' “1’2".
21133

Q
ti

‘3'.’ 3.1;"
5K

_...

'24,“. .-

14:2: ’.'fY

‘ « "32's :3:-

$.95 -" .mhz. - V

.." ‘7‘; ‘I“}£*':u.'."‘f’.’-

n- J ‘1? l-An-o I-OL‘

«gap-:5!1:1.»Qu-Jg, 5-3“:- ‘ r”...

‘1‘)“ P.3d.“ p"; 2‘ 'jhr': 'fn.‘; ”A’

gnaw-dc. 4. 4.. 2”».- "mm

.n/f.,4".",‘r'.;. ‘1 p" waﬁ'?” ’.'.4' >'

“/19 ,J‘ .. - . Win, ”a... a-

“ -n .h w ':.-‘vn.1oﬁ.‘.‘ '

d. '3 5' r' 5.402" "'5 ﬂ

. «1'34" #5"??? 71.: .r-nz
.' ‘

.
‘
‘

o

\'

MOW

v1

 

Mr

Jpoﬁl V

; .

43.
I
'. 4.44.

.. ’ L‘f

iéwr-lw-‘r'ua'b‘do' w-P Ir NM!"

'4‘“

u. .-

I‘.—'. 'I -.

A:

L

A“ L'AI" ‘Ix‘

.‘rm

-. Q -. «NIH
_ . _. ,, A. _.‘.
‘. swag}. 54".!“ ..~
, V . 3 .
mint-ratxm um -
. 'A" " bun“; .-
',7.‘,l;r,'1.

' ‘1’,"

 

’~
.
<
A

'1

' l ;-
(~43;

A . (3“

'

5‘ . ,. .‘ .
30594123329

pay... was-«qua» "he. mas: ~

4‘ A I 1 ‘

V .V.

’29:: ."f-" *
a 4

"did

(to.

A

 

~ 't 9W‘“

 

llll'llllllll‘fl lllllflllll L»

1293 01082 3668

 

 

[.1an

Michigan State
University

 

This is to certify that the
thesis entitled
A VLSI CHIP ARCHITECTURE FOR THE COMPUTATION OF
THE DIRECT KINEMATIC SOLUTION OF A ROBOTIC MANIPULATOR

presented by
STEVEN SIUTIT LEUNG

has been accepted towards fulﬁllment
of the requirements for

Masters degree in EIECt. Engr.

VWJ/ﬂf

Major professor

D t 4/12/85 Michael Shanblatt
a e

0-7539 MS U is an Afﬁrmative Action/Equal Opportunity Institution

 

 

 

IVTESI_J RETURNING MATERIALS:
Place in book drop to
LJBRARJES remove this checkout from
Alull5llH-L your record. FINE§ will

be charged if book is
returned after the date
stamped below.

 

 

 

 

 

we 2 l 1993

/

 

 

A v1.5: can: mcurncm ran ma COMPUTATION or

THE DIRECT KINEMATIC SOLUTION OF A ROBOTIC MANIPULATOR

BY

Steven Siutit Leung

A THESIS

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

MASTER or SCIENCE

Department of Electrical Engineering and System Science

1985

l‘7’//

‘v~

7., q
‘—_-~

\

bl-)

Copyright © by
STEVEN SIUTIT LEUNG
1985

ABSTRACT

A VLSI CHIP ARCHITECTURE FOR THE COMPUTATION OF

THE DIRECT KINEMATIC SOLUTION OF A ROBOTIC MANIPULATOR

BY

Steven Siutit Leung

This thesis describes the architecture design and simulation of a VLSI‘
chip dedicated to the computation of the Direct Kinematic Solution (DKS)
of a robotic manipulator. The design features fixed-point calculation
and on-chip generation of trigonometric functions. The calculation
achieves the resolution required for industrial applications by
augmenting the internal word size. Regularity considerations lead to
the modification of the homogeneous transformation algorithm.
Pipelining techniques applied to the inter- and intra-functional unit
designs speed up the DRS calculation. Control signals assume a
control store implementation, easing both the design and verification
processes without sacrificing the completeness of the system-phase
design tasks. A symbolic simulation approach verifies the correctness of
the design. Transistor-count and cell-count confirm the feasibility of
a semi-custom implementation approach with current technology.
Results indicate that the DKS can be obtained in microseconds, which
amounts to a speed improvement of three orders of magnitude compared to

a similar algorithm implemented on a 16-bit microprocessor.

To

my parents and my nemofe Zovcna

iii

ACKNOWLEDGEMENT

The author wishes to express his sincere appreciation to his major
advisor, Dr. Michael A. Shanblatt, for his guidance, support and
encouragement in the course of this research.

He also wishes to thank the committee members, Prof. P. D. Fisher,

Prof. E. Goodman and Dr. R. L. Tummala, for their valuable comments in

' this work.

iv

TABLE OF CONTENT

Page

L IST OF FIGURES O O O O O O O O O O O O O O O O O O O O O O O I O O O O O O O O O O O O O O O O O O 0 O O O Vii
L IST OF TABLES I O O O O O O O O O O O O O O 0 O O O O O 0 O O O O O O O O O O O O O O O O O O O O O O O O 0 ix
I 0 INTRODUCTION 0 O O O O O ‘0 O 0 I O O O O O O O O O O O O O O I O O O O O O O O O O O O O O O O O O 1

1.1 Problem Statement .................................
1.2 ApproaCh 0.00.00.00.000.0.00.0...OOOOOOOOIOOOOOOOOO

U1.»

II. BACKGROUND ............................................. 8
2.1 The Direct Kinematic Solution of
a Robotic Manipulator ............................. 8
2.2 VLSI System Design ................................ 11
2.2.1 Characteristics and Important Trends of
VLSI System Design ......................... 11
2.2.2 The VLSI Design Space Concept .............. 14
III. SYSTEM SPECIFICATIONS .................................. l7
3 1 Application Requirements .......................... 17
3.2 Technology Assessment ............................. 20
3 3 Fixed-Point Calculation ........................... 21
3 4 Specifications .................................... 23
3.4.1 Design Objective and Criteria .............. 23
3.4.2.1 I/O and Functional
Specifications .................... 25
3.4.2.2 Performance Specification ......... 26

IV. ALGORITHM DEVELOPMENT .................................. 28
4.1 Sine Function Generation .......................... 28
4.2 Modifications on the

Homogeneous Transformation Matrices ............... 32
4.2.1 Regularity of the Matrix Entries ........... 32

4.2.2 Decomposition of the 4x4 Homogeneous
Transformation Matrix ...................... 34
4.3 The Final Algorithm ............................... 35

VI.

VII.

VIII.

Page

ARCHITECTURE DESIGN ................................... 37
5.1 VLSI System Design Rules .... ...... ............... 38
5.2 Architecture Development I ....................... 38
5.2.1 Functional Units .......................... 38
5.2.2 Fixed-Point Calculation Implementation .... 42
5.2.3 SIN/COS Implementation .................... 44
5.3 Timing ........................................... 47
5.3.1 Clock Period and Pipelining ............... 49
5.3.2 Clocking Scheme Alternatives .............. 52
5.3.3 Timing Diagrams ........................... 55
5.4 Architecture Development II ...................... 58
5.4.1 Synthesis of the
Computational Structure ................... 58
5.4.2 Description of Hardware in RTL ............ 61

CONTROL AND DESIGN VERIFICATION ....................... 62
6.1 Control Signal Specification ..................... 62
6.2 Verification by Symbolic Simulation .............. 66
6.2.1 Simulation Principles ..................... 66
6.2.2 Program Development ....................... 68

EVALUATION ............................................ 74
7.1 Goal and Strategy ................................ 74
7.2 Area Estimation .................................. 75
7.2.1 Transistor Count .......................... 76
7.2.2 Macrocell Implementation .................. 78
7.3 Speed Estimation ................................. 80

CONCLUSION O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 84
8 O 1 AChievments O O O O O O O O O O O O O O O O O O O O O O O O I O O O I O 0 O O I O O O 84
8 O 2 sumries O O O O O O O O O O O O O O O O O O O O O O O O 0 O O O O O O O O O O O O O O O 85

APPENDIX 1 ............................................ 87
APPENDIX 2 ............................................ 89
APPENDIX 3 ............................................ 91
APPENDIX 4 ............................................ 94

BIBLImPHY OOOOOOOOOOOOOOO0.00....OOOOOOOOOOOOOOOOOOO 103

vi

LIST OF FIGURES

Figure _ Page

1.1 An overview of the design methodology
With respect to the researCh SCheme 0.0000000000000000. 4

2.1 The transformation matrices of the PUMA arm [5] ....... 9

2.2 The configuration of the PUMA robotic manipulator

andits paran‘eter values [5] 00000000000000.00000000000 10
2.3 Triartite representation of the design
and various design levels [17] ........................ 15

3.1 Absolute shaft encoder and incremental
Shaft encwer [19] 000.000..0000.000000......0.0.0.0... 18

3.2 The accumulation error versus internal word size ...... 24
303 The I/Odata formats 0.0000000000000000000000000.000.00 26

4.1 The schematic block diagram for
sine function generation [29] ......................... 29

4.2 Sine function by linear interpolation [26] ............ 31
4.3 The three modified transformation matrices ............ 33

5.1 The block diagram of a 5-by-5 Baugh-Wooley
two's complement array multiplier [35] ................ 41

5.2 The block diagram of a two's complement adder
“5d in the DKS Chip 0.0....000000000000.000.000.000.00 42

5.3 Illustration of the radix point position
in themultiplication 000.0000.000000000000000000000000 43

5.4 The table value adjustment scheme [26] ................ 44

5.5 Logic circuit diagram of the
SIN/cos contr01 section 00.000000000000000000.000000000 ‘6

vii

Figure

5.6

5.7

5.8

5.9

5.10

5.11

5.12

5.13

5.14

5.15

6.2

6.3

Block diagram of SIN/COS implementation ......... ......
The general form of a data path [36] ....... . ..........
The two-phase clocking scheme ....... ..... .............
Clocking scheme one using single phase latches .. ..... .
Clocking scheme two using single phase latches ...... ..
Clocking scheme three using

dynamic registers .....................................

Timing diagram of one pseudo-matrix-
mUltiplication 000.000.00..00.00000000000000000000.0000

Circuit diagram of an input register cell
With mUltiple sources [39] 0000000000.. ooooo 00.00.0000.

A two-port register cell and the
corresponding logic truth table .......................

BIOCk diagram or the DKS Chip 00000000000000.0000.00000

Flowchart of the DKS symbolic simulation program ......
Flowchart of the subroutine PHASEl ... .............. ...
Flowchart of the subroutine PHASEZ ............ ........

viii

48

50

50

53

54

54

57

58

59

60

71

72

73

Table

3.1

~ 4.1

LIST OF TABLES

Error due to the interpolation table size .............
A comparison of two sine generation methods ...........
Control fields and signal definitions .................
Transistor count of the DKS chip ......................

The IBM Master Image function-cell and
Chip Statistics 00000000000000000....0000.000.000.0000.

cell count Of theDKS Chip 00.00.000.000000.000000.000.
Estimtion Of the phase one delay 00.000.000.0000000000

Total calculation time for various approaches .........

ix

22

31

65

77

78

79

82

83

Chapter I

I NTRODUCT I ON

In a computer-based robotics system with multiple
degrees-of-freedom (DOF) movement, the position and orientation of the
manipulator are servo-controlled through each individual joint. The
parameters of position are conveniently expressed as joint angle
variables in a link coordinate system. The position of the object to be
manipulated, however, is usually expressed in the reference base (world)
coordinate system. Given the joint angle vector 9, in the link
coordinate system, the position and orientation of the manipulator in
world coordinates can be obtained via the Direct Kinematic Solution
(DKS).

A homogeneous transformation method has been developed to solve the
direct kinematic problem effectively [1]. However, the successive
matrix multiplications involved in the homogeneous transformation
require a cumbersome amount of calculations. For similar
computationally-bound problems, VLSI (Very Large Scale Integration)
technology has shown great potential in improving system performance by
implementing concurrent numerical algorithms directly in hardware.
Furthermore, advances in CAD (Computer-Aided, or more recently,
-Automated, Design) and CM: (Computer—Aided Engineering) systems for
VLSI enable such an implementation to be carried out in a custom design

approach with relatively low cost.

The purpose of this research project is to investigate aspects of a
custom-designed VLSI chip architecture, tailored for the computation of
DKS.

Specifically, application requirements for the DKS chip
architecture are investigated in order to formulate the design criteria
and a set of specifications describing the computational requirements.
Based on the PUMA robotic manipulator, a set of simulation program
modules is developed to study the effect of fixed-point calculations on
the resolution of the resultant position and orientation.

Various methods for SIN/COS function generation are compared and
the homogeneous transformation matrices are modified to reduce the
control complexities. Based on the modified algorithm and the SIN/COS
generation method, a set of functional units and storage elements are
determined.

Additional considerations, such as the choice of the most suitable
number representation, timing rules, and hardware constraints are
discussed. The pipeline concept is extensively applied to the inter-
and intra-functional unit designs to improve the speed of the DRS
calculation. A detailed timing diagram is provided to illustrate the
data flow in the machine and to facilitate the generation of control
signals for the DRS calculation cycle. The flow of control is presented
in the form of a program written in RTL (Register Transfer Language).

Control signals are identified and partitioned into fields to
facilitate the specification and generation. Viewed as machine codes,

they are generated through the manual compilation of the RTL program.

Verification of the architecture and control design is carried out by
symbolic simulation. By comparing the simulation results with the
expected DKS function, errors are discovered and corrected.

Finally, empirical and statistical data is collected from various
sources to aid in the estimation of the total chip area requirement.
The total DKS calculation time is evaluated from both simplified circuit

models and published experimental data.

1.1 Problem Statement

Current research is being conducted in the field of robotic
controls to improve the system performance by incorporating various
dedicated computing hardware [2]. In the same spirit, the purpose of
this research is to explore some specific algorithmic and architectural
design alternatives for a single VLSI chip for dedicated high speed DKS

computation.

1.2 Approach

The VLSI design process consists of a system design phase and a
physical design phase. This research effort focuses exclusively on the
first phase development. An overview of the research scheme is
presented in Figure 1.1. Tasks in each step are identified and listed

alongside while the output from each step is shown in Italic.

REQUIREMENT * Application Requirement Analysis
ANALYSIS * Fixed-Point Calculation Simulation

Design Objectives and Criteria
System Specifications

ALGORITHM * SIN/COS Generation Method
DEVELOPMENT * Transformation Matrix Modification

DKS Algorithm

ARCHITECTURE * Timing Diagram for Data/
DESIGN Control Flow Analysis

* Pipeline Implementation

* Functional Unit Design

* Interconnections

* HDL Programming

Architecture (Block Diagram)
Clocking Scheme
Description Program in RTL

 

CONTROL & * Machine Code Generation
VERIFICATION * Symbolic Simulation

Control Signal Specification
COHtFOl Signal Table

 

EVALUATION * Empirical and Statistical
Data Collection
* Circuit Modelling

Area/Speed Estimation

 

System Design

 

Physical Design

Figure 1.1. An overview of the design methodology
with respect to the research scheme.

The high cost of the VLSI development places unprecedented
importance upon the precise specification of the integrated system to be
designed. The first step, therefore, is to study the requirements as
carefully and comprehensively as possible. Questions to be addressed
include the future utility of the proposed st chip, the 1/0' format,
the interface requirements, the flexibility of adapting the chip to
other robotic manipulators, the impact of the working environment on the
choice of technologies, and the possibility of employing fixed-point
.calculations only.

Once the design goals and criteria are resolved, the design
proceeds with algorithmic development. Alternatives to the SIN/COS
generation methods are discussed. The transformation matrices are
modified and decomposed to yield a higher degree of regularity and
better hardware utilization. The final version of the DRS algorithm is
expressed in a Pascal-like language.

The implementation of algorithms with VLSI technology gives the
designer maximum freedom in making software/hardware choices. However,
such freedom must be exerted judiciously with an understanding of the
potentials as well as various .constraints and limitations. These
understandings are conveniently formulated as design rules or
principles, such as those suggested by Mead and Conway [3], which are
presented in Chapter 5. A two's complement number representation is
chosen and this presents certain difficulties which must be resolved by
the modifications of relevant functional units. Timing requirements are

considered in sufficient detail leading to a decision to implement a

two-stage pipeline multiplier.

Ideally, the architecture should achieve performance of such a high
degree that it will only be limited by the inherent data dependency.
Accordingly, a detailed timing diagram is constructed to aid in the
analysis of the data flow. The order of the original DKS calculation is
manipulated in such a way that a minimum number of clock cycles is
achieved for a given hardware-and-interconnect configuration. The
sequence of events is translated directly from the timing diagram into
an RTL program.

The control structure is further specified by defining the needed
control signals, which are subsquently partitioned into eleven control
fields. A table of machine code is compiled from the RTL program
manually. The control signal table is then used to drive the data
through a simulator (a symbolic simulation program) to verify the
design. Simulation principles and program developments are described in
Chapter 6.

In order to have a realistic picture of the feasibility of
implementing the DKS on a single chip, transistor count and gate count
are evaluated first and the total chip area is estimated from data
collected on comparable transistor/gate count vs chip area relationships
from various sources. Once the area, and hence the chip edge, are
known, interconnection delay as well as delay in the combinational logic
sections are estimated. Based on the estimated delays, the minimum

clock period is evaluated from the worst case delay path.

Finally, to conclude this work, insights to the VLSI design process
gained from the design of the DRS chip are sublimated to a higher level
of understanding. Results of the evaluation are compared with the design
objectives. Implications as well as problems for further research are

identified.

Chapter II

BACKGROUND

2.1 The Direct Kinematic Solution of a Robotic Manipulator

The robot arm consists of a number of link-joint pairs, each
providing one degree-of-freedom. It is convenient, for the purpose of
control, to express the position of each individual rotational (or
prismatic) link-joint pair by a single variable 9‘ (or d‘) with respect
to its own link coordinate system. A unique 4x4 homogeneous
transformation matrix A‘ , which is a function of 9‘ (or d‘), maps a
vector in the link ‘th coordinate system to the link c-lth coordinate
system. Thus, for a six degree-of-freedom robot arm, given its six
rotational joint variables 9 = (0,,0,,6,,0,,0,,9.), the joint space to

Cartesian space mapping is obtained by the successive multiplication of

the six homogeneous transformation matrices,

T = A,-A,-A3-A,°A5-A. = [ n s a p ]. (2.1)

The resultant homogeneous matrix T gives the orientation vectors of the
wrist n, s and a, and the current arm position p which is defined as the
vector from the origin of the base to the wrist, all in the world
coordinate system. This joint space to Cartesian space mapping is known

as the Direct Kinematic Solution (DKS) [4,5].

In the DKS calculation (eq. 2.1), each A‘ has the form

P C

cose -cosa sina sina sina a c050
l. t l l l. l C
sine cosa cosO —sina cose a sine
t I l. t l t t

A‘ - (2.2)

0 sina cosa d
t t t

 

0 0 i 0 l

 

where a‘, at, and d‘ are the parameters of the cm link-joint pair.

Even though the matrix A; appears complicated, the parameter a‘ of
almost all of the existing robots only takes on values that result in
sina‘ or cosa‘ being either Oor :1. As an example, the simplified
PUMA robot arm transformation matrices together with their parameter
values are shown in Figure 2.1 and Figure 2.2. The highly regular
pattern and small number of elements for each A‘ suggest the possible
use of a dedicated computing structure to enhance computational
throughput in the calculation of DKS. This can be obtained by

incorporating pipeline and concurrent processing concepts with high

speed VLSI hardware .

C0] -Cc; 5”. 3315131 ~60,
5‘." Ca,CO, -Sa,C0, 0'59]

”'7‘: 0 So, CO; d;
0 0 O 1
a. o ‘—8. o c: —szoec2 ’c. o 8. ace
5‘ O C. 0 S; C, 032$; 33 o —C; 033;
A‘: M: AI=
00-100 00.‘d; 0:00
0 0 0.1 o o 0 3 Lo 0 0 1
C. 0 44 0 C3 0 35 0 PC. -S. O O
s.oc.‘o 530-650 536.00
A‘: A:= A”:
, .3 0 -1 o a. 01 O 0 5 O 0 Id.
0 0 0 I 0 0 0 l 0 o 0 1

 

where C, 5 coat), ; S, z 3100,

Figure 2.1. The transformation matrices of the PUMA arm [5].

 

 

10

 

\\ y
\\ 25
152
”s

s
Y5(§)

26 (3)

2.61)
O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PUMA Robot Arm Link Coordinate Parameters
Join” ‘6, a, l a, d, Range
1 90 [~90 I 0 o -160to+150
2 0 I o ! 431.8 mm , 149.09 mm -225 to as
a 90 i 90 9 -20.:32 mm I 0 ~45 to 225
4 ' o I -90 ,! 0 433.07 mm -110 to 170
s 0 I 90 3 o o ~100to1oo
6 017 0 3 o i 5&25rmnA -2562o255

 

Figure 2.2.

and its parameter values [5].

The configuration of the PUMA robotic manipulator

11

2.2 VLSI System Design

Design, in its most general sense, can be described as a process of
successive mappings or transformations of specifications from one domain
(or abstraction level) into another. In this aspect, VLSI design is no
different from design in other disciplines. However, the fabrication
of more than 100,000 transistors on one chip requires the use of

special tools and techniques to make the design and verification process

feasible .

2.2.1 Characteristics and Important Trends of VLSI System Design

In traditional engineering practice, design and manufacturing can
be done almost independently of each other. This situation, however,
has dramatically changed for VLSI design. ”Integrated”, in fact, means

not only the integration of a large number of circuit components, but
also implies the designer's integrated knowledge, from the choice of
design alternatives to circuit layout to packaging [6]. Fortunately,
the formulation of design rules and the development of powerful tools
relieve the designer's burden by providing] a well defined interface
between the system designers and engineers working on physical
implementation. Nevertheless, the complexities and interactions between

and among all design levels have increased to the extent that it is

12

beyond human capabilities. As a result, design automation is no longer
a luxury but a necessity.

nxmy,the process of physical design, including layout, wiring and
timing analysis, has been largely automated [7]. Some state-of-the-art
design automation systems are capable of placing, and wiring over 37K
logic gates on a chip, achieve 99% automatic physical design and 100%
automated checking, and assure error-free functioning [8]. While
improvement will certainly continue in this area, automated logic
synthesis is gaining more and more attention. CAE systems aimed at
automated translation of register transfer level description of digital
systems to VLSI realization through gate arrays or standard cells, have

been developed and claim varying degrees of success [9].

Amidst all the progress in VLSI technology, perhaps the single most
important event is the appearance of the 'silicon compilers' which are
about to revolutionize the VLSI design process. Silicon compilation
attempts to reduce the entire physical design process of a VLSI systm

into a compilation process such that, at any point of time, most of the
ultimate characteristics of the finished chip can be accurately
estimated from available data. [10,11]. This opens new opportunities for
system designers to experiment with various algorithmic/architectural
alternatives with much less effort and cost. This facilitates the
making of more intelligent design decisions based on alternative

analyses .

1.3

These developments have brought forth two major impacts on VLSI
design. First, more emphasis has been shifted toward higher and higher
levels of design. It is now recognized that the size and performance of
VLSI chips are more influenced by higher level designs, such as
task-realization algorithm design and architecture design, than logic

design or layout [6,12]. The significance of this shift may be better

understood from a historical perspective by comparing it with the shift

from machine language to high-level language in the early days of
computer programing [13]. In fact, the influence of the software
engineering practice on the VLSI design process is pervasive as
evidenced by the use of terms like ”silicon compiler", and "top-down"
and "bottom-up" approach.

In parallel with this upward shift, custom and semi-custom design
approaches emerge as the mainstream of next-generation VLSI design. The
custom IC market has grown from $54 million in 1979 to $1 billion in

1984 and is predicted to reach $5 billion by 1989 [11]. This trend can
also be witnessed in the widely published objectives of the US
Department of Defense's VHSIC program [14] and DARPA's announcement of a

fast turnaround fabrication service available for commercial use. With
the availability of silicon compilers and the formation of the silicon
foundry [11,15], the ideal of a fast design turnaround time with
relatively low cost is rapidly becoming a reality. It is under such
auspices that the custom design approach is elected for the design of

the DES chip in this work.

14

2.2.2 The VLSI Design Space Concept

A significant wealth of knowledge about VLSI design has been
accumulated in the past ten years as the industry grows. In order to
gain a better understanding and insight of the design methodology, the
domain of VLSI design may be expressed as a design space spanned by
three axes - functional, structural, and geometrical representations -
as shown in Figure 2.3. Elements in each axis represent the abstract
level or refinement steps involved in the design process. VLSI design
methodology, from this perspective, is visualized as a specific path
traversing through the design space. A particular path taken by a
certain school of thought, not surprisingly, more or less reflects a
particular environment (usually expressed in needs and constraints of
human, technological and financial resources) under which that
particular methodology has evolved. (It is interesting to note that the
term "level” suggests structured programming, and ”representation,” on
the other hand, comes from the AI (Artificial Intelligence) vocabulary.)

In a more formal sense, design methodology, as suggested by Baller,
can be defined as ”a set of codified techniqueIs], broadly applicable,
that facilitate the creation of designs that are functionally correct,
qualitatively acceptable, are easily ‘understood, and are easily

modified” [16].

15

STRUCTURAL FUNCTIONAL
REPRESENTATION REPRESENTATION

 
   

Processor Memory Switch Systems
Register Transfer Algorithmic

Circuit Boolean Expression

Mask Geometries

 

Cells

 

Layout Planning
1..

 

\V

GEOMETRIC
REPRESENTATION

Figure 2.3. Triartite representation of the design and various
design levels [17].

VLSI design methodology is still in a developmental stage.
Therefore, instead of endorsing a particular approach, only the basic
and common aspects of VLSI design methodology are discussed here. The
presentation below follows the philosophy of Gajski and Kuhn [17].

Extreme complexity in VLSI design forces the designer to take a
structured hierarchical approach. By hiding lower level details, the
designer is thus capable of concentrating on the design at a specific
abstract level. In spite of their wide differences, design levels in

the entire VLSI design space fall into three categories of

16

representations: the functional, structural, and geometrical
representations.

The functional representation is at the highest level of the design
and may be captured on several sublevels. The most widely accepted of
them are the systems, algorithmic, and Boolean expression (logic)
levels. Between the functional and geometrical representations is the
structural representation. It maps a functional representation onto a
set of components and connections under constraints such as cost, area
and timing. Sometimes structural representation may overlap with
functional representation. Commonly used levels of structural
representation are the processor memory switch, the register transfer
(operator register bus), and the circuit level. The geometrical
representation ignores, as much as possible, what the design is supposed
to do and binds its structure in space (physical design) or to silicon
(geometrical design). Geometric representation levels include layout
planning with arbitrary size blocks, cells, and physical mask
geometries.

This research effort focuses mainly on the design of the DKS chip
at levels corresponding to the system, algorithmic, and register

transfer levels.

Chapter III

SYSTEM SPECIFICATIONS

A lesson VLSI engineers learn from software engineering is the
crucial role of requirement and specification definition in the design
cycle. Study has shown that while requirement analysis occupies only 3%
of the total cost,its errors are 100 times more expensive to correct
than implementation errors. And unfortunately, over 40% of the errors
observed during testing are due to incorrect or misinterpreted
requirements or functional specification [18]. Therefore, at the
beginning of this work, a literature search was undertaken to collect
data on the application requirements in order to derive realistic design

specifications.

3.1 Application Requirements

Robots can be classified as Cartesian, cylindrical, rotational or
articulated depending on the number of prismatic/rotational link-joints.
‘The kinematic problems of the former three classes are simple, but their
rigid structures limit their use to only a certain class of
applications. The articulated robot arm, on the other hand, mimics the
actions. of the human arm and is the most flexible. Consequently, the

control is more complicated.

17

18

The function of the proposed DKS chip is to map the position and
orientation of an articulated robotic manipulator from the joint space
to the Cartesian space. The angular position of a joint can be measured
by a potentiometer. More recently, two major kinds of optical shaft
encoders (see Figure 3.1) have been developed. They utilize digital
techniques and ,the measurement is more accurate. An absolute shaft
encoder can provide a resolution of 0.04 degrees. The incremental shaft
encoder can provide a somewhat lower resolution of l/2500 turns (0.144
degrees) and requires more interface hardware. It has the advantage of
encoding position and velocity information simultaneously and its cost

is relatively low [19].

Reference slot

and sensor

 

Figure 3.1. Absolute shaft encoder and incremental shaft
encoder [l9] .

19

Surveys have shown that the positional resolution requirement of
various applications range from 0.1-10.0mm for a working range of l-2m
[20]. Expressed in a relative terms, the base two logarithm of the
ratio of the robot's maximum reach divided by its resolution is
approximately 14. Repeatability requirements of 0.01-lmm are typical.
The resolution requirement appears readily achievable using 16-bit
processors [21].

Minicomputers are often used as the control computer in which
information from sensors is transmitted to the CPU via a 16-bit bus.
The immediate use of the DKS chip is to accept joint angle inputs and
produce a real-time position and orientation in world coordinates. In
the future, the DKS chip may be incorporated into iterative algorithms
[22] to obtain the more useful Inverse Kinematic Solution (IKS). In
either case, it is desirable to have the input port and output port
separated.

To specify the orientation, only two out of three orientation
vectors are needed [5,19]. Therefore, during the entire DKS calculation
period, a total number of 15 I/0 operations are required if the SIN/COS
function can be generated on the chip. This includes six joint angle
inputs, one positional and two orientational vector outputs. Compared
with the minimum of 60 multiplications and ,3, additions, the I/o
obviously does not constitute a bottleneck. Therefore, a multiplexed
scheme through separated I/O ports is assumed for I/O operations and its

implementation will not be considered in detail here.

20

Throughout this project, the PUMA robot arm is chosen as a testbed
to study the DKS chip architecture because of its popularity. Even
though the general transformation matrix method can be applied to
various arm configurations, differences in parameter values and types of
variables (prismatic versus rotational) introduce tradeoffs between
flexibility and time-area complexity of the control hardware. However,
tradeoffs due to the difference of parameter values are simpler to

handle and thus flexibility is given a higher priority in such a case.

3.2 Technology Assessment

Three particular state-of-the-art VLSI achievements have
implications on the design of the DKS chip.

The first one is a lG-by-lé multiplier-accumulator chip by TRW.
Using l-um CMOS technology, it can perform a multiply-accumulation
Operation in 90ns with maximum power consumption under 350 mm [23].

The second achievement is the 32-bit MC68020 microprocessor by
Motorola. Using Z-um HC-MOS technology, it contains roughly 200K
transistors on a 375-by-350-mil chip. Operated at 16.67MHz, it
dissipates only 1.5 watts [24].

The third one is the 583 ”operating system on chip” by Gould AMI.
Using 3-um NMOS technology, it is actually a single chip integration of
a Zilog 280 microprocessor, a 64K ROM (containing its standard operating

system), and the system interface logic [25].

21

Because of the high packing density, low power consumption and high
noise immunity, which is particularly important in view of the generally
harsh working environment of robotic operations, CMOS is probably the
‘ natural choice for.the DKS chip. However, it is noted that the system
design rules (see Chapter 5) are intentionally formulated to be
independent of any particular IC technology. This enables the system
designer to take advantage of the best technology available. Therefore,
in this design, while CMOS technology is generally assumed, NMOS circuit
models are also used because of availability. The conversion problem

will be addressed during evaluation.

3.3 Fixed-Point Calculation

Since the joint angle input needs only 12 bits (maximum resolution
of 0.04 degrees), and the resolution requirement of position ranges from
0.1-10mm for a working range of l-Zm, the I/O values are readily
transferable through a lG-bit bus. It is of great advantage if the
internal calculations can be carried out in fixed-point calculations
since both the hardware design and control mechanism will be much
simpler. Accordingly, Fortran simulation programs have been developed
to study the round-off effects and ultimately determine the minimum
number of fixed-point bits required by the resolution parameters.

A table look-up and interpolation algorithm is chosen for the
SIN/COS generation (see Chapter 4). As a result, the position error may

originate from three sources. The error due to the resolution limit of

22

the joint angles is intrinsic and cannot be eliminated. Consequently,
all input angles in the simulations assume a lZ-bit value. The second
source of error is introduced in the sine function generation and is a
function of table size as well as word size. The third error source is
due to the accumulated round-off caused by the finite fixed-point
internal word size of the arithmetic hardware.

The first simulation program assumes no internal word size limit so
as to isolate the effect of table size as the subject of investigation.
1000 joint angle vectors are randomly generated as input. The range for
each angle is checked and the lZ-bit angle word size is enforced. The
DKS results, calculated with various table sizes, are then compared with

the correct values. The tabulated results are presented in Table 3.1.

Table 3.1. Error due to the interpolation table size.

 

Table Size (Entries)
Type 32 64 128 256
Orientation
(ave) 0.0006 0.0001 0.0000 0.0000
Position
(ave) 0.25(mm) 0.058(mm) 0.014(mm) 0.0034(mm)
Orientation
(max) 0.0028 0.0006 0.0002 0.0000
Position

(max) l.58(mm) 0.334(mm) 0.084(mm) 0.019(mm)

23

The first simulation result shows that with table size above 128
entries, the accumulated error due to the sine function interpolation
error is negligible. The result is encouraging, and therefore a second
Fortran program is used to simulate the fixed-point DKS calculation for
variable internal word sizes with table sizes of 128 and 256. The
results are plotted in Figure 3.2.

The figure shows two alternatives to achieve the 0.1mm positional
resolution. Either an 18-bit word with a table size of 256 or a 20-bit
word with a table size of 128 is required. Although the 18-bit word
alternative requires double the table size, the resultant increase in
total chip area will be smaller than that of the latter which will
increase roughly by 10%. Moreover, an increase of internal word size
will also increase the delay time. Therefore, the 18-bit word with a

table size of 256 is chosen.

3.4 Specifications

3.4.1 Design Objective and Criteria

The objective is to design a basic VLSI chip architecture dedicated
to the fixed-point computation of the DKS for a six DOF articulated
robotic manipulator, with computational speeds surpassing ordinary
computational techniques. A major effort is devoted to the study of

architectural alternatives for the computational requirements of DKS.

24

Position
Error (mm)

Max Error for Table Size l28
Mox Error for Table Size 256

Ave Error for Table Size l28
Ave Error for Table Size 256

X+OB

0.41.5

0 0.35

0.3

0.25

0.20

 

 

 

 

l6 I7 l8 :9 20 2: 22 23 24
Word Size (Bile)

Figure 3.2. The accumulation error versus internal word size.

25

The following criteria, based on the application requirement
analysis, have been established to guide the making of various design
decisions:

* MOS technology with a feature size of Zum or below and a

clock of up to 20MHz are to be used.

* SIN/COS functions should be generated on chip to avoid
possible I/O bottleneck.

* The architecture should be applicable to the whole class of
six DOF articulated manipulators of rotational joints.

* Intra-functional unit pipelining is to be used and limited
to two stages to reduce design complexities.

* Minimum cycle time and minimum interconnections are sought
with the former having higher priority.

3.4.2.1 I/O and Functional Specifications

The input to the proposed DKS chip consists of six lZ-bit joint
angle values in fractional-turn representation [26]. Each is subject to
a range limit as specified in Figure 2.2. The output consists of two
orientation vectors, n and a, and the position vector p, all in two's
complement representation.

The I/O formats are shown in Figure 3.3. I/O operations are
multiplexed through separated I/O ports. The chip function is
holistically defined by the matrix multiplications of eq. 3.1. The DKS

is obtained by fixed-point calculation with an internal word size of 18

26

bits.

0

(3.1)

O
CDF‘CDCJ
hoooc>c>

O

 

 

COMO

3.4.2.2 Performance Specification

Since the major concern at this point is whether the DKS function
can be realized on a single chip, only two basic performance

requirements are specified here:

* Resource utilization higher than 90%;

* Total calculation time less than lOus.

OUTPUTUB bile)

(I) Position Element

INPUT

 

 

 

 

 

 

 

 

 

 

 

 

l2-Bit Join? Angle u bits 5 bits
8 bits (mm)
/ /' xi Radix point
0
>l80 (2) Orientation Element
Tobie Size=28
l5 bits

 

 

 

 

Figure 3.3. The I/O data formats.

27

In any bus oriented processor system, the performance is ultimately
limited by the most precious resource - the bus bandwidth. Because of
this limitation and the data dependency, 100% resource utilization is
generally impossible. The performance specification figure is intended
to provide a measure of how well the algorithm and the architecture
match.

The second specification represents an absolute measure of
performance of the proposed chip. In order to get a feeling for the DKS
calculation speed, a DKS algorithm, similar to the one developed in
Chapter 4, has been implemented on the Intel 8086 microprocessor.
Experimental results show an average of 15ms total calculation time when
run at lOMHz. Thus, the specification goal of the design in this

project amounts to a speed improvement of three orders of magnitude.

Chapter IV

ALGORITHM DEVELOPMENT

4.1 Sine Function Generation

Each transformation matrix A‘ in eq. 3.1 is a function of the joint
angle 9‘ for which both sine and cosine values are needed in the
computation. Since the cosine can be obtained easily from the sine
through trigonometric equivalence, only sine generation methods need be
considered.

In general, there are four practical methods to generate the sine
function. The oldest method of Taylor series expansion is notoriously
slow and thus not considered here. The CORDIC (COordinate Rotation
DIgital Computer) algorithm [27] has the advantage of large function
capacities [28] but its hardware design is more involved. Moreover,
there is no indication that the accuracy it will achieve is needed nor
that it can be implemented as a small functional unit on a single DKS
chip. Therefore, this implementation method is also not considered.

The remaining two candidates both employ ROM look-up techniques in
their calculations. The first method, described by Muroga [29], is based
on the principle of trigonometric sum of angles. It recursively divides
the argument into a more significant part and a less significant part
until the individual parts are small enough to address a ROM of
reasonable size. The sine value is then obtained by summing up all the

partial results. An example of a schematic block diagram based on this

28

Ar
dimers!

Figure 4 1

ML

LM

LL

 

Th
e sch
ema
tic
bloc
k d'
iagram
for sine
fun
ctio
n
gene
rati
on

29

sllﬂMM + ML)

    

z-l-
2"-
2-3-
2-4
2-5
2-0

2"-
2-.-
2'9

2'10
2'”
2'12

Adders

2'13
2°14
cos 7'
tu+smu1sint
M

2-15
cos [(1 + 8)MM]
sinLl.

[29].

2-3

2-5
2-6
2-7
2-0

2-9

2-10
2'31
2‘12

2'13
240
2-15
2'15

:3]

 

sin 1 s '
sun (MM +
ML)
+c0s [(ltllMMll
sin LM+
sin LL)

30

concept is shown in Figure 4.1. The design accepts a 16-bit angle input
and outputs a 16-bit sine value.

In contrast, the second method proposed by Ruoff [26] is based on
linear interpolation as shown in Figure 4.2. In this method, the domain
of the argument is divided into intervals of 2“ which become the table
size. The endpoints of each interval are calculated and adjusted so
that the interpolation error is spread evenly over each interval. The
upper n bits (not including the most significant bit) of the argument
are used to address a precalculated table. The lower part is then used
to interpolate the sine value using the endpoint value and the
interpolation difference of that interval.

A comparison of the two described methods is shown in Table 4.1.
While the first method requires more ROM area, its only external need is
an adder. Since both methods require two-level calculations, the first
method appears faster as multiplication tends to be slower than
addition. However, an on-chip multiplier is necessary to speedup the
DKS calculation. The speed of the second method can approach the first
if pipelining is implemented within the multiplier, and between the
multiplier and adder. Thus, with much less ROM area, the second method

is preferred.

31

1M») MD
I

...... “gaff
\\\\

/’7 44:3 nmxmnou

y.- is.)
KNOWN!

 

 

 

 

 

«I. .s)
W '(‘e’
ll“) ‘1»
'“4/’,/’4*"

“We,

 

 

 

 

 

 

 

Figure 4.2. Sine function by linear interpolation [26].

Table 4.1. A comparison of two sine generation methods.

Method ROM Multiply Add Levels Accuracy
(Blocks) (Operations) (Operations)

 

Muroga 4 0 2 2 2

Ruoff l l l 2 2

* Table size = 128 entries

32

4.2 Modifications on the Homogeneous Transformation Matrices

A concensus among VLSI designers is that a good VLSI architecture
should have the following properties [30,31]:

1. It should be implementable by only a few different types

of simple cells.

2. It should have simple and regular data and control paths
so that the cells can be connected by a network with local
and regular interconnections. Long distance or irregular
communications must be minimized.

3. It should use extensive pipelining and multiprocessing.

In this way, a large number of cells are active at one
time so that the overall computational rate is high.

The realization of the above properties is ultimately constrained
by the data dependency inherent in the task to be implemented. However,
within this constraint, the design of the algorithm determines the
eventual performance of the chip. The demand placed on the algorithm

can probably be summarized into a single goal - regularity.

4.2.1 Regularity of the Matrix Entries

Each transformation matrix in eq. 3.1 can be partitioned into four

submatrices as

 

' R - ' Rotation - Position '
3x3 ! p3x1 Matrix ! Vector
T = -----:~---- = % (4.1)
£1x3 g 1x1 Perspective g Scaling
: Transf. : .

 

 

 

 

33'

For robotic applications, the perspective transformation and
scaling are always 0 and 1 respectively. The rotation matrix is already
well structured, with the first column always [ C. s‘ o ]T and the other
two columns [ o 0 :1 JT and 1[-S‘ c‘ o ]T arranged in different orders.
However, the position vectors in A, and A, have elements of a,,,,C,(,,
and a,,,,s,,,, which cause the entries to be irregular. Note that
A, can be decomposed as a product of a rotation matrix and a pure

translation matrix as

P - l' '1 r -
c, o s, a,C, c, o s, o l o o a,
A 3 s, o -c, a,S, g s, o -c, o o l o o (4 2)
3 o l o o o 1 o o o o l 0 °
0 o o l o o o l o o o l .

 

 

 

 

 

 

The translation matrix merely adds the element a, to the first element
of the position vector of A,. Likewise, A, can be decomposed which
results in a, being added to the left-component matrix of eq. 4.2. With
this modification, all the transformation matrices now have the same
basic structure. The redefined transformation matrices A, to A, are

shown in Figure 4.3.

A: A: AO
C3 -53 0 O C; 0 S; a; C. 0 -5. a;
S, C, 0 0 S, 0 .C, 0 S. 0 C. 0
0 O 0 1 O 0 0 1 0 0 0 1
L e e e l- d

 

 

 

 

 

 

Figure 4.3. The three modified transformation matrices.

34

4.2.2 Decomposition of the 4x4 Homogeneous Transformation Matrix

Even though systolic array processors designed for 4-by-4 matrix
multiplications have been proposed [32], they are not yet practical for
single chip implementation [33]. Moreover, most of the entries of each
matrix in the DKS calculation are either 0 or 1 such that the array
processor approach may be overkill. Since only three vectors are
actually involved in the calculations, each intermediate matrix
multiplication can be decomposed to a 3-by-3 matrix multiplication and a

vector addition as

'50

I
I
----'A.--.

I
HI'O

[ n a p ] = [ Ron R-a R-p+p' ]. (4.3)

In ordinary matrix multiplication, the resultant column can be

expanded to

all'bl + alz’bz + ala'ba
A'b = 3314), + azz'ba 4' 33,433 (4.4)

aal'bl + asz'bz + ala'ba

35

Because of the special structure of the rotation matrix, the resultant

column is actually either

all'bl + aia’ba all'bl + alz'bz
azl'bl + azs'ba or all'bl + aaz'bz
1b, 1b, .

From the expansion above, it is easy to see that twelve
multiplications and six additions are needed for each matrix
multiplication. Two additional multiplications for SIN/COS must be
carried out first. This implies that a functional unit set of more than
two multipliers and one adder will result in low hardware utilization.
However, the performance of a two-multiplier hardware unit can be
approximated by that of a two-stage pipelined multiplier with half of
the area and a much simpler interconnection requirement. Therefore, in
this design, only the architectural alternatives for a

one-multiplier-one-adder configuration are considered.

4.3 The Final Algorithm

Based on the decomposed matrix multiplication, the final DKS
algorithm is presented below. The SIN/COS generation and the
pseudo-matrix-multiplication described earlier will be given a more
detailed treatment in the next chapter and hence are simply presented as

two procedures here.

36

Procedure DKS (jangle, var world : register)

const d6=56.6*2**7; a3=-20.S*2**7; dl=432*2**7;
a2=432*2**7; d2=149.5*2**7;

type register=integer<0zl7>;

var jangle : array [l..6] of register;
world : array [l..3,l..3] of register;
buffer : array [l..3] of register;
posvec : array [1..5,l..3] of register;
creg, sreg : register;

Begin
posvec[l..5,l..3]:=0;
posvec[3,l]:=a3; posvec[4,3]:=d4;
posvec[2,3]:=d2; posvec[3,l]:=a2;
world[l..3,l..3]:=0; world[2,3]:=l*2**l7; wor1d[3,3]:=d6;
Table-look-up-and-interpolation (jangle,6,creg,sreg);
world[l,l]:=creg; world[2,l]:=sreg;
for I:=5 to 1 do
Begin
Table-1ook-up-and-interpolation (jangle,I,creg,sreg);
for Jz=l to 3 do
Begin
buffer[l..3]:=0;
Pseudo-Matrix-Multiply(I,buffer,creg,sreg,world[J,l..31);
if J<>3 then world[J,l..3]:=buffer[l..3]
End;
world[3,l..3]:=buffer[l..3]+posvec[I,l..3]
End
End.

Chapter v

ARCHITECTURE DESIGN

The architecture of a computational system can be characterized in
many ways depending on the designer's view as well as the design level.
A view proposed by Dasgupta [34] is particularly useful for the design
of the DKS chip. In this view, an architecture is collectively defined
by a specification of the functional capacities of -its physical
components, the logical structure of their interconnections, the nature
of the information (data) flow, and the control structure and mechanism.

The design decisions on system components, such as functional units
or storage elements, are rooted in the needs of how data are to be
processed. On the other hand, the interconnection between these
building blocks depends mainly on how the data flow. The strategy
adopted in this work is to determine the structure of the functional
units first. Considerations on timing are explained in detail and
several clocking scheme alternatives are discussed. A detailed timing
diagram is constructed to analyze the data flow of the DKS algorithm.
The data flow is then manipulated in such a way that maximum resource
utilization is.achieved. The architecture is modified as necessary and
presented in block diagram form. Once the architecture of the DKS chip
is defined, the DKS function is realized by a logical sequence of events
activated by signals from the control logic. This sequence of events is

expressed in the form of a program written in RTL (Register Transfer

37

38

Language).

5.1 VLSI System Design Rules

The following three rules from Mead and Conway [3] are adequate to
guarantee the correct functioning of a bus-oriented chip design and are

rigidly followed in the design of the DKS chip.

Rule 1: The system is driven by a two-phase, non—overlapped
clock; phase one inputs from phase two outputs and
vice-versa.

Rule 2: Functional units are timeless C/L (Combinational
Logic); operations on data are separated by latches
or registers which are controlled by the two-phase
clock.

Rule 3: A standard bus scheme is used; data transfer through

the bus occurs during phase one, and the bus is
precharged during phase two.

5.2 Architecture Development I

5.2.1 Functional Units

As will be seen, the choice of functional units, the mode of
calculation, and the sine function implementation are actually very
interdependent on each other. The mode of calculation is considered

first.

39

The elementary arithmetic operations needed for the DKS calculation
are multiplication, addition, subtraction and unitary complementation.
Several factors appear to favor the mode of sign-magnitude integer
calculation. First, the multiplier design is relatively simple as the
sign of the result is just the XOR (eXclusive OR) of the signs of the
two operands. Furthermore, the most significant bit of the sine
argument indicates whether the angle is greater than or equal to 180
degrees and can be used directly as the sign bit of the result. And
above all, the unitary complement operation can be carried out
separately without occupying the bus and adder with a small increase in
hardware. The fatal disadvantage, however, is that the adder design is
cumbersome and the speed is slow because of the need for post-adjustment
[35].

Another alternative is a sign-complement mode such as the two's-
complement number representation. Additionally, several two‘s
complement multiplier designs are available in the literature. However,
in the two's complement number system, the unitary complement has to be
carried out through the‘adder. Thus, a double penalty will be imposed
on this operation. This is because the operation will not only require
the bus and adder for a clock cycle, but it also requires that the
multiplier-adder pipeline be emptied and results in low functional unit

utilization.

40

A closer look at the unitary complementation in the DKS reveals
that there are actually two different sources of this operation. One
comes from the ”pseudo-matrix-multiplication", as described in Chapter
4, where the third row of the resultant column (b, or b,) may require
complementation depending on the relevant element of the rotation
matrix. Since five successive matrix multiplications are performed to
obtain the DKS, the complement operation can actually be combined
in the previous or next matrix multiplication depending on which is
more convenient. Therefore, by carefully manipulating the relevant
additions and subtractions in the neighboring matrix operations, the
complement operation due to this source can be eliminated.

The second source of the complement operation comes from the
process of sine function generation. In order to reduce the
interpolation table size, only the interval values from 0 to 180 degrees
are stored. Therefore, when the argument is greater than 180 degrees,
the result needs to be complemented. With the table look-up and

interpolation method, the sine value is obtained by

SIN (x) = T(xU) + xL~D(xU), (5.1)

where xU and xL denote the upper 8 bits and lower 3 bits of the argument
respectively, and T(xu) and D(xu) are table values addressed by xv.
Obviously, -SIN(x) can be obtained by -T(xU)-(xL-D(xu)). In two's
complement operations, -a is implemented by a'+l, where a' denotes the

bitwise complement of a. While a two's complement adder can only

41

execute a+b or a-b and not -a-b in one cycle, it is easy to modify the
adder so that it can also perform a'-b. The result may lose one digit's
resolution, but for a table size of 256, the worst maximum interpolation
error is about 2.15 (0.000019). The truncation of l in the last digit

17 and

'of an 18-bit word is equivalent to a loss in resolution of 2'
should be acceptable, especially since this occurs only when calculating
-SIN(x). In short, with a small penalty on resolution, a two's
complement implementation will be faster, the design effort will be
relatively simple, and is thus preferred. Figures 5.1 and 5.2 show the
schematic block diagrams of a Baugh-Wooley two's complement array

multiplier and a two's complement adder.

. .,h adj-etzauuan-m4)
05553 0$i$3 -
. ‘J’o ‘abo ‘abo ‘350 'o’o
o o .. g o
b b
“‘1 ‘ a I: re ‘2 n0 ‘1‘: FA '05:

a. b. _ “3’ 6’6 6 FA .,5,
?? -oo’o
.. uiwm 0 aiaiir_u_--i-_i_

 

 

 

 

a"
5'
”w
3
.9

Figure 5.1. The block diagram of a S-by-S Baugh-Wooley two's
complement array multiplier [35].

42

 

 

 

.6 bl Ei‘l b°b° (505
5A2 s 2
° '
SA' SA!
C2 FA 6; FA 5A2

8A2 Sum

0 “b s s

I o~b ' O

: o'—b

 

Figure 5.2. The block diagram of a two's complement adder
used in the DKS chip.

5.2.2 Fixed-Point Calculation Implementation

Three data formats, as shown in Figure 3.3 exist in the DKS
calculation. The integer values of the orientation vector element and

the position vector element are their real values multiplied by 217 and

27 respectively, In other words, an implicit exponent of either 2-17 or

2'7 is implied for each integer value. To assure a correct solution, it
is necessary to maintain the consistency of these implicit exponent
values. According to the DKS algorithm developed in Chapter 4, the
matrix multiplication of A‘W is carried out by multiplying the rotation
submatrix of A‘ with each column vector of W, and the exponent value of
the result is the same as that of the W element. Since the elements of

the rotation submatrix are in the orientation vector format, the

exponent value of the W element can be easily preserved by truncating

43

the 17 least significant bits and the most significant bit of the
lB-by-la two's complement integer multiplier output. This is
illustrated in Figure 5.3. Since the implicit exponents are always the
same for the two operands for the addition operations, no adjustment is
required. The multiplication in the interpolation of the sine value

will be addressed in next section.

Orientation Position
element OP 2 element

0P2
[InmeJiTwmm] HWWsJDmew]

 

 

 

MPY MPY

 

 

 

 

 

 

I 1 [7 bits I? bits ] n bush um] i? bits ]

 

 

 

 

 

 

 

 

Result Result

Op 2 = Rotation matrix element

Figure 5.3. Illustration of the radix point position in
the multiplication. .

44

5.2.3 SIN/COS Implementation

The sine function needed in the DKS is approximated by the linear
interpolation between the endpoint values of the interval in which the
angle lies. In order to reduce the approximation error, the maximum
error in each interval is Calculated first. The endpoints are then
adjusted such that the maximum error of each interval will be reduced to

half of its original error as shown in Figure 5.4.

y a- '(w) T: 3 ylE(I’ ’ I(X3)

D
‘L— /*‘ Y“ ""=’

'«ﬁE(x) max
I LINE SEGMENT

ii I, V:E(x) . in.)
T. ADJUSIEO 30R INIIRVAL I

1m y = I(In)

 

 

 

 

x, x x,

 

WIERVAL | -1 mutant .

 

 

 

 

V

‘0‘. ‘. ’30.

Figure 5.4. The table value adjustment scheme [26].

45

The interpolation formula is

f(x) = f(x,) + [(x-x1)/(x,-x,)][f(x,)-f(x,)]. (5.2)

In the above equation, x,, x, are the interval numbers determined by the
upper 8 bits of the argument and the term (X-x,)/(x,-x,) is simply the
lower 3 bits of the argument after the radix point. Accordingly, both
f(x,) (the left endpoint of the interval) and [f(x,)-f(x,)] are stored
in the table and denoted as T(x) and D(x). The table is generated for
an angle argument range from 0 to 180 degrees and stored in a ROM. In
the fractional-turn representation of an angle, the 1's in the two most
significant bits correspond to 180 degrees and 90 degrees respectively.
Since SIN(180+x) = -SIN(x), the sign bit of the angle is used to
control the complement operation. Since COS(x) = SIN(90+x), the cosine
function is obtained by adding one to the second significant bit of the
angle argument. A special combinational logic circuit shown in Figure
5.5 is designed to implement the above logic operations under the
control signal s/c.

Since fixed-point calculation is used and an 18-bit word can

represent a two's complement integer number from -2-17

518(90) is set to 2‘17—1, the largest integer value, thus preventing

overflow. The multiplier assumes that one of the input operands has the

17

implicit exponent value of 2' and the result will have the same

exponent value as that of the other operand. Since the implicit exponent

of the interpolation result is expected to be 2'17, both D(x) and

46

 

>To LX

 

 

Angle Registers

 

 

 

 

<9=4> Tags

H

 

Address

<|O=lO>

 

<1]:

  

 

 

 

 

s/c e<i0=l0>

 

 

C}
3s

 

 

 

 

 

 

 

 

Sign (s/c)'<II=ll>+((s/c)<l0=lo> @<II=H>)

Figure 5.5. Logic circuit diagram of the SIN/COS control section.

47

(x-x,)/(x,-x,) of eq. 5.2 must have the combined exponent value of 2-34.

The lower 3 bits of the angle argument can be hardwired to the upper 3
bits (not including the sign bit) of the multiplier input latch. In
practice, the exponent values of D(x) and (x-x,)/(x,-x,) can be
adjusted to yield a more accurate result.

According to design rule 2, data flowing through stages of the C/L
section must be separated by latches. It requires a total of three
stages to generate the SIN/COS functions. During the first stage, the
angle argument is fetched with its upper part flowing through the
SIN/COS logic section to generate the table address and its lower part
latched by a delay element. During the second stage, table values T(xU)
and D(xu) are transferred via the dual bus to another delay element and
the multiplier input respectively. During the last stage, the product of
D(xu) and xL is added to T(xU) to generate the desired function. The
angle register output is connected to the SIN/COS logic section by a
dedicated local interconnection path so that the angle fetch can be
overlapped with other operations. The entire hardware design for

SIN/COS generation is shown in Figure 5.6.

5.3 Timing

In the synchronous discipline of design, tasks are accomplished
through a logical sequence of events with each event eventually realized
under the constraint of the physical timing of transporting information

(electrical signals) from one point to another. This dual meaning of

48

 

 

D00

 

 

TOO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

A
Sign Angle
SIN/COS Registers ‘]
gr ,
° LX
[ ] LA
M2 Mi 0
E:
V (DE
l'o
N
M P Y
>5
0
‘8
“O
C;-—-— .AZ Al
3 Al 3" SAI
' MUX ‘ 8A2 V +—

 

 

 

Figure 5.6.

ADDER

 

Block diagram of SIN/COS implementation.

49

timing is bound by a system-wide clock which serves both as the sequence
reference and as the time reference. Hence the design of the clocking
scheme has paramount importance to the system's correct functioning and

performance.

5.3.1 Clock Period and Pipelining

The general form of a data path and the corresponding two-phase
clocking scheme are illustrated in Figures 5.7 and 5.8.

As already seen in Chapter 4, the ”pseudo-matrix—multiplication”
involves repeated calculations of the form a-b+c-d; it is of great
advantage to have a multiplier and an adder forming a pipe because the
piped operation will save the bus bandwidth and storage of temporary
results. Thus, for the DKS chip, the multiplier and adder become two
major C/L sections in line.

From the timing diagram of the two-phase clocking scheme, it is
clear that the minimum clock period is the sum of the maximum C/L delays
of phase-one and phase-two plus some overhead of delay time and preset
time. Intuitively, the C/L delay due to the multiplier may constitute
the largest delay component in the minimum clock period calculation and
thus one may further be motivated to introduce additional pipeline

segments into the multiplier to shorten the clock period.

50

Phase I OP 0?; Phase I

  

Figure 5.7. The general form of a data path [36].

 

 

 

 

 

 

 

 

 

 

 

if Period
E5| —1 Q5.
02
._f \____
HngCdelay PMaxCdeiay
o as a

.. a /Lodd /Leven

g ' Deiay in ,
2:00;? Phasez Preset t.me

for PhaseZ
Preset Iime .
for Phase!

Figure 5.8. The two-phase clocking scheme.

51

While this idea is sound, there are at least two factors that
should be taken into consideration. The first one is the scaling effect
on the interconnect delay. Since it is uncertain at this point whether
the DKS function can be realized on a single chip with l-um MOS
technology, scaling may be necessary. However, as the feature size A is
scaled down by a factor of k, the gate delay r is also scaled down by k,
but the resistance of the interconnect path may scale up quadratically
[37]. As a result, the interconnect delay may well become dominant and
thus render the intra-functional unit pipelining effort meaningless.
Recent studies show that refractory metal silicides can provide
low-resistivity gates and interconnects. Taken together with appropriate
interconnect scaling rules, the scale-down effects on interconnect
delays may sustain down to a linewidth (21) of 0.5um [38]. Therefore,
in this work, if scale-down is necessary, it will be assumed that it is
possible to reduce the interconnect delays linearly down to the feature
size of 0.25um.

Another factor is related to the particular choice of
implementation technology. The estimation of various delays in this
work is based on a full-custom design approach. Consequently, if the
chip is to be implemented with a semi-custom approach, such as standard
cell or macrocell implementation, the effect on the timing relationship
of the C/L delay and the interconnect propagation delay is uncertain.
Again, this suggests that the interconnect delay problem may possibly

offset the intra-functional unit pipelining effort.

52

Nevertheless, the multiplier circuit can be divided into two stages
quite naturally with the delay time of each stage about the same as that
of a ripple-carry adder as indicated by the dotted line in Figure 5.1.
An about-equal delay time of all C/L sections allows the system to
achieve a higher degree of efficiency for a given clock period. Also,
interconnect delay in this case seems very unlikely to exceed the delay

of the adder.

5.3.2 Clocking Scheme Alternatives

Given the two-stage multiplier and an adder forming a pipeline,
several clocking schemes are applicable. The first scheme is a
brute-force application of the Mead and Conway approach as shown in
Figure 5.9.

Figure 5.9 shows that the clock period is essentially the
multiplication time plus some delay overhead. The advantage of this
scheme is that when addition alone is needed, it only takes one cycle to
empty the multiplier. Also, the second stage delay can be reduced by
using a faster adder such as a CLA (Carry Look~Ahead) design. However,
each C/L section is actually activated only half of a clock cycle with
this clocking scheme.

To overcome this shortcoming, the C/L section can be selected to
operate either at phase one or phase two. Since C/L sections are
considered timeless and design rule one states that phase one inputs

come from phase two outputs and vice-versa, additional latches are

53

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I 0| 0? g, a.
_;
: --) --) MPYI —-3 ﬂMPYZ *9 r-BADDER -)
-9 L_ Ci L.
Latch L L L L
Period j
0] g1
f3’2
Max. IMPYI.ADDERI MPY2 9i
a ‘—
Preset Preset
\ Deloy/

Figure 5.9. Clocking scheme one using single phase latches.

needed. Figures 5.10 and 5.11 illustrate two alternative clocking
schemes that allow the C/L sections to operate in a single phase.

A requirement for the architecture of the second clocking scheme is
that the clock period not exceed the refresh period. of the dynamic
storage elements used on the chip. In contrast, instead of making that
assumption, the third clocking scheme accommodates the refresh rate
consideration into the calculation of the clock period and is thus more
flexible. Compared with the first clocking scheme, these two schemes
have the apparent advantage of a shorter clock period with a slight
increase in chip area. However, caution must be exercised for now two

clock cycles are needed to empty the multiplier. Furthermore, data now

54

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

I re 0’0 0 3
r9
. ..ﬁ MPYI MPYZ ADDER __,
F’ m
Latch L L L L L L
[a Period
01 01
__// L ' .
ﬁeMaXIMPYIMPYZADDERl
Delay Preset

Figure 5.10. Clocking scheme two using single phase latches.

EA 9. a, y a, a, e a,

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

-—i —>
MPYI -3l MPYZ ADDER "'""
"‘3
TDynamIc R R R
Register
k e Period e
0' MaxIRefresh rate. Ol+02I 0'

Jm ,2 _/_\_

\_
ﬂax (MPYI.MPY2.ADDER)

Worst case path delay

 

 

Akr/

 

 

Figure 5.11. Clocking scheme three using dynamic registers.

55

take three cycles instead of two to pass through the entire pipe. This
implies that if the degree of data dependency of the algorithm is high,
the faster clock, stemming from the increase of pipeline segments, may
not be advantageous. Since the second and third schemes are essentially
the same with the third being more flexible, the study is focused on the

trade-offs of the first and third clocking schemes.

5.3.3 Timing Diagrams

Given the basic computational structure and the two clocking scheme
alternatives, the task now is to realize the desired algorithm within
this framework. This is the logical aspect of timing and can be
mastered efficiently by the aid of timing diagrams.

For full illustration of design options, the data flow timing
diagram for this design should also provide some mechanisms for
manipulating the data flow and be useful for later design work (for
example, the interconnection solution and the RTL-programming).
Accordingly, the timing diagram for this design work consists of three
resource components: the dual bus, the multiplier, and the adder. Both
adder input and multiplier input are identified by their logical meaning
rather than their functional unit name. Thus, the logical sequence of
the DKS can be easily followed. The add/subtract operation of the adder
is also explicitly specified in the diagram to facilitate the
manipulation of the neighboring matrix multiplication as discussed in

Section 5.2.1. For each bus, both source and destination are identified

56

in the diagram. This proves very useful because the data flow pattern is
soon discovered. By manipulating the data flow and modifying the
interconnect path, the total DKS calculation cycles are reduced from 83
in the original design to 68 for clocking scheme one.

Figure 5.12 illustrates the timing diagram for one pseudo-matrix
multiplication for clocking scheme one. The elements of the input
matrix are logically identified as Wl to W9 but physically stored in R1
to R9 and may or may not be in sequential order. At the beginning of
each iteration, sine and cosine values are calculated first and stored
in Creg and Sreg respectively. A local interconnect path is assumed so
that the result from the adder can be sent to the output register file
without using an internal system bus.

Another timing diagram is constructed for clocking scheme three and
the minimum total clock cycles for the entire DKS computation is found
to be 73. Compared with the first clocking scheme, the data path of the
third clocking scheme requires more area due to the additional dynamic
registers including one needed for the delay of the T(x) term in the
sine interpolation process. The resultant increase of the chip area,
however, will be small compared with the total chip area. Even for a
very unlikely 5% increase in total chip area, the time-area product of
the third scheme will still be smaller than the first one if the clock
period can be reduced at least 12%. By inspecting the timing diagrams
of Figures 9 and 11, this reduction is apparently achievable. Thus, the
third clocking scheme is adopted for the DKS chip. The entire DKS

timing diagram with clocking scheme three is shown in Appendix 1.

S7

 

 

 

 

 

 

w; w; w; C, 0 S, a, w, w, w,
w; w; w; a s, 0 -C, 0 w, w, w.
w; w; w; 0 l 0 0 w, w. w,
(0) (0) (l) L(0) (0) (0) (1) (0) (0) (1)

After Multiplication: Before Multiplication:

 

 

 

 

R1 = w; R1 - w,
R2 . w; 32 = w,
R3 3 w; R3 = W,
R4 = w; R4 = w.
R5 = w; R5 = w,
as = w; R6 = w.
R7 = W; R7 = w,
R8 = w; R8 = w,
R9 = w; R9 = w,
I
I< -------------- t I 3
I
I
BUS A src T3’ T3 R1 R3 R1 R3 R4 R6 R4 R6 R7 R9 R7 R9 R7
dst LA LA M1 M1 M1 M1 M1 M1 M1 Mt M1 M1 M1 M1 A1
BUS 8 src D3’ D3 SUM SUM Sreg Creg Creg Sreg Sreg Creg Creg Sreg Sreg Creg a2
dst M2 M2 M2 M2 M2 M2 M2 M2 M2 M2 M2 M2 M2 M2 A2
MPYer opt x3 X3 W1 -W3 W1 -W3 W4 W6 W4 W6 W7 W9 W7 W9
op2 03' 03 C3 $3 53 C3 03 S3 S3 C3 C3 S3 S3 C3
ADDER opt 13' T3 0 C3Wt O 53Wt O C3W4 O S3W4 0 C3W7 O S3W7 W7
op2 DX3' 0X3 C3Wt-53W3 S3Wt-C3W3 C3W4 S3W6 S3W4 C3W6 C3W7 S3W9 S3W7 €3W9 a2
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36
SUM to R9 Creg Sreg R1 R3 R4 R6 R7 R9

Figure 5.12.

Timing diagram of one pseudo-matrix multiplication.

58

5.4 Architecture Development 11

5.4.1 Synthesis of the Computational Structure

Now that the data path structure and clocking scheme are defined,
the architecture design becomes a synthesis of previous sub-task
development with some necessary refinements and modifications.

Due to the pipeline structure, delay elements are needed to
synchronize the sine generation mechanism. The delay elements can be
implemented by the same dynamic registers shown in Figure 5.11. Most of
the input registers of functional units have more than one input source.
The circuit diagram of a register cell capable of handling multiple

sources is shown in Figure 5.13.

 

 

 

 

 

 

 

 

\7,‘ s9! “ 9| .
Select I Select 2 Select 3 :I:
Input I‘J—T i ﬂ
Input 3 ﬂ >~>‘D04>-— Output
Input 3 I i

 

Output

Figure 5.13. Circuit diagram of an input register cell with
multiple sources [39].

59

For MOS technology, the propagation delay of a device increases as
the number of loads increase. Therefore, if the number of device
destinations in a connection path is more than one, all the destination
latches or registers must be able to disconnect from the path. The
register file is required to send information to bus A and receive a
result from the adder output simultaneously (but not for the same
register) as seen in the timing diagram. A two-port register cell with
such capability is shown in Figure 5.14. And finally, the entire DKS

chip architecture is presented in block diagram form in Figure 5.15.

 

Bus A llkﬂdAItm

Eli-s

 

 

 

 

Ld,’-\=I=Q_|_‘ll:[_—wi _:

 

 

 

 

 

From ADDER
Oi
RdA LdA 62 OPCIOIIOH
I O O BusAt—Req
O I O Rege—ADDER
O O I Reg«——Reg

 

Figure 5.14. A two-port register cell and its logic truth table.

60

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

t I - A . - - t I
l N P 0 R T
s/c-——-—;L , |
is SIN/COS f i 7 7
.52 l t— AL D - .AL V
a: r:
O l g D(x) T(x) 3+)
‘3
«one jCans. LX \I
UUdO'O L j
[ WMZ j l LM‘ * J
I l .
LAJ:
C M P Y I §
0 a s
2;: 2 | + g d)
:1); 3 I _ g
(D E M P Y 2 l ‘I’
a) » OJ
0 i?» l I
... 1A2 ‘ .Al
0" pA|I L J T:
o 4 J, J:
“J ”0 W A D D E R <
B 3 c "’
x. 03 {89 TE; 1 a;
:2 t s g 1
C) re
0 J Register File
,i
a.
0 U T P O R T akl

 

Figure 5.15. Block diagram of the DKS chip.

61

5.4.2 Description of Hardware in RTL

The block diagram of Figure 5.15 presents only the structural
aspect of the chip architecture. The functional aspect, or the sequence
of events ultimately realized by a control structure, is yet to be
specified. Hardware Design and Description Language (HDL) is widely
used to specify the functional, as well as structural, aspects of a
digital hardware design and can be used as input to some hardware
compilers. Because of their strong dependency on technology and
developer, existing languages often evolve from a small design community
and tend to be specialized [12]. Recognizing this, major research
efforts are being made to standardize the HDL so as to strengthen the
VLSI design process [40,41]. In view of this situation, this work uses
RTL to describe the control flow structure of the DKS chip primarily for
communication purposes. Furthermore, it is assumed that the RTL program
presented may later be translated to other similar languages for
implementation when necessary. The complete listing of the RTL program

can be found in Appendix 2.

Chapter VI

CONTROL AND DESIGN VERIFICATION

Before the design process can proceed further, a legitimate
question is whether the architecture of Figure 5.15, under the control
structure described by the RTL program, will provide the desired DKS
function. In answer to this question control signals are first
identified and generated for the 73 clock cycles. These control signals
are then used to drive a symbolic simulation program to confirm the

correct behavior of the system.

6.1 Control Signal Specification

Even though the RTL program is created mainly for communication
purposes, in practice, it is increasingly used for implementation
purposes. In fact, control logic can be synthesized from an RTL program
input with the capabilities of today's CAD systems. With this in mind,
the question from a system designer's point of view is clearly not how
the control logic is implemented but What the control signals should be.

It is obviously desirable to minimize the number of control signals
as it implies less chip area. However, in deriving the needed signals
from the RTL program, one must be aware that implicit states of certain
functional entities may exist. For example, loading or transmitting a
register is explicitly instructed, but what is not explicitly instructed

is the implied state of neither loading nor transmitting and the

62

63

forbidden state of loading and transmitting simultaneously.
Such a problem can be solved by further clarifying the operation
rules of various functional entities. For the design of the DKS chip,

the following rules are observed:

1. no register is allowed to load and transmit at the same
time;

2. no more than one source is allowed for any interconnection
or register at a time;

3. the system bus can get data from only one source and give
data to only one destination during any one cycle.

Another intriguing issue is the trade-offs of implementing the
control signals with a horizontal, vertical, or a mixed policy [42]. In
order to facilitate the generation of the control signals and the later
verification process, the control signals are organized into fields and
subfields. By assigning a meaningful name to each field, the control
function of each signal can be easily captured. The way in which the
control signals are partitioned, however, depends on which policy is
chosen.

By employing decoding techniques, the vertical approach packs more
functionality into a control word of a given size and thus has better
chip area utilization. But it suffers a slower response time due to the
decoding process. It also tends to make the encoding and simulation

process more complex. On the other extreme, the horizontal approach

64

embodies maximum parallelism in generating the control signals and thus
has a speed edge. But the waste of area is formidable for this approach
even when a register file of moderate size is considered.

At first glance, this looks mostly like an implementation problem
and therefore should not be determined during the design stage. But in
practice, whenever a control field is defined, a certain amount of
decision on the policy is already implied. In order to reduce the
design effort, a mixed policy adapted from the horizontal approach is
chosen in the partitioning of the DKS control signals. With this
policy, a control field is created for each functional element. If a
control field requires more than four bits, the vertical policy is
employed within that particular control field.

Based on the above principle, the control signals required for the
DKS chip are partitioned into 11 fields with each field further divided
into two subfields for easy encoding. The 11 control fields with their
signal definitions are shown in Table 6.1. Once the control signals are
organized and defined, the specification of control signals for each
clock cycle of the entire DKS calculation is reduced to a compilation of
the RTL program into machine codes. This step is done manually in a
straightforward manner. The control words, or machine codes, for the 73

cycles are listed in Appendix 3.

65

Table 6.1. Control fields and signal definitions.

 

No. Field Sub Range Definition
1 ADRTBL Ev 0/1 1: TXPORT<—T(x); Lx<—ANGLE(V) ,
V 0-6 0: Angle Register output in high 2 state
2 SINCOS s/c 0/1 0(1): Select SIN(COS) operation
Et 0/1 1: BUSAe—TXPORT; BUSBe—DXPORT;
LA<—BUSA
3 ROM EM 0/1 1: BUSBe-CONS(Madr)
Madr l-S Constants address
4 BUSAR T 0/1 1: BUSA<—REG(Radr)
Radr 1-9 Register address
5 RSUM L 0/1 1: REG(Radr)<—ADDER
Radr 1-9 Register address
6 MPY Ml 0-2 : Ml<—Lx
2: MleeBUSA
M2 0/1 1: M2<—BUSB
7 Al Alzd 0-2 1: Ale—LA
: Al<—0
Albs 0-2 1: Ale—ADDER
: A1<—BUSA
8 AZ/G A2 0-2 1: A2<—MPY2
2: A2<—BUSB
Gate 0/1 1: BUSB‘eADDER
9 MINUS CTRL 0/1 0: SAl,SA2<-SIGN (from SIN/COS)
l: SAl<—0; SA2<—SGN
SGN 0/1 0: ADDER<—AI+A2; l: ADDER<eAl-A2
10 Creg Lc 0/1 1: CREG<~ADDER
Tc 0/1 1: BUSB4-CREG
ll Sreg Ls 0/1 1: SREGe-ADDER
Ts 0/1 : BUSBe—SREG

66

6.2 Verification by Symbolic Simulation

Many tasks of the DKS chip design, from the construction of the
timing diagram to the manipulation of data flow, and from the
development of the RTL program to its compilation into machine codes,
are performed manually. The flexibility of human activity in many cases
complements the limitation of available resources; but it is also prone
to error. As the complexity grows, simulation increasingly becomes an
indispensable tool for VLSI design.

Because the homogeneous transformation method is already well
understood in the case of the DKS chip design, an exhaustive simulation
is not necessary. Besides, even for an angle word size of six bits, the
six joint angles form a total of 646 > 64x109 joint angle vector
inputs! An exhaustive simulation is plainly impossible. Generally
speaking, symbolic simulation is more practical at the system level

where the behavior of the system as a whole, rather than the correct

functioning of an indvidual building block, is the major concern.
6.2.1 Simulation Principles

The validity of any simulation rests on the rationality of the
mapping of the real system to the simulator. The behavior of the system
can be thought of as the total sum of events occuring in various

entities. From this perspective, entities may be modeled as program

67

variables while events are simply program statements. These two
axiomatic correspondencies in turn constitute the foundation of the
higher correspondence between the logical sequence of events of the DKS
chip and the logic structure of the simulation program.

Three kinds of entities are observed in the architecture of the DKS
chip. Memory elements including static devices (such as ROM) and
dynamic devices (such as registers), are readily compatible with
variables and constants of a program. Interconnects serve as the media
for data transfer between entities; they may be multi-source/destination
(such as a system bus) or dedicated (such as a wire). Functional units,
such as multipliers and adders, are timeless combinational logic
circuits which receive operands from dynamic storage elements, perform
the assigned functions, and generate results.

Events of the system fall in two categories, data transfer and
logic operations. The former can be simulated by program assignment
statements with little effort. The establishment of interconnects as an
entity enable the data transfer event to be decomposed into two logical
phases: from source to medium, and from medium to destination. Various
rule-checkings can subsequently be classified in either phase and
enforced. In contrast, no rule needs be checked for the logic
operations; but the function simulated by manipulating the symbolic
values through program procedures is slightly more involved even for

events seemingly as innocent as add or multiply. As mandated by the

68

clocking scheme of Figure 5.11, all functions are accomplished by C/L

occurring exclusively in phase two.

6.2.2 Program Development

Based on the observed correspondence between entities and
variables, a unique variable of type character string is assigned to
each of the entities in the DKS, system. The symbolic value of a
variable] represents a data item. The property of timelessness of the
C/L allows the separation of the outcome from the symbolic manipulation
process. Therefore, the variable of a functional unit is only
associated with the outcome. For a dedicated path, the only event is
the invariable source to interconnect to destination, which is simply
simulated by a single assignment statement. On the other hand, the
multiple source/destination situation of a system bus requires
rule-checking. This is facilitated by designating the first character
of the BUS variable as a flag. A load-bus operation is legal only when
the flag indicates the bus is empty('E‘) and subsequently the flag will
be set to 'F' for full. A receive-fromjbus operation will set either
the 's' or m. flag to 'a', and an error message will be issued if this
operation is attempted when the flag is '0'.

The event of data transfer is simulated by scanning all the load
bus cases first followed by processing all the load storage-element

cases. During phase two, dynamic registers or delay elements have their

69

phase-one value transferred to the phase-two variable before the logic
simulation is performed. The precharge of the bus is simulated by
assigning a special value of 'E?‘ to signal the undefined state.

The events of multiply and add are simulated by appropriate
manipulation of the symbolic values of the data and the outcomes are
stored in the variables associated with the functional units.

The sine function generation mechanism is simulated as a sum of
four independent events. In one event, during phase two, the value of
'T'i and 'DX'i will be assigned to TXPORT and DXPORT, respectively, when
the address table signal is fired. The angle 'X'i will also be assigned
to latch Lx. In another event, the multiplier will take Lx and BUSB as
its operand sources and the data in BUSA will be latched by the first
stage of a two-stage delay element. Then, in another event occuring in
phase two, if the multiplier procedure detects the input operands having
the symbolic values of 'X'i and 'DX'i, it will abort the normal multiply
manipulation and assign 'DX'i to MPYl. The last event occurs when the
adder procedure detects its input operand being 'T'i and 'DX‘i; it will
assign Si as the outcome stored in ADDER. It also checks the appropriate
control field and the SIGN to guarantee that the adder control signal is
correct. The cosine is simulated in a similar fashion with the help of
a special character flag. Obviously, sine and cosine outputs will
result only when a particular sequence of events occur.

The main program is a simple kernel driven by the machine code of

Appendix 3. In order to help in the correction of design mistakes as

70

well as the debugging of the program, an error encoding technique is
employed and a controlled dump of memory contents is provided. The
flowcharts of the simulation program are presented in Figures 6.1-6.3.
A complete listing of the simulation program can be found in Appendix 4.
In Appendix 3, some of the signals are designated with the x (don't
care) state. These conditions were converted into two machine code
tables: one with x equal to 0 and another with x equal to 1. Two
simulation runs, one with each table, are performed. Results are checked
against the correct ones individually to prove the correctness. The

simulation results are also shown in Appendix 4.

71

 

INITIALIZATION
Set all storage elements to UNDEF

 

 

 

/
SET DEBUG CONTROL

 

 

 

 

 

 

DONE? ﬂ/

 

N DUMP REGISTERS

 

 

 

 

 

 

READ CONTROL WORD

 

 

 

 

 

 

 

 

CALL PHASEI ( STOP )<

 

 

   

 

 

 

 

 

 

 

 

 

   

 

 

 

 

Y
ERROR? \1/
DUMP REGISTERS
L
Y
DEBUG? q/
DUMP REGISTERS
J

 

 

 

CALL PHASE2

 

 

 

 

N DEBUG?

Y
DUMP REGISTERS

 

 

 

 

 

 

 

Figure 6.1. Flowchart of DKS symbolic simulation program.

72

Task I

 

 

Check rules by examining control fields:
If ERROR. Set ERR code and RETURN

 

Task 2 J,

 

 

Process all load BUS cases:
For each case DO
It BUS<I=I> NOT E0 'E‘ then

Set ERR code and RETURN
else
Load BUS and set BUS<I=I> 'F'

end

 

 

 

Task 3

 

 

 

Process all LOAD controlled REG cases:
If Source = BUS then
If BUS<l=I> E0 '0' then

Set ERR code and RETURN
else
LOAD REG and set BUS<l=l> ’0'

end
end

 

Task 4 \L

 

 

Process data transfer between delay stages:

Load Phase-l latch from Phase-2 source latch

 

 

 

 

 

(RETURN4)

Figure 6.2. Flowchart of subroutine PHASEl.

 

73

Task l

 

Transfer Phase-l Latch content to
Phase-2 Latch

 

 

Task 2

 

 

Precharge BUS:
Set all controlled Phase-lLatch
to UNDEF

 

 

 

 

Task 3

 

SIN/COS Simulation

lf SINCOS = I
Set TXPORT. DXPORT values

 

 

 

Task 4 \l/

 

Functional Unit Symbolic Simulations

4.l Isl-Stage Multiplier simulation:
Simulate MULTIPLY operation on symbolic
values of Ml and M2. and results in MPYI

4.2 2-Stage= MPY2 MPYI

4.3 ADDER Simulation

Simulate ADD operation on symbolic
values of Al and A2. and results in ADDER

 

 

 

 

(RETURN)

Figure 6.3. Flowchart of subroutine PHASEZ.

Chapter VII

EVALUATION

7.1 Goal and Strategy

Based on the architecture presented in previous chapters, the
precise performance of the DKS chip and area requirement must be
determined by the physical design. Without the support of a physical
design, any time/area evaluation is inevitably subject to suspicions.
However, without even a qualitative estimation of the time/area
complexity of the chip, a "go or no-go" decision on physical design is
even harder, if not impossible, to make.

Since physical design is beyond the scope of this work, no attempt
is made to acquire an accurate time/area evaluation. The goal here is
to provide some basic statistics of the chip, and by comparing it with
the capabilities of current technology, provide some sort of "confidence
vote" on the feasibility of the DKS chip. In view of the industrial
practice of prototyping with semi-custom design, special attention is
paid to the possibility of the standard cell implementation approach.
This approach has been predicted to become the dominant form of
application-specific ICs [43,44].

Current technical information in the literature tends to be
fragmented, and its technology-dependent nature makes it even harder to
meaningfully use. At the time of this project, the only relatively
complete information was the IBM Master Image Chip approach [8]. Both

statistical data on functional units and cell utilization as well as

74

75

basic logic gate delays are available. The area of the DKS chip is
evaluated by this approach. The area required for CMDS implementation
is obtained from the NMOS figure according to the conversion rules
suggested by Wollssen [45].

Semi-custom design usually requires more chip area. Therefore, the
chip edge from the area estimation should be considered as an upper
bound. The phase one clock width is evaluated by taking the chip edge
as the worst case path for the propagation delay. Since the phase two
clock width is determined by the maximum combinational logic delay,
delay times from an actual multiplier IC and from gate delay summations
are used to obtain a range of possible C/L delay values. The minimum

clock period is then the sum of the two clock widths.

7.2 Area Estimation

The silicon real estate of a VLSI chip is used either for device
fabrication or for interconnects. While the former is determined
exclusively by circuit design, the later depends on many physical design
factors such as layout, placement and wiring, which are beyond this
design task. To overcome this difficulty, this work uses the physical
device count as the major parameter, and the device count is then
weighted with the transistor/cell utilization statistics obtained from
published sources to get a realistic picture of the area required for

the DKS chip including the interconnects.

76

7.2.1 Transistor Count

The transistor counts for various functional circuits in the DKS
chip are shown in Table 7.1. All transistor counts are based on NMOS
circuits taken from [3] except the adder circuit which is taken from
[46].

The transistor count of the control logic is unknown until the
actual logic design is carried out. A convenient way to get around this
problem is to assume that the control logic is implemented by storing
the machine code in a ROM with a counter generating the instruction
sequentially. 37 individual signals are needed for the DKS calculation.
This figure is rounded off to 40 to account for the I/O and other
interface control functions. Since this approach generally requires
larger chip area, the transistor count in the table may once again be
interpreted as an upper bound of the control logic section.

Because transistors in driver circuits require larger area, they
are counted separately. The table shows that the transistor counts of
the regular logic circuits and the random logic circuits are about the
same. Because of their regular nature, the former will occupy much less
area. Also, the area for storing table D(x) can be further reduced as
the first few bits of the D(x) values are always zeros.

The total number of transistors of the DKS chip is in the
neighborhood of 25K, or about 10K logic gates. This figure is well
under the projected limit of 50K logic gates for gate array

implementation [47]. In fact, CMOS gate arrays of up to 24K logic gates

77

 

 

Table 7.1. Transistor count of the DKS chip.
No. of No. of Total No.
Type Function Transistors Units of Transistors
per Unit
Regular TX(1 bit) 1 18x256 4,608
DX(l bit) 1 18x256 4,608
Constant 1 18KB 90
Control 1 40x73 2,920
Subtotal 12,226
Random Reg Cell 9 18X(9+6) 2,430
Delay Cell 6 18X2+2 228
controlled latches
l-source 7 18x2 252
2-source 9 18x2 324
4-source 13 18 234
Pass Trans. 1 18x3 54
Decoder
3-8 30 2 60
4-16 72 2 144
8-256 72 72x17+256 1,480
Subtotal 5,206
FA 18
MPYl FA 18x17+2 5,544
AND 5 2x17+l7**2 1,615
MPYZ FA 18 324
ADD FA 18 324
2-1 MUX 2 18x2 72
Subtotal 7,879
Driver I/O Buf 12 16x2 385
BUSA 5 18 90
BUSB 11 18 198
Control 8 55 440
Subtotal 1,112
Grand Total 25,232

78

have already appeared [48]. Therefore, the DKS chip can be implemented
on a single chip with current semi-custom design technology provided

gate utilization of higher than 50% can be achieved.

7.2.2 Macrocell Implementation

The gate array approach is ideal for implementing designs at the
Boolean logic level but is not suitable for high performance system
designs. To bridge the gaps of cost, design turn-around time and
performance between semi-custom design and full custom design, the
standard cell approach shows promising potential.

One such approach is the IBM Master Image Chip approach [8,49].
This approach uses a predesigned image with a fixed power bus
distribution and preallocated areas for circuits and wiring. Well

designed and tested functional circuits are stored in a macrocell

Table 7.2. IBM Master Image function-cell and chip statistics.

 

 

No. of Chip Statistics
Function Cell Item A=2um A=l.25um
2-1 MUX 1 Area(mm’) 7.4x7.4 8.0X8.0
LSSD SRL 2 Cell 2,432-2,952 8,468
SRL CLK Driver 5 Cell Util. 63-93% 79%
CLK Driver 3 Gates 7.6K—1l.8K 36K
CSA 6 Gate Delay 5-20ns 3.1-12.5ns *
4-bit Counter 10
16-by-l6 Multiplier 630
256x16 ROS 200
128x16 R05 120

* Estimated from figures of Zen

79

library and can be called out during circuit design much the same way a
digital designer selects an IC from a catalog of standard ICs. Design
efforts are thus minimized and yet the system performance can be
expected to approach that of fully custom designed ICs. Currently, the
Master Image cell library contains only NMOS circuit designs; but CMOS
implementation has been planned [8].

Table 7.2 shows the IBM Master Image Chip statistics used in the
evaluation of the DKS chip. Table 7.3 shows the estimation of cell
numbers. Linear dependence is assumed in the evaluation. For example,
the 16-by-16 multiplier requires 630 cells, and the 18-by-l8 multiplier
of the DKS chip is assumed 18/16 times larger in each dimension,
resulting in 798 cells. Note that the random logic represents 79.1% of

the total cell count compared with only 51.7% of total transistor count.

Table 7.3. Cell count of the DKS chip.

 

Function Cell Count
TX (256x18) 450
DX (256x18) 450
Constant 10
Control 100

Subtotal 560
Adder 108
l8-by-l8 MP? 798
Latches & Reg 828
Pass/ Mux 90
Counter 20
Clock Driver 165
SRL Driver 115

Subtotal 2,146

Total 2,706

80

In terms of a percentage of chip area, the 79.1% figure may be closer to
reality. Assuming a 10% estimation error and a .somewhat conservative
70% cell utilization, the number of cells required for the DKS chip is
about 4.3K. The Master Image chip approach allows two feature size
implementations: either 2-um with 2,952 cells or 1.25-um with 8,468
cells. Clearly, the DKS chip can be implemented with 1.25-um
technology. With half of the cell count, the chip edge of the DKS chip
is estimated about 5.6mm.

For conversion of NMOS circuits to CMOS circuits, the rule of thumb
is that the area of CMOS will be 30% larger than NMOS equivalent
circuits for random logic but the same for memory cells [45]. Using
these factors, the CMOS DKS chip will require a total of 5,317 cells,
about 24% larger than the NMOS version. The resulant chip edge is

estimated to be about 6.34mm.

7.3 Speed Estimation

The evaluation of the total DKS calculation time is based on
average gate delays. The gate delay of a two-way NOR gate of the IBM
Master Image approach using 2-um NMOS technology ranges from 5-20ns
[49]. Assuming linear scaling, the average gate delay for the 1.25-um
technology is 7.8ns. It should be pointed out that this figure is well
below the state-of-the-art gate delay. For example, CMOS gate delays of

0.8ns using 1-um technology has been reported [50]. What's more, the

81

limit of CMOS gate delay for 1-um technology is projected to be 0.2ns
[51]. Therefore, in this work, two values of gate delays are used. The
IBM figure of 7.8ns is used mainly to provide a bottomline of
performance while a lns/gate delay is used to represent what is actually
achievable in the near term.

As shown in Figure 5.11, the mdnimum clock period is the sum of the
clock widths of phase one and two. In calculating the minimum clock
widths, empirical data is extracted from various sources and applied to
the DKS chip design. Since the target feature size is 1.25um as used in
the estimation of the chip area, all timing data is adjusted according
to the scaling rules.

For the interconnections, the delay is due to the RC time constant
which is normally independent of scaling of the feature size 1. But, it
may become inproportionally larger in the total delay calculation when A
is scaled down. However, improvements in conductivity of interconnect
lines may be regarded as another scaling dimension. Combined with the
separation of interconnect scaling rules from device scaling rules, the
”scaling" of conductivity may justify the application of the same
scaling factor to the interconnect delay [38].

According to the specified clocking scheme, the phase one clock
width is the worst case path propagation delay. By inspecting Figure
5.15, the transfer of data from the register file to the input latch of
the multiplier may be taken as the worst case path. The delay

components include the register response time, bus propagation delay,

82

and the input latch response time. Table 7.4 shows the estimation of
these three components. The gate counts are derived from the relevant
circuits and the CMOS gate delay is assumed to be lns whereas figures in
parenthesis are calulated from the IBM Master Image average gate delay.
The total delay time for the phase one ranges from 20ns to 102ns.

The phase two clock width is the maximum combinational logic delay.
In the DKS chip, the first stage of the multiplier pipe can be
considered as the maximum C/L delay. Since the standard cell approach
promises the ability of incorporating proven functional circuits as
macros of the cell library, it is not unreasonable to assume that the
multiply time achieved by today's multiplier IC is also achievable by
the multiplier unit in the DKS chip. A survey of existing commercial
multiplier ICs indicates that a 16-by-16 multiplier can be as fast as

45ns per operation [52]. An even faster multiply time of 20ns using

Table 7.4. Estimation of phase one delay.

 

Delay Component Estimation Method Delay Time (ns)
Register 36 + 4G + 26 * 9 (70.2)
Response (Decode)+(Driver)+(Resp.)

Interconnect 6.34mmx1.25ns/mm @ 8 (8)

Latch Response 36 3 (23.4)

Total 20 (102)

* l G = l (7.8) ns
@ Scaled down from Sns/mm for S-um design rule [38].

83

1.5-um NMOS design rules has been reported by Bell Labs [53]. Based on
the somewhat moderate figure of 90ns per multiply-accumulate operation
(from TRW's l-um CMOS l6-by-16 multiplier IC), a lOOns multiplication
time for an l8-by-18 multiplier using 1.25-um CMOS technology, appears
appropriate. Since the delay times for the two stages of the multiplier
are about the same, the phase two clock is set to half of the multiply
time, i.e., 50ns.

On the other hand, if one full adder (FA) is assUmed to require 6
gate delays, the 18-FA chain in the first stage of the multiplier pipe
may require as many as 120 gate delays. Taking l (7.8)ns for one gate
delay, the phase two clock width is estimated 120(936)ns.

Table 7.5 sums up the clock period and total calculation time
estimations for various approaches and technologies. To achieve the
goal of a lOus calculation time, a gate delay of 1ns or less seems to be

necessary.

Table 7.5 Total calculation times for various approaches.

Technology Phase 1(ns) Phase 2(ns) Clock(ns) Total(us)

 

Fast Multiplier 20 50 70 5.1

Standard CMOS Cell
(1ns/gate) 20 120 140 10.8

IBM Master Image
(7.8ns/gate) 101.6 936 ~l,000 73

Chapter VIII

CONCLUSION

8.1 Achievements

The advent of VLSI technology offers unprecedented freedom, as well
as many challenges, for designers to integrate their knowledge and
expertise into a tiny piece of material of utmost elegance. However, an
ever-expanding gap exists today between system designers and the latest
available technology [54]. One of the difficulties in bridging this gap
arises from the fact that the design process at the different levels
has different subjects of abstraction and requires different abstraction
mechanisms. From this perspective, the greatest contribution by Mead and
Conway to the-VLSI system design is the idea of formulating a set of
rules both as a way to express the subject of abstraction and as an
abstraction mechanism. Following their lead, this work extends these
ideas to higher level designs by proposing various rules as guidelines
or templates for abstractions at different levels. The design of the
DKS chip is subsequently treated as an instantiation of these rules.
Although these rules will certainly be enriched as our understanding of
VLSI design deepens, the accomplishment of this work is a verification
that such an approach can potentially bridge the gap between the system
designer's ability and the capability of technology.

In addition, this work develops a set of tools and techniques to
facilitate the design at the system architecture level. Various

mechanisms have been incorporated into the construct of the timing

84

85

diagram such that it not only demonstrates the data flow, but also
allows for its manipulation. As to the development of simulation
programs, the validity of the simulation approach has been rationalized.
Based on the observation of the correspondencies between architecture
elements and program elements, it proposes principles and rules to set
down their relationship. Following these rules, a designer can map a
design into a symbolic simulation program with relatively little effort.

This work also verifies the feasibility for fixed-point calculation
of the DKS and further quantifies the number of bits required for the
resolution in a standard industrial environment. The major structural
and functional aspects of a VLSI chip architecture dedicated to the DKS
computation have been specified. Results indicate that the design can
be fabricated on a single chip using a semi-custom implementation

approach with current technology.

8.2 Summary

From the timing diagram of Appendix 1, it is verified that the
utilization of the bus, multiplier and adder reach 86.3%, 80.8% and
86.3%, respectively. They are somewhat lower than the first performance
specification. This is mainly due to the introduction of pipeline
segments into the multiplier which causes the data dependence problem to
become more severe. In contrast, the first design, without pipelining
inside the multiplier, achieves an average of 90% utilization.

The evaluation of the last chapter indicates that the total DKS

86

calculation time of lOus is attainable provided the basic gate delay is
1.0ns or less.

In short, the architecture design presented in this thesis
satisfies the major design objectives. The speed improvements, in
retrospect, may be attributed to two fundamental improvements: (1) the
increased speed of basic operations by using dedicated functional
hardware and by inter- and intra- functional unit pipelining; and (2)
the relief of the fundamental limitation of bus bandwidth by providing
dedicated local interconnect paths.

As stated at the beginning, this work is mainly concerned with the
design specification at the system level. As a result, much work at the
physical levels, such as logic design, circuit design and timing
verification, remains to be done. At the system level, more pipeline
segments may be introduced into the functional units. The I/O
activities and mechanisms must be further specified. Depending on the
data access time, I/O buffering may be desirable. 'All I/O and other
interface control functions have to be incorporated into the control
logic of the chip.

Beyond the implementation problem, the future utilization of the
DKS chip may be more interesting. This work shows that with nascent
VLSI technology, the DKS of a robot arm can be obtained in as little as
lOus. This result may have implications on the method of obtaining the
Inverse Kinematic Solution (IKS). For example, with a modified algorithm
using the DKS chip as its building block, the IRS may be approximated

iteratively with superior speed.

APPENDICES

APPENDIX 1

DKS TIMING DIAGRAM

Al. DKS Timing Diagram

 

 

 

 

 

 

on we no .e on no oecn oeto on no «a
on wn nn an .n On an on an on ma wn no on .n 0n n. n.
n3nn n3no o3no w3nm o3nm w3no n3no-.3nm n3nn-.3no we no nxo .nxo o3 n3
n3no o w3nm o w3no o .3nm o .3no o m3- .53 n» .ne 0 o
( 1‘ < ( l\ < ( ( ( ( < ( ( ( ( ( <
nn nn no no nm no no no no nm no no .no mo
e3 n3 e3 o3 w3 o3 w3 n3- .3 n3. .3 nx nx w3wm
no nm no no no nm no no no nm nn no no .no
m3 n3 m3 n3 o3 w3 o3 w3 n3- .3 n3- .3 nx nx
«x n: a: a: «a a: a: n: a: «x a: n: we «w n: «3
onto ootm uocm auto auto uotm 00cm auto coco aoim ootm auto we no no .mo
.3 .z .a .a .x .x .z .x .x .3 .x .3 .< .w <4 «4
no no no no me we on we no .e no .e no em n» .n»
a
a
n . --------------v~
a
.u on we no mean oeco .a no «a euro we on
a. o. n. w. n. n. .. o. n o n o w n n . o

 

mivo wivm nzvm wivo vrvm QIQO mmwo vxo .vxo mowo o):

(

 

0x0 wmxo mxo .mxo

‘ ( ‘

(

 

 

 

 

 

.3wn o .3wo o o o 0 we .we 0 o me .me me .m»
we wo wm wn wo wn wo mo wo .wo oo oo oo .oo no .mo
w3 n3 .3 n3 .3 w3 w3 mm wx wx no m3- ox ox mx mx
oo oo wo wm wm wo wm wo oo wo .wo oo oo oo .oo no .mo
w3wn w3 n3 .3 n3 .3 w3 w3 mm wx wx mo m3- ox ox nx mx
«3 n: n: a: n: a: n: n: a: n: a: n: a: n: a: a: a:
on on auto ootm omen onto 33m 23m onto wo .wo son oo oo .oo no .mo
.3 .x .x .3 .x .z .x .x .x «a an .x .3 wo we «3 <4
me we on .e no .a we we we we .w» on me o» .m» me .me

a
H

----I--- ...... v“

H

ZOHH<NH4<ah~2H

«no
«do

«oo
.oo

«no
.no

you
0.3

«on
One

00 13m

mmoo<

N >a1

w >a!

m mam

( mam

00 83m

.no

«no
.no

«no
.Q0

#00
OLD

«on
One

umoo<

N >a!

p >d!

m mam

( mam

87

 

 

 

 

 

 

 

A1. DKS Timing Diagram

 

 

 

 

 

 

 

 

 

on an no we no .n on corn oeco on an a» 35m
n» .n on no no no no no we no no .o 00 on no en on no wn no
n3.o n3.n m3.n n3.o o3.o w3.o o3.m w3.o n3.o .3.o n3.m .3.o no .xo ..xo n3no n3nn noo
n3.m o n3.o o w3.m o w3.o o .3.n o .3.o 0 n3 .» ..e n3nm- o .no umoow
.o .m .m .o .o .n .n .o .o .m .n .o .o ..o no noo
n3 n3 n3 n3 o3 w3 o3 w3 n3 .3 n3 .3 .x .x n3 .oo n >a3
.o .m .m .o .o .m .m .o .o .m .m .o .o ..o noo
n3 n3 m3 n3 m3 w3 o3 w3 n3 .3 n3 .3 .x .x .no . >a3
n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 nw n3 n3 «no
menu ootm oetm auto aeto Germ oecm Deco oeto oecm uecm oeto no .0 ..0 can m mom
.3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .w we «a «no
on an on nu me we no we nu .e no .3 on .p ..e or» w mom
H
H
. w . --------------v.
H
on we nu .n em oecm oeto on an on 33m
no .n on nw ow aw ow mw ww nw nw .w ow on on an on
n3nm n3no n3no w3nm m3nm w3no n3no .3nm n3nm .3no no nxo .nxo m3no n3nm noo
n3no o w3nn- o w3no o .3nn- o .3no 0 n3 ne .ne e3nn o .no nmoow
nn no no no nn nn no no nn nn no no .no no noo
n3 n3 53 m3 w3 m3 w3 n3 .3 n3 .3 nx nx n3 .oo n >d3
no nm nm no no nm nm no no nm nm no no .no noo
o3 n3 m3 n3 n3 w3 m3 w3 n3 .3 n3 .3 nx nx .oo . >d3
n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 n3 nw n3 n3 «no
auto oetm oetm ooco ooco oocm ootm onto oeto oucm aetm auto no no .no one 0 mom
.3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .3 .w <4 <3 «no
on an no no on we on we no .a nu .m an ne .n» are w mom

H
u

n . . --------------vn

H

88

APPENDIX 2

DESCRIPTION OF THE

DKS CHIP IN RTL

A2. Description of the DKS Chip in RTL

type REGISTER : Bit<l7:0>;

var

DYNREG : Bit<l7:0>; /* Dynamic Register */
JOINT : Bit<ll:0>;
ROM : Bit<l7:0>;

R(9), CREG, SREG OF REGISTER;

ANGLE(6) OF JOINT;

LX, M1, M2, A1, A2, TXPORT, DXPORT, TXDLAY(2) OF DYNREG;
TX(256), DX(256), CONS(5) OF ROM;

MPYI, MPY2, ADDER 0F Bit<l7:0>; /* C/L output */
SINCOS OF Bit<7:0>; /* Table address */
SC, CTRL, SGN OF Bit<0:0>; /* Control signal */

MACRO TRIG (FUNC, I)
Cycle N: if FUNC = SIN then SC<-0 else sc<-l endif,

TXPORTé—TX [smcos (ANGLE( I )<ll : 4>> ] .
DXPORT<—DX[SINCOS(ANGLE(I)<ll:4>)]t

Cycle N+l: le—Lx, M2<—DXPORT, TXDLAY(1)<—TXPORT;
Cycle N+3: Alw—TXDLAY(2), A2<—MPY2, CTRLe—O;
Cycle N+4: if FUNC = SIN then Ls<—1 else Lc<—l;
END TRIG

Begin
Cycle (0): .TRIG(COS,5);

(l): .TRIG(SIN,S);

(2): .TRIG(COS,6); '

(3): .TRIG(SIN,6);

(4): R(6)<—ADDER;

(5): le-R(6), M2<—CON5(0), R(4)<—ADDER; /* CONS(0)=d6 */
(6): .TRIG(COS,4), M1<—R(6), M2<—ADDER;

(7): .TRIG(SIN,4), Ale—0, R(2)<—ADDER;

(8): Al<—0, R(9)<—ADDER;

(9): Ml<-R(4), M2<—CREG, R(1)<+ADDER;

(10): Ml<—R(4), M2w—ADDER;

(ll): Ml<—R(4), M2<—ADDER, Al<—0;

(12): Ml<—R(l), M24—CREG, Al<—0, R(3)<—ADDER;

(13): Ml<—R(2), M2<—SREG, Ale-0, RUM—ADDER;

(14): Mlé—R(1), Mzé-SREG, Al<—0, R(S)<—ADDER;

(15): Ml<—R(2), M2<—CREG, Al<—ADDER, SGNe-l;

(16): Ml<—R(4), M24—CONS(0), A14—0, R(l)<+ADDER;
(l7): .TRIG(COS,3), Ml<—R(5), MZw—CONS(0), Alé-ADDER;
(l8): .TRIG(SIN,3), Alé—O, R(2)<—ADDER;

(19): Al<r0, R(7)<—ADDER;

(20): R(8)<—ADDER;

(21):
(22): Al<—R(7), A2<—CONS(1); /* cous<1>=a3 */
(23): Al<—R(9), A2<-caNS(2), R(7)<—ADDER; /* cous<2>=d4 */

(24): M1<—R(1), MZG—CREG, R(9)<—ADDER;
(25): Ml<—R(3), M2<~SREG;

89

End.

(26):
(27):
(28):
(29):
(30):
(31):
(32):
(33):
(34):
(35):
(36):
(37):
(38):
(39):
(40):
(41):
(42):
(43):
(44):
(45):
(46):
(47):
(48):
(49):
(50):
(51):
(52):
(53):
(54):
(55):
(56):
(57):
(58):
(59):
(60):
(61):
(62):
(63):
(64):
(65):
(66):
(67):
(68):
(69):
(70):
(71):
(72):

Ml<—R(l),
Ml<—R(3) ,
M1<—R(4),
Ml<—R(6).
Ml<—R(4),
M1<—R(6),
Ml<—R(7),
Ml<—R(9) ,
Ml<—R(7) ,

M2<—SREG,
M2<—CREG,
M2<—CREG,
M24—SREG,
M2<—SREG,
MZe—CREG,
M2<—CREG,
M24—SREG,
Mzé—SREG,

A2. Description of the DKS Chip in RTL

Al<—0;

Ale—ADDER, SGNé-l;
Alé-O, le—ADDER;
Alé-ADDER;

Al<—0, R3<—ADDER;
Al<-ADDER;

Al<—0, R4<—ADDER;
Ale-ADDER, SGNw-l;
Ale-0, R6w—ADDER;

.TRIG(COS,2), Ml<—R(9), M2<—CREG, Al<—ADDER;
.TRIG(SIN,2), A1<-0, R(7)<eADDER;

Al<—ADDER, SGN<—1;
R(9)<—ADDER;

Al<—R(7),
le—R(l),
le—R(3),
M1<—R(1),
Ml<—R(3),
Ml<—R(4),
Mle—R(6),
Ml<—R(4),
M1<—R(6),
Ml<—R(7),
Ml<—R(9),
le—R(7),

A2<—CONS(3);

M2<—CREG,
M24—SREG;
M2<-SREG,
M24—CREG,
M2<—CREG,
M2<—SREG,
M23—SREG,
M2<—CREG,
M2<—CREG,
M24-SREG,
M24—SREG,

/* CONS(3)=a2 */
R(7)<—ADDER;

Alw—O;

Ale-ADDER, SGN<—l;

Al<—0, SGNw—l, Rl<—ADDER;
JIM—ADDER, SGNw—l;

Al<—0, R34—ADDER;
AlweADDER, SGNe-l;

Ale—0, SGNw-l, R4<—ADDER;
Al<—ADDER, SGN<—l;

Al<—0, R64—ADDER;

.TRIG(COS,1), M1<—R(9), M24—CREG, Al<—ADDER, SGN<—1;
.TRIG(SIN,1), Alé-O, SGN<+1, R(7)<—ADDER;

Ale-ADDER, SGN€—l;
R(9)<—ADDER;

Alw-R(8),
Ml<—R(l),
M1<—R(2),
M14—R(l),
Ml<—R(2) ,
Ml<—R( 4) ,
Ml<—R( 5) ,
le—R(4),
Ml<—R(5),
M1<—R(7) ,
le—R(8),
le—R(7),
Ml<—R(9),

A2<—CONS(4);

M2<—CREG,
M2<—SREG;
M2<—SREG,
M2<—CREG,
M2<—CREG,
M24—SREG,
Mzé—SREG,
M24—CREG,
M2<—CREG,
M24—SREG,
M24—SREG,
M24—CREG,

A1<—0, R(7)<—ADDER;

Al<—ADDER;

R(8)<—ADDER

/* cous<4>=d2 */
R(8)<—ADDER;

Al<—0;

Alw—ADDER, SGNw—l;
Ale—0, Rl<—ADDER;
Al<—ADDER;

Al<—0, R34—ADDER;
Alw—ADDER, SGN<—1;
Alw—O, R4<—ADDER;
Ale—ADDER;

Alw—O, R5<—ADDER;
Alw—ADDER, SGN<—1;

90

APPENDIX 3

DKS MACHINE CODE

A3. DKS Machine Code

CLOCK ADRTAB SINCOS ROM BUSAR RSUM MPY A I A2/G MIWS CREG SREG
CVCLE Ev v s/c 6t EM adr T adr L edr Mt M2 zd be A2 O CO SON Lo To Ls Ts

..................................................................................

2 I 6 I I O X 0 X 0 X I I X X 0 X 0 O O O
3 I 8 O I O X 0 O X I I I I O X 0 O
4 X 0 X I O X 0 I 6 I I I X 0 O

0-----------—--—----------------------------------u-uoccc------Q----------------n-

M‘----‘----------------------—-----------------------------------—-------.H

.--c-------------------------—-------------------—----------o-----------——-------.
3

.---—-------.-----o---~-—---—----—------------------o-----—---------------—------.

.-------w—--—--—-———-o---o------—Q—qn-n----——~-—---——-I—-—-----—-—u--——--’-----—--—

X
X X
10 X 0 X 0 O X I 4 O X 2 I I I I O I O O

‘;;'"’;“a“‘;"g"‘a";"‘;";“1;";"’;";"‘5 """""""""""""""""
7;“‘1";“;"a"'.;";“';“r"7253“:";""";";"; """ 6"}"3'"-
°;;"";"5"';".;"'5";°“;";"';";"';";‘";""";";"; """ a???
';;“";“.s‘"g"a"‘.;“;"‘;";'"In;";";"';"“";"a"‘; """" Jr's":
7;"32".?";"a"'a";"';";‘“.37"';";"‘;,“;“‘;".;"';";'"a”;'"a"a'
1:;"‘;“.;“';”5'"VJ”;“I";";‘”;"; """ 5"";"5‘“;"a"'.;"a"‘.s“a'

O

..................................................................................
..................................................................................
..................................................................................
..................................................................................
..................................................................................
..................................................................................
..................................................................................

.......-.--.-.----—...-.-.-.-.-...--..-......-.—....—........Q....--..-...--.----.

91

A3. DKS Machine Code

A

II

25

0
e
R
S.
I.
C
G...-
E
R
CO
L
N
56
W5
I
am
GO
/
2
‘2
A
.
Ib
d
I
2
N
V.
m:
H
r
".0
U.
5
RI.
P
Rd
A.
S
UT
8
I
N
E
be
E
C
I
.

CLOCK ADRTAB SINCOS ROM
CYCLE Ev v

--------.----------------------------------------.-.-.---------------------..----.

26

27

28

29

30

.--------------------------------------------------------------—‘---------‘~------

31

-----------------------------—---------------------------------------------------.

32

33

34

35

.----------—--------------------------------_---------------—--a-----------c-----.

36

a..—...-—..--—.-...--.—.-.---—....-.....-..-.-..-....----.——--—-....--.—.....-—-..

37

38

.------‘O---’--------------------------------------------------------------------.

39

40

41

42

43

44

45

.....-..-—--—.-........---..—..—.....----.--.---.-o...-.-—.——.......-..-.....--.--.

46

.---------------------------------------------------------w----—----------OD-----.

47

48

.—--.—....-.-.-....-....-----——.c..-...-—-—-—...—.....-............--...-._--.-....

49

50

92

A3. DKS Machine Code

BUSAR

TS

G
E
R
58
I.
c
GT
F.
R
CO
.L
N
SG
W5
IO
MC
66
/
2
A2
A
5
ID
Ad
2
2
M
V.
0.
"I
M
P
Md
U8
5
RI.
P
d
a
T
P
d
Ma
0
RM
E
St
0....
C
NC
.1/
$8
8
AV
.l
R
DV
AE
KE
CL
0C
LV-
CC

51

--------------—-----------------—-------------------------------------------------

52

.....-.-.......---........--.-....-----.-..-.-......o.—.---.---....-----..-...--..

53

----------------D---—-~¢----------------Q-----------------------------——---------.

54

55

56

C-----------—-------------.-----------------—-------------o-------------—-QQ------.

57

.------------------------c-----‘o-------—----Q----no-----e—------—---O-------------

58

59

60

61

--¢-----------------------n--.—-—«.---v-w--n——-—-——---—------~u—u.----------------—---.

62

63

64

u-—-----—------------";--——-—..---------na-—-—----—---——-——--—--—--------------ocucc

O O X

X

65

...-....-...-...-...-...----.—.........-—---....-.....-..O—.-.—..-.....----..---..

66

67

-
.........-.-.-—-.-.....—.-—...--....---.—-.--.-.-..-—...-..--—-.....-.—-~.....-...

68

69

7O

.-------—---:--------------------—-----u-----------------—------------h----------.

'II

72

93

APPENDIX 4

PROGRAM LISTING AND SIMULATION RESULT

A4. Symbolic Simulation Program Listing

!
l
PROGRAM OKSVRF 74/175 OPT: L ROUND: 445/ FTN 5. 587 O2é27/ss
Do-gLONcl- OT. ARG= -COMMON/ FIXED. cs- USER/— FI ED. DB=-TB/D SB/- SL/ ER/ /-PMD/ ST. PL= .
0 PROGRAM DKSVRF
c THIS PROGRAM SIMULATES THE DATA FLOw (USING SYMBOLIC VALUES)
c IN THE DKS CHIP TO VERIFY THE CONTROL SIGNAL SPECIFICATION
E AND THE DESIGN AS A WHOLE AT THE BEHAVIORAL LEVEL.
IMPLICIT INTEGER (A-Zl
COMMON /CONTRL/CDNTRL 11.2)
COMMON /CFIELD ADRTBL SINCOS CONS.BUSAR RSUM MPY. AL 42. MINUS. C. 5
COMMON /MEM/RO (a; 4).tHETA(6).UNDEF.CSFLAG. SIGN
COMMON /REG/REG 9).CREG.SREG
COMMON /LATCH DYNREG(4.2).DELAY(4 2),TXPORT DXPORT. Lx
6 COMMON /MISC/ USA BUSB.MPY1.MPY2.LDDER.MYPIPE(L2
CHARACTERt2 ROM. THETA. UNDEF.CREG SREG.CSFLAG.SIGN
CHARACTERt4 TXPORT DxPORT.Lx.DE Y
CHARACTERFIOO DYNREG. REG.MPY1.MPY2.ADDER.BUSA.BUSB.MYPIPE
c EXTERNAL PHASEI. PHASE2
c FOR INFORMATION ABOUT THE USAGE OF THE VARIABLES.
g SEE COMMENTS IN SUBROUTINE PHASEI.
DATA ROM/D 6’.’A3’ '04'. 'A2'. '02'
DATA THETA/‘Xt'.’x2. xa'. 'x4 x ' xa'l
DATA ADRTBL.SINCOS CONS. BUSAR. RSUM. MPY. 41/1. 2. 3. 4 5. 6. 7/
DATA A2.MINUS.C.S/é.9.1O.11/
C DATA UNDEF/’E?’/
g INITIALIZATION: SET THE CONTENTS OF ALL STORAGE ELEMENTS To UNDEFINED
DD 1 I: 1 4
DYNREG I. =UNDE
DYNREG I. 2 =DELAYII. I)=DELAY(I.2)=UNOEF(2:)
1 CONTINUE
on 2.9
REG(I HUNDEF(2 )
2 CONTI
BUSA= BUSB= UNDEF
CREG= SREG= UNDEF(2: )
TXPORT= DXPORT= Lx= UNDEF(2:)
MPY1=MPY2= ADDER=UNDEF(2: )
MYPIPE(L 1): MYPIPE(1 2): UNDEF(2 )
CSFLAG= SIGN: UNDEF(2: I
PRINT
3 FORMAT 3('1')
C THE BEHAVIOR OF THE CHIP Is SIMULATED THROUGH THE COMPLETE D.K.S.
c CALCULATION PERIOD DRIVEN BY THE CONTROL SIGNAL TABLE. THE ENTIRE
C PERIOD CONSISTS 0F 77 2-PHASE. NON-OVERLAPPED SYSTEM CLOCK CYCLES
c DURING PHASE ONE. DATA FLOwS BETwEEN VARIOUS STORAGE ELEMENTS VIA
c SYSTEM OR LOCAL BUSES. DURING PHASE TWO. DATA IS TRANSFERRED To
C PHASE TWO LATCHES AND PROCESSED BY COMBINATIONAL LOGIC SECTIONS.
c THE ASSIGNED FUNCTION IS SIMULATED BY MANIPULATING THE SYMBOLIC
g VALUES OF DATA. THE BUS Is ALSO PRECHARGED DURING PHASE Two.
ERR=o
CYCLE=O
OPEN 1.F;LE=’TABLE’)
READ 1.: NUM DBUG
IF (MUM .GT. 7O) NUM= 76
100 IF CYCLE .GT. NUM) THEN
CLOSE (1)
CALL DMPREG
STOP
ENDIF
READ (1 *)E ((CONTRL(FIELD SUB). SUB=1.2). FIELD=ADRTBL.S)
IF (DBUO.1)
PRINT s.E
PRINT t. ' CY LE 2 '. CYCLE. ' CONTROL wDRD = '. ((CONTRL
+(FIELD.SUB). SUB=t.2). FIELD=ADRTBL.S)
c ENDIF
c CALL PHASE1 (ERR)

IF (ERR .NE. 0) GOTO 200

94

A4. Symbolic Simulation Program Listing

PROGRAM DKSVRF 74/175 0PT81.ROUND= A/ S/ M/-D.-DS FTN 5.14587 02/27/85

IF (DBUG .E0. 1) CALL DMPOYN
CALL PHASE2
IF (DBUG .E0. 1) THEN

CALL OM DYN
PRINT 1. ' MPY1 . '.MPY1
PRINT ., '
PRINT A. ' ADDER = ’.ADDER
PRINT ., ' '
CALL OMPREG
C ENDIF
IF (éCYCLE EO.9).OR.(CYCLE.EO.24).OR.(CYCLE.EO.4O).OR.(CYCLE.EO
+.58)g ALL OMPREG
E=CYCLE+1
c GOTO 100
8 ERROR HANDLING SECTION
200 PRINT '. ' ttttt ERROR 45...!
pRINT ., . CYCLE = '. CYCLE. RROR CODE = '. ERR
PRINT -. ' CONTROL VORO = '. ((CONTRL(FIELO. SUB). SUB=1.2). FIEL
+D=ADRTBL.S)
PRINT .. ' '
PRINT -. ' BUSA . '. BUSA
PRINT o. ' BUSB = '. BUSB
PRINT t. ' '
CALL DMPOYN
CALL OMPREG
STOP
END
SUBROUTINE OMPREG 74/175 OPT=1 ROUND: A/ s/ M/-D - TN 5.1+587
-- -- _ - 02/? m/
DON SL0NG/- OT ARG- COMMON/— FIXED. CS— USER/~F IXED O.OB=-TB/-S SB/- SL/ ER/ IO/ pmo/ 51 pL L=500005
SUBROUTINE OMPREG
COMMON /REG /REG(9)
CHARACTER'1100 REG
831?TI‘1 é ..... OUTPUT REGISTER CONTENT DUMP .....I
PRINT c. ' OUT REG '. I. ' = '
1 CONTINUE ‘ REG(I)
PRINT t. ' '
RETURN
END
SUBROUTINE DMPOYN 74/175 OPT=1.ROUNO= A/ s/ M/- D 05 TN 87 02/27/35

9?;gi0NG/-OT.ARG=’COMMON/ FIXED. CS= USER/ FIXED. DB=-TB/ 58/" SL/ ER/ 10/ PMD/ SI. PL=SOOOO

SUBROUTINE DMPOYN

COMMON /LATCH/DYNREG(4 2)

CHARACTER»1OO DY

SSINTIFi ; ttttEE GDYNAMIC REGISTER CONTENT OUMP .....o

’ DYN REG ’. I. ' PHASE '. d. ’ = '. DYNREG(I.J)

95

TN5..

A4. Symbolic Simulation Program Listing

SUBROUTINE PHASE1 74/175 DPT-1 R0UND= A/ s/ M/D FTN 5 5802/27/85
DD--LON G/- OT. ARG= -COMMON/- FIXED. CS= 'USER/-FIxE D OB=- TO/D SB/- SL/ ER/- 0/ PMD/- ST. PL= 50000

SUBROUTINE PHASE1 (ERR)

IMPLICIT INTEGER (A-ZI

COMMON /CDNTRL CONTRL 11.2)

COMMON /CFIEL0/ADRTBL SINCOS CONS BUSAR. RSUM MPY A1. A2. MINUS c. 5

COMMON /MEM/ROM£O:4) THETA(5). UNDEF. CSFLAG. SIGN

COMMON /REG/REG 9).cREG SREG

COMMON /LATCHéDYNREG(4.2 DELAY 4 2) TXPORT DxPORT. Lx

COMMON /MISC/ USA.BUSB.MPY1.MPY .ADOER.MYPIPE(1

CHARACTER 2 RD

OOOCOOOOOOOOOOOOOOOOO

O

000

M. THETA UNDEF. CREG SREG. CSFLAG. SIGN
CHARACTER 4 TXPORT DxPORT. Lx. DELAY
CHARACTER11OO DYNREG. REG. MPY1. MPY2. ADDER. BUSA. BUSB MYPIPE

VARIABLE USAGE EXPLANATION:

CONTRL--CONTROL WORD OF THE CURRENT CYCLE
IST INDEX IOENTIFIES ’FIELD’ AS IN ’CFIELD' COMMON BLOCK
EACH FIELD HAS 2 SUBFIELDS IDENTIFIED BY THE 2ND INDEX
DYNREG(I. PHASE)--DYNAMIC REGISTERS CONTROLLED BY PHASE = 1 OR 2
I= 1: M1. MULTIPLIER OPERAND LATCH CONNECTEC TO BUSA
I=2: M2. MULTIPLIER OPERAND LATCH CONNECTEC TO BUSB
I=3: A1. ADDER OPERAND LATCH CONNECTEC TO BU SA
I=4t A2. ADDER OPERAND LATCH CONNECTEC TO BU SB
DELAY(I.PHASE)'-DELAY ELEMENTS CONTROLLED BY PHASE 8 1 OR 2
I=1: 1ST DELAY STAGE FOR TX
I=2: 2ND DELAY STAGE FOR TX
I=3: 1ST DELAY STAGE FOR SIGN CONTROL IN SIN CALCULATION
I=4: 2ND DELAY STAGE FOR SIGN CONTROL IN SIN CALCULATION
BUSA(B)‘-THE 1ST CHARACTER IS A E/F FLAG FUNCTIONS SIMILAR TO CTRLAH

TASK 1: RULE CHECKING

IF CDNTRL(BUSAR.1)+CONTRL(RSUM.11.GT. 1 THEN

I (CDNTRL(BUSAR. 2) .E0. CONTRL RSUM. 2) ERR=1
..... ERROR: TRY TO LOAD AND TRANSMIT AM REG AT THE SAME TIME .....
ELEEéF2(CONTRL(C.1)+C0NTRL(C.2) .GT. 1 THEN

**‘*‘ ERROR: TRY TO LOAD AND TRANSMIT CREG AT THE SAME TIME *“**
ELESéF3(CONTRL(S. 1)+CONTRL(S. 2) GT. 1)T HEN
***'* ERROR: TRY TO LOAD AND TRANSMIT SREG AT THE SAME TIME '***'
EEDIERR .NE. 0) RETURN

TASK 2: PROCESS LOAD BUS CASES

IF (CONTRL SINCOS .E0. 1) THEN

BUSA=’F /TXP MORT
BUSB=’F //DXPORT
NDIF

IF I£(CONTR%(BI)JSA5 1) .E0. 1) .AND. (CONTRL(BUSAR.2) .NE. 0)) THEN

“RR " E E') THEN
:tttt:E¥5GISTER ATTEMP TO LOAD THE ALREADY LOADED BUS A .....

ELSE
BUSA='F’ //REG(CONTRL(BUSAR. 2))

ENDIF
ENDIF
IF (CDNTRL(CONS. 1) .E0. 1) THEN
IF (BUS B .NE. ’E’ ) THEN
.....E RROM2 ATTEMP TO LOAD THE ALREADY LOADED BUS B .....
RETURN
ELSE
BUSB=‘F’//ROM(CONTRL(CONS.2))
ENDIF
ENDIF
IF éCONTRL’C.2) .E0. 1) THEN
I (BUSB :1 .NE. ’E’) THEN

96

4. Symbolic Simulation Program Listing

SUBROUTINE pHASE1 74/175 OPT=1.ROUND= A/ s/ M/-D.-DS FTN 5.1+537 02/27/85
c ..... CREG ATTEMP TO LOAD THE ALREADY LOADED BUS B .....
RETURN
ELSE
BUSBx'F'//CREG
ENDIF
ENDIF

000

IF éCONTRLES” .2) .E0. 1) THEN
.NE. ’E ) THEN

ERR 14
OttttRE¥REg ATTEMP TO LOAD THE ALREADY LOADED BUS B '*‘*'

LSE
BUSB='F'//SREG
ENDIF
ENDIF

IF 1(CONTRL2A2) 2) .E0. 1 THEN
NE. ’E’ THEN

ERR=
tttttpE¢885R OUTPUT ATTEMP TO LOAD THE ALREADY LOADED BUS B *“"

LSE
BUSB=’F'//AOOER
ENDIF
ENDIF
TASK 3: PROCESS THE LOADINGS OF REGISTERS OR CONTROLLED-LATCHES
IF P((CONTRL(ADRTBL. 1) .EO ngg .AND. (CONTRL(AORTBL.2) .NE.O)) THEN
tittl Ottt.’
PRINT * ’ AT CYCLEA '. CYCLE-1. ' ATTEMP TO LOAD THETA. INSTRU
+EL69N IGNOREO.’

.AND. (CONTRL(RSUM.2) .NE. 0)) THEN

IFE(gOgTRL(MPY. 1) .GT. 2) THEN
ﬁ'ttﬁ SLTEMP TO LOAD MULTIPLIER INPUT LATCH MORE THAN ONCE ****‘

RETU
IF (CONTRL(MPY.1) .E0. 1) THEN
OYNREG(1.1)=LX
ELSEIF (CONTRL(MPY.1) .E0. 2) THEN
IF (BUSA( 1) .E0. '0') THEN
ERR=22
RETURN
ELSE
DYNREG$1.1)=BUSA(2 )
BUSA(: )='0'
ENDIF
ENDIF
ENDIF
IF SCONTRLfMPY. .2) .E0. 1) THEN
.Eo . '0') THEN
tttttE ATTEMP TO LDAD MULTIPLIER INPUT LATCH MORE THAN ONCE .....
RETURN
ELSE
DYNREG(2.1)=BUSB(2: )
BUSB(:1)= 0'
ENDIF
ENDIF

IFEééCgNTRL(A1.1) .NE. 0) .AND. (CONTRL(A1.2) .NE. 0)) THEN
SSREETUSLTEMP TO LOAD A1 MORE THAN ONCE *‘***
ELEgéF2é(CONTRL(A1,1) .GT. 2) .OR. (CONTRL1A1.2) .GT. 2)) THEN
t‘tﬁt ATTEMP TO LOAD A1 MORE THAN ONCE ‘*'*'
RETURN
ELSEIF (CONTRL(A1.1) .E0. 2) THEN

97

A4. Symbolic Simulation Program Listing

SUBROUTINE RHASE1 74/175 OPT=1.ROUNO= A/ S/ M/-D.-DS FTN 5.1+537 02/27/85
DYNREG(3.1&"O'
ELSEIF (CONT L(A1 1 E?. 1) THEN
DYNREGé3.1&=DELAY 2.2
ELSEIF ( ONT L(A1.2 E0. 1) THEN
DYNREG(3.1g=ADDER
ELSEIF (CONT L(A1.2) .E0. 22 THEN
IF (BUSA(:1) .E0. '0') TH N
ERR=26
RETURN
ELSE
DYNREG$3.1)=BUSA(2 )
BUSA(: )8'0'
ENDIF
ENDIF
IF (CONTRL(A2.1) .GT. 2) THEN
ERR=27
c ..... ATTEMP TO LOAO A2 LATCH TWICE .....
R TURN
ELSEIF (CONTRL(A2.1) .E0. 1) THEN
DYNREG(4.1&=MPY2
ELSEIF (CONT L(A2.1) .EO. 2) THEN
IF (BUSB( 1) .E0. '0') THEN
ERR=23
RETURN
ELSE
DYNREG(4.1)=BUSB(2 )
BUSB(:1)=’O'
ENDIF
ENDIF
IF £CONTRL SINCOS.2) .E0. 1) THEN
I (BUSA :1) .E0. ’0’) THEN
ERR=29
RETURN
LSE
OELAv(1 1)=BUSA(2:)
BUSA(:1)='O’
ENDIF
ENDIF
IF ECONTRLEC,1; .E0. 1 CREG=ADDER
c IF CONTRL S 1 .E0 1 SREG=ADDER
8 TASK 4: TRANSFER OATA TO OTHER PHASE-1 LATCHES FROM PHASE-2 LATCHES
DELAY 2.1 =OELAv(1.2)
OELAv 3.1 =CSFLAG
OELAv 4.1 =OELAv(3.2)
SIGN=DELAYS4.2)
MYPIPE(1.1 =MPY1
RETURN
END

98

A4. Symbolic Simulation Program Listing

SUBROUTINE PHASE2 74/175 OPT' 1. ROUNO= A/ S/ M/ TN 5. 1*58 7 02/27/85
OOI-LONG/ OT. ARG3- -COMMON/- FIXED. CS'l USER/ FIXED. OBI-TB/D SB/- SL/ ER/- IO/- PMO/ ST. PL= 50000

FTNS.

000 a 000

000 M

00000

000

SUBROUTIN E PHASE2

IMPLICIT INTEGER (A-zz

COMMON /CONTRL/CONTRL 11.2)

COMMON /CFIELD¢A08TBL SINCOS CONS BUSAR RSUM MPY AL A2. MINUS c. 5
COMMON /MEM/RO EC. 4) THETA(6). UNDEF. CSFLAG. SIGN

COMMON /REG/REG 9) CREG SREG

COMMON /LATCH/DYNR§G§4.§).DEL 34 2) TXPORT, DXPORT. Lx

COMMON /MISC/8U SA .BU 8.MPY1.M PY .ADDER. MYPIPE(1. 2)
CHARACTERo2 ROM THETA UNDEF.CREG SREG.CSFLAG.SIGN

CHARACTER14 TxRORT DXPORT.LX DELAV

CHARACTERv1OO DYNREG REG.Mpv1.Mpv2.ADDER.8USA.BUSB.MYRIRE
CHARACTER‘TOO 0P2

TASK 1: TRANSFER DATA FROM RHASE-1 LATCHES TO PHASE-2 LATCHES
DO 1 1x1 4

DYNREG I 2)= DYNREG(I

DELAY 2)=DELAY(I.1)”
CONTINU
MYPIPE(1.2)=MYPIPE(1.1)

TASK 2: RRECHARGE BUS AND UNDEFINE PHASE-1 CONTROLLED LATCHES
gusg= Dusa= -UNOEF
0DYNREG(I.1)=UNDEF(2:)
CONTINUE
DELAY(1.1)=UNDEF(2:)
TASK 3: SIN/COS SIMULATION

IF (CONTRL‘ADRTBL. 1) .E0. 1) THEN
CSFLA G

Lx= THETA(CONTRL(ADRTBL. 2 A
TXPORT=’T'//THETA CONTRL DRTBL.2 )(2:2)
DXPORT='O’//THETA CONTRL ADRTBL,2
IF (CONTRL(SINCOS.1).EO.1) THEN
TXPORT2323;=""
DXPORT 4:4 :""
ENDIF
ELSE
LX=UNDEF(2:2
CSFLAG=UNDE (2:
TXPORT=OXPORT=UNOEF(2 )
ENDIF

TASK 4: SIMULATE DATA FLOWING THROUGH COMBINATIONAL LOGIC SECTIONS
TASK 4.1: SIMULATE THE 1ST STAGE OF MULTIPLIER PIPE

IF M((DYNREGH 2)(:1). Eo. 1x ) .AND. (DYNREG(2.2)(:2).EO.'OX')) THEN
WDYNREGT

EL SE
géLL ACTLEN(M1 FLAG. DYNREG(1 2))
IF DYNREG(1. 2) 1)E NEo. ' -') SGN=-SGN
IF FLAG . EQ.
IF (SGN .GT. )TT
ELgEv1= -DVNREG 2. 2)(: N2)//DYNREG(1. 2)
MP;1=’-’//DYNREG(2.2)(:2)//DYNREG(1.2)(2:)
IF (SGN .GT. 0)T
EL:EY1=—DYNREG(2 2)(: N2)// (’//DYNREG(1 2)(: M1)//’)'
MPY1=’-’//DYNREG(2.2)(:2)//DYNREG(1.2)(2:)
ENDIF
ENDIF
ENDIF
TASK 4.2: SIMULATE THE 2ND STAGE OF MULTIRLIER PIPE

MPY2=MYPIPE(1.2)

99

A4. Symbolic Simulation Program Listing

SUBROUTINE PHASE2 74/175 OPT-1.ROUND' A/ S/ M/‘D.-DS FTN 5.14587 02/27/85

TASK 4. 3: SIMULATE ADDER FUNCTION
IF$(DYNREGgé2g$1 .1T1) .AND. (DYNREG(4.2)(:2).E0.10x1)) THEN
I (31111) THEN

ADDER-1c1(//DYNREGo

000

3 2)(2n
ELSE
ADDER:1S1//DYNREG(3.2)(2:2)

END F
leé§§%6N(i1! .NE. 111) .OR. (CDNTRL(MINUS.1) .NE. 0)) THEN
‘0
PRINT 4.1 ... ADDER CONTROL ERROR IN SINE SIMULATION ...,
PRINT . 1
ENDIF
ELSE
geht'ACTLEN(L2.FLAG.DYNREG(4.2))
IFSéDYNREG(4. .2)(:1) .E0. 1-1) THEN
IF (FLAG .E0. 2 HEN
OP2= DVNREG(4. )(3: L2- 1)
L2=L2 3
FLAG= 1
ELSE
0P2=DYNREG(4.2)(2:)
L2=L2-1
ENDIF
ELSE
OP2=DYNREG(4 2)
ENDIF
IF (CONTRL MINUS 1 +CONTRL (M INUS.2) .E0. 2) SGN=~SGN
IF DYNRE 3. 2)( .E0. 0 ) THEN
IF (SN HEN
AODER=
ELSEIF (FLAG ED. 0) THEN
ADDER=1-1//OP2
ELSE
ADDER=1-(1//OP2(:L2)//1)1
ENDIF
ELSE
CALL ACTLEN(L1 FLAGI DYNREG(3. 2))N
IF (DYNREG(3 2)(:1) .NE. 1)
IF (SGN .GT. OI HEN
ADDER=DYNREG 3.2) :L1)‘/’ +1//OP2(: L2)
ELSEIF (FLAG .EO. D THE
ELggDER=DvNREG(3.2 :L1)//'-1//OP2(:L2)
ADDER=DvNREG(3.2)(:L1)//1-(1//DP2(:L2)//1)1
ENDIF
EL E
GN=-SGN
IF (FLAG1.EO.O THEN
IF (SGN . GT. 0
ADOER=’-(’//DYNREG(3.2)(2: L1)//'+'//0P2(: L2)// )1
ELSEIF (FLAG ED. 0) THEN
ELgEOER='-(’//DYNREG(3.2)(2:L1)//’-’//0P2(:L2)//' )1
AODER=’-(’//DYNREG(3.2)(2:L1)//' —(1//OP2(:L2)//'))'
ENDIF
LSE
IF (SGN .GT. 0 TH
ADDER= OYNREG 3. 2)N :L1-1)//'+'//0P2(:L2)//’)'
ELSEIF (FLAG . THEN
ELgDDER= —DYNREG(3. 2) :L1-1)//'-'//OP2(:L2)//')'
ADDER=DVNREG(3.2)(:L1-1)//1-(1//OP2(:L2)//'))'
ENDIF
ENDIF
ENDIF
ENDIF
ENDIF
RETURN
END

100

A4. Symbolic Simulation Program Listing

SUBROUTINE ACTLEN 74/175 0PT-1.ROUND= A/ S/ M/‘D 'DS FTN 5.1+587 02427/85
ggﬁéLONGI-OT,ARG=-COMMDN/-FIXED.CS' USER/~FIXED.DB=-Té/-SB/-SL/ ER/-lD/-PMD/-ST.PL= 0000

SUBROUTINE ACTLEN (I.FLAG.STRING)
INTEGER FLAG
gHgRACTER1100 STRING

0.
IF (STRING(I:I) .E0. 1(1) THEN
FLAG=2
ELSE
FLAG=O
ENOIF
1 IF £STRING(I:I) .go. 1 1 GOTO 2
I (FLAG .E0. 2 GOTO
IFd STRING(I:I .E0. 1 1) THEN
ELSESF'(STRING(I:I) .E0. 1)1) THEN
ELSEIF ((STRING(I I) .EO. 1+1) .OR. SSTRING(I:I) .E0. 1-1)) THEN
IF ((FLAG .E0. 0) .AND. (J .E0. 0) FLAG=1
ENOIF
T=I+1
IF (1 .LT. 100) GOTO 1
2 I=I~1
RETURN
ENO

101

A4. Symbolic Simulation Program Listing

8“88 OUTPUT REGISTER CONTENT DUMP 28888
OUT REG 1 8 C605
OUT REG 2 8 56

OUT REG 3 8 7
OUT REG 4 8 55
OUT REG 5 8 7
OUT REG 6 8 C5
OUT REG 7 8 ?
OUT REG 8 8 7
OUT REG 9 8 DGCS

‘8i‘8 OUTPUT REGISTER CONTENT DUMP 8ttt8
OUT REG 1 8 C4C6C5-5456

OUT REG 2 8 S4C6C5+C4SG
OUT REG 3 8 C655

OUT REG 4 8 C455

OUT REG 5 8 $455

OUT REG 6 8 C5

OUT REG 7 = D6C4S$+A3
OUT REG 8 = 065455

OUT REG 9 8 06C5+D4

011*: OUTPUT REGISTER CONTENT OUMP 1.1:.
OUT REG 1 - c3(CACGc5-SASG)-53CGSS

OUT REG 2 - $4csc5+cass

OUT REG 3 : 53(c4cscs-Sase)+c3cess
OUT REG 4 - c3c455+sacs

OUT REG 5 = $455

OUT REG 6 a 53c455-c3cs

OUT REG 7 = C3(DGC4SS+A3)+SS(DGCS+04)
OUT REG 8 = 065455

OUT REG 9 = S3(OGC4SS+A3)-03(OGCS+OA)

888.8 OUTPUT REGISTER CONTENT DUMP 9.888

OUT REG 1 8 C2(C3(C4C6C5-S456)'S3C655)‘S2(S3(C4C6C515456)+C3C6$5)

OUT REG 2 = S4C6C5+C4SG

OUT REG 3 8 -(S2(C3(C4C6C5-Sd$6)'SSCGSS)+C2(S3(C4C6CS-S4SG)+C3C6$5))

OUT REG 4 8 C2(C3C4SS+S3C5)*S2(S3C4SS‘C3C5)

OUT REG 5 = $455

OUT REG 6 8 '(S2(C3C4SS+S3C5)+C2(S3C4SS-C3CS)T

OUT REG 7 8 C2(C3(DGC4SS+A3)+S3(OGCS+04)+A2)'52(SS(DGC4SS+A3)~C3(D6C5+04))
OUT REG 8 8 065455+D2

OUT REG 9 8 -(S2(C3(OGC4SS+A3)+S3(06C5+O4)+A2)+C2(53(DGC455+A3)-C3(D6C5+04)))

888** OUTPUT REGISTER CONTENT DUMP 88888

OUT REG 1 8 C1(C2(CS(C4C6C5-S4SG)’SGCGSS)'$2(S3(C4C6C5-54$6)+C3C655))‘S1154C6C5+C456)

OUT REG 2 8 S1(C2(C3(C4C6C5“Sd$6)‘53C655)'52(S3(C4C6C5-S456)+C3C655))+C1(S4C6C5*C456)

OUT REG 3 8 -(S2(C3(C4C6C5-Sd$6)'SGCGSS)+C2(53(C4C6C5-5456)+C3C655))

OUT REG 4 8 C1(C2(C3C4SS+S3C5)'S2(S3C4SS-C3C5))‘S15455

OUT REG 5 8 S1(C2(C3C4SS+SBCS)‘S2(S3C4SS-C305))+C1S4SS

OUT REG 6 8 -(S2(C3C4SS+S3C5)*C2(SSC4SS-C3C5))

OUT REG 7 8 C1(C2(C3(DSC4SS+A3)+S3(DGC5+04)+A2)-S2(53(DGC4SS+A3)-C3(DGC5+D4)))'S1(D6SdSS+O2)
OUT REG 8 8 '(S1(C2(C3(DSC4SS+A3)+S3(OGC54D4)+A2)-S2(S3(DGC4SS+A3)-C3(DGC5*D4)))'C1(DGS4SS+D2))
OUT REG 9 8 '(52(C3(DGC4SS+A3)*S3(DGC5+O4)+A2)+C2(S3(DGC4SS+A3)-C3(DGC5+04)))

102

BIBLIOGRAPHY

10.

ll.

12.

BIBLIOGRAPHY

J. Denavit, and R. s. Hartenberg, "A Kinematic Notation for
Lower-Pair Mechanisms Based on Matrices," Journal of Applied
Mechanics, 1955, pp. 215-221.

E. Ribble and K. Olson, "Skeletal Motion Processor for High
Speed Robotics and Graphic Computations," Proceedings of 1984
International Conference on Robotics, IEEE Computer Society
Press, 1984, pp. 230-238.

Introduction to VLSI Systems, C. Mead and L. Conway, Readings,
MA: Addison-Wesley, 1978, Chapter 3-5.

Robot Manipulators: Mathematics, Programming and Control, R. P.
Paul, Cambridge, MA: MIT Press, 1981, Chapter 1,2.

G. C. S. Lee, ”Robot Arm Kinematics" in Tutorial on Robotics,
G. C. S. Lee, R. C. Gonzalez, and K. S. PO, 1888 Computer
Society Press, 1983, pp. 47-65.

VLSI System Design, When and How to Design Very-Large-Scale
Integrated Circuits, S. Muroga, Wiley, 1982, p. 42, Chapter
7.9.

J. A. Darringer, et al., "LSS: A System for Production LOgic
Synthesis,” IBM Journal of Research and Development, Vol. 28,
No. 5, September 1984, pp. 537-545.

W. H. Elder, P. P. Zenewicz and R. R. Alvarodiaz, ”An
Interactive System for VLSI Chip Physical Design,” IBM Journal
of Research and Development, Vol. 28, No. 5, September 1984,
pp. 524-536.

F. J. Hill, 2. Navabi, et al., "Hardware Compilation from an
RTL to a Storage Logic Array Target," IEEE Trans. on
Computer-Aided Design, Vol CAD-3, No. 3, July 1984, pp.
208-217.

S. C. Johnson, "VLSI Circuit Design Reaches the Level of
Architecture Description," Electronics, May 3, 1984, pp.
121-128.

P. Wallick, "On the Horizon: Fast Chip Quickly," IEEE Spectrum,
Vol. 21, No. 3, March 1984, pp. 28-34.

W. M. vanCleemput and H. Ofek, "Design Automation for Digital
Systems," IEEE Computer, Vol. 17, No. 10, October 1984, pp.

103

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

C. U. Smith and J. A. Dallen, "A Comparison of Design
Strategies for Software and for VLSI,” IEEE 1984 Workshop on
the Engineering of VLSI and Software, IEEE Computer Society,
1984, pp. 160-165.

D. H. Barbe, "VHSIC Systems and Technology," IEEE Computer,
Vol. 14, No. 2, February 1981, pp. 13-22.

F. Guterl, P. Wallich, and M. A. Fischetti, "In Pursuit of the
One-Month Chip," IEEE Spectrum, Vol. 21, No. 9, September 1984,
pp. 28-49 0

H. H. Baller, "VLSI/Software Engineering Design Methodology,"
IEEE 1984 Workshop on the Engineering of VLSI and Software,
IEEE Computer Society Press, 1984, pp. 3-5.

D. D. Gajski and R. H. Kuhn, ”New VLSI Tools," IEEE Computer,
Vol. 16, No. 12, December 1983, pp. 11-14.

C. V. Ramamoorthy, et al., "Software Engineering: Problems and
Perspectives," IEEE Computer, Vol. 17, No. 10, October 1984,
pp. 191-209.

Industrial Robots Computer Interfacing and Control, W. E.
Snyder, Englewood Cliffs, NJ: Prentice-Hall, 1985, pp. 24-28,
119.

L. D. Harmon, "Automated Tactile Sensing," The International
Journal of Robotics Research, vol. 1, no. 2, Summer, 1982.

J. F. Jarvis, "Robotics," IEEE Computer, Vol. 17, No. 10,
October 1984, pp.283-292.

J. J. Uicker, J. Denavit, and R. S. Hartenberg, "An iterative
Method for the Displacement Analysis of Spatial Mechanism,"
Transactions of ASME, Journal of Applied Mechanics, vol. 31,
Series E, 1964, pp. 309-314.

ElectronicsWeek, November 5, 1984, p. 88.

ElectronicsWeek, July 12, 1984, p. 46.

M. A. Fischetti, "Solid State" in "Technology '85," IEEE
Spectrum, Vol. 22, No. 1, January 1985, pp. 60-64.

C. F. Ruoff, "Fast Trig Functions for Robot Control,” in

Robotics Age in the Beginning, ed. by C. T. Helmers, Hasbrouck
Heights, NY: Hayden Book, 1983, pp. 73-79.

104

27.

28.

29.

30.

31.

32.

33.

34.

35.

36.

37.

38.

39.

40.

41.

42.

J. E. Valder, "The CORDIC Trigonometric Computing Technique,"
IEEE Trans. on Electronics and Computers, Vol. EC-9, September
1960, pp. 227-231.

Computer Arithmetic, K. Hwang, John Wiley and Son, 1979, pp.
368-373.

S. Muroga, op. cit., pp. 313-316.

M. J. Poster and H. T. Kung, "The Design of Special-Purpose
VLSI Chip," IEEE Computer, Vol. 13, No. 1, January 1980, pp.
26-40.

P. C. Treleaven, ”VLSI Processor Architecture," IEEE Computer,
Vol. 15, No. 6, June 1982, pp.33-45.

Computer Arithmetic and Parallel Processing, K. Hwang and F. A.
Briggs, New York, McGraw Hill, 1984, pp. 788-803.

W. C. Hsu and M. A. Shanblatt, " Evaluation of a Single VLSI
Chip Algorithm for Triangulating Large Band Form Matrices,"
Tech. Report No. MSU-ENGR 82-015, Michigan State University, E.
Lansing, Michigan (August 1982).

The. Design and Description of Computer Architecture, 5.
Dasgupta, Wiley, 1984, Chapter 1.

K. Hwang, op. cit., Chapter 3-6.

C. Mead and L. Conway, op. cit., p. 26.

C. Mead and L. Conway, op. cit., p. 231.

D. J. McGreivy, "Interconnects/Gates in VLSI Technologies," in
VLSI Through the 80's and Beyond, ed. by D. J. McGreivy and K.
A. Picker, IEEE Computer Society Press, 1982.

C. Mead and L. Conway, op. cit., p. 155.

L. I. Maissel and H. Ofek, "Hardware Design and Description
Language in IBM," IBM Journal of Research and Development, Vol.

28, No. 5, September 1984, pp. 557-563.

M. Shahdad, et al., "VHSIC Hardware Description Language,” IEEE
Computer, Vol. 18, No. 2, February 1985, pp. 94-103.

Structured Computer Organization, A. s. Tanenbaum, Englewood
Cliffs, NJ: Prentice-Hall, 1984, pp. 149-156.

105

43'
44.
45.
46.
47.

48.

49.

50.

51.

52.
53.

54.

R. Beresford, ”Advances in Customization Free VLSI System
Designers," Electronics, February 10, 1983, pp. 134-145.

B. R. Bourbon, "ICs Tailored to Applications Gain Ground,"
ElectronicsWeek, September 3, 1984, pp. 117-121.

D. L. Wollesen, "CMOS LSI - The Computer Component Process of
the 80's," IEEE Computer, Vol. 13, No. 2, February 1980, pp.
59-67.

S. Muroga, Op. cit. pp. 163-164.

D. J. McGreivy, "VLSI Chip Trend - Size, Complexity, Cost,” in
VLSI Through the 80's and Beyond, ed. by D. J. McGreivy and K.
A. Picker, IEEE Computer Society Press, 1982.

B. C. Cole, "Memories Dominate ISSCC," ElectronicsWeek,
February 11, 1985, pp. 51-60.

R. L. Donze and G. Sporzynski, ”Masterimage Approach to VLSI
Design," IEEE Computer, Vol. 16, No. 12, December 1983, pp.
18-25.

S. 20110, "CMOS Array Break 1N5," ElectronicsWeek, January 21,
1985, pp. 55-56.

K. Kokkonem and R. Pashley, "Modular Approach to CMOS
Technology Tailors Process to Application,” Electronics, May 3,
1984, pp. 129-133.

L. Waller, "1C Houses are Fruitful in Multipliers,"
Electronics, July 14, 1983, pp. 155-157.

9. T. Greiling, "High Speed Digital 1c performance Outlook,"
Seminar, Michigan State University, January 31, 1985.

J. E. Dykes, "Bridging the Gap," IEEE Proceedings of the 1984
Custom Integrated Circuits Conference, Rochester, NY, 1984,

pp. 5-8.

106

"IVIIIEITTIIIIIIIT