"lWIWl

[\E‘

—_
_ ,.__
if
,.V_’

W

WI

_.\_3
000
0100

 

THS.

5778026129

This is to certify that the
thesis entitled

LOW ENERGY HARDWARE FOR SENSOR SIGNAL
CALIBRATION AND COMPENSATION

presented by
PRASANNA BALASUNDARAM

has been accepted towards fulfillment
of the requirements for the

Master of degree in Electrical and Computer
Science Engineering

WMVI/Ix

Majbr Professor’s Signature X‘ '
January 15, 2004

Date

MSU is an Affirmative Action/Equal Opportunity Institution

__————————

LIBRARY
Michigan State
University

 

 

 

 

kw.-

_—- ".—'ﬂ_ra_-_

v-—v — WW“

-v— r—v-<-‘

 

PMCE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

6/01 c:/CIRC/DateDuo.p65-p. 15

 

 

LOW ENERGY HARDWARE FOR SENSOR SIGNAL CALIBRATION AND
COMPENSATION

By

Prasanna Balasundaram

A THESIS

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

MASTER OF SCIENCE
ELECTRICAL AND COMPUTER ENGINEERING
2004

 

__'v 1‘
w

OUI
lerr
non
to 0
sem
Ciﬁc
com
the [
enor
crosy
CC” I;
tation
crease
the de

ABSTRACT

LOW ENERGY HARDWARE FOR SENSOR SIGNAL CALIBRATION AND
COMPENSATION

By

Prasanna Balasundaram

When semiconductor sensors transfer signal from one domain to the other, an accurate
output is not reported due to inherent physical properties of sensor materials and the prob—
lems in sensor manufacturing. Non-linearity, offset and cross sensitivity are typical phe-
nomena observed in the sensor outputs, requiring calibration and compensation processes
to obtain meanigful information from the sensors. This thesis makes use of advances in
semiconductor industry to develop a correction engine in the form of an Application Spe-
ciﬁc Integrated Circuit (ASIC) that efﬁciently calibrates and compensates sensor data. The
correction engine uses ﬂoating point hardware to perform the error correction prescribed by
the IEEE 1451.2 standard. The conﬁgurable correction engine is capable of performing the
error correction operations to suit the energy demands of the battery powered sensor mi-
crosystem, either with high accuracy, or with ultra-low energy expenditure. Energy efﬁcient
cell library, compact multipliers and adders reduce the power consumption in the compu-
tations. Novel value prediction scheme and an efﬁcient rounding mode are employed to in-
crease the effectiveness in spending the energy. The hardware correction engine facilitates
the development of key microsystems for medical, commercial and industrial applications

using simple, low-cost sensors that otherwise would not provide reliable data.

To Michigan State University

iii

ACKNOWLEDGEMENTS

I would like to start by giving my thanks to Dr. Andrew Mason for providing contin-
uous moral support in doing the thesis work. I am greatful to the freedom that he gave
me throughout the thesis work. Without his encouraging words and guidance, I would
not have ﬁnished this work. I like to thank Dr. Michael Shanblatt, Dr. Nihar Mahapa—
tra, and Dr. Peixin Zhong for serving the thesis committee in their busy schedule. I like
to thank the department chair person Dr. Satish Udpa, and the graduate co-ordinator Dr.
Donnie Reinhard for their continuing support during the Masters degree program. I like
to thank Mr. Fredrick Hall and other unix administrators for answering my requests im-
mediately, even on holidays. I like to thank Mr. Peter Sernig for clarifying my doubts
and providing tool kit support. His involvement in getting the technical documents from
cadence Sourcelink was very useful in the library development process. Jichun Zhang, Jun-
wei Zhou, and Kartik Vaidyanathan of Advanced Micro Systems and Circuits Lab provided
good encouragement during the last couple of years. I like to thank Matthew Guthaus, Eric
Marsman, and Robert Senger of University of Michigan, who gave technical support when
I was developing the library. I like to thank the comp. cad. cadence usenet group, especially
Andrew Beckett of Cadence Design Systems for providing valueable suggestions when I
faced problems with CAD tools. I like to thank my friends Chandan Reddy, Shankarshna
Madhavan, Arvind Ravisekar, Loganathan Anjaneyulu, Badrinarayanan Kasturi, Mahesh
Arumugam, Narasimhan Swaminathan, Sunder Balakrishnan, Srinivasan Rakhunathan and
many more for encouraging me to do challenging things in life. I learn a lot from them.
Finally, I like to thank my Mom, without her blessings and kind heart, I would not have

written this.

iv

TABLE OF CONTENTS

Page

Abstract ......................................... ii

List of Tables ...................................... vii

List of Figures ...................................... viii
Chapters:

1. Introduction .................................... 1

1.1 Smart sensors and error correction ..................... 1

1.2 Motivation ................................. 7

1.3 Goals .................................... 9

1.4 Organization 9

2. Standard Cell Library Design ........................... 10

2.1 Library Design Flow ............................ 11

2.1.1 LEF Generation .......................... 11

2.1.2 TLF Generation .......................... 14

2.2 Results ................................... 19

2.2.1 Combinational Cells ........................ 19

2.2.2 Flip-ﬂops .............................. 19

2.2.3 System Design ........................... 22

3. Floating Point Unit ................................ 23

3.1 Integer Operations ............................. 23

3.1.1 Adders ............................... 23

3.1 .2 Multipliers ............................. 24

3.2 Floating Point Operations ......................... 24

3.2.1 Multiplication ........................... 26

3.2.2 Addition .............................. 29

3.3 Implementation ............................... 31

4. Calibration and Compensation Engine ...................... 33

4.1 Correction Engine Architecture ...................... 34

4.2 Correction Engine Operation ........................ 35

4.2.1 Clocking .............................. 37

4.2.2 Reconﬁgurability ......................... 38

4.2.3 Perturbation Analysis ....................... 38

4.2.4 Energy Efﬁciency ......................... 38

4.3 Hardware Sorter .............................. 39

4.3.1 Sorting Algorithm ......................... 40

4.3.2 Sorter Architecture ......................... 40

4.3.3 Sorter Operation .......................... 42

5. Conclusion and Future Research ......................... 44

5.1 Conclusion ................................. 44

5.2 Future Research .............................. 45
Appendices:

A. Design Flow Mth AMSAC Library ....................... 47

A.1 Using the TLF ﬁle ............................. 47

A2 Sample TLF ﬁle .............................. 48

A3 Using the LEF ﬁle ............................. 51

A.4 Sample LEF ﬁle .............................. 53

Bibliography ...................................... 56

vi

LIST OF TABLES

Table Page

2.1 Logical and Physical properties of cells developed for the correction engine
design ...................................... 20

3.1 Floating point values for various exponent and signiﬁcand combinations. . . 25
3.2 Design results of the modules in the correction engine design. Shown is the

#gates in the module, its area, number of gates in critical path(CP), delay
in critical path, and the power consumption when operating at 40 MHz. . . 32

vii

LIST OF FIGURES

Figure Page

1.1

1.2

1.3

1.4

2.1

2.2

2.3

2.4

2.5

3.1

3.2

4.1

4.2

4.3

4.4

4.5

Output of the sensor before (a) and after calibration (b), error surface before

(c) and after calibration (d) [6] ......................... 3
Components of an Integrated Sensor Module. ................ 4
Architecture of the central digital controller .................. 7
Segmenting the operational region of sensors [2]. .............. 8
Parameters associated with a cell path [3]. .................. 14
Delay of the Flip-ﬂop for various setup times. ................ 17
Determining Setup time by iterative simulations. .............. 18

Energy expenditure for Flip-Flop operation.Non data-storing edge energy
consumption, usually ignored in literature is included for analysis. ..... 21

Layout of UMSI - Full-Custom design (a), Semi-Custom Design(b). Dras-
tic reduction of design time is experienced in (b) when compared to (a). . . 22

IEEE (a) and custom (b) representation of single precision ﬂoating point
quantities. ................................... 25

Four rounding modes used in the correction engine design. ......... 28
Architecture of the Correction Engine with microprocessor core and memory. 35

Memory word organization. The powers of the input signals are stored in
the MSB, while the correction coefﬁcient is stored in the 3 LSBs ....... 36

Tentative exponent generation with input exponents and correction coefﬁ-

cient. ..................................... 37
Architecture of the sorter ........................... 41
Functional Simulation of the Sorter ..................... 43

viii

 

 

[C
C TI

an

eas
don
in s
qua:
InICt

use (

1.1

COVER

quilntii

CHAPTER 1

Introduction

Semiconductor sensors play an important role in measuring a physical quantity for
control applications. They take part in our every day activities and make life easier. The
complexity associated with a sensor may vary from a simple room-temperature control to
advanced motion control in an airplane. Advances in the Micro Electro Mechanical Sys-
tems (MEMS) industry lead us to the age where sensors are downsized to the order of mi-
crometers. Modern sensors acquire information such as temperature, pressure, or humidity
and transform them into another domain, where the information is processed, and suitable
control action is taken to keep the physical quantity under control. The transformed do-
main is usually the electrical domain, since information processing and communication are
easier and ﬂexible using electronic circuits. When the sensors transfer the signal from one
domain to the other, accurate output is not reported due to inherent nature of the problems
in sensor manufacturing. This error must be corrected by proper means to interpret the
quantity of the interest correctly. This thesis focuses on developing an Application Speciﬁc
Integrated Circuit (ASIC) capable of performing the error-correction efﬁciently, making

use of advances in the semiconductor industry.

1.1 Smart sensors and error correction

The era of semiconductor sensors began when the piezoresistivity of silicon was dis-
covered. Consequently, devices such as photodiodes and Hall devices were developed. The
main purpose of these devices is to generate an electric signal proportional to the physical

quantity of interest. A practical sensor usually doesn’t yield an ideal signal transfer curve,

but includes effects such as nonlinearity, offset, and non-unity gain. Commonly, the sensor
signal is not only proportional to the physical quantity of interest, but also sensitive to other
parameters such as temperature. Removing the effects of non-linearity, offset and gain is
known as calibration while removing the effect of other physical quantities like temperature
is known as compensation. These undesired effects should to be removed from the sensor
signal so that the signal produced by the sensor can be interpreted correctly.

Traditionally these errors were corrected by laser trimming discrete components such
as resistors and capacitors in the signal conditioning circuits. This type of error-correction
required individual attention to each sensor; high material and labor costs increased the
price difference between a calibrated and un-calibrated sensor. To reduce the cost of the
error correction, more signal conditioning circuits were integrated with the sensors as the
VLSI technology advanced. Mixed signal designs enabled the ampliﬁers and other passive
components to become digitally programmable so that the parameters associated with the
components can be modiﬁed as desired. The sensors became capable of communicating
the sensor signal to a digital computer and sensor systems became smart once it started
processing the information. Improvements in the semiconductor fabrication and packaging
reduce the cost of the sensors; at the same time the intelligence of the sensors keeps ever
increasing.

The integrated sensor module can be simpliﬁed as in Figure 1.2 where the major blocks
include the analog interface, signal conditioning, ampliﬁer, AID converter, and a bus in-
terface to communicate the data to a digital controller. The error-correction procedure can
be performed in any stage after the sensor output is read by the signal conditioning circuit.
The transfer curve of the ampliﬁer can be controlled by modifying the passive components

in the circuit to overcome the effects of nonlineraties. This involves using trimmable or

 

g

Zinc

Output ‘ otmoui
(normalized) ‘7 ‘ ' (normalized)

  
  
  

 
  
   
     

12] ‘ .- 31.51
. ‘ 1J
0.5
0
I -U.5
~1

ﬂag“, .. “-111 '1'51 . -

_,- -. 05 /' \05 ~ 'w 1

2.3.1.2325: 0_0_5 1 _05 Pressure Temperature 0 0 5 ' L' 0 0'5 P/re2w0
- .1 (normalized) ("O'maI'Zed) ' ' -1 .1 ‘05 (mrmuzod)
(a) (b)

   

0.5 /

 

(rnormaIiz d ' I '05 ”New” 0 ~0?\/’.0.5 0 Pressure
e ) -1 -1 (normed) (normalized) -1 .1 (normalized)

(C) (d)

Figure 1.1: Output of the sensor before (a) and after calibration (b), error surface before (c)
and after calibration (d) [6].

programmable resistors and capacitors. Their ability to control the calibration and com-
pensation process becomes limited and each sensor needs individual attention, which in-
creases the production cost of the sensor. The calibration and compensation can be done
using programmable ampliﬁers [13] and Analog to Digital Converters(ADC) [18].

These circuits become very complex and less ﬂexible. Single Chip ASIC with signal
conditioning and error correction [11] and performing the error correction in the sensor
module itself [12] offer a good solution for error correction for an independent sensor, but
these implementations suffer if many sensors are connected in the form of a small network.
Redundant sensors become unavoidable because the compensation of one signal in the
network may require the other sensor data. Hence it forces the error correction to be done

in a centralized place where all the data are processed.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.2: Components of an Integrated Sensor Module.

The error correction in a central digital controller in the digital domain offers ﬂexible
and accurate solution. The simplest method of error correction is to build a lookup table
in the memory of the controller. The lookup table method is very fast, it requires huge
memory in the controller and also offers only limited resolution. The other method of
doing the error correction is to ﬁt the sensor transfer curve as a multinomial and evaluate
the multinomial when the sensor signal arrives. Sophisticated techniques such as spline
functions [7] exist to determine the multinomial coefﬁcients using minimal measurements.
Usually these coefﬁcients are expressed as ﬂoating point numbers to increase the range of
representable values. Once the multinomial coefﬁcients are determined, the co-efﬁcients
are stored in the non-volatile memory of the sensor. When the sensor is connected to the
network, the non-volatile memory is read for performing error correction. Recalibration of
the sensor can be done by simply rewriting the modiﬁed coefﬁcients.

Widespread usage of single chip rnicrocontrollers offer an attractive means for perform-
ing sensor signal calibration and compensation as they provide cost effective means of com-
putation. Since many microcontroller cores don’t support ﬂoating-point operations needed
to perform the error correction, speciﬁc subroutines [5] are written to do the ﬂoating-point
operations. The disadvantage in using subroutines method is that the error correction pro-
cess for a single data point takes large amount of time, limiting the sampling rate of the
sensor to a very low value. For example , running on a 2-4 MHz clock, a typical errorcor-
rection routine takes about 4-13 ms [4]. This engages the controller for a longtime that it
wont be able to perform other control applications.

The accuracy of the error correction depends on how well the sensor signal transfer
curve is represented using the multinomial approach. Experiments show that higher the
order of the multinomials, lower the error bounds. But high accuracy comes at the cost of

increasing the processing time needed by the microcontroller. An attractive alternative to

reduce the order of the multinomial and also to decrease the processing time is to segment
the various regions of operation of the sensor and approximate the transfer curve by a lower
order multinomial. In most cases, this doesn’t considerably increase the error. Using sub—
routines to perform the error correction in the micro-controller consumes more energy due
to the overhead when the controller handles interrupts. Hence, by developing a dedicated
hardware that is capable of performing ﬂoating-point Operations, the error correction can
be performed with lower energy consumption leading longer battery life.

Since there are many ways to build sensor modules and to communicate with central
digital controllers, many consumer products have appeared in the market. To regulate the
products and to improve the portability of devices, IEEE 1451.2 [2] standardizes the com-
munication between the smart transducer and the microcontroller and also the Trasnducer
Electronic Data Sheet(TEDS). It deﬁnes how the data inside the sensor module are orga—
nized for internal programmable control. It also recommends digital error correction of the
sensor signal using the piecewise linear multinomial approach. The standard also encour-
ages that the error correction be perforrned using ﬂoating-point operations.

The sensor signal transfer curve is approximated by a multinomial in an n+1 -dimensional
space, where n is the number of parameters the sensor signal data depends upon. For exam-
ple, if a pressure sensor signal depends on the pressure channel data and the temperature
channel data, then it will be equal to 2 and the space formed is a 3-dimensional space.
The pressure and temperature data will form the X and Y-axes while the corrected pressure
data will form the Z-axis. The TEDS allow the independent axes to be segmented into as
many segments as needed to reduce the degrees of the multinomial. For example, if the
pressure and temperature channel spectrum are divided into 2 and 3 segments, then the
entire space will be divided in to 6 region of operation. In each region of operation, the

transfer curve can be a multinomial of an arbitrary degree required. If we assume that the

 

McroConIroler

 

 

 

 

 

 

 

 

 

 

 

 

 

 

*lherhterfooe

 

 

 

 

 

 

 

 

 

 

 

 

ISM 4+

BM 9++g I g Q
3 m" E

ISM 4+

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 1.3: Architecture of the central digital controller.

actual sensor data depends on the square of the pressure channel data and linearly on the
temperature channel data in a particular region of operation, then the signal transfer curve
will be an expression involving 6 coefﬁcients with all possible combination of pressure
and temperature channel data raised to the powers of 0 through 2 and 0 through 1 respec-
tively. The general expression for correction for a particular region of operation is given as

Y = 23321:? - - ~2?:"8Ci,j...pIX1 - H1]‘[X2 " ”le ' ' ' IXn - Hnlp
1.2 Motivation

The error correction processes in smart sensors accounts for upto 50 % of the cost of the
sensors and consumes more than 30% of the energy spent by the micro-controller. Some-
times the errors due to the nonlineraties and crosssensitivities can change the actual output
from 50% to 75% [6] of the true value. Hence automating the calibration and compensa-
tion process of the sensor and performing sensor signal error correction by utilizing least

amount of resources will lower the cost of sensors and allow them to be used in many more

Segments of X2

   

X2

 
 

segments of 1

  
 
 

X1 Cells

Figure 1.4: Segmenting the operational region of sensors [2].

applications in the future. The piecewise multinomial approach allows the sensor signal to
be corrected irrespective of the nonlinearities and cross dependencies. To maintain the ac-
curacy of the error correction process, it is desirable to do the correction process in ﬂoating
point arithmetic. The correction process in an integer computation based microcontroller
involve more energy expenditure due to the overheads in the subroutines and data com-
munication process. Energy savings can be obtained using dedicated hardware to perform
the ﬂoating-point operations for sensors that cross dependencies with less than two other
physical quantities. It will be an optimal balance if we design the hardware to perform the
correction upto the 3” order, and upto 4 input signals, and allow software programs to per-
form the correction if the needed order is higher. We can also obtain considerable energy
savings if previously computed values are stored in on chip memory to avoid repeated cal-
culations. If the contribution of a particular co-efﬁcient is very small, energy savings can

be obtained at the cost of slight loss of accuracy. Also for small perturbations in the input

signal, corrections can be obtained with less number of computations by applying Taylor’s

expansion of the multinomial at the operating point.

1.3 Goals

The main goal is to design a reconﬁgurable correction engine to perform calibration and
compensation of sensor signals using ﬂoating-point hardware optimized for low-energy. At
the lowest level, energy savings are considered while designing the library cells like logic
gates and ﬂip-ﬂops [8, 15]. At an interrnidiate level, modules such as adders, multipliers
[ l6] and rounding units are optimized for low energy dissipation. At the architectural level,
the required number of computations are optimized using a data analyzer. At the system
level the previously computed values are analyzed for opportunities to reduce the number
of operations. By designing the hardware from device to system level optimizing for energy

savings yields low energy error correction hardware.
1.4 Organization

The rest of the thesis is organized as follows: Chapter 2 discusses design ﬂow and
the cell library generation process, Chapter 3 describes design issues in integer adder and
multiplier units and ﬂoating-point rounding units , Chapter 4 highlights the top level deisgn
of the correction engine and the hardware sorter, and Chapter 5 summarizes the results

acheived during the project giving directions for future research.

CHAPTER 2

Standard Cell Library Design

As the feature size of integrated circuits (ICs) grow smaller, it enables the designers to
pack more gates in the given area of the chip and acheive more functionality. At the same
time, more burden is placed on the designer to ensure the reliability of the chip. Under-
standing the design ﬂow and the tools used in the design process leads the circuit designer
to make wise choices in the design to give maximum productivity. ASIC technology is
proven to be cost effective for mid-high volume applications. This chapter focuses on the
design ﬂow and the library used in the correction engine design.

Bottom-up and top-down are the two widely used design ﬂows for integrated circuit
design. In the bottom-up design ﬂow, the system is divided into sub-blocks and each mod-
ule is designed considering timing constraints by varying the transistor size. Though this
method gives the maximum efﬁciency interrns of area and power constraints, this approach
is not very efﬁcient and some time impractical for designs involving thousands of transis-
tors. In the top-down design ﬂow, the functionality of a block is given a priority and it
is expressed in hardware description language such as verilog. Once the design meets the
functionality requirements, the synthesis tool develops a generic implementation and maps
it to the target technology meeting the timing and area constraints. The physical design is
performed by the Auto Place and Route (APR) tool after the synthesis process.

To design an integrated circuit with a speciﬁc process technology using the top-down
method, a cell library is needed. The cell library contains information about the tinting
properties (to perform logic synthesis) and physical properties (to perform physical design)

of the cells in the library. An in-house standard cell library was developed for designing

10

the correction engine in AMICSN process for this project. The library contains a complete
set of combinational cells, tristates, latches, and ﬂip-ﬂops, so that any digital design could
be implemented using the top down method. This cell library helps the designer to produce

integreated circuits with high reliability in a short period of time.
2.1 Library Design Flow

In this correction engine design, Silicon Ensemble and Build Gates were used for per-
forming logic synthesis and physical design respectively, since they are a part of the Ca-
dence NCSU Design kit used in Michigan State University. Silicon Ensemble requires
the physical information be represented in a Library Exchange Format (LEF) and Build
Gates require the timing information be represented in Timing Library Format (TLF). This

section explains the LEF and TLF generation processes.

2.1.1 LEF Generation

The LEF ﬁle contains information about the metals, vias, and poly layers for each cell
in the library. The LEF ﬁle lacks information such as wells, and active layers, that are
not relevant for the cell-based place and routing. Cadence Abstract Generator is used for
Generating LEF ﬁle from the layout. A Design Planner Universal eXchange (DPUX) ﬁle
containing the technolgy information is created for the Abstract Generator. The Layout
of the cells are given as input for the Abstract Generator to generate the LEF ﬁle and the

generated LEF ﬁle is verﬁed by a sample place and route run.

DPUX Generation

The recomended method for generating the DPUX ﬁle is to input an existing LEF ﬁle

with technology information and ask the Abstract Generator to create this ﬁle. But in

11

the absence of the technology LEF ﬁle, the information is entered manually reading the

technology ﬁle dumped by the Design Framework (deI) environment.

Layout Generation

The layout for the cells in the library is created in Virtuoso Layout Editor passing the
Design Rule Check (DRC) and Layout Versus Schematic (LVS) tests. Before designing
the layout, the horizontal and vertical pitches (the distance between the center of two metal
strips; the grid spacing is set to the corresponding pitch) of the cells are decided. The
performance of the routing engine is enhanced if the ratio of the horizontal pitch and the
vertical pitch is kept as a simple ratio. For the AMICSN process library, a horizontal
pitch of 8}» (2.4m) and a vertical pitch of 10)» (3pm) are chosen. The following design

considerations are kept in mind when designing the layout.

0 The height and the width of all cells are kept in multiples of the vertical and horizontal

pitch respectively.

0 The offset (the distance by which the boundary of the cell extends beyond the grid)

for the cells are kept at half of the corresponding pitch.
0 Metal 1 and 3 layers are drawn horizontally and Metal 2 is drawn vertically.

o The input and output pins are kept in the intersection of horizontal and vertical grids

to increase the efﬁciency of the router.

o Text/Pin lables are bound inside the shape pin to avoid exclusion of pins in the ab-

stract generation step.

c Mdth of the power rails are kept as one vertical pitch (3pm).

12

Abstract Generation
The layout of the cells are given as input to the Abstract Generator to create the LEF

ﬁle. The following steps are involved in this process.

0 Input Technology: The DPUX ﬁle generated before is given as the input for the

Abstract Generator in this step.

0 Input Layout: The layout is exported from the de1 environment in GDSII format

(stream) and imported to the Abstract Generator using a layer map table.

0 Import Logical: The input and output pin information about the cells is given in a

verilog ﬁle for the Abstract Generator.

0 Pins Step: The text labels are mapped to the terminal lables in the layout and the

place and route boundary is created.

0 Extract Step: The Abstract Generator probes through the layout and ﬁnds the con-
nectivity among the nodes using various layers. Antenna information (capacitance,

inductance and resistance information) about the cells are also created in this step.

0 Verify step: LEF view for the cell is created and a target place and route run is

performed to make sure that the cell can be used by the place and route engine.

0 Export LEF: The LEF ﬁle which can be directly used as input for Silicon Ensemble

is exported from the Abstract Generator.
Veryfying LEF File

A verilog netlist using the cells in the library is created and tested with Silicon Ensem-

ble. The density of the design is studied and the layout is modiﬁed to improve the output

13

of the APR engine. A row utilization of 40% is achieved with the cell library, quite reason-
able for a 3 metal process, though a row utilization of 80-90% is common with a 6 metal

process.

2.1.2 TLF Generation

The timing library format, TLF ﬁle represents the input-output characteristics such as
delay and functionality for each cell in the library. The TLF clasiﬁes the cells as combina-
tional cells (the output of the cell depends only on the current inputs of the cell, but not on
the previous outputs), tristates (in addition to logic high and low as in combinational cells,
the output may be ﬂoating for these type of cells), latches (level sensitive device which
stores data when write enabled), ﬂip-ﬂops (edge sensitive storage device) etc. Each type of
cell contains some unique characteristics that help the synthesis tool to identify a particular
class of cells and perform the logic synthesis using them. If an input signal applied to a cell
changes its output, the cell has a path (or an arc) from the speciﬁc input to the output. The
TLF ﬁle summarizes all of the possible paths. A more formal deﬁnition of the quantities

involved in the TLF is summarized below with reference to Figure 2.1.

 

 

W
7— v :3:iiifiiiﬁiiii
Cell II
' 10% ............

 

 

 

 

 

 

 

 

Figure 2.1: Parameters associated with a cell path [3].

14

Slew: Time for an input/output signal to rise from V1“ to VH2 (VH2 to VT“ for a
fall transition). Usually VT“ and Vm are set as 10% and 90% of the vdd, and the

input/output signal is approximated to a ramp when measuring this quantity.

Delay: Time difference between the input and output crossing the mid-point in the

transistion (VT).

Setup: The time for which the input signal has to be stable before the clock transition

to ensure prOper storage of data in the ﬂip-ﬂop/latch.

Hold: The time for which the input signal has to be stable after the clock transition

to ensure proper storage of data in the ﬂip-ﬂop/latch.

Recovery: The time after which the asynchronous signal (set/reset) has to be applied

to override the data stored by the clock signal.

Removal: The time before which the asynchronous signal (set/reset) has to be applied

to override the data stored by the clock signal.

Combinational Cells

In a combinational cell, the output state depends on the inputs, and change in an input

can cause its output to change. The output slew and the delay (between input and out-

put) depends on the input slew and the output load. Both the output slew and delay are

represented as two dimensional timing tables (by taking the input slew and output load as

independent axes) in the TLF ﬁle for all possible paths in the cell. In addition to that, the

functionality of each pins and the area of the cell are represented in the TLF ﬁle. The static

timing analysis (STA) tool uses the TLF ﬁle to determine the delay in the a circuit and

hence the worst path in the circuit. Linear interpolation techniques are used by the STA

tool if the output slew and delay information are nor readily available in the tables.

15

Parasitic capacitances as small as 0.001pF are extracted from the layout and a netlist is
created. The timing characteristics are obtained by simulating the cell with various input
slew and output load conditions. Spectre is used for simulating the transistor level circuits
and per] scripts are used for reading the output generated by Spectre and report it to a TLF
ﬁle. The scripts developed for this library can identify any combinational gates upto 3

inputs and report all the possible paths for the speciﬁed input slew and load conditions.

'Ii'istates

Tristate devices are used in synthesizing bus structures in a design. The tristate devices
differ from the combinational blocks as the output of the tristate devices can be in high
impedance state (Z) in addition to logic high (1) and logic low (0). The timing information
for the output transition from either 1 or O to Z is not critical since a Z output is not driving
any other gates. Hence the timing table is ﬁlled with entries as zero. The time for an output
to go from Z to 1 can be no greater than the time for the output to go from 0 to 1. Hence
the worst case time for 0 to 1 and l to 0 is substituted for the change from Z to 1 and Z to

0 respectively.
Latches and Flip-ﬂops

Latches and ﬂip-ﬂops are storage elements that contain additional parameters like setup
and hold times. If they have asynchronous inputs like preset or clear, the recovery and re-
moval times are included in the TLF ﬁle. The deﬁnition for setup and hold times mentioned
in the Section 2.1.2 were modiﬁed slightly to determine those times using simulations. The
delay for the data to get stored in the ﬂip-ﬂop after the rising/falling edge of the clock varies
with the time the clock signal is applied after the data has settled. If the time difference
between the clock edge and the data signal is decreased, the delay for the data to get stored

in the ﬂip-ﬂop increases. The setup time is computed as the minimum time before which

16

the data has to be stable so that the storing delay does not increase beyond 5% of the nor-
mal operational delay. Hence if we determine the time difference between the data and the
clock, such that the delay between the clock and the output is as close as to 105% of the
normal delay time, that becomes the setup time for the transition with speciﬁed data and
clock slews. From Figure 2.2, we see that the delay of the gate gradually increases as the
data is moved closer to the clock and the data fails to latch beyond a limit (shown as the

discontinuity).

Relation between setup time and delay

4.5 _

3.5 _

Delay(ns)

2.5 -

 

1.5 _

 

0 0.5 l 1.5 2 2.5
Setup time (ns)

Figure 2.2: Delay of the Flip-ﬂop for various setup times.

Setup time of the ﬂip-ﬂop is determined by performing spice simulations iteratively

on the following basis. The normal storing delay is found by giving enough seperation

17

between the input and the clock signal without violating the setup condition (say 5 ns).
Another simulation is performed with no seperation between the data and the clock signal,
and hence violating the setup condition. Consequent simulations are performed by either
increasing or decreasing the seperation between the data and the clock signal depending
on whether a violation has occurred or not. The increment or decrement in the seperation
is half of the time between a simulation without violation and a simulation with violation.
In Figure 2.3, the setup time converges after 7-8 simulations performed in this manner. A

similar argument is extended for determining hold time also.

Determining Setup time of a Filp—flop

 

 

 

 

 

 

 

 

+ Delay time
— Max delay
10‘
8 ..
73
5
E 6 -
i:
4 -
2
A A L.— .. A ..
V - II' '- W L :_ J
o r r 1 L 1 L 1 i r
1 2 3 4 5 6 7 8 9 10 11
Iterations

Figure 2.3: Determining Setup time by iterative simulations.

18

2.2 Results

The developed cell library has an inverter, a tristate inverter, a buffer, 2 and 3 input
NAND, NOR, AND, and OR gates, a 2 input XOR gate, a 2:1 multiplexer, a latch, a
ﬂip-ﬂop with reset and a ﬂip-ﬂop without reset. Five structures were considered for the
ﬂip-ﬂops and two of the most efﬁcient ones are chosen to be included in the library. The

following sections summarize the physical properties of the cells in the library.

2.2.1 Combinational Cells

The Table 2.1 summarizes the logic function, width of the cell, number of transistors,
input/output capacitance, and power consumption for a run at 40 MHz with all possible

input changes.
2.2.2 Flip-ﬂops

Since ﬂip-ﬂops are used extensively in the pipelined microprocessor and data correc-
tion unit, the energy demands of ﬁve different ﬂip-ﬂop structures (in-house ﬂip-ﬂop with
and without reset (dff, dffr), push-pull isolation (ppi) ﬂip-ﬂop [8], transmission gate ﬂip-
ﬂop (tgfl), and a regular master-slave ﬂip-ﬂop (rdff)) were thoroughly analyzed (shown
in Figure 2.4), and the two efﬁcient structures were included in the cell library. When a
simulation is performed, the spectre simulator dumps the raw data (the voltage at various
nodes, and the current supplied by the sources in the circuit at each instance of the transient
analysis) to a ﬁle. The ﬁle is read by a per] script [14] to store the data in arrays. The
current supplied by the source (ivdd) is multiplied by the value of vdd at each instance of

time and integrated over the time period (7}) of the simulation. Thus f vdd * ivdd and

fvdd * ivdd

Ts (2.1)

19

 

 

 

 

 

 

 

 

 

 

 

 

 

ammo coed N coed 806 NH o 20:
Etc coed N coed coed NS _ £0:
83 m9: a mood 82. :2 < 0338
coed coed c coed coed NA SE
83 83 S 85m 83. 3a 8 _o E 7: so
m2: com; S owed v36 oém Aaumoummumﬁ Eons
ﬁn; com; 3 wmmé god vdm aﬁuﬁmdzxx v93:
mow." owed v wwné coed v.3 < Q3
mow; amwém w whoa: Good v.3. < have
Km; ﬁnd 0 ainm wand QQ 3 >2:
Eh; «Rd 3 Even coed oém Q nose—e
NVNA anhm Q 98.0 mde NE 5 $3 Nuox
mmmg #26 v Gwd hand ed A3 C3
wmmg wand w mmvd wmcd v.3 6035.3 9:8
was 33 w 33 32 4.: 6 _m 7: me
Ba.~ mend c 2.04 2 md QQ O _m 73 m m8:
nova $56 0 emu; osmd ON“ AUQméa‘v u mean:
coed coed o coed coed 06 3E
coed coed o coed coed wé SE
oocd coed o coed coed Wm =c
v3; 3 md em 806 Reg v.8 D 56
Km.“ can." 0 Sim X: .o v.3 $3293 53.8 35::
mwvm sand c 085” Ste QS 93.5 9:8
wt; End 0 «2: 023 ON“ 8 7a NB
mow.” owed N Ex; 03 .o NS 23 a:
wow; omnd v :5; o: .o ed Amara: Need:
m3; Sud v mm»; mono ed 5 79 who:
E3 £5 CE: 25 C5:

98 90 98 AS €835.51 825m QED 52>) cocoa—E =00

dwmmoe

oﬁwco cosootoo 05 com pogo—gee 230 Mo 3thon Roam—E use Rommot— “mm 033.

20

and represents the energy and the average power dissipated throughout the simulation.
In order to measure the the energy dissipation for a particular transition (for example,
clkOlqu indicates the clock transition for storing 1 in the ﬂip-ﬂop overwriting the ex-
isting 0) the integration is started when the clock signal crosses the 10% of its complete
swing when rising from 0 V to 3 V and stopped when the incremental energy spent is less
than 1% of the total energy spent during the transition. In literature, usually the falling edge
of a clock transition is ignored for power and energy reports [15] . From the Figure 2.4, we

see that non data-storing edge of the clock also consumes a considerable amount of energy.

Energy Dissipation in Flip-ﬂops

 

 

 

 

 

 

1.6 I I I I I
- c|k01q01
- clk01q10
1-4‘ - clkOIqOO ‘
clk01q11
1 2L - clk10q00 ‘
- - clk10q11 I
1 _ -I
3,
3 0.8 ~ -
Q
C
LIJ
0.6“ r
0.4- -
0.2r _
0

 

 

     

Figure 2.4: Energy expenditure for Flip«Flop operation.Non data-storing edge energy con-
sumption, usually ignored in literature is included for analysis.

21

2.2.3 System Design

The cell library was used in the top-down design ﬂow of the Universal Micro-Sensor
Interface chip which performs data communication between a sensor node and a central
microcontroller [19]. The ﬁrst version of the circuit was designed using full-custom design
methodologies while the second version was designed using the cell library. Figure 2.5
shows the layout generated for the bus interface using the fullocuatom design method (a)
and the semi-custom design method (b). The library drastically reduced the design time of
the bus interface design from months to weeks in the semi-custom design. However, it was
achieved only at the loss of the design density. The layout occupied 1 mm x 1mm in the

ﬁrst version, and it occupied about 1.5 mm x 1 mm in the second version.

3

S

III... III-IIIII III

 

Figure 2.5: Layout of UMSI - Full-Custom design (a), Semi-Custom Design(b). Drastic
reduction of design time is experienced in (b) when compared to (a).

22

 

 

 

CHAPTER 3

Floating Point Unit

A digital computer processes either a logical 0 and 1 through its arithmetic and logic
units. Integers other than 0 and 1 are represented in a string of 0’s and 1’s of a speciﬁed
length. The integer representation can accommmodate 2" (n-bits wide) discrete value in its
spectrum. Floating point representation is used to represent the values between the integer
values. Though the ﬂoating point quantities also can represent only 2" discrete values
(some times even fewer dicsrete values), the spectrum is very broad when compared to the
integer representation. Floating point arithmetic operations can be carried out using integer
arithmetic units, but with little modiﬁcations to the inputs such as alignment according to
their exponents and post-processing the outputs such as normalization and rounding. This
chapter discusses integer addition/multiplication units and ﬂoating point units used in the

correction engine.
3.1 Integer Operations

In order to perform the ﬂoating point computations for the calibration and compen-
sation engine, unsigned integer adders and multipliers are used. In this section a variety
of adder/multiplier structures are studied and a suitable architecture is identiﬁed for the

implementation.

3.1.1 Adders

The following adder structures were considered for the project: a ripple—carry adder, a

carry-save adder, a carry-lookahead adder, and a manchester carry chain adder. The adders

23

are compared for the delay, area and energy dissipation. Since the ripple carry adder turned

out to be the adder with least energy dissipation, it was chosen for the project.

3.1.2 Multipliers

A variety of multipliers are considered for the project including an array multiplier,
booth encoded multiplier, and a Wallace tree multiplier with carry-save adder and ripple-
carry adder as ﬁnal adders. Since the Wallace tree multiplier with ripple-carry adder is the

efﬁcient structure in terms of energy, it was chosen for the multiplier.
3.2 Floating Point Operations

IEEE standard for Smart Transducer Interface for Sensors and Actuators [2] recom-
mends that sensor signal processing be performed in ﬂoating point precision. A single
precision ﬂoating point number is represented using 24 bits according to the IEEE ﬂoating
point standards [1]. In Figure 3.1, the f [22 : 0] represents the fraction bits (also known
as signiﬁcand), e[30 : 23] represents the exponent bit and the leading s bit represents the
sign of the number. Table 3.1 summarizes the value of the represented number for all the
values of the fraction, exponent and the sign bits [9]. If the given ﬂoating point number is
a normal number, then the signiﬁcand will be greater than or equal to 1 and less than 2 due
to the implicit presence of a leading 1. This is represented as [1, 2) in symbolic notations.
However, denonnalized numbers can have signiﬁcands greater than 0 and less than 1 and
hence represented in the interval [0,1). The value of 127 (known as bias) is always added
to the exponent in the IEEE representation to represent the negative exponents in unsigned
representation. Sensor outputs usually contain noise in the sample that limit data precision
to around 12 bits. Thus a precision of 24 bits in signiﬁcand is not necessary for doing the

correction engine operations, hence the signiﬁcand width is reduced to 16 bits, deviating

24

Table 3.1: Floating point values for various exponent and signiﬁcand combinations.
Bit Pattern Value

0 < e < 255 (—1)‘ * 2‘”127 * 1.f (normal numbers)
e = 0;f ¢ 0 (—1)‘ at: 2’128 * 0.f (denormal numbers)
e=0;f=0 (—1)‘*0(signedzero)

e=255;f=0 (—1)‘*oo(inﬁnity)

e = 255; f aé O NAN (Not-a-Number)

 

 

 

 

 

 

from the IEEE ﬂoating point standards. Figure 3.1 shows the ﬂoating point representation

used in the correction engine design.

 

 

s [e[30:23]l f[22:0]

 

 

 

S e[22:i 5] I f[l410] J
(a) (b) ’

 

Figure 3.1: IEEE (a) and custom (b) representation of single precision ﬂoating point quan-
tities.

The advantages of using ﬂoating point computation over integer computation are as

follows:

0 The ﬂoating point number has wider range of numbers that can be represented using

the speciﬁed number of bits

0 IEEE ﬂoating point representation handles exceptions precisely and degradation of

tiny numbers is handled gracefully

25

o The exponent can be used as an indication of the magnitude of the represented num-
ber; before performing the actual multiplication and addition processes, the mag-
nitude of the results can be predicted. This is used for isolating quantities of least

signiﬁcance to conserve energy while performing the computations.

3.2.1 Multiplication

The ﬂoating point inputs represented by (sl,el, f 1) and (32,e2, f2) are multiplied to
form the result (s,e,f). The output is computed by s = 31 €9s2; e = e1 +e2 and f = fl -f2.
The signiﬁcands of the normal and denorrnal numbers are in the range of [1, 2) and [0,1)
respectively. Hence the product can be in the range of [0,4) which has to be rounded to the
range [0, 2). The following sections describe the steps taken to perform the multiplication

operation.

Pre-normalization

The leading bit of the signiﬁcand is not stored anywhere in the number and is said to
be implicit. In the pre-normalization stage, the packed input number is unpacked to form
the explicit signiﬁcand for multilpication. A tentative exponent of the result is generated
by summing the exponents of the inputs. Since the bias gets added twice in the addition
process, the bias is subtracted once from the tentative exponent. If the the result is too large
to be represented using the given single precision point, an exception is set in this stage to

handle the overﬂow of the product.

Signiﬁcand Multiplication

The 16-bit input signiﬁcands are multiplied using the unsigned integer multiplier to

produce the 32-bit product and the 32-bit result is rounded to 16 bits in the rounding stage.

26

Rounding

ince the multiplication result is 32 bits, it is more precise than the result that will be
represented by the 16 bits. The process of converting the higher precision signiﬁcand to a
lower precision representable signiﬁcand is deﬁned as rounding. In other words, rounding
is the process in which the higher order 16 bits are modiﬁed to represent the lower order 16
bits at the cost of accuracy. The result can take either the value of the higher order 16 bits
(a) or the higher order 16 bits + 1 at p (b) where at p is the least representable quantity for
the given bit width. Whether the result takes the value of a or b depends on how close the
32 bit precise result is located between a and b and the rounding mode.

In order to round the result to 16 bits, the 17'” bit (rounding bit) is examined. If the
rounding bit is a 0, then the result takes the value of a, since the precise signiﬁcand is closer
to a rather than b. If the rounding bit is a l, and at least one of the bits from 18 to 32 is
one (the bits are 0Red together to make the sticky bit and used for decision making as the
position of the 1 in bits 18 to 32 really does not change the decision making), then the result
is rounded to the higher magnitude b. However if the rounding bit is 1 and the sticky bit is
0, then the 32 bit precise signiﬁcand lies exactly between a and b, and the result is chosen
to be either a or b depending on the rounding mode of operation. The IEEE ﬂoating point
standard supports 4 rounding modes. The four rounding modes are explained in Figure 3.2

In each of the rounding mode, the result is chosen as follows:

0 Even: Since the signiﬁcands a and b are a at p apart, one of them will be an even and

the other will be odd. The even number among a and b is chosen as the result.

0 +00: The signiﬁcand closer to +00 is chosen as the result. This depends on the sign

of the ﬂoating point product. If the sign of the ﬂoationg point result is positive, then

27

b is chosen as the result, since b is closer to +00. If the sign is negative, then a is

chosen as the result as a is closer to +00.

0 —00: The signiﬁcand closer to —00 is chosen as the result in this rounding mode. That
is, if the sign of the product is positive, a will be chosen as the result and if the sign

is negative, b is chosen as the result.

0 Zero: the signiﬁcand closer to 0 is chosen as the result. Since a is closer to zero, a is

chosen as the result in this mode irrespective of its sign.

 

EVEN +|NFTY -|NFTY ZERO
+ A + [t +41 +A
b— b—— T b— b——
0
- v ' v WI WV

 

 

 

 

Figure 3.2: Four rounding modes used in the correction engine design.

If the value of the signiﬁcand a is the largest one represented using 16 bits, then in-
crementing it by an ul p will produce a carry. In that case, the exponent is incremented

and the result is chosen to be all zeroes. The process of chosing between the signiﬁcands

28

a and b is deﬁned as signi f icand - rounding and incrementing the exponent is deﬁned as
post -— normalization. During the post-normalization process, if incrementing the exponent
results in a carry, it is impossible to represent the result in the described precision and the
hence overflow occurs. The overflow can occur in all rounding modes except for Zero, as
the signiﬁcand and the exponent are never incremented, which avoids exponent overflow.

As mentioned previously, if the input signiﬁcands are in [0, 2), then the product of those
two signiﬁcands may be in [0, 4) and rounding is the process to represent the signiﬁcand in
[0, 2). If the product is in [2,4), it is adjusted to be in [0,2) by incrementing the exponent
and shifting the imaginary ﬂoating point to the left before examining the rounding bit and
proceeding to the signiﬁcand rounding process. If the product of the multiplier is in [2,4),
the most signiﬁcant 16 bits are considered as a and the rounding is carried out with 17’”
bit as rounding bit and the 0Red value from bits 18 through 32 as sticky bit. If the product
itself is in [0, 2), the most signiﬁcand bit of the product is ignored and the next 16 bits are
treated as a. The 18th bit becomes the rounding bit and ored value from bits 19 through 32

becomes the sticky bit.

3.2.2 Addition

The multiplication operation is canied out by calculating the 32 bit result and then
rounding it to 16 bits. But, the addition process can be performed with a 16-bit adder itself.
The addition process is a little more complicated than the multiplication operation as the
addition operation may turn out to be a subtraction depending on the sign of the operands.
When the inputs are known, the smaller operand is identiﬁed by comparing the exponents
and shifting the smaller operand to the right, such that both the inputs have equal exponents.

Then the addition is performed using a 16 bit adder and the result is post-normalized to

29

form the ﬁnal result. The following sections elaborate the ﬂoating ponit addition process

employed in the design of correction engine.

Pre-normalization

In this stage, the packed ﬂoating point operands are processed and sent to the adder
unit for addition. The operands are swapped such that the exponent difference between
the operands are not negative. The signiﬁcand of the second operand is complemented if
the two operands differ in their signs. The second operand is shifted to the right by the
exponent difference positions preserving the sign. The bit adjacent to the lsb is assigned as
guard bit, the bit adjacent to the guard bit is assigned as round bit and the rest of the bits

are 0Red together to form the sticky bit.

Signiﬁcand Addition

The signiﬁcands produced in the pre-normalization are added using a 16-bit adder to

produce the sum and carry in this stage.

Rounding

Different rounding procedures are canied out depending on the input operands and the

result.

0 In the pre-normalization process, the operands are compared only for exponents (to
ﬁnd which operand is large and to shift the smaller operand’s signiﬁcand to the right).
If the two operands have the same exponent and their signs are different, then the
result produced will be exact and the result will contain leading zeroes. The result is
shifted to the left till the msb of the result is l, or the exponent become 0 giving rise

to a de-normalized quantity.

30

o If the operands are of the same sign and no carry is produced during the addition
process, the signiﬁcand is rounded with the guard bit as the new round bit and the
round and sticky bits are 0Red to make the new sticky bit. If the signiﬁcand overﬂow
results, the exponent is incremented. If an exponent overﬂow occurs, appropriate

overflow/ infinity ﬂags are set depending on the rounding mode.

0 If the operands are of the same sign and a carry is produced in the addition process,
the exponent is incremented and the significand - rounding is carried out with the
lsb of the sum as the new round bit and the guard, round and sticky bits are 0Red to-
gether to make the new sticky bit. Abnormal activities such as overflow are detected

and proper ﬂags are set during the signi f icand — rounding.

o If the operands are of different signs, a carry is produced in the adder and the msb of
the sum is 0, the carry is discarded and the gurad bit is included in the signiﬁcand
and the signi f icand — rounding is carried out with the actual round, guard and sticky

bits.

3.3 Implementation

Both the multiplier and the adder are implemented in three pipeline stages. The ﬁrst
stage performs the pre-normalization, the second stage performs the actual multiplica-
tion/add operation. The third stage performs the rounding operation. These stages initiate
the operations at the rising edge of the clock and write the data in the internal registers
which act as the input for the next stage. To avoid the spurious computations, each block is
enabled by the previous stage. The ﬂoating point multipliers and adders were verﬁed with
test vectors. The veriﬁcation is canied out with the help of NC-Verilog simulator. Number
of gates in each stage and the length of the critical path is listed in Table 3.2 summarizes

the number of gates in each of the ﬂoating point multiplier/adder stages, the area of the

31

design , the number of gates in the critical path and the critical path delay. Since the critical

path delay for the multiplier is 21.2 ns, it limits the clock frequency to around 40 MHz.

Table 3.2: Design results of the modules in the correction engine design. Shown is the
#gates in the module, its area, number of gates in critical path(CP), delay in critical path,
and the power consumption when Operating at 40 MHz.

 

 

 

Block #Gates Area (um x um) CP CP delay (ns) Power (mW)
Exp Predictor 999 49 19.02 2.417
Mul-Pre-norm 351 688.8 x 668.4 21 8.02 0.835

Multiplier 4293 1951 x 1951 30 21.2 10.216
Mul-Rounding 867 1015 x 1017 30 16.25 2.039
Add-Pre-norm 1303 1149 x 1127 28 14.96 2.670

Adder 358 702 x 702 16 6.43 0.801

Add-Rounding 1595 1394 x 1370 24 14.71 4.116

 

 

 

 

 

 

 

32

CHAPTER 4

Calibration and Compensation Engine

The calibration and compensation engine (correction engine for brevity) performs the
error correction of sensor signal prescribed by the IEEE standards [2] using ﬂoating point
hardware. Essentially the error correction is the process of evaluating the multinomial

shown in Equation 4.1.

Y = 2032i? - - -2,’ii’3c.-,,~...,.IXI — HrI‘IXé — H211? . - Ix}. — mi" (4.1)

For example, in a given region of operation, the correction equation might be like Equation

4.2.

Y =Cooo+Crzo-[X1'-IL11]-[Xé—1‘12]2+Czrz.-[X1'--1L11]2-[Xi-Hz]-[X3'-1L13]3+Co3o-[Xi-H2]3
(4.2)
X 1’ ,X5, and X3’ are input signals like pressure channel data and temperature channel data,
H1,H2, and H3 are the offset values for the given channel, and C000,C120,C213, and C030
are the gain coefﬁcients. D(1),D(2), and 0(3) take values of 2, 2, and 3, the maximum
order a particular signal is raised to. The offset values are subtracted from the input signals,
raised to appropriate powers, multiplied with other signals, and multiplied with the constant
coefﬁcient to form the partial sum term. The partial sum terms are accumulated to form the
ﬁnal output Y.
When the signals from the sensors arrive at the microcontroller, the microcontroller
converts the sampled digital signal to a ﬂoating point value, compensates for the offset
values and initiates the error correction operation (let us assume that the signals compen-

sated with offsets are X1,X2,X3 for simplicity). The correction engine performs the error

33

correction operation using its hardware resources and report the corrected signal to the mi-
crocontroller. The correction engine employs value prediction schemes to perform the error
correction operation with high accuracy or with low energy expenditure, trading off one for
the other. The architecture and the operation of the correction engine and the hardware

sorter is explained in this chapter.

4.1 Correction Engine Architecture

The blocks of the correction engine and their connectivity is shown in Figure 4.1. The
microprocessor core directly interacts with the shared memory and stores the necessary
values to perform the calibration. The shared memory is 32-bits wide. The calibration
coefﬁcients are stored in this 32-bit memory location (shown in Figure 4.2). The lower 24
bits hold the calibration coefﬁcient. The upper 8 bits are divided into 4 banks, each of 2
bits wide to store the orders of X1,X2,X3, and X4 respectively. This limits the upper bound
for the number of independent signals to 4 and their corresponding orders to 3.

The controller reads the memory location where the ﬁrst calibration coefﬁcient is
stored and computes the exponent of the partial sum (like C213.X12-X2-X33) using the ex-
ponents of the input signals. This exponent is tentative, as it might be modiﬁed during the
rounding in the multiplication process (Section 3.2.1). This tentative exponent is calculated
for all the partial sum terms and fed to the sorter to arrange them in ascending order. The
partial sum terms are evaluated starting from the lowest tentative exponent for evaluation.
In order to evaluate the partial sum, inner porducts (like X12, X3?) and cross products (like
((X 12 ~X2) x33) ) are required. These values are computed using the pre-normalization, mul-
tiplier and rounding blocks of the ﬂoating point multiplier and and stored in the memory.
Consequtively, the partial sum is evaluated and passed to the accumulator which is reset at

the beginning of the multinomial evaluation. Evaluating the partial sum starting from the

34

   

occumulotortO)

pre
norm

    
 

Figure 4.1: Architecture of the Correction Engine with microprocessor core and memory.

term of least signiﬁcance avoids the tiny number to disappear in the ﬂoating point addition
process (For example, if we want to add the numbers 10, 0.8, and 0.9 using the ﬂoating
point adder, which is of 2 digits wide, accumulation starting from 10 will yield a result of
10 as the tiny quantities are lost when the operands are aligned. However if the accumu-
lation is started from 0.8, 0.9, and followed by 10, the result will be 11, a more accurate

result).

4.2 Correction Engine Operation

When the inputs from the sensors arrive, the region of operation in the sensor signal
tranfer curve is determined by the microcontroller, and the error-correction coefﬁcients are
stored in the shared memory. The controller reads the ﬁrst correction coefﬁcient and the

order of the input signals. The controller has a 4-input multiplexer for each input signal.

35

22272, 24

A

trails am...

Figure 4.2: Memory word organization. The powers of the input signals are stored in the
MSB, while the correction coefﬁcient is stored in the 3 LSBs.

The input signals for the multiplexer are 0, ex, (exponent of the input signal X1), 2 at ex, (left
shifted once from 8X1). and 3 * ex, (obtained by adding ex, and 2 at: ext); the select signal is
the order of the input signal (0(X1)). The outputs of all four of the multiplexers are fed
to a 5-input adder, where the other input comes from the exponent term of the coefﬁcient,
to form the tentative exponent of the partial sum term (Figure 4.3). The sign of the partial

sum term is determined by evaluating sx, EB sx2 EB sx3 EB sx3 However

69 Scam )0(X2)0(X2)0(X2)’
the sign evaluated here is used only to set an appropriate :1:00 ﬂag during the overﬂow at
any stage of the shift or addition process.

The exponent of the partial sum terms are computed one by one at each clock cycle.
Reading a value of a constant (0(X1m4) = 0) indicates that the multinomial does not have
any more terms to evaluate. The output of the tentative exponents are fed to the hardware
sorter, which accepts all the inputs given in consecutive clock cycles. After getting the
signal that the multinomial has reached a constant term, the sorter starts giving its output
starting from the term of least signiﬁcance. Inner porducts (like X {2, X3) and cross products
(like ((X,2 -X2) -X§) ) are evelauated folloewd by the partial sum term itself (like cm X?-
X2 -X33‘). Once the ﬁrst partial sum is evaluated, it is fed to the adder (pre-normalize, add,

and rounding stages) in the consecutive cycles. The other input for the adder comes from

the accumulator of the correction engine.

36

0
x1 '
2*xr
3*xr

32:\ orxn
m—~
390'—

00(2) EICXIW "a". :i
£3:\
338 (Ira

0001’
3.

00(4)

 

 

Figure 4.3: Tentative exponent generation with input exponents and correction coefﬁcient.

4.2.1 Clocking

The clocking scheme for various blocks is an interesting one, as not all the blocks
produce and consume data on consecutive cycles. Once the sorter produces an output about
which partial sum term is to be evaluated ﬁrst, the necessasy inner and cross products are
generated to evaluate the complete partial sum term. During this computation phase, the
sorter as well as the adder blocks are not clocked. If there is a data dependency, the Once
a partial sum term is evaluated completely, the sorter is clocked to produce the next partial
sum term and at the same time, the adder pre-normalize block is clocked. The add and
rounding blocks are clocked in the consucutive cycles from the adder pre-norrnalize block
is clocked, to complete the accumulation. The internal ﬂip-ﬂops holds the data values if

the stages are not clocked.

37

4.2.2 Reconﬁgurability

The data correction unit is capable of adapting itself to a variety of operations in ad-
dition the error correction operation. It can be programmed to work as a dedicated inte-
ger/ﬂoating point multiply and accumulate unit without clocking the hardware sorter unit.
It can adapt to a new rounding scheme where the rounding bit is forced to zero bypass-
ing the entire post normalization blocks and complete with computations in fewer clock
cycles. The microcontroller can take control of all hardware resources to operate as a
general-purpose ﬂoating-point co-processor to perform ﬁltering operations and data fusion

algorithms.

4.2.3 Perturbation Analysis

Incremental theorem for functions [17] describe about the output of a function if there
is a small change in the inputs. If Y = f(X1,X2), and there is a change AX] = X1 -Xlo in
the input X1, then the output is given by Y = Yo + £6 -AX1 if the perturbation in the input
AX 1 is small. If one of the inputs change slightly during the operation, this principle can be
used to compute the change in the output rather than computing the output again. Software
assistance is needed for calculating the perturbation of the input signal and in detennining

whether the approximation is close enough to the actual output.

4.2.4 Energy Efﬁciency

Since the operating frequency of the correction engine is limited to 40 MHz, the power
measurements are taken by simulating the design at 40 MHz. All possible input com-
binations are applied to each cell in the library, and the power dissipation is calculated.

The number of times a particular gate is used, is counted and multiplied with the power

38

consumption. Performing this calculation for all the cells in the library, yielded the typ-
ical power consumption. The correction engine takes about 28 clock cycles to complete
the multinomial shown if Equation 4.2, and consume about 4.47 n] from the battery. it is
assumed that the leakage power is very small when compared to the switching power.

The effect of having the hardware sorter comes to play, when the correction equation
contain many partial sum terms. The hardware sorter reduces the number of computations
in the multinomial by making a trade off in accuracy, and improves the energy efﬁciency
of the correction engine. Usually, the data precision and speed performance are not of
utmost importance in the sensor based battery powered system applications, when com-
pared to the energy demands of the system. Hence this method will be an effective one,
when compared to the general purpose ﬂoating point hardware units, where data accuracy
is not compromised, and a certain level of performance is guaranteed. Most of the appli-
cation speciﬁc controllers in sensor based microsystems support ﬂoating point operations
in software rather than in hardware. These routine-driven softwares have computational
and communication overheads in the correction engine process making them unfriendly
for a system, which predominantely wants to stay in sleep mode to save battery life. This
reconﬁgurable correction engine, as a single system meets the requirements of both high
accuracy and low energy demands (not simultaneously though), making it an unique prod-

uct in the growing environmental and bio sensor microsystems.

4.3 Hardware Sorter

The signiﬁcance of a particular partial sum term in the error correction multinomial is
approximately estimated from the exponent value of the input signals and the correction
coefﬁcient. The partial sum terms are rearranged in ascending order of their signiﬁcane

and evaluated and accumulated to form the ﬁnal sum. This avoids the the error due to

39

alignments in the ﬂoating point addition operation. However, if the accumulation is started
from the term of most signiﬁcance, a coarse result can be achieved in fewer computations
saving time and energy involved in the ﬂoating point computation scheme. Arranging the
partial sum terms in an order is sorter is necessary to produce accurate or faster results. For
the correction engine design, a hardware sorter working on the principle of assigning ranks
to integers on the ﬂy is developed. The following sections describe the algorithm and the

working principle of the sorter.

4.3.1 Sorting Algorithm

In this scheme, each incoming integer is associated with a rank, which determines its
position in a set of integers. When a new integer arrives to the network, it does not have a
rank. The rank of the new integer among the existing integers is determined by comparing
it against all the existing integers. Then rank of all the integers whose rank is greater or
equal to the incoming integer is incremented to maintain the uniqueness of the rank. If the
integers are called in the order of their ranks, a sorted list is produced. In this methodology,
swapping the integers inside the network is avoided, the main contributor of increased pro-
cessing time and energy dissipation. The disadvantage of the sorter is that the input/output
are given/taken once in a clock cycle(i.e. it takes n clock cycles to sort n integers). How-
ever, since input for the sorter comes from the exponent predictor block and the output is
fed to the Data Control block, all working in a pipeline, the designed sorter ﬁts well for the

correction engine design.

4.3.2 Sorter Architecture

The sorter ( shown in Figure 4.4 ) is capable of sorting sixteen 8-bit wide unsigned
integers given one integer in a clock cycle. It has sixteen 8-bit registers to store the integers

and a comparator associated with each integer. The comparators compare the existing

40

integers with the incoming integer and produce a 0 if the existing integer is smaller than
the incoming integer else it produces a 1. If we count the 0’s from all the comparators, it
determines the rank of the incoming integer. The result of all the comparators are stored
in a 16-bit register, which has a special property that the result of any comparator can
be stored in any bit position. The bit position of the comparator’s result is determined
the rank of the existing integer stored in the 4»bit index register. With this arrangement,
the rank determination problem is reduced to a lead zero detection problem. The sorter
has a lead zero detector for determining the rank of the incoming integer. Once the lead
zeroes are detected and storeed as the rank for the incoming integer, the rank is compared
against all the existing integer’s ranks. The ranks higher than the incoming integer’s rank
are incremented. Hence the new integer is inserted in the array of ordered integers. Each
integer has a data valid bit associated with it, so that the comparision is performed only

when there data valid bit is enabled.

reg_0 [1111111 reg_n
dafo_ln 8%“ :3
[111m

ronk_0 ronk_n

decoder o o o deccz—ér]
..

compore_register

 

 

 

 

 

 

 

 

 

 

Figure 4.4: Architecture of the sorter

41

4.3.3 Sorter Operation

The sorter is initialized by setting the inputht signal (Figure 4.5 ) which resets the
storejndex counter. In consecutive clock cycles, the storejndex is incremented. Each time
an integer arrives to the network, the incoming integer is stored in the register indicated by
the contents of the storejndex. The same integer is also stored in the datajn register to
compare it with the already existing integers. The regjndex associated with the ﬁrst integer
is assigned to 0 ( 4’b0000 as the regjndex is 4-bit wide ) and the data.val id bit associated
with that register is set, so that the result of the compare operation is stored in the 0‘”
oposition in the compareJeg.

When the next integer arrives, the incoming integer is stored in the next register. The
new integer is compared with the already existing integer. If the new integer is bigger than
the old integer, the comparator will produce a 0. The leading zeros in the compareJeg is
counted by the lead zero detection network. Since the leading zeros(1) is greater than the
regjndex of the ﬁrst integer, it is not modiﬁed this time. The leading zeros is stored in the
regjndex of the second integer.

If a number in between the ﬁrst and the second integer arrives to the network, the third
integer is stored in the 3"! register. The already existing integers are compared against the
incoming integer. The comparator result of the ﬁrst and second integers goes to the 0’”
and 1" bit positions. This time also the leading zeros in the compareJeg will be counted
as 1 and the regjndexes with entries greater than or equal to the leading zeros will be
incremented. Hence the reg-index of the ﬁrst integer will not be incremented while the
regjndex of the second integer will be incremented. The leading zeros will be stored in
the regjndex of the third integer. Now if the integers are recalled by regjndex values, they
will be sorted in ascending order. The sorter was tested with a typical input pattern and the

functional simulation is shown in Figure 4.5.

42

 

.Stnelimsfuo223lo=seoadcalssauasnsc

 

 

88 B838 :3 no; .3 32....

 

u xau

u annualunnuzo

« amuluoaupo

n amuluzncﬂ

h v. u .ouhﬁcwluucv

u mlmwulwumasoo

u Hlumulwucaﬁoo

o n clawulwucaEOU

aw c. n HostNIxoUCNIMBMU

 

 

 

 

 

 

OOLﬂOOHO

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

. mm U. u "0 haalxmucﬁlmDMU
8: x 2: o. u Satoixueﬁiuune
2 ﬂ 3 x. N e. u 80233533
H_ om x=H :. u Houm_alxmuzﬂlmmu
ﬂ 2 2 2 2 3L? e. u Siolxeeﬁlmeu
: 2 2 2 2 2 e __ . :35;
«— v.2 mH— OH .2 N— 2 V_ m— m_ h_ m_ mm

 

u _oumemu:ﬁIUcoH
n Honm_x06:alwn0um
u Houbguoolcuce

2 N. 2 2 2 drain... 1......

 

 

3 m: S. ENSINNIONE

 

 

 

¢ "Quouo
mc mmm oom. oma. oma. ova. omH. ooa. om. ow. ow. on. 9
was 833 u «we: m
m: o u Nuomuaom
m: Ao.mm~ n anomunu

59.3.5315 iii Iii-Em

..otom 9: no co=a_:E_m .mcozocsm

 

 

 

imulation of the Sorter

Functional S
43

Figure 4.5

CHAPTER 5

Conclusion and Future Research

5.1 Conclusion

A correction engine capable of performing sensor signal calibration and compensation
was implemented in a top-down design process using a custom library. Energy efﬁciency
was given a priority right from the early design stages. The design and report generation
were performed with the help of powerful perl scripts, which required least amount of
manual interventions.

Energy/Power savings in the correction engine design are obtained in the following

ways:

0 An optimized cell library of about 25 cells was developed to meet the low energy
constraints. For example, since ﬂip ﬂops are used extensively in the pipelined mi-
croprocessor and data correction unit, the energy demands of ﬁve different ﬂip-ﬂop
structures (in-house ﬂip-ﬂop with and without reset, push-pull isolation ﬂip-ﬂop,
transmission gate ﬂip-ﬂop and a regular master-slave ﬂip-ﬂop) were thoroughly ana-

lyzed and the two most efﬁcient structures were included in the cell library.

0 Efﬁcient multipliers and adders are used to perform the integer multiplication and

addition process.

0 The expoenent term of the ﬂoating point quantity is exploited to order the partial sum
terms even before the multiplication and addition process. This novel method saves

energy in computing the small terms in the multinomial which does not contribute to

44

5.2

the ﬁnal result. When the accuracy is of a concern, the accumulation can be started
from the term of least signiﬁcance even avoiding the error arising in the ﬂoating point
alignment process. Thus the correction engine design meets the requirements of both
the ends. Predicting the magnitiude of the data using the exponents of ﬂoating point

values is an unique contribution to the scientiﬁc community.

The design can adapt to a novel rounding scheme which forces the rounding bit to 0
and gains a clock cycle by shutting down an entire rounding block. This is an unusual

way of gaining performance for the scientiﬁc community.

Assigning ranks on the ﬂy algorithm was used to re-order the terms in the multi-
nomial according to their signiﬁcance. The data movements in the sorter are kept
to a minimum (no swapping between the contents of registers) to keep the energy

dissipation under control.

Perturbation calculations which minimize the order of the sum terms in the multi-
nomial helps high order equations to be evaluated using hardware without software

interventions contribute further to the low energy objective.

Future Research

The interactions between the microcontroller and the correction engine can be well

studied and optimized.

Moving to a technology with more interconnect metal layers and smaller feature size
Silicon on Insulator technologies will give designs of high densities and can further

contribute to the low energy objective.

45

0 Formal veriﬁcation techiniques and Design for Test techniques will improve the re-
liability of the design process and identify the faults in the circuit and checking the

consistency of the design in avrious stages of the design ﬂow.

0 Content Adressable Memory [10] locations for storing inner and cross products for
a particular partial sum term in the multinomial would reduce evaluating the same

terms twice.

0 Right now, more burden is placed on the hardware to resolve the hazards and to
schedule the resources. The load can be moved to the compiler to avoid the stall

cycles in the operation.

46

APPENDIX A

Design Flow With AMSAC Library

The cell library has a Timimg Library Format (TLF) ﬁle to help the synthesis tool to
perform logic synthesis, and an LEF ﬁle to perform the physical design. Both TLF and the
LEF ﬁle can be segmented into two sections. In the ﬁrst section, information pertaining to
all the cells in the library is provided, and in the second section, information pertaining to
a cell is given. New cells can be added to the library with little or no changes to the rest of
the cells. A portion of the TLF ﬁle and the LEF ﬁle is provided in this appendix to get an

insight in the library development process.

A.1 Using the TLF ﬁle

Ambit Buildgates can be invoked in any workstation in MSU by running these com-
mands in a console/xterm window: source 5‘ SOFT/spr40 , followed by ac_shell -gui&
(these commands are subjected to change, and contact the CAD support/unixadmin if case

of any problems).

0 The TLF ﬁle can be read by Ambit Buildgates using the command read_tlf am-
sachb.tlf (in ac-shell prompt) by keeping the amsachb.t1f in the Ambit Buildgates

running directory.
0 The verilog source ﬁle can be read using the command read_verilog test.v.

0 After setting the timing constraints, a generic design can be designed using the com-
mand do_build_generic -all . Then the generic design can be mapped to the library

using by running do.optimize in the ac-shell prompt.

47

o A gatelevel netlist in verilog format can be obtained from Ambit Buildgates by ex-
ecuting write-verilog -hier test.vg. This completes our synthesis process and the

gatelevel netlist can be fed as input to the physical design tool.

A.2 Sample TLF ﬁle

/* Library for Synthesis --
Copyright reserved by Advanced Micro Systems and Circuits Laboratory
Michigan State University

Author: Prasanna Balasundaram
This file will be used for synthesis using Ambit Build Gates.

Version: 1.02 11/03/2003

-Added the load capacitance of the gates which read zero before;

if the cap values are 6+ digits, it was written from the netlist directly
else it was added from the thesis report.

Version: 1.01 Date Unknown

-All cells in the library are recognized by the synthesis tool.
-synthesized netlist functionally matches the source.

*/

header( '
library("amsac_lib")
date("Tue Feb 25 11:15:35 2003")
vendor("Michigan State University AMSAC Lab")
environment("com1c_tt_n-n")
technology("AMIC5N 0.3um")
version("1.02")
t1f_version("4.3")

)

Properties (

temperature(25)
voltage(3.0)

/* multipliers and k-factors */

proc_mult(1.0)
temp_mult(1.0)

48

volt_mu1t(1.0)

/* threshold definitions */

table_input_threshold (0.5)

table_output_threshold (0.5)

tab1e_transition_start (0.1)

tab1e_transition_end (0.9)
// for_cell(seq for_pin(input slew_1imit(warn(2.0) error(2.0))))
// for_cell(comb for_pin(input slew_limit(warn(2.0) error(2.0))))

/* defaults */
load_limit(100.0) /* max output load */
)

/* additional header data */

/* end of header section */
/* -—- -- — */
cell (nand2
/* cell properties */
/* constraint models */
/* timing models */
timing_model (td_a10_y01_b1
(spline (input_slew_axis 0.1 0.2 0.5 1 2)
(load_axis 0 1 2 5 10) (
(0.1222 0.1337 0.1343 0.1617 0.1430 )
(0.1361 0.1429 0.1429 0.2113 0.1835 )
(0.1996 0.2084 0.1978 0.2129 0.2509 )
0
0

 

 

(0.3292 0.2655 0.3752 0.2848 .4019 )
(0.3042 0.3046 0.3093 0.3701 .5635 )
)))

timing_mode1 (td_b10_y01-a1
(spline (input_slew_axis 0.1 0.2 0.5 1 2)
(load_axis 0 1 2 5 10) (

(0.1085 0.0893 0.0969 0.1354 0.1427 )
(0.1285 0.1243 0.1309 0.1493 0.1777 )
(0.1473 0.1504 0.1539 0.2143 0.2463 )
(0.2005 0.2223 0.3214 0.3332 0.3019 )
(0.3478 0.2562 0.3107 0.3903 0.4563 )

)))

timing_model (td_a01_y10_b1
(spline (input_slew_axis 0.1 0.2 0.5 1 2)
(load_axis 0 1 2 5 10) (

49

(0.2064 0.2586 0.2580 0.2563
(0.2216 0.2129 0.2363 0.2916
(0.2783 0.2941 0.3120 0.3527
(0.3127 0.3412 0.3309 0.4188
(0.4270 0.3475 0.3460 0.4755

)))

timing_model (td_b01_y10_a1
(spline (input_slew_axis 0.1
(load_axis 0 1 2 5 10) (
(0.1851 0.2189 0.2151 0.2682
(0.2477 0.2350 0.2297 0.2719
(0.3199 0.3340 0.3446 0.3898
(0.4322 0.4413 0.4512 0.5526
(0.4697 0.4872 0.5068 0.6789
)))

timing_model (ts_a10_y01_b1
(spline (input_slew_axis 0.1
(load_axis 0 1 2 5 10) (
(0.1554 0.1535 0.1576 0.2070
(0.1733 0.1669 0.1735 0.2198
(0.2529 0.2584 0.2791 0.2998
(0.3655 0.4362 0.3833 0.4058
(0.4756 0.4764 0.6707 0.5750
)))

timing_model (ts_b10_y01_a1
(spline (input_slew-axis 0.1
(load_axis 0 1 2 5 10) (
(0.0973 0.1421 0.1538 0.1502
(0.1420 0.1521 0.1658 0.2124
(0.2311 0.2463 0.2588 0.2917
(0.2909 0.3753 0.3406 0.3748
(0.5166 0.5321 0.5135 0.6225
)))

timing_model (ts_a01_y10_b1
(spline (input_slew_axis 0.1
(load_axis 0 1 2 5 10) (
(0.3012 0.3316 0.3519 0.4498
(0.2817 0.3222 0.3518 0.4358
(0.3436 0.3455 0.3574 0.4121
(0.4183 0.4128 0.4623 0.4862
(0.4866 0.5302 0.5542 0.6594

00000 00000

00000

00000

.3498 )
.3588 )
.3823 )
.4606 )
.5042 )

.2 0.5 1 2)

.3423 )
.3727 )
.4677 )
.6302 )
.8177 )

.2 0.5 1 2)

.2758 )
.2642 )
.3282 )
.4772 )
.6333 )

.2 0.5 1 2)

.2103 )
.2595 )
.2784 )
.4284 )
.6190 )

.2 0.5 1 2)

.4909 )
.5818 )
.5444 )
.5995 )
.7938 )

50

)))

timing_model (ts_b01_y10_a1

(spline (input_slew_axis 0.1 0.2 0.5 1 2)
(load_axis 0 1 2 5 10) (

(0.2913 0.3165 0.3539 0.4473 0.5960 )
(0.2817 0.3239 0.3446 0.3696 0.5802 )
(0.3448 0.3523 0.3788 0.4842 0.6753 )
(0.4516 0.4548 0.4933 0.5644 0.6153 )
(0.6158 0.6389 0.6183 0.7345 0.9436 )

)))

pin(A pintype(input) capacitance(0.719819993816823))
pin(B pintype(input) capacitance(0.719819993816823))
pin(Y pintype(output) capacitance(1.80803994797362) Function(!(A\&B))

/* path definitions */

Path(A => Y 10 01 Delay(td_a10_y01_b1) Slew(ts_a10_y01_b1))
Path(A > Y 01 10 Delay(td_a01_y10_b1) Slew(ts_a01_y10-b1))
Path(B > v 10 01 Delay(td_b10_y01_a1) Slew(ts_b10_y01_a1))
Path(B => Y 01 10 Delay(td_b01_y10_a1) Slew(ts_b01_y10_a1))

A.3 Using the LEF ﬁle

Envisia Silicon Ensemble can be started from any workstation by running the following
commands in the console/xterm window: source SSOF'T/dsmse53 , source $SOFT/ic446

and sedsm -m=96.

o In the command prompt of the Silicon Ensemble, the LEF ﬁle is imported to the

database using INPUT LEF FILENAME ”amsachbJef” REPORTFILE ”importlef. rpt”;

0 Special variables are set using the following commands:
SEH’VAIIHVPTIIIQHEHHDCZIYDWHERJVEH"”vddV”;

SET VAR INPUT. VERILOG.GROUND.NET ”gnd!”;

51

SET VAR INPUTVERILOGLOGICLNET "vdd.’”;
SET VAR INPUT VERILOG.LOGIC.0.NET ”gnd.’ ”.'

SET VAR INPUT. VERILOGSPECIALNET S ”vdd! gnd! clk".'.

A sample verilog ﬁle with all the cells in the library is created and imported to the
database. These ﬁles need not have functional descriptions, but should have match-
ing pins with the LEF. INPUT VERILOG FILE ”../verilog/amsac_lib.v” LIB ”ver-

ilogJib”

The design (synthesized gate level netlist) is imported to the Silicon Ensemble by
executing INPUT VERILOG F1115 ”test.vg” LIB ”cds-vbin” REFLIB ”verilogJib ”

DESIGN ”cds-vbin.name.of_the.top_module:hd ” ,' .

Floorplanning is performed by the command: FINIT FLOOR rowu 0.35 rowsp 6000

blockhalo 2000 a I xio 30000 yio 30000

IOPLACE AUTOMATIC STYLE EVEN ; places the pins along the periphery in ran-

dom. The 10 constraint ﬁle can be modiﬁed to place the pins in the desired locations.

If the design is not constrained much, running the commands QPLACE NOCONFIG

; and WROUTE NOCONFIG ,' should complete the place and route process.

The design can be exported in the LEF, GDSH and DEF (Design Exchange Format)
formats using the following commands:

OUTPUT GDSII MAPFILE amsachb.map FILE test. gds2 ;

OUTPUT DEF FILENAME ”test.def’ ;

OUT PUT LEF BLOCK FILENAME ”test.lef” MACRONAME name-0f_the_top_module

52

o The design can be imported to deI by importing the DEF, followed by importing the
GDSH by keeping the layout view open. The source netlist for the Silicon Ensemble

can be imported to a schematic and the LVS can be performed here.

A.4 Sample LEF ﬁle

VERSION 5.3 ;
NAMESCASESENSITIVE ON ;
BUSBITCHARS "[]" ;
DIVIDERCHAR "/" ;
UNITS

DATABASE MICRONS 1000 ;
END UNITS

LAYER nwell
TYPE VIRTUAL ;
END nwell

LAYER active
TYPE MASTERSLICE ;
END active

LAYER poly
TYPE MASTERSLICE ;
END poly

LAYER cc
TYPE CUT ;
SPACING 0.9 ;
END cc

LAYER metal1
TYPE ROUTING ;
DIRECTION HORIZONTAL ;
PITCH 3 ;
WIDTH 0.9 ;
SPACING 0.9 ;
RESISTANCE RPERSQ 0 ;
CAPACITANCE CPERSQDIST 0 ;
CURRENTDEN 0 ;

END metali

53

LAYER via
TYPE CUT ;
SPACING 0.9 ;
END via

LAYER metal2
TYPE ROUTING ;
DIRECTION VERTICAL ;
PITCH 2.4 ;
WIDTH 0.9 ;
SPACING 0.9 ;
RESISTANCE RPERSQ 0 ;
CAPACITANCE CPERSQDIST 0 ;
CURRENTDEN 0 ;
END metal2

MACRO nand2
CLASS CORE ;
FOREIGN nand2 0.000 0.000 ;
ORIGIN 0.000 0.000 ;
SIZE 9.600 BY 21.000 ;
SYMMETRY X Y ;
SITE CoreSite ;
PIN A
DIRECTION INPUT ;
PORT
LAYER metall ;
RECT 1.800 8.400 3.300 9.600 ;
END
END A
PIN gnd!
DIRECTION INOUT ;
USE GROUND ;
SHAPE ABUTMENT ;
PORT
LAYER metal1 ;
RECT 1.800 0.000 3.000 5.100 ;
RECT 0.000 0.000 9.600 3.000 ;
END
END gnd!
PIN B
DIRECTION INPUT ;
PORT

54

LAYER metall ;
RECT 6.300 11.400 7.800 12.600 ;

END
END 8
PIN vdd!
DIRECTION INOUT ;
USE POWER ;
SHAPE ABUTMENT ;
PORT
LAYER metall ;
RECT 1.800 14.250 3.000 21.000 ;
RECT 6.600 14.250 7.800 21.000 ;
RECT 0.000 18.000 9.600 21.000 ;
END
END vdd!
PIN Y
DIRECTION OUTPUT ;
PORT
LAYER metall ;
RECT 4.200 3.900 5.400 16.950 ;
RECT 4.200 3.900 7 800 5.100 ;
END
END Y
END nand2
END LIBRARY

55

BIBLIOGRAPHY

[1] IEEE Standard for Binary F loating-Point Arithmetic ANSI/IEEE Std 754. IEEE Press,
1985.

[2] IEEE Standard for a Smart Transducer Interface for Sensors and Actuators - Trans-
ducer to Microprocessor Communication Protocols and Transducer Electronic Data
Sheet (TEDS) Formats. IEEE Press, 1997.

[3] Cadence Design Systems. Timing Library Format Reference, October 2000.

[4] A.V. Chavan. An integrated high resolution barometric pressure sensing system.
Technical Report SSEL-313, University of Michigan, 2000.

[5] Yoshikoru Yoshii et al. Integrated software calibrated cmos pressure sensor with mcu,
a/d converter, d/a converter, digital communications port, signal conditioning circuit
and temperature sensor. In Proceedings of Transducers, 1997.

[6] KB Lyahou, G. van der Horn, and J.H. Huijsing. [A noniterative polynomial 2-d
calibration method implemented in a microcontroller. IEEE Transactions on Instru-
mentation and Measurement, 46(4):752—757, 1997.

[7] O. Machul, D. Hammerschmidt, W. Brockherde, and BJ. Hosticka. A smart pressure
transducer with on-chip readout, calibration and nonlinear temperature compensation
based on spline-functions. In IEEE Integrated Sol id-State Circuits Conference, pages
198—199, San Francisco, 1997.

[8] D. Markovic, B. Nikolic, and RW. Brodersen. Analysis and design of low-energy
ﬂip-ﬂops. In Proceedings of the IEEE/ACM International Symposium on Low Power
Electronics and Design, ISLPED’OI, pages 52-55, Huntington Beach, CA, August
6-7, 2001.

[9] S. Microsystems. Numerical computations guide, 1991.

[10] H. Miyatake, M. Tanaka, and Y. Mori. A design for high-speed low-power cmos fully
parallel content-addressable memory macros. IEEE Journal of Solid-State Circuits,
36(6):956—968, June 2001.

[11] M.L.Dunbar. Single chip asics for smart sensor signal conditioning. In Proceedings
of WESCON.

[12] M. Mozek, D. Vrtacnik, D. Resnik, U. Aljancic, M. Cvar, and S. Amon. Calibration
and error correction algorithms for smart pressure sensors. In IEEE MELECON,
Cairo, Egypt, May 7-9 2002.

56

[13] G. C.M. Meijer P.C. de Jong. A high—temperature electronic system for pressure—
transducers. IEEE Transactions on Instrumentation and Measurement, 49(2):365 -
370, April 2000.

[14] R.L. Schwartz and T. Christiansen. Learning Perl,. O’Reilly and Associates, Novem-
ber 1993.

[15] V. Stojanovic and V. Oklobdzija. Comparative analysis of master-slave latches and
ﬂip-ﬂops for high-performance and low-power systems. IEEE Journal Solid-State
Circuits, 34(4):536—548, April 1999.

[16] E. Swartzlander T. Callaway. Power-delay characteristics of cmos multipliers. In 13th
IEEE Symposium on Computer Arithmetic, Asilomar, Califomia,USA, July 6-9 1997.

[17] G. Thomas and R. Finney. Calculus and Analytic Geometry. Addison-Wesley, 9th
edition, 1996.

[18] H.K. Trieu, M. Knier, O. deter, H. Kappert, M. Schmidt, and W. Mokwa. Monolithic
integrated surface micromachined pressure sensors with analog on-chip linearization
and temperature compensation. In The Thirteenth Annual International Conference

on Micro Electro Mechanical Systems, volume 13, pages 547-550, Piscataway,NJ,
2000.

[19] J. Zhang, J. Zhou, P. Balasundaram, and A. Mason. A highly programmable sen-
sor network interface with multiple sensor reaout circuits. In Proceedings of IEEE
Sensors 2003, Toronto, Canada, Oct 22-24 2003.

57

   

IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
3 1293 02504 5810