E ..
4.3.33. .
huﬂr. sail.- it
3522.2;
:

 

 

.. .n _
«n 5?? 1

.. ....
. ‘ 13‘ ...
I....r.......x ..
£93. 2. .

La”...
3.9 , I ..
mﬁwmﬁ V. ‘81:...

4m
’2 .-

v29.
.

5...?

4mm

1| 3t
.18
a

Q.- P17!
.3 .9. Medina...
I. I ‘3
a» Ru

.9:

, 3V}
.1. 3!»
I)

, ,a haw...»
2.13.1
Wmﬁlnaunve an
.3551...
Htswlioo

- ..
Ex. ,1: .3
.2: 33. .33!

hawhsuukh

v.\ a. 19432.

. . 59:?
1.1.3311:

‘33.: a.
its

:2 3... .1:
3 .7540!
.3; r .

r

(:3 r (.5. 1:
:a..bn.’i.!- 51 753.35
:32 .zgémh; .5
r in.“ 1! >
“91...... neg-avg
in... 3.. 51113-...
...... . . , 2d...
2:; ..
.52. I.
.: .r g

5
4K
#1

(twig;

"I

lvt
.1315):
(a?) .31.
‘ 0:33 I

shut: Iii: L.

I
‘i. 1: int...
WAS-:1 ~ K... (1......
. a...

\‘K'uié’ﬂihﬂ “uh...

,‘1 i: it «v
lift-S!!!»

 

 

 

. “‘3” LI BRAlg’Y
Michigan tate
My University

 

 

 

This is to certify that the
dissertation entitled

EFFICIENT TECHNIQUES FOR MODELING AND
MITIGATION OF SOFT ERRORS IN NANOMETER—SCALE
STATIC CMOS LOGIC CIRCUITS

presented by

Srivathsan Krishnamohan

has been accepted towards fulﬁllment
of the requirements for the

Ph.D. degree in Electrical and Computer
Engineering

 

 

Wag?

 

Major Professor's Signature

IZ/I G /Zoos"

 

Date

MSU is an Afﬁrmative Action/Equal Opportunity Institution

PLACE IN RETURN Box to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE DATE DUE DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/CIRC/DateDue.indd-p.1

 

EFFICIENT TECHNIQUES FOR MODELING
AND MITIGATION OF SOFT ERRORS IN
NANOMETER—SCALE STATIC CMOS LOGIC
CIRCUITS

By

Srivathsan Krishnamohan

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Electrical and Computer Engineering

2005

ABSTRACT

EFFICIENT TECHNIQUES FOR MODELING AND MITIGATION OF
SOFT ERRORS IN NAN OMETER—SCALE STATIC CMOS LOGIC CIRCUITS

By

Srivathsan Krishnamohan

Soft errors are changes in logic state resulting from the latching of single-event
transients (SETS) caused by high-energy particle strikes-or electrical noise. Due
to scaling of minimum feature size, supply voltage, and clock frequency, soft error
rate (SER) is expected to increase by several orders of magnitude in combinational
and sequential logic circuits in the near future. In this dissertation, we address the
following three important issues related to logic soft errors: (1) modeling of SETS
generated in combinational logic blocks (CLBS), (2) efficient design techniques to
reduce the SER of CLBs, and (3) analysis and design of soft-error hardened latches.

Our main contributions in modeling and mitigation of soft errors in logic circuits are
as follows. (1) A fast and accurate lookup—table (LUT) based approach to estimate
SET width, which is necessary to gauge the effectiveness of time-redundancy based
SER mitigation techniques. The LUT provides more than 1000 times speedup over
HSPICE simulations and has less than 10% error compared to existing techniques
which have 15% or more error, without significantly increasing LUT size. (2) An
efﬁcient and systematic error masking (EM) technique that samples selected non-
critical primary outputs (POs) of a CLB three times using delay-chain-generated
control signals and then majority votes on them within the slack available in a cycle to

mask errors. Hence, it incurs no performance overhead, does not perform redundant

computation, and can mask SETS of width less than half the slack available at a
PO. The average SER reduction from EM on ISCAS85 circuits is 82.67%. Other
signiﬁcant features of this technique include: (a) efficient triple sampling and majority
voting and (b) exploitation of circuit timing dependence upon input vector and using
non-uniform slack passing/borrowing in pipelined circuits to further reduce SER. (3)
A method that supports error masking plus efficient error detection and recovery
(EM-I—EDR) and is suitable for CLBs with a small fraction of non-critical POs. In
this case, EM is applied to P03 with sufﬁcient slack and EDR to critical or near-
critical POs. EM+EDR can tolerate SETS with width up to half the clock period
and provides an average SER reduction of 93.78% on ISCASSS circuits. When a
soft error occurs, a very low-likelihood event for an application run, and is detected,
EM+EDR recovers from it within a single clock cycle. (4) Design of an efﬁcient and
robust delay chain to produce phase-shifted clock signals for use in our EM / EM+EDR
techniques. (5) Finally, a comprehensive analysis of a number of existing soft-error
hardened latch designs using a variety of metrics and some new designs, the best
of which is vulnerable to only single-event multiple upsets, and which has a delay
overhead of 12% and consumes only 70% power compared to a standard latch.

Our SER mitigation work represents a signiﬁcant advancement over previous ap—
proaches which, in contrast, rely on introducing explicit hardware or time redundancy
or on redundant computation, often both. Consequently, our methods provide sub-

stantial energy and performance/hardware advantages while signiﬁcantly reducing

logic SER.

Acknowledgements

It has been an enriching experience working with my advisor over the course of
my Ph.D. and also interacting with him during the courses he has taught me. I
deeply appreciate his patience with me and being very liberal with his time while I
was looking around for a dissertation problem, as I used to discuss with him in every
meeting a new tOpic that I had got excited about. His positive outlook, view that
every problem is solvable, and attention to detail are things that I have admired, and
have tried to imbibe some of these qualities in myself. His insistence on concise and
precise writing, whose importance though initially lost on me, has led me to much
learning of this skill.

I would like to thank Professors Anthony Wojcik, Michael Shanblatt, Peixin Zhong,
and Shantanu Chakrabartty for consenting to be on my Ph.D. committee. I’m also
grateful to them for being very co-operative with the scheduling of my proposal and
dissertation defense and for reviewing my dissertation at short notices. Their valuable
technical suggestions during my proposal defense, and their advice on the importance
of good writing helped me to improve this ﬁnal dissertation. Also, the courses taught
by Professor Wojcik on computer architecture and Professor Chakrabartty on low-
power mixed signal design helped broaden my knowledge in these areas.

I would like to thank Dr. Ra jeev Murgai for providing me an opportunity to do my
internship at Fujitsu Labs, which has been a valuable experience and has helped me
to improve the dissertation research tremendously. My interaction with Dr. Rajeev

Murgai taught me the rigors of industrial research and also showed that “Most of us

iv

can easily do two things at once; what ’3 all but impossible is to do one thing at once ”1.

He has been more than a technical mentor to me, and my interaction with him has
helped me to develop many positive qualities (least of which is my fore hand smash
in ping-pongl). Also, I would like to thank William Walker for his guidance and
funding during my internship, Subodh Reddy for helping me to improve my coding
skills and making my stay in and outside of Fujitsu labs enjoyable. I’m grateful to Dr.
Dipesh Patel, Stuart Biles, and Dr. Daryl Bradley for offering me the Opportunity
to investigate my ideas with them in ARM R&D during Summer 2005, and for their
invaluable inputs. Also, I appreciate Dr. Dipesh Patel’s kindness in offering to extend
my internship at their Sunnyvale ofﬁce.

Also thanks are due to the other members of my lab Krishnan, Gandhi, Sandeep,
for making my stay at MSU memorable and for all the fun we had over the last
two years. It was a pleasure working on my Ph.D. along side these guys and the
productive working atmosphere provided by them made my stay in the ACAC lab
enjoyable.

My experience at STMicroelectronics was an enriching one which gave me a deeper
insight into the ﬁeld of chip design and one which developed me professionally. My
colleagues Murthy, Paolo, Wreeju, Arvind, Vijay and others in CMG-MCD made my
ST years memorable. My colleague Vijay has been a great friend since my STMicro
days and I’m deeply grateful for his kindness in providing all sorts of help and for his
enjoyable company whenever I have visited Bay area. I am also grateful to Shanker

and my other undergraduate friends from BITS Pilani in Intouch, for their great

 

lMignon McLaughlin, The Second Neurotic’s Notebook, 1966

company and support during my BITS years and through my graduate study. Many
thanks also to Dr. Sridhar from SUNY Buffalo, whose words of encouragement and
advice have helped me in many of my endeavors.

My brother Srikanth has been a beacon through my life. The conﬁdence he instilled
in me early on and his constant motivation through the Ph.D. have been a big driving
force for my timely completion of the dissertation. Also my parents have helped me
reach me this far by instilling in me right from the beginning that “The virtue of all
achievement is victory over oneself”, and by supporting me in whatever endeavor I
have taken up. This dissertation is dedicated to my parents and brother for their
support and affection all these years. None of this would have been possible without

the guidance and inspiration from my brother Srikanth and my parents.

vi

TABLE OF CONTENTS

LIST OF FIGURES x
LIST OF TABLES xii
1 INTRODUCTION 1
1.1 Radiation-Induced Soft Errors: Causes, SET Generation, and SER
Calculation ................................ 2
1.2 SET Propagation, Masking, and Latching in Logic Circuits ...... 6
1.3 SER Scaling Trends for Combinational Logic, Latches, and Memories 9
1.4 Our Contributions ............................ 14
1.4.1 Logic Circuit SER Estimation .................. 14
1.4.2 Efficient Soft-Error Mitigation Techniques for Combinational
Logic ................................ 16
1.4.3 Robust Delay Chain Construction ................ 17
1.4.4 Hardening of Latches for Soft Errors .............. 18
1.5 Dissertation Outline ........................... 19
2 Modeling and Analysis of Soft Errors in Logic Circuits 21
2.1 Simulation Setup ............................. 23
2.1.1 HSPICE Modeling of I(t) .................... 26
2.2 Sensitivity of SET Width ......................... 27
2.2.1 Gate Inputs ............................ 27
2.2.2 Output Load Capacitance .................... 28
2.2.3 Charge Collected ......................... 29
2.2.4 Gate Size ............................. 30
2.3 Lookup Table ............................... 31
2.4 Accuracy of the LUT Model ....................... 32
2.5 Regression for Supply Voltage Variation ................ 34
2.6 Conclusion ................................. 35
3 Error Masking for SER Mitigation 36
3.1 Introduction ................................ 36
3.2 Related Work ............................... 37
3.2.1 Self-Checking Designs ...................... 37

vii

3.2.2 Architectural Techniques .....................
3.2.3 Gate and Circuit—Level Techniques ...............
3.3 Time Redundancy Based Error Masking ................
3.3.1 Output Sampling and Majority Voting .............
3.4 Delay Chain ................................
3.5 Simulation Results ............................
3.5.1 Extension of LUT to Calculate SET Width at Primary Output
3.5.2 Critical Charge and Transient Pulse Width Calculation . . . .
3.5.3 SER Calculation of Complete Circuit ..............
3.5.4 SER Reduction Using Error Masking ..............
3.6 Conclusion .................................

Combining Error Masking and Error Detection Plus Recovery
4.1 Introduction ................................
4.2 Related Work ...............................
4.2.1 Error Masking ...........................
4.3 Techniques to Combine Error Masking and Error Detection Plus Re-
covery ...................................
4.3.1 Error Detection and Recovery on a Single Path ........
4.3.2 Circuits for Error Detection and Recovery ...........
4.4 Techniques to Enhance Error Masking .................
4.4.1 Exploiting Circuit Timing Dependence on Input Vector . . . .
4.4.2 Slack Redistribution to Enhance Error Masking ........
4.5 Simulation Results ............................
4.5.1 SER Reduction Using Slack Redistribution ...........
4.6 Conclusion .................................

Robust Delay Chain Construction
5.1 Introduction ................................
5.1.1 Delay Elements ..........................
5.2 Yield Deﬁnition ..............................
5.3 Parameters Studied ............................
5.4 Simulation Methodology .........................
5.5 Delay Element Analysis and Yield Results ...............
5.5.1 Transmission Gate Based Delay Element ............
5.5.2 Cascaded Inverter Based Delay Element ............
5.5.3 NP-Voltage Controlled Delay Element .............
5.6 Comparison of Delay Elements ......................
5.6.1 Effect of VDD and Gate Length Variation ............
5.6.2 Effect of VDD and Width Variation ...............
5.7 Control Signal Generation and Distribution from Delay Chain . . . .

viii

38
39
41
44
46
50
50
51
52
53
54

56
56
57
58

59
60
62
65
65
67
71
74
75

77
77
78
79
79
80
81
81
83
85
87
88
89
91

5.8 Conclusion ................................. 97

6 Analysis and Design of Soft Error Hardened Latches 98
6.1 Simulation Methodology ......................... 99
6.1.1 Latch Delay and Power Calculation ............... 99

6.2 Comparison of Latch Designs ...................... 101
6.2.1 SEU Tolerant Latch ....................... 101

6.2.2 Soft Error Hardened Latch Scheme for SoC ........... 103

6.2.3 Dual Interlocked Storage Cell .................. 105

6.2.4 Single Event Resistant T0pology Latch ............. 106

6.2.5 Other Latch Designs ....................... 108

6.3 New Latch Designs with Soft—Error Immunity ............. 109
6.3.1 Customizing Latches for Performance and Power Requirements 114

6.4 Conclusion ................................. 115

7 Conclusion 116
7.1 Key Contributions ............................ 116
7.2 Future Work ................................ 119
BIBLIOGRAPHY 121

1.1
1.2
1.3

1.4

2.1

2.2
2.3
2.4
2.5
2.6
2.7

2.8

3.1
3.2

3.3
3.4
3.5
3.6

4.1
4.2
4.3
4.4

LIST OF FIGURES

Mechanism for SET generation ..................... 5
Delay fault and latching window of gates in a logic circuit ...... 7
Long-term estimates from ITRS for the supply voltage, and clock frequency

of DRAMS, microprocessors, and ASICs used in high performance and low—
power applications [1]. ........................... 10
Permanent fault FIT rates for microprocessors, SRAMS and DRAMS 12

Circuit setup used to measure the sensitivity of SET width to various

parameters. ................................ 24
Junction current waveform for different time constants. ........ 25
Current waveform for different models ................. 26
Quit dependence on input vectors applied ................ 28
SET width variation for various output loads .............. 29
SET width variation for different gate drive strengths ......... 30

Surface described by three co—ordinates X, Y, and Z corresponding
to Q, CL, and SET width. Q and CL values in the middle of LUT
indices are used to test the accuracy of the LUT. The ﬁgure shows the
neighboring points and the surface formed by them, when the LUT is

indexed using Q = 10 fC and CL = 25 fF. ............... 33
Variation of SET width for supply voltage perturbation ........ 35
Existing temporal sampling latches ................... 40
Latching window probability for SETS using multiple sampling error

masking schemes ............................. 43
Flip-ﬂop for error masking ........................ 45
Generation of control signals C and C .................. 47
Effect of particle strikes on delay chain ................. 48
Simulation setup for generating three dimensional LUT ........ 51
Flip-ﬂop used for error detection and recovery ............. 61
Latch-based pipeline with dead time .................. 69
Latch-based pipeline with dead time being used for error masking . . 70
SER reduction results for time borrowing ................ 74

4.5

5.1
5.2
5.3
5.4
5.5
5.6

5.7

5.8

5.9

6.1

Algorithm for time borrowing to reduce SER .............. 76

Schematic diagram of a transmission gate. ............... 82
Delay of transmission gate for different iterations of a MCS. ..... 83
Schematic diagram of a cascaded inverter. ............... 84
Delay of cascaded inverter for various Monte Carlo iterations. . . . . 86
Schematic diagram of a NP-voltage cascaded inverter .......... 87
Delay distribution of NP-voltage controlled delay element for various

Monte Carlo iterations ........................... 88
The number of ﬂip-ﬂops driven by each delay tap in the delay line of

c7552. Two separate delay chains -DL1, DL2- are used to prevent soft
errors from occurring due to particle strikes on the delay chain itself. 93
Delay versus fanout for an inverter in TSMC 0.18 micron technol-
ogy. The absolute value of the parasitic delay of an inverter is the
Y—intercept of the line shown, and has a value of 26.4 ps. ....... 95
Graph used to ﬁnd the best stage effort. ................ 96

Basic transmission gate latch used to normalize delay and power values
of other latch designs. The delay and power values were measured by

connecting a FO4 inverter at the latch output .............. 100
6.2 Schematic of single event upset tolerant latch. ............. 101
6.3 Schematic of soft error hardened latch. ................. 104
6.4 Schematic of dual interlocked storage cell. ............... 105
6.5 Schematic of single event resistant topology. .............. 107
6.6 Hardening of the feedback node in a latch ............... 108
6.7 Proposed latch designs for soft error tolerance ............. 110
6.8 Stick diagram for layout of latch A ................... 111
6.9 Improved latch designs with higher soft error tolerance ........ 112
6.10 Stick diagram for layout of latch C ................... 113
6.11 Customized latch designs trading off speed and power ......... 115

xi

1.1

2.1
2.2

3.1

4.1

5.1
5.2

5.3

6.1
6.2

LIST OF TABLES

Qm-t and Q, in fC of combinational logic, latches, and SRAMs for

different technology nodes ......................... 11
Percentage error for both PWL and exponential current sources. . . . 27
Percentage error for interpolation from a LUT with both uniform and

non-uniform interval between indices. .................. 33
SER reduction for ISCAS85 circuits due to error masking. ...... 54

SER reduction for ISCA885 circuits. The power overhead in practice
would be lower than the one presented above due to: (1) The original
power has been estimated using zero delay model, which does not take
into account glitchy or partial transitions. (2) The leakage energy,
which has not been taken into account, consumed by the overhead
circuit is far lower than the leakage of the CLB, due to fewer components. 72

Parameter variations for the process considered. ............ 80
Mean delay and variability of the delay elements when VDD variation
is 10% and 20%, and gate length variation is 10% ............ 88
Mean delay and variability of the delay elements when VDD and gate
width variation are 10%. ......................... 89
Delay and power overhead of the proposed latch designs. ....... 114

Delay and power overhead of the customized latch designs. ...... 114

xii

CHAPTER 1

INTRODUCTION

Designers strive to deliver ever-higher performance systems cost-effectively by lever-
aging technology scaling to meet end-user application needs. However, with unprece-
dented levels of device integration (~ 109 transistors/chip) and scaling of minimum
feature size (~ 105 of nm), clock frequency (~ 108 of GHz), and supply voltage
(VDD < 1 V), the transient fault or soft-error rate (SER) of logic circuits is becom-
ing a dominant reliability challenge even in commodity processors. Soft errors are
changes in logic state resulting from the latching of single-event transients (transient
voltage ﬂuctuations at a logic node or SETS) caused by electrical noise or external ra-
diation. Unlike hard errors (arising from, say, electromigration, hot carrier effects, or
dielectric breakdown), they do not result in permanent damage of components. In this
dissertation, we are concerned with static CMOS circuit soft errors. Although most
of our discussion and mitigation techniques apply to soft errors due to either source
(i.e., electrical noise or external radiation), our focus is on radiation—induced errors,

particularly, those resulting from high-energy neutron strikes. This is because, as ex-

plained in Sec. 1.1, high-energy neutron strikes represent the most important source
of soft errors and their effects are well modeled, allowing us to accurately analyze the
effectiveness of our soft error mitigation techniques. Background on radiation-induced
soft errors and a brief discussion of our contributions relative to previous work follows

next .

1.1 Radiation-Induced Soft Errors: Causes, SET Genera-

tion, and SER Calculation

Soft errors are caused by electrical noise (e.g., due to crosstalk and IR or Ldi/dt
supply noise), electromagnetic interference, and external radiation, with the latter
being the most important source and our main focus. The operating environment
of a semiconductor chip contains background radiation from cosmic rays, low energy
thermal neutrons, and radioactive traces present in chip packaging material. These
radiations comprise electrons, protons, neutrons, pions, muons, alpha, and other par-

ticles. The following two types of effects have been observed from these radiations.

c Total dose effects (TDES): These result from the interaction of ionizing
radiation with device materials. TDEs can cause changes in transistor thresh-
old voltage and decrease the mobility of carriers in the channel, and hence
the transconductance and gain of a transistor. Gain degradation results in in-
creased propagation delay of a gate. TDEs cause changes in the transistor and
circuit characteristics over a long period of time, which do not lead to failure in

commodity chips. Hence, they are not considered in this dissertation.

0 Single event effects (SEES): These result from the interaction of a high-
energy particle passing through a device. SEES can cause Single event upsets
(SEUS) or Single event latchups (SELs). SEUS are reversible bit-flips in a latch
or memory element that change the logic state of a circuit. The change in the
logic state of a circuit due to SEUS are called soft errors in contrast to hard
errors which are irreversible. Soft errors are reversible since their effects can
be removed by resetting or rewriting the memory elements. SEL occurs when
the injected charge activates the parasitic PNPN structure that exists in bulk
CMOS transistors. SEL is not a threat to 801 devices and can be prevented in

bulk CMOS by using thin epitaxial layers or guard rings [2].

Various ionizing particles generate electron-hole pairs through different mecha-
nisms. An alpha particle ionizes the atoms in a chips’s substrate through electromag-
netic force between itself and the valence electrons. However, high-energy cosmic—ray
neutrons and protons collide with nuclei within Silicon substrate and generate sec-
ondary particles capable of ionizing silicon atoms. A neutron has to encounter, on an
average, 1010 atoms before it hits a nucleus. Based on the density of silicon (which is
2.3 gm / cm3) and the absorption length for neutrons, the average distance a neutron
has to travel in silicon substrate before it hits a nucleus is on the order of tens of
centimeters. This explains the fact that neutrons, which have enough energy to pass
through shields that are hundreds of centimeter thick, do not easily cause logic upsets

at multiple nodes in a circuit block.

The secondary particles generated by neutrons traverse and ionize the Silicon sub-
strate, in the process creating a track of excess electron-hole pairs, with average
ionization energy (i.e., the average energy needed of the particle to ionize a Silicon
atom) being 3.6 eV/electron-hole pair in silicon. Some of these electron-hole pairs can
eventually reach a reverse-biased pn junction of a sensitive node, such as a gate output
as Shown in Figure 1.1. When that happens, the majority carriers are reﬂected from
the depletion region of the pn junction while the minority carriers are swept across
the junction through the drift mechanism. This causes a net current across the deple-
tion region that can either charge or discharge a logic node. For example, electrons
created near the drain node of an NMOS transistor connected to VDD pull the drain
node to GND. In the case of static nodes, the electrons recombine with holes shortly
thereafter, and the drain node is charged back to VDD. For static nodes, this creates a
transient glitchy pulse called single-event transient (SET), whose amplitude and du-
ration depend upon the strength of the driving gate and the output capacitance that
it drives. Recently, Sun and Intel announced that chip packages with alpha-particle
emitting lead have been minimized. Therefore, high—energy cosmic rays and the neu-
trons present in them are expected to become the primary source of soft errors in the
future. The methodology and schemes we propose in this dissertation are applicable
to soft errors caused by sources other than neutrons too. But, we evaluate most of
our proposed schemes based on radiation-induced errors, particularly, those resulting
from high-energy neutron strikes, since SET generation mechanisms due to neutron
strikes have been studied for a long time and are well-understood. As a result, suf-

ﬁcient information for calculating SERs of CMOS process technologies are available

for neutron strikes.

GATE

SOURCE DRAIN

Path of
ionizing particle

 

 

 

 

 

 

 

 

 

 

+-}/*
SUBSTRATE -

 

 

 

Figure 1.1. Figure shows an ionizing particle passing through a silicon substrate generating
holes and electrons. SET is generated when the holes and electrons collect around a pn

junction of a drain node as shown.

The basic or raw soft-error rate (SER) of CMOS circuits due to cosmic ray neutrons
can be calculated using the following equation from [3]:

_Q )

SER(QC.,.) = K x F x A x el—‘mos , (1.1)

where K is a technology-independent constant, F is the neutron flux, A is the
sensitive device area, Qcm is the critical charge, and Q, the charge collection slope
for the technology, which is strongly dependent on doping and supply voltage. The
critical charge QC.“ is deﬁned as the minimum charge required to cause a logic upset
in memory elements, and in logic circuits, it is the minimum charge which generates
SETS that can change the value stored in a latch or flip-flOp. The sensitive device area
is equal to the sum of node areas where charge collection can lead to SET generation

in logic gates, or logic upset in latches and memory elements.

1.2 SET Propagation, Masking, and Latching in Logic Cir-

cuits

Soft errors can occur in both sequential circuits (latches and ﬂip-ﬂOps) and com-
binational logic blocks (CLBS). A particle strike causing a Single event upset (SEU)
at a latch, when the latch is in hold mode, results in an error. However, SEU at a
gate leads to generation of a Single-event transient (SET). SETS are transient voltage
ﬂuctuations, such as a 1—+0 or 0——>1 logic ﬂip, occurring at a gate output. Such SETS,
apart from being generated by radiation, can also occur due to electrical noise such
as crosstalk and IR and Ldi/dt supply noise. Soft errors in CLBS result from SETS
changing the value stored in memory elements, such as latches, ﬂip-ﬂops, or register
ﬁles. For a SET originating in CLB to cause a soft-error, it must propagate to a
primary output (PO) gate and be ﬁnally captured by an output latch. However, a
soft error will not occur if the SET is either: ( 1) logically masked: some other input
of a gate in the SET propagation path determines its output instead of the SET; (2)
timing window masked: the SET does not arrive around the closing clock edge and
hence is not captured by the output latch; (3) electrically masked: the amplitude of
the SET is not sufﬁcient to cause a state change at one of the propagating gates or
at the output latch.

In a static CMOS circuit, since the SET can get latched only if it arrives around the
clock closing edge, soft errors in combinational circuits are synchronous. Synchronous

soft errors can be further classiﬁed into the following two types [4].

1. Delay faults: These are caused by transient pulses with width smaller than

 

 

 

DQ—~

 

 

 

 

 

 

 

 

 

 

 

 

 

e e S a ”’ " Q
Flip Flopo g _.. 3 - T 3 Flip Flop]
oi oi '3
CK _ N L ‘é. > CK
HOld H
Setup Time Latching

  
 
 

Time _ CK Window H
\ ong

D x vector (it-1]
T me To Lam” 2) Correct Operation
Latching Window vector ii—linector (i)

ofﬂi flo
p— p Clock Period (T) a 4 Delay fault

 

 

 

 

 

 

  

Figure 1.2. Delay fault and latching window of gates in a logic circuit. The ﬁgure shows
both correct and incorrect operation of a logic circuit. During correct operation, the data
corresponding to vector i transitions a setup time before the clock closing edge. Incorrect
operation results due to a delay fault, which pushes the data D of ﬁip—ﬂopl to transition
after the clock closing edge. Also, the ﬁgure shows latching window of gates in a pipeline
stage. Latching window of gates located farther from the PO occurs earlier in the clock

cycle time.

the latching window time (sum of latch setup and hold times). Such transients
cause an error by delaying the arrival of the PO Signal at the output latch. This

leads the PO to transition after the setup time as Shown in Figure 1.2.

2. Functional faults: These occur when the transient pulse is wider than and
overlaps the latching window of the output latch. The value stored in the

output latch gets ﬂipped, which changes the logic state.

An SET has two properties associated with it that determine whether it gets latched

at the primary output:

1. Spatial: whether the SET originates on a critical or non-critical path of the

CLB.

2. Temporal: whether the SET originates at a gate output before or after the

output has settled.

These two properties strongly inﬂuence the likelihood of an SET not getting masked
due to the latching window effect. Each gate has a latching window as shown in
Figure 1.2, during which time the error pulse has to pass through the gate in order not
to get latching window masked. The latching window of a gate extends approximately
from tclkﬁdge — tsewp — tTTL to tcjkﬁdge + thold — tTTL, where tclkﬁdge is the time of clock
closing edge, tsetup and thold are the setup and hold times of a latch, respectively,
and tTTL is the time to latch or the propagation delay from the gate output to a
latch. Any SET with width w passing through the gate has to completely overlap
this latching window to cause a logic ﬂip at the output latch. Latching window of a
gate can begin at or after the time the actual output of the gate passes through it.
For gates on a critical path, the latching window overlaps with the time when actual
gate output settles, while for gates on non-critical paths, latching window occurs after
actual gate output settles. An SET generated at any gate before its output settles
would get latching window masked, unless its pulse width extends to overlap the
complete latching window of the gate.

The SER of a system is usually measured in failures in time (FITS) or mean time
between failures (MTBF). FIT is deﬁned as the number of failures in one billion
hours of operation, while M TBF is the mean time between two successive failures.
For example, an MTBF of 1000 years equals a FIT rate of 114 (109/ (24 x 365 x

1000)). A fault-tolerant system with inﬁnite MTBF corresponds to zero FIT. FIT is

more commonly used by VLSI designers because it is additive (i.e., the FIT rate of a

system is obtained by adding the FIT rates of individual components), unlike MTBF.

1.3 SER Scaling Trends for Combinational Logic, Latches,

and Memories
The scaling of SER for memories, latches, and combinational logic differ with ad-
vancing process generations; minimum physical gate length is used to demarcate
different process generations. As can be seen from Eqn. 1.1, SER is: (i) linearly pro-
portional to the sensitive device area A and (ii) exponentially dependent on the ratio
Qm-t/Qs. When Qm-t/Qs is close to one, SER scaling is dominated by the sensitive
device area. The sensitive device area decreases quadratically with shrinking feature
size and, based on scaling trends, is 50% smaller compared to that for the immedi-
ately previous process generation. The critical charge depends only on the charge
stored (Qstored = C X VDD) at a dynamic node, and on both Qsmed and the charge
dissipation capability at a static node. Critical charges are decreasing nonlinearly
With each process generation due to diminishing Qstmd. This can be seen from the
International Technology Roadmap for Semiconductors (ITRS) prediction of supply
voltage and clock frequency scaling shown in Figure 1.3. The supply voltages for
high-performance and low-power applications are expected to drop to 0.7 V and 0.5
V, respectively, at the 7 nm technology node, while the clock frequency scales to 55
GHz. This would lead to Signiﬁcant reduction in Qm-t, which implies particles of
lower energy, with higher ﬂux, can cause SEUS. In addition, it has been experimen-

tally veriﬁed that SER of logic circuits increases linearly with clock frequency [5].

Thus, decreasing VDD and increasing clock frequency lead to higher SERs in future

technologies.

Supply Voltage and Frequency Scaling

 

 

 

 

 

 

 

1.4 60
-I-High Performance
v-O-Low-Power Applications
1-2 +Local Clock Frequency . 5°
1:?
1 4 l I
S “4° 9
a o 8 J r a
a ' c
-'-" 0
g 30 a
8
i 0.6 h
g. t
m 4 s 20 8
0.4 ‘ l a
0.2 » 10
0 ‘ g : : t o
45 28 20 14 10 7

Physical Gate Length (nm)

Figure 1.3. Long-term estimates from ITRS for the supply voltage, and clock frequency
of DRAMS, microprocessors, and ASICS used in high performance and low-power applica-

tions [1].

The charge collection efﬁciency Q s scales approximately linearly with device size in
a log-log scale [3]. The value of Qm-t and Q3 for both logic and memories for different
technology nodes, reproduced from [6], is given in Table 1.1. The ﬁrst ﬁve rows give
the critical charge for combinational logic, latches, and SRAMS in femto Coulombs
(fC). The last row gives the value of charge collection slope in fC for different process
generations. The value of Qm-t for latches and SRAM cells becomes smaller than Q3
at 100 nm and 180 nm technology nodes, respectively, and hence as feature size and
the device area scale down, the SER of memory elements reduces. However, in logic

gates, Qm-t is more than Q3, clue to which a small decrease in critical charge increases

10

SER by orders of magnitude.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuit Element 600 350 250 180 130 100 70 50
nm nm um um nm um um um
Qm-t of Logic: l6 FO4S N / A 676 489 250 116 61.3 24.0 10.40
Qm-t of Logic: 4 F048 4160 509 336 131 63.9 35.2 16.0 7.02
chit of Logic: 0 F048 1130 386 265 99.3 48.8 27.3 13.2 5.57
Qm-t of Latches 360 120 82.4 31.9 15.0 7.96 3.73 1.66
Qm-t of SRAM 146 48.8 33.7 12.9 6.31 3.43 1.52 0.67
Q3 52.3 34.6 26.8 17.2 12.2 9.53 7.19 5.54

 

 

Table 1.1. Qm-t and Q3 in fC of combinational logic, latches, and SRAMS for different

technology nodes.

In current technologies, soft-error contribution of latches and SRAMS far exceeds
the soft-error contribution of CLBS. Mitra and others estimate the contribution of
combinational logic, latches, and unprotected SRAM for a commercial state-of-the art
processor to be 11%, 49%, and 40%, respectively [7]. But as technology scales and
clock frequencies increase, SET from a CLB has a higher likelihood of latching because
of diminishing Qm-t and latching-window probability. This is expected to make SER
contribution of logic much more than that of memories. Shivkumar and others have
Shown that SER per chip of combinational logic circuits will increase nine orders
of magnitude when minimum feature Size scales from 600 nm to 50 nm, becoming
comparable to SER per chip of unprotected memory elements [6]. In addition to
logic SER scaling, the SER per latch or SRAM bit is expected to stay the same or
decrease in future technology generations [8, 9]. Although the critical charge per
latch or SRAM bit decrease, smaller cross section area of these devices reduces the

probability of SEU occurring. However, due to the increasing number of latches and

SRAM nodes per chip, in the future, the contribution of memory elements to total

11

chip SER is estimated to increase, but at a rate lower than that of combinational
logic circuits [10].

The advances made in semiconductor process technology have reduced manufac-
turing related failures in chips. The trends observed in permanent fault rates for
microprocessors, SRAMS, and DRAMS are shown in Figure 1.4 [11]. This plot Shows
that the FIT rate due to permanent faults has clearly decreased over the period 1990-
2001. In the case of 256 KB SRAM, the FIT rate has fallen by six times reaching 50
in 2001 from an initial value of 300 in 1990. This rapid decline in permanent fault
FIT rates and technology scaling are expected to make soft errors very critical in
sub-100—nm designs. In fact, soft-errors, if unmitigated, are expected to become the
primary source of failures for sub-90 nm chips, causing a failure rate of 50,000 FITS,

exceeding that of all other reliability mechanisms combined [12].

 

 

 

 

300 4
Permanent Failure Rates
I Mlcroproceuors
I saw: 256 KB
Isnm 4 us
I cm 1 we
I cam 32 me
200<
E
b
0
‘E
I
8
a
'a
Ll.

 

 

 

1990 1992 1995 1997 2001
Year

Figure 1.4. Permanent fault FIT rates for microprocessors, SRAMS, and DRAMS [11].

12

The impact of soft errors at the system level has been reported in diverse applica-
tions, ranging from satellites to sea-level computer systems, a few of which are sum-
marized next. A detailed historical review of experiments done by IBM on radiation-
induced soft fails and failures observed in memories due to soft errors, in the time
period 1978-1994, is reported in [13]. This paper reports all the major memory sup-
pliers of the time, such as Intel, IBM, and Hitachi, observing soft failures in their
memories due to both alpha particles and cosmic rays. As early as 1978, evidence
of sea-level soft fails on 16 Kb DRAMS, from alpha particles present in the memory
packaging materials, was provided by May and Woods of Intel [14]. Similarly, in
1980, Hitachi announced that some of their bipolar RAMS failed under alpha-particle
bombardment [13]. In addition, in 1981, IBM discovered reliability problems with
their 16 Kb DRAM memory chips. Sun Microsystems also observed cosmic ray strikes
on unprotected cache memories causing random crashes at major customer sites in
its Enterprise server line [15].

In addition to the memory errors reported above, Fujitsu announced that they have
protected 80% of the 200,000 latches in their ﬁfth generation SPARC64 processor
fabricated in 130 nm SOI CMOS [16]. Other logic errors in processors have been
reported by iROC’ Technologies when they performed radiation testing on an 8—bit
logic core processor called ROC—CRll [7]. The processor was manufactured for French
space agency CNES in 180—nm technology. During radiation testing, they observed
eleven logic errors in the processor datapath.

The increase in logic SER not only affects high-performance microprocessors

(HPMS), such as Pentiums, Opterons, SPARCS, and Power PCS used in worksta-

13

tions and enterprise servers, with core frequencies greater than 1 GHz and with more
than 100 million transistors on-chip, but also embedded processors (such as those
from ARM, MIPS Technologies, and Tensilica) used in consumer, automotive and
networking applications [17]. The push for tackling soft errors in HPMS is due both
to application factors and the large number of chip transistors. In the case of em-
bedded processors, application factors, increase in the number of transistors per chip,
and the number of chips used in a system (such as a mobile phone) are expected to
play an important role in scaling the SER of these systems. For example, when a
semiconductor vendor ships a million units of a product with 100 components in each
product, the total FIT over the entire shipment would be hundreds of errors every
few hours. The number of product recalls due to soft errors would lead to signiﬁcant

loss in the vendor’s revenues and reputation.

1.4 Our Contributions

In this dissertation, we address the following three important problems posed by
the dramatic increase in logic SER: (1) modeling of SETS generated in CLBS, (2)
efﬁcient design techniques to reduce the SER of CLBS, and (3) analysis and design of
soft—error hardened latches. These issues and how we have addressed them are brieﬂy

outlined and compared to previous work next.

1.4.1 Logic Circuit SER Estimation

Tackling SER along with various conﬂicting nanometer objectives of power, perfor~

mance, and area increases design cost, which has been identiﬁed as an important

14

factor with the potential to limit the semiconductor roadmap. Calculation of logic
SER is essential to devise efﬁcient SER mitigation techniques and can be used to
optimize the different conﬂicting nanometer objectives, such as power, performance,
and reliability. It can also be used to isolate and apply the design techniques to the
most vulnerable circuitry.

The SER of SRAM caches can be calculated by determining the QC,“ of each cell
from SPICE simulation and using it in Eqn. 1.1. However, estimating the critical
charge of a combinational logic gate requires the width of the SET generated at a
gate output. We describe a fast and accurate lookup—table (LUT) based methodology
to calculate both SET width due to particle strikes and the SER reduction that can be
obtained with time—redundancy based mitigation techniques. Previous techniques for
SET width calculation use complex expressions or large LUTs and have greater than
15% error for inputs not close to pre—characterized points. We study the sensitivity of
an SET to various gate and circuit characteristics and determine the parameters to be
used, their spacing, and their lower and upper-bounds for constructing the LUT. The
proposed LUT uses non-uniform spacing and surface-based interpolation between its
indices to obtain the SET width generated at a gate and primary output. It provides
more than 1000 times Speedup over HSPICE simulations and has less than 10% error

compared to existing techniques which have 15% or more error.

15

1.4.2 Efﬁcient Soft-Error Mitigation Techniques for Combi-
national Logic

For memories, due to their regular array structures, efﬁcient soft-error detection and
correction techniques have been developed. Commonly used techniques in memories
are error correcting codes or parity codes for detection. Using these techniques to
protect combinational logic circuits requires high cost due to their irregular structure.
Other techniques for logic soft-error protection, such as triple modular redundancy
(TMR) and RE—computing with triplication and voting, rely on explicit spatial or
temporal redundancy. However, most of these techniques suffer from signiﬁcant short-
comings such as: (1) they are meant primarily for error detection only; (2) they are
applicable only to speciﬁc classes of circuits such as arithmetic units; and/ or (3) they
incur high power, performance, and area overheads. This necessitates an efﬁcient
design approach that would make logic circuits used in commodity as well as other
applications soft—error resilient without adversely affecting other design considerations
such as power and performance.

We propose an efﬁcient and systematic error masking, (EM) technique that can
be applied to combinational logic circuits which have a signiﬁcant fraction of non—
critical primary outputs (POS) with sufﬁcient slack. This error masking technique
prevents an SET pulse of width less than approximately half of the slack available
in the propagation path from latching and turning into a soft error, without any
performance overhead. Previous techniques incur a performance overhead of 2W for

masking an SET pulse of width W. We perform error masking only at PO ﬂip-ﬂops

l6

with sufﬁcient Slack, which ensures that the delay increase caused by the addition of
majority voter and control transistors to the ﬂip-ﬂops does not affect the timing of the
circuit. Additionally, our technique uses a single delay chain to produce phase—shifted
Signals and sample POs of a CLB. The results obtained on ISCAS85 benchmark
circuits Show an average SER reduction of 82.67% from the original unprotected
circuit.

For CLBS with a small fraction of non-critical POs, we proposed a method that
supports error masking plus eﬂicient error detection and recovery (EM+EDR). In
this case, EM is applied to P03 with suﬂicient slack and EDR to critical or near-
critical POS. EM+EDR can tolerate SETS with width up to half the clock period
and provides an average SER reduction of 93.78% on ISCAS85 circuits. When a
soft error occurs, a very low-likelihood event for an application run, and is detected,

EM+EDR recovers from it within a single clock cycle.

1.4.3 Robust Delay Chain Construction

An important component of our EM and EM+EDR techniques is a delay chain used
to generate phase-shifted clock signals for sampling POs. Delay chains are also used
in a variety of other applications too. With technology scaling, sub-90 nm process
technologies are introducing increasing variations in designs. This process variation
leads to delay uncertainty. Therefore, delay chains need to be constructed with ro-
bust delay elements. We analyze three different families of delay elements in terms

of their robustness to process variation, and then determine the appropriate delay

17

element for delay chain construction. The three different delay element families are:
(1) transmission gate based, (2) cascaded inverter based, and (3) voltage-controlled
ones. We compare the delay element’s effectiveness in terms of yield, which is de-
ﬁned as the number of circuits within the speciﬁed delay range. The delay variations
are obtained through HSPICE Monte Carlo simulations and the delay sensitivity to
different process and environmental variations are studied using simulation results.
A design methodology used to construct a delay chain that produces control Signals
phase shifted from the system clock by every 200 ps is presented. Finally, construction
of a buffer chain with least delay to distribute the phase shifted clock signals based
on the logical effort method is explained. This work will help designers to construct

robust delay elements and chains.

1.4.4 Hardening of Latches for Soft Errors

Many different latch designs to prevent soft errors due to particle strikes on the latch
nodes have been proposed. We analyze and compare these designs based on some
existing and new metrics. This work will help designers to select latches for applica-
tions where soft errors are an important design metric. We also prOpose new latch
designs, the best of which is vulnerable only to SEMUS with a delay overhead of 12%
and power consumption of 70% compared to a standard transmission gate latch. The
proposed latches can also be customized in accordance with application requirements
for power consumption, performance, and soft-error resilience. In addition, some of

the proposed latch designs can also be used to protect CLBS.

18

1.5 Dissertation Outline

The remainder of the dissertation is organized as follows. Chapter 2 presents the
LUT-based model for determining the width of SET generated at a gate output.
The sensitivity of SET width to different circuit and striking particle’s parameters
are studied and then appropriate LUT indexes are determined. The lower and upper
bound for each LUT parameter, and the number and interval between LUT indices are
determined based on accuracy requirements. This chapter also explains interpolation
performed within the LUT and the accuracy of the SET width estimated. The LUT-
based model is further extended in chapter 3 and used to calculate the SER reduction
obtained using the time-redundancy techniques presented in the next two chapters.

In the next three chapters, we present our work on soft-error mitigation of com-
binational logic circuits. In Chapter 3, we ﬁrst review existing methods to mitigate
soft errors in combinational logic circuits and their drawbacks. Then our error mask-
ing technique is explained and the SET width tolerated iS determined. Finally, the
methodology used to calculate the SER of the original and the error-masked circuits
and results for SER reduction are presented. Next, Chapter 4 describes how the
error masking technique can be improved by combining it with error detection and
recovery in critical and near-critical paths. The pulse width tolerated using the EDR
technique and the SER reduction obtained are calculated. Later, steps to improve
the SER reduction obtained from the proposed error masking technique by increas-
ing the slack available in the CLBS are presented: ﬁrst, by exploiting critical path

delay dependence upon the input vector and second, through slack redistribution in

19

pipelined circuits. Finally, Chapter 5 discusses construction of the delay chain used
to generate phase-shifted clock signals for P0 sampling in our EM and EM+EDR
techniques. Robustness to process variation and power consumption of different delay
elements are studied and guidelines for selecting appropriate delay elements for delay
chain construction are given.

Chapter 6 analyses existing soft-error hardened latches and studies the trade-offs
involved in using these latches to protect CLBS. Some new latch designs that are only
SEMU vulnerable, and which can be customized based on application requirements
are also presented.

Finally, Chapter 7 summarizes the important contributions of this dissertation and

discusses directions for future research.

20

CHAPTER 2

Modeling and Analysis of Soft

Errors in Logic Circuits

Soft error rate estimation of logic circuits requires accurate and efﬁcient estimation of
electrical, logical, and temporal masking effects. To evaluate the electrical masking of
a path, the amplitude and duration (AD) of an SET generated at a gate output, due
to a particle strike, needs to be calculated. Here, amplitude refers to peak voltage,
and duration is the width of SET measured at VDD / 2. The width and amplitude
of an SET, for a speciﬁc charge collected, depends on gate drive strength, output
load capacitance, supply voltage, and shape of the current waveform. For an SET
with amplitude greater than VDD / 2, the duration of the pulse determines the SET
width during propagation. Hence, for all SETS which reach a voltage greater than
VDD / 2, we approximate the amplitude to be VDD. Further, this approximation yields
the minimum charge that can cause a soft error. In this chapter, a methodology to

calculate the SET width as a function of the charge collected at a gate output is

21

proposed. In Chapter 3, we extend this methodology, to take into account electrical
masking, for calculating the SET width at the output of a path. The extension helps
to calculate the worst-case width of the SET reaching a PO. Once the SET width
at a gate output is known, other masking effects can be calculated as follows. (1)
Commercial noise simulators such as Paciﬁc [18] can be used to characterize electrical
masking and get the nominal SET width at a PO. (2) The logical masking probability
can be calculated through gate-level simulations of the circuit as described in [19].
(3) The temporal masking probability can be calculated using analytical expressions
presented in Chapter 3. As all these tasks are well understood, here we just focus on
calculation of SET width at a gate output.

Both lookup table (LUT) based approaches and closed form expressions for the
shape of the SET pulse have been presented recently. A closed form expression for

the output voltage of a gate Vout(t) due to charge collection was presented in [20].

WT) . (-t/T) ‘
V. .(t) = Q x e(_t/T)(e e '0
u T X CL 1/7' -1/T ,

 

 

(2.1)

Where T is the time constant of charge collection, Q is charge collected, CL is the
load capacitance, 7' = f (Q, CL, gate size) is a time constant of the gate, and f is a
function obtained by doing linear regression on a table of 7' values indexed using Q, C L,
and gate size. This leads to a complex expression for calculating V0,,t(t), and the error
for points not close to the table index have been reported to be more than 15% [20]. A
uniform LUT was used for determining the width of the SET at a gate output in [21].

The LUT was constructed for different gate types, fan-ins, sizes, channel lengths,

22

supply voltages, threshold voltages, load capacitances, and for a particular charge
collected at the gate output. Our experiments on the sensitivity of SET width show
that it varies non-linearly with increasing gate sizes and load capacitances. As the
SET width varies non-linearly for some parameters, the distance between the LUT
indices needs to be non-uniform (and hence the interpolation for points not close to
LUT index needs to be non-linear), which has not been considered in [21]. Moreover,
constructing a single LUT for many different parameters results in a big LUT size
and leads to a signiﬁcant loss of accuracy during LUT interpolation. In addition,
constructing a LUT for power supply variation leads to a large number of points in
the LUT, which is handled through regression in our methodology.

The sensitivity of SET width to gate drive strength, output capacitance, and the
charge collected around a gate’s output is ﬁrst studied through HSPICE simulations.
Based on the sensitivity of SET width to different parameters, we determine the LUT
indices and the interval between them. Additionally, the lower and upper bound for
each LUT parameter and, the number and interval between each LUT index are
determined based on accuracy requirements. The accuracy of the LUT and the time
required for interpolation within the LUT are measured, and compared with HSPICE

simulation.

2.1 Simulation Setup
The simulation setup for studying the sensitivity is shown in Figure 2.1. CL is
the total load capacitance at a gate output, and is equal to the sum of fanout gate

capacitance and lumped wire capacitance driven by the gate. The current source

23

Amplitude

 

ADO . Y
V(t)

1“) CL

 

 

 

v Duration

Figure 2.1. Circuit setup used to measure the sensitivity of SET width to various

parameters.

connected to the gate output models the current ﬂowing across the P-N junction due
to charge collection. The direction of the current source for a O—>1 output ﬂip (charge
collection around a PMOS drain) is as shown in Figure 2.1, while it is reversed for a
1——>0 flip (charge collection around a NMOS drain). The current source is modeled

by a single time constant and is given by equation 2.2 [22]:

 

_2Q ti)
I(t)—Tﬁ T67",

(2.2)

where Q is the charge collected at a gate output and T is the charge collection
time constant. The time constant T depends on the technology and if the drain is
P or N-type. It is a measure of how fast the electrons recombine in the drain node.
The current has sharp rise time, which models the drift mechanism through which
the minority carriers are swept across the P-N junction, and it reaches its maximum
value at T / 2. The fall time is more gradual, due to the diffusion of carriers across

the P-N junction and is determined by the exponential term with time constant T.

The value of T for different CMOS technologies was determined through device

24

 

2.00E-02

 

 

Junction Current for Different T
Laos-02 .
]—T=15 p,
Z_' «.25 P“
1.40502 [
a 1205-02 ]
a.
E
“ 1.00502 J
.5

8.00E-03 ‘

 

6.00503

4.00503

2.00E-03

 

 

_
ENE-0'” n.1,”..wrﬁﬁnrﬁw. y..-v....n.,..........”Hﬁﬁwnwwnunwn.H.............wﬁn.n..
0.” 20.” mm 1”.” 1m.” 1”.” ml“ m.” WIN WIN WIN m.” mm ml”

TIInetInps

 

 

Figure 2.2. Junction current waveform for different time constants.

simulations and tabulated in [3]. We determine the value of the time constant for
TSMC 180 nm, which is the technology used in our experiments, by scaling the time
constant values given in [3]. The values used for the P and N-type drain time constants
are 45.2 and 46.4 ps, respectively. The shape of the current waveform for different
values of T is shown in Figure 2.2. As can be seen from Figure 2.2, for smaller T
peak value of I(t) is higher, and current decreases rapidly leading to a smaller current
pulse width. Higher peak value of I(t) results in lower Qcm, and the smaller current
pulse width leads to faster gate output recovery and hence smaller SET width. If
the exact value of T for a speciﬁc technology is not known, then the LUT can be
characterized for different T values. Later, the LUT can be indexed with the time

constant for which chit is being calculated.

25

 

 

 

 

     

 

 

   

 

0.006
I(t) Models
0.0051
0.004- - - -PWL(15Pts) [
a —EXPwlthT=20p
E ——1(1)Equauon
< 0.000 ‘
.5
0.0024
0.001 .
o I f T V V ‘ Y Y Y Y
0.0 20.0 40.0 00.0 00.0 100.0 120.0 140.0

Time In picosecs

Figure 2.3. Current waveform constructed using values from Eq. 2.2, PWL model

with 15 points, and exponential HSPICE model for Q=2OO fC.

2.1.1 HSPICE Modeling of I(t)

The current waveform in Eq. 2.2 can be modeled as a piece-wise linear (PWL) or
exponential current source in HSPICE. I(t) values for a PWL model with 15 points,
and an exponential model from HSPICE were compared to the values computed from
Eq. 2.2. The experiments were repeated for two different charges Q=100 and Q=200
fC. The current waveforms for Q2200 fC using the PWL, exponential model, and
values from Eq. 2.2 are shown in Figure 2.3.

The percentage error for both the PWL and the exponential model was calculated

as follows:

[I(t) — I(t)PWL or EXP.[

I(t) x 100. (2.3)

Error(%) =

 

26

The maximum error for PWL and exponential models between t=0 and 50 ps is
given in Table 2.1. Error is only measured between t=0 and 50 ps, as the value of
current, after t=50 ps, falls to less than 70% of peak value of I(t). The maximum
error for the PWL model with 15 points was found to be just 1.7% as compared to
117% for the exponential model. Hence, the PWL model is used for modeling the

current across P-N junction, due to charge collection, in HSPICE.

 

Q:200 fC Q=295 fC
PWL (%) 1.34 1.7
Exponential (%) 100 117

 

 

 

 

 

 

 

Table 2.1. Percentage error for both PWL and exponential current sources.

2.2 Sensitivity of SET Width

In this section, the sensitivity of SET width to different gate characteristics is

studied. The indices of the LUT are determined based on the sensitivity studies.

2.2. 1 Gate Inputs

The critical charge required to cause a 1—>0 or a O—+1 ﬂip depends on the input
vector applied to the gate. Figure 2.4 shows the normalized QC”, for different input
vectors in a three input NAND and NOR gate, respectively. As can be seen from
the Figure 2.4, the QC,“ considering only the 1 —> O ﬂip at the output of NAND3
gate varies by six times. This is due to two reasons: (1) The strength of the pull-up
or pull-down network that is ON, which determines how fast the deposited charge is
dissipated by the gate. (2) If the top transistor in a stack is conducting. When the

top transistor is ON, the effective output capacitance is equal to sum of actual output

27

 

1.00E+01
Normalized chlt Values Vs. Input Vectors

9.00E+00 <

 

7+ Norm. NAND cert?
8.005400 1 |:0-Norm. NOR Ocrlt ,

 

7.00E+00 ‘

6.008100 1

5.008100 1

4.005000 1

Norm. Ocrlt Values

3.00E+00 1

2.005100 1

 

1 .OOE+00

 

 

0.00E+00 a . . . . . .
"m0. "1 m0! .001 0" "m1 '0 "1 1 o" 001 01 '0 "01 1 00 "1 1 1 0!
Input Vectors

Figure 2.4. QC,“ of a NAND3 gate for a 1—>0 ﬂip, and a NOR3 gate for a 0—>1 ﬂip,

 

normalized with respect to the minimum QM, among the input vectors considered.

node capacitance and the capacitances at the internal nodes. The larger capacitance
increases the charge required to generate an SET, which means the SER of the input

vectors - ” 101”, ”110”, ”100” - are much lower compared to other inputs.

2.2.2 Output Load Capacitance

The initial charge stored at a gate output node is equal to the load capacitance (CL)
times the supply voltage (VDD). The effect of the load capacitance on SET width
is shown in Figure 2.5. Increasing CL, increases SET width for a particular charge
Q. However, CL also determines if Q results in an SET with amplitude greater than
VDD / 2. For example, this can be seen from the plot for Q2105 fC, where SET width

falls to zero when CL increases from 30—>40 fF. This shows that the effect of C L on

28

 

1200

SET Width Vs Load Cap.

551 Width (ps)

 

 

 

 

0.00£+00 1.00E+01 2.006+01 3.00901 4.00am 5.00am
Loud Cap. (1F)

Figure 2.5. The width of SET measured at an inverter output for different charges,

when the output load capacitance is varied from 0 to 50 fC.

SET width to be highly non-linear.

2.2.3 Charge Collected

The generation of SET depends on the charge collected around the P-N junction of
a drain node. The minimum or threshold charge (ch), deﬁned as charge required to
produce an SET whose amplitude exceeds VDD / 2, depends on the gate drive strength
and load capacitance. Once the charge collected exceeds ch, the SET width increases
non-linearly with increasing Q. This can be seen from Figure 2.6, which shows the
SET width as a function of increasing Q. For example, in the case of INVXl, SET

width increases from zero to 600 ps when Q increases from 0—>295 fC.

29

 

7.50E+02
INV - 0, Slze, Vs SET Width

6.ME+02 1

I".

SET Width (ps)
5
8
§

3.008902 1

1.50E+02 1

 

 

 

 

0. 0054-00

9&9 .9"? 150.9 W596” ‘433? \f‘ﬁgf: 9991969”? gﬂﬂfﬁyoﬂ é"9:,”9.1526099'

00C)

Figure 2.6. The width of SET measured for different drive strengths of an inverter
when charge Q varies form 0——+300 fC.

2.2.4 Gate Size

The gate size or drive strength determines the ch required to generate an SET of
width W. The drive strength determines how quickly the charge collected around the
P-N junction is dissipated. Higher drive strength gates have higher output capaci-
tances, which also increase ch. The effect of different drive strengths on the SET
width can be seen from Figure 2.6, which plots the SET width for increasing Q at
the output of an inverter. For example, ch increases from 60 to 180 fC when the
drive strength of an inverter increases from 1X—>3X. In addition, the maximum SET
width reduces from approximately 600—)300 ps when the inverter strength increases
from 1X——>3X. The drive strength has a bigger and non-linear impact on the scaling

of Qt}, and SET width than CL and Q, due to which it is not used as an LUT index.

30

2.3 Lookup Table

A two-dimensional LUT was constructed with output load capacitance and the
critical charge as indices, for different gates. We ﬁrst determine the lower and upper
bound for the parameters in the LUT, and then ﬁnd the number of points to be
used for less than 10% error in SET width looked up. The lower and upper bound
of CL is determined by the minimum and maximum load values driven by the gates
in the design. In the case of synthesized designs, this value can be determined from
the standard-cell library models for delays. For our experiments, we bound the load
capacitance of a gate to 0—>4ng, where C9 is the gate input capacitance. The
maximum charge collected around a P-N junction is determined as follows. The
magnitude of the charge collected around a P—N junction depends on the energy of
the particle passing through the silicon substrate, as well as the path length over
which the charge is collected. The energy of ionizing particles is measured by the
metric linear energy transfer (LET), which is the energy per unit mass per unit area
transferred from an ionizing particle to the material through which it passes, expressed
in MeV—cm2 / mg. It has been estimated that 1 MeV—cm2 / mg of neutrons deposit 10.8
fC/um of charge [23]. The ﬂux of particles decreases exponentially with increasing
LET, and there are far fewer particles with LET > 15 MeV-cm2 / mg [24]. Therefore,
the maximum LET of ionizing particles considered is 15 MeV-cm2 / mg. The length
over which the charge is collected depends on the technology and is approximated
to be 2 microns for TSMC 180 nm technology [23]. Therefore, the maximum charge

collected is approximated to be 300 fC (2 15 MeV-cm2 / mg * 10 fC/pm * 2 pm), and

31

the minimum charge is 0 fC.

Once the lower and upper bound for the parameters were determined, the initial
LUT was created by choosing six and thirty points for Q and CL, respectively. The
actual points were uniformly spaced apart within the lower and upper bound men-
tioned above. For each gate, the total number of points in the LUT were 180 and
hence an equal number of HSPICE simulations were done. In each HSPICE run,
the SET width for a particular Q and CL was recorded. Surface described by three
co—ordinates X, Y, and Z corresponding to Q, CL, and SET width, respectively, is
used to do interpolation when the LUT is indexed with Q and CL not in the LUT.

The SET width is obtained by solving equation 2.4.

Z=a+l3X+7Y+6XY (2.4)

The value of the co-efﬁcients were obtained by solving four such equations using
gaussian elimination. The four equations correspond to four neighboring points shown

in Figure 2.7.

2.4 Accuracy of the LUT Model

The accuracy of the LUT model was tested using CL and Q located in the middle
of the LUT indexes as shown in Figure 2.7. The SET width from the LUT table
was compared to that of HSPICE simulations. The percentage error in the LUT

interpolation was calculated as follows:

I(LUT value — Spice result)|

LUT value ) * 100 (2'5)

 

Error(%) = (

32

Z (SET Width)

A

 

 

 

 

 

 

 

 

5 ,2
( 01F) (5fC,30[F)
V
15rC,301F
W \(tF )
(15mm
or 201‘ 35 301' 40f> Y(CL)
/ V
10fC srC / I/ /
leC' /
25
X(Q)

Figure 2.7. Surface described by three co—ordinates X, Y, and Z corresponding to Q,
CL, and SET width. Q and CL values in the middle of LUT indices are used to test
the accuracy of the LUT. The ﬁgure shows the neighboring points and the surface
formed by them, when the LUT is indexed using Q = 10 fC and CL = 25 fF.

The percentage error for Q values closer to ch, where the maximum error occurs,

are shown in Table 2.2.

 

CL Q Max. Error (%)

 

 

(fF) (fC)
Uniform Non-
Uniform
5 80 2.31 1.05

 

15 90 7.05 4.3
25 100 17.02 9.18
35 110 10.14 6.45
45 120 15.37 8.6

 

 

 

 

 

 

 

 

 

Table 2.2. Percentage error for interpolation from a LUT with both uniform and

non-uniform interval between indices.

The width of an SET changed signiﬁcantly for values of Q near ch, which led to the

33

LUT interpolation producing errors greater than desired 10%. The accuracy can be
improved by increasing the number of points in the LUT. For example, increasing the
number of points for Q to forty would improve accuracy, but with a huge increase in
LUT size. To reduce the error and to keep the complexity of the LUT within bounds,
it was decided to use non-uniform spacing between charges, instead of increasing
the number of points and maintaining an uniform interval between charges. The
spacing between charges from 60—»120 fC was reduced to 5 fC, while maintaining 10
fC spacing for other charges. The percentage error for the new LUT is also given in
Table 2.2, in the column titled non-uniform spacing. The time taken to interpolate
and lookup the SET width was found to be 12 ms for 1000 points, while it takes 3
hours for the same in HSPICE simulations. The model offers a speed-up of > 1000x

compared to HSPICE simulations.

2.5 Regression for Supply Voltage Variation

Power supply variation also causes the SET width to change. The SET width
for a 10% variation in VDD is shown in Figure 2.8. As can be seen, the SET width
varies linearly over i10% range. We tried linear regression over this small range and
found that it gave a good ﬁt. The R2 value was found to be 0.99 for both Q2125 fC
and 175 fC. The straight line equation shown in Figure 2.8 can be used to scale the
characterized SET width while operating voltage varies by 10%. This avoids costly
HSPICE simulations required for re—characterizing the LUT for variations in supply

voltages.

34

 

 

 

 

 

 

 

7.00E+00
SET Width Vs Delta v00 V = (gas‘gfz’
5.001300 J = 0'
y = 0.00041: + 4.1055
R2 = 0.9973
5.005100 -
Z My
3
g “05"” ‘ [ +125 10 (SPICE)
g . +17510 (SPICE)
— -125 1c
; a00£+00 4 , _ .175 10
11':
200500
1.00E+00 1
0.00500 2
9° 9" 9" 9‘ 9" 9" 9" 9" 9" "
0' e e o c to 0 c (o c”
gé ‘9‘? &§ ﬁg? esp Né \‘P «‘9 «*9 Kg?

Figure 2.8. Variation of SET width for supply voltage perturbation of i10% from

the nominal value of 1.8 V.

2.6 Conclusion

In this chapter, we ﬁrst analysed the sensitivity of SET width to gate drive strength,
output load and charge collected at the gate output. A LUT, which is indexed using
output load and charge Q, was proposed to calculate the SET width. The variation
in the SET width for increasing Q is non-linear. Hence, spacing between Q values
was varied to improve the accuracy of LUT. The error from the LUT interpolation
was found to be less than 10%. A regression based model for scaling SET width due
to VDD variation was also presented. Further in Chapter 3, this LUT is extended to

interpolate for SET width occurring in different points of a path.

35

CHAPTER 3

Error Masking for SER Mitigation

3.1 Introduction

In this chapter, we present an efficient error-masking design technique for static
CMOS combinational circuits that exploits the inherent temporal redundancy (timing
slack) of logic signals to increase their soft error reliability [25]. It has a number of
features that make it attractive compared to existing approaches: (1) It modiﬁes only
the ﬂip-ﬂops of a combinational logic block (CLB) for sampling PO values and thus
has lower area and power overheads. (2) Further helping lower these overheads is the
use of a common delay line for an entire CLB or even multiple CLBS for producing
control signals used in the technique. (3) In CLBS that have sufﬁcient slack at a
signiﬁcant fraction of the PO gates, which is quite common, SER can be reduced
markedly without any performance overhead. (4) The proposed design technique
also masks soft errors in both the CLB and the master stage of the ﬂip-ﬂop.

The remainder of the chapter is organized as follows. Techniques that have been

36

proposed to handle logic soft errors are discussed in Section 3.2. Section 3.3 ﬁrst
characterizes the slack available in a path for error masking, then explains our error
masking technique along with the circuits used to achieve this. Section 3.4 explains
the logical construction of the delay chain used in the error masking technique. Sec-
tion 3.5 describes the simulation setup and presents results obtained with ISCAS85

circuits.

3.2 Related Work

In this section, we discuss some of the earlier techniques proposed for logic soft
error correction using self checking designs, architectural and circuit level techniques.

Their drawbacks are explained and we motivate the need for new techniques.

3.2.1 Self-Checking Designs

Online or concurrent error detection (CED) can be achieved by using self checking cir-
cuits [26, 27], or by exploiting temporal redundancy of signals [28]. CED schemes use
an output characteristic predictor, whose output is then compared (using a checker)
with actual circuit output to detect an error. The output characteristic predictor
is implemented in hardware using extra circuits, and recomputation is done in case
of an error to recover the correct value. Self checking circuits are more efﬁcient for
arithmetic units, and may require high hardware cost for arbitrary logic functions.
Also, online error detection and retry may affect performance (throughput) and can—
not be used in real-time systems to overcome transient faults due to electrical noise

or external radiation.

37

3.2.2 Architectural Techniques

At the architectural level, executing the same instructions in parallel using two cores
or datapath has been used to detect soft errors in the core logic. This requires twice
the logic as single—core processors and is extremely power and area hungry. Recently,
microprocessor vendors have introduced dual-core designs to reduce the power con-
sumed by their high-frequency processors. However, utilizing these dual cores for
error detection, by executing the same instructions in parallel on both the cores
would reduce the processor throughput. Additionally, it will have a big performance
impact, as compared to single core designs, due to reduced clock frequencies of the
dual-core designs. Therefore, dual core solutions targeted for SER reduction are pro-
hibitively expensive for commodity applications. Hence, lower overhead variants such
as redundant execution using spare elements (REESE) have been proposed [29]. The
REESE approach involves placing each instruction that completes execution into a
queue along with the results of the instruction. An additional stage in the pipeline
then schedules execution of these duplicate instructions by mixing them in with reg-
ular instructions. The results from the duplicate instructions are compared with the
original results, and any differences indicate that a fault has occurred. Queuing of
the instructions and repeated execution increases the complexity of the system and
requires signiﬁcant amount of extra logic area. Moreover, the extra pipeline stage
would increase instruction latency, and any soft errors in the instruction fetch or de-
code stages would not be detected. Time redundancy based architectural approaches

also have signiﬁcant performance, power overheads and design time cost [10].

38

3.2.3 Gate and Circuit-Level Techniques

Traditional techniques to provide soft error tolerance rely on triple modular redun-
dancy (TMR), in which the original circuit is triplicated and a majority voter is
used to determine the ﬁnal output. However, this technique involves high overhead
(> 200%) in terms of area and cost, which limits its usage to reliability-critical ap—
plications. Various ideas for soft error tolerance based on time redundancy were
presented in [30]. The time domain majority voter presented in [30] has a perfor-
mance overhead, since the sampling is started after the longest path in the circuit
settles. Another technique called partial error masking, corrects errors with lower
overhead than traditional TMR techniques by utilizing the difference in soft error
vulnerabilities of gates [31]. But it masks soft errors only in CLBS and has higher
overhead compared to the technique presented in section 3.3

Upsizing of gates to reduce the SET amplitude to less than VDD / 2 was proposed
in [32]. Gates which have the lowest logical masking probability are selected to achieve
cost-effective trade-offs between overhead and soft error failure rate reduction. The
results presented show a performance overhead of 12.2% for 90% SER reduction in
180 nm technology. A combination of skewed logic and output latches which respond
only to a 0—>1 or 1—>0 transition was used to tackle logic soft errors in [33]. The
gates in the CLB were skewed by upsizing the NMOS transistors in the gate, such
that the generation of 0-1-0 SET was reduced. The 1-0-1 SETS were handled by
using a dual-sampling ﬂip-ﬂop, that changed its output only for a 0-1 transition at

the input. The results presented show a performance overhead of 420 ps and power

39

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

U1 U1
In ﬁ> D Q In > D Q
DFF DFF
Clock U4 Clock > U 4
U2 up, U2 9
——-(>D Q >[MAJ—DOut —-—>D Q {>MAJ—D-Out
OFF "9“ DFF "D"
~-+> >>
U3 U3
——-——> D Q lay 2W D Q
DFF DFF
—-—>> ————«>>
Temporal Sampling Asynchronous Voting Temporal Sampling Asynchronous Voting

Figure 3.1. (a) Temporal sampling latch with internal clock generation delays. (b) Equiv-

alent temporal sampling latch using internal data delays [23].

overhead of 160 HW even for a simple inverter chain.

Temporal Sampling Latch Designs: Prior efforts have also focused on latch
design for mitigating soft errors in CLBS [34, 2, 23]. The latch design in [34] requires
resistor insertion to slow down the input stage, which incurs both performance and
area penalty. The techniques presented in [2, 23] use temporal redundancy to sample
the circuit output at multiple delayed time instances, which is used to mask a pulse
of width 20. Some practical implementations of this technique described in [23] are
shown in Figure 3.1.

Figure 3.1 shows a temporal sampling latch with internal clock generation and data
delays. The three ﬂip-ﬂops U1, U2 and U3 sample the PO of a circuit at T, T+w
and T+2w, respectively. Majority voting done by U4 on the three samples masks
any SET of width 10 from changing the Out signal, which reduces the SER of the
CLB. In Figure 3.1(a), the clock signal is delayed internally using delay chains to
generate multiple data sampling edges. Equivalently, data is delayed internally to
arrive at multiple times with respect to a single clock edge in Figure 3.1(b). The

main drawbacks of the proposed latch designs are:

40

0 Performance penalty: In order to tolerate an SET of width 10 the ﬂip—ﬂop
setup time is increased by 2111, which increases the clock period time and hence
reduces clock frequency. For example, tolerating an SET width of 200 ps using
the above temporal sampling latch, introduces a performance penalty of 8% for

a 200 MHz clock frequency, which jumps to 40% at lGHz.

0 Area overhead: Usage of separate delay lines inside each ﬂip-ﬂop increases the

area and power overhead incurred for delayed clock or data signal generation.

The techniques we discuss in this chapter, for error masking, have zero performance
overhead. In addition, they use a delay line that is common to one or more combi-

national logic blocks (CLBS), as opposed to a delay line within each latch as done in

[2]-

3.3 Time Redundancy Based Error Masking

We ﬁrst analyze the soft error vulnerability of a CLB in the original circuit, and then
in the next paragraph, explain our technique conceptually and analyze how it exploits
timing slack to reduce SER. All time instants in the following discussion are speciﬁed
in terms of elapsed time after a cycle begins. Let T denote the cycle time. When
an SET pulse is generated at the output of a static CMOS gate in a combinational
circuit due to a high—energy particle strike, it may propagate through a path u and
be captured by an output ﬂip—ﬂop (FF), causing a soft error. At t3 2 T — tsemp, u’s
output (primary output) is sampled by an output FF, where tsemp is the setup time
of the FF. Consider an SET pulse of width 11) that can begin at any time during a

cycle with equal probability. The probability P(w) that this pulse, will latch at an

41

output FF and cause a soft error (i.e., it will overlap the sampling instant t3) can be
determined to be P(w) 2 %.1

Since the effect of an SET is only temporary, it is possible to prevent a soft error by
exploiting timing slack available in the path u as follows. Let t1 denote the worst-case
propagation delay from the primary inputs to the output of u. The slack for u is then
t, 2 t3 — t1, i.e., in the absence of an SET, u’s output will be stable at its correct
value in the time interval [t1, t3]. If in addition to t3, we sample u’s output (in the
connected ﬂip—ﬂop) at t1 and t2 too, where t1 < t2 < t3, and we then perform majority
voting among the three sampled values, we will be able to obtain the correct value of
u’s output whenever an SET pulse does not overlap more than one sampling instant.
Let t312 2 t2 — t1 and t323 2 t3 — t2, and let tslz S t323 without loss of generality. The
probability P(w) that an SET pulse of width 21), after reaching u’s output, will cause
a soft error (i.e., it will overlap at least two sampling instants) can be veriﬁed to be
as follows: (1) P(-w) 2 0 when w < 15312; (2) P(w) 2 31;“ when tslg s w < tsgg;
(3) P(w) 2 2“—1?’1 when 15323 S w < t3; and (4) P(w) 2 w when w 2 t,. The
transient pulse and its overlap with different sampling points to cause a soft error
is shown in Figure 3.2. Thus, in the ﬁrst three cases our technique improves soft
error tolerance and has the same tolerance as the original circuit in the last case. In
the ﬁrst case, soft errors are always prevented. To maximize the pulse width that is

guaranteed to be tolerated, we choose t2 2 £1341 or t312 2 13523, so that SET pulses of

width less than half of slack at u are guaranteed to be tolerated.

 

1More precisely, a soft error will be caused if the SET pulse overlaps the setup and hold time
interval of the output FF.

42

 

Legend:

 

 

 

   

 

 

 

Transient pulse
of Width w
ts12 t523 t1:12 t323
l<—>t<——>1
t1 t2 t3
|<i——————c>} , M,
15 ' /
(a) E
—“ 'h

 

w" t11:12 : :
+v\—’i T"—

(b) (c) (d)

Figure 3.2. Figures (b), (c) and ((1) Show different transient pulse widths and their starting
and ending times when they overlap two sampling points to cause soft error. (a) Effective
slack available in a path and the time when the FF samples: t1, t2 and t3. Probability of
the SET latching for three different widths: (b) Transient pulse width is greater than tslz
and covers both t1 and t2. (c) Transient pulse width is greater than tslz and t823, hence
can overlap both t1 and t2 or t2 and t3. ((1) Tfansient pulse width is greater than t8 and

completely covers the slack time ts.

We now move onto implementation issues. First, we discuss circuits for sampling
a path’s output values and to do majority voting. Then we describe the logical
construction of a delay chain, which is used to generate the sampling control signals
for the FF. In the above discussion, we exploited the complete slack from t1 till t3
to reduce SER. However, due to implementation reasons (as explained below), the

actual slack available in a path for error masking within a clock cycle is given by:

Smax : T _ (tpdgworst + tD—Cl + tD—CK + tCK—Q)- (3.1)

In Eq. 3.1, tpd‘worst is the worst case propagation delay in a path, while tD_Cl

and tD_CK are the setup time requirement for the ﬁrst and third sample (D1 & D3,

43

respectively) in the sampling latch, tCK_Q is the clock to ﬂip-ﬂop output delay which
includes the majority voter delay. The setup time tD_Cl is deﬁned as the D-to—Cl
offset that causes a a wrong value to be latched at D1, while setup time tD_CK is
deﬁned as the minimum D-to-clock offset that causes the clock-to—output (Q) delay
to be 5% higher than its nominal value. The effective transient pulse width that can
be tolerated is then Smax/ 2. The actual sampling is done at time instants t’l, t’2 and
t3 (the last sampling time remains unchanged), such that t1 3 15’1 < t’2 < t3. We
deﬁne t’s 2 t3 — ’1, t’s12 2 t’2 — t’1 and t’,23 2 t3 — ’2, and let rm 3 tfm without loss of

generality.

3.3.1 Output Sampling and Majority Voting

The sampling is performed by adding two sets of n and 19 control transistors (corre-
sponding to t’1 and t’z) to a FF as shown in Figure 3.3(a). At sampling time, control
signals C1 and C2 (C) go high (low), which disconnects output node F from VDD
and VSS, thus preventing any further transitions and completing the sampling. A
majority voter embedded into the slave stage of the FF determines the ﬁnal output
value (see Figure 3.3(a)). To reduce the susceptibility of node D1 and D2 to particle
strikes after sampling (when it is essentially a dynamic node), cross-coupled inverters
(shown in Figure 3.3(b)) are added to make it static. Explicit switched-capacitors
can also be added to harden the cross-coupled inverters against soft errors [35]. The
capacitor addition should be done based on SER requirements and power and area

overheads incurred.

44

 

__ voo rJTVDD VDD VDD
Did P7 023 P8 D14 P10 H h— 4%;

   
  
     
   

 

 

 

 

 

 

 

 

 

 

__dE1
D3C[ P6 024 P9
Cld P2 DCZ-(j P3 DCKd P4 CKBC[ P5 -C—1C[
D3
D“
Cri N2 DCz-l N3 :1de N4 CK—l N5 Cl—l
03-] N6 02-] N9
24 m N. P1 «Him
Dl"l N7D2‘l N8 Dl"l N10 Dl
vss €7vss vss vvss
Master stage Slave stage (b)

(a)

Figure 3.3. (a) A modiﬁed CZMOS ﬂip-ﬂop to sample and latch signal values at different
time instances within a clock cycle. The slave stage contains a majority voter to vote among
the different sampled values. (b) Cross-coupled inverters added to D1 and D2, to keep it
static after C1 and C2 become high.

An SET pulse generated in the CLB and reaching the modiﬁed FF will be tolerated
as per our analysis in Section 3.3. An SET pulse generated only at D1, or D2, or
D3 of the modiﬁed FF due to a particle strike (an SEU) can always be tolerated
because of majority voting. However, a single-event multiple upset (SEMU), i.e., a
single particle strike causing transient pulses to be generated at multiple data nodes,
can be a problem as it can cause a wrong value to appear at the majority voter
output. Since it is hard to characterize, through simulation, the charge required for
a SEMU, we do not include soft error contribution of FFs to calculate original and
ﬁnal reduced SER (i.e., we present quantitative SER reduction results only for CLB).
However, the data nodes D1, D2 and D3 in the modiﬁed FF can be spaced apart
in the layout, by placing the cross coupled inverters and the layout of any explicit

switched capacitances present between the data nodes. This would further reduce the

45

chances of a SEMU occurring in the FF itself.

There are two cases when SETS are not masked by the error masking technique.
SETS generated at the output of the majority voter gate are not masked, while
transient pulses in critical or Short Slack paths are not masked. Errors occurring at the
output of a majority voter gate affect next stage in the pipeline, which is corrected
by using our technique in the subsequent pipeline stage. In case of reconvergent
paths, where a transient pulse propagates through both paths, a single logical ﬂip
originating before reconvergent paths begin can affect more than one sampling point.
An error can occur if the delay difference between the reconverging paths makes the
same transient pulse overlap two sampling points. To protect the sampling points
5312 and t’s23 Should be made greater than the delay difference between reconverging
paths plus the overlapping error pulse width, or delay difference between reconverging
paths can be reduced by increasing the delay of faster path. Techniques to prevent

soft errors in critical paths are described in chapter 4.

3.4 Delay Chain

The control signals C and C are generated using the circuit shown in Figure 3.4.
The generation of the controls signals are explained using the NMOS control signal
C. The circuit used for generation of C depends on when it falls low. C is generated
by delaying CK when it transitions low after %, while C is generated by ANDing
CK and delayed CT}? when it goes low before 7% C is generated by inverting C in
both cases. Particle strikes in the control signal generation circuit can also cause soft

errors, as a result of glitchy transitions generated in control signals. The occurrence

46

 

CK => CK_l_l_J'_L_J_|_

—E_',L > DCK m
Delay C W

element .
(1)

 

 

K_1_I__r—1_l—'I_
AND'D—oK I

CK |__',LD—«>CJ'I__J'1_FI__

 

 

 

I.
Delay i—DCK (ii)

element

Figure 3.4. Generation of control signals C and C. (i) Cl and C; are generated by
delaying CK if they go low after 1. DCK shown is used as CT or Cg. (ii) CI and C: are
generated by ANDing CK and delayed CK when they go low before- .C 18 generated by

inverting C in both cases.

of such soft errors is determined by the sampling time t'l, t5 and t3 for a FF. Since
sampling time t3 always occurs at T-tsemp, we only consider the occurrence of t’1 and
t’2 with respect to % (CK is symmetric and for simplicity t3 2 T is used here). We do

not consider particle strikes on the CK Signal itself due to high load on CK Signal.

1. t’l < % and t’2 < a 0 —> 1 logic ﬂip occurring in the delay chain before t’1 and
extending till t’2 will make both C1 and C2 low before t’l. C2- remains low till t’2
which causes a wrong value to be latched in both D1 and D2. The corresponding

waveforms are Shown in Figure 3.5(a).

2. t’1 < % and t’2> :In this case t’2d, the time by which CK Signal has to be

shifted to produce control Signal C2 is:

47

 

*2 .F1_f1_fj
CK Delay element CK

 

 

 

 

 

AND — ,
——I—® 2 C1.,r:1__l'1__n_
CK AND _ Error Pulse
Wcz m n {-1
0—>1 ﬂip CK Errolj- Pulse
m

 

Delay element

[Ir—I2!" >
I-CK. AND
1Error Pulse

CK 9’ CK_I I I I I I

 

 

 

 

 

 

 

 

 

cx > ch—1_J—1__1—1_
Delay element _
JDLJ—LJ_IJ
Lt!!! Dr (:1- ll
[13E D, c2 rm
ErrorPulse
m
Figure 3.5. (a )t’1 g—and t’ < 1. Zero to one logic ﬂip affects both Cl and 521(1)) t’1
< % and t’2> >2. Zero to one logic ﬂip affects only C—1.(c) ti > g and t'2 > %: Both Cl
and C: are affected.
—t' t'1 T
t’2t’ =— —
2 1 + (T 1) 2 + 2
T t'1

ﬂ=H——_2
2d 2 2 2

which is smaller than tid 2 ti— The corresponding waveforms for C1 and C;
are shown in Figure 3.5(b). A 0——>1 logic ﬂip occurring in C2- 33 shown by the
dotted line would cause C1 to go low earlier than t’l, which may cause a wrong
value at D1 in the gate shown in Figure 3.3(b). However, as C2- and hence D2

are not affected, the majority value still remains correct. Hence, a 0 —> 1 logic

48

ﬂip occurring in C2 does not cause a soft error. A 1—>0 logic ﬂip occurring in
signal C; before t’l, could cause an error in D2 if the error pulse width extends
till t’2. Since C; only changes to one, D1 is not affected by this 1—>0 error in

'03, which gives a correct value at the majority voter output.

3. t’l > g and t'2 > 32:: The corresponding waveforms CI and C; are shown in
Figure 3.5(c). A 1——>0 logic ﬂip occurring in C1 before t’1 and extending till t’2
can cut-off both NMOS transistors controlled by C1- and CE, which can cause

wrong values to be latched in both D1 and D2.

To avoid soft errors due to particle strikes on delay chains, two different delay chains
are used to generate C1 and C2— in circuits which have ﬂip—ﬂops with triggering times
satisfying conditions one and three above. Due to discrete nature of delays produced
by the delay elements sampling cannot happen exactly at the ideal t’1 and t’2 times,
which are equal to worst case output settling time of the path and fig—£3, respectively.
This requires us to determine the nearest sampling time which can be used to reduce
SER. The number of discrete control signals C and C to be generated can be reduced
by clustering and using common control Signals for ﬂip-ﬂops whose sampling time
occur close together. This reduces the area overhead by using fewer delay elements to
generate control signals and fewer wires to route. Due to clustering of control signals
sampling may be done at new time instants t’l’, t’z’ and t3 (the last sampling time
remains unchanged), such that t’ S t’l’ < tg’ < t3. We deﬁne t’s’ 2 t3 — t’l’, (3’12 2 t’2’ — t’l’
and tg’23 2 t3 — t’z’. The new sampling time intervals t’s’l2 and 1323 may reduce the

effective error pulse width that can be tolerated. Therefore, the sampling times t’l’

49

and 1;; have to be selected such that the decrease in the SER reduction is minimized.
The construction of the discrete delay chain and the methodology used for clustering

of control signals are explained in Chapter 5.

3.5 Simulation Results

The results for SER reduction on applying our technique to ISCAS85 circuits are

presented below.

3.5.1 Extension of LUT to Calculate SET Width at Primary
Output

The LUT described in Chapter 2 is extended to calculate the best case SET width
that reaches the primary output. The Simulation setup Shown in Figure 3.6 is used
to create a three-dimensional LUT. The gate for which LUT is being constructed
drives a ﬁxed number of inverters, so that an SET generated at the gate output
has maximum width when it reaches the path end. This allows us to estimate the
worst-case SER reduction while using the error masking technique. The LUT is
constructed by measuring the SET width at inverter outputs which are driven by
the gate. Therefore, the LUT is now constructed using three different parameters
viz., gate output load, charge collected, and the level of gate from the PO. A total
of ten inverters in a path are used to construct the LUT. Changes in the width of an
SET generated more than ten levels away from the PO were found to be negligible.
Hence, the LUT was found to provide sufficient accuracy for estimating the width

of an SET generated more than ten levels away from the PO. The LUT for each

50

gate now contains a total of 1800 points (6x30x10). Interpolation, as described in
Chapter 2, is done for output load and charge Q not located in the LUT. The accuracy
for interpolation was tested using directed points which are located in the middle of
existing LUT indexes. The maximum interpolation error was found to be less than
10%, which is well within the acceptable limits. The percentage error is similar to
that of a two-dimensional LUT described in the previous chapter. This is because, the
level of the gate used to index the LUT is same as one of the pre—characterized points,

while interpolation is only done along the output load and the charge collected.

Amplitude

V0) SET width measured at these points

ma... mum:
_[_CL We ............. —{,>o—

NAND

 

310)

Figure 3.6. A total of ten inverters are connected between the NAND gate and the P0.
The SET width is measured at the output of the NAND gate and the inverters.

3.5.2 Critical Charge and Transient Pulse Width Calculation

A LUT which contains the width of SET reaching PO gate was constructed for differ-
ent gate types, their distance from the PO gate, load capacitance and charge collected.
All the simulations were done using TSMC 180 nm transistor models with VDD 21.8V.
Thus, by varying the output capacitance and the charge collected around the P-N
junction (up to a maximum of 300 fC), we construct lookup tables for the charge

versus the transient pulse width. The LUTS were constructed for NAND, NOR and

51

inverters. The lookup tables are then searched using binary search technique, to get

the Q corresponding to the transient pulse width required.

3.5.3 SER Calculation of Complete Circuit

ISCAS85 circuits were synthesized in 0.18 micron technology using the standard cell
library described in [36]. Only inverters, two input NAN D and NOR gates were used
during synthesis. The SER of the original and the error masked circuit are given by

the following equations.

TSERo-rig = ZSEC(gi,wor-ig)

2'21

ZSEC<g.,w,,,,)

1'21

TS E Rred

513mg.) = Z(Z(SER(QL,)—SER(QR.)) >< Plasma,» xPj.
Vj 1021

(3.2)

where SEC(g,-,wm.9) and SEC(g,,th/2) are the soft error contributions of gate 9,-
when the transient pulse width required to cause an error are won-g and mtg/2. SER(Q),
which is the basic soft-error rate of a gate, is calculated using Eq. 1.1. To calculate
basic SER, neutron ﬂux F 20.00565 neutrons-cm‘zs‘l is used, sensitive device area
A is equal to the sum of drain node areas connected to the output node, Q is the
charge required to produce an SET pulse of required width and iS estimated from
the LUT, Qs is the charge collection efficiency of the device in fC, K is a technology

independent constant equal to 2.2 x 10‘s. The charge collection efficiency for 1—>0

52

and 0——>1 logic ﬂips are 20.5 fC and 17.2 fC, respectively. For the sake of calculating
SER reduction, we consider only 1—>0 ﬂips which have higher SER due to higher Q3.
SER(Q) gives the soft-error rate for charges equal to and greater than Q. The soft
error contribution of each gate 9, is calculated starting from Qm-t up to a charge of
300 fC, which is the maximum charge that can be collected by a P-N junction in 180
nm. In order to calculate the SER of a gate for charges between Qm-t and 300 fC, we
divide the charge values into m equal intervals of 5 fC. The soft error contribution of
each interval is calculated by subtracting SER corresponding to right endpoint from
the left [6] The soft error contribution of each interval is weighted by the latching
window probability of a transient pulse produced by charge QLk, corresponding to
the left endpoint in the interval. The SEC of each gate is calculated by summing the
SER with respect to all ﬂip—ﬂops in its fanout cone and weighted by the probability
P, of the path to ﬂip—ﬂop j being functionally sensitized. As ISCAS85 circuits do
not have speciﬁc input patterns to test them, the logical masking probability P,- is

generated as a random number.

3.5.4 SER Reduction Using Error Masking

we ﬁrst estimate the slack Sm” available at each ﬂip-ﬂop for sampling the PO values.
We constructed the original and modiﬁed C2MOS ﬂip-ﬂop (shown in Figure 3.3((a))
and simulated the circuits using TSMC 180 nm micron models. The setup and hold
time for the ﬂip—ﬂOp was measured by connecting them to a F04 load. The increase

in tCk_Q delay for the multi-sampling ﬂip-ﬂop when data transitions closer to ck is

53

shown in Figure 2.1. The setup time for the ﬂip-ﬂop is calculated from this plot.
The value for tp_ck, 750—01 and tCk_Q in the modiﬁed design were found to be 125,
115 and 75 ps, respectively. A delay chain capable of generating phase Shifted clock
signals every 200 pS was constructed and hence the sampling time t1 and t2 were
determined from the control signal availability. The width of transient pulse that can
be tolerated in the modiﬁed circuit til, is then calculated as min(t3 — t2, t2 — t1). The
charge required to cause a transient pulse of width t1” and the latching window width
tw2100 pS are then retrieved from the LUT. If t; S tw, then no error masking is
applied. The results obtained for ISCAS85 circuits due to error masking (EM) are

given in Table 3.1.

 

 

 

 

 

 

 

 

 

Circuit Circuit Features Nm-g SER
Redn.
%
Gates PIS POS EM
c432 210 36 7 3 55.66
c1908 1005 33 25 16 66.74
c2670 1498 233 128 44 99.58
c3540 2176 50 22 20 95.8
c7552 4785 207 108 58 81.14
c5315 3712 178 106 102 97.12
Avg. 2231 122.8 70.8 43.2 82.7

 

 

 

 

 

 

 

 

Table 3.1. SER reduction for ISCAS85 circuits due to error masking.

3.6 Conclusion
In this chapter, we presented an error masking technique for SER reduction in
CLBS. The error masking technique samples and votes on the primary outputs within

the Slack available in a clock cycle time. This results in zero performance overhead.

54

Efficient ﬂip-ﬂop designs to do triple sampling and majority voting were presented.
The error masking technique leads to average SER reduction of 82.7% in ISCAS85

circuits.

55

CHAPTER 4

Combining Error Masking and

Error Detection Plus Recovery

4.1 Introduction

In this chapter, we describe techniques for combining error masking with error
detection and recovery (EDR) to cope with soft errors in combinational and sequential
circuits [37]. If the error masking technique is used alone, it prevents an SET pulse
of width less than approximately half of the slack available in the propagation path
from latching and causing a soft error. If the error masking technique is used in
combination with EDR, SET of width approximately half the clock cycle time can be
tolerated. The EDR technique has a single cycle penalty for recovering from an error
latched into the pipeline. The EDR technique can be used in circuits with no slack
to provide complete error protection for most applications. The SET is also masked

without additional delay in an area- and energy-efficient manner, which makes this

56

technique attractive for commodity as well as reliability-critical applications. The area
and power overhead can be traded-off with soft-error rate (SER) reduction based on
application requirements. Techniques to improve Slack and hence the SER reduction
of the error masking technique such as: (1) exploiting circuit delay dependence on
input vectors; and (2) redistributing Slack in pipelined circuits, are also presented.
The remainder of the chapter is organized as follows. Section 4.2 presents existing
error detection and correction techniques. Section 4.3 explains the EDR technique,
and then characterizes the paths where error masking and EDR can be applied.
Section 4.4 presents ways to increase the effectiveness of the error masking technique
by utilizing the input value characteristics of a circuit, and by redistributing the Slack
available in a latch-based pipeline circuit. Section 4.5 presents results obtained with

ISCAS85 circuits for all the techniques described in this chapter.

4.2 Related Work

In error detection and correction, detection is done by sampling the circuit output
at two different time instances and then XORing the sampled values. Once an error
has been detected, the correct output is recovered through recomputation. Efﬁcient
techniques to do error detection of soft errors due to particle strikes and delay faults
were presented in [28, 38]. In both techniques an extra latch (called Shadow latch in
Razor [38]) is used to sample the circuit output. The ﬁrst sample is stored in the
main pipeline ﬂip-ﬂop at the rising edge of the clock (i.e. after a time T has elapsed
from the beginning of the clock cycle), while the second sample is stored half a clock

cycle later (at time 3T/ 2) in the shadow latch. This means any transient pulse with

57

width less than % is detected as an error. Razor recovers from the error by restoring
the value stored in the Shadow latch into the main ﬂip—ﬂop, while the work presented
in [28] suggests re-doing the computation to get the correct value. Implementing
recomputation requires storing the current state and executing the program from an
instruction not affected by the soft error. Also, recomputation requires many clock
cycles, very high area and energy overhead and is difficult to implement in modern
super-scalar processors due to complex circuitry required. The present version of
Razor works very efficiently for delay faults, but there are certain limitations when it
is used for handling soft errors due to particle strikes. In the case of Razor, if a particle
strike had altered the value stored in the shadow latch, then restoring this value would
result in wrong circuit output. AS particle strikes are uniformly distributed in time,
there is equal probability of a particle strike affecting either the main ﬂip-ﬂop or
the shadow latch. Moreover, a particle strike in the combinational logic circuit can
also change the value stored in the Shadow latch. Hence, restoring the value from the
Shadow latch does not reduce the probability of soft error occurrence due to a particle

strike on the latch or in the combinational logic circuit.

4.2. 1 Error Masking

Error masking refers to error correction on—line. Efficient error masking techniques
utilize both the Spatial and temporal redundancy in a circuit. In the previous chap-
ter, we presented an error masking technique (EM), which samples the circuit output

three times within a Single clock cycle and does a majority voting on the sampled

58

values. The error masking technique presented attempts to trade—off the SER reduc-
tion obtained with the performance and area overhead. The sampling and majority
voting were done within the Slack available in a circuit, which results in zero perfor-
mance overhead. The area overhead was minimized by using a common delay chain
to generate the phase shifted clock signals for sampling. The technique presented in
Chapter 3 cannot be applied to circuits without slack, and for circuits with few non-
critical paths the ratio of SER reduction to overhead would be very small (overhead

is greater).

4.3 Techniques to Combine Error Masking and Error Detec-

tion Plus Recovery

In certain applications or circuits with balanced paths, using error masking in
paths with slack alone cannot provide required soft error protection. The technique
presented in Chapter 3 requires performance overhead in paths with insufﬁcient slack,
which is not acceptable for transient fault protection in timing-critical applications.
One way to improve the soft error protection provided by error masking is to do only
error detection in short-Slack paths by sampling only twice within the slack available,
or sampling after the clock closing edge plus contamination delay of the path. This
can detect errors twice and much more wider than the nominal pulse width masked by
error masking schemes. Thus, the error detection in critical paths can be combined
with error masking in non—critical paths to provide SER reduction. However, as
explained in Section 4.2, the cost of applying such error detection and retry techniques

are very high. To overcome these drawbacks, we present a novel technique exploiting

59

both error detection and error masking on a single path to provide sufﬁcient soft error

protection for all circuits.

4.3.1 Error Detection and Recovery on a Single Path

We ﬁrst explain the technique for doing EDR on a single path, then explain where
EDR needs to be used, as opposed to error masking alone, analyze the overhead
required and then present techniques to reduce the overhead. In the EDR technique
to do error correction, we sample the path output or primary output (PO) three times
and do a majority voting among the sampled values. As error detection is also done,
sampling is extended till the end of the next clock cycle. Once an error is detected,
the pipeline is stalled and the correct value from the error correction circuitry injected
into the pipeline, in the next clock cycle. All time instants in the following discussion
regarding the sampling time t1, t2, and t3 are speciﬁed in terms of the elapsed time
after a cycle begins. To better understand the discussion that follows, the reader is
referred to Figure 4.1 which presents the latch used for sampling. The P0 is sampled
at time t1, t2, and t3 to produce D1, D2, and D3. Let T denote the cycle time.
To tolerate the maximum transient pulse width, the time interval t3 — t1 must be
maximum, with t2 — t1 2 t3 — t2. The maximum slack (Smax) available for sampling

in a path (where EDR is used) is given by:

Smax = 2 ' T — (tpduvorst + tD—CK + tD—C2 + tC2—fb)- (41)

In Eq. 4.1, tpdwmt is the worst case propagation delay in the path, while t 0-67,,

60

and tD_C2 are the setup time requirement for the ﬁrst and third sample (D1 & D3,
respectively) in the sampling latch, t02_fb is the delay from Signal C2 going high to
the output of multiplexer in the feedback path settling, which includes the majority
voter delay. The setup time tD_CK is deﬁned as the D-to—CK offset that causes a
wrong value to be latched at Dl, while setup time t 0-02 is deﬁned as the minimum
D-to-C2 offset that causes the D3 settling delay to be 5% higher than its nominal

value.
v00 EC EC

‘ ‘ VDD——

21 .. —<1E’1
- - D1 DEJ).

 

 

 

 

 

 

 

 

 

 

D1 2 D3 Q
D2 _ __
—] N2C—]N3 (3 N4 CK-l N5
CKB 1 2 D1— . .
_ - Majonty EC.
32‘— Voter
——] N] ——[E‘”
vss vVSS
Master stage Slave stage

Figure 4.1. Flip—ﬂop used for EDR in a path. XOR is used for error detection and a

majority voter generates the correct output which is then fed-back in case of an error.

In the ideal case, the ﬁrst sampling can be done immediately after the worst case
output settling time, and 1:3 2 2 - T — (tD_Cg + t02_fb), while the second sampling is
done at the middle of the time interval between t3 and t1. One of the sampled values
needs to be passed onto the next pipeline stage, hence one of the sampling times is
ﬁxed at time T. D1 is sampled at time T, as this enables the maximum pulse width

to be tolerated. Based on the worst case propagation delay of the path, we offer the

61

following guidelines for choosing the soft error protection scheme in a path.

0 The slack available in a path while using error masking alone is given by:

SEM = T 2 (tpd,worst + tD-Cl + tD—CK + tC’K—Q)- (4-2)

The effective transient pulse width that can be tolerated is then S EM / 2. If the
transient pulse width tolerated is sufﬁcient, then sampling and majority voting

can be done within the slack available.

0 If error masking done within the slack available does not provide sufﬁcient soft
error protection, then EDR should be used in a path. This requires the use
of the modiﬁed ﬂip—ﬂop Shown in Figure 4.1. The ﬁrst and third sampling are
done at t’1 2 T — tD_CK and t3 2 2 - T — (t02_fb). The second sampling is done

at t’ 2 $213. The effective Slack available in a path with EDR is:

SEDR = T — (tn—CK + tD—C2 + tc2—fb)-

The error detection is done by XORing samples latched at t’1 and t’2.

4.3.2 Circuits for Error Detection and Recovery

The ﬂip-ﬂop used for sampling the PO values within the slack available in a circuit
was described in the previous chapter. The master stage samples the PO values thrice,
while majority voter is embedded into the slave stage of the ﬂip—ﬂop. The ﬂip-ﬂop for
doing both error detection and recovery on a Single path is given in Figure 4.1. As

the ﬁrst sampling of the PO is done at t’l, D1 is latched by CK Signal. The signals

62

C1 and C2 go high corresponding to sampling times t’2 and t3, respectively. The Slave
stage passes the value latched at time T, i.e. D1, to the next pipeline stage. Due to
an SET, D1 could have latched and passed on the wrong value to the next pipeline
stage. If the width of the SET is bigger than t’2 — t’l, then the error detect signal (ED)
which is the XOR of D1 and D2 changes to one, which leads to the majority voter
output being fed back into D1 and D2.

Once an error has been detected, error recovery is done by clock gating and pipeline
stalling. All the error detect (ED) signals from a pipeline stage are ORed to generate
a single error detect signal for the stage. The error detect signals from different stages
are ORed to generate a global pipeline stall (PS) Signal. The PS signal is used to
gate the clock Signal that is being fed to pipeline latches. Clock gating prevents
the CLB output generated in the next clock cycle from conﬂicting the output of the
multiplexer (fb) fed into D1.

The generation of the clock gating signal needs to be done before the next clock
cycle begins. This can be done due to the following reasons. (1) As the error detect
signals are generated using D1 and D2, approximately half-a—clock cycle (g) is avail-
able for generating the pipeline stall and clock gating signals. (2) High-speed circuits,
such as domino logic, can be used to generate the clock gating signal, as soft errors in
these circuits do not lead to a functional failure (explained later). (3) Moreover, Since
EDR is applied to only the most critical paths in a circuit, the number of ED signals
to be ORed are expected to be few (Section 4.5). In case generation and distribution

of PS exceeds half clock cycle, we suggest use of counter-ﬂow pipelining techniques

as done in [38]. The correct value is fed back into D2 too, so that ED selects the new

63

circuit output latched in future clock cycles.

We assume that once a particle strike has occurred in the logic circuit, the chances
of a new particle strike occurring in the latch and majority voter in the same clock
cycle are negligible. Hence, a transient pulse without sufﬁcient width to overlap two
sampling points can always be masked. However, the chance of a particle strike
occurring on the extra circuitry and causing an error is analyzed separately. There

are three different cases which need to be considered for particle strikes.

0 Output of error detector: If a particle strike changes the output of the XOR
gate when there has been no SET generated in the CLB block, the circuit
still functions correctly. This is because the output of the majority voter still
remains same and the correct value is put into the pipeline in the next clock
cycle. However, there is a one-cycle penalty due to wrong case of error detection.
Since particle strikes on the error detection and pipeline stall circuits do not
affect the circuit output, fast circuits constructed using domino logic can be

used to generate pipeline stall and global clock gating signals.

0 Output of majority voter: In the case of combined error detection and recovery
in a Single path, if a particle strike ﬂips the output of a majority voter, the
circuit still functions correctly. This is because, correct value has been passed
on to the next pipeline stage, and Since the ED signal in Figure 4.1 is not
one, no feedback happens into the pipeline. In the case of error masking alone,
particle strike at the output of majority voter can be corrected in the next stage,

assuming all pipeline stages implement soft error protection schemes.

64

4.4 Techniques to Enhance Error Masking

In this section, we discuss two techniques to increase the Slack available for doing
error masking. The ﬁrst technique exploits the input vector characteristics of the
circuit. The second technique exploits time borrowing to increase the Slack for most

soft error vulnerable blocks in a pipelined circuit.

4.4.1 Exploiting Circuit Timing Dependence on Input Vector

The SER reduction obtained from the error masking technique can be improved
further if the value of S EM in Eq. 4.2 can be increased. This increases the width
of the transient pulse and hence the particle charge required to cause an error. To
increase S EM, the sampling time t1 can be shifted earlier than the worst case arrival
time in a path. This means that the probability of a correct output being available
at the sampled PO gate at time t1 (P(t1)) is less than one. The sampling times t2
and 153 Should be positioned such that the probability of sampling a correct value in
D2 and D3, P(t2) and P(t3), respectively are one. The SER of a gate with sensitized

path to a ﬂip-ﬂop where t1 3 tdem, is:

SERnew 2 P(t1) >< SER(w 2 Sum/2) +

(1 — 13(5)) x SER(w 21“,), (4.3)

where SER(w 2 Sum / 2) is the SER of a path, when the transient pulse width

required to cause an error is greater than half of the new path slack, obtained by

65

Shifting t1 before tpd, and SER(w 2 tlw) is the SER of the path when an SET of
width greater than the latching window (tzw) of the original latch can cause an error.
The above equation represents the fact that when the ﬁrst sample D1 is wrong, SER
of the path with error masking is same as the original circuit.

In order to reduce SER compared to the case when error masking was not used in

the path:

seam, < SERm-g, (4.4)

where S E ng is the soft-error rate of the path without error masking and is equal

. . __Q_ ,
to SER(w 2 tlw)- As SER 1S proportional to e Os, where Q 18 the charge collected
around a gate output, QS is the charge collection eﬂiciency of the technology, Eq. 4.3

can be rewritten as:

QSnewzz
SERnew OC P(t1) X e_ 08
Q .
+(I — P(t1)) X e— 08 (4.5)

Here Q gnaw /2 and Qmin refers to the charge required to create transient pulses of
width Snew/ 2 and tzw, respectively. When anew/g > Qmin, then SER(w Z Sum/2)
<< S ER(w Z tlw). Therefore, approximating by ignoring the contribution of
SER(w 2 Sum / 2) to S E Rum, implies that S E Rnew linearly decreases with increasing
P(t1). Thus, sampling earlier gives a much better SER reduction. This means if the

application does not excite the worst case delay of the path for all inputs, then the

66

sampling point t1 can be shifted earlier. In arithmetic units such as adders, multipliers
and comparators narrow width input vector has been exploited for energy reduction.
For example, during addition when the Sixteen most Signiﬁcant bits (MSB) of a 32-bit
vector are zero, the sum and carry output settle much earlier, which can be exploited
to increase the SER reduction. A high level functional description of ISCASS5 cir-
cuits was used to determine the effect of input vectors on the outputs. We consider
here input vectors which lead to least delay in the most critical paths of the circuit.
For example, in C1908 when the inputs n953 and n952 are one and zero respectively,
output check bits from the error settle with only a AND gate delay. This allows us
to sample the output check bits much earlier than the worst case critical path delay.
For results, the top ﬁve timing critical paths in all circuits were considered to settle
at their minimum delay. The top ﬁve critical path outputs in circuits considered (all
circuits in Table 4.1, except C432), were sampled much earlier and the resulting SER
reduction was calculated. We found that the average SER reduction increased to

91%.

4.4.2 Slack Redistribution to Enhance Error Masking

In a symmetric latch based pipeline, time borrowing can be used to improve the SER
reduction obtained. Time borrowing here, refers to utilizing the time voluntarily
passed by a previous pipeline stage (usually referred to as Slack passing) or taking
up time from the next pipeline stage. The discussion of time borrowing is done

with respect to the PO gate connected to latch, so as to remain consistent with

67

previous sections. In common pipelined circuits (such as those present in super-
scalar processors), the sum of total logic delay across N pipeline stages and the latch
tD_Q overhead is usually less than N xT/ 2, where T is the clock cycle time. This is
because it is impossible in practice to construct the pipeline such that data always
arrives at a latch input when the latch is transparent [39]. Figure 4.2 Shows three
stages of such a pipeline with an ideal latch (tsetup and tD_Q for the latches are zero)
and assuming ideal clock Signals. The output of CLB B settles 0.3xT time units
before latch L3 opens, which is not utilized by any logic. This creates a dead time
which can be utilized for SER reduction. Figure 4.3 shows combinational logic block
(CLB) A using the dead time to shift the sampling time in its critical path. In
Figure 4.3, tearly is the contamination delay of CLB B. Effectively, the width of the
error pulse that can be tolerated by logic in CLB A increases by 0.15 X T. The latch
connected to PO gates in CLB A is similar to the master stage of the ﬂip—ﬂop shown
in Figure 4.1, without multiplexer in the feedback path. The latch is clocked by C1,
C2 and CKl, while the majority voter is not clocked. C1 and C2 close the second latch
in Figure 4.3, corresponding to the sampling times t1 and t2, respectively. Separate
Signal for sampling at t3 is not used, as the sampling is set by the timing constraints
of the circuit.

In pipelines where cycle time cannot be further reduced for correct operation, i.e.,
the sum of total logic delay across N pipeline stages and the latch tD_Q overhead is
equal to N xT/ 2, we present a selective time borrowing technique that can be used
in non-critical PO gates of a pipeline stage. This is due to unequal SER contribution

and Slack distribution of different PO gates in different pipeline stages. Soft error

68

0.5T 0.2T 0.6T

D Q D Q B D Q D
L1 L2 L3
.. 1

CK

 

 

 

 

O

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

L 0.5T I 0.2TW 0.6T I

Time

Figure 4.2. A pipeline with dead time.

rate of a PO gate 1 in stage j (SER(POM» refers to sum of soft errors of all it’s fanin
gates. A PO gate in pipeline stage j can borrow a maximum slack of 0.5 x T (minus
the latch D — Q delay plus any jitter or Skew, due to data settling close to clock closing
edge) from pipeline stage j + 1 and ahead. This time borrowed can be used to reduce
SER contribution of PO gate in stage j by increasing slack available for sampling the
output. However, time borrowed by POM can reduce the Slack available for another
gate POM-+1 in the fanout cone of POM. This requires us to borrow time based on the
SER contribution of PO gates in a path. We present an algorithm in Figure 4.5, to
do selective phase time borrowing for SER reduction, where two consecutive pipeline
stages operating on the high and low phase of the same clock pass Slack between
them. This is done to make the problem of slack distribution across the pipeline
stages Simpler.

The algorithm assumes that the minimum clock cycle time in the latch-based
pipeline has been determined and also the data arrival time at all the latches are

known. The slack available for a non-critical PO”- is tcrit — tpdy, where tm-t and tpdyi

69

0.5T 0.2T 0.6T

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

1

early (I: 0.5T

12 = 0.651"
0.5T [ [ 0.2T [ 0.6T 13:0.8T
l

 

 

 

t1 t2 J3

Figure 4.3. Dead time being used to increase the width of error pulse that can be tolerated.

are the critical path delay and the worst case propagation delay of the path ending
at PO 2', respectively. The time borrowing potential for each POM is then calculated
as the minimum Slack among the PO gates in its fanout cone from stage j + 1. The
SER of PO”- and all its fanout gates in stage j +1 are calculated when error masking
technique is applied with the current Slack available. Then the sum of SER of POM
and all its fanout gates are calculated and stored in SSER,-,j. The sleek distribution
between POM- and gates in its fanout cone is done based on the SER of POM. If
SER of POL,- is greater than the sum of SER of its fanout cone gates, slack available
at POM is increased through time borrowing. The slack is iteratively increased, in
steps of 10% of the total time available for borrowing. During each iteration of the
algorithm when steps 3-6 are executed, it is always made sure that the total sum of
SER in pipeline stages j and j + 1 remains same or decreases. This ensures that the

algorithm converges, and the ﬁnal SER of the pipeline system is equal to or lower

70

than the original SER. The results for SER reduction on ISCA885 circuits due to

time borrowing are presented in section 4.5.1.

4.5 Simulation Results

The EDR ﬂip-ﬂop, shown in Figure 4.1, was simulated using TSMC 180 nm models
to calculate the values for t 0-4., 150-02, and tck_Q. The value for tD_Ck, tD_C'2 and
tCk_Q in the modiﬁed design were found to be 125, 115, and 50 ps, respectively. A delay
chain capable of generating phase shifted clock signals every 200ps was constructed
and hence the sampling times t’l’ and t’2’ were determined from the control signal
availability. The complete methodology for determining the sampling times t’l’ and t5;
are described in Chapter 5. Based on the setup time and CK-Q delay for the EDR
ﬂip—ﬂop, we ﬁrst estimate the slack S ED R available at each ﬂip—ﬂop for sampling the
PO values. The width of transient pulse that can be tolerated in the modiﬁed circuit
t; is then calculated as min(t3 — t’2, t; — t’l). The charge required to cause a transient
pulse of width t1” and tw = lOOps are then obtained from the lookup table. If til, 3 tw,
then we use EDR in the path, else error masking is used as described in Chapter 3.
The results on applying EDR for ISCASSS circuits are given in Table 4.1.

In Table 4.1, the column Nm-g represents the number of ﬂip-ﬂops (FF) modiﬁed.
Sub-column EM gives the number of FFs where error masking was used, while EDR
gives the number of POs connected to ﬂip—ﬂops shown in Figure 4.1. As the average
number of paths on which EDR is applied equals to 5.6, domino logic can be used
to generate ED signals without much delay. The average SER reduction on using

error masking alone is 82.67%, while combining error masking with EDR raised it to

71

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Circuit Circuit Features Nm-g SER Redn. % Area Power
Ovhd. Ovhd.
% %
Gates PIs POs EM EDR EM Both
c432 210 36 7 3 4 55.66 89.66 30.8 72
c1908 1005 33 25 16 9 66.74 85.2 19.4 65
c2670 1498 233 128 44 4 99.58 99.81 30.8 52
c3540 2176 50 22 18 4 95.8 98.73 18.9 34
07552 4785 207 108 58 8 81.14 90.14 22.63 27
c5315 3712 178 106 102 4 97.12 99.14 27.75 45
Avg. 2231 122.8 70.8 43.2 5.6 82.67 93.78 25.05 49.17
Table 4.1. SER reduction for ISCA885 circuits. The power overhead in practice

would be lower than the one presented above due to: (1) The original power has
been estimated using zero delay model, which does not take into account glitchy or
partial transitions. (2) The leakage energy, which has not been taken into account,
consumed by the overhead circuit is far lower than the leakage of the CLB, due to

fewer components.

93.78%. The original area of ISCAS85 circuits were obtained from SynOpsys design
compiler, while the area overhead is equal to the sum of area occupied by the delay
lines and associated buffers, the modiﬁed FFs, circuit required to generate ED signals
and a ﬁve percent wiring overhead. The overhead for generating PS signals are not
included as the ISCAS85 circuits considered are not pipelined.

The area overhead depends on the number of modiﬁed FFs, the number of dis-
tinct sampling times and the maximum sampling time which contribute to the delay
element overhead. If a number of sampling times are close together, then the de—
lay element overhead can be reduced more (by clustering) without signiﬁcant loss of
SER reduction, as compared to circuits with sampling times wide apart. The delay

lines can be shared across multiple modules Which would further reduce their area as

72

well as power overheads. The power consumed by the original ISCAS85 circuits was
estimated in Primepower using zero delay model (i.e. delays and switching activity
have not been back annotated). The power overhead is calculated by simulating the
delay and buffer chain in SPICE. The construction of the delay and buffer chain,
along with SPICE details, are described in Chapter 5. The average power overhead
in practice would reduce due to: (1) the original power has been estimated using zero
delay model, which does not take into account glitchy or partial transitions; (2) the
leakage energy, which has not been taken into account, consumed by the overhead
circuit is far lower than the leakage of the CLB due to fewer components in the de-
lay and buffer chain. In comparison to the overhead of 200% obtained from modiﬁed
TMR techniques, the overheads incurred by the techniques presented are signiﬁcantly
lower.

The results presented here are for zero delay overhead i.e., the critical path delay is
not affected. C499/ C1355 which have the same overall function are not selected due
to the presence of balanced paths in the circuit. As technology scales, clock frequency
is increasing which decreases the absolute value of slack in circuits. However, as the
time constant for charge collection process of a device decreases exponentially with
minimum gate length [6], current pulse width due to particle strike also decreases.
The decrease in current pulse width coupled with decrease in gate output capacitance
leads to a decrease in the width of SET as technology scales. This should allow us
to exploit the reduced slack available in a path to decrease SER using the technique

discussed.

73

100

8

$

SER Reduction (%)

m-

55

Figure 4.4. Plot showing the SER reduction achieved versus the time borrowed.

75 r

70 r

65.-

SER Reduction Vs Time Borrowed

 

.-

 

: /

I,

C

I;

 

 

-0- C432

-0- C1908
-- 02670
+ 63540
-e- 07552
+ 65315

 

 

 

 

8%!" 0.06T 0.08T 0.1T 0.12T 0.14T 0.15T 0.181'

Time Borrowed (fraction of Clock Period)

0.2T

4.5.1 SER Reduction Using Slack Redistribution

In order to simulate the effect of time borrowing on SER reduction when using error
masking alone, we increase the slack available across all paths and recalculated the
SER reduction. Slack available in all paths were increased up to 0.2xT, where T is
the clock cycle time. The results are plotted in Figure 4.4. As can be seen, SER
reduces with small increase in the slack time. SER reduction in C432 increases from
52% to 82% as time borrowed is increased from 0.05XT to 0.1xT. This is because

the number of latches that are triggered and used for sampling doubles from three to

six when the slack across all paths is increased by 0.1 XT.

74

4.6 Conclusion

A technique for doing both error detection and recovery (EDR) on short-slack paths
was presented in ths chapter. This technique masks SETS of width approximately
equal to half the clock cycle time. In case an error is detected, correction can be done
with a single cycle penalty. This technique in combination with the error masking
technique of previous chapter provides an average 93.78% SER reduction for ISCA885
circuits. Two other techniques to improve the SER reduction provided by error
masking were proposed. The ﬁrst technique increases the slack available in logic
circuits, by exploiting the circuit delay dependence on input vectors. The second
technique utilized time borrowing in latch-based pipeline circuits to increase the slack.
The potential for SER improvement on using these techniques was demonstrated in

Section 4.5.

75

Algorithm Phase_Time_Borrow

Begin
1. Initialization:
-——)
e Slack vector Sf0={sf§), 3%), . . . , 55,30} /* 3,50 is the slack at POM in pipeline stage
j- */

Potential time borrowable by P0”: tb,,j=min(sf’?+1), V POM“ e fanout cone (FOC)

Of POL]:
SER(POiJ) = Z SER(gi,j), V gm- 6 fanin cone of POM

SSERM = SER(POiJ) + Z SER(POkd-H), /* sum of SER of PO gate 2' in stage j,
and SER of PO gates in the fanout cone of POM before slack redistribution /*

SSER’M = SER’(POz-,j) + Z SER’(POk,j+1), /* SER’ and SSER’ are SER after slack

redistribution */

For each PO 2' in stage j
2. If (SER(POM) > Z SER(POkJ-H», V k 6 FCC of POM

Repeat

3. Calculate new sampling times and SER for P0”- and all POk, j+1
in FCC of PO”, when time borrowed from stage j + 1 is 0.1 x tbz-J.
4. Calculate SSERa-J and SSER’M.

5. If (SSER’M < SSERl-J)

6. Increment and decrement 350, 3,153+, by 0.1 x tbw, respectively.
7. Decrement tbz-J- by 0.1 x tbz-yj.

Endif

Until (tbz-‘J- > 0) and (SSER’U < SSERa-J)
Endif
End For /* End of for loop */

End

Figure 4.5. Algorithm for time borrowing to reduce SER

76

CHAPTER 5

Robust Delay Chain Construction

5.1 Introduction

In this chapter, we explain the methodology for constructing the delay chain, which
is used in both the EM and EDR techniques. First, we analyze three different families
of delay elements for their robustness to process variation, and then determine the
appropriate delay element for the delay chain construction [40]. Later in section 5.7,
a delay chain, which produces control signals phase shifted from the system clock by
every 200 ps, is constructed using the most robust delay element. Finally, we explain
the construction of a buffer chain to distribute the phase shifted clock signals using
the method of logical effort.

The three different delay element families analyzed are: (1) transmission gate based,
(2) cascaded inverter based, and (3) voltage—controlled ones. We compare the delay
element’s effectiveness in terms of yield, which is deﬁned as the number of circuits

within a speciﬁed delay range. The delay variations are obtained through HSPICE

77

Monte Carlo simulations (MCSs) and the delay sensitivity to different process and
environmental variations are studied using the simulation results. This enables us to
select a robust delay element for constructing the delay line.

Process variation refers to random die—to—die and within die parameter fluctuations
during the manufacturing process. The within-die variations can be classiﬁed as either
correlated (systematic) or uncorrelated (random). Meindl suggests that correlated
variations could occur due to aberrations in the stepper lens, whereas placement of
dopant atoms in the device channel region, which varies randomly and independently
from device to device within a die could cause uncorrelated variations [41]. As tech-
nology scales and transistor gate lengths become smaller than the wavelength of light
used in the lithography process, the uncorrelated within-die parameter variations are
expected to become a major design concern [41]. Therefore, we consider only random
within-die parameter variations in this chapter. In the next few sections, we study

the robustness of the delay elements under consideration.

5. 1 . 1 Delay Elements

A delay element is a circuit that produces an output waveform similar to its input
waveform, only delayed by a certain amount of time. Delay elements ﬁnd wide use in
digital systems [42]. Asynchronous or self-timed designs, in which the global clock is
eliminated, make extensive use of delay elements [43]. Most asynchronous cells need
to generate a completion signal to indicate that their outputs have been evaluated. A

delay element can provide this as long as its delay amount is larger than the worst-case

78

delay of the cell [44]. For such structures as self—timed multipliers, delay elements are
needed in the micropipeline [43]. Even circuits that perform complex mathematical
calculations, such as computing the discrete cosine transform require delay elements
in their architecture [45]. Finally, delay elements are used for phase modulation in
delay-locked loops and phase—locked loops [46]. To our knowledge, there is no previous
study on the effect of process variation on delay elements. Therefore, a study of delay

elements vis-a—vis process variation would prove helpful in designing robust circuits.

5.2 Yield Deﬁnition

We deﬁne yield as the percentage of the total delay elements fabricated, whose
propagation delay falls within a certain critical delay cut-off. In the presence of
parameter variations, delays are distributed over a certain range. The distribution
of delays can be evaluated by normalized variability - 37", where a and a are the
standard deviation and mean of the measured delay, respectively. As the proposed

techniques are highly sensitive to delay of control signals, we deﬁne cut-off delay as

i10% of the mean delay.

5.3 Parameters Studied

The delay of a circuit depends on gate length L, width W, supply voltage VDD, and
N MOS (VTn) and PMOS (VTP) threshold voltages. These parameters are modeled
as normally distributed random variables to study their effect on the delay variation.
The standard deviation of each of these parameters and their nominal values for
TSMC 180 nm are given in Table 5.1. The nominal value of the gate width for both

PMOS and NMOS transistors in an inverter are given in Table 5.1. For other delay

79

 

 

 

 

 

 

Nominal Value 311—”
Gate Length 180 nm 15%
Width 2/1 um 15%
VTn (V) 0.445 15%
VT, (V) -044 15%
VDD (V) 1.8V 10%

 

 

 

 

 

Table 5.1. Parameter variations for the process considered.

elements, the width is scaled based on the circuit design and the delay required.

In all our simulations comparing delay elements, we ﬁxed the fan-in to be a single
minimum-sized inverter and the load was F04 inverters. These inverters are provided
with their own power supply, separate from the one connected to the delay element.
The propagation delay of the delay elements is calculated by averaging the rise and
fall delays. A square pulse with a slew of 50 ps was applied as input in all the

experiments.

5.4 Simulation Methodology

Using MCSs performed in HSPICE, we sample a signiﬁcant number of points from
the normal distribution of each parameter and calculate the delay in each iteration.
The resulting data gives the delay distribution, when the delay elements are manufac-
tured under the given parameter variations. For each delay element we perform MCS
with 500 iterations. We then calculate the mean, variance, and normalized variability

for each delay element from the simulation results.

80

5.5 Delay Element Analysis and Yield Results
5.5.1 Transmission Gate Based Delay Element

A transmission gate (T-gate) is a bidirectional switch consisting of a parallel con-
nection of an N MOS and a PMOS transistor that are controlled by complementary
control signals as shown in Fig. 5.1(a). The NMOS and PMOS transistors pass logic
0 and 1, respectively, without degradation. By keeping the two transistors always on

(S = 1.8V and S = 0V in Fig. 5.1(a)), the transmission gate acts as a delay element.

Delay

The delay of a transmission gate is effectively determined by the time to charge or
discharge a load capacitance CL at its output through the equivalent resistance Reg.
The output voltage: Vo-ut(t) = (1 — e‘t/RWCL)VDD. The propagation delay, which is

the time taken for V0,,(t) to reach VDD / 2, i.e., Vout(tp) = VDD / 2 is given by [44]:

tp 2 ln(2)ReqCL

 

._ 2V
_ 1n(2) kn(VDD—VTn)2+DkI:(VDD—|Vrp|)2 CI" (5'1)

Here VT” and VTp denote N MOS and PMOS transistor threshold voltages, respec-
tively, and kn and kp denote gain factors (which are proportional to the ratio of width
over length (1:1)) of the two transistors. For a given fan-out, delay can be increased,
compared to that of a minimum-sized transmission gate, by increasing L of the tran-

sistors, which linearly increases Reg. Delay may be decreased by increasing PV of the

81

IN OUT

S=VDD

Figure 5.1. Schematic diagram of a transmission gate.

transistors, which decreases Reg; however, this effect is limited by the diffusion (or
junction) capacitance also increasing, which contributes to more load capacitance CL.

A chain of n transmission gates has a delay of [44]:

n(n+1)

tp(chaz'n) = ln(2)ReqCL 2 a

(5.2)

where CL is the load capacitance at the output of each transmission gate. Therefore,
delay increases quadratically with the number of transmission gates in the chain and
hence with area. A MCS with 500 iterations was done by varying parameters gate
length, width, supply voltage V913, and threshold voltage within the range given in
Table 5.1. The delay distribution plot when gate length, width, VDD, and VT are
varied is given in Figure 5.2. As can be seen from Figure 5.2, the delay values are
distributed around a mean value of 180 ps with signiﬁcant number of delay values
between 160 to 200 ps. The yield of the transmission gate delay element was found

to be 97.8% and 95.8% when VDD variation is 10% and 20%, respectively, and gate

82

length is varied by 10%. All other parameters were ﬁxed to their nominal value. This

shows that the yield of the transmission gate is affected signiﬁcantly due to supply

voltage variation.

-1o
2.6

 

f

Delay Distribution of the Tgate
2.4 ~

2.2

 

Delay (eeoe)
“7.:

 

 

 

0 100 200 300 400 500
Iteration Number

Figure 5.2. Delay of transmission gate for different iterations of a MCS.

5.5.2 Cascaded Inverter Based Delay Element

A pair of cascaded inverters can also function as a simple delay element that delays

the input signal by an amount equal to the combined propagation delays of the two

inverters (see Figure 5.3).

Delay

The propagation delay of an inverter depends upon the time taken to (dis)charge
the load capacitance. An exact computation of this delay is nontrivial because of

the nonlinear dependence of the (dis)charging current on the output voltage. An

83

IN OUT

SS

Figure 5.3. Schematic diagram of a cascaded inverter.

approximate expression is derived by using an average value of this current equal to

the saturation current of the PMOS (NMOS) transistor given by:

PS”

[av = EEO/Gs — lval)2
kp 2 kp 2
= 3(VDD — IVTpl) “ §VDD- (5-3)

The above holds since VDD >> IVTPI, VT”. Based on this 10., value, the propagation

delay is as follows [44]:

1 CL 1 1
tr 2 §(tpLH + tpHL) = Q—VDD (k_ + I?) , (5-4)
p n

where tpLH and th-{L denote propagation delays for low to high and high to low
output transitions, respectively. The above expression is valid when the input signal
makes an abrupt transition from VDD to V53 or vice versa. The effect of a nonzero

input rise time t, > tpHL on propagation delay tpHL is captured by the following

84

equation [44] :

 

mm...» = Jam...) + (tr/2r (5.5)

A MCS with 500 iterations was done by varying parameters gate length, width,
supply voltage VDD, and threshold voltage within the range given in Table 5.1. The
delay distribution plot when gate length, width, V130, and VT are varied is given in
Figure 5.4. As can be seen from Figure 5.4, the variation in delay for the cascaded
inverter occurs over a small range of 20 ps as compared to that of the transmission
gate where the delay variation is spread over 40 ps. The delay of a cascaded inverter
depends on the ratio of the gate length to VDD, which leads to the small delay vari-
ation. The yield of the cascaded inverter delay element was found to be 100% and
99.8% when VDD variation is 10% and 20%, respectively, with gate length variation

set to 10%.

5.5.3 N P-Voltage Controlled Delay Element

An NP—voltage-controlled delay element is shown in Fig. 5.5(a). It consists of a cas-
caded inverter pair with an additional series-connected N MOS and PMOS transistor
in the pull—down and pull—up of each inverter controlled by a global control voltage

Vn and VP, varying which changes the delay of this delay element.

85

 

 

Delay (sect)

 

Delay Distri - ution of Cascaded Inverter

100 200 300
Monte Carlo Iteration No.

Figure 5.4. Delay of cascaded inverter for various Monte Carlo iterations.

Delay

The delay for this element can be altered by changing the control voltages V7, and

V1,. One advantage is that the delay can be adjusted post—fabrication too. The

(dis)charging takes place through a controlled transistor. The propagation delay of

this element is:

1
— 5 (tpLH + tpHL)

_ CLVDD 1 + 1
‘ 2 kpvpz an,;~’

 

 

(5.6)

Note that in this case, tp is inversely proportional to both sz and Vnz. Both V,

and VP are fed from a stable source, due to which they do not vary. A MCS with

500 iterations was done by varying parameters gate length, width, supply voltage

VDD, and threshold voltage within the range given in Table 5.1. The nominal value of

86

 

 

DD

W] 4
VIN VOUT

Vn I l

VSS

Figure 5.5. Schematic diagram of a NP-voltage cascaded inverter.

PMOS and NMOS widths were 4 pm and 2 pm, respectively. The delay distribution
plot is given in Figure 5.6. Since parameter variations in both the stacked transistors
affect delay, the variation is more as compared to that of the transmission gate. The
yield of the NP-voltage controlled delay element has been found to be 92.2% and
84.8% when VDD variation is 10% and 20%, respectively, and gate length variation is

10%.

5.6 Comparison of Delay Elements
In this section, we present results and analyze the delay sensitivity of the delay

elements to various parameters.

87

-10
2.4 x 1° T

[ Delay Distribution of NP—Voltage Controibd Delay Element

 

2.3
2.2 r

2.1-

N

 

Delay (eeca.)
d d
b co

.5
N

 

1.6“

 

1.5-

 

 

1.4 L ‘
200 300 400
Monte Carlo Iteration No.

Figure 5.6. Delay distribution of NP-voltage controlled delay element for various

Monte Carlo iterations.

5.6.1 Effect of VDD and Gate Length Variation

The MCSs were run with gate width and threshold voltage ﬁxed, while length and
supply voltage were randomly varied. The 30 variation of length was ﬁxed at 10%,
while 10% and 20% VDD variations were applied. Table 5.2 presents mean, yield and
3a variability for both 10% and 20% VDD variation. NP—voltage delay element has the
maximum variability of 28.5% among the three delay elements considered. Cascaded

inverter is the most robust delay element as it has an almost 100% yield.

 

 

 

 

 

 

 

 

 

 

 

Delay element Mean a (ps) 30 (%) Yield (%)
10% 20% 10% 20% 10% 20%

Trans. gate 167.2 164.8 12.4 13.5 97.8 95.8
Cas. Inv. 225.3 224.6 7.9 8.43 100 99.8
NP-Volt. 193.5 194.6 21.1 28.5 92.2 84.8

 

 

Table 5.2. Mean delay and variability of the delay elements when VDD variation is

10% and 20%, and gate length variation is 10%.

88

 

5.6.2 Effect of VDD and Width Variation

We performed another set of experiments with gate length and threshold voltage ﬁxed,
while gate width and the supply voltage are varied. Table 5.3 shows delay results
when VDD and transistor width have 10% variation, with length constant. This table
shows that the yield of the transmission gate, cascaded inverter, and the NP-voltage
controlled delay element, increases, remains same, and decreases, respectively, when
width changes randomly as compared to random variation in length. NP-Voltage
delay element is more sensitive to width variation while transmission gate is more

sensitive to variation in gate length.

 

Delay element Mean 30 Yield

n (DS) (‘70) (%)
Trans. gate 165.54 11.9 98.2
Cas. Inv. 253.52 8.6 100

NP-Volt. 193.1 20.7 85.6

 

 

 

 

 

 

 

 

 

Table 5.3. Mean delay and variability of the delay elements when VDD and gate width

variation are 10%.

We now summarize the area, power, and signal integrity characteristics of the
three different types of delay elements and offer some suggestions for choosing the

appropriate delay element.

0 The advantage of the transmission gate is the small area overhead and power
dissipation. But it has poor signal integrity, which is deﬁned as the maximum
of rise and fall times. l\-"Ioreover, the signal integrity and power consumed by
the transmission gate degrade rapidly for producing large delays and hence the

transmission gate is suitable for delays within 200-300 ps. The signal integrity

89

of transmission gate can be improved using Schmitt trigger, but it increases the
area and power overhead signiﬁcantly. Moreover, process variation can affect
the Schmitt trigger too. Process variation also introduces uncertainty in rise

and fall time, which can cause further delay variations.

The cascaded inverter consumes more area and power than the transmission
gate. Its signal integrity is good for delay values of less than 500ps. As the
cascaded inverters have the highest yield, they can be used to construct de-
lay chains with intermittent Schmitt triggers to provide higher delays. Some
variations of the cascaded inverter such as replacing each inverter with a cas-
coded version with multiple PMOS and NMOS transistors in the pull-up and
pull-down network can also be used to obtain higher delay values. However, cas-
coded inverters have lower robustness as compared to that of cascaded inverters,

due to stacked transistors present in them.

The NP-voltage controlled inverter’s delay can be changed by altering its con-
trolling voltage. Based on our experiments, we ﬁnd that it has the least robust-
ness to process variation and very poor signal integrity. It occupies a bigger
area and consumes more power when compared to cascaded inverters because
of the extra NMOS and PMOS control transistors. This delay element has a
very poor signal integrity response and hence process variation can introduce
more delay because of high slew uncertainty. Thus, we ﬁnd that a cascaded
inverter offers the best trade-off between area, power, and robustness metrics

among the three delay elements considered.

90

5.7 Control Signal Generation and Distribution from Delay

Chain

A delay chain which takes the system clock as an input and generates control
signals used in the EM and EDR techniques is constructed using cascaded inverters.
Each delay tap in the delay chain produces a control signal which is delayed from the
previous tap or clock input by 200 ps, with the ﬁnal control signal delayed by 2 ns
from the clock input. The load driven by each tap in the delay chain is limited to
10-20 fF, so that the maximum delay variation at each delay tap is limited to i10 ps
from the intended 200 ps. This is done by limiting the gate driven by each delay tap
to a minimum-sized-inverter or -AN D gate.

The control signals from the delay taps are driven to the ﬂip-ﬂops using a buffer
chain. A single inverter with varying drive strengths is used as a buffer. The load
driven by the buffer chain and hence its delay depends on the number of ﬂip-ﬂOps to
which the control signal distributed by the buffer chain is routed. The methodology

used for determining the load and delay of the buffer chains are summarized next.

1. Initially, only ﬂip-ﬂops in paths, which have half-the—slack (Smog, / 2) greater than

the latching window time are chosen as candidates for error masking.

2. The sampling times t’1 and t’2 of the candidate ﬂip-ﬂops are determined based

on the slack (Snm) available in the path, using steps explained in Chapter 3.

3. Then the initial t’l’ and t’2’ of the candidate ﬂip-ﬂops are determined based on

the control signal availability, such that t’l’ Z t’l, and t’2’ = t; :t (5,.

91

4. Once the initial t’l’ and t’2’ are determined, the total ﬂip-ﬂops and hence the
number of sampling transistors driven by each delay tap is known. The total
capacitance for each delay tap is calculated as the sum of gate capacitances of
the ﬂip—ﬂops driven by them. The interconnect capacitance is small compared
to gate capacitance, as the delay chain is local to the CLB, and hence it is

ignored.

5. Based on the capacitance driven by each delay tap, buffer chain which drives the
control signals with minimum delay is constructed using the method of logical

effort.

6. After the buffer chain construction, t’l’ and tg’ are re—calculated based on the

delay of the buffer chain and availability of the control signals.

7. Control signals which are used to sample data before €- are generated by ANDing

CK and DCK as shown in Figure 3.4(ii).

We now demonstrate the construction of the buffer chain using logical effort. Fig-
ure 5.7 shows the number of flip-ﬂops driven by each delay tap for the largest ISCASSS
circuit c7552. In each ﬂip-ﬂop, the control signals drive both a PMOS and NMOS
transistor (sized 4 / 2 um), whose total width is double the minimum inverter size.

Logical effort states that the minimum path delay occurs when each stage in the
path bears the same effort [47]. In logical effort, the unitless delay of a single stage

in a path is given by:

92

 

25
LOAD DISTRIBU‘HON or DELAY UNES IOU
[n DL2

 

—

201

15‘

FF: Driven

10<

 

 

 

 

 

 

 

 

0 . . . _._...J ..

0.2 0.4 0.6 0.0 1 12 1.4 1.6 1.8 2
Time (ns)

Figure 5.7. The number of ﬂip-ﬂ0ps driven by each delay tap in the delay line of
c7552. Two separate delay chains -DL1, DL2— are used to prevent soft errors from

occurring due to particle strikes on the delay chain itself.

fi = gisz'Xhz'

d.- = f. + p.- (5.7)

The parameters f,, p,, represent the stage effort and the unit-less parasitic delay
of that stage, respectively. The absolute value of stage delay - clubs 2 d, * 7‘, where
’T is the technology time constant, deﬁned as the average drive resistance of an in-
verter multiplied by its input capacitance. Each stage effort is in turn a product of
the logical-(9,), branching-(bi), and electrical-effort (12,-) of that stage. Logical effort
is a measure of a gates drive strength relative to a minimum sized inverter in the

same technology. Electrical effort is the ratio of gates output to input capacitance.

93

Branching effort is a ratio of total output to off-path capacitance. The values for the

above discussed parameters for a complete path are:

G = ”91'
B = 11b;
H = Uh,-

F = G’xBxH

P=Zpi

D = F+P. (5.8)

As all stages in a path bear the same effort for minimum delay, the total delay can

also be written as:

D = NFl/N + P, (5.9)

where N is the total number of stages in the path. The minimum delay is obtained
when the partial derivate of D with respect to N is zero. This leads us to a relation
between the parasitic delay (pbuf) and the best stage effort (7), and is given by

equation 5.9 [47].

max = 7009.6) - 1) (5.10)

As inverters are used as buffers, we determine the absolute value of pm; by ﬁtting

a straight line through F01, F02, FOB, and F O4 delays of an inverter and measuring

94

 

Delay Vs Fanout

y 312.567X + 26.

     
 

 

 

+ Seriesz
— Linear (SeriesZ)

 

 

 

Delay (ps)
a a

i5

10*

 

 

 

0 . a ﬁ

Fanout

Figure 5.8. Delay versus fanout for an inverter in TSMC 0.18 micron technology.
The absolute value of the parasitic delay of an inverter is the Y-intercept of the line

shown, and has a value of 26.4 ps.

the Y-intercept of the line. The Y—intercept of the line, which is the delay of the
inverter for zero external load, is the required value for the absolute value of p1,“ f.
The straight line equation for an inverter in TSMC 0.18 micron is shown in Figure 5.8.

The unit-less psz is calculated by dividing its absolute value by 7’. As the value
of 7' for TSMC 180 nm technology is approximately 13.13 ps, unit-less 195,422.01.
Once the parasitic delay is known, Eq. 5.9 needs to be solved to obtain 7. As there
is no closed—form solution for Eq. 5.9, we solve it graphically by plotting the value of
f(7)(= 7(l0ge(7) — 1)) versus 7. The value of 7 for which f(7) equals 2.01 is looked up
from the graph, as shown in Figure 5.9, and is found to be 4.32. Hence, a stage effort

of f,=4.32 results in minimum delay through an inverter chain in the technology we

95

 

ET I I I T

x(log(x)—l)

 

 

 

 

Figure 5.9. Graph used to ﬁnd the best stage effort.

used. The total load at each delay tap is calculated based on the number of ﬂip-ﬂops
driven by that tap. The path electrical effort (H) is ratio of the path load capacitance
to input capacitance of a minimum sized inverter. The logical effort (G) of the buffer
chain is one, as it is made up of inverters only. There is no branching in the path and
hence B=1. The total path effort F is calculated as shown in Eq. 5.8. The number
of stages or inverters in a path is calculated as log4I32F. We round off the number of
stages to the next highest integer. Thus a buffer chain is constructed and its delays
are simulated using HSPICE. The values of t’l’ and t’2' are calculated by summing t'l

and t’2 with the delay of the buffer chain.

96

5.8 Conclusion

In this chapter, we analyzed the robustness of delay elements to process variation
and explained the methodology used to construct a delay and buffer chain for our
SER reduction techniques. A cascaded inverter was found to give a better yield under
process variation, since its delay is less sensitive to VDD and gate length variations. A
delay chain with a delay tap every 200 ps was constructed using cascaded inverters.
The delayed clock signals are then distributed using buffer chains. The construction
of buffer chains with the least delay was demonstrated using the method of logical

effort .

97

CHAPTER 6

Analysis and Design of Soft Error

Hardened Latches

Previous study on soft error vulnerability of ﬂip—ﬂOps and scannable latches consid-
ered latches designed without any explicit soft error protection [48]. As latches in
commodity applications are being increasingly protected for soft errors, new soft-
error hardened latch designs have been presented. In this chapter, we compare the
performance and power cost of the existing designs and also propose efficient latch
designs for soft error protection [49]. We use the following metrics to compare the
existing latch designs. (1) Robustness of latches to charge collection at their drain
nodes. We ﬁrst investigate whether particle strikes on a latch change its output value.
If the latch output changes, we determine if an error recovery function exists and the
time taken for error recovery. If the latch output does not change value, we check if
it is held stable by a static or dynamic node. If the latch output is held stable by

a dynamic node, we study the effect of leakage on this stored value. (2) Soft error

98

protection to transient pulses originating from a combinational logic block (CLB). (3)
Robustness to single event multiple-upsets. (4) Setup, hold time and the Data-to-
output (D-Q) delay. (5) Power overhead of the soft error hardened latches. (6) Issues
such as power and performance cost to be considered for system-level integration,
especially when using some of the latch designs for CLB protection.

The chapter is organized as follows. Section 6.1 explains the simulation setup
for measuring the critical charge, the latch delay and the power consumption. In
Section 6.2, we analyze the various existing soft error hardened latches based on the
metrics presented above. Section 6.3 presents our proposed latch designs for soft
error immunity, some of which are affected by SEMUS on more than two nodes only.

Finally, we conclude in Section 6.4.

6.1 Simulation Methodology
6.1.1 Latch Delay and Power Calculation

All simulations were done in TSMC 0.18 micron technology with a supply voltage VDD
of 1.8V. All the latches were designed with minimum sizes for the sake of comparison.
To calculate setup and hold time, all the latch outputs were connected with a fanout
of four inverter load (F04). The setup time t, is deﬁned as the minimum D-to—CK
offset that causes the Data—to—output (D—Q) delay to be 5% higher than its nominal
value [50]. Based on this deﬁnition of setup time, the minimum clock cycle time when

ﬂip—ﬂop (FF) A is driving FF B is given by :

T Z 105 ' tD—QA + tLogic + tsetup,B + tskew (6-1)

99

The ﬁrst term of Eq. 6.1 accounts for the worst—case D-Q delay of FF A, when
data arrives exactly one setup time before the active clock edge. The second term,
tLogz-c, captures the worst case prOpagation delay through the combinational logic,
while the third parameter, tskew, captures the clock skew. The D-Q delay is the delay
measured from the active clock edge to the output. It depends on the clock slope
and the output load, apart from the D-CK offset. The clock slope was ﬁxed at 50
ps both in the rising and falling directions. The total delay of the latch is deﬁned as
the sum of D-Q delay (measured at the setup time) plus the setup time itself. The
total delay of a basic transmission gate latch, similar to the one in Figure 6.1, (but
without the explicit capacitance in node fb) was ﬁrst calculated. The delay values of
all other latches are reported after normalizing with the standard latch delay. The
power consumed by the latch is calculated as the average of the power consumed for
latching a logic 0 and 1. The energy values reported here are per clock cycle. Similar
to delay, the power values are normalized with respect to the transmission gate latch,

for ease of comparison.

 

Figure 6.1. Basic transmission gate latch used to normalize delay and power values
of other latch designs. The delay and power values were measured by connecting a

FO4 inverter at the latch output.

100

6.2 Comparison of Latch Designs
6.2.1 SEU Tolerant Latch

The schematic of a SEU tolerant latch latch is shown in Figure 6.2 [2]. The latch
stores data D at PP and NP, while D is stored at QP and ON. PP and QP are driven
only by PMOS transistors while NN and QN are driven only by NMOS transistors.
This latch utilizes the fact that only a 0—+1 ﬂip can occur in a PMOS, and only a 1—+0
ﬂip can occur in an NMOS, due to which nodes QP and ON are soft error hardened.
Particle strikes at any of the two nodes PP or NN does not change the output Q,

however they cause node Q to become dynamic for a short period of time.

 

 

 

 

 

 

Figure 6.2. Schematic of single event upset tolerant latch.

This latch can be used to prevent soft errors due to particle strikes in combinational
logic. This requires inputs DP and DN to be fed from two different CLBs or delayed

by certain time interval. Delaying the PO to create DP and DN could introduce

101

performance overhead for the circuit. To avoid performance overhead such latches can
be used only in non-critical paths. The maximum transient pulse width tolerated by
using this latch in a path with slack S is S / 2. If the latch delay - t D-Q - is considered
then the transient pulse width tolerated reduces to (S —- tD_Q) / 2. Introduction of
time delay between DP and DN requires a delay chain inside each latch, which adds
to the overall system area and power overhead.

The probability of a SEMU affecting more than two nodes is very low due to high
energy of the particle required to cause multiple upsets. Hence, we consider SEMUS
occurring on two nodes only. There are six different node combinations which need to
be analyzed for SEMUS. The four node combinations PP-NN, QP-QN, PP-QP, NN-
QN are not affected by SEMUS, since simultaneous logic ﬂips can not occur on these
node pairs. However, SEMUS occurring in QN-PP or QP-NN combinations have the
potential to cause soft errors. But by carefully spacing them apart in the latch layout,
the critical charge required to cause an upset can be made quite high, and hence the
latch can be assumed to be SEMU hardened for most sea-level applications. The
delay of this latch was found to be ﬁve times the original latch delay, due to its high
D—Q delay. This latch consumes static power due to some of the PMOS and NMOS
not completely turning off. This latch consumes 52% more power than the original

latch.

102

6.2.2 Soft Error Hardened Latch Scheme for SoC

The schematic of the latch is shown in Figure 6.3 [51]. The latch stores D at node
DH, and D at nodes PDH and NDH. Node DH is kept static by either transistor P1 or
N 1, depending on the value of input D, after CK becomes low. Whenever a particle
strike occurs at the node DH, a glitch occurs at the latch output Q. For example,
when DH ﬂips from 1—->0 transistor N2 is cut-off. But node PDH can be maintained
at zero by the parasitic capacitances, which enables P1 to pull DH back to logic one.
Therefore, transistors P1 and N2 should be sized larger than P8, such that DH is
pulled to logic one before P8 pulls PDH to one. The width of a glitch at the output
node due to a logic ﬂip at DH is small, as node DH is restored to its correct value
within 50 ps, even at VDD=0.8V. While using this latch in a pipelined circuit, it has
to be taken care that this glitchy output is not capable of causing an error in the next
stage. A particle strike at PDH does not affect nodes DH and Q. For example, in
case of a 0—+1 logic ﬂip in PDH, transistor N2 pulls node PDH back to logic 0 and P1
also turns ON. Particle strikes on transistors P7, N3 and N7, P3 also do not change
DH, and hence the output value. It has been experimentally veriﬁed that soft errors
with charges up to 1000 fC are corrected by this latch [51].

This latch design can only handle SEUS on the latch itself. Any transient pulse
with width equal to the latching window of the modiﬁed latch could cause a soft
error if it is not logically or latching window masked. The probability of soft errors
occurring due to particle strikes in CLB is same as any original latch without soft

error hardening. Now we consider the possibility of SEMU causing an error. Let’s

103

 

PDI-LIFClE]; 4E];

 

 

1:].

 

CKB I

‘0] P6
DH

4N6
CK 51.2] ”5 P2 £12857]

mar; __E:l%:

Figure 6.3. Schematic of soft error hardened latch.

 

l>°Q

 

 

 

 

 

 

 

consider again DH being held at logic 1 by P1. When a particle strike ﬂips both DH,
from 1—+O, and PDH, from 0—+1, then the output value ﬂips permanently, until new
inputs arrive. Thus, soft errors can be caused due to SEMUS on the latch. For both
DH and PDH to ﬂip as mentioned, sufficient electrons have to accumulate around N7
or N1 drain, or sufﬁcient holes should accumulate around P8. This can be avoided
by increasing the spacing between these transistors in the layout.

The delay of this latch was reported to be 520% higher than that of the original
latch. Higher delay overhead is due to the transistors P1 and N1 being sized big-
ger to reduce SEU at DH. The latch consumes 4-6% more power than the latch in
Figure 6.1 [51]. The SER reduction is 25x for neutrons and 99x for alpha particles

compared to the original latch.

104

6.2.3 Dual Interlocked Storage Cell

The schematic of the latch is shown in Figure 6.4 [52, 53]. The latch stores the
data values at D0a, DOb, Dla and le, which are vulnerable to particle strikes.
Let’s assume the latch stores logic 0 at D0a and D0b. When a particle strike ﬂips
D0b from O—il, N la is enabled and thus Dla may reach an intermediate voltage
between logic 0 and 1. A glitch may also occur at output node Q because of a ﬂip
in DOb. Nodes DOa and le maintain their value dynamically. As le stays at 1,
DOb discharges to 0 through N2b and N3b. The effect of leakage on the recovery
time was studied in [53]. The leakage effect was studied by connecting four current
sources of same value to nodes D0a, D0b, Dla and le. Signiﬁcant increase in SER
was observed when the leakage current was 20% of static current noise margin (Inm).
Thus, this latch design is sensitive to leakage, but whether the leakage observed is

20% of In", needs to be evaluated based on the process technology used.

{>0—

 

j— ———<:] P23
ﬂw‘i “a N28]

—CK
—-] N 3 N33!

13
CK
45‘ Elm)

 

 

 

 

 

 

 

 

 

 

ﬂ DDbcl Plb
—,— le N2b [
CKB ’
[ Nib—] N3b
v

 

 

[>o—a

Figure 6.4. Schematic of dual interlocked storage cell.

105

 

The dual interlocked storage cell (DICE) design does not mask transient pulses
originating in CLB. If the input D is split, and temporally separated before feeding it
to transmission gates driving D0a and DOb, then the latch becomes unstable, and the
output Q could ﬂip to the wrong value. A variant of DICE, which uses a C-element
in each of its inputs to protect SETS generated in CLBS, was presented in [54]. The
C-element is fed both the normal and a delayed input, with delay equal to the width
of the SET to be tolerated. This introduces big performance penalty and high area
overhead due to separate delay chains being used in every ﬂip-ﬂop. This latch is
also vulnerable to SEMUS, because all the nodes are vulnerable to both 0—>1 and
1-—>0 logic ﬂips. This causes the stored value to ﬂip permanently until a new input is
applied.

The worst case delay and power penalties of this latch as compared to that of
the standard latch was reported to be 2-3% and 34%, respectively [53]. This latch

provides 10x SER reduction as compared to an original latch.

6.2.4 Single Event Resistant Topology Latch

The schematic of the latch is shown in Figure 6.5 [55]. The clocked transistors and
buffers are not shown in the schematic for clarity. The data values are stored at both
Y0 and Y1, while their complements are stored at Y2 and Y3. The cross—coupling
of the transistors prevents an upset at any one of the four nodes from changing
the output value and hence the relative sizing of the transistors does not matter as

compared to latches in Figure 6.2 and Figure 6.3. Let us consider an initial case when

106

 

nodes Y0 and Y1 are both 1 and Y2 and Y3 are at 0. If a particle strike ﬂips Y0
from 1—>0, transistors N 1c, NOd are disabled while POd is enabled, which ﬂips Y3 to
1. Y2 still retains 0 because N0c remains ON and POc is disabled. As Y2 is still at 0,
POa charges Y0 and brings it back to 1, which makes Y3 to go low. Thus, the initial

state of the latch is restored. A glitch could possibly occur at the output of the latch.

POa]:>—— —a]1>0b

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Y » Y1
NOa NOb
Nla Nib

Y4 Y3
N0c NOd
Nlc & Nld

 

 

 

 

Figure 6.5. Schematic of single event resistant topology.

The proposed latch can also be used to protect transient faults originating in the
CLB. This was reportedly done by duplicating the CLB, and connecting the output of
each block to the right (b,d) and left sections(a,c) of the latch shown in Figure 6.5 [55].
This could also be achieved by using temporal separation of PO signals and using
the original and delayed versions to drive the right and left sections. This latch is

sensitive to SEMUS. For example, a particle strike which can ﬂip both Y0 and Y1

107

from 1—>0 will change the data stored permanently without any chance of recovery,
until the next input is applied. Similar to previous approaches, the nodes YO-Y3 can
be spaced apart to reduce probability of SEMUS.

The delay of this latch was found to be 37% higher than the original latch. The
power consumed by the latch was found to be 70% of the original latch due to the

complex interconnection structure for holding the data values.

6.2.5 Other Latch Designs

INTI

2:.

 

 

 

 

 

 

>——l<l><}—-
H]
lb if 4
‘“ ic e “" MC? °
Cf a a
[Db—ﬂ Qout' J8
Ek— mrz‘ I
(a) (b)

Figure 6.6. Hardening of the feedback node in a latch (a) Feedback node fb hardened
by adding explicit capacitances. (b) Feedback node hardened by duplicating feedback

inverters.

A few other designs which try to harden latches against soft errors have been pre-
sented. A latch design which hardens the feedback node (fb) in a latch was presented
in [35]. The schematic of the latch is shown in Figure 6.6(a). An explicit gate ca-
pacitance is added to fb to increase the charge required to ﬂip the node. This leads
to a 2x reduced SER of the latch, but degrades the speed of the latch. Another

latch design which hardens the feedback node was presented in [56]. The schematic

108

 

of the latch is shown in Figure 6.6(b). This latch uses two inverters II and I2 in the
feedback instead of one. A particle strike on INTl and INT2 alone cannot produce
an upset. However, a particle strike on Q can produce a soft error. Also, SEMUS are

not tolerated by this latch.

6.3 New Latch Designs with Soft-Error Immunity

We propose new latch designs which utilize some of the techniques used in asyn-
chronous circuits. The schematic of the two basic latch designs are shown in Fig-
ure 6.7(a) & (b). Latch A stores data at nodes D0 and D1 which are held static
(after CK goes high) by transistors P2 and N2, respectively. This means nodes D0
and D1 do not reach true 0 and 1 (Vdd). Due to this, transistors P3 and N3 are not
enabled completely. In latch B, the nodes DO and D1 (which store data) are held
static by N2 and P2, with the help of inverters II and I2, which means P3 and N3
are enabled completely. Therefore, latch A has a higher delay compared to latch B,
while latch B consumes more power due to the inverters and full voltage swing at
nodes D0 and D1. Both the latches are negative-level sensitive, open when the clock
is low. The vulnerability of latches to particle strikes are analyzed by looking at the
effect of charge collection at data storage nodes. We only consider transistors P2 and
N2, as only these transistors have their drain / source nodes connected to D0/ D1. In
latch A, charge collection at DO or D1 can only turn P3 and N3 off, respectively. In
latch B, charge collection at D0 or D1 can only turn ON P3 and N3. This leads to
different vulnerabilities for latches A and B.

A SEU in latch A, at either D0 or D1, only makes D’ dynamic as transistors P3

109

CK CK

~ P1 - 1.] P3
P1 D0 4 P3 DO
P2 N2
D II
CKB Q D D, Q
D1 D1
N1 - ] N3 N1 12 4 N3

(a) (b)

 

 

 

Figure 6.7. (a) Latch A vulnerable only to SEMUS. (b) Latch B having lower delay
and higher power consumption than latch A, but vulnerable to SEUs.

and N3 can only be disabled. Therefore, latch A is susceptible only to SEMUS, while
in latch B soft error can occur due to SEUS at either D0 or D1. A particle strike at
D0 or D1 in latch B can cause D’ to reach an intermediate voltage between 0 and
Vdd, which could lead to a wrong value at output Q. In latch A, SEMU at either D’
and D0 (or) D’ and D1 could cause a soft error. A SEMU at nodes D’ and D0 or D’
and D1 in latch B creates a temporary 1—>0 or 0—+1 glitch at node Q which ﬁnally
settles at a voltage between 0 and VDD. Stick diagram for the layout of latch A, such
that SEMU vulnerable nodes are spaced apart is shown in Figure 6.8.

The delay of latch A and latch B were found to be 1.12x and 0.67x of the latch
shown in Figure 6.1. As] analyzed before latch B has lower delay and lower recovery
time for a particle strike. However, the power consumed by latch B was found to be
5x compared to original latch, while latch A just consumes 40% power of the original
latch.

The two basic latch conﬁgurations A & B are presented to explain the concept of

110

 

 

N-Well
P1 P2 P3

D D0 VSSVDD D’

N1 N2 N3
DE DIEVDD VSSED’

Figure 6.8. Stick diagram for layout of latch A, such that nodes D’-D0 and D’-D1 are

 

 

 

spaced apart with minimum area overhead.

new latch designs. Latch designs which have more redundancy and provide increased
soft error tolerance as compared to latches A and B, without having too much area and
power overhead, are shown in Figure 6.9(a) & (b). Again we analyze the susceptibility
of the latches by studying the effect of charge collection around the data storage
nodes D0—D5. Simultaneous charge collection on both nodes D4 and D5 can result
in a voltage ﬂip at output node Q. As nodes D4 and D5 are restored to their initial
value after a short duration, only a small glitch results in Q. SEMUS on nodes D4 and
D0 or D5 and D3 can only result in output node Q reaching a voltage between 0 and
VDD. Latch C is not susceptible to simultaneous charge collection at any combination
of two and three data nodes. Only a SEMU on four nodes could lead to a logic ﬂip at
node Q. Thus latch C is much less vulnerable compared to latch A. Particle strikes
which cause SEMU at nodes Q and D4 (or) Q and D5, only cause a glitch at output
node Q. As both the node combinations Q-D4 and Q—D5 are kept static all the time,

the probability of such a glitch occurring is low. Thus SEMU in latch C could either

111

 

CK CK

 

 

 

 

 

 

 

_c5_ _c5_
,, -,,, a... .4;
P3
D4 CK
CKB
4_ $31 a
N1 .13] [N5 1 P2
P7
P2 , .4... am
D2 N7
P4 [*1
CKB
D5 _i__
A}?

 

 

Figure 6.9. (a) Latch C having best power, performance, and soft error immunity.

(b) Latch D can be used to provide soft error protection for CLBs also.

lead to Q being at an intermediate voltage between 0 and VDD, or cause a glitch, or
in the worst-case of SEMU at four nodes a complete logic ﬂip. To avoid node Q from
reaching an intermediate voltage between 0 and VDD, layout of latch C with data
nodes spaced apart is shown in Figure 6.10.

In the case of latch D, shown in Figure 6.9(b), charge collection at the data nodes
D0—D3 only turns off P5, P6 and N5, N6. Therefore, SEMU on only two nodes, such
as D0 and D1 or D2 and D3, does not result in an error at output Q, but it turns D’
into a dynamic node. Leakage in transistors N5 and N6 should be considered while
using this latch conﬁguration. An error in latch D can occur when SEMU changes

the value stored in D’, D0, and D1, or D’, D2, and D3. Spacing apart these nodes

112

 

 

N-Well

P1 P3 P2 P4 rs P6 97
DEMEVSS 05023388 MBVDDEQ

N2 N4

N1 N3 N5 N6 N7
DE DIEVDD DED3EFDD DﬁFSSEDSVSS E0

Figure 6.10. Stick diagram for layout of latch C, such that nodes D4-DO and D5-D2

 

 

 

are spaced apart with minimum area overhead.

would make these latches SEMU resilient.

As most of the new latch designs are susceptible to only SEMU the actal SER re-
duction compared to original latch is difficult to calculate through simulation. Hence,
we have presented only analytical results for the soft error tolerance of these latches.
The new latch designs apart from being vulnerable only to SEMUS, also have signif-
icant lower area and power cost compared to earlier latches surveyed. The latches C
and D can be customized according to application requirements for speed and power.
The latch D can also be used to provide soft error protection for transient faults
in CLB. This can be done by creating temporally separated signals D and D’ and
driving for example D0-D2 with D and D1-D3 with D’. The latches in Figure 6.7
designed with only two nodes for storing data D0 and D1, could be used when soft
error protection for CLBS is not necessary. Table 6.1 presents the delay and power
overhead of the proposed designs compared to the standard latch.

The delay of latch C is 1.2x of the standard latch and that of latch D is 2.3x.
Latch C also consumes just 74% power of the original latch. Hence, latch C is the

best conﬁguration considering power, performance, and delay. But latch D which

113

 

Latch type Delay (ps) Power (a W)
Standard Latch 194.92 (1x) 557 (1x)

Latch C 239.45 (1.2x) 41.32 (0.74x)

Latch D 45537 (2.3x) 37.3 (0.67x)

 

 

 

 

 

 

 

 

Table 6.1. Delay and power overhead of the proposed latch designs.

consumes just 67% of original power can be used to provide protection for CLBs also.

6.3.1 Customizing Latches for Performance and Power Re-
quirements

The latches presented in Figure 6.9 can be customized based on the power and speed
requirements. Latch designs which need to be optimized for speed could use storage
nodes similar to latch B, while those optimized for power could use storage nodes sim-
ilar to latch A. For example, to improve speed at the cost of increased power, inverters
at selective data nodes can be added to latches C and D. Two such conﬁgurations are
shown in Figure 6.11. The speed and power consumption of the customized latches
are shown in Table 6.2. Both the latches E and F have lower delay and higher power
compared to latches C and D, respectively. Both latches E and F are vulnerable to
SEMUS on more than two nodes only. Hence, the soft error vulnerability of these

latches are similar to that of C and D.

 

Latch type Delay (ps) Power (a W)
Latch E 172.6 260.64
Latch F 429.20 707

 

 

 

 

 

 

 

Table 6.2. Delay and power overhead of the customized latch designs.

114

 

 

 

 

 

 

 

 

 

 

 

CK
A A
D0
P1 p—I Cl P2 941:] ‘Do 44 P5
P5
3 D. C, 135551
CKB ix
:1— Pffi] _] J N2 iy—[PZ1 - 4% P6
D1 l 159?] D1
P7
D—o
CK Q CE ”TS Dir—500
’1:— Q2 4 p4 ib—‘NII ]1132 [ES
N7
135% CKB E
_1_ P3
D5 mm [30%] I [@176

CKB ,
N3 -93 [ N4

(3) (b)
Figure 6.11. (a) Latch C customized for reduced delay with higher power cost. (b)

Latch D can be used to provide soft error protection for CLBs also.

6.4 Conclusion

In this chapter, we analyzed existing latch designs for their soft error vulnerabil-
ity, their power, and performance overheads. The latch design presented in [51] was
found to provide good trade-off between power, performance, and soft error protec-
tion. However, this latch cannot be used for soft error protection in CLBs. We also
proposed new latch designs, the best of which is vulnerable only to SEMUS with a
performance penalty of 1.12x and which consumes only 40% power compared to a
standard transmission gate latch. The latches C and D can be customized according
to application requirements for speed and power. Latch D can also be used to pro—
vide soft error protection for transient faults in CLB. This can be done by creating

temporally separated signals and driving the signals D0-D3 separately.

115

CHAPTER 7

Conclusion

In this dissertation, we presented our research on soft-error modeling and mitigation
techniques for logic circuits. Performance, power-, and area-efﬁcient techniques to
reduce the soft-error vulnerability of both combinational and sequential logic circuits,
along with an LUT-based methodology to estimate the SER reduction of these tech-
niques was explained in depth. The important contributions and results from this

research are summarized in detail below.

7 .1 Key Contributions

In Chapter 2, we described a fast and accurate LUT-based methodology to calculate
both SET width due to particle strikes, and SER reduction for time-redundancy-
based error mitigation techniques. Previous techniques for SET width calculation
use complex expressions or large LUTS and have greater than 15% error for inputs
not close to pre—characterized points. We studied the sensitivity of an SET to various

gate and circuit characteristics, and determined the parameters to be used, their

116

spacing, and their lower and upper-bounds for constructing the LUT. The LUT uses
non-uniform spacing and surface-based interpolation between its indices to obtain the
SET width generated at a gate and primary output. We found the LUT to provide
greater than 1000 times speedup compared to HSPICE simulations, with less than
10% error.

In Chapter 3, we presented an efﬁcient and systematic error masking (EM) tech-
nique that can be applied to combinational logic circuits which have a signiﬁcant
fraction of non-critical paths with sufficient slack. This error masking technique pre-
vents an SET pulse of width less than approximately half of the slack available in the
propagation path from latching and turning into a soft error, without any performance
overhead. Previous techniques incur a performance overhead of 2112 for masking an
SET pulse of width 20. We control ﬂip-ﬂops only in paths with sufficient slack which
ensures that the delay increase caused by the addition of majority voter and control
transistors to the ﬂip—ﬂops does not affect the timing of the overall circuit. Addi-
tionally, our technique uses a single delay chain which produces phase-shifted signals
used to sample POs. The results obtained on ISCAS85 benchmark circuits show an
average SER reduction of 82.67% from the original unprotected circuit.

In Chapter 4, we presented a design technique to combine error masking with
error detection and recovery (EDR). This technique can be used to improve the
reliability of a circuit without sufﬁcient number of non-critical paths for applying
the EM technique. The EDR technique tolerates transient pulses with width up to
half a clock cycle period. In case a soft error occurs, a very low-probability event

for an application run, and is detected, recovery can be completed within a single

117

clock cycle. The results obtained show an average SER reduction of 93.78% for
the EM+EDR method. Apart from the EDR technique, we also described simple
and efficient methods, using input vector characteristics and time borrowing in latch-
based pipeline circuits, to improve the SER reduction obtained from the error masking
technique. Both soft-error mitigation techniques presented in this dissertation can be
used to reduce transient faults caused not only due to particle strikes, but also due
to cross-talk or power supply noise.

We explained the construction of the delay chain used in EM and EM+EDR tech-
niques explained in Chapter 5. We ﬁrst analyzed the robustness of three different fam-
ilies of delay elements to process variation using Monte Carlo simulations in HSPICE.
A cascaded inverter was found to give a better yield under process variation, since its
delay is less sensitive to VDD and gate length variations. A delay chain with a delay
tap every 200 ps was constructed using cascaded inverters. The delayed clock signals
are then distributed using buffer chains. The construction of buffer chains with the
least delay was demonstrated using the method of logical effort.

In Chapter 6, we analyzed existing latch designs for their soft-error vulnerability
and overheads. The latch preposed in [51] was found to provide good trade—off for
power, performance, and soft-error robustness. However, this latch cannot be used
for soft-error protection of CLBs. We also proposed new latch designs, the best of
which is vulnerable only to SEMU with a delay overhead of just 12% and consumes
only 40% power of a standard transmission gate latch. Further, we presented efﬁcient
approaches to layout the proposed latch designs, such that vulnerability to SEMUS can

be reduced. Finally, we presented two configurations of the proposed latch designs,

118

customized according to application power and performance requirements.

Our SER mitigation work represents a signiﬁcant advancement over previous ap—

proaches which, in contrast, rely on introducing explicit hardware or time redundancy

or on redundant computation, often both. Consequently, our methods provide sub-

stantial energy and performance/ hardware advantages.

7.2

Future Work

Three potentially fruitful directions for future research are brieﬂy outlined next.

1.

3.

The delay chain and distribution of the control signals contributed to the max-
imum power overhead. The power overhead can be reduced using low-voltage—
swing delay chain and control signals. The low-voltage—swing control signals
should be converted to full 0—>VDD swing before they can be fed to the ﬂip-
ﬂ0ps using efﬁcient level converters. This is necessary to ensure correct op-
eration of the ﬂip-ﬂops. Therefore, generation and distribution of low-voltage

control signals should be investigated.

. The distribution of slack for improving error masking using time—borrowing was

performed between two successive pipeline stages in Chapter 4. The potential of
this technique can be increased further by doing slack redistribution across the
entire pipeline. Future work can consider slack redistribution across pipeline
stages and explore the feasibility of obtaining a globally optimal solution for

this problem.

The SERs of logic circuits vary by orders of magnitude depending upon the

119

application being executed. Efficient and accurate SER characterization of
logic circuits for various applications can be performed. This will be useful
in customizing SER reduction techniques to lower overhead based on the most
frequent application executed. Such customized techniques are beneﬁcial for
embedded processors, most of which are used only in single-application sys-

tems.

120

 

[1]

[2]

[3]

[4]

[5]

[6]

l7]

[8]

[9]

[10]

BIBLIOGRAPHY

Semiconductor Industry Association, “The international technology roadmap for semi-
conductors,” http://www.itrs.net/Common/2004Update/2004Update.htm, 2004.

K.J. Hass, J .W. Gambles, B. Walker, and M. Zampaglione, “Mitigating single event
upsets from combinational logic,” in Proc. 7th NASA Symposium on VLSI Design.
1998, NASA.

P. Hazucha, Background radiation and soft errors in CMOS circuits, Ph.D. thesis,
Linkoping University, Sweden, 2000.

K. Bernstein, “High speed CMOS logic responses to radiation-induced upsets,” in
Designing Robust Circuits and Systems with Unreliable Components. Workshop, 2002.

M. J. Gadlage et al., “Single Event Transient Pulsewidths in Digital Microcircuits,”
IEEE Transactions on Nuclear Science, vol. 51, no. 6, pp. 3285—3290, 2004.

Premkishore Shivakumar, Michael Kistlerand, Stephen W. Keckler, Doug Burger, and
Lorenzo Alvisi, “Modeling the effect of technology trends on the soft error rate of

combinational logic,” in Proc. International Conference on Dependable Systems and
Networks, June 2002, pp. 389—398.

S. Mitra et a1., “Robust system design with built-in soft-error resilience,” IEEE
Computer, Feb. 2005.

S. Hareland, J. Maiz, M. Alavi, K. Mistry, S. Walsta, and C. Dai, “Impact of CMOS
process scaling and SOI on soft error rates of logical processes,” in Symposium on
VLSI Technology, Digest of Technical Papers. 2001, pp. 73—74, IEEE.

Tanay Karnik, Bradley Bloechel, K. Soumyanath, Vivek De, and Shekhar Borkar,
“Scaling trends of cosmic rays induced soft errors in static latches beyond 0.18u,” in
Symposium on VLSI Circuits Digest of Technical Papers. 2001, pp. 61—62, IEEE.

S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin, “A systematic
methodology to compute the architectural vulnerability factors for a high-performance

microprocessor,” in Proc. IEEE/A CM International Symposium on Microarchitecture,
Dec. 2003.

121

[11] C. Constantinescu, “Impact of deep submicron technology on dependability of VLSI
circuits,” in Proc. International Conference on Dependable Systems and Networks,
June 2002, pp. 205—209.

[12] R. Baumann, “Soft Errors in Advanced Computer Systems,” IEEE Design and Test
of Computers, vol. 22, no. 3, pp. 258-266, May 2005.

[13] J.F. Ziegler et al, “IBM experiments in soft fails in computer electronics,” IBM Journal
of Research and Development, vol. 40, no. 1, pp. 3—19, 1998.

[14] T. C. May and M. H. Woods, “Alpha-particle induced soft errors in dynamic memo-
ries,” IEEE Transactions on Electron Devices, vol. 26, no. 2, 1979.

[15] R. Baumann, “Soft errors in commercial semiconductor technology: overview and seal-
ing trends,” in IEEE 2002 Reliability Physics Tutorial Notes, Reliability Fundamentals,
Apr. 2002, pp. 121.1—12l.14.

[16] H. Ando et. al., “A 1.3GHz ﬁfth generation SPARC64 microprocessor,” in Proc.
IEEE/ACM Design Automation Conference, June 2003, pp. 702—705.

[17] M. Santarini, “Cosmic radiation comes to ASIC and SOC design,” Electronics Design
Network, pp. 46—56, 2005.

[18] Cadence Design Systems, “Paciﬁc: User guide,” 2004.

[19] L.W. Massengill, A.E. Baranski, D.O. Van Nort, J. Meng, and B. Bhuva, “Analysis of
single—event effects in combinational logic-simulation of the AMZ901 bitslice processor,”
IEEE Transactions on Nuclear Science, vol. 47, no. 6, pp. 2609—2615, Dec. 2000.

[20] K. Mohanram, “Closed-form simulation and robustness models for SEU-tolerant de-
sign,” in Proc. International VLSI Test Symposium, Apr. 2005, pp. 327—333.

[21] Y. S. Dhillon, A. U. Diril, and A. Chatterjee, “Soft-error tolerance analysis and op-
timization of nanometer circuits,” in Proc. Design Automation and Test in Europe,
Mar. 2005, pp. 288—293.

[22] LB. Freeman, “Critical charge calculations for a bipolar SRAM array,” IBM Journal
of Research and Development, vol. 40, pp. 119—129, Jan. 1996.

[23] DC. Mavis and PH. Eaton, “Soft error rate mitigation techniques for modern micro-
circuits,” in IEEE Reliability Physics Symposium, 2002, pp. 216-225.

[24] G. Hubert et al., “Study of basic mechanisms induced by an ionizing particle on simple
structures,” IEEE Transactions on Nuclear Science, vol. 47, no. 3, pp. 519-525, 2000.

[25] S. Krishnamohan and NR. Mahapatra, “A highly—efﬁcient technique for reducing soft
errors in static CMOS circuits,” in Proc. IEEE International Conference on Computer
Design (ICCD), Oct. 2004.

122

[26] J .C. Lo, “A novel area-time efﬁcient static CMOS totally self-checking comparator,”
IEEE Journal of Solid-State Circuits, vol. 28, pp. 165—168, Feb. 1993.

[27] C. Metra, M. Favalli, and B. Ricco, “Self-checking detection and diagnosis of transient,

delay, and crosstalk faults affecting bus lines,” IEEE Transactions on Computers, vol.
49, pp. 560—574, June 2000.

[28] L. Anghel and M. Nicolaidis, “Cost reduction and evaluation of a temporary faults
detecting technique,” in Proc. Design Automation and Test in Europe, 2000.

[29] J. B. Nickel and A. K. Somani, “REESE: A Method of Soft Error Detection in Micro-
processors,” in Proc. International Conference on Dependable Systems and Networks,
June 2001, pp. 401—410.

[30] M. Nicolaidis, “Time redundancy based soft-error tolerance to rescue nanometer tech-
nologies,” in Proc. International VLSI Test Symposium, 1999.

[31] K. Mohanram and N. A. Touba, “Partial error masking to reduce soft error failure rate
in logic circuits,” in Proc. International Symposium on Defect and Fault Tolerance in
VLSI Systems, 2003, pp. 433—440.

[32] Q. Zhou and K. Mohanram, “Cost—Effective Radiation Hardening Technique for Com-
binational Logic,” in Proc. IEEE/A CM International Conference on Computer-Aided
Design (ICCAD), Oct. 2004.

[33] M. Zhang and NR. Shanbag, “An energy-efﬁcient circuit technique for single event
transient noise-tolerance,” in Proc. IEEE International Symposium on Circuits and
Systems, May 2005, pp. 636—639.

[34] H. Cha and J .H. Patel, “Latch design for transient pulse tolerance,” in Proc. IEEE
International Conference on Computer Design (ICCD), Oct. 1994, pp. 385—388.

[35] T. Karnik, S. Vangal, V. Veeramachaneni, P. Hazucha, V. Erraguntla, and S. Borkar,
“Selective node engineering for chip-level soft error rate improvement,” in Symposium
on VLSI Circuits Digest of Technical Papers, June 2002, pp. 204—205.

[36] J. Grad and J. E. Stine, “A standard cell library for student projects,” in International
Conference on Microelectronic Systems Education, 2003, pp. 98—99.

[37] S. Krishnamohan and NR. Mahapatra, “Combining Error Masking and Error De-
tection Plus Recovery to Combat Soft Errors in Static CMOS Circuits,” in Proc.
International Conference on Dependable Systems and Networks, June 2005.

[38] D. Ernst et al., “Razor: A low-power pipeline based on circuit-level timing specula-
tion,” in Proc. IEEE/ACM International Symposium on Microarchitecture, Dec. 2003.

[39] K. Bernstein et al., High speed CM OS design styles, Kluwer Academic Publishers, ﬁrst
edition, 1998.

123

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

S. Krishnamohan and N .R. Mahapatra, “An Analysis of the Robustness of CMOS
Delay Elements,” in Proc. Great Lakes Symposium on VLSI, Apr. 2005.

K.A. Bowman, S.G. Duvall, and JD. Meindl, “ Impact of die-to—die and within-
die parameter ﬂuctuations on the maximum clock frequency distribution for gigascale
integration,” IEEE Journal of Solid-State Circuits, pp. 183-190, 2002.

G. Kim, M.-K. Kim, B.-S. Chang, and W. Kim, “ A low-voltage, low-power CMOS
delay element,” IEEE Journal of Solid-State Circuits, pp. 966—971, 1996.

Y. W. Pang et al., “An asynchronous cell library for self-timed system designs,” in
IEICE Transactions on Information and Systems, Mar. 1997, pp. 296—305.

J.M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital integrated circuits, Prentice
Hall, ﬁrst edition, 1996.

M. F. Aburdene, J. Zheng, and R. J. Kozick, “New recursive VLSI architectures
for forward and inverse discrete cosine transform,” in Proceedings of SPIE - The
International Society for Optical Engineering, 1996.

A. W. Buchwald, K. W. Martin, and A. K. Oki, “ A 6GHz integrated phase-locked loop
using AlGaAs/GaAs heterojunction bipolar transistors,” IEEE Journal of Solid-State
Circuits, pp. 1752—1762, 1992.

I. Sutherland, B. Sproull, and D. Harris, Logical effort: designing fast CMOS circuits,
Morgan Kaufmann Publishers, ﬁrst edition, 1999.

R. Ramanarayanan et al., “Analysis of soft error rate in ﬂip—ﬂops and scannable
latches,” in Proc. IEEE International System on Chip Conference, Sept. 2003.

S. Krishnamohan and NR. Mahapatra, “Analysis and Design of Soft Error Hardened
Latches,” in Proc. Great Lakes Symposium on VLSI, Apr. 2005.

D. Markovic, B. Nikolic, and R. Broderson, “Analysis and design of low-energy ﬂip—
ﬂops,” in Proc. IEEE/ACM International Symposium on Low Power Electronics and
Design, 2001, pp. 52—55.

Y. Komatsu et al., “A soft-error hardened latch scheme for SOC in a 90nm technology
and beyond,” in Proc. IEEE Custom Integrated Circuits Conference, Oct. 2004.

T. Calin, M. Nicolaidis, and R. Velazco, “Upset hardened memory design for submicron
CMOS technology,” IEEE Transactions on Nuclear Science, pp. 2874—2878, Dec. 1996.

P. Hazucha et al., “Measurements and analysis of SER tolerant latch in a 90nm dual-
Vt CMOS process,” in Proc. IEEE Custom Integrated Circuits Conference, Oct. 2003,
pp. 617—620.

R. Naseer and J. Draper, “The DF-DICE storage element for immunity to soft errors,”
in Proc. IEEE Midwest Symposium on Circuits and Systems, 2005.

124

[55] J. Gambles et al., “An ultra low-power, radiation—tolerant reed solomon encoder for
space applications,” in Proc. IEEE Custom Integrated Circuits Conference, Oct. 2003.

[56] M. Omana, D. Rossi, and C. Metra, “Novel transient fault hardened static latch,” in
Proc. International Test Conference, Sept. 2003, pp. 886—892.

125