:3»;
3'2?
. J

DIAGNOSIS OF INTERMITTENT FAULTS
IN. DIGITAL SYSTEMS

Thesis for the Degree of Ph. D.
MICHIGAN STATE UNIVERSITY
SAMIR KAMAL
1 972

 

 

LIBRARy

Michigan State
nivcrsity

 

     
 

    

BIN‘BING av ”‘7"! ,;
IIUAG & SIINS' I
800K BINDERY INC.I

)- ‘ LIBRARY emoms
mmsrw. meme”

 

We???

 

 

 

 

ABSTRACT

DIAGNOSIS OF INTERMITTENT FAULTS

IN DIGITAL SYSTEMS

BY

SAMIR KAMAL

Digital computers are being relied upon as integral
parts of an increasing number of systems handling all aspects
of our life. The prOper Operation of computers is vital to
the functioning of these computerized systems. One of the
major approaches to achieve prOper operation of computers
is fault diagnosis plus repair. This thesis lends itself
to one aspect of this approach, namely: the diagnosis of
intermittent faults in combinational circuits.

Intermittent faults in digital systems are those faults
whose effects are not observed all the time. A system having
an intermittent fault may show the effect of such a fault
when an input test is applied one time, and possibly not
show it when the same test is applied many other times.

A probabilistic model is suggested for intermittent
faults in logical circuits. The model assumes that the

circuit is irredundant and that it can have only one of a

SAMIR KAMAL

possible set of faults (the single fault assumption). It
also assumes that the faults are well behaved and are signal
independent. Testing for such faults is done through the
repeated application of tests that would detect these faults
had their effect been permanent. These tests are generated
using any of the methods employed for generating tests that
detect permanent faults. The prior probability of the cir-
cuit having any of the intermittent faults is assumed to be
known in the model. Also the probability of observing the
effect of each fault, if that fault exists, is assumed.

A procedure for the detection of intermittent faults
is suggested. It is analogous to a sequential statistical
decision problem. At any stage during testing, the proce-
dure terminates if a failure is observed or if the decision
rules decides that the circuit is fault free. Otherwise,
it selects an appropriate test to be applied at the next
stage. The decision rule used in the detection procedure
is selected such that it insures the procedure termination
in a finite number of steps. Least upper bounds on the
number of repetitions of tests that detect a particular
fault are derived. They are employed in designing optimum
detection experiments. Such an optimization problem is
found to be equivalent to an integer programming problem.

A diagnosis procedure, also employing the repetition of

tests that detect permanent faults, is proposed. It is proved

SAMIR KAMAL

that the conditions required in the procedure guarantee that
the expected length of the diagnosis experiment is finite.
Test prOperties that are needed for the diagnosis of inter-
mittent faults are determined. Some fundamental differences
between those properties needed for the intermittent fault
case and those needed for the permanent fault case are
pointed out. The problem of Optimizing the diagnosis experi-
ment is examined. Unfortunately, the true optimum solution
can be obtained only through impossibly lengthy enumeration.
Two subOptimal approaches that are heuristic in nature are

suggested.

DIAGNOSIS OF INTERMITTENT FAULTS

IN DIGITAL SYSTEMS

BY

SAMIR KAMAL

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
DOCTOR OF PHILOSOPHY

Department of Computer Science

1972

TO

MY PARENTS

ACKNOWLEDGMENTS

I am grateful to Dr. Carl V. Page, the chairman of my
guidance committee, for his guidance and encouragement
during the course of this thesis. My thanks also go to
Dr. Richard C. Dubes, Dr. Edgar M. Palmer, Dr. Morteza A.
Rahimi, and to Dr. Bernhard L. Weinberg for serving on my
guidance committee and for reviewing this work.

I wish to express my thanks to the Division of Engineering
Research for their financial support during the writing of
this thesis, and further, to the Egyptian Government for the
financial support that made my graduate studies possible.

Finally, I am deeply indebted to my wife Lillian for
her encouragement, loyalty, patience and understanding; and
also to my daughter Heba whose joyous smile is always a

source of happiness and reassurance.

iii

TABLE OF CONTENTS

LIST OF TABLES O I O O O O O O O O 0

LIST OF FIGURES O O O O O O O O I O O

GLOS SARY O O O O O O O O I O O O 0

Chapter

I.

II.

INTRODUCTION 0 O O O O O O O O O

1.1 FAULT-TOLERANT COMPUTING . . . .

1.2 SOME APPROACHES TO ATTAIN STRUCTURELY
RELIABLE SYSTEMS. . . . . .

1.3 SWITCHING CIRCUITS . . . . . .
1.3.1 Combinational Circuits. . .
1.3.2 Sequential Circuits. . . .

1.4 LOGICAL FAULTS. . . . . . . .
1.5 INTERMITTENT FAULTS . . . . . .

1.6 CONTRIBUTION AND ORGANIZATION OF
THE THESIS. . . . . . . .

DIAGNOSIS OF PERMANENT FAULTS IN
COMBINATIONAL CIRCUITS . . . . .

2 O 1 TESTS O O O O O O O O O O O

2.2 TEST GENERATION . . . .
2.2.1 Truth Table Metho .
2.2.2 Path Sensitizing. .
2.2.3 The D-Algorithm . .
2.2.4 Algebraic Methods .

2.3 FAULT TABLE. . . . . . . . .
2.4 EXPERIMENTS AND THE DIAGNOSIS TREE .
2.5 MINIMIZATION OF PRESENT EXPERIMENTS.

iv

Page

vii

viii

@QOS I9-

11
15

16

Table of Contents

III.

IV.

2.6

2.5.1 Exhaustive Enumeration. . .
2.5.2 The Prime Implicant Method .
2.5.3 The Test Set Intersection
Method. . . . . . .
2.5.4 Method of Distinguishability

Criteria . . . . . .

MINIMIZATION OF ADAPTIVE EXPERIMENTS

2. 6.1 Exhaustive Enumeration. .

2. 6. 2 Method of Distinguishability
Criteria . . . . . .

DETECTION OF INTERMITTENT FAULTS IN

COMBINATIONAL CIRCUITS . . . . .

3.1 THE MODEL . . . . . . . . .

3.2

3.3

DETECTION OF INTERMITTENT FAULTS. .
3.2.1 A Simple Case. . . . . .
3.2.2 The General Case. . . . .

OPTIMUM DETECTION EXPERIMENT . . .

3.3.1 A SubOptimal Solution . .

3. 3. 2 Reduction of the Fault Table
Matrix A . . . . . .

DIAGNOSIS OF INTERMITTENT FAULTS IN

4.6

COMBINATIONAL CIRCUITS . . . . .

GENERAL ASSUMPTIONS . . . . . .
DIAGNOSIS OF INTERMITTENT FAULTS. .
EXPECTED LENGTH OF SUBEXPERIMENT. .
FAULT TABLE. . . . . . . . .

OPTIMIZATION OF DIAGNOSIS EXPERIMENTS
4.5.1 Exhaustive Enumeration. . .
4.5.2 Local Optimization . . . .
4.5.3 The Method of Maximum
Resolution . . . . .

SPECIAL CASES . . . .
4. 6.1 The Simple Fault Table Case .
4.6.2 The Simple Elimination Fault

Table . . . . . . .

49
49

50

52
SS
58

58

61

62
70

83
91
93

94

99

99
101
112
116

121
121
122
124
124
125

125

Table of Contents

V. SUMMARY, CONCLUSIONS AND SUGGESTIONS
FUTURE WORK

5.1 THESIS SUMMARY.
5.2 CONCLUSIONS.

5.3 SUGGESTIONS FOR FUTURE WORK

BIBLIOGRAPHY .

vi

FOR

127

127
130
131

133

LIST OF TABLES

Table Page
2.1 Truth Table for Normal and Faults

Circuits when "a s-a-l" . . . . . 24
2.2 Example of a Complete Fault Table . . . 44
2.3 Reduced Fault Table for Example 2.3. . . 52
2.4 Complete Fault Table and Initial Weights . 54
2.5 First Rearrangement of Complete Fault

Table 0 O O O O 0 O O O O O 55
2.6 Second Rearrangement of Complete Fault

Table 0 O O O O O O I O O O 56
2.7 Third Rearrangement of Complete Fault

Table 0 O O O O O O O O O O 57
2.8 Fourth Rearrangement of Complete Fault

Table 0 O O O O O O O O O O 57
4.1 Reduced Fault Table after Detection

Experiment. . . . . . . . . . 107
4.2 Reduced Fault Table after First

Subexperiment. . . . . . . . . 108
4.3 Reduced Fault Table after Second

Subexperiment. . . . . . . . .. 109
4.4 Reduced Fault Table when t6 Fails in

Third Subexperiment. . . . . . . 110

vii

Figure

1.1

2.2

2.3

2.4

2.5
2.6
2.7

2.8

2.9
2.10
2.11

LIST OF FIGURES

Some Logical Elements . . . . . . .
Block Diagram of a Combinational Circuit.
Example of a Combinational Circuit. . .
Block Diagram of a Sequential Circuit. .
Faults in an Inverter Circuit . . . .
Input Diode Failure of an AND Gate. . .
Bridging Faults . . . . . . . . .

Example of Faults in an Exclusive-OR
Circuit . . . . . . . . . .

Redundancy and Fault Masking. . . . .

Truth Table Test Generation for
"a 8-3-1". 0 o o o o o o o o

Adjusting Gate Input to Make Output
Sensitive Only to a Single Input . .

Path Sensitizing. . . . . . . . .
OR Gate and Its Singular Cover . . . .

AND Gate Behaving as an OR Gate when
Faulty. . . . . . . . . . .

0115 is a Propagation D-Cube of a NAND
Gate for a Change in x1 . . . . .

Experiment Types. . . . . . . . .
Diagnosis Tree of a Preset Experiment. .

Diagnosis Tree of an Adaptive Experiment.

viii

Page

0mm

12
13
14

19
21

23

26
28
32

33

35
46
48
48

List of Figures

3.1 State Space 0 . . . . . . . . . . 63
3.2 Sample Space 3 . . . . . . . . . . 64
3.3 Points in 0 x S with Non-Zero

Probabilities . . . . . . . . . 68
3.4 Flow Chart for Detection Procedure . . . 73
3.5 O x 3 Space for a Simple Case . . . . . 74
4.1 Flow Chart for the Diagnosis Procedure . . 103
4.2 Diagnosis Tree for Example 4.1. . . . . 111

ix

Term

UI

EI.)

GLOSSARY

Meaning

An entry in the fault table matrix
Fault table matrix
Initial uncertainty

Uncertainty after the application of
a single test that did not fail

Uncertainty after the application of
a single test that failed

Reduced fault table matrix
k-dimensional random variable

Number of blocks, or
j-th bias

A signal that is normally 1 but
becomes 0 when a fault is
present

A signal that is normally 0 but
becomes 1 when a fault is
present

Conditional probability of malfunction

Expectation

Page

40

4O

59

59

59

94

69

52
94

31

31

70

114

Glossary

fi(xl,x2,...)
I
fi(x1'x2'ooo)

F

F
Pi

gk(xllleo--)
iO'il'in

k

31

k+m

The fault free condition
Permanent fault number i

The row corresponding to fault fi in
the fault table

A Boolean function of x1,x2,...
A Boolean function of x1,x2,...
Permanent fault space

Set of permanent faults corresponding
to Q
P

Subset of Fp that is detected by i-th

test of T
P

A Boolean function of x1,x2,...
Fault parameters

Least upper bound on length of the
detection experiment for a simple
case, or

Number of repetitions of IF

Least upper bound on number of tests
detecting fi that are needed for
the detection experiment, or

Number of times fi is covered by Tp
Expected value of k

As k approaches infinity

xi

40

19

41

10

64

101

123

10

38

81
114

91
113

114

78

Glossary

NH

Number of tests detecting a given

fault, or
Experiment length

EXpected length of experiment

Experiment length if i-th fault is

present

Number of 1's in column 1 of the

fault table

Number of 1's in portion of column 1

that belongs to block k

Number of output variables, or
Number of tests in T

The j-th membership function

Number of input variables, or
Number of faults

Number of secondary variables in a

sequential circuit

Prior probability of fault number i

Posterior probability of mo of
simple case after applying
tests, none of them failed

Posterior probability of ml of
simple case after applying
tests, none of them failed

xii

DJ

89

92

56

56

52

53

40

84

40

56

77

76

Glossary

P(.)

s-a-O

s-a-l

Probability of

Probability of failure of test ti' or
Number of l-entries in i-th row of
the fault table matrix

Resolution figure of merit

Threshold for posterior probability
of a simple case

Sample point denoting all components
being fault free

Sample point denoting that component
pertaining to i-th fault is faulty

Stuck at logical value 0
Stuck at logical value 1
Sample space
Test number j

Random variable corresponding to test

t.
3

Threshold for likelihood ratio of a
simpel case

Threshold for the i-th likelihood
ratio

Figure of merit

Weight of test ti

xiii

67

59

97

123

80

65

65

12

12

65

19

66

82

90

123

52

Glossary

Weight of test ti after application of
j tests

A don't care value

The j—th input of a switching
circuits, or
Number of repetitions of test tj

Complement of the switching variable

"1
Output of a gate

Current value of i-th secondary variable
of a sequential circuit

Output function with fault parameters
substituted

Next value of i-th secondary variable
of a sequential circuit

Output function
The i-th output of a switching circuit
Output function

Information gain due to the application
of test ti

Member of (set membership)
Number of elements in 9p or in Fp

Number of elements in Fp
i

xiv

53

32

92

12

38

28

37

59

69

101

123

Glossary

Number of 0's in column 1 of the
fault table 52

Number of 0's in portion of column 1
that belongs to block k 53

Likelihood ratio of a simple case after
applying k tests, none of them
failed 79

Likelihood ratio for mi after applying
k tests, all detecting f1 and none

of them failed 88
Number of elements in Tp 95
Summation 53
Test set 65

Subset of T containing all tests

detecting fi 66
Set of tests that cover Fp 102
Random variable set 66

Random variable subset that corres-
ponds to Ti 67

T1=OI T2=opooop Tu=o
where TP = {T1,T2,...,Tu} 112

Ti=0, Ti=0,... k times for all i (lgiiu)
where Tp = {T1,T2,...,Tu} ‘ 113

XV

Glossary

(9 x S)k

{a,b,...}

dz
3;

State of being fault free
State of having i-th intermittent fault
State space

Set of possible intermittent faults

Action space (cartesian product
of Q andS)

k-dimensional binary vector all of its
entries are 0's

Cartesian product of (Q x.S) by itself
k times

A set whose elements are a,b,...

Boolean difference of z with respect

to x.
1

Function mapping from the set on the
left hand side of the arrow to
the right hand side

Smallest integer greater than or equal

to x

xvi

62

63

62

101

67

69

113

20

39

66

121

CHAPTER I

INTRODUCTION

The last quarter of a century has witnessed major
changes in many systems organizations which affect every
facet of life. The reliance on digital computers as
intergral parts of these systems has introduced irreversible
changes. The speed and efficiency of computers in handling
massive amounts of information are the main motives for
these changes. For these same reasons, huge and complex
systems are now designed that were unthinkable before the
use of computers, e.g., systems for space flights. These
so-called computerized systems span a very wide range of
applications, from the simple to the extremely complex,
from government agencies to private enterprise, from space
missions to patient monitoring. Air line reservations,
production line scheduling, highway traffic control, ABM,
are just a few examples of these systems.

Prior to computerization, systems had manual backups
for use in case of failures. With the current fast and
sophisticated systems, manual backup is infeasible except

for a limited number of cases for a short period of time.

Proper operation of computers, being important parts of
these systems, must be assured, particularly for real-time
applications, where down time for an extended period of
time would be very costly, if not disastrous. An auto
assembly plant will be shut down and all the labor force
sent home for the day, if the production line is down for
half an hour. With this type of strigent requirements, it
is imperative that we use more reliable computers, even
though today's machines are built from components far more
reliable than their counterparts that were used a decade

or so ago.

1.1 FAULT-TOLERANT COMPUTING

Fault-tolerant computing is a field of study
whose major concern is the assurance of prOper operation
of digital computers. A definition of the term was given
by Ramamoorthy [30] as: "Fault-tolerant computing can be
defined as the ability to execute specific algorithms
correctly regardless of hardware failures or software
errors." This definition, even though comprehensive,
neglects the importance of time. It is very critical,
especially with real-time applications, that algorithms be
executed correctly within a tolerable period of time.
Clearly, the goal implied in the above definition is far
reaching and it is quite a challenge to achieve, even

partially, this goal. The difficulty forseen in achieving

this goal should not be a discouraging factor in the work
toward that end, at least we should try to reduce the
penalty we are likely to pay, in case of failures or
errors, to a minimum.

Even though the roots of this field started with the
early days of computers [27, 28, 38], there has been con-
siderable recent interest in it. Related research papers
appear regularly in the literature. In addition, two
recent international symposia* were SOlGlY devoted to this
subject. Fault-tolerant computing encompasses theory and
techniques for fault and error detection and correction,
modeling, simulation, analysis, synthesis and architecture
of fault-tolerant systems.

Structural reliability is of major importance to
fault-tolerant systems. Several approaches are used to
ensure structural reliability, some of which are presented
in the following section. It should be emphasized that
the term "reliable" is used here in the generic sense and
not in the formal probabilistic sense as in the theory of
reliability in engineering parlance. The term "credible"

was used in lieu of that by Carroll [5].

 

*First International Symposium on Fault-Tolerant
Computing, March 1971, and Second International Symposium
on Fault-Tolerant Computing, June 1972.

1.2 SOME APPROACHES TO ATTAIN STRUCTURELY RELIABLE SYSTEMS

Several traditional approaches to the problem of
assuring proper operation of digital systems have been

studied. The most significant are:

(a) Use of Better Components and Better Designs.

An integrated circuit, for example, could be
made more reliable if it were designed to be more
noise discriminant, and to be capable of withstanding
greater power supply fluctuations without suffering
damage or malfunctioning than is presently possible.
This approach appears to be an obvious and straight-
forward one, but it is limited by the available
technology and by the economics governing the design.
Often, the application still requires more reliability

than this approach can provide.

(b) Redundancy.

Through this approach, it is possible, using
additional hardware, to build a system that will
function properly even after the failure of one or
more of its components. The redundancy employed in
space flights is a typical example of this approach.
A good portion of the early work in fault-tolerant
computing concentrated on the study of different
redundancy systems. By guadrupling the number of

contacts in a relay switching system, Moore and

Shannon [28] have shown that it is possible to come
up with a system more reliable than the individual
relays. This work was later extended to gate-networks
by Tryon [37]. Other classic work in this area was
due to Von Neumann who introduced the notion of the
"Restoring Organ" [38]. This concept was later
developed by Lyons [22] in the study of TMR (Triple
Modular Redundancy). Redundancy is quite expensive
and just postpones the inevitable: given.sufficient
time, enough failures will occur and the system will
eventually malfunction. It is quite suitable for
short-term applications such as space missions where
correct operation must be guaranteed for a relatively
short period of time and repair is rather difficult

or even impossible.

(c) Fault Diagnosis Plus Repair.

Economic considerations make this approach the
most favored one. With this approach, a system is
tested to determine whether or not it is functioning
as intended (fault detection), and if not, which part
caused the trouble (fault diagnosis). Such testing
is needed through the life of the system. When it is
first installed, an acceptance test is needed.
Thereafter, routine testing, e.g., performed by the
CE (Customer Engineer), is performed as part of a

maintainance program. Studies about automatic

testing for the so-called "self-repairing" systems
were carried by Avizienis [1, 2] and others.

This approach has received a great deal of
attention in recent years. It can be safely stated
that, so far, it has been the backbone of fault-

tolerant computing.

A combination of approaches (b) and (c), even though
more expensive than the third approach, will lead to ulti-
mate reliability. Such a combination is very suitable when
the system is used for vital applications where uninter-
rupted operation is a must. Combining these two approaches
must be done with care, since redundancy is incompatible
with diagnosis because it will tend to mask the effect of
a fault when it occurs so it will go undetected. However,
clever use of redundancy (e.g., via additional outputs)
could amount to greater reliability, if the redundant
components are used in such a way that they are not
redundant for test purposes. A system that made use of
this combined approach is the E88 (Electronic Switching
System) of the Bell System, where the failure time is
limited to 2 hours in 40 years (acknowledged off the record

by the designers).

1.3 SWITCHING CIRCUITS

Most of the work done in the area of fault

diagnosis deals with faults in switching circuits. The

basic elements of switching circuits are the logical
elements. Schematic notations for some of these elements
are shown in Figure 1.1. Switching circuits are usually

divided into two types: combinational and sequential

 

 

 

circuits.
I
ﬂ
NOT gate AND gate OR gate
:19— :D— 1E)»—
EOR gate NAND gate NOR gate

(Exclusive OR)

Figure 1.1 Some Logical Elements.

1.3.1 Combinational Circuits

A switching circuit is combinational [24] if its
outputs 2i (1‘: i §_m) can be written as Boolean functions

of its inputs xj (l i.j : n)

21 = fi (X1,X2,...,Xn) ' (101)

i.e., each output is a function only of the present values
of the inputs. A block diagram for a combinational

switching circuit is shown in Figure 1.2. The realization

 

 

 

 

 

 

 

x1 21
.2 P .2
Combinational
Logic
xn zm

 

 

 

Figure 1.2. Block Diagram of a Combinational Circuit.

 

 

 

 

 

z - §2(x1 + x3) +
;‘2 32132§3

Figure 1.3. Example of a Combinational Circuit.

of such a circuit is characterized by the absence of feed-
back. An example of a combinational circuit is given in

Figure 1.3.

1.3.2 Sequential Circuits

In a sequential circuit [24], the output at any
time is a function of the current inputs and also of
previous inputs. The history of previous inputs is summa—
rized in the state of the circuit. The realization of a
sequential circuit is characterized by feedback. The
combined value of the feedback lines yk (l i.k i p),
represent the current state of the circuit. A typical

block diagram is shown in Figure 1.4.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

x
l 0 o 21
' Combinational '
xn Logic zm
1.44] .4,
Y1 Y1
: Memory :
Y
yp p

 

 

 

Figure 1. 4. Block Diagram of a Sequential Circuit.

Sequential circuits are classified as "synchronous"

or "asynchronous" depending on whether or not it is

10

operating under the control of clock pulses (additional
input to the "memory" block of Figure 1.4 that is not
shown). Asychronous sequential circuits are characterized
by level inputs and outputs; while the inputs and outputs
of synchronous sequential circuits could be either level
or pulses.

Two different models for synchronous sequential
circuits have been in use, one is due to Mealy [25] and
the other is due to Moore [27]. In both models, the next
state is.a function of current input and current state,

i.e.,

Yk = gk (X1,X2,...pxn3yl,y2,...,yp) (1.2)

The difference between the two models is in the output
function. In the Mealy model it is a function of the

current state and current input, i.e.,

21 = fi (xl,x2,...,xn;yl,y2,...,yp) (1.3)

while in the Moore model, the output is a function only of

the current state, i.e.,

21 = fi (X1,X2,...,Xn) (1.4)

The two models can be shown to be quivalent, see Gill

[18].

11
1.4 LOGICAL FAULTS

A fault is a physical defect that causes the
circuit to malfunction. There are numerous factors that
could give rise to faults, among them:

(1) Aging and gradual deterioration with time. This
usually would result in what is called marginal
faults, e.g., the gap between the ON and OFF
thresholds of a transistor gets smaller.

(2) Critical timing, noise, close design tolerances
and loose wires. These would cause some sort of
intermittent faults. Some intermittent faults
eventually will become permanent faults.

(3) Solid failures such as a permanently open col-
lector or base lead of a transistor, a broken
wire, or a short circuit between adjacent con-
nections. These will result in what is called
permanent faults.

This thesis deals only with logical faults. These are
the faults which affect changes in the logical behavior of
the circuit. Failures that cause, say, changes in pulse
shapes, but do not alter the logical functions realized by
the circuit will not be considered. Also, power supply and
clock failures are not considered here. Hereafter, the
term fault would mean logical fault, unless otherwise
indicated. The following are examples of faults and how

they are logically described.

12

Example 1.1 (Faults in an inverter)

Consider the inverter circuit shown in Figure 1.5.
An open collector failure (lead 1 open) will cause the
output y to be at a high voltage regardless of the
value of the input signal x. Logically, this can be
represented as line y being s-a-l (stuck at logical
value 1) if positive logic is assumed. On the other
hand, if the failure is Of the form of a short between
the collector and the emitter (short between 1 and 2),
the output y will always be at ground voltage
regardless of the input signal x. This is represented
logically as line y being s-a-O (stock at logical

value 0).

 

4.
1 Y X Y
1-

1 Open (y high) y s-a-l

l and 2 shorted (y low) y s-a-O

Figure 1.5. Faults in an Inverter Circuit.

13

Example 1.2 (Faults in and AND gate)

The circuit diagram Of a DRL (Diode Resistor
Logic) AND gate is shown in Figure 1.6. If the
diode at input x1 is open, the circuit will behave
as if the x1 input is not present. This can be

described logically as xl being s-a-l.

 

 

 

+.
I x1
X

*
x2

*‘
"3

Diode 1 Open xl s-a—l

Figure 1.6. Input Diode Failure of an AND gate.

Example 1.3 (Bridging Faults)

If a short circuit occurs between two lines, say
the outputs of two gates, both outputs will take a
common signal value. The value of such a signal
could be evaluated by detailed circuit analysis. In
general, it depends on the type of technology used in
the realization. With current technologies, which

are mainly TTL (Transistor-Transistor Logic), ANDing

14

 

 

 

 

(a) Normal Circuit.

 

 

 

 

 

 

 

 

 

x2 I B Aﬁc-ﬁ’: I

"3 ‘c 1 “‘D—
i3
x4 2

(c) Equivalent Circuit if Lines 1 and 3 are Shorted.

Figure 1.7. Bridging Faults.

15

the two signals is an acceptable logical description of
the fault. Thus, the effect of a short between lines

1 and 2 of the TTL circuit shown by its logical diagram
in Figure 1.7(a), on the behavior of the circuit can be
analyzed by inserting a virtual AND gate as shown in
Figure 1.7(b). The equivalent circuit when a short
occurs between lines 1 and 3 is shown in Figure

1.7(c). Notice how this fault has transformed the

combinational circuit into a sequential one.

The failures indicated in examples 1.1 and 1.2, are
represented logically as stuck-at faults. This is typical
of a good proportion of known failures. Again, this kind
of representation depends on the technology in use. A
considerable amount of work dealt only with the stuck-at

faults; for example [8, 21].

1.5 INTERMITTENT FAULTS

Intermittent faults in logical circuits are those
faults whose effects are not present all the time. A
circuit having an intermittent fault may show the effect of
such a fault when an input test is applied one time, and
possibly not show it when the same input test is applied
other times. As indicated in the previous section, many
factors could result in intermittent faults, e.g., stray
capacitances, close design tolerances, fatigue, or

irregular physical structure of components [34, 40].

16

Almost all of the work cited in the literature in the
area of modeling, detection and diagnosis of faults in
logical circuits deals only with permanent faults [7, 14,
31, 36]. Only one recent paper, Breur [3], deals with
intermittent faults. Breur has a first order Markov chain
model for the faults, but most of his results are based on
the simplified assumption of a zero order Markov chain.

His work deals mainly with detecting whether the circuit
has a certain intermittent fault or not, i.e., the case
where the circuit could have only one possible intermittent

failure.

1.6 CONTRIBUTION AND ORGANIZATION OF THE THESIS

Intermittent faults constitute a respectable
portion of the faults that occur in digital systems. Ad-hoc
methods have been used to handle these faults in practice,
while formal treatment has been completely ignored despite
the need for such a tool. This thesis investigates this
problem. It develops a probabilistic model for these
faults and defines a criterion for fault detectiOn in
combinational circuits. The detection problem is treated
as a sequential statistical decision problem using tech-
niques similar to those employed in pattern recognition
methodology [15]. A method is given for the design of
optimum detection experiments, which was found to be

equivalent to an integer programming problem. A diagnosis

17

philosophy is then presented leading to a diagnosis
methodology. A method for finding suboptimal diagnosis
experiments is presented since optimum experiments could
only be found through enumeration of an impossibly large
class of experiments.

Chapter I presents some background material and a
general discussion of fault-tolerant computing. In
Chapter II, an overview of diagnosis of permanent faults is
presented. Chapter III presents a model for intermittent
faults and discusses the problem of their detection in
combinational circuits. Diagnosis of intermittent faults
is dealt with in Chapter IV, while Chapter V summarizes

the thesis and recommends problems for further research.

CHAPTER II

DIAGNOSIS OF PERMANENT FAULTS IN

COMBINATIONAL CIRCUITS

It was indicated in Chapter I that one of the most
common approaches to ensure proper operation of digital
computers is fault diagnosis plus repair. Fault diagnosis
deals not only with detecting faults when they occur, but
also with pinpointing the locations of failures to enable
repair. These diagnostic tasks are accomplished by testing
the system at hand. Testing, in this sense, means applying
inputs and observing the corresponding outputs. In this
chapter, several test generation methods are surveyed. The
total number of tests generated for a large circuit could
be enormous; hence it is often desirable to select a
minimal or a near-minimal subset of these tests that is
sufficient for detection or diagnosis. Schemes for
selection of such optimal test subsets are explored later

in the chapter.

18

19
2.1 TESTS

A test for a combinational circuit is an input

vector together with the Observed output. A test tj is

said to detect fault fi if, upon the application of tj,

the output vectors are different when the circuit is fault-
free and when it has fault fi' For example, consider the
exclusive-OR circuit shown in Figure 2.1. Let f1 be the
fault "x1 s-a-O," and f2 be the fault "y s-a-l." Test
t1, denoting the input vector (0,1) (i.e., x1 = 0 and
x2 = l), detects neither f1 nor f2 since upon the
application of t1 the output is 1 when the circuit is

fault-free, when it has f1, and when it has f On the

2.
other hand, tests t2 (input vector (1,0) ) and t3

(input vector (1,1) ) detect f1 since they result in

x2
f1 : "x1 s-a-O" , detected by (1,0) and (1,1).
f2 : "y s-a-l" , detected by (0,0) and (1,1).

Figure 2.1. Example of Faults in an Exclusive-OR
Circuit.

20

outputs of 0 and 1 respectively, if the circuit has fl,
and in outputs of 1 and 0 respectively if the circuit is
fault-free. Similarly, tests t4 (input vector (0,0) )
and t3 detect. £2.

The following remarks are now clear:

(1) In general, a test detects more than one fault.

For example, test t3 above.

(a) A fault can generally be detected by more than

one test. For example, fault f1 above.

If fl and f2 are the only possible faults that
could occur in the above circuit, then t3 is sufficient
to determine whether the circuit is faulty or not. If t3
fails, i.e., produces an output that is different from
that of a fault-free circuit, then the circuit is faulty.
Actually this is true if any test fails. If t3 does not
fail, then the circuit must be fault-free. For that
reason {t3} is called a detection set for this circuit.
Generally, a detection set would contain more than one
test. {t2,t4} is also a detection set, while {t2} is not.
It is clear that {t3} is an optimal detection set (since it
has only one test) unless test costs are considered.

Detection tests tell us whether a system is faulty or
fault-free, but, in general, do not completely identify the
failure. Diagnosis tests will isolate the faults to a
specific component or a group of components depending on

both the diagnosis resolution needed and on the technology

in use. For example, if the system is built with discrete

21

component technology, it might be necessary to pinpoint the
faults down to the gate level. On the other hand, with LSI
(Large Scale Integration) technology, fault identification
down to the module level would be sufficient.

It is possible that a failure occurs, but no test will
detect it; i.e., the effect Of the fault does not change
the function realized by the circuit. This is due to some
sort of redundancy in the circuit. Such redundancy is not
necessarily of the copious type discussed in Chapter I.

For example, the circuit of Figure 2.2 exhibits some
redundancy for test purposes; faults "3 s-a-O" and "4

s-a-O" will go undetected.

 

 

 

 

 

 

xx
L» N
cl tn 5! I“, 60‘ IF
*<

Figure 2.2. Redundancy and Fault Masking.
Line "3 s-a-O" or line "4 s—a-O"

will go undetected.

22
2.2 TEST GENERATION

In this section we discuss several methods to
generate tests that would detect a given fault. In the
methods presented, the circuit is assumed to be irredundant,
so that redundancy would not mask the effect of the fault.
Friedman [l3] pointed out some of the difficulties that may
arise when masked faults in redundant circuits interact
with otherwise detectable faults. Also, it will be assumed
that only one fault can be present at a time (single fault
assumption). This assumption is quite reasonable if
testing is done routinely as part of a maintainance
program: then, the probability Of having two faults is
negligible [7]. However, this would not be a reasonable
assumption for an acceptance test of a new installation
where whole sections of the machine may be constructed

incorrectly.
2.2.1 Truth Table Method

This is the most Obvious and straightforward
method for test generation. The truth tables fOr the normal
(fault—free) circuit and for the faulty circuit are com-
pared. An input combination is a test which detects the
fault under consideration if it results in two different
output vectors for the two circuits. A different truth

table is constructed for every fault considered.

23

Example 2.1

Consider the circuit shown in Figure 2.3 and let
the line a be "s-a-l." The truth tables of the
output functions of the normal and faulty circuits are
shown in Table 2.1.

From Table 2.1, it is clear that three tests will
detect "a s-a-l." For these tests, the input vector
(xl,x2,x3,x4) will take the values (0,0,0,0),
(0,0,1,0) and (0,1,0,0). Alternatively, these test

could be represented by the min-terms: §1§2‘3‘4,

xlx2x3x4 and xlx2 3 4.

 

Figure 2.3. Truth Table Test Generation for "a s-a-l".

z’ is the output when "a s-a-l".

24

 

 

x1 x2 x3 x4 2 z’
* 0 0 0 0 0 1
0 0 0 1 0 0
* 0 0 l 0 0 l
0 0 l l l 1
* 0 1 0 0 0 l
0 1 0 l 0 0
0 l l 0 0 0
0 l l l‘ 0 0
l 0 0 0 l l
l O 0 1 0 0
1 0 l 0 l l
1 0 l l l l
l l 0 0 1 l
1 1 0 l l l
l l l 0 O 0
1 l l 1 l l

 

 

 

 

 

Table 2.1. Truth Table for Normal and Faulty Circuits when

"a s-a-l." (0,0,0,0), (0,0,1,0) and (0,1,0,0)
detect this fault.

This method is effective for small circuits. If the
circuit has n inputs, the number of computations needed
is proportional to 2n for every fault, which makes it

prohivitive to use this method for even the moderate size

circuits.

25
2.2.2 Path Sensitizing

Many investigators have worked on some form or
the other of this method. The name of Armstrong is usually
linked with it even though he did not publish his work that
is related to it. This method attempts to generate tests
faster and using less memory relative to the exponential
requirements of the previous method. Path sensitizing deals
with stuck-at faults in circuits consisting only of NOT,
AND, NAND, OR and NOR gates. The idea is to propagate a
change in signal value on a faulty line in the circuit to an
output terminal. A path is chosen from that location to an
output, and the inputs to the gates along this path are
adjusted, depending on the type of the gate, so that the
gate output is sensitive only to that input that is part
of the path. For AND and NAND gates, all inputs except the
changing one should be 1. For OR and NOR gates, these
inputs should be 0. For example, in Figures 2.4(a) and
2.4(b), if x2 and x3 are assigned 1, the output of
the AND gate will be x1, i.e., sensitive only to x1,
and the output of the NAND gate will be i1, i.e.,
sensitive only to x1. Similarly in Figures 2.4(c) and
2.4(d), if x1 and x2 are assigned 0, the output of the
OR gate will be x3, and of the NOR gate will be i3,

i.e., sensitive only to x3.

26

 

(a) AND gate.
= x3 = 1 implies y = x1.
Y
(c) OR gate.
= x2 = 0 implies y = x3.
Figure 2.4.

(b) NAND gate.
x2 = x3 = 1 implies y = x1.
x1
X2 y
x3

(d) NOR gate.

x = x2 = 0 implies y = i3.

Adjusting Gate Inputs to Make Output Sensitive

Only to a Single Input.

The general procedure can

(1) A failure at a point

location is assigned

the fault condition.

be summarized as follows:

is assumed. The faulty

a value opposite to that of
is

That is, a value of l

27

assigned to a line with "s-a-O" fault and vice
versa for a "s-a-l" fault.

(2) A path is chosen from the fault location to an
output terminal. The inputs to the gates along
this path are assigned values so as to propagate
any change on the faulty line, along the chosen
path, to the output terminal. This path is said

to be sensitized. This phase of the procedure is

 

called the forward-trace phase.

(3) An input vector (test) is determined by tracing
back from the inputs Of the gates, along the path,
to the inputs of the circuit, and assigning input
values to obtain the desired signals for these
gates. An arbitrary choice is made when differ-
ent possibilities exist. This portion of the
procedure is called the backward-trace phase.

It could result in more than one input vector or
even in a contradiction. In case of a contra-
diction, the process is repeated with a different
choice for the signals assigned arbitrarily, if
such a choice exists, otherwise a different path

should be chosen.

An example follows to illustrate this method.

28

Example 2.2

Consider the circuit of Figure 2.5. In the

following discussion, a gate label may indicate its output

to simplify notation.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.5. Path Sensitizing.

(1) "a s-a—l" : sensitize path DH;

detect it.

(2) "B s-a-O" : sensitize path EH;
sensitize path FH,
however, (0,0,0,0)

(1,0,1,1) and (1,0,0,0)

contradiction:
contradiction:
detects it.

(1)

(2)

29

To generate tests that detect the fault "a s-a-l":
Sensitize path DH.

Assign 0 to a.

Assign 0 to the other input of gate D, i.e.,
x2 = 0.
Assign 0 to outputs of gates E, F and G.

This completes the forward-trace phase.

a = 0 implies x1 1 or x3 = 1, say x1 = l.

G = 0 implies x3 = l or C = l (i.e., x2 = 0
and x4 = 0), say x3 = 1.

F = 0 implies x4 = l or B = l (i.e., x2 = O
and x3 = 0), say x4 = 1.

E = 0 implies x1 = 1 or B = 1, already satis-
fied by a = 0.

This completes the backward-trace phase.

Thus we see that (l,0,l,1) is a test that detects
"a s-a-l." Have we selected another signal
choice, we would have obtained (l,0,0,0) as
another possible test.

To generate tests that detect "8 s-a-O":
Sensitize path EH.

= 0 and

Assign 1 to B, i.e., = 0.

x2 x3
Assign 0 to the other input of gate E, i.e.,
x1 = 0.
Assign O to outputs of gates D, F and G.
G = 0 implies x3 = 1 (not possible), or

C = 1 (i.e., x2 = 0 and x4 = 0).

30

F = 0 implies x4 = 1.

This is a contradiction since x4 is required to
be 0 and 1 at the same time.

Sensitize path FH.

Due to the symmetry Of the circuit, we will end
up with a similar contradiction.

However, the input vector (0,0,0,0) detects this
fault since it results in a l-output for the
normal circuit and in a O-output for the faulty

circuit.

This example is due to Schneider [32] to show that
this method is not algorithmic, i.e., even though a test
exists, this method did not generate it. The main flaw is
that only one path is allowed to be sensitized at one time.
For this reason, this method is sometimes called "one-
dimensional path sensitizing." The key to an algorithmic
method is to simultaneously sensitize all possible paths
from the site Of the fault to an output. This will be
necessary if the circuit has reconverqent fan-out at the
site Of the failure, i.e., there are two or more paths
that fan-out from the fault location then subsequently
reconverge as inputs to the same gate, e.g., paths EH

and PE in the above example.

31
2.2.3 The D-Algorithm

This method was developed by Roth [31] to over-
come the limitations of the path sensitizing method. This
method is applicable to a wider class of faults than
stuck-at type faults. Also, its use is not restricted to
circuits constructed only of NOT, AND, NAND, OR and NOR
gates. Most important, this method is algorithmic due to
its ability to simultaneously sensitize all possible paths
from the site of the fault to a circuit output. This
method is sometimes referred to as two-dimensional path
sensitizing. Only an overview of the algorithm is
presented here. Details are found in Roth's paper. Roth's
formulation is in terms Of the D-Calculus; a calculus for
cubical complexes. In what follows, the symbol D
represents a signal that is 1 in the normal circuit and
0 in the faulty circuit. The symbol D represents a
signal that is normally 0, but becomes 1 when a fault
is present. The definitions of D and 5 could be
interchanged as long as they are consistent throughout the
circuit.

The elements of the D-calculus are:

(a) Singular Cover.

The singular cover of a gate (or a block) can be
considered as a concise form Of its truth table. It
is used to obtain the other elements of the D—

calculus.

32

For example, Figure 2.6 shows a three-input OR
gate together with its singular cover. An "x" denotes
a "don't care” value. The cube lxxl means that the
output y will take a l-value if x1 takes the value
1 regardless of the values of x2 and x3. Notice
that no "x" appears in the output coordinate of the
cubes of the singular cover. For details about how

to obtain these cubes see [7, 14, 31].

(b) Primitive D-Cubes of a Fault.

The primitive D-cubes of a fault define the in-
puts tO a gate (block) which cause the output of the
gate (block) to be different from its normal value if
a given fault is present in the gate (block). These
cubes are obtained by intersecting the singular covers

of the normal and faulty gates (blocks). For cube

 

x1 x2 x3 y
l x x 1
y x l x 1
x x 1 l
0 0 0 0

Figure 2.6. OR Gate and Its Singular Cover.

"x" denotes a don't care.

 

33

intersection rules see [7, 14, 31]. For example
consider the two-input AND gate of Figure 2.7. Sup-
pose this gate were realized by a threshold element
and that the actual threshold, due to some malfunction,
has dropped below the proper value; so the gate
behaves as an OR gate. The singular covers for the
normal and faulty gates are shown. Intersecting these
two sets Of cubes we Obtain 010 and 105 as the
primitive D-cubes of the fault. This means that to
test for this fault, apply 0(1) on x1 and 1(0)

on x2, if the output is 0, the circuit is normal; if

 

 

 

xl-———-
Y
x2-——-1

X1 X2 y X1 X2 y
1 l 1 l x l

0 l
x 0 0 0 0
Singular cover of Singular cover of
normal circuit. faulty circuit.

Figure 2.7. AND gate Behaving as an OR gate When
Faulty. 015 and 105 are the primitive
D-cubes of this fault.

34

it is 1, the circuit is faulty. Primitive D-cubes
Of a fault can also be obtained for blocks with more
than one output. Notice that D and 5 appear only

in the output coordinate(s).

(c) Propagation D-Cubes for Input changes(s).

These cubes define the inputs that cause the
output of a gate (block) to be sensitive only to one
or more of its specified inputs, thus propagating a
fault on these inputs to the output. If the output is
to be sensitive to more than one input, then, of
course, these inputs must be related, e.g., have
identical signals or one is the complement of another.
This allows simultaneous multiple path sensitizing.
These cubes are obtained from the singular cover of
the gate (block). Some of the coordinates Of the
singular cover are complemented. The newly obtained
cubes are intersected with the singular cover to
Obtain the propagation D-cubes. For example, consider
the three-input NAND gate of Figure 2.8. The propa-
gation D-cube for a change in x1 is D115, i.e., to
make the output sensitive to x1, apply 1 to both
x2 and x3, then the output will be the complement

of the signal on x1.

To use the D-algorithm for generating tests, the
singular covers and all the propagation D-cubes for single

input change for all the gates (blocks) of the circuit are

35

 

 

x1 x2 x3 Y
0 x x l
x 0 x l
x x O 1
l 1 l 0

Singular Cover.

Figure 2.8. 0115 is a Propagation D-cube of a NAND
gate for a Change in x1.

obtained. Only single-input propagation D-cubes are com-
puted initially. Propagation D-cubes for multiple input
changes are computed as necessary. They will be necessary
when reconvergent fan-out paths are to be sensitized. A
D-cube which represents partial signal values on the lines
of the circuit is called a test cube.

The procedure for deriving a test for a given fault
consists of two parts: the D-drive, which is analogous
to the forward-trace phase of the path sensitizing method,
and the consistency operation, which is analogous to the
backward-trace phase of the path sensitizing method.

In the D-drive, one Of the primitive D-cubes Of the
fault under consideration is chosen as the initial test

cube. It is then intersected with one of the propagation

36

D-cubes. An activity vector is kept to help determine the
next propagation D-cube to intersect with. The activity
vector and the test cube are updated after every inter-
section. The process is repeated until at least one output
coordinate of the circuit is Obtained in the test cube. It
is possible that intersections involving single-input
propagation D-cubes may terminate prematurely before
reaching an output terminal if the circuit has reconvergent
fan-out. In this case, suitable propagation D-cubes for
multiple input changes are computed and intersection
proceeded. This is the case when more than one path needs
to be sensitized simultaneously.

After completion of the D-drive, the consistency
operation is begun. The test cube is successively inter-
sected with the cubes of the singular cover until enough
circuit inputs have been assigned to generate the signal
values specified by the test cube. It is possible that some
intersections will be empty, in this case it is necessary
to return to the D-drive phase and Obtain a new D-chain
before the consistency Operation can be successfully

completed.

2.2.4 Algebraic Methods

The basic ideas Of three algebraic methods for
test generation are presented. Even though some of these

methods are mathematically neat, they are only suitable

37

for small circuits due to the large computation and memory

requirements for larger circuits. These methods are:

(a) Method of Complements.

The output function 2 is computed for the
normal circuit as is its complement, E. The corre-
sponding output 2’ and its complement 2’ are
computed for the faulty circuit. The above functions
are Obtained in normal form expressions. The Boolean
product of z and 2’ and of E and z’ are
computed. The Boolean sum of these products represent
the tests that detect the fault under consideration.

This method is somewhat better than the truth
table method since it deals with terms of the normal
form rather than with all the minterms (rows of the
truth tables). However, we need to store all the
terms of these functions which may very well exceed
the available memory. It is estimated that it would
take 109 reels of tape to store the terms of the
minimal normal form of the parity check circuit for a

60 bits per word computer, even though the circuit

would contain only about 63 logical blocks.

(b) Poage's Method.

Poage [29] has developed a complete and thorough
method for generating tests to detect all possible
stuck-at faults in combinational circuits. His work

is applicable to single faults as well as multiple

38

faults. The disadvantage Of the method is that it is
practical only for relatively small circuits. In
order to introduce the effect of a fault on the
function realized by a circuit, he uses a kind Of
ternary algebra. For every line i in the circuit,
three Boolean variables i0, 11, and in, called

faultgparameters, are defined as follows:

10 = 1 iff line i is "s-a-O",
i1 = 1 iff line i is "s-a-l",
in = 1 iff line i is normal.

Only one of these parameters is equal to 1 while the
other two are 0's. If the signal on line i is
supposed to be y, it is replaced in the analysis by

the literal y* defined as:

y* = y-i + 11 . (2.1)
The complement 9* is defined as

§* = §~in + 10. (2.2)

These substitutions are successively carried on until
expressions 2* and 2* for the output and its

complement are obtained. Substitution for fault

parameters is then made to insert the fault condition.

Expressions for the faulty function and its complement

39

are then Obtained. The method of complements,

described above, is then used to generate the tests.

(c) Boolean Differences.

Sellers et a1. [35] have developed the Boolean
difference method to generate tests, that detect
stuck-at faults, from the Boolean forms representing
the output functions.

The Boolean difference of an output z(x1,x2,
...,xn) with respect to one Of its variables xi is

defined as follows:

d
332:1 = z("1'°°°"‘i--1")"‘i+1"""‘n’69

z(xl,...,xi_1,l,xi+1,...,xn) (2.3)

In general, 5%} will be a function of some or all of
1

, .. dz _ ,
the xj s, jf1. If 351 - 1, then any change in xi
will result in a change in z regardless of the other
signals x.'s, j#i. If 95 = 0, then 2 is inde-
j dxi

pendent of xi. For single output circuits, the tests
that detect "xi s-a-O" are represented by xi 3% ,

while the tests detecting "xi s-a-l" are represented

- dz
by "1 3:31.

40

2.3 FAULT TABLE

 

In the previous section, several methods for
generating tests that detect a particular fault were
presented. A set of tests, detecting all faults that are
likely to occur is generated. Let the set of faults of
interest, {f1,f2,...,fn}, contain n elements, and the
set of tests generated to detect these faults, {t1,t2,...,
tm}, have m elements. As noted earlier, some of these
tests detect more than one fault. An n x m fault table
could be constructed from these two sets. The columns
correspond to the tests and the rows correspond to the
faults. The entries are zeros and ones. This table can

be denoted by an n x m fault table matrix A.

Definition 2.1
The n x m fault table matrix A = (aij) is
defined as:

1 if test tj detects fault fi'

aij = (2.4)
0 if test tj does not detect
fault fi'

For completeness, the set Of faults may contain an
additional element f0 corresponding to the fault free
circuit condition since it represents a circuit condition
to be distinguished from the rest of the faults. A 2257
plete fault table is defined to be a fault table appended

to it an additional row corresponding to £0. All entries

41

Of the row corresponding to fo in the complete table will
be zeros. The m-dimensional binary vector corresponding
to fi (0|: 1‘: n) will be denoted by f1.

It is possible to save some of the effort used to
generate tests if some relations among faults are known.
If a class of faults is known to have indentical test sets,
it is sufficient to generate only a test set for one of the
faults in that class. Such faults form an equivalence
class that has indentical rows in the fault table. This
technique is called fault collapsing and is due to Schertz
and Metze [33]. For example, "s-a-O" faults at an input
Of an AND gate and at the output of the same gate have
indentical detection test sets. Practically, it does not
matter which fault of these occur, since the gate has to be
replaced anyway. Similarly, if all tests that detect a
particular fault fi also detect another fault fj, then
it is not necessary to generate tests to detect fj' In

_).
this case, for every 1-entry in fi, there is a corre-

+ +
sponding l-entry in fj, this is referred to as row f.

3
dominates row fi. For example, any test that detects an
input "s-a-l" for an AND gate also detects the output
"s-a-l" for that gate.

If two or more tests in the fault table have identical
columns, all but one of these columns can be removed since
they correspond to redundant tests. Similarly a column

whose entries are all zeros can be eliminated since no

information will be gained when the corresponding test is

42

applied. A fault table (complete fault table) is said to
be a reduced fault table (reduced complete fault table) if

it contains no zero or redundant columns.
The following are some properties of fault tables.

Theorem 2.1
An upper bound on the number of tests in a

reduced fault table with n faults is given by:

1:152“ - 1 (2.5)

252:.

Every column will have n entries. There are
at most 2n different binary vectors of n co-
ordinates each. One Of these vectors is all zeros.
Thus, there are at most 2n - 1 non-zero different

possible columns.

Theorem 2.2

Maximum diagnostic resolution (i.e., every fault
is diagnosable) is possible if and only if no two rows
in the complete fault table are identical.
2.9.9.:

+ +

If two rows, say fi and fj (i¢j), are
identical, then it is impossible to distinguish
between f1 and f. since the outcome of any test

3
applied when either fault exists will be the same.

43

If no two rows are identical, the following
method constructs a diagnosis procedure with maximum
diagnostic resolution. At any stage in the procedure
apply the test with the lowest index which does not
result in the same outcome for all faults which are
still possibilities for the unknown circuit condition.
As long as there are at least two remaining possible
faults, such a test must exist, because for each pair
of faults there is at least one test which distinguishes
them. The procedure always terminates in fault
identification since the possibilities of unknown
faults are reduced at each stage. This is an
existence proof. It does not necessarily mean that
this is the only procedure with maximum diagnostic
resolution.

Q.E.D.

Theorem 2.3
For maximum diagnostic resolution, a lower bound
on the number of tests m that a fault table with n

faults should have is given by:
m 3 log2 (n + l) (2.6)

Proof
Suppose m < log2 (n + 1). With m tests,
there can be at most 21“ distinct rows in the com-

plete fault table. However, we have

44

log (n + 1)
2m < 2 2 = n + l, i.e.,

2‘“<n+1.

Notice that a complete fault table with n faults
has n + 1 rows. Thus at least two rows of the
complete fault table must be identical, and, by
theorem 2.2, maximum diagnostic resolution is not

possible.

Example 2.3
A circuit can have one of six faults. Five tests
t1,t2,t3,t4 and t5 were generated. The detection

sets for f1,f2,f3,f4,fS and f6 are {t1},{t1,t2,t4},

t1 t2

 

(fault free)

H:rt Htrﬁ HIrm H:
ascn e-cu k3r4 o
be c>ua c>la F410
0 PJEH F'FJ o c:
0 F410 :5 o'ca o
H PJIH F'rd o c:
0 viii P‘CD o :5

 

Table 2.2. Example of a Complete Fault Table.

45

{t2,t4,t5}, {t1,t2,t4,t5}, {t2,t3,t4,t5} and {t1,t4}
respectively. The corresponding complete fault table

is given in Table 2.2.

2.4 EXPERIMENTS AND THE DIAGNOSIS TREE

The process of applying tests and drawing con-
clusions from the observed outputs is called an experiment.
Thus, a detection experiment is a sequence of tests to be
applied in order to determine whether the circuit is fault
free or not. On the other hand, a diagnostic experiment
is a sequence of tests whose outcome is used to decide
which fault, or class of faults (depending on the diag-
nostic resolution required), is present in the circuit.
The tests used for either the detection experiment or the
diagnostic experiment are selected from the set Of tests
generated, using any of the methods of Section 2.2, to
detect the set of faults of interest. The application of
all generated tests is sufficient for either detection or
diagnosis. However, the tests generated are usually more
than necessary for either experiment. 3

Experiments are classified into two types:

(a) Preset Experiments, where tests to be applied
are completely determined in advance.

(b) Adaptive Experiments, where the test to be
applied at a given stage depends on the outcome

of the previous test.

46

In a preset experiment, the order of test application
is immaterial. The experiment length (number of tests to
be applied) is the same regardless of the existing fault
condition. In an adaptive experiment, the order of test
application is essential. The experiment length generally
varies depending on the existing fault condition. Adaptive
experiments are more efficient since they tend to be
shorter in average length. A schematic representation Of
the two types of experiments is shown in Figure 2.9.

An experiment can be represented by a binary tree

structure. This is due to the fact that every test ti

 

 

Test

t - z
Generator‘ Ci cuit

 

 

 

 

 

 

 

(a) Perset Experiment.

 

 

Test ,
t1 Circuit 2

Generator It» v ’

I

 

 

 

 

 

 

 

 

 

 

4*

(b) Adaptive Experiment.

Figure 2.9. Experiment Types.

47

partitions a set of fault conditions (possibly including
f0) into two classes: those that are detected by ti
(those faults with a l in column i in the complete
fault table), and those that are not (faults with a 0 in
the column 1). The nodes Of the tree are classes of
faults representing the diagnosis resolution Obtained thus
far in the experiment. The root corresponds to the class
of all faults including £0. The root is defined to have
12331 0. The edges out of a node correspond to the two
possible outcomes of the test applied at that stage. The
level of a node, say node a, is one larger than the level
of the node having an edge directed to node a. A tree
corresponding to a preset experiment will have the same
test applied at all nodes of the same level. This is not
the case in a tree corresponding to an adaptive experiment.
For maximum diagnostic resolution, the leaves of the tree
should correspond to classes of single faults. Notice that
for permanent faults, any test need only be applied once in

either type of experiment.

Example 2.4

Consider the complete fault table given in
Table 2.2. A preset diagnosis tree is shown in
Figure 2.10. Notice that at every level the same
test is applied. The symbol ¢ denotes the empty
set. Whenever a class of a single fault is reached,

no further edges are shown. Figure 2.11 shows an

48

 

 

(:{fo,fl,f2,f3,f4,f5,f6} ) Level 0
tl/O t /l
a {f3,f5} m” {f2,f4} Level 2
t3/O t3
t3/0 t 3/1 t /o /1
3 3
Q @0 {£1,136} 10 {f2.f4} oLevel 3
tM t/4 t/4 /° t4/1
@ Q 0 {fz'f4} Level 4
ts/O t5/1

O a

Figure 2.10. Diagnosis Tree of a Preset Experiment.

(Experiment length = 5)

 

(: {fo,fl,f2,f3,f4,f5,f6}:) Level 0

    
  

 

Level 1

 

 

Figure 2.11. Diagnosis Tree of an Adaptive Experiment
(Maximum Experiment Length = 3)

49

adaptive diagnosis tree. In this case, the experiment

has three tests at most.

2.5 MINIMIZATION OF PRESET EXPERIMENTS

A minimum preset experiment is a preset experi-
ment with the least number of tests. Four approaches for
selecting minimal test sets will be discussed. Exhaustive
enumeration and the prime implicant method are two
approaches that lead to a true minimal test sets. However,
they are lengthy for large circuits. The method of test
intersection and the method Of distinguishibility criteria
produce suboptimal solutions that are not necessarily

minimal, but Often close to minimal.

2.5.1 Exhaustive Enumeration

Exhaustive enumeration can be accomplished by
ordering all possible tests subsets and selecting the
smallest subset which is sufficient for detection or
diagnosis. Obviously, this method is impossibly lengthy

for even small circuits.

2.5.2 The Prime Implicant Method

This method makes use of the similarity between
the problem of finding a minimal detection set and the
problem of minimal cover is switching theory. Tests are

analogous to prime implicants, while faults are analogous

50

to the minterms to be covered. Several solutions have been
proposed for Obtaining minimal covers, most notably, the
Quine~McC1uskey algorithm [23], also linear and integer
programming solutions have been suggested [9, 10]. Any
such solution can be directly applied to minimizing
detection experiments. This method can be extended to
handle diagnosis experiments. The extension is due to
Poage [29]. A difference table is constructed from the
original fault table. It has all of the rows of the
original fault table, plus a new row for each pair of

+

faults. An entry Of a new row formed from f1 and

+
f. (i#j) is the exclusive-OR of the corresponding entries

3

in f1 and Ej' The l-entries of this formed row denote
the tests that distinguish between f1 and fj‘ The 1-
entries of an original row can also be thought of as
denoting tests that distinguish between two faults, one of
them being f0. This approach is elegant and guarantees a
minimal experiment, but it is impractical for moderate or
large size circuits since the solution to a covering

problem for that many rows in the difference table is too

long.
2.5.3 The Test Set Intersection Method

This method produces near minimal test sets. It
is due to Galey, Norby and Roth [16]. It is actually a
method for reducing the fault table. After table reduction,

any other method of test selection is used. f1 and f2

51

are intersected by ANDing the corresponding entries. If
the result is a non-zero vector, the two rows are replaced
by the intersection. A l-entry in the intersection
represents a test that detects both f1 and £2. The
process is repeated by intersecting the resultant vector
with E3 and so on. If the intersection of f1 and f2
is the zero vector, we intersect f1 and f3 and proceed.
If this intersection also happens to be zero, we intersect
f2 and E3. The process is carried on until all rows are
considered. The result is a reduced fault table. We then
apply any test selection method to this table, such as the
prime implicant method, or selecting one test corresponding

at a l in every row. The test set obtained will be a

suboptimal detection test set.

Examp1e 2.5
Consider the fault table given in Table 2.2. If
+ + + + +

we intersect f1, f2, f3, f4, f

we obtain the reduced table shown in Table 2.3.

+
5 and f6 respectively,

Selecting tests that cover every row in the reduced
table, we obtain {t1, t4} as the detection set, which

happens to be a true minimum in this example.

It is to be noted that the outcome of this method
depends on the order of row intersection.

If we start with the difference table than do the
intersection, we can Obtain a near-minimal diagnosis test

set.

52

 

 

t1 t2 t3 t4 t5
(£1 £2) 1 o o o 0
(£3, £4, f5, f6) 0 o

 

 

 

 

Table 2.3. Reduced Fault Table for Example 2.3.

2.5.4 Method of Distinguishabilipy Criteria

This method, due to Chang [6], selects a near-
minimal diagnosis test set. The basic idea is to assign
weights to tests. The weight reflects the test's ability
to distinguish faults. Tests are systematically selected
on the basis of their weights. The weight Wi of test ti
is defined to be the number of pairs of fault conditions

(including f0) which it distinguishes, i.e.,
w. = e. z. (2.7)

where, Bi and Zi are the number of 0's and the number
of 1's in the i-th column of the complete fault table. We
select the test for which W1 is greatest as the first
test. When j tests have been chosen, the faults would
have been partitioned into b.

3
depending on the possible outcomes of these j tests. To

(bj :_23) disjount blocks

select the j+1-st test of this procedure, the weights of
the remaining tests are computed. At this stage, the weight
of a test is the sum (over the bj blocks) of the number

of pairs of faults that it can distinguish within each

53

block. Let be the number of 0'5 (1's) in

ei,k(zi,k)
that portion of the i—th column of the complete fault
table that correspond to block k. Thus, the weight

W. j of test 1 after j tests.

1:
b.
J
Wi,j = :2 91,k 11.x (2.3)
k=l
The test for which W. . is maximized is chosen as the

1,3
j+1-st member of the test set. The selection of tests is

continued until the partition of faults can be refined no
further: that is, until the weights of the unselected

tests are all zeros.

Example 2.6

Consider the complete fault table given in
Table 2.4.

After calculating the initial weights, we select
t1 (t4 or t5 will do) and rearrange the complete
fault table to form Table 2.5. Select t4 since it
has the largest weight at this stage, then obtain the
rearrangement shown in Table 2.6. Test t5 has the
highest weight, so it is selected. The next re-
arrangement is shown in Table 2.7. Select t2. The
fourth rearrangement is shown in Table 2.8. Here,
every test has a zero weight, so the process termi-
nates. The test set {t1, t2, t4, t5} is the near-

minimal diagnosis test set obtained. It happened

54

 

 

 

t1 t2 t3 t4 t5 t6 t7

f0 0 o o o o o 0
£1 0 o o 1 1 o 0
f2 0 o o o 1 o 0
f3 1 o 1 o o o 0
f4 1 1 1 o o 1 0
f5 1 o o 1 1 o 0
f6 1 o o o 1 o 0
f7 0 o o 1 o 1 1
f8 1 o 1 1 o o 0
weight 8 18 20 20 14 8

 

 

 

 

Table 2.4. Complete Fault Table and Initial Weights.

that this is a true minimal diagnosis test set for this
problem.

It should be emphasized that in computing the test
weights we should deal with the complete fault table and
not the fault table since fo corresponds to a circuit
condition to be distinguished from the other fault con—
ditions. Chang, in his paper, did not include the ED
row in his analysis. As a result, the test set he obtained
in his example was sufficient to distinguish among the
faults, if it is known that a fault has actually occurred,

but was not sufficient as a detection set.

55

 

 

 

I
t1 I t2 t3 t4 t5 t6 t7
f0 0 : o o o o o 0
£1 0 I o o 1 1 o 0
f2 0 | o o o 1 o o
f o I o o 1 o 1 1
7 +
£3 1 I o 1 o o o 0
f4 1 I 1 1 o o 1 0
£5 1 | o o 1 1 o 0
f6 1 I o o o 1 o o
I
£8 1 I o 1 1 o o 0
weight : 4 6 10 7 3

 

 

 

 

Table 2.5. First Rearrangement of Complete Fault Table.

This method can be extended to allow for different
degrees of diagnostic resolutions, i.e., it can be used to
point out to a faulty block if any fault occurs in that

block without pinpointing to the actual fault.

2.6 MINIMIZATION OF ADAPTIVE EXPERIMENTS

In general, the length of an adaptive experiment
varies depending on the existing fault condition. Thus,

the meaning of the term "minimal" in a "minimal adaptive

56

 

 

 

 

 

I
t1 t4 I t2 t3 t5 t6 t7
I
£0 0 o I o o o o 0
f2 0 o | o o 1 o o
____._ __ _ _ __ _ .1 ________
£1 0 1 I o o 1 o 0
f7 0 1 I o o o 1 1
__.________+_._._____.____
£3 1 o I o 1 o o 0
£4 1 o I 1 1 o 1 o
I
f5 1 o . o o 1 o o
__.____.._.____+. ________
f 1 1 I o o 1 o o
5
f3 1 1 I o 1 o o o
. l
weight I 2 3 (:) 3 1

 

 

Table 2.6. Second Rearrangement of Complete Fault Table.

experiment" should be different from that used for preset
experiments. An adaptive experiment is minimal if the
expected experiment length is minimal. If prior.proba-
bilities pi (0‘: i §_n) for the different fault conditions
are known, and if £1 (0': 1‘: n) is the experiment

length if fault condition fi exists, then the expected

experiment length I is:

57

 

0 .nu_.1 .0 .nv.U .0 .nv

0_0

0

.U .0 .nu_nu .0 11..U .0

.:+-+L.-TITT+-

0

0

11 0 1

.
0

.0
.

.0
_

0

.0

.0 .1..nu..1 .0.nu_.1 .0 .1.

 

 

 

0 1. 11 nu 0 nu .1
0_ . _ _ _ . _
0..U .0 .nv.11 1 .1._11_11
. _ . . . . _
0. _ . . .6. .

2 f7. 11 .3 4 .5
f0.f .:...r :1..r6.c1 7:

weight

 

Third Rearrangement of Complete Fault Table.

 

 

2.7.

Table

 

 

 

 

0.0—1.0.0_0_0_0.0 0
_ . _ _ . . . _
O n. 11 .U 0 11 nu .U 0. 0
. _ . . . _ . .
0 .nu.nu .0 .11.11.nu_.1 .0 0
I+u+..+..+..+..+-+-+l- I
0..nv.nu .0 .nv..1..u .0 .nv
. . _ . . _ . _
0 11 1 no AU .1 0 11
w _ _ _ _ _ _ _ _
0 _0 .1 _1_0.0_0_1_1
_ _ _ _ _ _ _ _
_ . _ . . . _ .
nL .2. _ 1L .3. A; ,b_¢mL.L5 _M
f F. .r f F. :1
_ _ _ _ _ _ . _ .m
e
_ _ _ _ _ _ _ _ w
. . _ .

 

 

 

Fourth Rearrangement of Complete Fault Table.

Table 2.8.

58

Z is the function to be minimized for a minimal
adaptive experiment. Two approaches to select a test set
for minimal experiments are discussed: exhaustive enumer-
ation, which leads to a true minimal solution, and the
method of distinguishability criteria which results in a

near-minimal experiment.
2.6.1 Exhaustive Enumeration

This approach was not practical for preset
experiments. It is even worse for adaptive experiments,
since we have to consider all possible permutations of
test subsets. Unfortunately, this is the only known
method that gives a true minimal adaptive experiment.
Garey [32], in his Ph.D. thesis, developed a systematic
method for the enumeration, and discussed special cases

which have shorter solutions. This method is only suitable

for small problems.
2.6.2 Method of Distinguishability Criteria

This method makes use of a figure of merit that
is computed for every test that might be selected. At any
stage in the experiment, the test that has the highest

figure of merit is selected. The process is repeated until

59

no further diagnosis is possible. This approach results
in locally optimized procedures, rather than globally
Optimized ones. Chang's method, discussed in subsection
2.5.4, can be easily adjusted to apply for adaptive experi-
ments. The test weight is considered its figure of merit.
Instead of adding up the number of pairs a test can
distinguish in all blocks, only one block is considered.
Another figure of merit, based on information gain,
has been suggested by Brule’ et a1. [4]. The initial

uncertainty A0 about the fault condition is:

n
A0 = - 2 pi log2 pi . (2.10)
i=0

At every stage, the test that results in maximum infor-
mation gain is selected. For example, to select the first
test, the information gain AAi due to every test ti is
computed. If test ti is applied, it will fail with

probability qi, narrowing down the fault condition to a

smaller block. Let the uncertainty among this block be

A1 1. It is also true that if ti is applied, it will
I

not fail with probability (l-qi), narrowing down the fault
condition to a different smaller block. Let the uncertainty

among this block be A1 0' The information gain AAi due

I

to ti 1s:

AAi = A0 - (qi Al'l + (1-qi) Al'o). (2.11)

60

The test with the largest information gain is selected.
The probability of failure qi can be calculated from the
prior probabilities. The computations at later stages are

computed in a similar fashion.

CHAPTER III

DETECTION OF INTERMITTENT FAULTS IN

COMBINATIONAL CIRCUITS

Intermittent faults in digital circuits are those
faults whose effects are not present all the time. A
probabilistic model for intermittent faults is presented.
Permanent faults are a special case in this model. '
Detection of intermittent faults through repeated appli-
cation of tests that detect such faults, as if they were
permanent, is suggested, together with a detection
criterion. The detection procedure proposed is equivalent
to a sequential statistical decision problem. Optimization
of detection experiments is discussed later in the chapter.

Assumptions similar to those made in Section 2.2 for
the permanent faults case are used here, namely:

(1) The single fault assumption; i.e., the circuit
can have only one fault during the testing
experiment.

(2) Irredundancy; i.e., the circuit is assumed to be
irredundant, thus the effect of a fault would not

be masked.

61

62

Moreover, we will assume that:

(3) The faults are well behaved, i.e., during an
application of a test, the circuit behaves as if
it is fault free or having a permanent fault for
that period of time. That is, during the appli—
cation of a test, the effect of an intermittent
fault is either not present at all, present for a
relatively brief interval of time that the
response to the test is the same as if no
failure occurred, or present for a long enough
period of time so that the fault appears to be
permanent for that application of this test.

(4) The faults are signal independent; i.e., the
presence of a fault does not depend on the signal

values existing in the circuit.

3 . 1 THE MODEL

The model proposed is a probabilistic one. The
basic elements of the model are defined below. Assumptions
about the model are indicated as we proceed. The notation
used is very close to that employed in pattern recognition

literature, for example, see [12, 15, 26].

State Space 0: Q = {mo, m1,...,wn}

Each point in 0 represents a state of the circuit;

63

mo: denotes the state of being fault free;
mi: (1 3.1 §_n) denotes the state of having inter-

mittent fault number i.

It is assumed that the state of the circuit can be
described by a single element from 0 (which one, is not
known); i.e., the circuit is fault free or it has only one
of n possible intermittent faults (the single fault
assumption). During the testing experiment, the circuit is
assumed to stay in the same state. If the circuit is in

state mi (1‘: i‘: n), it does not mean that the effect

 

 

Figure 3.1. State Space 9. State mi corresponds

to the circuit having intermittent fault
number i.

64

of the i-th fault on the behaviour of the circuit will be
present all the time; this is due to the intermittency of

the fault.

Permanent Fault Space F: F = {f f

lpoou'f}

0' n

There is a one to one correspondence between F and

9. f.

1, (l :_i i n), is the permanent fault that the circuit

would have, if the effect of mi is present all the time

and fo denotes the fault free condition.

 

 

 

Figure 3.2. Sample Space 8. Sample point ‘1 corre-

sponds to the circuit behaving as if it
has permanent fault fi.

65

Samale Space 3: S = {60, 61,..., an}

Every point in 3 corresponds to the outcome of a
random experiment. Conceptually, the random experiment can
be thought of as testing all circuit components. With the
single fault assumption, the outcome of such an experiment

would be:

40: all components are fault free, or

61: (1': i i n), the component that pertains to the

i-th fault is faulty.

Observing 40 in this conceptual random experiment, we
cannot infer that the state of the circuit is mo; in fact,
it could be any mi (0 1.1 i n) due to the intermittency
of the faults. However, observing 6i (1 1.1 i n), we

could infer that the state of the circuit is mi for sure.

Test Set I: r = {t1, t2,..., tm}

The test set I is a complete test set that would
detect all faults under consideration if they were perma—
nent. This set can be generated using any of the methods
discussed in Section 2.2 to detect permanent faults. In

general, a test tj will detect more than one fault. Let

be the set of faults detected by tj (those faults that

correspond to the l-entries in column j of the fault

66

table [definition 2.1)). In terms of the random experiment,
test tj can be thought of as testing the components

corresponding to faults fj , f. , ..., and fj . The test

1 32 m.

has two possible outcomes: 3

(a) all components tested are fault free, or

(b) one component is faulty, which one, is not known.

Test Subset Ti: (l 5,1 i n)

 

A subset of I that contains all the tests in I
that detect £1“ The elements of this subset correspond to

.+
the l-entries in the vector fi of the fault table.

Random Variable Set T: T = {T1, T2,..., Tm}

Every test tj (l :,j :_m) in T defines a random
variable Tj on S. If test tj is applied to the cir-
cuit and it fails, i.e., the Observed output is different
from that of a normal (fault free) circuit, the value of
T3. is defined to be 1. On the other hand if tj does
not fail (i.e., produces an output identical to that of a
normal circuit), then the value of T.

J

0. Formally, Tj is a function from S into the set

{0,1} defined as:

is defined to be

Tj:3+{0,1} I (lijim) 7

l l < i < n, and t. detects fault f..
T44.) = " — 3 1
3 1 0 otherwise. (3.1)

67

Random Variable Subset Ti: (l 2,1 1.“)

 

T1 is a subset of T; it contains all the random

variables that correspond to tests in Ti.

Action Space 0 x S:

The probability measure to be used, is defined on the
action space 0 x S in order to be able to use prior
information about the distribution over 0. Since the
presence of the effect of an intermittent fault corresponds
to a single point in 3, many of the points in 0 x S

will have zero probability, namely:

P(w0,bi) = 0 , 1 1’1 1 n
and, P(wi,6j) = 0 , l :_i,j : n and i f j :

That is, only 2n+l points in 0 x S have non—zero proba-

bilities, namely:

(”0:60) o (wioé-) I (”1160) I 1 iiiji no

1

Every point mi in 0 defines an event on 0 x S

as follows:
mi = {041,434 | o _<_ j _<_ n} (3.2)

It follows that only one point in 0 x S of those contained
in the mo event has non-zero probability, namely:

(w0,60). Similarly, only two points in 0 x S in the

wi(l 5.1 :_n) event, have non-zero probabilities, namely:

((91:60) and (0)1161).

68

 

2
(w2,42)
011 “In
(w1,41) «an,4n)
(w2,40)
(w1,60) «un'60)

 

 

 

Figure 3.3. Points in 0 x 8 With Non-Zero Proba-
bilities.

Similarly, every point bi in 3 defines an event on

0 x S as follows:

Only one point in 9 x S, of those in the event bi
(l i 1‘: n), has non-zero probability, namely (mi'biI'
and all the n+1 points in 40 have non-zero probabili-

ties.

69

k

k-Dimensional Random Variable Set Tk (Ti , 1 i_i : n):

 

Tk(Tik) is the cartesian product of the set T(Ti)

k) can be

by itself k times. An element b from Tk(Ti
written as a k-tuple of random variables from T(Ti), for

example:

b = ( ,...,T. )

T. ,T.
31 32 3k
Tjr e T (Tjr 5 Ti) , for l i|r : k .
The outcome of an experiment in which k tests from 1(11)

are applied to the circuit can be represented by such an

element b. Its value is a k-dimensional binary vector.
...p

k will be used to denote a k-dimensional

The notation b = 0
binary vector, all of its entries are 0's. This corre-
sponds to an experiment of k tests; none of them has
failed.

It should be kept in mind that during the application
of an experiment (a sequence of tests), the circuit will
stay in the same state mi (0 :_i g n). During the course

of the testing experiment, the sample space point will be

60 all the time or changing randomly between 40 and 6.

1

(for some, but fixed 1) only.

Prior Probabilities P(wi): (0 :‘i 2.“)

 

Pi = PIwi). 0 :_i i n, is the prior probability that

the circuit is in state mi. The values of these pi's

are assumed to be known in the model. These values could

70

be obtained empirically, from manufacturer information, or

from experience.

Conditional Probability of Malfunction P(Ai/mi): (1<i<n)

 

ei = P(Ai/wi), l i’i :,n, is the probability that the

effect of intermittent fault mi will be present knowing

that the circuit already has intermittent fault mi. The
values of the ei's are assumed to be constants, also known

in the model.

3.2 DETECTION OF INTERMITTENT FAULTS

In general, fault detection means applying tests
from a certain test set to find out whether a given circuit
is fault free or not. TO detect permanent faults, any
particular test need only be applied once. The approach
proposed for detection of intermittent faults employs
repeated application of tests from the test set T. The
repetition of a particular test is needed since the circuit
might have an intermittent fault that could be detected by
this test but the effect of such a fault is not always
present when this test is applied. After applying a par-
ticular test repeatedly and obtaining outputs identical
with those of a fault free circuit, it could be reasoned
that the circuit might still have an intermittent failure

whose effect has not yet been observed. Should the test

71

be repeated forever? What is needed is a "wise" stopping
rule. It is assumed that testing is done by a fast machine,
say a computer, so repetitions, possibly in the order of
millions of times, can be done fast and fairly easy. The
problem of detection of intermittent faults can then be
viewed as a statistical decision problem. An appropriate
decision rule is one that is sequential in nature. The one

suggested makes use of the posterior probabilities.

Posterior Probabilities P(wi/Tj)

 

After applying a test tj to the circuit and observing
the output, the probabilities P(wi/Tj), 0 i i i n, can be
calculated using Bayes' Rule:

P (Ti/(Di) P (mi)

P(wi/Tj) = n (3.4)

2 p ('rj/wkwmk)
k=0

 

Details and proof of Bayes' Rule are found in [11].

As the prior probabilities P(wi), 0 :.i i’n, reflect
the beliefs about the condition of the circuit before any
test is applied, the posterior probabilities P(wi/Tj) are
an updated version of these beliefs after applying test tj
and observing the output. These posterior probabilities

will be used as prior probabilities next time a test is

applied. The following detection procedure is suggested.

72

Detection Procedure:

(1) Apply an apprOpriate test tj from T. The
meaning of "appropriate" will be clear after
defining the decision rule. If the output of the
circuit is different from that of the fault free
circuit (i.e., if Tj = l) decide that the circuit
has an intermittent fault and stop; otherwise go
to (2).

(2) At this point, the output so far is identical to
that of a fault free circuit. Calculate the
posterior probabilities P(wi/Tj = 0), 0 1'1 :|n,
using Bayes' Rule.

(3) Using the posterior probabilities apply a decision
rule, and decide one of two things:

(a) Request the application of a particular test
and repeat, i.e., go to (1), or;
(b) Decide that the circuit is fault free and

stop.

Decision Rule:

The decision rule mentioned above can be freely
selected to satisfy a set of conditions that represent
acceptable measures for deciding that the circuit is fault
free. Moreover, the decision rule must ensure the termi-
nation of the detection procedure in a finite amount of

time, i.e., ensure that we have a detection algorithm.

C Sm.)
I

Select a test

from T

I
I

Apply test

 

 

 

 

    
   
 
   
      
 

test

failed
?

  

 

 

 

 

Calculate posterior prob., Decide circuit has
update prior prob. n intermittent fault

 

 

 

Decision Decide circuit

 

Rule is fault free

 

 

 

I
Q... )

Figure 3.4. Flow Chart for Detection Procedure.

74

Examples of typical decision rules are presented later in

this section.
3.2.1 A Simple Case

Let Q = {w0,wl}, i.e., one particular inter-

mittent fault could possibly exist in the circuit.

Let T {t1}. If T is not a singleton, choose any
element of T and disregard the others, since it will
function as well as any other test in T. Obviously, any

test in T will detect f1.

 

 

   

(a) Non-Zero (b) State Space.
Probability Points.

 

 

(c) Sample Space.

Figure 3.5. 0 x 8 Space for a Simple Case.

75

It follows that:
3 == {4()r41}
F == {£0,111}

ml = {(w1,bo),(wl,61)}
T1(60) = 0
T1(Al) = 1

Consider the following distribution on 0 x S :

P(w

H

I
'U

H

P(Tj/wi) P(wi)

 

1

Z P (Tj/wk) P (wk)
k=0

P(Tl=0/w1) P(w1)
P(Tl=0/wo) P(w0) + P(T1=0/w1) P(w1)

 

P1.1 3 I’(“’1/T1 =0) 3

(1-e1) p1
='1_:_e—"'— (3.5)
1 p1
(l-el) is less than (l-e1 pl), therefore:
p1,]_‘< p1 (3.6)
Similarly,
( / 0) po 3 7
= P w T = = ---- .
p0,1 o 1 1-e1p1 I I

it follows that:

p0,1 > po (3.8)

76

P(Tl = 1/011) P(0)1)

PUD /T = l) = = 1(309)
1 1 P(Tl 8 1/w0\ P(w0) + P(T1 = 1/(01) Paul)

 

Similarly,
P(ub/Tl = 1) = 0 (3.10)

The interpretation of (3.6) and (3.8) is that, if t1
is applied and the circuit produced a good output (identical
with that of a fault free circuit), then our certainty about
the circuit being faulty will decrease and our certainty
about the circuit being fault free will increase; which is
quite reasonable. On the other hand, the interpretation Of
(3.9) and (3.10) is that, once a bad (different from that
of a fault free circuit) output is observed upon the appli-
cation of t1, then we know for sure that the circuit has
an intermittent fault.

The posterior probabilities after applying tj for

k+l times and Observing k+l good outputs are:

—>

P(w1/'r1=o,b=ok ; b e Tk)

Pl,k+l

‘I 7:
P(Tl=O/wl,b=o ) P(wl/b=0 )
-—> —-> —-> —->

k k _ _ k _ k
P(T1=0/w0,b=0 ) P(w0/b-O ) + P(T1-O/w1,b—0 ) P(w1/b-O )

 

a (l-el) pllk = (l-el) 91.x (3 11)
(l-Pl,k) + (1-e1) 91.x 1"‘31 pl,k

 

 

Shﬁlaﬁhn

77

 

 

 

 

 

...p
k
po,k+1 = PIwo/Tl=0,b=0 ; b 8 TR)
9 p
= 1 - e 2’: = 1 _ 2’k(1_ - (3.12)
1 1 Po,k 1 Po,k)
From (3.11) and (3.12) it follows that:
p1,k+1 < pl,k ' (3°13)
p0.k+1 > p0,k ° (3.14)
Theorem 3.1
(l-e ) p
(a) p1,k = 1 % (3.15)
po
(b) p0,k - k (3.16)

Proof
(a) By induction.

(l-el) P1 (1'61) p1
p1:1 _ pO + (l-el) p1 - l - eljp1

 

 

k = 1 :

 

which is true from (3.5).
Assume true for k: from (3.11),

(l-e

= 1’ pl,k
p1,k+l 1 -

e1 pl,k

 

k
(l-el) (l-el) p1

 

k
p0 + (l-e1)k p1 -e1 (1-e1) p1

78

k+1

 

 

 

= (l-el) p1
91 + (1-e1)k (1-e1> pl
(1_e)k+l p1
= +
i.e., true for k+l
po
- k

Corollary 3.1

(a) lim p = 0
k+oo 1'k
(b) lim = l
k+oo po'k
Proof:

(l-el) is less than 1, therefore,

lim (l-el)
k+oo
(a) Using (3.15) and (3.19)

lim
k+oo

P1,k

(3.17)

(3.18)

(3.19)

79

(b) Using (3.16) and (3.19)

= po :1
p0+0

 

lim
k+co

90.x
Definition 3.1

If the test tl is applied k times and good
output (identical with that of a fault free circuit)
was Observed every time, the likelihood ratio 1k is

defined as:

From (3.15) and (3.16), it follows that:

 

 

 

Corollaryf3.2
Ak+1 < 1k (3.21)
Proof:
1
from (3.20) : —§il = (1-e1) < 1
k
i.e., Ak+l < 1k
Q.E.D
Corollary 3.3
lim 1k = 0 (3.22)

k+oo

80

Proof:
From (3.19) and (3.20):

lim Ak=I_—p—=O

k+oo

A Decision Rule

Earlier in this section it was stated that the pos-
terior probabilities will be used in the decision rule to
decide whether to continue testing or that the circuit is
fault free. The following decision rule is suggested to be

part of the detection procedure:

If the posterior probability p1,k goes below a
certain threshold 3 (0 < s < 1), decide that the cir-
cuit is fault free and stop, otherwise apply t1 and
repeat (i.e., go to step (1) of the detection pro-

cedure).

This decision rule is an acceptable one, because it
guarantees termination Of the detection procedure in a
finite amount of time by virtue of the fact that p1,k is
monotonically decreasing (from (3.13)), and that as k
increases it can get below the threshold 3 (from (3.17)).

The value of the threshold 5 can be chosen by con-
sidering the probability of error (probability of deciding
that the circuit is fault free while it is actually faulty).

This of course depends on how critical the proper operation

81

of the circuit is. Also this affects the length of the
testing experiment which is a factor that can be taken into
consideration when choosing 5.

TO determine a least upper bound on the length of the

experiment, find k for which p1 k < s . From (3.15):
I

k

 

k < S
(l‘Pl) + pl (l-el)

(l-el)k p1 < s [(l-pl) + p1 (1-e1)k]

(1-e1>k (pl-s p1)< s (1-p1)
k 8(1-91)
(l-el) < p1 Il-s)

s (l-pl)
k log (1‘61) < log m

 

s (1-p1)
log EI—Tr:gy
or, k > log (1'91) (3.23)

Note that log(l-e1) is negative.

Example 3.1

Among the gates produced by a certain manu-
facturer, it is estimated that for about 0.01% of
them, the gap between the ON and OFF voltages is
smaller than some critical value. If the gap is
below the critical value, the gate will malfunction
5% of the time. Find a least upper bound for the

length of the testing experiment.

82

Solution

From the given figures, the following quantities

are estimated as shown:

 

 

p1 = 10-4
el = 0.05
If we choose the threshold 5 to be 10-6, then
10’6(1-1o'4)
log -4 -6
k > 10 (1-10 )
log 0.95
or k 1 91

Another Decision Rule

The rule suggested here compares the likelihood ratio
(which is a function of the posterior probabilities) with a

threshold u (u > 0) as follows:

If Ak goes below u decide that the circuit is
fault free and stop, otherwise apply t1 and repeat

(i.e., go to step (1) of the detection procedure).

This decision rule is also an acceptable one, since it
guarantees termination of the detection procedure in a
finite amount of time by virtue of the fact that 1k is
monotonically decreasing (from (3.21)), and that as k
increases it can get below u (from (3.22)). Actually

this is the optimum Bayesian decision rule (that minimizes

the average loss) for the (0,1) loss function [15].

83

The value of the threshold u could be chosen in a
fashion similar to that of selecting s.

The least upper bound on the length Of the testing
experiment is obtained by finding k such that l < u.

k
From (3.20):

(1-e1)k p1
U’Pl) u

A

)k u (l-pl)
1 p1

A

(l-e

u (l‘Pl)

k lo l-e < lo
q ( 1) 9 pl

11 (1.131)
p1
long-el) (3'24)

log
or, k >

 

Which is similar to the results in Wald [39] for binomial

samples.

3.2.2. The General Case

This is the case described by the model in
Section 3.1. The basic assumptions are: (l) the circuit
is irredundant, (2) the circuit can have only a single
intermittent fault out of the n faults considered, (3)
the faults are well behaved, and (4) the faults are signal
independent.
Consider the probability distribution on 0 x S

governed by the following conditions:

84

PI‘I-Ii) = P1 I 0

|A
)5
IA
:3

P(Ai/wi) = ei , l i 1 i_n

Definition 3.2
The membership functions Mj (1 i’j i’m) from

the the set {0,l,...,n} into the set {0,1} are defined as:

II
0
‘

Mj(0)

1 if t. detects f. .
j 1

Mj(i) (1 < i < n)

0 if tj does not detect fi'

The posterior probabilities are now calculated using

Bayes' Rule (Equation 3.4)),

P(Tj/wi) P(wi)

 

 

 

P(wi/Tj) = n
2 P(Tj/w£) P(w£)
i=0
ei 1f tj detects fi .
P(Tj = l/wi) =
0 if tj does not detect fi .
Similarly,
l-e. if t. detects f..
1 j 1
P(Tj = O/wi) =
1 if tj does not detect fi .

In terms of the membership functions, we can write:

P(Tj = O/mi) = ei Mj(i) (3.25)

P(Tj = O/wi) 1

1 - e. Mj(i) . (3.26)

85
Thus:

ei MJ. (1) P (mi)

 

 

P(wi/Tj = l) n (3.27)
2 e£ Mj(£) P(w£)
i=0
It follows that:
P(wo/Tj = 1) = o . (3.28)
If ti does not detect fi then:
P(wi/Tj = 1, Mj(i) = 0) = o (3.29)

If tj detects fi then:

 

ei P(wi)

 

P(u)i/Tj = l, Mj(i) l)

n
2 ez Mj(£) P(w£)
£=0

(3.30)

Equations (3.28), (3.29) and (3.30) give the value of
the posterior probabilities if the applied test tj fails
(results in an output different from that of a fault free
circuit).

Similarly,

[1-ei Mi(i)] P(wi)

 

P(wi/Tj = 0) = (3.31)

n
2 [l-e, Mjwn P(w£)
£=0

86

 

Thus,
P(wo)
P(wo/Tj = 0) = n (3.32)
i=0
Notice that:
n
£=0
Thus,
n
2 el Mj(£) P(w£) < l , (3.33)
i=0
and, n
:2 [l-eL Mj(£)] P(w£) <il (3.34)
£=0

From (3.32) and (3.34), it follows that:
P(w0/Tj = 0) > P(wo) - (3.35)

If ti does not detect fi' then:

P(wi/Tj = 0, Mj(i) = 0) > P(wi) (3.36)

This is an interesting result, since it indicates that
the certainty about the circuit having a particular fault

increases (if the applied test does not detect that fault)

87

even though the applied test produced a good output (that
is, identical with that of a fault free circuit).

If tj detects fi' P(wi/Tj = 1) could be less than,

 

equal to, or even greater than P(wi), i.e.,

, _ >
P(u)i/Tj = 0, Mj(1) — 1))? P(wi) (3.37)

This is even a more interesting result since the cer-
tainty about the circuit having a particular fault could
increase even if the applied test detects that fault and
produces a good output.

From (3.36) and (3.37), it is clear that a decision
rule based on comparing the posterior probabilities with
some thresholds, as was done in Subsection 3.2.1 for the
simple case, is not acceptable since the posterior proba-
bilities are not monotonically decreasing functions, thus
there is no guarantee that the detection procedure will
terminate in a finite amount of time.

The posterior probabilities after applying k+1 tests;

the k+1-st being, say, test t are (from (3.4)):

j;

P(Tj/wi,b) P(wi/b)

k _
P(wi/Tj, b , b e T ) _ n .

 

P(Tj/w£,b) P(w£/b)
£=0

Notice that P(Tj/wi,b) = P(Tj/wi), thus:

88

[l-ei Mi(i)] PIwi/b)

 

P(wi/Tj a o, b ; b 8 TR) =

in 0
(3.38)

Definition 3.3
If k+1 tests from T are applied: the k+1-st

being, say, test tj: and a good output was observed

every time, the likelihood ratios xi,k+l(1 1.1 :,n)

are defined as:

k
P(mi/Tj=o'b= SbET)

.3
A. - - (3.39)
1'k+1 Paco/Tj = 0, b - 6%»; b e Tk)

 

From (3.38), it follows that:
Ai,k+l = [l-ei Mj(1)] Ai,k . (3.40)

Thus Ai,k+l : Ai,k (3.41)

The equality in (3.41) holds only if the k+1-st test

detects f.

1, otherwise this relation is strict inequality.

Definition 3.4
The initial likelihood ratios xi (1 i|i :,n) are

defined as:

A. = 11,0 = mgr ° (“2’

89

Theorem 3.2
If k tests from T are applied, L of them are from

Ti, and none Of them failed, then:

_ L
Ai,k — (1-ei) Xi . (3.43)

Proof
It is clear that L :_k. Proof is by induction on k.
k = 1 : from (3.40),

A1,1 = [1'31
the first test being tj.

Mj(i)] A1;

L could be 0 or 1:

II
>a

0, i.e., Ai,l i'

If L = 0 (tj t Ti), then Mj(i)
which satisfies (3.43) for L = O.

l, 1.e., Ai,l =

If L = l (tj 8 Ti), then Mj(i)

(1‘9 )Air

1
which satisfies (3.43) for L = 1.

Assume true for k: from (3.40),

A1,k+1 = [1'91 MjIIII A1,18

the k+1-st test being tj.

If tj t Ti, then Mj(i) = 0, thus,

A A. = (l-ei)£ A. . (3.44)

i,k+l = 1,k 1

tj ¢ Ti also means that the number of tests from Ti

in the first k+1 tests is L. Therefore, (3.44) indi-

cates that the theorem is true for k+1.

If tj 8 Ti, then Mj(1) = 1, thus;

90

_ _ _ L+l
Ai'k+l - (1 31) Ai’k - (1 ei) A1 0 (3.45)
tj 8 Ti also means that the number of tests from
Ti in the first k+1 tests is L+1. Therefore, (3.45)

indicates that the theorem is true for k+1.

 

Q.E.D.
Corollary 3.4
11:21+00 Ai,k = 0 (3.46)
Proof:
(l-ei) < 1 I
therefore, lim 1. k = lim (l-e.)£1. = 0 .
[+00 1' [+00 1 1'
Q.E.D.

The Decision Rule
The decision rule suggested, compares the likelihood
ratios xi, (1 :_i 1.“)! with thresholds ui(ui > 0 for all

i) as follows:

If 11‘: ui for all i, decide that the circuit is
fault free and stop.
If Xi > ui for some i, select a test from Ti and

repeat, (i.e., go to step (1) of the detection pro-

cedure).

From (3.41) we see that the likelihood ratios are

monotonically non-increasing functions. Any likelihood

91

ratio Ai,k will strictly decrease if we apply a test from
Ti (and of course, that test produces a good output).
Thus, from (3.46), the likelihood ratios could go below the
specified thresholds, and the detection procedure is
guaranteed to terminate in a finite amount of time using
this decision rule. Hence, it is an acceptable rule.
Theorem 3.2 can be used to determine a least upper
bound on the number of tests ki from Ti (l i'i : n) that

are needed, as follows:

k.

1
(l-ei) Ai <

u
1
k. u.
1 1

 

 

u.
ki log(l-ei) < log X_'
u.
1°? 1‘?-
1
°r' ki > 1°9I1'617
u. p
log 1 0
or, k. > 1 . (3.47)

 

1 109(1-ei)

3.3 OPTIMUM DETECTION EXPERIMENT

In Section 3.2.2, least upper bounds ki's on the
number of tests that detect fi that are needed were
obtained (3.47). However, a given test usually detects

more than one fault: so, a method is needed to determine

92

how many times every test should be repeated so that

(3.47) is satisfied for all i (l 1.1 i n), and that the
overall experiment length is minimum. If tj (1‘: j :.m) is
repeated xj times, then, (3.47), in terms of the fault

table matrix A (definition 2.1), yields,

m
2 aij xj : ki . (3.48)
i=1

The experiment length L is:

m
L.== :E xj (3.49)
i=1

L is the function to be minimized.
The quantities at hand satisfy the following con-

ditions:

1. x. > 0
J

2. all xj's are integers

This is an all integer-integer programming problem,
the solution of which determines the optimum number of
repetitions for each test. For details and solutions of
integer programming problems, see [19, 20].

It should be noted that for the permanent faults case,
every ki (1 5'1 : n) will be equal to 1; thus, similar to

the problem of minimizing boolean functions. Integer

93

programming treatments for this special case are found in

[9, 10, 19].

Example 3.2

Consider the following fault table matrix:

 

t1 t t3 t4 t
f1[ 1

f2i 1 1 1 o 1
f3) 0 l 1 1 1
£4: 0 1 o o 1
£5! 0 o o 1 1
f6 1 o 1 o o

 

x1 + x2 + x4 ‘1 k1
x1 + x2 + x3 + x5 1 k2
x2 + x3 + x4 + x5 ‘1 k3

X2 + x5 1 k4

x4 + x5 :_ kS

X1 + x3 1 k6

Find integer xj's that minimize:

x1 + x2 + x3 + x4 + x5.
3.3.1 A Suboptimal Solution

If the values of the kj's are relatively large

(e.g., > 10), the integer programming problem presented by

94

(3.48) and (3.49) can be solved as a linear programming
problem (actually as a transportation problem since the
coefficients are 0's and 1's). The solutions are then
rounded up (the smallest integers greater than or equal to
the obtained solutions are used). This is generally a
faster solution since solving a linear programming problem
is easier than solving an integer programming one. Linear
programming methods will result in a very little deviation
from the optimal solution if the values of kj's are large
enough, since the function to be minimized is just the sum

of the xj's.
3.3.2 Reduction of the Fault Table Matrix A:

The size of the matrix A can easily get to be
huge for a large size circuit. The amount of memory and
number of computations needed to solve an integer program-
ming problem (or a linear programming one) can be greatly
reduced if the size of matrix A is reduced. In this sub-
section, an attempt is made to transform the problem into
an equivalent one, but with a smaller matrix A*. As A is
transformed into A*, the values of kj's will be adjusted.
The obtained solution, i.e., xj's, for the reduced problem
will also be adjusted by adding appropriate biases bj's to
Obtain the solution for the original problem. The initial
bias values are zeros.

The following operations are suggested for the

reduction of the fault table matrix A to A*:

95

(a) If only one 1 occurs in row i, say, aij=1 and
aik=0 for k # j, delete row i from the matrix,
increment, b. by ki, and, for every row q for

3
which a .=l, replace k by k -ki, if this differ-

QJ q q
ence is zero or negative, delete row q.
(b) If row 1 contains 1's wherever row j does (1 # j),
that 13, for each k, ajk = l 1mp11es aik=l; (row
i dominates row j) and if ij: ki' delete row i.
(c) If column 1 contains l's wherever column j does
(1 # j), that is, for each k, akj=l implies
aki=1' (column 1 dominates column j) then delete

column j.

These operations may be carried out in any order
starting with A and continuing until a matrix A* is Obtained
on which none of them can be applied. The solution for the
reduced problem (with A*) is then obtained. Every xj in
the solution of the original problem is obtained by adding
the bias bj to the corresponding xj obtained in the

solution with A*.

Theorem 3.3

If A* is the matrix obtained from the fault table
matrix A by a succession of operations of the (a), (b)
and (c) types suggested above, then the solution to
the original problem (presented by (3.48) and (3.49)
is the same as that of the reduced problem adjusted

by the biases bj's (1 i j :_m).

96

32222
It is sufficient to prove that any Operation will
result in a problem with an equivalent solution.
(1) Assume operation (a) is performed. This means
that one of the constraints in (3.48) is
xj Z ki ,
i.e., test tj has to be repeated at least ki
times. If we adjust the corresponding bias bj
by k1' this constraint can be removed (i.e.,
delete row i) since the bias will make sure that
it is satisfied.
Every constraint of the form:
... + xj + ... Z.kq ,
can be written as
+ (xj'ki) + :kq-ki .
Let x3 = xj - ki. The set of constraints can now
be written in terms of xi (the reduced problem).
If kq-ki is zero or negative, then this con-
straint is automatically satisfied by xj > ki

thus row q can be eliminated. If ki < kq we
solve the reduced problem. Obviously, a solution
to the reduced problem is a solution to the
original problem if we adjust the obtained value
for x3 by ki, i.e., by the amount of bias adjust-

ment. Thus, Operation (a) results in an equiva-

lent solution.

(2)

(3)

97

Assume operation (b) is performed. Let the
number of l-entries in rows 1 and j be qi and qj
respectively. Without loss of generality, assume
that the l-entries in row i are the first qi
entries in that row. Since the condition for
operation (b) is satisfied, we can also assume
that the l-entries of row j are the first qj
entries in that row. It is clear that qi :_q..

3
The following constraints must now exist:

+x2+ooo+xq +ooo+x Z

x
1 . .
3 q

X

k1
1
k.

+x +...+
2 X

[V

l 3

j

The second constraint implies:

x +x +000+x +OOO+X
1 2 qj qi

>k

If kj :.ki' then the first constraint is auto-
matically satisfied by satisfying the second.
Thus, elimination of row i will not change the
solution.

Assume operation (c) is performed. In this case,
it is clear that xj=0 does not result in any
contradiction to the constraints of (3.48), since
for every xj appearance, xi also appears; so, xi
can be set large enough to satisfy the con-

straint. We need to prove that xj=0 is part Of

m
a solution that minimizes : 2: 1g; . Assume

q=l

98

that a solution minimizing this sum exists with

xj > 0. From this solution, substitute xi + xj

for xi and 0 for xj in (3.48). NO contradiction
results since column i dominates column j. The

same substitution will result in the same sum
m

23

x
q=1 q
found a solution with xj=0. Thus column j can be

, i.e., minimality is maintained. Thus we

eliminated.

CHAPTER IV

DIAGNOSIS OF INTERMITTENT FAULTS IN

COMBINATIONAL CIRCUITS

In Chapter III, a probabilistic model for intermittent
faults was introduced, also an approach for the detection
of these faults. The detection procedure proposed relied
on the repeated application of tests that would detect these
faults had their effect been permanent. Several methods
for test generation were discussed in Chapter II. In this
chapter, we present an approach for the diagnosis of
intermittent faults in combinational circuits, also,
employing the repetition of tests that detect permanent

faults.

4.1 GENERAL ASSUMPTIONS

There are three basic assumptions for the

diagnosis methodology to be presented, namely:

(1) The probabilistic model, introduced in Section

3.1 for intermittent faults, will be used here.

99

(2)

(3)

100

All the assumptions previously made in the model
are also carried here. Of major importance is
the single fault assumption.

A detection experiment, similar to that proposed
in Section 3.2, is assumed to have been run and
resulted in the decision that the circuit has an
intermittent fault, i.e., a test in that experi-
ment has failed. This assumption assures that a
fault exists in the circuit before we start the
diagnosis experiment.

The posterior probabilities of the states Of the
circuit at the end of the detection experiment
are assumed to be known to the experimenter.
These will be used as prior probabilities in the

diagnosis experiment.

A fourth assumption, that is helpful even though not

essential, will also be assumed:

(4)

The test that failed in the detection experiment,
say test tj' is assumed to be known to the
experiment. This assumption tends to reduce the
length of the diagnosis experiment since it will
start with less possible faults (those faults
that correspond to the l-entries of column j of

the fault table) than the total fault set.

101
4.2 DIAGNOSIS OF INTERMITTENT FAULTS

The approach suggested for the diagnosis of
intermittent faults is through the repeated application of
tests from the test set I. A subset of T is selected
and its tests are repeatedly applied until a failure occurs.
This narrows down the possible faults that the circuit
might have. The fault table is then reduced and another
subset of T is selected and the process is repeated until
enough failures occur to diagnose the fault in the circuit.

Let the set of possible intermittent faults, at any
stage in the experiment, be 0p. Let Fp be the subset of
F that corresponds to 0p. Let 0p (also Fp) contain n
elements. Initially Fp contains the faults that have
1-entries in column j of the fault table (assuming that
tj was the test that failed in the detection experiment).
Obviously, 0p does not contain mo since in the diagnosis
phase, we know that the circuit is faulty for sure. The
posterior probabilities of the states of the circuit at the
end of the detection experiment are used as prior proba-
bilities for the diagnosis experiment.

Fault Table Reduction. Whenever a failure occurs
during the diagnosis experiment, Fp is narrowed down, and
consequently the fault table is reduced. The following
reduction steps are applied until none of them can be

applied any more:

(a)

(b)

(C)
(d)

102

Eliminate all rows that correspond to faults not
contained in Fp.

Eliminate all redundant columns, i.e., eliminate
all but one of every group of identical columns.
Eliminate all O-columns.

Eliminate all the l-columns. The failure of a
test, whose corresponding column in the fault
table has all l-entries, does not contribute any

information to the diagnosis Of the existing

fault. Thus this test is eliminated.

Diagposis Procedure. The following diagnosis pro-

cedure is now suggested (a flow chart for this procedure is

given in Figure 4.1):

(1)

(2)

(3)

(4)

From the outcome of the detection experiment,
compute Fp , 0p. Each of these sets has n
elements.

If Fp is a singleton (i.e., if n = 1) go to

(9), otherwise go to (3).
Obtain a reduced fault table using steps (a)
through (d) indicated above.

Select a subset Tp of I that covers Fp,

i.e., Tp is a detection test set for Fp. Let

Tp contain 0 tests. Without loss of gener-

ality, we can assume that the tests in Tp are

ordered, so we can speak of the i-th test in Tp

(l i i i u). If no such Tp is found, go to (9).

103

C .....9
I

Q = Set of possible faults
p F corresponds to 0p

 

 

Each contains n elements

 

 

 

 

YES NO

 

 

Reduce fault table
Select Tp that covers Fp

 

 

Tp contains u tests

 

NO

 

 

 

C )

 

  
 

 

 

 

[ Apply i—th test of Tp I

 

 

 

U d t 9 F & +__.___
Lpaeplp 7)

Figure 4.1. Flow Chart for the Diagnosis Procedure.

(5)
(6)

(7)

(8)

(9)

104

i = 1.
Apply the i-th test of Tp' If this test fails,
go to (8), otherwise go to (7).

Increment i by 1. If it exceeds u go to
(5), otherwise go to (6).

At this point, a test in Tp has failed, this
narrows down the set of possible faults 9p

(also Fp) to a smaller set. This smaller set

is the one that corresponds to faults in Fp
with l-entries in the column corresponding to the
failing test in the fault table. Replace F

P

and 9p by the smaller sets indicated above.

Update n. GO to (2).
Stop. Diagnosis experiment is complete. Diag-
nostic resolution is determined by F . If Fp

P
is a singleton, complete diagnosis is obtained.

Notice that when selecting the test set Tp’ no test

that failed earlier in the diagnosis experiment will be

chosen, since such a test corresponds to a column, all of

its entries are 1's, in the reduced fault table. Such a

column will be eliminated by operation (d) of the fault

table reduction procedure.

The diagnosis procedure terminates only if one of two

conditions arises:

(1)

Fp becomes a singleton; in which case complete

diagnosis is obtained, or

105

(2) no Tp that covers Fp is found; in which case
complete diagnosis is not obtained, the diagnosis
resolution being determined by Fp.

Diagnosis Tree. The diagnosis procedure can be
represented by a tree. The nodes correspond to the Fp's,
with the root being the initial Fp Obtained from the
detection experiment. The edges out of a node correspond
to the tests Of the appropriate Tp' i.e., every node has
u (the corresponding u) edges out of it. An edge, corre-
sponding to a test tj out of a node 8, goes to node B
that corresponds to the subset of the faults of node a
that are detected by tj. At any stage in the diagnosis
procedure, if a different Tp is chosen, a different diag-
nosis experiment and consequently a different diagnosis
tree will result.

A complete diagnosis tree is a diagnosis tree that
contains all the subtrees corresponding to all the possible
outcomes of the diagnosis experiments. The leaves corre-

spond to the maximum diagnostic resolution possibly

obtained.

Definition 4.1
The portion of the diagnosis procedure that con-
sists of applying the tests Of Tp repeatedly until

a failure occurs is called a subexperiment.

In terms of the diagnosis procedure, a subexperiment

consists of iterations of the loop defined by steps (5),

106

(6) and (7) until a failure occurs to exit from this loop.
Notice that the failure of a test is the only way to exit
from this loop. In order for the diagnosis experiment to
be finite, the expected length of every subexperiment in it
should be finite. The condition that Tp covers Fp
guarantees that, this will be proved later in Section 4.3.
During the diagnosis procedure, it is desirable to
calculate the posterior probabilities of the different
states of the circuit after observing the outcome of every
test applied. These posterior probabilities can be employed
to select a Tp’ at the beginning of a subexperiment, that
tends to make the diagnosis experiment shorter. It should
be noted that the posterior probabilities for any wi that
is not in Up is zero. This follows directly from (3.29)
since such an mi corresponds to a permanent fault f.

1
that is not detected by a test that failed earlier.

Example 4.1

A detection experiment was run. It resulted in
the decision that the circuit has an intermittent
fault. Observing the failing test ruled out some
possibilities for the fault condition. Fault table
reduction, as suggested above, resulted in the reduced
fault table given in Table 4.1. A diagnosis experi-

ment is to be run.

 

 

t1 t2 t3 t4 t5 t6 t7
f1 0 0 0 1 1 0 1
f2 0 1 0 l 0 1 0
f3 0 1 0 0 1 1 1
f4 1 0 1 0 1 0 0
f5 1 l 1 0 0 0 1
f6 0 1 0 0 0 1 1
f7 0 0 1 1 0 1 1
f8 1 1 0 0 0 l 0

 

 

 

 

Table 4.1. Reduced Fault Table After

Detection Experiment.

Th ' 't' 1 Q d F :
e 1n1 1a p an p are

up = {w1,w2,...,w8}
Fp = {fl'f2'...'f8}
Notice that mo and f0 are not included in GP or

Fp. The test set {tl,t2,t7} covers Fp; select 1t
as Tp' Apply t1,t2,t7,t1,t2,t7,... unt1l a failure
occurs. Assume that this subexperiment results in the

failure of t7. The set of possible faults is nar-
rowed down to those faults with 1's in column t7 of

Table 4.1 (w2,w4, and w are ruled out). Thus:

8

Q = {wl,w3,w5,w6,w7}.

F = Ifl'f3'fs'f6'f7}°

108

The corresponding reduced fault table is shown in
Table 4.2.

Select a new Tp that cOvers Fp. The test set
{t2,t5,t6} is selected. Apply t2,t5,t6,t2,t5,t6,...

until a failure occurs. Assume that t2 failed in

this subexperiment. This rules out w and w

1 7'

Thus, we have:

9
P

Fp = {£3,£

{w3,w5,w6}

s'fs}
Reducing Table 4.2 in correspondence with the
outcome of this subexperiment, we obtain Table 4.3.
The test set {t1,t6} is a suitable Tp since it
covers Fp. Apply t1,t6,t1,t6,... unt1l a fa1lure
occurs. Say t1 fails. This rules out OB and w6.

The corresponding Fp and 0p are:

 

 

t1 t2 t3 t4 t5 t6
f1 0 o o 1 1 0
£3 0 1 o o 1 1
£5 1 1 1 o o 0
f6 0 1 o o o 1
f7 0 o 1 1 o 1

 

 

 

 

Table 4.2. Reduced Fault Table After
First Subexperiment.

109

 

 

t1 t5 t6
f3
f5
f6 0 1

 

 

 

 

Table 4.3. Reduced Fault Table After

Second Subexperiment.

Q = {ms}

9
F {f5}

P

Fp is a singleton. This terminates the diagnosis
experiment. The circuit is diagnosed as having inter-
mittent fault ms.

If test t6 is the one that failed in this sub-
experiment and not t1, then ms is ruled out. The
corresponding Fp and 0p are:

Up = {w3,w6}

Fp = {f3,f6}

The corresponding reduced fault table is given in
Table 4.4. In this case there is no TP that covers
both f3 and f6. This terminates the diagnosis
experiment with the diagnosis resolution being defined

by Fp which conta1ns f3 and f6.

110

 

 

 

 

 

 

Table 4.4. Reduced Fault Table when t6

Fails in Third Subexperiment.

That is, the result of the diagnosis experiment is:
the circuit has either intermittent fault w3 or
intermittent fault N6. Thus, the components that
pertain to these two faults should be replaced for
repair. A diagnosis tree for this example is shown
in Figure 4.2. The tree shown is not a complete
diagnosis tree.

In this example, we did not make use of the
posterior probabilities since we were not concerned
about comparing Tp's or designing a shortest diag-

nosis experiment.

The last result of Example 4.1 should be compared with
the diagnosis of permanent faults. If the faults of this
example were permanent, maximum diagnostic is possible
since the rows of the fault table are distinguishable
(Theorem 2.2). However, in case of intermittent faults,
and using the suggested diagnosis procedure, we were unable

to obtain complete diagnosis. Hence, there are fundamental

111

 

(ifl’fz’fa'f4’fs'fe’f7’f8i>

t1 t2 t7

 

 

 

 

{f4,f5,f8} ({ f ,f3, f5 ,f6 ,£;:> {f1,f3,f5, f6 ,fz:)

t t 116

 

 

3} {f6.f7}

 

 

Figure 4.2. Diagnosis Tree for Example 4.1.

differences between the diagnosis of intermittent faults
and the diagnosis of permanent ones that are more subtle
than mere repetition of tests. Fault table properties that
are relevant to the diagnosis of intermittent faults are

explored later in Section 4.4.

112
4.3 EXPECTED LENGTH OF SUBEXPERIMENT

In this section, we try to compute the expected
length of a subexperiment (definition 4.1) of the diagnosis
procedure. In a subexperiment, tests of an appropriate Tp
are repeatedly applied in sequence until a failure occurs.
Tp contains u tests. Let the set Of corresponding
random variables be Tp. The set of possible faults 0p
contains n states. Without loss of generality, we can
label the elements of Tp and Tp as: l,2,...,u, and
the elements of 9p as: l,2,...,n. We assume that the
posterior probabilities for the states of the circuit were
computed after Observing the outcome of every test in the
diagnosis experiment. The posterior probabilities computed
thus far (to the beginning of the subexperiment) will be
used as prior probabilities in the subexperiment. The
problem is to find the expected number of times the tests
of Tp will be applied until the first failure.

Notation. The following notation will be used in this

section:

—-+

_ U - _ _ _
Tp - 0 will mean Tl—O, T2—0,..., and Tu—O

This corresponds to the case where every test in Tp has

been applied once and one of them failed.

...)
Tp # 01“ will mean the complement of the above

113

condition. That is, every test of Tp was applied once

and at least one of them failed.
...)
k

~«-> -+
k . .
Tp = 011 will mean Tp=0u, Tp=0u,... k times.

This corresponds to the case where the test set Tp was
applied k times and no failure occurred.

The wi's (l :_i i n) define disjoint events over
the probability space (9 x S)k (the cartesian product of
(0 x S) by itself k times). The probabilities of these

events (the prior probabilities of the subexperiment) add

to 1 due to the single fault assumption. Thus we can

write:
-73 _ _ _ _ -
P(Tp—0 ) — P(Tl—T2—...-Tu—0)
n ——>
__ _ IJ
- P(Tp-0 /wi) P(wi) (4.1)
i=1

If mi 13 covered ki times by Tp (1.e., ki tests of Tp
detect fi), then:

..., k.

_ u _ _ 1
P(Tp—0 /wi) _ (1 ei) (4.3)

The conditional probability of at least one failure during

the application of Tp is:

'3 _ - -
p(Tp¢o /wi) — 1 (1 ei) (4.4)

114

If Tp is applied k times and no test failed, then:

'—* k.k
P(T k=0“pr = (1—ei) 1

p (4.5)

The probability of a first failure on the k+1-st appli-

cation of Tp is:
+ -->
P(T k+1 ; T k+1 = T k T ' T k = OUk ' T #0“)
P P P P P P
n —»
k+1 k k uk
= P T o T " T T r T - 0 T O (1)
Z ( P p p y! / o) p
i=1
n k.k k.
= Z (l-e.) l (l - (l-e.) 1) Pi (4.6)
l 1.
i=1

This is the probability that the subexperiment will have

k+1 applications of Tp.

Thus, the expected number of applications of Tp is:+

)2 = E(k)
°° n
k.k k.
= 2 (k+1) Z (l-ei) 1 (l-(l-ei) 1) pjL
= i=1
[I °° k.k k.
.= :Ep .22 (1- e.) 1 (1- (1— e ) 1) (k+1)

 

+E(k) denotes the expected value of k.

115

The inner infinite series is the expansion of a negative

binomial,* thus

 

E.—- 5: pi (4 7
- k. . )

The expected subexperiment length 2 is:
Z = ui (4.8)

The expected subexperiment length given by (4.8) is approxi-
mate since the last application of Tp (when a failure
occurs) will probably be incomplete. However, this is
fairly accurate if k is relatively large.

The conditional expectation of the number of applications

of Tp is given by:

E(k/wi) = 1 (4.9)

k.
1- (1-31:)

 

1

From (4.7) and (4.3), if ki > 0, then 2 is finite,
and if k1 = 0, for some i(l :_i i n), then I = m. This
explains why Tp was required to cover 9p in the diag-
nosis procedure: to insure that the diagnosis procedure

will eventually terminate.

 

00

*If 0<x<l , then (1+x)-2== :2 (l+i)x1.

i=0

116
4.4 FAULT TABLE

In this section, we study some fault table
prOperties that are pertinent to the diagnosis of inter-
mittent faults. Our discussion will deal only with the
fault table and not with the complete fault table, i.e.,
f0 will be excluded. The number of rows is assumed to be

n, and the number of columns is assumed to be m.

Definition 4.2

A fault table will be called locally symmetric if
for every pair of rows, fi and fj, there is a pair of
columns, t and t , whose entries for these two rows

P q
have one of the following forms:

0 f.
i or 1

Theorem 4.1

 

The local symmetry property is not affected by
the fault table reduction operations (operations (a),
(b), (c) and (d) of Section 4.2) of the diagnosis

procedure.

Proof:
It sufficies to show that the application of any
of these operations maintains the local symmetry

property in the reduced table.

117

(a) The elimination of a row, obviously, does not
change the local symmetry for the rest of the
rows as long as two or more rows are left. In
the diagnosis procedure, table reduction is per-
formed only if the new Fp is not a singleton,
i.e., the fault table will have at least two rows
after the reduction.

(b) The elimination of a redundant column, clearly,
maintains the property for the remaining rows.

(c) The elimination of an all zeros column does not
change the property since it does not contribute
anything to it.

(d) Similarly, the elimination of an all ones column
does not change the property since it does not

contribute anything to it.

Corollary 4.1

If the original fault table is locally symmetric,
then any row in a reduced fault table, during the
course of the diagnosis experiment, has at least one

l-entry. This follows directly from Theorem 4.1.

Theorem 4.2

A necessary and sufficient condition for maximum
diagnostic resolution (i.e., every fault is diag-
nosable), using the diagnosis procedure of Section

4.2, is that the fault table be locally symmetric.

118

Proof:

Sufficiency. If the fault table is locally

 

symmetric, then by Theorem 4.1, this property will be
maintained in the reduced fault tables. By Corollary
4.1, every row in a reduced fault table will contain
at least one l-entry. Thus, we can always find a

Tp that covers Fp. Thus the only way the diagnosis
procedure can terminate is when Pp is a singleton.

That is, we have maximum diagnostic resolution.

Necessity. Assume that every fault is diag-

 

nosable.

Let mi be the fault that the circuit possesses.

Thus, during the diagnosis experiment, a row corre-
sponding to fi will always exist in the reduced

fault table since mi will never be ruled out (i.e.,
no test that does not detect fi will fail). Diag-
nosis is accomplished through observing enough failures
to narrow down the possibilities of the fault con-
dition as much as possible. Since m. is diagnosable,

1

then considering the fault w. a test that detects

3’
fi and does not detect fj must eventually fail in

the experiment (in order to rule out wj). That is,
there must exist a test, tp, whose entries for fi

and fj are of the form:

119

Let wj be the fault that the circuit possesses.
Since every fault is diagnosable, then by a similar
argument, there must exist a test, say tq, whose

entries for fi and fj are of the form:

 

 

t
q
f.
i
f.
3
That is, there exists a pair of tests, say tp and
tq, whose entries for fi and fj are in the one
of the forms:
t t
tq tp p q
i or fi
. l 0 f.
3 J

The argument can be repeated for any other pair
of faults. It follows that maximum diagnostic
resolution implies the local symmetry property.

Q.E.D.

Definition 4.3
The complement of an n x m fault table is another
n x m fault table with the l-entries of the original

table replaced by 0's and vice versa for the O-entries.

Theorem 4.3
If a fault table is locally symmetric, then its

complement is.

120

The proof follows directly from the definition

of the local symmetry property.

Theorem 4.4

 

A fault table that has no identical rows and that
is its own complement (with the columns rearranged) is

locally symmetric.

Proof:

Since the rows are distinguishable, then for every
.+

+
pair of rows, say fi and fj, there is a test that

has different entries for these rows. Let the entries
+ +

of such a test for fi be 1 and for fj be 0. Since

the table is its own complement, then there is a test

whose entries are the complement of the entries of
+
the above test, i.e., has entries of 0 for fi and l
+

for fj' That is, the table is locally symmetric.

Q.E.D.

Corollaryp4.2

 

An n x m fault table with no identical rows will
become a locally symmetric n x 2m fault table if its
complement is appended to it.

The proof follows directly from Theorem 4.4.

Corollary 4.2 and Theorem 2.3 indicate that the least

number of tests m that satisfies the local symmetry

121

property for n faults is greater than log2 n and less

than or equal to 2[log2 n1?

4.5 OPTIMIZATION OF DIAGNOSIS EXPERIMENT

In this section we discuss the problem of opti—
mizing the diagnosis experiment. The diagnosis procedure
suggested in Section 4.2 is adaptive in nature, namely, the
subexperiment to be adopted at any stage depends on which
test failed in the previous subexperiment. The length of a
particular subexperiment is not deterministic, i.e., it
varies when the same subexperiment is run over again under
the same circuit conditions, due to the intermittency of
the faults. The key to an optimum experiment lies in the
choice of an appropriate Tp for every subexperiment. The
usual aim of optimization is to minimize the expected
length of the subexperiment. Three approaches that tend
to minimize the length of the diagnosis experiment are
presented: exhaustive enumeration, local optimization and
the method of maximum resolution. The latter two approaches

are heuristic in nature.
4.5.1 Exhaustive Enumeration

In this method, all possible complete diagnosis

trees are constructed. The expected lengths of the

 

1-|'x] means the smallest integer greater than or equal
to x.

122

experiments corresponding to these trees are computed. The
one with the shortest expected length is the optimal
experiment. This method guarantees an optimum solution,
but is not feasible even for small problems. This is due
to the fact that the number of possible trees becomes
unwieldy since at every subexperiment we have to construct
trees corresponding to every possible Tp' The number of
trees to be enumerated is much larger than the case of
permanent faults (which, by itself, is impossibly large as
indicated in Subsection 2.6.1). The reasons are that the
trees we have here are not binary, and that the tests of a
subexperiment do not divide the set of possible faults into
disjoint subsets, due to the fact that a possible fault may
be detected by more than one test in the subexperiment.
This latter reason also adds complexity to estimating the
expected length of an experiment since it amounts to more
than one leaf, in the complete diagnosis tree, having the

same label.
4.5.2 Local Optimization

This method attempts to minimize the length of
the diagnosis experiment by selecting the best (according
to some criterion) Tp for every subexperiment during the
course of the diagnosis procedure. The criterion suggested
here is one that takes into effect the expected length of
the subexperiment, as well as the resolution that might

result at the end of such subexperiment.

123

Definition 4.4
Let the u fault subsets that are fault possi-
bilities corresponding to the failures of the u

tests of T be F ,F ,...,F . Let the number of

elements in PP (l :_i :yu) be mi. The resolution
i

figure of merit r for this subexperiment is defined

 

as:

U

r = Z (”i) (4.10)
._ 2
1-1

(ni) is the number of pairs of faults that we have to
2
distinguish if the i-th test of Tp fails in this

subexperiment.

A figure of merit W, that is a function of Z (the
expected length of the corresponding subexperiment as
defined by (4.8)) and r, is computed for every T . The

P

Tp with the lowest W is selected for the subexperiment.

We suggest the following simple function for W:
W = Z.r (4.11)

Any other suitable function of W, in terms of Z and
r, may be chosen as long as it is monotonically increasing
in 2 and also in r.

This method requires the enumeration of all possible

subexperiments rather than all possible experiments as in

 

 

124

the previous method. However, it does not guarantee an

optimum result.
4.5.3 The Method of Maximum Resolution

The local optimization method requires the
enumeration of all possible subexperiments, a task that
could be quite lengthy especially in the earlier stages of
the experiment. In this section, we suggest a method not
as good as local optimization but one that does not require
the enumeration of all possible subexperiments. The idea
is similar to local optimization. However, we select the
figure of merit to be r itself and try to minimize it.
The problem of selecting the Tp with the lowest r is
similar to the covering problem of switching theory with
costs assigned to the prime implicants. The prime impli-
cants are analogous to the tests of the reduced fault table
and the minterms to be covered are analogous to the faults
of the reduced fault table. The cost of a prime implicant
being (2%) . ”i is the number of 1‘s in the column of

the corresponding test.

4.6 SPECIAL CASES

 

In this section we will discuss the diagnosis
problem for special cases of the fault table, namely, the

simple fault table and the simple elimination fault table.

125
4.6.1 The Simple Fault Table Case

A fault table is called simple, if every column
has exactly one 1-entry and every row has exactly one 1-
entry. It follows that the number of rows is equal to the
number of columns and the fault table matrix is the
identity matrix with some of its columns permuted. This
special fault table corresponds to the case where every
test detects one and only one fault.

If the original fault table is simple, then there is
no need for a diagnosis experiment. The test that fails in
the detection experiment identifies the fault that the
circuit possesses.

If one of the reduced fault tables is simple, then
Tp of the corresponding subexperiment will contain all the
tests of this table. This will be the last subexperiment
in the diagnosis procedure since it will result in fault

identification.
4.6.2 The Simple Elimination Fault Table

In a simple elimination fault table every column
has exactly one O-entry and every row has exactly one 0-
entry. It follows that the fault table matrix is a square
matrix. This table corresponds to the case where every
test detects all but one of the faults. Without loss of

generality we can assume the tests are labeled such that

126

test ti does not detect fault f i.e., the fault table

i'
matrix is the complement of the identity matrix.

In this case any two tests will suffice as a Tp
since any two tests cover all the faults. It is easy to
see that after the first subexperiment the reduced fault
table will also be a simple elimination fault table with
the number of rows (or columns) being one less than that
of the previous subexperiment.

If we are to minimize the diagnosis experiment for
this case using the method of maximum resolution, we
notice that selecting any two tests will result in the same
value for the resolution figure of merit r. Notice that

selecting more than two tests will result in a larger r.

Thus, any two tests minimize r.

CHAPTER V

SUMMARY, CONCLUSIONS AND SUGGESTIONS

FOR FUTURE WORK

5.1 THESIS SUMMARY

 

Digital computers are being relied upon as
integral parts of an increasing number of systems handling
all aspects of our life. The proper operation of computers
is vital to the functioning of these computerized systems.
One of the major approaches to the proper operation of
computers is fault diagnosis plus repair. This thesis
lends itself to one aspect of this approach, namely: diag-
nosis of intermittent faults in combinational circuits.

Chapter I presents a general discussion of fault-
tolerant computing together with some of the physical
failures that can occur in digital circuits. The logical
representations of many of these failures are also
presented.

Chapter II deals with failures that are permanent in
nature. The idea of applying proper inputs as tests

detecting certain faults is discussed. In Section 2.2

127

128

several methods for the generation of tests that detect
permanent faults were surveyed, most notably the path
sensitizing and the D-algorith methods. The notion of
representing the data obtained from the test generation
phase by a fault table is explained in Section 2.3.
Theorems 2.1 and 2.3 in that section give bounds on the
number of tests of a fault table having n faults.
Theorem 2.2 gives a necessary and sufficient condition to
be able to diagnose every fault, i.e., to have maximum
diagnostic resolution. The representation of diagnosis
experiments as trees is elucidated in Section 2.4.
Sections 2.5 and 2.6 cover the problem of optimizing
detection and diagnosis experiments for permanent faults.
Unfortunately, the methods that give true optimal experi-
ments are impossibly lengthy for any practical size
problem. Suboptimal methods that have been used are
surveyed. Most of them locally optimize the testing experi-
ment, i.e., at every stage a decision is made to select
the testing strategy that optimizes only that stage.
Chapter III introduces a mathematical model for inter-
mittent faults and deals with the detection of these
faults. The model, which is a probabilistic one, is
detailed in Section 3.1. Many of the notations used in
this model are similar to those employed in pattern
recognition literature. Every intermittent fault is
thought of as a pattern class. The tests and their out—

comes are thought of as random variables. In Section 3.2

129

a detection procedure, through the repeated application of
tests that would detect these faults had their effect been
permanent, is proposed. The procedure suggested is
analogous to a sequential statistical decision procedure.
A decision rule for that procedure must guarantee the ter-
mination of this procedure in a finite amount of time. Two
decision rules were suggested in Subsection 3.2.1 for the
simple case where the circuit can have only one possible
fault. One rule compared the posterior probability of the
fault with a threshold; the other compared the likelihood
ratio (definition 3.1) for that fault with a threshold. It
was proved, in that subsection, that these are acceptable
rules. The general problem, of having one of n possible
faults was tackled in Subsection 3.2.2. A decision rule
for this case, based on comparing the posterior proba-
bilities with thresholds, was not acceptable since it did
not guarantee a finite length for the procedure. However,
a decision rule that compares the likelihood ratios with
thresholds was proved to be acceptable. Section 3.3 dealt
with optimizing the detection experiment. Finding an
optimal solution was shown to be equivalent to an integer
programming problem. Theorem 3.3 proves the validity of
some suggestions made to reduce the size of such an integer
programming problem.

Chapter IV lends itself to the problem of diagnosing
intermittent faults. The model, introduced in Chapter III

is also employed in this chapter. A diagnosis procedure

130

employing the repetition of tests is proposed in Section
4.2. The representation of such a procedure by a tree is
also discussed in that section. In Section 4.3, the
expected length of a subexperiment in that procedure is
computed. This made it possible to show that the require-
ments of the diagnosis procedure guarantees that the
expected length of the procedures be finite. Properties
of the fault table that are relevant to the diagnosis of
intermittent faults are explored in Section 4.4. Theorem
4.2 gives the necessary and sufficient conditions for
maximum diagnostic resolution. The conditions in this
theorem are much stricter than those needed for the perma-
ment fault case. Section 4.5 deals with optimizing the
diagnosis experiment. As expected, the method that yields a
true optimum solution is impossibly lengthy to solve. Two
heuristic approaches are suggested that result in local
optimization (optimization of subexperiments only). One
relies on maximizing the resolution at the end of the
subexperiment, and the other takes into account his
resolution together with the expected length of the sub-
experiment. In Section 4.6, the diagnosis procedures for

two special cases are discussed.

5.2 CONCLUSIONS

Intermittent faults constitute a respectable

portion of the failures that occur in digital systems.

131

Ad-hoc methods have been used to coupe with these faults
in practice, while formal treatment has been completely
ignored despite the need for such a tool. This thesis
represents, to the best of the author's knowledge, the
first major attempt to treat intermittent faults formally.
Detection and diagnosis procedures for these faults, that
can be employed in practice, were developed. Some funda-
mental differences between these procedures and those for
permanent faults were pointed out. A great number of
problems still remain to be solved. It is hoped that this

thesis paves the way for working on these problems.

5.3 SUGGESTIONS FOR FUTURE WORK

Several interesting problems related to the
diagnosis of intermittent faults remain to be solved.

The probabilistic model presented in Section 3.1
assumes constant probability for the presence of the effect
of an intermittent faults. It will be interesting to study
the detection and diagnosis procedures, proposed in this
thesis, using different models. The employment of models,
that are of the Markov chain type or that have time as a
parameter in the probability distributions, as alternatives:
deserves exploration.

Another problem is finding other suitable decision

rules for the detection procedure suggested in Section 3.2.

132

More efficient suboptimal solutions for minimizing the
detection experiment are also needed.

Variations of the diagnosis procedure suggested in
Section 4.2 open a fertile research area. Diagnosis pro-
cedures of limited length (or time) are of practical
importance.

It will be interesting to find the least number of
tests m that satisfies the local symmetry property for
n faults.

Other approaches for designing suboptimal diagnosis
experiments are also needed. Another interesting problem
is that of optimizing the detection and diagnosis experi-
ments when tests costs are taken into account.

An important problem is the detection and diagnosis
of intermittent faults in sequential circuits. Unfortu-
nately, despite the importance of this problem, it is
expected that it will be quite difficult since the problem
of diagnosis of permanent faults in sequential circuits has
not been solved to satisfaction yet.

Another important problem is the design of easily
diagnosable circuits. For example, desired properties in
the fault table could be taken into account when designing

digital circuits.

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

BIBLIOGRAPHY

Avizienis, A., "Design of Fault-Tolerant Computers,"
in 1967 Fall Joint Compgter Conf., AFIPS Conf.
Proc. I pp. 733-743.

 

Avizienis, A.; Gilley, G. C.; Mathur, F. P.: Rennels,
D. A.; Rohr, J. A.; and Rubin, D. K., "The STAR
(Self-Testing And Repairing) Computer: An Investi-
gation of the Theory and Practice of Fault-Tolerant
Computer Design," in Digest 1971 International
Symp. on Fault-Tolerant Computing, pp. 92-961

 

 

Breuer, M. A., "Generation of Fault Detection Tests
for Intermittent Faults in Sequential Circuits,"
in Digest 1972 International Symp. on Fault—
Tolerant Computing, pp. 53-57.

 

Brule, J. D.; Johnson, R. A.; and Kletsky, E. J.,
"Diagnosis of Equipment Failures," IRE Trans.
Rel. Qual. Contr., Vol. RQC-9, pp. 23-24, April

 

Carroll, J. J., ”Examination of Sequential Circuits :
A Model and a Method,” Ph.D. thesis, Illinois
Inst. of Tech., 1972.

Chang, H. Y., ”An Algorithm for Selecting an Optimum
Set of Diagnostic Tests," IEEE Trans. Electron.
Comput., Vol. EC-14, pp. 706-711, Oct. 1965.

Chang, H. Y.; Manning, E.; and Metze G., Fault Diag-
nosis of Digital Systems. New York: Wiley—
Interscience, 1970.

 

Clegg, F. W., "Algebraic Properties of Faults in
Logic Networks,” Ph.D. thesis, Stanford Univ.,
1970.

Cobhan, A.; Fridshal, R.; and North, J. H., "An Appli-
cation of Linear Programming to the Minimization
of Boolean Functions," in Proc. 2nd Annu. Symp.
Switching Theory and Logical Design, pp. 3-9, 1961.

133

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

134

Cobhan, A., and North, J. H., "Extensions of the
Integer Programming Approach to the Minimization
of Boolean Functions," IBM Res. Report, RC-915,
April 1963.

 

Dubes, R. C., The Theory of Applied Probabilit .
Englewood Cliffs, N. J.: Prentice-HaIl, I968.

, "Pattern Recognition," notes for CPS
817, Dept. of Computer Sci., Mich. State Univ.,
1971.

 

Friedman, A. D., "Fault Detection in Redundant Cir-
cuits," IEEE Trans. Electron. Comput., Vol. EC-16,
pp. 99-100, Feb. 1967.

 

Friedman, A. D., and Menon, P. R., Fault Detection in
Digital Circuits. Englewood C11 3, N. J.:
Prentice-HaIl, 1971.

 

Fu, K. S., Sequential Methods in Pattern Recognition
and Machine Learning. New York: Academic Press,

1968.

 

 

Galey, J. M.; Norby, R. E.; and Roth, J. P., "Techni-
ques for the Diagnosis of Switching Failures,"
IEEE Trans. Cgmmun. Electron., Vol. CE-83,
pp. 509¥514, Sept. 1964.

 

Garey, M. R., "Optimal Binary Decision Trees for
Diagnostic Identification Problems," Ph.D. thesis,
the Univ. of Wisconsin at Madison, 1970.

Gill, A., “Comparison of Finite State Models," IRE
Trans. Circuit Theogy, Vol. CT-7, pp. 178-179,
Junegl960.

 

Gomory, R. E., "All-Integer Integer Programming
Algorithm," IBM Res. Report, RC-189, Jan. 1960.

Introduction
Holden-

Hillier, F. S., and Lieberman, G. J.,
to Operations Research. San Francisco:
Day, Inc., 1967.

 

 

Kohavi, 1., "Fault Diagnosis of Logical Circuits,"
in Proc. 10th Annu. Symp. Switching and Automata
Theory, pp. 1664173, 1969.

[22]

[23]

[24]

[25]

[26]

[27]

[23]

[29]

[30]

[31]

[32]

[33]

135

Lyons, R. E., and Vanderkulk, W., "The Use of Triple-
Modular Redundancy to Improve Computer Relia-
bility," IBM J. Res. Dev., Vol. 6, pp. 200-209,
April 1962.

McCluskey, E. J., Jr., “Minimization of Boolean
Functions," Bell System Tech. J., Vol. 35,
pp. 1417-1444, Nov. 1956.

McCluskey, E. J., Jr.

Introduction to the Theor of
Switching Circuits. New YorE: McGraw-HiII, I965.

Mealy, G. H., "A Method for Synthesizing Sequential
Circuits," Bell System Tech. J., Vol. 34, pp. 1045-
1080, Sept. 1955.

 

Mendel, J. M., and Fu, K. 5., Adaptive, Learning and
Pattern Recognition Systems. New York: Academic
Press, 1970.

 

 

Moore, E. F., "Gedanken EXperiments on Sequential
Machines,” in Automata StudiengAnnals of Mathe-
matics Studies No. 34, Shannon, C. E., and

 

 

McCarthy, J., eds., pp. 129-153. Princeton,
N. J.: Princeton Univ. Press, 1956.

Moore, E. F., and Shannon, C. E., "Reliable Circuits
Using Less Reliable Relays," J. Franklin Inst.,
Vol. 262, pp. 191-208, Sept. 1956, and pp. 281-
297, Oct. 1956.

Poage, J. F., "Derivation of Optimal Tests to Detect
Faults in Combinational Circuits," in Proc. Symp.
on Mathematical Theory of Automata, Polytechnic
Inst. of Brooklyn, pp. 483-528, I963.

 

Ramamoorthy, C. V., "Fault-Tolerant Computing : An
Introduction and an Overview," IEEE Trans. Comput.,
Vol. C-20, pp. 1241-1244, Nov. 1971.

Roth, J. P., "Diagnosis of Automata Failures : A
Calculus and a Method," IBM J. Res. Dev., Vol. 10,
pp. 278-291, July 1966.

Schneider, P. R., "On the Necessity to Examine D-Chains
in Diagnostic Test Generation-An Example," IBM J.
Res. Dev., Vol. 11, p. 114, Jan. 1967.

Schertz, D. R., and Metze, G. A., "On the Distingui-
shability of Faults in Digital Systems," in Proc.
6th Annu. Allerton Conference on Circuit and
System—Theory, pp. 752-760, 1968.

 

[34]

[35]

[36]

[37]

[38]

[39]

[40]

136

Schnable, G. L.; Ewald, H. J.: and Schlegel, E. 8.,

"MOS Integrated Circuit Reliability," IEEE Trans.
Relo' V01. R-21' pp. 12-19, Feb. 1972.

Sellers, F. F., Hsiao, M. Y., and BearnsOn, L. W.,

"Analyzing Errors with the Boolean Difference,

IEEE Trans. Comput., Vol. C-17, pp. 676-683,
Ju y .

Szygenda, S. A., "A Simulator for Digital Design

Verification and Diagnosis," Proc. Lehigh Univ.
Workshop on Reliabilitypand Maintainability,
Dec 0 O

Tryon, J. G., "Quaded Logic," in Redundanc Techni-
ques for Computipg_Systems, WiIcox, R. H., and

Von

Mann, W. C., eds., pp.ﬂ255L228. Washington, D.C.:
Spartan Books.

Neumann, J., "Probabilistic Logics and the Syn-
thesis of Reliable Organisms from Unreliable
Components," in Automata Studies, Annals of
Mathematics StudIes No. 34, Shannon, C. E., and
McCartHy, 3., eds., pp. 43-98. Princeton, N. J.:
Princeton Univ. Press, 1956.

Wald, A., Seguential Analysis. New York: Wiley,

Yen,

1947.

Y. T., "Intermittent Failure Problems of Four-
Phase MOS Circuits," IEEE J. Solid State Cir-
cuits, Vol. SC-3, June 1969.

   

"'TITI'I‘DI‘I‘LITUMLHEI((Lilﬂﬁﬂjilﬂilzllﬂiﬂfﬂmﬂ'ES