:3»; 3'2? . J DIAGNOSIS OF INTERMITTENT FAULTS IN. DIGITAL SYSTEMS Thesis for the Degree of Ph. D. MICHIGAN STATE UNIVERSITY SAMIR KAMAL 1 972 LIBRARy Michigan State nivcrsity BIN‘BING av ”‘7"! ,; IIUAG & SIINS' I 800K BINDERY INC.I )- ‘ LIBRARY emoms mmsrw. meme” We??? ABSTRACT DIAGNOSIS OF INTERMITTENT FAULTS IN DIGITAL SYSTEMS BY SAMIR KAMAL Digital computers are being relied upon as integral parts of an increasing number of systems handling all aspects of our life. The prOper Operation of computers is vital to the functioning of these computerized systems. One of the major approaches to achieve prOper operation of computers is fault diagnosis plus repair. This thesis lends itself to one aspect of this approach, namely: the diagnosis of intermittent faults in combinational circuits. Intermittent faults in digital systems are those faults whose effects are not observed all the time. A system having an intermittent fault may show the effect of such a fault when an input test is applied one time, and possibly not show it when the same test is applied many other times. A probabilistic model is suggested for intermittent faults in logical circuits. The model assumes that the circuit is irredundant and that it can have only one of a SAMIR KAMAL possible set of faults (the single fault assumption). It also assumes that the faults are well behaved and are signal independent. Testing for such faults is done through the repeated application of tests that would detect these faults had their effect been permanent. These tests are generated using any of the methods employed for generating tests that detect permanent faults. The prior probability of the cir- cuit having any of the intermittent faults is assumed to be known in the model. Also the probability of observing the effect of each fault, if that fault exists, is assumed. A procedure for the detection of intermittent faults is suggested. It is analogous to a sequential statistical decision problem. At any stage during testing, the proce- dure terminates if a failure is observed or if the decision rules decides that the circuit is fault free. Otherwise, it selects an appropriate test to be applied at the next stage. The decision rule used in the detection procedure is selected such that it insures the procedure termination in a finite number of steps. Least upper bounds on the number of repetitions of tests that detect a particular fault are derived. They are employed in designing optimum detection experiments. Such an optimization problem is found to be equivalent to an integer programming problem. A diagnosis procedure, also employing the repetition of tests that detect permanent faults, is proposed. It is proved SAMIR KAMAL that the conditions required in the procedure guarantee that the expected length of the diagnosis experiment is finite. Test prOperties that are needed for the diagnosis of inter- mittent faults are determined. Some fundamental differences between those properties needed for the intermittent fault case and those needed for the permanent fault case are pointed out. The problem of Optimizing the diagnosis experi- ment is examined. Unfortunately, the true optimum solution can be obtained only through impossibly lengthy enumeration. Two subOptimal approaches that are heuristic in nature are suggested. DIAGNOSIS OF INTERMITTENT FAULTS IN DIGITAL SYSTEMS BY SAMIR KAMAL A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1972 TO MY PARENTS ACKNOWLEDGMENTS I am grateful to Dr. Carl V. Page, the chairman of my guidance committee, for his guidance and encouragement during the course of this thesis. My thanks also go to Dr. Richard C. Dubes, Dr. Edgar M. Palmer, Dr. Morteza A. Rahimi, and to Dr. Bernhard L. Weinberg for serving on my guidance committee and for reviewing this work. I wish to express my thanks to the Division of Engineering Research for their financial support during the writing of this thesis, and further, to the Egyptian Government for the financial support that made my graduate studies possible. Finally, I am deeply indebted to my wife Lillian for her encouragement, loyalty, patience and understanding; and also to my daughter Heba whose joyous smile is always a source of happiness and reassurance. iii TABLE OF CONTENTS LIST OF TABLES O I O O O O O O O O 0 LIST OF FIGURES O O O O O O O O I O O GLOS SARY O O O O O O O O I O O O 0 Chapter I. II. INTRODUCTION 0 O O O O O O O O O 1.1 FAULT-TOLERANT COMPUTING . . . . 1.2 SOME APPROACHES TO ATTAIN STRUCTURELY RELIABLE SYSTEMS. . . . . . 1.3 SWITCHING CIRCUITS . . . . . . 1.3.1 Combinational Circuits. . . 1.3.2 Sequential Circuits. . . . 1.4 LOGICAL FAULTS. . . . . . . . 1.5 INTERMITTENT FAULTS . . . . . . 1.6 CONTRIBUTION AND ORGANIZATION OF THE THESIS. . . . . . . . DIAGNOSIS OF PERMANENT FAULTS IN COMBINATIONAL CIRCUITS . . . . . 2 O 1 TESTS O O O O O O O O O O O 2.2 TEST GENERATION . . . . 2.2.1 Truth Table Metho . 2.2.2 Path Sensitizing. . 2.2.3 The D-Algorithm . . 2.2.4 Algebraic Methods . 2.3 FAULT TABLE. . . . . . . . . 2.4 EXPERIMENTS AND THE DIAGNOSIS TREE . 2.5 MINIMIZATION OF PRESENT EXPERIMENTS. iv Page vii viii @QOS I9- 11 15 16 Table of Contents III. IV. 2.6 2.5.1 Exhaustive Enumeration. . . 2.5.2 The Prime Implicant Method . 2.5.3 The Test Set Intersection Method. . . . . . . 2.5.4 Method of Distinguishability Criteria . . . . . . MINIMIZATION OF ADAPTIVE EXPERIMENTS 2. 6.1 Exhaustive Enumeration. . 2. 6. 2 Method of Distinguishability Criteria . . . . . . DETECTION OF INTERMITTENT FAULTS IN COMBINATIONAL CIRCUITS . . . . . 3.1 THE MODEL . . . . . . . . . 3.2 3.3 DETECTION OF INTERMITTENT FAULTS. . 3.2.1 A Simple Case. . . . . . 3.2.2 The General Case. . . . . OPTIMUM DETECTION EXPERIMENT . . . 3.3.1 A SubOptimal Solution . . 3. 3. 2 Reduction of the Fault Table Matrix A . . . . . . DIAGNOSIS OF INTERMITTENT FAULTS IN 4.6 COMBINATIONAL CIRCUITS . . . . . GENERAL ASSUMPTIONS . . . . . . DIAGNOSIS OF INTERMITTENT FAULTS. . EXPECTED LENGTH OF SUBEXPERIMENT. . FAULT TABLE. . . . . . . . . OPTIMIZATION OF DIAGNOSIS EXPERIMENTS 4.5.1 Exhaustive Enumeration. . . 4.5.2 Local Optimization . . . . 4.5.3 The Method of Maximum Resolution . . . . . SPECIAL CASES . . . . 4. 6.1 The Simple Fault Table Case . 4.6.2 The Simple Elimination Fault Table . . . . . . . 49 49 50 52 SS 58 58 61 62 70 83 91 93 94 99 99 101 112 116 121 121 122 124 124 125 125 Table of Contents V. SUMMARY, CONCLUSIONS AND SUGGESTIONS FUTURE WORK 5.1 THESIS SUMMARY. 5.2 CONCLUSIONS. 5.3 SUGGESTIONS FOR FUTURE WORK BIBLIOGRAPHY . vi FOR 127 127 130 131 133 LIST OF TABLES Table Page 2.1 Truth Table for Normal and Faults Circuits when "a s-a-l" . . . . . 24 2.2 Example of a Complete Fault Table . . . 44 2.3 Reduced Fault Table for Example 2.3. . . 52 2.4 Complete Fault Table and Initial Weights . 54 2.5 First Rearrangement of Complete Fault Table 0 O O O O 0 O O O O O 55 2.6 Second Rearrangement of Complete Fault Table 0 O O O O O O I O O O 56 2.7 Third Rearrangement of Complete Fault Table 0 O O O O O O O O O O 57 2.8 Fourth Rearrangement of Complete Fault Table 0 O O O O O O O O O O 57 4.1 Reduced Fault Table after Detection Experiment. . . . . . . . . . 107 4.2 Reduced Fault Table after First Subexperiment. . . . . . . . . 108 4.3 Reduced Fault Table after Second Subexperiment. . . . . . . . .. 109 4.4 Reduced Fault Table when t6 Fails in Third Subexperiment. . . . . . . 110 vii Figure 1.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 LIST OF FIGURES Some Logical Elements . . . . . . . Block Diagram of a Combinational Circuit. Example of a Combinational Circuit. . . Block Diagram of a Sequential Circuit. . Faults in an Inverter Circuit . . . . Input Diode Failure of an AND Gate. . . Bridging Faults . . . . . . . . . Example of Faults in an Exclusive-OR Circuit . . . . . . . . . . Redundancy and Fault Masking. . . . . Truth Table Test Generation for "a 8-3-1". 0 o o o o o o o o Adjusting Gate Input to Make Output Sensitive Only to a Single Input . . Path Sensitizing. . . . . . . . . OR Gate and Its Singular Cover . . . . AND Gate Behaving as an OR Gate when Faulty. . . . . . . . . . . 0115 is a Propagation D-Cube of a NAND Gate for a Change in x1 . . . . . Experiment Types. . . . . . . . . Diagnosis Tree of a Preset Experiment. . Diagnosis Tree of an Adaptive Experiment. viii Page 0mm 12 13 14 19 21 23 26 28 32 33 35 46 48 48 List of Figures 3.1 State Space 0 . . . . . . . . . . 63 3.2 Sample Space 3 . . . . . . . . . . 64 3.3 Points in 0 x S with Non-Zero Probabilities . . . . . . . . . 68 3.4 Flow Chart for Detection Procedure . . . 73 3.5 O x 3 Space for a Simple Case . . . . . 74 4.1 Flow Chart for the Diagnosis Procedure . . 103 4.2 Diagnosis Tree for Example 4.1. . . . . 111 ix Term UI EI.) GLOSSARY Meaning An entry in the fault table matrix Fault table matrix Initial uncertainty Uncertainty after the application of a single test that did not fail Uncertainty after the application of a single test that failed Reduced fault table matrix k-dimensional random variable Number of blocks, or j-th bias A signal that is normally 1 but becomes 0 when a fault is present A signal that is normally 0 but becomes 1 when a fault is present Conditional probability of malfunction Expectation Page 40 4O 59 59 59 94 69 52 94 31 31 70 114 Glossary fi(xl,x2,...) I fi(x1'x2'ooo) F F Pi gk(xllleo--) iO'il'in k 31 k+m The fault free condition Permanent fault number i The row corresponding to fault fi in the fault table A Boolean function of x1,x2,... A Boolean function of x1,x2,... Permanent fault space Set of permanent faults corresponding to Q P Subset of Fp that is detected by i-th test of T P A Boolean function of x1,x2,... Fault parameters Least upper bound on length of the detection experiment for a simple case, or Number of repetitions of IF Least upper bound on number of tests detecting fi that are needed for the detection experiment, or Number of times fi is covered by Tp Expected value of k As k approaches infinity xi 40 19 41 10 64 101 123 10 38 81 114 91 113 114 78 Glossary NH Number of tests detecting a given fault, or Experiment length EXpected length of experiment Experiment length if i-th fault is present Number of 1's in column 1 of the fault table Number of 1's in portion of column 1 that belongs to block k Number of output variables, or Number of tests in T The j-th membership function Number of input variables, or Number of faults Number of secondary variables in a sequential circuit Prior probability of fault number i Posterior probability of mo of simple case after applying tests, none of them failed Posterior probability of ml of simple case after applying tests, none of them failed xii DJ 89 92 56 56 52 53 40 84 40 56 77 76 Glossary P(.) s-a-O s-a-l Probability of Probability of failure of test ti' or Number of l-entries in i-th row of the fault table matrix Resolution figure of merit Threshold for posterior probability of a simple case Sample point denoting all components being fault free Sample point denoting that component pertaining to i-th fault is faulty Stuck at logical value 0 Stuck at logical value 1 Sample space Test number j Random variable corresponding to test t. 3 Threshold for likelihood ratio of a simpel case Threshold for the i-th likelihood ratio Figure of merit Weight of test ti xiii 67 59 97 123 80 65 65 12 12 65 19 66 82 90 123 52 Glossary Weight of test ti after application of j tests A don't care value The j—th input of a switching circuits, or Number of repetitions of test tj Complement of the switching variable "1 Output of a gate Current value of i-th secondary variable of a sequential circuit Output function with fault parameters substituted Next value of i-th secondary variable of a sequential circuit Output function The i-th output of a switching circuit Output function Information gain due to the application of test ti Member of (set membership) Number of elements in 9p or in Fp Number of elements in Fp i xiv 53 32 92 12 38 28 37 59 69 101 123 Glossary Number of 0's in column 1 of the fault table 52 Number of 0's in portion of column 1 that belongs to block k 53 Likelihood ratio of a simple case after applying k tests, none of them failed 79 Likelihood ratio for mi after applying k tests, all detecting f1 and none of them failed 88 Number of elements in Tp 95 Summation 53 Test set 65 Subset of T containing all tests detecting fi 66 Set of tests that cover Fp 102 Random variable set 66 Random variable subset that corres- ponds to Ti 67 T1=OI T2=opooop Tu=o where TP = {T1,T2,...,Tu} 112 Ti=0, Ti=0,... k times for all i (lgiiu) where Tp = {T1,T2,...,Tu} ‘ 113 XV Glossary (9 x S)k {a,b,...} dz 3; State of being fault free State of having i-th intermittent fault State space Set of possible intermittent faults Action space (cartesian product of Q andS) k-dimensional binary vector all of its entries are 0's Cartesian product of (Q x.S) by itself k times A set whose elements are a,b,... Boolean difference of z with respect to x. 1 Function mapping from the set on the left hand side of the arrow to the right hand side Smallest integer greater than or equal to x xvi 62 63 62 101 67 69 113 20 39 66 121 CHAPTER I INTRODUCTION The last quarter of a century has witnessed major changes in many systems organizations which affect every facet of life. The reliance on digital computers as intergral parts of these systems has introduced irreversible changes. The speed and efficiency of computers in handling massive amounts of information are the main motives for these changes. For these same reasons, huge and complex systems are now designed that were unthinkable before the use of computers, e.g., systems for space flights. These so-called computerized systems span a very wide range of applications, from the simple to the extremely complex, from government agencies to private enterprise, from space missions to patient monitoring. Air line reservations, production line scheduling, highway traffic control, ABM, are just a few examples of these systems. Prior to computerization, systems had manual backups for use in case of failures. With the current fast and sophisticated systems, manual backup is infeasible except for a limited number of cases for a short period of time. Proper operation of computers, being important parts of these systems, must be assured, particularly for real-time applications, where down time for an extended period of time would be very costly, if not disastrous. An auto assembly plant will be shut down and all the labor force sent home for the day, if the production line is down for half an hour. With this type of strigent requirements, it is imperative that we use more reliable computers, even though today's machines are built from components far more reliable than their counterparts that were used a decade or so ago. 1.1 FAULT-TOLERANT COMPUTING Fault-tolerant computing is a field of study whose major concern is the assurance of prOper operation of digital computers. A definition of the term was given by Ramamoorthy [30] as: "Fault-tolerant computing can be defined as the ability to execute specific algorithms correctly regardless of hardware failures or software errors." This definition, even though comprehensive, neglects the importance of time. It is very critical, especially with real-time applications, that algorithms be executed correctly within a tolerable period of time. Clearly, the goal implied in the above definition is far reaching and it is quite a challenge to achieve, even partially, this goal. The difficulty forseen in achieving this goal should not be a discouraging factor in the work toward that end, at least we should try to reduce the penalty we are likely to pay, in case of failures or errors, to a minimum. Even though the roots of this field started with the early days of computers [27, 28, 38], there has been con- siderable recent interest in it. Related research papers appear regularly in the literature. In addition, two recent international symposia* were SOlGlY devoted to this subject. Fault-tolerant computing encompasses theory and techniques for fault and error detection and correction, modeling, simulation, analysis, synthesis and architecture of fault-tolerant systems. Structural reliability is of major importance to fault-tolerant systems. Several approaches are used to ensure structural reliability, some of which are presented in the following section. It should be emphasized that the term "reliable" is used here in the generic sense and not in the formal probabilistic sense as in the theory of reliability in engineering parlance. The term "credible" was used in lieu of that by Carroll [5]. *First International Symposium on Fault-Tolerant Computing, March 1971, and Second International Symposium on Fault-Tolerant Computing, June 1972. 1.2 SOME APPROACHES TO ATTAIN STRUCTURELY RELIABLE SYSTEMS Several traditional approaches to the problem of assuring proper operation of digital systems have been studied. The most significant are: (a) Use of Better Components and Better Designs. An integrated circuit, for example, could be made more reliable if it were designed to be more noise discriminant, and to be capable of withstanding greater power supply fluctuations without suffering damage or malfunctioning than is presently possible. This approach appears to be an obvious and straight- forward one, but it is limited by the available technology and by the economics governing the design. Often, the application still requires more reliability than this approach can provide. (b) Redundancy. Through this approach, it is possible, using additional hardware, to build a system that will function properly even after the failure of one or more of its components. The redundancy employed in space flights is a typical example of this approach. A good portion of the early work in fault-tolerant computing concentrated on the study of different redundancy systems. By guadrupling the number of contacts in a relay switching system, Moore and Shannon [28] have shown that it is possible to come up with a system more reliable than the individual relays. This work was later extended to gate-networks by Tryon [37]. Other classic work in this area was due to Von Neumann who introduced the notion of the "Restoring Organ" [38]. This concept was later developed by Lyons [22] in the study of TMR (Triple Modular Redundancy). Redundancy is quite expensive and just postpones the inevitable: given.sufficient time, enough failures will occur and the system will eventually malfunction. It is quite suitable for short-term applications such as space missions where correct operation must be guaranteed for a relatively short period of time and repair is rather difficult or even impossible. (c) Fault Diagnosis Plus Repair. Economic considerations make this approach the most favored one. With this approach, a system is tested to determine whether or not it is functioning as intended (fault detection), and if not, which part caused the trouble (fault diagnosis). Such testing is needed through the life of the system. When it is first installed, an acceptance test is needed. Thereafter, routine testing, e.g., performed by the CE (Customer Engineer), is performed as part of a maintainance program. Studies about automatic testing for the so-called "self-repairing" systems were carried by Avizienis [1, 2] and others. This approach has received a great deal of attention in recent years. It can be safely stated that, so far, it has been the backbone of fault- tolerant computing. A combination of approaches (b) and (c), even though more expensive than the third approach, will lead to ulti- mate reliability. Such a combination is very suitable when the system is used for vital applications where uninter- rupted operation is a must. Combining these two approaches must be done with care, since redundancy is incompatible with diagnosis because it will tend to mask the effect of a fault when it occurs so it will go undetected. However, clever use of redundancy (e.g., via additional outputs) could amount to greater reliability, if the redundant components are used in such a way that they are not redundant for test purposes. A system that made use of this combined approach is the E88 (Electronic Switching System) of the Bell System, where the failure time is limited to 2 hours in 40 years (acknowledged off the record by the designers). 1.3 SWITCHING CIRCUITS Most of the work done in the area of fault diagnosis deals with faults in switching circuits. The basic elements of switching circuits are the logical elements. Schematic notations for some of these elements are shown in Figure 1.1. Switching circuits are usually divided into two types: combinational and sequential circuits. I fl NOT gate AND gate OR gate :19— :D— 1E)»— EOR gate NAND gate NOR gate (Exclusive OR) Figure 1.1 Some Logical Elements. 1.3.1 Combinational Circuits A switching circuit is combinational [24] if its outputs 2i (1‘: i §_m) can be written as Boolean functions of its inputs xj (l i.j : n) 21 = fi (X1,X2,...,Xn) ' (101) i.e., each output is a function only of the present values of the inputs. A block diagram for a combinational switching circuit is shown in Figure 1.2. The realization x1 21 .2 P .2 Combinational Logic xn zm Figure 1.2. Block Diagram of a Combinational Circuit. z - §2(x1 + x3) + ;‘2 32132§3 Figure 1.3. Example of a Combinational Circuit. of such a circuit is characterized by the absence of feed- back. An example of a combinational circuit is given in Figure 1.3. 1.3.2 Sequential Circuits In a sequential circuit [24], the output at any time is a function of the current inputs and also of previous inputs. The history of previous inputs is summa— rized in the state of the circuit. The realization of a sequential circuit is characterized by feedback. The combined value of the feedback lines yk (l i.k i p), represent the current state of the circuit. A typical block diagram is shown in Figure 1.4. x l 0 o 21 ' Combinational ' xn Logic zm 1.44] .4, Y1 Y1 : Memory : Y yp p Figure 1. 4. Block Diagram of a Sequential Circuit. Sequential circuits are classified as "synchronous" or "asynchronous" depending on whether or not it is 10 operating under the control of clock pulses (additional input to the "memory" block of Figure 1.4 that is not shown). Asychronous sequential circuits are characterized by level inputs and outputs; while the inputs and outputs of synchronous sequential circuits could be either level or pulses. Two different models for synchronous sequential circuits have been in use, one is due to Mealy [25] and the other is due to Moore [27]. In both models, the next state is.a function of current input and current state, i.e., Yk = gk (X1,X2,...pxn3yl,y2,...,yp) (1.2) The difference between the two models is in the output function. In the Mealy model it is a function of the current state and current input, i.e., 21 = fi (xl,x2,...,xn;yl,y2,...,yp) (1.3) while in the Moore model, the output is a function only of the current state, i.e., 21 = fi (X1,X2,...,Xn) (1.4) The two models can be shown to be quivalent, see Gill [18]. 11 1.4 LOGICAL FAULTS A fault is a physical defect that causes the circuit to malfunction. There are numerous factors that could give rise to faults, among them: (1) Aging and gradual deterioration with time. This usually would result in what is called marginal faults, e.g., the gap between the ON and OFF thresholds of a transistor gets smaller. (2) Critical timing, noise, close design tolerances and loose wires. These would cause some sort of intermittent faults. Some intermittent faults eventually will become permanent faults. (3) Solid failures such as a permanently open col- lector or base lead of a transistor, a broken wire, or a short circuit between adjacent con- nections. These will result in what is called permanent faults. This thesis deals only with logical faults. These are the faults which affect changes in the logical behavior of the circuit. Failures that cause, say, changes in pulse shapes, but do not alter the logical functions realized by the circuit will not be considered. Also, power supply and clock failures are not considered here. Hereafter, the term fault would mean logical fault, unless otherwise indicated. The following are examples of faults and how they are logically described. 12 Example 1.1 (Faults in an inverter) Consider the inverter circuit shown in Figure 1.5. An open collector failure (lead 1 open) will cause the output y to be at a high voltage regardless of the value of the input signal x. Logically, this can be represented as line y being s-a-l (stuck at logical value 1) if positive logic is assumed. On the other hand, if the failure is Of the form of a short between the collector and the emitter (short between 1 and 2), the output y will always be at ground voltage regardless of the input signal x. This is represented logically as line y being s-a-O (stock at logical value 0). 4. 1 Y X Y 1- 1 Open (y high) y s-a-l l and 2 shorted (y low) y s-a-O Figure 1.5. Faults in an Inverter Circuit. 13 Example 1.2 (Faults in and AND gate) The circuit diagram Of a DRL (Diode Resistor Logic) AND gate is shown in Figure 1.6. If the diode at input x1 is open, the circuit will behave as if the x1 input is not present. This can be described logically as xl being s-a-l. +. I x1 X * x2 *‘ "3 Diode 1 Open xl s-a—l Figure 1.6. Input Diode Failure of an AND gate. Example 1.3 (Bridging Faults) If a short circuit occurs between two lines, say the outputs of two gates, both outputs will take a common signal value. The value of such a signal could be evaluated by detailed circuit analysis. In general, it depends on the type of technology used in the realization. With current technologies, which are mainly TTL (Transistor-Transistor Logic), ANDing 14 (a) Normal Circuit. x2 I B Afic-fi’: I "3 ‘c 1 “‘D— i3 x4 2 (c) Equivalent Circuit if Lines 1 and 3 are Shorted. Figure 1.7. Bridging Faults. 15 the two signals is an acceptable logical description of the fault. Thus, the effect of a short between lines 1 and 2 of the TTL circuit shown by its logical diagram in Figure 1.7(a), on the behavior of the circuit can be analyzed by inserting a virtual AND gate as shown in Figure 1.7(b). The equivalent circuit when a short occurs between lines 1 and 3 is shown in Figure 1.7(c). Notice how this fault has transformed the combinational circuit into a sequential one. The failures indicated in examples 1.1 and 1.2, are represented logically as stuck-at faults. This is typical of a good proportion of known failures. Again, this kind of representation depends on the technology in use. A considerable amount of work dealt only with the stuck-at faults; for example [8, 21]. 1.5 INTERMITTENT FAULTS Intermittent faults in logical circuits are those faults whose effects are not present all the time. A circuit having an intermittent fault may show the effect of such a fault when an input test is applied one time, and possibly not show it when the same input test is applied other times. As indicated in the previous section, many factors could result in intermittent faults, e.g., stray capacitances, close design tolerances, fatigue, or irregular physical structure of components [34, 40]. 16 Almost all of the work cited in the literature in the area of modeling, detection and diagnosis of faults in logical circuits deals only with permanent faults [7, 14, 31, 36]. Only one recent paper, Breur [3], deals with intermittent faults. Breur has a first order Markov chain model for the faults, but most of his results are based on the simplified assumption of a zero order Markov chain. His work deals mainly with detecting whether the circuit has a certain intermittent fault or not, i.e., the case where the circuit could have only one possible intermittent failure. 1.6 CONTRIBUTION AND ORGANIZATION OF THE THESIS Intermittent faults constitute a respectable portion of the faults that occur in digital systems. Ad-hoc methods have been used to handle these faults in practice, while formal treatment has been completely ignored despite the need for such a tool. This thesis investigates this problem. It develops a probabilistic model for these faults and defines a criterion for fault detectiOn in combinational circuits. The detection problem is treated as a sequential statistical decision problem using tech- niques similar to those employed in pattern recognition methodology [15]. A method is given for the design of optimum detection experiments, which was found to be equivalent to an integer programming problem. A diagnosis 17 philosophy is then presented leading to a diagnosis methodology. A method for finding suboptimal diagnosis experiments is presented since optimum experiments could only be found through enumeration of an impossibly large class of experiments. Chapter I presents some background material and a general discussion of fault-tolerant computing. In Chapter II, an overview of diagnosis of permanent faults is presented. Chapter III presents a model for intermittent faults and discusses the problem of their detection in combinational circuits. Diagnosis of intermittent faults is dealt with in Chapter IV, while Chapter V summarizes the thesis and recommends problems for further research. CHAPTER II DIAGNOSIS OF PERMANENT FAULTS IN COMBINATIONAL CIRCUITS It was indicated in Chapter I that one of the most common approaches to ensure proper operation of digital computers is fault diagnosis plus repair. Fault diagnosis deals not only with detecting faults when they occur, but also with pinpointing the locations of failures to enable repair. These diagnostic tasks are accomplished by testing the system at hand. Testing, in this sense, means applying inputs and observing the corresponding outputs. In this chapter, several test generation methods are surveyed. The total number of tests generated for a large circuit could be enormous; hence it is often desirable to select a minimal or a near-minimal subset of these tests that is sufficient for detection or diagnosis. Schemes for selection of such optimal test subsets are explored later in the chapter. 18 19 2.1 TESTS A test for a combinational circuit is an input vector together with the Observed output. A test tj is said to detect fault fi if, upon the application of tj, the output vectors are different when the circuit is fault- free and when it has fault fi' For example, consider the exclusive-OR circuit shown in Figure 2.1. Let f1 be the fault "x1 s-a-O," and f2 be the fault "y s-a-l." Test t1, denoting the input vector (0,1) (i.e., x1 = 0 and x2 = l), detects neither f1 nor f2 since upon the application of t1 the output is 1 when the circuit is fault-free, when it has f1, and when it has f On the 2. other hand, tests t2 (input vector (1,0) ) and t3 (input vector (1,1) ) detect f1 since they result in x2 f1 : "x1 s-a-O" , detected by (1,0) and (1,1). f2 : "y s-a-l" , detected by (0,0) and (1,1). Figure 2.1. Example of Faults in an Exclusive-OR Circuit. 20 outputs of 0 and 1 respectively, if the circuit has fl, and in outputs of 1 and 0 respectively if the circuit is fault-free. Similarly, tests t4 (input vector (0,0) ) and t3 detect. £2. The following remarks are now clear: (1) In general, a test detects more than one fault. For example, test t3 above. (a) A fault can generally be detected by more than one test. For example, fault f1 above. If fl and f2 are the only possible faults that could occur in the above circuit, then t3 is sufficient to determine whether the circuit is faulty or not. If t3 fails, i.e., produces an output that is different from that of a fault-free circuit, then the circuit is faulty. Actually this is true if any test fails. If t3 does not fail, then the circuit must be fault-free. For that reason {t3} is called a detection set for this circuit. Generally, a detection set would contain more than one test. {t2,t4} is also a detection set, while {t2} is not. It is clear that {t3} is an optimal detection set (since it has only one test) unless test costs are considered. Detection tests tell us whether a system is faulty or fault-free, but, in general, do not completely identify the failure. Diagnosis tests will isolate the faults to a specific component or a group of components depending on both the diagnosis resolution needed and on the technology in use. For example, if the system is built with discrete 21 component technology, it might be necessary to pinpoint the faults down to the gate level. On the other hand, with LSI (Large Scale Integration) technology, fault identification down to the module level would be sufficient. It is possible that a failure occurs, but no test will detect it; i.e., the effect Of the fault does not change the function realized by the circuit. This is due to some sort of redundancy in the circuit. Such redundancy is not necessarily of the copious type discussed in Chapter I. For example, the circuit of Figure 2.2 exhibits some redundancy for test purposes; faults "3 s-a-O" and "4 s-a-O" will go undetected. xx L» N cl tn 5! I“, 60‘ IF *< Figure 2.2. Redundancy and Fault Masking. Line "3 s-a-O" or line "4 s—a-O" will go undetected. 22 2.2 TEST GENERATION In this section we discuss several methods to generate tests that would detect a given fault. In the methods presented, the circuit is assumed to be irredundant, so that redundancy would not mask the effect of the fault. Friedman [l3] pointed out some of the difficulties that may arise when masked faults in redundant circuits interact with otherwise detectable faults. Also, it will be assumed that only one fault can be present at a time (single fault assumption). This assumption is quite reasonable if testing is done routinely as part of a maintainance program: then, the probability Of having two faults is negligible [7]. However, this would not be a reasonable assumption for an acceptance test of a new installation where whole sections of the machine may be constructed incorrectly. 2.2.1 Truth Table Method This is the most Obvious and straightforward method for test generation. The truth tables fOr the normal (fault—free) circuit and for the faulty circuit are com- pared. An input combination is a test which detects the fault under consideration if it results in two different output vectors for the two circuits. A different truth table is constructed for every fault considered. 23 Example 2.1 Consider the circuit shown in Figure 2.3 and let the line a be "s-a-l." The truth tables of the output functions of the normal and faulty circuits are shown in Table 2.1. From Table 2.1, it is clear that three tests will detect "a s-a-l." For these tests, the input vector (xl,x2,x3,x4) will take the values (0,0,0,0), (0,0,1,0) and (0,1,0,0). Alternatively, these test could be represented by the min-terms: §1§2‘3‘4, xlx2x3x4 and xlx2 3 4. Figure 2.3. Truth Table Test Generation for "a s-a-l". z’ is the output when "a s-a-l". 24 x1 x2 x3 x4 2 z’ * 0 0 0 0 0 1 0 0 0 1 0 0 * 0 0 l 0 0 l 0 0 l l l 1 * 0 1 0 0 0 l 0 1 0 l 0 0 0 l l 0 0 0 0 l l l‘ 0 0 l 0 0 0 l l l O 0 1 0 0 1 0 l 0 l l 1 0 l l l l l l 0 0 1 l 1 1 0 l l l l l l 0 O 0 1 l l 1 l l Table 2.1. Truth Table for Normal and Faulty Circuits when "a s-a-l." (0,0,0,0), (0,0,1,0) and (0,1,0,0) detect this fault. This method is effective for small circuits. If the circuit has n inputs, the number of computations needed is proportional to 2n for every fault, which makes it prohivitive to use this method for even the moderate size circuits. 25 2.2.2 Path Sensitizing Many investigators have worked on some form or the other of this method. The name of Armstrong is usually linked with it even though he did not publish his work that is related to it. This method attempts to generate tests faster and using less memory relative to the exponential requirements of the previous method. Path sensitizing deals with stuck-at faults in circuits consisting only of NOT, AND, NAND, OR and NOR gates. The idea is to propagate a change in signal value on a faulty line in the circuit to an output terminal. A path is chosen from that location to an output, and the inputs to the gates along this path are adjusted, depending on the type of the gate, so that the gate output is sensitive only to that input that is part of the path. For AND and NAND gates, all inputs except the changing one should be 1. For OR and NOR gates, these inputs should be 0. For example, in Figures 2.4(a) and 2.4(b), if x2 and x3 are assigned 1, the output of the AND gate will be x1, i.e., sensitive only to x1, and the output of the NAND gate will be i1, i.e., sensitive only to x1. Similarly in Figures 2.4(c) and 2.4(d), if x1 and x2 are assigned 0, the output of the OR gate will be x3, and of the NOR gate will be i3, i.e., sensitive only to x3. 26 (a) AND gate. = x3 = 1 implies y = x1. Y (c) OR gate. = x2 = 0 implies y = x3. Figure 2.4. (b) NAND gate. x2 = x3 = 1 implies y = x1. x1 X2 y x3 (d) NOR gate. x = x2 = 0 implies y = i3. Adjusting Gate Inputs to Make Output Sensitive Only to a Single Input. The general procedure can (1) A failure at a point location is assigned the fault condition. be summarized as follows: is assumed. The faulty a value opposite to that of is That is, a value of l 27 assigned to a line with "s-a-O" fault and vice versa for a "s-a-l" fault. (2) A path is chosen from the fault location to an output terminal. The inputs to the gates along this path are assigned values so as to propagate any change on the faulty line, along the chosen path, to the output terminal. This path is said to be sensitized. This phase of the procedure is called the forward-trace phase. (3) An input vector (test) is determined by tracing back from the inputs Of the gates, along the path, to the inputs of the circuit, and assigning input values to obtain the desired signals for these gates. An arbitrary choice is made when differ- ent possibilities exist. This portion of the procedure is called the backward-trace phase. It could result in more than one input vector or even in a contradiction. In case of a contra- diction, the process is repeated with a different choice for the signals assigned arbitrarily, if such a choice exists, otherwise a different path should be chosen. An example follows to illustrate this method. 28 Example 2.2 Consider the circuit of Figure 2.5. In the following discussion, a gate label may indicate its output to simplify notation. Figure 2.5. Path Sensitizing. (1) "a s-a—l" : sensitize path DH; detect it. (2) "B s-a-O" : sensitize path EH; sensitize path FH, however, (0,0,0,0) (1,0,1,1) and (1,0,0,0) contradiction: contradiction: detects it. (1) (2) 29 To generate tests that detect the fault "a s-a-l": Sensitize path DH. Assign 0 to a. Assign 0 to the other input of gate D, i.e., x2 = 0. Assign 0 to outputs of gates E, F and G. This completes the forward-trace phase. a = 0 implies x1 1 or x3 = 1, say x1 = l. G = 0 implies x3 = l or C = l (i.e., x2 = 0 and x4 = 0), say x3 = 1. F = 0 implies x4 = l or B = l (i.e., x2 = O and x3 = 0), say x4 = 1. E = 0 implies x1 = 1 or B = 1, already satis- fied by a = 0. This completes the backward-trace phase. Thus we see that (l,0,l,1) is a test that detects "a s-a-l." Have we selected another signal choice, we would have obtained (l,0,0,0) as another possible test. To generate tests that detect "8 s-a-O": Sensitize path EH. = 0 and Assign 1 to B, i.e., = 0. x2 x3 Assign 0 to the other input of gate E, i.e., x1 = 0. Assign O to outputs of gates D, F and G. G = 0 implies x3 = 1 (not possible), or C = 1 (i.e., x2 = 0 and x4 = 0). 30 F = 0 implies x4 = 1. This is a contradiction since x4 is required to be 0 and 1 at the same time. Sensitize path FH. Due to the symmetry Of the circuit, we will end up with a similar contradiction. However, the input vector (0,0,0,0) detects this fault since it results in a l-output for the normal circuit and in a O-output for the faulty circuit. This example is due to Schneider [32] to show that this method is not algorithmic, i.e., even though a test exists, this method did not generate it. The main flaw is that only one path is allowed to be sensitized at one time. For this reason, this method is sometimes called "one- dimensional path sensitizing." The key to an algorithmic method is to simultaneously sensitize all possible paths from the site Of the fault to an output. This will be necessary if the circuit has reconverqent fan-out at the site Of the failure, i.e., there are two or more paths that fan-out from the fault location then subsequently reconverge as inputs to the same gate, e.g., paths EH and PE in the above example. 31 2.2.3 The D-Algorithm This method was developed by Roth [31] to over- come the limitations of the path sensitizing method. This method is applicable to a wider class of faults than stuck-at type faults. Also, its use is not restricted to circuits constructed only of NOT, AND, NAND, OR and NOR gates. Most important, this method is algorithmic due to its ability to simultaneously sensitize all possible paths from the site of the fault to a circuit output. This method is sometimes referred to as two-dimensional path sensitizing. Only an overview of the algorithm is presented here. Details are found in Roth's paper. Roth's formulation is in terms Of the D-Calculus; a calculus for cubical complexes. In what follows, the symbol D represents a signal that is 1 in the normal circuit and 0 in the faulty circuit. The symbol D represents a signal that is normally 0, but becomes 1 when a fault is present. The definitions of D and 5 could be interchanged as long as they are consistent throughout the circuit. The elements of the D-calculus are: (a) Singular Cover. The singular cover of a gate (or a block) can be considered as a concise form Of its truth table. It is used to obtain the other elements of the D— calculus. 32 For example, Figure 2.6 shows a three-input OR gate together with its singular cover. An "x" denotes a "don't care” value. The cube lxxl means that the output y will take a l-value if x1 takes the value 1 regardless of the values of x2 and x3. Notice that no "x" appears in the output coordinate of the cubes of the singular cover. For details about how to obtain these cubes see [7, 14, 31]. (b) Primitive D-Cubes of a Fault. The primitive D-cubes of a fault define the in- puts tO a gate (block) which cause the output of the gate (block) to be different from its normal value if a given fault is present in the gate (block). These cubes are obtained by intersecting the singular covers of the normal and faulty gates (blocks). For cube x1 x2 x3 y l x x 1 y x l x 1 x x 1 l 0 0 0 0 Figure 2.6. OR Gate and Its Singular Cover. "x" denotes a don't care. 33 intersection rules see [7, 14, 31]. For example consider the two-input AND gate of Figure 2.7. Sup- pose this gate were realized by a threshold element and that the actual threshold, due to some malfunction, has dropped below the proper value; so the gate behaves as an OR gate. The singular covers for the normal and faulty gates are shown. Intersecting these two sets Of cubes we Obtain 010 and 105 as the primitive D-cubes of the fault. This means that to test for this fault, apply 0(1) on x1 and 1(0) on x2, if the output is 0, the circuit is normal; if xl-———- Y x2-——-1 X1 X2 y X1 X2 y 1 l 1 l x l 0 l x 0 0 0 0 Singular cover of Singular cover of normal circuit. faulty circuit. Figure 2.7. AND gate Behaving as an OR gate When Faulty. 015 and 105 are the primitive D-cubes of this fault. 34 it is 1, the circuit is faulty. Primitive D-cubes Of a fault can also be obtained for blocks with more than one output. Notice that D and 5 appear only in the output coordinate(s). (c) Propagation D-Cubes for Input changes(s). These cubes define the inputs that cause the output of a gate (block) to be sensitive only to one or more of its specified inputs, thus propagating a fault on these inputs to the output. If the output is to be sensitive to more than one input, then, of course, these inputs must be related, e.g., have identical signals or one is the complement of another. This allows simultaneous multiple path sensitizing. These cubes are obtained from the singular cover of the gate (block). Some of the coordinates Of the singular cover are complemented. The newly obtained cubes are intersected with the singular cover to Obtain the propagation D-cubes. For example, consider the three-input NAND gate of Figure 2.8. The propa- gation D-cube for a change in x1 is D115, i.e., to make the output sensitive to x1, apply 1 to both x2 and x3, then the output will be the complement of the signal on x1. To use the D-algorithm for generating tests, the singular covers and all the propagation D-cubes for single input change for all the gates (blocks) of the circuit are 35 x1 x2 x3 Y 0 x x l x 0 x l x x O 1 l 1 l 0 Singular Cover. Figure 2.8. 0115 is a Propagation D-cube of a NAND gate for a Change in x1. obtained. Only single-input propagation D-cubes are com- puted initially. Propagation D-cubes for multiple input changes are computed as necessary. They will be necessary when reconvergent fan-out paths are to be sensitized. A D-cube which represents partial signal values on the lines of the circuit is called a test cube. The procedure for deriving a test for a given fault consists of two parts: the D-drive, which is analogous to the forward-trace phase of the path sensitizing method, and the consistency operation, which is analogous to the backward-trace phase of the path sensitizing method. In the D-drive, one Of the primitive D-cubes Of the fault under consideration is chosen as the initial test cube. It is then intersected with one of the propagation 36 D-cubes. An activity vector is kept to help determine the next propagation D-cube to intersect with. The activity vector and the test cube are updated after every inter- section. The process is repeated until at least one output coordinate of the circuit is Obtained in the test cube. It is possible that intersections involving single-input propagation D-cubes may terminate prematurely before reaching an output terminal if the circuit has reconvergent fan-out. In this case, suitable propagation D-cubes for multiple input changes are computed and intersection proceeded. This is the case when more than one path needs to be sensitized simultaneously. After completion of the D-drive, the consistency operation is begun. The test cube is successively inter- sected with the cubes of the singular cover until enough circuit inputs have been assigned to generate the signal values specified by the test cube. It is possible that some intersections will be empty, in this case it is necessary to return to the D-drive phase and Obtain a new D-chain before the consistency Operation can be successfully completed. 2.2.4 Algebraic Methods The basic ideas Of three algebraic methods for test generation are presented. Even though some of these methods are mathematically neat, they are only suitable 37 for small circuits due to the large computation and memory requirements for larger circuits. These methods are: (a) Method of Complements. The output function 2 is computed for the normal circuit as is its complement, E. The corre- sponding output 2’ and its complement 2’ are computed for the faulty circuit. The above functions are Obtained in normal form expressions. The Boolean product of z and 2’ and of E and z’ are computed. The Boolean sum of these products represent the tests that detect the fault under consideration. This method is somewhat better than the truth table method since it deals with terms of the normal form rather than with all the minterms (rows of the truth tables). However, we need to store all the terms of these functions which may very well exceed the available memory. It is estimated that it would take 109 reels of tape to store the terms of the minimal normal form of the parity check circuit for a 60 bits per word computer, even though the circuit would contain only about 63 logical blocks. (b) Poage's Method. Poage [29] has developed a complete and thorough method for generating tests to detect all possible stuck-at faults in combinational circuits. His work is applicable to single faults as well as multiple 38 faults. The disadvantage Of the method is that it is practical only for relatively small circuits. In order to introduce the effect of a fault on the function realized by a circuit, he uses a kind Of ternary algebra. For every line i in the circuit, three Boolean variables i0, 11, and in, called faultgparameters, are defined as follows: 10 = 1 iff line i is "s-a-O", i1 = 1 iff line i is "s-a-l", in = 1 iff line i is normal. Only one of these parameters is equal to 1 while the other two are 0's. If the signal on line i is supposed to be y, it is replaced in the analysis by the literal y* defined as: y* = y-i + 11 . (2.1) The complement 9* is defined as §* = §~in + 10. (2.2) These substitutions are successively carried on until expressions 2* and 2* for the output and its complement are obtained. Substitution for fault parameters is then made to insert the fault condition. Expressions for the faulty function and its complement 39 are then Obtained. The method of complements, described above, is then used to generate the tests. (c) Boolean Differences. Sellers et a1. [35] have developed the Boolean difference method to generate tests, that detect stuck-at faults, from the Boolean forms representing the output functions. The Boolean difference of an output z(x1,x2, ...,xn) with respect to one Of its variables xi is defined as follows: d 332:1 = z("1'°°°"‘i--1")"‘i+1"""‘n’69 z(xl,...,xi_1,l,xi+1,...,xn) (2.3) In general, 5%} will be a function of some or all of 1 , .. dz _ , the xj s, jf1. If 351 - 1, then any change in xi will result in a change in z regardless of the other signals x.'s, j#i. If 95 = 0, then 2 is inde- j dxi pendent of xi. For single output circuits, the tests that detect "xi s-a-O" are represented by xi 3% , while the tests detecting "xi s-a-l" are represented - dz by "1 3:31. 40 2.3 FAULT TABLE In the previous section, several methods for generating tests that detect a particular fault were presented. A set of tests, detecting all faults that are likely to occur is generated. Let the set of faults of interest, {f1,f2,...,fn}, contain n elements, and the set of tests generated to detect these faults, {t1,t2,..., tm}, have m elements. As noted earlier, some of these tests detect more than one fault. An n x m fault table could be constructed from these two sets. The columns correspond to the tests and the rows correspond to the faults. The entries are zeros and ones. This table can be denoted by an n x m fault table matrix A. Definition 2.1 The n x m fault table matrix A = (aij) is defined as: 1 if test tj detects fault fi' aij = (2.4) 0 if test tj does not detect fault fi' For completeness, the set Of faults may contain an additional element f0 corresponding to the fault free circuit condition since it represents a circuit condition to be distinguished from the rest of the faults. A 2257 plete fault table is defined to be a fault table appended to it an additional row corresponding to £0. All entries 41 Of the row corresponding to fo in the complete table will be zeros. The m-dimensional binary vector corresponding to fi (0|: 1‘: n) will be denoted by f1. It is possible to save some of the effort used to generate tests if some relations among faults are known. If a class of faults is known to have indentical test sets, it is sufficient to generate only a test set for one of the faults in that class. Such faults form an equivalence class that has indentical rows in the fault table. This technique is called fault collapsing and is due to Schertz and Metze [33]. For example, "s-a-O" faults at an input Of an AND gate and at the output of the same gate have indentical detection test sets. Practically, it does not matter which fault of these occur, since the gate has to be replaced anyway. Similarly, if all tests that detect a particular fault fi also detect another fault fj, then it is not necessary to generate tests to detect fj' In _). this case, for every 1-entry in fi, there is a corre- + + sponding l-entry in fj, this is referred to as row f. 3 dominates row fi. For example, any test that detects an input "s-a-l" for an AND gate also detects the output "s-a-l" for that gate. If two or more tests in the fault table have identical columns, all but one of these columns can be removed since they correspond to redundant tests. Similarly a column whose entries are all zeros can be eliminated since no information will be gained when the corresponding test is 42 applied. A fault table (complete fault table) is said to be a reduced fault table (reduced complete fault table) if it contains no zero or redundant columns. The following are some properties of fault tables. Theorem 2.1 An upper bound on the number of tests in a reduced fault table with n faults is given by: 1:152“ - 1 (2.5) 252:. Every column will have n entries. There are at most 2n different binary vectors of n co- ordinates each. One Of these vectors is all zeros. Thus, there are at most 2n - 1 non-zero different possible columns. Theorem 2.2 Maximum diagnostic resolution (i.e., every fault is diagnosable) is possible if and only if no two rows in the complete fault table are identical. 2.9.9.: + + If two rows, say fi and fj (i¢j), are identical, then it is impossible to distinguish between f1 and f. since the outcome of any test 3 applied when either fault exists will be the same. 43 If no two rows are identical, the following method constructs a diagnosis procedure with maximum diagnostic resolution. At any stage in the procedure apply the test with the lowest index which does not result in the same outcome for all faults which are still possibilities for the unknown circuit condition. As long as there are at least two remaining possible faults, such a test must exist, because for each pair of faults there is at least one test which distinguishes them. The procedure always terminates in fault identification since the possibilities of unknown faults are reduced at each stage. This is an existence proof. It does not necessarily mean that this is the only procedure with maximum diagnostic resolution. Q.E.D. Theorem 2.3 For maximum diagnostic resolution, a lower bound on the number of tests m that a fault table with n faults should have is given by: m 3 log2 (n + l) (2.6) Proof Suppose m < log2 (n + 1). With m tests, there can be at most 21“ distinct rows in the com- plete fault table. However, we have 44 log (n + 1) 2m < 2 2 = n + l, i.e., 2‘“ua c>la F410 0 PJEH F'FJ o c: 0 F410 :5 o'ca o H PJIH F'rd o c: 0 viii P‘CD o :5 Table 2.2. Example of a Complete Fault Table. 45 {t2,t4,t5}, {t1,t2,t4,t5}, {t2,t3,t4,t5} and {t1,t4} respectively. The corresponding complete fault table is given in Table 2.2. 2.4 EXPERIMENTS AND THE DIAGNOSIS TREE The process of applying tests and drawing con- clusions from the observed outputs is called an experiment. Thus, a detection experiment is a sequence of tests to be applied in order to determine whether the circuit is fault free or not. On the other hand, a diagnostic experiment is a sequence of tests whose outcome is used to decide which fault, or class of faults (depending on the diag- nostic resolution required), is present in the circuit. The tests used for either the detection experiment or the diagnostic experiment are selected from the set Of tests generated, using any of the methods of Section 2.2, to detect the set of faults of interest. The application of all generated tests is sufficient for either detection or diagnosis. However, the tests generated are usually more than necessary for either experiment. 3 Experiments are classified into two types: (a) Preset Experiments, where tests to be applied are completely determined in advance. (b) Adaptive Experiments, where the test to be applied at a given stage depends on the outcome of the previous test. 46 In a preset experiment, the order of test application is immaterial. The experiment length (number of tests to be applied) is the same regardless of the existing fault condition. In an adaptive experiment, the order of test application is essential. The experiment length generally varies depending on the existing fault condition. Adaptive experiments are more efficient since they tend to be shorter in average length. A schematic representation Of the two types of experiments is shown in Figure 2.9. An experiment can be represented by a binary tree structure. This is due to the fact that every test ti Test t - z Generator‘ Ci cuit (a) Perset Experiment. Test , t1 Circuit 2 Generator It» v ’ I 4* (b) Adaptive Experiment. Figure 2.9. Experiment Types. 47 partitions a set of fault conditions (possibly including f0) into two classes: those that are detected by ti (those faults with a l in column i in the complete fault table), and those that are not (faults with a 0 in the column 1). The nodes Of the tree are classes of faults representing the diagnosis resolution Obtained thus far in the experiment. The root corresponds to the class of all faults including £0. The root is defined to have 12331 0. The edges out of a node correspond to the two possible outcomes of the test applied at that stage. The level of a node, say node a, is one larger than the level of the node having an edge directed to node a. A tree corresponding to a preset experiment will have the same test applied at all nodes of the same level. This is not the case in a tree corresponding to an adaptive experiment. For maximum diagnostic resolution, the leaves of the tree should correspond to classes of single faults. Notice that for permanent faults, any test need only be applied once in either type of experiment. Example 2.4 Consider the complete fault table given in Table 2.2. A preset diagnosis tree is shown in Figure 2.10. Notice that at every level the same test is applied. The symbol ¢ denotes the empty set. Whenever a class of a single fault is reached, no further edges are shown. Figure 2.11 shows an 48 (:{fo,fl,f2,f3,f4,f5,f6} ) Level 0 tl/O t /l a {f3,f5} m” {f2,f4} Level 2 t3/O t3 t3/0 t 3/1 t /o /1 3 3 Q @0 {£1,136} 10 {f2.f4} oLevel 3 tM t/4 t/4 /° t4/1 @ Q 0 {fz'f4} Level 4 ts/O t5/1 O a Figure 2.10. Diagnosis Tree of a Preset Experiment. (Experiment length = 5) (: {fo,fl,f2,f3,f4,f5,f6}:) Level 0 Level 1 Figure 2.11. Diagnosis Tree of an Adaptive Experiment (Maximum Experiment Length = 3) 49 adaptive diagnosis tree. In this case, the experiment has three tests at most. 2.5 MINIMIZATION OF PRESET EXPERIMENTS A minimum preset experiment is a preset experi- ment with the least number of tests. Four approaches for selecting minimal test sets will be discussed. Exhaustive enumeration and the prime implicant method are two approaches that lead to a true minimal test sets. However, they are lengthy for large circuits. The method of test intersection and the method Of distinguishibility criteria produce suboptimal solutions that are not necessarily minimal, but Often close to minimal. 2.5.1 Exhaustive Enumeration Exhaustive enumeration can be accomplished by ordering all possible tests subsets and selecting the smallest subset which is sufficient for detection or diagnosis. Obviously, this method is impossibly lengthy for even small circuits. 2.5.2 The Prime Implicant Method This method makes use of the similarity between the problem of finding a minimal detection set and the problem of minimal cover is switching theory. Tests are analogous to prime implicants, while faults are analogous 50 to the minterms to be covered. Several solutions have been proposed for Obtaining minimal covers, most notably, the Quine~McC1uskey algorithm [23], also linear and integer programming solutions have been suggested [9, 10]. Any such solution can be directly applied to minimizing detection experiments. This method can be extended to handle diagnosis experiments. The extension is due to Poage [29]. A difference table is constructed from the original fault table. It has all of the rows of the original fault table, plus a new row for each pair of + faults. An entry Of a new row formed from f1 and + f. (i#j) is the exclusive-OR of the corresponding entries 3 in f1 and Ej' The l-entries of this formed row denote the tests that distinguish between f1 and fj‘ The 1- entries of an original row can also be thought of as denoting tests that distinguish between two faults, one of them being f0. This approach is elegant and guarantees a minimal experiment, but it is impractical for moderate or large size circuits since the solution to a covering problem for that many rows in the difference table is too long. 2.5.3 The Test Set Intersection Method This method produces near minimal test sets. It is due to Galey, Norby and Roth [16]. It is actually a method for reducing the fault table. After table reduction, any other method of test selection is used. f1 and f2 51 are intersected by ANDing the corresponding entries. If the result is a non-zero vector, the two rows are replaced by the intersection. A l-entry in the intersection represents a test that detects both f1 and £2. The process is repeated by intersecting the resultant vector with E3 and so on. If the intersection of f1 and f2 is the zero vector, we intersect f1 and f3 and proceed. If this intersection also happens to be zero, we intersect f2 and E3. The process is carried on until all rows are considered. The result is a reduced fault table. We then apply any test selection method to this table, such as the prime implicant method, or selecting one test corresponding at a l in every row. The test set obtained will be a suboptimal detection test set. Examp1e 2.5 Consider the fault table given in Table 2.2. If + + + + + we intersect f1, f2, f3, f4, f we obtain the reduced table shown in Table 2.3. + 5 and f6 respectively, Selecting tests that cover every row in the reduced table, we obtain {t1, t4} as the detection set, which happens to be a true minimum in this example. It is to be noted that the outcome of this method depends on the order of row intersection. If we start with the difference table than do the intersection, we can Obtain a near-minimal diagnosis test set. 52 t1 t2 t3 t4 t5 (£1 £2) 1 o o o 0 (£3, £4, f5, f6) 0 o Table 2.3. Reduced Fault Table for Example 2.3. 2.5.4 Method of Distinguishabilipy Criteria This method, due to Chang [6], selects a near- minimal diagnosis test set. The basic idea is to assign weights to tests. The weight reflects the test's ability to distinguish faults. Tests are systematically selected on the basis of their weights. The weight Wi of test ti is defined to be the number of pairs of fault conditions (including f0) which it distinguishes, i.e., w. = e. z. (2.7) where, Bi and Zi are the number of 0's and the number of 1's in the i-th column of the complete fault table. We select the test for which W1 is greatest as the first test. When j tests have been chosen, the faults would have been partitioned into b. 3 depending on the possible outcomes of these j tests. To (bj :_23) disjount blocks select the j+1-st test of this procedure, the weights of the remaining tests are computed. At this stage, the weight of a test is the sum (over the bj blocks) of the number of pairs of faults that it can distinguish within each 53 block. Let be the number of 0'5 (1's) in ei,k(zi,k) that portion of the i—th column of the complete fault table that correspond to block k. Thus, the weight W. j of test 1 after j tests. 1: b. J Wi,j = :2 91,k 11.x (2.3) k=l The test for which W. . is maximized is chosen as the 1,3 j+1-st member of the test set. The selection of tests is continued until the partition of faults can be refined no further: that is, until the weights of the unselected tests are all zeros. Example 2.6 Consider the complete fault table given in Table 2.4. After calculating the initial weights, we select t1 (t4 or t5 will do) and rearrange the complete fault table to form Table 2.5. Select t4 since it has the largest weight at this stage, then obtain the rearrangement shown in Table 2.6. Test t5 has the highest weight, so it is selected. The next re- arrangement is shown in Table 2.7. Select t2. The fourth rearrangement is shown in Table 2.8. Here, every test has a zero weight, so the process termi- nates. The test set {t1, t2, t4, t5} is the near- minimal diagnosis test set obtained. It happened 54 t1 t2 t3 t4 t5 t6 t7 f0 0 o o o o o 0 £1 0 o o 1 1 o 0 f2 0 o o o 1 o 0 f3 1 o 1 o o o 0 f4 1 1 1 o o 1 0 f5 1 o o 1 1 o 0 f6 1 o o o 1 o 0 f7 0 o o 1 o 1 1 f8 1 o 1 1 o o 0 weight 8 18 20 20 14 8 Table 2.4. Complete Fault Table and Initial Weights. that this is a true minimal diagnosis test set for this problem. It should be emphasized that in computing the test weights we should deal with the complete fault table and not the fault table since fo corresponds to a circuit condition to be distinguished from the other fault con— ditions. Chang, in his paper, did not include the ED row in his analysis. As a result, the test set he obtained in his example was sufficient to distinguish among the faults, if it is known that a fault has actually occurred, but was not sufficient as a detection set. 55 I t1 I t2 t3 t4 t5 t6 t7 f0 0 : o o o o o 0 £1 0 I o o 1 1 o 0 f2 0 | o o o 1 o o f o I o o 1 o 1 1 7 + £3 1 I o 1 o o o 0 f4 1 I 1 1 o o 1 0 £5 1 | o o 1 1 o 0 f6 1 I o o o 1 o o I £8 1 I o 1 1 o o 0 weight : 4 6 10 7 3 Table 2.5. First Rearrangement of Complete Fault Table. This method can be extended to allow for different degrees of diagnostic resolutions, i.e., it can be used to point out to a faulty block if any fault occurs in that block without pinpointing to the actual fault. 2.6 MINIMIZATION OF ADAPTIVE EXPERIMENTS In general, the length of an adaptive experiment varies depending on the existing fault condition. Thus, the meaning of the term "minimal" in a "minimal adaptive 56 I t1 t4 I t2 t3 t5 t6 t7 I £0 0 o I o o o o 0 f2 0 o | o o 1 o o ____._ __ _ _ __ _ .1 ________ £1 0 1 I o o 1 o 0 f7 0 1 I o o o 1 1 __.________+_._._____.____ £3 1 o I o 1 o o 0 £4 1 o I 1 1 o 1 o I f5 1 o . o o 1 o o __.____.._.____+. ________ f 1 1 I o o 1 o o 5 f3 1 1 I o 1 o o o . l weight I 2 3 (:) 3 1 Table 2.6. Second Rearrangement of Complete Fault Table. experiment" should be different from that used for preset experiments. An adaptive experiment is minimal if the expected experiment length is minimal. If prior.proba- bilities pi (0‘: i §_n) for the different fault conditions are known, and if £1 (0': 1‘: n) is the experiment length if fault condition fi exists, then the expected experiment length I is: 57 0 .nu_.1 .0 .nv.U .0 .nv 0_0 0 .U .0 .nu_nu .0 11..U .0 .:+-+L.-TITT+- 0 0 11 0 1 . 0 .0 . .0 _ 0 .0 .0 .1..nu..1 .0.nu_.1 .0 .1. 0 1. 11 nu 0 nu .1 0_ . _ _ _ . _ 0..U .0 .nv.11 1 .1._11_11 . _ . . . . _ 0. _ . . .6. . 2 f7. 11 .3 4 .5 f0.f .:...r :1..r6.c1 7: weight Third Rearrangement of Complete Fault Table. 2.7. Table 0.0—1.0.0_0_0_0.0 0 _ . _ _ . . . _ O n. 11 .U 0 11 nu .U 0. 0 . _ . . . _ . . 0 .nu.nu .0 .11.11.nu_.1 .0 0 I+u+..+..+..+..+-+-+l- I 0..nv.nu .0 .nv..1..u .0 .nv . . _ . . _ . _ 0 11 1 no AU .1 0 11 w _ _ _ _ _ _ _ _ 0 _0 .1 _1_0.0_0_1_1 _ _ _ _ _ _ _ _ _ . _ . . . _ . nL .2. _ 1L .3. A; ,b_¢mL.L5 _M f F. .r f F. :1 _ _ _ _ _ _ . _ .m e _ _ _ _ _ _ _ _ w . . _ . Fourth Rearrangement of Complete Fault Table. Table 2.8. 58 Z is the function to be minimized for a minimal adaptive experiment. Two approaches to select a test set for minimal experiments are discussed: exhaustive enumer- ation, which leads to a true minimal solution, and the method of distinguishability criteria which results in a near-minimal experiment. 2.6.1 Exhaustive Enumeration This approach was not practical for preset experiments. It is even worse for adaptive experiments, since we have to consider all possible permutations of test subsets. Unfortunately, this is the only known method that gives a true minimal adaptive experiment. Garey [32], in his Ph.D. thesis, developed a systematic method for the enumeration, and discussed special cases which have shorter solutions. This method is only suitable for small problems. 2.6.2 Method of Distinguishability Criteria This method makes use of a figure of merit that is computed for every test that might be selected. At any stage in the experiment, the test that has the highest figure of merit is selected. The process is repeated until 59 no further diagnosis is possible. This approach results in locally optimized procedures, rather than globally Optimized ones. Chang's method, discussed in subsection 2.5.4, can be easily adjusted to apply for adaptive experi- ments. The test weight is considered its figure of merit. Instead of adding up the number of pairs a test can distinguish in all blocks, only one block is considered. Another figure of merit, based on information gain, has been suggested by Brule’ et a1. [4]. The initial uncertainty A0 about the fault condition is: n A0 = - 2 pi log2 pi . (2.10) i=0 At every stage, the test that results in maximum infor- mation gain is selected. For example, to select the first test, the information gain AAi due to every test ti is computed. If test ti is applied, it will fail with probability qi, narrowing down the fault condition to a smaller block. Let the uncertainty among this block be A1 1. It is also true that if ti is applied, it will I not fail with probability (l-qi), narrowing down the fault condition to a different smaller block. Let the uncertainty among this block be A1 0' The information gain AAi due I to ti 1s: AAi = A0 - (qi Al'l + (1-qi) Al'o). (2.11) 60 The test with the largest information gain is selected. The probability of failure qi can be calculated from the prior probabilities. The computations at later stages are computed in a similar fashion. CHAPTER III DETECTION OF INTERMITTENT FAULTS IN COMBINATIONAL CIRCUITS Intermittent faults in digital circuits are those faults whose effects are not present all the time. A probabilistic model for intermittent faults is presented. Permanent faults are a special case in this model. ' Detection of intermittent faults through repeated appli- cation of tests that detect such faults, as if they were permanent, is suggested, together with a detection criterion. The detection procedure proposed is equivalent to a sequential statistical decision problem. Optimization of detection experiments is discussed later in the chapter. Assumptions similar to those made in Section 2.2 for the permanent faults case are used here, namely: (1) The single fault assumption; i.e., the circuit can have only one fault during the testing experiment. (2) Irredundancy; i.e., the circuit is assumed to be irredundant, thus the effect of a fault would not be masked. 61 62 Moreover, we will assume that: (3) The faults are well behaved, i.e., during an application of a test, the circuit behaves as if it is fault free or having a permanent fault for that period of time. That is, during the appli— cation of a test, the effect of an intermittent fault is either not present at all, present for a relatively brief interval of time that the response to the test is the same as if no failure occurred, or present for a long enough period of time so that the fault appears to be permanent for that application of this test. (4) The faults are signal independent; i.e., the presence of a fault does not depend on the signal values existing in the circuit. 3 . 1 THE MODEL The model proposed is a probabilistic one. The basic elements of the model are defined below. Assumptions about the model are indicated as we proceed. The notation used is very close to that employed in pattern recognition literature, for example, see [12, 15, 26]. State Space 0: Q = {mo, m1,...,wn} Each point in 0 represents a state of the circuit; 63 mo: denotes the state of being fault free; mi: (1 3.1 §_n) denotes the state of having inter- mittent fault number i. It is assumed that the state of the circuit can be described by a single element from 0 (which one, is not known); i.e., the circuit is fault free or it has only one of n possible intermittent faults (the single fault assumption). During the testing experiment, the circuit is assumed to stay in the same state. If the circuit is in state mi (1‘: i‘: n), it does not mean that the effect Figure 3.1. State Space 9. State mi corresponds to the circuit having intermittent fault number i. 64 of the i-th fault on the behaviour of the circuit will be present all the time; this is due to the intermittency of the fault. Permanent Fault Space F: F = {f f lpoou'f} 0' n There is a one to one correspondence between F and 9. f. 1, (l :_i i n), is the permanent fault that the circuit would have, if the effect of mi is present all the time and fo denotes the fault free condition. Figure 3.2. Sample Space 8. Sample point ‘1 corre- sponds to the circuit behaving as if it has permanent fault fi. 65 Samale Space 3: S = {60, 61,..., an} Every point in 3 corresponds to the outcome of a random experiment. Conceptually, the random experiment can be thought of as testing all circuit components. With the single fault assumption, the outcome of such an experiment would be: 40: all components are fault free, or 61: (1': i i n), the component that pertains to the i-th fault is faulty. Observing 40 in this conceptual random experiment, we cannot infer that the state of the circuit is mo; in fact, it could be any mi (0 1.1 i n) due to the intermittency of the faults. However, observing 6i (1 1.1 i n), we could infer that the state of the circuit is mi for sure. Test Set I: r = {t1, t2,..., tm} The test set I is a complete test set that would detect all faults under consideration if they were perma— nent. This set can be generated using any of the methods discussed in Section 2.2 to detect permanent faults. In general, a test tj will detect more than one fault. Let be the set of faults detected by tj (those faults that correspond to the l-entries in column j of the fault 66 table [definition 2.1)). In terms of the random experiment, test tj can be thought of as testing the components corresponding to faults fj , f. , ..., and fj . The test 1 32 m. has two possible outcomes: 3 (a) all components tested are fault free, or (b) one component is faulty, which one, is not known. Test Subset Ti: (l 5,1 i n) A subset of I that contains all the tests in I that detect £1“ The elements of this subset correspond to .+ the l-entries in the vector fi of the fault table. Random Variable Set T: T = {T1, T2,..., Tm} Every test tj (l :,j :_m) in T defines a random variable Tj on S. If test tj is applied to the cir- cuit and it fails, i.e., the Observed output is different from that of a normal (fault free) circuit, the value of T3. is defined to be 1. On the other hand if tj does not fail (i.e., produces an output identical to that of a normal circuit), then the value of T. J 0. Formally, Tj is a function from S into the set {0,1} defined as: is defined to be Tj:3+{0,1} I (lijim) 7 l l < i < n, and t. detects fault f.. T44.) = " — 3 1 3 1 0 otherwise. (3.1) 67 Random Variable Subset Ti: (l 2,1 1.“) T1 is a subset of T; it contains all the random variables that correspond to tests in Ti. Action Space 0 x S: The probability measure to be used, is defined on the action space 0 x S in order to be able to use prior information about the distribution over 0. Since the presence of the effect of an intermittent fault corresponds to a single point in 3, many of the points in 0 x S will have zero probability, namely: P(w0,bi) = 0 , 1 1’1 1 n and, P(wi,6j) = 0 , l :_i,j : n and i f j : That is, only 2n+l points in 0 x S have non—zero proba- bilities, namely: (”0:60) o (wioé-) I (”1160) I 1 iiiji no 1 Every point mi in 0 defines an event on 0 x S as follows: mi = {041,434 | o _<_ j _<_ n} (3.2) It follows that only one point in 0 x S of those contained in the mo event has non-zero probability, namely: (w0,60). Similarly, only two points in 0 x S in the wi(l 5.1 :_n) event, have non-zero probabilities, namely: ((91:60) and (0)1161). 68 2 (w2,42) 011 “In (w1,41) «an,4n) (w2,40) (w1,60) «un'60) Figure 3.3. Points in 0 x 8 With Non-Zero Proba- bilities. Similarly, every point bi in 3 defines an event on 0 x S as follows: Only one point in 9 x S, of those in the event bi (l i 1‘: n), has non-zero probability, namely (mi'biI' and all the n+1 points in 40 have non-zero probabili- ties. 69 k k-Dimensional Random Variable Set Tk (Ti , 1 i_i : n): Tk(Tik) is the cartesian product of the set T(Ti) k) can be by itself k times. An element b from Tk(Ti written as a k-tuple of random variables from T(Ti), for example: b = ( ,...,T. ) T. ,T. 31 32 3k Tjr e T (Tjr 5 Ti) , for l i|r : k . The outcome of an experiment in which k tests from 1(11) are applied to the circuit can be represented by such an element b. Its value is a k-dimensional binary vector. ...p k will be used to denote a k-dimensional The notation b = 0 binary vector, all of its entries are 0's. This corre- sponds to an experiment of k tests; none of them has failed. It should be kept in mind that during the application of an experiment (a sequence of tests), the circuit will stay in the same state mi (0 :_i g n). During the course of the testing experiment, the sample space point will be 60 all the time or changing randomly between 40 and 6. 1 (for some, but fixed 1) only. Prior Probabilities P(wi): (0 :‘i 2.“) Pi = PIwi). 0 :_i i n, is the prior probability that the circuit is in state mi. The values of these pi's are assumed to be known in the model. These values could 70 be obtained empirically, from manufacturer information, or from experience. Conditional Probability of Malfunction P(Ai/mi): (1 po (3.8) 76 P(Tl = 1/011) P(0)1) PUD /T = l) = = 1(309) 1 1 P(Tl 8 1/w0\ P(w0) + P(T1 = 1/(01) Paul) Similarly, P(ub/Tl = 1) = 0 (3.10) The interpretation of (3.6) and (3.8) is that, if t1 is applied and the circuit produced a good output (identical with that of a fault free circuit), then our certainty about the circuit being faulty will decrease and our certainty about the circuit being fault free will increase; which is quite reasonable. On the other hand, the interpretation Of (3.9) and (3.10) is that, once a bad (different from that of a fault free circuit) output is observed upon the appli- cation of t1, then we know for sure that the circuit has an intermittent fault. The posterior probabilities after applying tj for k+l times and Observing k+l good outputs are: —> P(w1/'r1=o,b=ok ; b e Tk) Pl,k+l ‘I 7: P(Tl=O/wl,b=o ) P(wl/b=0 ) -—> —-> —-> —-> k k _ _ k _ k P(T1=0/w0,b=0 ) P(w0/b-O ) + P(T1-O/w1,b—0 ) P(w1/b-O ) a (l-el) pllk = (l-el) 91.x (3 11) (l-Pl,k) + (1-e1) 91.x 1"‘31 pl,k Shfilafihn 77 ...p k po,k+1 = PIwo/Tl=0,b=0 ; b 8 TR) 9 p = 1 - e 2’: = 1 _ 2’k(1_ - (3.12) 1 1 Po,k 1 Po,k) From (3.11) and (3.12) it follows that: p1,k+1 < pl,k ' (3°13) p0.k+1 > p0,k ° (3.14) Theorem 3.1 (l-e ) p (a) p1,k = 1 % (3.15) po (b) p0,k - k (3.16) Proof (a) By induction. (l-el) P1 (1'61) p1 p1:1 _ pO + (l-el) p1 - l - eljp1 k = 1 : which is true from (3.5). Assume true for k: from (3.11), (l-e = 1’ pl,k p1,k+l 1 - e1 pl,k k (l-el) (l-el) p1 k p0 + (l-e1)k p1 -e1 (1-e1) p1 78 k+1 = (l-el) p1 91 + (1-e1)k (1-e1> pl (1_e)k+l p1 = + i.e., true for k+l po - k Corollary 3.1 (a) lim p = 0 k+oo 1'k (b) lim = l k+oo po'k Proof: (l-el) is less than 1, therefore, lim (l-el) k+oo (a) Using (3.15) and (3.19) lim k+oo P1,k (3.17) (3.18) (3.19) 79 (b) Using (3.16) and (3.19) = po :1 p0+0 lim k+co 90.x Definition 3.1 If the test tl is applied k times and good output (identical with that of a fault free circuit) was Observed every time, the likelihood ratio 1k is defined as: From (3.15) and (3.16), it follows that: Corollaryf3.2 Ak+1 < 1k (3.21) Proof: 1 from (3.20) : —§il = (1-e1) < 1 k i.e., Ak+l < 1k Q.E.D Corollary 3.3 lim 1k = 0 (3.22) k+oo 80 Proof: From (3.19) and (3.20): lim Ak=I_—p—=O k+oo A Decision Rule Earlier in this section it was stated that the pos- terior probabilities will be used in the decision rule to decide whether to continue testing or that the circuit is fault free. The following decision rule is suggested to be part of the detection procedure: If the posterior probability p1,k goes below a certain threshold 3 (0 < s < 1), decide that the cir- cuit is fault free and stop, otherwise apply t1 and repeat (i.e., go to step (1) of the detection pro- cedure). This decision rule is an acceptable one, because it guarantees termination Of the detection procedure in a finite amount of time by virtue of the fact that p1,k is monotonically decreasing (from (3.13)), and that as k increases it can get below the threshold 3 (from (3.17)). The value of the threshold 5 can be chosen by con- sidering the probability of error (probability of deciding that the circuit is fault free while it is actually faulty). This of course depends on how critical the proper operation 81 of the circuit is. Also this affects the length of the testing experiment which is a factor that can be taken into consideration when choosing 5. TO determine a least upper bound on the length of the experiment, find k for which p1 k < s . From (3.15): I k k < S (l‘Pl) + pl (l-el) (l-el)k p1 < s [(l-pl) + p1 (1-e1)k] (1-e1>k (pl-s p1)< s (1-p1) k 8(1-91) (l-el) < p1 Il-s) s (l-pl) k log (1‘61) < log m s (1-p1) log EI—Tr:gy or, k > log (1'91) (3.23) Note that log(l-e1) is negative. Example 3.1 Among the gates produced by a certain manu- facturer, it is estimated that for about 0.01% of them, the gap between the ON and OFF voltages is smaller than some critical value. If the gap is below the critical value, the gate will malfunction 5% of the time. Find a least upper bound for the length of the testing experiment. 82 Solution From the given figures, the following quantities are estimated as shown: p1 = 10-4 el = 0.05 If we choose the threshold 5 to be 10-6, then 10’6(1-1o'4) log -4 -6 k > 10 (1-10 ) log 0.95 or k 1 91 Another Decision Rule The rule suggested here compares the likelihood ratio (which is a function of the posterior probabilities) with a threshold u (u > 0) as follows: If Ak goes below u decide that the circuit is fault free and stop, otherwise apply t1 and repeat (i.e., go to step (1) of the detection procedure). This decision rule is also an acceptable one, since it guarantees termination of the detection procedure in a finite amount of time by virtue of the fact that 1k is monotonically decreasing (from (3.21)), and that as k increases it can get below u (from (3.22)). Actually this is the optimum Bayesian decision rule (that minimizes the average loss) for the (0,1) loss function [15]. 83 The value of the threshold u could be chosen in a fashion similar to that of selecting s. The least upper bound on the length Of the testing experiment is obtained by finding k such that l < u. k From (3.20): (1-e1)k p1 U’Pl) u A )k u (l-pl) 1 p1 A (l-e u (l‘Pl) k lo l-e < lo q ( 1) 9 pl 11 (1.131) p1 long-el) (3'24) log or, k > Which is similar to the results in Wald [39] for binomial samples. 3.2.2. The General Case This is the case described by the model in Section 3.1. The basic assumptions are: (l) the circuit is irredundant, (2) the circuit can have only a single intermittent fault out of the n faults considered, (3) the faults are well behaved, and (4) the faults are signal independent. Consider the probability distribution on 0 x S governed by the following conditions: 84 PI‘I-Ii) = P1 I 0 |A )5 IA :3 P(Ai/wi) = ei , l i 1 i_n Definition 3.2 The membership functions Mj (1 i’j i’m) from the the set {0,l,...,n} into the set {0,1} are defined as: II 0 ‘ Mj(0) 1 if t. detects f. . j 1 Mj(i) (1 < i < n) 0 if tj does not detect fi' The posterior probabilities are now calculated using Bayes' Rule (Equation 3.4)), P(Tj/wi) P(wi) P(wi/Tj) = n 2 P(Tj/w£) P(w£) i=0 ei 1f tj detects fi . P(Tj = l/wi) = 0 if tj does not detect fi . Similarly, l-e. if t. detects f.. 1 j 1 P(Tj = O/wi) = 1 if tj does not detect fi . In terms of the membership functions, we can write: P(Tj = O/mi) = ei Mj(i) (3.25) P(Tj = O/wi) 1 1 - e. Mj(i) . (3.26) 85 Thus: ei MJ. (1) P (mi) P(wi/Tj = l) n (3.27) 2 e£ Mj(£) P(w£) i=0 It follows that: P(wo/Tj = 1) = o . (3.28) If ti does not detect fi then: P(wi/Tj = 1, Mj(i) = 0) = o (3.29) If tj detects fi then: ei P(wi) P(u)i/Tj = l, Mj(i) l) n 2 ez Mj(£) P(w£) £=0 (3.30) Equations (3.28), (3.29) and (3.30) give the value of the posterior probabilities if the applied test tj fails (results in an output different from that of a fault free circuit). Similarly, [1-ei Mi(i)] P(wi) P(wi/Tj = 0) = (3.31) n 2 [l-e, Mjwn P(w£) £=0 86 Thus, P(wo) P(wo/Tj = 0) = n (3.32) i=0 Notice that: n £=0 Thus, n 2 el Mj(£) P(w£) < l , (3.33) i=0 and, n :2 [l-eL Mj(£)] P(w£) P(wo) - (3.35) If ti does not detect fi' then: P(wi/Tj = 0, Mj(i) = 0) > P(wi) (3.36) This is an interesting result, since it indicates that the certainty about the circuit having a particular fault increases (if the applied test does not detect that fault) 87 even though the applied test produced a good output (that is, identical with that of a fault free circuit). If tj detects fi' P(wi/Tj = 1) could be less than, equal to, or even greater than P(wi), i.e., , _ > P(u)i/Tj = 0, Mj(1) — 1))? P(wi) (3.37) This is even a more interesting result since the cer- tainty about the circuit having a particular fault could increase even if the applied test detects that fault and produces a good output. From (3.36) and (3.37), it is clear that a decision rule based on comparing the posterior probabilities with some thresholds, as was done in Subsection 3.2.1 for the simple case, is not acceptable since the posterior proba- bilities are not monotonically decreasing functions, thus there is no guarantee that the detection procedure will terminate in a finite amount of time. The posterior probabilities after applying k+1 tests; the k+1-st being, say, test t are (from (3.4)): j; P(Tj/wi,b) P(wi/b) k _ P(wi/Tj, b , b e T ) _ n . P(Tj/w£,b) P(w£/b) £=0 Notice that P(Tj/wi,b) = P(Tj/wi), thus: 88 [l-ei Mi(i)] PIwi/b) P(wi/Tj a o, b ; b 8 TR) = in 0 (3.38) Definition 3.3 If k+1 tests from T are applied: the k+1-st being, say, test tj: and a good output was observed every time, the likelihood ratios xi,k+l(1 1.1 :,n) are defined as: k P(mi/Tj=o'b= SbET) .3 A. - - (3.39) 1'k+1 Paco/Tj = 0, b - 6%»; b e Tk) From (3.38), it follows that: Ai,k+l = [l-ei Mj(1)] Ai,k . (3.40) Thus Ai,k+l : Ai,k (3.41) The equality in (3.41) holds only if the k+1-st test detects f. 1, otherwise this relation is strict inequality. Definition 3.4 The initial likelihood ratios xi (1 i|i :,n) are defined as: A. = 11,0 = mgr ° (“2’ 89 Theorem 3.2 If k tests from T are applied, L of them are from Ti, and none Of them failed, then: _ L Ai,k — (1-ei) Xi . (3.43) Proof It is clear that L :_k. Proof is by induction on k. k = 1 : from (3.40), A1,1 = [1'31 the first test being tj. Mj(i)] A1; L could be 0 or 1: II >a 0, i.e., Ai,l i' If L = 0 (tj t Ti), then Mj(i) which satisfies (3.43) for L = O. l, 1.e., Ai,l = If L = l (tj 8 Ti), then Mj(i) (1‘9 )Air 1 which satisfies (3.43) for L = 1. Assume true for k: from (3.40), A1,k+1 = [1'91 MjIIII A1,18 the k+1-st test being tj. If tj t Ti, then Mj(i) = 0, thus, A A. = (l-ei)£ A. . (3.44) i,k+l = 1,k 1 tj ¢ Ti also means that the number of tests from Ti in the first k+1 tests is L. Therefore, (3.44) indi- cates that the theorem is true for k+1. If tj 8 Ti, then Mj(1) = 1, thus; 90 _ _ _ L+l Ai'k+l - (1 31) Ai’k - (1 ei) A1 0 (3.45) tj 8 Ti also means that the number of tests from Ti in the first k+1 tests is L+1. Therefore, (3.45) indicates that the theorem is true for k+1. Q.E.D. Corollary 3.4 11:21+00 Ai,k = 0 (3.46) Proof: (l-ei) < 1 I therefore, lim 1. k = lim (l-e.)£1. = 0 . [+00 1' [+00 1 1' Q.E.D. The Decision Rule The decision rule suggested, compares the likelihood ratios xi, (1 :_i 1.“)! with thresholds ui(ui > 0 for all i) as follows: If 11‘: ui for all i, decide that the circuit is fault free and stop. If Xi > ui for some i, select a test from Ti and repeat, (i.e., go to step (1) of the detection pro- cedure). From (3.41) we see that the likelihood ratios are monotonically non-increasing functions. Any likelihood 91 ratio Ai,k will strictly decrease if we apply a test from Ti (and of course, that test produces a good output). Thus, from (3.46), the likelihood ratios could go below the specified thresholds, and the detection procedure is guaranteed to terminate in a finite amount of time using this decision rule. Hence, it is an acceptable rule. Theorem 3.2 can be used to determine a least upper bound on the number of tests ki from Ti (l i'i : n) that are needed, as follows: k. 1 (l-ei) Ai < u 1 k. u. 1 1 u. ki log(l-ei) < log X_' u. 1°? 1‘?- 1 °r' ki > 1°9I1'617 u. p log 1 0 or, k. > 1 . (3.47) 1 109(1-ei) 3.3 OPTIMUM DETECTION EXPERIMENT In Section 3.2.2, least upper bounds ki's on the number of tests that detect fi that are needed were obtained (3.47). However, a given test usually detects more than one fault: so, a method is needed to determine 92 how many times every test should be repeated so that (3.47) is satisfied for all i (l 1.1 i n), and that the overall experiment length is minimum. If tj (1‘: j :.m) is repeated xj times, then, (3.47), in terms of the fault table matrix A (definition 2.1), yields, m 2 aij xj : ki . (3.48) i=1 The experiment length L is: m L.== :E xj (3.49) i=1 L is the function to be minimized. The quantities at hand satisfy the following con- ditions: 1. x. > 0 J 2. all xj's are integers This is an all integer-integer programming problem, the solution of which determines the optimum number of repetitions for each test. For details and solutions of integer programming problems, see [19, 20]. It should be noted that for the permanent faults case, every ki (1 5'1 : n) will be equal to 1; thus, similar to the problem of minimizing boolean functions. Integer 93 programming treatments for this special case are found in [9, 10, 19]. Example 3.2 Consider the following fault table matrix: t1 t t3 t4 t f1[ 1 f2i 1 1 1 o 1 f3) 0 l 1 1 1 £4: 0 1 o o 1 £5! 0 o o 1 1 f6 1 o 1 o o x1 + x2 + x4 ‘1 k1 x1 + x2 + x3 + x5 1 k2 x2 + x3 + x4 + x5 ‘1 k3 X2 + x5 1 k4 x4 + x5 :_ kS X1 + x3 1 k6 Find integer xj's that minimize: x1 + x2 + x3 + x4 + x5. 3.3.1 A Suboptimal Solution If the values of the kj's are relatively large (e.g., > 10), the integer programming problem presented by 94 (3.48) and (3.49) can be solved as a linear programming problem (actually as a transportation problem since the coefficients are 0's and 1's). The solutions are then rounded up (the smallest integers greater than or equal to the obtained solutions are used). This is generally a faster solution since solving a linear programming problem is easier than solving an integer programming one. Linear programming methods will result in a very little deviation from the optimal solution if the values of kj's are large enough, since the function to be minimized is just the sum of the xj's. 3.3.2 Reduction of the Fault Table Matrix A: The size of the matrix A can easily get to be huge for a large size circuit. The amount of memory and number of computations needed to solve an integer program- ming problem (or a linear programming one) can be greatly reduced if the size of matrix A is reduced. In this sub- section, an attempt is made to transform the problem into an equivalent one, but with a smaller matrix A*. As A is transformed into A*, the values of kj's will be adjusted. The obtained solution, i.e., xj's, for the reduced problem will also be adjusted by adding appropriate biases bj's to Obtain the solution for the original problem. The initial bias values are zeros. The following operations are suggested for the reduction of the fault table matrix A to A*: 95 (a) If only one 1 occurs in row i, say, aij=1 and aik=0 for k # j, delete row i from the matrix, increment, b. by ki, and, for every row q for 3 which a .=l, replace k by k -ki, if this differ- QJ q q ence is zero or negative, delete row q. (b) If row 1 contains 1's wherever row j does (1 # j), that 13, for each k, ajk = l 1mp11es aik=l; (row i dominates row j) and if ij: ki' delete row i. (c) If column 1 contains l's wherever column j does (1 # j), that is, for each k, akj=l implies aki=1' (column 1 dominates column j) then delete column j. These operations may be carried out in any order starting with A and continuing until a matrix A* is Obtained on which none of them can be applied. The solution for the reduced problem (with A*) is then obtained. Every xj in the solution of the original problem is obtained by adding the bias bj to the corresponding xj obtained in the solution with A*. Theorem 3.3 If A* is the matrix obtained from the fault table matrix A by a succession of operations of the (a), (b) and (c) types suggested above, then the solution to the original problem (presented by (3.48) and (3.49) is the same as that of the reduced problem adjusted by the biases bj's (1 i j :_m). 96 32222 It is sufficient to prove that any Operation will result in a problem with an equivalent solution. (1) Assume operation (a) is performed. This means that one of the constraints in (3.48) is xj Z ki , i.e., test tj has to be repeated at least ki times. If we adjust the corresponding bias bj by k1' this constraint can be removed (i.e., delete row i) since the bias will make sure that it is satisfied. Every constraint of the form: ... + xj + ... Z.kq , can be written as + (xj'ki) + :kq-ki . Let x3 = xj - ki. The set of constraints can now be written in terms of xi (the reduced problem). If kq-ki is zero or negative, then this con- straint is automatically satisfied by xj > ki thus row q can be eliminated. If ki < kq we solve the reduced problem. Obviously, a solution to the reduced problem is a solution to the original problem if we adjust the obtained value for x3 by ki, i.e., by the amount of bias adjust- ment. Thus, Operation (a) results in an equiva- lent solution. (2) (3) 97 Assume operation (b) is performed. Let the number of l-entries in rows 1 and j be qi and qj respectively. Without loss of generality, assume that the l-entries in row i are the first qi entries in that row. Since the condition for operation (b) is satisfied, we can also assume that the l-entries of row j are the first qj entries in that row. It is clear that qi :_q.. 3 The following constraints must now exist: +x2+ooo+xq +ooo+x Z x 1 . . 3 q X k1 1 k. +x +...+ 2 X [V l 3 j The second constraint implies: x +x +000+x +OOO+X 1 2 qj qi >k If kj :.ki' then the first constraint is auto- matically satisfied by satisfying the second. Thus, elimination of row i will not change the solution. Assume operation (c) is performed. In this case, it is clear that xj=0 does not result in any contradiction to the constraints of (3.48), since for every xj appearance, xi also appears; so, xi can be set large enough to satisfy the con- straint. We need to prove that xj=0 is part Of m a solution that minimizes : 2: 1g; . Assume q=l 98 that a solution minimizing this sum exists with xj > 0. From this solution, substitute xi + xj for xi and 0 for xj in (3.48). NO contradiction results since column i dominates column j. The same substitution will result in the same sum m 23 x q=1 q found a solution with xj=0. Thus column j can be , i.e., minimality is maintained. Thus we eliminated. CHAPTER IV DIAGNOSIS OF INTERMITTENT FAULTS IN COMBINATIONAL CIRCUITS In Chapter III, a probabilistic model for intermittent faults was introduced, also an approach for the detection of these faults. The detection procedure proposed relied on the repeated application of tests that would detect these faults had their effect been permanent. Several methods for test generation were discussed in Chapter II. In this chapter, we present an approach for the diagnosis of intermittent faults in combinational circuits, also, employing the repetition of tests that detect permanent faults. 4.1 GENERAL ASSUMPTIONS There are three basic assumptions for the diagnosis methodology to be presented, namely: (1) The probabilistic model, introduced in Section 3.1 for intermittent faults, will be used here. 99 (2) (3) 100 All the assumptions previously made in the model are also carried here. Of major importance is the single fault assumption. A detection experiment, similar to that proposed in Section 3.2, is assumed to have been run and resulted in the decision that the circuit has an intermittent fault, i.e., a test in that experi- ment has failed. This assumption assures that a fault exists in the circuit before we start the diagnosis experiment. The posterior probabilities of the states Of the circuit at the end of the detection experiment are assumed to be known to the experimenter. These will be used as prior probabilities in the diagnosis experiment. A fourth assumption, that is helpful even though not essential, will also be assumed: (4) The test that failed in the detection experiment, say test tj' is assumed to be known to the experiment. This assumption tends to reduce the length of the diagnosis experiment since it will start with less possible faults (those faults that correspond to the l-entries of column j of the fault table) than the total fault set. 101 4.2 DIAGNOSIS OF INTERMITTENT FAULTS The approach suggested for the diagnosis of intermittent faults is through the repeated application of tests from the test set I. A subset of T is selected and its tests are repeatedly applied until a failure occurs. This narrows down the possible faults that the circuit might have. The fault table is then reduced and another subset of T is selected and the process is repeated until enough failures occur to diagnose the fault in the circuit. Let the set of possible intermittent faults, at any stage in the experiment, be 0p. Let Fp be the subset of F that corresponds to 0p. Let 0p (also Fp) contain n elements. Initially Fp contains the faults that have 1-entries in column j of the fault table (assuming that tj was the test that failed in the detection experiment). Obviously, 0p does not contain mo since in the diagnosis phase, we know that the circuit is faulty for sure. The posterior probabilities of the states of the circuit at the end of the detection experiment are used as prior proba- bilities for the diagnosis experiment. Fault Table Reduction. Whenever a failure occurs during the diagnosis experiment, Fp is narrowed down, and consequently the fault table is reduced. The following reduction steps are applied until none of them can be applied any more: (a) (b) (C) (d) 102 Eliminate all rows that correspond to faults not contained in Fp. Eliminate all redundant columns, i.e., eliminate all but one of every group of identical columns. Eliminate all O-columns. Eliminate all the l-columns. The failure of a test, whose corresponding column in the fault table has all l-entries, does not contribute any information to the diagnosis Of the existing fault. Thus this test is eliminated. Diagposis Procedure. The following diagnosis pro- cedure is now suggested (a flow chart for this procedure is given in Figure 4.1): (1) (2) (3) (4) From the outcome of the detection experiment, compute Fp , 0p. Each of these sets has n elements. If Fp is a singleton (i.e., if n = 1) go to (9), otherwise go to (3). Obtain a reduced fault table using steps (a) through (d) indicated above. Select a subset Tp of I that covers Fp, i.e., Tp is a detection test set for Fp. Let Tp contain 0 tests. Without loss of gener- ality, we can assume that the tests in Tp are ordered, so we can speak of the i-th test in Tp (l i i i u). If no such Tp is found, go to (9). 103 C .....9 I Q = Set of possible faults p F corresponds to 0p Each contains n elements YES NO Reduce fault table Select Tp that covers Fp Tp contains u tests NO C ) [ Apply i—th test of Tp I U d t 9 F & +__.___ Lpaeplp 7) Figure 4.1. Flow Chart for the Diagnosis Procedure. (5) (6) (7) (8) (9) 104 i = 1. Apply the i-th test of Tp' If this test fails, go to (8), otherwise go to (7). Increment i by 1. If it exceeds u go to (5), otherwise go to (6). At this point, a test in Tp has failed, this narrows down the set of possible faults 9p (also Fp) to a smaller set. This smaller set is the one that corresponds to faults in Fp with l-entries in the column corresponding to the failing test in the fault table. Replace F P and 9p by the smaller sets indicated above. Update n. GO to (2). Stop. Diagnosis experiment is complete. Diag- nostic resolution is determined by F . If Fp P is a singleton, complete diagnosis is obtained. Notice that when selecting the test set Tp’ no test that failed earlier in the diagnosis experiment will be chosen, since such a test corresponds to a column, all of its entries are 1's, in the reduced fault table. Such a column will be eliminated by operation (d) of the fault table reduction procedure. The diagnosis procedure terminates only if one of two conditions arises: (1) Fp becomes a singleton; in which case complete diagnosis is obtained, or 105 (2) no Tp that covers Fp is found; in which case complete diagnosis is not obtained, the diagnosis resolution being determined by Fp. Diagnosis Tree. The diagnosis procedure can be represented by a tree. The nodes correspond to the Fp's, with the root being the initial Fp Obtained from the detection experiment. The edges out of a node correspond to the tests Of the appropriate Tp' i.e., every node has u (the corresponding u) edges out of it. An edge, corre- sponding to a test tj out of a node 8, goes to node B that corresponds to the subset of the faults of node a that are detected by tj. At any stage in the diagnosis procedure, if a different Tp is chosen, a different diag- nosis experiment and consequently a different diagnosis tree will result. A complete diagnosis tree is a diagnosis tree that contains all the subtrees corresponding to all the possible outcomes of the diagnosis experiments. The leaves corre- spond to the maximum diagnostic resolution possibly obtained. Definition 4.1 The portion of the diagnosis procedure that con- sists of applying the tests Of Tp repeatedly until a failure occurs is called a subexperiment. In terms of the diagnosis procedure, a subexperiment consists of iterations of the loop defined by steps (5), 106 (6) and (7) until a failure occurs to exit from this loop. Notice that the failure of a test is the only way to exit from this loop. In order for the diagnosis experiment to be finite, the expected length of every subexperiment in it should be finite. The condition that Tp covers Fp guarantees that, this will be proved later in Section 4.3. During the diagnosis procedure, it is desirable to calculate the posterior probabilities of the different states of the circuit after observing the outcome of every test applied. These posterior probabilities can be employed to select a Tp’ at the beginning of a subexperiment, that tends to make the diagnosis experiment shorter. It should be noted that the posterior probabilities for any wi that is not in Up is zero. This follows directly from (3.29) since such an mi corresponds to a permanent fault f. 1 that is not detected by a test that failed earlier. Example 4.1 A detection experiment was run. It resulted in the decision that the circuit has an intermittent fault. Observing the failing test ruled out some possibilities for the fault condition. Fault table reduction, as suggested above, resulted in the reduced fault table given in Table 4.1. A diagnosis experi- ment is to be run. t1 t2 t3 t4 t5 t6 t7 f1 0 0 0 1 1 0 1 f2 0 1 0 l 0 1 0 f3 0 1 0 0 1 1 1 f4 1 0 1 0 1 0 0 f5 1 l 1 0 0 0 1 f6 0 1 0 0 0 1 1 f7 0 0 1 1 0 1 1 f8 1 1 0 0 0 l 0 Table 4.1. Reduced Fault Table After Detection Experiment. Th ' 't' 1 Q d F : e 1n1 1a p an p are up = {w1,w2,...,w8} Fp = {fl'f2'...'f8} Notice that mo and f0 are not included in GP or Fp. The test set {tl,t2,t7} covers Fp; select 1t as Tp' Apply t1,t2,t7,t1,t2,t7,... unt1l a failure occurs. Assume that this subexperiment results in the failure of t7. The set of possible faults is nar- rowed down to those faults with 1's in column t7 of Table 4.1 (w2,w4, and w are ruled out). Thus: 8 Q = {wl,w3,w5,w6,w7}. F = Ifl'f3'fs'f6'f7}° 108 The corresponding reduced fault table is shown in Table 4.2. Select a new Tp that cOvers Fp. The test set {t2,t5,t6} is selected. Apply t2,t5,t6,t2,t5,t6,... until a failure occurs. Assume that t2 failed in this subexperiment. This rules out w and w 1 7' Thus, we have: 9 P Fp = {£3,£ {w3,w5,w6} s'fs} Reducing Table 4.2 in correspondence with the outcome of this subexperiment, we obtain Table 4.3. The test set {t1,t6} is a suitable Tp since it covers Fp. Apply t1,t6,t1,t6,... unt1l a fa1lure occurs. Say t1 fails. This rules out OB and w6. The corresponding Fp and 0p are: t1 t2 t3 t4 t5 t6 f1 0 o o 1 1 0 £3 0 1 o o 1 1 £5 1 1 1 o o 0 f6 0 1 o o o 1 f7 0 o 1 1 o 1 Table 4.2. Reduced Fault Table After First Subexperiment. 109 t1 t5 t6 f3 f5 f6 0 1 Table 4.3. Reduced Fault Table After Second Subexperiment. Q = {ms} 9 F {f5} P Fp is a singleton. This terminates the diagnosis experiment. The circuit is diagnosed as having inter- mittent fault ms. If test t6 is the one that failed in this sub- experiment and not t1, then ms is ruled out. The corresponding Fp and 0p are: Up = {w3,w6} Fp = {f3,f6} The corresponding reduced fault table is given in Table 4.4. In this case there is no TP that covers both f3 and f6. This terminates the diagnosis experiment with the diagnosis resolution being defined by Fp which conta1ns f3 and f6. 110 Table 4.4. Reduced Fault Table when t6 Fails in Third Subexperiment. That is, the result of the diagnosis experiment is: the circuit has either intermittent fault w3 or intermittent fault N6. Thus, the components that pertain to these two faults should be replaced for repair. A diagnosis tree for this example is shown in Figure 4.2. The tree shown is not a complete diagnosis tree. In this example, we did not make use of the posterior probabilities since we were not concerned about comparing Tp's or designing a shortest diag- nosis experiment. The last result of Example 4.1 should be compared with the diagnosis of permanent faults. If the faults of this example were permanent, maximum diagnostic is possible since the rows of the fault table are distinguishable (Theorem 2.2). However, in case of intermittent faults, and using the suggested diagnosis procedure, we were unable to obtain complete diagnosis. Hence, there are fundamental 111 (ifl’fz’fa'f4’fs'fe’f7’f8i> t1 t2 t7 {f4,f5,f8} ({ f ,f3, f5 ,f6 ,£;:> {f1,f3,f5, f6 ,fz:) t t 116 3} {f6.f7} Figure 4.2. Diagnosis Tree for Example 4.1. differences between the diagnosis of intermittent faults and the diagnosis of permanent ones that are more subtle than mere repetition of tests. Fault table properties that are relevant to the diagnosis of intermittent faults are explored later in Section 4.4. 112 4.3 EXPECTED LENGTH OF SUBEXPERIMENT In this section, we try to compute the expected length of a subexperiment (definition 4.1) of the diagnosis procedure. In a subexperiment, tests of an appropriate Tp are repeatedly applied in sequence until a failure occurs. Tp contains u tests. Let the set Of corresponding random variables be Tp. The set of possible faults 0p contains n states. Without loss of generality, we can label the elements of Tp and Tp as: l,2,...,u, and the elements of 9p as: l,2,...,n. We assume that the posterior probabilities for the states of the circuit were computed after Observing the outcome of every test in the diagnosis experiment. The posterior probabilities computed thus far (to the beginning of the subexperiment) will be used as prior probabilities in the subexperiment. The problem is to find the expected number of times the tests of Tp will be applied until the first failure. Notation. The following notation will be used in this section: —-+ _ U - _ _ _ Tp - 0 will mean Tl—O, T2—0,..., and Tu—O This corresponds to the case where every test in Tp has been applied once and one of them failed. ...) Tp # 01“ will mean the complement of the above 113 condition. That is, every test of Tp was applied once and at least one of them failed. ...) k ~«-> -+ k . . Tp = 011 will mean Tp=0u, Tp=0u,... k times. This corresponds to the case where the test set Tp was applied k times and no failure occurred. The wi's (l :_i i n) define disjoint events over the probability space (9 x S)k (the cartesian product of (0 x S) by itself k times). The probabilities of these events (the prior probabilities of the subexperiment) add to 1 due to the single fault assumption. Thus we can write: -73 _ _ _ _ - P(Tp—0 ) — P(Tl—T2—...-Tu—0) n ——> __ _ IJ - P(Tp-0 /wi) P(wi) (4.1) i=1 If mi 13 covered ki times by Tp (1.e., ki tests of Tp detect fi), then: ..., k. _ u _ _ 1 P(Tp—0 /wi) _ (1 ei) (4.3) The conditional probability of at least one failure during the application of Tp is: '3 _ - - p(Tp¢o /wi) — 1 (1 ei) (4.4) 114 If Tp is applied k times and no test failed, then: '—* k.k P(T k=0“pr = (1—ei) 1 p (4.5) The probability of a first failure on the k+1-st appli- cation of Tp is: + --> P(T k+1 ; T k+1 = T k T ' T k = OUk ' T #0“) P P P P P P n —» k+1 k k uk = P T o T " T T r T - 0 T O (1) Z ( P p p y! / o) p i=1 n k.k k. = Z (l-e.) l (l - (l-e.) 1) Pi (4.6) l 1. i=1 This is the probability that the subexperiment will have k+1 applications of Tp. Thus, the expected number of applications of Tp is:+ )2 = E(k) °° n k.k k. = 2 (k+1) Z (l-ei) 1 (l-(l-ei) 1) pjL = i=1 [I °° k.k k. .= :Ep .22 (1- e.) 1 (1- (1— e ) 1) (k+1) +E(k) denotes the expected value of k. 115 The inner infinite series is the expansion of a negative binomial,* thus E.—- 5: pi (4 7 - k. . ) The expected subexperiment length 2 is: Z = ui (4.8) The expected subexperiment length given by (4.8) is approxi- mate since the last application of Tp (when a failure occurs) will probably be incomplete. However, this is fairly accurate if k is relatively large. The conditional expectation of the number of applications of Tp is given by: E(k/wi) = 1 (4.9) k. 1- (1-31:) 1 From (4.7) and (4.3), if ki > 0, then 2 is finite, and if k1 = 0, for some i(l :_i i n), then I = m. This explains why Tp was required to cover 9p in the diag- nosis procedure: to insure that the diagnosis procedure will eventually terminate. 00 *If 0