MSU RETURNING MATERIALS: Place in book drop to remove this checkout from LIBRARIES . 45:27:55.. your record. FINES W1“ be charged if book is returned after the date stamped below. A PROBABILISTIC ASSOCIATION MEASURE FOR PATTERN RECOGNITION by Xiaobo Li A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1984 ABSTRACT A PROBABILISTIC ASSOCIATION MEASURE FOR PATTERN RECOGNITION by Xiaobo Li This thesis investigates the properties of a probabilistic association measure in four areas of pattern recognition and image processing. It is essential to measure the association between patterns, between features and between partitions in both pattern recognition and exploratory pattern analysis. Direct interpretation and known distribution are desirable properties for an association measure. This thesis proposes a statistic with these characterisitcs under a permutation null hypothesis which has been used to advantage in the literature. The four problems attacked in this thesis are described below. A preliminary feature analysis studies the unusualness of feature-category relations before feature extraction, which is just as important as cluster tendency study before clustering. This is formally defined in this thesis for the first time in a statistical hypothesis testing framework. This thesis shows that the permutation statistic is preferable to the commonly used correlation coefficient with binary features. Both statistics have comparable power but the threshold of the permutation statistic has a more direct interpretation than that of the correlation coefficient. The use of this statistic is demonstrated on questionnaire data. The second and third problems attacked with the permutation statistic are the measurement of the adequacy of binary partitions in validating clustering results and designing tree classifiers. The relations among three well-known measures of cluster validity are derived and the permutation statisti is shown to be different but highly correlated to these measures. Computational advantages make it preferable to the other statistics. In tree classifier design, the permutation statistic can be used as a criterion for choosing the feature and threshold at each node in the tree. The threshold can be set more easily than with the mutual information criterion. Several examples on artificial data and real data sets demonstrate that the permutation statistic is a reasonable criterion for node definition and leads to successful tree classifiers. The fourth problem investigated is the general problem of image template matching posed as a test of hypothesis. The alternative hypothesis is that a distorted version of the object is in the image. The null hypothesis is that the object is not in the image. ‘We derive an approximate likelihood ratio statistic for testing these hypotheses and compare it to three other statistics. The optimal statistic is most powerful but computationally intensive. A simplified Neyman Pearson statistic is shown to be more powerful than other suboptimal measures, yet to remain sensitive to the true object location in the image. The permutation statistic acts about the same as other sub-optimal measures, such as absolute difference and correlation coefficent. The last chapter of the thesis reviews algorithms for computing cumulative hypergeometric distribution functions, which is essential to the proposed statistic. A pipeline architecture design implementing a recursive computation formula is proposed. This design is more efficient than other exact computation methods by several degrees of magnitude. It is faster. than the best approximation algorithm for many reasonable cases. To my father and my mother ACKNOWLEDGEMENTS I thank my thesis advisor, Professor Richard C. Dubes, for his help, guidance and encouragement throughout the preparation of this thesis. He has introduced me to careful scientific research and precise scientific writing and he has contributed greatly to my interest in Pattern Recognition and my research abilities. His continuing direction was essential to this study. Thanks are also due to Professor Anil K. Jain and Professor George Stockman for their helpful discussions and guidance in the research group and my degree committee; to Professor Raoul LePage for his important suggestions on the theoretical work of this thesis; to Professor Carl Page for his advice and excellent teaching of many of my courses. Special thanks to Professor Lionel M. Ni for his brilliant help on the hardware implementation part of this thesis. Acknowledgement are due in addition to past and present graduate students of the MSU PR/IP Laboratory for their assistance: Dr. George Cross, Dr. Weichung Lin, Dr. Steve Smith, Dr. James Coggins, Dr. Gautam Biswas, Dr. Ardeshir Goshtasby, Almost-Dr. Rick Hoffman, Phil Nagan, ii and Juan Esteva. System manager Phil Nagan's maintenance of the laboratory greatly facilitated this thesis work. A special acknowledgement to my office mate Rick Hoffman for his theoretical discussions and creating useful system software which played a large role in preparation of this thesis. I must thank my wife Zhemin for her sharing my burden and difficulties. Her understanding and support, the sacrifice of my son and my family made this study possible. The financial support of NSF grants ECS-800716 and ECS-83002004 is also gratefully acknowledged. iii TABLE OF CONTENTS LISTOF TABLES OOOOOOOOOOOOOOOOOOOOOOOO ..... 0.0.0.0.... viii LIST OF FIGURES ........... .............................. ix CHAPTER 1. INTRODUCTION ................... ...... ........ l 1.1 Summary ........................................... l 1.2 Background .......................... ....... ....... 5 1.3 Probabilistic Proximity Measures and Permutation Models ...... .................. .... 8 1.4 Proposed Probabilistic Indices for Binary vector Pairs. ..... 0.00.00.00.00000000011 CHAPTER 2. PRELIMINARY ANALYSIS OF DICHOTOMOUS FEATURES ...... ....... ........... 14 2.1 Are Any Two Features Unusually Similar to One Another? .......................... 16 2.2 Is Any Feature Unusually Closely ' Related to Category? ............................. 22 2.3 Approximating the Distributions of g and SM ..................................... 24 '2.3.1 Null Distribution of § when K=2 ............. 24 2.3.2 Approximating the Null Distribution of s when K>2 .................. 25 2.4 Power Comparison of S and C ...................... 28 iv 2.5 An Application of Preliminary Feature Analysis ... 2.5.1 Tests for Significant Relation between Features I.........OOOOOOOOOOOOOOOOOO 2.5.2 Verification on Original Pattern Matrix ..... 2.5.3 Feature Extraction Using Feature 9 as categorical ......‘CCCOCCOOOOOO ..... ...... 2.5.4 Feature Extraction Using Feature 8 as categorical ......OOOOOOOOOOOO00.0.0000... 2.6 Summary and Conclusion ..... .......... ....... ..... CHAPTER 3 ADEQUACY OF BINARY PARTITIONS I O O O O O O O C O O I O O O O O 3.1 Four External Measures of Association for Cluster Validity ......... ....... ............. 3.1.1 Definitions ....IOOOOOOOOOOOO ......... 0...... 3.1.2 Relation among Measures ... ..... ............. .3.2 Binary Tree Classifier Design .................... 3.2.1 Binary Tree Classifier ...................... 3.2.2 Computation of Feature Thresholds ........... 3.2.3 Efficiency in Feature Threshold Computation . 3.2.4 Numerical Examples .......................... 3.3 Summary and Conclusions .......................... CHAPTER4TEMPLATE MATCHING O..........OOOOOOOOOOOOOOOOOO 4.1 Introduction ...... 4.2 Mathematical Model 34 36 39 40 41 42 45 46 51 56 59 62 64 72 74 75 4.2.1 General Definition .................... ..... . 79 4.2.2 Statistics for Testing Ho vs. H1x .......... 83 4.2.3 Statistics for Testing H0 vs. Hl ........... 84 4.3 Comparison Study ................... ............. . 87 4.3.1 Analytical Comparison of D and G ............ 89 4.3.2 Monte Carlo Comparison of D,G and R ........ . 89 4.4 Results ..... ........ ................... . ......... 91 4.4.1 Power Study Results ...... .......... ......... 91 4.4.2 Sensitivity Study Results ................... 96 4.4.3 Formal Comparison ... ....................... 100 4.4.4 Results on Landsat Images .................. 103 4.5 Summary and Conclusions .... ....... . ........... .. 105 CHAPTER 5 COMPUTATIONAL CONS I DERATI ON S ................. l O 7 5.1 Notation . ..... . ....... .... ......... .... ......... 107 5.2 Peizer Approximation ............................ 111 5.3 Hardware Implementation ......... ................ 113 5.3.1 An Overview of the Architecture .... ........ 113 5.3.2 Computing Y(i) ............................. 116 5.4 Comparison .............. ......... ...... ...... ... 127 5.5 Summary and Conclusions . ........................ 129 CHAPTER 6 CONCLUS I ON AND FUTURE RESEARCH 0 O O O O O O O O O O O O O O 1 3 l 6.1 Conclusions ... .................................. 131 6.2 Future Research ............ ...... . ...... ........ 133 6.2.1 Extension to Multi-valued Vectors .......... 133 vi 6.2.2 Two-sided Tests for Preliminary Analysis 6.2.3 Multi-class Tree Classifiers 6.2.4 Multi-stage and Sequential Approach to Template Matching ............ APPENDIX A. GIVE HIGHEST S VALUES APPENDIX B. APPENDIX C. APPENDIX D. APPENDIX E. IN MULTIVALUED VECTOR CASES LIST OF REFERENCES APPROXIMATION OF Sxy vii WHEN y#x .... RESULTS OF t-TESTS FOR SEC. SUMMARY OF RESULTS FOR SEC. 4 4. THE THRESHOLDS AT THE BOUNDARIES OF RUNS .3 .......... AN EXAMPLE OF THE PERMUTATION STATISTIC 3.1 ......... 134 134 135 136 138 141 142 143 145 2.9 2.10 3.1 3.2 3.3 3.4 3.5 3.6 LIST OF TABLES Frequencies of Combinations of Vl(i) and V2(i) ..... 12 Means and Variances of g and § under H0l for K=2 ... 25 0.05 Thresholds for g and T' ... .................... 26 Comparing Distribution of S and T' ................. 27 Power Estimates When K=2 ........................... 31 Power Estimates When K=5 ........................... 32 Result of Power Comparison ......................... 33 Definitions of First 13 Features ................... 35 Testing H01 and H02 ..... . ............... . ......... 37 Testing H03 and H04 .... .......................... . 38 Critical Levels of Gammas ........... . .......... .... 42 Frequencies of Combinations ........................ 49 Examples of d' Ranges and g values ..... ... ......... 54 Sample Correlation between S',E(S') and Gamma ..... . 55 Ordering of Training Patterns by Feature j ........ . 60 Parameters of Artificial Data Sets .......... . ...... 66 Tree DeSign for Data Sets 00.00.000.000.0.0.00.0... 68 viii 3.7 Tree Design for Data Set 2 ......................... 68 3.8 Tree Design for Data Set 1 ............. ...... ...... 69 3.9 Tree Design for Data Set 3 ......................... 70 3.10 Tree Design for Data Set 4 ......................... 70 3.11 Tree Design for Data set 6 ......................... 71 4.1 Effect of (p,p') on Powers .... ........... . ........ . 92 4.2 Effect of (a,b) on Powers ............. ............. 93 4.3 Effect of Parameter Knowledge on Powers ............ 94 4.4 Effect of Subtemplate Size on Powers ...... ......... 94 4.5 Sensitivities when Subtemplate is Balanced ......... 96 4.6 Effect of (p,p') on Sensitivities . ................. 97 4.7 Effect of (a,b) on Sensitivities .......... ..... .... 97 4.8 Effect of Parameter Knowledge on Sensitivities ..... 98 4.9 Effect of Subtemplate Size on Sensitivities .. ...... 98 4.10 Sensitivities when Subtemplate is Balanced ... ...... 99 4.11 Power Comparison ... ............................... 101 4.12 Sensitivity Comparison .............. ...... . ...... . 102 4.13 Results on Several Landsat Images ........ . ........ 104 5.1 Observables in Vector Pair .................. ...... 108 5.2 vTime and Circuit Complexity of the Part Computing Y(i) from X(j) ........ ..... .... ....... .. 125 5.3 Computation Time Comparison ........... ............ 128 5.4 Typical Computation Time Ranges ......... .......... 128 ix LIST OF FIGURES d and E(S') vs d' when L=28, nl=9 and n2=13 ........ 53 Fourteen Examples of Category-feature Vector Pair .. 64 Tree Designed with S' .............................. 67 An Example Image and Template ...................... 76 Overall Architecture for Computing H(a) ........... 114 Computing X(j) ....... ............................. 115 Compute Y(i) from X(j) when y=4 ................... 117 Compute Y(i) from X(j) when y=7 . .................. 118 Compute Y(i) from X(j) when y=13 . ................. 119 Compute Y(i) from X(j) (general design) .......... . 120 Compute Y(i) from X(j) when y=7 and x=3 ........... 122 Time Diagram of Figure 5.7 ......... ..... . ...... ... 123 CHAPTER 1 INTRODUCTION 1.1 Summary This thesis proposes a probabilistic association measure for binary vectors and examines its application in pattern recognition. It is essential to measure association between patterns, between features, between partitions of patterns or between subimages. There exist many association measures in the pattern recognition literature. Only Goodall's similarity index is directly based on probability. This Chapter reviews previous work on probabilistic measures and the permutation hypothesis and proposes a permutation statistic S, which is directly based on permutation unusualness and has known distribution under randomness. In Chapter 2, we apply S to the preliminary analysis of binary features for the first time and state its utility. We define several null hypotheses expressing the relation between features and category. Rejection of these hypotheses motivates feature extraction. We compare S to Pearson correlation for testing these hypotheses. Since S has known distribution under all null hypotheses, the selection of thresholds for the tests is direct. The selection of the threshold for Pearson correlation demands Monte Carlo simulation. Both statistics have similar power for detecting Bahadur type alternative hypotheses and both act similarly in a questionnaire data analysis example. This suggests S be used in the preliminary analysis of binary features. Chapter 3 applies the permutation statistic S'=:S-0.5: as an adequacy measure for binary partitions in cluster validity and as a criterion. for tree classifier design. Three statistics, Rand, Hubert's Gamma and Fowlkes' B, are compared to S' in the permutation environment. These three statistics are linear functions of each other and thus have the same power against any alternative hypotheses in testing the random permutation null hypothesis. The statistic S' is highly correlated to those statistics when the numbers of 1's in the vectors are about half the vector length. The other three statistics have asymptotic normal distributions under the null hypothesis while 5' has a known distribution. That makes 5' easier to use for assessing global fit of a binary partition to category than the other three statistics. The statistic S' is used for the first time in designing a tree classifier, and is compared to a mutual information criterion. Both criteria give similar trees but the threshold for S' is far easier to determine. Several artificial data sets and a real data set are used in the numerical examples. The results demonstrate that S' is a reasonable measure in tree classifier design. Chapter 4 investigates the image template matching problem. We study a null hypothesis stated in the literature and propose a new alternative hypothesis. Under this model, we propose a statistic R which is derived directly from the likelihood ratio statistic to measure similarity between template and subimages. With an extra independence assumption, this statistic is theoretically optimal, but its computation is very time consuming. We suggest a simplified version of the statistic, G, as a similarity measure and compare the powers of these two statistics by a Monte Carlo means to the commonly used similarity measure, the absolute difference D. The statistic R is most powerful. The 6 statistic is more powerful than other sub-optimal measures and has the same complexity as others. More importantly, G is more sensitive to the. true location of the object in the image than any others. The correlation coefficient C and the permutation statistic S defined in Chapter 1 are also compared to the likelihood statistics and they perform about the same, worse than the derived measures. All these statistics are applied to several Landsat data sets and demonstrate that R is not as sensitive as statistic G. In summary, G strikes the best balance between power ans sensitivity. Since the computation of S is directly based on the hypergeometric distribution, Chapter 5 reviews some commonly used algorithms for computing the hypergeometric c.d.f. Their computation times are compared and a pipeline architecture design is proposed to implement the recursive computation algorithm. The pipeline design is much more efficient than other exact computation methods. It is even faster than the Peizer approximation using single CPU when the vector size is small or a, the number of (1,1) pairs, is small (a < 932). This pipeline design also solved a more general problem. In implementing a product using pipeline functional units with x segments, we give examples of some individual designs for different y values and one general design for y=2,3,...,15, which are typical pipeline length for arithmetic operators. A computation time and circuit complexity analysis are given. From the above applications, we conclude that the statistic S is a good similarity measure between features in preliminary analysis for binary features; S is a good similarity measure between feature and category in tree classifier design; and S is also a reasonable adequacy measure for binary partitions. Since S has a known distribution under null hypotheses, the threshold and significance level are easy to determine. In image template matching, S acts as well as other sub-optimal similarity measures between sub-images. Finally, a hardware architecture design is more efficient than other methods for computing hypergeometric c.d.f.'s, which is essential to the computation of S. The application of S and other proposed measures to the above pattern recognition areas and the methodology for using S are the main contributions of this thesis. 1.2 Background Pattern Recognition is concerned with the description and classification of objects, which are represented by measurements taken from realizations of physical processes. Pattern recognition includes three distinct steps - Representation, Abstraction, and Generalization. The input data for pattern recognition is usually represented as a pattern matrix, whose rows are patterns. Each pattern consists of a series of measurements, called features, and describes an object. In classical pattern recognition problems, the category, or pattern class, of each training pattern is known. A subset of the features or some function of this subset is extracted for the representation of the patterns. Based on this subset and category information, a pattern classifier is designed. Feature extraction and classifier design are the Abstraction phase of Pattern Recognition. The Generalization phase of Pattern Recognition involves evaluation of algorithms and classifiers. There are two mathematical approaches to solving pattern recognition problems: the decision-theoretic (or discriminant) approach and the syntactic (or structural) approach. In the decision-theoretic approach, each pattern is considered as a vector in a feature space and the recognition is made by partitioning the feature space. In the syntactic approach, each pattern is considered as a composition of components, called sub-patterns or primitives, and the recognition is made by parsing the pattern according to some syntax rules. Cluster analysis and image processing are closely related to pattern recognition. Cluster analysis explores the data structure with and without category information, which can be considered as the Representation phase of Pattern Recognition. Image processing deals with two dimensional pictorial patterns. Many basic techniques used for pattern recognition and image processing are very similar in nature. This thesis proposes an association measure for binary vector pairs based on permutation statistics. Its application in various areas of pattern recognition is compared to other measures of association and both hardware and software computational methods are developed. The particular areas of application studied are preliminary analysis of binary features, validity of binary partitions, design of tree classifiers, and image template matching. Section 1.3 reviews previous work in probabilistic proximity measures and permutation models. Section 1.4 defines the proposed statistic. 1.3 Probabilistic Proximity Measures and Permutation Models A proximity (or association) measure, either a similarity or a dissimilarity, is a symmetric mapping from VxV to [0,00), where V is a space of vectors. A similarity measure increases as two vectors become more alike, while a dissimilarity measure increases as two vectors become more unalike. For example, Euclidean distance is a dissimilarity measure and correlation is a similarity measure. The history of Statistics is replete with association measures [84]. Hundreds of proximity measures have been proposed by researchers in the biological and engineering literature. Comprehensive reviews [l,25,26,27,28,77] and comparative experiments have been reported [21,30,41,71]. This thesis is mainly interested in measures using cumulative probability, not information theory, although some similarity measures based on information theory are also considered to be probabilistic measures [77]. Goodall's similarity index is the best example of this type [22,23,24]. We seek a statistic which can be easily interpreted and is easy to compute. A population is needed to establish the significance of a probabilistic index of association. The null hypothesis is that all members of the population are equally likely. A permutation population is formed from all permutations of the components of the vectors. The vectors could be patterns, features, subimages or partitions. Permutations have long been used as a randomization procedure to generate populations [44]. Statistics useful in pattern analysis, such as the Mantel statistic [19,33,35,37,53,54] and the B statistic [17] use permutations to generate null hypotheses. These two indices have known asymptotic distributions under this null hypothesis. Formulas for exact means and variance are available but require a significant amount of computation. The permutation null hypothesis for a vector pair used for much of our work is defined below. H0: all permutations of one vector in the pair are equally likely. 10 If this vector pair represents partitions of a pattern set resulting from clustering, then the testing of this hypothesis is a cluster validity analysis, which validates the global fit of a partition. This null hypothesis is reasonable for external criteria, as when prior structure is being compared to the result of a cluster analysis [33]. However, it is irrelevant as an internal criterion [18,42,621 even though it is the only hypothesis for which analytical results are available. Specific null hypotheses, alternative hypotheses, and appropriate tests are stated in each area of investigation. For some statistics, Monte Carlo simulation is needed to obtain the critical levels, thresholds or power estimates. Usually, we set the size of the Monte Carlo simulation. to 100 [11]. In some permutation situations, we set the size of Monte Carlo simulation to 1000 [33]. 11 1.4 Proposed Probabilistic Indices for Binary Vector Pairs If a statistic measuring association between data items has known distribution under some H0, its tail probability (the probability that this statistic is less than or equal to a threshold) is called a p-value, and is a probabilistic index of association. This section proposes the p-value of the number of matches between two binary vectors as such an index. This index has known distribution under HO. Consider a pair of binary vectors, V1 and V2, of size L, whose entries are either 0 or 1. The number of 1's in Vi is ni. The set of all possible permutations of V2 forms a sample space. Table 1.1 defines the observables for these two vectors. For example, a' is the number of positions in which both vectors are 0 and n1 is the number of 1's in V1. 12 Table 1.1 Frequencies of Combinations of Two Binary Vectors V2 0 l + ------------- + ------------- + 0 : a' : b' : L-nl vl + ------------- + ------------- + 1 i c' : d' : n1 4» ------------- + ------------- + L-n2 n2 L The random variable D' denotes the number of positions in which the two vectors are both 1. For example, If V1=(100110) and V2=(110010) then D'sd's2. The following statistic measures the degree of association between two vectors. S(l,2) a Pr(D' < d') + Pr(D' = d') U where U is an independent random variable with uniform distribution over [0,1]. Probabilities are computed under H0. Under H0, S(V1,V2) is distributed uniformly over [0,1]. It has a clear interpretation as permutation unusualness, i.e., the probability that a vector pair has at most that many (l-l) matches. The explicit formula for the statistic 13 is as follows. This statistic is used as a similarity measure between features in preliminary feature analysis (Chapter 2), as a similarity measure between partitions in measuring partitional adequacy (Chapter 3), and as a similarity measure between subimages in image template matching (Chapter 4). The null hypotheses are rejected for extreme values of this statistic. CHAPTER 2 PRELIMINARY ANALYSIS OF DICHOTOMOUS FEATURES Feature selection in pattern recognition chooses a subset of features which best reflects a-priori category information in some manner [78] and which are "independent" in some sense. Feature selection algorithms select or extract features one at a time and stop when a predefined threshold has been exceeded. For example, the percent variance retained in the eigenvector method of feature extraction [78] and the recognition rate in the sequential forward method of feature selection [86] are thresholds which the user must specify beforehand. Few theoretical guidelines exist for selecting these thresholds, especially when the relationships among the variables are completely arbitrary. This chapter proposes a methodology for examining a data set before feature extraction to alleviate the danger of interpreting the results of feature extraction inappropriately. For example, we ask whether or not binary (dichotomous) features significantly relate to each other, 14 15 and whether or not some features significantly relate to category. If no significant relations exist, it would appear useless or misleading to proceed with feature extraction and discriminant analysis. The motivation for this preliminary analysis is much the same as for clustering tendency [13] where the pattern set is examined for randomness. In this chapter, we pose several questions about dichotomous. features in a statistical hypothesis testing framework. We define hypotheses and examine the application of the probabilistic similarity measure S, defined in Chapter 1, for testing those hypotheses, in comparison with statistics based on Pearson correlation. An important application of the proposed methodology is to questionnaire data, which come as an NxK pattern matrix containing only 0'3 and 1's. Rows denote patterns and columns denote features. Some of the features may be reserved as category information. In Sec. 2.1, we examine the entire pattern matrix in terms of statistical hypotheses, and define two statistics for establishing the existence of relationships between columns. One is based on the correlation coefficient and 16 the other is based on the similarity S. We discuss their distributions and tests of hypotheses. Section 2.2 asks if any feature is related to category. We briefly discuss the distributions of two statistics. Sec. 2.3 describes approximations to the distributions of the statistics defined in Sec. 2.1 and 2.2, based on the similarity 8. Section 2.4 reports a power comparison of these statistics. The methodology is demonstrated on a data set derived from medical questionnaires in Sec. 2.5. Finally, Sec. 2.6 summarizes the results in this chapter. The main contribution of this chapter is the methodology for the preliminary analysis of binary features, which is summarized in Sec. 2.6. 2.1 Are Any Two Features Unusually Similar to One Another? In this section, we consider the first question in the preliminary analysis of dichotomous features stated below. Two null hypotheses and their tests will be defined. The distributions of test statistics will be discussed in some detail. 17 Consider an NxK binary pattern matrix in which columns can be considered as categorical or measured features. This first question searches for significant similarity between any pair of features, including those used later as categories. Two reasonable hypotheses for describing a state of "no relationship" are defined below. In all cases, we regard the rows of the pattern matrix as samples of independent random variables. H01: The patterns are samples of K independent Bernoulli random variables with parameters {pi}. H02: The pattern matrix is chosen randomly from population P2. Population P2 is formed by independently permuting each column in 'the pattern matrix. There are (N!)K matrices in P2. If Ni denotes the number of 1's in column i, then the number of distinct matrices in P2, each occuring the same number of times, is 18 If a matrix can be realized under both hypotheses, its probability is lower under H01 than under H An ideal 02' statistic for testing H01 or H02 satisfies the following criteria: 1. It has reasonable power against certain alternatives. 2. Its value is easy to compute. 3. Its distribution is known analytically, or at least can be approximated analytically. 4. The analytical form is simple. 5. Its distribution is independent of the parameters in the problem. Several measures of correspondence between dichotomous variables have been suggested in the literature [1,30,77]. In this chapter, we examine the application of the similarity S to test H01 and H02. We define statistic § directly based on S below. We also define statistic g based on the commonly used Pearson correlation coefficient for purpose of comparison. In a max{c(i,j)} S = max{S(i,j)} Here c(i,j) is the sample Pearson correlation between column 19 i and j, and S(i,j) is defined in Chapter 1. Let Ni be the number of 1's in column i and let Nij be the number of rows in which columns i and j are both 1. - _ _ 1/2 (uni. NiNj)/ [NiNj(N Ni)(N Nj)] c(i,j) J In our case, S(i,j) takes the following form: 5(1.3) = [ > h(N,Ni,Nj,y)]'h(N,Ni,Nj,Nij)U where the sum is for y from max(0,Ni+Nj-N) to Nij' U is a (continuous) uniform random variable on the unit interval, independent of all other random variables, and h is the density function for a hypergeometric distribution, defined below. Note that since h(N,K,L,y) = h(N,L,K,y), including 0-0 matches as well as 1-1 matches does not alter the nature of the statistic. Large positive values of c(i,j) and S(i,j) indicate a positive relation between features i and j. With the questionnaire type applications in mind, we only consider 20 positive relations because the coding in questionnaire data is usually fixed and (1,1) matches have stronger meaning than (0,0) matches. Under H02, has a hypergeometric distribution and Nij c(i,j) is a linear function of Nij' The distribution of c(i,j) has a rather complicated form so the analytical forms for the distributions of 9 under either H01 or H02 are also complicated. The threshold of this distribution is usually estimated by Monte Carlo means. The distribution of S(i,j) under H02 is clearly uniform. The fact that S(i,j) is uniformly distributed over [0,1] under H01 follows from the following equality. Write S(l,2) as an explicit function of random variables N1,N2,N12 and U. The sum is over all allowable values of (21,22). Pr(S(Nl,N2.N12.U) g y : 801) > Pr(S(zl,zz,N12,U) 3 y:H02) Pr(Nl=zl,N2=zz:HOI) y > Pr(Nl=zl,N2=22:H01) ll "< 21 Based on the fact that S(i,j) has a uniform distribution under H01 and H02, Sec. 2.3 derives a simple approximation to the distribution of § under H02 for M>2, which eliminates Monte Carlo work when choosing thresholds for tests of hypothesis. A test of H01 or H02 has the following form, where T is either 9 or S . Reject HO if T>t where threshold t is computed as P(T>t:HO) = a and a is a specified level, such as 0.05. Alternatively, we * compute the critical level a ' * P(T>t*,HO) = a . where t* is the observed value of the statistic and reject the null hypothesis if a* is small, say 0.1 or less. - The distribution of 9 does not have a simple analytical form, and depends on the parameters {pi} under H01 and on the original matrix under H02. The distribution of S(i,j) has a simple analytical form, regardless of parameters, under both H01 and H02. An approximation for the distribution of S will be developed. Therefore, according 22 to Criteria 2 through 5, S is prefered over 9 . The comparison according to Criterion l is discussed in Sec. 2.4. 2.2 Is Any Feature Unusually Closely Related to Category? Here we designate one feature, say feature M, as categorical and ask whether any one of the remaining features is unusually similar to feature M. Two null hypotheses of interest are defined below. H All N! permutations of feature M are equally 03‘ likely. 304: All matrices in population P4 are equally likely. Population P4 is created in the same way as population P2 except that feature M is not permuted. Two test statistics based on our indices of proximity are: CM = max{c(i,M),i#M} S a max{S(i,M),i#M} M The forms of tests of these hypotheses are as in Sec. 2.1. A test of H03 asks whether it is likely that the observed statistic could have been produced were the 23 category labels inserted at random. Accepting H03 implies that no feature is unusually "close” to the categorical feature. The distribution of CM under H03 must be obtained by Monte Carlo means, but a procedure discussed in Sec. 2.3.3 can approximate the distribution of SM under H03. All columns are permuted independently so the distributions of CM and SM are known under H04. P(SM>t: H04) = 1 — c(x'l) P(CM>t: H04) = 1 - i- P[ c(i,M)§t: H04] Em where P[c(i,M)=ti: H04] : h(N,Ni,NM.ti) - - - 1/2 and ti— [NiNj+t[NiNj(N Ni)(N Nj] )]/N Under H03, the distribution of 5M can be approximated analytically, while the distribution of CM requires Monte Carlo simulation. Under H04, both statistics have known distributions, but that of S is uniform. This fact M suggests that SM is better than CM under Criteria 3 through 5. 24 2.3 Approximating the Distributions of S and SM We first discuss the distribution of S when K=2. Then we view S as the maximum of dependent uniform random variables and try to approximate its distribution. The distribution of SM is then considered. 2.3.1 Null Distributions of S for K=2 Assume K=2 so that S is simply the similarity measure between two binary vectors; S is uniform under H01 regardless of N, p1 and p2. Table 2.1 demonstrates empirical means and variances for S and S . Each row in Table 2.1 corresponds to a parameter set (N,M,pl,p2). One hundred Monte Carlo samples were used for each row. 25 Table 2.1. Means and Variances of S and S under H01 for 20 .10 .10 -.02808 .03129 .51732 .08919 20 .10 .50 -.01565 .04664 .53185 .08632 20 .10 .90 .01468 .04015 .48699 .08465 20 .50 .50 .02982 .05273 .48520 .07934 20 .50 .90 .00577 .03625 .54219 .06516 20 .90 .90 .01331 .03786 .54297 .08038 200 .10 .10 .00578 .04061 .45545 .07565 200 .10 .50 .05577 .08443 .47903 .07233 200 .10 .90 .00137 .04455 .49546 .09711 200 .50 .50 -.07583 .09182 .47418 .08335 200 .50 .90 -.05822 .23322 .50266 .08883 200 .90 .90 .00896 .04410 .52179 .07967 The theoretical mean and variance of S under H01 are 0.5 and 0.0833. The empirical means and variances of S in Table 2.1 are stable and close to theoretical values, independent of N, K, p1 and p2. The distribution of S varies with these parameters. This fact favors S over S with respect to Criterion 5. 2.3.2 Approximating the Null Distribution of S when K>2 Statistic S is a maximum of k=K(K-l)/2 dependent U(0,l) random variables under H01. A one-sided test of H01 is Reject H01 if S > t+. 26 The dependences among {S(i,j)} require that t+ be estimated from an approximate null distribution for S. Since the distribution of the maximum of k independent random variables is well known, we consider the possiblility of approximating t from this distribution. Let T' have this 4. distribution, whose c.d.f. is .k Pr(T' 3 t' : H01 ) = t if 02, our results suggest that the distribution of T' approximates the distribution of SM’ This point is also demonstrated in the application described in Sec. 2.5. 2.4 Power Comparison of S and Q The purpose of using S and S is to detect relations among features. We now examine the powers of these statistics under a class of alternatives where the relations among features are governed by a Bahadur model [4], in which the feature relations are specified in terms of the correlation coefficient. In this section, we consider the following alternative hypothesis. H11: The given matrix is generated with Bahadur parameter set (p1,...,pK and w) with w>0; pi is the Bernoulli parameter for column 1. Power studies are conducted for the case when pi under 301 is the same as the pi under H11 for any i, so that the hypotheses differ only in relation among features. 29 The test of H01 relative to H11 based on S is: Reject H01 If S > ts. The test relative to H11 based on S is: Reject H01 if S > tc. The general procedure for estimating power of a test based on S is as follows [16]. One hundred matrices are generated under H01 with given p's and the sixth largest value is used as the 0.05 level tc' Then, 100 matrices are generated under H11 with given w # 0. Our study involves the four factors listed below with levels for each factor indicated. (p1,p2) = (.2,.2),(.2,.5),(.2,.8),(.5,.5),(.5,.8),(.8,.8): pi = 0.5 for 3 g i 3 K; w = 0.1, 0.2, 0.4; ' N a 20, 60, 200; K = 2, 5. All combinations of levels are used except the impossible case (pl, p2, w )=(.2,.8,.4). The power of the test based on S is estimated by the number of matrices generated under H11 with S value greater than to' The power of the test based on S is estimated similarily, except that the 30 threshold ts is approximated analytically (Sec. 2.3.2). For each combination of parameters, we have two integers, the estimated power (P"(g)) of the correlation statistic and the estimated power (P"(S)) of S. Let P(S) and P(S) denote the true powers. Then p"(g) has a binomial distribution B(100, P(§)) and P"(§) has a 8(100, P(S)) distribution. We report P”(Q) and P”(S) in Table 2.4 for K82 and in Table 2.5 for K=5. The symbol "+" at the upper right corner indicates that P(§) > P(Q) at confidence level 0.95. The symbol "-" means P(§) < 9(9) at confidence level 0.95. 31 Table 2.4 Power Estimates when K=2 (p1,p2)=(.2,.2) (.2,.5) <.2,.a> (.5,.5) (.5,.a> <.s,.a) W I N I10 I I 12I 13 11 l I 7 I 8 +-------+-----——+-------+-------+-----—-+-------+ I 12 I I 1 0 20, I17 I I 25I 26 E 21 I I +-------+----—--+—------+-------+---—---+-------+ I 30 l I 21 :25 I I 2 0 20, 41 I57 I 51: 56 65 54 48 5 +-------+-------+-------+-------+-------+-------+ 5 46 42 0.4 I 20, I 14 I I 22 22 I 15E 15 I' 35 I I 43 I 49 46 I 45E 38 89 I 86 I I 95 93 I 94E :94 I I I 93 I +--—----+—-----—+———----+----—--+-------+------—+ E 30 I I 36 -E 55 I 36 I I +-------+------—+-—-----+--—----+-------+—------+ :54 I I 0 200, 81 100 {100 +: 74 84 E 100 {100 100 I {100 5 100 9 {100 I I 91 E 100 96 I {100 0.2 0.4 200, 200, +---------+ +---------+ 32 Table 2.5 Power Estimates when K=5 ( p1, p2)=(.2,.2) (.2,.5) (.2,.8) (.5,.5) (.5,.8) (.8,.8) +----—--+-------+-------+-------+-----—-+-------+ W N. 20, 7 I 8 I I 9 4 9 0.1 IIII + IIII 7 8 + l l 4 1 l 9 + 9 3 3 l 0 l 5 l 3 l . . . _ . . . + . . . . . . . + . . . . . . . + _ . . _ . . _ + . . . . . _ . ... . . . . . . +E 12 E 4 8 + 5 : 5 6 I I 7 I 6 22 13 9 12 IIII + IIII + IIII 2 0 ' 0 2 4. I 0 ' 0 2 l 0 0 6 III! + III... + IIII + . . . _ . . . + _ . . _ . . . + . . . . . . . + . . . . . . . + . . _ . . . . + . . . . . . Illl + I!!! . 5 4 0 6 2 200, . _ . . . . + . . . . . . + . 0.. 0.. . nu . 0 . 1. . IIII+ . nu. nu. 1.. nu . nu . 1. . IIII+ . nu. nu. 1.. nu . nu . 1. . IIII+ . . . . . _ . IIII+ . nu. nu. 1.. nu . nu . 1. . IIII+ . nu. nu. 1.. nu . nu . 1. . + 4 O o 0 0 2 +---------+ +---------+ Each 2.6. Table 2.4 and 2.5 are summarized into Table box in Table 2.6 corresponds to a combination of K, N and w. the is upper left corner the number at the In each box, of p' combinations (columns in Table 2.4 or 2.5) for number 33 which P(§) > P(S) (minus signs) at level 0.95. The number at the lower right corner is the number of cases for which P(Q) < P(S) (plus signs). Table 2.6 Result of Power Comparison 4» --------------------- + --------------------- + K I 2 I 5 I + ------ + ------ + ------- + ------ + ------ + ------- + N E 20 E 50 E 200 E 20 E 50 E 200 E w ==83888::=22:====8======3838833883883883883a: I 0 I 1 I 2 I 0 -I 0 I 0 I 0.1 E 1 E 0 E 1 E 1 E 3 E 0 E + ------ + ------ + ------- + ------ + ------ + ------- + I 0 I 1 I 0 I 0 I 0 I 2 I 0.2 E 0 E 0 E 1 E 2 E 2 E 0 E + ------ + ------ + ------- + ------ + ------ + ------- + I 2 I 0 I 0 I l I 0 I 0 I 0.4 E 0 E 0 E 0 E o E 1 E 0 E + ------ + ------ + ------- + ------ + ------ + ------- «I- Table 2.6 indicates that, among the 13 combinations of (K,N,w), for some (pl,p2) values P(E)I and P(§) are significantly different. Seven combinations of (K,N,w) contain more cases for which P(S) > P(Q) , while in six combinations the reverse is true. We conclude that S is as powerful as S under the condition of this experiment, even though the structure of the data under all was specified in terms of correlation. Although the powers of SM and CM for testing H03 and H04 were not compared experimentally, we expect them to be related in the same way. 34 The analytical approximations to the distributions for S and SM are independent of parameters. In addition, they have the same power as C and CM against certain alternatives. We conclude that S and SM are prefered over S and CM. 2.5 An Application of Preliminary Feature Analysis As a detailed application of our methodology, we studied a set of questionnaire responses completed over the past several years by female patients of a medical doctor regarding self breast examination. We randomly chose 145 questionnaires for this study, no two from the same individual. We then reduced the data to 26 features. Table 2.7 defines the meaning of a "1" value for the first 13 features. A "1" value for features 14 through 26 denote the occurrence of cancer in a relative, such as a father, mother, or aunt. 35 Table 2.7 Definitions of First 13 Features Feature Meaning of ”1" Value 1 Menstrual period has stopped 2 Operative menapause 3 Pregnant at least once 4 At least one miscarriage 5 Have used female hormones 6 Currently use female hormones 7 Positive result on pap smear 8 Perform semi-annual self breast exam 9 Perform monthly self breast exam 10 Have had mammograms 11 Use contraceptive techniques 12 Have had pelvic surgery 13 Have had breast surgery The ultimate goal was to discover any factors that explained why self breast examination was performed by some women but not by others. Features 8 and 9 in Table 2.7 will be taken as category variables. This is a typical feature extraction problem in which a subset of features that predict category well is to be found. Some clustering algorithms were used to cluster features [40] and the resulting dendrograms suggested random clusters. A preliminary feature analysis should indicate whether any non-random relation exists among features. Our experiments proceeded as follows. 1. Test H01 using S and S . 2. Test H02 u51ng S . 3. Test H03 and H04 using CM and SM for two categorical 36 features. 4. Verify the results by a Mantel test [53] on original pattern matrix. 5. Perform feature extraction for any non-random categories. 6. Verify the results by Mantel tests on selected features. 7. Perform feature extraction for any categorical feature for which either H03 or H04 is accepted. 8. Verify the results by a Mantel test on selected features. The details are explained in the following sections. 2.5.1 Tests for Significant Relations between Features First, we test H01 and H02 using statistics S and S. Critical levels for S were estimated from 1000 Monte Carlo trials. Random matrices were generated under H01 using Ni/N from the original pattern matrix to estimate pi and S was computed for each matrix. The critical level for S was approximated by t' as explained in Sec. 2.3 and also estimated from 1000 Monte Carlo trials. Table 2.8 summarizes the results. The approximate critical level for S using t' is the same as the Monte Carlo result in this case. We reject H01 and H02 and conclude that the 37 questionnaire data merit further study. Table 2.8 Testing H01 and H02 Hypothesis Stat, Value Cr. Level Method * a H01 S , 0.4528 0.001 1000 Monte Carlo Trials H02 S , 0.4528 0.002 1000 Monte Carlo Trials H02 S , 1.0000 0.00016 Approximation or 1000 MC Trials We feel justified to proceed with testing H03 and H04. We designate features 8 and 9 as categories separately, and we ask whether any one of the remaining features is unusually similar to these categories, based on statistics CM and SM' The critical level of CM under H03 was estimated from 1000 Monte Carlo trials, while the critical level of SM was obtained by both Monte Carlo means and analytical approximation. The results reported in Table 2.9 indicate that some feature is close to category feature 9 but no feature is unusually similar to category feature 8. Testing H04 produces the same result as testing H03 so only feature 9 is used as a category in further study. 38 Test results using the two statistics agree with each other. The approximate thresholds for $8 and $9 produce the same results as that derived by Monte results support conclusions drawn ealier. Table 2.9 Testing H03 and H04 Hypothesis Categorical Statistic feature value H03 8 C8 0.1172 H03 9 C9 0.2463 H03 8 S8 0.9156 H03 8 58 0.9156 H03 9 59 0.9984 H03 9 89 0.9984 H04 8 C8 0.1172 H04 8 S8 0.9156 H04 9 C9 0.2463 H04 9 59 0.9984 Cr. 0.835 0.034 0.880 0.824 0.039 0.025 0.379 0.379 0.005 0.005 Carlo Level means. These Method 1000 MC Trials 1000 MC Trials Approximation 1000 MC Trials Approximation 1000 MC Trials Exact Exact Exact Exact 39 2.5.2 Verification on Original Pattern Matrix We now ask whether the grouping of patterns defined by a particular categorical feature could have been defined in a purely random manner. We measure similarity between patterns i and j by the Jaccard coefficient [1], J(i,j). The reason for using the Jaccard coefficient is that (1,1) matches are more important than (0,0) matches in questionnaire data. If nll is the number of (l-l) matches in rows i and j of the pattern matrix and n00 is the number of (0-0) matches, then J(i.j)=nll/(M-l-noo). A category matrix is defined by: B(i,j) 1 if rows i and j have the same value for category feature M 0 else. The test statistic, denoted by G is the point amma' serial correlation between matrices [J(i,j)] and [B(i,j)]. The baseline is defined by H03. A test of hypothesis based on Gamma 15 called a Mantel test [53,54] 1n other applications and tests whether the increases and decreases observed in the two matrices are unusually similar. The 40 distribution of Gamma was estimated by Monte Carlo means and the critical level was found to be 0.049 for the questionnaire data, using feature 9 as category. This suggests that H03 be rejected, which implies that the set of features is significantly related to category feature 9. 2.5.3 Feature Extraction Using Feature 9 as Categorical Previous tests suggest that individual features and the entire matrix are unusually similar to categorical feature 9. Thus, we feel justified in seeking a subset of features which "explains" categorical feature 9 and which can lead to an efficient design for future questionnaires. We chose a sequential forward stepwise selection method for study [86]. The best individual feature is chosen first. The best pair containing the best individual feature is then found, and the best triple containing the best pair is identified. This process is continued until a suitable level of performance is observed. Although ‘only an exhaustive search of all subsets ensures optimality with binary features [8,9,14,15], the stepwise procedure is computationally attractive and is recommended in the literature [58,86]. The criterion used to compare two subsets of features is recognition rate under the 41 leave-one-out method. The five features selected, in order of selection, were features 10, 4, 6, 7, and 16. Feature 16 records whether a brother had cancer. To verify these results, we repeated the Mantel test described above using only the five features selected from the questionnaire data. The critical level of the Gamma statistic was 0.003, which suggests that the patterns cluster by category unusually well and that the five features reflect category better than the set of all features. 2.5.4 Feature Extraction Using Feature 8 as Categorical We carried out feature extraction and Mantel test using feature 8 as category. The critical level of Gamma for the entire matrix is 0.653, which suggests accepting H03. Feature 1 had recognition rate 0.97, no lower than any other feature or any feature subset selected by our stepwise procedure. The critical level of Gamma using Feature 1 alone is 0.093. These results are summarized in Table 2.10. 42 Table 2.10 Critical Levels of Gammas use all features use selected features category + ------------------ + ----------------------- + 8 E 0.653 E 0.093 E + ------------------ + ----------------------- + 9 E 0.049 E 0.003 E + ------------------ + ----------------------- + Table 2.10 indicates that feature 8 is not a valid categorical feature. It might be caused by improper design of questionnaires. Its high recognition rate is due to the fact that there are only four 1's in column 8. Blindly performing feature selection and trusting in the recognition rate alone is misleading. 2.6 Summary and Conclusions This chapter discussed the problem of preliminary analysis of dichotomous features. Four null hypotheses are stated to describe randomness in feature relationships. The necessity of this preliminary analysis is examined for the first time and is demonstrated with questionnaire data where much irrelevant information is usually involved. Rejecting these null hypotheses give us confidence in performing feature extraction. Accepting one of these null hypotheses prevents us from useless work and misleading results in feature extraction. This point is clear in the application 43 where preliminary analysis shows that feature 8 is not a valid categorical feature, although the feature extraction could give high recognition rate. In testing H01 to H03, statistics based on the correlation coefficient require Monte Carlo simulation to estimate thresholds. The statistics based on similarity S can be approximated analytically. Under H04, both statistics, CM and SM' have known distribution, while SM is distributed uniformly. For detecting Bahadur type relations between features, S and SM have reasonable power. These facts suggest that S and SM are better than S and CM in preliminary feature analysis. The Mantel test can also prevent us from accepting a misleading result, but estimating the critical level of Gamma requires either complicated asymptotic normal approximations [33,34] or Monte Carlo simulation. The test of'I-i03 or H04 using SM does not require Monte Carlo work and gives the same conclusion. This fact suggests that SM is more suitable than Gamma for exploratory type analysis which requires speed and simplicity. 44 Tests of H01 and H02 using the two types of statistics give similar results for questionnaire data; tests of H03 and H04 also provide similar results. Since the thresholds of S and SM are easier to compute in the cases of H02 and H04 than those for S and CM' we may choose to limit ourselves to testing H02 and H04. In this case, the following simplified methodology is proposed. (1) Test H02 on the original pattern matrix using S. (2) If H02 is accepted, stop and try to gather more data. (3) If H02 is rejected, choose categorical feature M. (4) Test H04 on the pattern matrix with categorical feature M eliminated, using SM' (5) If H04 is accepted, stop and try other categorical features or gather more data. (6) If H04 is rejected, go on with feature extraction. CHAPTER 3 ADEQUACY OF BINARY PARTITIONS This chapter studies procedures to assess the significance of a binary partition of a pattern set. For example, in cluster analysis one must verify structures resulting from clustering methods. Verification procedures assess the global fit of a hierarchy, the global fit of a partition, and the isolation and compactness of individual clusters. A sequence of partition fits, i.e., a sequence of partitional adequacy measures, can provide a basis for assessing global fit of a hierarchy [17]. We examine the application of the S statistic defined in Chapter 1 to two problems. One is the use of S as an external measure for binary partitions in cluster analysis. The second application is the design of a binary tree pattern classifier, where S measures the correspondence of a feature vector to category. 45 46 Section 3.1 compares S, under the permutation hypothesis defined in Chapter 1, to three other commonly used measures in a cluster validity framework. The relations among them are derived and their powers and computational costs are compared. Section 3.2 uses S in binary tree classifier design in comparison with the mutual information criterion [74]. A brief conclusion is given in Sec. 3.3. 3.1 Four External Measures of Association for Cluster Validity A clustering algorithm begins with a measure of proximity. between all pairs of objects in a set and induces a clustering, or partition, of the objects in (which "dissimilar" objects are placed in separate clusters. Clustering algorithms will impose partitions even on totally random data, so a clustering must be validated to establish its "adequacy”, or its degree of unusualness, when compared to a hypothesis of randomness [13]. Two types of measures of partitional adequacy are used in cluster analysis: external and internal [33,56]. An external criterion validates a partition with respect- to a 47 source of information that is independent from that used to form this partition. Permutation statistics provide clear external criteria of partitional adequacy. For example, if distances between patterns were used to establish a partition, then category labels can be the second source, and the partitional adequacy measure answers the question: Do the patterns cluster by category? A prior conjecture, the result of another clustering algorithm, or some independent proximities could also serve to define the second partition. An internal criterion assesses the fit of a partition using only the proximity data from which the partition was generated. Several external criteria for partitional adequacy have been proposed [13,56]. The problem of partitional adequacy has been formulated under two hypotheses of randomness. The Random Graph Hypothesis [3] assumes that all possible rank order proximity matrices are equally likely. The permutation hypothesis [31] assumes that only those proximity matrices corresponding to a relabeling of patterns are equally likely. The permutation hypothesis provides a good baseline for assessing statistical significance [32,36,37]. In this section, we adopt the permutation hypothesis as the null hypothesis and compare S to three related measures, the Rand, Gamma and B statistics 1n terms of power and 48 computation time. Section 3.1.1 gives the definitions of the four statistics and Sec. 3.1.2 discusses the relation among them. Our goal is to examine the effectiveness of the statistic S'=ES-0.5E in measuring the adequacy of binary partitions. 3.1.1 Definitions Binary partitions are the basis for hierarchical clustering [1] and for binary tree classifier design. If we consider a hierarchy resulting from some hierarchical clustering method as a sequence of binary splits, then a binary partitional adequacy measure can be use to validate one split at a time and indicate from which point the splitting is random. In this section, we study four measures for the adequacy of binary partitions. These statistics assess the unusualness of a particular pair of partitions by testing the permutation hypothesis H0 (Chapter 1). All have unimodal null distributions so a threshold can be selected under the null distribution for defining the significance of association. Consider two independent binary partitions of a set of L patterns. Each pattern in each partition is coded as 0 or 1. The coding is presented as two binary L-vectors, 49 Vl=[V1(i)] and V2=[V2(i)]. Indicator matrices X and Y are defined as follows. X(i,j)=1 if v1(i)=v1(j), =0 otherwise; Y(i,j)=l if V2(i)=v2(j), =0 otherwise. The four cells in Table 3.1 contain all the information about the similarity between x and Y; K=L(L-l)/2. Table 3.1 Frequencies of Combinations Y 0 1 + ------------- + ------------- + 0 E a E b E K-Nl x + ------------- + ------------- +, 1E C E d E N1 + ------------- + ------------- + K-N2 N2 K For example, a .. E(i,j) : i g, d and Gamma are monotonically increasing functions of d' over the entire range of d'. Thus the non-monotone function 5' does not measure the same thing as G If nl+n2-L < g, such as in amma‘ 53 the case nl=n2=L/2, both d and S' are non-monotone functions of d'. Figure 3.1 demonstrates how d and E(S') vary as functions of d' under HO when L=20, nl=9 and n2=13. d E(S'I 80 ‘ .5 56’) 78‘J d 60 ‘ .3 50‘3 40 ‘ .I d, Fig.3.l d and E(S') vs d' when L=28, n1=9 and n2=13 Table 3.2 gives some examples of the ranges of d' and g values. The cases marked with a star mark exhibit the general behaviour of Figure 3.1. 54 Table 3.2 Examples of d' Ranges and 9 Values L n1 n2 range of d' g 10 3 3 ( 0, 3 ) 0.5 * 10 3 5 ( 0, 3 ) 1.5 * 10 3 6 ( 0, 3 ) 2.0 * 10 5 5 ( 0, 5 ) 2.5 * 10 5 6 ( l, 5 ) 3.0 * 10 6 6 ( 2, 6 ) 3.5 * 20 5 5 ( 0, 5 ) 0.0 20 5 10 ( 0, 5 ) 2.5 * 20 5 13 ( 0, 5 ) 4.0 * 20 10 10 ( 0, 10 ) 5.0 * 20 10 13 ( 3, 10 ) 6.5 * 20 13 13 ( 6, 13 ) 8.0 * 40 10 10 ( 0, 10 ) 0.0 40 10 20 ( 0, 10 ) 5.0 * 40 10 25 ( 0, 10 ) 7.5 * 40 20 20 ( 0, 20 ) 10.0 * 40 20 25 ( 5, 20 ) 12.5 * 40 25 25 ( 10, 25 ) 15.0 * 80 20 20 ( 0, 20 ) 0.0 80 20 40 ( 0, 20 ) 10.0 * 80 20 50 ( 0, 20 ) 15.0 * 80 40 40 ( 0, 40 ) 20.0 * 80 40 50 ( 10, 40 ) 25. * 80 50 50 ( 20, 50 ) 30.0 * 200 50 50 ( 0, 50 ) 0.0 200 50 100 ( 0, 50 ) 25.0 * 200 50 130 ( 0, 50 ) 40.0 * 200 100 100 ( 0, 100 ) 50.0 * 200 100 130 ( 30, 100 ) 65.0 * 200 130 130 ( 60, 130 ) 80.0 * The quantity 9 is inside the range of d' when nl and n2 are around L/2. These cases exhibit the behavior in Figure 3.1. To further study the relation between S' and Gamma' we examine the sample correlation for the special cases when n1=n2=L/2, which is representative of the cases with star marks in Table 3.2. We generated vector 55 pairs of different lengths. For each permutation of one vector, we compute S' and Gamma which is converted to its z-score [33,34]. The correlation coefficients are shown in Table 3.3, which suggests that S' and Gamma are highly correlated under H0 when the number of 0's and 1's are equal. Table 3.3 Sample Correlation between S ,E(S ) and z-score of Gamma + ------- + -------- + ---------- + --------------- + --------------- E no.of E vector E Pearson E Kendall's Tau E Kendall's Tau E perms E size Ecorr(S',z)E corr(S',z) E corr(E(S'),z) + ------- + -------- + ---------- + --------------- + --------------- E 1000 E 30 E 0.873 E 0.900 E 1.000 E 1000 E 60 E 0.901 E 0.962 E 1.000 E 1000 E 90 E 0.913 E 0.969 E 1.000 E 10000 E 120 E 0.912 E 0.980 E 1.000 + ------- + -------- + ---------- + --------------- + --------------- The sample space consists of all permutations of one vector. Under any alternative a non-uniform probability assignment is made to these permutations. The fact that the two statistics have correlation coefficient 1 under H0 thus implies they have identical power under any alternative. Since R, B and G are linear functions of each other, we amma conclude that they have the same power against any alternative hypothesis. When nl=n2=L/2, S' is highly correlated to R, Gamma and B. Since the rank order correlation between S' and G is above 0.9, we expect amma + +---—— +--— 56 that they have compatible power against any alternatives. Since the threshold for S’ can be determined in a natural fashion and it is comparable to other statistics, 5' is a good choice for assessing partitional adequacy. 3.2 Binary Tree Classifier Design In this section, we apply 3' to the design of a binary tree classifier. Section 3.2.1 briefly discusses the concept of tree classifier, some existing techniques, and where this study fits. Section 3.2.2 describes the computation of feature thresholds from S. Section 3.2.3 considers the efficiency of computing feature thresholds. Some numerical examples are shown in Sec. 3.2.4. 3.2.1 Binary Tree Classifiers Binary tree classifiers have been used in many pattern recognition problems [50,51,59,72,73,75,85,90]. Each node of the tree corresponds to a subset of features. A pattern to be classified is passed through the tree from the root to a leaf. At each non-terminal node, based on the subset of feature values corresponding to that node, the pattern is sent to one of two descendent nodes. Every terminal node, or leaf, is labeled by category. The unknown pattern 57 eventually reaches a leaf and is assigned the corresponding category label. A tree classifier is designed from training patterns. At each non-terminal node, a subset of features is selected based on the training patterns available at the node. The training pattern set is divided into two disjoint subsets according to these features for use at successor nodes. A terminal node is labeled by the category of the majority of training patterns remaining. Since the computational cost in classification is roughly proportional to the square of the number of features used [81], we follow the literature and use one feature per node. Thus, descendent nodes are chosen by comparing the value of the feature corresponding to that node to a predefined threshold. The advantages of a tree classification rule versus a single stage classification rule are in computational efficiency, use of features, avoidance of the "curse of dimensionality" and ease in human interpretation [2,38,50,59,65]. One disadvantage is that the design of a binary tree classifier often requires a large amount of computing time and storage [75], especially when a fully optimal tree is desired [65]. 58 The design of a binary tree classifier with one feature per node consists of two components [46,49,75]: (1) Defining the structure of the tree; (2) Choosing the most effective feature and threshold at each node. Some design criteria [46,55,90] are low error rate, minimum number of nodes on the tree, shortest path length, and weighted sum of these factors [46]. Numerical examples show that local optimality does not ensure global optimality and that no simple method exists for specifying the optimal tree structure in a given problem [47]. Therefore, only conditional optimality can be achieved. Game tree search techniques and the look-ahead property have achieved partial global optimality [76]. Most practical tree designs use heuristic approaches and make no claim of optimality [50,59,70,73,75,8l,90]. Our study provides an alternative heuristic approach, without optimization'or look-ahead. Systematic procedures have been developed for the first component [46,55,65]. Given thresholds for features which partition the feature space, Meisel's dynamic programming method will generate equivalent partitions of that space which are optimal in the sense of having min-max path length, minimum number of nodes and minimum expected path 59 length. Since the number of possible trees under the constraint of a given partition is still very large, this algorithm is not feasible for large numbers of features [55,65]. Many approaches have been proposed to attack both components of tree design simultaneously [48,67,74]. A specific criterion is selected to determine the feature thresholds. At each node, the best threshold of the best feature splits the training pattern set on this node. This splitting criterion is used at every non-terminal node until the tree is constructed. The maximum distance between the empirical c.d.f.'s of the feature under different categories [70] and the mutual information between the category and the thresholded feature [74] have been used as criteria. We examine a splitting criterion based on the S statistic between a thresholded feature vector and category vector for the two-class problem. 3.2.2 Computation of Feature Thresholds Consider a non-terminal node with N training patterns, each being described by M features. The features must be at least on an ordinal scale, although binary features are allowed. The data are represented as an NxMx2 matrix A, 60 where A(i,j,l) denotes feature j of pattern i, A(i,j,2) denotes the category label of pattern i for all j, and A(*,j,l) and A(*,j,2) denote the feature and category vectors respectively for feature j. Order the entries so that for each j, A(i,j,l) becomes the ith smallest value and A(i,j,2) is the corresponding category label. A threshold value between A(i,j,l) and A(i+l,j,1) creates a binary vector Ai(*,j,l) with i 0's and (N—i) 1's. A similarity measure between Ai(*,j,l) and the category vector A(*,j,2) serves as the splitting criterion. Table 3.4 is an example of feature and category values for feature j. Threshold 5.0 will result in binary feature vector A2(*,j,l)=001111, while the threshold 8.0 will give A4(*,j,l)=000011. Before sorting: A(i,j,2): 0 0 0 1 l' 1 A(i,j,l): 8.21 7.43 1.20 5.56 4.23 8.02 After sorting: A(i,j,2): 0 l l 0 1 0 A(i,j,l): 1.20 4.23 5.56 7.43 8.02 8.21 Table 3.4 Ordering of Training Patterns by Feature j (i=l,2,...,6) The splitting of a node can be described as the Pascal-like procedure SPLIT written below. 61 Procedure SPLIT (node,MINCUT); Begin Build matrix A for this node and set N,M; SMAX := 0; .For j=l,M begin Sort to obtain A(*,j,l); Arrange A(*,j,2) accordingly; For i=l,N-l begin THRESH(j):=(A(i,j,1)+A(i+l,j,l))/2; For k=l,i Ai(k,j,1):=0; For k=i+l,N Ai(k,j,l):=l; S(i,j):=SIMILAR [Ai(*,j,l),A(*,j,2)]; if S(i,j)>SMAX then begin SMAX := S(i,j); IMAX :=i; JMAX :=j end; (* if *) end; (* for *) end: (* for *) if SMAX > MINCUT then begin For i=l,IMAX the pattern corresponding to A(i,JMAX,l) is passed on to the left-son; For i=IMAX+l,N the pattern corresponding to A(i,JMAX,l) is passed on to the right-son;' SPLIT(left-son,MINCUT); SPLIT(right-son,MINCUT) end; (* if *) end; (* procedure *) 62 The algorithm for designing the whole tree is simply SPLIT(root,MINCUT), where MINCUT is the user specified minimum splitting criterion value. Several splitting criterion functions SIMILAR can be used, such as average mutual information [74]. We propose S'=ES[(A(*,j,l),A(*,j,2)]-0.5E as the SIMILAR function, since it has a direct intepretation and known distribution under H0 (Chapter 1). 3.2.3 Efficiency in Feature Threshold Computation Consider the N training patterns assigned to a node and feature. j, represented by sorted arrays A(*,j,l) and A(*,j,2). There are at most (N-l) possible thresholds in the procedure SPLIT. To avoid checking every possible threshold, we investigate whether the best feature threshold occurs between runs of 0's or 1's in the A(*,j,2) vector, or occurs at the boundaries of the largest runs. Sethi [74] suggested restricting the search for feature thresholds to the boundaries of runs, i.e., check threshold [A(i,j,l)+A(i+l,j,l)]/2 only if A(i,j,2) # A(i+l,j,2). Sethi didn't prove that other thresholds can be ignored when the mutual information is used as the splitting criterion. 63 Appendix A gives a simple induction proof that only thresholds at the boundaries of runs need be checked if S' is the splitting criterion. We now ask whether threshold checking can be further restricted to the boundaries of the largest runs of 0's and 1's. To be specific, suppose the run lengths in A(*,j,2) are kt,t=1,2,...,m and let * k =max{kl,k2,...,km} which is achieved between pattern k1+...+ki_l+l and kl+...+ki. We have observed that the threshold [A(kl+...+ki_l,j,l)+A(kl+...+ki,j,l)]/2 or [A(kl+...+ki,j,l)+A(kl+...+ki+1]/2 results in a binary feature vector with highest 5' among all possible thresholds. We have not been able to prove this in general. Figure 3.2 demonstrates this phenomenon with 14 cases. In each case, two vectors are presented. The first vector is the category vector A(*,j,2). The second is the feature vector thresholded to have the maximum E(S') value. In every case, the "best" threshold point is at a boundary of a largest run. The category vectors in the last two pairs are identical, i.e., the "best” threshold is not unique. If we can only check the boundaries of largest 64 runs, we can further reduce the time for finding feature thresholds. 1110000100 11011100010110 01001111010010011101 0001111111 00000011111111 00001111111111111111 01011110 1101001111 0100001100 010011010111 00011111 0000001111 0000001111 000000000111 0000000000110 01001100000000000000 1000000100 0000000000111 00000011111111111111 0111111111 0111000110110110 01000110 01000010 01000010 0000000111111111 00000111 00111111 00000011 Fig.3.2 Fourteen Examples of Category-Feature Vector Pair 3.2.4 Numerical Examples Since there are infinitely many different types of classification problems, a formal comparison of binary tree design algorithms is not feasible. Following the literature, we use several numerical examples to demonstrate the use of S' as a splitting criterion in tree design and indicate its advantages over the mutual information criterion in an informal manner. Our experiments are on four artificial data sets and two real data sets. 65 Data sets 1 through 4 are from a cluster process and are generated on a computer. Each cluster has about the same number of patterns. The number of clusters is a Poisson random variable. Patterns are distributed around cluster centers randomly chosen in a unit hypercube according to a Normal distribution with diagonal covariance matrix, all of whose diagonal elements equal sigma, the spread factor. The smaller sigma, the more distinct the clusters. We code the patterns in odd numbered clusters "1" and those in even numbered clusters "0” to serve as category labels. One half of the patterns in each cluster are used for training while the rest are for testing. The parameters actually used are shown in Table 3.5. Data sets 5 and 6 are created from the Munson handprinted Fortran character set, containing 48 samples of each of four letters, namely "I", "M", "O" and "X”. Each character is represented by an eight-dimensional pattern [12]. Data set 5 includes characters "I" and "M". Data set 6 contains all four characters with "I", "0" being one category and "M","X" being the other. The first half of each cluster (Data sets 1 through 4) and each alphabet (Data sets 5 and 6) are used for training. 66 Table 3.5 Parameters of Artificial Data Sets +----+ ---------- + ---------- + ---------- + ---------- + ------- + E E no. of E no. of E no. of E no. of E E EDataE training E testing E features E clusters E Sigma E E setE patterns E patterns E E E E +--—-+ ---------- + ---------- + ---------- + ---------- + ------- + E l E 74 E 76 E 6 E 5 E .09 E E 2 E 72 E 73 E 15 E 6 E .20 E E 3 E 71 E 74 E 15 E 6 E .30 E E 4 E 74 E 76 E 9 E 4 E 1.0 E E 5 E 48 E 48 E 8 E (2) E - E E 6 E 96 E 96 E 8 E (4) E - E +----+ ---------- + ---------- + ---------- + ---------- + ------- + The mutual information and S' are used as splitting criteria to construct two binary classification trees for each data set. The procedure SPLIT is used in all cases. We specify a minimum splitting criterion MINCUT in every case. When no threshold for any feature exceeds MINCUT,the node in question will be a leaf. Using the direct interpretation of S', we set MINCUT for S' at 0.475 for all data sets, which corresponds to a significance level of. 0.05. That is, S'>0.475 means that the thresholded feature vector and the category vector are more closely related than 95% of the pairs formed by permuting one of these vectors. The mutual information levels are explained with the data sets. 67 The tree classifiers are summarized in Table 3.6 through Table 3.11. The trees for data sets 1 and 2 are given and written in prefix form to give a general feeling for the tree structure. The labels in "<>" ("a' or "b") are labels of leaf nodes. Other numbers are the features used. For example, 2(4) indicates the tree in Figure 3.3. F2 < .382 / \ / \ F4 < .457 Cat / \ / \ Cat Cat Fig.3.3 Tree Designed with S' (minimum splitting S'=.475) (artificial data set 1) 68 Table 3.6 Tree Design for Data Set 5 mutual formation .9. I I I I =========3=======a=s= I I + splittin criterio MINCUT "DID IIH “:1 0.330 training recog.rate testing Table 3.7 Tree Design for Data Set 2 + ----------- + ---------------- + ----------------- + E splitting E mutual E E E criterion E information E S' E E MINCUT E 0.33 E 0.475 E + ----------- + ---------------- + ----------------- + E tree E 6()(9(3())) E + ----------- + ---------------------------------- + E number of E E E nodes E 7 E + ----------- + ---------------------------------- + E training E E E recog.rateE 1.000 E + ----------- + ---------------------------------- + E testing E E E recog.rateE 0.918 E + ----------- + ---------------------------------- + 69 Table 3.8 Tree Design for Data Set 1 + ----------- + ---------------- + ----------------- + E splitting E mutual E E E criterion E information E S' E E MINCUT E 0.33 E 0.475 E + ----------- + ---------------- + ----------------- + E tree E 2(4) E 4()(2) E + ----------- + ---------------- + ----------------- + E number of E E E E nodes E 5 I 5 I + ----------- + ---------------- 4 ----------------- + E training E E E E recog.rateE 1.000 E 1.000 E + ----------- + ---------------- + ----------------- + E testing E E E E recog.rateE 0.987 E 0.987 E + ----------- + ---------------- + ----------------- + Both methods gave the same result for Data set 5 (Table 3.6) and Data set 2 (Table 3.7). The difference between the trees for Data set 1 under the two criteria (Table 3.8) involves only the order of the features used and slight changes in feature threshold values. The minimum splitting criterion value MINCUT for mutual information was chosen because it gave good results for data set 5 in a preliminary trial. The distribution of mutual information is not known under H0 , although it has an asymptotic chi-square distribution [6]. These data sets suggest that the two criteria give comparable results. 70 Table 3.9 Tree Design for Data Set 3 splitting mutual criterion MINCUT E 0.330 E 0.200 E 0.100 E 0.475 ----------- +-------+-----—-+-----——+—------ number of E E E E nodes E 1 E 3 E 17 E 17 ----------- +----—-—+-——----+——-----+—--—--- training E E E E recog.rateE 0.563 E 0 761 E 1.000 E 1.000 ----------- +-------+---—---+-—----—+------- testing E E E E recog.rateE 0.568 E O 730 E 0 716 E 0 716 ----------- +--——---+--—----+-------+------- Table 3.10 Tree Design for Data Set 4 + ----------- + --------------- + ------- + E splitting E mutual E E E criterion E information E S' E E MINCUT E 0.500 E 0.050 E 0.475 E + ----------- + ------- + ------- + ------- + E number of E E E E E nodes E l E 25 E 13 E + ----------- + ------- + ------- + ------- + E training E E E E E recog.rateE 0.635 E 1.000 E 0.865 E + ----------- + ------- + ------- + ------- + I testing I I I I E recog.rateE 0.618 E 0.566 E 0.566 E + ----------- + ------- + ------- + ------- + 71 Table 3.11 Tree Design for Data Set 6 + ----------- + ----------------------- + --------- + E splitting E mutual E E E criterion E information E S' E E MINCUT E 0.350 E 0.200 E 0.020 E 0.475 E + ----------- + ------- + ------- + ------- + --------- + E number of E E E E E E nodes E 3 E 5 E 9 E 5 E + ----------- + ------- + ------- + ------- + --------- + E training E E E E E E recog.rateE 0.885 E 0.979 E 1.000 E 0 979 E + ----------- + ------- + ------- + ------- + --------- + I testing I I I I I E recog.rateE 0.885 E 0.938 E 0.917 E 0 938 E + ----------- + ------- + ------- + ------- + --------- + Data set 3 (Table 3.9) is very loosely clustered. The recognition rates are lower than those for data sets 1 and 2. The value of MINCUT for mutual information has a significant effect on the tree obtained. Data set 4 (Table 3.10) is completely random. Different MINCUT values for the mutual information method result in very different trees. Data set 6 (Table 3.11) also shows the effect of MINCUT value with the mutual information criterion. The above examples demonstrate that S' is a reasonable splitting criterion in tree classifier design. The trees designed with S' are as good as those designed with the mutual information criterion. Since 5' has a known distribution under H0 that is independent of the number of patterns at each node, 5' has a direct meaning and the 72 MINCUT value for S' can be based on theory. The user must select the minimum threshold MINCUT under the mutual information criterion. These data sets demonstrate that the selection of MINCUT has a dramatic effect on the tree structure and on the recognition rates. Little prior information is available for selecting a mutual information threshold but a threshold on S' can be defined in a natural way. Mutual information has an asymptotic chi square distribution [6], but the degree of freedom depends on the number of patterns at.each node. If one really wants the statistical significance to be consistent in the entire tree design, a p-value for mutual information can be approximated and different chi square tables can be used when the number of patterns changes from node to node, assuming the asymptotic distribution is applicable. This process is much more tedious than selecting the threshold of S'. 3.3 Summary and Conclusions This chapter evaluated the similarity S, defined in Chapter 1, in two applications. Both applications require that the adequacy of a binary partition be measured. A version of S named S' is compared to three well known statistics (R, G B). The three are shown to be linear amma' functions of each other. All have ,asymptotic normal 73 distribution under H0. The measure 5' is shown to be different from the others. When the numbers of 1's in both vectors are half the vector length, 5' is highly correlated to. other measures. The statistical significance of the other three measures demands either Monte Carlo simulation or rather complex approximation, while the significance level of S' is obvious. We also applied 5' to the design of a binary tree classifier. Numerical examples demonstrated that S' is a reasonable splitting criterion to be used in determining features and their thresholds. The known distribution and known statistical significance of 5' permits one to establish the threshold of each feature in a simple and direct fashion. By comparison, the mutual information threshold must be approximated from asymptotic results and changes from node to node. CHAPTER 4 TEMPLATE MATCHING This chapter provides a probabilistic analysis of some statistics used in the template matching problem on binary images. We assume mathematical models for both null and alternative hypotheses. An approximately optimal statistic and two other statistics are derived and their powers are compared. We propose' a suboptimal statistic which has reasonable power and is more sensitive to the true object location than existing statistics. Along with the experiments on artificial images generated under our mathematical model, this statistic is applied to several real Landsat'images. The permutation statistic S defined in Chapter 1 and the Pearson correlation coefficient are also applied to matching binary images and are shown to provide result similar to other sub-optimal statistics. In this chapter, we concentrate on the statistic derived from the Neyman Pearson 74 75 Criterion, which is optimal under a weak assumption and which has reasonable power and sensitivity. 4.1 Introduction Template Matching is a simple classical approach to the problem of locating an object in a digitized image [29,68,79,82]. An image is an Nr by Nc matrix of gray levels. A template is an Nr' by Nc' matrix containing a picture of an object with Nr' << Nr and Nc' << Nc' The image may contain that object without rotation or size distortion, or it may contain no object at all. The object in the image is the same as the template except for lighting conditions. An example in Figure 4.1 shows an image and a template which are pictures of the same scene but are seen through different light frequency channels. This chapter considers only binary images and templates. 76 Fig.4.1 An Image and a Template 77 The standard solution to template matching is to scan through the image and compare the template with subimages at all possible template locations using some measure of similarity to decide whether or not the image contains an object and if it does, to estimate its location. Since there are (Nr-Nr.+l)(NC-NC.+1) locations for the template, the computation complexity of the standard solution to the template matching problem is O((Nr-Nr.+l)(NC-Nc.+1)(Nr.*Nc.)). Several methods have been proposed to reduce the computational burden of the standard solution [5,43,60,61,66,87,88,89]. Bolles' planning method first applies an interest operator [57] to eliminate locations with low interest. This method is effective for grey level images and for images with structure, i.e. when pixels are dependent. Our study employs a randomness model with independent pixel values, which is appropriate for the Two Stage solution described below. 78 The first stage [83] compares a subtemplate to the subimages at all possible template locations. The second stage applies the entire template only at the locations with a sufficient match between the subtemplate and the subimage. The smaller the subtemplate, the lower the computational cost, but the higher the possibility of false match or missing a true match. The problem of choosing the optimal subtemplate size has been studied [83]. In one binary image model [83], a background pixel takes value 1 with probability p independent of other pixel values. The binomial distribution and the normal approximation describe the distributions of the simple matching coefficient between subtemplate and subimage. The independence assumption is more reasonable when the subtemplate is selected at random and is ”sparse" [83]. In two stage matching, the simple matching coefficient, or equivalently, the absolute difference of pixel values between template and subimage, has low computational cost, but it has not been compared to other measures, such as the Jaccard coefficient and the correlation coefficient [29]. These measures have been compared in other applications 79 [80]. This chapter examines two stage template matching on a binary image and with a fixed binary subtemplate. We suggest a probabilistic model for both null hypothesis and alternative hypothesis and seek a powerful and computationally feasible statistic for recognizing the object. We do not attack the subtemplate size problem or the methodology of the second stage. 4.2 Mathematical Model Section 4.2.1 discusses notation for the mathematical model. Section 4.2.2 defines five statistics for testing hypotheses. Approximations to the Neyman-Pearson statistic and two other statistics are defined in Sec. 4.2.3. 4.2.1 General Definition An object is a mapping from a two dimensional grid of size (Nr.xNC.) to {0,1}. A template is a perfect copy of the object. In our experiment, each object pixel is generated independently with probability p' of being 1. A subset of n template pixels forms a subtemplate. Let q be the fraction of 1's in the subtemplate. An image G is a 80 matrix of size (erNC) with binary valued elements. The image may contain a single distorted version of the object (template). The template is scanned over the image but the image is seen only through the subtemplate. Let the possible locations of the template in the image be arbitrarily ordered from 1 to x0. Let Lx denote the set of image pixels covered by a grid of size (Nr.ch.) at location x. Define Nii(x) as the number of pixels at which both the image and the subtemplate of a template at x are i for i=0,l, 0 5 Nii(x) g n. The mapping G is a matrix valued random variable and two hypotheses can be stated. We assume independent pixel values under both hypotheses. The template and the subtemplate are fixed. H0 : (No distorted object in the image), Pr[G(i,j)=1EH0]=p for any (i,j). Hlx: (The distorted object is at location x), Pr[G(i,j)=lE(i,j) not in Lx, Hlxl=pI Let Tx(i,j) be the template pixel value corresponding to image pixel (i,j) when the template is at location x. The parameters a and b measure distortion of the object in the image, such as lighting condition differences between the image and the template. 81 Pr[G(i,j)=lE(i,j) in Lx' Tx(i,j)=0, Hlx]=a, Pr[G(i,j)=lE(i,j) in Lx' Tx(i,j)=l, Hlx]=b. H1 : (The distorted object is at some location), {G(i,j)} are distributed as in H1x for some x. The random experiment consists of setting the template at all locations of the image and viewing the image at first only through the subtemplate. At each location y, 1 g y 5 x0, we observe N00(y) and Nll(y). Let Ey denote the event that N00(y)= n0(y) and Nll(y)= nl(y) for fixed numbers n0(y) and nl(y), where n0(y)+nl(y) g n. Consider the likelihood ratio Rx": Pr((W EYEHlx) Y Rx": --------------- Pr(IWZE EH0) Y Events Ey(l) and Ey(2) are not independent under H0, H1, or {Hlx' all x} when the templates at locations y(l) and y(2) overlap, but we treat {BY} as independent events to obtain an approximation to Rx"' 82 If x=y, the template and the distorted object coincide under Hlx° In this case, N00(y) has a B[n(l-q),1-a] distribution and Nll(y) has a B[nq,b] distribution, where B[M,p] is a binomial distribution representing the result of M independent Bernoulli trials with probability of success p. Under H0, N00(y) has a B[n(l-q),l-p] distribution and N11(y) has a B[nq,p] distribution. In addition, N00(y) and Nll(y) are independent under all hypotheses because the template and the subtemplate are fixed. Pr(E EH ) n (y) n (y) xx = ----57-;5— = u 0 v 1 Cxx where Pr(ExEHO) (l-a)p b(l-p) u: -------- v: -------- a(l-p) (l-b)p . Cxx= (a/pIn‘l‘q’((i-hI/Ii-p>)“q. Similar equations can be stated for the special cases a=0, a=l, b=0, and b=1. 83 The term Sxy involves the distribution of the sum of two binomial random variables, so no closed form exists for sxy' Poisson approximations create complicated expressions. We choose to use a binomial approximation, as explained in Appendix B. Let Rx denote the approximation to Rx' when S is approximated as in Appendix B. xy R = un0(x)vnl(x)C T-T [--_w§2----p_]no(y)-nl(y) } x xx' ' _ _ xY y¥x (l wxy)(1 p) log(Rx) = n0(x)log(u) + nl(x)log(v) + log(Cxx) + --- wxy p --- + \ {[n0(y)-nl(y)]log[ ----------- I} + \ 109(ny) / (l-wxy)(l-p) / 4.2.2 Statistics for Testing H0 vs. H1x To test Hlx' we simply place the template at position x and compute a statistic based on Noo(x) and N11(x). The five statistics which we will compare are defined below. Dx = N00(x)+ Nll(x), the simple matching coefficient, Gx = N00(x)log(u) + Nll(x)log(v), 84 an approximation to the optimal statistic. J = Nll(x)/(n-N00(x)), Jaccard coefficient, in which only (1,1) matches are considered, and n is the subtemplate size, C = Pearson correlation coefficient, S = the S statistic defined in Chapter 1. The test using any of the above statistics is to reject 140 when the observed value of the statistic is larger than some threshold. AMonte Carlo comparison study will be <3escribed in Sec. 4.3 to see if any of the statistics can Inatch the performance of 6x“ 4.2.3 Statistics for Testing H0 vs. H1 The most powerful statistic for testing H0 vs H1 can be 0, p is estimated from 100 random images, different from the images used in estimating ‘threshold and power. SPK=0= know all parameters; SPK=1: estimate p; p',a,b are known; SPK=2: estimate p,p': assume a=.10,b=.90; SPK=3: estimate p,p'; assume a=.25,b=.96; SPK=4: estimate p,p': assume a=.05,b=.88. Subtemplate sizes: {19,26,33,39]; The parameter p' is used only in statistic R. The =Ei‘tclbtemplate is selected randomly, except in one experiment (: fiL ndicated in Table 4.5), in which the number of 0's and 1's ‘3‘ 1:? e approximately equal. 91 4.4 Results The experimental results for comparing D, G and R described above are collected in this section. In Sec. 4.4.1, Table 4.1 - Table 4.5 list the average powers of D, G and R. In Sec. 4.4.2, Table 4.6 - Table 4.10 list the average sensitivities of D, G and R. In Sec. 4.4.3, Tables 4.11 - 4.12 list the results of two sample t tests comparing powers and sensitivities of D, G and R. 44.4.1 Power Study Results: Table 4.1 lists the mean powers PD, PG and PR for the <=:aase when all parameters are known, the distortion is low, iEIlfid the template size is relatively small. The marginals 5L Indicate the average effect of p and p'. From the overall ‘t=~<:tal averages, we can compare powers between statistics and <==I::mpare to results of other subtemplate sizes, other =E=i1LJbtemplate selection methods, other parameter sets for (:«EEI,b), and other options for estimating parameters. In Table 4.1 and all following tables, "size" means the s ubtemplate size . 92 Table 4.1 Effect of (p,p‘) on Powers SPK=0, (a,b)=(0.l,0.9), size=l9 p= .2 .5 .8 + ------------- + ------------- + ------------- + E .23 E .80 El.00 E .68 P'=.2 E .31 E .82 E 1.00 E .71 E .30 E .95E 1.00 E + ------------- + ------------- + ------------- + E .89 E .75 E .83 E .82 p':.s E .95 E 75 E 95 E .89 E .99 E .74 E 1.00 E .91 + ------------- + ------------- + ------------- + E1.00 E 79 E .25 E .68 P'=.8 E 1.00 E .82 E .37 E .73 E 1.00 E .97 E .33 : + ------------- + ------------- + ------------- + 71 .78 .70 .73 .76 .80 .78 .78 .77 .89 .78 .81 + ------------- + I l 3 PD p, I l P I I ........... B-I Parameters p and p' indicate the difference in frequency 0 f 1's between the background and the object. The further a~part p and p' , the: easier. the object is to detect. Our 11‘ esults .confirm this. The mean powers of any statistic for t he cases (p,p')=(.2,.2) and (.8,.8) are comparable, which I: e flects the symmetry in the problem. That is, reversing t he coding should not alter the results. A similar comment C an be made for other symmetric locations in the table. All three mean powers for the cases (p,p')=(.2,.2) and (.8,.8) are low, because the information for discrimination in these .75 .77 93 cases is low. The expressions for Gx' u and v show that when p is small then u< the parameters p, p', a and b to get optimal results. 96 Table 4.5 Powers when Subtemplate is Balanced I P'=.2 E .93 I I p':.s E .95 I I I P'=.8 E 1.00 I I +-—-——-+--—-+-—-- + \J N +---— +--- +--- + \O \O O) \D 41.. 49E ..2 Sensitivity Study Results Tables 4.6 - 4.10 show the results of the sensitivity as . . t:“dl<:3y for the conditions of Table 4.1 - 4.5, respectively. .96 .90 .97 .94 97 (a,b)=(0.l,0.9), size=19 Table 4.6 Effect of (p,p') on Sensitivities SPK=0, 0.8 +------------+------------+------------+ 0.5 0.2 p: 2 4. 4. O 2 9 1 O 3 5 6 l 6 2 O 1 4. 2 O l 2 4. O 2 6 0 O 2 0 l O 2 l 0 4. l 0 O 4. 2 2 O 6 2 O = P a. .2 0 1 .5 1. 0 1 .2 .2 O 1. nu .3 O .1 .2 1. O l .2 .2 O 1. o. 1. O 1. o. 1. O 1. 3 .2 O 1. a. .2 O 1. 1. 1. O 1 .2 .2 O l .5 = P l 7 2 8 2 O 2 7 3 O 3 3 6 O 3 3 6 O 3 0 8 O 6 3 5 O 2 4. 9 O l 8 0 2 7 9 O l 7 2 O l 3 2 O l l 2 o 2 6 9 o 1 9 5 o 2. 9 l o 2 0 0 o 2 9 0 o 3 5 0 o 2 3 7 o l 0 8 o l 0 4 o 2 4 1 o 2 9 8 o 2 +------------+ +-----------—+ size=l9 Table 4.7 Effect of (a,b) on Sensitivities SPK=0, D E 1.95 E 2.21 E 2.59 +----------+----------+---—------+ (a,b)=(.1,.9) 5.22 E .98 4 E 7.44 E (.2,.8) 98 Table 4.8 Effect of Parameter Knowledge on Sensitivity (a,b)=(0.2,0.8), size=l9 D c R + ---------- + ---------- + ---------- + spxso E 7.44 E 4.98 E 5.22 E + ---------- + ---------- + ---------- + 1 E 7.21 E 5.01 E 5.14 E + ---------- + ---------- + ---------- + 2 E 7.27 E 4.47 E 5.15 E + ---------- + ---------- + ---------- + 3 E 7.44 E 6.41 E 7.53 E + ---------- + ---------- + ---------- + 4 E 7.44 E 4.94 E 5.58 E + ---------- + ---------- + ---------- 4- Table 4.9 Effect of Subtemplate Size on Sensitivity SPK=0, (a,b)=(0.l,0.9) D G R + ---------- + ---------- + ---------- + subt.size= 9 E 8.29 E 7.03 E 6.06. E + ---------- + ---------- + ---------- + 13 E 4.88 E 3.81 E 3.69 E + ---------- + ---------- + ---------- + 19 E 2.59 E 1.96 E 2.21 E + ---------- + ---------- + ---------- + 26 E 1.61 E 1.27 E 1.50 E + ---------- + ---------- + ---------- + 33 E 1.25 E 1.09 E 1.28 E + ---------- + ---------- + ---------- + 39 E 1.10 E 1.02 E 1.19 E + ---------- + ---------- + ---------- + 99 Table 4.10 Sensitivities when Subtemplate is Balanced I I p'=.2 : 1.04 i p':.s Tables 4.6 - 4.10 indicate that the effects of ( p,p'),(a,b) and subtemplate size on sensitivity are similar t:o their effects on power. Estimating parameters does not a: ffect sensitivity very much. Interestingly enough, sstatistic G is more sensitive than statistic R, except when ‘tihe subtemplate is very small. An intuitive reason for this is that the high value of Rx may happen at several locations OVerlapping with the the true object location. The use of 1“formation at overlapping locations provides higher power ‘3}1an ignoring the overlap, but some sensitivity is lost. 100 4.4.3 Formal Comparison Table 4.1 - 4.10 use only mean values of power estimates and sensitivity estimates and give a general idea of the effects of parameters. To compare the three statistics, we performed a standard two sample t test in each case of different subtemplate size. For example, we test the hypothesis Power(D)>Power(G). A positive t value indicates the acceptance of the hypothesis at the critical level wt. A negative t value indicates the acceptance of the reverse hypothesis Power(D)PWR PWG>PWR PWD>PWG size + 'II'I . . 0 . 0 - o . _ 2 . 4 . 6 - o . 4 . . . + llll . _ 0 . 0 _ o . . 5 . 0 . 2 - o _ 5 . . _ + 'IIII. . . 5 . 3 - o . . 6 . 8 - 3 — o . 0 . _ . + 9 + lu|.l| _ . 0 . 0 . o . . 8 _ 6 . 7 — o . 2 . . . + Il'l . . 0 _ 0 - o . . l _ 0 _ 5 - o _ 3 . . _ + III-Il| . . 4 . 2 — o . . 8 . 0 . 7 — o _ 0 . _ . 3 1 -0.672 I .05 E .25 + . no. 4 . .. . nu . .4 . 1. . . . nu . . . . uuuu+ . 4.. 1.. .. . .4 . o. _ nu . . . 1. . . . . nunu+ . ,o. 1.. .. . a. . o. . o. . . . no . . . . unuu+ 6 2 .39 : 0.601 : .27 8 .0 +------------+-—-—--------+—---------—-+ : '1.450 I I .03 l - I w l L}. +----------—-+ t +-------—-- 102 Table 4.12 Sensitivity Comparison Result of Two Sample t Test SPK=0, (a,b)=(0.l,0.9) size V(D)>V(G) V(D)>V(R) V(G)>V(R) + ------------ + ------------ + ------------ + 9 : 1.132 : 1.965 : 0.881 : : .13 : .03 : .19 : + ------------ + ------------ + ------------ + 13 : 1 654 : 1.826 : 0.221 : : .05 : .04 : .41 : + ------------ + ------------ + ------------ + 19 g 1.681 : 0.983 : -0.881 : : .05 : .16 : .19 : + ------------ + ------------ + ------------ + 26 : 2.098 : 0 649 : -2.259 : : .02 : .26 : .01 : + ------------ + ------------ + ------------ + 33 : 2 108 : -0.33s : -3.609 : : .02 : .37 : .00 : + ------------ + ------------ + ------------ + 39 : 2.207 : -1.665 : -4.316 : : .02 : .05 : .00 : + ------------ + ------------ + ------------ + + ------------ + i t i I W I 1------_---1-1 Table 4.11 indicates that R is more powerful than the other statistics when-the subtemplate is small. This may be explained by the fact that overlapping locations provide most of the information for discrimination. Both R and G are significantly more powerful than D in almost all cases. When the subtemplate size is 19 or larger, R is only slightly more powerful than G. A similar situation exists for sensitivity, except that G is significantly more sensitive than R and D when subtemplate size is large. The 103 reason for R to perform worse when the subtemplate size is large might be the fact that some assumptions made in deriving R are violated when the subtemplate points are more dependent. 4.4.4 Results on Landsat Images This study involves landsat images of the same scene from different light frequency channels. Each image was converted into a binary image using the average grey level of that image as threshold. We arrange the study into several cases. In each case, we take a subimage of size 8x8 from a channel i image to form an object or template. Then we take another subimage of size 64x64 from channel j (i#j) which includes the object. We consider that this 64x64 image contains a intensity distorted (not geometrically distorted) object. We apply our scheme to find the relative location of the distorted object. We studied five measures of similarity and three subtemplate sizes. Each study was repeated three times. The parameter p is estimated by the fraction of 1's in the 64x64 image from channel j. The parameter p' is estimated by the fraction of 1's in the 8x8 template. The distortion parameters (a,b) are assumed to be (0.1,0.9). A balanced subtemplate is formed in each case. 104 Table 4.13 Results on Several Landsat Images subtemplate size V(D) V(G) V(R) V(max(Cx)) V(max(Sx)) 9 1512 1211 718 2244 1222 9 1719 1456 829 2159 1188 9 154 110 1528 110 110 l9 l8 14 541 18 22 19 165 64 1630 42 36 19 1188 979 711 1117 1117 29 79 64 544 78 102 29 926 756 750 787 722 29 719 468 681 969 719 Among the nine cases in Table 4.13, G is more sensitive than D in every case; 6 is more sensitive than R in five cases; more sensitive than max(Cx) in seven cases; more sensitive than max(Sx) in five cases. These results support our conclusion about G drawn from artificial data. In two cases using very small subtemplates, statistic R performs best of all statistics. However, in four cases using large subtemplates, R performs worst. The reason may be that the independent assumption we made in the derivation of R is violated for the Landsat images. Since R is the only statistic using the information from overlapping locations, it performs rather differently from all other statistics. 105 4.5 Summary and Conclusions This chapter stated a null hypothesis (no object) and alternative hypotheses (single object present) for a problem in binary template matching and examined an approximation to the Neyman Pearson statistic for testing them. This statistic is compared to the simple matching coefficient and other similarity measures including the S statistic defined in Chapter 1. A new sensitivity index is proposed to assess the ability of a statistic to locate the true object in the image. Power, sensitivity and computation cost are the criteria for comparison. In testing H0 vs Hlx' G is more powerful than all other statistics. In testing H0 vs H1, R is most powerful, but less sensitive than G to the true object location. The statistic G, with the information in overlapping locations ignored, is almost as powerful as R, but computationally much simpler, and more sensitive to the true object location when the subtemplate is large. The threshold of G for testing H0 vs H1 can be obtained analytically, which is another important advantage of G over R. 106 The simple matching coefficient D, is less powerful and less sensitive than G, but has the same order of computational complexity as G. The correlation coefficient has been used for ,grey level image template matching problems [29]. Correlation and the 5 statistic defined in Chapter 1 have lower power and lower sensitivity than G and R, with computational complexity at least as large as G. Therefore, we suggest using G in the first stage of template matching. CHAPTER 5 COMPUTATIONAL CONSIDERATIONS Since the S statistic is computed directly from the cumulative hypergeometric distribution function (c.d.f.), we will discuss algorithms for computing hypergeometric c.d.f.'s that have appeared in the literature. Notation is defined in Sec.5.l. The Peizer approximation is described in Sec.5.2. We give a hardware architecture design for the recursive algorithm in Sec.5.3 and an overall computational time comparison in Sec.5.4. A brief conclusion is given in Sec.5.5. 5.1 Notation Consider two binary N vectors V1 and V2. We first define n,r and a as follows, to be consistent with the literature [52]. 107 108 Table 5.1 Observables in Vector Pair V2 1 0 + ----- + ----- + 1 : a : b : n V1 + ----- + ----- + 0 : c i d i m + ----- + ----- + r S N For example, for the following vector pair, V1=[0010110°] v =[10010110] 2 N=8, n=3, r=4 and a=l. For computational convenience [52], without generality, we can recode V1, V2 and 0,1, so that a g d and a < b g c, or, equivalently, 2a + l g n 5 r i m. Also, we let k=min(a,b-1,c-l,d) [52]. loss of The probability density function (p.d.f.) h(i) and the c.d.f. H(a) of the hypergeometric distribution are defined below. 109 (‘2) (ii?) h(i) = h(N,n,r,i) = --------- for 0 g i 5 a, N (.J a H(a) = \ h(i) for 0 5 a 5 min{n,r}. / i=0 The S statistic studied in the previous chapters can be written in the form of H(.) as follows. S(V1,V2) = H(a) - [H(a)-H(a-l)]U where U is an independent random variable distributed uniformly over [0,1]. various The remainer of this chapter is concerned with ways of computing H(a). The computation of the c.d.f. H(a) has attracted great attention. We compare them in terms of in computation time, i.e. the number of time cycles involed the computation. We first define the following notation. Ta Tm *3 Q: : number number number number number of of of of of cycles cycles cycles cycles cycles needed needed needed needed needed in in in in in addition, multiplication, division, square root operation, logarithm operation, 110 Tt: number of cycles needed in looking up a standard normal distribution table. Each number is the time needed for each operation when a single CPU is used. It is also the number of segments in each functional pipeline unit; e.g., Ta is the number of segments in an pipeline adder. The computation time for each algorithm discussed below assumes a single CPU unless otherwise specified. The order of operations is important since the word length in a computer is limited. The ratio of two long products should be done by alternating division and multiplication. Since the direct computation involves factorials and a great deal of repetition, it is very time consuming. The computation time is aTa+ (a+l) [(2r-1)Tm+ 2er] The following formula provides a recursive way to compute h(j). 111 m(m-l)...(m-r+l) N(N-l)...(N-r+l) h(j+l) = h(j)X(j) (n-j)(r-j) where X(j) = ------------------ (j+1)(N—n-r+j+l) The computation time for the recursive formula is (r-1)Tm + er + a(2Tm+ 2Td+Ta). 5.2 Peizer Approximation The approximate computation of H(a) has been investigated by many researchers and a recent extensive empirical study [52] of the accuracy of 12 normal and three binomial approximations showed that a normal approximation by Peizer is both far superior to other normal appoximations and simple to compute. Since the binomial tail is almost as difficult to compute as the hypergeometric tail, the binomial approximation is not recommended [52]. In this section, we briefly state the Peizer formula and give a simple discussion on computational complexity. The Peizer approximation is 8(a) ~ Fn(z) where Fn is the c.d.f. of the standard normal distribution, 112 and .02 .01 .01 where A = a+.5, A' = A+(l/6), A" = A'+----+---+--- , B,B',B",C,... are defined in a similar manner from b,c,..., and L = Alog(AN/nr)+Blog(BN/ns)+Ciog(CN/mr)+Dlog(DN/ms). The maximum absolute error of this approximation is .001 if k>2, with the maximum relative (percent) error being 0.71% through 19.7%. The guaranteed number of correct decimal places in this case is at least 3.040 [52].' When k is small, approximations are not needed since the exact computation is trivial. The computation time of Peizer's approximation T(Peizer) is a constant, independent from the vectors V1 and V2. The computation time (using a single CPU) is 37Ta + ZZTm + 18Td + 4Tg + Tr + Tt We noticed that some portions of the computation can be done concurrently, e.g., Alog(AN/nr) and Blog(BN/ns). If we have sufficient hardware and perform all possible concurrent computations, the computation time will be reduced to 113 3Ta+4Tm+Td+Tg+Tr+Tt. 5.3 A Hardware Implementation In recent years, considerable efforts have been devoted to developing special computer architectures for pattern recognition and image processing [20,63]. We desribe a two level pipeline design for the recursive computation of H(a). Section 5.3.1 is an overview of the architecture. A detailed discussion of the pipeline structure is given in Sec.5.3.2. 5.3.1 An Overview of the Architecture The overall structure is shown in Figure 5.1. The input is the observables of vector pair V1 and V2. Since the computation of h(O) can be implemented by a straightforward design or by a table look-up, h(O) is considered as input in our design. 114 Nnra hti+gl lvle-ol Hta) Fig.5.1 Overall Architecture for Computing H(a) The bottom pipeline adder contains z(=Ta) segments. The pipeline multiplier with y(=Tm) segments computes values of h(.) in the following fashion. h(i+y) = h(i)Y(i) where 2(1) = X(i)x(i+l)...X(i+y-l) 115 The box on the upper left corner in Figure 5.1 produces X(j) for different j every cycle. The box right below it computes Y(i) from X(j), which is the main part of this design. The details of those two boxes are shown in Figures 5.2 through 5.6. The boxes marked "MUX' indicates a multiplexer and the small boxes with a decrement or increment input are counters. The input from above to these counters are initiation lines. n r N-n-r+1 1 a ¢-— clock enable .XIJ) Fig.5.2 Compute X(j) 116 5.3.2 Computing Y(i) We discuss this part of architecture in a general environment. The function is Y(i) = x(i) * x(i+1) * * x(i+y) = where "*" can be any operator which is commutative and associative. In this design, a pipeline functional unit with x segments is used and represented by a box marked with a letter ”x". For the hypergeometric case, x=y=Tm. The design of the pipeline can vary with y. We consider y = 2,3,...,15, which are the usual lengths of pipeline functional units in computers such as Cray 1. Figures 5.3 through 5.5 show designs for y=4, 7 and 13. The general design is shown in Figure 5.6. In the cases y=4 and y=13, the design is the same as in the general case. The case y=7 takes advantages of individual y value and thus is different from the general design. A box with one input, one output and a number NUM, represents a NUM-bit shift register. Some of those boxes are small, with NUM=1,2,4. Some are larger with NUM=x,x-l,x-2. 117 XtJJ Fig.5.3 Compute Y(i) from X(j) when y=4 118 [ Fig.5.4 Compute Y(i) from X(j) when y=7 119 xi]: 1 L_:L. , x-l Fig.5.5 Compute Y(i) from X(j) when y=13 120 1 1 [12:] 1 ..---- 91---".% X I X 982 g>2 4 1 o-o-gzoc-o flux IE}:- I" ‘lf’ x —_—._J‘ 984 T!w 49(4 'x g>8 9:6 g<8 HUX Ytil Fig.5.6 Compute Y(i) from X(j) (general design) 121 Multiplexer control is indicated either as conditions at input lines, or as a dash line at the side connected to the control signal. The control signals happen to be the bits in the binary expression of y. Y 3 Y3Y2Y1YO To demonstrate how these pipelines work, we present a time diagram showing the data contents in different portions of the pipeline at different time steps. We pick y=7 and x=3 and give the pipeline design in Figure 5.7. Boxes (Cl,C2,C3), (Fl,F2,F3) and (11,12,13) are pipeline units. Boxes B, D1, D2, El, E2 and G are one-bit shift registers. The time diagram is shown in Figure 5.8. 122 Fig.5.7 Compute Y(i) from X(j) when y=7 and x=3 C1 C3 D1 02 E1 52 F1 F3 H1 H3 11 13 1234567891011 O1234 O123 OO12 123 OO 1 O O12 IO 1 Fig.5.8 Time Diagram of Figure 5.7 (The left most column is a circuit element, 5 4 HQ NH #U 8 N8 U 6 5 UN Ulb 0—9 N.- b r-Q u.- 7 6 5 6 8 F9 NF UN ON HQ #10 HS NO 7 UN Old 019 NIUI UIU 0| 8 N8 0‘8 NF 9N8 Ut- 9 1O 8 9 7 8 8 9 5 6 5 7 4 5 5 5 3 4 4 5 2 3 3 4 1 2 2 3 O 1 3 4 O O 1 2 7 8 4 5 5 7 2 3 4 5 O O 3 4 O O 1 2 12 11 1O U9 UIN eu (Jib OIUI ~10! GIN mm 10 US 019 0&1 13 12 11 1O 11 0|“ Ulb OIUI \IOI ON! 100 hr- 1O 401 (0‘1 93 08 the numbers on the right are the contents in that element.) 14 13 12 11 12 ~19 (DUI ~10 ON] mm 0110 11 UIO q.— 15 14 13 12 13 1O 11 ~10 0‘1 mm ... P10 N OIU (DUI ... 018 ON (0‘! 16 15 14 13 14 11 12 1O 11 (DO 0% 100 Nib 13 1O 12 1O 17 15 15 14 15 12 13 11 12 1O 11 1O GUI 8‘1 14 2 8 18 17 15 15 15 13 14 12 13 11 12 1O 11 1O 11 15 12 14 1O 12 11 3 9 124 The computation time and the circuit complexity of the part computing Y(i) from X(j) are shown in Table 5.2. The time complexity, denoted by Ty, is the number of time cycles through the pipeline. The circuit complexity, denoted by M is the number of functional units used in the general Y, design in Fig.5.6. Also, we define the following functions. fl(y) - 1 if y is an odd number, = 0 otherwise; f2(y,a) = 1 if y > a, 0 otherwise. 125 Table 5.2 Time and Circuit Complexity of the Part Computing Y(i) from X(j) “I WOQUILIbUN rotor-bro bUNO-‘O HHH-l-l-IH—lhifi-l-I—IH p I] general des1gn for Individual des1gn for all g=2.3.....15 each 9 value when x>3 3110929] ’ 5 11°929] ’ ‘93’92’HIIQg‘1’ xrlog 912+g-1-f119l L ooooooooooooooooooooo db ooooooooooooooooooooooooooo x + 1 x + 1 p OOOOOOOOOOOOOOOOOOOOO C(- OOOOOOOOOOOOOOOOOOOOOOOOOOOO d 2x + 1 2x + 1 p OOOOOOOOOOOOOOOOOOOOO db OOOOOOOOOOOOOOOOOOOOOOOOOOOO 1 2x + 3 2x + 3 p ooooooooooooooooooooo q- ooooooooooooooooooooooooooo .4 3x + 3 3x + 3 3x + 5 3x + 3 3x + 5 3x + 5 ...................... qb-OQo-u—ococnoocoouc---—------J 3x + 7 L 3x + 7 4x + 7 4X * 7 )- ooooooooooooooooooooo «I- ---------------------------- d 4x + 9 4x + 7 4x + 9 4X + 7 p ooooooooooooooooooooo cl oooooooooooooooooooooooooooo .4 4x + 11 4x + 7 p ooooooooooooooooooo db ............................ «1 4x + 11 4x + 11 4x + 13 4x + 11 fl 4x + 13 4x + 13 126 Table 5.2 demonstrated that the individual design for a fixed y value can be more efficient than the general design for y<15. We now derive the total computing time of 8(a), for Fig.5.l, which is denoted by T We first define total‘ TX( ) to be the time for computing X(0) (Fig.5.2), define Tmerge to be the time for the bottom pipeline adder to produce final result after h(a) is fed in the adder. We have T = T total x(.) + TY + Tm + a + 1 + Tmerge' T 3 Tu + T XI.) d T9 2 T.rlong.1 + T. - 1 - t1(Tn’ Llonglj T a Tlrlong31-(2 -r.)+r merge 2(T'.a)(T.-a)rlogza] L1 0921...] T s (2rlongn]+41TI - 2 + Td + a + total + tth-.a)(T--alrlogza] - rllrn’ 127 5.4 Comparison We compare the computation times of all methods in Table 5.3. The word "soft” in parentheses means software serial computation on a single CPU; the word "hard" means special hardware which performs concurrent operation with possible pipeline structures. Since the actual computation in a computer depends on programming and the individual machine, the numbers in Tables 5.3 and 5.4 provide only the order of magnitude. The logarithm operation is performed by table look-up, all of which take 4 time cycles. The square root operation assumes a standard algorithm [39]. Based on Cray-l parameters, every cycle is 12.5 nsec and the number of cycles for various operations are as follows. Ta = 6, T111 8 7, Td-29, T9 8 4, Tr =30, Tt- 4. The results in Table 5.3 are computed according to [64]. In Table 5.4, we show how the computing time ranges over values of a for various methods when r=50, 150 and 450. The times are in number of cycles. All the times are computed when Cray-l parameters are used [10]. 128 Table 5.3 Computation Time Comparison general form of typical computing computing time time based on Crag 1 °"°°‘ aT +ta+11112r-11T +2rT 1 72ar+72r-a-7 (soft) a I d recursive _ _ (soft) . (r ilTl+er+a12Tn+2Td+Tal 36r+78a 7 Ll092TIJ recursive (2rlong.‘|+4lTn 2 +Td+a+ a + 94 + (hard) +£21T..al(Tu-a1r109231-tliT.) +t2(7.a117-a)rlo92a] Peizer (sort) 37Ta+22Tn+iBTd+4Tg+TP+Tt 1026 Peizer (hard) 373* “T.* Tg’ 79*Trth 113 Table 5.4 Typical Computation Time Ranges ( for a=0,1,2,...,r) r=50 r=150 r=450 + --------- + ------------- + ---------------- + ---------------- + : direct :3,593 ~ :10,793 ~ :32,393 ~ : : (soft) : 183,543| 1,630,643: 6,223,335: + --------- + ------------- + ---------------- + ---------------- + :recursive:l,793 ~ : 5,393 ~ :16,l93 ~ : : (soft) : 5,693: 17,093: 51,293: + --------- + ------------- + ---------------- + ---------------- + :recursive: 94 ~ : 94 ~ : 94 ~ : : (hard) : 144: 244: 544: + --------- + ------------- + ---------------- + ---------------- + : Peizer : : : (soft) : 1'025 : + --------- + ----------------------------------------------- + : Peizer : i : (hard) : 113 : + --------- + ----------------------------------------------- + 129 'The computing time of the Peizer approximation is problem independent. Therefore this approximation is prefered when a > 932 and the exact value of 8(a) is not necessary. The hardware approach to the recursive exact method is much faster than other exact methods in all cases. It is faster than Peizer approximation using single CPU when a < 932. 5.5 Summary and Conclusions In this chapter, we reviewed some existing techniques for computing the c.d.f. of the hypergeometric distribution. Since the direct computation is very time consuming, an approximation is desired in many cases. The recursive algorithm for the exact computation avoids much of the repetition of the direct method. We proposed a hardware pipeline architecture which implements the recursive formula and presented a detailed design. The main part of the design is actually a possible solution for a more general problem. This pipeline architecture is computationally efficient. For a large range of values of parameter a, the proposed architecture is 100 times to 10,000 times faster than the direct computation, 50 times to 100 times faster than the recursive method using single CPU, and two to ten times faster than the Peizer approximation using a single 130 CPU. This design uses six functional units for pipelines with no more than 15 segments. The modularity and the regularity of the system make it suited for VLSI implementation. CHAPTER 6 CONCLUSION AND FUTURE RESEARCH 6.1 Conclusions The statistic S has a known distribution under a permutation null hypothesis and its critical level and threshold are easy to determine. This statistic is a good similarity measure between features in preliminary analysis for binary features. It has about the same power as the correlation coefficient. The statistic S is a good similarity measure between feature and category in binary tree classifier design. It gives similar trees as the mutual information criterion. The statistic S is also a reasonable adequacy measure for binary partitions in cluster validity. It is different but highly correlated to Gamma and other commonly used measures. In image template matching, S acts as well as other sub-optimal similarity measures between sub-images. An approximatation to the likelihood statistic is proposed. The hardware architecture 131 132 design is more efficient than other methods for computing hypergeometric c.d.f.'s. The original contributions of this thesis are summarized below: (1) Adaption of the cumulative hypergeometric distribution to the definition of similarity measures in pattern recognition. (2) Definition of the preliminary feature analysis problem in pattern recognition and the use of S in this analysis (Chapter 2). (3) Study of the relation among adequacy measures for binary partitions and the successful application of S in binary tree classifier design (Chapter 3). (4) Definition of alternative hypothesis in .image template matching and the design of a modified Neyman-Pearson statistic (Chapter 4). (5) Design of a hardware implementation for computing the hypergeometric c.d.f. which is much faster than conventional means (Chapter 5). 133 6.2 Future Reseach Future research includes extension to multivalued vectors, extension to two sided tests in preliminary feature analysis, extension to multi-class tree classifier design, and the multiple stage approach to image template matching. 6.2.1 Extension to Multi-Valued Vectors This thesis involved binary vectors. The definition of a permutation statistic for multivalued nominal vectors is a direct extension, if we can define "match” between nominal values. When the match is obvious from common sense or from the physical meaning of the variables, the definition of S is straightforward. An example is given in Appendix E for a special case. Since not very many similarity measures exist for general nominal vectors, 5 could be an alternative. It will have known distribution as in the binary case. Since the computational complexity is high even for a moderate number of values, some algorithm or hardware structure must be used to apply this multi-valued permutation statistic to the problems in this thesis. 134 6.2.2 Two-Sided Tests for Preliminary Analysis In Chapter 2, the tests are one sided since we are only interested in positive relations between features. This is true for questionnaire data, but not true for the cases where the coding (0,1) is irrelevent. Our statistic could be max(:S(i,j)-0.5:) and still have known distribution. It will be interesting to compare it with max(:C(i,j):). 6.2.3 Multi-class Tree Classifier Design Tree classifiers have been used for classification problems with a large number of pattern classes. At each node, the recognition problem is to identify which subset to which the unknown pattern belongs. If the tree itself is a binary tree, each node still deals with binary partitioning of a pattern set. A mulitivalued version of S' or some hierarchy of the binary version of 5' might be used to assess the association of features to the grouping of class symbols. 135 6.2.4 Multi-stage and Sequential Approach to Template Matching In image template matching a multi-stage approach might have certain advantages over the two-stage approach. An analysis of optimal number of stages in a hypothesis testing framework could be a future topic. It makes no sense to scan the entire image after the true location of the object has been found. A sequential decision making scheme under our hypothesis testing might be more efficient than the traditional approach. The hardware implementation of the computation of the proposed statistics may also be a future research topic. 136 APPENDIX A THE THRESHOLDS AT THE BOUNDARIES OF RUNS GIVE HIGHEST S VALUES Let h(nl,i) denote h(n,nl,n2,i) for fixed n,n2, where h(.,.,.,.) is defined as in Sec. 2.1; and let n11 H(nl,nll) = > h(nl,1) i=0 We need to prove that H(nl,nll) < H(nl+l,nll+l) for nl+n2-n g nll 5 min(nl,n2). We proceed by induction on I'll. Base case (111g 1): (1) If n 0, 11" H(1,0)=(n-n2)/(2n) H(2,l)=(n-n2)(n+n2-l)/[n(n-l)] > (n-n2)/(2n) (2) If nllzl, H(l,1)=(n+n2)/(2n) H(2,2)=1 > (n+n2)/(2n) Assume H(nl,nll) < H(nl+l,nll+l) for a particular n11. 137 (1) If “11.3 min(nl,n2) (nl+l)(n- nl- n2+i) Since H(nl +1, i)= ----------------- H(nl,i) (n-nl )(n1+l- i) (nl+2)(n- nl- n2+i) H(nl +2, i)= ----------------- H(nl +1, i+l) (n- n1 -l)(n1 +1- 1) 0< c.=-55lii15333l332ii). d..-531i311223123211’ 1 (“'“1)("1+1'i) 1 (“‘“1’l)(nl+l-i) Therefore, > [H(n1,i)-H(nl+l,i+l)] < > [ciH(nl,i)-diH(nl+l,i+l)] < 0 (2) If n2>nl and nll'snl+l, (nl+2)(n2- n1 -1) +2)= --------------- H(nl +1, n H+l)Power(j) where i (row) and j (column) can refer to any of the five statistics above. The first number of each entry is the actual t value and the second number is w where the critical level is w Positive t value t t' indicates the acceptance of the hypothesis with level wt; a negative t value indicates the acceptance of the reverse hypothesis. For example, Power(Dx)>Power(Jx) is accepted at level 0.36; Power(Cx) < Power(Gx) is accepted at level 0.10. PWJ PWC PWS PWG + ---------------------------------------------------- + pwn : .367 (.36) .472 (.32) .501 (.31) -0.748 (.23) : pwa : .092 (.46) .122 (.45) -1.146 (.13) : ch : .031 (.49) -l.289 (.10) : Pws : -1.321 (.09) 1 + ---------------------------------------------------- + This table indicates that Gx is more powerful than all others and Dx is slightly more powerful than the other three. 142 APPENDIX D SUMMARY OF RESULTS FOR SEC. 4.3.1 The following table compares analytical approximations of the powers of D and G. In each box, the first integer is the number of combinations of (p,p",a,b) for which Power(D) < Power(G); the second number is the number of cases for which Power(D) > Power(G). There are 18 cases for each box. image size / template size 16x16 / 8x8 32x32 / 12x12 subtemplate + --------------- + ............... 1 size = 19 : 10 , 4 : 10 , 4 : 25 l 12 . 2 l 12 , 2 : 33 : 15 , 2 : 8 , 4 g 39 l 11 . 5 : 11 , 6 g + --------------- + --------------- + We consider all 144 pairs of power approximations as random samples and perform a standard two sample t-test. The resulting t value is 2.16648 and the hypothesis Power(D) < Power(G) is accepted at significence level (l-O.975). 143 APPENDIX E AN EXAMPLE OF THE PERMUTATION STATISTIC IN MULTIVALUED NOMINAL VECTOR CASES Consider nominal vectors V1 with 3 possible values and V2 with 4 possible values. We defined "match" between the values of vector components as follows. Let n be the vector length. The letter ”x” means the corresponding values in the two vetors are considered as "match”. 2 l 2 3 4 + -------------------- + l : x x V 2 ' x l 3 : x x Let n be the vector length, ni* be the number of components taking value i in V1, “*j be the number of components taking value j in V2, and let xij be the number of components that Vl takes value i and V2 takes value j. The following table indicates the relation among them. 144 V2 1 2 3 4 + -------------------- + 1 1 x13 x14 1 “1* V1 2 : x22 : n2, 3 1 x31 x32 : “3* + -------------------- + “*1 11,,2 n,.,3 11,,4 n Let the random experiment is the permutation of V2, a random variable M be the number of matches between V1 and V2, and let m be the observed value of M in the original vector pair, then a permutation statistic S can be defined as s = Pr(M 5 m) - Pr(M=m)U where U is as defined in Chapter 1 and LIST OF REFERENCES 145 LIST OF REFERENCES M.R.Anderberg, Cluster Analysis for Applications, Academic Press, New York, 1973. P.Argentiero, R.Chin and P.Beaudet, "An automated approach to the design of decision tree classifiers", IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-4, pp.51-57, Jan. 1982. F.B.Baker and L.J.Hubert, "Measuring the power of hierarchical cluster analysis,” J. of American Statistical Association, Vol.70, pp.31-38, Mar.1975. R.R.Bahadur, ”A representation of the joint distribution of response to n dichotomous items”. Studies lg Item Analysis and Prediction, H.Soloman (Ed.), Palo Alto, Cali.: Stanford Univ. Press, pp.lS8-168, 1961. D.I.Barnea and H.E.Silverman, "A class of algorithms for fast digital image registration,” IEEE Trans. Computers, Vol.C-21, No.2, pp.l79-l86, 1972. P.L.Brockett, P.D.Haaland and A.Levine, "Information theoretic analysis of questionnaire data“. IEEE Trans. Information Theory, Vol.IT-27, pp.438-445, 1981 T.M.Conover, Practical Nonparametric Statistics, 10. 11. 12. 13. 14. 15. 16. 146 John Wiley & Sons, 2nd ed., New York, 1980. T.M.Cover, "The two best independent measurements are not the best two”. IEEE Trans. Systems, Man and Cybernetics, Vol. SMC-4, pp.116-117,1974. T.M.Cover and J.M.VanCampenhout, ”On the possible orderings in the measurement selection problem”, IEEE Trans. System, Man and Cybernetics, Vol. sue-7, pp.657-661, 1977. Cray research, Cray-l Computer System Hardware Reference Manual 2240004, 1977. P.J.Diggle, ”On parameter estimate and goodness-of-fit testing for spatial point patterns,” Biometrics, Vol.35, pp.87-101, March, 1979. R.Dubes and A.Jain, ”Clustering techniques: the user's dilemma," Pattern -Recognition, Vol.8, pp.247-260, 1976. R.Dubes and A.Jain, ”Validity studies in clustering methodologies," Pattern Recognition, Vol.11, pp.235-254, 1979. J.D.Elashoff, R.M.Elashoff and G.E.Go1dman, "On the choice of variabless in classification problems with dichotomous variables", Biometrika, Vol.54, pp.668-670, 1967. G.S.Fang, "A note on optimal selection of independent observations", IEEE Trans. Systems, Man and Cybernetics, Vol.SMC-6, pp.309-311, 1979. R.V.Foutz, "A method for constructing exact tests from statistic that have unknown null distribution". Journal of Statistical Computation and Simulation, Vol.10. pp.l87-l93, 1980. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 147 E.G.Fowlkes and C.L.Mallows,"A method for comparing two hierarchical clusterings,” J. of American Statistical Association, Vol.78, pp.553-569, Sept.l983. E.B.Fowlkes and C.L. Mallows, "Rejoinder', J. of American Statistical Association, pp.584, 1983. J.M.Friedman and L.C. Rafsky, "Graph-theoretic measures of multivariate association and prediction", The Analysis of Statistics, Vol.11, No.2, pp.377-391, 1983. K.S.Fu and _T.Ichikawa, S ecial computer architecture for pattern proceSS1ng. CRC Press, Boca Raton, Fla., 1982. G.V.Glass and Hakstian,A.R., ”Measures of association in Comparative experiments: their development and interpretation”. American Educational Research Journal, pp.403-4l4, 1969. D.W.Goodall, "A probabilistic similarity index”, Nature, 1964. D.W.Goodall, ”A new similarity index based on probability", Biometrics, pp.882-907, 1966. D.W.Goodall, ”Numerical taxonomy of bacteria - some published data re-examined", Journal of General Microbiology, Vol.42, pp.25-37, 1966. L.A.Goodman and Kruskal,W.H., ”Measures of association for cross classifications". J. of American Statistical Association, pp.732-764, 1954. L.A.Goodman and Kruskal,W.H., "Measures of association for cross classifications. II: further discussion and references", J. of American Statistical Association, Vol.54, pp.123-163, 1959. 27. 28. 29. 30. 31. 32. 33. 34. 35. 148 L.A.Goodman and Kruskal,W.H., "Measures of association for cross classifications. III: approximate sample theory", J. of American Statistical Association, Vol.58, pp.310-364, 1963. L.A.Goodman and Kruskal,W.H., ”Measures of association for cross classifications. IV: Simplification of asymptotic variances", J. of American Statistical Association, Vol.67, pp.415-447, 1972. A.Goshtasby, S.H.Gage, and J.F.Bartholic, "A two-stage cross correlation approach to template matching,“ IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-G, No.3, pp.374-378, May 1984. z.Hubalek, "Coefficients of association and similarity, based on binary (presence, absence) data - an evaluation", Biological Review, Vol.57, pp.669-689, 1982. L.Hubert and J.Schultz, "Quadratic assignment as a general data analysis strategy”, British Journal Math. Statist. Psychol., Vol.29, pp.l90-241, 1976. L.Hubert and J.R.Levin, ”A general statistical framework for assessing categorical clustering in free recall", Psychological Bulletin, Vol.83, No.6, pp.lO72-1080, 1976. L.Hubert, "The comparison and fitting of given classification schemes", Journal of Mathematical Psychology, Vol.16, No.3, pp.233-253, 1977. ' L.Hubert, "Generalized proximity function comparisons", British Journal Math. Statist. Psychol., Vol.31, pp.l79-192, 1978. L.Hubert, "Generalized concordance", Psychometrika, Vol.44, No.2, pp.135-l42, 1979. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 149 L.Hubert and Subkoviak,M.J., "Confirmatory Inference and Geometric Models”, Psychological Bulletin, Vol 86,No.2,361-370, 1979. L.Hubert and R.G.Golledge, "A heuristic method for the comparison related structures", Journal of mathematical psychology, Vol.23, pp.214-226, 1981. G.E.Hughes,"0n the mean accuracy of statistical pattern recognizers”, IEEE Trans. Information Theory, Vol.IT-lS, pp.420-421, Jan. 1969. K.Hwang, Computer Arithmetic, John Wiley & Sons, New York, 1979. A.K.Jain and R.C. Dubes, "Feature definition in pattern recognition with small sample size”, Pattern recognition, Vol.10, pp.85-97, 1978. M.F.Janowitz, "Similarity measures on binary data", Systematic Zoology, pp.342-359, 1980. I.T.Jolliffe and B.J.T. Morgan, "Comment", J. of American Statistical Association, pp.580-581, 1983. M.D.Kelly, "Edge detection in pictures by computer using planning," Matching Intelligence 6 (D. Michie,Ed), pp.379-409, Edinburgh Univ. Press, Edinburgh, 1971. O.Kempthorne, Design and analysis pf experiments , John Wiley & Sons, N.Y., 1952. J.B.Kruskal, ”Multidimensional scaling and other methods for discovering structure," Statistical Methods for Digital Computers, pp.296-339, 1977. A.Kulkarni and L.Kanal, "An optimization approach to hierarchical classifier design", 3rd 47. 48. 49. 50. 51. 52. 53. 54. 55. 150 International. Joint Conf. on Pattern Recognition, pp.459-466, 1976. A.Kulkarni, "On the mean accuracy of hierarchical classifiers", IEEE Trans. Computers, Vol.C-27, No.8, pp.771-776, Aug. 1978. A.Kulkarni, ”Admissible search strategaies for paramtric and nonparametric hierarchical classifiers", 4th International. Joint Conf. on Pattern Recognition, pp.238-248, 1978. M.Rurzynski, "The optimal strategy of tree classifier", Pattern Recognition, Vol.16, No.1, pp.81-87, 1983. G.Landeweerd, T.Timmers, E.Gelsema, M.Bins and M.Ralie, "Binary tree versus single level tree classification of white blood cells", Pattern Recognition, Vol.16, No.6, pp.57l-577, 1983. Y.K.Lin and R.S.Fu, "Automatic classification of cervical cells using a binary tree classifier”, Pattern Recognition, Vol.16, No.1, pp.69-80, 1983. R.F.Ling and J.W.Pratt, "The accuracy of Peizer approximations to the hypergeometric distribution, with comparisons to some other approximations," J. of American Statistical Association, Vol.79, No.385, pp.49-60, Mar.1984 N.Mantel, "The detection of disease clustering and a generalized regression aproach," Cancer Research, Vol.27, pp.209-220, Feb.l967. N.Mantel and R.S.Va1and, “A technique of nonparametric multivariate analysis", Biometrics, Vol.26, pp.547-558, Sept.l970. W.Meisel and D.Michalopoulos, "A partitioning 56. 57. 58. 59. 60. 61. 62. 63. 64. 151 algorithm with application and the optimization of decision trees”, IEEE Trans. Computers, Vol.C-22, No.1, Jan.l973. G.W.Milligan, "A Monte Carlo study of thirty internal criterion measure for cluster analysis," Psychometrika, Vol.46, pp.187-l99, June 1981. H.Moravec and D.Gennery, Cart Project Progress Report, Stanford Artif. Intell. Project, Internal Memo. Oct. 1976. A.N.Mucciardi and E.E.Gose, "A comparison of seven techniques for choosing subsets of pattern recognition properties", IEEE Trans. Comput., Vol.20, pp.1023-1031, 1971. J.Mui and R.S.Fu, "Automated classification of nucleated blood cells using a binary tree classifier", IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-Z, No.5, pp.429-443, Sept.l980. C.Munteanu, "Evaluation of the sequential similarity detection algorithm applied to binary images," Pattern Recognition, Vol.13, No.2, pp.167-175, 1981. R.N.Nagel and A.Rosenfeld, "Ordered search techniques in template matching," Proc. IEEE, Vol.60, pp.242-244, 1972. J.W.V.Ness,"Comment”, J. of American Statistical Association, pp.576-579, 1983. L.M.Ni and A.K.Jain, ”A VLSI systolic architecture for pattern clustering," to appear on IEEE Trans. Pattern Analysis and Machine Intelligence. L.M.Ni and K.Hwang, "Vector reduction techniques 65. 66. 67. 68. 69. 70. 71. 72. 73. 152 for arithmetic pipelines," to appear on IEEE Trans. on Computers. H.Payne and W.Meisel, "An algorithm for constructing optimal binary decision trees", IEEE Trans. Computers, Vol.C-26, No.9, pp.905-916, Sept.1977. M.R.Ramapriyan, “A multilevel approach to sequential detection of pictorial features", IEEE Trans. Computers, Vol.C-25, No.1, pp.66-78, 1976. W.M.Rand, "Objective criteria for evaluation of clustering methods," J. of American Statistical Association, Vol.66, pp.846-850, 1971. A.Rosenfeld, Picture processing py computer. New York, Academic Press, 1969. A.Rosenfeld and G.J.Vanderbrug, ”Coarse-fine template matching", IEEE Trans. System, Man and Cybernetics, Vol.SMC-7, No.2, pp.104-107, 1977. E.Rounds, ”A combined nonparametric approach to feature selection and binary decision tree design”, Pattern Recognition, Vol.12, pp.313-3l7, 1980. G.E.Sarndal, ”A comparative study of association measures”, Psychometrika, Vol.39, pp.165-187, 1974. J.Schuermann and W.Doster, ”A decision theoretic approach to hierarchical classifier design", Pattern Recognition, Vol.17, No.3, pp.359-369, 1984. I.Sethi and B.Chatterjee, "Machine recognition of constrained hand printed Devanagari", Pattern Recognition, Vol.9, pp.69-75, 1977. 74. 75. 76. 77. 78. 79. 80. 81. 82. 153 I.Sethi and G.Sarvarayudu, "hierarchical classifier design using mutual information", IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-4, No.4, pp.441-445, July 1982. Q.Y.Shi and R.S.Fu, "A method for the design of binary tree classifiers", Pattern Recognition, Vol.16, No.6, pp.593-603, 1983. J.Slagle and R.Lee, "Application of game tree searching techniques to sequential pattern recognition", Communication of ACM, Vol.14, No.2, Feb.l97l. P.E.A.Sneath and R.R.Sokal, Numerical taxonomy, W.M.Freeman. SanFrancisco. 1973. Special Issue on feature extraction and selection in pattern recognition, IEEE Trans. Comput. Vol.20, 1971. W.Sta11ings, "Approaches to Chinese character recognition," Pattern Recognition, Vol.8, pp.87-98, 1976. M.Svedlew, C.D.McGillem and P.E.Anuta, "Experimental examination of similarity measures and preprocessing methods used for image registration," Symp. Machine Processing of Remotely Sensed Data, 1976. P.Swain and H.Hauska, "The decision tree classifier: design and potential”, IEEE Trans. Geoscience Electronics, Vol.CE-lS, No.3, pp.l42-l47, July 1977. S.L.Tanimoto, "Template matching in pyramids," Computer Graphics and Image Processing, Vol.16, pp.356-369, 1981. 83. 84. 85. 86. 87. 88. 89. 90. 154 D.J.Vanderbrug and A.Rosenfeld, "Two-stage template matching," IEEE Trans. Computers, Vol.C-26, no.4, pp.384-393, 1977. D.L.Wallace, "Comment”, J. of American Statistical Association, pp.569-567, 1983. Q.R.Wang and C.Y.Suen, "Analyis and design of a decision tree based on entropy reduction and its application to large character set recognition", IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-6, No.4, pp.406-4l7, 1984. A.Whitney, "A direct method of nonparametric measurement selection", IEEE Trans. Computers. Vol.C-20, pp.1100-1103, 1971. R.Y.Wong and E.L.Hall, "Scene matching with invariant moments,” Computer Graphics and Image Processing, Vol.8, pp.l6-24, 1978. R.Y.Wong and E.L.Hall, "Sequential hierachical scene matching," IEEE Trans. Computers, Vol.c-27, No.4, April 1978. R.Y.Wong and E.L.Hall, "Performance comparison of scene matching techniques,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol.PAMI-l, No.3, July 1979. ' R.C.You and R.S.Fu, "An approach to the design of a linear binary tree classifier", 1976 Machine Processing of Remotely Sensed Data, pp.3A-l - 3A-10.