. ,~- q.--»~.~. A MULTIMODAI. DISTRIBUTION BASED CLUSTERING ALGORITHM . Thesis for the Degree of Ph; D MICHIQAN STATE UNIVERSITY ' , VASUDEVA ANANDA RAD 19171 ' L I B R A R Y Michigan State ‘1: University ’ This is to certify that the thesis entitled A MULTIDDDAL DISTRIBUTION BASED CLUSTERING ALGORITHM presented by VASUDEVA ANANDA RAO has been accepted towards fulfillment of the requirements for Ph. D. degree in Computer Science Major professor Date December 8, 1971 0-7639 ": isuuv WW ‘I Mm. R :5 ' amomc av‘ HOME 8 SDNS' ABSTRACT A MULTIMODAL DISTRIBUTION BASED cwsmsms ALGORITHM By Vasudeva Ananda Rao A new mathematical model is prOposed for the clustering prdblem.encountered in data analysis and pattern recognition. The set of multivariate observations on the objects to be grouped is considered as having been generated by an unknown multivariate continuous probability distribution having one or more distinct modes. This distribution is not treated as a mixture of several source pdfs. The clustering problem is identified as that of (1) the estimation of the number and location of the modes of such a distribution and (2) the selection of a suitable distance measure to group the observations based on a 'similarity measure' defined as the distance of each observation from the modes. A practical method is proposed for estimating the number and location of the modes and for detecting the structure in the data in the form of clusters. This method does not require that the number of clusters desired be specified in advance. For pattern classification in the absence of training patterns from each class, the clusters so detected are treated as the sets of training patterns for "learning" purposes. An algo- rithm is presented for classification of patterns from unknown Vasudeva Ananda Rao class using a minimum distance-to-mode decision rule. Apart from the Euclidean and Mahalanobis generalized distance measures, two other intuitively apealing distance measures are also discussed. Excellent numerical results have been obtained using test data. A MULTIMDDAL DISTRIBUTION BASED CLUSTERING ALGORITHM By Vasudeva Ananda Rao A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 1971 TO MY MOTHER ii ACKNOWLEDGEMENTS The people to whom one is indebted in acquiring an educa- tion are indeed numerous. It is hoped that these people will accept my sincere thanks. I am highly grateful to Dr. B. Weinberg for his guidance and encouragement. He suggested the problem as a potentially fruitful area for research and devoted a generous part of his time for discussions leading to this thesis. For their guidance and interest I wish to thank the other members of my doctoral committee: Dr. R.C. Dubes, Dr. C.V. Page, Dr. M. Rahimi of the Department of Computer Science and Dr. J.S. Frame of the Department of Mathematics. For the financial support which made this work possible I wish to express my gratitude to: the United Nations Educational, Scientific and Cultural Organization, Paris; the Government of India, New Delhi; the Department of Technical Education, Govern- ment of Tamilnadu, Madras and the Department of Computer Science, Michigan State University. Special thanks are due to my mother, wife and children for their patience and understanding. iii TABLE OF CONTENTS Page List of Tables 'Vi List of Figures vii Chapter I Motivation ..... O O O ........ O ..... O O O O ....... O O O O O 1 1 .1 Introduction ........... O O O O C O I O O O O O O C C C O C O O 1 1.2 Pattern Recognition and Cluster Analysis ... 1 1 .3 Literature survey 0 C O O I O O C O O O O I O O O O O C O O O O O O O 4 1.4 Contributions of the Thesis ....... ......... 8 1.5 Organization of the Thesis .. ........ . ...... 10 II Mathematical Preliminaries ....... ..... ..... ..... 11 2.1 Introduction ...... ......................... 11 2.2 A Mathematical Model for the Clustering PrOblem ......00............OOCOOCOOOOOOOOOO 11 2.3 Estimation of Modes and Mathematical Results CO.........0.0.0.000...00.00.0000... 13 2.4 Chapter Summary ............................ 17 III The Mode Seeking Algorithm ............... . ...... 18 3.1 Introduction ............................... 18 3.2 Mode Estimation Procedure for the Multivariate Case ... ..... .......... ...... .. 19 3.3 Mathematical Details of the Algorithm ...... 20 3.4 Algorithm for Mode Estimation - the Univariate case I C O O I O C O O C O O O I O O O O O O O I I O O O O I 26 3.5 Results Of Tests 0 O O O O O O O O O O O O O O O O O O O O O O O O O O 33 3 O 6 Chapter sumry O I C O O O O O O O O O O O O O O O O C O O O O O O O O 39 IV Cluster Seeking and Pattern Classification ...... 41 4 .1 IntrOduction O O C O O O O O O O O I O O O O O C O O O O O O O O O O O O O 41 4.2 A Similarity Measure and an Algorithm for Clustering .0......OOOOOOOOOOOOOOOOOO0...... 42 4.3 Distance Measures for Classification ....... 43 4.4 A Classification Algorithm ................. 46 4.5 Results ......C.......COCOOOOOOOOOOOOOOO...O 46 4.6 Chapter Summary ..... ....... . ...... . ........ 48 iv Chapter Page V Summary and Conclusions ...................... ... 52 5 .1 T118818 sumry O O O O O O O O O O O O O O O O O O O O O O O O O O O O O 52 5 .2 conCIuSions O O Q C I O O O O O O O O O O O O O O O O 0 O O O O O O O C O O 53 5.3 Suggestions for Future Work ................ 54 Bibliography ... ...... ........................... 56 AppendiXA .........O..........OOOOOOOOOOOOOOOO 61 Table LIST OF TABLES Results of mode estimation - simulated data ..... Results of mode estimation - real data ..... Results of clustering and classification algorithm I00............OIOOOOOOOOOOOOOO Results of classification using training samples ......0.0..........OOOOOOOOOOOOO. vi Page 34 38 49 50 Figure LIST OF FIGURES Flow chart for multivariate mode estimation ..... Two dimensional mode estimation example ......... Flow chart for univariate mode estimation ....... Flow chart for the clustering algorithm ........ Flow chart for classification .................. Explanation of the asymmetrical distance measure vii Page 21 23 30 44 47 66 CHAPTER I Motivation 1.1 Introduction: In many diverse disciplines the scientist collects data in the form of p measurements or observations on each of N indi- viduals or objects. He is interested in grouping these individuals into subgroups in such a manner that members of a subgroup are highly similar or associated and relatively unassociated with members not belonging to the subgroup. For example: in electrical engineering, one is interested in the detection of signals of unknown characteristics that recur frequently in a background of random noise; in medicine, to group electrocardiograms of EEGs into subgroups [D-l]; in psychology and sociology, to group peOple into types that may relate to treatment categories or behaviour categories; in information retrieval, to find classes of descriptors for articles and papers [G-l]; in numerical taxonomy, to group species of living organisms into hierarchic trees, etc. Such group- ing is a very valuable tool in data analysis. 1.2 Pattern Recognition and Cluster Analysis: One of the aims of pattern recognition study is to find meaningful descriptions to adequately characterize a set of data. The principal objective of cluster analysis as applied to pattern recognition is to gain more information about the structure of a l data set than is possible by more conventional methods such as factor analysis or principal components analysis [N-l]. In the statistical sense the pattern classification problem (2 class case) may be defined as a discriminatory problem as follows: A vector random variable X of observed values x is distributed over some p-dimensional space according to distribution F or G. The problem is to determine which of the two distribu- tions x came from. A statistical approach to the above problem can be con- sidered to fall into one of three subproblems: (i) F and G are completely known. (ii) F and G are known with the exception of some finite set of parameters. (iii) F and G are unknown except possibly for assumptions about the existence of densities, continuity, symmetry, etc. Subproblem (i) has been completely solved by the Neyman- Pearson lemma and results in a likelihood ratio (LR) test. This LR test yields Optimum results in the sense of minimum probability of misclassifications. In the formulation of the approach to subproblem (ii) one assumes the availability of training samples from each of the two distributions. It is also assumed that the forms of F and G are given but one or more parameters are unknown. The problem then reduces to one of parameter estimation or one can also employ learning strategies. Once these parameters have been estimated using the training samples the LR test is applied as though F and G were completely known. The "parametric" approaches to subproblems (i) and (ii) appear reasonable provided the assumptions made are justifiable in practice in that the assumed parametric forms are good repre- sentations of the data. But when little g_priori knowledge exists about the underlying probability distribution associated with each pattern class, these parametric approaches to the classification problem become questionable in the sense that bad results may be Obtained. This conclusion has led researdhers to require less stringent assumptions about the form of the data structure and this, in turn resulted in the emergence of a variety of nonparametric pattern recognition procedures as approaches to subproblem (iii). One such approach is to treat the pdf of the observations as a mixture of several unknown source pdfs and then try to identify these pdfs. Such an approach has been considered among others by Teicher [T-l, T-2, T-3], Yakowitz [Y—l, Y-Z], Yakowitz and Spragins [Y-S], and Stanat [8-5]. Application and develOpment of this technique under the name of "unsupervised learning" or "learning without a teacher" has been mainly done by Fralick [F-3], Spragins [8-4], Patrick [P-3], Hilborn and Lainiotis [H-l], Patrick and Costello [P-4] and Patrick and Hancock [P-Z]. However it is known that the class of mixture distributions which have a unique solution for the parameters of the individual distributions constituting the mixture, is limited and whether or not it admits a unique solution depends on the identifiability of the mixture distribution [F-6, Y-l, Y-2, Y-3]. Cluster analysis is another type of approach which has been pursued for a solution to subproblem (iii). It is a non-parametric technique to determine a type of structure describing a set of empirical data. A second way that cluster detection is applicable to pattern recognition is by providing an answer to the question whether or not a given set of features constitutes a good feature space in which to discriminate a given set of pattern classes [Z-l]. 1.3 Literature Survey: Clustering techniques have been used as long ago as 1939 by Tryon [T-A]. At present there are a number of proposed clustering procedures available. Widely used in numerical taxonomy [8-3] are agglomerative and divisive hierarchical clustering schemes. Here a small and fixed set of patterns is given and a matrix is computed whose (i,j)th entry is the association or similarity be- tween the i-th and j-th patterns. The agglomerative procedures [L-Z] generally link together the most similar patterns. Then the similarities between the groups of grouped patterns and the re- maining groups (or patterns) are recomputed using the minimum, maximum, or mean similarity between the two groups. The pro- cedure continues in this manner linking together the most similar patterns or groups. Michener and Sokal [M-4], Ward [W-l], and McQuitty [M-Z] were early users of this scheme. In a recent paper Rohlf [R-Z] describes sequential agglomerative hierarchical cluster- ing schemes in particular detail and preposes several new methods. The divisive procedures [L-Z] begin with all patterns in the same group and splitting the group into the two most dissimilar groups. Edwards and Cavalli-Sforza [E-l] suggest dividing the points or patterns into two groups such that the sum of squared distance between the sets is a maximum. They define this as a cluster and suggest an algorithm to find the two sets with the desired property. Because the total sum of squared distance is a constant for a given sample of points, maximizing the between- set sums of squared distance is equivalent to minimizing the within-set sum of squared distance. Their algorithm is to examine all 2“"1 - l partitions of the N points and select the one which gives the minimum‘within-set sum of squared distance. Lance and Williams [L-l] suggest successively splitting the groups in a way which is expected to reduce the variance the greatest for the split groups. Mattson and Dammann [M-l] suggest successively splitting each group by thresholding the dominant eigenvector of the co- variance matrix of that group. Wirth g£.§l_[W-4] suggest thres- holding the association or similarity matrix and defining the components of the resulting graph as clusters. Thresholding is done successively from strict thresholds to more liberal thres- holds. Jardine and Sibson [J-l] outline a theoretical framework within which the preperties of classificatory systems, which operate on data in the form of a dissimilarity coefficient on a set of objects, may be discussed. Hierarchical clustering schemes are also discussed by Johnson [J-Z]. The three superficially different hierarchical clustering schemes of Sokal and Michener [M24], Edwards and Cavalli-Sforza [E-l], and Williams and Lambert [W-B] have been compared by Gower [G-2] and suggestions made for their improvement. Most popular among the non-hierarchical clustering schemes have been those iterative schemes beginning with an arbitrary set I of all inclusive and mutually exclusive clusters and successively improving the set of clusters by transferring patterns from one cluster to another until no further improvement is available. Such methods use as an evaluation index, what has been called the "C Criterion" by Switzer [8-6]. As a means of evaluating any given partition of the sample, the within cluster distance or scatter [W-Z] or between cluster distance or the ratio of total scatter to within cluster scatter is used [F-4]. To circumvent the computational time consuming difficulty involved in examining all possible partitions of the sample [F-Z], Friedman and Rubin [F-é], use what they call a "hill climbing" algorithm. In principle their procedure attempts to examine only those partitions of the data for which the ratio of total scatter to within-group scatter is high. The logic of this technique can also be found in [F -S] . A technique for clustering which is very popular is the ISODATA (Iterative Self-Organizing Data Analysis Technique (A)) of Ball and Hall [B-l]. This technique clusters all of the data into distinct and independent groups. A computed mean or average response pattern is used to represent a group of patterns, and the iterative process creates new average response patterns to improve the accuracy of trial or existing average response patterns. The process also combines average response patterns that are so similar that their being separate fails to provide a significant amount of information about the structure of the response patterns. Each response pattern is put into that group for which the squared distance between it and the average response patterns (group mean) is the least. ISODATA implements an intuitively appealing mathe- matical idea for clustering patterns. It is not itself a standard statistical procedure, such as analysis of covariance or factor analysis. The order of computations and the setting of the various thresholds in the algorithm were motivated by heuristic reasoning and the performance of the program with standard data sets [D-Z]. Jones and Jackson [J-3] suggest an iterative technique where clusters are found one at a time. An initial pattern is picked to be the first pattern in the cluster. Patterns are successively transferred into and out of the cluster in a way which increases the within-cluster similarities and decreases the in-cluster to out-cluster similarities. Graph-theoretical procedures have also been applied for clustering. Bonner [B-3] starts out by thresholding the associa- tion or similarity matrix and defining as "core clusters" the maximal complete subgraphs (cliques) of the resulting graph. Then the smaller core clusters are merged into large core clusters and largely overlapping core clusters are merged. Gotlieb and Kumar [G-l] discuss graph theoretical clustering methods useful in information retrieval applications. Zahn [Z-l] describes graph theoretical algorithms based on the minimal spanning tree of a graph and which are capable of detecting several kinds of cluster structure in arbitrary point sets. The concept of the minimal spanning tree of a graph (MST) for single linkage cluster analysis (SLCA) is also discussed by Gower and Ross [G-B]. They show that all the information required for the SLCA of a set of points is contained in the MST of the graph of these points. Augustson and Minker [A-Z] also analyze some graph theoretical clustering techniques as applied to information retrieval. The statistical technique of Principal Components analysis used for clustering of multivariate data involves treating the number of observations (N) in p-dimensions as N points in an p-dimensional metric space and projecting these points onto a space of smaller dimensions with minimum.loss of statistical in- formation; that is, the inherent structure in the data is approx- imately preserved under the mapping. The primary interest is in mapping onto two or three dimensions since the resultant data con- figuration can be easily evaluated by human observations in three or less dimensions. Rao [R-l] discusses the application of this technique to clustering. Nunnally [N-B] describes the application of factor analysis procedures in clustering. Dubes [D-Z] has an interesting report on cluster analysis and decision making with a correlation matrix wherein he describes minimum-average-distance clustering and mean-squared clustering with particular reference to Gaussian distributions. A non-linear mapping technique useful in clustering multi- variate data is discussed by Sammon [8-1]. The idea behind Sammon's method is similar to that of Kruskal [K-Z] who discusses the technique as applied to non-metric hypothesis. An interesting new mathematical formulation of the clustering problem has been provided by Ruspini [R-3]. 1.4 Contribution of the Thesis: From the literature survey of the last section it is clear that, basically, cluster analysis techniques call for the identifica- tion of those regions of the observation space where the patterns are most heavily concentrated, thereby establishing a structure in the data in the form of clusters. In this thesis, it is believed that, to detect such a structure, it is not absolutely necessary (1) to treat the underlying pdf of the observations as a mixture of several component pdfs and then (2) try to identify the indi- vidual source pdfs. It is sufficient to consider that the under- lying pdf, is, in general, governed by a multivariate continuous distribution with one or more distinct modes. Since the mode of a pdf is that outcome which is likely to occur most often, the estimation of the number and location of the modes provides a clue to the structure of the data. Accordingly the method for cluster- ing proposed in this thesis is based on the premise that such an estimation of the number and location of the modes of a mmltimodal multivariate pdf underlying a set of empirical data is a key to a solution to the clustering prdblem. Thereafter, the selection and application of a suitable distance measure determines the identity and membership of the clusters in a straightforward manner. This intuitively appealing idea, not found in the literature, is presented and offered as a practical solution for the clustering problem. The performance of this method on data sets, both real and artificial, is found to strengthen the belief in such an approach. Specifically the contributions of this thesis are: (i) A new mathematical model is proposed for the clustering problem. The set of multivariate observations, on the objects to be clustered, is assumed to have been generated by a multivariate continuous probability distribution whose density function has one or more distinct modes. This pdf is not treated as a mixture of 10 several source pdfs as is sometimes done in the classification problem.[Y-l] of pattern recognition. (ii) In his discussion on nonparametric decisions based on distance to modes, Nilsson [N-Z] assumes the existence of a method to find good estimates for the modes of a probability density function, given the set of training samples. This thesis provides a practical method, guided by statistical theory, for the estimation of such modes. Once the modes have been estimated from a set of training samples, a starting point for the abstraction phase of the pattern recognition problem is available. The abstraction phase of the problem can then be completed using, for example, a piecewise linear machine, implementing a minimum dis- tance classifier. 1.5 Organization of the Thesis: Chapter II presents the proposed model for clustering and discusses the relevant mathematical preliminaries. Chapter III describes an algorithm for the estimation of the number and loca- tion of the modes of a multivariate, multimodal pdf. Chapter IV extends the mode seeking algorithmto clustering and summarizes the results obtained on a few real and simulated sets of data, using both supervised and unsupervised learning techniques. Con- clusions and suggestions for future work are included in Chapter V. CHAPTER II Mathematical Preliminaries 2.1 Introduction: In this chapter the theoretical preliminaries required for a solution to the clustering problem are discussed. The prob- lem is stated in the framework of a simple mathematical model. The basic assumptions made are: (i) that there exists a probability distribution which generates the multivariate observations to be clustered: and (ii) that the probability density function (pdf) of this distribution is of the continuous type characterized by one or more distinct modes. The word mode, as used here, denotes the location of a local maximum in the pdf. Assumption (ii) implies that the pdf need not necessarily be symmetric. (iii) the number of observations is large compared with the number of dimensions of each observation. (iv) the observations are independent. 2.2 _A Mathematical Model for the Clustering Problem:— Let x1,x ,...,xN denote the p-dimensional observations 2 (features, measurements) of N objects. Assume these observations are values of a random variable, X - (X1,X2,...,Xp), having a cumulative distribution function (cdf) F(x), and the corresponding 11 12 probability density function (pdf) f(x). Let f(x) be continuous and characterized by G distinct modes, that is, f(x) = f(x1,x2,...,xp; M1,M2,...,MC) where Mi’ i = 1,2,...,G, denotes the i-th distinct p-dimensional mode vector of f(x). Then the clustering problem is: (i) Estimate G, the number of modes of f(x); (ii) Estimate the column vectors Mi’ i = 1,2,...,G; and (iii) Define a distance measure d(x ’Mi)’ j = 1,2,...,N; J i = 1,2,...,G, which partitions the set of observa- tions into C groups, n1,n2,...,nG such that x 6 n if and only if j i d(x ,Mi) 0 . N N—coo Theorem 2.1: (Loftsgaarden and.Quesenberry) Under the above hypothesis, the estimate of f(x) at the point 2, given by the estimator rN-l 2. p g' tum = N (pm (2) 2 [dzoan n is consistent. Proof: Proof of the above theorem can be found in [L-3]. Loftsgaarden and Quesenberry suggest the use of an integer closest to /N as a value of rN- 15 The following corollary to the above theorem is obtained in the univariate case (p = 1). A consistent estimate of f(x) at z is given by (rN-l) /N £N(Z) = W")- . (2.2) 2 N Lemma 2.1: (Owen) [0-1] Let f(x) be a uniformly continuous probability density function and {fN(x)} be a sequence of estimates of f(x). Let M, be the value of x, assumed to be unique, where 1 max f(x) occurs and I1 is an interval over which f(x) is con- xEI1 tinuous. Further let M be the value of x where max £N(x) 1N x61 1 occurs. If fN(x) is a consistent estimate of f(x), then M 1N is a consistent estimate of M1. Proof of this lemma has been given by Owen. Extension to Multiple Mode Estimation: As before, let x1,x2,...,xN be independent observations on a univariate random variable and dz(rN) be defined as before. Let f(x) be uniformly continuous and positive over an interval I and let there exist a unique mode M 1 E I . l 1 Let a new estimate of the mode M1 be formed as follows: M' = x where x is (k) (k) the k-th smallest observation from the set of N observations and is the k-th order statistic, i.e., x 16 k=ArgCMin d (r)x. 61D; 1 {x(i) Ma) 1 lN contained in the interval I1 which yields the minimum value of dz(rN) as computed by the procedure of Loftsgaarden and that is, the estimate M of the mode is that order statistic Quesenberry. Further let fN(x), for x E I be constructed in a step- 19 wise fashion according to (rN-1)/N fa“) = 2 dx “:9 ’ “(1) (1) Lemma 2.2: OMoore and Henrichon) Sx S x(i+1). The estimate fN(x) converges uniformly, in probability, to f(x), x E 11. Proof: see [M-S]. From lemmas (1) and (2), TheoremL2.2: M' is a consistent estimate of the mode M1; i.e., lN Proof: By Lemma 2.2 we know that the estimate fN(x) constructed in the manner indicated converges in probability to f(x), x E 11. .Applying Lemma 2.1 and the fact that max fN(x) occurs at the x61 1 value of x(i) where d (r is minimum, the theorem follows. ) x(i) N Q.E.D. 17 For estimation of the modes in the case of a multimodal, univariate pdf, consider the set of L relative maxima (assumed L L unique), ”mi}i=l located at modes {Mi}i=1 respectively, and such that M1 6 I1 = (a21_1,a21), i = 1,2,...,L where a1‘< a2 <...< 82L° Further let the pdf f(x) be uniformly con- tinuous and positive over the intervals {Ii}i=1’ and the estimates MkN of Mk be determined by = x MkN (j) where j = Arg [Min [hx (r ) x , E I }] . i (i) N ‘ (1) k Then, as an immediate consequence of Theorem 2.2, Corollary: The set of estimates [M, }:=1 converges in probability L to the set {Mi}i=l' 2.4 Chapter Summary: In this chapter a new model for cluster analysis was pro- posed. The problem of cluster detection was identified with that of estimating the number of modes and the modes themselves of a multimodal pdf with distinct modes. The remainder of the chapter related to the theoretical aspects of mode estimation. CHAPTER III A Mode Seeking Algorithm 3.1 Introduction: The goal of this chapter is to provide a computational procedure to estimate the number, G, and the p-dimensional vectors Mj’ j = 1,2,...,G, of the modes of a multimodal, multivariate pdf, f(x), of Section 2.2. The mathematical results of the last chapter were directed towards the estimation of the modes of a univariate pdf because one of the problems encountered in a direct application of the mode seeking techniques to a multivariate problem consists of deciding how to store the boundaries which separate the observa- tion space into regions containing only one mode. To appreciate this problem one has only to note that in the one dimensional case these boundaries are simply points on the real line. In two dimensions the corresponding boundaries are curvesand it is not easy to store arbitrary curvesin a computer. The problem is even more difficult in a higher dimensional space. One method of circumventing this difficulty is to project the multivariate observations on to the principal eigenvectors of the sample co- variance matrix [M-l] to get a univariate set of data; estimate the modes in this one dimensional space and finally transform back to the original p-dimensional space. 18 19 The direction of the eigenvector associated with the largest eigenvalue of the sample covariance matrix of the observations is the direction of maximum dispersion. This is the reason for pro- jecting the observations on to the eigenvectors rather than pro- jecting onto the coordinate axes of the p-dimensional observation space. If separation between the data does indeed exist, it should be more easily detected along the eigenvectors [H-Z]. 3.2 Mode Estimation Procedure for the Multivariate Case: The proposed mode estimation procedure can be summarized as follows: Step (i) Compute the principal eigenvectors associated with the sample covariance matrix determined from a set of independent observations. Step (ii) Project the multivariate observations onto the first principal eigenvector to obtain a univariate data set. Step (iii) Apply the procedure of Section 3.4 and estimate the number and locations of the local minima for this univariate set and the modes along this eigenvector. If only one mode is found go to step (iv); else, go to step (v). Step (iv) Using the next principal eigenvector repeat the procedure from step (ii). If all the eigenvectors have been pro- cessed and only one mode is found along each eigenvector, compute the location of the mode and transform back to the original Space. Stop. Step (v) Partition the observation space into regions by hyperplanes perpendicular to the eigenvector and passing 20 through the points of minima. Classify the observations into sub- sets corresponding to these regions. Step (vi) For each new region so obtained repeat the procedure from step (1). Figure 1 depicts the details of this procedure in flow chart form, .An illustration of the procedure applied to a two dimensional data set is given in Figure 2. 3.3 Mathematical Details of the Algorithm: * * * Let x1,x2,...,xN denote the N, p-dimensional column vectors corresponding to the p feature measurements on each object; that is, x = ; i=l,2,...,N . (3.3.1) x L 194 The asterisk indicates that these are raw measurements. The j-th feature average, m is defined as the sample 1 mean for the j-th feature: N Z x * j -= 1,2,...,p (3.3.2) i=1 8 II 2 [H and the vector of feature averages is m B i . (3.3.3) 21 Store Multivariate Observations; Re ion # = l l. {:9 Compute Sample Covariance Matrix of Data in the Region I [Compute Eigenvectors of Sample Covariance Matrix Eigenvector # = 1 l l 1 . Obtain Untvariate [ Increment ' Data Set by Projection E. Vector # of Observations in this ' * Region on this Eigenjector Estimate # and Location of Modes for univariate Data Set. (Fig.3) No Compute # of E. Vectors Vector Modes = l? Processed for this Region '6') 0 Figure 1. Flow chart for multivariate mode estimation 22 All Divide Observation Space Regions No into Regions by ’roces::gzlz" Hyperplanes Yes Increment Region ‘H’ Count Store # of Modes; Location of Modes * Find Observations Falling in each Region 1 Region # = Region # + l Figure l (cont'd.) 23 x2; 1(1) A ‘ (b) ” Eigenvector l C f(x (e) 3! Region 1 Eigenvector 2 Processing of region 2 X2 9 f(x) (e) x’ Eigenvector 1 £00 C2 (d) ’ x1 (f) ? Region 2 Eigenvector 2 Regions 3 and 4 processed in a similar manner. Note: Sketch intended only for visualizing the steps involved in the algorithm. Figure 2. Two dimensional mode estimation example 24 Define the normalized measurements as xij = xij - US (3.3.4) and the normalized observation vector is r x117 ‘12 xi = . . (3.3.5) X. L 1P4 These normalized vectors are arranged in the form of a (N X p) matrix, [A] : T F ‘7 x1 x11 x12 .... x1p T x2 x21 x22 .. . x2p m = s = ::::::'::::::::::::: - <3-3-6> T Lst L_XN1 xN2 °°'° prJ The (p X p) sample covariance matrix, [S], is: [33 = DYE—1 [HT [A] (3.3.7) and is assumed to be positive definite. Let x1,x2,...,xp be the eigenvalues of [S] arranged in order such that k1 > X2 >...> )‘p . The first principal (column) eigenvector, C1, of [S] is the eigenvector corresponding to the eigenvalue x1. The remaining (column) eigenvectors are, respectively, CZ’C3"°"Cp: and the 25 (p X p) matrix of eigenvectors is [C] = [C1,C2,...,Cp] . In step (1) these eigenvectors Cl,C2,...,Cp, of [S] are computed treating the entire observation space as one region. The univariate set of step (ii) is obtained by projecting the N observations onto C . 1 Let V k represent the univariate data set obtained by projecting the observations onto the k-th eigenvector Ck. Then (V a 1k V2k V k = I = [A]Ck ; k = 1,2,...,p . (3.3.8) LVNks The p data sets so obtained can be arranged as the columns of a (N X p) matrix, [V], given by, [V] = [A][C] . (3.3.9) In step (iii) the mode seeking procedure of Section (3.4) is applied to the data set represented by the first column of [V]. If only one mode is detected for this set, the procedure is applied to the set given by the second column of [V] and so on as mentioned in step (iv). For the set of observations in each region of the observa- tion space obtained in step (v), a sample covariance matrix [S], a matrix of eigenvectors [C] and a matrix [V] is obtained. 26 Suppose that for the i-th region, only one mode is detected for each of the data sets corresponding to V(i) V(1) (i) 1 ,v .2 ,...,v.p . i-th region, in the transformed space, are the elements of some The components of the mode vector for the row of [V(i)], say, the j-th row. The mode vector for the region, in the original observa- tion space is therefore, M“ )= [C(1)T .J'lv(i)T+ . . T . . . = [C(1)1v§1) + m(1); since [C(1)]T[C(1)] = [I] (3.3.10) where [C(i)] is the matrix of eigenvectors for the i-th region; (1) (i) V(i) V(i) V = V . e e o g 1 [ 31 V32 vjp ] and rm{1)1 m“) = : (i) m. J is the vector of feature averages for the Ni observations in the i-th region. 3.4 Algorithm for Mode Estimation - The Univariate Case: The theoretical formulation of the mode estimation prob- 1em for the univariate multimodal case, as presented in the last chapter, requires that those intervals on the real line where 27 distinct modes are assumed to exist, be known first. This is a condition which must be satisfied prior to the application of Theorem 2.2. Once these intervals have been identified, a con- sistent estimate of the mode of the underlying pdf of observations falling in this region, can be obtained in a straightforward manner. The problem therefore reduces to that of finding intervals {11}E=1 on the real line. One method which suggests itself immediately is to actually obtain the pointwise density estimate of the univariate pdf using the estimate (r -l) /N £N(z) = 2 dz(rN) and to plot these to identify the intervals containing the local extrema. The futility of this approach can be appreciated if one actually constructs such a plot. The local variations in the pointwise density estimate are too great to be of any use in isolating the intervals under consideration. To overcome these difficulties Fu g£_§l_[F-7] suggest the following three "necessarily vague" guidelines: (1) Some means of smoothing the pointwise approximation of the underlying density estimate is necessary; (ii) Modes of an underlying density which are further apart from each other should be more readily detected; and (iii) Modes which have associated with them a high prob- ability mass should be more readily detected. The approach adopted in this thesis is the construction of an equal bin-count histogram as a primary approximation to the underlying density. This histogram approach performs the 28 integrating effect of assumption (i), that is, local variations in the pointwise density approximation will tend to be smoothed. A discussion on three types of histograms useful for analyzing a set of empirical data can be found in Dubes [D-l]. The equal bin-count histogram approximation was motivated by this dis- cussion. The construction of histograms for estimating the modes has also been suggested by Sebestyen and Edie [8-2]. Such an approach may also be useful to achieve the three objectives referred to above but their method is not used in this chapter and hence not discussed further. In an equal bin-count type of histogram the widths of the various bins indicate the concentration of the probability mass in any interval. Bins with a smaller width imply a higher con- centration just as in the equal bin width type of histogram a bin with more height is indicative of more observations lying in the interval specified by the bin width. The first step in the algorithm to estimate the modes is the construction of a histogram. The number of bins desired is specified by the user. This number, NB, must be, preferably, selected such that the number of observations N, (sample size/ region count) is a multiple of NB. Once this number is specified the bin divisions are chosen to fulfill the condition that all bins are required to have the same, or as close to the same as possible, number of counts. The smaller the number of bins specified the greater will be the smoothing effect; that is, specifying a large number of observations per bin supresses the local variations in the histogram approximation to a greater 29 extent. The locations of the bin divisions carry the information required for further analysis. An extrema seeking technique is then applied to this histogram in order to determine appropriate intervals in which to search for local maxima. The algorithm for selecting these intervals depends on a parameter, PARA, to be specified by the user. This parameter decides the value of a threshold which essentially determines how much local variation will be tolerated before a decision to Specify an extremum interval is made. The algorithm first seeks bin divisions to be used in determining the intervals of existence of local minima between successive modes. The criterion used to store such an interval is that: (i) there exist a bin to the left of the chosen bin such that the difference in their widths is greater than the threshold specified and (ii) there exist a bin to the right satisfying a similar condition. The threshold is calculated as PARA times the minimum bin width. From simulation studies it is found that the value of PARA in the range 1 to 10 gives good results. The points at which relative minima exist are then chosen as the midpoints of these bins. The set of intervals {Ii}E=1 in which modes are assumed to exist is now available. The techniques of the last chapter are now applied to estimate the mode in each interval. A flow chart for the algorithm is given in Figure 3. 30 Initialize: Specify: # of bins for histogram; r for L 6: Q method; Parameters PARA for threshold and min. # of observations/ region allowed: MCNT Construct Equal Bin Count Histogram; Bin Locations Given by Di Find Bin Width xMin of Bin with Minimum Width [Threshold = xMin * PARAI Search for Bins with Widths Satisfying the Threshold Criterion Figure 3. Flow chart for univariate mode estimation 31 Find Midpoints of Such Bin Intervals; These are the Local.Minima Positions l .- Store Interval Boundaries Containing Mode l Count # of Observa- tions in each Interval {#I No LumplAdjacent Regions; Refix Boundary of Interval *— Compute Distance of Observations in the Intervals to the rN-th Closest Observation in the Intervals: dz(rN) i [ Find Min of dz(rN)] Figure 3 (cont'd.) 32 [Mode Position Given by that Observationl (order statistic) with Min dz(ru) L Store # of Modes and Mode Positions Figure 3 (cont'd.) 33 From the flow chart it is observed that the number of modes is controlled by the following parameters to be specified by the user: (i) PARA, the parameter controlling the threshold to be used in finding the interval where the local minima of the pdf exist; and (ii) MCNT, the minimum number of observations the user allows in any region. If any region is found to contain less number of observations than MCNT, then adjacent regions are lumped till this criterion is satisfied. The estimate of the location of the mode in any interval is controlled by the value of rN used in the expression for the pointwise density estimate derived by Loftsgaarden and Quesenberry. They suggest the use of an integer closest to ‘/N, where N is the number of observations in a region. To use a more general value r is taken as the integer closest to PAR ./N where PAR N is a third parameter to be specified by the user. 3.5 Results of Tests: The algorithm presented in this chapter was tested on different sets of simulated as well as real data. The simulated sets of data were obtained by Monte Carlo methods. The results of such test runs are summarized in Tables 1 and 2. The time shown in each case refers to runs on a CDC 6500 digital computer. Data set # 1. Simulated data generated by a mixture of two uni- variate Weibull pdfs given by f(X) a 0.4 * W(x301981,61) + 0.6 * W(XEO'2:52:52) . . - . . - muse . . owes owa o oNo N mmmm o HmHN N oowmmsou N oo H ms 0 ca m Ha N com me musuxwe mac 0 MNo Nu namN o- comm N- Housewo>am Ne omen moon fiancee: N oo.H cod 0.5 mm.ma mmoN.N oNoN.o NqNN.N Nnaw.o N oom mo ououxae "ouoauo>qo= He 5 3 3 8 o o o o o o o .53 So: 55 8.3 c z B: u z e a canoe mu moooz. moooz mono: swam anuauowao any you nouoaaoo mo :oNuoooa mo oowuooon mo * oHnEom gum ouon pom: muouoesuom . Hoowuouooaa oouqaauwu some oouoaoaam u coaueaaumu moo: we ouasmmm .H mum<9 35 where A51 9"1 1 3' . W(x;ai.ei.5i) = ;* (x-si) 1 * exp [- 071* (“'51) 11, 1f x 2. 5i = 0, if x < 61 ; i = 1,2 . = 1.0 ; 51 = 62 = 2.0 ; 6 = 0.0 ; 6 = 2.0 . °’1=°’2 1 Data set # 2: Simulated Data: Mixture of two dimensional Gaussian distributions given by f(x1,x2) = 0.3 * N(x1,x2 ; “1,21) + 0.7 * N(x1,x2;u2,22) 0 -200. Where ”1 = [o] ; ”2 = [-2 o] 1.0 0.0 0.25 o‘ 0.0 1.0 0 0.25 Referring to the results in Table 1, it should be noted that the estimated mode locations for both the sets of data are for a mixture of the two distributions. They are not necessarily the same as the mode locations of the component pdfs. The theorem given below shows that indeed such is the case except under special conditions. This theorem and the proof are given only to explain the results obtained for the simulated data sets. It is pointed out that in the model prOposed in Chapter II, the pdf is not treated as a mixture of pdfs. Theorem 3.4.1: Let f1(x) and f2(x) be two continuous pdfs of the multi- variate random variable X = (X1,X2,...,Xp) and let f(x) be a 36 mixture of f1 and f2 given by £0!) = 91 £101) + p2 £201) ;p1 + p2 = 1 . If x!n is a mode of f(x) and x0 is a mode of f1(x), then the mode xm is not the same as x0 unless sz is zero at x = x . ‘m Proof: At any mode of f(x), V f(x) = plovf1(x) + p2°Vf2(x) = 0 . Since Vf1(x)‘x-xo = 0, locally, in a neighborhood of x0, we have the expansions Vf1(x) 8 [B1] (x - x0) +-temms involving higher order terms, vf2(x) =.A2 + [B2] (x - x0) + terms involving higher order terms, where A2 = Vf2(x) =3 and x 0 [B1] and [B2] are square matrices depending on x0. Retaining only the first order terms, the mode xm occurs when p1[B1] (x - x0) +-p2 A2 +p2 [B2] (xm - x0) = 0 from which "m = xo ’ “’1 [B1] + P2 [323)-1 A2 P_1 -1 3 KO ‘ [p2 [Bl] +'[Bz]] A2 37 assuming that the indicated inverse exists. If A2 = vf2(x) 8 0 at x = x x = x , O O m 0 Q.E.D. The last condition may be interpreted as either the two distributions are well separated in the observation space or a mode of one of the densities occurs where the other density is a constant. The theoretical modes of the mixture pdf for the Weibull data are 0.7070 and 2.7035 respectively. Considering the fact that the estimated modes were for data produced by simulation, there is reasonably good agreement between the ideal and estimated ‘mode locations. The two artificial data sets were used only to illustrate the feasibility of the mode seeking algorithm. Applications to real data are considered next. Data sets # 3,,4, 5 and 6: The sets of data mentioned in Table 2 were selected from 150 four dimensional patterns used by Fisher in a classical paper [F-l]. Each pattern represents an iris and each feature repre- sents a measurement on the iris. The patterns are divided into three species/classes called Setosa, versicolor and virginica with fifty patterns in each class. In each of the tests all the four measurements were used for mode estimation. The parameter, PAR, to be specified, was varied for the different runs to demonstrate its effect on the estimate for the mode locations. The iris data set has been used by many research workers to test the performance of the algorithms proposed for clustering 38 mmom.~ oomm.~ m~o~.o nosasmus> moon demo.o wNN¢.q Namn.a uoHoonuo> m¢.¢ «dam.N NHN¢.N oNq¢.m m mm oo.a o.H oma omOuom Nmmm.o Homn.n omHH.m mHmH mmmm.N mNON.o moon och.m u qum.H ooHonuH> Ne.m mnoo.N oN¢¢.m N mm oo.H o.H OOH a cucuom mmHm.N omH~.n mHmH . mNmm.a BOON.o 3mm - oaoo. m 33. H so." 038:5 mN.¢ aNma.N mamm.m N mm oo.N o.H OOH a omouom wnmN.o owON.m mHmH onm.H mawm.a moon wnaa.¢ wan.¢ s N mm om.H o.H ooH moaoawuw> om.¢ NNmm.N NmmN.N a uoHouNnuo> mNam.m damm.o mHmH mass any Ame Asa venom azuz m 0 ; S II (a - b) N[s:1)] (a- b) >0; and '1 ll J‘lx Md is a vector representing the center of a constant unit distance ellipse. A detailed deriva- tion of the measures (iii) and (iv) is given in Appendix A. 46 The 'asymmetric' distance measure is motivated by the belief that if the pdf underlying the observations in any cluster is asymmetric in the sense that the mean is different from the mode, the distance measure chosen should reflect this skew. Because the dispersion of the observations is, in general, not the same in all directions, it is felt that the measure of distance used should reflect this. 4.4 A Classification Algorithm: Based on the minimum distance-to-mode decision rule, an algorithm for classifying a pattern vector x, from unknown class, is given in Figure 5 in flow chart form. 4.5 Results: The clustering algorithm and its application to pattern classification using the different distance measures are tried on the sets of data used in Chapter III. Another set of real data tested consisted of measurements of different types of grain. Ehrlich and Weinberg [E-2] discuss how grain shape may be described as precisely as needed by a Fourier series expansion of the radius about the center of mass utilizing co-ordinates of peripheral points. They also give illustrative examples to show that the shape variables easily discriminate grain differences arising from geographic, stratigraphic and process factors. In pattern recognition parlance this may be considered as their method of feature extraction. Professor Weinberg made available a set of eight feature measurements for each of the grains - Navy Bean, 47 Estimate modes of given data set; Clustering Phase Identify clusters using Euclidean metric l Estimate M(i): - modes of the clusters -—b Learning Phase Choose distance measure for classification 1 Compute (i) d(XaM 1) Classification JV 3 Phase Find > [Decide k = ArgCMin {d(x,M(1))}) x 6 C1188 STOP 1 Figure 5. Flow chart for classification 48 Wheat and Oats. As the sample size of each variety of grain is small compared with the number of features only four out of the eight features are chosen for the test. Specifically, the avail- able sample sizes are: Navy Bean: 46 Oats : 50 Wheat : 42 The results of these tests are summarized in Table 3. The mode seeking algorithm is also applied to estimate the modes of the pdf underlying the actual training sets as dis- tinct from the sets identified by the clusters. For each data set two types of tests are conducted; first using all the samples from each training set and next using a subset of the set from each class. Both the tests gave encouraging results. The results of the former type of test are tabulated in Table 4. An inSpection of the results in Table 3 reveal that with one exception all the reclassifications based on the initial cluster membership are worse almost inversely proportional to the degree of the intuitive sophistication of each distance measure. The reason for this is not known and has not been investigated further. 4.6 Chapter Summary: In this chapter a clustering procedure is discussed using the simdlarity measure defined by the distance of an observation point from the modes of the distribution which generated the set of observations. If the ultimate aim is clustering of a set of empirical data, the algorithm can be terminated at this stage. .ma .o .m madman mom .1 oom\q oom\HH oom\m oom\m oom\a u Nmmo.Nu omoN.o N cowmmsmo Homo.Nu mmuo.o| museum>wm mofiowwua> HmoN.H wmaN.H oemN.o a NNoo.m mmoo.¢ mumq.a a soaouamuo> omH\q omH\0N oma\oH onH\ma omH\m NmoN.N mmwN.N aooa.m encuom Honw.m Oan.m aomo.n mauH Hmoe.a mms~.H «oncamna> NNoo.m mmoo.q u a a ooH\m ooa\ON ooa\oH ooa\mH 00H\m NwoN.N ammN.N sodoowmuo> m4; 358 029m 3.: Noao.o wooo.o HmNo.o «ono.o mNHo.o anN.o uoona mmH\H wms\m mma\a me\e wms\m mmHo.o omao.o aafio.o a seam s>uz Naom.o mHoH.o mann.o mono oouaaosooow oouwaouooow asuwuowam monocoamnoz uauuuaahma. oasuofifihm oaaoom s soumoao museum: «newsman as case Ame Ame RHV acosm 1% nmflo moo: woos moo: unease uom . no“: mo moon coauoUNmamnoHo ow nowmammoaonaz oz sooeoz mono: noumsau sooasz masonsowao ooNumoemammmHo one woauoumsHo mo muaomom .m manna 50 oon\¢ oom\HH oom\w oom\m u Nmmo.Nu omoN.o+ N coanmsow Homo.Nn mmno.0u ouoauo>am on¢o.~ mma~.H ¢~m~.o monaamna> NNHm.m meo.q mmaq.a uoHoonuo> omH\m omH\m omH\N omH\Oa om¢N.N Nmom.N qooa.m a ooOuom Ho¢¢.o aw¢m.w «omo.m mNuH wwao.o Hooo.o HmNo.o wwwo.o mmoo.o mmHN.o uoosz me\H o o wma\a moao.o amao.o NNHo.o q comm h>oz Hmom.o HHoH.o oqmm.o nuoo manoooamnwz oonaaouooow oouaaouooow nMv ANV Aav ocean oauuoaahm<. oauumaahm meadow woos. moo: woos. nomaan uom madmamm ponwuo ouoo Ham mo moon chances mucoumwa moan: woumaaumo moves. pogaoz emamammmaomaz nonasz moaoamm wofioamuu moans ooHumowmwmmmHo mo muasmum .s manna 51 However if the aim is to detect a cluster structure for pattern classification analysis, the clusters generated could be used as sets of training patterns, their modes estimated and a pattern vector x from unknown class, classified using a minimum distance-to-mode classifier. It is also demonstrated that if training subsets from known classes are availabe §_priori, then the algorithm could be used for classification studies. Different distance measures are discussed even though the two intuitively more sophisticated measures performed poorly on the data sets mentioned. It may be recalled that in the mode seeking algorithm the observation space is partitioned into regions and the modes are estimated for observations in each such region. To explain why the observations in these regions are not treated as the final clusters, one has only to note that the modes could also be estimated using other procedures which do not call for such partitioning. The algorithm to seek clusters may also be used with such mode seeking methods. CHAPTER V Conclusions and Suggestions for Future Work 5.1 Thesis Summary: In this thesis a new model for the clustering problem is presented. The formulation of the problem is based on assumptions which are not severely restrictive. The assumptions are of a nature similar to those generally made in most of the methods for the statistical analysis of the pattern recognition problem. The model is motivated by the belief that to find a cluster-structure in a set of empirical data, it is not necessary to treat the under- lying pdf as a mixture of several source pdfs. The pdf need be treated, in its own right, as one governed by a multi-modal continuous distribution with one or more distinct modes. This leads to the identification of the clustering problem with that of estimating the modes. An algorithm is given to estimate these modes and the modes estimated are used to detect the cluster membership. As no clas- sified training samples from each pattern class are available 5 priori, the clusters so formed are taken to be the training sets for training a minimum distance classifier. However if training patterns from each class are available in advance, it is shown that the mode seeking algorithm could be used to complete the abstraction phase of the pattern classification problem. 52 53 .Apart from the standard.Euclidean and the Mahalanobis Dz-measure, two heuristically motivated distance measures are discussed. Even though the reasoning involved in the derivation of these new measures seems to be intuitively sound, the measures performed poorly contrary tO expectation. However the Mahalanobis Dz-measure gives very good results when used in conjunction with the prOposed model and strengthens the belief about the suit- ability Of the model. 5.2 Conclusions: The clustering procedure prOposed is useful in a number of situations. However it is not hard to imagine examples where this approach may not give the desired results. One such situation is described by two concentric point sets in a plane. Ideally, one would like to obtain from a clustering algorithm applied to this set, exactly two clusters. However the scheme proposed here is likely to detect more than two clusters. It is doubtful whether any of the presently available clustering techniques will produce exactly two concentric clusters without rejecting any of the observations. The method will certainly perform very well in situations where the ratio of the inter-cluster distance to intra-cluster distance is high; that is, in situations where the data are well separated into groups in the observation space. Such a performance is, of course, to be expected of any good clustering algorithm. As implemented the algorithm is slightly more expensive as regards computer time involved, compared with the ISODATA, Operating 54 on the same data sets. However this higher cost is compensated by the better performance produced at least in the case of the iris versicolor and virginica data sets. It is not claimed that the performance will be better in general but certainly the results are comparable to those of other pOpular algorithms currently in use. The mode seeking algorithm is ideally suited for obtaining non-parametric decision rules based on distance to modes, given the training samples from each class. In cases where the number of clusters present is not known in advance, the applicability of this procedure is quite clear. In fact this is a distinct advantage claimed for this method. 5.3 Suggestions for Future Work: The two new distance measures used for classification are based on a heuristically sound principle, that the distance measure employed must in some way reflect the dispersion of the Observations. The manner of implementation adopted here did not give encouraging results. The reasons for instability, that is, why with a distance measure as implemented, the results diverge instead of converging to form 'ideal' clusters (or remain stationary) are worth investiga- tion. Other methods of implementing this notion may also be con- sidered for future work. It is quite possible that better results may be obtained. Iterations performed with all the four distance measures show no further improvement in performance. Even with the classical.Euclid and Mahalanobis distance measures, the iterative process diverges and the reason for this divergence requires further investigation. 55 Another area of future work is the estimation of the modes of a multivariate pdf directly from the Loftsgaarden and Quesenberry estimator of Chapter II, instead of first constructing a uni- variate data set. A stochastic approximation approach to estimate the pdf [K-l] may result in a computationally more economical method to solve the problem of mode estimation. BIBLIOGRAPHY [A-l] [A-Z] [3-1] [3-2] [0-1] [0-1] [12-11 [3-2] [F-l] BIBLIOGRAPHY Aizerman, M.A., Braverman, E.M., and L.I. Rozonoer, "Theoretical principles of potential function method in the problem of teaching automata to recognize classes", Autgmgtion and Remote Control, Vol. 25, No. 6, pp. 821- 837, 1964. Augustson, J.G. and J. Minker, "An analysis of some graph- theoretical cluster techniques", Technical Report NGL-21- 002-008, No. 70-106, University of Maryland, College Park, Maryland, 1970. Ball, G.H. and D.J. Hall, "ISODATA, a novel method of data analysis and pattern classification", Technical Report, Stanford Research Institute, Menlo Park, California, 1965. Bonner, R.E., "On some clustering techniques", I.B.M. Jour. Res. Dev., Vol. 8, pp. 22-32, 1969. Chernoff, H., "Estimation of the mode", Ann. of Inst. Stat. Math., Toyko, Vol. 16, pp. 31-41, 1964. Dubes, R.C., "Data reduction with grouping and Weibull models", Interim Scientific Report NO. 7, Div. of Eng. Res., Michigan State University, East Lansing, Michigan, 1970. , "Information compression, structure analysis, and decision making with a correlation matrix", Interim Scientific Report No. 11, Div. of Eng. Res., Michigan State University, East Lansing, Michigan, 1970. Edwards, A.W.F. and L.T. Cavalli-Sforza, “A method for cluster analysis", Biometrics, Vol. 21, NO. 2, pp. 362- 375, 1965. Ehrlich, R. and B. Weinberg, "An exact method for char- acterization of grain shape", Jour. of Sedimentapy Petrolo , Vol. 40, No. 1, pp. 205-212, 1970. Fisher, RsA., "The use of multiple measurements in taxonomic problems", Appals of Eugenics, 3, Part 2, pp. 179-188, 1936. 56 [F-B] [F-4] [F-S] [F-6] 11-71 [0-1] [0-2] [c-31 [G-4] [H-1] [3.21 [J-IJ [J-Z] 57 Fortier, J. and H. Solomon, "Clustering Procedures", in Multivariate Analysis - 1, Ed. P.R. Krishniah, Academic Press, New York, 1966. Fralick, 8.0., "Learning to recognize patterns without a teacher", IEEE Trans., Infom.lheory, Vol. IT-13, pp. 57-“, 19670 Friedman, H.P. and J. Rubin, "On some invariant criteria for grouping data", Jour. of Amer. Stat. Assoc., Vol. 62, ______, "The Logic of the Statistical Methods", The Borderline Syndrome, Ch. 5, Basic Books, New York, 1968. Fu, K.S., Seguethial Methods in Pattern Recognition and Machine Learning, Academic Press, New York, 1968. Fu, K.S. and E.G. Henrichon, Jr., "On non-parametric methods for pattern recognition", Technical Report No. TR-EE 68-19, School of Electrical Engineering, Purdue University, Lafayette, Indiana, 1968. Gotlieb, C.G. and S. Kumar, "Semantic clustering of index terms", Jour. ACM, Vol. 15, pp. 493-513, 1968. Gower, J.C., "A comparison of some methods of cluster analysis", Biometrics, Vol. 23, pp. 623-637, 1967. Gower, J.C. and G.J.S. Ross, "Minimum spanning trees and single linkage cluster analysis", App. Statistics, Vol. 18, NO. 1, ppe 54"“, 1969. Grenander, U., "Some direct estimates of the mode", Ann. of Math. Stat., Vol. 36, pp. 131-138, 1965. Hilborn, C.G. and D.G. Lainiotis, "Optimal unsupervised learning multicategory dependent hypotheses pattern recognition", lEEE lrans.jl_rlform. Theory, Vol. IT-l4, pp. 468-470, 1968. Hotelling, H., "Analysis of a complex of statistical variables into principal components", Jour. Educ. Psych., Vol. 24, pp. 417-441, 1933. Jardine, N. and R. Sibson, "The construction of hierarchic and non-hierarchic classifications", Co_n_:p. Jour., Vol. 11, pp. 177-184, 1968. Johnson, 8.0., "Hierarchical clustering schemes", Psychgntetrika, Vol. 32, pp. 241-254, 1967. [J '3] [x-2] [1-1] [M-l] Un-2] [14-3] [M-4] [M-5] 04-61 - [N-l] 58 Jones, K.S. and D. Jackson, "Current approaches to classification and clump-finding at the Cambridge Language Research Unit", Comp. Jour., Vol. 10, pp. 29- 37, 1967. Kashyap, R.L. and C.C. Blaydon, "Estimation of proba- bility density and distribution functions", Technical Report TRéEE67-l4, School of Electrical.Engineering, Purdue University, Lafayette, Indiana, 1967. Kruskal, J.B., "Multidimensional scaling by Optimizing goodness of fit to non-metric hypothesis", Psychometrika, Vol. 29, NO. 2, pp. 115-129, 1964. Lance, G.N. and W.T. Williams, "Computer programs for monothetic classification (association analysis)", Comp. Jour., Vol. 8, pp. 246-249, 1965. , '%.general theory of classificatory sorting strategies, Part I: hierarchical systems", Comp. Jour., Vol. 1, pp. 82-85, 1968. Loftsgaarden, D.O. and C.P. Quesenberry, "A non-para- metric estimate of a multivariate density function", Ann. of Math. Stat., Vol. 36, pp. 1049-1051, 1965. Mattson, R.L. and J.E. Dammann, "A technique for deter- mining and coding subclasses in pattern recognition problems", I.B.M. Jour. Res. Develop., pp. 294-302, 1965. Mcfluitty, L.L., "Hierarchical linkage analysis for the isolation of types", Educ. Psychol. Measurement, Vol. 20, NO. 1’ pp. 55-67, 19600 Mendel, J.M. and K.S. Fu, Adaptive, Learnipg and Pattern Recogpition Systems, Theory and Applications, Academic Press, New York, 1970. Michener, C.D. and R.R. Sokal, "A qualitative approach to a problem in classification", Evolution, Vol. 11, pp. 130-162, 1957. Moore, D.S. and E.G. Henrichon, "Uniform consistency of some estimates of a density function", Purdue University Stat. Dept., No. 168, 1968. Murthy, V.K., "Estimation of probability density", Ann. of Math. Stat., Vol. 36, pp. 1027-1031, 1965. N38Ya G., "State of the art in pattern recognition", 135.1122” Vol- 56. No. 5, pp. 836-862, 1963, [N -2] [N-3] [0-1] [p-1] [P-Z] [p-33 [p-41 [R-1] [3-21 [3-3] [3-4] [3-5] 59 Nilsson, N.J., Learning Machines: Foundations of Train- able Pattern Classifying Systems, McGraw-Hill, New York, 1965 . Nunnally, J., "The analysis of profile data", Ps ch. Bulletin, Vol. 59, No. 4, pp. 311-319, 1962. Owen, J., "The consistency of a non-parametric decision procedure", Engr. Note No. 334, Applied Res. Lab., Sylvania Electronic Systems, 1964. (Quoted in [F-7]) . Parzen, E., "On estimation of a probability density function and mode", Ann. of Math. Stat., Vol. 33, pp. 1065- 1076, 1962. Patrick, E.A. and J.C. Hancock, 'Non-supervised sequential classification and recognition of patterns", IEEE Trans. lnform. Theory, Vol. IT-12, pp. 362-372, 1966. Patrick, E.A., "On a class of unsupervised estimation problem", lEEE Trans. lnform. Theogy, Vol. IT-14, pp. 407- 41s, 1968. Patrick, E.A. and J.P. Costello, "Unsupervised estimation and processing of unknown signals", Technical Report TR-69-430, Rome Air Development Center, Rome, New York, 1970. Rao, C.R., "The use and interpretation of principal component analysis in applied research", Sankhya, Th2 Indfl Journal of Statistics, Series A, 26, pp. 329- 358, 1965. Rohlf, F.J., "Adaptive hierarchical clustering schemes", Sys. Zoology, Vol. 19, No. 1, pp. 58-82, 1970. Ruspini, E.H., "A new approach to clustering", Inform. Contr., Vol. 15, pp. 22-32, 1969. Sammon, J.W., Jr., "A non-linear mapping for data structure analysis", IEEE Trans. Cgputers, Vol. C-18, No. 5, pp. 401-409, 1969. Sebestyen, G. and J. Edie, "An algorithm for non-para- metric pattern recognition", IEEE Trans. Elec. Computers, Vol. EC-15, No. 6, pp. 908-915, 1966. Sokal, R.R. and P.H.A. Sneath, Principles of Numerical Taxon , W.H. Freeman, San Francisco, California, 1963. Spragins, J., "Learning without a teacher", IEEE Trans. Inform. Theo , Vol. IT-12, pp. 223-230, 1966. Stanat, D.F., "Unsupervised learning of mixtures of prob- ability functions", in Pattern Recognition, Ed. L. Kanal, Thompson, Washington, D.G., 1966. [S-6] [T-3] [T-4] 1v-11 [w-1] [Y-2] [Y'3] [z -1] 60 Switzer, P., "Statistical techniques in clustering and pattern recognition", Technical Report No. 139, Depart- ment of Statistics, Stanford University, Stanford, California, 1968. Teicher, H., "Identifiability of mixtures of product measures", Ann. of Math. Stat., Vol. 38, pp. 1300-1302, 1967. ‘________, "Identifiability of finite mixtures", Ann. of Math. Stat., Vol. 34, pp. 1265-1269, 1963. ________J "Identifiability of mixtures", Ann. of Math. Stat., Vol. 32, pp. 244-248, 1961. Tryon, R.C., Cluster Analysis, Edwards, Ann Arbor, Michigan, 1939. Venter, J.H., "On estimation of the mode, Ann. of Math. Stat., Vol. 38, pp. 1446-1455, 1967. Ward, J.H., "Hierarchical grouping to Optimize an objective function", Jour. Auer. Stat. Assn., Vol. 58, pp. 236-245, 1963. Wilks, S.S., Mathematical Statistics, Ch. 18, Wiley, New York, 1963. Williams, W.T. and J.M. Lambert, "Multivariate methods in plant ecology I. Association analysis in plant communities", J. Ecol., Vol. 47, pp. 83-89, 1959. Wirth, M., Estabrook, G., and D. Rogers, "A group theory model for systematic biology", Systematic Zoology, V01. 15, NO. 1, pp. 59-69, 1966. Yakowitz, S.J., "Unsupervised learning and the identifica- tion Of finite mixtures", IEEE Trans. Inform. Theory, V01. IT-16’ pp. 330-338, 1970. ________J "A consistent estimator for the identification of finite mixtures", Ann. of Math. Stat., Vol. 40, No. 5, pp. 1728-1735, 1969. Yakowitz, S.J. and J.D. Spragins, "On the identifiability of finite mixtures", Ann. of Math. Stat., Vol. 39, No. 1, pp. 209-214, 1968. Zahn, C.T., "Graph-theoretical methods for detecting and describing Gestalt clusters", lEEE Trans. Computers, V01. C-ZO, NO. 1, pp. 68-86, 1971. APPENDIX Appendix A This appendix discusses the motivation behind the use of the symmetrical and asymmetrical generalized distance measures referred to in Chapter IV and gives the details of the derivations leading to (4.3.3) thru (4.3.6). I. Symmetricygeneralized distance: A generalized second order distance metric, D2, between two points X and Q in a p-dimensional space is 2 T D = (X -Q) [S] (X -Q) (A-l) where [S] is a (p X p) positive definite matrix. We seek a particular matrix [S] which satisfies the condition that the average of the distances of the points in a cluster from their mode is 1. That is, a matrix [8(1)] is sought such that, for the i-th cluster, i . . . . - (1) _ M(1))T[S(1)](x*(l) _ M(1)) = 1 . (A.2) (11* J 1 j 11rd =2 .1. Nij The existence of such a matrix is shown below: Assume that the inverse of the matrix defined by "1 z (x310) _ 1=1 (1) M (£01) _ M(1) T ) j ) exists. Then a matrix satisfying the condition (A.2) is given by 61 62 N . Ni . . . . [3(1)] _[ 2:(Xj( ) _ 149598;“) _ M<1>)T]-1 . (A.3) j: To prove this, we note, N, N 1 1 2 1 i *(1)_M(1>1(1>*1(> <1) —zn=—z13 (x -M> Nij=1j Nij=l 5' 1 j Ni = .1. 2 mm {(M*(1) _M(1))'r [s<1)3(x’f(1> _ Mm” N1 j=1 j 3 Ni ...1. z trace mm] (xfo) _M(1>)(x*<1>_ M(1>)T } N1 j=l J 3 N1 = '1- trace { 1: [8(1)] (x *u ) - 11(1))(55‘1) - M(1))T} Ni j=1 j J *(i) _ Mo) *(1) _ (1)1 j M H )(Xj NT trace {[S(i)i]2 (x 1 j=l Substituting for [8(1)] the expression given by (A.3) trace [1] = The symmetric generalized distance measure between a pattern vector (i) x and a mode M of the i-th cluster is now defined as: (1) 4 _M<1)T <1) <1) (18(ng )=(x ) [S ](X'M ) . (11.4) It is to be observed that [8(1)] is proportional to the inverse of the second moment matrix of the points in the i-th cluster, with respect to the mode, M(i), of the cluster and is a measure of the dispersion of that cluster. The name "symmetric generalized distance" is used to dis- tinguish it from the asymmetric measure discussed next. 63 II. Asymmetricygeneralized distance: For defining the asymmetric generalized distance measure the ellipsoidal surfaces described by (x-Q)T[S] (x-Q) =1 are centered at a new point which reflects the dispersion of the points in a cluster. The measures of the diSpersion along each axis, both to the left and to the right of the mode M(1) are computed. The center of the ellipsoid is obtained by shifting M(i) by the average of these dispersions along each axis. To be more specific: *(1) jk vector of the i-th cluster; j = 1,2,...,N Let x be the k-th component of the j-th pattern 1; k = 1,2,...,p; and further let this component be designated *(i) . *(i) (i) ijk if xjk Mk llA (A.5) and *(i) if *<1>>M 0 if the point M(l) is interior to the ellipsoid; and r = (a-b)t [W] (a-b) > 0 assuming [W] to be positive definite. HUIIIWJILIMIlifltflllfllfl'i'filfl' 1 0 I 3 All Mllz 1117111qu