asmomm OF A BASES FOR QGAL {IRENE CLUSTEBING Mssefiafim far the Degree 9‘? PM). Mimifim STATE UMVERSITY ALI BEHFORGOZ W75 4., Q, O!’ "I . we: a. a” - r ‘I‘Z‘uv . ' . 0- .0 _ _; i " ' " ' "V t" 1 “ This is to certify that the thesis entitled Development of 3 Basis for Goal Oriented Clustering presented by Ali Behforooz has been accepted towards fulfillment of the requirements for Ph.D degree in Computer Science %2/% Major professor 1" t 61(0 ABSTRACT DEVELOPMENT OF A BASIS FOR GOAL ORIENTED CLUSTERING By Ali Behforooz This thesis is concerned with the first two of the follow— ing group of problems which were faced during the design of a Television-Computer Education System: 1) representation of a nominal-scale variable with unknown nominal value by numeric data for cluster analysis, 2) development of a method of clustering for a set of data items defined by such nominal- scale variables , 3) assignments of an appr0priate sequence of actions to every generated cluster, 4) measurement of the effectiveness of the assigned actions , and 5) control or prediction of changes in the clusters and the members of the clusters. The basic structure of a clustering system which is designed so as to provide solutions to these problems is developed. It is called a Goal Oriented Clustering System, because every step of this system is dependent on the purpose(s) of the clustering. In this thesis a complete solution is provided for the first problem. The nominal-scale variables are represented by a vector of dimension M, where M is the number of categories of the variable. If the nominal value of a nominal variable V is known then the vector representing this variable is a deterministic vector. If the nominal value of V is unknown then the vector representing this variable is a probabilistic vector. Also, in this thesis a complete solution is provided for the clustering problem. To cluster a set of data items defined by nominal-scale variables with unknown nominal value, a data item is represented by an n x M matrix, where n is the number of variables. An M-dimensional distance function is proposed to calculate the association measure between data items. Two different clustering methods are developed which are capable of clustering data items represented by n x M matrices as well as data items represented by n-dimensional vectors. The last three problems mentioned at the beginning are left for future research by others. DEVELOPMENT OF A BASIS FOR GOAL ORIENTED CLUSTERING by Ali Behforooz A THESIS Submitted to Michigan State University Department of Computer Science in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY 197s TO FARIDEH 8 AMIR ii . ACKNOWLEDGMENT I am grateful to Dr. Morteza A. Rahimi,my major profe- ssor, for his encouragement during the course of this research. My thanks also go to Dr. Harry G. Hedges, Dr. Richard J. Reid, Dr. John J. Forsyth, and Dr. Dennis D. Gilliland for their serving on my guidance committee and their critical review of this thesis. My special thanks go to National Educational Television of Iran for their financial support of this work. Finally,my thanks go to Dr. Martin O. Holien, Chairman of the Department of Computer Science at Moorhead State University, Moorhead, Minnesota for his support. iii ISI I93 XRY R[A] [31R iff a==$b a¢=9b ‘V a 3 a xéfA xfiA LIST OF SYMBOLS Empty set, a set with no element. If S is a set,lSl is the number of elements of the set. The cardinality of the set of natural numbers. The cardinality of the set of real numbers. A vector with all components equal to r. The set of positive integers. The set of negative integers. The set of real numbers. The set of non-negative real numbers. p or q. p and q. x is related to y by relation R. x is not related to y by relation R. The set of R-related elements Of A- The set of all elements related to a by relation R. If and only if. a implies b. a if and only if b. For all a. There exists an a. There does nor exist an a. x is an element of A. x is not an element of A. Such that. iv A.CB \a,b,.. (a,b,.. [dij] [dek] LIST OF SYMBOLS (continue) is a subset of.B. set with elements a, b,.. vector with components a, matrix with d as its ii three dimensional matrix. etc. b,..etc. jth element of its i th TOW . Chapter I II III IV TABLE OF CONTENTS INTRODUCTION.. ..... . ..................... . ........... 1.1 Five Major Problems.... ..... .............. ...... 1 1.2 A Brief Review of Current Clustering Methods.... 5 1.3 Scope of This Research .................. . ....... 10 1.4 Organization of This Thesis ............... . ..... 11 GOAL ORIENTED CLUSTERING .............. .... ........... 2.1 Introduction.................................... 13 2.2 Goal Oriented Clustering System (GOCS) .......... 14 2.3 Elements of the GOCS .................. .......... 16 NUMERIC REPRESENTATION OF NON-NUMERIC VARIABLES(NNV'S) FOR CLUSTER ANALYSIS ........... ... ........... ........ 3.1 Introduction .................................... 21 3.2 Category Relation ........................... .... 22 3.3 Representation of an NNV. ....................... 26 3.4 Representation of a Data Item ......... .......... 27 3.5 Representation of a Variable...r.....o -------- .. 31 3.6 Some Motivation for the Proposed Representation. 36 ASSOCIATION MEASURES..... ............................ 4.1 Introduction ......... . ........... ............... 41 4.2 Association Measure Between Variables.... ...... . 42 4.2.1 Association Measure Between Numeric Variables. 43 4.2.2 Association Measure Between NNV's.. ........... 44 4.3 Association Measure Between Data Items .......... 48 vi TABLE OF CONTENTS (continue) Chapter Page V CLUSTERING PROCEDURES ................ . ....... . ...... 5.1 Introduction ................... ... ............ 54 5.2 Geometric Representation of D and A .......... 55 5.2.1 Hypersphere Representation of Data Items.... 55 5.3 Clustering the set D by Transforming A.to an N x N dissimilarity matrix.................... 58 5.4 General Centroid Method(GCM)................. 62 5.4.1 General Centroid A1gorithm(GCA)....... ...... 64 5.4.2 Analysis of the GCA ...... ................... 68 5.4.3 Computational Limits and the Problem of Storage.70 5.4.4 Some Special Cases. ......... ....... ......... 71 5.5 Graph Representation of D and A .............. 74 5.6 k-clustering Based on Graph Representation of D and A............... ........................ 78 5.6.1 k-clustering Algorithm ...................... 86 VI SUMMARY AND CONCLUSION ............. ...... ......... 6.1 Summary of this Research ...................... 92 6.2 Conclusion... ............. ... ......... ........ 94 6.3 Suggested Topics for Further Work ............ 95 BIBLIOGRAPHYooooooooo ooooooooo 00000000000000.0000. 97 vii LIST OF FIGURES Dynamic Property of Clusters ................... Visual and Algorithmic Clusters ................ Visual and Algorithmic Clusters ................ Visual and Algorithmic Cldsters .......... -..... Goal Oriented Clustering System ........... ..... Unit Spheres of M1, M2 and Md,. .............. Geometric Representation of D and A . . . . . ....... Hypersphere Representation.of the Data Items of Example 4.2 ........ . ................. . ....... .. Bipartite,krpartite, Uniform k-partite, Uniform Weighted k-partite and Average Graph.... ........ Graph Representation of D and A.of Example 4.2.. A Picture of Definition 5.9 for Weighted Graphs. viii Page 4 8 8 8 15 SO 56 57 76-77 77 80 CHAPTER I INTRODUCTION Most nominal-scale data , especially those which are to define behavioral characteristics , are such that their nominal value is unknown. Throughout this entire thesis, nominal- scale data which have uncertain nominal values are referred to as Non-Numeric Variables(NNV's). This terminology makes it possible to classify variables into two categories. The first category contains all variables which can be represented by a single number regardless of whether the scale is ratio, interval, ordinal, or nominal. Variables of this class are called numeric variables. The second category contains those nominal-scale variables which have uncertain nominal values and as stated above are called NNV's. The problem under consideration is to extend some of the existing clustering procedures so as to be able to cluster 3 set of N data items each of which is defined by n variables some of which are NNV's. The major topics covered in this thesis are 1) numeric representation of a NNV, 2) association measures between NNV's and also between data items defined by NNV's and 3) clustering a dissimilarity N-block matrix of a set of N data items defined by both NNV's and numeric variables. 1.1 Five Major Problems: The motivation for this research came from the design of a Television-Computer Educational (TCE) system. A major goal of the TCE system is to be able to act as an instructor and,at the v.1 2 same time, be relatively inexpensive, efficient, and fast enough to handle a large number of students. During the design study of the TCE system, five major problems as given below were faced. The first two problems are investigated and totally solved in this research and other peOple are working on the last three problems. 1. The Problem of Representing Data Items: The data items of the TCE system are the students and each student is repre- sented by a set of variables. Some of these variables are NNV's and they are important to the student clustering procedure. This type of variable must be represented numerically for clust— ering purposes. This problem is included in the scope of this research and is completed in chapter 3. 2. The Problem of Clustering: The number of students en- rolled for a given course is large, the facilities to serve these students are limited and the students are from different backgrounds. Because of these reasons and the fact that the TCE system is to be efficient, it is necessary to divide the students into different groups. A clustering algorithm that performs this task is essential to the operation of the TCE system. This problem is also included in the scope of this research and is discussed in Chapters 4 and 5. 3. The Problem of Assigning_Appropriate Actions to the Generated Clusters: The TCE system should be designed such that it gives prOper assignments to the students. The action sequence or sequence of assignments, tests, and homework corresponding to each group of students should be determined on the basis of the "label" of that group and modified according to certain 3 characteristics of the individual members of the group. In other words, the action sequence for a group cannot be pre-defined. Let Ci be the label of the ith group and Ai = Ail’AiZ’°"’Aini be the action sequence determined for this group. The action sequence Ai is called a pre-defined action sequence if it is not a function of the members of the ith group. 4. The Problem of Applicability of A;: Once the students are clustered and all action sequences are determined, it is time to determine whether or not the action sequence A1 is the best action sequence for the group Ci“ The action sequence Ai is accepted as the best action sequence for C1 if it possesses the following properties: a. there are no conflicts in Ai; b. all actions included in A1 are applicable to every individual member of Ci; c. the application of Ai to the members of Ci would provide the expected changes for the group Ci as well as the individual members of Ci° An algorithm is needed to test those three properties OfeVerY generated action sequence. 5. The Problem of Dynamic Clusters: The members of group Ci might move their positions from one point to another under the application of the action sequence Ai' This change of position is due to the changes in values of variables defining the students. This movement does not have to be in the same direction for all members of Ci' It is true that all members of Ci are under the effect of the same action sequence, but this does not mean that the progress or improvement of the _ — - — - - - -- —— - -— _ --- —— - - - - O - o - - -- u — - - .- — cs - - c - - a. - -_ - - O - o - q— -. o- — 0 fi - - - - ‘ - . -.. - - C - - - -. .- ,‘ 0 fl: ‘ Ogitn} Figure 1-1 . Dynamic property of Clusters. 4“ D 5 individuals will be the same. Figure 1-1 shows a set of four clusters at times to,t1,...,tn. The solid clusters are the positions of the dotted clusters at time t = t1 , i=1,2,...,n. The movement of the elements of the clusters is called the dynamic property of the clusters. The dynamic property of a cluster could be studied under the following two separate cases. 1. The members of a cluster move such that they stay in the same cluster. Clusters C1(t1) and C4(t1) of Figure 1-1 are examples of this type of movement. 2. The members of a cluster move such that they explode the cluster; that is, at least one of the members does not remain a member of the same cluster. Clusters C2(t1) and C3(t1) of Figure 1-1 are examples of exploded clusters. In Figure 1-1 C1(t0) ,...,C4(to) are four initial clusters and C1(t1),...,C4(t1) are four clusters after a period t=t1-t0. The application of the action sequence Ai to the cluster Ci(to) causes a change in values 0f the variables defining the members of this cluster. The change in position of a member of a cluster is due to the change of variables defining this member. The enlarged clusters are special case of exploded clusters. A system called Goal Oriented Clustering is designed in Chapter Two such that it is possible to solve all five major problems mentioned previously. 1.2 A Brief Review of Current ClusterigggMefhods: In this section a very brief description of some cluster analysis approaches is given. For those who are interested in a complete description , without referring to the original works, Anderberg [2 ] Chapters Six and Seven is a good reference. Ling [41] gives a description of some of the current work along with some criticisms. Most of the current methods in clustering are included in one of the following categories: 1. Agglomerative Algorithms: Most of the widely used clustering algorithms belong to this family of algorithms. There are a large number of papers in hierarchical and non- hierarchical clustering employing agglomerative algorithms. For a complete description and references to the work in this area the reader is referred to Anderberg [2]. Very brief descriptionsof some of the widely used agglomerative algorithms are given below. a. The Single Linkage Algorithm: This algorithm is also known as the "nearest neighbor" or the "minimum" algo- rithm. It has been considered by Sneath [65], Sokal and Sneath [66], Johanson [35], Shepard and Willmott [64], Zahn [79] and others. This algorithm combinesthe two clusters whose dis- tance is the minimum among all pairs of clusters. The distance between two clusters Xi and Y~ is given by J d(X.,Y,)=Minimum (d(x,y)) 1 J xEX i yEYj The merging process continues until a terminating criterion is satisfied or until all points belong to a single cluster. b. The Complete Linkage Algorithm: This algorithm has been studied by Sokal and Sneath [66], Johnson [35] and others. It differs from the minimum algorithm only in the definition of the distance between two clusters. If Xi and Yj are two clusters the distance between them is defined by 7 d(Xi,Y.) = Maximum (d(x,y)). J xtixi yEIYj c. The Centroid Algorithm: In this method the distance between two clusters is defined by d(Xi,Yj) = d(x,Y) , where Xi and Yj are two clusters and X , Y are the mean of these two clusters. The distance function d is the n-dimensional Euclidian distance function. d. Minimizing Within-Group Sum of Square (NGSS) Algorithm: This method was introduced by Fisher [16] and applied by Ward [77]. It is also known as Ward's method. Ward's method is employed by Jensen [36], Vinod [75], Dubes [10], Fisher and Van Ness [15], Fukunaga and Koontz [19] and others. The WGSS method is applicable only to some configuration similar to Figure l-2(a), where the clusters are compact and/or hyperelipsoidal and in either case they are well separated. In some cases, as in Figure 1-2(b) and (c), this method will not give an appropriate partition. Assuming the data items are in two-dimensional space, the optimal clusters of Figure 1-2 are the ones given in Figure 1-3. But by using the W688 method the resultant cluster for Figure l-2(a) is the same as in Figure l-3(a); and the resultant clusters for Figure l-2(b) and (c) are those given in Figure 1-4(b) and (c), respectively. 2. Divisive Algorithms: There exist a variety of tech- niques based on dividing the entire set of data items into subgroups. These methods are focused on finding the groups which are the most widely separated from each other. Divisive algorithms can be categorized into three different approaches. Clusters. ,. ,, ¢«- T ‘ \.: o .. J g Q : Q'il‘: {a " ~.‘.‘ 6 ‘ {1:‘. ..\. ‘ “‘ .I I‘\ r. 3 H‘ Q‘ ~ - ‘. o ’ . g . \ 00"". l‘ ' -' ' | “. no \‘ ... .\.: . ' O 0.. ‘I :n: :‘ 0 o o I. " "... t o ' V «n, 'o' ... ‘.‘\ 3'/ ‘ e. ’ (a) . (b) . (c, Figure 1-2. (a)- (b) (c) Figure 1-3. .1 (C) Figure 1-4. Visual and Algorithmic 9 The first approach is under the label "association analysis" and is discussed by Lance and Williams [43]. The second approach is proposed by Edwards and Cavalli-Sforza [12]. Scott and Symons [61] and Harding [28] have some work related to this approach. The third approach is based on the use of discriminant analysis. The idea is to start with some initial partition, compute a linear discriminant function and then iteratively reassign points and recompute discriminant functions so as to find the most strongly separated groups. Dubes [10] provides a computer program for this method. Since the k-clustering method , introduced by Ling[41],is one of the two methods which are extended in Chapter Five of this thesis, some more words about this method are appropriate. In the k-clustering method a cluster is a well-defined entity which depends on the dissimilarity matrix and a parameter k. Given a set of data items D, a dissimilarity rank matrix: R, and a positive integer k, the k-clustering algorithm will deliver all subsetsof D that are k-clusterd according to the definition of k-cluster. In this method the problem is not to define a chaster but the problem is to find the clusters. As soon as the dissimilarity matrix is determined no information about the individual data items is necessary and the dissimilarity matrix should be computed only once. 10 1.3 Scopeof This Research: Numeric variables are those which can be represented by a single number. For example, AGE and INCOME are two numeric variables. Non-numeric variables are those which cannot be represented by a single number. For example,BEHAVIOR and PERFORMANCE are two non-numeric variables. Every clustering procedure attempts to categorize data items defined by a set of n variables. Each data item consist of a vector of n dimension5(if all variables are numeric). Each component of this numeric vector is the value of one of the n variables for the given data item. The first goal of this research was to find a meaningful and/or comparable numeric representation for NNV's. The ap- proach used is explained in Chapter Three of this thesis. The second goal of this research was to find an appro- priate distance function between data items defined by both numeric variables and NNH's. The approach used is delineated in chapter Four of this thesis. The third goal of this research was to develop procedures capable of clustering data items defined by both numeric variables and NNV's. The form of the distance matrix for such data items is different from that for data items defined only by numeric variables so a new procedure was needed to cluster such data items. This problem has been approached in two different ways as explained in Chapter Five of this thesis. The fourth goal of this research was to approach the design of a general system which provides the solutions to the five major problems given in section 1.1.The goal oriented 11 clustering system is a procedure which is designed in a manner which may fully solve the above five major problems. The goal oriented clustering system is defined in Qhapter Two of this thesis. 1.4 Organization of This Thesis: Chapter one has been an overview and description of a number of major problems to be either solved in this research or considered for further research. Chapter two is a design study of a procedure called the goal oriented clustering system(GOCS). The description of the elements of the GOCS is given in this chapter. Chapter three provides a vector representation of'NNV's. A relation R,ca11ed a category relation , is defined and a partition is introduced on the range of every NNV X. Each NNV X is represented by an M-dimensional row vector and each data item Di is represented by an n x M matrix. Chapter four provides a number of appropriate distance functions to measure the dissimilarity (or similarity) between data items as well as between variables. An M-dimensional distance function is defined and some of the properties of this distance function are studied. The Minkowski metric is extended to an M-dimensional metric and is used to measure the dissimilarity between data items. Chapter five provides a number of different procedures for clustering data items defined by NNV's. Two different approaches are provided for clustering a set of data items D. The first approach is to change the N-block dissimilarity matrix 2 to an N x N square matrix and then use one of the 12 current clustering methods to cluster the set D. The second approach is to develop a new clustering procedure to cluster the data items based on the N-block dissimilarity matrixA. A method similar to the centroid method is develOped called the " General Centroid Method." Ling's [41] method is extended by using the graph representation of D and A . The theory for this extension is also provided in Chapter 5. Chapter six is a summary of this research. A number of important topics for further research are given. CHAPTER II GOAL ORIENTED CLUSTERING 2.1 Introduction: This chapter is a study of a new clustering system designed so as to be able to provide solutions for the five major problems given in chapter one. This system is fully described but only the clustering block of the system is completely developed. This system is calleda Goal Oriented Clustering System (GOCS). The clustering block of the GOCS consist of three different .procedures; 1) representation of data items defined by NNV's , 2) association measures between data items as well as between variables, and 3) a procedure to cluster an N-block dissimilarity matrix. These three procedures are discussed in Chapters 3, 4 and 5, respectively. Through discussion with people from different areas of application, it was discoverd that the five major problems mentioned in section 1.1 are not unique to the TCE system. People involved with research in areas such as behavioral science, psychology, sociology, biometry, biology,and educa— tion have problems very similar to those «mf the TCE system. The sociologists are vitally concerned about the overall behavior of every cluster after each period of time. They would like to see the relations between clusters as well as between the members of the cluster before and after application of the action sequences to the clusters. The psychologists would like to be able to represent their objects by NNV's 13 14 as well as numeric ones. The sociologists want to know the direction of any movement of their clusters. They use this information to predict some of the future events of society. The biologists want to know the interaction between clusters. They want to have >full control over the dynamic property of the clusters. 2.2 Goal Oriented ClusteringZSystem (GOCS): The GOCS consists of the three separate blocks given below. 1. Clustering Block: This block clusters the set of data items defined by a set of n variables some of which are NNV's. The input to the clustering block is a set of N data items each of which is defined by an n x M matrix. The output of one this block is a set of m clusters C1,C2,...,Cm. 2. Action block: This block assigns an action sequence to every generated cluster. These action sequences are based on the set of actions, the set of needs, the members of each cluster, and the cluster label. The input to the action block is a set of m clusters and the output is a set of m pairs (Ci, Af) , where Ci is a cluster and Ai is the action sequence corresponding to Ci' 3. Acceptor Block: This block teststhe applicablity of the action sequence A1 to the cluster Ci' Each Ai should be checked for the three general properties mentioned in section 1.1. The input to this block is a set of pairs (C1,Ai) and the output is either "accepted" or "rejected." GOAL ORIENTED CLUSTERING SYSTEM DATA ITEMS CLUSTERING BLOCK ' c: d H, ' B; [ ACTION BLOCK 1 e '3; I 9'! e ‘2 D“ . f1 . e > sf . =~ . > SO x/ ' SVT' §z_ l ACCEPTOR BLOCK YES FIND NEW VALUES 0? ALL VARIABLES END OF ANALYSIS? Figure 2-1. 15 fr RECHECK ACTIONS AND/OR VARIABLES TO FIND THE CONFILICT(S) i REMOVE ALL CONFLICTS a] INTERPRET RESULTS 16 The process of the GOCS is not limited to finding the most similar objects. It is to assign a set of actions to a set of data items in order to achieve a set of pre-defined goals. Figure 2-1 is a simple diagram of the GOCS. In order to take care of the dynamic property of a cluster , the GOCS is set up such that it could run periodically. The GOCS halts either by user request or when convergence occurs. The GOCS is said to converge if it generates only one pair (:1, Af); that is , all data items are clustered into one group. 2.3 Elements of The GOCS: The following is a list of elements needed to be defined and/or introduced for the GOCS before using it. 1. Data Item: Every member of a finite sample space Q is called a data item. For example, let afbe the set of all graduate students at Michigan State University. Everyone enrolled as a graduate student at MSU is a data item. To define the set of data items , it is sufficient to define % . Every member of i is a data item. The set D=[D1,D2,...,DN] is called a set of N data items and is a subset of‘X.. 2. Variable : A data item Di could be defined by a set of n variables and represented by a characteristic vector or by an n x M matrix(if a data item is defined by some NNV's , one might represent it by an n x M matrix as given in(:hapter Three.) The componentsof this characteristic vector (or the row of the characteristic matrix) are the values of the n variables for a data item Di' The variables which represent data items for the goal oriented clustering system are totally a function of the purposes of the clustering. 17 One might represent a graduate student of MSU by D-= (Department, 3.8. Major, Previous Institution) 1 for a purpose P1. For another purpose, say Pg. one COUId represent the same student by Di= (Department, GPA, Term Credits). Variables are the means to measure the distinction and the differences between data items. 3. What to Cluster: The set D = ]D1’ D2,..., ON} and the variables V1,V2,...,Vn are chosen. Does one need to put "similar" data items in one group (data item clustering) , does one want to see the "similarity" between variables (variable clustering) or does one want the simultaneous clusteiing of data items and variables? Answers to these questions must be decidedi“order to know what is to be clustered. 4. Variable Transformation: Every clustering procedure deals with a measure of association between variables. If the variables are not of the same scale of measurement it is necessary to homogenize the variables to have a unique scale of measurement. 5. Concept of Similarity: One of the well known definitions of clustering procedures says that a clustering procedure must generate an optimal partition on the set of data items so that the most "similar" data items are grouped together. The measure of "dissimilarity" between two elements of two different groups is to be maximum. Here, one sees the important role of similarity measure in the clustering procedure. The concept of similarity must be well defined aniavailable to the clustering procedure even before choosing the clustering algorithm. 18 6. Method of Clustering: The term "cluster" is not an abstract term with a fixed mathematical definition. The term "cluster" in classification , pattern recognition, cluster analysis and other related areas is treated much like the term "point" in mathematics. Depending on the set of data items, the purpose of clustering, the set of actions, the set of needs and some other factors, one could choose the appropriate method of clustering. 7. Algerithm: Once the data items are defined by homogeni- zed variables,the measure of similarity (or dissimilarity) is defined and the criterion of clustering is chosen,an algorithm is needed to specify the steps of the clustering procedure as well as a computer program to implement this algorithm. The choice of algorithm is important to the result of clustering. To choose the appropriate algorithm sufficient information must be known about the capacity of the available computer system, the size of the set of data items , the number of variables, and the method of clustering. 8. Number of Clusters: In some of the clustering methods, like non-hierarchical clustering, one needs to specify the number of clusters as a parameter to the system or as aparameter to the algorithm. In some other type of clustering the number of clusters does not have to be specified during the operation of the system. Since most of the existent algorithms arevnOt defined so as to be able to decide the number of clusters, it seems reasonable that the analyst for the clustering system keep accurate and timely records on this parameter. 19 9. The Set of Needs: There is always at least one purpose for using any clustering procedure. When one classifies his groceries to cold, meat, frozen, canned food and dry food he might want to satisfy the following needs: a. To have easier access to his groceries; b. To put the cold bags in the cooler in the back of his car; c. To unload the cold foods before the canned foods; The set of needs resulting in classification of groceries consists of those listed above. In general, whatever goals one wishes to reach by means of clustering are called the set of needs. 10. The Set of Actions: To satisfy the needs, a set of appropriate actions must be applied to the clusters in a certain order. Application of the set of actions to the generated clusters is intended to improve, in some sense, the behavior of the members of the cluster as well as the cluster itself. Since there is a unique sequence of actions corresponding to each cluster, one could represent a cluster by its action sequence. 11. Acceptor Block: In this new method of clustering the procedure of clustering is to generate a partition;7=]C1,C2,...,Cm] on the set of data items. Also, the procedure of clustering is to rassociate an action sequence Ai with the cluster Ci. The action sequence corresponding to Ci should be based on the setOf needs as well as on the members of Ci“ The pair (Ci,Ai) identifies both the group of data items and the actions which the members of the group should 22. In this method the pair (Ci,Ai) is called a cluster. Before applying the action sequence Ai to the group Ci, it is reasonable to test the applicability of Ai to the individual members of Cias well as to the group Ci' An algorithm which teststhe applicability of A1 for the group Ci is needed.The 20 block containing this algorithm is called the acceptor block. 12. Analysis of The Result: At the time one starts collecting data items for clustering analysis, his knowledge about the data items is non-existent or very small. By the time the procedure of clustering is done , at least the following information is available. a.the similarity of data items in terms of a set of given objects; b. the reaction of data items to the application of the action sequences; c. the effect of applying the action sequences.to the individuals; d. the relation between data itemsof one group; e. the relation between groups; f. the measure of "improvement;" g. the convergence direction of the clusters as well as individual members of clusters. Determininthhe degree of achievement of the goals, effectiveness of the procedure, efficiency of the algorithm and effectiveness of the application of the sequence of actions to the individuals are some of the factors which could be measured by analyzing the results of the GOCS. The GOCS is the solution to the five major problems given in section 1.1. The GOCS is introduced in this research and the direction for additional work on the GOCS is given in) Chapter six. CHAPTER III NUMERIC REPRESENTATION OF NNV FOR CLUSTER ANALISIS 3.1 Introduction: To cluster a set of data items D = {01, Dz, ..., DN} it is necessary to represent each data item Di by a set of n variables. Some of these variables are such that one can associate a single number to every one of them. This type of variable is called a numeric variable. For example age and income are two numeric variables. Some variables are such that they cannot be repre- sented by a single number. That is, the meaning of this type of variable cannot be represented by a single number as it could be for age and income. This second type of variable is of nominal scale , but the reason that it is considered a different type is that the nominal category of this type of variable is not certain. For example, let X represent the behavior of a person ,P, from a population in which M different types of behavior have been recognized. Usually, there is not enough information available to define the type of behavior for person P as one of the M types of existing behavior. That is why X cannot be represented by a single number. This type of variable,which has been referred to as a NNV in Chapter One, has not previously been used in cluster analysis. In this chapter a numeric representation for NNV's is proposed in a manner that could be used for cluster analysis. Anderberg (2) 21 22 gives a complete listing of problems concerning numeric variables. 3.2 Category Relation: Let X be a NNV and A be the range of X. It is assumed that the range of every NNV is a finite set. For example, assume X to be the behavior of an object Q and A = [A1, A2, ..., AM} to be the range of X. Al, A2, ..., and AM are M different types of behavior. The definition-of the elements of the set A as well as the number of these elements are dependent on the given problem. Corresponding to each NNV ,X, and its range ,A, define a set A' which is the set of the original population related to X. Corresponding to each element Ai of the set A, define a set A; which is a subset of A' and contains those members of the original population which have the same value A1 for the NNV X. For example , if X is a NNV ,say behavior, and A1, A2,...,AM are M different types of behavior then A; is the set of all members of the original population which have behavior of type Ai° The following paragraph describes some of the above notations. Suppose the problem is to cluster a set of 1000 data items (persons) to give them a one-year industrial job training. Every individual data item is defined by the following six variables: Vl=age, V2=income, V3: marital status, V4=background, V5= behavior , and V6: performance. The original population with respect to V5 from which the given sample is taken is 0f size 100,000. Let A be the set of all types of behavior which have been identified in the original population. Based on the 23 problem under study (clustering for industrial job training ) one can arbitrarily assume there are 4 different types of behavior, that is, A={A1,A2,A3,A4} . Let ni be the number of people with.behavior of type Ai,that is , ]Ai[= ni. For this example let n1= 25,000 , n2 = 37,000 , n3 = 28,000 and n4=10,000. This means 25,000 people of this population have behavior of type A1, 37,000 have behavior of type A2 and so on. It is clear that if one exchanges the problem given above with the problem "cluster a set of 1000 persons to give them a one-year job training in salesmanship" the definition of the different types of behavior might change as well as their number. Definition 3.1: Let X be a NNV and A be the range of X. Let P and Q be two members of the set A' (the original population related to X.) Define a relation R on the set A'as follows: Let P and Q be two elements of A'. Then P is said to be related to Q if and only if they have the same category defini- tion in terms of X. R is called the category relation. It is easy to verify that R, the category relation defined by definition 3.1 , is an equivalence relation. Lemma 3.1: A category relation R induces a partition ”.{Al’ A2’ "" Ad‘over the set A' ° Proof: Let the equivalence classes of the equivalence : relation R be 81,82....,Bk. It is sufficient to show that the partition T =[Bl,Bz,...,Bk] is the same as the partition 27 a] A',A§,...,Afi} . Let P and Q be any two elements of Bi' 24 This implies P and Q have the same category definition;and this further implies P and Q belong to a unique subset of A',say A5. Therefore, Bi is a subset of A}. On the other hand let U be any member of A; other than P and Q,if such an element exists. This implies P and U have the same category definition which inltfirn implies P and U are related. Since the 31's are disjoint, P and U have to belong to Bi' Therefore, Bi: Ag; and this is true for all i = 1,2,...,k and j = 1,2,...,M. In the case in which Bi has less than two elements it is easy to show that there must exist a corresponding A; with only one element in it- QED- Note 1: For a given NNV X with range A and population A' the partition 5;: [Ai,A§,...,A&} as well as the category relation R are not unique. That is,27and R might change from one problem to another. This is true because the definition of the elements of the set A (categories of X) might change by the nature of the given problem. Note 2: Depending on the nature of a given problem one might need to assign a set of weights to the elements of the set A,the range of a NNV X. These weights could be any real number in the interval (0, I]. If no weight is assigned to an element Ai it is assumed that the weight of Ai is 1. The weights corresponding to a set A of M elements are given by a vector UK: (w1,wz,...,wM) or by a diagonal matrix with weights on its diagonal. Definition 3.2: Let A =[A1,A2,...,AM] be the range of a NNV X and A' be the finite population related to A. The density degree of an element Ai , denoted by dd(Ai), is defined 25 V by A- dd(Ai) = iii-#- If A' is an infinite set the density degree of Ai is defined for i=1,2,...,M. by definition 3.3. Definition 3.3: Let A and A' be as defined in definition 3.2 except that A' be an infinite set.Let k be the number of ele- ments of partition 7%=[Ai,Aé,...,AlU which are infinite. Define dd(Ai) as follows: 0 if A; is a finite element of/7 l/k if A; is an infinite element offlf Definition 3.4: Let f be a function from 8x8 to the reals, where S is any set. Then f is called a distance function if: a. f(X,Y))0 , Vx,res; b. f(X,Y) = 0 iff X=Y VX,Y£S; c. f(X,Y) = f(Y,X) Vx,res. If in addition to the properties given above one has d. £(x,r) + f(v,z) > £(x,z) Vx,v,z es then f is called an Euclidian distance function or a metric. If for a function f and a set S one has the properties (a) and (c) of definition 3.4 plus the fact that f(X,X) = o Vxe s then f is called a psuedo distance. A psuedo distance which satisfies the triangle inequality is called a psuedo metric. For example, one can define A on A'xA' as follows: 2(P.Q) = [damn-damp! where p ‘Ai and Q eAJg,The Ai's and Ai's are as defined in the beginning of this section. It is easy to verify that A is a 26 psuedo metric. 3.3 Representation of a NNV: Throughout the remainder of this chapter assume X is a NNV defined on the set A , A' is the population set corresponding to X and 75" [Ai, Aé’°”’AM] is a partition on the set A'. Also, assume N” = (w1,w2,...,wM) is a weight vector corresponding to the partitionzg. Given a NNV X with range A and population A' and a data item Q there are three different possibilities as follows: 1. There is no information for the data item Q with respect to the NNV X. In this case it is assumed Q is uniformly distributed over A'. That is, P(Q6Ai) = dd(Ai). 2. Q is uniformly distributed over/7 , a subset ofag, with probability p. In this case , lAfl (I'P)Tfifi— if c; = O P(Q£EA{) =¢ pcdd(ci)) k if c; # a . where r'= iA;l, A;2,...,A;k] , swigl c; = Ain 3' and dd(Ci) =|€ii /|Bd A'. , D = A'-B' , $1 3. Q is distributed over [7, a subset ofqg, with k distribution function (psl,p52,...,p5k) such that 1 psi - p’ 1 .In this case [All I (l-p) if c: = a In 1 P(QtAi) =4 . , ._ . . psj 1f Ci # O and i-sJ , where D, B' , Ci and/7 are the same as defined in case 2. Note that in case: 2 and 3 it is assumed that Q is uniformly distributed over A'-B'. If one has complete information about the data item Q then A'-B' is an empty set. 27 Definition 3.5: For a given NNV X and a data item Q define P, = (Pm 2A1).P(Q “5)...”ch mm). The vector Px is called the distribution vector of X for the data item Q. If there is any weight vector corresponding to the range of a NNV X then one might represent X by the vector Vx = (w1P(Q 6 Ai) ,w2P(Q£ A5) ,. . . ,wMP(Q£A;‘)) . The following example is to explain the notations Px and Vx' Example 3.1: Let X be a NNV with range A={A1,A2,A3} and population of size 10. Let 7],}= [Ai,A§,A‘3] be a partition over A' with IA'1[= 3 , [Ail = 3 and (Ag: 4 and w”: (l/I,1,1/4) be a weight vector corresponding to A (org). Let QEA' and assume it is known that P(Q€Aé) = 7/10. In this example ['7 = A5 , D =[A',A'3] , |Dl= 7 , P(QéAi)=(1-7/10)3/.70 = 9/10, and P(Q€A:'5)= (1-7/10)4/70 = 12/70. The distribution vector of X for the data item Q is given by Px = (9/70, 49/70, 12/70L and Vx is given by Vx=(3/70, 49/70, 3/70). 3.4 Representation of a Data Item: Let D =[01, DZ’ ..., 0N] be a set of N data items and n be the number of variables defining each data item. If all n variables are numeric variables then each data item Di can be represented by a row vector Di = (vil,viz,...,vin). The set of all data items can be represented by an N x n matrix, when all variables are numeric. An alternative is to have s numeric variables and n-5 NNV's. Assume that the first 5 variables are numeric and the last n-s variables are NNV's. Let 77;; be a partition of order 1 28 ni over the population of the NNV Vi , i=s+l,s+2,..,n. Define M = Maximum(ni) . Represent each numeric variable by an M-dimEd2inzal row vector with the first component equal to the actual value of the variable and the next M-l components equal to o. For NNV Vi , if ni M, add M-ni zeroes to the end of both the distribution vector and the weight vector corresponding 1 P;. is called the augmented distribution vector of Vi. 1 to Vi and denote them by V; and P¢,,respectively. The vector 1 I .. , , . PVi-(pll’piz’ ° ° ° ’plni,&fi9£:~;fi) 1 and vVi=cwi1pil’wizpiZ"'°’winipinitosoa°°°so)s for i = s+l,s+2,...,n., where (Pil’PiZ""’pini) is the distri- bution vector of Vi and (wil,wiz,...,wini) is the weight vector of Vi' Using vector representation of NNV's , each data item can be represented by an n x M matrix as follows: v11 0 ... 0 F v31 0 o i i 1 Di- p5*1’1 ps+1,2 ' Ps+1,M i 1 Pi ps+2,1 ps+2,2 S+2,M1 - i i i i LPn,1 Pn,2 --- pn,M J . If the NNV's are represented by V¢_'s (there are weight vectors 1 O . I . i involved) then one can replace p:+j k with "s+j,kp:+j,k 1n D D and denote the.resultant matrix by 0:. In chapter pour it is shown how one can associate a i . dissimilarity measure to D1 and Dj using the matrices Dw and 0a. "I 29 The letter "D" with no subscript stands for the set of data items, with subscript "i" stands for the ith data item, with superscript "1" stands for the matrix representation of the 1th data item, and with superscript "i" and subscript "w" stands for the matrix representation of the ith data item when the weight vectors are effective. It should be mentioned that the representation of a data item, defined by NNV's, preposed in this chapter is one of the possible ways to represent such data items. Of course, there are some other ways to represent a data item defined by NNV's. For example, an alternative method for representing a data item defined by s numeric variables and n;s NNV's is as follows: Suppose the first 5 variables are numeric and the last n-s are NNV's. Let the number of categories of the Kth NNV be nk. The column vector ' i i i T ,Ps;1 ’ns*1,...,pn,1.’...,pn’nn) Wi=(v:,v:,...v:,P:+1,1... is an alternative form to represent the ith data item. In this alternative representation of data items,the components of the vector Wi all are treated as individual numeric variables when dealing with clustering algorithms. That is, W1 is treated as a data item of s+ns+1+ns+2+n+nn numeric variables. To see how this problem would affect the result of a clustering analysis one has to collect a set of real data and check the resulting clusters for being a well optimized set of clusters. Since the data items of NNV's are clustered for the first time in this thesis, the representa- tions and clustering methods proposed in this thesis could be a base for comparison for the future work in this area. 30 Example 3.2: Let each data item 01 be defined by six variables,where U;,V;,W are three numeric variables and X, Y, Z are three NNV's. To be specific , let each data item be a person defined by U=age, V= income, W=marital status, X=behavior, Y= background, and Z= performance. X,Y,and Z are defined on some sets A,B, and C with population A',B', and C' , respectively. The original population for all three NNV,s is the same and it is assumed to be 10. That is, IA'I =IB'I =IC'I =10 . The corresponding partitions are: 77A=[A'1,A'2,A'3} ,773=[B'1,35, 33,33,133 andzlc = ‘Ci,C§] and the subpupulations are: [Ail = [Ail = 3 , [A5] = 4 , [Bil = lag] = 3 , [83' = 2 ,lBAl = lag] = 1 , [Ci] = [cg] = s The weight vectors are: N =(.1, .2, .3) , "B =(.3, .2, .l, .4, .5) and "C =(.1,.2). Let assume for person Di it is known that: P(DiEA£)=l/2, P(Di€B:'5) =3/5 and P(Di eCi) = 2/3. The distribution vectors of X,Y, and Z for person Di when computed are as follows: (3/14, 1/2, 4/14) for X ,(2/3, 1/3) for Z and (6/70, 6/70, 42/70, 8/70, 8/70) for Y. The augmented weight and distribution vectors are: wA_=(.l, .2, .3, o, 0) , WB=(.3, .2, .l,..4, .5) , WC.=(.1, .2, 0, 0, 0), Px =(3/l4, 1/2, 4/14, 0, 0) , Py =(6/70, 6/70, 42/70, 8/70, 8/70) , Pz =(2/3, 1/3, 0, 0, 0). Let the person Di be 34 years old (U=34) with $15900 income 31 (V=15900) and married(W=0). The person 0; could be represented by the following 6 x 5 matrix. (The weight vectors are considered.) r 34 0 0 0 0 15900 0 0 0 0 . 0 0 0 0 0 01 = " 3/L40 2/20 12/140 0 0 18/700 12/700 42/700 32/700 40/700 2/30 2/30 0 O O L J Let Dj be another person defined by the same six variables and represented by the following information: 0 = 25 , v = 12000, w = 1 , 9x = (1/7, 3/5, 9/35) , Py = (0, 2/55, 10/55, 40/55, 3/55) and P2 = (3/5, 2/5). The matrix representation of Dj using the weight vectors is: 1’25 o 0 o 0 7 12000 0 o 0 0 1 o 0 0 0 j 1/70 6/50 27/350 0 0 D" 3 0 4/550 10/550 160/550 15/550 3/50 4/50 0 0 0 . L J In general,let D='{D1,D2,...,DN} be a set of N data items, n be the number of variables (consider numeric variables as NNV‘s with partition of order 1) , and M be the maximum order of the partitions of the population of the variables. Define ' i 01 , 0w * Using the alternative way to represent D- one has 0%:(34,15900,0,3/140,2/20,6/70,18/700,127700,42/700, 32/700,40/700,2/3o,2/30)T. , D , and EH as follows: 32 l r i i 1 P11 P12 °°: PIM ’ i 1 911 p22--- 92M 1 i i E3111 p112 ... PnMJ where ij is the kth component of the augmented distribution vector of the jth variable for the ith data item; 1 . o =[--i1 H ijka where "jk is the kth component of the weight vector correspon- ding to the jth variable; p = [ 01? 02; ... 30”]; and 1: 2~ : N 9-": [ D . Du E o o o :Dw] o The matrix Di is an augmented n x M matrix containing all th data item , excluding the weight information about the i vectors; glis an N-block matrixcontaining all available infor- mation about the whole set of data items , excluding the weight vectors. 0: contains the same information as 01 plus the weight vectors and 2M contains the same information as 2 plus the weight vectors. Throughout the reminder of this chapter consider M, N, n, D, Di, 0:, D, and Dwas defined in this section. 3.5 Representation of a Variable: Let D =[D1,DZ,...,DN} be a set of N data items , each one represented by n variables( 5 numeric and n-5 NNV). A NNV X1 is defined on a set Ai with population set A1 and ”a1 is a partie tion of order ”1 on the set A1; 771, =1 A11» Aiz’HU an} 33 for i=s+l,s+2,...,n. According to section 3.3 each NNV can be represented by its distribution vector P§i=(p:1,p:2,...,p§ni) or by its augmented distribution vector ’k _ k k k Pxi ’ (pil’ p12, 0009 pinisosos°-°:°)s h where pgj is the jt component of the distribution vector of th the 1 variable for the kth data item. Therefore, P§_ is the 1 th th distribution vector of the i NNV for the k data item. An alternative for representing the ith NNV for the kth data item is vk = pk rw. xi xi A1 where 'WA is the weight matrix corresponding to the NNV xi 1 and is given by r % WAi= 0 0 W3... 0 L0 0 O o o o "nil o The weight matrix "A1 is the same as the weight vector (wl,w2,...,wni) but in a different notation to make multiplica- tion possible. The augmented weight matrix of X1 is given by le o ... 0 0 ... 0'1 0 W2 ... 0 0 ... 0 w' = 0 0 .. wniO 0 Ai 0 0 . 0 0 . 0 o 0 . 0 0 0 34 For a numeric variable , Xi, an N-dimensional vector T (x11, x21, ...,xNi) will represent the set of N values of the variable Xi for N data items 01,02,...,and DN; xki is the value of Xi for the kth data item. If Xi is a NNV then the matrix r. . . 1 1 1 p11 P12 --- Plni i . o 921 P22 --- Pin, W i . - 1 1 le pNz ° ' ' pNniJ will represent the set of N values of X1 corresponding to N h row of the matrix different data items Dl,D2,..., and DN' The jt X1 is the distribution vector of Xi for the jth data item. One could use the augmented distribution vectors of X1. In this case i X is an N x M augmented matrix , where M = Maximum(ni) . An i:;n-s alternative form for representing the set of all N values of a NNV Xi is i i Xw = X "A1 , where X1 is the augmented N x M matrix as defined above and "A1 is the augmented weight matrix corresponding to Xi. The whole information about a set of N data items of n variables (NNV or Numeric) can be stored in a 3-dimensional array as follows: 5 = [Pix 1 . where i=1,2,...,-u stands for number of variables, j=l,2,...,N stands for number of data items , and k=l,2,...,M. Another form for representing X could be x = [ x1! xzs...;x"], 35 where each X1 is an augmented N x M matrix. One might use x: to represent each variable. In this case if X1 and "A- are w 1 augmented matrices one could define Xw as follow: 2. .n we use :xwlo _. l: Ew-[xw.x In chapter four it is shown how one can associate a correlation (dissimilarity) measure to Xi and Xjusing the . i j matrices Xu and X". The following example is to explain the notations and definitions given in this section. Example 3.3: Three data items Dl,D2 and D3 are defined by two variables X and Y with ranges A and B and populations A' and B' , respectively. The partitions of A' and B' are: 7TA =[ A'1,A'2, A3] and 73 = [Bi,B§,B.‘5,B'4], respectively. In this example one has n=2, N=3 and M=Max(3,4)=4. The following information is given for each data item: for the first data item P(01€A'1)=P(D1EAé)=P(01€A§) = 1/3 and P(DIEBi) = P(DléBi) = 1/2 , P(DIEB§)=P(01£83) = 0; for the second data item 9(026A1) 0, and P(D2 6 Ag) I>(02 £95) for the third data item 1/2 , 9(02 £ Ag) 9(02e31) 9(02a3') = 1/3 , ”029.35g = 0; 4 PCDSEAi) = 9(03 £113) = 1/4 , ”0332115) = 1/2 and 903353;) = 9(03235) = 9(1):, 23%) = 11(03 {135) = 9(03535)=1/4. The weight vectors are: u, = (.1, .2, .3) and "a = (.1, .2, .3, .4). The augmented weight matrices are: "A: (.1 o 0 2 0 0 0 0 c ‘1 0 0 0 0 .3 0 0 0 J. 36 r.1 0 0 0 2 0 ; wB = 0 0 0 0 0 The augmented distribution vectors are: The augmented matrices X1, X , (1/2. (1/4. Y=X2) are: ’1/3 1/3 x1: 1/2 1/2 1/4 1/2 L. '1/3 1/3 1 x"- 1/2 1/2 Ll/4 1/2 3/2 0 x2: 1/3 1/3 W 1/4 1/4 The two-block matrices X and p X Ill 3/3 1/3 1/2 1/2 [— g/4 1/2 (1/3, 1/3, 1/3, 0) , P; = (1/2, 1/2, 0, o) , P; = (1/3, 1/2, 1/4, 0) , p; = (1/4, 2 Xi, and X3 1/3 6 5/2 0 0 0 x2 = 1/3 1/3 1/4 g 3/4 1/4 1/3 61’11 0 0 0] 1/30 0 0 0 2 0 0 = 1/20 1/4 q 0 0 .3 0 3/40 9 0 0 0] 0 1/2‘ 71 0 0 0] fi/zo 0 113 o .2 0 o = 1/30 1/4 1/3 0 0 .3 0 g/40 p 0 0 .g Xyare: 1/3 0'hfr1/2 o 0 1/2 0 0 E 1/3 1/3 0 1/3 1/4 0 ,iL1/4 1/4 1/4 1/4 0 l 0 0 'fl . 0, 0, 1/2) , 1/3, 0, 1/3), 1/4, 1/4, 1/4). (assuming X=X1 and q 0 1/2 0 1/3 , I/q 1/4 2/30 3/30 01 2/20 0 0 2/20 3/40 0 , J 0 0 4/201 2/30 0 4/30 2/40 3/40 4/40, 37 ’1/30 2/30 3/30 0 X = 1/20 2/20 0 0 L 1/40 2/20 1/40 0 L The matrices D1 , D , D , D1, w 1/3 1/3 1/3 0 1/2 0 0 1/2 1/2 1/2 0 0 1/3 1/3 0 1/3 1/4 1/2 1/4 0 1/4 1/4 1/4 1/4 The three-block matrices D and 1/3 1/3 1/3 0 3 1/2 1/2 0 = : 1/2 0 0 1/2 ; 1/3 1/3 1/30 2/30 3/30 0 f 1/20 2w: : 1/20 0 0 4/20 1/30 fi/zo 0 1/30 2/30 3/40 2/40 . 1/30 01 = W 1/20 2 1/20 0w = 1/30 03 _ 1/40 w - 1/40 Dyare: 0 0 3 0 1/3 E 2/20 0 0- D2 , and D3 w 0 0 3/40 are: 2/30 0 2/20 2/30 2/20 2/40' 4/20‘ 4/30 4/40) . 3/30 0 0 4/20 0 O 0 4/30 3/40 0 3/40 4/40 1/4 1/2 1/4 0 1/4 1/4 1/4 1/ 51/40 2/2013/40 0 . a 2/30 0 4/30: 1/40 2/40 3/40 4/40 . The elements of the kth row of D1 (03) are the same as the elements of the ith row of Xk (X5). Therefore, to calculate £.(1W) and D (2,) one has to calculate only Xk's (Xb's). 3.6 Some Motization for the ' Proposed Representation: The representations of a NNV (distribution vector) and a data item (an n x M matrix) are developed basically for cluster analysis of data items. That is, the representations prOposed in this chapter are such that one could use them for the 38 purpose of clustering. All changes in representation of variables start from introducing a distribution vector to represent the value of a variable for a given data item. Therefore, for simplicity, hereafter this new type of representation is referred to as "Distribution Vector Representation" (DVR). The previous representation of variables and data items is referred to as "Single Number Representation" (SNR) . There are some advantages as well as disadvantages to the DVR method , concerning the cluster analysis problem. Some of the advantages are as follows: ..- 1. Let X be a NNV with a range of M elements , that is , there are M nominal categories. It is known that the category definition of X for a given data item Q is not specifically defined. In the SNR method one either has to completely ignore this variable or make a rough guess in assigning a nominal value to X. In the first case,the question is; what happens if most of the variables defining the data item Q are NNV's such as in psychology and behavioral science? In the second case , the question is that of assigning a probability of l/M to every category of the range of X. ‘Comparing the distribution of X , in case of no available information , to the rough guess method one can easily see the difference. In the rough guess method , for N data items on the average N/M times one would be right. For example. let X be a NNV with range A =fA1,A2,A3f , population 10,000 and the subpopulations [Ail 100 , ”i" 6000 and ”[115“, 3900. 39 Let a sample of size 450 be drawn from this population.With the rough guess method one would consider approximately 150 observations out of 450 to be of category one ,in terms of NNV X. But it is known that even in the whole population there are only 100 observations.df category one. This simple example shows the danger associated with the rough guess method. 2. The DVR method makes it possible to compare the values of a variable X for two data items. Let P and Q be two data items and let Px(P) = (p1,p2,...,pfi) be the DVR of a NNV X for the data item P and Px(Q) = (q1,q2,...,qM) be the DVR of X for the data item Q. One can comparg pi to qi for i=1,2,...,M as follows, If pi = qi it means both P and Q have the same probability of being in category Ai' If pi~>qi then P has.a larger probability of being in category Ai than Q does. 3. The DVR method can be applied to those nominal-scale variables which have certain category definitions. Let X be a nominal-scale variable and Q be a data item.Assume X has M categories A1,A ..,AM and it is known that the category of 2" Q is Ai' Consider r'=]Ai] and p=l in case 2 of section 3.3. Note that in this case the subpopulationlAi] need not be known. In this case the distribution of X for the data item Q would be (0,0,...,1,0,0,...,0,0) . These illustrate some of L-i-1-J ’—-M-i-————’ the advantages of the DVR method. One of the disadvantages of the DVR method is the problem of storage and speed of computation. The amount of computer storage which is needed to operate a certain clustering algorithm when the data items are represented by the DVR method 40 is almost M times as much as when the data items are repre-- sented by the SNR method. This problem can be partially solved by using "The Storage Data Approach " methods; see Anderberg (2) Chapter Six. In this case the operation of clustering becomes significantly slow. For some areas such as psychology and behavioral science, where a large number of NNV's are required to represent the dataitemm the amount of time needed to solve the clustering problem when the DVR method is used is acceptable because of the advantages previously stated. In some areas where less NNV's ( or even nominal-scale variables with certain category definition) are used to define the data items one should compare the advantages of using the DVR method in spite of its slowness. Note that slow speed computation will cause more computer time and greater COSt. Also, note that the use of the DVR method will cause an increase in the number of compu- tations as much as M times more than when the SNR method is used. Hereafter, when the DVR method is chosen, it is assumed that a nominal-scale variable with a certain nominal value is a special case of a NNV as specified above. CHAPTER IV ASSOCIATION MEASURES 4.1 Introduction: In clustering a set of N "objects" one usually deals with some type of numeric measure!” . The measure/)1 is assumed to be an abstract number with ratio scale. It is also assumed thathis a mapping from D x 0 into a subset of real numbers, where D is the set of objects(data items). Based on the nature of the measured there are different terminologies available such as similarity measure, closeness measure, neighborhood distance and dissimilarity measure. It is assumed thatafllis a symmetric measure ,that is, V x,r e 0 =jn(x,v) = m(v,X). For N data items there are (E) = N(N-l)/2 different pairs. Therefore, the functionwdlis defined such that it has at most N(N-1)/2 possibly different values corresponding to the set D. Let D1,D2,...,DN be N different elements of a set of data items. Denote the value of the function.fllcorresponding to the pair (Di,Dj) by ”(Dimj) = dij‘ The matrix S F’ ‘! dll dlz ... dlfi S d21 dzz ... d2" le sz o o o dNN L . '.nt" . is a symmetric matrix. Based on the nature of the measure.AVthe matrix 5 might be called a similarity matrix, a dissimilarity 41 42 matrix or a distance matrix. If dij is the distance between Di and D. or if di' is the dissimilarity between Di and D. then all 1 J J diagonal elements of the matrix d are zero, that is, d-- = 0 for 11 i=1,2,...,N. Let S be a similarity matrix and t be a threshold given for the set of data items D. It is assumed that any two elements of the set D ,say Di and Do, are similar if and only if dijj;t. In most cases,as long as the magnitude of dij exceeds the thresh- old t,the value of dij is not important. In these cases it is more convenient to transform dij to a binary variable as follows: r'1 if .11).»; , diJ‘ = . 0 1f dij (t . If the dij's are binary variables‘fii is called a binary matrix. To store a binary similarity matrix only N(N-l)/2 bits are needed,where S is an N x N matrix. If d is not a binary matrix N(N-l)/2 computer words are neededwhere each word may have as many as 60 bits. Therefore, transforming dij to a binary variable is a significant way to save machine memory. 4.2 Association Measure Between Variables: This section is divided into two parts. In the first part some different types of measures between numeric variables are discussed. In the second part of this section some measures of association between NNV's are discussed. Note that the association measure between variables is used for the purpose of clustering as well as feature reduc- tion. In the case of using 'an association measure fer the purpose of feature reduction , the term "correlation" is more meaningful for the measure between variables than the term 43 "dissimilarity." For example, in a set of N data items let X = (x1,x2,...,xN) represent N different values of the variable AGE and Y = (y1,y2,...,yN) represent N different values of the variable INCOME. Seeking a correlation measure between X and Y is more logical than seeking a similarity measure between them. 4.2.1 Association Measure Between Numeric Variables: It is assumed that all numeric variables involved in a clustering procedure are of interval sca1e( after solving the problem of mixed variablefl. Since this assumption has been made, it is sufficient to consider only the association measures between interval variables. Let X and Y be two n-dimensional vectors given by X = (xl,x2,...,xn)T and Y{= (y1,y2,...,yn)? a. The inner product of X and Y is defined by = xTv = YTX . b. The Euclidian norm of the column vector X is defined by ”XI! = \/(x.X> . Define XTY “x’” = 1x: m: S; (xi-Yicyi-Y) 1: and r(X,Y) =: 3 7‘— L_‘ n \k (xi-le \/2: (vi-If i=1 i=1 5 1 n n where Y = 1/n Ex. and Y = l/n C. yi. i=1 1 i=1 The measure d(X,Y) is the cosine of the angle between the two column vectors X and Y; and r(X,Y) is called the moment corre- lation between X and Y. If X and Y are rank vectors then the 44 moment correlation[81] between them is defined by n Z (xi-y-lz f’ i=1 3 (X.Y) = 1-6 . . “n(n2-l) The functions d, r and.f’all take values in the range [-1,l]. These three functions are some of the widely used functions for association measures between variables. Anderberg (2) chapter three is a good reference for further discussion about asso- ciation measure between numeric variables. 4.2.2 Association Measure Between NNV's: In chapter three,section 3.5,it was shown that a NNV can be represented by an N x M augmented matrix x:=[X}k]=. . ... o is the killh th where xik = 'ikpik , w. component of the weight 1k vector corresponding to the i th variable, and pik is the k component of the distribution vector of the it variable for the jth data item. With this representation in mind one might want to define some sort of association measure between NNV’s Xi and Xj. The problem of association measure between NNV's is approached in two different ways. In the first approach an average-association measure is suggested and in the second approach an exact association measure is given by using an M-dimensional distance function(to be defined.) 45 Definition 4.1: Let S be the set of all o‘x M matrices and V be the set of all non-negative n-dimensional vectors. Let X = [xij] and Y '[yij] be any two members of S. A function [7: ()1, X5, ...,‘YMI from S x S into V is called an M-dimensional distance function if )0: 1:132:0009M5 a. Vx,vEs Pom!) = (e1,e2,...,eMl , ei b. Vx,v£s r7(x,Y) = (0,0,...,0) g.» x=v; c. Vx,r£s r7(x,v) --r7(Y,X); where ei = 33(Xoi,Yoi) for i=1,2,...,M , Ki is a distance function and xoi and in are the 1th columns of the matrices X and Y , respectively. If all distance functions X1, (2, ...,XM satisfy the triangle inequality then [71s called an Mfdimensional Euclidian distance function or an m-dimensional metric. The norm of FTX,Y) is denoted by , “f1(x,y)" -\/’§ij: e: . If one or more of the 31's isldlpsuedo distance then Fis called a psuedo distance. Definition 4.1 can be modified to map the set S into the set of non-negative M-dimensional column vectors. In this case one has $9 = ( (a, [32,... ,[3an and(B(X,Y)=(b1,b2,...Pn-f, where bi= fli(xif,y1,) and xi, and yio are the ith EQYés- of the matrices X and Y , respectively. Definition 4.2: Let A = [aij] be an N x M matrix. The row vector V = (v1,v2,...,vu) with 1 M . Vl’fi .21 “13° 1=l 46 is called the average row vector of the matrix A. The column vector U a (u1,u2,...,uN)T with 1 N u: s... a.. . M2:1, =l tor of the matrix A. no... is called the average column ve Remark: To be more meaningful about the measures between NNV's , the association measure is defined only between those variables which have ranges of the same size(or partitions of equal order or the same number of categories.) For example, in Example 3.3 there is no association measure defined between variables X and Y because the range of X has three elements and the range of Y has four elements. Definition 4.3: Let X and Y be two fl-dimensional vectors X- (x1,x2,...,x”) and Y = (y1,y2,...,yM). Define M (mm!) = 1m 12:1 ‘xi-yil . The value of the function (is called the mean deviation measure A c of X and Y. Define (V(X,Y) = Min1mum( aTXp,YP) , where Xp and Yp are different permutations of X and Y. The value of the function A . . a’is called the minimum mean deV1ation measure of X and Y. It is easy to verify that o’is a metric. The first three properties of a metric are the immediate results of the defi- nition of 0’ and the triangle inequality property of 0’ reduces to M triangle inequalities like the following one: 'xi - yilvo» ‘Yi - zit )‘xi - zi‘ . The function a’could be considered as a dissimilarity measure between two NNV's Xi and X- when they are represented by the J average rowVvectors Xi and X9. When all components of 4k" 47 and Y are between zero and one, the range of O’is [0,1]. It must be mentioned that,when one uses either the average row vectors representing NNV's or the [3 function for finding an association measure between NNV's , one must be sure that the function defining the association measure must be invariant under any permutation of the distribution vectors representing NNV's. An alternative for finding an association measure between NNV's could be introduced by the N-dimensional distance function I? . To use flone needs to specify fl, {9 ,..., fiQ» For example ,let /%= lg: , ...,= fih80’. This implies 9(Xi,xj)=(;’(xio,xl.), ¢:'(x§o,x3o),.. ., §(x§o,x§o)).r where x§o=(xil,xfi2,...,xfiM). WhenE is used to compute the association measure between NNV,s it is usually assumed that flax Ag;...=l&. This assumption makes it possible to compare the components of the distance vector [7(X,Y) = (e1,e2,...,eN)T. Since this research deals with association measures between data items, the discussion about the association measures between NNV's is closed at this point. The association measure between X- 1 and Xj when they are represented by Xi and Xi should be computed in the same way as when they were represented by X1 and Xj. 48 4.3 .Association Measure Between Data Items: A set of data items,D, is given by its three-dimensional (N-block) matrix 1 g 2! g N 2M ‘ [Dw '- Dwi'“;Dw] or g_ a [D1 : DZE"°§DN]. The problem is to find some sort of association measure between data items. In this section a few M-dimensional distance functions are discussed. These distance functions as well as many others can be used as measures between data items with matrix representation. The algorithm for working with an M-dimensional distance function is set up so that it is independent of the type of the distance function. Therefore, if later on during the practical use of the measures given in this section, one finds a more effective and/or practical measure he has only to change the formula for calculating the distance matrix (to be defined later in this chapter). Definition 4.4: Let X = (x1,x2,...,xn)T and Y a (yl,y2,...,yn)T be two column vectors of the set of all column vectors V. The function Mp from V x V into the set of reals defined by n [p /p Mpcxm = 2: i=1 xi ' 71 is called the Minkowski metric for all p > 1. The limit of Mp when p approaches infinity is called the Chebyshev metric and is defined by Limit MP(X.Y) = M”(X.Y) = M?! Ixi'Yil. P40 1 49 For each value of p a different metric could be obtained, but any metric for p;»2 is of little practical use for the area of cluster analysis. For p = l the metric is Mlcx.v) : :3: lxi-Yil i=1 For p = 2 the metric is [n M2(X,Y) g '2} (xi'yi)2 ; 1 M2 is the familiar Euclidian distance function. The set of all points for which Mpal is called the unit sphere of the metric M . The unit spheres for"M1 ,- M and M, P 2 for a two-dimensional and three-dimensional space are given in Figure 4-1. Let X1: Xé‘°'°‘ X; - Mp . The association measure between the elements of the set of data items, D, is given by k s Fwy . D") = (e1.e2.-.-.eh)T. M where . k l/p 2:: s P ei " [j_l‘fij ’ fij‘ k Recall that Dw - [p‘i‘j wij] , i=1,2,...,n , j=l,2,...,M , h k=l,2,...,N , .p:j is the jt component of the augmented distribution vector of the ith variable for the kth data item, h and wij is the jth component of the weight vector of the it variable. For simplicity, it is assumed that pgj wij = fgj. The special case of Mp , M2 , is used for the purpose of the association measure between data items D: and 0:. In this k case M2(DE,D5) is a dissimilarity vector between Dw and D: W . k s T and it 15 given by “2(Dw’Dw) = (e1,e2,...,en) , where 50 2 l/ 2 , k _ $- 1 91 m... \2. 7V 8’ Figure 4-1. Unit spheres of Ml,M2 and M0,, Note : When a set of data items is defined by both NNV's and numeric variables the magnitude of each of the numeric variable should be normalized so that it can be compared to a NNV. One appropriate factor to normalize a numeric variable can be defined by fx 8 l/Mx , where Mx is the maximum of all N values of the variable X corresponding to the N different data items D1,D2,...,DN. Hereafter, it is assumed that every numeric variable which is involved in definition of a data item is normalized by some factor fx. Definition 4.5: A p x q matrix A is called a blockwise symmetric matrix if there existsan r x r block partition T-[tij] on A such that Tij a Tji for i-l,2,...,r and j-l,2,..,r. Each T15 is a submatrix of the matrix A. , gllluflitafllgnunr 1 no 51 Definition 4.6: Let an M-dimensional distance function r1-be defined on a set of data items. Let n be the number of variables and D1 be the matrix representing a data item D1; {1(05 ,‘0 )- ( d§1,d§2,...,d§“)s(d§1,dfi2,...,d§M)- (7(1):,05) , where dsj a 83(f.j’ ffj) , ffj is the jth column of the k matrix D" , and Xj is the jth distance function of r7 Since there are N data items, if one calculates [7 for all possible pairs of data items the result is an N x (N x M) matrix A , r 2 ‘ l 1 I 2 2 . c N N N all dlz ooe dIM:dll dlz coo d1":ooo E dll dlz .0. d1“ 1 ' 2 2 3 ' 1 1 ' 2 . 2 N N N dzl dzz coo d2“:d21 d22 one d2 .00. ' d21 dzz 00. d2" . . ' A . e o co. 0 '0 e me. o .oo o! o o 0.0 e o e co. 0 ' o e com o .00: . e o ace 0 ‘ z e o me. o :0 o 000 o .000 . o e .00 o 1 1 1 2 2 2 2 : I N N Lle sz o o a dNM ; le d N2 0 e o dNM : e o e o d"- dNZ e e o dNM O The matrix A is called an N-block distance matrix. Any N-block distance matrix is a blockwise symmetric matrix. To explain definitions 4.5 and 4. 6 one might consider the _' of Example 3. 3 1/30 2/30 3/30 0 ' ! 1/20 2/20 0 0 3[1/40 4/40 3/40 0 ] 3 0 0 1/20 0 0 4/20 1/30 2/30 0 4/303 1/40 2/40 3/40 4/4 Using X1 - XZB X3 = M2 one has {70th = (-024 .047 .1 .066 )= Fwfimb. 1 3 s 1 ”(oww) " (.026 .066 .078 do )= P(0w,0w), 2 ”(90:93) = (.026 .017 .105 .033)= (7033.03,). and . . 52 '0 0 0» 0 :.024 .047 .10 .00§:.026 .066 .078 .10 I A- .024 .047 .10' .066:0 0 o o :.026 .017 .105 .033‘ L.026 .066 .078 .10 ;.026 .017 .105 .033;0 0 0 0 J. As one can see A is a blockwise symmetric matrix. Each‘block is an 4-dimensional vector. If the distance functions (xj's) are all the same then lthe components of each row of the matrix A within each block kth h are comparable. Let d;k be the component of the it block . h and 3t row of A. Then dJik is the association measure between th the data items Di and D. based on the k elements of the J ranges of the NNV's. In case of XI: Y2=,...,= KM =3M2 the d;k‘s are the dissimilarity measuresbetween Di and Dj‘ The matrix A is a general case of a dissimilarity matrix. If it happens that Ms] then A will be the same N x N dissimilarity matrix as would occur by using the SNR method. The problem of storage and computation which was discussed in chapter three will be apparent here by looking at A. Definition 4.6 could be modified for the n-dimensional functionf}. In this case the form of $3 is an (n x N)x N matrix. The following is the tranSposed form offléa. ‘ ' 1 2 2 2 N N N 611 612 ... b1“ 611 612 ... bln ... 611 612 ... 6M 1 1 2 2 N N 'r bél b22 ... bZn b21 béz coo bzn oo. bzl b§2 coo bzn A = o e on e e o e e e e o e o e o a so. 0 l3 1 l 1 2 2 2 N N N Lle sz ... an 0N1 sz ... an ... 0N1 6N2 ... b"“J 53 As one can seed” is also an N-block distance matrix. Each block ofA’ is a column vector of dimension n. The element ”h is the kth component of the association vector between data items Di and Dj. In other words, b§k is the association measure between Di and Dj in terms of the kth variable. In Chapter Five the M-dimensional distance function/7 is used for computing an N-block dissimilarity matrix for a set of N data items. But one could use the n-dimensional function 8 or any other appropriate function to compute an N-block dissimilarity measure for a set of N data items. Note that [vis not invariant under the permutation of the categories of the NNV's. If one is to use,f'one should huris- ticaly fix the order of the categories of the NNV‘s. To summarize this chapter , let Di and Dj be two data items represented by two n x M matrices Dfi and Da,respectively. To find the dissimilarity measure between Di and Dj’ two approaches are given. The first approach is to define(choose) 31, 33,..., Y“ and compute‘fWD$,Da). In this case the dissi- milarity between Di and Dj is given by an M-dimensional row vector. The kth component of this vector is the dissimilarity between the kth columnsof the matrices D; and D3. The second approach is to choose 31, 02,...,&9n and compute13(D$,Da). In this case the dissimilarity between Di and Dj is given by a column vector. The kth component of this vector is the dissimilarity measure between the kth rows of the matrices D: and Di. In this case the dissimilarity vector is of dimen- sion n. CHAPTER V CLUSTERING PROCEDURES 5.1 Introduction: In developing the clustering block of the GOC system several changes in the representation of the data items and variables were desirable. The following is a list of some of the major changeS, given in Chapters Three and Pour. a. The representation of a data item changes from an n-dimensional vector to an n x M matrix; b. The similarity (dissimilarity) measure between data items changes from a single number to an M-dimensional vector; c. The similarity (dissimilarity) matrix changes from an N x N matrix to an N-blOCk matrix. Because of these three major changes the data items defined by NNV's need a new type of clustering procedure. In this chapter two different clustering procedures are extended. One can employ these extended procedures for clustering a set of data items defined by NNV's as well as numeric variables. Since most of the comparisons in this chapter are between vectors of the same dimension , the statement given below will help to clarify the ordering relation between vectors of the same dimension. Statement: Between every two vectors V = (v1,v2,...,vr) and U = (ul,u2,...,ur) at least one of the following relations exist: 54 55 U iff v.=u, for i=1,2,...,r ; 1 1 H < ll 2. V > U iff Viéui for i=1,2,...,r ; ‘0 3. v < 0 iff vi(ui for i=1,2,...,r 4. U and V are not comparable. 5.2 Geometric Representation of D andA: Since D5‘ is a matrix and.A is a set of Euclidian distances, a geometric representation of D andllwould help to clarify the notations used in sections 3.4 and 3.5. Consider two data items Di and Dj. One has D: = [fir] and D3 = [fir], k=l,2,...,n , r=1,2,...,M. The distance between Di and Dj is given by ru0:,05) = (631,632,..., dJM)= (d11,d12,...,dgM), where dgr = dj for r=1,2,...,M and 11' i i i j 2]1/2 d. = (d1 -dk r) . Jr [k=1 kr Each data item Di could be represented by M points in an n-dimensional space. Call these M points Pi,P;,. .. ,P;. The M th th components of the vector [7(Di, Dj) , the j block of the i row of A , are the Euclidian distances between points Pi.and pi for k=l,2,...,M. Figure 5.1 is the geometric representation of D and A of Example 3.3. There were three data items of two variables and M was 4. Therefore, each data item is represented by four points in a 2-dimensional space. In Figure 5.1 a set of four X's represents D1 , a set of four +'s represents D2 , and a set of four O's represents D3. The linesbetween corrsponding points are the elements of.A , as they are shown by arrows. 5.2.1 Hypersphere Representation of Di° Let Pi,P;, ..., P; be M points in an n-dimensional space. Let Di =[f§k] be the matrix representation of Di' One could 56 4 2' no 3 acm\~ .ouuwa . Ac .oH\Hvuma . ac..on\mvu~a mm 1 d .ao~\~ .ovuma . ficV\n.oe\mUuma . fio~\m.o~\muuma Aon\v .ovuma . ac . ovuma . aom\~.om\mvuma H mu 3 a3. . HOV\H.cv\HUuma . aOM\~.o~\Hgnma Ao~\~.cn\mvu«a \6 Pi ure S-l. Geometric Representation of D andeof Example 3.3. 57 0 o i o . have P; - (fik,f2k,...,f;k) for 1=l,2,...,N and k=l,2,...,M. As one can see Pi is a point in an n-dimensional space. . M . Define [AI = l/M 5:; f¥k , then the vector 3 k=l J ’ ' i i #1 = (pp/12. ....fln) is a row mean vector of the matrix D:. One can represent the data item Di by the smallest hypersphere Si containing all the points Pi's with center at,fli. Figure 5-2 is the hypersphere representation of the data items in Example 3.3. Hypersphere (Si,f}) is called the hypersphere representation of the data item Di' ‘6 .SV Figure 5-2. Hypersphere representation of the data items of Example 3.3; 58 5.3 Clustering_the Set D by Transforming A to an NxN Dissimilarity Matrix: The problem of transforming,A to an NxN dissimilarity matrix is approached in two different ways. The first approach is by means of a threshold vector,T, and the second approach is without use of a threshold vector. A. Usinga Threshold Vector: In this approach it is assumed that a threshold vector T=(t1,t2,...,tM) is given. Two data items Di and Dj are said to be similar if all compo- nents of the vector /7(D:,D£) are less than or equal to the corresponding components of the given threshold vector T. That is, P(D$,D;) \(T. As long as it is known that d;k (tk or d§k>'tk the magnitude of the dgk is not important to the procedure of clustering. Considering this fact, it is possible to convert A to a binary matrix. Ab = [bgk] as follows: . i F0 1f djk M/2 M . 0 if 2:: bik ( M/2 . R: J 5..- th This means that if the number of 1's in the j block of the th i row ofAb is greater than the number of 0's in the jth block of the ith row of'Db then bij = 1 . Otherwise bij = 0. 59 By this method of transforming.Ab to Db , two data items are called similar if they are similar in more than half of the categories. An alternative approach to transform Ab to an NxN binary matrix Db which is more reasonable , but idealistic, is as follows: 1 if there exist one or more '1" in the jth block of the ith row of.Ab, bij = 0 if otherwise. In this alternative two data items are said to be similar if they are similar in all the cases. B. Without Using_a Threshold: In this approach it is assumed there is no threshold vector given. To transform A to an NxN dissimilarity matrix one could defineJQ=[aij] as below: M i aii "‘ g1 djk for i=1,2,...,N and j=l,2,...,N. Definejl'=[aij] , where aij = l/M aij‘ The matrixVQ' is called the average dissimilarity matrix.1t is an NxN symmetric matrix. The element a! 1i 15 the average dissimilarity measure between data items Di and Dj' For example, the dissimilarity matrix corresponding to the Example 3.3 is p 0 0 0 0 :.024 .047 .10 .066:.026 .066 .078 .10 0 “v A- .024 .047 .10 .066:0 0 0 0 ;.026 .017 .105 .033 L.026 .066 .078 .10 ;.026 .017 .105 .ossgo 0 0 0 3 and the average dissimilarity matrix corrsponding to this A is rO .247 .270' A1=1/4A.= 1/4 .247 0 .181 L‘27° .181 0 J 60 One can usefl orA' as the dissimilarity matrix for the set of data items D and cluster the set D based on the matrix Aor A'. There are a number of methods available for cluster- ing a set of data items based on an NxN dissimilarity matrix. (See section 1.2 for references.) Also, the algorithm given in the next section could be used to cluster any set of data items which is represented by an N x N dissimilarity matrix. Note that some clustering algorithms require recalcula- tion of the dissimilarity matrix in each iteration of the algorithm. If one uses such an algorithm for clustering purposes he must note that to recalculate the NxN matrix .24 it is necessary to recalculateA. The problem with using the matrix,€Q4') as the dis- similarity matrix is that there is a chance of generating invalid clusters in some special cases. This is because of summing (or averaging) the dissimilarity measures over the categories of the NNV's. Consider a particular case when a set of data items consistrof four data items and is represented by a .4eblock dissimilarity matrix ' - .25 0 .75 0 .25 .26 .27 .30“ 0 .75 .25 0 .29 .30 .30 .31 0 0 0 0 .26 .25 .28 .29 L . 0 0 0 OJ The average matrix corresponding to A is F0 .25 .25 .271 ()9. = 0 .25 .30 0 .27 L 01' 61 By looking at,A one can see the dissimilarity relation between the pair of data items.(D1,Dz) is totaly different from the one between the pairs (D1,D2) and (D2,D3). But the average matrix A' says the dissimilarity relations between the pairs (D1,Dz) , (01,03) and (D2,D3) are exactly the same. In other words , it is very likely for a clustering algorithm to put Dl,D2 and D3 into one group as three similar data items (using A or A' as dissimilarity matrix). Because of this prob— lem , the following necessary condition which is only based on some experiments is suggested for the use of an average dissimilarity matrix (or the matrix of the sum of the dissi- milarity measures) for clustering a set of data items. Necessary Condition: Calculate the standard deviation of each lxM submatrix of A. There are N(N-1)/2 standard dev- iations. Let 033 be the standard deviation of the jth block in the ith row and [Xi]. the mean of this block. M . o g 2:: 1 fi' 1’” k=l djk 0’“, g J70 kZ=1( djk) ,0“ i d,k- ff,’ , Aij a l/M Max k J 1 for i=1,2,...,N and j=l,2,...,N. Do not use the matrix A or A' for the purpose of clustering 1f there exist one or more 0/1]. such that 0Jij>)ij for 131.2.oo-aN and 5:142»--°'N° If it should happen that f0 0 o r some cases one has ggj {Aij for all 1's and 3's then one m1ght use A or A' as a dissimilarity matrix for clustering purposes. ' 7" 62 5.4 General Centroid Method (GCM): In the general centroid method it is assumed that the dissimilarity between each pair of data items is given by an M-dimensional row vector. The ith component of this vector is the dissimilarity between the pair of data items associated with the ith categories of the NNV's. It is also assumed that each data item D1 is represented by an n x M matrix D3. The dissimilarity matrix of these data items is an N-block matrix Note that if there is no weight vector corresponding to the range of a particular NNV then the weight vector for this NNV is assumed to be a vector with all components equal to 1. In this method it is assumed that either a fixed threshold vector T = (t1,t2,...,tM) is given for the clustering purposes or a vector T will be calculated in each iteration of the algo- rithm. If there is no fixed threshold vector given then one can define an appropriate formula to calculate a threshold vector T when it is necessary to be calculated. The following is a list of some apprOpriate formulas for computing T: j=l,29000,Nr be the 1. Let Ar =[d}k] , i=1,2,...,Nr, Nr-block dissimilarity matrix at the rth iteration of the algorithm. Define Nr N , 2 Z: Z: ah , k=l,2,...,M. tk =“‘—“" i=1 j>i Nr(Nr-l) Then T=(t1,t2,...,tM) is a threshold vector computed for the rth iteration of the algorithm. 2. Define tk for k=l,2,...,M to be the median of the numbers: 63 1 -1 1 2 2 2 ‘ i N d3k,eoo,dNrk , dSk,d4k,...’dNrk,...,di+1k,...,dNrk,...,dN;k 1 d2k’ The vector T=(t1,t2,...,tM) is a threshold vector computed for the rth iteration of the algorithm. 3. For 0 (q {I define tk to be the smallest number such that q*(Nr(Nr-l))/2 elements of the numbers dur" l 1 Nrk 2 2 2 ... 4‘ dgk’ d3k,oom'd§rk , dSkpd4k3000'dNrk30003di*1k, . Nth, ’ are less than or equal to tk and (l-q)(Nr(Nr-l))/2 of them are greater than tk. A special case of this could be for q = I. In this case t would be defined as t = Maximum (d3 ) k k . Jk 1:1(Nr j‘>i The initial groups of the data items consist of N groups, one data item per group. Assume Gi and Gj are two groups of data items. These groups are said to be similar if and only if i . I7(c,,.c,’,) (T. If during the operation of the algorithm it was decided to combine two groups Bi and Gj into one group,Gk, then the matrix representation of this new group is given by G: = (63 + 633/2 , where GS and G; are the matrix representationsof the groups Ci and Gj, respectively. At the end of each iteration of the algorithm , if there are some new groups generated,then the following items must be updated and/or recalculated: l. the total number of groups; 2. the matrices representing the new groups; 3. the dissimilarity Nr-block matrix 64 5.4.1 General Centroid Algorithm (GCA): The algorithm for the GCM based on an N-block dissimilarity matrix consistsof seven steps. Note that a formula for computing a threshold vector for the algorithm and a formula for computing the dissimilarity measure between data items are given as a part of the input to the algorithm. The steps of the algorithm are as follows: Step l-- Compute the dissimilarity N-block matrix ll. Compute the threshold vector T if it is not given as input. Step 2-- If there is no threshold given go to step 3. Otherwise consider those pairs of the total (g) pairs of the groups of data items which satisfy the condition /7(G$,G;)§'T, where Gi and Gj are two groups of data items and G: and G; are their matrix representations,.L§t‘ Q1 3 (Gil’cjl) 2 Q2 = (Gizscjz) 280's QC = (Gicscjc) be those pairs which satisfy the above condition. Every one of these pairs is a candidate pair for being combined into one group. If -c = 0 then go to step 7 , otherwise go to step 3. Step 3-- Assign a number qk to each pair Qk = (Gik’ij) as follows: M . = 4 i qk E; thr , k . where dj r 15 the rth component of the dissimilarity vector k between groups Gik and Gj . It could happen , although very k seldom , that qk=qk, for some k # k'. If this should happen replace qk by q; as follow: r + 1 'f . . . . éqk __10’V+1 1 Slk+ SJk> Slkl+SJlU 0 qk 3 l qk - 7+1 - 10 if s. + s. (sip. Sj \ 1k Jk k‘ ) 65 th where Sr is the sum of the elements in the r row of the matrix A , and Y is the maximum number of digits to the right of the decimal point in the qk's after rounding them. If Sik+sjk is - + S. then one should reconsider the matrices lkl ka representing the data items in the pairs Qk and Qk" equal to 8 Step 4-— Rank the pairs Q1,Q2,...,Qc according to their weights,qk's. Sort these pairs according to their ranks. Step S-- Mark the pair with the smallest weight(or rank). If Qk=(Gik’ij) is a marked pair,then Gi and ij should be k combined into one group. If no threshold vector is given go to step 6 . Otherwise delete all pairs which have either Gik or ij as one of their elements from the list of sorted pairs. Repeat step 5 until every remainder pair in the sorted list is marked. Step 6-- Go to step seven if either all data items are assigned into one group or if the number of groups is not greater than a previously defined number , provided such a number is given. Otherwise , calculate the matrices for the new groups , set the new number of groups to N , and go back to step I. Step 7-- Print all the information about the generated groups (clusters) and stop the algorithm. Example 5.1: Let D=iD1,D2,...,D6} be a set of six data items to be clustered. The initial groups are GI: {D1{, Gz=fD2‘, GS8 303%: G4: §D4l , GS= {05}, G6: {D65 and the matrices representing these groups are: 66 11 0 0 0 r2/3 o 0 0 ‘ r5/6 0 0 0 ci=1/3 2/3 0 0 63: 1/4 3/4 0 0 cg= 1/3 1/3 1/6 0 0/3 0 1/3 1/3 Ll/4 1/8 1/4 1/9 3/4 0 1/4 0) f '4 F '1 { '1 1/5 0 0 0 3/5 0 0 o 1 0 o 0 03= 0 1 o 0 ,0: = 0 1/2 1/4 0 ,66 = 1/2 1/2 0 0 W L1/2 0 1/2 0 B/z 0 0 1/2 1/2 1/4 0 0) J J L The distance functions X , 3%,..., XM all are the same and they are the Euclidian distancexin an n-dimensional space. Since.A is a blockwise symmetric matrix,one need only store N(N-1)/2 blocks ofll. In this example N is 6 and M is 4 ,therefore,.A can be stored in an array of two dimensions. The first dimension is equal to N(N-1)/2 = 15 and the second dimension is equal to M=4. Call this array DELTA(15.4)' Applying step 1 of the GCA: the initial dissimilarity matrix is computed and stored in DELTA(I,J) as follow: ,3 1 2 3 4 1 J 1 2 3 r 4 1 .35 .15 .08 .08 11 .48 .17 .26 .50 2 .45 .33 .19 .33 12 .33 .39, .30 .00 3 .38 .33 .17 .33 13 .40 .50 .56 .50 4 .55 .17 .42 .17 14 .94 .56 .50 .00 r 5 .24 .30 .33 .33 15 .64 .25 .25 .5& 6 .53 .44 .17 .25 7 .59 .28 .25 .25 8 .36 .28 .35 .25 9 .49 .28 .25 .25 10 .76 .66 .30 .00 DELTA(I,J) 67 Let a fixed threshold vector T = (.50, .45, .40,.30) be given. Applying step 2 of the GCA: Q1=(Gl,GZ), QZ=(GZ,G6) , Q3=(63,G6), and Q4=(GZ,GS). Applying step 3 of the GCA: q1=.66,q2=l.27 , q3= .93 , and q4= 1.24. Applying step 4 of the GCA Q1 and 02 are marked and Q2 and Q4 are deleted from the list. the matrix representation of the Applying step 6 of the GCA new groups HI and H2 are: E/6 0 o '0 1 1 = - 2 HH 7/24 17/24 0 0 .(0H + Gw)/2’ [7/24 1/16 7/24 7/24J p n 11/12 0 o 0 [ 2 3 6 NW = 5/12 5/12 1/12 0 = (cw+cw)/2. 5/8 1/8 1/8 0 F .3 The number of new groups ,N,fis 4 and the groups are: G =H1=101,02§ , 82: {03,06§ , 03:5041, c4={05} . l The second time through the algorithm , the new matrix A is given by DELTA(I,J) as below: 3 1 2 3 4 3 1 2 3 4 1 3. .37 .30 .19 .29 4 .84 .60 .37 .00 ;Z3 .30 .21 .29 5 .54 .15 .21 .50 .43 .22 .38 .21 6 .40 .50 .56 .50 DELTA(I,J). There is no change in the threshold T. The candidate pairs are Ql=(Gl,Gz) and Qz=(G1,G4). The marked pair is Q1 and the matrix representation of this new group is given by: 68 21/24 0 o 0 7 H: . 17/48 27/48 1/24 0 = (61+Gz)/2. 11/24 3/32 5/24 7/48 , The groups are G1 2 H1 ={Dl’DZ'DS'D6g ; stiD4] and G3 I‘DS]. The new number of groups is 3.The third time through the algorithm c would be zero and the procedure of clustering has to stop at step 7 of the algorithm. Corresponding to the given threshold vector T the final clusters are the groups generated at the second iteration of the algorithm. 5.4.2 Analysis of the General Centroid Algorithm: The five important features of an algorithm should be demOnstrated for the GCA in order to call it an algorithm. These five features are as follows: 1. Finiteness: The GCA is a finite procedure and it will stop after (§)-l iterations at NOSt- The maximum number of times that step 2 can be executed before going to step 7 is N(N-l)/2. Each time through the algorithm either there is no candidate pair or there is at least one candidate pair. If there is no candidate pair the algorithm will halt. The worsi case occurs when every time through step 2 there is only one candidate pair. Each time through step 5 at least one group will be marked. The initial number of groups is N(N-l)/2 and each time through step 5 the number of groups reduces at least by one. Therefore, after at most N(N-l)/2 -l iterations the number of groups is one and the GCA will halt at step 7. Each individual step of the GCA is finite and there is no infinite loop in any of the seven steps of the GCA. The only step which is 69 repeated in each iteration is step 5. Let there be c pairs included in the sorted list of step 4. The maximum number of times that step 5 might be repeated is c; and this happens only if all c candidate pairs are disjoint. 2. Definiteness: Step I is the calculation of the dis- similarity matrix‘A and the threshold vector T. Step 2 is the compariSon~ of at most (g) vectors of dimension M with the vector T. The result of the comparison is equal, less than , greater than , or not comparible , and one should keep record of those comparisons which result in less than or equal. Step 3 is the calculation of the sum of the dissimilarity mea- sures for each candidate pair of groups. Step 4 is the ranking and sorting of a set of c pairs which have c different weights. Steps 5, 6, and 7 are very well defined and their definiteness is obvious. 3. Input: There are obviously some inputs to the GCA such as , a set of N matrices each one representing a single data item, a threshold vector T (or a formula for calculating T) , a formula for calculating the dissimilarity measures between each pair of data items , and N , M , and n. 4. Output: The output of the GCA is a set of M0 clusters along with 311 relevant information about each cluster. 5. Effectiveness: All of the operations which are to be performed in the GCA are sufficiently basic that they can be done exactly and in a practical (not mathematical) finite length of time by a person familiar with the operations using pencil and paper. The GCA uses some basic operations 70 such as summation, subtraction, multiplication, comparison, and square root. 5.4.3 Computational Limit and the Problem of Storage: Given N and M, the dissimilarity matrix A needs M (N(N-l)) computer words to be stored. A normal size problem might have as many as 400 data items and M usually is between 3 and 10. For M85 and N-200 the amount of memory needed to store.Ais 99,500 words which is beyond the capacity of the average size existing computers. Because of this problem the Stored Matrix Approach is used for programming the GCA. The dissimilarity matrix/swill be stored on disc or on some other storage device and it will be read as needed. The amount of memory needed for programming the GCA consists of two two-dimensional arrays DATAl(n,M) and DATA2(n,M) , a one-dimensional array DISV(M) , and possibly some other small arrays as they are needed for programming purposes. The calculation limit of the GCA is very much problem- dependent and is especialy dependent on the threshold vector T. For fixed N, n, and M one might change T so that it makes the same problem twice as long.The number of basic computations which are required for the initial iteration of the GCA could be given as below. Given the Euclidian distance as the dissimilarity measure between data items and a fixed threshold vector T ,the number of basic computations needed in step 1 are: (N(N-l)/2)[n+(M-l)(n-s)] subtractions (s is the number of numeric variables) , the same number of multiplications, 71 M*(N(N-l))/2 square roots and a large number of input/output operations. At step 2 there are M*(N(N-l))/2 comparisons plus some input/output operations. At step 3 there are M a c summa- tions and perhaps some additional summations when two weights become equal. Step 4 is a sorting assignment for a list of c elements. (there are tables available for computational time of a binary sort for different number of data.) Computation at step 5 is a search through a sorted list. This search might be done as many times as : 2(c—l)+2(c-2)+....+2= 2(c-l)(c-2)/2 = (c-l)(c-2). Steps 6 and 7.do not require many computations. As mentioned before , the number of computations at the non- initial iterations.of the GCA are dependent on’the threshold vector T. Any estimation of the computation limits of the GCA at non-initial iterations without considering the effect of T would be a rough guess. 5.4.4 Some Special Cases: As anspecial case consider M=l,that is, there is no NNV. In ' other words, consider the case where the dissimilarity matrix is an NxN matrix. In this case T is a threshold number. Step 3 of the GCA is not needed for this special case exept! for tie breaking scheme:.The other steps of the GCA stay the same for this special case. As another special case consider the situation where neither a threshold vector is given nor a formula for finding it. In this case step 2 of the GCA is redundant, that is, all N(N-l)/2 pairs are candidate pairs. For this special case, 72 in each iteration of the algorithm there are only two groups and always two groups to be combined into one group. Therefore, in each iteration of the algorithm the number of clusters reduces by one. A third special case occurs 3 when the dissimilarity' matrix is an NxN matrix and there is no threshold vector given. The GCA as modified for the above two special casescan be used to cluster this third special'case. Let D be a set of data items and A or A' as defined in section 5.3.B. The GCA can be used to cluster the set D with the dissimilarity matrix A. As an example, consider the data items of Example 5.1. The matrix A corresponding to these data items is as follows: (Only the elements of the upper triangle are Printed‘) F0 .66 1.30 1.72 1.30 1.20‘ 0 1.39 1.37 1-24 1.27 A a 0 1.73 1.41 .93 O 1.96 2.00 0 1.64 L 0 J ' At step 5 the pair (D1,D2) is marked and at step 6 the number of new groups is S. The new matrix A corresponding to these fi . [aye groups 61 {01,02} , czafnsl , 03-{04§, c4afnsgand 85={ng§ {o 1.39 1.17 1.24 1.191 0 1.73 1.41 .93 0 1.96 2.00 0 1.64 0 73 At this time the pair (63,66) is marked. The number of new groups is 4 and the new matrix A corresponding to these four groups GI ={D1,DZ§ , G2 '{03’06} , G3 8104fand G4 ={D5} is: ’0 1.15 1.47 1 24, 0 1.72 1.40 0 1.96 0 J. At this time the pair (61,62) is marked. The number of new groups is 3 and the new matrix A for these groups GI: {61,621- SDI’DZ’DS’D6; , 62 3‘04; and Gs .{Ds} is 0 1.65 1.217 A = 0 1.96 0 . L . At this time the number of groups is 2 and the groups are: 81 ={01,02,03,05,06§ and 62 ={04g . Next time there will be only one group and the algorithm will stop at step 7. The remainder of this chapter is a different approachto the problem of clustering a set of N data item with an N-block dissimilarity matrix A.. In the GCM the dissimilarity matrix.A is to be recalculated every time repeating step 1 of the GCA. The approach discussed in the following sections has the advantage of calculating A only once for the whole clustering procedure. 74 5.5 Graph Representation of D andll: In this section a new type of graph is introduced and some of the properties of this graph.are discussed.This new type of graph is used to represent a set of data items D along with its dissimilarity N-block matrix A . Definition 5.1: A graph G= is called a weighted N-partite graph (or an N-partit: weighted graph) if there is a weight corresponding to every edge of the N-partite graph G. The mean of all weights of the edges of a source is the weight of that source. Definition 5.4: An N-partite graph G = <13 Si’ P> is called a uniform N-partite graph if: 1:1 1. the number of nodes in every source is the same, that is, [51' = M , i=1,2,...,N; 2. each node has exactly N-l branches (edges); and 3. there are exactly M edges between every two sources. 7S PropositionS.l: The number of edges of a uniform N-partite graph with source size M is M(N(N-l))/2. Proof: There are (g) = N(N-1)/2 pairs of sources and by (3) in definition 5.4 there are H edges corresponding to each pair of sources. QED. In Figure 5-3’(a) is a bipartite graph and‘(b) is a complete bipartite graph. A complete bipartite graph is usually denoted by Kr,s , where r and s are the size of its two sources. In Figure 5-3,(c) is a S-partite graph, (d) is a weighted 4-partite graph, (e) is a uniform 4-partite graph and (e') is a uniform weighted 4-partite graph. A Definition 5.5: Let G = ('3 G1 , P)>be a uniform weighted i=1 N-partite graph and G1 =f gil’giZ""’giMx and Gj=fgj1,gjz,.., J5”) be two sources of G. Let "(gik'gjk) = w be the weight k of the edge (gik’gjk)’ The weightM W G-,G- = 1 M . , , c 1 J) / 5;} wcglk ng) is called the average weight between two sources Gi and Gj. N Definition 5.6: Let G =< U Gi , P) be a uniform weighted i=1 N-partite graph. Construct a complete weighted graph ’Ga’ as follows: 1. The nodes of G8 are the sources of G; 2. The weight of each edge of G3 is the average weight of its corrsponding nodes. The graph G8 is called an average weighted graph corresponding to the graph G. Note that there are N-l average weights corresponding to each source of a uniform N-partite weighted graph G. But the total number of average weights of G is N(N-l)/2. 76 Figure 5-3. Bipartite, k-partite, Uniform k—partite, Uniform weighted k-partite and Average Graph. Figure 5-3 (continued) .0/7 “.‘I 026 v/US 033 o 81 Fi- “1'8 5’4. Graph representation of D andA. 78 Figure 5-3 (f) is the average graph of Figure 5-3 (e'). Let D =fD1,D2,...,DN§ be a set of N data items. Let Di be represented by M points Pi,P;,...,P; and A‘be the dissimilarity N-block matrix of the set D. Considering D and A»as defined above,the following theorem can be stated. Theorem 5.1: Let G = <3 Di , P), where p ”i (pimp [1,45 , #13:...»1 , j=l,2,...,N, k=l,2,..,M}. Then G is a uniform N-partite graph. Proof: By definition of P there are no edges between points in Di for i=1,2,...,N. Each point of Di has only one branch or edge to every one of the sources D1,D2,...,Di_1,Di+1,...,DN. And the fact that every source has exactly M points (nodes) completes the proof of the theorem. QED. The set of data items D could be represented by the graph G of theorem 5.1. Let dik be the weight of the edge (P:,Pi) of the graph G.Then G is a uniform N-partite weighted graph and is the graph representation of both D andTA.. For example, the graph representation of D and A of Example 3.3 is given in Figure 5.4 (a). Figure 5-4 (b) is the average graph of Example 3.3. 5.6 Egclustering Based on Graph Representation of D and A : The term "k-cluster" was introduced by Ling [41]. He proposed a method of clustering based on an NxN dissimilarity matrix. The method was called the k-clustering method. In this section the k-clustering terminology is used for a clustering method which deals with an N-block dissimilarity matrix4A. This method is based on graphical representationsof D and4A. Most of the terminologies used in this section are taken from Ling 79 but are defined for the UNWG's. Let D be a set of N data items and.A,be the N-block dissi- milarity matrix corresponding to D. There is a one-to-one corress- pondence between P of theorem 5.1 and A». Therefore, it is possible to replace P by A in theorem 5.1. Corresponding to each d? for i f j there is one edge between two sources Di and 3k Dj' Therefore, throughout this section N c= and G2 = (B,Q)be two graphs: 3. GI = 9) G9 A = 9) ; b. GIUG2 = (AflB, PflQ); c. 610 62 =(AUB, PUQ); d. Hcc;l e: n = (c,R>,cc:A and RCCxA Note that PIIQ is the set of all edges which have their nodes in Ale. Similarily, PUQ is the union of all edges of both graphs, G1 and 62. In order to extend definition 5.7 to weighted graphs one can define the set of edges of a weighted graph 61 as follows: (a,b,w) is an edge of a weighted graph G1 only if (a,b) is an edge of G1 with a weight equal to w. For example, the weighted graphs of Figure 5-5 could be represented as follows: 61 = {A,P>', A= {a,b,c,d,e} , P =‘{(a,b,l)(a,c,5), (a,e,2),(b,d,7),(a,d,9)} and (:2 o and G = £_ for every s;urce Y of 61' There is no r-chain for: A1 and A2 , two sources of G4, satisfying (5-3) . In other words, rc(A1,A2))d£ for all r-chains in G4. QED. 83 Proposition 5.4: Let G1 =‘be a URI" r-chained sub- graph and G2 = (A2,P2) be a Ukzw r-chained gubgraph of G. If Glf)Gz # D then GIU 62 is a Uk3W r-chained subgraph of G. Pooof: Let 51(162 = (A3,P3) and A3 # ¢. Given any two sources X and Y of GlU 62 there are two casec: l. X and Y are both sources of G1 (or GZ)' In this case rc(X,Y)g'£ by the fact that G1 (or Gz) is r-chained. 2. X is a source of G1 and Y is a source of 62' In this case let 2 be a source ofGlnGz. One has rc(X,Z)(_1; and rc(Y,Z)s:£ by the fact that G1 and 62 are both r-chained graphs. Let X=Xl,X2,..., Xn=Z “and Z=Y1,Y2,...,Yn=Y u: two b chains connecting X to Z and Z to Y, respectively. The chain X=Xl ,Xz, ..., X.=Z=Y1, Y2,...,Yn=Y will connect X to Y and one has {rc(X,Z)s£ and rc(Z,Y){_r_}=g,rc(X,Y) So. therefore, 51 U G2 is r-chained. QED. Definition 5.10: A UsWSG of G , say H= ,is called (k,r)-bounded ,k)»0, if for every source A in X there exist a UkWSG F =, a UkiWSG of G is called a (k,r)-c1uster if r is the minimum value for which Gi is (k,s)-chained for some 5 and G1 is maximal (k,r)-chained. A subgraph of G which is a (k,r)-cluster for some r is called a k-cluster. Proposition 5.7: Not every UsWSG of G , s>.l, is a (k,r)-cluster for some r. Pooof: It is an immediate result of definition 5.13. Proposition 5.8: Let S be a (k,r)-c1uster and S' be a (k,r')-c1uster of G. a. r=r' =geither S = S' or SnS' = Q. B. Proof: a) Let S i S' and SITS' = X # D. This implies b. r')r=9either SCS' or 808' SUé? is (k,r)-chained ,by proposition 5.4, and this implies S is not maximal (k,r)-chained which is a contradiction. Therefore , if S # S' then S and 8' must be disjoint. 85 b) Let s¢s' and 505' = x ¢ ¢ . If s=s' then this implies r=r' which contradicts the fact that r')»r. If SIDS' then SUS' is a (k,r)-chained and this implies S (or S') is not the maximal (k,r)-chained UsWSG of G which is a contradiction. Therefore, If sgts' then S and 8' must be disjoint. QED. Theorem 5.3: If T is a nonempty maximal (k,r)-bounded UpWSG for a fixed integer k then T is uniquely partitioned into m (k,r')-clusters for some r'{'r. Proof: Let X1,X2,...,X be p sources of T. The fact that P T is (k,r)-bounded implies, for each Xi,there existssome UkWSG Ti of T with exactly k sources (excluding Xi) such that N(Xi,U)£{£ for all U being a source of Ti‘ ,(5-4) If there are more than one such Ti let r; be the minimal value of r satisfying (5-4) and T; be the corresponding UkWSG of T. Let i871 Téi # D for si€(l,2,...,p). Then 65 =(igl Xsi,Q> is a maximal (k,r')-cluster, where r' = Max(r;l,r' ,...,r' ) and q is the number of all such T'i's. The uniqueness of Ti's implies the uniqueness of the Cs. QED. It is clear from the above definitions, propositions and theorems that the clustering problem is only the problem of finding the maximal (k,r)-bounded UkiWSG's of G. The set of k-clusters corresponding to each (k,r)-bounded UkiWSG of G is a well defined entity and it can be found by theorem 5:3. Remark: Given an N-block dissimilarity matrix A.=[d;k], assign a rank matrix R=[r%k] to A as follow: Rank dgk's from 1 to(N(N-l)/2)*M according to their magnitude and set r;k to 1 be the rank of dj The scheme of handling ties by assigning k' 86 to the tied observations the average of their ranks should not be used in ranking dgk's, because it could raise some of the terms to fractional ranks which are not desirable for computational reasons. One way to break the ties is to assign the same rank to all tied observations and skip the unused ranks. An alternative is to add some random noise to the tied obser- vations to make them unequal. Some of the reasons for replacing Ah by R in the k-clustering algorithm are as follows: 1. The use of R simplifies the computations required to find all k-clusters of a set of N data items; 2. The use of R makes it possible to save computer memory by storing each rank in at most 12 bits instead of storing it in a complete computer word; 3. The use of R makes the clustering result invariant under any monotonic transformation of/X. 5.6.1 X-clusteringTAlgorithm: As mentioned before, Ling[4l] uses the term"k-cluster" for his method. Here in the last two sections some definitions and theorems were presented in order to make it possible that one can generalize the idea of k-clustering to cluster a set of data items which are represented by n x M matrices(not by n-dimensional vectors.) The algorithm for the k-clustering method based on graph representationsof D and A consists of the following steps: Step l-- Choose a positive integer k, set r=M*(k;1), and replace A by R. Step 2-- Find Br , the maximal (k,r)-bounded UsWSG of G. Step 3-- If Bri D find all m ,mjel, k-clusters 87 corresponding to B1. using theorem 5.3. If B - D go to step 5. r Step 4-- If a new cluster is foundzaIVeaall relevant information about this cluster. Step 5-- Increase r byld units(or if it is appropriate, increase it as it is specified in the following remark) and repeat the algorithm from step 2 until: a. the number of clusters is less than or equalto some previously defined number of clusters, if such a number is given. b. the largest cluster ,D, is reached, that is , every data item is assigned into one group. Remark: Let r* be the previous r used to find maximal (k,r)-bounded UsWSG S and I(r*) be the set of all ranks of clusters already feund.lnstead of increasing r bij)one can set the next r to be the smallest integer greater than r* and not in I(r*). The reason is that no integer in I(r*) can generate a new k-cluster. Steps 1 , 4 and S of the above algorithm are clearly well defined. To show the definiteness of the steps 2 and 3, and indeed the complete algorithm, the following computational steps of the algorithm are specified. Step c(l)-- Read the fixed parameters N,M,n and k. Compute the initial value of r as r0: M*(k;1) . Compute all N(N-l)/2 dissimilaritypg-dimensional vectors corresponding to the N(N-1)/2 pairs of data items.The following is the order of pairs for which the dissimilarity vectors are computed: (D1,D2),(D1,D3),...,(D1,DN),(D2,D3),(DZ,D4),...,(D2,DN),..., (Di,Di+l)9°°-9(D1,DN):°-'a(DN-1’DN)° 88 Store the ranks in an array called RANKS(N(N-l)/2 ,M). The ranks are stored as they are calculated. Therefore, RANKS(I,J) is the Jth component of the dissimilarity vector between Dt and DS where t is the integer solution of the inequality (t-1)(N-t/2) is a maximal (2,44)-bounded U4WSG with sources equal 1:1 D1 , D3 ’D6' This U5WSG is partitioned into only one 2-cluster, namely C1 =lD1., OS ’06} . The maximal (2,48)-bounded USWSG of G has Dl,D2,D3,D5,D6 as its sources and is partitioned into one 2-cluster ,namely C2=fD1,D2,D3,D5,D6§ . Since Cl is a 2, C1 is marked. The last maximal (2,r)-buunded UsWSG of G is a (2,56)-bounded U6WSG and is G itself. Since subcluster of C this last maximal (2,SS)-bounded U6WSG is partitioned into only one cluster ,namely the set D itself, the algorithm will halt at step 5 (Or step C(7).) The GCM and the k-clustering method are two different approaches. Each one has its own advantages as well as disad- vantages. But there are two important reasons that one might prefer the GCM. These reasons are l)forvelarge number of data items the GCA is significantly faster than the k-clustering algorithm, and 2) the GCA recomputes the dissimilarity matrix each time a new cluster is obtained,therefore, more Optimized clusters are expected from the GCA. One of the most important advantages of the k-clustering algorithm is that it can generate different clusters by changing the parameter k. 91 It should be mentioned that , due to the lack of a set of real data , the two algorithms given in this chapter have not been tested by a set of real data. But they have been programmed and tested by a set of generated data. I To summarize this chapter, a set of N different data items , D = {01, Dz, ..., DN} , each one of them represen- ted by an n x M matrix along with a dissimilarity matrix, A , are given. Two different approaches are developed for clustering the set D. The first approach is to transformld to an N x N dissimilarity matrix and use one of the existing clustering algorithms to cluster the set D. The second app: roach is to use A as it is. For this second approach two different clustering algorithms are developed. One is the generalization of the Centroid Algorithm and the other one is based on graphical representations of D and A . For exten- ding Ling's[4l] method, some theoretical properties of the graph representation of D and A are discovered in this chapter. CHAPTER VI SUMMARY AND CONCLUSION 6.1 Summary of this Desertation: Chapter one is an overview which includes brief descriptions of a number of major problems. These major problems are l) the problem of representation, 2) the problem of clustering, 3) the problem of assignment, 4) the problem of applicability of assignments and 5) the problem of dynamic clusters. Chapter two is the design study of a general system called goal oriented clustering. This system is designed so as to provide some solution to the above major problems. The goal oriented clustering procedure(GOCP) consists of three separate blocks. A list of elements of the GOCP is given in chapter two along with a brief description of each element. Chapter three is concerned with variables. Since numeric variables have been studied for a long time by most of the authors working in the area of clustering, only a reference is given for the problem related to the numeric variables. The numeric representation of NNV's is the most important step of GOCP. Two different representations of NNV's are provided in chapter three. Also, in this chapter some appropriate measures corresponding to the numeric representation of NNV's are defined. Each data item is represented by an n x M matrix and the set of all data items is represented by an N-block matrix. Finally , some motivation for all these proposed representations is given at the end of chapter three. 92 93 In chapter four an M-dimensional distance function [7: ( Yl’ 32,...,‘1M) is defined and a very special form of this distance function is used for association measure between variables. Also, in this chapter the Minkowski metric is discussed and extended to an M-dimensional Minkowski metric. The extended Minkowski metric is used for measuring the dissi- milarity between data items. A general N-block distance matrix is defined and . is used to represent the dissimilarity between data items. The following are some of the most important tOpics discussed in chapter five . l. Geometric Representation of D and.A: In this section an attempt is made to interpret the matrices D and4A.by some geometrical means. 2. General Centroid Method: Two approaches are provided for clustering a set of data items D with a dissimilarity matrixA. The first approach is to transform A to an N x N dissimilaiity matrix and then use one of the existing clustering methodsto cluster the set D. The other approach is to work with.A as it is. The General Centroid Method is proposed for this second approach. Some special casesof GCM arediscussed and an algorithm for the general case of the GCM is provided. 3. Graph Representation of D andA: In this section two new types of graphs are introduced. These new graphs are called N-partite Weighted and Uniform N-partite Weighted.Graphs. The definitions and properties of these graphs are discussed and D and A are represented by a uniform N-partite weighted graph. 94 4. k-clustering Based on Graph Representation of D andA : Chapter five is closed by providing the necessary mathematical theory for clustering a set of N data items when they are represented by a UNWG G=<§gf Di,AC>. A number of theorems and definitions are provided in this chapter dealing with graph representationSOf D and A . 6.2 Conclusion: .. One of the major results of this thesis is the development of a method of representing NNV's as part of a characteristic vector of a data item. NNV's had never before been considered as part of characteristic vector of a data item. On the other hand the contribution of this type of variable to the result of any cluster analysis is apparent. Through use of the GOCS one can represent an observation by any type of variable. One can incloude behavioral performance as well as age in the list of variables. A most important result of this thesis is the development of a complete solution to the clustering problem of the GOCS.The clustering criteria proposed in this research are generalized modesof some previous clustering criteria. The representation of variables has been generalized to M-dimen-. sional vectors,the representation of data items has been generalized to an n x M matrix , and the distance measure between data items has been generalized to an N-block matrix. The generalization is such that if no NNV is involved then the representations are the same as they were with the previous criteria. Two clustering algorithms are developed.The first 95 algorithm is a generalization of the Centroid Algorithm(GCA). The GCA ,if there is no threshold vector given, will generate only one new cluster in each iteration. That is, the GCA combines two of the most similar groups and then compares this new group with all the other groups to find the next most similar group. The second algorithm is the generalization of the k-clustering Algorithm. The definitions and the theorems deve10ped for generalizing the k-clustering Algorithm are one of the important parts of this theSis. The M-dimensional distance function introduced in(3hapter Four could be used as wellfor factor analysis,multivariate analysis in case of three dimensional matrices,and pattern recognition. The clustering criteria introduced in Chapter Five could also be used for clustering the data items with no NNV, especially the one based on graph representation of D ande . The advantage of this method is that the cluster in this method is a well defined entity and is dependent on the dissimilarity matrix and the parameter k. The definition of ' uniform k-partite graph could be extended for pure graph' theoretic reasons. 6.3 Soggosted Topics for Further Work.: The following is a list of suggested tOpics for further work. in the area of GOC. 1. Implementing a General Action Block: At this time the clustering routine of the GOCP is completed but the action routine of the GOCP is not implemented as a general computer program . The action block should be designed such that it acceptsas input a set of M0 clusters , a set of needs associ- ate with the problem, a set of actions , and a set of control 96, parameters which are dependent on the given problem. The output of the action block should be a set of action sequences. 2. Implementing a General Acceptor Block: This block should be designed as a subblock of the action block. The inputs to this block are,possibly, a set of actions , a set of needs , a set of action sequences , and a set of parameters defining the nature of the given problem and specifying the relations between actions. The output of this routine is either "accepted" or "rejected." 3. Studying,the Dynamic Property of a Cluster: This subject was mentioned as one of the five major problemuin chapter one. As far as the author knows there has been no work related to this subject and it is perhaps a most interesting subject to be studied. 4. Developing a Different Approach: Solving the cluste- ring problem of the GOCS in a different approach and comparing the result of that approach to what is done in this thesis could be an interesting topic for further reseach in this area. (l) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) 97 Bibliography Aitken, M.A. and Hume, N.W., "Correlation in a Singly 'l‘rtunmted Bivariate Normal Distribution III. Correlation between Ranks and Variate Values." Biametrika Vol. 53, parts 1/2 pp. 278-281, 1966. Anderberg, R.M., "Cluster Analysis for Application." National Technical Information Service, 1972. (Published in 1973) Babu, C.C. and Chan, W.C., "An Algorithm for Pattern Classification Using Bigenvectors." Bailey, 0.3. and Tryan, R.C., "Clustering Analysis." MoGraw-Hill, 1970. Ball, G.H. and Hall, D.J., "A Clustering Technique for Summarizing Multivariate Data." Behavioral Science, 12, pp. 153-155, 1967. Ball, G.H. and Hall, D.J., ”ISODATA, A Novel Method of Data Analysis and Pattern Classification." Stanford Research Inst., AD 699616, 1965. Bartels, P.H., Bahr, G.F., Calhoun, D.W. and Wied, G.L., "Cell Recognition by Neighbourhood Grouping Technique in TICAS." Acta Cytological Vol. 14, No. 6, pp. 313-324, 1970. Canover, W.J., "Practical Nonparametric Statistics." John Wiley & Sons, Inc., 1971. Diday, 3., "Optimization in Non-Hierarchical Clustering." Pattern Recognition, v01. 6, No. 1, pp. 17-34, 1974. Dubes, R.C., "Information Compression, Structure Analysis and Decision Making with a Correlation Matrix." Michigan State Univer- sity (1970). Dunn, J.C. ”A Graph Theoretic Analysis of Pattern Classification via Tamusa's Fuzzy Relation." IEEE Trans. on System, Man and Cybernetics, pp. 310-313, May 1974. deards, A. w. P. and L. L. Cavalli-Sforza, "A Method for Cluster Analysis,” Biometrics, Vol. 21, No. 2, pp. 362-375, 1965. Eigen, D.J., Northouse, R.A. and Fromm, F.R., "Cluster Analysis Based on Dimensional Information with Application to Feature Selection and Classification." 1883 Trans. on System Man, Cybernetics, pp. 284-294, May 1974. Fisher, J.L. and Zudun, J. "On the Method of Theory of Clustering." Multivariate Behavioral Research, 4, pp. 235-250, 1969. (15) (16) (17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) 98 Fisher, L. and VanNess, J.W., "Admissible Clustering Procedures." Biometrika Vol. 58, pp. 99-104, 1971. Fisher, W.D., "0n Grouping for MaximumlflomOgeneity." Journal of the American Statistical Association, No. 53, pp. 789-798, 1958. Friedman, H.P. and Rubin, J., "On Some Invariant Criteria for Grouping Data." Journal of American Statistical Association, Vol. 62, pp. 1159-1178, 1967. Friedman, J.H. and Tukey, J.W., "A Projection Pursuit Algorithm for Exploratory Data Analysis." IEEE Trans. Comp. C-23, pp. 881-890, Sept. 1974. Fukunaga, K. and Koontz, W.L.F., "A Nonparametric Valley-Seeking Technique for Cluster Analysis." IEEE Com. Tran. C-21, 2, pp. 171- 179, Feb. 1972. Fukunaga, K. and Olsen, D.R., "Representation of Non-Linear Data Surface." IEEE Comp. Trans. vo1. 3, C-22, No. 10, Oct. 1973, pp. 915-923. Fukunaga, K. and Koontz, W.L.G., "A Non-Linear Feature Exhibition Algorithm Using Distance Transformation." IEEE Camp. Trans. Vol. C‘Zl, NO. 1' Jan. 1972' PP. 56-63. Fukunaga, K. and Koontz, W.L.G., "Asymptatic Analysis of a Non- Parametric Clustering Technique." IEEE Trans. on Computer, V01. C-Z' NO. 9' Pp. 967-974, sept- 1972. Fukunaga, K., "Introduction to Statistical Pattern Recognition." Academic Press, 1972. Fukunaga, K. and Olsen, D.R., “An Algorithm for Finding Intrinsic Dimensionality of Data." IEEE Trans. on Computer, Vol. C-20, No. 2, Feb. 1971. Fukunaga, K. and Kessell, D.L., "Estimation of Classification Error." IEEE Comp. Trans. Vol. C-20, No. 12, pp. 1521-1527, Dec. 1971. Gimilin, D.R. and Ferrell, D.R., "A K-K' Error Correction Procedure for Nonparametric Imperfectly Supervised Learning." IEEE Trans. SMS. pp. 304-306, May 1974. Gower, J.C., "Comparison of Some Methods of Cluster Analysis." Biometries, No. 23, pp. 623-637, 1967. Harding, E. F., "The Probabilities of Rooted Tree-Shapes Generated by Random Bifurcation," Advances in Applied Probability, Vol. 3, Hajek, J., ”Nonparametric Statistics." Holden-Day, Inc., 1969. Hellman, M.E., "The Nearest Neighbor Classification Rule with a Reject Option." IEEE Trans. Vol. SSC-6, pp. 179-185, July 1970. (31) (32) (33) (34) (35) (36) (37) (38) (39) (40) (41) (42) (43) (44) (45) (46) 99 Harry T. Hsu, "An Algorithm for Finding a Numeral Equivalent Graph of a Digraph." Journal of ACM, Vol. 22, No. 1, Jan. 1975. Hasselb'lad, V., "Estimation of Parameters of a Mixture of Normal Distributions." Technometrics, Vol. 8, pp. 431-444, 1966. Jance, G. N. and Williams, W. T., "Note on the Classification of Multi-Level Data." Computer Journal, No. 9, pp. 381-382, 1967a. Jarvis, R. A. and Patric, E. A., "Clustering Using a Similarity Measure Based on Shared Nearest Neighbor." IEEE Trans. Vol. C-22, pp. 1025-1034, Nov., 1973. ‘ Johnson, S. C., "Heirarchial Clustering Schemes." Psychametrika, Vol. 32, No. 3, Sept. 1967, pp. 241-254. Jensen, R. E., "A Dynamic Programming Algorithm for C1uster Analy- sis," Operation Research, Vol. 17, No. 6, pp. 1034-1057, (1969). Kruskal, J. B., "Multidimensional Scaling by Optimizing Goodness to Fit to a Non-metric Hypothesis." Psychametrika, 29, pp. 1-27, (1964a). Labovitz, S., "In Reference of Assigning Numbers to Ranks." American Sociological Review, Vol. 36, No. 3, pp. 520-521, 1971. Labovitz, S., "The Assignment of Numbers to Rank Order Categories." American Sociological Review, Vol. 35, No. 3, pp. 515-524, 1970. Labovitz, S., "Some Observations on Measurement and Statistics." Social Forces, Vol. 46, No. 2, pp. 151-160, 1967. Ling, R. F., "Clustering Analysis." Ph.D. dissertation, Technical Report No. 18, Dept. of Statistics, Yale University, 1971. Lingoes, J. C., ”An IBM 7090 Program for Guttman-Lingoes Smallest Space Analysis--I." Behavioral Sciences, 10, pp. 183-184, 1966. Lance, G. N. and Williams, W. T., "Computer Program for Hierarchical Polythetic Classification (Similarity Analysis)." Computer Journal, Vol. 9, No. 2, pp. 60-64, 1966. Lance, G. N. and Williams, w. T., "Computer Program for Monotlletic Classification (Association Analysis)." Computer Journal, Vol. A, NO. 3' pp. 246-249, 1965. Lee, Y. H. and Rosenfeld, A., "A Clustering Heuristic for Line- Drawing Analysis." IEEE Trans. Comp. Vol. C-21, No. 8, pp. 904-911, Aug. 1972. McQuitty, L. L., "Hierarchical Linkage Analysis for the Isolation of Types." Educational and Psychological Measurement, Vol. 202, No. 1, pp. 55-67, 1960. (47) (48) (49) (50) (51) (52) (53) (54) (SS) (56) (57) (58) (59) (60) (61) 100 Meisel, W. S., "Computer-Oriented Approaches to Pattern Recognition." Academic Press, 1972. Meisel, W. S. and Michalopaulos, D. A., "A Partitioning Algorithm with Application in Pattern Classification and the Optimization of Decision Trees." pp. 93-102, IEEE Trans. on Compt. Vol. C-22, No. 1, Jan. 1973. Minsky, M., "Semantic Information Processing." MIT Press, 1967. Morrison, D. F., "Multivariate Statistical Methods." McGraw-Hill, 1967. Mucciardi, A. N. and Gose, E. E., "An Automatic Clustering Algorithm and Its Properties in High Dimensional Space." IEEE Trans. Syst. Man. Syb., Vol. SMC-2, pp. 247-254, April 1972. Nagy, 6., "State of the Art of Pattern Recognition." Proceeding of the IEEE, Vol. 56, No. 5, pp. 830-862, 1968. Patric, E. A., "Fundamental of Pattern Recognition." Prentice- Hall, Inc., 1972. Parks, J. M., "FORTRAN IV Program for Q-Mode Cluster Analysis and Distance Function with Printed Dendrogram." The University of Kansas, 1970. Rao, V. A., "A Multimodel Distribution Based Clustering Algorithm." Ph.D. Thesis at Michigan State University, Dept. of Computer Science. Ruspini, E. E., "A New Approach to Clustering." Infor. and Control, No. 15, pp. 22-32, (1969). Ruspini,-E. H., "Numerical Methods for Fuzzy Clustering." Infor. Science, No. 2, pp. 319-350, 1970. Sawrey, W. L., Keller, L. and Conger, J. J., "An Objective Method of Grouping Profiles by Distance Functions and Its Relation to Factor Analysis." Educational and Psychological Measurement, Vol. 20, No. 4, pp. 651-673, 1960. Schweitzer, S. and Schweitzer, D. 6., "Comment on the Pearson r in Random Numbers and Precise Functional Scale Transformation." American Sociological Review, Vol. 36, No. 3, pp. 517-518, 1971. Scott, A. J. and Symons, M. J., "Clustering Methods Based on Likeli- hood Ratio Criteria." Beiometrics, Vol. 27, No. 2, pp. 387-398, 1971a. ' Scott, A. J. and Symons, M. J., "On the Edwards and Cavalli-Sforza Method of Clustering Analysis." Biometric Vol. 27, No. 1, pp. 217- 219, 1971a. 101 (62) Shepard, R. N., "The Analysis of Proximities, Multi-dimensional Scaling with an Unknown Distance Function." Psychometrika 27, pp. 125-139 and 219-2146, (1962). (63) Shepard, R. N., "Representation of Structure in Similarity Data: Problems and Prospects." Psychometrika Vol. 39, No. 4, pp. 373- 421, Dec. 1974. (64) Shepard, M. J. and Willmontt, A. J., ”Cluster Analysis on the Atlas Computer." Computer Journal, 11, pp. 57-62, 1968. (65) Sneath, P. H. A., "The Application of Computer to Texonomy." Journal of General Microbiology, Vol. 17, pp. 201-206, 1957. (66) Sokal, R. R. and Sneath, P. H. A., "Principles of Numerical .Taxonomy." W. H. Freeman and Company, San Francisco,71953- (67) Sokal, R. R. and Sneath, P. H. A., "Numeric Taxonomy." W. H. Freeman, San Francisco, 1973. (68) Sokal, R. R. and Michener, C. D., "A Statistical Method for Evalu- ating Systematic Relationships." University of Kansas Science Bulletin, No. 38, pp. 1409-1438, 1958. (69) Sokal, R. R., "Classification: Purposes, Principles, Progress, Prospects." Science, Vol. 185, No. 4, 57, pp. 1115-1123, Sept. 1974. (70) Stanat, D. F. and Das, S. K., "A Modified Training Procedure for Linear Threshold Device." IEEE Trans. Com. Vol. C-21, No. 4, pp. 390-397, April 1972. (71) Stuart, A., "The Correlation Between Variate Values and Ranks in Samples from a Continuous Distribution." The British Journal of Statistical Psychology, Vol. VII, Part I, pp. 37-44, 1954. (72) Tamura, S., Higuchi, S. and Taneka, K., "Pattern Classification Based on Fuzzy Relation." IEEE Trans. System, Man and Cypernetics, pp. 61-66, June 1971. (73) Uhr, L., "Pattern Recognition Learning and Thought." Prentice-Hall, 1973. (74) Ullmann, J. R., "Pattern Recognition Techniques." Butterworth, 1973. (75) Vinod, H. D., "Integer Programming and the Theory of Grouping." Journal of the Ame. Stat. ASS. Vol. 64, pp. 506-519, (1969). (76) Ward, J. H., Jr., "Heirarchical Grouping to Optimize an Objective Function." Journal of the American Statistical Assoc., Vol. 58, pp. 236-244, 1963. (77) Wishart, D., "An Algorithm for Hierarchical Classification." Biometries, 25, pp. 165-170, 1969a. (78) (79) (30) (81) 102 Wo1fe, J. H., "Pattern Clustering by Multivariate Mixture Analysis." Multivariate Behavior Research, Vol. 5, No. 3, pp. 329—350, 1970. Zahn, C. T., "Graph-Theoretical Method for Detecting and Describing Gestal Clusters." IEEE Trans. Computer, Vol. C-20, pp. 69-89, Jan. 1971. Zadeh, L. A., "Fuzzy Sets." Information and Control, Vol. 8, pp. 338-353, 1965. Berztiss, A. T., "Data Structure Theory and Practice." Academic Press, 1975.