. J , . .....ri. 3.2:... .....:........1.: ... .... 4 .v . :1). ... .2: 1.521).... .... . ... :7 A ..s. 25.”. 2... s. .1. 20.4,qufimmfl. .. .2 ...“... 4. .... . : 1 .. i .. ....H v 1 a; . . ....n...ng.....k. , . iw. . . . . ....«uwiie «my. . “...... .2... ..zt, “3...! L. ‘~. This is to certify that the dissertation entitled Evolutionary Optimization and Ensemble Techniques for Data Mining and Pattern Recognition presented by Alexander P. Topchy has been accepted towards fulfillment of the requirements for the Doctoral degree in Computer Science affl V WajoVProfessor’s Signature 5/{1/057 Date MSU is an Affirmative Action/Equal Opportunity Institution .o-g-o-o--o-I-o-u-n-c-u-ouo—u...-.-.-o-o-o-n-n—o—g—n-n---a_-—-..--— ._-_. _.—4_._._.-.-_.. 4 .UBRARY Michigan State University PLACE IN RETURN BOX to remove this checkout from your record. TO AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 6/01 c:/ClRC/DateDue.p65-p.15 EVOLUTIONARY OPTIMIZATION AND ENSEMBLE TECHNIQUES FOR DATA MINING AND PATTERN RECOGNITION By Alexander P. Topchy A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering 2004 ABSTRACT EVOLUTIONARY OPTIMIZATION AND ENSEMBLE TECHNIQUES FOR DATA MINING AND PATTERN RECOGNITION By Alexander P. Topchy This dissertation addresses fundamental data mining and pattern recognition problems — feature extraction, modeling, and data clustering — through evolutionary computation and ensemble-based approaches. We offer feature extraction methods for improved pattern classification using genetic algorithms. New features are synthesized by merging the values of original variables during the search process. The genetic search of (sub-) optimal combinations of values is performed using a graph-based encoding of candidate solutions. A compact solution representation with minimal redundancy is used for a wide class of grouping problems, including clustering of variable values. Genetic value clustering is applied to text categonzation, DNA-based assignments of individuals in population genetics and parametric learning of Bayesian network classifiers. It is shown that such feature extraction results in better predictive accuracy of classification decisions. We develop genetic programming algorithms for modeling input-output mappings of continuous variables that incorporates dynamical fitting of free parameters of evolved models. Traditional genetic programming is extended by gradient descent optimization of leaf coefficients of tree—like programs during the evolutionary search that is made possible using algorithmic differentiation. Experimental results show significant improvement in both computational requirements and modeling accuracy for a set of symbolic regression problems. Ensembles of partitions of data sets are studied in two respects: combination of multiple clusterings and generation of clusterings for an ensemble. We develop two efficient consensus functions for finding a combined partition of good quality. The first consensus function uses an information-theoretic principle based on maximal generalized mutual information. The second function finds a consensus clustering by estimating a probabilistic mixture model from the observed ensemble. It is demonstrated that the ensemble’s partitions can be generated by weak clustering algorithms, in particular, by clustering in random low-dimensional subspaces of the original feature space. Experiments indicate that ensemble of an weak partitions can be more accurate than a single sophisticated clustering algorithm. .Finally, we consider how the partition generation process can be made adaptable to provide better decisions for the patterns located near the inter-cluster boundaries. Dedicated to my parents, Pavel Topchy and Tamara Zemel Acknowledgements The years at MSU have been a fantastic experience and it is a joy to remember and to thank my teachers and colleagues whose influence contributed to this thesis. First and foremost, I would like to thank my advisor, Bill Punch, for his extraordinary support and patience. His fruitful suggestions and our discussions contributed enormously to this work. From the first time I stepped in the door of Computer Science Department, until now, Bill has allowed me freedom to explore my own directions in research. I have come to appreciate the wisdom of his way that has guided me safely through the process. I owe many inspirational ideas to Anil Jain, who has been like a second advisor to me. It was a privilege to be a student in his pattern recognition classes, which defined my interest to the field for many years to come. The results of this thesis would be impossible without Anil Jain’s motivation and without research assistantship he kindly provided in the past year. Kim Scribner fostered my interest in genetic data analysis. I deeply appreciate his enthusiasm for unorthodox approaches, the high standards he set, the genuinely positive attitude to science and to my research progress. He supported me materially and morally for many years for what I am very grateful. I thank Erik Goodman for asking just the right questions to improve the work, for providing valuable feedback on my projects at all times, for unforgettable GARAGe group meetings. Most importantly, thanks to his help with organizing and hosting my visit to MSU in the summer of 1997, I became determined to pursue the study at MSU, in particular, in GARAGE group, which eventually became my alma mater. vi TABLE OF CONTENTS LIST OF TABLES ............................................................................ . .......... X LIST OF FIGURES .................................................................................. X11 1 INTRODUCTION ................................................................................... 1 1.1 MOTIVATION, GOALS, PROBLEM DOMAINS ........................................................... 2 1.2 CONTRIBUTIONS OF THE THESIS ............................................................................. 4 2 BACKGROUND ON DIMENSIONALITY REDUCTION ............. 11 2.1 OPTIMAL SUBSET SEARCH ........................................................................................ 15 2.2 HEURISTIC SEARCH ................................................................................................... 16 2.3 FILTER METHODS ..................................................................................................... 18 2.4 WRAPPER METHODS ................................................................................................. 19 2.5 EVOLUTIONARY FEATURE SELECTION ...................................................................... 22 3 GENETIC VALUE CLUSTERING .................................................... 25 3.1 CLUSTERING IN NA'I'VE BAYES MODEL ...................................................................... 25 3.2 BACKGROUND ON VALUE ABSTRACTION AND CLUSTERING ...................................... 27 3.3 CLUSTER REPRESENTATION IN GENETIC ALGORITHM ............................................... 30 3.4 APPLICATION DOMAINS FOR VALUE CLUSTERING ..................................................... 39 3.5 CLOSED FORM ANALYSIS OF VALUE CLUSTERING ..................................................... 44 vii 3.6 EXPERIMENTAL STUDY .............................................................................................. 55 3.6.1 Genetic algorithm details .................................................................................. 56 3.6.2 Experimental results .......................................................................................... 57 3.7 CONCLUSIONS ............................................................................................................ 65 4 GENETIC PROGRAMMING WITH LOCAL LEARNING 66 4.1 LAMARCKIAN VERSUS BALDWINIAN STRATEGY IN GP .............................................. 70 4.2 HYBRID GENETIC PROGRAMMING ............................................................................. 73 4.3 LEARNING LEAF COEFFICIENTS ................................................................................. 74 4.4 PROGRAM DIFFERENTIATION IN DETAIL .................................................................... 75 4.5 LEARNING RATE AND NUMBER OF STEPS ................................................................... 81 4.6 EXPERIMENTAL DESIGN ............................................................................................. 83 4.7 EXPERIMENTAL RESULTS ..................................................................................... ~ ...... 85 4.8 CONCLUSIONS ............................................................................................................ 91 5 CLUSTERING ENSEMBLES ............. 92 5.1 RELATED WORK ........................................................................................................ 96 5.2 REPRESENTATION OF MULTIPLE PARTITIONS ........................................................... 101 5.3 A MIXTURE MODEL OF CONSENSUS ........................................................................ 103 5.4 INFORMATION-THEORETIC CONSENSUS OF CLUSTERINGS ....................................... 111 5.5 COMBINATION OF WEAK CLUSTERINGS ................................................................... 1 14 5.5.1 Splitting by random hyperplanes ................................................................... 115 5.5.2 Combination Of clusterings in random subspaces .......................................... 119 viii 5.6 EMPIRICAL STUDY 122 5.6.1 Datasets 122 5.6.2 Selection Of parameters and algorithms 123 5.6.3 Experiments with complete Partitions 124 5.6.4 Experiments with incomplete partitions 126 5.6.5 Results of random subspaces algorithm 127 5.6.6 Results of random splitting algorithm 138 5.7 ADAPTIVE CLUSTERING ENSFMBI PS 140 5.7.1 Adaptive sampling and clustering 141 5.7.2 Empirical study of adaptive ensembles 146 5.8 CONCLUSIONS 150 6 SUMMARY 152 APPENDIX. EM ALGORITHM FOR CONSENSUS SOLUTION ................ . 155 BIBLIOGRAPHY 159 LIST OF TABLES 3.1 Comparison of test accuracy for the best solutions found by GA value clustering and best-first hillclimbing with “naive“ non-clustered solution on “acquisitions” versus “eamings” classification problem .................................................................................... 58 3.2 Comparison Of test accuracy for the best solutions found by GA value clustering and hillclimbing with “naive“ non-clustered solution on “silver” and “gold” versus “gas” and “nat-gas” topics classification problem ..................................................................... 59 3.3 Characteristics Of the datasets used in BNC experiments ........................................ 62 3.4 Classification accuracy of the compared classifiers over test sets (no smoothing). 64 3.5 Classification accuracy on test sets with classifier smoothing ((1:1/2) ..................... 64 4.1 Performance Comparison of Hybrid and Regular GP. All data collected after 30000 function eValuations and averaged over 10 experiments .................................................. 86 4.2 Effects of local learning .................... . ....................................................................... 88 5.1 Clustering ensemble and consensus solution ......................................................... 107 5.2 Characteristics Of the data sets ............................................................................... 122 5.3 Mean error rate (%) for the “Galaxy” dataset ........................................................ 127 5.4 Mean error rate (%) for the “Half-rings” ............................................................... 128 5.5 Mean error rate (%) for the “Biochemistry” dataset .............................................. 129 5.6 Mean error rate (%) for the “2-spirals” .................................................................. 129 5.7 Mean error rate (%) for the his dataset .................................................................. 130 5.8 Clustering error rate Of EM algorithm as a function Of the number Of missing labels for the large datasets ....................................................................................................... 129 5.9 “2 Spirals” data experiments. Average error rate (% over 20 runs) of clustering combination using the random l—d projections algorithm with different number of components in combination H, different resolutions of components k and seven types of consensus functions ........................................................................................................ 134 5.10 Average error rate (%, over 20 runs) of combination clustering using random projections algorithm on “Galaxy/star” data set ............................................................. 139 5.11 Consistent re-labeling of 4 partitions ................................................................... 144 xi LIST OF FIGURES 2.1. Feature extraction wrapper-type approach with GA search engine .......................... 20 3.1. Regular graph-based clustering GA selects N genes from the total of N2 edges, while compact encoding needs only N-l from N (N—l)/2 edges. GA search space size is reduced to N! from N” ....................................................................................................... 32 3.2. Ratio of GA search Space size to the true number of distinct partitions of Objects on logarithmic scale ............................................................................................................... 33 3.3. Sample chromosomes and corresponding clusterings are shown by connected subgraphs for regular and compact encodin gs .................................................................. 34 3.4. Ratio Of GA search space sizes in non-compact and compact encoding. ................. 35 3.5. An example of encoding alphabets at each chromosome position in compact and equalized compact graph-based representations of a set of 6 items ................................. 38 3.6. Reduction of search space size in equalized compact encoding as compared to regular graph~based encoding. .......................................................................................... 38 3.7. TAN structure with four attributes. Augmented edges are shown among the attribute nodes. ................................................................................................................................ 42 3.8. Conditional probability table P(X,Y|C=c) before value clustering and after the merging of values {3,4} of X variable and values {2, 3} of Y variable ........................... 43 3.9. Accuracy gain AE due to merging states {1,2} for sample size 2N=10: a) if {p1, p2: p1}, {q1, q2= ql}, b) if {p1, p2: 4/5 191}, {q1,q2= 5/4 ql} ................................................. 49 3.10. Accuracy gain AB due to merging states { 1,2} for sample size 2N=20: a) if {p1, p2: p1}, {q1, (12: q] }, b) if {p1, p2: 4/5 p1},{q1, q2= 5/4 (1,} ................................................. 50 xii 3.11. Accuracy gain AE due to merging states {1,2} for sample size 2N=30: a) if {p1, p2: p1},{q1, 612: q] }, b)1f {p1,p2= 4/5 p1},{Q1, q2= 5/4 (11} ................................................. 51 3.12. Accuracy gain AE due to merging states { 1,2} for sample size 2N=10: a) {p1, p2: 2/3p1},{q1, q2= 3/2 ql}b){p1,p2=1/2p1}, {q|, q2= 2 q]} ............................................ 52 3.13. Accuracy gain AE due to merging States {1,2} for sample size 2N=20: a) {p1, p2: 2/3p1},{q1, q2= 3/2 qj} b) {p1,p2= 1/2p1}, {(11, q2= 2 q;} ............................................ 53 3.14. Accuracy gain AE due to merging states {1,2} for sample size 2N=30: a) {p}, 1);; 2/3p1},{q1, Q2: 3/2 q1}b){p1,p2=1/2p1},{q1, (12: 2 £11} ............................................ 54 3.15. Underlying dependencies between the variables in the datasets “weak”, “moderate” and “strong”. ..................................................................................................................... 64 4.1. Sample tree with a set Of random constants. In hybrid GP all the leaf coefficients are subjected to training. ......................................................................................................... 68 4.2. Example of a function of 3 variables drawn as a tree. C1 and C2 are the leaf coefficients......................................................... ............................................................... 76 4.3. Example of a function for algorithmic differentiation with respect to a parameter c. ........................................................................................................................................... 78 4.4. Example of a function for algorithmic differentiation with respect to 2 parameters C; and cz ................................................................................................................................. 80 4.5. Local learning strongly affects fitness of individuals. Typical learning progress is illustrated using individuals from test problem f2 ............................................................. 82 4.6. Comparison of selection process in HGP and GP. Local learning changes outcome of some tournaments used to select a mating pool. .......................................................... 87 4.7: Typical dynamic of number of terminals (numeric coefficients) used by the best program as a function of GP generations (for the test function f1) ................................... 88 4.8: Surface fitting test problems f 1, f2 and respective learning curves. ......................... 89 xiii 4.9 Surface fitting test problems f3, f4, f5 and respective learning curves ..................... 5.1: Four possible partitions Of 12 data points into 2 clusters. Different partitions use different sets of labels. .................................................................................................... 5.2 Clustering by a random hyperplane: (a) An example Of splitting 2-spiral data set b random line. Points on the same side of the line are in the same cluster. (b) Probability splitting two one-dimensional objects for different number of random thresholds as a function of distance between Objects. ............................................................................. 5.3 Dependence of distances derived from the co-association values vs. the actual Euclidean distance x for each possible pair Of Objects in his data. Co-association matri were computed for different numbers of hyperplanes r =1 ,2,3,4 ................................... 5.4. Projecting data on a random line: (a) A sample data with two identifiable natural clusters and a line randomly selected for projection. (b) Histogram of the distribution ( points resulting from data projection onto a random line ............................................... 5.5 “2 spirals” and “Half—rings” datasets difficult for any centroid based clustering algorithms. .................................................. ‘ .................................................................... 5.6 Consensus clustering error rate as a function of the number Of missing labels in th ensemble for the his dataset, H=5, k=3. ......................................................................... 5.7: Consensus error as a function Of the number of clusters in the contributing partiti for Galaxy data and ensemble size H220 ....................................................................... 5.8. Performance of random subspaces algorithm on his data. (a) Number of errors b3 the combination of k-means partitions (k=4) in multidimensional space and projected j subspaces. Average—link consensus function was used. (b) Accuracy of projection algorithm as a function of the number of components and the number of clusters k in 6 component. ...................................................................................................................... 5.9. Dependence of accuracy of the projection algorithm on the type of consensus function for his data set. k=3. ......................................................................................... xiv 5.10. Number of misassigned points by random hyperplanes algorithm in clustering O spiral” dataset: (a) for different number of hyperplanes (b) for different consensus functions .......................................................................................................................... 5.11. Two possible decision boundaries for a 2-cluster data set. Sampling probabilitie data points are indicated by gray level intensity at different iterations (to p..(x)p...(w) Observation of variable x does not help to infer the value Of (1) because they are statistically independent. Second, the accuracy Of some classifiers may suffer if some input features x and y are highly correlated: ICorr(x,y)| —> 1 Any classification method relying on feature independence, e.g. maximum likelihood assignment or na'r've (simple) Bayes classifier, can potentially perform worse when both x and y are included. Let us review important dimensionality reduction algorithms to provide a background for our own developments in the field. The terminology and notation used follows classical texts such as [26][64][28]. We mostly use the term feature to denote the component of a measurement vector, as commonly accepted in statistical pattern recognition. Attribute, variable, parameter and field are other equivalent terms and can be used in context of a specific application. Feature selection algorithms are concerned with finding the best subset among existing features according to some criterion. Feature extraction algorithms usually synthesize a set of “new” features. Suppose there are D measurements representing the pattern and Y is a complete set of features: Y={y,. |i=1,2,...,D} 12 Feature selection is concerned with finding the subset X that maximizes chosen criterion function J(-): J(X)= max J(Z) zQY X = {xi |i=1,2,...,d} The number of features d in subset X can be fixed in advance or found during the search. Feature selection either accepts a feature as is, or not at all. In contrast, feature extraction transforms existing variables Y into the new variables X: X = W(Y) The dimensionality Of the transformed set X is generally lower than that of the original feature space. Feature extraction transformation W(Y) is chosen to Optimize an objective function J(-): J(W(Y)) = max J(W '(Y)) W' Another taxonomy Of methods is based on the types of criteria used for Optimization. In one category, the dimensionality reduction is performed independently of the subsequent classification. The relevance of candidate feature set or transformation is estimated without running the actual model but by analyzing data distribution, Often heuristically. Such methods are called filters. Filtering itself can rely on other models, e.g. decision trees or neural networks [87]. The other category is the wrapper methods — the Optimization criterion is the prediction accuracy achieved by the selected (extracted) features. The very same induction algorithm, intended for use after the optimization, produces the accuracy estimate. Both categories have they advantages discussed in more detail below. The complexity Of the feature selection problem can very high. The total number of possible subsets with d features out Of D is (5) In general, all possible feature combinations should be evaluated in order to guarantee the Optimal subset Of features, except certain problems with special structure or size. Moreover, the Optimality is achieved only with respect to a chosen evaluation criterion. It is important to note, that a new solution may be required if the evaluation criterion is changed, e.g., if na’r‘ve Bayes classifier is replaced by k—NN classification, etc. As pointed out in [26], feature selection or extraction and classifier design are not independent, and an optimal subset of features is simply a part of the best classifier, at least conceptually. Search strategy can be considered independent of model, but it efficiency can be vastly improved if domain specific knowledge is taken into account. Brute force evaluation of all the subsets is not practical when the number of features is large. Heuristic search finds sub-Optimal solutions in subset space. Still it is possible to Obtain optimal solutions in certain problems, without exhaustive search, because clever search organization can exclude most of the search space states without their explicit evaluation [82]. Hence there are two search strategies: Optimal and sub- Optimal. Thus summarizing the taxonomy, feature selection algorithms are usually defined by following components: ' Evaluation criterion. Filter-type algorithms do not use a classifier or induction step for selecting subsets. Instead, filtering evaluates features based only on such data properties as distance, information or consistency. Filter algorithms serve as 14 preprocessors for further induction steps. Wrapper-type algorithms evaluate features based on performance of the model or the classifier learned. I Search strategies. Optimal algorithms always find the best subset Of features. In worst case, they require 0(2dT) total computational effort, if T is required for each subset. Heuristic algorithms do not guarantee Optimality, but are Often the only feasible approach to practical problems. I Initialization and Termination. Initialization sets the starting point for search. In single point algorithms, it can be the empty subset or a set with all the features included, or even some random subset of features. Multipoint (population) search starts with many candidate solutions. Termination is usually controlled by computational effort or quality of solution. 2.1 Optimal subset search The earliest non-exhaustive yet complete search that finds Optimal subsets is a branch and bound feature selection due to Narendra and Fukunaga [96]. It is based on the assumption that the evaluation criterion is monotonic; that is, for the nested feature sets a , E, :> 472 :5 DE. , the criterion function fulfills the property: 1(5): 1(5) 3 31(5), The monotonicity condition allows the algorithm to omit examination of entire branches in a search tree and effectively reduce the computational effort. The basic scheme of Narendra and Fukunaga [96] was further significantly improved in [146] [121] and made 15 several times faster. Unfortunately, the monotonic criterion is rarely available in practical data mining tasks. The FOCUS algorithm by Almuallin and Dietterich [3][4] offers Optimal search for finding the best model explaining all the training instances. The search is optimal in the sense that it outputs a concept consistent with the training data. However, it works only with binary features. Its computational complexity is bounded by 0(n(2a’)‘°g("d)). The main idea of FOCUS was expanded by Schlimmer [117] with an efficient implementation of the pruning heuristics for finding Optimal solutions. Breadth-first search is used in both approaches. The automatic branch and bound algorithm of Liu et a1. [86] attempts to construct monotonic evaluation criterion for general data in order to make regular branch and bound applicable. This measure is called “inconsistency” and is computed for every feature subset based on the number of identical (except class labels) instances. Using the inconsistency measure the search becomes complete but not exhaustive. In practice, researchers Often seek satisfactory suboptimal solutions because problem size or non-monotonic criteria prohibit complete search. Most research is concentrated on heuristic algorithms. 2.2 Heuristic search Heuristic search algorithms aim to provide reasonably good solutions within affordable time. Two important results in the area greatly influence our understanding of feature selection problems: 16 1. Sequential selection Of features cannot guarantee that an optimal subset will be found, as shown in [21]. This is true for any nonexhaustive sequential algorithm. 2. The best individual feature is not necessarily a part Of the two best features in combination [20]. Perhaps the simplest approach is to collect all the best individual features. Apparently, such a set can be suboptimal when features are not independent. Only when the evaluation criterion is a sum or product of individual features’ functions, can optimality be preserved. Several simple strategies employ two basic Operations — addition and deletion — of features in a subset [73]. Sequential forward selection adds (on each step) one feature to an initially empty set until no improvement is possible. The feature is retained if the criterion function is maximized by a subset. However, a feature cannot be discarded later. Backward elimination starts with the set of all the features. At each step one feature is deleted that maximizes evaluation value Of the remaining features. In general, various heuristics can be constructed by adopting of one of many know variants of hill-climbing. For example, “plus l-take away r” methods on each step attempt tO add 1 features and then delete r features [123]. Floating search [106][107] is a modification Of basic sequential search. These algorithms do not have preset values Of l and r but control them dynamically. Floating methods update current subsets by both forward and backward selection if the action results in better performance. Sequential methods consider one solution at a time. In contrast, evolutionary algorithms (EAs) operate on a population of candidate solutions. In a genetic algorithm (GA) each solution is typically encoded in a binary vector of length d. Subsets of features are represented by 1’s in the associated vector positions. Various crossover and mutation Operators support search through subsets over many generations. The GA approach to feature selection was pioneered by Siedlecki and Sklansky [118]. GA search performs well on large-scale problems, where it quick identifies promising feature subsets. It could be further combined with hill-climbing to allow fine-tuning of candidate solutions. GAS are certainly not sensitive to non—monotonicity of objective function and not as prone as other methods to difficulties of finding hidden nested sets. 2.3 Filter methods The evaluation criterion plays a key role in feature selection, since it strongly influences the overall Search complexity as well as generality of obtained results. It is rarely completely dictated by the application. Preprocessing Of features, before actual model induction, can be done by filter methods. Filtering selects features by analyzing the data. Interclass distances, consistency, information theoretic, and even auxiliary learning algorithms often used as filter measures for feature selection and extraction. Interclass distances can be estimated from data instances according tO several metrics. The search goal is to find a feature subset that provides maximum separation between classes. Classical metrics are reviewed in detail in [26]. They fall into two major categories: 1) deterministic, those that explicitly compute distances between instances (points) of classes, and 2) probabilistic, those that compute distances between class- conditional probability distributions of points. The first category includes such well- 18 known metrics as Euclidean, Minkowski, Chebychev, city block, quadratic, etc. Probabilistic measures include mutual information, Chemoff, Bhattacharaya, Joshi, Kullback-Leibler divergence, etc. Complexity of computing distances using probability metrics is comparable to estimating error probability. In this case error probability criterion is recommended. The main advantage of filter methods is their speed, since classifier induction is not the part of the search algorithm. 2.4 Wrapper methods Wrapper methods have access not only to the data but also to the classifier itself during the search. The performance Of a candidate solution is estimated during the search by running a classification algorithm using the candidate feature’s subset. It usually results in a tighter fit between the classifier and features selected. Fig. 2.1 shows overall wrapper construction with a classifier within the search loop. A number of methods could be used to estimate the performance of the classifier. An important goal of any classification procedure is to achieve the highest possible accuracy with respect to previously unseen individuals. This would mean access to the unknown individuals to evaluate such accuracy, which we will call true or predictive accuracy. In practice we have only a limited number of individuals in a baseline sample. 19 Training set V GA Value clustering operations ‘— Candidate fe atures / Induction of classifiers Cross-validation loop Classifiers Performance estimation via cross -validation Yes Continue? V No Final classifier Estimated accuracy Test set ——'> evalu ation Figure 2.1. Feature extraction wrapper-type approach with GA search engine Use of all the individuals from the baseline results in a better classifier as Opposed to a classifier built only from a subsample Of individuals. At the same time, it is necessary to have an accurate measure of classification accuracy even in the absence Of independent individuals. Several measures are available that can approximate true accuracy: cross- validation accuracy, bootstrapping [23] and leave-one-out accuracy as a specific case of cross-validation. Leave-one-out (jackknife) evaluation Of a classifier is most commonly used, since it is virtually unbiased. Absence of bias means that on average (over different samples) we accurately estimate the true accuracy. A jackknife accuracy estimate is a mean Of accuracies Of N classifiers, where each classifier is Obtained from the original 20 sample (Of size N) by removing a single individual. Each time the (N—l) individuals are used for training and the remaining individual for testing. However, leave—one-out estimates have quite a large variance since estimated accuracy can vary widely for different baseline samples. M-fold cross—validation estimates true accuracy via an average of many classifiers. Each classifier is constructed using [N/M] disjoint subsets of the original sample for testing and the rest Of the data for training. Similarly, bootstrap estimates Of the true classification accuracy are computed as an average over many bootstrap samples Of the original baseline sample. Cross-validation and bootstrapping may provide a better estimation of accuracy since they use more subsamples. However, this increase in complexity gives little improvement in precision in practice. Most researchers consider a jackknife accuracy estimator as the best available or, at least, sufficient given practical constraints. Jackknife estimation is regularly used in feature selection problems. However, it is Often overlooked that leave-one-out estimation itself may mislead feature selection algorithms. In other words, if the search procedure employs a jackknife criterion to sift through various subsets, this may result in a classifier that has a lower true predictive accuracy but better jackknife accuracy. More Often than not the selected subset has a lower true accuracy when the jackknife criterion is used with smaller size samples. The main reason for this is that search among classifiers finds a classifier that is overfitting. That is, the classifier fits to the training data well but does not generalize. Jackknife accuracy changes widely as the features vary. Naturally, the features subset having the highest apparent accuracy will be picked during the search, but that does not guarantee better predictive accuracy. Indeed, to prefer one classifier to another, we must make sure 21 that confidence intervals for their true predictive accuracy do not overlap at the required confidence level. Unfortunately, for small sample size, the confidence intervals are not tight enough and often overlap. As a result the search becomes ineffective. This over- searching phenomenon is effectively equivalent to overtraining, which is well known for classification algorithms with many degrees of freedom (adjustable parameters), e.g., neural networks. 2.5 Evolutionary feature selection Evolutionary computing has shown itself to be a highly robust approach to solving difficult optimization problems Of structural or numeric nature. The current renaissance Of the whole family of evolutionary algorithms (EAs) is due to the new and improved understanding of both strengths and limitations of the evolutionary paradigm [148]. New findings in the theory are tightly coupled with progress in practical applications of evolutionary computations. The NO Free Lunch (NFL) theorem by Wolpert and Macready [143] indicates that no single search algorithm is better than the other algorithms for all problems on average. This also means that we can hope to design only algorithms that perform well for certain problems instead of universally good algorithms. It is generally acknowledged that evolutionary search must be careful tailored for the objective function [32]. Still, little is known about how we select the best algorithm for practical problems. Most theoretical results are derived for “toy” problems which are far less challenging than real-world optimization. Optimization practitioners continue to be among the strongest supporters of evolutionary algorithms. While few artificial 22 benchmarks demonstrate the superiority of standard EAs with respect to conventional optimization, many outstanding results have been Obtained by EAs in practice. Every successful EA takes into account distinctive characteristics of real-world problems including: 0 Strong non—linear interaction between the parameters of the problem. 0 The evaluation function is non-differentiable. Only the value of the objective function is available, but not the derivatives. 0 Many constraints exist for the problem (design) variables. 0 Scalability of the search algorithm is very important, since search space of the problem can increase in the future. 0 Parallel evaluation Of several configurations is possible. 0 Multiple design criteria may be used simultaneously, thus requiring‘to consider multiobjective search. 0 Robustness of the search algorithm with respect to changes of problem variables. 0 High computational complexity. Single Objective function evaluation often may take minutes. Evolutionary algorithms, in particular genetic algorithms (GAS), are a unique kind of Optimization method as applied to feature selection/extraction. Unlike conventional sequential search of feature subsets [26], a GA can overcome non-monotonicity of an evaluation function and the difficulties involved in finding hidden, nested sets of features [64]. Siedlecki and Sklansky [118] introduced conventional, direct representation of feature sets. More sophisticated schemes were proposed in [135][136][142]. Building on this seminal work [118], GAS were also applied to feature extraction coupled with k- 23 nearest neighbor classifiers in [108][112][113] and several other algorithms [85][88]. Typically new features are created through the learning Of real-value weights applied to the original features [112] or through a sequence of primitive operators encoded in the chromosome [14]. Recent attribute construction algorithms successfully utilize GA for data mining tasks [81][107]. A variety of relevant algorithms is reviewed in [88]. In the next chapter we introduce a different GA approach to dimensionality reduction that does not explicitly operate with whole features, but instead operates on individual values assumed by the feature variables. Features are extracted by clustering several “old” values into a new meta-value which substitutes for the old values in the feature vector. Therefore, new features are created by clustering the values of variables. A GA is used as a search engine for this value clustering. If features assume nominal values then clustering could be viewed as a grouping problem and there are numerous known genetic algorithms that can be used [32]. Grouping and clustering GAS constitute an important research area. Scalability and redundancy are two characteristic problems Of grouping algorithms. Here we present a grouping GA with an improved graph-based encoding that extends the work of Park and Song [102]. The proposed cluster representation has lower redundancy while preserving the simplicity Of decoding and evaluation. This encoding is not limited to value clustering and therefore applicable to other grouping tasks. 24 Chapter 3 Genetic Value Clustering Data mining applications typically deal with instances represented as a set of input attributes (features), where each attribute assumes one of several possible values. We are primarily interested in data represented by nominal values. Throughout the text, the term ‘ “value clustering” refers to a method that replaces several original values of an attribute by a new meta-value. This meta-value is completely artificial and simply a substitute for all the values used in the Cluster. Such a clustering affects the data model and consequently the classification rule. This clustering of values represents a kind of generalization or value abstraction that is used to improve accuracy of classification. We consider important related work below. However, it is instructive to start with a simple example illustrating the idea of value clustering. 3.1 Clustering in Na‘r’ve Bayes Model Consider a naive Bayes classifier. The classifier learns the conditional probability P(Ai|C) of each attribute A, given the class label C. Classification is then done by applying Bayes 25 rule to compute the probability Of C given the particular instance Of A1,..,A,, and then predicting the class with the highest posterior probability (MAP hypothesis): MAP 5 arg max P(c,. I a1,.-,a,,), where c; is an instance of C (a class) and aj is an instance of Aj. There is one major assumption behind this: all the attributes are conditionally independent given the value of the class C. Therefore, by using Bayes rule: P a .. a c. P(c, |a1,..,a,,)= (1” "’ ') P(al,..,an) oc P(c,.)HP(aj |c,.) Let us assume a multinomial probability distribution for the values Of each attribute. Using the maximum likelihood principle one can estimate the class-conditional probabilities {P(aj|ci)} from training samples as a ratio of number of instances with a 'specific value of aj within class c, to the total number of instances in the same class. For example, if there are only three possible values of the first attribute a1 E {0t,B,y}, one would estimate the three terms P(a1=0t|ci), P(a1=[3|c,) and P(a1=y|ci). However, if values 0t and [3 are merged into one cluster, the meta—value x is constructed x: (Ot or B), and only two model parameters remain to be estimated: P(al=x|ci) and P(a1=y|ci). In the multinomial model, the probability of the meta-value x obeys: P(a1=x | Ci ) = P(a1=0t | c, ) + P(a1=B | ci ). As a result of this value clustering all the occurrences of or or [3 are replaced by x. Of course, the estimation of parameters is classifier specific, and can take a different form if another data model is assumed. 26 3.2 Background on Value Abstraction and Clustering In traditional database research there is existing literature on the methods of finding an approximate answer for a database query with no exact answer [92]. Automatic query expansion systems offer approximations from a cluster of similar values, obtained by term clustering [122]. Term clustering creates groups Of related words by analyzing their co-occurrence in a document collection. A hierarchy of values can either be garnered from experts or obtained automatically. In the work [89], the attribute values are clustered automatically by generating an attribute abstraction hierarchy. This hierarchy is discovered using rules derived from database instances. For rules with the same consequence, values found in the premise are clustered. This process is called pattern based knowledge induction. However, information retrieval approaches cannot be easily combined with arbitrary classifiers. In machine learning studies we find a number of flexible value clustering methods. Perhaps the most interesting method is the information bottleneck by Tishby et al. [126]. The information bottleneck method replaces the original random variable X by a compact representation X , which tries to keep as much information as possible about the random variable Y. In particular, variablex could stand for feature values and variable Y is a class label. The information bottleneck method maximizes mutual information I(X;Y) between X and Y, conditioned on the information content 10? ;X ) resulting from the clustering X with respect to X. Most significantly, the optimal solution of this information theoretic problem can be found in terms of probability 27 distributions p(i I x), p(y | 3c"), and p(f) with an arbitraryjoint distribution p(x, y). The exact solution is given by a set of nonlinear equations [126]. An iterative solution requires the user-defined cardinality of X . Unknown variables are usually initialized to random values that potentially may lead to sub-optimal local solutions. The resulting clustering is not “hard”, where each value of X belongs to a single cluster in X” , but rather “soft” with membership probabilities p(Sr' | x). However, hard clustering can be achieved in the agglomerative information bottleneck method by Slonim and Tishby [119]. The agglomerative information bottleneck method is a hierarchical process that uses distance measures between distributions and greedy search by merging clusters pair— wise. The multivariate information bottleneck method [45][120] further extends the approach for multivariate distributions, maximizing mutual information between Y and several “bottlenecked” variables simultaneously. Recently, the value abstraction approach (clustering in our terms) led to considerable progress in linkage analysis for genetic mapping [98], and was expanded for more general likelihood computations and faster Bayesian network inference in [44]. All these works can be characterized as filter-type feature extraction, since classifier accuracy or induction step feedback is not used for value clustering. The other relevant body of literature concerns the use of genetic algorithms for Clustering problems, and specifically for combinatorial optimization involving clustering. Combinatorial grouping problems are rarely supplied with a definition of distance between items. Advanced GA designs utilize graph-based solution encodings of the clustering problems. Originally, a graph-based encoding was introduced by Park and 28 Song [102]. The main motivation Of their study was to re—solve the problems of traditional grouping encodings that use group numbers or permutations with separators. In the simplest case a chromosome contains a group label (number) for each Object. Such a group-number encoding suffers from the incompatibility of labels between two chromosomes that ultimately participatie in genetic recombination. In fact, by using group numbers we are forced to solve an intractable correspondence problem between different labelings each time a crossover operator is applied. In group-number representation typical crossover cannot easily combine Objects’ memberships from parents to offspring. In addition, the number of groups must be known in advance. Moreover, this encoding contains redundancy because it does not take into account the symmetry of group label permutations. The GA community currently rejects such a representation in its raw form [32]. However, some heuristics can certainly help in adopting group—number representation [83][19], but their complexity leads to major computation overhead to ensure meaningful inheritance. Another popular encoding uses a permutation representation [32]. A chromosome is created by an arbitrary permutation of the original Objects. There is no cluster information in the chromosome itself. Instead, the responsibility of constructing a clustering from the chromosome lies on a special decoder function, which is problem specific. The decoder function is guided by the particular Optimization criterion for the given task. The drawback of the permutation representation is basically the same as that Of the group—number encoding: classical recombination of two chromosomes cannot combine two clusterings by transferring cluster content, Since clustering of the offspring 29 is exposed only by a decoder function. Membership of each individual item acquires its meaning only in the context of labeling Of other items. More effective design of a grouping genetic algorithm (GGA) is due to Falkenauer [31][32]; it operates with a permutation encoding with separators between groups. GGA requires sophisticated repairs during the crossover to ensure valid Offspring representation. In GGA, the crossover operator must be domain dependent in order to enable effective propagation Of good building blocks (clusters) to the next generation. 3.3 Cluster Representation in Genetic Algorithm We propose a novel genetic algorithm design for general clustering, and value clustering ' in particular. Value clustering can be regarded as a grouping problem, where a number of objects (items) must be placed in several groups. Examples of grouping problems include bin packing, graph coloring and line balancing, all of which are NP-hard. The search space size for partitioning N objects into exactly M groups grows exponentially with N. Stirling numbers of the second kind S(N, M) give the number of ways to partition N ‘ objects into M groups: 1 M M—i M -N S(N,M)=—M—!;(—1) (i); If the number of clusters is not known then M can assume any value between 1 and N. Thus we get the total number of distinct clusterings, also known as the Bell number B(N): B(N) = ismM) 30 For example, for N=1...10 we obtain Bell numbers B(N) = {1, 2, 5, 15, 52, 203, 877, 4140, 21147, 115975}. The major difficulty encountered by typical GA approaches is in designing a representation of a grouping as a chromosome. A good encoding must ensure the complete coverage Of the space Of possible partitions with minimal redundancy. This encoding should also allow meaningful inheritance Of information from parents to children. Finally, it is important that validity checks and chromosome repair have insignificant computational overhead, if in fact such repair is necessary. Several genetic algorithms are specifically known to solve grouping problems. The most straightforward approach uses a standard GA with group-numbers, one gene per object, where each gene contains a group-number for the object. Unfortunately, standard n-point crossover fails to transmit grouping information between chromosomes, since group—numbers do not carry meaning by. themselves and could be arbitrarily permuted for any given grouping. This representation also requires knowing the correct number of clusters in advance. Similar difficulties are encountered with permutation encodings. In contrast, the GGA algorithm [31] guarantees meaningful inheritance using permutation with separators and a sophisticated crossover operator with repairs. 31 Regular graph-based encoding: chromosome with N genes Compact graph-based encoding: chromosome with N-1 genes [1[2[3[...]N] [112]3|...|N-1| r N N N N N N N N N N-1 N-1 N-1 N-1 N-1 N-1 N-1 N—1 N-1 possible values 1 ateach gene 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 l 1 1 1 1 1 1 Figure 3.1. Regular graph-based clustering GA selects N genes from the total of N2 edges, while compact encoding needs only N—l from N (N—1)/2 edges. GA search space size is reduced from N” to N !. Graph-based encoding is a different method for solving clustering problems [102]. If there are N items to cluster then a solution is represented by a string with N genes, one gene per item. Given that original items are numbered from 1 to N, then each gene aSsumes a value between 1 and N. Thus if the i-th gene contains value j, then items with numbers i and j belong to the same cluster and can be connected by a link in a graph of all items. In this representation an arbitrary partitioning can be reached without pre- defining a correct number of clusters. No special crossover is required for this representation, since even standard crossover both transfers the item’s link and preserves information about the partitioning. However, the redundancy of the encoding is extremely large, since there are N” possible individuals. Many points in the GA search space correspond to the same cluster partitioning, because a cluster can be formed by any connected graph containing its items. The redundancy of the regular graph-based encoding is given by the ratio of GA search space size to Bell number B(N): NN p(N):B(N)' 32 Redundancy of regular graph-based encoding 1e+12§;. 1 8+1 1;; le+1 0;; 18+09§E 18+08;;» 184437;; 16406;; 18+05-:g 10000.12; 1000;; 10.;E 2 £1 a a 1'0 1'2 1'4 f 6 1'8 20 Number of objects Figure 3.2. Ratio of GA search space size to the true number Of distinct partitions of objects on logarithmic scale. For example, for 20 Objects we find ,0 (20) > 2 - 1012. Park and Song have proposed a remedy to this redundancy problem, by limiting the number of possible links based on domain knowledge [102]. For example, if the distance between Objects can be defined, then one may restrict the possible links for this object to a set of nearest neighbors. Unfortunately, feature value clustering as well as the combinatorial grouping problem in general do not allow introduction Of meaningful metrics or definitions Of neighborhood for nominal values. 33 regular representation compact representation chromosome: [4,3,1,2,5,7,8,6] chromosome: [2,3,4,4,5,8,8] Figure 3.3. Sample chromosomes and corresponding clusterings are shown by connected subgraphs for regular and compact encodings Here we present an improved compact graph-based encoding, where redundancy is reduced exponentially with respect to number of Objects N, as compared to the original idea discussed in [102]. The proposed representation keeps all the good qualities of the raw graph-based encoding, without sacrificing its representation power, as we strictly prove below. First, we note that the regular graph-based encoding picks N links out of a total N2 available, because the chromosome has N genes with N possible values per gene. However, even a complete graph on N vertices has only N(N-1)/2 edges. This results in a major combinatorial explosion in the number of ways to encode the same cluster. Furthermore, to represent any partition of items on a graph we need at most N—l edges. Thus an arbitrary tree connecting all the objects within a cluster would be sufficient. 34 Reduction of GA Space in compact encoding 19+OT? 1e+06? i te+05+ 10000? 1000.,- 100.1, . . 101. . 221 6 8110121414618 20 Number of objects Figure 3.4. Ratio of GA search space sizes in non-compact and compact encoding Indeed, the largest possible cluster contains N items that could be represented by a tree with N-l edges. We propose an enhanced encoding that draws gene values only from N(N-1)/2 edges and needs N—l genes. The main idea is to remove duplicated edges from a pool of possible gene values. Retaining only values from N to i, for the i—th gene, does this exactly. The first gene assumes values in { 1,. . .,N}, the second gene in {2,...,N}, etc. The N-th gene is unnecessary, Since it always is set to N, and can be omitted. If the gene value happens to be equal to the gene’s number, it means that no edge is placed on the graph. We cannot have less than N(N—1)/2 edges of a complete graph, because clusters with only two objects must be Obtained by a unique link between them. Figure 3.1 compares the regular and compact graph-based encoding. Figure 3.2 shows graph representations of the sample chromosomes. 35 In the compact encoding, each gene Obtains its values from an alphabet of a different cardinality. This raises a question as to how required clusters are formed. It is easy to prove that any possible clustering is still attainable in the new encoding. For this purpose, consider a cluster containing objects with arbitrary numbers. One can sort Objects in a cluster by their number. Any such cluster can be formed, by connecting an object with next higher numbered Object from the sorted list, because the required edges are always available: any Object has access to edges incident to any Object with greater number. Therefore, the proposed encoding has the same representational power at the original encoding but has an exponentially lower space, as we approximately estimate from: where 8 is the ratio of sizes of GA search space in Old and new encodings. For example, if N=20 we find that 5>43000000, a dramatic reduction! The compact encoding has one problem, namely different genes carry different information because of the different cardinalities. This could be an issue for a standard crossover (e.g. 1- or 2-point) since lower numbered positions on the chromosome are more important than the higher numbered positions. As an alternative, we Offer a modification of the compact encoding that equalizes the cardinality of alphabets for different genes. The genes with the most available edges can transfer some of their edges to other genes, so we still have the very same pool of N(N—1)/2 edges at our disposal. For 36 example, we can delete the value (N-I) in the alphabet of the first gene, and insert value 1 in the alphabet of (N-l)th gene, leaving us with the same edges. Equalization creates alphabet Of cardinality |A|=(l+N/2) for every gene. Equalization can be performed exactly with even valued N, with minor deviation at some positions when N is an Odd number. In the equalized compact representation, the reduction in size Of search space is still greater than before, namely: .- ... Ina—Li” (1+N/2)N" 2 1+2/N Let us prove that it is possible to realize any clustering in the equalized compact representation. We will show that a cluster with an arbitrary number of objects can be created. The alphabet of the genes (corresponding to any chosen objects) contains all possible links between these objects, because eqUalization preserves the edges by simple transfer. Now an auxiliary construction is necessary: imagine a complete graph on these objects. In this complete graph we can assign direction to the edge from Object I to J if the alphabet of gene I contains value J, otherwise the edge goes in the opposite direction. This directed graph is called a tournament since it is a complete and directed graph. Graph theory provides us with this important fact: every tournament has a Hamiltonian path [17]. A Hamiltonian path connects all the objects following the arcs and visits each and every object only once. Therefore, arcs from such a directed path are equivalent to valid assignment of edges to genes. Since arbitrary objects are connected, we can Obtain any partitioning. Figure 3.5 shows an example comparing compact and equalized graph- based encoding for a set of 6 objects. 37 Compact graph-based encoding Equalized compact encoding for a set of 6 objects for a set of 6 objects I112l314l51 l1f2l3l4l5l ' 6 6 6 6 6 6 6 6 6 6 5 5 5 5 5 3 4 5 5 5 possible values < 4 4 4 4 2 3 4 4 1 ateach gene 3 3 3 1 2 3 1 2 2 2 l 1 Figure 3.5. An example of encoding alphabets at each chromosome position in compact and equalized compact graph-based representations Of a set of 6 items For most of the experiments we used an equalized compact encoding and 2-point crossover. The next section details several application domains for value clustering. Redundancy of regular vs equalized encoding 1e+06? 1e+05% 100001 1000é- 1001’ 10..- a 4 6 6 10 12.14 16 1’8 20 Number or objects Figure 3.6. Reduction of search space size in equalized compact encoding as compared to regular graph-based encoding. 38 .—---' 3.4 Application Domains for Value Clustering Text categorization Text categorization attempts to automatically place documents into one Of a set of predefined classes. Training samples for the experiments were taken from the Reuters- 21578 documents collection. We use a classical “bag of words” data model using a naive Bayes classifier. The number of features is equal to the number of words in a document and each feature can assume any value from the observed dictionary Of size N. The dictionary is created from all the training data. The classifier is induced by estimating the probabilities Of words in the entire training documents collection. If a previously unseen word appears during the inference phase, its probability is set to a small value such as 0.1/N. Note that the number of features changes from document to document, while the number of feature values is constant for a fixed training set. It is our goal to cluster the values within features. The probability distribution for each feature is the same due to the independence assumption used in naive Bayes. Therefore, the clustering of feature values is the same for all features. Thus each GA chromosome represents a clustering of words in the dictionary. The full dictionary of the Reuters data set contains tens of thousands of words (values to be clustered). It is commonly acknowledged that preprocessing is necessary to reduce computational costs to a reasonable level. We perform preprocessing by retaining features having the highest mutual information within the category label. Mutual information is assessed independently for each feature. Original baseline documents from each class are split between training and holdout sets (for classifier 39 fitness evaluation), as well as a final test set to estimate the true accuracy of the best solution found. After the search phase, the best classifier was induced from a combination of training and holdout sets before the final testing. The comparison between the three methods could be made, namely, a naive non-clustered approach, our proposed clustering GA and a pure best—first hill climbing with the same number of fitness evaluations. DNA fingerprinting Assignment of individuals to their putative populations of origin is a practical problem in population genetics. In a classical framework (see, e.g. Waser and Strobeck [139]), to assign an individual it is necessary to compute likelihood functions for each source population. The population having the highest likelihood value is deemed. to be the most probable population of origin. Genetic markers, taken from DNA samples, serve as features and their values (allelic configurations) are assumed to be multinomially distributed [53]. Classification is then performed using a naive Bayes approach, assuming independence of features. The large number of possible alleles (e.g.10-20) for each feature seriously complicates the assignment in many practical situations, because the reliability of the estimated parameters is low, since a typical baseline sample consists of only 50-100 individuals. Value clustering is quite promising for binning the alleles together to improve the assignments. Bayesian network classifiers A Bayesian network, also known as belief network, is a directed acyclic graph, such that: 40 (l) A set of random variables makes up the nodes of the network. (2) Each directed link in the network represents direct influence. An arrow from node A to node B means that A has a direct influence on B. (3) Each node has a conditional probability table that quantifies the effects that the parents have on the node. In theory, a belief network is very powerful because it can represent any joint probability distribution of its node variables. However, in practice a probabilistic model matching the data sets cannot always be found. Belief networks can inference about any probabilistic quantity of interest because theoretically everything can be Obtained from the joint distribution. However, in practice, polynomial algorithms may not be available under the condition of some values not being instantiated in the network. Good approximate inference procedures do exist. Our primary interest is to consider solving classification problems for discrete data analysis. Discrete data is very common in genetics data. The reason that we prefer belief networks is because 1. They can model joint probabilities with arbitrary accuracy, taking into account dependencies. 2. Unlike neural networks, their representation of interactions between the variables can be interpreted and understood. 3. Unlike decision tree or decision rules, they are not very sensitive to noise. It is true that finding a general exact belief network for a particular set of data is a task Of exponential complexity. Therefore, some simplifications are necessary which would limit the space of possible structures. For example, two of types of inductive bias could be 41 considered to simplify learning: na’r've Bayesian classifier (trivial), and tree—augmented naive (TAN) Bayesian network. Tree-augmented Naive Bayesian Network structure has the following defining properties: - Class variable (node) has no parents and each attribute (non-class node) has only one or two parents and one of them has to be a class node. - Edges representing relations between attributes are included. - In TAN, graph structure among attribute nodes is a tree. m Q 9 O Figure 3.7. TAN structure with four attributes. Augmented edges are shown among the attribute nodes. P(a1,..,a4,ci) P(a,,..,a4) P(Ci |a1,..,a4) = 0‘ P(al Iazaci)P(a1 ICi)P(a3 lazvci)P(a4 |a3’ci) The TAN structure is preferred because it exploits dependencies between attributes, which usually provide better accuracy than a naive Bayesian classifier. Moreover, this structure can be learned in polynomial time. Structural learning can be done in 0(K2) time using the mutual information algorithm [18], where K is number of nodes (variables) in the Bayesian network. 42 ACON~I 1 2 3+4 1 p11 p12 p13+p14 Y 2+3 p21+p31 p22+p32 p23+p24+p33+p34 4 p41 p42 p43+p44 Figure 3.8. Conditional probability table P(X,Y|C=c) before value clustering and after the merging of values {3,4} Of X variable and values {2, 3} Of Y variable Our goal is to use value clustering for learning conditional probability distributions P(a, |a1,ck). Value clustering is viewed as merging various cells in conditional probability tables. Again, it will reduce number of estimated model parameters. Parameters themselves will be more reliable due to the increase in effective sample size. Encoding of TAN node tables in a GA is possible in a number Of ways, such as graph-based representation. Value clustering in conditional probability tables can be either in rows , columns, or both simultaneously. For example, if we have P(X,Y|C=c) for X and Y variables, each assuming 4 possible values then initially we need to estimate 16 entries in the table P(X,Y|C=c) for some fixed class. Figure 3.8 Shows the result of clustering the values {3,4} of X variable and values {2, 3} of Y variable. 43 3.5 Closed form analysis Of value clustering In several simple data models we can Obtain exact expressions for the effect of value clustering on classifier accuracy. One such data model in feature Space is represented by a one-dimensional multinomial variable. In this case one can calculate the change in error rate of classification after clustering Of two values, if Bayes classifier parameters are estimated by the maximum-likelihood procedure. Let X be a random variable that assumes one Of M possible values {1, 2, M} with probabilities {p1, p2, pM} in class 1, and with probabilities {q1, q2, cm} in class 2. This is a two-class model, for which we apparently have p1+ p2+ ...+ pM = 1, (11+ q2+ ...+ qM=1. Suppose we observe a data sample of size N, where data points are drawn from the two source classes with equal prior probabilities TE1=7152. Each data point is sampled according to the above class-conditional probabilities: Prob(X=i |1)=pi. Prob(X=i|2)=q,-. Our first step would be to find an optimal decision rule, given the Observed finite sample, and estimate its error rate Eold. Next we will merge two of the values (characters) from the alphabet of variable X, and estimate the error rate Enew of the resulting classifier. And finally we can calculate the change in error rate AE= Eold - Enew . Obviously we are ultimately interested in under what distributions of probability values one can obtain the positive value of AE >0. The gain (or loss) of accuracy AE is a function of many factors: observed data sample (x1, x2, xN), class-conditional probabilities {p1, p2, ..., pM} and {q1, q2, 44 qM}, and sample size N. However, if we merge the characters 1 and 2, then it can be proved that the value AE only depends on the values p], pg, q1, q2. To show this we note that for any given data sample D=(x1, x2, xN) the value of error eD of simple Bayes classifier is an additive function of the parameters of the distribution: 60 =f1(P1)+f2 (P2)+ ---+fM(PM) + 81(611) + 82 ((12)+ ---+ 8M(qM), where {f,~} and {g,} are certain functions whose exact value is determined by the actual sample drawn. Clearly, when a difference is taken between the error values of classification before and after merging the states 1 and 2 Of X, we obtain: 60(01d) - €D(U€W) =fI(P1) +f2 (P2) + 81(41) + 82 ((12)-f’1(P1 +102) + 8'1(611+ 612)- Hence, for a given sample the number of relevant parameters is reduced to four -- {p1, p2}and {q1, qz}. At this point we can evaluate the difference Of error rates AE. Let us recall that the error rate of classifier averaged over the possible training samples can be calculated as the expected value of error on single data sample taken over the probability distribution of data. E[eD] = = jeD (x1 , x2,..., xN )p(x1 ,x2,...,xN ) dxldxz...de For a multinomial samples the distribution is discrete and the integral must be replaced by corresponding sums. For simplicity we will consider training samples with equal number of points N from each class. In the two-class situation, this will mean the total sample size will be 2N=N+N. The probability of any sample D=(x1, x2, X2N) is then given by the multinomial: N! n, 112 nM N! ml "'12 mM p(x1.x2,....xN)=nl,n, n ,pl p2 u-pM mi'm' m '61] Q2 qu .,.... M. .,.... M. where nl+n2+...+nM =N, ml+m2+...+mM =N 45 Here n,- and m,- denote the number of observations of value X = i in class 1 and class 2, respectively. The change in error rate is then computed as AE= Eold - Enew and can be written (after some manipulations) as follows: ”“2 N "“2 61+ [9 AE= £2 2 Z 2(q]6(k—m—l—)+p16’(m—k —1)+ ’ ——‘2ml(5( k) 2"12: —0 ml- =0 k2=0 k1=0 5] + P +q26i(k2 —m2 —1)+ p219(m2 — k2 —1)+—2—2—25(m2 —k2) "(511+q2)6(k1 +k2 _'mi -—m2 _l)_(PI + p2)6(ml +m2 _k1”k2 "1) + + + _ p1 p2 Q] 612 50C! +k2 _m1 _m2)jx 2 N! kl k2 N—k- x 1— — 1 kllk2!(N—kl—k2)!p’ p2( 17‘ p2) N! m] "12 N—m q. 612 (l-qi-qzi ‘ "'2 k2X x ml lm2!(N—ml —m2)! Where 5(z)=1, if z=0, 5(z)=0 otherwise; and 0(z)=l, if z 2 O, 0(z)=0 otherwise. This expression for AE can be evaluated in closed form for particular values of N. Even in the case of fixed N the calculations of such combinatorial sums are quite laborious. For this purpose I used a specialized computer algebra system ‘FORM’ by Jos Vermaseren [138], initially developed for similar expressions arising from high-energy physics calculations. Certainly, the asymptote Of large values of N is less interesting than small sample size, because large sample size usually provides reliable estimates Of classifier parameters, thus leaving us with smaller room for improvement due to value clustering. 46 We computed the change in error rate AE for total sample sizes (2N) equal to 10, 20, 30, and 50 for several different relations between the model parameters {p1, p2} and {611, qz}. In order to represent the AE in tractable form we considered the following four special cases: {PDPFPI}, {(11, Q2: 611} {191,172: 45 P1}, {(11, 612: 5/4 611} {p1.p2= 2/3 P1 }. {611. 612: 3/2 qr} {P1,P2=1/2P1},{CI1,€12= 2 (11} These assumptions cover many plausible situations, which reflect gradual increase in discriminative power provided by the values/states {1, 2} selected for clustering from the alphabet of variable X. For instance, configuration {[91, p2: p1} and {(11, q2 = q1} corresponds to the case when having distinct values {1, 2} generally does not help in classification. In contrast, when {p1, p2: ’1/2 p1} and {(11, qz = 2 ql}, one can expect considerable contribution to accuracy from both values {1, 2} (not merged), given that the maximum likelihood estimates of the distribution parameters {p1, m}. {(11, 612} are known with good precision. To get an idea of the result of computing the value of gains/losses, consider the following expression obtained for {p1, p2: p1}, {q1, qz = q1} and sample size 10 (N=5): AE = E01d — Em: 1/2'(5-35700p13q14-75p13 — 35700;)!“qu3 +11430p14q12 +3577Op14q14-3Op1 — 30q1+70p12+31p14+ 43180p13q13 + 8556p12q12 -17010p12q13+11430p12q14—126Op1q14+31q14+420p1q1-75q13-1575p1q12 +2356p1q13-17010p13q12+70q12-1575p12q1-126Op14q]+2356p13q1) - (pi-q1)2 47 The number Of terms in other similar results for AE would grow as O(N4 ), which makes it difficult to interpret and analyze. Instead, we visualize the behavior of gains/losses AB in expected error rate on the following plots Shown in Figures 39-314. 48 E(old)-E(new), N=5, pZfifl, q2=q1' “0.2 00 E(old)-E(ncw), N=5, p2'=4/5*p1, q2=5/4*q1 0.04-- 0.03.- 0.02-— , 0.01-- 43.01- 0.4 0 o 0.1 ' p1" Figure 3.9. Accuracy gain AE due to merging states {1,2} for sample size 2N=10: a) if {p1,P2=p1}, {(11, q2= qr}. b) if {p1,P2= 45 pl}, {(11, qz= 5/4 qr} 49 E(old)-E(ncw), N=10, p2=p1, q2=q1 U 035—- 0 325-- 0015.. o .005:- Figure 3.10. Accuracy gain AE due to merging states {1,2} for sample size 2N=20: a) if {p1,P2= PI 1. {(11, QZ= ‘11}, b) if 1101.172: 4/5191}, {qb (12: 5/4 (11} 50 E(old)—E(ncw), N=15, p2=p1, q2=q1 0.03-- 0.025:- 0.023. 0.0155. 0.01:- 0.005" 1 0.02- 0.01- 43.01- 4102- 0.4 Figure 3.11. Accuracy gain AE due to merging states {1,2} for sample size 2N=30: a)if{p1,p2=p1}, {611, £12: 611}, b) if 1171.102: 45 P1}, {611, (12: 5/4 qr} 51 E(old)-E(ncw), N=5, p2=2/3*p1, q2=3/2*q1 E(old)-E(new), N=5, p2=1f2*p1, q2=2*q1 0 .02- -O .02- {1.06-- -0 .084- .o_1__ «..— -o.12-: qr 0.1 0.3 0'4 0 o 0.1 0-2 p1 Figure 3.12. Accuracy gain AB due to merging states {1,2} for sample size 2N=10: 21) {p1,p2= 2/3 pl}. {41. q2= 3/2 41} b) {p1.pz= 1/2p1}. {q1. q2= 2 611} 52 E(old)-E(new), N=10, p2=2/3*p1, q2=3/2*q1 0.02- 43.06-- 0.4 02 05 0.6 ‘11 0.3 . 0'2 p’l 00 Figure 3.13. Accuracy gain AE due to merging states {1,2} for sample size 2N=20: a) {P1,p2= 2/3 P1}, {611. 612: 3/2 (11} 13) {P1,P2= I/ZPI}, {611, q2= 2 £11} 53 E(old)-E(ncw), N=15, p2=2/3*p1, q2=3/2*q1 0.2 0.5 0.4 91 02 0.3 E(old)-E(New), N=15, p2=1/2*p1, q2=2*q1 -0.14— 0.2 0 6 0.5 . at 0.1 0,3 0.4 02 P1 0 0 0.1 Figure 3.14. Accuracy gain AE due to merging states {1,2} for sample size 2N=30: a) {P1,P2= 2/3 P1}, {611, 612: 3/2 q1}b){PI,P2=1/2p1}, {q1,q2= 2 611} Several important conclusions follow from this analysis. First, we can generally confirm an improvement in classification accuracy due to value clustering. For example, on average over possible training samples of size 10-20 we can expect to get an improvement of up to 3-4% for certain data models. Note that this is an improvement due to merging Of a single pair of states and in one-dimensional feature space. Potentially, one can gain more from merging multiple pairs of states. Second, the actual expected improvement decreases with large sample sizes. This confirms our intuition that the positive effect Of value clustering is mainly due to improved estimates of the collapsed meta—states of the variables. Third, most accuracy gains are achieved when we merge the X variable’s states that contribute relatively little to discrimination between classes. Hence, the main challenge of the developed genetic algorithm is to implicitly identify such values and perform their merging into meta-values. In the next section we consider the empirical results from the application domains that we discussed earlier. 3.6 Experimental Study The main goal of the empirical study is to compare the performance of select classifiers with and without GA value clustering. Even though an overall accuracy improvement is very valuable, we are also interested in reduction of measurement complexity. 55 3.6.1 Genetic Algorithm Details The driver GA program included the following major steps: 1. Initialization. The population is initialized with random individuals using the equalized compact graph-based encoding as described previously in section 3.3. Each individual is created by selecting its genes from the alphabet of respective chromosome positions. For the i-th position, the probability to select the value i is set to (N-l)/N, to prevent formation of a single big cluster covering all the Objects. If the value i is not generated, we initialize this gene to one of the remaining available values (edges) for this position using a uniform probability. Prevention of excessive clustering is motivated by a result from graph theory on the critical probability of connectedness of random graphs [13]. Finally, we always seeded a chromosome [1, 2,. .., N], corresponding to N non-merged objects (feature values). 2. Fitness evaluation of each individual in the population. Chromosomes are easily decoded to the actual cluster partitioning of feature values as described above. Model parameters are estimated according to the extracted clustering. Classifier accuracy is evaluated using hold-out samples or 10-fold cross-validation. 3. Termination criteria check. The number of generations was the measure of computational effort. The search was stopped after about 200 generations in each run 4. Toumgment selection of parents (tournament size = 2). Pairs of individuals are selected at random with replacement and their number is equal to the population Size. The better of the two individuals becomes a parent at the next step. 5. Crossover and reproduction. Standard 2-point crossover is used. Each pair of parents produces two offspring. Mutation with small probability pm = 0.01 is applied to each 56 position. In addition, elitism was always used and the best individual of the population survives unchanged. 6. Continue to step 2. 3.6.2 Experimental Results Text categorization. The full dictionary of the Reuters data set contains tens of thousands of words (values to be clustered). It is commonly acknowledged that preprocessing is necessary to keep reduce computational costs to a reasonable level. We performed preprocessing by retaining features having the highest mutual information within the category label. Mutual information was assessed independently for each feature. In two different setups we selected 10 and 50 words respectively, corresponding to chromosome lengths with 9 and 49 genes in the equalized compact encoding. The first set of experiments was designed to distinguish between Reuters articles on “acquisitions” or “earnings” with 2308 and 3785 non-overlapping documents in each category respectively. The known expected accuracy is extremely high, so we made the task more challenging by considering only the first two lines in the body of every document. 1500 documents from each class were split between training and holdout sets (for classifier fitness evaluation); the rest of documents were used to estimate the true accuracy Of the best found solution. After the search phase, the best classifier was induced from a combination of training and holdout sets before the final testing. The comparison between the three methods was made, namely, a naive non-clustered approach, our proposed clustering GA and a pure best-first hill climbing with the same number of fitness evaluations. We performed 100 runs of each algorithm with different 57 sizes of splits between training and holdout sets. Our GA approach ran for 200 generations with a population of size 100. The results for the estimated true accuracy and percentage of runs with improvement are reported in Table 3.1. Last column shows percentage of runs when GA found better than naive solution. Each row is the average over 100 runs, with training and holdout sets sampled at random in each run from original data. Table 3.1. Comparison of test accuracy for the best solutions found by GA value clustering and best-first hillclimbing with “naive“ non-clustered solution on “acquisitions” or “earnings” classification problem. First two columns, are training and holdout set sizes. Tr. set Holdout "Na'r've" model Hill-climbing Clustering GA Best runs, % 10 words 400 1100 79.9 ‘ 79.9 82.3 87 600 900 79.8 80 81.6 81 800 700 79.7 79.9 81.5 78 1000 500 79.9 79.9 81.1 74 50 words 600 900 88.2 88.7 90.8 73 800 700 88.4 88.8 90.5 60 The results show that GA typically improves the predictive accuracy by 1-2% and most runs are successful. Perhaps the main advantage Of the GA was in dimensionality reduction. The average number of clusters for the 10-words problem was 7.9 and 38.6 for the 50-words problem. Experiments with similar setup were performed to discriminate between the documents in the gold and silver news categories (topics) on one side and gas and nat- 58 gas on the other side. To ensure that our baseline sample contains enough documents we sampled three line subsets from large documents and considered them as individual texts. The classification accuracy averaged over 100 independent runs of the GA is compared to the results of hill-climbing search as well as Simple Bayes classifier and reported in Table 3.2. Table 3.2. Comparison of test accuracy for the best solutions found by GA value clustering and best—first hillclimbing with “naive“ non-clustered solution on “silver” and “gold” versus “gas” and “nat-gas” topics classification problem. Tr. set size Holdout size "Naive" model Hill-climbing Clustering GA 200 200 84.0 85.8 86.5 10 words 250 200 84.5 85.7 86.6 200 200 89.5 90.0 91.4 50 words 250 200 90.0 90.3 91.3 DNA fingerprinting. The experiments were done with multiple data sets. Each data set contained individuals drawn from two populations. The data was collected from a simulation program that can generate individuals from the source populations with a user— chosen level of inter-population dispersal as measured by the Wright’s F—statistic values, in particular, the value of F5}. We used 100 individuals per sampled population as training data. Each individual is given as a set of two alleles per locus (feature) with 8 loci in total. While most of the features had only 2-4 different alleles, there were 10 possible alleles for one of the loci. We ran the GA value clustering only for the allelic values at this particular locus. The classification step and GA search are very similar to what was done for text categorization. However, the probabilities of individuals are computed differently, because for Mendelian genotypes there are two alleles specified at each locus 59 (feature). Namely, the probability of allelic configuration (i, j) at each locus is estimated as 2p,-pj-, if alleles i and j are different, or as p12, if i=j, where p,- is an estimate of i-th allele probability at a locus. This modification of the regular na‘r’ve Bayes classifier is trivial and is necessary only in computations of fitness for candidate solutions. Also the fitness was estimated on completely independent test set of 100 individuals. As a main result, the value clustering GA was able to improve classification accuracy (cross—validated) from 69% to 73% for low Fst data (considerably overlapping populations), from 91% to 92% for high Fst data (populations with little overlap). In most runs, the best solution was consistently represented by only 5-7 meta-values instead of the 10 original values (alleles). Bayesian Network Optimization by Value Clustering. The goal of value clustering as applied to Bayesian network is to intelligently modify conditional probability tables in order to improve accuracy of classification. In our experiments we followed these steps. 1. Learning the structure Of Bayesian Network Classifier: the main purpose of structural learning is to provide a tree of dependencies over attribute nodes. Class- conditional mutual information (CMI) between attributes serves as major guiding principle for better structure. Only links with highest value of CMI are kept. It means that a maximal spanning tree is constructed from the complete graph, where edges are weighted by CMI computed from training data. It was shown in [18] that such a tree is optimal among all other possible trees over attribute nodes. The proof shows that maximum spanning tree maximizes the likelihood function for given data set. 60 2. Parametric training from the given training data. This includes estimation of all required conditional probability tables (CPl‘s) for the nodes present in the given fixed structure found. Every CPT is either three-dimensional P(xly,c) or two-dimensional P(xlc). If the only incoming link is from the parent class node then we have the usual parameter estimation as in case for nai've Bayes; i.e., we estimate assuming a multinomial distribution for the values of x. Learning the values of the CPT table is straighforward from the data provided, because it is done in linear time with respect to sample size. During parametric training we compute the Observed frequencies of values Of the required attribute conditioned on the combinations of values of other attributes as well as class label. This gives us the maximum likelihood estimate of the parameters (CPT entries) for a fixed network structure. 3. Genetic value clustering. We use a genetic algorithm described earlier to optimize the performance of a BNC. The GA is applied to CPT tables at the nodes of the network. Clustering of both rows and columns is done for one node at a time. We perform value clustering for two nodes at most, sequentially. 4. Testing of trained BNC. At this phase we compute posterior probabilities for each class according to the learned structure and CPTS. The maximum value of posterior probability decides class membership for every test instance. 5. Editing of BNC can be done after training. Links can be deleted from the maximum spanning tree. Essentially it means that we can consider an arbitrary forest over attribute nodes. Usually we had one or two edges remaining after editing. 6. Smoothing control is achieved via combination values of weights for P(xlc) and P(xly,c). The idea is to mix the actual value of conditional probability with its prior value. 61 Though this method was first introduced in Bayesian statistics in the late eighties, we rediscovered it for our application independently. The idea is to balance between full link and absence of link in a TAN structure. B(xl y,c) = (xP(x|c)+(l—a)P(x| y,c), where y is a parent of x and a is weight. If a = 1, this is equivalent to a naive Bayesian classifier. If a = 0, it gives us a general TAN. In our experiments with smoothing below, the value a = 1/2 is used to gain benefits from both. Testing of the Bayesian Network classification algorithms with value clustering was done using two types of datasets: 4 datasets from the Machine Learning UC Irvine repository and 3 datasets generated by the population genotype simulator written by the author. All data was discrete and had 3-6 features with 3-4 possible states per feature. Instances were drawn from 2 or 3 'classes. Table 3.3 provides necessary details about the datasets. Table 3.3. Characteristics of the datasets used in BNC experiments Number of Number of Datasets samples in samples In 2:253:er 6:131:12; training set test set Hayes-Roth 1 32 28 3 3 Monk-1 124 432 6 2 Monk-2 169 432 6 2 Monk-3 122 432 6 2 Moderate 200 200 6 2 Strong 200 200 6 2 Weak 200 200 6 2 Monk’s test problems are often used in machine learning research. The instances of this dataset describe robots by the following nominal features: 62 Head shape E {round, square, octagon}, Body shape 6 {round, square, octagon} Is smiling? 6 {yes, no}, Holding 6 {sword, balloon, flag} Jacket colour 6 {red, yellow, green, blue}, Has tie? 6 {yes, no} Monkl target hypothesis: (head shape = body shape) or (jacket colour = red). Monk2 target hypothesis: Two and only two of the features assume their first listed value. Monk3 target hypothesis: (jacket—colour = green and holding = sword) or (jacket-colour ¢ blue and body-shape ¢ octagon). Another problem taken from Irvine repository is the Hayes-Roth dataset. The target concept behind this dataset were described as following: Class 1: (# of 1's)>(# of 2's) for attribirtes 1—3 Class 2: (# of 2's)>(# Of 1's) for attributes 1-3 Class 3: if 4 occurs for any attribute Apparently, decision trees are appropriate in recovering such generation rules due to the similarity of the hypothesis spaces. That is why we additionally ran all the tests through the decision tree program C5. Three genetic data sets were created by the custom program that sets the desired feature distributions (allelic frequencies) as well as degree of correlation among the features. We created data sets with weak, moderate and strong linkage. The features were pairwise correlated, namely each attribute was linked at most to one more attribute. For example, the relations were 1-2, 3-4 and 5—6. Other, possible pairs of features were considered independent. Therefore, we had good prior knowledge of the distributions and were able to compare learned BN classifier structure with our assumptions. Expected structure of classifiers for custom data sets is shown on Figure 3.15. 63 Figure 3.15. Underlying dependencies between the variables in the datasets “weak”, “moderate” and “strong”. Table 3.4. Classification accuracy of the compared classifiers over test sets (no smoothing) Data Set DT C50 Naive BNC-TAN I BNC-TAN 2 . Bayes with GA with GA hayes-roth 92.9 88. 8 91.4 88.9 Monk-I 74.3 71.3 73.1 72.7 Monk-2 65.0 54.7 59.6 53.2 Monk-3 97.2 97.2 97.2 93.5 Strong 77.0 76 77.0 73.5 Moderate 77.0 72.5 76.5 71.0 Weak 66.0 70 70.0 67.5 Table 3.5. Classification accuracy on test sets with classifier smoothing (0t='/2). Data Set Naive Bayes BNC-TAN I with GA hayes-roth 88.8 90.3 monks-1 71.3 73.4 monks-2 54.7 59.1 monks-3 97.2 97.2 strong 76 76.5 moderate 72.5 73.0 weak 70 70.0 64 The experimental results indicate that value clustering has been able to improve on the results of naive Bayes classifier for almost all datasets. Though the decision trees seem to outperform probabilistic models on this data, this is to be expected since their decision rules match the target concepts very well. The genetic value clustering performed nearly as well with probabilistically generated data with limited attribute dependencies. 3.7 Conclusions The novelty Of the approach is in adopting a genetic algorithm for clustering the values of variables. In contrast, traditional genetic algorithm approaches for data mining operate on the features as a whole by including/excluding them from a subset or adjusting the appropriate weights. We take a further step and organize the search in the entire input space. The enhanced graph-based encoding has a much less redundant chromosome while maintaining complete coverage of possible partitions, as we strictly prove. The experiments performed demonstrate reduction in measurement complexity and improvement in classification accuracy due to better reliability of meta-values. It is possible to generalize and apply genetic clustering to a difficult problem of parametric learning in the context of Bayesian networks. Bayesian network classifiers, even augmented to trees over the feature nodes, require estimation of a much greater number of parameters than a typical na'I've Bayes classifier. We demonstrate that the conditional probability distribution in a Bayesian network can be considerably simplified and learned from data with the help of GA value clustering. 65 Chapter 4 Genetic Programming with Local Learning Data mining deals not only with discrete variables, but also with continuous values of input and output variables. In this case a solution to a classification problem can be obtained in the form Of a model that relates continuous features in input and output. Building a model, in fact, amounts to learning a function fitting given data samples, which is an interpolation problem. Genetic programming is a modern tool for learning arbitrary models from the given data using a special set of building blocks — node functions. The quest for more efficient Genetic Programming is an important research problem. This is due to the fact that the high computational complexity of GP is among its distinctive features [104]. Especially now, when variants of GP are being used on very ambitious projects [7 8][127], the speed and efficiency of evolution are very crucial for such problems. Numerous modifications of the basic GP paradigm [75] are currently known, e.g. see [80] for a review. Among them, several researchers have considered GP augmentation by hill climbing, simulated annealing and other stochastic techniques. In 66 [100] crossover and mutation are used as move Operators Of hill climbing, while Esparcia-Alcazar and Sharman [30] considered optimization of extra parameters (node gains) using simulated annealing. Terminal search was employed in [140], but due to the associated computational expense it was limited to 2-4% of individuals. The presence of stochasticity in local learning makes it relatively slow, even though some hybrid algorithms yield overall improvement. Iba and Nikolaev [62] and Rodriguez-Vazquez [115] considered least squares coefficient fitting limited to linear models. Apparently, the full potential of local search Optimization is yet to be realized. The focus of this study is on a local adaptation of individual programs during the GP process. We rely on gradient descent for improved generation of GP individuals. This adaptation can be performed repeatedly during the lifetime of an individual. The results of local learning may or may not be coded back into the genotype (reverse transcription) based on the modified behavior, which is reported in the literature as Lamarckian and Baldwinian learning, respectively. The resulting new fitness values affect the selection process in both cases, which in turn changes the global optimization performance of a GP. Such an interaction between local learning, evolution and associated phenomena without reverse transcription is also generally referred to as the Baldwin effect. 67 Figure 4.1. Sample tree with a set Of random constants. In hybrid GP all the leaf coefficients are subjected to training. We were motivated by a number Of successful applications Of hybridization to neural networks [8][147][97]. Both neural networks and GP trees perform input-output mapping using a number of adjustable parameters. In this respect, terminal values (leaf coefficients) in a GP tree serve a similar function to weights in a neural network. A form of gradient descent is usually used to adjust weights in a neural net architecture. In contrast, various terminal constants are typically random within GP trees and are rarely adjusted by gradient methods. The reasons for this are twofold: the unavailability of gradients/derivatives in some GP problems and the computational expense that is assumed to exist in computing those gradients. However, the complexity of computing derivatives is largely overestimated. In order to differentiate programs explicitly, algorithmic differentiation [51] may be adopted. Algorithmic (computational) differentiation is a technique that accurately determines values of derivatives with essentially the same time complexity as found in the execution of the evaluation function itself. In fact, gradients may often be computed as part of the function evaluation. This is 68 especially true for trees and at least potentially true for arbitrary non—tree programs. Generalization of the algorithmic differentiation method for any program is possible, given that the generated program computes numeric values, even in the presence of loops, branches and intermediate variables. The main requirement is that the function be piecewise differentiable. While not always correct, this is the case for a great majority of engineering design applications. Moreover, it is also known that directional derivatives can be computed with many non-smooth functions [51]. Knowledge of only gradient direction, not its value, is often enough to Optimize the values of parameters. In this work we empirically compare conventional GP with a GP coupled with terminal constant learning. The effectiveness of the approach is demonstrated on several symbolic regression problems. Arithmetic operations have been chosen as the primitives set in our GP implementation (for simplicity’s sake. While such functions make differentiation easy, again, these techniques can be adapted to more difficult problems. Our results indicate that inexpensive differentiation, along with Baldwin effect learning, leads to a very fast form of GP. Significant improvement in accuracy is achieved beyond that which could be achieved by either local search or more generations of GP. The Baldwin effect is known to change the inductive bias of the algorithm [134]. In the case of GP, where functional complexity is highly variable, it is expected that such a change of bias can be properly quantified. Two manifestations of the learning bias were Observed in our experiments. Firstly, the selection process is affected by local learning since the fitness of many individuals dramatically improves during their lifetime. Secondly, changes in the functional complexity Of individuals were observed in the 69 experiments. Both the length (number of nodes) of the best evolved programs and the number of leaf coefficients were higher using local learning as opposed to regular GP. 4.1 Lamarckian versus Baldwinian Strategy in GP Evolution rarely proceeds without phenotypic changes. As we are interested in digital evolution, two dominating strategies have been proposed which allow environmental fitness to affect genetic features. Lamarckian evolution, an alternative proposition to Darwinian approaches of the time, claimed that traits acquired from individual experience could be directly encoded into the genotype and inherited by Offspring. In contrast, Baldwin claimed that Lamarckian effects could be Observed where no direct transfer of phenotypic characteristic to the genotype occurred, in keeping with Darwinism. Rather, Baldwin claimed that “innate” behaviors could be selected for (in a Darwinian sense) which the individual originally had to learn. In Lamarckian evolution learning affects fitness distribution as well as the underlying genotypic values, while the Baldwin effect is mediated via the fitness results only. In our case, the question is whether locally learned constants are copied back into the genotype (Lamarckian) or whether the constants are unmodified while the individual’s fitness value reflects the fitness resulting from learning (Baldwin). Real algorithmic implementations of evolution coupled with local learning are much richer than the two original strategies. The researcher, usually guided by the total computational expense, may arbitrarily decide both the amount of and scheduling of learning or local adaptation of solutions. Moreover, since local learning comes with a 70 price, it must be wisely traded off with genetic search costs. Several questions must be answered: 0 What aspect of the solution should be learned beyond genetic search, as only a subset of solution parameters may be chosen for adaptation? 0 Should learning be performed at every generation or should it be used as a form of fine—tuning when genetic search is converged? 0 How many individuals and which Of those individuals should have local learning applied to them? 0 How many iterations Of local learning should be done (really, how much computational cost are we willing to incur)? Accordingly, there are many ways to introduce local learning into GP. Evolution in GP is both parametric and structural in nature. Two important features are specific to GP: 1. The fitness of the functional structure depends critically on the values of local parameters. Even very fit structures may perform poorly due to inappropriate numeric coefficients. 2. The fitness of the individual is highly context sensitive. Slight changes in structure dramatically influence fitness and may require completely new parameters. That is why we focus on learning numeric coefficients, so called Ephemeral Random Constants or ERC [75], which are traditionally randomly generated as shown in Figure 4.1. As explained below, the local learning algorithm — gradient descent on the error surface in the space of the individual’s coefficients, turns out to be a very inexpensive approach, so much so that every individual can do local learning in every generation. 71 Formally we follow the Lamarckian principle of evolution since we allow the tuned performance of individual to directly affect the genome by modifying numeric constants. At the same time, the choice between Lamarckian and Baldwin strategies in our implementation is not founded on the issue of computational complexity. In both cases the amount of the extra work is approximately the same. The main issue arises when considering the fitness values of the Offspring with inherited coefficients vs. offspring with unadjusted terminals. Our experiments indicate that there is little difference between these two fitness values when crossover is the main Operator. Two factors contribute to this phenomenon: 1. Crossover usually generates individuals with significantly worse fitness than their parents. The coefficients found earlier to be good for the parents are not appropriate for the offspring structures. The subsequent local learning changes fitness dramatically by updating the ERCS to more appropriate values. 2. Newly generated Offspring are equally well adjusted starting from any values: earlier trained, not trained or even random. Hence, inheritance of the coefficients does not much help the performance of the individuals created by crossover. However, if an individual is transferred to a new generation as a part of the elitist pool, i.e. unchanged by crossover or mutation, then its learned coefficients are also transferred. With respect to this structure, the use of the Baldwinian strategy would be wasteful, since it requires releaming the same parameters. Thus, even though our implementation formally follows the Lamarckian strategy, we effectively Observe the very same phenomena peculiar to the Baldwin effect. 72 4.2 Hybrid Genetic Programming The organization of the hybrid GP (HGP) in many respects is the same as that of the standard GP. The only extra activity done by the HGP algorithm is to update the values of numeric coefficients. That is, all individuals in the population are trained using a simple gradient algorithm in every generation of the standard GP. Below we discuss the exact formulation of the corresponding optimization problem. The hybrid GP is intended to solve problems of a numeric nature, which may include regression, recognition, system identification or control. We will assume throughout that there are no non-differentiable nodes, such as Boolean functions. In general, given a set Of N input-output pairs (d,x),- it is required to find a mapping flx,c) minimizing certain performance criteria, e. g. mean squared error (MSE): Here, f is scalar function (generalization to multi-trees is trivial), x is vector of input values, c is vector Of coefficients and sum goes over the training samples. Of course, in GP we are interested in discovering the mapping fix, c) in the form of a program tree. That is, we seek not only coefficients c, but also the very structure Of the mapping f, which is not known in advance. In our approach, finding the coefficients is done by gradient descent during the same time functional structures are evolved. Descriptions of the standard GP approach can be found elsewhere (e.g. [80]), instead, we will focus below on details of the local learning algorithm. 73 4.3 Learning leaf coefficients Minimization of MSE is done by a few iterations of a simple gradient descent. At each generation all numeric coefficients are updated several times using the rule: a MSE (c) a ck where CI is the learning rate, and k goes over all the coefficients at our disposal. Three ck —>ck—a' important points must be discussed: how tO find the derivatives, what the value of or should be, and how many iterations (steps) of descent should be used. From two previous equations we obtain: dMSE(c) 2 N df(x.,c) __._.___ d.— _, _t_ dck Nglx ' f(x, 0) dck Thus, an immediate goal is to differentiate any current program tree with respect to any of its leaves. The chain rule significantly simplifies computing of/dc. Indeed, if nj( -) denotes node functions, then: df(n1(n2(n3(...),...),...)) 2 8f dnl an2 man,(c,,,...) dck an1 an2 an3 Bck Therefore differentiation of the tree simply reduces to the product of the node derivatives on the path which starts at the given leaf and ends at the root. It is clear that each term in the product is a derivative of a node output with respect to its arguments (children). If the paths from the different leaves share some common part, then corresponding sub—chains in the derivatives are also shared. Computation of such a product in practice depends on the data structure used for the program tree. In simple cases, differentiation uses a single recursive postorder traversal together with the actual function evaluation. Derivatives of 74 the program tree with respect to all its leaves can be obtained simultaneously. As soon as an entire sum in computing leaf coefficient correction becomes known, i.e. derivatives in all training points Obtained, one may need an extra sweep through the tree to update the coefficients. In total, the incurred overhead depends on the complexity Of node derivatives and the number of leaves. For instance, in our implementation using only an arithmetic functional set, the cost of differentiation was equal to the cost of function evaluation, making the overall cost twice the standard GP cost for the same problem. 4.4 Program Differentiation in Detail Let us consider details of on-the-fly differentiation of programs represented by trees. Potentially this technique can be generalized to programs of other structure as proved in the area of algorithmic differentiation. We will assume that first-order derivatives for the node functions exist. Program tree can be viewed as a complex nested function: f (n1(n2(n3(...),...),...)), where n,-(x,y) is a node function from a chosen functional set. For example, the tree shown in Figure 4.2 has equivalent representation as a function: f(x,y,z,c,,c2) = ((xxy)+cl)x((x—z)—(yxcz)) 75 0 O O x Y Figure 4.2. Example of a function of 3 variables drawn as a tree. C1 and C2 are the leaf coefficients Here the node functions have two arguments. Nodes with only one argument can be included too. Any node function with more than two arguments is trivial to decompose if necessary. Though for the purpose of differentiation, it is only important to know the derivatives of every function in the set with respect to each of its arguments (no matter ' what is the cOmplexity Of the function). The overall goal of GP is to find a function, which is the best fit for the training data. Assuming that input is represented as x and desired output as y then it is required to find a mapping f(x,c) minimizing mean squared error: 2 minMSE:m1n (yl- —f(Xi,C))2 f’c f,C i=1 The sum goes over all given input-output pairs {y,-, x,}, and c is a set Of parameters. During GP search we are seeking both the structure and corresponding parameters. The essence of the problem is how to find the best values of parameters c for a function f which is not known in advance but is generated. In fact, we have many such functions, because we work with the population of programs .If we are going to apply a gradient 76 descent then we have to be able to find the derivatives of this function. At first look, it seems to be difficult, because the structure of f is not fixed, but created dynamically during the search. However, algorithmic differentiation states that if one can compute the value of a function, then we can compute a value of its derivative. The key, Of course, is that the programs are “computable”, i.e. comprised of elementary Operations. Therefore this provides an Opportunity to find the values of derivatives not approximately, but exactly. This is important, since no estimation error is introduced, as it could be in case Of some sort of approximation, e. g. by finite differences. Also, the values of derivatives can be found by essentially using the same code that represents the function. The only error in the values is simply due to the nature of numerical computations. For example, if we evaluate a function flx,y) = x2 + xy, at point x21, y=0, then we hope to Obtain the value of the derivate with respect to x: f,’(1,0)=2, or with respect to y: fy’(l,0)=3, in addition to the value of the function itself: fl0,0)=l. In general, in order to decrease the mean-squared error we have to adjust the value Of function’s parameter ck, by making a step in the direction opposite to gradient: Such adjustments can be made for all or only some of the available parameters {ck}. As we already know, in order to compute an adjustment to the parameter ck it is required to af(X.- ,C) Ck Obtain and f (x,.,c). Note that the output Of a function f (x,,c)is computed for af (xi ’c) fitness evaluation anyway. Hence, it is desired to accomplish computation of ————at tick the same time. 77 Let us consider the simultaneous computation of function and its derivative with respect to one parameter c. The free parameter c, is shown explicitly in Figure 4.3, while the rest of the function is assumed to be sitting within the triangles and therefore is not relevant at the moment. Differentiation by c gives: 91:: an1 (n2 (n3 (n4(c,...),...),...),...) 2 an, an, dn3 8n4(c,...) BC BC an, {in3 an4 dc This example shows that tO compute a derivative of a function, one must multiply partial derivatives on the path from c to the root. Let us work out the example in more details. Consider the function f(x,y,c)= x*(c(x+y)-xy)+y. Suppose we would like to compute the 0 0A .A “A A Figure 4.3. Example Of a function for algorithmic differentiation with respect to a parameter c. 3f(x. )hc) values flx=l,y=2,c=l) and a c . Again, we assume that (1,2,1) all the triangles are arbitrary subtrees not depending on parameter c. Without the knowledge Of analytical expression of a function f we compute it by traversing the tree: if we store this function also as a tree, then we can evaluate it by postorder traversal. However, it is required to introduce the auxiliary variable ‘df’, 78 associated with the parameter c, in which we will keep the value Of the derivative. Hence we Obtain: 1. f=0, df=0, n4_left =1, n4_right=3 2. f=3 (=n4(n4_left=1, n4_right=3)), df=3, because we know that derivative of this ‘*’ node with respect to the left argument equals to the right argument. In general, we make the distinction between the “left” and “right” derivatives, e. g. for “-” (minus) or other asymmetric node functions. 3. f=1 ( =n3(n3_left =3, n3_right=2) ), df=3 (=3*1=df"1), because the “left” “ derivative of “- node is 1. 4. f=1 ( =n2(n2_left =1, n2_right=1) ), df=3 (that is =3*1=df*1= df* n2_right) 5. es ( =n1(n1_left =1, n1_right=2) ), dfi3 (that is =3*1=df*1) Therefore the result computed is: f(x=1, y=2, c=1) = 3 and Qfl—S—X—fl (1 21) = 3. If there C 9 9 is more then one point in the training set, then we- just found one term from the sum over the dataset and need to repeat the same computations at other points (xi, yi), i=1..N. In this situation we do not adjust the parameter c yet, but instead accumulate the value of correction in some intermediate variable, say ‘delta’. Once, all terms in the sum over the training set are found, then we finally update the parameter c. Now it is clear how the general scheme works: we compute the value of the derivative at the same time as we compute the function and using the same program flow as the function itself. There are many implementation dependent issues, which may influence how it is done technically. E.g. how we allocated the ‘delta’ variable: if ‘delta’ is a pair (structure) of a value and a pointer to c, then in order to update c, we don’t need to traverse tree again to 79 C1 o o 0 s. A Figure 4.4. Example of a function for algorithmic differentiation with respect tO 2 parameters C; and 6‘2. find c. Also, the nodes now become pointers to structures, which keep pointers to not only the node functions, but also to its derivatives (e.g. to “left” and “right” derivatives for two-argument functions). Of course, we have to supply the GP with functional set and corresponding set of derivatives. Now let us consider a more complex example, where we have more than one parameter. If we want to find derivatives with respect to several parameters than we also can do it on the same pass through the tree. Suppose there are two parameters C] and cz. In this case we have to carry three variables while traversing through the tree, f, dfi, dfz. We initialize the value of f at the very beginning of evaluation, and allocate (If,- only if necessary. There are some other potential savings because of the tree-Specific structure. The paths from c, and from C2 share common subpath to the root node. This means that the values of some terms in the product of the chain derivative can be re-used for both derivatives. Note that two terms 111—9’2.— are common in both derivatives. In fact, when 80 dfl and dfz are computed, their values will be multiplied by the same factors as soon as their paths meet. For example, here the paths from cl and c; meet after the node n3. The scheme can be generalized for any number of parameters - all the derivatives can be computed on the same pass through the tree. It should work not only with trees but with almost any program, because computations of derivatives mirror program flow. If, for example, we have conditional control like “IF” or “FOR” then we only need to carefully keep track of derivatives. In this case the implementation can be more complex, because with “FOR” loops we could repeatedly go through the same “terminals” containing the parameters, and therefore should avoid re-initializing the auxiliary variables like df. In order to adjust the parameters c, sometimes we can sacrifice precision, because we do not rely very much on the actual value of derivatives, but rather on their Sign. That is, we mostly need to make a step in right direction, to either increase or decrease the value of a parameter. Therefore, even if exact derivatives are not known (or don’t exist), but only “directional” derivatives are known, we still can use them. 4.5 Learning rate and number Of steps In a simple gradient descent algorithm, the proper choice of learning rate is very important. Too large a learning rate may increase error, while tOO small a rate may require many training iterations. It is also known that the simple gradient Cauchy rule works better in the areas far from the vicinity Of local minima [114]. Therefore we decided to make the rate as large as possible without sacrificing the quality of learning. After a few trials on the test problem of symbolic regression we fixed the learning rate to 81 MSE o 2 4 6 8101214161820 iterations of gradient descent Figure 4.5. Local learning strongly affects fitness of individuals. Typical learning progress is illustrated using individuals from test problem f2. the value 01:05. The same learning rate was used for all other test problems. If the algorithm resulted in an increase in the error of an individual, the training was stopped and no update to the individual's fitness was recorded. However, this problem did not have any impact on overall quality of learning since it happened rarely, approximately 1 out of 10 successful individuals. Moreover, those individuals that had this problem showed an error rate that was typically not reduced by any subsequent application of gradient descent. This simple local learning rule dramatically improved the fitness Of individuals. Figure 4.5 shows the decrease in MSE for typical individuals. It is important to note that the most significant improvements happened after only the first few iterations of local learning. Note that some individuals were improved by as much as 60% or more. We decided that 3 steps of gradient descent was a good trade Off between fitness gain and 82 effort overhead. Again, the number of iterations was never altered afterwards and is used in all our experiments. 4.6 Experimental Design The main goal of the empirical study is to compare the performance of the GP with and without gradient-descent learning. Even though an overall speed-up is very valuable, we are also interested in other effects resulting from such learning. These effects have to be properly quantified to shed light on the internal mechanisms of the interaction between learning and evolution. Three major issues are studied: 0 Improvement in search speed 0 Changes in fitness distribution and selection 0 Changes in the functional structure of the programs The driver GP program included following major steps: 1. Initialization of the population using the “grow” method. Starting from a set of random roots, more nodes and terminals are aggregated with equal probability until a specified number of nodes are generated. The total number of nodes in the initial population was chosen to be three times greater than the population size, which is three functional nodes per individual on average. 2. Fitness evaluation and training (in HGP) Of each individual. Mean squared error over the given training set, as defined before, serves as an inverse fitness function since we seek to minimize error. This stage includes parametric training in HGP given that the individual has leaf coefficients. 83 7. Termination criteria check. The number of function evaluations was the measure of computational effort. For instance, every individual is evaluated only once in every GP generation, but three times in every HGP generation if its parameters are trained for three steps. Tournament selection (tournament size = 2) Of parents. Pairs are selected at random with replacement and their number is equal to the population size. The better Of the two individuals becomes a parent at the next step. Crossover and reproduction. Standard tree crossover is used. Each pair of parents produces two Offspring. Mutation with small probability pm=0.01 is applied to each node. In addition, elitism was always used and the best 10% of the population survive unchanged. Pruning the trees with the size exceeding predefined threshold value (24 nodes for test problems described below). Continue to step 2. Test Problems Five surface fitting problems were chosen as benchmarks. rosy) = xy +sin() f2(x,y)=x4-x’+y2/2-y f3(x. y) = 6sin(x)cos(y) f4(x,y) = 8(2+x2 + fl“ f5(x,y)=x3/5+y3/2—y—x 84 For each problem 20 random training points (fitness cases) were generated in the range [— 3...3] along each axis. Figure 4.8 - 4.12 show the desired surfaces to be evolved. 4.7 Experimental results To compare the performance we made experiments with both hybrid and regular GP with the same effort of 30,000 function evaluations in each run. All experiments were done with aipopulation size Of 100 and the arithmetic operators {+,-,*,%-protected} as the function set with no ADFS. Initial leaf coefficients were randomly generated in the range [-l...1]. Also, the pruning threshold was set to 24 nodes. If the number of nodes in an individual grew beyond this threshold, a sub-tree beginning at some randomly chosen node was cut from the individual. Each experiment was run 10 times and the MSE value was monitored. Our main results are shown in Figures 4.8-4.9 and also summarized in Table 4.1. The success of the hybrid GP is quite remarkable. For all the test problems, the average error of the best evolved programs was significantly smaller (1.5 to 25 times) when learning was employed. The first 20 — 30 generations usually brought most of these improvements. The gap in error levels is wide enough to require the regular GP to use hundreds more generations to achieve similar results. Certain improvements were also observed for the average population fitness, but with lesser magnitude. The similarity of each population’s average fitness indicates a high diversity and that not all Offspring reach small error values after local learning. Another set of experiments included extra fine-tuning iterations performed only after the regular GP terminates. Again, we run gradient Optimization on the population 85 from the last GP generation. Each individual was tuned by applying 100 gradient descent iterations. The results in Table 4.1 Show that this approach is not effective and did not achieve the quality of result found in the HGP. This is a strong argument for Baldwin effect, namely that another factor affecting search speed—up is a change in fitness distribution that directly affects selection outcome. Learning introduces a bias that favors individuals that are more able to adapt to local learning modifications. If we would suppose that the selection bias does not occur, then the hybrid GP would be only a trivial combination of genetic search and fine tuning. However, as we see from the results this is not the case. We attempted to measure some properties of HGP that would demonstrate this synergy between local learning and evolution. Table 4.1. Performance Comparison of Hybrid and Regular GP. All data collected after 30000 f.e. and averaged over 10 experiments. Best MSE Ave. MSE Best MSE Test problem , HGP GP HGP GP GP + fine tunrng f, 0.009 0.26 0.47 0.80 0.233 f2 0.075 0.761 1.03 2.18 0.31 f3 2.32 6.22 5.98 6.59 6.21 f4 0.64 0.76 4.06 4.41 0.76 f5 0.097 0.36 0.27 0.78 0.30 First Of all, a Baldwin effect in selection would mean that the results of some tournament selections are reversed after local learning. Indeed, local learning adaptable individuals 86 30_ 25 ’ - Typical run __ : Ave. over 10 runs 20 : - - 15 : II DI tournaments reversed, % till-“I‘M. 0 llll Ill] lllJ [ll] llll 1L1! JJLI llll [All L_A_Ll 0 1O 20 30 40 50 60 70 80 90 100 generations Figure 4.6. Comparison of Selection Process in HGP and GP. Local learning changes outcome Of some tournaments used to select a mating pool. win their tournaments due to improved fitness resulting from gradient descent. These individuals would lose the same tournament in regular GP. Figure 4.6 shows both the‘ typical and average percentage of reversed tournaments in the problem f1. A summary of results for all the test problems is given Table 4.2. It was found that the average percentage of selection changes remains the same during the course of search for all test problems. Such a behavior would be expected if selection pressure pushes Offspring that are very adaptable, even when Older elite members are almost converged. An empirical measure of this degree of adaptability is provided by the average gain in fitness achieved by newly generated offspring. The values are given in Table 4.2. We do not include elite members in this statistic to emphasize magnitude of learning from scratch. The average observed drop Of MSE is between 12% and 19% on all the test problems. 87 .5 01 01 # of coefficjents used O an xxxxxxxxxx wen 1 IIKIKXKXI ‘13 m one Gd 0 en a; 0 mm ° Hybrid GP ' Regular GP - mm W mm V” - WW v'rvvvvvw vvvvvvvvvvvvvvvvvvvvvvvvvvvvvvv vwv o I .l l l J l 1 M1 1 l l l l l l l l l l l l I l l l l l 1 l l l I 1 l I l l l l I. 0 10 20 30 40 50 60 70 80 90 100 generations Figure 4.7: Typical dynamic of number of terminals (numeric coefficients) used by the best program as a function of GP generations (for the test function fl). What exactly makes one program more adaptable than the other? Clearly, it is the functional structure of the program. For example, a program with no numeric leaves cannot learn at all using the gradient local learning method described above. Furthermore, a tree with no terminal arguments (inputs) containing only terminal constants will always produce the same output and will not benefit from local learning. Instead we have tried to understand what characteristics of adaptable programs are unique. Table 4.2. Effects of local learning Difference in HGP Ave. MSE gain Complexity of the best programs, Test selection vs. GP in for newly #coefficients / #nodes after the same problems each generation on generated effort (30000 f.e.) average, % offspring, % HGP GP f, 7.7 16.5 16.0 / 22.4 12.2 / 21.2 f; 7.1 12.7 16.6 / 23.0 13.7 / 21.8 f3 8.4 15.1 17.5 / 23.5 11.8/20.4 f4 7.9 18.7 17.3 / 22.9 12.4 / 21.6 f5 7.4 15.0 17.0 / 23.1 12.9 / 21.8 88 f. (x, y) = xy + sin((x -1)(y + 1)) Regular GP 10‘" \ \ t ‘0‘:\‘:‘ 05‘“ 0 ‘\ \“\:‘\ ‘3 \ \‘ ‘\‘ \‘ fl" 1.1 . 9:5" 0 ‘1. 30‘“ 0 ‘0 \\v R ‘5“ e v‘g‘ ‘ g. B 3:“ 3‘ ‘0 \ l \ 0 ‘1 \‘. ‘2‘ \ ". \ e 6 0 i O 5000 1 0000 1 5ND 20000 25000 30000 Iii]! ”If!” trialling, ”IIII‘II II III“, III I] I 1,, (I‘ll ' O,,'I',’ ' 33:32:: 0 5WD 1 0000 1 5000 20000 25000 30000 functlon evaluations Figure 4.8: Surface fitting test problems f1, f2 and respective learning curves We have focused on the length (number of nodes) and on the number of coefficients in the best evolved programs (remember, that length had an upper limit too). As Table 4.2 illustrates both values are noticeably greater for the programs evolved by HGP. This is one illustration of the inductive bias of the hybrid algorithm. More adaptive programs use more coefficients and consequently have lengthier representations. Also, the number of the terminal inputs (x and y) in HGP results is slightly less. Figure 4.7 shows typical changes in the number of coefficients for a “best” individual on a generational scale for both GP and HGP. 89 f3 (x, y) = 6 sin(x) cos( y) 0 \ . I ‘9 \\\\\‘ sec/555' \\ \\\\\§,s;o':ll ’. Q ‘\\‘ ‘\\ ‘9 .O ’I’ ‘-:t\\\\\‘ \I do 'I “\§\\\{\‘:9:0/ \}§\\‘.O.' ‘ \ \ ‘ 9 \t 9 0 $“\\\““‘:~ ..,, . 0.. "o‘o‘:“$‘:§§‘ttfiiiiti:v~ “03:7.102" ‘o‘o Io‘é‘o“ ‘:~“‘ - O ’ ‘ 0’ :z: 2’“ o e e ‘4?» "In. 9 « ‘Io‘o‘ 0:”: O gy. 9:. l .332”ng '1‘, I. Q :3». ,, ’31:"!0’ O o 0 o ~. 0“?“ ‘ -. Assamese: \‘x‘sssesaaeaa. -- \\~\§§;:::€;;:::}:{ai:sx:s- assazeaaa. .. ‘::::O:o"¢.fia’«. M functlon evaluatlons 0.4 0.35 0.3 5000 1W0 15000 20000 functlon evaluatlons 10000 15000 20000 functlon evaluatlons 25000 30000 Figure 4.9. Surface fitting test problems f3, f4, f5 and respective learning curves. 90 4.8 Conclusions We examined the effectiveness Of gradient search optimization of numeric leaf values for Genetic Programming. Genetic search for tree-like programs at the population level is complemented by the Optimization of terminal values at the individual level. Local adaptation of individuals is made easier by algorithmic differentiation. We Show how conventional random constants are tuned by gradient descent with minimal overhead. Local learning in the form of gradient descent can be efficiently included into GP search. Several experiments with symbolic regression problems are performed to demonstrate the approach’s effectiveness. This learning provides a substantial improvement in both final fitness and speed in reaching this fitness. Effects of local learning are clearly manifest in both improved approximation accuracy and selection changes when periods of local and global search are interleaved. Finally, the inductive bias Of local learning is quantified. The use of local learning creates a bias in the structure of the solutions, namely it prefers structures that are more readily adaptable by local learning. This approach can have significant impact on practical, engineering problems that are addressed by GP. 91 Chapter 5 Clustering Ensembles In contrast to supervised classification, clustering is an inherently ill-posed problem, whose solution violates at least one of the common assumptions about scale—invariance, richness, and cluster consistency [72]. Different clustering solutions may seem equally plausible without a priori knowledge about the underlying data distributions. Every clustering algorithm implicitly or explicitly assumes a certain data model, and it may produce erroneous or meaningless results when these assumptions are not satisfied by the sample data. Thus the availability of prior information about the data domain is crucial for successful clustering, though such information can be hard to obtain, even from experts. Identification of relevant subspaces [2] or visualization [57] may help to establish the sample data’s conformity to the underlying distributions or, at least, to the proper number of clusters. The exploratory nature of clustering tasks demands efficient methods that would benefit from combining the strengths of many individual clustering algorithms. This is the focus of research on clustering ensembles, seeking a combination of multiple partitions that provides improved overall clustering Of the given data. Clustering ensembles can go beyond what is typically achieved by a single clustering algorithm in several respects: 92 I Robustness. Better average performance across the domains and datasets. I Novelty. Finding a combined solution unattainable by any single clustering algorithm. I Stability and confidence estimation. Clustering solutions with lower sensitivity to noise, outliers or sampling variations. Clustering uncertainty can be assessed from ensemble distributions. I Parallelization and Scalability. Parallel clustering of data subsets with subsequent combination of results. Ability to integrate solutions from multiple distributed sources of data or attributes (features). Clustering ensembles can also be used in multiobjective clustering as a compromise between individual clusterings with conflicting Objective functions. Fusion of clusterings using multiple sources Of data or features becomes increasingly important in distributed data mining, e.g., see review in [101]. Several recent independent studies [29][35][38][39][109][124] have pioneered clustering ensembles as a new branch in the conventional taxonomy of clustering algorithms [63][65]. The problem of clustering combination can be defined generally as follows: given multiple clusterings of the data set, find a combined clustering with better quality. While the problem of clustering combination bears some traits of a classical clustering problem, it also has three major issues which are specific to combination design: 1. Consensus function: How to combine different clusterings? How to resolve the label correspondence problem? How to ensure symmetrical and unbiased consensus with respect to all the component partitions? 2. Diversity of clustering: How to generate different partitions? What is the source of diversity in the components? 93 3. Strength of constituents/components: How “weak” could each input partition be? What is the minimal complexity of component clusterings to ensure a successful combination? Similar questions have already been addressed in the framework of multiple classifier systems. Combining results from many supervised classifiers is an active research area [110][l l] and it provides the main motivation for clustering combination. However, it is not possible to mechanically apply the combination algorithms from the classification (supervised) domain to clustering (unsupervised) domain. Indeed, no labeled training data is available in clustering; therefore the ground truth feedback necessary for boosting the overall accuracy cannot be used. In addition, different clusterings may produce incompatible data labelings, resulting in intractable correspondence problems, especially when the numbers of clusters are different. Still, the supervised classifier combination demonstrates, in principle, how multiple outputs reduce the variance component Of the expected error rate and increase the robustness of the solution. From the supervised case we also learn that the proper combination of weak classifiers [11][43][59][71] may achieve arbitrarily low error rates on training data, as well as reduce the predictive error. One can expect that using many simple, but computationally inexpensive components will be preferred to combining clusterings obtained by sophisticated, but computationally involved algorithms. This research further advances ensemble methods in several aspects, namely, design of new effective consensus functions, development of new partition generation mechanisms and study of their clustering accuracy. We offer a representation of multiple clusterings as a set of new attributes characterizing the data items. Such a view directly leads to a formulation of the combination problem as a categorical clustering problem in the space of these attributes, 94 or, in other terms, a median partition problem. Median partition can be viewed as the best summary of the given input partitions. As an optimization problem, median partition is NP-complete [6], with a continuum of heuristics for an approximate solution. This study focuses on the primary problem of clustering ensembles, namely the consensus function, which creates the combined clustering. We show how median partition is related to the classical intra—class variance criterion when generalized mutual information is used as the evaluation function. A consensus function based on quadratic mutual information (QMI) is proposed and reduced to the k-means clustering in the space of specially transformed cluster labels. We also propose a new fusion method for unsupervised decisions that is based on a probability model of the consensus partition in the space of contributing clusters. The consensus partition is found as a solution to the maximum likelihood problem for a given clustering ensemble. The likelihood function of an ensemble is optimized with respect to the parameters of a finite mixture distribution. Each cOmponent in this distribution corresponds to a cluster in the target consensus partition, and is assumed to be a multivariate multinomial distribution. The maximum likelihood problem is solved using the EM algorithm [25]. There are several advantages to QMI and EM consensus functions. These include: (i) complete avoidance of solving the label correspondence problem, (ii) low computational complexity, (iii) ability to handle missing data, i.e., missing cluster labels for certain patterns in the ensemble (for example, when bootstrap method is used to generate the ensemble). Another goal of our work is to adopt weak clustering algorithms and combine their outputs. Vaguely defined, a weak clustering algorithm produces a partition which is only 95 slightly better than a random partition of the data. We propose two different weak clustering algorithms as the component generation mechanisms: 1. Clustering of random l-dimensional projections of multidimensional data. This can be generalized to clustering in any random subspace of the original data space. 2. Clustering by splitting the data using a number of random hyperplanes. For example, if only one hyperplane is used then data is split into two groups. Finally, this work compares the performance of different consensus functions. We have investigated the performance of a family Of consensus functions based on categorical clustering including the co—association-based hierarchical methods [39][40][41], hypergraph algorithms [124] and our new consensus functions. Combination accuracy is analyzed as a function of the number and the resolution of the clustering components. In addition, we study clustering performance when some cluster labels are missing, which is often encountered in the distributed data or re-sampling scenarios. 5.1 Related Work Approaches to combination Of clusterings differ in two main respects, namely the way in which the contributing component clusterings are obtained and the method by which they are combined. Fred [38] proposed to summarize various clustering results in a co- association matrix. CO-association values represent the strength of association between objects by analyzing how often each pair of objects appears in the same cluster. The final clustering in [38] is determined using a voting-type algorithm applied to the co- association matrix. Hence the co-association matrix serves as a similarity matrix for the data items. Clusters are formed from the co-association matrix by linking the objects 96 whose co-association value exceeds a certain threshold. Further work by Fred and Jain [38][39][40] also used co-association values, but instead of a fixed threshold, they applied a hierarchical (single-link) clustering to the co-association matrix. These studies use the k—means algorithm with various values of k and random initializations (for selecting the k cluster centers) for generating the component clusterings. Specific methods for combination of the k-means algorithm and agglomerative linkage algorithms were proposed in [109]. Strehl and Ghosh [124] have considered three different consensus functions for ensemble clustering: one is similar to the evidence accumulation approach [39] [40], and the other two methods are based on a hypergraph representation. All of them use various hypergraph operations to search for a solution. The Cluster-based Similarity Partitioning Algorithm (CSPA) [124] induces a graph from a co-association matrix and clusters it using the METIS algorithm [69]. Hypergraph partitioning represents each cluster by a hyperedge in a graph, where the nodes correspond to a given set of objects. Good hypergraph partitions are found using minimal cut algorithms such as HMETIS [68] coupled with the proper objective functions, which also control partition size. Hyperedge collapsing operations are considered in another hypergraph-based Meta-Clustering algorithm (MCLA) in [124]. The meta-clustering algorithm uses these Operations to determine soft cluster-membership values for each object. A combination of partitions obtained from multiple bootstrap samples is implemented in [29][35]. Both works pursued direct re-labeling approaches to the correspondence problem. A re-labeling can be done optimally between two clusterings using the Hungarian algorithm [103]. This was used in [29] to re-label each bootstrap partition using a single reference partition. The reference partition is determined by a 97 single clustering of the entire data set. After an overall consistent re-labeling, voting can be applied to determine cluster membership for each pattern. Bagging of multiple k-means clustering results was done in [84] by clustering k- means centers and assigning the Objects to the closest cluster center. In fact, their component clusterings do not keep information about the individual object labels but only information about cluster prototypes. The k-means centers are “bagged” and clustered by a hierarchical procedure. Such an approach is unique in that the components of the combination and the final clustering are defined implicitly via prototypes rather than by explicit labelings [84]. Kellam et al. [70] also combined clusterings through a type of co-association matrix. However, this matrix is used only to find the clusters with the highest value of support based on object co-occurrences. As a result, only a set of so-called robust clusters - is produced which may not contain all the initial objects A genetic algorithm is employed in [46] to produce the most stable partitions from an evolving ensemble (population) Of clustering algorithms along with a special objective function. The objective function evaluates multiple partitions according to changes caused by data perturbations and prefers those clusterings that are least susceptible to those perturbations. Dimitriadou et al. [24] proposed a voting/merging procedure that combines clusterings pair-wise and iteratively. The cluster correspondence problem must be solved at each iteration and the solution is not unique. Fuzzy membership decisions are accumulated during the course of merging. The final clustering is obtained by assigning each object to a derived cluster with the highest membership value. The distributed clustering algorithm [67] constructs a global dendrogram for a set of objects from multiple local models produced by single-link algorithms. Collective 98 hierarchical clustering combines dendrograms built on different subsets of features. The global hierarchy uses a specific approximation of distance that is estimated based on merging thresholds of component dendrograms. Recently, Fern and Brodley [34] introduced random projections algorithm in the context of clustering ensembles similar to one of the generation mechanisms considered in this paper. Generally clustering ensembles approaches can be summarized as following: Generative mech_anism_s (How to obtain different partitions?) 1. Apply various clustering algorithms [124] 2. Use a single algorithm 2.1. Different initialization and built-in randomness [38][39] [40] [41] 2.2. Different parameters [39] 2.3. Different subsets of data points - Deterministic subsets [41][124] - Resampling [29][35][94] without or with replacement (e. g., bootstrap) 2.4. Projecting data onto different subspaces [34] (and this Study) 2.5. Different subset of features [124] ’ Consensus functiogs (How to integrate multiple partitions?) 1. Using Co-association Matrix [38] [39] [40] [41] - Single Link (SL)/ Minimum Spanning Tree (MST) - Complete Link (CL) - Average Link (AL) - Ward, or other similarity based algorithms 2. (Hyper) Graph Partitioning [124][125] - Hyper Graph Partition Algorithm (HGPA) - Meta Clustering Algorithm (MCLA) - Clustering Similarity Partition Algorithm (CSPA) 3. Information-theoretic methods, e.g. Quadratic Mutual Information (this study) 4. Voting Approach [29][35] L 5. Mixture Model (this study) Unique properties of the space of cluster labels are exploited by several known consensus functions that we briefly review. 99 Co-association Methods. Similarity between objects can be estimated by the number of clusters shared by two objects in all the partitions of an ensemble. This similarity definition expresses the strength of co-association of Objects by a matrix containing the values: 1 H Si]. 2 S(x,,xj) 217.2160“ (xi),7rk (xj)), (5.1) 6(a,b)5{1, 1f azb 0, 1f a¢b. Thus, one can use numerous similarity-based clustering algorithms by applying them to the matrix of co-association values. It is also possible to construct more sophisticated similarity definitions when the ensemble I1 contains “soft” partitions. There are three main concerns with this intuitively appealing approach. First, it has a quadratic complexity In the number of patterns 0(N2). Second, there are no established guidelines concerning which clustering algorithm should be applied, e. g. single linkage or complete linkage. The shapes of the clusters embedded in the space of clustering labels may or may not reflect cluster shapes in the original space. Third, an ensemble with a small number of clusterings may not provide a reliable estimate of the co-association values. Hypergraph Methods. All the clusters in the ensemble partitions can be represented as hyperedges on a graph with N vertices. Each hyperedge describes a set of Objects belonging to the same cluster. A consensus function is formulated as a solution to k-way min-cut hypergraph partitioning problem. Each connected component after the cut corresponds to a cluster in the consensus partition. Though the hypergraph partitioning problem is NP-hard, efficient heuristics have been developed for its solution. However, 100 the cuts only remove hyperedges as a whole. Hypergraph algorithms seem to work the best for nearly balanced clusters. For example, complexity of CSPA, HGPA and MCLA is estimated in [124] as 0(kN2H), 0(kNH), and 0(k2NH2), respectively Re-labeling and Voting Approach. If a label correspondence problem is solved for all the given partitions, then a simple voting procedure can be used to assign objects in clusters. However, label correspondence is exactly what makes unsupervised combination difficult. A heuristic approximation to consistent labeling is possible. All the partitions in the ensemble can be re-labeled according to their best agreement with some chosen reference partition. The reference partition can be taken as one from the ensemble, or from a new clustering of the dataset. Also, a meaningful voting procedure assumes that the number of clusters in every given partition is the same as in the target partition. But this requires that the number of clusters in the target consensus partition is known. To summarize, existing consensus functions suffer from a number of drawbacks that include complexity, heuristic character of objective function and uncertain statistical status of the consensus solution. The next sections introduce new models of the clustering combination that aim to overcome these drawbacks. 5.2 Representation of Multiple Partitions Combination of multiple partitions can be viewed as a partitioning task itself. Typically, each partition in the combination is represented as a set of labels assigned by a clustering algorithm. The combined partition is obtained as a result of yet another clustering algorithm whose inputs are the cluster labels of the contributing partitions. We will 101 assume that the labels are nominal values. In general, the clusterings can be “soft”, i.e., described by the real values indicating the degree of pattern membership in each cluster in a partition. We consider only “hard” partitions below, noting, however, that combination of “soft” partitions can be solved by numerous clustering algorithms and does not appear to be more complex. Suppose we are given a set of N data points X = {x.,. xN} and a set of H partitions H={1t1,..., 1TH} Of objects in X. Different partitions of X return a set of labels for each point x,, i=1,..., N: x, ——> {n,(x,),n,(x,),...,7z,, (x,)}, Here, H different clusterings are indicated and 75-09) denotes a label assigned to x,- by the j—th algorithm. NO assumption is made about the correspondence between the labels produced by different clustering algorithms. Also no assumptions are needed at the moment about the data input: it could be represented in a non—metric space or as an NXN dissimilarity matrix. For simplicity, we use the notation y,j =7rj(x,.) or y,- = It(x,-). The problem of clustering combination is to find a new partition TEC of data X that summarizes the information from the gathered partitions H. Our main goal is to construct a consensus partition without the assistance of the original patterns in X, but only from their labels Y delivered by the contributing clustering algorithms. Thus, such potentially important issues as the underlying structure of both the partitions and data are ignored for the sake Of a solution to the unsupervised consensus problem. We emphasize that a space of new features is induced by the set 11. One can view each component partition II,- as a new feature with categorical values, i.e. cluster labels. The values assumed by the i-th new feature are simply the cluster labels from partition 712,-. Therefore, membership Of an object 102 x in different partitions is treated as a new feature vector y = 1t(x), an H-tuple. In this case, one can consider partition ttj(x) as a feature extraction function. Combination of clusterings becomes equivalent to the problem of clustering of H—tuples if we use only the existing clusterings {n1,..., fly}, without the original features of data X. Hence the problem of combining partitions can be transformed to a categorical clustering problem. Such a view gives insight into the properties of the expected combination, which can be inferred through various statistical and information-theoretic techniques. In particular, one can estimate the sensitivity of the combination to the correlation of components (features) as well as analyze various sample size issues. Perhaps the main advantage of this representation is that it facilitates the use Of known algorithms for categorical clustering [90][36] and allows one to design new consensus heuristics in a transparent way. The extended representation of data X can be illustrated by a table with N rows and (d+H) columns: n1 an x1 x“ xld 11:1(x1) nH(x1) X2 3‘21 x2d 731062) 715I1(x2) XN kx1,” deJ k1t1(xN) “HOCNL V Original 0' features "New" H features The consensus clustering is found as a partition TLC of a set of vectors Y = {y,} that directly translates to the partition of the underlying data points {xi }. 5.3 A Mixture Model Of Consensus Our approach to the consensus problem is based on a finite mixture model for the probability of the cluster labels y=11:(x) of the pattem/object x. The main assumption is 103 that the labels y,- are modeled as random variables drawn from a probability distribution described as a mixture of multivariate component densities: M P(yi '6):Zaml)m(yi|9m)a (52) m=1 where each component is parametrized by 0",. The M components in the mixture are identified with the clusters of the consensus partition EC. The mixing coefficients 0;" correspond to the prior probabilities of the clusters. In this model, data points {y.-} are presumed to be generated in two steps: first, by drawing a component according to the probability mass function 0;", and then sampling a point from the distribution Pm(y|6’,,,). All the data Y= {y 1}” are assumed to be independent and identically distributed. This 121 allows one to represent the log likelihood function for the parameters O={a1,..., cm, 01,..., 0M } given the data set Y as: N N M 10gL<®IY>=logHP=Zlog2aum. (53> i=1 i=1 m=l The objective of consensus clustering is now formulated as a maximum likelihood estimation problem. TO find the best fitting mixture density for a given data Y, we must maximize the likelihood function with respect to the unknown parameters O: O’ = argmax log L(O|Y). O (5.4) The next important step is to specify the model of component-conditional densities Pm(y|6m). Note, that the original problem of clustering in the space of data X has been transformed, with the help of multiple clustering algorithms, to a Space of new multivariate features y = 7t(x). To make the problem more tractable, a conditional independence assumption is made for the components of vector y,-, namely that the conditional probability of y,- can be represented as the following product: 104 H Pm(Y,-|9m)=HP,,‘.j’(y,-jl9f,{’), (5'5) “=1 To motivate this, one can note that everi if the different clustering algorithms (indexed by j) are not truly independent, the approximation by product in Eq. (5.5) can be justified by the excellent performance of naive Bayes classifiers in discrete domains [34]. Our ultimate goal is to make a discrete label assignment to the data in X through an indirect route of density estimation of Y. The assignments of patterns to the clusters in 11C are much less sensitive to the conditional independence approximation than the estimated values of probabilities P(yi | O), as supported by the analysis of naive Bayes classifier in [27]. The last ingredient of the mixture model is the choice of a probability density P’sj)(yij mfg”) for the components of the vectors y,-. Since the variables yij take on nominal values from a set Of cluster labels in the partition Itj, it is natural to view them as the outcome Of a multinomial trial: K(J') ‘ ' 5 ,k 5.6 th)(y|9iii)):l—Izgjm(k) (Y ) . ( ) k=l Here, without the loss of generality, the labels of the clusters in IIJ- are chosen to be integers in { 1,. ..,K(j)}. To clarify the notation, note that the probabilities of the outcomes are defined as 79,-...(k) and the product is over all the possible values of yij labels of the partition Itj. Also, the probabilities sum up to one: KU) 2:9,," (k) = r, Vje {1,...,H}, vme {1,...,M}, (57) For example, if thtlj-th partition has only two clusters, and possible labels are 0 and 1, then Eq. (5.6) can be simplified as: (5.8) p(1>r.,ln<1>\ — 4W /1_ .9 \H’ 105 The maximum likelihood problem in Eq. (5.3) generally cannot be solved in a closed form when all the parameters O={a1,..., aM, 01,..., 0M} are unknown. However, the likelihood function in Eq. (5.2) can be optimized using the EM algorithm. In order to adopt the EM algorithm, we hypothesize the existence of hidden data Z and the likelihood of complete data (Y, Z). If the value of z,- is known then one could immediately tell which of the M mixture components was used to generate the point y,. The detailed derivation Of the EM solution to the mixture model with multivariate, multinomial components is given in the Appendix. Here we give only the equations for the E- and M- steps which are repeated at each iteration of the algorithm: H cal—l k I (29;m(k>)‘5("’ k) (5 9) BMW]: M Fa =0) . 204.11 (19;. i"‘”’ " n=1 j=l k=l N 2 E [Zim ] a : 1:1 (5.10) 19].," (k) = N's]: , . (5.11) 2 6(yz'j’k)E[Zim] i=1 k=l The solution to the consensus clustering problem is obtained by a simple inspection of the expected values of the variables E[z,-,,,], due to the fact that E[z,-,,,] represents the probability that the pattern y,- was generated by the m-th mixture component. Once convergence is achieved, a pattern y,- is assigned to the component which has the largest Value for the hidden label z,-. 106 08 08 07- 1.2 .1 07» IOY ioY 06 . .1 06’ 01’ 01 OY 0.4 - 0.4 ~ 0.3 » 2 03~ OY 102 . OY’ 0'21 .2 02 0.2» ox ox 01 OX OJ » .22 01» O)( 0 0 a o 01 02 03 04 05 0 01 02 03 04 05 08 08 07- 6E3 0E3 OJr 0(x 9(1 06- Ol\ 06» 0(1 .8 OP 0.5 - 03 .3 05» 0t .a 04 . 04» 0.3 p A 0.3» . 0A O of, B 0.2 » OB QB 02» .3 QB 0A QB 01 . Q/\ 01 ~ 6(1 0 0 I o 01 02 03 04 0,5 0 01 02 03 04 05 Figure 5.1: Four possible partitions of 12 data points into 2 clusters. Different partitions use different sets of labels. Table 5.1 : Clustering ensemble and consensus solution In It; 70, m E [an E [2,2] Consensus yl 2 B X B 0.999 0.001 1 y2 2 A X a 0.997 0.003 1 y3 2 A Y B 0.943 0.057 1 y4 2 B X B 0.999 0.001 1 y5 1 A X B 0.999 0.001 1 y6 2 A Y B 0.943 0.057 1 y7 2 B Y or 0.124 0.876 2 yg 1 B Y CI 0.019 0.981 2 y9 1 B Y B 0.260 0.740 2 ylo 1 A Y or 0.115 0.885 2 y” 2 B Y CC 0.124 0.876 2 ylz 1 B Y or 0.019 0.981 2 107 It is instructive to consider a simple example of an ensemble. Figure 5.1 shows four 2-cluster partitions of 12 two-dimensional data points. The correspondence problem is emphasized by the different label systems used by the partitions. Table 5.1 shows the expected values of latent variables after 6 iterations of the EM algorithm and the resulting consensus clustering. In fact, a stable combination appears as early as the third iteration, and it corresponds to the true underlying structure of the data. Our mixture model of consensus admits generalization for clustering ensembles with incomplete partitions. Such partitions can appear as a result of clustering of subsamples or resampling of a dataset. For example, a partition of a bootstrap sample only provides labels for the selected points. Therefore, the ensemble of such partitions is represented by a set of vectors of cluster labels with potentially missing components. Moreover, different vectors of cluster labels are likely to miss different components. Incomplete information can also arise when some clustering algorithms do not assign outliers to any of the clusters. Different clusterings in the diverse ensemble can consider the same point x,- as an outlier or otherwise, that results in missing components in the vector y. Yet another scenario leading to missing information can occur in clustering combination of distributed data or ensemble of clusterings of non-identical replicas of a dataset. It is possible to apply the EM algorithm in the case of missing data [47], namely missing cluster labels for some of the data points. In these situations, each vector y, in Y Obs, yim’s). Incorporation of a can be split into observed and missing components y,- = (y,- missing data leads to a slight modification of the computation of E and M steps. First, the expected values E[Zim lyiObS, O’] are now inferred from the observed components of vector 108 H y,-, i.e. the products in Eq. (5.9) are taken over known labels: H ——> H . Additionally, - ,obs 1:1 1:) .obs one must compute the expected values E [z,-,,, yimjsl y, ,O’] and substitute them, as well as E[z,~,,. l y,-°b’, O’], in the M-step for re-estimation of parameters 19,...(k). More details on handling missing data can be found in [47]. Though data with missing cluster labels can be obtained in different ways, we analyze only the case when components of y,- are missing completely at random [116]. It means that the probability of a component to be missing does not depend on other observed or unobserved variables. Note, that the outcome of clustering of data subsamples (e.g., bootstrap) is different from clustering the entire data set and then deleting a random subset of labels. However, our goal is to present a consensus function for general settings. We expect that experimental results for ensembles with missing labels are applicable, at least qualitatively, even for a combination of bootstrap clusterings. The proposed ensemble clustering based on mixture model consensus algorithm is summarized below. Note that any clustering algorithm can be used to generate the ensemble instead of the k-means algorithm shown in this pseudocode: begin for i=1 to H // H - number of clusterings cluster a dataset It (— k-means(X) add partition to the ensemble 1'1: {11,II} end initialize model parameters O ={ a,,..., 05/14, 01,. 0M } do until convergence criterion is satisfied compute expected values E[Zim], i=1..N, m=l..M 109 compute E[Zim yr'mis] for missing data (if any) re-estimate parameters 19 (k), j=l..H, m=l..M, Vk jm end IIC (xi) = index of component of z,- with the largest expected value, i=1..N return IIC // consensus partition end The value of M, number of components in the mixture, deserves a separate discussion that is beyond the scope of this paper. Here, we assume that the target number of clusters is predetermined. It should be noted, however, that the mixture model in unsupervised classification greatly facilitates estimation Of the true number of clusters [37]. Maximum likelihood formulation of the problem specifically allows us to estimate M by using additional objective functions during the inference, such as the minimum description length of the model. In addition, the proposed consensus algorithm can be viewed as a version of Latent Class Analysis (e.g. see [7]), which has rigorous statistical means for quantifying plausibility of a candidate mixture model. Whereas the finite mixture model may not be valid for the patterns in the original Space (the initial representation), this model more naturally explains the separation of groups of patterns in the space of “extracted” features (labels generated by the partitions). It is somewhat reminiscent of classification approaches based on kernel methods which rely on linear discriminant functions in the transformed space. For example, Support Vector Clustering [9] seeks spherical clusters after the kernel transformation that correspond to more complex cluster shapes in the original pattern space. 110 5.4 Information-theoretic consensus of clusterings Another candidate consensus function is based on the notion of median partition. A median partition 0'is the best summary of existing partitions in II. In contrast to the co- association approach, the median partition is derived from estimates of similarities between attributes (i.e., partitions in II), rather than from similarities between objects. A well-known example of this approach is implemented in the COBWEB algorithm in the context of conceptual clustering [36]. The COBWEB clustering criterion estimates the partition utility, which is the sum of category utility functions introduced by Gluck and Corter [48]. In our terms, the category utility function U(0', m) evaluates the quality of a candidate median partition TIC ={C1,...,CK} against some other partition 1:,- = {L‘1,..., L‘K(,-)}, with labels Li,- for j—th cluster: K K0) K0) 1' 2 i 2 U(7tC,7t,)=Zp(C,)Zp(Lj|C,) ‘ZPUQ , (5‘12) r=l j=1 j=l with the following notations: p(Cr) = |C,| / N, p(L'j) = IL‘J-I / N, and p(L‘j |C,)=|L"j nC, MO |. The function U(7tc, It.) assesses the agreement between two partitions as the difference between the expected number of labels of partition II,- that can be correctly predicted both with the knowledge of clustering TLC and without it. The category utility function can also be written as the Goodman-Kruskal index for the contingency table between two partitions [49][93]. The overall utility of the partition Hg with respect to all the partitions in II can be measured as the sum of pair-wise agreements: 111 [I UUICJI) zzUozCJzi), (5.13) i=1 Therefore, the best median partition should maximize the value of overall utility: best yzc =arg max U(7Z'C.H). (5.14) C' Importantly, Mirkin [93] has proved that maximization of partition utility in Eq. (5.13) is equivalent to minimization of the square-error clustering criterion if the number of clusters K in target partition TIC is fixed. This is somewhat suprising in that the partition utility function in Eq. (5.14) uses only the between-attribute similarity measure of Eq. (5.12), while the square-error criterion makes use of distances between objects and prototypes. Simple standardization of categorical labels in {n1,...,7IH} effectively transforms them to quantitative features [93]. This allows us to compute real-valued distances and cluster centers. This transformation replaces the i-th partition 1:,- assuming K(i) values by K(i) binary features, and standardizes each binary feature to a zero mean. In other words, for each object x we can compute the values of the new features y, (x), as follows: 53,306) = 5(Eji7l,-(X))-p(L'j-) , for j=l... K(i), i=1...H. (5-15) Hence, the solution of median partition problem in Eq. (5.14) can be approached by the k- means clustering algorithm operating in the space of features 9,, if the number of target clusters is predetermined. We use this heuristic as a part of empirical study of consensus functions. Let us consider the information-theoretic approach to the median partition problem. In this framework, the quality of the consensus partition IIC is determined by the 112 amount of information I (it ,H) it shares with the given partitions in 11. Strehl and Ghosh [124] suggest an objective function that is based on the classical Shannon definition of mutual information: H it?" =argflmax [MOI—1),,“me I(rtC,II) =ZI(7ZC,JI,), (5'16) C i=1 K K(i) . C ,L‘. 1(flc,fli)=z p(Cr,L’j)log p( ' ll.) , (5.17) i=1 j=l p(C,)p(L,-) Again, an optimal median partition can be found by solving this optimization problem. However, it is not clear how to directly use these equations in a search for consensus. We show that another information-theoretic definition of entropy will reduce the mutual information criterion to the category utility function discussed before. We proceed from the generalized entropy of degree 5 for a discrete probability distribution P=(p1,...,pn) [56]: H5(P)=(21—S—1)"1[Zpis-l), s>0, s¢1 (5.18) i=1 Shannon’s entropy is the limit form of Eq.(5.18): n limHs (P) = —2 pi log pl. (5.19) s—> 1 i=1 ' Generalized mutual information between O'and It can be defined as: I‘(7t,7tC)=HS(7t)-H‘(7t|7tc), (5.20) Quadratic entropy (s = 2) is of particular interest, since it is known to be closely related to classification error, when used in the probabilistic measure of inter-class distance. When s = 2, generalized mutual information [(1tc, 7n) becomes: 113 K(i) . K K(i) . p(L'j)2 -1]+ 22 p(Cr)[Z p(L'j |c,)2 4)] 2 =1 r=1 1:] (5.21) J 12(ac,7z,)=—2[ K(i) K(i) K i 2 i = 22 p(c,)z p(Lj lo) —22 p(Lj)2 = 2U(7[C,7[i). r=1 j=1 j=l Therefore, generalized mutual information gives the same consensus clustering criterion as the category utility function in Eq. (5.13). Moreover, the traditional Gini-index measure for attribute selection also follows from Eqs. (5.12) and (5.21). In light of Mirkin’s result, all these criteria are equivalent to within-cluster variance minimization, after simple label transformation. Quadratic mutual information, mixture model and other interesting consensus functions have been used in our comparative empirical study. 5.5 Combination of Weak Clusterings The previous sections addressed the problem of clusterings combination, namely how to formulate the consensus function regardless of the nature of individual partitions in the combination. We now turn to the issue of generating different clusterings for the combination. There are several principal questions. DO we use the partitions produced by numerous clustering algorithms available in the literature? Can we relax the requirements for the clustering components? There are several existing methods to provide diverse partitions: 1. Use different clustering algorithms, e.g. k-means, mixture Of Gaussians, spectral, single—link, etc. [124]. 2. Exploit built-in randomness or different parameters of some algorithms, e.g. initializations and various values of k in k-means algorithm [39][40][84]. 114 3. Use many subsamples of the data set, such as bootstrap samples [29][91]. These methods rely on the clustering algorithms, which are powerful on their own, and as such are computationally involved. We argue that it is possible to generate the partitions using weak, but less expensive, clustering algorithms and still achieve comparable or better performance. Certainly, the key motivation is that the synergy of many such components will compensate for their weaknesses. We consider two simple clustering algorithms: 1. Clustering of the data projected to a random subspace. In the simplest case, the data is projected on l-dimensional subspace, a random line. The k-means algorithm clusters the projected data and gives a partition for the combination. 2. Random splitting of data by hyperplanes. For example, a single random hyperplane would create a rather trivial clustering of d-dimensional data by cutting the hypervolume into two regions. We will show that both approaches are capable of producing high quality consensus clusterings in conjunction with a proper consensus function. 5.5.1 Splitting by Random Hyperplanes Direct clustering by use of a random hyperplane illustrates how a reliable consensus emerges from low-informative components. The random splits approach pushes the notion of weak clustering almost to an extreme. The data set is cut by random hyperplanes dissecting the original volume of d—dimensional space containing the points. Points separated by the hyperplanes are declared to be in different clusters. Hence, the output clusters are convex. In this situation, a co-association consensus function is 115 I f 1 . ‘ ‘ ‘ ““““ pluster 2 " A . ‘ l”: i,‘ 1' A ”I, it o’.’.’ + ’I 3'. .J.’ . O 5 _ ’l’ :3: "I” . ,I I." - ----- 1 -plane A l- ! .-' ,- a I ... ’. l,’ I." ........ 2-p]gmes A p I ..o' [,1' L {1’53" ,-"’. -—-- 3-planes cluster 1 _ 4-planes a g I’ :' II." 0 I l J l I 1 A l L J L 0 0'5 distance x 1 Figure 5.2. Clustering by a random hyperplane: (a) An example of splitting 2-spiral data set by a random line. Points on the same side of the line are in the same cluster. (b) Probability of splitting two one-dimensional objects for different number of random thresholds as a function of distance between objects. appropriate since the only information needed is whether the patterns are in the same cluster or not. Thus the contribution of a hyperplane partition to the co-association value for any pair of objects can be either 0 or 1. Finer resolutions of distance are possible by counting the number of hyperplanes separating the objects, but for simplicity we do not use it here. Consider a random line dissecting the classic 2-spiral data shown in Figure 5.2(a). While any one such partition does little to reveal the true underlying clusters, analysis of the hyperplane generating mechanism shows how multiple such partitions can discover the true clusters. Consider first the case of one-dimensional data. Splitting of objects in 1- dimensional space is done by a random threshold in R1. In general, if r thresholds are randomly selected, then (r+1) clusters are formed. It is easy to derive that, in 1- dimensional space, the probability of separating two objects whose inter-point distance is x is exactly: 116 P(split) =1—(1—x/L)’, (5.22) where L is the length of the interval containing the objects, and r threshold points are drawn at random from uniform distribution on this interval. Figure 5.2(b) illustrates the dependence for L=1 and r=1,2,3,4. If a co-association matrix is used to combine H different partitions, then the expected value of co-association between two objects is H (1—P(split)), that follows from the binomial distribution of the number of splits in H attempts. Therefore, the co-association values found after combining many random split partitions are generally expected to be a non—linear and a monotonic function of respective distances. The situation is similar for multidimensional data, however, the generation of random hyperplanes is a bit more complex. To generate a random hyperplane in d dimensions, we should first draw a random point in the multidimensional region that will serve as a point of origin. Then we randomly choose a unit normal vector u that defines the hyperplane. The two objects characterized by vectors p and q will be in the same cluster if (up)(uq)>0 and will be separated otherwise (here ab denotes a scalar product of a and b). If r hyperplanes are generated, then the total probability that two objects remain in the same cluster is just the product of probabilities that each of hyperplanes does not split the objects. Thus we can expect that the law governing the co- association values is close to what is obtained in 1-dimensional space in Eq. (5.22). Let us compare the actual dependence of co-association values with the function in Eq. (5.22). Figure 5.3 shows the results of experiments with 1000 different partitions by random splits of the his data set. The Iris data is 4-dimensional and contains 150 points. There are 11,175 pair-wise distances between the data items. For all the possible pairs of points, each plot in Fig. 3 shows the number of times a pair was split. The observed dependence of the inter-point “distances” derived from the co-association 117 1000 1000 . . . - 1 . m m o o C C 4 59 i 59 1’ .2 '0 'U c c -9 .9. H IO—i 52 .9 o o . o o m m w m m m I I : J O O ';. o 4, o ,. l O J 1 1000 1000 m ‘ m U U C > C m m 4-1 4—0 .‘2 ‘ .9 'c _ '0 C C .9 . .9 4—0 vo—l -‘3 .9 o ' o o o (I) - U) t (I) (D ‘9 ~ . <9 0 .' o O ' U 47. O x l O i . I x 1 i i I 1 Figure 5.3. Dependence of distances derived from the co-association values vs. the actual Euclidean distance x for each possible pair of objects in Iris data. Co-association matrices were computed for different numbers of hyperplanes r =1,2,3,4. values vs. the true Euclidean distance, indeed, can be described by the function in Eq. (5.22). Clearly, the inter-point distances dictate the behavior of respective co-association values. The probability of a cut between any two given objects does not depend on the other objects in the data set. Therefore, we can conclude that any clustering algorithm that works well with the original inter-point distances is also expected to work well with co-association values obtained from combination of multiple partitions by random splits. However, this result is more of theoretical value when true distances are available, since they can be used directly instead of co-association values. It illustrates the main idea of 118 the approach, namely that the synergy of multiple weak clusterings can be very effective. We present an empirical study of the clustering quality of this algorithm in the experimental section. 5.5.2 Combination of Clusterings in Random Subspaces Random subspaces are an excellent source of clustering diversity that provides different views of the data. Projective clustering is an active topic in data mining. For example, algorithms such as CLIQUE [2] and DOC [105] can discover both useful projections as well as data clusters. Here, however, we are only concerned with the use of random projections for the purpose of clustering combination. Each random subspace can be of very low dimension and it is by itself somewhat uninformative. On the other hand, clustering in 1-dimensional space is computationally cheap and can be effectively performed by k-means algorithm. The main subroutine of k- means algorithm — distance computation — becomes d times faster in l-dimensional space. The cost of projection is linear with respect to the sample size and number of dimensions 0(Nd), and is less then the cost of one k—means iteration. The main idea of our approach is to generate multiple partitions by projecting the data on a random line. A fast and simple algorithm such as k—means clusters the projected data, and the resulting partition becomes a component in the combination. Afterwards, a chosen consensus function is applied to the components. We discuss and compare several consensus functions in the experimental section. 119 1-d projected distribution ' ll total { [3 class 1‘ 30 i l l ‘Cjclass 21 25 j N O r # of objects a :_| 1‘“ i H projected axis U! (a) (b) Figure 5.4. Projecting data on a random line: (a) A sample data with two identifiable natural clusters and a line randomly selected for projection. (b) Histogram of the distribution of points resulting from data projection onto a random line. It is instructive to consider a simple 2—dimensional data set and one of its projections, as illustrated in Fig. 5.4(a). There are two natural clusters in the data. This data looks the same in any l—dimensional projection, but the actual distribution of points is different in different clusters in the projected subspace. For example, Figure 5.4(b) shows one possible histogram distribution of points in l-dimensional projection of this data. There are three identifiable modes, each having a clear majority of points from one of the two classes. One can expect that clustering by the k—means algorithm will reliably separate at least a portion of the points from the outer ring cluster. It is easy to imagine that projection of the data in Figure 5.4(a) on another random line would result in a different distribution of points and different label assignments, but for this particular data set it will always appear as a mixture of three bell—shaped components. Most probably, these modes will be identified as clusters by the k-means algorithm. Thus each new 1-dimensi0nal 120 view correctly helps to group some data points. Accumulation of multiple views eventually should result in a correct combined clustering. The major steps for combining the clusterings using random l-d projections are described by the following procedure: begin for i=1 to H //His the number of clusterings in the combination generate a random vector u, s.t. |u|=1 project all data points {xj}: {yj}(—{uxj}, j=l...N cluster projections {yj}: 1t(i)<——k—means({Yj}) and combine clusterings via a consensus function: 0' <—{1t(i)}, i=1...H return 0' // consensus partition and The important parameter is the number of clusters in the component partition it,- returned by k-means algorithm at each iteration, i.e. the value of k. If the value of k is too large then the partitions {m} will overfit the data set which in turn may cause unreliability of the co-association values. Too small a number of clusters in {m} may not be enough to capture the true structure of data set. In addition, if the number of clusterings in the combination is too small then the effective sample size for the estimates of distances from co-association values is also insufficient, resulting in a larger variance of the estimates. That is why the consensus functions based on the co-association values are more sensitive to the number of partitions in the combination (value of H) than consensus functions based on hypergraph algorithms. 121 5.6 Empirical study The experiments were conducted with artificial and real-world datasets, where true natural clusters are known, to validate both accuracy and robustness of consensus via the mixture model. We explored the datasets using five different consensus functions. 5.6.1 Datasets Table 5.2 summarizes the details of the datasets. Five datasets of different nature have been used in the experiments. “Biochemical” and “Galaxy” data sets are described in [1] and [99], respectively. Table 5.2. Characteristics of the datasets. No. of No. of No. of Total no. Av. k-means Dataset features classes points/class of points error (%) Biochem. 7 2 21 38-3404 5542 47.4 Galaxy 14 2 2082-2110 4192 21.1 2-spirals 2 2 100-100 200 43.5 Half-rings 2 2 100-300 400 25.6 Iris 4 3 50-50-50 150 15.1 We evaluated the performance of the evidence accumulation clustering algorithms by matching the detected and the known partitions of the datasets. The best possible matching of clusters provides a measure of performance expressed as the misassignment rate. To determine the clustering error, one needs to solve the correspondence problem between the labels of known and derived clusters. The optimal correspondence can be obtained using the Hungarian method for minimal weight bipartite matching problem with 0(k3) complexity for k clusters. 122 5.6.2 Selection of Parameters and Algorithms Accuracy of the QMI and EM consensus algorithms has been compared to six other consensus functions: 1. CSPA for partitioning of hypergraphs induced from the co-association values. Its complexity is 0(N2) that leads to severe computational limitations. We did not apply this algorithm to “Galaxy” [99] and “Biochemical” [1] data. For the same reason, we did not use other co-association methods, such as single—link clustering. The performance of these methods was already analyzed in [38][39]. 2. HGPA for hypergraph partitioning. 3. MCLA, that modifies HGPA via extended set of hyperedge operations and additional heuristics. 4. Consensus functions operated on the co-association matrix, but with three different hierarchical clustering algorithms for obtaining the final partition, namely single-linkage, average—linkage, and complete-linkage. First three methods (CSPA, HGPA and MCLA) were introduced in [47] and their code is available at http://www.strehl.com. The k—means algorithm was used as a method of generating the partitions for the combination. Diversity of the partitions is ensured by the solutions obtained after a random initialization of the algorithm. The following parameters of the clustering ensemble are especially important: i. H — the number of combined clusterings. We varied this value in the range [550]. ii. k — the number of clusters in the component clusterings {n1,. . ., RH} produced by k-means algorithm was taken in the range [2.. 10]. 123 iii. r — the number of hyperplanes used for obtaining clusterings {rtl,. . ., fly} by random splitting algorithm. Both, the EM and QMI algorithm are susceptible to the presence of local minima of the objective functions. To reduce the risk of convergence to a lower quality solution, we used a simple heuristic afforded by low computational complexities of these algorithms. The final partition was picked from the results of three runs (with random initializations) according to the value of objective function. The highest value of the likelihood function served as a criterion for the EM algorithm and within—cluster variance is a criterion for the QMI algorithm. 5.6.3 Experiments with Complete Partitions Main results for each of the data sets are presented in Tables 5.3-7. The tables report the mean error rate (%) of clustering combination from 10 independent runs for the relatively large biochemical and astronomical data sets and from 20 runs for the other smaller data sets. First observation is that none of the consensus functions is the absolute winner. Good performance was achieved by different combination algorithms across the values of parameters k and H. The EM algorithm slightly outperforms other algorithms for ensembles of smaller size, while MCLA is superior when number of clusterings H > 20. However, ensembles of very large size are less important in practice. All co-association methods are usually unreliable with number of clusterings H < 50 and this is where we position the proposed EM algorithm. Both, EM and QMI consensus functions need to estimate at least kHM parameters. Therefore, accuracy degradation will inevitably occur with an increase in the number of partitions when sample size is fixed. However, there was no noticeable decrease in the accuracy of the EM algorithm in current experiments. 124 + +4 ‘ T 7 J * + * t 4 tr . t ’9 . ++ +* . 9 . . + + Z 9 . ” *** H. ’ 2 . . 4 + + t «he... 3 '9 ' 4* . 'ill :1”, +‘l‘ 2 1" .. ’91 {4. ~1 Jul] . 4* . + *- o 3.5, ' I gr .0 * i} O C d“. I r (l . 35%“; ‘ ‘ " * (Ck ’ ‘4‘ m ‘ (I 41:31?" ‘ + ‘ )5 ‘ (be ’3 .9 I 4 v Q Q‘ C o o 9‘. .:' fig do I‘ ‘1 _ .“90999'” ‘f' . "B(C' "Q. :r . “ . ' ' -‘ T H 7 ' 313$. 51%? [r . o ._n V ’) . L1 ’ o (g: k C O p . (?0 . I? <3 i .O t ,' “O o o c ‘ i Q? N m) (9 3 . - . “ r 00 ’ l 1 J Figure 5.5: “2 spirals” and “Half-rings” datasets difficult for any centroid-based clustering algorithms. The EM algorithm also should benefit from the datasets of large size due to the improved reliability of model parameter estimation. A valuable property of the EM consensus algorithm is its fast convergence rate. Mixture model parameter estimates nearly always converged in less than 10 iterations for all the datasets. Moreover, pattern assignments were typically settled in 4-6 iterations. Clustering combination accuracy also depends on the number of clusters M in the ensemble partitions, or more precisely, on its ratio to the target number of clusters, i.e. k/M. For example, the EM algorithm worked best with k=3 for Iris dataset, k=3,4 for “Galaxy” dataset and k=2 for “Half-rings” data. These values of k are equal or slightly greater than the number of clusters in the combined partition. In contrast, accuracy of MCLA slightly improves with an increase in the number of clusters in the ensemble. Figure 5.7 shows the error as a function of k for different consensus functions for the galaxy data. It is also interesting to note that, as expected, the average error of consensus clustering was lower than average error of the k-means clusterings in the ensemble (Table 5.2) when 125 k is chosen to be equal to the true number of clusters. Moreover, the clustering error obtained by EM and MCLA algorithms with k=4 for “Biochemistry” data [1] was the same as found by supervised classifiers applied to this dataset [112]. 5.6.4 Experiments with incomplete partitions This set of experiments focused on the dependence of clustering accuracy on the number of patterns with missing cluster labels. As before, an ensemble of partitions was generated using the k-means algorithm. Then, we randomly deleted cluster labels for a fixed number of patterns in each of the partitions. The EM consensus algorithm was used on such an ensemble. The number of missing labels in each partition was varied between 10% to 50% of the total number of patterns. The main results averaged over 10 independent runs are reported in Table 5.8 for “Galaxy” and “Biochemistry” datasets for various values of H and k. Also, a typical dependence of error on the number of patterns with missing data is shown for Iris data on Figure 5.6 (H=5, k=3). One can note that combination accuracy decreases only insignificantly for “Biochemistry” data when up to 50% of labels are missing. This can be explained by the low inherent accuracy for this data, leaving little room for further degradation. For the “Galaxy” data, the accuracy drops by almost 10% when k=3,4. However, when just 10- 20% of the cluster labels are missing, then there is just a small change in accuracy. Also, with different values of k, we see different sensitivity of the results to the missing labels. For example, with k=2, the accuracy drops by only slightly more than 1%. Ensembles of larger size H=10 suffered less from missing data than ensembles of size H=5. 126 Table 5.3: Mean error rate (%) for the “Galaxy” dataset. Type of Consensus Function H k EM QMI HGPA MCLA 5 2 18.9 19.0 50.0 18.9 5 3 11.6 13.0 50.0 13.5 5 4 11.3 13.0 50.0 11.7 5 5 13.9 18.0 50.0 14.3 5 7 14.5 21.9 50.0 15.6 5 10 13.4 31.1 50.0 15.4 10 2 18.8 18.8 50.0 18.8 10 3 14.9 15.0 50.0 14.8 10 4 11.6 11.1 50.0 12.0 10 5 14.5 13.0 50.0 13.6 15 2 18.8 18.8 50.0 18.8 15 3 14.0 13.3 50.0 14.8 15 4 11.7 11.5 50.0 11.6 15 5 12.9 11.5 50.0 12.9 20 2 18.8 18.9 50.0 18.8 20 3 12.8 11.7 50.0 14.3 20 4 11.0 10.8 50.0 11.5 20 5 16.2 12.1 50.0 12.3 18 _.___fi- j 17 — 16 — CITOI' 15 rate 14 (%) 13 12 11 10 9 r r r r 0 1 O 20 30 4O 50 % of patterns with missing labels Figure 5.6: Consensus clustering error rate as a function of the number of missing labels in the ensemble for the Iris dataset, H=5, k=3. 5.6.5 Results of random subspaces algorithm Let us start by demonstrating how the combination of clusterings in projected 1- dimensional subspaces outperforms the combination of clusterings in the original multidimensional space. Figure. 5.8(a) shows the learning dynamics for Iris data and k=4, 127 using average-link consensus function based on co-association values. Note that the number of clusters in each of the components {n1,..., rtH} is set to k=4, and is different from the true number of clusters (=3). Clearly, each individual clustering in full multidimensional space is much stronger than any l-dim partition, and therefore with only a small number of partitions (H<50) the combination of weaker partitions is not yet effective. However, for larger numbers of combined partitions (H>50), 1-dim projections together better reveal the true structure of the data. It is quite unexpected, since the k- means algorithm with k=3 makes, on average, 19 mistakes in the original 4-dimensional space and 25 mistakes in l-dimensional random subspace. Moreover, clustering in the projected subspace is d times faster than in multidimensional space. However, the cost of computing a consensus partition 0' is the same in both cases. The results regarding the impact of value of k are reported in Figure 5.8(b), which shows that there is a critical value of k for the Iris data set. This occurs when the average- linkage of co-association distances is used as a consensus function. In this case the value k=2 is not enough to separate the true clusters. Table 5.4: Mean error rate (%) for the “Half-rings” Type of Consensus Function H k EM QMI CSPA HGPA MCLA 5 2 25.4 25.4 25.5 50.0 25.4 5 3 24.0 36.8 26.2 48.8 25.1 10 2 26.7 33.2 28.6 50.0 23.7 10 3 33.5 39.7 24.9 26.0 24.2 30 2 26.9 40.6 26.2 50.0 26.0 30 3 29.3 35.9 26.2 27.5 26.2 50 2 27.2 32.3 29.5 50.0 21.1 50 3 28.8 35.3 25.0 24.8 24.6 128 Table 5.5: Mean error rate (%) for the “Biochemistry” dataset. Type of Consensus Function H k EM QMI MCLA 5 2 44.8 44.8 44.8 5 3 43.2 48.8 44.7 5 4 42.0 45.6 42.7 5 5 42.7 44.3 46.3 10 2 45.0 45.1 45.1 10 3 44.3 45.4 40.2 10 4 39.3 45.1 37.3 10 5 40.6 45.0 41.2 20 2 45.1 45.2 45.1 20 3 46.6 47.4 42.0 20 4 37.2 42.6 39.8 20 5 40.5 42.1 39.9 30 2 45.3 45.3 45.3 30 3 47.1 48.3 46.8 30 4 37.3 42.3 42.8 30 5 39.9 42.9 38.4 50 2 45.2 45.3 45.2 50 3 46.9 48.3 44.6 50 4 40.1 39.7 42.8 50 5 39.4 38.1 42.1 Table 5.6: Mean error rate (%) for the “2-spirals” Type of Consensus Function H k EM QMI CSPA HGPA MCLA 5 2 43.5 43.6 43.9 50.0 43.8 5 3 41.1 41.3 39.9 49.5 40.5 5 5 41.2 41.0 40.0 43.0 40.0 5 7 45.9 45.4 45.4 42.4 43.7 5 10 47.3 45.4 47.7 46.4 43.9 10 2 43.4 43.7 44.0 50.0 43.9 10 3 36.9 40.0 39.0 49.2 41.7 10 5 38.6 39.4 38.3 40.6 38.9 10 7 46.7 46.7 46.2 43.0 45.7 10 10 46.7 45.6 47.7 47.1 42.4 20 2 43.3 43.6 43.8 50.0 43.9 20 3 40.7 40.2 37.1 49.3 40.0 20 5 38.6 39.5 38.2 40.0 38.1 20 7 45.9 47.6 46.7 44.4 44.2 20 10 48.2 47.2 48.7 47.3 42.2 129 error rate (%) Figure 5.7: Consensus error as a function of the number of clusters in 19.0 — 17.0 — 15.0 . 13.0 ~ 9.0 — 7.0 — 5.0 3 k - number of clusters 4 the contributing partitions for Galaxy data and ensemble size H=20. Table 5.7: Mean error rate (%) for the Iris Type of Consensus Function H k EM QMI CSPA HGPA MCLA 5 3 11.0 14.7 11.2 41.4 10.9 10 3 10.8 10.8 11.3 38.2 10.9 15 3 10.9 11.9 9.8 42.8 11.1 20 3 10.9 14.5 9.8 39.1 10.9 30 3 10.9 12.8 7.9 43.4 11.3 40 3 11.0 12.4 7.7 41.9 11.1 50 3 10.9 13.8 7.9 42.7 11.2 130 Table 5.8: Clustering error rate of EM algorithm as a function of the number of missing labels for the large datasets Missing "Galaxy" "Biochem." H k labels (%) error (%) error (%) 5 2 10 18.81 45.18 5 2 20 18.94 44.73 5 2 30 19.05 45.08 5 2 40 19.44 45.64 5 2 50 19.86 46.23 5 3 10 12.95 43.79 5 3 20 13.78 43.89 5 3 30 14.92 45.67 5 3 40 19.58 47.88 5 3 50 23.31 48.41 5 4 10 11.56 43.10 5 4 20 11.98 43.59 5 4 30 14.36 44.50 5 4 40 17.34 45.12 5 4 50 24.47 45.62 10 2 10 18.87 45.14 10 2 20 18.85 45.26 10 2 30 18.86 45.28 10 2 40 18.93 45.13 10 2 50 19.85 45.35 10 3 10 13.44 44.97 10 3 20 14.46 45.20 10 3 30 14.69 47.91 10 3 40 14.40 47.21 10 3 50 15.65 46.92 10 4 10 1 1.06 39.15 10 4 20 1 1.17 37.81 10 4 30 11.32 40.41 10 4 40 15.07 37.78 10 4 50 16.46 41.56 131 k=4 -O- 4—d k—means # misassigned objects N O \ \ —°— l-d k-rneans 18 — \ ‘~ -' ~0- - — * — — 16 — -'.~ § __ fi 14 l I I 1 1 l 0 5 10 50 100 500 1000 H, number of combined clustererings 44 39 ‘ # misassigned objects H, number of combined clustererings Figure 5.8. Performance of random subspaces algorithm on Iris data. (a) Number of errors by the combination of k-means partitions (k=4) in multidimensional space and projected l-d subspaces. Average-link consensus function was used. (b) Accuracy of projection algorithm as a function of the number of components and the number of clusters k in each component. The role of the consensus function is illustrated in Figure 5.9. Three consensus functions are compared on the his data set. They all use similarities from the co- 132 association matrix but cluster the objects using three different criterion functions, namely, single link, average link and complete link. It is clear that the combination using single- link performs significantly worse than the other two consensus functions. This is expected since the three classes in Iris data have hyperellipsoidal shape. More results were obtained on “half-rings” and “2 spirals” data sets in Figure 5.5, which are traditionally difficult for any partitional centroid-based algorithm. Table 5.9 reports the error rates for the “2 spirals” data using seven different consensus functions, different numbers of component partitions H = [5..500] and different numbers of clusters in each component k = 2, 4, 10. Similar results were obtained for the “half-rings” data set under the same experimental conditions and same intermediate values of k. As we see, the single-link consensus function performed the best and was able to identify both the ‘half- rings’ clusters as well as spirals. In contrast to the results for Iris data, average-link and complete-link consensus were not suitable for these data sets. 133 Table 5.9. “2 Spirals” data experiments. Average error rate (% over 20 runs) of clustering combination using the random l-d projections algorithm with different number of components in combination H, different resolutions of components k and seven types of consensus functions. Type of Consensus Function H, # of k, # of CI. in Co-association methods Hypergraph methods Medan partition corrrponents component Single Average Corrplete CSPA HGPA MCLA QM link link link 5 2 47.8 45.0 44.9 41 .7 50.0 40.8 40.0 10 2 48.0 44.3 43.1 41.6 50.0 40.4 40.0 50 2 44.2 43.5 41.6 43.1 50.0 41.1 39.3 100 2 46.2 43.9 42.1 46.2 50.0 43.2 42.6 200 2 44.5 42.5 41 .9 43.3 50.0 40.1 40.7 500 2 41 .6 43.5 39.6 46.4 50.0 44.0 43.3 5 4 48.6 45.9 46.8 43.4 44.9 43.8 44.7 10 4 47.4 47.5 48.7 44.1 43.8 42.6 44.0 50 4 35.2 48.5 46.2 44.9 39.9 42.3 44.2 100 4 29.5 49.0 47.0 44.2 39.2 39.5 42.9 200 4 27.8 49.2 46.0 47.7 38.3 37 .2 39.0 500 4 4.4 49.5 44.0 48.1 39.4 45.0 43.4 5 10 48.0 42.6 45.3 42.9 42.4 42.8 45.3 10 10 ‘ 44.7 44.9 44.7 42.4 42.6 42.8 44.2 50 10 9.4 47.0 44.3 43.5 42.4 42.2 42.8 100 10 0.9 46.8 47.4 41.8 41.1 42.3 44.2 200 10 0.0 47.0 45.8 42.4 38.9 44.5 40.0 500 10 0.0 47.3 43.4 43.3 35.2 44.8 37.4 This observation supports the idea that the accuracy of the consensus functions based on co-association values is sensitive to the choice of data set. In general, one can expect that average-link (single-link) consensus will be appropriate if standard average-link (single-link) agglomerative clustering works well for the data and vice versa. Moreover, none of the three hypergraph consensus functions could find a correct combined partition. This is somewhat surprising given that the hypergraph algorithms performed well on the his data. However, the Iris data is far less problematic because one of the clusters is linearly separable, and the other classes are well described as a mixture of two multivariate normal distributions. 134 59 . 54L 3 49 r“. n . , «6.. $44 L x .,e‘ 9," "~49 in 9. .. . ' . ' s‘ ,’ g 39 » o .‘ .' 5: 34 r “ ,’ a o -—o-— Aver.Link ,g 29 ' - «v - Single Link ,5 24 ‘k —=~—-Compl.Link 19 l 14 " 5 10 50 100 200 500 750 1000 H , number of combined clusterings Figure 5.9. Dependence of accuracy of the projection algorithm on the type of consensus function for Iris data set. k=3. Perfect separation of natural clusters was achieved with a large number of partitions in clustering combination (H > 200) and for values of k > 3 for “half-rings” and “2 spirals”. Again, it indicates that for each problem there is a critical value of resolution of component partitions that guarantees good clustering combination. This further supports the work of Fred and Jain [40] [41] who showed that a random number of clusters in each partition ensures a greater diversity of components. We see that the minimal required value of resolution for the Iris data is k=3, for “half-rings” it is k=2 and for “2 spirals” it is k=4. In general, the value of k should be larger than the true number of clusters. The number of partitions affects the relative performance of the consensus functions. With large values of H (>100), co-association consensus becomes stronger, while with small values of H it is preferable to use hypergraph algorithms or the k-means median partition algorithm. It is interesting to compare the combined clustering accuracy with the accuracy of some of the classical clustering algorithms. For example, for Iris data the EM algorithm 135 has the best average error rate of 6.17%. In our experiments, the best performers for Iris data were the hypergraph methods, with an error rate as low as 3%, with H > 200 and k > 5. For the “half—rings” data, the best standard result is 5.25% error by the average-link algorithm, while the combined clustering using the single-link co—association algorithm achieved a 0% error with H > 200. Also, for the “2 spirals” data the clustering combination achieves 0% error, the same as by regular single-link clustering. Hence, with an appropriate choice of consensus function, clustering combination outperforms many standard clustering algorithms. However, the choice of good consensus function is similar to the problem of choice of a good conventional clustering algorithm. Perhaps a good alternative to guessing the right consensus function is simply to run all the available consensus functions and then pick the final consensus partition according to the partition utility criterion in Eq. (5.21). We hope to address this in future applications of the method. 136 120 100 ~ 80 ~ 60 ~ 40 ~ 20 ~ 0 ,_ a— 10 50 100 200 500 750 # of clusterings 120 100 80 60 -I— r=3 Single Link + r=3 Av Link 40 -n- r=3 Complete Link 20 0 I I l I 4 1O 50 100 200 500 750 # of clusterings Figure 5.10. Number of misassigned points by random hyperplanes algorithm in clustering of “2 spiral” dataset: (a) for different number of hyperplanes (b) for different consensus functions. Another set of experiments was performed on the “Galaxy” dataset, which has a significantly larger number of samples, N = 4192, and number of features, d = 14. The task is to separate patterns of galaxies from stars. We used the most difficult set of “faint” objects from the original data [99]. True labels for the objects were provided manually by experts. Even though computation of component partitions is d times faster 137 due to projection, overall computational effort can be dominated by the complexity of computing the consensus partition. Quadratic computational complexity effectively prohibits co-association based consensus functions from being used on large data sets, since 0(N2) factor arises when the co-association matrix is built for all the objects. Therefore, for large datasets we do not use the three hierarchical agglomerative methods or the CSPA hypergraph algorithm by Strehl and Ghosh [124]. The k-means algorithm for median partition via QMI is the most attractive in terms of speed, with a complexity 0(kNH). In addition, we also used two other hypergraph-based consensus functions, since they worked fast in practice. Table 5.10 reports the achieved error rate using these consensus functions. Also we limited the number of components in combination to H=20 because of the large data size. The results show that the k-means algorithm for median partition has the best performance. HGPA did not work well due to its bias toward balanced cluster sizes, as also happened in the case of the “half-rings” data set. We see that the accuracy improves when the number of partitions and clusters increases. It is important to note that the average error rate of the standard k-means algorithm for the “Galaxy” data is about 20%, and the best known solution has an error rate of 18.6%. It is quite significant that the k-means median partition algorithm and MCLA obtained a much better partition with an error rate of only around 13% for k>3. 5.6.6 Results of random splitting algorithm The same set of experiments was performed with clustering combination via splits by random hyperplanes as in Section 5.1. Here we would like to emphasize only the main and the most interesting observations, because the results in many details are close to what has been obtained by using random subspaces. There is a little difference in terms 138 Table 5.10. Average error rate (%, over 20 runs) of combination clustering using random projections algorithm on “Galaxy/star” data set. Type of Consensus H , # 0f k , # 0f Cl. Hypergraph methods Median partition component component HGPA MCLA QMI 5 2 49.7 20.0 20.4 10 2 49.7 23.5 21.1 20 2 49.7 21.0 18.0 5 3 49.7 22.0 21.7 10 3 49.7 17.7 13.7 20 3 49.7 15.8 13.3 5 4 49.7 19.7 16.7 10 4 49.7 16.9 15.5 20 4 49.7 14.1 13.2 5 5 49.7 22.0 22.5 10 5 49.7 17.7 17.4 20 5 49.6 15.2 12.9 of absolute performance: the random hyperplanes algorithm is slightly better on the “half-rings” data using te single-link consensus function, about the same on “2 spirals”, and worse on the Iris data set. It is important to distinguish the number of clusters k in the component partition and the number of hyperplanes r, because hyperplanes intersect randomly and form varying numbers of clusters. For example, 3 lines can create anywhere between 4 and 7 distinct regions in a plane. The results for the “2 spirals” data set also demonstrate the convergence of consensus clustering to a perfect solution when H reaches 500 and for the values of r = 2,...,5. See Figure 5.10(a). A larger number of hyperplanes (r=5) improves the convergence. Figure 5.10(b) illustrates that the choice of consensus function is crucial for successful clustering. In the case of the “2-spirals” data, only single-link consensus is able to find correct clusters. 139 5.7 Adaptive Clustering Ensembles As we previously discussed, several methods to create partitions for clustering ensembles are known. For example, one can use: (i) different regular clustering algorithms [124], (ii) different initializations, parameter values or built-in randomness of a specific clustering algorithm [38], (iii) weak clustering algorithms, (iv) data resampling [29][35][91]. All these methods generate ensemble partitions independently, in the sense that the probability to obtain the ensemble consisting of H partitions {7a, 7t2,...,7rH} of the given data D can be factorized: p({rtl,7z2,...,7r,, } | D) = H [902, | D) r=1 Hence, the increased efficacy of an ensemble is mostly attributed to the number of identically distributed and independent partitions, assuming that a partition of data is treated as a random variable 7:. Even when the clusterings are generated sequentially, it is traditionally done without considering previously produced clusterin gs: p(flt |flt—1’”t—2"°"7Z-1;D): p(flr |D) However, similar to the ensembles of supervised classifiers using boosting algorithms [43], a more accurate consensus clustering can be obtained if contributing partitions take into account the solutions found so far. Unfortunately, it is not possible to mechanically apply the decision fusion algorithms from the supervised (classification) to the unsupervised (clustering) domain. New objective functions for guiding partition generation and the subsequent decision integration process are necessary. 140 We propose an adaptive approach to partition generation that makes use of clustering history. In clustering, ground truth in the form of class labels is not available. Therefore, we need an alternative measure of performance for an ensemble of partitions. We determine clustering consistency for data points by evaluating a history of cluster assignments for each data point within the generated sequence of partitions. Clustering consistency serves for adapting the data sampling to the current state of an ensemble during partition generation. The goal of adaptation is to improve confidence in cluster assignments by concentrating sampling distribution on problematic regions of the feature space. In other words, by focusing attention on the data points with the least consistent clustering assignments, one can better approximate (indirectly) the inter-cluster boundaries. To achieve this goal, we address the problems related to estimation of clustering consistency and of finding a consensus clustering. Then we evaluate, the performance of adaptive clustering ensembles on a number of real-world and artificial data sets in comparison with more conventional clustering ensembles of bootstrap partitions 5.7.1 Adaptive sampling and clustering While there are many ways to construct diverse data partitions for an ensemble, not all of them easily generalize to adaptive clustering. Our approach extends the studies of ensembles whose partitions are generated via data resampling [29][35]. Intuitively, clustering ensembles generated by other methods can be also boosted. It was shown [91] that small subsamples generally suffice to represent the structure of the entire data set in 141 the framework of clustering ensembles. Subsamples of small size can reduce computational cost for many exploratory data mining tasks with distributed sources of data [101]. Let D be a data set of N points that is available either as an Nxd pattern matrix in d- dimensional space or an NXN dissimilarity matrix. Suppose that X = {X 1,...,X,,} is a set of H subsamples of size N drawn with replacement from the given data D. A chosen clustering algorithm is run on each of the samples in X that results in H partitions H={ 7:1, 7:2,..., 7Z'H}. Each component partition in H is a set of non-overlapping and exhaustive clusters with 7;: {C1, 21,“, C12“) }, X, = Cli U...UC,2(,-), V7;- and K(i) is the number of clusters in the i-th partition. The problem of combining partitions is to find a new partition 0'={C1,...,CM} of the entire data set D given the partitions in H, such that the data points in a cluster of 0' are more similar to each other than to points in different clusters of 0'. We assume that the number of clusters M in the consensus clustering is predefined and can be different from the number of clusters k in the ensemble partitions. In order to find this target partition 0', one needs a consensus function utilizing information from the partitions {75-}. Several known consensus functions [41][124] can be employed to map a given set of partitions I'I={ 7:1, 7t2,..., 7221} to a target partition Gin our study. The adaptive partition generation mechanism is aimed at reducing the variance of inter-class decision boundaries. Unlike the regular bootstrap method that draws subsamples uniformly from a given data set, adaptive sampling favors points from regions close to the decision boundaries. At the same time, the points located far from the 142 Figure 5.11. Two possible decision boundaries for a 2—cluster data set. Sampling probabilities of data points are indicated by gray level intensity at different iterations (to < t1< t2) of the adaptive sampling. True components in the 2-class mixture are shown as circles and triangles. boundary regions will be sampled less frequently. It is instructive to consider a simple example that shows the difference between ensembles of bootstrap partitions with and withoUt the weighted sampling. Figure 5.11 shows how different decision boundaries can separate two natural classes depending on the sampling probabilities. Here we assume that the k-means clustering algorithm is applied to the subsamples. Initially, all the data points have the same weight, namely, the sampling probability p,- = %v, ie[1,...,N]. Clearly, the main contribution to the clustering error is due to the sampling variation that causes an inaccurate inter-cluster boundary. In contrast, solution variance can be significantly reduced if sampling is increasingly concentrated only on the subset of objects at iterations t2 > t1>to, as demonstrated in Figure 5.11. The key issue in the design of the adaptation mechanism is the updating of probabilities. We have to decide how and which data points should be sampled as we collect more and more clusterings in the ensemble. A consensus function based on the co- association values [38] provides the necessary guidelines for adjustments of sampling 143 probabilities. Remember that the co-association similarity between two data points x and y is defined as the number of clusters shared by these points in the partitions of an ensemble II: A consensus clustering can be found by using an agglomerative clustering algorithm (e.g., single linkage) applied to such a co—association matrix constructed from all the points. The quality of the consensus solution depends on the accuracy of similarity values as estimated by the co-association values. The least reliable co-association values come from the points located in the problematic areas of the feature space. Therefore, our adaptive strategy is to increase the sampling probability for such points as we proceed with the generation of different partitions in the ensemble. The sampling probability can be adjusted not only by analyzing the co-association matrix, which is of quadratic complexity 0(N2), but also by applying the less expensive 0(N+K3) estimation of clustering consistency for the data points. Again, the motivation is that the points with the least stable cluster assignments, namely those that frequently change the cluster they are assigned to, require increased presence in the data Table 5.11: Consistent re-labeling of 4 partitions 1t, n2 n3 7:4 nl' 112' 713' 114' Consistency X1 2 B X 01 2 1 2 1 0.5 X2 2 A X 0t 2 2 2 1 0.7 X3 2 A Y B 2 2 1 2 0.7 x. 2 B X l3 2 1 2 2 0.7 x. 1 A x B 1 2 2 2 0.7 X6 2 A Y B 2 2 1 2 0.7 X7 2 B X a 2 1 2 1 0.5 X8 1 B Y (1 1 1 1 1 1 x9 1 B Y B 1 1 1 2 0.7 X10 1 A Y 0‘ 1 2 1 1 0.7 X11 2 B Y 0‘ 2 1 1 1 0.7 X12 1 B Y <1 1 1 1 1 1 144 subsamples. In this case, a label correspondence problem must be approximately solved to obtain the same labeling of clusters throughout the ensemble’s partitions. By default, the cluster labels in different partitions are arbitrary. To make a correspondence problem more tractable, one needs to re-label each partition in the ensemble using some fixed reference partition. Table 5.11 illustrates how 4 different partitions of twelve points can be re-labeled using the first partition as a reference. At the (t+1)-st iteration, when some t different clusterings are already included in the ensemble, we use the Hungarian algorithm for the maximum weight bipartite matching problem in order to re-label the (t+1)st partition. As an outcome of the re-labeling procedure, we can compute the consistency index of clustering for each data point. Clustering consistency index CI at iteration t for a point x is defined as the ratio of the maximal number of times the object is assigned in a certain cluster to the total number of partitions: C100 2 %max {i 6(7z, (x), L)} L 6 cluster labels The values of consistency indices are shown in Table 5.11 after four partitions were generated and re-labeled. We should note that clustering of subsamples of the data set D does not provide the labels for the objects missing (not drawn) in some subsamples. The clustering consistency index of a point can be directly used to compute its sampling probability. In particular, the probability value is adjusted at each iteration as follows: p,+1 (x) = Z(a’p, (x) + CI(x)) where 0t is a discount constant for the current sampling probability and Z is a 145 normalization factor. The discount factor was set to Ot=0.3 in our experiments. The proposed clustering ensemble algorithm is summarized in pseudocode below: Input: D — data set of N points, H — number of partitions to be combined M — number of clusters in the consensus partition 0', K — number of clusters in the partitions of the ensemble, F — chosen consensus function operating on cluster labels p — sampling probabilities (initialized to UN for all the points) Reference Partition (— k—means(D) for i=1 to H Draw a subsample X,- from D using sampling probabilities p Cluster the sample Xi: 71(i) (— k—means(X,-) Re-label partition 72(i) using the reference partition Compute the consistency indices for the data points in D Adjust the sampling probabilities p A end Apply consensus function I‘ to ensemble H to find the partition 0' Validate the target partition 0' (optional) return 0' // consensus partition 5.7.2 Empirical Study of Adaptive Ensembles The experiments were conducted on artificial and real-world data sets (“Galaxy”, “half-rings”, “wine”, “3-gaussian”, “Iris”, “LON”), with known cluster labels, to validate the accuracy of consensus partition. A comparison of the proposed adaptive and previous non-adaptive ensembles is the primary goal of the experiments. We evaluated the 146 performance of the clustering ensemble algorithms by matching the detected and the known partitions of the datasets. The best possible matching of clusters provides a measure of performance expressed as the misassignment rate. To determine the clustering error, one needs to solve the correspondence problem between the labels of known and derived clusters. Again, the Hungarian algorithm was used for this purpose. The k-means algorithm was used to generate the partitions of samples of size N drawn with replacement, similar to bootstrap, albeit with dynamic sampling probability. Each experiment was repeated 20 times and average values of error (misassignment) rate are shown in Figure 5.12. 147 550 35 - _ 540 ~ 0; Wine data set, MI consensus, k_3 ‘0 c 30 ~ E _ s. 530 a g g E 25 — 5520 ~ — :. ; _; 0- °‘51o — 320» a . c g, 500 ' + Non— Adaptive .%15 - E 490 ~ Ad . «n + a trve 33,10 . -I—Non—adaptive g 430 r p g 5 , —¢-—Adaptive , €223 : Galaxydata set, MCLAconsensusilunction, k=6 :1: t o i_ r r r r 450 r 1 r 25 50 75 100 125 150 25 50 75 100 # of Partitions, H # of Partitions, H 110 9 ‘ 3-Gaussran data set, CSPA k=4 "’ 108 » . "’ 8 i g -I—Non-Adaptrve E 7 _ o 1: _ . , g1 06 —a— Adaptive Es _ 3104 - '3 5 - c c .9102 ~ .214 ~ _ a fi- 3 -'-Non-Adaptrve 8100 _ ,-, , . “3 b . E Halt-rings data set, CSPA consensus, k=3 g 2 _ +Adaptrve 1.5 98 - < ._ * O 1 " - - . 96 ‘ ' ‘ ' 5 i": 0 . , '—* . . —\I—4 25 50 75 100 125 150 # of Partitions, H 25 50 75 100 125 150 # of Partitions, H Figure 5.12. Clustering accuracy for ensembles with adaptive and non-adaptive sampling mechanisms as a function of ensemble size for some data sets and selected consensus functions. Consensus clustering was obtained by four different consensus functions: hypergraph-based MCLA and CSPA algorithms [124], quadratic mutual information and EM algorithm based on the mixture model. The main observation is that adaptive ensembles slightly outperform the regular sampling schemes on most benchmarks. Typically, the clustering error decreased by 1-5%. However, accuracy improvement depends on the number of clusters in the ensemble partitions (K). Generally, the adaptive ensembles were superior for values of K larger than the target number of clusters, M, by 1 148 or 2. With either too small or too large a value of K, the performance of adaptive ensembles was less robust and occasionally worse than corresponding non-adaptive algorithms. A simple inspection of probability values always confirmed the expectation that points with large clustering uncertainty are drawn more frequently. Most significant progress was detected when the combination consisted of 25-75 partitions. Large numbers of partitions (H>75) almost never lead to further improvement in clustering accuracy. Moreover, for H>125 we often observed increased error rates (except for the hypergraph-based consensus functions), due to the increase in complexity of the consensus model and in the number of model parameters requiring estimation. To summarize, we have extended the clustering ensemble framework by an adaptive data sampling mechanism for generation of partitions. We dynamically update sampling probability to focus on more uncertain and problematic points by on-the—fly computation of clustering consistency. Empirical results demonstrate improved clustering accuracy and faster convergence as a function of the number of partitions in the ensemble. 149 5.8 Conclusions We advanced previous research on clustering ensembles in several respects. First, we have introduced a unified representation for multiple clusterings and formulated the corresponding categorical clustering problem. Second, we have proposed a solution to the problem of clustering combination. A consensus clustering is derived from a solution of the maximum likelihood problem for a finite mixture model of the ensemble of partitions. An ensemble is modeled as a mixture of multivariate multinomial distributions in the space of cluster labels. The maximum likelihood problem is effectively solved using the EM algorithm. The EM-based consensus function is also capable of dealing with incomplete contributing partitions. Third, it is shown that another consensus function can be related to the classical intra-class variance criterion using the generalized mutual information definition. A consensus solution based on quadratic mutual information can be efficiently found by the k-means algorithm in the space of specially transformed cluster labels. Experimental results indicate good performance of the approach for several datasets and favorable comparison with other consensus functions. Among the advantages of these algorithms are their low computational complexity and well- grounded statistical model. We have also considered combining weak clustering algorithms that use data projections and random data splits. A simple explanatory model is offered for the behavior of combination of such weak components. We have analyzed combination accuracy as a function of parameters that control the power and resolution of component 150 partitions as well as the learning dynamics vs. the number of clusterings involved. Empirical study compared effectiveness of several consensus functions applied to weak partitions. It is interesting to continue the study of clustering combination in several directions: 1) design of effective search heuristics for consensus functions, 2) a more precise and quantifiable notion of weak clustering, and 3) an improved understanding of effect of component resolution on overall performance. This research can be extended in order to take into account non-independence of partitions in the ensemble. The consensus function presented here is equivalent to a certain kind of Latent Class Analysis, which offers established statistical approaches to measure and use dependencies (at least pair- wise) between variables. It is also interesting to consider a combination of partitions of different quality. In this case one needs to develop a consensus function that weights the contributions of different partitions proportionally to their strength. We hope to address these issues in our future work. 151 Chapter 6 Summary This dissertation addresses three information-processing problems, which are central to modern data mining, and advances their solution by using evolutionary computation and ensemble approaches. Feature Extraction. The goal is to find the best subset of original variables or their transformation relevant to a computational model. In general, new variables can be synthesized from existing data features. Generalization or predictive classification accuracy of the model is the main evaluation criterion in feature selection and extraction. Our contribution to this problem is evolutionary feature extraction and dimensionality reduction based on genetic search of optimal combinations of feature values. Accuracy- driven construction of new features uses intelligent value clustering. We offer feature extraction methods for improved pattern classification using genetic algorithms. New features are synthesized by merging the values of original variables during the search process. In this study, we introduce a different GA approach to dimensionality reduction that does not explicitly operate with whole features, but instead operates on individual values assumed by the feature variables. Features are extracted by clustering several “old” values into a new meta-value which substitutes for the old values in the feature vector. Therefore, new features are created by clustering the values of variables. A GA is 152 used as a search engine for this value clustering. If features assume nominal values then clustering could be viewed as a grouping problem. Scalability and redundancy are two characteristic problems of grouping algorithms. Here we present a grouping GA with an improved graph-based encoding. The proposed cluster representation has lower redundancy while preserving the simplicity of decoding and evaluation. Compact solution representation with minimal redundancy of search space is designed for a wide class of grouping problems, including clustering of variable values. This encoding is not limited to value clustering and therefore applicable to other grouping tasks. The genetic search of (sub-) optimal combinations of values is performed in graph-based encoding of candidate solutions. Genetic value clustering is applied to text categorization, DNA—based assignments of individuals in population genetics and parametric learning of Bayesian network classifiers. It is shown that such feature extraction results in better predictive accuracy of classification decisions. Data model identification. Prediction of output variables is based on learning of an unknown function for a given training sample and a set of elementary functional components. We propose fast genetic programming for data-driven induction of predictive models with real-valued input-output mappings. We develop genetic programming algorithms for modeling of input-output relations of continuous variables that incorporates dynamical fitting of free parameters of evolved models. Traditional genetic programming is extended by gradient descent optimization of leaf coefficients of tree-like programs during the evolutionary search that is made possible using algorithmic 153 differentiation. Experimental results show significant improvement in both computational requirements and modeling accuracy for a set of symbolic regression problems. Data clustering. Partitioning of a data set must be done into a number of meaningful groups of data points. Clustering criteria use available data features and a variety of prior knowledge about data distribution properties. We propose generic and robust methods to combine multiple clusterings of data including the use of novel fast algorithms for consensus clustering as well as inexpensive and weak clustering components. Ensembles of partitions of data sets are studied in two respects: combination of multiple clusterings and generation of clusterings for an ensemble. We develop two efficient consensus functions for finding a combined partition of good quality. First consensus function uses information-theoretic principle based on maximal generalized mutual information. Second function finds a consensus clustering by estimation of a probabilistic mixture model from the observed ensemble. It is demonstrated that ensemble’s partitions can be generated by the weak clustering algorithms, in particular, by clusterings in random low-dimensional subspaces of original feature space. Experiments indicate that ensemble of weak partitions can be more accurate than a single sophisticated clustering algorithm. Finally, we consider how partition generation process can adapt to provide better decisions for the patterns near the boundaries between the data clusters. The main results of this dissertation were published in [52][53][66][91] [128][129][130][131][132]. 154 Appendix EM Algorithm for Consensus Solution Here we provide detailed derivation of the EM algorithm for estimation of parameters of the mixture model proposed in Section 5.3. The distribution of hidden variable z should be consistent with the observed values Y: log P(Y | o) = log ZP(Y,z | 9), (A1) ZEZ If the value of z, is known then one could immediately tell which of the M mixture components was used to generate the point yi. It means that with each observed data point y,-, we associate a hidden vector variable z,- = {Zi1,...,ZiM} such that Zim =1 if y,- belongs to the m-th component and Zim = 0, otherwise. It is convenient to write the complete data likelihood as: N N M z- lOg L (QIY’Z) :logl—IP(yi’zi 18):10gHH(am‘Pm(yr 10771)) 1m i=1 i=1 m=1 i r=1 m (A2) M: Zim 10g am Pm (yi 19m)' ll fl According to the general EM approach, we have to define an auxiliary function Q((~);O’) that serves as a lower bound on the observed data likelihood in Eq. (5.3): Q(@;6’) = Zlog(P(Y,Z l E9))10(Z| Y,@')- (A3) 155 Classical convergence analysis of EM algorithm [25][28] establishes that the maximization of the function Q(®;(~)') with respect to O is equivalent to increasing the observed likelihood function in Eq. (5.3). Evaluation of Q(O;(~)’) is the first step of the EM algorithm. Substitution of Eq. (A2) in the definition of Q function gives: Q<®;®’>= Z Zz. malog ..P.....p (A4) 7"; N M 222 El: [Zim] logaum(yi|0m)° i=1 m=l The last expression depends on the previous estimate of parameters (9’ only via the expected values of the hidden variables E[z,-ml, which are defined as: (P,,.y.|9,..) ZameGIO.) E[z,m]= zm (z Y,®)=M Z: . p l (A5) Here, the current guess about the parameters 9’: {al’ ,...,aj, ,,..0;.,0:w } is used to compute the expectations. Taking into account the concrete form of the component- conditional densities P (yi [0,") from Eqs. (5.5) and (5.6), we obtain the E—step of the m algorithm: Etz ]= F‘ ’5‘ . (A6) The M—step consists of maximizing the Q function in Eq. (A4) by the parameters 9, given the expected values of the hidden variables E[z1m] from E-step in Eq. (A6): 156 2 K 8* = argmaxQ(O;®') = argmax ZZ(E[zim]logam +E[z,m]logP m(yi |0m)). G) {ainafij’n} i=1 m=l (A7) The two terms on the right hand side can be optimized independently. The expression for the coefficients am is easily found using a Lagrange multiplier along with the constraint 261:1: m "I 8%:®)=a: [zzflzmhogam +/l(m2=o:m —1))=0 (A8) m ‘1 "F1 ZElzml a = . (A9) m N M ZZElzz-ml i=1 m=1 Similarly, obtaining the optimizing values of z9jm(k), is facilitated by the independence assumption for the component-conditional densities of variables yij as described by Eq. (5.3). Again, natural constraints 2,, 191-",(16) 1 and Lagrange multipliers x1,-,,, are utilized: 60(00)_ 0 N M K... - l . . . -1 = a 19 (k) 7819. (k)(121211;kaogR"(y'|9m)+/1""(k§119’m(k) ) ) 0 (A10) 1m 1m N Z§(yij’k)E[Zim] 19. (k): 5‘ (A11) jm N K(j) Z 5(y.,~,k)E[z,-,,.l i=1/(=1 To summarize, the EM algorithm starts with an initial guess for the values of the parameters {a,’ ,...,aju,0’ .,0M } of the mixture density. After this, E and M steps are 9‘ repeated until a chosen convergence criterion is satisfied. The E-step computes the 157 expected values of the hidden variables E[z,~m] according to Eq. (A6). The M-step maximizes the likelihood by computing new best parameters estimates according to Eqs. (A9, A11). The convergence criteria can be based on the increase in the amount of the likelihood function between two consequent M-steps or on the stability of the assignment of points from Y, or equivalently from X. In fact, it is the stability criterion that is more relevant to the clustering task. Intuitively, each E—step is essentially equivalent to the naive Bayes computation procedure. However, the M-step uses “soft” real-valued cluster memberships to determine pattern contributions to the component-conditional densities, in contrast to conventional supervised maximum likelihood estimation. 158 Bibliography [1] [2] [3] [4] [5] [6] [71 E. E. Abola, F. C. Bernstein, S. H. Bryant, T. F. Koetzle, and J. Weng, Protein Data Bank, In Crystallographic Databases—Information Content, Software Systems, Scientific Applications (F. Allen et al. eds), Data Commission of the Intl. Union of Crystallography, Bonn/Cambridge/Chester, 107—132, 1987. R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan, Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, In Proceedings of ACM-SIGMOD 1998 Int. Conf. on Management of Data: 94-105, 1998. H. Almuallim, and T.G. Dietterich, Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, 547-552. Menlo Park, CA: AAAI Press, 1991. H. Almuallin, and T.G. Dietterich, Efficient algorithms for identifying relevant features. Proceedings of the Ninth Canadian Conference on Artificial Intelligence (pp. 38-45). Vancouver, BCzMorgan Kaufmann, 1992. PG. Angeline, G. Saunders, and J. Pollak, An Evolutionary Algorithm That Constructs Recurrent Neural Networks, IEEE Trans. On Neural Networks, Vol. 5, No. 1, 54-65, 1994. J.-P. Barthelemy and B. Leclerc, The median procedure for partition, In Partitioning Data Sets, 1.]. Cox et al eds., AMS DIMACS Series in Discrete Mathematics, 19: 3-34, 1995. D. J. Bartholomew and M. Knott, Latent variable models and factor analysis, 2nd ed, Kendall’s Library of Statistics 7. London: Arnold, 1999. 159 [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] R.K. Belew, J. McInemey, and N.N. Schraudolph, Evolving networks: Using the Genetic Algorithm with connectionist learning. In Proceedings of the Second Artificial Life Conference, 511-547, Addison-Wesley, 1991. A. Ben-Hur, D. Horn, H.T. Siegelmann, and V.Vapnik, Support vector clustering, Journal of Machine Learning Research, 2: 125 - 137, 2001. L. Breiman, Bagging Predictors, Machine Learning, Vol. 24 (2), 123 - 140, 1996. L. Breiman, Arcing classifiers, The Annals of Statistics, 26 (3): 801 - 849, 1998. L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and Regression Trees. Wadsworth International, California, 1984. B. Bollobas, Random Graphs, Academic Press, London, 1985. SP. Brumby, J. Theiler, S.J. Perkins, N .R. Harvey, J .J . Szymanski, J .J . Bloch, and M. Mitchell, Investigation of Feature Extraction by a Genetic Algorithm, Proc. SPIE 3812, 24-31, 1999. T. Cai, N. Cercone, and J. Han, Attribute-oriented induction in relational databases. In Gregory Piatetsky-Shapiro and William J. Frawley, editors, Knowledge Discovery in Databases, 213-228. AAAI Press/T he MIT Press, 1991. C. Chang, Dynamic Programrrring as applied to Feature Subset Selection in a Pattern Recognition System, IEEE Transactions on Systems, Man and Cybernetics, No. 3, 166 - 171, 1973. G. Chartrand, and O. R. Oellermann, Applied and Algorithmic Graph Theory. McGraw-Hill, Inc., New York, 1993. C. K. Chow and C. N. Liu, Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, 14: 462-467, 1968. D. Cristofor and DA. Simovici, An information-theoretical approach to clustering categorical databases using genetic algorithms. In Proceedings of 2nd SIAM ICDM Workshop on clustering high dimensional data, 2002. T. M. Cover, The Best Two Independent Measurements are Not the Two best, IEEE Transactions on Systems, Man, and Cybernetics, pp. 116-117, 1974. 160 [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [321 T. M. Cover and J. M. Van Campenhout, On the possible orderings in the measurement selection problem. IEEE Transactions on Systems Man and Cybernetics, SMC-7: 657-661, 1977. R. Cucchiara, Genetic algorithms for clustering in machine vision. Machine Vision and Applications, 11: 1 — 6, 1998. AC. Davison and D.V. Hinkley, Bootstrap Methods and their Application, Cambridge University Press, second edition, 1997. E. Dimitriadou, A. Weingessel and K. Homik, Voting-merging: An ensemble method for clustering, In Proc. Int. Conf. on Artificial Neural Networks, Vienna, 217-224, 2001. A. P. Dempster, N. M. Laird, and D. B. Rubin, Maximum Likelihood From Incomplete Data Via the EM Algorithm, Journal of the Royal Statistical Society B, 39: 1-22, 1997. P. A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach, Prentice-Hall International, 1982. P. Domingos and M. Pazzani, On the optimality of the simple Bayesian classifier under zero-one loss, Machine Learning, 29: 103—130, 1997. RD. Duda, P.E. Hart, and D.G. Stork, Pattern Classification, 2nd Ed, John Wiley & Sons Inc., 2001. S. Dudoit and J. Fridlyand, Bagging to improve the accuracy of a clustering procedure, Bioinformatics, 19 (9): 1090-1099, 2003. A. Esparcia-Alcazar and K. Sharman, Learning schemes for genetic programming, In Proceedings Late Breaking Papers at the 1997 Genetic Programming Conference, 57- 65, Stanford University, CA, 1997 E. Falkenauer, A New Representation and Operators for GAs Applied to Grouping Problems, Evolutionary Computation, Vol.2 (2), 123-144, 1994. E. Falkenauer, Genetic Algorithms and Grouping Problems, John Wiley & Son Ltd., 1998. 161 [33] [34] [35] [361 [37] 1381 [39] [40] [41] [42] SE. Fahlman and C. Lebiere, The cascade-correlation learning architecture. In D. Touretzky, editor, Advances in Neural Information Processing Systems, Vol. 2, pages 524-532. Morgan Kaufmann, 1990. X. Z. Fern and C. E. Brodley, Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach, Proc. of 20th Intl. Conf. on Machine learning, 2003. B. Fischer, J.M. Buhmann, Path-Based Clustering for Grouping of Smooth Curves and Texture Segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 25 (4): 513-518, 2003. D. H. Fisher, Knowledge acquisition via incremental conceptual clustering, Machine Learning, 2: 139-172, 1987. M. Figueiredo and AK. Jain, Unsupervised learning of finite mixture models, IEEE Transaction on Pattern Analysis and Machine Intelligence 24: 381-396, 2002. A.L.N. Fred, Finding Consistent Clusters in Data Partitions, In Proc. 3d Int. Workshop on Multiple Classifier Systems. Eds. F. Roli, J. Kittler, LNCS 2364: 309-318, 2001. A.L.N. Fred and AK. Jain, Data Clustering using Evidence Accumulation, In Proc. of the 16th Intl. Conference on Pattern Recognition, ICPR 2002 ,Quebec City: 276 - 280, 2002. A.L.N. Fred and AK Jain, Evidence Accumulation Clustering based on the K- means algorithm, In T.Caelli et al., eds, Structural, Syntactic, and Statistical Pattern Recognition, LNCS 2396, Springer-Verlag, 442-451, 2002. A. Fred and A. K. Jain. Robust data clustering, In Proc. of IEEE Computer Society Conf, on Computer Vision and Pattern Recognition, CVPR 2003, Wisconsin, USA, June 2003. AA. Freitas, Data Mining and Knowledge Discovery with Evolutionary Algorithms. Springer-Verlag, 2002. 162 [43] [44] [45] [46] [471 148] [49] 150] [51] [52] Y. Freund and RE. Schapire, Experiments with a New Boosting Algorithm, in Proc. of the Thirteenth Intl. Conference on Machine Learning, Morgan Kaufmann, 148-156, 1996. N. Friedman, D. Geiger, and N. Lotner, Likelihood Computation with Value Abstraction, In Proc. Sixteenth Conf. on Uncertainty in Artificial Intelligence (UAI), 2000. N. Friedman, O. Mosenzon, N. Slonim, and N. Tishby, Multivariate Information Bottleneck, in Proceedings of the Seventeenth Conference on Uncertainty in Artificial Intelligence, 2001. W. Gablentz, M. Koppen, and E. Dimitriadou, Robust Clustering by Evolutionary Computation, In Proc. 5th Online World Conf. on Soft Computing in Industrial Applications (W SC5), 2000. Z. Ghahramani and M. Jordan, Supervised learning from incomplete data via an EM approach, In Proceedings of Advances in Neural Information Processing Systems (NIPS 6): 120-127, 1993. MA. Gluck and J .E. Corter, Information, uncertainty, and the utility of categories. In Proc. of the Seventh Annual Conference of the Cognitive Science Society, Hillsdale, NJ: Lawrence Erlbaum, 283-287, 1985. LA. Goodman and W.H. Kruskal, Measures of association for cross- classifications. Journal of Am. Stat. Assoc.; 49: 732 - 764, 1954. J. Ghosh, A. Strehl, and S. Merugu, A consensus framework for integrating distributed clusterings under limited knowledge sharing. In Proc. NSF Workshop on Next Generation Data Mining, pages 99-108, 2002. A. Griewank, Evaluating derivatives: Principles and techniques of algorithmic differentiation, SIAM, Philadelphia, 2000. B. Guinand, K.T. Scribner, A. Topchy, K.S. Page, W. Punch & M.K. Burnham- Curtis, Sampling issues affecting accuracy of likelihood-based classification using genetical data, Environmental Biology of Fishes, vol. 69: 245—259, 2004. 163 [53] [54] [55] [56] 157] [58] [591 [60] [61] [62] [63] [64] B. Guinand, A. Topchy, K. S. Page, M. K. Bumham-Curtis, W. F. Punch, and KT. Scribner, Comparisons of likelihood and machine learning methods of individual classification. Journal of Heredity 93(4), 260-269, 2002. M. Halkidi,Y. Batistakis,and M. Vazirgiannis, Clustering algorithms and validity measures. In Proceedings of the SSDBM conference, Virginia, USA, July 2001. H. J. Hamilton, R. J. Hilderman, and N. Cercone, Attribute-oriented induction using domain generalization graphs. In Proceedings of the Eighth IEEE International Conference on Tools with Artificial Intelligence (ICTAI'96), 246- 253, Toulouse, France, 1996. J. Havrda and F. Charvat, Quantification Method of Classification Processes: Concept of Structrual a-Entropy, Kybernetika, 3: 30-35, 1967. A. Hinneburg, D. A. Keim, and M. Wawryniuk, Using Projections to visually cluster high-dimensional Data, Computing in Science and Engineering, 5 (2): 12- 25, 2003. G. E. Hinton and S. J Nowlan, How learning can guide evolution, Complex Systems, 1, 495-502, 1987. T. K. Ho, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8): 832-844, 1998. G. E. Hughes, On the mean accuracy of statistical pattern recognition. IEEE Transactions on Information Theory, IT-14 (1): 55 - 63, 1968. M. Hutter, Distribution of Mutual Information, in Advances in Neural Information Processing Systems, 14, 399-406, 2002. H. Iba and N. Nikolaev, Genetic Programming Polynomial Models of Financial Data Series," in Proceedings of the Conference on Evolutionary Computation, CEC-2000, IEEE Press, pp. 1459-1466, 2000. AK. Jain and RC. Dubes, Algorithms for Clustering Data, Prentice—Hall, New Jersey, 1988. AK. Jain, R.P. Duin and J. Mao, Statistical pattern recognition: a review, IEEE Transactions on Pattern Analysis and Machine Intelligence 22: 4 - 37, 2000. 164 [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] 176] A.K. Jain, M. N. Murty, and P. Flynn, Data clustering: A review, ACM Computing Surveys, 31(3): 264 - 323, 1999. A.K. Jain, A. Topchy, M. Law, Landscape of Clustering Algorithms, In Proc. Int]. Conf on Pattern Recognition, ICPR 2004, Cambridge, UK, Aug 2004. E. Johnson and H. Kargupta, Collective, hierarchical clustering from distributed, heterogeneous data, In Large-Scale Parallel KDD Systems. Eds. Zaki M. and Ho C., Volume 1759 of LNCS, Springer-Verlag, 221—244, 1999. G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, Multilevel Hypergraph Partitioning: Applications in VLSI Design, In Proc. ACM/IEEE Design Automation Conf., 526-529, 1997. G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Joumal of Scientific Computing, 20(1): 359 - 392, 1998. P. Kellam, X. Liu, N]. Martin, C. Orengo, S. Swift, and A.Tucker, Comparing, contrasting and combining clusters in viral gene expression data, Proceedings of 6th Workshop on Intelligent Data Analysis in Medicine and Pharmocology, 56- 62, 2001. EM. Kleinberg, Stochastic Discrimination, Annals of Mathematics and Artificial Intelligence, 1: 207-239, 1990. J. Kleinberg, An Impossibility Theorem for Clustering, In Proceedings of Advances in Neural Information Processing Systems (NIPS 2002) 15, 2002. J. Kittler, Feature set search algorithms, in Pattern Recognition and Signal Processing, ed., CH. Chen, pp. 41-60, Sitho and Noordho, 1978. R. Kohavi, G. John, Wrappers for Feature Subset Selection, Artificial Intelligence journal 97 (1-2), 273-324, 1997. J .R. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection. Cambridge, MA: The MIT Press, 1992. J .R. Koza, Genetic Programming H: Automatic Discovery of Reusable Programs. Cambridge, MA: The MIT Press, 1994. 165 [77] [78] [79] [80] 181] [82] [83] [84] [85] [86] J .R. Koza, Bennett HI, Forrest H, Andre, David, and Keane, Martin A.. Genetic Programming 1H: Darwinian Invention and Problem Solving. San Francisco, CA: Morgan Kaufmann Publishers, 1999. JR. Koza, F. H Bennett 11], D. Andre, M.A. Keane, and F. Dunlap, Automated synthesis of analog electrical circuits by means of genetic programming, IEEE Transactions on Evolutionary Computation, 1(2), 109-128, 1997. P. Langley, W. Iba, and K. Thompson, An analysis of Bayesian classifiers, In Proc. of the Tenth National Conf. on Artificial Intelligence, San Jose, CA, AAAI Press, 399—406, 1992. W. B. Langdon, Data Structures and Genetic Programming: Genetic Programming + Data Structures = Automatic Programming, Kluwer, Boston, 1998. O. Larsen, A.A. Freitas and J .C. Nievola, Constructing X-of-N attributes with a genetic algorithm. Proc. 4th Int. Conf. on Recent Advances in Soft Computing, 326-331, 2002. E. L. Lawler and D. E. Wood, Branch and Bound Methods: A Survey. Operations Research 14, 699-719, 1966. G. von Laszewsski and H. Muhlenbein, Partitioning a graph with a Parallel Genetic Algorithm. In Proceedings of Parallel Problem Solving from Nature, PPSN 1, H.-P.Schwefel and R.Manner (Eds.), Springer-Verlag, Berlin, Germany, 165-169, 1990. F. Leisch, Bagged clustering, Working Papers SFB "Adaptive Information Systems and Modeling in Economics and Management Science", no.51, Aug. 1999, Institut fiir Information, Abt. Produktionsmanagement, Wien, Wirtschaftsuniv, 1999. C. Lin and J .Wu, Automatic facial feature extraction by genetic algorithms, IEEE Trans. Image processing, vol. 8, 6, 834-845, 1999. H. Liu, H. Motoda, and M. Dash, A monotonic measure for optimal feature selection. In Proceedings of European Conference on Machine Learning, pages 101 - 106, 1998. 166 [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] J. Mao, K. Mohiuddin, and A. K. Jain, Parsimonious network design and feature selection through node pruning. In Proceedings of 12th ICPR, Jerusalem, 622 - 624, 1994. M. Martin-Bautista and M.-A. Vila, A survey of genetic feature selection in mining issues, in Proceedings of the Congress on Evolutionary Computation (CEC 99), 13-23, 1999. M. Merzbacher and W. W. Chu, Pattem-based clustering for database attribute values. In Proceedings of AAAI Workshop on Knowledge Discovery in Databases, Wash., DC, 1993. RS. Michalski and RE Stepp, Automated construction of classifications: Conceptual clustering versus numerical taxonomy, IEEE Trans. Pattern Analysis and Machine Intelligence, 5, 396-410, 1983. B. Minaei, A. Topchy, and W. Punch, Ensembles of Partitions via Data Resampling, Proc. Int]. Conf. Information Technology, ITCC 2004, Las Vegas, 2004. J. Minker, G.A. Wilson, B.H. Zimmerman, An evaluation of query expansion by the addition of clustered terms for a document retrieval system, Information Storage and Retrieval 8(6), 329-48, 1972. B. Mirkin, Reinterpreting the Category Utility Function, Machine Learning, 45(2): 219-228, 2001. S. Monti, P. Tamayo, J. Mesirov, and T. Golub, Consensus Clustering: A Resamlping Based Method for Class Discovery and Visualization of Gene Expression Microarray Data, Journal on Machine Learning, 52 (1-2), 2003 CA. Murthy and N. Chowdhury, In search of optimal clusters using genetic algorithms. Pattern Recognition Letters, 17:825—832, 1996. P. M. Narendra and K. Fukunaga, A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26: 917—922, 1977. S. Nolfi, J. Elman, and D. Parisi, Learning and evolution in neural networks, Adaptive Behavior, 3(1), 5 - 28, 1994. 167 [98] [99] [100] [101] [102] [103] [104] [105] [106] [107] [108] J .R. O'Connell, D.E. Weeks, The VITESSE algorithm for rapid exact multilocus linkage analysis via genotype set-recoding and fuzzy inheritance, Nature Genetics 11, 402-408, 1995. SC. Odewahn, E.B. Stockwell, R.L. Pennington, R.M. Humphreys, and WA. Zumach, Automated Star/Galaxy Discrimination with Neural Networks, Astronomical Journal, 103: 308-331, 1992. U-M. O'Reilly and F. Oppacher, A comparative analysis of GP, In Peter J. Angeline and K. E. Kinnear, Jr., editors, Advances in Genetic Programming 2, ch. 2, 23-44. MIT Press, Cambridge, MA. 1996. B.H. Park and H. Kargupta, Distributed Data Mining, In The Handbook of Data Mining, Ed. Nong Ye, Lawrence Erlbaum Associates, 2003. Y-J. Park and M-S. Song, A genetic algorithm for clustering problems, In Proc. 3rd Annual Conf. on Genetic Programming, (1998) 568-575. CH. Papadimitriou and K. Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Prentice-Hall, 1982 R. Poli and J. Page, Solving high-order boolean parity problems with smooth uniform crossover, sub-machine code GP and demes, Genetic Programming And Evolvable Machines, 1(1/2), 37-56, 2000. C. Procopiuc, M. Jones, P. Agarwal, and T. M. Murali, A Monte Carlo algorithm for fast projective clustering. In Proc. 2002 SIGMOD Conference, 418-427, 2002. P. Pudil, F. J. Ferri, J. Novovicova, and J. Kittler, Floating search methods for feature selection with nonmonotonic criterion functions. 12th International Conference on Pattern Recognition, 279-283, 1994. P. Pudil, J. Novovicova, Feature Subset Selection Using a Genetic Algorithm in Feature Extrac-tion, Construction and Selection: A Data Mining Perspective, Huan Liu, Hiroshi Motoda (Eds.), Kluwer, 1998. W.F. Punch, E.D. Goodman, M. Pei, L. Chia-Shun, P. Hovland and R. Enbody, Further Research on Feature Selection and Classification Using Genetic Algorithms, In Proc. 5th International Conference on Genetic Algorithms, Urbana-Champaign, 557-562, 1993. 168 [109] [110] [111] [112] [113] [114] [115] [116] [117] [118] [119] [120] Y. Qian and C. Suen, Clustering Combination Method, International Conference on Pattern Recognition, ICPR’00, Volume 2, 2000. J. R. Quinlan, Bagging, boosting, and C45. In Proceedings of the 13th AAAI Conference on Artificial Intelligence, AAAI Press, Menlo Park, CA, 725-30, 1996. NJ. Radcliffe and PD. Surry, Formal memetic algorithms, in “Evolutionary Computing: AISB Workshop", Ed: T. Fogarty, Springer-Verlag, 1994. M. L. Raymer, W. F. Punch, E. D. Goodman, L. A. Kuhn, and A. K. Jain, Dimensionality reduction using genetic algorithms, IEEE Transactions on Evolutionary Computation, 4(2): 164-171, 2000. M. Raymer, W. Punch, E. Goodman, P. Sanschagrin, and L. Kuhn, Simultaneous Feature Extrac-tion and Selection using a Masking Genetic Algorithm, In Proceedings of ICGA-97, pp. 561-567, San Francisco, CA, July, 1997. G.V. Reklaitis, A. Ravindran and KM. Ragsdell, Engineering Optimization: Methods and Applications, Wiley, New York, 1983. K. Rodriguez-Vazquez, Identification of Non-Linear MIMO Systems Using Evolutionary Computation, Late Breaking Papers of Genetic and Evolutionary Computation Conf., 411-417, 2000. DB. Rubin, Inference with Missing Data, Biometrika, 63: 581-592, 1976. J. C. Schlimmer, Efficiently inducing determinations: A complete and systematic search algorithm that uses optimal pruning. In Proceedings of the Tenth International Conference on Machine Learning, pages 284-290, 1993. W. Siedlecki and J. Sklansky, A note on genetic algorithms for large-scale feature selection, Pattern Recognition Letters, vol. 10, pp. 335-347, 1989. N. Slonim and N. Tishby, Agglomerative Information Bottleneck, In proceedings of NIPS-12, MIT Press, 617-623, 1999. N. Slonim, N. Friedman, and N. Tishby, Agglomerative Multivariate Information Bottleneck, In advances in Neural Information Processing Systems, NIPS-l4, 2001. 169 [121] [122] [123] [124] [125] [126] [127] [128] [129] [130] P. Somol, P. Pudil, F.J. Ferri and J. Kittler, Fast branch and bound algorithm in feature selection. World Multiconference on Systerrrics, Cybernetics and Informatics (SCI), Proceedings of SCI/ISAS 2000, Volume VII, (B. Sanchez, J.M. Pineda, J. Wolfmann, Z. Bellahsene, y F]. Ferri, eds.), 646-651, 2000. K. Spark-Jones and D.M. Jackson, The use of automatically-obtained keyword classifications for information retrieval, Information Processing and Management 5, 175—201,1970. S.D. Stearns, On Selecting Features for Pattern Classifiers. In Proc. 3d International Conference on Pattern Recognition, Coronado, CA, 678 - 686, 1978. A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining multiple partitions, Journal of Machine Learning Research, 3: 583- 617, 2002. A. Strehl and J. Ghosh, Cluster ensembles - a knowledge reuse framework for combining partitionings. In Proc. Conference on Artificial Intelligence (AAAI 2002), Edmonton, 93-98. AAAI/MIT Press, 2002. N. Tishby, F.C. Pereira, and W. Bialek, The information bottleneck method, In Proc. of the 37-th Annual Allerton Conference on Communication, Control and Computing, 368-377, 1999. A. Thompson, Hardware Evolution: Automatic Design of Electronic Circuits in Reconfigurable Hardware by Artificial Evolution, Springer-Verlag: London, 1998. A. Topchy, B. Minaei, A.K. Jain, W. Punch, Adaptive Clustering Ensembles, In Proc. Intl. Conf on Pattern Recognition, ICPR 2004, Cambridge, UK, Aug 2004. A. Topchy, A.K. Jain and W. Punch, A Mixture Model for Clustering Ensembles, in Proc. SIAM Conf. on Data Mining, pp. 379-390, Orlando, 2004. A. Topchy, A.K. Jain and W. Punch, Combining Multiple Weak Clusterings, in Proceedings IEEE Int]. Conf. on Data Mining 2003, Melbourne, Fl, pp. 331-338, 2003. 170 [131] [132] [133] [134] [135] [136] [137] [138] [139] [140] [141] A. Topchy and W. Punch, Dimensionality Reduction via Genetic Value Clustering, Genetic Algorithms and Evolutionary Computation — GECCO 2003, Springer, LNCS 2724, 1431-1443, 2003 A. Topchy and W. Punch, Faster Genetic Programming based on Local Gradient Search of Numeric Leaf Values, Genetic Algorithms and Evolutionary Computation - GECCO 2001, Morgan Kaufmann, pp. 155-162, 2001. G.V. Trunk, A problem of dimensionality: a simple example. IEEE Trans. Patt. Anal. Mach. Intell. 1, 306-307, 1979. P. Turney, How to shift bias: Lessons from the Baldwin effect, Evolutionary Computation, 4 (3), 271 - 295, 1996. H. Vafaie, K. DeJong, Feature Space Transformation Using Genetic Algorithms. IEEE Intelligent Systems 13 (2): 57 - 65, 1998. H. Vafaie, and K. DeJong, Robust feature selection algorithms, In Proc. of the 5th IEEE International Conference on Tools for Artificial Intelligence, Boston, MA, 356-363, 1993. J. M. Van Campenhout, On the peaking of the Hughes mean recognition accuracy: the resolution of an apparent paradox. IEEE Trans. on Systems, Man and Cybernetics, 8 (5): 390-395, 1978. J .A.M. Vermaseren, Tuning FORM Nucl.Phys.Proc.Suppl. 116, 343-347, 2003. with large calculations, P.M. Waser and C. Strobeck, Genetic signatures of interpopulation dispersal. Trends Ecol. Evol. 13, 43 - 44, 1998. A. Watson and I. Parrnee, Systems Identification using Genetic Programming, In Proceedings of Int. Conf on Adaptive Computing in Engineering Design and Manufacture, 248 - 255, ACEDC'96, University of Plymouth, UK, 1996. D. Whitley, S. Gordon and K. Mathias, Lamarckian Evolution, The Baldwin Effect and Function Optimization. In Proceedings Parallel Problem Solving from Nature, PPSN 111, 6-15. 1994. 171 [142] [143] [144] [145] [146] [147] [148] D. Whitley, R. Beveridge, C. Guerra, and C. Graves, Messy Genetic Algorithms for Subset Feature Selection. International Conference on Genetic Algorithms. T. Beck, ed. Morgan Kaufmann, 1997. D. H. Wolpert and W. G. Macready, No free lunch theorems for optimization, IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, 67 — 82, 1997. S. Wright, The interpretation of population structure by F-statistics with special regards to system of mating. Evolution 19: 395 — 420, 1965. J. Yang, and V. Honavar, Feature Subset Selection Using a Genetic Algorithm, In: Feature Extraction, Construction, and Subset Selection: A Data Mining Perspective. Motoda, H. and Liu, H. (Eds) New York, Kluwer, 1998. B. Yu and B. Yuan, A more efficient branch and bound algorithm “for feature selection. Pattern Recognition, 26: 883 — 889, 1993. B. Zhang and H. Muhlenbein, Evolving Optimal Neural Networks Using Genetic Algorithms with Occam's Razor, Complex Systems, 7(3), 199 —220, 1993. A.Y. Zomaya, J .A. Andreson, D.B. Fogel, G.J. Milbum, and G. Rozenberg, Non- conventional Computing Paradigms in the New Millennium, IEEE/AIP Computing in Science and Engineering, November/December, Vol. 3, No. 6, 82 - 99, 2001. 172 a...” "a; ..r ”3%? my