”w .m. flaw, _ . . r i. a a... :5. ”3...... g .0: . . V $ququ . .34132 R. u S. T 1.3.. 3“: 339...... (4.. 7 ”mm. r . L: a. .2. :7! :1: x . 3.: “47%52 4. ’“fi . .fi 5.: . t-I. fin.» r"? 53%. «v.35. :. . ‘5 a... ii.‘ ... ,- . 0 ii. «.53? 57 .4 51.1.; . .fl an: a». .4 U a." . 1.1.3.- . ”.15... .03.? v a...» .3.“ )2. ”.5 at: gun, as ...m J»: affix x? . wmgfimwmwsi . L. a hm. ‘- ‘4 J» hum .. . Va.- . Erika”? WM... .. 2 M. .. um I II N THESIS Zoo 1 LIBRARY Michigan State University Y— This is to certify that the dissertation entitled ENHANCING PATTERN RECOGNITION USING EVOLUTIONARY COMPUTATION FOR FEATURE SELECTION AND EXTRACTION WITH APPLICATION TO THE BIOCHEMISTRY OF PROTEIN- WATER BINDING presented by Michael L. Raymer has been accepted towards fulfillment of the requirements for Ph .D . degree in Computer Science and Engineering Ma. I V Major profgsor I... f/Jo/Oo MS U i: an Affirmative Action/Equal Opportunity Institution 0—12771 PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE AA -Anfl ‘ WWW: 0:41;. 11m c/CIRCJDuaDmpGS-p.“ ENHANCING PATTERN RECOGNITION USING EVOLUTIONARY COMPUTATION FOR FEATURE SELECTION AND EXTRACTION WITH APPLICATION TO THE BIOCHEMISTRY OF PROTEIN-WATER BINDING By Michael L. Raymer A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science and Engineering August 23, 2000 C0.\ lite 39- hai'e 3 III? 9) Order 1 allow I Voli'e 11' Win] 1 do {101 up“. lea ABSTRACT ENHANCING PATTERN RECOGNITION USING EVOLUTIONARY COMPUTATION FOR FEATURE SELECTION AND EXTRACTION WITH APPLICATION TO THE BIOCHEMISTRY OF PROTEIN-WATER BINDING By Michael L. Raymer Statistical pattern recognition techniques classify objects in terms of a representa- tive set of features. The selection and quality of the features representing each object have a considerable bearing on the success of subsequent pattern classification. Fea- ture extraction is the process of deriving new features from the original features in order to reduce the cost of feature measurement, increase classifier efficiency, and allow higher classification accuracy. Many current feature extraction techniques in- volve linear transformations of the original features to produce new features. While useful for data visualization and increasing classification efficiency, these techniques do not necessarily reduce the number of features that must be measured since each new feature may be a linear combination of some or all of the original features. Here a new approach is presented in which feature selection, feature extraction, and classi- fier tIa methot outperi minimi racy. l junctim hybrid ( and cla ability I mining ; nique is 0f poten Water-bi. ital lean binding fier training are performed simultaneously using evolutionary computing (BC). This method is tested in conjunction with a k-nearest-neighbors classifier, and Shown to outperform other current methods for feature selection and extraction in terms of minimizing the number of features employed while maximizing classification accu- racy. Two new classifiers based on the naive Bayes classifier are developed in con- junction with the EC feature selection and extraction technique, and the resulting hybrid classifiers are shown to yield further improvements in feature subset parsimony and classification accuracy. A key advantage to the methods presented here is the ability to examine the set of linear feature weights produced by EC to perform data mining and exploratory data analysis. The EC feature selection and extraction tech- nique is applied to an important and difficult problem in biochemistry—classification of potential protein-water binding sites. The resulting classifier is able to identify water-binding sites with ~68% accuracy, and identifies a set of physical and chem- ical features that correspond well with the results of other studies of protein-water binding. © Copyright August 22, 2000 by Michael L. Raymer All Rights Reserved For For my family, who never failed to believe in me, and especially for my father, who taught me the importance of learning. First ar Kuhn f good f0 You bot I would Erik Go 33 Well .' ACKNOWLEDGMENTS First and foremost, I would like to thank my advisors, Dr. Bill Punch and Dr. Leslie Kuhn for their guidance, support, encouragement, and patience. It has been my good fortune to work with two such outstanding scientists and individuals. Thank you both for inspiring me to strive to excel, and for providing examples of excellence. I would also like to thank the other two members of my graduate committee, Dr. Erik Goodman and Dr. Rich Enbody, for their guidance and direction over the years, as well as their excellent suggestions for improvement of this dissertation. I gratefully acknowledge the financial support of the National Science Founda- tion. Much of this work was supported by the NSF grant “Resolving Protein-Water Recognition” . In addition, I would like to thank my colleagues and friends from the MSU GARAGe, past and present, including Owen Matthews, Terry Dexter, Jason Greanya, Victor Miagkikh, Oleg Lukibanov, Iliana Martinez-Bermoudes, Lakshman Thiru- venkatachari, Michael Resendez and others. In particular, many thanks to Marilyn Wulfekuhler, Boris Gelfand, Arnold Patton, and Mike Boers for their ideas, construc- vi thecr lit Analys Inhd anemt Hrdal their II The imdkc place t( Brehob tive criticisms, moral support, and friendship. Likewise, I am grateful to the many members of the MSU Protein Structural Analysis and Design lab, and especially Sridhar Venkataraman, Carrie Barkham, Vishal Thakkar, Rajesh Korde, Trevor Barkham, and Maria Zavodszky, for creating an enjoyable research environment, and for their many contributions to this work. A Special thanks to Paul Sanschagrin, Volker Schnecke and Brandon Hespenheide for their many contributions and friendship over the years. Thank you to my friends and colleagues at MSU who have provided technical, intellectual, and emotional support, as well as making East Lansing an enjoyable place to spend the.past seven years: Travis Doom, Jen White, Mark Brehob, Katrina Brehob and Dave Paoletti. Finally, I would like to extend my heartfelt gratitude to my family. Mom, Ed, Laura—there is nothing in my life that has provided me with more motivation, en- couragement, and support than your belief in me. Above all, I thank my wife, Delia. Without your support, understanding, and encouragement I would never have come this far. I continue to aspire to become as remarkable a scientist and person as you are. vii 2 Bar TABLE OF CONTENTS LIST OF TABLES xi LIST OF FIGURES xvi 1 Introduction 1 1.1 Overview—Contributions of the Thesis ................... 1 1.2 Motivation ................................... 2 1.2.1 Evolutionary Computation for Feature Selection and Extraction . . . . 2 1.2.2 Understanding Protein-Water Interactions ................ 6 2 Background and Synopsis of Literature 7 2.1 Pattern Classification ............................. 7 2.1.1 The Pattern Recognition Task ....................... 7 2.1.2 The Bayes Classifier ............................ 10 2.1.3 Nearest Neighbors Classification ...................... 12 2.2 Feature Selection and Extraction ...................... 16 2.2.1 Feature Costs and the Curse of Dimensionality ............. 16 2.2.2 Feature Selection .............................. 17 2.2.3 Feature Extraction ............................. 21 2.3 Evolutionary Computation .......................... 23 2.3.1 Genetic Algorithms ............................. 23 2.3.2 Evolution Strategies and Evolutionary Programming .......... 25 2.4 Evolutionary Computation in Feature Selection and Extraction ..... 26 viii 4.1.3 3 Feature Selection and Extraction using the K-Nearest-Neighbors Classifier in Combination with Evolution-based Learning 31 3.1 Methods .................................... 31 3.1.1 Branch and Bound Near-Neighbor Searching ............... 31 3.1.2 The Hybrid GA/knn Classifier ...................... 32 3.1.3 EP Optimization with the Knn Classifier ................. 42 3.1.4 Bayes and KNN Classifiers ......................... 44 3.1.5 Data Sets~ .................................. 46 3.1.6 Testing and Error Rate Estimation .................... 54 3.2 Results ..................................... 57 3.2.1 Classification of Artificial Data Sets ................... 57 3.2.2 Classification of Data from the UCI Repository ............. 61 3.2.3 Classification of Medical Data ....................... 65 3.2.4 Classification of Protein Solvation Sites .................. 69 3.3 Discussion ................................... 70 4 Variations on the Bayes Classifier using Evolutionary Computation- Based Learning 72 4.1 Methods .................................... 72 4.1.1 Bayesian Discriminant Functions ..................... 72 4.1.2 Nonlinear Weighting of the Bayes Discriminant Function ........ 77 4.1.3 A New Discriminant Function Based on Summation of the Class- Conditional Marginal Probabilities ............... 80 4.2 Results ..................................... 80 4.2.1 Artificial Data Sets ............................. 81 4.2.2 Classification of the UCI Data Sets .................... 84 4.2.3 Classification of Medical Data ....................... 87 4.3 Discussion ................................... 88 ix 5 Identifying the Determinants of Solvent Binding in Proteins 91 5.1 Introduction .................................. 91 5.1.1 An Overview of the Protein-Water Problem ............... 92 5.1.2 State of the Art in Predicting Protein Solvation ............. 94 5.2 Methods .............................. . ...... 102 5.2.1 Compilation of a Protein Solvation Database .............. 102 5.2.2 Generation of Non-Solvated Probe Sites ................. 106 5.2.3 Feature Identification and Measurement ................. 107 5.2.4 Distinguishing Solvation Sites from N on-Solvated Sites Near the Protein Surface ............................... 111 5.2.5 Distinguishing Conserved Solvation Sites from Sites not Conserved Be- tween Ligand-Bound and Unbound Protein Structures . . . . 112 5.3 Results ..................................... 113 5.3.1 Differentiation of Solvation Sites from N on—Solvated Sites ........ 113 5.3.2 Distinguishing Between Conserved and Non-Conserved Solvation Sites 116 / 5.4 Discussion ....... L ........................... 120 BIBLIOGRAPHY 123 3.2 T3 3.3 R1 3.4 A] 3.1 3.2 3.3 3.4 LIST OF TABLES Typical values for the GA cost function coefficients ............. Typical values for the GA run parameters. ................. Run parameters for a typical EP/knn experiment. d is the number of parameters to be optimized (i.e. the dimensionality of the problem). . Apparent accuracy and bootstrap accuracy rates for the Bayes classifier, the knn classifier, and the EC/knn hybrid classifiers for artificially generated data sets consisting of independent, Gaussian distributed features. For the Bayes classifier, accuracy is Shown for re-classifying the training data (Train), and for classifying the independent testing data (Test). For the knn classifier, experiments were run using odd values of k ranging from 3 to 101 using the same independent training and tuning set for all experiments. The best accuracy on the tuning set ('Dune) and the corresponding 1: value are reported. The best I: value was then evaluated by reclassifying the training data (Rain), and by classifyng a new, independent test set (Test). For the EC/knn hy- brid classifiers, five EC experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The I: value for the best result, the accuracy on reclassification of the knn training data (Train), the accuracy when reclassifying the EC tuning data (Tune), and the accuracy when clas- sifying an independent test set (Test) are provided along with the number of non-zero feature weights (Features). ............ xi 38 42 44 60 3.5 AP 3-6 Res 3.? Bal g -‘ rm _‘ ‘ - 3.5 Apparent accuracy and bootstrap accuracy rates for the Bayes classifier, the knn classifier, and the EC / knn hybrid classifiers for artificially gen- erated data sets consisting of features with a mixture of random distri- butions. For the Bayes classifier, accuracy is Shown for re-classifying the training data (Train), and for classifying the independent testing data (Test). For the knn classifier, experiments were run using odd values of k ranging from 3 to 101 using the same independent training and tuning set for all experiments. The best accuracy on the tuning set (Tune) and the corresponding k value are reported. The best It value was then evaluated by reclassifying the training data (Train), and by classifying a new, independent test set (Test). For the EC/knn hy- brid classifiers, five EC experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The k value for the best result, the accuracy on reclassification of the knn training data (Train), the accuracy when reclassifying the EC tuning data (Tune), and the accuracy when clas- sifying an independent test set (Test) are provided along with the number of non-zero feature weights (Features). ............ 62 3.6 Results of the hybrid EC/knn classifiers and the GADistAI algorithm on various data sets from the UCI Machine Learning data set repository, averaged over 50 runs. 'Irain/ Tune refers to the accuracy obtained when reclassifying the data used by the EC in tuning (optimizing) feature subsets and weights. Test refers to the accuracy obtained on an independent test set for each experiment, disjoint from the training and tuning sets. Features is the number of features with nonzero weights in the best performing weight set for each run; the mean value over all 50 runs is Shown .......................... 64 3.7 Balance in predictive accuracy among classes for various classifiers. The maximum accuracy, (Max), and the minimum accuracy, (Min), among the various classes are shown for each classifier, along with the difference between the two values (Bal). Lower values for Ba], indicat- ing a smaller difference between the minimum and maximum predictive accuracies, are preferred. Train / Test results refer to accuracies for re- classifying the training data for the Bayes and Knn classifier, and for reclassifying the EC tuning data set for the GA/knn and EP/knn clas- sifiers. Test refers to the accuracy when classifying an independent test set. The knn classifier was tested for odd values of k from 1 to 101, and the best results are shown .................... 66 xii 3.9 h 3.8 Results of various classifiers on hypothyroid data, as reported by Weiss, in comparison with that of the EC / knn hybrid classifiers. Train/tune refers to the accuracy Obtained when reclassifying the training data, in the case of Weiss’ results, or the tuning data, in the case of the EC/knn hybrid classifiers. Testing refers to the accuracy obtained on an independent test set for Weiss, and to the bootstrap accuracy estimate for the hybrid classifiers. .................... 67 3.9 Results of various classifiers on the appendicitis data, as reported by Weiss, in comparison with that of the EC/knn hybrid classifiers. As in the previous table, rain / tune refers to the accuracy obtained when reclas- sifying the training data, in the case of Weiss’ results, or the tuning data, in the case of the EC / knn hybrid classifiers. Testing again refers to the accuracy obtained on an independent test set for Weiss, and to the bootstrap accuracy estimate for the hybrid classifiers ........ 68 4.1 Performance of the nonlinear discriminant function (Nonlinear) and the summation-based discriminant function (Sum) on artificially- generated data sets consisting of feature values drawn from indepen- dent Gaussian distributions. For both discriminant functions, five GA experiments were conducted using the same data sets but differing ini- tial random starting conditions, and the best result is reported. The accuracy on reclassification of the training data (Train), the accuracy when reclassifying the GA tuning data (Tune), and the accuracy when classifying an independent test set (Test) are provided along with the number of non-zero feature weights (Features). The results for the naive Bayes classifier (Bayes), and the two EC/knn hybrid classifiers are reproduced here from Table 3.4 for comparison. .......... 82 4.2 Performance of the nonlinear discriminant function (Nonlinear) and the summation-based discriminant function (Sum) on artificially- generated data sets consisting of features with a mixture of random distributions. For both discriminant functions, five GA experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The accuracy on reclassification of the training data (Train), the accuracy when reclas- sifying the GA tuning data (Tune), and the accuracy when classifying an independent test set (Test) are provided along with the number of non-zero feature weights (Features). The results for the naive Bayes classifier (Bayes), and the two EC/knn hybrid classifiers are repro- duced here from Table 3.5 for comparison. ............... 83 xiii 4.3 Ba 4.4 Re 4.5 Ex 5'1 30 4.3 Balance in predictive accuracy among classes for the various EC/ hybrid classifiers on the data sets M6 and N5. Predictive accuracy on the independent test set is Shown for each class. Balance is defined as the difference in predictive accuracy between class 1 and class 2, as described in Section 3.1.2. ........................ 84 4.4 Results of the nonlinear-weighted discriminant function (Nonlinear) and the discriminant function based on summation of the marginal prob- abilities (Sum) on various data sets from the UCI Machine Learning data set repository, averaged over 50 runs. 'I‘rain/ Tune refers to the accuracy obtained when reclassifying the data used by the EC in tuning (optimizing) feature subsets and weights. Test refers to the accuracy obtained on an independent test set for each experiment, disjoint from the training and tuning sets. The number of features is the mean num- ber of features used in classification over all 50 runs. Performance for the EC / knn classifiers is repeated here from Table 3.6 for comparison. 86 4.5 Execution times (wall clock time) for 200 generations of GA optimization of the knn and nonlinear discriminant function classifiers. For each data set, the number of features (d), the number of classes (0), the combined training and tuning set size (n), and the mean execution time (hourszminuteszseconds) over 50 runs are shown. Each run was executed on a single 250MHz UltraSPARC-II cpu of a six-cpu Sun Ultra-Enterprise system with 768 MB of system RAM. Runs were ex- ecuted in sets of 5 with no other user processes present on the system. 87 4.6 Accuracy of various classifiers on the hypothyroid and appendicitis diag- nosis data sets. Results for the discriminant function classifiers are averaged over five GA experiments. Results for the GA/knn classi- fier represent the best of five experiments. Train/Time refers to the accuracy obtained in reclassifying the GA tuning set; Test refers to bootstrap accuracy over 100 bootstrap sets. .............. 88 5.1 30 protein pairs included in the database. ................. 104 5.2 The physical and chemical features used to represent protein-bound wa- ter molecules and protein surface Sites. BVAL and MOB were only measured for conserved and non-conserved molecules, as the concept of thermal mobility is not defined for non-solvated probe sites. . . . . 108 xiv 5.3 Bootstrap test results, k—values, and weight sets from the top six GA experiments for distinguishing crystallographically-observed solvation sites from non-sites. Bootstrap testing accuracy is given as a percent- age for observed solvation sites (Site), non-solvated sites (Non), and both classes together (Tot). Balance (Bal) is the average difference between solvated-site and non-solvated-site prediction accuracy (‘76) over all bootstrap runs. Features with no weights were masked by the GA; i.e. their feature weights were zero, and they did not participate in classification. The feature weights for all six features are normalized to sum to 1.00 for each experiment. ................... 114 5.4 Bootstrap results of the best 21 weight sets for identifying conserved sol- vation sites. Mean bootstrap testing accuracy is given as a percentage (Acc). Mean balance (Bal) is the average difference between con- served and non-conserved prediction accuracy (‘76) over all bootstrap runs. Features with no weights were masked by the GA; i.e. their fea- ture weights were zero, and they did not participate in classification. The feature weights for all eight features are normalized to sum to 1.00 for each experiment. ........................... 117 XV 2.3 ‘24 3.1 3.3 3.4 A ge A il Am An £36 A ti 2.1 2.2 2.3 2.4 2.5 3.1 3.2 3.3 3.4 3.5 LIST OF FIGURES A general model for classical supervised pattern classification systems. A three class, 5-nearest-neighbors classifier. The training samples are plotted in a two-dimensional feature space. Three of the five nearest neighbors of the test sample are of class 2, so the test sample will be classified as belonging to class 2 ...................... A general model of a simple Genetic Algorithm. .............. A d-dimensional binary vector, comprising a Single member of the GA population for GA-based feature selection. ............... Effect of scaling feature axes on k-nearest-neighbor classification. (a) orig- inal data; (b) scaled data. Extension of the scale of the horizontal axis increases the distance between patterns which differ in feature 1, allow- ing the knn to discriminate more finely along this dimension. Here, the prediction of the unknown changes from class 2 to class 3 as a result of scaling. ................................. A GA-based feature extractor using an objective function based on clas- sification accuracy. Each transformation matrix from the GA is used to transform the input patterns, which are then passed to a knn classi- fier. The fitness of a particular transformation matrix is based on the classification accuracy of the knn using the transformed patterns. An example of a GA/knn chromosome with masking for a 4-dimensional feature set .................................. An example of an EP/GA hybrid chromosome with masking for a 4- dimensional feature set ........................... Effects of Gaussian smoothing on the computation of effective marginal probabilities. Assuming that the current feature value falls in the cen- ter bin (black rectangle), and assuming a = 2, then the two surround- ing bins on either side (grey rectangles) also contribute to the effective marginal probability for the current feature. .............. A two-dimensional projection of the data sets 010,1 (a), and 010,2 (b) onto the first two feature axes. 010,2 exhibits much more overlap between classes than 310,1 .............................. xvi 13 24 27 28 33 40 43 46 51 3.6 Div 4.1 The 4.2 An 5.1 A F 3.6 4.1 4.2 5.1 5.2 Division of training, testing, and tuning data during a single EC exper- iment (a), and error rate estimation of the tuned classifier (b). The cost score in (a) is based on the performance of the knn on the tuning set for a particular weight set and k value. The same knn training data is used for the entire EC experiment and during external testing. . . . The Bayes decision rule is invariant to linear transformations of the feature space. For the feature shown here, the raw feature values (a) have been multiplied by 10 in (b). Using a nonparametric Bayes classifier, we find that the original feature value falls in the bin 14—16 (black rectangle) in the original histogram. The scaled feature falls in the equivalent bin of histogram b, and the histogram values (marginal probabilities) of the two bins are identical, so the scaling has no bearing on the classification results .................................... An example of the EC chromosome for optimization of the nonlinear dis- criminant coefficients. A four-dimensional problem is shown. Each coefficient, Ci, in the discriminant function is determined by the chro- mosome weight, W,, and the masking field, Mi. ............ A particular sequence of amino acids will consistently fold into the same three-dimensional, “globular” structure. Each of the 20 common amino acids is identified by a one-letter code. The linear protein chain (left) folds into its globular structure (right) after synthesis. Water molecules, shown here as spheres, bind to the surface of the protein via hydrogen bonds ............................... Conserved and non-conserved water molecules plotted according to the first two principal components of the eight-feature water binding data set. Eight hundred randomly-selected water molecules from the data set are plotted. Non-conserved water molecules are plotted as 1’s, conserved water molecules are plotted as 2’s. The first principal com- ponent is composed primarily of the features HBDP, ADN, and AHP (with coefficients 0.63, 0.52, and 0.38, respectively). The second prin- cipal component is composed primarily of the three features based on temperature factor: ABVAL, BVAL, and NBVAL. (The respective co- efficients are —0.54, —0.53, and —0.44.) There is no clear separation between the two classes in this feature space ............... 5.3 Average normalized weight of each feature in the top fifteen weight sets for distinguishing crystallographically— observed solvation sites from non-solvation sites. ............................ xvii 58 74 79 92 110 5.4 Relat bi pi cz 5.4 Relationship between majority votes and accuracy. For each possible num- ber of majority votes, m, (x-axis) the solid line shows the cumulative predictive accuracy for all test sites for which m or more votes were cast in the majority. This accuracy value corresponds to the scale on the right y-axis. The outlined rectangles Show the actual number of sites for which m or more votes were cast in the majority, correspond- ing to the scale on the left y-axis. Since the k-value for the weight set tested was 65, the number of majority votes ranges from 33 (the smallest possible majority) to 65. .................... 121 xviii (Iha Intr 1.1 This the: tem tea for Simui describe, feature 5 Classifi(.a. Slfier are mahOdS. and We ProdUCe c it is ShOW' data mini Chapter 1 Introduction 1.1 Overview—Contributions of the Thesis This thesis describes several contributions to the state of the art of computational pat- tern recognition, evolutionary computation (EC), and biochemistry. A novel method for Simultaneous feature selection and extraction using evolutionary computation is described. This method compares favorably with the best current algorithms for feature subset selection in terms of minimizing the number of features selected and classification accuracy. In addition, several new classifiers based on the Bayes clas- sifier are developed and tested alongside the k-nearest—neighbors classifier and other methods, and are Shown to perform well in conjunction with the EC feature selection and extraction methods developed here. The ability of the hybrid EC-classifiers to produce consistent feature subsets across various experiments is demonstrated, and it is shown that this capability can be useful in the analysis of classifier results for data mining and analysis. Finally, application of these methods to the biochemical 1 problem I that can curacy of predictior not exhib Analysis I the deten sifiers des incorpor' Kuhn. ‘20 software ture data 1.2 1.2.1 The held many ar, employed Mitten. , 1n medic-i problem of understanding protein-water interactions demonstrates two new classifiers that can predict water binding and conservation among protein structures. The ac- curacy of these classifiers is shown to compare well with other current methods for prediction of protein solvation. Furthermore, the new classifiers developed here do not exhibit a tendency to overpredict solvation, as is common among other methods. Analysis of the experimental results of these classifiers has provided new insights into the determinants of water binding to proteins. The weighted k—nearest-neighbor clas- sifiers described here, with feature weights determined by GA optimization, has been incorporated into a new algorithm for identifying promising drug leads (Schnecke and Kuhn, 2000), and is currently being evaluated for incorporation into a well-known software package, XtalView, for identifying water binding sites during protein struc- ture determination (McRee, 1992). 1.2 Motivation 1.2.1 Evolutionary Computation for Feature Selection and Extraction The field of computational pattern recognition has produced useful applications in many areas of science and engineering. Pattern recognition techniques have been employed with great success, enabling computer systems to recognize typed, hand- written, and spoken language, to analyze satellite imagery, to aid in decision making in medicine and finance, and to perform numerous other difficult classification tasks. A broad works, at problems and techr nary goa system, f disease, i one of 1f perf0rm ' ple objec training 5 as a list I features ; OCR sys "1 a Chara ln des in adi'anc reduildan ing featu diffiCulty, Featu beSt A broad range of methods, including statistical classifiers, decision trees, neural net- works, and fuzzy logic have been applied to pattern recognition and classification problems. (Duda and Hart, 1973). Yet despite this wide variation in problem areas and techniques, several features are common to most classification systems. The pri- mary goal of a classifier is to categorize objects or concepts into groups. A diagnosis system, for example, might classify patients as “at—risk” or “not-at-risk” for a certain disease, while OCR systems typically classify typewritten characters as belonging to one of 100 or so groups, one for each possible printed ASCII character. In order to perform this categorization, a classifier is typically trained by providing it with sam- ple objects (or concepts) for which the correct classification is already known. These training samples, as well as the objects to be classified later, are generally represented as a list of features. For a diagnosis system, a patient might be represented by such features as height, weight, age, sex, and the results of various clinical tests. For an OCR system, features might include the length of the longest straight line segment in a character, or the presence or absence of a closed loop. In designing a classification system for a particular task, it is often not known in advance which features will prove useful for classification, and which will contain redundant or even misleading information. Several areas of investigation, includ- ing feature subset selection and feature extraction, have come about to address this difficulty. Feature selection is the problem of finding the subset of all the available features that best satisfies some classification-related criteria. Usually the goal is to find a small I imposs sets gr that In inrestig to find Feat original cars fro of the a‘ mass) i feat IECOgnit tides a j e(‘lUlee tallonal of lean" the i”Du II] ad C0nlllnetj fiers to b Ilends in small subset of features that provides good classification accuracy. It is generally impossible to exhaustively search all feature subsets, since the number of such sub- sets grows rapidly as d increases. For a fixed subset size, n, the number of subsets that must be searched is (3). When n is not fixed, there are 2" possible subsets to investigate. AS a result, heuristic and suboptimal search methods are often employed to find a feature subset that performs well, but cannot be guaranteed to be optimal. Feature extraction goes a step further and computes new features based on the original feature set. For example, if a classifier were attempting to distinguish sports cars from other vehicle types, and engine horsepower and total vehicle mass were two of the available features, a feature extraction method might find that (horsepower+ mass) was a useful new feature, computed from the original features. Feature selection and extraction can provide a number of benefits to a pattern recognition system. Limiting the number of features to be considered generally pro- vides a reduction in the overall cost of a classifier by reducing the time, effort, and equipment involved in feature measurement and processing. Furthermore, the compu- tational efficiency of many classification methods is improved by lowering the number of features to consider. Even the accuracy of a classifier can be improved by limiting the input data to a minimal set of salient features. In addition to these direct benefits, the use of feature selection and extraction in conjunction with traditional pattern recognition can allow many traditional classi- fiers to be employed for data mining—the identification and analysis of interesting trends in large data sets. By identifying the particular features that are useful for sepa ting; in u: lers: data “it theg owir alla as I hon. argt l its (1 ei'ol' ion; of n ma tech ,qut»; have Deer; Olpi tbler 1996 f0g 21180 thin C1355 lers the Flies The i’sul NCO; llllt’, mill ids separating natural classes in large data sets, feature selection and extraction can aid in understanding the interrelationships between the features and the classes in the data. With the low cost and relatively high Speed of modern magnetic storage, and the growing use of computational techniques for collection and analysis of data from all areas of science, engineering, and industry, the idea of mining interesting data from large data sets has become a topic of much experimentation and study recently. This dissertation describes a feature selection and extraction method based on evolutionary computing, a parameter optimization method based on the dynamics of natural selection and the Darwinian model of evolution. The two evolutionary techniques employed, genetic algorithms (GAS) and evolutionary programming (EP), have been shown to be effective search and optimization methods for a broad array of problems (Fogel, 1998; Holland, 1975; Goldberg, 1989; Back and Schwefel, 1993, 1996; Fogel et al., 1966; Koza, 1992). The EC-based feature selection and extraction algorithms developed here are utilized to optimize the performance of several different classifiers, including traditional algorithms such as nearest-neighbor classification and the Bayes classifier, and several novel methods derived from the Bayes classifier. The resulting hybrid algorithms are shown to outperform various traditional pattern recognition techniques alone and in conjunction with commonly-used feature selection methods on a variety of artificial and real-world data sets. In addition, the data- mining capabilities of the EC-hybrid classifiers are explored in the analysis of an important problem from biochemistry—understanding the features determining the sites of water binding at protein surfaces. 1.2.2 Extensive portant a Much of 1 their stru drug desi tion of pr play an i molecules design, p and func protein 0 The h Structure Sites and emacti0 “Hills of Ilit? inter 1.2.2 Understanding Protein-Water Interactions Extensive application of the EC-hybrid classifiers was made in the study of an im- portant and difficult problem in biochemistry—understanding protein—water binding. Much of the current research in biochemistry is targeted at understanding proteins— their structure, function, and interactions with other molecules. Most pharmaceutical drug design efforts are directed at producing drugs that bind to and inhibit the func- tion of proteins. It is well known that the water molecules that surround most proteins play an important role in protein folding, structural stability, and binding to other molecules, such as drugs. Unfortunately, correct treatment of water molecules in drug design, protein folding studies, and many other areas of study in protein structure and function has proven to be a difficult task, because the state of hydration of the protein changes during these processes. The hybrid EC-classifiers described here have been applied to a database of protein structures to produce a classifier that is capable of predicting protein water-binding sites and their conservation upon binding other molecules. The feature selection and extraction capabilities of this classifier have been employed to analyze the determi- nants of water binding, which has contributed to a more complete understanding of the interactions between proteins, the molecules they bind, and water molecules. Che Bac 2.1 2.]..1 llodern pre‘lous sunulatp broadly Pattern OUI ime} b." anal. process: l5 Such. benefits Chapter 2 Background and Synopsis of Literature 2.1 Pattern Classification 2.1.1 The Pattern Recognition Task Modern advances in computer hardware and software technology have allowed many previously intractable problems from various fields of scientific inquiry to be analyzed, simulated, or otherwise addressed using computational techniques. One of the most broadly applicable and potentially profitable of these techniques is computational pattern recognition. For humans, the ability to identify patterns is vital to many of our intellectual faculties, including: our ability to abstract from examples, to reason by analogy, and to learn by induction. Yet the actual mechanisms by which these processes are effected remains a subject for debate and empirical experimentation. AS such, advances in the realm of computational pattern recognition bear potential benefits in many areas, including improvement of data analysis methods for the phys- 7 halsd unders fig/1H9 / ical sciences, advancement of machine perception and vision, and eventually further understanding of human perception and cognition. While the large quantity and variety of pattern recognition algorithms and tech- niques prevent the construction of a universal model for all pattern recognition tasks, it is possible to consider a simple model which captures the essential features of many classification systems. Figure 2.1 illustrates a model of a generic pattern recognition system. Raw Feature {> Data - Values F Classifier ClaSSIficanon > Subject " Pattern Classes Supervisor Figure 2.1: A general model for classical supervised pattern classification systems. In this model, the object of classification may be a physical object or a concept. The task is to assign the object to one of several prespecified categories. Initially, the Object is observed, yielding a collection of unprocessed, raw data. The nature of this observation can vary greatly according to the type of object being classified. For physical objects, cameras, sensors, or other transducers are often used to record various physical and visual properties of the object. For classification of concepts, fi ‘observation” may be more abstract. For a medical diagnosis system, for example, aver mm min sflfle 0f it: isofi £025 ing s olth, unur, the d and ti not b the Us the data is Often a combination of clinical test results and a description of symptoms provided by the patient. While it is possible to use all of the unprocessed data to classify the object, it is usually beneficial to reduce this raw data to a set Of features or properties related to the classification categories. Visual information collected with a camera, for example, might be summarized into a set of features describing the average color, general shape, and approximate size of the classification object. In general, the goal of feature selection is to reduce the input data to a parsi- monious and salient representation for the classification subject: a vector of feature values. This feature vector then serves as the input for the classifier, which is respon- sible for determining which category each subject belongs to according to the values of its features. The feature vector associated with a particular classification subject is often called a pattern, and the category associated with the subject may be referred to as that pattern’s class. Many classical pattern recognition methods are operated in two stages. The train- ing stage is distinguished by the presence of a supervisory agent that has knowledge of the true class for each pattern. During training, the supervisor provides this class information to the classifier as each pattern is observed, allowing the classifier to learn the distinctions among classes. During the testing stage, the class input is removed, and the performance of the classifier is observed. The training and testing stages need not be completely disjoint. In some handwriting recognition systems, for example, the user can correct misclassified letters while the system is in normal Operation. 2.1 C0n the funt Pl; info best any prol obtz turn one mat the 2.1.2 The Bayes Classifier Consider the task of assigning a sample to one of C classes, {w1,w2, ...wc}, based on the d-dimensional Observed feature vector 55'. Let p(:i:'|w,-) be the probability density function for the feature vector, :i’, when the true class of the sample is tug. Also, let P(w,-) be the relative frequency of occurrence class 02,- in the samples. If no feature information is available, the probability that a new sample will be of class 0),- is P(w.-)—this probability is referred to as the a priori or prior probability, and is the best estimate that a sample will belong to a particular class prior to observation of any feature values. Once the feature values are obtained, we can combine the prior - probability with the class-conditional probability for the feature vector, p(:i:‘|w,-), to obtain the a posteriori probability that a pattern belongs to a particular class. This combination is done using Bayes rule (Bayes, 1763): P(fle)P(wJ-) Zic=1p(ilwi)P(wg) (2'1) PM?) = Once the posterior probability is obtained for each class, classification is a simple matter of assigning the pattern to the class with the highest posterior probability—— the resulting decision rule is Bayes decision rule: given 53’, decide w,- if Pans) >P(w.-lf) v]- 10 “hm probab lame 1%34 anon nuns hmnt pawn hunt unmr dnnd of th, as tht the q When the class-conditional probability density for the feature vector and the prior probabilities for each class are known, the Bayes classifier can be shown to be optimal in the sense that no other decision rule will yield a lower error rate (Duda and Hart, 1973, pp. 10—17). Of course, these probability distributions (both a priori and a pos- teriorz') are rarely known during classifier design, and must instead be estimated from training data. Class-conditional probabilities for the feature values can be estimated from the training data using either a parametric or a non-parametric approach. A parametric method assumes that the feature values follow a particular probability distribution for each class and estimate the parameters for the distribution from the training data. For example, a common parametric method is to assume a Gaussian distribution of the feature values, and then estimate the parameters It,- and a; for each class, can, from the training data. The parametric method can simplify the problem of estimating the class-conditional feature distributions, but can also result in poor classification if the assumed probability distribution does not fit the actual distribu- tion of feature values. A non-parametric approach usually involves construction of a histogram from the training data to approximate the class-conditional distribution of the feature values. Several decisions in the construction of this histogram, such as the number of bins and the size of each, are critical in obtaining good classifica- tion performance. The lack of a standard procedure for constructing a distribution histogram is a primary weakness in the non-parametric sampling approach. Once the distribution of the feature values has been approximated for each class, the question remains how to combine the individual class—conditional probability den- 11 5in in ity der assume Th. t0 perl is not probat literati relative prior p l0r all I 2.1.3 Estima, Slficatig (high a \Hlues’ l 1511951“: sity functions for each feature, p(x1|w,-), p(:r:2|w,-)...p(:rd|w,-) to determine the probabil- ity density function for the entire feature vector: p(:i:’|w,). A common method is to assume that the feature values are statistically independent: Plflwi) = P($1lwi) X P($2lwil X >< P($dlwi) (2-2) The resulting classifier, often called the naive Bayes classifier has been shown to perform well on a variety of data sets, even when the independence assumption is not strictly satisfied (Domingos and Pazzani, 1996). The selection of the prior probabilities for the various categories has been the subject of a substantial body of literature (J aynes, 1968). One of the most common methods is to Simply estimate the relative frequency for each class from the training data and use these values for the prior probabilities. An alternate method is to Simply assume equal prior probabilities for all categories by setting P(w,~) = %, i = 1...C. 2.1.3 Nearest Neighbors Classification Estimation of the class-conditional distribution of the feature values in Bayesian clas- sification is a potential source of classification error. An alternative approach is to design a classifier that does not require an estimate of the distribution of the feature values, but instead memorizes the feature values for every training sample. In this way, no training information is lost in parameter estimation. One such classifier is the k-nearest-neighbors (knn) classifier (Cover and Hart, 1967). The operation of the knn 12 are Stt consid cones spare. demo: Flgm plott. test s classifier during training is quite simple—the feature values for each training sample are stored in their entirety. To classify a new testing sample, the training samples are considered in a d—dimensional space, where each of the orthogonal coordinate axes corresponds to one classification feature. The test sample is plotted into this feature- space, and the k nearest neighbors, for some small constant k, are found. Figure 2.2 demonstrates the behavior of a 3-class, 5-nearest-neighbors classifier where d = 2. t3 Class1 D o 0 Class 2 g A Class 3 g D U 0 ? Test pattern u- A ‘ A Feature 1 Figure 2.2: A three class, 5-nearest-neighbors classifier. The training samples are plotted in a two—dimensional feature space. Three of the five nearest neighbors of the test sample are of class 2, so the test sample will be classified as belonging to class 2. Any distance metric can be used to calculate the neighbors—Euclidean distance and Mahalanobis distance1 are the two metrics most commonly employed. The classes of these k neighbors are tallied, and the most commonly represented class becomes the predicted class for the test sample. Several tie breaking schemes are possible when a single majority class is not present among the near neighbors. The most common, and the method used here, is to choose the class of the closest neighbor. Other methods 1Euclidean distance is defined as d2 = (552 — fl)T (552 — fl). Mahalanobis distance is defined as d2 = (:32 — f1)T 2‘1 (f2 - 51), where E is the d x d covariance matrix of the d-dimensional training data. 13 incl mos mist dist sho of t but dei include scaling each vote by the distance from the current test sample, choosing the most common class from the training data, and choosing the class with the greatest misclassification cost. Feature values are usually normalized prior to computing the distances to prevent the scale of each feature from biasing the prediction. It can be shown that the asymptotic error probability (that is, the error rate as the number of training samples approaches infinity) of the nearest neighbor classification rule is bounded by twice the Bayes error probability(Duda and Hart, 1973, p. 101); i.e. 711,190 Pe(Bayes) 5 7.1520 Pe(NN) 5 2 a: "159° Pe(Bayes) (2.3) A drawback to the knn classifier is the computational expense of computing the distance between the testing sample and each of the training samples. A branch- and-bound approach to near-neighbor search (Fukunaga and Narendra, 1975) can significantly reduce this overhead for large training sets. Many variants of the knn classifier have been proposed and explored. The con- densed nearest neighbor rule (Hart, 1968) and the reduced nearest neighbor rule (Gates, 1972) seek to reduce the number of training patterns that must be stored and tested for each near-neighbor search, while maintaining the same decision boundaries as the nearest neighbor rule. The reduced nearest neighbor rule, in particular, ex- pends considerable computation effort to find the minimal consistent subset of the training patters in order to reduce the computation time needed later to find near neighbors during on—line classification. The edited nearest neighbor rule (Wilson, 14 F0 from t confid Call It The n Tl paran using COHdl‘. Sumpt Iraini: densit distri} The n 1962) tiOns ; [he dE metho 1972) smoothes the decision surface by removing “outlier” training samples—that is, samples belonging to a certain class that fall in an area of the feature Space domi- nated by another class. The resulting decision rule is less sensitive than the nearest neighbor rule to erroneous feature measurement and anomalous training data. For values of k larger than 1, a measure Of classification confidence can be obtained from the number of nearest neighbors belonging to the majority class. This measure of confidence can be used to define a reject option for the k nearest neighbor rule, which can result in a reduction in the error rate for those samples that are not rejected. The use of majority votes as a confidence measure is discussed further in Chapter 3. The well-known statistical classifiers can be partitioned into parametric and non- parametric methods. As described previously, the Bayes classifier can be implemented using either approach. The parametric methods assume that the form of the class- conditional density function of the features is known in advance. A common as- sumption is that the feature values follow a multivariate Gaussian distribution. The training data are then used to estimate the parameters of these class-conditional densities (e.g. the mean vector, p, and the covariance matrix, a, for the Gaussian distribution) and these estimated densities are are then used to classify new patterns. The non-parametric methods, including Parzen window density estimation (Parzen, 1962) and the nearest-neighbor methods (Cover and Hart, 1967), make no assump- tions about the class-conditional distribution of feature values. Rather, the form of the density function is either fitted to the training data, as in the Parzen window method, or dealt with implicitly, as in the nearest neighbor rule. 15 Th prt cla des rel to tea tin of ear nei e01 Spi Cln lEa ref 19,5 2.2 Feature Selection and Extraction 2.2.1 Feature Costs and the Curse of Dimensionality The selection of features to measure and include in the feature vector can have a profound impact on the accuracy of the resulting classifier, regardless of what specific classification rule is implemented. A common approach is to have human experts describe as many features as possible that are readily measurable and likely to be related to the classification categories. Unfortunately, there are several disadvantages to evaluating a profuse number of features in classification. First, each additional feature to be considered Often incurs an additional cost in terms of measurement time, equipment costs,” and storage space. In addition, the computational complexity of classification grows with each additional feature. For some classifiers the cost of each additional feature in computational complexity can be significant. The nearest- neighbor classification, for example, has a complexity that is 0(n2), where n is the combined size of the training and testing data sets. In addition, the inclusion of spurious features (features unrelated to the classification categories) is likely to re- duce classification accuracy. In fact, it is sometimes the case that the inclusion of features that do, in fact, contain information relevant to classification can result in reduced accuracy when the number of training samples is fixed. This phenomenon is sometimes referred to as the curse of dimensionality (Jain and Chandrasekaran, 1982). At the root of the problem of high-dimensionality lies the fact that in real-world 16 Class: ahn class [Op para 0ft ialt POI DEV I‘m classifiers the class-conditional densities are not known, and must be estimated from a finite number of training samples. This fact is easily seen when considering a classification task involving two classes, where the feature vector for each class follows a multivariate Gaussian distribution. AS each new feature, xi, is added to the pattern vector, additional information is made available for classification, as long as the mean of the feature, n,- is not identical to that of some other feature. However, in addition to providing new information, the inclusion of the new feature introduces additional parameters which must be estimated to characterize the class-conditional distribution of the feature vector. In the case of a Gaussian distribution, an additional mean value is added to [i and the covariance matrix 0 changes from a d x d matrix to a (d + 1) x (d + 1) matrix. Since the covariance matrix is symmetric, the number of parameters which must be estimated increases by (d + 2) for each additional feature. Clearly, the geometric growth in the number of parameters to be estimated has the potential to deteriorate the value of additional features to the classifier, even when the new features contain additional information. This effect was illustrated for a specific two-class problem with Gaussian distribution of feature values by Thunk (1979). 2.2.2 Feature Selection A number of techniques have been developed to address the problem of dimensional- ity, including feature selection and feature extraction. The main purpose of feature selection is to reduce the number of features used in classification while maintain- ing an acceptable classification accuracy. Less discriminatory features are eliminated, 17 lea ine ex; th: 91": ad lllt lSt of sul Eat In] Wit 3W. leaving a subset of the original features which retains sufficient information to discrim- inate well among classes. For most problems the brute-force approach is prohibitively expensive in terms of computation time. Cover and Campenhout (1977) have shown that to find an Optimal subset of size n from the original d features, it is necessary to evaluate all (3) possible subsets when the statistical dependencies among the features are not known. Furthermore, when the size of the feature subset is not specified in advance, each of the (2“) subsets of the original d features must be evaluated. In the special case where the addition of a new feature always improves performance, it is possible to significantly reduce the number of subsets that must be evaluated using a branch and bound search technique (Narendra and Fukunaga, 1977). Unfor- tunately, this sort of monotonic decrease in the error rate as new features are added is often not found in real-world classification problems due to the effects of the curse of dimensionality and finite training sample sizes. Various heuristic methods have been prOposed to search for near-Optimal feature subsets. Sequential methods involve the addition or removal of a single feature at each step. Sequential forward selection (Whitney, 1971) begins with the single fea- ture which provides the best classification performance. At each iteration, the feature which provides the best performance in combination with the current subset of fea- tures is added, until a subset of the desired size is generated. Once a feature is added, it cannot be removed. Sequential backwards selection operates similarly, but begins with all features included, removing a Single feature at each iteration. “Plus I — take away 1'” selection combines these two methods by alternately enlarging and reducing 18 the feature subset repeatedly. The sequential floating forward selection algorithm (SFFS) Of Pudil et at. (1994) is a further generalization of the plus I, take away r methods, where l and r are not fixed, but rather are allowed to “float” to approximate the Optimal solution as much as possible. The algorithm basically executes as follows: 0 Let X), = 131...:Ck be a set of It features 0 Let J () be a criterion function (e.g. classification accuracy) 0 Let the significance of the feature 27,- in the set Xk, be Sk_1(xj) = J (Xk) '- J(Xk — (Ej) 0 Let the significance Of the feature :12,- 9! Xk, be Sk+1(:r,~) = J (X,c + 133-) — J (Xk) 0 Beginning with an initial set Of It features, Xk... 1. Using the sequential forward selection method, find XHI. 2. Find the least significant feature in Xk+1. If 55,,“ is the least Significant feature, then set k = k + 1 and go to step 1. 3. If x,,1 S r g k is the least Significant feature in XHI, then exclude 1:, from Xk+1 to form a new feature set X;. 4. If J(xp > J(Xk): — If k = 2 then set Xk = X}, and go to step 1. — Otherwise continue conditional exclusion by returning to step 2. In a recent study of current feature selection techniques, Jain and Zongker (1997) evaluated the performance Of fifteen feature selection algorithms in terms Of classifica- 19 tio Th for an Sf} beh clas add gan beh the niqu 3 DH the i mot hate geflei tradi pIObl Class; Una tion error and run time on a 2-class, 20—dimensional, multivariate Gaussian data set. Their findings demonstrated that the SFFS algorithm dominated the other methods for this data, Obtaining feature selection results comparable to the Optimal branch- and-bound algorithm while requiring less computation time. Further tests of the SFFS technique in combination with a knn classifier were conducted to Observe the behavior Of the knn classification rule as additional features were provided to the classifier. The results showed that classification accuracy was initially improved as additional features were introduced, but eventually reached a maximal value and be- gan to decline with the introduction Of further features. Because Of this unimodal behavior, selection Of an Optimal number Of features is straightforward when using the SFFS method in combination with a knn classifier. When classification is being performed using neural networks, node pruning tech- niques can be used for dimensionality reduction (Mao et al., 1994). After training for a number of epochs, nodes are removed from the neural network in such a manner that the increase in squared error is minimized. When an input node is pruned, the feature associated with that done is no longer considered by the classifier. Similar methods have been employed in the use of fuzzy systems for pattern recognition through the generation of fuzzy if—then rules (Nozaki et al., 1996; Ishibuchi et al., 1995). Some traditional pattern classification techniques, while not Specifically addressed to the problem of dimensionality reduction, can provide feature selection capability. Tree classifiers (Quinlan, 1986b), for example, typically partition the training data based on a single feature at each tree node. If a particular feature is not tested at any 20 Ca: UT} node of the decision tree, it is effectively eliminated from classification. Additionally, simplification of the final tree can provide further feature selection (Quinlan, 1987). 2.2.3 Feature Extraction Feature extraction, a superset of feature selection, involves transforming the origi- nal set of features to provide a new set of features, where the transformed feature set usually consists of fewer features than the original set. While both linear and non-linear transformations have been explored, most of the classical feature extrac- tion techniques involve linear transformations of the original features. Formally, the objective for linear feature extraction techniques can be stated as follows: Given an n x d pattern matrix A (n points in a d—dimensional space), derive an n x m pattern matrix B, m < d, where B = AH and ’H is a d x m transformation matrix. According to this formalization, many common methods for linear feature extrac- tion can be specified according to the method of deriving the transformation matrix, ”H. For unsupervised linear feature extraction, the most common technique is prin- cipal component analysis (Duda and Hart, 1973). For this method, the columns of ”H consist of the eigenvectors of the d x d covariance matrix of the given patterns. It can be shown that the new features produced by principal component analysis are uncorrelated and maximize the variance retained from the original feature set (Duda 21 9) 9i dil tux .ld be: and Hart, 1973). The corresponding supervised technique is linear discriminant anal- ysis. In this case, the columns of ’H are the eigenvectors corresponding to the nonzero eigenvalues of the matrix «SQ/183, where SW is the within-class scatter matrix and SB is the between-class scatter matrix for the given set of patterns. Deriving ’H in this way maximizes the separation between class means relative to the covariance of the classes (Duda and Hart, 1973). In the general case, the matrix ’H is chosen to maximize some criteria, typically related to class separation or classification accuracy for a specific classifier. In this view, feature selection is a special case of linear feature extraction, where the off-diagonal entries of ’H are zero, and the diagonal entries are either zero or one. Feature selection and extraction can enhance pattern recognition in a number of different ways. As discussed previously, these techniques can reduce the cost of fea- ture measurement and storage, as well as providing improved classification accuracy. Additionally, feature selection and extraction can help to reveal the relationships between the features available to a classifier and the classification categories. By identifying minimal sets of features that are sufficient for accurate classification, fea- ture selection and extraction can aid researchers in understanding which features are related to the categories in the data. This function is closely related to the problem of data mining, and can be a useful tool for analysis of large data sets. 22 2.3 Evolutionary Computation Dimensionality reduction is well suited to formulation as an Optimization problem, thus making it available to the diverse array of computational techniques that have been developed and explored in this area. Some of these methods have been devel- oped by observation and modeling of the process of Darwinian evolution and natural selection. Such methods, collectively termed evolutionary computation, have been developed and studied from various viewpoints, leading to a number of different tech- niques and specific algorithms (Fraser, 1957; Crosby, 1967; Bremermann et al., 1966; Reed et al., 1967; Fogel, 1998). 2.3.1 Genetic Algorithms Genetic Algorithms comprise a subset of evolutionary computation focusing on the application of selection, mutation, and recombination to a population of competing problem solutions (Holland, 1975; Goldberg, 1989). Figure 2.3 shows a general model of a simple GA. Problem solutions are encoded as strings of information, or chro- mosomes. A population of competing solutions is maintained and sorted according to the application-specific fitness of each solution. Each generation, a stochastic se- lection process is employed to choose individuals to advance to the next generation. Individuals with higher fitness values are more likely to be chosen, but the stochastic nature of the selection process allows for the selection of any individual in the popula- tion. Various selection methods have been employed, including fitness proportionate 23 selection (Holland, 1975), tournament selection, rank selection, and others. From the individuals that do advance to the next generation, some are chosen at random as parent individuals for recombination. The process of recombination combines traits of the parent individuals to form a new child individual. Again, various techniques are possible, including one— and two-point crossover, uniform crossover, and others (Goldberg, 1989; Mitchell, 1996). It has been proposed that the learning power of the genetic algorithm is largely derived from the interaction of recombination and selection, which together isolate “building blocks”, or short motifs common to highly fit chromosomes. Such motifs can then combine to form new chromosomes of even higher fitness (Holland, 1975; Goldberg, 1989). Finally, random mutation is employed to introduce a controlled amount of noise into the chromosomes, which in turn main- tains diversity in the population and helps to avoid premature convergence of the population at local extrema. Genetic algorithms have been shown to be a useful tool in computer optimization problems from a broad range of scientific disciplines. Population Population i) i+l) a> Selection -——> Recombination] ———->[ Mutation —-> Increasing Fitness Next generation Figure 2.3: A general model of a simple Genetic Algorithm. 24 Q] 2.3.2 Evolution Strategies and Evolutionary Programming The popularity of the GA as an optimization tool is largely due to its ability to avoid local optima by maintaining a diverse population of competing problem solutions. Unfortunately, the mechanisms of recombination and selection are not the most ap- pr0priate tools for fine-tuning nearly-optimal problem solutions. As a result, genetic algorithms are more powerful for the early stages of optimization, when general areas of high fitness are sought, than for later stages where solutions in these areas are be- ing refined toward a global optimum. Some researchers have concluded that GAs are, in fact, not an appropriate technique for real-valued function optimization (De Jong, 1992) Evolutionary programming (Fogel et al., 1966), and evolution strategies (ES; Rechenberg, 1973; Schwefel, 1977), two subsets of the EC family of methods, are often considered more suited to real-valued function optimization and refinement of nearly-optimal problem solutions. Like GAs, EP and ES algorithms are parallel, it- erative optimizers. Also like GAS, EP and ES maintain a population of competing problem solutions which are evaluated in terms of a fitness function and subjected to a selection function for inclusion in successive generations. Unlike GAS, however, EP and ES do not emphasize recombination as a primary method for optimization. In- stead, these techniques generally rely on various forms of mutation as the key means of learning. For EP, which is often employed for real-valued function optimization, one of the most common mutation operators involves the addition of a Gaussian de- viate to the current value of a field on the chromosome. The Gaussian deviate is 25 generally chosen from a distribution with a mean of zero and a variance that is either computed based on the degree of convergence of the current population or stored and evolved with each individual chromosome (Back and Schwefel, 1996). The specific selection method used by EP is slightly different from those commonly used in GAS as well. A typical method is to compete each individual in a fixed number of “tourna- ments”, in which its fitness is compared with a random member of the population to determine a winner. After each individual has competed in the requisite number of tournaments, the entire population is sorted according to the number of tournament wins, and the upper half of this sorted population is allowed to advance to the next generation. As with GA selection methods, many variations on this basic method have been explored. 2.4 Evolutionary Computation in Feature Selec- tion and Extraction A direct approach to using GAS for feature selection was introduced by Siedlecki and Sklansky (1989). In their work, a GA is used to find an optimal binary vector, where each bit is associated with a feature (Figure 2.4). If the 2"” bit of this vector equals 1, then the 2"” feature is allowed to participate in classification; if the bit is a 0, then the corresponding feature does not participate. Each resulting subset of features is evaluated according to its classification accuracy on a set of testing data using a nearest-neighbor classifier. 26 } § § dbits { , § ------ Feature 2 is included in the classifier. 3 ------------- Feature 1 is not included in the classifier. Figure 2.4: A d—dimensional binary vector, comprising a single member of the GA population for GA-based feature selection. This technique was later expanded to allow linear feature extraction, by Punch et al. (1993) and independently by Kelly and Davis (1991). The single bit associ- ated with each feature is expanded to a real-valued coefficient, allowing independent linear scaling of each feature, while maintaining the ability to remove features from consideration by assigning a weight of zero. Given a set of feature vectors of the form X = {3:1, 2:2, ...xd}, the GA produces a transformed set of vectors of the form X’ = {112111, wgzg...wdxd} where w,- is a weight associated with feature 2'. Each fea- ture value is first normalized, then scaled by the associated weight prior to training, testing, and classification. This linear scaling of features prior to classification allows a classifier to discriminate more finely along feature axes with larger scale factors. A knn classifier is used to evaluate each set of feature weights. The effects of linear feature weighting on the knn classification rule are visualized in Figure 2.5. Patterns plotted in feature Space are spread out along feature axes with higher weight values, and compressed along features with lower weight values. The value of k for the knn classifier is fixed and determined empirically prior to feature extraction. In a Similar approach, Yang and Honavar (1998) use a simple GA for feature subset 27 C] Class 1 D o 0 Class 2 g A Class 3 § 0 O ? Test pattern 0 D u. A A A Feature 1 (a) g o g Cl 0 < 3» Scale Extended (b) Figure 2.5: Effect of scaling feature axes on k-nearest-neighbor classification. (a) original data; (b) scaled data. Extension of the scale of the horizontal axis increases the distance between patterns which differ in feature 1, allowing the knn to discrimi- nate more finely along this dimension. Here, the prediction of the unknown changes from class 2 to class 3 as a result of scaling. 28 selection in conjunction with DistAl, a neural network-based pattern classifier (Yang et al., 1998). As in other GA-based feature selectors, a simple binary representation was used where each bit corresponds to a single feature. The use of the GA for feature subset selection improved the accuracy of the DistAl classifier for nearly all of the data sets explored, while Simultaneously reducing the number of features considered. Their hybrid classifier, GADistAI, outperformed a number of modern classification methods on the various data sets presented. Vafaie and De Jong (1998) describe a hybrid technique in which EC methods are employed for both feature selection and extraction2 in conjunction with the 04.5 decision tree classifier system (Quinlan, 1986a). Again, a binary representation is used for feature subset selection using traditional GA techniques. In this system, however, the features seen by the classifier are functions of the original features composed of Simple arithmetic operations. For example, one such feature might be {(Fl - F2) x (F2 - F 4)}, where F1, F2, and F4 represent values from the original feature set. These functions for constructing new features are represented as trees, in much the same way as commonly seen in the genetic programming (GP; Koza, 1992) literature. GP subtree crossover is used for recombination of these function trees, while traditional GA—style mutation and crossover are used for the feature selection chromosome. In this system, feature selection and extraction are not Simultaneous. Rather, feature selection, feature extraction, and classifier construction are performed serially each generation. The resulting system was shown to outperform the decision 2The authors use the term “feature construction”. 29 tree classifier using the original feature set, while reducing the number of features considerably, for human facial image recognition. Additionally, the hybrid system was shown to outperform several contemporary classifiers on three diverse data sets. 30 Ch Fe: the Le 3.1 3.1. The 1159 x the n Sets. )‘alllg. Chapter 3 Feature Selection and Extraction using the K-Nearest-Neighbors Classifier in Combination with Evolution-based Learning 3.1 Methods 3.1.1 Branch and Bound Near-Neighbor Searching The k-nearest—neighbors classifier has several features that make it a good choice for use with feature selection and extraction algorithms. The nonparametric nature of the nearest neighbors rule allows the classifier to be applied to a broad range of data sets, including those where the form of the class-conditional distribution of feature values is completely unknown. When used with a Euclidean distance metric, the knn 31 decision rule is sensitive to scaling of the feature values, which forms the basis for the feature extraction technique applied here. Many classifiers based on the class- conditional distribution of feature values, including the Bayes classifier, are invariant to scaling of the feature values. The primary drawback of the knn classifier is the computational complexity of the near neighbor search. The search for the neighbors of a single test pattern takes 0(n * d) time where n is the number of training samples, and d is the number of features being considered. Thus, the evaluation of a test set of m samples takes 0(n * m a: d) time, which can quickly become prohibitively expensive for large training and testing set Size and large numbers of features. To reduce the computational cost of the near neighbor search, the branch and bound search algorithm of Fukunaga and Narendra (1975) was incorporated into the knn classifier. The resulting branch and bound knn (bbknn) classifier scales more efficiently with the number of samples in the knn training set. 3.1.2 The Hybrid GA/knn Classifier The basis for the hybrid nearest neighbor classification technique is the weighted nearest neighbors classifier of Punch et al. (1993). As described in Section 2.4, a genetic algorithm is used to optimize a vector of weight values, {2121, 1.02, ...wd}, where each weight is used as a linear scaling factor for one of the features considered by the classifier. The feature values are normalized over a common range prior to scaling. For the work described here the normalized feature values ranged from 1.0 to 10.0. 32 The formalism for linear feature extraction described in Section 2.2.3 states that the objective is a d x m transformation matrix, ”H, where d is the number of original features and m is the number of transformed features. The GA-weighted knn classifier fits this framework, where the GA produces the diagonal elements of the transfor- mation matrix, ’H, and all the off-diagonal elements are zero. Figure 3.1 Shows a schematic overview of a GA-based linear feature extractor in combination with a knn classifier. For the sake of efficiency, only the weight vector itself (that is, the diagonal elements of the transformation matrix, ’H) need be passed to the knn. Since the value of k (the number of nearest neighbors to consider) is closely related to the scaling of the feature dimensions, the value of It was also included on the GA chromosome for co-optimization with the weight vector. Genetic Transformed Input Patterns Algorithm Patterns r—-fi F——“ mifia at 9% Z ___. 2 __. $=M _. ° ' Q Fitness of transformation matrix 91, Figure 3.1: A GA-based feature extractor using an objective function based on classi- fication accuracy. Each transformation matrix from the GA is used to transform the input patterns, which are then passed to a knn classifier. The fitness of a particular transformation matrix is based on the classification accuracy of the knn using the transformed patterns. 33 The ( AS Bill (or in the In seeks in del be to Feat MEa The GA Cost Function As each weight vector and k value is sent to the knn classifier for evaluation, a cost (or inverse fitness) score is computed, based primarily on the accuracy obtained by the knn in classifying a set of samples of known class. Since the genetic algorithm seeks to minimize the cost score, the formulation of the cost function is a key element in determining the quality of the resulting classifier. As such, several trade-offs must be considered in the design of the objective function: Feature subset size versus classification accuracy — Up to a certain point, inclusion of meaningful features will generally increase the accuracy of a classi- fier. Thus, reducing the size of the feature set and obtaining classification accu- racy are often mutually exclusive goals. The importance of feature set reduction relative to overall accuracy can be controlled through the fitness function. Measures of classification accuracy — All types of classification errors do not necessarily incur equal costs. In a two-class system, for example, the cost of misclassification for one class might differ from the other. The most commonly employed measure of accuracy is the classification rate, usually defined as the number of correct classifications divided by the number of test samples. This Simple metric may not suffice for problems with differing objectives. Consider a test set consisting of n samples, {551,552, ...:E‘,,}, belonging to c classes. For a given classifier let P, be the number of patterns from the test set that were correctly predicted to belong to class i. Let U, be the number of patterns falsely 34 predicted as class 2' (that is, predicted as class 2', but observed to belong to class j,z' # j). Similarly, let 0, be the number of patterns belonging to class 2', but mispredicted as belonging to some other class; and let N,- be the number of patterns in the testing set neither observed nor predicted as belonging to class 2'. Using these four variables, we can establish several measures of classification accuracy that emphasize different aspects of the classification task. i: C P 0 Classification rate: Acc = —-—1—1 This common measure of accuracy emphasizes correct prediction of as many testing samples as possible, regardless of their class distribution. While overall accuracy is generally desirable, it can be a somewhat short- sighted measure of classifier quality when used alone. For example, for a two-class problem where the test set consists of many samples from class 1 and only a few from class 2, a good classification rate might be achieved by always predicting class 1. Class balance metrics can be combined with classification rate to mitigate this effect. Classification rate is also often expressed as error rate, Err = 1.0 — Acc. C P e Average class accuracy: ACA = ( ’ ) /c f; 0.- + P.- Average class accuracy helps to insure class balance by computing the classification rate for each class, and then taking the mean over all classes. .01 bl ~Bz—°(P‘ ) main P1 ass a ance. a — millx 0‘ + B j=l 03- + P]- This direct measure of class balance can be combined with the classification rate to penalize for class bias. 35 Objec tl. P.- U,+P.~ Sometimes called “hit rate”, PBA is defined on a per-class basis. For a e Prediction-based Accuracy: PBA(2’) = given class, 2', PBA(2’) is the proportion of all the predictions for class 2' that are correct. For example, if 100 samples are predicted to belong to class 1, and 85 of them are observed to actually belong to class one, then PBA(1) = 0.85. This measure explicitly penalizes for overpredic- tion of each class, while AC‘A penalizes for underprediction. Average prediction-based accuracy (APBA) can be computed for the entire testing i=1 set: APBA(2’) = c (Fl—:43) /c e Matthews coefficient (Matthews, 1975): I),- * N,“ — U5 * 05 (I); + Ui) * (Pi + 0;)* (Ni-i” U,) * (N; + 0.) Like PBA, the Matthews coefficient is defined on a per-class basis. The Cm(i) = Matthews coefficients for each class can be averaged to produce an overall measure of accuracy and class balance. When the testing set is extremely unbalanced, Cm can have small values even when the accuracy for each class is high. Cm for a particular class is undefined when none of the training samples are predicted to belong to that class. Objective function smoothness -— For the knn classifier an objective function based largely on classification accuracy can have an unnecessarily rough, step- wise character. By smoothing this function we can provide more fine-grained feedback to the GA, enabling more efficient optimization. One way to achieve this is to consider the individual near neighbors and their classes. Consider a 36 two-class knn classifier with k = 7, and a weight vector that results in mis- classification of a particular test sample. Further suppose that the test sample belongs to class two, but all the near neighbors of the test sample are of class one when scaled by the weight vector. If the weight vector undergoes mutation or crossover and the resulting set of near neighbors contains two members of class two, then the classifier is closer to a correct classification of this test sample, but the fitness value of the weight vector does not change because the sample is still misclassified. By adding a penalty for each incorrect near-neighbor “vote” to the cost function, we can reward an individual weight vector each time an additional vote is cast for the correct class. This additional feedback to the GA can guide the search toward new correct classifications, providing a more efficient search. Coefficients are associated with each term in the GA cost function that allow control of each run. The following cost function is computed by the knn classifier for each individual (consisting of a weight vector and k value): cost(u7, k) 2 Cam x Err(tii, k) + 0pm x nonzero(zi2') + Cum x incorrect-votes(zii, k) + Cm, x Bal(u‘i, k) (3.1) 37 Where Err is the error rate, as defined above; nonzero(tU) is the number of nonzero weights in the weight vector 13; incorrect-votes(u7, k) is the total number of incorrect near neighbor votes cast during classification with the weight vector 11'} and neighbor count 1:; and Bal is the balance function, also defined above. Additionally, Cm, Cmnl, Cw“, and CM are coefficients for each of these terms, respectively. The coefficients determine the relative contribution of each part of the fitness function in guiding the GA search for the Optimal weight vector and I: value. The values for the cost function coefficients were determined empirically in a set of initial experiments for each data set. Typical values for these coefficients are given in Table 3.1. Deviations from these values for particular GA runs or data sets will be specifically noted in subsequent discussion. Table 3.1: Typical values for the GA cost function coefficients. Coefficient Typical value Cm 20.0 CM, 1.0 Cvote 2-0 CW 10.0 Representation Issues and Masking The representation of the feature weights and I: value on the chromosome is fairly direct—a 32 bit integer is used to represent each weight value. The resulting gene lPars is short for parsimony, indicating the importance of the nonzero term in obtaining a minimal feature set. 38 the ehrr each seha Sine Valm \Elue featu zero, nneth Dorat the h The \ above Inask In the Each ff on the GA chromosome yields an unsigned value over the range [0, 232 — 1]. This value is then scaled by division to yield a real value over [0.0, 100.0]. The value of k is represented as a 6-bit unsigned integer, but is constrained to values between 1 and 50. For two-class problems, the I: value is set to k = (kchmm * 2) + 1, where kchmm is the k-value from the GA chromosome. This constrains the value of k to odd integers, eliminating the need for a tie-breaking scheme in the knn classifier. While the cost function encourages parsimony by penalizing a weight vector for each nonzero feature weight, a simple real-valued representation for the weights them- selves does not provide an easy means for the GA to reduce feature weights to zero. Since the GA mutation operator tends to produce a small change in a single weight value, numerous mutations of the same feature weight are often required to yield a value at or near zero. Several methods were tested to aid the search for a minimal feature set, including reducing weight values below a predefined threshold value to zero, and including a penalty term in the cost function for higher weight values. The method that proved most effective, however, was a hybrid representation that incor- porates both the GA feature selection technique of Siedlecki and Sklansky (1989) and the feature weighting techniques of Punch et at. (1993) and Kelly and Davis (1991). The weights and k value are represented directly on the chromosome, as described above. Additionally, a mask field is assigned to each feature. The contents of the mask field determine whether the feature is included in the classifier (see Figure 3.2). In the initial implementation, a Single mask bit was stored on the chromosome for each feature. If the value of this bit was 1, then the feature was weighted and included 39 in classification. If, on the other hand, the mask bit for a feature was set to 0, then the feature weight was treated effectively as zero, eliminating the feature from con- sideration by the classifier. Since the masking fields comprised a very Small section of the chromosome relative to the feature weights and I: value, the number of mask bits associated with each feature was later increased to five. This increase had the effect of increasing the probability that a single bit mutation in a random location would affect the masking region of the chromosome. The interpretation of multiple mask bits is a simple generalization of the single bit case. When the majority of the mask bit values for a field are 1, then the field is weighted and included in classification. Otherwise, the field weight is reduced to 0, removing the feature from consideration by the knn. The number of mask bits is always odd so there is no possibility of a tie. Figure 3.2 shows a typical GA chromosome for the hybrid GA / knn classifier with masking. [— 32-bit L— S-bit masking feature field weight 6-bit k-value Figure 3.2: An example of a GA / knn chromosome with masking for a 4-dimensional feature set. The use of a masking field on the chromosome allows a more efficient search for a minimal subset of features that provides good classification accuracy, while Simultaneously searching for the optimal weights for the non-masked features. A single mutation can allow a feature to be tentatively included or removed from a 40 specific feature set without the loss of the partially-optimized weight value for that feature. Without masking, a weight value must be reduced to zero (or below a fixed threshold) to remove a feature from the current feature set. If the feature is later reintroduced, its weight value must be re-optimized from scratch. GA Optimization Details Several GA engines were employed to search for the optimal set of feature weights and I: value, including the GAUCSD algorithm (Schraudolph and Grefenstette, 1992) and GALOPPS (Goodman, 1996). GA run parameters were determined empirically via a set of initial experiments. Table 3.2 summarizes the GA parameters for a typical run. Where Specific runs have parameters that differ from these values in the subsequent discussion, this fact will be explicitly noted. Most runs used fitness proportionate selection, which requires an explicit scaling of the fitness of each individual in the GA population prior to selection. The specific scaling method used has a significant impact on the selection pressure applied by the GA during the subsequent search. For runs described here, the sigma scaling method of the GAUCSD algorithm was used. Several Sigma scaling factors were tested, with the most common value being 3.0, as noted in the table. Further details on sigma scaling and the significance of the sigma scaling factor can be found in Schraudolph and Grefenstette (1992). 41 3.1 Evol and . 1991; lem. J Were 2 to tha Optimi- We a[ for each Table 3.2: Typical values for the GA run parameters. GA Run Parameter Typcial value Population size 200 Crossover rate 0.8 per individual Mutation rate 0.001 per bit Max number of generations 200 Sigma scaling factor 3.00 3.1.3 EP Optimization with the Knn Classifier Evolutionary programming was also tested as a method for optimizing feature weights and I: value for the knn classifier. An EP framework based on SGA-C (Smith et al., 1991; Goldberg, 1989) was adapted to suit the feature selection and extraction prob- lem. In order to facilitate efficient sampling of feature subsets, the usual EP operators were augmented with a recombination operator, and a hybrid representation similar to that of the masking GA/knn was used for the EP chromosome. As in the GA optimizer, the EP chromosome consisted of a single real—valued weight for each fea- ture, an integer value for k, and a masking field consisting of one or more mask bits for each feature. Unlike the GA, however, the distinct regions of the EP chromo- some were subjected to different operators. The real-valued feature weights and the k value underwent only Gaussian mutation, while the masking fields were modified by GA-Style bitwise mutation and Single-point crossover. Figure 3.3 illustrates the dichotomy of the chromosome used for EP / GA hybrid optimization. AS mentioned in Section 2.3.2, EP is commonly employed for the tuning of real- valued function parameters, but lacks a standard mechanism for dealing with binary 42 fie b}' the tier opt; ers. dhnr in th Pafar The 51 normal . . Bitwise mutation Gaussnan mutation Single-pomt crossover \ lwtl W2] Wsl ”filké €mzlm2lm3lm4l 32-bit L— S-bit masking feature field weight 6-bit k-value Figure 3.3: An example Of an EP/GA hybrid chromosome with masking for a 4- dimensional feature set. fields such as the masking region of the chromosome. The objective Of the EP/ GA hybrid approach was to explore the real-valued parameter optimization capability of the EP for tuning the feature weights, while exploiting the known ability of tradi- tional GA operators tO search binary alphabets (Holland, 1975; Goldberg, 1989) in optimizing the mask bits. The EP framework employed, like most EP-based real-valued parameter Optimiz- ers, utilizes a mutation operator that varies each parameter according to a Gaussian distribution. The standard deviation for this distribution is an adaptive parameter in the sense that it is stored with the individual as a vector of a-values, one for each parameter. Each parameter, x,, is mutated to produce 2:: as follows: 1:; = 11:,- + N(0, 0,) (3.2) The standard deviations themselves are randomly perturbed according to a log- normal distribution to promote a diversity of individuals with differing Optimization strategies within the population. Each generation, the vector Of standard deviations 43 for ead where .' i.“ elem of para par 3.1116 Suggestl The Table 3. \ Table 3 Parame 3.1.4 FOr m Classif for each individual is updated according tO: a; = 0,- ~exp(T - N(0,1) + 7" - N,(0,1)) (3.3) where N (,u, a) is a single normally distributed random variable, and N,(0, 1) is the 1"“ element Of a vector of standard normal deviates of a length equal to the number Of parameters to mutate (Angeline, 1995). The values Of ’T’ and 1" are compile time parameters of the EP engine. Here, these parameter are set according to guidelines suggested by Back and Schwefel (1996). The rest Of the EP run parameters, like those Of the GA, were tuned empirically. Table 3.3 Shows the run parameters for a typical EP/knn experiment. Table 3.3: Run parameters for a typical EP/knn experiment. (1 is the number of parameters to be Optimized (i.e. the dimensionality Of the problem). EP Run Parameter Typical value Population Size 200 Mask field crossover rate 0.8 per individual Mask field mutation rate 0.001 per bit Max number Of generations 200 T’ (ml-1 -1 T ( 2V2 3.1.4 Bayes and KNN Classifiers For many experiments, results Of an unweighted knn classifier and a naive Bayes classifier are presented for comparison. The na'ive Bayes classifier, described in Sec- 44 tion 2.1.2, is nonparametric in the sense that the class conditional distributions Of the feature values are estimated by constructing a histogram for each feature based on the training data. The bin Size Of this histogram is a run-time parameter Of the classifier. Unless otherwise noted, 20 bins for each feature were used for the experi- ments described here. In addition, a Gaussian smoothing factor is applied in order to mitigate sampling anomalies that might introduce classification bias. Given a feature value, 23,-, for feature 2', and a class wj, then let bwj(x,) be the bin that x,- occupies in the histogram for class Wj. When the Gaussian smoothing is applied, the effec- tive marginal probability p(:c,-|wj) depends on the histogram value Of bin by, (23,-), as well as the histogram values Of neighboring bins. Let hwj(bwj(a:,-)) be the histogram value for bin bwj (z,)——that is, the proportion Of the training samples Of class u, that have values for feature i that belong in the bin bu]. (93,-)—then the effective marginal probability for feature value x,- is: +0 pence) = 2 (Gas, 0) x h..,(b.,,.(x.-) + a) (3.4) k=—a where G (k, a) is the mass density function for the Gaussian distribution at u = 0.0, with variance 02: G(k, a) = e-%(%)2 (3.5) The value Of 0, also a run-time parameter, determines the number Of bins that will contribute to each effective marginal probability value. Figure 3.4 illustrates the effect of Gaussian smoothing on the effective marginal probability for a particular 45 feature value. Proportion Of training samples J Hllfl Feature value Hflflfl Figure 3.4: Effects of Gaussian smoothing on the computation of effective marginal probabilities. Assuming that the current feature value falls in the center bin (black rectangle), and assuming 0‘ = 2, then the two surrounding bins on either side (grey rectangles) also contribute to the effective marginal probability for the current feature. The unweighted knn classifier, also employed for comparison with the hybrid clas- sification results, uses a Euclidean distance metric to find nearest neighbors. When there is a tie in determining the most frequent class among the near neighbors, the class of the nearest neighbor is used as a tie breaker. 3.1.5 Data Sets Artificially Generated Data The GA/knn and EP/knn hybrid classifiers were trained and tested against a variety Of artificially constructed data sets to evaluate their error rates and feature selection 46 and extraction capability under a variety Of classification conditions. These data sets were constructed with a varying number of features Of the following types: Univariate Gaussian: 001,0) — A random Gaussian deviate selected from the distribution with mean u and variance 02. Uniform: U (min, max) — A random real value selected from a uniform random distribution over the range [min, max]. Uniform integer: U I (min, max) — A random integer selected from a uniform random distribution over the range [min, max]. Computed fields — Field values can be computed from the results of other field values in combination with basic arithmetic Operations and/or any Of the pre- viously mentioned types Of random variables. For example, a data field might consist Of twice the value of the first data field, plus a uniform random value from 1 to 10: 2 * fl + U(1,10). Based on these basic types Of field values, a number of two-class data sets were generated for use in testing various aspects Of the GA and EP—based feature selection and extraction methods in combination with the knn classifier. The specific data sets generated include: Name: G10,1 Number of Features: 10 Class 1 specifier: G(5,2), G(5, 2), ...G(5,2) 47 Class 2 specifier: G(lO, 2), G(lO, 2), ...G(10, 2) This simple 10—dimensional data set consists Of independent Gaussian distributed fea- tures. The two classes are generally linearly separable, and weighting of the features is not likely to improve classification accuracy. This data set serves as a baseline for classification performance. Name: 010,2 Number of Features: 10 Class 1 specifier: G(7, 2), G(7, 2), ...G(7, 2) Class 2 specifier: G(lO, 2), G(10, 2), ...G(10, 2) This data set is similar in construction to 010,1, but the means for the two classes are closer, and thus the data is Slightly more overlapped. See Figure 3.5 for a comparison DBtWBBIl 010,1 and 010,2. Name: G's Number of Features: 5 Class 1 specifier: G(5,2), G(5,2), G(5, 2), G(5,2), G(5,2) Class 2 specifier: . G(10, 2), G(9, 2), G(8,2), G(7, 2), G(6, 2) This 5-dimensional data set again consists Of independent features selected from Gaus- sian distributions. However, the distance between the class means is reduced by one for each feature after the first, SO feature weighting is more likely to be useful for this data set. Furthermore, since inappropriate feature selection should now prove detrimental to classification accuracy, this is the first data set that will provide a test Of the feature selection capability Of each method. 48 —u Name: 05 Number of Features: 6 Class 1 specifier: G(5,2), G(5,2), G(5,2), G(5,2), G(5,2), G(5,2) Class 2 specifier: C(10,2), G(9,2), G(8,2), G(7,2), G(6,2), G(5, 2) This data set is identical to G5, but an additional feature, feature 6, is included. The mean and variance of feature 6 are identical for both classes. This feature should provide no useful classification information and is included to further test the feature selection capability Of each method. Name: M6 Number of Features: 6 Class 1 specifier: G(5,2), G(2,1), G(5,3), G(2,2), U(1,5), UI(1,5) Class 2 specifier: G(7,2), G(3,1), G(7,3), G(3,2), U(3, 7), UI(3, 7) This is a mixed data set consisting Of features drawn from overlapping Gaussian and uniform distributions. Various levels of overlap between classes are exhibited by the different features. Both continuous and discrete uniformly-distributed features are included. Name: N5 Number of Features: 5 Class 1 specifier: G(5,2), G(5,2) + U(1,2), G(5,2) + U(1,3), G(5, 2) + U(1, 4), G(5, 2) + U(1, 5) Class 2 specifier: G(7, 2), G(7, 2) + U(1, 2), G(7, 2) + U(1, 3), G(7, 2) + U(1,4), G(7, 2) + U(1, 5) 49 This data set is designed to evaluate the ability Of each method to deal with increasing noise in each feature. The first feature consists of a Gaussian random deviate with moderately overlapped distribution between classes. Each subsequent feature is a similar Gaussian deviate, but with an increasing level of uniform random noise added. Name: N10 Number Of Features: 10 Class 1 specifier: G(5,2), G(5,3), G(5,4), G(5, 5), G(5, 6), G(5,6), G(5,6), G(5, 6), G(5,6), G(5,6) Class 2 specifier: G(7,2), G(7,3), G(7,4), G(7, 5), G(7, 6), C(7, 6), G(7, 6), G(7, 6), G(7, 6), G(7, 6) Another data set designed to test the ability Of each method to deal with noisy features, this set includes a large number Of features with extremely overlapped dis- tribution between the two classes. Medical, Biochemical and Other Data Several data sets were selected from the UCI machine learning repository (Blake and. Merz, 1998) in order tO test the capability of the EC/knn classifier methods on real- world problems, and to facilitate comparison with other EC—based hybrid classification methods. The specific data sets tested were chosen to allow comparison with the GADistAI technique Of Yang and Honavar (1998) and the EC feature construction and selection methods Of Vafaie and De Jong (1998), and included the following: 50 18 I I I I I I 1 M I (5.2 o g 10.2 + (b) Figure 3.5: A two-dimensional projection of the data sets G10,1 (a), and 010.2 (b) onto the first two feature axes. 010,2 exhibits much more overlap between classes than 610,1- 51 Hepatitis — This data consists Of 19 descriptive and clinical test result values for 155 hepatitis patients (Diaconis and Efron, 1983; Cestnik et al., 1987). The two classes, survivors and patients for whom the hepatitis proved terminal, are strongly unbalanced—123 samples belong to the survivor class while 32 belong to the terminal class. The data includes qualitative, as well as both continuous and discrete-valued quantitative features. There are missing values, the number Of which varies largely by feature. Many features have nO missing values, while others have as many as 67 missing values out Of 155 samples. The small sample size and incompleteness Of this data set are typical Of many medical classification problems. Pima — Diabetes diagnosis information for native American women Of the Pima heritage, aged 21 or over (Smith et al., 1988). This data consists diagnostic information for 768 women; 268 Of these patients tested positive for diabetes, while 500 tested negative. Six of the eight features are quantitative and contin- uous, consisting Of various clinical test results. The remaining two features, age in years and number Of times pregnant, are quantitative and discrete. There are no missing feature values in the data. The completeness and moderate dimen- sionality Of this data set make it suitable for testing the ability of a classifier and feature extractor to maintain or increase classification accuracy while reducing dimensionality when there are fewer features to work with. Wine — This data set consists of the results Of a chemical analysis Of wines derived from three different cultivars (Aeberhard et al., 1992a,b). There are 13 continu- 52 ous features, with no missing values. There are 59, 71, and 48 members of each Of the three classes, respectively. The three classes are nearly linearly separa- ble, and linear discriminant analysis can Obtain 98.9% accuracy over all three classes. This data set is thus better for evaluating feature selection capability than classifier accuracy. Ionosphere — The 34 continuous features in this data set are derived from the signals read by a phased array of 16 high-frequency antennas in Goose Bay, Labrador (Sigillito et al., 1989). These radar signals are designed to recognize structure in the ionosphere. Each reading consists Of 17 pulses, with two attributes per pulse resulting in 34 features. There are 351 samples in this data set—225 are considered “good” readings, for which some structure in the ionosphere was detected, while 126 readings showed no structure. There are no missing feature values in this data. This data set was selected for evaluation Of feature selection capability for higher-dimensionality data sets. Two additional data sets, also selected from the UCI repository, were employed by Weiss and Kapouleas (1989, 1990) in a comparative study of classification methods from statistical pattern recognition, neural networks, and machine learning. These two medical data sets, thyroid and appendicitis, are included here to facilitate comparison with these results. The thyroid data consists Of 21 clinical test results for a set Of patients tested for thyroid dysfunction (Quinlan et al., 1986)—-15 Of these features are binary-valued, while the other 6 are continuous. The training data consist Of 3772 cases from the year 1985, while the testing data consist of 3428 cases from 53 the following year. The data are grouped into two classes, consisting Of the patients that were/were not diagnosed as having certain categories Of hypothyroid disorder. The two classes are highly unbalanced: the training data consist Of 3487 negative diagnoses and 284 positive, while the testing data consist Of 3177 negative samples and 250 positive. The appendicitis data consists Of seven laboratory tests to confirm the diagnosis of acute appendicitis (Marchand et al., 1983). All seven features are continuous. This data set consists of only 106 samples in two classes. 85 patients had confirmed appendicitis while 21 did not. In addition, extensive experimentation was conducted on a large biochemical data set dealing with the binding Of water molecules to protein surfaces. The features and biological significance Of this data, as well as the classification and feature selection results for the GA/knn, EP/knn and other classifiers and feature selection and ex- traction methods are detailed in Chapter 5—Identifying the Determinants of Solvent Binding in Proteins. 3.1.6 Testing and Error Rate Estimation Various techniques were used to estimate the error rate Of the trained EC/knn hybrid classifier. For data sets from the UCI machine learning data repository (i.e. the hepatitis, Pima, wine, and ionosphere data sets) the testing methods were Similar tO those Of Yang and Honavar (1998) in order to facilitate comparison with these results. Several minor differences were introduced to avoid problems with overfitting small training and testing sets. Two types of experiments were conducted on the UCI data sets, depending on the number Of training and testing samples available for each class in the data set. Where ample training and testing data (at least 100 samples) were available for each class, hold-out tuning experiments were conducted as follows. First, the available data are partitioned into three sets, training, tuning, and testing, with an equal number Of samples of each class in each set. Thus, the size Of these three sets is limited by the number Of samples in the least numerous class. The first set (training) is used to populate the feature space for the knn classifier. Next, the EC is employed tO Optimize the k-value and feature weights. AS previously described in Section 3.1.2, the EC cost function is based on classification accuracy for a set Of samples Of known class. The second set of samples (tuning) is used for this purpose. Finally, after the k-value and feature weights have been Optimized, the resulting classifier is evaluated using the final data set (testing) to Obtain an estimate Of the error rate. This estimate is then averaged over 5 independent runs of the EC using the same partitioning into training, testing, and tuning sets. Finally, the entire process is repeated 10 times with 10 different partitionings of the data, resulting in an error rate estimate averaged over 50 EC experiments. When the number of samples for one or more classes is extremely limited (less than 100 samples), this procedure is modified Slightly. In this case, the data are partitioned into only two sets (training and testing), and leave-one-out tuning experiments are conducted. ‘In these experiments, the training and tuning sets are identical. When evaluating the EC cost function, each sample is temporarily removed from the training set and evaluated as a tuning set Of size one. The cost function is then computed 55 based on the average classification accuracy over all the training samples. After EC Optimization Of the feature weights and k-value, the error rate Of the resulting classifier is estimated using the testing set, exactly as in the hold-out tuning experiments. Again, 5 independent EC runs are performed for each Of 10 partitionings Of the data, and the error estimates are averaged over all 50 experiments. This method was simplified for the artificially generated data sets, since the num- ber Of potential independent samples is infinite for this data. Here, experiments were conducted for each classifier using large, independent training, tuning, and testing sets (1000 samples each). For GA and EP experiments, 5 independent experiments were conducted and the accuracy for the best result is reported. Bootstrap Testing For other data sets, including the appendicitis, thyroid, and protein—water binding data, a variant Of the bootstrap test method (Jain et al., 1987; Efron, 1979, 1982) is employed in order to Obtain both an error rate estimate and a simple measure Of the variance Of this estimate. Traditional bootstrap measures require that a classifier be retrained and tested on a number Of bootstrap testing sets selected with replacement from the available data (Jain et al., 1987). Since the training Of the hybrid EC/knn classifier is an iterative, computationally-intensive, Offline process, a modified boot- strap method was employed to Obtain an error estimate for the trained and tuned hybrid classifier. As with the previously described methods, the available data are partitioned into 56 three sets, training, tuning, and testing. Within each Of these sets, the number Of samples Of each class is equal. The training and tuning sets are used during each EC experiment in the same manner described above: the training set is used to pOpulate the knn feature space, while the tuning set is used to evaluate a particular set Of feature weights and k-value in terms Of classification error, providing feedback to the EC Optimizer. Once the feature weights and k-value are Optimized, the testing set is randomly sampled with replacement to form n bootstrap test sets. The accuracy Of the classifier (or any other performance measure to be estimated) is evaluated on each of these bootstrap test sets using the final weight set and k value produced by the EC experiment. Finally, the mean and standard deviation Of the error rates for all n external test sets are computed and used as an estimator Of the true error rate for the optimized classifier. During the bootstrap tests, the same data set is used for training the knn classifier as during the EC experiment. Since the feature weights, k value, and training data COOperatively define the decision boundary for the final classifier, it would place an undue negative bias on the error rate estimation procedure to test the final classifier using a knn training set other than the one used during the EC experiment that produced that classifier. Figure 3.6 shows a schematic Of the data sets used for training, tuning, and testing during a single EC experiment using the bootstrap- based testing method. 57 : .11 : available trainind a :testinal g data I ' ' / a test sets tuning selected randuly set with replacement Weighted knn classifier “A Weighted knn feature classifier 0 rights a o ‘ va ue D O D o D ___> '3 cost score 0 0 estimated ‘ . ‘ ‘ ‘ error ‘ rate (a) (b) Figure 3.6: Division of training, testing, and tuning data during a single EC experi- ment (a), and error rate estimation Of the tuned classifier (b). The cost score in (a) is based on the performance Of the knn on the tuning set for a particular weight set and k value. The same knn training data is used for the entire EC experiment and during external testing. 58 3.2 Results 3.2.1 Classification Of Artificial Data Sets Independent Gaussian Features Three Of the artificially generated data sets were constructed entirely Of features drawn from independent Gaussian distributions, 010,1, 610,2, and G5. The set (310,1 is linearly separable, and a naive Bayes classifier is able to achieve 100% prediction accuracy using all 10 features. The set G103 is not as easily classified, but like (910,1, the features are all drawn from identical independent distributions, so there is no feature weighting that can Obtain better performance than that Of the unweighted classifiers. Like 010,1 and 610,2, the data set G5 is not difficult to classify—a naive Bayes classifier can discriminate between the two classes with ~ 96% accuracy— but the features are not drawn from identical distributions, SO feature weighting can potentially have a positive effect on classification accuracy. For these data, the primary goal is to reduce the number of features used, while reducing classification accuracy as little as possible. For comparison, a naive Bayes classifier was trained and tested on each data set, as well as a knn classifier using Odd values for I: from 1 to 100. As described in Section 3.1.6, 5 independent experiments were conducted for each EC-based method, and the best k value and weight set were then tested on independent data. These results are summarized in Table 3.4. The results for 010,1 and G103 verify the ability Of the GA and the EP to perform feature selection. Since all the features in these two data sets are drawn from identical 59 Table 3.4: Apparent accuracy and bootstrap accuracy rates for the Bayes classifier, the knn classifier, and the EC/knn hybrid classifiers for artificially generated data sets consisting Of independent, Gaussian distributed features. For the Bayes classifier, accuracy is shown for re-classifying the training data (Train), and for classifying the independent testing data (Test). For the knn classifier, experiments were run using Odd values Of k ranging from 3 to 101 using the same independent training and tuning set for all experiments. The best accuracy on the tuning set (Tune) and the corresponding k value are reported. The best It value was then evaluated by reclassifying the training data (Train), and by classifying a new, independent test set (Test). For the EC / knn hybrid classifiers, five EC experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The I: value for the best result, the accuracy on reclassification Of the knn training data (Train), the accuracy when reclassifying the EC tuning data (Time), and the accuracy when classifying an independent test set (Test) are provided along with the number Of non-zero feature weights (Features). (310,1 k 'Ii'ain Tune Test Features Bayes -— 100 — 100 10 Knn 3 100 100 100 10 GA/knn 3 98.9 100 99.4 4 EP/knn 3 100 100 100 4 G103 k 'Irain Tune Test Features Bayes — 92.9 —— 91.5 10 Knn 67 94.1 94.2 93.3 10 GA/knn 25 94.2 94.8 93.1 10 EP/knn 19 94.6 93.8 93.1 10 G5 k 'Irain Tune Test Features Bayes — 96.0 — 95.6 5 Knn 79 95.8 97.7 96.5 5 GA/knn 33 94.7 97.2 95.7 3 EP/knn 35 95.8 97.2 96.2 3 60 distributions, the feature weights produced by the hybrid classifiers are valueless. For 010,1, the class conditional distributions of feature values are well enough separated that 6 Of the 10 features can be removed without a Significant adverse effect on classification accuracy. For G103, the EC Optimizers were unable to remove any features while maintaining maximal classification performance. Results for G5 are slightly more interesting. Since the distributions for the two classes have different degrees Of overlap for each feature, the quality of the feature weighting provided by the EC Optimizers should have a bearing on the classification accuracy. For this data, both the GA and the EC were able to remove 2 Of 5 features while maintaining classification accuracies near tO that Of the knn using all 5 features. In all cases, the EP/knn was able to Obtain slightly better overall accuracy than the GA/knn, probably due to the EP’S ability to fine tune the final feature weights during the later phases Of the Optimization run. Mixed Multivariate Data Several Of the artificially-constructed data sets contained a mixture Of Gaussian and uniformly distributed random feature values. As described in Section 3.1.5, M6 con- sists Of Gaussian deviates, discrete uniform random values, and continuous uniform random values. This data set provides a test of the hybrid classifiers’ ability to main- tain recognition accuracy through feature weighting while reducing the number of features to consider. The set N5 consists Of five features with increasing levels of random noise, and is designed to test the feature selection capability Of the two EC 61 methods, and the effect Of feature selection on classification accuracy. The test meth- ods for these data sets were identical to those described for the previous artificially constructed data sets. The results are summarized in Table 3.5. Table 3.5: Apparent accuracy and bootstrap accuracy rates for the Bayes classifier, the knn classifier, and the EC/knn hybrid classifiers for artificially generated data sets consisting Of features with a mixture Of random distributions. For the Bayes classifier, accuracy is shown for re—classifying the training data (Train), and for classifying the independent testing data (Test). For the knn classifier, experiments were run using odd values Of k ranging from 3 to 101 using the same independent training and tuning set for all experiments. The best accuracy on the tuning set (Tune) and the corresponding k value are reported. The best I: value was then evaluated by reclassifying the training data (Rain), and by classifying a new, independent test set (Test). For the EC/knn hybrid classifiers, five EC experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The I: value for the best result, the accuracy on reclassification of the knn training data (Train), the accuracy when reclassifying the EC tuning data (Time), and the accuracy when classifying an independent test set (Test) are provided along with the number Of non-zero feature weights (Features). M3 k 'D'ain Tune Test Features Bayes —— 89.5 — 87.7 6 Knn 29 91.6 92.4 91.2 6 GA/knn 19 92.4 93.6 90.8 5 EP/knn 25 90.8 92.2 89.0 4 N5 k Train Tune Test Features Bayes — 94.4 — 81.6 5 Knn 9 86.5 86.8 82.8 GA/knn 11 85.1 86.2 80.9 EP/knn 7 85.6 85.8 79.3 ABC” Again, the hybrid methods were able to reduce the feature set size by one or more features with a minimal reduction in classification accuracy for these data sets. For all data sets discussed thus far, the EP and GA classifiers have produced Similar values 62 for I: (having a difference Of 6 or less). 3.2.2 Classification Of Data from the UCI Repository Four data sets—Pima, wine, ionosphere, and hepatitis—were selected for com- parison with the GADistAI iterative feature selection and construction method Of Yang and Honavar (1998). As described previously, the testing and error rate esti- mation for these data sets were performed in a manner similar to that Of Yang and Honavar to facilitate direct comparison. Classification and feature selection results for this data, averaged over 50 EC/knn experiments, are summarized in Table 3.6. Unfortunately, the error rates reported for GADistAI were Obtained using the same data set used by the GA to tune feature subsets and weights. This is essentially equivalent to the “Train/Tune” accuracy provided in Table 3.6 for the EC-hybrid classifiers. Since no external or bootstrap testing was done, it remains to be seen how the GADistAI algorithm generalizes to new data, and no comparison tO the “Test” accuracy Obtained by the EC/knn hybrid classifiers can be made. In all four data sets the bootstrap test accuracy of the EC-hybrid classifiers are comparable to the best accuracy achieved by any classifier. For wine and pima, the best test results were, in fact, Obtained by the EC/knn hybrids. Additionally, both hybrid classifiers were able to achieve a Significant reduction in the number of features considered for all four data sets. The GA / knn hybrid classifier used the lowest mean number Of features in classification Of all the feature selection and classification methods presented. .For the four data sets examined here, the GA/knn consistently 63 Table 3.6: Results of the hybrid EC/knn classifiers and the GADistAI algorithm on various data sets from the UCI Machine Learning data set repository, averaged over 50 runs. 'Irain/ Tune refers tO the accuracy Obtained when reclassifying the data used by the EC in tuning (Optimizing) feature subsets and weights. Test refers to the accuracy Obtained on an independent test set for each experiment, disjoint from the training and tuning sets. Features is the number of features with nonzero weights in the best performing weight set for each run; the mean value over all 50 runs is shown. Hepatitis 'lh'ain/ Tune Test Features Bayes 85.3 65.7 19 Knn 87.1 73.4 19 GADistAI 97.1 — 9.2 GA / knn 86.0 69.6 8.1 EP/knn 87.2 73.1 8.9 Wine Train / Tune Test Features Bayes 98.8 94.7 13 Knn 94.9 94.3 13 GADistAI 99.4 — 6.7 GA/knn 99.7 94.8 6.0 EP/knn 99.5 93.2 6.2 Ionosphere Train / Tune Test Features Bayes 93.0 90.1 34 Knn 83.4 93.2 34 GADistAI 98.6 — 17.3 GA/knn 95.0 91.9 8.5 EP/knn 93.2 92.3 13.5 Pima 'Ih‘ain / Tune Test Features Bayes 76.1 64.6 8 Knn 73.5 71.5 8 GADistAI 79.5 -— 3.8 GA/knn 80.0 72.1 3.1 EP/knn 79.1 72.9 3.9 64 reduced the number Of features considered to roughly one-half the original number Of features. Another important factor to consider when comparing the results of the EC/knn hybrid classifiers to classical methods and other feature selection and classification techniques is the balance between the prediction accuracies among the various classes. That is, depending on the class distribution Of the training data, some classifiers can perform very well on some classes at the expense Of others. The cost function that drives the Optimization process for the EC/knn hybrid classifiers penalizes for disparity in the prediction accuracies among classes. In some cases, this will cause the EC to reject weight sets that lead to higher overall prediction accuracy in favor of those that lead to a better balance in prediction accuracy among classes. Table 3.7 summarizes the balance in predictive accuracy, as defined in Section 3.1.2, of the various classifiers on the four UCI data sets. The EC/knn hybrid classifiers obtained the best balance in all cases when reclassifying the GA tuning data. For the Hepatitis and Pima data sets, the GA/knn classifier was able to achieve significantly better balance than the traditional classifiers. The drop in balance between the tuning data and the bootstrap testing data for the Wine data set may indicate that some overfitting of the tuning data has occurred. NO information regarding balance in predictive accuracy was provided for GADistAI. 65 Table 3.7: Balance in predictive accuracy among classes for various classifiers. The maximum accuracy, (Max), and the minimum accuracy, (Min), among the various classes are Shown for each classifier, along with the difference between the two values (Bal). Lower values for Bal, indicating a smaller difference between the minimum and maximum predictive accuracies, are preferred. Train / Test results refer to accuracies for reclassifying the training data for the Bayes and Knn classifier, and for reclassifying the EC tuning data set for the GA/knn and EP/knn classifiers. Test refers to the accuracy when classifying an independent test set. The knn classifier was tested for Odd values Of k from 1 to 101, and the best results are shown. Tiain / Tune Test Hepatitis Min Max Bal Min Max Bal KNN 34.5 97.6 63.1 30.8 95.7 64.8 Bayes 17.5 98.8 81.3 6.7 96.5 89.9 GA/knn 67.3 89.7 22.5 45.4 82.2 36.8 EP/knn 81.9 88.3 6.4 48.7 85.9 37.2 Train / Tune Test Pima Min Max Bal Min Max Bal KNN 68.5 78.4 9.9 62.8 73.4 10.6 Bayes 68.1 84.1 16.0 62.0 75.7 13.7 GA/knn 79.8 80.1 0.3 72.1 72.4 0.3 EP/knn 78.9 79.3 0.4 72.5 74.6 2.2 Tiain / Tune Test Ionosphere Min Max Bal Min Max Bal KNN 68.8 97.9 29.1 72.3 97.5 25.2 Bayes 92.4 93.6 1.2 88.6 96.9 8.3 GA/knn 94.6 95.4 0.8 80.7 94.2 13.5 EP/knn 91.5 94.9 3.4 83.6 94.1 10.5 Train / Tune Test Wine Min Max Bal Min Max Bal KNN 84.8 100.0 15.3 89.4 100.0 10.6 Bayes 97.8 98.8 1.0 93.2 94.8 1.7 GA/knn 99.3 100.0 0.7 90.6 99.4 8.8 EP/knn 99.1 99.8 0.7 88.3 98.6 10.4 66 3.2.3 Classification of Medical Data The thyroid and appendicitis data sets were selected for comparison with a broader study encompassing several traditional classification methods (Weiss and Kapouleas, 1989, 1990), including the linear and quadratic discriminant functions, the l-nearest neighbor classifier, a neural network classifier trained using the backprOpagation method, the CART tree classifier (Breiman et al., 1984), the naive Bayes classifier, and a 2nd order Bayes classifier. Error rates for these data sets were evaluated using the bootstrap testing method described in Section 3.1.6. Five independent experi- ments were conducted, and the best weight set and k value for each experiment were subjected to 100 bootstrap tests. Weiss and Kapouleas report two accuracy estimates for each classifier. The first, training accuracy, is the accuracy Obtained when reclas- sifying the training samples. Since training Of the EC/knn hybrid classifiers is a two step process, WeiSS’ training accuracy is best compared to the accuracy Obtained by the hybrid classifier when reclassifying the tuning set. Weiss also reports a less biased estimate Of accuracy based on an independent testing set, this measure is best com- pared tO the bootstrap accuracy Obtained by the hybrid classifiers. Tables 3.8 and 3.9 compare the results reported by Weiss and Kapouleas with those of the GA / knn hybrid classifier on the thyroid and appendicitis data sets, respectively. Comparison with SFFS The sequential floating forward selection (SFF S) algorithm Of Pudil et al. (1994) has been shown to be a powerful technique for feature subset selection (Jain and Zongker, 67 Table 3.8: Results of various classifiers on hypothyroid data, as reported by Weiss, in comparison with that Of the EC/knn hybrid classifiers. Train/tune refers to the accuracy Obtained when reclassifying the training data, in the case Of Weiss’ results, or the tuning data, in the case Of the EC/knn hybrid classifiers. Testing refers to the accuracy Obtained on an independent test set for Weiss, and to the bootstrap accuracy estimate for the hybrid classifiers. Method Accuracy Accuracy (train/ tune) (testing) GA/knn 98.5% 98.4% Linear Discriminant 93.8% 93.8% Quadratic Discriminant 89.7% 88.4% Nearest Neighbor 100% 95.3% Bayes (independent) 97.1% 96.1% Bayes (2nd order) 97.7% 92.4% Neural Net (Back prop) 99.5% 98.5% Predictive Value Max. 99.8% 99.3% CART Tree 99.8% 99.4% Table 3.9: Results Of various classifiers on the appendicitis data, as reported by Weiss, in comparison with that of the EC/knn hybrid classifiers. As in the previous table, rain / tune refers to the accuracy Obtained when reclassifying the training data, in the case Of WeiSS’ results, or the tuning data, in the case Of the EC / knn hybrid classifiers. Testing again refers to the accuracy obtained on an independent test set for Weiss, and to the bootstrap accuracy estimate for the hybrid classifiers. Method Accuracy Accuracy (train/tune) (testing) GA/knn 90.4% 90.6% Linear Discriminant 88.7% 86.8% Quadratic Discriminant 79.3% 73.6% Nearest Neighbor 100% 82.1% Bayes (independent) 88.7% 83.0% Bayes (2nd order) 95.3% 81.1% Neural Net (Back prop) 90.0% 85.8% Predictive Value Max. 91.5% 89.6% CART Tiee 90.0% 84.9% 68 1997). For comparison with the EC feature selection and extraction methods, SFF S was tested in conjunction with a knn classifier using Odd values of 19 between 1 and 101 for the two medical data sets. For the thyroid data, the sequential floating forward selection method achieved good classification results. The best accuracy Obtained by the SFFS/knn algorithm during feature selection was 97.99%, using 6 Of the 21 available features. A mean bootstrap accuracy of 98.06%, with a standard deviation of 0.6032%, was obtained over 100 bootstrap tests on this feature set. This accuracy is Similar to those Obtained by the various methods reported by Weiss. The GA feature extractor combined with a knn classifier obtained a similar accuracy, 98.48%, using only 3 of the available 21 features, and a k-value Of 87. 100 bootstrap tests for this set Of feature weights yielded a mean bootstrap accuracy Of 98.40%, with a standard deviation Of 0.6256%. For the appendicitis data, the best result was Obtained for k = 7. The best predictive accuracy during selection was 88.46% using 3 of the 7 available features. Bootstrap testing for 100 trials using the best feature set found by SFFS yielded a mean predictive accuracy Of 91.44% with a standard deviation Of 3.94%. The GA feature extractor achieved a slightly higher accuracy than SF FS during extraction: 90.38% using 2 Of 7 weighted features and k = 7. In bootstrap testing, however, the mean bootstrap accuracy over 100 trials proved to be similar tO that of SFFS—90.60% with a standard deviation Of 4.21%. 69 3.2.4 Classification Of Protein Solvation Sites The most extensive set Of experiments was conducted on a sizeable set of biochemical data. These data, comprising the physical and chemical environments Of 5542 protein— bound water molecules, were examined in light of two distinct Objectives. The first goal Of these experiments was to Obtain good pattern recognition performance on various classes of water molecules, while the second Objective was tO identify the specific features which were consistently associated with membership in each class. The methods and results for these experiments are discussed in detail in Chapter 5— Identifying the Determinants Of Solvent Binding in Proteins. 3.3 Discussion In all Of the data sets examined, the EC / knn hybrid classifiers exhibited an ability to reduce the number Of features considered by the classifier Significantly, while main- taining or slightly improving predictive accuracy, as compared to a non-weighted knn classifier using all available features. The overall predictive accuracy Of the EC/knn classifiers was comparable tO those Of the best classifier tested for each data set ex- amined, and in many cases the best accuracy was achieved by one Of the EC/knn hybrids. In terms Of feature selection, the EC feature selection and extraction techniques equaled or exceeded the ability of other methods examined to reduce the number Of features considered by the classifier, while maintaining Similar or improved classifica- 70 tion accuracy. For the medical data sets, the EC methods were able to meet or exceed the accuracy Obtained using a knn classifier in conjunction with SFFS (Pudil et al., 1994), which is generally considered the most effective Of the commonly-used feature selection methods (Jain and Zongker, 1997), while further reducing the number of features considered. In addition, the parameterized nature Of the cost function which drives EC Opti- mization allows the EC methods the ability to consider factors other than the apparent error rate in tuning the feature set and weights. One such factor is the balance in pre- dictive accuracy among classes. On many data sets where the classical methods result in strong disparity Of predictive accuracy among classes, the EC-hybrid methods are able to Obtain better balance while maintaining overall accuracy. 71 Chapter 4 Variations on the Bayes Classifier using Evolutionary Computation-Based Learning 4.1 Methods 4.1.1 Bayesian Discriminant Functions The Bayesian classifier has a computational advantage over the knn classifier in that the training data are summarized, rather than stored. The comparison of each test sample with every known training sample to find nearest neighbors during knn clas- sification is a computationally expensive process, even when efficient search methods are employed (Fukunaga and Narendra, 1975). In contrast, finding the marginal prob- ability associated with a particular feature value is computationally efficient for both 72 the parametric and nonparametric forms of the Bayesian classifier. Since EC-based hybrid classifiers require many classifications to be performed during feature selec- tion and extraction,1 the use Of a computationally efficient classifier such as a Bayes classifier is expedient. Unfortunately, the direct application Of the Bayes classifier to the problem is not effective, because the Bayes decision rule is invariant to linear scaling of the feature space. In other words, multiplying the feature values for a given feature by a constant has no effect on the class-conditional probabilities considered by the classifier, as illustrated in Figure 4.1. Direct scaling Of the marginal probabilities is also ineffective for the naive Bayes classifier, since the joint class-conditional probabilities are simply the products Of the marginal probability values. There are, nevertheless, several aspects Of the Bayesian classifier that, when Op- timized, can yield better classification performance. One such area is is the manner in which the marginal probabilities for each feature will be combined into the multi- variate class-conditional probability densities. As noted previously (see Section 2.1.2, the most common approach is to assume that all features are independent. For the resulting naive Bayes classifier, the class-conditional probability is the product of the marginal probabilities for each feature. A more general approach would be to encode the entire d x d covariance matrix describing the interrelationships between 1The actual number Of classifications tO be performed is typically 0(npt) where n is the number Of generations for the EC experiment, p is the number Of individual solutions in the EC population, and t is the number of samples in the EC tuning set. If all-neighbors search is conducted for each classification, the number Of distance comparisons is 0(npts), where s is the number of samples in the knn training set. 73 £3 a. g p(x > 14, x<16) = 0.045 °" l .E m .E _ E _.. “5 —— r c: .9. E" H o a e HHHHH 024681012M16182022242628303234363840+ Feature value (a) p(x > 140, x< 160) = 0.045 N m 0 2040 60 80100120l40l60180200220240260280300320340360380 400+ 33 a. 5 DD :3 — 7 E _ __ § .8. fill Feature value (b) Figure 4.1: The Bayes decision rule is invariant to linear transformations Of the feature space. For the feature shown here, the raw feature values (a) have been multiplied by 10 in (b). Using a nonparametric Bayes classifier, we find that the original feature value falls in the bin 14-16 (black rectangle) in the original histogram. The scaled feature falls in the equivalent bin of histogram b, and the histogram values (marginal probabilities) of the two bins are identical, SO the scaling has no bearing on the classification results. 74 all the features being considered, and allow an EC-based Optimizer to search for the covariance matrix which best describes the true multivariate distribution Of the train- ing data. Unfortunately, the search space involved in finding this covariance matrix grows as 2‘12, even if the elements of the covariance matrix are binary-valued. For real valued matrix elements, the search space quickly becomes intractable, even for small problems. We can simplify the problem somewhat by looking at the Bayesian classifier as a discriminant function. AS outlined in Section 2.1.2, the Bayes decision rule can be written as follows: given 2'5, decide 0),- if Peri) >P 0, and class 2 if 9(5) < 0. The classification when g(:i:') = 0 is arbitrary. The discriminant function, then, is uniquely associated with a particular classifier, mapping an input feature vector to a value associated with a particular class. According to Duda and Hart: 75 We can always multiply the discriminant functions by a positive constant or bias them by an additive constant without influencing the decision. More generally, if we replace every g,(x) by f (g,(x)), where f is a monoton- ically increasing function, the resulting classification is unchanged. (Duda and Hart, 1973, pp. 17—18). Thus, we can design a parameterized classifier based on the concept Of the discrim- inant function. We begin with a discriminant function based on the Bayes decision rule. Using this function as a model, we can design similar functions which clas- sify well, but are more easily parameterizable for hybridization with EC Optimization methods. After designing such a discriminant function and identifying the tunable pa- rameters, we can use an EC to Optimize these parameters with regard to a particular set Of training and tuning data. Two distinct techniques were designed and tested using this method. The first is a simple nonlinear weighting Of the Bayes discriminant function, while the second is a novel discriminant function based on the summation Of the class-conditional marginal probabilities. 76 4.1.2 Nonlinear Weighting Of the Bayes Discriminant Func- tion Consider the Bayes discriminant function, 9(53') = P(w1lf)-P(w2|5) = P(a':'|w1) X P(w1) — P(.’II.'LU2) X P(W2) (43) Z P(§:’|w,~) X P(w,-) i=1 The denominator can be eliminated, since it does not affect the Sign Of 9(5), and thus does not affect the resulting classification. Since a > b => log(a) > log(b), we can apply the log function tO the a posteriori probabilities without changing the resulting classification. Thus, the following discriminant function is equivalent tO the naive Bayes discriminant: 9(5) = log(P(flw1) >< P(w1))-103(P(5I'Iw2) >< P(w2)) (4-4) = (log(P(5|w1))+10s(P(w1))) - (108(P(5='|w2)) + 103 WM)» (45) where 108(P(i"|wi)) = log(P($1|w.-))+10g(P($2|w.-))+---+10g(P($alw.-)) (4-6) Finally, we can parameterize this discriminant function, while maintaining a similar level Of classification accuracy, by adding coefficients to each of the marginal proba- 77 bilities. P‘(f|w,-) = Cllog(P(a:1|w,-))+Czlog(P(a:2|w,-))+... + Cd log (P(a:d|w,-)) + log (P(w,~)) (4.7) The values for the coefficients, Cl“), are supplied by an EC Optimizer. The effect Of these coefficients is to apply a nonlinear weighting to each of the marginal prob- abilities, which are then combined to produce a confidence value, P‘, for each class. While P‘ (5:10),) is no longer a joint probability distribution, the discriminant function is equivalent to the naive Bayes discriminant function when 01 = Cg = = 0.; = 1. Furthermore, the new function has several desirable features for hybridization with an EC Optimizer. When a particular coefficient, Cj, is reduced to zero, the associated feature value, x), is effectively eliminated from consideration by the classifier. This allows us to perform feature selection in conjunction with classifier tuning. Further- more, when the value Of a coefficient, C,- is increased, the marginal probability value for the associated feature, 3,, has an increased influence on the value Of the confidence value, P*(:i:’|w,-), for each class. GA Optimization of the Nonlinear Discriminant Coefficients The implementation for this discriminant function was based on the previously de- scribed nonparametric naive Bayes classifier. The marginal probability distributions for each feature were approximated using histograms with 20 bins each. Gaus- sian smoothing was used to mitigate sampling anomalies, as described earlier, with 78 a = 2.0. The GA cost function was identical to that described in section 3.1.2, with the exception that there is nO voting involved in classification using the discriminant function classifier, SO there was no corresponding incorrectwotes term in the fitness function. The coefficient values, except for Cums, were set identically to those shown in Table 3.1. The EC chromosome was also similar to that designed for the GA/knn hybrid classifier, although there is no It value included. Each coefficient is represented as a 32-bit integer scaled by division to produce a real value over the range [0.0, 100.0]. Masking was used to aid feature selection, with 5 mask bits for each coefficient. An example of the EC chromosome for Optimization of the nonlinear discriminant coefficients is Shown in Figure 4.2. GA and EP run parameters were set identically to those used in the corresponding hybrid knn classifiers, as shown in Tables 3.2 and 3.3. P’Ctluf) = C1log(P(x1|wJ-))+C2log(P(x2|wj)) + see + C4 log(P (x4le)) +log(P (wj)) m W2W3Pf’4MlM2M3M4 Figure 4.2: An example Of the EC chromosome for Optimization Of the nonlinear discriminant coefficients. A four-dimensional problem is shown. Each coefficient, Cg, in the discriminant function is determined by the chromosome weight, W,, and the masking field, Mi. 79 4.1.3 A New Discriminant Function Based on Summation Of the Class-Conditional Marginal Probabilities In the process Of developing the nonlinear discriminant function, several alternatives, also derived from the Bayes discriminant function, were formulated. Of these al- ternatives, a discriminant function based on a linear combination of the marginal probability values showed initial promise in combination with EC coefficient Opti- mization. As with the nonlinear discriminant, this sum-based discriminant function assigns a confidence value, P‘(f|w,-), to each class, 02,. In this case, however, the value Of P“ is simply a weighted sum of the marginal probabilities for each feature value. That is, P”(:i:'|w,~) = 01P(.’131 Iwi) + CgP(a:2|w,-) + + CdP(a:d|w,-) (4.8) Again, the marginal probabilities, P(:c1|w,)...P(xd|w,-), were estimated using a histogram consisting of 20 bins. Gaussian smoothing was also employed with a = 2.0. Additionally, the representation Of the coefficients on the chromosome, EC cost function, and EC parameter settings were identical to those detailed above for the Bayes-derived nonlinear discriminant function. 4.2 Results A GA was employed tO Optimize the parameters of each Of the discriminant function based classifiers. The two resulting hybrid classifiers were then tested on the same 80 artificial and real-world data sets as the EC/knn hybrid classifiers, under identical training and testing conditions. Based on the similarity of the results for the GA and the EP Optimizers when hybridized with a weighted knn classifier, only the GA Optimizer was tested in conjunction with the discriminant function classifiers. 4.2.1 Artificial Data Sets Results Of the two parameterized discriminant function classifiers on data sets con- sisting Of independent, Gaussian-distributed features are summarized in Table 4.1. Performance on these simple data sets is similar to that of the EC / knn hybrid classi- fiers, but the computation time is significantly less Since all-pairs neighbor searching is no longer required. The nonlinear discriminant function seems to Show Slightly better feature selection performance than the other classifiers, as it is the only clas- sifier to reduce the feature set for 010,1 down to 3 features, and G103 down to 9 features. There is, however, a performance penalty associated with this reduction in features—the EC/knn hybrid classifiers Obtained better accuracy than the nonlinear discriminant on both Of these data sets. When the number Of features used is equal, as in the G5 data set, the performance Of the nonlinear discriminant is similar to that Of the EC/knn hybrid classifiers. Table 4.2 summarizes the results for data sets consisting of feature values drawn from a mixture of uniform random and Gaussian distributions. These experiments were conducted identically to the previous set, and the results were Similar. The nonlinear discriminant function outperformed all other classifiers for the data set 81 Table 4.1: Performance Of the nonlinear discriminant function (Nonlinear) and the summation-based discriminant function (Sum) on artificially-generated data sets con- sisting of feature values drawn from independent Gaussian distributions. For both discriminant functions, five GA experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The accuracy on reclassification Of the training data (Train), the accuracy when reclassifying the GA tuning data (Tune), and the accuracy when classifying an in- dependent test set (Test) are provided along with the number Of non-zero feature weights (Features). The results for the naive Bayes classifier (Bayes), and the two EC/knn hybrid classifiers are reproduced here from Table 3.4 for comparison. G10,1 'Ii'ain Tune Test Features Bayes 100 — 100 10 Nonlinear 99.0 99.6 99.1 3 Sum 99.1 99.6 98.9 4 GA / knn 98.9 100 99.4 4 EP/knn 100 100 100 4 G10; Train Tune Test Features Bayes 92.9 —— 91.5 10 Nonlinear 91.3 92.5 89.9 9 Sum 88.6 90.2 87.7 10 GA/knn 94.2 94.8 93.1 10 EP/knn 94.6 93.8 93.1 10 G5 Train Tune Test Features Bayes 96.0 — 95.6 5 Nonlinear 94.8 96.2 95.1 3 Sum 91.5 94.7 93.3 3 GA / knn 94.7 97.2 95.7 3 EP/knn 95.8 97.2 96.2 3 82 N5, but was outperformed by the EC/knn hybrids for M6. The summation-based discriminant function found a unique solution for the N5 data set. Although the accuracy for this solution was considerably lower than that Of the other classifiers, the data set was classified using only a single feature. Table 4.2: Performance of the nonlinear discriminant function (Nonlinear) and the summation-based discriminant function (Sum) on artificially-generated data sets con- sisting of features with a mixture Of random distributions. For both discriminant functions, five GA experiments were conducted using the same data sets but differing initial random starting conditions, and the best result is reported. The accuracy on reclassification Of the training data (Train), the accuracy when reclassifying the GA tuning data (Tune), and the accuracy when classifying an independent test set (Test) are provided along with the number Of non-zero feature weights (Features). The re- sults for the naive Bayes classifier (Bayes), and the two EC/knn hybrid classifiers are reproduced here from Table 3.5 for comparison. M3 Train Tune Test Features Bayes 89.5 — 87.7 6 Nonlinear 89.2 89.0 88.1 5 Sum 88.2 88.1 87.6 6 GA/knn 92.4 93.6 90.8 5 EP/knn 90.8 92.2 89.0 4 N 5 'Ii‘ain Tune Test Features Bayes 94.4 — 81.6 5 Nonlinear 81.4 83.7 81.5 4 Sum 69.1 71.0 66.5 1 GA/knn 85.1 86.2 80.9 4 EP/knn 85.6 85.8 79.3 4 Once again, the balance in predictive accuracy among classes is an important factor to consider when comparing the results Of the various EC—hybrid classifiers with that Of the Bayes classifier. Since the EC cost function is biased to favor solutions with good balance in accuracy between classes, it will prefer such solutions to those 83 with disparity among the classes, possibly at the cost of overall prediction accuracy. Table 4.3 summarizes the predictive balance Obtained by the various classifiers for the M6 and N5 data sets. All the EC-hybrid methods except the EP/knn classifier outperform the naive Bayes classifier in terms Of class balance. Table 4.3: Balance in predictive accuracy among classes for the various EC/hybrid classifiers on the data sets M6 and N5. Predictive accuracy on the independent test set is shown for each class. Balance is defined as the difference in predictive accuracy between class 1 and class 2, as described in Section 3.1.2. M5 Class 1 Class 2 Overall Balance Bayes 91.80 84.40 88.10 7.4 Nonlinear 89.40 85.80 87.60 3.6 Sum 93.20 88.40 90.80 4.8 GA/knn 92.00 86.00 89.00 6.0 EP/knn 93.00 82.40 87.70 10.6 N5 Class 1 Class 2 Overall Balance Bayes 76.60 86.40 81.50 9.8 Nonlinear 62.20 70.80 66.50 8.6 Sum 79.20 82.40 80.80 3.2 GA/knn 78.60 80.00 79.30 1.4 EP/knn 72.20 91.00 81.60 18.8 4.2.2 Classification of the UCI Data Sets Classification of real-world data from the UCI machine learning data set repository not only allows a more realistic assessment Of the performance Of the various classi- fiers on data with erroneous and missing feature values, but also allows the classifiers to be tested on data sets with larger numbers Of features than the artificially con- structed data sets described above. Table 4.4 compares the classification and feature 84 selection performance Of the two discriminant-function-based classifiers with that Of the EC / knn hybrid classifiers and the naive Bayes classifier. The most evident aspect Of the results on these four data sets is the feature selection capability demonstrated by the nonlinear discriminant function. For three Of the four data sets, the minimum number Of features used in classification was found by the nonlinear discriminant function in conjunction with the GA. Additionally, for the hepatitis data, the test accuracy Obtained by the two discriminant function classifiers surpassed the other classifiers tested. For the other three data sets the accuracies Obtained by the discriminant methods were Similar to those Obtained by other methods tested. The notable difference between Tune and Test results for the hepatitis and ionosphere data sets suggest that the discriminant classifiers may be more prone to overfitting of the training and tuning data than the other classifiers. Examination Of the run times for the UCI data sets illustrates the advantage held by the discriminant-function-based classifiers over the EC / knn hybrid classifiers in terms Of computational efficiency. Table 4.5 compares the execution times for 200 generations Of GA Optimization in conjunction with the nonlinear discriminant function and the knn classifier. In all cases the nonlinear discriminant classifier is significantly faster than the GA/knn—in the case of the Pima Indian diabetes data set the difference is nearly tenfold. 85 Table 4.4: Results of the nonlinear-weighted discriminant function (Nonlinear) and the discriminant function based on summation Of the marginal probabilities (Sum) on various data sets from the UCI Machine Learning data set repository, averaged over 50 runs. Train/ Tune refers to the accuracy Obtained when reclassifying the data used by the EC in tuning (Optimizing) feature subsets and weights. Test refers to the accuracy Obtained on an independent test set for each experiment, disjoint from the training and tuning sets. The number of features is the mean number Of features used in classification over all 50 runs. Performance for the EC/knn classifiers is repeated here from Table 3.6 for comparison. Hepatitis Thain/Tune Test Features Bayes 85.3 65.7 19 Nonlinear 95.4 79.4 6.5 Sum 91.4 79.4 8.2 GA/knn 86.0 69.6 8.1 EP/knn 87.2 73.1 8.9 Wine Train / Tune Test Features Bayes 98.8 94.7 13 Nonlinear 97.8 91.3 4.5 Sum 99.0 92.3 7.6 GA / knn 99.7 94.8 6.0 EP/knn 99.5 93.2 6.2 Ionosphere 'Ih‘ain / Tune Test Features Bayes 93.0 90.1 34 Nonlinear 97.6 87.5 8.5 Sum 93.8 84.2 11.2 GA/knn 95.0 91.9 8.5 EP/knn 93.2 92.3 13.5 Pima 'Irain / Tune Test Features Bayes 76.1 64.6 8 Nonlinear 76.2 70.4 3.9 Sum 74.0 70.8 3.4 GA/knn 80.0 72.1 3.1 EP/knn 79.1 72.9 3.9 86 Table 4.5: Execution times (wall clock time) for 200 generations Of GA Optimization of the knn and nonlinear discriminant function classifiers. For each data set, the number Of features (d), the number of classes (C), the combined training and tuning set size (n), and the mean execution time (hourszminuteszseconds) over 50 runs are shown. Each run was executed on a single 250MHz UltraSPARC-II cpu Of a six-cpu Sun Ultra-Enterprise system with 768 MB Of system RAM. Runs were executed in sets Of 5 with no other user processes present on the system. Data set d C n knn nonlinear Pima 8 2 400 1:40:13 0:10:52 Hepatitis 19 2 240 1:05:48 0:24:42 Ionosphere 34 2 400 2:02:25 0:43:37 Wine 13 3 240 0:23:59 0:14:39 4.2.3 Classification of Medical Data For the thyroid and appendicitis data, the two discriminant function—based classi- fiers were trained and tested in the same manner as the EC-hybrid classifiers (Sec- tion 3.2.3). For each data set, five experiments were conducted for each classifier. The appendicitis data set was re-partitioned into disjoint training/ tuning and testing sets for each experiment. The much larger thyroid data set was pre—partitioned into training and testing sets in the UCI database (Quinlan et al., 1986). For this data, only the initial random GA population was changed for each experiment. The results Of these experiments are summarized in Table 4.6. Both the discriminant function based classifiers performed well on the hypothyroid diagnosis data. The nonlinear classifier utilized a smaller feature set than the GA / knn on this data, at a Slight cost in bootstrap test accuracy. Both discriminant function classifiers seem to exhibit overfitting of the tuning data for the appendicitis data set. 87 Table 4.6: Accuracy of various classifiers on the hypothyroid and appendicitis diag- nosis data sets. Results for the discriminant function classifiers are averaged over five GA experiments. Results for the GA / knn classifier represent the best Of five exper- iments. 'Ii'ain/ Tune refers tO the accuracy Obtained in reclassifying the GA tuning set; Test refers to bootstrap accuracy over 100 bootstrap sets. Thyroid Train / Tune Test Features GA / knn 98.5 98.4 3 Nonlinear 97.7 97.2 2.7 Sum 97.8 97.4 4.2 Appendicitis Train / Tune Test Features GA/knn 90.4 90.6 2 Nonlinear 80.4 67.0 2.6 Sum 83.0 74.2 2.2 This behavior, along with the previous results for the hepatitis and ionosphere data sets, suggests that the discriminant function classifiers may be prone to overfitting when presented with small (< 50 samples Of each class) data sets for training, tuning, and testing. 4.3 Discussion A key advantage Of the discriminant function classifiers over the nearest neighbor methods is the gain in computational efficiency Obtained by estimating the class- conditional feature value distributions based on the training data, rather than stor- ing every training sample and performing an all-pairs search for near neighbors for each test sample. While the experiments here were all executed for a fixed number Of EC generations, it would be worthwhile to run several experiments constrained 88 instead by wall-clock time. In this way, the efficiency advantage of the discrimi- nant function-based classifiers might be translated into further gains in classification accuracy relative to the near-neighbor methods. The nonlinear discriminant function classifier, in conjunction with the GA feature extraction method, seems to exhibit the best feature selection capability of all the classifiers evaluated. In several cases, however, the additional reduction in the number Of features, as compared tO the GA/knn classifier, incurred a slight cost in terms Of classification accuracy. The promise exhibited by the discriminant function-based classifiers suggests sev- eral avenues for further investigation. One possible improvement would be to include the prior probabilities for each class on the EC chromosome for Optimization. Intu- itively, this might allow the hybrid classifier more ability tO maintain more control over the balance in predictive accuracy among classes, even when there is disparity in the number Of training and tuning samples available for each class. Initial exper- imentation in this area, however, suggested that inclusion Of the prior probabilities on the chromosome can exacerbate the problem of overfitting the training and tuning data, resulting in poor performance on independent test data. Since the capabilities Of the Bayes classifier have been well explored in the lit- erature, an alternate approach would be to avoid direct modification Of the Bayes discriminant function. Instead, the independence assumption Of the naive Bayes clas- sifier might be abandoned, and the relationship among the various distributions of feature values encoded on the EC chromosome as a covariance matrix, or some sub- 89 set thereof. Essentially, this approach would allow the EC to estimate the covariance matrix for each class based on both the training and tuning data provided. In con- junction with the previously proposed technique of including the a priori probabilities for each class on the chromosome, all the parameters Of a traditional Bayes classifier might be Optimized by the EC for a particular data set and error cost function. 90 Chapter 5 Identifying the Determinants of Solvent Binding in Proteins 5.1 Introduction The most extensive application and testing Of the various EC-based feature selection and extraction methods described here has been performed in the field Of biochemistry. The specific problem area investigated was that Of evaluating and understanding protein-water interactions. Understanding the nature of protein solvation (water binding) is a significant outstanding problem in structural biochemistry, and has been the subject Of numerous studies, yielding a substantial body Of literature (Palma et al., 1993; Westhof, 1993; Ladbury, 1995; Karplus and Faerman, 1994; Levitt and Park, 1 993). Elucidation Of the nature Of protein-water interactions would be a significant Contribution to a broad range Of biochemical endeavors, including de novo drug design, database screening for drug leads, ligand docking, understanding of protein structure 91 and function, and even the grand challenge “protein folding” problem. 5.1.1 An Overview Of the Protein-Water Problem Proteins are macromolecules consisting of linear (non-branching) chains Of amino acids. The twenty common amino acids are joined covalently during protein syn- thesis tO form the protein “main chain” or “backbone”. A particular protein can be completely specified by the order Of the different amino acids that form its main chain. After synthesis, a protein will fold into a particular three-dimensional struc- ture (Figure 5.1). Since this structure is generally a compact globule, folded, soluble proteins are occasionally referred to as “globular” proteins. In the context Of a living organism, numerous water molecules will be bound, via hydrogen bonds, steric, and other interactions, to the surface Of the protein in its globular form. Figure 5.1: A particular sequence Of amino acids will consistently fold into the same three-dimensional, “globular” structure. Each of the 20 common amino acids is iden- tifiGd by a one-letter code. The linear protein chain (left) folds into its globular Structure (right) after synthesis. Water molecules, shown here as spheres, bind to the SUPface of the protein via hydrogen bonds. \ While proteins have many functions in the context Of a living organism, most 92 proteins bind to a specific molecule (or set Of molecules) and perform some action upon that molecule, such as transportation, cleavage, or catalysis of other chemical reactions. The particular molecule that a protein binds in this manner is called the ligand, and the area where the ligand binds is referred to as the protein’s active site. Like the rest Of the protein surface, the residues in the active site will typically bind water molecules. When a protein binds to a particular ligand, some of the water molecules in the active site will be displaced by the ligand, while some will remain in the area Of the interface between the protein and the ligand. Those water molecules that remain in the interface will Often form hydrogen bonds to both the protein and the ligand, thus helping the protein tO bind to the ligand by forming a water-mediated “bridge” between the two molecules (Raymer et al., 1997; Poornima and Dean, 1995c). One study of 20 protein / ligand complexes found that the average protein/ ligand interface includes 10 water molecules and 17 water-mediated bridges between the protein and ligand (A. Cayemberg & L. A. Kuhn, unpublished results). In the process Of designing a ligand to bind to a particular active site, such as a pharmaceutical drug, water molecules that are not displaced from the active site upon ligand binding must be considered, as they form an essential part Of the protein surface (Poornima and Dean, 1995a). Unfortunately, during ligand design it is generally unknown which water molecules will remain bound upon ligand binding, and which Will be displaced. If the Sites Of interfacial water binding were known, this information Collld be used to improve the chemical and shape complementarity Of the ligand in the design process. 93 The primary goal in analyzing protein-water interactions is tO isolate the determi- nants of water binding—the physical and chemical features Of a local protein environ- ment that determine whether or not the Site is conducive to water binding. In terms Of classification, it is desirable to categorize areas near a protein surface into one Of three classes: sites that are unlikely to bind water molecules, “weak” solvation sites that bind water molecules but are not likely to be conserved during ligand binding, and “strong” solvation sites that are likely to be conserved between the ligand-bound and unbound forms Of a protein (Ringe, 1995). 5.1.2 State of the Art in Predicting Protein Solvation X-ray Crystallographic Data Collection The input for most algorithms that predict protein solvation consists Of the three- dimensional structure of the protein, usually represented as the Cartesian coordi- nates Of each heavy (non-hydrogen) atom in the protein. This structure is generally Obtained via X-ray crystallography. In this process, a protein is first purified and dissolved in an aqueous solvent. From this solution, microscopic crystals (with edges on the order of tenths Of millimeters) are grown. These crystals consist Of roughly equal amounts of the protein, arranged in a lattice, and interstitial water molecules. The crystals are then suspended in small capillaries, and exposed tO a beam Of X-ray radiation. The electrons in the protein crystal diffract the X-rays, and the diffrac- tion pattern is captured in two dimensions by exposing X-ray sensitive film or an electronic detector to the diffracted X-rays. The diffraction pattern on the exposed 94 film can then be digitized and subjected to a Fourier transformation, resulting in a three-dimensional map of the electron density in the crystal. This electron density map is equivalent to a time-averaged and Space-averaged1 snapshot Of the locations Of the electrons in the protein crystal. From this information, the three-dimensional structure of the protein can be deduced. The result Of an X-ray crystallographic experiment—Often referred to as a crys- tallographic structure—contains several other pieces of information along with the coordinates of the atoms. Each atom is assigned a crystallographic temperature fac- tor, Or B-value, which provides a relative measure of the thermal mobility Of the atom. A related measurement, the occupancy Of the atom, provides an estimate Of the percentage Of time that a Site is occupied by a given atom type over the experi- ment time and throughout the various copies of the protein in the crystal. Generally, the information used for training and testing in empirical solvent-site prediction al- gorithms is calculated from the atomic coordinates, thermal mobility, and occupancy data Obtained via crystallography. Occasionally, raw electron density data is also con— sidered. From the crystallographic coordinates, other physical and chemical features Of each protein atom can be calculated. Phrther details and examples are provided in Section 5.2.3. 1The electron density map is Space-averaged in the sense that the average electron density over all the copies Of the protein in the crystal is measured. 95 Current Approaches to Solvent Site Prediction Current algorithms for predicting the locations of bound water molecules can be di- vided into two classes. Theoretical approaches, such as GRID (Wade et al., 1993), use a potential energy function to evaluate the favorability Of a probe site. For GRID, the potential energy function includes terms for Lennard-Jones interactions, electrostatic interactions, and a detailed evaluation Of potential hydrogen bonds. For all theoreti- cal approaches, the relative contribution Of the terms in the potential energy function must be determined before solvation sites can be predicted. In contrast, empirical methods determine the favorability of a solvation site by analogy with known sites. A site is evaluated and compared to a database of known solvation Sites and non-solvated sites, and predicted as being more similar to one than the other. A set Of features to Observe and compare between solvation Sites and non-solvated Sites must be selected prior to classification. Several empirical methods for prediction Of protein-bound wa- ter molecule locations have been developed. AQUARIUS2 (Pitt et al., 1993) uses a knowledge base of the distributions Of water molecules around polar atoms at the protein surface. A “likelihood” value is assigned to a putative water molecule location based on its geometric relationship to nearby polar protein atoms. If the Site lies in a region highly occupied by water molecules in the knowledge base and has significant electron density, as determined by X-ray crystallography, it receives a higher score. The highest scoring positions in a 3-dimensional matrix surrounding the protein are selected as likely water molecule locations. AUTO-SOL (Vedani and Huhta, 1991) predicts water Sites based on the directionality Of hydrogen bonds. A database Of 96 small-molecule crystal structures was analyzed to find the distribution of hydrogen- bond lengths and directions, and possible solvent sites are evaluated by AUTO-SOL according tO how well their hydrogen-bond geometry matches this database. Cur- rent methods can reproduce ~70% Of crystallographically Observed solvent molecules within 1.5 A Of the experimental locations (Vedani and Huhta, 1991). There remains, however, a tendency for many current methods to produce false positives, predicting solvation sites where none are Observed in the crystal structure. Both empirical and theoretical methods require the evaluation of a set Of fea- tures describing the environment Of the site. Commonly-used features include hydro- gen bond geometry, electrostatic interactions, local thermal mobility, protein surface Shape, solvent accessibility, and the packing geometry, atom type, and residue type Of neighboring protein atoms. Several studies have evaluated the correlation between individual features and water binding. For instance, in a study Of 56 high-resolution crystallographic structures, a strong relationship was found between local surface shape and water binding (Kuhn et al., 1992). Deep grooves in the solvent-accessible surface bound three times as many water molecules as non-groove surface areas. Re- sults obtained later also showed that deep surface grooves were the preferred sites for tightly-bound water molecules, which were unlikely to be displaced by ligand binding (Poornima and Dean, 1995b). Furthermore, conserved polar contacts between water molecules and the protein were found to be associated with solvation-Site cOnservation (Poornima and Dean, 1995c). As noted in previously, X-ray crystallographic assignment is an imperfect source Of 97 information for known water sites, resulting in both false-positive and false-negative Sites in any database based on crystallographic data. One method for identifying reliable water molecule positions is to utilize solvation sites conserved in multiple crystallographic structures. In a study Of eighteen crystallographically independent structures representing ten unique crystal forms of T4 lysozyme, it was found that ~79% Of the 20 most frequently occupied sites were conserved in all structures when allowing for potential steric interference induced by the crystal lattice (Zhang and Matthews, 1994). A recent study of conserved solvation sites in thrombin, trypsin and bovine pancreatic trypsin inhibitor (BPTI) found that 19% Of unique water sites in 10 superimposed thrombin structures were conserved in at least half of the struc- tures, while 77% of the unique sites in three superimposed trypsin structures were conserved in at least two Of the three structures (Sanschagrin and Kuhn, 1998). In perhaps the most extreme change in crystallization conditions, the structure of subtil- isin Carlsberg was solved in an organic solvent, anhydrous acetonitrile, and compared with the structure solved in an aqueous environment (Fitzpatrick et al., 1993). Of the 119 bound water sites Observed in the aqueous structure, 99 were conserved in the acetonitrile structure. Together, these studies show that one-half or more of bound water sites are typically conserved in independent structures Of a protein solved under diverse conditions. A Slightly different approach was taken by Gschwend and Karplus (1991) in their study Of features correlating with increasing electron density Of water sites in glu- tathione reductase. The water molecules in a 1.54 A resolution structure were ranked 98 in order Of electron density in the final 2F0 — FC map. Eight parameters were mea- sured for each water molecule, weighted, and correlated with the electron density rankings; they included the energy from XPLOR refinement, solvent accessibility, strong hydrogen bonds tO protein atoms and to other water molecules (donor tO ac- ceptor bond length between 2.5 and 3.2 A), weak hydrogen bonds (bond length less than 2.5 A or from 3.2 to 3.5 A) to protein atoms and water molecules, the aver- age temperature factor Of hydrogen-bonding partners Of the water molecule, and the average occupancies Of hydrogen-bonding partners. The parameters that correlated best with increasing electron density included decreasing average temperature factor of hydrogen-bonding partners Of the water molecule, decreasing solvent accessibility Of the water, and increasing the number Of strong hydrogen bonds to protein atoms. The linear correlation coefficient between this set Of features and the density ranking was 0.8. An experimental approach was taken by Ringe (1995) to investigate the char- acteristics Of ligand—binding sites on the protein surface. Crystals Of elastase were grown in aqueous mother liquor, then transferred to an organic solvent. A shell Of bound water molecules remained, even after the organic solvent was exchanged several times, though some organic solvent molecules also bound. The features associated with binding sites for the polar organic solvent molecules included the local surface Shape, exposure Of hydrophobic groups, and the presence of water molecules bound to polar groups near the Site. The approach presented here is a hybrid between prediction Of bound water 10- 99 cation and characterization Of features favorable for water binding. Three classes of protein sites were studied: non-solvated Sites, Sites that were solvated in a sin- gle crystallographic structure of a protein, and solvation Sites that were conserved in two crystallographically independent structures. The hybrid genetic algorithm/k- nearest-neighbors (GAknn) classifier was provided with a set Of features representing each site, and trained tO distinguish between the classes Of sites. Motivation and Experimental Design Unfortunately, one Of the steps in the crystallographic determination Of protein struc- tures, the assignment Of water molecule locations based on the electron density data from an X-ray crystallographic experiment, is an imperfect process. The resulting 3-dimensional structure may include false sites via noise in the electron density, and omit many Sites due to limited resolution. Additionally, the structure Of the crystal lattice may produce favorable environments for bound water molecules where none were present in the soluble form Of the protein. By applying the feature selection and extraction methods described here to protein crystallography data sets, we hOpe to improve water assignment by identifying the physical and chemical features as- sociated with water-binding sites, and by producing a classifier that is capable of distinguishing solvation sites from non-solvated Sites near the protein surface. In order to train such a classifier, however, the current, imperfect information about protein-water binding available from X-ray crystallographic structures must be used. To mitigate the effect Of noisy water coordinate data, we can superimpose mul- 100 tiple crystallographic structures Of the same protein and examine the water-binding Sites that are conserved between them. In this way we can identify the physical and chemical features associated with the most reliable water binding Sites available— those that are conserved among multiple, independently-solved structures Of the same protein. Finally, by using this information during the assignment Of water molecules in future structures, we can improve the reliability of crystallographic water assign- ment by considering not only the electron density information Obtained from the X-ray experiment, but also the physical and chemical features Of the local protein environment in determining whether or not to assign a water molecule to a particular location near the protein surface. While the electron density is the experimental Ob- servable and the most important criterion, evaluating the likelihood Of a given protein environment for water binding gives us insight into the protein-water recognition prO- ceSS and also can help identify missing or extraneous water Sites in crystallographic structures. In pursuit Of this goal of understanding protein-water interactions more com- pletely, the classification and feature selection and extraction methods described in earlier chapters were applied to two distinct classification problems: Prediction of salvation sites: Given a location near the surface Of a protein, de- termine whether the site is likely to be a water-binding Site or a non-solvated site. Prediction of solvent-site conservation: Given a known solvent site from a non- ligand—bound protein, determine whether the water molecule will be conserved 101 (remain bound) in the protein-ligand complex. For each Of these problems, two subproblems related tO understanding protein- water binding were investigated. The first was to develop and train a classifier capable Of distinguishing between the two classes Of Sites: solvated vs. non-solvated, and conserved vs. non-conserved, respectively. The second Objective was to utilize the feature selection and data mining capabilities Of the EC-hybrid classifiers to identify the features Of protein bound water molecules that are associated with solvent binding, and with conservation Of solvent Sites between structures. 5.2 Methods 5.2.1 Compilation of 3 Protein Solvation Database The task Of understanding protein-water interactions was undertaken in two stages. The Objective of the first stage was tO design a hybrid EC / knn classifier capable of distinguishing water molecules from non-solvated sites on the surface Of a protein, and to use the feature selection capabilities Of this classifier to identify the physi- cal and chemical features associated with solvation sites. The second goal was to further classify the water sites into those likely to be conserved between a ligand- bound and an unbound structure Of a protein, and those likely to be non-conserved (displaced). Again, once this classification was made, the data mining and feature selection capabilities Of the classifier were utilized to identify the determinants Of solvent conservation between ligand-bound and unbound structures. 102 Prerequisite to any Of these experiments it was necessary to establish a database Of conserved and non-conserved water molecules as well as non-solvated sites near protein surfaces to serve as training and tuning data for the GA/knn hybrid classi- fier. TO this end, 60 protein structures were selected in pairs from the Brookhaven Protein Databank (Abola et al., 1987; Bernstein et al., 1977). Each pair consisted of the crystallographic structure of a protein bound tO a ligand and the correspond- ing structure Of the unbound protein. Protein pairs were selected based on several criteria. First, all proteins included in the database were non-homologous2 (Hobohm et al., 1992). In addition, high resolution (5 2.0 A) structures were preferred, and structures exhibiting Significant conformational changes upon ligand binding were excluded, since the concept Of conserved water is no longer well-defined in this case. Pairs Of structures with a root-mean-squared positional deviation (RMSD) Of S 2.0 A upon superposition Of main-chain atoms were also preferred. The 30 pairs Of proteins selected for the database are detailed in Table 5.1. 2Homology, in this case, refers to proteins composed of similar sequences of amino acids. Inclusion Of homologous proteins in the database can introduce redundant information, which can lead to unnecessary bias in the training and tuning data. No two proteins included in the database described here had > 25% amino acid sequence identity when the linear sequence Of amino acids comprising each protein were aligned. 103 Table 5.1: 30 protein pairs included in the database. PDB Code1 Protein / Ligand Complex ReS(A)2 Waters RMSD3 lahc/lahb lapm/latp 1bia/1bib lbsa/ 1brn 1ca2/1bcd 1cgf/1hfc lcgt/lcgu 1chp/ lxtc 1dr2/1dr3 1gta/1gtb 1hel/1mlc IHb/IHC 1nsb/1nsc 1poa/1pob lsyc/lsyd 1thm/2tec a-momorcharin / formycin 5’-monOphosphate CAMP-dependent protein kinase / MnATP bira bifunctional protein / biotinylated lysine barnase / D(CGAC) carbonic anhydrase II / trifluoromethane sulphonamide fibroblast collagenase / HAP" cyclodextrin glycosyltransferase / glucose cholera toxin 5 pentamer / polypeptide chain dihydrofolate reductase / biOpterin glutathione S-transferase / praziquantel hen egg-white lysozyme / monoclonal antibody Fab D441 adipocyte lipid-binding protein / hexadecanesulfonic acid neuraminidase / N-acetyl neuraminic acid phospholipase A2 / transition-state analogue staphylococcal nuclease / 2’-deoxy-3’-5’- diphosphothymidine thermitase / eglin-C 104 :a0/252 :a0/252 :a3/2rs 2.0/1.76 110/15) 2.1/1.56 :20/215 :20/221 :a3/2:3 124/215 1.7/21. 1.7/113 :22/137 1.5/21) 1.8/137 137/198 163/80 207/103 43/19 258/229 167/215 181/88 588/478 248/138 73/110 118/84 185/210 89/68 446/506 151/242 69/83 193/224 0.23 0.35 0.48 0.55 0.20 0.39 0.34 0.93 0.12 0.20 0.49 0.32 0.12 0.72 0.41 0.24 Table 5.1 (cont.) PDB Code1 Protein / Ligand Res(A)2 Waters RMSD3 Complex ludg/ludh uracil-DNA glycosylase / 1.75/ 1.75 121/94 0.19 uracil 2act/1aec actinidin / E645 1.7/ 1.86 272/268 0.11 2apr/3apr acid proteinase / reduced 1.8/1.8 373/344 0.13 peptide inhibitor 2cla/3cla chloramphenicol 2.35 / 1.75 104/ 204 0.41 acetyltransferase / chloramphenicol 2ctv/5cna concanavalin A / a-methyl- 1.95/2.0 146/533 0.42 D-mannopyranoside 2sga/5sga proteinase A / tetrapeptide 1.5/ 1.8 220/185 0.08 Ace-PrO-Ala-Pro-Tyr 2wrp/1tro Trp repressor / synthetic 1.65/ 1.9 170/572 2.18 Operator 3cox/1coy cholesterol oxidase / 3-6- 1.8/ 1.8 453/424 0.24 hydroxy-5-androsten-17-One 3dni/2dnj deoxyribonuclease I / DNA 2.0/2.0 375/252 0.37 3enl/5enl enolase / 2.25/22 353/355 0.21 ' 2-phosphO-D-glyceric acid 3grs/1gra glutathione reductase / 1.54/2.0 523/530 0.12 glutathione disulfide 3tln/3tmn thermolysin / Val-Trp 1.6/1.7 173/173 0.10 5cpa/6cpa carboxypeptidase A / 1.54/2.0 315/148 0.36 phosphonate See note 6 RTEM-l fi-lactamase / 1.7/1.7 182/189 0.22 penicillin G 1 Ligand-free/ligand-bound 2Resolution Of the crystallographic structures in Angstroms. 3Main-chain root-mean-squared positional deviation from superposition Of the ligand-bound and free structures. 4HAP is (N-(2-hydroxamatemethylene-4-methyl-pentoyl)phenylalanyl) methyl amine. 5E64 is [N-(l-3-trans-carboxyoxirane—2-carbonyl)-l-leucyl]-amido(4-guanido)butane. 6Provided by Drs. Natalie Strynadka and Michael James, University of Alberta, Edmonton. The water molecules from the unbound structures were used to construct the training and tuning database. The ligand-bound structures were used only to label 105 the water molecules from the unbound structures as conserved or displaced. TO ob- tain this labeling, the main-chain atoms Of the ligand-bound and unbound structures were superimposed using the rigid-body superimposition algorithm Of the InsightII molecular graphics software (Molecular Simulations, Inc., San Diego, CA). Each water molecule in the unbound structure was then compared with the superimposed ligand- bound structure—if a corresponding water molecule was found in the ligand-bound structure within 1.2 A (Zhang and Matthews, 1994) Of the water from the unbound structure, the water was labeled as conserved. Water molecules from the unbound structure with no corresponding water molecule in the superimposed ligand-bound structure were labeled as non-conserved. Only first-shell water molecules from the unbound structures were included in the database, where the first shell is defined as the set Of water molecules within hydrogen bonding distance (3.6 A) Of a protein surface atom. 5.2.2 Generation of Non-Solvated Probe Sites For distinguishing solvation sites from non-solvated Sites near the protein surface, it was necessary tO generate a collection of non-solvated probe sites near the surface Of each Of the 30 non-ligand—bound proteins selected for the database. To generate these probe sites, the solvent accessible molecular surface for each Of the non-ligand-bound structures in Table 5.1 was generated using the Molecular Surface (MS) software from Connolly (1983). A probe radius Of 1.2 A was used, with a surface density Of 1 dot / A2 and unit normals generated at each surface dot. For each protein, the minimum 106 distance Of any crystallographically Observed water molecule from the protein surface, dmin, and the maximum distance, am, were determined. For each surface point, a probe-site was generated and placed at a distance along the surface normal, selected at random over the range of distances from dmin tO dmu. Any non-sites overlapping crystallographic water sites or other non-sites (with positions within 1.2 A) were removed. Finally, the same number of non-sites were selected for each protein as there were crystallographic water sites. This selection was done so that the distribution Of distances of non-Sites from the protein surface matched the distribution Of distances for Observed water molecules for that protein. 5.2.3 Feature Identification and Measurement Once a database Of non-solvated sites, non-conserved water-binding sites, and con- served solvation Sites had been established, the next step was to identify and measure a collection Of features to represent each water binding site or non-solvated site for purposes Of classification. For each site, six physical and chemical features Of the local protein environment were computed from the crystallographic coordinates. For conserved and non-conserved solvation Sites, two additional features describing the thermal mobility Of the water molecule in the crystal structure were included. The eight features included in the waters database are included in Table 5.2. 107 Table 5.2: The physical and chemical features used to represent protein-bound water molecules and protein surface sites. BVAL and MOB were only measured for con- served and non-conserved molecules, as the concept Of thermal mobility is not defined for non-solvated probe sites. Tag Feature Description ADN Atomic density The number of protein atom neighbors within 3.6 A of the water molecule. This feature correlates with the local protein topography. Water molecules bound in deep grooves will have high ADN val- ues, while those bound to protrusions will have low ADN values (Kuhn et al., 1992). AHP Atomic hydrophilicity The hydrophilicity of the neighborhood Of the water molecule is based on the fre- quency of hydration for each atom type in 56 high-resolution protein structures (Kuhn et al., 1995). Each water molecule is assigned an AHP value equal to the sum Of the atomic hydrophilicity values of all atom neighbors within 3.6 A of the water molecule. BVAL B-value The crystallographic temperature factor from PDB file Of the crystal structure. .This feature measures the thermal mo- bility Of the water molecule. HBDP Hydrogen bonds to The number Of hydrogen bonds between protein the water molecule and neighboring pro- tein atoms. Each donor or acceptor atom within 3.5 A is considered a potential hy- drogen bond. HBDW Hydrogen bonds to The number of hydrogen bonds between water the water molecule and other water molecules in the ligand-free protein struc- ture, based on S 3.5 A distance between oxygen atoms in the two water molecules. 108 Table 5.2 (cont.) Tag Feature Description MOB Mobility ABVAL Average B-value of protein atom neighbors NBVAL Net B-value of protein atom neighbors A normalized measure Of thermal mobil- ity, defined in terms Of the B-value and occupancy, two measures Of the thermal mobility Of the water molecule provided in the crystallogra hic structure of the protein. MOB = Bw/B) / (Dec/OE). Where B and B are the B~value Of the water molecule and the average Evalue Of all the water molecules in the pro- tein respectively. Similarly, Dec and m are the occupancy Of the water molecule (from the PDB file) and the average oc- cupancy of all water molecules in the pro- tein structure (Craig et al., 1998). The average (mean) temperature factor Of all protein atoms within 3.6 A of the water molecule. The sum Of the B-values of all protein atoms within 3.6 A Of the water molecule. The eight features described in Table 5.2 comprise most of the factors currently believed tO be related to protein/water binding, as discussed in Section 5.1.2. These features do not, unfortunately, establish a well-separated or easily classified data space. Figure 5.2 Shows 800 waters selected randomly from the conserved and non- conserved protein-bound water molecule data set, plotted according tO their first two principal components. This plot demonstrates the high degree Of overlap between the two classes. This overlap is further evidenced by the poor classification results Obtained by linear discriminant analysis (Fisher, 1936), which Obtains a classification accuracy Of only ~50% (equivalent to random prediction) in distinguishing between conserved and non-conserved solvation Sites based upon this data set. 109 -10 Second -20 First Figure 5.2: Conserved and non-conserved water molecules plotted according to the first two principal components of the eight-feature water binding data set. Eight hundred randomly-selected water molecules from the data set are plotted. Non- conserved water molecules are plotted as 1’s, conserved water molecules are plotted as 2’s. The first principal component is composed primarily Of the features HBDP, ADN, and AHP (with coefficients 0.63, 0.52, and 0.38, respectively). The second principal component is composed primarily Of the three features based on temperature factor: ABVAL, BVAL, and NBVAL. (The respective coefficients are —0.54, —0.53, and —0.44.) There is no clear separation between the two classes in this feature space. 110 5.2.4 Distinguishing Solvation Sites from Non-Solvated Sites Near the Protein Surface The training and tuning data for distinguishing solvation sites from non-solvated regions Of the protein surface consisted Of all the first-shell water molecules from the non-ligand-bound structures in Table 5.1, a total Of 5542 water molecules, and an equal number Of non-solvated probe sites, for a total Of 11,084 samples. The non-solvated sites were generated from the same 30 non-ligand-bound structures, as described in Section 5.2.2. The six features from Table 5.2 not related to thermal mobility were used tO represent each sample, and the patterns were labeled “class 1” for non-solvated probe sites and “class 2” for solvation sites. For each GA / knn experiment, 1000 patterns Of each class were selected at random from the database to serve as the training set. Likewise, the tuning data consisted Of 1000 patterns Of each class. After the GA / knn experiment, bootstrap testing was performed as described in Section 3.1.6 to test the predictive accuracy of the best weight set and k value found by the GA. The bootstrap testing pool consisted Of the 3542 samples from each class not included in the training or tuning data. Individual test sets Of size 500 were drawn from this pool during the bootstrap testing. The GA run parameters and the cost function coefficients for these experiments were identical to those detailed in Tables 3.1 and 3.2. 111 5.2.5 Distinguishing Conserved Salvation Sites from Sites not Conserved Between Ligand-Bound and Unbound Protein Structures The training and tuning data for classifying conserved solvation Sites between ligand- bound and unbound protein structures consisted of the 5324 first-shell water molecules from the 30 non-ligand-bound structures in Table 5.1. Of these water molecules, the 2137 non-conserved water molecules were labeled “class 1”, and the 3405 conserved water molecules were labeled “class 2”. All eight features described in Table 5.2 were used tO represent each sample. Training and testing data were compiled in a similar manner to the previous set Of experiments, but due tO the smaller size of this data set, the training and testing data consisted Of 400 patterns Of each class. The remaining 2605 conserved and 1337 non-conserved water molecules composed the pool of samples for bootstrap testing. Again, the bootstrap test set size was 500 samples. As with the previous set Of experiments, the GA run parameters and the cost function coefficients were set as described in Tables 3.1 and 3.2. 112 5.3 Results 5.3.1 Differentiation Of Solvation Sites from Non-Solvated Sites The protein solvation site and non-solvated site experiments were conducted with sev- eral Objectives. In addition tO Optimizing classification accuracy, the feature weights produced by the GA were analyzed to determine if a subset Of the available features was consistently selected for inclusion in the highest performance weight sets. Fifteen GA/knn experiments were conducted with differing initial random populations. 100 bootstrap test experiments were then conducted on the best 6 resulting weight sets. Descriptive statistics were calculated for each set Of bootstrap tests, and the results are summarized in Table 5.3, sorted according to mean bootstrap accuracy. In spite Of randomized initial conditions for each GA experiment, the top perform- ing runs for distinguishing protein solvation sites from non-sites (refer to Table 5.3) exhibited notable consistency in feature subset selection across runs. Although we cannot assume that this consistency is indicative Of finding a globally optimal set Of feature weights for classifying this data, we can conclude that the features that are consistently included in the optimized weight sets are a sufficient set to predict, with ~68% accuracy, water binding Sites at the protein surface. In this respect, then, these features can be concluded to be associated with water binding. For distinguishing Observed solvation sites from pseudo—water sites, the number Of hydrogen bonds tO protein atoms (HBDP) was the most common highly-weighted 113 Table 5.3: Bootstrap test results, k-values, and weight sets from the top six GA experiments for distinguishing crystallographically-Observed solvation sites from non- sites. Bootstrap testing accuracy is given as a percentage for observed solvation sites (Site), non-solvated sites (N011), and both classes together (Tot). Balance (Bal) is the average difference between solvated-site and non-solvated-site prediction accuracy (%) over all bootstrap runs. Features with no weights were masked by the GA; i.e. their feature weights were zero, and they did not participate in classification. The feature weights for all Six features are normalized to sum to 1.00 for each experiment. Bootstrap % Feature Weights Site Non Tot Bal K ADN AHP HBDP HBDW ABVAL NBVAL 68.60 67.75 68.18 3.50 23 0.284 0.524 0.193 66.87 68.74 67.81 3.50 75 0.545 0.235 0.219 65.93 65.93 65.93 3.20 53 0.530 0.197 0.274 65.21 70.44 67.83 5.49 45 0.194 0.721 0.086 ' 66.19 69.40 67.79 4.39 35 0.332 0.643 0.024 67.32 67.74 67.53 3.52 61 0.651 0.289 0.060 feature, followed closely by atomic density (ADN). Some measure of the temperature factor Of neighboring protein atoms was also included in each Of the top weight sets. For half Of the runs the average temperature factor (ABVAL) was included, while the sum Of the temperature factors (NBVAL) was included in the other half. The local atomic hydrophilicity (AHP) and the number Of hydrogen bonds to other water molecules (HBDW) were not included in any of the highest-performance weight sets. The average weights for each feature across the top fifteen weight sets are shown in Figure 5.3. These results correspond well with several other studies of conserved solvation sites in specific protein structures. In a study Of conserved solvation sites in throm- bin, trypsin, and bovine pancreatic trypsin inhibitor (BPTI), various features of the local protein environment were correlated with the degree of conservation Of solva- 114 0.5 0.45 «» 0.4 «- § 0.35 .. 0.3 .. 025 ~- 02 «- o.15 «~ I 0.. ._ 0.05 T HBDP ADN NBVAL ABVAL AHP HBDW Festue Figure 5.3: Average normalized weight of each feature in the top fifteen weight sets for distinguishing crystallographically- Observed solvation sites from non-solvation sites. tion sites across multiple superimposed structures for each protein (Sanschagrin and Kuhn, 1998). The three features which demonstrated the strongest correlation with water site conservation were hydrogen bonds to neighboring protein atoms, atomic hydrophilicity, and atomic density. Two Of these features, hydrogen bonds to protein atoms and atomic density, were highly weighted by the GA in all of the weight sets reported in Table 5.3. A possible explanation for the GA’s failure to include atomic hydrophilicity in any of these weight sets is the strong correlation between atomic density and atomic hydrophilicity. Since both Of these features are strongly depen- dent On the number Of protein atoms neighboring a given water molecule, the two values are correlated (r = 0.642). Even more closely related to features found by the GA are those found to be related to water electron density rankings by Gschwend and Karplus (1991)—average tem- perature factor Of hydrogen bonding partners, solvent accessibility, and the number 115 Of strong hydrogen bonds to protein atoms. The hydrogen bonding and tempera- ture factor features were selected by both methods, while the atomic density feature used by the GA is closely related to the solvent accessibility term used in Gschwend and Karplus’s density correlations. Water molecules bound in deep grooves will have low solvent accessibility and higher atomic density values, while the reverse holds for waters bound to protein surface protrusions. Studies conducted by Kuhn et al. (1992) and Poornima and Dean (1995b) both demonstrated a relationship between water binding and local surface shape. The GA results identify the same relationship, as atomic density is included in all Of the top performing weight sets; high density Of protein atoms surrounding a water Site correlates with groove-like topography. 5.3.2 Distinguishing Between Conserved and Non- Conserved Solvation Sites The problem Of identifying solvation Sites likely to be conserved between ligand-bound and unbound structures Of a protein was one Of the first applications of the GA / knn hybrid classifier. As such, significantly more experiments have been conducted for this data set than for any Of the other data sets presented here. The top 21 weight sets from all such experiments were selected for bootstrap testing. Again, 100 bootstrap tests were conducted for each weight set. Accuracy and balance results for each weight set and k value tested are summarized in Table 5.4. 116 Table 5.4: Bootstrap results of the best 21 weight sets for identifying conserved solvation Sites. Mean bootstrap testing accuracy is given as a percentage (Acc). Mean balance (Bal) is the average difference between conserved and non-conserved prediction accuracy (%) over all bootstrap runs. Features with no weights were masked by the GA; i.e. their feature weights were zero, and they did not participate in classification. The feature weights for all eight features are normalized to sum to 1.00 for each experiment. Feature Weights Acc Bal K ADN AHP BVAL HBDP HBDW MOB ABVAL NBVAL 64.20 3.88 65 0.413 0.135 0.137 0.315 63.62 3.80 29 0.667 0.333 63.16 5.35 25 0.463 0.323 0.214 63.11 4.15. 37 0.891 0.109 63.00 3.52 77 0.308 0.163 0.225 0.304 62.87 5.36 17 0.841 0.159 62.47 7.74 97 0.459 0.291 0.250 62.36 3.27 27 0.371 0.629 62.16 3.79 23 0.372 0.240 0.203 0.184 62.15 3.50 7 0.571 0.156 0.273 62.02 4.26 87 0.118 0.558 0.323 61.72 4.14 17 0.252 0.352 0.397 61.70 4.20 67 0.421 0.579 61.58 3.58 13 0.018 0.388 0.441 0.153 61.46 3.84 63 0.227 0.773 61.42 3.18 19 0.051 0.417 0.058 0.474 61.36 3.39 15 0.392 0.293 0.207 0.108 61.16 7.14 19 1.000 60.62 4.58 57 0.881 0.119 60.30 2.78 75 0.317 0.683 60.25 3.40 69 0.336 0.664 117 In applying the feature selection and extraction capabilities Of the EC-hybrid clas- sifiers for data mining and analysis, consistency in feature weighting across multiple runs is a desirable result. While it is never possible tO guarantee that the EC has found an Optimal or near-Optimal set of feature weights, features that are consistently included in weight sets that provide high classification accuracy are likely to be as- sociated with the natural classes in the data. For the water conservation data, the two features related tO thermal mobility—B-value (BVAL) and mobility (MOB)—are included in all Of the top 6 performing weight sets. Features such as atomic density (ADN), number Of hydrogen bonds to other water molecules (HBDW) and average B-value of neighboring protein atoms (ABVAL), which are included in very few of the top 21 weight sets, are likely to be omitted either because they are not as closely associated with water binding, or because they are closely related to another included feature, and thus providing redundant information. Because the EC cost function pe- nalizes for each feature included in the classification, the EC search is directed towards a minimal feature set that provides good classification accuracy. It is interesting to note that atomic hydrophilicity (AHP) and mobility are nearly mutually exclusive, suggesting that some information needed for classification may be provided by both of these features. The linear correlation coefficient for AHP and MOB is r = —0.374. It is not surprising that the features that indicate conserved solvation across multi- ple structures differ somewhat from those that identify likely solvation sites in a single structure. Since both conserved and non-conserved water molecules are present in at least one crystallographic structure, they might be expected to have similar values for 118 the features associated with bound water molecules in a single crystallographic struc- ture. Although the task appears tO be more difficult than identifying water-binding sites, the GA/knn classifier Obtained a set of features which further distinguishes those sites that are likely tO be conserved across multiple structures. The features thus identified include B-value (BVAL), which appeared in 19 of the top 21 weight sets, and was the highest-weighted feature in 15 of these, and mobility (MOB), ap- pearing in 11 weight sets, always as the first or second most strongly-weighted feature. Also selected in 9 weight sets, including the set which Obtained the best overall ac- curacy in bootstrap testing, was the number of hydrogen bonds to protein atoms. The best performing weight set included these three features as well as the number Of hydrogen bonds to other water molecules. The best bootstrap accuracy Obtained by the GA/knn was 64.2%, with a mean bootstrap balance (difference in predictive accuracy between conserved and non- conserved solvation sites) Of 3.88%. This accuracy is a notable improvement over the near random classification Obtained by linear discriminant analysis. Further im- provement in accuracy can be Obtained by using the k-nearest-neighbors vote tally as a measure Of classification confidence. Intuitively, one would expect that a site for which the knn vote was definitive (say, 38 votes for conserved and 1 vote for non-conserved) was more likely to be correctly classified than one for which the vote was more evenly split (for example, 20 votes for conserved and 19 for non-conserved), and this intuition was supported by knn voting results. A weighted knn classifier was used with the top weight set from Table 5.4 tO classify as conserved or non-conserved 119 all water-binding Sites from the data set that were not selected as training or tun- ing sites for this GA run. A total Of 3942 solvation Sites were tested. Figure 5.4 shows the relationship between number Of votes in the majority category, m, and the predictive accuracy for all test sites where the number Of votes cast in the majority was 2 m. The correlation coefiicient between m and cumulative predictive accuracy was r = 0.905. As demonstrated by the figure, the knn classifier exhibits a strong tendency toward more accurate classification of Sites for which the voting consensus is more definitive. Greater predictive accuracy can thus be Obtained by allowing “don’t know” classifications when the number of majority votes does not exceed a specified cutoff value, and the prediction is thus less likely to be correct. Also shown in the figure is the number Of test sites for which the number of majority votes meets or exceeds each value Of m. For example, by classifying only sites where 43 or more votes were cast for the same class we can classify 1344 Of the 3942 test sites with a predictive accuracy of 70.46%. While some Sites are eliminated from classification, those that remain are classified with greater accuracy. 5.4 Discussion In distinguishing solvation Sites from non-solvated Sites near the protein surface, the classification accuracy Of the GA / knn hybrid classifier is comparable to other contem- porary methods for solvent-site prediction, but the GA / knn does not have a tendency to overpredict solvation—a feature common to other techniques. For the problem Of distinguishing conserved from non-conserved water-binding Sites, the accuracy Of the 120 “000 - ”0% =3Numberotwaters —Aecuracy 3000 I ll 1- m £2500 l I U -~85% '6 2000 « ' -» 80% 8 r E g 1500 .. l ~» 75% 7,. 1000 ,l‘ 4 70% —‘/' 500 Lf-‘l-pu"‘ A “UH" 65% 0 T . . : : :::":n:n:“:-: : 60% Figure 5.4: Relationship between majority votes and accuracy. For each possible number Of majority votes, m, (x-axis) the solid line shows the cumulative predictive accuracy for all test sites for which m or more votes were cast in the majority. This accuracy value corresponds to the scale on the right y-axis. The outlined rectangles Show the actual number of sites for which m or more votes were cast in the majority, corresponding to the scale on the left y-axis. Since the k-value for the weight set tested was 65, the number Of majority votes ranges from 33 (the smallest possible majority) tO 65. 121 GA/knn surpassed that Of other classical classification techniques tested. In addi- tion, the consistency in feature weighting exhibited for both problems provides some insight into the determinants Of water binding and water conservation in crystallo- graphic protein structures. In the case Of predicting solvation Sites, the determinants identified by the GA / knn experiments correspond well with other computational and experimental results. Taken together, the results Of both sets Of experiments paint a general picture of the physical and chemical features related to water binding and conservation. From these results, we can envision a continuum Of water binding favorability based on var- ious features. Local atomic hydrOphilicity, atomic density, and the hydrogen bonding potential Of local protein atoms are good indicators Of likely solvation sites. Of those Sites that do bind water molecules, a low crystallographic temperature factor, high oc- cupancy, and a large number Of hydrogen bonds to neighboring protein atoms indicate that the water molecule is likely to be conserved between independently determined structures. 122 BIBLIOGRAPHY Bibliography Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F., and Weng, J. (1987). Protein Data Bank, pages 107—132. Data Commission Of the International Union Of Crystallography, Bonn / Cambridge/ Chester. Aeberhard, S., Coomans, D., and de Vel, O. (1992a). The classification performance Of RDA. Technical Report 92-01, Dept. Of Computer Science and Dept. Of Mathe- matics and Statistics, James Cook University Of North Queensland. Aeberhard, S., Coomans, D., and de Vel, O. (1992b). Comparison of classifiers in high dimensional settings. Technical Report 92-02, Dept. Of Computer Science and Dept. Of Mathematics and Statistics, James Cook University Of North Queensland. Angeline, P. J. (1995). Adaptive and self-adaptive evolutionary computations. In Palaniswami, M., Attikiouzel, Y., Marks, R., Fogel, D., and Fukuda, T., editors, Computational Intelligence: A Dynamic Systems Perspective, pages 152—163. IEEE Press, Piscataway, NJ. Béick, T. and Schwefel, H.-P. (1993). An overview Of evolutionary algorithms for parameter Optimization. Evolutionary Computation, 1:1—23. Biick, T. and Schwefel, H.-P. (1996). Evolutionary computation: An overview. In Proceedings of the Third IEEE Conference on Evolutionary Computation, pages 20-29, Piscataway NJ. IEEE Press. Bayes, T. (1763). An essay towards solving a problem in the doctrine Of chances. .Phil. Trans. Roy. Soc., 53. Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, Jr., E. F., Brice, M. D., Rodgers, J. R., Kennard, 0., Shimanouchi, T., and Tasumi, M. (1977). The Protein Data Bank: A computer-based archival file for macromolecular structures. J. Mol. Biol, 112:535-542. Blake, C. L. and Merz, C. J. (1998). UCI repository Of machine learning databases. University Of California, Irvine, Dept. Of Information and Computer Sciences. http: //www.ics.uci.edu/~mlearn/MLRepository.html. Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classification and Regression Trees. Wadsworth, Pacific Grove. 123 Bremermann, H., Rogson, M., and Salaff, S. (1966). Global properties of evolution processes. In Pattee, H., Edlsack, E., Fein, L., and Callahan, A., editors, Natural Automata and Useful Simulations, pages 3—41. Spartan Books, Washington, DC. Cestnik, G., Konenenko, I., and Bratko, I. (1987). Assistant-86: A knowledge- elicitation tOOl for sophisticated users. In Bratko, I. and Lavrac, N., editors, Progress in Machine Learning, pages 31—45. Sigma Press. Connolly, M. L. (1983). Solvent—accessible surfaces of proteins and nucleic acids. Science, 221:709—713. Cover, T. M. and Campenhout, J. M. V. (1977). On the possible orderings in the measurement selection problem. IEEE Transactions on Systems, Man, and Cyber- netics, 7:657—661. Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, IT—13:21—27. Craig, L., Sanschagrin, P. C., Rozek, A., Lackie, S., Kuhn, L. A., and Scott, J. K. (1998). The role Of structure in antibody cross-reactivity between peptides and folded proteins. J. Mol. Biol, 281(1):183—201. Crosby, J. (1967). Computers in the study of evolution. Sci. Prog. 0:1:f., 55:279—292. De Jong, K. A. (1992). Genetic algorithms are NOT function Optimizers. In Proceed- ings of the Second Workshop on Foundations of Genetic Algorithms, pages 5—18, Vail, Colorado. Diaconis, P. and Efron, B. (1983). Computer-intensive methods in statistics. Scientific American, 248. Domingos, P. and Pazzani, M. (1996). Beyond independence: Conditions for the Optimality Of the simple bayesian classifier. In Saitta, L., editor, Proceedings of the Thirteenth International Conference on Machine Learning, pages 105—112, San Francisco, CA. Morgan Kaufmann. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley & Sons. Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist, 7:1—26. Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. In CBMS-NSF Regional Conf. Series in Applied Mathematics, no. 38. SIAM. 91. Fisher, R. A. (1936). The use Of multiple measurements in taxonomic problems. Ann. Eugenics, 7:179-188. 124 Fitzpatrick, P. A., Steinmetz, A. C. U., Ringe, D., and Klibanov, A. M. (1993). Enzyme crystal structure in a neat organic solvent. Proc. Natl. Acad. Sci. USA, 90:8653-8657. Fogel, D., editor (1998). Evolutionary Computation The Fossil Record. IEEE Press, New York, NY. Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artificial Intelligence through Simulated Evolution. John Wiley, NY. Fraser, A. (1957). Simulation Of genetic systems by automatic digital computers. I. Introduction. Austrailian J. Biological Sciences, 10:484—491. Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for com- puting k-nearest neighbors. IEEE Transactions on Computers, pages 750—753. Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Transactions on Information Theory, IT-182431—433. Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley, San Mateo, CA. Goodman, E. D. (1996). An introduction to GALOPPS, the genetic algorithm Opti- mized for portability and parallelism. Technical Report 96-07—01, Michigan State University. Gschwend, D. A. and Karplus, P. A. (1991). Bound water visibility in protein struc- ture determination by x-ray crystallography. Unpublished senior thesis. Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14:515—516. Hobohm, U., Scharf, M., Schneider, R., and Sander, C. (1992). Selection Of represen- tative protein data sets. Protein Sci, 1:409—417. Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press, Ann Arbor, MI. Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). Selecting fuzzy if- then rules for classification problems using genetic algorithms. IEEE Transactions on Fuzzy Systems, 3(3):260—270. Jain, A. K. and Chandrasekaran, B. (1982). Dimensionality and sample size consid- erations in pattern recognition in practice. In Krishnaiah, P. R. and Kanal, L. N., editors, Handbook of Statistics, volume 2, pages 835—855. North-Holland. Jain, A. K., DubeS, R. C., and Chen, C. C. (1987). Bootstrap techniques for er- ror estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(5):628—633. 125 Jain, A. K. and Zongker, D. (1997). Feature selection: Evaluation, application, and small sample performance. IEEE ’Ii‘ansactions on Pattern Analysis and Machine Intelligence, 19(2):153—158. Jaynes, E. T. (1968). Prior probabilities. In IEEE Transactions on Systems Science and Cybernetics, volume SSC-4, pages 227—241. Karplus, P. A. and Faerman, C. H. (1994). Ordered water in macromolecular struc- ture. Curr. Opin. Struct. Biol, 4:770—776. Kelly, J. D. and Davis, L. (1991). Hybridizing the genetic algorithm and the k near- est neighbors classification algorithm. In Proceedings of the Fourth International Conference on Genetic Algorithms and their Applications, pages 377—383. Koza, J. R. (1992). Genetic Programming: 0n the Programming of a Computer by Means of Natural Selection. MIT Press. Kuhn, L. A., Siani, M. A., Pique, M. E., Fisher, C. L., Getzoff, E. D., and Tainer, J. A. (1992). The interdependence of protein surface topography and bound water molecules revealed by surface accessibility and fractal density measures. J. Mol. Biol, 228:13—22. Kuhn, L. A., Swanson, C. A., Pique, M. E., Tainer, J. A., and Getzoff, E. D. (1995). Atomic and residue hydrophilicity in the context Of folded protein structures. Pro- teins: Str. Funct. Genet, 23:536—547. Ladbury, J. E. (1995). Counting the calories to stay in the groove. Structure, 3:635— 639. Levitt, M. and Park, B. H. (1993). Water: now you see it, now you don’t. Structure, 1(4):223—226. Mao, J ., Mohiuddin, K., and Jain, A. K. (1994). Parsimonious network design and feature selection through node pruning. In Proc. of the Intl. Conf. on Pattern Recognition, pages 622—624, Jerusalem. Marchand, A., Lente, F. V., and Galen, R. (1983). The assessment of laboratory tests in the diagnosis of acute appendicitis. American Journal of Clinical Pathology, 80(3):369—374. Matthews, B. W. (1975). Comparison Of the predicted and Observed secondary struc- ture Of T4 phage lysozyme. Biochim. Biophys. Acta., 405:442—451. McRee, D. E. (1992). A visual protein crystallographic software system for X11 /XView. J. Molecular Graphics, 10:44—46. Mitchell, M. (1996). An Introduction to Genetic Algorithms. MIT Press. Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Transactions on Computers, C-26z917-922. 126 Nozaki, K., Ishibuchi, H., and Tanaka, H. (1996). Adaptive fuzzy rule-based classifi- cation systems. IEEE Transactions on Fuzzy Systems, 4(3):238—250. Palma, M. U., Palma—Vittorelli, M. B., and Parak, F., editors (1993). Water- Biomolecule Interactions, volume 43 Of Italian Physical Society Conference Pro- ceedings. Editrice Compositori, Bologna, Italy. Parzen, E. (1962). On the estimation Of a probability density function and the mode. Ann. Math Statistics, 33:1065-1076. Pitt, W. R., Murray-Rust, J., and GOOdfellow, J. M. (1993). AQUARIUS2: Knowledge-based modeling Of solvent sites around proteins. J. Comp. Chem, 14(9):1007—1018. Poornima, C. S. and Dean, P. M. (1995a). Hydration in drug design. 1. Multiple hydrogen-bonding features of water molecules in mediating protein-ligand interac- tions. Journal of Computer-Aided Molecular Design, 9:500—512. Poornima, C. S. and Dean, P. M. (1995b). Hydration in drug design. 2. Influence of local site surface shape on water binding. Journal of Computer-Aided Molecular Design, 9:513—520. Poornima, C. S. and Dean, P. M. (1995c). Hydration in drug design. 3. Conserved water molecules at the ligand-binding sites Of homologous proteins. Journal of Computer-Aided Molecular Design, 9:521—531. Pudil, P., Novovicova, J ., and Kittler, J. (1994). Floating search methods in feature selection. Pattern Recognition Letters, 15:1,119—1,125. Punch, W. F., Goodman, E. D., Pei, M., Chia-Shun, L., Hovland, P., and Enbody, R. (1993). Further research on feature selection and classification using genetic algorithms. In Proc. International Conference on Genetic Algorithms 93, pages 557—564. Quinlan, J. R. (1986a). The effect Of noise on concept learning. In Michalski, R., Car- bonnell, J ., and Mitchell, T., editors, Machine Learning: an Artificial Intelligence Approach, pages 149-166. Morgan Kaufmann. Quinlan, J. R. (1986b). Induction Of decision trees. Machine Learning, 1:81—106. Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man- Machine Studies, pages 221—234. Quinlan, J. R., Compton, P. J ., Horn, K. A., and Lazurus, L. (1986). Inductive knowl- edge acquisition: A case study. In Proceedings of the Second Australian Conference on Applications of Expert Systems, Sydney, Australia. 127 Raymer, M. L., Sanschagrin, P. C., Punch, W. F., Venkataraman, S., Goodman, E. D., and Kuhn, L. A. (1997). Predicting conserved water-mediated and polar ligand interactions in proteins using a k-nearest-neighbors genetic algorithm. J. Mol. Biol, 265:445—464. Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systems nach Prinzipien der biologischen Evolution. Fromman-Holzboog, Stuttgart. Reed, J ., Toombs, R., and Barricelli, N. (1967). Simulation Of biological evolution and machine learning: 1. selection Of self-reproducing numeric patterns by data processing machines, effects of hereditary control, mutation type and crossing. J. Theoret. Biol, 17:319—342. Ringe, D. (1995). What makes a binding Site a binding site? Curr. Opin. Struct. Biol, 5:825—829. Sanschagrin, P. C. and Kuhn, L. A. (1998). Cluster analysis Of consensus water sites in thrombin and trypsin Shows conservation between serine proteases and contributions to ligand Specificity. Protein Sci, 7(9). Schnecke, V. and Kuhn, L. A. (2000). Virtual screening with solvation and ligand- induced complementarity. Perspectives in Drug Discovery and Design, 20:1—22. Schraudolph, N. N. and Grefenstette, J. J. (1992). A user’s guide to GAUCSD. Tech- nical Report CS92-249, Computer Science Department, University Of California, San Diego, CA. Schwefel, H.-P. (1977). Numerische Optimierung von Computermodellen mittels der Evolutionsstrategie. Birkhiiuser, Basel. Siedlecki, W. and Sklansky, J. (1989). A note on genetic algorithms for large-scale feature selection. Pattern Recognition Letters, 10:335—347. SigillitO, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989). Classification Of radar returns from the ionosphere using neural networks. Johns Hopkins APL Technical Digest, 10:262—266. Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes, R. S. (1988). Using the ADAP learning algorithm to forecast the onset Of diabetes mellitus. In Proceedings of the Symposium on Computer Applications and Medical Care, pages 261—265. IEEE Computer Society Press. Smith, R. E., Goldberg, D. E., and Earickson, J. A. (1991). SGA-C: A C-language implementation Of a Simple genetic algorithm. Technical Report 91002, The Clear- inghouse for Genetic Algorithms, The University Of Alabama, Department Of En- gineering Mechanics, Tuscaloosa, AL 35487. 'Iiunk, G. V. (1979). A problem Of dimensionality: A Simple example. IEEE T rans- actions on Pattern Analysis and Machine Intelligence, 1:306—307. 128 Vafaie, H. and De Jong, K. (1998). Evolutionary feature space transformation. In Liu, H. and Motoda, H., editors, Feature Extraction, Construction and Selection: A Data Mining Perspective, chapter 19, pages 307—323. Kluwer Academic Publishers, Norwell, MA. Vedani, A. and Huhta, D. W. (1991). An algorithm for the systematic solvation of proteins based on the directionality Of hydrogen bonds. J. Am. Chem. Soc., 113:5860—5862. Wade, R. C., Clark, K. J., and GOOdford, P. J. (1993). Further developments Of hydrogen bond functions for use in determining energetically favorable binding sites on molecules of known structure. J. Med. Chem, 36:140—147. Weiss, S. and Kapouleas, I. (1989). An empirical comparison of pattern recognition, neural nets, and machine learning classification methods. In Sridharan, N. S., editor, Proceedings of the Eleventh International Joint Conference on Artificial Intelligence, pages 781—787, Detroit, MI. Morgan Kaufmann. Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition, Neural Nets, and Machine Learning Classification Methods. Morgan Kaufmann. Westhof, E., editor (1993). Water and Biological Macromolecules. Topics in Molecular and Structural Biology. CRC Press Inc., Boca Raton, FL. Whitney, A. (1971). A direct method Of nonparametric measurement selection. IEEE Transactions on Computers, 20:1100—1103. Wilson, D. L. (1972). Asymptotic properties Of nearest neighbor rules using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC—2z408—421. Yang, J. and Honavar, V. (1998). Feature subset selection using a genetic algorithm. In Liu, H. and Motoda, H., editors, Feature Extraction, Construction and Selection: A Data Mining Perspective, chapter 8, pages 117—136. Kluwer Academic Publishers, Norwell, MA. Yang, J., Parekh, R., and Honavar, V. (1998). DistAl: an inter-pattern distance- based constructive learning algorithm. In Proceedings of the International Joint Conference on Neural Networks, Anchorage, Alaska. Zhang, X.-J. and Matthews, B. W. (1994). Conservation Of solvent-binding sites in 10 crystal forms Of T4 lysozyme. Protein Sci, 3:1031—1039. 129