”w
.m.
ﬂaw,
_
. . r
i. a a... :5. ”3...... g
.0: . . V
$ququ . .34132 R. u
S. T 1.3..

 

 

 

 

3“:

339......

(4.. 7

”mm.

 

r

. L:

a. .2.

:7! :1:
x . 3.:

 

 

“47%52

4.
’“ﬁ

.

.ﬁ

5.:

.
t-I.

ﬁn.» r"? 53%.

«v.35.

:.

. ‘5

a... ii.‘

... ,-

. 0
ii.
«.53?

57 .4
51.1.;

. .ﬂ
an: a».
.4 U

a."

. 1.1.3.- .
”.15...
.03.? v

a...»

.3.“
)2.

”.5

 

at:

gun,

as ...m

J»:
afﬁx

x? .

wmgﬁmwmwsi

. L. a

hm.
‘- ‘4 J»

hum
.. .
Va.-

. Erika”? WM... ..

2 M.

..
um

I
II

N

 

THESIS

Zoo 1

 

LIBRARY
Michigan State
University

 

Y—

 

 

This is to certify that the

dissertation entitled

ENHANCING PATTERN RECOGNITION USING EVOLUTIONARY
COMPUTATION FOR FEATURE SELECTION AND EXTRACTION
WITH APPLICATION TO THE BIOCHEMISTRY OF PROTEIN-

WATER BINDING
presented by

Michael L. Raymer

has been accepted towards fulﬁllment
of the requirements for

 

 

Ph .D . degree in Computer Science
and Engineering

Ma.

I V Major profgsor

I... f/Jo/Oo

 

MS U i: an Afﬁrmative Action/Equal Opportunity Institution 0—12771

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

AA -Anﬂ
‘

WWW:

 

0:41;.

 

 

 

 

 

 

 

 

 

 

 

 

11m c/CIRCJDuaDmpGS-p.“

 

ENHANCING PATTERN RECOGNITION USING
EVOLUTIONARY COMPUTATION FOR FEATURE
SELECTION AND EXTRACTION WITH APPLICATION TO
THE BIOCHEMISTRY OF PROTEIN-WATER BINDING

By

Michael L. Raymer

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY

Department of Computer Science and Engineering

August 23, 2000

C0.\

lite 39-
hai'e 3
III? 9)
Order 1
allow I
Voli'e 11'
Win] 1
do {101
up“. lea

ABSTRACT
ENHANCING PATTERN RECOGNITION USING EVOLUTIONARY
COMPUTATION FOR FEATURE SELECTION AND EXTRACTION WITH
APPLICATION TO THE BIOCHEMISTRY OF PROTEIN-WATER
BINDING
By
Michael L. Raymer

Statistical pattern recognition techniques classify objects in terms of a representa-
tive set of features. The selection and quality of the features representing each object
have a considerable bearing on the success of subsequent pattern classiﬁcation. Fea-
ture extraction is the process of deriving new features from the original features in
order to reduce the cost of feature measurement, increase classiﬁer efﬁciency, and
allow higher classiﬁcation accuracy. Many current feature extraction techniques in-
volve linear transformations of the original features to produce new features. While
useful for data visualization and increasing classiﬁcation efﬁciency, these techniques
do not necessarily reduce the number of features that must be measured since each
new feature may be a linear combination of some or all of the original features. Here

a new approach is presented in which feature selection, feature extraction, and classi-

ﬁer tIa
methot
outperi
minimi
racy. l
junctim
hybrid (
and cla
ability I
mining ;
nique is
0f poten
Water-bi.
ital lean

binding

ﬁer training are performed simultaneously using evolutionary computing (BC). This
method is tested in conjunction with a k-nearest-neighbors classiﬁer, and Shown to
outperform other current methods for feature selection and extraction in terms of
minimizing the number of features employed while maximizing classiﬁcation accu-
racy. Two new classiﬁers based on the naive Bayes classiﬁer are developed in con-
junction with the EC feature selection and extraction technique, and the resulting
hybrid classiﬁers are shown to yield further improvements in feature subset parsimony
and classiﬁcation accuracy. A key advantage to the methods presented here is the
ability to examine the set of linear feature weights produced by EC to perform data
mining and exploratory data analysis. The EC feature selection and extraction tech-
nique is applied to an important and difﬁcult problem in biochemistry—classiﬁcation
of potential protein-water binding sites. The resulting classiﬁer is able to identify
water-binding sites with ~68% accuracy, and identiﬁes a set of physical and chem-
ical features that correspond well with the results of other studies of protein-water

binding.

© Copyright August 22, 2000 by Michael L. Raymer
All Rights Reserved

For

For my family, who never failed to believe in me, and especially for

my father, who taught me the importance of learning.

First ar
Kuhn f
good f0
You bot
I would
Erik Go

33 Well .'

ACKNOWLEDGMENTS

First and foremost, I would like to thank my advisors, Dr. Bill Punch and Dr. Leslie
Kuhn for their guidance, support, encouragement, and patience. It has been my
good fortune to work with two such outstanding scientists and individuals. Thank
you both for inspiring me to strive to excel, and for providing examples of excellence.
I would also like to thank the other two members of my graduate committee, Dr.
Erik Goodman and Dr. Rich Enbody, for their guidance and direction over the years,

as well as their excellent suggestions for improvement of this dissertation.

I gratefully acknowledge the ﬁnancial support of the National Science Founda-
tion. Much of this work was supported by the NSF grant “Resolving Protein-Water

Recognition” .

In addition, I would like to thank my colleagues and friends from the MSU
GARAGe, past and present, including Owen Matthews, Terry Dexter, Jason Greanya,
Victor Miagkikh, Oleg Lukibanov, Iliana Martinez-Bermoudes, Lakshman Thiru-
venkatachari, Michael Resendez and others. In particular, many thanks to Marilyn

Wulfekuhler, Boris Gelfand, Arnold Patton, and Mike Boers for their ideas, construc-

vi

thecr

lit
Analys
Inhd
anemt
Hrdal

their II

The
imdkc
place t(

Brehob

tive criticisms, moral support, and friendship.

Likewise, I am grateful to the many members of the MSU Protein Structural
Analysis and Design lab, and especially Sridhar Venkataraman, Carrie Barkham,
Vishal Thakkar, Rajesh Korde, Trevor Barkham, and Maria Zavodszky, for creating
an enjoyable research environment, and for their many contributions to this work. A
Special thanks to Paul Sanschagrin, Volker Schnecke and Brandon Hespenheide for

their many contributions and friendship over the years.

Thank you to my friends and colleagues at MSU who have provided technical,
intellectual, and emotional support, as well as making East Lansing an enjoyable
place to spend the.past seven years: Travis Doom, Jen White, Mark Brehob, Katrina

Brehob and Dave Paoletti.

Finally, I would like to extend my heartfelt gratitude to my family. Mom, Ed,
Laura—there is nothing in my life that has provided me with more motivation, en-
couragement, and support than your belief in me. Above all, I thank my wife, Delia.
Without your support, understanding, and encouragement I would never have come
this far. I continue to aspire to become as remarkable a scientist and person as you

are.

vii

2 Bar

TABLE OF CONTENTS

LIST OF TABLES xi
LIST OF FIGURES xvi
1 Introduction 1
1.1 Overview—Contributions of the Thesis ................... 1
1.2 Motivation ................................... 2
1.2.1 Evolutionary Computation for Feature Selection and Extraction . . . . 2
1.2.2 Understanding Protein-Water Interactions ................ 6
2 Background and Synopsis of Literature 7
2.1 Pattern Classiﬁcation ............................. 7
2.1.1 The Pattern Recognition Task ....................... 7
2.1.2 The Bayes Classiﬁer ............................ 10
2.1.3 Nearest Neighbors Classiﬁcation ...................... 12
2.2 Feature Selection and Extraction ...................... 16
2.2.1 Feature Costs and the Curse of Dimensionality ............. 16
2.2.2 Feature Selection .............................. 17
2.2.3 Feature Extraction ............................. 21
2.3 Evolutionary Computation .......................... 23
2.3.1 Genetic Algorithms ............................. 23
2.3.2 Evolution Strategies and Evolutionary Programming .......... 25
2.4 Evolutionary Computation in Feature Selection and Extraction ..... 26

viii

4.1.3

3 Feature Selection and Extraction using the K-Nearest-Neighbors

Classiﬁer in Combination with Evolution-based Learning 31
3.1 Methods .................................... 31
3.1.1 Branch and Bound Near-Neighbor Searching ............... 31
3.1.2 The Hybrid GA/knn Classiﬁer ...................... 32
3.1.3 EP Optimization with the Knn Classiﬁer ................. 42
3.1.4 Bayes and KNN Classiﬁers ......................... 44
3.1.5 Data Sets~ .................................. 46
3.1.6 Testing and Error Rate Estimation .................... 54
3.2 Results ..................................... 57
3.2.1 Classiﬁcation of Artiﬁcial Data Sets ................... 57
3.2.2 Classiﬁcation of Data from the UCI Repository ............. 61
3.2.3 Classiﬁcation of Medical Data ....................... 65
3.2.4 Classiﬁcation of Protein Solvation Sites .................. 69
3.3 Discussion ................................... 70

4 Variations on the Bayes Classiﬁer using Evolutionary Computation-

Based Learning 72
4.1 Methods .................................... 72
4.1.1 Bayesian Discriminant Functions ..................... 72
4.1.2 Nonlinear Weighting of the Bayes Discriminant Function ........ 77
4.1.3 A New Discriminant Function Based on Summation of the Class-

Conditional Marginal Probabilities ............... 80
4.2 Results ..................................... 80
4.2.1 Artiﬁcial Data Sets ............................. 81
4.2.2 Classiﬁcation of the UCI Data Sets .................... 84
4.2.3 Classiﬁcation of Medical Data ....................... 87
4.3 Discussion ................................... 88

ix

5 Identifying the Determinants of Solvent Binding in Proteins 91

5.1 Introduction .................................. 91
5.1.1 An Overview of the Protein-Water Problem ............... 92
5.1.2 State of the Art in Predicting Protein Solvation ............. 94
5.2 Methods .............................. . ...... 102
5.2.1 Compilation of a Protein Solvation Database .............. 102
5.2.2 Generation of Non-Solvated Probe Sites ................. 106
5.2.3 Feature Identiﬁcation and Measurement ................. 107

5.2.4 Distinguishing Solvation Sites from N on-Solvated Sites Near the Protein

Surface ............................... 111
5.2.5 Distinguishing Conserved Solvation Sites from Sites not Conserved Be-

tween Ligand-Bound and Unbound Protein Structures . . . . 112
5.3 Results ..................................... 113
5.3.1 Differentiation of Solvation Sites from N on—Solvated Sites ........ 113

5.3.2 Distinguishing Between Conserved and Non-Conserved Solvation Sites 116

/

5.4 Discussion ....... L ........................... 120

BIBLIOGRAPHY 123

3.2 T3

3.3 R1

3.4 A]

3.1
3.2

3.3

3.4

LIST OF TABLES

Typical values for the GA cost function coefﬁcients .............
Typical values for the GA run parameters. .................

Run parameters for a typical EP/knn experiment. d is the number of
parameters to be optimized (i.e. the dimensionality of the problem). .

Apparent accuracy and bootstrap accuracy rates for the Bayes classiﬁer,
the knn classiﬁer, and the EC/knn hybrid classiﬁers for artiﬁcially
generated data sets consisting of independent, Gaussian distributed
features. For the Bayes classiﬁer, accuracy is Shown for re-classifying
the training data (Train), and for classifying the independent testing
data (Test). For the knn classiﬁer, experiments were run using odd
values of k ranging from 3 to 101 using the same independent training
and tuning set for all experiments. The best accuracy on the tuning set
('Dune) and the corresponding 1: value are reported. The best I: value
was then evaluated by reclassifying the training data (Rain), and by
classifyng a new, independent test set (Test). For the EC/knn hy-
brid classiﬁers, ﬁve EC experiments were conducted using the same
data sets but differing initial random starting conditions, and the best
result is reported. The I: value for the best result, the accuracy on
reclassiﬁcation of the knn training data (Train), the accuracy when
reclassifying the EC tuning data (Tune), and the accuracy when clas-
sifying an independent test set (Test) are provided along with the
number of non-zero feature weights (Features). ............

xi

38

42

44

60

3.5 AP

3-6 Res

3.? Bal

g -‘ rm _‘ ‘ -

3.5 Apparent accuracy and bootstrap accuracy rates for the Bayes classiﬁer,
the knn classiﬁer, and the EC / knn hybrid classiﬁers for artiﬁcially gen-
erated data sets consisting of features with a mixture of random distri-
butions. For the Bayes classiﬁer, accuracy is Shown for re-classifying
the training data (Train), and for classifying the independent testing
data (Test). For the knn classiﬁer, experiments were run using odd
values of k ranging from 3 to 101 using the same independent training
and tuning set for all experiments. The best accuracy on the tuning set
(Tune) and the corresponding k value are reported. The best It value
was then evaluated by reclassifying the training data (Train), and by
classifying a new, independent test set (Test). For the EC/knn hy-
brid classiﬁers, ﬁve EC experiments were conducted using the same
data sets but differing initial random starting conditions, and the best
result is reported. The k value for the best result, the accuracy on
reclassiﬁcation of the knn training data (Train), the accuracy when
reclassifying the EC tuning data (Tune), and the accuracy when clas-
sifying an independent test set (Test) are provided along with the
number of non-zero feature weights (Features). ............ 62

3.6 Results of the hybrid EC/knn classiﬁers and the GADistAI algorithm on
various data sets from the UCI Machine Learning data set repository,
averaged over 50 runs. 'Irain/ Tune refers to the accuracy obtained
when reclassifying the data used by the EC in tuning (optimizing)
feature subsets and weights. Test refers to the accuracy obtained on
an independent test set for each experiment, disjoint from the training
and tuning sets. Features is the number of features with nonzero
weights in the best performing weight set for each run; the mean value
over all 50 runs is Shown .......................... 64

3.7 Balance in predictive accuracy among classes for various classiﬁers. The
maximum accuracy, (Max), and the minimum accuracy, (Min),
among the various classes are shown for each classiﬁer, along with the
difference between the two values (Bal). Lower values for Ba], indicat-
ing a smaller difference between the minimum and maximum predictive
accuracies, are preferred. Train / Test results refer to accuracies for re-
classifying the training data for the Bayes and Knn classiﬁer, and for
reclassifying the EC tuning data set for the GA/knn and EP/knn clas-
siﬁers. Test refers to the accuracy when classifying an independent
test set. The knn classiﬁer was tested for odd values of k from 1 to
101, and the best results are shown .................... 66

xii

3.9 h

 

 

 

 

 

3.8 Results of various classiﬁers on hypothyroid data, as reported by Weiss,
in comparison with that of the EC / knn hybrid classiﬁers. Train/tune
refers to the accuracy Obtained when reclassifying the training data,
in the case of Weiss’ results, or the tuning data, in the case of the
EC/knn hybrid classiﬁers. Testing refers to the accuracy obtained
on an independent test set for Weiss, and to the bootstrap accuracy
estimate for the hybrid classiﬁers. .................... 67

3.9 Results of various classiﬁers on the appendicitis data, as reported by Weiss,
in comparison with that of the EC/knn hybrid classiﬁers. As in the
previous table, rain / tune refers to the accuracy obtained when reclas-
sifying the training data, in the case of Weiss’ results, or the tuning
data, in the case of the EC / knn hybrid classiﬁers. Testing again refers
to the accuracy obtained on an independent test set for Weiss, and to
the bootstrap accuracy estimate for the hybrid classiﬁers ........ 68

4.1 Performance of the nonlinear discriminant function (Nonlinear) and
the summation-based discriminant function (Sum) on artiﬁcially-
generated data sets consisting of feature values drawn from indepen-
dent Gaussian distributions. For both discriminant functions, ﬁve GA
experiments were conducted using the same data sets but differing ini-
tial random starting conditions, and the best result is reported. The
accuracy on reclassiﬁcation of the training data (Train), the accuracy
when reclassifying the GA tuning data (Tune), and the accuracy when
classifying an independent test set (Test) are provided along with the
number of non-zero feature weights (Features). The results for the
naive Bayes classiﬁer (Bayes), and the two EC/knn hybrid classiﬁers
are reproduced here from Table 3.4 for comparison. .......... 82

4.2 Performance of the nonlinear discriminant function (Nonlinear) and
the summation-based discriminant function (Sum) on artiﬁcially-
generated data sets consisting of features with a mixture of random
distributions. For both discriminant functions, ﬁve GA experiments
were conducted using the same data sets but differing initial random
starting conditions, and the best result is reported. The accuracy on
reclassiﬁcation of the training data (Train), the accuracy when reclas-
sifying the GA tuning data (Tune), and the accuracy when classifying
an independent test set (Test) are provided along with the number of
non-zero feature weights (Features). The results for the naive Bayes
classiﬁer (Bayes), and the two EC/knn hybrid classiﬁers are repro-
duced here from Table 3.5 for comparison. ............... 83

xiii

4.3 Ba

4.4 Re

4.5 Ex

5'1 30

4.3 Balance in predictive accuracy among classes for the various EC/ hybrid
classiﬁers on the data sets M6 and N5. Predictive accuracy on the
independent test set is Shown for each class. Balance is deﬁned as
the difference in predictive accuracy between class 1 and class 2, as
described in Section 3.1.2. ........................ 84

4.4 Results of the nonlinear-weighted discriminant function (Nonlinear) and
the discriminant function based on summation of the marginal prob-
abilities (Sum) on various data sets from the UCI Machine Learning
data set repository, averaged over 50 runs. 'I‘rain/ Tune refers to the
accuracy obtained when reclassifying the data used by the EC in tuning
(optimizing) feature subsets and weights. Test refers to the accuracy
obtained on an independent test set for each experiment, disjoint from
the training and tuning sets. The number of features is the mean num-
ber of features used in classiﬁcation over all 50 runs. Performance for
the EC / knn classiﬁers is repeated here from Table 3.6 for comparison. 86

4.5 Execution times (wall clock time) for 200 generations of GA optimization
of the knn and nonlinear discriminant function classiﬁers. For each
data set, the number of features (d), the number of classes (0), the
combined training and tuning set size (n), and the mean execution
time (hourszminuteszseconds) over 50 runs are shown. Each run was
executed on a single 250MHz UltraSPARC-II cpu of a six-cpu Sun
Ultra-Enterprise system with 768 MB of system RAM. Runs were ex-
ecuted in sets of 5 with no other user processes present on the system. 87

4.6 Accuracy of various classiﬁers on the hypothyroid and appendicitis diag-
nosis data sets. Results for the discriminant function classiﬁers are
averaged over ﬁve GA experiments. Results for the GA/knn classi-
ﬁer represent the best of ﬁve experiments. Train/Time refers to the
accuracy obtained in reclassifying the GA tuning set; Test refers to
bootstrap accuracy over 100 bootstrap sets. .............. 88

5.1 30 protein pairs included in the database. ................. 104

5.2 The physical and chemical features used to represent protein-bound wa-
ter molecules and protein surface Sites. BVAL and MOB were only
measured for conserved and non-conserved molecules, as the concept
of thermal mobility is not deﬁned for non-solvated probe sites. . . . . 108

xiv

5.3 Bootstrap test results, k—values, and weight sets from the top six GA
experiments for distinguishing crystallographically-observed solvation
sites from non-sites. Bootstrap testing accuracy is given as a percent-
age for observed solvation sites (Site), non-solvated sites (Non), and
both classes together (Tot). Balance (Bal) is the average difference
between solvated-site and non-solvated-site prediction accuracy (‘76)
over all bootstrap runs. Features with no weights were masked by the
GA; i.e. their feature weights were zero, and they did not participate
in classiﬁcation. The feature weights for all six features are normalized
to sum to 1.00 for each experiment. ................... 114

5.4 Bootstrap results of the best 21 weight sets for identifying conserved sol-
vation sites. Mean bootstrap testing accuracy is given as a percentage
(Acc). Mean balance (Bal) is the average difference between con-
served and non-conserved prediction accuracy (‘76) over all bootstrap
runs. Features with no weights were masked by the GA; i.e. their fea-
ture weights were zero, and they did not participate in classiﬁcation.
The feature weights for all eight features are normalized to sum to 1.00
for each experiment. ........................... 117

XV

2.3
‘24

3.1

3.3

3.4

A ge

A il

Am
An

£36

A ti

2.1
2.2

2.3
2.4

2.5

3.1

3.2

3.3

3.4

3.5

LIST OF FIGURES

A general model for classical supervised pattern classiﬁcation systems.

A three class, 5-nearest-neighbors classiﬁer. The training samples are
plotted in a two-dimensional feature space. Three of the ﬁve nearest
neighbors of the test sample are of class 2, so the test sample will be
classiﬁed as belonging to class 2 ......................

A general model of a simple Genetic Algorithm. ..............

A d-dimensional binary vector, comprising a Single member of the GA
population for GA-based feature selection. ...............

Effect of scaling feature axes on k-nearest-neighbor classiﬁcation. (a) orig-
inal data; (b) scaled data. Extension of the scale of the horizontal axis
increases the distance between patterns which differ in feature 1, allow-
ing the knn to discriminate more ﬁnely along this dimension. Here, the
prediction of the unknown changes from class 2 to class 3 as a result
of scaling. .................................

A GA-based feature extractor using an objective function based on clas-
siﬁcation accuracy. Each transformation matrix from the GA is used
to transform the input patterns, which are then passed to a knn classi-
ﬁer. The ﬁtness of a particular transformation matrix is based on the
classiﬁcation accuracy of the knn using the transformed patterns.

An example of a GA/knn chromosome with masking for a 4-dimensional
feature set ..................................

An example of an EP/GA hybrid chromosome with masking for a 4-
dimensional feature set ...........................

Effects of Gaussian smoothing on the computation of effective marginal
probabilities. Assuming that the current feature value falls in the cen-
ter bin (black rectangle), and assuming a = 2, then the two surround-
ing bins on either side (grey rectangles) also contribute to the effective
marginal probability for the current feature. ..............

A two-dimensional projection of the data sets 010,1 (a), and 010,2 (b) onto
the ﬁrst two feature axes. 010,2 exhibits much more overlap between
classes than 310,1 ..............................

xvi

13
24

27

28

33

40

43

46

51

3.6 Div

4.1 The

4.2 An

5.1 A F

3.6

4.1

4.2

5.1

5.2

Division of training, testing, and tuning data during a single EC exper-

iment (a), and error rate estimation of the tuned classiﬁer (b). The
cost score in (a) is based on the performance of the knn on the tuning
set for a particular weight set and k value. The same knn training data
is used for the entire EC experiment and during external testing. . . .

The Bayes decision rule is invariant to linear transformations of the feature

space. For the feature shown here, the raw feature values (a) have been
multiplied by 10 in (b). Using a nonparametric Bayes classiﬁer, we ﬁnd
that the original feature value falls in the bin 14—16 (black rectangle) in
the original histogram. The scaled feature falls in the equivalent bin of
histogram b, and the histogram values (marginal probabilities) of the
two bins are identical, so the scaling has no bearing on the classiﬁcation
results ....................................

An example of the EC chromosome for optimization of the nonlinear dis-

criminant coefﬁcients. A four-dimensional problem is shown. Each
coefﬁcient, Ci, in the discriminant function is determined by the chro-
mosome weight, W,, and the masking ﬁeld, Mi. ............

A particular sequence of amino acids will consistently fold into the same

three-dimensional, “globular” structure. Each of the 20 common amino
acids is identiﬁed by a one-letter code. The linear protein chain
(left) folds into its globular structure (right) after synthesis. Water
molecules, shown here as spheres, bind to the surface of the protein via
hydrogen bonds ...............................

Conserved and non-conserved water molecules plotted according to the

ﬁrst two principal components of the eight-feature water binding data
set. Eight hundred randomly-selected water molecules from the data
set are plotted. Non-conserved water molecules are plotted as 1’s,
conserved water molecules are plotted as 2’s. The ﬁrst principal com-
ponent is composed primarily of the features HBDP, ADN, and AHP
(with coefﬁcients 0.63, 0.52, and 0.38, respectively). The second prin-
cipal component is composed primarily of the three features based on
temperature factor: ABVAL, BVAL, and NBVAL. (The respective co-
efﬁcients are —0.54, —0.53, and —0.44.) There is no clear separation
between the two classes in this feature space ...............

5.3 Average normalized weight of each feature in the top ﬁfteen weight sets

for distinguishing crystallographically— observed solvation sites from
non-solvation sites. ............................

xvii

58

74

79

92

110

5.4 Relat
bi
pi
cz

5.4 Relationship between majority votes and accuracy. For each possible num-
ber of majority votes, m, (x-axis) the solid line shows the cumulative
predictive accuracy for all test sites for which m or more votes were
cast in the majority. This accuracy value corresponds to the scale on
the right y-axis. The outlined rectangles Show the actual number of
sites for which m or more votes were cast in the majority, correspond-
ing to the scale on the left y-axis. Since the k-value for the weight
set tested was 65, the number of majority votes ranges from 33 (the
smallest possible majority) to 65. .................... 121

xviii

(Iha

Intr

1.1

This the:
tem tea
for Simui
describe,
feature 5
Classiﬁ(.a.
Slﬁer are
mahOdS.
and We
ProdUCe c
it is ShOW'

data mini

Chapter 1

Introduction

1.1 Overview—Contributions of the Thesis

This thesis describes several contributions to the state of the art of computational pat-
tern recognition, evolutionary computation (EC), and biochemistry. A novel method
for Simultaneous feature selection and extraction using evolutionary computation is
described. This method compares favorably with the best current algorithms for
feature subset selection in terms of minimizing the number of features selected and
classiﬁcation accuracy. In addition, several new classiﬁers based on the Bayes clas-
siﬁer are developed and tested alongside the k-nearest—neighbors classiﬁer and other
methods, and are Shown to perform well in conjunction with the EC feature selection
and extraction methods developed here. The ability of the hybrid EC-classiﬁers to
produce consistent feature subsets across various experiments is demonstrated, and
it is shown that this capability can be useful in the analysis of classiﬁer results for
data mining and analysis. Finally, application of these methods to the biochemical

1

problem I
that can

curacy of
predictior
not exhib
Analysis I
the deten
siﬁers des
incorpor'
Kuhn. ‘20
software

ture data

1.2

 

1.2.1

The held
many ar,
employed
Mitten. ,

1n medic-i

 

problem of understanding protein-water interactions demonstrates two new classiﬁers
that can predict water binding and conservation among protein structures. The ac-
curacy of these classiﬁers is shown to compare well with other current methods for
prediction of protein solvation. Furthermore, the new classiﬁers developed here do
not exhibit a tendency to overpredict solvation, as is common among other methods.
Analysis of the experimental results of these classiﬁers has provided new insights into
the determinants of water binding to proteins. The weighted k—nearest-neighbor clas-
siﬁers described here, with feature weights determined by GA optimization, has been
incorporated into a new algorithm for identifying promising drug leads (Schnecke and
Kuhn, 2000), and is currently being evaluated for incorporation into a well-known
software package, XtalView, for identifying water binding sites during protein struc-

ture determination (McRee, 1992).

1.2 Motivation

1.2.1 Evolutionary Computation for Feature Selection and

Extraction

The ﬁeld of computational pattern recognition has produced useful applications in
many areas of science and engineering. Pattern recognition techniques have been
employed with great success, enabling computer systems to recognize typed, hand-
written, and spoken language, to analyze satellite imagery, to aid in decision making

in medicine and ﬁnance, and to perform numerous other difﬁcult classiﬁcation tasks.

A broad

works, at
problems
and techr
nary goa
system, f
disease, i
one of 1f
perf0rm '
ple objec
training 5
as a list I
features ;
OCR sys

"1 a Chara

ln des
in adi'anc
reduildan
ing featu

difﬁCulty,

Featu

beSt

A broad range of methods, including statistical classiﬁers, decision trees, neural net-
works, and fuzzy logic have been applied to pattern recognition and classiﬁcation
problems. (Duda and Hart, 1973). Yet despite this wide variation in problem areas
and techniques, several features are common to most classiﬁcation systems. The pri-
mary goal of a classiﬁer is to categorize objects or concepts into groups. A diagnosis
system, for example, might classify patients as “at—risk” or “not-at-risk” for a certain
disease, while OCR systems typically classify typewritten characters as belonging to
one of 100 or so groups, one for each possible printed ASCII character. In order to
perform this categorization, a classiﬁer is typically trained by providing it with sam-
ple objects (or concepts) for which the correct classiﬁcation is already known. These
training samples, as well as the objects to be classiﬁed later, are generally represented
as a list of features. For a diagnosis system, a patient might be represented by such
features as height, weight, age, sex, and the results of various clinical tests. For an
OCR system, features might include the length of the longest straight line segment

in a character, or the presence or absence of a closed loop.

In designing a classiﬁcation system for a particular task, it is often not known
in advance which features will prove useful for classiﬁcation, and which will contain
redundant or even misleading information. Several areas of investigation, includ-
ing feature subset selection and feature extraction, have come about to address this

difﬁculty.

Feature selection is the problem of ﬁnding the subset of all the available features

that best satisﬁes some classiﬁcation-related criteria. Usually the goal is to ﬁnd a

small I
imposs
sets gr
that In
inrestig

to ﬁnd

Feat
original
cars fro
of the a‘

mass) i

feat
IECOgnit
tides a j
e(‘lUlee
tallonal
of lean"

the i”Du

II] ad
C0nlllnetj
ﬁers to b

Ilends in

small subset of features that provides good classiﬁcation accuracy. It is generally
impossible to exhaustively search all feature subsets, since the number of such sub-
sets grows rapidly as d increases. For a ﬁxed subset size, n, the number of subsets
that must be searched is (3). When n is not ﬁxed, there are 2" possible subsets to
investigate. AS a result, heuristic and suboptimal search methods are often employed

to ﬁnd a feature subset that performs well, but cannot be guaranteed to be optimal.

Feature extraction goes a step further and computes new features based on the
original feature set. For example, if a classiﬁer were attempting to distinguish sports
cars from other vehicle types, and engine horsepower and total vehicle mass were two
of the available features, a feature extraction method might ﬁnd that (horsepower+

mass) was a useful new feature, computed from the original features.

Feature selection and extraction can provide a number of beneﬁts to a pattern
recognition system. Limiting the number of features to be considered generally pro-
vides a reduction in the overall cost of a classiﬁer by reducing the time, effort, and
equipment involved in feature measurement and processing. Furthermore, the compu-
tational efﬁciency of many classiﬁcation methods is improved by lowering the number
of features to consider. Even the accuracy of a classiﬁer can be improved by limiting

the input data to a minimal set of salient features.

In addition to these direct beneﬁts, the use of feature selection and extraction in
conjunction with traditional pattern recognition can allow many traditional classi-
ﬁers to be employed for data mining—the identiﬁcation and analysis of interesting

trends in large data sets. By identifying the particular features that are useful for

sepa ting;
in u: lers:
data “it
theg owir
alla as I

hon. argt

l its (1
ei'ol' ion;
of n ma
tech ,qut»;
have Deer;
Olpi tbler
1996 f0g

21180 thin

 

C1355 lers
the Flies
The i’sul
NCO; llllt’,

mill ids

 

separating natural classes in large data sets, feature selection and extraction can aid
in understanding the interrelationships between the features and the classes in the
data. With the low cost and relatively high Speed of modern magnetic storage, and
the growing use of computational techniques for collection and analysis of data from
all areas of science, engineering, and industry, the idea of mining interesting data

from large data sets has become a topic of much experimentation and study recently.

This dissertation describes a feature selection and extraction method based on
evolutionary computing, a parameter optimization method based on the dynamics
of natural selection and the Darwinian model of evolution. The two evolutionary
techniques employed, genetic algorithms (GAS) and evolutionary programming (EP),
have been shown to be effective search and optimization methods for a broad array
of problems (Fogel, 1998; Holland, 1975; Goldberg, 1989; Back and Schwefel, 1993,
1996; Fogel et al., 1966; Koza, 1992). The EC-based feature selection and extraction
algorithms developed here are utilized to optimize the performance of several different
classiﬁers, including traditional algorithms such as nearest-neighbor classiﬁcation and
the Bayes classiﬁer, and several novel methods derived from the Bayes classiﬁer.
The resulting hybrid algorithms are shown to outperform various traditional pattern
recognition techniques alone and in conjunction with commonly-used feature selection
methods on a variety of artiﬁcial and real-world data sets. In addition, the data-
mining capabilities of the EC-hybrid classiﬁers are explored in the analysis of an
important problem from biochemistry—understanding the features determining the

sites of water binding at protein surfaces.

1.2.2

Extensive
portant a
Much of 1
their stru
drug desi
tion of pr
play an i
molecules
design, p
and func

protein 0

The h
Structure
Sites and
emacti0
“Hills of

Ilit? inter

 

1.2.2 Understanding Protein-Water Interactions

Extensive application of the EC-hybrid classiﬁers was made in the study of an im-
portant and difﬁcult problem in biochemistry—understanding protein—water binding.
Much of the current research in biochemistry is targeted at understanding proteins—
their structure, function, and interactions with other molecules. Most pharmaceutical
drug design efforts are directed at producing drugs that bind to and inhibit the func-
tion of proteins. It is well known that the water molecules that surround most proteins
play an important role in protein folding, structural stability, and binding to other
molecules, such as drugs. Unfortunately, correct treatment of water molecules in drug
design, protein folding studies, and many other areas of study in protein structure
and function has proven to be a difﬁcult task, because the state of hydration of the

protein changes during these processes.

The hybrid EC-classiﬁers described here have been applied to a database of protein
structures to produce a classiﬁer that is capable of predicting protein water-binding
sites and their conservation upon binding other molecules. The feature selection and
extraction capabilities of this classiﬁer have been employed to analyze the determi-
nants of water binding, which has contributed to a more complete understanding of

the interactions between proteins, the molecules they bind, and water molecules.

Che

Bac

2.1

2.]..1

llodern
pre‘lous
sunulatp

broadly

 

Pattern

 

OUI ime}
b." anal.
process:
l5 Such.

beneﬁts

 

Chapter 2

Background and Synopsis of Literature

2.1 Pattern Classiﬁcation

2.1.1 The Pattern Recognition Task

Modern advances in computer hardware and software technology have allowed many
previously intractable problems from various ﬁelds of scientiﬁc inquiry to be analyzed,
simulated, or otherwise addressed using computational techniques. One of the most
broadly applicable and potentially proﬁtable of these techniques is computational
pattern recognition. For humans, the ability to identify patterns is vital to many of
our intellectual faculties, including: our ability to abstract from examples, to reason
by analogy, and to learn by induction. Yet the actual mechanisms by which these
processes are effected remains a subject for debate and empirical experimentation.
AS such, advances in the realm of computational pattern recognition bear potential
beneﬁts in many areas, including improvement of data analysis methods for the phys-

7

halsd

unders

fig/1H9

/

ical sciences, advancement of machine perception and vision, and eventually further

understanding of human perception and cognition.

While the large quantity and variety of pattern recognition algorithms and tech-
niques prevent the construction of a universal model for all pattern recognition tasks,
it is possible to consider a simple model which captures the essential features of many

classiﬁcation systems. Figure 2.1 illustrates a model of a generic pattern recognition

 

 

 

 

 

 

system.
Raw Feature
{> Data - Values F Classiﬁer ClaSSIﬁcanon >
Subject "

Pattern Classes

 

Supervisor

Figure 2.1: A general model for classical supervised pattern classiﬁcation systems.

 

In this model, the object of classiﬁcation may be a physical object or a concept.
The task is to assign the object to one of several prespeciﬁed categories. Initially,
the Object is observed, yielding a collection of unprocessed, raw data. The nature
of this observation can vary greatly according to the type of object being classiﬁed.
For physical objects, cameras, sensors, or other transducers are often used to record
various physical and visual properties of the object. For classiﬁcation of concepts,

ﬁ

‘observation” may be more abstract. For a medical diagnosis system, for example,

aver

mm
min
sﬂﬂe
0f it:
isoﬁ

£025

ing s
olth,
unur,
the d
and ti
not b

the Us

the data is Often a combination of clinical test results and a description of symptoms
provided by the patient. While it is possible to use all of the unprocessed data to
classify the object, it is usually beneﬁcial to reduce this raw data to a set Of features or
properties related to the classiﬁcation categories. Visual information collected with
a camera, for example, might be summarized into a set of features describing the

average color, general shape, and approximate size of the classiﬁcation object.

In general, the goal of feature selection is to reduce the input data to a parsi-
monious and salient representation for the classiﬁcation subject: a vector of feature
values. This feature vector then serves as the input for the classiﬁer, which is respon-
sible for determining which category each subject belongs to according to the values
of its features. The feature vector associated with a particular classiﬁcation subject
is often called a pattern, and the category associated with the subject may be referred

to as that pattern’s class.

Many classical pattern recognition methods are operated in two stages. The train-
ing stage is distinguished by the presence of a supervisory agent that has knowledge
of the true class for each pattern. During training, the supervisor provides this class
information to the classiﬁer as each pattern is observed, allowing the classiﬁer to learn
the distinctions among classes. During the testing stage, the class input is removed,
and the performance of the classiﬁer is observed. The training and testing stages need
not be completely disjoint. In some handwriting recognition systems, for example,

the user can correct misclassiﬁed letters while the system is in normal Operation.

2.1

C0n
the
funt
Pl;

info

best
any
prol
obtz

turn

one
mat

the

2.1.2 The Bayes Classiﬁer

Consider the task of assigning a sample to one of C classes, {w1,w2, ...wc}, based on
the d-dimensional Observed feature vector 55'. Let p(:i:'|w,-) be the probability density
function for the feature vector, :i’, when the true class of the sample is tug. Also, let
P(w,-) be the relative frequency of occurrence class 02,- in the samples. If no feature
information is available, the probability that a new sample will be of class 0),- is
P(w.-)—this probability is referred to as the a priori or prior probability, and is the
best estimate that a sample will belong to a particular class prior to observation of
any feature values. Once the feature values are obtained, we can combine the prior -
probability with the class-conditional probability for the feature vector, p(:i:‘|w,-), to
obtain the a posteriori probability that a pattern belongs to a particular class. This

combination is done using Bayes rule (Bayes, 1763):

P(fle)P(wJ-)
Zic=1p(ilwi)P(wg) (2'1)

 

PM?) =

Once the posterior probability is obtained for each class, classiﬁcation is a simple
matter of assigning the pattern to the class with the highest posterior probability——

the resulting decision rule is Bayes decision rule:

given 53’, decide w,- if

Pans) >P(w.-lf) v]-

10

“hm

probab
lame

1%34
anon
nuns
hmnt
pawn
hunt
unmr

dnnd

of th,
as tht

the q

When the class-conditional probability density for the feature vector and the prior
probabilities for each class are known, the Bayes classiﬁer can be shown to be optimal
in the sense that no other decision rule will yield a lower error rate (Duda and Hart,
1973, pp. 10—17). Of course, these probability distributions (both a priori and a pos-
teriorz') are rarely known during classiﬁer design, and must instead be estimated from
training data. Class-conditional probabilities for the feature values can be estimated
from the training data using either a parametric or a non-parametric approach. A
parametric method assumes that the feature values follow a particular probability
distribution for each class and estimate the parameters for the distribution from the
training data. For example, a common parametric method is to assume a Gaussian
distribution of the feature values, and then estimate the parameters It,- and a; for each
class, can, from the training data. The parametric method can simplify the problem
of estimating the class-conditional feature distributions, but can also result in poor
classiﬁcation if the assumed probability distribution does not ﬁt the actual distribu-
tion of feature values. A non-parametric approach usually involves construction of
a histogram from the training data to approximate the class-conditional distribution
of the feature values. Several decisions in the construction of this histogram, such
as the number of bins and the size of each, are critical in obtaining good classiﬁca-
tion performance. The lack of a standard procedure for constructing a distribution

histogram is a primary weakness in the non-parametric sampling approach.

Once the distribution of the feature values has been approximated for each class,

the question remains how to combine the individual class—conditional probability den-

11

 

5in in
ity der

assume

Th.
t0 perl
is not
probat
literati
relative
prior p

l0r all I

2.1.3

Estima,
Slﬁcatig
(high a
\Hlues’ l

1511951“:

sity functions for each feature, p(x1|w,-), p(:r:2|w,-)...p(:rd|w,-) to determine the probabil-
ity density function for the entire feature vector: p(:i:’|w,). A common method is to

assume that the feature values are statistically independent:

Plflwi) = P($1lwi) X P($2lwil X >< P($dlwi) (2-2)

The resulting classiﬁer, often called the naive Bayes classiﬁer has been shown
to perform well on a variety of data sets, even when the independence assumption
is not strictly satisﬁed (Domingos and Pazzani, 1996). The selection of the prior
probabilities for the various categories has been the subject of a substantial body of
literature (J aynes, 1968). One of the most common methods is to Simply estimate the
relative frequency for each class from the training data and use these values for the
prior probabilities. An alternate method is to Simply assume equal prior probabilities

for all categories by setting P(w,~) = %, i = 1...C.

2.1.3 Nearest Neighbors Classiﬁcation

Estimation of the class-conditional distribution of the feature values in Bayesian clas-
siﬁcation is a potential source of classiﬁcation error. An alternative approach is to
design a classiﬁer that does not require an estimate of the distribution of the feature
values, but instead memorizes the feature values for every training sample. In this
way, no training information is lost in parameter estimation. One such classiﬁer is the

k-nearest-neighbors (knn) classiﬁer (Cover and Hart, 1967). The operation of the knn

12

are Stt
consid
cones
spare.

demo:

Flgm
plott.
test s

classiﬁer during training is quite simple—the feature values for each training sample
are stored in their entirety. To classify a new testing sample, the training samples are
considered in a d—dimensional space, where each of the orthogonal coordinate axes
corresponds to one classiﬁcation feature. The test sample is plotted into this feature-
space, and the k nearest neighbors, for some small constant k, are found. Figure 2.2

demonstrates the behavior of a 3-class, 5-nearest-neighbors classiﬁer where d = 2.

 

 

 

 

 

t3 Class1
D o 0 Class 2
g A Class 3
g D U 0 ? Test pattern
u- A ‘ A

 

 

Feature 1

Figure 2.2: A three class, 5-nearest-neighbors classiﬁer. The training samples are
plotted in a two—dimensional feature space. Three of the ﬁve nearest neighbors of the
test sample are of class 2, so the test sample will be classiﬁed as belonging to class 2.

 

Any distance metric can be used to calculate the neighbors—Euclidean distance
and Mahalanobis distance1 are the two metrics most commonly employed. The classes
of these k neighbors are tallied, and the most commonly represented class becomes the
predicted class for the test sample. Several tie breaking schemes are possible when a
single majority class is not present among the near neighbors. The most common, and

the method used here, is to choose the class of the closest neighbor. Other methods

 

1Euclidean distance is deﬁned as d2 = (552 — fl)T (552 — fl). Mahalanobis distance
is deﬁned as d2 = (:32 — f1)T 2‘1 (f2 - 51), where E is the d x d covariance matrix
of the d-dimensional training data.

13

incl
mos
mist
dist
sho
of t

but

dei

include scaling each vote by the distance from the current test sample, choosing the
most common class from the training data, and choosing the class with the greatest
misclassiﬁcation cost. Feature values are usually normalized prior to computing the
distances to prevent the scale of each feature from biasing the prediction. It can be
shown that the asymptotic error probability (that is, the error rate as the number
of training samples approaches inﬁnity) of the nearest neighbor classiﬁcation rule is

bounded by twice the Bayes error probability(Duda and Hart, 1973, p. 101); i.e.

711,190 Pe(Bayes) 5 7.1520 Pe(NN) 5 2 a: "159° Pe(Bayes) (2.3)

A drawback to the knn classiﬁer is the computational expense of computing the
distance between the testing sample and each of the training samples. A branch-
and-bound approach to near-neighbor search (Fukunaga and Narendra, 1975) can

signiﬁcantly reduce this overhead for large training sets.

Many variants of the knn classiﬁer have been proposed and explored. The con-
densed nearest neighbor rule (Hart, 1968) and the reduced nearest neighbor rule
(Gates, 1972) seek to reduce the number of training patterns that must be stored and
tested for each near-neighbor search, while maintaining the same decision boundaries
as the nearest neighbor rule. The reduced nearest neighbor rule, in particular, ex-
pends considerable computation effort to ﬁnd the minimal consistent subset of the
training patters in order to reduce the computation time needed later to ﬁnd near

neighbors during on—line classiﬁcation. The edited nearest neighbor rule (Wilson,

14

F0
from t
conﬁd

Call It

The n

Tl
paran
using
COHdl‘.
Sumpt
Iraini:
densit
distri}
The n
1962)
tiOns ;

[he dE

metho

1972) smoothes the decision surface by removing “outlier” training samples—that is,
samples belonging to a certain class that fall in an area of the feature Space domi-
nated by another class. The resulting decision rule is less sensitive than the nearest

neighbor rule to erroneous feature measurement and anomalous training data.

For values of k larger than 1, a measure Of classiﬁcation conﬁdence can be obtained
from the number of nearest neighbors belonging to the majority class. This measure of
conﬁdence can be used to deﬁne a reject option for the k nearest neighbor rule, which
can result in a reduction in the error rate for those samples that are not rejected.

The use of majority votes as a conﬁdence measure is discussed further in Chapter 3.

The well-known statistical classiﬁers can be partitioned into parametric and non-
parametric methods. As described previously, the Bayes classiﬁer can be implemented
using either approach. The parametric methods assume that the form of the class-
conditional density function of the features is known in advance. A common as-
sumption is that the feature values follow a multivariate Gaussian distribution. The
training data are then used to estimate the parameters of these class-conditional
densities (e.g. the mean vector, p, and the covariance matrix, a, for the Gaussian
distribution) and these estimated densities are are then used to classify new patterns.
The non-parametric methods, including Parzen window density estimation (Parzen,
1962) and the nearest-neighbor methods (Cover and Hart, 1967), make no assump-
tions about the class-conditional distribution of feature values. Rather, the form of
the density function is either ﬁtted to the training data, as in the Parzen window

method, or dealt with implicitly, as in the nearest neighbor rule.

15

 

 

Th

prt
cla
des
rel
to
tea
tin
of
ear
nei
e01
Spi
Cln
lEa

ref

19,5

2.2 Feature Selection and Extraction

2.2.1 Feature Costs and the Curse of Dimensionality

The selection of features to measure and include in the feature vector can have a
profound impact on the accuracy of the resulting classiﬁer, regardless of what speciﬁc
classiﬁcation rule is implemented. A common approach is to have human experts
describe as many features as possible that are readily measurable and likely to be
related to the classiﬁcation categories. Unfortunately, there are several disadvantages
to evaluating a profuse number of features in classiﬁcation. First, each additional
feature to be considered Often incurs an additional cost in terms of measurement
time, equipment costs,” and storage space. In addition, the computational complexity
of classiﬁcation grows with each additional feature. For some classiﬁers the cost of
each additional feature in computational complexity can be signiﬁcant. The nearest-
neighbor classiﬁcation, for example, has a complexity that is 0(n2), where n is the
combined size of the training and testing data sets. In addition, the inclusion of
spurious features (features unrelated to the classiﬁcation categories) is likely to re-
duce classiﬁcation accuracy. In fact, it is sometimes the case that the inclusion of
features that do, in fact, contain information relevant to classiﬁcation can result in
reduced accuracy when the number of training samples is ﬁxed. This phenomenon
is sometimes referred to as the curse of dimensionality (Jain and Chandrasekaran,

1982).

At the root of the problem of high-dimensionality lies the fact that in real-world

16

Class:

ahn

class

[Op

para

0ft

ialt

POI

DEV

I‘m

classiﬁers the class-conditional densities are not known, and must be estimated from
a ﬁnite number of training samples. This fact is easily seen when considering a
classiﬁcation task involving two classes, where the feature vector for each class follows
a multivariate Gaussian distribution. AS each new feature, xi, is added to the pattern
vector, additional information is made available for classiﬁcation, as long as the mean
of the feature, n,- is not identical to that of some other feature. However, in addition
to providing new information, the inclusion of the new feature introduces additional
parameters which must be estimated to characterize the class-conditional distribution
of the feature vector. In the case of a Gaussian distribution, an additional mean
value is added to [i and the covariance matrix 0 changes from a d x d matrix to a
(d + 1) x (d + 1) matrix. Since the covariance matrix is symmetric, the number of
parameters which must be estimated increases by (d + 2) for each additional feature.
Clearly, the geometric growth in the number of parameters to be estimated has the
potential to deteriorate the value of additional features to the classiﬁer, even when the
new features contain additional information. This effect was illustrated for a speciﬁc

two-class problem with Gaussian distribution of feature values by Thunk (1979).

2.2.2 Feature Selection

A number of techniques have been developed to address the problem of dimensional-
ity, including feature selection and feature extraction. The main purpose of feature
selection is to reduce the number of features used in classiﬁcation while maintain-

ing an acceptable classiﬁcation accuracy. Less discriminatory features are eliminated,

17

 

lea

ine

ex;

th:

91":

ad

lllt

lSt

of

sul

Eat

In]

Wit

3W.

leaving a subset of the original features which retains sufﬁcient information to discrim-
inate well among classes. For most problems the brute-force approach is prohibitively
expensive in terms of computation time. Cover and Campenhout (1977) have shown
that to ﬁnd an Optimal subset of size n from the original d features, it is necessary to
evaluate all (3) possible subsets when the statistical dependencies among the features
are not known. Furthermore, when the size of the feature subset is not speciﬁed in
advance, each of the (2“) subsets of the original d features must be evaluated. In
the special case where the addition of a new feature always improves performance,
it is possible to signiﬁcantly reduce the number of subsets that must be evaluated
using a branch and bound search technique (Narendra and Fukunaga, 1977). Unfor-
tunately, this sort of monotonic decrease in the error rate as new features are added
is often not found in real-world classiﬁcation problems due to the effects of the curse

of dimensionality and ﬁnite training sample sizes.

Various heuristic methods have been prOposed to search for near-Optimal feature
subsets. Sequential methods involve the addition or removal of a single feature at
each step. Sequential forward selection (Whitney, 1971) begins with the single fea-
ture which provides the best classiﬁcation performance. At each iteration, the feature
which provides the best performance in combination with the current subset of fea-
tures is added, until a subset of the desired size is generated. Once a feature is added,
it cannot be removed. Sequential backwards selection operates similarly, but begins
with all features included, removing a Single feature at each iteration. “Plus I — take

away 1'” selection combines these two methods by alternately enlarging and reducing

18

the feature subset repeatedly. The sequential floating forward selection algorithm
(SFFS) Of Pudil et at. (1994) is a further generalization of the plus I, take away r
methods, where l and r are not ﬁxed, but rather are allowed to “float” to approximate

the Optimal solution as much as possible. The algorithm basically executes as follows:

0 Let X), = 131...:Ck be a set of It features
0 Let J () be a criterion function (e.g. classiﬁcation accuracy)

0 Let the signiﬁcance of the feature 27,- in the set Xk, be Sk_1(xj) = J (Xk) '-

J(Xk — (Ej)
0 Let the signiﬁcance Of the feature :12,- 9! Xk, be Sk+1(:r,~) = J (X,c + 133-) — J (Xk)
0 Beginning with an initial set Of It features, Xk...

1. Using the sequential forward selection method, ﬁnd XHI.

2. Find the least signiﬁcant feature in Xk+1. If 55,,“ is the least Signiﬁcant

feature, then set k = k + 1 and go to step 1.

3. If x,,1 S r g k is the least Signiﬁcant feature in XHI, then exclude 1:,

from Xk+1 to form a new feature set X;.
4. If J(xp > J(Xk):
— If k = 2 then set Xk = X}, and go to step 1.

— Otherwise continue conditional exclusion by returning to step 2.

In a recent study of current feature selection techniques, Jain and Zongker (1997)

evaluated the performance Of ﬁfteen feature selection algorithms in terms Of classiﬁca-

19

 

tio
Th
for
an
Sf}
beh
clas
add
gan
beh

the

niqu
3 DH
the i
mot
hate
geﬂei
tradi
pIObl
Class;

Una

tion error and run time on a 2-class, 20—dimensional, multivariate Gaussian data set.
Their ﬁndings demonstrated that the SFFS algorithm dominated the other methods
for this data, Obtaining feature selection results comparable to the Optimal branch-
and-bound algorithm while requiring less computation time. Further tests of the
SFFS technique in combination with a knn classiﬁer were conducted to Observe the
behavior Of the knn classiﬁcation rule as additional features were provided to the
classiﬁer. The results showed that classiﬁcation accuracy was initially improved as
additional features were introduced, but eventually reached a maximal value and be-
gan to decline with the introduction Of further features. Because Of this unimodal
behavior, selection Of an Optimal number Of features is straightforward when using

the SFFS method in combination with a knn classiﬁer.

When classiﬁcation is being performed using neural networks, node pruning tech-
niques can be used for dimensionality reduction (Mao et al., 1994). After training for
a number of epochs, nodes are removed from the neural network in such a manner that
the increase in squared error is minimized. When an input node is pruned, the feature
associated with that done is no longer considered by the classiﬁer. Similar methods
have been employed in the use of fuzzy systems for pattern recognition through the
generation of fuzzy if—then rules (Nozaki et al., 1996; Ishibuchi et al., 1995). Some
traditional pattern classiﬁcation techniques, while not Speciﬁcally addressed to the
problem of dimensionality reduction, can provide feature selection capability. Tree
classiﬁers (Quinlan, 1986b), for example, typically partition the training data based

on a single feature at each tree node. If a particular feature is not tested at any

20

 

Ca:

UT}

node of the decision tree, it is effectively eliminated from classiﬁcation. Additionally,

simpliﬁcation of the ﬁnal tree can provide further feature selection (Quinlan, 1987).

2.2.3 Feature Extraction

Feature extraction, a superset of feature selection, involves transforming the origi-
nal set of features to provide a new set of features, where the transformed feature
set usually consists of fewer features than the original set. While both linear and
non-linear transformations have been explored, most of the classical feature extrac-
tion techniques involve linear transformations of the original features. Formally, the

objective for linear feature extraction techniques can be stated as follows:

Given an n x d pattern matrix A (n points in a d—dimensional space),
derive an n x m pattern matrix B, m < d, where B = AH and ’H is a

d x m transformation matrix.

According to this formalization, many common methods for linear feature extrac-
tion can be speciﬁed according to the method of deriving the transformation matrix,
”H. For unsupervised linear feature extraction, the most common technique is prin-
cipal component analysis (Duda and Hart, 1973). For this method, the columns of
”H consist of the eigenvectors of the d x d covariance matrix of the given patterns. It
can be shown that the new features produced by principal component analysis are

uncorrelated and maximize the variance retained from the original feature set (Duda

21

9)

9i

dil
tux
.ld

be:

and Hart, 1973). The corresponding supervised technique is linear discriminant anal-
ysis. In this case, the columns of ’H are the eigenvectors corresponding to the nonzero
eigenvalues of the matrix «SQ/183, where SW is the within-class scatter matrix and
SB is the between-class scatter matrix for the given set of patterns. Deriving ’H in
this way maximizes the separation between class means relative to the covariance of
the classes (Duda and Hart, 1973). In the general case, the matrix ’H is chosen to
maximize some criteria, typically related to class separation or classiﬁcation accuracy
for a speciﬁc classiﬁer. In this view, feature selection is a special case of linear feature
extraction, where the off-diagonal entries of ’H are zero, and the diagonal entries are

either zero or one.

Feature selection and extraction can enhance pattern recognition in a number of
different ways. As discussed previously, these techniques can reduce the cost of fea-
ture measurement and storage, as well as providing improved classiﬁcation accuracy.
Additionally, feature selection and extraction can help to reveal the relationships
between the features available to a classiﬁer and the classiﬁcation categories. By
identifying minimal sets of features that are sufﬁcient for accurate classiﬁcation, fea-
ture selection and extraction can aid researchers in understanding which features are
related to the categories in the data. This function is closely related to the problem

of data mining, and can be a useful tool for analysis of large data sets.

22

2.3 Evolutionary Computation

Dimensionality reduction is well suited to formulation as an Optimization problem,
thus making it available to the diverse array of computational techniques that have
been developed and explored in this area. Some of these methods have been devel-
oped by observation and modeling of the process of Darwinian evolution and natural
selection. Such methods, collectively termed evolutionary computation, have been
developed and studied from various viewpoints, leading to a number of different tech-
niques and speciﬁc algorithms (Fraser, 1957; Crosby, 1967; Bremermann et al., 1966;

Reed et al., 1967; Fogel, 1998).

2.3.1 Genetic Algorithms

Genetic Algorithms comprise a subset of evolutionary computation focusing on the
application of selection, mutation, and recombination to a population of competing
problem solutions (Holland, 1975; Goldberg, 1989). Figure 2.3 shows a general model
of a simple GA. Problem solutions are encoded as strings of information, or chro-
mosomes. A population of competing solutions is maintained and sorted according
to the application-speciﬁc ﬁtness of each solution. Each generation, a stochastic se-
lection process is employed to choose individuals to advance to the next generation.
Individuals with higher ﬁtness values are more likely to be chosen, but the stochastic
nature of the selection process allows for the selection of any individual in the popula-

tion. Various selection methods have been employed, including ﬁtness proportionate

23

selection (Holland, 1975), tournament selection, rank selection, and others. From the
individuals that do advance to the next generation, some are chosen at random as
parent individuals for recombination. The process of recombination combines traits
of the parent individuals to form a new child individual. Again, various techniques
are possible, including one— and two-point crossover, uniform crossover, and others
(Goldberg, 1989; Mitchell, 1996). It has been proposed that the learning power of
the genetic algorithm is largely derived from the interaction of recombination and
selection, which together isolate “building blocks”, or short motifs common to highly
ﬁt chromosomes. Such motifs can then combine to form new chromosomes of even
higher ﬁtness (Holland, 1975; Goldberg, 1989). Finally, random mutation is employed
to introduce a controlled amount of noise into the chromosomes, which in turn main-
tains diversity in the population and helps to avoid premature convergence of the
population at local extrema. Genetic algorithms have been shown to be a useful tool

in computer optimization problems from a broad range of scientiﬁc disciplines.

 

Population Population
i) i+l)

     

a>

 

 

Selection
-——> Recombination] ———->[ Mutation —->

 

 

Increasing Fitness

 

Next generation

 

Figure 2.3: A general model of a simple Genetic Algorithm.

 

24

Q]

2.3.2 Evolution Strategies and Evolutionary Programming

The popularity of the GA as an optimization tool is largely due to its ability to avoid
local optima by maintaining a diverse population of competing problem solutions.
Unfortunately, the mechanisms of recombination and selection are not the most ap-
pr0priate tools for ﬁne-tuning nearly-optimal problem solutions. As a result, genetic
algorithms are more powerful for the early stages of optimization, when general areas
of high ﬁtness are sought, than for later stages where solutions in these areas are be-
ing reﬁned toward a global optimum. Some researchers have concluded that GAs are,
in fact, not an appropriate technique for real-valued function optimization (De Jong,

1992)

Evolutionary programming (Fogel et al., 1966), and evolution strategies (ES;
Rechenberg, 1973; Schwefel, 1977), two subsets of the EC family of methods, are
often considered more suited to real-valued function optimization and reﬁnement of
nearly-optimal problem solutions. Like GAs, EP and ES algorithms are parallel, it-
erative optimizers. Also like GAS, EP and ES maintain a population of competing
problem solutions which are evaluated in terms of a ﬁtness function and subjected to
a selection function for inclusion in successive generations. Unlike GAS, however, EP
and ES do not emphasize recombination as a primary method for optimization. In-
stead, these techniques generally rely on various forms of mutation as the key means
of learning. For EP, which is often employed for real-valued function optimization,
one of the most common mutation operators involves the addition of a Gaussian de-

viate to the current value of a ﬁeld on the chromosome. The Gaussian deviate is

25

generally chosen from a distribution with a mean of zero and a variance that is either
computed based on the degree of convergence of the current population or stored and
evolved with each individual chromosome (Back and Schwefel, 1996). The speciﬁc
selection method used by EP is slightly different from those commonly used in GAS
as well. A typical method is to compete each individual in a ﬁxed number of “tourna-
ments”, in which its ﬁtness is compared with a random member of the population to
determine a winner. After each individual has competed in the requisite number of
tournaments, the entire population is sorted according to the number of tournament
wins, and the upper half of this sorted population is allowed to advance to the next
generation. As with GA selection methods, many variations on this basic method

have been explored.

2.4 Evolutionary Computation in Feature Selec-

tion and Extraction

A direct approach to using GAS for feature selection was introduced by Siedlecki
and Sklansky (1989). In their work, a GA is used to ﬁnd an optimal binary vector,
where each bit is associated with a feature (Figure 2.4). If the 2"” bit of this vector
equals 1, then the 2"” feature is allowed to participate in classiﬁcation; if the bit is
a 0, then the corresponding feature does not participate. Each resulting subset of
features is evaluated according to its classiﬁcation accuracy on a set of testing data

using a nearest-neighbor classiﬁer.

26

 

 

 

 

 

 

 

 

 

 

 

 

} § § dbits {

, § ------ Feature 2 is included in the classiﬁer.
3 ------------- Feature 1 is not included in the classiﬁer.

Figure 2.4: A d—dimensional binary vector, comprising a single member of the GA
population for GA-based feature selection.

 

This technique was later expanded to allow linear feature extraction, by Punch
et al. (1993) and independently by Kelly and Davis (1991). The single bit associ-
ated with each feature is expanded to a real-valued coefﬁcient, allowing independent
linear scaling of each feature, while maintaining the ability to remove features from
consideration by assigning a weight of zero. Given a set of feature vectors of the
form X = {3:1, 2:2, ...xd}, the GA produces a transformed set of vectors of the form
X’ = {112111, wgzg...wdxd} where w,- is a weight associated with feature 2'. Each fea-
ture value is ﬁrst normalized, then scaled by the associated weight prior to training,
testing, and classiﬁcation. This linear scaling of features prior to classiﬁcation allows
a classiﬁer to discriminate more ﬁnely along feature axes with larger scale factors.
A knn classiﬁer is used to evaluate each set of feature weights. The effects of linear
feature weighting on the knn classiﬁcation rule are visualized in Figure 2.5. Patterns
plotted in feature Space are spread out along feature axes with higher weight values,
and compressed along features with lower weight values. The value of k for the knn

classiﬁer is ﬁxed and determined empirically prior to feature extraction.

In a Similar approach, Yang and Honavar (1998) use a simple GA for feature subset

27

 

 

 

 

 

 

 

 

 

 

 

C] Class 1
D o 0 Class 2
g A Class 3
§ 0 O ? Test pattern
0 D
u.
A A A
Feature 1
(a)
g o
g Cl 0
< 3»
Scale Extended

(b)

Figure 2.5: Effect of scaling feature axes on k-nearest-neighbor classiﬁcation. (a)
original data; (b) scaled data. Extension of the scale of the horizontal axis increases
the distance between patterns which differ in feature 1, allowing the knn to discrimi-

nate more ﬁnely along this dimension. Here, the prediction of the unknown changes
from class 2 to class 3 as a result of scaling.

 

28

selection in conjunction with DistAl, a neural network-based pattern classiﬁer (Yang
et al., 1998). As in other GA-based feature selectors, a simple binary representation
was used where each bit corresponds to a single feature. The use of the GA for feature
subset selection improved the accuracy of the DistAl classiﬁer for nearly all of the data
sets explored, while Simultaneously reducing the number of features considered. Their
hybrid classiﬁer, GADistAI, outperformed a number of modern classiﬁcation methods

on the various data sets presented.

Vafaie and De Jong (1998) describe a hybrid technique in which EC methods
are employed for both feature selection and extraction2 in conjunction with the 04.5
decision tree classiﬁer system (Quinlan, 1986a). Again, a binary representation is
used for feature subset selection using traditional GA techniques. In this system,
however, the features seen by the classiﬁer are functions of the original features
composed of Simple arithmetic operations. For example, one such feature might be
{(Fl - F2) x (F2 - F 4)}, where F1, F2, and F4 represent values from the original
feature set. These functions for constructing new features are represented as trees, in
much the same way as commonly seen in the genetic programming (GP; Koza, 1992)
literature. GP subtree crossover is used for recombination of these function trees,
while traditional GA—style mutation and crossover are used for the feature selection
chromosome. In this system, feature selection and extraction are not Simultaneous.
Rather, feature selection, feature extraction, and classiﬁer construction are performed

serially each generation. The resulting system was shown to outperform the decision

 

2The authors use the term “feature construction”.

29

tree classiﬁer using the original feature set, while reducing the number of features
considerably, for human facial image recognition. Additionally, the hybrid system

was shown to outperform several contemporary classiﬁers on three diverse data sets.

30

 

Ch

Fe:

the

Le

3.1

3.1.

The
1159 x
the n
Sets.

)‘alllg.

 

Chapter 3

Feature Selection and Extraction using
the K-Nearest-Neighbors Classiﬁer in
Combination with Evolution-based

Learning

3.1 Methods

3.1.1 Branch and Bound Near-Neighbor Searching

The k-nearest—neighbors classiﬁer has several features that make it a good choice for
use with feature selection and extraction algorithms. The nonparametric nature of
the nearest neighbors rule allows the classiﬁer to be applied to a broad range of data
sets, including those where the form of the class-conditional distribution of feature
values is completely unknown. When used with a Euclidean distance metric, the knn

31

decision rule is sensitive to scaling of the feature values, which forms the basis for
the feature extraction technique applied here. Many classiﬁers based on the class-
conditional distribution of feature values, including the Bayes classiﬁer, are invariant

to scaling of the feature values.

The primary drawback of the knn classiﬁer is the computational complexity of
the near neighbor search. The search for the neighbors of a single test pattern takes
0(n * d) time where n is the number of training samples, and d is the number of
features being considered. Thus, the evaluation of a test set of m samples takes
0(n * m a: d) time, which can quickly become prohibitively expensive for large training
and testing set Size and large numbers of features. To reduce the computational cost
of the near neighbor search, the branch and bound search algorithm of Fukunaga and
Narendra (1975) was incorporated into the knn classiﬁer. The resulting branch and
bound knn (bbknn) classiﬁer scales more efﬁciently with the number of samples in

the knn training set.

3.1.2 The Hybrid GA/knn Classiﬁer

The basis for the hybrid nearest neighbor classiﬁcation technique is the weighted
nearest neighbors classiﬁer of Punch et al. (1993). As described in Section 2.4, a
genetic algorithm is used to optimize a vector of weight values, {2121, 1.02, ...wd}, where
each weight is used as a linear scaling factor for one of the features considered by the
classiﬁer. The feature values are normalized over a common range prior to scaling.

For the work described here the normalized feature values ranged from 1.0 to 10.0.

32

The formalism for linear feature extraction described in Section 2.2.3 states that
the objective is a d x m transformation matrix, ”H, where d is the number of original
features and m is the number of transformed features. The GA-weighted knn classiﬁer
ﬁts this framework, where the GA produces the diagonal elements of the transfor-
mation matrix, ’H, and all the off-diagonal elements are zero. Figure 3.1 Shows a
schematic overview of a GA-based linear feature extractor in combination with a knn
classiﬁer. For the sake of efﬁciency, only the weight vector itself (that is, the diagonal
elements of the transformation matrix, ’H) need be passed to the knn. Since the value
of k (the number of nearest neighbors to consider) is closely related to the scaling of
the feature dimensions, the value of It was also included on the GA chromosome for

co-optimization with the weight vector.

 

 

 

Genetic Transformed
Input Patterns Algorithm Patterns
r—-ﬁ F——“

miﬁa
at 9%

Z ___. 2 __. $=M _. °
' Q

 

 

 

 

 

 

 

 

 

 

Fitness of transformation matrix 91,

Figure 3.1: A GA-based feature extractor using an objective function based on classi-
ﬁcation accuracy. Each transformation matrix from the GA is used to transform the
input patterns, which are then passed to a knn classiﬁer. The ﬁtness of a particular
transformation matrix is based on the classiﬁcation accuracy of the knn using the
transformed patterns.

 

33

 

The (

AS Bill
(or in
the In
seeks
in del

be to

Feat

MEa

 

The GA Cost Function

As each weight vector and k value is sent to the knn classiﬁer for evaluation, a cost
(or inverse ﬁtness) score is computed, based primarily on the accuracy obtained by
the knn in classifying a set of samples of known class. Since the genetic algorithm
seeks to minimize the cost score, the formulation of the cost function is a key element
in determining the quality of the resulting classiﬁer. As such, several trade-offs must

be considered in the design of the objective function:

Feature subset size versus classiﬁcation accuracy — Up to a certain point,
inclusion of meaningful features will generally increase the accuracy of a classi-
ﬁer. Thus, reducing the size of the feature set and obtaining classiﬁcation accu-
racy are often mutually exclusive goals. The importance of feature set reduction

relative to overall accuracy can be controlled through the ﬁtness function.

Measures of classiﬁcation accuracy — All types of classiﬁcation errors do not
necessarily incur equal costs. In a two-class system, for example, the cost of
misclassiﬁcation for one class might differ from the other. The most commonly
employed measure of accuracy is the classiﬁcation rate, usually deﬁned as the
number of correct classiﬁcations divided by the number of test samples. This
Simple metric may not sufﬁce for problems with differing objectives. Consider
a test set consisting of n samples, {551,552, ...:E‘,,}, belonging to c classes. For
a given classiﬁer let P, be the number of patterns from the test set that were

correctly predicted to belong to class i. Let U, be the number of patterns falsely

34

 

predicted as class 2' (that is, predicted as class 2', but observed to belong to class
j,z' # j). Similarly, let 0, be the number of patterns belonging to class 2', but
mispredicted as belonging to some other class; and let N,- be the number of
patterns in the testing set neither observed nor predicted as belonging to class
2'. Using these four variables, we can establish several measures of classiﬁcation
accuracy that emphasize different aspects of the classiﬁcation task.

i:

C P
0 Classiﬁcation rate: Acc = —-—1—1

This common measure of accuracy emphasizes correct prediction of as
many testing samples as possible, regardless of their class distribution.
While overall accuracy is generally desirable, it can be a somewhat short-
sighted measure of classiﬁer quality when used alone. For example, for a
two-class problem where the test set consists of many samples from class
1 and only a few from class 2, a good classiﬁcation rate might be achieved
by always predicting class 1. Class balance metrics can be combined with
classiﬁcation rate to mitigate this effect. Classiﬁcation rate is also often

expressed as error rate, Err = 1.0 — Acc.

 

C P
e Average class accuracy: ACA = ( ’ ) /c
f; 0.- + P.-

Average class accuracy helps to insure class balance by computing the

classiﬁcation rate for each class, and then taking the mean over all classes.

 

 

.01 bl ~Bz—°(P‘ ) main P1
ass a ance. a — millx 0‘ + B j=l 03- + P]-
This direct measure of class balance can be combined with the classiﬁcation

rate to penalize for class bias.

35

 

Objec

tl.

P.-
U,+P.~

Sometimes called “hit rate”, PBA is deﬁned on a per-class basis. For a

e Prediction-based Accuracy: PBA(2’) =

 

given class, 2', PBA(2’) is the proportion of all the predictions for class
2' that are correct. For example, if 100 samples are predicted to belong
to class 1, and 85 of them are observed to actually belong to class one,
then PBA(1) = 0.85. This measure explicitly penalizes for overpredic-
tion of each class, while AC‘A penalizes for underprediction. Average
prediction-based accuracy (APBA) can be computed for the entire testing

i=1

set: APBA(2’) = c (Fl—:43) /c

e Matthews coefﬁcient (Matthews, 1975):

I),- * N,“ — U5 * 05
(I); + Ui) * (Pi + 0;)* (Ni-i” U,) * (N; + 0.)
Like PBA, the Matthews coefﬁcient is deﬁned on a per-class basis. The

Cm(i) =

 

Matthews coefﬁcients for each class can be averaged to produce an overall
measure of accuracy and class balance. When the testing set is extremely
unbalanced, Cm can have small values even when the accuracy for each
class is high. Cm for a particular class is undeﬁned when none of the

training samples are predicted to belong to that class.

Objective function smoothness -— For the knn classiﬁer an objective function
based largely on classiﬁcation accuracy can have an unnecessarily rough, step-
wise character. By smoothing this function we can provide more ﬁne-grained
feedback to the GA, enabling more efﬁcient optimization. One way to achieve

this is to consider the individual near neighbors and their classes. Consider a

36

 

two-class knn classiﬁer with k = 7, and a weight vector that results in mis-
classiﬁcation of a particular test sample. Further suppose that the test sample
belongs to class two, but all the near neighbors of the test sample are of class one
when scaled by the weight vector. If the weight vector undergoes mutation or
crossover and the resulting set of near neighbors contains two members of class
two, then the classiﬁer is closer to a correct classiﬁcation of this test sample,
but the ﬁtness value of the weight vector does not change because the sample is
still misclassiﬁed. By adding a penalty for each incorrect near-neighbor “vote”
to the cost function, we can reward an individual weight vector each time an
additional vote is cast for the correct class. This additional feedback to the
GA can guide the search toward new correct classiﬁcations, providing a more

efﬁcient search.

Coefﬁcients are associated with each term in the GA cost function that allow
control of each run. The following cost function is computed by the knn classiﬁer for

each individual (consisting of a weight vector and k value):

cost(u7, k) 2 Cam x Err(tii, k)
+ 0pm x nonzero(zi2')

+ Cum x incorrect-votes(zii, k)

+ Cm, x Bal(u‘i, k) (3.1)

37

 

Where Err is the error rate, as deﬁned above; nonzero(tU) is the number of nonzero
weights in the weight vector 13; incorrect-votes(u7, k) is the total number of incorrect
near neighbor votes cast during classiﬁcation with the weight vector 11'} and neighbor
count 1:; and Bal is the balance function, also deﬁned above. Additionally, Cm,
Cmnl, Cw“, and CM are coefﬁcients for each of these terms, respectively. The
coefﬁcients determine the relative contribution of each part of the ﬁtness function in
guiding the GA search for the Optimal weight vector and I: value. The values for the
cost function coefﬁcients were determined empirically in a set of initial experiments for
each data set. Typical values for these coefﬁcients are given in Table 3.1. Deviations
from these values for particular GA runs or data sets will be speciﬁcally noted in

subsequent discussion.

 

Table 3.1: Typical values for the GA cost function coefﬁcients.

Coefﬁcient Typical value

 

Cm 20.0
CM, 1.0
Cvote 2-0
CW 10.0

 

Representation Issues and Masking

The representation of the feature weights and I: value on the chromosome is fairly

direct—a 32 bit integer is used to represent each weight value. The resulting gene

 

lPars is short for parsimony, indicating the importance of the nonzero term in
obtaining a minimal feature set.

38

 

 

the

ehrr

each
seha
Sine
Valm
\Elue
featu
zero,
nneth
Dorat
the h
The \
above
Inask
In the

Each ff

 

on the GA chromosome yields an unsigned value over the range [0, 232 — 1]. This
value is then scaled by division to yield a real value over [0.0, 100.0]. The value of k
is represented as a 6-bit unsigned integer, but is constrained to values between 1 and
50. For two-class problems, the I: value is set to k = (kchmm * 2) + 1, where kchmm is
the k-value from the GA chromosome. This constrains the value of k to odd integers,

eliminating the need for a tie-breaking scheme in the knn classiﬁer.

While the cost function encourages parsimony by penalizing a weight vector for
each nonzero feature weight, a simple real-valued representation for the weights them-
selves does not provide an easy means for the GA to reduce feature weights to zero.
Since the GA mutation operator tends to produce a small change in a single weight
value, numerous mutations of the same feature weight are often required to yield a
value at or near zero. Several methods were tested to aid the search for a minimal
feature set, including reducing weight values below a predeﬁned threshold value to
zero, and including a penalty term in the cost function for higher weight values. The
method that proved most effective, however, was a hybrid representation that incor-
porates both the GA feature selection technique of Siedlecki and Sklansky (1989) and
the feature weighting techniques of Punch et at. (1993) and Kelly and Davis (1991).
The weights and k value are represented directly on the chromosome, as described
above. Additionally, a mask ﬁeld is assigned to each feature. The contents of the
mask ﬁeld determine whether the feature is included in the classiﬁer (see Figure 3.2).
In the initial implementation, a Single mask bit was stored on the chromosome for

each feature. If the value of this bit was 1, then the feature was weighted and included

39

in classiﬁcation. If, on the other hand, the mask bit for a feature was set to 0, then
the feature weight was treated effectively as zero, eliminating the feature from con-
sideration by the classiﬁer. Since the masking ﬁelds comprised a very Small section of
the chromosome relative to the feature weights and I: value, the number of mask bits
associated with each feature was later increased to ﬁve. This increase had the effect
of increasing the probability that a single bit mutation in a random location would
affect the masking region of the chromosome. The interpretation of multiple mask
bits is a simple generalization of the single bit case. When the majority of the mask
bit values for a ﬁeld are 1, then the ﬁeld is weighted and included in classiﬁcation.
Otherwise, the ﬁeld weight is reduced to 0, removing the feature from consideration
by the knn. The number of mask bits is always odd so there is no possibility of a
tie. Figure 3.2 shows a typical GA chromosome for the hybrid GA / knn classiﬁer with

masking.

 

 

 

 

 

 

 

 

 

 

 

 

 

[— 32-bit L— S-bit masking
feature ﬁeld
weight 6-bit k-value

Figure 3.2: An example of a GA / knn chromosome with masking for a 4-dimensional
feature set.

 

The use of a masking ﬁeld on the chromosome allows a more efﬁcient search
for a minimal subset of features that provides good classiﬁcation accuracy, while
Simultaneously searching for the optimal weights for the non-masked features. A

single mutation can allow a feature to be tentatively included or removed from a

40

speciﬁc feature set without the loss of the partially-optimized weight value for that
feature. Without masking, a weight value must be reduced to zero (or below a ﬁxed
threshold) to remove a feature from the current feature set. If the feature is later

reintroduced, its weight value must be re-optimized from scratch.

GA Optimization Details

Several GA engines were employed to search for the optimal set of feature weights and
I: value, including the GAUCSD algorithm (Schraudolph and Grefenstette, 1992) and
GALOPPS (Goodman, 1996). GA run parameters were determined empirically via a
set of initial experiments. Table 3.2 summarizes the GA parameters for a typical run.
Where Speciﬁc runs have parameters that differ from these values in the subsequent
discussion, this fact will be explicitly noted. Most runs used ﬁtness proportionate
selection, which requires an explicit scaling of the ﬁtness of each individual in the
GA population prior to selection. The speciﬁc scaling method used has a signiﬁcant
impact on the selection pressure applied by the GA during the subsequent search. For
runs described here, the sigma scaling method of the GAUCSD algorithm was used.
Several Sigma scaling factors were tested, with the most common value being 3.0, as
noted in the table. Further details on sigma scaling and the signiﬁcance of the sigma

scaling factor can be found in Schraudolph and Grefenstette (1992).

41

3.1

Evol
and .
1991;
lem. J
Were 2
to tha
Optimi-
We a[

for each

 

 

 

Table 3.2: Typical values for the GA run parameters.

 

GA Run Parameter Typcial value
Population size 200
Crossover rate 0.8 per individual
Mutation rate 0.001 per bit
Max number of generations 200
Sigma scaling factor 3.00

 

3.1.3 EP Optimization with the Knn Classiﬁer

Evolutionary programming was also tested as a method for optimizing feature weights
and I: value for the knn classiﬁer. An EP framework based on SGA-C (Smith et al.,
1991; Goldberg, 1989) was adapted to suit the feature selection and extraction prob-
lem. In order to facilitate efﬁcient sampling of feature subsets, the usual EP operators
were augmented with a recombination operator, and a hybrid representation similar
to that of the masking GA/knn was used for the EP chromosome. As in the GA
optimizer, the EP chromosome consisted of a single real—valued weight for each fea-
ture, an integer value for k, and a masking ﬁeld consisting of one or more mask bits
for each feature. Unlike the GA, however, the distinct regions of the EP chromo-
some were subjected to different operators. The real-valued feature weights and the
k value underwent only Gaussian mutation, while the masking ﬁelds were modiﬁed
by GA-Style bitwise mutation and Single-point crossover. Figure 3.3 illustrates the

dichotomy of the chromosome used for EP / GA hybrid optimization.

AS mentioned in Section 2.3.2, EP is commonly employed for the tuning of real-

valued function parameters, but lacks a standard mechanism for dealing with binary

42

 

ﬁe
b}'
the
tier

opt;

ers.

dhnr
in th

Pafar

The 51

normal

 

. . Bitwise mutation
Gaussnan mutation Single-pomt crossover

\

lwtl W2] Wsl ”ﬁlké €mzlm2lm3lm4l

 

 

 

 

32-bit L— S-bit masking
feature ﬁeld
weight 6-bit k-value

Figure 3.3: An example Of an EP/GA hybrid chromosome with masking for a 4-
dimensional feature set.

 

ﬁelds such as the masking region of the chromosome. The objective Of the EP/ GA
hybrid approach was to explore the real-valued parameter optimization capability of
the EP for tuning the feature weights, while exploiting the known ability of tradi-
tional GA operators tO search binary alphabets (Holland, 1975; Goldberg, 1989) in

optimizing the mask bits.

The EP framework employed, like most EP-based real-valued parameter Optimiz-
ers, utilizes a mutation operator that varies each parameter according to a Gaussian
distribution. The standard deviation for this distribution is an adaptive parameter
in the sense that it is stored with the individual as a vector of a-values, one for each

parameter. Each parameter, x,, is mutated to produce 2:: as follows:

1:; = 11:,- + N(0, 0,) (3.2)

The standard deviations themselves are randomly perturbed according to a log-
normal distribution to promote a diversity of individuals with differing Optimization

strategies within the population. Each generation, the vector Of standard deviations

43

for ead

where .'
i.“ elem
of para

par 3.1116

Suggestl

The

 

Table 3.

\

Table 3
Parame

3.1.4

FOr m

Classif

 

for each individual is updated according tO:

a; = 0,- ~exp(T - N(0,1) + 7" - N,(0,1)) (3.3)

where N (,u, a) is a single normally distributed random variable, and N,(0, 1) is the
1"“ element Of a vector of standard normal deviates of a length equal to the number
Of parameters to mutate (Angeline, 1995). The values Of ’T’ and 1" are compile time
parameters of the EP engine. Here, these parameter are set according to guidelines

suggested by Back and Schwefel (1996).

The rest Of the EP run parameters, like those Of the GA, were tuned empirically.

Table 3.3 Shows the run parameters for a typical EP/knn experiment.

 

Table 3.3: Run parameters for a typical EP/knn experiment. (1 is the number of
parameters to be Optimized (i.e. the dimensionality Of the problem).

 

EP Run Parameter Typical value
Population Size 200
Mask ﬁeld crossover rate 0.8 per individual
Mask ﬁeld mutation rate 0.001 per bit
Max number Of generations 200
T’ (ml-1

-1
T ( 2V2

 

3.1.4 Bayes and KNN Classiﬁers

For many experiments, results Of an unweighted knn classiﬁer and a naive Bayes

classiﬁer are presented for comparison. The na'ive Bayes classiﬁer, described in Sec-

44

tion 2.1.2, is nonparametric in the sense that the class conditional distributions Of
the feature values are estimated by constructing a histogram for each feature based
on the training data. The bin Size Of this histogram is a run-time parameter Of the
classiﬁer. Unless otherwise noted, 20 bins for each feature were used for the experi-
ments described here. In addition, a Gaussian smoothing factor is applied in order to
mitigate sampling anomalies that might introduce classiﬁcation bias. Given a feature
value, 23,-, for feature 2', and a class wj, then let bwj(x,) be the bin that x,- occupies
in the histogram for class Wj. When the Gaussian smoothing is applied, the effec-
tive marginal probability p(:c,-|wj) depends on the histogram value Of bin by, (23,-), as
well as the histogram values Of neighboring bins. Let hwj(bwj(a:,-)) be the histogram
value for bin bwj (z,)——that is, the proportion Of the training samples Of class u, that
have values for feature i that belong in the bin bu]. (93,-)—then the effective marginal

probability for feature value x,- is:

+0

pence) = 2 (Gas, 0) x h..,(b.,,.(x.-) + a) (3.4)
k=—a
where G (k, a) is the mass density function for the Gaussian distribution at u = 0.0,

with variance 02:

G(k, a) = e-%(%)2 (3.5)

 

The value Of 0, also a run-time parameter, determines the number Of bins that
will contribute to each effective marginal probability value. Figure 3.4 illustrates the

effect of Gaussian smoothing on the effective marginal probability for a particular

45

 

feature value.

 

Proportion Of training samples
J

 

 

 

 

 

 

 

 

 

 

 

 

Hllﬂ

Feature value

Hﬂﬂﬂ

Figure 3.4: Effects of Gaussian smoothing on the computation of effective marginal
probabilities. Assuming that the current feature value falls in the center bin (black
rectangle), and assuming 0‘ = 2, then the two surrounding bins on either side (grey
rectangles) also contribute to the effective marginal probability for the current feature.

 

 

 

The unweighted knn classiﬁer, also employed for comparison with the hybrid clas-
siﬁcation results, uses a Euclidean distance metric to ﬁnd nearest neighbors. When
there is a tie in determining the most frequent class among the near neighbors, the

class of the nearest neighbor is used as a tie breaker.

3.1.5 Data Sets

Artiﬁcially Generated Data

The GA/knn and EP/knn hybrid classiﬁers were trained and tested against a variety

Of artiﬁcially constructed data sets to evaluate their error rates and feature selection

46

and extraction capability under a variety Of classiﬁcation conditions. These data sets

were constructed with a varying number of features Of the following types:

Univariate Gaussian: 001,0) — A random Gaussian deviate selected from the

distribution with mean u and variance 02.

Uniform: U (min, max) — A random real value selected from a uniform random

distribution over the range [min, max].

Uniform integer: U I (min, max) — A random integer selected from a uniform

random distribution over the range [min, max].

Computed ﬁelds — Field values can be computed from the results of other ﬁeld
values in combination with basic arithmetic Operations and/or any Of the pre-
viously mentioned types Of random variables. For example, a data ﬁeld might
consist Of twice the value of the ﬁrst data ﬁeld, plus a uniform random value

from 1 to 10: 2 * fl + U(1,10).

Based on these basic types Of ﬁeld values, a number of two-class data sets were
generated for use in testing various aspects Of the GA and EP—based feature selection
and extraction methods in combination with the knn classiﬁer. The speciﬁc data sets

generated include:

Name: G10,1
Number of Features: 10
Class 1 speciﬁer: G(5,2), G(5, 2), ...G(5,2)

47

Class 2 speciﬁer: G(lO, 2), G(lO, 2), ...G(10, 2)

This simple 10—dimensional data set consists Of independent Gaussian distributed fea-
tures. The two classes are generally linearly separable, and weighting of the features
is not likely to improve classiﬁcation accuracy. This data set serves as a baseline for

classiﬁcation performance.

Name: 010,2

Number of Features: 10

Class 1 speciﬁer: G(7, 2), G(7, 2), ...G(7, 2)
Class 2 speciﬁer: G(lO, 2), G(10, 2), ...G(10, 2)

This data set is similar in construction to 010,1, but the means for the two classes are
closer, and thus the data is Slightly more overlapped. See Figure 3.5 for a comparison

DBtWBBIl 010,1 and 010,2.

Name: G's

Number of Features: 5

Class 1 speciﬁer: G(5,2), G(5,2), G(5, 2), G(5,2), G(5,2)
Class 2 speciﬁer: . G(10, 2), G(9, 2), G(8,2), G(7, 2), G(6, 2)

This 5-dimensional data set again consists Of independent features selected from Gaus-
sian distributions. However, the distance between the class means is reduced by one
for each feature after the ﬁrst, SO feature weighting is more likely to be useful for
this data set. Furthermore, since inappropriate feature selection should now prove
detrimental to classiﬁcation accuracy, this is the ﬁrst data set that will provide a test

Of the feature selection capability Of each method.

48

 

—u

Name: 05

Number of Features: 6

Class 1 speciﬁer: G(5,2), G(5,2), G(5,2), G(5,2), G(5,2), G(5,2)
Class 2 speciﬁer: C(10,2), G(9,2), G(8,2), G(7,2), G(6,2), G(5, 2)

This data set is identical to G5, but an additional feature, feature 6, is included. The
mean and variance of feature 6 are identical for both classes. This feature should
provide no useful classiﬁcation information and is included to further test the feature

selection capability Of each method.

Name: M6

Number of Features: 6

Class 1 speciﬁer: G(5,2), G(2,1), G(5,3), G(2,2), U(1,5), UI(1,5)
Class 2 speciﬁer: G(7,2), G(3,1), G(7,3), G(3,2), U(3, 7), UI(3, 7)

This is a mixed data set consisting Of features drawn from overlapping Gaussian and
uniform distributions. Various levels of overlap between classes are exhibited by the

different features. Both continuous and discrete uniformly-distributed features are

included.

Name: N5

Number of Features: 5

Class 1 speciﬁer: G(5,2), G(5,2) + U(1,2), G(5,2) + U(1,3),
G(5, 2) + U(1, 4), G(5, 2) + U(1, 5)

Class 2 speciﬁer: G(7, 2), G(7, 2) + U(1, 2), G(7, 2) + U(1, 3),

G(7, 2) + U(1,4), G(7, 2) + U(1, 5)

49

This data set is designed to evaluate the ability Of each method to deal with increasing
noise in each feature. The ﬁrst feature consists of a Gaussian random deviate with
moderately overlapped distribution between classes. Each subsequent feature is a

similar Gaussian deviate, but with an increasing level of uniform random noise added.

Name: N10
Number Of Features: 10
Class 1 speciﬁer: G(5,2), G(5,3), G(5,4), G(5, 5), G(5, 6), G(5,6),

G(5,6), G(5, 6), G(5,6), G(5,6)
Class 2 speciﬁer: G(7,2), G(7,3), G(7,4), G(7, 5), G(7, 6), C(7, 6),
G(7, 6), G(7, 6), G(7, 6), G(7, 6)
Another data set designed to test the ability Of each method to deal with noisy
features, this set includes a large number Of features with extremely overlapped dis-

tribution between the two classes.

Medical, Biochemical and Other Data

Several data sets were selected from the UCI machine learning repository (Blake and.
Merz, 1998) in order tO test the capability of the EC/knn classiﬁer methods on real-
world problems, and to facilitate comparison with other EC—based hybrid classiﬁcation
methods. The speciﬁc data sets tested were chosen to allow comparison with the
GADistAI technique Of Yang and Honavar (1998) and the EC feature construction

and selection methods Of Vafaie and De Jong (1998), and included the following:

50

 

 

18 I I I I I I 1 M I
(5.2 o
g 10.2 +

 

 

 

 

 

 

 

 

 

 

(b)

Figure 3.5: A two-dimensional projection of the data sets G10,1 (a), and 010.2 (b)
onto the ﬁrst two feature axes. 010,2 exhibits much more overlap between classes than
610,1-

51

 

Hepatitis — This data consists Of 19 descriptive and clinical test result values for
155 hepatitis patients (Diaconis and Efron, 1983; Cestnik et al., 1987). The
two classes, survivors and patients for whom the hepatitis proved terminal, are
strongly unbalanced—123 samples belong to the survivor class while 32 belong
to the terminal class. The data includes qualitative, as well as both continuous
and discrete-valued quantitative features. There are missing values, the number
Of which varies largely by feature. Many features have nO missing values, while
others have as many as 67 missing values out Of 155 samples. The small sample
size and incompleteness Of this data set are typical Of many medical classiﬁcation

problems.

Pima — Diabetes diagnosis information for native American women Of the Pima
heritage, aged 21 or over (Smith et al., 1988). This data consists diagnostic
information for 768 women; 268 Of these patients tested positive for diabetes,
while 500 tested negative. Six of the eight features are quantitative and contin-
uous, consisting Of various clinical test results. The remaining two features, age
in years and number Of times pregnant, are quantitative and discrete. There are
no missing feature values in the data. The completeness and moderate dimen-
sionality Of this data set make it suitable for testing the ability of a classiﬁer and
feature extractor to maintain or increase classiﬁcation accuracy while reducing

dimensionality when there are fewer features to work with.

Wine — This data set consists of the results Of a chemical analysis Of wines derived

from three different cultivars (Aeberhard et al., 1992a,b). There are 13 continu-

52

ous features, with no missing values. There are 59, 71, and 48 members of each
Of the three classes, respectively. The three classes are nearly linearly separa-
ble, and linear discriminant analysis can Obtain 98.9% accuracy over all three
classes. This data set is thus better for evaluating feature selection capability

than classiﬁer accuracy.

Ionosphere — The 34 continuous features in this data set are derived from the signals
read by a phased array of 16 high-frequency antennas in Goose Bay, Labrador
(Sigillito et al., 1989). These radar signals are designed to recognize structure
in the ionosphere. Each reading consists Of 17 pulses, with two attributes per
pulse resulting in 34 features. There are 351 samples in this data set—225 are
considered “good” readings, for which some structure in the ionosphere was
detected, while 126 readings showed no structure. There are no missing feature
values in this data. This data set was selected for evaluation Of feature selection

capability for higher-dimensionality data sets.

Two additional data sets, also selected from the UCI repository, were employed by
Weiss and Kapouleas (1989, 1990) in a comparative study of classiﬁcation methods
from statistical pattern recognition, neural networks, and machine learning. These
two medical data sets, thyroid and appendicitis, are included here to facilitate
comparison with these results. The thyroid data consists Of 21 clinical test results
for a set Of patients tested for thyroid dysfunction (Quinlan et al., 1986)—-15 Of these
features are binary-valued, while the other 6 are continuous. The training data consist

Of 3772 cases from the year 1985, while the testing data consist of 3428 cases from

53

 

 

the following year. The data are grouped into two classes, consisting Of the patients
that were/were not diagnosed as having certain categories Of hypothyroid disorder.
The two classes are highly unbalanced: the training data consist Of 3487 negative
diagnoses and 284 positive, while the testing data consist Of 3177 negative samples
and 250 positive. The appendicitis data consists Of seven laboratory tests to conﬁrm
the diagnosis of acute appendicitis (Marchand et al., 1983). All seven features are
continuous. This data set consists of only 106 samples in two classes. 85 patients had

conﬁrmed appendicitis while 21 did not.

In addition, extensive experimentation was conducted on a large biochemical data
set dealing with the binding Of water molecules to protein surfaces. The features and
biological signiﬁcance Of this data, as well as the classiﬁcation and feature selection
results for the GA/knn, EP/knn and other classiﬁers and feature selection and ex-
traction methods are detailed in Chapter 5—Identifying the Determinants of Solvent

Binding in Proteins.

3.1.6 Testing and Error Rate Estimation

Various techniques were used to estimate the error rate Of the trained EC/knn hybrid
classiﬁer. For data sets from the UCI machine learning data repository (i.e. the
hepatitis, Pima, wine, and ionosphere data sets) the testing methods were Similar
tO those Of Yang and Honavar (1998) in order to facilitate comparison with these
results. Several minor differences were introduced to avoid problems with overﬁtting

small training and testing sets. Two types of experiments were conducted on the

 

UCI data sets, depending on the number Of training and testing samples available
for each class in the data set. Where ample training and testing data (at least 100
samples) were available for each class, hold-out tuning experiments were conducted
as follows. First, the available data are partitioned into three sets, training, tuning,
and testing, with an equal number Of samples of each class in each set. Thus, the size
Of these three sets is limited by the number Of samples in the least numerous class.
The ﬁrst set (training) is used to populate the feature space for the knn classiﬁer.
Next, the EC is employed tO Optimize the k-value and feature weights. AS previously
described in Section 3.1.2, the EC cost function is based on classiﬁcation accuracy
for a set Of samples Of known class. The second set of samples (tuning) is used for
this purpose. Finally, after the k-value and feature weights have been Optimized, the
resulting classiﬁer is evaluated using the ﬁnal data set (testing) to Obtain an estimate
Of the error rate. This estimate is then averaged over 5 independent runs of the EC
using the same partitioning into training, testing, and tuning sets. Finally, the entire
process is repeated 10 times with 10 different partitionings of the data, resulting in

an error rate estimate averaged over 50 EC experiments.

When the number of samples for one or more classes is extremely limited (less than
100 samples), this procedure is modiﬁed Slightly. In this case, the data are partitioned
into only two sets (training and testing), and leave-one-out tuning experiments are
conducted. ‘In these experiments, the training and tuning sets are identical. When
evaluating the EC cost function, each sample is temporarily removed from the training

set and evaluated as a tuning set Of size one. The cost function is then computed

55

 

based on the average classiﬁcation accuracy over all the training samples. After EC
Optimization Of the feature weights and k-value, the error rate Of the resulting classiﬁer
is estimated using the testing set, exactly as in the hold-out tuning experiments.
Again, 5 independent EC runs are performed for each Of 10 partitionings Of the data,

and the error estimates are averaged over all 50 experiments.

This method was simpliﬁed for the artiﬁcially generated data sets, since the num-
ber Of potential independent samples is inﬁnite for this data. Here, experiments were
conducted for each classiﬁer using large, independent training, tuning, and testing
sets (1000 samples each). For GA and EP experiments, 5 independent experiments

were conducted and the accuracy for the best result is reported.

Bootstrap Testing

For other data sets, including the appendicitis, thyroid, and protein—water binding
data, a variant Of the bootstrap test method (Jain et al., 1987; Efron, 1979, 1982) is
employed in order to Obtain both an error rate estimate and a simple measure Of the
variance Of this estimate. Traditional bootstrap measures require that a classiﬁer be
retrained and tested on a number Of bootstrap testing sets selected with replacement
from the available data (Jain et al., 1987). Since the training Of the hybrid EC/knn
classiﬁer is an iterative, computationally-intensive, Offline process, a modiﬁed boot-
strap method was employed to Obtain an error estimate for the trained and tuned

hybrid classiﬁer.
As with the previously described methods, the available data are partitioned into

56

 

three sets, training, tuning, and testing. Within each Of these sets, the number Of
samples Of each class is equal. The training and tuning sets are used during each EC
experiment in the same manner described above: the training set is used to pOpulate
the knn feature space, while the tuning set is used to evaluate a particular set Of
feature weights and k-value in terms Of classiﬁcation error, providing feedback to the
EC Optimizer. Once the feature weights and k-value are Optimized, the testing set
is randomly sampled with replacement to form n bootstrap test sets. The accuracy
Of the classiﬁer (or any other performance measure to be estimated) is evaluated on
each of these bootstrap test sets using the ﬁnal weight set and k value produced by
the EC experiment. Finally, the mean and standard deviation Of the error rates for
all n external test sets are computed and used as an estimator Of the true error rate

for the optimized classiﬁer.

During the bootstrap tests, the same data set is used for training the knn classiﬁer
as during the EC experiment. Since the feature weights, k value, and training data
COOperatively deﬁne the decision boundary for the ﬁnal classiﬁer, it would place an
undue negative bias on the error rate estimation procedure to test the ﬁnal classiﬁer
using a knn training set other than the one used during the EC experiment that
produced that classiﬁer. Figure 3.6 shows a schematic Of the data sets used for
training, tuning, and testing during a single EC experiment using the bootstrap-

based testing method.

57

 

 

  
 

   
  

 

: .11 :

available
trainind a
:testinal
g data I

     

 
      
  

' ' / a test sets
tuning selected randuly
set with replacement

Weighted knn

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

classifier “A Weighted knn
feature classifier
0 rights a
o ‘ va ue D
O
D o D ___>
'3 cost score 0 0 estimated
‘ . ‘ ‘ ‘ error
‘ rate
(a) (b)

Figure 3.6: Division of training, testing, and tuning data during a single EC experi-
ment (a), and error rate estimation Of the tuned classiﬁer (b). The cost score in (a)
is based on the performance Of the knn on the tuning set for a particular weight set
and k value. The same knn training data is used for the entire EC experiment and
during external testing.

 

58

 

3.2 Results

3.2.1 Classiﬁcation Of Artiﬁcial Data Sets

Independent Gaussian Features

Three Of the artiﬁcially generated data sets were constructed entirely Of features
drawn from independent Gaussian distributions, 010,1, 610,2, and G5. The set (310,1
is linearly separable, and a naive Bayes classiﬁer is able to achieve 100% prediction
accuracy using all 10 features. The set G103 is not as easily classiﬁed, but like (910,1,
the features are all drawn from identical independent distributions, so there is no
feature weighting that can Obtain better performance than that Of the unweighted
classiﬁers. Like 010,1 and 610,2, the data set G5 is not difﬁcult to classify—a naive
Bayes classiﬁer can discriminate between the two classes with ~ 96% accuracy—
but the features are not drawn from identical distributions, SO feature weighting
can potentially have a positive effect on classiﬁcation accuracy. For these data, the
primary goal is to reduce the number of features used, while reducing classiﬁcation
accuracy as little as possible. For comparison, a naive Bayes classiﬁer was trained
and tested on each data set, as well as a knn classiﬁer using Odd values for I: from
1 to 100. As described in Section 3.1.6, 5 independent experiments were conducted
for each EC-based method, and the best k value and weight set were then tested on

independent data. These results are summarized in Table 3.4.

The results for 010,1 and G103 verify the ability Of the GA and the EP to perform

feature selection. Since all the features in these two data sets are drawn from identical

59

 

Table 3.4: Apparent accuracy and bootstrap accuracy rates for the Bayes classiﬁer,
the knn classiﬁer, and the EC/knn hybrid classiﬁers for artiﬁcially generated data sets
consisting Of independent, Gaussian distributed features. For the Bayes classiﬁer,
accuracy is shown for re-classifying the training data (Train), and for classifying
the independent testing data (Test). For the knn classiﬁer, experiments were run
using Odd values Of k ranging from 3 to 101 using the same independent training
and tuning set for all experiments. The best accuracy on the tuning set (Tune)
and the corresponding k value are reported. The best It value was then evaluated by
reclassifying the training data (Train), and by classifying a new, independent test set
(Test). For the EC / knn hybrid classiﬁers, ﬁve EC experiments were conducted using
the same data sets but differing initial random starting conditions, and the best result
is reported. The I: value for the best result, the accuracy on reclassiﬁcation Of the knn
training data (Train), the accuracy when reclassifying the EC tuning data (Time),
and the accuracy when classifying an independent test set (Test) are provided along
with the number Of non-zero feature weights (Features).

(310,1 k 'Ii'ain Tune Test Features

 

 

 

Bayes -— 100 — 100 10
Knn 3 100 100 100 10
GA/knn 3 98.9 100 99.4 4
EP/knn 3 100 100 100 4
G103 k 'Irain Tune Test Features
Bayes — 92.9 —— 91.5 10
Knn 67 94.1 94.2 93.3 10
GA/knn 25 94.2 94.8 93.1 10
EP/knn 19 94.6 93.8 93.1 10
G5 k 'Irain Tune Test Features
Bayes — 96.0 — 95.6 5
Knn 79 95.8 97.7 96.5 5
GA/knn 33 94.7 97.2 95.7 3
EP/knn 35 95.8 97.2 96.2 3

 

60

 

distributions, the feature weights produced by the hybrid classiﬁers are valueless. For
010,1, the class conditional distributions of feature values are well enough separated
that 6 Of the 10 features can be removed without a Signiﬁcant adverse effect on
classiﬁcation accuracy. For G103, the EC Optimizers were unable to remove any
features while maintaining maximal classiﬁcation performance. Results for G5 are
slightly more interesting. Since the distributions for the two classes have different
degrees Of overlap for each feature, the quality of the feature weighting provided by
the EC Optimizers should have a bearing on the classiﬁcation accuracy. For this
data, both the GA and the EC were able to remove 2 Of 5 features while maintaining
classiﬁcation accuracies near tO that Of the knn using all 5 features. In all cases,
the EP/knn was able to Obtain slightly better overall accuracy than the GA/knn,
probably due to the EP’S ability to ﬁne tune the ﬁnal feature weights during the later

phases Of the Optimization run.

Mixed Multivariate Data

Several Of the artiﬁcially-constructed data sets contained a mixture Of Gaussian and
uniformly distributed random feature values. As described in Section 3.1.5, M6 con-
sists Of Gaussian deviates, discrete uniform random values, and continuous uniform
random values. This data set provides a test of the hybrid classiﬁers’ ability to main-
tain recognition accuracy through feature weighting while reducing the number of
features to consider. The set N5 consists Of ﬁve features with increasing levels of

random noise, and is designed to test the feature selection capability Of the two EC

61

 

methods, and the effect Of feature selection on classiﬁcation accuracy. The test meth-
ods for these data sets were identical to those described for the previous artiﬁcially

constructed data sets. The results are summarized in Table 3.5.

 

Table 3.5: Apparent accuracy and bootstrap accuracy rates for the Bayes classiﬁer,
the knn classiﬁer, and the EC/knn hybrid classiﬁers for artiﬁcially generated data sets
consisting Of features with a mixture Of random distributions. For the Bayes classiﬁer,
accuracy is shown for re—classifying the training data (Train), and for classifying the
independent testing data (Test). For the knn classiﬁer, experiments were run using
odd values Of k ranging from 3 to 101 using the same independent training and
tuning set for all experiments. The best accuracy on the tuning set (Tune) and
the corresponding k value are reported. The best I: value was then evaluated by
reclassifying the training data (Rain), and by classifying a new, independent test
set (Test). For the EC/knn hybrid classiﬁers, ﬁve EC experiments were conducted
using the same data sets but differing initial random starting conditions, and the best
result is reported. The I: value for the best result, the accuracy on reclassiﬁcation
of the knn training data (Train), the accuracy when reclassifying the EC tuning
data (Time), and the accuracy when classifying an independent test set (Test) are
provided along with the number Of non-zero feature weights (Features).

 

 

M3 k 'D'ain Tune Test Features
Bayes —— 89.5 — 87.7 6
Knn 29 91.6 92.4 91.2 6
GA/knn 19 92.4 93.6 90.8 5
EP/knn 25 90.8 92.2 89.0 4
N5 k Train Tune Test Features
Bayes — 94.4 — 81.6 5
Knn 9 86.5 86.8 82.8

GA/knn 11 85.1 86.2 80.9
EP/knn 7 85.6 85.8 79.3

ABC”

 

Again, the hybrid methods were able to reduce the feature set size by one or more
features with a minimal reduction in classiﬁcation accuracy for these data sets. For all

data sets discussed thus far, the EP and GA classiﬁers have produced Similar values

62

 

for I: (having a difference Of 6 or less).

3.2.2 Classiﬁcation Of Data from the UCI Repository

Four data sets—Pima, wine, ionosphere, and hepatitis—were selected for com-
parison with the GADistAI iterative feature selection and construction method Of
Yang and Honavar (1998). As described previously, the testing and error rate esti-
mation for these data sets were performed in a manner similar to that Of Yang and
Honavar to facilitate direct comparison. Classiﬁcation and feature selection results
for this data, averaged over 50 EC/knn experiments, are summarized in Table 3.6.
Unfortunately, the error rates reported for GADistAI were Obtained using the same
data set used by the GA to tune feature subsets and weights. This is essentially
equivalent to the “Train/Tune” accuracy provided in Table 3.6 for the EC-hybrid
classiﬁers. Since no external or bootstrap testing was done, it remains to be seen how
the GADistAI algorithm generalizes to new data, and no comparison tO the “Test”

accuracy Obtained by the EC/knn hybrid classiﬁers can be made.

In all four data sets the bootstrap test accuracy of the EC-hybrid classiﬁers are
comparable to the best accuracy achieved by any classiﬁer. For wine and pima,
the best test results were, in fact, Obtained by the EC/knn hybrids. Additionally,
both hybrid classiﬁers were able to achieve a Signiﬁcant reduction in the number of
features considered for all four data sets. The GA / knn hybrid classiﬁer used the lowest
mean number Of features in classiﬁcation Of all the feature selection and classiﬁcation

methods presented. .For the four data sets examined here, the GA/knn consistently

63

 

 

Table 3.6: Results of the hybrid EC/knn classiﬁers and the GADistAI algorithm on
various data sets from the UCI Machine Learning data set repository, averaged over
50 runs. 'Irain/ Tune refers tO the accuracy Obtained when reclassifying the data
used by the EC in tuning (Optimizing) feature subsets and weights. Test refers to the
accuracy Obtained on an independent test set for each experiment, disjoint from the
training and tuning sets. Features is the number of features with nonzero weights in
the best performing weight set for each run; the mean value over all 50 runs is shown.

Hepatitis 'lh'ain/ Tune Test Features

 

 

 

 

Bayes 85.3 65.7 19
Knn 87.1 73.4 19
GADistAI 97.1 — 9.2
GA / knn 86.0 69.6 8.1
EP/knn 87.2 73.1 8.9
Wine Train / Tune Test Features
Bayes 98.8 94.7 13
Knn 94.9 94.3 13
GADistAI 99.4 — 6.7
GA/knn 99.7 94.8 6.0
EP/knn 99.5 93.2 6.2
Ionosphere Train / Tune Test Features
Bayes 93.0 90.1 34
Knn 83.4 93.2 34
GADistAI 98.6 — 17.3
GA/knn 95.0 91.9 8.5
EP/knn 93.2 92.3 13.5
Pima 'Ih‘ain / Tune Test Features
Bayes 76.1 64.6 8
Knn 73.5 71.5 8
GADistAI 79.5 -— 3.8
GA/knn 80.0 72.1 3.1
EP/knn 79.1 72.9 3.9

 

64

reduced the number Of features considered to roughly one-half the original number Of

features.

Another important factor to consider when comparing the results of the EC/knn
hybrid classiﬁers to classical methods and other feature selection and classiﬁcation
techniques is the balance between the prediction accuracies among the various classes.
That is, depending on the class distribution Of the training data, some classiﬁers
can perform very well on some classes at the expense Of others. The cost function
that drives the Optimization process for the EC/knn hybrid classiﬁers penalizes for
disparity in the prediction accuracies among classes. In some cases, this will cause
the EC to reject weight sets that lead to higher overall prediction accuracy in favor
of those that lead to a better balance in prediction accuracy among classes. Table 3.7
summarizes the balance in predictive accuracy, as deﬁned in Section 3.1.2, of the
various classiﬁers on the four UCI data sets. The EC/knn hybrid classiﬁers obtained
the best balance in all cases when reclassifying the GA tuning data. For the Hepatitis
and Pima data sets, the GA/knn classiﬁer was able to achieve signiﬁcantly better
balance than the traditional classiﬁers. The drop in balance between the tuning
data and the bootstrap testing data for the Wine data set may indicate that some
overﬁtting of the tuning data has occurred. NO information regarding balance in

predictive accuracy was provided for GADistAI.

65

 

Table 3.7: Balance in predictive accuracy among classes for various classiﬁers. The
maximum accuracy, (Max), and the minimum accuracy, (Min), among the various
classes are Shown for each classiﬁer, along with the difference between the two values
(Bal). Lower values for Bal, indicating a smaller difference between the minimum and
maximum predictive accuracies, are preferred. Train / Test results refer to accuracies
for reclassifying the training data for the Bayes and Knn classiﬁer, and for reclassifying
the EC tuning data set for the GA/knn and EP/knn classiﬁers. Test refers to the
accuracy when classifying an independent test set. The knn classiﬁer was tested for
Odd values Of k from 1 to 101, and the best results are shown.

 

 

 

 

 

 

 

 

 

Tiain / Tune Test
Hepatitis Min Max Bal Min Max Bal
KNN 34.5 97.6 63.1 30.8 95.7 64.8
Bayes 17.5 98.8 81.3 6.7 96.5 89.9
GA/knn 67.3 89.7 22.5 45.4 82.2 36.8
EP/knn 81.9 88.3 6.4 48.7 85.9 37.2
Train / Tune Test
Pima Min Max Bal Min Max Bal
KNN 68.5 78.4 9.9 62.8 73.4 10.6
Bayes 68.1 84.1 16.0 62.0 75.7 13.7
GA/knn 79.8 80.1 0.3 72.1 72.4 0.3
EP/knn 78.9 79.3 0.4 72.5 74.6 2.2
Tiain / Tune Test
Ionosphere Min Max Bal Min Max Bal
KNN 68.8 97.9 29.1 72.3 97.5 25.2
Bayes 92.4 93.6 1.2 88.6 96.9 8.3
GA/knn 94.6 95.4 0.8 80.7 94.2 13.5
EP/knn 91.5 94.9 3.4 83.6 94.1 10.5
Train / Tune Test
Wine Min Max Bal Min Max Bal
KNN 84.8 100.0 15.3 89.4 100.0 10.6
Bayes 97.8 98.8 1.0 93.2 94.8 1.7
GA/knn 99.3 100.0 0.7 90.6 99.4 8.8
EP/knn 99.1 99.8 0.7 88.3 98.6 10.4

 

 

 

66

 

 

3.2.3 Classiﬁcation of Medical Data

The thyroid and appendicitis data sets were selected for comparison with a broader
study encompassing several traditional classiﬁcation methods (Weiss and Kapouleas,
1989, 1990), including the linear and quadratic discriminant functions, the l-nearest
neighbor classiﬁer, a neural network classiﬁer trained using the backprOpagation
method, the CART tree classiﬁer (Breiman et al., 1984), the naive Bayes classiﬁer,
and a 2nd order Bayes classiﬁer. Error rates for these data sets were evaluated using
the bootstrap testing method described in Section 3.1.6. Five independent experi-
ments were conducted, and the best weight set and k value for each experiment were
subjected to 100 bootstrap tests. Weiss and Kapouleas report two accuracy estimates
for each classiﬁer. The ﬁrst, training accuracy, is the accuracy Obtained when reclas-
sifying the training samples. Since training Of the EC/knn hybrid classiﬁers is a two
step process, WeiSS’ training accuracy is best compared to the accuracy Obtained by
the hybrid classiﬁer when reclassifying the tuning set. Weiss also reports a less biased
estimate Of accuracy based on an independent testing set, this measure is best com-
pared tO the bootstrap accuracy Obtained by the hybrid classiﬁers. Tables 3.8 and
3.9 compare the results reported by Weiss and Kapouleas with those of the GA / knn

hybrid classiﬁer on the thyroid and appendicitis data sets, respectively.

Comparison with SFFS

The sequential floating forward selection (SFF S) algorithm Of Pudil et al. (1994) has

been shown to be a powerful technique for feature subset selection (Jain and Zongker,

67

 

Table 3.8: Results of various classiﬁers on hypothyroid data, as reported by Weiss,
in comparison with that Of the EC/knn hybrid classiﬁers. Train/tune refers to the
accuracy Obtained when reclassifying the training data, in the case Of Weiss’ results,
or the tuning data, in the case Of the EC/knn hybrid classiﬁers. Testing refers to
the accuracy Obtained on an independent test set for Weiss, and to the bootstrap
accuracy estimate for the hybrid classiﬁers.

 

Method Accuracy Accuracy
(train/ tune) (testing)
GA/knn 98.5% 98.4%
Linear Discriminant 93.8% 93.8%
Quadratic Discriminant 89.7% 88.4%
Nearest Neighbor 100% 95.3%
Bayes (independent) 97.1% 96.1%
Bayes (2nd order) 97.7% 92.4%
Neural Net (Back prop) 99.5% 98.5%
Predictive Value Max. 99.8% 99.3%
CART Tree 99.8% 99.4%

 

 

Table 3.9: Results Of various classiﬁers on the appendicitis data, as reported by Weiss,
in comparison with that of the EC/knn hybrid classiﬁers. As in the previous table,
rain / tune refers to the accuracy Obtained when reclassifying the training data, in the
case Of WeiSS’ results, or the tuning data, in the case Of the EC / knn hybrid classiﬁers.
Testing again refers to the accuracy obtained on an independent test set for Weiss,
and to the bootstrap accuracy estimate for the hybrid classiﬁers.

 

Method Accuracy Accuracy
(train/tune) (testing)
GA/knn 90.4% 90.6%
Linear Discriminant 88.7% 86.8%
Quadratic Discriminant 79.3% 73.6%
Nearest Neighbor 100% 82.1%
Bayes (independent) 88.7% 83.0%
Bayes (2nd order) 95.3% 81.1%
Neural Net (Back prop) 90.0% 85.8%
Predictive Value Max. 91.5% 89.6%
CART Tiee 90.0% 84.9%

 

68

 

 

1997). For comparison with the EC feature selection and extraction methods, SFF S
was tested in conjunction with a knn classiﬁer using Odd values of 19 between 1 and 101
for the two medical data sets. For the thyroid data, the sequential ﬂoating forward
selection method achieved good classiﬁcation results. The best accuracy Obtained
by the SFFS/knn algorithm during feature selection was 97.99%, using 6 Of the 21
available features. A mean bootstrap accuracy of 98.06%, with a standard deviation
of 0.6032%, was obtained over 100 bootstrap tests on this feature set. This accuracy is
Similar to those Obtained by the various methods reported by Weiss. The GA feature
extractor combined with a knn classiﬁer obtained a similar accuracy, 98.48%, using
only 3 of the available 21 features, and a k-value Of 87. 100 bootstrap tests for this
set Of feature weights yielded a mean bootstrap accuracy Of 98.40%, with a standard

deviation Of 0.6256%.

For the appendicitis data, the best result was Obtained for k = 7. The best
predictive accuracy during selection was 88.46% using 3 of the 7 available features.
Bootstrap testing for 100 trials using the best feature set found by SFFS yielded a
mean predictive accuracy Of 91.44% with a standard deviation Of 3.94%. The GA
feature extractor achieved a slightly higher accuracy than SF FS during extraction:
90.38% using 2 Of 7 weighted features and k = 7. In bootstrap testing, however, the
mean bootstrap accuracy over 100 trials proved to be similar tO that of SFFS—90.60%

with a standard deviation Of 4.21%.

69

 

 

 

3.2.4 Classiﬁcation Of Protein Solvation Sites

The most extensive set Of experiments was conducted on a sizeable set of biochemical
data. These data, comprising the physical and chemical environments Of 5542 protein—
bound water molecules, were examined in light of two distinct Objectives. The ﬁrst
goal Of these experiments was to Obtain good pattern recognition performance on
various classes of water molecules, while the second Objective was tO identify the
speciﬁc features which were consistently associated with membership in each class.
The methods and results for these experiments are discussed in detail in Chapter 5—

Identifying the Determinants Of Solvent Binding in Proteins.

3.3 Discussion

In all Of the data sets examined, the EC / knn hybrid classiﬁers exhibited an ability to
reduce the number Of features considered by the classiﬁer Signiﬁcantly, while main-
taining or slightly improving predictive accuracy, as compared to a non-weighted knn
classiﬁer using all available features. The overall predictive accuracy Of the EC/knn
classiﬁers was comparable tO those Of the best classiﬁer tested for each data set ex-
amined, and in many cases the best accuracy was achieved by one Of the EC/knn

hybrids.

In terms Of feature selection, the EC feature selection and extraction techniques
equaled or exceeded the ability of other methods examined to reduce the number Of

features considered by the classiﬁer, while maintaining Similar or improved classiﬁca-

70

 

tion accuracy. For the medical data sets, the EC methods were able to meet or exceed
the accuracy Obtained using a knn classiﬁer in conjunction with SFFS (Pudil et al.,
1994), which is generally considered the most effective Of the commonly-used feature
selection methods (Jain and Zongker, 1997), while further reducing the number of

features considered.

In addition, the parameterized nature Of the cost function which drives EC Opti-
mization allows the EC methods the ability to consider factors other than the apparent
error rate in tuning the feature set and weights. One such factor is the balance in pre-
dictive accuracy among classes. On many data sets where the classical methods result
in strong disparity Of predictive accuracy among classes, the EC-hybrid methods are

able to Obtain better balance while maintaining overall accuracy.

71

 

Chapter 4

Variations on the Bayes Classiﬁer using
Evolutionary Computation-Based

Learning

4.1 Methods

4.1.1 Bayesian Discriminant Functions

The Bayesian classiﬁer has a computational advantage over the knn classiﬁer in that
the training data are summarized, rather than stored. The comparison of each test
sample with every known training sample to ﬁnd nearest neighbors during knn clas-
siﬁcation is a computationally expensive process, even when efﬁcient search methods
are employed (Fukunaga and Narendra, 1975). In contrast, ﬁnding the marginal prob-
ability associated with a particular feature value is computationally efﬁcient for both

72

 

the parametric and nonparametric forms of the Bayesian classiﬁer. Since EC-based
hybrid classiﬁers require many classiﬁcations to be performed during feature selec-
tion and extraction,1 the use Of a computationally efﬁcient classiﬁer such as a Bayes

classiﬁer is expedient.

Unfortunately, the direct application Of the Bayes classiﬁer to the problem is not
effective, because the Bayes decision rule is invariant to linear scaling of the feature
space. In other words, multiplying the feature values for a given feature by a constant
has no effect on the class-conditional probabilities considered by the classiﬁer, as
illustrated in Figure 4.1. Direct scaling Of the marginal probabilities is also ineffective
for the naive Bayes classiﬁer, since the joint class-conditional probabilities are simply

the products Of the marginal probability values.

There are, nevertheless, several aspects Of the Bayesian classiﬁer that, when Op-
timized, can yield better classiﬁcation performance. One such area is is the manner
in which the marginal probabilities for each feature will be combined into the multi-
variate class-conditional probability densities. As noted previously (see Section 2.1.2,
the most common approach is to assume that all features are independent. For the
resulting naive Bayes classiﬁer, the class-conditional probability is the product of
the marginal probabilities for each feature. A more general approach would be to

encode the entire d x d covariance matrix describing the interrelationships between

 

1The actual number Of classiﬁcations tO be performed is typically 0(npt) where
n is the number Of generations for the EC experiment, p is the number Of individual
solutions in the EC population, and t is the number of samples in the EC tuning set.
If all-neighbors search is conducted for each classiﬁcation, the number Of distance
comparisons is 0(npts), where s is the number of samples in the knn training set.

73

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

£3
a.
g p(x > 14, x<16) = 0.045
°" l
.E m
.E _
E _..
“5 —— r
c:
.9.
E" H
o
a e HHHHH
024681012M16182022242628303234363840+
Feature value
(a)

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

p(x > 140, x< 160) = 0.045

N m

 

0 2040 60 80100120l40l60180200220240260280300320340360380 400+

33
a.
5
DD
:3 — 7
E _ __
§
.8. ﬁll
Feature value
(b)

Figure 4.1: The Bayes decision rule is invariant to linear transformations Of the feature
space. For the feature shown here, the raw feature values (a) have been multiplied
by 10 in (b). Using a nonparametric Bayes classiﬁer, we ﬁnd that the original feature
value falls in the bin 14-16 (black rectangle) in the original histogram. The scaled
feature falls in the equivalent bin of histogram b, and the histogram values (marginal
probabilities) of the two bins are identical, SO the scaling has no bearing on the

classiﬁcation results.

 

74

 

all the features being considered, and allow an EC-based Optimizer to search for the
covariance matrix which best describes the true multivariate distribution Of the train-
ing data. Unfortunately, the search space involved in ﬁnding this covariance matrix
grows as 2‘12, even if the elements of the covariance matrix are binary-valued. For real
valued matrix elements, the search space quickly becomes intractable, even for small

problems.

We can simplify the problem somewhat by looking at the Bayesian classiﬁer as a
discriminant function. AS outlined in Section 2.1.2, the Bayes decision rule can be

written as follows:

given 2'5, decide 0),- if

Peri) >P<w.~|s2') w (4.1)

This rule can be rewritten as a discriminant function—a function g Of the feature
vector 53'. Consider, by way Of example, a two-class decision problem. The Bayes

discriminant function can be written as:

9(5) = P(w1|f)- P(w2|f) (4-2)

Here, we would decide class 1 if 9(5) > 0, and class 2 if 9(5) < 0. The classiﬁcation
when g(:i:') = 0 is arbitrary. The discriminant function, then, is uniquely associated
with a particular classiﬁer, mapping an input feature vector to a value associated

with a particular class. According to Duda and Hart:

75

 

We can always multiply the discriminant functions by a positive constant
or bias them by an additive constant without inﬂuencing the decision.
More generally, if we replace every g,(x) by f (g,(x)), where f is a monoton-
ically increasing function, the resulting classiﬁcation is unchanged. (Duda

and Hart, 1973, pp. 17—18).

Thus, we can design a parameterized classiﬁer based on the concept Of the discrim-
inant function. We begin with a discriminant function based on the Bayes decision
rule. Using this function as a model, we can design similar functions which clas-
sify well, but are more easily parameterizable for hybridization with EC Optimization
methods. After designing such a discriminant function and identifying the tunable pa-
rameters, we can use an EC to Optimize these parameters with regard to a particular

set Of training and tuning data.

Two distinct techniques were designed and tested using this method. The ﬁrst is
a simple nonlinear weighting Of the Bayes discriminant function, while the second is a
novel discriminant function based on the summation Of the class-conditional marginal

probabilities.

76

 

4.1.2 Nonlinear Weighting Of the Bayes Discriminant Func-

tion
Consider the Bayes discriminant function,

9(53') = P(w1lf)-P(w2|5)

= P(a':'|w1) X P(w1) — P(.’II.'LU2) X P(W2) (43)

Z P(§:’|w,~) X P(w,-)

i=1

 

The denominator can be eliminated, since it does not affect the Sign Of 9(5), and thus
does not affect the resulting classiﬁcation. Since a > b => log(a) > log(b), we can
apply the log function tO the a posteriori probabilities without changing the resulting
classiﬁcation. Thus, the following discriminant function is equivalent tO the naive

Bayes discriminant:

9(5) = log(P(flw1) >< P(w1))-103(P(5I'Iw2) >< P(w2)) (4-4)
= (log(P(5|w1))+10s(P(w1)))
- (108(P(5='|w2)) + 103 WM)» (45)
where

108(P(i"|wi)) = log(P($1|w.-))+10g(P($2|w.-))+---+10g(P($alw.-)) (4-6)

Finally, we can parameterize this discriminant function, while maintaining a similar

level Of classiﬁcation accuracy, by adding coefﬁcients to each of the marginal proba-

77

bilities.

P‘(f|w,-) = Cllog(P(a:1|w,-))+Czlog(P(a:2|w,-))+...

+ Cd log (P(a:d|w,-)) + log (P(w,~)) (4.7)

The values for the coefﬁcients, Cl“), are supplied by an EC Optimizer. The effect
Of these coefﬁcients is to apply a nonlinear weighting to each of the marginal prob-
abilities, which are then combined to produce a conﬁdence value, P‘, for each class.
While P‘ (5:10),) is no longer a joint probability distribution, the discriminant function
is equivalent to the naive Bayes discriminant function when 01 = Cg = = 0.; = 1.
Furthermore, the new function has several desirable features for hybridization with
an EC Optimizer. When a particular coefﬁcient, Cj, is reduced to zero, the associated
feature value, x), is effectively eliminated from consideration by the classiﬁer. This
allows us to perform feature selection in conjunction with classiﬁer tuning. Further-
more, when the value Of a coefﬁcient, C,- is increased, the marginal probability value
for the associated feature, 3,, has an increased inﬂuence on the value Of the conﬁdence

value, P*(:i:’|w,-), for each class.

GA Optimization of the Nonlinear Discriminant Coefﬁcients

The implementation for this discriminant function was based on the previously de-
scribed nonparametric naive Bayes classiﬁer. The marginal probability distributions
for each feature were approximated using histograms with 20 bins each. Gaus-

sian smoothing was used to mitigate sampling anomalies, as described earlier, with

78

a = 2.0. The GA cost function was identical to that described in section 3.1.2, with
the exception that there is nO voting involved in classiﬁcation using the discriminant
function classiﬁer, SO there was no corresponding incorrectwotes term in the ﬁtness
function. The coefﬁcient values, except for Cums, were set identically to those shown

in Table 3.1.

The EC chromosome was also similar to that designed for the GA/knn hybrid
classiﬁer, although there is no It value included. Each coefﬁcient is represented as a
32-bit integer scaled by division to produce a real value over the range [0.0, 100.0].
Masking was used to aid feature selection, with 5 mask bits for each coefﬁcient.
An example of the EC chromosome for Optimization of the nonlinear discriminant
coefﬁcients is Shown in Figure 4.2. GA and EP run parameters were set identically
to those used in the corresponding hybrid knn classiﬁers, as shown in Tables 3.2 and

3.3.

 

P’Ctluf) = C1log(P(x1|wJ-))+C2log(P(x2|wj))
+ see + C4 log(P (x4le)) +log(P (wj))

 

 

 

 

m W2W3Pf’4MlM2M3M4

 

 

 

 

 

 

 

Figure 4.2: An example Of the EC chromosome for Optimization Of the nonlinear
discriminant coefﬁcients. A four-dimensional problem is shown. Each coefﬁcient, Cg,
in the discriminant function is determined by the chromosome weight, W,, and the
masking ﬁeld, Mi.

 

79

4.1.3 A New Discriminant Function Based on Summation Of

the Class-Conditional Marginal Probabilities

In the process Of developing the nonlinear discriminant function, several alternatives,
also derived from the Bayes discriminant function, were formulated. Of these al-
ternatives, a discriminant function based on a linear combination of the marginal
probability values showed initial promise in combination with EC coefﬁcient Opti-
mization. As with the nonlinear discriminant, this sum-based discriminant function
assigns a conﬁdence value, P‘(f|w,-), to each class, 02,. In this case, however, the value
Of P“ is simply a weighted sum of the marginal probabilities for each feature value.
That is,

P”(:i:'|w,~) = 01P(.’131 Iwi) + CgP(a:2|w,-) + + CdP(a:d|w,-) (4.8)

Again, the marginal probabilities, P(:c1|w,)...P(xd|w,-), were estimated using a
histogram consisting of 20 bins. Gaussian smoothing was also employed with a =
2.0. Additionally, the representation Of the coefﬁcients on the chromosome, EC cost
function, and EC parameter settings were identical to those detailed above for the

Bayes-derived nonlinear discriminant function.

4.2 Results

A GA was employed tO Optimize the parameters of each Of the discriminant function

based classiﬁers. The two resulting hybrid classiﬁers were then tested on the same

80

artiﬁcial and real-world data sets as the EC/knn hybrid classiﬁers, under identical
training and testing conditions. Based on the similarity of the results for the GA
and the EP Optimizers when hybridized with a weighted knn classiﬁer, only the GA

Optimizer was tested in conjunction with the discriminant function classiﬁers.

4.2.1 Artiﬁcial Data Sets

Results Of the two parameterized discriminant function classiﬁers on data sets con-
sisting Of independent, Gaussian-distributed features are summarized in Table 4.1.
Performance on these simple data sets is similar to that of the EC / knn hybrid classi-
ﬁers, but the computation time is signiﬁcantly less Since all-pairs neighbor searching
is no longer required. The nonlinear discriminant function seems to Show Slightly
better feature selection performance than the other classiﬁers, as it is the only clas-
siﬁer to reduce the feature set for 010,1 down to 3 features, and G103 down to 9
features. There is, however, a performance penalty associated with this reduction in
features—the EC/knn hybrid classiﬁers Obtained better accuracy than the nonlinear
discriminant on both Of these data sets. When the number Of features used is equal,
as in the G5 data set, the performance Of the nonlinear discriminant is similar to that

Of the EC/knn hybrid classiﬁers.

Table 4.2 summarizes the results for data sets consisting of feature values drawn
from a mixture of uniform random and Gaussian distributions. These experiments
were conducted identically to the previous set, and the results were Similar. The

nonlinear discriminant function outperformed all other classiﬁers for the data set

81

 

Table 4.1: Performance Of the nonlinear discriminant function (Nonlinear) and the
summation-based discriminant function (Sum) on artiﬁcially-generated data sets con-
sisting of feature values drawn from independent Gaussian distributions. For both
discriminant functions, ﬁve GA experiments were conducted using the same data
sets but differing initial random starting conditions, and the best result is reported.
The accuracy on reclassiﬁcation Of the training data (Train), the accuracy when
reclassifying the GA tuning data (Tune), and the accuracy when classifying an in-
dependent test set (Test) are provided along with the number Of non-zero feature
weights (Features). The results for the naive Bayes classiﬁer (Bayes), and the two
EC/knn hybrid classiﬁers are reproduced here from Table 3.4 for comparison.

 

 

 

G10,1 'Ii'ain Tune Test Features
Bayes 100 — 100 10
Nonlinear 99.0 99.6 99.1 3
Sum 99.1 99.6 98.9 4
GA / knn 98.9 100 99.4 4
EP/knn 100 100 100 4
G10; Train Tune Test Features
Bayes 92.9 —— 91.5 10
Nonlinear 91.3 92.5 89.9 9
Sum 88.6 90.2 87.7 10
GA/knn 94.2 94.8 93.1 10
EP/knn 94.6 93.8 93.1 10
G5 Train Tune Test Features
Bayes 96.0 — 95.6 5
Nonlinear 94.8 96.2 95.1 3
Sum 91.5 94.7 93.3 3
GA / knn 94.7 97.2 95.7 3
EP/knn 95.8 97.2 96.2 3

 

82

N5, but was outperformed by the EC/knn hybrids for M6. The summation-based
discriminant function found a unique solution for the N5 data set. Although the
accuracy for this solution was considerably lower than that Of the other classiﬁers,

the data set was classiﬁed using only a single feature.

 

Table 4.2: Performance of the nonlinear discriminant function (Nonlinear) and the
summation-based discriminant function (Sum) on artiﬁcially-generated data sets con-
sisting of features with a mixture Of random distributions. For both discriminant
functions, ﬁve GA experiments were conducted using the same data sets but differing
initial random starting conditions, and the best result is reported. The accuracy on
reclassiﬁcation Of the training data (Train), the accuracy when reclassifying the GA
tuning data (Tune), and the accuracy when classifying an independent test set (Test)
are provided along with the number Of non-zero feature weights (Features). The re-
sults for the naive Bayes classiﬁer (Bayes), and the two EC/knn hybrid classiﬁers
are reproduced here from Table 3.5 for comparison.

 

 

M3 Train Tune Test Features
Bayes 89.5 — 87.7 6
Nonlinear 89.2 89.0 88.1 5
Sum 88.2 88.1 87.6 6
GA/knn 92.4 93.6 90.8 5
EP/knn 90.8 92.2 89.0 4
N 5 'Ii‘ain Tune Test Features
Bayes 94.4 — 81.6 5
Nonlinear 81.4 83.7 81.5 4
Sum 69.1 71.0 66.5 1
GA/knn 85.1 86.2 80.9 4
EP/knn 85.6 85.8 79.3 4

 

Once again, the balance in predictive accuracy among classes is an important
factor to consider when comparing the results Of the various EC—hybrid classiﬁers with
that Of the Bayes classiﬁer. Since the EC cost function is biased to favor solutions

with good balance in accuracy between classes, it will prefer such solutions to those

83

with disparity among the classes, possibly at the cost of overall prediction accuracy.
Table 4.3 summarizes the predictive balance Obtained by the various classiﬁers for

the M6 and N5 data sets. All the EC-hybrid methods except the EP/knn classiﬁer

outperform the naive Bayes classiﬁer in terms Of class balance.

 

Table 4.3: Balance in predictive accuracy among classes for the various EC/hybrid
classiﬁers on the data sets M6 and N5. Predictive accuracy on the independent test
set is shown for each class. Balance is deﬁned as the difference in predictive accuracy
between class 1 and class 2, as described in Section 3.1.2.

 

 

M5 Class 1 Class 2 Overall Balance
Bayes 91.80 84.40 88.10 7.4
Nonlinear 89.40 85.80 87.60 3.6
Sum 93.20 88.40 90.80 4.8
GA/knn 92.00 86.00 89.00 6.0
EP/knn 93.00 82.40 87.70 10.6
N5 Class 1 Class 2 Overall Balance
Bayes 76.60 86.40 81.50 9.8
Nonlinear 62.20 70.80 66.50 8.6
Sum 79.20 82.40 80.80 3.2
GA/knn 78.60 80.00 79.30 1.4
EP/knn 72.20 91.00 81.60 18.8

 

4.2.2 Classiﬁcation of the UCI Data Sets

Classiﬁcation of real-world data from the UCI machine learning data set repository
not only allows a more realistic assessment Of the performance Of the various classi-
ﬁers on data with erroneous and missing feature values, but also allows the classiﬁers
to be tested on data sets with larger numbers Of features than the artiﬁcially con-

structed data sets described above. Table 4.4 compares the classiﬁcation and feature

84

selection performance Of the two discriminant-function-based classiﬁers with that Of

the EC / knn hybrid classiﬁers and the naive Bayes classiﬁer.

The most evident aspect Of the results on these four data sets is the feature
selection capability demonstrated by the nonlinear discriminant function. For three
Of the four data sets, the minimum number Of features used in classiﬁcation was found
by the nonlinear discriminant function in conjunction with the GA. Additionally,
for the hepatitis data, the test accuracy Obtained by the two discriminant function
classiﬁers surpassed the other classiﬁers tested. For the other three data sets the
accuracies Obtained by the discriminant methods were Similar to those Obtained by
other methods tested. The notable difference between Tune and Test results for the
hepatitis and ionosphere data sets suggest that the discriminant classiﬁers may be

more prone to overﬁtting of the training and tuning data than the other classiﬁers.

Examination Of the run times for the UCI data sets illustrates the advantage
held by the discriminant-function-based classiﬁers over the EC / knn hybrid classiﬁers
in terms Of computational efﬁciency. Table 4.5 compares the execution times for
200 generations Of GA Optimization in conjunction with the nonlinear discriminant
function and the knn classiﬁer. In all cases the nonlinear discriminant classiﬁer is
signiﬁcantly faster than the GA/knn—in the case of the Pima Indian diabetes data

set the difference is nearly tenfold.

85

 

Table 4.4: Results of the nonlinear-weighted discriminant function (Nonlinear) and
the discriminant function based on summation Of the marginal probabilities (Sum)
on various data sets from the UCI Machine Learning data set repository, averaged
over 50 runs. Train/ Tune refers to the accuracy Obtained when reclassifying the
data used by the EC in tuning (Optimizing) feature subsets and weights. Test refers
to the accuracy Obtained on an independent test set for each experiment, disjoint
from the training and tuning sets. The number of features is the mean number Of
features used in classiﬁcation over all 50 runs. Performance for the EC/knn classiﬁers
is repeated here from Table 3.6 for comparison.

Hepatitis Thain/Tune Test Features

 

 

 

 

Bayes 85.3 65.7 19
Nonlinear 95.4 79.4 6.5
Sum 91.4 79.4 8.2
GA/knn 86.0 69.6 8.1
EP/knn 87.2 73.1 8.9
Wine Train / Tune Test Features
Bayes 98.8 94.7 13
Nonlinear 97.8 91.3 4.5
Sum 99.0 92.3 7.6
GA / knn 99.7 94.8 6.0
EP/knn 99.5 93.2 6.2
Ionosphere 'Ih‘ain / Tune Test Features
Bayes 93.0 90.1 34
Nonlinear 97.6 87.5 8.5
Sum 93.8 84.2 11.2
GA/knn 95.0 91.9 8.5
EP/knn 93.2 92.3 13.5
Pima 'Irain / Tune Test Features
Bayes 76.1 64.6 8
Nonlinear 76.2 70.4 3.9
Sum 74.0 70.8 3.4
GA/knn 80.0 72.1 3.1
EP/knn 79.1 72.9 3.9

86

 

Table 4.5: Execution times (wall clock time) for 200 generations Of GA Optimization
of the knn and nonlinear discriminant function classiﬁers. For each data set, the
number Of features (d), the number of classes (C), the combined training and tuning
set size (n), and the mean execution time (hourszminuteszseconds) over 50 runs are
shown. Each run was executed on a single 250MHz UltraSPARC-II cpu Of a six-cpu
Sun Ultra-Enterprise system with 768 MB Of system RAM. Runs were executed in
sets Of 5 with no other user processes present on the system.

 

Data set d C n knn nonlinear
Pima 8 2 400 1:40:13 0:10:52
Hepatitis 19 2 240 1:05:48 0:24:42
Ionosphere 34 2 400 2:02:25 0:43:37
Wine 13 3 240 0:23:59 0:14:39

 

4.2.3 Classiﬁcation of Medical Data

For the thyroid and appendicitis data, the two discriminant function—based classi-
ﬁers were trained and tested in the same manner as the EC-hybrid classiﬁers (Sec-
tion 3.2.3). For each data set, ﬁve experiments were conducted for each classiﬁer.
The appendicitis data set was re-partitioned into disjoint training/ tuning and testing
sets for each experiment. The much larger thyroid data set was pre—partitioned into
training and testing sets in the UCI database (Quinlan et al., 1986). For this data,
only the initial random GA population was changed for each experiment. The results

Of these experiments are summarized in Table 4.6.

Both the discriminant function based classiﬁers performed well on the hypothyroid
diagnosis data. The nonlinear classiﬁer utilized a smaller feature set than the GA / knn
on this data, at a Slight cost in bootstrap test accuracy. Both discriminant function

classiﬁers seem to exhibit overﬁtting of the tuning data for the appendicitis data set.

87

 

Table 4.6: Accuracy of various classiﬁers on the hypothyroid and appendicitis diag-
nosis data sets. Results for the discriminant function classiﬁers are averaged over ﬁve
GA experiments. Results for the GA / knn classiﬁer represent the best Of ﬁve exper-
iments. 'Ii'ain/ Tune refers tO the accuracy Obtained in reclassifying the GA tuning
set; Test refers to bootstrap accuracy over 100 bootstrap sets.

 

 

Thyroid Train / Tune Test Features
GA / knn 98.5 98.4 3
Nonlinear 97.7 97.2 2.7
Sum 97.8 97.4 4.2
Appendicitis Train / Tune Test Features
GA/knn 90.4 90.6 2
Nonlinear 80.4 67.0 2.6
Sum 83.0 74.2 2.2

 

This behavior, along with the previous results for the hepatitis and ionosphere data

sets, suggests that the discriminant function classiﬁers may be prone to overﬁtting

when presented with small (< 50 samples Of each class) data sets for training, tuning,

and testing.

4.3 Discussion

A key advantage Of the discriminant function classiﬁers over the nearest neighbor

methods is the gain in computational efﬁciency Obtained by estimating the class-

conditional feature value distributions based on the training data, rather than stor-

ing every training sample and performing an all-pairs search for near neighbors for

each test sample. While the experiments here were all executed for a ﬁxed number

Of EC generations, it would be worthwhile to run several experiments constrained

88

instead by wall-clock time. In this way, the efﬁciency advantage of the discrimi-
nant function-based classiﬁers might be translated into further gains in classiﬁcation

accuracy relative to the near-neighbor methods.

The nonlinear discriminant function classiﬁer, in conjunction with the GA feature
extraction method, seems to exhibit the best feature selection capability of all the
classiﬁers evaluated. In several cases, however, the additional reduction in the number
Of features, as compared tO the GA/knn classiﬁer, incurred a slight cost in terms Of

classiﬁcation accuracy.

The promise exhibited by the discriminant function-based classiﬁers suggests sev-
eral avenues for further investigation. One possible improvement would be to include
the prior probabilities for each class on the EC chromosome for Optimization. Intu-
itively, this might allow the hybrid classiﬁer more ability tO maintain more control
over the balance in predictive accuracy among classes, even when there is disparity
in the number Of training and tuning samples available for each class. Initial exper-
imentation in this area, however, suggested that inclusion Of the prior probabilities
on the chromosome can exacerbate the problem of overﬁtting the training and tuning

data, resulting in poor performance on independent test data.

Since the capabilities Of the Bayes classiﬁer have been well explored in the lit-
erature, an alternate approach would be to avoid direct modiﬁcation Of the Bayes
discriminant function. Instead, the independence assumption Of the naive Bayes clas-
siﬁer might be abandoned, and the relationship among the various distributions of

feature values encoded on the EC chromosome as a covariance matrix, or some sub-

89

set thereof. Essentially, this approach would allow the EC to estimate the covariance
matrix for each class based on both the training and tuning data provided. In con-
junction with the previously proposed technique of including the a priori probabilities
for each class on the chromosome, all the parameters Of a traditional Bayes classiﬁer

might be Optimized by the EC for a particular data set and error cost function.

90

Chapter 5

Identifying the Determinants of Solvent

Binding in Proteins

5.1 Introduction

The most extensive application and testing Of the various EC-based feature selection
and extraction methods described here has been performed in the ﬁeld Of biochemistry.
The speciﬁc problem area investigated was that Of evaluating and understanding
protein-water interactions. Understanding the nature of protein solvation (water
binding) is a signiﬁcant outstanding problem in structural biochemistry, and has been
the subject Of numerous studies, yielding a substantial body Of literature (Palma et al.,
1993; Westhof, 1993; Ladbury, 1995; Karplus and Faerman, 1994; Levitt and Park,
1 993). Elucidation Of the nature Of protein-water interactions would be a signiﬁcant
Contribution to a broad range Of biochemical endeavors, including de novo drug design,
database screening for drug leads, ligand docking, understanding of protein structure

91

and function, and even the grand challenge “protein folding” problem.

5.1.1 An Overview Of the Protein-Water Problem

Proteins are macromolecules consisting of linear (non-branching) chains Of amino
acids. The twenty common amino acids are joined covalently during protein syn-
thesis tO form the protein “main chain” or “backbone”. A particular protein can
be completely speciﬁed by the order Of the different amino acids that form its main
chain. After synthesis, a protein will fold into a particular three-dimensional struc-
ture (Figure 5.1). Since this structure is generally a compact globule, folded, soluble
proteins are occasionally referred to as “globular” proteins. In the context Of a living
organism, numerous water molecules will be bound, via hydrogen bonds, steric, and

other interactions, to the surface Of the protein in its globular form.

 

 

Figure 5.1: A particular sequence Of amino acids will consistently fold into the same
three-dimensional, “globular” structure. Each of the 20 common amino acids is iden-
tiﬁGd by a one-letter code. The linear protein chain (left) folds into its globular
Structure (right) after synthesis. Water molecules, shown here as spheres, bind to the

SUPface of the protein via hydrogen bonds.

\

While proteins have many functions in the context Of a living organism, most

92

proteins bind to a speciﬁc molecule (or set Of molecules) and perform some action
upon that molecule, such as transportation, cleavage, or catalysis of other chemical
reactions. The particular molecule that a protein binds in this manner is called the
ligand, and the area where the ligand binds is referred to as the protein’s active
site. Like the rest Of the protein surface, the residues in the active site will typically
bind water molecules. When a protein binds to a particular ligand, some of the
water molecules in the active site will be displaced by the ligand, while some will
remain in the area Of the interface between the protein and the ligand. Those water
molecules that remain in the interface will Often form hydrogen bonds to both the
protein and the ligand, thus helping the protein tO bind to the ligand by forming a
water-mediated “bridge” between the two molecules (Raymer et al., 1997; Poornima
and Dean, 1995c). One study of 20 protein / ligand complexes found that the average
protein/ ligand interface includes 10 water molecules and 17 water-mediated bridges

between the protein and ligand (A. Cayemberg & L. A. Kuhn, unpublished results).

In the process Of designing a ligand to bind to a particular active site, such as a
pharmaceutical drug, water molecules that are not displaced from the active site upon
ligand binding must be considered, as they form an essential part Of the protein surface
(Poornima and Dean, 1995a). Unfortunately, during ligand design it is generally

unknown which water molecules will remain bound upon ligand binding, and which
Will be displaced. If the Sites Of interfacial water binding were known, this information
Collld be used to improve the chemical and shape complementarity Of the ligand in

the design process.

93

The primary goal in analyzing protein-water interactions is tO isolate the determi-
nants of water binding—the physical and chemical features Of a local protein environ-
ment that determine whether or not the Site is conducive to water binding. In terms
Of classiﬁcation, it is desirable to categorize areas near a protein surface into one Of
three classes: sites that are unlikely to bind water molecules, “weak” solvation sites
that bind water molecules but are not likely to be conserved during ligand binding,
and “strong” solvation sites that are likely to be conserved between the ligand-bound

and unbound forms Of a protein (Ringe, 1995).

5.1.2 State of the Art in Predicting Protein Solvation

X-ray Crystallographic Data Collection

The input for most algorithms that predict protein solvation consists Of the three-
dimensional structure of the protein, usually represented as the Cartesian coordi-
nates Of each heavy (non-hydrogen) atom in the protein. This structure is generally
Obtained via X-ray crystallography. In this process, a protein is ﬁrst puriﬁed and
dissolved in an aqueous solvent. From this solution, microscopic crystals (with edges
on the order of tenths Of millimeters) are grown. These crystals consist Of roughly
equal amounts of the protein, arranged in a lattice, and interstitial water molecules.
The crystals are then suspended in small capillaries, and exposed tO a beam Of X-ray
radiation. The electrons in the protein crystal diffract the X-rays, and the diffrac-
tion pattern is captured in two dimensions by exposing X-ray sensitive ﬁlm or an

electronic detector to the diffracted X-rays. The diffraction pattern on the exposed

94

ﬁlm can then be digitized and subjected to a Fourier transformation, resulting in a
three-dimensional map of the electron density in the crystal. This electron density
map is equivalent to a time-averaged and Space-averaged1 snapshot Of the locations
Of the electrons in the protein crystal. From this information, the three-dimensional

structure of the protein can be deduced.

The result Of an X-ray crystallographic experiment—Often referred to as a crys-
tallographic structure—contains several other pieces of information along with the
coordinates of the atoms. Each atom is assigned a crystallographic temperature fac-
tor, Or B-value, which provides a relative measure of the thermal mobility Of the
atom. A related measurement, the occupancy Of the atom, provides an estimate Of
the percentage Of time that a Site is occupied by a given atom type over the experi-
ment time and throughout the various copies of the protein in the crystal. Generally,
the information used for training and testing in empirical solvent-site prediction al-
gorithms is calculated from the atomic coordinates, thermal mobility, and occupancy
data Obtained via crystallography. Occasionally, raw electron density data is also con—
sidered. From the crystallographic coordinates, other physical and chemical features
Of each protein atom can be calculated. Phrther details and examples are provided

in Section 5.2.3.

 

1The electron density map is Space-averaged in the sense that the average electron
density over all the copies Of the protein in the crystal is measured.

95

Current Approaches to Solvent Site Prediction

Current algorithms for predicting the locations of bound water molecules can be di-
vided into two classes. Theoretical approaches, such as GRID (Wade et al., 1993), use
a potential energy function to evaluate the favorability Of a probe site. For GRID, the
potential energy function includes terms for Lennard-Jones interactions, electrostatic
interactions, and a detailed evaluation Of potential hydrogen bonds. For all theoreti-
cal approaches, the relative contribution Of the terms in the potential energy function
must be determined before solvation sites can be predicted. In contrast, empirical
methods determine the favorability of a solvation site by analogy with known sites. A
site is evaluated and compared to a database of known solvation Sites and non-solvated
sites, and predicted as being more similar to one than the other. A set Of features to
Observe and compare between solvation Sites and non-solvated Sites must be selected
prior to classiﬁcation. Several empirical methods for prediction Of protein-bound wa-
ter molecule locations have been developed. AQUARIUS2 (Pitt et al., 1993) uses
a knowledge base of the distributions Of water molecules around polar atoms at the
protein surface. A “likelihood” value is assigned to a putative water molecule location
based on its geometric relationship to nearby polar protein atoms. If the Site lies in a
region highly occupied by water molecules in the knowledge base and has signiﬁcant
electron density, as determined by X-ray crystallography, it receives a higher score.
The highest scoring positions in a 3-dimensional matrix surrounding the protein are
selected as likely water molecule locations. AUTO-SOL (Vedani and Huhta, 1991)

predicts water Sites based on the directionality Of hydrogen bonds. A database Of

96

small-molecule crystal structures was analyzed to ﬁnd the distribution of hydrogen-
bond lengths and directions, and possible solvent sites are evaluated by AUTO-SOL
according tO how well their hydrogen-bond geometry matches this database. Cur-
rent methods can reproduce ~70% Of crystallographically Observed solvent molecules
within 1.5 A Of the experimental locations (Vedani and Huhta, 1991). There remains,
however, a tendency for many current methods to produce false positives, predicting

solvation sites where none are Observed in the crystal structure.

Both empirical and theoretical methods require the evaluation of a set Of fea-
tures describing the environment Of the site. Commonly-used features include hydro-
gen bond geometry, electrostatic interactions, local thermal mobility, protein surface
Shape, solvent accessibility, and the packing geometry, atom type, and residue type
Of neighboring protein atoms. Several studies have evaluated the correlation between
individual features and water binding. For instance, in a study Of 56 high-resolution
crystallographic structures, a strong relationship was found between local surface
shape and water binding (Kuhn et al., 1992). Deep grooves in the solvent-accessible
surface bound three times as many water molecules as non-groove surface areas. Re-
sults obtained later also showed that deep surface grooves were the preferred sites for
tightly-bound water molecules, which were unlikely to be displaced by ligand binding
(Poornima and Dean, 1995b). Furthermore, conserved polar contacts between water
molecules and the protein were found to be associated with solvation-Site cOnservation

(Poornima and Dean, 1995c).

As noted in previously, X-ray crystallographic assignment is an imperfect source Of

97

information for known water sites, resulting in both false-positive and false-negative
Sites in any database based on crystallographic data. One method for identifying
reliable water molecule positions is to utilize solvation sites conserved in multiple
crystallographic structures. In a study Of eighteen crystallographically independent
structures representing ten unique crystal forms of T4 lysozyme, it was found that
~79% Of the 20 most frequently occupied sites were conserved in all structures when
allowing for potential steric interference induced by the crystal lattice (Zhang and
Matthews, 1994). A recent study of conserved solvation sites in thrombin, trypsin
and bovine pancreatic trypsin inhibitor (BPTI) found that 19% Of unique water sites
in 10 superimposed thrombin structures were conserved in at least half of the struc-
tures, while 77% of the unique sites in three superimposed trypsin structures were
conserved in at least two Of the three structures (Sanschagrin and Kuhn, 1998). In
perhaps the most extreme change in crystallization conditions, the structure of subtil-
isin Carlsberg was solved in an organic solvent, anhydrous acetonitrile, and compared
with the structure solved in an aqueous environment (Fitzpatrick et al., 1993). Of
the 119 bound water sites Observed in the aqueous structure, 99 were conserved in the
acetonitrile structure. Together, these studies show that one-half or more of bound
water sites are typically conserved in independent structures Of a protein solved under

diverse conditions.

A Slightly different approach was taken by Gschwend and Karplus (1991) in their
study Of features correlating with increasing electron density Of water sites in glu-

tathione reductase. The water molecules in a 1.54 A resolution structure were ranked

98

in order Of electron density in the ﬁnal 2F0 — FC map. Eight parameters were mea-
sured for each water molecule, weighted, and correlated with the electron density
rankings; they included the energy from XPLOR reﬁnement, solvent accessibility,
strong hydrogen bonds tO protein atoms and to other water molecules (donor tO ac-
ceptor bond length between 2.5 and 3.2 A), weak hydrogen bonds (bond length less
than 2.5 A or from 3.2 to 3.5 A) to protein atoms and water molecules, the aver-
age temperature factor Of hydrogen-bonding partners Of the water molecule, and the
average occupancies Of hydrogen-bonding partners. The parameters that correlated
best with increasing electron density included decreasing average temperature factor
of hydrogen-bonding partners Of the water molecule, decreasing solvent accessibility
Of the water, and increasing the number Of strong hydrogen bonds to protein atoms.
The linear correlation coefﬁcient between this set Of features and the density ranking

was 0.8.

An experimental approach was taken by Ringe (1995) to investigate the char-
acteristics Of ligand—binding sites on the protein surface. Crystals Of elastase were
grown in aqueous mother liquor, then transferred to an organic solvent. A shell Of
bound water molecules remained, even after the organic solvent was exchanged several
times, though some organic solvent molecules also bound. The features associated
with binding sites for the polar organic solvent molecules included the local surface
Shape, exposure Of hydrophobic groups, and the presence of water molecules bound

to polar groups near the Site.

The approach presented here is a hybrid between prediction Of bound water 10-

99

cation and characterization Of features favorable for water binding. Three classes
of protein sites were studied: non-solvated Sites, Sites that were solvated in a sin-
gle crystallographic structure of a protein, and solvation Sites that were conserved
in two crystallographically independent structures. The hybrid genetic algorithm/k-
nearest-neighbors (GAknn) classiﬁer was provided with a set Of features representing

each site, and trained tO distinguish between the classes Of sites.

Motivation and Experimental Design

Unfortunately, one Of the steps in the crystallographic determination Of protein struc-
tures, the assignment Of water molecule locations based on the electron density data
from an X-ray crystallographic experiment, is an imperfect process. The resulting
3-dimensional structure may include false sites via noise in the electron density, and
omit many Sites due to limited resolution. Additionally, the structure Of the crystal
lattice may produce favorable environments for bound water molecules where none
were present in the soluble form Of the protein. By applying the feature selection
and extraction methods described here to protein crystallography data sets, we hOpe
to improve water assignment by identifying the physical and chemical features as-
sociated with water-binding sites, and by producing a classiﬁer that is capable of

distinguishing solvation sites from non-solvated Sites near the protein surface.

In order to train such a classiﬁer, however, the current, imperfect information
about protein-water binding available from X-ray crystallographic structures must be

used. To mitigate the effect Of noisy water coordinate data, we can superimpose mul-

100

tiple crystallographic structures Of the same protein and examine the water-binding
Sites that are conserved between them. In this way we can identify the physical and
chemical features associated with the most reliable water binding Sites available—
those that are conserved among multiple, independently-solved structures Of the same
protein. Finally, by using this information during the assignment Of water molecules
in future structures, we can improve the reliability of crystallographic water assign-
ment by considering not only the electron density information Obtained from the
X-ray experiment, but also the physical and chemical features Of the local protein
environment in determining whether or not to assign a water molecule to a particular
location near the protein surface. While the electron density is the experimental Ob-
servable and the most important criterion, evaluating the likelihood Of a given protein
environment for water binding gives us insight into the protein-water recognition prO-
ceSS and also can help identify missing or extraneous water Sites in crystallographic

structures.

In pursuit Of this goal of understanding protein-water interactions more com-
pletely, the classiﬁcation and feature selection and extraction methods described in

earlier chapters were applied to two distinct classiﬁcation problems:

Prediction of salvation sites: Given a location near the surface Of a protein, de-
termine whether the site is likely to be a water-binding Site or a non-solvated

site.

Prediction of solvent-site conservation: Given a known solvent site from a non-

ligand—bound protein, determine whether the water molecule will be conserved

101

(remain bound) in the protein-ligand complex.

For each Of these problems, two subproblems related tO understanding protein-
water binding were investigated. The ﬁrst was to develop and train a classiﬁer capable
Of distinguishing between the two classes Of Sites: solvated vs. non-solvated, and
conserved vs. non-conserved, respectively. The second Objective was to utilize the
feature selection and data mining capabilities Of the EC-hybrid classiﬁers to identify
the features Of protein bound water molecules that are associated with solvent binding,

and with conservation Of solvent Sites between structures.

5.2 Methods

5.2.1 Compilation of 3 Protein Solvation Database

The task Of understanding protein-water interactions was undertaken in two stages.
The Objective of the ﬁrst stage was tO design a hybrid EC / knn classiﬁer capable of
distinguishing water molecules from non-solvated sites on the surface Of a protein,
and to use the feature selection capabilities Of this classiﬁer to identify the physi-
cal and chemical features associated with solvation sites. The second goal was to
further classify the water sites into those likely to be conserved between a ligand-
bound and an unbound structure Of a protein, and those likely to be non-conserved
(displaced). Again, once this classiﬁcation was made, the data mining and feature
selection capabilities Of the classiﬁer were utilized to identify the determinants Of

solvent conservation between ligand-bound and unbound structures.

102

Prerequisite to any Of these experiments it was necessary to establish a database
Of conserved and non-conserved water molecules as well as non-solvated sites near
protein surfaces to serve as training and tuning data for the GA/knn hybrid classi-
ﬁer. TO this end, 60 protein structures were selected in pairs from the Brookhaven
Protein Databank (Abola et al., 1987; Bernstein et al., 1977). Each pair consisted
of the crystallographic structure of a protein bound tO a ligand and the correspond-
ing structure Of the unbound protein. Protein pairs were selected based on several
criteria. First, all proteins included in the database were non-homologous2 (Hobohm
et al., 1992). In addition, high resolution (5 2.0 A) structures were preferred, and
structures exhibiting Signiﬁcant conformational changes upon ligand binding were
excluded, since the concept Of conserved water is no longer well-deﬁned in this case.
Pairs Of structures with a root-mean-squared positional deviation (RMSD) Of S 2.0 A
upon superposition Of main-chain atoms were also preferred. The 30 pairs Of proteins

selected for the database are detailed in Table 5.1.

 

2Homology, in this case, refers to proteins composed of similar sequences of amino
acids. Inclusion Of homologous proteins in the database can introduce redundant
information, which can lead to unnecessary bias in the training and tuning data. No
two proteins included in the database described here had > 25% amino acid sequence
identity when the linear sequence Of amino acids comprising each protein were aligned.

103

Table 5.1: 30 protein pairs included in the database.

PDB Code1

Protein / Ligand
Complex

ReS(A)2 Waters RMSD3

 

lahc/lahb
lapm/latp
1bia/1bib

lbsa/ 1brn
1ca2/1bcd

1cgf/1hfc

lcgt/lcgu

1chp/ lxtc

1dr2/1dr3

1gta/1gtb

1hel/1mlc

IHb/IHC

1nsb/1nsc
1poa/1pob

lsyc/lsyd

1thm/2tec

a-momorcharin / formycin
5’-monOphosphate
CAMP-dependent protein
kinase / MnATP

bira bifunctional protein /
biotinylated lysine
barnase / D(CGAC)
carbonic anhydrase II /
triﬂuoromethane
sulphonamide

ﬁbroblast collagenase /
HAP"

cyclodextrin
glycosyltransferase /
glucose

cholera toxin 5 pentamer /
polypeptide chain
dihydrofolate reductase /
biOpterin

glutathione S-transferase /
praziquantel

hen egg-white lysozyme /
monoclonal antibody Fab
D441

adipocyte lipid-binding
protein /
hexadecanesulfonic acid
neuraminidase / N-acetyl
neuraminic acid
phospholipase A2 /
transition-state analogue
staphylococcal nuclease /
2’-deoxy-3’-5’-
diphosphothymidine
thermitase / eglin-C

104

:a0/252
:a0/252
:a3/2rs
2.0/1.76
110/15)
2.1/1.56

:20/215

:20/221
:a3/2:3
124/215

1.7/21.

1.7/113

:22/137
1.5/21)

1.8/137

137/198

163/80
207/103
43/19
258/229
167/215
181/88

588/478

248/138
73/110
118/84

185/210

89/68

446/506
151/242

69/83

193/224

0.23

0.35

0.48

0.55

0.20

0.39

0.34

0.93

0.12

0.20

0.49

0.32

0.12

0.72

0.41

0.24

Table 5.1 (cont.)

 

 

PDB Code1 Protein / Ligand Res(A)2 Waters RMSD3
Complex

ludg/ludh uracil-DNA glycosylase / 1.75/ 1.75 121/94 0.19
uracil

2act/1aec actinidin / E645 1.7/ 1.86 272/268 0.11

2apr/3apr acid proteinase / reduced 1.8/1.8 373/344 0.13
peptide inhibitor

2cla/3cla chloramphenicol 2.35 / 1.75 104/ 204 0.41
acetyltransferase /
chloramphenicol

2ctv/5cna concanavalin A / a-methyl- 1.95/2.0 146/533 0.42
D-mannopyranoside

2sga/5sga proteinase A / tetrapeptide 1.5/ 1.8 220/185 0.08
Ace-PrO-Ala-Pro-Tyr

2wrp/1tro Trp repressor / synthetic 1.65/ 1.9 170/572 2.18
Operator

3cox/1coy cholesterol oxidase / 3-6- 1.8/ 1.8 453/424 0.24
hydroxy-5-androsten-17-One

3dni/2dnj deoxyribonuclease I / DNA 2.0/2.0 375/252 0.37

3enl/5enl enolase / 2.25/22 353/355 0.21 '
2-phosphO-D-glyceric acid

3grs/1gra glutathione reductase / 1.54/2.0 523/530 0.12
glutathione disulﬁde

3tln/3tmn thermolysin / Val-Trp 1.6/1.7 173/173 0.10

5cpa/6cpa carboxypeptidase A / 1.54/2.0 315/148 0.36
phosphonate

See note 6 RTEM-l ﬁ-lactamase / 1.7/1.7 182/189 0.22

penicillin G

1 Ligand-free/ligand-bound
2Resolution Of the crystallographic structures in Angstroms.
3Main-chain root-mean-squared positional deviation from superposition Of

the ligand-bound and free structures.
4HAP is (N-(2-hydroxamatemethylene-4-methyl-pentoyl)phenylalanyl)

methyl amine.

5E64 is

[N-(l-3-trans-carboxyoxirane—2-carbonyl)-l-leucyl]-amido(4-guanido)butane.
6Provided by Drs. Natalie Strynadka and Michael James, University of

Alberta, Edmonton.

 

The water molecules from the unbound structures were used to construct the

training and tuning database. The ligand-bound structures were used only to label

105

the water molecules from the unbound structures as conserved or displaced. TO ob-
tain this labeling, the main-chain atoms Of the ligand-bound and unbound structures
were superimposed using the rigid-body superimposition algorithm Of the InsightII
molecular graphics software (Molecular Simulations, Inc., San Diego, CA). Each water
molecule in the unbound structure was then compared with the superimposed ligand-
bound structure—if a corresponding water molecule was found in the ligand-bound
structure within 1.2 A (Zhang and Matthews, 1994) Of the water from the unbound
structure, the water was labeled as conserved. Water molecules from the unbound
structure with no corresponding water molecule in the superimposed ligand-bound
structure were labeled as non-conserved. Only ﬁrst-shell water molecules from the
unbound structures were included in the database, where the ﬁrst shell is deﬁned
as the set Of water molecules within hydrogen bonding distance (3.6 A) Of a protein

surface atom.

5.2.2 Generation of Non-Solvated Probe Sites

For distinguishing solvation sites from non-solvated Sites near the protein surface, it
was necessary tO generate a collection of non-solvated probe sites near the surface Of
each Of the 30 non-ligand—bound proteins selected for the database. To generate these
probe sites, the solvent accessible molecular surface for each Of the non-ligand-bound
structures in Table 5.1 was generated using the Molecular Surface (MS) software from
Connolly (1983). A probe radius Of 1.2 A was used, with a surface density Of 1 dot / A2

and unit normals generated at each surface dot. For each protein, the minimum

106

distance Of any crystallographically Observed water molecule from the protein surface,
dmin, and the maximum distance, am, were determined. For each surface point, a
probe-site was generated and placed at a distance along the surface normal, selected
at random over the range of distances from dmin tO dmu. Any non-sites overlapping
crystallographic water sites or other non-sites (with positions within 1.2 A) were
removed. Finally, the same number of non-sites were selected for each protein as there
were crystallographic water sites. This selection was done so that the distribution Of
distances of non-Sites from the protein surface matched the distribution Of distances

for Observed water molecules for that protein.

5.2.3 Feature Identiﬁcation and Measurement

Once a database Of non-solvated sites, non-conserved water-binding sites, and con-
served solvation Sites had been established, the next step was to identify and measure
a collection Of features to represent each water binding site or non-solvated site for
purposes Of classiﬁcation. For each site, six physical and chemical features Of the
local protein environment were computed from the crystallographic coordinates. For
conserved and non-conserved solvation Sites, two additional features describing the
thermal mobility Of the water molecule in the crystal structure were included. The

eight features included in the waters database are included in Table 5.2.

107

Table 5.2: The physical and chemical features used to represent protein-bound water
molecules and protein surface sites. BVAL and MOB were only measured for con-
served and non-conserved molecules, as the concept Of thermal mobility is not deﬁned
for non-solvated probe sites.

Tag Feature Description

ADN Atomic density The number of protein atom neighbors
within 3.6 A of the water molecule. This
feature correlates with the local protein
topography. Water molecules bound in
deep grooves will have high ADN val-
ues, while those bound to protrusions will
have low ADN values (Kuhn et al., 1992).

AHP Atomic hydrophilicity The hydrophilicity of the neighborhood
Of the water molecule is based on the fre-
quency of hydration for each atom type
in 56 high-resolution protein structures
(Kuhn et al., 1995). Each water molecule
is assigned an AHP value equal to the
sum Of the atomic hydrophilicity values
of all atom neighbors within 3.6 A of the
water molecule.

 

BVAL B-value The crystallographic temperature factor
from PDB ﬁle Of the crystal structure.
.This feature measures the thermal mo-
bility Of the water molecule.

HBDP Hydrogen bonds to The number Of hydrogen bonds between
protein the water molecule and neighboring pro-
tein atoms. Each donor or acceptor atom
within 3.5 A is considered a potential hy-
drogen bond.

HBDW Hydrogen bonds to The number of hydrogen bonds between
water the water molecule and other water
molecules in the ligand-free protein struc-

ture, based on S 3.5 A distance between

oxygen atoms in the two water molecules.

108

Table 5.2 (cont.)

 

Tag Feature

Description

 

MOB Mobility

ABVAL Average B-value of
protein atom neighbors

NBVAL Net B-value of protein
atom neighbors

A normalized measure Of thermal mobil-
ity, deﬁned in terms Of the B-value and
occupancy, two measures Of the thermal
mobility Of the water molecule provided
in the crystallogra hic structure of the
protein. MOB = Bw/B) / (Dec/OE).
Where B and B are the B~value Of the
water molecule and the average Evalue
Of all the water molecules in the pro-
tein respectively. Similarly, Dec and m
are the occupancy Of the water molecule
(from the PDB ﬁle) and the average oc-
cupancy of all water molecules in the pro-
tein structure (Craig et al., 1998).

The average (mean) temperature factor
Of all protein atoms within 3.6 A of the
water molecule.

The sum Of the B-values of all protein
atoms within 3.6 A Of the water molecule.

 

The eight features described in Table 5.2 comprise most of the factors currently

believed tO be related to protein/water binding, as discussed in Section 5.1.2. These

features do not, unfortunately, establish a well-separated or easily classiﬁed data

space. Figure 5.2 Shows 800 waters selected randomly from the conserved and non-

conserved protein-bound water molecule data set, plotted according tO their ﬁrst two

principal components. This plot demonstrates the high degree Of overlap between

the two classes. This overlap is further evidenced by the poor classiﬁcation results

Obtained by linear discriminant analysis (Fisher, 1936), which Obtains a classiﬁcation

accuracy Of only ~50% (equivalent to random prediction) in distinguishing between

conserved and non-conserved solvation Sites based upon this data set.

109

 

 

-10

Second

 

 

 

-20

 

First

Figure 5.2: Conserved and non-conserved water molecules plotted according to the
ﬁrst two principal components of the eight-feature water binding data set. Eight
hundred randomly-selected water molecules from the data set are plotted. Non-
conserved water molecules are plotted as 1’s, conserved water molecules are plotted
as 2’s. The ﬁrst principal component is composed primarily Of the features HBDP,
ADN, and AHP (with coefﬁcients 0.63, 0.52, and 0.38, respectively). The second
principal component is composed primarily Of the three features based on temperature
factor: ABVAL, BVAL, and NBVAL. (The respective coefﬁcients are —0.54, —0.53,
and —0.44.) There is no clear separation between the two classes in this feature space.

 

110

5.2.4 Distinguishing Solvation Sites from Non-Solvated Sites

Near the Protein Surface

The training and tuning data for distinguishing solvation sites from non-solvated
regions Of the protein surface consisted Of all the ﬁrst-shell water molecules from
the non-ligand-bound structures in Table 5.1, a total Of 5542 water molecules, and
an equal number Of non-solvated probe sites, for a total Of 11,084 samples. The
non-solvated sites were generated from the same 30 non-ligand-bound structures, as
described in Section 5.2.2. The six features from Table 5.2 not related to thermal
mobility were used tO represent each sample, and the patterns were labeled “class 1”

for non-solvated probe sites and “class 2” for solvation sites.

For each GA / knn experiment, 1000 patterns Of each class were selected at random
from the database to serve as the training set. Likewise, the tuning data consisted
Of 1000 patterns Of each class. After the GA / knn experiment, bootstrap testing was
performed as described in Section 3.1.6 to test the predictive accuracy of the best
weight set and k value found by the GA. The bootstrap testing pool consisted Of the
3542 samples from each class not included in the training or tuning data. Individual

test sets Of size 500 were drawn from this pool during the bootstrap testing.

The GA run parameters and the cost function coefﬁcients for these experiments

were identical to those detailed in Tables 3.1 and 3.2.

111

5.2.5 Distinguishing Conserved Salvation Sites from Sites
not Conserved Between Ligand-Bound and Unbound

Protein Structures

The training and tuning data for classifying conserved solvation Sites between ligand-
bound and unbound protein structures consisted of the 5324 ﬁrst-shell water molecules
from the 30 non-ligand-bound structures in Table 5.1. Of these water molecules, the
2137 non-conserved water molecules were labeled “class 1”, and the 3405 conserved
water molecules were labeled “class 2”. All eight features described in Table 5.2 were

used tO represent each sample.

Training and testing data were compiled in a similar manner to the previous set
Of experiments, but due tO the smaller size of this data set, the training and testing
data consisted Of 400 patterns Of each class. The remaining 2605 conserved and 1337
non-conserved water molecules composed the pool of samples for bootstrap testing.

Again, the bootstrap test set size was 500 samples.

As with the previous set Of experiments, the GA run parameters and the cost

function coefﬁcients were set as described in Tables 3.1 and 3.2.

112

5.3 Results

5.3.1 Differentiation Of Solvation Sites from Non-Solvated

Sites

The protein solvation site and non-solvated site experiments were conducted with sev-
eral Objectives. In addition tO Optimizing classiﬁcation accuracy, the feature weights
produced by the GA were analyzed to determine if a subset Of the available features
was consistently selected for inclusion in the highest performance weight sets. Fifteen
GA/knn experiments were conducted with differing initial random populations. 100
bootstrap test experiments were then conducted on the best 6 resulting weight sets.
Descriptive statistics were calculated for each set Of bootstrap tests, and the results

are summarized in Table 5.3, sorted according to mean bootstrap accuracy.

In spite Of randomized initial conditions for each GA experiment, the top perform-
ing runs for distinguishing protein solvation sites from non-sites (refer to Table 5.3)
exhibited notable consistency in feature subset selection across runs. Although we
cannot assume that this consistency is indicative Of ﬁnding a globally optimal set Of
feature weights for classifying this data, we can conclude that the features that are
consistently included in the optimized weight sets are a sufﬁcient set to predict, with
~68% accuracy, water binding Sites at the protein surface. In this respect, then, these

features can be concluded to be associated with water binding.

For distinguishing Observed solvation sites from pseudo—water sites, the number

Of hydrogen bonds tO protein atoms (HBDP) was the most common highly-weighted

113

 

Table 5.3: Bootstrap test results, k-values, and weight sets from the top six GA
experiments for distinguishing crystallographically-Observed solvation sites from non-
sites. Bootstrap testing accuracy is given as a percentage for observed solvation sites
(Site), non-solvated sites (N011), and both classes together (Tot). Balance (Bal) is
the average difference between solvated-site and non-solvated-site prediction accuracy
(%) over all bootstrap runs. Features with no weights were masked by the GA; i.e.
their feature weights were zero, and they did not participate in classiﬁcation. The
feature weights for all Six features are normalized to sum to 1.00 for each experiment.

Bootstrap %

Feature Weights

 

 

 

Site Non Tot Bal K ADN AHP HBDP HBDW ABVAL NBVAL
68.60 67.75 68.18 3.50 23 0.284 0.524 0.193
66.87 68.74 67.81 3.50 75 0.545 0.235 0.219
65.93 65.93 65.93 3.20 53 0.530 0.197 0.274
65.21 70.44 67.83 5.49 45 0.194 0.721 0.086

' 66.19 69.40 67.79 4.39 35 0.332 0.643 0.024
67.32 67.74 67.53 3.52 61 0.651 0.289 0.060

 

feature, followed closely by atomic density (ADN). Some measure of the temperature
factor Of neighboring protein atoms was also included in each Of the top weight sets.
For half Of the runs the average temperature factor (ABVAL) was included, while
the sum Of the temperature factors (NBVAL) was included in the other half. The
local atomic hydrophilicity (AHP) and the number Of hydrogen bonds to other water
molecules (HBDW) were not included in any of the highest-performance weight sets.
The average weights for each feature across the top ﬁfteen weight sets are shown in

Figure 5.3.

These results correspond well with several other studies of conserved solvation
sites in speciﬁc protein structures. In a study Of conserved solvation sites in throm-
bin, trypsin, and bovine pancreatic trypsin inhibitor (BPTI), various features of the

local protein environment were correlated with the degree of conservation Of solva-

114

 

 

0.5
0.45 «»
0.4 «-

§ 0.35 ..
0.3 ..
025 ~-
02 «-
o.15 «~

I 0.. ._

0.05 T

 

  

 

 

HBDP ADN NBVAL ABVAL AHP HBDW
Festue

 

 

 

Figure 5.3: Average normalized weight of each feature in the top ﬁfteen weight sets for
distinguishing crystallographically- Observed solvation sites from non-solvation sites.

 

tion sites across multiple superimposed structures for each protein (Sanschagrin and
Kuhn, 1998). The three features which demonstrated the strongest correlation with
water site conservation were hydrogen bonds to neighboring protein atoms, atomic
hydrophilicity, and atomic density. Two Of these features, hydrogen bonds to protein
atoms and atomic density, were highly weighted by the GA in all of the weight sets
reported in Table 5.3. A possible explanation for the GA’s failure to include atomic
hydrophilicity in any of these weight sets is the strong correlation between atomic
density and atomic hydrophilicity. Since both Of these features are strongly depen-
dent On the number Of protein atoms neighboring a given water molecule, the two

values are correlated (r = 0.642).

Even more closely related to features found by the GA are those found to be related
to water electron density rankings by Gschwend and Karplus (1991)—average tem-

perature factor Of hydrogen bonding partners, solvent accessibility, and the number

115

Of strong hydrogen bonds to protein atoms. The hydrogen bonding and tempera-
ture factor features were selected by both methods, while the atomic density feature
used by the GA is closely related to the solvent accessibility term used in Gschwend
and Karplus’s density correlations. Water molecules bound in deep grooves will have
low solvent accessibility and higher atomic density values, while the reverse holds for

waters bound to protein surface protrusions.

Studies conducted by Kuhn et al. (1992) and Poornima and Dean (1995b) both
demonstrated a relationship between water binding and local surface shape. The
GA results identify the same relationship, as atomic density is included in all Of the
top performing weight sets; high density Of protein atoms surrounding a water Site

correlates with groove-like topography.

5.3.2 Distinguishing Between Conserved and Non-

Conserved Solvation Sites

The problem Of identifying solvation Sites likely to be conserved between ligand-bound
and unbound structures Of a protein was one Of the ﬁrst applications of the GA / knn
hybrid classiﬁer. As such, signiﬁcantly more experiments have been conducted for this
data set than for any Of the other data sets presented here. The top 21 weight sets
from all such experiments were selected for bootstrap testing. Again, 100 bootstrap
tests were conducted for each weight set. Accuracy and balance results for each weight

set and k value tested are summarized in Table 5.4.

116

 

Table 5.4: Bootstrap results of the best 21 weight sets for identifying conserved
solvation Sites. Mean bootstrap testing accuracy is given as a percentage (Acc).
Mean balance (Bal) is the average difference between conserved and non-conserved
prediction accuracy (%) over all bootstrap runs. Features with no weights were
masked by the GA; i.e. their feature weights were zero, and they did not participate
in classiﬁcation. The feature weights for all eight features are normalized to sum to
1.00 for each experiment.

Feature Weights

 

Acc Bal K ADN AHP BVAL HBDP HBDW MOB ABVAL NBVAL
64.20 3.88 65 0.413 0.135 0.137 0.315

63.62 3.80 29 0.667 0.333

63.16 5.35 25 0.463 0.323 0.214
63.11 4.15. 37 0.891 0.109

63.00 3.52 77 0.308 0.163 0.225 0.304

62.87 5.36 17 0.841 0.159

62.47 7.74 97 0.459 0.291 0.250

62.36 3.27 27 0.371 0.629

62.16 3.79 23 0.372 0.240 0.203 0.184
62.15 3.50 7 0.571 0.156 0.273

62.02 4.26 87 0.118 0.558 0.323

61.72 4.14 17 0.252 0.352 0.397

61.70 4.20 67 0.421 0.579

61.58 3.58 13 0.018 0.388 0.441 0.153
61.46 3.84 63 0.227 0.773

61.42 3.18 19 0.051 0.417 0.058 0.474

61.36 3.39 15 0.392 0.293 0.207 0.108
61.16 7.14 19 1.000

60.62 4.58 57 0.881 0.119
60.30 2.78 75 0.317 0.683
60.25 3.40 69 0.336 0.664

 

 

117

In applying the feature selection and extraction capabilities Of the EC-hybrid clas-
siﬁers for data mining and analysis, consistency in feature weighting across multiple
runs is a desirable result. While it is never possible tO guarantee that the EC has
found an Optimal or near-Optimal set of feature weights, features that are consistently
included in weight sets that provide high classiﬁcation accuracy are likely to be as-
sociated with the natural classes in the data. For the water conservation data, the
two features related tO thermal mobility—B-value (BVAL) and mobility (MOB)—are
included in all Of the top 6 performing weight sets. Features such as atomic density
(ADN), number Of hydrogen bonds to other water molecules (HBDW) and average
B-value of neighboring protein atoms (ABVAL), which are included in very few of
the top 21 weight sets, are likely to be omitted either because they are not as closely
associated with water binding, or because they are closely related to another included
feature, and thus providing redundant information. Because the EC cost function pe-
nalizes for each feature included in the classiﬁcation, the EC search is directed towards
a minimal feature set that provides good classiﬁcation accuracy. It is interesting to
note that atomic hydrophilicity (AHP) and mobility are nearly mutually exclusive,
suggesting that some information needed for classiﬁcation may be provided by both

of these features. The linear correlation coefﬁcient for AHP and MOB is r = —0.374.

It is not surprising that the features that indicate conserved solvation across multi-
ple structures differ somewhat from those that identify likely solvation sites in a single
structure. Since both conserved and non-conserved water molecules are present in at

least one crystallographic structure, they might be expected to have similar values for

118

the features associated with bound water molecules in a single crystallographic struc-
ture. Although the task appears tO be more difﬁcult than identifying water-binding
sites, the GA/knn classiﬁer Obtained a set of features which further distinguishes
those sites that are likely tO be conserved across multiple structures. The features
thus identiﬁed include B-value (BVAL), which appeared in 19 of the top 21 weight
sets, and was the highest-weighted feature in 15 of these, and mobility (MOB), ap-
pearing in 11 weight sets, always as the ﬁrst or second most strongly-weighted feature.
Also selected in 9 weight sets, including the set which Obtained the best overall ac-
curacy in bootstrap testing, was the number of hydrogen bonds to protein atoms.
The best performing weight set included these three features as well as the number

Of hydrogen bonds to other water molecules.

The best bootstrap accuracy Obtained by the GA/knn was 64.2%, with a mean
bootstrap balance (difference in predictive accuracy between conserved and non-
conserved solvation sites) Of 3.88%. This accuracy is a notable improvement over
the near random classiﬁcation Obtained by linear discriminant analysis. Further im-
provement in accuracy can be Obtained by using the k-nearest-neighbors vote tally
as a measure Of classiﬁcation conﬁdence. Intuitively, one would expect that a site
for which the knn vote was deﬁnitive (say, 38 votes for conserved and 1 vote for
non-conserved) was more likely to be correctly classiﬁed than one for which the vote
was more evenly split (for example, 20 votes for conserved and 19 for non-conserved),
and this intuition was supported by knn voting results. A weighted knn classiﬁer was

used with the top weight set from Table 5.4 tO classify as conserved or non-conserved

119

all water-binding Sites from the data set that were not selected as training or tun-
ing sites for this GA run. A total Of 3942 solvation Sites were tested. Figure 5.4
shows the relationship between number Of votes in the majority category, m, and the
predictive accuracy for all test sites where the number Of votes cast in the majority
was 2 m. The correlation coeﬁicient between m and cumulative predictive accuracy
was r = 0.905. As demonstrated by the ﬁgure, the knn classiﬁer exhibits a strong
tendency toward more accurate classiﬁcation of Sites for which the voting consensus is
more deﬁnitive. Greater predictive accuracy can thus be Obtained by allowing “don’t
know” classiﬁcations when the number of majority votes does not exceed a speciﬁed
cutoff value, and the prediction is thus less likely to be correct. Also shown in the
ﬁgure is the number Of test sites for which the number of majority votes meets or
exceeds each value Of m. For example, by classifying only sites where 43 or more
votes were cast for the same class we can classify 1344 Of the 3942 test sites with a
predictive accuracy of 70.46%. While some Sites are eliminated from classiﬁcation,

those that remain are classiﬁed with greater accuracy.

5.4 Discussion

In distinguishing solvation Sites from non-solvated Sites near the protein surface, the
classiﬁcation accuracy Of the GA / knn hybrid classiﬁer is comparable to other contem-
porary methods for solvent-site prediction, but the GA / knn does not have a tendency
to overpredict solvation—a feature common to other techniques. For the problem Of

distinguishing conserved from non-conserved water-binding Sites, the accuracy Of the

120

 

 

 

 

 

 

 

 

 

 

 

 

 

“000 - ”0% =3Numberotwaters
—Aecuracy
3000 I ll 1- m
£2500 l I U -~85%
'6 2000 « ' -» 80%
8 r E
g 1500 .. l ~» 75%
7,.
1000 ,l‘ 4 70%
—‘/'
500 Lf-‘l-pu"‘ A “UH" 65%
0 T . . : : :::":n:n:“:-: : 60%

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 5.4: Relationship between majority votes and accuracy. For each possible
number Of majority votes, m, (x-axis) the solid line shows the cumulative predictive
accuracy for all test sites for which m or more votes were cast in the majority. This
accuracy value corresponds to the scale on the right y-axis. The outlined rectangles
Show the actual number of sites for which m or more votes were cast in the majority,
corresponding to the scale on the left y-axis. Since the k-value for the weight set
tested was 65, the number Of majority votes ranges from 33 (the smallest possible
majority) tO 65.

 

121

GA/knn surpassed that Of other classical classiﬁcation techniques tested. In addi-
tion, the consistency in feature weighting exhibited for both problems provides some
insight into the determinants Of water binding and water conservation in crystallo-
graphic protein structures. In the case Of predicting solvation Sites, the determinants
identiﬁed by the GA / knn experiments correspond well with other computational and

experimental results.

Taken together, the results Of both sets Of experiments paint a general picture of
the physical and chemical features related to water binding and conservation. From
these results, we can envision a continuum Of water binding favorability based on var-
ious features. Local atomic hydrOphilicity, atomic density, and the hydrogen bonding
potential Of local protein atoms are good indicators Of likely solvation sites. Of those
Sites that do bind water molecules, a low crystallographic temperature factor, high oc-
cupancy, and a large number Of hydrogen bonds to neighboring protein atoms indicate
that the water molecule is likely to be conserved between independently determined

structures.

122

BIBLIOGRAPHY

Bibliography

Abola, E. E., Bernstein, F. C., Bryant, S. H., Koetzle, T. F., and Weng, J. (1987).
Protein Data Bank, pages 107—132. Data Commission Of the International Union
Of Crystallography, Bonn / Cambridge/ Chester.

Aeberhard, S., Coomans, D., and de Vel, O. (1992a). The classiﬁcation performance
Of RDA. Technical Report 92-01, Dept. Of Computer Science and Dept. Of Mathe-
matics and Statistics, James Cook University Of North Queensland.

Aeberhard, S., Coomans, D., and de Vel, O. (1992b). Comparison of classiﬁers in
high dimensional settings. Technical Report 92-02, Dept. Of Computer Science and
Dept. Of Mathematics and Statistics, James Cook University Of North Queensland.

Angeline, P. J. (1995). Adaptive and self-adaptive evolutionary computations. In
Palaniswami, M., Attikiouzel, Y., Marks, R., Fogel, D., and Fukuda, T., editors,
Computational Intelligence: A Dynamic Systems Perspective, pages 152—163. IEEE
Press, Piscataway, NJ.

Béick, T. and Schwefel, H.-P. (1993). An overview Of evolutionary algorithms for
parameter Optimization. Evolutionary Computation, 1:1—23.

Biick, T. and Schwefel, H.-P. (1996). Evolutionary computation: An overview. In
Proceedings of the Third IEEE Conference on Evolutionary Computation, pages
20-29, Piscataway NJ. IEEE Press.

Bayes, T. (1763). An essay towards solving a problem in the doctrine Of chances.
.Phil. Trans. Roy. Soc., 53.

Bernstein, F. C., Koetzle, T. F., Williams, G. J. B., Meyer, Jr., E. F., Brice, M. D.,
Rodgers, J. R., Kennard, 0., Shimanouchi, T., and Tasumi, M. (1977). The Protein
Data Bank: A computer-based archival ﬁle for macromolecular structures. J. Mol.
Biol, 112:535-542.

Blake, C. L. and Merz, C. J. (1998). UCI repository Of machine learning databases.
University Of California, Irvine, Dept. Of Information and Computer Sciences.
http: //www.ics.uci.edu/~mlearn/MLRepository.html.

Breiman, L., Friedman, J., Olshen, R., and Stone, C. (1984). Classiﬁcation and
Regression Trees. Wadsworth, Paciﬁc Grove.

123

Bremermann, H., Rogson, M., and Salaff, S. (1966). Global properties of evolution
processes. In Pattee, H., Edlsack, E., Fein, L., and Callahan, A., editors, Natural
Automata and Useful Simulations, pages 3—41. Spartan Books, Washington, DC.

Cestnik, G., Konenenko, I., and Bratko, I. (1987). Assistant-86: A knowledge-
elicitation tOOl for sophisticated users. In Bratko, I. and Lavrac, N., editors,
Progress in Machine Learning, pages 31—45. Sigma Press.

Connolly, M. L. (1983). Solvent—accessible surfaces of proteins and nucleic acids.
Science, 221:709—713.

Cover, T. M. and Campenhout, J. M. V. (1977). On the possible orderings in the
measurement selection problem. IEEE Transactions on Systems, Man, and Cyber-
netics, 7:657—661.

Cover, T. M. and Hart, P. E. (1967). Nearest neighbor pattern classiﬁcation. IEEE
Transactions on Information Theory, IT—13:21—27.

Craig, L., Sanschagrin, P. C., Rozek, A., Lackie, S., Kuhn, L. A., and Scott, J. K.
(1998). The role Of structure in antibody cross-reactivity between peptides and
folded proteins. J. Mol. Biol, 281(1):183—201.

Crosby, J. (1967). Computers in the study of evolution. Sci. Prog. 0:1:f., 55:279—292.

De Jong, K. A. (1992). Genetic algorithms are NOT function Optimizers. In Proceed-
ings of the Second Workshop on Foundations of Genetic Algorithms, pages 5—18,
Vail, Colorado.

Diaconis, P. and Efron, B. (1983). Computer-intensive methods in statistics. Scientiﬁc
American, 248.

Domingos, P. and Pazzani, M. (1996). Beyond independence: Conditions for the
Optimality Of the simple bayesian classiﬁer. In Saitta, L., editor, Proceedings of
the Thirteenth International Conference on Machine Learning, pages 105—112, San
Francisco, CA. Morgan Kaufmann.

Duda, R. O. and Hart, P. E. (1973). Pattern Classiﬁcation and Scene Analysis. John
Wiley & Sons.

Efron, B. (1979). Bootstrap methods: Another look at the jackknife. Ann. Statist,
7:1—26.

Efron, B. (1982). The jackknife, the bootstrap, and other resampling plans. In
CBMS-NSF Regional Conf. Series in Applied Mathematics, no. 38. SIAM. 91.

Fisher, R. A. (1936). The use Of multiple measurements in taxonomic problems. Ann.
Eugenics, 7:179-188.

124

Fitzpatrick, P. A., Steinmetz, A. C. U., Ringe, D., and Klibanov, A. M. (1993).
Enzyme crystal structure in a neat organic solvent. Proc. Natl. Acad. Sci. USA,
90:8653-8657.

Fogel, D., editor (1998). Evolutionary Computation The Fossil Record. IEEE Press,
New York, NY.

Fogel, L. J., Owens, A. J., and Walsh, M. J. (1966). Artiﬁcial Intelligence through
Simulated Evolution. John Wiley, NY.

Fraser, A. (1957). Simulation Of genetic systems by automatic digital computers. I.
Introduction. Austrailian J. Biological Sciences, 10:484—491.

Fukunaga, K. and Narendra, P. M. (1975). A branch and bound algorithm for com-
puting k-nearest neighbors. IEEE Transactions on Computers, pages 750—753.

Gates, G. W. (1972). The reduced nearest neighbor rule. IEEE Transactions on
Information Theory, IT-182431—433.

Goldberg, D. (1989). Genetic Algorithms in Search, Optimization, and Machine
Learning. Addison-Wesley, San Mateo, CA.

Goodman, E. D. (1996). An introduction to GALOPPS, the genetic algorithm Opti-
mized for portability and parallelism. Technical Report 96-07—01, Michigan State
University.

Gschwend, D. A. and Karplus, P. A. (1991). Bound water visibility in protein struc-
ture determination by x-ray crystallography. Unpublished senior thesis.

Hart, P. E. (1968). The condensed nearest neighbor rule. IEEE Transactions on
Information Theory, IT-14:515—516.

Hobohm, U., Scharf, M., Schneider, R., and Sander, C. (1992). Selection Of represen-
tative protein data sets. Protein Sci, 1:409—417.

Holland, J. H. (1975). Adaptation in Natural and Artiﬁcial Systems. University of
Michigan Press, Ann Arbor, MI.

Ishibuchi, H., Nozaki, K., Yamamoto, N., and Tanaka, H. (1995). Selecting fuzzy if-
then rules for classiﬁcation problems using genetic algorithms. IEEE Transactions
on Fuzzy Systems, 3(3):260—270.

Jain, A. K. and Chandrasekaran, B. (1982). Dimensionality and sample size consid-
erations in pattern recognition in practice. In Krishnaiah, P. R. and Kanal, L. N.,
editors, Handbook of Statistics, volume 2, pages 835—855. North-Holland.

Jain, A. K., DubeS, R. C., and Chen, C. C. (1987). Bootstrap techniques for er-
ror estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence,
9(5):628—633.

125

Jain, A. K. and Zongker, D. (1997). Feature selection: Evaluation, application, and
small sample performance. IEEE ’Ii‘ansactions on Pattern Analysis and Machine
Intelligence, 19(2):153—158.

Jaynes, E. T. (1968). Prior probabilities. In IEEE Transactions on Systems Science
and Cybernetics, volume SSC-4, pages 227—241.

Karplus, P. A. and Faerman, C. H. (1994). Ordered water in macromolecular struc-
ture. Curr. Opin. Struct. Biol, 4:770—776.

Kelly, J. D. and Davis, L. (1991). Hybridizing the genetic algorithm and the k near-
est neighbors classiﬁcation algorithm. In Proceedings of the Fourth International
Conference on Genetic Algorithms and their Applications, pages 377—383.

Koza, J. R. (1992). Genetic Programming: 0n the Programming of a Computer by
Means of Natural Selection. MIT Press.

Kuhn, L. A., Siani, M. A., Pique, M. E., Fisher, C. L., Getzoff, E. D., and Tainer,
J. A. (1992). The interdependence of protein surface topography and bound water
molecules revealed by surface accessibility and fractal density measures. J. Mol.
Biol, 228:13—22.

Kuhn, L. A., Swanson, C. A., Pique, M. E., Tainer, J. A., and Getzoff, E. D. (1995).
Atomic and residue hydrophilicity in the context Of folded protein structures. Pro-
teins: Str. Funct. Genet, 23:536—547.

Ladbury, J. E. (1995). Counting the calories to stay in the groove. Structure, 3:635—
639.

Levitt, M. and Park, B. H. (1993). Water: now you see it, now you don’t. Structure,
1(4):223—226.

Mao, J ., Mohiuddin, K., and Jain, A. K. (1994). Parsimonious network design and
feature selection through node pruning. In Proc. of the Intl. Conf. on Pattern
Recognition, pages 622—624, Jerusalem.

Marchand, A., Lente, F. V., and Galen, R. (1983). The assessment of laboratory tests
in the diagnosis of acute appendicitis. American Journal of Clinical Pathology,
80(3):369—374.

Matthews, B. W. (1975). Comparison Of the predicted and Observed secondary struc-
ture Of T4 phage lysozyme. Biochim. Biophys. Acta., 405:442—451.

McRee, D. E. (1992). A visual protein crystallographic software system for
X11 /XView. J. Molecular Graphics, 10:44—46.

Mitchell, M. (1996). An Introduction to Genetic Algorithms. MIT Press.

Narendra, P. M. and Fukunaga, K. (1977). A branch and bound algorithm for feature
subset selection. IEEE Transactions on Computers, C-26z917-922.

126

Nozaki, K., Ishibuchi, H., and Tanaka, H. (1996). Adaptive fuzzy rule-based classiﬁ-
cation systems. IEEE Transactions on Fuzzy Systems, 4(3):238—250.

Palma, M. U., Palma—Vittorelli, M. B., and Parak, F., editors (1993). Water-
Biomolecule Interactions, volume 43 Of Italian Physical Society Conference Pro-
ceedings. Editrice Compositori, Bologna, Italy.

Parzen, E. (1962). On the estimation Of a probability density function and the mode.
Ann. Math Statistics, 33:1065-1076.

Pitt, W. R., Murray-Rust, J., and GOOdfellow, J. M. (1993). AQUARIUS2:
Knowledge-based modeling Of solvent sites around proteins. J. Comp. Chem,
14(9):1007—1018.

Poornima, C. S. and Dean, P. M. (1995a). Hydration in drug design. 1. Multiple
hydrogen-bonding features of water molecules in mediating protein-ligand interac-
tions. Journal of Computer-Aided Molecular Design, 9:500—512.

Poornima, C. S. and Dean, P. M. (1995b). Hydration in drug design. 2. Inﬂuence of
local site surface shape on water binding. Journal of Computer-Aided Molecular
Design, 9:513—520.

Poornima, C. S. and Dean, P. M. (1995c). Hydration in drug design. 3. Conserved
water molecules at the ligand-binding sites Of homologous proteins. Journal of
Computer-Aided Molecular Design, 9:521—531.

Pudil, P., Novovicova, J ., and Kittler, J. (1994). Floating search methods in feature
selection. Pattern Recognition Letters, 15:1,119—1,125.

Punch, W. F., Goodman, E. D., Pei, M., Chia-Shun, L., Hovland, P., and Enbody,
R. (1993). Further research on feature selection and classiﬁcation using genetic
algorithms. In Proc. International Conference on Genetic Algorithms 93, pages
557—564.

Quinlan, J. R. (1986a). The effect Of noise on concept learning. In Michalski, R., Car-
bonnell, J ., and Mitchell, T., editors, Machine Learning: an Artiﬁcial Intelligence
Approach, pages 149-166. Morgan Kaufmann.

Quinlan, J. R. (1986b). Induction Of decision trees. Machine Learning, 1:81—106.

Quinlan, J. R. (1987). Simplifying decision trees. International Journal of Man-
Machine Studies, pages 221—234.

Quinlan, J. R., Compton, P. J ., Horn, K. A., and Lazurus, L. (1986). Inductive knowl-
edge acquisition: A case study. In Proceedings of the Second Australian Conference
on Applications of Expert Systems, Sydney, Australia.

127

Raymer, M. L., Sanschagrin, P. C., Punch, W. F., Venkataraman, S., Goodman,
E. D., and Kuhn, L. A. (1997). Predicting conserved water-mediated and polar
ligand interactions in proteins using a k-nearest-neighbors genetic algorithm. J.
Mol. Biol, 265:445—464.

Rechenberg, I. (1973). Evolutionsstrategie: Optimierung technischer Systems nach
Prinzipien der biologischen Evolution. Fromman-Holzboog, Stuttgart.

Reed, J ., Toombs, R., and Barricelli, N. (1967). Simulation Of biological evolution
and machine learning: 1. selection Of self-reproducing numeric patterns by data
processing machines, effects of hereditary control, mutation type and crossing. J.
Theoret. Biol, 17:319—342.

Ringe, D. (1995). What makes a binding Site a binding site? Curr. Opin. Struct.
Biol, 5:825—829.

Sanschagrin, P. C. and Kuhn, L. A. (1998). Cluster analysis Of consensus water
sites in thrombin and trypsin Shows conservation between serine proteases and
contributions to ligand Speciﬁcity. Protein Sci, 7(9).

Schnecke, V. and Kuhn, L. A. (2000). Virtual screening with solvation and ligand-
induced complementarity. Perspectives in Drug Discovery and Design, 20:1—22.

Schraudolph, N. N. and Grefenstette, J. J. (1992). A user’s guide to GAUCSD. Tech-
nical Report CS92-249, Computer Science Department, University Of California,
San Diego, CA.

Schwefel, H.-P. (1977). Numerische Optimierung von Computermodellen mittels der
Evolutionsstrategie. Birkhiiuser, Basel.

Siedlecki, W. and Sklansky, J. (1989). A note on genetic algorithms for large-scale
feature selection. Pattern Recognition Letters, 10:335—347.

SigillitO, V. G., Wing, S. P., Hutton, L. V., and Baker, K. B. (1989). Classiﬁcation
Of radar returns from the ionosphere using neural networks. Johns Hopkins APL
Technical Digest, 10:262—266.

Smith, J. W., Everhart, J. E., Dickson, W. C., Knowler, W. C., and Johannes,
R. S. (1988). Using the ADAP learning algorithm to forecast the onset Of diabetes
mellitus. In Proceedings of the Symposium on Computer Applications and Medical
Care, pages 261—265. IEEE Computer Society Press.

Smith, R. E., Goldberg, D. E., and Earickson, J. A. (1991). SGA-C: A C-language
implementation Of a Simple genetic algorithm. Technical Report 91002, The Clear-
inghouse for Genetic Algorithms, The University Of Alabama, Department Of En-
gineering Mechanics, Tuscaloosa, AL 35487.

'Iiunk, G. V. (1979). A problem Of dimensionality: A Simple example. IEEE T rans-
actions on Pattern Analysis and Machine Intelligence, 1:306—307.

128

Vafaie, H. and De Jong, K. (1998). Evolutionary feature space transformation. In
Liu, H. and Motoda, H., editors, Feature Extraction, Construction and Selection: A
Data Mining Perspective, chapter 19, pages 307—323. Kluwer Academic Publishers,
Norwell, MA.

Vedani, A. and Huhta, D. W. (1991). An algorithm for the systematic solvation
of proteins based on the directionality Of hydrogen bonds. J. Am. Chem. Soc.,
113:5860—5862.

Wade, R. C., Clark, K. J., and GOOdford, P. J. (1993). Further developments Of
hydrogen bond functions for use in determining energetically favorable binding
sites on molecules of known structure. J. Med. Chem, 36:140—147.

Weiss, S. and Kapouleas, I. (1989). An empirical comparison of pattern recognition,
neural nets, and machine learning classiﬁcation methods. In Sridharan, N. S.,
editor, Proceedings of the Eleventh International Joint Conference on Artiﬁcial
Intelligence, pages 781—787, Detroit, MI. Morgan Kaufmann.

Weiss, S. and Kapouleas, I. (1990). An Empirical Comparison of Pattern Recognition,
Neural Nets, and Machine Learning Classiﬁcation Methods. Morgan Kaufmann.

Westhof, E., editor (1993). Water and Biological Macromolecules. Topics in Molecular
and Structural Biology. CRC Press Inc., Boca Raton, FL.

Whitney, A. (1971). A direct method Of nonparametric measurement selection. IEEE
Transactions on Computers, 20:1100—1103.

Wilson, D. L. (1972). Asymptotic properties Of nearest neighbor rules using edited
data. IEEE Transactions on Systems, Man, and Cybernetics, SMC—2z408—421.

Yang, J. and Honavar, V. (1998). Feature subset selection using a genetic algorithm.
In Liu, H. and Motoda, H., editors, Feature Extraction, Construction and Selection:
A Data Mining Perspective, chapter 8, pages 117—136. Kluwer Academic Publishers,
Norwell, MA.

Yang, J., Parekh, R., and Honavar, V. (1998). DistAl: an inter-pattern distance-
based constructive learning algorithm. In Proceedings of the International Joint
Conference on Neural Networks, Anchorage, Alaska.

Zhang, X.-J. and Matthews, B. W. (1994). Conservation Of solvent-binding sites in
10 crystal forms Of T4 lysozyme. Protein Sci, 3:1031—1039.

129