. «a.
.55 -a

.4
; A... c 51
1.!!ll)

 

Gil. .

a 2....
$3.? 1.
be“): shim-”Qt...“
. Vial.“ 1.! ‘
. .3...ch3:,. .
.c ....mh.ww1u.s.......s.....;n3.
.. . 1|
"(Hull
. ¢_ .m.,.§..._.
:3.

aw“...

saw?

2
5.1..

5...}.
5&3}
.. 2
.1...
A .as ..
v
3 Writ—km. r

Aﬁhmnmw

1!.

a. 0:
up}...

. J!
x: .l
3.)

. n
!u:
)a:.

‘ .
Liam?

\t': A
A” '1‘ 4“”

k

)1

I.

 

50:)?

This is to certify that the
dissertation entitled

Molecular Conformation of Clusters by Genetic Algorithm
Using Spatial Operators and Unlabeled Distance Data

PhD.

 

presented by

David M. Cherba

has been accepted towards fulﬁllment
of the requirements for the

degree in A Computer Science

 

  

 

‘ LIBRKRY ‘

Michic' “-‘tate

 

Major Professor’s Signature

/Z,/&'L/i> 5
Date

MSU is an Aﬁinnative Action/Equal Opportunity Institution

‘

Ul'lrv». vat,

 

 

PLACE IN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.

 

DATE DUE

DATE DUE

DATE DUE

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

2/05 p:/CIRCIDateDuo.indd-p.1

MOLECULAR CONFORMATION OF CLUSTERS BY GENETIC
ALGORITHM USING SPATIAL OPERATORS AND
UNLABELED DISTANCE DATA

By

David M. Cherba

A DISSERTATION

Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of

DOCTOR OF PHILOSOPHY
Department of Computer Science

2005

ABSTRACT

MOLECULAR CONFORMATION OF CLUSTERS BY GENETIC
ALGORITHM USING SPATIAL OPERATORS AND UNLABELED DISTANCE
DATA

By
David M. Cherba

A set of Genetic Algorithm (GA) operators based on spatial location concepts will
provide improved performance for a class of NP hard search problems in N dimensional
spaces. A set of spatial operators for use with genetic algorithms is proposed for a class
of problems with real-valued genes that consist of N-dimensional homogeneous vectors.
Evolutionary computation is capable of providing solutions to problems that would be in-
tractable using more conventional methods. A subset of these problems is represented in
real-valued three dimensional spaces or other more complex vector spaces. This thesis ad-
dresses a number of issues related to the natural inﬂuences that adjacent locations in these
spaces have on the ﬁtness functions used in genetic algorithms. A subset of building blocks

(schema) will be utilized based on these natural inﬂuences.

It will be shown that these operators can be described by a building block style of theory
that supports the experiment results. Further, the spatial base operators naturally preserve
the interactions between genes for this class of problems. Genes have a natural inﬂuence
on each other based on proximity. To be an effective genetic algorithm, operators need to
take these proximity effects into consideration in order to preserve good contributions to
ﬁtness. Failure to utilize these spatial relationships will lead to very poor performance of

the genetic algorithm or require statistical methods to try to capture the relationships.

As a demonstration of these spatial operators, this dissertation will focus on the con-
formation of molecular clusters, where each atom’s location represents a gene with real-

valued coordinates. Further, the algorithm presented will work from unlabeled distance

information available from experiments with limited preparation. A set of theories will be
presented that form the basis for prediction of operator effectiveness, population size and
convergence for this class of problems. The theory will be based on a schemata like set
of building blocks constructed from subgraphs of limited size where the bonds represent
edges and the atoms are nodes. From this analysis, it is possible to Show the construction
of larger and more numerous building blocks by crossover and at the same time extract
the requirements for good exploration. Local search will be utilized, so that the genetic

algorithm operators focus on the construction of building blocks and exploration.

© Copyright 2005 by David M. Cherba
All Rights Reserved

ACKNOWLEDGMENTS

I would like to thank all the people who have encouraged me in this academic pursuit,
especially those members of my Doctoral Committee; Professor Goodman, Professor En-
body, Professor Duxbury, and my advisor Professor Punch. The patience, guidance and
the encouragement that all the members of my committee provided was invaluable in the
completion of this dissertation, especially when it was hard to articulate the ideas in an
organized manner. In addition, I would like to thank the CSE staff, especially Linda Moore
without whom the whole process of taking classes and ﬁling the paper work would never
get done. I would like to thank Professor Jain, Professor Stockman, Professor Esfahanian,
and all the faculty whose discussion about both academic topics and life issues have helped
me to be a better student and instructor.

The understanding and support from my family, especially my wife, Wendy, have been
unconditional despite the sacriﬁces. For the encouragement when I was having moments
of self doubt and the support of my children that have given me good advice about being a
better instructor and educator. To my parents for their early focus on learning and scientiﬁc

experimentation.

TABLE OF CONTENTS

LIST OF TABLES x
LIST OF FIGURES xi
1 Introduction 1
1.1 Hypothesis .................................... 2
1 .2 Motivation ..................................... 2
1.2.1 Source of Data ................................. 3
1.2.2 Preprocessing Data ............................... 5
1.2.3 Comparison to Protein Molecules ....................... 6
1.3 Class of Problems ................................. 7
1.3.1 Molecular Clusters ............................... 7
1.3.2 Generalized Problem Classes .......................... 8
1.3.3 Fitness Landscape ............................... 8
1.3.4 Number of Local Minima ........................... 9
1.4 Road Map of Work ................................ 9
2 Background 13
2.1 Evolutionary Computing ............................. 15
2.1.1 Broad Divisions ................................. 15
2.1.2 Basic GA Operation .............................. 16
2.1.3 Biological Inspiration .............................. 17
2. l .4 Chromosomes .................................. 1 8
2.1.5 Theory ..................................... 19
2. 1 .6 Operators .................................... 20
2.1.7 Replacement Strategy .............................. 23
2. 1 .8 Selection .................................... 24
2.2 Molecular Conformation of Clusters ....................... 24
2.2.1 Energy Functions ................................ 25
2.2.2 Diversity .................................... 28
2.2.3 Local Search .................................. 28
2.2.4 Distance Values ................................. 29
2.3 Current Work with Clusters Using GA ...................... 32
2.3.1 Deaven and Ho ................................. 34
2.3.2 Hartke ...................................... 38
2.3.3 Sastry ...................................... 42
2.4 Non-GA Solution Methods ............................ 46
2.4.1 Simulated Annealing .............................. 46

vi

2.4.2 Monte Carlo and Basin Hopping ........................ 48

2.4.3 Statistical Directed ............................... 49
2.5 Spatial Concepts in EC .............................. 50
2.5.1 Gene Arrangement ............................... 51
2.5.2 Multi-Deme ................................... 56
2.6 Local Search ................................... 59
2.6. 1 Quasi-Newton ................................. 59
2.6.2 Nelder-Mead .................................. 59
2.6.3 Constructive .................................. 60
2.6.4 Gradient ..................................... 60
2.6.5 Conjugate Gradient ............................... 61
2.7 Representation of Clusters / Molecules ...................... 61
2.7.1 Locations .................................... 62
2.7.2 Euclidian Distance Matrix ........................... 63
2.7.3 Distance Labels ................................. 63
2.7.4 Angles and Sequence .............................. 64
2.7.5 Crystals ..................................... 67
2.8 Geometric Algebra ................................ 69
2.8.1 Rank ...................................... 69
2.8.2 Vector Operations ................................ 70
2.8.3 Constraints ................................... 70
2.8.4 Cayley-Menger Determinants ......................... 70
2.8.5 Solution Methods ................................ 71
2.9 Protein Conformation ............................... 71
2.9.1 Preprocessing .................................. 72
2.9.2 Objective Functions ............................... 73
2.9.3 Search Process ............................ _ ..... 73
2.10 Summary ..................................... 74
3 Theory 76
3.1 Class of Problems ................................. 77
3.1.1 Homogeneous Vectors ............................. 77
3.1.2 Genes ...................................... 78
3.1.3 Fitness Function ................................ 79
3.2 Local Search ................................... 80
3.2.1 Volume ..................................... 81
3.2.2 Distance-Based Local Search .......................... 83
3.2.3 Convergence Criteria .............................. 84
3.2.4 Iteration Limit ................................. 84
3.3 Spatial Concept .................................. 85
3.3.1 Division of Space ................................ 87
3.4 Spatial Building Blocks and Subgraphs ..................... 91
3.4.1 Deﬁnition .................................... 92
3.4.2 Counting Spatial Subgraphs .......................... 93
3.4.3 Regular Graph Model .............................. 93

vii

3.4.4 Cube ...................................... 94

3.4.5 N11 Lennard-Jones Molecule ......................... 96
3.4.6 Bucky Ball ................................... 98
3.5 Crossover ..................................... 100
3.5.1 Standard Crossover ............................... 101
3.5.2 Spatial Crossover ................................ 102
3.6 Mutation ...................................... 103
3.6.] Simple Mutation ................................ 103
3.6.2 Relocation Mutation .............................. 104
3.7 Viable Population Size .............................. 105
3.7.1 Initial Population of Genes ........................... 106
3.7.2 Reduction in Gene Pool by Selection ..................... 107
3.7.3 Introduction of New Subgraphs ........................ 107
3.7.4 Diversity of Population ............................. 112
3.8 Graph Isomorphism ................................ 112
3.9 Convergence ................................... 1 15
3.9.1 Operator Effectiveness ............................. 1 16
3.9.2 Difference Equation .............................. 116
3.10 Scalability ..................................... 117
3.10.1 Fitness Function ................................ 118
3.10.2 GA Algorithm ................................. 119
3.11 Summary ..................................... 122
4 Experiments and Results 124
4.1 Local Search ................................... 125
4.2 Ideal Distances .................................. 130
4.2.1 N12 Lennard-Jones Molecule ......................... 130
4.2.2 N20 ....................................... 134
4.2.3 N38 ....................................... 135
4.2.4 N60 ....................................... 136
4.2.5 Bucky Ball ................................... 137
4.3 Non-Ideal Distances ................................ 141
4.3.1 Bucky Ball Experimental Data Set Target ................... 142
4.3.2 Artiﬁcially Degraded Data ........................... 146
4.4 Parameter Selection Searches ........................... 146
4.4.1 Population Size ................................. 147
4.4.2 Operator Probabilities ............................. 149
4.4.3 Local Search Parameter Effect ......................... 157
4.5 Classic Two-Point Crossover ........................... 158
4.6 Operator Effectiveness .............................. 160
4.7 Scalability ..................................... 162
4.8 Discussion ..................................... 165
4.8.] Comparison of Classic vs. Spatial Operators .................. 165
4.8.2 Population Size: Theory vs. Actual ...................... 165
4.8.3 Operator Effectiveness ............................. 166

viii

4.8.4 Scalability: Theory vs. Actual ......................... 167

4.9 Summary ..................................... 167
5 Conclusions 169
5.1 Spatial Operators ................................. 169
5.2 Interrelationship .................................. 170
5.3 Objective Functions ................................ 171
5.4 Conﬁrmation of the Hypothesis .......................... 172
5.5 Competitive Methods ............................... 173
5.6 Future Research .................................. 173
5.6.1 Improved Scalability .............................. 173
5.6.2 Adaptation to Folding Proteins ......................... 174
5.6.3 Dynamic Parameter Selection ......................... 175
5.6.4 Using Multiple Levels of GA .......................... 175
5.6.5 Tracking Genealogy .............................. 175
BIBLIOGRAPHY 177

ix

2.1

3.1
3.2
3.3
3.4
3.5

4.1

4.2
4.3
4.4
4.5
4.6
4.7

4.8

4.9

LIST OF TABLES

Main characteristics of evolutionary computing ................. 16
Multiple plane division of space ......................... 91
Building blocks for cube ............................. 96
Building blocks for ideal N11 LJ ......................... 97
Building blocks for N11 LJ random initialization and after local search ..... 97
Bucky building blocks .............................. 99
Failed convergence by local search to ideal N 11 by count and magnitude of

perturbation .................................. 127
Summary of 100 Trials Bucky Ball With Real Data ............... 146
GA operating parameters modes and sizes .................... 148
GA operation counters .............................. 149
GA operating ﬂoating point parameters ..................... 150
GA operating ﬁle names preﬁx and sufﬁx .................... 151

Solution comparison of standard and spatial crossover for N20 Lennard-Jones
molecules ................................... 160

Solution comparison of standard and spatial crossover for N38 Lennard-Jones
molecules ................................... 160

Scalability for various size molecules ...................... 164

1.1

1.2

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
2.10
2.11
2.12
2.13
2.14
2.15
2.16
2.17
2.18
2.19

LIST OF FIGURES

X-ray diffraction data (implied count of atom pairs) by atom separation dis-

tance [56] ................................... 4
Ideal Bucky ball derived from distance data ................... 4
General evolutionary computing operation .................... 17
Chromosome mutation .............................. 21
Chromosome recombination ........................... 22
Ideal data histogram ................................ 31
Histogram derived from experimental data .................... 32
Pair distance data implied counts by radius ................... 33
Left and Right hemisphere combined in crossover operation ............. 37
xy projection of n=104 molecule [27] ...................... 40
GA algorithm outline for Sastry [60] ................... '. . . . 44
K N graph arrangement of genes ......................... 52
Row arrangement of genes ............................ 52
2D matrix arrangement of genes ......................... 53
Circle arrangement of genes ........................... 53
3D matrix arrangement of genes ......................... 54
2D division of genes ............................... 54
2D multiple division of genes ........................... 55
2D multiple division of genes using Voronoi spaces ............... 55
Island populations for genetic algorithms ..................... 58
Multiple population multiple layer competition ................. 58

xi

2.20
2.21
2.22
2.23

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11
4.12
4.13

Typical group of atoms .............................. 66

Two combined groups with dihedral angles ................... 66
Crystal base structure ............................... 68
Crystal lattice structure .............................. 68
Example of multiple plane division ........................ 90
BB counting R3 .................................. 94
Cube schema R3 ................................. 95
Initialized molecule for GA run .......................... 98
Initialized molecule for GA run after local search ................ 99
Bucky ball schema ................................ 100
Bucky division Shown in 2D ........................... 101
Local search contributions ............................ 120
Local search adjustment ............................. 121
N11 local search probability of success by number of iterations ........ 128
N11 local search probability failure by sum of error ............... 128
N11 local search probability failure by max perturbation ............ 129
N12 simple mutation probability sweep ..................... 131
N 12 population versus average generations to solve ............... 132
N 12 local search convergence factor versus average generations ........ 133
N12 population versus evaluations ........................ 133

N12 local search convergence factor versus standard deviation of generations . 134

Population size effect on computation cost to ﬁnd solution N20 ........ 135
Population size effect on average generations to ﬁnd solution N38 ....... 136
Ideal Bucky ball ﬁtness .............................. 138
Ideal Bucky ball energy ............................. 139
Combined energy and distance objective Bucky ball ............... 140

xii

4.14
4.15
4.16
4.17
4.18
4.19
4.20
4.21
4.22
4.23
4.24
4.25
4.26
4.27
4.28
4.29
4.30
4.31
4.32
4.33

Combined energy and distance objective N38 .................. 140

Ideal Bucky ball single-level runs ......................... 14]
Comparison of ideal and experimental distances for Bucky ball ......... 143
Accumulated distance error Bucky ball data set ................. 143
Good Real data Bucky ball Results ....................... 144
Typical Defect for Real Data Bucky ball result ................. 145
Population Size effect on average generations to ﬁnd solutions N 20 ....... 147
Population size effect on number of failed runs N20 ............... 151
Population size effect on total average evaluations on N20 ........... 152
N20 crossover probability verse failed solution ................. 153
Average generations to solution N20 ....................... 154
Effect of simple mutation on average solution generations N20 ......... 154
Effect of relocation mutation on average solution generations N20 ....... 156

Effect of relocation mutation repetition on average solution generations N20 . . 156

Average evaluations to ﬁnd solution N20 ..................... 158
Effect of convergence step size on failed solutions N20 ............. 159
Effect of convergence step size on average generations to ﬁnd solution N 20 . . 159
Crossover effectiveness by generation ...................... 162
Simple mutation effectiveness by generation .................. 163

Relocation mutation effectiveness by generation ................ 163

xiii

Chapter 1

Introduction

The derivation of molecular structure continues to be a major challenge in many ﬁelds of
science. It has become more important than ever to understand the effect of structure on
behavior and properties of everything from proteins, drugs and superconductors to nano-
devices. Even with all the advances in microscopic examinations, the two main sources
of information about molecular structure are X-ray diffraction and nuclear magnetic res-
onance (NMR). The results from both of these technologies are processed and eventually
yield a set of inter-atom distance pairs. The basic problem is how to use this set of distances
to derive the structure of the molecule. From a pure mathematical perspective, the problem
of embedding a graph has been shown to be NP—Hard by Saxe [61]. The ﬁeld of Evolu-
tionary Computation (EC) has proven to be a viable solution strategy for many problems
that are NP-Hard. With inspiration from natural biological principles such as reproduction,
survival of the ﬁttest, diverse populations and mutation, these algorithms stochastically

produce candidate solutions.

Applications of GA, a branch of evolutionary computation, have been successful at
solving certain classes of molecular conformation problems, the most notable being both
clusters and biological molecules. This dissertation will address a set of operators for use

with GA that improves efﬁciency for a general class of problems that includes molecular

1

conformation of clusters from atom-pair distance information.

1.1 Hypothesis

A set of GA operators based on spatial location concepts will provide improved perfor-
mance for a class of NP-hard search problems in N-dimensional space. This class of prob-
lems have real-valued genes that consist of homogeneous vectors that represent a spatial
location. These operators improve efﬁciencies by preserving groups of genes in the vector
Space that contribute to high ﬁtness values. The theory to support the observed performance
improvements is based on examination of groups of genes as small subgraphs where the
atoms are nodes and the bonds are edges. These subgraphs are a distinct subset of the
more general concept of schema as proposed by Holland [36]. The main difference is that
the groups of genes are deﬁned by the proximity to each other in vector space rather than

arbitrary subsets of a whole chromosome gene set.

1.2 Motivation

Given the sources of data used to generate structure, signiﬁcant amounts of domain knowl-
edge have to be supplied to even begin searching for the actual structure. This domain
knowledge can come in many forms. Known-good structures for similar molecules, protein
side chain and backbone sequences, doping of heavy atoms to identify speciﬁc locations,
and many other methods are used to provide either Starting points or constraints on the
search. The time required or the biases introduced by the domain knowledge application
process deter from the ability to ﬁnd unbiased structures. In many cases, the researchers
resort to artiﬁcial measures to search for viable conformations of molecules. The most
common is the introduction of energy models. An energy model is used to ﬁnd an optimal
low energy conformation of the molecule. These low energy structures are then compared

to the experimental data to determine the likelihood of a match. The energy models can be

2

simpliﬁed to the point of making calculations reasonable for large protein molecules and
using energy gradients to guide search procedures.

Given a set of distances, the ultimate goal is to make possible a search for viable struc-
tures using minimal domain knowledge. Basically, once a computer is given a set of dis-
tances without any assignment to speciﬁc atoms, it will derive candidate structure with an
automated process in reasonable computational time. Further, the set of distances can be
partial or contain a ﬁnite amount of noise that is similar to that produced by experimental
processes and data preparation.

Figure 1.1 and Figure 1.2 give a pictorial description of the problem. The ﬁrst ﬁgure
shows the experimental data and the second the desired results if the data were perfect. In
reality, the data represented in Figure 1.1 is far from perfect and this makes the derivation
of molecular structure difﬁcult. The vertical axis indirectly depicts the number of atom
pairs at a given separation distance and the horizontal axis is distance of separation. The
difﬁculty depends on the completeness and quality of experimental data. In abstract terms,

this is a problem of embedding a graph with points and edges.

1.2.1 Source of Data

X-ray diffraction and Nuclear Magnetic Resonance Resonance (NMR) are the two most
common sources for data used in molecular conformation. In both of these experimental
studies the distance information must be derived indirectly from the experimental results.
X-ray diffraction requires a single crystal to be irradiated with X rays while the NMR
can process materials in solutions. This is especially important for proteins because some
cannot be crystallized.

In all these experiments, some primitive set of information must be known. For ex-
ample, the basic chemical equation along with the types of bonds formed are required to
process the experimental data. For biological molecules the actual sequence of the back-

bone and substructure of side chains may be known. For either process, some quantity of

3

 

 

 

 

 

Figure 1.1: X-ray diffraction data (implied count of atom pairs) by atom separation distance
[56]

 

Figure 1.2: Ideal Bucky ball derived from distance data

basic information is needed to begin processing the experimental results for the desired set
of distances.

The X-ray diffraction produces a scattering pattern. The scatter of X rays by electron
clouds around each atom forms a diffraction pattern. Using this diffraction pattern one
could imply the location of the atoms, if the phase information were also known. However,
without the phase information, the Fourier transform applied to the diffraction pattern can
only yield a set of inter—atom distance pairs.

For NMR, the distances between atoms is inferred from resonant energy peaks and val-
leys. It is possible, given a wide range of frequencies and multiple points of irradiation, to
deduce some location information of certain atoms in the structure in addition to distances.
This set of distances has some very real limitations. First, the atoms are interacting to some
degree. Second, the distances must be within the range affected by the source radiation. A
common practice is to create molecules with heavy atoms attached to key points because it
is easy to identify the interactions between them in the data. In this way, it is often possible
to, in addition to the distance data, make some assignments to speciﬁc atoms. While adding
some identiﬁcation is not enough to make the structure determination easy, it does reduce
the search space signiﬁcantly.

Regardless of which method is used, the best possible expectation is some limited noisy
distance data which represents the shape of the molecule. In both methods, constraints and
limits must be placed on the distances that can be derived. It is possible, using some
predictive techniques based on knowledge of the molecule under study, to ﬁll in some of
the gaps and make estimates of the expected distance distribution, even for overlapping

peaks in the experimental data.

1.2.2 Preprocessing Data

The type of molecule and the experiment conducted will determine the processing used

to extract the distance data. This process can be dramatically improved by the domain

5

knowledge. For example, with X-ray diffraction, the basic chemical equation can provide
constraints for the number and length of bonds that are expected. While the actual lengths
can vary, the basic physics of the bonding process can limit the range for the length of
bonds. Given a chemical equation, the preprocessing has a reference target at least for the
smaller distances. The problem of overlapping distances is a major concern. When the
natural distribution or the noise associated with the experiment cause the peaks to overlap,
it is very hard to construct a discrete set of distances that correspond to the peaks in the
experimental data. Given an atom count for a cluster, about all that is known exactly is how
many total distances need to be extracted from the data.

This work will use the derived distance set as its target for the objective function used in
the search process. This requires that the experimental data be used to produce a complete

set of distances.

1.2.3 Comparison to Protein Molecules

It is expected that some comparisons between ﬁnding structures of clusters and proteins will
be made. The structures of proteins which contain thousands of atoms often have a much
larger set of assumptions and domain knowledge. Having key domain information aids in
the determination of structure. While proteins are not explored in this work, it is important
to understand some of the issues associated with the preparation of the data and the basic
processes used in conformation. There are many semi-automated processes for ﬁnding
structure from proteins. In the case of proteins, the sequence of subcomponents that form
the protein is often known. Further, the interactions or distances between speciﬁc types of
atoms or bonds can be identiﬁed in the experimental data. In some cases, these interactions
are unique enough to make assignment of distances to speciﬁc atom pairs possible. In
others, the uniqueness is adequate to reduce the assignment to some limited number of
pairs. Using this information, the search space is reduced in size and it is possible to

search for conformation of signiﬁcantly larger molecules than clusters. The key point is

6

that the assignment of speciﬁc experimentally derived distances to speciﬁc atoms changes
the nature of the problem complexity. For the work on clusters here, no assignment of

distances to atoms was used as part of the objective function.

1.3 Class of Problems

The speciﬁc problem addressed in this work is to ﬁnd the structure of molecular clusters
using only distance information derived from experimental data without assignment to spe-
ciﬁc atom pairs. The nature of the problem is that a solution is represented by a vectors. In
this case, the set of vectors are three-dimensional locations in space that represent the loca-
tions of each atom in the molecule. The objective function to be optimized is signiﬁcantly
affected by the interaction of atoms close to each other. That is, each atom has multiple in-
teractions with those closest to it that can signiﬁcantly affect the overall objective function.

There are many general problems that have a similar nature to the ones dealt with here.
While the solution may not be easily represented by a set of three-dimensional vectors, the
conﬁguration of these problems and the objective functions have the same multiple interac-
tions based on the logical or physical proximity of points in some multi-dimensional space.
The key issue is preserving interactions that contribute to high ﬁtness so as to improve the

search operations.

1.3.1 Molecular Clusters

A molecular cluster is an isolated molecule of typically less than 300 atoms whose proper-
ties and characteristics are unique as compared to amorphous solids. Many of the molecules
used in this study are hypothetical low energy conﬁgurations. Others are derived entirely
from experimental data. The determination of molecule structure is driven by a set of
distances between all the atoms. Given a complete and accurate set of distances for a

molecule, the problem of assignment of distances to atom pairs still exists. If the molecule

7

has N atoms, the number of distances between all pairs of atoms is the same as the number
of edges in a K n graph. The number of distances m = (ﬁgﬂ increases as the square of
the number of atoms in the molecule. In the simplest of terms, this is a search among N 21
possible assignments. This search can be reduced by applying triangulation constraints and
the realization of repeated distances. However, the assignment problem is still intractable
with current computer resources even for moderate values of N.

Molecular clusters clearly demonstrate this class of problems. Regardless of the objec-
tive function used, the interactions between atoms close to each other signiﬁcantly affects
the ﬁtness of the molecule. Further, it is possible to group the atoms into subgraphs that

can be used as building blocks to build complete molecules.

1.3.2 Generalized Problem Classes

This class of problems can be generalized as follows. It includes any search space where the
solution is deﬁned by a set of homogenous vectors and/or where the criterion function has
signiﬁcant contributions from the proximity of each vector to each other. Each vector can
be thought of as a location in Space. In molecular clusters, the interaction between bonded
atoms is the strongest. Taking this concept beyond normal physical spaces, imagine a set
of operating parameters for a grid of power plants where each plant’s operating point is
deﬁned by a complex vector with many values. Each plant’s operating point is a location
in hyperspace. The proximity of each plant can be deﬁned by the interconnection of the

power grid. Plants closer together will have a greater inﬂuence on each other.

1.3.3 Fitness Landscape

The ﬁtness landscape has a major inﬂuence on the methods that can be applied to ﬁnd
optimal solutions. A smooth landscape can use a variety of search methods simply by
following the gradients in the landscape. In molecular clusters, a signiﬁcant amount of

work has been done on taking complex ﬁtness landscapes and mapping them to deformed

8

surfaces where more traditional searches work. This issue of ﬁtness landscape affects the
operation of GA algorithms as well. The main difference is that, by maintaining a popula-
tion of candidate solutions and constantly introducing new solutions, the trapping effect of

all the local minima can be reduced.

1.3.4 Number of Local Minima

In the case of molecular clusters, the ﬁtness landscape based on energy criteria is very com-
plex. Further, the number of local minima has been shown by Hoare [34] to be 6(eN2 ). Not
only is the number of minimum large, but the energy wells that each minima resides in can
have very large walls that must be overcome to move from one minimum to another. This
combination makes the conformation problem extremely difﬁcult to search using potential
energy as the criterion function. Even for the most robust methods currently in use, the
real elapsed computer time for a search on a cluster with 100 atoms can take two thousand

hours [28] to ﬁnd a low energy conﬁguration.

1.4 Road Map of Work

This work is divided into sections for background information, theory of the spatial Oper-
ators, and the experimental results obtained using the spatial operators. The background
chapter starts with a review of the current work with molecular clusters. This review will
focus on the use of GA’s and other viable methods to ﬁnd low energy conﬁgurations of
clusters. Next in the chapter, a review of the basic issues for evolutionary computation will
be described as they pertain to the problem of applying a GA to the cluster problem. At
this point, a brief survey of spatial concepts that appear in the current literature on genetic
algorithms will be presented. Several applications of spatial concepts have been made with
multiple population GA and some graph problems. These are distinctly different from the

work here. The structure of the populations in these applications is conﬁgured in a variety

9

of shapes and spaces, while the spatial operators used here will divide individual genes into

discrete subspaces for the purpose of mutation and recombination.

As part of practical application for many GA’S, local searches can be Shown to signif-
icantly improve performance. In the background chapter, a review of local searches used
in the molecular conformation applications will be provided. The representation of the
molecules has an impact on the type of algorithm and the nature of the operators used. The
various possible representations will be explored with speciﬁc attention to the location of
the atoms as points in Space. Several other representations will be included because of the

applications to other types of problems.

The mathematical concepts for geometric algebra form a strong basis for many of the
constraints and iterative forms of solutions. Many protein conformation methods use evolu-
tionaryI computation utilize geometric algebra in the objective functions and in preparation
of experimental data. While only clusters will be studied here, the use of these principles
in the determination of the protein structure is a necessary background for understanding
the signiﬁcance that even partial distance assignment and data reﬁnement can have on the

search process.

The theory chapter starts with a formal deﬁnition of the class of problems that these
new operators address. This is further focused by how the class of problems is represented
and how ﬁtness functions are calculated. The class of problems is directly related to the
introduction of spatial concepts and the division of hyperspace. The theory chapter then
introduces the subgraph as a building block and a form of schema for vectors that are in
close proximity. The local search and the spatial operators are then combined to provide a
basis for judging the exploration of search spaces. Finally, in the theory chapter, rates of
convergence are predicted along with the impact of diversity in the population of candidate
solutions. As these factors signiﬁcantly impact scalability, some predictions are made based

on the overall theory.
The next chapter describes the experiments that were conducted and how well they

10

compare to the predicted results of the theory. This will include many examples of well
known low energy molecular clusters along with the Bucky ball. In addition, examples
of the spatial operators used with real distance data for the Bucky ball will be described.
Several examples of artiﬁcially degraded distance sets are also presented. The data sets
that have been run provide a broad spectrum of cluster sizes ranging from eleven atoms to
eighty-eight atoms. As part of the overall work, possible parameters for the GA operations

are explored for effective values.

The role of local search is also critical to the operation of the GA. In particular, the
ability to rapidly ﬁnd local minima and the best candidate solutions will impact overall
performance. In the experiments conducted, a characterization of the local search used
was conducted on a small cluster with eleven atoms. This characterization will prove to be

useful for connecting the theory to the performance of the GA with spatial operators.

While the critical test of all algorithms in this area is the ability to scale to larger clus-
ters, the experimental results will provide the needed information to predict the perfor-
mance for large clusters. The computational cost to calculate the objective function grows
as 0(n2). This objective function cost is multiplied by the cost associated with the basic na-
ture of the algorithm. The objective function computational cost has a major impact on the
scalability of the search methods and it is clear that larger molecules represent a signiﬁcant

challenge, as will be shown by the results presented here.

An experimental comparison is made between classical two-point crossovers and spa-
tial crossover. The comparison clearly shows the beneﬁts of the spatial operators. Differ—
ential tests, where individual components of the GA algorithm are substituted or removed,
will be presented. These results will show the utility of the spatial operators for this class

of problem.

As promising as the theoretical and experimental results are, the focus of the next chap-
ter is on the future work that must follow in order to demonstrate the utility beyond molec-

ular clusters. Expanding this work into small biological molecules is necessary. A variety

11

of ideas that have been developed while working on the molecular cluster problem need to
be explored. First, the problem with the search scaling needs to be explored. Expanding
this work into small proteins or working with groups of atoms as single entities needs to
be addressed. From some of the experimental data, it is clear that the spatial mutation en-
hanced the operation of the searches but was rarely directly responsible for the creation of
the current best-of-generation molecules. Exploration of the genealogical tree for the best
of molecules may Shed some insight into the actual role played by this operator.

The ﬁnal chapter summarizes this work with a number of conclusions about the ex-
periments and theory. An empirical measure of improvement is provided for several of
the experimental cases described. The foundation of the subgraph as a means to represent

multiple linkages among genes will be summarized.

12

Chapter 2

Background

The process for ﬁnding the structure of molecular clusters with genetic algorithms bridges
several distinct areas. First, the structure of clusters is determined by the force interactions
between the atoms. These forces can be summarized as the potential energy level of a
molecule. Many methods for energy calculations are used, and these have been a dominant
criterion for structure determination methods used throughout the ﬁeld. Second, Evolu-
tionary Computing is used as a means to explore the large and complex search space for
solutions. This basic biological process of evolution, modeled with chromosomes and ge-
netic operators, has been surprisingly successful in ﬁnding solutions to difﬁcult problems.
While the basic process of genetic algorithms can be stated in a simple outline, the practi-
cal implementation becomes complex as all the issues that effect performance are identiﬁed
and resolved. In addition to these main two areas, the problem has its roots in geometric
algebra, combinatorial mathematics, matrix algebra, physics and biochemistry. Nearly all
of these areas are involved in ﬁnding solutions to the molecular structure problem. Several
of the current methods combine Evolutionary Computing and the potential energy searches
to ﬁnd conformations. A representative selection of these methods is described in the ﬁrst
section "Molecular Conformation of Clusters”. In addition, several other competing meth—

ods are described that range from random style searches to sophisticated statistical based

13

searches.

Evolutionary computing can be divided into several broad areas and is constantly
changing with the introduction of new methods and styles of operators. The next sec-
tion in this chapter will provide a brief overview of Evolutionary Computing with a focus
on those issues that related directly to the genetic algorithm used in this work. One major
issue is the use of spatial concepts in the previous GA research. These have been limited
primarily to multiple GA populations and some graph problems. A review of this work will

be presented.

A major departure from what is viewed as pure GA, essential to solving many real prob-
lems, is the role that local search plays in the operation of genetic algorithms. The types of
searches range from classical gradients to Hessian reductions. When GA’S are combined
with local search, the formal term ”memetic” [49] is used to describe these algorithms.
The key point is that the search uses a population of potential solutions and some form of
a metaheuristic; in this case, a GA with local search. Nearly all the methods involved in
the search for the structure of molecules use some form of local search. Some of the more

common basic search methods will be reviewed.

The representation of molecular structure can vary depending on the domain informa-
tion that is known. At some point, despite the type of representation used, the location of
each atom in space must be calculated. The basic representation styles will be described
for clusters, proteins and crystals. The representation for clusters is the most straightfor-
ward. The crystals will be the most compact, utilizing only a few parameters to describe
the locations of many atoms in a lattice. Both crystals and proteins have additional ﬁxed

information that reduces the degrees of freedom for the model.

The Geometric Algebra section provides useful information for understanding the var-
ious preprocessing methods used on experimental data as well as the basic search methods
used on larger molecules like proteins. Many of the mathematical concepts, applied to

three dimensional objects, can be used on problems of higher dimensionality. One key

14

note is that a complete set of distances contains a signiﬁcant amount of redundant infor-
mation. Many of these algebraic concepts are used to reﬁne distance information or make
assignments to speciﬁc atoms possible.

The last section gives a brief overview of the automated processes that are currently
used in the discovery of protein structure. While this dissertation does not address proteins,
it is expected that some of the concepts used here will be applicable to the problem of
ﬁnding protein structure and that the current methods provide a contrasting approach to

molecular structure problems.

2.1 Evolutionary Computing

Evolutionary computing (EC) is a recent development. It was ﬁrst used in the late 1960’s
and early 1970’s to address a class of optimization problems. Like with most discover-
ies, there were several independent researchers taking slightly different approaches to what
later could be loosely consolidated as evolutionary computing [35] [57] [20]. This chapter
will present a brief history of EC and provide some background information on the types of
operators and selection processes that were used in solutions described in this work. Gen-
erally there are some broad categories that Evolutionary Computing can be divided into.
These categories will be listed along with some of the features that divide them by styles of
operators and population. Some common features will be described including populations
of candidate solutions, mutation, recombination, and a survival of the ﬁttest. This mea-
sure of ﬁtness can be any evaluation of the chromosome used to represent a member of the

populations that is the objective of the search problem.

2.1.1 Broad Divisions

Table 2.1 attempts to help classify the various ﬁelds within evolutionary computing. There

are more similarities than differences when all the variations are explored. All the ﬁelds

15

 

 

 

 

 

 

 

 

 

 

 

 

ES GA EP GP
Represen- real-valued binary real-valued tree and functions
tation ﬁxed length variable length
Population parents ﬁxed ﬁxed
generational generational
Selection best offsprings operation probabilistic
+/— parents extinctive
Fitness objective Scaled objective Scaled objective Objective
function value function value function value function value
Mutation Gaussian bit inversion Gaussian functions
Gaussian adaptive terminals
Recom- parents parents none parents
bination (2 or more) two offspring two offspring
discrete N point sub tree swap
sexual uniform
panmictic

 

use multiple members of a population. All members of the population are measured with

ﬁtness. Individual members of the population are altered by some form of mutation or other

Table 2.1: Main characteristics of evolutionary computing

operators during the evolutionary process. The population members are selected by ﬁtness

for inclusion in the next population or for participation as a parent. Some of the entries in

this table are based on a similar table in Back’s book on evolution strategies [3].

The entries in this table are not rigid and exceptions to all the entries are not hard to ﬁnd.

This table serves only to differentiate the four areas on a limited number of features and

is not complete. Key points have to do with basic population control of ﬁxed generational

versus p and A styles along with the type of encoding. The type of encoding leads to

some additional types of arithmetic recombination strategies rather than the simple choice

of inclusion found with the binary encodings.

2.1.2 Basic GA Operation

The block diagram 2.1 represents a generalized operation of an evolutionary algorithm.

Collection of information about individuals in the population include ﬁtness and popula-

l6

 

 

 

 

 

 

 

 

 

J Selection R\

Evaluations

Population , Population
Generation ::> Se'ectiO" _.> Transmrmation :> Generation

N N+1

L_.
U

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Characteristic Generation of .
of Population w—D operating 44» 96:33:20"
and individuals: parameters
fitness, rank,
diversity,
convergence

 

 

 

Figure 2.1: General evolutionary computing operation

tion statistics that can be used to control the selection and transformation processes. This
block diagram also shows the use of selection to determine the parents for the next gen-
eration as well as selection to determine the parents to be used in creating children. The
transformation process is intended to be general and includes mutation, recombination and

local searches. Details of these operations will be explored in the following sections.

2.1.3 Biological Inspiration

The biological parallels to the operation for evolutionary computing are very strong. From
the representation description as genes to the processes of mutation and recombination, the
analogies tie biological concepts to evolutionary computing. At the core is Charles Dar-
win’s theory of evolution. In this theory, the principle of ’survival of the ﬁttest’ combined

with apparently undirected small changes in phenotypes was recognized as a critical factor

17

in evolution. Several symptoms of small isolated populations can result in genetic drift and
reductions in overall ﬁtness of individuals in the population. This phenomenon of genetic
drift can be demonstrated with genetic algorithms and strong selection pressure on small
populations.

The amazing part of this whole process is the fact that all the hereditary information
is encoded by four nucleotide bases. Adenine, Thymine, Cytosine and Guanine make up
the entire set of building blocks for deoxyribonucleic acid (DNA). These blocks have many
unique properties that allow for replication and complex interactions. The hierarchy con-
sists of amino acids, proteins, and ribonucleic acid (RNA). At each level different processes
can occur.

A single gene can have a number of observable effects on an organism or genotypic
effect. On the other hand, many genes often combine to produce phenotypical changes in
the individual. This complex interaction on the overall characteristics of individuals is what
makes it hard to directly map cause and effect between the genetic encoding of DNA and
the resulting individual. The chromosome can be mapped into many small groups, which
have distinct functions such as translation or transcription. There are other groups which
appear to have no phenotypical function and these are called introns.

Just like with life, the use of genetic algorithms does not produce certain results, only
a likelihood of results. Further, just like in biological evolution, the process takes many
generations to Show distinct trends. This mean that it is often very hard to relate changes

in the genes and chromosome to speciﬁc measured results.

2.1.4 Chromosomes

The representation of chromosomes is built by sequences of the four nucleotide bases. This
alphabet of symbols is limited to the four abbreviations A, T, C, and G. Taking inspiration
from this limited set of symbols the binary 1’s and 0’s can be used to deﬁne the genes

and, from there, the overall chromosome. For the purposes of the genetic algorithms each

18

chromosome is expected to encode all the hereditary material for an individual. While the
mapping of some limited set of symbols such as binary can be shown to encode all the
required information, it is not always convenient to work with the binary bit representation.
In these cases, it is possible to have genes that are made up of real-valued numbers that often
have a more phenotypical nature. The phenotypical encoding breaks with true analogy of
the gene encoding. However, it often makes the actual manipulation of the chromosome
easier and avoids violation of the constraints that are applied to a given problem. In most
cases the chromosomes are made of genes that are either binary or real numbers, but there

is no limitation on mixing the types of genes in a real application.

2.1.5 Theory

There are several areas where the basic theory on evolutionary computing has provided
some foundations to build this work on. One area is the schema theory promoted by Hol-
land [36]. In this theory, the implied parallel nature of processing binary patterns is derived
and supports the building of successive larger blocks of good genes.

Another area of theory is the effect of selection on the convergence of a population.
Simply Stated, the stronger the selection pressure is the faster the population will converge.
If it converges too fast, it will not explore the solution space sufﬁciently to provide accept-
able solution candidates. The effects of selection pressure can be described by convergence
velocity, loss of diversity and dominance by highly fit individuals. In all this, the proba-
bilistic nature in the random selections and initial values of genes make exact analysis
difﬁcult.

Several authors have expanded the concept of schema to include real-valued genes. The
schema theory for real-valued genes has been addressed by Nehab in a recent work [50] and
another much earlier work by Antonisse [1]. Both of these describe a consistent schema
theory for real-valued genes. The Nehab paper describes a concept of schema as occupying

regions of a three-dimensional space. This paper goes on to develop a schema theory

19

using the same foundation as the binary formulation by Holland, but for these regions in

space[36].

2.1.6 Operators

The classic operators for genetic algorithms center on mutation and crossover. The terms
crossover and recombination are used interchangeably. Both describe the process of taking
two or more parents and selecting or combining genes from the parents to create a new
offspring. The range of these operators can vary based on the problem and the encoding.
When the encoding is pure binary, the mutation can be as simple as inversion of bits in
the parent used to create the offspring. When real numbers and problem constraints are
involved the mutation operators can span an enormous range of operations from multi-
variable Gaussian distributions to simple permutations. The crossover operators also in-
clude many styles. The most common form of variance is the number of points for dividing
and selecting genetic material from the parents. For real-valued genes there are a number
of numeric processes such as selection, averaging, minimum, maximum and statistically

weighed methods for generation of new gene values.

Mutation

Mutation normally involves a single parent from which a single offspring is constructed
based on that parent. In this process, the parent is modiﬁed by the mutation operator and
genes are changed in some very simple ways. TWO very different means of mutation ex-
ist. The ﬁrst involves inversion of bits in the binary representation. This method has some
drawbacks. For example, if all bits are equally likely to be inverted there are great dispari-
ties in the size of the change to some forms of binary encoding. For example, changing the
most signiﬁcant bit of an encoded number will basically cause a mutation whose magnitude
is half the range of valid numbers. Encoding methods such as gray code overcome some of

these issues for number representations.

20

 

Chromsome

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

gene 1 gene 2 gene 3 gene n
0 1 0 1 1 1 O 1 1 """ O 0 1
Mutate
Chromsome
gene 1 gene 2 gene 3 gene n
0 1 O 1 1 1 O 0 1 """ 0 1 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.2: Chromosome mutation

The second mutation process is used for real-valued genes. This modiﬁcation to a gene
can be described by various random distributions. The distribution of the random change
can be a linear model or a more complex statistical method. A common distribution model
uses a mean value and standard deviation to characterize a normal Gaussian distribution
so that mutation magnitudes are statistically centered around a mean value. Controlled
statistical distributions can also be done for binary encoding, but it is more complicated
and involves multiple conversions. A simple example of mutation on binary encoded genes

is shown in Figure 2.2. In this Figure, the inverted bits are shown in bold.

Crossover-Recombination

Recombination normally involves two parents. These two parents are selected from the
population based on ﬁtness. The parents then each provide genetic material to create two
offspring. The total genetic material is conserved at this point in the recombination process.
Some material will be lost during the selection process. With real-valued genes, there are a
number of ways to contribute genetic material. It is possible to combine the values in some
arithmetic way from multiple (two or more) parents in a weighted fashion to create a single
offspring. It is possible to use gene values from many parents to create a single offspring.

For discrete selection of gene material, the offspring gets a copy of a gene from one of the

21

 

Chromsome Parent A

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

gene 1 gene 2 gene 3 gene n
0 1 0 1 0 1 0 1 1 """ 0 O 1
Chromsome Parent B
gene 1 gene 2 gene 3 gene n
0 1 1 1 1 1 O O 1 """ O 1 1
Recombination
Chromsome Offspring 1
gene 1 gene 2 ene 3 _gene n
O 1 1 1 1 1 0 1 1 """ 0 O 1
Chromsome Offspring 2
_gene 1 gene 2 gene 3 gene n
O 1 0 1 O 1 O O 1 """ O 1 1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.3: Chromosome recombination

parents.

How the parent chromosome is divided to provide the genetic material is subject to
many variations. Each variation can be shown to be superior for some application. In
general, the chromosome is divided up into N pieces and each of the pieces is then labeled
as odd or even. When two offspring are created, the odd parts from parent A are combined
with the even parts from parent B to complete offspring one. The even parts from parent A
and the odd parts from parent B are used to complete offspring two. The manner in which
the dividing points are selected is subject to many controls. In the extreme, every gene is
probabilistically selected as odd or even. In other cases the dividing points follow a speciﬁc
rotating pattern. For example, the ﬁrst division point is between gene one and gene two.
The next division is between gene two and gene three. The pattern then continues until it

has to start over again. The dividing points can be determined by some statistical analysis

22

of the population based on ﬁtness. With binary coding, it is possible to divide a gene into
pieces. Normally, though, the division points preserve genes.

A taxonomy of crossover operators is provided in the work by Herra et a1. [32] for real-
valued genes. The work discusses the various arithmetic methods for creating offspring as
well as other discrete gene selection, but does not include any spatial based operators. The
problem of bias or effectiveness of crossover has also been addressed in several works [54]
[19] [18]. Another paper introduces a new class of crossover operator by Arabas[2] that

has some spatial qualities.

2.1.7 Replacement Strategy

There are three basic styles of population control. A generational style uses one gener-
ation as parent to produce offspring, and then the offspring become the parents. Each
replacement of the parents is a generation. A population is constructed and then parents are
selected from that population based on ﬁtness to produce offspring. Once the offspring gen-
eration is full, it takes on the role of supplying parents to the next population. A steady-state
approach selects parents based on ﬁtness to reproduce or mutate, but decided to whether to
replace some member of the population with the offspring based on ﬁtness. Some methods
replace one of the parents; others replace the weakest member of the population. The base
values for ﬁtness are then recalculated after every change to the population.

The third type of population is where )1 parents are created and then selection is done
among them based on ﬁtness to create offspring. This process is repeated until A offspring
are created. Once the offspring population is complete, a competition takes place to see
who will be the p parents in the next generation. TWO common versions are used. In one,
only the offspring compete for the [1 spots while in the other, parents and offspring compete.
These are referred to as (u, A) and (u + A) strategies [3].

Another prominent topic associated with population is multiple populations for more

robust algorithms. A recent paper by Hu [38] describes a system with a stratiﬁed population

23

in a hierarchy to maintain a robust operation, thus avoiding many of the common problems

with population stagnation.

2.1.8 Selection

There are many selection schemes in use. Selection occurs in two ways. One form of
selection is used to determine participation as a parent for mutation, recombination or other
operators that create an offspring. The second way is selection for inclusion in the next
population. Some methods rank all the members of the population then select the better
ranking members in a probabilistic fashion for parenthood or simply for inclusion in the
next population.

Roulette wheel is a method of selection based on a probability of a member’s ﬁtness in
relation to the sum of all the ﬁtness values. Tournament selection forces X members of the
population to compete, with the winner being chosen for inclusion, or parenthood. This
selection process, either for parenthood or inclusion in a population, is a key element to all
the algorithms. It is this survival of the ﬁttest that drives the evolutionary process. The down
side is when a population is overtaken by a single ﬁt individual or group of individuals. This

is referred to as the loss of diversity and is a major issue with evolutionary computing.

2.2 Molecular Conformation of Clusters

The basic process for determining the structure of molecular clusters focuses on ﬁnding
molecules with low potential energy. It then compares the derived distance information or
an X-ray scattering simulation to the experimental data. The actual data-driven solution
methods are not emphasized in the current approaches. As most of the methods used to
ﬁnd clusters use potential energy functions, there is no surprise that there are numerous
different energy functions. Several of these will be described. The most successful of the

methods employ genetic algorithms. Three current works cover not only some of the ﬁrst

24

innovations used with the genetic algorithms but also set the standard for scalability and
advance GA methods using learned linkages.

The application of genetic algorithms to this problem is relatively recent as compared
to the previous work in this area. While the use of closed form solutions has been limited,
several computer—based methods that involve concepts of simulated annealing, stochastic
searches, and surface deformations have been successful. These will be brieﬂy described.
For very small clusters (three to ﬁfteen atoms) many of the other methods are shown to be
more effective at ﬁnding the structure than GA’s. The difﬁculty increases as molecule size

grows.

2.2.1 Energy Functions

There are several different potential energy functions used in the search for molecular struc-
ture. The most common is the Lennard-Jones function. The energy-based objective func-
tion optimizations normally sum all the individual energy terms associated with pairs of
atoms. Depending on the level of modeling, additional terms based on adjacent triplet
and quad groups of atoms can be included in the summation. These forms use empirical
constants or some relative potential measures. Energy functions span from simple relative
potentials like the Lennard-Jones to ﬁrst principles. Three energy functions will be de-
scribed. The ﬁrst is the most common in the work on clusters, Lennard-Jones. The second
is the Gong energy function used by Sastry [60] in his work on clusters. The ﬁnal model
is based on Hooke’s law. This model has terms for the pairs, triplets and quads of atoms
including the Lennard-Jones terms.

The popularity of the Lennard-Jones is based on not only the simple calculation, but
also due to the fact that distance data can be readily derived from candidate molecules.
The calculation has two empirical constants that only have a few common values. The
basic nature of the calculation will emphasize atomic interaction of close atoms and will

contribute little energy for atoms far apart.

25

a 5
”(733) = m — R;
z] z]

(2.1)

Equation 2.1 shows the basic Lennard-Jones equation with the indexes z’ and j repre-
senting a pair of atoms. The overall energy potential is a relative value based on the sum of
all the pairs. There are actually many different versions of this equation where the values
for a, ,8, and e have different constants. To be consistent with several of the minimum
energy tables for Lennard-Jones energies, the constants are assumed to have these values:
a = 1 , [3 = 2, and c = 1. Equation 2.2 shows the molecule’s total potential energy sum—
mation with the n,- = d), substitution. This substitution demonstrated that, given the set of
distances without knowing the atom pairing, it is possible to estimate the potential energy

of the molecule using a Lennard-Jones approximation.

V(d) = 462 [difﬁ- — 2%] (2.2)
k=0 k k

The Gong potential takes the energy calculation one step further, using bond angles as
shown in equation 2.3. The more detailed equation for 222 has terms similar to the Lennard-
Jones but with an exponential decay term added. The 2);; function has terms with the cosine
of the angles formed by the three adjacent atoms groups . Details of this energy function
can be found in the paper by Sastry [60]. One of several issues with the Gong potential is
that the ﬁrst term U2 is based on pairs of atoms while the second term 113 is based on triplets.
The additional pairing can increase the cost of calculation because the number of pairs is
on the order of O(N2), where as the number of triplets can be on the order of O ((3)

terms.

14...: Zv2(z',j) + Z v3(z',j.k) (2.3)

i< j i<j<lc
A more complex potential energy typically used in the area of proteins can be expressed

as Shown in equation 2.4. Comparing the potential equations shown in 2.4 and 2.2 for

26

empirical equilibrium values for bond lengths, bond angles, and torsion angles leaves only
the Lennard-Jones terms. At the limit, if the bond lengths near ()0, the bond angles near 00
and the torsion angles near 7, the equation reduces to the Lennard-Jones with a coulomb
charge force term added. In addition, all the coefﬁcients are derived from empirical studies

for a given atom’s interactions.

V'total = Z Kb(b(rij) _ b0)2

bonds

+ Z K9(6(r,-j) — 60)2

angles

1(45
+ E — 1+cos n r,~ —
dihedrals 2 [ [ ¢( 1k) 7]]

+ Z [;%—T£6+—gﬂj—] (2.4)

nonbonded ij ij 6,- 60 Tij

The potential energy function’s complexity can be increasing complex. Hooke’s law
approximations use many empirical coefﬁcients for bond length, bond angle, torsion angle
and coulomb forces with additional terms. At the most complex end of the scale, are energy
functions derived from ﬁrst principles and quantum physics. The tradeoff of complexity
versus gain in quality of information is always a consideration. Will increases in complexity

yield a result that explains some properties not described by the simpler models?

On the other end of the energy calculation complexity scale can be the simple hy-
drophobic model that adds {+1, -1, 0} based on the relationship between certain atom
groups [14]. Further, the conﬁguration of the atoms is assumed to lie on a lattice. This
can greatly simplify the calculation by using the ﬁxed distance between the lattice nodes
and ﬁxed force interaction. This type of calculation is more important as the size of the
molecule increases. Larger proteins have thousands of atoms and the more complex meth-

ods would require signiﬁcant storage space and calculational cost.

27

2.2.2 Diversity

All the methods presented, (both Genetic Algorithm based and Others) use some selection
method to control the mixture of candidate solutions. This is referred to as diversity of
the candidate pool. In the case of the three GA’s presented, the ﬁrst method of diversity
control is achieved by only keeping candidates that have some signiﬁcant difference in
potential energy. Considering the fact that the whole optimization process is concerned
with minimizing potential energy, these algorithms keep molecules with poorer potential
energy over better ones to insure this diversity. The key concept is that even though the
current energy of the molecule is not as good as others, some part of its structure may be
exceptionally good and needs to be kept so that it can mix with the other molecules during
the recombination process. In some cases, the diversity can be measured by far more unique
means such as linkage between parts of the molecule representation, or the projection of
the molecule structure onto two-dimensional surfaces.

Diversity is a key issue for the effectiveness of the search methods. In simple terms,
a larger and more diverse the population of candidate solutions, the more likely it will
contain key parts of the optimal solution molecule. Therefore, it improves the probability

the search outcome will produce good quality results.

2.2.3 Local Search

Another issue that all practical methods have in common is some means for conducting
local search on the molecule structure. It is well established that there are many local mini-
mums that local search will converge to. These minima can have a probability of discovery
based on the concave region of space near the minima. The larger the region the more
likely the algorithm will discover the nearby local minima. For some of the low-energy
conﬁgurations, the concave region near the current known minimum is extremely small
and therefore the chance of discovery is small. These local searches are, nevertheless, crit-

ical to the operation of the successful methods. There are a large variety of possible search

28

methods to use. Many of the current methods address the fact that there are major compu-
tational costs in the search process by adopting processes that minimize the computational
cost.

The number of local minima is large and has been estimated by Hoare as a function
of atom count to be on the order of e36'”"°3"2 [34]. The number of local minima has an
interesting impact on both the local search and on scaling. For small numbers of atoms the
number of minima is very reasonable to ﬁnd and search, but, as it grows, this number soon

becomes too large to work on effectively, at least using only pure global search methods.

2.2.4 Distance Values

The sources for distance values include a variety of information. The use of two passes of
NMR can actually provide some distinct atom labels, which can greatly reduce the difﬁculty
of the assignment problem. Other forms of experimental data only produce an implied set
of distances, the number of atoms and the expected types of bonds that are involved in
the molecular cluster. The range of data and processes used makes it difﬁcult to compare
speciﬁc data sets. The focus of this work is on the use of the data set once it has been
created from the experimental data. The goal of this work is to use an unlabeled, noisy set
of distances to determine the structure of the molecule.

For X-ray diffraction experiments, the actual distance data set is derived by application
of Fourier transforms to the X—ray scattering data. Conversion of the NMR data to distances
requires calculations and scaling of the measurements. All of these methods used to extract
distances from the data collected will introduce some errors. Some of the errors will be
based on the inability to distinguish overlapping peaks in the raw data. Others will be
introduced by variation in some of the experimental methods.

In the ideal case, the distances are perfect in magnitude and completely labeled with
atom pairs. Each distance element is shown to have the form did. The subscript labels

signify the distance between atom 2' and atom j. These distances could then be used to ﬁll

29

the Euclidian distance matrix D directly, because the row and column locations are derived
from the i, j labels. This means that every entry in dij = z is known to an acceptable
accuracy in magnitude and atom pair. Even with this data it is not easy to derive the
locations in space for atoms. Selection of the correct entries to use during construction is
critical to form a structure with minimum residual error. The selection of atom quads used
in the reconstruction process is important, as errors in the location of each atom added to

the model will degrade atom locations added later.

Further, degradation of this data can take two forms. One form has distance data with
mislabeled pairs of atoms. The other form has distances which are measured with some
arbitrary error. In both cases, the problem becomes how to correct the erroneous data

before structure construction can take place.

More degradation can include missing distance entries entirely. In this case, the ques-
tion becomes whether there is enough redundant information to ﬁll in the missing pieces of
information. The rank of the D matrix is typically ﬁve and this can represent a lot of redun-
dant information for molecules where n > 5. That is, for every atom only ﬁve distances
are required to correctly locate the atom with some accuracy. As few as four distances
could be used, but the addition of the ﬁfth allows for some correction of the errors in the
placement. This represents a minimum set of 572 values. A perfect D matrix has n(n —— 1)

useful values.

Another variation of the data involves having perfect distance data but no labels. Using
this set of data, the assignment of individual atom labels is required to ﬁnd the molecular
structure prior to locating the atoms. The Bucky ball molecular structure of C60 makes a
good example of this type of data distance information. The histogram representation of
ideal data for a Bucky ball is shown in Figure 2.4. The vertical axis is the count of atom

pairs having a given distance and the horizontal axis is distance between atoms.

This histogram and the accompanying target set of distances are perfect to within some

accuracy but not labeled with atom pairs. The shape of this histogram is clean, with atom

30

 

250

200-

 

 

 

distance

Figure 2.4: Ideal data histogram

pair counts in regular fractions of the sixty atoms in the molecule. That is, all counts are
some even mod 60 value or evenly divisible by the the number of atoms.

The ﬁnal kind of data to be examined is the data derived with no labels and with noisy
distance values. Either the experimental collection or the method used to generate the
distance data set introduced some amount of noise such that they will never match a valid
set of atom locations beyond a limited accuracy.

In comparison to the ideal data histogram, the actual experimental data for the Bucky
ball is shown in Figure 2.5. The bond counts are not regular fractions of sixty and the
various symmetries of counts have clearly been altered. This data and histogram were
derived from an x-ray diffraction experiment and the collected data is shown in Figure 2.6.
The vertical axis is measured in relative terms to the number of atom pairs that have a
particular distance. Known information about the atomic bonds can then be used to imply
the count of distances. The horizontal axis has been scaled to distance, but the actual
data measured can have several different measures. In order to demonstrate the correlation
between the peaks and the histogram, the Signal graph has been scaled in width to make the

correlation easier to see. Both the ideal histogram and the histogram from real data used

31

 

250 r I 17 ﬁ I l T

200 '

150*

counts

100‘

50-

 

 

 

distance

Figure 2.5: Histogram derived from experimental data

the ideal distances as center points for each histogram bar.

The basic deﬁnitions for the quality of data have been established. The difference in
quantity and quality of data affect the solution strategies applied to solve the molecular
conformation problem. For many solution methods will involve the placement of distance
in the D matrix and the adjustment of values. For the purposes of this work, the distances
are reduced to a simple one-dimensional vector in ascending order of magnitude and will

be referenced as the (1 vector. The lower case denotes the lack of the atom pair labels.

2.3 Current Work with Clusters Using GA

Three works using GA’s to ﬁnd low-energy clusters will be reviewed. The three works are
chosen for speciﬁc reasons. The ﬁrst work, by Deaven and Ho, introduces the recombi-
nation process that cuts two parent molecules by a plane and then combines the halves.
The second work, by Hartke, shows the enhancements to the GA’s process for diversity
and demonstrates scalability of the current algorithms. The third work, by Sastry, demon-

strates the contrast between the use of real versus binary encoding of the genes and how

32

 

 

 

 

 

 

 

 

 

Figure 2.6: Pair distance data implied counts by radius

the introduction of linkage learning was needed to provide solutions. This learning of link-
age can be thought of as a limited version of the grouping provided by the spatial concept

introduced in this work.

This review will include a comparison of how the three GA works are implemented.
They will be compared for seeding the initial population, population control, mutation
methods, local search, diversity control, objective functions and scalability. In each of
these sections, the differences between these three methods are described. Each of these
areas covers a fundamental area of operation for the algorithms. The idea of seeding with
some known set of molecules or close derivation will signiﬁcantly reduce the computa-
tional cost for the algorithm. The size and control of population has a linear effect on the
computational cost. The objective functions for all three GA’s are energy based but use
different models and still yield similar results. The biggest issue is the scalability of the re-
sults. While Deaven and Ho approach only solved a few molecules, Hartke provides results

for a wide range of clusters. Sastry solves only a few small molecules of speciﬁc interest,

33

but he does so with very low computational cost.

2.3.1 Deaven and Ho

The Deaven and Ho [13] paper was landmark in that it used a genetic algorithm to obtain
complex molecule structures. It employed energy functions as the ﬁtness objective and a
variety of innovative operators. An article in Nature by Maddox [46] described the work
by Deaven and Ho, and brought the application of genetic algorithms to molecule confor—
mation of clusters to the attention of many researchers. Despite these results, the basic
techniques which were so successful did not really become the standard style of GA for

many physicists working on clusters until after 1999 [27].

Seeding

The initial population was constructed using random placement of atoms in a ﬁxed volume
of space such that no two atoms are ”too” close together or ”too” far apart. Atoms with too
much separation had little or no signiﬁcant interaction. Arbitrary selection of these limits
can be made by solving the energy function for minimum and maximum values in terms of

radius.

Population

The model for population control of the genetic algorithm developed by Deaven and Ho
was a mixture of diversity control, permutations and then the classic ([1 + A) selection
[3]. In the original paper [13], they start with u molecules, then create permutations with
A = (,u) (u — 1) total offspring. Each of these offspring is relaxed with a local search method
and then the total population of (u + A) is reduced back to (u) for the next generation.
The selection process requires a certain energy difference between each of the molecules.
In the ﬁrst few generations, diversity is not an issue because there was a wide range of

energies. After only a few generations, the molecules converge to good low energy levels

34

and the differences were much smaller. Requiring a minimum difference in energy helped
to keep the population more diverse. At least some small difference in energy level implies
a different atom conﬁguration.

What is unusual when comparing Deaven and Ho’s method to normal genetic algo-
rithms is that the size of the population it only ranged from four to Sixteen molecules. The
total number of Offspring ranged from 16 to 256. These were smaller in size than a typical
GA population, which can be on the order of thousands of molecules that are needed to

provide good results in only two to three thousand generations.

Mutation

Based on a probabilistic selection, a new member of the population is created by mutating
an original individual and replacing it in the population. Determination of which individual
is replaced is based on criteria such as energy difference. The mutation process applies
many random displacements to a number of atoms and then subjects the whole molecule
to a relaxation process by which a minimum distance between atoms is maintained. In
Deaven and Ho’s work another type of mutation is achieved with an inverse directed local

search which will be described at the end of the next section.

Local Search

The local search was rather unique in that it starts with a gradient vector in the direction of
the local minimum, but the vector is rotated around a normal to the energy surface to point
in the opposite direction. While this local search is normally used to ﬁnd the local minimum
it was also applied like a mutation operator. For example, rotation of the gradient vector
180 degrees from the direction of descent in essence moves the molecule’s atoms to the
most rapid ascent away from the minima. Doing this movement in the direction opposite
from the current local minimum some number of steps will force the molecule into a region

of an adjacent local minimum. The normal form of the local search was applied to every

35

new offspring, while the redirected form is applied on a probabilistic basis to only a few
percent of selected parents. This algorithm was iterative and the typical search applied to

each offspring was thirty iterations.

Crossover

The crossover operator used was based on a random dividing plane through the center of
mass for each molecule. The plane’s location was adjusted until approximately half of the
atoms resided on each side of the plane. Using two parents, and dividing each of them,
produces two sets of pairs that can then be recombined to form offspring. As described in
the population section, a single offspring was generated from this crossover operator for
each permutation of parents.

An example of this crossover operator is shown in ﬁgure 2.7. The two halves are shown
separated before being combined. As can be seen from the ﬁgure, when the two halves
are combined several atoms will be too close. If they are at least some minimum distance

apart, then the local search will relax the conﬁguration to some acceptable energy level.

Diversity

Diversity of the population was maintained by the mutation rate and also in the selection
process for inclusion in the )1 population. In Deaven and Ho’s work, some difference in
energy was chosen so that all individuals have some slightly different energy level. While
this insured diversity, it did not insure the right distribution of diversity to provide effective
populations. The mutation operators for both the anti-local search version and the random

movement of atoms will insure that some diversity is also maintained.

Selection

Deaven and Ho used all possible permutations of parents to generate offspring. The major

selection process was used with the diversity controls for inclusion from the total popula-

36

 

 

Figure 2.7: Left and Right hemisphere combined in crossover operation

tion of u+ A individuals into the 11 set of parents for the next generation. The basic selection
mechanism was to rank parents and offspring by energy values and then start selecting the

lowest energy molecules for inclusion, subject to the difference in energy constraint.

Objective Function

The Lennard-Jones energy function, as shown in equation 2.5, was used for this search
by Deaven and Ho. The (1, ﬂ, 6 of equation 2.2 have been replaced by the appropriate

constants .

"‘ 1 2
V(d) = Z [(1—12 — it] (2.5)

k=0

37

Performance

In the original Deaven and Ho work of 1995, they took three thousand generations to ﬁnd
the Bucky ball molecule. The Bucky ball molecule is named after Buckminster Fuller, who
was the inventor of the geodesic dome as well as many other unique styles of structures.
The Bucky ball has the same basic structure as a simple geodesic dome and the modern
soccer ball. This ball has 60 atoms arranged with 20 hexagons and 12 pentagons and was
created from carbon atoms vaporized and injected into an inert gas. Richard E. Smalley,
Robert F. Curl Jr. and Harold Kroto discovered the carbon Bucky ball while working on so-
lar chemistry research. This discovery led to the 1996 Nobel prize for the three researchers.
The applications of the Bucky ball structure include nano-structures, superconductors, and

optics.

Evaluations = p(p — 1) =1: 30 * generations (2.6)

Given p = 8 and generations = 3000 for the Bucky ball with sixty atoms scales approxi-

mately as 0(n3'76).

2.3.2 Hartke

The Hartke work builds on the Deaven and Ho method in several ways. A geometrically
based measure was used to improve the diversity. The local search method was improved
using a Newton-style iteration that will, in theory, take fewer steps. Also, the mutation was
well deﬁned and more extensive than hopping to the next region of the minimum. Using
all these improvements, his work found all the minimal energy states for molecules up to

250 atoms with a performance that scaled very competitively.

38

Seeding

In his latest work, Hartke used a set of molecules based on known low-energy conﬁgura-
tions and then randomly added atoms to make the full atom count. It should be noted that
seeding can really improve the ability to ﬁnd solutions for a great many of the low-energy
conﬁgurations which have similar geometries. Seeding can also cause signiﬁcant problems
for those conﬁgurations that are markedly different. Hartke used many of the Deaven and
Ho operators, with small improvements, to give a more robust algorithm. That algorithm
was used to ﬁnd the energy minima for molecules from a few atoms to several hundred, in-
cluding many known hard-to-reproduce low-energy states at certain sizes. One of the main
differences between Deaven and Ho, and Hartke, is the way that diversity in the population

was achieved. Hartke introduced a geometry measure rather than an energy difference.

Population and Diversity

Hartke’s typical population was ﬁve to ﬁfteen molecules. The initial population was con-
structed from known good geometries that yielded low energy conﬁgurations. Then atoms
were added until the molecule was complete. Hartke’s work uses the same (a + A) pop-
ulation methods as the Deaven and Ho work, but with both the energy and geometry con-
straints for diversity. Just like in the Deaven and Ho work, all permutations of the parents
were used to create an offspring. The next generation of parents was chosen using ranking
constrained by the energy and geometry difference requirements.

Hartke [27] took the concept of diversity control based on energy and added some fun-
damental geometry measures to enforce an even more diverse population. In this measure-
ment, the molecule was rotated so that a two-dimensional projection had the fewest number
of distinct atoms in the projection. The two-dimensional projection is then arbitrarily di-
vided by a ﬁxed grid of forty squares scaled to the diameter of the projection. A count of
squares with at least one projected point was then taken. Using the difference in counts

as a measure of similarity, the population diversity was improved. Figure 2.8 shows the

39

 

Figure 2.8: xy projection of n=104 molecule [27]

projection of a molecule with 104 atoms. Using this method, it is possible to keep several

unique low—energy conﬁgurations with signiﬁcantly different geometries in the population.

Mutation

Hartke’s mutation operator randomly selected one to forty percent of the atoms to move
and also selected a random magnitude between a Speciﬁc minimum distance and up to
the diameter of the cluster for each move. The mutation was applied to ﬁfteen percent of
the offspring produced by crossover and prior to the local search. So large a magnitude
of movement has the effect of completely relocating an atom in the conﬁguration. The
balance between the disruption of good locations against exploration seems a little extreme

towards the exploration.

Local Search

The local search used by Hartke was a quasi-Newton method developed by Liu and Nocedal
[45]. This particular method has been shown to have speedup factors in the range of 1.6
to 4.5 compared to other high-dimensionality local search methods. The local search was
applied to offspring and parents alike after any changes due to mutation, crossover, and
initialization.

40

Crossover

Hartke employed an improvement to the Deaven and Ho method of crossover. It used the
dividing plane and then evaluated the potential energy for each half of the molecule before
recombination. Further, instead of randomly selecting a plane, this crossover method ﬁnds
a plane that produced a half a molecule with lower potential energy. Doing this about ﬁfty
percent of the time improved performance. What was not clear in this work was how the
advantageous plane was found. Clearly, rotating a dividing plane around looking for low

energy halves of a molecule would require an intensive computational cost.

Selection

The selection process was basically rank elite. It was based on rank of the total population
of parents plus offspring. The constraints of geometric diversity and energy differences
were then applied to the selection process. In this way, a parent population for the next
generation should have included potentially new and beneﬁcial partial conﬁgurations even

if the ﬁtness of the individual was not competitive with the current best in the population.

Objective Function

This is the same energy function that was used by Deaven and Ho. The Lennard-Jones
potential was calculated from all the distances between atom pairs. Once the atom locations
were found and after the local search, the energy was calculated along with the geometric

diversity projection onto two dimensions.

Performance

The performance of Hartke’s algorithm was very good. It not only found all the lowest
energy conﬁgurations up to 150 atoms but does so in sealed performance with a time of

approximately 0(n3). Based on the results shown graphically, there was large deviation on

41

the time values for larger molecules. Some of these were noted as known difﬁcult conﬁg-
urations because of a shift from the normal low.energy pattern to that of a Face Centered
Cubic with truncation. It should be noted that other methods, like Monte Carlo with basin
hopping, perform better for small molecules and certain known hard conﬁgurations can

take an order of magnitude more time than the scaling prediction.

2.3.3 Sastry

In the previous two works, the encoding of the molecules used real-values or phenotypical
encoding. In contrast, some GA methods use binary encoding of the genes to represent
atom locations. TWO issues that are highlighted by the Sastry work are the reporting of
performance and the use of binary encoding. The work by Gregurick et a1. [24] is often
cited for its performance scaling as 0(n4'5) for an example of the genotypical encoding
with binary values. This work by Sastry [60] used a small ﬁve-bit genotypical binary
encoding. It described performance on the same order of Hartke, if the cost of developing
the seeds was included. The cost of producing the seed molecules must be included so
that a fair comparison can be made. Further enhancements to the basic genetic algorithm
included using statistic distributions of the population and minimum descriptive length to
partition building blocks and maintain diversity. This work also used local search with
the Extended Compact Genetic Algorithm (ECGA) to ﬁnd the low energy conﬁgurations.
The scaling claims included only the number of function evaluations and not the cost for
evaluating the ﬁtness function or producing the seed molecules. Additional information
will be provided that will make comparison between the previous two algorithms and this

one, using advanced GA methods such as linkage learning, possible.

Seeding

The seeding used the (n — 1)-atom low energy molecule conﬁguration as a seed, then

added a randomly placed atom to make the full molecule conﬁguration. This required that

42

all the low energy clusters, up to the size currently being attempted, had previously been
found. While some algorithms will make use of known good general conﬁgurations for

low energy, this requires the n — 1 conﬁguration to start.

Population

A population from twenty to one hundred molecules was used in this algorithm. The size
of the population required for a high probability of a successful run was stated as 0(n'83).
Examination of the results shows that, for molecules with twenty atoms, they used popu-
lations on the order of one hundred molecules. This population was much larger than the
twelve predicted by the claim. The population was ﬁxed and some percentage of the popu-
lation was replaced every generation. Taking two data points from the results reported for
molecules with ten atoms and twenty atoms the following prediction of population scaling
was calculated to be 0(n2'9) which is much larger than the claim. The exact data points
are twenty members of the population for ten atom molecules and 149 members for twenty

atom molecules.

Generational Outline

This algorithm was different from the basic simple genetic algorithm in that it used two
distinct tools to measure the proﬁle of the population. These tools were then used to re-
place some percentage of the population each generation. Figure 2.9 shows the outline as
described in the paper by Sastry. Each generation N (1 — PC) individuals were selected
by tournaments to be included as parents for the next generation. Then, using the selected
individuals as the parents, offspring were created with crossover operations to ﬁll the re-
maining positions in the population. PC was the probability of crossover. The marginal
product models (MPM) and minimum descriptive lengths (MDL) are used to guide the
operation of the GA.

43

 

Initialize Clusters
Loop
Perform Nelder-Mead simplex search
Evaluate potential energy of each cluster
If converged then stop
Perform tournament selection
Build MPM using minimum descriptive length
Create new clusters using MPM
Replace percentage of old clusters
Repeat loop

 

 

 

Figure 2.9: GA algorithm outline for Sastry [60]
Local Search

The Nelder-Mead local search was used by Iwamatsu [40] and also used by Sastry [60].
This method uses a form of a generalized triangulation to bound the area to advance toward.
The triangulation local search for new atom locations used thirteen data points to develop
a boundary to evaluate for determination of the direction to move. This direction was
either toward the interior of the bounded region or exterior depending on the value of the
objective function. At each step, a new boundary point was constructed from the existing
points. For a graphical description of this method, the web site by Mathews [47] has a good

explanation.

Crossover

Crossover was applied to the selected parents from the population, but was guided by the
statistics developed for both the Minimum Descriptive Length (MDL) and Marginal Prod-
uct Models (MPM). This process limits the disruption on the building blocks or sub-clusters
of atoms. Once the rest of the population was ﬁlled in, the whole generational process
Started again. The parent selection used a high-pressure tournament with as many as four
to sixteen participants. When the population was small, the poorer molecules were not

chosen for crossover often.

Selection

Tournament selection was used on the total population to reduce the population to some set
of parents that would then be used for crossover operations. The number of individuals in
each tournament was eight without replacement. Large tournament sizes represents a very
strong selection process. In this work, the typical Pc = .8 was used. This means that the
selection process only kept 20% of the population and replaced 80% with new molecules

each generation.

Objective Function

The Gong potential took the energy calculation one step further, using bond angles as
shown in equation 2.7. The more detailed equation for 222 has terms similar to the Lennard-
Jones but with an exponential decay term added. The 223 function has terms with the cosine
of the angles formed by the three adjacent atoms groups. Details of this energy function

can be found in the paper by Sastry [60].

14...: 2212031) + Z v3(z'.j.k) (2.7)

i<j i< j < In
Performance

In order to compare the results of Hartke to Sastry, one must do a summation on the cost
of building the seeds used by Sastry. Equation 2.8 shows this sum. Using the claim for
evaluation scaling and doing the summation yields a total cost of 0(n3'18) for the same
sizes of molecule structures that Hartke used. These claims only holds true if the population
size claims made by Sastry are used. If the population increases more rapidly as the result
graphs indicate rather than the claim 0(n0'83) , this performance will actually degrade
signiﬁcantly to on the order of 0(n4) or more and be much worse than the work by Hartke.

It is not clear how the cost of the local search was included in the cost estimate provided.

45

Evaluation = 2 i2'45 (2.8)

i=1
The performance represented in Sastry’s work was for smaller molecules. It is well
established that molecules greater than twenty atoms represent a much more challenging
problem. It is still possible to compare the Hartke and Sastry results on a limited scale as

shown.

2.4 Non-GA Solution Methods

A variety of methods compete with the genetic algorithms in their ability to solve the con-
formation problems. Simulated Annealing (SA), Monte Carlo (MC), and Molecular Dy-
namics (MD) are just some of the terms used to describe these methods. When reviewing
the current papers in this area, it is clear that the most successful results often involve a
combination or variation of these methods, and often some aspect of genetic algorithms as

well, with generation of candidates.

2.4.1 Simulated Annealing

The Boltzmann equation is at the core of the methods that use SA. The equation is used to
determine the probability of an event based on a simulated temperature. In principle, the
temperature is high in the beginning and large magnitude events will occur more frequently.
As the temperature is reduced, the likelihood of the large magnitude events will decrease.
Equation 2.9 shows the probability of an event based on the current temperature T and the
energy of the event E. The variable I: is the Boltzmann constant. The space is explored by

the high-temperature period. As the temperature drops local minima are explored.

P(E) = e‘E/kT (2.9)

46

Typical of the simulated annealing work is the paper by Hsu et al. [37]. In this paper,
the determination of structure from x-ray scattering data is done for a variety of molecules.
They start with the paired distance data objective function and combine that with simu-
lated annealing to generate molecules to compare with the experimental data. David et a1.
[12] used the same type of strategy. This work derives the crystal structures from power
diffraction data using a combination of methods. Both of these works used simulated an-
nealing to modify the proposed structures with a temperature schedule. Over many cycles
of modiﬁcation, these temperature schedules yielded candidate molecules with good ob-

jective functions.

The use of NMR-based data for simulated annealing was demonstrated in the paper by
Bassolino-Klimas et al. [4]. In this paper, several issues were highlighted about the source
and types of data used to control the process and the combination of source data with
domain knowledge and energy functions. The paper also gives a detailed list of various
reconstruction programs, primarily for proteins and organic molecules, but also used for
clusters. These programs include EMBED, DG, DG-II, DISMAN, DINA, X-PLOR and
CONGEN. One major issue was the use of bond angles and bond distances to limit the
ranges that Simulated Annealing can use. Further, the energy functions employed were

more complex than the Lennard-Jones type as described in equation 2.4.

The paper by Lee et al. [44] describes a form of conformational space annealing that is a
merger of three methods. Hartke refers to this work as a form of Tabu search [23], but it has
aspects of Simulated Annealing, genetic algorithms and Monte Carlo. Just like the fact that
evolutionary algorithms often use mixtures of local search, random events and complex-
ity evaluations, these methods identiﬁed as Simulated Annealing use different techniques.
These include the basic Boltzman probability evaluation for simulated temperatures. The
Lee paper uses the Lennard-Jones potential to ﬁnd minimum energy clusters of up to 201
atoms involving a typical mix of strategies. The algorithm starts with a population of ﬁfty

molecules. Each of these new molecules is subjected to a local search. Once the local

47

search is completed, the next set of ﬁfty is selected by randomly picking twenty from the
old set and creating some random new molecules and applying local search. The process
now starts an iterative loop, picking molecules from either the old set of ﬁfty or the new set
and combining them. This combination is then subjected to an energy minimization local
search. The process also exhibits diversity control and competition to become the seeds for
each new group. Further, if there is stagnation, the process injects a new group of random
seeds. This work was able to reproduce all of the low-energy conﬁgurations for n g 201

atoms.

2.4.2 Monte Carlo and Basin Hopping

Monte Carlo and Basin Hopping continue to be viable methods for ﬁnding minimum energy
conﬁgurations of clusters. These methods often use a combination of Monte Carlo, local
conjugate gradient search and selection to identify a direction of travel to improve the
effectiveness. The paper by Wales and Doye [65] [66] demonstrates an effective method
for ﬁnding clusters up to 110 atoms. In this method, the program starts with a Polak-
Ribiere variant of a conjugate gradient search to ﬁnd the minimum energy of the current
conﬁguration. Next, a set of 5000 Monte Carlo moves is used to map the energy landscape.
The threshold for energy was monitored, along with smaller Monte Carlo steps. When a
better basin was found, the method hopped to the new basin and started the search over
again. It should be noted that these clusters do not start at a random conﬁguration, but
utilize known good conﬁgurations for clusters with one less and one more atom. TWO
important points to remember about this work by Doye are that the Lennard-Jones energy
was calculated with different assumed coefﬁcients and the seeded starting conﬁgurations

were known good structures.

48

2.4.3 Statistical Directed

Potential energy surface mapping and search are also used to ﬁnd the energy minimums.
The paper by Chaudhury et al. [9] describes in detail a method using some aspects of
genetic algorithms and gradient search to locate the energy minimum by mapping the po-
tential energy surface (PES). Another approach used statistical mapping of the potential
energy surface to focus the search area. The paper by Phillip, Rosen and Dill [55] took
samplings of the minimum over a broad range of the potential energy surface. Then, using
averaging and statistical methods, it mapped a trend surface and then used this to guide the

search to more promising areas.

A lot of work has dealt with the energy landscape and methods to alter this landscape
to make the minimum energy points easier to ﬁnd. This landscape, even for simple two-
dimensional variations, is extremely complex. Not only are there an exponential number of
minima, but the energy barriers are large. Finding the global minimum in such a landscape
is difﬁcult. This landscape also highlights the general average shape of the Potential Energy
Surface (PES). In this case it has a general convex shape. In the work by Phillips et al. [55]
the searching strategy was directed by the statistical sampling of the PBS to an area where
the global minimum was more likely to be located, such as the center concave part of the

PES.

There is no shortage of methods and variations currently used for problems determining
molecular conformation. Classiﬁcation of the methods, however, is nearly impossible be-
cause all of them combine a variety of operators to become effective. It was clearly stated
in several of the papers that pure SA or pure GA was not an effective means to solving the
problem and all the papers described in this section use combinations of methods. While
this section has made an effort to describe some of the competitive methods, it is difﬁcult

to classify them strictly as SA, Monte Carlo or any of the other single term descriptions

available.

49

2.5 Spatial Concepts in EC

One of the simplest methods to incorporate the spatial relationships between genes or pos-
sible linkage is to arrange the genes in some fashion. The most likely arrangements can be
two-dimensional arrays or three-dimensional arrays. The paper by Bui [8] makes the case
for a ﬁxed number of crossover points in each dimension of the gene arrangements. This
concept was formalized to introduce the idea of schema with the probability of survival
of crossover. One issue that comes from having ﬁxed arrangements and ﬁxed numbers of
crossover points in each dimension occurs when the spatial location of the gene moves,
but does not change position in the ﬁxed arrangement. Bui reported results for both a
Very Large Scale Integration (VLSI) arrangement and a graph partitioning problem. These
suggest the use of the ﬁxed arrangement did improve the performance of the algorithm.

In a paper by Yoon [67], one idea that was presented was the synergy of multiple
crossovers when applied to a Traveling Sales Person (TSP) problems. However, one of
the methods used in the case studies was based on 2D geographic crossover or natural
crossover. When compared directly with uniform, ﬁve point and cycle crossover, it was
always the second or ﬁrst most effective on all the trials. When combined with other
crossover operators, it was always included in the best combination. The geographic
crossover inclusion actually makes good sense. The other recombination methods over-
come the problem of good genes mixed with bad by picking smaller sections of each chro—
mosome. At the same time the power of the spatial crossover, especially in the early gen-
erations, is clear. The work by Yoon provides graphs that show the effect of the various
crossover combinations as a function of generation. They demonstrate how the crossovers
are complementary and transition over generations. Yoon’s paper showed that it was difﬁ-
cult to pick a mixture of operators for the graph partition problems which consistently did
well. All the combinations that did well included the geographic crossover operator.

In M -dimensional classiﬁcation and a variety of two-dimensional space problems, rules

are used to divide the space into clusters or sets, in genetic algorithms. The paper by Saraﬁs

50

[59] uses a system of rules that are developed with the two-dimensional partitioning. The
interesting part about this paper is that the recombination operators and mutation were
entirely based on spatial concepts. It’s general concept focused on the division of the genes

into two partitions or more.

2.5.1 Gene Arrangement

In the survey of crossover operators by Seo and Moon [63], the impact of the arrangements
were discussed with the generalization of the linkage being a full Kn-style graph of all
the genes. It is interesting to note that the Saraﬁs-style crossover was not included in the
survey. This paper also describes the possible general partitions of a two-dimensional space
and, in ﬁgures 2.10, 2.11 and 2.12, shows typical possible partitions of such a space. These
partitions are ﬁxed arrangements of the genes, but the crossover is limited by certain rules.
For example, in Figure 2.12, the crossover can be deﬁned as selecting a division between
rows or columns for single-point crossover. More generally, the crossover can choose N
divisions between rows and columns. The case where the genes are arranged in a linear
fashion is the most common, but ﬁgure 2.13 shows them in a circle. The crossover points
can be limited to lines that pass through the center of the circle and break the genes into
two or more sets.

Figure 2.10 shows the maximum possible Single connections between genes. In general,
any set of loci can be divided from any other set for the purposes of crossover. The entire
super set of all edge combinations is the same as the total schema for binary encoding
except that the combination of genes can be more complex, with each gene having its
own coding of the material, much like biological DNA. These linkages between genes are
represented by the lines connecting the genes shown as circles.

For Figure 2.14, the crossover points can divide entire planes of loci or multiple
crossover points can select sub-planes in arbitrary sizes. The nature of all these ﬁxed ar-

rangement spatial crossovers only preserves the small grouping of loci if they happen to

51

r ‘7 . ‘
rat!!! .
mm.

4*
Qp

i7

irl
mm‘ Q
1‘“.

 

 

 

Figure 2.10: K N graph arrangement of genes

OOOOOOOO

Figure 2.11: Row arrangement of genes

be arranged in the same fashion as the solution vectors that inﬂuence the ﬁtness function.
When the arrangement corresponds to the location of genes in space this turns out to be
advantageous for some set of problems.

The major point is that some means for dividing the chromosome into two sets that
preserves the spatial relationship of the vectors will be less disruptive. These two sets then
identify the portions from each parent to use during crossover. Figure 2.15 shows a simple
division of the genes by a rectangular space. This rectangle can be expanded to be an
arbitrary closed polygon or a solid in three dimensions. The black dots represent the vector
values of individual genes mapped to this bounded space. In this ﬁgure, a rectangle would
be applied to both parents and genes. Genes from outside the rectangle would be selected
from parent A and genes inside the rectangle from parent B would be used to create an
offspring. The number of ways to divide the space can be arbitrarily based on some set of

boundaries. In ﬁgure 2.16, multiple line segments are used to divide the genes and each

52

Figure 2.12: 2D matrix arrangement of genes

cocoooo
loco...

 

O D

Figure 2.13: Circle arrangement of genes

53

 

Figure 2.14: 3D matrix arrangement of genes
if“—

Figure 2.15: 2D division of genes

 

 

 

space is assigned as contributions from parent A or parent B. For example, the genes in the
shaded areas would be selected from parent A and unshaded areas from parent B.

A Voronoi diagram is often associated with classiﬁcation systems [17]. It could be the
ultimate individual division of the genes in space with each space assigned to contribute a
gene from parent A or B. In many respects, the concept of linkage learning does exactly
that. It divides individual genes by assigning them to one of several sets so that crossover
will preserve good combinations of genes. Figure 2.17 shows a Voronoi diagram with each
gene in its own space.

Kahng [42] introduced the term geographic crossover and used multiple discrete di-

54

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.16: 2D multiple division of genes

 

 

Figure 2.17: 2D multiple division of genes using Voronoi spaces

55

visions in both two-dimensional and three-dimensional arrangements of genes for experi-
ments. He also introduced the general concept of multiple boundaries and spaces based on

gene arrangement.

2.5.2 Multi-Deme

A large amount of literature has addressed the issues of multiple populations with genetic
algorithms. These have a variety of forms and issues. The spatial arrangement of the
populations is an aspect in the application of the multiple populations to GA. The literature
has many examples where the use of multiple populations with periodic exchange provides
better results than single populations. Tomassini’s [16] work ties some of the principles of
population exchange to the basic theory of panmictic populations. The overall operation
of the multiple populations is effected by the number of individuals exchanged, how often
they are exchanged, which populations exchange individuals and the replacement strategy
used when exchanges are made. Population exchange parameters are in addition to the
already numerous parameters for the GA operation used by each localized population.
One measure of performance is the takeover time for the distributed populations. Gia-
cobini et al. [21] show the interrelationship between multiple populations, exchange param-
eters, and takeover times. There are several interesting effects that the spatial conﬁguration
of exchange has on the operation of the multiple populations. The more restrictive the
number of exchange points and number of individuals exchanged, the greater the isola-
tion. Restriction of exchanges allows for more localized development of individuals. On
the other hand, if the populations are too isolated, great individuals are not produced ef-
ﬁciently or possibly not at all. The major effect that the spatial arrangement has is in the
fostering of better competition by isolating populations then exchanging a limited number
of good individuals to provide genetic diversity. Figure 2.18 shows a general conﬁguration
of six island populations and the lines show the migration paths among the islands. The

conﬁguration of the islands can take just about any logical form. The most common forms

56

represent the same type of arrangements as shown for the genes. They can be arranged in
a two-dimensional matrix as shown in Figure 2.12, circles as shown in 2.13 , pyramids in
hierarchial conﬁgurations, and just about any other conﬁguration. Even random conﬁgura-

tions have been used in some of the studies.

The area of multiple competitive levels proposed by Goodman and Hu [38] introduces
Hierarchal Fair Competition (HFC) for sustained GA operation, with a constant feed of
new genetic material from lower layers in the hierarchy. In general, it has been shown that
multiple populations with the same amount of individuals as single pandemic populations
normally perform better and the hierarchal structure gives away to channel good individ-
uals to increasingly tougher competition. Whether the populations are divided in a single
style of run, where the competition between individuals is stratiﬁed, or there are multiple
populations feeding a series of higher level populations, the results Show an improvement
in the solutions found. Figure 2.19 shows a simple three-layer competition structure. At
each layer when a member of a population reaches a ﬁtness threshold it can then be trans-
ferred to the hierarchal layers above. These semi-ﬁnal populations then repeat the process
transferring good individuals that meet the threshold to the ﬁnal population layer which
normally consist of only one population. At each population, the normal selection process
will allow for fair competition among individuals with similar levels of ﬁtness and will
produce a better chance of diverse results. At any population, a dominant individual will

have the tendency to prevent other good individuals from developing.

All of these methods rely on reasonable choices for sizes, number of populations and
exchange parameters. As demonstrated by Giacobini when populations have sufﬁcient
interconnections (like small world graphs) with frequent interchanges, the systems will
behave just like a panmictic population [21]. A small world graph has a small number of
connections at each node. The diameter of the graph is small in terms of maximum number

of connections that must be traversed between any two nodes of the graph.
Similarly, the HFC approach also requires proper selection of the exchange parameters

57

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.18: Island populations for genetic algorithms

Finals

Semi Semi Semi
Final A Final B Final C

 

 

 

 

 

 

 

 

 

Prelim 1 Prelim 2 Prelim 3 Prelim 4 Prelim 5 Prelim 6

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Figure 2.19: Multiple population multiple layer competition

and number of layers or demes. In addition, the strategy for the ﬁtness values to be dis-
tributed among the layers must be developed. There are many ways to dynamically adjust
the layers as competition progress on the HFC runs. Hu and Goodman current version
of the HFC process is called Quick Hierarchal Fair Competition(QHFC)[39] and it has

automated the assignment of the parameters dynamically as part of the overall process.

58

2.6 Local Search

Nearly all the current algorithms that are used to search for molecular clusters use some
form of local search. This form can vary based on a number of parameters. In the original
work by Deaven and Ho [13] a modiﬁed conjugate gradient was used. For the Hartke
[28] work on the clusters under 200 atoms a quasi-Newton search was used. In both cases
mentioned, they applied a limited number of iterations during each generation. This partial

convergence of the local search can play an interesting role in the operation of the GA.

2.6.1 Quasi-Newton

One key point in the quasi-Newton methods is their ability to utilize multiple passes to
construct the equivalent of a full Hessian matrix a little at a time. In doing so, the overall
efﬁciency is markedly improved. This method is demonstrated by Gill [22] and it can be
clearly shown that most of the partial derivatives do not change substantially from iteration
to iteration. The question becomes how often a complete generation of the matrix must be
done and what events are signiﬁcant enough to require the extra work. An interesting prob-
lem associated with the low-energy cluster search is that once the molecule conﬁguration
is nearly perfect, the ability to ﬁne tune the positions is difﬁcult. This is because the energy
curves are nearly ﬂat. With little slope to work with, the searches for those last couple of
decimal points are slow. Several of the search methods, once a very nearly optimal. solution

is reached, invoke other search algorithms based on higher-order derivatives.

2.6.2 Nelder-Mead

This method uses a form of N -dimensiona1 triangulation to ﬁnd a local minimum of several
variables. A good source for a graphic representation in two dimensions is provided on the
web and in a book by Mathews [47] [48]. A series of triangles are constructed that bound

the local optima [51]. Each successive triangle is the result of adding a single new vertex

59

to the existing series of triangles. In doing so, a single new function evaluation is required
for each step. This search method is efﬁcient in that only one new energy evaluation is
required for each iteration. The location of the new point to evaluate is simply a projection
perpendicular to the last two points. In hyperspace, like this example, the concept of normal
to the two existing points is complex. It is still a small cost of calculation as compared to

evaluating many partial derivatives.

2.6.3 Constructive

In the Dong and Wu work [15], the constructive approach uses some key facts to add atoms
one at a time to the solution space. The approach starts with four atoms chosen for their
relationship in three-dimensional space and then adds atoms until complete. In the addition
process, there are twelve possible sets of equations that can be used to solve the placement
of the new atom. Some of these twelve will provide a much better calculation of the location
than others. By evaluation of the twelve possible sets of equations, it is possible to choose
a set to use that will provide the most accurate solution. One important item to note here
is that the distances are labeled with the atoms involved. In the event that there are no
labels, the problem is considerably more difﬁcult. The claims of computational cost for
this method are on the order of 0(N). The key point is that they are, in fact, labeled and as

such the question is more of pure construction than one of search.

2.6.4 Gradient

A simple gradient is based on the impact to the objective function by each variable. For
molecular conformation, these variables are the coordinates of the atoms. The process
Starts by ﬁnding the derivative for each atom and each coordinate. Once the derivatives for
each axis at each atom location have been found, a selection process is used. This selection
process identiﬁes the x, y, or z coordinate associated with a speciﬁc atom that can reduce

the objective function the most. The process then moves the atom along that coordinate

6O

until a minimum is reached. The whole process repeats until no more improvement is
possible. While this may seem primitive, it is possible to implement with ﬁnite difference
methods in cases where the slope cannot be easily found in analytical terms. In both this
method and conjugate gradient, the step size is based on the rate of change and must be
conservative to insure stability. Both methods will iterate until the rate of improvement is

below some threshold.

2.6.5 Conjugate Gradient

The conjugate gradient [33] is an improvement over standard gradient in that it ﬁnds the
composite direction and magnitude to move an atom. This atom motion effectively moves
along the potential energy surface. After each iteration, a new vector is determined and the
atom is moved in the correct direction. In each step, the atom with the greatest potential to
improve the objective function is selected. In analytical terms, if it is possible to develop
the functions then the conjugate gradient vector can be found by simple evaluation of the
function. As with the gradient method, if the analytical equations are not convenient, ﬁnite
differences are used.

In both of the last methods, complete evaluation is not necessary if partial contributions
are kept for each of the contributing terms. Only the contribution of the atom moved needs
to be reevaluated. Some methods even short cut this, and use estimates for the overall

contributions.

2.7 Representation of Clusters / Molecules

This section will describe some of the most common methods for describing the structure
of molecules. While it is necessary at some point in all evaluations of structure to produce
the actual location of the molecule there are many advantages to other representations. The

basic representation of location will be described. These sets of locations produce the Eu-

61

clidian distance matrix which can have a couple of different styles. Producing the Euclidian
distances for each designated pair of atoms provides the distances with labels. In contrast,
typical data provided by experiments will, generally, not have any labels associating atoms
to distances. It is possible, in some cases, to provide partial labeling but it is not feasible to

identify labels for all the distances.

The use of other systems for describing structures for proteins and crystals is easier
because of the repetitive nature of the molecules or the domain knowledge, including se-
quence, substructures and bond length. The basic concept of representation of the molecu-
lar structure of proteins will be described. The last part of this section will include a brief

description of the tessellation systems that can be used to describe regular crystals.

2.7.1 Locations

For the work on clusters, most representations use the locations of atoms in space in refer-
ence to some arbitrary center. The location of each atom is described by three real-values
in an 2:, y, z coordinate system. The location of a speciﬁc atom is designated as p, as shown
in equation 2.10. The collection of all atoms in a molecule is referred to as P which is
described in equation 2.11. The Euclidean distance matrix is comprised of an N x N ma-
trix of distances between the atom pairs. The deﬁnition of each element entry is Shown in
equation 2.12. It should be noted that distance is the magnitude of the vector created when
the two points in space are subtracted. Other forms of the Euclidean distance matrix use

the distance squared as elements.

Pi = [51313 yr, Zr] (2.10)

62

P1
P2
P: (2.11)
Pn—I
Pn

 

 

2.7 .2 Euclidian Distance Matrix

3 . I i— '
di,j = 5'é .7 '1’ 193' (2.12)

2:]:0

Finally the full distance matrix is described by D matrix which is shown in equation
2.13. This matrix has some Speciﬁc properties which will help with the solution strategies.
First, it has redundant information. The distance from atom x to atom y is the same as from
atom y to atom x. Second is that the rank for this matrix is R g M + 2 where M is the
dimension of the space (in this case assumed to be three). The typical assumption is that

R = 5 [10].

 

0 611,2 "' dl,n—1 dl,n
612,1 0 ' ° ° d2,n—l d2,n
D: (2.13)
(in—1,1 dn—l,2 ' ' ' 0 dn—1,n
L dn,1 dn,2 ' ' ' dn,n-l 0 d

 

2.7 .3 Distance Labels

In some of the algorithms described, a sorted set of the distances will be used as the objec-
tive functions for optimization. This set of sorted distances will be referred to as d and is

Shown in equation 2.14. The set is constrained by equation 2.16.

63

d = (2-14)

 

 

t: (2.15)

tm—l

ltm

 

 

de, ZSksmldk_1Sdk (2.16)

While the distances in D matrix have atom labels, those in d are sorted in order and have
lost the direct association with speciﬁc atoms. This is denoted by the change in sufﬁx to
d), instead of d,j. The lower case m refers to the total number of distances for a given 17.
atoms as m 2: §(n — 1). Experimental data is given as a set of distances. These distances
are often referred to as the target set. The data shown in Equation 2.15 matrix is sorted in

ascending order just like the (1 matrix.

2.7 .4 Angles and Sequence

In the case of many organic-style molecules, the basic sequence of the backbone is known.
This information, along with the side chains, makes up a protein molecule. For these
molecules, it is possible to provide a more compact representation with fewer degrees of
freedom. There are several competing standards for protein molecule representation by

data ﬁles.

64

Based on the domain knowledge associated with this type of molecule, several repre-
sentations are possible beyond the atom location format described in the previous sections.
Domain knowledge available for proteins, or DNA, can include standard groups of atoms
that have a known conﬁguration. These groups then are combined together to build pro-
teins and larger conﬁgurations like DNA. Using this domain knowledge, the representation
of the molecular conformation will take an alternative form. Similarly, the study of crys-
tals relies heavily on the use of tessellations. Therefore, the representation in these studies

makes use of standard tessellation parameters.

When trying to determine the structure of proteins, the basic chemical composition
of the central chain of atoms is known. What is not known is the rotational angles for
all the side chains and Speciﬁc connecting points along the central chain of atoms. The
problem representation becomes this collection of angles, with constraints that match the
experimental data. Figure 2.20 and 2.21 shows a basic group of atoms whose structure is
known and the combination of two such groups showing the angles in a two-dimensional
representation. Each variable angle like those labeled 1,0 and 4p in the Figure 2.21 will need
to be solved for a given conﬁguration in three dimensions. The diagrams are based on a

work by Neumaier [52] on protein structure and folding.

In Figure 2.20, the R letter actually designates an amino acid with several additional
atoms. While there are roughly twenty common amino acids, there are also many other
known conﬁgurations that are found in proteins. The conﬁguration of the R group is nor-
mally known. In Figure 2.21, the combination of the two backbone components is shown
with two unspeciﬁed R groups attached. The key point here is that the carbon atom des-
ignated with the a has two bonds labeled with \II and (I). These bonds can rotate and form
the basis for protein folding. While the bonds can rotate freely, they are limited by the
interference from the R groups and other parts of the protein molecule. The various parts
of the protein interact with repulsion and attraction of the various polarized atom groups.

This is the basis for many of the potential energy calculations on proteins.

65

Figure 2.20: Typical group of atoms

\ \/H
NiC H
/ a\"’ / o
H C—No
/ \ ,. //
O Ca—C
/| \
R H

Figure 2.21: Two combined groups with dihedral angles

66

The actual variable part of the protein representation can be reduced to these two angles
of rotation and a table of known relative atom locations for each distinct amino acid that
comprise the sequence of the protein. Using the table and rotational angles is the most
simplistic of possible representations for the protein. It reduces the degrees of freedom
for a protein with thousands of atoms to a few hundred angles. While it would seem that
this reduction would make ﬁnding the conformation easier the number of constraints and

interactions between all the Side chains still makes ﬁnding solutions extremely difﬁcult.

2.7.5 Crystals

The representation of crystals is based on three-dimensional tessellation. Figure 2.22 shows
an example of an underlying structure for a crystal. This structure is part of a lattice, as
Shown in Figure 2.23. There are fourteen basic lattice structures and many underlying
structures. The representation of these crystals can be accomplished using the parameters
for the underlying structure and the deﬁnition of the lattice. A typical lattice deﬁnition con-
tains a translation and rotation component. Figure 2.22 shows the general triclinic shape.
Each side of the shape can have a different length and each angle shown as a, 5 and 7 can
be different. In some of the shapes, the angles are equal or 90 degrees and the length can be
the same. In Figure 2.22, the lengths of the sides are shown as a, b and c. These variations
make up the basic family of crystal foundation. There are other crystal geometries but they
are rare and more complex. In addition to the basic shape, the shape can be repeatedly
stacked to form larger complex forms called lattices. Figure 2.23 shows how the basic
shape can be stacked to form the lattices. The crystals have atoms at each of the vertices
shown in the ﬁgures. Atoms are symbolized by the black circles. There are variations that
include atoms in the center of the shape, at the center of the face and also the center of the
sides.

Crystals formed around these types of lattices produce very regular X-ray diffraction

patterns and libraries of the patterns with known shapes exist to help identify the structure

67

 

 

 

 

 

 

Figure 2.22: Crystal base structure

 

Figure 2.23: Crystal lattice structure

68

of molecules. Because of the tessellation and the regular pattern of the lattices, the encoding
of crystals can often be reduced to just a few variables for the angles, lengths, and rotations
found in the lattice of real samples. When searching for solutions to conformations of
molecules, the number of variables is small, especially compared to proteins and large

clusters.

2.8 Geometric Algebra

Havel [29] describes the ﬁeld of geometric algebra as it applies to chemical applications
in a distinct fashion. This description includes some history of the ﬁeld along with several
theories that are used in many of the approaches to the molecular conformation and espe-
cially the study of proteins. The theories are summarized here in the following subsections;
the rank of the distance matrix, vector operations, constraints, determinants and solution
methods. Each of the these areas is used to either aid in the correction of experimental data

or in the assignment of distances to atom pairs.

2.8.1 Rank

For a given molecule the number of parameters needed to specify the molecule are 3N
where N is the number of atoms. Two molecules are identical if, by translation and rotation,
they can be resolved to within some small amount of positional error. These rotations and
translations can be considered six additional degrees of freedom. It is common to refer
to the parameter total as 3N + 6. A fully ﬁlled Euclidean distance matrix for N atoms
has a total of N 2 parameters and is clearly over-speciﬁed. First, the upper triangle and
lower triangle are identical. Then, ignoring the diagonal which by deﬁnition is all zeros,
the matrix has Egg—111 entries. A theorem from Chu’s work [10] states that the rank for a
N -atom matrix is bounded by R 2 m + 2 where m is the number of dimensions. In this

case, m = 3 so the rank is no less than m + 2 and is often equal to m + 2 or a value of ﬁve.

69

2.8.2 Vector Operations

The product of three vectors is outlined and used for many of the analyses in distance
geometry problems. In this treatment, the classical inner product and the outer product are
covered for an arbitrary orthogonal basis set of vectors. Extending these vector operators to
two sets of three vectors produces determinants call Gramians. All of the vector operators
have utility in working with the distance geometry problems. Some of the uses include
the triple product for a signed volume, dihedral angles and quatemion operations used to
validate entries for location and distances. The signed volumes have signiﬁcance for work
with molecular conformation in several ways. They are linearly dependent and satisfy a
linear relationship among ﬁve points. Several of the properties are invariant. For example,
the Euclidian distance matrix will not change for any translation and rotation of the original

atom locations.

2.8.3 Constraints

By far the most common constraint used in assignment of the distances to atom pairs is
based on the triangle inequality. Given two distances between three atoms, an upper and
lower bound on the third distance can be derived using the triangle inequality. While the
bounds are not necessarily tight, they can signiﬁcantly reduce the possible assignments of
a given set of distances, to an atom pair. In cases with lots of symmetries, the assignment
probabilities are reduced and often the selection from available distances is reduced to just

three or four possible choices.

2.8.4 Cayley-Menger Determinants

Many of the properties associated with the distance matrix are invariant. The key point here
is that the hyper-volume spanned by four inter-point vectors is always zero. The hyper-

volume can be describe in terms of the Cayley-Menger determinant. With this information,

70

it is possible to repair some of the marginal data from the experimental results. Further,
the determinants of all higher-order sets vanish and can be used to reduce the possible

assignments.

2.8.5 Solution Methods

The work by Crippen and Havel [11] uses an eigenvector approach. The basic method
starts by ﬁnding three such vectors and then using them to locate a subset of the atoms. In
terms of calculational cost, this requires on the order of 0(n2) to 0(n3) operations. This
work also made a preparation pass through the data, which attempted to repair any ﬂaws in
the data. Flaws can included both missing data and distance errors. Failure to improve the
quality of the data would cause serious ﬂaws in the solutions found.

Hendrickson addressed this issue in a series of papers and his dissertation [31] [30].
In these works, his primary focus was on the generation of the missing data. Using an
objective function for optimization based on the squared error between candidate solutions
and the distance known, he was able to ﬁll the missing data to some degree of accuracy.
Beger Kleiberg and Leighton [5] built on this method with a system that is capable of
handling arbitrary errors and can reconstruct the matrix in 0(nlog(n)) time. This method

starts with poorly labeled distance data, errors in distances and missing data points.

2.9 Protein Conformation

Determination of the structure of proteins is closely related to the conformation of clusters.
Clusters are typically much smaller than proteins but there is a lot of information known
about the make up of the protein, such as the sequences of amino acids. The representation
will include ﬁxed bonds, bond lengths and back bone sequence. The representation can be
reduced to some set of folding angles. The bending of the known ﬂat molecule structure at

each of the «[2 and (1) bond points is commonly called folding or protein folding. Even with

71

all this domain knowledge, the problem is still nearly intractable, especially considering
the limitations on the data sources.

Just like with automation of the process for ﬁnding cluster structure, there is a great
amount of effort being applied to automated structure calculations of proteins. Several
automated or semi-automated methods are currently in popular use. They include ASNO,
SANE, NOAH, ARIA, CANDID, and KNOWNOE [25]. Many of these are focused on
the data from NMR experiments. The CANDID method is described by Guntert [25] and
is reviewed here to provide background for comparison of the methods used for proteins
versus the molecular clusters. Many of the automated systems are actually comprised of
several distinct steps and often some form of iteration. Several of the methods employ
GA’s, simulated annealing and various other objective functions that are optimized in some

way to ﬁnd the conformation.

2.9.1 Preprocessing

As described in Section 2.2.4, the experimental data either from X-ray diffraction or from
NMR has to be processed in some fashion to produce the distance information. When work-
ing with proteins and NMR data, it is often possible to identify some subset of distances and
speciﬁc atoms. The goal of preprocessing the N MR data is to reﬁne that process to produce
some number of viable candidate conﬁgurations. Nuclear Overhauser effect (NOESY) is
used with two-dimensional NMR to gain more information about the atom locations. This
is actually accomplished using a series of one-dimensional NMR experiments. Chemical
shift is also critical in the preprocessing step. Because many of the resonant frequencies are
associated with the hydrogen bonds, the variations in the bonds can be speciﬁcally identi-
ﬁed as belonging to a limited set of hydrogen atoms in the molecule. Identiﬁcation of some
atoms once again reduces the possible assignment of the distances to a smaller group of

atoms.

In the preprocessing step, the method uses as much domain knowledge as possible to

72

reduce the assignments to a manageable subset of atoms. This is referred to as partially
labeled data and may use probabilities to assign the distances based on a knowledge base
of known conﬁgurations.

Another issue is the accuracy of the experimental data. There is some uncertainty with
the assignments and the value of the distance. In contrast the x-ray diffraction is able
to determine distances with much more precision but with few hints on the assignments
to speciﬁc atoms. Until very recently this preprocessing step with NMR data has been
interactive and iterative in nature. The current automated process will repeatedly use the

preprocessing steps in the search methods.

2.9.2 Objective Functions

There are several aspects to the objective functions used in the protein conformation. Sev-
eral of them are based on some form of lowest energy. Objective functions can be as
simple as a quantized +, - summation of the locations that are hydrophobic pairs or full
Lennard-Jones potential energy. Still, other methods count interferences of the side chains
and deviation from NMR simulations on the candidate structures. Several of the methods
have introduced multiple objectives to try and decrease the uncertainty of the molecular

conformations.

2.9.3 Search Process

In most of the search methods, some population of molecules is used. These molecules
which have been folded in such a way that some of the distances from the preprocessing
steps are assigned and conform to the shape. The folding is based only on the rotational
angles of each free bond rotation needed to derive the shape and location of each atom.
Some of the geometric properties are then applied to the vectors that represent the atom
locations. The locations are reﬁned to minimize the residuals associated with the various

geometric properties. Small subgroups of atoms are evaluated and the total residual of all

73

the groups are summed. In addition, molecular dynamics, such as energy interactions, are
also used to perturb the atom locations.

The current process of ﬁnding protein structure relies on domain knowledge to assign
distances to speciﬁc atoms. From these partial assignments, an iterative process of as-
signments and residual evaluations occurs until some minimum residual is satisﬁed. The
distance assignments are partial and limited to atoms with known interactions that can be
detected in the NMR data. The residuals include candidate molecule deviation from de-
tectable distances, geometric algebra properties, energy and known substructures. Even
though the goal for both the cluster conformation and the protein conformation are similar,

the tools vary signiﬁcantly.

2.10 Summary

This chapter started out with a review of three current works addressing molecular clusters.
The works include the original introduction of spatial crossover by Deaven and Ho, the
Hartke reﬁnement of the algorithm with improved diversity control, and work by Sastry
that utilizes linkage learning. The Hartke work is the current standard for performance and
scaling of algorithms that seeks to optimize the energy function. While the work by Sastry
makes better scaling claims, these are unsupported by experimental evidence. The small
molecules used in his work are known to be easier to solve than larger ones.

Several other methods currently used in the molecular cluster area that utilize non-GA
methods for energy optimization are also described. It is important to note that, while the
methods have distinct names, there are often a variety of algorithms used in each. Many of
them border on evolutionary computational concepts. Combinations of the key elements of
these methods are very similar. For example, cooling cycles used in simulated annealing

are applied to parameters for some of the other methods.

The next section provided an overview of the evolutionary computing ﬁeld with speciﬁc

74

attention to the genetic algorithms and the parameters that are important to performance.
Included is a section on the use of spatial operators in evolutionary computing. These are
mostly concerned with multiple populations or graph-style problems. A few of the works
describe a form of spatial crossover, but, always combine this with other operators.

A section on the various methods for local search describes several of the main algo-
rithms used. There are many different local search methods beyond the basics described
here. The Sastry work uses a unique method of triangulation. The important point is that for
GA’s that are addressing complex real problems, the local search is deﬁnitely an advantage.

The molecules can be represented in several fashions. Even though, at some point, they
must be resolved to atom locations, there are several alternative forms. A presentation of
the straight-forward atom location representation is provided. Some of the alternatives used
in crystals and proteins are described brieﬂy.

The section on Geometric Algebra is provided to establish some of the mathematical
techniques that are used to describe algorithms and residuals in protein conformation stud-
ies. While proteins are not directly addressed by this research, it is a prominent ﬁeld for
molecular conformation and is used for comparison of the methods. This is especially true
for the labeled versus unlabeled distance assignments.

The last section describes some of the automation applied to current protein confor-
mation. The main point here is that, while these methods used large amounts of domain
knowledge, they are more data driven from experimental results than the search for low-
energy clusters. Much of the automation work is not only with the search but centers on

the assignment of distances to atom pairs.

75

Chapter 3

Theory

Holland’s introduction of schemata [36] provided a theoretical foundation for the operation
of genetic algorithms. Implicit in this foundation is that, for high ﬁtness chromosomes,
individual genes are linked. These links are expressed by the compound dependencies
between genes that lead to high ﬁtness values. In many applications, the genes used in a
genetic algorithm represent points in space. This chapter will introduce the theory for gene
linking based on the interactions between genes that are adjacent to each other in space.
The proposed GA operators for mutation and crossover, based on spatial concepts, will be
shown to reduce the disruption of high ﬁtness chromosomes. Further, the operators will be
shown to more effectively explore the search space and promote the construction of good
genes. Integrated in the GA algorithm is a local search based entirely on the location of the
genes in space. This search has an important impact on the overall operation of the GA.
Its integral nature makes it difﬁcult to isolate the effect that local search has on the overall

operation of the GA. In practical problems, the use of local search is often beneﬁcial.

Our approach will be as follows. First we will describe the convergence of a gener-
alized local search to some minimum. Second we will Show how mutation and crossover
move from local minimum to local minimum in an efﬁcient and directed manner. This

chapter will use the concept of hyperspace volume to describe the convergence to a spe-

76

ciﬁc minimum. Further, a deﬁnition of building blocks as subgraphs will be introduced as
N closely connected points. These subgraphs can be thought of as schemata and the count-
ing of these subgraphs forms the basis for a prediction of spatial operator effectiveness. We
will show that dividing the space into regions with a single plane improves GA crossover
operations. The concept of division is generalized for multiple planes in hyper-volumes.
These theories will then be used to predict population size, convergence rates and diversity
for the GA applied to the molecular conformation. Although the focus is on cluster confor-
mation, the concepts and theory are applicable to many problems with homogenous vector

representation and strong local interactions.

3.1 Class of Problems

This theory targets the representation of continuous-valued genes in three-dimensional
spaces. However, the theory has been formulated for application to M-dimensional spaces.
Three-dimensional spaces are the work-horse of many practical problems that need to be
solved and are the focus of this work. The basis for the molecular conformation problem
is: given a set of inter-atom distances, ﬁnd a set of atom locations that is consistent with
the distances provided. In this case the ﬁtness function is some measure that describes
the consistency between a set of atom locations and the inter-atom distance set. For other
problems, the ﬁtness function will use some subset of these distances as a main compo-
nent. For example, a design for strength of a mechanical part will be highly dependent on

the dimensions and locations of key points in the design that are close together.

3.1.1 Homogeneous Vectors

In this class of problems, each gene, or some signiﬁcant part of the ﬁtness function, is
described by a set of homogeneous vectors. These problems will have N vectors, each

of dimensionality M. Each of the N vectors represent a point in space 171 . . . pN. The

77

ﬁtness functions will, in general have the following form, as shown in equation 3.1. The
proximity of the points deﬁnes an individual point’s contribution to the ﬁtness function.
This relationship is described by equation 3.2. What this means is that for any two points
that are close together will contribute more signiﬁcantly to the ﬁtness of the problem than

two points that are farther from each other.

N—l N
Fitness = Z Z f(pk,pj) (3-1)
k=1j=k+l
lpk — pjl < lpx — pyl -> lf(pk.pj)| > |f(pz.py)l (3.2)

All the vectors must have an identical number of elements. However, all elements need
not be identical. This creates a problem in that the concept of distance is altered. Elements
with incompatible units are difﬁcult to combine in a single measure of distance. It would
be possible to use Mahalanobis distances for elements with different units that have known

correlations .

3.1.2 Genes

For true biological systems the relationship between speciﬁc genes and functions are often
complicated. For the class of problems that spatial operators work best on, a gene will be
more phenotypical. In molecular conformation, the genes are atom locations. These loca-
tions have some constraints. A molecule, for example, has a minimum distance between
atoms that which can be derived from the experimental data. Similarly, the maximum
distance can also be determined. All the atoms must be connected by at least one bond
within some range of distances. For many other practical problems that Spatial operators
can address, there are similar sets of constraints. These constraints can be applied at var-
ious stages of the GA algorithm to improve the probability of good ﬁtness and solutions.

Equation 3.3 shows the gene as a point in space that is made of an M-dimensional vector

78

of values as, 6 31

gene,- = p,- = {m1,-,x2,~, . . .mM,} (3.3)

3.1.3 Fitness Function

There are two types of distance criterion functions. The ﬁrst is shown in equation 3.4 and
is based on the sorted list of distances between atoms from the candidate molecule and
the unlabeled target distances. The second function is more common in the literature. It
is based on the differences between entries in the D matrix of the molecule candidate and
the target information Do matrix. This target information is shown in equation 3.5 with
elements designated as tij. Several variations of the criterion function based on distance
exist. The most common variation takes the square root of the summed function. It should
be noted that in equation 3.5, atom labels are implied by the values of i, j, and in equation

3.4, labels are not available.

Objd 2' Z“); - (IA-)2 (3.4)
1:21
n—l n
ObjD = Z Z (tij — dij)2 (3.5)
i=1 j=i+l

As described in Section 2.2.1 there are many energy-based ﬁtness functions. For the
purposes of this work only distance-based ﬁtness functions will be used. The theory of
ﬁtness function contribution is based on the Euclidean distance measure between two points
as represented by the homogenous vectors. There are a large variety of ﬁtness functions
that could be included, but in this work they are excluded to Show the effectiveness of the

spatial operators.

79

3.2 Local Search

The deﬁnition of the local search is based on the deﬁnition of the local minimum. Schwefel
[62] provides a useful deﬁnition of the local minimum. It is adapted to the objective func-
tion used in this work as shown in equation 3.6 and 3.7. The form of the local minimum
is very similar to that of the deﬁnition that two molecules are equal. TWO molecules are
deﬁned as equal if it is possible to rotate and translate them such that they are aligned so
that the distances between corresponding atom centers summed is below some limit. The
local minimum deﬁnition of equivalent molecules is stated as follows: Applying the local
search operator to two molecules will align the individual atoms such that the summed dis—
tances between corresponding atoms will be less than some limit. In many respects, any
two molecules that are in some region of space will converge to the same local minimum

and therefore, are considered equivalent.

F(P) g F(P) (3.6)
N

e > Z In — ml (3.7)
i=1

Under what condition can the two molecules be different and still converge to the same
local minimum? From both an experimental and geometric perspective, the molecules will
converge to the same local minimum if the positions of the atoms are within some range of
the ideal location as deﬁned for a speciﬁc local minimum. If the topology of the molecule
is deﬁned as each atom’s relationship to its closest neighbors, then any position such that
the relationships of the local minimum and the current state do not change will deﬁne the
local attractive basin for a given local search and local minimum.

From a pure geometry point of view, as long as the location of each current atom is
within some minimum of the ideal location for that local minimum, the topology will not

change when the atoms move to the ideal locations. In the case of a straight line arrange—

80

ment of atoms, any one atom can move up to the bond length and not change the topology.
Any two atoms can move up to one half the bond length and not change. This sets an upper
bound on the possible space around each atom that will converge to the same local mini-
mum. For the purposes of the local search based entirely on the distances between atoms,
some factor can be deﬁned such that, as long as the atoms are within some radius deﬁned
by the factor, it will converge to the local minimum.

The factor of 6 reduction in equation 3.10 is intended to insure the ability of the local
search to converge to the solution. That is, as long as all the locations are within some
small sphere deﬁned by 1%, (the minimum distance to another point), then the topology of

the graph will not change and the local search should be able to ﬁnd the local minimum.

3.2.1 Volume

The probability of locating an atom in the correct locations during initialization can be
estimated using volumes. If the placement of an atom is random within some bounded vol-
ume and this placement has some tolerance associated with it, comparing the total volume
to the volume of a valid placement will estimate the probability. Experimental data from
initialization methods has been collected and the related statistics provides an estimate for
the value of the molecule diameter. Equation 3.8 shows this volume as an expression for
a Sphere with some maximum radius. For M-dimensional spaces, a hyper-cube of volume
described by equation 3.9 can be used.

4
VOLtotal = " 71.123

3 max

(3.8)

VOLtotal = Riga; (3.9)

The factor of 6 reduction in equation 3.10 is intended to insure the ability of the local
search to converge to the solution, as well as to produce high quality building blocks. The

actual R4,,” will not necessarily be the same as the max distance in the distance data set.

81

The random initialization methods can produce larger or smaller volumes and this must be

recognized in the creation of building blocks.

The local search will have the tendency to heal crossover-disrupted building blocks
and also combine small building blocks into better quality ones. Equation 3.10 shows the
local volume that an atom can reside in and still be moved by the local search to the local

minimum position.

 

3
L00,,, = -:—1r (R?) (3.10)

The probability of having an atom within this volume is given by the ratio of the target
volume divided by the average volume per atom. Equation 3.11 shows the average volume
as a function of the number of atoms and the total volume of the molecule. The ratio is the

probability as shown in equation 3.12.

 

VOL 0 a
AVGvol : —N—t_t_£ (3.11)
_ LOCvol
(r) _ AV C... (3.12)

The nature of the building blocks requires that multiple atoms be in the correct 10-
cations. The location of the ﬁrst atom is arbitrary because it starts as the anchor of the
subgraph. The probability of having the next atom in the subgraph in the right volume of
space is given by the volume of viable possible locations for the second atom. The place-
ment of the third atom now has a much smaller restricted volume to be a building block or

valid subgraph.

Using a simplistic probability calculation, the ﬁrst atom has probability 1 and all the

others can be based on the ratio of the local volume to the average volume.

n—l

P..(n) = H Pa) (3.13)

l

82

This probability forms the base for predicting the number of population members
needed to have a rich enough set of building blocks to eventually ﬁnd a solution to the
molecule conformation. The calculation based on volume is a lower bound because the ini-
tialization process places atoms one bond length apart and this further restricts the volume

and increases the chance of creating a good building block.

3.2.2 Distance-Based Local Search

Vector summed local search calculates the magnitude and direction that each atom con-
tributes to the overall criterion functions. The criterion function produces a single scalar
term while the local search produces a vector term for each atom. The operation of the
local search can then either move a single atom, which has the greatest contribution, to
the error between the target and the current conﬁguration or it can move all the atoms in
the direction that will correct the error contributions. If the step size for each correction is
too large, the algorithm can become unstable. The correction factor must be based on the
total number of atom-to-atom interactions. The typical magnitude of each corrective step
is scaled down on the order of 0.3 to 0.01. These values were selected by experimentation
and also a simple analytical case of three atoms. In the case of three atoms, the stability for
all possible positions can be insured if the step size is less than 0.5. In a real conﬁguration,
there is some average contribution from each atom to the correction. Some of these error
contributions cancel one another and this leads to the range of values shown. The 6 will be
scaled by the total number of atom-to-atom distances.

Equation 3.15 shows how each contribution is calculated. For each atom pair, the dis—
tance difference between the target distance and the actual distance is calculated and then
scaled along the unit vector between the two atoms. While there is no mapping of the t),
target distances to an i, j, there is such a mapping for the molecule being examined. The
mapping is deﬁned in equation 3.16. In equation 3.17 the correction to the atom position

is shown. The 6 is the portion of the total error to move the atom. TWO forms of local

83

search are used: one that moves the atom with the largest magnitude of E,- and the other

that moves all atoms. The search converges when the sum of all the magnitudes of the E,-

 

is below some threshold.
6 = 315- (3.14)
E- = gm — d.) Ii): :2, (3.15)
de3i€(1...n),jE(1...n)|dk=-d,-,j (3.16)
17.: 1‘9. — 6F.- (3.17)

3.2.3 Convergence Criteria

When the total sum of all the magnitudes of the correction vectors gets small, then the local

search has converged.

N
c > Z |E,-| (3.18)
i=1

In the actual operation of the algorithm, full convergence is not required. The actual
processes Show evidence that a slower convergence spread over many generations of the
GA will actually be more effective. The basic process required the application of the local

search :1: times after each mutation, crossover or reproduction.

3.2.4 Iteration Limit

The number of iterations will depend on the step factor used in applying the correction to

each atom’s location. The stability of this search is increased if smaller steps are taken,

84

but the number of iterations increases. The compromise is that it will require more com-
putational effort to iterate repeatedly. It is possible to estimate the maximum number of
iterations for convergence starting with equation 3.19. After c iterations of applying the 6
factor to the largest of the correction vectors, a ﬁnal correction vector whose magnitude is
less than 6 will be the outcome. If the process only corrects one vector, then it will take
more iterations. The assumption here is that the largest of the correction vectors is chosen

as the worst case.

6 > max(|E,-|)(5c (3.19)

Taking the logarithm and solving for iteration count c can be done as shown in equation

3.20 and 3.21. The ﬁnal form as a function of c is given in equation 3.22.

ln(e) > ln(max(|—E,-|)) + cln(6) (3.20)

ln(c) — ln(max(|E,—|)) < cln(6) (3.21)

111(6) -1n(maX(|E-|))
ln(6)

 

< c (3.22)

3.3 Spatial Concept

There are two spatial concepts involved in this work. The ﬁrst is the idea of a point in
hyperspace deﬁned by the entire set of N vectors of dimensionality M. The second is the
location deﬁned by each M-dimensional vector. The relationship between two M dimen-
sional locations is described as the proximity between the two locations. This proximity
is deﬁned as the standard Euclidean distance or distance squared between two genes, 10-

cations, atoms or points. The loctions are paritioned into K sets, by some collection of

85

boundaries. Equation 3.23 shows the union of all the space divided set and equation 3.24

Shows that each vector can belong to only one set.

P=P1UP2...UP;c (3.23)

There is no requirement that the sets be non-empty or that they hold equal numbers
of points. For the work on the molecular conformation K = 2, however the number of

locations in each set is nearly equal.

The second spatial concept is that of all N vectors as a single location in hyperspace.
For the case of the molecular conformation, the absolute location is not as important as the
space deﬁned by the distances between all the locations. That is, two sets of N locations
translated and rotated by arbitrary amounts are identical if the sets of distances derived

from those sets of locations are identical.

Take, for example, the chromosome deﬁned by N sets of real-valued coordinates of the
atoms. There are an inﬁnite number of sets of coordinates that will satisfy the distance
matrix. TWO molecules P1, P2 are considered equal if there is a rotation and translation
matrix that can be applied to one molecule such that the total summed difference in loca-
tions is below some value 6. Equation 3.25 shows the two molecules each represented by a
set of points in space. For the two molecules to be equivalent there must exist a rotation-
translation matrix such that after the operation shown in 3.26, the total distance between
corresponding point locations calculated in equation 3.27 is below some limit. Then the
two molecules are considered equal, as shown in equation 3.28. One additional step is
needed because the indices of of atom locations may not be mapped correctly. An addi—
tional mapping process to arrange the two sets of atoms so that the index will correspond

to the correct pairs of atoms must be completed.

86

P1, P2 (3.25)

3 M | P1 = MP1 (3.26)
N

6 > Z lpi,.- - p2,.-| (3.27)
i=1

3.3.1 Division of Space

The concept of building blocks needs some method of mathematical division for identifying
which atoms belong to each building block set. The crudest division is breaking the set
into two pieces. This can be accomplish with the basic crossover operator that Deaven and
Ho [13] introduced. It can be generalized to M -dimensional Spaces as shown, using the
deﬁnition of the normal to the hyperplane. The number of methods that can be used to
divide the space is large. It is possible to use any solid shape as the base for dividing the
space into subsets. It is possible to use any combinations of planes (three-dimensional) or

hyperplanes (M-dimensional) to divide the space.

By Single Hyperplane

Once a normal to the plane is created, determining which side of the plane the atoms are on
can be achieved by using the vector to the atom and the normal to the plane. The deﬁnition
of this normal is shown in equation 3.29. This normal to the plane passes through the center

of mass.

87

Vn = 0.1331 ‘1‘ 02.272 . . . aiim (3.29)

Determining which set a point belongs to can be accomplished with a dot product op-
erator. Let the set of points as described in equation 2.11 be divided into two sets. Assume
that the hyperplane includes a point (0,0, . . .0). A point is on the + (p) side of the plane if

the dot produce is positive. Otherwise, it is on the — (n) side of the plane.

This section introduces a spatial concept of building blocks. It formalizes the operation
of the division of the space by one or more hyperplanes. The idea of good building blocks
can be thought of as a collection of points in space whose locations are within some region,

and will produce good contributions to the criterion functions.

P = PpUPn (3.30)
VpiePIpioVnZOszPpUp, (3.31)
VngP|pioVn<0Pn=PnUp,~ (3.32)

Formula 3.30 was created in order to divide real space with a single plane and assign
points in that space to the two subspaces. When the boundary is made of multiple hy-
perplanes, each described by a center point and a normal passing though the center, then
determining if a point occupies that bounded region can be accomplished by repeated ap-
plication of the dot product for each bounding plane. The vector from each center point to
the atom (arbitrary point) and the hyperplane’s normal can be used for the dot product. If
the multiple hyperplane set of boundaries is convex, this produces an easy division. If the
planes form a concave shape, the orientation of the normal must be constant in terms of

a topological manifold’s surface. Equations 3.32 and 3.31 demonstrate the division of the

88

points into the two sets Pp and P".

The formulation of the division is general to M -dimensional points and is not limited
to 2D or 3D spaces. In terms of genetic algorithms, this is no different than two-point
crossover on a linear arrangement of genes. For two-point crossover, the linear arrangement
can be thought of as a continuous circle. The two-point method is the same as drawing a line
though the center of the circle that divides the chromosome into two parts with near equal
numbers of genes. There is, in general, no requirement that an equal numbers of genes
be from each parent. In the later stages of a typical genetic algorithm, it will be shown
that, if equal numbers of genes were a requirement, improvement of better chromosomes
would not be likely to occur because of the distribution of bad genes among the good genes.
Smaller partitions can be made by repeated use of more than one plane to divide the space.
This can be analogous to N-point crossover, but in a three-dimensional setting.

For example, when any atom is near enough to a good location it can be coded as a

9’1”

binary and those far away are coded as ”0”. Good schemata would have lots of ones
and the grouping of ones would provide a building block. If viewed as a graph, relationships
between all the good locations would look like a K n graph. Further, the order would be the
number of points in a good location, but the length would need to be measured in a new

way such as the diameter of ”1” nodes on the K n graph.

By Multiple Hyperplanes

Division of space can be expanded into multiple spaces in two ways. First is to continue
to ﬁnd orthogonal planes and divide the space into quarters, then eighths and so on. The
second method can use an arbitrary number of planes. Each volume of space is deﬁned by
a set of planes and their intersections. A speciﬁc space is bounded by a set of planes. The
results of all the dot product operator signs form a binary set of bits. Each pattern of bits
deﬁnes a distinct space volume. The Figure 3.1 shows a simpliﬁed example of multiple

plane division using atoms in a plane and a set of three lines. For each line there is a small

89

 

 

Figure 3.1: Example of multiple plane division

arrow indicating the positive side of the dot product.

Table 3.1 shows the binary patterns that determine in which subspace the individual
atoms are located. In this table the spaces are identiﬁed by the numerical representation
of 0 through 7 as interpreted from the binary pattern of L1, L2, and L3. For example, the
subspace P1 consist of atoms {0, D, E}. Subspace P0 = {F, G, H}. Some subspaces are
empty sets like P7 and P3. The label of the subspace set is based on the binary value of
the bits from the table. Each bit is set if the atom produces a positive dot product. A zero
indictes the dot product is a negative value. For the purposes of crossover each subspace

can be assigned to one of the parents.

For the case where the number of genes is ﬁxed, there is no expectation that the numbers
of genes produced in each half of the space will total the ﬁxed number. In these cases the
boundaries can be moved or placed in such a way as to provide a clear division with the
correct total number of genes. In the simple division of the molecule though the center
of mass by a single plane, the chances of ﬁnding a clean division by moving the plane

slightly is high and the crossover contribution from each parent is evenly divided. Several

90

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

atom L1 L2 L3
A 1 0 l
B 1 O l
C 0 0 1
D 0 O 1
E 0 0 1
F 0 0 O
G 0 0 0
H 0 0 0
I 0 1 0
J 1 1 0
K 1 l 0
L 1 0 O
M 1 1 0
N 1 1 O
O 1 l 0

 

 

 

 

 

 

Table 3.1: Multiple plane division of space

experiments with nonequal division were conducted and provided poor results as compared

to the equal division.

3.4 Spatial Building Blocks and Subgraphs

Building blocks and schemata are closely related in the theory of genetic algorithms. For
this work, building blocks will be a subset of the more general concept in that genes in
the chromosome will be linked by proximity to each other in space. For molecules, it
is easy to ﬁnd some range of distances that deﬁnes a bond and therefore a connection
between two atoms. A subgraph with K atoms that are all connected by bonds will be
considered a building block or schemata of order L. The abbreviation is given as 391,. The
fact that spatial operators are advantageous is supported by the analysis of the numbers
and locations of these subgraphs during the operation of the genetic algorithm. Just like
with basic building block theory, the construction of larger and more numerous 391, are
fundamental to ﬁnding good solutions for the problems addressed here. To assist in the

analysis of the 591,, a number of concepts must be reviewed. First is the role that proximity

91

plays between locations as it affects the ﬁtness functions. Next, for small molecules, it is
possible to enumerate the 391,. For larger molecules, the counting is more difﬁcult and use
of R regular graphs will provide a way for estimation of the sgL population and growth.
Several examples of actual sgL counting will be included for the N11, cube, and Bucky
ball. These are provided as a means to Show the application of the counting methods and
provide the estimates for 391, disruptions by crossover operators.

Several actual examples of simple molecules will be used to demonstrate the counting
of the subgraphs and document the disruption of subgraphs by crossover operators. In
the case of large molecules, the count of subgraphs can be estimated. The example used
here will be for regular graphs with R edges. This is appropriate for modeling for many
large molecules where the number of bends is in a limited range of three to six. The
Simple example explored will be the cube structure. The enumeration of all the subgraphs
will be made and the effects of crossover illustrated. For a more complex example, the
Lennard-Jones molecule N11 will also be examined. The ﬁnal example for actual counting
of subgraphs will be the Bucky ball that is often the example used to demonstrate the

effectiveness of the spatial operators.

3.4.1 Deﬁnition

A spatial building block or subgraph is deﬁned to be a collection of nodes or atoms that
are close enough to have a high degree of interaction. This collection will typically be
described by a limited set of edges or bonds that connect the atoms. The key element
is that the nodes or atoms are in close proximity to each other. There are several issues
associated with proximity. The interaction of points can be characterized in linear and non-
linear models. For the case of atom attraction and repulsion, the interactions can be highly
non-linear. Even in the example of the molecules with distance-based search, the quadratic
nature of the ﬁtness function can be signiﬁcant. In the event where points are arranged in

a single sequence, proximity is only based on the next few points in line. For the case of

92

three-dimensional points in space, the proximity is based on the closest other points.

3.4.2 Counting Spatial Subgraphs

There are two ways to count the number of subgraphs of a speciﬁc conﬁguration. Use the
number of nodes or atoms to describe the Size of the subgraph or alternately use the edges
or bonds to describe the Size. The cutting of a molecular structure by planes will break
the bonds. Therefore, developing the counts based on atoms will be more useful. The
requirement to be a building block is that all the nodes must be connected by at least one
edge. If a more general counting scheme is used where all possible edge combinations that
can connect N locations are used, the number of subgraphs counted can be even larger. A
given subgraph can be connected by several distinct sets of edges. In this work as long
as there is at least one set of edges that connect N locations, that set of N locations is
counted only once regardless of how many edge set may connect the N nodes. In the pure
binary chromosome representation, the number of schemata is easily constructed using
combinations. For this theory, the construction of subgraphs with the proper number of
nodes is the basis for counting. Further, the way in which the nodes are combined can
have several permutations of edges. For this work, a special program using recursion for
the graph representation was used to count the building blocks. For large graphs such as
the Bucky ball, some methods for estimation on a single node are used. For the smaller
molecules such as N11 Lennard-Jones, a complete count was made with the recursive

program.

3.4.3 Regular Graph Model

Large molecules can be approximated with R regular graphs. The interaction of the atoms
in space can be modeled as a graph with the atoms as nodes and edges as bonds. For other
models, the edges are selected by some set of closely proximate points. These models are

normally considered to be some regular graph with vertex order of 3 or a R3 graph. Other

93

1M" 22‘ < \ fix fr X)
o O 0 0O O O L; c) O Q C:

Figure 3.2: BB counting R3

graphs such as regular R4 and R5 may be considered, but many of the practical applications
can be covered by R3. The number of overlapping building blocks for a simple cube and a
Bucky ball will be demonstrated. For the Bucky ball, only building blocks on the order of

3 atoms to 5 atoms will be constructed.

1' = 0 : 1
bbcnt(r,p, k, f) = (3.33)
r aé 0 : 3L. bbcnt<r — iik +1.f)("’;’)

cmt = Z (7:) (3.34)

k=0
p . k = Q .
brnh( ) cm t (3 35)
Vi=0,m Vj=0,m I Z +j > m Pimprovement : Z Pbbih(z) X Pbbih (j) (336)
limN BMW...“ ~ 0.437 (3.37)

3.4.4 Cube

When dealing with a cube as shown in Figure 3.3 with labeled nodes, the goal is to enumer-
ate all the building blocks associated with the cube to demonstrate the concept of subgraph

counting for spatial applications. As is well documented, the number of points used in clas-

94

 

 

 

 

 

 

 

 

 

Figure 3.3: Cube schema R3

Sic crossover operators will increase the likelihood of disrupting good schemata [64]. In
the case of spatial crossover operators, this is also true. The equivalence of a single cutting
plane for Spatial crossover is similar to two-point classic crossover for a linear arrangement
of genes where the two points must divide the chromosome into equal halves. Given the
deﬁnition of subgraphs as building blocks or a type of schemata, it is instructive to compare
the disruption of those subgraphs by the two crossover methods. The N2 building blocks
consist of the edges in the cube. For the N3 building block, each vertex is at the center of
three unique N3’s, so there are a total of 24. It should be noted that in normal schemata
each combination of items would be valid. However, for this concept, the locations have
to be connected. This eliminates many of the possible building blocks. The full table of
building blocks enumerated by Size is shown in Table 3.2. This table also shows the number
of building blocks that are destroyed by a cutting plane. All the building blocks above N4
are destroyed. Even for this simple example of the cube, the reduction in the disruption of

the small subgraphs is signiﬁcantly reduced by the spatial crossover.

95

 

 

 

 

 

 

 

 

order Number Broken by Plane 2 Point

N2 12 4 10
N3 24 16 20
N4 38 36 35
N5 48 48 48
N6 28 28 28
N7 8

N8 1 l 1

 

 

 

 

 

 

Table 3.2: Building blocks for cube

3.4.5 N11 Lennard-J ones Molecule

Now that the general method for counting subgraphs for large molecules modeled as R reg-
ular graphs and a simple molecule arranged as a cube have been examined, it is appropriate
to explore a moderate-sized molecule. Two examples of the N11 Lennard-Jones molecule
are examined. For this molecule the size of the subgraph will be described by the number
of nodes it contains and then the total number of subgraphs of that size subgraph. In this
example, the size of each subgraph is shown in the ﬁrst column. The number of total sub-
graphs at each size is shown in the second column. The third and fourth columns Show the
number of subgraphs that are broken by a plane division and two-point crossover respec-
tively. In the Table 3.3, the number of smaller subgraphs that are disrupted is signiﬁcantly
less with the plane division than with two-point crossover. The number of subgraphs with
size less than ﬁve atoms or nodes Shows a reduced number of disruptions. As the size of
the subgraph grows, then just about any division in the chromosome is likely to disrupt the
subgraph.

When a molecule is examined after random creation and the application of the ﬁrst
local search, the number of good subgraphs is much higher than would be expected by pure
random placement. Table 3.4 shows the number of good subgraphs after initialization and
then after the ﬁrst local search is completed. While the local search does not signiﬁcantly
improve the number of small subgraphs, it does greatly increase the number of larger order

subgraphs. The subgraphs with seven atoms increase more than two-fold after the local

96

 

 

 

 

 

 

 

 

 

 

 

order Number Broken by Plane 2 Point
N2 31 12 28
N3 91 65 86
N4 215 195 210
N5 374 367 369
N6 438 437 435
N7 328 328 328
N8 165 165 165
N9 55 55 55
N10 11 11 11
N 11 1 l 1

 

 

 

 

 

 

Table 3.3: Building blocks for ideal N11 LJ

search is completed. The effect of the local search to repair subgraph disruptions is clearly

 

 

 

 

 

 

 

 

 

 

demonstrated.

order Number Random init after ls
N2 3 l 10 12
N3 91 11 16
N4 215 15 25
N5 374 20 37
N6 438 23 46
N7 328 l 8 44
N8 165 9 34
N9 55 2 20
N 10 1 1 0 8

N11 1 0 1

 

 

 

 

 

 

 

Table 3.4: Building blocks for N11 LJ random initialization and after local search

Random initialization makes the following assumptions. First, that each atom is con-
nected to at least one other atom. Second, these connections must be at least the minimum
distance in the target table. Using these two constraints and then placing atoms in a random
direction will produce valid molecules that have higher numbers of valid subgraphs than
would be expected with placing atoms at random locations in some bounding sphere.

Table 3.4 contains the actual building block counts of a typical member in the pop-
ulation after the random initialized molecule at the start of a run, along with that same

molecule after the application of the local search. It is clear that the initialization process

97

builds many more blocks than would be expected if true random locations were chosen. In
fact the initialization process requires that each atom is one minimum distance from another
and that no two atoms are too close together. Therefore, the worst case conﬁguration would
be a straight line of atoms. In fact, the process yields a typical conﬁguration as shown in

Figure 3.4.

 

Figure 3.4: Initialized molecule for GA run

Figure 3.5 shows the same molecule after the local search has been completed. Com—
parison of the two ﬁgures shows signiﬁcant positive impact that the local search has on the

molecules early in the GA run.

3.4.6 Bucky Ball

The Bucky ball is an example of the regular style of graph and the subgraph building blocks
of interest are on the order of ﬁve or fewer atoms. First, Figure 3.6 shows a single bond
between two atoms and all the subgraphs up to a size of ﬁve atoms. If the bond is broken

by the crossover operator, then all the subgraphs shown will be broken. Figure 3.7 shows

98

 

Figure 3.5: Initialized molecule for GA run after local search

a 2D representation of the Bucky ball and a simulated division down the center. In Figure
3.7 it is possible to see the 14 edges that are broken by the plane.
Table 3.5 shows the estimates for the number of subgraph disruptions for the plane

division and the two-point crossover. This table will be used as the basis for comparison of

the two crossover methods in the next section.

 

Table 3.5: Bucky building blocks

From this table, it is clear that, especially for the Bucky ball, the number of disrupted
subgraphs is much smaller for plane division than from two-point crossover. The reduction

in the disruption is on the order of 3 : 1 for the smaller subgraph sizes.

99

 

Figure 3.6: Bucky ball schema

3.5 Crossover

The previous section on counting subgraphs provides the foundation for the examination
of crossover based on spatial operators. Crossover has long been known to have detrimen-
tal effects on building blocks, as well as its positive effect in assembling larger ones. The
role of the crossover is two-fold in this application. First, it spreads good mutations and
good building blocks throughout the population. Second, it slices the molecule in such a
way that most of the good subgraphs are preserved and the bad ones are isolated. In the
simplest of perspectives, when the number of good building blocks is small, the odds of
constructing a better one are on the order of .25, assuming the good building blocks are
distributed unevenly among the four halves and that the two parents have roughly the same
numbers of building blocks. As the number of good subgraphs increases, the effective-
ness of the crossover will change dynamically. The larger the number of good subgraphs,
the more disruptive crossover will be, up to some limit. Then, an equilibrium state will
be reached. This state will have some constant probability of improvement. In standard
crossover operators where there is no provision for preserving spatial-based arrangements,

the disruption is even greater. The ﬁrst section will examine standard simple one-point

100

 

 

 

 

 

 

 

Figure 3.7: Bucky division shown in 2D

and two-point crossover impact on subgraphs. The next section will describe the spatial

crossover impact.

3.5.1 Standard Crossover

For a standard two-point crossover operator, there is one important question. What is the
probability of disrupting the building blocks? The following assumptions are made. The
location of the gene in a linear arrangement has no correlation to its spatial location relative
to any coordinate system. For one-, two- or N-point crossover, the probability that a given
gene of a building block will lie in the same half section of a molecule is roughly P(S) = .5
for each selection. This is based on the fact that a plane was used to divide the molecule
such that roughly half the atoms are on each side of the plane. Using the example of a
Bucky ball, the standard crossover will destroy forty-ﬁve of the ninety (N = 2)-order
building blocks. The standard crossover will also break the chromosome in two places.
The additional break will also lead to a slightly higher probability of disruption for the
building blocks. For N3, the factor will be .25. In this example 135 building blocks will
be destroyed. This ignores the additional effect of the two points breaking a possible set of

three genes. It is clear that, as the order progresses, virtually all the building blocks will be

101

destroyed by these methods of crossover. Each linear chain of linkages only corresponds
to 1 building block as described above. Only a small subset of all the interactions can be
preserved by linear linkages. For example, if the program learned twenty N3 linkages, then

the number of building blocks destroyed during crossover would be 160 or 88%.

3.5.2 Spatial Crossover

There are several ways in which the spatial crossover is superior to the standard crossover
methods. First, the number of building blocks preserved is much higher, especially for
small building blocks. Second, the nature of recombining two sets, where the absolute
reference can be completely arbitrary, requires some elasticity. In the spatial crossover,
proposed atoms or locations are added from the ”inside out”, with each addition checked
for minimum distance. This allows for few failed crossovers that violate the minimum
distance constraint. There is a ﬁnite likelihood that a crossover operation will fail. If this
occurs, it is treated as a reproduction selection with the best parent propagated to the next
generation.

The preservation of the subgraphs or building blocks for small orders of N is much
better for the spatial crossover. In the case of the Bucky ball, the number of N2 subgraphs
destroyed is only 14, as compared to 45 for the two-point crossover. As the building blocks
get larger, the number of subgraphs destroyed by crossover eventually reaches 100% for
both methods of crossover. Based on this, how is possible that any solutions are found?
A solution can stem from two different areas. First, the population can become much
less diverse, such that only molecules of very similar construction are taking part in the
crossover. In this case, the atoms are essentially in the correct locations and only those that
are different will be affected. That means all the subgraphs that are good in both molecules
will be good when the Spatial crossover or two-point crossover is conducted. For less
diverse populations, the disruptions are less. However it is much harder to search for the

solutions.

102

3.6 Mutation

In standard mutation, the magnitude and the quantity of genes altered to effectively explore
the solution space needs to be understood. For mutations that move the locations less
than ﬂight, the effect will be undone by the local search. Therefore, to be effective in the
exploration, the mutations must signiﬁcantly alter enough locations that the topology of
the interactions changes. TWO ways of accomplishing this are to create a large number of
large displacements or take a single location and relocate it relative to another point. Both
are needed. The ﬁrst style does a local search around the current minimum and the second

signiﬁcantly moves to the region of another minimum far away.

3.6.1 Simple Mutation

Mutation of the simple type can be deﬁned as described in Equation 3.38 where each r,- is
a random number between 0.0 S r,- g 1.0. This vector R represents the direction and
the magnitude to displace a single atom from its current location. In addition to the basic
movement of an individual atom, the mutations can all be repeated several times. If the
atom is chosen at random, it is possible to choose a single atom more than once in a given
mutation operation with X repetitions. There are three components to the simple mutation:
the random direction, the magnitude of the displacement of the atom moved and how many

times to repeat the mutation process on a given molecule.

Simple Mutation Magnitude and Direction

The distribution of the mag can be Gaussian around some mean or uniform random in
some range. The mutation as shown in 3.39 is simply the sum of the current location and

the random vector.

103

 

(3.38)

 

Pi = Pi + R (339)

The ﬁnal position of the atom is based on its original position. It is displaced in a random

direction by a random magnitude.

Simple Mutation Repetition

A repetition is deﬁned as one individual atom mutation. It was necessary to apply a simple
mutation multiple times to a given atom to create the needed exploration. In some works
cited [13][27] the number of mutations used was nearly half the number of atoms in the
molecule. This is very disruptive. The effect of the local search will return many of the
atoms to their original locations. Another added issue for exploration is when the same
atom is chosen multiple times in a given mutation operation. Even if the magnitude of
the displacement vector is small, the overall motion can create signiﬁcant changes in the
topology of the molecule. An interesting issue is that, while it will destroy more subgraphs,
it will also increase the chances of building a new needed subgraph. Repetition is also

applied to the relocation mutation.

3.6.2 Relocation Mutation

Mutation of a localized nature would still use a random vector but would choose an atom
to move and a new atom to move it near. The location near the new atom would will be
based on the random vector but with the mag = min(t,,). Where t), is the target data set
and it contains a minimum distance that any two atoms can be apart. Equation 3.40 shows

this mutation of the 2' atom moved to a location near the j atom.

104

For both types of mutation the number of atoms moved can be adjusted as well as
the occurrence of mutation to members of the population. Some current methods choose
crossover or mutation while others choose to mutate independent of prior operations. All
of the mutations are subject to the minimum distance constraint as described by equation

3.41

Pr IVi¢j le‘ — le > min(distance) (341)

3.7 Viable Population Size

Exploration is an interesting term in the GA vocabulary. While it’s deﬁnition might seem
obvious it’s role in the operation of a genetic algorithm is not always obvious. The full
phrase should be to ”explore likely areas of the solution space”. The problem is how to
identify the likely areas. Further, exploring the same localized area wastes effort. Lo-
calized searching of areas repletely is not effective. Exploration can be achieved by an
abundance of good genetic material and then by crossover to combine likely local areas.
Another way is to provide a high degree of mutations that can explore the search space. The
probability of having a certain number and size of building blocks in a given size popula-
tion provides the bases for selecting population size. If a minimum number of these blocks
must be present in the population at any point in the solution search then this become the
limiting factor for determination of population size. Knowing the probability of have a
building block of size L and the minimum number blocks needed, then the population can
be estimated by the simple fraction. The lower bound for the probability of creating a build
block was shown in Equation 3.13 as the product of individual probabilities for occupation

of a desired volume. The Pbb(L) is the chance of creating a building block with L atoms.

105

Equation 3.42 shows the expected number of building blocks with size L created in a Single

chromosome with N atoms.

PL = (1:) Pbb(L) (3°42)

It is possible to have an abundance of small building blocks but the probability of larger
building blocks decreases rapidly. Equation 3.43 provides the estimate of Q, the number of
population members needed to have a count of building blocks with a given size. The actual
count needs to be several times the total needed at a given size in order to be effective. If
the PL is on the order of .001 for L = 5 and the count needed was 20 this would predict a

population size of over 20, 000.

_ BBcount

Q a

(3.43)

It should be noted that the actual number of subgraphs of order L is much higher than
the random probability of locating atoms in volumes would predict. This is due to the
initialization method. Random molecule initialization routine only locates atoms one bond
length apart. In doing so, it signiﬁcantly increases the probability of constructing subgraphs
on the order of L where L < 5; the smaller subgraphs. It also insures that some of the larger
subgraphs are constructed. In the case where the known solution subgraphs at each size L
can be counted, it would be possible to predict the number of individuals needed in the

population to have all the subgraphs represented.

3.7.1 Initial Population of Genes

The real question to be addressed is how large of a population is needed to have an effective
algorithm for ﬁnding solutions with some acceptable probability of success. TWO very
related issues are the expected number of 39,, that the random initialization produces and

the effectiveness of combining good .991, that are distributed throughout the population. The

106

L in the sufﬁx refers to the number of atoms in each subgraph.

3.7 .2 Reduction in Gene Pool by Selection

For the algorithm used in the molecular conformation, only one child is produced from
a crossover operation. As a consequence, genetic material from the unused halves is not
propagated into future generations. This loss has the effect of reducing the gene pool. A
highly ﬁt individual is more likely to be chosen multiple times and will provide material
in random halves so that good material is not lost. Having a population large enough to
have at least one of every needed subgraph created during the initialization process is not
feasible. Just using the probability based on volume occupation with effective local search
still requires a population several times larger than those proved effective in practice. The
explanation for this is based on the probability of constructing new useful subgraphs. One
interesting property of an increasing number of similar molecules is that the disruption of

good subgraphs by spatial crossover will be reduced.

3.7 .3 Introduction of New Subgraphs

A new subgraph can be created by the local search, random combinations along the bound-
ary of the cutting plane, and by mutation. Each of these areas is important for the operation
of the GA. For each of these areas the operator may also destroy good subgraphs as well
as create new material. Multiple copies of good material distributed among the rest of
the population are critical to allow the newly created material to be combined into bet-
ter molecules. Early in the GA run, the probabilities of creating new subgraphs is very
high. As the run progresses, the dynamics change to the point where the creation of a new

subgraph is much less frequent and the destruction of subgraphs is more common.

107

Creation of Subgraphs by Crossover

In the case where two molecules are combined, a new subgraph can be created along the
cutting plane. Cutting the molecules in half will also destroy a certain number of good
subgraphs. The probability of creating a new good subgraph can be estimated from the
number of atoms in the neighborhood of the cutting plane. These atoms occupy a volume
based on a disk of radius equal to that of the molecule and the thickness of a bond length.
The number of atoms in this region can be estimated by assuming a spherical volume of
the molecule. For the creation of a new subgraph in the cutting plane area, the two parents
must have different conﬁgurations of atoms. If the conﬁgurations are the same, the cutting
of the molecules would preserve the existing subgraph and not create a new one. Further,
only the differences in the region of the cutting plane are important for the introduction
of new subgraphs. As described in Section 3.4.3, if the disruptions of the subgraphs by
the cutting plane are ignored, the probability of creating a new child with more subgraphs
than either parent based on random distribution of the subgraphs in each half can be as
high as .437 in the limit, even for small subgraphs. This is clearly seen in the ﬁrst few
generations of the GA run. This rate of improvement is not sustained because, as the
number of good subgraphs increases, the disruptive issues come into equilibrium with the

constructive mechanisms.

The total number of subgraphs cut by the dividing plane also represents the number of
possible new subgraphs that can be completed with the spatial crossover operator. Sub-
graphs that are the same in both parents, even if cut by the plane, will be completed again
during the crossover and will not be lost. Those atoms in the immediate area of the cutting
plane that are either part of a good subgraph that was broken or just in that volume of space

become the possible candidates for completing new subgraphs.

 

2
Vdisk = min(dk) (ma);(dk)) (3.44)

108

 

3
Vsphere : 3'” (ma);(dk)) (3.45)

Atoms = N i__m1n(dk)

27r max(dk) (346)

First, Equation 3.44 gives the volume of the cutting plane region based on a Single bond
length or minimum distance between atoms. Second, Equation 3.45 gives the volume of the
molecule based on the maximum distance in the molecule. The ﬁnal Equation 3.46 shows
the number of atoms in the region that can become part of new subgraphs based on the total
number of atoms in the molecule. The local search will have the effect of including each
atom not already part of a good subgraph to be added to a good subgraph. This represents
the upper bound on the number of new subgraphs that can be created in this region. The

lower bound is zero.

Creation of Subgraphs by Mutation

The introduction of new subgraphs by simple mutation has two problem areas. The muta-
tion may break a good subgraph and the movement will not be large enough to escape the
local search. Relocation mutation avoids the problem of inability to escape from the local
search by relocating atoms to new neighborhoods in a molecule. This forces an exploration
of possible solutions. Simple mutation has a much smaller chance of exploration if the

% times the bond length. Virtually every repetition of

magnitude of the moves is less than
mutation will destroy some number 391, of subgraphs. Simple mutation explores topolo-
gies that are close to each other in space, while relocation mutation guarantees that new
topologies of the molecule farther from each other are explored .

Some number of simple mutation repetitions will be undone by the local search effects,

returning some atoms to their previous locations. For some number of atoms moved, the 10-

cal search will move the atoms to new desired locations and thus explore a nearby topology.

109

The net effect on the creation of subgraphs is more subtle. The added effect of local search
will make every atom part of a 392 subgraph. The real possibility of increasing the longer
subgraphs will depend on the defects that need to be ﬁxed. If the movement of an atom to
a location completes many subgraphs of several lengths, the overall ﬁtness improves more

than if only one subgraph is completed.

The effect could be simpliﬁed to that of defective locations and non-defective locations
for each atom. An estimate of impact by mutation could be as simple as determining the
probability of moving an atom to create one more defect, two more defects, no net change
in defects or one fewer defect. Assume that there are N — 2: locations. If an atom is
removed, it’s removal can create another defect. Similarly, there are also N — :1: locations
where if an atom is added, it creates a new defect. In addition, there are approximately 2:
locations where placing an atom will remove a defect. In this example, there are :1: atoms
that are defective. Equation 3.47 show that as :3 gets small, the probability falls to 7%.
For larger molecules with more atoms, this represents a limitation of improvement in the

structure by mutation.

 

x 2
defeCtreduced N (N) (3.47)
N — a: 2 17(N — z)
defeCtincrease N ( N ) + (————N2 ) (3.48)

Equation 3.48 shows that, with small values of 2:, nearly every mutation will add de-
fects. The dilemma is how to perform enough mutations so that the reduction in defects
has a chance to occur and at the same time maintain a ﬁtness value high enough so that
the molecules will survive in the population. This also means that when a defect reduction
Occurs the process of recombination must somehow preserve the reduction and remove all

the other defects that were introduced.

110

¥

Subgraph Creation with Local Search

The local search will tend to move atoms to good locations and increase the number of
subgraphs. The N 1 1 molecules examined in Section 3.4.5 show the effect on the count after
application to randomly initialized molecules. In broad terms, any atom that is in the right
vicinity will be moved to the correction location by local search. There are two effects that
crossover provides that will impact the local search. When crossover occurs, atoms cannot
be placed too close together and are adjusted slightly as needed to maintain the minimum
distance. Second, when the two halves are combined, the atoms near the center of mass are
added ﬁrst. The process then works its way toward the outer edge of the molecule, adding
atoms. This will tend to prevent minimum distance violations and increase the number of
successful crossover operations. Overall, the two operations may constantly undo the each
other results. Local search may be trying to move an atom between two others and violate
the distance minimum. Crossover will move it back. Distance-based local search treats
large distances in the same fashion as short distances. This will have the effect of forcing
an overall shape on the molecule that is consistent with the target molecule. In contrast,
energy-based search will be mostly concerned with only those atoms that are a bond length
or two away and will not drive the overall shape of the molecule. In fact, the results for
some of the Bucky ball studies by Deaven and Ho [13] have a distorted appearance because

the larger distances have little to no effect on the potential energy value.

Local search is applied to all the operations. Even simple reproduction allows the
work of local search to be spread out over several generations. As noted for mutation
and crossover, the ﬁtness of the molecule will degrade during the actual operations even
for those that contain new subgraphs that are ultimately needed to ﬁnd the solution. Local
search will increase the ﬁtness so that the offspring will be chosen for future operations.
The other effect is that any mutation or crossover that gets a defective atom into the right
region to make a repair will move that atom to the proper location and thus increase the

number of subgraphs with higher lengths L.

111

3.7.4 Diversity of Population

A natural way to measure the diversity of the current population is to use the ﬁtness measure
with the best molecule as the target. This will highlight the differences between the current
best molecule and the rest of the population. The distance sets will have small errors
for isomorphic graphs or similar arrangements of atoms. For molecules with signiﬁcantly
different topologies, the set of distances will be substantially different and will produce a
large value for the summed differences in the distances. It is interesting to note that if the
two molecules are used during crossover and they are very similar, the disruption of good
subgraphs is reduced along the cutting plane. Only good subgraphs that span the cutting
plane in both parents and that have no matching atoms on the other parent are lost. As the
population converges, the probability of having the same good subgraph affected by the

cutting plane increases and fewer disruptions will occur.

3.8 Graph Isomorphism

The nature of determining the structure of a molecular from the distance set has an ex-
tremely strong parallel to determination if two graphs are isomorphic. The set of atom-to-
atom distances can be used as a test for isomorphism. If the two molecules are represented
as two graphs with the bonds being edges, then testing for possible isomorphism can be
used to rule out equivalent molecules. This test can also be used as a measure of diver-
sity in the population. The comparison of two chromosomes on a gene-by-gene basis is
not useful for the determination of diversity because of the fact that gene placement in the
chromosome has no correlation to actual position in the molecule. In order to measure the
diversity of the population, the true measure is the degree of isomorphism. The total sum
error in these distances can represent a useful measure of similarity of two graphs and in
this application the two molecules. In graph theory terms, two graphs are either isomorphic

or they are not. This diversion to graph isomorphism is useful because it shows that the

112

sorted distances can be used as an additional test of isomorphism. The distances in graph

terms are integer number of edges between nodes.

Two graphs G and H are isomorphic if there exists a mapping f : V(G) —> V(H) such
that any pair of adjacent vertices u, v E G (-) f (u), f (v) 6 H are also adjacent [7][26].
In the case of molecular conformation, the graphs are deﬁned by bonds as edges and the
atoms as nodes. A Simple test of isomorphism for two molecules is identity of the two sets

of sorted distances, when distances are restricted to those that represent bond lengths.

Another test that can eliminate two graphs as being isomorphic is the set of {deg(v,~)}
for all nodes in the graphs. If the two sets are not the same, then the two graphs are
not isomorphic. While GA works with the complete set of distances produced in a fully
connected graph K n, the set of distances is restricted to bond lengths that form the topology
of the molecule. In a typical molecule, the deg(v,~) is in the range of 1 . .6 for each atom. The

total number of distances that make the set of bonds is then based on dcount = ail-313M.

The following conditions can rule out isomorphic graphs. Equation 3.49 describes the
two graphs in question as G and H. The Simplest is based on the fact that the graphs must

have the same number of nodes or atoms as shown in Equation 3.50.

v, e G, u,- e n (3.49)

n ¢ m (3.50)

The number of edges for the two graphs must also be equal and can be calculated
from the vertex degrees as shown in Equation 3.51. The next check shown in Equation
3.52 is based on the degree of each vertex. These two sets of vertex degree values must

match, indicating that there is a one-to-one match of vertex degrees. This test can rule out

113

isomorphism but cannot provide the mapping from G -—) H required to prove isomorphism.

 

 

2;; (189(2),) 7e 23:1 d89(uj) (3.51)
2 2
{deg(v1) . . .deg(vn)} 74 {deg(u1) . ..deg(um)} (3.52)

Working with the set of distances it is possible to add another check that can rule out
isomorphism between graphs. If the graphs are isomorphic then the sets of distances will
also be equal. That is, the distance set consists of all the distances between any two vertices
in the graph as measured by the edge count between them. For each graph, the distances
between all combinations of nodes is calculated, as shown in 3.53 and 3.54 for each of the
graphs. These distances are then collected and sorted into a set for each graph and these
sets must be identical as shown in Equation 3.55 for the graphs to be isomorphic. This is

necessary, but not sufﬁcient, condition for isomorphism of the graphs.

Vi 7e j d. = dis(v,-, 2),) (3.53)
vr 75 j t), = dis(u,-, uj) (3.54)
{d1...d,,};£{t1...tk} (3.55)
Vv,eGC{d1...d,,}3u,-EHC{t1...t,,} (3.56)

Determining isomorphism uses these sets of distances as constraints for each of the
graphs. Any subset of the distances incidents on a given vertex 1 in graph G must match

the possible candidates in H. If the set is unique, it is possible to assign the mapping

114

between the vertex v1 6 G —-) u.r E b. If there is no matching set, then the two graphs are
not isomorphic. When there are multiple sets that are identical, the number of permutations
that must be tested is signiﬁcantly reduced. Equation 3.56 describes that, for every subset
of distances associated with each I),- e G, there must exist an identical subset of distances
for some vertex Uj 6 H.

The number of unmatched sets can be used as the measure of difference between the two
graphs or molecules. If the measure of similarity is desired instead, the count of matching

sets can be used.

3.9 Convergence

There is a natural tension between exploration and convergence. Premature convergence
produces poor results and prolonged slow convergence wastes computational resources.
The correct convergence velocity or rate of convergence is the desired goal for the best
solution possible. The rate of convergence for the molecular conformation GA is affected
by many factors. Some of these factors that can lead to premature convergence are the size
of the population, selection pressure, mutation rate, crossover rate, local search and lack of
diversity. Lack of diversity in the population is a typical cause for premature convergence.
Diversity alone will not solve premature convergence. Having a population of poor ﬁtness
candidates may be diverse, but it will still not necessarily help in ﬁnding a good ﬁtness
solution.

Further, the diversity should contain at least a large number of the needed low-order
subgraphs that need to be combined into good solutions. Having a number of candidate
solutions that are similar is a natural consequence of competition of the ﬁttest. Worse yet,
in this type of application, some arrangements of the atoms are isomorphic to each other.
While they are different in orientation or possibly chivalry, they add nothing to the diversity

of the population and, therefore, do not assist in ﬁnding the solutions. The main issues in

115

 

determination of the convergence are the Size of the population, the introduction of new

genetic material by mutations, and the selection pressure.

3.9.1 Operator Effectiveness

During the course of a GA run the effectiveness of the GA operators can be measured in
several ways. Ultimately, the key point is that some number of individuals will be created
that lead to a valid solution. It is possible to measure the Operator effectiveness in a couple
of ways. The simplest is to count the number of resulting molecules that have a better
ﬁtness than the parents. The other is to sum the improvement of ﬁtness for the offspring
that are better than the parents. As the GA runs, the ability to create better children will be
more difﬁcult. This is due ﬁrst to the fact that the size of the space with better ﬁtness values
is getting smaller and smaller as ﬁtness converges to better values. Second, is that all the
operators are more likely to disrupt good ﬁtness subgraphs than to build a new high-ﬁtness
subgraph. The population of potential solutions has a certain equilibrium point as the run
progresses. At this point the ﬁtness values for some molecules will be poor, based on the
disruptive nature of the operators. This will lead to a near-steady-state effectiveness rate.
This rate is calculated for each type of operator. For each generation, the rate is simply
the number of offspring that are better than the parents divided by the total number of
operations of that type performed. Each operator is a combination of the local search with

one of the GA operators, simple mutation, relocation mutation, and spatial crossover.

3.9.2 Difference Equation

Operator effectiveness is based on the current ﬁtness of the population and the amount of
degradation that occurs from the GA operators. For example, if all the molecules are near
perfect, the probability of an operator producing a new molecule with better ﬁtness is much
lower than when the population is ﬁrst initialized. This will be clearly shown in the results

section that describes operator effectiveness. At the same time, if the GA operators degrade

116

 

 

the overall population, a steady-state number of ﬁtness improvements can be expected.

Cbetter = Pbetter(E) (3-57)

3.10 Scalability

The ability to solve the problem has several interrelated factors that all affect scalability.
For the molecular conformation problem, scalability is the ability to ﬁnd solutions as the
size of clusters increases. Computation cost is impacted two ways. First is the cost of
evaluation of the objective function. Second is the increases associated with population
Size and number of generations required for ﬁnding a solution.

A larger population size is typically thought to improve the probability of solution. This
increases the computation cost for every generation. If the size is too small, the probability
of ﬁnding a solution will become very small and a high number of failures will occur. If the
population is larger than it needs to be, it may take extra generations to ﬁnd the solution.
The key question about population size is if it is large enough to contain a critical mass of
genetic material needed to ﬁnd a solution. In this particular problem, a constant tension
between size and effective mixing occurs. Too large a population and the needed mutations
are not mixed sufﬁciently by crossover to ﬁnd the solution. Too small a population and
the probability of getting the needed mutation is so small that the chances of ﬁnding a
solution goes down. So for effective scalability, the population size must be scaled within a
window of viable probabilities. On the smaller molecules, this window is much larger and
the population size parameters will behave in the expected fashion. As the molecules get

larger, the overlap in windows gets smaller and this leads to fewer good solutions.

The cost of evaluation of the ﬁtness function is at the base of the scalability issue. This
cost will determine the lower bound for the entire systems operation. The creation of new

subgraphs is the key element in determination of scalability, assuming that it is not feasible

117

to create a population with a sufﬁcient number of subgraphs that can then be combined
into a solution. The sources of these new subgraphs are provided by the genetic operators
along with the local search. The ﬁtness function will be analyzed followed by the impact
of the GA Operation. This will be divided among the population size, local search, and GA

operators.

3.10.1 Fitness Function

The ﬁtness function’s cost of evaluation is on the order of 8(N2). The current process
requires the calculation of the distances, which is 8(N2), and sorting the distances which is
on the order of 9(N ln(N)), along with the actual summation of error which is on the order
of O(N2). This is an interesting observation in and of itself. Any genetic algorithm that
uses the distance matrix or energy calculational cost must scale by some factor x such that
6(Nf) > O(N2). In this work, the entire set of distances is used for the ﬁtness function.
In contrast, the energy functions are most dependent on only near-atom interactions.
There are several issues that affect the ability of algorithms to solve larger problems. At
the base of these issues is the way in which solutions scale with difﬁcultly and size. For the
molecular conformation, the size of the representation scales on the order of O(N). How-
ever, many of the data structures used to calculate the ﬁtness function grow as O(N2). This
sets a lower bound on the ability to scale the algorithms with the current ﬁtness functions.
It is interesting to note that if the ﬁtness function was reduced to some value 6(aN) based
on only the bonds of the molecule, the performance would be improved. In addition to the
ﬁtness function, the ability to scale the GA will depend on the population Size required and
the cost of performing the various operations such as local search, crossover, and mutation.
The number of small subgraphs grows in a linear fashion and, if the GA ability to operate
is based on these small subgraphs, this should provide good scaling as the molecules grow
in size. The other areas that affect the scalability include the size of the population, number

of generations required for convergence and cost of performing the local search.

118

3.10.2 GA Algorithm

There are many trade-off parameters in a genetic algorithm whose selection will not be
obvious. For example, increasing the size of the population could signiﬁcantly reduce the
number of generations needed for solution. Similarity, increasing the number of iterations
on the local search may in fact prevent ﬁnding solutions or reduce the number of genera-
tions required. Clearly these choices are not obvious. In addition to the ﬁtness function,
the dynamic operation of the GA will depend on population size, application of operator
parameters, and number of generations required.

The ability of mutation to ﬁll in defects has been estimated. The probability is pro-
portional to Flf- This implies that the population size must increase by N 2 to improve the
chances of ﬁlling in the defect. The crossover operator’s role is to spread good subgraphs
around while trying to preserve as many good ones as possible. If the population is too

large, this will be harder to do.

Population

The population should be large enough to insure a good set of genetic material is available
and at the same time, small enough to provide good computational cost. The population
size can be smaller if the introduction of new good genetic material is provided. If the
population is assumed to have all the genetic material it might need at the start of the GA
run, then the required size of the population can be calculated. Starting with the probability
of having a building block and the number of building blocks required, it is possible to
estimate the size. Needed building blocks must survive until they are combined into high
ﬁtness individuals that lead to a solution. For the current work, the local search can create
new good subgraphs along with the GA operator. The goal is balance population size
and the number of generations to ﬁnd the solution. Further, there is a critical population
size at which not enough material exists to work with and the disruptions caused by the

genetic operators will cause convergence to some poor ﬁtness value. In this situation, the

119

probability of creating new subgraphs and also having all the required subgraphs in the

population in order to ﬁnd a solution is so low that it starts to resemble a random search.

Generations

The larger populations are expected to decrease the number of generations needed to ﬁnd
a solution. This relationship is not well established and, for some of the experiment results
in this work, it is shown that smaller populations actually perform better for the larger

molecules in relation to the number of atoms.

Local Search

TWo factors of local search are the rate of adjustment to each location and the number
of iterations performed. Increasing the number of iteration linearly affects the number of
evaluations. At the same time, the higher the adjustment rate, the fewer number of iterations
should be required. These parameters are bound by a stability criterion proportional to ﬁ.
Figures 3.8 and 3.9 show how the individual contributions are added in a simple two—
dimensional example. The magnitude of the ﬁnal vector for correction of the atom location

is based on the scale factor that is subject to the stability criteria.

 

 

 

 

 

Figure 3.8: Local search contributions

120

 

 

 

 

 

 

Figure 3.9: Local search adjustment

If the scale factor is small and the number of iterations is high then the computation cost
is signiﬁcant. On the other side, if the scale factor is larger, the number of iterations can
be reduced. When this scale factor is too large, the local search can become unstable. The
basic impact on computation cost is linear with the number of iterations used. Convergence
is detected by examination of the sum of correction vector magnitudes. If the total sum of
the correction vectors is below some threshold, then the local search has converged and
is terminated. The disruptive nature of all the GA operators will cause the local search to
iterate the maximum number of times for a large percentage of the molecules. Multiple
generations may be required in order to reach convergence, even when the subgraphs are

not disrupted with other operators.

Estimate of scalability

If the relocation mutation is assumed to be the major source of new subgraphs and crossover
is how these are spread among members of the population, with local search just doing the
ﬁnal adjustments, then the basis for the scalability will be O(N4). The impact of the local
search is a linear scale factor. While it can signiﬁcantly increase the cost of each GA run,

the real problem is the probability of constructing the new subgraphs that repair defects

121

 

in the molecule structure. This somewhat primitive analysis is in good agreement with
established scalability results for energy search. In Hartke [27], a claim of 9(N3'15) is

made and the data could be construed to be somewhat higher for larger values of N.

3.11 Summary

This chapter introduced the concepts of space division for homogeneous vectors as loca-
tions in space. Many ﬁtness functions used in GA’s have signiﬁcant contributions that are
based on the proximity and arrangement of these locations in space. These locations are
represented as genes in a chromosomes used to describe each individual member of the
GA’s population. A method for dividing spaces is introduced based on the dot product of a
normal to a hyperplane.

Next, the ideal of local search was formalized as any collection of locations such that
application of a local search will lead to a speciﬁc set of locations. The equivalences of
two molecules is demonstrated for translation and rotation such that the individual atom
locations between two molecules are within some arbitrarily small sphere. The difference

between the local search and the equivalence is the size of the spheres.

The probability of the random creation of building blocks or small subgraphs is de-
scribed in terms of space volumes. The actual process for the local search is described,
along with the deﬁnition of convergence for the local search. The generalized division of
space by multiple hyperplanes is described with assignment to speciﬁc spaces based on the
binary pattern of positive or negative dot products for each of the dividing planes.

The ﬁtness values have been shown to have a direct relationship to the number of small
subgraphs or building blocks that each molecule has. To Show the theoretical basis for
the performance of the spatial crossover, the counting of the subgraphs is used to predict
the performance. While the subgraphs are similar to schemata, they are restricted by the

existence of edges between each gene where in schema theory no such restrictions apply.

122

Several molecules and a basic graph are examined. In each case, the single cutting plane
disrupts far fewer small subgraphs than the two-point standard crossover.

The standard crossover and the spatial crossover are then compared. Using the Bucky
ball example, it is clear that the spatial crossover preserves the subgraphs and leads to
subgraph creation. It is interesting to note that as the diversity is reduced, the disruption of
subgraphs is also reduced.

The two forms of mutation used in this GA are formalized. The ﬁrst is a random
displacement of atoms and the second a relocation of an atom. The ﬁrst type of mutation
has a very localized effect if the number and the magnitude of mutation is limited. The
second form signiﬁcantly alters the local space and provides a better means for exploration.

An important factor to the operation is determining a viable population size and asso-
ciated convergence rate. This section provides a formulation for the number of members
needed in a given population along with the expected rates of convergence. There are sev-
eral critical parameters that are based on the creation of subgraphs in the initial population
along with the steady-state operating points.

The overall scalability is based on the probability of constructing new good subgraphs
and the cost of the ﬁtness function. These, along with the linear impact of the local search,

are the dominant factors in scalability of the spatial operators.

123

Chapter 4

Experiments and Results

This chapter will present the results of experiments that explored the nature of the spatial
operators. These runs are divided into several areas, starting with local search, followed by
a series of trials for ideal molecules and real molecule distance sets. Additional results are

provided for parameter selection searches and operator efﬁciency determination.

The results presented were accomplished using a typical single-level run of the GA
algorithm with spatial operators. A single population was initialized and a GA run was
conducted on this ﬁxed population. For some limited number of molecules, multi-level runs
were conducted. In this style of run, a number of GA runs were conducted on initialized
populations. These runs were terminated either by ﬁtness value or ﬁxed generations. The

best of each of these ﬁrst-level runs were used as seed material for a single second-level

GA run.

Section 4.1 describes the results of the local search without any of the genetic algorithm
operations utilized. This is provided to establish the size of the region within which an atom
can be displaced and still converge. This characterization of the local search provides the

basis for describing equivalent molecules that converge to the same atom locations.

Section 4.2 is based on ﬁnding solutions to molecule structure using GA on of a variety

0f molecules with different sizes and shapes. While most of the work will center on known

124

 

‘1'”

Lennard-Jones low-energy conﬁgurations, other shapes, such as the Bucky ball, will be
examined. One of the main results provided is the impact of molecule size on the solution

process.

Solving real conformation problems given real and artiﬁcially degraded distance data
sets are the focus of the next section. In Section 4.3, the ability of the algorithm to ﬁnd
Bucky balls from what is measurably a noisy data set of distances is presented. One of
the problems with the application of genetic algorithms is the large number of user-deﬁned
parameters and what impact they have on the algorithm performance. In this application
there are over ninety parameters that are deﬁned at run time that can be customized for each
type of molecule. Determination of the optimal operating points is in itself another useful
application of genetic algorithms. In this work the parameters were manually selected
based on some of the parameter sweep studies presented in Section 4.4. The molecule
Size of twenty atoms was chosen because of the reasonable run times and the demonstrated
sensitivity of some of the parameters. The ability to use these experiments as guides for

larger molecules turns out to be limited.

4.1 Local Search

In order to understand the limits on the local search, an experiment was constructed that
started with the ideal molecule and then randomly perturbed X atoms some random direc-
tion with a maximum random magnitude of perturbation M. For each set of X and M forty
trials were run. The raw results are presented as the number of failures, in Table 4.1. Trials
that were run with less than ﬁve atom movements had very few failures and are not shown.
All the tests were run on the Lennard-Jones N11 molecule. Each test starts with the ideal
molecule then repeats the following procedure X times as shown in the leftmost column of
Table 4.1. It randomly selects an atom and moves it in a random direction with a magnitude

described by the ﬁrst row. Once the perturbations are complete, the molecule is compared

125

to the ideal for the following measurements before the local search and after. In addition,
the number of iterations to ﬁnd the ideal was recorded. If the local search failed in 4000
iterations to ﬁnd the ideal, then the run was terminated.

The following information was collected for each perturbed molecule that was created.
The values were collected after the ideal molecule was perturbed and again once the local
search had been completed. The local search was terminated if the total number of itera-
tions exceeded the limit or when the total distance objective function was below a speciﬁc

threshold.

1. Lennard-Jones energy

2. Distance objective function

3. Average distance errors

4. Maximum distance error

5. Standard deviation of distance error
6. Total vector summed error

7. Histogram of distance errors

Because the atoms to be moved are chosen at random, even for small numbers of rep-
etitions, there is a signiﬁcant chance one atom may be moved two or more times by the
perturbation process. This local search, as describe in Section 3.10.2, is not a gradient or
conjugate gradient method and it does not have the analytical properties that either method
would. The correction vector is distributed to both atoms in every atom pair even though
only one atom might need the correction. On some iterations, only a subset of the distances
are used to calculate the correction vector. The size and the range of distances used are
selected at random. Near convergence to the solution, all the distances should be near zero

and therefore, a subset should have only minor impact on the actual atom movement.

126

 

 

 

 

 

 

 

 

 

 

perturbation magnitude

atoms 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
5 0 0 0 l 0 0 3 4
6 O 0 0 0 l 1 1 1
7 0 0 0 2 O O 0 5
8 0 0 0 0 l 0 6 7
9 0 0 0 0 0 2 7 10
10 0 0 1 0 l 3 6 1 1
1 1 0 0 1 0 l 4 3 7

 

 

 

 

 

 

 

 

 

 

 

Table 4.1: Failed convergence by local search to ideal N11 by count and magnitude of
perturbation

As demonstrated in the Table 4.1, the number of local search failures increases as the

number of repetitions increases or the magnitude of the atom movements increases.

The number of iterations required to ﬁnd the solution has a signiﬁcant impact on the
overall efﬁciency of the GA. In fact, it is the local search that contributes most to the com-
putational cost. In Figure 4.1, the actual number of iterations before termination is shown.
The vertical axis is the probability of ﬁnding a solution in the range shown. For example,
the ﬁrst bar shows that 58% of all trails found a solution in less than ﬁfty iterations. The
second bar shows that 12% of all trials solved in 50 — 99 iterations and so on. The last bar

represents the total number of failures for all 2240 trials which is 90 or 4%.

In order to determine the true nature of the failures from this experiment, two other sets
of probabilities were created. First, we determined the probability of failure given the total
summed error of the perturbed molecules. The second set of probabilities is the failure

given the maximum perturbation of the molecule.

Figure 4.2 shows the probability of failure given a total summed error in a speciﬁc
range. With the exception of the large spike for summed errors in the range 1.0 to 1.1, the
graph shows the expected increase with the total amount of summed error. The small bar
for summed error in the range 0.2 to 0.3 is really only two cases that failed. It is important
to note the vertical scale of probability. For the largest bar, only 3% of the cases in the

range 1.0 to 1.1 fail. This is almost the same as the overall failure rate for the whole trial.

127

 

 

0.1 .

0.5 (
0.4 7
0.3 1
02 1

0.1 -

Probabillty of solution in x lnteratlons

 

01

006 T '

 

‘a‘rsrcsa‘b‘riilyssalur...

 

<50 <100 <150 <200 <400 <600 <aoo <1000 <1500 <2000 <2500 <3000 <3500 <4000 <4001

Figure 4.1: N11 local search probability of success by number of iterations

0.035

I

01325...

Probability

 

0.015

 

iiﬁéiﬁéilftféliéilﬁr5.

 

 

 

 

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 .1 2
Squared sum of error

Figure 4.2: N11 local search probability failure by sum of error

128

 

0.4
0.35 ' .. HQ.-- .. j.
i- mum

0.25

Probability
O
N

0.15

0.05

 

 

 

o . — f . y A ,
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1
magnltude

 

Figure 4.3: N11 local search probability failure by max perturbation

Figure 4.3 shows a much more signiﬁcant trend of failures as determined by the max-
imum error associated with any one atom. In this graph, an expected rise with increased
maximum error occurs but the correlation to failure is much stronger. In section 3.2, the
hypothesis stated that if the atom locations are within some fraction F of the ideal loca-
tion then the local search should converge to either the ideal location or the closest local
minimum. The vertical axis in this graph is an order of magnitude larger than the previous
one. In this graph the probability of convergence failure to the ideal solution was 37.5% for
maximum error of any single atom between 0.9 and 1.0 bond length. Below a maximum

error of 0.6 bond length, the probability of failure is extremely small.

This experiment supports the hypothesis that it is the maximum displacement of any
given atom that leads to the inability to converge to the ideal solutions. This maximum

displacement is just over 0.5 of the bond length.

129

4.2 Ideal Distances

The use of ideal distances was necessary to demonstrate the feasibility of the algorithm
and the impact of spatial operators on performance. While the runs with eleven and twelve
atoms allowed for close examination of the local search impact, it turns out that the prob-
ability of solution even for marginal methods is high. It is only when the molecule size
exceeds thirty eight or so atoms that the problem really becomes difﬁcult and scalability
clear. The molecules used include N38, a known difﬁcult conﬁguration, the Bucky ball

with sixty atoms, and the N60 Lennard-Jones low-energy conﬁguration.

4.2.1 N12 Lennard-Jones Molecule

The results for the N12 ideal molecule provides a base for comparison with larger
molecules. In these runs, the algorithm ﬁnds the solution nearly always, even for poor
choices of parameters. The following series of graphs demonstrate the effects the mutation
rate, local search iteration count, local search convergence factor, and population size have
on the solution. The number of evaluations is on the order of 0(n3'1). While the graphs
exhibit some of the expected properties, they also show how robust the solution methods

are for small numbers of atoms.

In the Figure 4.4, the probability of simple mutation was varied for a series of runs.
Each run consisted of 100 trials and the results were averaged. As can be seen, the muta-
tion rate has a limited affect on the average number of generations to solution. The main
reason is that local search is nearly able to ﬁnd the solution by itself. The fact that more
mutations lead to slightly faster average solution times is expected because the space will
be explored faster. Continuing to increase the mutation frequency will eventually lead to
slower solutions because of the disruptive nature of mutation. The interaction of crossover
and more complex mutation will become far more apparent in the larger molecules. The

size of the population also affects the average number of generations to ﬁnd a solutions.

130

 

14

 

 

 

 

 

—Av A“- tions
12 , .. _.._. _. __,, . -- _ .mew,._ _.--H-- . z
10 \/\
c 8
8i
3'
3 6
4
2
0
0.01 0.05 0.1 0.15 0.2
prob m1

Figure 4.4: N12 simple mutation probability sweep

Figure 4.5 shows this effect using the N12 molecules. The slow decrease in the average
number of generations as population size increases is shown. An interesting point here
is that, as the population increases, the overall number of evaluations increases. Figure
4.7 shows the number of evaluations as a function of population. This ﬁgure has some
very intriguing peaks and dips. Each population size was run 100 times, and while the
number of samples is limited, the results should still be representative. The dips at 18 and
38 are very pronounced, along with the peak at 13. Several additional trials were run on
population size 18 in addition to the original 100 and the results were consistently fewer
total evaluations. The true computational cost is better indicated by number of evaluations
however the impact on the average number of generations to solution shows the continued
decrease as the population continues to increase. The vertical scale is relative, based on the
product of the population size and the average number of generations to achieve the desired

solution.

Because of the important role local search plays in the conformation algorithm, the

131

18

 

--Avonuwaononnkuui

16

14

12

 

 

 

 

10 12 14 15 16 18 20 25 30 35 40 45 50 55
Fknnuaﬁon

Figure 4.5: N12 population versus average generations to solve

affect of its parameters are very pronounced. The next two graphs demonstrate this point.
The ﬁrst graph, shown in Figure 4.6, shows the affect of the convergence factor on the
solution process. This is the factor that the error vector computed for each atom applied
to the location of the atom. If it had too small a value, the time to ﬁnd the solution goes
up signiﬁcantly. Too large a value can make the algorithm become unstable. The point of
instability is very steep and all runs beyond a factor of 0.3 were unstable. The interesting
point here is that any operating point in the range from 0.1 to 0.25 will give good results,

with a slight advantage to larger values.

What is very interesting is that the average number of generations does not vary much
with the convergence factor. The standard deviation of the average does vary and exhibits
a very clear effect as shown in Figure 4.8. In this ﬁgure, the stability of the number of
generations to ﬁnd a solution is strongly impacted by the selection of the convergence

factor. The range is clearly identiﬁed in the graph.

132

 

 

+ Average Generations

25

 

20

Generations

   

 

 

 

15
10
5
o
o 0.05 0.1 0.15 0.2 0.25 0.3 0.35
SF adjustment

Figure 4.6: N12 local search convergence factor versus average generations

 

400

    

5" 9'31““.‘903 .

250

§

150

100

 

 

o 5 10 15 20 25 30 35 40 45 50 55 60
Population

 

Figure 4.7: N 12 population versus evaluations

133

120

 

+smm i0...

 

100

 

 

 

 

0 0.05 0.1 0.15 0.2
Factor

0.35

Figure 4.8: N12 local search convergence factor versus standard deviation of generations
4.2.2 N20

The use of twenty-atom molecules demonstrates the various effects that parameters have on
the solution, but at the same time offers a high degree of probability that the molecule will
be solved. The average number of evaluations to solve N20 molecule versus the population
size is displayed in Figure 4.9. This ﬁgure shows the relative cost of computation to ﬁnd
the solution as a function of population size. For lower population sizes, the operations fail
more often and this has the effect of increasing the average number of generations needed
to ﬁnd the solution. As the population increases, the number of subgraphs that can be
present at one time increases along with the increase in possible creation of new subgraphs.
This increases the likelihood of solution. After a population threshold is reached, further
increases in population have a linear degrading effect on the computation cost. Populations
beyond this threshold will take longer on the average to ﬁnd a solution and require more

evaluations of the molecules. Figure 4.9 shows the computational cost as a relative value,

134

14000

 

-O- Evaluations

12000 ‘

10000

8000

 

 

 

 

20 40 50 60 70 80 100120140160180200220240260280
Population

Figure 4.9: Population size effect on computation cost to ﬁnd solution N20

based on the product of the average number of generations to solution and the population
size. This graph shows the pronounced dip between 80 and 120 population members. This
population size is much smaller than the predicted size needed, based on the probability of

having sufﬁcient subgraphs created during initialization only.

4.2.3 N38

The N38 Lennard-J ones molecule has a reputation of being difﬁcult for many of the energy
based search methods [27]. This difﬁcultly stems from the signiﬁcant change in the conﬁg-
uration from nearby low energy states to the current known lowest state. For the distance
based objective function, these large energy wells are not as difﬁcult as they are for the
energy based search. While the population size and number of generations required to ﬁnd
the solution are signiﬁcantly larger than the N20, the current algorithm found the solution
100% of the time with a population greater than 110 molecules. The average number of

generations to ﬁnd the solution was also larger. Figure 4.10 shows the plot of population

135

size on the average generations for a solution.

700

 

+AvgrageGenerathS‘
600 . \ _ . ....... ...
500 ~ .—

 

300

 

Generations

200

100 A

 

 

 

50 100 150 200
Population

Figure 4.10: Population size effect on average generations to ﬁnd solution N 38

4.2.4 N60

There are two sets of ideal molecules used in this study with sixty-atom molecules. The
ﬁrst is the carbon Bucky ball [43]. Although the Bucky ball is named after R. Buckminster
Fuller, the discovery of the carbon molecule that resembles the shape of the Buckminster
domes occurred when exploring the chemical processes on the sun. This discovery eventu-
ally lead to a Nobel prize and many advanced applications of the carbon Bucky ball. The
molecule is formed like a soccer ball and was the original target of these investigations.
The ball has many interesting properties that make it desirable to study. First, the number
of symmetrical distances is very high. Second, the energy level is not the lowest possible.

This means that search methods that rely on energy alone would not ﬁnd the conformation

136

without using distance constraints. Third, the size of the ball is large compared to many
other molecules and this affects the overall operation of the search.

The second N60 molecule is the low-energy conﬁguration. This molecule has an in-
teresting history. It was only recently found and displaced a molecule that was the prior
known low-energy conﬁguration for several years [53]. The difference in energy level is
quite large. The Bucky ball has a Lennard-Jones energy of —44.1 while the low energy
conﬁguration has --305.8. This means that ﬁnding the solution to the conformation strictly

on energy would be difﬁcult.

4.2.5 Bucky Ball

One of the most interesting aspects, of the ideal Bucky ball problem is the effect that of
only a couple of iterations of the local search had on the ability to ﬁnd a solution. Several
months of searching with just ”straight” GA spatial mutation and crossover failed to ﬁnd
a solution in a single-level run. Mth the inclusion of the local search after every GA
operation, the program found solutions 80% of the time in a few hundred generations. One
of the most interesting observations is the progression of potential energy as a function of
generation compared to the distance-based objective function. Figure 4.11 shows the raw
ﬁtness as a function of generation. In this graph, the initial drop is due to local search
and the fact that combinations with a few good building blocks have a high probability of
creating a molecule with more building blocks or good subgraphs. This phase only lasts
about six to twelve generations and then the steady-state equilibrium is dominant, with
a more gradual improvement in ﬁtness. In these early stages, just about all operations
will produce a molecule that is better than the parents because there is so much room for
improvement. This steady decline in objective function value is accented with jumps that
are very characteristic of a GA search. Once a conﬁguration is found with the local search
within range of a solution, it only takes three to four generations to ﬁnish.

In contrast, Figure 4.12 shows the progression of the potential energy as a function

137

 

— Oblective Function

Squared Sum Error
at

_a
O

 

 

 

1 21 41 61 81 101 121 141 161 181 201 221
Generations

Figure 4.11: Ideal Bucky ball ﬁtness

of generation. The horizontal dotted line is the expected energy for the ideal Bucky ball
and you can see that the molecular conﬁgurations have lower energy level at the start than
the Bucky ball. An energy-based search would not normally ﬁnd this particular plateau
without constraints. As the search progresses, the levels bounce around signiﬁcantly. A last
big jump occurs just before a good conformation is found. These two graphs demonstrate
the disconnect between the convergence of energy and distance objective functions. The
distances used as the target in the objective function are derived from experiments and not

biased by the low-energy hypothesis.

Superimposing the two graphs shows clearly that, as the distance function converges,
the energy diverges from the target value for the Bucky ball. Figure 4.13 shows the two
functions on a single graph with dual vertical scales by generations. The rapid change in
energy near generation 211 shows the inverse relationship between the distance and energy
objective functions. Eventually the two converge as the GA ﬁnishes. Large improvements

in the distance function can have detrimental changes on the energy values of the molecules.

138

 

'r'LJlﬂﬁnev

 

81 101 121 141 161 181 201 221i

 

 

 

 

Figure 4.12: Ideal Bucky ball energy

This does not have a minimum energy conﬁguration. Structure search methods that rely

only on ﬁnding minimum energy would have a problem with this molecule.

In Figure 4.14 the objective function and the LJ energy levels track in general. There are
still the signiﬁcant differences over the course of the search. Once the molecule approaches
the equilibrium point both the energy and the objective function are well behaved and both

converge as expected.

Figure 4.15 shows the histogram of solution times by generation for seventy trials. The
failure rate for ﬁnding solutions was 18%. The number of iterations for the local search
was eight and the population was ninety members. The bulk of the runs converged in under

ﬁve hundred generations.

139

60

 

'e—Ehergy“ 1
—0blect|ve 5901:1190 .1

40

 

 

 

-60

 

.80 ,

 

 

 

 

 

 

 

   

i T I
0 i ___.__m-w____._..____ 7|
- -LJ energy. I 6
l “—oblective i |
i l 5
.501, 1
l 1 1| '
i \\_/ l 4,, 4
i i| .
l i
' l
I '2
-150 i :
I I. 1
' I
I
-m L4!" 1"!- —‘ '19-,- 9—1'9 *q—p 7— ”-9— i-i-r-P-q-I—t—e-r - P'- -P -, ~—;§-§ —‘ o

1611162126313641465156616671

Figure 4.14: Combined energy and distance objective N38

140

35
Generations for solution
30 ' ' ,, -‘ ‘ '
25

20

15 -- ~

 
 

10~

 

<301 >300 and < 501 >500 and <1001 >1000

 

Figure 4.15: Ideal Bucky ball single-level runs

4.3 N on-Ideal Distances

Real data sets for the target distances will have a certain amount of error. This error can
manifest itself as uncertainty in the distances, missing distances, and other measurement
issues. The initial research used an ideal Bucky ball as the target molecule because of its
unique nature and size. A set of experimental data was available for the Bucky ball that
would demonstrate the ability of the algorithm to ﬁnd good structures from real data. These
structures, by the very nature of the errors in the data set, cannot generate a perfect Bucky
ball. In addition, several molecules were distorted using random mutations to the atom
positions to simulate errors in measurement. These distorted molecules were then used
as targets in the experiments. The sixty-atom molecules deﬁnitely remove the concerns
associated with very small molecules and limited search spaces, both for the ideal and real
data sets. There are a number of search algorithms that have reported good successes with

small molecules only to have signiﬁcant problems with larger ones. Two different non-

141

ideal distance sets were used in this work. The ﬁrst was the experimental data for a Bucky

ball and the second was a set of artiﬁcially perturbed atom locations for a N38 molecule.

4.3.1 Bucky Ball Experimental Data Set Target

The goal of this portion of the experimentation was to take pure measurement data and
determine the structure of the molecule. Pure in this sense means without human or algo-
rithmic assignment of distance to atom pairs. In the case of the Bucky ball experimental
data set, there is a base error in the data that can be measured compared to the ideal Bucky
ball. This base error determines the limiting performance on any algorithm that is applied.
First, the real experimental data has a raw distance ﬁtness function using the ideal as the
target of 41.17, which represents a very high mismatch. That means, using an ideal Bucky
ball as the target data, the real data has a raw objective value of 41.17 , where a value of
less than .1 is considered a good match. Second, the overall set of distances have an accu-
mulated error of —89.95 in total sum of lengths. The amazing part is that the energy of the
ideal Bucky ball is —41.8 and the real data has a potential energy of —44.1. Considering
the differences with the distance data, this small difference in potential energy underscores
the impact of the smaller distances on energy while the larger distances can have a major
effect on overall shape. Figure 4.16 shows the differences in the 1770 distances between
the two target data sets. With any real data set, it will be impossible to reduce the distance
ﬁtness to zero because of the inherent errors in the data set.

How the two molecules can have similar energy but signiﬁcant differences in the dis-
tance is shown in the Figure 4.17. This ﬁgure shows the differences for distances on the
order of bond length. They are very small, and the distance errors that contribute to the
large mismatch and the total occur for the middle distances. This accounts for the similar
energies but signiﬁcant distance errors, because the longer distance errors don’t contribute
as signiﬁcantly to the energy. These distances do play a major role in the overall shape.

A typical good result for the real distance data run is shown in the Figure 4.18. From

142

20

 

 

 

 

 

F’Aceurnulated Distance

Error... ,
-20
.40
-60
-80
-100

assessessgggggggg

Figure 4.16: Comparison of ideal and experimental distances for Bucky ball

20

 

—Aocumilated Distance Error

 

 

 

 

1 01
201
301
401
501
601
701
801
901
1 001
1 1 01
1 201
1 301
1 401
1 501
1 601
1 701

Figure 4.17: Accumulated distance error Bucky ball data set

143

a visual inspection, it appears to be a perfect Bucky ball. The small difference in energy
values is based on minor variations in the bond lengths. The shape when compared to the
ideal Bucky ball has some minor distortions. The diameter of the ball varies and this leads
to the inherent error in the distance data. The near distances as shown in the Figure 4.17

are almost perfect, so the energy values are close.

 

Figure 4.18: Good Real data Bucky ball Results

Figure 4.19 shows a typical defect in a real data Bucky ball result. There is one atom
that is misplaced. It is interesting to note that the actual raw ﬁtness error is 0.0573, which
is very close to the target, yet the molecule is obviously not perfect.

One hundred trials with the real Bucky ball data were conducted. Examination of the

resulting molecules shows that over eighty-three of them had less than four defects. There

144

 

Figure 4.19: Typical Defect for Real Data Bucky ball result

were eleven of the resulting molecules that had perfect shape and symmetry with no sig-
niﬁcant defects. Each of these trials were conducted for one thousand generations or ter-
minated when the raw ﬁtness was low enough. The population consisted of 260 molecules

with a local search iteration count of four.

Given the noisy real data, the perfect Bucky ball results were unexpected. This range
of actual results with close target raw ﬁtness shows the robust nature of the distance data.
The energy values are extremely close to those for a perfect Bucky ball and are very close

to the calculation of energy for the target distance table.

145

 

 

raw ﬁtness energy

average 0.2285 -43.50
max 1 .0541 41 .76
min 0.0422 -46.68
stdev 0.2391 1 .29

 

 

 

 

 

 

 

 

 

Table 4.2: Summary of 100 Trials Bucky Ball With Real Data

4.3.2 Artiﬁcially Degraded Data

Two sets of degraded data were used. The ﬁrst applied a random variation to each of the
distances in an ideal data set. This method produces a set of data with an inherent error
such that no molecule could match the data set. The runs would never quite reduce the
distance error to low values, but still converged to molecules that had a high degree of
similarity to the original. The second molecule used was derived by taking a N38 Lennard-
Jones minimum energy molecular structure and randomly moving one third of the atoms
a random distance. The resulting molecule was then used as the target set of distances.
While this molecule could be realized, it still took longer to converge. Where the normal
ideal N38 would converge in a few hundred generations, this degraded molecule would

take three to four times the average number of generations.

4.4 Parameter Selection Searches

The number of parameters that are used by the current algorithm are varied and numer-
ous. The basic parameters associated with genetic algorithms include the probability of
crossover, mutation, and selection along with all those used by the local search. Because
the program was used to compare a variety of methods, the parameters include many mode
settings. These modes include population models such as (p, A), permutations of parents to
construct children, crossover modes, selection modes, and many other variations of basic
algorithm operation. The following series of tables lists most of the algorithm parameter

settings with a brief statement of use. Table 4.3 has most of the mode settings and integer

146

 

 

 

 

 

 

f—Avocsmi was

400 .
$300 + -
>
<

200 --

100

0 . , . 4 . .
20 40 50 60 70 80100120140160180200220240260280
Population

Figure 4.20: Population size effect on average generations to ﬁnd solutions N 20

parameter settings that control the algorithm. These are continued in Table 4.4 with the
counts that are used to calculate the Operator effectiveness. Table 4.5 contains the real-
valued control parameters including the probabilities for operation selection. Last, Table

4.6, holds some of the string constants that are used by the program to specify ﬁle names.

4.4.1 Population Size

The ﬁrst series of parameter setting runs looked at the effect of population size on the ability
to ﬁnd solutions. Population size has an impact on the overall cost of computation. Figure
4.20 shows how the population size affects the average number of generations required to
ﬁnd a solution. It should be noted that, for smaller population sizes, the algorithm failed to
ﬁnd solutions a high percentage of the time, and runs were terminated at 2000 generations.
This was also used in the calculation of the averages. It is clear that, once past a population
size of 110 molecules, the process saw little reduction in failure rate.

The boundary for the size at which the failure decreased is very evident in Figure 4.21.

147

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

POPN 25 Populations size

SELMODE 1 Mode of selection tournament

TOURNSIZE 2 tournament size 1

STARTMODE 0 modes 0-5 0-random initialization 5-seed
molecules

TARGETMODE 0 O-ideal molecule l-only distances

POPS 20 How many seed molecules to make room
for

ELITECNT 1 How many elite to copy each generation

LAMDACNT 0 A, u mode A count

MUCNT 5 A, p. mode ,0 count

DUMPMD 0 dump ﬁle mode

KEEPMD 1 1 keep all offspring 0- only keep better

ISPLT 19 count of atoms for split

TARGETM 0 Type of target ﬁle

MUT2CNT 4 how many atoms to relocated

MUTlCNT 30 how many atoms to randomly move

MAXGEN 7000 Maximum generations to run

DIVERCNT 10 How many molecules to compare for di-
versity

RANDSEED -1 -1 random seed else a positive seed num-
ber

ACTUALGEN 0 Actual number of generations run

MU LAMMD 0 A mode

PERMUMD 2 permutation mode for [1 parents

NICHMD 2 mode for niche control

VSINTCNT 8 count of iterations for local search

NATOMS 38 total number of atoms

VSACTCNT 0 actual number of local search iterations
per generation

CNTBEST 0 Number of new best molecule

CROSSMD 0 Cross over mode 0— spatial 1- N point

CROSSMDX 0 Number of cut points for N point

FITMODE 0 select between distance and energy

ITERCNT 1 Number of iterations to perform for sta-
tistical purposes

BORECMD 0

RANGER 5 division factor restricts random start of
distances LS

RAN GEA 5 division factor restricts size of random

 

 

range LS

 

Table 4.3: GA operating parameters modes and sizes

148

 

 

 

 

 

 

 

 

 

 

 

 

 

OPCNTO 0 Total number of reproduction

OPCNTl 0 Total number of crossover

OPCNT2 0 Total number of simple mutate
OPCNT3 0 Total number of relocation mutate
OPCNTKO 0 Total number of reproductions better
OPCNTKl 0 Total number of crossover better
OPCNTKZ 0 Total number of simple mutate better
OPCNTK3 0 Total number of relocation mutate better

 

 

Table 4.4: GA operation counters

The increase from twenty to ﬁfty molecules in the population cuts the failure rate by a

factor of ﬁve. The failures don’t reduce to zero until the population size reaches over one

hundred.

Ultimately, the computational cost needs to be minimized and larger populations ad-
versely affect computational cost. Figure 4.22 shows the relative number of evaluations as
a function of population size. It is clear that the most cost-effective solutions are found

between population sizes 80 and 110 .

4.4.2 Operator Probabilities

The three basic GA operators are crossover, simple mutation, and relocation mutation. All
three have a probability of occurrence which, when summed with reproduction, will equal
one. In a typical set of parameters, crossover will occur 70% of the time, simple mutations
10% of the time and relocation mutation about 15% of the time. Simple reproduction makes
up the remaining percentage of operations. The mutation operations also have a count
associated with them. This count represents the number of repetitions that each operation
is performed. In addition, the simple mutation has a magnitude associated with it that is a
function of the raw ﬁtness. The local search is applied to results of all operations including

simple reproduction.

149

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

PROB_REP 0 1 — PI — Pml — Pm2

PROB_X 0.7 probability that crossover will occur

PROB_M1 0.1 probability that simple mutation will oc-
cur

PROB_M2 0.15 probability that relocation mutation will
occur

MAXFIT 10040 if scale ﬁtness exceed this stop

MAGMUTRT 4 Magnitude of mutation based on
ﬁt/MAGMUTRT

MAGMUT 0.6 Base magnitude of mutation

DIVERAVG 0 Average diversity of samples

SUMWLIM 0.001 Limit on local search total summed error

F8] 0

FS2 0

FS3 0

F84 5

FSS 36

FS6 0

F87 0

F88 0

PRANDVS 0.8 probability of restricted range local
search

MINDISI 1.09733 Minimum distance over ride

ENERGY 0 Actual energy

FIT 0 Current raw ﬁtness

SUMWFAC 0.09 Convergence factor for Local search

REPDF 0 Sum of improvement for all molecules
better than parent

CROSSDF 0 Sum of improvement for all molecules
better than both parents

MUTIDF 0 Sum of improvement for all molecules
better than parent

MUTZDF 0 Sum of improvement for all molecules

 

 

better than parent

 

Table 4.5: GA operating ﬂoating point parameters

150

 

 

 

 

 

 

 

 

 

 

 

DUMPFILE dmp.txt Name of ﬁle to dump data structure

PARMFILE n381jset.txt parameter setting from this ﬁle * par-
mdef.txt default*

BESTOF bo.txt sufﬁx added to best of molecule ﬁle

RUNLOG rl.txt sufﬁx added run log ﬁle

TARGETF n38idealx.txt Target ﬁle with information

FILEPREF n38a/c_000 File preﬁx to store results

SEEDFILE slista.txt File with list of seeds best of molecules
ﬁle names

TITLE N38 text for report

 

 

 

Table 4.6: GA operating ﬁle names preﬁx and sufﬁx

 

60

 

 

I Failed

 

 

20 40 50 60 70 80100120140160180200220240260280

Pop size

Figure 4.21: Population size effect on number of failed runs N20

151

 

14000

-0- Evaluations

12000

10000

 

 

 

 

20 40 50 80 70 80 100120140160180200220240260280
Population

Figure 4.22: Population size effect on total average evaluations on N20

Crossover Probability Parameter Effect

The advantages to the spatial crossover can be demonstrated by the next two graphs. In
Figure 4.23, the number of failed attempts to ﬁnd a solution to N20 is signiﬁcantly affected
by the crossover probability. The rate of failures decreases as the percentage of crossover
operations increases. Once the number of crossover operations exceed 60%, the failure
rate goes to zero for the N20 molecules. The impact on the average generations is also
interesting. In Figure 4.24, the average generations to a solution declines even more steeply
than the solution failure rate as a function of crossover probability. The failed average
trials were removed from the graph showing average generations to a solution in this result.
Most of the average generations for a solution results include the failed runs with a penalty
in terms of generations. For example, if the GA terminated at 1500 generations then an

additional penalty of 500 generations was added when calculating the average for all 100
GA runs.

152

 

new A 1

count
3

10‘

I -
05 06

0.65 0.7 0.75 0.8 0.84
Probability Crossover

 

 

 

Figure 4.23: N20 crossover probability verse failed solution

Crossover is normally thought to be very disruptive. Interestingly, in this case, it is
clearly a major positive inﬂuence. There is a slight increase in the average number of
generations as the percentage of crossover continues to increase. This increase occurs
primarily because the number of mutations has to be decreased for the given algorithm.
The total of all mutation, crossovers, and reproduction probabilities must total 1.0. This

constraint could be eliminated if serial probabilities of operator selection were used instead.

Mutation Probability Parameter Effect

Simple mutation has three factors; the probability of application, number of repetitions
applied if chosen, and the magnitude of each random move. Relocation mutation has two
parameters. As with simple mutation it has a probability of being selected for application
and the number of repetitions. There is no magnitude parameter for relocation mutation
because, when the atom is moved to a location relative to a chosen existing atom, it is

placed at minimum distance from that atom.

153

 

700 _ . .. _ .. .. M. "MM- W. -

 

 

 

 

 

 

 

600
500 .
g 400
E 300
200
100
o
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Probability oi crossover
Figure 4.24: Average generations to solution N20
46
-_0-Averge Generations
45 ._ ,.,WW,W-. ,-__.......--._ L

t

 

Average generations
5 8

 

 

 

0.01 0.05 0.1 0.15 0.2
Probability of simple mutation

Figure 4.25: Effect of simple mutation on average solution generations N20

154

What is interesting about both of these mutations is that they can adversely affect the
average solution time. In Figure 4.25, the probability of mutation is varied from 0.01 to
0.2. Ranges beyond this were not tried because the crossover probability would have to be
reduced. From experiments, the average solution times increased substantially and would
have been the major inﬂuence masking the effect of mutation on this algorithm. From
these results, the inﬂuence of mutation is limited. Even after repeated trials of 100 runs
the results were of limited impact. Elimination of the simple mutation prevented complete
convergence. This indicates some small amount of mutation was required to assist the local

search near the solution and to prevent it being stuck in a shallow minimum.

Figure 4.26 shows the effect of the relocation mutation on the average number of gen-
erations to ﬁnd a solution. This mutation clearly has a detrimental effect on average gen-
erations to solution. These parameter sweep studies were completed for the N20 Lennard-
Jones molecule, which is small enough that just about any set of parameters will work.
Similar to the simple mutation, total elimination of the mutation, increased the number of
failures from zero to over 20%. At the N38 level, the elimination of the relocation mu-
tation caused failures to increase 50%. The trend of increased average solution time with

increased probability rate was similar to that shown for the N20 molecule.

Each of the mutations, once selected, will repeatedly mutate the molecule some num-
ber of times. This repetition count also has a random selection associated with it. If the
repetition count is set at eight, the number of actual repetitions can be anywhere from one
to eight. Several one-hundred-trial runs were completed and they all had similar features,
as shown in Figure 4.27. The impact on the average solution degrades as the number of
repetitions increases. Each of the trials for the N20 molecule showed a dip after eight repe—
titions, then a steady decline again. This is consistent with the work by Hartke [27], where
they report that mutation applied to 48% of the atoms in a molecule leads to better solution

times.

155

350

 

tAxmqeam ’ tions
300 .. _ ...... .. .... “W" “MN”.-.,.,_,__,.__.w.._..__.... . _.

150

average generations

100

 

50 -

 

 

 

o 0.05 0.1 0.15 0.2 0.25
Probability or Mutation 2

Figure 4.26: Effect of relocation mutation on average solution generations N20

 

 

Average generations
h
M

 

 

 

 

0 2 4 6 8 10 12 14 16
Count of mutations

Figure 4.27: Effect of relocation mutation repetition on average solution generations N20

156

4.4.3 Local Search Parameter Effect

The local search has many parameters that affect its operation. First is the number of it-
erations performed each time. The second is the convergence factor that controls the ratio
of the movement applied to each atom, based on the error vector calculated for each atom.
In addition to these parameters, there is a limit value that terminates the iterations if the
total summed error of the magnitude for all the vectors gets below a set limit. Based on a
probability, the ranges of distances used can be restricted by two parameters. These param—
eters are used to select a subrange of distances used in the local search vector correction

calculations.

The computation cost is signiﬁcantly affected by the number of iterations performed on
each molecule. Calculational cost is the total number of molecule evaluations performed.
The cost is used in the graphs that show average evaluations to ﬁnd a solution are based on

a ﬁx size of molecule.

cost = (population size)(local search repetition) (average generations) (4.1)

Figure 4.28 shows a steady increase in cost as the number of iterations increases. The
parameter ”vsintcnt” is maximum number of local search iterations that can be applied to
each offspring. It should be noted that there is not a signiﬁcant reduction in the number
of average generations needed to ﬁnd the solution. The number of evaluations is simply
the product of the average number of generations and the number of iterations on each

molecule for a given population size.

The outcome of the study shows that just about any reasonable number of iteration
steps will work and that the impact on computation cost implies that smaller the better.
However, not using local search will cause a signiﬁcant number of failed solutions. Two or

more iterations will result in complete convergence.

157

 

 

-0- Evaluations

 

 

 

 

0481216202428323640444852566064
vslntcnt

Figure 4.28: Average evaluations to ﬁnd solution N20

As already demonstrated, a large convergence factor can have a negative effect if it is so
large that instability results. Figure 4.29 shows the number of failures for each convergence
factor from .05 up to .21. As can be seen, the instability has a signiﬁcant rise after .2.
Second, Figure 4.30 shows the effect on the average number of generations needed to ﬁnd
a solution. In this Figure a mild downward trend is shown as the factor increases. Then
there is a signiﬁcant rise when the critical point is reached. The minimum occurs when the
convergence factor is about 80% of the critical value for a convergence factor where the
failures increase rapidly. This is also a function of the square of the number of atoms. Each

of these distances has the potential to contribute to the error correction vector.

4.5 Classic Two-Point Crossover

In this set of experiments, the spatial crossover method was substituted with a standard

two-point crossover. The only parameter that was changed was the method of crossover.

158

f'.

 

‘

 

 

 

 

‘2 .ll-Ll_-l_-l-l~lal.11-lraW-_lml

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21
SF

Figure 4.29: Effect of convergence step size on failed solutions N20

 

 

 

 

 

0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21
SF

Figure 4.30: Effect of convergence step size on average generations to ﬁnd solution N 20

159

 

N20 Two-Point Spatial
Failures 54 0
Average 491.3 35.65

 

 

 

 

 

 

 

Table 4.7: Solution comparison of standard and spatial crossover for N 20 Lennard-Jones
molecules

 

N38 Two-Point Spatial
Failures 100 4
Average NA 306.

 

 

 

 

 

 

 

Table 4.8: Solution comparison of standard and spatial crossover for N38 Lennard-Jones
molecules

The population size, operator probabilities, and iteration counts were all kept the same. The
local search was also used with the same parameters. When spatial crossover was replaced
with the two-point crossover method, the ability to ﬁnd solutions was severely degraded.
In the N20 molecules, the algorithm only found 44 solutions in a 100-trial series. The
comparison of the spatial crossover to the standard two-point crossover is summarized in
the Table 4.7. The average generations to ﬁnd a solution for a successful run increased by
over a factor of ten.

Another set of experiments was conducted on the N38 Lennard-Jones molecules and
the results were dramatically worse. In this case, 100% of the trials failed to ﬁnd a solution.
Expecting that several other GA parameters beside the crossover were having a signiﬁcant
negative impact, a series of parameter sweeps was conducted to discover better parameters.
In all cases, there was either no change or further degradation of the search. Based on this,

only the N20 and N 38 series of runs using the expected optimum parameter settings are

included.

4.6 Operator Effectiveness

Operator effectiveness provides a key insight into the operation of the spatial crossover and

relocation mutation. Effectiveness is measured by counting the number of operations that

160

produce a better offspring and dividing by the total count of that type of operation in a
generation. This is tracked separately by the style of GA operation such as crossover and
mutation. The curious behavior of the crossover ratio was especially interesting. In Figure
4.31, the ratio starts out very high, indicating that just about any two halves selected have
a high probability of increasing the good subgraph count. The interesting point is when
the ratio effectively goes to zero and then seems to level off with only slight rises and dips
around when signiﬁcant changes in ﬁtness occur. The explanation for the near-zero point is
that it is an inﬂection point. It is at this point that the average ﬁtness has improved such that
the probability of creating a better molecule through crossover given the current population
is very small. This represents a peak of the average ﬁtness in the short run and it occurs
after only a few generations. This number of generations is very close to the convergence of
a pure tournament selection process, typically around six to ﬁfteen, generations depending
on the population size. The disruptive effects of both the mutations and crossover actually
degrade the average ﬁtness so that a steady-state ratio of successful to total crossovers will
be expected. This ratio remains nearly constant for the remainder of the runs. Figure 4.3]
shows the ratio at every generation and a smoothed ﬁve generation average. The dip to zero
at generation six is exactly what would be expected for the small population size using only
tournament selection. The steady ratio of 0.1 has a steady decline for the remainder of the
run. This decline is expected by the fact that average ﬁtness increases and the chance
of having a crossover operation that improves ﬁtness becomes less and less. The N20

molecule used for this ﬁgure was typical of all the molecules explored.

The next two graphs have the objective value plotted along with the effectiveness ratio
for simple mutation and for relocation mutation. While the average number of generations
for a solution does not seem to correlate to the mutation probability and repetition counts,
the tracking of objective value changes is clearly tied to mutation events. By plotting the
objective function and mutation effectiveness ratio, it can be seen that those generations that

produce good mutations are key to the overall solution process. In Figure 4.32, the high

161

 

 

 

‘44emectwms‘;

0.9 .-
0.8 -

0.7 a ,
0.6

l
0.5 -~ 1

ratio

0.4
as l , i

 

 

 

1 6 11 16 21 26 31
Generation

Figure 4.31: Crossover effectiveness by generation

ratio of simple mutations near generation 18 directly corresponds to a desired decrease in
the objective function.

The relocation mutation in Figure 4.33 has, in general, much lower ratios. What is crit-
ical is that those generations that had high ratios show how those events had a major impact
on the ﬁtness. From the ﬁgure, events at generation 7, 20 and 31 show how those mutations
lead to signiﬁcant objective value drops over the next few generations and ultimately to the

solution.

4.7 Scalability

The molecules N12, N20, N38 and Bucky ball were all solved using single-level runs and
they sealed in an acceptable fashion. The N60 Lennard-Jones molecular cluster problems
were solved in a single-level run but had a high failure rate and took substantially longer to

solve. Because of this, only a limited number of trials were conducted for this molecule. In

162

 

 

142‘

P
as

Objective Function

9
.5

0.2

1.4

1.2

Objective
0
lib

0.2 .

 

 

 

+Ob|ectlv H evEunc‘tlon'
‘ : -0—Slmple Mutation

 

 

 

16 21 26

Generations

Figure 4.32: Simple mutation effectiveness by generation

31

 

==i—1>
;
i:

 

 

 

AAAAA

 

 

 

 

Figure 4.33: Relocation mutation effectiveness by generation

163

 

0.9

0.8

P
at

no
bin

9 O P
d lo a
Effectiveness ratio

O

 

 

 

 

 

 

 

 

 

 

 

N12 N20 N38 N60 Bucky Bucky R
Failures 0 0 4 7 l3 5
Generations 8.29 35.65 65.3 3212.5 762.2 1000
Population 18 120 90 90 90 260
LSINT 5 8 18 1 8 8 8
Size 66 190 703 1770 1770 1770
Evals. .74k 34.22k 105.78k 12,403.44k 548.78k 2,080k
Cost 4.92E+04 6.50E+06 7.44E+07 2.20E+10 9.71E+08 3.68E+09

 

 

 

 

 

 

 

 

 

Table 4.9: Scalability for various size molecules

the earlier work on the N60 and the Bucky ball, two-level trials were conducted. In these
trials, a series of between ten and one hundred GA runs were conducted with a high ﬁtness
and low generation termination point that was well short of a solution. The molecules were
still reasonably close in basic shape to the desired solutions. These terminated runs were
then used as seed material for a second-level run. Using this approach consistent good
solutions were found for the N60 Lennard-Jones, and occasionally the N88 Lennard-Jones

molecules.

Table 4.9 shows the summary of the results for the various trials. The signiﬁcant in-
crease between the N20 and N38 shows major problems with the length of run versus
molecule size. The near ten-fold increase for a less than two—fold increase in size is not
encouraging. The Bucky ball with both the ideal and the real data, provided good results.
Even those that were classiﬁed as failures still had a structure that was clearly a Bucky ball
with defects. The real problem was the N60 molecule. A two-level run was conducted.
The ﬁrst-level run had a population of sixty molecules and was terminated at ﬁve hundred
generations. The second-level run selected twenty ﬁrst-level run molecules and used them

as seeds, ﬁlling in the population until the size requirement was fulﬁlled.

The column labeled Bucky R represent the trials where the real experimental data was
used as the target. ”LSINT” represents the number of iterations that the local search used on
each molecule. The row labeled ”Evals.” is the total number of function evaluations on the

average per solution. The last row, called ”Cost”, represents the total cost of computation

164

that also factors in the computation of objective function evaluations based on the size of
the molecules in atoms.
The number of function evaluations in terms of N for the ideal Bucky ball shows scaling

on the order of 9(N3'2).

4.8 Discussion

The four main areas that were documented by these results include the comparison between
crossover methods, population size, operator effectiveness and scalability. In each of these

areas a number of trials were conducted.

4.8.1 Comparison of Classic vs. Spatial Operators

This was by far the simplest of the experiments to conduct. In these experiments, the
two-point crossover was substituted for the spatial crossover method. All other parameters
were kept the same. The degradation in performance was signiﬁcant. There has been
prior research that has used traditional crossover methods for small molecules under twenty
atoms and shown reasonable results [58]. Few of these have worked with larger molecules.
The results presented here show that the difﬁculty of the larger molecules rises signiﬁcantly
after forty atoms. The most successful GA applications to the cluster molecules have used
the spatial crossover method with extremely high mutation rates for simple mutation [28]. It
is clear from the N38 molecule that the traditional crossover is far too disruptive to achieve

any solutions.

4.8.2 Population Size: Theory vs. Actual

The basic theory of population size, based entirely on the likelihood of creating small
building blocks by random atom placement, provides at best a very high upper bound.

The initialization process, by its very nature, provides for an order of magnitude more

165

 

 

useful subgraphs than simple probability would predict. The initialization method correctly
assumes that each atom will be one minimum distance from at least one other atom and that
each atom will not violate some minimum distance. This, combined with the local search,
yields a population with far more potentially useful subgraphs than would be achieved with
random placement. Further, the inﬂuence of the distance-based local search also increases
the number of good subgraphs. In all the trials where the population was varied, there
appear to be a dip in the average generations required to solve for the structure.

The increase in the population size actually reduced the ability to ﬁnd the solution in
a smaller number of generations. This is counter-intuitive, given the abundance of genetic
building blocks that larger populations would provide. It take longer to assemble them
into a single molecule. In the algorithm based on energy search by Hartke [28], a forced
mixing of all the parents occurs, insuring that the potential new material always gets a
chance to create a better offspring. In the algorithm used here, the tournament selection
for crossover, mutation and reproduction will tend to reduce the chances for participation
of a poor-ﬁtness molecule even if it contains highly needed subgraphs. With the smaller
population, the chance of competing against itself is higher and therefore more likely to
occur. Then, even though the mutation is needed, it will not get the chance to mix with the
better-ﬁtness molecules.

From the population sweep studies, the average number of generations to ﬁnd a solution
dips to some minimum and then rises. This supports the need for a sufﬁcient number of
good subgraphs that can be mixed with the crossover operator and at the same time shows
the detrimental effect that too many competitors can have on selecting good material with

lower ﬁtness.

4.8.3 Operator Effectiveness

Examination of operator effectiveness during a run provided some unexpected results.

These results were consistent for nearly all the trials. The crossover operator effective-

166

 

 

ness was the most interesting, especially the initial high likelihood of creating a better
offspring and the dip to near zero chance of creating a better offspring. This rate then
leveled off and represented the probability of making a better offspring from an average
degraded population created by the constant disruption of mutations and crossover.

The behavior of the mutation effectiveness was more expected. It was not possible to
increase the probability of creating a better offspring by manipulating the repetition or the
chance of mutation operation. The overall chance of creating a better offspring was con-
sistently low after the initial few generations of a run. Simple mutation can be indirectly
connected to early changes in the ﬁtness by spikes in the ratio, while the relocation muta-
tion can be indirectly connected to signiﬁcant improvement of ﬁtness later in the run and

ultimately leading to the solution.

4.8.4 Scalability: Theory vs. Actual

In the theory section, the cost of calculating the ﬁtness function was shown to be 0(N2).
The estimated cost to ﬁx defects was expected to be the Z 0(N2) with a corresponding
increase in population size. The ability of the algorithm to ﬁnd effective solutions to small
molecules was very promising. The cost for ﬁnding the solution to larger molecules indi-
cates a need to focus the mutation operations on atoms that have a higher chance of im-
proving the ﬁtness. Even using the two-level method of solution for the larger molecules,

the overall scalability remains approximately 0(N3'7).

4.9 Summary

In this chapter a variety of experimental results were reported with molecules ranging from
twelve atoms to sixty atoms. The results were provided in a variety of forms with speciﬁc
focus on individual aspects. The average number of generations to ﬁnd a solution was the

key indicator used for parameter selection searches. In the parameter searches involving

167

varying population sizes the number of evaluations were used so that the larger popula-
tion sizes were properly compared. In these parameter searches, an operating point was
located giving good performance for the overall search. Included in this chapter was the
comparison of the standard two-point crossover to the spatial crossover. In this compari-
son the two-point crossover performed very poorly and in the larger molecules, failed to
ﬁnd solutions completely. The operator effectiveness provides additional understanding of
the mechanisms at work in the GA operation. The role of crossover in distributing and
combining genetic material provided by the mutation operators proved to be critical to the
operation of the GA. The ratio of crossover to mutation for these algorithms is very high
to allow time for the new material developed by mutation to migrate among members of
the population. This supports the previous work, where all permutations of molecules are
tried with only the best kept for each generation. This is exactly the kind of mechanism
demonstrated in this algorithm.

The scalability of the algorithm is limited by the probability of creating new genetic
material that is needed to complete the molecule. The local search does not overcome the
limit in topology of the molecule. That is, the local search will not alter the topology, one
of the required tasks for escaping a local minimum. The mutation operators are critical for
providing this alteration. The limiting factor for the scalability is the probability of cor-
recting the defects in the molecule. With pure random selection of atoms, this probability

becomes much smaller as the size of the atoms increases.

168

Chapter 5

Conclusions

The original hypothesis for the improved performance of the spatial operators has been
clearly shown. The theory to support the performance of the spatial operators is consistent
with the performance. It is also clear that all the various aspects of the GA algorithm are
strongly interrelated. The absence of any one operator will have a strong negative impact on
performance. The use of an objective function based only on distance allows for unbiased
searches with experimental data as the target. Each of these conclusions will be described
along with future research work. The scalability for larger molecules, and possibly proteins,
will have to be addressed. In addition, the limited work on the multi-level algorithm shows
that hierarchal GA populations need to be explored and may be critical for the challenges
that the larger molecules will produce.

A constant question about GA algorithms is the genealogy of the solutions. This par-
ticular problem can provide a rich environment that could easily be recorded and used to

trace the genealogy of the solutions.

5.1 Spatial Operators

The two GA operators, spatial crossover and relocation mutation, have been shown to have

an advantageous performance over standard two-point crossover and simple mutation. The

169

interrelationship between the spatial operators and the local search is not separable. Re-
moval of any operational component makes the problem much more difﬁcult and requires
multiple levels of GA trials to achieve a solution. The spatial crossover’s advantage is de-
rived from its preservation of the small subgraphs that are key to calculation of good ﬁtness
values. The relocation mutation provides a more effective way to explore different molecule
topologies. Each relocation signiﬁcantly alters the arrangement of atoms. From examina-
tion of the failed results for many trials, it appears that the algorithm achieves near-perfect
molecules with only a few atoms in wrong locations. Relocation mutation addressed this
problem because otherwise it would take massive simple mutations to explore the same
space. Worse yet, the massive mutations would destroy many of the good subgraphs at the
same time and, even if the newly mutated atom had good subgraphs, they might not survive
the competition to reproduce, mutate or crossover. Relocation mutation preserves nearly

all the good subgraphs and allows for the distribution of new genetic material.

The tracking of jumps in the ﬁtness value can be seen to correspond to increases in the
operator effectiveness for the mutations. This alignment is not a direct result, but an indirect
one. The events that lead to the high ratio of mutation effectiveness take several generations
to distribute among the population and, in doing so, ultimately lead to the improvement in
ﬁtness value. This concept adds support to the use of permutations for all members of a
small a population to create the A population. It is interesting to note that the size of the A
population is roughly the same as the constant population used in this work. The number

of evaluations are similar with the energy-based search methods.

5.2 Interrelationship

In nearly all the trails run, the elimination of any single aspect caused the failure rate to
jump signiﬁcantly. In the case of standard crossover, the algorithm failed 100% to ﬁnd a

solution for N38 molecules. For the spatial operators, the elimination of any of the oper-

170

 

 

ators caused signiﬁcant degradation. If the the local search was eliminated, no solutions
were found using a single-level run. Even as the parameters for each of the operators var-
ied, no one parameter had a major effect on the overall performance. When the mutation
probability or the repetition counts were set to zero, the algorithms would take an order of
magnitude longer to ﬁnd a solution, if they found one at all. When the crossover operator
probability was reduced, no solutions were found. The removal of the relocation operator
was the only one that still allowed for solutions, but they took ﬁve to ten times as many

generations.

5.3 Objective Functions

As the distance-based objective function converges, the energy function also converges.
From the results section, it is clear that the two functions are not at all consistent with
each other. The best-of—generation molecule provided a steady progression of distance
objective values while the corresponding energy values were sporadic jumping above and
below the eventual asymptotic value. In the limit, for any given molecule the energy and
distance objective functions are both continuous. The tracking of the two demonstrates the
inﬂuence on longer distances for the distance objective function, while these distances for
the energy function are not as important. This is a signiﬁcant observation. In the ﬁnal stages

of convergence, the two objective functions both smoothly approach the desired values.

This effect leads to the following hypothesis. The distance objective function will more
likely produce a grossly correct shape of the molecule while the energy function will pro-
duce more correct close atom distances and sacriﬁce overall shape. That is, the shape may
be distorted and, in some cases, atoms may be in the wrong locations. However, this loca-
tion will be far from the other atoms, such that an energy minimum is found. In contrast,
with the distance objective function, if an atom is in the wrong general location, the effect

will produce an error contribution on the order of N * factor, which will tend to magnify

171

 

 

the placement error. At the same time, the local search will require many iterations and can

become stuck more easily by large distance errors.

5.4 Conﬁrmation of the Hypothesis

A set of genetic algorithm operators based on spatial location concepts provides improved
performance for a class of NP-hard search problems in N-dimensional spaces. Spatial
crossover has been shown to outperform standard crossover for the molecular conformation
problem, which is categorized as NP-hard, with homogeneous vector based solution. For
larger molecules, the difference in performance was substantial. The relocation crossover
demonstrated the ability to reduce search times by more effectively exploring the search
space. Both of these operators are based on the premise that the individual genes are part of
an N-dimensional space and that the preservation of the spatial relationships among genes
is essential to insuring the propagation of good ﬁtness values. These spatial operators can
by shown by analysis to preserve those relationships and the experimental results presented

here support the theory.

This algorithm also used a distance-based objective function and local search. This
allows for an unbiased search, based on target data from experimental x-ray diffraction
and NMR results. The distance-based objective function allows for applications to a wide
range of problems where the only requirement is that the actual equations that describe
interactions between points or atoms conform to a continuous function with consistent
inﬂuences. The polarity of the interactions needs to be stable without repeated changes in
sign. This mapping can be highly nonlinear, as is the case for the molecular conformation

with the Lennard-J ones energy models.

172

5.5 Competitive Methods

The energy-based search provides cost effective methods for ﬁnding molecular conforma—
tion. Basin Hopping [66] , GA [6] and Simulated Annealing [37] all demonstrate effective
solutions for a wide range of molecules using energy-based searches. The Liga algorithm
[41] has combined several GA concepts with a constructive approach using competitive
layers similar to a sports league to solve the molecular conformation problem. The Liga

algorithm uses the a robust target distance set to construct the molecules.

5.6 Future Research

While this work has shown that a general concept using distance and spatial relationships
is able to provide improved performance over standard GA operators, the ability to solve
larger problems has some limitations. In the examination of the prior art, the most success-
ful algorithms add many domain-speciﬁc methods to achieve good results. These include
sorting and searching on individual operations, forcing permutations among all the most ﬁt
individuals, serial mutations of large magnitude, and geometric diversity control. This sec-
tion contains some proposed improvements that will address some of these speciﬁc areas

and also more general concepts.

5.6.1 Improved Scalability

From the theory on the reduction of the last few defects, it is clear that a better methods
need to be used rather than random point or atom selection. This process could be guided
by the magnitude of the contribution to the objective function or some derived value based
on the subgraphs that each atom is a part of. The hypothesis is that it is the repair of the last
few defects that takes nearly all the computational power. The initial gains in the ﬁrst few
generations are enormous, while the majority of the generations are required to remove the

last few defects.

173

Another signiﬁcant improvement could come from the multiple-layer sustained style of
GA. This approach was used on a batch basis for the successful attempts on the molecular
conformation problem without the local search. This has also been shown to be effective
after the addition of the local search. The process of ﬁnding the desired operating points for
the number of seed solutions needed and the quality of the seeds has a major effect on the
solution time. Using this multiple-layer approach will increase the parallelism of mixing
high ﬁtness value candidates in the higher layers. It still took over a thousand generations
at the second layer to ﬁnd the solution. Exploring this hierarchial set of populations may

signiﬁcantly help with scaling problems.

5.6.2 Adaptation to Folding Proteins

Even though the atomic structure of all molecules comes down to atom location, the ad-
ditional amounts of domain information available for proteins makes it desirable to store
the protein shape as a combination of known sequences with rotational angles. Also the
GA crossover operator needs to be based on dividing the main sequence chain. A pure
spatial dividing plane would create invalid molecules at the completion of the crossover
operator. Some combination of crossovers that use both the division of the main sequence
chain along with spatial relationships will be required to insure valid molecules and also
take advantage of the spatial relationships.

The other issue is the much larger size of the protein molecules. Even the smallest is a
few hundred atoms. This would impact the evaluation time in several ways for this algo-
rithm. First, the size of the data structures and evaluation time would increase by 0(N2).
One possible solution to this would be to reduce the data structures and the evaluations to
some limited range of distances ,BN where [3 would be on the order of three to ten. This
would eliminate the squared N growth and at the same time would keep the close proxi-
mate interactions and would actually conform more to the subgraph model presented in this

work. The other issue is the availability of experimental distances. While the smaller dis-

174

tances between atoms representing bond lengths are easier to detect using frequency-based

experiments, the large distances are difﬁcult to measure.

5.6.3 Dynamic Parameter Selection

Like nearly all applications of GA’s, the parameters that control the overall operation of the
algorithm have a tendency to grow in number and complexity. Several of the parameters
are expected to have linear relationships to the number of atoms in the molecules. From
the effectiveness ratio, trying to achieve a higher ratio of good results was not possible by
variation of a single parameter. From the theory section, the probability for improvement
can become very low at the end of the run due to the random selection process. For the
algorithm solution time to improve, the dynamic variation of parameters will need to be

explored.

5.6.4 Using Multiple Levels of GA

The concepts of distinct population islands in the GA’s algorithms have been explored ex-
tensively for optimal conﬁgurations, sizes, transfer rates and topology of interactions. The
recent work on Hierarchial Fair Competition (HFC) has been shown effective for creating
an environment for sustained evolution. In the HFC work, the number of levels of evolution
and how they interact or overlap was explored. In the middle stage of the research reported
here a two-level evolutionary environment that was a batch implementation of a limited

HFC was able to achieve good results without the addition of local search.

5.6.5 'h'acking Genealogy

One of the more interesting results was the tracking of operator effectiveness. This provided
good insight into the operation of the GA for this application. The correlations between the

high mutation ratio events to the improvement in ﬁtness was identiﬁed. The other fact

175

that was clear was the distribution of the event through the population by the crossover
operator. To this extent, the current algorithm was able to provide a rudimentary means of
inferring a limited generation or two of genealogy that led to the solution to the molecular
conformation. One of the areas for future work would be to develop a complete genealogy

tracking system that can show the various family trees that lead to a solution.

176

BIBLIOGRAPHY

177

 

[1]

[2]

[3]

[4]

[5]

[6]

[71

[8]

[9]

Bibliography

Jim Antonisse. A new interpretation of schema notation that overturns the binary
encoding constraint. In Proceedings of the 3rd International Conference on Genetic
Algorithms, pages 86—91. Morgan Kaufrnann Publishers Inc., 1989.

Jaroslaw Arabas, Jan J. Mulawka, and Jacek Pokrasniewicz. A new class of the
crossover operators for the numerical optimization. In Proceedings of the 6th Inter-
national Conference on Genetic Algorithms, pages 42—48. Morgan Kaufmann Pub-
lishers Inc., 1995.

Thomas Back. Evolutionary algorithms in theory and practice. Oxford University
Press, 1996.

Donna Bassolino-Klimas, Roberto Tejer, Stanley R. Krystek, “William Metzler, Gae-
tano T. Montelione, and Robert E. Bruccoleri. Simulated annealing with restrained
molecular dynamics using a ﬂexible restraint potential: Theory and evaluation with
simulated NMR constraints. Protein Science, 5:593—603, 1996.

Bonnie Berger, Jon Kleinberg, and Tom Leighton. Reconstructing a three-
dimensional model with arbitrary errors. In STOC ’96: Proceedings of the twenty-
eighth annual ACM symposium on theory of computing, pages 449—458. ACM Press,
1996.

B.Hartke. GEC C0-2001 : Proceedings of the Genetic and Evolutionary Computation
Conference, chapter Global geometry optimization of atomic and molecular clusters
by genetic algorithms. Morgan Kaufmann, San Francisco, 2001.

Fred Buckley and Marty Lewinter. A Friendly Introduction to Graph Theory. Prentice
Hall, Pearson Education, Inc, Upper Saddle River, New Jersey, 2003.

Thang N. Bui and Byung-Ro Moon. On multi-dimensional encoding / crossover.
International Conference on Genetic Algorithms, pages 49—55, 1995.

P. Chaudhury, S.P. Bhattacharyya, and W. A Quapp. Genetic algorithm-based tech-

nique for locating ﬁrst-order saddle point using a gradient dominated recipe. Chem.
Phys, 253:295—303, 2000.

178

[10] David I. Chu, Hunter C. Brown, and Moody T. Chu. On least squares euclidean

distance matrix approximation and completion. SIAM. J. Matrix Anal. Appl., 16:645—
654, 1995.

[11] G. M. Crippen and T. F. Havel. Distance Geometry and Molecular Conformation.
John Wiley and Sons, New York, NY, 1988.

[12] Willian David, Kenneth Shanklad, and Norman Shanklad. Routine determinaion of
molecular crystal structures from powder diffraction data. Chem Commun, pages
931—932. 1996.

[13] D. Deaven and K. Ho. Molecular geometry optimization with a genetic algorithm.
Phy. Rev. Let, 75:288-291, 1995.

[14] KA Dill, S Bromberg, K Yue, KM Fiebig, DP Yee, PD Thomas, and HS Chan. Princi-
ples of protein folding—a perspective from simple exact models. Protein Sci, 4:561—
602), April 1995.

[15] Q. Dong and Z. Wu. A geometric build-up algorithm for solving the molecular dis-
tance geometry problem with sparse distance data. Journal of Global Optimization,
26:321-333, 2003.

[16] B. Dorronsoro, E. Alba, M. Giacobini, and M. Tomassini. The inﬂuence of grid
shape and asynchronicity on cellular evolutionary algorithms. In Y. Shi, editor, IEEE
International Conference on Evolutionary Computation, pages 2152—2158, Portland,
Oregon, June 20—23 2004. IEEE Press.

[17] Richard Duda, Peter Hart, and David Stork. Pattern Classiﬁction. John Wiley and
Sons, 2nd edition, 2001.

[18] L. J. Eshelman, K. E. Mathias, and J. D. Schaffer. Crossover operator biases: Exploit-
ing the population distribution. Proc. of the 7th ICGA, pages 354—361, 1997.

[19] Larry J. Eshelman, Rich Caruana, and J. David Schaffer. Biases in the crossover
landscape. ICGA, pages 10—19, 1989.

[20] L. J. Fogel. Autonomous automata. Industrial Research, 4:14—19, 1962.

[21] Mario Giacobini, Marco Tomassini, and Andrea Tettamanzi. Takeover time curves
in random and small-world structured populations. In Hans-Georg Beyr et a1, edi-
tor, GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary
computation, 2005.

[22] Philip E. Gill and Michael W. Leonard. Reduced-Hessian quasi-Newton methods for
unconstrained optimization. SIAM Journal on Optimization, 12(1):209—237, 2001.

[23] Fred Glover and M. Laguna. Tabu search. In C. Reeves, editor, Modem Heuristic
Techniques for Combinatorial Problems, Oxford, England, 1993. Blackwell Scientiﬁc
Publishing.

179

 

[24] S. K. Gregurick, M. H. Alexander, and B. Hartke. Global geometry optimization
of (Ar)n and B(Ar)n clusters using a modiﬁed genetic algorithm. J. Chem. Phys,
104:2684, 1996.

[25] Peter Guntert. Automated NMR protein structure calculation. Progress in Nuclear
Magnetic Resonance Spectroscopy, 43(3-4):105—125, December 2003.

[26] Frank Harary. Graph Theory. Addison-Wesley Publishing Comp, Reading, Mas-
sachusetts, 3rd edition, 1972.

[27] B. Hartke. Global cluster geometry optimization by a phenotype algorithm with
niches: location of elusive minima, and low-order scaling with cluster size.
J. Comput. Chem, 20(16): 1752-1759, 1999.

[28] B. Hartke. Application of Evolutionary Computation in Chemistry, volume 110 of
Structure and Bonding, chapter Application of Evolutionary Algorithms to global
cluster geometry optimization, pages 33—53. Springer, Heidelberg, 2004.

[29] T.F. Havel. Encyclopedia of Computational Chemistry, chapter Distance geometry:
Theory, algorithms and chemical applications, pages 723-741. J. Wiley and Sons,
New York NY, 1998.

[30] Bruce Hendrickson. The Molecule Problem: Determining Conformation from Pair-
wise Distances. Phd thesis, Cornell University, Computer Science Department, 1990.

[31] Bruce Hendrickson. The molecule problem: Exploiting structure in global optimiza—
tion. SIAM Journal on Optimization, 5(4):835—857, 1995.

[32] F. Herrera, M. Lozano, and A.M. sanchez. A taxonomy for the crossover operator
for real-coded genetic algorithms. an experimental study. International Journal of
Intelligent Systems, l8(3):309-338, 2003.

[33] Magnus R Hestenes. Conjugate Direction Methods in Optimization, volume 12 of
Application Mathematics. Springer Verglag, New York, 1980.

[34] MR. Hoare. Structure and dynamics of simple microclusters. Advances in Chemical
Physics, 40:49—135, 1979.

[35] J. H. Holland. Outline for a logical theory of adaptive systems. Journal of the ACM,
9(3):297—314, July 1962.

[36] J .H. Holland. Adaptation in Natural and Artiﬁcial Systems. University of Michigan
Pres, Ann Arbor, 1975.

[37] Hsiao-Ping Hsu, Ulrich, H. E. Hansmann, and Simon C. Lin. Structure determination
of organic molecules from diffraction data by simulated annealing. Physical Review
E, 64(056707):1—2, 2001.

180

[38] J. Hu, E. Goodman, and K. Seo. Genetic Programming Theory and Practice, chap-
ter Continuous Hierarchical Fair Competition Model for Sustainable Innovation in
Genetic programming, pages 81-98. Kluwer Publishers, 2003.

[39] Jianjun Hu. Sustainable Evolutionary Algorithms and Scalable Evolutionary Syn-
thesis of Dynamic Systems. Phd thesis, Michigan State University, Department of
Computer Science and Engineering, 2005.

[40] M. Iwamatsu. Global geometry optimization of silicon clusters using the space-ﬁxed
genetic algorithm. Journal of Chemical Physics, 112:10976—10983, jun 2000.

[41] P. Juhas, D. M. Cherba, P. M. Duxbury, W. F. Punch, and S. J. L. Billinge. Ab initio
solid state nano-structure determination. Accepted for publication in Nature, Nov
2005.

[42] Andrew B. Kahng and Byung Ro Moon. Toward more powerful recombinations. In
Larry Eshelman, editor, Proceedings of the Sixth International Conference on Genetic
Algorithms, pages 96—103, San Francisco, CA, 1995. Morgan Kaufmann.

[43] H. W. Kroto, J .R. Heath, S.C. O’Brien, R.F. Curl, and R.E.C. Smalley. Buckminser-
fullerene. Nature, 318:162—163, 1985.

[44] J. Lee, I.-H. Lee, and J. Lee. Unbiased global optimization of lennard-jones clus-
ters for n ;= 201 by the conformational space annealing method. Phys. Rev. Lett.,
91(8):080201, 2003.

[45] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale
optimization. Math. Programming, 45(3, (Ser. B)):503—528, 1989.

[46] J. Maddox. Genetics helping molecular-dynamics. Nature, 376(6537):209—209, Jul
201995.

[47] John H. Mathews. Nelder-mead method. web, 2003.
http:math.fullerton.edu/mathews/n2003/NelderMeadMod.html.

[48] John H. Mathews and Kurtis K. Fink. Numerical Methods Using Matlab. Prentice-
Hall Inc., Upper Saddle River, New Jersey, USA, 4th edition, 2004.

[49] P. Moscato. Memetic algorithms for molecular conformation and other optimization
problems. Newsletter of the Commission for Powder Diffraction, of the International
Union of Crystallography, 20, 1998.

[50] Diego F. Nehab and Marco C. Pacheco. Schemata theory for the real coding and
arithmetical operators. In SAC ’04: Proceedings of the 2004 ACM symposium on
Applied computing, pages 1006—1012. ACM Press, 2004.

[51] J .A. Nelder and R. Mead. A simplex method for function minimization. Computer
Journal, 7:308—313, 1965.

181

[52] Arnold Neumaier. Molecular modeling of proteins and mathematical prediction of
protein structure. SIAM Review, 39(3):407—460, 1997.

[53] J.A. Northby. Structure and binding of lennard-jones clusters: 13 < 147. J. Chem.
Phys, 87:6166—6167, 1987.

[54] A. Patton, E. Goodman, and W. Punch. Beyond encoding: Using populationg sam-
pling as a rotation guide for ga operators. Garage tech report garage99-01-02, Michi-
gan State University, 1999.

[55] A.T. Phillips, J. B. Rosen, and K. A. Dill. Energy landscape projections of molecular
potential functions. Optimization in Computation Chemistry and Molecular Biology,
pages 1-2, 2000.

[56] A. P. Ramirez and R. C. Haddon. C60 neutron data collection at IPNS. Funded by
DOE work done at MSU.

[57] I. Rechenberg. Evolution strategy: Optimization of technical systems by means of
biological evolution. Fromman-Holzboog, Stuttgart, 1973.

[58] Mlson Rivera-Gallego. A genetic algorithm for solving the euclidean distance matri-
ces completion problem. SAC 1999, pages 286—290, 1999. San Antonio, Texas.

[59] I. A. Saraﬁs, P. W. Trinder, and A.M.S. Zalzala. Towards effective subspace clustering
with an evolutionary algorithm. Proc. IEEE Congress an Evolutionary Computation,
Dec. 2003.

[60] Kumara Sastry. Efﬁcient cluster optimization using extended compact genetic algo-
rithm with seeded population, optimization by building and using probabilistic mod-
els. OBUPM, pages 222—225, Jul. 2001.

[61] 1B. Saxe. Embeddability of weighted graphs in k-space is strongly np-hard. In
Proceedings of the 17th Allerton Conference in Communications, Control and Com-
puting, pages 480—489, 1979.

[62] H.-P. Schwefel. Numerical Optimization of Computer Models. John Wiley and Sons,
Chichester, 1981.

[63] Dong-i1 Seo and Byung Ro Moon. A survey on chromosomal structures and operators
for exploiting topological linkages of genes. In GECCO, pages 1357-1368, 2003.

[64] William M. Spears and Kenneth A. De Jong. An analysis of multi-point crossover.
In G. J. E. Rawlins, editor, Foundations of Genetic Algorithms, pages 301—315, San
Mateo, CA, 1991. Morgan Kaufmann.

[65] David J. Wales and Jonathan P. K. Doye. Global optimization by basin-hopping and
the lowest energy structures of Lennard-J ones clusters containing up to 110 atoms. J.
Phys. Chem. A, 101(28):5111 - 5116, 1997.

182

[66] DJ Wales and HA Scheraga. Global optimization of clusters, crystals, and
biomolecules. Science, 285(5432):1368—1372, Aug 199.

[67] Hyun-Sook Yoon and Byung-R0 Moon. Synergy of multiple crossovers in a genetic

algorithm. IEEE Transactions on Evolutionary Computation, 6(2):212 - 223, April
2002.

183

 

TE Si 1E5

lllllilllﬂialllﬂﬂlﬂllllﬂlzlljlll