. «a. .55 -a .4 ; A... c 51 1.!!ll) Gil. . a 2.... $3.? 1. be“): shim-”Qt...“ . Vial.“ 1.! ‘ . .3...ch3:,. . .c ....mh.ww1u.s.......s.....;n3. .. . 1| "(Hull . ¢_ .m.,.§..._. :3. aw“... saw? 2 5.1.. 5...}. 5&3} .. 2 .1... A .as .. v 3 Writ—km. r Afihmnmw 1!. a. 0: up}... . J! x: .l 3.) . n !u: )a:. ‘ . Liam? \t': A A” '1‘ 4“” k )1 I. 50:)? This is to certify that the dissertation entitled Molecular Conformation of Clusters by Genetic Algorithm Using Spatial Operators and Unlabeled Distance Data PhD. presented by David M. Cherba has been accepted towards fulfillment of the requirements for the degree in A Computer Science ‘ LIBRKRY ‘ Michic' “-‘tate Major Professor’s Signature /Z,/&'L/i> 5 Date MSU is an Afiinnative Action/Equal Opportunity Institution ‘ Ul'lrv». vat, PLACE IN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 p:/CIRCIDateDuo.indd-p.1 MOLECULAR CONFORMATION OF CLUSTERS BY GENETIC ALGORITHM USING SPATIAL OPERATORS AND UNLABELED DISTANCE DATA By David M. Cherba A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY Department of Computer Science 2005 ABSTRACT MOLECULAR CONFORMATION OF CLUSTERS BY GENETIC ALGORITHM USING SPATIAL OPERATORS AND UNLABELED DISTANCE DATA By David M. Cherba A set of Genetic Algorithm (GA) operators based on spatial location concepts will provide improved performance for a class of NP hard search problems in N dimensional spaces. A set of spatial operators for use with genetic algorithms is proposed for a class of problems with real-valued genes that consist of N-dimensional homogeneous vectors. Evolutionary computation is capable of providing solutions to problems that would be in- tractable using more conventional methods. A subset of these problems is represented in real-valued three dimensional spaces or other more complex vector spaces. This thesis ad- dresses a number of issues related to the natural influences that adjacent locations in these spaces have on the fitness functions used in genetic algorithms. A subset of building blocks (schema) will be utilized based on these natural influences. It will be shown that these operators can be described by a building block style of theory that supports the experiment results. Further, the spatial base operators naturally preserve the interactions between genes for this class of problems. Genes have a natural influence on each other based on proximity. To be an effective genetic algorithm, operators need to take these proximity effects into consideration in order to preserve good contributions to fitness. Failure to utilize these spatial relationships will lead to very poor performance of the genetic algorithm or require statistical methods to try to capture the relationships. As a demonstration of these spatial operators, this dissertation will focus on the con- formation of molecular clusters, where each atom’s location represents a gene with real- valued coordinates. Further, the algorithm presented will work from unlabeled distance information available from experiments with limited preparation. A set of theories will be presented that form the basis for prediction of operator effectiveness, population size and convergence for this class of problems. The theory will be based on a schemata like set of building blocks constructed from subgraphs of limited size where the bonds represent edges and the atoms are nodes. From this analysis, it is possible to Show the construction of larger and more numerous building blocks by crossover and at the same time extract the requirements for good exploration. Local search will be utilized, so that the genetic algorithm operators focus on the construction of building blocks and exploration. © Copyright 2005 by David M. Cherba All Rights Reserved ACKNOWLEDGMENTS I would like to thank all the people who have encouraged me in this academic pursuit, especially those members of my Doctoral Committee; Professor Goodman, Professor En- body, Professor Duxbury, and my advisor Professor Punch. The patience, guidance and the encouragement that all the members of my committee provided was invaluable in the completion of this dissertation, especially when it was hard to articulate the ideas in an organized manner. In addition, I would like to thank the CSE staff, especially Linda Moore without whom the whole process of taking classes and filing the paper work would never get done. I would like to thank Professor Jain, Professor Stockman, Professor Esfahanian, and all the faculty whose discussion about both academic topics and life issues have helped me to be a better student and instructor. The understanding and support from my family, especially my wife, Wendy, have been unconditional despite the sacrifices. For the encouragement when I was having moments of self doubt and the support of my children that have given me good advice about being a better instructor and educator. To my parents for their early focus on learning and scientific experimentation. TABLE OF CONTENTS LIST OF TABLES x LIST OF FIGURES xi 1 Introduction 1 1.1 Hypothesis .................................... 2 1 .2 Motivation ..................................... 2 1.2.1 Source of Data ................................. 3 1.2.2 Preprocessing Data ............................... 5 1.2.3 Comparison to Protein Molecules ....................... 6 1.3 Class of Problems ................................. 7 1.3.1 Molecular Clusters ............................... 7 1.3.2 Generalized Problem Classes .......................... 8 1.3.3 Fitness Landscape ............................... 8 1.3.4 Number of Local Minima ........................... 9 1.4 Road Map of Work ................................ 9 2 Background 13 2.1 Evolutionary Computing ............................. 15 2.1.1 Broad Divisions ................................. 15 2.1.2 Basic GA Operation .............................. 16 2.1.3 Biological Inspiration .............................. 17 2. l .4 Chromosomes .................................. 1 8 2.1.5 Theory ..................................... 19 2. 1 .6 Operators .................................... 20 2.1.7 Replacement Strategy .............................. 23 2. 1 .8 Selection .................................... 24 2.2 Molecular Conformation of Clusters ....................... 24 2.2.1 Energy Functions ................................ 25 2.2.2 Diversity .................................... 28 2.2.3 Local Search .................................. 28 2.2.4 Distance Values ................................. 29 2.3 Current Work with Clusters Using GA ...................... 32 2.3.1 Deaven and Ho ................................. 34 2.3.2 Hartke ...................................... 38 2.3.3 Sastry ...................................... 42 2.4 Non-GA Solution Methods ............................ 46 2.4.1 Simulated Annealing .............................. 46 vi 2.4.2 Monte Carlo and Basin Hopping ........................ 48 2.4.3 Statistical Directed ............................... 49 2.5 Spatial Concepts in EC .............................. 50 2.5.1 Gene Arrangement ............................... 51 2.5.2 Multi-Deme ................................... 56 2.6 Local Search ................................... 59 2.6. 1 Quasi-Newton ................................. 59 2.6.2 Nelder-Mead .................................. 59 2.6.3 Constructive .................................. 60 2.6.4 Gradient ..................................... 60 2.6.5 Conjugate Gradient ............................... 61 2.7 Representation of Clusters / Molecules ...................... 61 2.7.1 Locations .................................... 62 2.7.2 Euclidian Distance Matrix ........................... 63 2.7.3 Distance Labels ................................. 63 2.7.4 Angles and Sequence .............................. 64 2.7.5 Crystals ..................................... 67 2.8 Geometric Algebra ................................ 69 2.8.1 Rank ...................................... 69 2.8.2 Vector Operations ................................ 70 2.8.3 Constraints ................................... 70 2.8.4 Cayley-Menger Determinants ......................... 70 2.8.5 Solution Methods ................................ 71 2.9 Protein Conformation ............................... 71 2.9.1 Preprocessing .................................. 72 2.9.2 Objective Functions ............................... 73 2.9.3 Search Process ............................ _ ..... 73 2.10 Summary ..................................... 74 3 Theory 76 3.1 Class of Problems ................................. 77 3.1.1 Homogeneous Vectors ............................. 77 3.1.2 Genes ...................................... 78 3.1.3 Fitness Function ................................ 79 3.2 Local Search ................................... 80 3.2.1 Volume ..................................... 81 3.2.2 Distance-Based Local Search .......................... 83 3.2.3 Convergence Criteria .............................. 84 3.2.4 Iteration Limit ................................. 84 3.3 Spatial Concept .................................. 85 3.3.1 Division of Space ................................ 87 3.4 Spatial Building Blocks and Subgraphs ..................... 91 3.4.1 Definition .................................... 92 3.4.2 Counting Spatial Subgraphs .......................... 93 3.4.3 Regular Graph Model .............................. 93 vii 3.4.4 Cube ...................................... 94 3.4.5 N11 Lennard-Jones Molecule ......................... 96 3.4.6 Bucky Ball ................................... 98 3.5 Crossover ..................................... 100 3.5.1 Standard Crossover ............................... 101 3.5.2 Spatial Crossover ................................ 102 3.6 Mutation ...................................... 103 3.6.] Simple Mutation ................................ 103 3.6.2 Relocation Mutation .............................. 104 3.7 Viable Population Size .............................. 105 3.7.1 Initial Population of Genes ........................... 106 3.7.2 Reduction in Gene Pool by Selection ..................... 107 3.7.3 Introduction of New Subgraphs ........................ 107 3.7.4 Diversity of Population ............................. 112 3.8 Graph Isomorphism ................................ 112 3.9 Convergence ................................... 1 15 3.9.1 Operator Effectiveness ............................. 1 16 3.9.2 Difference Equation .............................. 116 3.10 Scalability ..................................... 117 3.10.1 Fitness Function ................................ 118 3.10.2 GA Algorithm ................................. 119 3.11 Summary ..................................... 122 4 Experiments and Results 124 4.1 Local Search ................................... 125 4.2 Ideal Distances .................................. 130 4.2.1 N12 Lennard-Jones Molecule ......................... 130 4.2.2 N20 ....................................... 134 4.2.3 N38 ....................................... 135 4.2.4 N60 ....................................... 136 4.2.5 Bucky Ball ................................... 137 4.3 Non-Ideal Distances ................................ 141 4.3.1 Bucky Ball Experimental Data Set Target ................... 142 4.3.2 Artificially Degraded Data ........................... 146 4.4 Parameter Selection Searches ........................... 146 4.4.1 Population Size ................................. 147 4.4.2 Operator Probabilities ............................. 149 4.4.3 Local Search Parameter Effect ......................... 157 4.5 Classic Two-Point Crossover ........................... 158 4.6 Operator Effectiveness .............................. 160 4.7 Scalability ..................................... 162 4.8 Discussion ..................................... 165 4.8.] Comparison of Classic vs. Spatial Operators .................. 165 4.8.2 Population Size: Theory vs. Actual ...................... 165 4.8.3 Operator Effectiveness ............................. 166 viii 4.8.4 Scalability: Theory vs. Actual ......................... 167 4.9 Summary ..................................... 167 5 Conclusions 169 5.1 Spatial Operators ................................. 169 5.2 Interrelationship .................................. 170 5.3 Objective Functions ................................ 171 5.4 Confirmation of the Hypothesis .......................... 172 5.5 Competitive Methods ............................... 173 5.6 Future Research .................................. 173 5.6.1 Improved Scalability .............................. 173 5.6.2 Adaptation to Folding Proteins ......................... 174 5.6.3 Dynamic Parameter Selection ......................... 175 5.6.4 Using Multiple Levels of GA .......................... 175 5.6.5 Tracking Genealogy .............................. 175 BIBLIOGRAPHY 177 ix 2.1 3.1 3.2 3.3 3.4 3.5 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 LIST OF TABLES Main characteristics of evolutionary computing ................. 16 Multiple plane division of space ......................... 91 Building blocks for cube ............................. 96 Building blocks for ideal N11 LJ ......................... 97 Building blocks for N11 LJ random initialization and after local search ..... 97 Bucky building blocks .............................. 99 Failed convergence by local search to ideal N 11 by count and magnitude of perturbation .................................. 127 Summary of 100 Trials Bucky Ball With Real Data ............... 146 GA operating parameters modes and sizes .................... 148 GA operation counters .............................. 149 GA operating floating point parameters ..................... 150 GA operating file names prefix and suffix .................... 151 Solution comparison of standard and spatial crossover for N20 Lennard-Jones molecules ................................... 160 Solution comparison of standard and spatial crossover for N38 Lennard-Jones molecules ................................... 160 Scalability for various size molecules ...................... 164 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 LIST OF FIGURES X-ray diffraction data (implied count of atom pairs) by atom separation dis- tance [56] ................................... 4 Ideal Bucky ball derived from distance data ................... 4 General evolutionary computing operation .................... 17 Chromosome mutation .............................. 21 Chromosome recombination ........................... 22 Ideal data histogram ................................ 31 Histogram derived from experimental data .................... 32 Pair distance data implied counts by radius ................... 33 Left and Right hemisphere combined in crossover operation ............. 37 xy projection of n=104 molecule [27] ...................... 40 GA algorithm outline for Sastry [60] ................... '. . . . 44 K N graph arrangement of genes ......................... 52 Row arrangement of genes ............................ 52 2D matrix arrangement of genes ......................... 53 Circle arrangement of genes ........................... 53 3D matrix arrangement of genes ......................... 54 2D division of genes ............................... 54 2D multiple division of genes ........................... 55 2D multiple division of genes using Voronoi spaces ............... 55 Island populations for genetic algorithms ..................... 58 Multiple population multiple layer competition ................. 58 xi 2.20 2.21 2.22 2.23 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10 4.11 4.12 4.13 Typical group of atoms .............................. 66 Two combined groups with dihedral angles ................... 66 Crystal base structure ............................... 68 Crystal lattice structure .............................. 68 Example of multiple plane division ........................ 90 BB counting R3 .................................. 94 Cube schema R3 ................................. 95 Initialized molecule for GA run .......................... 98 Initialized molecule for GA run after local search ................ 99 Bucky ball schema ................................ 100 Bucky division Shown in 2D ........................... 101 Local search contributions ............................ 120 Local search adjustment ............................. 121 N11 local search probability of success by number of iterations ........ 128 N11 local search probability failure by sum of error ............... 128 N11 local search probability failure by max perturbation ............ 129 N12 simple mutation probability sweep ..................... 131 N 12 population versus average generations to solve ............... 132 N 12 local search convergence factor versus average generations ........ 133 N12 population versus evaluations ........................ 133 N12 local search convergence factor versus standard deviation of generations . 134 Population size effect on computation cost to find solution N20 ........ 135 Population size effect on average generations to find solution N38 ....... 136 Ideal Bucky ball fitness .............................. 138 Ideal Bucky ball energy ............................. 139 Combined energy and distance objective Bucky ball ............... 140 xii 4.14 4.15 4.16 4.17 4.18 4.19 4.20 4.21 4.22 4.23 4.24 4.25 4.26 4.27 4.28 4.29 4.30 4.31 4.32 4.33 Combined energy and distance objective N38 .................. 140 Ideal Bucky ball single-level runs ......................... 14] Comparison of ideal and experimental distances for Bucky ball ......... 143 Accumulated distance error Bucky ball data set ................. 143 Good Real data Bucky ball Results ....................... 144 Typical Defect for Real Data Bucky ball result ................. 145 Population Size effect on average generations to find solutions N 20 ....... 147 Population size effect on number of failed runs N20 ............... 151 Population size effect on total average evaluations on N20 ........... 152 N20 crossover probability verse failed solution ................. 153 Average generations to solution N20 ....................... 154 Effect of simple mutation on average solution generations N20 ......... 154 Effect of relocation mutation on average solution generations N20 ....... 156 Effect of relocation mutation repetition on average solution generations N20 . . 156 Average evaluations to find solution N20 ..................... 158 Effect of convergence step size on failed solutions N20 ............. 159 Effect of convergence step size on average generations to find solution N 20 . . 159 Crossover effectiveness by generation ...................... 162 Simple mutation effectiveness by generation .................. 163 Relocation mutation effectiveness by generation ................ 163 xiii Chapter 1 Introduction The derivation of molecular structure continues to be a major challenge in many fields of science. It has become more important than ever to understand the effect of structure on behavior and properties of everything from proteins, drugs and superconductors to nano- devices. Even with all the advances in microscopic examinations, the two main sources of information about molecular structure are X-ray diffraction and nuclear magnetic res- onance (NMR). The results from both of these technologies are processed and eventually yield a set of inter-atom distance pairs. The basic problem is how to use this set of distances to derive the structure of the molecule. From a pure mathematical perspective, the problem of embedding a graph has been shown to be NP—Hard by Saxe [61]. The field of Evolu- tionary Computation (EC) has proven to be a viable solution strategy for many problems that are NP-Hard. With inspiration from natural biological principles such as reproduction, survival of the fittest, diverse populations and mutation, these algorithms stochastically produce candidate solutions. Applications of GA, a branch of evolutionary computation, have been successful at solving certain classes of molecular conformation problems, the most notable being both clusters and biological molecules. This dissertation will address a set of operators for use with GA that improves efficiency for a general class of problems that includes molecular 1 conformation of clusters from atom-pair distance information. 1.1 Hypothesis A set of GA operators based on spatial location concepts will provide improved perfor- mance for a class of NP-hard search problems in N-dimensional space. This class of prob- lems have real-valued genes that consist of homogeneous vectors that represent a spatial location. These operators improve efficiencies by preserving groups of genes in the vector Space that contribute to high fitness values. The theory to support the observed performance improvements is based on examination of groups of genes as small subgraphs where the atoms are nodes and the bonds are edges. These subgraphs are a distinct subset of the more general concept of schema as proposed by Holland [36]. The main difference is that the groups of genes are defined by the proximity to each other in vector space rather than arbitrary subsets of a whole chromosome gene set. 1.2 Motivation Given the sources of data used to generate structure, significant amounts of domain knowl- edge have to be supplied to even begin searching for the actual structure. This domain knowledge can come in many forms. Known-good structures for similar molecules, protein side chain and backbone sequences, doping of heavy atoms to identify specific locations, and many other methods are used to provide either Starting points or constraints on the search. The time required or the biases introduced by the domain knowledge application process deter from the ability to find unbiased structures. In many cases, the researchers resort to artificial measures to search for viable conformations of molecules. The most common is the introduction of energy models. An energy model is used to find an optimal low energy conformation of the molecule. These low energy structures are then compared to the experimental data to determine the likelihood of a match. The energy models can be 2 simplified to the point of making calculations reasonable for large protein molecules and using energy gradients to guide search procedures. Given a set of distances, the ultimate goal is to make possible a search for viable struc- tures using minimal domain knowledge. Basically, once a computer is given a set of dis- tances without any assignment to specific atoms, it will derive candidate structure with an automated process in reasonable computational time. Further, the set of distances can be partial or contain a finite amount of noise that is similar to that produced by experimental processes and data preparation. Figure 1.1 and Figure 1.2 give a pictorial description of the problem. The first figure shows the experimental data and the second the desired results if the data were perfect. In reality, the data represented in Figure 1.1 is far from perfect and this makes the derivation of molecular structure difficult. The vertical axis indirectly depicts the number of atom pairs at a given separation distance and the horizontal axis is distance of separation. The difficulty depends on the completeness and quality of experimental data. In abstract terms, this is a problem of embedding a graph with points and edges. 1.2.1 Source of Data X-ray diffraction and Nuclear Magnetic Resonance Resonance (NMR) are the two most common sources for data used in molecular conformation. In both of these experimental studies the distance information must be derived indirectly from the experimental results. X-ray diffraction requires a single crystal to be irradiated with X rays while the NMR can process materials in solutions. This is especially important for proteins because some cannot be crystallized. In all these experiments, some primitive set of information must be known. For ex- ample, the basic chemical equation along with the types of bonds formed are required to process the experimental data. For biological molecules the actual sequence of the back- bone and substructure of side chains may be known. For either process, some quantity of 3 Figure 1.1: X-ray diffraction data (implied count of atom pairs) by atom separation distance [56] Figure 1.2: Ideal Bucky ball derived from distance data basic information is needed to begin processing the experimental results for the desired set of distances. The X-ray diffraction produces a scattering pattern. The scatter of X rays by electron clouds around each atom forms a diffraction pattern. Using this diffraction pattern one could imply the location of the atoms, if the phase information were also known. However, without the phase information, the Fourier transform applied to the diffraction pattern can only yield a set of inter—atom distance pairs. For NMR, the distances between atoms is inferred from resonant energy peaks and val- leys. It is possible, given a wide range of frequencies and multiple points of irradiation, to deduce some location information of certain atoms in the structure in addition to distances. This set of distances has some very real limitations. First, the atoms are interacting to some degree. Second, the distances must be within the range affected by the source radiation. A common practice is to create molecules with heavy atoms attached to key points because it is easy to identify the interactions between them in the data. In this way, it is often possible to, in addition to the distance data, make some assignments to specific atoms. While adding some identification is not enough to make the structure determination easy, it does reduce the search space significantly. Regardless of which method is used, the best possible expectation is some limited noisy distance data which represents the shape of the molecule. In both methods, constraints and limits must be placed on the distances that can be derived. It is possible, using some predictive techniques based on knowledge of the molecule under study, to fill in some of the gaps and make estimates of the expected distance distribution, even for overlapping peaks in the experimental data. 1.2.2 Preprocessing Data The type of molecule and the experiment conducted will determine the processing used to extract the distance data. This process can be dramatically improved by the domain 5 knowledge. For example, with X-ray diffraction, the basic chemical equation can provide constraints for the number and length of bonds that are expected. While the actual lengths can vary, the basic physics of the bonding process can limit the range for the length of bonds. Given a chemical equation, the preprocessing has a reference target at least for the smaller distances. The problem of overlapping distances is a major concern. When the natural distribution or the noise associated with the experiment cause the peaks to overlap, it is very hard to construct a discrete set of distances that correspond to the peaks in the experimental data. Given an atom count for a cluster, about all that is known exactly is how many total distances need to be extracted from the data. This work will use the derived distance set as its target for the objective function used in the search process. This requires that the experimental data be used to produce a complete set of distances. 1.2.3 Comparison to Protein Molecules It is expected that some comparisons between finding structures of clusters and proteins will be made. The structures of proteins which contain thousands of atoms often have a much larger set of assumptions and domain knowledge. Having key domain information aids in the determination of structure. While proteins are not explored in this work, it is important to understand some of the issues associated with the preparation of the data and the basic processes used in conformation. There are many semi-automated processes for finding structure from proteins. In the case of proteins, the sequence of subcomponents that form the protein is often known. Further, the interactions or distances between specific types of atoms or bonds can be identified in the experimental data. In some cases, these interactions are unique enough to make assignment of distances to specific atom pairs possible. In others, the uniqueness is adequate to reduce the assignment to some limited number of pairs. Using this information, the search space is reduced in size and it is possible to search for conformation of significantly larger molecules than clusters. The key point is 6 that the assignment of specific experimentally derived distances to specific atoms changes the nature of the problem complexity. For the work on clusters here, no assignment of distances to atoms was used as part of the objective function. 1.3 Class of Problems The specific problem addressed in this work is to find the structure of molecular clusters using only distance information derived from experimental data without assignment to spe- cific atom pairs. The nature of the problem is that a solution is represented by a vectors. In this case, the set of vectors are three-dimensional locations in space that represent the loca- tions of each atom in the molecule. The objective function to be optimized is significantly affected by the interaction of atoms close to each other. That is, each atom has multiple in- teractions with those closest to it that can significantly affect the overall objective function. There are many general problems that have a similar nature to the ones dealt with here. While the solution may not be easily represented by a set of three-dimensional vectors, the configuration of these problems and the objective functions have the same multiple interac- tions based on the logical or physical proximity of points in some multi-dimensional space. The key issue is preserving interactions that contribute to high fitness so as to improve the search operations. 1.3.1 Molecular Clusters A molecular cluster is an isolated molecule of typically less than 300 atoms whose proper- ties and characteristics are unique as compared to amorphous solids. Many of the molecules used in this study are hypothetical low energy configurations. Others are derived entirely from experimental data. The determination of molecule structure is driven by a set of distances between all the atoms. Given a complete and accurate set of distances for a molecule, the problem of assignment of distances to atom pairs still exists. If the molecule 7 has N atoms, the number of distances between all pairs of atoms is the same as the number of edges in a K n graph. The number of distances m = (figfl increases as the square of the number of atoms in the molecule. In the simplest of terms, this is a search among N 21 possible assignments. This search can be reduced by applying triangulation constraints and the realization of repeated distances. However, the assignment problem is still intractable with current computer resources even for moderate values of N. Molecular clusters clearly demonstrate this class of problems. Regardless of the objec- tive function used, the interactions between atoms close to each other significantly affects the fitness of the molecule. Further, it is possible to group the atoms into subgraphs that can be used as building blocks to build complete molecules. 1.3.2 Generalized Problem Classes This class of problems can be generalized as follows. It includes any search space where the solution is defined by a set of homogenous vectors and/or where the criterion function has significant contributions from the proximity of each vector to each other. Each vector can be thought of as a location in Space. In molecular clusters, the interaction between bonded atoms is the strongest. Taking this concept beyond normal physical spaces, imagine a set of operating parameters for a grid of power plants where each plant’s operating point is defined by a complex vector with many values. Each plant’s operating point is a location in hyperspace. The proximity of each plant can be defined by the interconnection of the power grid. Plants closer together will have a greater influence on each other. 1.3.3 Fitness Landscape The fitness landscape has a major influence on the methods that can be applied to find optimal solutions. A smooth landscape can use a variety of search methods simply by following the gradients in the landscape. In molecular clusters, a significant amount of work has been done on taking complex fitness landscapes and mapping them to deformed 8 surfaces where more traditional searches work. This issue of fitness landscape affects the operation of GA algorithms as well. The main difference is that, by maintaining a popula- tion of candidate solutions and constantly introducing new solutions, the trapping effect of all the local minima can be reduced. 1.3.4 Number of Local Minima In the case of molecular clusters, the fitness landscape based on energy criteria is very com- plex. Further, the number of local minima has been shown by Hoare [34] to be 6(eN2 ). Not only is the number of minimum large, but the energy wells that each minima resides in can have very large walls that must be overcome to move from one minimum to another. This combination makes the conformation problem extremely difficult to search using potential energy as the criterion function. Even for the most robust methods currently in use, the real elapsed computer time for a search on a cluster with 100 atoms can take two thousand hours [28] to find a low energy configuration. 1.4 Road Map of Work This work is divided into sections for background information, theory of the spatial Oper- ators, and the experimental results obtained using the spatial operators. The background chapter starts with a review of the current work with molecular clusters. This review will focus on the use of GA’s and other viable methods to find low energy configurations of clusters. Next in the chapter, a review of the basic issues for evolutionary computation will be described as they pertain to the problem of applying a GA to the cluster problem. At this point, a brief survey of spatial concepts that appear in the current literature on genetic algorithms will be presented. Several applications of spatial concepts have been made with multiple population GA and some graph problems. These are distinctly different from the work here. The structure of the populations in these applications is configured in a variety 9 of shapes and spaces, while the spatial operators used here will divide individual genes into discrete subspaces for the purpose of mutation and recombination. As part of practical application for many GA’S, local searches can be Shown to signif- icantly improve performance. In the background chapter, a review of local searches used in the molecular conformation applications will be provided. The representation of the molecules has an impact on the type of algorithm and the nature of the operators used. The various possible representations will be explored with specific attention to the location of the atoms as points in Space. Several other representations will be included because of the applications to other types of problems. The mathematical concepts for geometric algebra form a strong basis for many of the constraints and iterative forms of solutions. Many protein conformation methods use evolu- tionaryI computation utilize geometric algebra in the objective functions and in preparation of experimental data. While only clusters will be studied here, the use of these principles in the determination of the protein structure is a necessary background for understanding the significance that even partial distance assignment and data refinement can have on the search process. The theory chapter starts with a formal definition of the class of problems that these new operators address. This is further focused by how the class of problems is represented and how fitness functions are calculated. The class of problems is directly related to the introduction of spatial concepts and the division of hyperspace. The theory chapter then introduces the subgraph as a building block and a form of schema for vectors that are in close proximity. The local search and the spatial operators are then combined to provide a basis for judging the exploration of search spaces. Finally, in the theory chapter, rates of convergence are predicted along with the impact of diversity in the population of candidate solutions. As these factors significantly impact scalability, some predictions are made based on the overall theory. The next chapter describes the experiments that were conducted and how well they 10 compare to the predicted results of the theory. This will include many examples of well known low energy molecular clusters along with the Bucky ball. In addition, examples of the spatial operators used with real distance data for the Bucky ball will be described. Several examples of artificially degraded distance sets are also presented. The data sets that have been run provide a broad spectrum of cluster sizes ranging from eleven atoms to eighty-eight atoms. As part of the overall work, possible parameters for the GA operations are explored for effective values. The role of local search is also critical to the operation of the GA. In particular, the ability to rapidly find local minima and the best candidate solutions will impact overall performance. In the experiments conducted, a characterization of the local search used was conducted on a small cluster with eleven atoms. This characterization will prove to be useful for connecting the theory to the performance of the GA with spatial operators. While the critical test of all algorithms in this area is the ability to scale to larger clus- ters, the experimental results will provide the needed information to predict the perfor- mance for large clusters. The computational cost to calculate the objective function grows as 0(n2). This objective function cost is multiplied by the cost associated with the basic na- ture of the algorithm. The objective function computational cost has a major impact on the scalability of the search methods and it is clear that larger molecules represent a significant challenge, as will be shown by the results presented here. An experimental comparison is made between classical two-point crossovers and spa- tial crossover. The comparison clearly shows the benefits of the spatial operators. Differ— ential tests, where individual components of the GA algorithm are substituted or removed, will be presented. These results will show the utility of the spatial operators for this class of problem. As promising as the theoretical and experimental results are, the focus of the next chap- ter is on the future work that must follow in order to demonstrate the utility beyond molec- ular clusters. Expanding this work into small biological molecules is necessary. A variety 11 of ideas that have been developed while working on the molecular cluster problem need to be explored. First, the problem with the search scaling needs to be explored. Expanding this work into small proteins or working with groups of atoms as single entities needs to be addressed. From some of the experimental data, it is clear that the spatial mutation en- hanced the operation of the searches but was rarely directly responsible for the creation of the current best-of-generation molecules. Exploration of the genealogical tree for the best of molecules may Shed some insight into the actual role played by this operator. The final chapter summarizes this work with a number of conclusions about the ex- periments and theory. An empirical measure of improvement is provided for several of the experimental cases described. The foundation of the subgraph as a means to represent multiple linkages among genes will be summarized. 12 Chapter 2 Background The process for finding the structure of molecular clusters with genetic algorithms bridges several distinct areas. First, the structure of clusters is determined by the force interactions between the atoms. These forces can be summarized as the potential energy level of a molecule. Many methods for energy calculations are used, and these have been a dominant criterion for structure determination methods used throughout the field. Second, Evolu- tionary Computing is used as a means to explore the large and complex search space for solutions. This basic biological process of evolution, modeled with chromosomes and ge- netic operators, has been surprisingly successful in finding solutions to difficult problems. While the basic process of genetic algorithms can be stated in a simple outline, the practi- cal implementation becomes complex as all the issues that effect performance are identified and resolved. In addition to these main two areas, the problem has its roots in geometric algebra, combinatorial mathematics, matrix algebra, physics and biochemistry. Nearly all of these areas are involved in finding solutions to the molecular structure problem. Several of the current methods combine Evolutionary Computing and the potential energy searches to find conformations. A representative selection of these methods is described in the first section "Molecular Conformation of Clusters”. In addition, several other competing meth— ods are described that range from random style searches to sophisticated statistical based 13 searches. Evolutionary computing can be divided into several broad areas and is constantly changing with the introduction of new methods and styles of operators. The next sec- tion in this chapter will provide a brief overview of Evolutionary Computing with a focus on those issues that related directly to the genetic algorithm used in this work. One major issue is the use of spatial concepts in the previous GA research. These have been limited primarily to multiple GA populations and some graph problems. A review of this work will be presented. A major departure from what is viewed as pure GA, essential to solving many real prob- lems, is the role that local search plays in the operation of genetic algorithms. The types of searches range from classical gradients to Hessian reductions. When GA’S are combined with local search, the formal term ”memetic” [49] is used to describe these algorithms. The key point is that the search uses a population of potential solutions and some form of a metaheuristic; in this case, a GA with local search. Nearly all the methods involved in the search for the structure of molecules use some form of local search. Some of the more common basic search methods will be reviewed. The representation of molecular structure can vary depending on the domain informa- tion that is known. At some point, despite the type of representation used, the location of each atom in space must be calculated. The basic representation styles will be described for clusters, proteins and crystals. The representation for clusters is the most straightfor- ward. The crystals will be the most compact, utilizing only a few parameters to describe the locations of many atoms in a lattice. Both crystals and proteins have additional fixed information that reduces the degrees of freedom for the model. The Geometric Algebra section provides useful information for understanding the var- ious preprocessing methods used on experimental data as well as the basic search methods used on larger molecules like proteins. Many of the mathematical concepts, applied to three dimensional objects, can be used on problems of higher dimensionality. One key 14 note is that a complete set of distances contains a significant amount of redundant infor- mation. Many of these algebraic concepts are used to refine distance information or make assignments to specific atoms possible. The last section gives a brief overview of the automated processes that are currently used in the discovery of protein structure. While this dissertation does not address proteins, it is expected that some of the concepts used here will be applicable to the problem of finding protein structure and that the current methods provide a contrasting approach to molecular structure problems. 2.1 Evolutionary Computing Evolutionary computing (EC) is a recent development. It was first used in the late 1960’s and early 1970’s to address a class of optimization problems. Like with most discover- ies, there were several independent researchers taking slightly different approaches to what later could be loosely consolidated as evolutionary computing [35] [57] [20]. This chapter will present a brief history of EC and provide some background information on the types of operators and selection processes that were used in solutions described in this work. Gen- erally there are some broad categories that Evolutionary Computing can be divided into. These categories will be listed along with some of the features that divide them by styles of operators and population. Some common features will be described including populations of candidate solutions, mutation, recombination, and a survival of the fittest. This mea- sure of fitness can be any evaluation of the chromosome used to represent a member of the populations that is the objective of the search problem. 2.1.1 Broad Divisions Table 2.1 attempts to help classify the various fields within evolutionary computing. There are more similarities than differences when all the variations are explored. All the fields 15 ES GA EP GP Represen- real-valued binary real-valued tree and functions tation fixed length variable length Population parents fixed fixed generational generational Selection best offsprings operation probabilistic +/— parents extinctive Fitness objective Scaled objective Scaled objective Objective function value function value function value function value Mutation Gaussian bit inversion Gaussian functions Gaussian adaptive terminals Recom- parents parents none parents bination (2 or more) two offspring two offspring discrete N point sub tree swap sexual uniform panmictic use multiple members of a population. All members of the population are measured with fitness. Individual members of the population are altered by some form of mutation or other Table 2.1: Main characteristics of evolutionary computing operators during the evolutionary process. The population members are selected by fitness for inclusion in the next population or for participation as a parent. Some of the entries in this table are based on a similar table in Back’s book on evolution strategies [3]. The entries in this table are not rigid and exceptions to all the entries are not hard to find. This table serves only to differentiate the four areas on a limited number of features and is not complete. Key points have to do with basic population control of fixed generational versus p and A styles along with the type of encoding. The type of encoding leads to some additional types of arithmetic recombination strategies rather than the simple choice of inclusion found with the binary encodings. 2.1.2 Basic GA Operation The block diagram 2.1 represents a generalized operation of an evolutionary algorithm. Collection of information about individuals in the population include fitness and popula- l6 J Selection R\ Evaluations Population , Population Generation ::> Se'ectiO" _.> Transmrmation :> Generation N N+1 L_. U Characteristic Generation of . of Population w—D operating 44» 96:33:20" and individuals: parameters fitness, rank, diversity, convergence Figure 2.1: General evolutionary computing operation tion statistics that can be used to control the selection and transformation processes. This block diagram also shows the use of selection to determine the parents for the next gen- eration as well as selection to determine the parents to be used in creating children. The transformation process is intended to be general and includes mutation, recombination and local searches. Details of these operations will be explored in the following sections. 2.1.3 Biological Inspiration The biological parallels to the operation for evolutionary computing are very strong. From the representation description as genes to the processes of mutation and recombination, the analogies tie biological concepts to evolutionary computing. At the core is Charles Dar- win’s theory of evolution. In this theory, the principle of ’survival of the fittest’ combined with apparently undirected small changes in phenotypes was recognized as a critical factor 17 in evolution. Several symptoms of small isolated populations can result in genetic drift and reductions in overall fitness of individuals in the population. This phenomenon of genetic drift can be demonstrated with genetic algorithms and strong selection pressure on small populations. The amazing part of this whole process is the fact that all the hereditary information is encoded by four nucleotide bases. Adenine, Thymine, Cytosine and Guanine make up the entire set of building blocks for deoxyribonucleic acid (DNA). These blocks have many unique properties that allow for replication and complex interactions. The hierarchy con- sists of amino acids, proteins, and ribonucleic acid (RNA). At each level different processes can occur. A single gene can have a number of observable effects on an organism or genotypic effect. On the other hand, many genes often combine to produce phenotypical changes in the individual. This complex interaction on the overall characteristics of individuals is what makes it hard to directly map cause and effect between the genetic encoding of DNA and the resulting individual. The chromosome can be mapped into many small groups, which have distinct functions such as translation or transcription. There are other groups which appear to have no phenotypical function and these are called introns. Just like with life, the use of genetic algorithms does not produce certain results, only a likelihood of results. Further, just like in biological evolution, the process takes many generations to Show distinct trends. This mean that it is often very hard to relate changes in the genes and chromosome to specific measured results. 2.1.4 Chromosomes The representation of chromosomes is built by sequences of the four nucleotide bases. This alphabet of symbols is limited to the four abbreviations A, T, C, and G. Taking inspiration from this limited set of symbols the binary 1’s and 0’s can be used to define the genes and, from there, the overall chromosome. For the purposes of the genetic algorithms each 18 chromosome is expected to encode all the hereditary material for an individual. While the mapping of some limited set of symbols such as binary can be shown to encode all the required information, it is not always convenient to work with the binary bit representation. In these cases, it is possible to have genes that are made up of real-valued numbers that often have a more phenotypical nature. The phenotypical encoding breaks with true analogy of the gene encoding. However, it often makes the actual manipulation of the chromosome easier and avoids violation of the constraints that are applied to a given problem. In most cases the chromosomes are made of genes that are either binary or real numbers, but there is no limitation on mixing the types of genes in a real application. 2.1.5 Theory There are several areas where the basic theory on evolutionary computing has provided some foundations to build this work on. One area is the schema theory promoted by Hol- land [36]. In this theory, the implied parallel nature of processing binary patterns is derived and supports the building of successive larger blocks of good genes. Another area of theory is the effect of selection on the convergence of a population. Simply Stated, the stronger the selection pressure is the faster the population will converge. If it converges too fast, it will not explore the solution space sufficiently to provide accept- able solution candidates. The effects of selection pressure can be described by convergence velocity, loss of diversity and dominance by highly fit individuals. In all this, the proba- bilistic nature in the random selections and initial values of genes make exact analysis difficult. Several authors have expanded the concept of schema to include real-valued genes. The schema theory for real-valued genes has been addressed by Nehab in a recent work [50] and another much earlier work by Antonisse [1]. Both of these describe a consistent schema theory for real-valued genes. The Nehab paper describes a concept of schema as occupying regions of a three-dimensional space. This paper goes on to develop a schema theory 19 using the same foundation as the binary formulation by Holland, but for these regions in space[36]. 2.1.6 Operators The classic operators for genetic algorithms center on mutation and crossover. The terms crossover and recombination are used interchangeably. Both describe the process of taking two or more parents and selecting or combining genes from the parents to create a new offspring. The range of these operators can vary based on the problem and the encoding. When the encoding is pure binary, the mutation can be as simple as inversion of bits in the parent used to create the offspring. When real numbers and problem constraints are involved the mutation operators can span an enormous range of operations from multi- variable Gaussian distributions to simple permutations. The crossover operators also in- clude many styles. The most common form of variance is the number of points for dividing and selecting genetic material from the parents. For real-valued genes there are a number of numeric processes such as selection, averaging, minimum, maximum and statistically weighed methods for generation of new gene values. Mutation Mutation normally involves a single parent from which a single offspring is constructed based on that parent. In this process, the parent is modified by the mutation operator and genes are changed in some very simple ways. TWO very different means of mutation ex- ist. The first involves inversion of bits in the binary representation. This method has some drawbacks. For example, if all bits are equally likely to be inverted there are great dispari- ties in the size of the change to some forms of binary encoding. For example, changing the most significant bit of an encoded number will basically cause a mutation whose magnitude is half the range of valid numbers. Encoding methods such as gray code overcome some of these issues for number representations. 20 Chromsome gene 1 gene 2 gene 3 gene n 0 1 0 1 1 1 O 1 1 """ O 0 1 Mutate Chromsome gene 1 gene 2 gene 3 gene n 0 1 O 1 1 1 O 0 1 """ 0 1 1 Figure 2.2: Chromosome mutation The second mutation process is used for real-valued genes. This modification to a gene can be described by various random distributions. The distribution of the random change can be a linear model or a more complex statistical method. A common distribution model uses a mean value and standard deviation to characterize a normal Gaussian distribution so that mutation magnitudes are statistically centered around a mean value. Controlled statistical distributions can also be done for binary encoding, but it is more complicated and involves multiple conversions. A simple example of mutation on binary encoded genes is shown in Figure 2.2. In this Figure, the inverted bits are shown in bold. Crossover-Recombination Recombination normally involves two parents. These two parents are selected from the population based on fitness. The parents then each provide genetic material to create two offspring. The total genetic material is conserved at this point in the recombination process. Some material will be lost during the selection process. With real-valued genes, there are a number of ways to contribute genetic material. It is possible to combine the values in some arithmetic way from multiple (two or more) parents in a weighted fashion to create a single offspring. It is possible to use gene values from many parents to create a single offspring. For discrete selection of gene material, the offspring gets a copy of a gene from one of the 21 Chromsome Parent A gene 1 gene 2 gene 3 gene n 0 1 0 1 0 1 0 1 1 """ 0 O 1 Chromsome Parent B gene 1 gene 2 gene 3 gene n 0 1 1 1 1 1 O O 1 """ O 1 1 Recombination Chromsome Offspring 1 gene 1 gene 2 ene 3 _gene n O 1 1 1 1 1 0 1 1 """ 0 O 1 Chromsome Offspring 2 _gene 1 gene 2 gene 3 gene n O 1 0 1 O 1 O O 1 """ O 1 1 Figure 2.3: Chromosome recombination parents. How the parent chromosome is divided to provide the genetic material is subject to many variations. Each variation can be shown to be superior for some application. In general, the chromosome is divided up into N pieces and each of the pieces is then labeled as odd or even. When two offspring are created, the odd parts from parent A are combined with the even parts from parent B to complete offspring one. The even parts from parent A and the odd parts from parent B are used to complete offspring two. The manner in which the dividing points are selected is subject to many controls. In the extreme, every gene is probabilistically selected as odd or even. In other cases the dividing points follow a specific rotating pattern. For example, the first division point is between gene one and gene two. The next division is between gene two and gene three. The pattern then continues until it has to start over again. The dividing points can be determined by some statistical analysis 22 of the population based on fitness. With binary coding, it is possible to divide a gene into pieces. Normally, though, the division points preserve genes. A taxonomy of crossover operators is provided in the work by Herra et a1. [32] for real- valued genes. The work discusses the various arithmetic methods for creating offspring as well as other discrete gene selection, but does not include any spatial based operators. The problem of bias or effectiveness of crossover has also been addressed in several works [54] [19] [18]. Another paper introduces a new class of crossover operator by Arabas[2] that has some spatial qualities. 2.1.7 Replacement Strategy There are three basic styles of population control. A generational style uses one gener- ation as parent to produce offspring, and then the offspring become the parents. Each replacement of the parents is a generation. A population is constructed and then parents are selected from that population based on fitness to produce offspring. Once the offspring gen- eration is full, it takes on the role of supplying parents to the next population. A steady-state approach selects parents based on fitness to reproduce or mutate, but decided to whether to replace some member of the population with the offspring based on fitness. Some methods replace one of the parents; others replace the weakest member of the population. The base values for fitness are then recalculated after every change to the population. The third type of population is where )1 parents are created and then selection is done among them based on fitness to create offspring. This process is repeated until A offspring are created. Once the offspring population is complete, a competition takes place to see who will be the p parents in the next generation. TWO common versions are used. In one, only the offspring compete for the [1 spots while in the other, parents and offspring compete. These are referred to as (u, A) and (u + A) strategies [3]. Another prominent topic associated with population is multiple populations for more robust algorithms. A recent paper by Hu [38] describes a system with a stratified population 23 in a hierarchy to maintain a robust operation, thus avoiding many of the common problems with population stagnation. 2.1.8 Selection There are many selection schemes in use. Selection occurs in two ways. One form of selection is used to determine participation as a parent for mutation, recombination or other operators that create an offspring. The second way is selection for inclusion in the next population. Some methods rank all the members of the population then select the better ranking members in a probabilistic fashion for parenthood or simply for inclusion in the next population. Roulette wheel is a method of selection based on a probability of a member’s fitness in relation to the sum of all the fitness values. Tournament selection forces X members of the population to compete, with the winner being chosen for inclusion, or parenthood. This selection process, either for parenthood or inclusion in a population, is a key element to all the algorithms. It is this survival of the fittest that drives the evolutionary process. The down side is when a population is overtaken by a single fit individual or group of individuals. This is referred to as the loss of diversity and is a major issue with evolutionary computing. 2.2 Molecular Conformation of Clusters The basic process for determining the structure of molecular clusters focuses on finding molecules with low potential energy. It then compares the derived distance information or an X-ray scattering simulation to the experimental data. The actual data-driven solution methods are not emphasized in the current approaches. As most of the methods used to find clusters use potential energy functions, there is no surprise that there are numerous different energy functions. Several of these will be described. The most successful of the methods employ genetic algorithms. Three current works cover not only some of the first 24 innovations used with the genetic algorithms but also set the standard for scalability and advance GA methods using learned linkages. The application of genetic algorithms to this problem is relatively recent as compared to the previous work in this area. While the use of closed form solutions has been limited, several computer—based methods that involve concepts of simulated annealing, stochastic searches, and surface deformations have been successful. These will be briefly described. For very small clusters (three to fifteen atoms) many of the other methods are shown to be more effective at finding the structure than GA’s. The difficulty increases as molecule size grows. 2.2.1 Energy Functions There are several different potential energy functions used in the search for molecular struc- ture. The most common is the Lennard-Jones function. The energy-based objective func- tion optimizations normally sum all the individual energy terms associated with pairs of atoms. Depending on the level of modeling, additional terms based on adjacent triplet and quad groups of atoms can be included in the summation. These forms use empirical constants or some relative potential measures. Energy functions span from simple relative potentials like the Lennard-Jones to first principles. Three energy functions will be de- scribed. The first is the most common in the work on clusters, Lennard-Jones. The second is the Gong energy function used by Sastry [60] in his work on clusters. The final model is based on Hooke’s law. This model has terms for the pairs, triplets and quads of atoms including the Lennard-Jones terms. The popularity of the Lennard-Jones is based on not only the simple calculation, but also due to the fact that distance data can be readily derived from candidate molecules. The calculation has two empirical constants that only have a few common values. The basic nature of the calculation will emphasize atomic interaction of close atoms and will contribute little energy for atoms far apart. 25 a 5 ”(733) = m — R; z] z] (2.1) Equation 2.1 shows the basic Lennard-Jones equation with the indexes z’ and j repre- senting a pair of atoms. The overall energy potential is a relative value based on the sum of all the pairs. There are actually many different versions of this equation where the values for a, ,8, and e have different constants. To be consistent with several of the minimum energy tables for Lennard-Jones energies, the constants are assumed to have these values: a = 1 , [3 = 2, and c = 1. Equation 2.2 shows the molecule’s total potential energy sum— mation with the n,- = d), substitution. This substitution demonstrated that, given the set of distances without knowing the atom pairing, it is possible to estimate the potential energy of the molecule using a Lennard-Jones approximation. V(d) = 462 [diffi- — 2%] (2.2) k=0 k k The Gong potential takes the energy calculation one step further, using bond angles as shown in equation 2.3. The more detailed equation for 222 has terms similar to the Lennard- Jones but with an exponential decay term added. The 2);; function has terms with the cosine of the angles formed by the three adjacent atoms groups . Details of this energy function can be found in the paper by Sastry [60]. One of several issues with the Gong potential is that the first term U2 is based on pairs of atoms while the second term 113 is based on triplets. The additional pairing can increase the cost of calculation because the number of pairs is on the order of O(N2), where as the number of triplets can be on the order of O ((3) terms. 14...: Zv2(z',j) + Z v3(z',j.k) (2.3) i< j i 5. That is, for every atom only five distances are required to correctly locate the atom with some accuracy. As few as four distances could be used, but the addition of the fifth allows for some correction of the errors in the placement. This represents a minimum set of 572 values. A perfect D matrix has n(n —— 1) useful values. Another variation of the data involves having perfect distance data but no labels. Using this set of data, the assignment of individual atom labels is required to find the molecular structure prior to locating the atoms. The Bucky ball molecular structure of C60 makes a good example of this type of data distance information. The histogram representation of ideal data for a Bucky ball is shown in Figure 2.4. The vertical axis is the count of atom pairs having a given distance and the horizontal axis is distance between atoms. This histogram and the accompanying target set of distances are perfect to within some accuracy but not labeled with atom pairs. The shape of this histogram is clean, with atom 30 250 200- distance Figure 2.4: Ideal data histogram pair counts in regular fractions of the sixty atoms in the molecule. That is, all counts are some even mod 60 value or evenly divisible by the the number of atoms. The final kind of data to be examined is the data derived with no labels and with noisy distance values. Either the experimental collection or the method used to generate the distance data set introduced some amount of noise such that they will never match a valid set of atom locations beyond a limited accuracy. In comparison to the ideal data histogram, the actual experimental data for the Bucky ball is shown in Figure 2.5. The bond counts are not regular fractions of sixty and the various symmetries of counts have clearly been altered. This data and histogram were derived from an x-ray diffraction experiment and the collected data is shown in Figure 2.6. The vertical axis is measured in relative terms to the number of atom pairs that have a particular distance. Known information about the atomic bonds can then be used to imply the count of distances. The horizontal axis has been scaled to distance, but the actual data measured can have several different measures. In order to demonstrate the correlation between the peaks and the histogram, the Signal graph has been scaled in width to make the correlation easier to see. Both the ideal histogram and the histogram from real data used 31 250 r I 17 fi I l T 200 ' 150* counts 100‘ 50- distance Figure 2.5: Histogram derived from experimental data the ideal distances as center points for each histogram bar. The basic definitions for the quality of data have been established. The difference in quantity and quality of data affect the solution strategies applied to solve the molecular conformation problem. For many solution methods will involve the placement of distance in the D matrix and the adjustment of values. For the purposes of this work, the distances are reduced to a simple one-dimensional vector in ascending order of magnitude and will be referenced as the (1 vector. The lower case denotes the lack of the atom pair labels. 2.3 Current Work with Clusters Using GA Three works using GA’s to find low-energy clusters will be reviewed. The three works are chosen for specific reasons. The first work, by Deaven and Ho, introduces the recombi- nation process that cuts two parent molecules by a plane and then combines the halves. The second work, by Hartke, shows the enhancements to the GA’s process for diversity and demonstrates scalability of the current algorithms. The third work, by Sastry, demon- strates the contrast between the use of real versus binary encoding of the genes and how 32 Figure 2.6: Pair distance data implied counts by radius the introduction of linkage learning was needed to provide solutions. This learning of link- age can be thought of as a limited version of the grouping provided by the spatial concept introduced in this work. This review will include a comparison of how the three GA works are implemented. They will be compared for seeding the initial population, population control, mutation methods, local search, diversity control, objective functions and scalability. In each of these sections, the differences between these three methods are described. Each of these areas covers a fundamental area of operation for the algorithms. The idea of seeding with some known set of molecules or close derivation will significantly reduce the computa- tional cost for the algorithm. The size and control of population has a linear effect on the computational cost. The objective functions for all three GA’s are energy based but use different models and still yield similar results. The biggest issue is the scalability of the re- sults. While Deaven and Ho approach only solved a few molecules, Hartke provides results for a wide range of clusters. Sastry solves only a few small molecules of specific interest, 33 but he does so with very low computational cost. 2.3.1 Deaven and Ho The Deaven and Ho [13] paper was landmark in that it used a genetic algorithm to obtain complex molecule structures. It employed energy functions as the fitness objective and a variety of innovative operators. An article in Nature by Maddox [46] described the work by Deaven and Ho, and brought the application of genetic algorithms to molecule confor— mation of clusters to the attention of many researchers. Despite these results, the basic techniques which were so successful did not really become the standard style of GA for many physicists working on clusters until after 1999 [27]. Seeding The initial population was constructed using random placement of atoms in a fixed volume of space such that no two atoms are ”too” close together or ”too” far apart. Atoms with too much separation had little or no significant interaction. Arbitrary selection of these limits can be made by solving the energy function for minimum and maximum values in terms of radius. Population The model for population control of the genetic algorithm developed by Deaven and Ho was a mixture of diversity control, permutations and then the classic ([1 + A) selection [3]. In the original paper [13], they start with u molecules, then create permutations with A = (,u) (u — 1) total offspring. Each of these offspring is relaxed with a local search method and then the total population of (u + A) is reduced back to (u) for the next generation. The selection process requires a certain energy difference between each of the molecules. In the first few generations, diversity is not an issue because there was a wide range of energies. After only a few generations, the molecules converge to good low energy levels 34 and the differences were much smaller. Requiring a minimum difference in energy helped to keep the population more diverse. At least some small difference in energy level implies a different atom configuration. What is unusual when comparing Deaven and Ho’s method to normal genetic algo- rithms is that the size of the population it only ranged from four to Sixteen molecules. The total number of Offspring ranged from 16 to 256. These were smaller in size than a typical GA population, which can be on the order of thousands of molecules that are needed to provide good results in only two to three thousand generations. Mutation Based on a probabilistic selection, a new member of the population is created by mutating an original individual and replacing it in the population. Determination of which individual is replaced is based on criteria such as energy difference. The mutation process applies many random displacements to a number of atoms and then subjects the whole molecule to a relaxation process by which a minimum distance between atoms is maintained. In Deaven and Ho’s work another type of mutation is achieved with an inverse directed local search which will be described at the end of the next section. Local Search The local search was rather unique in that it starts with a gradient vector in the direction of the local minimum, but the vector is rotated around a normal to the energy surface to point in the opposite direction. While this local search is normally used to find the local minimum it was also applied like a mutation operator. For example, rotation of the gradient vector 180 degrees from the direction of descent in essence moves the molecule’s atoms to the most rapid ascent away from the minima. Doing this movement in the direction opposite from the current local minimum some number of steps will force the molecule into a region of an adjacent local minimum. The normal form of the local search was applied to every 35 new offspring, while the redirected form is applied on a probabilistic basis to only a few percent of selected parents. This algorithm was iterative and the typical search applied to each offspring was thirty iterations. Crossover The crossover operator used was based on a random dividing plane through the center of mass for each molecule. The plane’s location was adjusted until approximately half of the atoms resided on each side of the plane. Using two parents, and dividing each of them, produces two sets of pairs that can then be recombined to form offspring. As described in the population section, a single offspring was generated from this crossover operator for each permutation of parents. An example of this crossover operator is shown in figure 2.7. The two halves are shown separated before being combined. As can be seen from the figure, when the two halves are combined several atoms will be too close. If they are at least some minimum distance apart, then the local search will relax the configuration to some acceptable energy level. Diversity Diversity of the population was maintained by the mutation rate and also in the selection process for inclusion in the )1 population. In Deaven and Ho’s work, some difference in energy was chosen so that all individuals have some slightly different energy level. While this insured diversity, it did not insure the right distribution of diversity to provide effective populations. The mutation operators for both the anti-local search version and the random movement of atoms will insure that some diversity is also maintained. Selection Deaven and Ho used all possible permutations of parents to generate offspring. The major selection process was used with the diversity controls for inclusion from the total popula- 36 Figure 2.7: Left and Right hemisphere combined in crossover operation tion of u+ A individuals into the 11 set of parents for the next generation. The basic selection mechanism was to rank parents and offspring by energy values and then start selecting the lowest energy molecules for inclusion, subject to the difference in energy constraint. Objective Function The Lennard-Jones energy function, as shown in equation 2.5, was used for this search by Deaven and Ho. The (1, fl, 6 of equation 2.2 have been replaced by the appropriate constants . "‘ 1 2 V(d) = Z [(1—12 — it] (2.5) k=0 37 Performance In the original Deaven and Ho work of 1995, they took three thousand generations to find the Bucky ball molecule. The Bucky ball molecule is named after Buckminster Fuller, who was the inventor of the geodesic dome as well as many other unique styles of structures. The Bucky ball has the same basic structure as a simple geodesic dome and the modern soccer ball. This ball has 60 atoms arranged with 20 hexagons and 12 pentagons and was created from carbon atoms vaporized and injected into an inert gas. Richard E. Smalley, Robert F. Curl Jr. and Harold Kroto discovered the carbon Bucky ball while working on so- lar chemistry research. This discovery led to the 1996 Nobel prize for the three researchers. The applications of the Bucky ball structure include nano-structures, superconductors, and optics. Evaluations = p(p — 1) =1: 30 * generations (2.6) Given p = 8 and generations = 3000 for the Bucky ball with sixty atoms scales approxi- mately as 0(n3'76). 2.3.2 Hartke The Hartke work builds on the Deaven and Ho method in several ways. A geometrically based measure was used to improve the diversity. The local search method was improved using a Newton-style iteration that will, in theory, take fewer steps. Also, the mutation was well defined and more extensive than hopping to the next region of the minimum. Using all these improvements, his work found all the minimal energy states for molecules up to 250 atoms with a performance that scaled very competitively. 38 Seeding In his latest work, Hartke used a set of molecules based on known low-energy configura- tions and then randomly added atoms to make the full atom count. It should be noted that seeding can really improve the ability to find solutions for a great many of the low-energy configurations which have similar geometries. Seeding can also cause significant problems for those configurations that are markedly different. Hartke used many of the Deaven and Ho operators, with small improvements, to give a more robust algorithm. That algorithm was used to find the energy minima for molecules from a few atoms to several hundred, in- cluding many known hard-to-reproduce low-energy states at certain sizes. One of the main differences between Deaven and Ho, and Hartke, is the way that diversity in the population was achieved. Hartke introduced a geometry measure rather than an energy difference. Population and Diversity Hartke’s typical population was five to fifteen molecules. The initial population was con- structed from known good geometries that yielded low energy configurations. Then atoms were added until the molecule was complete. Hartke’s work uses the same (a + A) pop- ulation methods as the Deaven and Ho work, but with both the energy and geometry con- straints for diversity. Just like in the Deaven and Ho work, all permutations of the parents were used to create an offspring. The next generation of parents was chosen using ranking constrained by the energy and geometry difference requirements. Hartke [27] took the concept of diversity control based on energy and added some fun- damental geometry measures to enforce an even more diverse population. In this measure- ment, the molecule was rotated so that a two-dimensional projection had the fewest number of distinct atoms in the projection. The two-dimensional projection is then arbitrarily di- vided by a fixed grid of forty squares scaled to the diameter of the projection. A count of squares with at least one projected point was then taken. Using the difference in counts as a measure of similarity, the population diversity was improved. Figure 2.8 shows the 39 Figure 2.8: xy projection of n=104 molecule [27] projection of a molecule with 104 atoms. Using this method, it is possible to keep several unique low—energy configurations with significantly different geometries in the population. Mutation Hartke’s mutation operator randomly selected one to forty percent of the atoms to move and also selected a random magnitude between a Specific minimum distance and up to the diameter of the cluster for each move. The mutation was applied to fifteen percent of the offspring produced by crossover and prior to the local search. So large a magnitude of movement has the effect of completely relocating an atom in the configuration. The balance between the disruption of good locations against exploration seems a little extreme towards the exploration. Local Search The local search used by Hartke was a quasi-Newton method developed by Liu and Nocedal [45]. This particular method has been shown to have speedup factors in the range of 1.6 to 4.5 compared to other high-dimensionality local search methods. The local search was applied to offspring and parents alike after any changes due to mutation, crossover, and initialization. 40 Crossover Hartke employed an improvement to the Deaven and Ho method of crossover. It used the dividing plane and then evaluated the potential energy for each half of the molecule before recombination. Further, instead of randomly selecting a plane, this crossover method finds a plane that produced a half a molecule with lower potential energy. Doing this about fifty percent of the time improved performance. What was not clear in this work was how the advantageous plane was found. Clearly, rotating a dividing plane around looking for low energy halves of a molecule would require an intensive computational cost. Selection The selection process was basically rank elite. It was based on rank of the total population of parents plus offspring. The constraints of geometric diversity and energy differences were then applied to the selection process. In this way, a parent population for the next generation should have included potentially new and beneficial partial configurations even if the fitness of the individual was not competitive with the current best in the population. Objective Function This is the same energy function that was used by Deaven and Ho. The Lennard-Jones potential was calculated from all the distances between atom pairs. Once the atom locations were found and after the local search, the energy was calculated along with the geometric diversity projection onto two dimensions. Performance The performance of Hartke’s algorithm was very good. It not only found all the lowest energy configurations up to 150 atoms but does so in sealed performance with a time of approximately 0(n3). Based on the results shown graphically, there was large deviation on 41 the time values for larger molecules. Some of these were noted as known difficult config- urations because of a shift from the normal low.energy pattern to that of a Face Centered Cubic with truncation. It should be noted that other methods, like Monte Carlo with basin hopping, perform better for small molecules and certain known hard configurations can take an order of magnitude more time than the scaling prediction. 2.3.3 Sastry In the previous two works, the encoding of the molecules used real-values or phenotypical encoding. In contrast, some GA methods use binary encoding of the genes to represent atom locations. TWO issues that are highlighted by the Sastry work are the reporting of performance and the use of binary encoding. The work by Gregurick et a1. [24] is often cited for its performance scaling as 0(n4'5) for an example of the genotypical encoding with binary values. This work by Sastry [60] used a small five-bit genotypical binary encoding. It described performance on the same order of Hartke, if the cost of developing the seeds was included. The cost of producing the seed molecules must be included so that a fair comparison can be made. Further enhancements to the basic genetic algorithm included using statistic distributions of the population and minimum descriptive length to partition building blocks and maintain diversity. This work also used local search with the Extended Compact Genetic Algorithm (ECGA) to find the low energy configurations. The scaling claims included only the number of function evaluations and not the cost for evaluating the fitness function or producing the seed molecules. Additional information will be provided that will make comparison between the previous two algorithms and this one, using advanced GA methods such as linkage learning, possible. Seeding The seeding used the (n — 1)-atom low energy molecule configuration as a seed, then added a randomly placed atom to make the full molecule configuration. This required that 42 all the low energy clusters, up to the size currently being attempted, had previously been found. While some algorithms will make use of known good general configurations for low energy, this requires the n — 1 configuration to start. Population A population from twenty to one hundred molecules was used in this algorithm. The size of the population required for a high probability of a successful run was stated as 0(n'83). Examination of the results shows that, for molecules with twenty atoms, they used popu- lations on the order of one hundred molecules. This population was much larger than the twelve predicted by the claim. The population was fixed and some percentage of the popu- lation was replaced every generation. Taking two data points from the results reported for molecules with ten atoms and twenty atoms the following prediction of population scaling was calculated to be 0(n2'9) which is much larger than the claim. The exact data points are twenty members of the population for ten atom molecules and 149 members for twenty atom molecules. Generational Outline This algorithm was different from the basic simple genetic algorithm in that it used two distinct tools to measure the profile of the population. These tools were then used to re- place some percentage of the population each generation. Figure 2.9 shows the outline as described in the paper by Sastry. Each generation N (1 — PC) individuals were selected by tournaments to be included as parents for the next generation. Then, using the selected individuals as the parents, offspring were created with crossover operations to fill the re- maining positions in the population. PC was the probability of crossover. The marginal product models (MPM) and minimum descriptive lengths (MDL) are used to guide the operation of the GA. 43 Initialize Clusters Loop Perform Nelder-Mead simplex search Evaluate potential energy of each cluster If converged then stop Perform tournament selection Build MPM using minimum descriptive length Create new clusters using MPM Replace percentage of old clusters Repeat loop Figure 2.9: GA algorithm outline for Sastry [60] Local Search The Nelder-Mead local search was used by Iwamatsu [40] and also used by Sastry [60]. This method uses a form of a generalized triangulation to bound the area to advance toward. The triangulation local search for new atom locations used thirteen data points to develop a boundary to evaluate for determination of the direction to move. This direction was either toward the interior of the bounded region or exterior depending on the value of the objective function. At each step, a new boundary point was constructed from the existing points. For a graphical description of this method, the web site by Mathews [47] has a good explanation. Crossover Crossover was applied to the selected parents from the population, but was guided by the statistics developed for both the Minimum Descriptive Length (MDL) and Marginal Prod- uct Models (MPM). This process limits the disruption on the building blocks or sub-clusters of atoms. Once the rest of the population was filled in, the whole generational process Started again. The parent selection used a high-pressure tournament with as many as four to sixteen participants. When the population was small, the poorer molecules were not chosen for crossover often. Selection Tournament selection was used on the total population to reduce the population to some set of parents that would then be used for crossover operations. The number of individuals in each tournament was eight without replacement. Large tournament sizes represents a very strong selection process. In this work, the typical Pc = .8 was used. This means that the selection process only kept 20% of the population and replaced 80% with new molecules each generation. Objective Function The Gong potential took the energy calculation one step further, using bond angles as shown in equation 2.7. The more detailed equation for 222 has terms similar to the Lennard- Jones but with an exponential decay term added. The 223 function has terms with the cosine of the angles formed by the three adjacent atoms groups. Details of this energy function can be found in the paper by Sastry [60]. 14...: 2212031) + Z v3(z'.j.k) (2.7) i lf(pk.pj)| > |f(pz.py)l (3.2) All the vectors must have an identical number of elements. However, all elements need not be identical. This creates a problem in that the concept of distance is altered. Elements with incompatible units are difficult to combine in a single measure of distance. It would be possible to use Mahalanobis distances for elements with different units that have known correlations . 3.1.2 Genes For true biological systems the relationship between specific genes and functions are often complicated. For the class of problems that spatial operators work best on, a gene will be more phenotypical. In molecular conformation, the genes are atom locations. These loca- tions have some constraints. A molecule, for example, has a minimum distance between atoms that which can be derived from the experimental data. Similarly, the maximum distance can also be determined. All the atoms must be connected by at least one bond within some range of distances. For many other practical problems that Spatial operators can address, there are similar sets of constraints. These constraints can be applied at var- ious stages of the GA algorithm to improve the probability of good fitness and solutions. Equation 3.3 shows the gene as a point in space that is made of an M-dimensional vector 78 of values as, 6 31 gene,- = p,- = {m1,-,x2,~, . . .mM,} (3.3) 3.1.3 Fitness Function There are two types of distance criterion functions. The first is shown in equation 3.4 and is based on the sorted list of distances between atoms from the candidate molecule and the unlabeled target distances. The second function is more common in the literature. It is based on the differences between entries in the D matrix of the molecule candidate and the target information Do matrix. This target information is shown in equation 3.5 with elements designated as tij. Several variations of the criterion function based on distance exist. The most common variation takes the square root of the summed function. It should be noted that in equation 3.5, atom labels are implied by the values of i, j, and in equation 3.4, labels are not available. Objd 2' Z“); - (IA-)2 (3.4) 1:21 n—l n ObjD = Z Z (tij — dij)2 (3.5) i=1 j=i+l As described in Section 2.2.1 there are many energy-based fitness functions. For the purposes of this work only distance-based fitness functions will be used. The theory of fitness function contribution is based on the Euclidean distance measure between two points as represented by the homogenous vectors. There are a large variety of fitness functions that could be included, but in this work they are excluded to Show the effectiveness of the spatial operators. 79 3.2 Local Search The definition of the local search is based on the definition of the local minimum. Schwefel [62] provides a useful definition of the local minimum. It is adapted to the objective func- tion used in this work as shown in equation 3.6 and 3.7. The form of the local minimum is very similar to that of the definition that two molecules are equal. TWO molecules are defined as equal if it is possible to rotate and translate them such that they are aligned so that the distances between corresponding atom centers summed is below some limit. The local minimum definition of equivalent molecules is stated as follows: Applying the local search operator to two molecules will align the individual atoms such that the summed dis— tances between corresponding atoms will be less than some limit. In many respects, any two molecules that are in some region of space will converge to the same local minimum and therefore, are considered equivalent. F(P) g F(P) (3.6) N e > Z In — ml (3.7) i=1 Under what condition can the two molecules be different and still converge to the same local minimum? From both an experimental and geometric perspective, the molecules will converge to the same local minimum if the positions of the atoms are within some range of the ideal location as defined for a specific local minimum. If the topology of the molecule is defined as each atom’s relationship to its closest neighbors, then any position such that the relationships of the local minimum and the current state do not change will define the local attractive basin for a given local search and local minimum. From a pure geometry point of view, as long as the location of each current atom is within some minimum of the ideal location for that local minimum, the topology will not change when the atoms move to the ideal locations. In the case of a straight line arrange— 80 ment of atoms, any one atom can move up to the bond length and not change the topology. Any two atoms can move up to one half the bond length and not change. This sets an upper bound on the possible space around each atom that will converge to the same local mini- mum. For the purposes of the local search based entirely on the distances between atoms, some factor can be defined such that, as long as the atoms are within some radius defined by the factor, it will converge to the local minimum. The factor of 6 reduction in equation 3.10 is intended to insure the ability of the local search to converge to the solution. That is, as long as all the locations are within some small sphere defined by 1%, (the minimum distance to another point), then the topology of the graph will not change and the local search should be able to find the local minimum. 3.2.1 Volume The probability of locating an atom in the correct locations during initialization can be estimated using volumes. If the placement of an atom is random within some bounded vol- ume and this placement has some tolerance associated with it, comparing the total volume to the volume of a valid placement will estimate the probability. Experimental data from initialization methods has been collected and the related statistics provides an estimate for the value of the molecule diameter. Equation 3.8 shows this volume as an expression for a Sphere with some maximum radius. For M-dimensional spaces, a hyper-cube of volume described by equation 3.9 can be used. 4 VOLtotal = " 71.123 3 max (3.8) VOLtotal = Riga; (3.9) The factor of 6 reduction in equation 3.10 is intended to insure the ability of the local search to converge to the solution, as well as to produce high quality building blocks. The actual R4,,” will not necessarily be the same as the max distance in the distance data set. 81 The random initialization methods can produce larger or smaller volumes and this must be recognized in the creation of building blocks. The local search will have the tendency to heal crossover-disrupted building blocks and also combine small building blocks into better quality ones. Equation 3.10 shows the local volume that an atom can reside in and still be moved by the local search to the local minimum position. 3 L00,,, = -:—1r (R?) (3.10) The probability of having an atom within this volume is given by the ratio of the target volume divided by the average volume per atom. Equation 3.11 shows the average volume as a function of the number of atoms and the total volume of the molecule. The ratio is the probability as shown in equation 3.12. VOL 0 a AVGvol : —N—t_t_£ (3.11) _ LOCvol (r) _ AV C... (3.12) The nature of the building blocks requires that multiple atoms be in the correct 10- cations. The location of the first atom is arbitrary because it starts as the anchor of the subgraph. The probability of having the next atom in the subgraph in the right volume of space is given by the volume of viable possible locations for the second atom. The place- ment of the third atom now has a much smaller restricted volume to be a building block or valid subgraph. Using a simplistic probability calculation, the first atom has probability 1 and all the others can be based on the ratio of the local volume to the average volume. n—l P..(n) = H Pa) (3.13) l 82 This probability forms the base for predicting the number of population members needed to have a rich enough set of building blocks to eventually find a solution to the molecule conformation. The calculation based on volume is a lower bound because the ini- tialization process places atoms one bond length apart and this further restricts the volume and increases the chance of creating a good building block. 3.2.2 Distance-Based Local Search Vector summed local search calculates the magnitude and direction that each atom con- tributes to the overall criterion functions. The criterion function produces a single scalar term while the local search produces a vector term for each atom. The operation of the local search can then either move a single atom, which has the greatest contribution, to the error between the target and the current configuration or it can move all the atoms in the direction that will correct the error contributions. If the step size for each correction is too large, the algorithm can become unstable. The correction factor must be based on the total number of atom-to-atom interactions. The typical magnitude of each corrective step is scaled down on the order of 0.3 to 0.01. These values were selected by experimentation and also a simple analytical case of three atoms. In the case of three atoms, the stability for all possible positions can be insured if the step size is less than 0.5. In a real configuration, there is some average contribution from each atom to the correction. Some of these error contributions cancel one another and this leads to the range of values shown. The 6 will be scaled by the total number of atom-to-atom distances. Equation 3.15 shows how each contribution is calculated. For each atom pair, the dis— tance difference between the target distance and the actual distance is calculated and then scaled along the unit vector between the two atoms. While there is no mapping of the t), target distances to an i, j, there is such a mapping for the molecule being examined. The mapping is defined in equation 3.16. In equation 3.17 the correction to the atom position is shown. The 6 is the portion of the total error to move the atom. TWO forms of local 83 search are used: one that moves the atom with the largest magnitude of E,- and the other that moves all atoms. The search converges when the sum of all the magnitudes of the E,- is below some threshold. 6 = 315- (3.14) E- = gm — d.) Ii): :2, (3.15) de3i€(1...n),jE(1...n)|dk=-d,-,j (3.16) 17.: 1‘9. — 6F.- (3.17) 3.2.3 Convergence Criteria When the total sum of all the magnitudes of the correction vectors gets small, then the local search has converged. N c > Z |E,-| (3.18) i=1 In the actual operation of the algorithm, full convergence is not required. The actual processes Show evidence that a slower convergence spread over many generations of the GA will actually be more effective. The basic process required the application of the local search :1: times after each mutation, crossover or reproduction. 3.2.4 Iteration Limit The number of iterations will depend on the step factor used in applying the correction to each atom’s location. The stability of this search is increased if smaller steps are taken, 84 but the number of iterations increases. The compromise is that it will require more com- putational effort to iterate repeatedly. It is possible to estimate the maximum number of iterations for convergence starting with equation 3.19. After c iterations of applying the 6 factor to the largest of the correction vectors, a final correction vector whose magnitude is less than 6 will be the outcome. If the process only corrects one vector, then it will take more iterations. The assumption here is that the largest of the correction vectors is chosen as the worst case. 6 > max(|E,-|)(5c (3.19) Taking the logarithm and solving for iteration count c can be done as shown in equation 3.20 and 3.21. The final form as a function of c is given in equation 3.22. ln(e) > ln(max(|—E,-|)) + cln(6) (3.20) ln(c) — ln(max(|E,—|)) < cln(6) (3.21) 111(6) -1n(maX(|E-|)) ln(6) < c (3.22) 3.3 Spatial Concept There are two spatial concepts involved in this work. The first is the idea of a point in hyperspace defined by the entire set of N vectors of dimensionality M. The second is the location defined by each M-dimensional vector. The relationship between two M dimen- sional locations is described as the proximity between the two locations. This proximity is defined as the standard Euclidean distance or distance squared between two genes, 10- cations, atoms or points. The loctions are paritioned into K sets, by some collection of 85 boundaries. Equation 3.23 shows the union of all the space divided set and equation 3.24 Shows that each vector can belong to only one set. P=P1UP2...UP;c (3.23) There is no requirement that the sets be non-empty or that they hold equal numbers of points. For the work on the molecular conformation K = 2, however the number of locations in each set is nearly equal. The second spatial concept is that of all N vectors as a single location in hyperspace. For the case of the molecular conformation, the absolute location is not as important as the space defined by the distances between all the locations. That is, two sets of N locations translated and rotated by arbitrary amounts are identical if the sets of distances derived from those sets of locations are identical. Take, for example, the chromosome defined by N sets of real-valued coordinates of the atoms. There are an infinite number of sets of coordinates that will satisfy the distance matrix. TWO molecules P1, P2 are considered equal if there is a rotation and translation matrix that can be applied to one molecule such that the total summed difference in loca- tions is below some value 6. Equation 3.25 shows the two molecules each represented by a set of points in space. For the two molecules to be equivalent there must exist a rotation- translation matrix such that after the operation shown in 3.26, the total distance between corresponding point locations calculated in equation 3.27 is below some limit. Then the two molecules are considered equal, as shown in equation 3.28. One additional step is needed because the indices of of atom locations may not be mapped correctly. An addi— tional mapping process to arrange the two sets of atoms so that the index will correspond to the correct pairs of atoms must be completed. 86 P1, P2 (3.25) 3 M | P1 = MP1 (3.26) N 6 > Z lpi,.- - p2,.-| (3.27) i=1 3.3.1 Division of Space The concept of building blocks needs some method of mathematical division for identifying which atoms belong to each building block set. The crudest division is breaking the set into two pieces. This can be accomplish with the basic crossover operator that Deaven and Ho [13] introduced. It can be generalized to M -dimensional Spaces as shown, using the definition of the normal to the hyperplane. The number of methods that can be used to divide the space is large. It is possible to use any solid shape as the base for dividing the space into subsets. It is possible to use any combinations of planes (three-dimensional) or hyperplanes (M-dimensional) to divide the space. By Single Hyperplane Once a normal to the plane is created, determining which side of the plane the atoms are on can be achieved by using the vector to the atom and the normal to the plane. The definition of this normal is shown in equation 3.29. This normal to the plane passes through the center of mass. 87 Vn = 0.1331 ‘1‘ 02.272 . . . aiim (3.29) Determining which set a point belongs to can be accomplished with a dot product op- erator. Let the set of points as described in equation 2.11 be divided into two sets. Assume that the hyperplane includes a point (0,0, . . .0). A point is on the + (p) side of the plane if the dot produce is positive. Otherwise, it is on the — (n) side of the plane. This section introduces a spatial concept of building blocks. It formalizes the operation of the division of the space by one or more hyperplanes. The idea of good building blocks can be thought of as a collection of points in space whose locations are within some region, and will produce good contributions to the criterion functions. P = PpUPn (3.30) VpiePIpioVnZOszPpUp, (3.31) VngP|pioVn<0Pn=PnUp,~ (3.32) Formula 3.30 was created in order to divide real space with a single plane and assign points in that space to the two subspaces. When the boundary is made of multiple hy- perplanes, each described by a center point and a normal passing though the center, then determining if a point occupies that bounded region can be accomplished by repeated ap- plication of the dot product for each bounding plane. The vector from each center point to the atom (arbitrary point) and the hyperplane’s normal can be used for the dot product. If the multiple hyperplane set of boundaries is convex, this produces an easy division. If the planes form a concave shape, the orientation of the normal must be constant in terms of a topological manifold’s surface. Equations 3.32 and 3.31 demonstrate the division of the 88 points into the two sets Pp and P". The formulation of the division is general to M -dimensional points and is not limited to 2D or 3D spaces. In terms of genetic algorithms, this is no different than two-point crossover on a linear arrangement of genes. For two-point crossover, the linear arrangement can be thought of as a continuous circle. The two-point method is the same as drawing a line though the center of the circle that divides the chromosome into two parts with near equal numbers of genes. There is, in general, no requirement that an equal numbers of genes be from each parent. In the later stages of a typical genetic algorithm, it will be shown that, if equal numbers of genes were a requirement, improvement of better chromosomes would not be likely to occur because of the distribution of bad genes among the good genes. Smaller partitions can be made by repeated use of more than one plane to divide the space. This can be analogous to N-point crossover, but in a three-dimensional setting. For example, when any atom is near enough to a good location it can be coded as a 9’1” binary and those far away are coded as ”0”. Good schemata would have lots of ones and the grouping of ones would provide a building block. If viewed as a graph, relationships between all the good locations would look like a K n graph. Further, the order would be the number of points in a good location, but the length would need to be measured in a new way such as the diameter of ”1” nodes on the K n graph. By Multiple Hyperplanes Division of space can be expanded into multiple spaces in two ways. First is to continue to find orthogonal planes and divide the space into quarters, then eighths and so on. The second method can use an arbitrary number of planes. Each volume of space is defined by a set of planes and their intersections. A specific space is bounded by a set of planes. The results of all the dot product operator signs form a binary set of bits. Each pattern of bits defines a distinct space volume. The Figure 3.1 shows a simplified example of multiple plane division using atoms in a plane and a set of three lines. For each line there is a small 89 Figure 3.1: Example of multiple plane division arrow indicating the positive side of the dot product. Table 3.1 shows the binary patterns that determine in which subspace the individual atoms are located. In this table the spaces are identified by the numerical representation of 0 through 7 as interpreted from the binary pattern of L1, L2, and L3. For example, the subspace P1 consist of atoms {0, D, E}. Subspace P0 = {F, G, H}. Some subspaces are empty sets like P7 and P3. The label of the subspace set is based on the binary value of the bits from the table. Each bit is set if the atom produces a positive dot product. A zero indictes the dot product is a negative value. For the purposes of crossover each subspace can be assigned to one of the parents. For the case where the number of genes is fixed, there is no expectation that the numbers of genes produced in each half of the space will total the fixed number. In these cases the boundaries can be moved or placed in such a way as to provide a clear division with the correct total number of genes. In the simple division of the molecule though the center of mass by a single plane, the chances of finding a clean division by moving the plane slightly is high and the crossover contribution from each parent is evenly divided. Several 90 atom L1 L2 L3 A 1 0 l B 1 O l C 0 0 1 D 0 O 1 E 0 0 1 F 0 0 O G 0 0 0 H 0 0 0 I 0 1 0 J 1 1 0 K 1 l 0 L 1 0 O M 1 1 0 N 1 1 O O 1 l 0 Table 3.1: Multiple plane division of space experiments with nonequal division were conducted and provided poor results as compared to the equal division. 3.4 Spatial Building Blocks and Subgraphs Building blocks and schemata are closely related in the theory of genetic algorithms. For this work, building blocks will be a subset of the more general concept in that genes in the chromosome will be linked by proximity to each other in space. For molecules, it is easy to find some range of distances that defines a bond and therefore a connection between two atoms. A subgraph with K atoms that are all connected by bonds will be considered a building block or schemata of order L. The abbreviation is given as 391,. The fact that spatial operators are advantageous is supported by the analysis of the numbers and locations of these subgraphs during the operation of the genetic algorithm. Just like with basic building block theory, the construction of larger and more numerous 391, are fundamental to finding good solutions for the problems addressed here. To assist in the analysis of the 591,, a number of concepts must be reviewed. First is the role that proximity 91 plays between locations as it affects the fitness functions. Next, for small molecules, it is possible to enumerate the 391,. For larger molecules, the counting is more difficult and use of R regular graphs will provide a way for estimation of the sgL population and growth. Several examples of actual sgL counting will be included for the N11, cube, and Bucky ball. These are provided as a means to Show the application of the counting methods and provide the estimates for 391, disruptions by crossover operators. Several actual examples of simple molecules will be used to demonstrate the counting of the subgraphs and document the disruption of subgraphs by crossover operators. In the case of large molecules, the count of subgraphs can be estimated. The example used here will be for regular graphs with R edges. This is appropriate for modeling for many large molecules where the number of bends is in a limited range of three to six. The Simple example explored will be the cube structure. The enumeration of all the subgraphs will be made and the effects of crossover illustrated. For a more complex example, the Lennard-Jones molecule N11 will also be examined. The final example for actual counting of subgraphs will be the Bucky ball that is often the example used to demonstrate the effectiveness of the spatial operators. 3.4.1 Definition A spatial building block or subgraph is defined to be a collection of nodes or atoms that are close enough to have a high degree of interaction. This collection will typically be described by a limited set of edges or bonds that connect the atoms. The key element is that the nodes or atoms are in close proximity to each other. There are several issues associated with proximity. The interaction of points can be characterized in linear and non- linear models. For the case of atom attraction and repulsion, the interactions can be highly non-linear. Even in the example of the molecules with distance-based search, the quadratic nature of the fitness function can be significant. In the event where points are arranged in a single sequence, proximity is only based on the next few points in line. For the case of 92 three-dimensional points in space, the proximity is based on the closest other points. 3.4.2 Counting Spatial Subgraphs There are two ways to count the number of subgraphs of a specific configuration. Use the number of nodes or atoms to describe the Size of the subgraph or alternately use the edges or bonds to describe the Size. The cutting of a molecular structure by planes will break the bonds. Therefore, developing the counts based on atoms will be more useful. The requirement to be a building block is that all the nodes must be connected by at least one edge. If a more general counting scheme is used where all possible edge combinations that can connect N locations are used, the number of subgraphs counted can be even larger. A given subgraph can be connected by several distinct sets of edges. In this work as long as there is at least one set of edges that connect N locations, that set of N locations is counted only once regardless of how many edge set may connect the N nodes. In the pure binary chromosome representation, the number of schemata is easily constructed using combinations. For this theory, the construction of subgraphs with the proper number of nodes is the basis for counting. Further, the way in which the nodes are combined can have several permutations of edges. For this work, a special program using recursion for the graph representation was used to count the building blocks. For large graphs such as the Bucky ball, some methods for estimation on a single node are used. For the smaller molecules such as N11 Lennard-Jones, a complete count was made with the recursive program. 3.4.3 Regular Graph Model Large molecules can be approximated with R regular graphs. The interaction of the atoms in space can be modeled as a graph with the atoms as nodes and edges as bonds. For other models, the edges are selected by some set of closely proximate points. These models are normally considered to be some regular graph with vertex order of 3 or a R3 graph. Other 93 1M" 22‘ < \ fix fr X) o O 0 0O O O L; c) O Q C: Figure 3.2: BB counting R3 graphs such as regular R4 and R5 may be considered, but many of the practical applications can be covered by R3. The number of overlapping building blocks for a simple cube and a Bucky ball will be demonstrated. For the Bucky ball, only building blocks on the order of 3 atoms to 5 atoms will be constructed. 1' = 0 : 1 bbcnt(r,p, k, f) = (3.33) r aé 0 : 3L. bbcnt m Pimprovement : Z Pbbih(z) X Pbbih (j) (336) limN BMW...“ ~ 0.437 (3.37) 3.4.4 Cube When dealing with a cube as shown in Figure 3.3 with labeled nodes, the goal is to enumer- ate all the building blocks associated with the cube to demonstrate the concept of subgraph counting for spatial applications. As is well documented, the number of points used in clas- 94 Figure 3.3: Cube schema R3 Sic crossover operators will increase the likelihood of disrupting good schemata [64]. In the case of spatial crossover operators, this is also true. The equivalence of a single cutting plane for Spatial crossover is similar to two-point classic crossover for a linear arrangement of genes where the two points must divide the chromosome into equal halves. Given the definition of subgraphs as building blocks or a type of schemata, it is instructive to compare the disruption of those subgraphs by the two crossover methods. The N2 building blocks consist of the edges in the cube. For the N3 building block, each vertex is at the center of three unique N3’s, so there are a total of 24. It should be noted that in normal schemata each combination of items would be valid. However, for this concept, the locations have to be connected. This eliminates many of the possible building blocks. The full table of building blocks enumerated by Size is shown in Table 3.2. This table also shows the number of building blocks that are destroyed by a cutting plane. All the building blocks above N4 are destroyed. Even for this simple example of the cube, the reduction in the disruption of the small subgraphs is significantly reduced by the spatial crossover. 95 order Number Broken by Plane 2 Point N2 12 4 10 N3 24 16 20 N4 38 36 35 N5 48 48 48 N6 28 28 28 N7 8 N8 1 l 1 Table 3.2: Building blocks for cube 3.4.5 N11 Lennard-J ones Molecule Now that the general method for counting subgraphs for large molecules modeled as R reg- ular graphs and a simple molecule arranged as a cube have been examined, it is appropriate to explore a moderate-sized molecule. Two examples of the N11 Lennard-Jones molecule are examined. For this molecule the size of the subgraph will be described by the number of nodes it contains and then the total number of subgraphs of that size subgraph. In this example, the size of each subgraph is shown in the first column. The number of total sub- graphs at each size is shown in the second column. The third and fourth columns Show the number of subgraphs that are broken by a plane division and two-point crossover respec- tively. In the Table 3.3, the number of smaller subgraphs that are disrupted is significantly less with the plane division than with two-point crossover. The number of subgraphs with size less than five atoms or nodes Shows a reduced number of disruptions. As the size of the subgraph grows, then just about any division in the chromosome is likely to disrupt the subgraph. When a molecule is examined after random creation and the application of the first local search, the number of good subgraphs is much higher than would be expected by pure random placement. Table 3.4 shows the number of good subgraphs after initialization and then after the first local search is completed. While the local search does not significantly improve the number of small subgraphs, it does greatly increase the number of larger order subgraphs. The subgraphs with seven atoms increase more than two-fold after the local 96 order Number Broken by Plane 2 Point N2 31 12 28 N3 91 65 86 N4 215 195 210 N5 374 367 369 N6 438 437 435 N7 328 328 328 N8 165 165 165 N9 55 55 55 N10 11 11 11 N 11 1 l 1 Table 3.3: Building blocks for ideal N11 LJ search is completed. The effect of the local search to repair subgraph disruptions is clearly demonstrated. order Number Random init after ls N2 3 l 10 12 N3 91 11 16 N4 215 15 25 N5 374 20 37 N6 438 23 46 N7 328 l 8 44 N8 165 9 34 N9 55 2 20 N 10 1 1 0 8 N11 1 0 1 Table 3.4: Building blocks for N11 LJ random initialization and after local search Random initialization makes the following assumptions. First, that each atom is con- nected to at least one other atom. Second, these connections must be at least the minimum distance in the target table. Using these two constraints and then placing atoms in a random direction will produce valid molecules that have higher numbers of valid subgraphs than would be expected with placing atoms at random locations in some bounding sphere. Table 3.4 contains the actual building block counts of a typical member in the pop- ulation after the random initialized molecule at the start of a run, along with that same molecule after the application of the local search. It is clear that the initialization process 97 builds many more blocks than would be expected if true random locations were chosen. In fact the initialization process requires that each atom is one minimum distance from another and that no two atoms are too close together. Therefore, the worst case configuration would be a straight line of atoms. In fact, the process yields a typical configuration as shown in Figure 3.4. Figure 3.4: Initialized molecule for GA run Figure 3.5 shows the same molecule after the local search has been completed. Com— parison of the two figures shows significant positive impact that the local search has on the molecules early in the GA run. 3.4.6 Bucky Ball The Bucky ball is an example of the regular style of graph and the subgraph building blocks of interest are on the order of five or fewer atoms. First, Figure 3.6 shows a single bond between two atoms and all the subgraphs up to a size of five atoms. If the bond is broken by the crossover operator, then all the subgraphs shown will be broken. Figure 3.7 shows 98 Figure 3.5: Initialized molecule for GA run after local search a 2D representation of the Bucky ball and a simulated division down the center. In Figure 3.7 it is possible to see the 14 edges that are broken by the plane. Table 3.5 shows the estimates for the number of subgraph disruptions for the plane division and the two-point crossover. This table will be used as the basis for comparison of the two crossover methods in the next section. Table 3.5: Bucky building blocks From this table, it is clear that, especially for the Bucky ball, the number of disrupted subgraphs is much smaller for plane division than from two-point crossover. The reduction in the disruption is on the order of 3 : 1 for the smaller subgraph sizes. 99 Figure 3.6: Bucky ball schema 3.5 Crossover The previous section on counting subgraphs provides the foundation for the examination of crossover based on spatial operators. Crossover has long been known to have detrimen- tal effects on building blocks, as well as its positive effect in assembling larger ones. The role of the crossover is two-fold in this application. First, it spreads good mutations and good building blocks throughout the population. Second, it slices the molecule in such a way that most of the good subgraphs are preserved and the bad ones are isolated. In the simplest of perspectives, when the number of good building blocks is small, the odds of constructing a better one are on the order of .25, assuming the good building blocks are distributed unevenly among the four halves and that the two parents have roughly the same numbers of building blocks. As the number of good subgraphs increases, the effective- ness of the crossover will change dynamically. The larger the number of good subgraphs, the more disruptive crossover will be, up to some limit. Then, an equilibrium state will be reached. This state will have some constant probability of improvement. In standard crossover operators where there is no provision for preserving spatial-based arrangements, the disruption is even greater. The first section will examine standard simple one-point 100 Figure 3.7: Bucky division shown in 2D and two-point crossover impact on subgraphs. The next section will describe the spatial crossover impact. 3.5.1 Standard Crossover For a standard two-point crossover operator, there is one important question. What is the probability of disrupting the building blocks? The following assumptions are made. The location of the gene in a linear arrangement has no correlation to its spatial location relative to any coordinate system. For one-, two- or N-point crossover, the probability that a given gene of a building block will lie in the same half section of a molecule is roughly P(S) = .5 for each selection. This is based on the fact that a plane was used to divide the molecule such that roughly half the atoms are on each side of the plane. Using the example of a Bucky ball, the standard crossover will destroy forty-five of the ninety (N = 2)-order building blocks. The standard crossover will also break the chromosome in two places. The additional break will also lead to a slightly higher probability of disruption for the building blocks. For N3, the factor will be .25. In this example 135 building blocks will be destroyed. This ignores the additional effect of the two points breaking a possible set of three genes. It is clear that, as the order progresses, virtually all the building blocks will be 101 destroyed by these methods of crossover. Each linear chain of linkages only corresponds to 1 building block as described above. Only a small subset of all the interactions can be preserved by linear linkages. For example, if the program learned twenty N3 linkages, then the number of building blocks destroyed during crossover would be 160 or 88%. 3.5.2 Spatial Crossover There are several ways in which the spatial crossover is superior to the standard crossover methods. First, the number of building blocks preserved is much higher, especially for small building blocks. Second, the nature of recombining two sets, where the absolute reference can be completely arbitrary, requires some elasticity. In the spatial crossover, proposed atoms or locations are added from the ”inside out”, with each addition checked for minimum distance. This allows for few failed crossovers that violate the minimum distance constraint. There is a finite likelihood that a crossover operation will fail. If this occurs, it is treated as a reproduction selection with the best parent propagated to the next generation. The preservation of the subgraphs or building blocks for small orders of N is much better for the spatial crossover. In the case of the Bucky ball, the number of N2 subgraphs destroyed is only 14, as compared to 45 for the two-point crossover. As the building blocks get larger, the number of subgraphs destroyed by crossover eventually reaches 100% for both methods of crossover. Based on this, how is possible that any solutions are found? A solution can stem from two different areas. First, the population can become much less diverse, such that only molecules of very similar construction are taking part in the crossover. In this case, the atoms are essentially in the correct locations and only those that are different will be affected. That means all the subgraphs that are good in both molecules will be good when the Spatial crossover or two-point crossover is conducted. For less diverse populations, the disruptions are less. However it is much harder to search for the solutions. 102 3.6 Mutation In standard mutation, the magnitude and the quantity of genes altered to effectively explore the solution space needs to be understood. For mutations that move the locations less than flight, the effect will be undone by the local search. Therefore, to be effective in the exploration, the mutations must significantly alter enough locations that the topology of the interactions changes. TWO ways of accomplishing this are to create a large number of large displacements or take a single location and relocate it relative to another point. Both are needed. The first style does a local search around the current minimum and the second significantly moves to the region of another minimum far away. 3.6.1 Simple Mutation Mutation of the simple type can be defined as described in Equation 3.38 where each r,- is a random number between 0.0 S r,- g 1.0. This vector R represents the direction and the magnitude to displace a single atom from its current location. In addition to the basic movement of an individual atom, the mutations can all be repeated several times. If the atom is chosen at random, it is possible to choose a single atom more than once in a given mutation operation with X repetitions. There are three components to the simple mutation: the random direction, the magnitude of the displacement of the atom moved and how many times to repeat the mutation process on a given molecule. Simple Mutation Magnitude and Direction The distribution of the mag can be Gaussian around some mean or uniform random in some range. The mutation as shown in 3.39 is simply the sum of the current location and the random vector. 103 (3.38) Pi = Pi + R (339) The final position of the atom is based on its original position. It is displaced in a random direction by a random magnitude. Simple Mutation Repetition A repetition is defined as one individual atom mutation. It was necessary to apply a simple mutation multiple times to a given atom to create the needed exploration. In some works cited [13][27] the number of mutations used was nearly half the number of atoms in the molecule. This is very disruptive. The effect of the local search will return many of the atoms to their original locations. Another added issue for exploration is when the same atom is chosen multiple times in a given mutation operation. Even if the magnitude of the displacement vector is small, the overall motion can create significant changes in the topology of the molecule. An interesting issue is that, while it will destroy more subgraphs, it will also increase the chances of building a new needed subgraph. Repetition is also applied to the relocation mutation. 3.6.2 Relocation Mutation Mutation of a localized nature would still use a random vector but would choose an atom to move and a new atom to move it near. The location near the new atom would will be based on the random vector but with the mag = min(t,,). Where t), is the target data set and it contains a minimum distance that any two atoms can be apart. Equation 3.40 shows this mutation of the 2' atom moved to a location near the j atom. 104 For both types of mutation the number of atoms moved can be adjusted as well as the occurrence of mutation to members of the population. Some current methods choose crossover or mutation while others choose to mutate independent of prior operations. All of the mutations are subject to the minimum distance constraint as described by equation 3.41 Pr IVi¢j le‘ — le > min(distance) (341) 3.7 Viable Population Size Exploration is an interesting term in the GA vocabulary. While it’s definition might seem obvious it’s role in the operation of a genetic algorithm is not always obvious. The full phrase should be to ”explore likely areas of the solution space”. The problem is how to identify the likely areas. Further, exploring the same localized area wastes effort. Lo- calized searching of areas repletely is not effective. Exploration can be achieved by an abundance of good genetic material and then by crossover to combine likely local areas. Another way is to provide a high degree of mutations that can explore the search space. The probability of having a certain number and size of building blocks in a given size popula- tion provides the bases for selecting population size. If a minimum number of these blocks must be present in the population at any point in the solution search then this become the limiting factor for determination of population size. Knowing the probability of have a building block of size L and the minimum number blocks needed, then the population can be estimated by the simple fraction. The lower bound for the probability of creating a build block was shown in Equation 3.13 as the product of individual probabilities for occupation of a desired volume. The Pbb(L) is the chance of creating a building block with L atoms. 105 Equation 3.42 shows the expected number of building blocks with size L created in a Single chromosome with N atoms. PL = (1:) Pbb(L) (3°42) It is possible to have an abundance of small building blocks but the probability of larger building blocks decreases rapidly. Equation 3.43 provides the estimate of Q, the number of population members needed to have a count of building blocks with a given size. The actual count needs to be several times the total needed at a given size in order to be effective. If the PL is on the order of .001 for L = 5 and the count needed was 20 this would predict a population size of over 20, 000. _ BBcount Q a (3.43) It should be noted that the actual number of subgraphs of order L is much higher than the random probability of locating atoms in volumes would predict. This is due to the initialization method. Random molecule initialization routine only locates atoms one bond length apart. In doing so, it significantly increases the probability of constructing subgraphs on the order of L where L < 5; the smaller subgraphs. It also insures that some of the larger subgraphs are constructed. In the case where the known solution subgraphs at each size L can be counted, it would be possible to predict the number of individuals needed in the population to have all the subgraphs represented. 3.7.1 Initial Population of Genes The real question to be addressed is how large of a population is needed to have an effective algorithm for finding solutions with some acceptable probability of success. TWO very related issues are the expected number of 39,, that the random initialization produces and the effectiveness of combining good .991, that are distributed throughout the population. The 106 L in the suffix refers to the number of atoms in each subgraph. 3.7 .2 Reduction in Gene Pool by Selection For the algorithm used in the molecular conformation, only one child is produced from a crossover operation. As a consequence, genetic material from the unused halves is not propagated into future generations. This loss has the effect of reducing the gene pool. A highly fit individual is more likely to be chosen multiple times and will provide material in random halves so that good material is not lost. Having a population large enough to have at least one of every needed subgraph created during the initialization process is not feasible. Just using the probability based on volume occupation with effective local search still requires a population several times larger than those proved effective in practice. The explanation for this is based on the probability of constructing new useful subgraphs. One interesting property of an increasing number of similar molecules is that the disruption of good subgraphs by spatial crossover will be reduced. 3.7 .3 Introduction of New Subgraphs A new subgraph can be created by the local search, random combinations along the bound- ary of the cutting plane, and by mutation. Each of these areas is important for the operation of the GA. For each of these areas the operator may also destroy good subgraphs as well as create new material. Multiple copies of good material distributed among the rest of the population are critical to allow the newly created material to be combined into bet- ter molecules. Early in the GA run, the probabilities of creating new subgraphs is very high. As the run progresses, the dynamics change to the point where the creation of a new subgraph is much less frequent and the destruction of subgraphs is more common. 107 Creation of Subgraphs by Crossover In the case where two molecules are combined, a new subgraph can be created along the cutting plane. Cutting the molecules in half will also destroy a certain number of good subgraphs. The probability of creating a new good subgraph can be estimated from the number of atoms in the neighborhood of the cutting plane. These atoms occupy a volume based on a disk of radius equal to that of the molecule and the thickness of a bond length. The number of atoms in this region can be estimated by assuming a spherical volume of the molecule. For the creation of a new subgraph in the cutting plane area, the two parents must have different configurations of atoms. If the configurations are the same, the cutting of the molecules would preserve the existing subgraph and not create a new one. Further, only the differences in the region of the cutting plane are important for the introduction of new subgraphs. As described in Section 3.4.3, if the disruptions of the subgraphs by the cutting plane are ignored, the probability of creating a new child with more subgraphs than either parent based on random distribution of the subgraphs in each half can be as high as .437 in the limit, even for small subgraphs. This is clearly seen in the first few generations of the GA run. This rate of improvement is not sustained because, as the number of good subgraphs increases, the disruptive issues come into equilibrium with the constructive mechanisms. The total number of subgraphs cut by the dividing plane also represents the number of possible new subgraphs that can be completed with the spatial crossover operator. Sub- graphs that are the same in both parents, even if cut by the plane, will be completed again during the crossover and will not be lost. Those atoms in the immediate area of the cutting plane that are either part of a good subgraph that was broken or just in that volume of space become the possible candidates for completing new subgraphs. 2 Vdisk = min(dk) (ma);(dk)) (3.44) 108 3 Vsphere : 3'” (ma);(dk)) (3.45) Atoms = N i__m1n(dk) 27r max(dk) (346) First, Equation 3.44 gives the volume of the cutting plane region based on a Single bond length or minimum distance between atoms. Second, Equation 3.45 gives the volume of the molecule based on the maximum distance in the molecule. The final Equation 3.46 shows the number of atoms in the region that can become part of new subgraphs based on the total number of atoms in the molecule. The local search will have the effect of including each atom not already part of a good subgraph to be added to a good subgraph. This represents the upper bound on the number of new subgraphs that can be created in this region. The lower bound is zero. Creation of Subgraphs by Mutation The introduction of new subgraphs by simple mutation has two problem areas. The muta- tion may break a good subgraph and the movement will not be large enough to escape the local search. Relocation mutation avoids the problem of inability to escape from the local search by relocating atoms to new neighborhoods in a molecule. This forces an exploration of possible solutions. Simple mutation has a much smaller chance of exploration if the % times the bond length. Virtually every repetition of magnitude of the moves is less than mutation will destroy some number 391, of subgraphs. Simple mutation explores topolo- gies that are close to each other in space, while relocation mutation guarantees that new topologies of the molecule farther from each other are explored . Some number of simple mutation repetitions will be undone by the local search effects, returning some atoms to their previous locations. For some number of atoms moved, the 10- cal search will move the atoms to new desired locations and thus explore a nearby topology. 109 The net effect on the creation of subgraphs is more subtle. The added effect of local search will make every atom part of a 392 subgraph. The real possibility of increasing the longer subgraphs will depend on the defects that need to be fixed. If the movement of an atom to a location completes many subgraphs of several lengths, the overall fitness improves more than if only one subgraph is completed. The effect could be simplified to that of defective locations and non-defective locations for each atom. An estimate of impact by mutation could be as simple as determining the probability of moving an atom to create one more defect, two more defects, no net change in defects or one fewer defect. Assume that there are N — 2: locations. If an atom is removed, it’s removal can create another defect. Similarly, there are also N — :1: locations where if an atom is added, it creates a new defect. In addition, there are approximately 2: locations where placing an atom will remove a defect. In this example, there are :1: atoms that are defective. Equation 3.47 show that as :3 gets small, the probability falls to 7%. For larger molecules with more atoms, this represents a limitation of improvement in the structure by mutation. x 2 defeCtreduced N (N) (3.47) N — a: 2 17(N — z) defeCtincrease N ( N ) + (————N2 ) (3.48) Equation 3.48 shows that, with small values of 2:, nearly every mutation will add de- fects. The dilemma is how to perform enough mutations so that the reduction in defects has a chance to occur and at the same time maintain a fitness value high enough so that the molecules will survive in the population. This also means that when a defect reduction Occurs the process of recombination must somehow preserve the reduction and remove all the other defects that were introduced. 110 ¥ Subgraph Creation with Local Search The local search will tend to move atoms to good locations and increase the number of subgraphs. The N 1 1 molecules examined in Section 3.4.5 show the effect on the count after application to randomly initialized molecules. In broad terms, any atom that is in the right vicinity will be moved to the correction location by local search. There are two effects that crossover provides that will impact the local search. When crossover occurs, atoms cannot be placed too close together and are adjusted slightly as needed to maintain the minimum distance. Second, when the two halves are combined, the atoms near the center of mass are added first. The process then works its way toward the outer edge of the molecule, adding atoms. This will tend to prevent minimum distance violations and increase the number of successful crossover operations. Overall, the two operations may constantly undo the each other results. Local search may be trying to move an atom between two others and violate the distance minimum. Crossover will move it back. Distance-based local search treats large distances in the same fashion as short distances. This will have the effect of forcing an overall shape on the molecule that is consistent with the target molecule. In contrast, energy-based search will be mostly concerned with only those atoms that are a bond length or two away and will not drive the overall shape of the molecule. In fact, the results for some of the Bucky ball studies by Deaven and Ho [13] have a distorted appearance because the larger distances have little to no effect on the potential energy value. Local search is applied to all the operations. Even simple reproduction allows the work of local search to be spread out over several generations. As noted for mutation and crossover, the fitness of the molecule will degrade during the actual operations even for those that contain new subgraphs that are ultimately needed to find the solution. Local search will increase the fitness so that the offspring will be chosen for future operations. The other effect is that any mutation or crossover that gets a defective atom into the right region to make a repair will move that atom to the proper location and thus increase the number of subgraphs with higher lengths L. 111 3.7.4 Diversity of Population A natural way to measure the diversity of the current population is to use the fitness measure with the best molecule as the target. This will highlight the differences between the current best molecule and the rest of the population. The distance sets will have small errors for isomorphic graphs or similar arrangements of atoms. For molecules with significantly different topologies, the set of distances will be substantially different and will produce a large value for the summed differences in the distances. It is interesting to note that if the two molecules are used during crossover and they are very similar, the disruption of good subgraphs is reduced along the cutting plane. Only good subgraphs that span the cutting plane in both parents and that have no matching atoms on the other parent are lost. As the population converges, the probability of having the same good subgraph affected by the cutting plane increases and fewer disruptions will occur. 3.8 Graph Isomorphism The nature of determining the structure of a molecular from the distance set has an ex- tremely strong parallel to determination if two graphs are isomorphic. The set of atom-to- atom distances can be used as a test for isomorphism. If the two molecules are represented as two graphs with the bonds being edges, then testing for possible isomorphism can be used to rule out equivalent molecules. This test can also be used as a measure of diver- sity in the population. The comparison of two chromosomes on a gene-by-gene basis is not useful for the determination of diversity because of the fact that gene placement in the chromosome has no correlation to actual position in the molecule. In order to measure the diversity of the population, the true measure is the degree of isomorphism. The total sum error in these distances can represent a useful measure of similarity of two graphs and in this application the two molecules. In graph theory terms, two graphs are either isomorphic or they are not. This diversion to graph isomorphism is useful because it shows that the 112 sorted distances can be used as an additional test of isomorphism. The distances in graph terms are integer number of edges between nodes. Two graphs G and H are isomorphic if there exists a mapping f : V(G) —> V(H) such that any pair of adjacent vertices u, v E G (-) f (u), f (v) 6 H are also adjacent [7][26]. In the case of molecular conformation, the graphs are defined by bonds as edges and the atoms as nodes. A Simple test of isomorphism for two molecules is identity of the two sets of sorted distances, when distances are restricted to those that represent bond lengths. Another test that can eliminate two graphs as being isomorphic is the set of {deg(v,~)} for all nodes in the graphs. If the two sets are not the same, then the two graphs are not isomorphic. While GA works with the complete set of distances produced in a fully connected graph K n, the set of distances is restricted to bond lengths that form the topology of the molecule. In a typical molecule, the deg(v,~) is in the range of 1 . .6 for each atom. The total number of distances that make the set of bonds is then based on dcount = ail-313M. The following conditions can rule out isomorphic graphs. Equation 3.49 describes the two graphs in question as G and H. The Simplest is based on the fact that the graphs must have the same number of nodes or atoms as shown in Equation 3.50. v, e G, u,- e n (3.49) n ¢ m (3.50) The number of edges for the two graphs must also be equal and can be calculated from the vertex degrees as shown in Equation 3.51. The next check shown in Equation 3.52 is based on the degree of each vertex. These two sets of vertex degree values must match, indicating that there is a one-to-one match of vertex degrees. This test can rule out 113 isomorphism but cannot provide the mapping from G -—) H required to prove isomorphism. 2;; (189(2),) 7e 23:1 d89(uj) (3.51) 2 2 {deg(v1) . . .deg(vn)} 74 {deg(u1) . ..deg(um)} (3.52) Working with the set of distances it is possible to add another check that can rule out isomorphism between graphs. If the graphs are isomorphic then the sets of distances will also be equal. That is, the distance set consists of all the distances between any two vertices in the graph as measured by the edge count between them. For each graph, the distances between all combinations of nodes is calculated, as shown in 3.53 and 3.54 for each of the graphs. These distances are then collected and sorted into a set for each graph and these sets must be identical as shown in Equation 3.55 for the graphs to be isomorphic. This is necessary, but not sufficient, condition for isomorphism of the graphs. Vi 7e j d. = dis(v,-, 2),) (3.53) vr 75 j t), = dis(u,-, uj) (3.54) {d1...d,,};£{t1...tk} (3.55) Vv,eGC{d1...d,,}3u,-EHC{t1...t,,} (3.56) Determining isomorphism uses these sets of distances as constraints for each of the graphs. Any subset of the distances incidents on a given vertex 1 in graph G must match the possible candidates in H. If the set is unique, it is possible to assign the mapping 114 between the vertex v1 6 G —-) u.r E b. If there is no matching set, then the two graphs are not isomorphic. When there are multiple sets that are identical, the number of permutations that must be tested is significantly reduced. Equation 3.56 describes that, for every subset of distances associated with each I),- e G, there must exist an identical subset of distances for some vertex Uj 6 H. The number of unmatched sets can be used as the measure of difference between the two graphs or molecules. If the measure of similarity is desired instead, the count of matching sets can be used. 3.9 Convergence There is a natural tension between exploration and convergence. Premature convergence produces poor results and prolonged slow convergence wastes computational resources. The correct convergence velocity or rate of convergence is the desired goal for the best solution possible. The rate of convergence for the molecular conformation GA is affected by many factors. Some of these factors that can lead to premature convergence are the size of the population, selection pressure, mutation rate, crossover rate, local search and lack of diversity. Lack of diversity in the population is a typical cause for premature convergence. Diversity alone will not solve premature convergence. Having a population of poor fitness candidates may be diverse, but it will still not necessarily help in finding a good fitness solution. Further, the diversity should contain at least a large number of the needed low-order subgraphs that need to be combined into good solutions. Having a number of candidate solutions that are similar is a natural consequence of competition of the fittest. Worse yet, in this type of application, some arrangements of the atoms are isomorphic to each other. While they are different in orientation or possibly chivalry, they add nothing to the diversity of the population and, therefore, do not assist in finding the solutions. The main issues in 115 determination of the convergence are the Size of the population, the introduction of new genetic material by mutations, and the selection pressure. 3.9.1 Operator Effectiveness During the course of a GA run the effectiveness of the GA operators can be measured in several ways. Ultimately, the key point is that some number of individuals will be created that lead to a valid solution. It is possible to measure the Operator effectiveness in a couple of ways. The simplest is to count the number of resulting molecules that have a better fitness than the parents. The other is to sum the improvement of fitness for the offspring that are better than the parents. As the GA runs, the ability to create better children will be more difficult. This is due first to the fact that the size of the space with better fitness values is getting smaller and smaller as fitness converges to better values. Second, is that all the operators are more likely to disrupt good fitness subgraphs than to build a new high-fitness subgraph. The population of potential solutions has a certain equilibrium point as the run progresses. At this point the fitness values for some molecules will be poor, based on the disruptive nature of the operators. This will lead to a near-steady-state effectiveness rate. This rate is calculated for each type of operator. For each generation, the rate is simply the number of offspring that are better than the parents divided by the total number of operations of that type performed. Each operator is a combination of the local search with one of the GA operators, simple mutation, relocation mutation, and spatial crossover. 3.9.2 Difference Equation Operator effectiveness is based on the current fitness of the population and the amount of degradation that occurs from the GA operators. For example, if all the molecules are near perfect, the probability of an operator producing a new molecule with better fitness is much lower than when the population is first initialized. This will be clearly shown in the results section that describes operator effectiveness. At the same time, if the GA operators degrade 116 the overall population, a steady-state number of fitness improvements can be expected. Cbetter = Pbetter(E) (3-57) 3.10 Scalability The ability to solve the problem has several interrelated factors that all affect scalability. For the molecular conformation problem, scalability is the ability to find solutions as the size of clusters increases. Computation cost is impacted two ways. First is the cost of evaluation of the objective function. Second is the increases associated with population Size and number of generations required for finding a solution. A larger population size is typically thought to improve the probability of solution. This increases the computation cost for every generation. If the size is too small, the probability of finding a solution will become very small and a high number of failures will occur. If the population is larger than it needs to be, it may take extra generations to find the solution. The key question about population size is if it is large enough to contain a critical mass of genetic material needed to find a solution. In this particular problem, a constant tension between size and effective mixing occurs. Too large a population and the needed mutations are not mixed sufficiently by crossover to find the solution. Too small a population and the probability of getting the needed mutation is so small that the chances of finding a solution goes down. So for effective scalability, the population size must be scaled within a window of viable probabilities. On the smaller molecules, this window is much larger and the population size parameters will behave in the expected fashion. As the molecules get larger, the overlap in windows gets smaller and this leads to fewer good solutions. The cost of evaluation of the fitness function is at the base of the scalability issue. This cost will determine the lower bound for the entire systems operation. The creation of new subgraphs is the key element in determination of scalability, assuming that it is not feasible 117 to create a population with a sufficient number of subgraphs that can then be combined into a solution. The sources of these new subgraphs are provided by the genetic operators along with the local search. The fitness function will be analyzed followed by the impact of the GA Operation. This will be divided among the population size, local search, and GA operators. 3.10.1 Fitness Function The fitness function’s cost of evaluation is on the order of 8(N2). The current process requires the calculation of the distances, which is 8(N2), and sorting the distances which is on the order of 9(N ln(N)), along with the actual summation of error which is on the order of O(N2). This is an interesting observation in and of itself. Any genetic algorithm that uses the distance matrix or energy calculational cost must scale by some factor x such that 6(Nf) > O(N2). In this work, the entire set of distances is used for the fitness function. In contrast, the energy functions are most dependent on only near-atom interactions. There are several issues that affect the ability of algorithms to solve larger problems. At the base of these issues is the way in which solutions scale with difficultly and size. For the molecular conformation, the size of the representation scales on the order of O(N). How- ever, many of the data structures used to calculate the fitness function grow as O(N2). This sets a lower bound on the ability to scale the algorithms with the current fitness functions. It is interesting to note that if the fitness function was reduced to some value 6(aN) based on only the bonds of the molecule, the performance would be improved. In addition to the fitness function, the ability to scale the GA will depend on the population Size required and the cost of performing the various operations such as local search, crossover, and mutation. The number of small subgraphs grows in a linear fashion and, if the GA ability to operate is based on these small subgraphs, this should provide good scaling as the molecules grow in size. The other areas that affect the scalability include the size of the population, number of generations required for convergence and cost of performing the local search. 118 3.10.2 GA Algorithm There are many trade-off parameters in a genetic algorithm whose selection will not be obvious. For example, increasing the size of the population could significantly reduce the number of generations needed for solution. Similarity, increasing the number of iterations on the local search may in fact prevent finding solutions or reduce the number of genera- tions required. Clearly these choices are not obvious. In addition to the fitness function, the dynamic operation of the GA will depend on population size, application of operator parameters, and number of generations required. The ability of mutation to fill in defects has been estimated. The probability is pro- portional to Flf- This implies that the population size must increase by N 2 to improve the chances of filling in the defect. The crossover operator’s role is to spread good subgraphs around while trying to preserve as many good ones as possible. If the population is too large, this will be harder to do. Population The population should be large enough to insure a good set of genetic material is available and at the same time, small enough to provide good computational cost. The population size can be smaller if the introduction of new good genetic material is provided. If the population is assumed to have all the genetic material it might need at the start of the GA run, then the required size of the population can be calculated. Starting with the probability of having a building block and the number of building blocks required, it is possible to estimate the size. Needed building blocks must survive until they are combined into high fitness individuals that lead to a solution. For the current work, the local search can create new good subgraphs along with the GA operator. The goal is balance population size and the number of generations to find the solution. Further, there is a critical population size at which not enough material exists to work with and the disruptions caused by the genetic operators will cause convergence to some poor fitness value. In this situation, the 119 probability of creating new subgraphs and also having all the required subgraphs in the population in order to find a solution is so low that it starts to resemble a random search. Generations The larger populations are expected to decrease the number of generations needed to find a solution. This relationship is not well established and, for some of the experiment results in this work, it is shown that smaller populations actually perform better for the larger molecules in relation to the number of atoms. Local Search TWo factors of local search are the rate of adjustment to each location and the number of iterations performed. Increasing the number of iteration linearly affects the number of evaluations. At the same time, the higher the adjustment rate, the fewer number of iterations should be required. These parameters are bound by a stability criterion proportional to fi. Figures 3.8 and 3.9 show how the individual contributions are added in a simple two— dimensional example. The magnitude of the final vector for correction of the atom location is based on the scale factor that is subject to the stability criteria. Figure 3.8: Local search contributions 120 Figure 3.9: Local search adjustment If the scale factor is small and the number of iterations is high then the computation cost is significant. On the other side, if the scale factor is larger, the number of iterations can be reduced. When this scale factor is too large, the local search can become unstable. The basic impact on computation cost is linear with the number of iterations used. Convergence is detected by examination of the sum of correction vector magnitudes. If the total sum of the correction vectors is below some threshold, then the local search has converged and is terminated. The disruptive nature of all the GA operators will cause the local search to iterate the maximum number of times for a large percentage of the molecules. Multiple generations may be required in order to reach convergence, even when the subgraphs are not disrupted with other operators. Estimate of scalability If the relocation mutation is assumed to be the major source of new subgraphs and crossover is how these are spread among members of the population, with local search just doing the final adjustments, then the basis for the scalability will be O(N4). The impact of the local search is a linear scale factor. While it can significantly increase the cost of each GA run, the real problem is the probability of constructing the new subgraphs that repair defects 121 in the molecule structure. This somewhat primitive analysis is in good agreement with established scalability results for energy search. In Hartke [27], a claim of 9(N3'15) is made and the data could be construed to be somewhat higher for larger values of N. 3.11 Summary This chapter introduced the concepts of space division for homogeneous vectors as loca- tions in space. Many fitness functions used in GA’s have significant contributions that are based on the proximity and arrangement of these locations in space. These locations are represented as genes in a chromosomes used to describe each individual member of the GA’s population. A method for dividing spaces is introduced based on the dot product of a normal to a hyperplane. Next, the ideal of local search was formalized as any collection of locations such that application of a local search will lead to a specific set of locations. The equivalences of two molecules is demonstrated for translation and rotation such that the individual atom locations between two molecules are within some arbitrarily small sphere. The difference between the local search and the equivalence is the size of the spheres. The probability of the random creation of building blocks or small subgraphs is de- scribed in terms of space volumes. The actual process for the local search is described, along with the definition of convergence for the local search. The generalized division of space by multiple hyperplanes is described with assignment to specific spaces based on the binary pattern of positive or negative dot products for each of the dividing planes. The fitness values have been shown to have a direct relationship to the number of small subgraphs or building blocks that each molecule has. To Show the theoretical basis for the performance of the spatial crossover, the counting of the subgraphs is used to predict the performance. While the subgraphs are similar to schemata, they are restricted by the existence of edges between each gene where in schema theory no such restrictions apply. 122 Several molecules and a basic graph are examined. In each case, the single cutting plane disrupts far fewer small subgraphs than the two-point standard crossover. The standard crossover and the spatial crossover are then compared. Using the Bucky ball example, it is clear that the spatial crossover preserves the subgraphs and leads to subgraph creation. It is interesting to note that as the diversity is reduced, the disruption of subgraphs is also reduced. The two forms of mutation used in this GA are formalized. The first is a random displacement of atoms and the second a relocation of an atom. The first type of mutation has a very localized effect if the number and the magnitude of mutation is limited. The second form significantly alters the local space and provides a better means for exploration. An important factor to the operation is determining a viable population size and asso- ciated convergence rate. This section provides a formulation for the number of members needed in a given population along with the expected rates of convergence. There are sev- eral critical parameters that are based on the creation of subgraphs in the initial population along with the steady-state operating points. The overall scalability is based on the probability of constructing new good subgraphs and the cost of the fitness function. These, along with the linear impact of the local search, are the dominant factors in scalability of the spatial operators. 123 Chapter 4 Experiments and Results This chapter will present the results of experiments that explored the nature of the spatial operators. These runs are divided into several areas, starting with local search, followed by a series of trials for ideal molecules and real molecule distance sets. Additional results are provided for parameter selection searches and operator efficiency determination. The results presented were accomplished using a typical single-level run of the GA algorithm with spatial operators. A single population was initialized and a GA run was conducted on this fixed population. For some limited number of molecules, multi-level runs were conducted. In this style of run, a number of GA runs were conducted on initialized populations. These runs were terminated either by fitness value or fixed generations. The best of each of these first-level runs were used as seed material for a single second-level GA run. Section 4.1 describes the results of the local search without any of the genetic algorithm operations utilized. This is provided to establish the size of the region within which an atom can be displaced and still converge. This characterization of the local search provides the basis for describing equivalent molecules that converge to the same atom locations. Section 4.2 is based on finding solutions to molecule structure using GA on of a variety 0f molecules with different sizes and shapes. While most of the work will center on known 124 ‘1'” Lennard-Jones low-energy configurations, other shapes, such as the Bucky ball, will be examined. One of the main results provided is the impact of molecule size on the solution process. Solving real conformation problems given real and artificially degraded distance data sets are the focus of the next section. In Section 4.3, the ability of the algorithm to find Bucky balls from what is measurably a noisy data set of distances is presented. One of the problems with the application of genetic algorithms is the large number of user-defined parameters and what impact they have on the algorithm performance. In this application there are over ninety parameters that are defined at run time that can be customized for each type of molecule. Determination of the optimal operating points is in itself another useful application of genetic algorithms. In this work the parameters were manually selected based on some of the parameter sweep studies presented in Section 4.4. The molecule Size of twenty atoms was chosen because of the reasonable run times and the demonstrated sensitivity of some of the parameters. The ability to use these experiments as guides for larger molecules turns out to be limited. 4.1 Local Search In order to understand the limits on the local search, an experiment was constructed that started with the ideal molecule and then randomly perturbed X atoms some random direc- tion with a maximum random magnitude of perturbation M. For each set of X and M forty trials were run. The raw results are presented as the number of failures, in Table 4.1. Trials that were run with less than five atom movements had very few failures and are not shown. All the tests were run on the Lennard-Jones N11 molecule. Each test starts with the ideal molecule then repeats the following procedure X times as shown in the leftmost column of Table 4.1. It randomly selects an atom and moves it in a random direction with a magnitude described by the first row. Once the perturbations are complete, the molecule is compared 125 to the ideal for the following measurements before the local search and after. In addition, the number of iterations to find the ideal was recorded. If the local search failed in 4000 iterations to find the ideal, then the run was terminated. The following information was collected for each perturbed molecule that was created. The values were collected after the ideal molecule was perturbed and again once the local search had been completed. The local search was terminated if the total number of itera- tions exceeded the limit or when the total distance objective function was below a specific threshold. 1. Lennard-Jones energy 2. Distance objective function 3. Average distance errors 4. Maximum distance error 5. Standard deviation of distance error 6. Total vector summed error 7. Histogram of distance errors Because the atoms to be moved are chosen at random, even for small numbers of rep- etitions, there is a significant chance one atom may be moved two or more times by the perturbation process. This local search, as describe in Section 3.10.2, is not a gradient or conjugate gradient method and it does not have the analytical properties that either method would. The correction vector is distributed to both atoms in every atom pair even though only one atom might need the correction. On some iterations, only a subset of the distances are used to calculate the correction vector. The size and the range of distances used are selected at random. Near convergence to the solution, all the distances should be near zero and therefore, a subset should have only minor impact on the actual atom movement. 126 perturbation magnitude atoms 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 5 0 0 0 l 0 0 3 4 6 O 0 0 0 l 1 1 1 7 0 0 0 2 O O 0 5 8 0 0 0 0 l 0 6 7 9 0 0 0 0 0 2 7 10 10 0 0 1 0 l 3 6 1 1 1 1 0 0 1 0 l 4 3 7 Table 4.1: Failed convergence by local search to ideal N11 by count and magnitude of perturbation As demonstrated in the Table 4.1, the number of local search failures increases as the number of repetitions increases or the magnitude of the atom movements increases. The number of iterations required to find the solution has a significant impact on the overall efficiency of the GA. In fact, it is the local search that contributes most to the com- putational cost. In Figure 4.1, the actual number of iterations before termination is shown. The vertical axis is the probability of finding a solution in the range shown. For example, the first bar shows that 58% of all trails found a solution in less than fifty iterations. The second bar shows that 12% of all trials solved in 50 — 99 iterations and so on. The last bar represents the total number of failures for all 2240 trials which is 90 or 4%. In order to determine the true nature of the failures from this experiment, two other sets of probabilities were created. First, we determined the probability of failure given the total summed error of the perturbed molecules. The second set of probabilities is the failure given the maximum perturbation of the molecule. Figure 4.2 shows the probability of failure given a total summed error in a specific range. With the exception of the large spike for summed errors in the range 1.0 to 1.1, the graph shows the expected increase with the total amount of summed error. The small bar for summed error in the range 0.2 to 0.3 is really only two cases that failed. It is important to note the vertical scale of probability. For the largest bar, only 3% of the cases in the range 1.0 to 1.1 fail. This is almost the same as the overall failure rate for the whole trial. 127 0.1 . 0.5 ( 0.4 7 0.3 1 02 1 0.1 - Probabillty of solution in x lnteratlons 01 006 T ' ‘a‘rsrcsa‘b‘riilyssalur... <50 <100 <150 <200 <400 <600 300 and < 501 >500 and <1001 >1000 Figure 4.15: Ideal Bucky ball single-level runs 4.3 N on-Ideal Distances Real data sets for the target distances will have a certain amount of error. This error can manifest itself as uncertainty in the distances, missing distances, and other measurement issues. The initial research used an ideal Bucky ball as the target molecule because of its unique nature and size. A set of experimental data was available for the Bucky ball that would demonstrate the ability of the algorithm to find good structures from real data. These structures, by the very nature of the errors in the data set, cannot generate a perfect Bucky ball. In addition, several molecules were distorted using random mutations to the atom positions to simulate errors in measurement. These distorted molecules were then used as targets in the experiments. The sixty-atom molecules definitely remove the concerns associated with very small molecules and limited search spaces, both for the ideal and real data sets. There are a number of search algorithms that have reported good successes with small molecules only to have significant problems with larger ones. Two different non- 141 ideal distance sets were used in this work. The first was the experimental data for a Bucky ball and the second was a set of artificially perturbed atom locations for a N38 molecule. 4.3.1 Bucky Ball Experimental Data Set Target The goal of this portion of the experimentation was to take pure measurement data and determine the structure of the molecule. Pure in this sense means without human or algo- rithmic assignment of distance to atom pairs. In the case of the Bucky ball experimental data set, there is a base error in the data that can be measured compared to the ideal Bucky ball. This base error determines the limiting performance on any algorithm that is applied. First, the real experimental data has a raw distance fitness function using the ideal as the target of 41.17, which represents a very high mismatch. That means, using an ideal Bucky ball as the target data, the real data has a raw objective value of 41.17 , where a value of less than .1 is considered a good match. Second, the overall set of distances have an accu- mulated error of —89.95 in total sum of lengths. The amazing part is that the energy of the ideal Bucky ball is —41.8 and the real data has a potential energy of —44.1. Considering the differences with the distance data, this small difference in potential energy underscores the impact of the smaller distances on energy while the larger distances can have a major effect on overall shape. Figure 4.16 shows the differences in the 1770 distances between the two target data sets. With any real data set, it will be impossible to reduce the distance fitness to zero because of the inherent errors in the data set. How the two molecules can have similar energy but significant differences in the dis- tance is shown in the Figure 4.17. This figure shows the differences for distances on the order of bond length. They are very small, and the distance errors that contribute to the large mismatch and the total occur for the middle distances. This accounts for the similar energies but significant distance errors, because the longer distance errors don’t contribute as significantly to the energy. These distances do play a major role in the overall shape. A typical good result for the real distance data run is shown in the Figure 4.18. From 142 20 F’Aceurnulated Distance Error... , -20 .40 -60 -80 -100 assessessgggggggg Figure 4.16: Comparison of ideal and experimental distances for Bucky ball 20 —Aocumilated Distance Error 1 01 201 301 401 501 601 701 801 901 1 001 1 1 01 1 201 1 301 1 401 1 501 1 601 1 701 Figure 4.17: Accumulated distance error Bucky ball data set 143 a visual inspection, it appears to be a perfect Bucky ball. The small difference in energy values is based on minor variations in the bond lengths. The shape when compared to the ideal Bucky ball has some minor distortions. The diameter of the ball varies and this leads to the inherent error in the distance data. The near distances as shown in the Figure 4.17 are almost perfect, so the energy values are close. Figure 4.18: Good Real data Bucky ball Results Figure 4.19 shows a typical defect in a real data Bucky ball result. There is one atom that is misplaced. It is interesting to note that the actual raw fitness error is 0.0573, which is very close to the target, yet the molecule is obviously not perfect. One hundred trials with the real Bucky ball data were conducted. Examination of the resulting molecules shows that over eighty-three of them had less than four defects. There 144 Figure 4.19: Typical Defect for Real Data Bucky ball result were eleven of the resulting molecules that had perfect shape and symmetry with no sig- nificant defects. Each of these trials were conducted for one thousand generations or ter- minated when the raw fitness was low enough. The population consisted of 260 molecules with a local search iteration count of four. Given the noisy real data, the perfect Bucky ball results were unexpected. This range of actual results with close target raw fitness shows the robust nature of the distance data. The energy values are extremely close to those for a perfect Bucky ball and are very close to the calculation of energy for the target distance table. 145 raw fitness energy average 0.2285 -43.50 max 1 .0541 41 .76 min 0.0422 -46.68 stdev 0.2391 1 .29 Table 4.2: Summary of 100 Trials Bucky Ball With Real Data 4.3.2 Artificially Degraded Data Two sets of degraded data were used. The first applied a random variation to each of the distances in an ideal data set. This method produces a set of data with an inherent error such that no molecule could match the data set. The runs would never quite reduce the distance error to low values, but still converged to molecules that had a high degree of similarity to the original. The second molecule used was derived by taking a N38 Lennard- Jones minimum energy molecular structure and randomly moving one third of the atoms a random distance. The resulting molecule was then used as the target set of distances. While this molecule could be realized, it still took longer to converge. Where the normal ideal N38 would converge in a few hundred generations, this degraded molecule would take three to four times the average number of generations. 4.4 Parameter Selection Searches The number of parameters that are used by the current algorithm are varied and numer- ous. The basic parameters associated with genetic algorithms include the probability of crossover, mutation, and selection along with all those used by the local search. Because the program was used to compare a variety of methods, the parameters include many mode settings. These modes include population models such as (p, A), permutations of parents to construct children, crossover modes, selection modes, and many other variations of basic algorithm operation. The following series of tables lists most of the algorithm parameter settings with a brief statement of use. Table 4.3 has most of the mode settings and integer 146 f—Avocsmi was 400 . $300 + - > < 200 -- 100 0 . , . 4 . . 20 40 50 60 70 80100120140160180200220240260280 Population Figure 4.20: Population size effect on average generations to find solutions N 20 parameter settings that control the algorithm. These are continued in Table 4.4 with the counts that are used to calculate the Operator effectiveness. Table 4.5 contains the real- valued control parameters including the probabilities for operation selection. Last, Table 4.6, holds some of the string constants that are used by the program to specify file names. 4.4.1 Population Size The first series of parameter setting runs looked at the effect of population size on the ability to find solutions. Population size has an impact on the overall cost of computation. Figure 4.20 shows how the population size affects the average number of generations required to find a solution. It should be noted that, for smaller population sizes, the algorithm failed to find solutions a high percentage of the time, and runs were terminated at 2000 generations. This was also used in the calculation of the averages. It is clear that, once past a population size of 110 molecules, the process saw little reduction in failure rate. The boundary for the size at which the failure decreased is very evident in Figure 4.21. 147 POPN 25 Populations size SELMODE 1 Mode of selection tournament TOURNSIZE 2 tournament size 1 STARTMODE 0 modes 0-5 0-random initialization 5-seed molecules TARGETMODE 0 O-ideal molecule l-only distances POPS 20 How many seed molecules to make room for ELITECNT 1 How many elite to copy each generation LAMDACNT 0 A, u mode A count MUCNT 5 A, p. mode ,0 count DUMPMD 0 dump file mode KEEPMD 1 1 keep all offspring 0- only keep better ISPLT 19 count of atoms for split TARGETM 0 Type of target file MUT2CNT 4 how many atoms to relocated MUTlCNT 30 how many atoms to randomly move MAXGEN 7000 Maximum generations to run DIVERCNT 10 How many molecules to compare for di- versity RANDSEED -1 -1 random seed else a positive seed num- ber ACTUALGEN 0 Actual number of generations run MU LAMMD 0 A mode PERMUMD 2 permutation mode for [1 parents NICHMD 2 mode for niche control VSINTCNT 8 count of iterations for local search NATOMS 38 total number of atoms VSACTCNT 0 actual number of local search iterations per generation CNTBEST 0 Number of new best molecule CROSSMD 0 Cross over mode 0— spatial 1- N point CROSSMDX 0 Number of cut points for N point FITMODE 0 select between distance and energy ITERCNT 1 Number of iterations to perform for sta- tistical purposes BORECMD 0 RANGER 5 division factor restricts random start of distances LS RAN GEA 5 division factor restricts size of random range LS Table 4.3: GA operating parameters modes and sizes 148 OPCNTO 0 Total number of reproduction OPCNTl 0 Total number of crossover OPCNT2 0 Total number of simple mutate OPCNT3 0 Total number of relocation mutate OPCNTKO 0 Total number of reproductions better OPCNTKl 0 Total number of crossover better OPCNTKZ 0 Total number of simple mutate better OPCNTK3 0 Total number of relocation mutate better Table 4.4: GA operation counters The increase from twenty to fifty molecules in the population cuts the failure rate by a factor of five. The failures don’t reduce to zero until the population size reaches over one hundred. Ultimately, the computational cost needs to be minimized and larger populations ad- versely affect computational cost. Figure 4.22 shows the relative number of evaluations as a function of population size. It is clear that the most cost-effective solutions are found between population sizes 80 and 110 . 4.4.2 Operator Probabilities The three basic GA operators are crossover, simple mutation, and relocation mutation. All three have a probability of occurrence which, when summed with reproduction, will equal one. In a typical set of parameters, crossover will occur 70% of the time, simple mutations 10% of the time and relocation mutation about 15% of the time. Simple reproduction makes up the remaining percentage of operations. The mutation operations also have a count associated with them. This count represents the number of repetitions that each operation is performed. In addition, the simple mutation has a magnitude associated with it that is a function of the raw fitness. The local search is applied to results of all operations including simple reproduction. 149 PROB_REP 0 1 — PI — Pml — Pm2 PROB_X 0.7 probability that crossover will occur PROB_M1 0.1 probability that simple mutation will oc- cur PROB_M2 0.15 probability that relocation mutation will occur MAXFIT 10040 if scale fitness exceed this stop MAGMUTRT 4 Magnitude of mutation based on fit/MAGMUTRT MAGMUT 0.6 Base magnitude of mutation DIVERAVG 0 Average diversity of samples SUMWLIM 0.001 Limit on local search total summed error F8] 0 FS2 0 FS3 0 F84 5 FSS 36 FS6 0 F87 0 F88 0 PRANDVS 0.8 probability of restricted range local search MINDISI 1.09733 Minimum distance over ride ENERGY 0 Actual energy FIT 0 Current raw fitness SUMWFAC 0.09 Convergence factor for Local search REPDF 0 Sum of improvement for all molecules better than parent CROSSDF 0 Sum of improvement for all molecules better than both parents MUTIDF 0 Sum of improvement for all molecules better than parent MUTZDF 0 Sum of improvement for all molecules better than parent Table 4.5: GA operating floating point parameters 150 DUMPFILE dmp.txt Name of file to dump data structure PARMFILE n381jset.txt parameter setting from this file * par- mdef.txt default* BESTOF bo.txt suffix added to best of molecule file RUNLOG rl.txt suffix added run log file TARGETF n38idealx.txt Target file with information FILEPREF n38a/c_000 File prefix to store results SEEDFILE slista.txt File with list of seeds best of molecules file names TITLE N38 text for report Table 4.6: GA operating file names prefix and suffix 60 I Failed 20 40 50 60 70 80100120140160180200220240260280 Pop size Figure 4.21: Population size effect on number of failed runs N20 151 14000 -0- Evaluations 12000 10000 20 40 50 80 70 80 100120140160180200220240260280 Population Figure 4.22: Population size effect on total average evaluations on N20 Crossover Probability Parameter Effect The advantages to the spatial crossover can be demonstrated by the next two graphs. In Figure 4.23, the number of failed attempts to find a solution to N20 is significantly affected by the crossover probability. The rate of failures decreases as the percentage of crossover operations increases. Once the number of crossover operations exceed 60%, the failure rate goes to zero for the N20 molecules. The impact on the average generations is also interesting. In Figure 4.24, the average generations to a solution declines even more steeply than the solution failure rate as a function of crossover probability. The failed average trials were removed from the graph showing average generations to a solution in this result. Most of the average generations for a solution results include the failed runs with a penalty in terms of generations. For example, if the GA terminated at 1500 generations then an additional penalty of 500 generations was added when calculating the average for all 100 GA runs. 152 new A 1 count 3 10‘ I - 05 06 0.65 0.7 0.75 0.8 0.84 Probability Crossover Figure 4.23: N20 crossover probability verse failed solution Crossover is normally thought to be very disruptive. Interestingly, in this case, it is clearly a major positive influence. There is a slight increase in the average number of generations as the percentage of crossover continues to increase. This increase occurs primarily because the number of mutations has to be decreased for the given algorithm. The total of all mutation, crossovers, and reproduction probabilities must total 1.0. This constraint could be eliminated if serial probabilities of operator selection were used instead. Mutation Probability Parameter Effect Simple mutation has three factors; the probability of application, number of repetitions applied if chosen, and the magnitude of each random move. Relocation mutation has two parameters. As with simple mutation it has a probability of being selected for application and the number of repetitions. There is no magnitude parameter for relocation mutation because, when the atom is moved to a location relative to a chosen existing atom, it is placed at minimum distance from that atom. 153 700 _ . .. _ .. .. M. "MM- W. - 600 500 . g 400 E 300 200 100 o 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Probability oi crossover Figure 4.24: Average generations to solution N20 46 -_0-Averge Generations 45 ._ ,.,WW,W-. ,-__.......--._ L t Average generations 5 8 0.01 0.05 0.1 0.15 0.2 Probability of simple mutation Figure 4.25: Effect of simple mutation on average solution generations N20 154 What is interesting about both of these mutations is that they can adversely affect the average solution time. In Figure 4.25, the probability of mutation is varied from 0.01 to 0.2. Ranges beyond this were not tried because the crossover probability would have to be reduced. From experiments, the average solution times increased substantially and would have been the major influence masking the effect of mutation on this algorithm. From these results, the influence of mutation is limited. Even after repeated trials of 100 runs the results were of limited impact. Elimination of the simple mutation prevented complete convergence. This indicates some small amount of mutation was required to assist the local search near the solution and to prevent it being stuck in a shallow minimum. Figure 4.26 shows the effect of the relocation mutation on the average number of gen- erations to find a solution. This mutation clearly has a detrimental effect on average gen- erations to solution. These parameter sweep studies were completed for the N20 Lennard- Jones molecule, which is small enough that just about any set of parameters will work. Similar to the simple mutation, total elimination of the mutation, increased the number of failures from zero to over 20%. At the N38 level, the elimination of the relocation mu- tation caused failures to increase 50%. The trend of increased average solution time with increased probability rate was similar to that shown for the N20 molecule. Each of the mutations, once selected, will repeatedly mutate the molecule some num- ber of times. This repetition count also has a random selection associated with it. If the repetition count is set at eight, the number of actual repetitions can be anywhere from one to eight. Several one-hundred-trial runs were completed and they all had similar features, as shown in Figure 4.27. The impact on the average solution degrades as the number of repetitions increases. Each of the trials for the N20 molecule showed a dip after eight repe— titions, then a steady decline again. This is consistent with the work by Hartke [27], where they report that mutation applied to 48% of the atoms in a molecule leads to better solution times. 155 350 tAxmqeam ’ tions 300 .. _ ...... .. .... “W" “MN”.-.,.,_,__,.__.w.._..__.... . _. 150 average generations 100 50 - o 0.05 0.1 0.15 0.2 0.25 Probability or Mutation 2 Figure 4.26: Effect of relocation mutation on average solution generations N20 Average generations h M 0 2 4 6 8 10 12 14 16 Count of mutations Figure 4.27: Effect of relocation mutation repetition on average solution generations N20 156 4.4.3 Local Search Parameter Effect The local search has many parameters that affect its operation. First is the number of it- erations performed each time. The second is the convergence factor that controls the ratio of the movement applied to each atom, based on the error vector calculated for each atom. In addition to these parameters, there is a limit value that terminates the iterations if the total summed error of the magnitude for all the vectors gets below a set limit. Based on a probability, the ranges of distances used can be restricted by two parameters. These param— eters are used to select a subrange of distances used in the local search vector correction calculations. The computation cost is significantly affected by the number of iterations performed on each molecule. Calculational cost is the total number of molecule evaluations performed. The cost is used in the graphs that show average evaluations to find a solution are based on a fix size of molecule. cost = (population size)(local search repetition) (average generations) (4.1) Figure 4.28 shows a steady increase in cost as the number of iterations increases. The parameter ”vsintcnt” is maximum number of local search iterations that can be applied to each offspring. It should be noted that there is not a significant reduction in the number of average generations needed to find the solution. The number of evaluations is simply the product of the average number of generations and the number of iterations on each molecule for a given population size. The outcome of the study shows that just about any reasonable number of iteration steps will work and that the impact on computation cost implies that smaller the better. However, not using local search will cause a significant number of failed solutions. Two or more iterations will result in complete convergence. 157 -0- Evaluations 0481216202428323640444852566064 vslntcnt Figure 4.28: Average evaluations to find solution N20 As already demonstrated, a large convergence factor can have a negative effect if it is so large that instability results. Figure 4.29 shows the number of failures for each convergence factor from .05 up to .21. As can be seen, the instability has a significant rise after .2. Second, Figure 4.30 shows the effect on the average number of generations needed to find a solution. In this Figure a mild downward trend is shown as the factor increases. Then there is a significant rise when the critical point is reached. The minimum occurs when the convergence factor is about 80% of the critical value for a convergence factor where the failures increase rapidly. This is also a function of the square of the number of atoms. Each of these distances has the potential to contribute to the error correction vector. 4.5 Classic Two-Point Crossover In this set of experiments, the spatial crossover method was substituted with a standard two-point crossover. The only parameter that was changed was the method of crossover. 158 f'. ‘ ‘2 .ll-Ll_-l_-l-l~lal.11-lraW-_lml 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 SF Figure 4.29: Effect of convergence step size on failed solutions N20 0.05 0.06 0.07 0.08 0.09 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 SF Figure 4.30: Effect of convergence step size on average generations to find solution N 20 159 N20 Two-Point Spatial Failures 54 0 Average 491.3 35.65 Table 4.7: Solution comparison of standard and spatial crossover for N 20 Lennard-Jones molecules N38 Two-Point Spatial Failures 100 4 Average NA 306. Table 4.8: Solution comparison of standard and spatial crossover for N38 Lennard-Jones molecules The population size, operator probabilities, and iteration counts were all kept the same. The local search was also used with the same parameters. When spatial crossover was replaced with the two-point crossover method, the ability to find solutions was severely degraded. In the N20 molecules, the algorithm only found 44 solutions in a 100-trial series. The comparison of the spatial crossover to the standard two-point crossover is summarized in the Table 4.7. The average generations to find a solution for a successful run increased by over a factor of ten. Another set of experiments was conducted on the N38 Lennard-Jones molecules and the results were dramatically worse. In this case, 100% of the trials failed to find a solution. Expecting that several other GA parameters beside the crossover were having a significant negative impact, a series of parameter sweeps was conducted to discover better parameters. In all cases, there was either no change or further degradation of the search. Based on this, only the N20 and N 38 series of runs using the expected optimum parameter settings are included. 4.6 Operator Effectiveness Operator effectiveness provides a key insight into the operation of the spatial crossover and relocation mutation. Effectiveness is measured by counting the number of operations that 160 produce a better offspring and dividing by the total count of that type of operation in a generation. This is tracked separately by the style of GA operation such as crossover and mutation. The curious behavior of the crossover ratio was especially interesting. In Figure 4.31, the ratio starts out very high, indicating that just about any two halves selected have a high probability of increasing the good subgraph count. The interesting point is when the ratio effectively goes to zero and then seems to level off with only slight rises and dips around when significant changes in fitness occur. The explanation for the near-zero point is that it is an inflection point. It is at this point that the average fitness has improved such that the probability of creating a better molecule through crossover given the current population is very small. This represents a peak of the average fitness in the short run and it occurs after only a few generations. This number of generations is very close to the convergence of a pure tournament selection process, typically around six to fifteen, generations depending on the population size. The disruptive effects of both the mutations and crossover actually degrade the average fitness so that a steady-state ratio of successful to total crossovers will be expected. This ratio remains nearly constant for the remainder of the runs. Figure 4.3] shows the ratio at every generation and a smoothed five generation average. The dip to zero at generation six is exactly what would be expected for the small population size using only tournament selection. The steady ratio of 0.1 has a steady decline for the remainder of the run. This decline is expected by the fact that average fitness increases and the chance of having a crossover operation that improves fitness becomes less and less. The N20 molecule used for this figure was typical of all the molecules explored. The next two graphs have the objective value plotted along with the effectiveness ratio for simple mutation and for relocation mutation. While the average number of generations for a solution does not seem to correlate to the mutation probability and repetition counts, the tracking of objective value changes is clearly tied to mutation events. By plotting the objective function and mutation effectiveness ratio, it can be seen that those generations that produce good mutations are key to the overall solution process. In Figure 4.32, the high 161 ‘44emectwms‘; 0.9 .- 0.8 - 0.7 a , 0.6 l 0.5 -~ 1 ratio 0.4 as l , i 1 6 11 16 21 26 31 Generation Figure 4.31: Crossover effectiveness by generation ratio of simple mutations near generation 18 directly corresponds to a desired decrease in the objective function. The relocation mutation in Figure 4.33 has, in general, much lower ratios. What is crit- ical is that those generations that had high ratios show how those events had a major impact on the fitness. From the figure, events at generation 7, 20 and 31 show how those mutations lead to significant objective value drops over the next few generations and ultimately to the solution. 4.7 Scalability The molecules N12, N20, N38 and Bucky ball were all solved using single-level runs and they sealed in an acceptable fashion. The N60 Lennard-Jones molecular cluster problems were solved in a single-level run but had a high failure rate and took substantially longer to solve. Because of this, only a limited number of trials were conducted for this molecule. In 162 142‘ P as Objective Function 9 .5 0.2 1.4 1.2 Objective 0 lib 0.2 . +Ob|ectlv H evEunc‘tlon' ‘ : -0—Slmple Mutation 16 21 26 Generations Figure 4.32: Simple mutation effectiveness by generation 31 ==i—1> ; i: AAAAA Figure 4.33: Relocation mutation effectiveness by generation 163 0.9 0.8 P at no bin 9 O P d lo a Effectiveness ratio O N12 N20 N38 N60 Bucky Bucky R Failures 0 0 4 7 l3 5 Generations 8.29 35.65 65.3 3212.5 762.2 1000 Population 18 120 90 90 90 260 LSINT 5 8 18 1 8 8 8 Size 66 190 703 1770 1770 1770 Evals. .74k 34.22k 105.78k 12,403.44k 548.78k 2,080k Cost 4.92E+04 6.50E+06 7.44E+07 2.20E+10 9.71E+08 3.68E+09 Table 4.9: Scalability for various size molecules the earlier work on the N60 and the Bucky ball, two-level trials were conducted. In these trials, a series of between ten and one hundred GA runs were conducted with a high fitness and low generation termination point that was well short of a solution. The molecules were still reasonably close in basic shape to the desired solutions. These terminated runs were then used as seed material for a second-level run. Using this approach consistent good solutions were found for the N60 Lennard-Jones, and occasionally the N88 Lennard-Jones molecules. Table 4.9 shows the summary of the results for the various trials. The significant in- crease between the N20 and N38 shows major problems with the length of run versus molecule size. The near ten-fold increase for a less than two—fold increase in size is not encouraging. The Bucky ball with both the ideal and the real data, provided good results. Even those that were classified as failures still had a structure that was clearly a Bucky ball with defects. The real problem was the N60 molecule. A two-level run was conducted. The first-level run had a population of sixty molecules and was terminated at five hundred generations. The second-level run selected twenty first-level run molecules and used them as seeds, filling in the population until the size requirement was fulfilled. The column labeled Bucky R represent the trials where the real experimental data was used as the target. ”LSINT” represents the number of iterations that the local search used on each molecule. The row labeled ”Evals.” is the total number of function evaluations on the average per solution. The last row, called ”Cost”, represents the total cost of computation 164 that also factors in the computation of objective function evaluations based on the size of the molecules in atoms. The number of function evaluations in terms of N for the ideal Bucky ball shows scaling on the order of 9(N3'2). 4.8 Discussion The four main areas that were documented by these results include the comparison between crossover methods, population size, operator effectiveness and scalability. In each of these areas a number of trials were conducted. 4.8.1 Comparison of Classic vs. Spatial Operators This was by far the simplest of the experiments to conduct. In these experiments, the two-point crossover was substituted for the spatial crossover method. All other parameters were kept the same. The degradation in performance was significant. There has been prior research that has used traditional crossover methods for small molecules under twenty atoms and shown reasonable results [58]. Few of these have worked with larger molecules. The results presented here show that the difficulty of the larger molecules rises significantly after forty atoms. The most successful GA applications to the cluster molecules have used the spatial crossover method with extremely high mutation rates for simple mutation [28]. It is clear from the N38 molecule that the traditional crossover is far too disruptive to achieve any solutions. 4.8.2 Population Size: Theory vs. Actual The basic theory of population size, based entirely on the likelihood of creating small building blocks by random atom placement, provides at best a very high upper bound. The initialization process, by its very nature, provides for an order of magnitude more 165 useful subgraphs than simple probability would predict. The initialization method correctly assumes that each atom will be one minimum distance from at least one other atom and that each atom will not violate some minimum distance. This, combined with the local search, yields a population with far more potentially useful subgraphs than would be achieved with random placement. Further, the influence of the distance-based local search also increases the number of good subgraphs. In all the trials where the population was varied, there appear to be a dip in the average generations required to solve for the structure. The increase in the population size actually reduced the ability to find the solution in a smaller number of generations. This is counter-intuitive, given the abundance of genetic building blocks that larger populations would provide. It take longer to assemble them into a single molecule. In the algorithm based on energy search by Hartke [28], a forced mixing of all the parents occurs, insuring that the potential new material always gets a chance to create a better offspring. In the algorithm used here, the tournament selection for crossover, mutation and reproduction will tend to reduce the chances for participation of a poor-fitness molecule even if it contains highly needed subgraphs. With the smaller population, the chance of competing against itself is higher and therefore more likely to occur. Then, even though the mutation is needed, it will not get the chance to mix with the better-fitness molecules. From the population sweep studies, the average number of generations to find a solution dips to some minimum and then rises. This supports the need for a sufficient number of good subgraphs that can be mixed with the crossover operator and at the same time shows the detrimental effect that too many competitors can have on selecting good material with lower fitness. 4.8.3 Operator Effectiveness Examination of operator effectiveness during a run provided some unexpected results. These results were consistent for nearly all the trials. The crossover operator effective- 166 ness was the most interesting, especially the initial high likelihood of creating a better offspring and the dip to near zero chance of creating a better offspring. This rate then leveled off and represented the probability of making a better offspring from an average degraded population created by the constant disruption of mutations and crossover. The behavior of the mutation effectiveness was more expected. It was not possible to increase the probability of creating a better offspring by manipulating the repetition or the chance of mutation operation. The overall chance of creating a better offspring was con- sistently low after the initial few generations of a run. Simple mutation can be indirectly connected to early changes in the fitness by spikes in the ratio, while the relocation muta- tion can be indirectly connected to significant improvement of fitness later in the run and ultimately leading to the solution. 4.8.4 Scalability: Theory vs. Actual In the theory section, the cost of calculating the fitness function was shown to be 0(N2). The estimated cost to fix defects was expected to be the Z 0(N2) with a corresponding increase in population size. The ability of the algorithm to find effective solutions to small molecules was very promising. The cost for finding the solution to larger molecules indi- cates a need to focus the mutation operations on atoms that have a higher chance of im- proving the fitness. Even using the two-level method of solution for the larger molecules, the overall scalability remains approximately 0(N3'7). 4.9 Summary In this chapter a variety of experimental results were reported with molecules ranging from twelve atoms to sixty atoms. The results were provided in a variety of forms with specific focus on individual aspects. The average number of generations to find a solution was the key indicator used for parameter selection searches. In the parameter searches involving 167 varying population sizes the number of evaluations were used so that the larger popula- tion sizes were properly compared. In these parameter searches, an operating point was located giving good performance for the overall search. Included in this chapter was the comparison of the standard two-point crossover to the spatial crossover. In this compari- son the two-point crossover performed very poorly and in the larger molecules, failed to find solutions completely. The operator effectiveness provides additional understanding of the mechanisms at work in the GA operation. The role of crossover in distributing and combining genetic material provided by the mutation operators proved to be critical to the operation of the GA. The ratio of crossover to mutation for these algorithms is very high to allow time for the new material developed by mutation to migrate among members of the population. This supports the previous work, where all permutations of molecules are tried with only the best kept for each generation. This is exactly the kind of mechanism demonstrated in this algorithm. The scalability of the algorithm is limited by the probability of creating new genetic material that is needed to complete the molecule. The local search does not overcome the limit in topology of the molecule. That is, the local search will not alter the topology, one of the required tasks for escaping a local minimum. The mutation operators are critical for providing this alteration. The limiting factor for the scalability is the probability of cor- recting the defects in the molecule. With pure random selection of atoms, this probability becomes much smaller as the size of the atoms increases. 168 Chapter 5 Conclusions The original hypothesis for the improved performance of the spatial operators has been clearly shown. The theory to support the performance of the spatial operators is consistent with the performance. It is also clear that all the various aspects of the GA algorithm are strongly interrelated. The absence of any one operator will have a strong negative impact on performance. The use of an objective function based only on distance allows for unbiased searches with experimental data as the target. Each of these conclusions will be described along with future research work. The scalability for larger molecules, and possibly proteins, will have to be addressed. In addition, the limited work on the multi-level algorithm shows that hierarchal GA populations need to be explored and may be critical for the challenges that the larger molecules will produce. A constant question about GA algorithms is the genealogy of the solutions. This par- ticular problem can provide a rich environment that could easily be recorded and used to trace the genealogy of the solutions. 5.1 Spatial Operators The two GA operators, spatial crossover and relocation mutation, have been shown to have an advantageous performance over standard two-point crossover and simple mutation. The 169 interrelationship between the spatial operators and the local search is not separable. Re- moval of any operational component makes the problem much more difficult and requires multiple levels of GA trials to achieve a solution. The spatial crossover’s advantage is de- rived from its preservation of the small subgraphs that are key to calculation of good fitness values. The relocation mutation provides a more effective way to explore different molecule topologies. Each relocation significantly alters the arrangement of atoms. From examina- tion of the failed results for many trials, it appears that the algorithm achieves near-perfect molecules with only a few atoms in wrong locations. Relocation mutation addressed this problem because otherwise it would take massive simple mutations to explore the same space. Worse yet, the massive mutations would destroy many of the good subgraphs at the same time and, even if the newly mutated atom had good subgraphs, they might not survive the competition to reproduce, mutate or crossover. Relocation mutation preserves nearly all the good subgraphs and allows for the distribution of new genetic material. The tracking of jumps in the fitness value can be seen to correspond to increases in the operator effectiveness for the mutations. This alignment is not a direct result, but an indirect one. The events that lead to the high ratio of mutation effectiveness take several generations to distribute among the population and, in doing so, ultimately lead to the improvement in fitness value. This concept adds support to the use of permutations for all members of a small a population to create the A population. It is interesting to note that the size of the A population is roughly the same as the constant population used in this work. The number of evaluations are similar with the energy-based search methods. 5.2 Interrelationship In nearly all the trails run, the elimination of any single aspect caused the failure rate to jump significantly. In the case of standard crossover, the algorithm failed 100% to find a solution for N38 molecules. For the spatial operators, the elimination of any of the oper- 170 ators caused significant degradation. If the the local search was eliminated, no solutions were found using a single-level run. Even as the parameters for each of the operators var- ied, no one parameter had a major effect on the overall performance. When the mutation probability or the repetition counts were set to zero, the algorithms would take an order of magnitude longer to find a solution, if they found one at all. When the crossover operator probability was reduced, no solutions were found. The removal of the relocation operator was the only one that still allowed for solutions, but they took five to ten times as many generations. 5.3 Objective Functions As the distance-based objective function converges, the energy function also converges. From the results section, it is clear that the two functions are not at all consistent with each other. The best-of—generation molecule provided a steady progression of distance objective values while the corresponding energy values were sporadic jumping above and below the eventual asymptotic value. In the limit, for any given molecule the energy and distance objective functions are both continuous. The tracking of the two demonstrates the influence on longer distances for the distance objective function, while these distances for the energy function are not as important. This is a significant observation. In the final stages of convergence, the two objective functions both smoothly approach the desired values. This effect leads to the following hypothesis. The distance objective function will more likely produce a grossly correct shape of the molecule while the energy function will pro- duce more correct close atom distances and sacrifice overall shape. That is, the shape may be distorted and, in some cases, atoms may be in the wrong locations. However, this loca- tion will be far from the other atoms, such that an energy minimum is found. In contrast, with the distance objective function, if an atom is in the wrong general location, the effect will produce an error contribution on the order of N * factor, which will tend to magnify 171 the placement error. At the same time, the local search will require many iterations and can become stuck more easily by large distance errors. 5.4 Confirmation of the Hypothesis A set of genetic algorithm operators based on spatial location concepts provides improved performance for a class of NP-hard search problems in N-dimensional spaces. Spatial crossover has been shown to outperform standard crossover for the molecular conformation problem, which is categorized as NP-hard, with homogeneous vector based solution. For larger molecules, the difference in performance was substantial. The relocation crossover demonstrated the ability to reduce search times by more effectively exploring the search space. Both of these operators are based on the premise that the individual genes are part of an N-dimensional space and that the preservation of the spatial relationships among genes is essential to insuring the propagation of good fitness values. These spatial operators can by shown by analysis to preserve those relationships and the experimental results presented here support the theory. This algorithm also used a distance-based objective function and local search. This allows for an unbiased search, based on target data from experimental x-ray diffraction and NMR results. The distance-based objective function allows for applications to a wide range of problems where the only requirement is that the actual equations that describe interactions between points or atoms conform to a continuous function with consistent influences. The polarity of the interactions needs to be stable without repeated changes in sign. This mapping can be highly nonlinear, as is the case for the molecular conformation with the Lennard-J ones energy models. 172 5.5 Competitive Methods The energy-based search provides cost effective methods for finding molecular conforma— tion. Basin Hopping [66] , GA [6] and Simulated Annealing [37] all demonstrate effective solutions for a wide range of molecules using energy-based searches. The Liga algorithm [41] has combined several GA concepts with a constructive approach using competitive layers similar to a sports league to solve the molecular conformation problem. The Liga algorithm uses the a robust target distance set to construct the molecules. 5.6 Future Research While this work has shown that a general concept using distance and spatial relationships is able to provide improved performance over standard GA operators, the ability to solve larger problems has some limitations. In the examination of the prior art, the most success- ful algorithms add many domain-specific methods to achieve good results. These include sorting and searching on individual operations, forcing permutations among all the most fit individuals, serial mutations of large magnitude, and geometric diversity control. This sec- tion contains some proposed improvements that will address some of these specific areas and also more general concepts. 5.6.1 Improved Scalability From the theory on the reduction of the last few defects, it is clear that a better methods need to be used rather than random point or atom selection. This process could be guided by the magnitude of the contribution to the objective function or some derived value based on the subgraphs that each atom is a part of. The hypothesis is that it is the repair of the last few defects that takes nearly all the computational power. The initial gains in the first few generations are enormous, while the majority of the generations are required to remove the last few defects. 173 Another significant improvement could come from the multiple-layer sustained style of GA. This approach was used on a batch basis for the successful attempts on the molecular conformation problem without the local search. This has also been shown to be effective after the addition of the local search. The process of finding the desired operating points for the number of seed solutions needed and the quality of the seeds has a major effect on the solution time. Using this multiple-layer approach will increase the parallelism of mixing high fitness value candidates in the higher layers. It still took over a thousand generations at the second layer to find the solution. Exploring this hierarchial set of populations may significantly help with scaling problems. 5.6.2 Adaptation to Folding Proteins Even though the atomic structure of all molecules comes down to atom location, the ad- ditional amounts of domain information available for proteins makes it desirable to store the protein shape as a combination of known sequences with rotational angles. Also the GA crossover operator needs to be based on dividing the main sequence chain. A pure spatial dividing plane would create invalid molecules at the completion of the crossover operator. Some combination of crossovers that use both the division of the main sequence chain along with spatial relationships will be required to insure valid molecules and also take advantage of the spatial relationships. The other issue is the much larger size of the protein molecules. Even the smallest is a few hundred atoms. This would impact the evaluation time in several ways for this algo- rithm. First, the size of the data structures and evaluation time would increase by 0(N2). One possible solution to this would be to reduce the data structures and the evaluations to some limited range of distances ,BN where [3 would be on the order of three to ten. This would eliminate the squared N growth and at the same time would keep the close proxi- mate interactions and would actually conform more to the subgraph model presented in this work. The other issue is the availability of experimental distances. While the smaller dis- 174 tances between atoms representing bond lengths are easier to detect using frequency-based experiments, the large distances are difficult to measure. 5.6.3 Dynamic Parameter Selection Like nearly all applications of GA’s, the parameters that control the overall operation of the algorithm have a tendency to grow in number and complexity. Several of the parameters are expected to have linear relationships to the number of atoms in the molecules. From the effectiveness ratio, trying to achieve a higher ratio of good results was not possible by variation of a single parameter. From the theory section, the probability for improvement can become very low at the end of the run due to the random selection process. For the algorithm solution time to improve, the dynamic variation of parameters will need to be explored. 5.6.4 Using Multiple Levels of GA The concepts of distinct population islands in the GA’s algorithms have been explored ex- tensively for optimal configurations, sizes, transfer rates and topology of interactions. The recent work on Hierarchial Fair Competition (HFC) has been shown effective for creating an environment for sustained evolution. In the HFC work, the number of levels of evolution and how they interact or overlap was explored. In the middle stage of the research reported here a two-level evolutionary environment that was a batch implementation of a limited HFC was able to achieve good results without the addition of local search. 5.6.5 'h'acking Genealogy One of the more interesting results was the tracking of operator effectiveness. This provided good insight into the operation of the GA for this application. The correlations between the high mutation ratio events to the improvement in fitness was identified. The other fact 175 that was clear was the distribution of the event through the population by the crossover operator. To this extent, the current algorithm was able to provide a rudimentary means of inferring a limited generation or two of genealogy that led to the solution to the molecular conformation. One of the areas for future work would be to develop a complete genealogy tracking system that can show the various family trees that lead to a solution. 176 BIBLIOGRAPHY 177 [1] [2] [3] [4] [5] [6] [71 [8] [9] Bibliography Jim Antonisse. A new interpretation of schema notation that overturns the binary encoding constraint. In Proceedings of the 3rd International Conference on Genetic Algorithms, pages 86—91. Morgan Kaufrnann Publishers Inc., 1989. Jaroslaw Arabas, Jan J. Mulawka, and Jacek Pokrasniewicz. A new class of the crossover operators for the numerical optimization. In Proceedings of the 6th Inter- national Conference on Genetic Algorithms, pages 42—48. Morgan Kaufmann Pub- lishers Inc., 1995. Thomas Back. Evolutionary algorithms in theory and practice. Oxford University Press, 1996. Donna Bassolino-Klimas, Roberto Tejer, Stanley R. Krystek, “William Metzler, Gae- tano T. Montelione, and Robert E. Bruccoleri. Simulated annealing with restrained molecular dynamics using a flexible restraint potential: Theory and evaluation with simulated NMR constraints. Protein Science, 5:593—603, 1996. Bonnie Berger, Jon Kleinberg, and Tom Leighton. Reconstructing a three- dimensional model with arbitrary errors. In STOC ’96: Proceedings of the twenty- eighth annual ACM symposium on theory of computing, pages 449—458. ACM Press, 1996. B.Hartke. GEC C0-2001 : Proceedings of the Genetic and Evolutionary Computation Conference, chapter Global geometry optimization of atomic and molecular clusters by genetic algorithms. Morgan Kaufmann, San Francisco, 2001. Fred Buckley and Marty Lewinter. A Friendly Introduction to Graph Theory. Prentice Hall, Pearson Education, Inc, Upper Saddle River, New Jersey, 2003. Thang N. Bui and Byung-Ro Moon. On multi-dimensional encoding / crossover. International Conference on Genetic Algorithms, pages 49—55, 1995. P. Chaudhury, S.P. Bhattacharyya, and W. A Quapp. Genetic algorithm-based tech- nique for locating first-order saddle point using a gradient dominated recipe. Chem. Phys, 253:295—303, 2000. 178 [10] David I. Chu, Hunter C. Brown, and Moody T. Chu. On least squares euclidean distance matrix approximation and completion. SIAM. J. Matrix Anal. Appl., 16:645— 654, 1995. [11] G. M. Crippen and T. F. Havel. Distance Geometry and Molecular Conformation. John Wiley and Sons, New York, NY, 1988. [12] Willian David, Kenneth Shanklad, and Norman Shanklad. Routine determinaion of molecular crystal structures from powder diffraction data. Chem Commun, pages 931—932. 1996. [13] D. Deaven and K. Ho. Molecular geometry optimization with a genetic algorithm. Phy. Rev. Let, 75:288-291, 1995. [14] KA Dill, S Bromberg, K Yue, KM Fiebig, DP Yee, PD Thomas, and HS Chan. Princi- ples of protein folding—a perspective from simple exact models. Protein Sci, 4:561— 602), April 1995. [15] Q. Dong and Z. Wu. A geometric build-up algorithm for solving the molecular dis- tance geometry problem with sparse distance data. Journal of Global Optimization, 26:321-333, 2003. [16] B. Dorronsoro, E. Alba, M. Giacobini, and M. Tomassini. The influence of grid shape and asynchronicity on cellular evolutionary algorithms. In Y. Shi, editor, IEEE International Conference on Evolutionary Computation, pages 2152—2158, Portland, Oregon, June 20—23 2004. IEEE Press. [17] Richard Duda, Peter Hart, and David Stork. Pattern Classifiction. John Wiley and Sons, 2nd edition, 2001. [18] L. J. Eshelman, K. E. Mathias, and J. D. Schaffer. Crossover operator biases: Exploit- ing the population distribution. Proc. of the 7th ICGA, pages 354—361, 1997. [19] Larry J. Eshelman, Rich Caruana, and J. David Schaffer. Biases in the crossover landscape. ICGA, pages 10—19, 1989. [20] L. J. Fogel. Autonomous automata. Industrial Research, 4:14—19, 1962. [21] Mario Giacobini, Marco Tomassini, and Andrea Tettamanzi. Takeover time curves in random and small-world structured populations. In Hans-Georg Beyr et a1, edi- tor, GECCO 2005: Proceedings of the 2005 conference on Genetic and evolutionary computation, 2005. [22] Philip E. Gill and Michael W. Leonard. Reduced-Hessian quasi-Newton methods for unconstrained optimization. SIAM Journal on Optimization, 12(1):209—237, 2001. [23] Fred Glover and M. Laguna. Tabu search. In C. Reeves, editor, Modem Heuristic Techniques for Combinatorial Problems, Oxford, England, 1993. Blackwell Scientific Publishing. 179 [24] S. K. Gregurick, M. H. Alexander, and B. Hartke. Global geometry optimization of (Ar)n and B(Ar)n clusters using a modified genetic algorithm. J. Chem. Phys, 104:2684, 1996. [25] Peter Guntert. Automated NMR protein structure calculation. Progress in Nuclear Magnetic Resonance Spectroscopy, 43(3-4):105—125, December 2003. [26] Frank Harary. Graph Theory. Addison-Wesley Publishing Comp, Reading, Mas- sachusetts, 3rd edition, 1972. [27] B. Hartke. Global cluster geometry optimization by a phenotype algorithm with niches: location of elusive minima, and low-order scaling with cluster size. J. Comput. Chem, 20(16): 1752-1759, 1999. [28] B. Hartke. Application of Evolutionary Computation in Chemistry, volume 110 of Structure and Bonding, chapter Application of Evolutionary Algorithms to global cluster geometry optimization, pages 33—53. Springer, Heidelberg, 2004. [29] T.F. Havel. Encyclopedia of Computational Chemistry, chapter Distance geometry: Theory, algorithms and chemical applications, pages 723-741. J. Wiley and Sons, New York NY, 1998. [30] Bruce Hendrickson. The Molecule Problem: Determining Conformation from Pair- wise Distances. Phd thesis, Cornell University, Computer Science Department, 1990. [31] Bruce Hendrickson. The molecule problem: Exploiting structure in global optimiza— tion. SIAM Journal on Optimization, 5(4):835—857, 1995. [32] F. Herrera, M. Lozano, and A.M. sanchez. A taxonomy for the crossover operator for real-coded genetic algorithms. an experimental study. International Journal of Intelligent Systems, l8(3):309-338, 2003. [33] Magnus R Hestenes. Conjugate Direction Methods in Optimization, volume 12 of Application Mathematics. Springer Verglag, New York, 1980. [34] MR. Hoare. Structure and dynamics of simple microclusters. Advances in Chemical Physics, 40:49—135, 1979. [35] J. H. Holland. Outline for a logical theory of adaptive systems. Journal of the ACM, 9(3):297—314, July 1962. [36] J .H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Pres, Ann Arbor, 1975. [37] Hsiao-Ping Hsu, Ulrich, H. E. Hansmann, and Simon C. Lin. Structure determination of organic molecules from diffraction data by simulated annealing. Physical Review E, 64(056707):1—2, 2001. 180 [38] J. Hu, E. Goodman, and K. Seo. Genetic Programming Theory and Practice, chap- ter Continuous Hierarchical Fair Competition Model for Sustainable Innovation in Genetic programming, pages 81-98. Kluwer Publishers, 2003. [39] Jianjun Hu. Sustainable Evolutionary Algorithms and Scalable Evolutionary Syn- thesis of Dynamic Systems. Phd thesis, Michigan State University, Department of Computer Science and Engineering, 2005. [40] M. Iwamatsu. Global geometry optimization of silicon clusters using the space-fixed genetic algorithm. Journal of Chemical Physics, 112:10976—10983, jun 2000. [41] P. Juhas, D. M. Cherba, P. M. Duxbury, W. F. Punch, and S. J. L. Billinge. Ab initio solid state nano-structure determination. Accepted for publication in Nature, Nov 2005. [42] Andrew B. Kahng and Byung Ro Moon. Toward more powerful recombinations. In Larry Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 96—103, San Francisco, CA, 1995. Morgan Kaufmann. [43] H. W. Kroto, J .R. Heath, S.C. O’Brien, R.F. Curl, and R.E.C. Smalley. Buckminser- fullerene. Nature, 318:162—163, 1985. [44] J. Lee, I.-H. Lee, and J. Lee. Unbiased global optimization of lennard-jones clus- ters for n ;= 201 by the conformational space annealing method. Phys. Rev. Lett., 91(8):080201, 2003. [45] D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Programming, 45(3, (Ser. B)):503—528, 1989. [46] J. Maddox. Genetics helping molecular-dynamics. Nature, 376(6537):209—209, Jul 201995. [47] John H. Mathews. Nelder-mead method. web, 2003. http:math.fullerton.edu/mathews/n2003/NelderMeadMod.html. [48] John H. Mathews and Kurtis K. Fink. Numerical Methods Using Matlab. Prentice- Hall Inc., Upper Saddle River, New Jersey, USA, 4th edition, 2004. [49] P. Moscato. Memetic algorithms for molecular conformation and other optimization problems. Newsletter of the Commission for Powder Diffraction, of the International Union of Crystallography, 20, 1998. [50] Diego F. Nehab and Marco C. Pacheco. Schemata theory for the real coding and arithmetical operators. In SAC ’04: Proceedings of the 2004 ACM symposium on Applied computing, pages 1006—1012. ACM Press, 2004. [51] J .A. Nelder and R. Mead. A simplex method for function minimization. Computer Journal, 7:308—313, 1965. 181 [52] Arnold Neumaier. Molecular modeling of proteins and mathematical prediction of protein structure. SIAM Review, 39(3):407—460, 1997. [53] J.A. Northby. Structure and binding of lennard-jones clusters: 13 < 147. J. Chem. Phys, 87:6166—6167, 1987. [54] A. Patton, E. Goodman, and W. Punch. Beyond encoding: Using populationg sam- pling as a rotation guide for ga operators. Garage tech report garage99-01-02, Michi- gan State University, 1999. [55] A.T. Phillips, J. B. Rosen, and K. A. Dill. Energy landscape projections of molecular potential functions. Optimization in Computation Chemistry and Molecular Biology, pages 1-2, 2000. [56] A. P. Ramirez and R. C. Haddon. C60 neutron data collection at IPNS. Funded by DOE work done at MSU. [57] I. Rechenberg. Evolution strategy: Optimization of technical systems by means of biological evolution. Fromman-Holzboog, Stuttgart, 1973. [58] Mlson Rivera-Gallego. A genetic algorithm for solving the euclidean distance matri- ces completion problem. SAC 1999, pages 286—290, 1999. San Antonio, Texas. [59] I. A. Sarafis, P. W. Trinder, and A.M.S. Zalzala. Towards effective subspace clustering with an evolutionary algorithm. Proc. IEEE Congress an Evolutionary Computation, Dec. 2003. [60] Kumara Sastry. Efficient cluster optimization using extended compact genetic algo- rithm with seeded population, optimization by building and using probabilistic mod- els. OBUPM, pages 222—225, Jul. 2001. [61] 1B. Saxe. Embeddability of weighted graphs in k-space is strongly np-hard. In Proceedings of the 17th Allerton Conference in Communications, Control and Com- puting, pages 480—489, 1979. [62] H.-P. Schwefel. Numerical Optimization of Computer Models. John Wiley and Sons, Chichester, 1981. [63] Dong-i1 Seo and Byung Ro Moon. A survey on chromosomal structures and operators for exploiting topological linkages of genes. In GECCO, pages 1357-1368, 2003. [64] William M. Spears and Kenneth A. De Jong. An analysis of multi-point crossover. In G. J. E. Rawlins, editor, Foundations of Genetic Algorithms, pages 301—315, San Mateo, CA, 1991. Morgan Kaufmann. [65] David J. Wales and Jonathan P. K. Doye. Global optimization by basin-hopping and the lowest energy structures of Lennard-J ones clusters containing up to 110 atoms. J. Phys. Chem. A, 101(28):5111 - 5116, 1997. 182 [66] DJ Wales and HA Scheraga. Global optimization of clusters, crystals, and biomolecules. Science, 285(5432):1368—1372, Aug 199. [67] Hyun-Sook Yoon and Byung-R0 Moon. Synergy of multiple crossovers in a genetic algorithm. IEEE Transactions on Evolutionary Computation, 6(2):212 - 223, April 2002. 183 TE Si 1E5 lllllilllflialllflfllflllllfllzlljlll