A GENETIC PROGRAMMING TECHNIQUE FOR PROTEIN OPTIMIZATION DEMONSTRATED IN MRI REPORTER GENES By Alexander Robert Bricco A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Biomedical Engineering—Doctor of Philosophy 2024 ABSTRACT Reporter genes are important tools for researchers studying molecular and cellular biology as they give location and measurable values to the expression level of a given gene, as reporter genes link the activation of a gene to a detectable phenomenon. Reporter genes for MRI, allow these functions to be done at arbitrary tissue depth and noninvasively. Chemical Exchange Saturation Transfer (CEST) based reporter genes have shown promise in acting as reliable reporters in MRI but the relatively low sensitivity to the method has decreased its utility in research situations. Initial attempts to optimize existing CEST reporter genes proved difficult due to a series of technical challenges. This led to the development of a new protein engineering machine learning tool, the Protein Optimization Engineering Tool (POET). Using a process where POET and experimentation were used to develop improved CEST reporter genes resulted in new peptides that produce nearly a fourfold increase in contrast over prior art. Additionally POET is used to generate a reporter gene that produces significant contrast at a farther downfield frequency than prior CEST reporter genes. I dedicate this work to my wife, Emily, and my Lord, Jesus Christ, who when I lacked strength enough to continue, lent me some of theirs. iii ACKNOWLEDGEMENTS This work would not have been possible without the guidance and assistance of many others. I would like to thank my collaborator and friend Illiya Miralavy who wrote the first (and several subsequent) version of POET, Dr. Jory Schlossau who optimized and improved it, and their advisor Dr. Banzhaf for guiding its development. I would also like to thank Dr. Michael McMahon at Johns Hopkins University for guiding me through learning CEST MRI and keeping the project moving forward when I was still learning. Further thanks go to Dr. Christian Farrar, and his lab at Harvard Medical School for providing the information on exchange rates that help complete this project’s flow. This work was made possible via funding from NIH/NINDS: R01-NS098231; R01-NS104306 NIH/NIBIB: R01-EB031008; R01-EB030565; R01-EB031936; P41-EB024495 and NSF 2027113. Lastly, but mostly, I would like to thank my advisor, Dr. Assaf Gilad, who introduced me to the project, taught me countless things about science and served as an exemplary role model for what it means to be a professor and run a laboratory. iv TABLE OF CONTENTS INTRODUCTION ............................................................................................................................... 1 METHODS ...................................................................................................................................... 11 PROJECT 1: IMPROVING CEST PEPTIDE SENSITIVITY THROUGH DIRECTED EVOLUTION ............. 21 PROJECT 2: IMPROVING CEST CONTRAST AT 3.6 PPM VIA MACHINE LEARNING ........................ 24 PROJECT 3: IMPROVING CEST CONTRAST AT 5.0 PPM VIA MACHINE LEARNING ........................ 42 CONCLUSION ................................................................................................................................. 53 REFERENCES .................................................................................................................................. 56 APPENDIX A: CESTIDE PROPERTIES ............................................................................................... 61 APPENDIX B: POET CODE .............................................................................................................. 68 v Many portions of this section are adapted from a currently published paper1. INTRODUCTION Magnetic Resonance Imaging (MRI) Magnetic Resonance Imaging (MRI) is a field of biomedical imaging that uses powerful magnetic fields to align the magnetic moments of unbalanced nuclei. MRI is widely used clinically to inform the diagnosis of a wide variety of diseases2. In an MRI scanner a strong homogenous magnetic field (B0) is applied. This field causes the spin of unbalanced nuclei to align the field and allows them to resonate. Not all nucei in within the field are aligned, with the number of aligned nuclei being associated with the strength of the B0 field. In clinical practice scanners have a B0 field with a strength of 0.5T-3T, while preclinical scanners typically achieve strengths 7T-15T. Along with aligning their magnetic moment, the nuclei spin around the axis of the B0 field in a motion called precession. The Lamor frequency is the rate of precession and is linearly related to the strength of the B0 field. Within MRI, frequency is often expressed in the number of millionths of the Lamor frequency (ppm), so for a 7T preclinical MRI would have a Lamor frequency of ~300 MHz for a hydrogen nucleus and one “ppm” on that magnet would represent 300 Hz for a hydrogen nucleus. These nuclei can be shifted out of alignment with the B0 field by exposing them to a radio frequency (RF) pulse in a perpendicular magnetic field (B1). At the cessation of the RF pulse the nucei are in a higher energy state, referred to as excited, and decay from this high energy state to realign with the B0 field in a process called relaxation. As these nuclei relax they produce a measurable change in the magnetization which is detected often by the same coils used to 1 generate the B0 field and is the signal measured in MRI3. Chemical Exchange Saturation Transfer (CEST) By applying an RF pulse before the imaging RF pulse, a nucleus may be saturated. When an imaging pulse is applied to these saturated nuclei it results in the nuclei being aligned antiparallel to the B0 field. As these saturated nuclei undergo relaxation, they produce a negative signal as they drop to the normal excited state in addition to the normal signal as they relax from excited to their normal state. These signals cancel each other out and as a result can be said to produce no signal. Although the B0 field is the major contributor to the magnetic field experienced by a given nucleus undergoing MRI there is also an effect from the surrounding chemistry. Each nuclei surrounding the nucleus being imaged has its own magnetic field which produces a change in the Lamor frequency for those nuclei called a chemical shift4. This allows these nuclei to be saturated or excited specifically without saturating or exciting other nuclei in the surrounding area. The change in magnetic field, and thus the change in Lamor frequency is constant for a given chemistry. This allows for the specific saturation of protons associated with a given functional group. If a saturation pulse is sustained, and the nucleus is a in a molecule where it can be exchanged with the solvent (such as a hydrogen nucleus in a hydroxyl group)5, this saturated proton may be exchanged with the environment. If this exchange is fast enough, the bulk solvent will begin to have enough signal loss to generate a measurable decrease in signal, despite the low concentration of the exchangeable nucleus’s chemical solute. This mechanism of contrast is called chemical exchange saturation transfer (CEST)6. This process can be seen graphically in 2 figure 1. a b c d e f g Figure 1: Visualization of CEST MRI. An organic amide sits in solution with water (a). In a normal T1 weighted MRI scan a RF pulse is applied to the hydrogens in the water (b) and this results in signal (c). In a CEST scan, first a saturation pulse is applied, which saturates a proton on the amide (d). The saturated proton is then exchanged with the protons in the surrounding water (e). When the RF pulse is applied to produce an image (f), the saturated proton doesn’t produce a signal, resulting in reduced signal (g). 3 In the most ordinary example, a hydrogen nucleus (frequently simply called a proton) is exchanged with the water from an exchangeable organic group such as a hydroxyl or amide7. In CEST MRI 0 ppm is the water saturation frequency, and the subject is typically exposed to a series of saturation pulses at different frequencies symmetrical around the water saturation frequency, which is called a z-spectra. The effect of CEST after a saturation pulse is visible as a decrease in signal in the area where the exchangeable protons are present. This localized decrease in signal is called CEST contrast. CEST contrast is quantified by comparing the signal after a given saturation pulse with the signal produced at the frequency multiplied by -1. The difference between them is then generated divided by the signal at a frequency far off field from the water saturation frequency. The asymmetry due to the transfer of saturated particle results in the magnetization transfer ratio (MTRasym) as shown in equation 1 below8, where Sω is the signal at a given frequency, S-ω is the signal at the inverse of that frequency and S0 is the signal without any saturation. This can be seen visually in figure 2. 𝐸𝑞. 1 𝑀𝑇𝑅𝑎𝑠𝑦𝑚 = 𝑆−𝜔 − 𝑆𝜔 𝑆0 In addition to the MTRasym it is important in experiments via correcting for mass and concentration, as larger molecules, can generate more contrast than smaller ones by virtue of having more exchangeable protons to saturate and higher concentrations of a CEST contrast agent produces more contrast. This correction is done by multiplying the MTRasym by the concentration of protons in the solvent to the concentration of the CEST compound. This results in the proton transfer efficiency (PTE) of a given peptide see equation 2 below8, where [H2O] is the molar concentration of H2O in the sample and [peptide] is the molar concentration of the pepide. In this dissertation all PTE values are divided by 1000, for ease of communication. 4 𝐸𝑞. 2 𝑃𝑇𝐸 = 𝑀𝑇𝑅𝑎𝑠𝑦𝑚 ∗ 2 ∗ [𝐻2𝑂] [𝑝𝑒𝑝𝑡𝑖𝑑𝑒] a b Figure 2: Visualization of MTR calculation. (a) In black is the Z-spectra acquired from the highest contrast protein from the 5th generation of 3.6 development (see below). If the water was instead pure, it would be symmetrical across the 0 ppm point which can be seen in blue. The difference between the two is used to calculate the MTRasym seen in red. (b) the three input calculations (S0, Sω, S-ω) for a frequency of 3.6 ppm offset from the water saturation frequency. S0 was acquired without a saturation pulse. S3.6 was acquired after a saturation pulse at 3.6 ppm offset from water. S-3.6 was acquired after a saturation pulse at -3.6 ppm offset from water. The MTRasym figure is the result of performing the calculation from equation 1 for the data on each pixel in the images. 5 CEST contrast is generated naturally by any compound that exchanges imageable atoms with the surrounding solvent, including naturally occurring biomolecules. Carbohydrates produce contrast from their hydroxl groups, which has led to studies this hydroxl CEST signal as an indicator of metabolism and function in the brain9 and in mouse livers10. Proteins produce contrast from the amine groups present in the backbone of their amino acid chain and are also able to produce contrast at other frequencies due to the presence of exchangeable protons in their functional groups. The saturation frequency used to generate CEST contrast from some functional groups common in biochemistry can be seen in Table 1 below. Table 1: Saturation frequency of common functional groups Functional Group Hydroxyl Primary Amine Secondary Amine Saturation Frequency(ppm) 0.8 1.8 3.6 Chemistry R-OH R-NH2 R-NH-R Although functional groups produce contras at a consistent frequency, different compounds produce different amounts of MTRasym. The primary influences are the number of exchangeable protons, the rate at which each of those protons exchange, and the accessibility of those exchangeable protons to the solvent11,12. Reporter Genes In the study of genetics and molecular biology it can be difficult to tell cells apart from one another or to tell cells where a gene is active appart from cells where a gene is inactive. To overcome these challenges researchers employ reporter genes. A reporter gene is a gene, that when expressed, produces a measurable change that can be used to tell that the gene is currently being expressed and the level of expression. Typically this change is a result of a 6 protein being generated that has a detectable presence. The first reporter genes were for optical imaging13-15. When these genes are expressed the proteins produced can generate light from chemical energy (bioluminescence), emit light at one frequency after stimulation with light from another frequency (fluorescence), or create a change in the color of a solution. In MRI the initial efforts at reporter genes developed relied on generating T1 or T2 based contrast by binding metals within cells16-20. More recently there has been a demand for reporter genes that don’t require an added substrate for them to work. MRI reporter genes using CEST have shown great success21, despite their relatively low sensitivity compared to T1 or T2 based reporter genes. The mechanism allows the generation of contrast without the addition of metals, and since the contrast requires a presaturation pulse, researchers can get a clear structural image without the effects of contrast. The first reporter gene designed for CEST was the Lysine Rich Protein (LRP)22,23 which true to its name was primarily a large series of lysines with other residues interspersed to improve expression and was designed from the observation of high contrast from poly-l-lysine (PLL), a synthetic lysine polymer, in prior experiments24. Later CEST reporter genes included human protamine (hPRM1)25, and an adaption of the thymidine kinase system already used in positron emission tomography (PET)26, both of which primarily produce contrast as a result of amine exchange at 3.6 ppm. There has also been work done to generate CEST contrast at other frequencies such as 0.8 ppm and 1.8 ppm8, as when expressed alongside 3.6 ppm reporters they allow multiple different contrast agents to be detected in parallel for “multicolor” imaging. LRP remains the most used CEST reporter gene and has been used in the study of systems as 7 complex as. LRP remains the most used CEST reporter gene and has achieved a PTE of roughly 12.5 thousand. This leads to detection in the nM-mM range, allowing it to be used in the study of cardiac gene therapy27 and oncolytic virotherapy28, but leaves the analysis of smaller systems such as neuronal activation out of reach. Protein Engineering Proteins have been very useful for industrial, medical, and environmental purposes. However, having been developed in nature, few of these proteins are optimal for usage in the purposes that we have taken them for. Protein engineering is based on either generating a new protein to fulfill a task that there has yet to be a natural protein for, or to optimize a protein for a given task. The most common mechanism for protein optimization is via directed evolution29. From a starting point of a natural protein, and a test capable to measuring that protein’s fitness in the required task. After a base line is developed mutations are generated by a variety of means in the genetic code for the protein. Each of these mutant variants are then measured for their fitness. The most fit protein then becomes the new base line, new mutations are introduced into it. This process continues until the protein has hit the desired level of function or the experimenter chooses to stop. Although straight forward and easy to perform, directed evolution has difficulties where the process can be locked in a plateu or local maximum and be unable to find its way to the global maximum30,31. There have been many attempts to use computational methods to aid in protein engineering to help bypass the time limitations, and lab requirements of directed evolution. One notable example is Rosetta32 which computationally determines the shape a protein will take allowing 8 researchers to attach pieces of other proteins to a common backbone33, which makes it very easy for the design of a protein when the protein engineer knows what overall shape they are trying to achieve. Since the ideal conformation of an engineered protein is sometimes unknown, machine learning tools have been developed to aid in protein engineering. A notable example is Protein Optimization with Optimal Learning (POOL) which makes use of Bayseyan classification34. There are other machine learning models35 such as AlphaFold which uses deep learning to model protein structure36, UniREP which uses a generalized understanding of protein structure to provide a generalized model37 and more recently language models that predict interactions by sequence without developing a structure38. Existing machine learning methods of designing and optimizing proteins currently rely on large amounts of training data, and long proteins which provides increased context for understanding a given protein. Otherwise the training data that enables optimization tools like POOL to work relies on proteins that share a high degree of similarity existing within the training data. Genetic Programming Evolutionary Computation is a field in computer science, studying algorithms inspired by biological evolution. Genetic Programming (GP)39,40 is among powerful evolutionary computation techniques that evolves solutions to difficult structural design tasks as a general problem solver. GP has been used in protein engineering before to predict key protein motifs41, to evolve energy functions for evaluating protein structures42, and predict protein-protein interactions related to disease43. This earlier work demonstrates the capability of GP to model features in the protein search domain, and in particular its ability to extract features relevant 9 for a prediction task. This is a central capability in biological applications where often high- dimensional inhomogeneous datasets are used as input to predict output values. In addition to creating predictive models, the underlying mechanisms of GP allows it to come up with novel models, often on first sight surprising or even counter-intuitive to the user44. Over the last decades, GP has proven to produce human-competitive solutions to many problems45. Significance My work has pursued the development of new CEST reporter genes, as well as demonstrating the utility of a tool for generalized protein engineering. The reporter genes offer a wealth of new options for scientists to study the molecular biology and genetics of living creatures non- invasively through use of MRI. As the sensitivity of such reporters increase, smaller events such as the activity of neurons in the brain, could be measured using such reporter genes. POET can help scientists develop proteins for a variety of purposes. Currently it has been used to engineer reporter genes or targeting proteins for extracellular vesicles (EVs)46. The approach of POET can be used to help develop any important small proteins, with such option as silk monomers47, or pain killers46. 10 Directed Evolution METHODS For directed evolution experiments I used an error prone polymerase kit as a means of introducing mutations into the gene of interest. The DNA target was replicated via a polymerase chain reaction (PCR) using the error prone polymerase. The resulting sequences were cloned using a TOPO TA Cloning Kit (ThermoFisher, K4600-01). pCR II-TOPO recombinant plasmids were transformed into One Shot TOP10 Chemically Competent E. coli (ThermoFisher, C404010). Bacterial colonies were grown on ampicillin-rich agar plates overnight at 37 °C. Colony PCR was performed using Quick-Load Taq 2X master mix (New England Biolabs, M0271L). Successful colonies were grown in 100 µg/mL ampicillin-rich lysogeny broth overnight at 37 °C. DNA was extracted using PureLink Quick Plasmid Miniprep Kit (Thermofisher, K4510- 02) and then sent to MSU genomics core for DNA sequencing. MRI Scanners MRI experiments were done on three different MRI machines. One of them was a 7T Bruker horizontal bore preclinical MRI, at the Biomedical and Physical Sciences Building (BPS) at MSU. The next was a 7T Bruker horizontal bore preclinical MRI at the Institute for Quantitative Health Science and Engineering (IQ). The last one was a 11.7T Bruker horizontal bore preclinical MRI that was used by our collaborator at Johns Hopkins University (JHU). Pulse Program CEST spectra, also known as Z-spectra were captured in scans that spanned -7 to 7 ppm offset from water frequency in units of 0.2 ppm. The saturation pulses have a saturation power of 4.7 11 µT and lasted for 4 seconds. To do B0 correction a WASSR scan48 was also performed spanning from -1.5 to 1.5 in units of 0.1 ppm. The saturation pulses for the WASSR scans had a saturation power of 1.25 µT and lasted for 2 seconds. Both the WASSR and CEST scans used a modified RARE scan with a RARE factor of 16 and a resolution of 48 X 48 covering a field of view of 40 mm x 40 mm, with a slice thickness of 3 mm. Both WASSR and CEST scans had a TR of 10s, and a TE of 4.74 ms. The resulting scans results in a series of images, all covering the same geometric area that contains the entire phantom. Each image is taken after the application of a saturation pulse at a different frequency allowing for comparisons between the intensities of each voxel across the range of saturation frequencies. To account for changes in the B0 field over time a new WASSR scan was done before each CEST scan. Four to five Z-spectra were acquired in each experiment. Phantom and Phantom Preparation MRI experiments were done in a custom designed 3D printed phantom with 12 wells within a central body which contains water to prevent the magnetic field from distorting at the air water interface. Each well is designed to hold 250 µL of liquid. The circular distribution of the wells ensures that each well is experiencing a similar field strength. The shape of the phantom can be seen in figure 3 below. 12 Figure 3: Imaging Phantom. A 3D image from the software used to design the phantom. Proteins were ordered from Genscript (Piscataway NJ). All samples were dissolved to a concentration of 5 mg/mL in PBS. After dissolution, pH was correct for each sample’s to 7.4 via titration with 0.1M HCl & 0.1 M NaOH and measuring the pH with pH paper. In a typical experiment five experimental proteins would be used. In addition to these, there would be wells of PBS, a negative control; salmon protamine (sPRM), a positive control; and a peptide of twelve consecutive lysines (K12), which is used as a comparison to LRP, since LRP represents the contrast generated by currently used CEST reporter genes. Data Processing CEST Processing The data from all MRI scans were processed using an in-house MATLAB script based on prior literature49,50. First the ROIs are isolated automatically using a greedy region growing algorithm, which finds the highest intensity voxel not already in a mask. The program then checks the voxels adjacent to all the voxels in the mask, and if they aren’t in another mask and produce at least 55% of the intensity of the initial voxel of the mask they are added to the mask. The mask grows until it can’t acquire additional voxels, and then a new voxel is selected as above. Masks including areas that aren’t part of one of the phantom’s wells are excluded if they aren’t 13 circular, determined by their ratio of diameter to area, or of an unusual size. All masks are visually confirmed before use in image processing. B0 correction is done using the data from the WASSR scan48 done immediately prior to the scan that is being corrected. In this scan a spline interpolation of the Z-spectra is used to determine the true central point of the water saturation peak. Later in the process this B0 map is used to correct the Z-spectra for each voxel via a spline interpolation. From a single CEST experiment the contrast values are determined by averaging all voxels within the same ROI. All repetitions of the same scan are averaged together to determine the final MTRasym value for a given well, and all the voxels in a well are averaged together to achieve a single value for each peptide at each frequency. The PTE values are determined by normalizing the values according to water, adjusting them to account for molar concentrations and then correcting them the literature value of K12 of 12.5*10008 to retain consistency across the different scanners used by us and our collaborators. For ease of communication and entry into POET (see below), these PTE values are divided by 1000. T-test mapping To the generate t-test maps used in this dissertation, I used an in-house MATLAB script based on prior literature49,50. The intensity of each voxel across five experiments at 3.6 ppm was compared to the intensity of the same voxel at -3.6 ppm using a two tailed unpaired t-test. The resulting p-value was assigned to the same 2d location of the voxel to produce the map. Protein Optimization Engineering Tool (POET) POET is a genetic program that was developed in a collaboration between the Banzhaf and Gilad labs, for the purpose of improving the function of peptides. POET uses genetic 14 programming as its means of learning. As input data POET takes in a table of a series of peptides as amino acid sequences using single letter amino acid codes and a numeric measure of fitness of the protein. These values are loaded into POET in the form of a comma separated value (CSV) file, the first 10 lines of the training data can be seen Table 2 below. Within this work, the training data of POET uses the PTE values for the appropriate frequency (3.6 ppm for project 2, and 5.0 ppm for project 1) as the fitness value for each peptide, but the concept is deliberately vague, and can be any quality of a peptide that can be numerically quantified and the experimenter wishes to optimize. This open definition of fitness in POET allows it to be used in a variety of applications. PTE was chosen as the measurement for fitness to allow the use of existing data from literature in training POET. Table 2: Example POET input data Sequence Fitness KKKKKKKKKKKK 12.5 KSKSKSKSKSKS 17 KHKHKHKHKHKH 12.7 KGKGKGKGKGKG 10.8 KSSKSSKSSKSS 13.2 KGGKGGKGGKGG 11.8 KSSSKSSSKSSS 13 KGGGKGGGKGGG 12.1 RRRRRRRRRRRR 22 RSRSRSRSRSRS 12.8 POET generates models called Protein Optimization Evolved Models (POEMs), which predict protein fitness as a function of a peptide’s sequence. Each POEM consists of a series of rules. 15 Each rule is composed of a motif and a weight. A motif is a series of amino acids with a length between 1 and 10. A weight is a number, which may be positive if it benefits fitness or negative if it impairs fitness. Each model assesses peptide sequences by applying its rules: the predicted fitness for each peptide is the sum of the weights for the motifs it contains. POEMs are generated in groups of 100 called populations. Each time POET is run it begins with a population where the rules in each POEM are random, while the later populations are generated from a process of tournament selection, crossover, and mutation. Each POEM within the population is randomly assigned to a group of five called a tournament. The POEMs within a tournament are evaluated by their ability to accurately predict fitness of peptides in the training data. The two most accurate POEMs in each tournament produce four new POEMs for a new population, where each POEM’s rules are randomly inherited from each of the two tournament winners. An additional POEM is produced by applying “mutations” to the best POEM in the tournament. Mutations in POET are modifications to the rules of a POEM in a manner analogous to mutational changes to DNA in biological evolution. The types of mutation can be separated into two groups: POEM mutations and rule mutations. When a POEM is being mutated, first POET randomly determines for each POEM mutation whether it occurs, then each rule of the POEM, has a chance for each of the rule mutations to occur to it. The different types of mutation performed in POET can be seen in Table 3. 16 Table 3: Mutations in POET Mutation Type Type Chance Description Add Rule POEM 20% A randomly generated new rule is added to the POEM Remove Rule POEM 20% A randomly selected existing rule is removed from the POEM Adjust Weight Rule 20% The rule’s weight is increased or decreased by a percentage of its prior weight. Add Letter Rule 10% The rule’s motif has an additional letter added to it unless the rule already has the maximum number of letters. Remove Letter Rule 10% The rule’s motif has a letter removed from it, unless the rule has one of fewer letters. This cycle of generating rules, comparing POEMs to produce a new population of models. If a rule is useful for predicting the training data, it will improve the accuracy of the POEMs containing it. As these rules mutate, beneficial mutations will increase the accuracy of the model, leading to their presence in later populations, while detrimental mutations decrease the POEM’s accuracy and thus make them less likely to be present in later populations. This results in the accrual of useful rules that increase the POEM’s as the learning process continues, despite all the rules being generated randomly, and being mutated randomly. When running POET on MSU’s High Power Computing Center (HPCC), POET’s process of 17 learning was run multiple times in parallel, typically 100. Each of these runs produced their own series of POEMs from the initial steps and evolved in different ways. After POEMs are developed, POET can be used to predict the fitness of a peptide from its sequence. First POET finds the most accurate model at predicting the training data and then that model is used to predict the fitness of the peptide. Additionally before the fitness values of a peptide are calculated, POET applies a simple amino acid model of hydrophobicity51 to predict if the amino acids are soluble. The learning process of POET can be visualized in figure 4. Figure 4: Visual depiction of POET. Motifs, such as arginine-lysine-arginine (RKR) or lysine-serine (KS), are identified from the peptides in the training data provided to POET (a). Weights are assigned to these models to create POEMs (b). A population of POEMs undergo mutation and 18 Figure 4 (cont’d) recombination (c) to produce the next generation of POEMs (d), which themselves are used to generate later generations of models using further cycles of recombination and mutation. When using POET to produce a list of peptides to examine, first it generates a random list of peptides, predicts the fitness for each peptide on the list, sorts the list and returns the top results. The number of randomly generated peptides, length of the peptides generated, and the number of top peptides that are returned are variables that can be altered in POET’s config file, for ease of use in different applications. My experiments used 10,000, 12, and 10 respectively for these inputs. For more information on POET please consult an article published by my collaborator on it52. POET Design Cycle The cycle of using POET would begin with predictions of new peptides, which would then be synthesized commercially by Genscript (Piscataway, NJ). These peptides then had their contrast measured as described above. The new data from these peptides was then added to the overall training data. Then POET is started again from the beginning, to develop a new series of POEMs using this newer, larger, body of training data. This design cycle can be visualized in figure 5 below. 19 Figure 5: Design Cycle using POET. POET identifies motifs from a preexisting dataset (bottom). This leads to identifying proteins for testing (upper left). The proteins are tested, here using an MRI (upper right) and the data is added to POET to begin the cycle anew (bottom). 20 PROJECT 1: IMPROVING CEST PEPTIDE SENSITIVITY THROUGH DIRECTED EVOLUTION Introduction Prior work established that sPRM, an alternate histone produced in sperm cells to tightly bind DNA, was able to function as a good CEST contrast agent, since the many positively charged arginine residues used to bind positively charged DNA provided many highly accessible amide protons. Use of sPRM as a transgenic reporter gene, however, presents the long-term issue of immune system rejection common to using a non-human gene within a human body. In human sperm there is a homolog to sPRM, human protamine 1 (hPRM1)53. Experiments in testing the CEST contrast produced by hPRM1 demonstrated substantial contrast, but less than that of sPRM25,54 and its localization in sperm cells should prevent it from interfering with the background signal in humans. We hypothesized that if we were to apply optimization to hPRM1 we would be able to increase its contrast without significantly increasing its immunogenicity. Results Figure 6: Logo of Human Protamine Variants. The height of each letter demonstrates the relative occurrence of a given amino acid in the mutant protein using single letter amino acid codes. The wild type for hPRM1 is MARYRCCRSQSRSRYYRQRQSRRRRRRSCQTRRRAMRCCRPRY This figure was made using WebLogo 3.6.0. The experiment was repeated three times producing three plates and 65 colonies where the sequence was analyzed. Across all of the studied peptides sequences there were 23 different 21 mutants. Of these there were 13 different mutations to the overall protein sequence. A logo of the mutants generated can be seen in figure 6. Discussion Of the 13 mutations to the protein sequence, only 4 of them occurred to one of the sequences arginine residues, despite the arginine residues making up 20 of the 44 amino acids in the overall sequence. This can be understood as a result of arginine codon bias. Arginine has more codons than any other amino acid. Any change to the third nucleic acid results in a silent mutation, while an alteration to the second amino acid has two chances to result in a silent mutation and one chance to mutate into a stop codon. Conclusion Due to the slow progress of these directed evolution experiments, the project was ended before the contrast of the mutants could be evaluated or the immunogenicity of the mutants could be measured. Whether an improved version of hPRM1 could be produced via directed evolution remains unclear. Furthermore, whether any of hPRM1 mutants have immunogenicity that would prevent them from clinical application is unclear. Although this project did not produce the reporter gene that it set out to produce, it did generate insights into the development of CEST reporter genes using directed evolution and provide insights on what can be done to overcome these limitations. Mutagenesis seems to introduce changes very slowly into hPRM1. The density of arginine residues and the likelihood of an arginine codon mutating into another arginine codon present a high likelihood that any mutations to the DNA sequence would result in a silent mutation. Further the process of evaluating the contrast from these variants produces many challenges 22 such as purification, and standardization of concentration, which could obscure any effects of a change in sequence if the change is small, such as the mutation of only a single amino acid into a similar amino acid in a chain of 44. This would suggest that if further developments are to be made in the engineering of a CEST reporter gene a means of development they must be done using a method other than directed evolution. Optimally it would scan allow a greater variety of different protein biochemistries to be assayed. 23 PROJECT 2: IMPROVING CEST CONTRAST AT 3.6 PPM VIA MACHINE LEARNING Many portions of this section are adapted from a currently published paper1. Introduction The results of directed evolution experiments encountered a variety of problems. The mutation count was low, the silent mutation rate was high and normalizing for protein in each sample presented technical difficulties. This led us to pursue a different means of developing the CEST reporter genes using machine learning. This provided issues in that in prior literature there was only one paper which provided a database of peptides, but it only had 31 peptides each being 12 amino acids long. Current machine learning methods of protein engineering require far larger dataset, typically thousands of proteins, and rely on longer sequences, typically over 100 amino acids, to provide context. This led to the collaboration with the Banzhaf lab and along with that the development of POET. The first experiments using POET would focus on the development of a CEST producing peptide (CESTide) that produced contrast at 3.6 ppm, which is the amine resonance frequency. This was taken as a focus due to the relatively large amount of prior data available in literature and the lab’s prior focus on reporter genes at this frequency, such as LRP, while there aren’t published results that would allow a training library at other frequencies, such as 1.8 ppm or .8 ppm. Despite 1.8 and .8 ppm peaks being larger than the 3.6 ppm peaks in many cases, they are far closer to the water saturation frequency, which produces issues with contrast to noise ratio (CNR) especially on scanners of clinical strength. This would demonstrate POET’s ability to understand a relatively well understood question. 24 Generational Development The first generation resulted in only one soluble peptide which generated less contrast than K12. The issue of solubility led to integrating a solubility estimation into POET’s predictor (see methods). The Z-spectra and MTR values for generation 1 can be seen in figure 7. Figure 7: 3.6 ppm Generation 1. Shows the MTR (a) and Z-spectra (b) acquired from the only soluble peptide in the 1st generation of 3.6 ppm POET design. The results are from one experiment. The second generation saw an increase in contrast and universal solubility. The best peptide in the generation produced more contrast than K12. Further this peptide, NSSNHSNNMPCQ, had a neutral charge (pI 7.32), which was of interest since prior attempts to engineer a CEST reporter gene focused on incorporating as many positively charged residues as possible, leading them to have a strong positive charge55. The Z-spectra and MTR values for generation 2 can be seen in figure 8. 25 Figure 8: 3.6 ppm Generation 2. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 2nd generation of 3.6 ppm POET design. Data is taken from a single experiment. Error bars show standard deviation. Generation 3 was developed unusually, with half of the peptides produced using a model that was developed to predict generation 2, but was overlooked in that generation due to not being the most accurate at the time. The model was in the final population for predicting generation 2, but didn’t have the lowest error when determining which POEM would be used with the predictor. A post-hoc analysis determined that when incorporating the data gathered from the generation 2 peptides the model’s error was lower than a model trained from scratch with both generations of data. The other five were predicted from a model developed as normal. The peptides from generation 2 model produced significantly (p=0.01) more contrast than those with an entirely new model. Similar post-hoc analysis wasn’t done on later generations in an attempt to maintain a consistent experimental framework across generations. Later generations wouldn’t use this same split model approach to maintain a consistent experimental framework across generations. The Z-spectra and MTR values for generation 3 can be seen in 26 figure 9. Figure 9: 3.6 ppm Generation 3. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 3rd generation of 3.6 ppm POET design. Points are the average from two experiments for the first five peptides or five experiment for the last five peptides. Error bars show standard deviation. The 4th generation was the first generation to see a decrease in maximal contrast to 22.30, but the average of the generation did see increase to 14.49. The Z-spectra and MTR values for generation 4 can be seen in figure 10. Figure 10: 3.6 ppm Generation 4. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 4th generation of 3.6 ppm POET design. All values are from the average of two experiments. Error bars show standard deviation. 27 In the 5th generation, two new peptides were discovered that produced more contrast than any prior generation: WFGLQRHLKKKD (40.58) and LELKLGKRPMGW (43.64) and the average contrast increased to 23.14. The Z-spectra and MTR values for generation 5 can be seen in figure 11. Figure 11: 3.6 ppm Generation 5. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 5th generation of 3.6 ppm POET design. Points are taken from the average of four experiments. Error bars show standard deviation. During the production of the 6th generation of peptide optimization the peptides were 10 amino acids long instead of the 12 that is normal, due to an experimenter error in using POET’s predictor function. The maximum contrast produce was 31.40, which is less than the top CESTides of generation 5. The average contrast also decreased to 14.30. The Z-spectra and MTR 28 values for generation 6 can be seen in figure 12. Figure 12: 3.6 ppm Generation 6. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 6th generation of 3.6 ppm POET design. Points are taken from the average of five experiments for the first five peptides or four experiments for the last five peptides. Error bars show standard deviation. The 7th generation of peptide optimization saw the two highest contrast peptides generated by POET: LWSDIKMKLKKT (PTE of 49.4) and KMGKLIGIPVLK (PTE of 47.8). This generation also achieved an average PTE of 21.67, the second highest after generation 5. The Z-spectra and MTR values for generation 7 can be seen in figure 13. Figure 13: 3.6 ppm Generation 7. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 7th generation of 3.6 ppm POET design. Points are taken from the average of five 29 Figure 13 (cont’d) experiments. Error bars show standard deviation. The 8th generation of CESTides saw a decrease in both peak and average to 30.59 and 18.54 respectively. The Z-spectra and MTR values for generation 8 can be seen in figure 14. Figure 14: 3.6 ppm Generation 8. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 8th generation of 3.6 ppm POET design. Points are taken from the average of four experiments for the first five peptides or five experiments for the last five peptides. Error bars show standard deviation. In the 9th generation of CESTides the maximum contrast increased relative to the 8th generation to 30.64, but the contrast average contrast decreased to 12.25. The Z-spectra and MTR values for generation 9 can be seen in figure 15. Figure 15: 3.6 ppm Generation 9. Shows the MTR (a) and Z-spectra (b) acquired from the 30 Figure 15 (cont’d) peptides in the 9th generation of 3.6 ppm POET design. Points are taken from the average of five experiments. Error bars show standard deviation. In the 10th and final generation of 3.6 ppm peptide evolution there was an increase in the peak contrast and average contrast to generation 9, with values of 29.03 and 17.64 respectively. The 10th generation can be seen in figure 16 and the contrast generated in each generation can be seen in figure 17. Figure 16: 3.6 ppm Generation 10. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 10th generation of 3.6 ppm POET design. Points are taken from the average of five experiments for the first five peptides or three experiments for the last five peptides. Error bars show standard deviation. 31 Figure 17: Contrast by generation for 3.6 ppm peptides. Each point represents a single peptide from the generation. The red dashed line is the contrast generated by K12, used as a comparison to K12. Black lines represent the median of the generation. Contrast values are normalized to K12. Sensitivity Experiments To examine whether the increases in exchange rate and concentration were able to lead to higher sensitivity, I examined differing concentrations of K12 and KYTKTRKQSSKA (the highest contrast peptide from generation 3, chosen for its high exchange rate). T-test maps were generated from this experiment as described in methods. The comparison between the two samples can be seen in figure 18. KYTK produced more contrast and had more voxels determined to significant by a t-test at every concentration. 32 Figure 18: Sensitivity increase from exchange rate increase. Wells of KYTK and K12 graphed by concentration and contrast (a), compared by the results of a student’s t-test (b), and the MTR values for each of the wells (c). 33 Interesting Discoveries The CESTides found using POET demonstrated several unique qualities as a result of not being developed in a manner of iterative changes to a single base peptide. This is most clearly evident in the unusual combinations of amino acids seen within the peptides. Earlier attempts to create CEST reporter genes were highly homogenous, relying on large numbers of lysine and arginine residues. Although lysine is the most commonly used amino acid in the CESTides studied, no amino acid holds a majority of the residues like is seen in hPRM25 or LRP22.The variety of amino acids used by the CESTides generated with POET can be seen in figure 19. These trends also form trends in the learning throughout the generations of POET design and optimization as seen in figure 20. The most notable trend following lysine and arginine which form major parts of currently studied CEST reporter genes such as LRP and hPRM1. Initially lysine and arginine make up a small portion of the generations. Identifying the importance of these residues in generating contrast, POET increases the amount of lysine and arginine in its predictions up through generation five. The concentration decreases in generations six and seven, where the highest peak contrast is achieved. Generations eight and nine see the lysine and arginine content stabilize, before increasing in generation 10. This perhaps follows POET learning how to generate contrast via lysine and arginine content before reaching a fitness plateau and attempting to find additional means of increasing contrast, such as the surges in hydrophobic residues such as valine in generation nine and leucine in generation seven. 34 Figure 19: Diversity of Amino acids within CESTides generated with POET. Every peptide generated is shown in the order of its testing. 35 Figure 20: Generational Changes in Amino Acid Usage. Another area of interest is the way these changing chemistries have impacted chemical properties, most notably the protein’s charge. In prior literature peptides were engineered to generate more contrast by introducing additional positively charged residues, such as lysine55. The relationship between the charge of the CESTide and the contrast it generated can be clearly seen in figure 21. Although the highest contrast producing peptides are still positively charged, peptides generated by POET demonstrated higher contrast than any peptide in the training data with neutral and negatively charged peptides. 36 Figure 21: Relationship between charge and contrast. Black circles are the peptides from literature used as training data. Red squares are the peptides generated using POET. Analysis of POET Predictions We sought to examine the differences between the peptides generated by POET to determine if POET was converging toward a solution. This was calculated via the nearest neighbor distance from peptides in the same generation using Grantham distance, which takes into consideration differences between the size, charge, and hydrophobicity of different amino acids56. The basic assumption is that amino acids that are similar in chemical composition, polarity and molecular volume are more likely to change throughout evolution as they are less disruptive to protein function. To determine whether POET was learning and converging on a solution, we compared the Grantham distance between the peptides discovered with POET with peptides that were generated randomly. We first examined the intergenerational nearest neighbor distance by comparing finding the shortest Grantham distance within each peptide’s generation and all 37 prior generations. As the Grantham distance decreased with an increase in the number of generations, this implies that learning took place since it shows that the predictions of POET are more similar than would be generated by randomness and are decreasing in distance faster. Next, we examined the intragenerational nearest neighbor distance by comparing each peptide to all peptides in the same generation to determine the most similar peptide. We find that the distance stays lower than the random simulation, implying that there is a form of selection occurring since the distance is lower than that of random peptides. . As POET optimizes, it would be expected to see the diversity of new generations decrease as POET begins to single out the area of the global maximum in the fitness landscape. The distance in the experiments are however, not decreasing by generation which suggests that POET is not converging on a solutionn . The nearest neighbor analysis over the CESTides generated by POET can be seen in figure 22. 38 Figure 22: Learning with POET. The intergenerational (a) and intragenerational (b) nearest neighbor distance from all the CESTides produced by POET. The data from the actual POET data set is shown in blue circles, while the data from a random simulation is shown as red squares. Brackets show the 95% CI for the data. Conclusions POET demonstrated the ability to discover several new peptides with higher contrast than those in prior literature, with the peak seeing roughly a fourfold increase over K12. This peptide can be seen compared to K12 in figure 23. With work done through collaborators indicating that the increased contrast generated is a result of an increased exchange rate1. 39 Figure 23: Top 3.6 ppm peptide vs K12. The top peptide produced by the 3.6 ppm POET experiments is shown beside the K12 MTR. Data is the average of five experiments. This suggests that POET is a good tool for engineering CEST reporter genes and possibly peptides in general. Although not universal, the increased contrast appears to be generated by an increased rate of proton exchange. Most of the peptides that were discovered in the search have chemistries that mean that they would likely to never have been found by standard methods, such as directed evolution, within a feasible time scale. There is an unknown loss of improvement after the 7th generation, which could have many possible sources. One possibility is that of a fitness plateau as POET isn’t able to generate a better peptide with the same mechanism. 40 Another possibility is in a known bug of POET that wasn’t discovered until after the final experiment was run. The bug results in the decrease in overall learning done by POET by ending POET’s learning process prematurely. This problem scaled with the increase in time to evaluate a population because of the increasing volume of learning data. 41 PROJECT 3: IMPROVING CEST CONTRAST AT 5.0 PPM VIA MACHINE LEARNING Introduction Generating contrast at 3.6 ppm is very common for proteins and has been the approach taken by using hPRM1, LRP and scGFP, as well as prior experimentation using POET. This is because due to the ease of incorporating exchangeable amine groups into proteins using a lysine, histidine, or arginine residue. This ease however works in reverse and these same residues are used frequently in naturally occurring proteins, the sum of which make up a large portion of the contents of a cell. This results in a high background signal at 3.6 ppm (see figure 24). There are measures of CEST contrast at other frequencies than 3.6, notably 5.0 which is generated by thymidine and other biomolecules6. It is believed that it is related exchangeable protons bound to ring structures such as found in histidine or tryptophan, but no protein CEST agents have been developed that generate contrast at that frequency. If a protein CEST agent were developed that generated contrast at 5 ppm, it would be easier to distinguish from the background in a biological system. This would result in increased specificity and sensitivity. Since POET doesn’t understand the mechanism of contrast, but instead determines relationships between the amino acid sequence and the output, we sought to use POET to generate a new class of reporters at 5 ppm. 42 Figure 24: CEST Background. In black circles is the MTRasym of cell lysate from E. coli. In red squares is the best protein from generation 5 of the 3.6ppm POET experiments. Generational Development A full z spectrum running from -7 to 7 ppm was acquired for prior work in the generation of 3.6 ppm reporters (see above). The contrast values at 5 ppm were used as the initial training data (generation 0), containing all the data from generations 1-5 (all peptides with a gen less than 5 and a series value of 3.6 in appendix 1). For purposes of statistical analysis, the training data was used as an understanding of the contrast produced by a random soluble peptide. The overall contrast generated by the peptides in each generation can be seen in figure 25. 43 Figure 25: 5ppm contrast in each generation. Each peptide tested from the POET development of a 5ppm contrast agent is included. MTR values are normalized to the 3.6 peak of K12. Generation 0 represents the learning data. The first generation saw significant (p= 1.2*10-15) improvement over the training data, with the best peptide having a PTE of 10.1 and the average being 6.9. None of the peptides show clear signs of generating contrast from an exchangeable proton at 5 ppm and instead most of the signal at 5 ppm seems to be part of a gradual decrease of the 3.6 ppm peak. The MTRasym and Z- spectra for the 1st generation can be seen in figure 26. 44 Figure 26: 5 ppm Generation 1. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 1st generation of 5 ppm POET design. Points are the average from three experiments for the first six peptides or two experiment for the last 4 peptides. Error bars show standard deviation. The second generation had difficulties with solubility with five of the peptides for that generation being insoluble and going unmeasured. The generation has a lower average than generation 1 as well, with an average of 5.6 among the soluble peptides, but the peptides performed significantly better than the training data (p=0.005). The MTRasym and Z-spectra for the 2nd generation can be seen in figure 27. Figure 27: 5 ppm Generation 2. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 2nd generation of 5 ppm POET design. Points are the average from three experiments. 45 Figure 27 (cont’d) Error bars show standard deviation. In the third generation like the second generation saw a decrease in both maximum (10.77) and average contrast (5.32). The results were still significantly (p=0.008) better than the training data. The MTRasym and Z-spectra for the 3rd generation can be seen in figure 28. Figure 28: 5 ppm Generation 3. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 3rd generation of 5 ppm POET design. Points are the average from four experiments for the first five peptides or five experiments for the last five. Error bars show standard deviation. In the fourth generation the maximum contrast decreased to a PTE of 7.1 and the average decreased to 4.38. The results were still significantly (p=0.008) better than the training data. The MTRasym and Z-spectra for the 3rd generation can be seen in figure 29. 46 Figure 29: 5 ppm Generation 4. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 4th generation of 5 ppm POET design. Points are the average from five experiments for the first five peptides or three for the last five. Error bars show standard deviation. In the fifth generation maximum and average contrast increased to 9.73 which is higher than the fourth generation but remained lower than the first generation. The MTRasym and Z-spectra for the 5th generation can be seen in figure 30. Figure 30: 5 ppm Generation 5. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 5th generation of 5 ppm POET design. Points are the average from five experiments. Error bars show standard deviation. The sixth generation saw the peak of the 5 ppm experiments. Its top CESTide, KVNFNKAVSNLK, produced the highest 5 ppm PTE of 14.23 and it appears to form a distinct peak, despite an 47 apparent contribution from the amide exchangeable at 3.6 ppm. The average contrast for this generation was also the highest at a PTE of 7.01. The MTRasym and Z-spectra for the 6th generation can be seen in figure 31. Figure 31: 5 ppm Generation 6. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 6th generation of 5 ppm POET design. Points are the average from five experiments. Error bars show standard deviation. In the seventh generation the maximum contrast decreased to 8.01 and the average contrast decreased to 3.82. The MTRasym and Z-spectra for the 7th generation can be seen in figure 32. Figure 32: 5 ppm Generation 7. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 7th generation of 5 ppm POET design. Points are the average from five experiments. Error bars show standard deviation. 48 The eighth generation continued the pattern of decrease with a maximum contrast value of 7.26 and an average contrast value of 3.14. This generation wasn’t significantly different than the initial training data (p=0.062). The MTRasym and Z-spectra for the 8th generation can be seen in figure 33. Figure 33: 5 ppm Generation 8. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 8th generation of 5 ppm POET design. Points are the average from five experiments for the first five peptides or four for the last five. Error bars show standard deviation. In the ninth and final generation performed the worst out of all the generations, performing significantly (p=0.017) worse than the training data. The generation’s maximum is the lowest generational maximum at 3.67 and the average was also the lowest with a mean contrast of 2.81. The MTRasym and Z-spectra for the 9th generation can be seen in figure 34. 49 Figure 34: 5 ppm Generation 9. Shows the MTR (a) and Z-spectra (b) acquired from the peptides in the 9th generation of 5 ppm POET design. Points are the average from four experiments for the first five peptides or three for the last five. Error bars show standard deviation. Conclusions Using POET, I was able to develop a single peptide (KVNFNKAVSNLK) that produces a greater amount of contrast at 5 ppm than K12 produces at 3.6 ppm which was over three times the highest amount of 5 ppm contrast in any of the data used to train the initial POEMs for the 5 ppm experiments. This peptide should prove a useful tool as a basis for reporter genes and could become the baseline for later optimization work via other methods such as directed evolution. The contrast produced by this top peptide compared to the contrast of K12 can be seen in figure 35. 50 Figure 35: Top 5ppm peptide vs K12. The top peptide produced by the 5ppm POET experiments is shown beside the K12 MTR. Data is the average of five experiments. POET was unable to continue developing better 5 ppm peptides after the 6th generation, and later generations saw increasingly low contrast produced by the peptides being developed, instead of the maximum contrast increasing as the amount of training data increased. The reason for this decrease is unclear, but many factors are possible contributors. One possible cause is the initial dataset was composed of peptides designed to generate contrast at 3.6 ppm. In such a dataset the majority of 5 ppm contrast is a result of decay from the peak at 3.6 ppm instead of contrast by a proton exchanging at 5 ppm. This bias might have caused POET’s learning to focus on increasing 5 ppm contrast in the relatively inefficient 51 manner of increasing the 3.6 peak. This could be adjusted instead of the PTE at 5 ppm as the fitness value instead calculate the fitness as in equation 3 below. Eq. 3 𝐹𝑖𝑡𝑛𝑒𝑠𝑠 = 𝑃𝑇𝐸5 𝑝𝑝𝑚 ⁄ 𝑃𝑇𝐸3.6 𝑝𝑝𝑚 The fitness being calculated in this way would maintain a selective pressure to produce peptide that generate high contrast at 5 ppm, but make peptides that have a high contribution to their 5 ppm contrast from the 3.6 peak have low fitness, and thus be less likely to be suggested. Another problem was one in the design of POET itself discovered after the experiments were performed, where a bug in POET caused the learning process to end prematurely as the length of learning time increased from additional training data. This bug has been addressed in the current version of POET, but wasn’t addressed when this research took place. 52 Functionality of POET CONCLUSION Using POET I have been able to engineer CESTides that produce nearly four times as much contrast as what was priorly used as the state of the art. I was additionally able to develop a 5ppm reporter that produces contrast at 5ppm at a similar scale to what was produced by PLL at 3.6 ppm. This suggests that POET is a useful tool for protein engineers and synthetic biologists, for the development of peptides that can’t be screened in a quick or efficient way. The dramatic differences between the peptides discovered and peptides that were used as a starting point would imply that directed evolution would not have developed them within an efficient amount of time. As the dataset of POET grows its ability to learn seems to decrease after crossing a certain point which I have not had the ability to experimentally determine, but colleagues focusing on improving POET did find several improvements and fix several bugs that allows POET to develop POEMs with lower error faster than they did before. Whether this weakness in learning persists into the current version of POET is a point for further research. Impact POET has shown to be an effective means of designing new reporter genes for MRI. This demonstration and the early results from the development of uPar peptides suggest that POET will be useful for researchers seeking to optimize peptides by their function in ways that explore a wide area of the search space. The 3.6 series peptides show to be unique in chemistry and produce contrast that exceeds that of prior peptides. If developed into fully functional reporter genes, these peptides show 53 likelihood of becoming the backbone of later efforts for generating MRI reporter genes using CEST. Future Research Over the course of me writing this dissertation, POET has continued to develop with the objective of making it more efficient, and able to solve different kinds of problems. There is substantial room for improvement still as POET’s ability to learn grows faster and more complete. Since POET is seeking a simple correlation between peptide function and sequence, it is currently being used to optimize proteins used in guiding extracellular vesicles. These peptides would be used to help the vesicles deliver drugs to the area they are most needed. The peptides themselves are not at a finished point. To make the jump from protein contrast agent to reporter gene, the peptides need to be assembled into complete proteins and expressed in cells. Work on this has already begun in the Gilad Lab57. Several of the proteins discovered by using POET demonstrate properties that are highly unexpected, especially the reporters that have neutral or negative charge. Closer examination to these peptides could yield discoveries in protein biochemistry on what factors influence the exchange rate of amine protons. The peptides developed with this project are not yet full reporter genes. To reach the point of being full reporter genes the peptides would need to be combined into a longer protein, which would allow more efficient expression in vivo. Further to ensure that there are no immunogenicity problems with the fully developed reporter gene, it should go through testing with software to predict immunogenicity, and also undergo testing in humanized small animals 54 prior to it being useful in clinical work. 55 1 2 3 4 5 6 7 8 9 REFERENCES Bricco, A. R. et al. Protein Optimization Evolving Tool (POET) based on genetic programming. bioRxiv, 2022.2003. 2005.483103 (2022). Smith-Bindman, R. et al. Trends in Use of Medical Imaging in US Health Care Systems and in Ontario, Canada, 2000-2016. Jama 322, 843-856, doi:10.1001/jama.2019.11456 (2019). Plewes, D. B. & Kucharczyk, W. Physics of MRI: a primer. Journal of magnetic resonance imaging 35, 1038-1054 (2012). Brateman, L. Chemical shift imaging: a review. American Journal of Roentgenology 146, 971-980 (1986). Meyerhoff, D. J., Rooney, W. D., Tokumitsu, T. & Weiner, M. W. Evidence of multiple ethanol pools in the brain: an in vivo proton magnetization transfer study. Alcoholism: Clinical and Experimental Research 20, 1283-1288 (1996). Ward, K., Aletras, A. & Balaban, R. S. A new class of contrast agents for MRI based on proton chemical exchange dependent saturation transfer (CEST). Journal of magnetic resonance 143, 79-87 (2000). Van Zijl, P. C. & Yadav, N. N. Chemical exchange saturation transfer (CEST): what is in a name and what isn't? Magnetic resonance in medicine 65, 927-948 (2011). McMahon, M. T. et al. New “multicolor” polypeptide diamagnetic chemical exchange saturation transfer (DIACEST) contrast agents for MRI. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 60, 803-812 (2008). Nasrallah, F. A., Pagès, G., Kuchel, P. W., Golay, X. & Chuang, K.-H. Imaging brain deoxyglucose uptake and metabolism by glucoCEST MRI. Journal of Cerebral Blood Flow & Metabolism 33, 1270-1278 (2013). 10 Van Zijl, P. C., Jones, C. K., Ren, J., Malloy, C. R. & Sherry, A. D. MRI detection of glycogen in vivo by using chemical exchange saturation transfer imaging (glycoCEST). Proceedings of the National Academy of Sciences 104, 4359-4364 (2007). 11 Woessner, D. E., Zhang, S., Merritt, M. E. & Sherry, A. D. Numerical solution of the Bloch equations provides insights into the optimum design of PARACEST agents for MRI. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 53, 790-799 (2005). 12 Zhang, S., Merritt, M., Woessner, D. E., Lenkinski, R. E. & Sherry, A. D. PARACEST agents: modulating MRI contrast via water proton exchange. Accounts of chemical research 36, 56 783-790 (2003). 13 14 15 16 Casadaban, M. J., Chou, J. & Cohen, S. N. In vitro gene fusions that join an enzymatically active beta-galactosidase segment to amino-terminal fragments of exogenous proteins: Escherichia coli plasmid vectors for the detection and cloning of translational initiation signals. Journal of bacteriology 143, 971-980 (1980). Chalfie, M., Tu, Y., Euskirchen, G., Ward, W. W. & Prasher, D. C. Green fluorescent protein as a marker for gene expression. Science 263, 802-805 (1994). Brasier, A., Tate, J. & Habener, J. Optimized use of the firefly luciferase assay as a reporter gene in mammalian cell lines. Biotechniques 7, 1116-1122 (1989). Louie, A. Y. et al. In vivo visualization of gene expression using magnetic resonance imaging. Nature biotechnology 18, 321-325 (2000). 17 Weissleder, R. et al. In vivo magnetic resonance imaging of transgene expression. Nature medicine 6, 351-354 (2000). 18 19 20 21 22 23 24 25 Cohen, B. et al. MRI detection of transcriptional regulation of gene expression in transgenic mice. Nature medicine 13, 498-503 (2007). Genove, G., DeMarco, U., Xu, H., Goins, W. F. & Ahrens, E. T. A new transgene reporter for in vivo magnetic resonance imaging. Nature medicine 11, 450-454 (2005). Kodibagkar, V. D., Yu, J., Liu, L., Hetherington, H. P. & Mason, R. P. Imaging β- galactosidase activity using 19F chemical shift imaging of LacZ gene-reporter molecule 2- fluoro-4-nitrophenol-β-D-galactopyranoside. Magnetic resonance imaging 24, 959-962 (2006). Gilad, A. A., Bar‐Shir, A., Bricco, A. R., Mohanta, Z. & McMahon, M. T. Protein and peptide engineering for chemical exchange saturation transfer imaging in the age of synthetic biology. NMR in Biomedicine, e4712 (2022). Gilad, A. A. et al. Artificial reporter gene providing MRI contrast based on proton exchange. Nature biotechnology 25, 217-219 (2007). Perlman, O. et al. Redesigned reporter gene for improved proton exchange-based molecular MRI contrast. Scientific reports 10, 20664 (2020). Goffeney, N., Bulte, J. W., Duyn, J., Bryant, L. H. & Van Zijl, P. C. Sensitive NMR detection of cationic-polymer-based gene delivery systems using saturation transfer via proton exchange. Journal of the American Chemical Society 123, 8628-8629 (2001). Bar-Shir, A. et al. Human protamine-1 as an MRI reporter gene based on chemical exchange. ACS chemical biology 9, 134-138 (2014). 57 26 27 28 29 30 31 32 33 34 35 36 37 38 Bar-Shir, A. et al. Transforming thymidine into a magnetic resonance imaging probe for monitoring gene expression. Journal of the American Chemical Society 135, 1617-1624 (2013). Meier, S. et al. Non-invasive detection of adeno-associated viral gene transfer using a genetically encoded CEST-MRI reporter gene in the murine heart. Scientific reports 8, 4638 (2018). Farrar, C. T. et al. Establishing the lysine-rich protein CEST reporter gene as a CEST MR imaging detector for oncolytic virotherapy. Radiology 275, 746-754 (2015). Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology 10, 866-876 (2009). Goldsmith, M. & Tawfik, D. S. Enzyme engineering: reaching the maximal catalytic efficiency peak. Current opinion in structural biology 47, 140-150 (2017). Meyer, J. R. et al. Repeatability and contingency in the evolution of a key innovation in phage lambda. Science 335, 428-432 (2012). Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. in Methods in enzymology Vol. 383 66-93 (Elsevier, 2004). Loshbaugh, A. L. & Kortemme, T. Comparison of Rosetta flexible‐backbone computational protein design methods on binding interactions. Proteins: Structure, Function, and Bioinformatics 88, 206-226 (2020). Tallorin, L. et al. Discovering de novo peptide substrates for enzymes using machine learning. Nature Communications 9, 5253, doi:10.1038/s41467-018-07717-6 (2018). Xu, Y. et al. Deep Dive into Machine Learning Models for Protein Engineering. Journal of Chemical Information and Modeling 60, 2773-2790, doi:10.1021/acs.jcim.0c00073 (2020). David, A., Islam, S., Tankhilevich, E. & Sternberg, M. J. The AlphaFold database of protein structures: a biologist’s guide. Journal of molecular biology 434, 167336 (2022). Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nature Methods 16, 1315-1322, doi:10.1038/s41592-019-0598-1 (2019). Dong, T. N., Brogden, G., Gerold, G. & Khosla, M. A multitask transfer learning framework for the prediction of virus-human protein–protein interactions. BMC Bioinformatics 22, 572, doi:10.1186/s12859-021-04484-y (2021). 39 Koza, J. R. Genetic programming as a means for programming computers by natural 58 selection. Statistics and computing 4, 87-112 (1994). 40 41 Banzhaf, W., Nordin, P., Keller, R. E. & Francone, F. D. Genetic programming: an introduction: on the automatic evolution of computer programs and its applications. (Morgan Kaufmann Publishers Inc., 1998). Koza, J. R. & Andre, D. in Evolutionary Computation: Theory and Applications 171-197 (World Scientific, 1999). 42 Widera, P., Garibaldi, J. M. & Krasnogor, N. GP challenge: evolving energy function for protein structure prediction. Genetic Programming and Evolvable Machines 11, 61-88 (2010). 43 44 45 46 47 48 49 50 Vyas, R. et al. Application of genetic programming (GP) formalism for building disease predictive models from protein-protein interactions (PPI) data. IEEE/ACM Transactions on Computational Biology and Bioinformatics 15, 27-37 (2016). Lehman, J. et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. arXiv preprint arXiv:1803.03453 (2018). Koza, J. R. Human-competitive results produced by genetic programming. Genetic programming and evolvable machines 11, 251-284 (2010). Mironidou-Tzouveleki, M. & Dokos, C. Nature's drug store attacks again: the cone snail's pain relief toxin. Aristotle University Medical Journal 34, 23-30 (2007). Vendrely, C. & Scheibel, T. Biotechnological production of spider‐silk proteins enables new applications. Macromolecular bioscience 7, 401-409 (2007). Kim, M., Gillen, J., Landman, B. A., Zhou, J. & Van Zijl, P. C. Water saturation shift referencing (WASSR) for chemical exchange saturation transfer (CEST) experiments. Magnetic Resonance in Medicine: An Official Journal of the International Society for Magnetic Resonance in Medicine 61, 1441-1450 (2009). Gilad, A. A. et al. Functional and molecular mapping of uncoupling between vascular permeability and loss of vascular maturation in ovarian carcinoma xenografts: the role of stroma cells in tumor angiogenesis. International journal of cancer 117, 202-211 (2005). Gilead, A., Meir, G. & Neeman, M. The role of angiogenesis, vascular maturation, regression and stroma infiltration in dormancy and growth of implanted MLS ovarian carcinoma spheroids. International journal of cancer 108, 524-531 (2004). 51 Wimley, W. C. & White, S. H. Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nature Structural Biology 3, 842-848, 59 doi:10.1038/nsb1096-842 (1996). 52 53 54 55 56 57 Miralavy, I., Bricco, A., Gilad, A. & Banzhaf, W. Using Genetic Programming to Predict and Optimize Protein Function. arXiv preprint arXiv:2202.04039 (2022). McKAY, D. J., RENAUX, B. S. & DIXON, G. H. Human sperm protamines: Amino‐acid sequences of two forms of protamine P2. European Journal of Biochemistry 156, 5-8 (1986). Oskolkov, N. et al. Biophysical characterization of human protamine-1 as a responsive CEST MR contrast agent. ACS Macro Letters 4, 34-38 (2015). Bar-Shir, A., Liang, Y., Chan, K. W., Gilad, A. A. & Bulte, J. W. Supercharged green fluorescent proteins as bimodal reporter genes for CEST MRI and optical imaging. Chemical Communications 51, 4869-4871 (2015). Grantham, R. Amino Acid Difference Formula to Help Explain Protein Evolution. Science 185, 862-864 (1974). Fillion, A. J. et al. Development of a Synthetic Biosensor for Chemical Exchange MRI Utilizing In Silico Optimized Peptides. bioRxiv, 2023.2003. 2008.531737 (2023). 60 Table 4: CESTide Properties Sequence Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE APPENDIX A: CESTIDE PROPERTIES NQYSNWNKNYK NSSNHSNNMPCQ IRTYLRKRNSTQ GIFKTTKCKHNS SNHKMSECRGLR FNSNKITPTSNM VNSDPSNGQMRD LSNRRGREQYAG QTATENSQMNSG QTEHYENSARNS KDRTSKPKRPWC GRKRGAIWKDTK CCWHNPKWRRTR KYTKTRKQSSKA RGKMPLRWMTRK GNCPMKVCSPMG VNLPMVMPNLRM GPMPMNAKMKLC KVIRYVVAPMKL IKGMNIKMPTDQ MWQMKWTRKTRE HGRKWKRTKFDD DKRKIKQKMWWG RRMVNRTITRMW RKHHGWRWEQWK HWSTCTRTRTLS WWWKPKREDFMK 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 1 1458.54 2 1332.39 2 1535.76 2 1363.59 2 1417.62 2 1353.51 2 1319.36 2 1406.51 2 1267.29 2 1435.42 3 1501.76 3 1415.65 3 1642.92 3 1425.64 3 1559.96 3 1223.52 3 1414.81 3 1320.72 3 1416.82 3 1375.66 4 1681.00 4 1573.76 4 1603.94 4 1619.97 4 1733.94 4 1448.61 4 1737.04 2 0.2 9.93 7.36 4 12.23 3.2 10.51 9.82 2.2 9.69 1 -1 4.11 2 11.05 3.85 -1 -0.8 5.48 3.9 11.06 4 11.75 4.2 11.38 5 11.28 5 12.81 8.23 1 10.55 9.67 1.9 10.9 3 1 9.53 3 11.65 3.2 11.05 4 11.25 4 12.98 3.5 11.65 2.2 10.53 10.5 0.9 2 0.68% 4.69% 1.99% 2.35% 1.47% 1.85% 2.09% 2.34% 1.43% 2.18% 3.47% 3.99% 2.31% 3.72% 3.06% 3.63% 4.63% 4.89% 4.54% 4.64% 6.22% 5.64% 4.33% 6.17% 4.67% 0.00% 3.85% 0.3997 1.3671 0.7912 0.7630 0.5670 0.4688 0.4054 0.6653 0.3545 0.1127 1.6561 2.4314 1.1928 1.5194 2.2182 1.8346 1.0656 0.8821 1.4825 1.9309 1.7114 1.8044 0.9433 1.6217 1.2924 0.0000 1.1410 0.1036 0.2290 0.0973 0.1144 0.0718 0.0903 0.1021 0.1140 0.0697 0.1062 0.8601 0.9899 0.5717 0.6450 0.7589 0.9003 0.8024 0.8470 0.7865 0.8049 0.9701 0.8802 0.6756 0.9625 0.7282 0.0000 0.6001 5.3300 19.9573 10.0214 10.8835 7.7795 6.7376 5.9771 9.2007 5.4404 1.5273 21.4496 33.4077 14.1218 20.7299 27.6587 29.1656 14.6495 12.9911 20.3534 27.3020 19.8029 22.3019 11.4397 19.4717 14.4985 0.0000 15.3203 1.3821 3.3431 1.2319 1.6319 0.9858 1.2979 1.5051 1.5767 1.0699 1.4396 11.1401 13.6019 6.7689 8.7997 9.4622 14.3128 11.0312 12.4750 10.7975 11.3816 11.2255 10.8788 8.1935 11.5567 8.1692 0.0000 8.0583 2.64% 28.03% 16.22% 15.64% 11.62% 9.61% 8.31% 13.64% 7.27% 2.31% 6.68% 9.80% 4.81% 8.76% 8.94% 7.40% 6.15% 5.09% 8.55% 11.14% 10.97% 11.56% 6.05% 10.39% 8.28% 0.00% 7.31% 61 Table 4 (cont’d) Sequence HIKWRLTKGTRT WDRTSTRPSSVL KPWHGCASRTKR DKRKIKQKMWWG KKRLHWIRWHCG RKHHGWRWEQWK WFGLQRHLKKKD CHLKDLRKMGLR QRHDSHRHGLWL LELKLGKRPMGW GQRWLYKMKDSM LDHTWGKWGHQS DKVCKIQKRKWH WDWEQKKKWI ERQEEKIKKW SDGSKIKDRD SSDQDRDKWL LLRLLGLVER KEEVWLKWLI KGKLDKDRNL HDDKNKESDD QERRDDILWD KRIIEDDQLE VCNRIEPLKPIL LHSSQWLKVDHLL VINKVISNPCVN GNKKNWRWYKNR ICLKSQPICGID LWSDIKMKLKKT Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 4 1496.77 4 1404.53 4 1426.66 5 1603.94 5 1619.95 5 1733.94 5 1555.83 5 1469.83 5 1541.68 5 1427.76 5 1542.83 5 1451.55 5 1568.90 6 1446.66 6 1373.56 6 1120.18 6 1249.29 6 1181.48 6 1343.62 6 1186.37 6 1202.15 6 1345.42 6 1258.38 7 1394.73 7 1575.82 7 1299.55 7 1649.86 7 1289.57 7 1490.86 10.78 4.2 12.53 1 10.45 4.2 11.65 4 11.35 4.4 11.65 3.5 11.65 3.2 11.08 3.2 1.7 10.45 2 10.79 2 10.75 7.72 0.5 10.8 4.2 9.45 1 9.45 1 6.51 0 4.31 -1 1 10.45 6.51 0 10.5 2 4.19 -2.8 4.06 -2 4.11 -2 8.55 0.9 7.72 0.5 8.55 0.9 5 11.76 6.07 10.8 -0.1 3 5.79% 3.70% 4.53% 2.13% 0.60% 1.47% 2.78% 3.04% 1.06% 1.60% 1.66% 1.66% 1.28% 2.28% 2.12% 0.85% 1.00% 0.40% 0.97% 1.99% 0.90% 0.69% 0.61% 1.23% 1.85% 0.72% 1.64% 2.30% 3.13% 1.4892 0.6208 1.0365 1.5834 0.6114 2.1189 3.2462 2.7647 1.0017 3.2035 1.3625 1.4077 0.9342 2.3350 1.4733 0.4987 1.0840 0.1851 0.8278 1.5579 0.4578 0.1869 0.9340 1.5409 1.4376 0.5440 1.2538 1.9645 3.7870 9.54% 3.98% 6.64% 5.27% 2.03% 7.05% 10.80% 9.20% 3.33% 10.66% 4.53% 4.68% 3.11% 7.77% 4.90% 1.66% 3.61% 0.62% 2.62% 4.93% 1.45% 0.59% 2.95% 5.36% 5.00% 1.89% 4.36% 6.83% 11.90% 62 0.9039 0.5769 0.7064 11.7464 19.3531 7.9896 8.5967 9.6317 14.1323 0.640964 19.20307 7.773029 0.180985 7.341661 2.173127 0.441435 23.76972 4.951965 0.834161 40.58454 10.42875 0.912535 36.58821 12.07611 0.317516 12.63926 4.006044 0.481419 43.64305 6.558619 0.499469 17.17881 6.297021 0.49875 18.86464 6.683378 0.384813 11.58307 4.770885 9.1958 31.3955 9.0175 20.8631 4.4611 8.6591 4.6992 16.8780 1.9900 3.0482 4.4267 11.9835 10.3234 25.5433 4.6245 7.4068 3.1713 2.7023 2.9897 14.4364 4.9386 21.4892 6.5493 17.7453 3.0914 8.1418 5.5431 14.7815 9.9543 29.6311 13.0076 49.4092 0.6839 0.6368 0.2569 0.3018 0.1209 0.3058 0.6296 0.2858 0.2194 0.1934 0.3541 0.5306 0.2065 0.4702 0.6600 0.9970 Table 4 (cont’d) Sequence LWSDIKMKLKKT NWRDCLSLIVPN KMGKLIGIPVLK NDISMCNKNNNW VSLQCWELGPNK TVSEPVMMVSVS PRSWEVKEKETM PGGVRSNDLLEV PVNRLGKMSKNR VGSVKSGNLRMR TSKSKKRMTAKK ETNVRVKVVSES EPSNLPKGMNEK RLWNSGEGRGEN RPPMLNVVRVVG KWVVRPRIRRLL IGVLRSVKQTVR VINKVISNPCVN ETNVRVKVVSES RLPKRVQGNVEK GLGNQHVVVLGV KVRCLVEARPSW HLVVSPRVSWGC IIRSPICCVSRV DKRKIKQKMWWG RKHHGWRWEQWK EMRQWKWMWENA PIKQIAWPIIEH KMWDWEQKKKWI Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 3.6 7 1490.86 7 1429.65 7 1296.71 7 1452.58 7 1373.58 8 1265.50 8 1519.73 8 1255.38 8 1399.67 8 1303.54 8 1393.71 8 1346.49 8 1343.51 8 1374.42 9 1336.66 9 1592.00 9 1355.64 9 1299.55 9 1346.49 9 1423.67 9 1191.39 9 1443.72 9 1339.57 9 1345.68 10 1603.94 10 1733.94 10 1694.94 10 1444.73 10 1706.03 10.8 3 6.18 -0.1 11.1 3 6.18 -0.1 6.18 -0.1 3.85 -1 6.62 0 -1 4.19 4 12.53 3 12.51 6 12.06 6.51 0 6.51 0 6.51 0 2 12.5 5 12.98 3 12.51 8.55 0.9 0 6.51 3 11.65 7.55 9.82 8.55 8.83 4 11.25 3.5 11.65 6.51 7.55 2 10.33 0.2 1.9 1.2 1.9 0 0.2 3.13% 0.46% 3.18% 0.69% 0.86% 1.55% 2.42% 2.26% 2.68% 1.90% 2.94% 0.92% 1.10% 1.25% 0.64% 1.07% 1.68% 0.64% 0.63% 1.32% 0.50% 0.78% 0.76% 0.72% 2.02% 1.37% 0.94% 1.20% 2.35% 3.7870 0.2337 3.1918 0.6580 1.1072 0.4845 1.4242 0.7658 2.2014 1.7579 2.1357 0.3946 1.7061 0.8660 0.4524 1.2228 2.0159 0.5951 0.1792 2.2428 0.2155 0.6087 0.3656 0.8920 1.5780 1.8767 0.4802 1.0018 2.1270 0.9970 0.1456 1.0104 0.2180 0.2751 0.3811 0.5959 0.5569 0.6583 0.5361 0.8311 0.2602 0.3107 0.3547 0.1865 0.3120 0.4893 0.1873 0.1841 0.4009 0.1521 0.2373 0.2306 0.2199 0.6349 0.4311 0.2945 0.3757 0.7384 49.4092 3.1803 47.8780 8.8105 15.6792 7.4473 18.2290 11.8653 30.5927 26.2309 29.8071 5.7005 24.7003 12.2564 6.5834 14.9403 28.9254 8.9079 2.5889 30.6422 3.5190 8.2005 5.3092 12.8927 19.1361 21.0528 5.5107 13.4878 24.2503 13.0076 1.9811 15.1564 2.9197 3.8963 5.8584 7.6270 8.6293 9.1487 7.9991 11.5992 3.7595 4.4982 5.0203 2.7145 3.8123 7.0210 2.8029 2.6591 5.4776 2.4829 3.1976 3.3481 3.1790 7.6999 4.8359 3.3791 5.0584 8.4193 11.90% 0.73% 10.03% 2.07% 3.48% 1.97% 5.79% 3.11% 8.95% 6.22% 7.55% 1.40% 6.03% 3.06% 1.55% 4.19% 6.91% 2.04% 0.61% 7.37% 0.71% 2.00% 1.20% 2.93% 5.03% 5.98% 1.53% 3.19% 6.78% 63 Table 4 (cont’d) Sequence ARNRKKIMMRWI NAPWKHWRIINE NKQRRMLSRERS LSQQPRKRATWR IRRWNDRIRITS RRCQAQEFWLGA GLIEARAMQQCC QCRAGAMPAMYV NFLRAQRQCQKQ MAMADAAAPMNA AQCCQHRKGYMN MAALLYQHRLARR KPCKWAGRACAK CQLAWRPCAKAS QCSGWVQKRQIQ NRVTESVRNVKM NVVVQRRNHHTS VINKVISCPCVN GGRVWEWNVAA NNKCQVVAAFVM VPNIQVKGSK PVARKVVQICHP VTRMTIQVKGSK MAMADAAAPMNA MKVAAAMAPKQV PVVYKTVIQCCD KVLWRMPAQIIQ VSVVATGCVWET AKCKVQSANVCK Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 3.6 3.6 3.6 3.6 3.6 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 10 1603.02 10 1563.77 10 1560.79 10 1526.75 10 1585.82 1 1464.66 1 1322.58 1 1297.58 1 1519.74 1 1164.38 1 1164.38 1 1598.92 1 1318.62 1 1333.59 1 1460.67 2 1432.66 2 1446.58 2 1288.59 2 1244.36 2 1323.59 3 1069.26 3 1346.65 3 1347.63 3 1164.38 3 1244.58 3 1367.64 3 1482.84 3 1250.43 3 1278.55 1.2 5 12.81 9.69 4 12.51 4 12.81 12.5 3 8.55 0.9 6.23 -0.1 0.9 8.53 2.9 11.48 3.75 -1 2.2 3.75 3.2 12.21 3.9 10.51 8.83 1.9 9.84 1.9 2 11.48 12.5 8.23 6.34 8.55 10.8 9.48 3 11.82 3.75 -1 10.8 2 6.07 -0.1 2 11.66 3.85 9.67 2.5 0.9 0 0.9 2 2.2 -1.1 2.9 2.3923 2.83% 0.7157 0.78% 2.2884 2.15% 0.9899 0.89% 1.1343 1.17% 0.7700 1.40% 0.3218 1.68% 1.1436 2.63% 1.4000 1.59% 0.3734 1.24% 1.2996 2.28% 0.2748 1.20% 1.5767 1.78% 1.8239 2.06% 2.88% 1.9930 0.78% 0.316758 1.44% 2.386539 1.56% 0.782529 1.03% 0.613375 2.1930 1.03% 0.2746 0.87% 1.2888 1.20% 2.0923 2.12% 0.4173 0.57% 2.4757 2.13% 0.2867 0.54% 1.0557 0.74% 0.9526 0.57% 2.5923 2.06% 7.75% 2.32% 7.41% 3.21% 3.67% 3.05% 1.27% 4.53% 5.54% 1.48% 5.15% 1.04% 5.98% 6.92% 7.56% 0.93% 6.99% 2.29% 1.80% 1.80% 0.95% 4.44% 7.20% 1.44% 8.52% 0.83% 3.07% 2.77% 7.54% 64 0.8744 0.2423 0.6651 0.2755 0.3612 0.3542 0.4237 0.6639 0.4017 0.3140 0.5770 0.3165 0.4698 0.5433 0.7590 10.6099 29.0288 3.0134 8.9019 8.2886 28.5194 3.5103 12.6112 4.4307 13.9124 4.7036 10.2256 6.2316 4.7330 9.9518 17.1431 5.1412 17.9181 5.2457 6.2378 9.6385 21.7106 3.8498 3.3429 6.9305 23.2581 7.9241 26.6024 10.1078 26.5398 0.265939 4.300599 3.61064 0.492561 32.09008 6.623116 0.532269 11.81219 8.034538 5.49304 0.35141 9.587929 18.4637 32.2278 4.5740 4.9961 5.0211 18.6148 8.9059 30.1993 2.7492 6.9711 9.6601 38.6914 2.6585 4.0781 3.3434 13.8476 3.0629 14.8188 10.7740 39.4381 1.2564 0.2514 0.3476 0.6170 0.1646 0.6181 0.1869 0.2549 0.1969 0.7082 Table 4 (cont’d) Sequence VAWVMKAHVCTM EAPMPKVNVIVN VPKRLVVVMNTC VINKVISNPCVN PLPDNVVKAVVW RPPMLNVVRVVG LTFNLVSDKVVV TNVLCVPEMVGV TFPANPNVKVTV NSWDSVPNPRVV PPVVNECCPHSV PGPPAGPTLSNR NRSMPLPMNPGD ANNNPWSGMMNG NSPQLNLNPPQS NAARMNRLNDAM CATNAKNNNMRG TTWPNNRSQWP NQHTPPKPLNNA NKAKACNGMKPR CKPARTDWPPMP PVARKVVQICHP DAPSFAKVVPNR PLPDNVVKAVVW RPPMLNVVRVVG EAPMPKVNVIVN KVNFNKAVSNLK LVKPMNKRWVRM VPKDKRKKLTNP Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 3 1375.73 4 1310.57 4 1358.72 4 1299.55 4 1336.58 4 1336.66 4 1333.58 4 1260.53 4 1286.48 4 1369.49 4 1280.48 5 1163.29 5 1328.52 5 1292.41 5 1308.40 5 1376.57 5 1293.44 5 1386.48 5 1330.46 5 1317.59 5 1398.66 6 1346.65 6 1300.47 6 1336.58 6 1336.66 6 1310.57 6 1361.60 6 1557.98 6 1423.71 1.2 0 1.9 0.9 0 2 0 -1.1 1 0 -0.8 8.56 6.34 9.84 8.55 6.23 12.5 6.23 3.85 9.69 6.23 5.35 1 10.55 6.23 0 6.02 0 0 6.02 1 10.45 9.84 1 10.55 1.2 9.69 3.9 11.08 8.55 0.9 9.84 2.2 9.71 1 6.23 0 12.5 2 6.34 0 11.1 3 4 12.53 4 11.25 1.9 0.51% 0.3273 0.41% 0.274349 0.53% 0.308921 0.64% 0.578747 0.85% 1.320008 0.49% 0.498041 0.6720 1.06% 1.0539 1.37% 2.1917 1.16% 1.1243 0.81% 1.1448 1.47% 1.0934 1.37% 0.9028 1.30% 1.0282 1.56% 1.2728 1.75% 0.8490 1.01% 1.0891 1.23% 1.2405 0.99% 2.1977 2.20% 1.4193 1.72% 1.0205 1.43% 1.6667 1.19% 0.9633 0.78% 2.5585 1.86% 0.8099 1.06% 1.0341 0.95% 3.3259 2.90% 1.0837 1.04% 1.3678 1.71% 0.95% 0.74% 0.83% 1.56% 3.57% 1.35% 2.11% 3.30% 6.87% 3.52% 3.59% 4.01% 3.31% 3.77% 4.66% 3.11% 3.61% 4.11% 7.27% 4.70% 3.38% 4.24% 2.45% 6.51% 2.06% 2.63% 9.68% 3.15% 3.98% 65 4.6271 0.1765 2.4957 0.150771 4.071816 2.237699 0.197219 4.422446 2.823342 0.236603 8.662444 3.541376 0.314956 19.20996 4.58353 0.182992 7.247509 2.662906 4.9172 6.7245 5.5918 3.6585 7.1409 6.2517 5.2058 6.4084 7.0986 3.9130 5.5788 4.2046 9.7337 7.6752 6.0038 6.7461 4.5663 10.6487 6.0843 5.5189 14.2310 4.4677 8.0437 9.8021 16.2628 33.1381 15.9681 17.3907 18.2825 13.2179 15.4752 18.9221 11.9968 16.3789 17.4028 32.1296 20.9532 14.1924 24.0737 14.4078 37.2337 11.7854 15.3478 47.5116 13.5299 18.6873 0.3371 0.4358 0.3698 0.2576 0.4701 0.3739 0.3556 0.4258 0.4775 0.2769 0.3710 0.2997 0.6658 0.5199 0.4317 0.4670 0.3053 0.7317 0.4181 0.3719 0.9962 0.3578 0.5888 Table 4 (cont’d) Sequence VPKRLVVVMNTC AACPNLAVAAPM RDPDKNNRDWPI WLLDLRRQMDDT QLGVHRNPRWNQ LHKDRRKPLPLP HLRLNKNRMQKS LDDDTPDRNRDW VGQNNDERRRQR HQWHRHRPPIRR DNPHQLLPVNWR RWHHVNVLWQDR KPWWMTTQVDVP NTTWPNNRSQWP KMPPKLNCPGVW PVPWLQTGPTNC VVPGLCRVKTQA WKKYNVQPPKVW CSWNVGDNPPVD PNVRMNKQSWQP GWTPPLVVNDTL AVQLKPPSPNDN IVMCNLNVQVRD TISDIVKVVIRS VIDMSDMVVEIV VITNNWALAQVR MLMKVPVIRRVW VARVDIDLQAIA DCLAVPPNAVTI Series Gen Mass Charge pI MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 6 1358.72 6 1128.37 7 1525.63 7 1561.77 7 1504.66 7 1469.78 7 1524.80 7 1517.52 7 1527.61 7 1675.91 7 1488.66 7 1645.83 8 1487.73 8 1500.58 8 1369.70 8 1312.50 8 1270.55 8 1572.86 8 1302.38 8 1484.69 8 1311.49 8 1279.40 9 1403.68 9 1329.59 9 1349.62 9 1384.59 9 1527.99 9 1283.48 9 1212.42 9.84 1.9 6.02 -0.1 6.51 0 4.31 -1 2.2 12.5 3.2 11.65 4.2 12.53 -3 3.85 2 12.02 4.7 12.98 0.2 7.55 1.5 10.45 0 6.23 1 10.55 9.67 6.02 9.84 3 10.64 3.49 2 11.66 3.75 -1 6.23 0 6.18 -0.1 9.71 1 -3 3.38 1 10.55 3 12.51 4.11 -1 3.75 -1.1 1.9 -0.1 1.9 -2.1 0.96% 0.85% 2.43% 1.17% 2.05% 1.76% 3.28% 0.96% 0.89% 0.87% 1.24% 0.86% 0.51% 1.66% 1.51% 0.87% 1.25% 1.14% 0.95% 3.34% 0.75% 1.50% 0.77% 1.05% 1.20% 0.93% 1.08% 0.60% 1.14% 0.3274 0.5062 1.3125 0.5568 1.2639 0.6919 1.7866 0.2999 0.4303 0.3976 0.5860 0.2484 0.1746 0.9058 0.7186 0.2810 0.8197 0.7118 0.4739 1.5277 0.2383 0.6662 0.1658 0.3957 0.2773 0.1964 0.6206 0.1070 0.6178 0.3314 0.2931 0.4653 0.2248 0.3924 0.3373 0.6282 0.1908 0.1770 0.1738 0.2467 0.1718 0.0861 0.2820 0.2556 0.1470 0.2118 0.1893 0.1571 0.5545 0.1254 0.2495 0.1558 0.2121 0.2424 0.1867 0.2178 0.1206 0.2287 4.6876 8.7267 16.7336 6.9350 16.3390 9.1566 22.7905 3.8435 5.4785 4.6152 7.6571 2.9358 2.2833 11.7415 10.2048 4.1649 12.5488 8.8021 7.0779 20.0150 3.5338 10.1289 2.2970 5.7882 3.9963 2.7587 7.9005 1.6214 9.9122 4.7443 5.0518 5.9324 2.7995 5.0730 4.4643 8.0131 2.4451 2.2541 2.0166 3.2236 2.0302 1.1255 3.6550 3.6293 2.1789 3.2418 2.3413 2.3468 7.2645 1.8603 3.7936 2.1585 3.1025 3.4932 2.6231 2.7725 1.8275 3.6693 0.95% 1.47% 6.85% 2.91% 6.60% 3.61% 9.33% 1.51% 2.16% 2.00% 2.95% 1.25% 1.03% 5.35% 4.24% 1.66% 4.84% 4.28% 2.85% 9.19% 1.43% 4.01% 0.82% 1.97% 1.38% 0.98% 3.08% 0.53% 3.07% 66 Table 4 (cont’d) Sequence VVLVIADMCSAN PLNIAISNSPDS NWRDCLSLIVPN Notes on data: Series Gen Mass 5 5 5 9 1234.49 9 1227.33 9 1429.65 Charge pI -1.1 -1 -0.1 3.75 3.75 6.18 MTR @3.6 MTR @5 3.6/K12 5.0/K12 3.6 PTE 5.0 PTE 1.74% 0.77% 2.37% 0.72% 1.08% 1.01% 0.3494 0.1558 0.4776 0.1450 0.2176 0.2033 5.5049 2.4694 6.4986 2.2843 3.4490 2.7662 The PTE values are calculated through the contrast after it’s normalized contrast to K12. Where K12’s PTE is 12.5 *1000. This is maintained consistent with our collaborators and the literature values that formed the initial training data for the 3.6 experiments. Data is provided for every peptide that I can find the experiment where it was measured. When possible I use the experiment or experiments that were used to develop the learning data for POET. The first two generations were thus measured by my collaborator Dr. Michael McMahon at Johns Hopkins University, on a 9T scanner instead of the 7T scanners that I used to perform my experiments. Thus contrast values are relatively high in the first two generations, although the contrast relative to K12 is low. PTE values are all divided by 1000, to maintain consistency with the original literature values. 67 APPENDIX B: POET CODE POET was written for use in python 3.6.4. This is the version of POET that I used in my experiments detailed in the main body of this dissertation, presented should a reader wish to reproduce my work or deeply examine the code to answer a question. This version of the code is no longer current and contains several known bugs. For an updated version of POET please contract me and I can direct you to an appropriate GitHub. POET # POET: Protein Optimization Enhancing Tool # Authors: # Engineering - Michigan State University # # Iliya "eLeMeNOhPi" Alavy - Department of Computer Science and Alexander Bricco - Department of Bioengineering - Michigan State University All Rights Reserved @ Michigan State University import argparse import pandas as pd import pop as Population import settings import optimizer as Optimizer import predictor as P import individual as I import math import fitness as F import os import random as R from subprocess import call import archivist as Archivist def main(): print ("\n\n######################################################\n") print ("POET V3.0 \n") print ("######################################################\n") print ("Configuring the application...\n") # Argument descriptions 68 parser = argparse.ArgumentParser(description='Finds a model to predict fitness value of any given protein sequence. Fitness can be manually defined to any protein characteristic but our main goal is to predict CEST ability of proteins') parser.add_argument('-learn', default=settings.default_learn, help='Path to the learn data (format: csv, default: ' + settings.default_learn + ')') parser.add_argument('-translation', default=settings.TT, help='Path to the translation table (format: csv, default: ' + settings.TT + ')') parser.add_argument('-pop', help='Path to the initial population files. If not specified, this application uses random initial population as default (format: csv)') parser.add_argument('-model', help='Path to a generated model to determine a given protein\'s fitness. You will need to use -seq after this command') parser.add_argument('-seq', help='A sequence to be tested using an already specified model. This command only runs if it\'s used jointly with the -model command') parser.add_argument('-predict', help='Number of potential protein sequences you want the program to predict. Must be jointly used with -seqsize and -iter') parser.add_argument('-seqsize', help='Size of the protein sequences for prediction') parser.add_argument('-iter', help='Number of iterations to predict/find potential proteins') parser.add_argument('-hpcc', help="Number of replications you need to queue on hpcc. Uses the default config file") parser.add_argument('-o', help="Output file name") parser.add_argument('-r', default=settings.runs, help='Number of GP iterations to find a model.') parser.add_argument('-seed', help='The random seed') parser.add_argument('-f', help="Gets path to a model as it's input and returns the fitness of it") #ToDo:: Code this part. parser.add_argument('-c', nargs='*', help="Compares the fitness of all given models") parser.add_argument('-al', nargs='*', help="Computes the average length of all given models") parser.add_argument('-archive', nargs='*', help='Setups the default output directories if necessary and archives existing files/results') args = parser.parse_args() # Read the tables settings.learn_df = pd.read_csv(args.learn) settings.TT = pd.read_csv(args.translation) arch = Archivist.Archivist() if args.archive != None: arch.setup(True) print("Archiving completed Successfully!") 69 exit(1) # Setting up the random seed if args.seed != None: settings.seed = int(args.seed) R.seed(settings.seed) # total number of runs if args.r != None: settings.runs = int(args.r) # output file name if args.o != None: settings.output_file_name = args.o # Should be after initializing the output file name ^^^ arch.setup() # run on hpcc if args.hpcc != None: hpcc(int(args.hpcc), 13, 10000, 300) exit(1) # get the fitness of a given model if args.f != None: modelFitness(args.f) # compare a bunch of models if args.c != None: compareModels(args.c) # calculate the average length of the models if args.al != None: averageLength(args.al) # Make a dictionary out of the translation table for i, row in settings.TT.iterrows(): settings.dic[row[0]] = row[1] # Translate the data using the translation table in case we are not using a numeric translation table #if not is_numeric(): #for i, row in settings.learn_df.iterrows(): # translatedSeq = "" 70 #origSeq = row[0] #for j in range(len(origSeq)): # try: # except IndexError: translatedSeq += settings.dic[origSeq[j]] print("Could not find all the required data in the # # Translation Table. Exiting...") # #settings.learn_df.iloc[i, 0] = translatedSeq exit(0) # ToDo:: I never checked the prediction part if args.predict != None: if args.seqsize == None or args.iter == None: raise("Prediction mode needs -seqsize and -iter in order to work") else: predictor = P.Predictor(int(args.predict), int(args.seqsize)+1, int(args.iter)) predictor.predict() return elif args.pop == None and args.seq == None and args.model == None: pop = Population.Population() opt = Optimizer.Optimizer(pop) opt.optimize() elif args.seq != None and args.model != None: proteinSeq = args.seq modelPath = args.model raise("Invalid arguements.") pass else: return def is_numeric(): codes = settings.TT['code'] for i in range(codes.size): try: float(codes[i]) except ValueError: return False return True def modelFitness(path): model = I.Individual() model.makeFromFile(path) f = F.Fitness() fitness = f.measureTotal(model) 71 print("Pro-Predictor: Fitness (RMSE) of {}: {}".format(path, fitness)) exit(1) def compareModels(paths): model = I.Individual() avg = 0 best = 100000 bestModel = '' for path in paths: model.makeFromFile(path) f = F.Fitness() fitness = f.measureTotal(model) avg += fitness print("Pro-Predictor: Fitness (RMSE) of {}: {}".format(path, fitness)) if fitness < best: bestModel = path best = fitness avg /= len(paths) print("Pro-Predictor: Best model: {} with RMSE: {} Average RMSE: {}".format(bestModel, best, avg)) exit(1) def averageLength(paths): model = I.Individual() for index, path in enumerate(paths): model.makeFromFile(path) f = F.Fitness() fitness = f.measureTotal(model) sumL = 0 counter = 0 for rule in model.rules: if rule.status: try: if math.isnan(rule.pattern): continue except: pass sumL += len(rule.pattern) counter += 1 if counter == 0: print("Pro-Predictor: {}: Old Model. Old Models are Not Supported".format(path)) continue avg = sumL/counter 72 print("Pro-Predictor: {}: Fitness (RMSE): {} Average Rule Length: {}".format(path, fitness, avg)) exit(1) def hpcc(reps, hours, runs, startingSeed): for i in range(reps): filename = "subs/{}.sb".format(i) file = open(filename, "w") file.write("#!/bin/bash --login\n") file.write("\n########## SBATCH Lines for Resource Request ##########\n\n") file.write("#SBATCH --time={}:02:00 # limit of wall clock time - how long the job will run (same as -t)\n".format(hours)) file.write("#SBATCH --nodes=1-5 # number of different nodes - could be an exact number or a range of nodes (same as -N)\n") file.write("#SBATCH --ntasks=5 # number of tasks - how many tasks (nodes) that you require (same as -n)\n") file.write("#SBATCH --cpus-per-task=2 # number of CPUs (or cores) per task (same as -c)\n") file.write("#SBATCH --mem-per-cpu=2G # memory required per allocated CPU (or core) - amount of memory (in bytes)\n") file.write("#SBATCH --job-name POET_rep_{} # you can give your job a name for easier identification (same as -J)\n".format(i)) file.write("\n########## Command Lines to Run ##########\n\n") file.write("module pandas\n") # file.write("module load GCC/6.4.0-2.28 OpenMPI ### load necessary modules, e.g\n") file.write("cd ~/POET\n") file.write("srun -n 5 python poet.py -r {} -o {} -seed {}\n".format(runs, i, i+startingSeed)) file.write("cd batch\n") file.write("scontrol show job $SLURM_JOB_ID ### Optimizer # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # import settings import pop as Population import fitness as F import pandas as pd import archivist as Archivist import copy import random as R import individual as I 73 import rule as Rule import time class Optimizer: def __init__(self, population): self.TT = settings.TT self.codes = self.TT['code'] self.P = population self.runs = settings.runs self.tournamentSize = settings.tournament_size self.logInterval = settings.pop_log_interval self.crossRate = settings.cross_rate self.ruleSize = settings.rule_size self.ruleCount = settings.maximum_rule_count self.minWeight = settings.rule_weight_min self.maxWeight = settings.rule_weight_max self.output_file_name = settings.output_file_name self.mAR = settings.mut_add_rule self.mRR = settings.mut_remove_rule self.mCW = settings.mut_change_weight self.mATP = settings.mut_add_to_pattern self.mRFP = settings.mut_remove_from_pattern self.mCWmin = 0 self.mCWmax = 1 pass def optimize(self): # We need an instance of the Fitness and Archivist classes for later usage fitness = F.Fitness() arch = Archivist.Archivist() # ToDo:: # raise("Add dynamic mutation rates to escape from premature convergence.\nAdd Remove duplicate rules to make space for the new rules without changing the fitness") # growthLog = 0 # mutRates = [self.mAR, self.mRR, self.mCW, self.mATP, self.mRFP, self.mCWmin, self.mCWmax] # For all the generations for i in range(self.runs): # To log the time start_time = int(round(time.time() * 1000)) 74 # Measure fitness avgFitness = 0.0 avgRuleCount = 0.0 avgUsedRulesCount = 0.0 # To calculate the best fitness # Temporarily set the first member as the best member self.P.pop[0].fitness = fitness.measureTotal(self.P.pop[0]) bestIndividual = self.P.pop[0] for j in self.P.pop: j.fitness = fitness.measureTotal(j) avgFitness += j.fitness avgRuleCount += len(j.rules) avgUsedRulesCount += j.usedRulesCount if j.fitness < bestIndividual.fitness: bestIndividual = j # To calculate the average fitness and the average avgFitness = round(avgFitness / float(len(self.P.pop)), 3) avgRuleCount = round(avgRuleCount / float(len(self.P.pop)), 0) avgUsedRulesCount = round(avgUsedRulesCount / float(len(self.P.pop)), 0) bestFitness = round(bestIndividual.fitness, 3) bestRuleCount = len(bestIndividual.rules) bestUsedRulesCount = bestIndividual.usedRulesCount # Log the outcome before doing the changes to the population / generating a new population log_string = "{}: -b {} -rc {} -urc {} ||| -a {} -arc {} -aurc {}".format(i, bestFitness, bestRuleCount, bestUsedRulesCount, avgFitness, avgRuleCount, avgUsedRulesCount) # Print the evolutionary log print(log_string) # Dynamic Mutation # If the fitness is not growing every 25 generations, increase the rates # if i%25 == 0 and bestIndividual.fitness == growthLog: # # # # print("Mutation rate increasing") self.mAR += 0.05 self.mRR += 0.1 self.mCW += 0.2 75 self.mATP += 0.1 self.mRFP += 0.1 self.mCWmin = 0 self.mCWmax = 0.1 if self.mCW >= 0.6: # # # # # self.mAR = mutRates[0] # self.mRR = mutRates[1] # self.mCW = mutRates[2] # self.mATP = mutRates[3] # # self.mRFP = mutRates[4] # elif i%25 == 0 and growthLog != bestIndividual.fitness: # # # # # # # # # print("Mutation rate reset") self.mAR = mutRates[0] self.mRR = mutRates[1] self.mCW = mutRates[2] self.mATP = mutRates[3] self.mRFP = mutRates[4] self.mCWmin = mutRates[5] self.mCWmax = mutRates[6] growthLog = bestIndividual.fitness # Log the evolution arch.saveEvo(log_string) # Create a copy of the population newPop = copy.deepcopy(self.P) newPop.pop.clear() # Note: This is the code for storing the population with an interval of generations. I temporarily disabled this feature to see if it needs any further developments. Right now I'm only concerned about the core. index = 0 for individual in self.P.pop: # # Check if we need to store the data # if i % self.logInterval == 0 and self.logInterval > 0: # # # # # data = [] for rule in individual.rules: data.append([rule.pattern, rule.weight, rule.status]) # # # Store the data df = pd.DataFrame(data,columns=['pattern','weight','used']) 76 str(100+index)) # # arc.saveCSV(df, self.outputPath + "/" + str(i), index += 1 # Elitism newPop.pop.append(bestIndividual) # Save the best model data = [] for rule in bestIndividual.rules: data.append([rule.pattern, rule.weight, rule.status]) df = pd.DataFrame(data,columns=['pattern','weight','status']) arch.saveModel(df) # Select Parents (Tournament Selection) and Crossover for k in range(len(self.P.pop)-1): tournament = [] offspring = I.Individual() tournament as we need len(self.P.pop)-1)]) for j in range(self.tournamentSize): # Randomly append as much individuals to the tournament.append(self.P.pop[R.randint(0, tournament = self.bubbleSortTournament(tournament) # We got two best parents parentA = copy.deepcopy(tournament[0]) parentB = copy.deepcopy(tournament[1]) # Do the crossover magic - Cluster crossover # Efficiency thing. Find the greater rule length lenA = len(parentA.rules) lenB = len(parentB.rules) maxLen = lenA if lenA < lenB: maxLen = lenB # we keep track of the rules we want to add to the offspring rules = [] for j in range(maxLen): if j < lenA: ruleA = parentA.rules[j] 77 rules some chance to get selected if ruleA.status == 1: rules.append(ruleA) elif R.random() < self.crossRate: # we give unused rules.append(ruleA) if j < lenB: ruleB = parentB.rules[j] if ruleB.status == 1: rules.append(ruleB) rules some chance to get selected elif R.random() < self.crossRate: # we give unused rules.append(ruleB) offspring.rules = rules offspring.bubbleSort() count # Resize the offspring so it doesn't exceed the maximum allowed while len(offspring.rules) > self.ruleCount: countGreens = 0 for index in range(len(offspring.rules)-1, -1, -1): if(countGreens >= index): del(offspring.rules[index]) break else: if(offspring.rules[index].status == 0): del(offspring.rules[index]) break else: countGreens += 1 newPop.pop.append(offspring) self.P.pop = copy.deepcopy(newPop.pop) # Mutations # We keep a copy of the elite elite = copy.deepcopy(self.P.pop[0]) for indv in self.P.pop: 78 # On Model if R.random() <= self.mAR: # add rule self.mut_add_rule(indv) if R.random() <= self.mRR: # remove rule self.mut_remove_rule(indv) # On Rule for rule in indv.rules: if R.random() <= self.mCW: # change weight self.mut_change_weight(rule) if R.random() <= self.mATP: # add to pattern self.mut_add_to_pattern(rule) if R.random() <= self.mRFP: # remove from pattern self.mut_remove_from_pattern(rule) if rule.pattern == "": indv.rules.remove(rule) indv.bubbleSort() zeroFitness = fitness.measureTotal(self.P.pop[0]) # Check if elite got worse if elite.fitness < zeroFitness: self.P.pop[0] = elite # self.P.pop[0].print() # for indv in self.P.pop: # # # # print(len(indv.rules)) self.removeExtra(indv) # print("after: {}".format(len(indv.rules))) # self.P.pop[0].print() # To Log the time end_time = int(round(time.time() * 1000)) - start_time # print(end_time) pass 79 def bubbleSortTournament(self, t): n = len(t) # Traverse through all array elements for i in range(n): # Last i elements are already in place for j in range(0, n-i-1): # traverse the array from 0 to n-i-1 # Swap if the element found is greater # than the next element if t[j].fitness > t[j+1].fitness: t[j], t[j+1] = t[j+1], t[j] return t # Add a random rule mutation def mut_add_rule(self, individual): if len(individual.rules) >= self.ruleCount: return pattern = "" weight = round(R.uniform(self.minWeight, self.maxWeight), 2) # Add these many rules for i in range(R.randint(1, self.ruleSize)): combination of codes with a specified size # Rule size is calculated randomly, and now we need to select a random code = self.codes[R.randint(0, self.codes.size - 1)] randomchar = code[R.randint(0, (len(code) - 1))] pattern += randomchar rule = Rule.Rule(pattern, weight,0) individual.rules.append(rule) # Remove rule mutation def mut_remove_rule(self, individual): # print("mrr") fitness = F.Fitness() if len(individual.rules) == 0: return tempRand = R.randint(0, len(individual.rules)-1) del(individual.rules[tempRand]) # Add to weight mutation def mut_change_weight(self, rule): optRand = R.randint(0, 1) weightRand = round(R.uniform(self.mCWmin, self.mCWmax), 2) if optRand == 0: # Addition 80 rule.weight += weightRand elif optRand == 1: # Substraction rule.weight -= weightRand pass # Alter patterns mutation (add letter) def mut_add_to_pattern(self, rule): if len(rule.pattern) >= self.ruleSize: return pattern = rule.pattern code = self.codes[R.randint(0, self.codes.size - 1)] randomchar = code[R.randint(0, (len(code) - 1))] if len(pattern) == 0: pattern = randomchar else: insPos = R.randint(0, len(pattern)) pattern = pattern[0:insPos] + randomchar + pattern[insPos:(len(pattern))] rule.pattern = pattern # Alter patterns mutation (remove letter) def mut_remove_from_pattern(self, rule): if len(rule.pattern) == 0: return if len(rule.pattern) == 1: rule.pattern = "" return pattern = rule.pattern insPos = R.randint(0, len(pattern)-1) pattern = pattern[0:insPos] + pattern[insPos+1:(len(pattern))] rule.pattern = pattern def removeExtra(self, indv): # removeList = [] for j, rule in enumerate(indv.rules): for k in range(j+1, len(indv.rules)): if indv.rules[k].pattern == indv.rules[j].pattern: # removeList.append(indv.rules[j]) if indv.rules[j].status == 0: indv.rules.remove(indv.rules[j]) j -= 1 elif indv.rules[k].status == 0: indv.rules.remove(indv.rules[k]) k -= 1 81 break # for i in removeList: # indv.rules.remove(i) Settings # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # import configparser import random as R global rule_size, TT, default_learn, default_unseen, population_size, alphabet_size, maximum_rules_count, rule_value_min, rule_value_max config = configparser.ConfigParser() config.read('config.ini') # Parameter Initialization # LEARN default_learn = config['LEARN']['learn_data'] default_unseen = config['LEARN']['unseen_data'] # PREDICTOR prediction_model = config['PREDICTOR']['prediction_model'] # TRANSLATION TT = "data/translation/" + config['TRANSLATION']['translation_table'] dic = {} alphabet_size = int(config['TRANSLATION']['alphabet_size']) # GENETIC PROGRAMMING population_size = int(config['GP']['population_size']) rule_size = int(config['GP']['maximum_rule_size']) maximum_rule_count = int(config['GP']['maximum_rule_count']) rule_weight_min = float(config['GP']['rule_weight_min']) rule_weight_max = float(config['GP']['rule_weight_max']) runs = int(config['GP']['runs']) pattern_mode = int(config['GP']['pattern_mode']) tournament_size = int(config['GP']['tournament_size']) cross_rate = float(config['GP']['crossover_unused_selection_chance']) # output_name 82 pop_log_interval = int(config['OUTPUT']['pop_log_interval']) output_file_name = config['OUTPUT']['output_name'] # GENERAL seed = int(config['GENERAL']['seed']) debug = int(config['GENERAL']['debug']) # MUTATION mut_add_rule = float(config['MUTATION']['mut_add_rule']) mut_remove_rule = float(config['MUTATION']['mut_remove_rule']) mut_change_weight = float(config['MUTATION']['mut_change_weight']) mut_add_to_pattern = float(config['MUTATION']['mut_add_to_pattern']) mut_remove_from_pattern = float(config['MUTATION']['mut_remove_from_pattern']) Pop # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # import individual as Individual import settings class Population: # constructor for random initialization def __init__(self): self.size = settings.population_size self.TT = settings.TT self.pop = [] print ("Initializing a population with size of " +str(self.size) + "...\n") if self.is_numeric(): print ("Translation Table supports the NUMERIC mode - Generating self.populate_formulas() else: print ("Translation Table supports the PATTERN mode - Generating formula...\n") rules...\n") self.populate_rules() # Randomly initializes the population with rules def populate_rules(self): for i in range(self.size): indv = Individual.Individual() indv.init_pattern() 83 self.pop.append(indv) # Randomly initializes the population with formulas def populate_formulas(self): raise("This feature is not coded yet.") for i in range(self.size): indv = Individual.Individual() indv.init_formula() self.pop.append(indv) # Uses the input files to create the population def populate_preset(self, population): raise ValueError('Uncharted terretories... Exiting') pass # Returns True if all the codes are numeric, otherwise returns False def is_numeric(self): codes = self.TT['code'] for i in range(codes.size): try: float(codes[i]) except ValueError: return False return True Individual # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # import rule as Rule import settings import random as R import pandas as pd class Individual: # Constructor def __init__(self): self.TT = settings.TT self.aSize = settings.alphabet_size self.rules = [] self.usedRules = {} self.usedRulesCount = 0 self.ruleSize = settings.rule_size self.maxRuleCount = settings.maximum_rule_count self.minWeight = settings.rule_weight_min 84 self.maxWeight = settings.rule_weight_max self.fitness = 0 self.validate = 0 self.extra = {} def makeFromFile(self, file): # print ("Creating an Individual from file...") self.rules = [] self.usedRules = {} self.fitness = 0 tempIndv = pd.read_csv(file) tmpPatterns = tempIndv['pattern'] tmpWeights = tempIndv['weight'] try: tmpStatus = tempIndv['status'] except: print("Pro-Predictor: model {} does not have status column, putting 0 for all.".format(file)) tmpStatus = [0]*len(tmpPatterns) # print("size " + str(tmpPatterns.size)) for i in range(0, len(tmpPatterns) - 1): # print (str(i) + " " + str(tmpPatterns[i]) + " " + str(tmpWeights[i])) rule = Rule.Rule(tmpPatterns[i],tmpWeights[i], tmpStatus[i]) self.rules.append(rule) pass def init_pattern(self): codes = self.TT['code'] for i in range(R.randint(1, int(self.maxRuleCount/3))): pattern = "" weight = round(R.uniform(self.minWeight, self.maxWeight), 2) # Add these many rules for j in range(R.randint(1, self.ruleSize)): # Rule size is calculated randomly, and now we need to select a random combination of codes with a specified size code = codes[R.randint(0, codes.size - 1)] randomchar = code[R.randint(0, (len(code) - 1))] pattern += randomchar rule = Rule.Rule(pattern, weight, 0) self.rules.append(rule) # for i in self.rules: # print(str(i.pattern) + " => " + str(i.weight)) 85 # print ("\n\n") self.bubbleSort() # for i in self.rules: # print(str(i.pattern) + " => " + str(i.weight)) def init_formula(self): raise Exception ("uncharted territories") pass def bubbleSort(self): n = len(self.rules) # Traverse through all array elements for i in range(n): # Last i elements are already in place for j in range(0, n-i-1): # traverse the array from 0 to n-i-1 # Swap if the element found is greater # than the next element if len(self.rules[j].pattern) < len(self.rules[j+1].pattern): self.rules[j], self.rules[j+1] = self.rules[j+1], self.rules[j] def print(self): for kh, rule in enumerate(self.rules): print("{}- {} - {} - {}".format(kh, rule.pattern, rule.weight, rule.status)) Fitness # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # import random as R import settings import individual as I import math class Fitness: def __init__(self): self.learn = settings.learn_df self.mode = settings.pattern_mode # 0 is the summation mode and 1 is the multiplication mode self.k = 0 # for 10-fold cross validation implementation 86 pass def measureTotal(self, individual): # Re-initalize the values to 0 fitness = 0.0 # this is the model's fitness sum = 0.0 individual.usedRulesCount = 0 # Make sure that every rule status is 0 for rule in individual.rules: rule.status = 0 # Iterating through the training set for i, row in self.learn.iterrows(): # Pretty much initialization sequence = row[0] actualFitness = row[1] # This is the starting position of the sequence that we're looking at each time. pos = 0 measuredFitness = 0.0 # Fitness of individual i # Iterate through the sequence while pos < len(sequence): # Check every rule for rule in individual.rules: sequence length # Find the length of the rule try: length = len(rule.pattern) except: # happens in the case of empty rules. continue # Continue if the rule length is more than the unchecked if length < (len(sequence) - pos) or length == 0: continue reverse_pattern = rule.pattern[::-1] # Normal or Reverse if ((rule.pattern == sequence[pos : (pos + length)]) or reverse_pattern == sequence[pos : (pos + length)]): # rule is found. Update its status and the usedRulesCount to avoid further computation 87 always use the summation mode tho. if rule.status == 0: rule.status = 1 individual.usedRulesCount += 1 # Check the multiplication/summation mode. We if self.mode == 0: measuredFitness += rule.weight elif self.mode == 1: measuredFitness *= rule.weight # If the rule is found, we don't wanna check any other "shorter" rules on the very same position. We wanna update the positioning and look for other rules. If no rule was found, the position updates anyway after the end of this loop so the difference is only the break command which happens only if the rule is found. This defenitely takes more processing power and considers overlapping rules but this is the way to go to consider all the possible cases. Also, note that this raises the issue of exponential slow-down when the number of rules in a table increase. Therefore, I assume we need some sort of bloat removal when the speed of the tool has decreased drastically. break # Update the position pos += 1 # Calculate the error after the estimation error = (measuredFitness - actualFitness)**2 fitness += error MSE = fitness / int(self.learn.size) RMSE = math.sqrt(MSE) return RMSE def measure(self, individual): raise("update this function according to measureTotal") if self.k == 10: self.k = 0 fitness = 0.0 sum = 0.0 for i, row in self.learn.iterrows(): int(self.learn.size/10)*(self.k+1): if i >= int(self.learn.size/10) * self.k and i < continue sequence = row[0] actualFitness = row[1] 88 pos = 0 measuredFitness = 0.0 # Fitness of individual i if self.mode == 1: measuredFitness = 1.0 # Iterate through the sequence while pos < len(sequence): # Check every rule found = False for rule in individual.rules: length = len(rule.pattern) reverse_pattern = rule.pattern[::-1] # Normal or Reverse if (pos + length < len(sequence) and (rule.pattern == sequence[pos : (pos + length)]) or reverse_pattern == sequence[pos : (pos + length)]): sequence + " found: " + sequence[pos : (pos + length)] + " weight: " + str(rule.weight)) # rule is found individual.usedRules[rule.pattern] = True # print("pos is at " + str(pos) + " sequence is: " + if self.mode == 0: measuredFitness += rule.weight elif self.mode == 1: measuredFitness *= rule.weight pos = pos + length found = True break if not found: pos += 1 if found and length == 0: pos += 1 # print("pos: "+str(pos)+" len: "+str(len(sequence))) error = (measuredFitness - actualFitness)**2 #ToDo:: Checkout limiting measuredFitness to possitive values fitness += error # print ("actualFitness: " + str(actualFitness) +" measuredFitness: "+str(measuredFitness)+" error: "+str(error)+" accuracy: "+ str(accuracy)) # print("Training:: {}- Actual: {}, Estimate: {} Squared Error: {}".format(i, actualFitness, measuredFitness, error)) MSE = fitness / int(self.learn.size - int(self.learn.size/10)) RMSE = math.sqrt(MSE) # print("ENDDDD:: MSE: {}, RMSE: {}".format(MSE, RMSE)) return RMSE 89 def validate(self, individual): raise("update this function according to measureTotal") if self.k == 10: self.k = 0 fitness = 0.0 sum = 0.0 for i, row in self.learn.iterrows(): if not (i >= int(self.learn.size/10) * self.k and i < int(self.learn.size/10)*(self.k+1)): continue sequence = row[0] actualFitness = row[1] pos = 0 measuredFitness = 0.0 # Fitness of individual i if self.mode == 1: measuredFitness = 1.0 # Iterate through the sequence while pos < len(sequence): # Check every rule found = False for rule in individual.rules: length = len(rule.pattern) reverse_pattern = rule.pattern[::-1] # Normal or Reverse if (pos + length < len(sequence) and (rule.pattern == sequence[pos : (pos + length)]) or reverse_pattern == sequence[pos : (pos + length)]): sequence + " found: " + sequence[pos : (pos + length)] + " weight: " + str(rule.weight)) # rule is found individual.usedRules[rule.pattern] = True # print("pos is at " + str(pos) + " sequence is: " + if self.mode == 0: measuredFitness += rule.weight elif self.mode == 1: measuredFitness *= rule.weight pos = pos + length found = True break if not found: pos += 1 if found and length == 0: pos += 1 # print("pos: "+str(pos)+" len: "+str(len(sequence))) error = (measuredFitness - actualFitness)**2 90 fitness += error # print ("actualFitness: " + str(actualFitness) +" measuredFitness: "+str(measuredFitness)+" error: "+str(error)+" accuracy: "+ str(accuracy)) # print("Validation:: {}- Actual: {}, Estimate: {} Squared Error: {}".format(i, actualFitness, measuredFitness, error)) MSE = fitness / int(self.learn.size / 10) # print(self.learn.size / 10) RMSE = math.sqrt(MSE) # print("ENDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD:: MSE: {}, RMSE: {}".format(MSE, RMSE)) return RMSE def measure_unseen(self, individual): raise("update this function according to measureTotal") if self.unseen.size <= 0: return 0 fitness = 0.0 sum = 0.0 for i, row in self.unseen.iterrows(): sequence = row[0] actualFitness = row[1] pos = 0 measuredFitness = 0.0 # Fitness of individual i if self.mode == 1: measuredFitness = 1.0 # Iterate through the sequence while pos < len(sequence): # Check every rule found = False for rule in individual.rules: length = len(rule.pattern) # Not tested #ToDo:: reverse_pattern = rule.pattern[::-1] if (pos + length < len(sequence) and (rule.pattern == sequence[pos : (pos + length)]) or reverse_pattern == sequence[pos : (pos + length)]): sequence + " found: " + sequence[pos : (pos + length)] + " weight: " + str(rule.weight)) # print("pos is at " + str(pos) + " sequence is: " + if self.mode == 0: measuredFitness += rule.weight elif self.mode == 1: measuredFitness *= rule.weight pos = pos + length found = True 91 break if not found: pos += 1 if found and length == 0: pos += 1 error = (measuredFitness - actualFitness)**2 fitness += error # print ("actualFitness: " + str(actualFitness) +" measuredFitness: "+str(measuredFitness)+" error: "+str(error)+" accuracy: "+ str(accuracy)) MSE = fitness / self.learn.size RMSE = math.sqrt(MSE) return RMSE def predict(self, sequence, individual): fitness = 0.0 pos = 0 # Iterate through the sequence while pos < len(sequence): # Check every rule found = False for i, rule in enumerate(individual.rules): if rule.status == 0: continue try: length = len(rule.pattern) except: continue reverse_pattern = rule.pattern[::-1] if (pos + length < len(sequence) and (rule.pattern == sequence[pos : (pos + length)]) or reverse_pattern == sequence[pos : (pos + length)]): + " found: " + sequence[pos : (pos + length)] + " weight: " + str(rule.weight)) # print("pos is at " + str(pos) + " sequence is: " + sequence if self.mode == 0: fitness += rule.weight elif self.mode == 1: fitness *= rule.weight pos = pos + length found = True break if not found: pos += 1 if found and length == 0: pos += 1 92 return fitness Rule # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University Alexander Bricco - Department of Bioengineering - Michigan State University # class Rule: def __init__(self, pattern, weight, status): self.pattern = pattern self.weight = weight self.status = status Predictor # Authors: Iliya "eLeMeNOhPi" Alavy - Department of Engineering - Michigan State University # Alexander Bricco - Department of Bioengineering - Michigan State University import settings import fitness as F import random as R import individual as I class Sequence: def __init__(self, pattern, fitness): self.pattern = pattern self.fitness = fitness pass # all I know is that iter is a lie... #ToDo:: class Predictor: def __init__(self, count, size, iter): self.count = count self.size = size self.iter = iter self.pop = [] self.output = [] self.popSize = 100 self.codes = settings.TT['key'] self.hydroDic = { "I":-0.31, "L":-0.56, "F":-1.13, "V":0.07, "M":-0.23, 93 "P":0.45, "W":-1.85, "J":0.17, "T":0.14, "E":2.02, "Q":0.58, "C":-0.24, "Y":-0.94, "A":0.17, "S":0.13, "N":0.42, "D":1.23, "R":0.81, "G":0.01, "H":0.96, "K":0.99 } pass def predict(self): self.populate() self.sort() for i in range(self.count): print(self.pop[i].pattern+" -> "+str(self.pop[i].fitness)) pass def populate(self): print ("Populating the sequence poolP...\n") # Read the best model and turn it to an individual individual = I.Individual() individual.makeFromFile(settings.prediction_model) objF = F.Fitness() for i in range(self.popSize): pattern = "" if self.size <= 0: self.size = R.randint(8, 12) for j in range(self.size - 1): rand = R.randint(0, len(self.codes) - 1) pattern += self.codes[rand] if self.hydrophobic(pattern): tempF = 0 else: tempF = objF.predict(pattern, individual) 94 sequence = Sequence(pattern, tempF) self.pop.append(sequence) def hydrophobic(self, pattern): temp = 0 for char in pattern: temp += self.hydroDic[char] if temp >= 0: else: return False return True def sort(self): print ("Sorting the sequences...\n") n = len(self.pop) # Traverse through all array elements for i in range(n): # Last i elements are already in place for j in range(0, n-i-1): # traverse the array from 0 to n-i-1 # Swap if the element found is greater # than the next element if self.pop[j].fitness < self.pop[j+1].fitness: self.pop[j], self.pop[j+1] = self.pop[j+1], self.pop[j] # Restrict the amino acid search # empirical research 95