MACHINE LEARNING FOR POSE SELECTION By Jun Pei A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Chemistry ÑDoctor of Philosophy 2020 !!!!!!!! ABSTRACT MACHINE LEARNING FOR POSE SELECTION By Jun Pei Scoring functions play an important role in protein related systems. In general, scoring functions were developed to connect three dimensional structures and corresponding stabilities. In protein -folding systems, scoring functions can be used to predict the most stable protein structure; in protein -ligand and protein -protein systems, scoring functions can be used to find the best ligand structure, predict the binding affinities, and identifying the correct binding modes. Potential funct ions make up an essential part of scoring functions. Each potential function usually represents a different interaction that exists in a protein or protein -ligand system. In many traditional scoring functions, energies calculated from individual potential functions were simply sum up to estimate the stability of the whole structure. However, it is possible that those energies cannot be directly added together. In other words, some of those potential functions might describe more important interactions, wher eas other potential functions are used to represent insignificant interactions. Hence, it will be useful to construct a model, which can emphasize the important interactions, and ignore the insignificant ones. With the development of machine learning (ML) , it became possible to build up a model, which can address the importance of different interactions. In this work, we combined random forest (RF) algorithm and different potential function sets to solve the pose selection problem in protein -folding and pr otein -ligand systems. Chapter 3 and chapter 5 show the results of combing RF algorithm with knowledge -based potential functions and force field potential functions for protein -folding systems. Chapter 4 shows the result of combining the RF method with know ledge -based potential functions for protein -ligand systems. As the results from chapter 3, chapter 4, and chapter 5, it is obvious that the RF model based on potential functions outperformed all of the traditional scoring functions in accuracy and native ranking tests. In order to test the importance of potential functions, scrambled and uniform artificial potential function sets were generated in chapter 3, the test results suggest that the potential function set is important in the model, and the most us eful information from knowledge -base potential functions are the peak positions. In chapter 5, the importance of the RF algorithm and potential functions were tested. The results also suggest that the potential functions are important, and the RF model is also necessary to achieve the best performance. !!!!!!!!!!!!!!!!!!!!!!iv ! This dissertation is dedicated to my parents, who always love, trust, and support me. !v ACKNOWLEDGEMENTS I thank my parents for the generous love. My parents have raised me and spared no effort to give me a good life and education. They encouraged me to follow my heart, and respect my decisions. They set up examples for me about good be haviors, and support me when I was stressed out. I thank my friends for their friendship and company. I wish to express my deepest gratitude to my advisor, Professor Kenneth M. Merz Jr., for his patient love and kind support of my research. He respects my decisions and helps me to fulfill my dream. His positive attitude encouraged me when I was pessimistic; his broad thoughts show me the possibilities of research; his deep insights illuminate the research direction for me. I will never touch machine learni ng without him. I thank my guidance committee members: Professor Katharine C. Hunt, Professor Robert I. Cukier, and Professor Benjamin G. Levine. I thank the members of the Merz research group for their company. Among them, I specifically thank Zheng for h anding me the project of constructing KECSA2 for protein systems. I thank Lin for helping me perform Amber calculations in chapter 2 and chapter 3. Being a member of the Merz research group is a precious memory that I will never forget. !vi TABLE OF CONTENTS LIST OF TABLES .................................................................................................................. viii LIST OF FIGURES ................................................................................................................... xi KEY TO ABBREVIATIONS .................................................................................................. xiii CHAPTER 1: INTRODUCTION ............................................................................................... 1 1.1 Introduction for protein -folding pose selection ............................................................ 1 1.2 Introduction for protein -ligand pose selection ............................................................. 3 1.3 Combine machine learning and conventional scoring function .................................... 4 1.4 Logistic regression ...................................................................................................... 6 1.5 Decision tree ............................................................................................................... 9 1.6 Support vector machine ............................................................................................ 13 1.7 The random forest algorithm ..................................................................................... 17 CHAPTER 2: METHOD ...........................................................................................................19 2.1 From potential functions to descriptors ..................................................................... 19 2.2 Random Forest model ............................................................................................... 22 2.3 Decoy sets ................................................................................................................ 25 2.4 Structure preparation ................................................................................................. 27 2.5 Potential functions .................................................................................................... 28 2.6 Machine learning and validation ............................................................................... 30 CHAPTER 3: RANDOM FOREST MODEL WITH KECSA2 FOR PROTEIN -FOLDING POSE SELECTION ................................................................................................................. 32 3.1 Accuracy of indivi dual decoy set training ................................................................. 32 3.2 Native ranking of individual decoy set training ......................................................... 33 3.3 1st decoy RMSD and TM -score of individual decoy set training ................................ 34 3.4 Feature importance analysis for the overall decoy set ................................................ 36 3.5 Comparison of overall RF model with traditional scoring functions .......................... 37 3.6 Importance of potential ............................................................................................. 39 3.7 Conclusion ................................................................................................................ 41 CHAPTER 4: RANDOM FOREST MODEL WITH GARF FOR PROTEIN -LIGAND POSE SELECTION ............................................................................................................................ 42 4.1 Accuracy ................................................................................................................... 42 4.2 Native ranking .......................................................................................................... 43 4.3 Random Forest model with decoy comparison information ....................................... 44 4.4 1st decoy RMSD and TM -score ................................................................................. 45 4.5 Uniform probability function .................................................................................... 49 4.6 Influence of training set size ..................................................................................... 51 4.7 Conclusion ................................................................................................................ 52 !vii CHAPTER 5: COMBINE RANDOM FOREST WITH AMBER FORCE FIELD FOR PROTEIN -FOLDING POSE SELECTION .............................................................................. 53 5.1 Definitions of atom types, torsion types, and nonbon d types ..................................... 53 5.2 From Amber parameters to descriptors ...................................................................... 55 5.4 Encoding validation .................................................................................................. 60 5.5 Feature importance analysis ...................................................................................... 62 5.6 Accuracy ................................................................................................................... 63 5.7 Native ranking .......................................................................................................... 65 5.8 1st decoy RMSD and TM -score ................................................................................. 67 5.9 Impact of the RF algorithm ....................................................................................... 69 5.10 Potential analysis ...................................................................................................... 70 5.11 Conclusion ................................................................................................................ 71 APPENDICES ..........................................................................................................................72 APPENDIX A: TABLES ..................................................................................................... 73 APPENDIX B: FIGURES ...................................................................................................137 APPENDIX C: COPYRIGHT NOTICE ..............................................................................152 BIBLIOGRAPHY ................................................................................................................... 153 !viii LIST OF TABLES Table 1.1. An example of training data to construct a decision tree. ........................................ ......73 Table 1.2. Relationship between humidity and jogging decisions ............................................ ......74 Table 2.1 . Atom types in the GARF potential database .aÉ...É............................................... ......75 Table 2.2 . General form of a confusion matrix ÉÉÉ................................................ ............. ......77 Table 3.1 . Accuracy values for different models. a ÉÉÉ................................................. ....... .....78 Table 3.2 . Native structureÕs ranking of different models. aÉÉÉ.......................................... ......79 Table 3.3 . 1st decoyÕs RMSD for different models. a ÉÉ.............. ... ................................. ...... ......80 Table 3.4 . 1st decoyÕs TM -score of different models. aÉÉÉ.......... ..................................... .... .....81 Table 3.5 . Comparison of accuracies of RF models with different numbers of features. aÉ... ........82 Table 3.6. Comparison of the overall performance of RF models (with different number of features) with trad itional potentials on overall data set ÉÉÉ................................................ ................. .....83 Table 3.7 . Comparison of RF models based on different potentials ÉÉÉ.............. ............... ... ...84 Table 4.1 . Comparisons between RF models and 29 other scoring functions É........ ... ................ ..85 Table 4.2 . Comparison between RF models with considering different number of decoy pose in training set ÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉ... ÉÉÉ................................................. ..86 Table 4.3 . Comparison between RF models with different probability function sets ..ÉÉ... ... .....87 Table 4.4 . Summary of peak positions and number of probability functions at each peak positions in GARF ... ........................................ .................. ........................................................ ................. ...88 Table 4.5 . Accuracy values for different training set sizes from RF models with original and uniform GARF ÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉÉ .... ............................................. ..89 Table 5.1 . Summary of charge, !, and van der Waals radii for each atom type in ff94 force field ÉÉÉ............................................... ................................................................................. ....90 Table 5.2. Summary o f charge, !, and van der Waals radii for each atom type in ff14SB force field ÉÉÉ................................. ................................................................................. ................ ..92 !ix Table 5.3 . Torsion and nonbond energies calculated by an encoded program with ff94 parameters for single amino acid test set ÉÉÉ....... ...................................... .................................. ...... ........ ..9 4 Table 5.4 . Torsion and nonbond energies calculated by Amber software wi th ff94 parameters for single amino acid test set ÉÉÉÉÉÉÉÉÉÉÉÉÉÉ É......................................... ......... ..95 Table 5.5 . Comparisons of torsion and nonbond energies calculated by encoded programs and Amber software with ff94 parameters for single amino acid test set .ÉÉÉÉÉÉÉ............. .....96 Table 5.6 . Torsion and nonbond energies calculated by an encoded program with ff14SB parameters for single amino acid test set ÉÉÉ...................... ...... .......................... ...................... .97 Table 5.7 . Torsion and nonbond energies calculated by Amber software with ff14SB parameters for single amino acid test set .ÉÉÉÉÉÉÉÉÉÉÉÉÉ.. ..................... ... ............................ 98 Table 5.8 . Comparisons of torsion and nonbond energies calcula ted by encoded programs and Amber software with ff14SB parameters for single amino acid test set ÉÉÉÉÉÉÉÉ.......... 99 Table 5.9 . Torsion and nonbond energies calculated by an encoded program with ff94 parameters for double amino acid test set ÉÉÉ............ ........................... ........ ...................... ....... ...... ..........1 00 Table 5.10 . Torsion and nonbond energies calculated by Amber software with ff94 parameter for double amino acid test set ÉÉÉ................................................ .................... ....... ..................... .106 Table 5.11 . Comparisons of torsion and nonbond energies calculated by the encoded program and Amber software with ff94 parameters for double amino acid test set ÉÉÉÉ......................... ...111 Table 5 .12 . Torsion and nonbond energies calculated by an encoded program with ff14SB parameters for double amino acid test set ÉÉÉÉÉÉÉÉ............................ ............ ... ... .......1 16 Table 5.13 . Torsion and nonbond energies calculated by Amber software w ith ff14SB parameters for double amino acid test set ÉÉÉ................................................. .................... ....................... 121 Table 5.14 . Comparisons of torsion and nonbond energies calculated by the encoded program and Amber softw are with ff14SB parameters for double amino acid test set ÉÉ... É.................... ... .126 Table 5.15 . A brief comparison between FFENCODER and Amber with ff94 and ff14SB parameter sets ÉÉÉ................................................ .................................................................. .131 Table 5.16. Accuracy and native ranking comparison between RF models based on ff94 and ff14SB with other scoring functions ÉÉÉ................................ ............................ ... ..................1 32 Table 5.17. 1st decoy RMSD and TM -score comparison between RF models based on ff94 and ff14SB with other scoring functions ÉÉÉ.................................... ............................ .... .............1 33 Table 5.18. Distribution of decoysÕ lowest RMSD values in the combined decoy setÉ ...É.......1 34 !x Table 5.19. Accuracy comparisons between scoring functions with and without RF refinement ÉÉÉ................. ... ................ ............................................ ........................... .............1 35 Table 5.20. Accuracy comparisons between scoring functions with and without force field parameters ÉÉÉ.................................... ....................................................................... .... .........1 36 !!!!!!!!!!!!xi LIST OF FIGURES Figure 1.1. The shape of the sigmoid function showed in equation (1.3) ÉÉ... É..... .... ""...........1 37 Figure 1.2. An example of a simple decision tree ÉÉÉ.................................... .."""""""""""""""""""""" 138 Figure 1.3. The comparison between general hyperplane and hyperplane generated by the maximum margin classifier. (a) shows there are infinite hyperplanes can be used to separate the data set. (b) shows the hyperplane with the larg est margin of separation width. ÉÉÉ..............1 39 Figure 1.4. An example of a support vector machine algorithm. (a) An example of a dataset that canno t be separated by a hyperplane. The observations in the data set are one dimensional points. (b) A poly nomial kernel equation is used to change those one dimensional points to 2D points. A hyperplane (black line ) can be used to separate those observations ÉÉÉÉÉÉÉ..ÉÉÉ....1 40 Figure 1.5. An example of bagging ÉÉÉ.... ..................................... """""""""""""""""""""""""" ...............1 41 Figure 1.6. An example of RF ÉÉÉ.... .... """""""""""" ............ ........................................ ................. .142 Figure 2.1 . Probability versus distance plot for atom pair O -MET__CG -MET, shaded regi on is the averaged region with the length of 1 "ÉÉÉ.... .... """""""""""" ............ ..... .................................. .143 Figure 2.2 . The p rotocol used to build up the Random Forest model . Parameter p (equals to 16029) represents the total number of atom pairs in KECSA2. Parameter n represents the native structure, d1, É , d m are the 1 st , É, m th decoy structures ÉÉÉ.... .... """""""""""""" """""""""""""""""" ......................... .144 Figure 2.3 . Protocol for generating the ranking list for the Random Forest model . Parameter p (equal to 16029) represents the total number of atom pairs in KECSA2. S 1, S 2, É , S n are the 1 st , 2nd , É , n th protein structures with the same residue sequenc eÉÉÉÉ....... """"""""""" .................. .145 Figure 3.1 . Feature importance analysis results for the overall decoy set. The red point represents the 500 th atom pair ÉÉÉ.... .... """""""""""""" ............ ............................. .................. .......................... .146 Figure 4.1 . The protocol used to include the comparison information between best decoy binding pose and other decoy poses ÉÉÉ.... .... """""""""""" .... ................... ............... .................................... .147 Figure 4.2 . Accuracy trend from RF models based on original(blue line) and uniform(orange line) GARF data sets ÉÉÉ.... .................... """""""""""" ............ ......... ....................................................... .148 Figure 5.1. Comparisons between energies calculated by FFENCODER and the Amber software package. (a) - (e) are results for dihedral, 1_4 Van der Waals, 1_4 electrostatics, Van der Waals, and electrostatic energies . Columns (1) and (3) are comparisons of single amino a cid and dipeptide test sets for ff94. Columns (2) and (4) are comparisons of single amino acid and dipeptide test sets for ff14SB """"""""" ......... ................ ... .................. ....................................... ................................. .... .149 !xii Figure 5.2. Importance analysis for features in ff94 and ff14SB parameter set. The red point in each plot represents the 500 th most important feature in each parameter set ÉÉÉ... ÉÉÉ..... .151 !xiii KEY TO ABBREVIATIONS Amber Assisted Model Building with Energy Refinement CASF Comparative Assessment of Scoring Functions CHAID Chi -square automatic interaction detection ff Force Field FFENCODER Force Field Encoded Program FN False Negative FP False Positive KECSA2 Kno wledge -Based and Empirical Combined Scoring Algorithm 2 ML Machine Learning RF Random Forest RMSD Root -mean -square deviation SVM Support Vector Machine TN Ture Negative TP Ture Positive !!!!!!!1 CHAPTER 1: INTRODUCTION 1.1 !Introduction for protein -folding pose selection According to the Òthermodynamic hypothesisÓ, the native protein in its preferred chemical environment should have a structure with the lowest Gibbs free energy. 1 Identifying the three dimensional protein structure with the lowest Gibbs free energy is important in many applications to protein systems, including protein folding, 2-40 protein structure prediction, 41-52 and protein design problems. 53-61 It is a challen ge to understand the relationship between the three dimensional structure of a protein and the corresponding stability. For example, in the protein design field, it is important to predict the native structure of a protein (most stable structure) to tailor the properties of that protein. Scoring functions were developed to connect three dimensional structures and corresponding stabilities. Currently, there are three broad categories of scoring functions for protein -folding. (i) Physics -based scoring functio ns, 62-68 this kind of scoring functions usually employ relatively simple equations to represent bond, angle, dihedral, van der Waals, and electrostatic interactions and calculate the score of a protein at atomic level. (ii) Knowledge -based scoring function s,2-33 also referred to as statistical -based scoring functions. Those functions use crystal structures of proteins as the data source and extract the radial distribution functions of atom/residue pairs based on protein crystal structures. Then, the referen ce state can be constructed based on different statistical models to generate the ÒpureÓ interactions between different atom/residue pairs. Hence, those scoring functions can generate scores of proteins at atomic/residual level. (iii) Machine learning -base d scoring functions (ML -based scoring functions), 34-40 those functions usually include more features of a protein, which is hard to be !2 involved in physics -based and knowledge -based scoring functions. ML -based scoring functions utilize different ML algorith ms and information from protein structures to predict the stabilities of protein structures. !3 1.2 !Introduction for protein -ligand pose selection Similar to protein -folding pose selection described in section 1.1, scoring functions are also needed for protein -ligand pose selection problems. In drug discovery, it is important to identify the binding mode of a small ligand (identify the most stable binding pose). Without the information of the correct binding mode, it will be hard to optimize the lead structures. The existing scoring functions for protein -ligand systems can be classified as four groups: physics -based, 69-79 knowledge -based, 80-91 empirical based, 92-98 and ML -based scoring functions. 99-110 The physics -based, knowledge -based and ML -based scoring functions are similar to those functions discussed in protein -folding pose selection. Empirical based scoring functions were constructed upon an assumption that the total binding affinity between a protein and ligand can be decomposed into basic components with different coefficients. The coefficients can be determined with a multivariate regression model and a benchmark contains experimentally determined protein -ligand structures and corresponding binding affinities. !4 1.3 !Combine machine learning and conventional scoring function Currently, with the success of ML in the computer vision area, there are more ML -based scoring functions were developed for protein -folding and protein -ligand systems. 34-40, 99 -110 With the flexibility of ML algorithms, a large variety of information w as employed as features in ML -based scoring functions. For example, in protein -folding systems, the secondary surface area of proteins and solvent accessible surface area were used as input features. 35, 38 In protein -ligand systems, topological fingerprint and three dimensional graph of a protein -ligand complex were used as inputs. 100-101 And some of the information used in ML -based scoring functions cannot be used in other traditional scoring functions. Alth ough those ML -based scoring functions achieved higher accuracy and lower computational cost than conventional scoring functions, some important information might be lost when the ML -based scoring functions were constructed. As we know, there are some wel l performed scoring functions in protein -folding and protein -ligand systems. Hence, important information might be buried in those conventional scoring function s. Physics -based, knowledge -based, and empirical based scoring functions share a common characte r that, all of them employed potential functions to describe interactions between atoms/residues. Potential functions can be used to denote three dimensional structures as a series of pair wise energies. In conventional scoring functions, the importance of all pair wise energies were treated as the same. However, the importance of each pair wise energy might be different. Due to date and algorithm limitations, it is hard to address the importance of different pair wise interactions in the past. With the dev elopment of ML algorithms, it became possible to utilize ML models to address the importance of different interactions. !5 It is important to understand basic ML algorithms in order to select the suitable ML models, which can emphasize more important pair w ise interactions and ignore insignificant ones. In general, most machine learning problems can be classified into two groups: supervised and unsupervised learning problems. Supervised learning refers to the cases when a model is fitted to predict values or categories. Unsupervised learning refers to the cases when there are no real values or classes to be predicted. In supervised learning, there are two categories of models, regression and classification. Regression models are usually used to predict contin uous values, and classification models tend to predict the category of each input. In this study, the goal is to build up a ML model that can correctly identify the most stable structure in protein -folding and protein -ligand systems. Hence, supervised mode ls can be used to solve the problem. On the other hand, it is hard to find experimentally determined values to describe different protein -folding and protein -ligand poses, therefore, classification models should be used to solve the problem. Considering ML classification models, logistic regression, decision tree, support vector machine, and random forest classifiers are basic models. Here, brief introductions about those methods will be discussed. 111 !6 1.4 !Logistic regression In statistics, logistic regression models are usually used to calculate the probability of a certain event existing. This method is based on linear regression and used the sigmoid function to transform the continuous probabilities to binary classes. In general, for a linear regre ssion model, if considering the input features as ( x1, x 2, x3, É, x m), the probability p can be calculated as following: #$%&'()*%'()+,()-%-,().%.,/,()0%0 (1.1) where )+1)-121)0 are coefficients before each input element. In linear regression, probability p has continuous values in the range of $341,4&. However, in logistic regression, the predicted value is expected to be 0 or 1. Hence, in logistic regression, a sigmoi d function is usually needed to transform the continuous probabilities into binary classes. Following is the example of a sigmoid function: 5$%&'(--6789(((((( (1.2) In logistic regression, the predictions can be calculated by using the following equation: :;$%&'5$)*%&'(--678<=9( (1. 3) The above equation is usually called the logistic equation or sigmoid equation. Figure 1.1 shows the shape of the sigmoid equation. It is clear that when )*% is close to ,4, 5$)*%& is near 1 ; and when )*% is close to 34, 5$)*%& is near 0. By using the sigmoid function, the range of predicted values can be changed from $341,4& to $>1?&. !7 The coefficients ) should be calculated to determine the functional form of :;$%&. Usually, with a training set contains a set of inputs and correspond ing classes, the parameters can be calculated by maximizing the likelihood between the predicted probabilities and actual classes. If we assume that: @$A'?B%C()&'(:;$%& (1.4) @$A'>B%C()&'(?3(:;$%& (1.5) If consi dering a training set contains m independently generated examples, the likelihood of the parameters can be written as below: D$)&'(E#FA$G&H%$G&C()&0GI-'E$:;F%$G&J&K$L&$?3(:;F%$G&J&-M(K$L&0GI-( (1.6) Instead of maximizing the likelihood in equation (1.6), it will be easier to maximize the logarithm of the likelihood as below: N$)&'NO5D $)&'PA$G&QRS:;F%$G&J,$?3A$G&&QRS($?3:;F%$G&J&0GI- (1.7) There are many methods (for example, gradi ent ascent, conjugate gradient, BFGS, L -BFGS) that can be used to maximize equation (1.7). Here, the gradient ascent method is discussed as an example. If we assume the learning rate is T, the update of ) will be )'(),(TUV$;&U;. One training exam ple $%1A& was used to show the calculation details as below: N$)&'(AQRS:;$%&,$?3A&QRS($?3:;$%& (1.8) !8 UV$;&U;'WKX<$Y&3(-MK-M(X<$Y&ZUX<$Y&U;'WKX<$Y&3(-MK-M(X<$Y&Z:;$%&F?3:;$%&JU;=YU;'($AF?3:;$%&3$?3A&:;$%&J%'FA3:;$%&J% (1.9) Here UX<$Y&U;'(:;$%&F?3:;$%&JU;=YU; has been used, the derivation is in equation (1.10) as below: UX<$Y&U;'(U[[\]8<=9U;'$3?&^-W-678<=9Z_^`M;=Y^$3?&^U;=YU;'(--678<=9^78<=9-678<=9^(U;=YU;'((:;$%&F?3:;$%&JU;=YU; (1.10) Hence, the update of parameters ) should be: )'(),(TFA3:;$%&J% (11) For one iteration, the parameters ) can be updated once, after se veral iterations, the parameters will be converged. Then, a logistic regression model will be constructed. !9 1.5 !Decision tree The d ecision tree classification model is another important classifier other than logistic regression models. The d ecision t ree algorithm can be used for both regression and classification problems. Based on the training data, decision tree models can learn a series of questions to infer the class labels of different examples. Figure 1.2 shows an example of a simple decision tr ee model. Considering the decision tree model in Figure 1.2 , the most important challenge is to locate all questions, in other words, how to get the right order of questions (condition) is the most significant problem of constructing a decision tree mode l. Here, we will go through the decision tree working methodology from first principles to understand the details about how to locate every question in the model. For example, a decision tree model is needed to predict if it is good to go out for jogging or not. First of all, a collection contains weather and jogging information can be obtained in Table 1.1 . Considering Table 1.1 , every feature (includes weather, temperature, humidity, and wind) can be used to decide jogging or not. The order of the features to be used is needed to construct a decision tree for predicting the jogging decision. For example, should humidity or weather to be considered first? Which feature is the most insignificant one to make the decisi on? Here, the feature humidity is used as an example of calculations to determine the order of features. Table 1.2 contains the relationship between humidity and jogging decisions. Here, three values ( Chi -square automatic interaction detection, information gain, Gini) can be calculated based on Table 1.2 , and by using those three values, the order of features will be identified. !10 (1) !Chi -square automatic interaction detection (CHAID) Chi -square and degrees of freedom can be calculated using the following equations: a:b3cdefg` '(P$+Mh&_h'($M+ij&_kij,($+ij&_kij,($-ij&_kij,($M-ij&_kij'?il (1.12) m`5g``c (On(ng``mOo '$g3?&^$p3?&'$q3?&^$q3?&'? (1.13) In equation (1.13), r represents the number of row components in Table 1.2 , c is the number of response variables. With Chi -square and degrees of freedom, the p -value (the right -tailed probability of the chi -square distribution) can be calc ulated with the function called ÒCHIDISTÓ in EXCEL. Here, the calculated p -value is 0.237. Similarl y, p -values for each feature will be calculated, based on the calculated p -values, the order of features can be obtained, the best feature is the one with t he lowest p -value. (2) !Information gain Based on Table 1.2 , entropy can be calculated using the following equation. rstgO#A '(3P#^NO5 .# (1.14) The concept entropy came from information theory, it represents the impurity in data. Entropy values are in the range of [0, 1]. As an example, the total entropy of humidity can be obtained with the equation as below: !11 rstgO#A *uvwV '(3x-y^NO5 .Wx-yZ3x-y^NO5 .Wx-yZ'? (1.15) Entropy values for high humidity and normal humidity can be calculated with equation (1.16) and (1.17) as following: rstgO#A zG{X'((3k|^NO5 .Wk|Z3j|^NO5 .Wj|Z'>i}~ll (1.16) rstgO#A •0wV '((3y†^NO5 .Wy†Z3.†^NO5 .W.†Z'>i}?‡… (1.17) With the three entropy values above, information gain can be generated as follow ing : —snOgoftbOs (5fbs '(rstgO#A *uvwV 3(–?l^rstgO#A zG{X3(–?l^rstgO#A •0wV ((((((((((((((((((((((((((((((((((((((((((((((((('>i>ƒ…ƒ~ ( (1.18) In equation (1.18), information gain represents the reduction in entropy if the feature humidity is split to Òhigh humidityÓ and Ònormal humidityÓ. The goal of the decision tree is that, by splitting data, the resultan t node only contains examples from a specific class. Hence, the feature with the largest information gain should be the most important feature. Similarl y, the values of information gain for each feature can be calculated. Based on th is information gain va lues, the order of features will be determined, the feature with the largest information gain is the best feature among all features. (3) !Gini !12 The value of Gini represents the degree of misclassification, it works similar to entropy but can be calculated faster. Equation (1.19) is the general way to obtain the value of Gini. ⁄bsb '?3(P#G.G (1.19) Here, the humidity is taken as an example of calculating the corresponding Gini value. ⁄bsb zG{X'?3(Wk|Z.3(Wj|Z.('>ilƒ‡‡ (1.20) ⁄bsb •0wV '?3(Wy†Z.3(W.†Z.('>illll (1.21) Based on ⁄bsb zG{X and ⁄bsb •0wV , the value of expected Gini can be calculated with the following equation: r%#`pt`m (⁄bsb '(|-y^>ilƒ‡‡,(†-y^>illll'>il~‡… (1.22) With the calculation details discussed above, the values of expected Gini can be obtained for other features, based on those values, the order of features can be generated. The best feature is the one with the lowest expected Gini. Based on the discussions above, one of CHAID, information gain, and expected Gini can be used to obtain the order of features. Then, the decision tree classifier will be generated with the order of features. !13 1.6 !Support vector machine Besides logistic regression and decision tree classifier, another important classification algorithm is the support vector machine (SVM). In general, the SVM method is trying to construct hyperplanes by maximiz ing the boundaries be tween different types of data points, and those hyperplanes can create sub regions with the most homogeneous points. Based on working algorithms, support vector machines can be classified into three major methods: maximum margin classifier, support vector classifier, and support vector machine. (1) !Maximum margin classifier If considering a dataset which contains two categories of examples, and those examples can be separated by using a hyperplane. As we know, there will be an infinite number of hyperplanes can be used to separate the data set. The most challenge question is, which hyperplane is the best one to be used? The maximum margin classifier answers question that the hyperplane with the maximum margin of separation width is the best. If considering a dataset contains n training examples, %-1%.1%k121%‹ (each x is a column vector and contains p elements), with corresponding labels, A-1A.1Ak121A‹›−3?1?‰. Then, the hyperplane defined by the maximum margin classifier is the solution to the following optimization problem: „“” ‘’1‘[121‘‚(™ (1.23) pOsctgfbst (?fi(PflG.ŁGI+'? (1.24) !14 pOsctgfbst (qfi(AGFfl+,(fl-%G-,/,flŁ%GŁJ(Œ™(Š(b'?1q121s (1.25) :A#`g#Nfs` fi(fl+,(fl-%-,/,flŁ%Ł'> (1.26) In equation (1.23), M is the width of the margin. Equation (1.25) represents the distance between an observation and the hyperplane. Equation (1.26) shows the form of the hyperplane. With the two constraints showed in equation (1.24) and (1.25), the hyperplane obtained will be the one with the maximum margin. Figure 1.3 shows the comparison between general hyperplanes which can be used to separate a data set and t he hyperplane determined by the maximum margin classifier. The three points with a vector showed in Figure 1.3 (b) are called Òsupport vectorsÓ. The hyperplane determined with the maximum margin classifier can be obtained only with those three support vect ors. If the support vector changes, the function of the hyperplane will change. On the other hand, if examples other than support vectors change, the hyperplane will stay the same. (2) !Support vector classifier A m aximum margin classifier can only be used if there is at least one hyperplane that can separate the whole data set. However, there might be cases cannot be separated by a hyperplane. For those cases, a support vector classifier can be used instead of a max imum margin classifier. A s upport vector classifier works similar ly to a maximum margin classifier, but they allow some observations in the margin area or on the wrong side of the hyperplane. The support vector classifier sacrifices some observations to gu arantee the majority of data points are on the right side of the hyperplane. !15 In general, the hyperplane determined by a support vector classifier is the solution of the optimization problem as below: „“” ‘’1‘[121‘‚(™ (1.27) pOsctgfbst (?fi(PflG.ŁGI+'? (1.28) pOsctgfbst (qfi(AGFfl+,(fl-%G-,/,flŁ%GŁJ(Œ™$?3ŸG&(Š(b'?121s (1.29) pOsctgfbst (…fiŸG(Œ>1PŸG‹GI-(Ža (1.30) :A#`g#Nfs` fi(fl+,(fl-%-,/,flŁ%Ł'> (1.31) Here, equation (1.27), (1.28), and (1.31) are the same as the maximum margin classifier. The difference between a support vector classifier and a maximum margin classifier is the second and the third constraints. In a maximum margin classifier, the second constraint showed in equation (1.25) requires every observation to be on the right side of the margin ar ea. In support vector classifier, the second constraint (equation (1.29)) allows some observations in the margin area or on the wrong side of the hyperplane. Variable Ÿ controls the positions of observations, and variable a controls the total number of observations that are on the wrong side of margin or hyperplane. For example, if ŸG'>, the corresponding point will be on the right side of the margin; if ŸGı>(fsm (ŸGŽ?, the corresponding point will be on the wrong side of the margin and the right side of the hyperplane, in other words, the observation has violated the margin; if ŸGı?, the corresponding example will be on the wrong side of the hyperplane. By tuning var iable C in equation (1.30), we can control the number and severity of violations that the model tolerates. If a'>, the support vector classifier is the same as a maximum margin classifier; if a increases, more observations will be allowed to violate the margin, and the model will become more flexible. !16 (3) !Support vector machine Support vector classifier and maximum margin classifier require that there is at least one hyperplane can be used to separate the majority of the data set. However, there are data se ts that cannot be separated by using a hyperplane. In order to separate those data sets, a support vector machine is needed. Figure 1.4 is an example of the support vector machine. In Figure 1.4 (a) , the data points in a dataset are linearly distributed, all of the points are one dimensional points. It is obvious that there is no hyperplane can be used to separate the two classes. Hence, a maximum margin classifier and support vector classifier cannot be used in this case. By using a support vector machin e, a concept called Òkernel functionÓ is used. In the specific case show n in Figure 1.4 (b) , values of x12 are calculated to transfer those one dimensional data points to two dimensional. With the two dimensional points, a support vector classifier can be used to build up a hyperplane to separate the two classes. The general idea of support vector machine is to include more features to make the whole dataset in a higher dimension, then the support vector classifier can be used to separate those higher dime nsional points. The challenge in a support vector machine is to find the best kernel function. There are four popular kernel functions, linear kernel, polynomial kernel, radial basis function (RBF) / Gaussian kernel, and sigmoid kernel functions. The polyn omial and RBF kernel functions are popular choices. !17 1.7 !The random forest algorithm Random forest (RF) model is an ensemble learning method, the algorithm can be used for classification, regression, and other tasks. A large number of decision trees were constructed to build up the RF model. The prediction from the RF model can be calculated as the vote result from all individual decision trees (classification) or the mean value of the predictions from all decision trees (regression). The first algorithm of RF was created by Tin Kam Ho, 122 then, Leo Breiman 116 and Adele Cutler 123 developed the extension of the algorithm and registered ÒRandom ForestsÓ as a trademark. RF model is built on single decision trees, to understand why the RF algorithm need s to be developed, it is necessary to understand the limit ations of a single decisi on tree model. As discussed in section 1.5, the decision tree model can be constructed by using CHAID, information gain, and Gini calculations. Based on those values, the order (importance) of each feature can be calculated. However, the decision tree mode l has a high risk of overfitting. If considering a whole data set be randomly split as two subsets (80% training, 20% testing), and two decision tree models are trained on the two different training sets. The order of features can be totally different for those two models. This is because the examples in the training data can strongly affect the importance (values of CHAID, information gain, and Gini) of each feature. In order to make the decision tree model with a lower variance, a bagging strategy is firs t used. Here, the bagging classifier is introduced. Bagging is referred to as bootstrap aggregation. It usually repeat s training with the replacement of examples and perform s aggregation of the result. It is a general methodology to reduce the variance !18 in a model. Figure 1.5 is an example of bagging. For example, the whole data set contains six observations (in real cases, the number will be much larger). Instead of fitting a decision tree model based on all six example s, the bagging algorithm selected su bsets of the whole data set. In Figure 1.5 there are two subsets selected (orange and blue). Based on the selected examples, two decision trees can be constructed to reduce the variance of the decision tree model built upon all examples. Theoretically, thi s procedure should reduce the variance value. However, this algorithm cannot reduce the variance efficiently. For all subsets selected by the bagging algorithm, they contain all features (columns). This might make the decision trees constructed on subsets correlated with each other. And those correlated trees might not be good enough to solve the overfitting problem from the decision tree model. RF model was developed to solve the problem that decision trees built on subsets might be correlated. Figure 1. 6 shows an example of the RF algorithm. In the bagging algorithm, decision trees built upon different subsets tend to be co rrelated because every subset contains all existed features. In order to reduce the correlation between decision trees, the features considered in each subset should be different. In the RF algorithm, the RF model randomly selected some features in each subset to construct decision trees, and the correlation between those trees will be reduced. Hence, with the two dimensional randomness of selecting examples and features, the RF algorithm can effectively reduce the risk of overfitting. !!!!!19 CHAPTER 2: METHOD 2.1 From potential functions to descriptors !The general protocols for calculating descriptors for protein -folding and protein -ligand system are similar. Here, protein -folding system is used as an example to describe how to calculate the descriptors based on potential functions. If all independent pair wise probabilities with different magnitudes in a n n-body system are known, the probability of the whole n-particle system can be obtained as: #‹'(EpGłœ#Gł‹G1łI-1(Gšł, (2.1) where pn is the probability of the n-particle system, cij is the scaling factor, which can be evaluated using the random forest model, of pair wise probability pij, i and j represent two different particles. Using a knowledge -based potential with pair wise independent interactions, the independent pair wise probabilities for the bond, angle, torsion, and non -bonding terms can be obtained. If the protein structure is treated as a n-particle system, the probability is: #Ł•uv7G‹ '$EpGłœ#Gł&$EpžVœ#žV&$Ep0‹œ#0‹&$Epœ#&‹u‹€u‹¡ vu•¢Gu‹ w‹{V7 €u‹¡ (2.2) pprotein is the protein structure probability, c!" and p!" represent the scaling factor and the probability of atom pair ! and ", the subscripts ij, kl, mn, and pq correspond to bond, angle, torsion, and non -bonded atom pairs, respectively . In this work, we make two further assumptions: (i) E#€u‹¡ €u‹¡ and E#w‹{V7 w‹{V7 are similar for native and all decoys, hence, the product of those !20 two values is treated as a constant C; (ii) the probabilities for the torsion and nonbond atom pairs are independent, since a reference state is used to remove cont ributions from the ideal -gas state. With these assumptions, the probability of a n-atom protein can be written as: #Ł•uv7G‹ 'a$Ep0‹œ#0‹&$Epœ#&‹u‹€u‹¡ vu•¢Gu‹ (2.3) Taking the logarithm on both sides of equation (2.3) we get: NO5 F#Ł•uv7G‹ J'(NO5 $a&,P%0‹œNO5 $#0‹&(,(P%œNO5 (F#J‹u‹€u‹¡ vu•¢Gu‹ (2.4) where x mn and x pq are the logarithm of cmn and cpq, respectively. A detailed potential database, KECSA2, was utilized to obtain pmn and ppq. Below we use O -MET -CG-MET as an example for what is involved in calculating the pair wise probability of a given protein. From KECSA2, the probability versus distance function, shown as a red curve in Figure 2.1 , can be found. If the distance between O -MET -CG-MET in the protein is 4.5 ", we first obtain the corresponding probabilities for the distances from 4 " to 5 (" with an interval of 0.005 ". Next, we take the log arithm of the average of the 201 probabilities obtained in the previous step, and use it to represent the probability at distance 4.5 ". Equation (2.5) shows a general way to obtain the probability of atom pair A -B with distance r1, where KECSA2 A-B is the potential function of atom pair A -B obtained from the KECSA2 potential data base, rABi is a distance between r1-0.5 and r1+0.5 with an interval of 0.005 ". #£M¤$g-&'QRS¥P¦ra§¨ q£M¤$g£¤G &•[6+ij•[M+ij©3QRS($q>? & (2.5) !21 Using equation (2.5), the probability for each atom pair present in the protein can be calculated; for the same atom pairs, the probabilities were summed yielding the final probability. In this way, the probability list for each protein examined can be gen erated. !22 2.2 Random Forest model A traditional native structure recognition problem is detecting the native or most native -like structure from a collection of decoys. Many different scoring functions have been described 2, 10 -40, 47-53 that attempt to address this problem. At first glance, it is hard to use unbalanced decoy sets as the training data set directly. Yet a balanced data structure can be generated if the decoy set is replaced with a ÔcomparisonÕ data set. Instead of training ML model that focuses on directly finding the native structure from hundreds of decoys, we can create a ML model that can accurately distinguish between native and decoy structures. In a ÔcomparisonÕ data set for decoy detection native structure should h ave the highest probability, which means: NO5 $#‹wvGª7 &(3NO5 F#¡7«uK J(ı> NO5 F#¡7«uK J(3NO5 $#‹wvGª7 &(¬(> (2.6) Figure 2.2 shows a detailed workflow for our protocol. If a decoy set consists of one native and m decoys, for each structure, an atom pair wise descriptor (probability) can be built as described above. For each descriptor set, there are 16029 elements in total (KEC SA2 has 2001 torsion atom pairs and 14028 nonbonded atom pairs, yielding 16029 = 2001+14028.). The descriptor sets are defined as the ÔDescriptor vectorÕ in Figure 2 .2 Next, the descriptor vector of the native minus the vector of each decoy are classified as class Ô1Õ, which means Ômore stable thanÕ since the native structure is always more stable than the decoys; the descriptor vector of each decoy minus the !23 vector of the native is defined as class Ô0Õ, which represents Ôless stable thanÕ. The resultant descriptors are described as the Ôfinal descriptor vectorÕ in Figure 2 .2 In this way, equal members of class Ô0Õ and class Ô1Õ can be generated, which is an ideal situation for classification. Hence, a RF model can be obtained based on using those two classe s. Through the use of this classification system a RF model can be generated where the relative probabilities of two proteins with the same sequence can be compared. A final descriptor vector can be generated using the descriptor vector of the first protei n minus the secondÕs. Then the RF model can be used to predict the class for that final descriptor vector. If the prediction from the RF model is Ô1Õ, it means the first protein is Ômore stable thanÕ the second one, and if the prediction is Ô0Õ, the first protein is Ôless stable thanÕ the second one. Constructing a RF model that can accurately differentiate native and decoy structures is not enough. For a native recognition blind test, in order to identify the native structure, a ranking of all structures should be generated. Thus, the RF model needs to be used to obtain the ranking list for a decoy set. Figure 2.3 gives the protocol used to obtain the ranking of a decoy set with n structures. First, the probability descriptor of each protein structure ca n be built using the KECSA2 database. Second, a table for each structure was obtained from the probability descriptor of the individual protein structure minus the probability vectors of all the other structures. Then, the RF model is used to predict the c lass of each column in all tables; in other words, the RF model is used to ÔcompareÕ two structures. Finally, a row with length n-1 can be generated for each structure. The value of each column in the resultant row is either Ô0Õ or Ô1Õ, which represents th e comparison result of each !24 structure with all other structures. The sum of the resultant row is defined as a ÔscoreÕ, which indicates if the corresponding structure is more stable than the ÒscoreÓ amount of decoys. In this way, the score of each structure can be generated, thereby, creating a ranking list. !25 2.3 Decoy sets The decoy sets we used for protein -folding systems include the multiple decoy sets from the Decoys ÔRÕ Us collection (http://compbio.buffalo.edu/dd/download.shtml ), which include the 4state_reduced, fisa, fisa_casp3, hg_structal, ig_structal, ig_structal_hires, lattice_ssfit, lmds, and lmds_v2 decoy sets. The MOUL DER decoy set was downloaded from https://salilab.org/decoys/ ; the I-TASSER decoy set -II was obtained from https://zhanglab.ccmb.med.umich.ed u/decoys/decoy2.html ; and the ROSETTA all -atom decoy set from https://zenodo.org/record/48780#.WvtCA63MzLF . Our RF model for protein -folding systems was compared to the following potential s designed for decoy detection: KECSA2, GOAP, 2 DFIRE, 40 dDFIRE, 37, 43 and RWplus. 39 The programs for these methods were downloaded from the corresponding authorÕs website. For protein -ligand systems, 191 systems were selected out of the 195 systems in CA SF-2013112 due to formatting issues with our program. CASF -2013112 is known as the Ô Comparative Assessment of Scoring FunctionsÕ, it includes data sets for testing the scoring, docking, screening, and ranking powers of scoring functions. Here, we only used the data sets, which were designed to test the docking power of scoring functions. The decoy ligand binding poses were prepared with three popular molecular docking programs: GOLD(v5), Surflex -Dock implemented in SYBYL(v8.1), and the docking module built in MOE(v2011). These three programs have different algorithms for ligand pose sampli ng, therefore, the resultant decoy set is more complete and avoids !26 the bias inherent in using only one program. In total, we used 191 protein ligand systems, 15802 ligand decoy poses, and 31604 native -decoy comparisons. !27 2.4 Structure preparation All protein structures (including both native and decoys) for protein -folding systems were converted into their biological oligomerization state and prepared with the Protein Preparation Wizard, REF which adds missing atoms, optimizes the H -bond network, and performs energy minimization to clean up the structures for subsequent calculations. The decoy sets can be found here https://github.com/JunPei000/protein_folding -decoy -set . Ligand pose structures are directly obtained from CASF -2013112 for protein -ligand systems. !28 2.5 Potential functions KECSA2 potential function set was used for the first project about protein -folding pose selection. KECSA2 is developed based on KECSA potential function set. KECSA REF is a potential data base originally designed for protein -ligand systems, we applied the same methodology used to derive KECSA to protein structures to generate KECSA2, a potential data base only for protein systems. The detailed derivation and parameters for KECSA can be found in reference 91. PDBbind v2014 113 was used as the protein crystal structure source, two criteria were used to filter these structures, (1) protein structures with resolution better than 2.5 " were selected; (2) Metal ions and residues within 4 " around the metal ions were deleted. After filtering, 9606 protein crystal structure were selected as the protein structure source. A detailed atom type definition was used in the KECSA2 potential ; in other words, every atom type represents a specific heavy atom in the twenty naturally occurring amino acids. For instance, ÔCA_ALAÕ corresponds to the alpha carbon in alanine. In total, there are 167 atom types in KECSA2. The methodology of KECSA was used to construct the reference state and remove ideal gas contributions. Finally, 2001 torsion and 14028 nonbond atom pairwise interactions were generated. The follow function was used to describe the nonbond interactions between each atom pair: r£¤$gG&'(!-W•LZ®3(!.W•LZ‘ (2.7) The five parameters in equation (1), !-, !., ¯, T and fl for each atom pair in KECSA2 can be found at https://github.com/JunPei000/protein_folding -decoy -set . !29 For protein -ligand systems, we used the GARF 114 potential to calculate the pairwise probabilities for each protein ligand complex. GARF is a potential database developed by our group. It employed a graphical -model -based approach with Bayesian field theory to construct atom pairwise potential functions. There are 20 atom types for the protein atoms and 24 atom types for the ligands. A ll definitions of the atom types are listed in Table 2.1 . Further details regarding GARF can be found in the original article. 114 The force field parameter sets ff94 120 and ff14SB 121 from Amber were also used for protein -folding pose selection. For ff94, atom types with their corresponding charges were obtained from file Òall_amino94.libÓ in directory Ò$AMBERHOME/dat/leap/lib/". Rmin , !, and torsion related parameters ( °‹, s, and ±) were from file Òparm94.datÓ in directory Ò$AMBERHOME/dat/leap/parm/". For ff14SB, atom types with corresponding charge values were obtained from file Òamino12.libÓ in directory Ò$AMBERHOME/dat/leap/lib/". Rmin and ! values for each atom type were fro m file Òparm10.datÓ, in directory Ò$AMBERHOME/dat/leap/parm/". Torsion related parameters ( °‹, s, and ±) were gained from file Òfrcmod.ff14SBÓ, in directory Ò$AMBERHOME/dat/leap/parm/". !30 2.6 Machine learning and validation The sklearn.ensemble.RandomForestClassifier function from Scikit -learn was used to create the proposed classification model. 115 One training -testing iteration includes: (1) Randomly split the whole data set into two parts, 80% as the training data set and 20% as the test set. (2) A grid search with five -fold cross validation was performed on the training set in order to identify th e best set of hyperparameters for the RF model. (3) The RF model with the best set of hyperparameters was then validated on the test set. Although the data set is randomly split as training and testing sets, there is still a bias buried in the splitting pr ocedure. In this work, ten independent iterations were performed on the combined decoy set in order to avoid bias from data partitioning scheme. Accuracy, a typical evaluation for ML classifiers, was used to evaluate the performance of RF models. An accu racy value can be calculated based on a Òconfusion matrixÓ, which is usually used in the supervised ML field. The general format of a confusion matrix is presented as Table 2.2 . There are four values in a confusion matrix, which are True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN). TPs refers to the cases whose predicted classes are class 1 - same as their actual classes. FPs are the cases predicted as class 1 whereas their actual class is class 0. FNs represent cas es whose predicted class are class 0, however, their actual class is class 1. TNs represent the cases where the predicted class is class 0, which is the same as their actual class. Accuracy can be calculated based on these four numbers from the confusion m atrix using: fppegfpA '(*²6*²66³²6 (2.8) !31 In many cases, accuracy cannot be used to judge the performance of a ML clas sifier. For example, if considering a data set with 90 positive samples and 10 negative ones, a naŁve classifier will predict all samples as positive. At the same time, the accuracy of that naŁve classifier is 0.90. However, it is obvious that the naŁve cl assifier is not able to provide reliable predictions. In this work, accuracy can be selected to represent the performance of RF models due to the fact that, the data base is evenly distributed (the numbers of positive and negative samples are the same). On the other hand, ten accuracy values can be obtained from ten independent RF models built with different data partitioning schemes, the highest, lowest, and averaged accuracy values were used to represent the general performance of RF models. !32 CHAPTER 3: RANDOM FOREST MODEL WITH KECSA2 FOR PROTEIN -FOLDING POSE SELECTION 3.1 Accuracy of individual decoy set training The most important characteristic of the resultant scoring function is its ability to differentiate the native structure from decoys. Table 3.1 shows the accuracy for both the RF models and traditional scoring functions. Since ten cycles of independent tra ining and testing were performed for each decoy set, the highest, lowest, and averaged accuracy were used to represent the general performance of RF models on that specific decoy set. In this way, the performance of the RF model can be better interpreted. In general, the RF model shows higher accuracies than all traditional methods for all of the decoy sets. For some decoy sets, like fisa, ig_structal, lmds_v2, and rosetta, RF models significantly improved the averaged accuracy to nearly 1.000, and the lowe st accuracy values are still higher than the best accuracies of the other scoring functions. For the other decoy sets, such as 4state_reudeced, fisa_casp3, hg_structal, ig_structal_hires, I -TASSER, lattice_ssfit, lmds, and MOULDER, the averaged accuracies from the RF models are similar to the best accuracies of traditional scoring functions, while the lowest accuracies of the RF models are similar to the accuracies of other methods. Overall, the RF models show better performance. !33 3.2 Native ranking of individual decoy set training Although the accuracies of RF models are higher than other methods, we still wanted to further validate their performance. Other than accuracy, another important criteria for judging a model/scoring function is whether the mo del/scoring function identifies the native structure as having the lowest rank. Hence, native structure ranking from the different methods were also compared. Table 3.2 shows the rankings of the native structures from several different models. The highest, lowest, and averaged rankings are shown to assess the performance of RF models. For the decoy sets fisa, ig_structal, lmds, lmds_v2, and ROSETTA, the RF models substantially improves native structure ranking over the other models. In the remain decoy sets , the averaged rankings of native structures are similar to the best performance of the other scoring functions. It can be concluded that, in general, the RF model shows a better performance in ranking the native structure over other methods we tested. !34 3.3 1st decoy RMSD and TM -score of individual decoy set training Although the ability to recognize the native structure as the most stable structure is a crucial characteristic of a good model/potential. For a model/potential to be useful for guidi ng conformation sampling, it should have a good correlation with structural quality. The RMSD and TM-score were used as two criteria for assessing the quality of each decoy structure. RMSD is the root mean squared deviation of all C ! pairs of the decoy to the native structure. TM -score 47 gives a large distance a small weight and makes the magnitude of TM -score more sensitive to the topology. Table 3.3 and Table 3.4 summarize the results of best model selection of different methods. Table 3.3 shows the 1 st decoyÕs RMSD of RF models and against a range of available scoring functions. The RMSD values of available methods are generally within the range of lowest and highest RMSD values of the RF models for each decoy set. This means the performance of those traditional scoring functions are within the confident range of our RF models. Table 3.4 shows the 1st decoyÕs TM -score for the RF model and against several models; these results are similar to what we observed for the RMSD analysis. In each decoy set, the 1 st decoyÕs TM -score is within the range of the lowest and the highest TM -score from the RF models. Considering that the independent training and testing process was done ten times, the range of lowest to highest RMSD/TM -score values show the confidence ran ge of RF models for each decoy set. In general, the RMSD/TM -score performance of available models are within the confidence range of the RF models, and the averaged values are similar in RF models and models we tested against. In other !35 words, the performa nce of the RF models against a range of models, when it comes to selecting the best decoy structure, were similar. !36 3.4 Feature importance analysis for the overall decoy set It is important to understand whether the better performances of RF models are due to overfitting because a large number of descriptors were used. To this end, a feature importance analysis was performed and is shown in Figure 3.1 . Based on the analysis, n ew RF models using only the top 500 features were constructed using the previous procedure. Table 3.5 shows the comparison of the accuracies between using only the top 500 and all 16029 features. In general, the highest, lowest, and averaged accuracy value s of the RF models using the top 500 features for each decoy set are similar to the corresponding values of the RF models using all features. Hence, we conclude that the better performance of RF models with all features is not simply due to overfitting and the current method is robust even in the face of potentially non -essential features. !37 3.5 Comparison of overall RF model with traditional scoring functions Besides creating RF models for each individual decoy set, combined RF models using all decoy sets were also constructed. This examines the situation where in a study one might generate decoys using one method and then score them with another. There were 291 individual systems across the 12 decoy sets that were combined finally, yielding 235 different protein systems (several proteins overlapped amongst the decoy sets). In these studies, 80% of the combined data set was used as the training data to build the RF models instead of choosing several specific decoy sets (like 4state_redueced, fisa , etc.). This was done to insure that the training and testing data set covered the same feature space and had the same distribution Ð this is known as an independent and identical distribution (IID). 116 The feature space and distribution of decoys from di fferent decoy sets are different because different models were used to generate those structures. Table 3.6 shows the result of comparing the overall performance of RF models with a number of available potentials. Due to the large number of descriptors, it is impossible to obtain RF models using the entire 16029 feature set. Based on the importance analysis discussed previously, instead of using all features, top 100 and 500 features were used to build up the overall RF models on the combined decoy sets. First, all RF models with different importance features provide higher averaged accuracy values than other traditional scoring functions. Clearly, the accuracies of the RF models outperform the other conventional methods. Second, the highest rankings of th e native structure from RF models are smaller than the rankings of other methods, and all of the averaged rankings of the RF models were ~10 or less, which means the RF models can identify the native structure within the top ten structures. Hence, the RF m odels outperform other methods on this !38 task. Finally, the RMSD and TM -score values of conventional potentials are within the corresponding confidence range of the RF models, and those values are similar to the averaged RMSDs and TM -scores of the RF models. Both the RMSD and TM -score results suggest that the performance of the RF models is similar to other conventional potentials. !39 3.6 Importance of potential Based on the previous discussion, it is clear that the RF models with KECSA2 perform the best in accuracy and ranking both on individual and overall decoy sets. This directly leads to the interesting question: does the potential plays a significant role in RF models? Here, two analyses were done to address the importance of the KECSA2 potential in the performance of the RF model. First, the probability functions of top 100 and 500 features (atom pairs) were scrambled to test if the probability functions pla yed a role in the RF model, for example, after scrambling, the probability function of atom pair O -PRO and N -ALA might be changed to the probability function of CA -GLY and C -THR. If the KECSA2 potential plays a role in RF models, the performance of the scr ambled probability functions should be worse than KECSA2. The peak positions in the probability functions represent the most favorable distances between different atom pairs found in the experimental structure database. For example, the peak position of th e atom pair O -PRO and N-ALA is 3.04 ", which means those two atoms are most stable when they form a hydrogen bond at that distance. However, after scrambling, the peak position might change to 4.51 (" (peak position of CA -GLY and C -THR) , which no longer rep resents a hydrogen bond. Thus, the scrambled probability function suggests that the atom pair O-PRO and N -ALA is most stable when they do not form a hydrogen bond. It is clear that the scrambled probability functions are unphysical. We expect that it would be unlikely that the RF model, as employed herein, could correct these deficiencies so, the performance of the scrambled probability function is expected to be worse than original KECSA2. !40 Second, the uniform probability functions (or potential functions) were built for the top 100 and 500 atom pairs to test if the KECSA2 probability peak heights (or well -depths) are important in RF models. The uniform probability functions have the same peak positions found in KECSA2, but with same heights. By doing so, t he interaction strength ÔbiasÕ of different atom types from KECSA2 can be eliminated via use of uniform probability functions. If the KECSA2 probability peak heights (or interaction potentials) are significant, the performance of uniform potential should be worse than KECSA2. The comparison of the result between the original KECSA2 potential, scrambled potential, and uniform potential are shown in Table 7 . From the comparison between the original KECSA2 and the scrambled potentials, we find the accuracy o f the models decreased by ~0.15, which gives a clear signal that the full KECSA2 potential (well depth and energy minimum) plays a role in RF models. The comparison between the uniform and original KECSA2 potential gives an evidence of how important the rmax component of the KECSA2 potential is in building an effective model. For the RF models with the top 100 features, the averaged, highest, and lowest accuracies based on the original KECSA2 potential are slightly higher than the corresponding accuracies f rom the RF models based on uniform potentials. However, if the number of features is increased to 500, the averaged, highest, and lowest accuracies from the RF models based on the original KECSA2 are similar to the uniform potentials. This provides strong evidence that only peak positions in the probability functions are critical in building up RF models for native protein structure detection. More importantly, the result also implies that RF models can be used to tune the height of peaks in probability fun ctions (or the depth of potential functions) only with the information of peak positions in protein structures. !41 3.7 Conclusion In this work, we utilized a ÔcomparisonÕ concept to construct RF models on an unbalanced data set. With these RF models, the kno wledge -based potential, KECSA2, was refined via assignment of different importance factors to different atom pairs present in the scoring function. The performance of the resultant RF models were assessed with individual and combined decoy sets and compare d with the results from conventional models. We find that the RF models perform better in accuracy and native ranking and have similar performance in the RMSD and TM -score tests. In other words, the RF models improved the effectiveness of finding native st ructures from a set of decoys, without compromising their ability to find the best decoy structures. This RF model based refinement not only can be used to improve the performance of KECSA2, but it can also be applied to other atom/residue pair based poten tials. More importantly, we find that only peak positions in probability functions play a significant role in constructing the RF models. This result implies that, with peak position information, RF models can be created to construct probability functions (or potential functions) by tuning the height of peaks in those functions based on native and decoy protein structures. !42 CHAPTER 4: RANDOM FOREST MODEL WITH GARF FOR PROTEIN -LIGAND POSE SELECTION 4.1 Accuracy The most important goal of a scoring function is to accurately identify the native structure among a plethora of decoy structures. In order to evaluate the ability of a scoring function to identify the native structure, the concept of accuracy is used in t his work. If a decoy set contains 100 decoy structures but only one native, the scoring function is expected to make 200 correct comparisons to identify the native pose. The higher the accuracy of the comparison, the better the performance of the scoring f unction. The third column in Table 4.1 shows the comparison of accuracies from RF models and 29 other scoring functions. The averaged, highest, and lowest accuracy of the RF models are 0.953, 0.969, 0.942. The averaged accuracy value is higher than all of the other tested scoring function, and the lowest accuracy value is still higher than all of the other accuracies. It is clear that the RF models have a higher accuracy, which means that the RF models perform better than all other scoring functions in comp aring the ligand native pose to decoy poses. !43 4.2 Native ranking Other than accuracy, another criteria for evaluating a scoring function is the ranking of the ligand native pose. In other words, a scoring function is expected to give the ligand nativ e pose the lowest rank. The fourth column in Table 4.1 shows the result of ligand native pose ranking from each method. The averaged, highest, and lowest ligand native pose ranking from RF models are 4.49, 5.54, and 3.54, respectively. The confidence inter val of the native poseÕs ranking from RF models is [3.54, 5.54]. It is clear that all 29 scoring functions have ligand native rankings higher than the averaged native ranking obtained from the RF models, and of these rankings they are larger than the highe st native ranking from the RF models. Thus, it can be concluded that the RF models perform better in selecting the ligand native pose than existing models. If the accuracy values are compared with the ligand native pose rankings, a correlation between tho se two sets of data can be found. The higher the accuracy, the lower the native pose ranking. The most important goal of a scoring function is to identify the most stable ligand pose (native pose), therefore, the minimum standard for a scoring function is to correctly compare native pose to decoy ones. Using our previous example of a decoy set containing 100 decoy structures and one native pose we have 200 comparisons between the native and decoy poses. Hence, minimally the scoring function should make 200 correct comparisons to obtain the native structure. With more correct comparisons, the native pose has a higher chance to be found at a lower rank. For example, if the scoring function makes ten mistakes, the accuracy is around 0.95, and the ligand pose ra nking would be Œ 5. !44 4.3 Random Forest model with decoy comparison information Our RF model is focused on identifying the native binding pose of a ligand among all decoy poses. However, it is not effective in identifying the best decoy due to the lack of comparison information between the best decoy structure and the other decoy poses. In order to include the comparison information between decoy poses into the RF analysis we made the following assumption. The assumption is that the ligand decoy pose with the lowest RMSD is the most stable decoy structure (best decoy pose). Figure 4.1 shows the protocol of adding comparisons between the best decoy pose and other decoy poses. For example, a decoy set contains m decoy structures and one native pose, two kinds of comparisons were considered when the model was trained: (1) the comparison between the native binding pose and all other decoy poses, in total there are 2 m comparisons ( m comparisons for each class); (2) without the native binding pose, the best decoy pose was compared with all other decoy poses for a total of 2( m Ð 1) comparisons. Then, RF models, which were trained on these comparisons, were used to select the best decoy through the protocol discussed in chapter 2. !45 4.4 1st decoy RMSD and TM -score Besides accuracy and native pose ranking, there is another criteria, RMSD of the best decoy structure, which is used to judge the performance of a scoring function. The best ligand decoy pose refers to the decoy structure that is selected by a scoring func tion as the structure among all decoy poses most similar to the native pose. Scores generated by a scoring function are expected to be correlated with the quality or native -likeness of a structure. The RMSD value between the ligand native binding pose and a decoy binding pose is often used to represent the quality of that decoy pose. If the RMSD is below a predefined cutoff (RMSD < 2 "), the decoy binding pose is believed to be Ònative -likeÓ. The last column in Table 4.1 shows the RMSD values from each of the scoring functions. The averaged, highest, and lowest RMSD values from RF models are 3.87 ", 4.47 ", and 3.38 (", respectively. The confidence interval for the RF models is [3.38, 4.47]. It is clear that there are 26 sc oring functions that can identify ligand decoy poses with RMSDs lower than 3.38 (", and two scoring functions provide RMSD values within the confidence range of the RF models. In general, 28 scoring functions perform better than our initial RF models in sel ecting the best decoy structure. The RF models used in Table 4.1 only contain comparisons between native and decoy poses, while comparison information between decoy poses was not considered when the models were trained. Hence, we conclude, that these RF models do not have enough information to find the ÒbestÓ ligand decoy poses among a large number of decoy structures. In order to improve our RF modelsÕ ability to identify the best decoy structure, comparison information between decoy poses should be adde d when training the RF models. Here, we make an assumption that, among all decoy poses, the pose !46 with the lowest RMSD is perhaps the most stable of all the decoys because it is most Ònative -likeÓ. With this assumption, the comparison between the best decoy and other decoy poses could be generated. Instead of just using comparisons between native and all decoy poses, the new training set also included comparisons between the best decoy pose and all other decoy poses. Table 4.2 shows the result when different number of decoy structures were identified as the most stable poses. Four sets of training data were used: (1) only including comparisons between the native and decoy poses; (2) including comparisons between the native and decoy poses, and between the dec oy structure with the lowest RMSD with all other decoy poses; (3) including comparisons between the native and decoy poses, between the two lowest RMSD decoy poses and all other decoy poses; (4) including comparisons between the native and decoy poses and, between the three lowest RMSD decoy poses with all other decoy poses. Table 4.2 gives the overall performance on accuracy, ligand native pose ranking, and the best decoy RMSD. With the inclusion of decoy structures in the training set, the accuracy of the RF models and the ligand native binding poseÕs ranking were slightly negatively affected. On the other hand, the best decoy poseÕs RMSD dropped dramatically. The averaged, highest, and lowest RMSD of the best decoy pose from RF models trained on data set only including comparisons between native and decoy binding poses are 3.87 ", 4.47 ", and 3.38 ", respectively . Alternatively, the corresponding values from RF models including the three lowest RMSD decoy structures are 2.27 ", 2.44 ", and 1.73 ", respecti vely (confidence interval is [1.73, 2.44]). By including low RMSD decoy structure comparisons we obtain RF models (see Table 4.1 ) that give better first decoy RMSDs than 13 scoring functions, a further 15 scoring functions have first decoy RMSDs with the c onfidence interval of the RF model and only one scoring function gave a RMSD smaller than 1.73 ". Hence, we conclude that the !47 overall performance ( i.e., accuracy, native rank, and low RMSD first decoy) of RF models can be improved by including lowest RMSD decoy comparisons in the fitting of the model. Based on previous discussion, it is clear that with a higher accuracy, a scoring function can give the native binding pose a lower rank. If the accuracy values are compared with the RMSD of the best decoy, i t is obvious that those two sets of data do not appear to correlate. Some scoring functions are better at selecting the native pose but provide a relatively larger RMSD value, whereas other scoring functions do a better job selecting the best decoy structu re but do not have the ability to identify the native binding pose. This leads to a basic philosophical question: which one is more important, accuracy or RMSD? Both of them should be important in the limit that all decoy poses can be obtained. However, it is almost impossible to generate all relevant decoy poses using contemporary approaches. In our opinion, the basic requirement for a scoring function is that the function can accurately identify the native pose. To some degree, RMSD might be useful in jud ging if a structure has a low free energy, but it is obvious that a decoy structure can have a high free energy while enjoying a low RMSD value. Hence, if two scoring functions were compared solely on identifying the best decoy and one gives a RMSD larger than 2 " while another is less than 2 ", it is unclear, at least to us, how to judge which one is better. On the other hand, accuracy, the factor that represents the performance of a scoring function when comparing native and decoy poses, is a clear standa rd. The explicit hypothesis we are making when docking and scoring is that the native structure always has a lower free energy than the decoys. When comparing two scoring functions, the better scoring function should be the one with a higher accuracy. Put another way, when creating, for example, ML models for a self -driving car what is more important Ð accurately !48 identifying an obstacle or being close to identifying an obstacle? Therefore, we believe that accuracy is the more important criteria. !49 4.5 Uniform probability function The RF models perform better than all other scoring function on accuracy and native binding pose ranking. It is interesting to consider if the GARF potential is critical in these RF models. Two tests were set up in order to test the importance of the GARF potential database. First, a scrambled probability function set was constructed based on GARF followed, by a uniform probability function set to test whether GARFÕs peak position is more important or if the peak height is more critical. The scrambled pro bability function set was generated by randomly mixing up the atom pairs in the GARF potential database. Taking the 480 atom pairwise potential functions in GARF we randomly scrambled the atom pair names. For example, before scrambling, one probability fun ction represented the interaction between N and O.co2, while after scrambling, the same probability function might be used to describe the interaction between C and F. Hence, the scrambled probability function set is physically unrealistic. Based on the sc rambled probability function set, ten independent RF models were constructed following the same procedure described in the methods section. Since the scrambled function set is physically unrealistic, it is expected that the performance of these RF models w ould be worse than models using the original GARF potential. There are two kinds of information embedded in the GARF potential, peak positions (well position) and peak heights (well depth). Which is more important Ð or are both important? To address this a uniform probability function set was built up to probe this fundamental question. !50 Uniform probability functions share the same peak positions with the original GARF potential, but set all the peak heights at a constant value eliminating the impact of pri or peak heights. If the obtained RF models based on a uniform probability function set performs similarly to models obtained with the original GARF, peak positions will be more important than peak height. Alternatively, if the obtained RF models perform m ore poorly than original the models peak height is significant. Table 4.3 compares the accuracy result from RF models based on the original, scrambled, and uniform GARF potential database. If we compare the accuracy values between RF models based on orig inal and scrambled GARF, it is clear that the averaged, highest, and lowest accuracies from RF models with a scrambled probability function perform poorer. The accuracy value did not drop as much as we have seen in the past 51 because the GARF potential onl y contains intermolecular interactions found in protein ligand systems. Moreover, the 480 peak positions found in GARF are all in the range of [2.5, 5.1] with 355 peak positions in the range of [3.4, 4.4] (see Table 4.4 ). Therefore, the scrambled peak posi tions in the scrambled probability function set might be similar to the original positions in GARF. It is reasonable to expect that the accuracy of RF model based on scrambled probability function set is lower than the corresponding values from original mo dels. On the other hand, if we compare the accuracy values from the uniform probability function set to the values provided by the original RF models, it is obvious that the averaged accuracy values from those two sets of models are the same. This further supports the notion that peak position is more important than well depths in given a potential function used to build a RF model. 117 !51 4.6 Influence of training set size Usually in the field of supervised machine learning, especially when the data set do es not contain a large number of data points, it is common to split the data set into training (80% of total, 16% cross validation set, five -fold cross validation in training data) and test sets (20% of total). The 80:20 ratio works well in most cases, but we wanted to test whether the RF models can achieve a similar accuracy with a smaller training set. Table 4.5 shows the accuracy result from RF models based on the original and uniform GARF data base trained on data sets of differing sizes. Figure 4.2 is the corresponding plot obtained using the data of Table 4.5 . The blue and orange lines in Figure 4.2 represent the performance of RF models based on the original and uniform GARF database, respectively. Both lines show that by increasing the size of the tr aining set, the accuracy of RF models generally increased. Accuracy values converge with training sets >60% and the RF models based on the original and uniform GARF potential have the same trend. !52 4.7 Conclusion In this work, we constructed RF m odels on unbalanced data sets utilizing the ÔcomparisonÕ concept to identify native protein -ligand poses. Using RF, the GARF potential database was refined by assigning different importance factors to each atom pair in that potential. The resultant RF mode ls were tested on a well -known protein -ligand decoy set, CASF -2013,5 which includes decoy structures generated from three docking packages using different docking algorithms. The results suggest that our RF models outperformed other scoring functions on accuracy and native binding pose selection. By including comparisons b etween the best decoy pose and the remaining decoy pose structures, the RMSD value of the best decoy was reduced. We also tested the importance of GARF in creating the corresponding RF models. The use of a scrambled GARF probability function to build a RF model provided evidence for the significance of the GARF potential, while the uniform GARF potential indicated that peak position (or the well position) is most relevant in building a RF model. Finally, we tested the influence of training set size, which s howed that the accuracy converged when ~60% of the data set was used in building the RF model. Overall, we showed that potential function based RF models perform at a high level when identifying a native pose from a collection of decoys. !53 CHAPTER 5: COMBINE RANDOM FOREST WITH AMBER FORCE FIELD FOR PROTEIN -FOLDING POSE SELECTION 5.1 Definitions of atom types, torsion types, and nonbond types For protein systems, Amber has its unique definition for atom types. It uses three parameters Ð atomic charge, potential well depth -!, and van der Waals radii - to assign different atoms in various chemical environments . Atoms sharing the same values for ! and van der Waals radii were defined as one atom type; however, atoms belonging to the same atom type by th is criteria might have very different charges due to different chemical environments. For example, the gamma carbon (defined as ÔCG2Õ in pdb format and ÔCTÕ in Amber atom type format) in isoleucine has a charge value of -0.3204, whereas the delta carbon (d efined as ÔCDÕ in pdb format and ÔCTÕ in Amber atom type) in proline has a positive charge as 0.0192. Hence, atoms with different charge values but the same Amber atom type necessarily had to separated into different atom types. In this work, charge, !, and van der Waals radii were used to define different atom types. Table 5.1 and Table 5.2 summarizes the names of our atom types, their corresponding values of charge, !, and van der Waals radii in ff94 and ff14SB force field, respectively. In total, there are 191 detailed atom types for both the ff94 and ff14SB force fields. Definitions of torsion types in Amber are based on Amber atom types, hence, with detailed atom types, representations of torsion interactions must also be redefined. In Amber, it pri marily uses four atoms to define a torsion type, for example, X -CT-CT-X (X represents any atom), represents !54 torsion interactions between any two atoms connected by two CT atoms. This definition eliminates the total number of torsion interactions, however, it lumps different and unique torsion information together. Here, it is necessary to split the Amber torsion types based on our more detailed atom type definitions. For example, the Amber torsion type ÒH -N-C- OÓ (ff94) is split by us into 14 torsion types as ÒH -0 -N-C- O-0Ó, ÒH -0 -N-C- O-1Ó, ÒH -0 -N-C- O-3Ó, ÒH -0 -N-C- O-5Ó, ÒH -1 -N-C- O-0Ó, ÒH -1 -N-C- O-1Ó, ÒH -1 -N-C- O-3Ó, ÒH -1 -N-C- O-5Ó, ÒH -4 -N-C- O-2Ó, ÒH -5 -N-C- O-0Ó, ÒH -5 -N-C- O-1Ó, ÒH -5 -N-C- O-3Ó, ÒH -5 -N-C- O-5Ó, and ÒH -6 -N-C- O-4Ó. In this wa y, torsion interactions in different chemical environments can be classified as different torsion types. For instance, torsion type ÒH -4 -N-C- O-2Ó represents the torsion interaction between the HD and OD1 atoms in ASNÕs sidechain . Other than torsion angles, nonbond interactions are also redefined based on the available atom types. For example, ÒC -0_C-1Ó represents the nonbond interaction between C -0 and C -1 atoms, and so on. Briefly, in this work, for ff94, there are 191 detailed atom types, 1,143 tor sion types, and 18,336 nonbond types; for ff14SB, there are 191 detailed atom types, 1,175 torsion types, and 18,336 nonbond types. !55 5.2 From Amber parameters to descriptors In Amber, the total energy of a protein structure can be calculated as follo ws: 118 rvuvwV '(P´€$g3(g+&.€u‹¡¢ ,(P´;$)3()+&.w‹{V7 ,(P°‹µ?,pOc $s¶3(±&·¡GX7¡•wV¢ ,(PP¸¹º»L¼ [_ºL½[_3.¹º»L¼ ¾ºL½¾,(L½¹ºL½¿łIG6-M-GI- (5.1) Where ´€ and ´; are force constants, g+ and )+ are equilibrium bond lengths and bond angles, respectively. °‹, s, and ± are the torsion b arrier, phase, and periodicity, respectively. À0G‹ is the sum of van der Waals radii of atoms i and j. ! is the depth of the potential well for the interaction between atoms i and j. dG and dł are charges on atoms i and j. ÀGł is the distance b etween atom i and j. Here, we make the first assumption that, in both native and decoy protein structures, all bond and angle interactions are identical. Hence, the total energy of a structure can be simplified to: rvuvwV '(aOsc ,P°‹µ?,pOc $s¶3(±&·¡GX7¡•wV¢ ,(PP¸¹º»L¼ [_ºL½[_3.¹º»L¼ ¾ºL½¾,(L½¹ºL½¿łIG6-M-GI- (5.2) Where Cons is the total energy for the bond and angle interactions. With parameters extracted from Amber, the energy of one specific torsion interac tion, which is the sum of the dihedral, 1_4 van der Waals, and 1_4 electrostatics energies, can be calculated using equation (3): !56 r£M¤MÁMÂ'(°£M¤MÁMµ?,pOc $s£M¤MÁM¶3(±£M¤MÁMÂ&·,-.i+^¸¹ÃÄºÅÆÇ ÈÃÄ[_ºÃÄ[_3.¹ÃÄºÅÆÇ (ÈÃľºÃľ¿,--i.(^(ÃĹºÃÄ (5.3) r£M¤'(¸¹ÃÄºÅÆÇ ÈÃÄ[_ºÃÄ[_3.¹ÃÄºÅÆÇ (ÈÃľºÃľ¿,(ÃĹºÃÄ (5.4) r£M¤MÁMÂ'(°£M¤MÁMµ?,pOc $s£M¤MÁM¶3(±£M¤MÁMÂ&· (5.5) In the Amber software package, three parts of equation (3) were calculated sep arately, and belonged to different energy components, Edihedral , E1-4-vdW , and E1-4-EEL, respectively. Edihedral , E1-4-vdW , and E1-4-EEL are energies of dihedral, van der Waals, and electrostatics between terminal atoms (for example, atom A and D in tors ion A -B-C-D), respectively. Here, instead of calculating energies for torsions separately (using three energy components), only one value was generated to represent a torsion interaction based on equation (5.3). In equation (5.3), 2.0 and 1.2 are scale fac tors for the energies of the 1 -4 van der Waals and electrostatics interactions. 118 Similarly, equation (5.4) is used to obtain the pair wise nonbond energies, which include energies for both van der Waals and electrostatics. Other than general torsion and nonbond interactions described in equation (5.3) and (5.4), there is another torsion interaction considered in Amber. Out -of-plane terms, also referred as improper torsions, represent ÒbranchedÓ four atoms systems. In these branched systems, there are thre e bonds between the central atom and other three atoms, and the central atom is forced into the plane of the other three. Equation (5.5) was used to calculate the torsion energies of these branched systems. In out -of-plane systems, the torsion energies do not contain energies from van der Waals and electrostatics, hence, the out -of-plane torsion definitions in this work are the same as in Amber. In this way, the total energy of a protein structure can be simplified as: !57 rvuvwV '(aOsc ,P`Głvu•¢Gu‹ ,(P`žV‹u‹€u‹¡ (5.6) where `Gł and `žV are pair wise energies for the torsion and nonbond interactions, respectively. It is known that by assigning differen t importance factors to emphasize more significant pair wise interactions can efficiently improve a scoring functionÕs ability to identify the native protein structure, 117 therefore, equation (5.7) was used to represent the final score ( Stotal ) of a given protein structure. §vuvwV '(a,P#Gł^`Głvu•¢Gu‹ ,(P#žV^`žV‹u‹€u‹¡ (5.7) where C is a constant, #Gł and #žV are the pair wise importance factors, which are to -be-determined parameters, for torsion and nonbond interactions, respectively. In this way, a three dimensional protein structure can be represented using a series of atom pair wise energies. Specifically, with the known amount of torsion and nonbond interactio ns in this work, the total score of a protein structure can be obtained using: §vuvwV 'a,(P#Gł^`Gł(--yk (u•(--xjvu• I-,(P#žV^`žV-|kk† ‹u‹€ I- (5.8) In one protein structure, a specific torsion / nonbond atom pair might exist several times. For each torsion / nonbond atom pair, `Gł / `žV in equation (5.8) is the sum of all the energies of the specific atom pairs that exist in the protein structur e. For example, if torsion ÒH -5 -N-C- O-5Ó is present three times in a protein structure, there are three corresponding energies that can be obtained using equation (5.3). The sum of these three energies is assigned as the energy of torsion ÒH -5 -N-C- O-5Ó in equation (5.8). In this way, the total score §vuvwV for any protein structure can be represented as the sum of 19479 (19479 = 1143 + 18336, ff94) pair wise energies or 19511 (19511 = !58 1175+18336, ff14SB) pair wise energies. With equation (5.8), a three dimensional structure can be represented as a one dimensional vector. !59 5.3 Structures for encoding validation The first and most important thing is to make sure that the calculations from both FFENCODER and Amber are consistent. In FFENCODER, there are five assumptions: (1) only 20 common amino acids were considered; (2) terminal amino acids were not treated diffe rently; (3) amino acids HIP, HIE, and HID were treated as HID; (4) there is no energy cutoff when the repulsion is too strong; (5) no metal ions were considered in the calculation. Assumptions (1) ~ (3) were made to control the total number of atom pairs u sed in RF models. In order to test if FFENCODER was consistent with the Amber package, two sets of structures were constructed. The first test set contains 20 different amino acid structures in order to test if torsion / nonbond pairs, which exist in th e same amino acid structure, were encoded correctly. The second test set, which consists of 210 dipeptide structures (all possible connections between those 20 amino acids), was used to test if torsion / nonbond atom pairs exists in different amino acids were correctly encoded. The LEaP module, which is from AMBERTools 18, 118 was used to generate the topologies of the single amino acids and dipeptides. Systems were first minimized in implicit solvent described by the generalized Born model. 119 25000 cycle s of steepest descent minimization followed by 25000 cycles of conjugate gradient minimization were performed. Without considering solvent, by setting the minimization step to zero, energies of the minimized single amino acids / dipeptides were calculated. All simulations were performed with pmemd.cuda from the Amber18 package. 118 No periodicity was applied and the cut off was set as 9999 † to include all long range interactions. !60 5.4 Encoding validation In order to confirm the force field parameters were correctly encoded, calculated results from FFENCODER and Amber were compared. Based on the RF algorithm described in the method section, it is necessary to guarantee all atom pair wise energies were encoded correctly. Because of the complexity of protein structures, it is hard to compare all pairwise energies in a protein structure. Here, two sets of relatively simple structures were used instead of using whole proteins as test structures. The first test set contains 20 common amino acids, and the second o ne contains 210 (21 ^20/2) dipeptides. Although the structures in these two test sets are relatively simple, they cover most of the parameter space of the Amber force fields. As a compromise, five energy components were compared instead of comparing all pai rwise energies. Here, the total energies of dihedral, 1_4_van der Waals (1_4_vdW), 1_4 electrostatics (1_4_EEL), van der Waals (vdW), and electrostatics (EEL) interactions were compared. Those five energy values calculated using FFENCODER and the Amber sof tware package are listed in Table 5.3 -Table 5.14 . Figure 5.1 shows the comparisons of different energy components between the two programs. The x-axis is the energy calculated with Amber, and the y-axis represents the corresponding energies provided by F FENCODER. Panels (a) -(e) represent the comparisons of the dihedral, 1_4_vdW, 1_4_EEL, vdW, and EEL between the two programs. Columns (1) and (3) represent the comparison for ff94 for the single amino acid and the dipeptide test sets, respectively. Similarl y, columns (2) and (4) are the comparisons for ff14SB. The trend line, equation of the trend line, and R -squared value are presented in each plot to evaluate the overall performance of FFENCODER. Theoretically, the trend line equation should be A'%. Thus , if the trend line equations showed in the plot are closer to y = x , that means the results calculated from FFENCODER are closer to the original Amber !61 output. In Figure 2 , all trend line equations are close to y = x , the range of slopes from those trend lines is [0.9985, 1.0003], and the range of absolute intercept value is [0.000002, 0.0046]. It is clear that the ranges of both slope and intercept are small. Furthermore, R -squared values for all plots are equal to 1. Table 5.15 gives a brief summary of the absolute energy difference results given in Table 5.5 , Table 5.8 , Table 5.11 , and Table 5.14 . For the single amino acid test set, the ranges of energy differences for ff94 and ff14SB are [ -0.0177, 0.0244 kcal/mol] and [ -0.0191 , 0.0247 kcal/mol], respectively. For the dipeptide test set, the range of energy differences for ff94 is [ -0.0519, 0.0807 kcal/mol], and the corresponding range for ff14SB is [ -0.0434, 0.0584 kcal/mol]. It is clear that for both the ff94 and ff14SB parame ter set, our results are consistent with the Amber force fields. Based on Figure 5.1 and Table 5.15 , we conclude that we are modeling canonical Amber force fields. !62 5.5 Feature importance analysis Based on the definitions of atom types, torsion t ypes, and nonbond types given in the method section, there are 191 atom types, 1143 pair wise torsion types, and 18336 pair wise nonbond types in ff94, and 191 atom types, 1175 pair wise torsion types, and 18336 pair wise nonbond types in ff14SB. Because o f the limited number of data points (around 154,000 ), it is computationally more intensive and largely unnecessary to include all pair wise interactions as features for our RF models. Here, before building up the final RF model, a feature importance analy sis was performed to filter the pair wise interactions. Figure 5.2 shows the importance analyses for features in the ff94 and ff14SB parameter sets. In Figure 5.2 , the red points in each plot are the 500 th most important feature in the corresponding parame ter set. After the 500 th feature, contributions from atom pairs are trivial. Hence, in this work, top 100, 200, 300, 400, and 500 features were used to construct RF models. At the same time, the risk of overfitting and the computational cost can also be di minished with these smaller feature sets. !63 5.6 Accuracy The most important criteria to judge a scoring function is its ability to identify the native structure. In this work, accuracy values are considered to represent the ability of scoring functions to locate native structures. The accuracy values used here i s defined in the method section, it evaluates the capability of a scoring function to compare two structures. A scoring function with a high accuracy is expected to perform better in identifying the native structure. For example, if a decoy set contains one native and 100 decoy structures, in order to locate the native structure, the scoring function is required to make 200 correct comparisons between native and each decoy structure. In other words, the native structure cannot be detected if the scoring fun ction makes one wrong comparison. In general, the higher the accuracy value is, the better the scoring function performs. In Table 5.16 , from column 3 to column 5 are comparisons of accuracy between RF models and other scoring functions. The results can be analyzed from three perspectives: (i) If RF models with force field parameters are compared with traditional scoring functions (RWplus, DFIRE, dDFIRE, and GOAP), it is obvious that RF models achieved higher averaged accuracy values, and the lowest accur acies are still higher than accuracies from conventional scoring functions. Therefore, RF models have a better performance when differentiating native and decoy structures relative to traditional scoring functions. (ii) If RF models with force field parame ters are compared with RF models based on knowledge -based potentials (KECSA2), for models based on different numbers of input features (top 100 to 500), RF models with force field parameters always provide a higher average accuracy than models based on a k nowledge -based potential (KECSA2). (iii) If RF models with ff94 are compared with models based on ff14SB, with the same number of input features (100 !64 to 500), models based on each force field parameter sets generated similar averaged accuracies, and the co nfidence ranges ([lowest accuracy, highest accuracy]) are similar as well. With increasing number of input features, averaged accuracy values from RF models based on ff94 and ff14SB remain similar. Based on this we conclude that RF models with force fiel d potentials perform better than all other scoring functions considered in this work, and the RF models depend on ff94 and ff14SB perform similar with the same number of input features. !65 5.7 Native ranking Other than accuracy, another impor tant criteria to evaluate the performance of a scoring function is native ranking. A scoring function is always expected to give the native structure the lowest rank. In Table 5.16 , from column 6 to column 8 are the comparisons of native ranking between different scoring functions. The comparison results can be analyzed again in three ways: (i) If the RF models with force field parameters are compared with traditional scoring functi ons (RWplus, DFIRE, dDFIRE, and GOAP), it is clear that the averaged native rankings from RF models are smaller than the corresponding values from conventional scoring functions. Furthermore, the native rankings provided by conventional scoring functions a re higher than all of the highest native ranking generated by RF models with force field parameters. Hence, RF models with force field parameters outperformed the traditional scoring functions considered in this work. (ii) If RF models based on force field parameters are compared with RF models based on knowledge -based potential (KECSA2), with the same number of input features, RF models with force field potentials can always achieve lower native rankings than models with KECSA2. (iii) If RF models based on ff94 are compared with RF models based on ff14SB, with the same number of input features, the averaged native rankings generated by RF models with each potential set are similar. In summary, we can conclude that, in the native ranking test, RF models with force field parameters outperformed all other scoring functions considered in this work; RF models with ff94 and ff14SB perform similar; with increasing number of input features, RF models with force field potentials provide native rankings of around 4. !66 When comparing native rankings with the corresponding accuracies, a correlation can be found. In general, the higher the accuracy is, the lower the native ranking will be. Here, a decoy set with one native and 100 decoys can be used as an example. In order to locate the native structure, the scoring function is required to make 200 correct comparisons between the native structure and each decoy. If a scoring function has an accurac y value of 0.95, that means it made 10 incorrect comparisons. The ranking of native structure provided by that scoring function will be larger or equal to 5. If a scoring function made more incorrect comparisons, it will have a lower accuracy value, and a higher native ranking. Hence, in general the higher the accuracy, the lower the native ranking. !67 5.8 1st decoy RMSD and TM -score Besides accuracy and native ranking, there are other two values, 1 st decoy RMSD and TM -score, 79 usually used to evaluate the capability of a scoring function to identify the best decoy structure. The best decoy is the most stable decoy structure selected by a scoring function among a set of candidates. In the protein design and protein structure prediction fields, t he scoring function is expected to identify the most stable decoy structure among a large number of structure candidates. RMSD and TM -score are often used to represent the quality of a decoy structure. RMSD refers to the root mean squared deviation of all C! pairs of the decoy to the native structure. TM -score gives a large distance a small weight, and makes the magnitude of TM -score more sensitive to topology. The best decoy structure selected by a scoring function is always expected to have low RMSD and a high TM -score value. Table 5.17 shows the 1 st decoy RMSD and TM -score comparisons between different scoring functions. In general, all scoring functions in Table 4 provide 1 st decoy RMSD values around 4.5 ", and a 1 st decoy TM -score around 0.62. Some o f them generate a RMSD or TM -score slightly better than others. The RMSD difference between the highest and lowest RMSD values is smaller than 1 (", and the TM -sore difference is within 0.1. When comparing RF based scoring functions, no matter which potenti al data set was used (force field potential like ff94 and ff14SB, or knowledge -based potentials like KECSA2), the performances in selecting the best decoy are similar. This common performance from RF models is due to the fact that the RF models were traine d on native -decoy comparisons, and the decoy -decoy comparisons are missing in the training data set. However, the decoy -decoy comparisons are necessary to locate the best decoy structure !68 from a large number of candidates. In our previous work, 124 we proved that with more decoy -decoy comparisons included in the training set, the ability of RF models to identify the best decoy can be improved. In this work, it is hard to include more decoy -decoy comparisons due to the sparseness of the data. Table 5.18 shows the distribution of the decoys lowest RMSD values in all 234 protein systems, there are only 75 systems that provide one decoy structure with a RMSD smaller than 1 ". Compared to the total size of native -decoy comparisons in the training data, the decoy -decoy comparison information is insufficient. Therefore, it is hard to improve the ability of RF models with force field potentials to identify the best decoy structur e in this work. Taken altogether, it can be concluded that, RF models with force field potentials perform similar to other scoring functions considered in this work for the best decoy selection test. !69 5.9 Impact of the RF algorithm RF models with force field potentials can achieve a higher accuracy value and a lower native ranking than other scoring functions considered in this work. At the same time, all scoring functions used here perform similar in selecting the best decoy structure. It is interesting to test whether the better performance from RF models is the result of the RF algorithm or not. In order to test the importance of RF refinement, accuracy values from scoring functions with and without RF models should be compared. RF models ca n emphasize more important pair wise interactions and ignore insignificant ones, on the contrary, a scoring function without RF refinement should treat every pair wise energy as the same. In other words, without RF refinement, a score value can be directly calculated as the sum (the sum of each descriptor ) of all pair wise energies obtained from FFENCODER with the ff94 or ff14SB parameter sets. Then, an accuracy can be obtained based on these calculated scores. Table 5.19 shows the comparison between scorin g functions with and without RF refinement. It is clear that with RF refinement, the accuracy values can be improved from ~0.65 to ~0.99. With both the ff94 and ff14SB force field parameters, the trend was the same. Therefore, it can be concluded that the RF refinement protocol is important, and it helped the scoring functions to achieve higher accuracy values. !70 5.10 Potential analysis The performance of scoring functions with force field descriptors can be improved by RF refinement. On the other han d, it is also necessary to test whether the force field parameters are important or not. In order to test the importance of force field potentials in RF models, the performance of RF models with and without force field parameters need to be compared. A set of RF models constructed based on counts of interactions were used as a reference. In potential functions, different distances between atoms will provide different pair wise energies, using counts of interactions eliminates the impact from potential funct ions on RF models. In detail, there are no energy difference between different atom pairs, and the same atom pair with different distances are treated as the same. For example, if an atom pair ÔH -0_C-0Õ exists five times in a protein, the count of interact ions of that atom pair is five, and five will be used instead of total energy. In this way, the torsion and nonbond potential function in force fields were replaced by horizontal lines with intercepts of 1. Table 5.20 shows the comparison between scoring f unctions with and without force field parameters. Without force field potential functions, the RF models can only generate accuracy values of 0.679 and 0.712, with 100 and 500 features, respectively. On the other hand, with force field parameters, the accu racy values increased to ~0.99 using either 100 or 500 features. Hence, the force field parameters are also important in the model. Furthermore, it can be concluded that both force field potential functions and the RF refinement algorithm are important in the RF model, and none of them can generate high accuracies by itself, combining them together is the only way to achieve the best performance. !71 5.11 Conclusion In this work, Amber force field pairwise potentials from ff94 and ff14SB were successfully encoded based on five assumptions: (1) only 20 common amino acids are considered; (2) terminal amino acids are not treated differently; (3) amino acids HIP, HIE, and HID are all treated as HID; (4) there is no energy cutoff if the repulsion is too strong; (5) no metal ions were considered in the calculation. Detailed pair wise energies obtained from FFENCODER were used as input features to construct RF models. 12 popular protein folding decoy sets were combined based on protein systems, and used to train an d test RF models. The comparisons between RF models and other scoring functions suggest that RF models with force field parameters outperformed other scoring functions in accuracy and in native ranking tests, and perform similar to other scoring functions in selecting the best decoy. The importance of the RF algorithm was tested by comparing scoring functions with and without RF refinement and the results clearly showed that the RF algorithm is an important reason for the observed high accuracies. On the ot her hand, counts of interactions were used to replace all force field potential functions in order to test the importance of the force field potentials. The comparisons between RF models with and without force field potentials suggest that force fields als o play a key role in the observed high accuracy values. A model cannot achieve high accuracy without both RF refinement and appropriate force field parameters. Moreover, in this work we only showed one example where we built ML models using force field potential functions. FFENCODER makes it possible to combine other novel ML algorithms with pair wise energies as encoded by Amber force field potentials. !72 APPENDICES !73 APPENDIX A: TABLES Table 1.1. An example of training data to construct a decision tree. Day weather Temperature Humidity Wind Jogging 1 Sunny Hot Normal Weak No 2 Sunny Mild High Weak Yes 3 Overcast Mild High Strong Yes 4 Overcast Hot Normal Weak No 5 Rain Cool High Strong No 6 Rain Cool High Strong No 7 Sunny Mild Normal Weak Yes 8 Sunny Mild Normal Weak Yes 9 Sunny Hot High Strong No 10 Overcast Hot High Strong Yes 11 Overcast Hot Normal Weak Yes 12 Rain Cool High Weak No 13 Rain Cool High Strong No 14 Overcast Mild Normal Weak Yes !74 Table 1.2. Relationship between humidity and jogging decisions . Humidity Jogging Expected Difference Yes No Yes No Yes No High 3 5 3.5 3.5 -0.5 1.5 Normal 4 2 3.5 3.5 0.5 -1.5 total 7 7 7 7 - - !75 Table 2.1 . Atom types in the GARF potential database .a Atom type definition Protein atom types C sp2 carbonyl carbon and aromatic carbon with hydroxyl substituent in tyrosine C* sp2 aromatic carbon in 5 -membered ring with one substituent CA sp2 aromatic carbon in 6 -membered ring with one substituent CB sp2 aromatic carbon at junction between 5 - and 6 -membered rings CC sp2 aromatic carbon in 5 -membered ring with one substituent and next to a nitrogen atom CN sp2 aromatic junction carbon in between 5 - and 6 -membered rings CR sp2 aromatic carbon in 5 -membered ring between two nitrogen atoms and bonded to one hydrogen atom (in HIS) CT sp3 carbon with four explicit substituents CV sp2 aromatic carbon in 5 -membered ring bonded to one nitrogen atom and bonded to an explicit hydrogen CW sp2 aromatic carbon in 5 -membered ring bonded to one N ! H group and an explicit hydrogen N sp2 nitrogen in amide group N2 sp2 nitrogen in base NH 2 group or arginine NH 2 N3 sp2 nitrogen with four substituents NA sp2 nitrogen in 5 -membered ring with hydrogen attached NB sp2 nitrogen in 5 -membered ring with lone pairs O carbonyl oxygen O2 carboxyl oxygen OH alcohol oxygen S sulfur in disulfide linkage or methionine SH sulfur in cysteine Ligand atom types C.3 sp3 carbon without polar group substituent C.2 sp2 carbonyl carbon without polar group substituent C.1 sp carbon C.ar sp2 aromatic carbon without polar group substituent O.3 alcohol oxygen O.3P ether oxygen O.2 carbonyl oxygen O.co2 carboxylate oxygen O.2v sulfate/phosphate oxygen N.2 sp/sp 2/aromatic nitrogen N.1h sp3 nitrogen with one hydrogen atom attached N.2h sp3 nitrogen with two hydrogen atoms attached N.3h sp3 nitrogen with three hydrogen atoms attached P Phosphorus F Fluorine Cl Chlorine Br Bromine I Iodine C.cat carbon cation !76 Table 2.1 . (contÕd) S.3 thiol/thioether sulfur S.o sulfoxide sulfur C.3X sp3 carbon with polar group substituent C.2X sp2 carbonyl carbon with polar group substituent C.arX sp2 aromatic carbon with polar group substituent a.!This table is as same as Table 2 in GARF paper, reference 114. !77 Table 2.2 . General form of a confusion matrix . Predicted (class 1) Predicted (class 0) Actual (class 1) TP FN Actual (class 0) FP TN !78 Table 3.1 . Accuracy values for different models. a Decoy sets RF model KECSA2 RWplus DFIRE dDFIRE GOAP Averaged accuracy Highest accuracy Lowest accuracy 4state_reduced 1.000 1.000 0.998 0.997 1.000 1.000 1.000 1.000 fisa 1.000 1.000 0.997 0.751 0.775 0.810 0.761 0.816 fisa_casp3 1.000 1.000 0.999 0.842 1.000 1.000 0.989 1.000 hg_structal 0.971 0.994 0.939 0.828 0.902 0.882 0.881 0.934 ig_structal 1.000 1.000 0.999 0.887 0.540 0.536 0.895 0.955 ig_structal_hires 1.000 1.000 1.000 0.953 0.580 0.567 0.942 1.000 I-TASSER 0.982 0.998 0.966 0.971 0.914 0.856 0.919 0.857 lattice_ssift 0.999 1.000 0.998 1.000 1.000 1.000 1.000 1.000 lmds 0.999 1.000 0.997 0.963 0.722 0.727 0.735 0.798 lmds_v2 0.999 1.000 0.990 0.762 0.861 0.899 0.871 0.906 MOULDER 0.988 1.000 0.969 0.829 0.982 0.985 0.982 0.991 ROSETTA 1.000 1.000 1.000 0.776 0.939 0.770 0.537 0.798 a) RF models were trained on different decoy sets. !79 Table 3.2 . Native structureÕs ranking of different models. a Decoy sets RF model KECSA2 RWplus DFIRE dDFIRE GOAP Averaged native ranking Highest native ranking Lowest native ranking 4state_reduced 1.70 7.00 1.00 3.29 1.00 1.00 1.00 1.00 fisa 1.50 3.00 1.00 125.50 113.5 95.75 120.25 92.75 fisa_casp3 1.00 1.00 1.00 228.60 1.60 1.60 17.20 1.00 hg_structal 2.43 5.67 1.33 5.93 3.79 4.38 4.41 2.90 ig_structal 1.01 1.08 1.00 8.03 29.7 29.98 7.57 3.79 ig_structal_hires 1.00 1.00 1.00 1.90 8.95 9.20 2.10 1.00 I-TASSER 13.26 39.17 2.83 13.71 38.13 63.25 36.16 62.89 lattice_ssift 1.05 1.50 1.00 1.38 1.00 1.00 1.00 1.00 lmds 1.05 1.50 1.00 138.91 138.90 136.50 132.2 101.10 lmds_v2 1.40 5.00 1.00 29.50 17.70 13.10 16.5 12.30 MOULDER 5.65 11.50 1.00 55.15 6.65 5.75 6.65 3.80 ROSETTA 1.00 1.00 1.00 23.33 7.07 23.84 47.16 21.07 a) RF models were trained on different decoy sets. !80 Table 3.3 . 1st decoyÕs RMSD for different models. a Decoy sets RF model KECSA2 RWplus DFIRE dDFIRE GOAP Averaged 1st decoy RMSD Highest 1st decoy RMSD Lowest 1st decoy RMSD 4state_reduced 3.28 6.06 1.34 3.17 2.69 2.61 2.25 1.83 fisa 6.13 9.60 4.68 6.51 5.26 5.77 6.05 4.48 fisa_casp3 11.67 15.76 6.35 12.30 11.80 11.10 9.88 10.40 hg_structal 2.62 4.88 1.39 2.59 2.31 2.45 2.66 2.43 ig_structal 2.21 2.62 1.73 2.02 2.00 2.06 1.86 1.88 ig_structal_hires 2.63 4.10 1.48 2.06 2.14 2.13 2.10 2.08 I-TASSER 1.71 2.21 1.27 1.73 1.73 1.70 1.70 1.65 lattice_ssift 10.37 11.44 9.17 9.55 9.26 9.17 9.21 10.01 lmds 7.91 10.75 4.13 7.72 8.08 8.23 6.69 8.55 lmds_v2 7.60 9.38 4.46 8.01 7.74 7.82 7.67 7.36 MOULDER 9.18 12.83 6.67 10.77 9.74 9.98 10.08 9.96 ROSETTA 7.27 8.75 5.88 8.54 7.65 7.36 7.53 7.53 a) RF models were trained on different decoy sets. !81 Table 3.4 . 1st decoyÕs TM -score of different models. a Decoy sets RF model KECSA2 RWplus DFIRE dDFIRE GOAP Averaged 1st decoy TM-score Highest 1st decoy TM-score Lowest 1st decoy TM-score 4state_reduced 0.620 0.864 0.278 0.617 0.700 0.714 0.725 0.791 fisa 0.398 0.468 0.315 0.411 0.467 0.432 0.389 0.472 fisa_casp3 0.263 0.318 0.233 0.296 0.285 0.286 0.313 0.313 hg_structal 0.871 0.924 0.790 0.888 0.894 0.892 0.869 0.891 ig_structal 0.931 0.941 0.923 0.943 0.945 0.943 0.951 0.950 ig_structal_hires 0.936 0.950 0.914 0.939 0.948 0.947 0.949 0.951 I-TASSER 0.431 0.529 0.377 0.451 0.442 0.451 0.445 0.444 lattice_ssift 0.224 0.291 0.179 0.240 0.270 0.258 0.277 0.249 lmds 0.347 0.430 0.283 0.333 0.344 0.336 0.376 0.342 lmds_v2 0.367 0.484 0.296 0.363 0.442 0.451 0.445 0.444 MOULDER 0.429 0.555 0.211 0.394 0.426 0.418 0.416 0.422 ROSETTA 0.487 0.573 0.410 0.438 0.460 0.466 0.477 0.471 a) RF models were trained on different decoy sets. !82 Table 3.5 . Comparison of accuracies of RF models with different numbers of features. a Decoy sets RF model_with_IMP_500 RF_model_with_all_features Averaged accuracy Highest accuracy Lowest accuracy Averaged accuracy Highest accuracy Lowest accuracy 4state_reduced 1.000 1.000 0.999 1.000 1.000 0.998 fisa 1.000 1.000 1.000 1.000 1.000 0.997 fisa_casp3 0.992 0.999 0.987 1.000 1.000 0.999 hg_structal 0.955 0.977 0.909 0.971 0.994 0.939 ig_structal 0.999 1.000 0.998 1.000 1.000 0.999 ig_structal_hires 1.000 1.000 1.000 1.000 1.000 1.000 I-TASSER 0.978 0.997 0.955 0.982 0.998 0.966 lattice_ssift 1.000 1.000 0.997 0.999 1.000 0.998 lmds 0.994 1.000 0.948 0.999 1.000 0.997 lmds_v2 1.000 1.000 1.000 0.999 1.000 0.990 MOULDER 0.989 1.000 0.974 0.988 1.000 0.969 ROSETTA 0.999 1.000 0.996 1.000 1.000 1.000 a) RF models were trained on different decoy sets. !83 Table 3.6. Comparison of the overall performance of RF models (with different number of features) with traditional potentials on overall data set . nIMP _100 nIMP _500 KECSA2 RWplus DFIRE dDFIRE GOAP Accuracy Averaged 0.963 0.981 0.908 0.916 0.886 0.904 0.917 Lowest 0.931 0.965 - - - - - Highest 0.987 0.994 - - - - - Ranking of native structure Averaged 10.62 7.95 25.67 23.43 31.35 26.49 23.09 Lowest 4.86 2.77 - - - - - Highest 21.64 17.59 - - - - - RMSD of 1 st selected decoy Averaged 4.62 4.57 4.84 4.53 4.51 4.44 4.45 Lowest 3.77 3.52 - - - - - Highest 5.49 5.72 - - - - - TM-score of 1st selected decoy Averaged 0.634 0.614 0.610 0.622 0.623 0.625 0.674 Lowest 0.574 0.536 - - - - - Highest 0.685 0.695 - - - - - !84 Table 3.7 . Comparison of RF models based on different potentials . RF model_nIMP_100 RF model_nIMP_500 Averaged accuracy Highest accuracy Lowest accuracy Averaged accuracy Highest accuracy Lowest accuracy Original 0.963 0.987 0.931 0.981 0.994 0.965 Scrambled 0.822 0.854 0.805 0.827 0.864 0.799 Uniform 0.940 0.968 0.872 0.977 0.990 0.959 !85 Table 4.1 . Comparisons between RF models and 29 other scoring functions. accuracy NativeÕs ranking 1st decoy RMSD RF models Averaged 0.953 4.49 3.87 Highest 0.969 5.54 4.47 Lowest 0.942 3.54 3.38 Conventional SFs GOLD -ASP 0.924 6.13 1.74 GOLD -ChemPLP 0.923 6.25 1.51 DS-PLP1 0.917 6.68 1.80 DS-PLP2 0.914 7.07 1.87 MOE-Affinity_dG 0.900 8.23 2.42 Xscore -HMScore 0.891 8.89 2.45 Xscore -Average 0.886 9.33 2.38 GOLD -ChemScore 0.882 9.58 1.72 DS-PMF04 0.874 10.58 3.38 SYBYL -PMF 0.874 10.53 3.42 Xscore -HPScore 0.871 10.63 2.75 MOE-Alpha 0.870 10.38 1.85 Xscore -HSScore 0.869 10.80 2.64 DS-LigScore2 0.867 10.67 1.83 MOE-London_dG 0.863 11.38 2.52 DS-PMF 0.857 11.89 3.48 MOE-ASE 0.856 11.88 2.91 GlideScore -SP 0.832 13.20 1.72 DS-LigScore1 0.823 14.28 2.31 GlideScore -XP 0.823 14.07 1.86 GOLD -GoldScore 0.819 14.70 1.88 DS-LUDI2 0.807 15.62 2.23 DS-LUDI1 0.799 16.35 2.34 DS-LUDI3 0.783 17.48 2.87 SYBYL -ChemScore 0.782 17.70 2.40 SYBYL -Gscore 0.725 22.64 3.13 dSAS 0.692 25.48 3.96 DS-Jain 0.685 25.63 2.90 SYBYL -Dscore 0.674 26.70 4.03 !86 Table 4.2 . Comparison between RF models with considering different number of decoy pose in training set. Accuracy NativeÕs ranking 1st decoy RMSD With no decoy structure Averaged 0.953 4.49 3.87 Highest 0.969 5.54 4.47 Lowest 0.942 3.54 3.38 With one lowest RMSD decoy structure Averaged 0.958 4.28 2.41 Highest 0.974 7.49 2.72 Lowest 0.921 3.03 2.13 With two lowest RMSD decoy structure Averaged 0.950 5.03 2.50 Highest 0.957 6.08 2.99 Lowest 0.937 4.39 1.95 With three lowest RMSD decoy structure Averaged 0.947 5.21 2.27 Highest 0.963 6.56 2.44 Lowest 0.930 3.97 1.73 !87 Table 4.3 . Comparison between RF models with different probability function sets . RF models Averaged accuracy Highest accuracy Lowest accuracy Original GARF 0.953 0.969 0.942 Scrambled GARF 0.933 0.951 0.911 Uniform GARF 0.953 0.980 0.918 !88 Table 4.4 . Summary of peak positions and number of probability functions at each peak positions in GARF . Peak position / " number of Probability functions 2.5 2 2.6 2 2.7 5 2.8 14 2.9 9 3.0 6 3.1 5 3.2 2 3.3 14 3.4 23 3.5 23 3.6 23 3.7 53 3.8 43 3.9 34 4.0 36 4.1 24 4.2 28 4.3 38 4.4 30 4.5 8 4.6 12 4.7 13 4.8 17 4.9 8 5.0 4 5.1 4 !89 Table 4.5 . Accuracy values for different training set sizes from RF models with original and uniform GARF . Training size / % Original GARF Uniform GARF Averaged accuracy Highest accuracy Lowest accuracy Averaged accuracy Highest accuracy Lowest accuracy 5 0.883 0.926 0.833 0.900 0.929 0.867 10 0.923 0.939 0.912 0.916 0.936 0.879 15 0.919 0.941 0.884 0.933 0.948 0.913 20 0.926 0.947 0.887 0.936 0.946 0.902 25 0.938 0.953 0.918 0.939 0.960 0.919 30 0.943 0.953 0.935 0.945 0.965 0.933 35 0.947 0.962 0.928 0.942 0.960 0.914 40 0.945 0.959 0.924 0.946 0.951 0.933 45 0.941 0.951 0.925 0.949 0.969 0.922 50 0.953 0.966 0.938 0.944 0.957 0.935 55 0.952 0.973 0.916 0.945 0.968 0.923 60 0.954 0.975 0.935 0.950 0.968 0.937 65 0.945 0.964 0.937 0.949 0.965 0.921 70 0.954 0.980 0.922 0.955 0.972 0.924 75 0.953 0.980 0.922 0.952 0.977 0.924 80 0.953 0.969 0.942 0.953 0.980 0.918 !90 Table 5.1 . Summary of charge, !, and van der Waals radii for each atom type in ff94 force field . Atom type Charge Van de Waals radii ! Atom type charge Van de Waals radii ! C*-0 -0.1415 1.9080 0.0860 H1-1 0.1560 1.3870 0.0157 C-0 0.5973 1.9080 0.0860 H1-10 0.0881 1.3870 0.0157 C-1 0.7341 1.9080 0.0860 H1-11 0.0869 1.3870 0.0157 C-2 0.713 1.9080 0.0860 H1-12 0.0922 1.3870 0.0157 C-3 0.7994 1.9080 0.0860 H1-13 0.1426 1.3870 0.0157 C-4 0.5366 1.9080 0.0860 H1-14 0.0440 1.3870 0.0157 C-5 0.6951 1.9080 0.0860 H1-15 0.0684 1.3870 0.0157 C-6 0.8054 1.9080 0.0860 H1-16 0.0978 1.3870 0.0157 C-7 0.5896 1.9080 0.0860 H1-17 0.0391 1.3870 0.0157 C-8 0.3226 1.9080 0.0860 H1-18 0.0641 1.3870 0.0157 CA-0 0.8076 1.9080 0.0860 H1-19 0.0843 1.3870 0.0157 CA-1 0.0118 1.9080 0.0860 H1-2 0.0687 1.3870 0.0157 CA-10 -0.1906 1.9080 0.0860 H1-20 0.0352 1.3870 0.0157 CA-11 -0.2341 1.9080 0.0860 H1-21 0.1007 1.3870 0.0157 CA-2 -0.1256 1.9080 0.0860 H1-22 0.0043 1.3870 0.0157 CA-3 -0.1704 1.9080 0.0860 H1-23 0.1123 1.3870 0.0157 CA-4 -0.1072 1.9080 0.0860 H1-24 0.0876 1.3870 0.0157 CA-5 -0.2601 1.9080 0.0860 H1-25 0.0969 1.3870 0.0157 CA-6 -0.1134 1.9080 0.0860 H1-3 0.1048 1.3870 0.0157 CA-7 -0.1972 1.9080 0.0860 H1-4 0.0880 1.3870 0.0157 CA-8 -0.2387 1.9080 0.0860 H1-5 0.1124 1.3870 0.0157 CA-9 -0.0011 1.9080 0.0860 H1-6 0.1112 1.3870 0.0157 CB-0 0.1243 1.9080 0.0860 H1-7 0.0850 1.3870 0.0157 CC-0 -0.0266 1.9080 0.0860 H1-8 0.1105 1.3870 0.0157 CN-0 0.138 1.9080 0.0860 H1-9 0.0698 1.3870 0.0157 CR-0 0.2057 1.9080 0.0860 H4-0 0.1147 1.4090 0.0150 CT-0 0.0337 1.9080 0.1094 H4-1 0.2062 1.4090 0.0150 CT-1 -0.1825 1.9080 0.1094 H5-0 0.1392 1.3590 0.0150 CT-10 0.0213 1.9080 0.1094 HA-0 0.1330 1.4590 0.0150 CT-11 -0.1231 1.9080 0.1094 HA-1 0.1430 1.4590 0.0150 CT-12 -0.0031 1.9080 0.1094 HA-2 0.1297 1.4590 0.0150 CT-13 -0.0036 1.9080 0.1094 HA-3 0.1572 1.4590 0.0150 CT-14 -0.0645 1.9080 0.1094 HA-4 0.1417 1.4590 0.0150 CT-15 0.0397 1.9080 0.1094 HA-5 0.1447 1.4590 0.0150 CT-16 0.056 1.9080 0.1094 HA-6 0.1700 1.4590 0.0150 CT-17 0.0136 1.9080 0.1094 HA-7 0.1699 1.4590 0.0150 CT-18 -0.0252 1.9080 0.1094 HA-8 0.1656 1.4590 0.0150 CT-19 0.0188 1.9080 0.1094 HC-0 0.0603 1.4870 0.0157 CT-2 -0.2637 1.9080 0.1094 HC-1 0.0327 1.4870 0.0157 CT-20 -0.0462 1.9080 0.1094 HC-10 0.0187 1.4870 0.0157 CT-21 -0.0597 1.9080 0.1094 HC-11 0.0882 1.4870 0.0157 CT-22 0.1303 1.9080 0.1094 HC-12 0.0236 1.4870 0.0157 CT-23 -0.3204 1.9080 0.1094 HC-13 0.0186 1.4870 0.0157 CT-24 -0.0430 1.9080 0.1094 HC-14 0.0457 1.4870 0.0157 CT-25 -0.066 1.9080 0.1094 HC-15 -0.0361 1.4870 0.0157 CT-26 -0.0518 1.9080 0.1094 HC-16 0.1000 1.4870 0.0157 CT-27 -0.1102 1.9080 0.1094 HC-17 0.0362 1.4870 0.0157 CT-28 0.3531 1.9080 0.1094 HC-18 0.0103 1.4870 0.0157 CT-29 -0.4121 1.9080 0.1094 HC-19 0.0621 1.4870 0.0157 CT-3 -0.0007 1.9080 0.1094 HC-2 0.0285 1.4870 0.0157 !91 Table 5.1 . (contÕd) CT-30 -0.2400 1.9080 0.1094 HC-20 0.0241 1.4870 0.0157 CT-31 -0.0094 1.9080 0.1094 HC-21 0.0295 1.4870 0.0157 CT-32 0.0187 1.9080 0.1094 HC-22 0.0213 1.4870 0.0157 CT-33 -0.0479 1.9080 0.1094 HC-23 0.0253 1.4870 0.0157 CT-34 -0.0143 1.9080 0.1094 HC-24 0.0642 1.4870 0.0157 CT-35 -0.0237 1.9080 0.1094 HC-25 0.0339 1.4870 0.0157 CT-36 0.0342 1.9080 0.1094 HC-26 -0.0297 1.4870 0.0157 CT-37 0.0018 1.9080 0.1094 HC-27 0.0791 1.4870 0.0157 CT-38 -0.0536 1.9080 0.1094 HC-3 0.0797 1.4870 0.0157 CT-39 -0.0024 1.9080 0.1094 HC-4 -0.0122 1.4870 0.0157 CT-4 0.039 1.9080 0.1094 HC-5 0.0171 1.4870 0.0157 CT-40 -0.0343 1.9080 0.1094 HC-6 0.0352 1.4870 0.0157 CT-41 0.0192 1.9080 0.1094 HC-7 -0.0173 1.4870 0.0157 CT-42 0.0189 1.9080 0.1094 HC-8 -0.0425 1.4870 0.0157 CT-43 -0.007 1.9080 0.1094 HC-9 0.0402 1.4870 0.0157 CT-44 -0.0266 1.9080 0.1094 HO-0 0.4275 0.0000 0.0000 CT-45 -0.0249 1.9080 0.1094 HO-1 0.4102 0.0000 0.0000 CT-46 0.2117 1.9080 0.1094 HO-2 0.3992 0.0000 0.0000 CT-47 -0.0389 1.9080 0.1094 HP-0 0.1135 1.1000 0.0157 CT-48 0.3654 1.9080 0.1094 HS-0 0.1933 0.6000 0.0157 CT-49 -0.2438 1.9080 0.1094 N-0 -0.4157 1.8240 0.1700 CT-5 0.0486 1.9080 0.1094 N-1 -0.3479 1.8240 0.1700 CT-50 -0.0275 1.9080 0.1094 N-2 -0.9191 1.8240 0.1700 CT-51 -0.005 1.9080 0.1094 N-3 -0.5163 1.8240 0.1700 CT-52 -0.0014 1.9080 0.1094 N-4 -0.9407 1.8240 0.1700 CT-53 -0.0152 1.9080 0.1094 N-5 -0.2548 1.8240 0.1700 CT-54 -0.0875 1.9080 0.1094 N2-0 -0.5295 1.8240 0.1700 CT-55 0.2985 1.9080 0.1094 N2-1 -0.8627 1.8240 0.1700 CT-56 -0.3192 1.9080 0.1094 N3-0 -0.3854 1.8240 0.1700 CT-6 0.0143 1.9080 0.1094 NA-0 -0.3811 1.8240 0.1700 CT-7 -0.2041 1.9080 0.1094 NA-1 -0.3418 1.8240 0.1700 CT-8 0.0381 1.9080 0.1094 NB-0 -0.5727 1.8240 0.1700 CT-9 -0.0303 1.9080 0.1094 O-0 -0.5679 1.6612 0.2100 CV-0 0.1292 1.9080 0.0860 O-1 -0.5894 1.6612 0.2100 CW-0 -0.1638 1.9080 0.0860 O-2 -0.5931 1.6612 0.2100 H-0 0.2719 0.6000 0.0157 O-3 -0.5819 1.6612 0.2100 H-1 0.2747 0.6000 0.0157 O-4 -0.6086 1.6612 0.2100 H-2 0.3456 0.6000 0.0157 O-5 -0.5748 1.6612 0.2100 H-3 0.4478 0.6000 0.0157 O2-0 -0.8014 1.6612 0.2100 H-4 0.4196 0.6000 0.0157 O2-1 -0.8188 1.6612 0.2100 H-5 0.2936 0.6000 0.0157 OH-0 -0.6546 1.7210 0.2104 H-6 0.4251 0.6000 0.0157 OH-1 -0.6761 1.7210 0.2104 H-7 0.3649 0.6000 0.0157 OH-2 -0.5579 1.7210 0.2104 H-8 0.3400 0.6000 0.0157 S-0 -0.2737 2.0000 0.2500 H-9 0.3412 0.6000 0.0157 SH-0 -0.3119 2.0000 0.2500 H1-0 0.0823 1.3870 0.0157 !92 Table 5.2. Summary of charge, !, and van der Waals radii for each atom type in ff14SB force field . Atom type Charge Van de Waals radii ! Atom type Charge Van de Waals radii ! 2C-0 -0.2041 1.9080 0.1094 H1-1 0.1560 1.3870 0.0157 2C-1 -0.0303 1.9080 0.1094 H1-10 0.0881 1.3870 0.0157 2C-10 0.0018 1.9080 0.1094 H1-11 0.0869 1.3870 0.0157 2C-11 0.2117 1.9080 0.1094 H1-12 0.0922 1.3870 0.0157 2C-2 -0.1231 1.9080 0.1094 H1-13 0.1426 1.3870 0.0157 2C-3 -0.0036 1.9080 0.1094 H1-14 0.0440 1.3870 0.0157 2C-4 -0.0645 1.9080 0.1094 H1-15 0.0684 1.3870 0.0157 2C-5 0.0560 1.9080 0.1094 H1-16 0.0978 1.3870 0.0157 2C-6 0.0136 1.9080 0.1094 H1-17 0.0391 1.3870 0.0157 2C-7 -0.0430 1.9080 0.1094 H1-18 0.0641 1.3870 0.0157 2C-8 -0.1102 1.9080 0.1094 H1-19 0.0843 1.3870 0.0157 2C-9 0.0342 1.9080 0.1094 H1-2 0.0687 1.3870 0.0157 3C-0 0.1303 1.9080 0.1094 H1-20 0.0352 1.3870 0.0157 3C-1 0.3531 1.9080 0.1094 H1-21 0.1007 1.3870 0.0157 3C-2 0.3654 1.9080 0.1094 H1-22 0.0043 1.3870 0.0157 3C-3 0.2985 1.9080 0.1094 H1-23 0.1123 1.3870 0.0157 C-0 0.5973 1.9080 0.0860 H1-24 0.0876 1.3870 0.0157 C-1 0.7341 1.9080 0.0860 H1-25 0.0969 1.3870 0.0157 C-2 0.7130 1.9080 0.0860 H1-3 0.1048 1.3870 0.0157 C-3 0.5366 1.9080 0.0860 H1-4 0.0880 1.3870 0.0157 C-4 0.6951 1.9080 0.0860 H1-5 0.1124 1.3870 0.0157 C-5 0.5896 1.9080 0.0860 H1-6 0.1112 1.3870 0.0157 C-6 0.3226 1.9080 0.0860 H1-7 0.0850 1.3870 0.0157 C*-0 -0.1415 1.9080 0.0860 H1-8 0.1105 1.3870 0.0157 C8-0 -0.0007 1.9080 0.1094 H1-9 0.0698 1.3870 0.0157 C8-1 0.0390 1.9080 0.1094 H4-0 0.1147 1.4090 0.0150 C8-2 0.0486 1.9080 0.1094 H4-1 0.2062 1.4090 0.0150 C8-3 -0.0094 1.9080 0.1094 H5-0 0.1392 1.3590 0.0150 C8-4 0.0187 1.9080 0.1094 HA-0 0.1330 1.4590 0.0150 C8-5 -0.0479 1.9080 0.1094 HA-1 0.1430 1.4590 0.0150 C8-6 -0.0143 1.9080 0.1094 HA-2 0.1297 1.4590 0.0150 CA-0 0.8076 1.9080 0.0860 HA-3 0.1572 1.4590 0.0150 CA-1 0.0118 1.9080 0.0860 HA-4 0.1417 1.4590 0.0150 CA-10 -0.1906 1.9080 0.0860 HA-5 0.1447 1.4590 0.0150 CA-11 -0.2341 1.9080 0.0860 HA-6 0.1700 1.4590 0.0150 CA-2 -0.1256 1.9080 0.0860 HA-7 0.1699 1.4590 0.0150 CA-3 -0.1704 1.9080 0.0860 HA-8 0.1656 1.4590 0.0150 CA-4 -0.1072 1.9080 0.0860 HC-0 0.0603 1.4870 0.0157 CA-5 -0.2601 1.9080 0.0860 HC-1 0.0327 1.4870 0.0157 CA-6 -0.1134 1.9080 0.0860 HC-10 0.0187 1.4870 0.0157 CA-7 -0.1972 1.9080 0.0860 HC-11 0.0882 1.4870 0.0157 CA-8 -0.2387 1.9080 0.0860 HC-12 0.0236 1.4870 0.0157 CA-9 -0.0011 1.9080 0.0860 HC-13 0.0186 1.4870 0.0157 CB-0 0.1243 1.9080 0.0860 HC-14 0.0457 1.4870 0.0157 CC-0 -0.0266 1.9080 0.0860 HC-15 -0.0361 1.4870 0.0157 CN-0 0.138 1.9080 0.0860 HC-16 0.1000 1.4870 0.0157 CO-0 0.7994 1.9080 0.0860 HC-17 0.0362 1.4870 0.0157 CO-1 0.8054 1.9080 0.0860 HC-18 0.0103 1.4870 0.0157 CR-0 0.2057 1.9080 0.0860 HC-19 0.0621 1.4870 0.0157 !93 Table 5.2. (contÕd) CT-0 -0.1825 1.9080 0.1094 HC-2 0.0285 1.4870 0.0157 CT-1 -0.0462 1.9080 0.1094 HC-20 0.0241 1.4870 0.0157 CT-10 -0.2438 1.9080 0.1094 HC-21 0.0295 1.4870 0.0157 CT-11 -0.005 1.9080 0.1094 HC-22 0.0213 1.4870 0.0157 CT-12 -0.0152 1.9080 0.1094 HC-23 0.0253 1.4870 0.0157 CT-13 -0.3192 1.9080 0.1094 HC-24 0.0642 1.4870 0.0157 CT-2 -0.3204 1.9080 0.1094 HC-25 0.0339 1.4870 0.0157 CT-3 -0.0660 1.9080 0.1094 HC-26 -0.0297 1.4870 0.0157 CT-4 -0.4121 1.9080 0.1094 HC-27 0.0791 1.4870 0.0157 CT-5 -0.0536 1.9080 0.1094 HC-3 0.0797 1.4870 0.0157 CT-6 -0.0343 1.9080 0.1094 HC-4 -0.0122 1.4870 0.0157 CT-7 0.0192 1.9080 0.1094 HC-5 0.0171 1.4870 0.0157 CT-8 0.0189 1.9080 0.1094 HC-6 0.0352 1.4870 0.0157 CT-9 -0.0070 1.9080 0.1094 HC-7 -0.0173 1.4870 0.0157 CV-0 0.1292 1.9080 0.0860 HC-8 -0.0425 1.4870 0.0157 CW-0 -0.1638 1.9080 0.0860 HC-9 0.0402 1.4870 0.0157 CX-0 0.0337 1.9080 0.1094 HO-0 0.4275 0.0000 0.0000 CX-1 -0.2637 1.9080 0.1094 HO-1 0.4102 0.0000 0.0000 CX-10 -0.0518 1.9080 0.1094 HO-2 0.3992 0.0000 0.0000 CX-11 -0.2400 1.9080 0.1094 HP-0 0.1135 1.1000 0.0157 CX-12 -0.0237 1.9080 0.1094 HS-0 0.1933 0.600 0.0157 CX-13 -0.0024 1.9080 0.1094 N-0 -0.4157 1.8240 0.1700 CX-14 -0.0266 1.9080 0.1094 N-1 -0.3479 1.8240 0.1700 CX-15 -0.0249 1.9080 0.1094 N-2 -0.9191 1.8240 0.1700 CX-16 -0.0389 1.9080 0.1094 N-3 -0.5163 1.8240 0.1700 CX-17 -0.0275 1.9080 0.1094 N-4 -0.9407 1.8240 0.1700 CX-18 -0.0014 1.9080 0.1094 N-5 -0.2548 1.8240 0.1700 CX-19 -0.0875 1.9080 0.1094 N2-0 -0.5295 1.8240 0.1700 CX-2 0.0143 1.9080 0.1094 N2-1 -0.8627 1.8240 0.1700 CX-3 0.0381 1.9080 0.1094 N3-0 -0.3854 1.8240 0.1700 CX-4 0.0213 1.9080 0.1094 NA-0 -0.3811 1.8240 0.1700 CX-5 -0.0031 1.9080 0.1094 NA-1 -0.3418 1.8240 0.1700 CX-6 0.0397 1.9080 0.1094 NB-0 -0.5727 1.8240 0.1700 CX-7 -0.0252 1.9080 0.1094 O-0 -0.5679 1.6612 0.2100 CX-8 0.0188 1.9080 0.1094 O-1 -0.5894 1.6612 0.2100 CX-9 -0.0597 1.9080 0.1094 O-2 -0.5931 1.6612 0.2100 H-0 0.2719 0.6000 0.0157 O-3 -0.5819 1.6612 0.2100 H-1 0.2747 0.6000 0.0157 O-4 -0.6086 1.6612 0.2100 H-2 0.3456 0.6000 0.0157 O-5 -0.5748 1.6612 0.2100 H-3 0.4478 0.6000 0.0157 O2-0 -0.8014 1.6612 0.2100 H-4 0.4196 0.6000 0.0157 O2-1 -0.8188 1.6612 0.2100 H-5 0.2936 0.6000 0.0157 OH-0 -0.6546 1.7210 0.2104 H-6 0.4251 0.6000 0.0157 OH-1 -0.6761 1.7210 0.2104 H-7 0.3649 0.6000 0.0157 OH-2 -0.5579 1.7210 0.2104 H-8 0.3400 0.6000 0.0157 S-0 -0.2737 2.0000 0.2500 H-9 0.3412 0.6000 0.0157 SH-0 -0.3119 2.0000 0.2500 H1-0 0.0823 1.3870 0.0157 !94 Table 5.3 . Torsion and nonbond energies calculated by an encoded program with ff94 parameters for single amino acid test set. Energies / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA 0.0088 0.5494 45.9693 -0.1578 -24.2543 ARG 0.0706 1.8904 -260.3572 -1.2202 148.0743 ASN 4.1250 1.3423 -23.2339 -0.8248 -39.3539 ASP 0.0194 0.8207 56.3004 -0.4736 -36.3626 CYS 0.0447 0.7164 39.1112 -0.3116 -16.1384 GLN 4.2578 1.3251 -30.2072 -0.8694 -14.0307 GLU 0.0742 1.3344 46.5920 -0.7514 -15.0577 GLY 0.0000 0.4985 37.1013 -0.0511 -21.1957 HID 0.0158 0.5368 23.8454 -1.2102 -10.6580 ILE 0.3316 2.1826 21.2849 -0.0174 -7.2089 LEU 0.0679 2.3109 16.3250 -0.4998 -21.3487 LYS 0.0743 1.6171 58.9714 -0.8348 -1.7579 MET 0.0668 0.8724 36.1038 -0.7017 -17.0426 PHE 0.0120 4.3120 36.2859 -1.3411 -17.6305 PRO 3.1304 0.4955 14.0621 -0.3670 -3.2757 SER 0.0115 0.6387 12.8409 -0.1832 1.0681 THR 0.0783 1.5075 -21.1780 -0.2961 12.0923 TRP 0.0115 3.5231 52.6121 -2.2212 -26.2422 TYR 0.0131 4.2971 24.5879 -1.4562 -23.3515 VAL 0.1042 1.8272 -10.7949 -0.0816 9.0526 !95 Table 5.4 . Torsion and nonbond energies calculated by Amber software with ff94 parameters for single amino acid test set. Energies / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA 0.0088 0.5492 45.9746 -0.1578 -24.2532 ARG 0.0704 1.8940 -260.3507 -1.2195 148.0715 ASN 4.1252 1.3436 -23.2274 -0.8252 -39.3537 ASP 0.0184 0.8247 56.3181 -0.4730 -36.3870 CYS 0.0446 0.7168 39.1207 -0.3116 -16.1453 GLN 4.2583 1.3287 -30.1907 -0.8689 -14.0280 GLU 0.0740 1.3323 46.5987 -0.7506 -15.0496 GLY 0.0000 0.4957 37.0937 -0.0510 -21.1894 HID 0.0155 0.5387 23.8397 -1.2100 -10.6602 ILE 0.3327 2.1786 21.2916 -0.0195 -7.2120 LEU 0.0676 2.3076 16.3301 -0.5007 -21.3532 LYS 0.0750 1.6158 58.9735 -0.8338 -1.7559 MET 0.0663 0.8775 36.1123 -0.7021 -17.0471 PHE 0.0120 4.3059 36.2908 -1.3421 -17.6348 PRO 3.1301 0.4933 14.0603 -0.3672 -3.2746 SER 0.0115 0.6394 12.8384 -0.1832 1.0766 THR 0.0789 1.5039 -21.1695 -0.2965 12.0833 TRP 0.0123 3.5237 52.6181 -2.2210 -26.2490 TYR 0.0128 4.2906 24.5879 -1.4566 -23.3583 VAL 0.1039 1.8261 -10.7968 -0.0813 9.0583 !96 Table 5.5 . Comparisons of torsion and nonbond energies calculated by encoded programs and Amber software with ff94 parameters for single amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA 0.0000 0.0002 -0.0053 0.0000 -0.0011 ARG 0.0002 -0.0036 -0.0065 -0.0007 0.0028 ASN -0.0002 -0.0013 -0.0065 0.0004 -0.0002 ASP 0.0010 -0.0040 -0.0177 -0.0006 0.0244 CYS 0.0001 -0.0004 -0.0095 0.0000 0.0069 GLN -0.0005 -0.0036 -0.0165 -0.0005 -0.0027 GLU 0.0002 0.0021 -0.0067 -0.0008 -0.0081 GLY 0.0000 0.0028 0.0076 -0.0001 -0.0063 HID 0.0003 -0.0019 0.0057 -0.0002 0.0022 ILE -0.0011 0.0040 -0.0067 0.0021 0.0031 LEU 0.0003 0.0033 -0.0051 0.0009 0.0045 LYS -0.0007 0.0013 -0.0021 -0.0010 -0.0020 MET 0.0005 -0.0051 -0.0085 0.0004 0.0045 PHE 0.0000 0.0061 -0.0049 0.0010 0.0043 PRO 0.0003 0.0022 0.0018 0.0002 -0.0011 SER 0.0000 -0.0007 0.0025 0.0000 -0.0085 THR -0.0006 0.0036 -0.0085 0.0004 0.0090 TRP -0.0008 -0.0006 -0.0060 -0.0002 0.0068 TYR 0.0003 0.0065 0.0000 0.0004 0.0068 VAL 0.0003 0.0011 0.0019 -0.0003 -0.0057 Maximum 0.0010 0.0065 0.0076 0.0021 0.0244 Minimum -0.0011 -0.0051 -0.0177 -0.0010 -0.0085 !97 Table 5.6 . Torsion and nonbond energies calculated by an encoded program with ff14SB parameters for single amino acid test set. Energies / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA 0.1711 0.5419 45.9076 -0.1566 -24.1773 ARG 0.6312 1.9243 -260.1021 -1.2163 147.9333 ASN 10.7889 1.3478 -23.4786 -0.5022 -39.5225 ASP 6.0681 1.3423 58.0559 -0.1236 -38.6990 CYS 2.1838 0.7572 38.8337 -0.3214 -15.8136 GLN 9.2666 1.5997 -30.421 -0.6938 -12.3584 GLU 6.0002 1.5892 46.1863 -0.6777 -14.6146 GLY 0.7558 0.4869 37.0364 -0.0505 -21.1259 HID 4.0438 0.8284 26.7941 -1.0869 -13.7885 ILE 4.7027 2.5636 26.2700 -0.1579 -12.1176 LEU 3.4216 2.3113 16.4199 -0.5225 -21.4754 LYS 1.0358 1.6150 59.0155 -0.8341 -1.7977 MET 3.4048 0.8709 35.9177 -0.7018 -16.8777 PHE 1.2477 4.5331 39.0332 -1.5347 -20.8393 PRO 3.6242 0.4790 14.0622 -0.3663 -3.2798 SER 2.4248 1.0014 14.0655 -0.2804 -0.8490 THR 7.6226 1.5090 -21.5826 -0.2696 12.7759 TRP 1.5419 3.7928 55.3833 -2.3750 -29.7700 TYR 1.5085 4.5045 27.0749 -1.6484 -26.6298 VAL 2.6052 2.2623 -7.2995 -0.2499 5.5357 !98 Table 5.7 . Torsion and nonbond energies calculated by Amber software with ff14SB parameters for single amino acid test set. Energies / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA 0.1712 0.5424 45.9121 -0.1567 -24.1781 ARG 0.6313 1.9285 -260.0866 -1.2161 147.9086 ASN 10.7881 1.3465 -23.4731 -0.5034 -39.5127 ASP 6.0686 1.3472 58.0653 -0.1221 -38.7120 CYS 2.1834 0.7575 38.8367 -0.3212 -15.8130 GLN 9.2677 1.6053 -30.4211 -0.6948 -12.3676 GLU 5.9995 1.5870 46.1810 -0.6764 -14.6191 GLY 0.7553 0.4867 37.0303 -0.0505 -21.1246 HID 4.0424 0.8261 26.7944 -1.0859 -13.7808 ILE 4.7030 2.5634 26.2721 -0.1615 -12.1235 LEU 3.4221 2.3102 16.4172 -0.5229 -21.4743 LYS 1.0363 1.6139 59.0196 -0.8337 -1.7965 MET 3.4055 0.8690 35.9189 -0.7020 -16.8745 PHE 1.2475 4.5322 39.0335 -1.5350 -20.8375 PRO 3.6209 0.4811 14.0628 -0.3666 -3.2788 SER 2.4255 1.0041 14.0670 -0.2803 -0.8570 THR 7.6225 1.5065 -21.5635 -0.2691 12.7765 TRP 1.5421 3.7865 55.3799 -2.3752 -29.7711 TYR 1.5077 4.5149 27.0864 -1.6483 -26.6331 VAL 2.6041 2.2635 -7.2958 -0.2495 5.5362 !99 Table 5.8 . Comparisons of torsion and nonbond energies calculated by encoded programs and Amber software with ff14SB parameters for single amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA -0.0001 -0.0005 -0.0045 0.0001 0.0008 ARG -0.0001 -0.0042 -0.0155 -0.0002 0.0247 ASN 0.0008 0.0013 -0.0055 0.0012 -0.0098 ASP -0.0005 -0.0049 -0.0094 -0.0015 0.0130 CYS 0.0004 -0.0003 -0.0030 -0.0002 -0.0006 GLN -0.0011 -0.0056 0.0001 0.0010 0.0092 GLU 0.0007 0.0022 0.0053 -0.0013 0.0045 GLY 0.0005 0.0002 0.0061 0.0000 -0.0013 HID 0.0014 0.0023 -0.0003 -0.0010 -0.0077 ILE -0.0003 0.0002 -0.0021 0.0036 0.0059 LEU -0.0005 0.0011 0.0027 0.0004 -0.0011 LYS -0.0005 0.0011 -0.0041 -0.0004 -0.0012 MET -0.0007 0.0019 -0.0012 0.0002 -0.0032 PHE 0.0002 0.0009 -0.0003 0.0003 -0.0018 PRO 0.0033 -0.0021 -0.0006 0.0003 -0.0010 SER -0.0007 -0.0027 -0.0015 -0.0001 0.0080 THR 0.0001 0.0025 -0.0191 -0.0005 -0.0006 TRP -0.0002 0.0063 0.0034 0.0002 0.0011 TYR 0.0008 -0.0104 -0.0115 -0.0001 0.0033 VAL 0.0011 -0.0012 -0.0037 -0.0004 -0.0005 Maximum 0.0033 0.0063 0.0061 0.0036 0.0247 Minimum -0.0011 -0.0104 -0.0191 -0.0015 -0.0098 !100 Table 5.9 . Torsion and nonbond energies calculated by an encoded program with ff94 parameters for double amino acid test set. Energies / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA 2.4541 2.4645 118.0755 -1.3252 -96.9829 ALA_ARG 2.5390 3.7265 -155.1786 -2.6131 57.5582 ALA_ASN 6.6826 3.0835 50.9307 -2.1937 -111.2274 ALA_ASP 2.4786 3.0685 139.1431 -1.9733 -122.5850 ALA_CYS 2.4849 2.6440 116.2346 -1.6243 -93.4271 ALA_GLN 6.7149 3.2556 53.4273 -2.2422 -96.5684 ALA_GLU 2.5180 3.2527 132.1015 -2.1393 -105.8521 ALA_GLY 1.6734 2.1374 123.1075 -0.9507 -105.6187 ALA_HID 2.4609 2.6784 106.8670 -2.7616 -95.8874 ALA_ILE 2.7927 4.5115 118.8190 -1.8707 -102.5848 ALA_LEU 2.5165 4.1994 97.7516 -1.9502 -98.8091 ALA_LYS 2.5381 3.4659 161.4802 -2.2229 -90.8739 ALA_MET 2.5177 2.8025 122.6702 -2.0665 -102.5814 ALA_PHE 2.4561 6.4551 121.5178 -2.9508 -103.4251 ALA_PRO 6.6337 2.9762 112.4849 -1.3731 -86.7326 ALA_SER 2.4803 2.8831 109.0765 -1.6101 -95.8092 ALA_THR 2.7028 3.2771 85.3268 -2.0097 -93.7234 ALA_TRP 2.4672 5.4390 139.2381 -3.8377 -109.2930 ALA_TYR 2.4550 6.4463 109.7341 -3.0755 -109.3333 ALA_VAL 2.5765 4.1802 95.8870 -1.9082 -95.3666 ARG_ARG 3.5989 5.0196 -464.1369 -4.2882 250.8879 ARG_ASN 7.6781 4.4892 -271.0586 -3.6803 63.7242 ARG_ASP 3.7294 4.0194 -187.4092 -3.5980 16.2763 ARG_CYS 3.5285 3.9415 -204.7689 -3.2842 80.1727 ARG_GLN 7.7672 4.5536 -265.7064 -3.8918 77.1582 ARG_GLU 3.5508 4.5561 -191.2274 -3.7826 35.4200 ARG_GLY 2.7520 3.3233 -195.2478 -2.4176 66.3483 ARG_HID 3.5023 3.9778 -213.3264 -4.4010 78.3238 ARG_ILE 4.0226 5.4434 -198.7507 -3.5590 67.2681 ARG_LEU 3.5625 5.4637 -221.5235 -3.5774 74.1001 ARG_LYS 3.5979 4.7507 -148.3798 -3.8962 103.1412 ARG_MET 3.5586 4.0921 -195.4928 -3.7156 68.8743 ARG_PHE 3.4939 7.7531 -197.9532 -4.5825 68.5464 ARG_PRO 8.1324 4.0358 -191.2188 -3.5445 73.4977 ARG_SER 3.6510 4.0405 -207.3283 -3.3388 71.8203 ARG_THR 3.9047 4.4901 -227.9708 -3.8193 69.9015 !101 Table 5.9 . (contÕd) ARG_TRP 3.5026 7.0099 -176.9734 -5.6027 55.4917 ARG_TYR 3.4931 7.7412 -209.6550 -4.7163 62.5798 ARG_VAL 3.7927 5.0970 -218.4410 -3.5805 71.2631 ASN_ASN 11.0339 3.8413 -20.3290 -3.0611 -127.1591 ASN_ASP 6.8400 3.6823 68.0288 -3.1432 -135.1406 ASN_CYS 6.8886 3.2317 45.1966 -2.6885 -108.1918 ASN_GLN 11.1264 3.8336 -17.6676 -3.2636 -111.5403 ASN_GLU 6.8894 3.8464 60.8697 -3.2337 -117.7684 ASN_GLY 6.0943 2.5801 51.4823 -1.7582 -119.7158 ASN_HID 6.8701 3.2598 35.7919 -3.8110 -110.8099 ASN_ILE 7.1218 5.1757 47.8864 -3.0779 -117.5689 ASN_LEU 6.9373 4.7856 26.7859 -2.9730 -113.4098 ASN_LYS 6.9685 4.0641 90.7499 -3.1775 -107.9412 ASN_MET 6.9262 3.3775 51.5919 -3.1027 -116.9573 ASN_PHE 6.8602 7.0480 50.4911 -4.0217 -117.9304 ASN_PRO 12.7024 3.3361 42.5972 -3.2959 -105.3569 ASN_SER 6.8498 3.4902 38.0614 -2.7114 -110.4575 ASN_THR 7.1053 3.8595 14.5698 -2.9901 -110.8819 ASN_TRP 6.9157 6.0046 68.1078 -4.7112 -127.2461 ASN_TYR 6.9122 6.8035 36.8430 -3.7995 -122.2596 ASN_VAL 6.9041 4.8457 25.0682 -3.0967 -110.4704 ASP_ASP 3.1607 3.2354 144.2870 -3.4380 -78.8422 ASP_CYS 3.1835 2.8171 122.5938 -2.9579 -101.4878 ASP_GLN 7.3641 3.4838 58.9727 -3.5224 -106.3848 ASP_GLU 3.1665 3.4410 136.4192 -3.4401 -69.0539 ASP_GLY 2.0169 2.2660 126.9188 -1.1338 -116.5652 ASP_HID 3.1247 2.8672 112.6631 -4.0776 -105.7256 ASP_ILE 3.5392 4.6492 123.9353 -3.2977 -105.8915 ASP_LEU 2.9222 4.5808 103.5540 -2.6487 -108.6484 ASP_LYS 2.9092 3.8309 166.3842 -2.7405 -137.4003 ASP_MET 3.1749 3.0066 128.0064 -3.3565 -110.0100 ASP_PHE 3.1360 6.6599 127.2950 -4.3028 -110.1849 ASP_PRO 8.0789 3.2402 115.2603 -2.5744 -79.6765 ASP_SER 3.1873 3.0302 113.4443 -2.9438 -99.9177 ASP_THR 2.9886 3.6904 87.1147 -2.4968 -100.6664 ASP_TRP 3.1386 5.8814 146.8720 -5.3003 -120.3027 ASP_TYR 3.1347 6.6239 115.4293 -4.4321 -115.9908 !102 Table 5.9 . (contÕd) ASP_VAL 3.3172 4.3338 99.9173 -3.3141 -97.6671 CYS_CYS 2.9858 2.5885 106.2667 -1.6170 -82.7387 CYS_GLN 7.1923 3.2100 43.3146 -2.2132 -86.1207 CYS_GLU 3.0456 3.2067 121.0599 -2.1566 -92.1949 CYS_GLY 2.1382 2.0243 112.8776 -0.8067 -94.5699 CYS_HID 2.9452 2.6232 96.8071 -2.7539 -85.4161 CYS_ILE 3.3271 4.4902 108.9223 -1.9375 -91.9022 CYS_LEU 3.0051 4.1647 87.7933 -1.9280 -88.1488 CYS_LYS 3.0124 3.4130 152.2366 -2.1891 -82.3759 CYS_MET 3.0035 2.7517 112.5730 -2.0482 -91.7585 CYS_PHE 2.9505 6.4087 111.4804 -2.9572 -92.6200 CYS_PRO 6.8531 3.1440 111.6524 -1.7290 -82.1681 CYS_SER 2.9942 2.8136 98.9303 -1.6165 -84.8319 CYS_THR 3.1399 3.2513 75.0342 -1.9495 -83.5785 CYS_TRP 2.9437 5.3702 128.8096 -3.7325 -99.0577 CYS_TYR 2.9509 6.3918 99.6718 -3.0864 -98.5395 CYS_VAL 3.1088 4.1469 85.9137 -1.9688 -84.6144 GLN_GLN 12.0125 3.8958 -29.4666 -3.4373 -79.1768 GLN_GLU 7.7871 3.8965 47.4288 -3.3190 -88.8434 GLN_GLY 6.9803 2.7353 40.1925 -2.0050 -88.5248 GLN_HID 7.8438 3.1434 22.3377 -3.9581 -74.7281 GLN_ILE 8.1410 4.7107 34.3822 -2.9965 -82.7077 GLN_LEU 7.8086 4.8222 14.9122 -3.1315 -81.6610 GLN_LYS 7.8463 4.1136 80.2541 -3.4401 -73.8852 GLN_MET 7.8095 3.4372 39.8187 -3.2595 -85.3447 GLN_PHE 7.8348 6.8875 36.9260 -4.1157 -83.3885 GLN_PRO 12.2978 3.4309 32.6702 -3.0442 -70.9782 GLN_SER 7.8505 3.3561 25.9878 -2.8537 -77.7492 GLN_THR 8.0483 3.8781 3.1903 -3.3084 -78.8237 GLN_TRP 7.8317 6.0764 56.6957 -5.1470 -94.4435 GLN_TYR 7.8354 6.8930 25.4572 -4.2521 -89.4582 GLN_VAL 7.9239 4.3761 12.1612 -3.0337 -76.3121 GLU_GLU 3.6441 3.6768 120.8392 -3.1749 -50.3009 GLU_GLY 2.7831 2.5925 113.6088 -1.8125 -85.6407 GLU_HID 3.5662 3.1174 97.7211 -3.7862 -76.3207 GLU_ILE 3.9177 4.9371 108.8633 -2.9739 -79.8023 GLU_LEU 3.6198 4.6441 88.9800 -2.9597 -78.8607 !103 Table 5.9 . (contÕd) GLU_LYS 3.6384 3.9212 152.6513 -3.2500 -100.1574 GLU_MET 3.6275 3.2404 113.1587 -3.0892 -82.0002 GLU_PHE 3.5659 6.8793 112.2930 -3.9829 -82.5193 GLU_PRO 8.0638 3.2750 104.4112 -2.7713 -63.5008 GLU_SER 3.6084 3.3089 98.6877 -2.6488 -72.9980 GLU_THR 3.8444 3.7322 73.5535 -3.0288 -68.0000 GLU_TRP 3.6527 5.8363 129.3251 -4.9669 -84.5115 GLU_TYR 3.5642 6.8681 100.4585 -4.1154 -88.2992 GLU_VAL 3.7021 4.5986 85.0392 -2.9978 -71.8262 GLY_GLY 1.5000 1.6585 102.8303 -0.6707 -92.2244 GLY_HID 2.2982 1.9844 85.8240 -2.3846 -81.1907 GLY_ILE 2.6346 3.8167 98.1373 -1.4559 -88.3262 GLY_LEU 2.3558 3.5112 77.0045 -1.5708 -84.4465 GLY_LYS 2.3749 2.7742 143.0126 -1.8421 -78.7054 GLY_MET 2.3556 2.1040 101.7901 -1.6939 -88.0633 GLY_PHE 2.2971 5.7707 100.5821 -2.5727 -88.8199 GLY_PRO 5.9381 2.5873 96.9794 -1.0577 -77.4042 GLY_SER 2.3166 2.1843 88.2247 -1.2364 -81.3288 GLY_THR 2.5377 2.5896 64.4826 -1.6198 -79.3192 GLY_TRP 2.2977 4.7539 118.2816 -3.4389 -94.5748 GLY_TYR 2.2959 5.7412 88.7830 -2.6974 -94.7276 GLY_VAL 2.4186 3.4758 75.3282 -1.4954 -81.2503 HID_HID 2.7411 2.7769 82.2093 -5.3569 -80.8651 HID_ILE 3.3192 4.5185 94.0478 -3.7986 -91.7020 HID_LEU 2.7948 4.3076 73.2120 -4.5351 -83.5880 HID_LYS 2.8364 3.5823 138.0382 -4.7869 -77.3242 HID_MET 2.7975 2.9080 98.0181 -4.6417 -87.0351 HID_PHE 2.7464 6.5656 96.8807 -5.5492 -88.2746 HID_PRO 8.2597 2.8998 90.6641 -4.2738 -81.9202 HID_SER 2.7973 2.9601 84.2997 -4.1701 -80.5401 HID_THR 3.0268 3.5179 62.7066 -4.5594 -85.3626 HID_TRP 3.0908 5.8468 116.3558 -5.7944 -103.4417 HID_TYR 2.7466 6.5465 85.0769 -5.6854 -94.2045 HID_VAL 2.9019 3.9920 70.9571 -4.1012 -83.0132 ILE_ILE 3.9864 6.1954 84.5272 -2.7266 -77.4666 ILE_LEU 3.7072 5.8961 63.4315 -2.7114 -73.6186 ILE_LYS 3.7414 5.1632 130.3178 -3.0124 -68.8156 !104 Table 5.9 . (contÕd) ILE_MET 3.7185 4.4992 88.1964 -2.8384 -77.2333 ILE_PHE 3.6532 8.1476 86.8547 -3.7223 -77.8879 ILE_PRO 8.4893 4.3784 84.7914 -2.4266 -67.7552 ILE_SER 3.7002 4.5634 74.5719 -2.3869 -70.5831 ILE_THR 3.9212 4.9769 51.2479 -2.8167 -68.8712 ILE_TRP 3.6978 7.1315 104.8627 -4.6951 -84.3497 ILE_TYR 3.6557 8.1320 75.0594 -3.8572 -83.7759 ILE_VAL 3.7748 5.8484 61.7970 -2.7504 -70.4904 LEU_LEU 3.5751 5.7919 63.1293 -3.1584 -93.7915 LEU_LYS 3.6360 5.0654 128.5000 -3.4701 -87.1845 LEU_MET 3.5843 4.3928 87.9066 -3.2875 -97.4388 LEU_PHE 3.5154 8.0419 86.5862 -4.1762 -98.1084 LEU_PRO 8.5188 4.4980 80.8828 -2.9783 -83.8714 LEU_SER 3.5617 4.4836 74.3449 -2.8578 -90.8153 LEU_THR 3.8547 4.8451 51.0380 -3.2704 -89.1047 LEU_TRP 4.6167 6.7250 102.9661 -4.9713 -102.9845 LEU_TYR 3.5145 8.0406 74.8016 -4.3098 -103.9483 LEU_VAL 3.5938 5.7632 61.4935 -3.2261 -90.6281 LYS_LYS 3.5837 4.4076 171.7248 -3.4583 -46.3914 LYS_MET 3.5439 3.7477 125.0376 -3.2698 -80.5089 LYS_PHE 3.6336 7.2059 121.1888 -4.1676 -79.4678 LYS_PRO 8.1071 3.7199 128.3801 -3.1028 -76.3896 LYS_SER 3.5235 3.8554 113.4248 -2.8400 -77.7344 LYS_THR 3.8315 4.1794 92.4664 -3.3389 -79.9144 LYS_TRP 3.6313 6.3921 141.9639 -5.1972 -91.5786 LYS_TYR 3.4805 7.3876 110.9817 -4.2773 -86.8944 LYS_VAL 3.6041 5.1130 102.6601 -3.1837 -79.3747 MET_MET 3.5960 2.9323 103.7658 -3.1716 -87.6575 MET_PHE 3.6690 6.4674 101.1667 -4.1196 -85.1823 MET_PRO 8.0672 2.9715 97.8373 -3.0239 -75.7004 MET_SER 3.5540 3.0426 90.2650 -2.7477 -81.1390 MET_THR 3.8860 3.3612 66.9389 -3.2329 -79.1961 MET_TRP 3.6729 5.6313 120.8851 -5.1474 -95.9502 MET_TYR 3.5259 6.5691 90.6895 -4.1833 -94.2637 MET_VAL 3.6244 4.3085 77.2636 -3.0982 -80.8930 PHE_PHE 2.9438 10.1367 107.2588 -6.0510 -92.2948 PHE_PRO 7.5244 6.5883 101.7166 -4.9261 -80.4459 !105 Table 5.9 . (contÕd) PHE_SER 3.0410 6.4996 94.6164 -4.6578 -84.3437 PHE_THR 2.9420 7.1106 70.5430 -4.8268 -84.2807 PHE_TRP 2.6765 9.1694 123.8937 -6.5484 -98.3989 PHE_TYR 2.6793 9.9424 92.6308 -5.6331 -93.3411 PHE_VAL 3.1880 7.7654 81.6957 -5.0487 -84.0391 PRO_PRO 9.4988 2.9215 73.0478 -2.0934 -57.4202 PRO_SER 5.5616 2.7334 63.5137 -2.2047 -60.4104 PRO_THR 5.7983 3.1318 39.6713 -2.6288 -58.0718 PRO_TRP 5.5485 5.2737 93.7747 -4.4444 -73.7855 PRO_TYR 5.5344 6.2878 64.1343 -3.6779 -73.8010 PRO_VAL 5.6571 4.0296 50.5453 -2.5153 -60.2468 SER_SER 2.7780 2.8641 62.7424 -1.3467 -58.8199 SER_THR 2.8729 3.3333 38.5876 -1.7058 -57.2738 SER_TRP 2.7289 5.4444 92.3778 -3.4552 -72.8540 SER_TYR 2.7434 6.4582 63.4548 -2.8300 -72.4031 SER_VAL 2.8921 4.2053 49.9412 -1.6923 -58.8025 THR_THR 2.9842 4.3523 -0.0002 -2.5669 -32.3897 THR_TRP 2.7521 6.5861 54.2923 -4.4176 -48.8353 THR_TYR 2.6783 7.5370 24.0810 -3.6018 -48.1746 THR_VAL 2.8136 5.2687 10.7400 -2.4526 -34.8625 TRP_TRP 2.6206 8.4303 138.7384 -7.5334 -108.2350 TRP_TYR 2.6155 9.2108 107.4267 -6.6273 -103.1310 TRP_VAL 2.6460 6.7610 94.2838 -5.5834 -90.5809 TYR_TYR 2.6631 9.8961 80.3178 -5.8225 -99.2140 TYR_VAL 3.1168 7.7432 69.4624 -5.3466 -89.2782 VAL_VAL 3.5552 5.4546 21.9492 -2.6268 -46.6209 !106 Table 5.10 . Torsion and nonbond energies calculated by Amber software with ff94 parameters for double amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA 2.4544 2.4627 118.0785 -1.3248 -96.9775 ALA_ARG 2.5391 3.7338 -155.1922 -2.6129 57.5772 ALA_ASN 6.6843 3.0813 50.9371 -2.1954 -111.2287 ALA_ASP 2.4792 3.0652 139.1533 -1.9718 -122.5947 ALA_CYS 2.4851 2.6409 116.2341 -1.6264 -93.4288 ALA_GLN 6.7146 3.2590 53.4295 -2.2435 -96.5693 ALA_GLU 2.5170 3.2510 132.1055 -2.1362 -105.8683 ALA_GLY 1.6730 2.1359 123.1055 -0.9506 -105.6109 ALA_HID 2.4599 2.6777 106.8644 -2.7612 -95.8808 ALA_ILE 2.7934 4.5113 118.8296 -1.8709 -102.5990 ALA_LEU 2.5162 4.2023 97.7498 -1.9472 -98.8072 ALA_LYS 2.5390 3.4660 161.4899 -2.2236 -90.8720 ALA_MET 2.5178 2.7988 122.6631 -2.0685 -102.5792 ALA_PHE 2.4571 6.4543 121.5168 -2.9501 -103.4290 ALA_PRO 6.6365 2.9748 112.4853 -1.3720 -86.7384 ALA_SER 2.4817 2.8791 109.0744 -1.6111 -95.8011 ALA_THR 2.7032 3.2769 85.3299 -2.0119 -93.7124 ALA_TRP 2.4685 5.4374 139.2497 -3.8380 -109.2967 ALA_TYR 2.4567 6.4398 109.7249 -3.0767 -109.3268 ALA_VAL 2.5777 4.1744 95.8936 -1.9052 -95.3793 ARG_ARG 3.5997 5.0160 -464.2176 -4.2875 250.9398 ARG_ASN 7.6774 4.4885 -271.0645 -3.6819 63.7411 ARG_ASP 3.7284 4.0150 -187.3786 -3.5989 16.2760 ARG_CYS 3.5283 3.9402 -204.7583 -3.2870 80.1704 ARG_GLN 7.7688 4.5507 -265.7263 -3.8927 77.1420 ARG_GLU 3.5495 4.5547 -191.1995 -3.7789 35.4041 ARG_GLY 2.7524 3.3212 -195.2521 -2.4169 66.3516 ARG_HID 3.5017 3.9840 -213.3195 -4.4024 78.3416 ARG_ILE 4.0238 5.4497 -198.7196 -3.5621 67.2394 ARG_LEU 3.5602 5.4658 -221.4991 -3.5768 74.0706 ARG_LYS 3.5993 4.7538 -148.3877 -3.9016 103.1335 ARG_MET 3.5566 4.0948 -195.5267 -3.7152 68.8839 ARG_PHE 3.4919 7.7516 -197.9175 -4.5855 68.4979 ARG_PRO 8.1292 4.0352 -191.2415 -3.5461 73.5166 ARG_SER 3.6525 4.0429 -207.3750 -3.3379 71.8319 ARG_THR 3.9021 4.4818 -227.9744 -3.8195 69.8984 ARG_TRP 3.5026 7.0102 -177.0093 -5.6061 55.4896 ARG_TYR 3.4934 7.7420 -209.6527 -4.7161 62.5689 ARG_VAL 3.7929 5.1030 -218.4348 -3.5826 71.2634 ASN_ASN 11.0334 3.8400 -20.3347 -3.0666 -127.1316 ASN_ASP 6.8370 3.6817 68.0145 -3.1457 -135.1461 ASN_CYS 6.8863 3.2332 45.2105 -2.6877 -108.2019 ASN_GLN 11.1274 3.8422 -17.6776 -3.2635 -111.5313 ASN_GLU 6.8896 3.8416 60.8696 -3.2354 -117.7836 ASN_GLY 6.0926 2.5825 51.4769 -1.7563 -119.7288 ASN_HID 6.8698 3.2619 35.7755 -3.8100 -110.8445 ASN_ILE 7.1217 5.1775 47.8983 -3.0761 -117.5813 ASN_LEU 6.9374 4.7896 26.7930 -2.9731 -113.4332 ASN_LYS 6.9672 4.0655 90.7454 -3.1776 -107.9458 !107 Table 5.10 . (contÕd) ASN_MET 6.9241 3.3836 51.5909 -3.1037 -116.9608 ASN_PHE 6.8606 7.0463 50.4930 -4.0226 -117.9426 ASN_PRO 12.7031 3.3373 42.6040 -3.3021 -105.3404 ASN_SER 6.8512 3.4881 38.0507 -2.7182 -110.4764 ASN_THR 7.1072 3.8617 14.5607 -2.9917 -110.8854 ASN_TRP 6.9151 6.0066 68.1072 -4.7071 -127.2454 ASN_TYR 6.9142 6.7951 36.8455 -3.8035 -122.2804 ASN_VAL 6.9034 4.8491 25.0748 -3.0930 -110.4792 ASP_ASP 3.1600 3.2306 144.2811 -3.4409 -78.8267 ASP_CYS 3.1859 2.8250 122.5939 -2.9569 -101.5003 ASP_GLN 7.3615 3.4733 58.9701 -3.5203 -106.3775 ASP_GLU 3.1683 3.4438 136.4099 -3.4423 -69.0742 ASP_GLY 2.0171 2.2736 126.9328 -1.1311 -116.5589 ASP_HID 3.1246 2.8712 112.6762 -4.0788 -105.7152 ASP_ILE 3.5394 4.6457 123.9222 -3.2958 -105.8612 ASP_LEU 2.9237 4.5787 103.5521 -2.6510 -108.6535 ASP_LYS 2.9100 3.8307 166.3765 -2.7342 -137.4193 ASP_MET 3.1772 3.0039 127.9881 -3.3566 -109.9989 ASP_PHE 3.1335 6.6542 127.2905 -4.3022 -110.1798 ASP_PRO 8.0744 3.2427 115.2702 -2.5702 -79.6948 ASP_SER 3.1877 3.0286 113.4609 -2.9425 -99.9249 ASP_THR 2.9870 3.6916 87.0965 -2.4957 -100.6566 ASP_TRP 3.1385 5.8781 146.8618 -5.2980 -120.2735 ASP_TYR 3.1348 6.6333 115.4404 -4.4336 -115.9782 ASP_VAL 3.3180 4.3218 99.9057 -3.3178 -97.6521 CYS_CYS 2.9818 2.5847 106.2724 -1.6156 -82.7409 CYS_GLN 7.1915 3.2086 43.3375 -2.2185 -86.1317 CYS_GLU 3.0453 3.2110 121.0603 -2.1576 -92.1871 CYS_GLY 2.1393 2.0234 112.8800 -0.8104 -94.5822 CYS_HID 2.9454 2.6218 96.7936 -2.7541 -85.4031 CYS_ILE 3.3274 4.4852 108.9150 -1.9379 -91.8907 CYS_LEU 3.0044 4.1645 87.7902 -1.9265 -88.1597 CYS_LYS 3.0131 3.4162 152.2474 -2.1883 -82.3812 CYS_MET 3.0019 2.7471 112.5761 -2.0484 -91.7541 CYS_PHE 2.9505 6.4050 111.4871 -2.9550 -92.6174 CYS_PRO 6.8521 3.1381 111.6451 -1.7317 -82.1610 CYS_SER 2.9935 2.8133 98.9338 -1.6176 -84.8356 CYS_THR 3.1392 3.2624 75.0432 -1.9480 -83.5826 CYS_TRP 2.9426 5.3762 128.8149 -3.7309 -99.0455 CYS_TYR 2.9525 6.3886 99.6839 -3.0857 -98.5499 CYS_VAL 3.1089 4.1492 85.9173 -1.9668 -84.6098 GLN_GLN 12.0110 3.8942 -29.4624 -3.4372 -79.1611 GLN_GLU 7.7898 3.8917 47.4497 -3.3223 -88.8583 GLN_GLY 6.9762 2.7326 40.1859 -2.0034 -88.5262 GLN_HID 7.8450 3.1454 22.3377 -3.9603 -74.7360 GLN_ILE 8.1417 4.7129 34.3919 -2.9997 -82.7214 GLN_LEU 7.8088 4.8238 14.9147 -3.1325 -81.6566 GLN_LYS 7.8437 4.1130 80.2583 -3.4370 -73.8935 GLN_MET 7.8068 3.4379 39.8478 -3.2607 -85.3444 GLN_PHE 7.8342 6.8882 36.9509 -4.1175 -83.3861 GLN_PRO 12.3028 3.4305 32.6720 -3.0451 -70.9599 !108 Table 5.10 . (contÕd) GLN_SER 7.8519 3.3569 25.9844 -2.8528 -77.7516 GLN_THR 8.0497 3.8803 3.1854 -3.3087 -78.8275 GLN_TRP 7.8325 6.0791 56.7260 -5.1509 -94.4490 GLN_TYR 7.8369 6.8984 25.4729 -4.2536 -89.4721 GLN_VAL 7.9232 4.3765 12.1859 -3.0318 -76.3078 GLU_GLU 3.6440 3.6797 120.8308 -3.1756 -50.2902 GLU_GLY 2.7825 2.5935 113.5969 -1.8134 -85.6298 GLU_HID 3.5660 3.1174 97.7237 -3.7887 -76.3215 GLU_ILE 3.9172 4.9340 108.8781 -2.9737 -79.8164 GLU_LEU 3.6213 4.6447 88.9770 -2.9630 -78.8664 GLU_LYS 3.6432 3.9187 152.6442 -3.2493 -100.1717 GLU_MET 3.6271 3.2411 113.1626 -3.0886 -82.0180 GLU_PHE 3.5658 6.8935 112.2770 -3.9848 -82.5250 GLU_PRO 8.0626 3.2689 104.3902 -2.7755 -63.4770 GLU_SER 3.6104 3.3156 98.6891 -2.6492 -73.0100 GLU_THR 3.8443 3.7273 73.5461 -3.0266 -67.9961 GLU_TRP 3.6489 5.8355 129.3415 -4.9667 -84.4984 GLU_TYR 3.5664 6.8780 100.4523 -4.1151 -88.2858 GLU_VAL 3.7046 4.6008 85.0406 -2.9988 -71.8135 GLY_GLY 1.5000 1.6571 102.8327 -0.6700 -92.2307 GLY_HID 2.2986 1.9853 85.8342 -2.3853 -81.1880 GLY_ILE 2.6343 3.8187 98.1352 -1.4533 -88.3262 GLY_LEU 2.3559 3.5132 77.0051 -1.5720 -84.4497 GLY_LYS 2.3755 2.7771 143.0221 -1.8434 -78.7119 GLY_MET 2.3558 2.1078 101.7965 -1.6936 -88.0565 GLY_PHE 2.2967 5.7624 100.5767 -2.5726 -88.8228 GLY_PRO 5.9399 2.5833 96.9601 -1.0596 -77.3908 GLY_SER 2.3163 2.1858 88.2293 -1.2359 -81.3223 GLY_THR 2.5366 2.5890 64.4655 -1.6185 -79.3202 GLY_TRP 2.2977 4.7396 118.2748 -3.4395 -94.5778 GLY_TYR 2.2958 5.7470 88.7782 -2.6979 -94.7217 GLY_VAL 2.4178 3.4846 75.3170 -1.4933 -81.2464 HID_HID 2.7414 2.7761 82.1977 -5.3610 -80.8640 HID_ILE 3.3216 4.5215 94.0431 -3.8031 -91.7022 HID_LEU 2.7939 4.3138 73.2260 -4.5333 -83.5980 HID_LYS 2.8377 3.5796 138.0366 -4.7862 -77.3246 HID_MET 2.7976 2.8991 97.9979 -4.6444 -87.0275 HID_PHE 2.7464 6.5613 96.8854 -5.5503 -88.2734 HID_PRO 8.2695 2.9031 90.6678 -4.2754 -81.9213 HID_SER 2.7989 2.9605 84.2969 -4.1717 -80.5334 HID_THR 3.0262 3.5205 62.7036 -4.5587 -85.3689 HID_TRP 3.0918 5.8445 116.3516 -5.7936 -103.4272 HID_TYR 2.7477 6.5433 85.0745 -5.6842 -94.2101 HID_VAL 2.9015 3.9975 70.9666 -4.0980 -83.0187 ILE_ILE 3.9893 6.1898 84.5022 -2.7263 -77.4548 ILE_LEU 3.7068 5.8945 63.4259 -2.7123 -73.6242 ILE_LYS 3.7416 5.1646 130.3269 -3.0102 -68.8154 ILE_MET 3.7173 4.4936 88.1951 -2.8379 -77.2311 ILE_PHE 3.6519 8.1484 86.8631 -3.7280 -77.8917 ILE_PRO 8.4908 4.3771 84.7955 -2.4262 -67.7631 ILE_SER 3.6965 4.5701 74.5959 -2.3879 -70.5793 !109 Table 5.10 . (contÕd) ILE_THR 3.9209 4.9757 51.2420 -2.8164 -68.8646 ILE_TRP 3.7009 7.1316 104.8576 -4.6936 -84.3266 ILE_TYR 3.6532 8.1342 75.0667 -3.8582 -83.7737 ILE_VAL 3.7771 5.8522 61.7753 -2.7475 -70.4835 LEU_LEU 3.5758 5.7878 63.1173 -3.1598 -93.7925 LEU_LYS 3.6350 5.0664 128.4960 -3.4687 -87.1873 LEU_MET 3.5843 4.3960 87.9149 -3.2904 -97.4306 LEU_PHE 3.5127 8.0506 86.5882 -4.1764 -98.1057 LEU_PRO 8.5212 4.4980 80.8836 -2.9817 -83.8724 LEU_SER 3.5623 4.4851 74.3365 -2.8558 -90.8023 LEU_THR 3.8552 4.8466 51.0338 -3.2705 -89.1112 LEU_TRP 4.6172 6.7301 102.9716 -4.9758 -102.9842 LEU_TYR 3.5157 8.0369 74.7967 -4.3086 -103.9834 LEU_VAL 3.5903 5.7649 61.4958 -3.2258 -90.6297 LYS_LYS 3.5815 4.4049 171.7302 -3.4559 -46.4067 LYS_MET 3.5450 3.7468 125.0406 -3.2716 -80.5012 LYS_PHE 3.6350 7.2086 121.1929 -4.1695 -79.4669 LYS_PRO 8.1075 3.7240 128.3883 -3.1022 -76.4028 LYS_SER 3.5211 3.8529 113.4396 -2.8401 -77.7377 LYS_THR 3.8299 4.1699 92.4868 -3.3378 -79.8854 LYS_TRP 3.6354 6.3872 141.9460 -5.1986 -91.5772 LYS_TYR 3.4809 7.3927 110.9881 -4.2775 -86.8880 LYS_VAL 3.6019 5.1101 102.6591 -3.1806 -79.3695 MET_MET 3.5965 2.9354 103.7696 -3.1733 -87.6672 MET_PHE 3.6705 6.4660 101.1788 -4.1244 -85.1871 MET_PRO 8.0664 2.9726 97.8380 -3.0201 -75.6937 MET_SER 3.5577 3.0452 90.2541 -2.7482 -81.1389 MET_THR 3.8853 3.3582 66.9378 -3.2337 -79.1920 MET_TRP 3.6734 5.6287 120.8928 -5.1500 -95.9500 MET_TYR 3.5277 6.5796 90.6851 -4.1868 -94.2762 MET_VAL 3.6268 4.3164 77.2806 -3.0987 -80.9085 PHE_PHE 2.9417 10.1371 107.2569 -6.0531 -92.3045 PHE_PRO 7.5262 6.5884 101.7216 -4.9263 -80.4541 PHE_SER 3.0406 6.4960 94.6190 -4.6570 -84.3487 PHE_THR 2.9373 7.1091 70.5480 -4.8241 -84.2736 PHE_TRP 2.6793 9.1659 123.9031 -6.5492 -98.3957 PHE_TYR 2.6789 9.9335 92.6355 -5.6323 -93.3557 PHE_VAL 3.1874 7.7650 81.6879 -5.0488 -84.0534 PRO_PRO 9.4985 2.9231 73.0407 -2.0938 -57.4253 PRO_SER 5.5604 2.7348 63.5045 -2.2033 -60.4158 PRO_THR 5.7972 3.1312 39.6629 -2.6273 -58.0726 PRO_TRP 5.5493 5.2889 93.7669 -4.4449 -73.7840 PRO_TYR 5.5369 6.2911 64.1512 -3.6784 -73.8046 PRO_VAL 5.6553 4.0326 50.5590 -2.5144 -60.2538 SER_SER 2.7792 2.8633 62.7370 -1.3553 -58.8400 SER_THR 2.8733 3.3342 38.5763 -1.7091 -57.2605 SER_TRP 2.7291 5.4406 92.3879 -3.4544 -72.8563 SER_TYR 2.7413 6.4489 63.4493 -2.8300 -72.4056 SER_VAL 2.8957 4.2003 49.9344 -1.6929 -58.7845 THR_THR 2.9849 4.3608 0.0107 -2.5675 -32.4047 THR_TRP 2.7528 6.5830 54.2738 -4.4163 -48.8367 !110 Table 5.10 . (contÕd) THR_TYR 2.6785 7.5363 24.0798 -3.6039 -48.1608 THR_VAL 2.8117 5.2668 10.7514 -2.4545 -34.8742 TRP_TRP 2.6185 8.4391 138.7433 -7.5365 -108.2237 TRP_TYR 2.6174 9.2155 107.4061 -6.6271 -103.1196 TRP_VAL 2.6448 6.7645 94.2846 -5.5814 -90.5867 TYR_TYR 2.6629 9.8953 80.3259 -5.8205 -99.2277 TYR_VAL 3.1148 7.7502 69.4697 -5.3431 -89.2894 VAL_VAL 3.5552 5.4500 21.9507 -2.6293 -46.6316 !111 Table 5.11 . Comparisons of torsion and nonbond energies calculated by the encoded program and Amber software with ff94 parameters for double amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA -0.0003 0.0018 -0.0030 -0.0004 -0.0054 ALA_ARG -0.0001 -0.0073 0.0136 -0.0002 -0.0190 ALA_ASN -0.0017 0.0022 -0.0064 0.0017 0.0013 ALA_ASP -0.0006 0.0033 -0.0102 -0.0015 0.0097 ALA_CYS -0.0002 0.0031 0.0005 0.0021 0.0017 ALA_GLN 0.0003 -0.0034 -0.0022 0.0013 0.0009 ALA_GLU 0.0010 0.0017 -0.0040 -0.0031 0.0162 ALA_GLY 0.0004 0.0015 0.0020 -0.0001 -0.0078 ALA_HID 0.0010 0.0007 0.0026 -0.0004 -0.0066 ALA_ILE -0.0007 0.0002 -0.0106 0.0002 0.0142 ALA_LEU 0.0003 -0.0029 0.0018 -0.0030 -0.0019 ALA_LYS -0.0009 -0.0001 -0.0097 0.0007 -0.0019 ALA_MET -0.0001 0.0037 0.0071 0.0020 -0.0022 ALA_PHE -0.0010 0.0008 0.0010 -0.0007 0.0039 ALA_PRO -0.0028 0.0014 -0.0004 -0.0011 0.0058 ALA_SER -0.0014 0.0040 0.0021 0.0010 -0.0081 ALA_THR -0.0004 0.0002 -0.0031 0.0022 -0.0110 ALA_TRP -0.0013 0.0016 -0.0116 0.0003 0.0037 ALA_TYR -0.0017 0.0065 0.0092 0.0012 -0.0065 ALA_VAL -0.0012 0.0058 -0.0066 -0.0030 0.0127 ARG_ARG -0.0008 0.0036 0.0807 -0.0007 -0.0519 ARG_ASN 0.0007 0.0007 0.0059 0.0016 -0.0169 ARG_ASP 0.0010 0.0044 -0.0306 0.0009 0.0003 ARG_CYS 0.0002 0.0013 -0.0106 0.0028 0.0023 ARG_GLN -0.0016 0.0029 0.0199 0.0009 0.0162 ARG_GLU 0.0013 0.0014 -0.0279 -0.0037 0.0159 ARG_GLY -0.0004 0.0021 0.0043 -0.0007 -0.0033 ARG_HID 0.0006 -0.0062 -0.0069 0.0014 -0.0178 ARG_ILE -0.0012 -0.0063 -0.0311 0.0031 0.0287 ARG_LEU 0.0023 -0.0021 -0.0244 -0.0006 0.0295 ARG_LYS -0.0014 -0.0031 0.0079 0.0054 0.0077 ARG_MET 0.0020 -0.0027 0.0339 -0.0004 -0.0096 ARG_PHE 0.0020 0.0015 -0.0357 0.0030 0.0485 ARG_PRO 0.0032 0.0006 0.0227 0.0016 -0.0189 ARG_SER -0.0015 -0.0024 0.0467 -0.0009 -0.0116 ARG_THR 0.0026 0.0083 0.0036 0.0002 0.0031 ARG_TRP 0.0000 -0.0003 0.0359 0.0034 0.0021 ARG_TYR -0.0003 -0.0008 -0.0023 -0.0002 0.0109 ARG_VAL -0.0002 -0.0060 -0.0062 0.0021 -0.0003 ASN_ASN 0.0005 0.0013 0.0057 0.0055 -0.0275 ASN_ASP 0.0030 0.0006 0.0143 0.0025 0.0055 ASN_CYS 0.0023 -0.0015 -0.0139 -0.0008 0.0101 ASN_GLN -0.0010 -0.0086 0.0100 -0.0001 -0.0090 ASN_GLU -0.0002 0.0048 0.0001 0.0017 0.0152 ASN_GLY 0.0017 -0.0024 0.0054 -0.0019 0.0130 ASN_HID 0.0003 -0.0021 0.0164 -0.0010 0.0346 ASN_ILE 0.0001 -0.0018 -0.0119 -0.0018 0.0124 ASN_LEU -0.0001 -0.0040 -0.0071 0.0001 0.0234 ASN_LYS 0.0013 -0.0014 0.0045 0.0001 0.0046 !112 Table 5.11 . (contÕd) ASN_MET 0.0021 -0.0061 0.0010 0.0010 0.0035 ASN_PHE -0.0004 0.0017 -0.0019 0.0009 0.0122 ASN_PRO -0.0007 -0.0012 -0.0068 0.0062 -0.0165 ASN_SER -0.0014 0.0021 0.0107 0.0068 0.0189 ASN_THR -0.0019 -0.0022 0.0091 0.0016 0.0035 ASN_TRP 0.0006 -0.0020 0.0006 -0.0041 -0.0007 ASN_TYR -0.0020 0.0084 -0.0025 0.0040 0.0208 ASN_VAL 0.0007 -0.0034 -0.0066 -0.0037 0.0088 ASP_ASP 0.0007 0.0048 0.0059 0.0029 -0.0155 ASP_CYS -0.0024 -0.0079 -0.0001 -0.0010 0.0125 ASP_GLN 0.0026 0.0105 0.0026 -0.0021 -0.0073 ASP_GLU -0.0018 -0.0028 0.0093 0.0022 0.0203 ASP_GLY -0.0002 -0.0076 -0.0140 -0.0027 -0.0063 ASP_HID 0.0001 -0.0040 -0.0131 0.0012 -0.0104 ASP_ILE -0.0002 0.0035 0.0131 -0.0019 -0.0303 ASP_LEU -0.0015 0.0021 0.0019 0.0023 0.0051 ASP_LYS -0.0008 0.0002 0.0077 -0.0063 0.0190 ASP_MET -0.0023 0.0027 0.0183 0.0001 -0.0111 ASP_PHE 0.0025 0.0057 0.0045 -0.0006 -0.0051 ASP_PRO 0.0045 -0.0025 -0.0099 -0.0042 0.0183 ASP_SER -0.0004 0.0016 -0.0166 -0.0013 0.0072 ASP_THR 0.0016 -0.0012 0.0182 -0.0011 -0.0098 ASP_TRP 0.0001 0.0033 0.0102 -0.0023 -0.0292 ASP_TYR -0.0001 -0.0094 -0.0111 0.0015 -0.0126 ASP_VAL -0.0008 0.0120 0.0116 0.0037 -0.0150 CYS_CYS 0.0040 0.0038 -0.0057 -0.0014 0.0022 CYS_GLN 0.0008 0.0014 -0.0229 0.0053 0.0110 CYS_GLU 0.0003 -0.0043 -0.0004 0.0010 -0.0078 CYS_GLY -0.0011 0.0009 -0.0024 0.0037 0.0123 CYS_HID -0.0002 0.0014 0.0135 0.0002 -0.0130 CYS_ILE -0.0003 0.0050 0.0073 0.0004 -0.0115 CYS_LEU 0.0007 0.0002 0.0031 -0.0015 0.0109 CYS_LYS -0.0007 -0.0032 -0.0108 -0.0008 0.0053 CYS_MET 0.0016 0.0046 -0.0031 0.0002 -0.0044 CYS_PHE 0.0000 0.0037 -0.0067 -0.0022 -0.0026 CYS_PRO 0.0010 0.0059 0.0073 0.0027 -0.0071 CYS_SER 0.0007 0.0003 -0.0035 0.0011 0.0037 CYS_THR 0.0007 -0.0111 -0.0090 -0.0015 0.0041 CYS_TRP 0.0011 -0.0060 -0.0053 -0.0016 -0.0122 CYS_TYR -0.0016 0.0032 -0.0121 -0.0007 0.0104 CYS_VAL -0.0001 -0.0023 -0.0036 -0.0020 -0.0046 GLN_GLN 0.0015 0.0016 -0.0042 -0.0001 -0.0157 GLN_GLU -0.0027 0.0048 -0.0209 0.0033 0.0149 GLN_GLY 0.0041 0.0027 0.0066 -0.0016 0.0014 GLN_HID -0.0012 -0.0020 0.0000 0.0022 0.0079 GLN_ILE -0.0007 -0.0022 -0.0097 0.0032 0.0137 GLN_LEU -0.0002 -0.0016 -0.0025 0.0010 -0.0044 GLN_LYS 0.0026 0.0006 -0.0042 -0.0031 0.0083 GLN_MET 0.0027 -0.0007 -0.0291 0.0012 -0.0003 GLN_PHE 0.0006 -0.0007 -0.0249 0.0018 -0.0024 GLN_PRO -0.0050 0.0004 -0.0018 0.0009 -0.0183 !113 Table 5.11 . (contÕd) GLN_SER -0.0014 -0.0008 0.0034 -0.0009 0.0024 GLN_THR -0.0014 -0.0022 0.0049 0.0003 0.0038 GLN_TRP -0.0008 -0.0027 -0.0303 0.0039 0.0055 GLN_TYR -0.0015 -0.0054 -0.0157 0.0015 0.0139 GLN_VAL 0.0007 -0.0004 -0.0247 -0.0019 -0.0043 GLU_GLU 0.0001 -0.0029 0.0084 0.0007 -0.0107 GLU_GLY 0.0006 -0.0010 0.0119 0.0009 -0.0109 GLU_HID 0.0002 0.0000 -0.0026 0.0025 0.0008 GLU_ILE 0.0005 0.0031 -0.0148 -0.0002 0.0141 GLU_LEU -0.0015 -0.0006 0.0030 0.0033 0.0057 GLU_LYS -0.0048 0.0025 0.0071 -0.0007 0.0143 GLU_MET 0.0004 -0.0007 -0.0039 -0.0006 0.0178 GLU_PHE 0.0001 -0.0142 0.0160 0.0019 0.0057 GLU_PRO 0.0012 0.0061 0.0210 0.0042 -0.0238 GLU_SER -0.0020 -0.0067 -0.0014 0.0004 0.0120 GLU_THR 0.0001 0.0049 0.0074 -0.0022 -0.0039 GLU_TRP 0.0038 0.0008 -0.0164 -0.0002 -0.0131 GLU_TYR -0.0022 -0.0099 0.0062 -0.0003 -0.0134 GLU_VAL -0.0025 -0.0022 -0.0014 0.0010 -0.0127 GLY_GLY 0.0000 0.0014 -0.0024 -0.0007 0.0063 GLY_HID -0.0004 -0.0009 -0.0102 0.0007 -0.0027 GLY_ILE 0.0003 -0.0020 0.0021 -0.0026 0.0000 GLY_LEU -0.0001 -0.0020 -0.0006 0.0012 0.0032 GLY_LYS -0.0006 -0.0029 -0.0095 0.0013 0.0065 GLY_MET -0.0002 -0.0038 -0.0064 -0.0003 -0.0068 GLY_PHE 0.0004 0.0083 0.0054 -0.0001 0.0029 GLY_PRO -0.0018 0.0040 0.0193 0.0019 -0.0134 GLY_SER 0.0003 -0.0015 -0.0046 -0.0005 -0.0065 GLY_THR 0.0011 0.0006 0.0171 -0.0013 0.0010 GLY_TRP 0.0000 0.0143 0.0068 0.0006 0.0030 GLY_TYR 0.0001 -0.0058 0.0048 0.0005 -0.0059 GLY_VAL 0.0008 -0.0088 0.0112 -0.0021 -0.0039 HID_HID -0.0003 0.0008 0.0116 0.0041 -0.0011 HID_ILE -0.0024 -0.0030 0.0047 0.0045 0.0002 HID_LEU 0.0009 -0.0062 -0.0140 -0.0018 0.0100 HID_LYS -0.0013 0.0027 0.0016 -0.0007 0.0004 HID_MET -0.0001 0.0089 0.0202 0.0027 -0.0076 HID_PHE 0.0000 0.0043 -0.0047 0.0011 -0.0012 HID_PRO -0.0098 -0.0033 -0.0037 0.0016 0.0011 HID_SER -0.0016 -0.0004 0.0028 0.0016 -0.0067 HID_THR 0.0006 -0.0026 0.0030 -0.0007 0.0063 HID_TRP -0.0010 0.0023 0.0042 -0.0008 -0.0145 HID_TYR -0.0011 0.0032 0.0024 -0.0012 0.0056 HID_VAL 0.0004 -0.0055 -0.0095 -0.0032 0.0055 ILE_ILE -0.0029 0.0056 0.0250 -0.0003 -0.0118 ILE_LEU 0.0004 0.0016 0.0056 0.0009 0.0056 ILE_LYS -0.0002 -0.0014 -0.0091 -0.0022 -0.0002 ILE_MET 0.0012 0.0056 0.0013 -0.0005 -0.0022 ILE_PHE 0.0013 -0.0008 -0.0084 0.0057 0.0038 ILE_PRO -0.0015 0.0013 -0.0041 -0.0004 0.0079 ILE_SER 0.0037 -0.0067 -0.0240 0.0010 -0.0038 !114 Table 5.11 . (contÕd) ILE_THR 0.0003 0.0012 0.0059 -0.0003 -0.0066 ILE_TRP -0.0031 -0.0001 0.0051 -0.0015 -0.0231 ILE_TYR 0.0025 -0.0022 -0.0073 0.0010 -0.0022 ILE_VAL -0.0023 -0.0038 0.0217 -0.0029 -0.0069 LEU_LEU -0.0007 0.0041 0.0120 0.0014 0.0010 LEU_LYS 0.0010 -0.0010 0.0040 -0.0014 0.0028 LEU_MET 0.0000 -0.0032 -0.0083 0.0029 -0.0082 LEU_PHE 0.0027 -0.0087 -0.0020 0.0002 -0.0027 LEU_PRO -0.0024 0.0000 -0.0008 0.0034 0.0010 LEU_SER -0.0006 -0.0015 0.0084 -0.0020 -0.0130 LEU_THR -0.0005 -0.0015 0.0042 0.0001 0.0065 LEU_TRP -0.0005 -0.0051 -0.0055 0.0045 -0.0003 LEU_TYR -0.0012 0.0037 0.0049 -0.0012 0.0351 LEU_VAL 0.0035 -0.0017 -0.0023 -0.0003 0.0016 LYS_LYS 0.0022 0.0027 -0.0054 -0.0024 0.0153 LYS_MET -0.0011 0.0009 -0.0030 0.0018 -0.0077 LYS_PHE -0.0014 -0.0027 -0.0041 0.0019 -0.0009 LYS_PRO -0.0004 -0.0041 -0.0082 -0.0006 0.0132 LYS_SER 0.0024 0.0025 -0.0148 0.0001 0.0033 LYS_THR 0.0016 0.0095 -0.0204 -0.0011 -0.0290 LYS_TRP -0.0041 0.0049 0.0179 0.0014 -0.0014 LYS_TYR -0.0004 -0.0051 -0.0064 0.0002 -0.0064 LYS_VAL 0.0022 0.0029 0.0010 -0.0031 -0.0052 MET_MET -0.0005 -0.0031 -0.0038 0.0017 0.0097 MET_PHE -0.0015 0.0014 -0.0121 0.0048 0.0048 MET_PRO 0.0008 -0.0011 -0.0007 -0.0038 -0.0067 MET_SER -0.0037 -0.0026 0.0109 0.0005 -0.0001 MET_THR 0.0007 0.0030 0.0011 0.0008 -0.0041 MET_TRP -0.0005 0.0026 -0.0077 0.0026 -0.0002 MET_TYR -0.0018 -0.0105 0.0044 0.0035 0.0125 MET_VAL -0.0024 -0.0079 -0.0170 0.0005 0.0155 PHE_PHE 0.0021 -0.0004 0.0019 0.0021 0.0097 PHE_PRO -0.0018 -0.0001 -0.0050 0.0002 0.0082 PHE_SER 0.0004 0.0036 -0.0026 -0.0008 0.0050 PHE_THR 0.0047 0.0015 -0.0050 -0.0027 -0.0071 PHE_TRP -0.0028 0.0035 -0.0094 0.0008 -0.0032 PHE_TYR 0.0004 0.0089 -0.0047 -0.0008 0.0146 PHE_VAL 0.0006 0.0004 0.0078 0.0001 0.0143 PRO_PRO 0.0003 -0.0016 0.0071 0.0004 0.0051 PRO_SER 0.0012 -0.0014 0.0092 -0.0014 0.0054 PRO_THR 0.0011 0.0006 0.0084 -0.0015 0.0008 PRO_TRP -0.0008 -0.0152 0.0078 0.0005 -0.0015 PRO_TYR -0.0025 -0.0033 -0.0169 0.0005 0.0036 PRO_VAL 0.0018 -0.0030 -0.0137 -0.0009 0.0070 SER_SER -0.0012 0.0008 0.0054 0.0086 0.0201 SER_THR -0.0004 -0.0009 0.0113 0.0033 -0.0133 SER_TRP -0.0002 0.0038 -0.0101 -0.0008 0.0023 SER_TYR 0.0021 0.0093 0.0055 0.0000 0.0025 SER_VAL -0.0036 0.0050 0.0068 0.0006 -0.0180 THR_THR -0.0007 -0.0085 -0.0109 0.0006 0.0150 THR_TRP -0.0007 0.0031 0.0185 -0.0013 0.0014 !115 Table 5.11 . (contÕd) THR_TYR -0.0002 0.0007 0.0012 0.0021 -0.0138 THR_VAL 0.0019 0.0019 -0.0114 0.0019 0.0117 TRP_TRP 0.0021 -0.0088 -0.0049 0.0031 -0.0113 TRP_TYR -0.0019 -0.0047 0.0206 -0.0002 -0.0114 TRP_VAL 0.0012 -0.0035 -0.0008 -0.0020 0.0058 TYR_TYR 0.0002 0.0008 -0.0081 -0.0020 0.0137 TYR_VAL 0.0020 -0.0070 -0.0073 -0.0035 0.0112 VAL_VAL 0.0000 0.0046 -0.0015 0.0025 0.0107 Maximum 0.0047 0.0143 0.0807 0.0086 0.0485 Minimum -0.0098 -0.0152 -0.0357 -0.0063 -0.0519 !116 Table 5.12 . Torsion and nonbond energies calculated by an encoded program with ff14SB parameters for double amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA 6.5280 2.3126 117.6865 -1.4025 -96.1481 ALA_ARG 7.0166 3.6239 -155.1195 -2.6920 57.7425 ALA_ASN 17.1385 3.0982 49.9245 -1.8712 -111.5289 ALA_ASP 12.4042 3.1055 137.8428 -1.5128 -122.4079 ALA_CYS 8.5105 2.5530 115.5498 -1.7194 -92.5868 ALA_GLN 15.6410 3.3854 52.9007 -2.1378 -95.2512 ALA_GLU 12.3539 3.3747 131.5935 -2.1234 -104.8375 ALA_GLY 4.0157 2.1425 123.3087 -1.0269 -106.0016 ALA_HID 9.9732 2.4055 106.3553 -2.7117 -95.0531 ALA_ILE 11.0335 4.3097 118.5912 -1.9128 -102.1124 ALA_LEU 9.7861 4.0513 97.5048 -2.0554 -98.1436 ALA_LYS 7.4148 3.3108 161.3028 -2.3049 -90.6006 ALA_MET 9.7597 2.6423 122.2876 -2.1289 -101.7359 ALA_PHE 7.5967 6.3179 121.1512 -3.0358 -102.6852 ALA_PRO 14.0934 2.9264 112.2445 -1.5247 -86.3372 ALA_SER 8.7284 2.7859 108.3262 -1.6823 -95.0701 ALA_THR 13.1230 3.5712 87.2247 -2.1949 -95.7348 ALA_TRP 7.8911 5.5500 140.9540 -4.0232 -114.0306 ALA_TYR 7.8507 6.2878 109.4003 -3.1586 -108.6624 ALA_VAL 8.9258 4.0006 95.4875 -1.9366 -94.7429 ARG_ARG 7.9093 4.9418 -464.2798 -4.3308 251.9799 ARG_ASN 18.0246 4.4111 -271.9103 -3.4613 64.5626 ARG_ASP 13.2676 4.4496 -186.8101 -3.0997 19.1102 ARG_CYS 9.3903 3.8928 -205.3387 -3.3387 82.6308 ARG_GLN 16.5225 4.7283 -266.3321 -3.7592 78.5311 ARG_GLU 13.2131 4.7245 -191.7080 -3.7306 39.4793 ARG_GLY 4.8914 3.4126 -194.9461 -2.5079 66.8828 ARG_HID 10.8838 3.7354 -213.8339 -4.3277 81.2921 ARG_ILE 11.8959 5.6543 -197.8616 -3.5909 67.7083 ARG_LEU 10.6662 5.3722 -221.7754 -3.6701 76.2546 ARG_LYS 8.3084 4.6439 -148.5852 -3.9435 104.1729 ARG_MET 10.6426 3.9693 -196.0188 -3.7440 71.5125 ARG_PHE 8.4738 7.6443 -198.3676 -4.6517 70.9449 ARG_PRO 15.2728 4.1005 -191.1913 -3.6667 73.5006 ARG_SER 9.6089 4.1231 -207.8877 -3.2955 74.3369 ARG_THR 14.1231 4.8720 -226.8575 -4.0157 66.6243 ARG_TRP 8.7656 6.8922 -177.4498 -5.6548 57.8402 ARG_TYR 8.7314 7.6511 -210.0522 -4.7774 64.8574 ARG_VAL 9.7880 5.3323 -218.1662 -3.6080 72.3257 ASN_ASN 27.9845 3.7466 -20.8185 -2.8517 -126.6835 ASN_ASP 23.2611 3.7717 66.9269 -2.5927 -133.8336 ASN_CYS 19.3692 3.1944 44.8261 -2.7487 -107.1895 ASN_GLN 26.4816 4.0282 -17.9992 -3.1324 -109.5975 ASN_GLU 23.1964 4.0255 60.5317 -3.2025 -116.2429 ASN_GLY 14.9039 2.8194 52.7182 -1.9765 -120.5007 ASN_HID 20.7802 3.0410 35.3693 -3.6568 -110.5193 ASN_ILE 21.8810 4.9814 47.9360 -3.0682 -116.3531 ASN_LEU 20.6334 4.7090 26.7883 -3.0481 -112.9382 ASN_LYS 18.2466 3.9508 90.7423 -3.2222 -107.8708 !117 Table 5.12 . (contÕd) ASN_MET 20.6045 3.2704 51.3683 -3.1206 -116.2559 ASN_PHE 18.4480 6.9523 50.3266 -4.0792 -117.1698 ASN_PRO 26.9232 3.4028 42.5466 -3.5477 -104.6211 ASN_SER 19.5877 3.4210 37.4947 -2.7528 -109.3346 ASN_THR 24.6294 4.0473 15.6958 -3.2017 -111.0090 ASN_TRP 18.7492 6.1818 70.2015 -5.0988 -128.4517 ASN_TYR 18.7043 6.9387 38.6138 -4.2082 -123.2203 ASN_VAL 19.7815 4.6713 24.8437 -3.0867 -108.9964 ASP_ASP 18.9973 3.6512 141.7389 -1.7719 -83.9246 ASP_CYS 15.0767 3.0595 120.7193 -1.9233 -103.0556 ASP_GLN 22.1663 3.9078 57.1436 -2.2825 -104.4918 ASP_GLU 18.9232 3.8978 134.6703 -2.3619 -70.8014 ASP_GLY 10.6740 2.7197 127.8023 -1.2125 -113.8296 ASP_HID 16.4189 2.9559 110.8170 -2.7748 -108.9906 ASP_ILE 17.6283 4.8410 122.4760 -2.2940 -107.8805 ASP_LEU 16.3337 4.6208 102.5449 -2.2155 -108.7822 ASP_LYS 13.9114 3.8506 165.4792 -2.3452 -136.9553 ASP_MET 16.2893 3.1785 126.2913 -2.2814 -111.5839 ASP_PHE 14.1524 6.8256 125.6892 -3.2719 -111.8069 ASP_PRO 22.4531 3.3107 115.1372 -2.5028 -81.7720 ASP_SER 15.2961 3.2847 111.4700 -1.9350 -101.6278 ASP_THR 20.6377 3.7235 87.0424 -2.1337 -99.0827 ASP_TRP 14.4596 6.0729 145.2762 -4.2969 -121.8024 ASP_TYR 14.4070 6.8155 113.8923 -3.4045 -117.7040 ASP_VAL 15.5271 4.5476 98.3364 -2.3175 -99.4558 CYS_CYS 10.9090 2.3820 104.4951 -1.9397 -79.8821 CYS_GLN 18.0289 3.2132 41.7873 -2.3490 -82.6290 CYS_GLU 14.7703 3.1908 119.4827 -2.3647 -89.3800 CYS_GLY 6.4282 2.0082 112.2790 -1.1685 -93.3558 CYS_HID 12.3470 2.2323 95.2076 -2.9147 -82.7586 CYS_ILE 13.4405 4.1231 107.5202 -2.1709 -89.1540 CYS_LEU 12.1801 3.8890 86.4928 -2.2614 -85.5853 CYS_LYS 9.7930 3.1453 151.0383 -2.5100 -80.0173 CYS_MET 12.1512 2.4645 111.1543 -2.3366 -89.0926 CYS_PHE 9.9940 6.1437 110.0443 -3.2649 -89.9671 CYS_PRO 16.2765 2.6795 102.4443 -2.4433 -75.3615 CYS_SER 11.1365 2.6122 97.1852 -1.9034 -82.0838 CYS_THR 15.4396 3.3550 75.9955 -2.5582 -81.5455 CYS_TRP 10.2893 5.3699 129.8747 -4.2597 -101.2032 CYS_TYR 10.2463 6.1186 98.2846 -3.3865 -95.9285 CYS_VAL 11.3392 3.8139 84.3884 -2.1964 -81.7533 GLN_GLN 25.1965 4.3181 -30.5644 -3.1758 -76.8539 GLN_GLU 21.8911 4.3081 46.4162 -3.1491 -86.8118 GLN_GLY 13.5618 3.0293 39.8007 -1.9474 -87.5874 GLN_HID 19.5461 3.3314 22.7876 -3.7452 -76.2270 GLN_ILE 20.5590 5.2572 35.2401 -2.9972 -84.4041 GLN_LEU 19.3402 4.9763 14.1147 -3.0926 -79.9226 GLN_LYS 16.9780 4.2589 79.4603 -3.3551 -72.6103 GLN_MET 19.3138 3.5751 38.9063 -3.1609 -83.4619 GLN_PHE 17.1457 7.2553 37.6332 -4.0693 -84.4708 GLN_PRO 23.9525 3.7069 32.2013 -3.0117 -72.0316 !118 Table 5.12 . (contÕd) GLN_SER 18.2711 3.7387 24.9622 -2.7113 -77.2972 GLN_THR 22.7446 4.4935 3.8166 -3.3947 -79.2694 GLN_TRP 17.4348 6.5009 57.5191 -5.0727 -95.9935 GLN_TYR 17.3970 7.2267 25.8928 -4.1949 -90.4446 GLN_VAL 18.4576 4.9438 12.3060 -3.0158 -77.1930 GLU_GLU 18.6469 4.1410 119.8831 -3.1349 -49.2654 GLU_GLY 10.2981 2.9417 113.2641 -1.9266 -84.1324 GLU_HID 16.2416 3.1823 96.7034 -3.7083 -75.6152 GLU_ILE 17.3143 5.0948 108.1624 -2.9681 -79.1952 GLU_LEU 16.1185 4.7117 88.2765 -2.8534 -77.9850 GLU_LYS 13.6872 4.1099 151.8997 -3.3098 -99.1985 GLU_MET 16.0387 3.4182 112.2653 -3.1265 -80.8048 GLU_PHE 13.8727 7.0954 111.4195 -4.0475 -81.4035 GLU_PRO 20.5202 3.4694 104.0916 -2.6343 -62.9287 GLU_SER 15.0082 3.5810 97.4747 -2.6851 -72.0477 GLU_THR 19.4009 4.3377 75.4383 -3.3655 -67.5472 GLU_TRP 14.1706 6.3260 131.0136 -5.0503 -91.7576 GLU_TYR 14.1272 7.0818 99.6316 -4.1777 -87.1942 GLU_VAL 15.2073 4.7776 84.1698 -2.9840 -71.0269 GLY_GLY 4.0568 1.6628 102.7650 -0.6670 -92.1571 GLY_HID 10.0112 1.9220 85.5779 -2.2987 -80.9198 GLY_ILE 11.0758 3.8174 98.1305 -1.4620 -88.3679 GLY_LEU 9.8318 3.5648 77.0117 -1.6355 -84.3610 GLY_LYS 7.4595 2.8325 143.0816 -1.8874 -78.8953 GLY_MET 9.8050 2.1521 101.6555 -1.7154 -87.7890 GLY_PHE 7.6425 5.8266 100.4681 -2.6193 -88.6530 GLY_PRO 14.1084 2.5868 96.6969 -1.2354 -77.0594 GLY_SER 8.7689 2.3038 87.7694 -1.2733 -81.1589 GLY_THR 13.1815 3.0914 66.6990 -1.7132 -81.9155 GLY_TRP 7.9347 5.0726 120.3785 -3.5978 -100.1006 GLY_TYR 7.8962 5.8082 88.7199 -2.7418 -94.6132 GLY_VAL 8.9699 3.5072 75.1814 -1.4938 -81.1571 HID_HID 13.8723 2.3360 81.3216 -5.2920 -78.6067 HID_ILE 14.8938 4.2318 93.7811 -4.6085 -85.2229 HID_LEU 13.6837 3.9912 72.6270 -4.6440 -81.4766 HID_LYS 11.3069 3.2492 137.4771 -4.9035 -74.4730 HID_MET 13.6595 2.5665 97.3235 -4.7074 -84.7784 HID_PHE 11.4934 6.2439 96.2257 -5.6568 -85.7839 HID_PRO 18.4480 2.8698 90.4206 -4.2054 -81.3746 HID_SER 12.6064 2.7177 83.4012 -4.2505 -78.3310 HID_THR 16.8116 3.7174 64.8478 -4.3805 -89.1514 HID_TRP 11.7783 5.4720 116.0573 -6.6665 -97.2551 HID_TYR 11.7490 6.2120 84.4592 -5.7844 -91.8738 HID_VAL 12.7866 3.9258 70.6870 -4.6053 -77.8978 ILE_ILE 15.8893 5.9655 84.2998 -2.6350 -77.1191 ILE_LEU 14.6457 5.7084 63.2005 -2.7063 -73.1160 ILE_LYS 12.2767 4.9821 130.0936 -2.9701 -68.6753 ILE_MET 14.6201 4.3029 87.8457 -2.7763 -76.5255 ILE_PHE 12.4555 7.9848 86.5235 -3.6973 -77.3056 ILE_PRO 19.2547 4.4081 84.5870 -2.5322 -67.5111 ILE_SER 13.5915 4.4535 73.8938 -2.3300 -70.0161 !119 Table 5.12 . (contÕd) ILE_THR 17.9678 5.2311 52.7670 -3.0057 -70.5951 ILE_TRP 12.7474 7.2226 106.4756 -4.7046 -88.7978 ILE_TYR 12.7106 7.9674 74.7862 -3.8266 -83.2669 ILE_VAL 13.7835 5.6578 61.4502 -2.6499 -70.0550 LEU_LEU 13.5945 5.6899 63.0049 -3.3027 -93.3937 LEU_LYS 11.2464 4.9581 128.3883 -3.5745 -87.2265 LEU_MET 13.5768 4.2700 87.6483 -3.3772 -96.8095 LEU_PHE 11.4064 7.9611 86.3356 -4.2916 -97.6166 LEU_PRO 18.5714 4.4984 80.7557 -3.1356 -83.7364 LEU_SER 12.5473 4.4305 73.6836 -2.9347 -90.3296 LEU_THR 16.9417 5.1981 52.6275 -3.6534 -90.9104 LEU_TRP 11.6967 7.1828 106.3209 -5.3009 -109.1064 LEU_TYR 11.6591 7.9309 74.5793 -4.4222 -103.5734 LEU_VAL 12.7080 5.6387 61.2403 -3.2638 -90.2831 LYS_LYS 8.6939 4.2781 171.3038 -3.5314 -45.4134 LYS_MET 11.0330 3.6081 124.4217 -3.3356 -78.2226 LYS_PHE 8.8668 7.2951 122.1350 -4.2423 -78.8246 LYS_PRO 15.6520 3.7526 128.2222 -3.2329 -76.2325 LYS_SER 10.0010 3.7651 112.4688 -2.8809 -75.3101 LYS_THR 14.4987 4.5370 93.5142 -3.5840 -82.7968 LYS_TRP 9.1591 6.5259 143.0397 -5.2445 -91.8499 LYS_TYR 9.1216 7.2670 110.4253 -4.3677 -84.9175 LYS_VAL 10.1856 4.9676 102.1771 -3.1953 -77.2144 MET_MET 13.4350 2.8200 103.0089 -3.2060 -86.3285 MET_PHE 11.2652 6.4971 101.7498 -4.1168 -87.2333 MET_PRO 18.0173 2.9833 97.5342 -3.1331 -75.5615 MET_SER 12.3963 2.9839 89.0948 -2.7589 -79.9252 MET_THR 16.8934 3.7504 67.9690 -3.5088 -81.1402 MET_TRP 11.5585 5.7433 121.6639 -5.1254 -98.6982 MET_TYR 11.5213 6.4843 90.0092 -4.2453 -93.1956 MET_VAL 12.5811 4.1817 76.5158 -3.0688 -79.8849 PHE_PHE 8.8690 10.0964 107.0504 -6.1741 -92.1391 PHE_PRO 15.4840 6.5747 101.4620 -5.0988 -79.9522 PHE_SER 10.0096 6.5360 94.1256 -4.7424 -84.2691 PHE_THR 14.2474 7.2030 72.8452 -5.4750 -83.0128 PHE_TRP 9.1665 9.3161 126.8904 -7.1916 -103.3919 PHE_TYR 9.1257 10.0721 95.2836 -6.3051 -98.1342 PHE_VAL 10.2077 7.7409 81.4624 -5.1171 -83.9398 PRO_PRO 17.5120 2.8868 72.8938 -2.2480 -57.3659 PRO_SER 12.2004 2.6706 63.0541 -2.2137 -60.5165 PRO_THR 16.6168 3.4533 41.8183 -2.7470 -61.0888 PRO_TRP 11.3682 5.4203 95.7614 -4.5660 -79.3417 PRO_TYR 11.3270 6.1699 64.1022 -3.7006 -73.9033 PRO_VAL 12.3954 3.8743 50.4475 -2.4789 -60.3810 SER_SER 11.2371 2.7488 61.0942 -1.7591 -55.9101 SER_THR 15.5318 3.5052 39.9507 -2.3943 -55.1979 SER_TRP 10.3993 5.5132 93.7906 -4.1119 -74.9205 SER_TYR 10.3531 6.2666 62.1423 -3.2397 -69.6242 SER_VAL 11.4523 3.9602 48.5044 -2.0396 -55.7095 THR_THR 19.2257 4.6269 1.2414 -2.7763 -34.8942 THR_TRP 13.9779 6.5994 54.8633 -4.5659 -52.6021 !120 Table 5.12 . (contÕd) THR_TYR 13.9386 7.3544 23.1877 -3.6970 -47.1025 THR_VAL 15.0228 5.0458 9.7650 -2.4901 -33.8353 TRP_TRP 9.3833 8.5720 140.8008 -8.5027 -109.7283 TRP_TYR 9.3402 9.3300 109.1063 -7.6203 -104.3638 TRP_VAL 11.7296 6.6231 92.8557 -6.1057 -88.1176 TYR_TYR 9.3749 10.0634 83.1288 -6.6073 -102.9884 TYR_VAL 10.4433 7.7158 69.3328 -5.4419 -88.8127 VAL_VAL 11.6396 5.3037 21.1357 -2.5695 -45.7340 !121 Table 5.13 . Torsion and nonbond energies calculated by Amber software with ff14SB parameters for double amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA 6.5282 2.3091 117.6841 -1.4026 -96.1489 ALA_ARG 7.0170 3.6214 -155.1509 -2.6933 57.7762 ALA_ASN 17.1392 3.0909 49.9035 -1.8698 -111.5165 ALA_ASP 12.4040 3.1048 137.8546 -1.5127 -122.4424 ALA_CYS 8.5098 2.5578 115.5685 -1.7206 -92.5853 ALA_GLN 15.6404 3.3884 52.9085 -2.1389 -95.2611 ALA_GLU 12.3527 3.3713 131.5812 -2.1224 -104.8197 ALA_GLY 4.0146 2.1422 123.3091 -1.0269 -106.0097 ALA_HID 9.9729 2.4023 106.3461 -2.7109 -95.0422 ALA_ILE 11.0312 4.3062 118.5851 -1.9103 -102.1069 ALA_LEU 9.7864 4.0532 97.5104 -2.0542 -98.1581 ALA_LYS 7.4157 3.3160 161.3121 -2.3042 -90.5891 ALA_MET 9.7609 2.6382 122.2837 -2.1295 -101.7366 ALA_PHE 7.5973 6.3132 121.1507 -3.0366 -102.6976 ALA_PRO 14.0927 2.9197 112.2296 -1.5277 -86.3184 ALA_SER 8.7280 2.7884 108.3392 -1.6829 -95.0772 ALA_THR 13.1223 3.5711 87.2368 -2.1952 -95.7339 ALA_TRP 7.8912 5.5500 140.9576 -4.0233 -114.0243 ALA_TYR 7.8514 6.2958 109.4072 -3.1600 -108.6725 ALA_VAL 8.9269 3.9916 95.4977 -1.9359 -94.7606 ARG_ARG 7.9099 4.9417 -464.2442 -4.3340 251.9600 ARG_ASN 18.0242 4.4043 -271.9286 -3.4648 64.5757 ARG_ASP 13.2674 4.4491 -186.8442 -3.1060 19.1445 ARG_CYS 9.3891 3.8967 -205.3971 -3.3413 82.6710 ARG_GLN 16.5234 4.7258 -266.3313 -3.7600 78.5323 ARG_GLU 13.2153 4.7238 -191.7468 -3.7332 39.4827 ARG_GLY 4.8920 3.4106 -194.9587 -2.5090 66.8844 ARG_HID 10.8851 3.7338 -213.8494 -4.3274 81.2735 ARG_ILE 11.8945 5.6490 -197.8503 -3.5930 67.7062 ARG_LEU 10.6667 5.3734 -221.8038 -3.6715 76.2732 ARG_LYS 8.3086 4.6396 -148.6216 -3.9451 104.1923 ARG_MET 10.6430 3.9709 -195.9901 -3.7467 71.5163 ARG_PHE 8.4747 7.6495 -198.3387 -4.6529 70.9472 ARG_PRO 15.2757 4.0963 -191.2041 -3.6702 73.5045 ARG_SER 9.6097 4.1278 -207.9238 -3.2953 74.3420 ARG_THR 14.1231 4.8727 -226.8921 -4.0152 66.6317 ARG_TRP 8.7665 6.8905 -177.4064 -5.6572 57.8598 ARG_TYR 8.7297 7.6336 -210.0276 -4.7785 64.8528 ARG_VAL 9.7899 5.3299 -218.1479 -3.6102 72.3134 ASN_ASN 27.9848 3.7478 -20.8155 -2.8476 -126.6976 ASN_ASP 23.2596 3.7634 66.9171 -2.5902 -133.8253 ASN_CYS 19.3676 3.1900 44.8419 -2.7493 -107.1840 ASN_GLN 26.4813 4.0234 -17.9978 -3.1309 -109.6119 ASN_GLU 23.1974 4.0234 60.5460 -3.1982 -116.2772 ASN_GLY 14.9037 2.8193 52.7154 -1.9774 -120.5051 ASN_HID 20.7816 3.0441 35.3851 -3.6560 -110.5166 ASN_ILE 21.8834 4.9776 47.9255 -3.0698 -116.3440 ASN_LEU 20.6328 4.7050 26.7914 -3.0464 -112.9472 ASN_LYS 18.2463 3.9543 90.7562 -3.2184 -107.8911 !122 Table 5.13 . (contÕd) ASN_MET 20.6039 3.2760 51.3890 -3.1217 -116.2577 ASN_PHE 18.4486 6.9534 50.3511 -4.0785 -117.1686 ASN_PRO 26.9224 3.4016 42.5755 -3.5515 -104.6450 ASN_SER 19.5868 3.4227 37.4880 -2.7506 -109.3075 ASN_THR 24.6287 4.0420 15.7161 -3.1990 -111.0184 ASN_TRP 18.7472 6.1894 70.2149 -5.1032 -128.4556 ASN_TYR 18.7036 6.9341 38.5993 -4.2090 -123.2130 ASN_VAL 19.7796 4.6664 24.8189 -3.0872 -108.9721 ASP_ASP 18.9974 3.6566 141.7366 -1.7714 -83.9031 ASP_CYS 15.0738 3.0641 120.7395 -1.9229 -103.0825 ASP_GLN 22.1672 3.9166 57.1639 -2.2846 -104.4838 ASP_GLU 18.9239 3.8998 134.6789 -2.3561 -70.8137 ASP_GLY 10.6717 2.7244 127.7975 -1.2064 -113.8517 ASP_HID 16.4164 2.9555 110.8062 -2.7735 -108.9838 ASP_ILE 17.6335 4.8453 122.4809 -2.3007 -107.8662 ASP_LEU 16.3341 4.6208 102.5543 -2.2113 -108.7919 ASP_LYS 13.9133 3.8490 165.4784 -2.3478 -136.9650 ASP_MET 16.2923 3.1735 126.2859 -2.2781 -111.5886 ASP_PHE 14.1537 6.8407 125.6969 -3.2740 -111.7951 ASP_PRO 22.4506 3.3064 115.1461 -2.5038 -81.7805 ASP_SER 15.2951 3.2852 111.4828 -1.9388 -101.6421 ASP_THR 20.6374 3.7314 87.0484 -2.1300 -99.0745 ASP_TRP 14.4583 6.0683 145.2645 -4.3018 -121.7983 ASP_TYR 14.4095 6.8174 113.8966 -3.4065 -117.7014 ASP_VAL 15.5282 4.5469 98.3435 -2.3158 -99.4606 CYS_CYS 10.9100 2.3817 104.5054 -1.9379 -79.8937 CYS_GLN 18.0295 3.2153 41.7720 -2.3502 -82.6228 CYS_GLU 14.7725 3.1898 119.4726 -2.3617 -89.3755 CYS_GLY 6.4300 2.0053 112.2827 -1.1687 -93.3515 CYS_HID 12.3455 2.2341 95.1971 -2.9113 -82.7579 CYS_ILE 13.4402 4.1245 107.5194 -2.1703 -89.1386 CYS_LEU 12.1810 3.8890 86.4840 -2.2630 -85.5892 CYS_LYS 9.7949 3.1438 151.0466 -2.5078 -80.0327 CYS_MET 12.1524 2.4671 111.1535 -2.3387 -89.1061 CYS_PHE 9.9926 6.1396 110.0459 -3.2636 -89.9577 CYS_PRO 16.2760 2.6776 102.4500 -2.4416 -75.3708 CYS_SER 11.1345 2.6103 97.1729 -1.9056 -82.0977 CYS_THR 15.4373 3.3564 75.9967 -2.5560 -81.5378 CYS_TRP 10.2921 5.3722 129.8671 -4.2635 -101.2073 CYS_TYR 10.2482 6.1211 98.2938 -3.3890 -95.9400 CYS_VAL 11.3381 3.8127 84.3965 -2.1921 -81.7574 GLN_GLN 25.1965 4.3243 -30.5645 -3.1760 -76.8420 GLN_GLU 21.8907 4.3089 46.4066 -3.1492 -86.7995 GLN_GLY 13.5617 3.0394 39.7990 -1.9483 -87.5960 GLN_HID 19.5480 3.3332 22.8073 -3.7448 -76.2269 GLN_ILE 20.5604 5.2530 35.2455 -2.9977 -84.3988 GLN_LEU 19.3382 4.9808 14.1123 -3.0886 -79.9226 GLN_LYS 16.9792 4.2608 79.4663 -3.3553 -72.6036 GLN_MET 19.3146 3.5733 38.8806 -3.1620 -83.4509 GLN_PHE 17.1451 7.2502 37.6421 -4.0719 -84.4789 GLN_PRO 23.9502 3.7046 32.2167 -3.0092 -72.0375 !123 Table 5.13 . (contÕd) GLN_SER 18.2733 3.7322 24.9567 -2.7114 -77.2824 GLN_THR 22.7461 4.4885 3.8052 -3.3946 -79.2634 GLN_TRP 17.4344 6.4919 57.5225 -5.0740 -95.9908 GLN_TYR 17.3992 7.2338 25.9013 -4.1971 -90.4286 GLN_VAL 18.4552 4.9370 12.2917 -3.0152 -77.1943 GLU_GLU 18.6487 4.1413 119.8938 -3.1346 -49.2682 GLU_GLY 10.2968 2.9371 113.2688 -1.9271 -84.1351 GLU_HID 16.2436 3.1839 96.6895 -3.7066 -75.6186 GLU_ILE 17.3121 5.0882 108.1523 -2.9666 -79.2074 GLU_LEU 16.1179 4.7079 88.2605 -2.8551 -77.9815 GLU_LYS 13.6894 4.1101 151.9113 -3.3092 -99.2071 GLU_MET 16.0368 3.4216 112.2724 -3.1272 -80.7798 GLU_PHE 13.8753 7.0947 111.4143 -4.0489 -81.4068 GLU_PRO 20.5160 3.4710 104.0921 -2.6301 -62.9401 GLU_SER 15.0077 3.5758 97.4635 -2.6833 -72.0663 GLU_THR 19.4001 4.3390 75.4564 -3.3651 -67.5732 GLU_TRP 14.1693 6.3326 131.0126 -5.0513 -91.7557 GLU_TYR 14.1292 7.0770 99.6367 -4.1748 -87.2148 GLU_VAL 15.2079 4.7771 84.1734 -2.9838 -71.0673 GLY_GLY 4.0571 1.6663 102.7669 -0.6665 -92.1614 GLY_HID 10.0137 1.9185 85.5641 -2.2972 -80.9031 GLY_ILE 11.0750 3.8193 98.1415 -1.4622 -88.3655 GLY_LEU 9.8317 3.5702 77.0255 -1.6373 -84.3646 GLY_LYS 7.4591 2.8330 143.0797 -1.8861 -78.8947 GLY_MET 9.8045 2.1546 101.6800 -1.7160 -87.7888 GLY_PHE 7.6422 5.8286 100.4684 -2.6192 -88.6522 GLY_PRO 14.1074 2.5911 96.6984 -1.2380 -77.0469 GLY_SER 8.7684 2.3052 87.7668 -1.2730 -81.1536 GLY_THR 13.1819 3.0889 66.6964 -1.7144 -81.9070 GLY_TRP 7.9352 5.0654 120.3817 -3.5973 -100.1118 GLY_TYR 7.8957 5.8108 88.7200 -2.7416 -94.6265 GLY_VAL 8.9702 3.5071 75.1746 -1.4934 -81.1613 HID_HID 13.8729 2.3383 81.3272 -5.2944 -78.5939 HID_ILE 14.8953 4.2373 93.7686 -4.6061 -85.2148 HID_LEU 13.6844 3.9917 72.6274 -4.6461 -81.4784 HID_LYS 11.3083 3.2481 137.4761 -4.9013 -74.4760 HID_MET 13.6590 2.5692 97.3191 -4.7082 -84.7790 HID_PHE 11.4937 6.2437 96.2157 -5.6565 -85.7706 HID_PRO 18.4450 2.8695 90.4082 -4.2075 -81.3837 HID_SER 12.6047 2.7181 83.3998 -4.2492 -78.3228 HID_THR 16.8114 3.7095 64.8340 -4.3792 -89.1386 HID_TRP 11.7805 5.4768 116.0556 -6.6675 -97.2560 HID_TYR 11.7450 6.2253 84.4679 -5.7820 -91.8666 HID_VAL 12.7876 3.9261 70.6911 -4.6109 -77.8877 ILE_ILE 15.8877 5.9732 84.3253 -2.6350 -77.1453 ILE_LEU 14.6445 5.7113 63.2066 -2.7042 -73.1167 ILE_LYS 12.2764 4.9827 130.1112 -2.9697 -68.6922 ILE_MET 14.6221 4.2991 87.8321 -2.7770 -76.5238 ILE_PHE 12.4562 7.9745 86.5202 -3.6956 -77.2997 ILE_PRO 19.2531 4.4099 84.5980 -2.5348 -67.5243 ILE_SER 13.5922 4.4533 73.8963 -2.3312 -70.0281 !124 Table 5.13 . (contÕd) ILE_THR 17.9664 5.2206 52.7832 -3.0056 -70.5953 ILE_TRP 12.7503 7.2130 106.4955 -4.7046 -88.8111 ILE_TYR 12.7109 7.9575 74.7720 -3.8214 -83.2649 ILE_VAL 13.7825 5.6585 61.4601 -2.6494 -70.0597 LEU_LEU 13.5940 5.6858 63.0026 -3.3027 -93.3972 LEU_LYS 11.2465 4.9574 128.3958 -3.5757 -87.2341 LEU_MET 13.5761 4.2746 87.6406 -3.3781 -96.8174 LEU_PHE 11.4052 7.9490 86.3375 -4.2959 -97.6069 LEU_PRO 18.5746 4.5002 80.7675 -3.1374 -83.7435 LEU_SER 12.5444 4.4279 73.6941 -2.9369 -90.3152 LEU_THR 16.9427 5.2028 52.6326 -3.6550 -90.9020 LEU_TRP 11.6953 7.1875 106.3052 -5.3042 -109.1018 LEU_TYR 11.6601 7.9322 74.5923 -4.4227 -103.5727 LEU_VAL 12.7097 5.6310 61.2493 -3.2665 -90.3028 LYS_LYS 8.6952 4.2772 171.3113 -3.5313 -45.4004 LYS_MET 11.0333 3.6107 124.4326 -3.3344 -78.2293 LYS_PHE 8.8658 7.2889 122.1373 -4.2427 -78.8203 LYS_PRO 15.6536 3.7499 128.2239 -3.2305 -76.2389 LYS_SER 10.0019 3.7673 112.4833 -2.8832 -75.3285 LYS_THR 14.4996 4.5276 93.5190 -3.5844 -82.7711 LYS_TRP 9.1589 6.5282 143.0227 -5.2453 -91.8419 LYS_TYR 9.1210 7.2727 110.4476 -4.3681 -84.9413 LYS_VAL 10.1855 4.9699 102.1546 -3.1944 -77.2011 MET_MET 13.4340 2.8189 103.0132 -3.2066 -86.3307 MET_PHE 11.2657 6.4954 101.7523 -4.1213 -87.2293 MET_PRO 18.0155 2.9767 97.5353 -3.1302 -75.5661 MET_SER 12.3981 2.9768 89.0866 -2.7593 -79.9232 MET_THR 16.8929 3.7465 67.9793 -3.5080 -81.1601 MET_TRP 11.5571 5.7361 121.6680 -5.1263 -98.7293 MET_TYR 11.5201 6.4790 90.0090 -4.2468 -93.1910 MET_VAL 12.5809 4.1823 76.5132 -3.0690 -79.8933 PHE_PHE 8.8691 10.0936 107.0462 -6.1755 -92.1359 PHE_PRO 15.4860 6.5744 101.4619 -5.0999 -79.9568 PHE_SER 10.0113 6.5467 94.1375 -4.7430 -84.2849 PHE_THR 14.2484 7.1953 72.8453 -5.4724 -83.0104 PHE_TRP 9.1682 9.3181 126.8987 -7.1934 -103.3995 PHE_TYR 9.1249 10.0724 95.2835 -6.3031 -98.1220 PHE_VAL 10.2036 7.7438 81.4659 -5.1192 -83.9486 PRO_PRO 17.5127 2.8886 72.8967 -2.2462 -57.3684 PRO_SER 12.1990 2.6669 63.0465 -2.2128 -60.5143 PRO_THR 16.6148 3.4496 41.8326 -2.7460 -61.0971 PRO_TRP 11.3664 5.4249 95.7587 -4.5677 -79.3411 PRO_TYR 11.3261 6.1706 64.1092 -3.7004 -73.8992 PRO_VAL 12.3973 3.8713 50.4373 -2.4812 -60.3802 SER_SER 11.2379 2.7509 61.0849 -1.7581 -55.9137 SER_THR 15.5296 3.5013 39.9627 -2.3935 -55.1953 SER_TRP 10.3988 5.5147 93.8043 -4.1144 -74.9394 SER_TYR 10.3529 6.2659 62.1415 -3.2397 -69.6217 SER_VAL 11.4515 3.9557 48.5157 -2.0406 -55.7108 THR_THR 19.2258 4.6262 1.2434 -2.7759 -34.8859 THR_TRP 13.9776 6.6122 54.8779 -4.5681 -52.6080 !125 Table 5.13 . (contÕd) THR_TYR 13.9395 7.3578 23.1922 -3.6967 -47.1002 THR_VAL 15.0208 5.0505 9.7613 -2.4891 -33.8293 TRP_TRP 9.3826 8.5740 140.7899 -8.5034 -109.7377 TRP_TYR 9.3407 9.3261 109.1256 -7.6196 -104.3716 TRP_VAL 11.7309 6.6227 92.8398 -6.1037 -88.1134 TYR_TYR 9.3730 10.0502 83.1507 -6.6067 -102.9952 TYR_VAL 10.4448 7.7262 69.3126 -5.4425 -88.8102 VAL_VAL 11.6387 5.3042 21.1595 -2.5726 -45.7387 !126 Table 5.14 . Comparisons of torsion and nonbond energies calculated by the encoded program and Amber software with ff14SB parameters for double amino acid test set. Energy difference / kcal/mol Dihedral 1_4_VdW 1_4_EEL VdW EEL ALA_ALA -0.0002 0.0035 0.0024 0.0001 0.0008 ALA_ARG -0.0004 0.0025 0.0314 0.0013 -0.0337 ALA_ASN -0.0007 0.0073 0.0210 -0.0014 -0.0124 ALA_ASP 0.0002 0.0007 -0.0118 -0.0001 0.0345 ALA_CYS 0.0007 -0.0048 -0.0187 0.0012 -0.0015 ALA_GLN 0.0006 -0.0030 -0.0078 0.0011 0.0099 ALA_GLU 0.0012 0.0034 0.0123 -0.0010 -0.0178 ALA_GLY 0.0011 0.0003 -0.0004 0.0000 0.0081 ALA_HID 0.0003 0.0032 0.0092 -0.0008 -0.0109 ALA_ILE 0.0023 0.0035 0.0061 -0.0025 -0.0055 ALA_LEU -0.0003 -0.0019 -0.0056 -0.0012 0.0145 ALA_LYS -0.0009 -0.0052 -0.0093 -0.0007 -0.0115 ALA_MET -0.0012 0.0041 0.0039 0.0006 0.0007 ALA_PHE -0.0006 0.0047 0.0005 0.0008 0.0124 ALA_PRO 0.0007 0.0067 0.0149 0.0030 -0.0188 ALA_SER 0.0004 -0.0025 -0.0130 0.0006 0.0071 ALA_THR 0.0007 0.0001 -0.0121 0.0003 -0.0009 ALA_TRP -0.0001 0.0000 -0.0036 0.0001 -0.0063 ALA_TYR -0.0007 -0.0080 -0.0069 0.0014 0.0101 ALA_VAL -0.0011 0.0090 -0.0102 -0.0007 0.0177 ARG_ARG -0.0006 0.0001 -0.0356 0.0032 0.0199 ARG_ASN 0.0004 0.0068 0.0183 0.0035 -0.0131 ARG_ASP 0.0002 0.0005 0.0341 0.0063 -0.0343 ARG_CYS 0.0012 -0.0039 0.0584 0.0026 -0.0402 ARG_GLN -0.0009 0.0025 -0.0008 0.0008 -0.0012 ARG_GLU -0.0022 0.0007 0.0388 0.0026 -0.0034 ARG_GLY -0.0006 0.0020 0.0126 0.0011 -0.0016 ARG_HID -0.0013 0.0016 0.0155 -0.0003 0.0186 ARG_ILE 0.0014 0.0053 -0.0113 0.0021 0.0021 ARG_LEU -0.0005 -0.0012 0.0284 0.0014 -0.0186 ARG_LYS -0.0002 0.0043 0.0364 0.0016 -0.0194 ARG_MET -0.0004 -0.0016 -0.0287 0.0027 -0.0038 ARG_PHE -0.0009 -0.0052 -0.0289 0.0012 -0.0023 ARG_PRO -0.0029 0.0042 0.0128 0.0035 -0.0039 ARG_SER -0.0008 -0.0047 0.0361 -0.0002 -0.0051 ARG_THR 0.0000 -0.0007 0.0346 -0.0005 -0.0074 ARG_TRP -0.0009 0.0017 -0.0434 0.0024 -0.0196 ARG_TYR 0.0017 0.0175 -0.0246 0.0011 0.0046 ARG_VAL -0.0019 0.0024 -0.0183 0.0022 0.0123 ASN_ASN -0.0003 -0.0012 -0.0030 -0.0041 0.0141 ASN_ASP 0.0015 0.0083 0.0098 -0.0025 -0.0083 ASN_CYS 0.0016 0.0044 -0.0158 0.0006 -0.0055 ASN_GLN 0.0003 0.0048 -0.0014 -0.0015 0.0144 ASN_GLU -0.0010 0.0021 -0.0143 -0.0043 0.0343 ASN_GLY 0.0002 0.0001 0.0028 0.0009 0.0044 ASN_HID -0.0014 -0.0031 -0.0158 -0.0008 -0.0027 ASN_ILE -0.0024 0.0038 0.0105 0.0016 -0.0091 ASN_LEU 0.0006 0.0040 -0.0031 -0.0017 0.0090 ASN_LYS 0.0003 -0.0035 -0.0139 -0.0038 0.0203 !127 Table 5.14 . (contÕd) ASN_MET 0.0006 -0.0056 -0.0207 0.0011 0.0018 ASN_PHE -0.0006 -0.0011 -0.0245 -0.0007 -0.0012 ASN_PRO 0.0008 0.0012 -0.0289 0.0038 0.0239 ASN_SER 0.0009 -0.0017 0.0067 -0.0022 -0.0271 ASN_THR 0.0007 0.0053 -0.0203 -0.0027 0.0094 ASN_TRP 0.0020 -0.0076 -0.0134 0.0044 0.0039 ASN_TYR 0.0007 0.0046 0.0145 0.0008 -0.0073 ASN_VAL 0.0019 0.0049 0.0248 0.0005 -0.0243 ASP_ASP -0.0001 -0.0054 0.0023 -0.0005 -0.0215 ASP_CYS 0.0029 -0.0046 -0.0202 -0.0004 0.0269 ASP_GLN -0.0009 -0.0088 -0.0203 0.0021 -0.0080 ASP_GLU -0.0007 -0.0020 -0.0086 -0.0058 0.0123 ASP_GLY 0.0023 -0.0047 0.0048 -0.0061 0.0221 ASP_HID 0.0025 0.0004 0.0108 -0.0013 -0.0068 ASP_ILE -0.0052 -0.0043 -0.0049 0.0067 -0.0143 ASP_LEU -0.0004 0.0000 -0.0094 -0.0042 0.0097 ASP_LYS -0.0019 0.0016 0.0008 0.0026 0.0097 ASP_MET -0.0030 0.0050 0.0054 -0.0033 0.0047 ASP_PHE -0.0013 -0.0151 -0.0077 0.0021 -0.0118 ASP_PRO 0.0025 0.0043 -0.0089 0.0010 0.0085 ASP_SER 0.0010 -0.0005 -0.0128 0.0038 0.0143 ASP_THR 0.0003 -0.0079 -0.0060 -0.0037 -0.0082 ASP_TRP 0.0013 0.0046 0.0117 0.0049 -0.0041 ASP_TYR -0.0025 -0.0019 -0.0043 0.0020 -0.0026 ASP_VAL -0.0011 0.0007 -0.0071 -0.0017 0.0048 CYS_CYS -0.0010 0.0003 -0.0103 -0.0018 0.0116 CYS_GLN -0.0006 -0.0021 0.0153 0.0012 -0.0062 CYS_GLU -0.0022 0.0010 0.0101 -0.0030 -0.0045 CYS_GLY -0.0018 0.0029 -0.0037 0.0002 -0.0043 CYS_HID 0.0015 -0.0018 0.0105 -0.0034 -0.0007 CYS_ILE 0.0003 -0.0014 0.0008 -0.0006 -0.0154 CYS_LEU -0.0009 0.0000 0.0088 0.0016 0.0039 CYS_LYS -0.0019 0.0015 -0.0083 -0.0022 0.0154 CYS_MET -0.0012 -0.0026 0.0008 0.0021 0.0135 CYS_PHE 0.0014 0.0041 -0.0016 -0.0013 -0.0094 CYS_PRO 0.0005 0.0019 -0.0057 -0.0017 0.0093 CYS_SER 0.0020 0.0019 0.0123 0.0022 0.0139 CYS_THR 0.0023 -0.0014 -0.0012 -0.0022 -0.0077 CYS_TRP -0.0028 -0.0023 0.0076 0.0038 0.0041 CYS_TYR -0.0019 -0.0025 -0.0092 0.0025 0.0115 CYS_VAL 0.0011 0.0012 -0.0081 -0.0043 0.0041 GLN_GLN 0.0000 -0.0062 0.0001 0.0002 -0.0119 GLN_GLU 0.0004 -0.0008 0.0096 0.0001 -0.0123 GLN_GLY 0.0001 -0.0101 0.0017 0.0009 0.0086 GLN_HID -0.0019 -0.0018 -0.0197 -0.0004 -0.0001 GLN_ILE -0.0014 0.0042 -0.0054 0.0005 -0.0053 GLN_LEU 0.0020 -0.0045 0.0024 -0.0040 0.0000 GLN_LYS -0.0012 -0.0019 -0.0060 0.0002 -0.0067 GLN_MET -0.0008 0.0018 0.0257 0.0011 -0.0110 GLN_PHE 0.0006 0.0051 -0.0089 0.0026 0.0081 GLN_PRO 0.0023 0.0023 -0.0154 -0.0025 0.0059 !128 Table 5.14 . (contÕd) GLN_SER -0.0022 0.0065 0.0055 0.0001 -0.0148 GLN_THR -0.0015 0.0050 0.0114 -0.0001 -0.0060 GLN_TRP 0.0004 0.0090 -0.0034 0.0013 -0.0027 GLN_TYR -0.0022 -0.0071 -0.0085 0.0022 -0.0160 GLN_VAL 0.0024 0.0068 0.0143 -0.0006 0.0013 GLU_GLU -0.0018 -0.0003 -0.0107 -0.0003 0.0028 GLU_GLY 0.0013 0.0046 -0.0047 0.0005 0.0027 GLU_HID -0.0020 -0.0016 0.0139 -0.0017 0.0034 GLU_ILE 0.0022 0.0066 0.0101 -0.0015 0.0122 GLU_LEU 0.0006 0.0038 0.0160 0.0017 -0.0035 GLU_LYS -0.0022 -0.0002 -0.0116 -0.0006 0.0086 GLU_MET 0.0019 -0.0034 -0.0071 0.0007 -0.0250 GLU_PHE -0.0026 0.0007 0.0052 0.0014 0.0033 GLU_PRO 0.0042 -0.0016 -0.0005 -0.0042 0.0114 GLU_SER 0.0005 0.0052 0.0112 -0.0018 0.0186 GLU_THR 0.0008 -0.0013 -0.0181 -0.0004 0.0260 GLU_TRP 0.0013 -0.0066 0.0010 0.0010 -0.0019 GLU_TYR -0.0020 0.0048 -0.0051 -0.0029 0.0206 GLU_VAL -0.0006 0.0005 -0.0036 -0.0002 0.0404 GLY_GLY -0.0003 -0.0035 -0.0019 -0.0005 0.0043 GLY_HID -0.0025 0.0035 0.0138 -0.0015 -0.0167 GLY_ILE 0.0008 -0.0019 -0.0110 0.0002 -0.0024 GLY_LEU 0.0001 -0.0054 -0.0138 0.0018 0.0036 GLY_LYS 0.0004 -0.0005 0.0019 -0.0013 -0.0006 GLY_MET 0.0005 -0.0025 -0.0245 0.0006 -0.0002 GLY_PHE 0.0003 -0.0020 -0.0003 -0.0001 -0.0008 GLY_PRO 0.0010 -0.0043 -0.0015 0.0026 -0.0125 GLY_SER 0.0005 -0.0014 0.0026 -0.0003 -0.0053 GLY_THR -0.0004 0.0025 0.0026 0.0012 -0.0085 GLY_TRP -0.0005 0.0072 -0.0032 -0.0005 0.0112 GLY_TYR 0.0005 -0.0026 -0.0001 -0.0002 0.0133 GLY_VAL -0.0003 0.0001 0.0068 -0.0004 0.0042 HID_HID -0.0006 -0.0023 -0.0056 0.0024 -0.0128 HID_ILE -0.0015 -0.0055 0.0125 -0.0024 -0.0081 HID_LEU -0.0007 -0.0005 -0.0004 0.0021 0.0018 HID_LYS -0.0014 0.0011 0.0010 -0.0022 0.0030 HID_MET 0.0005 -0.0027 0.0044 0.0008 0.0006 HID_PHE -0.0003 0.0002 0.0100 -0.0003 -0.0133 HID_PRO 0.0030 0.0003 0.0124 0.0021 0.0091 HID_SER 0.0017 -0.0004 0.0014 -0.0013 -0.0082 HID_THR 0.0002 0.0079 0.0138 -0.0013 -0.0128 HID_TRP -0.0022 -0.0048 0.0017 0.0010 0.0009 HID_TYR 0.0040 -0.0133 -0.0087 -0.0024 -0.0072 HID_VAL -0.0010 -0.0003 -0.0041 0.0056 -0.0101 ILE_ILE 0.0016 -0.0077 -0.0255 0.0000 0.0262 ILE_LEU 0.0012 -0.0029 -0.0061 -0.0021 0.0007 ILE_LYS 0.0003 -0.0006 -0.0176 -0.0004 0.0169 ILE_MET -0.0020 0.0038 0.0136 0.0007 -0.0017 ILE_PHE -0.0007 0.0103 0.0033 -0.0017 -0.0059 ILE_PRO 0.0016 -0.0018 -0.0110 0.0026 0.0132 ILE_SER -0.0007 0.0002 -0.0025 0.0012 0.0120 !129 Table 5.14 . (contÕd) ILE_THR 0.0014 0.0105 -0.0162 -0.0001 0.0002 ILE_TRP -0.0029 0.0096 -0.0199 0.0000 0.0133 ILE_TYR -0.0003 0.0099 0.0142 -0.0052 -0.0020 ILE_VAL 0.0010 -0.0007 -0.0099 -0.0005 0.0047 LEU_LEU 0.0005 0.0041 0.0023 0.0000 0.0035 LEU_LYS -0.0001 0.0007 -0.0075 0.0012 0.0076 LEU_MET 0.0007 -0.0046 0.0077 0.0009 0.0079 LEU_PHE 0.0012 0.0121 -0.0019 0.0043 -0.0097 LEU_PRO -0.0032 -0.0018 -0.0118 0.0018 0.0071 LEU_SER 0.0029 0.0026 -0.0105 0.0022 -0.0144 LEU_THR -0.0010 -0.0047 -0.0051 0.0016 -0.0084 LEU_TRP 0.0014 -0.0047 0.0157 0.0033 -0.0046 LEU_TYR -0.0010 -0.0013 -0.0130 0.0005 -0.0007 LEU_VAL -0.0017 0.0077 -0.0090 0.0027 0.0197 LYS_LYS -0.0013 0.0009 -0.0075 -0.0001 -0.0130 LYS_MET -0.0003 -0.0026 -0.0109 -0.0012 0.0067 LYS_PHE 0.0010 0.0062 -0.0023 0.0004 -0.0043 LYS_PRO -0.0016 0.0027 -0.0017 -0.0024 0.0064 LYS_SER -0.0009 -0.0022 -0.0145 0.0023 0.0184 LYS_THR -0.0009 0.0094 -0.0048 0.0004 -0.0257 LYS_TRP 0.0002 -0.0023 0.0170 0.0008 -0.0080 LYS_TYR 0.0006 -0.0057 -0.0223 0.0004 0.0238 LYS_VAL 0.0001 -0.0023 0.0225 -0.0009 -0.0133 MET_MET 0.0010 0.0011 -0.0043 0.0006 0.0022 MET_PHE -0.0005 0.0017 -0.0025 0.0045 -0.0040 MET_PRO 0.0018 0.0066 -0.0011 -0.0029 0.0046 MET_SER -0.0018 0.0071 0.0082 0.0004 -0.0020 MET_THR 0.0005 0.0039 -0.0103 -0.0008 0.0199 MET_TRP 0.0014 0.0072 -0.0041 0.0009 0.0311 MET_TYR 0.0012 0.0053 0.0002 0.0015 -0.0046 MET_VAL 0.0002 -0.0006 0.0026 0.0002 0.0084 PHE_PHE -0.0001 0.0028 0.0042 0.0014 -0.0032 PHE_PRO -0.0020 0.0003 0.0001 0.0011 0.0046 PHE_SER -0.0017 -0.0107 -0.0119 0.0006 0.0158 PHE_THR -0.0010 0.0077 -0.0001 -0.0026 -0.0024 PHE_TRP -0.0017 -0.0020 -0.0083 0.0018 0.0076 PHE_TYR 0.0008 -0.0003 0.0001 -0.0020 -0.0122 PHE_VAL 0.0041 -0.0029 -0.0035 0.0021 0.0088 PRO_PRO -0.0007 -0.0018 -0.0029 -0.0018 0.0025 PRO_SER 0.0014 0.0037 0.0076 -0.0009 -0.0022 PRO_THR 0.0020 0.0037 -0.0143 -0.0010 0.0083 PRO_TRP 0.0018 -0.0046 0.0027 0.0017 -0.0006 PRO_TYR 0.0009 -0.0007 -0.0070 -0.0002 -0.0041 PRO_VAL -0.0019 0.0030 0.0102 0.0023 -0.0008 SER_SER -0.0008 -0.0021 0.0093 -0.0010 0.0036 SER_THR 0.0022 0.0039 -0.0120 -0.0008 -0.0026 SER_TRP 0.0005 -0.0015 -0.0137 0.0025 0.0189 SER_TYR 0.0002 0.0007 0.0008 0.0000 -0.0025 SER_VAL 0.0008 0.0045 -0.0113 0.0010 0.0013 THR_THR -0.0001 0.0007 -0.0020 -0.0004 -0.0083 THR_TRP 0.0003 -0.0128 -0.0146 0.0022 0.0059 !130 Table 5.14 . (contÕd) THR_TYR -0.0009 -0.0034 -0.0045 -0.0003 -0.0023 THR_VAL 0.0020 -0.0047 0.0037 -0.0010 -0.0060 TRP_TRP 0.0007 -0.0020 0.0109 0.0007 0.0094 TRP_TYR -0.0005 0.0039 -0.0193 -0.0007 0.0078 TRP_VAL -0.0013 0.0004 0.0159 -0.0020 -0.0042 TYR_TYR 0.0019 0.0132 -0.0219 -0.0006 0.0068 TYR_VAL -0.0015 -0.0104 0.0202 0.0006 -0.0025 VAL_VAL 0.0009 -0.0005 -0.0238 0.0031 0.0047 Maximum 0.0042 0.0175 0.0584 0.0067 0.0404 Minimum -0.0052 -0.0151 -0.0434 -0.0061 -0.0402 !131 Table 5.15. A brief comparison between FFENCODER and Amber with ff94 and ff14SB parameter sets. ff94 ff14SB Maximum (kcal/mol) Minimum (kcal/mol) Maximum (kcal/mol) Minimum (kcal/mol) Single amino acid test set Dihedral 0.0010 -0.0011 0.0033 -0.0011 1_4_VdW 0.0065 -0.0051 0.0063 -0.0104 1_4_EEL 0.0076 -0.0177 0.0061 -0.0191 VdW 0.0021 -0.0010 0.0036 -0.0015 EEL 0.0244 -0.0085 0.0247 -0.0098 Dipeptide test set Dihedral 0.0047 -0.0098 0.0042 -0.0052 1_4_VdW 0.0143 -0.0152 0.0175 -0.0151 1_4_EEL 0.0807 -0.0357 0.0584 -0.0434 VdW 0.0086 -0.0063 0.0067 -0.0061 EEL 0.0485 -0.0519 0.0404 -0.0402 !132 Table 5.16. Accuracy and native ranking comparison between RF models based on ff94 and ff14SB with other scoring functions. accuracy Native ranking average highest lowest average highest lowest RF models with ff94 IMP_100 0.988 0.999 0.963 5.88 17.17 1.47 IMP_200 0.987 1.000 0.963 5.93 17.87 1.11 IMP_300 0.989 0.999 0.958 4.40 14.66 1.28 IMP_400 0.988 0.999 0.965 5.01 15.06 1.26 IMP_500 0.992 0.999 0.986 3.52 6.19 1.23 RF models with ff14SB IMP_100 0.987 1.000 0.973 5.16 10.45 1.11 IMP_200 0.987 0.999 0.962 5.83 16.85 1.17 IMP_300 0.991 1.000 0.972 4.28 14.28 1.04 IMP_400 0.987 0.999 0.965 4.80 11.11 1.51 IMP_500 0.990 0.996 0.973 4.06 9.34 1.85 RF models with KECSA2 IMP_100 0.963 0.987 0.931 10.62 21.64 4.86 IMP_200 0.972 0.989 0.947 8.01 11.17 6.05 IMP_300 0.976 0.993 0.933 8.69 13.31 3.92 IMP_400 0.977 0.965 0.956 6.06 15.34 2.36 IMP_500 0.981 0.994 0.965 7.95 17.59 2.77 RWplus - 0.916 - - 23.43 - - DFIRE - 0.886 - - 31.35 - - dDFIRE - 0.904 - - 26.49 - - GOAP - 0.917 - - 23.09 - - !133 Table 5.17. 1st decoy RMSD and TM -score comparison between RF models based on ff94 and ff14SB with other scoring functions. 1st decoy RMSD 1st decoy TM -score average highest lowest average highest lowest RF models with FF94 IMP_100 4.55 6.03 3.38 0.614 0.679 0.550 IMP_200 4.83 5.51 4.05 0.618 0.654 0.551 IMP_300 4.84 5.92 3.88 0.618 0.704 0.561 IMP_400 4.67 5.50 3.67 0.621 0.669 0.564 IMP_500 4.93 6.36 3.8 0.611 0.672 0.536 RF models with FF14SB IMP_100 4.74 5.47 3.79 0.621 0.688 0.563 IMP_200 4.83 5.76 4.48 0.620 0.656 0.587 IMP_300 4.87 5.83 3.78 0.609 0.639 0.572 IMP_400 5.22 5.66 4.57 0.594 0.628 0.564 IMP_500 4.80 5.96 4.15 0.616 0.660 0.591 RF models with KECSA2 IMP_100 4.62 5.49 3.77 0.634 0.685 0.574 IMP_200 5.06 5.71 4.18 0.598 0.656 0.561 IMP_300 4.78 5.43 3.56 0.604 0.679 0.546 IMP_400 4.74 5.48 4.00 0.629 0.684 0.582 IMP_500 4.57 5.72 3.52 0.614 0.695 0.536 RWplus - 4.53 - - 0.622 - - DFIRE - 4.51 - - 0.623 - - dDFIRE - 4.44 - - 0.625 - - GOAP - 4.45 - - 0.674 - - !134 Table 5.18. Distribution of decoysÕ lowest RMSD values in the combined decoy set. Protein name Lowest RMSD Protein name Lowest RMSD Protein name Lowest RMSD Protein name Lowest RMSD Protein name Lowest RMSD 1r69 0.195 1gaf 0.797 1nbv 1.166 1opd 1.931 1nkl 5.325 1di2 0.202 256b 0.803 1mcp 1.169 1cew 1.954 1dxt 5.401 1o2f 0.204 1gyv 0.804 1c8c 1.173 1c9o 2.007 1trl 5.401 1no5 0.205 1jnu 0.805 1mla 1.185 1gpt 2.012 1ew4 5.703 1a32 0.212 1tfi 0.812 6fab 1.191 1mfa 2.044 1pgb 5.874 1org 0.220 1thx 0.817 1of9 1.212 1hbg 2.049 2mta 5.944 1hbk 0.224 2cro 0.823 1fai 1.225 1ubi 2.116 1bg8 6.050 2reb 0.227 1shf 0.831 1tif 1.230 1ash 2.254 1kpe 6.244 1cy5 0.265 1bm8 0.843 1kjs 1.233 1cc8 2.264 2vik 6.256 2cr7 0.271 1fpt 0.847 1yuh 1.254 4sdh 2.287 1gky 6.275 1b72 0.290 1mlb 0.873 1fbi 1.280 2chf 2.334 1bbh 6.478 1abv 0.290 1hil 0.878 1mrd 1.283 1mn8 2.447 1cei 6.504 1cqk 0.343 1vcc 0.887 1ikf 1.308 1scj 2.457 1eyv 6.515 1aoy 0.346 2cgr 0.893 1ctf 1.319 1b0n 2.466 1a68 6.620 1mky 0.403 1igf 0.910 1sn3 1.341 2cmd 2.528 5cro 6.627 1bq9 0.419 1frg 0.917 1kem 1.358 1fc2 2.547 1dkt 6.713 1kvi 0.425 1igc 0.917 1dfb 1.366 1gdm 2.609 1cpc 6.873 1hda 0.429 1hbh 0.919 2a0b 1.371 1hdd 2.722 1gvp 6.957 1myg 0.433 1dbb 0.931 1opg 1.371 1bba 2.788 1beo 7.045 1csp 0.438 1for 0.938 1fig 1.378 1gnu 2.816 1vie 7.152 1ogw 0.457 3icb 0.961 4rxn 1.379 1b3a 2.883 1bk2 7.157 2f3n 0.500 2pcy 0.962 1gig 1.410 1hlb 2.884 1jwe 7.879 1ah9 0.510 3hfm 0.963 4pti 1.421 1a19 2.889 1tit 8.190 1elw 0.513 1ggi 0.968 1ngq 1.429 1hlm 2.994 1bgf 8.395 1ten 0.536 1jel 0.970 1ecd 1.478 1fna 3.009 1dhn 8.422 2dhb 0.541 1egx 0.973 1b4b 1.491 2lhb 3.024 1bkr 8.521 1pgx 0.552 1fgv 0.974 1mam 1.493 1acf 3.120 1cg5 8.545 1nps 0.559 1n0u 0.980 1nmb 1.493 1igd 3.171 smd3 8.612 1myj 0.563 1plg 1.004 1fvc 1.494 1cid 3.348 1rnb 8.778 1af7 0.573 1myt 1.004 1ibg 1.501 1eaf 3.455 811b 9.143 1g1c 0.576 1igm 1.028 1sfp 1.532 1cau 3.469 1who 9.170 1itp 0.615 1iai 1.041 1eap 1.545 1c2r 3.490 1lis 9.269 1sro 0.632 2fbj 1.046 1rmf 1.545 1bl0 3.641 1ptq 9.403 2pgh 0.641 1aiu 1.052 1igi 1.555 1wit 3.712 1fkb 9.682 1fad 0.644 1jhl 1.053 1bbd 1.580 1mdc 3.793 2acy 9.781 1dvf 0.645 1flr 1.065 1vge 1.602 1onc 3.807 2ci2 9.794 1dtj 0.656 1ail 1.071 1ith 1.614 1utg 3.900 2pna 10.803 1tig 0.670 1ne3 1.083 2gfb 1.630 1vls 3.984 1tul 11.621 1emy 0.690 1baf 1.087 1iib 1.662 2ovo 3.989 1lga 12.349 1bab 0.694 1dcj 1.111 2fb4 1.675 1eh2 4.002 1urn 12.423 1tet 0.735 1vfa 1.135 1ig5 1.704 1e6i 4.131 1col 12.427 1hkl 0.737 1nsn 1.146 1mbs 1.715 1dtk 4.333 1hz6 12.468 1fvd 0.772 1acy 1.148 1flp 1.719 4ubp 4.724 4sbv 14.050 1hsy 0.772 1ncb 1.149 1enh 1.745 4icb 4.781 1mup 14.293 1fo5 0.782 1gjx 1.156 1mba 1.824 1lou 4.974 2sim 15.671 1lht 0.784 1ucb 1.159 7fab 1.850 1fca 5.190 2afn 19.744 1bbj 0.791 1ind 1.162 8fab 1.913 1ugh 5.311 !135 Table 5.19. Accuracy comparisons between scoring functions with and without RF refinement. Accuracy average highest lowest RF models with FF94 IMP_100 0.988 0.999 0.963 IMP_500 0.992 0.999 0.986 RF models with FF14SB IMP_100 0.987 1.000 0.973 IMP_500 0.990 0.996 0.973 FF94 without RF refinement IMP_100 0.626 - - IMP_500 0.564 - - FF14SB without RF refinement IMP_100 0.656 - - IMP_500 0.660 - - !136 Table 5.20. Accuracy comparisons between scoring functions with and without force field parameters. Accuracy average highest lowest RF models with FF94 IMP_100 0.988 0.999 0.963 IMP_500 0.992 0.999 0.986 RF models with FF14SB IMP_100 0.987 1.000 0.973 IMP_500 0.990 0.996 0.973 RF models without force field potentials IMP_100 0.679 0.768 0.593 IMP_500 0.712 0.780 0.572 !137 APPENDIX B: FIGURES Figure 1.1. The shape of the sigmoid function showed in equation (1.3). !138 Figure 1.2. An example of a simple decision tree. !139 Figure 1.3. The comparison between general hyperplane and hyperplane generated by the maximum margin classifier. (a) shows there are infinite hyperplanes can be used to separate the data set. (b) shows the hyperplane with the largest margin of separation width. !140 Figure 1.4. An example of a s upport vector machine algorithm. (a) An example of a dataset that canno t be separated by a hyperplane. The observations in the data set are one dimensional points. (b) A polynomial kernel equation is used to change those one dimensional points to 2D points. A hyperplane (black line ) can be us ed to separate those observations . !141 Figure 1.5. An example of bagging. !142 Figure 1.6. An example of RF. !143 Figure 2.1 . Probability versus distance plot for atom pair O -MET__CG -MET, shaded region is the averaged region with the length of 1 ". r = 4.5 !O-MET__CG -MET Probability !"#$%&'()*) !!144 Figure 2.2 . The p rotocol used to build up the Random Forest model . Parameter p (equals to 16029) represents the total numbe r of atom pairs in KECSA2. Parameter n represents the native structure, d1, É , d m are the 1 st , É, m th decoy structures. . . .. . . +. . . Class Ô1Õ Class Ô0Õ Random Forest Training .... . .0010Voting Final Class Descriptor vector Final descriptor vector !145 Figure 2.3 . Protocol for generating the ranking list for the Random Forest model . Parameter p (equal to 16029) represents the total number of atom pairs in KECSA2. S 1, S2, É , S n are the 1 st , 2nd , É , n th protein structures with the same residue sequence. . . .0010Voting Final Class !!!"#"$"%!&&!&&!!!!Table_S1 Table_S2 Table_Sn !&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!!&&!&&!0!&&!&&!...Score_S1 Score_S2 Score_Sn sum !146 Figure 3.1 . Feature importance analysis results for the overall decoy set. The red point represents the 500 th atom pair. !147 Figure 4.1 . The protocol used to include the comparison information between best decoy binding pose and other decoy poses. !148 Figure 4.2 . Accuracy trend from RF models based on original(blue line) and uniform(orange line) GARF data sets. !149 Figure 5.1. Comparisons between energies calculated by FFENCODER and the Amber software package. (a) - (e) are results for dihedral, 1_4 Van der Waals, 1_4 electrostatics, Van der Waals, !150 Figure 5.1 . (contÕd) and electrostatic energies. Columns (1) and (3) are comparisons of single amino acid and dipeptide test sets for ff94. Columns (2) and (4) are comparisons of single amino acid and dipeptide test sets for ff14SB. !151 Figure 5.2. Importance analysis for features in ff94 and ff14SB parameter set. The red point in each plot represents the 500 th most important feature in each parameter set. !152 APPENDIX C: COPYRIGHT NOTICE Chapter 2, 3, 4, and 5 of this dissertation (include its supporting information) are adapted with permissions from several publications listed below: (1) !Adapted with permission from ref 117. Copyright 2019 American Chemistry Society. (2) !Adapted with permission from ref 124. Copyright 2019 American Chemist ry Society. !153 BIBLIOGRAPHY !154 BIBLIOGRAPHY 1.!John, B.; Sali, A. Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res. 2003, 31, 3982 ! 3992. 2.!Zhou, H.; Skolnick, J. GOAP: A Generalized Orientation - Dependent, All -Atom Statistical Potential for Protein Structure Prediction. Biophys. J. 2011, 101, 2043 ! 2052. 3.!Skolnick, J. In quest of an empirical potential for protein structure prediction. Curr. Op in. Struct. Biol. 2006, 16, 166 ! 171. 4.!Sippl, M. J. Knowledge -based potentials for proteins. Curr. Opin. Struct. Biol. 1995, 5, 229! 235. 5.!Jernigan, R. L.; Bahar, I. Structure -derived potentials and protein simulations. Curr. Opin. Struct. Biol. 1996, 6, 195 ! 209. 6.!Moult, J. Comparison of database potentials and molecular mechanics force fields. Curr. Opin. Struct. Biol. 1997, 7, 194 ! 199. 7.!Lazaridis, T.; Karplus, M. Effective energy functions for protein structure prediction. Curr. Opin. Struct. Biol. 2000 , 10, 13 9! 145. 8.!Gohlke, H.; Klebe, G. Statistical potentials and scoring functions applied to protein -ligand binding. Curr. Opin. Struct. Biol. 2001, 11, 231 ! 235. 9.!Russ, W. P.; Ranganathan, R. Knowledge -based potential functions in protein design. Curr. Opin. Struct . Biol. 2002 , 12, 447 ! 452. 10.!Buchete, N. V.; Straub, J. E.; Thirumalai, D. Development of novel statistical potentials for protein fold recognition. Curr. Opin. Struct. Biol. 2004 , 14, 225 ! 232. 11.!Poole, A. M.; Ranganathan, R. Knowledge -based potentials in pro tein design. Curr. Opin. Struct. Biol. 2006, 16, 508 ! 513. 12.!Zhou, Y.; Zhou, H.; Zhang, C.; Liu, S. What is a desirable statistical energy functions for proteins and how can it be obtained? Cell Biochem. Biophys. 2006, 46, 165 ! 174. 13.!Ma, J. Explicit Orientation Dependence in Empirical Potentials and Its Significance to Side -Chain Modeling. Acc. Chem. Res. 2009, 42, 1087 ! 1096. 14.!Gilis, D.; Biot, C.; Buisine, E.; Dehouck, Y.; Rooman, M. Development of Novel Statistical Potentials Describing Cation !" Interactions in Proteins and Comparison with Semiempirical and Quantum Chemistry Approaches. J. Chem. Inf. Model. 2006 , 46, 884 ! 893. !155 15.!Hendlich, M.; Lackner, P.; Weitckus, S.; Floeckner, H.; Froschauer, R.; Gottsbacher, K.; Casari, G.; Sippl, M. J. Identification of native protein folds amongst a large number of incorrect models. The calculation of low energy conformations from potentials of mean force. J. Mol. Biol. 1990, 216, 167 ! 180. 16.!Hoppe, C.; Schomburg, D. Prediction of protein thermo - stability with a direction - and distance -dependent knowledge -based potential. Protein Sci. 2005, 14, 2682 ! 2692. 17.!Jones, D. T.; Taylort, W. R.; Thornton, J. M. A new approach to protein fold recognition. Nature 1992, 358, 86 ! 89. 18.!Kolin #ki,A.;Bujnicki,J.M.Generalizedproteinstructure prediction based on combination of fold -recognition with de novo folding and evaluation of models. Proteins: Struct., Funct., Genet. 2005, 61, 84 ! 90. 19.!Miyazawa, S.; Jernigan, R. L. Estimation of Effective Interresidue Contact Energies from Protein Crystal Structures: Quasi -Chemical Approximation. Macromolecules 1985, 18, 534! 552. 20.!DeBolt, S. E.; Skolnick, J. Evaluation of atomic le vel mean force potentials via inverse folding and inverse refinement of protein structures: atomic burial position and pairwise non -bonded interactions. Protein Eng., Des. Sel. 1996, 9, 637 ! 655. 21.!Zhang, C.; Vasmatzis, G.; Cornette, J. L.; DeLisi, C. Determi nation of atomic desolvation energies from the structures of crystallized proteins. J. Mol. Biol. 1997, 267, 707 ! 726. 22.!Tobi, D.; Elber, R. Distance -dependent, pair -potential for protein folding: Results from linear optimization. Proteins: Struct., Funct., G enet. 2000, 41, 40 ! 46. 23.!Wu, Y.; Lu, M.; Chen, M.; Li, J.; Ma, J. OPUS -Ca: A knowledge -based potential function requiring only C $ positions. Protein Sci. 2007, 16, 1449 ! 1463. 24.!Zhang, Y.; Kolinski, A.; Skolnick, J. TOUCHSTONE II: A New Approach to Ab Initio Pr otein Structure Prediction. Biophys. J. 2003, 85, 1145 ! 1164. 25.!Sippl, M. J. Calculation of Conformational Ensembles from Potential of Mean Force. J. Mol. Biol. 1990, 213, 859 ! 883. 26.!Lu, H.; Skolnick, J. A distance dependent atomic knowledge -based potential for improved protein structure selection. Proteins: Struct., Funct., Genet. 2001, 44, 223 ! 232. 27.!Lu, M.; Dousis, A. D.; Ma, J. (2008). OPUS -PSP: An Orientation -dependent Statistical All -atom Potential Derived from Side -chain Packing. J. Mol. Biol. 2008, 376, 28 8! 301. 28.!Samudrala, R.; Moult, J. An all -atom distance -dependent conditional probability discriminatory function for protein structure prediction. J. Mol. Biol. 1998, 275, 895 ! 916. !156 29.!Shen, M. -Y.; Sali, A. Statistical potential for assessment and prediction of protein structures. Protein Sci. 2006, 15, 2507 ! 2524. 30.!Yang, Y.; Zhou, Y. Specific interactions for ab initio folding of protein terminal regions with secondary structures. Protein s: Struct., Funct., Genet. 2008, 72, 793 ! 803. 31.!Skolnick, J.; Kolinski, A.; Ortiz, A. Derivation of protein -specific pair potentials based on weak sequence fragment similarity. Proteins: Struct., Funct., Genet. 2000, 38, 3 ! 16. 32.!Zhang, J.; Zhang, Y. A novel si de-chain orientation dependent potential derived from random -walk reference state for protein fold selection and structure prediction. PLoS One 2010, 5, No. e15386. 33.!Zhou, H.; Zhou, Y. Distance -scaled, finite ideal -gas reference state improves structure -der ived potentials of mean force for structure selection and stability prediction. Protein Sci. 2002, 11, 2714! 2726. 34.!Benkert, P.; Tosatto, S. C. E.; Schomburg, D. QMEAN: A comprehensive scoring function for model quality assessment. Proteins: Struct., Funct., Genet. 2008, 71, 261 ! 277. 35.!Cao, R.; Bhattacharya, D.; Hou, J.; Cheng, J. DeepQA: Improving the estimation of single protein model quality with deep belief networks. BMC Bioinf. 2016, 17, 1 ! 9. 36.!Uziela, K.; Shu, N.; Wallner, B.; Elofsson, A. ProQ3: Improved m odel quality assessments using Rosetta energy terms. Sci. Rep. 2016, 6, 1 ! 10. 37.!Uziela, K.; Hurtado, D. M.; Shu, N.; Wallner, B.; Elofsson, A. ProQ3D: Improved model quality assessments using deep learning. Bioinformatics 2017, 33, 1578 ! 1580. 38.!Manavalan, B.; Lee, J. SVMQA: support -vector -machine -based protein single -model quality assessment. Bioinformatics 2017, 33, 2496 ! 2503. 39.!Olechnovic ,% K.; Venclovas, &. VoroMQA: Assessment of protein structure quality using interatomic contact areas. Protei ns: Struct., Funct., Genet. 2017, 85, 1131 ! 1145. 40.!Hurtado, D. M.; Uziela, K.; Elofsson, A. Deep transfer learning in the assessment of the quality of protein models. arXiv:1804.06281, 2018 41.!Simons, K. T.; Kooperberg, C.; Huang, E.; Baker, D. Assembly of prot ein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J. Mol. Biol. 1997, 268, 209 ! 225. 42.!Balakrishnan, S.; Kamisetty, H.; Carbonell, J. G.; Lee, S. -I.; Langmead, C. J. Learning generati ve models for protein fold families. Proteins: Struct., Funct., Genet. 2011, 79, 1061! 1078. 43.!Leaver -Fay, A.; Tyka, M.; LEWIS, S. M.; Lange, O. F.; Thompson, J.; Jacak, R.; Kaufman, K. W.; Renfrew, P. D.; Smith, C. A.; Sheffler, W.; Davis, I. W.; Cooper, S.; Treuille, A.; Mandell, !157 D. J.; Richter, F.; Ban, Y. A.; Fleishman, S. J.; Corn, J. E.; Kim, D. E.; Lyskov, S.; Berrondo, M.; Mentzer, S.; Popovic, Z.; Havranek, J. J.; Karanicolas, J.; Das, R.; Meiler, J.; Kortemme, T.; Gray, J. J.; Kuhlman, B.; Baker, D.; Bradley, P. ROSETTA3:anobject -oriented - softwaresuiteforthe imulation and design of macromolecules. Methods Enzymol. 2011, 487, 545! 574. 44.!Wood, C. W.; Bruning, M.; Ibarra, A. A.; Bartlett, G. J.; Thomson, A. R.; Sessions, R. B.; Brady, R. L.; Woolfson, D. V. CCBuilder:aninteractiveweb -basedtoolforbuilding, designing and as - sessing coiled -coil protein assemblies. Bioinformatics 2014, 30, 3029 ! 3035. 45.!Negron, C.; Keating, A. E. Multistate protein design using CLEVER and CLASSY. Methods Enzymol. 2013, 523, 171 ! 190. 46.!Smadbeck, J.; Peterson, M. B.; Khoury, G. A.; Taylor, M. S.; Floudas, C. A. Protein WISDOM: a workbench for in silico de novo design of biomolecules. J. Visualized Exp. 2013, 77, No. e50476. 47.!Dahiyat, B. I.; Mayo, S. L. Protein design automation. Prot ein Sci. 1996, 5, 895 ! 903. 48.!Dahiyat, B. I.; Mayo, S. L. De novo protein design: fully automated sequence selection. Science 1997, 278, 82 ! 87. 49.!Kamisetty, H.; Ovchinnikov, S.; Baker, D. Assessing the utility of coevolution -based residue ! residue contact predic tions in a sequence - and structure -rich era. Proc. Natl. Acad. Sci. U. S. A. 2013, 110, 15674 ! 15679. 50.!Ovchinnikov, S.; Kamisetty, H.; Baker, D. Robust and accurate prediction of residue ! residue interactions across protein interfaces using evolutionary infor mation. eLife 2014, 3, No. e02030. 51.!Ovchinnikov, S.; Kinch, L.; Park, H.; Liao, Y.; Pei, J.; Kim, D. E.; Kamisetty, H.; Grishin, N. V.; Baker, D. Large -scale determination of previously unsolved protein structures using evolutionary information. eLife 2015, 4, No. e09248. 52.!K−llberg, M.; Wang, H.; Wang, S.; Peng, J.; Wang, Z.; Lu, H.; Xu, J. Template -based protein structure modeling using the RaptorX web server. Nature Protocols , 2012, 7(8), 1511 Ð1522. 53.!Kuhlman, B.; Dantas, G.; Ireton, G. C.; Varani, G.; Stodda rd, B. L.; Baker, D. Design of a novel globular protein fold with atomic -level accuracy. Science 2003, 302, 1364 ! 1368. 54.!Huang, P. -S.; Ban, Y. -E. A.; Richter, F.; Andre, I.; Vernon, R.; Schief, W. R.; Baker, D. RosettaRemodel:ageneralizedframeworkfor - flexib le backbone protein design. PLoS One 2011, 6, No. e24109. 55.!Harbury, P. B.; Plecs, J. J.; Tidor, B.; Alber, T.; Kim, P. S. High - resolution protein design with backbone freedom. Science 1998, 282, 1462 ! 1467. !158 56.!Thomson, A. R.; Wood, C. W.; Burton, A. J.; Bartlett, G. J.; Sessions, R. B.; Brady, R. L.; Woolfson, D. N. Computational design of water -soluble $-helical barrels. Science 2014, 346, 485! 488. 57.!Grigoryan, G.; DeGrado, W. F. Probing designability via a genera lized model of helical bundle geometry. J. Mol. Biol. 2011, 405, 1079 ! 1100. 58.!Huang, P. -S.; Oberdorfer, G.; Xu, C.; Pei, X. Y.; Nannenga, B. L.; Rogers, J. M.; DiMaio, F.; Gonen, T.; Luisi, B.; Baker, D. High thermodynamic stability of parametrically designed helical bundles. Science 2014, 346, 481 ! 485. 59.!Regan, L.; DeGrado, W. F. Cha racterization of a helical protein designed from first principles. Science 1988, 241, 976 ! 978. 60.!Lin, Y. -R.; Koga, N.; Tatsumi -Koga, R.; Liu, G.; Clouser, A. F.; Montelione, G. T.; Baker, D. Control over overall shape and size in de novo designed proteins. P roc. Natl. Acad. Sci. U. S. A. 2015, 112, E5478 ! E5485. 61.!Koga, N.; Tatsumi -Koga, R.; Liu, G.; Xiao, R.; Acton, T. B.; Montelione, G. T.; Baker, D. Principles for designing ideal protein structures. Nature 2012, 491, 222 ! 227. 62.!MacKerell, A. D.; Bashford, D.; B ellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph -McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E., III; Roux, B.; Sc hlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe,M.;Wio 'kiewicz -Kuczera,J.;Yin,D.;Karplus,M.All -Atom Empirical Potential for Molecular Modeling and Dynamics Studies of Proteins. J. Phys. Chem. B 1998 , 102, 3586 ! 3616. 63.!Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. CHARMM: A program for macro - molecular energy, minimization, and dynamics calculations. J. Comput. Chem. 1983, 4, 187 ! 217. 64.!Weiner, S. J.; Kollman, P. A.; Nguyen, D. T.; Case, D. A. An all atom force field for simulations of proteins and nucleic acids. J. Comput. Chem. 1986, 7, 230 ! 252. 65.!Case, D. A.; Cheatham, T. E.; Darden, T.; Gohlke, H.; Luo, R.; Merz, K. M.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. The Amber biomolecular simul ation programs. J. Comput. Chem. 2005 , 26, 1668 ! 1688. 66.!Arnautova, Y. A.; Jagielska, A.; Scheraga, H. A. A New Force Field (ECEPP -05) for Peptides, Proteins, and Organic Molecules. J. Phys. Chem. B 2006, 110, 5025 ! 5044. 67.!Ponder, J. W.; Case, D. A. Force field s for protein simulations. Adv. Protein Chem. 2003, 66, 27! 85. 68.!Jagielska, A.; Wroblewska, L.; Skolnick, J. Protein model refinement using an optimized physics -based all -atom force field. Proc. Natl. Acad. Sci. U. S. A. 2008, 105, 8268 ! 8273. !159 69.!Meng, E. C.; Sh oichet, B. K.; Kuntz, I. D. Automated Docking with Grid -Based Energy Evaluation. J. Comput. Chem. 1992, 13, 505 ! 524. 70.!Makino, S.; Kuntz, I. D. Automated Flexible Ligand Docking: Method and Its Application for Database Search. J. Comput. Chem. 1997, 18, 181 2! 1825. 71.!Goodsell, D. S.; Morris, G. M.; Olson, A. J. Automated Docking of Flexible Ligands: Applications of AutoDock. J. Mol. Recog . 1996, 9, 1! 5. 72.!Ortiz, A. R.; Pisabarro, M. T.; Gago, F.; Wade, R. C. Prediction of Drug Binding Affinities by Comparative Binding Energy Analysis. J. Med. Chem . 1995, 38, 2681 ! 2691. 73.!Yin, S.; Biedermannova, L.; Vondrasek, J.; Dokholyan, N. V. MedusaScore: An Accurate Forc e Field -Based Scoring Function for Virtual Drug Screening. J. Chem. Inf. Model. 2008, 48, 1656 ! 1662. 74.!Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. Development and Validation of a Genetic Algorithm for Flexible Docking. J. Mol. Biol. 1997, 267, 727 ! 748. 75.!Aqvist, J.; Medina, C.; Samuelsson, J. E. New Method for Predicting Binding Affinity in Computer -Aided Drug Design. Protein Eng. 1994, 7, 385 ! 391. 76.!Almlof, M.; Brandsdal, B. O.; Aqvist, J. Binding Affinity Prediction with Different Force Fields : Examination of the Linear Interaction Energy Method. J. Comput. Chem. 2004, 25, 1242! 1254. 77.!Carlson, H. A.; Jorgensen, W. L. Extended Linear Response Method for Determining Free Energies of Hydration. J. Phys. Chem. 1995, 99, 10667 ! 10673. 78.!Jones -Hertzog, D . K.; Jorgensen, W. L. Binding Affinities for Sulfonamide Inhibitors with Human Thrombin Using Monte Carlo Simulations with a Linear Response Method. J. Med. Chem . 1997, 40, 1539 ! 1549. 79.!Kollman, P. A.; Massova, I.; Reyes, C.; Kuhn, B.; Huo, S.; Chong, L.; L ee, M.; Lee, T.; Duan, Y.; Wang, W.; Donini, O.; Cieplak, P.; Srinivasan, J.; Case, D. A.; Cheatham, T. E. Calculating Structures and Free Energies of Complex Molecules: Combining Molecular Mechanics and Continuum Models . Acc. Chem. Res . 2000, 33, 889 ! 897 . 80.!DeWitte, R. S.; Shakhnovich, E. I. SMoG: de Novo Design Method Based on Simple, Fast, and Accurate Free Energy Estimates. 1. Methodology and Supporting Evidence . J. Am. Chem. Soc. 1996, 118, 11733 ! 11744. 81.!Grzybowski, B. A.; Ishchenko, A. V.; Shimada, J.; Shakhnovich, E. I. From Knowledge -Based Potentials to Combinatorial Lead Design in Silico. Acc. Chem. Res. 2002, 35, 261 ! 269. 82.!Muegge, I.; Martin, Y. C. A General and Fast Scoring Function for Protein -Ligand Interactions: A Simplified Potential Approach. J. Med. Chem. 1999 , 42, 791 ! 804. !160 83.!Muegge, I. A Knowledge -Based Scoring Function for Protein -Ligand Interactions: Probing the Refe rence State. Perspect. Drug Discovery Des. 2000, 20, 99 ! 114. 84.!Muegge, I. Effect of Ligand Volume Correction on PMF Scoring. J. Comput. Chem. 2001, 22, 418! 425. 85.!Gohlke, H.; Hendlich, M.; Klebe, G. Knowledge -Based Scoring Function to Predict Protein -Ligand In teractions. J. Mol. Biol . 2000, 295, 337 ! 356. 86.!Velec, H. F. G.; Gohlke, H.; Klebe, G. DrugScore(CSD): Knowledge -Based Scoring Function Derived from Small Molecule Crystal Data with Superior Recognition Rate of Near -Native Ligand Poses and Better Affinity Pr ediction. J. Med. Chem. 2005, 48, 6296 ! 6303. 87.!Neudert, G.; Klebe, G. DSX: A Knowledge -Based Scoring Function for the Assessment of Protein -Ligand Complexes. J. Chem. Inf. Model. 2011 , 51, 2731 ! 2745. 88.!Huang, S. -Y.; Zou, X. An Iterative Knowledge -Based Scorin g Function to Predict Protein -Ligand Interactions: I. Derivation of Interaction Potentials. J. Comput. Chem. 2006, 27, 1865! 1875. 89.!Huang, S. Y.; Zou, X. An Iterative Knowledge -Based Scoring Function to Predict Protein -Ligand Interactions: II. Validation of the Scoring Function. J. Comput. Chem. 2006, 27, 1876! 1882. 90.!Huang, S. Y.; Zou, X. Inclusion of Solvation and Entropy in the Knowledge -Based Scoring Function for Protein -Ligand Interactions. J. Chem. Inf. Model. 2010, 50, 262 ! 273. 91.!Zheng, Z.; Merz, K. M. Dev elopment of the Knowledge -Based and Empirical Combined Scoring Algorithm (KECSA) To Score Protein ! Ligand Interactions. J. Chem. Inf. Model. 2013 , 53, 1073 ! 1083. 92.!Wang, R.; Lai, L.; Wang, S. Further Development and Validation of Empirical Scoring Functions for Structure -Based Binding Affinity Prediction. J. Comput. Aided. Mol. Des . 2002, 16, 11 ! 26. 93.!Bohm, H. J. The Development of A Simple Empirical Scoring Function t o Estimate the Binding Constant for a Protein -Ligand Complex of Known Three -Dimensional Structure. J. Comput. Aided. Mol. Des. 1994, 8, 243 ! 256. 94.!Verkhivker, G.; Appelt, K.; Freer, S. T.; Villafranca, J. E. Empirical Free Energy Calculations of Ligand -Prote in Crystallographic Complexes. I. Knowledge -Based Ligand -Protein Interaction Potentials Applied to the Prediction of Human Immunodeficiency Virus 1 Protease Binding Affinity. Protein Eng. 1995, 8, 677 ! 691. 95.!Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paol ini, G. V.; Mee, R. P. Empirical Scoring Functions: I. The Development of a Fast Empirical Scoring Function to Estimate the Binding Affinity of Ligands in Receptor Complexes. J. Comput. Aided. Mol. Des. 1997, 11, 425 ! 445. !161 96.!Murray, C. W.; Auton, T. R.; Eldri dge, M. D. Empirical Scoring Functions. II. The Testing of an Empirical Scoring Function for the Prediction of Ligand -Receptor Binding Affinities and the Use of Bayesian Regression to Improve the Quality of the Model. J. Comput. Aided. Mol. Des. 1998, 12, 503! 519. 97.!Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scor ing. 1. Method and Assessment of Docking Accuracy. J. Med. Chem. 2004, 47 (7), 1739 ! 1749. 98.!Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.; Sanschagrin, P. C.; Mainz, D. T. Extra Precision Glide: Docking and Sc oring Incorporating a Model of Hydrophobic Enclosure for Protein -Ligand Complexes. J. Med. Chem. 2006, 49, 6177! 6196. 99.!Jim”nez, J.; ( kali ) , M.; Mart™nez -Rosell, G.; De Fabritiis, G.. KDEEP: Protein -Ligand Absolute Binding Affinity Prediction via 3D -Convolut ional Neural Networks. J. Chem. Inf. Model. 2018, 58(2) , 287 Ð296. 100.!Cang, Z., Wei, G. W. Integration of Element Specific Persistent Homology and Machine Learning for Protein -Ligand Binding Affinity Prediction. Int. J. Numer Meth. Biomed. Engng . 2018, 34(2) , 1Ð17. 101.!Ragoza, M.; Hochuli, J.; Idrobo, E.; Sunseri, J.; Koes, D. R. Protein -Ligand Scoring with Convolutional Neural Networks. J. Chem. Inf. Model. 2017, 57(4), 942Ð957. 102.!Deng, W.; Breneman, C.; Embrechts, M. J. Predicting Protein -Ligand Binding Affinitie s Using Novel Geometrical Descriptors and Machine -Learning Methods. J. Chem. Inf. Comput. Sci. 2004, 44, 699 ! 703. 103.!Zhang, S.; Golbraikh, A.; Tropsha, A. Development of Quantitative Structure -Binding Affinity Relationship Models Based on Novel Geometrical C hemical Descriptors of the Protein -Ligand Interfaces. J. Med. Chem. 2006, 49, 2713 ! 2724. 104.!Durrant, J. D.; McCammon, J. A. NNScore: A Neural -Network - Based Scoring Function for the Characterization of Protein -Ligand Complexes. J. Chem. Inf. Model. 2010, 50, 1865! 1871. 105.!Durrant, J. D.; McCammon, J. A. NNScore 2.0: A Neural - Network Receptor -Ligand Scoring Function. J. Chem. Inf. Model. 2011, 51, 2897 ! 2903. 106.!Ballester, P. J.; Mitchell, J. B. O. A Machine Learning Approach to Predicting Protein ! Ligand Binding Affi nity with Applications to Molecular Docking. Bioinformatics 2010, 26, 1169 ! 1175. 107.!Ballester, P. J.; Schreyer, A.; Blundell, T. L. Does a More Precise Chemical Description of Protein ! Ligand Complexes Lead to More Accurate Prediction of Binding Affinity ? J. C hem. Inf. Model. 2014, 54, 944 ! 955. !162 108.!Zilian, D.; Sotriffer, C. A. SFCscore RF: A Random Forest -Based Scoring Function for Improved Affinity Prediction of Protein -Ligand Complexes. J. Chem. Inf. Model. 2013, 53, 1923! 1933. 109.!Li, G. B.; Yang, L. L.; Wang, W. J.; Li, L. L.; Yang, S. Y. ID - Score: A New Empirical Scoring Function Based on a Comprehensive Set of Descriptors Related to Protein ! Ligand Interactions. J. Chem. Inf. Model. 2013, 53, 592 ! 600. 110.!Deng, Z.; Chuaqui, C.; Singh , J. Structural Interaction Fingerprint (SIFt): A Novel Method for Analyzing Three -Dimensional Protein -Ligand Binding Interactions. J. Med. Chem. 2004, 47, 337 ! 344. 111.!Dangetti, P. Journey from Statistics for Machine Learning. In Statistical for Machine Lear ning, Editing, S., Pagare, V., Singh, A., Pawanikar, M., Pawar, D. Ltd: Packt Publishing, Birmingham , United Kingdom, 2017; pp 9. 112.!Li, Y.; Liu, Z.; Li, J.; Han, L.; Liu, J.; Zhao, Z.; Wang, R. Comparative Assessment of Scoring Functions on an Updated Benchm ark: 1. Compilation of the Test Set. J. Chem. Inf. Model. 2014, 54(6) , 1700 Ð1716. 113.!Liu, Z.; Li, Y.; Han, L.; Li, J.; Liu, J.; Zhao, Z.; Nie, W.; Liu , Y.; Wang, R. PDB -wide collection of binding data: current status of the PDBbind database. Bioinformatics 2015, 31 (3): 405 -412 114.!Zheng, Z.; Pei, J.; Bansal, N.; Liu, H.; Song, L. F.; Merz, K. M. Generation of Pairwise Potentials Using Multidimensional Data Mining. J. Chem. Theory Comput. 2018, 14(10), 5045Ð5067. 115.!Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Miche l, V.; Thirion, B.; Grisel, O.; Blondel M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.; Brucher, M.; Perrot, M.; Duchesnay, …. Scikit -Learn: Machine Learning in Python. Journal of Machine Learning Research 2011, 12, 2825 Ð2830. 116.!Breiman, L. Random forests. Machine Learning 2001 , 45, 5Ð32. 117.!Pei, J.; Zheng, Z.; Merz, K. M. Random Forest Refinement of the KECSA2 Knowledge -Based Scoring Function for Protein Decoy Detection. J. Chem. Inf. Model. 2019, 59, 1919-1929. 118.!D.A. Case, I. Y. B. -S., S.R. Brozell, D.S. Cerutti, T.E. Cheatham, III, V.W.D. Cruzeiro, T.A. Darden,; R.E. Duke, D. G., M.K. Gilson, H. Gohlke, A.W. Goetz, D. Greene, R Harris, N. Homeyer, Y. Huang,; S. Izadi, A. K., T. Kurtzman, T.S. Lee, S. LeGrand, P. Li, C. Lin, J. Liu, T. Luchko, R. Luo, D.J.; Mermelstein, K. M. M., Y. Miao, G. Monard, C. Nguyen, H. Nguyen, I. Omelyan, A. Onufriev, F. Pan, R.; Qi, D. R. R., A. Roitberg, C. Sagui, S. Schott -Verdugo, J. Shen, C.L. Simmerling, J. Smith, R. SalomonFerrer, J. Swails, R.C. Walker, J. Wang, H. Wei, R.M. Wolf, X. Wu, L. Xiao, D.M. York and P.A. Kollman, AMBER 2018, University of California, San Francisco. 2018. !163 119.!Tsui, V.; Case, D. A., Theory and applications of the generalized Born solvation model in macromolec ular simulations. Biopolymers 2000, 56 (4), 275 -91. 120.!Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A Second Generation Force Field for the Simulation of Pr oteins, Nucleic Acids, and Organic Molecules. J. Am. Chem. Soc. , 1995, 117(19), 5179 Ð5197. 121.!Maier, J. A.; Martinez, C.; Kasavajhala, K.; Wickstrom, L.; Hauser, K. E.; Simmerling, C. ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameter s from ff99SB . J. Chem. Theory Comput. , 2015, 11(8), 3696 Ð3713. 122.!Ho, T. K. Random Decision Forests. Proc. Third Int. Conf. 1995,1, 278 ! 282. 123.!Liaw, A.; Wiener, M.; Breiman, L.; Cutler, A. Package ÔrandomForestÕ . 2015. 124.!Pei, J.; Zheng, Z.; Kim, H.; Song, L. F.; Walworth, S.; Merz, M. R.; Merz, K. M. Random Forest Refinement of Pairwise Potentials for Protein ÐLigand Decoy Detection. J. Chem. Inf. Model. , 2019, 59(7), 3305 Ð3315 !