PREDICTING THE PROPERTIES OF LIGANDS USING MOLECULAR DYNAMICS AND MACHINE LEARNING By Nazanin Donyapour A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational, Mathematics, Science and Engineering – Doctor of Philosophy 2022 ABSTRACT PREDICTING THE PROPERTIES OF LIGANDS USING MOLECULAR DYNAMICS AND MACHINE LEARNING By Nazanin Donyapour The discovery and design of new drugs requires extensive experimental assays that are usually very expensive and time-consuming. To cut down the cost and time of the drug development process and help design effective drugs more efficiently, various computational methods have been developed that are referred to collectively as in silico drug design. These in silico methods can be used to not only determine compounds that can bind to a target receptor but to determine whether compounds show ideal drug-like properties. In this dissertation, I have provided solutions to these problems by developing novel methods for molecular simulation and molecular property prediction. Firstly, we have developed a new enhanced sampling MD algorithm called Resampling of Ensembles by Variation Optimization or “REVO” that can generate binding and unbinding pathways of ligand- target interactions. These pathways are useful for calculating transition rates and Residence Times (RT) of protein-ligand complexes. This can be particularly useful for drug design as studies for some systems show that the drug efficacy correlates more with RT than the binding affinity. This method is generally useful for generating long-timescale transitions in complex systems, including alternate ligand binding poses and protein conformational changes. Secondly, we have developed a technique we refer to as “ClassicalGSG” to predict the partition coefficient (log 𝑃) of small molecules. log 𝑃 is one of the main factors in determining the drug likeness of a compound, as it helps determine bioavailability, solubility, and membrane permeability. This method has been very successful compared to other methods in literature. Finally, we have developed a method called “Flexible Topology” that we hope can eventually be used to screen a database of potential ligands while considering ligand-induced conformational changes. After discovering molecules with drug-like properties in the drug design pipeline, Virtual Screening (VS) methods are employed to perform an extensive search on drug databases with hundreds of millions of compounds to find candidates that bind tightly to a molecular target. However, in order for this to be computationally tractable, typically, only static snapshots of the target are used, which cannot respond to the presence of the drug compound. To efficiently capture drug-target interactions during screening, we have developed a machine-learning algorithm that employs Molecular Dynamics (MD) simulations with a protein of interest and a set of atoms called “Ghost Particles”. During the simulation, the Flexible Topology method induces forces that constantly modify the ghost particles and optimizes them toward drug-like molecules that are compatible with the molecular target. Copyright by NAZANIN DONYAPOUR 2022 To my parents, husband and daughter v ACKNOWLEDGEMENTS First, I would like to thank my advisor, Dr. Alex Dickson, for his support, wisdom, enthusiasm, and encouragement. He has been a kind and patient advisor and tirelessly helped me become a better researcher. Second, I want to extend my gratitude to my Ph.D. committee members Dr. Scott Barton, Dr. Matthew Hirn, and Dr. Ekaterina Rapinchuk, for their insightful comments and feedback. I thank my fellow lab members for their ample support, particularly Nicole M. Roussey, Dr. Tom Dixon, and Dr. Samik Bose. I also would like to thank the Department of Computational Mathematics, Science, and Engineering for their guidance. My appreciation goes to my family and friends for their unconditional love and continuous support. Specially, I offer many thanks to my beloved husband, Dr. Hamid Karimi, for his incredible support along my Ph.D. journey. I express my sincere gratitude to my dear friend, Mona Lavasani, who has always been there for me. I can not express how I am grateful to my parents and in-laws for their tremendous help. Finally, my last words go to my late mother, Farahnaz Emami, the light of my life whose presence in my heart is the source of my strength. vi TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Kinetics-orientated drug design . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Predicting molecular properties . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Virtual Screening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 CHAPTER 2 PREDICTING THE RESIDENCE TIME OF DRUG MOLECULES (REVO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Generalized framework for Weighted ensemble sampling . . . . . . . . 9 2.2.2 REVO resampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3 WExplore sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.4 𝑁-dimensional biased random walk . . . . . . . . . . . . . . . . . . . . 14 2.2.5 Trypsin-benzamidine system . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2.6 Clustering and network visualizations . . . . . . . . . . . . . . . . . . . 15 2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 𝑁-dimensional random walk . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.2 Choosing an optimal distance exponent (𝛼) for ligand unbinding simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.3 Trypsin-benzamidine ligand unbinding . . . . . . . . . . . . . . . . . . 23 2.3.3.1 Residence time . . . . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.3.2 Heterogeneity of ligand unbinding pathways . . . . . . . . . . 25 2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 CHAPTER 3 PREDICTING LOGP OF SMALL MOLECULES . . . . . . . . . . . . 32 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.2.2 Generating atomic attributes . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.3 Log P predictions using GSG . . . . . . . . . . . . . . . . . . . . . . . . 43 3.2.3.1 Geometric Scattering for graphs . . . . . . . . . . . . . . . . . 43 3.2.3.2 Neural networks architecture and training . . . . . . . . . . . 46 3.2.4 Log P predictions using GCNs . . . . . . . . . . . . . . . . . . . . . . . 46 3.2.4.1 Graph convolutional neural network . . . . . . . . . . . . . . 46 vii 3.2.4.2 GCN architecture and training . . . . . . . . . . . . . . . . . 48 3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.1 Evaluation of molecular representations . . . . . . . . . . . . . . . . . . 49 3.3.2 Geometric scattering vs graph convolution . . . . . . . . . . . . . . . . 56 3.3.3 Features visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 3.3.4 Distinguishing features of failed molecules . . . . . . . . . . . . . . . . 58 3.3.5 Uncertainty estimations for the SAMPL7 predictions . . . . . . . . . . 60 3.3.6 Predictions for SAMPL7 log 𝑃 challenge . . . . . . . . . . . . . . . . . . 63 3.3.7 Performance of Open-source force fields . . . . . . . . . . . . . . . . . . 64 3.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 CHAPTER 4 MODELING COMPOUNDS IN BINDING SITE WITH THE FLEX- IBLE TOPOLOGY METHOD . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.2.1 Framework of Flexible Topology . . . . . . . . . . . . . . . . . . . . . . 76 4.2.2 Symmetry-invariant Molecular Features . . . . . . . . . . . . . . . . . 77 4.2.3 Continuous Atomic Identities . . . . . . . . . . . . . . . . . . . . . . . . 80 4.2.4 Loss function and Flexible Topology Restraint Energy . . . . . . . . . . 81 4.2.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.5.1 Bromodomain simulation setup details . . . . . . . . . . . . . 83 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.1 Assembly Simulations in Vacuum . . . . . . . . . . . . . . . . . . . . . 84 4.4 Assembly in a Bromodomain binding pocket . . . . . . . . . . . . . . . . . . . 88 4.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 CHAPTER 5 SUMMARY AND FUTURE DIRECTION . . . . . . . . . . . . . . . . 94 5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 viii LIST OF TABLES Table 2.1: Characteristic distance values for N-dimensional random walk. . . . . . . . . . . 16 Table 2.2: Cluster counts for all simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 27 Table 2.3: Co-clustering information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Table 3.1: log 𝑃 prediction models classifications. . . . . . . . . . . . . . . . . . . . . . . 33 Table 3.2: Independent test sets used for evaluating ClassicalGSG models. . . . . . . . . . 37 Table 3.3: log 𝑃 datasets used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Table 3.4: Classifying CGenFF atom types in 36 categories (AC36). The C_Name is the name of the category, and C_Value shows the value of the category. . . . . . . . 41 Table 3.5: Classifying GAFF2 atom types in 31 categories (AC31). The C_Name is the name of the category, and C_Value shows the value of the category . . . . . . . 42 Table 3.6: Classifying atoms in 5 categories (AC5). . . . . . . . . . . . . . . . . . . . . . 42 Table 3.7: Neural network settings. Square brackets denote possible parameter values used in the grid search method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 Table 3.8: Neural network settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Table 3.9: Parameters for the Geometric Scattering for Graphs algorithm. The square brackets show all of the values examined for each parameter. For “Scattering operators”, ‘z’ represents the zero order operator, ‘f’ is first order, and ‘s’ is second order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 Table 3.10: The log 𝑃 prediction results using different methods on FDA test set. The result from our model shown in red color. The TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table from [92] . . . 53 Table 3.11: The logP prediction performance results for four independent test sets. . . . . . . 53 Table 3.12: The log 𝑃 prediction results using different methods on Star test set. The result from our model shown in red color. The TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set from [92]. Table from [92]. . . . . . . . . . 54 ix Table 3.13: The log 𝑃 prediction results using different methods on NonStar test set. The result from our model shown in red color. The TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92] . . . 55 Table 3.14: The logP prediction results using different set of features and models . . . . . . . 56 Table 3.15: The averaged KL-divergence between fingerprint distributions of all data ver- sus failed molecules averaged over 5 GCGNN models. . . . . . . . . . . . . . . 59 Table 3.16: The PIs and coverage range for the SAMPL7 molecules using four Classical- GSG methods. 𝑦ˆ is the prediction of log 𝑃. . . . . . . . . . . . . . . . . . . . . 62 Table 3.17: The log 𝑃 prediction results for the SAMPL7 molecules. . . . . . . . . . . . . . 63 Table 3.18: Sets of parameters used to evaluate MMFF94 ClassicalGSG models on external test sets. Sets in square brackets denote all the GSG parameter values used for generating the molecular features. For scattering operators, “z” denotes the zeroth order operator (Equation 3.1), “f” is first order (Equation 3.2), and “s” is second order (Equation 3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 Table 3.19: The logP prediction results from MMFF94 force field parameters for external test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 3.20: The log 𝑃 prediction results from different log 𝑃 methods for the FDA test set. The TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table from [92]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Table 3.21: The log 𝑃 prediction results from different log 𝑃 methods for the Star test set. The TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set from [92]. Table from [92]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 Table 3.22: The log 𝑃 prediction results from different log 𝑃 methods for the NonStar test set. The TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92] . . . . . . . . . . . . . . . . . . . . . . . . . 69 Table 4.1: Parameter values for Behler-Parrinello functions, grouped into radial and an- gular components. The number of values used for each parameter is shown in the 𝑛vals column. The product of 𝑛vals is the total number of features for each component (𝑁feat ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Table 4.2: Minimum and maximum parameter values for ghost particle attributes. . . . . . 86 x LIST OF FIGURES Figure 2.1: The WE simulation framework. Each walker is represented as a circle. The size of a circle represents the weight, and different colors represent different conformations. An ensemble of walkers with the same weight and conformation is run for a set number of steps (“Dynamics”). Then resampling is performed. These two steps continue until the simulation ends. . . . . . . . . 10 Figure 2.2: Average predicted probability distributions. The black curve is the target probability. Probability distributions are averaged over all 10 runs for (A) REVO and (B) WExplore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 Figure 2.3: (A) The calculated accuracy values are shown for three methods, averaged over 10 runs. REVO outperforms when compared to WExplore and CONV methods for all dimensions. (B) The range of visited 𝑥 values for three methods averaged over 10 runs. REVO explores a broader sampling space when compared to WExplore and CONV methods for all dimensions. Error bars show the standard error of the mean across the set of runs. . . . . . . . . . 18 Figure 2.4: (A) The final walker positions for a representative 𝑁 = 2 simulation are shown as points. This is overlayed on a probability distribution heat map, calculated using 𝑃𝑡 (𝑥, 𝑦) = 𝑃𝑡 (𝑥)𝑃𝑡 (𝑦). (B) Probability distributions for the distance to the origin, averaged over all simulations, all walkers and for 10 cycles spaced evenly between cycle 1000 and cycle 10000. Distributions are shown separately for 𝑁 = 2, 5, 10 and 20. (C) The expectation values of these curves as a function of 𝑁. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.5: Accuracy (𝐴) and range values considering only the first two dimensions. Values are averaged over 10 simulations in each case and the error bars indicate the standard error of the mean. . . . . . . . . . . . . . . . . . . . . . 19 Figure 2.6: (Top) Average accuracy (left) and range (right) values for different values of the distance exponent 𝛼: 𝛼 = 1 (black), 𝛼 = 2 (green) and 𝛼 = 4 (orange). (Bottom) Average accuracy and range values are shown both with (orange) and without (green) the weight novelty term (𝜙). . . . . . . . . . . . . . . . . . 20 Figure 2.7: Representative ensembles from WExplore simulations are shown at “early” and “late” timepoints, as described in the text. (Top) Ligand density distri- butions from the trypsin-benzamidine WExplore simulations reported here. (Bottom) Ligand density distributions are shown for the TPPU ligand unbind- ing from the enzyme soluble epoxide hydrolase (sEH). Ensembles were taken from previous work [88]. Below each snapshot, a bar graph of trajectory weights is shown, sorted from highest to lowest. . . . . . . . . . . . . . . . . . 22 xi Figure 2.8: The trajectory variation determined for four 𝛼 values for (A) trypsin-benzamidine and (B) sEH-TPPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 2.9: The average unbound probability for all runs for (A) REVO and (B) WExplore. The thick blue region represents the standard error of the mean at each time point. The black curve shows the average probability for all runs. . . . . . . . . 25 Figure 2.10: Average predicted residence times are shown in black for (A) REVO and (B) WExplore. The red line shows the experimental residence time for the trypsin-benzamidine system [172]. . . . . . . . . . . . . . . . . . . . . . . . . 25 Figure 2.11: Average trajectory flux values are shown using all possible sub-samples over the set of five runs. Individual averages are shown as points, and the prob- ability of the sub-samples is shown using a violin plot. The trajectory flux corresponding to the experimental residence time is shown as a horizontal red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Figure 2.12: Network representations of the free energy landscapes of binding are shown for REVO (A) and WExplore (B). In both cases, discrete transition path ensembles were visually identified and labeled. Nodes are colored according to their ligand solvent accessible surface area using the color bar at the figure bottom, and node size correspond to the statistical weight of the states. Representative conformations are shown to depict each ligand unbinding pathway for REVO (C) and WExplore (D). Loop regions 209-218 and 179- 190 are shown in blue left and orange right. . . . . . . . . . . . . . . . . . . . 28 Figure 2.13: Ligand densities for unique clusters visited by REVO (red) and WExplore (yellow). Density maps are plotted using the VMD Volmap tool with Isosur- face value of 0.02. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 Figure 3.1: Architecture of the GSG method. The adjacency matrix describes the graph structure of the molecule. Each atom has a set of attributes that are shown as colored bars. Wavelet matrices Ψ are built at different logarithmic scales, 𝑗, using the adjacency matrix as described in the text. Finally, the scattering transform is applied to get the graph features using both the wavelet matrices and the signal vectors. Modified from figure made by Feng et al. [105]. . . . . . 44 Figure 3.2: Architecture of the GCN method. The adjacency matrix describes the graph structure of the molecule. Each atom has a set of attributes and are shown as colored bars. GCN layers are shown by gray color and are followed a max-pooling layer which is shown in purple. The graph gathering layer is shown in green color adds features on all nodes to generate the molecular feature vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 xii Figure 3.3: Average 𝑟 2 (A) and RMSE (B) for the OpenChem test set using GSGNN models. Each average is calculated over 20 individual parameter values and the error bars show the best and worst performing models. The atomic attributes are generated with either CGenFF or GAFF2 force fields and using one of three atom type classification schemes ("AC1", "AC5", "AC36/AC31" or "ACall"). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 Figure 3.4: The 𝑟 2 for the OpenChem test set using GSGNN models. The atomic attributes are all generated with CGenFF force fields, AC36 atom type classification scheme, and 2D molecular structure. . . . . . . . . . . . . . . . . . . . . . . . 51 Figure 3.5: The 𝑟 2 for different test sets using GSGNN models. A) shows 𝑟 2 for the FDA test set. B) represent 𝑟 2 for the Huuskonen test set. C) and D) show 𝑟 2 for the Star and NonStar test sets, respectively. The horizontal axis indicates the maximum wavelet scale 𝐽. The atomic attributes are generated with 2D molecular structure, CGenFF force fields and using AC36 atom type classification scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Figure 3.6: The t-SNE plots with GSG and NN features of the OpenChem test set molecules. Each represents a molecule and is colored by its actual log 𝑃 value. ⟨Δ log 𝑃⟩𝑁 shows the mean log 𝑃 difference value calculated over the nearest neighbors in the t-SNE plot. A) The GSG features of size 1716 are projected into 2-dimensional space. B) The NN features from the last hidden layer with size of 400 are projected into 2-dimensional space. . . . . . . . . . 58 Figure 3.7: Probability distributions of molecular fingerprints. The histograms show the distribution of fingerprints of all data and failed molecules of 5 GCGNN models. The distribution of all data is shown in thick black line. A) The number of shortest paths of length 2, B) the atomic weight, C) the number of carbon atoms (ncarb) and D) the number of heavy atoms. . . . . . . . . . . . . 60 Figure 3.8: Cumulative distribution of MAE of molecules in the S7_TEST set. The solid blue line shows the cumulative distributions for each set of predictions. The dashed red line represents the probability of 90%. Panels A through D show MAEs using models trained on DB1 through DB4, respectively. . . . . . . . . . 61 Figure 3.9: Prediction intervals of log 𝑃 predictions for the SAMPL7 molecules. The experimental log 𝑃 values are shown in red circles as a scatter plot. The predictions are shown in a red line, and the orange wide range shows the prediction intervals (PIs). Panels A through D show predictions from mod- els trained on DB1 through DB4, respectively. In all cases, data is sorted according to the predicted log 𝑃 values. . . . . . . . . . . . . . . . . . . . . . . 62 xiii Figure 3.10: The best fit lines for prediction sets. The experimental versus prediction values are shown in red circles as a scatter plot. The actual fit line is shown in orange. The dashed blue curve shows the best fit line. A) predictions using DB1, B) predictions using DB2, C) predictions using DB3, and D) predictions using the DB4 training set. . . . . . . . . . . . . . . . . . . . . . . 64 Figure 3.11: The log 𝑃 predictions from our submissions to the SAMPL7 challenge. The orange line shows the experimental values. The ClassicalGSG predictions are shown as circles (DB1: blue, DB2: orange, DB3: green, DB4: red). The thick orange area shows the MAE interval of 0.44, which is the lowest MAE of our submitted predictions (ClassicalGSG-DB2). Molecules are labeled with their molecule ID from SAMPL7 [212]. . . . . . . . . . . . . . . . . . . . 65 Figure 3.12: Results of ClassicalGSG models trained using open-source force field pa- rameters. Error bars are computed over five independently-trained models. These models are trained using the 2D structure information and using all the scattering moments with the maximum wavelet scale (𝐽) of 4. For each set of ClassicalGSG models trained using these force field parameters we show A) the average RMSE, and B) the average 𝑟 2 . . . . . . . . . . . . . . . . . . . . 66 Figure 4.1: The framework of the Flexible Topology method. The gray and blue spheres show the ghost particles. MD simulations start with an initial structure of the protein of interest and randomly initialized ghost particles. The MLForce plugin computes external forces on the ghost particle positions and attributes using a loss function, 𝐿. 𝑋𝐺𝑃 denotes the ghost particle positions, 𝐴𝐺𝑃 repre- sents the atomic attributes of ghost particles, 𝑀 shows the mapping function. 𝐹𝐺𝑃 and 𝐹𝑇 show ghost particle and target ligand features, respectively. Final output shows the ghost particles assembled into a drug-like molecule in the binding-pocket of the protein. . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Figure 4.2: The assembly simulation target molecules randomly chosen from the OpenChem dataset [55] are shown in their 2D structures. The number of atoms for each molecule is written under it. The 2D structures are generated and drawn by MarvinSketch using implicit hydrogens. . . . . . . . . . . . . . . . . . . . . . 85 Figure 4.3: Average values of RMSD and 𝑈𝐹𝑇 for assembly simulations run with different MLForce scale values. Error bars show standard errors calculated over 3 separate assembly simulations for the last 10% of data. A) RMSD between positions of ghost particles and the target molecule. B) The potential energy of ghost-ghost interactions, normalized by 𝐶 (Eq. 4.8). . . . . . . . . . . . . . 86 Figure 4.4: The average RMSE of atomic attributes between the ghost and target ligands for the last 10% of simulation data. Errors are determined over three different runs for 10 molecules (30 assembly simulations in total for each MLForce Scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 xiv Figure 4.5: The average RMSD of positions and FT energy for MLForce scale of 2000 over the last 10% of data. The standard errors are determined over three different runs. A) The average RMSD of ghost particles and the target ligand. B) The FT potential energy, divided by the MLForce scale. . . . . . . . . . . . 87 Figure 4.6: The minimum RMSD of positions for MLForce scale of 2000 for the assembly simulations. The standard errors are determined over three different runs. . . . . 88 Figure 4.7: The bound pose of 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one from the 4a9e PDB entry. Residues used to define the binding pocket centroid (Pro98, Val103, Tyr155 and Ile162) are shown in white licorice. Hydrogens are not shown on the ligand or binding site residues. A) Front view. B) Top view. C) 2D structure of the ligand. D) A size-charge representation where the each sphere is sized according to the atomic radius 𝜎 and colored according to its charge, 𝑞. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Figure 4.8: Ten assembly simulations in a bromodomain binding pocket. The central panel shows traces of the FT energy (𝑈𝐹𝑇 ) for the ten trajectories (thin lines) as well as an average of the trajectory set (thick blue line). Two example trajectories were highlighted. The top panels, outlined in red, show snapshots from an example “failed” trajectory, which did not assemble to the correct structure. The bottom panels, outlined in green, show snapshots from a correctly assembled trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Figure A.1: Minimum walker weight of 𝑖 to ensure that ΔΔ𝑖𝑚 > 0. . . . . . . . . . . . . . . 104 xv LIST OF ALGORITHMS Algorithm 2.1: REVO Resampler Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12 xvi KEY TO ABBREVIATIONS QSAR Quantitative Structure-Activity Relationship ADMET Absorption, distribution, Metabolism, Elimination, and Toxicity ML Machine Learning SVM Support Vector Machines NNs Neural Networks DL Deep Learning RT Residence Time TR-FRET Time-Resolved Fluorescence Resonance Energy Transfer IC50 Half-Maximal Inhibitory Concentratio MD Molecular Dynamics WE Weighted Ensemble MSM Markov State Model SMILES Molecular-Input Line-Entry System CM Coulomb Matrix DFT Density Functional Theory 3D 3-Dimensional 2D 2-Dimensional GSG Geometric Scattering for Graphs HTS High Throughput Screening VS Virtual Screening LBVS Ligand-Based Virtual Screening SBVS Structure-Based Virtual Screening GPU Graphics Processing Unit CPU Central Processing Unit CV Collective Variable xvii REVO Resampling Of Ensembles By Variation Optimization VP Voronoi polyhedra RMSD Interface Root Mean Square Deviation CSN Conformation Space Network SASA Solvent Accessible Surface Area NMR Nuclear Magnetic Resonance LiPE Lipophilic Efficiency GCN Graph Convolutional Networks MM Molecular Mechanics QM Quantum Mechanics RMSD Root Mean Square Deviation RMSE Root Mean Squared Error SAMPL Statistical Assessment of the Modeling of Proteins and Ligands MAE Mean Absolute Error MUE Mean Unsigned Errors CGenFF CHARMM Generalized Force Field GAFF General AMBER Force Field UFF Universal Force Field MMFF94 Merck Molecular Force Field 94 AC Atomic Category ReLU Rectified Linear Unit FF atomic attributes CLint Intrinsic Clearance Rates GP Ghost Particles API Application Programming Interface SPR Surface Plasmon Resonance LJ Lennard-Jones sEH Soluble Epoxide Hydrolase TPPU Inhibitor of sEH in crystal structure 4OD0 xviii CHAPTER 1 INTRODUCTION The discovery and development of a new drug is a lengthy and expensive process that often takes up to 14 years [1, 2, 3], with an average cost of 1.3 billion USD [4]. The typical drug design pipeline consists of early research and discovery, development, approval, and marketing. In the first stage, the target of disease and the potential drug compounds that can bind and interact with the target are identified (hit compounds) and optimized (lead compounds). In the second stage, the drug candidates go under preclinical and clinical studies and trials to ensure their effectiveness and safety for patients. Finally, the successful compounds from clinical trials are registered for approval and marketing. In recent years, to help expedite and economize the design of effective medicines, computational methods and models have found significant applications in the traditional drug discovery workflow. This field is referred to as in silico drug design. Utilizing in silico approaches such as Quantitative Structure-Activity Relationships (QSAR) for predicting the relation between chemical properties and molecular structure go back to the 1950s [5]. In drug design, the QSAR methods are employed to predict a small molecule’s drug-like properties (e.g., those indicated in Lipnski’s Rule of Five [6]) such as molecular mass, log 𝑃, numbers of hydrogen bond donors and acceptors. In addition to QSAR, extensive in silico tools have been developed for target discovery [7, 8, 9], molecular docking [10, 11, 12, 13, 14], virtual screening [15, 16], binding site prediction [17, 18, 19], de novo drug design [20, 21, 22], ligand similarity search [23, 24, 25], and prediction of pharmacokinetic properties like absorption, distribution, metabolism, elimination, and toxicity (ADMET) [26, 27, 28]. These methods can be categorized into two types: ligand-based and structure-based methods. The ligand-based methods are based on knowledge of small molecules that interact with the target of interest while structure-based methods are based on the three dimensional (3D) structure information of target molecules [29, 30, 31, 32, 33, 34]. While many approved drugs were discovered with the help of computational methods such as Raltitrexed [35], Amprenavir [36, 37], Isoniazid [38], Flurbiprofen [39, 40] and 1 Dorzolamide [41], there are also reports of failures of computational methods. For instance, the antidepressant/anxiolytic RPX00023 has been reported to be a selective serotonin 5-HT1A receptor agonist but turned out to be an inhibitor [42]. In fact, 90% of drug candidates fail during clinical trials, either due to insufficient clinical efficacy, unmanageable toxicity/side effects, poor pharmacokinetic/drug-like properties, and lack of commercial interest [43, 44]. The failure rate for new drugs targeting Alzheimer’s disease was even higher (99.6%) [45]. With the availability and growth of molecular databases that include both chemical and pharmacological properties, Machine Learning (ML) methods such as Support Vector Machines (SVM), Decision trees, and Deep Neural Networks (DNN) have been widely employed in in silico drug design for various tasks such as binding site prediction [46, 47], ADMET prediction [48] denovo drug design [49, 50], similarity search [51], drug-Target interaction prediction [52] and molecular properties prediction [53, 54, 55, 56]. Computational methods are needed to both accurately predict whether a compound will bind potently to its target, and to maintain ideal drug-like properties along with a number of different axes during development: solubility, toxicity, ease of manufacturing, stability and molecular weight. In this dissertation, we have developed advanced computational methods for predicting molecular properties, identifying hit compounds using a novel ML-based screening method, and modeling receptor-drug binding kinetics. 1.1 Kinetics-orientated drug design In the traditional drug discovery paradigm, the binding affinity of potential drug candidates is measured to predict the efficacy of drug candidates. The binding affinity is quantified by equilibrium dissociation constant (𝐾 𝑑 ) or half-maximal inhibitory concentration (𝐼𝐶 50 ) [57]. Several effective drugs have been discovered using estimations of binding affinity [58, 59, 60]. However, these measurements are carried out in-vitro, which is very different from the biological system of the human body, sometimes resulting in poor predictions of drug efficacy [61]. One fundamental difference is that living organisms exist far from equilibrium. Further, equilibrium quantities like 𝐾 𝑑 are not always appropriate, especially when timescales of ligand binding and unbinding are 2 comparable to the timescales of drug metabolism, elimination and excretion. Hence, several reports have shown that the efficacy of a drug can be more correlated to the drug-receptor binding kinetics than the binding affinity [62, 63, 64]. The drug-receptor association rate (𝑘 𝑜𝑛 ), dissociation rate (𝑘 𝑜 𝑓 𝑓 ), and mean ligand residence time (RT) are the main parameters describing binding kinetics. The mean residence time is the amount of time that the drug stays in contact with its target, and it has shown that in some cases a drug with a longer residence time can lead to better biological efficacy and therapeutic effects [62, 63, 64]. This has motivated researchers to consider the optimization of binding kinetics as a parameter in designing new drugs, a process called kinetics-orientated drug design. Recently, many experimental and computational methods have been developed to predict the drug-receptor binding and unbinding rates and pathways. The experimental methods to calculate the binding rates are label free-Surface Plasmon Resonance (SPR) [65, 66], radioligand binding [67, 68, 69] and time-resolved fluorescence resonance energy transfer (TR-FRET) [70]. The main challenge of experimental methods is the inability to observe and characterize the unbinding transition states of the ligand because the lifetime of the transition state is very short, often in femtoseconds. The limitation of experimental approaches raises a need for computational methods to reveal structural information of transition states. Therefore, a broad range of enhanced sampling approaches such as Metadynamics [71, 72], Scaled Molecular Dynamics (MD) [73, 74, 75], Well- tempered Meatdynamics [76, 77], Steered MD [78, 79], Brownian dynamics [80, 81], Markov State Models (MSM) [82, 83, 72], Weighted Ensemble (WE) [84, 85], and WExplore [86, 87, 88, 89] have been proposed. In this dissertation, we propose a novel enhanced sampling method for investigating the kinetics of ligand unbinding and understating unbinding pathways–See Chapter 2. 1.2 Predicting molecular properties In the early stages of drug design, computational predictions of molecular properties can help filter the drug-like candidates and reduce drug discovery time and cost. Recently, a diverse set of machine learning techniques have been employed to predict molecular properties like aqueous 3 solubility [53, 54], solvation free energy [90], log 𝑃 [91, 92, 55], blood brain barrier penetration [56] and protein–ligand binding affinity [93]. However, the accuracy of ML methods heavily depends on the manner in which molecules are represented. Representation models describe molecular characteristics in terms of descriptors such as atomic environments [94, 95], internal coordinates (i.e., bond lengths, bond angles, and dihedral angles) [96], and molecular fingerprints [49]. To respect the underlying symmetry, the features that are extracted from the representation models should be invariant to rotation, translation, reflection and the re-indexation of atoms. One of the widely used models for molecular representation is called simplified molecular- input line-entry system strings (SMILES). These are strings of characters that uniquely identify a particular molecule (e.g. benzene [C1=CC=CC=C1]). Although SMILES does not include physical or structural features of molecules, it has shown surprising success in the prediction of the properties of small molecules and has even been used to generate novel chemical compounds [55, 49, 97, 98]. Another model, the Coulomb Matrix (CM) [96], utilizes atomic energies and the inter-nuclear coulomb repulsion operator to predict the molecular energy. The eigenvalues of the CM are used as the similarity measure and they are invariant to the symmetry operations. Another approach to represent atomic systems is symmetry functions, initially proposed by Behler and Parrinello [94]. The local environment of an atom is described by two sets of radial and angular symmetry functions, which are invariant to rotation and translation of the system and define the local environment energetics. The summation of these local energies is used to calculate the total energy of the system. An improved version of Behler’s angular symmetry function is used in Anakin-ME [95] to construct a molecular force field with accuracy that is comparable to density functional theory (DFT). Another molecular model, the Grid Featurizer, has been used to predict protein-ligand binding affinity [99, 100, 101, 102]. Here, the 3D (3-dimensional) structure of a protein-ligand complex is discretized into a grid and each atom is expressed by a density function to form the grid matrix. Another approach for representation modeling is graph-based model, which is concerned with the molecular structure. In graph-based models, atoms and bonds represent the nodes and edges 4 of the graph, respectively. Further, some functions are defined in the molecular graph to create feature vectors or matrices of the same size. In recent years varios functions have been devel- oped for predicting different molecular properties such as graph convolutional networks [49, 93], directed acyclic graph models [54], deep tensor neural networks [103], message passing neural networks [104] and geometric scattering for graphs [105]. Unfortunately, these methods are all applied in different applications, making it hard to carry out a fair comparison between them. Here, we use the geometric scattering for graphs (GSG) [105] model to predict the log 𝑃 of small molecules (Chapter 3). GSG is fast in creating features that are symmetrically invariant. Also, its feature vectors have the same length, allowing us to use any distance metric for similarity measurement purposes. 1.3 Virtual Screening After target identification, the drug discovery process continues with compound screening in the laboratory to identify hit compounds. Compound screening is performed using automated High Throughput Screening (HTS) [106] assays. In this step, for a given protein of interest, small drug-like compounds are examined by docking to the receptor, and a drug compound is selected such that is predicted to bind tightly and specifically to its target. However, the HTS assays are physical methods with screening power limited to millions of compounds, expensive and slow [5, 107, 108]. Therefore, computational screening methods, which is referred to as Virtual Screening (VS) [107, 109, 110, 111] have been proposed to expand the search of chemical space for small molecules. The VS methods that utilize the prior knowledge of known active ligands are called ligand-based VS (LBVS) [112, 113, 114, 115]. The VS approaches that rely on the 3D structure of target molecules are called structure-based VS (SBVS) [116, 117, 118]. In SBVS, a handful of drug databases such as ZINC [119], ChEMBL [120], PubChem [121, 122] and NCI [123] with hundreds of millions of compounds are examined, which can be extremely computationally demanding. Furthermore, each compound (ligand) has the potential to cause significant conformational changes in the protein binding pocket during the binding process. To 5 overcome the computation hurdle, the traditional process of VS overlooks the flexibility and confor- mational changes of the receptor during docking by taking into account just a single conformation of the receptor [124, 125, 126, 127, 128]. Recently, a wide range of approaches have been devel- oped that consider the flexibility of the receptor in screening [127, 129, 130, 131, 132, 133, 134]. A class of these methods is referred to as ensemble docking [128, 133, 135, 134] that utilizes an ensemble of target conformations generally determined either by MD simulations or crystallogra- phy. Ensemble docking enables docking of each ligand into multiple structures, and docking scores are determined by aggregation or consensus strategies. This method has been successful in some applications [136, 137, 138]; however, the selection of an optimum ensemble size and appropriate structures is a variable of target protein [139], and there are studies showing that increasing the ensemble size leads to a poor model [140, 141, 142]. To solve this problem, we develop an efficient ML-based compound screening method that can identify potent “hit” compounds in large datasets, while taking into account ligand-induced conformational changes, including those of explicit water molecules in the binding site. This method is called “Flexible Topology” since allows ligand atoms to change their chemical type in response to forces from the surrounding binding site atoms. This way, the search process through the dataset is coupled to the dynamics of a molecular simulation, and the final ligand poses can be used as predictions of potent binders. Chapter 4 presents the first results obtained with the Flexible Topology method. 6 CHAPTER 2 PREDICTING THE RESIDENCE TIME OF DRUG MOLECULES (REVO) This chapter was adopted from the paper titled “REVO: Resampling of ensembles by variation optimization" as it was published [143]. The figures and tables from supporting information (SI) are imported into the text. 2.1 Introduction Unraveling the functionality of macromolecules and exploring their structures is a popular research topic in biochemistry that can be carried out through molecular dynamics (MD). MD can be used to explore and sample the conformation space of a system. However, its sampling power is often limited by large energetic barriers that separate molecular stable-states. This is a problem in a variety of applications such as protein binding, unbinding and folding processes despite advances in high-performance compute hardware and graphical processing units (GPUs). Enhanced sampling techniques have thus been useful to increase the efficiency of MD and observe rare events in biomolecular systems. Enhanced sampling methods have a long history. Over the last decades a wide variety of methods have been described that involve either the introduction of external forces [144], manipulation of the energy landscape [145, 146, 147], or coupling to systems at higher temperatures [148, 149]. Although in most cases these methods can be used to obtain accurate thermodynamic quantities such as free energy differences, the methods use perturbed dynamics, which complicates the collection of kinetic information - both transition rates between macrostates and microscopic state-to-state transitions. Other enhanced sampling methods can be used to simulate rare events without the use of biasing forces. For example, Markov State Models (MSM) [150] are based on unbiased sampling of trajectories, in which system dynamics are described by transitions between a set of states at discrete time intervals, (i.e, 𝜏). However in the MSM, the Markovian assumption (that transitions are independent of history) is only guaranteed to be fulfilled in the limit of long 𝜏, in practice tens 7 of nanoseconds [151]. It can also be sensitive to clustering parameters and feature selection [89] and typically requires a very long simulation time. The weighted ensemble (WE) [84] method also uses unbiased trajectories, but offers a way to calculate observables directly, without the use of the Markovian assumption. The WE algorithm periodically uses cloning and merging operations on this set of trajectories in order to balance computational effort between different regions of space. When possible, these regions can be defined as “bins” along a collective variable (CV) that describes a transition of interest. However, for some systems the processes of interest cannot be described by a single CV, for instance where multiple transition paths are possible between multiple stable states. Use of traditional binning procedures in the WE framework is limited for these high-dimensional systems as the number of bins depends exponentially on the number of CVs used. This is a problem for traditional WE, as the number of trajectories (also called “walkers”) per region is typically fixed, leading to an exponential increase in total simulation time. Even if one employed a large number of regions and then allowed most of these to be unoccupied, it would still be difficult to prioritize which underrepresented regions should be chosen for cloning. The WExplore algorithm was introduced to address this problem [152]. WExplore is a WE approach that dynamically defines a large number of sampling regions using a distance metric within a high-dimensional CV space. These regions are defined within a hierarchy, allowing us to balance sampling between branches of the hierarchy at multiple levels. This allows a small number of walkers to be efficiently distributed across a (possibly) high-dimensional space. WExplore has been applied to sample a variety of rare events, including ligand (un)binding pathways [86, 87, 88, 89] protein folding pathways [152] and RNA conformational changes [153]. Despite this success, WExplore is limited by three main issues related to the definitions of these hierarchical regions. First, the nature of the hierarchical regions leads to inconsistent cloning activity: when a threshold is crossed and region is defined on a new level of the hierarchy for the first time, many cloning events of a single trajectory occur in quick succession. We call this process “thresholding” and its stochastic nature has the potential to produce large differences between 8 different WExplore runs. Secondly, regions in WExplore are not moved once they are created. The centers of these regions could thus be different from the positions of local energy minima. We call this problem “sub-optimal region definition”. Further, although WExplore can divide a space into a large number of regions (e.g. 10000), typically a maximum branching factor is defined at each level of the hierarchy to limit the total number of regions that can be defined. This can lead to an uneven distribution of sampling regions throughout the space. Inspired to address the aforementioned problems, we propose a new region-free enhanced sampling algorithm called Resampling of Ensembles by Variation Optimization, or “REVO”. REVO uses cloning and merging to create ensembles of diverse trajectories without defining any regions, and instead optimizes a measure of “variation” that depends on the pairwise distances between the walkers. In this paper, we first describe the REVO algorithm and its differences from WExplore and other WE methods. We then apply the REVO method to a tunable 𝑁-dimensional Random Walk system to study its performance as a function of dimensionality. We also apply REVO to sample unbinding pathways in the well-studied trypsin-benzamidine system, and compare the results to WExplore. Finally we conclude with a discussion of the REVO algorithm, including new possibilities for enhanced sampling. 2.2 Methods 2.2.1 Generalized framework for Weighted ensemble sampling Since the original publication of the weighted ensemble (WE) algorithm, a number of augmentations and improvements to the method have been introduced. Here we describe a generalized framework that is common to different algorithms in the WE family. This framework includes two alternating steps: 1) MD simulations that move walkers forward in time, and 2) resampling operations that merge and clone walkers. A resampling function is designed such that desirable walkers are cloned and less-desirable walkers are merged together. Historically, this “desirability” has been defined using counts of walkers in a set of regions (or “bins”) constructed along one or more collective variables that describe the system dynamics, however, as shown here this can be thought of more 9 Cloning ... Merging Dynamics Resampling Dynamics Figure 2.1: The WE simulation framework. Each walker is represented as a circle. The size of a circle represents the weight, and different colors represent different conformations. An ensemble of walkers with the same weight and conformation is run for a set number of steps (“Dynamics”). Then resampling is performed. These two steps continue until the simulation ends. generally. When a walker is cloned, it creates two independent walkers that get the conformation of the cloned walker and half of its weight. Merging of two walkers 𝐴 and 𝐵 creates a walker 𝐶 with the weight of 𝑤 𝐶 = 𝑤 𝐴 + 𝑤 𝐵 , where 𝐶 inherits the conformation of 𝐴 or 𝐵, with a probability proportional to the two weights. On the whole, a resampling process aims to increase the diversity of the trajectory ensemble and increase the probability of observing the events (or conformations) of interest. A resampling function (Figure 2.1) accepts a set of walkers and returns the new set of walkers that result from the cloning and merging operations. The new conformations are thus a subset of the input conformations, and the sum of the weights must be unchanged by the resampler. In general, a resampler can return a different number of walkers, but in this work we keep the number of walkers constant. WE simulations can use arbitrary resampling methods and remain a statistically valid process from which unbiased estimates of observables can be calculated [154]. Seen this way, conventional WE and WExplore can simply be viewed as different resamplers. 10 2.2.2 REVO resampling algorithm Here, we present a new method for resampling trajectory ensembles in the WE simulation frame- work. The REVO resampling method works by explicitly maximizing a measure of “trajectory variation”, which is defined using the weights of walkers and an all-to-all pairwise distance matrix obtained from the distances between walkers. This distance metric is system-specific and should describe the events of interest. Notably, the resampling in REVO does not involve the construc- tion of regions in order parameter space, which avoids the region-related limitations of WExplore mentioned in the Introduction. We calculate the variation using the following equation: ∑︁ ∑︁ ∑︁  𝑑𝑖 𝑗  𝛼 𝑉= 𝑉𝑖 = 𝜙𝑖 𝜙 𝑗 (2.1) 𝑖 𝑖 𝑗 𝑑0 where 𝑑𝑖 𝑗 is the distance between walker 𝑖 and walker 𝑗 determined using a distance metric of choice. The exponent 𝛼 is used to modulate the influence of the distances in the variation calculation. A procedure for selecting an appropriate value of 𝛼 is given in Section 2.3.2. 𝜙𝑖 is a non-negative function which is referred to as a “novelty”. It can be a function of walker conformation and/or walker weight, and it is a measurement of the relative importance of each walker, which can be defined in a system-specific fashion. Here, we define 𝜙 as a function of walker weight (𝑤 𝑖 ): 𝑝  𝑚𝑖𝑛 𝜙𝑖 = log (𝑤 𝑖 ) − log (2.2) 100 This function prioritizes walkers with higher weight values, and ranges from 𝜙𝑖 ≈ 32 for 𝑤 𝑖 ≈ 1 down to 𝜙𝑖 = 4.6 for 𝑤 𝑖 = 𝑝 𝑚𝑖𝑛 . Parameters 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 are the minimum and maximum statistical weights, respectively, that a walker can hold. Following previous work with WExplore, in REVO we do not clone walkers of weight < 𝑝 𝑚𝑖𝑛 , to avoid spending simulation time on walkers that will not significantly contribute to statistical observables. We enforce a maximum weight (𝑝 𝑚𝑎𝑥 ) in order to avoid the accumulation of probability in a single walker (𝑤 ≈ 1), which can lower our chances of seeing new rare events within a given simulation. For instance, here we set 𝑝 𝑚𝑎𝑥 to 0.1, in order to always have at least 10 walkers with reasonably high probabilities. 11 We also employ a check where the two walkers that are merged must be within a certain distance from each other, which we call the “merge distance threshold”. This ensures that minimal information is lost when two trajectories are merged. Parameter 𝑑0 , called the “characteristic distance”, does not affect cloning and merging behavior, but is defined to make the variation function unit-less and to facilitate comparison across different distance metrics. A procedure for calculating the characteristic distance for a given system will be explained below. The goal of the resampling process in REVO is to optimize 𝑉 in Equation 2.1. To do this, walkers with high 𝑉𝑖 values are selected for cloning, and walkers with low 𝑉𝑖 values are selected for merging. This is further explained in the Appendix 9. The pseudocode of the REVO resampler algorithm is shown in Algorithm 2.1. Algorithm 2.1: REVO Resampler Algorithm Input: Ensemble of walkers, REVO parameters Output: Ensemble of resampled walkers Dist_Matrix = AlltoAll_Dist(walkers); 𝑉𝑜𝑙𝑑 , {𝑉𝑖 }1𝑛 = CalcVariation(weights, Dist_Matrix); while 𝑇 𝑅𝑈𝐸 do 𝑐 = Select the walker with highest 𝑉𝑖 where 𝑤 𝑖 > 𝑝 𝑚𝑖𝑛 ; 𝑚1 = Select the walker with lowest 𝑉𝑖 where 𝑤 𝑖 < 𝑝 𝑚𝑎𝑥 ; 𝑚2 = Select the walker that is closest to 𝑚1 where 𝑤 𝑖 + 𝑤 𝑚1 < 𝑝 𝑚𝑎𝑥 and 𝑑𝑖,𝑚1 < 𝑑𝑚𝑒𝑟𝑔𝑒_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ; if 𝑚1 and 𝑚2 are defined then /* Changes the conformation and weight of walkers */ Do cloning; Do merging; 𝑉𝑛𝑒𝑤 , {𝑉𝑖 }1𝑛 = CalcVariation(weights, Dist_Matrix); if 𝑉𝑛𝑒𝑤 > 𝑉𝑜𝑙𝑑 then 𝑉𝑜𝑙𝑑 = 𝑉𝑛𝑒𝑤 ; else Undo cloning and merging step; break; else break; 12 2.2.3 WExplore sampling algorithm For completeness, we describe our implementation of the WExplore sampling algorithm based on previous work [152, 87]. Similar to WE, each walker in WExplore carries a statistical weight that changes during the resampling procedure. The WExplore algorithm dynamically splits the sampling space into a set of hierarchical Voronoi polyhedra (VP), which are used as the “regions” to guide resampling. Each VP is defined using a central point called an “image”, which are specific conformations of the system. A walker can be assigned to a VP region by calculating its distance to each VP image, and assigning it to the region with the smallest such distance. A WExplore simulation employs a distance metric which is defined to emphasize the process of interest. For example, in protein-ligand unbinding simulations the distance metric between walkers is defined as the root mean squared distance (RMSD) between the ligands after aligning the binding sites of both walkers. All walkers start with the same structure and initial weight. The sampling space initially includes just a single region defined - at each level of the hierarchy - by the image of the initial structure. As the simulation progresses, new regions are defined when a structure is sampled whose distance to all previously defined images is greater than a predefined distance threshold. The hierarchical regions are defined using a set of progressively smaller distance thresholds. There is a maximum number of child regions that can be defined under each parent at each level of the hierarchy, which is set here to 10 for all systems. In this paper, we use a four-level hierarchy of regions with 10000 regions in total. Walker resampling in WExplore occurs through the cloning and merging processes, where the number of walkers are distributed as equally as possible across all regions. This occurs from the top of the hierarchy downwards: first balancing between the largest hierarchical regions, then the second-largest, and so on. At the beginning of a simulation, only the smallest regions are defined and resampling occurs only at the lowest level. As mentioned in the Introduction, the first time a walker establishes a new region at a new level of the hierarchy, it is cloned repeatedly until the number of walkers in the new and old regions is as even as possible. 13 WExplore and REVO have many of the same qualities, which facilitates their direct comparison here. As in REVO, we have a constant number of walkers throughout the simulation. The same distance metrics can be used in both algorithms. Also, the parameters 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 have the same role, and can be enforced in the same way. In WExplore, two walkers are only merged if they are in the same region (at all levels of the hierarchy). This is analogous to the merge distance threshold in REVO, introduced above. 2.2.4 𝑁-dimensional biased random walk We first use the 𝑁-dimensional biased random walk to study and analyze the performance of REVO in higher dimensional spaces. In this system, the conformation of walkers is defined as an 𝑁-dimensional vector (𝑥𝑖 ∈ R𝑁 ) of non-negative values. A walker starts at position 0® and randomly moves either one unit forward (with probability 𝑃𝑢 = 0.25) or one unit backward (with probability 1 − 𝑃𝑢 = 0.75) in each dimension at each dynamics step. The walkers are confined to positive position values by rejecting moves to negative values. In this system, the distance metric used is a scaled version of the Manhattan norm: 𝑁 1 ∑︁ 𝑑𝑖 𝑗 = (|𝑥𝑖𝑘 − 𝑥 𝑗 𝑘 |) (2.3) 𝑁 𝑘=1 For WExplore we use a four-level region hierarchy with distance thresholds of 𝑑 = 0.25, 1, 4 and 16. The “merge distance” in REVO is set to 2.5 for all 𝑁. 2.2.5 Trypsin-benzamidine system We run simulations of the trypsin-benzamidine system which is a test system used to evaluate the sampling power of enhanced sampling techniques for ligand binding [87, 155, 156, 157]. We used the OpenMMRunner in WEPY https://github.com/ADicksonLab/wepy and OpenMM version 7.2.2 [158] to run parallel simulations for each walker on nodes equipped with 4 NVIDIA K80 GPUs. The system was setup following our previous work[87] Atomic coordinates from the PDBID 3PTB structure are used to setup the system, including the crystallographic calcium ion and the 14 crystallographic water molecules. The system is solvated using a periodic cubic water box of size 74.3 Å. This system has a total of 41006 atoms with nine neutralizing chloride ions. The benzamidine ligand is parameterized using the CHARMM Generalized Force Field (CGENFF). [159, 160] The system is run at a constant temperature and pressure using Langevin dynamics with a friction coefficient of 1 ps−1 which couples the system to the heat bath with a temperature of 300 K and integration step size of 2 fs. We employ a 1 atm constant pressure Monte Carlo algorithm where the volume move attempts are carried out every 50 steps. Nonbonded forces are calculated using the CutoffPeriodic method in OpenMM in which only the interaction of each particle with the nearest periodic copy of other particles is considered. A cutoff distance of 10 Å is used for nonbonded particle interactions. Covalent bonds to hydrogen are constrained using the OpenMM HBonds function. For WExplore we use a four-level region hierarchy with distance thresholds of 𝑑 =10, 5, 3 and 1.7 Å. The REVO merge distance was set to 25 Å, effectively allowing all non-local merges. For both REVO and WExplore, five independent simulations were run with 48 walkers each and a step size of 2 fs with resampling occurring every 20 ps. For both resamplers, we measure the distance between two walkers (𝐴 and 𝐵) as the RMSD between the 𝐴 and 𝐵 ligands after aligning the binding sites of 𝐴 and 𝐵. 2.2.6 Clustering and network visualizations To compare the structures obtained by the REVO and WExplore resamplers, we build conformation space networks (CSN) as follows [161, 162, 163]. First, the feature vector of each frame is determined: a set of distances between a predefined set of ligand and protein atoms. This set includes the 50 nearest heavy protein atoms to the ligand as well as the 9 heavy atoms of the ligand. The feature vector includes all possible pairs of atoms from these two sets, resulting in a feature vector of size 450. Feature-vector based clustering was done with the MSMBuilder [164] program using the KCenters method and the Canberra distance metric. Three sets of clusters were 15 determined: one using only WExplore trajectories, one using only REVO, and one using both sets of trajectories. In each case, the data was grouped into 2000 clusters. After clustering, the CSNs are constructed using the CSNAnalysis tool (https://github.com/ADicksonLab/CSNAnalysis), using the unweighted transition counts matrix and a lag time of 20 ps. 2.3 Results 2.3.1 𝑁-dimensional random walk To compare the REVO and WExplore resampling algorithms, we first run simulations for the random walk system at dimensions 𝑁 = 2, 5, 10 and 20, with 10 copies each. The simulations were run with 200 walkers, for 10000 cycles that consist of 10 dynamics steps followed by resampling. For both REVO and WExplore, a minimum and maximum walker weight of (𝑝 𝑚𝑖𝑛 = 10−100 ) and (𝑝 𝑚𝑎𝑥 = 0.1) were used. The characteristic distance parameter (𝑑0 ) of REVO is determined by running a single dynamic cycle, then calculating the average distance between all walkers. The overall average value is the characteristic distance, which is tabulated for each value of 𝑁 in Table 2.1. Table 2.1: Characteristic distance values for N-dimensional random walk. Dimension (N) Characteristic Distance 2 0.7165 5 0.7161 10 0.7138 20 0.7157 To compare with our REVO and WExplore results, we also ran straightforward random walk simulations (CONV) with no resampling. Figure 2.2 shows the average predicted probability along each dimension, calculated by averag- ing the positional probability distributions over all dimensions and all runs. Since the random walk is biased towards the origin, the probability decreases drastically with increasing 𝑥. In this system, the target equilibrium probability of each 𝑥 position can be directly calculated as 𝑃𝑡 (𝑥) = ( 23 )( 31 ) 𝑥 . 16 A B Figure 2.2: Average predicted probability distributions. The black curve is the target probability. Probability distributions are averaged over all 10 runs for (A) REVO and (B) WExplore. We find that, using the same number of dynamics steps, REVO is capable of visiting more distant points in comparison to the more limited sampling by WExplore. Furthermore, we can compare run-to-run variability of the two algorithms by calculating the average standard error of predicted probability for 𝑥 in the range 0 to 22 for 𝑁 = 2, 5, 10 and 20. The value of this averaged standard error is 1.003 × 10−5 and 2.06 × 10−5 respectively for REVO and WExplore, which shows REVO simulations are more consistent than WExplore. Following previous work [152], we quantify the quality of sampling of each probability dis- tribution using two values: the “accuracy” and the “range”. The range of a given simulation is calculated by determining the largest 𝑥 values visited along each dimension, and then averaging them. The accuracy (𝐴) of a given curve 𝑃(𝑥) is equal to: ∑︁ 𝐴= 𝑎(𝑥) (2.4) 𝑥 where   | log(𝑃𝑡 (𝑥))−log(𝑃(𝑥))| 1 + if    log(𝑃𝑡 (𝑥)) log(𝑃(𝑥))>2 log(𝑃𝑡 (𝑥)) 𝑎(𝑥) =  0   otherwise  For a given 𝑥 point, the highest accuracy contribution (𝑎(𝑥)) is 1, which is achieved when 𝑃(𝑥) = 𝑃𝑡 (𝑥). This decreases as 𝑃(𝑥) gets farther from the target probability. Figure 2.3 shows that REVO obtains the highest accuracy and range for all simulated values of 𝑁. Notably, WExplore 17 A B Figure 2.3: (A) The calculated accuracy values are shown for three methods, averaged over 10 runs. REVO outperforms when compared to WExplore and CONV methods for all dimensions. (B) The range of visited 𝑥 values for three methods averaged over 10 runs. REVO explores a broader sampling space when compared to WExplore and CONV methods for all dimensions. Error bars show the standard error of the mean across the set of runs. struggles with very high-dimensional spaces, with accuracy and range values approaching that of conventional simulation. REVO resampling dramatically outperforms WExplore for 𝑁 = 5, 10 and 20, indicating that it is much more capable of efficiently discovering new areas of space in high-dimensional systems. In fact, the accuracy and range values actually improve with increasing 𝑁, reaching their maximum values at 𝑁 = 10. To investigate this phenomenon, we examine the walkers and their distances to the origin, for each value of 𝑁 (Figure 2.4). For the 2-dimensional system, walker positions are evenly spread over two sampling “arms” that extend along the 𝑥 and 𝑦 axes. In general, an 𝑁 dimensional system will have 𝑁 of these arms, extending outward from the origin. We hypothesized that having more sampling arms will allow for a higher fraction of the walkers to be far from the origin, which could improve both the accuracy and range values. In Figure 2.4B and C, we confirm that the expected distance to the origin averaged over the set of walkers, increases as 𝑁 goes from 2 to 10. Once 𝑁 increases to 20, even though there are more sampling “arms”, the same 200 walkers are not able to sample all of these arms efficiently, and the expected distance to the origin decreases. Interestingly, Figure 2.4C tracks very well with the expected accuracy in Figure 2.3A. In our random walk simulations, we propose a move along each dimension, for each timestep. 18 A B C Figure 2.4: (A) The final walker positions for a representative 𝑁 = 2 simulation are shown as points. This is overlayed on a probability distribution heat map, calculated using 𝑃𝑡 (𝑥, 𝑦) = 𝑃𝑡 (𝑥)𝑃𝑡 (𝑦). (B) Probability distributions for the distance to the origin, averaged over all simulations, all walkers and for 10 cycles spaced evenly between cycle 1000 and cycle 10000. Distributions are shown separately for 𝑁 = 2, 5, 10 and 20. (C) The expectation values of these curves as a function of 𝑁. An 𝑁-dimensional system can then be seen as 𝑁 1-dimensional systems, where the only thing that couples them together is the resampling algorithm. As the average probability distributions are calculated over all dimensions, the higher dimensional systems gain a benefit, as they have more chances to sample higher values. To remove this effect we calculate the accuracy and range using only the first two dimensions for 𝑁 = 2, 5, 10 and 20 (Figure 2.5). 70 Range 60 50 40 30 20 10 0 N=2 N=5 N=10 N=20 Figure 2.5: Accuracy (𝐴) and range values considering only the first two dimensions. Values are averaged over 10 simulations in each case and the error bars indicate the standard error of the mean. 19 The accuracy values in this case are within standard error for 𝑁 = 2, 5 and 10, but drop for 𝑁 = 20. The range is constant for 𝑁 = 2 and 5, and sees a slight increase for 𝑁 = 10, before again dropping for 𝑁 = 20. In Figure 2.6, we examine different values of the distance exponent 𝛼, and the presence or absence of the weight novelty term (𝜙). The weight novelty term was introduced to prioritize walkers with higher weights, thereby encouraging not only that higher 𝑥 values are sampled, but that they are sampled with as high a probability as possible. = 1 50 = 2 = 4 40 30 < A> 20 10 0 N= 2 N= 5 N= 10 N= 20 weight s off 50 weight s on 40 30 < A> 20 10 0 N= 2 N= 5 N= 10 N= 20 Figure 2.6: (Top) Average accuracy (left) and range (right) values for different values of the distance exponent 𝛼: 𝛼 = 1 (black), 𝛼 = 2 (green) and 𝛼 = 4 (orange). (Bottom) Average accuracy and range values are shown both with (orange) and without (green) the weight novelty term (𝜙). As expected, turning off this weight novelty term (setting 𝜙𝑖 = 1 for all 𝑖) results in a lower average accuracy for all 𝑁. We would also expect this would result in a higher range, since the range is independent of the weights of walkers. This is what is observed, on average, although the weight novelty leads to a slightly higher range for 𝑁 = 10. Possible reasons for this phenomenon 20 will be addressed in the Discussion section below. 2.3.2 Choosing an optimal distance exponent (𝛼) for ligand unbinding simulations For biomolecular simulations, it is not feasible to run a large set of simulations with many different 𝛼 values. Here we describe a procedure for determining an optimal distance exponent without running any additional simulations. We instead use ensembles of walkers from previous WExplore simulations, taken at two different time points. “Early” ensembles were taken from early time points in WExplore ligand unbinding simulations (less than 50 cycles), where all walkers have reasonably low distances to each other and all walker weights are still relatively high (Figure 2.7). 21 Early Ensemble Late Ensemble Trypsin- benzamidine sEH-TPPU 10  3 Weight 10  6 10  9 10  12 Conform at ions Figure 2.7: Representative ensembles from WExplore simulations are shown at “early” and “late” timepoints, as described in the text. (Top) Ligand density distributions from the trypsin-benzamidine WExplore simulations reported here. (Bottom) Ligand density distributions are shown for the TPPU ligand unbinding from the enzyme soluble epoxide hydrolase (sEH). Ensembles were taken from previous work [88]. Below each snapshot, a bar graph of trajectory weights is shown, sorted from highest to lowest. “Late” ensembles were taken from the end of these simulations, where some walkers are in the unbound state (with low weight) and some walkers remain in the binding site. We isolate five early ensembles and five late ensembles from two different sets of ligand unbinding simulations: 1) 22 A B Early ensem bles Lat e ensem bles 10 9 Traject ory variat ion 10 8 10 7 10 6 1 2 3 4  Figure 2.8: The trajectory variation determined for four 𝛼 values for (A) trypsin-benzamidine and (B) sEH-TPPU. the WExplore trypsin-benzamidine simulations conducted here and 2) the unbinding of the TPPU ligand from soluble epoxide hydrolase (sEH), conducted in previous work [88]. In each case the ensembles had 48 walkers each, with 𝑝 𝑚𝑖𝑛 = 10−12 and 𝑝 𝑚𝑎𝑥 = 0.1. The trajectory variation values were calculated using Eq. 2.1 for the early and late ensembles using 𝛼 = 1, 2, 3 and 4. Average variation values are shown in Figure 2.8. For both systems, higher 𝛼 increases the difference in trajectory variation between early and late trajectory ensembles. An appropriate 𝛼 value is one that clearly differentiates between the early and late ensembles. If our measure of trajectory variation is not higher for the late ensembles that include the unbound state, then we would likely not be able to sample ligand unbinding events by maximizing that measure of variation alone. Based on these results we choose to use 𝛼 = 4 for our REVO simulations of the trypsin-benzamidine system. 2.3.3 Trypsin-benzamidine ligand unbinding We now compare results for the trypsin-benzamidine unbinding process obtained using the REVO and WExplore methods. 23 2.3.3.1 Residence time The mean ligand residence time has been shown to be important for determining drug efficacy [165]. This can be calculated via the flux of unbinding trajectories in ligand-protein unbinding simulations, using a technique called ensemble splitting, or “coloring” [166, 167, 168, 169]. The starting structure for the trypsin-benzamidine simulations for both REVO and WExplore is the ligand bound in the binding pocket. After each dynamics cycle and before resampling, we apply a boundary condition that examines the conformation of the walkers to determine if the unbound state is reached. A walker is considered unbound if the minimum ligand-protein distance exceeds 10 Å. A walker that reaches the unbound state is “warped”: the structure is to set back to the initial bound state. The sum of the weights of the warped walkers is used to determine the unbinding rates and the mean residence time of the ligand. The flux as a function of time is determined using the weights of the warped walkers as follows: Í 𝑖∈W 𝑤𝑖 Flux(𝑡) = (2.5) t where W is the set of all warped walkers. This flux is shown in Figure 2.9. The black curve shows the average probability of the five runs and is influenced strongly by the highest weighted warping events. In total we observe 1160 unbinding events for REVO and 740 for WExplore. The first unbinding events occur at 182.4 ns and 345.6 ns for REVO and WExplore, respectively. The ligand residence time, or the mean first passage time of unbinding, can be determined as the reciprocal of the average probability flux [166, 167, 170, 171]. Figure 2.10 shows the predicted residence time as a function of simulation time for both REVO and WExplore. For both REVO and WExplore, the total simulation time over the five runs was 8.75 𝜇𝑠. The final calculated residence times are 3.76 ms and 1.19 for REVO and WExplore respectively, which are close to the experimental value of 1.6 𝑚𝑠 [172]. The standard error in Figure 2.10 is calculated using the standard error of the average flux from Figure 2.9. As shown in Figure 2.10, the predicted residence time can exhibit large jumps when new highly-weighted warping events are recorded. A key motivation for developing the REVO method 24 A B A B Figure 2.9: The average unbound probability for all runs for (A) REVO and (B) WExplore. The thick blue region represents the standard error of the mean at each time point. The black curve shows the average probability for all runs. Figure 2.10: Average predicted residence times are shown in black for (A) REVO and (B) WExplore. The red line shows the experimental residence time for the trypsin-benzamidine system [172]. was to increase the consistency in residence time estimates across different simulations. Figure 2.9 shows that the REVO simulations are more consistent in the aggregated unbound probability, ranging from 4.13 × 10−8 to 2 × 10−3 , whereas the WExplore results varied from 3.09 × 10−10 to 38 × 10−3 . We quantify the convergence of the average trajectory flux as a function of the size of the trajectory set in Figure 2.11. Importantly, this shows that REVO can obtain more reliable residence time estimates using a smaller number of runs. 2.3.3.2 Heterogeneity of ligand unbinding pathways We now compare the heterogeneity of the unbinding pathways that are observed using the two sampling methods. Two conformation space networks (CSNs) are shown in Figure 2.12 that combine sampling results for the five simulations conducted with each sampling algorithm. The 25 Figure 2.11: Average trajectory flux values are shown using all possible sub-samples over the set of five runs. Individual averages are shown as points, and the probability of the sub-samples is shown using a violin plot. The trajectory flux corresponding to the experimental residence time is shown as a horizontal red line. undirected CSNs are created using the force minimization algorithm Force Atlas in Gephi [173]. Each node in the CSN represents a state, and the size of each node is proportional to the sum of the weights of all walker conformations that were assigned to that state. Directed edge weights are computed as 100 times the transition probability. The weight of each undirected edge in the CSN is the average of the in-edge and out-edge weights. The nodes in the Figure 2.12 networks are colored by the solvent accessible surface area (SASA) in Å2 , with the SASA value averaged over all conformations in that cluster. The total number of frames is 437,280 for both REVO and WExplore. For visualization purposes, the weight of all edges is set to 1.0 after graph minimization. As seen in the CSNs, the bound and unbound states are connected via different exit paths. The CSN for the REVO simulations is shown in Figure 2.12A. We find four main ligand unbinding pathways for trypsin-benzamidine, three of which (Paths 1-3) are consistent with those 26 found in an earlier work using WExplore [87]. The CSN for the WExplore simulations 2.12B, depicts only two of these pathways. Representative structures from the REVO pathways are shown in Figure 2.12C, and the WExplore pathways are shown in Figure 2.12D. The binding site is predominantly formed by two loops: one, depicted in blue, consists of residues 209-218, and the other is depicted in orange and consists of residues 179-190. In Path 1 the ligand exits directly from the binding site without any large changes of loop conformation. Paths 2, 3 and 4 are dependent on conformational changes of the loop regions. In Path 2 the blue loop opens and the ligand exits through it. This benzamidine unbinding pathway was observed in two previous works [155, 87]. In Path 3, the ligand exits to the right through a newly-formed opening in the orange loop. This path has only been previously observed in our WExplore simulations [87]. Finally, in Path 4 benzamidine exits between the blue and orange loops, as in Path 1, but through a newly-formed opening above the disulfide bond formed by residues CYS188 and CYS212. This path has not been observed by any method before. To measure the breadth of sampling of individual runs, we jointly cluster the trajectories from REVO and WExplore into a set of 2000 clusters. The numbers of clusters visited by each simulation, are shown in Table 2.2. We find that REVO has a higher number of clusters visited, on average, with a lower standard error. Table 2.2: Cluster counts for all simulations. Resampler Run Number of clusters visited 1 803 2 720 REVO 3 862 4 793 5 892 Average 814.0 STD err 26.67 1 921 2 811 WExplore 3 876 4 660 5 716 Average 796.8 STD err 43.42 27 Figure 2.12: Network representations of the free energy landscapes of binding are shown for REVO (A) and WExplore (B). In both cases, discrete transition path ensembles were visually identified and labeled. Nodes are colored according to their ligand solvent accessible surface area using the color bar at the figure bottom, and node size correspond to the statistical weight of the states. Representative conformations are shown to depict each ligand unbinding pathway for REVO (C) and WExplore (D). Loop regions 209-218 and 179-190 are shown in blue left and orange right. Pooling all simulations together, REVO visits 424 clusters that were not visited by WExplore. Conversely, WExplore visits 294 unique clusters. This shows that REVO samples a more broad set of states than WExplore. To analyze structural properties of the unique clusters found by both algorithms, we determine a representative conformation for each cluster as the conformation with the minimum distance to the center of the cluster. 28 Table 2.3: Co-clustering information. REVO WExplore Exclusive cluster numbers 435 268 Average ligand RMSD (Å) 6.65 7.07 Average loops RMSD (Å) 3.76 3.20 Average SASA (Å2 ) 43.30 88.32 Average probability 0.014 0.008 For each algorithm, a density map of ligand poses from unique structures is shown in Figure 2.13. The red color volume shows the ligand density for REVO, which is localized mostly inside the blue loop and up in between the orange and blue loops, consistent with Path 4. WExplore unique clusters are concentrated in the area on the surface of trypsin adjacent to the binding site, related to the higher probability ligand transition Path 1. Figure 2.13: Ligand densities for unique clusters visited by REVO (red) and WExplore (yellow). Density maps are plotted using the VMD Volmap tool with Isosurface value of 0.02. 2.4 Discussion The above results demonstrate the ability of REVO to explore a broad sampling space with greater accuracy and range when compared to WExplore simulations. For the 𝑁-dimensional random walk 29 system, we found that the accuracy and range of REVO is greater than WExplore for all values of 𝑁, suggesting that REVO may be especially powerful for systems with very high-dimensional sampling spaces. In addition to finding all previously discovered unbinding pathways for the trypsin-benzamidine system, the REVO resampler discovered a new unbinding pathway involving significant protein conformational change. These findings are remarkable as WExplore was already notable for its broad sampling of ligand unbinding pathways in the trypsin-benzamidine system. As REVO is a region-free sampling algorithm, it is not limited by regioning obstacles, the main hindrance of its predecessors, WExplore and conventional weighted ensemble sampling. “Thresholding” is a key issue in the WExplore algorithm, occurring when a region is defined on a new level of the hierarchy, resulting in many highly-correlated cloning events of a single trajectory. We hypothesized that removing this behavior would lead to more consistent measurements of observables. Encouragingly, this is exactly what is observed here, both in the unbinding flux for trypsin-benzamidine and the standard error measurements in the 𝑁-dimensional random walk. As in WExplore, the distance metric used in REVO is flexible. It can be any measurement of distance, and need not be differentiable as a function of system coordinates. For instance, distances could be defined as differences between TM-scores [174], or other measures of template similarity (e.g. “global distance test total score” (GDT-TS) used in CASP competitions). Distance metrics can also involve histograms of ion and/or water positions which are discontinuous as a function of atomic positions. Another means of customization is the novelty function (𝜙 in Equation 2.1) Here, the novelty function for each trajectory is defined using only the trajectory weight. However, this function can include any trajectory feature that is of interest to the researcher. Furthermore, the objective function that we are maximizing in this work is the variation within the trajectory ensemble. This could also be modified to optimize other properties of the ensemble. For instance, the matching of NMR observables such as Nuclear Overhauser Effects and coupling constants or matching density maps from crystallography or cryo-EM. The efficient nature of the 𝑁-dimensional random walk system allowed us to run a number of simulations under different conditions to examine the properties of the REVO algorithm. One 30 puzzling result was the increase in the sampling range for 𝑁 = 10 when the weight novelty was turned on. This was counter-intuitive as the weight novelty term seeks to encourage cloning of outlier trajectories that have reasonably high weights, while simulations without the weight novelty seek to clone the farthest outlier trajectories at all costs. One important factor is that these simulations are run with a minimum attainable trajectory probability (𝑝 𝑚𝑖𝑛 ). This can explain this puzzling result, in that the weight novelty encourages higher weighted trajectories to venture out from the origin, which can be cloned a higher number of times before they reach 𝑝 𝑚𝑖𝑛 . Aside from this small increase in range, we expect the weight novelty to be broadly useful in obtaining accurate rate constants for rare events, as evidenced by the higher accuracy values in the 𝑁-dimensional random walk simulations. 31 CHAPTER 3 PREDICTING LOGP OF SMALL MOLECULES This chapter was adopted from two papers tilted “ClassicalGSG: Prediction of log 𝑃 using classical molecular force fields and geometric scattering for graph” [175] and “Predicting partition coeffi- cients for the SAMPL7 physical property challenge using the ClassicalGSG method” [176] as they were published. The figures and tables from supporting information (SI) are imported into the text. 3.1 Introduction The partition coefficient (𝑃) measures the relative solubility of a compound between two solvents. It is defined as the ratio of the concentration of an uncharged compound dissolved in an organic solvent (e.g. octanol) in comparison to water. The logarithm of this ratio (log 𝑃) is considered to be one of the main factors in determining the drug-likeness of a chemical compound. The log 𝑃 determines the lipophilicity, which affects bioavailability, solubility, and membrane permeability of a drug compound in the body. Moreover, an orally active drug should have a log 𝑃 value of less than 5 according to one of the criteria of the famous Lipinski’s Rule of Five [6]. Thus, predicting log 𝑃 plays an essential role in the drug discovery process and is our main focus in this study. The prediction of log 𝑃 is also used as a stepping stone to calculate other molecular properties such as the distribution coefficient (log 𝐷) [177], drug solubility (log 𝑆) [178, 179], and Lipophilic efficiency (LiPE) [180]. All of these properties have been used in drug discovery and design: log 𝐷 is an effective descriptor for lipophilicity of an ionized compound [181], LiPE combines the potency and lipophilicity of a drug compound to estimate its quality, and log 𝑆 is important to model the solubility of a compound in the human body. The partition coefficient is widely used in cheminformatics and generally there are diverse exper- imental methods to measure it [182]. However, these methods are time-consuming and expensive for large databases of compounds and not feasible for compounds that are not synthesized [183]. 32 Therefore, a large number of computational methods have been developed to predict accurate val- ues of log 𝑃. These methods have a long history and can be classified by both their input features (atomic/fragment, molecular and hybrid) and their models or algorithms (parameteric models vs. machine learning methods) (Table 3.1). Table 3.1: log 𝑃 prediction models classifications. Mathematical models Parametric Models Machine Learning Methods XlogP3 [184], AlogP [185], ClogP [186], László et al [189], Features Atomic/Fragment KowWIN [187], JPlogP [188] Huuskonen et al [190] MlogP [191], AlogPS [195, 196], Molecular iLogP [192, 193], Manhold[194] S+logP [197], CSLogP [91] Hybrid Silicos-IT LogP [198] TopP-S [92], OpenChem [55] In atomic-based or fragment-based methods [184, 185, 186, 187, 188, 189, 190], which are based on atomic or fragment contributions, a set of atomic or fragment based descriptors is used as input to the model, while “molecular” methods use descriptors of the whole molecule, such as topology or molecular fingerprints [191, 192, 193, 194, 195, 196, 197, 91]. “Hybrid” models combine atomic and molecular descriptors as input to the model [198, 92, 55]. In general, there are challenges with both atomic and molecular descriptors. In Atomic-based or fragment-based methods, the accuracy of the atomic contributions depends on the similarity of the atomic environments of the atoms in the group. Unfortunately, more training data is needed as the number of groups grows larger. Fragment-based methods can struggle with defining the optimal size of fragments that participate in predictions. On the other hand, the accuracy of property-based methods heavily depends on the choice of molecular descriptors. Common descriptors include: 3D molecular structure [92], molecular volume and surface area [199], solvation free energies [192, 200, 193], number of carbon atoms and heteroatoms [191]. Furthermore, these molecular descriptors can be difficult or computationally costly to generate. A second way of categorizing log 𝑃 predictors is by the types of mathematical model used to process the input data. Parameteric models use methods such as least squares estimation or multiple linear regression to fit parameters that govern the relative contributions of the different input features. Machine learning based methods such as Support Vector Machines (SVM) [201, 202, 203], Neural 33 Networks (NNs) [201, 202, 204, 92], and Graph convolutional networks (GCN) [49] have been used to predict logP values. Recently, some methods have been described that create their own custom molecular features from atomic attributes, which go beyond simple additive models. The TopP-S [92] predictor was developed by Wu et al and uses the atomic positions to create topological descriptors called Betti barcodes. These Betti barcodes are used as molecular features that are input into deep neural networks, along with atomic attributes such as atom type. Results were shown to further improve upon the addition of 633 “molecular fingerprints" calculated from ChemoPy [205]. TopP-S has shown success in predicting log 𝑃 over other methods like XlogP3, ClogP and KowWIN using a group of independent test sets. Graph representations of molecules have also shown success in various applications including predicting molecular properties [54, 93, 206, 207, 49, 208], virtual screening [209] and molecular force field calculations [95]. In particular, OpenChem [55] uses a graph representation of the molecules. Each atom represents a node in a graph and has a vector of atomic attributes including element type, valence, charge, hybridization, and aromaticity; bonded atoms are connected by an edge in the graph. GCNs are then trained on the graph representations created using these atomic attributes and the 2D structure – or, “graph structure” – of the molecules. Graph representations are beneficial in that they are invariant to translation, rotation, and re- flection symmetries. Another molecular symmetry that should be respected is invariance to the re-indexation of atoms: changing the order in which atoms are input to the model should not affect the computed molecular features. Summation operations respect re-indexation symmetry but it is not straightforward how to capture more detailed information about molecular structure while maintaining re-indexation symmetry. A recently-described method, Geometric Scattering for Graphs (GSG) [105], provides a solution to this problem. GSG, which is analogous to GCNs, creates molecular features by scattering atomic attributes across a graph using lazy random walks. GSG is fast in creating re-indexation invariant features and also its feature vectors have the same length allowing us to easily measure the similarity of molecules, even those with different numbers 34 of atoms. It has shown promising results in the classification of social network data and predicting Enzyme Commission (EC) numbers [105]. Given this abundance of algorithms for creating molec- ular features, we seek to compare some different methods based on molecular graph representations and their ability to predict log 𝑃. Here we use GSG in combination with a set of atomic descriptors that are generated for use with classical molecular dynamics force fields: partial charges, atom type, and Lennard-Jones interaction parameters. We call this method “ClassicalGSG” and examine its performance as a function of different atomic/molecular features. In publications, the log 𝑃 predictor methods are often assessed using specific test sets, making the comparison between different algorithms challenging. Hence, benchmarks and standardized test sets are needed to effectively compare these methods and further advance the research on log 𝑃 prediction. To help meet this need, the statistical assessment of the modeling of proteins and ligands (SAMPL) [210] project recently created a distinct blind challenge for predicting log 𝑃 allowing fair evaluation and comparison of different log 𝑃 prediction methods (SAMPL6 in 2019 [211] and SAMPL7 in 2020 [212]). In this challenge, the participants predict log 𝑃 for a set of drug-like molecules and the predictions are assessed using experimental log 𝑃 values that are revealed later. The submitted prediction methods are classified into one of the following categories: Empirical methods, Physical molecular mechanics (MM)-based, Physical quantum mechanics (QM)-based, or Mixed methods. Empirical methods [184, 185, 186, 187, 188, 189, 191, 193, 190, 194, 195, 197, 91, 198, 92, 49, 213] are data-driven methods in which predictor models are trained directly on a dataset of molecules. The empirical category includes methods that employ atomic/fragment- based additive methods, machine learning, and quantitative structure-property relationship (QSPR) approaches. In MM-based methods [214, 215, 200, 216, 217, 214], molecular dynamics simulations are run and used to estimate the solvation free energy. Then, the log 𝑃 for a compound is calculated analytically from the solvation energy. QM-based methods [218, 219, 220, 221, 222, 223, 224] utilize the solvation free energy estimated from the quantum mechanical energy calculations. The mixed approaches [225, 223, 226, 227] employ the combination of physical (QM/MM-based) and empirical techniques. 35 Here, first we employed ClassicalGSG to examine the performance of two force field parameter matching programs: CHARMM General Force Field (CGenFF) [228, 229] and General AMBER Force Field 2 (GAFF2) [230, 231]. The NN models were trained using a dataset of molecules made available by OpenChem [55] and we showed that CGenFF-generated parameters with a specific ad hoc scheme of classifying CGenFF atomic types achieved the highest accuracy in predicting log 𝑃 values. Also, we examine the ClassicalGSG performance as a function of different atomic/molecular features. Also, we compare the ClassicalGSG results with GCNs trained on the same data and using a variety of atomic attributes, including those from previous work [49]. We then evaluate the performance of ClassicalGSG on several independent test sets and study the properties of features generated in the pipeline of GSGNN models. In addition, we investigate the properties of molecules with high log 𝑃 prediction error. For the SAMPL7 target molecules we used the best performing parameter sets to train four log 𝑃 predictor models. One of our verified (but non-ranked) prediction sets achieved the lowest RMSE (0.55) and the second-lowest Mean Absolute Error (MAE) of 0.44 among all the submitted predictions. We note that in SAMPL6 [211] the best set of predictions was using Cosmotherm [219], achieving an RMSE of 0.38, and 10 models achieved an RMSE of less than 0.5. In SAMPL7 none of the submitted predictions were below this threshold, implying that these molecules had specific structures that introduced difficulty into both the empirical and physical predictions. Here we describe the process of curating the four training datasets, training the models and making predictions for SAMPL7 target molecules. Further, to achieve better predictions we examined the performance of various open-source force fields such as General AMBER Force Field (GAFF) [232], Universal Force Field (UFF) [233], Merck Molecular Force Field 94 (MMFF94) [234, 235] and Ghemical [236]. Our results show that MMFF94 models create predictors that on average are as accurate or better than those created with CGenFF. We conclude with a discussion about the GSG method generated features, the computational cost of generating atomic attributes with CGenFF and GAFF2, and the relative performance of 2D versus 3D structure in predicting log 𝑃 values. We then discuss the curation of training sets for 36 the SAMPL7 challenge, the performance of open-source force field generator tools and the code available in the ClassicalGSG package. 3.2 Methods 3.2.1 Datasets The dataset used in this work is generated by Korshunova et al [55]. from the public version of PHYSPROP database [237]. This dataset consists of 14176 molecules in SMILES format and their corresponding log 𝑃 values, which we refer to as the OpenChem dataset. The molecules are converted from SMILES format to mol2 format and their 3D structure is created by OpenBabel (https://github.com/openbabel/openbabel). Then CHARMM General Force Field (CGenFF) [228, 229] and General AMBER Force Field 2 (GAFF2) [230, 231] parameter files are generated for each molecule. These force field parameter files are either created by CGenFF using the CGenFF tool of the SilcsBio software package (http://silcsbio.com) or by the Antichamber tool implemented in the Ambertools18 [238] package. The process of generating CGenFF parameter files fails for 175 molecules, and GAFF2 for 681 molecules. These 774 molecules are removed from the OpenChem dataset, resulting in a dataset of 13402 molecules. Then 80% of the molecules are used for training and the rest for testing. In addition, we evaluate our trained models on four independent test sets shown in Table 3.2. Star and NonStar [194] test sets are publicly available on https://ochem.eu/article/17434 and the Husskonen [190] test set can be found on https://ochem.eu/article/164. The FDA dataset contains drug molecules that are approved by the Food and Drug Administration (FDA) of the United Sates and originally prepared by Chen et al [184]. Table 3.2: Independent test sets used for evaluating ClassicalGSG models. Test set name Number of molecules FDA [184] 406 Huuskonen [190] 348 Star [194] 223 NonStar [194] 43 37 For the SAMPL7 challenge, we first built a master training dataset by combining the log 𝑃 datasets in Table 3.3. Table 3.3: log 𝑃 datasets used for training. Test set name Number of molecules PHYSPROP [237] 41039 Huuskonen training set [190, 239] 1496 TopP-S [190] 8199 OpenChem [55] 14176 ALOGPS_3_01 17436 Logpt_all_data_training 233 Logpt_challenge_training 187 The physical properties database (PHYSPROP) [237] was built by the Syracuse Research Corporation (SRC) and contains the log 𝑃 values of over 41, 000 diverse chemical compounds. Here, we used the public version of PHYSPROP. The Huuskonen dataset [190] has 1, 844 unique molecules, combining 326 molecules from its initial version [239] with 1, 663 molecules from the Klopman dataset [240]. The 1, 844 molecules in the Huuskonen dataset have been organized into a training set with 1, 496 compounds and a test set with 348 compounds. Here we use molecules from the Huuskonen training set. The TopP-S dataset consists of 8, 199 chemical compounds, initially compiled by Hansch et al. [241] and then compiled by Cheng et al. [184] to include only molecules with reliable experimental log 𝑃 values. The OpenChem dataset was curated from the PHYSPROP drug database [55] and contains of 14, 176 molecules. The Logpt_all_data_training, ALOGPS_3_01, and Logpt_challenge_training are public log 𝑃 training sets which can be down- loaded from https://ochem.eu. The RDkit program [242] is employed to create canonical SMILES for molecules in these 7 datasets. After removing duplicate molecules, 44, 595 molecules remained in the dataset. As the generation of CGenFF atomic attributes failed for some molecules, we ended up with 41, 409 molecules in our dataset, which we refer to as the “master dataset.” The master dataset itself serves as “DB1", which is used to train a GSG model to generate a set of predictions for the SAMPL7 molecules. The presence of only five atomic elements C, N, O, S, and H in the SAMPL7 target molecules motivated us to make a subset of the master dataset where 38 each compound has only has these atomic elements, which we call “DB2”. Molecules that either had another element not listed above or did not have the full set of elements were not selected. The DB2 dataset has 3, 482 molecules. Also, the existence of a specific structure - a sulfonyl moiety - in all of the SAMPL7 target molecules inspired us to generate the third dataset by filtering the master training set and keeping only those with sulfonyl moiety. The “HasSubstructMatch" function of RDKit was used to check if a molecule has this moiety. The obtained training dataset is referred to as “DB3" and has 2, 379 molecules. The fourth training set was obtained by filtering the master dataset and keeping only those with both a sulfonyl moiety and the following elements (C, N, O, S, and H). This training set has 1, 482 molecules, and we refer to it as “DB4". These four datasets DB1, DB2, DB3, and DB4, are used to train four ClassicalGSG models and generate four sets of predictions for the SAMPL7 target molecules. Further, to assess the performance of the open-source force field tools we use a group of external test sets including FDA [184], Star [194], NonStar[194], Huuskonen [190, 239] and SAMPL6 molecules set [211] (Table 3.3). To quantify our uncertainty, we chose molecules from these test sets that are similar to the set of SAMPL7 molecules. More specifically, these test sets are filtered to include molecules with a sulfonyl moiety and to include each of the elements (C, N, O, S, and H). Molecules that contained other elements were excluded. The selected 44 molecules are filtered further by keeping molecules which their molecular weight is in the range of SAMPL7 molecules weights. The resulting test set has 36 molecules and is referred to as S7_TEST. 3.2.2 Generating atomic attributes As mentioned above, we use atomic attributes including partial charges, atom type, and Lennard- Jones interaction parameters. Below, we explain how we generate these atomic attributes. Atomic attributes for each atom are extracted from the parameter files generated by CGenFF [228, 229], or GAFF2 [230, 231] or open-source force fields such as GAFF [232], UFF [233], MMFF94 [234, 235] and Ghemical [236]. To determine atom type classifications, we compared a number of different schemes. In one 39 scheme, we classify atom types in one of five categories as shown in Table 3.6. This is referred to below as “AC5”. Alternatively, we directly use the atom types as generated by either CGenFF or GAFF2; referred to below as “ACall.” In the third classification scheme, we manually sorted CGenFF atom types into 36 groups (AC36; Table 3.4) and GAFF2 atom types into 31 groups (AC31; Table 3.5) based on chemical knowledge. 40 Table 3.4: Classifying CGenFF atom types in 36 categories (AC36). The C_Name is the name of the category, and C_Value shows the value of the category. Atom type C_Name C_Value Atom type C_Name C_Value Atom type C_Name C_Value LPH LPH 0 OG2D1 O2 10 CG324 C3 19 NG321 N3 1 OG2S1 O2 10 CG331 C3 19 NG301 N3 1 OG2D4 O2 10 CG302 C3 19 NG311 N3 1 SG2R50 S3R 11 CG301 C3 19 NG331 N3 1 SG311 S3R 11 SG3O3 SO3 20 NG3N1 N3 1 CG2O5 C2 12 OG304 O3 21 NG3P1 N3 1 CG2O6 C2 12 OG312 O3 21 CLGR1 CL 2 CG2O1 C2 12 OG311 O3 21 CLGA3 CL 2 CG2O7 C2 12 OG303 O3 21 CLGA1 CL 2 CG2D1 C2 12 OG302 O3 21 HGPAM3 H2 3 CG2D2 C2 12 OG301 O3 21 HGPAM2 H2 3 CG2D1O C2 12 NG3P0 N4 22 HGPAM1 H2 3 CG2O4 C2 12 NG3P3 N4 22 HGP5 H2 3 CG2O3 C2 12 NG3P2 N4 22 HGP3 H2 3 CG2D2O C2 12 HGR51 H3 23 HGP2 H2 3 CG2DC1 C2 12 HGR62 H3 23 HGP1 H2 3 CG2DC2 C2 12 HGR61 H3 23 HGP4 H2 3 CG2DC3 C2 12 HGR53 H3 23 SG301 S3 4 CG2O2 C2 12 HGR52 H3 23 CG2R63 C2R 5 CG2N1 C2 12 HGR71 H3 23 CG25C1 C2R 5 CG2N2 C2 12 HGR63 H3 23 CG25C2 C2R 5 CG1T2 C1 13 PG2 P2 24 CG2R57 C2R 5 CG1N1 C1 13 IGR1 I 25 CG2R53 C2R 5 CG1T1 C1 13 CG3C52 C3R 26 CG251O C2R 5 OG2N1 O2B 14 CG3C31 C3R 26 CG252O C2R 5 SG302 S3C 15 CG3C50 C3R 26 CG2R61 C2R 5 NG2RC0 N2R 16 CG3C54 C3R 26 CG2R62 C2R 5 NG2R50 N2R 16 CG3RC1 C3R 26 CG2R64 C2R 5 NG2R62 N2R 16 CG3C41 C3R 26 CG2R51 C2R 5 NG2R61 N2R 16 CG3C51 C3R 26 CG2R67 C2R 5 NG2R60 N2R 16 CG3C53 C3R 26 CG2RC0 C2R 5 NG2R57 N2R 16 OG2D2 O2A 27 CG2R71 C2R 5 NG2R53 N2R 16 SG3O1 SO1 28 CG2RC7 C2R 5 NG2R52 N2R 16 BRGA1 BR 29 CG2R52 C2R 5 NG2R51 N2R 16 BRGA2 BR 29 CG2R66 C2R 5 NG2R43 N2R 16 BRGA3 BR 29 SG2D1 S2 6 NG2R67 N2R 16 BRGR1 BR 29 SG2P1 S2 6 ALG1 AL 17 NG2S0 N2 30 SG2P2 S2 6 FGA3 F 18 NG2S1 N2 30 SG3O2 SO2 7 FGR1 F 18 NG2S2 N2 30 NG1T1 N1 8 FGP1 F 18 NG2S3 N2 30 HGA1 H1 9 FGA1 F 18 NG2D1 N2 30 HGA7 H1 9 FGA2 F 18 NG2P1 N2 30 HGA4 H1 9 CG3AM2 C3 19 NG2O1 N2 30 HGA5 H1 9 CG3AM1 C3 19 OG2P1 O2C 31 HGA6 H1 9 CG3AM0 C3 19 PG0 P0 32 HGAAM0 H1 9 CG334 C3 19 NG3C51 N3R 33 HGA3 H1 9 CG311 C3 19 OG2R50 O3R 34 HGAAM1 H1 9 CG314 C3 19 OG3R60 O3R 34 HGAAM2 H1 9 CG321 C3 19 OG3C31 O3R 34 HGA2 H1 9 CG322 C3 19 OG3C61 O3R 34 OG2D5 O2 10 CG323 C3 19 OG3C51 O3R 34 OG2D3 O2 10 CG312 C3 19 PG1 P1 35 41 Table 3.5: Classifying GAFF2 atom types in 31 categories (AC31). The C_Name is the name of the category, and C_Value shows the value of the category Atom type C_Name C_Value Atom type C_Name C_Value Atom type C_Name C_Value c C2 0 f F 8 nn N3R 17 cz C2 0 cl CL 9 np N3R 17 cf C2 0 br BR 10 nq N3R 17 ce C2 0 i I 11 n5 N3R 17 cs C2 0 n N2 12 n6 N3R 17 c2 C2 0 nt N2 12 nu N3R 17 cg C1 1 n2 N2 12 o O2 18 ch C1 1 ns N2 12 ow O3 19 c1 C1 1 nf N2 12 os O3 19 c3 C3 2 na N2 12 oh O3 19 cc C2R 3 ne N2 12 oq O3R 20 cq C2R 3 n1 N1 13 op O3R 20 cp C2R 3 n8 N3 14 pe P2 21 ca C2R 3 n7 N3 14 pf P2 21 cd C2R 3 n3 N3 14 p2 P2 21 cx C3R 4 n9 N3 14 p4 P3 22 cy C3R 4 nk N4 15 p3 P3 22 cu C2R 4 nl N4 15 px P3 22 cv C2R 4 n+ N4 15 p5 P4 23 h5 H1 5 nz N4 15 py P4 23 hx H1 5 no N4 15 pb P2R 24 h4 H1 5 nx N4 15 pc P2R 24 hc H1 5 n4 N4 15 pd P2R 24 h2 H1 5 ny N4 15 s S1 25 h3 H1 5 nd N2R 16 s2 S2 26 h1 H1 5 nc N2R 16 sh S2 26 ha H3 6 ni N2R 16 s4 S3 27 hn H2 7 nj N2R 16 sx S3 27 ho H2 7 nb N2R 16 s6 S4 28 hp H2 7 nv N3R 17 sy S4 28 hs H2 7 nh N3R 17 sp S3R 29 hw H2 7 nm N3R 17 sq S3R 29 ss SS 30 Specifically, efforts were made to make new groups for atom types with different elements and hybridization values and to separately identify atoms that are members of ring structures. Finally, a forth classification scheme simply uses a uniform atom type for all atoms, referred to as “AC1”. Table 3.6: Classifying atoms in 5 categories (AC5). Atom type Category number Hydrogen 1 Oxygen and Nitrogen 2 Carbon with hybridization value < 3 3 Carbon with hybridization value = 3 4 Others 5 The two Lennard-Jones parameters – radius (𝑟) and well-depth (𝜖) – are extracted from either CHARMM or AMBER parameter files for each atom type. In summary the atomic attributes 42 are defined by both the force field generation tool (CGenFF or GAFF2) and the atom classifica- tion scheme AC1, AC5, AC36, AC31 or ACall and we generally refer to these as “FF" atomic attributes. The atomic charges and Lennard-Jones parameters are scalar values, but atomic types are represented in one-hot encoding format. For comparison, we also examine a different set of atomic attributes, following previous work for the log 𝑃 predictor from the OpenChem toolkit (https://github.com/Mariewelt/OpenChem.git) [55]. These atomic attributes are defined as: atom element type, valence, charge, hybridization, and aromaticity. The 3D structure is generated from SMILES strings by RDkit (https://www.rdkit.org), then again using RDkit the atomic attributes are produced for each molecule and represented in one-hot encoding format as node features in OpenChem. These are referred to below as "OPENCHEM" atomic attributes. 3.2.3 Log P predictions using GSG 3.2.3.1 Geometric Scattering for graphs The Geometric scattering for graphs (GSG) method, which has been introduced in Ref [105] is a feature extraction method for graph data types and uses a geometric transform defined on the graph. In the GSG method, molecules are represented by a graph of atoms where each atom has a vector of attributes; a single attribute evaluated at each vertex is also referred to as a “graph signal". GSG combines the molecular structure (defined using an adjacency matrix) and atomic signal vectors to construct invariant, stable, and informative features as shown in Figure 3.1. 43 Adjacency matrix Wavelet matrices Scattering transform Atom index Adjacency Features Atom index Atomic features Figure 3.1: Architecture of the GSG method. The adjacency matrix describes the graph structure of the molecule. Each atom has a set of attributes that are shown as colored bars. Wavelet matrices Ψ are built at different logarithmic scales, 𝑗, using the adjacency matrix as described in the text. Finally, the scattering transform is applied to get the graph features using both the wavelet matrices and the signal vectors. Modified from figure made by Feng et al. [105]. Now we briefly explain the mathematics behind this method. Let 𝐺 = (𝑉, 𝐸, 𝑊) be a weighted graph and x(𝑣 𝑙 ), 1 < 𝑙 <= 𝑛 be the signal function defined on each node where 𝑛 is the number of nodes in the graph and 𝑣 𝑙 represents node 𝑙. A graph random walk operator is defined as 𝑃 = 21 (𝐼 + 𝐴𝐷 −1 ), where 𝐴 is the adjacency matrix and 𝐷 is the degree matrix. 𝑃𝑡 describes the probability distribution of a lazy random walk after 𝑡 steps and hence the term lazy random walk for 𝑃. In fact, graph random walks are low-pass filters that act like convolution filters to capture graph features at different scales. Graph wavelet operators – also known as “signal transform 𝑗−1 𝑗 operators for graphs” – are defined at the scale 2 𝑗 as Ψ 𝑗 = 𝑃2 − 𝑃2 . Finally, a set of scattering transform operators S0 , S1 and S2 , are defined using the graph signals and wavelets to compute the molecular-level features. Unique S0 , S1 and S2 vectors are created using different moments (𝑞) of x and their wavelet coefficients Ψ, as well as different wavelet scales indexed by 𝑗: 𝑛 ∑︁ (S0 ) 𝑞 = x(𝑣 𝑙 ) 𝑞 , 1≤𝑞≤𝑄 (3.1) 𝑙=1 𝑛 ∑︁ (S1 ) 𝑗,𝑞 = |Ψ 𝑗 x(𝑣 𝑙 )| 𝑞 , 1≤ 𝑗 ≤𝐽 1≤𝑞≤𝑄 (3.2) 𝑙=1 𝑛 ∑︁ (S2 ) 𝑗, 𝑗 ′,𝑞 = |Ψ 𝑗 ′ |Ψ 𝑗 x(𝑣 𝑙 )|| 𝑞 , 1 ≤ 𝑗 < 𝑗′ ≤ 𝐽 1≤𝑞≤𝑄 (3.3) 𝑙=1 44 The scattering operators are named based on the number of times the wavelets Ψ are used to transform the signal x. For example, S0 features do not use the wavelet operators, and are simply different moments of the atomic signals. These are called the “Zero order scattering moments.” Accordingly, the S1 operators with one wavelet transformation are called “First order scattering moments” and S2 with two wavelet transformations are called the “Second order scattering moments.” Note that these allow mixing between different wavelet scales as 𝑗 and 𝑗 ′ are set independently. We define the set S as the concatenation of all S0 , S1 and S2 vectors using all possible values of 𝑞, 𝑗 and 𝑗 ′ as specified by the ranges in Eq. 3.1, 3.2 and 3.3. The set of graph features is generated deterministically from the atomic signals and adjacency matrix. The size of this set of features is equal to 𝑄𝑁 𝑠 (1 + 𝐽 (𝐽 + 1)/2), where 𝑁 𝑠 is the number of attributes per vertex, although this is lower if not all zeroth, first and second order features are used. As mentioned above, one of the inputs to GSG is the adjacency matrix of the graph and we consider two distinct ways to represent it. One is to use the 2D connectivity of the molecule, where 𝐴𝑖 𝑗 = 1 indicates a bond between atoms 𝑖 and 𝑗, and 𝐴𝑖 𝑗 = 0 otherwise. Alternatively, the adjacency matrix can be calculated using the 3D structure, where 𝐴𝑖 𝑗 = 𝑓 (𝑅𝑖 𝑗 ), and 𝑓 (𝑅) is a smooth and differentiable function that is equal to 1 at low 𝑅 and decreases to 0 as 𝑅 exceeds some cutoff value 𝑅𝑐 [94]:   𝜋𝑅𝑖 𝑗  0.5(cos ( 𝑅𝑐 ) + 1) for 𝑅𝑖 𝑗 ≤ 𝑅𝑐    𝑓 (𝑅𝑖 𝑗 ) = (3.4)   0.0 for 𝑅𝑖 𝑗 > 𝑅𝑐    where 𝑅𝑐 is the radial cutoff and 𝑅𝑖 𝑗 is the Euclidean distance between atoms 𝑖 and 𝑗. If 𝑓 (𝑅𝑖 𝑗 ) is non-zero, nodes 𝑖 and 𝑗 are considered connected and disconnected otherwise. The wavelet operators in GSG can be defined using either discrete (2D) or continuous (3D) values in the adjacency matrix. 45 3.2.3.2 Neural networks architecture and training To predict the log 𝑃 values from the GSG features (S), we develop a feedforward neural network using PyTorch [243] with a nonlinear activation function such as Rectified Linear Unit (ReLU). To determine the best performing network, we consider three important hyperparamters including the number of hidden layers, the number of neurons in a hidden layer, and the dropout rate whose ranges are shown in Table 3.7. We perform cross validation using the PyTorch wrapper Skorch https://skorch.readthedocs.io/en/stable/ to tune the hyperparameters. The loss function of our NN model is MSELOSS (Mean Squared Error) and we use the Adam (adaptive momentum estimation) [244] for optimizing the parameters. We use MultiStepLR, which has an adjustable learning rate set to the initial value of 0.005 and dynamically decreases during training every 15 steps by a factor of 0.5. In training our GSGNN models, we used two data regularization methods: L1 norm and standardization using the StandardScaler function from sci-kit learn [245]. For GSG features with maximum wavelet scales of 𝐽 = 4, 5 and 6 we mostly benefit from using the standardization method. Other settings can be found in Table 3.7. Table 3.7: Neural network settings. Square brackets denote possible parameter values used in the grid search method. Parameter Values Number of hidden layers [2, 3, 4, 5] Size of hidden layers [300, 400, 500] Dropout rate [0.2, 0.4] Initial learning rate 0.005 Learning coefficient 0.5 Batch size 256 Max epoch size 400 3.2.4 Log P predictions using GCNs 3.2.4.1 Graph convolutional neural network Graph convolutional neural networks (GCNs) extend the application of convolutional neural net- works to graph data. The goal of GCNs is to take a graph and generate features according to node 46 attributes and graph structure. The heart of the GCN method is the convolution operator, which aggregates the features of neighboring nodes within the graph. Recently, various implementations of GCNs [246, 247, 248, 249, 250, 207] have been developed to increase the speed and accuracy of the GCN models. In this paper, we employ the GCN model that was developed in OpenChem [55] based on the method introduced by Kipf and Welling [249]. Similar to the GSG model above, this model takes a graph 𝐺 = (𝑉, 𝐸) as the input where each node has a vector of attributes x. The model processes the graph by passing it through multiple hidden layers performing convolution operations (Figure 3.2). Adjacency matrix GCNN layer Graph gathering layer Atom index Adjacency Features Atom index Atomic features Max-pooling layer Figure 3.2: Architecture of the GCN method. The adjacency matrix describes the graph structure of the molecule. Each atom has a set of attributes and are shown as colored bars. GCN layers are shown by gray color and are followed a max-pooling layer which is shown in purple. The graph gathering layer is shown in green color adds features on all nodes to generate the molecular feature vector. The GCN method works by propagating a feature matrix 𝐻 – originally set to the value of the attribute vector x for all nodes – through a set of convolution operators as follows: 𝐻 (0) = X (3.5) 𝐻 (𝑙+1) = 𝜎( 𝐴𝐻 ˜ (𝑙) 𝑊 (𝑙) ) where 𝐴˜ = 𝐴 + 𝐼, is the adjacency matrix of the input graph with self-connections, 𝐼 is the identity matrix, 𝑊 (𝑙) is the trainable weight matrix of layer 𝑙 and 𝜎 is a non-linearity function such as ReLU. 𝐻 (𝑙) denotes the value of the feature matrix on layer 𝑙. 47 Each convolution layer is followed by a graph max-pooling layer introduced in [250]. Following these convolution and max-polling operations , the final set of graph features is obtained by a graph gather layer, where the values of each feature are summed over nodes. This last layer gives GCNs their index-invariance property. 3.2.4.2 GCN architecture and training We predict log 𝑃 using GCNs using the model implemented in the OpenChem toolkit https: //mariewelt.github.io/OpenChem/html/index.html. This model contains 5 layers of graph convolu- tions with a hidden layer size of 128. The GCN layers are followed by max-pooling and a graph gather layer. A 2-layer neural network with ReLU as the activation function is added after the graph gather layer. The PyTorch Adam optimizer and the MSELOSS (Mean Squared Error) are used as the parameter optimizer and training loss function, respectively. We use the MultiStepLR learning scheduler, implemented in PyTorch, with an initial value of 0.001, a step size of 15 and a learning coefficient of 0.5 for training the model. Parameters used for the GCN training are summarized in Table 3.8. Table 3.8: Neural network settings Parameter Values Number of GCN hidden layers 5 Number of NN hidden layers 2 Size of hidden layers 128 Dropout rate 0 Initial learning rate 0.01 Learning coefficient 0.5 Batch size 128 Max epoch size 200 48 3.3 Results 3.3.1 Evaluation of molecular representations For our set of atomic attributes (FF), we use parameters from classical MD force fields: the partial charge, and the Lennard-Jones radii and well-depth. We first compare the performance of atomic attributes generated using two different algorithms: GAFF2 and CGenFF. Both algorithms are used to automatically generate force field parameters for small molecule ligands, by matching atomic environments with atoms that are part of existing force fields; GAFF2 uses the Amber [232] force field, while CGenFF uses the CHARMM [228, 229] force field. Here we generate molecular features from atomic attributes using GSG (see Section 3.2.3.1). We examine the four different atom type classification schemes discussed above: AC1, AC5, AC36/AC31 and ACall. The GSG parameters used to construct the molecular features are shown in Table 3.9. Table 3.9: Parameters for the Geometric Scattering for Graphs algorithm. The square brackets show all of the values examined for each parameter. For “Scattering operators”, ‘z’ represents the zero order operator, ‘f’ is first order, and ‘s’ is second order. Parameter Values Adjacency matrix (𝐴) [2D, 3D] Wavelet maximum scale index (𝐽) [4, 5, 6, 7, 8] Four combinations Scattering operators (z, f, s) (z, f), (z, s), (f, s), (z, f, s) Atom classification [AC1, AC5, AC36/AC31, ACall] We trained 160 models for each of the CGenFF and GAFF2 atomic attributes sets, each trained using a 3-fold cross-validation method. After training, we ran tests on subsets of the full database using an 80:20 train:test split. We then calculated evaluation metrics such as the correlation coefficient (𝑟 2 ), Root Mean squared errors (RMSE) and Mean Unsigned Errors (MUE) between the predicted and experimental log 𝑃 values. In Figure 3.3, 𝑟 2 and RMSE values are shown for predictions using 2D and 3D molecular structures. Each 𝑟 2 and RMSE is averaged on 20 NN models that are created with all combinations of the 4 geometric scattering operator sets and the 5 wavelet step numbers as shown in Table 3.9. 49 Figure 3.3: Average 𝑟 2 (A) and RMSE (B) for the OpenChem test set using GSGNN models. Each average is calculated over 20 individual parameter values and the error bars show the best and worst performing models. The atomic attributes are generated with either CGenFF or GAFF2 force fields and using one of three atom type classification schemes ("AC1", "AC5", "AC36/AC31" or "ACall"). We find that models with 2D structure universally have higher accuracy in prediction of log 𝑃 values. Additionally, it demonstrates that CGENFF atomic attributes on average are more accurate in predicting log 𝑃 values compared to GAFF2 atomic attributes, although this is not true for ACall. We find that good results on average are obtained by classifying atom types using AC36 categories; the best performing individual models are also found in this category. We thus use CGenFF, 2D structure and AC36 for all models going forward. 50 1.00 f,s z,f z,s z,f,s 0.95 0.90 0.85 2 r 0.80 0.75 0.70 0.65 0.60 4 5 6 7 8 Wavelet maximum scale index ( ) J Figure 3.4: The 𝑟 2 for the OpenChem test set using GSGNN models. The atomic attributes are all generated with CGenFF force fields, AC36 atom type classification scheme, and 2D molecular structure. We next investigate different scattering moment sets and wavelet scales in Figure 3.4. We find that for each examined wavelet number (4, 5, 6, 7 and 8) there is at least one model with an 𝑟 2 value of 0.9. The model also performs well for all combinations of scattering moments, making it difficult to determine an optimal set of parameters. To further evaluate the accuracy of our method, we examine four different independent test sets: FDA, Huuskonen, Star and NonStar which are explained in Section 3.2.1. To avoid any accidental overlap, we identified shared molecules and removed them from our training dataset. The training dataset contains 10, 722 randomly selected molecules from the OpenChem dataset that successfully are processed by CGenFF. The trained models are evaluated using these test sets and the 𝑟 2 between the actual and predicted log 𝑃 values are calculated. The results are shown in Figure 3.5. For the Huuskonen and FDA data sets we find that there is at least one GSGNN model with 𝑟 2 = 0.92 and 𝑟 2 = 0.90, respectively. The best performing models for the Star test set is a maximum wavelet scale of 𝐽 = 4 and zero- and second-order scattering operators, while the best model for NonStar is 51 a maximum wavelet scale of 𝐽 = 7 and zero-, first- and second-order scattering operators. A1.0 FDA B 1.0 Huuskonen 0.9 0.9 2r 0.8 2 r 0.8 0.7 0.7 0.6 0.6 4 5 6 7 8 4 5 6 7 8 C 1.0 Star D1.0 NonStar f,s z,f z,s z,f,s 0.9 0.9 2r 0.8 2 r 0.8 0.7 0.7 0.6 0.6 4 5 6 7 8 4 5 6 7 8 Wavelet maximum scale index (J) Figure 3.5: The 𝑟 2 for different test sets using GSGNN models. A) shows 𝑟 2 for the FDA test set. B) represent 𝑟 2 for the Huuskonen test set. C) and D) show 𝑟 2 for the Star and NonStar test sets, respectively. The horizontal axis indicates the maximum wavelet scale 𝐽. The atomic attributes are generated with 2D molecular structure, CGenFF force fields and using AC36 atom type classification scheme. The 𝑟 2 , RMSE and MUE of the best-performing GSGNN model for each test set are shown in Table 3.11. In paper [92] the log 𝑃 values for the FDA, Star, NonStar test sets are predicted using their topology-based models (TopP-S) and the different established methods such as ALOGPS, XLOGP3 and KowWIN. 52 Table 3.10: The log 𝑃 prediction results using different methods on FDA test set. The result from our model shown in red color. The TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table from [92] N=406 2 molecules are removed due to pre-processing failure Method 𝑟2 RMSE MUE TopP-S 0.935 0.51 0.24 ClassicalGSG 0.910 0.55 0.35 ALOGPS 0.908 0.6 0.42 XLOGP3 0.872 0.72 0.51 XLOGP3-AA 0.847 0.8 0.57 CLOGP 0.838 0.88 0.51 TOPKAT 0.815 0.88 0.56 ALOGP98 0.800 0.90 0.64 KowWIN 0.771 1.10 0.63 HINT 0.491 1.93 1.3 We find that our results do not outperform the TopP-S method but we do achieve higher accuracy compared to other known methods such as ALOGPS, KowWIN and XLOGP3 for FDA test set (Table 3.10), methods like XLOGP3 and ALOGP for Star test set (Table 3.12) and XLOGP3 method for NonStar test set (Table 3.13). Table 3.11: The logP prediction performance results for four independent test sets. Dataset 𝑟2 RMSE MUE FDA 0.91 0.56 0.35 Star 0.90 0.49 0.35 NonStar 0.72 1.02 0.81 Huuskonen 0.93 0.37 0.23 53 Table 3.12: The log 𝑃 prediction results using different methods on Star test set. The result from our model shown in red color. The TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set from [92]. Table from [92]. N=223 % of molecules within error range Method RMSE < 0.5 1 < 1> AB/LogP 0.41 84 12 4 S+logP 0.45 76 22 3 TopP-S 0.49 77 16 7 ClassicalGSG 0.49 75 18 7 ACD/logP 0.5 75 17 7 CLOGP 0.52 74 20 6 VLOGP OPS 0.52 64 21 7 ALOGPS 0.53 71 23 6 XLOGP3 0.62 60 30 10 KowWIN 0.64 68 21 11 CSLogP 0.65 66 22 12 ALOGP 0.69 60 25 16 MolLogP 0.69 61 25 14 ALOGP98 0.7 61 26 13 OsirisP 0.71 59 26 16 VLOGP 0.72 65 22 14 TLOGP 0.74 67 16 13 ABSOLV 0.75 53 30 17 QikProp 0.77 53 30 17 QuantlogP 0.80 47 30 22 SLIPPER-2002 0.80 62 22 15 COSMOFrag 0.84 48 26 19 XLOGP2 0.87 57 22 20 QLOGP 0.96 48 26 25 VEGA 1.04 47 27 26 CLIP 1.05 41 25 30 LSER 1.07 44 26 30 MLOGP(Sim+) 1.26 38 30 33 NC+NHET 1.35 29 26 45 SPARC 1.36 45 22 32 HINTLOGP 1.80 34 22 44 54 Table 3.13: The log 𝑃 prediction results using different methods on NonStar test set. The result from our model shown in red color. The TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92] N=43 % of molecules within error range Method RMSE < 0.5 1 < 1> ALOGPS 0.82 42 30 28 MiLogP 0.86 49 30 21 S+logP 0.87 40 35 26 XLOGP3 0.89 47 23 30 CLOGP 0.91 47 28 26 ALOGP 0.92 28 40 33 CSLogP 0.93 58 19 23 MolLogP 0.93 40 25 26 TopP-S 0.94 51 19 30 OsirisP 0.94 42 26 33 ALOGP98 1.00 30 37 33 AB/LogP 1.00 42 23 35 ACD/logP 1.00 44 32 23 ClassicalGSG 1.02 36 31 33 ABSOLV 1.02 49 28 23 KowWIN 1.05 40 30 30 VLOGP OPS 1.07 33 28 26 TLOGP 1.12 30 37 30 VLOGP 1.13 40 28 33 XLOGP2 1.16 35 23 42 SLIPPER-2002 1.16 35 23 42 QuantlogP 1.17 35 26 40 COSMOFrag 1.23 26 40 23 VEGA 1.24 28 30 42 QikProp 1.24 40 26 35 LSER 1.26 35 16 49 QLOGP 1.42 21 26 53 CLIP 1.54 33 9 49 MLOGP(Sim+) 1.56 26 28 47 SPARC 1.70 28 21 49 NC+NHET 1.71 19 16 65 HINTLOGP 2.72 30 5 65 55 3.3.2 Geometric scattering vs graph convolution To compare the performance of geometric scattering (GSGNN) models to the graph convolution (GCN) models, we trained models using these two methods with two different sets of atomic attributes "FF" and "OPENCHEM" as described in Section 3.2.1. To rule out the influence of the training/test split, we trained 5 models for each method where the data is randomly divided into training and test sets with 80/20 ratio, respectively. Each of the five GSGNN models is trained using 5-fold cross validation. The RMSE and 𝑟 2 are determined for each model and are shown in Table 3.14. Note that the atomic attributes for these GSGNN models are generated by 2D molecular structure, CGenFF force fields, and AC36 atom type classification scheme. The GSG parameters that used to generate the molecular features are from one of the best performing models (a wavelet maximum scale of 𝐽 = 4 and all three of the zero-, first- and second-order scattering operators) from Figure 3.4. Our results suggest that the ClassicalGSG method, which is a GSGNN model trained on FF atomic attributes, has the most accurate prediction of log 𝑃 values. Table 3.14: The logP prediction results using different set of features and models Method name Atomic attributes Model 𝑟 2 (STD) RMSE (STD) ClassicalGSG FF GSGNN 0.91 (0.003) 0.52 (0.009) - OPENCHEM GSGNN 0.89 (0.003) 0.57 (0.005) - FF GCN 0.75 (0.091) 0.91 (0.18) OpenChem OPENCHEM GCN 0.79 (0.052) 0.83 (0.122) 3.3.3 Features visualization To examine the molecular features that are generated in the pipeline of the ClassicalGSG method, we visualize the features generated by the GSG method and the last layer of NN model using t-distributed stochastic neighbor embedding (t-SNE) [251] plots. The t-SNE plots are intended for projecting high dimensional data into the low dimensional space so it can be visualized readily. Figure 3.6 shows the GSG and NN features for molecules in the OpenChem test set visualized in a 2D space. Here each point shows a molecule in the OpenChem test set and is colored by its log 𝑃 value. The initial dimension of GSG features is 1716 while NN features have size of 400. 56 These GSG features are generated from CGenFF atomic charges, 2D molecular structure, AC36 type classification scheme and all three scattering moment operators with wavelet step number of 4. We can observe from Figure 3.6 that features extracted from NN are more discriminative with respect to log 𝑃 values. More specifically, in the last layer of the NN model, molecules with similar log 𝑃 value tend to be near each other and form distinct clusters. In contrast, GSG features are less discriminative where nearby molecules have different log 𝑃 values. To further verify this, five nearest neighbors are determined for each point in the GSG and NN reduced feature space. Then, the difference between the actual log 𝑃 value of each point and its five neighbors are calculated and averaged. This value averaged over all the molecules and is shown by ⟨Δ log 𝑃⟩𝑁 inside the figure. The ⟨Δ log 𝑃⟩𝑁 in the reduced GSG features space is 0.81 while this value for the reduced NN features space is only 0.17. The average log 𝑃 difference between two random points is 2.00. This shows that molecules with similar log 𝑃 values are closer in the NN features space. This is noteworthy, as the GSG features are task independent and therefore are not adapted to any particular prediction task, including log 𝑃 prediction. On the other hand, the NN model, which takes as input the GSG features, is a supervised model that is trained for the specific task of log 𝑃 prediction. The results in Figure 3.6, in addition to the results in Tables 3.11 and 3.14, illustrate that the GSG features provide not only a translation, rotation, and permutation invariant representation of the molecules, but one that is also sufficiently rich so that, when combined with a downstream supervised NN, the resulting model provides accurate estimates of log 𝑃 values for molecules. 57 GSG NN Experimental log P t-SNE dim. 2 t-SNE dim. 1 t-SNE dim. 1 Figure 3.6: The t-SNE plots with GSG and NN features of the OpenChem test set molecules. Each represents a molecule and is colored by its actual log 𝑃 value. ⟨Δ log 𝑃⟩𝑁 shows the mean log 𝑃 difference value calculated over the nearest neighbors in the t-SNE plot. A) The GSG features of size 1716 are projected into 2-dimensional space. B) The NN features from the last hidden layer with size of 400 are projected into 2-dimensional space. 3.3.4 Distinguishing features of failed molecules To investigate the common characteristics of failed molecules during the prediction of their log 𝑃 value, we created molecular fingerprints, which have been used extensively in cheminformatics, QSAR/QSPR predictions, and drug design [92]. Here we employ ChemoPy [205] to create a set of constitutional fingerprints. To specify the failed molecules we define a failure cutoff value (here, 0.5) where if the difference between the actual and predicted log 𝑃 values is larger than the cutoff, we consider the molecule as a failed prediction. We determine the failed molecules from ClassicalGSG models we described in Section 3.3.2. These models are trained on features constructed by CGenFF force fields, 2D structure and AC36 atom type classification method. The probability distribution of 30 constitutional fingerprints are calculated for all molecules in the OpenChem dataset and for failed molecules in each of the 5 GSGNN models. Kullback–Leibler divergence (KL-divergence) [252] values are determined by comparing probability distributions of all data with distributions of the failed molecules from each model. The KL-divergence values are 58 averaged over five models and their standard error (STE) is shown in Table 3.15. We note that the PCX descriptors count the number of shortest paths of length X. The attributes with the highest KL-divergence values are: PC counts with lengths of 2, 1, 3 and 4; molecular weight (Weight); the number of carbon atoms (ncarb); and the number of heavy atoms (nhev). In other words, the distributions of failed molecules are enriched in particular values of these attributes. Table 3.15: The averaged KL-divergence between fingerprint distributions of all data versus failed molecules averaged over 5 GCGNN models. Fingerprint Average_KL STE Fingerprint Average_KL STE PC2 0.048 0.005 nhet 0.028 0.003 PC1 0.047 0.006 noxy 0.026 0.003 PC3 0.046 0.007 nrot 0.026 0.004 Weight 0.045 0.004 nsulph 0.026 0.003 PC4 0.044 0.004 ndb 0.017 0.002 ncarb 0.043 0.005 ndonr 0.012 0.004 nhev 0.043 0.007 ncof 0.011 0.001 nta 0.043 0.007 AWeight 0.011 0.002 PC5 0.04 0.005 nnitro 0.01 0.001 naro 0.04 0.003 ncocl 0.009 0.002 PC6 0.039 0.005 nhal 0.008 0.001 nsb 0.038 0.004 nphos 0.007 0.003 naccr 0.036 0.002 ncobr 0.006 0 nring 0.035 0.004 ncoi 0.005 0.002 nhyd 0.031 0.003 ntb 0.001 0 The probability distributions of fingerprints with the highest KL-divergence values are shown in Figure 3.7. The distributions for the failed molecules largely follow the complete dataset distribution, but are enriched towards higher values, suggesting that the log 𝑃 predictions are more likely to fail for larger molecules. Interestingly the distributions for attributes that count rare element types (e.g. number of Phosphorus (nphos), number of iodine atoms (ncoi) and number of bromine atoms (ncobr)) do not show large KL-divergence values, although this might have been expected given that they are poorly represented in the training set. The atom element types and their count in our dataset is shown in Figure S1. 59 All molecules Model 1, failed Model 2, failed Model 3, failed Model 4, failed Model 5, failed Figure 3.7: Probability distributions of molecular fingerprints. The histograms show the distribution of fingerprints of all data and failed molecules of 5 GCGNN models. The distribution of all data is shown in thick black line. A) The number of shortest paths of length 2, B) the atomic weight, C) the number of carbon atoms (ncarb) and D) the number of heavy atoms. 3.3.5 Uncertainty estimations for the SAMPL7 predictions Upon submission of our results to the SAMPL7 organizers, our predictions for uncertainty in log 𝑃 were simply the standard errors of the mean of the predictions using the four training sets. Here we first calculate more accurate estimates for the prediction uncertainty using prediction intervals (PIs) obtained separately for each of the four predictors. PIs define the range of values in which predictions for new data are expected to lie with a defined probability. For example, a 90% PI in the range of [𝑎, 𝑏] indicates that a future prediction will fall into the range [𝑎, 𝑏] 90% of the time. There are a variety of methods for calculating PIs for NNs, such as bootstrapping [253, 254], Mean-Variance Estimation (MVE) [255], Delta, and Bayesian methods. In this paper, we utilize a parametric approach similar to the MVE method. However, unlike the MVE, our method constructs 60 PIs from the Mean Absolute Errors (MAE) between the predicted and observed values of similar inputs rather than all of the input data. The PIs are determined by finding a MAE value (𝜖90 ) below which 90% of MAE values fall. Thus for a future prediction ( 𝑦ˆ ) the PI is defined as [ 𝑦ˆ − 𝜖90 , 𝑦ˆ + 𝜖 90 ]. To construct PIs, we make four sets of predictions for the S7_TEST dataset using ClassicalGSG models trained on DB1, DB2, DB3, or DB4 training sets. As mentioned in Section 3.2.1, S7_TEST is a subset of external test sets containing molecules similar to the SAMPL7 molecules. For training these models, we used the parameters of the best ClassicalGSG models obtained in Section 3.3.1. More specifically, atomic attributes from CGenFF parameters, 2D molecular structure information and AC36 atomic types fed to GSG with parameters of maximum wavelet number (𝐽) of 4, and all scattering operators (zeroth, first, and second order) to generate molecular features. After making predictions using these models, the MAE values are calculated for S7_TEST for each set of predictions. We then binned them in a histogram with 20 bins to determine cumulative probability distributions shown in Figure 3.8. A B 1.0 P=0.90 1.0 P=0.90 Cumulative probability Cumulative probability 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 1.6 0.0 0.2 0.4 0.6 0.8 1.0 MAE (predictions using DB1) MAE (predictions using DB2) C D 1.0 P=0.90 1.0 P=0.90 Cumulative probability Cumulative probability 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0.0 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 MAE (predictions using DB3) MAE (predictions using DB4) Figure 3.8: Cumulative distribution of MAE of molecules in the S7_TEST set. The solid blue line shows the cumulative distributions for each set of predictions. The dashed red line represents the probability of 90%. Panels A through D show MAEs using models trained on DB1 through DB4, respectively. 61 Table 3.16: The PIs and coverage range for the SAMPL7 molecules using four ClassicalGSG methods. 𝑦ˆ is the prediction of log 𝑃. Model name PIs Coverage of PIs ClassicalGSG_DB1 𝑦ˆ ± 0.94 72.72% ClassicalGSG_DB2 𝑦ˆ ± 0.88 90.90% ClassicalGSG_DB3 𝑦ˆ ± 0.88 68.18% ClassicalGSG_DB4 𝑦ˆ ± 0.98 86.36% The four sets of predictions made for SAMPL7 target molecules using each model and their 90% PIs are shown in Figure 3.9. We determined the coverage of PIs for experimental log 𝑃 values for each model and the results are shown in Table 3.16. The ClassicalGSG_DB2 method has the highest coverage of 90.90%, as expected, although other predictors fall below this threshold. This could indicate that the SAMPL7 log 𝑃 values were more difficult to predict than the similar S7_TEST molecules. A B DB1 predictions 4 DB2 predictions 4 Experimental Experimental 90.0% PIds 90.0% PIds 3 3 logP 2 logP 2 1 1 0 0 C D 5 DB3 predictions DB4 predictions Experimental 4 Experimental 4 90.0% PIds 90.0% PIds 3 3 logP 2 logP 2 1 1 0 0 Figure 3.9: Prediction intervals of log 𝑃 predictions for the SAMPL7 molecules. The experimental log 𝑃 values are shown in red circles as a scatter plot. The predictions are shown in a red line, and the orange wide range shows the prediction intervals (PIs). Panels A through D show predictions from models trained on DB1 through DB4, respectively. In all cases, data is sorted according to the predicted log 𝑃 values. 62 Table 3.17: The log 𝑃 prediction results for the SAMPL7 molecules. Ranking among all Ranking among Model name RMSE 𝑟2 MAE verified predictions ranked predictions ClassicalGSG_DB2 0.55 0.51 0.44 1 ClassicalGSG_DB4 0.65 0.50 0.56 3 ClassicalGSG_DB1 0.76 0.28 0.62 7 ClassicalGSG_DB3 0.77 0.51 0.62 9 5 3.3.6 Predictions for SAMPL7 log 𝑃 challenge Four sets of blind predictions were generated using different training sets, as described above. By the rules of the SAMPL7 challenge, only one of these predictions could be used as a “ranked” submission. To determine which set to use we examined the performance of each predictor on the FDA [184] and Huuskonen test sets [190] and chose the model with the lowest RMSE: ClassicalGSG-DB3. We note that there was an error in our analysis code at this time, and later we found the ClassicalGSG-DB1 was the model with the lowest RMSE. The RMSE and 𝑟 2 values of prediction sets using these four models was determined after the SAMPL7 challenge (Table 3.17). Note that the best model in terms of RMSE is the one trained on the DB2 training set and the worst performing model was DB3. The model we were intending to select, DB1, is the second-worst performing model. In retrospect, this shows that the FDA and Huuskonen test sets were not good proxies for the SAMPL7 molecules. We show comparisons of our predictions with experimental results, along with linear fit lines for each model in Figure 3.10. Note the best fitting coefficients also correspond to the model trained on the DB2 training set. To identify the prediction outliers, we show the log 𝑃 predictions from our methods for the SAMPL7 target molecules in Figure 3.11. We find the largest systematic errors in compounds: SM36, SM40, SM41, SM42, SM43 and SM45 molecules. As shown in Figure 6A from Ref. [212], these molecules were found to have some of the highest prediction errors across all submissions. For SM36, SM41, SM42 and SM43, ClassicalGSG consistently over-predicted the experimentally determined log 𝑃, which was also true for the other well-performing methods (in Figure 6D of Ref. [212]). Molecules SM40 and SM45 were under-predicted compared to experiment, which was 63 A 4.0 B 4.0 y=0.46x+0.96 y=0.71x+0.52 3.5 y=x 3.5 y=x Predictions using DB1 Predictions using DB2 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Experimental Experimental C 4.0 D 4.0 y=0.47x+0.82 y=0.62x+0.80 3.5 y=x 3.5 y=x Predictions using DB3 Predictions using DB4 3.0 3.0 2.5 2.5 2.0 2.0 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Experimental Experimental Figure 3.10: The best fit lines for prediction sets. The experimental versus prediction values are shown in red circles as a scatter plot. The actual fit line is shown in orange. The dashed blue curve shows the best fit line. A) predictions using DB1, B) predictions using DB2, C) predictions using DB3, and D) predictions using the DB4 training set. also in line with other well-performing methods, although the trend is less clear. 3.3.7 Performance of Open-source force fields Although there is an online server for generating CGenFF parameter files for a given molecule, it is still challenging to use CGenFF for high throughput applications as it is not open-source. Hence, we decided to assess the performance of open-source force field tools implemented by 64 Figure 3.11: The log 𝑃 predictions from our submissions to the SAMPL7 challenge. The orange line shows the experimental values. The ClassicalGSG predictions are shown as circles (DB1: blue, DB2: orange, DB3: green, DB4: red). The thick orange area shows the MAE interval of 0.44, which is the lowest MAE of our submitted predictions (ClassicalGSG-DB2). Molecules are labeled with their molecule ID from SAMPL7 [212]. OpenBabel [256, 257] which is open-source and free to use on large databases of molecules. We utilized GAFF [232], UFF [233], MMFF94 [234, 235] and Ghemical [236] force field parameters to generate atomic attributes including atom types, partial charges and the two Lennard-Jones interaction parameters (𝜖 and 𝑟). As above, we applied the GSG method with maximum wavelet scale of 4 while using all scattering operators to generate molecular features from atomic attributes. We used the DB2 training set to train 5 log 𝑃 predictor models for each force field. Each of these models is trained using a 5-fold cross validation approach. These models are tested on the SAMPL7 molecules and the RMSE and 𝑟 2 values are calculated for each set of predictions. We took the average values over the 5 runs for each force field and the results are shown in Figure 3.12. This figure shows that ClassicalGSG models from MMFF94 force field parameters achieve the highest 𝑟 2 and lowest RMSE value, which are on par with the CGenFF results submitted to the challenge. Additionally, we studied the performance of MMFF94 ClassicalGSG models on independent 65 A B 1.0 0.8 0.8 Average RMSE 0.6 Average r2 0.6 0.4 0.4 0.2 0.2 0.0 0.0 MMFF94 CGenFF GAFF UFF Ghemical MMFF94 CGenFF GAFF UFF Ghemical Figure 3.12: Results of ClassicalGSG models trained using open-source force field parameters. Error bars are computed over five independently-trained models. These models are trained using the 2D structure information and using all the scattering moments with the maximum wavelet scale (𝐽) of 4. For each set of ClassicalGSG models trained using these force field parameters we show A) the average RMSE, and B) the average 𝑟 2 . Table 3.18: Sets of parameters used to evaluate MMFF94 ClassicalGSG models on external test sets. Sets in square brackets denote all the GSG parameter values used for generating the molecular features. For scattering operators, “z” denotes the zeroth order operator (Equation 3.1), “f” is first order (Equation 3.2), and “s” is second order (Equation 3.3). Parameter Values Max. wavelet scale (𝐽) [4, 5, 6, 7, 8] Scattering operators [(z, f), (z, s), (f, s), (z, f, s)] external test sets such as FDA [184], Huuskonen [190, 239], Star [194], NonStar [194] and the com- pounds from the SAMPL6 log 𝑃 prediction challenge [211]. For the purpose of a fair comparison, we used the same 10722 molecules from the OpenChem dataset as utilized in Section 3.3.1. All combinations of a set of maximum wavelet scales (𝐽) and sets of scattering operators are used as GSG parameters to train 20 ClassicalGSG models as indicated in Table 3.18. The atomic attributes were generated from MMFF94 atomic parameters and all atomic types from MMFF94 (ACall). The parameters corresponding to the best models per each test set along with their performance results are shown in Table 3.19. The ClassicalGSG models based on MMFF94 achieve better performance compared to CGenFF based models for all test sets (Table 3.11). The comparison between log 𝑃 prediction results for FDA, Star, and NonStar test sets and those from other log 𝑃 predictor methods are shown in Tables 3.20, 3.21 and 3.22. 66 Table 3.19: The logP prediction results from MMFF94 force field parameters for external test set. GSG parameters Performance results Max. wavelet scale Scattering operators Test set name RMSE 𝑟 2 MAE 7 (f,s) FDA 0.53 0.93 0.27 7 (z,f,s) Star 0.44 0.93 0.29 5 (z,f) NonStar 0.74 0.89 0.59 7 (z,f) Huuskonen 0.35 0.94 0.18 5 (z,f,s) SAMPL6 0.29 0.87 0.23 Table 3.20: The log 𝑃 prediction results from different log 𝑃 methods for the FDA test set. The TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table from [92]. FDA test set, N=406 2 molecules are removed due to pre-processing failure Method 𝑟2 RMSE MAE TopP-S 0.935 0.51 0.24 MMFF94-ClassicalGSG 0.929 0.535 0.27 CGenFF-ClassicalGSG 0.910 0.55 0.35 ALOGPS 0.908 0.6 0.42 XLOGP3 0.872 0.72 0.51 XLOGP3-AA 0.847 0.8 0.57 CLOGP 0.838 0.88 0.51 TOPKAT 0.815 0.88 0.56 ALOGP98 0.800 0.90 0.64 KowWIN 0.771 1.10 0.63 HINT 0.491 1.93 1.3 67 Table 3.21: The log 𝑃 prediction results from different log 𝑃 methods for the Star test set. The TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set from [92]. Table from [92]. Star test set, N=223 % of molecules within error range Method RMSE < 0.5 1 < 1> AB/LogP 0.41 84 12 4 MMFF94-ClassicalGSG 0.44 81 14 5 S+logP 0.45 76 22 3 TopP-S 0.49 77 16 7 CGenFF-ClassicalGSG 0.49 75 18 7 ACD/logP 0.5 75 17 7 CLOGP 0.52 74 20 6 VLOGP OPS 0.52 64 21 7 ALOGPS 0.53 71 23 6 XLOGP3 0.62 60 30 10 KowWIN 0.64 68 21 11 CSLogP 0.65 66 22 12 ALOGP 0.69 60 25 16 MolLogP 0.69 61 25 14 ALOGP98 0.7 61 26 13 OsirisP 0.71 59 26 16 VLOGP 0.72 65 22 14 TLOGP 0.74 67 16 13 ABSOLV 0.75 53 30 17 QikProp 0.77 53 30 17 QuantlogP 0.80 47 30 22 SLIPPER-2002 0.80 62 22 15 COSMOFrag 0.84 48 26 19 XLOGP2 0.87 57 22 20 QLOGP 0.96 48 26 25 VEGA 1.04 47 27 26 CLIP 1.05 41 25 30 LSER 1.07 44 26 30 MLOGP(Sim+) 1.26 38 30 33 NC+NHET 1.35 29 26 45 SPARC 1.36 45 22 32 HINTLOGP 1.80 34 22 44 68 Table 3.22: The log 𝑃 prediction results from different log 𝑃 methods for the NonStar test set. The TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92] NonStar test set, N=43 % of molecules within error range Method RMSE < 0.5 1 < 1> MMFF94-ClassicalGSG 0.74 49 35 16 ALOGPS 0.82 42 30 28 MiLogP 0.86 49 30 21 S+logP 0.87 40 35 26 XLOGP3 0.89 47 23 30 CLOGP 0.91 47 28 26 ALOGP 0.92 28 40 33 CSLogP 0.93 58 19 23 MolLogP 0.93 40 25 26 TopP-S 0.94 51 19 30 OsirisP 0.94 42 26 33 ALOGP98 1.00 30 37 33 AB/LogP 1.00 42 23 35 ACD/logP 1.00 44 32 23 CGenFF-ClassicalGSG 1.02 36 31 33 ABSOLV 1.02 49 28 23 KowWIN 1.05 40 30 30 VLOGP OPS 1.07 33 28 26 TLOGP 1.12 30 37 30 VLOGP 1.13 40 28 33 XLOGP2 1.16 35 23 42 SLIPPER-2002 1.16 35 23 42 QuantlogP 1.17 35 26 40 COSMOFrag 1.23 26 40 23 VEGA 1.24 28 30 42 QikProp 1.24 40 26 35 LSER 1.26 35 16 49 QLOGP 1.42 21 26 53 CLIP 1.54 33 9 49 MLOGP(Sim+) 1.56 26 28 47 SPARC 1.70 28 21 49 NC+NHET 1.71 19 16 65 HINTLOGP 2.72 30 5 65 As these tables show, MMFF94 ClassicalGSG achieves the best results to date for the NonStar test set and the second-best results for the FDA and Star test sets. Moreover, our method shows a 69 significant improvement in the prediction of log 𝑃 values for SAMPL6 molecules, with a RMSE in the range [0.29, 0.52] and median of 0.42 over 20 models. This compares favorably to the best performing model (Cosmotherm [219]) with an RMSE of 0.35 in the SAMPL6 blind challenge. 3.4 Discussion and Conclusions In this chapter, we introduced a method called "ClassicalGSG" for predicting the partition coeffi- cient. Our method uses atomic attributes that are usually utilized as parameters for classical MD simulations. These parameters include partial charges, Lennard-Jones well depth, Lennard-Jones radius and atomic type. The pipeline for generating these parameters includes: creating 3D struc- ture from SMILES, creating PDB and Mol2 formatted files, and generating atomic parameters using either Antechamber (GAFF2) or CGenFF tools. We note that, in our implementation, the Antechamber tool is about 200 times slower than CGenFF, requiring a couple of days to process 10,000 molecules. We employ the geometric scattering for graphs (GSG) method to transform the atomic attributes into molecular features that satisfy re-indexation symmetry. Our results indicate that GSG is powerful enough to capture the universal molecular features, as the average log 𝑃 difference for all pairs of adjacent molecules in a t-SNE plot (⟨Δ log 𝑃⟩𝑁 ) is 0.81, which is a low value compared to the random pair log 𝑃 difference (2.00). As these features are general to the molecule and not specific to log 𝑃, this suggests that they can be used in multi-task NNs to predict other molecular properties, such as solubility (log 𝑆), melting point, pKa, and intestinal permeability (e.g. Caco2).[258] Our results show that employing a 2D molecular structure in ClassicalGSG yields accurate log 𝑃 predictions compared to 3D structures, and this confirms the same conclusion achieved previously [259, 260]. This could be due to difficulties in generating appropriate 3D structures, or that a single 3D structure is insufficient to capture the high probability conformations of a given molecule. Additionally, our 3D adjacency matrix did not explicitly distinguish between bonded and non-bonded interactions, where the former are much more important to determine molecular properties. The results reported here for four independent external test sets show that our log 𝑃 70 ClassicalGSG method is generalizable to new molecules. However, we do not expect this model to perform well for molecules with new elements or functional groups that are not covered in the training set. Like other empirical methods, we expect the accuracy will improve as the availability of training data grows. Furthermore, we described the curation of four training sets that we utilized to train Classical- GSG log 𝑃 predictor models for the SAMPL7 physical property blind challenge. The molecular features originally submitted for these models were created by CGenFF force field parameters. Our most accurate set of predictions – with an RMSE of 0.55 and MAE of 0.44 – were made by the ClassicalGSG-DB2 model, which had the lowest RMSE among the 36 submitted sets of predictions based on the non-ranked predictions analysis. Our ranked predictions were from ClassicalGSG- DB3 – with an RMSE of 0.77 and MAE of 0.62 – which were ranked in 5th place. To further compare ClassicalGSG with other predictors we also made post-hoc predictions for the previous SAMPL6 challenge molecules. We trained 20 predictors using different parameters and obtained some estimates that had significantly lower RMSE (0.29) than the best performing model at the time (Cosmotherm [219], 0.35). However, this parameter selection had the benefit of hindsight, so a more meaningful comparison is with our median RMSE of 0.42. We note that this RMSE would have placed fourth among the submissions to SAMPL6 [211]. Here we trained several ClassicalGSG models on molecular features generated by atomic attributes from open-source force fields. We find that MMFF94 ClassicalGSG models are slightly more accurate than CGenFF ClassicalGSG models. Applying the MMFF94 log 𝑃 predictor model trained on the OpenChem dataset to external test sets obtains excellent results throughout, at times achieving the best results acquired to date. An added benefit is that the MMFF94 ClassicalGSG models provide an end-to-end framework for predicting log 𝑃 values using only SMILES strings as input and does not require any auxiliary stream files like the CGenFF models. It might be counter-intuitive that a force field developed in the 1990s [234, 235] would outperform a modern forcefield that is still being actively developed [228, 229]. We note here that pertinent features of force fields for predicting log 𝑃 values are very different from those needed to conduct physically- 71 meaningful molecular dynamics simulations. We suspect that the leading benefit of MMFF94 is its broad coverage of atom types describing the chemical features of small, organic molecules that are relevant to log 𝑃. Differences in one-hot encodings of atom type would likely have a much stronger impact than improving predictions of partial charges, for example. Our code is publicly available on GitHub https://github.com/ADicksonLab/ClassicalGSG and our training and test sets are available on https://doi.org/10.5281/zenodo.4531015 and in SDF format on Zenodo https://doi.org/10.5281/zenodo.4560967. The ClassicalGSG repository contains two pre-trained log 𝑃 predictors, one using MMFF94 and another one using CGenFF atomic attributes. Once the predictor is trained, values can be predicted extremely quickly. Predictions for a set of 1000 molecules can be made in about 150 seconds on an Intel i7 processor, without parallelization. The code provides modules for extracting GSG features and training NN models on new datasets. As mentioned the ClassicalGSG method is not specific to log 𝑃 and could predict other molecular properties as well. We emphasize that progress in the field of molecular property prediction can be greatly accelerated by the free sharing of molecular property datasets. Efforts such as OpenChem [55] that support the sharing of methods and datasets will be useful catalysts for methods development. Publicly available data sources for properties such as intestinal permeability, pKa values, intrinsic clearance rates (CLint ) and serum protein binding fractions would similarly be great catalysts for the development of accurate predictors of pharmacokinetic effects. 72 CHAPTER 4 MODELING COMPOUNDS IN BINDING SITE WITH THE FLEXIBLE TOPOLOGY METHOD 4.1 Introduction One of the common steps in early-stage drug discovery is High Throughput Screening (HTS), where millions of drug-like compounds are examined against a target protein, and compounds with high affinity are selected [106]. HTS is carried out in a wet lab via experimental methods, where the binding of up to millions of compounds can be examined at significant cost [5, 107, 108]. To increase the efficiency of this process, in silico screening tools emerged in the 1980s [5] called Virtual Screening (VS), which can now evaluate large chemical databases with billions of compounds [107, 109, 110, 111]. VS can either rely on the 3D structure information of the target molecules – Structure-Based Virtual Screening (SBVS) [116, 117, 118] – or on known bioactive ligands for the target drug – Ligand-Based Virtual Screening (LBVS) [112, 113, 114, 115]. LBVS uses chemical similarity search methods such as QSAR, similarity search [261], pharmacophore matching [262] or 3D shape matching [263] to find novel ligands that show biological activity with the target of interest. SBVS techniques require the 3D structure of the target protein that can be determined by X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, or homology modeling. Drug-like molecules are then docked into the binding site of the target protein [11] and scored based on how well they bind to the target using empirical functions [264]. The highest-scoring compounds are selected as hit compounds and passed to the next step for further computational or experimental evaluation. SBVS can use compounds from large databases such as ZINC [119], ChEMBL [120], PubChem [121, 122] and NCI [123] are screened which require heavy computational resources and time. The conventional SBVS methods consider a single fixed snapshot of the target protein structure during docking simulation to reduce the computational burden, which can lead to inaccurate predictions of both ligand pose and binding affinity [124, 125, 126, 127]. 73 A variety of methods have been developed over the years to alleviate the deficiencies of single-structure docking methods [127, 129, 130, 131, 132, 133, 134, 265]. There are more than 60 docking programs [266] and some of them like AutoDock [267], DOCK [268], AutoDock Vina [269], GalaxyDock [270], and FITTED [271] can incorporate protein flexibility into docking, although this typically is limited to sidechain motions in the binding pocket. In addition, there are approaches that incorporate protein flexibility using a larger set of protein conformations that are generated either beforehand (“ensemble docking”) [128, 135, 134] or on-the-fly (“induced fit docking”) [133, 272, 273]. Boltzmann docking [274] and Implicit Ligand Theory [275] are also used to dock ligands to an ensemble of protein conformations, however these ensembles are generated in the apo structure and would not easily capture pocket conformations that are induced by the presence of a ligand. While these algorithms can be accelerated by parallel computing approaches such as message-passing interface (MPI) [276, 277, 278, 279], graphics processing units (GPUs) [280, 281, 282, 283, 284] or MPI-GPU heterogeneous methods [285, 275], the higher computational expense makes these algorithms unsuitable for virtual screening large ligand databases. Developing new techniques for accelerating docking while considering the ligand-induced conformational changes in the receptor are two main active areas of research in SBVS. Here, we present an advanced ML-based method that can be used for SBVS and also uses molecular dynamics (MD) to model ligand-induced conformational changes in the receptor as well as water molecules’ effects on ligand-receptor interactions. We refer to this method as “Flexible Topology” since it allows the topology (i.e. the chemical structure) of the ligand to change in response to forces from the surrounding protein and water atoms. This algorithm has the potential to achieve efficiency gains because it combines screening (chemical space search) with the sampling of ligand poses (conformational space search) through MD simulations. In Flexible Topology (FT), we run MD simulations with a set of randomly-initialized atoms called “Ghost Particles” (GP), where each ghost particle has a vector of attributes. These attributes are defined as dynamical variables that – just like the atomic positions – experience forces given 74 by negative partial derivatives of the energy function. The coordinates and attributes of the ghost particles thus change over time and gradually optimize toward different drug-like molecules during the MD simulation. The identity of each ghost particle is defined by a set of parameters from classical molecular force fields: the partial charge (𝑞) and two Lennard-Jones interaction parameters, the radius (𝜎) and the well-depth (𝜖). In addition, each particle has a parameter called 𝜆 restricted to [0, 1] that determines its presence. The ghost particles assemble into target ligands by running an ML-driven dynamical trajectory where the assembly of the ghost particles is informed by their environment: binding site residues and explicit solvent. Possible applications of this method include prediction of bound poses, calculation of relative binding free energies, and fast virtual screening of large ligand sets. The FT method combines the search of conformation and chemical space together. To illustrate the benefits of this approach, it is helpful to consider the example of a protein searching its conformational landscape to find its folded state. Levinthal’s paradox states that there are an enormous number of possible protein conformations (10300 ) [286] and if all the conformations of a protein are sampled one by one at a rate of 1 conformation per millisecond, then it would take the universe’s age to find its folded conformation. However, in nature, protein folding does not require sampling all of the high-energy conformations; the protein is automatically drawn towards low-energy conformations due to the funnel-shaped nature of the energy landscape [287]. The same reasoning is valid for the FT approach as its search space will also be funnel-shaped with the lowest energy target molecules at its bottom. In this way, FT aims to avoid the sampling of all high-energy ligands and conformations, and to be gradually drawn towards the best atomic identities and structures with each simulation step. At the end of a simulation we aim to achieve the optimal binding pose and the correct chemical types. In this chapter, we first describe the framework of the method and details of its implementation. We then run assembly simulations with a set of target ligands to study its performance and optimize the energy landscape of the FT energy function. Further, we employ FT to run assembly simulations in the binding pocket of a bromodomain. Finally, we conclude with a discussion of the challenges 75 and future work. 4.2 Methods 4.2.1 Framework of Flexible Topology The goal of the FT method is to model a set of atoms that can continuously transform between different molecules. Ultimately we envision this as an efficient way to identify candidate drug compounds for a drug target, while taking into account full flexibility in the receptor and surrounding solvent. To implement this method, we use MD simulations that are coupled with an ML-based external force (MLForce) that slowly nudges a set of ghost particles to become a drug-like molecule (Fig. 4.1). MLForce consists of three components: a molecular representation model, an assignment algorithm and a loss function. The molecular representation model defines a set of symmetry- invariant atomic environment vectors (AEVs) using functions introduced by Behler and Parrinello [94]. The AEVs, in combination with the ghost particle attributes, determine the set of features for each atom that describe both its environment and its identity. We discuss the features in Sec. 4.2.2 and the attributes in Sec. 4.2.3. The second component is an algorithm to solve an assignment problem to determine an optimal mapping between the ghost particles and target ligand atoms. In our implementation, we employ the Hungarian algorithm [288] to find an assignment with a minimum cost between the ghost particles and target ligand atoms. The cost matrix is determined by calculating the Euclidean norm between the ghost particle features (𝐹𝐺𝑃 ) and target molecule features (𝐹𝑇 ). While the brute force solution to the assignment problem is 𝑂 (𝑛!), the Hungarian algorithm has 𝑂 (𝑛3 ) time complexity. We use a C++ implementation from Cong Ma (the original code is implemented in Matlab by Buehren) https://github.com/mcximing/hungarian-algorithm-cpp.git. The output of this algorithm is the mapped set of features (𝑀 (𝐹𝐺𝑃 )). Finally, the loss function 𝐿 is used to define the restraint potential that guides the ghost particles to adopt their target configurations. Gradients of 𝐿 yield the external forces on both the positions 76 Initial Structure All-atom MD Final Structure Positions and attributes Forces Mapped Features Molecular Features Loss function Hungarian Representation Algorithm Model MLForce Figure 4.1: The framework of the Flexible Topology method. The gray and blue spheres show the ghost particles. MD simulations start with an initial structure of the protein of interest and randomly initialized ghost particles. The MLForce plugin computes external forces on the ghost particle positions and attributes using a loss function, 𝐿. 𝑋𝐺𝑃 denotes the ghost particle positions, 𝐴𝐺𝑃 represents the atomic attributes of ghost particles, 𝑀 shows the mapping function. 𝐹𝐺𝑃 and 𝐹𝑇 show ghost particle and target ligand features, respectively. Final output shows the ghost particles assembled into a drug-like molecule in the binding-pocket of the protein. and attributes of the ghost particles. These forces are applied to the ghost particles to change their positions and attributes at each dynamic step, gradually transforming them into a drug-like molecule in an orientation that fits the protein of interest. This loss function is described in Sec. 4.2.4, followed by a implementation details in Sec. 4.2.5. 4.2.2 Symmetry-invariant Molecular Features The atomic environment vectors that guide the assembly of the ghost particles are mostly comprised of a set of customized Behler-Parinello symmetry functions [94]. For each particle, these functions capture the radial and angular information of other ghost particles in its surrounding environment. 77 These features are invariant to translation and rotation operations and are differentiable with respect to atomic positions. The local radial environment of an atom is defined by a sum of Gaussian terms multiplied by a cutoff function: 𝑅 2 ∑︁ 𝑒 −𝜂 𝑅 ( 𝑅𝑖 𝑗 −𝑅𝑠 ) 𝑓𝑐 𝑅𝑖 𝑗  𝐺 𝑖𝑅 = (4.1) 𝑖≠ 𝑗 where 𝑅𝑖 𝑗 is the distance between atoms 𝑖 and 𝑗. The cutoff function 𝑓𝑐 goes from 1 to 0 as 𝑅𝑖 𝑗 surpasses a cutoff distance 𝑅𝑐 , which focuses the sum to capture only the local atomic environment. The parameter 𝜂 𝑅 controls the width and 𝑅𝑠𝑅 determines the location of the Gaussian functions. Using a set of different values for 𝑅𝑠𝑅 allows the capturing of information in particular regions of the radial environment around 𝑖. The cutoff function is defined as follows where 𝑅𝑐 is the cutoff distance.   𝜋𝑅𝑖 𝑗  0.5(cos ( 𝑅𝑐 ) + 1) for 𝑅𝑖 𝑗 ≤ 𝑅𝑐    𝑓𝑐 (𝑅𝑖 𝑗 ) = (4.2)   0.0 for 𝑅𝑖 𝑗 > 𝑅𝑐    The angular functions are defined for an atom 𝑖 using the sum of cosine terms for all possible angles centered at atom 𝑖: 1 𝐴 2 ∑︁ 𝐺 𝑖𝐴 = 21−𝜁 𝑒 −𝜂 𝐴 ( 2 ( 𝑅𝑖 𝑗 +𝑅𝑖𝑘 ) −𝑅𝑠 ) 𝑓𝑐 𝑅𝑖 𝑗 𝑓𝑐 (𝑅𝑖𝑘 )  𝜁  1 + cos 𝜃 𝑖 𝑗 𝑘 − 𝜃 𝑠 (4.3) 𝑗,𝑘≠𝑖 where 𝜂 𝐴 and 𝑅𝑠𝐴 have the same role to modify the width and location of radial Gaussian functions, and the parameters 𝜁 and 𝜃 𝑠 control the width and location of different angular regions. Note that these angular functions are the modified version of original Behler-Parinello symmetry func- tions [94] and reflects the changes made by Smith et al. [95]. Using multiple values of 𝜂 𝑅 , 𝑅𝑠𝑅 , 𝜂 𝐴 , 𝑅𝑠𝐴 , 𝜃 𝑠 and 𝜁, enables us to construct a set of symmetry functions that can uniquely describe the local environment of an atom. Although we employ the TorchANI module [289] to compute the symmetry functions, there are significant differences in our implementation. Firstly, whereas in ANI atoms have a fixed elemental type (e.g. H, C, N or O), here we consider all the atoms to have the same type. However, we found that symmetry functions with one atom type struggle to couple chemical and geometrical 78 Table 4.1: Parameter values for Behler-Parrinello functions, grouped into radial and angular components. The number of values used for each parameter is shown in the 𝑛vals column. The product of 𝑛vals is the total number of features for each component (𝑁feat ). Var. Values 𝑛vals 𝑁feat 𝜂 𝑅 (nm−2 ) 1600 1 Radial 𝑅𝑠𝑅 (nm) 0.090, 0.169, ..., 0.708 24 24 𝑅𝑐𝑅 (nm) 1.0 1 𝜂 𝐴 (nm−2 ) 800 1 𝑅𝑠𝐴 (nm) 0.090, 0.155, ..., 0.545 8 Angular 𝜃 𝑠𝐴 (radians) 𝜋/16, 2𝜋/16, ..., 𝜋 16 128 𝜁 32 1 𝑅𝑐𝐴 (nm) 0.5 1 information properly in some cases. Therefore, we added an additional set of radial functions to describe the charge atomic environment. We employ the radial function defined in Eq. 4.1 while using the following cutoff function that considers the charge of the pair atom (𝑞 𝑗 ).   𝜋𝑅𝑖 𝑗  0.5(cos ( 𝑅𝑐 ) + 1)𝑞 𝑗 for 𝑅𝑖 𝑗 ≤ 𝑅𝑐    𝑓𝑐𝑞 (𝑅𝑖 𝑗 ) = (4.4)   0.0 for 𝑅𝑖 𝑗 > 𝑅𝑐    𝑅𝑞 The resulting features are denoted 𝐺 𝑖 . 𝑅𝑞 The full set of features for atom 𝑖 is then: 𝐹𝑖 = ({𝐺 𝑖𝑅 }, {𝐺 𝑖 }, {𝐺 𝑖𝐴 }, { 𝐴𝑖 }), where the {} brackets for the 𝐺 functions denote a list over all permutations of the parameters 𝜁 𝑅 , 𝑅𝑠𝑅 , 𝜂 𝐴 , 𝑅𝑠𝐴 , 𝜃 𝑠 and 𝜁. {𝐴𝑖 } is the set of the four attributes for atom 𝑖, namely, 𝑞𝑖 , 𝜖𝑖 , 𝜎𝑖 and 𝜆𝑖 . The parameter values used to construct the features are given in Tab. 4.1. In total, there are 24 radial features (for 𝑅𝑞 each {𝐺 𝑖𝑅 } and {𝐺 𝑖 }), 128 angular features, and the four atomic attributes, for a grand total of 180 features per atom. We denote the ordered list of the ghost particle feature vectors as 𝐹𝐺𝑃 . Note that while 𝐹𝐺𝑃 is invariant to translation and rotation, it is not invariant to the re-indexing of atoms. However, the mapped set of features, 𝑀 (𝐹𝐺𝑃 ) is symmetric to re-indexing. In this way, our loss function, energies and forces are invariant to all three symmetries. 79 4.2.3 Continuous Atomic Identities In the framework of a classical MD simulation, “topology” describes the identities of each atom and their covalent connectivity in the system. The atomic identities are extensive; for instance, the CHARMM36 protein force field has 20 different types of carbon where each atom type has unique parameters for the partial charge, bond lengths, and rotation around chemical bonds [290]. A harmonic energy function defines the bonds between atoms, which can vibrate around a minimum energy position but can never break. The system’s topology for the classical MD simulation is thus determined at the beginning of the simulation and remains fixed for the duration of a trajectory. The FT method allows the topology to change over the course of the simulation by treating the attributes that define atomic identities as dynamic variables. To enable the continuous and dynamic description of chemical identities, a key step is to separate external interactions from internal ones and treat them separately. While internal interactions are complex and involve a multitude of atom-type specific parameters, external interactions are comprised of only Lennard-Jones and electrostatic interactions that can be wholly described by only three parameters: partial charges (𝑞), Lennard-Jones well-depths (𝜖) and Lennard-Jones radii (𝜎). To describe how to turn these parameters into dynamic variables, we first consider the role of 𝜆 in alchemical simulations. In alchemical simulations, 𝜆 is an external parameter that is used to “turn on” or “turn off” additional terms in a potential energy function [291, 292, 293, 294]: 𝑈 (𝑋, 𝑥, 𝜆) = 𝑈0 (𝑋) + 𝜆𝑈 𝑝𝑒𝑟𝑡 (𝑥) (4.5) where 𝑋 represents the positions of the fixed system and 𝑥 represents the positions of the alchemical group, 𝑈0 shows the original potential energy function and 𝑈 𝑝𝑒𝑟𝑡 is a perturbative term that is turned on when 𝜆 = 1 and off when 𝜆 = 0. This approach has also been used in the 𝜆-dynamics method, where each 𝜆 is considered as a dynamical variable [295, 296]. That is, 𝜆 feels a force according to: 𝐹𝜆 = −𝜕𝑈/𝜕𝜆, and has a velocity 𝑣 𝜆 and a mass 𝑚 𝜆 . In FT, each atom is assigned a dynamic 𝜆 variable, as well as three parameters 𝑞𝑖 , 𝜖𝑖 , and 𝜎𝑖 . These are all treated as dynamical variables that change as a function of time and together we 80 refer to them as “attributes” denoted by 𝐴𝑖 = {𝑞𝑖 , 𝜖𝑖 , 𝜎𝑖 , 𝜆𝑖 } for atom 𝑖 and 𝐴𝐺𝑃 for the complete set. These variables, like 𝜆, have masses (𝑚 𝑞 , 𝑚 𝜖 and 𝑚 𝜎 ) and their motion follows the gradients of an extended Hamiltonian energy function. These dynamic variables are used to model non- bonded interactions between the ghost particles and the rest of the system, using conventional Lennard-Jones and Coulomb equations. 4.2.4 Loss function and Flexible Topology Restraint Energy While the MLForce model does not have any trainable parameters, we define a loss function that measures the difference between the mapped ghost particle features (𝑀 (𝐹𝐺𝑃 )) and the target features (𝐹𝑇 ): 𝐿 = |𝑀 (𝐹𝐺𝑃 ) − 𝐹𝑇 | (4.6) where || is the L2 norm and 𝐹𝑇 is a set of features computed for a given conformation of the target ligand. Note that because 𝐹𝑇 is computed using a single ligand structure, the loss function can restrict the internal degrees of freedom of the ligand after assembly. To that end we could also employ a loss function with multiple minima that represent either different ligand conformations or different ligands entirely: 𝐿 = min (|𝑀 (𝐹𝐺𝑃 ) − 𝐹𝑇,𝑖 | + 𝜍𝑖 ) (4.7) 𝑖∈ℓ where only one set of target features 𝐹𝑇,𝑖 in a (possibly large) set (ℓ) is used to determine 𝐿 and an offset for each target (𝜍𝑖 ) can be used to balance their relative probabilities. This loss is transformed into a perturbative energy according to: 𝐶 𝑈𝐹𝑇 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) = 𝐿 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) (4.8) 𝛽 where we explicitly show the loss and potential energy as a function of the ghost particle coordinates (𝑋𝐺𝑃 ) and attributes (𝐴𝐺𝑃 ). 𝛽 is the inverse temperature, and 𝐶 is a parameter called the “MLForce Scale” that will be chosen to scale 𝑈𝐹𝑇 against the other forces in the system. We then add this term, multiplied by a tunable coupling parameter (𝜑), to the system energy function 𝑈: 𝑈 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ; 𝜑) = 𝑈0 (𝑋) + 𝑈𝑛𝑏 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) + 𝜑𝑈𝐹𝑇 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) (4.9) 81 where 𝑈𝑛𝑏 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) is the nonbonded interaction energy between the ghost particle atoms and the remainder of the system (𝑋). The free energy between the 𝜑 = 0 (“Random”) state and the 𝜑 = 1 (“assembled”) state could be rigorously calculated using the work applied during a nonequilibrium trajectory with progressively increasing values of the coupling parameter 𝜑, using the Jarzynski equality [297]. This will allow the system to be initialized in an arbitrary configuration, and to “assemble” in the most probable orientation as the loss function is turned on. Alternatively, we can run “relaxation” trajectories that begin with 𝜑 = 1 and the ghost particles in a disordered state. As shown below, although relaxation trajectories do not yet allow for rigorous free energy calculation, they are sufficient to determine suitable algorithmic parameters, and to assemble ligands inside a protein binding pocket. 4.2.5 Implementation details Assembly simulations were conducted in OpenMM 7.7 [298]. These simulations are run using a custom OpenMM integrator. In this integrator, we use Langevin dynamics with a friction coefficient of 1 ps−1 and an integration step size of 1 fs is used for advancing the positions and velocities. Brownian dynamics is used for evolving the atomic attributes 𝐴𝐺𝑃 . We also keep the sum of the ghost particle partial charges equal to a constant value, which is set to zero for all molecules examined here, but could be set to an integer value if examining charged ligands. Implementation of MLForce follows the architecture of the OpenMM library [298]. It includes multiple layers, which are as follows: 1) an OpenMM public Application Programming Interface (API) to generate the MLForce object with an embedded PyTorch model, 2) a platform-independent layer to load the model and, 3) a computational kernel layer for executing the model and computing the gradients. All of the layers are written in C++, and the computational kernels are implemented for CUDA, OpenCL, and Reference (CPU) platforms in OpenMM. Furthermore, to compute deriva- tives, MLForce can be implemented using an automatic differentiation tool such as Autograd (e.g., https://github.com/HIPS/autograd). Here we utilize TorchANI [289] a PyTorch [299] implementa- tion of the Behler-Parrinello symmetry functions to compute features, and we use Torch functions 82 to calculate the loss function derivatives with respect to positions and attributes of ghost particles. The feature calculator, Hungarian mapping algorithm, and loss function are created in a PyTorch model, then converted to a TorchScript module and save it in the “.pt” format. The assembly MD simulation loads this TorchScript model using the MLForce plug-in and adds it as an external force in an OpenMM simulation, which are automatically combined with the other system forces at each dynamic step. Simulations that embed the ghost particle in a larger system required additional setup steps. OpenMM System and Topology objects were created in a standard fashion for all of the non-ghost particle atoms (referred to as the “system atoms”). The ghost particles were then added to the topology object with masses of 20 amu. The ghost particles were also added to the existing nonbonded forces in the System, although this was done in such a way that their interaction en- ergy would always be zero. Nonbonded interactions between the ghost particles and the system atoms were implemented using a CustomNonbondedForce for each ghost atom, which took into account the dynamic variables 𝑞, 𝜎, 𝜖 and 𝜆. These attributes were defined for each ghost particle as “global variables”, which are accessible to both the external forces and to the integrator. The addEnergyParameterDerivative function of OpenMM was then used to easily compute deriva- tives of the nonbonded energy with respect to these parameters. For each CustomNonbondedForce we implemented an interaction group that restricts its application to only the interactions of the ghost particle in question with the system atoms. 4.2.5.1 Bromodomain simulation setup details The PDB entry 4a9e was used as a starting point for system setup in CHARMM-GUI [300]. The ligand, protein and crystallographic waters from Chain A were used, with ligand parameters obtained from CGenFF [301, 229]. The system was solvated in a cubic box using a 10 Å cutoff from the nearest protein atom. Four neutralizing Cl ions were added to the solution. To allow for easier introduction of ghost particles, we manually removed 8 water molecules from the vicinity of the binding site, taking care not to select waters that are involved in water-mediated hydrogen 83 bonds [302]. The final system contained 24788 atoms. In each assembly simulation, we begin by introducing 22 particles in a 6 × 6 × 6 Å3 cube centered at the protein binding site, which was defined as the centroid of residues Pro98, Val103, Tyr155 and Ile162. Ghost particles are inserted one at a time and trial positions are only accepted if the atom is farther than 0.5 from all system atoms and all ghost particles inserted so far. Ghost particle attributes 𝜎, 𝜖 and 𝑞 were selected randomly from the bounds in Tab. 4.2. The 𝜆 attributes were initialized uniformly at 0.7. To prevent the ghost particles from drifting away into solvent, a CustomCentroidBondForce was used to restrain each ghost particle to be close to the centroid of a selection of atoms in the protein binding site. The binding site residues Pro98, Val103, Tyr155 and Ile162 are used to define the binding site centroid. The restraint energy is a one-sided, flat-bottom potential of the form: 𝑈rest (𝑟𝑖 ) = 𝑘 2 Θ(𝑟 𝑖 − 𝑟 0 )(𝑟𝑖 − 𝑟 0 ) 2 where 𝑟 0 is a cutoff equal to 8 Å, Θ(𝑥) is a step function equal to 1 when 𝑥 > 0 and 0 otherwise, and 𝑘 = 1000 kcal/mol nm2 . The ghost-ghost forces are computed using the standard ghost-ghost forces added to the system with the MLForce package using the ligand features as a target, as described above. Before simulation, the positions of all atoms were minimized with all forces turned on for 1000 steps. The system was then run at 300 K for 4 ps with a 2 fs timestep, which was sufficient to reach the target temperature. Assembly simulations consisted of 1000 ps of simulation with structures and features reported every 50 timesteps. 4.3 Results 4.3.1 Assembly Simulations in Vacuum To evaluate the performance of the FT method and, in particular, examine the smoothness of 𝑈𝐹𝑇 with respect to 𝑋𝐺𝑃 and 𝐴𝐺𝑃 , we run assembly simulations in vacuum with randomly-initialized ghost particles and a single set of target features. These are relaxation simulations, where the disordered state is unstable and 𝜑 = 1 throughout. Ideally, 𝑈𝐹𝑇 would be a smooth function that steadily guides the ghost particles to assemble to positions that are compatible with the target features. 84 6 8 10 15 20 25 30 35 40 50 Figure 4.2: The assembly simulation target molecules randomly chosen from the OpenChem dataset [55] are shown in their 2D structures. The number of atoms for each molecule is written under it. The 2D structures are generated and drawn by MarvinSketch using implicit hydrogens. We randomly selected 10 molecules from the OpenChem dataset [55] as the targets whose size ranges from 6 atoms to 50 atoms (Fig. 4.2). Each simulation consists of 10 cycles of 100 steps of minimization followed by 20 ps of dynamics at 300 K. In order to optimize the landscape of the loss function, we examine a parameter which we refer to as “MLForce scale”, which is a scalar multiplier to the loss function that allows us to tune the loss function forces with respect to other forces in the system (denoted 𝐶 in Eq. 4.8). We run 3 simulations for each molecule across a range of 𝐶 values and compare the performance. Each simulation is 200 ps and takes, on average, 20 minutes to run. The Root Mean Squared Distance (RMSD) between the ghost particle and target ligand positions is calculated in a way that takes into account equivalent atoms in the structure (e.g. those that have equal values of the BP functions) and finds the minimum RMSD evaluated over all permutations of equivalent atoms. Average values of these RMSDs and 𝑈𝐹𝑇 Eq. 4.8 for the last 10% of the assembly simulations are shown in Fig. 4.3. 85 A B 0.6 0.200 0.175 0.5 Positions RMSD (nm) Potential Energy (UFT) 0.150 0.4 0.125 0.3 0.100 0.075 0.2 0.050 0.1 0.025 0.0 0.000 100 200 500 1000 2000 5000 10000 100 200 500 1000 2000 5000 10000 MLForce Scale MLForce Scale Figure 4.3: Average values of RMSD and 𝑈𝐹𝑇 for assembly simulations run with different MLForce scale values. Error bars show standard errors calculated over 3 separate assembly simulations for the last 10% of data. A) RMSD between positions of ghost particles and the target molecule. B) The potential energy of ghost-ghost interactions, normalized by 𝐶 (Eq. 4.8). Table 4.2: Minimum and maximum parameter values for ghost particle attributes. Min. Max. 𝑞 (Debye) −0.7 0.7 𝜎 (nm) 0.022 0.23 𝜖 (kJ/mol) 0.037 2.63 𝜆 0.0 1.0 These figures show that the RMSD and 𝑈𝐹𝑇 energy is lower for the MLForce scale of 2000 than other MLForce scale values. This suggests that an MLForce scale of 2000 is the best value for reaching a smooth FT potential energy function. To see the differences in the attributes of ghost particles and the target molecule, we determined the Root Mean Square Errors (RMSE) of charge (𝑞), particle size (𝜎), Lennard-Jones well depth (𝜖), and 𝜆 for the assembly simulations for the last 10% of data (Fig. 4.4). The standard errors scaled by the ranges of each attribute (Tab. 4.2) and are all unitless quantities. The lowest RMSEs are achieved by the MLForce Scale of 10000 while errors are different for each atomic attribute. RMSEs of 10−3 indicate an average difference from the target feature of 0.1% of the allowed range for that parameter. We view all of the results obtained with MLForce Scale values ≥ 2000 to show acceptable convergence. We then focus on an MLForce Scale value of 2000 and examined the performance of assembly 86 Charge (q) Particle Si e (σ) Well depth (ε) Lambda (λ) 10−2 RMSE 10−3 10−4 100 200 500 1000 2000 5000 10000 MLForce Scale Figure 4.4: The average RMSE of atomic attributes between the ghost and target ligands for the last 10% of simulation data. Errors are determined over three different runs for 10 molecules (30 assembly simulations in total for each MLForce Scale). simulations for molecules of different sizes. We observed that the best results are achieved with smaller molecules and that the RMSD and 𝑈𝐹𝑇 generally increase with increasing molecular size (Fig. 4.5). In Fig. 4.6 we plot the minimum RMSD obtained in each simulation and we see that FT can reach RMSE < 2 Å for the largest system we examine (50 atoms). A 0.35 B 0.035 0.30 0.030 0.25 Positions RMSD (nm) Potential Energy (UFT) 0.025 0.20 0.020 0.15 0.015 0.10 0.010 0.05 0.005 0.00 0.000 6 8 10 15 20 25 30 35 40 50 6 8 10 15 20 25 30 35 40 50 Number of atoms Number of atoms Figure 4.5: The average RMSD of positions and FT energy for MLForce scale of 2000 over the last 10% of data. The standard errors are determined over three different runs. A) The average RMSD of ghost particles and the target ligand. B) The FT potential energy, divided by the MLForce scale. 87 0.25 0.20 Minimum RMSD (nm) 0.15 0.10 0.05 0.00 6 8 10 15 20 25 30 35 40 50 Number of atoms Figure 4.6: The minimum RMSD of positions for MLForce scale of 2000 for the assembly simulations. The standard errors are determined over three different runs. 4.4 Assembly in a Bromodomain binding pocket To test the ability of FT to assemble ghost particles in a heterogeneous environment, we simulate the assembly of a ligand within the binding pocket of a bromodomain. Bromodomains are small left-handed bundles of four alpha helices, with a well-defined binding pocket. They are known as epigenetic "readers" that use their binding pockets to recognize acetylated lysine residues [303]. Bromodomains are included – often multiple times – in nuclear proteins that are involved in chro- matin remodeling. As dysregulation of chromatin remodeling is a hallmark of cancer progression, bromodomain inhibitors have long been sought as potential cancer treatments [304, 305]. Here we assemble the compound 3-methyl-1,2,3,4-tetrahydroquinazolin-2-one in the first bromodomain of BRD2 [306]. A crystal structure of this compound is known and available on the Protein Data Bank (PDBID: 4a9e). The ligand has 22 atoms and is composed of two conjoined six-membered rings (quinazolin scaffold) with protonations and substitutions on the nitrogen-containing ring (Fig. 4.7). 88 A B C H O N N H3 C >0.5 D Atomic charge 0 <-0.5 Figure 4.7: The bound pose of 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one from the 4a9e PDB entry. Residues used to define the binding pocket centroid (Pro98, Val103, Tyr155 and Ile162) are shown in white licorice. Hydrogens are not shown on the ligand or binding site residues. A) Front view. B) Top view. C) 2D structure of the ligand. D) A size-charge representation where the each sphere is sized according to the atomic radius 𝜎 and colored according to its charge, 𝑞. We conducted ten assembly simulations of the 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one ligand inside the BRD2 binding pocket using an MLForce scale of 5000. The simulations began from different randomly initialized positions and attributes of the ghost particles. Fig. 4.8 shows the 𝑈𝐹𝑇 energies over time for the ten assembly simulations. For these parameters, we found that 𝑈𝐹𝑇 energies of approximately 15 − 20 kcal/mol correspond to correctly assembled states and incorrectly assembled states had 𝑈𝐹𝑇 energies approximately equal to 80 kcal/mol. We find that seven of the ten simulations assembled into the correct structure within the time limit. One trajectory assembled between 800 and 1000 ps, suggesting that a longer time limit could allow all trajectories to converge. Generally, the assembly simulations show periods of stagnation, followed by sudden rapid drops in energy. This suggests the presence of metastable local minima in the 89 landscape. The insets of Fig. 4.8 show snapshots from a successful and an unsuccessful simulation. The final structure of the successful simulation (bottom) shows the same chemical structure and attribute pattern as the target ligand in Fig. 4.7D. Similar results were found for the other six simulations with 𝑈𝐹𝑇 < 40 kcal/mol. The final structure in the unsuccessful simulation (top) shows some correct elements of the target molecule, such as some smaller, positively charged atoms (hydrogens) arranged around a larger, neutral atom (carbon). While these results are encouraging, a close examination of the ligand pose in the example directory will show that it differs from the crystal structure bound pose shown in Fig. 4.7. The ligand should be rotated 180 degrees around the horizontal axis, with the methyl group on the left pointing down into the binding pocket instead of upwards. Similar results were found for the other seven correctly assembled ligands, where although the general position of the ligand was correct, the overall bound pose did not match that found in the crystal structure. This suggests that some further refinement of the method is needed before it is useful to predict bound poses. 90 0 ps 500 ps 1000 ps 160 Example (success) FT energy (kcal/mol) 120 Example (failed) 80 Average 40 0 200 400 600 800 1000 Time (ps) 0 ps 500 ps 1000 ps Figure 4.8: Ten assembly simulations in a bromodomain binding pocket. The central panel shows traces of the FT energy (𝑈𝐹𝑇 ) for the ten trajectories (thin lines) as well as an average of the trajectory set (thick blue line). Two example trajectories were highlighted. The top panels, outlined in red, show snapshots from an example “failed” trajectory, which did not assemble to the correct structure. The bottom panels, outlined in green, show snapshots from a correctly assembled trajectory. 91 4.5 Discussion and Conclusions Here we have presented a novel method called Flexible Topology that considers the ligand-induced fit and surrounding water molecules’ effect upon drug binding to its receptor. In this method, atomic identities of atoms are fluidly defined by a set of four continuous dynamical variables: partial charge (𝑞), particle size (𝜎), and Lennard-Jones well depth (𝜖) as well as presence variable (𝜆) for each atom. These atoms are guided to assemble into a target ligand using a deterministic loss function defined by the difference of atomic environment vectors (AEVs) of the atoms with the AEVs of the target atoms. The AEVs are modified versions of Behler-Parrinello symmetry functions [94], which describe the local environment around each atom in a way that is invariant to translation and rotation operations. Derivatives of the loss function are used to define atomic forces that are added to an OpenMM MD simulation, implemented using a PyTorch model that incorporates features from the TorchANI [289] package. This was achieved with an external force plugin for the OpenMM tool, referred to as “MLForce”, which is made freely available here: https://github.com/ADicksonLab/mlforce.git. MLForce is responsible for loading the TorchScript file of a TorchANI model and computing internal force on ghost particles. FT simulations also require a custom integrator to update the continuous dynamical variables. Code to implement this integrator, as well as other supporting scripts, are given in the Flexible Topology repo here: https://github.com/ADicksonLab/flexibletopology.git. One of the main challenges of this project was finding a representation that could describe molecular topologies uniquely. For this purpose, we originally tried features from a method called “Geometric Scattering for Graphs” [105], which we had used previously to predict partition coefficients for small molecules in chapter 3. We attempted to use these features by themselves, as well as a combination of the GSG features with the Behler-Parrinello symmetry functions used above, and none of these feature sets were able to generate a proper models for FT simulations. In these cases we found that minimizing the loss function was not sufficient to achieve the desired ligand structures. In other words, many structures were possible that were far from the target ligand structure, but still reproduced the features exactly. The current implementation, which directly 92 compares atomic feature values after an on-the-fly Hungarian algorithm mapping step, was found to be superior to the graph scattering approach. In this chapter, we have run assembly simulations in the vacuum and in a bromodomain binding pocket. Impressively, we could reach RMSE < 2 Å for a system of 50 atoms from random initial positions, and we can achieve convergence in both positions and attributes. We expect that assembly results can be further improved on larger system sizes by through additional modifications to the feature set. For instance, adding new features could improve performance by allowing for a more precise description of a molecular conformation. However, reducing the size of the feature space could also help focus in on the most important conformational differences. Another approach would be to learn a set of scaling parameters for the loss function: ∑︁ 𝐿 scaled = 𝑖 𝑤 𝑖 (𝐹𝐺𝑃 − 𝐹𝑇𝑖 ) 2 (4.10) 𝑖 where 𝑤 𝑖 is the scale factor for feature 𝑖. This would allow us to “train” the loss function for optimal performance in assembly simulations. We reserve these lines of investigation for future work. For the assembly simulations in the bromodomain, we found that the predicted ligand poses did not coincide with the binding pose observed in the 4a9e crystal structure. We plan to further investigate the ability of the FT method to predict bound poses in future work. We will first expand our results to include pose prediction of other ligands to the same bromodomain structure. A search of the protein data bank revealed six other ligand-bound BRD2 structures: 2ydw, 4uyf, 6db0, 6jke, 6mo7 and 6ony. It will be instructive to assemble these ligands as well and examine the pose- prediction performance across the whole dataset. Another key detail to examine will be whether the predicted poses are sensitive to the speed at which the FT energy is turned on. The results above are initialized with a high MLForce scale parameter (5000), causing the first stages of assembly to happen rather quickly. We hypothesize that a more gradual introduction of the MLForce will allow the ghost particle positions and attributes to better respond to the local electrostatic environment. In future work we plan to examine the performance of the algorithm across a range of coupling speeds. 93 CHAPTER 5 SUMMARY AND FUTURE DIRECTION In this chapter, we provide a summary of the research presented in this dissertation and discuss the promising future directions. 5.1 Summary In this dissertation, we introduced three in silico methods that can be applied at different points in the drug design and discovery processes: 1) an enhanced sampling algorithm for simulating ligand (un)binding pathways and predicting binding kinetics, 2) a neural network model to predict molecular properties, and 3) a new method for virtual screening. The motivations behind these methods were discussed in Chapter 1. Next, we summarize our contributions in this dissertation. In Chapter 2, we proposed a region-free weighted ensemble (WE) [84] enhanced sampling algo- rithm to observe long time-scale rare events in biological systems called Resampling of Ensembles by Variation Optimizations, or “REVO”. Since its inception, REVO has been employed to predict the unbinding rates and ligand residence time in unbinding simulations [307, 308, 309, 310]. We used a tunable 𝑁-dimensional random walk system with an analytical solution to show that REVO is able to visit more distant points when compared to a previously developed weighted ensemble algorithm, WExplore [152, 87], as well as conventional MD sampling. Furthermore, we discov- ered that REVO obtains the highest accuracy and range for all simulated values of 𝑁 compared to WExplore and conventional sampling. We then applied REVO to run unbinding simulations for the well-studied trypsin-benzamidine system and sampling the benzamidine unbinding pathways. Using REVO, we successfully determined a novel unbinding pathway where benzamidine exits be- tween the two loops moving downward. We predicted the residence time of the trypsin-benzamidine complex, which was in agreement with experimental values. In summary, the main characteristic of REVO is that it can be tuned for simulating any system by defining a new distance metric which was discussed in detail in Chapter 2. 94 In Chapter 3, we proposed a new method to predict the log 𝑃 values of small molecules, referred to as “ClassicalGSG”. We employed neural networks to train log 𝑃 predictor models on molecular features generated using a technique called Geometric Scattering for Graphs (GSG) [105]. The GSG method uses the graph representation of molecules wherein each atom has a set of attributes. The atomic attributes are classical molecular mechanics force field parameters such as partial charges, Lennard-Jones parameters (well-depth and radius) and atomic type. GSG combines the atomic attributes and molecular topology information (using either 2D or 3D molecular structure) and transforms them into symmetry-invariant molecular features. First, we employed ClassicalGSG to examine the performance of two force field parameter matching programs: CHARMM General Force Field (CGenFF) [228, 229] and General AMBER Force Field 2 (GAFF2) [230, 231]. The NN models were trained using a dataset of molecules made available by OpenChem [55]. CGenFF- generated parameters with a specific ad hoc scheme of classifying CGenFF atomic types achieved the highest accuracy in predicting log 𝑃 values. For the SAMPL7 target molecules, we used the best performing parameter sets to train four log 𝑃 predictor models. One of our verified (but non-ranked) prediction sets achieved the lowest RMSE (0.55) and the second lowest Mean Absolute Error (MAE) of 0.44 among all the submitted predictions. Furthermore, we examined the performance of the log 𝑃 predictor models trained on atomic attributes generated by open-source force fields such as Merck Molecular Force Field 94 (MMFF94), Universal Force Field (UFF) and Ghemical. Our results showed that the log 𝑃 predictor models trained on MMFF94 with 2D molecular structure data yield more accurate predictions than those trained with other force field parameters. Our final log 𝑃 predictor model was based on an end-to-end high-throughput pipeline to predict the log 𝑃 of small molecules from their SMILES strings. Finally, in Chapter 4, we developed a novel virtual screening method called “Flexible Topology” to identify hit compounds while considering the ligand-induced conformational changes in the receptor. This method employs external forces from a TorchANI model to gradually change the identity and positions of “Ghost Particles” and assemble them into target molecules. In addition, we developed a PyTorch force plugin for the OpenMM tool [298] called “MLForce” to generate 95 internal energy and forces on ghost particles. In MLForce, a set of Behler-Parinello symmetry functions define a symmetry-invariant molecular representation for the ghost particles and target molecule [94]. The assembly simulations are guided by a loss function between the ghost particles features and target ligand features. In addition, a set of parameters from classical molecular force fields, including the partial charge (𝑞), two Lennard-Jones interaction parameters (the radius (𝜎) and the well-depth (𝜖)) and a presence variable 𝜆 define the identity of each ghost particle and are considered as dynamical variables. We ran assembly simulations in vacuum and in the binding pocket of a bromodomain using the Flexible Topology. We reached RMSE < 2 Å for a system of 50 atoms and observed convergence in both structure and dynamical variables. 5.2 Future Directions In this section, we present several possible future directions across the major areas of in silico drug discovery and how machine learning can help these critical directions. • Incorporating VAMPNets into REVO. One future direction for the REVO algorithm will be to incorporate distance metrics that consider collective motions in the solvent or receptor degrees of freedom. Ideally, these collective motions can correspond to long-timescale processes of interest. To achieve this, a distance metric based on VAMPNets [311, 312] can be incorporated into REVO. VAMPNets are neural network models trained on time-lagged data to predict the slow molecular processes of a system. For instance, they can be trained to learn optimal collective variables and describe ligand (un)binding, possibly involving non- linear combinations of features such as distances, solvent densities, and dihedral angles. To incorporate these CVs into REVO, one can simply use the distance between CVs calculated separately for each walker at each resampling step. • Predicting drug-like properties using a multi-task model. In the second chapter, we trained NN models on the GSG features to predict the partition coefficients of drug-like molecules. Similarly, if data is available, one can attempt to predict any drug-like property such as permeability, solubility, blood-brain barrier penetration, toxicity, drug metabolism, 96 and pKa. Given this, an interesting future direction will be to develop a multi-task model to simultaneously predict several drug-like chemical properties, which we believe could help better predict whether a small molecule is a suitable drug candidate. For example, each property can have its own loss function in this multi-task model, and a weighted combination of the individual loss functions can be used to optimize the model. • Rigorous assessment of Flexible Topology to predict binding poses. The next step in improving the Flexible Topology method will be to evaluate its pose prediction power against established benchmark methods rigorously. In this regard, we plan to run assembly simula- tions with various target ligands and receptors and quantify the RMSD of the predicted poses. We expect that Flexible Topology will outperform methods such as docking for systems where the binding pocket and solvent molecules’ flexibility plays an important role. • Flexible Topology with multiple ligand targets. We can improve the screening power of the Flexible Topology by allowing multiple ligands in the target set. This way, we can take into account the flexibility of the ligands and identify conformations that are the most compatible with a given binding site. By including multiple ligands – and possibly large databases of ligands – we hope to quickly identify ligands that are the most favorable. In this setup, the loss function can be formulated as follows: 𝐿 (𝐹𝐺𝑃 , {𝐹𝑇 }) = min (|𝐹 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) − 𝐹𝑇𝑖 |); 𝑖 ∈ {1 . . . 𝑛} (5.1) 𝑖 where 𝑛 is the number of feature vectors in the target set. These features can correspond either to different ligands, or to different ligand conformations. • Flexible Topology with a binding score function. Another future direction for the Flexible Topology would be to have a score function to guide the ghost particles toward a correct structure and pose instead of using a set of target features for the loss function. The score function will take the bound structure of the protein and ligand and determine a) how well the ghost particles fit in the binding pocket of the target protein and b) the correct molecular 97 structure for the ligand. This would be a purely generative ligand design model that considers the target’s dynamics. Moreover, the score function could be a deep learning model trained extensively on both ground-state data and conformations from assembly simulations. 98 APPENDIX 99 REVO METHOD MATHEMATICAL MODEL The REVO resampling method is an optimization problem which aims to select a set of walkers for resampling operations (e.g cloning and merging) in order to maximize an objective function. The walker 𝑖 is denoted as 𝑊𝑖 = {X𝑖 , 𝑤𝑖 }, where X𝑖 represents the system coordinates and 𝑤 𝑖 is the walker weight. In this section we precisely define the REVO resampling problem using a mathematical model, as follows 𝑛 is the number of walkers. 𝑝 𝑚𝑖𝑛 is the minimum allowable probability of a walker. 𝑝 𝑚𝑎𝑥 is the maximum allowable probability of a walker. 𝑑𝑚𝑒𝑟𝑔𝑒 is the maximum walker-walker distance for merge operations. w = [𝑤 1 , 𝑤 2 , . . . 𝑤 𝑛 ] 𝑇 ∈ 𝑅 𝑛 is the vector of walker weights, whose elements sum to 1. ϕ = [𝜙(𝑊1 ), 𝜙(𝑊2 ), . . . 𝜙(𝑊𝑛 )] 𝑇 is a vector of walker “novelties” defined by the positively-valued 𝜙 function. 𝛼 is the exponent of the distance term in the variation function. 𝑑0 is the characteristic distance parameter. 𝐷 = {𝑑𝑖 𝑗 (X𝑖 , X 𝑗 )} for 𝑖 = 1 . . . 𝑛 and 𝑗 = 1 . . . 𝑛 is the distance matrix where 𝑑𝑖 𝑗 (X𝑖 , X 𝑗 ) is the distance between walker 𝑖 and walker 𝑗, 𝑑𝑖 𝑗 = 𝑑 𝑗𝑖 and 𝑑𝑖𝑖 = 0. The “variation” of the system is defined as ∑︁ ∑︁ 𝑑 𝑎𝑏 𝑉= ( ) 𝛼 𝜙𝑎 𝜙𝑏 (2) 𝑎 𝑏 𝑑 0 where we denote 𝜙(𝑊𝑎 ) as “𝜙𝑎 ”. REVO solves this optimization problem using a “greedy” algorithm, which at each step finds the cloning operation that would cause a maximal increase in 𝑉. We further denote w★ and ϕ★ as the walker and novelty vectors after resampling operations, and 𝑉 ★ as the variation value of walkers 100 after resampling. Now we can define Δ𝑉 Δ𝑉 = 𝑉 ∗ − 𝑉 ∑︁ ∑︁ 𝑑 𝑎𝑏 ∑︁ ∑︁ 𝑑 𝑎𝑏 = ( ) 𝛼 𝜙★𝑎 𝜙★𝑏 − ( ) 𝛼 𝜙𝑎 𝜙𝑏 (3) 𝑎 𝑏 𝑑 0 𝑎 𝑏 𝑑 0 Let 𝑖 denote the index of a cloned walker, and let 𝑗 and 𝑘 denote indices of walkers that are merged together, with walker 𝑘 continuing on to the next cycle (that is, absorbing the weight of walker 𝑗). The goal of the REVO resampler in a given iteration is to maximize Δ𝑉 by finding the optimal set of walkers (𝑖, 𝑗, 𝑘). The cloning and merging operations change the weights of walkers 𝑖, 𝑗 and 𝑘, and change the inter-walker distances 𝑑 𝑗𝑥 for all 𝑥. All other walker weights and distances are preserved. 𝑤𝑖 𝑤★𝑖 = 𝑤★𝑗 = 2 𝑤★𝑘 = 𝑤 𝑗 + 𝑤𝑘 𝑑𝑖★𝑗 = 𝑑★𝑗𝑖 = 0 𝑑★𝑗𝑥 = 𝑑𝑖𝑥 ∀ 𝑥≠ 𝑗 (4) 101 Finally we can formulate this optimization problem as follows: maximize ★ ΔV = 𝑉 ★ − 𝑉 𝜙 subject to 𝑖 = 1, 2 . . . 𝑛 𝑗 = 1, 2 . . . 𝑛 & 𝑗 ≠ 𝑖 𝑘 = 1, 2 . . . 𝑛 & 𝑘 ≠ 𝑖 & 𝑘 ≠ 𝑗 𝑤 𝑖 ≥ 2𝑝 𝑚𝑖𝑛 , 𝑤 𝑗 + 𝑤 𝑘 ≤ 𝑝 𝑚𝑎𝑥 , 𝑑 𝑗 𝑘 ≤ 𝑑𝑚𝑒𝑟𝑔𝑒 , 𝑤𝑖 𝑤★𝑖 = 𝑤★𝑗 = , 2 𝑤★𝑘 = 𝑤 𝑗 + 𝑤 𝑘 , 𝑤★𝑙 = 𝑤 𝑙 for 𝑙 = 1, 2 . . . 𝑛 & 𝑙 ≠ 𝑖 & 𝑙 ≠ 𝑗& 𝑙 ≠ 𝑘 This is a complex optimal assignment problem which may be solved using binary integer pro- gramming. [313] Here we intend to show an optimized example of the algorithm with a simpler model. Mathematical model: Cloning only Now imagine we have 𝑛 walkers and we want to choose one of them to clone into two identical walkers with half the weight of chosen initial walker and the same distances to other walkers. The new arrangement will have 𝑛 + 1 walkers, as no merging will take place. The variation will be calculated as above, and the change in variation upon cloning walker 𝑖 is denoted Δ𝑉𝑖 . Í Let us define 𝑉 = 𝑖 𝑉𝑖 , where 𝑉𝑖 is as follows: ∑︁ 𝑑𝑖𝑙 𝑉𝑖 = ( ) 𝛼 𝜙𝑖 𝜙 𝑙 (5) 𝑙 𝑑0 It is intuitive that walkers with higher 𝑉𝑖 values should be the walkers most beneficial for cloning, since they would then get to contribute twice to the variation function. However, the weight of this walker will also be reduced, which will lower 𝜙𝑖 . We will first show that if 𝑉𝑖 ≥ 𝑉𝑚 and 102 𝜙𝑖 ≥ 𝜙𝑚 then walker 𝑖 will show a higher increase in variation than walker 𝑚. In other words, ΔΔ𝑖𝑚 = Δ𝑉𝑖 − Δ𝑉𝑚 > 0. ΔΔ𝑖𝑚 = Δ𝑉𝑖 − Δ𝑉𝑚 " # ∑︁ ∑︁ 𝑑𝑖 𝑗 ∑︁ ∑︁ 𝑑𝑖 𝑗 = ( ) 𝛼 𝜙★𝑖 𝜙★𝑗 − ( ) 𝛼 𝜙𝑖 𝜙 𝑗 𝑖 𝑗 𝑑 0 𝑖 𝑗 𝑑0 (6) " #𝑖 ∑︁ ∑︁ 𝑑𝑖 𝑗 ∑︁ ∑︁ 𝑑𝑖 𝑗 − ( ) 𝛼 𝜙★𝑖 𝜙★𝑗 − ( ) 𝛼 𝜙𝑖 𝜙 𝑗 𝑖 𝑗 𝑑 0 𝑖 𝑗 𝑑0 𝑚 The weight of one walker changes and most of the terms cancel out. After simplification we end up with the following equation ∑︁ 𝑑𝑖𝑙 𝛼 ∑︁ 𝑑𝑚𝑙 𝛼 ΔΔ𝑖𝑚 = (2𝜙★𝑖 − 𝜙𝑖 ) 𝜙𝑙 ( ) − (2𝜙★𝑚 − 𝜙𝑚 ) 𝜙𝑙 ( ) 𝑙 𝑑 0 𝑙 𝑑 0  ★  ∑︁  ★  ∑︁ (7) 𝜙𝑖 𝑑𝑖𝑙 𝜙 𝑑𝑚𝑙 𝛼 = 2 −1 ( ) 𝛼 𝜙𝑖 𝜙 𝑙 − 2 𝑚 − 1 ( ) 𝜙𝑚 𝜙𝑙 𝜙𝑖 𝑙 𝑑0 𝜙𝑚 𝑙 𝑑0 Using Equation 5 we have: 𝜙★𝑖    ★  𝜙𝑚 ΔΔ𝑖𝑚 = 2 − 1 𝑉𝑖 − 2 − 1 𝑉𝑚 (8) 𝜙𝑖 𝜙𝑚 In this work use employ a specific form of 𝜙: 𝜙𝑖 = log(𝑤 𝑖 ) − log( 𝑝 𝑚𝑖𝑛 /100) (9) which varies with walker weight as follows: 1 𝑤★𝑖 = 𝑤 𝑖 2 𝜙★𝑖 = log 𝑤★𝑖 − log( 𝑝 𝑚𝑖𝑛 /100) = log(𝑤 𝑖 /2) − log( 𝑝 𝑚𝑖𝑛 /100) (10) = log(𝑤 𝑖 ) − log( 𝑝 𝑚𝑖𝑛 /100) − log 2 = 𝜙𝑖 − log 2 Therefore we have     log 2 log 2 ΔΔ𝑖𝑚 = 1−2 𝑉𝑖 − 1 − 2 𝑉𝑚 (11) 𝜙𝑖 𝜙𝑚 103 If we assume 𝜙𝑖 ≥ 𝜙𝑚 it can be shown that log 2 log 2 (1 − 2 ) ≥ (1 − 2 ) (12) 𝜙𝑖 𝜙𝑚 Additionally if we have 𝑉𝑖 ≥ 𝑉𝑚 , then this guarantees ΔΔ𝑖𝑚 > 0. However, it is a common scenario that 𝑉𝑖 > 𝑉𝑚 , but 𝜙𝑖 < 𝜙𝑚 . This would be the case if walker 𝑖 had large distances to other walkers, but had a lower weight than walker 𝑚. To find the minimum 𝜙𝑖 value such that ΔΔ𝑖𝑚 is still positive, we set ΔΔ𝑖𝑚 to zero, and solve for 𝜙𝑖 in terms of 𝜙𝑚 and 𝑉𝑖 /𝑉𝑚 . The result is as follows:   𝑉𝑖 2 log 2 𝜙𝑖𝑚𝑖𝑛 = 𝑉𝑖 2 log 2 (13) 𝑉𝑚 𝑉𝑚 −1+ 𝜙𝑚 Thus, if 𝑉𝑖 = 𝑉𝑚 then 𝜙𝑖𝑚𝑖𝑛 = 𝜙𝑚 , which is intuitive, as if walker 𝑚 has both the highest 𝑉𝑚 and the highest 𝜙𝑚 then it is guaranteed to cause the largest increase in 𝑉, as shown above. 10 -1 10 -2 Vi/Vm = 1 Vi/Vm = 1.1 10 -3 Vi/Vm = 1.5 10 -4 Vi/Vm = 2.0 10 -5 10 -6 w^ m in_i 10 -7 10 -8 10 -9 10 -10 10 -11 10 -12 10 -13 10 -14 -12 10 10 -11 10 -10 10 -9 10 -8 10 -7 10 -6 10 -5 10 -4 10 -3 10 -2 10 -1 w_m Figure A.1: Minimum walker weight of 𝑖 to ensure that ΔΔ𝑖𝑚 > 0. 104 Figure A.1 shows 𝑤 𝑖𝑚𝑖𝑛 , which corresponds to 𝜙𝑖𝑚𝑖𝑛 for different values of 𝑤 𝑚 and 𝑉𝑖 /𝑉𝑚 . This clearly shows that for the 𝜙 function in Equation 9, for even small values of 𝑉𝑖 /𝑉𝑚 we find that 𝜙𝑖𝑚𝑖𝑛 is very small, approaching 𝑝 𝑚𝑖𝑛 which is the smallest allowable walker weight. Thus, in our implementation of the REVO algorithm, we only consider the walkers with the highest 𝑉𝑖 values for cloning. Note that for different choices of the novelty function 𝜙𝑖 (𝑊𝑖 ), this assumption might have to be revisited. 105 BIBLIOGRAPHY 106 BIBLIOGRAPHY [1] Chun Meng Song, Shen Jean Lim, and Joo Chuan Tong. Recent advances in computer-aided drug design. Briefings in bioinformatics, 10(5):579–591, 2009. [2] Si-sheng Ou-Yang, Jun-yan Lu, Xiang-qian Kong, Zhong-jie Liang, Cheng Luo, and Hualiang Jiang. Computational drug discovery. Acta Pharmacologica Sinica, 33(9):1131– 1140, 2012. [3] Maria Batool, Bilal Ahmad, and Sangdun Choi. A structure-based drug discovery paradigm. International journal of molecular sciences, 20(11):2783, 2019. [4] Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development investment needed to bring a new medicine to market, 2009-2018. Jama, 323(9):844–853, 2020. [5] Brian S Hilbush. In Silico Dreams: How Artificial Intelligence and Biotechnology Will Create the Medicines of the Future. John Wiley & Sons, 2021. [6] Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimen- tal and computational approaches to estimate solubility and permeability in drug discovery and development settings. Advanced drug delivery reviews, 23(1-3):3–25, 1997. [7] Philip J Hajduk, Jeffrey R Huth, and Christin Tse. Predicting protein druggability. Drug discovery today, 10(23-24):1675–1682, 2005. [8] Honglin Li, Zhenting Gao, Ling Kang, Hailei Zhang, Kun Yang, Kunqian Yu, Xiaomin Luo, Weiliang Zhu, Kaixian Chen, Jianhua Shen, et al. Tarfisdock: a web server for identifying drug targets with docking approach. Nucleic acids research, 34(suppl_2):W219–W224, 2006. [9] Nagasuma Chandra. Computational systems approach for drug target discovery. Expert opinion on drug discovery, 4(12):1221–1236, 2009. [10] Douglas B Kitchen, Hélène Decornez, John R Furr, and Jürgen Bajorath. Docking and scoring in virtual screening for drug discovery: methods and applications. Nature reviews Drug discovery, 3(11):935–949, 2004. [11] Sergio Filipe Sousa, Pedro Alexandrino Fernandes, and Maria Joao Ramos. Protein–ligand docking: current status and future challenges. Proteins: Structure, Function, and Bioinfor- matics, 65(1):15–26, 2006. [12] Xuan-Yu Meng, Hong-Xing Zhang, Mihaly Mezei, and Meng Cui. Molecular docking: a 107 powerful approach for structure-based drug discovery. Current computer-aided drug design, 7(2):146–157, 2011. [13] Leonardo G Ferreira, Ricardo N Dos Santos, Glaucius Oliva, and Adriano D Andricopulo. Molecular docking and structure-based drug design strategies. Molecules, 20(7):13384– 13421, 2015. [14] Shikha Agnihotry, Rajesh Kumar Pathak, Ajeet Srivastav, Pradeep Kumar Shukla, and Budhayash Gautam. Molecular docking and structure-based drug design. In Computer- Aided Drug Design, pages 115–131. Springer, 2020. [15] Campbell McInnes. Virtual screening strategies in drug discovery. Current opinion in chemical biology, 11(5):494–502, 2007. [16] Antonio Lavecchia and Carmen Di Giovanni. Virtual screening strategies in drug discovery: a critical review. Current medicinal chemistry, 20(23):2839–2860, 2013. [17] Shide Liang, Chi Zhang, Song Liu, and Yaoqi Zhou. Protein binding site prediction using an empirical scoring function. Nucleic acids research, 34(13):3698–3707, 2006. [18] Bingding Huang. Metapocket: a meta approach to improve protein ligand binding site prediction. OMICS A Journal of Integrative Biology, 13(4):325–330, 2009. [19] Zengming Zhang, Yu Li, Biaoyang Lin, Michael Schroeder, and Bingding Huang. Iden- tification of cavities on protein surface using multiple computational approaches for drug binding site prediction. Bioinformatics, 27(15):2083–2088, 2011. [20] Gisbert Schneider and Uli Fechner. Computer-based de novo design of drug-like molecules. Nature Reviews Drug Discovery, 4(8):649–663, 2005. [21] Yun Tang, Weiliang Zhu, Kaixian Chen, and Hualiang Jiang. New technologies in computer- aided drug design: Toward target identification and new chemical entity discovery. Drug discovery today: technologies, 3(3):307–313, 2006. [22] Markus Hartenfeller and Gisbert Schneider. De novo drug design. Chemoinformatics and computational chemical biology, pages 299–323, 2010. [23] Robert P Sheridan and Simon K Kearsley. Why do we need so many chemical similarity search methods? Drug discovery today, 7(17):903–911, 2002. [24] Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia- Vallvé, and Gerard Pujadas. Molecular fingerprint similarity search in virtual screening. Methods, 71:58–63, 2015. [25] Ingo Muegge and Prasenjit Mukherjee. An overview of molecular fingerprint similarity 108 search in virtual screening. Expert opinion on drug discovery, 11(2):137–148, 2016. [26] Han Van De Waterbeemd and Eric Gifford. Admet in silico modelling: towards prediction paradise? Nature reviews Drug discovery, 2(3):192–204, 2003. [27] Jie Dong, Ning-Ning Wang, Zhi-Jiang Yao, Lin Zhang, Yan Cheng, Defang Ouyang, Ai-Ping Lu, and Dong-Sheng Cao. Admetlab: a platform for systematic admet evaluation based on a comprehensively collected admet database. Journal of cheminformatics, 10(1):1–11, 2018. [28] Jangampalli Adi Pradeepkiran, SB Sainath, and KVL Shrikanya. In silico validation and admet analysis for the best lead molecules. In Brucella Melitensis, pages 133–176. Elsevier, 2021. [29] Kenneth M Merz Jr, Dagmar Ringe, and Charles H Reynolds. Drug design: structure-and ligand-based approaches. Cambridge University Press, 2010. [30] Polamarasetty Aparoy, Kakularam Kumar Reddy, and Pallu Reddanna. Structure and ligand based drug design strategies in the development of novel 5-lox inhibitors. Current medicinal chemistry, 19(22):3763–3778, 2012. [31] Gregory L Wilson and Markus A Lill. Integrating structure-based and ligand-based ap- proaches for computational drug design. Future medicinal chemistry, 3(6):735–750, 2011. [32] Vidushi Sharma, Sharad Wakode, and Hirdesh Kumar. Structure-and ligand-based drug design: concepts, approaches, and challenges. In Chemoinformatics and Bioinformatics in the Pharmaceutical Sciences, pages 27–53. Elsevier, 2021. [33] Leonardo LG Ferreira and Adriano D Andricopulo. Chemoinformatics approaches to structure-and ligand-based drug design. Frontiers in Pharmacology, page 1416, 2018. [34] Javier Vázquez, Manel López, Enric Gibert, Enric Herrero, and F Javier Luque. Merging ligand-based and structure-based methods in drug discovery: An overview of combined virtual screening approaches. Molecules, 25(20):4723, 2020. [35] Amy C Anderson. The process of structure-based drug design. Chemistry & biology, 10(9):787–797, 2003. [36] Alexander Wlodawer and Jiri Vondrasek. Inhibitors of hiv-1 protease: a major success of structure-assisted drug design. Annual review of biophysics and biomolecular structure, 27(1):249–284, 1998. [37] David E Clark. What has computer-aided molecular design ever done for drug discovery? Expert opinion on drug discovery, 1(2):103–110, 2006. [38] Hedia Marrakchi, Gilbert Lanéelle, and Annaık Quémard. Inha, a target of the antituber- 109 culous drug isoniazid, is involved in a mycobacterial fatty acid elongation system, fas-ii. Microbiology, 146(2):289–296, 2000. [39] Sakineh Dadashpour, Tuba Tuylu Kucukkilinc, Oya Unsal Tan, Keriman Ozadali, Hamid Irannejad, and Saeed Emami. Design, synthesis and in vitro study of 5, 6-diaryl-1, 2, 4- triazine-3-ylthioacetate derivatives as cox-2 and 𝛽-amyloid aggregation inhibitors. Archiv der Pharmazie, 348(3):179–187, 2015. [40] Zachary Miller, Keun-Sik Kim, Do-Min Lee, Vinod Kasam, Si Eun Baek, Kwang Hyun Lee, Yan-Yan Zhang, Lin Ao, Kimberly Carmony, Na-Ra Lee, et al. Proteasome inhibitors with pyrazole scaffolds from structure-based virtual screening. Journal of Medicinal Chemistry, 58(4):2036–2041, 2015. [41] Sandeep Grover, Marsha A Apushkin, and Gerald A Fishman. Topical dorzolamide for the treatment of cystoid macular edema in patients with retinitis pigmentosa. American journal of ophthalmology, 141(5):850–858, 2006. [42] Tomas de Paulis. Drug evaluation: Prx-00023, a selective 5-ht1a receptor agonist for depression. Current Opinion in Investigational Drugs (London, England: 2000), 8(1):78– 86, 2007. [43] Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug develop- ment fails and how to improve it? Acta Pharmaceutica Sinica B, 2022. [44] Richard K Harrison. Phase ii and phase iii failures: 2013–2015. Nat Rev Drug Discov, 15(12):817–818, 2016. [45] Debomoy K Lahiri, Bryan Maloney, Justin M Long, and Nigel H Greig. Lessons from a bace1 inhibitor trial: Off-site but not off base. Alzheimer’s & Dementia, 10(5):S411–S419, 2014. [46] Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and ac- curate prediction of ligand binding sites from protein structure. Journal of cheminformatics, 10(1):1–12, 2018. [47] Marta M Stepniewska-Dziubinska, Piotr Zielenkiewicz, and Pawel Siedlecki. Improving detection of protein-ligand binding sites with 3d segmentation. Scientific reports, 10(1):1–9, 2020. [48] Evan N Feinberg, Elizabeth Joshi, Vijay S Pande, and Alan C Cheng. Improvement in admet prediction with multitask deep featurization. Journal of medicinal chemistry, 63(16):8835– 8848, 2020. [49] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018. 110 [50] Daniel Reker and Gisbert Schneider. Active-learning strategies in computer-assisted drug discovery. Drug discovery today, 20(4):458–465, 2015. [51] Jérôme Hert, Peter Willett, David J Wilton, Pierre Acklin, Kamal Azzaoui, Edgar Jacoby, and Ansgar Schuffenhauer. New methods for ligand-based virtual screening: use of data fusion and machine learning to enhance the effectiveness of similarity searching. Journal of chemical information and modeling, 46(2):462–470, 2006. [52] Ji-Yong An, Fan-Rong Meng, and Zi-Ji Yan. An efficient computational method for predict- ing drug-target interactions using weighted extreme learning machine and speed up robot features. BioData Mining, 14(1):1–17, 2021. [53] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure. Journal of chemical information and computer sciences, 44(3):1000–1005, 2004. [54] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal of chemical information and modeling, 53(7):1563–1575, 2013. [55] Maria Korshunova, Boris Ginsburg, Alexander Tropsha, and Olexandr Isayev. Openchem: A deep learning toolkit for computational chemistry and drug design. Journal of Chemical Information and Modeling, 61(1):7–13, 2021. [56] Bilal Shaker, Myeong-Sang Yu, Jin Sook Song, Sunjoo Ahn, Jae Yong Ryu, Kwang-Seok Oh, and Dokyun Na. Lightbbb: computational prediction model of blood–brain-barrier penetration based on lightgbm. Bioinformatics, 37(8):1135–1139, 2021. [57] Albert C Pan, David W Borhani, Ron O Dror, and David E Shaw. Molecular determinants of drug–receptor binding kinetics. Drug discovery today, 18(13-14):667–673, 2013. [58] James A. Maier, Carmenza Martinez, Koushik Kasavajhala, Lauren Wickstrom, Kevin E. Hauser, and Carlos Simmerling. ff14sb: Improving the accuracy of protein side chain and backbone parameters from ff99sb. Journal of Chemical Theory and Computation, 11(8):3696–3713, 2015. [59] Tanaji T Talele, Santosh A Khedkar, and Alan C Rigby. Successful applications of computer aided drug discovery: moving drugs from concept to the clinic. Current topics in medicinal chemistry, 10(1):127–141, 2010. [60] Vijay Kumar Bhardwaj and Rituraj Purohit. Targeting the protein-protein interface pocket of aurora-a-tpx2 complex: rational drug design and validation. Journal of Biomolecular Structure and Dynamics, 39(11):3882–3891, 2021. [61] Sam RJ Hoare, Beth A Fleck, John P Williams, and Dimitri E Grigoriadis. The importance of target binding kinetics for measuring target binding affinity in drug discovery: a case 111 study from a crf1 receptor antagonist program. Drug Discovery Today, 25(1):7–14, 2020. [62] Robert A Copeland, David L Pompliano, and Thomas D Meek. Drug–target residence time and its implications for lead optimization. Nature reviews Drug discovery, 5(9):730–739, 2006. [63] David C Swinney. The role of binding kinetics in therapeutically useful drug action. Current opinion in drug discovery & development, 12(1):31–39, 2009. [64] Hao Lu and Peter J Tonge. Drug–target residence time: critical information for lead opti- mization. Current opinion in chemical biology, 14(4):467–474, 2010. [65] Anthony M Giannetti. From experimental design to validated hits: a comprehensive walk- through of fragment lead identification using surface plasmon resonance. In Methods in enzymology, volume 493, pages 169–218. Elsevier, 2011. [66] David A Sykes and Steven J Charlton. Single step determination of unlabeled compound kinetics using a competition association binding method employing time-resolved fret. In Rational Drug Design, pages 177–194. Springer, 2018. [67] Edward C Hulme and Mike A Trevethick. Ligand binding assays at equilibrium: validation and interpretation. British journal of pharmacology, 161(6):1219–1237, 2010. [68] Sai Kiran Sharma, Serge K Lyashchenko, Hijin A Park, Nagavarakishore Pillarsetty, Yorann Roux, Jiong Wu, Sophie Poty, Kathryn M Tully, John T Poirier, and Jason S Lewis. A rapid bead-based radioligand binding assay for the determination of target-binding fraction and quality control of radiopharmaceuticals. Nuclear medicine and biology, 71:32–38, 2019. [69] Anni Allikalt and Ago Rinken. Budded baculovirus particles as a source of membrane proteins for radioligand binding assay: The case of dopamine d1 receptor. Journal of Pharmacological and Toxicological Methods, 86:81–86, 2017. [70] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566, 2002. [71] Huiyong Sun, Youyong Li, Mingyun Shen, Dan Li, Yu Kang, and Tingjun Hou. Charac- terizing drug–target residence time with metadynamics: How to achieve dissociation rate efficiently without losing accuracy against time-consuming approaches. Journal of Chemical Information and Modeling, 57(8):1895–1906, 2017. [72] Mattia Bernetti, Matteo Masetti, Maurizio Recanatini, Rommie E Amaro, and Andrea Cavalli. An integrated markov state model and path metadynamics approach to characterize drug binding processes. Journal of chemical theory and computation, 15(10):5689–5702, 2019. 112 [73] Luca Mollica, Sergio Decherchi, Syeda Rehana Zia, Roberto Gaspari, Andrea Cavalli, and Walter Rocchia. Kinetics of protein-ligand unbinding via smoothed potential molecular dynamics simulations. Scientific reports, 5(1):1–12, 2015. [74] Luca Mollica, Isabelle Theret, Mathias Antoine, Françoise Perron-Sierra, Yves Charton, Jean-Marie Fourquez, Michel Wierzbicki, Jean A Boutin, Gilles Ferry, Sergio Decherchi, et al. Molecular dynamics simulations and kinetic measurements to estimate and predict protein–ligand residence times. Journal of medicinal chemistry, 59(15):7167–7176, 2016. [75] Indrajit Deb and Aaron T Frank. Accelerating rare dissociative processes in biomolecules using selectively scaled md simulations. Journal of Chemical Theory and Computation, 15(11):5817–5828, 2019. [76] Donatella Callegari, Alessio Lodola, Daniele Pala, Silvia Rivara, Marco Mor, Andrea Rizzi, and Anna Maria Capelli. Metadynamics simulations distinguish short-and long-residence- time inhibitors of cyclin-dependent kinase 8. Journal of chemical information and modeling, 57(2):159–169, 2017. [77] Riccardo Capelli, Paolo Carloni, and Michele Parrinello. Exhaustive search of ligand bind- ing pathways via volume-based metadynamics. The journal of physical chemistry letters, 10(12):3495–3499, 2019. [78] Barry Isralewitz, Mu Gao, and Klaus Schulten. Steered molecular dynamics and mechanical functions of proteins. Current opinion in structural biology, 11(2):224–230, 2001. [79] Yuzhen Niu, Shuyan Li, Dabo Pan, Huanxiang Liu, and Xiaojun Yao. Computational study on the unbinding pathways of b-raf inhibitors and its implication for the difference of resi- dence time: insight from random acceleration and steered molecular dynamics simulations. Physical Chemistry Chemical Physics, 18(7):5622–5629, 2016. [80] Gary A Huber and Sangtae Kim. Weighted-ensemble brownian dynamics simulations for protein association reactions. Biophysical journal, 70(1):97–110, 1996. [81] Jeffrey C Sung, Adam W Van Wynsberghe, Rommie E Amaro, Wilfred W Li, and J Andrew McCammon. Role of secondary sialic acid binding sites in influenza n1 neuraminidase. Journal of the American Chemical Society, 132(9):2883–2885, 2010. [82] Nuria Plattner and Frank Noé. Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and markov models. Nature communications, 6:7653, 2015. [83] Jagannath Mondal, Navjeet Ahalawat, Subhendu Pandit, Lewis E Kay, and Pramodh Vallu- rupalli. Atomic resolution mechanism of ligand binding to a solvent inaccessible cavity in t4 lysozyme. PLoS computational biology, 14(5):e1006180, 2018. 113 [84] G A Huber and S Kim. Weighted-ensemble brownian dynamics simulations for protein association reactions. Biophysical Journal, 70(1):97–110, January 1996. [85] Matthew C Zwier, Adam J Pratt, Joshua L Adelman, Joseph W Kaus, Daniel M Zuckerman, and Lillian T Chong. Efficient atomistic simulation of pathways and calculation of rate constants for a protein–peptide binding process: Application to the mdm2 protein and an intrinsically disordered p53 peptide. The journal of physical chemistry letters, 7(17):3440– 3445, 2016. [86] Alex Dickson and Samuel D. Lotz. Ligand release pathways obtained with wexplore: Residence times and mechanisms. Journal of Physical Chemistry B, 120(24):5377–5385, 2016. PMID: 27231969. [87] Alex Dickson and Samuel D. Lotz. Multiple ligand unbinding pathways and ligand-induced destabilization revealed by wexplore. Biophysical Journal, 112(4):620–629, February 2017. [88] Samuel D Lotz and Alex Dickson. Unbiased molecular dynamics of 11 min timescale drug unbinding reveals transition state stabilizing interactions. Journal of the American Chemical Society, 140(2):618–628, 2018. [89] Alex Dickson. Mapping the ligand binding landscape. Biophysical Journal, 115(9):1707– 1719, 2018. [90] David L Mobley, Karisa L Wymer, Nathan M Lim, and J Peter Guthrie. Blind prediction of solvation free energies from the sampl4 challenge. Journal of computer-aided molecular design, 28(3):135–150, 2014. [91] Tewksbury ChemSilico LLC. Cslogp program. [92] Kedi Wu, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. Topp–s: Persistent homology- based multi-task deep neural networks for simultaneous predictions of partition coefficient and aqueous solubility. Journal of computational chemistry, 39(20):1444–1454, 2018. [93] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, Yang Li, Saisai Sun, Jianyi Yang, Bharath Ramsundar, and Vijay S Pande. Potentialnet for molecular property prediction. ACS central science, 4(11):1520–1530, 2018. [94] Jörg Behler and Michele Parrinello. Generalized neural-network representation of high- dimensional potential-energy surfaces. Physical review letters, 98(14):146401, 2007. [95] Justin S Smith, Olexandr Isayev, and Adrian E Roitberg. Ani-1: an extensible neural network potential with dft accuracy at force field computational cost. Chemical science, 8(4):3192–3203, 2017. [96] Masashi Tsubaki and Teruyasu Mizoguchi. Fast and accurate molecular property prediction: 114 Learning atomic interactions and potentials with neural networks. The journal of physical chemistry letters, 9(19):5733–5741, 2018. [97] Lei Cao, Wenbo Tao, Sungtae An, Jing Jin, Yizhou Yan, Xiaoyu Liu, Wendong Ge, Adam Sah, Leilani Battle, Jimeng Sun, et al. Smile: a system to support machine learning on eeg data at scale. Proceedings of the VLDB Endowment, 12(12):2230–2241, 2019. [98] Gabriel A Pinheiro, Johnatan Mucelini, Marinalva D Soares, Ronaldo C Prati, Juarez LF Da Silva, and Marcos G Quiles. Machine learning prediction of nine molecular properties based on the smiles representation of the qm9 quantum-chemistry dataset. The Journal of Physical Chemistry A, 124(47):9854–9866, 2020. [99] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: A deep convolutional neural network for bioactivity prediction in structure-based drug discovery, 2015. cite arxiv:1510.02855. [100] Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, and David Ryan Koes. Protein–ligand scoring with convolutional neural networks. Journal of chemical information and modeling, 57(4):942–957, 2017. [101] José Jiménez, Miha Skalic, Gerard Martinez-Rosell, and Gianni De Fabritiis. K deep: protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks. Journal of chemical information and modeling, 58(2):287–296, 2018. [102] Andrew T McNutt, Paul Francoeur, Rishal Aggarwal, Tomohide Masuda, Rocco Meli, Matthew Ragoza, Jocelyn Sunseri, and David Ryan Koes. Gnina 1.0: molecular docking with deep learning. Journal of cheminformatics, 13(1):1–20, 2021. [103] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre Tkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature commu- nications, 8:13890, 2017. [104] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl. Neural message passing for quantum chemistry. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017. [105] Feng Gao, Guy Wolf, and Matthew Hirn. Geometric scattering for graph data analysis. In International Conference on Machine Learning, pages 2122–2131, 2019. [106] Robert P Hertzberg and Andrew J Pope. High-throughput screening: new technology for the 21st century. Current opinion in chemical biology, 4(4):445–451, 2000. [107] W Patrick Walters, Matthew T Stahl, and Mark A Murcko. Virtual screening—an overview. Drug discovery today, 3(4):160–178, 1998. 115 [108] Tiejun Cheng, Qingliang Li, Zhigang Zhou, Yanli Wang, and Stephen H Bryant. Structure- based virtual screening for drug discovery: a problem-centric review. The AAPS journal, 14(1):133–141, 2012. [109] Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862–865, 2004. [110] Ulrich Rester. From virtuality to reality-virtual screening in lead discovery and lead optimiza- tion: a medicinal chemistry perspective. Current opinion in drug discovery & development, 11(4):559–568, 2008. [111] Sharangdhar S Phatak, Clifford C Stephan, and Claudio N Cavasotto. High-throughput and in silico screenings in drug discovery. Expert Opinion on Drug Discovery, 4(9):947–959, 2009. [112] Jürgen Bajorath. Integration of virtual and high-throughput screening. Nature Reviews Drug Discovery, 1(11):882–894, 2002. [113] Vincent Zoete, Antoine Daina, Christophe Bovigny, and Olivier Michielin. Swisssimilarity: a web tool for low to ultra high throughput ligand-based virtual screening, 2016. [114] Carlos Garcia-Hernandez, Alberto Fernandez, and Francesc Serratosa. Ligand-based virtual screening using graph edit distance as molecular similarity measure. Journal of chemical information and modeling, 59(4):1410–1421, 2019. [115] Markus Hofmarcher, Andreas Mayr, Elisabeth Rumetshofer, Peter Ruch, Philipp Renz, Jo- hannes Schimunek, Philipp Seidl, Andreu Vall, Michael Widrich, Sepp Hochreiter, et al. Large-scale ligand-based virtual screening for sars-cov-2 inhibitors using deep neural net- works. Available at SSRN 3561442, 2020. [116] Holger Gohlke and Gerhard Klebe. Approaches to the description and prediction of the binding affinity of small-molecule ligands to macromolecular receptors. Angewandte Chemie International Edition, 41(15):2644–2676, 2002. [117] Inbal Halperin, Buyong Ma, Haim Wolfson, and Ruth Nussinov. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins: Structure, Function, and Bioinformatics, 47(4):409–443, 2002. [118] Nuno MFSA Cerqueira, Diana Gesto, Eduardo F Oliveira, Diogo Santos-Martins, Natércia F Brás, Sérgio F Sousa, Pedro A Fernandes, and Maria J Ramos. Receptor-based virtual screening protocol for drug discovery. Archives of biochemistry and biophysics, 582:56–67, 2015. [119] John J Irwin and Brian K Shoichet. Zinc- a free database of commercially available com- pounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182, 116 2005. [120] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey, Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100– D1107, 2012. [121] Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, and Stephen H Bryant. Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37(suppl_2):W623–W633, 2009. [122] Qingliang Li, Tiejun Cheng, Yanli Wang, and Stephen H Bryant. Pubchem as a public resource for drug discovery. Drug discovery today, 15(23-24):1052–1057, 2010. [123] George WA Milne and JA Miller. The nci drug information system. 1. system overview. Journal of chemical information and computer sciences, 26(4):154–159, 1986. [124] Claudio N Cavasotto and Narender Singh. Docking and high throughput docking: successes and the challenge of protein flexibility. Current Computer-Aided Drug Design, 4(3):221–234, 2008. [125] Pietro Cozzini, Glen E Kellogg, Francesca Spyrakis, Donald J Abraham, Gabriele Costantino, Andrew Emerson, Francesca Fanelli, Holger Gohlke, Leslie A Kuhn, Garrett M Morris, et al. Target flexibility: an emerging consideration in drug discovery and design. Journal of medicinal chemistry, 51(20):6237–6255, 2008. [126] Thomas Scior, Andreas Bender, Gary Tresadern, José L Medina-Franco, Karina Martínez- Mayorga, Thierry Langer, Karina Cuanalo-Contreras, and Dimitris K Agrafiotis. Recog- nizing pitfalls in virtual screening: a critical review. Journal of chemical information and modeling, 52(4):867–881, 2012. [127] Katrina W Lexa and Heather A Carlson. Protein flexibility in docking and surface mapping. Quarterly reviews of biophysics, 45(3):301–343, 2012. [128] Claudio N Cavasotto and Ruben A Abagyan. Protein flexibility in ligand docking and virtual screening to protein kinases. Journal of molecular biology, 337(1):209–225, 2004. [129] Maxim Totrov and Ruben Abagyan. Flexible ligand docking to multiple receptor confor- mations: a practical alternative. Current opinion in structural biology, 18(2):178–184, 2008. [130] B-Rao Chandrika, Jyothi Subramanian, and Somesh D Sharma. Managing protein flexibility in docking and its applications. Drug discovery today, 14(7-8):394–400, 2009. [131] Jacob D Durrant and J Andrew McCammon. Computer-aided drug-discovery techniques 117 that account for receptor flexibility. Current opinion in pharmacology, 10(6):770–774, 2010. [132] Ferran Feixas, Steffen Lindert, William Sinko, and J Andrew McCammon. Exploring the role of receptor flexibility in structure-based drug discovery. Biophysical chemistry, 186:31–45, 2014. [133] Sheng-You Huang and Xiaoqin Zou. Ensemble docking of multiple protein structures: considering protein structural variations in molecular docking. Proteins: Structure, Function, and Bioinformatics, 66(2):399–421, 2007. [134] Rommie E Amaro, Jerome Baudry, John Chodera, Özlem Demir, J Andrew McCammon, Yinglong Miao, and Jeremy C Smith. Ensemble docking in drug discovery. Biophysical journal, 114(10):2271–2278, 2018. [135] Ian R Craig, Jonathan W Essex, and Katrin Spiegel. Ensemble docking into multiple crystallographically derived protein structures: an evaluation based on the statistical analysis of enrichments. Journal of chemical information and modeling, 50(4):511–524, 2010. [136] Julie R Schames, Richard H Henchman, Jay S Siegel, Christoph A Sotriffer, Haihong Ni, and J Andrew McCammon. Discovery of a novel binding trench in hiv integrase. Journal of medicinal chemistry, 47(8):1879–1881, 2004. [137] Lily S Cheng, Rommie E Amaro, Dong Xu, Wilfred W Li, Peter W Arzberger, and J Andrew McCammon. Ensemble-based virtual screening reveals potential novel antiviral compounds for avian influenza neuraminidase. Journal of medicinal chemistry, 51(13):3878–3894, 2008. [138] Albert H Chan, Jeff Wereszczynski, Brendan R Amer, Sung Wook Yi, Michael E Jung, J Andrew McCammon, and Robert T Clubb. Discovery of s taphylococcus aureus sortase a inhibitors using virtual screening and the relaxed complex scheme. Chemical biology & drug design, 82(4):418–428, 2013. [139] Wilfredo Evangelista Falcon, Sally R Ellingson, Jeremy C Smith, and Jerome Baudry. Ensemble docking in drug discovery: how many protein configurations from molecular dynamics simulations are needed to reproduce known ligand binding? The Journal of Physical Chemistry B, 123(25):5189–5195, 2019. [140] Xavier Barril and S David Morley. Unveiling the full potential of flexible receptor docking using multiple crystallographic structures. Journal of medicinal chemistry, 48(13):4432– 4443, 2005. [141] Manuel Rueda, Giovanni Bottegoni, and Ruben Abagyan. Recipes for the selection of experimental protein conformations for virtual screening. Journal of chemical information and modeling, 50(1):186–193, 2010. [142] Mengang Xu and Markus A Lill. Utilizing experimental data for reducing ensemble size 118 in flexible-protein docking. Journal of chemical information and modeling, 52(1):187–198, 2012. [143] Nazanin Donyapour, Nicole M Roussey, and Alex Dickson. Revo: Resampling of ensembles by variation optimization. The Journal of Chemical Physics, 150(24):244112, 2019. [144] Anna Maria Capelli and Gabriele Costantino. Unbinding pathways of vegfr2 inhibitors revealed by steered molecular dynamics. Journal of Chemical Information and Modeling, 54(11):3124–3136, 2014. PMID: 25299731. [145] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in monte carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187 – 199, 1977. [146] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566, 2002. [147] Xiongwu Wu and Bernard R. Brooks. Self-guided langevin dynamics simulation method. Chemical Physics Letters, 381(3):512 – 518, 2003. [148] Luca Maragliano and Eric Vanden-Eijnden. A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations. Chemical Physics Letters, 426(1):168 – 175, 2006. [149] Hongfeng Lou, Robert I Cukier, et al. Molecular dynamics of apo-adenylate kinase: a distance replica exchange method for the free energy of conformational fluctuations. Journal of Physical Chemistry B, 110(47):24121, 2006. [150] John D. Chodera, William C. Swope, Jed W. Pitera, and Ken A. Dill. Long-time protein folding dynamics from short-time molecular dynamics simulations. Multiscale Modeling & Simulation, 5(4):1214–1226, 2006. [151] Christof Schutte, Frank No’e, Jianfeng Lu, Marco Sarich, and Eric Vanden-Eijnden. Markov state models based on milestoning. The Journal of chemical physics, 134(20):05B609, 2011. [152] Alex Dickson and Charles L Brooks III. Wexplore: hierarchical exploration of high- dimensional spaces using the weighted ensemble algorithm. The Journal of Physical Chem- istry B, 118(13):3532–3542, 2014. [153] Alex Dickson, Anthony M. Mustoe, Loic Salmon, and Charles L. Brooks III. Efficient in silico exploration of rna interhelical conformations using euler angles and wexplore. Nucleic Acids Research, 42(19), 2014. [154] Bin W. Zhang, David Jasnow, and Daniel M. Zuckerman. The “weighted ensemble” path sampling method is statistically exact for a broad class of stochastic processes and binning procedures. The Journal of Chemical Physics, 132(5), 2010. 119 [155] Nuria Plattner and Frank Noe. Protein conformational plasticity and complex ligand-binding kinetics explored by atomistic simulations and markov models. Nature Communications, 6, July 2015. [156] Vittorio Limongelli, Massimiliano Bonomi, and Michele Parrinello. Funnel metadynamics as accurate binding free-energy method. Proceedings of the National Academy of Sciences, 110(16):6358–6363, 2013. [157] Pratyush Tiwary, Vittorio Limongelli, Matteo Salvalaglio, and Michele Parrinello. Kinetics of protein–ligand unbinding: Predicting pathways, rates, and rate-limiting steps. Proceedings of the National Academy of Sciences, 112(5):E386–E391, 2015. [158] Peter Eastman, Jason Swails, John D. Chodera, Robert T. McGibbon, Yutong Zhao, Kyle A. Beauchamp, Lee-Ping Wang, Andrew C. Simmonett, Matthew P. Harrigan, Chaya D. Stern, Rafal P. Wiewiora, Bernard R. Brooks, and Vijay S. Pande. Openmm 7: Rapid development of high performance algorithms for molecular dynamics. PLOS Computational Biology, 13(7):1–17, 07 2017. [159] Kenno Vanommeslaeghe and Alexander D MacKerell Jr. Automation of the charmm general force field (cgenff) i: bond perception and atom typing. Journal of chemical information and modeling, 52(12):3144–3154, 2012. [160] Kenno Vanommeslaeghe, E Prabhu Raman, and Alexander D MacKerell Jr. Automation of the charmm general force field (cgenff) ii: assignment of bonded parameters and partial atomic charges. Journal of chemical information and modeling, 52(12):3155–3168, 2012. [161] Francesco Rao and Amedeo Caflisch. The protein folding network. Journal of Molecular Biology, 342(1):299 – 306, 2004. [162] Danzhi Huang and Amedeo Caflisch. The free energy landscape of small molecule unbinding. PLoS Computational Biology, 7:1–12, February 2011. [163] Alex Dickson and Charles L. Brooks. Native states of fast-folding proteins are kinetic traps. Journal of the American Chemical Society, 135(12):4729–4734, 2013. PMID: 23458553. [164] Kyle A. Beauchamp, Gregory R. Bowman, Thomas J. Lane, Lutz Maibaum, Imran S. Haque, and Vijay S. Pande. Msmbuilder2: Modeling conformational dynamics on the picosecond to millisecond scale. Journal of Chemical Theory and Computation, 7(10):3412–3419, 2011. PMID: 22125474. [165] Robert A Copeland. The dynamics of drug-target interactions: drug-target residence time and its impact on efficacy and safety. Expert Opinion on Drug Discovery, 5(4):305–310, 2010. PMID: 22823083. [166] Alex Dickson, Aryeh Warmflash, and Aaron R. Dinner. Separating forward and backward 120 pathways in nonequilibrium umbrella sampling. Journal of Chemical Physics, 131(154104), 2009. [167] Eric Vanden-Eijnden and Maddalena Venturoli. Exact rate calculations by trajectory paral- lelization and tilting. Journal of Chemical Physics, 131(044120), 2009. [168] Ernesto Suárez, Steven Lettieri, Matthew C. Zwier, Carsen A. Stringer, Sundar Raman Subramanian, Lillian T. Chong, and Daniel M. Zuckerman. Simultaneous computation of dynamical and equilibrium information using a weighted ensemble of trajectories. Journal of Chemical Theory and Computation, 10(7):2658–2667, 2014. PMID: 25246856. [169] Alex Dickson, Mark Maienschein-Cline, Allison Tovo-Dwyer, Jeff R. Hammond, and Aaron R. Dinner. Flow-dependent unfolding and refolding of an rna by nonequilibrium umbrella sampling. Journal of Chemical Theory and Computation, 7(9):2710–2720, 2011. PMID: 26605464. [170] Daniel M. Zuckerman and Lillian T. Chong. Weighted ensemble simulation: Review of methodology, applications, and software. Annual Review of Biophysics, 46(1):43–57, 2017. PMID: 28301772. [171] Terrell L Hill. Free Energy Transduction and Biochemical Cycle Kinetics. Springer, 1989. [172] Florent Guillain and Darwin Thusius. Use of proflavine as an indicator in temperature-jump studies of the binding of a competitive inhibitor to trypsin. Journal of the American Chemical Society, 92(18):5534–5536, 1970. [173] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An open source software for exploring and manipulating networks. 2009. [174] Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702– 710, 2004. [175] Nazanin Donyapour, Matthew Hirn, and Alex Dickson. Classicalgsg: Prediction of log p using classical molecular force fields and geometric scattering for graphs. Journal of Computational Chemistry, 42(14):1006–1017, 2021. [176] Nazanin Donyapour and Alex Dickson. Predicting partition coefficients for the sampl7 physical property challenge using the classicalgsg method. Journal of Computer-Aided Molecular Design, 35(7):819–830, 2021. [177] Younggil Kwon. Handbook of essential pharmacokinetics, pharmacodynamics and drug metabolism for industrial scientists. Springer Science & Business Media, 2001. [178] Yingqing Ran and Samuel H Yalkowsky. Prediction of drug solubility by the general solubil- 121 ity equation (gse). Journal of chemical information and computer sciences, 41(2):354–357, 2001. [179] Samuel H Yalkowsky and Shri C Valvani. Solubility and partitioning i: solubility of nonelectrolytes in water. Journal of pharmaceutical sciences, 69(8):912–922, 1980. [180] Thomas Ryckmans, Martin P Edwards, Val A Horne, Ana Monica Correia, Dafydd R Owen, Lisa R Thompson, Isabelle Tran, Michelle F Tutt, and Tim Young. Rapid assessment of a novel series of selective cb2 agonists using parallel synthesis protocols: A lipophilic efficiency (lipe) analysis. Bioorganic & medicinal chemistry letters, 19(15):4406–4409, 2009. [181] Sanjivanjit K Bhal, Karim Kassam, Ian G Peirson, and Greg M Pearl. The rule of five revisited: applying log d in place of log p in drug-likeness filters. Molecular pharmaceutics, 4(4):556–560, 2007. [182] Zhi Chen and Stephen G Weber. High-throughput method for lipophilicity measurement. Analytical chemistry, 79(3):1043–1049, 2007. [183] James Sangster. Octanol-water partition coefficients: fundamentals and physical chemistry, volume 1. John Wiley & Sons, 1997. [184] Tiejun Cheng, Yuan Zhao, Xun Li, Fu Lin, Yong Xu, Xinglong Zhang, Yan Li, Renxiao Wang, and Luhua Lai. Computation of octanol- water partition coefficients by guiding an additive model with knowledge. Journal of chemical information and modeling, 47(6):2140– 2148, 2007. [185] Arup K Ghose and Gordon M Crippen. Atomic physicochemical parameters for three- dimensional structure-directed quantitative structure-activity relationships i. partition coef- ficients as a measure of hydrophobicity. Journal of computational chemistry, 7(4):565–577, 1986. [186] Albert J Leo. Calculating log poct from structures. Chemical Reviews, 93(4):1281–1306, 1993. [187] William M Meylan and Philip H Howard. Atom/fragment contribution method for estimating octanol–water partition coefficients. Journal of pharmaceutical sciences, 84(1):83–92, 1995. [188] Jeffrey Plante and Stephane Werner. Jplogp: an improved logp predictor trained using predicted data. Journal of Cheminformatics, 10(1):61, 2018. [189] László Molnár, György M Keserű, Ákos Papp, Zsolt Gulyás, and Ferenc Darvas. A neural network based prediction of octanol–water partition coefficients using atomic5 fragmental descriptors. Bioorganic & medicinal chemistry letters, 14(4):851–853, 2004. 122 [190] Jarmo J Huuskonen, David J Livingstone, and Igor V Tetko. Neural network modeling for estimation of partition coefficient based on atom-type electrotopological state indices. Journal of chemical information and computer sciences, 40(4):947–955, 2000. [191] Ikuo Moriguchi, Shuichi HIRONO, Qian LIU, Izumi NAKAGOME, and Yasuo MAT- SUSHITA. Simple method of calculating octanol/water partition coefficient. Chemical and pharmaceutical bulletin, 40(1):127–130, 1992. [192] Antoine Daina, Olivier Michielin, and Vincent Zoete. ilogp: a simple, robust, and efficient description of n-octanol/water partition coefficient for drug design using the gb/sa approach. Journal of chemical information and modeling, 54(12):3284–3301, 2014. [193] Deliang Chen, Qingyun Wang, Yibao Li, Yongdong Li, Hui Zhou, and Yulan Fan. A general linear free energy relationship for predicting partition coefficients of neutral organic compounds. Chemosphere, 247:125869, 2020. [194] Raimund Mannhold, Gennadiy I Poda, Claude Ostermann, and Igor V Tetko. Calculation of molecular lipophilicity: State-of-the-art and comparison of log p methods on more than 96,000 compounds. Journal of pharmaceutical sciences, 98(3):861–893, 2009. [195] Igor V Tetko, Vsevolod Yu Tanchuk, and Alessandro EP Villa. Prediction of n-octanol/water partition coefficients from physprop database using artificial neural networks and e-state indices. Journal of chemical information and computer sciences, 41(5):1407–1421, 2001. [196] Igor V Tetko and Vsevolod Yu Tanchuk. Application of associative neural networks for prediction of lipophilicity in alogps 2.1 program. Journal of chemical information and computer sciences, 42(5):1136–1145, 2002. [197] ADMET Predictor. version 2.3. 0, simulations plus. Inc.: Lancaster, CA, 2009. [198] Silicos-it. Filter-it software. [199] Jian-Wei Zou, Wen-Na Zhao, Zhi-Cai Shang, Mei-Lan Huang, Ming Guo, and Qing-Sen Yu. A quantitative structure- property relationship analysis of logp for disubstituted benzenes. The Journal of Physical Chemistry A, 106(47):11550–11557, 2002. [200] Maximiliano Riquelme and Esteban Vöhringer-Martinez. Sampl6 octanol–water partition coefficients from alchemical free energy calculations with mbis atomic charges. Journal of Computer-Aided Molecular Design, pages 1–8, 2020. [201] Hai-Feng Chen. In silico log p prediction for a large data set with support vector machines, radial basis neural networks and multiple linear regression. Chemical biology & drug design, 74(2):142–147, 2009. [202] Edward W Lowe, Mariusz Butkiewicz, Matthew Spellings, Albert Omlor, and Jens Meiler. 123 Comparative analysis of machine learning techniques for the prediction of logp. In 2011 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–6. IEEE, 2011. [203] Quan Liao, Jianhua Yao, and Shengang Yuan. Svm approach for predicting logp. Molecular diversity, 10(3):301–309, 2006. [204] Andreas Breindl, Bernd Beck, Timothy Clark, and Robert C Glen. Prediction of the n-octanol/water partition coefficient, logp, using a combination of semiempirical mo- calculations and a neural network. Molecular modeling annual, 3(3):142–155, 1997. [205] Dong-Sheng Cao, Qing-Song Xu, Qian-Nan Hu, and Yi-Zeng Liang. Chemopy: freely available python package for computational biology and chemoinformatics. Bioinformatics, 29(8):1092–1094, 2013. [206] Peng Gao, Jie Zhang, Yuzhu Sun, and Jianguo Yu. Accurate predictions of aqueous solu- bility of drug molecules via the multilevel graph convolutional network (mgcn) and schnet architectures. Physical Chemistry Chemical Physics, 22(41):23766–23772, 2020. [207] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pages 2224–2232, 2015. [208] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned molecular representations for property prediction. Journal of chemical information and modeling, 59(8):3370–3388, 2019. [209] Junshui Ma, Robert P Sheridan, Andy Liaw, George E Dahl, and Vladimir Svetnik. Deep neural nets as a method for quantitative structure–activity relationships. Journal of chemical information and modeling, 55(2):263–274, 2015. [210] SAMPL. Sampl. [211] Mehtap Işık, Teresa Danielle Bergazin, Thomas Fox, Andrea Rizzi, John D Chodera, and David L Mobley. Assessing the accuracy of octanol–water partition coefficient predictions in the sampl6 part ii log p challenge. Journal of computer-aided molecular design, 34(4):335– 370, 2020. [212] Teresa Danielle Bergazin, Nicolas Tielker, Yingying Zhang, Junjun Mao, Marilyn R Gunner, Carlo Ballatore, Stefan Kast, David Mobley, et al. Evaluation of log p, pka and log d predictions from the sampl7 blind challenge. ChemRxiv, 2021. [213] Raymond Lui, Davy Guan, and Slade Matthews. A comparison of molecular representations 124 for lipophilicity quantitative structure–property relationships with results from the sampl6 logp prediction challenge. Journal of Computer-Aided Molecular Design, pages 1–12, 2020. [214] Andreas Krämer, Phillip S Hudson, Michael R Jones, and Bernard R Brooks. Multi- phase boltzmann weighting: accounting for local inhomogeneity in molecular simulations of water–octanol partition coefficients in the sampl6 challenge. Journal of Computer-Aided Molecular Design, pages 1–13, 2020. [215] Ye Ding, You Xu, Cheng Qian, Jinfeng Chen, Jian Zhu, Houhou Huang, Yi Shi, and Jing Huang. Predicting partition coefficients of drug-like molecules in the sampl6 challenge with drude polarizable force fields. Journal of Computer-Aided Molecular Design, pages 1–15, 2020. [216] Shujie Fan, Bogdan I Iorga, and Oliver Beckstein. Prediction of octanol-water partition coefficients for the sampl6-log 𝑝 logp molecules using molecular dynamics simulations with opls-aa, amber and charmm force fields. Journal of Computer-Aided Molecular Design, pages 1–18, 2020. [217] Piero Procacci and Guido Guarnieri. Sampl6 blind predictions of water-octanol parti- tion coefficients using nonequilibrium alchemical approaches. Journal of Computer-Aided Molecular Design, pages 1–14, 2019. [218] Aleksandr V Marenich, Christopher J Cramer, and Donald G Truhlar. Universal solvation model based on solute electron density and on a continuum model of the solvent defined by the bulk dielectric constant and atomic surface tensions. The Journal of Physical Chemistry B, 113(18):6378–6396, 2009. [219] Christoph Loschen, Jens Reinisch, and Andreas Klamt. Cosmo-rs based predictions for the sampl6 logp challenge. Journal of computer-aided molecular design, 34(4):385–392, 2020. [220] Nicolas Tielker, Daniel Tomazic, Lukas Eberlein, Stefan Güssregen, and Stefan M Kast. The sampl6 challenge on predicting octanol–water partition coefficients from ec-rism theory. Journal of computer-aided molecular design, pages 1–9, 2020. [221] Davy Guan, Raymond Lui, and Slade Matthews. Logp prediction performance with the smd solvation model and the m06 density functional family for sampl6 blind prediction challenge molecules. Journal of Computer-Aided Molecular Design, pages 1–12, 2020. [222] Michael R Jones and Bernard R Brooks. Quantum chemical predictions of water–octanol partition coefficients applied to the sampl6 log p blind challenge. Journal of Computer-Aided Molecular Design, pages 1–9, 2020. [223] Jonathan A Ouimet and Andrew S Paluch. Predicting octanol/water partition coefficients for the sampl6 challenge using the sm12, sm8, and smd solvation models. Journal of Computer-Aided Molecular Design, pages 1–14, 2020. 125 [224] William J Zamora, Silvana Pinheiro, Kilian German, Clara Ràfols, Carles Curutchet, and F Javier Luque. Prediction of the n-octanol/water partition coefficients in the sampl6 blind challenge from mst continuum solvation calculations. Journal of computer-aided molecular design, 34(4):443–451, 2020. [225] Shuzhe Wang and Sereina Riniker. Use of molecular dynamics fingerprints (mdfps) in sampl6 octanol–water log p blind challenge. Journal of Computer-Aided Molecular Design, pages 1–11, 2019. [226] Prajay Patel, David M Kuntz, Michael R Jones, Bernard R Brooks, and Angela K Wilson. Sampl6 log p challenge: machine learning and quantum mechanical approaches. Journal of Computer-Aided Molecular Design, pages 1–16, 2020. [227] Evrim Arslan, Basak K Findik, and Viktorya Aviyente. A blind sampl6 challenge: insight into the octanol-water partition coefficients of drug-like molecules via a dft approach. Journal of Computer-Aided Molecular Design, pages 1–8, 2020. [228] Kenno Vanommeslaeghe and Alexander D MacKerell Jr. Automation of the charmm general force field (cgenff) i: bond perception and atom typing. Journal of chemical information and modeling, 52(12):3144–3154, 2012. [229] Kenno Vanommeslaeghe, E Prabhu Raman, and Alexander D MacKerell Jr. Automation of the charmm general force field (cgenff) ii: assignment of bonded parameters and partial atomic charges. Journal of chemical information and modeling, 52(12):3155–3168, 2012. [230] James A Maier, Carmenza Martinez, Koushik Kasavajhala, Lauren Wickstrom, Kevin E Hauser, and Carlos Simmerling. ff14sb: improving the accuracy of protein side chain and backbone parameters from ff99sb. Journal of chemical theory and computation, 11(8):3696– 3713, 2015. [231] Dario Vassetti, Marco Pagliai, and Piero Procacci. Assessment of gaff2 and opls-aa general force fields in combination with the water models tip3p, spce, and opc3 for the solvation free energy of druglike organic molecules. Journal of Chemical Theory and Computation, 15(3):1983–1995, 2019. [232] Junmei Wang, Romain M Wolf, James W Caldwell, Peter A Kollman, and David A Case. Development and testing of a general amber force field. Journal of computational chemistry, 25(9):1157–1174, 2004. [233] Anthony K Rappé, Carla J Casewit, KS Colwell, William A Goddard III, and W Mason Skiff. Uff, a full periodic table force field for molecular mechanics and molecular dynamics simulations. Journal of the American chemical society, 114(25):10024–10035, 1992. [234] Thomas A Halgren. Merck molecular force field. i. basis, form, scope, parameterization, and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996. 126 [235] Thomas A Halgren. Merck molecular force field. ii. mmff94 van der waals and electrostatic parameters for intermolecular interactions. Journal of Computational Chemistry, 17(5- 6):520–552, 1996. [236] Tommi Hassinen and Mikael Peräkylä. New energy terms for reduced protein models implemented in an off-lattice force field. Journal of Computational Chemistry, 22(12):1229– 1242, 2001. [237] P Howard and W Meylan. Physical/chemical property database (physprop), 1999. [238] DA Case, IY Ben-Shalom, SR Brozell, DS Cerutti, TE Cheatham III, VWD Cruzeiro, TA Darden, RE Duke, D Ghoreishi, MK Gilson, et al. Amber 2018; 2018. University of California, San Francisco, 2018. [239] Jarmo J Huuskonen, Alessandro EP Villa, and Igor V Tetko. Prediction of partition coefficient based on atom-type electrotopological state indices. Journal of pharmaceutical sciences, 88(2):229–233, 1999. [240] Gilles Klopman, Ju-Yun Li, Shaomeng Wang, and Mario Dimayuga. Computer automated log p calculations based on an extended group contribution approach. Journal of Chemical Information and Computer Sciences, 34(4):752–781, 1994. [241] C Hansch, A Leo, and D Hoekman. Hydrophobic, electronic, and steric constants-exploring qsar, 1995. [242] RDkit. Program. [243] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pages 8024–8035, 2019. [244] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. [245] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. the Journal of machine Learning research, 12:2825–2830, 2011. [246] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013. [247] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph- structured data. arXiv preprint arXiv:1506.05163, 2015. 127 [248] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large graphs. In Advances in neural information processing systems, pages 1024–1034, 2017. [249] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016. [250] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug discovery with one-shot learning. ACS central science, 3(4):283–293, 2017. [251] Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In Advances in neural information processing systems, pages 857–864, 2003. [252] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of mathematical statistics, 22(1):79–86, 1951. [253] TM Heskes, WAJJ Wiegerinck, and HJ Kappen. Practical confidence and prediction intervals for prediction tasks. PROGRESS IN NEURAL PROCESSING, pages 128–135, 1997. [254] Sricharan Kumar and Ashok Srivastava. Bootstrap prediction intervals in non-parametric regression with applications to anomaly detection. In Proc. 18th ACM SIGKDD Conf. Knowl. Discovery Data Mining, 2012. [255] David A Nix and Andreas S Weigend. Estimating the mean and variance of the target probability distribution. In Proceedings of 1994 ieee international conference on neural networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994. [256] Noel M O’Boyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, and Geoffrey R Hutchison. Open babel: An open chemical toolbox. Journal of cheminformatics, 3(1):33, 2011. [257] openbabel. Program. [258] Ismael J Hidalgo, Thomas J Raub, and Ronald T Borchardt. Characterization of the human colon carcinoma cell line (caco-2) as a model system for intestinal epithelial permeability. Gastroenterology, 96(2):736–749, 1989. [259] Robert D Brown and Yvonne C Martin. The information content of 2d and 3d structural de- scriptors relevant to ligand-receptor binding. Journal of Chemical Information and Computer Sciences, 37(1):1–9, 1997. [260] Georgia B McGaughey, Robert P Sheridan, Christopher I Bayly, J Chris Culberson, Constan- tine Kreatsoulas, Stacey Lindsley, Vladimir Maiorov, Jean-Francois Truchon, and Wendy D Cornell. Comparison of topological, shape, and docking methods in virtual screening. Journal of chemical information and modeling, 47(4):1504–1519, 2007. 128 [261] Jordi Mestres and Ronald Knegtel. Similarity versus docking in 3d virtual screening. In Virtual Screening: An Alternative or Complement to High Throughput Screening?, pages 191–207. Springer, 2000. [262] JS Mason, AC Good, and EJ Martin. 3-d pharmacophores in drug discovery. Current pharmaceutical design, 7(7):567–597, 2001. [263] Jayashree Srinivasan, Angelo Castellino, Erin K Bradley, John E Eksterowicz, Peter DJ Grootenhuis, Santosh Putta, and Robert V Stanton. Evaluation of a novel shape-based computational filter for lead evolution: Application to thrombin inhibitors. Journal of medicinal chemistry, 45(12):2494–2500, 2002. [264] Sheng-You Huang, Sam Z Grinter, and Xiaoqin Zou. Scoring functions and their evalua- tion methods for protein–ligand docking: recent advances and future directions. Physical Chemistry Chemical Physics, 12(40):12899–12908, 2010. [265] Hannes Stärk, Octavian-Eugen Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi Jaakkola. Equibind: Geometric deep learning for drug binding structure prediction. arXiv preprint arXiv:2202.05146, 2022. [266] N Moitessier, P Englebienne, D Lee, J Lawandi, and C R Corbeil. Towards the development of universal, fast and highly accurate docking/scoring methods: a long way to go. British journal of pharmacology, 153(S1):S7–S26, 2008. [267] Garrett M Morris, Ruth Huey, William Lindstrom, Michel F Sanner, Richard K Belew, David S Goodsell, and Arthur J Olson. Autodock4 and autodocktools4: Automated docking with selective receptor flexibility. Journal of computational chemistry, 30(16):2785–2791, 2009. [268] William J Allen, Trent E Balius, Sudipto Mukherjee, Scott R Brozell, Demetri T Moustakas, P Therese Lang, David A Case, Irwin D Kuntz, and Robert C Rizzo. Dock 6: Impact of new features and current docking performance. Journal of computational chemistry, 36(15):1132–1156, 2015. [269] AutoDock Vina. Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading trott, oleg; olson, arthur j. J. Comput. Chem, 31(2):455–461, 2010. [270] Woong-Hee Shin and Chaok Seok. Galaxydock: protein–ligand docking with flexible protein side-chains. Journal of chemical information and modeling, 52(12):3225–3232, 2012. [271] Christopher R Corbeil and Nicolas Moitessier. Docking ligands into flexible and solvated macromolecules. 3. impact of input ligand conformation, protein flexibility, and water molecules on the accuracy of docking programs. Journal of Chemical Information and Modeling, 49(4):997–1009, 2009. 129 [272] Ian W. Davis, Kaushik Raha, Martha S. Head, and David Baker. Blind docking of pharma- ceutically relevant compounds using rosettaligand. Protein Science, 18, 2009. [273] Edward B. Miller, Robert B. Murphy, Daniel Sindhikara, Kenneth W. Borrelli, Matthew J. Grisewood, Fabio Ranalli, Steven L. Dixon, Steven Jerome, Nicholas A. Boyles, Tyler Day, Phani Ghanakota, Sayan Mondal, Salma B. Rafi, Dawn M. Troast, Robert Abel, and Richard A. Friesner. Reliable and accurate solution to the induced fit docking problem for protein-ligand binding. Journal of Chemical Theory and Computation, 17, 2021. [274] Kathryn M Hart, Chris MW Ho, Supratik Dutta, Michael L Gross, and Gregory R Bow- man. Modelling proteins’ hidden conformations to predict antibiotic resistance. Nature communications, 7(1):1–10, 2016. [275] David DL Minh. Implicit ligand theory: Rigorous binding free energies and thermodynamic expectations from molecular docking. The Journal of chemical physics, 137(10):104106, 2012. [276] A Peters, ME Lundberg, PT Lang, and CP Sosa. High throughput computing validation for drug discovery using the dock program on a massively parallel system. redpaper 2008. Technical report, REDP-4410. [277] Barbara Collignon, Roland Schulz, Jeremy C Smith, and Jerome Baudry. Task-parallel message passing interface implementation of autodock4 for docking of very large databases of compounds using high-performance super-computers. Journal of computational Chemistry, 32(6):1202–1209, 2011. [278] Sally R Ellingson, Jeremy C Smith, and Jerome Baudry. Vinampi: Facilitating multiple receptor high-throughput virtual docking on high-performance computers, 2013. [279] Xiaohua Zhang, Sergio E Wong, and Felice C Lightstone. Message passing interface and multithreading hybrid for parallel molecular docking of large databases on petascale high performance computing machines. Journal of computational chemistry, 34(11):915–927, 2013. [280] John E Stone, David J Hardy, Ivan S Ufimtsev, and Klaus Schulten. Gpu-accelerated molecu- lar modeling coming of age. Journal of Molecular Graphics and Modelling, 29(2):116–125, 2010. [281] Oliver Korb, Thomas Stützle, and Thomas E Exner. Accelerating molecular docking cal- culations using graphics processing units. Journal of chemical information and modeling, 51(4):865–876, 2011. [282] Serkan Altuntaş, Zeki Bozkus, and Basilio B Fraguela. Gpu accelerated molecular dock- ing simulation with genetic algorithms. In European Conference on the Applications of Evolutionary Computation, pages 134–146. Springer, 2016. 130 [283] Xinqiang Ding, Yujin Wu, Yanming Wang, Jonah Z Vilseck, and Charles L Brooks III. Accelerated cdocker with gpus, parallel simulated annealing, and fast fourier transforms. Journal of chemical theory and computation, 16(6):3910–3919, 2020. [284] Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas F Tillack, Michel F Sanner, An- dreas Koch, and Stefano Forli. Accelerating autodock4 with gpus and gradient-based local search. Journal of Chemical Theory and Computation, 17(2):1060–1073, 2021. [285] Jianhua Li, Guanlong Liu, Zhiyuan Zhen, Zihao Shen, Shiliang Li, and Honglin Li. Molecu- lar docking for ligand-receptor binding process based on heterogeneous computing. Scientific Programming, 2022, 2022. [286] Cyrus Levinthal. Mossbauer spectroscopy in biological systems. In Proceedings of a meeting held at Allerton House. P. Debrunner, JCM Tsibris, and E. Munck, editors. University of Illinois Press, Urbana, IL, 1969. [287] Peter E Leopold, Mauricio Montal, and José N Onuchic. Protein folding funnels: a kinetic approach to the sequence-structure relationship. Proceedings of the National Academy of Sciences, 89(18):8721–8725, 1992. [288] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955. [289] Xiang Gao, Farhad Ramezanghorbani, Olexandr Isayev, Justin S Smith, and Adrian E Roit- berg. Torchani: a free and open source pytorch-based deep learning implementation of the ani neural network potentials. Journal of chemical information and modeling, 60(7):3408–3415, 2020. [290] Jing Huang and Alexander D MacKerell Jr. Charmm36 all-atom additive protein force field: Validation based on comparison to nmr data. Journal of computational chemistry, 34(25):2135–2145, 2013. [291] Bhalachandra L Tembre and J Andrew Mc Cammon. Ligand-receptor interactions. Com- puters & Chemistry, 8(4):281–283, 1984. [292] Pavel V Klimovich, Michael R Shirts, and David L Mobley. Guidelines for the analysis of free energy calculations. Journal of computer-aided molecular design, 29(5):397–411, 2015. [293] David L Mobley, Alan P Graves, John D Chodera, Andrea C McReynolds, Brian K Shoichet, and Ken A Dill. Predicting absolute ligand binding free energies to a simple model site. Journal of molecular biology, 371(4):1118–1134, 2007. [294] John D Chodera, David L Mobley, Michael R Shirts, Richard W Dixon, Kim Branson, and Vijay S Pande. Alchemical free energy methods for drug discovery: progress and challenges. 131 Current opinion in structural biology, 21(2):150–160, 2011. [295] Xianjun Kong and Charles L Brooks III. 𝜆-dynamics: A new approach to free energy calculations. The Journal of chemical physics, 105(6):2414–2423, 1996. [296] Jennifer L Knight and Charles L Brooks III. Multisite 𝜆 dynamics for simulated structure– activity relationship studies. Journal of chemical theory and computation, 7(9):2728–2739, 2011. [297] Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review Letters, 78(14):2690, 1997. [298] Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern, et al. Openmm 7: Rapid development of high performance algorithms for molecular dy- namics. PLoS computational biology, 13(7):e1005659, 2017. [299] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019. [300] Sunhwan Jo, Taehoon Kim, Vidyashankara G Iyer, and Wonpil Im. Charmm-gui: a web-based graphical user interface for charmm. Journal of computational chemistry, 29(11):1859–1865, 2008. [301] K. Vanommeslaeghe, E. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. Shim, E. Darian, O. Guvench, P. Lopes, I. Vorobyov, and A. D. Mackerell. Charmm general force field: A force field for drug-like molecules compatible with the charmm all-atom additive biological force fields. Journal of Computational Chemistry, 31(4):671–690, 2010. [302] Nagakumar Bharatham, Peter J Slavish, William R Shadrick, Brandon M Young, and Anang A Shelat. The role of za channel water-mediated interactions in the design of bromodomain-selective bet inhibitors. Journal of Molecular Graphics and Modelling, 81:197–210, 2018. [303] Chun-wa Chung, Hervé Coste, Julia H White, Olivier Mirguet, Jonathan Wilde, Romain L Gosmini, Chris Delves, Sylvie M Magny, Robert Woodward, Stephen A Hughes, et al. Discovery and characterization of small molecule inhibitors of the bet family bromodomains. Journal of medicinal chemistry, 54(11):3827–3838, 2011. [304] Montserrat Pérez-Salvia and Manel Esteller. Bromodomain inhibitors and cancer therapy: 132 From structures to applications. Epigenetics, 12(5):323–339, 2017. [305] Monica M Mita and Alain C Mita. Bromodomain inhibitors a decade later: a promise unfulfilled?, 2020. [306] Chun-wa Chung, Anthony W Dean, James M Woolven, and Paul Bamborough. Fragment- based discovery of bromodomain inhibitors part 1: inhibitor binding modes and implications for lead discovery. Journal of medicinal chemistry, 55(2):576–586, 2012. [307] Tom Dixon, Samuel D Lotz, and Alex Dickson. Predicting ligand binding affinity using on-and off-rates for the sampl6 sampling challenge. Journal of computer-aided molecular design, 32(10):1001–1012, 2018. [308] Robert Hall, Tom Dixon, and Alex Dickson. On calculating free energy differences using ensembles of transition paths. Frontiers in molecular biosciences, 7:106, 2020. [309] Tom Dixon, Derek MacPherson, Barmak Mostofian, Taras Dauzhenka, Samuel Lotz, Dwight McGee, Sharon Shechter, Utsab R Shrestha, Rafal Wiewiora, Zachary A McDargh, et al. Atomic-resolution prediction of degrader-mediated ternary complex structures by combining molecular simulations with hydrogen deuterium exchange. bioRxiv, 2021. [310] Tom Dixon, Arzu Uyar, Shelagh Ferguson-Miller, and Alex Dickson. Membrane-mediated ligand unbinding of the pk-11195 ligand from tspo. Biophysical journal, 120(1):158–167, 2021. [311] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noé. Vampnets for deep learning of molecular kinetics. Nature communications, 9(1):1–11, 2018. [312] Andreas Mardt, Luca Pasquali, Frank Noé, and Hao Wu. Deep learning markov and koopman models with physical constraints. In Mathematical and Scientific Machine Learning, pages 451–475. PMLR, 2020. [313] Dirk G Cattrysse and Luk N Van Wassenhove. A survey of algorithms for the generalized assignment problem. European journal of operational research, 60(3):260–272, 1992. 133