PREDICTING THE PROPERTIES OF LIGANDS USING MOLECULAR DYNAMICS AND
                              MACHINE LEARNING
                                          By
                                  Nazanin Donyapour
                                  A DISSERTATION
                                      Submitted to
                              Michigan State University
                      in partial fulfillment of the requirements
                                   for the degree of
     Computational, Mathematics, Science and Engineering – Doctor of Philosophy
                                         2022


                                               ABSTRACT
  PREDICTING THE PROPERTIES OF LIGANDS USING MOLECULAR DYNAMICS AND
                                        MACHINE LEARNING
                                                    By
                                           Nazanin Donyapour
The discovery and design of new drugs requires extensive experimental assays that are usually very
expensive and time-consuming. To cut down the cost and time of the drug development process and
help design effective drugs more efficiently, various computational methods have been developed
that are referred to collectively as in silico drug design. These in silico methods can be used to not
only determine compounds that can bind to a target receptor but to determine whether compounds
show ideal drug-like properties. In this dissertation, I have provided solutions to these problems
by developing novel methods for molecular simulation and molecular property prediction. Firstly,
we have developed a new enhanced sampling MD algorithm called Resampling of Ensembles by
Variation Optimization or “REVO” that can generate binding and unbinding pathways of ligand-
target interactions. These pathways are useful for calculating transition rates and Residence Times
(RT) of protein-ligand complexes. This can be particularly useful for drug design as studies for
some systems show that the drug efficacy correlates more with RT than the binding affinity. This
method is generally useful for generating long-timescale transitions in complex systems, including
alternate ligand binding poses and protein conformational changes. Secondly, we have developed
a technique we refer to as “ClassicalGSG” to predict the partition coefficient (log 𝑃) of small
molecules. log 𝑃 is one of the main factors in determining the drug likeness of a compound, as
it helps determine bioavailability, solubility, and membrane permeability. This method has been
very successful compared to other methods in literature. Finally, we have developed a method
called “Flexible Topology” that we hope can eventually be used to screen a database of potential
ligands while considering ligand-induced conformational changes. After discovering molecules
with drug-like properties in the drug design pipeline, Virtual Screening (VS) methods are employed
to perform an extensive search on drug databases with hundreds of millions of compounds to find


candidates that bind tightly to a molecular target. However, in order for this to be computationally
tractable, typically, only static snapshots of the target are used, which cannot respond to the presence
of the drug compound. To efficiently capture drug-target interactions during screening, we have
developed a machine-learning algorithm that employs Molecular Dynamics (MD) simulations with
a protein of interest and a set of atoms called “Ghost Particles”. During the simulation, the Flexible
Topology method induces forces that constantly modify the ghost particles and optimizes them
toward drug-like molecules that are compatible with the molecular target.


Copyright by
NAZANIN DONYAPOUR
2022


To my parents, husband and daughter
                 v


                                   ACKNOWLEDGEMENTS
First, I would like to thank my advisor, Dr. Alex Dickson, for his support, wisdom, enthusiasm,
and encouragement. He has been a kind and patient advisor and tirelessly helped me become a
better researcher. Second, I want to extend my gratitude to my Ph.D. committee members Dr.
Scott Barton, Dr. Matthew Hirn, and Dr. Ekaterina Rapinchuk, for their insightful comments and
feedback. I thank my fellow lab members for their ample support, particularly Nicole M. Roussey,
Dr. Tom Dixon, and Dr. Samik Bose. I also would like to thank the Department of Computational
Mathematics, Science, and Engineering for their guidance. My appreciation goes to my family and
friends for their unconditional love and continuous support. Specially, I offer many thanks to my
beloved husband, Dr. Hamid Karimi, for his incredible support along my Ph.D. journey. I express
my sincere gratitude to my dear friend, Mona Lavasani, who has always been there for me. I can
not express how I am grateful to my parents and in-laws for their tremendous help. Finally, my last
words go to my late mother, Farahnaz Emami, the light of my life whose presence in my heart is
the source of my strength.
                                                vi


                                 TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi
KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
CHAPTER 1 INTRODUCTION . . . .              . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 Kinetics-orientated drug design .    . . . . . . . . . . . . . . . . . . . . . . . . . .  2
   1.2 Predicting molecular properties      . . . . . . . . . . . . . . . . . . . . . . . . . .  3
   1.3 Virtual Screening . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . . . . . .  5
CHAPTER 2     PREDICTING THE RESIDENCE TIME OF DRUG MOLECULES
              (REVO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . .  7
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  7
   2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . .  9
       2.2.1 Generalized framework for Weighted ensemble sampling . . . . . .               . .  9
       2.2.2 REVO resampling algorithm . . . . . . . . . . . . . . . . . . . . . .          . . 11
       2.2.3 WExplore sampling algorithm . . . . . . . . . . . . . . . . . . . . .          . . 13
       2.2.4 𝑁-dimensional biased random walk . . . . . . . . . . . . . . . . . .           . . 14
       2.2.5 Trypsin-benzamidine system . . . . . . . . . . . . . . . . . . . . . .         . . 14
       2.2.6 Clustering and network visualizations . . . . . . . . . . . . . . . . .        . . 15
   2.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 16
       2.3.1 𝑁-dimensional random walk . . . . . . . . . . . . . . . . . . . . . .          . . 16
       2.3.2 Choosing an optimal distance exponent (𝛼) for ligand unbinding
              simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 21
       2.3.3 Trypsin-benzamidine ligand unbinding . . . . . . . . . . . . . . . .           . . 23
              2.3.3.1 Residence time . . . . . . . . . . . . . . . . . . . . . . . .        . . 24
              2.3.3.2 Heterogeneity of ligand unbinding pathways . . . . . . . .            . . 25
   2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
CHAPTER 3 PREDICTING LOGP OF SMALL MOLECULES .                            . . . . . . . . . . . 32
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . 32
   3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . 37
       3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . 37
       3.2.2 Generating atomic attributes . . . . . . . . . . . . .       . . . . . . . . . . . 39
       3.2.3 Log P predictions using GSG . . . . . . . . . . . . .        . . . . . . . . . . . 43
              3.2.3.1 Geometric Scattering for graphs . . . . . .         . . . . . . . . . . . 43
              3.2.3.2 Neural networks architecture and training           . . . . . . . . . . . 46
       3.2.4 Log P predictions using GCNs . . . . . . . . . . . .         . . . . . . . . . . . 46
              3.2.4.1 Graph convolutional neural network . . .            . . . . . . . . . . . 46
                                              vii


              3.2.4.2 GCN architecture and training . . . . . . .          . . . . . . . . . . 48
   3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
       3.3.1 Evaluation of molecular representations . . . . . . . .       . . . . . . . . . . 49
       3.3.2 Geometric scattering vs graph convolution . . . . . .         . . . . . . . . . . 56
       3.3.3 Features visualization . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 56
       3.3.4 Distinguishing features of failed molecules . . . . . .       . . . . . . . . . . 58
       3.3.5 Uncertainty estimations for the SAMPL7 predictions            . . . . . . . . . . 60
       3.3.6 Predictions for SAMPL7 log 𝑃 challenge . . . . . . . .        . . . . . . . . . . 63
       3.3.7 Performance of Open-source force fields . . . . . . . .       . . . . . . . . . . 64
   3.4 Discussion and Conclusions . . . . . . . . . . . . . . . . . . .    . . . . . . . . . . 70
CHAPTER 4     MODELING COMPOUNDS IN BINDING SITE WITH THE FLEX-
              IBLE TOPOLOGY METHOD . . . . . . . . . . . . . . . . . . . . .               . . 73
   4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 73
   4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
       4.2.1 Framework of Flexible Topology . . . . . . . . . . . . . . . . . . . .        . . 76
       4.2.2 Symmetry-invariant Molecular Features . . . . . . . . . . . . . . .           . . 77
       4.2.3 Continuous Atomic Identities . . . . . . . . . . . . . . . . . . . . . .      . . 80
       4.2.4 Loss function and Flexible Topology Restraint Energy . . . . . . . .          . . 81
       4.2.5 Implementation details . . . . . . . . . . . . . . . . . . . . . . . . .      . . 82
              4.2.5.1 Bromodomain simulation setup details . . . . . . . . . . .           . . 83
   4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
       4.3.1 Assembly Simulations in Vacuum . . . . . . . . . . . . . . . . . . .          . . 84
   4.4 Assembly in a Bromodomain binding pocket . . . . . . . . . . . . . . . . .          . . 88
   4.5 Discussion and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 92
CHAPTER 5 SUMMARY AND FUTURE DIRECTION                         . . . . . . . . . . . . . . . . 94
   5.1 Summary . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . 94
   5.2 Future Directions . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . 96
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
                                             viii


                                        LIST OF TABLES
Table 2.1: Characteristic distance values for N-dimensional random walk. . . . . . . . . . . 16
Table 2.2: Cluster counts for all simulations. . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 2.3: Co-clustering information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Table 3.1: log 𝑃 prediction models classifications. . . . . . . . . . . . . . . . . . . . . . . 33
Table 3.2: Independent test sets used for evaluating ClassicalGSG models. . . . . . . . . . 37
Table 3.3: log 𝑃 datasets used for training. . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Table 3.4: Classifying CGenFF atom types in 36 categories (AC36). The C_Name is the
            name of the category, and C_Value shows the value of the category. . . . . . . . 41
Table 3.5: Classifying GAFF2 atom types in 31 categories (AC31). The C_Name is the
            name of the category, and C_Value shows the value of the category . . . . . . . 42
Table 3.6: Classifying atoms in 5 categories (AC5). . . . . . . . . . . . . . . . . . . . . . 42
Table 3.7: Neural network settings. Square brackets denote possible parameter values
            used in the grid search method. . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Table 3.8: Neural network settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Table 3.9: Parameters for the Geometric Scattering for Graphs algorithm. The square
            brackets show all of the values examined for each parameter. For “Scattering
            operators”, ‘z’ represents the zero order operator, ‘f’ is first order, and ‘s’ is
            second order. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Table 3.10: The log 𝑃 prediction results using different methods on FDA test set. The
            result from our model shown in red color. The TopP-S is the best performing
            method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table from [92] . . . 53
Table 3.11: The logP prediction performance results for four independent test sets. . . . . . . 53
Table 3.12: The log 𝑃 prediction results using different methods on Star test set. The result
            from our model shown in red color. The TopP-S is the best performing method
            (MT-ESTD+ -1-AD) on Star test set from [92]. Table from [92]. . . . . . . . . . 54
                                                  ix


Table 3.13: The log 𝑃 prediction results using different methods on NonStar test set. The
            result from our model shown in red color. The TopP-S is the best performing
            method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92] . . . 55
Table 3.14: The logP prediction results using different set of features and models . . . . . . . 56
Table 3.15: The averaged KL-divergence between fingerprint distributions of all data ver-
            sus failed molecules averaged over 5 GCGNN models. . . . . . . . . . . . . . . 59
Table 3.16: The PIs and coverage range for the SAMPL7 molecules using four Classical-
            GSG methods. 𝑦ˆ is the prediction of log 𝑃. . . . . . . . . . . . . . . . . . . . . 62
Table 3.17: The log 𝑃 prediction results for the SAMPL7 molecules. . . . . . . . . . . . . . 63
Table 3.18: Sets of parameters used to evaluate MMFF94 ClassicalGSG models on external
            test sets. Sets in square brackets denote all the GSG parameter values used for
            generating the molecular features. For scattering operators, “z” denotes the
            zeroth order operator (Equation 3.1), “f” is first order (Equation 3.2), and “s”
            is second order (Equation 3.3). . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Table 3.19: The logP prediction results from MMFF94 force field parameters for external
            test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Table 3.20: The log 𝑃 prediction results from different log 𝑃 methods for the FDA test set.
            The TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test
            set from [92]. Table from [92]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Table 3.21: The log 𝑃 prediction results from different log 𝑃 methods for the Star test set.
            The TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set
            from [92]. Table from [92]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
Table 3.22: The log 𝑃 prediction results from different log 𝑃 methods for the NonStar test
            set. The TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar
            test set from [92]. Table from [92] . . . . . . . . . . . . . . . . . . . . . . . . . 69
Table 4.1: Parameter values for Behler-Parrinello functions, grouped into radial and an-
            gular components. The number of values used for each parameter is shown in
            the 𝑛vals column. The product of 𝑛vals is the total number of features for each
            component (𝑁feat ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Table 4.2: Minimum and maximum parameter values for ghost particle attributes. . . . . . 86
                                                   x


                                       LIST OF FIGURES
Figure 2.1: The WE simulation framework. Each walker is represented as a circle.
            The size of a circle represents the weight, and different colors represent
            different conformations. An ensemble of walkers with the same weight and
            conformation is run for a set number of steps (“Dynamics”). Then resampling
            is performed. These two steps continue until the simulation ends. . . . . . . . . 10
Figure 2.2: Average predicted probability distributions. The black curve is the target
            probability. Probability distributions are averaged over all 10 runs for (A)
            REVO and (B) WExplore. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 2.3: (A) The calculated accuracy values are shown for three methods, averaged
            over 10 runs. REVO outperforms when compared to WExplore and CONV
            methods for all dimensions. (B) The range of visited 𝑥 values for three
            methods averaged over 10 runs. REVO explores a broader sampling space
            when compared to WExplore and CONV methods for all dimensions. Error
            bars show the standard error of the mean across the set of runs. . . . . . . . . . 18
Figure 2.4: (A) The final walker positions for a representative 𝑁 = 2 simulation are
            shown as points. This is overlayed on a probability distribution heat map,
            calculated using 𝑃𝑡 (𝑥, 𝑦) = 𝑃𝑡 (𝑥)𝑃𝑡 (𝑦). (B) Probability distributions for the
            distance to the origin, averaged over all simulations, all walkers and for 10
            cycles spaced evenly between cycle 1000 and cycle 10000. Distributions are
            shown separately for 𝑁 = 2, 5, 10 and 20. (C) The expectation values of these
            curves as a function of 𝑁. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.5: Accuracy (𝐴) and range values considering only the first two dimensions.
            Values are averaged over 10 simulations in each case and the error bars
            indicate the standard error of the mean. . . . . . . . . . . . . . . . . . . . . . 19
Figure 2.6: (Top) Average accuracy (left) and range (right) values for different values of
            the distance exponent 𝛼: 𝛼 = 1 (black), 𝛼 = 2 (green) and 𝛼 = 4 (orange).
            (Bottom) Average accuracy and range values are shown both with (orange)
            and without (green) the weight novelty term (𝜙). . . . . . . . . . . . . . . . . . 20
Figure 2.7: Representative ensembles from WExplore simulations are shown at “early”
            and “late” timepoints, as described in the text. (Top) Ligand density distri-
            butions from the trypsin-benzamidine WExplore simulations reported here.
            (Bottom) Ligand density distributions are shown for the TPPU ligand unbind-
            ing from the enzyme soluble epoxide hydrolase (sEH). Ensembles were taken
            from previous work [88]. Below each snapshot, a bar graph of trajectory
            weights is shown, sorted from highest to lowest. . . . . . . . . . . . . . . . . . 22
                                                xi


Figure 2.8: The trajectory variation determined for four 𝛼 values for (A) trypsin-benzamidine
             and (B) sEH-TPPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Figure 2.9: The average unbound probability for all runs for (A) REVO and (B) WExplore.
             The thick blue region represents the standard error of the mean at each time
             point. The black curve shows the average probability for all runs. . . . . . . . . 25
Figure 2.10: Average predicted residence times are shown in black for (A) REVO and
             (B) WExplore. The red line shows the experimental residence time for the
             trypsin-benzamidine system [172]. . . . . . . . . . . . . . . . . . . . . . . . . 25
Figure 2.11: Average trajectory flux values are shown using all possible sub-samples over
             the set of five runs. Individual averages are shown as points, and the prob-
             ability of the sub-samples is shown using a violin plot. The trajectory flux
             corresponding to the experimental residence time is shown as a horizontal
             red line. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
Figure 2.12: Network representations of the free energy landscapes of binding are shown
             for REVO (A) and WExplore (B). In both cases, discrete transition path
             ensembles were visually identified and labeled. Nodes are colored according
             to their ligand solvent accessible surface area using the color bar at the
             figure bottom, and node size correspond to the statistical weight of the states.
             Representative conformations are shown to depict each ligand unbinding
             pathway for REVO (C) and WExplore (D). Loop regions 209-218 and 179-
             190 are shown in blue left and orange right. . . . . . . . . . . . . . . . . . . . 28
Figure 2.13: Ligand densities for unique clusters visited by REVO (red) and WExplore
             (yellow). Density maps are plotted using the VMD Volmap tool with Isosur-
             face value of 0.02. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
Figure 3.1: Architecture of the GSG method. The adjacency matrix describes the graph
             structure of the molecule. Each atom has a set of attributes that are shown
             as colored bars. Wavelet matrices Ψ are built at different logarithmic scales,
             𝑗, using the adjacency matrix as described in the text. Finally, the scattering
             transform is applied to get the graph features using both the wavelet matrices
             and the signal vectors. Modified from figure made by Feng et al. [105]. . . . . . 44
Figure 3.2: Architecture of the GCN method. The adjacency matrix describes the graph
             structure of the molecule. Each atom has a set of attributes and are shown
             as colored bars. GCN layers are shown by gray color and are followed a
             max-pooling layer which is shown in purple. The graph gathering layer is
             shown in green color adds features on all nodes to generate the molecular
             feature vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
                                                 xii


Figure 3.3: Average 𝑟 2 (A) and RMSE (B) for the OpenChem test set using GSGNN
            models. Each average is calculated over 20 individual parameter values and
            the error bars show the best and worst performing models. The atomic
            attributes are generated with either CGenFF or GAFF2 force fields and using
            one of three atom type classification schemes ("AC1", "AC5", "AC36/AC31"
            or "ACall"). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Figure 3.4: The 𝑟 2 for the OpenChem test set using GSGNN models. The atomic attributes
            are all generated with CGenFF force fields, AC36 atom type classification
            scheme, and 2D molecular structure. . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 3.5: The 𝑟 2 for different test sets using GSGNN models. A) shows 𝑟 2 for the FDA
            test set. B) represent 𝑟 2 for the Huuskonen test set. C) and D) show 𝑟 2 for
            the Star and NonStar test sets, respectively. The horizontal axis indicates
            the maximum wavelet scale 𝐽. The atomic attributes are generated with
            2D molecular structure, CGenFF force fields and using AC36 atom type
            classification scheme. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 3.6: The t-SNE plots with GSG and NN features of the OpenChem test set
            molecules. Each represents a molecule and is colored by its actual log 𝑃
            value. ⟨Δ log 𝑃⟩𝑁 shows the mean log 𝑃 difference value calculated over the
            nearest neighbors in the t-SNE plot. A) The GSG features of size 1716 are
            projected into 2-dimensional space. B) The NN features from the last hidden
            layer with size of 400 are projected into 2-dimensional space. . . . . . . . . . 58
Figure 3.7: Probability distributions of molecular fingerprints. The histograms show the
            distribution of fingerprints of all data and failed molecules of 5 GCGNN
            models. The distribution of all data is shown in thick black line. A) The
            number of shortest paths of length 2, B) the atomic weight, C) the number of
            carbon atoms (ncarb) and D) the number of heavy atoms. . . . . . . . . . . . . 60
Figure 3.8: Cumulative distribution of MAE of molecules in the S7_TEST set. The solid
            blue line shows the cumulative distributions for each set of predictions. The
            dashed red line represents the probability of 90%. Panels A through D show
            MAEs using models trained on DB1 through DB4, respectively. . . . . . . . . . 61
Figure 3.9: Prediction intervals of log 𝑃 predictions for the SAMPL7 molecules. The
            experimental log 𝑃 values are shown in red circles as a scatter plot. The
            predictions are shown in a red line, and the orange wide range shows the
            prediction intervals (PIs). Panels A through D show predictions from mod-
            els trained on DB1 through DB4, respectively. In all cases, data is sorted
            according to the predicted log 𝑃 values. . . . . . . . . . . . . . . . . . . . . . . 62
                                                  xiii


Figure 3.10: The best fit lines for prediction sets. The experimental versus prediction
             values are shown in red circles as a scatter plot. The actual fit line is shown
             in orange. The dashed blue curve shows the best fit line. A) predictions
             using DB1, B) predictions using DB2, C) predictions using DB3, and D)
             predictions using the DB4 training set. . . . . . . . . . . . . . . . . . . . . . . 64
Figure 3.11: The log 𝑃 predictions from our submissions to the SAMPL7 challenge. The
             orange line shows the experimental values. The ClassicalGSG predictions
             are shown as circles (DB1: blue, DB2: orange, DB3: green, DB4: red). The
             thick orange area shows the MAE interval of 0.44, which is the lowest MAE
             of our submitted predictions (ClassicalGSG-DB2). Molecules are labeled
             with their molecule ID from SAMPL7 [212]. . . . . . . . . . . . . . . . . . . . 65
Figure 3.12: Results of ClassicalGSG models trained using open-source force field pa-
             rameters. Error bars are computed over five independently-trained models.
             These models are trained using the 2D structure information and using all the
             scattering moments with the maximum wavelet scale (𝐽) of 4. For each set
             of ClassicalGSG models trained using these force field parameters we show
             A) the average RMSE, and B) the average 𝑟 2 . . . . . . . . . . . . . . . . . . . . 66
Figure 4.1: The framework of the Flexible Topology method. The gray and blue spheres
             show the ghost particles. MD simulations start with an initial structure of
             the protein of interest and randomly initialized ghost particles. The MLForce
             plugin computes external forces on the ghost particle positions and attributes
             using a loss function, 𝐿. 𝑋𝐺𝑃 denotes the ghost particle positions, 𝐴𝐺𝑃 repre-
             sents the atomic attributes of ghost particles, 𝑀 shows the mapping function.
             𝐹𝐺𝑃 and 𝐹𝑇 show ghost particle and target ligand features, respectively. Final
             output shows the ghost particles assembled into a drug-like molecule in the
             binding-pocket of the protein. . . . . . . . . . . . . . . . . . . . . . . . . . . 77
Figure 4.2: The assembly simulation target molecules randomly chosen from the OpenChem
             dataset [55] are shown in their 2D structures. The number of atoms for each
             molecule is written under it. The 2D structures are generated and drawn by
             MarvinSketch using implicit hydrogens. . . . . . . . . . . . . . . . . . . . . . 85
Figure 4.3: Average values of RMSD and 𝑈𝐹𝑇 for assembly simulations run with different
             MLForce scale values. Error bars show standard errors calculated over 3
             separate assembly simulations for the last 10% of data. A) RMSD between
             positions of ghost particles and the target molecule. B) The potential energy
             of ghost-ghost interactions, normalized by 𝐶 (Eq. 4.8). . . . . . . . . . . . . . 86
Figure 4.4: The average RMSE of atomic attributes between the ghost and target ligands
             for the last 10% of simulation data. Errors are determined over three different
             runs for 10 molecules (30 assembly simulations in total for each MLForce
             Scale). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
                                                 xiv


Figure 4.5: The average RMSD of positions and FT energy for MLForce scale of 2000
            over the last 10% of data. The standard errors are determined over three
            different runs. A) The average RMSD of ghost particles and the target ligand.
            B) The FT potential energy, divided by the MLForce scale. . . . . . . . . . . . 87
Figure 4.6: The minimum RMSD of positions for MLForce scale of 2000 for the assembly
            simulations. The standard errors are determined over three different runs. . . . . 88
Figure 4.7: The bound pose of 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one from the
            4a9e PDB entry. Residues used to define the binding pocket centroid (Pro98,
            Val103, Tyr155 and Ile162) are shown in white licorice. Hydrogens are not
            shown on the ligand or binding site residues. A) Front view. B) Top view. C)
            2D structure of the ligand. D) A size-charge representation where the each
            sphere is sized according to the atomic radius 𝜎 and colored according to its
            charge, 𝑞. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Figure 4.8: Ten assembly simulations in a bromodomain binding pocket. The central
            panel shows traces of the FT energy (𝑈𝐹𝑇 ) for the ten trajectories (thin lines)
            as well as an average of the trajectory set (thick blue line). Two example
            trajectories were highlighted. The top panels, outlined in red, show snapshots
            from an example “failed” trajectory, which did not assemble to the correct
            structure. The bottom panels, outlined in green, show snapshots from a
            correctly assembled trajectory. . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure A.1: Minimum walker weight of 𝑖 to ensure that ΔΔ𝑖𝑚 > 0. . . . . . . . . . . . . . . 104
                                                xv


                              LIST OF ALGORITHMS
Algorithm 2.1: REVO Resampler Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 12
                                        xvi


                              KEY TO ABBREVIATIONS
QSAR Quantitative Structure-Activity Relationship
ADMET Absorption, distribution, Metabolism, Elimination, and Toxicity
ML Machine Learning
SVM Support Vector Machines
NNs Neural Networks
DL Deep Learning
RT Residence Time
TR-FRET Time-Resolved Fluorescence Resonance Energy Transfer
IC50 Half-Maximal Inhibitory Concentratio
MD Molecular Dynamics
WE Weighted Ensemble
MSM Markov State Model
SMILES Molecular-Input Line-Entry System
CM Coulomb Matrix
DFT Density Functional Theory
3D 3-Dimensional
2D 2-Dimensional
GSG Geometric Scattering for Graphs
HTS High Throughput Screening
VS Virtual Screening
LBVS Ligand-Based Virtual Screening
SBVS Structure-Based Virtual Screening
GPU Graphics Processing Unit
CPU Central Processing Unit
CV Collective Variable
                                            xvii


REVO Resampling Of Ensembles By Variation Optimization
VP Voronoi polyhedra
RMSD Interface Root Mean Square Deviation
CSN Conformation Space Network
SASA Solvent Accessible Surface Area
NMR Nuclear Magnetic Resonance
LiPE Lipophilic Efficiency
GCN Graph Convolutional Networks
MM Molecular Mechanics
QM Quantum Mechanics
RMSD Root Mean Square Deviation
RMSE Root Mean Squared Error
SAMPL Statistical Assessment of the Modeling of Proteins and Ligands
MAE Mean Absolute Error
MUE Mean Unsigned Errors
CGenFF CHARMM Generalized Force Field
GAFF General AMBER Force Field
UFF Universal Force Field
MMFF94 Merck Molecular Force Field 94
AC Atomic Category
ReLU Rectified Linear Unit
FF atomic attributes
CLint Intrinsic Clearance Rates
GP Ghost Particles
API Application Programming Interface
SPR Surface Plasmon Resonance
LJ Lennard-Jones
sEH Soluble Epoxide Hydrolase
TPPU Inhibitor of sEH in crystal structure 4OD0
                                             xviii


                                             CHAPTER 1
                                          INTRODUCTION
The discovery and development of a new drug is a lengthy and expensive process that often takes
up to 14 years [1, 2, 3], with an average cost of 1.3 billion USD [4]. The typical drug design
pipeline consists of early research and discovery, development, approval, and marketing. In the
first stage, the target of disease and the potential drug compounds that can bind and interact with
the target are identified (hit compounds) and optimized (lead compounds). In the second stage,
the drug candidates go under preclinical and clinical studies and trials to ensure their effectiveness
and safety for patients. Finally, the successful compounds from clinical trials are registered for
approval and marketing. In recent years, to help expedite and economize the design of effective
medicines, computational methods and models have found significant applications in the traditional
drug discovery workflow. This field is referred to as in silico drug design.
     Utilizing in silico approaches such as Quantitative Structure-Activity Relationships (QSAR) for
predicting the relation between chemical properties and molecular structure go back to the 1950s [5].
In drug design, the QSAR methods are employed to predict a small molecule’s drug-like properties
(e.g., those indicated in Lipnski’s Rule of Five [6]) such as molecular mass, log 𝑃, numbers of
hydrogen bond donors and acceptors. In addition to QSAR, extensive in silico tools have been
developed for target discovery [7, 8, 9], molecular docking [10, 11, 12, 13, 14], virtual screening [15,
16], binding site prediction [17, 18, 19], de novo drug design [20, 21, 22], ligand similarity
search [23, 24, 25], and prediction of pharmacokinetic properties like absorption, distribution,
metabolism, elimination, and toxicity (ADMET) [26, 27, 28]. These methods can be categorized
into two types: ligand-based and structure-based methods. The ligand-based methods are based
on knowledge of small molecules that interact with the target of interest while structure-based
methods are based on the three dimensional (3D) structure information of target molecules [29,
30, 31, 32, 33, 34]. While many approved drugs were discovered with the help of computational
methods such as Raltitrexed [35], Amprenavir [36, 37], Isoniazid [38], Flurbiprofen [39, 40] and
                                                   1


Dorzolamide [41], there are also reports of failures of computational methods. For instance,
the antidepressant/anxiolytic RPX00023 has been reported to be a selective serotonin 5-HT1A
receptor agonist but turned out to be an inhibitor [42]. In fact, 90% of drug candidates fail during
clinical trials, either due to insufficient clinical efficacy, unmanageable toxicity/side effects, poor
pharmacokinetic/drug-like properties, and lack of commercial interest [43, 44]. The failure rate for
new drugs targeting Alzheimer’s disease was even higher (99.6%) [45]. With the availability and
growth of molecular databases that include both chemical and pharmacological properties, Machine
Learning (ML) methods such as Support Vector Machines (SVM), Decision trees, and Deep Neural
Networks (DNN) have been widely employed in in silico drug design for various tasks such as
binding site prediction [46, 47], ADMET prediction [48] denovo drug design [49, 50], similarity
search [51], drug-Target interaction prediction [52] and molecular properties prediction [53, 54, 55,
56]. Computational methods are needed to both accurately predict whether a compound will bind
potently to its target, and to maintain ideal drug-like properties along with a number of different axes
during development: solubility, toxicity, ease of manufacturing, stability and molecular weight.
In this dissertation, we have developed advanced computational methods for predicting molecular
properties, identifying hit compounds using a novel ML-based screening method, and modeling
receptor-drug binding kinetics.
1.1     Kinetics-orientated drug design
In the traditional drug discovery paradigm, the binding affinity of potential drug candidates is
measured to predict the efficacy of drug candidates. The binding affinity is quantified by equilibrium
dissociation constant (𝐾 𝑑 ) or half-maximal inhibitory concentration (𝐼𝐶 50 ) [57]. Several effective
drugs have been discovered using estimations of binding affinity [58, 59, 60]. However, these
measurements are carried out in-vitro, which is very different from the biological system of the
human body, sometimes resulting in poor predictions of drug efficacy [61]. One fundamental
difference is that living organisms exist far from equilibrium. Further, equilibrium quantities like
𝐾 𝑑 are not always appropriate, especially when timescales of ligand binding and unbinding are
                                                     2


comparable to the timescales of drug metabolism, elimination and excretion. Hence, several reports
have shown that the efficacy of a drug can be more correlated to the drug-receptor binding kinetics
than the binding affinity [62, 63, 64]. The drug-receptor association rate (𝑘 𝑜𝑛 ), dissociation rate
(𝑘 𝑜 𝑓 𝑓 ), and mean ligand residence time (RT) are the main parameters describing binding kinetics.
The mean residence time is the amount of time that the drug stays in contact with its target, and it has
shown that in some cases a drug with a longer residence time can lead to better biological efficacy
and therapeutic effects [62, 63, 64]. This has motivated researchers to consider the optimization of
binding kinetics as a parameter in designing new drugs, a process called kinetics-orientated drug
design.
     Recently, many experimental and computational methods have been developed to predict the
drug-receptor binding and unbinding rates and pathways. The experimental methods to calculate the
binding rates are label free-Surface Plasmon Resonance (SPR) [65, 66], radioligand binding [67,
68, 69] and time-resolved fluorescence resonance energy transfer (TR-FRET) [70]. The main
challenge of experimental methods is the inability to observe and characterize the unbinding
transition states of the ligand because the lifetime of the transition state is very short, often in
femtoseconds. The limitation of experimental approaches raises a need for computational methods
to reveal structural information of transition states. Therefore, a broad range of enhanced sampling
approaches such as Metadynamics [71, 72], Scaled Molecular Dynamics (MD) [73, 74, 75], Well-
tempered Meatdynamics [76, 77], Steered MD [78, 79], Brownian dynamics [80, 81], Markov State
Models (MSM) [82, 83, 72], Weighted Ensemble (WE) [84, 85], and WExplore [86, 87, 88, 89] have
been proposed. In this dissertation, we propose a novel enhanced sampling method for investigating
the kinetics of ligand unbinding and understating unbinding pathways–See Chapter 2.
1.2       Predicting molecular properties
In the early stages of drug design, computational predictions of molecular properties can help
filter the drug-like candidates and reduce drug discovery time and cost. Recently, a diverse set
of machine learning techniques have been employed to predict molecular properties like aqueous
                                                    3


solubility [53, 54], solvation free energy [90], log 𝑃 [91, 92, 55], blood brain barrier penetration [56]
and protein–ligand binding affinity [93]. However, the accuracy of ML methods heavily depends
on the manner in which molecules are represented. Representation models describe molecular
characteristics in terms of descriptors such as atomic environments [94, 95], internal coordinates
(i.e., bond lengths, bond angles, and dihedral angles) [96], and molecular fingerprints [49]. To
respect the underlying symmetry, the features that are extracted from the representation models
should be invariant to rotation, translation, reflection and the re-indexation of atoms.
     One of the widely used models for molecular representation is called simplified molecular-
input line-entry system strings (SMILES). These are strings of characters that uniquely identify a
particular molecule (e.g. benzene [C1=CC=CC=C1]). Although SMILES does not include physical
or structural features of molecules, it has shown surprising success in the prediction of the properties
of small molecules and has even been used to generate novel chemical compounds [55, 49, 97, 98].
Another model, the Coulomb Matrix (CM) [96], utilizes atomic energies and the inter-nuclear
coulomb repulsion operator to predict the molecular energy. The eigenvalues of the CM are used
as the similarity measure and they are invariant to the symmetry operations. Another approach to
represent atomic systems is symmetry functions, initially proposed by Behler and Parrinello [94].
The local environment of an atom is described by two sets of radial and angular symmetry functions,
which are invariant to rotation and translation of the system and define the local environment
energetics. The summation of these local energies is used to calculate the total energy of the
system. An improved version of Behler’s angular symmetry function is used in Anakin-ME [95]
to construct a molecular force field with accuracy that is comparable to density functional theory
(DFT). Another molecular model, the Grid Featurizer, has been used to predict protein-ligand
binding affinity [99, 100, 101, 102]. Here, the 3D (3-dimensional) structure of a protein-ligand
complex is discretized into a grid and each atom is expressed by a density function to form the grid
matrix.
     Another approach for representation modeling is graph-based model, which is concerned with
the molecular structure. In graph-based models, atoms and bonds represent the nodes and edges
                                                    4


of the graph, respectively. Further, some functions are defined in the molecular graph to create
feature vectors or matrices of the same size. In recent years varios functions have been devel-
oped for predicting different molecular properties such as graph convolutional networks [49, 93],
directed acyclic graph models [54], deep tensor neural networks [103], message passing neural
networks [104] and geometric scattering for graphs [105]. Unfortunately, these methods are all
applied in different applications, making it hard to carry out a fair comparison between them.
    Here, we use the geometric scattering for graphs (GSG) [105] model to predict the log 𝑃 of
small molecules (Chapter 3). GSG is fast in creating features that are symmetrically invariant.
Also, its feature vectors have the same length, allowing us to use any distance metric for similarity
measurement purposes.
1.3    Virtual Screening
After target identification, the drug discovery process continues with compound screening in the
laboratory to identify hit compounds. Compound screening is performed using automated High
Throughput Screening (HTS) [106] assays. In this step, for a given protein of interest, small
drug-like compounds are examined by docking to the receptor, and a drug compound is selected
such that is predicted to bind tightly and specifically to its target. However, the HTS assays
are physical methods with screening power limited to millions of compounds, expensive and
slow [5, 107, 108]. Therefore, computational screening methods, which is referred to as Virtual
Screening (VS) [107, 109, 110, 111] have been proposed to expand the search of chemical space
for small molecules. The VS methods that utilize the prior knowledge of known active ligands
are called ligand-based VS (LBVS) [112, 113, 114, 115]. The VS approaches that rely on the 3D
structure of target molecules are called structure-based VS (SBVS) [116, 117, 118].
    In SBVS, a handful of drug databases such as ZINC [119], ChEMBL [120], PubChem [121, 122]
and NCI [123] with hundreds of millions of compounds are examined, which can be extremely
computationally demanding. Furthermore, each compound (ligand) has the potential to cause
significant conformational changes in the protein binding pocket during the binding process. To
                                                  5


overcome the computation hurdle, the traditional process of VS overlooks the flexibility and confor-
mational changes of the receptor during docking by taking into account just a single conformation
of the receptor [124, 125, 126, 127, 128]. Recently, a wide range of approaches have been devel-
oped that consider the flexibility of the receptor in screening [127, 129, 130, 131, 132, 133, 134].
A class of these methods is referred to as ensemble docking [128, 133, 135, 134] that utilizes an
ensemble of target conformations generally determined either by MD simulations or crystallogra-
phy. Ensemble docking enables docking of each ligand into multiple structures, and docking scores
are determined by aggregation or consensus strategies. This method has been successful in some
applications [136, 137, 138]; however, the selection of an optimum ensemble size and appropriate
structures is a variable of target protein [139], and there are studies showing that increasing the
ensemble size leads to a poor model [140, 141, 142]. To solve this problem, we develop an efficient
ML-based compound screening method that can identify potent “hit” compounds in large datasets,
while taking into account ligand-induced conformational changes, including those of explicit water
molecules in the binding site. This method is called “Flexible Topology” since allows ligand atoms
to change their chemical type in response to forces from the surrounding binding site atoms. This
way, the search process through the dataset is coupled to the dynamics of a molecular simulation,
and the final ligand poses can be used as predictions of potent binders. Chapter 4 presents the first
results obtained with the Flexible Topology method.
                                                  6


                                            CHAPTER 2
         PREDICTING THE RESIDENCE TIME OF DRUG MOLECULES (REVO)
This chapter was adopted from the paper titled “REVO: Resampling of ensembles by variation
optimization" as it was published [143]. The figures and tables from supporting information (SI)
are imported into the text.
2.1     Introduction
Unraveling the functionality of macromolecules and exploring their structures is a popular research
topic in biochemistry that can be carried out through molecular dynamics (MD). MD can be
used to explore and sample the conformation space of a system. However, its sampling power is
often limited by large energetic barriers that separate molecular stable-states. This is a problem in a
variety of applications such as protein binding, unbinding and folding processes despite advances in
high-performance compute hardware and graphical processing units (GPUs). Enhanced sampling
techniques have thus been useful to increase the efficiency of MD and observe rare events in
biomolecular systems.
    Enhanced sampling methods have a long history. Over the last decades a wide variety of methods
have been described that involve either the introduction of external forces [144], manipulation of
the energy landscape [145, 146, 147], or coupling to systems at higher temperatures [148, 149].
Although in most cases these methods can be used to obtain accurate thermodynamic quantities such
as free energy differences, the methods use perturbed dynamics, which complicates the collection
of kinetic information - both transition rates between macrostates and microscopic state-to-state
transitions. Other enhanced sampling methods can be used to simulate rare events without the use
of biasing forces. For example, Markov State Models (MSM) [150] are based on unbiased sampling
of trajectories, in which system dynamics are described by transitions between a set of states at
discrete time intervals, (i.e, 𝜏). However in the MSM, the Markovian assumption (that transitions
are independent of history) is only guaranteed to be fulfilled in the limit of long 𝜏, in practice tens
                                                   7


of nanoseconds [151]. It can also be sensitive to clustering parameters and feature selection [89]
and typically requires a very long simulation time.
     The weighted ensemble (WE) [84] method also uses unbiased trajectories, but offers a way to
calculate observables directly, without the use of the Markovian assumption. The WE algorithm
periodically uses cloning and merging operations on this set of trajectories in order to balance
computational effort between different regions of space. When possible, these regions can be
defined as “bins” along a collective variable (CV) that describes a transition of interest. However,
for some systems the processes of interest cannot be described by a single CV, for instance where
multiple transition paths are possible between multiple stable states.
     Use of traditional binning procedures in the WE framework is limited for these high-dimensional
systems as the number of bins depends exponentially on the number of CVs used. This is a problem
for traditional WE, as the number of trajectories (also called “walkers”) per region is typically fixed,
leading to an exponential increase in total simulation time. Even if one employed a large number
of regions and then allowed most of these to be unoccupied, it would still be difficult to prioritize
which underrepresented regions should be chosen for cloning.
     The WExplore algorithm was introduced to address this problem [152]. WExplore is a WE
approach that dynamically defines a large number of sampling regions using a distance metric within
a high-dimensional CV space. These regions are defined within a hierarchy, allowing us to balance
sampling between branches of the hierarchy at multiple levels. This allows a small number of
walkers to be efficiently distributed across a (possibly) high-dimensional space. WExplore has been
applied to sample a variety of rare events, including ligand (un)binding pathways [86, 87, 88, 89]
protein folding pathways [152] and RNA conformational changes [153].
     Despite this success, WExplore is limited by three main issues related to the definitions of
these hierarchical regions. First, the nature of the hierarchical regions leads to inconsistent cloning
activity: when a threshold is crossed and region is defined on a new level of the hierarchy for the
first time, many cloning events of a single trajectory occur in quick succession. We call this process
“thresholding” and its stochastic nature has the potential to produce large differences between
                                                    8


different WExplore runs. Secondly, regions in WExplore are not moved once they are created. The
centers of these regions could thus be different from the positions of local energy minima. We call
this problem “sub-optimal region definition”. Further, although WExplore can divide a space into
a large number of regions (e.g. 10000), typically a maximum branching factor is defined at each
level of the hierarchy to limit the total number of regions that can be defined. This can lead to an
uneven distribution of sampling regions throughout the space.
    Inspired to address the aforementioned problems, we propose a new region-free enhanced
sampling algorithm called Resampling of Ensembles by Variation Optimization, or “REVO”.
REVO uses cloning and merging to create ensembles of diverse trajectories without defining any
regions, and instead optimizes a measure of “variation” that depends on the pairwise distances
between the walkers. In this paper, we first describe the REVO algorithm and its differences from
WExplore and other WE methods. We then apply the REVO method to a tunable 𝑁-dimensional
Random Walk system to study its performance as a function of dimensionality. We also apply REVO
to sample unbinding pathways in the well-studied trypsin-benzamidine system, and compare the
results to WExplore. Finally we conclude with a discussion of the REVO algorithm, including new
possibilities for enhanced sampling.
2.2     Methods
2.2.1    Generalized framework for Weighted ensemble sampling
Since the original publication of the weighted ensemble (WE) algorithm, a number of augmentations
and improvements to the method have been introduced. Here we describe a generalized framework
that is common to different algorithms in the WE family. This framework includes two alternating
steps: 1) MD simulations that move walkers forward in time, and 2) resampling operations that
merge and clone walkers. A resampling function is designed such that desirable walkers are cloned
and less-desirable walkers are merged together. Historically, this “desirability” has been defined
using counts of walkers in a set of regions (or “bins”) constructed along one or more collective
variables that describe the system dynamics, however, as shown here this can be thought of more
                                                   9


                                         Cloning
                                                                                      ...
                                         Merging
                     Dynamics            Resampling              Dynamics
Figure 2.1: The WE simulation framework. Each walker is represented as a circle. The size of a
circle represents the weight, and different colors represent different conformations. An ensemble
of walkers with the same weight and conformation is run for a set number of steps (“Dynamics”).
Then resampling is performed. These two steps continue until the simulation ends.
generally. When a walker is cloned, it creates two independent walkers that get the conformation
of the cloned walker and half of its weight. Merging of two walkers 𝐴 and 𝐵 creates a walker 𝐶
with the weight of 𝑤 𝐶 = 𝑤 𝐴 + 𝑤 𝐵 , where 𝐶 inherits the conformation of 𝐴 or 𝐵, with a probability
proportional to the two weights.
    On the whole, a resampling process aims to increase the diversity of the trajectory ensemble
and increase the probability of observing the events (or conformations) of interest.
    A resampling function (Figure 2.1) accepts a set of walkers and returns the new set of walkers
that result from the cloning and merging operations. The new conformations are thus a subset of the
input conformations, and the sum of the weights must be unchanged by the resampler. In general, a
resampler can return a different number of walkers, but in this work we keep the number of walkers
constant. WE simulations can use arbitrary resampling methods and remain a statistically valid
process from which unbiased estimates of observables can be calculated [154]. Seen this way,
conventional WE and WExplore can simply be viewed as different resamplers.
                                                  10


2.2.2    REVO resampling algorithm
Here, we present a new method for resampling trajectory ensembles in the WE simulation frame-
work. The REVO resampling method works by explicitly maximizing a measure of “trajectory
variation”, which is defined using the weights of walkers and an all-to-all pairwise distance matrix
obtained from the distances between walkers. This distance metric is system-specific and should
describe the events of interest. Notably, the resampling in REVO does not involve the construc-
tion of regions in order parameter space, which avoids the region-related limitations of WExplore
mentioned in the Introduction. We calculate the variation using the following equation:
                                        ∑︁       ∑︁ ∑︁  𝑑𝑖 𝑗  𝛼
                                    𝑉=      𝑉𝑖 =                   𝜙𝑖 𝜙 𝑗                        (2.1)
                                          𝑖        𝑖    𝑗
                                                             𝑑0
where 𝑑𝑖 𝑗 is the distance between walker 𝑖 and walker 𝑗 determined using a distance metric of choice.
The exponent 𝛼 is used to modulate the influence of the distances in the variation calculation. A
procedure for selecting an appropriate value of 𝛼 is given in Section 2.3.2.
    𝜙𝑖 is a non-negative function which is referred to as a “novelty”. It can be a function of walker
conformation and/or walker weight, and it is a measurement of the relative importance of each
walker, which can be defined in a system-specific fashion. Here, we define 𝜙 as a function of
walker weight (𝑤 𝑖 ):
                                                             𝑝     
                                                                𝑚𝑖𝑛
                                       𝜙𝑖 = log (𝑤 𝑖 ) − log                                     (2.2)
                                                               100
This function prioritizes walkers with higher weight values, and ranges from 𝜙𝑖 ≈ 32 for 𝑤 𝑖 ≈ 1
down to 𝜙𝑖 = 4.6 for 𝑤 𝑖 = 𝑝 𝑚𝑖𝑛 . Parameters 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 are the minimum and maximum
statistical weights, respectively, that a walker can hold. Following previous work with WExplore,
in REVO we do not clone walkers of weight < 𝑝 𝑚𝑖𝑛 , to avoid spending simulation time on walkers
that will not significantly contribute to statistical observables. We enforce a maximum weight
(𝑝 𝑚𝑎𝑥 ) in order to avoid the accumulation of probability in a single walker (𝑤 ≈ 1), which can
lower our chances of seeing new rare events within a given simulation. For instance, here we set
𝑝 𝑚𝑎𝑥 to 0.1, in order to always have at least 10 walkers with reasonably high probabilities.
                                                     11


    We also employ a check where the two walkers that are merged must be within a certain
distance from each other, which we call the “merge distance threshold”. This ensures that minimal
information is lost when two trajectories are merged. Parameter 𝑑0 , called the “characteristic
distance”, does not affect cloning and merging behavior, but is defined to make the variation
function unit-less and to facilitate comparison across different distance metrics. A procedure for
calculating the characteristic distance for a given system will be explained below.
    The goal of the resampling process in REVO is to optimize 𝑉 in Equation 2.1. To do this,
walkers with high 𝑉𝑖 values are selected for cloning, and walkers with low 𝑉𝑖 values are selected for
merging. This is further explained in the Appendix 9. The pseudocode of the REVO resampler
algorithm is shown in Algorithm 2.1.
                              Algorithm 2.1: REVO Resampler Algorithm
Input: Ensemble of walkers, REVO parameters
Output: Ensemble of resampled walkers
Dist_Matrix = AlltoAll_Dist(walkers);
 𝑉𝑜𝑙𝑑 , {𝑉𝑖 }1𝑛 = CalcVariation(weights, Dist_Matrix);
 while 𝑇 𝑅𝑈𝐸 do
    𝑐 = Select the walker with highest 𝑉𝑖 where 𝑤 𝑖 > 𝑝 𝑚𝑖𝑛 ;
    𝑚1 = Select the walker with lowest 𝑉𝑖 where 𝑤 𝑖 < 𝑝 𝑚𝑎𝑥 ;
    𝑚2 = Select the walker that is closest to 𝑚1 where 𝑤 𝑖 + 𝑤 𝑚1 < 𝑝 𝑚𝑎𝑥 and 𝑑𝑖,𝑚1 < 𝑑𝑚𝑒𝑟𝑔𝑒_𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 ;
    if 𝑚1 and 𝑚2 are defined then
         /* Changes the conformation and weight of walkers */
         Do cloning;
         Do merging;
         𝑉𝑛𝑒𝑤 , {𝑉𝑖 }1𝑛 = CalcVariation(weights, Dist_Matrix);
         if 𝑉𝑛𝑒𝑤 > 𝑉𝑜𝑙𝑑 then
              𝑉𝑜𝑙𝑑 = 𝑉𝑛𝑒𝑤 ;
         else
              Undo cloning and merging step;
                break;
    else
         break;
                                                   12


2.2.3   WExplore sampling algorithm
For completeness, we describe our implementation of the WExplore sampling algorithm based
on previous work [152, 87]. Similar to WE, each walker in WExplore carries a statistical weight
that changes during the resampling procedure. The WExplore algorithm dynamically splits the
sampling space into a set of hierarchical Voronoi polyhedra (VP), which are used as the “regions”
to guide resampling. Each VP is defined using a central point called an “image”, which are specific
conformations of the system. A walker can be assigned to a VP region by calculating its distance
to each VP image, and assigning it to the region with the smallest such distance. A WExplore
simulation employs a distance metric which is defined to emphasize the process of interest. For
example, in protein-ligand unbinding simulations the distance metric between walkers is defined
as the root mean squared distance (RMSD) between the ligands after aligning the binding sites of
both walkers.
    All walkers start with the same structure and initial weight. The sampling space initially includes
just a single region defined - at each level of the hierarchy - by the image of the initial structure.
As the simulation progresses, new regions are defined when a structure is sampled whose distance
to all previously defined images is greater than a predefined distance threshold. The hierarchical
regions are defined using a set of progressively smaller distance thresholds. There is a maximum
number of child regions that can be defined under each parent at each level of the hierarchy, which
is set here to 10 for all systems. In this paper, we use a four-level hierarchy of regions with 10000
regions in total.
    Walker resampling in WExplore occurs through the cloning and merging processes, where the
number of walkers are distributed as equally as possible across all regions. This occurs from the
top of the hierarchy downwards: first balancing between the largest hierarchical regions, then the
second-largest, and so on. At the beginning of a simulation, only the smallest regions are defined
and resampling occurs only at the lowest level. As mentioned in the Introduction, the first time
a walker establishes a new region at a new level of the hierarchy, it is cloned repeatedly until the
number of walkers in the new and old regions is as even as possible.
                                                   13


    WExplore and REVO have many of the same qualities, which facilitates their direct comparison
here. As in REVO, we have a constant number of walkers throughout the simulation. The same
distance metrics can be used in both algorithms. Also, the parameters 𝑝 𝑚𝑖𝑛 and 𝑝 𝑚𝑎𝑥 have the same
role, and can be enforced in the same way. In WExplore, two walkers are only merged if they are
in the same region (at all levels of the hierarchy). This is analogous to the merge distance threshold
in REVO, introduced above.
2.2.4   𝑁-dimensional biased random walk
We first use the 𝑁-dimensional biased random walk to study and analyze the performance of
REVO in higher dimensional spaces. In this system, the conformation of walkers is defined as an
𝑁-dimensional vector (𝑥𝑖 ∈ R𝑁 ) of non-negative values. A walker starts at position 0® and randomly
moves either one unit forward (with probability 𝑃𝑢 = 0.25) or one unit backward (with probability
1 − 𝑃𝑢 = 0.75) in each dimension at each dynamics step. The walkers are confined to positive
position values by rejecting moves to negative values. In this system, the distance metric used is a
scaled version of the Manhattan norm:
                                                  𝑁
                                               1 ∑︁
                                        𝑑𝑖 𝑗 =       (|𝑥𝑖𝑘 − 𝑥 𝑗 𝑘 |)                             (2.3)
                                               𝑁 𝑘=1
For WExplore we use a four-level region hierarchy with distance thresholds of 𝑑 = 0.25, 1, 4 and
16. The “merge distance” in REVO is set to 2.5 for all 𝑁.
2.2.5   Trypsin-benzamidine system
We run simulations of the trypsin-benzamidine system which is a test system used to evaluate
the sampling power of enhanced sampling techniques for ligand binding [87, 155, 156, 157]. We
used the OpenMMRunner in WEPY https://github.com/ADicksonLab/wepy and OpenMM version
7.2.2 [158] to run parallel simulations for each walker on nodes equipped with 4 NVIDIA K80
GPUs. The system was setup following our previous work[87] Atomic coordinates from the PDBID
3PTB structure are used to setup the system, including the crystallographic calcium ion and the
                                                   14


crystallographic water molecules. The system is solvated using a periodic cubic water box of
size 74.3 Å. This system has a total of 41006 atoms with nine neutralizing chloride ions. The
benzamidine ligand is parameterized using the CHARMM Generalized Force Field (CGENFF).
[159, 160]
    The system is run at a constant temperature and pressure using Langevin dynamics with a
friction coefficient of 1 ps−1 which couples the system to the heat bath with a temperature of 300
K and integration step size of 2 fs. We employ a 1 atm constant pressure Monte Carlo algorithm
where the volume move attempts are carried out every 50 steps. Nonbonded forces are calculated
using the CutoffPeriodic method in OpenMM in which only the interaction of each particle with
the nearest periodic copy of other particles is considered. A cutoff distance of 10 Å is used for
nonbonded particle interactions. Covalent bonds to hydrogen are constrained using the OpenMM
HBonds function. For WExplore we use a four-level region hierarchy with distance thresholds of 𝑑
=10, 5, 3 and 1.7 Å. The REVO merge distance was set to 25 Å, effectively allowing all non-local
merges.
    For both REVO and WExplore, five independent simulations were run with 48 walkers each
and a step size of 2 fs with resampling occurring every 20 ps. For both resamplers, we measure the
distance between two walkers (𝐴 and 𝐵) as the RMSD between the 𝐴 and 𝐵 ligands after aligning
the binding sites of 𝐴 and 𝐵.
2.2.6   Clustering and network visualizations
To compare the structures obtained by the REVO and WExplore resamplers, we build conformation
space networks (CSN) as follows [161, 162, 163]. First, the feature vector of each frame is
determined: a set of distances between a predefined set of ligand and protein atoms. This set
includes the 50 nearest heavy protein atoms to the ligand as well as the 9 heavy atoms of the
ligand. The feature vector includes all possible pairs of atoms from these two sets, resulting in a
feature vector of size 450. Feature-vector based clustering was done with the MSMBuilder [164]
program using the KCenters method and the Canberra distance metric. Three sets of clusters were
                                                 15


determined: one using only WExplore trajectories, one using only REVO, and one using both sets of
trajectories. In each case, the data was grouped into 2000 clusters. After clustering, the CSNs are
constructed using the CSNAnalysis tool (https://github.com/ADicksonLab/CSNAnalysis), using
the unweighted transition counts matrix and a lag time of 20 ps.
2.3    Results
2.3.1   𝑁-dimensional random walk
To compare the REVO and WExplore resampling algorithms, we first run simulations for the random
walk system at dimensions 𝑁 = 2, 5, 10 and 20, with 10 copies each. The simulations were
run with 200 walkers, for 10000 cycles that consist of 10 dynamics steps followed by resampling.
For both REVO and WExplore, a minimum and maximum walker weight of (𝑝 𝑚𝑖𝑛 = 10−100 ) and
(𝑝 𝑚𝑎𝑥 = 0.1) were used. The characteristic distance parameter (𝑑0 ) of REVO is determined by
running a single dynamic cycle, then calculating the average distance between all walkers. The
overall average value is the characteristic distance, which is tabulated for each value of 𝑁 in Table
2.1.
             Table 2.1: Characteristic distance values for N-dimensional random walk.
                              Dimension (N)      Characteristic Distance
                                      2                  0.7165
                                      5                  0.7161
                                     10                  0.7138
                                     20                  0.7157
    To compare with our REVO and WExplore results, we also ran straightforward random walk
simulations (CONV) with no resampling.
    Figure 2.2 shows the average predicted probability along each dimension, calculated by averag-
ing the positional probability distributions over all dimensions and all runs. Since the random walk
is biased towards the origin, the probability decreases drastically with increasing 𝑥. In this system,
the target equilibrium probability of each 𝑥 position can be directly calculated as 𝑃𝑡 (𝑥) = ( 23 )( 31 ) 𝑥 .
                                                  16


  A                                                     B
Figure 2.2: Average predicted probability distributions. The black curve is the target probability.
Probability distributions are averaged over all 10 runs for (A) REVO and (B) WExplore.
We find that, using the same number of dynamics steps, REVO is capable of visiting more distant
points in comparison to the more limited sampling by WExplore.
    Furthermore, we can compare run-to-run variability of the two algorithms by calculating the
average standard error of predicted probability for 𝑥 in the range 0 to 22 for 𝑁 = 2, 5, 10 and 20.
The value of this averaged standard error is 1.003 × 10−5 and 2.06 × 10−5 respectively for REVO
and WExplore, which shows REVO simulations are more consistent than WExplore.
    Following previous work [152], we quantify the quality of sampling of each probability dis-
tribution using two values: the “accuracy” and the “range”. The range of a given simulation is
calculated by determining the largest 𝑥 values visited along each dimension, and then averaging
them. The accuracy (𝐴) of a given curve 𝑃(𝑥) is equal to:
                                                     ∑︁
                                                𝐴=        𝑎(𝑥)                                 (2.4)
                                                       𝑥
where
                               
                                   | log(𝑃𝑡 (𝑥))−log(𝑃(𝑥))|
                               1 +                            if
                               
                               
                                          log(𝑃𝑡 (𝑥))            log(𝑃(𝑥))>2 log(𝑃𝑡 (𝑥))
                       𝑎(𝑥) =
                               
                               0
                               
                                                              otherwise
                               
For a given 𝑥 point, the highest accuracy contribution (𝑎(𝑥)) is 1, which is achieved when 𝑃(𝑥) =
𝑃𝑡 (𝑥). This decreases as 𝑃(𝑥) gets farther from the target probability. Figure 2.3 shows that
REVO obtains the highest accuracy and range for all simulated values of 𝑁. Notably, WExplore
                                                      17


  A                                                B
Figure 2.3: (A) The calculated accuracy values are shown for three methods, averaged over 10
runs. REVO outperforms when compared to WExplore and CONV methods for all dimensions.
(B) The range of visited 𝑥 values for three methods averaged over 10 runs. REVO explores a
broader sampling space when compared to WExplore and CONV methods for all dimensions.
Error bars show the standard error of the mean across the set of runs.
struggles with very high-dimensional spaces, with accuracy and range values approaching that of
conventional simulation. REVO resampling dramatically outperforms WExplore for 𝑁 = 5, 10
and 20, indicating that it is much more capable of efficiently discovering new areas of space in
high-dimensional systems. In fact, the accuracy and range values actually improve with increasing
𝑁, reaching their maximum values at 𝑁 = 10.
    To investigate this phenomenon, we examine the walkers and their distances to the origin, for
each value of 𝑁 (Figure 2.4). For the 2-dimensional system, walker positions are evenly spread
over two sampling “arms” that extend along the 𝑥 and 𝑦 axes. In general, an 𝑁 dimensional system
will have 𝑁 of these arms, extending outward from the origin. We hypothesized that having more
sampling arms will allow for a higher fraction of the walkers to be far from the origin, which could
improve both the accuracy and range values. In Figure 2.4B and C, we confirm that the expected
distance to the origin averaged over the set of walkers, increases as 𝑁 goes from 2 to 10. Once 𝑁
increases to 20, even though there are more sampling “arms”, the same 200 walkers are not able to
sample all of these arms efficiently, and the expected distance to the origin decreases. Interestingly,
Figure 2.4C tracks very well with the expected accuracy in Figure 2.3A.
    In our random walk simulations, we propose a move along each dimension, for each timestep.
                                                  18


A                                 B                                   C
Figure 2.4: (A) The final walker positions for a representative 𝑁 = 2 simulation are shown as
points. This is overlayed on a probability distribution heat map, calculated using
𝑃𝑡 (𝑥, 𝑦) = 𝑃𝑡 (𝑥)𝑃𝑡 (𝑦). (B) Probability distributions for the distance to the origin, averaged over
all simulations, all walkers and for 10 cycles spaced evenly between cycle 1000 and cycle 10000.
Distributions are shown separately for 𝑁 = 2, 5, 10 and 20. (C) The expectation values of these
curves as a function of 𝑁.
An 𝑁-dimensional system can then be seen as 𝑁 1-dimensional systems, where the only thing that
couples them together is the resampling algorithm. As the average probability distributions are
calculated over all dimensions, the higher dimensional systems gain a benefit, as they have more
chances to sample higher values. To remove this effect we calculate the accuracy and range using
only the first two dimensions for 𝑁 = 2, 5, 10 and 20 (Figure 2.5).
                  70                                                          <A>
                                                                              Range
                  60
                  50
                  40
                  30
                  20
                  10
                   0
                          N=2             N=5              N=10              N=20
Figure 2.5: Accuracy (𝐴) and range values considering only the first two dimensions. Values are
averaged over 10 simulations in each case and the error bars indicate the standard error of the
mean.
                                                   19


     The accuracy values in this case are within standard error for 𝑁 = 2, 5 and 10, but drop for
𝑁 = 20. The range is constant for 𝑁 = 2 and 5, and sees a slight increase for 𝑁 = 10, before again
dropping for 𝑁 = 20.
     In Figure 2.6, we examine different values of the distance exponent 𝛼, and the presence or
absence of the weight novelty term (𝜙). The weight novelty term was introduced to prioritize
walkers with higher weights, thereby encouraging not only that higher 𝑥 values are sampled, but
that they are sampled with as high a probability as possible.
                                                     = 1
         50
                                                     = 2
                                                     = 4
         40
         30
  < A>
         20
         10
          0
              N= 2              N= 5    N= 10    N= 20
                 weight s off
         50
                 weight s on
         40
         30
  < A>
         20
         10
          0
              N= 2               N= 5    N= 10     N= 20
Figure 2.6: (Top) Average accuracy (left) and range (right) values for different values of the
distance exponent 𝛼: 𝛼 = 1 (black), 𝛼 = 2 (green) and 𝛼 = 4 (orange). (Bottom) Average accuracy
and range values are shown both with (orange) and without (green) the weight novelty term (𝜙).
     As expected, turning off this weight novelty term (setting 𝜙𝑖 = 1 for all 𝑖) results in a lower
average accuracy for all 𝑁. We would also expect this would result in a higher range, since the
range is independent of the weights of walkers. This is what is observed, on average, although the
weight novelty leads to a slightly higher range for 𝑁 = 10. Possible reasons for this phenomenon
                                                            20


will be addressed in the Discussion section below.
2.3.2   Choosing an optimal distance exponent (𝛼) for ligand unbinding simulations
For biomolecular simulations, it is not feasible to run a large set of simulations with many different
𝛼 values. Here we describe a procedure for determining an optimal distance exponent without
running any additional simulations. We instead use ensembles of walkers from previous WExplore
simulations, taken at two different time points. “Early” ensembles were taken from early time points
in WExplore ligand unbinding simulations (less than 50 cycles), where all walkers have reasonably
low distances to each other and all walker weights are still relatively high (Figure 2.7).
                                                  21


                                  Early Ensemble                Late Ensemble
       Trypsin-
       benzamidine
       sEH-TPPU
                         10  3
               Weight
                         10  6
                         10  9
                        10  12
                                     Conform at ions
Figure 2.7: Representative ensembles from WExplore simulations are shown at “early” and “late”
timepoints, as described in the text. (Top) Ligand density distributions from the
trypsin-benzamidine WExplore simulations reported here. (Bottom) Ligand density distributions
are shown for the TPPU ligand unbinding from the enzyme soluble epoxide hydrolase (sEH).
Ensembles were taken from previous work [88]. Below each snapshot, a bar graph of trajectory
weights is shown, sorted from highest to lowest.
   “Late” ensembles were taken from the end of these simulations, where some walkers are in
the unbound state (with low weight) and some walkers remain in the binding site. We isolate five
early ensembles and five late ensembles from two different sets of ligand unbinding simulations: 1)
                                                       22


 A                                                   B
                                                                                     Early ensem bles
                                                                                     Lat e ensem bles
                                                                              10 9
                                                     Traject ory variat ion
                                                                              10 8
                                                                              10 7
                                                                              10 6
                                                                                     1              2       3   4
                                                                                                        
Figure 2.8: The trajectory variation determined for four 𝛼 values for (A) trypsin-benzamidine and
(B) sEH-TPPU.
the WExplore trypsin-benzamidine simulations conducted here and 2) the unbinding of the TPPU
ligand from soluble epoxide hydrolase (sEH), conducted in previous work [88]. In each case the
ensembles had 48 walkers each, with 𝑝 𝑚𝑖𝑛 = 10−12 and 𝑝 𝑚𝑎𝑥 = 0.1.
     The trajectory variation values were calculated using Eq. 2.1 for the early and late ensembles
using 𝛼 = 1, 2, 3 and 4. Average variation values are shown in Figure 2.8. For both systems, higher
𝛼 increases the difference in trajectory variation between early and late trajectory ensembles. An
appropriate 𝛼 value is one that clearly differentiates between the early and late ensembles. If our
measure of trajectory variation is not higher for the late ensembles that include the unbound state,
then we would likely not be able to sample ligand unbinding events by maximizing that measure
of variation alone. Based on these results we choose to use 𝛼 = 4 for our REVO simulations of the
trypsin-benzamidine system.
2.3.3   Trypsin-benzamidine ligand unbinding
We now compare results for the trypsin-benzamidine unbinding process obtained using the REVO
and WExplore methods.
                                                23


2.3.3.1    Residence time
The mean ligand residence time has been shown to be important for determining drug efficacy [165].
This can be calculated via the flux of unbinding trajectories in ligand-protein unbinding simulations,
using a technique called ensemble splitting, or “coloring” [166, 167, 168, 169]. The starting
structure for the trypsin-benzamidine simulations for both REVO and WExplore is the ligand
bound in the binding pocket. After each dynamics cycle and before resampling, we apply a
boundary condition that examines the conformation of the walkers to determine if the unbound
state is reached. A walker is considered unbound if the minimum ligand-protein distance exceeds
10 Å. A walker that reaches the unbound state is “warped”: the structure is to set back to the initial
bound state. The sum of the weights of the warped walkers is used to determine the unbinding rates
and the mean residence time of the ligand. The flux as a function of time is determined using the
weights of the warped walkers as follows:
                                                     Í
                                                       𝑖∈W 𝑤𝑖
                                          Flux(𝑡) =                                              (2.5)
                                                         t
where W is the set of all warped walkers. This flux is shown in Figure 2.9. The black curve shows
the average probability of the five runs and is influenced strongly by the highest weighted warping
events. In total we observe 1160 unbinding events for REVO and 740 for WExplore. The first
unbinding events occur at 182.4 ns and 345.6 ns for REVO and WExplore, respectively.
    The ligand residence time, or the mean first passage time of unbinding, can be determined as
the reciprocal of the average probability flux [166, 167, 170, 171]. Figure 2.10 shows the predicted
residence time as a function of simulation time for both REVO and WExplore. For both REVO
and WExplore, the total simulation time over the five runs was 8.75 𝜇𝑠. The final calculated
residence times are 3.76 ms and 1.19 for REVO and WExplore respectively, which are close to
the experimental value of 1.6 𝑚𝑠 [172]. The standard error in Figure 2.10 is calculated using the
standard error of the average flux from Figure 2.9.
    As shown in Figure 2.10, the predicted residence time can exhibit large jumps when new
highly-weighted warping events are recorded. A key motivation for developing the REVO method
                                                  24


 A                                                B
 A                                                B
Figure 2.9: The average unbound probability for all runs for (A) REVO and (B) WExplore. The
thick blue region represents the standard error of the mean at each time point. The black curve
shows the average probability for all runs.
Figure 2.10: Average predicted residence times are shown in black for (A) REVO and (B)
WExplore. The red line shows the experimental residence time for the trypsin-benzamidine
system [172].
was to increase the consistency in residence time estimates across different simulations. Figure
2.9 shows that the REVO simulations are more consistent in the aggregated unbound probability,
ranging from 4.13 × 10−8 to 2 × 10−3 , whereas the WExplore results varied from 3.09 × 10−10 to
38 × 10−3 . We quantify the convergence of the average trajectory flux as a function of the size
of the trajectory set in Figure 2.11. Importantly, this shows that REVO can obtain more reliable
residence time estimates using a smaller number of runs.
2.3.3.2   Heterogeneity of ligand unbinding pathways
We now compare the heterogeneity of the unbinding pathways that are observed using the two
sampling methods. Two conformation space networks (CSNs) are shown in Figure 2.12 that
combine sampling results for the five simulations conducted with each sampling algorithm. The
                                                 25


Figure 2.11: Average trajectory flux values are shown using all possible sub-samples over the set
of five runs. Individual averages are shown as points, and the probability of the sub-samples is
shown using a violin plot. The trajectory flux corresponding to the experimental residence time is
shown as a horizontal red line.
undirected CSNs are created using the force minimization algorithm Force Atlas in Gephi [173].
Each node in the CSN represents a state, and the size of each node is proportional to the sum of
the weights of all walker conformations that were assigned to that state. Directed edge weights are
computed as 100 times the transition probability. The weight of each undirected edge in the CSN
is the average of the in-edge and out-edge weights.
     The nodes in the Figure 2.12 networks are colored by the solvent accessible surface area (SASA)
in Å2 , with the SASA value averaged over all conformations in that cluster. The total number of
frames is 437,280 for both REVO and WExplore. For visualization purposes, the weight of all
edges is set to 1.0 after graph minimization. As seen in the CSNs, the bound and unbound states
are connected via different exit paths.
     The CSN for the REVO simulations is shown in Figure 2.12A. We find four main ligand
unbinding pathways for trypsin-benzamidine, three of which (Paths 1-3) are consistent with those
                                                 26


found in an earlier work using WExplore [87]. The CSN for the WExplore simulations 2.12B,
depicts only two of these pathways. Representative structures from the REVO pathways are shown
in Figure 2.12C, and the WExplore pathways are shown in Figure 2.12D. The binding site is
predominantly formed by two loops: one, depicted in blue, consists of residues 209-218, and the
other is depicted in orange and consists of residues 179-190. In Path 1 the ligand exits directly from
the binding site without any large changes of loop conformation. Paths 2, 3 and 4 are dependent
on conformational changes of the loop regions. In Path 2 the blue loop opens and the ligand exits
through it. This benzamidine unbinding pathway was observed in two previous works [155, 87].
In Path 3, the ligand exits to the right through a newly-formed opening in the orange loop. This
path has only been previously observed in our WExplore simulations [87]. Finally, in Path 4
benzamidine exits between the blue and orange loops, as in Path 1, but through a newly-formed
opening above the disulfide bond formed by residues CYS188 and CYS212. This path has not been
observed by any method before.
    To measure the breadth of sampling of individual runs, we jointly cluster the trajectories from
REVO and WExplore into a set of 2000 clusters. The numbers of clusters visited by each simulation,
are shown in Table 2.2. We find that REVO has a higher number of clusters visited, on average,
with a lower standard error.
                            Table 2.2: Cluster counts for all simulations.
                           Resampler     Run    Number of clusters visited
                                          1                803
                                          2                720
                             REVO
                                          3                862
                                          4                793
                                          5                892
                            Average                       814.0
                            STD err                       26.67
                                          1                921
                                          2                811
                           WExplore
                                          3                876
                                          4                660
                                          5                716
                            Average                       796.8
                            STD err                       43.42
                                                  27


Figure 2.12: Network representations of the free energy landscapes of binding are shown for
REVO (A) and WExplore (B). In both cases, discrete transition path ensembles were visually
identified and labeled. Nodes are colored according to their ligand solvent accessible surface area
using the color bar at the figure bottom, and node size correspond to the statistical weight of the
states. Representative conformations are shown to depict each ligand unbinding pathway for
REVO (C) and WExplore (D). Loop regions 209-218 and 179-190 are shown in blue left and
orange right.
    Pooling all simulations together, REVO visits 424 clusters that were not visited by WExplore.
Conversely, WExplore visits 294 unique clusters. This shows that REVO samples a more broad
set of states than WExplore. To analyze structural properties of the unique clusters found by both
algorithms, we determine a representative conformation for each cluster as the conformation with
the minimum distance to the center of the cluster.
                                                 28


                                Table 2.3: Co-clustering information.
                                                        REVO    WExplore
                          Exclusive cluster numbers       435       268
                         Average ligand RMSD (Å)          6.65      7.07
                          Average loops RMSD (Å)          3.76      3.20
                            Average SASA (Å2 )           43.30     88.32
                             Average probability         0.014     0.008
    For each algorithm, a density map of ligand poses from unique structures is shown in Figure
2.13. The red color volume shows the ligand density for REVO, which is localized mostly inside
the blue loop and up in between the orange and blue loops, consistent with Path 4. WExplore
unique clusters are concentrated in the area on the surface of trypsin adjacent to the binding site,
related to the higher probability ligand transition Path 1.
Figure 2.13: Ligand densities for unique clusters visited by REVO (red) and WExplore (yellow).
Density maps are plotted using the VMD Volmap tool with Isosurface value of 0.02.
2.4    Discussion
The above results demonstrate the ability of REVO to explore a broad sampling space with greater
accuracy and range when compared to WExplore simulations. For the 𝑁-dimensional random walk
                                                  29


system, we found that the accuracy and range of REVO is greater than WExplore for all values
of 𝑁, suggesting that REVO may be especially powerful for systems with very high-dimensional
sampling spaces. In addition to finding all previously discovered unbinding pathways for the
trypsin-benzamidine system, the REVO resampler discovered a new unbinding pathway involving
significant protein conformational change. These findings are remarkable as WExplore was already
notable for its broad sampling of ligand unbinding pathways in the trypsin-benzamidine system.
    As REVO is a region-free sampling algorithm, it is not limited by regioning obstacles, the
main hindrance of its predecessors, WExplore and conventional weighted ensemble sampling.
“Thresholding” is a key issue in the WExplore algorithm, occurring when a region is defined on a
new level of the hierarchy, resulting in many highly-correlated cloning events of a single trajectory.
We hypothesized that removing this behavior would lead to more consistent measurements of
observables. Encouragingly, this is exactly what is observed here, both in the unbinding flux for
trypsin-benzamidine and the standard error measurements in the 𝑁-dimensional random walk.
    As in WExplore, the distance metric used in REVO is flexible. It can be any measurement of
distance, and need not be differentiable as a function of system coordinates. For instance, distances
could be defined as differences between TM-scores [174], or other measures of template similarity
(e.g. “global distance test total score” (GDT-TS) used in CASP competitions). Distance metrics
can also involve histograms of ion and/or water positions which are discontinuous as a function
of atomic positions. Another means of customization is the novelty function (𝜙 in Equation 2.1)
Here, the novelty function for each trajectory is defined using only the trajectory weight. However,
this function can include any trajectory feature that is of interest to the researcher. Furthermore,
the objective function that we are maximizing in this work is the variation within the trajectory
ensemble. This could also be modified to optimize other properties of the ensemble. For instance,
the matching of NMR observables such as Nuclear Overhauser Effects and coupling constants or
matching density maps from crystallography or cryo-EM.
    The efficient nature of the 𝑁-dimensional random walk system allowed us to run a number
of simulations under different conditions to examine the properties of the REVO algorithm. One
                                                  30


puzzling result was the increase in the sampling range for 𝑁 = 10 when the weight novelty was
turned on. This was counter-intuitive as the weight novelty term seeks to encourage cloning
of outlier trajectories that have reasonably high weights, while simulations without the weight
novelty seek to clone the farthest outlier trajectories at all costs. One important factor is that these
simulations are run with a minimum attainable trajectory probability (𝑝 𝑚𝑖𝑛 ). This can explain this
puzzling result, in that the weight novelty encourages higher weighted trajectories to venture out
from the origin, which can be cloned a higher number of times before they reach 𝑝 𝑚𝑖𝑛 . Aside from
this small increase in range, we expect the weight novelty to be broadly useful in obtaining accurate
rate constants for rare events, as evidenced by the higher accuracy values in the 𝑁-dimensional
random walk simulations.
                                                   31


                                              CHAPTER 3
                        PREDICTING LOGP OF SMALL MOLECULES
This chapter was adopted from two papers tilted “ClassicalGSG: Prediction of log 𝑃 using classical
molecular force fields and geometric scattering for graph” [175] and “Predicting partition coeffi-
cients for the SAMPL7 physical property challenge using the ClassicalGSG method” [176] as they
were published. The figures and tables from supporting information (SI) are imported into the text.
3.1     Introduction
The partition coefficient (𝑃) measures the relative solubility of a compound between two solvents.
It is defined as the ratio of the concentration of an uncharged compound dissolved in an organic
solvent (e.g. octanol) in comparison to water. The logarithm of this ratio (log 𝑃) is considered to
be one of the main factors in determining the drug-likeness of a chemical compound. The log 𝑃
determines the lipophilicity, which affects bioavailability, solubility, and membrane permeability
of a drug compound in the body. Moreover, an orally active drug should have a log 𝑃 value of less
than 5 according to one of the criteria of the famous Lipinski’s Rule of Five [6]. Thus, predicting
log 𝑃 plays an essential role in the drug discovery process and is our main focus in this study.
The prediction of log 𝑃 is also used as a stepping stone to calculate other molecular properties
such as the distribution coefficient (log 𝐷) [177], drug solubility (log 𝑆) [178, 179], and Lipophilic
efficiency (LiPE) [180]. All of these properties have been used in drug discovery and design:
log 𝐷 is an effective descriptor for lipophilicity of an ionized compound [181], LiPE combines
the potency and lipophilicity of a drug compound to estimate its quality, and log 𝑆 is important to
model the solubility of a compound in the human body.
     The partition coefficient is widely used in cheminformatics and generally there are diverse exper-
imental methods to measure it [182]. However, these methods are time-consuming and expensive
for large databases of compounds and not feasible for compounds that are not synthesized [183].
                                                    32


Therefore, a large number of computational methods have been developed to predict accurate val-
ues of log 𝑃. These methods have a long history and can be classified by both their input features
(atomic/fragment, molecular and hybrid) and their models or algorithms (parameteric models vs.
machine learning methods) (Table 3.1).
                                 Table 3.1: log 𝑃 prediction models classifications.
                                                              Mathematical models
                                                Parametric Models              Machine Learning Methods
                                      XlogP3 [184], AlogP [185], ClogP [186],      László et al [189],
       Features
                  Atomic/Fragment
                                          KowWIN [187], JPlogP [188]              Huuskonen et al [190]
                                                   MlogP [191],                    AlogPS [195, 196],
                     Molecular
                                         iLogP [192, 193], Manhold[194]        S+logP [197], CSLogP [91]
                      Hybrid                  Silicos-IT LogP [198]           TopP-S [92], OpenChem [55]
    In atomic-based or fragment-based methods [184, 185, 186, 187, 188, 189, 190], which are based
on atomic or fragment contributions, a set of atomic or fragment based descriptors is used as input
to the model, while “molecular” methods use descriptors of the whole molecule, such as topology or
molecular fingerprints [191, 192, 193, 194, 195, 196, 197, 91]. “Hybrid” models combine atomic
and molecular descriptors as input to the model [198, 92, 55]. In general, there are challenges with
both atomic and molecular descriptors. In Atomic-based or fragment-based methods, the accuracy
of the atomic contributions depends on the similarity of the atomic environments of the atoms
in the group. Unfortunately, more training data is needed as the number of groups grows larger.
Fragment-based methods can struggle with defining the optimal size of fragments that participate
in predictions. On the other hand, the accuracy of property-based methods heavily depends on
the choice of molecular descriptors. Common descriptors include: 3D molecular structure [92],
molecular volume and surface area [199], solvation free energies [192, 200, 193], number of
carbon atoms and heteroatoms [191]. Furthermore, these molecular descriptors can be difficult or
computationally costly to generate.
    A second way of categorizing log 𝑃 predictors is by the types of mathematical model used to
process the input data. Parameteric models use methods such as least squares estimation or multiple
linear regression to fit parameters that govern the relative contributions of the different input features.
Machine learning based methods such as Support Vector Machines (SVM) [201, 202, 203], Neural
                                                          33


Networks (NNs) [201, 202, 204, 92], and Graph convolutional networks (GCN) [49] have been
used to predict logP values.
    Recently, some methods have been described that create their own custom molecular features
from atomic attributes, which go beyond simple additive models. The TopP-S [92] predictor was
developed by Wu et al and uses the atomic positions to create topological descriptors called Betti
barcodes. These Betti barcodes are used as molecular features that are input into deep neural
networks, along with atomic attributes such as atom type. Results were shown to further improve
upon the addition of 633 “molecular fingerprints" calculated from ChemoPy [205]. TopP-S has
shown success in predicting log 𝑃 over other methods like XlogP3, ClogP and KowWIN using a
group of independent test sets.
    Graph representations of molecules have also shown success in various applications including
predicting molecular properties [54, 93, 206, 207, 49, 208], virtual screening [209] and molecular
force field calculations [95]. In particular, OpenChem [55] uses a graph representation of the
molecules. Each atom represents a node in a graph and has a vector of atomic attributes including
element type, valence, charge, hybridization, and aromaticity; bonded atoms are connected by an
edge in the graph. GCNs are then trained on the graph representations created using these atomic
attributes and the 2D structure – or, “graph structure” – of the molecules.
    Graph representations are beneficial in that they are invariant to translation, rotation, and re-
flection symmetries. Another molecular symmetry that should be respected is invariance to the
re-indexation of atoms: changing the order in which atoms are input to the model should not
affect the computed molecular features. Summation operations respect re-indexation symmetry
but it is not straightforward how to capture more detailed information about molecular structure
while maintaining re-indexation symmetry. A recently-described method, Geometric Scattering
for Graphs (GSG) [105], provides a solution to this problem. GSG, which is analogous to GCNs,
creates molecular features by scattering atomic attributes across a graph using lazy random walks.
GSG is fast in creating re-indexation invariant features and also its feature vectors have the same
length allowing us to easily measure the similarity of molecules, even those with different numbers
                                                 34


of atoms. It has shown promising results in the classification of social network data and predicting
Enzyme Commission (EC) numbers [105]. Given this abundance of algorithms for creating molec-
ular features, we seek to compare some different methods based on molecular graph representations
and their ability to predict log 𝑃. Here we use GSG in combination with a set of atomic descriptors
that are generated for use with classical molecular dynamics force fields: partial charges, atom
type, and Lennard-Jones interaction parameters. We call this method “ClassicalGSG” and examine
its performance as a function of different atomic/molecular features.
     In publications, the log 𝑃 predictor methods are often assessed using specific test sets, making
the comparison between different algorithms challenging. Hence, benchmarks and standardized
test sets are needed to effectively compare these methods and further advance the research on log 𝑃
prediction. To help meet this need, the statistical assessment of the modeling of proteins and ligands
(SAMPL) [210] project recently created a distinct blind challenge for predicting log 𝑃 allowing
fair evaluation and comparison of different log 𝑃 prediction methods (SAMPL6 in 2019 [211] and
SAMPL7 in 2020 [212]). In this challenge, the participants predict log 𝑃 for a set of drug-like
molecules and the predictions are assessed using experimental log 𝑃 values that are revealed later.
The submitted prediction methods are classified into one of the following categories: Empirical
methods, Physical molecular mechanics (MM)-based, Physical quantum mechanics (QM)-based,
or Mixed methods. Empirical methods [184, 185, 186, 187, 188, 189, 191, 193, 190, 194, 195,
197, 91, 198, 92, 49, 213] are data-driven methods in which predictor models are trained directly
on a dataset of molecules. The empirical category includes methods that employ atomic/fragment-
based additive methods, machine learning, and quantitative structure-property relationship (QSPR)
approaches. In MM-based methods [214, 215, 200, 216, 217, 214], molecular dynamics simulations
are run and used to estimate the solvation free energy. Then, the log 𝑃 for a compound is calculated
analytically from the solvation energy. QM-based methods [218, 219, 220, 221, 222, 223, 224]
utilize the solvation free energy estimated from the quantum mechanical energy calculations. The
mixed approaches [225, 223, 226, 227] employ the combination of physical (QM/MM-based) and
empirical techniques.
                                                   35


    Here, first we employed ClassicalGSG to examine the performance of two force field parameter
matching programs: CHARMM General Force Field (CGenFF) [228, 229] and General AMBER
Force Field 2 (GAFF2) [230, 231]. The NN models were trained using a dataset of molecules
made available by OpenChem [55] and we showed that CGenFF-generated parameters with a
specific ad hoc scheme of classifying CGenFF atomic types achieved the highest accuracy in
predicting log 𝑃 values. Also, we examine the ClassicalGSG performance as a function of different
atomic/molecular features. Also, we compare the ClassicalGSG results with GCNs trained on
the same data and using a variety of atomic attributes, including those from previous work [49].
We then evaluate the performance of ClassicalGSG on several independent test sets and study the
properties of features generated in the pipeline of GSGNN models. In addition, we investigate the
properties of molecules with high log 𝑃 prediction error.
    For the SAMPL7 target molecules we used the best performing parameter sets to train four
log 𝑃 predictor models. One of our verified (but non-ranked) prediction sets achieved the lowest
RMSE (0.55) and the second-lowest Mean Absolute Error (MAE) of 0.44 among all the submitted
predictions. We note that in SAMPL6 [211] the best set of predictions was using Cosmotherm [219],
achieving an RMSE of 0.38, and 10 models achieved an RMSE of less than 0.5. In SAMPL7
none of the submitted predictions were below this threshold, implying that these molecules had
specific structures that introduced difficulty into both the empirical and physical predictions. Here
we describe the process of curating the four training datasets, training the models and making
predictions for SAMPL7 target molecules. Further, to achieve better predictions we examined the
performance of various open-source force fields such as General AMBER Force Field (GAFF) [232],
Universal Force Field (UFF) [233], Merck Molecular Force Field 94 (MMFF94) [234, 235] and
Ghemical [236]. Our results show that MMFF94 models create predictors that on average are as
accurate or better than those created with CGenFF.
    We conclude with a discussion about the GSG method generated features, the computational
cost of generating atomic attributes with CGenFF and GAFF2, and the relative performance of 2D
versus 3D structure in predicting log 𝑃 values. We then discuss the curation of training sets for
                                                   36


the SAMPL7 challenge, the performance of open-source force field generator tools and the code
available in the ClassicalGSG package.
3.2     Methods
3.2.1   Datasets
The dataset used in this work is generated by Korshunova et al [55]. from the public version of
PHYSPROP database [237]. This dataset consists of 14176 molecules in SMILES format and their
corresponding log 𝑃 values, which we refer to as the OpenChem dataset.
    The molecules are converted from SMILES format to mol2 format and their 3D structure is
created by OpenBabel (https://github.com/openbabel/openbabel). Then CHARMM General Force
Field (CGenFF) [228, 229] and General AMBER Force Field 2 (GAFF2) [230, 231] parameter
files are generated for each molecule. These force field parameter files are either created by
CGenFF using the CGenFF tool of the SilcsBio software package (http://silcsbio.com) or by the
Antichamber tool implemented in the Ambertools18 [238] package. The process of generating
CGenFF parameter files fails for 175 molecules, and GAFF2 for 681 molecules. These 774
molecules are removed from the OpenChem dataset, resulting in a dataset of 13402 molecules.
Then 80% of the molecules are used for training and the rest for testing. In addition, we evaluate our
trained models on four independent test sets shown in Table 3.2. Star and NonStar [194] test sets
are publicly available on https://ochem.eu/article/17434 and the Husskonen [190] test set can be
found on https://ochem.eu/article/164. The FDA dataset contains drug molecules that are approved
by the Food and Drug Administration (FDA) of the United Sates and originally prepared by Chen
et al [184].
             Table 3.2: Independent test sets used for evaluating ClassicalGSG models.
                              Test set name      Number of molecules
                                FDA [184]                  406
                             Huuskonen [190]               348
                                 Star [194]                223
                              NonStar [194]                 43
                                                 37


    For the SAMPL7 challenge, we first built a master training dataset by combining the log 𝑃
datasets in Table 3.3.
                             Table 3.3: log 𝑃 datasets used for training.
                               Test set name              Number of molecules
                            PHYSPROP [237]                        41039
                     Huuskonen training set [190, 239]             1496
                               TopP-S [190]                        8199
                              OpenChem [55]                       14176
                              ALOGPS_3_01                         17436
                          Logpt_all_data_training                   233
                         Logpt_challenge_training                   187
    The physical properties database (PHYSPROP) [237] was built by the Syracuse Research
Corporation (SRC) and contains the log 𝑃 values of over 41, 000 diverse chemical compounds.
Here, we used the public version of PHYSPROP. The Huuskonen dataset [190] has 1, 844 unique
molecules, combining 326 molecules from its initial version [239] with 1, 663 molecules from the
Klopman dataset [240]. The 1, 844 molecules in the Huuskonen dataset have been organized into
a training set with 1, 496 compounds and a test set with 348 compounds. Here we use molecules
from the Huuskonen training set. The TopP-S dataset consists of 8, 199 chemical compounds,
initially compiled by Hansch et al. [241] and then compiled by Cheng et al. [184] to include only
molecules with reliable experimental log 𝑃 values. The OpenChem dataset was curated from the
PHYSPROP drug database [55] and contains of 14, 176 molecules. The Logpt_all_data_training,
ALOGPS_3_01, and Logpt_challenge_training are public log 𝑃 training sets which can be down-
loaded from https://ochem.eu. The RDkit program [242] is employed to create canonical SMILES
for molecules in these 7 datasets. After removing duplicate molecules, 44, 595 molecules remained
in the dataset. As the generation of CGenFF atomic attributes failed for some molecules, we ended
up with 41, 409 molecules in our dataset, which we refer to as the “master dataset.”
    The master dataset itself serves as “DB1", which is used to train a GSG model to generate a set
of predictions for the SAMPL7 molecules. The presence of only five atomic elements C, N, O, S,
and H in the SAMPL7 target molecules motivated us to make a subset of the master dataset where
                                                 38


each compound has only has these atomic elements, which we call “DB2”. Molecules that either
had another element not listed above or did not have the full set of elements were not selected. The
DB2 dataset has 3, 482 molecules. Also, the existence of a specific structure - a sulfonyl moiety - in
all of the SAMPL7 target molecules inspired us to generate the third dataset by filtering the master
training set and keeping only those with sulfonyl moiety. The “HasSubstructMatch" function of
RDKit was used to check if a molecule has this moiety. The obtained training dataset is referred
to as “DB3" and has 2, 379 molecules. The fourth training set was obtained by filtering the master
dataset and keeping only those with both a sulfonyl moiety and the following elements (C, N, O,
S, and H). This training set has 1, 482 molecules, and we refer to it as “DB4". These four datasets
DB1, DB2, DB3, and DB4, are used to train four ClassicalGSG models and generate four sets of
predictions for the SAMPL7 target molecules.
    Further, to assess the performance of the open-source force field tools we use a group of
external test sets including FDA [184], Star [194], NonStar[194], Huuskonen [190, 239] and
SAMPL6 molecules set [211] (Table 3.3). To quantify our uncertainty, we chose molecules from
these test sets that are similar to the set of SAMPL7 molecules. More specifically, these test sets
are filtered to include molecules with a sulfonyl moiety and to include each of the elements (C, N,
O, S, and H). Molecules that contained other elements were excluded. The selected 44 molecules
are filtered further by keeping molecules which their molecular weight is in the range of SAMPL7
molecules weights. The resulting test set has 36 molecules and is referred to as S7_TEST.
3.2.2    Generating atomic attributes
As mentioned above, we use atomic attributes including partial charges, atom type, and Lennard-
Jones interaction parameters. Below, we explain how we generate these atomic attributes. Atomic
attributes for each atom are extracted from the parameter files generated by CGenFF [228, 229], or
GAFF2 [230, 231] or open-source force fields such as GAFF [232], UFF [233], MMFF94 [234, 235]
and Ghemical [236].
    To determine atom type classifications, we compared a number of different schemes. In one
                                                  39


scheme, we classify atom types in one of five categories as shown in Table 3.6. This is referred
to below as “AC5”. Alternatively, we directly use the atom types as generated by either CGenFF
or GAFF2; referred to below as “ACall.” In the third classification scheme, we manually sorted
CGenFF atom types into 36 groups (AC36; Table 3.4) and GAFF2 atom types into 31 groups
(AC31; Table 3.5) based on chemical knowledge.
                                               40


Table 3.4: Classifying CGenFF atom types in 36 categories (AC36). The C_Name is the name of
the category, and C_Value shows the value of the category.
          Atom type C_Name C_Value Atom type  C_Name  C_Value Atom type C_Name C_Value
             LPH     LPH     0      OG2D1       O2      10     CG324      C3     19
            NG321     N3     1       OG2S1      O2      10     CG331      C3     19
            NG301     N3     1      OG2D4       O2      10     CG302      C3     19
            NG311     N3     1      SG2R50     S3R      11     CG301      C3     19
            NG331     N3     1       SG311     S3R      11     SG3O3     SO3     20
            NG3N1     N3     1       CG2O5      C2      12     OG304      O3     21
            NG3P1     N3     1       CG2O6      C2      12     OG312      O3     21
            CLGR1     CL     2       CG2O1      C2      12     OG311      O3     21
            CLGA3     CL     2       CG2O7      C2      12     OG303      O3     21
            CLGA1     CL     2       CG2D1      C2      12     OG302      O3     21
           HGPAM3     H2     3       CG2D2      C2      12     OG301      O3     21
           HGPAM2     H2     3      CG2D1O      C2      12     NG3P0      N4     22
           HGPAM1     H2     3       CG2O4      C2      12     NG3P3      N4     22
             HGP5     H2     3       CG2O3      C2      12     NG3P2      N4     22
             HGP3     H2     3      CG2D2O      C2      12     HGR51      H3     23
             HGP2     H2     3      CG2DC1      C2      12     HGR62      H3     23
             HGP1     H2     3      CG2DC2      C2      12     HGR61      H3     23
             HGP4     H2     3      CG2DC3      C2      12     HGR53      H3     23
            SG301     S3     4       CG2O2      C2      12     HGR52      H3     23
           CG2R63    C2R     5       CG2N1      C2      12     HGR71      H3     23
           CG25C1    C2R     5       CG2N2      C2      12     HGR63      H3     23
           CG25C2    C2R     5       CG1T2      C1      13       PG2      P2     24
           CG2R57    C2R     5       CG1N1      C1      13      IGR1       I     25
           CG2R53    C2R     5       CG1T1      C1      13    CG3C52     C3R     26
           CG251O    C2R     5      OG2N1      O2B      14    CG3C31     C3R     26
           CG252O    C2R     5       SG302     S3C      15    CG3C50     C3R     26
           CG2R61    C2R     5      NG2RC0     N2R      16    CG3C54     C3R     26
           CG2R62    C2R     5      NG2R50     N2R      16    CG3RC1     C3R     26
           CG2R64    C2R     5      NG2R62     N2R      16    CG3C41     C3R     26
           CG2R51    C2R     5      NG2R61     N2R      16    CG3C51     C3R     26
           CG2R67    C2R     5      NG2R60     N2R      16    CG3C53     C3R     26
           CG2RC0    C2R     5      NG2R57     N2R      16     OG2D2     O2A     27
           CG2R71    C2R     5      NG2R53     N2R      16     SG3O1     SO1     28
           CG2RC7    C2R     5      NG2R52     N2R      16     BRGA1      BR     29
           CG2R52    C2R     5      NG2R51     N2R      16     BRGA2      BR     29
           CG2R66    C2R     5      NG2R43     N2R      16     BRGA3      BR     29
            SG2D1     S2     6      NG2R67     N2R      16     BRGR1      BR     29
            SG2P1     S2     6       ALG1       AL      17     NG2S0      N2     30
            SG2P2     S2     6       FGA3        F      18     NG2S1      N2     30
            SG3O2    SO2     7       FGR1        F      18     NG2S2      N2     30
            NG1T1     N1     8        FGP1       F      18     NG2S3      N2     30
            HGA1      H1     9       FGA1        F      18     NG2D1      N2     30
            HGA7      H1     9       FGA2        F      18     NG2P1      N2     30
            HGA4      H1     9     CG3AM2       C3      19     NG2O1      N2     30
            HGA5      H1     9     CG3AM1       C3      19     OG2P1     O2C     31
            HGA6      H1     9     CG3AM0       C3      19       PG0      P0     32
          HGAAM0      H1     9       CG334      C3      19    NG3C51     N3R     33
            HGA3      H1     9       CG311      C3      19    OG2R50     O3R     34
          HGAAM1      H1     9       CG314      C3      19    OG3R60     O3R     34
          HGAAM2      H1     9       CG321      C3      19    OG3C31     O3R     34
            HGA2      H1     9       CG322      C3      19    OG3C61     O3R     34
            OG2D5     O2     10      CG323      C3      19    OG3C51     O3R     34
            OG2D3     O2     10      CG312      C3      19       PG1      P1     35
                                               41


Table 3.5: Classifying GAFF2 atom types in 31 categories (AC31). The C_Name is the name of
the category, and C_Value shows the value of the category
           Atom type  C_Name   C_Value  Atom type C_Name  C_Value  Atom type C_Name C_Value
                c       C2        0         f        F       8        nn      N3R     17
               cz       C2        0         cl      CL       9        np      N3R     17
               cf       C2        0        br       BR      10        nq      N3R     17
               ce       C2        0          i       I      11        n5      N3R     17
               cs       C2        0         n       N2      12        n6      N3R     17
              c2        C2        0        nt       N2      12        nu      N3R     17
              cg        C1        1        n2       N2      12         o       O2     18
              ch        C1        1        ns       N2      12        ow       O3     19
              c1        C1        1        nf       N2      12        os       O3     19
              c3        C3        2        na       N2      12        oh       O3     19
               cc       C2R       3        ne       N2      12        oq      O3R     20
              cq        C2R       3        n1       N1      13        op      O3R     20
              cp        C2R       3        n8       N3      14        pe       P2     21
               ca       C2R       3        n7       N3      14        pf       P2     21
              cd        C2R       3        n3       N3      14        p2       P2     21
              cx        C3R       4        n9       N3      14        p4       P3     22
              cy        C3R       4        nk       N4      15        p3       P3     22
              cu        C2R       4        nl       N4      15        px       P3     22
              cv        C2R       4        n+       N4      15        p5       P4     23
              h5        H1        5        nz       N4      15        py       P4     23
              hx        H1        5        no       N4      15        pb      P2R     24
              h4        H1        5        nx       N4      15        pc      P2R     24
              hc        H1        5        n4       N4      15        pd      P2R     24
              h2        H1        5        ny       N4      15         s       S1     25
              h3        H1        5        nd      N2R      16        s2       S2     26
              h1        H1        5        nc      N2R      16        sh       S2     26
              ha        H3        6        ni      N2R      16        s4       S3     27
              hn        H2        7        nj      N2R      16        sx       S3     27
              ho        H2        7        nb      N2R      16        s6       S4     28
              hp        H2        7        nv      N3R      17        sy       S4     28
               hs       H2        7        nh      N3R      17        sp      S3R     29
              hw        H2        7        nm      N3R      17        sq      S3R     29
                                                                      ss       SS     30
    Specifically, efforts were made to make new groups for atom types with different elements and
hybridization values and to separately identify atoms that are members of ring structures. Finally,
a forth classification scheme simply uses a uniform atom type for all atoms, referred to as “AC1”.
                          Table 3.6: Classifying atoms in 5 categories (AC5).
                                     Atom type                  Category number
                                      Hydrogen                           1
                               Oxygen and Nitrogen                       2
                       Carbon with hybridization value < 3               3
                       Carbon with hybridization value = 3               4
                                       Others                            5
    The two Lennard-Jones parameters – radius (𝑟) and well-depth (𝜖) – are extracted from either
CHARMM or AMBER parameter files for each atom type. In summary the atomic attributes
                                                   42


are defined by both the force field generation tool (CGenFF or GAFF2) and the atom classifica-
tion scheme AC1, AC5, AC36, AC31 or ACall and we generally refer to these as “FF" atomic
attributes. The atomic charges and Lennard-Jones parameters are scalar values, but atomic types
are represented in one-hot encoding format. For comparison, we also examine a different set of
atomic attributes, following previous work for the log 𝑃 predictor from the OpenChem toolkit
(https://github.com/Mariewelt/OpenChem.git) [55]. These atomic attributes are defined as: atom
element type, valence, charge, hybridization, and aromaticity. The 3D structure is generated from
SMILES strings by RDkit (https://www.rdkit.org), then again using RDkit the atomic attributes
are produced for each molecule and represented in one-hot encoding format as node features in
OpenChem. These are referred to below as "OPENCHEM" atomic attributes.
3.2.3    Log P predictions using GSG
3.2.3.1   Geometric Scattering for graphs
The Geometric scattering for graphs (GSG) method, which has been introduced in Ref [105] is a
feature extraction method for graph data types and uses a geometric transform defined on the graph.
In the GSG method, molecules are represented by a graph of atoms where each atom has a vector
of attributes; a single attribute evaluated at each vertex is also referred to as a “graph signal". GSG
combines the molecular structure (defined using an adjacency matrix) and atomic signal vectors to
construct invariant, stable, and informative features as shown in Figure 3.1.
                                                   43


                             Adjacency matrix                                Wavelet matrices
                                                                                                   Scattering transform
                               Atom index                   Adjacency
                                                                                                                                Features
                                            Atom index
                              Atomic features
Figure 3.1: Architecture of the GSG method. The adjacency matrix describes the graph structure
of the molecule. Each atom has a set of attributes that are shown as colored bars. Wavelet
matrices Ψ are built at different logarithmic scales, 𝑗, using the adjacency matrix as described in
the text. Finally, the scattering transform is applied to get the graph features using both the
wavelet matrices and the signal vectors. Modified from figure made by Feng et al. [105].
   Now we briefly explain the mathematics behind this method. Let 𝐺 = (𝑉, 𝐸, 𝑊) be a weighted
graph and x(𝑣 𝑙 ), 1 < 𝑙 <= 𝑛 be the signal function defined on each node where 𝑛 is the number
of nodes in the graph and 𝑣 𝑙 represents node 𝑙. A graph random walk operator is defined as
𝑃 = 21 (𝐼 + 𝐴𝐷 −1 ), where 𝐴 is the adjacency matrix and 𝐷 is the degree matrix. 𝑃𝑡 describes the
probability distribution of a lazy random walk after 𝑡 steps and hence the term lazy random walk
for 𝑃. In fact, graph random walks are low-pass filters that act like convolution filters to capture
graph features at different scales. Graph wavelet operators – also known as “signal transform
                                                                                             𝑗−1        𝑗
operators for graphs” – are defined at the scale 2 𝑗 as Ψ 𝑗 = 𝑃2                                   − 𝑃2 . Finally, a set of scattering
transform operators S0 , S1 and S2 , are defined using the graph signals and wavelets to compute the
molecular-level features. Unique S0 , S1 and S2 vectors are created using different moments (𝑞) of
x and their wavelet coefficients Ψ, as well as different wavelet scales indexed by 𝑗:
                                                            𝑛
                                                           ∑︁
                                            (S0 ) 𝑞 =                   x(𝑣 𝑙 ) 𝑞 ,   1≤𝑞≤𝑄                                       (3.1)
                                                           𝑙=1
                                                   𝑛
                                                  ∑︁
                           (S1 ) 𝑗,𝑞 =                   |Ψ 𝑗 x(𝑣 𝑙 )| 𝑞 ,            1≤ 𝑗 ≤𝐽      1≤𝑞≤𝑄                          (3.2)
                                                  𝑙=1
                                             𝑛
                                            ∑︁
                   (S2 ) 𝑗, 𝑗 ′,𝑞 =               |Ψ 𝑗 ′ |Ψ 𝑗 x(𝑣 𝑙 )|| 𝑞 ,           1 ≤ 𝑗 < 𝑗′ ≤ 𝐽        1≤𝑞≤𝑄                 (3.3)
                                            𝑙=1
                                                                              44


The scattering operators are named based on the number of times the wavelets Ψ are used to
transform the signal x. For example, S0 features do not use the wavelet operators, and are
simply different moments of the atomic signals. These are called the “Zero order scattering
moments.” Accordingly, the S1 operators with one wavelet transformation are called “First order
scattering moments” and S2 with two wavelet transformations are called the “Second order scattering
moments.” Note that these allow mixing between different wavelet scales as 𝑗 and 𝑗 ′ are set
independently.
     We define the set S as the concatenation of all S0 , S1 and S2 vectors using all possible values of
𝑞, 𝑗 and 𝑗 ′ as specified by the ranges in Eq. 3.1, 3.2 and 3.3. The set of graph features is generated
deterministically from the atomic signals and adjacency matrix. The size of this set of features is
equal to 𝑄𝑁 𝑠 (1 + 𝐽 (𝐽 + 1)/2), where 𝑁 𝑠 is the number of attributes per vertex, although this is
lower if not all zeroth, first and second order features are used.
     As mentioned above, one of the inputs to GSG is the adjacency matrix of the graph and we
consider two distinct ways to represent it. One is to use the 2D connectivity of the molecule, where
𝐴𝑖 𝑗 = 1 indicates a bond between atoms 𝑖 and 𝑗, and 𝐴𝑖 𝑗 = 0 otherwise. Alternatively, the adjacency
matrix can be calculated using the 3D structure, where 𝐴𝑖 𝑗 = 𝑓 (𝑅𝑖 𝑗 ), and 𝑓 (𝑅) is a smooth and
differentiable function that is equal to 1 at low 𝑅 and decreases to 0 as 𝑅 exceeds some cutoff value
𝑅𝑐 [94]:
                                          
                                                    𝜋𝑅𝑖 𝑗
                                           0.5(cos ( 𝑅𝑐 ) + 1) for 𝑅𝑖 𝑗 ≤ 𝑅𝑐
                                          
                                          
                                          
                              𝑓 (𝑅𝑖 𝑗 ) =                                                          (3.4)
                                          
                                           0.0                 for 𝑅𝑖 𝑗 > 𝑅𝑐
                                          
                                          
                                          
where 𝑅𝑐 is the radial cutoff and 𝑅𝑖 𝑗 is the Euclidean distance between atoms 𝑖 and 𝑗. If 𝑓 (𝑅𝑖 𝑗 )
is non-zero, nodes 𝑖 and 𝑗 are considered connected and disconnected otherwise. The wavelet
operators in GSG can be defined using either discrete (2D) or continuous (3D) values in the
adjacency matrix.
                                                      45


3.2.3.2   Neural networks architecture and training
To predict the log 𝑃 values from the GSG features (S), we develop a feedforward neural network
using PyTorch [243] with a nonlinear activation function such as Rectified Linear Unit (ReLU).
To determine the best performing network, we consider three important hyperparamters including
the number of hidden layers, the number of neurons in a hidden layer, and the dropout rate
whose ranges are shown in Table 3.7. We perform cross validation using the PyTorch wrapper
Skorch https://skorch.readthedocs.io/en/stable/ to tune the hyperparameters. The loss function of
our NN model is MSELOSS (Mean Squared Error) and we use the Adam (adaptive momentum
estimation) [244] for optimizing the parameters. We use MultiStepLR, which has an adjustable
learning rate set to the initial value of 0.005 and dynamically decreases during training every
15 steps by a factor of 0.5. In training our GSGNN models, we used two data regularization
methods: L1 norm and standardization using the StandardScaler function from sci-kit learn [245].
For GSG features with maximum wavelet scales of 𝐽 = 4, 5 and 6 we mostly benefit from using the
standardization method. Other settings can be found in Table 3.7.
Table 3.7: Neural network settings. Square brackets denote possible parameter values used in the
grid search method.
                                     Parameter              Values
                            Number of hidden layers       [2, 3, 4, 5]
                               Size of hidden layers    [300, 400, 500]
                                    Dropout rate           [0.2, 0.4]
                                Initial learning rate        0.005
                               Learning coefficient           0.5
                                     Batch size               256
                                  Max epoch size              400
3.2.4   Log P predictions using GCNs
3.2.4.1   Graph convolutional neural network
Graph convolutional neural networks (GCNs) extend the application of convolutional neural net-
works to graph data. The goal of GCNs is to take a graph and generate features according to node
                                                   46


attributes and graph structure. The heart of the GCN method is the convolution operator, which
aggregates the features of neighboring nodes within the graph. Recently, various implementations
of GCNs [246, 247, 248, 249, 250, 207] have been developed to increase the speed and accuracy of
the GCN models. In this paper, we employ the GCN model that was developed in OpenChem [55]
based on the method introduced by Kipf and Welling [249]. Similar to the GSG model above, this
model takes a graph 𝐺 = (𝑉, 𝐸) as the input where each node has a vector of attributes x. The
model processes the graph by passing it through multiple hidden layers performing convolution
operations (Figure 3.2).
                           Adjacency matrix
                                                                    GCNN layer              Graph gathering layer
                            Atom index                 Adjacency
                                                                                                                    Features
                                         Atom index
                            Atomic features                             Max-pooling layer
Figure 3.2: Architecture of the GCN method. The adjacency matrix describes the graph structure
of the molecule. Each atom has a set of attributes and are shown as colored bars. GCN layers are
shown by gray color and are followed a max-pooling layer which is shown in purple. The graph
gathering layer is shown in green color adds features on all nodes to generate the molecular feature
vector.
   The GCN method works by propagating a feature matrix 𝐻 – originally set to the value of the
attribute vector x for all nodes – through a set of convolution operators as follows:
                                                      𝐻 (0) = X
                                                                                                                      (3.5)
                                                  𝐻 (𝑙+1) = 𝜎( 𝐴𝐻
                                                               ˜ (𝑙) 𝑊 (𝑙) )
where 𝐴˜ = 𝐴 + 𝐼, is the adjacency matrix of the input graph with self-connections, 𝐼 is the identity
matrix, 𝑊 (𝑙) is the trainable weight matrix of layer 𝑙 and 𝜎 is a non-linearity function such as ReLU.
𝐻 (𝑙) denotes the value of the feature matrix on layer 𝑙.
                                                                   47


    Each convolution layer is followed by a graph max-pooling layer introduced in [250]. Following
these convolution and max-polling operations , the final set of graph features is obtained by a graph
gather layer, where the values of each feature are summed over nodes. This last layer gives GCNs
their index-invariance property.
3.2.4.2   GCN architecture and training
We predict log 𝑃 using GCNs using the model implemented in the OpenChem toolkit https:
//mariewelt.github.io/OpenChem/html/index.html. This model contains 5 layers of graph convolu-
tions with a hidden layer size of 128. The GCN layers are followed by max-pooling and a graph
gather layer. A 2-layer neural network with ReLU as the activation function is added after the graph
gather layer. The PyTorch Adam optimizer and the MSELOSS (Mean Squared Error) are used as
the parameter optimizer and training loss function, respectively. We use the MultiStepLR learning
scheduler, implemented in PyTorch, with an initial value of 0.001, a step size of 15 and a learning
coefficient of 0.5 for training the model. Parameters used for the GCN training are summarized in
Table 3.8.
                                  Table 3.8: Neural network settings
                                          Parameter             Values
                               Number of GCN hidden layers          5
                                 Number of NN hidden layers         2
                                    Size of hidden layers         128
                                         Dropout rate               0
                                     Initial learning rate        0.01
                                     Learning coefficient         0.5
                                          Batch size              128
                                       Max epoch size             200
                                                   48


3.3     Results
3.3.1   Evaluation of molecular representations
For our set of atomic attributes (FF), we use parameters from classical MD force fields: the partial
charge, and the Lennard-Jones radii and well-depth. We first compare the performance of atomic
attributes generated using two different algorithms: GAFF2 and CGenFF. Both algorithms are used
to automatically generate force field parameters for small molecule ligands, by matching atomic
environments with atoms that are part of existing force fields; GAFF2 uses the Amber [232]
force field, while CGenFF uses the CHARMM [228, 229] force field. Here we generate molecular
features from atomic attributes using GSG (see Section 3.2.3.1). We examine the four different
atom type classification schemes discussed above: AC1, AC5, AC36/AC31 and ACall. The GSG
parameters used to construct the molecular features are shown in Table 3.9.
Table 3.9: Parameters for the Geometric Scattering for Graphs algorithm. The square brackets
show all of the values examined for each parameter. For “Scattering operators”, ‘z’ represents the
zero order operator, ‘f’ is first order, and ‘s’ is second order.
                              Parameter                                Values
                       Adjacency matrix (𝐴)                          [2D, 3D]
                 Wavelet maximum scale index (𝐽)                   [4, 5, 6, 7, 8]
                                                               Four combinations
                     Scattering operators (z, f, s)
                                                           (z, f), (z, s), (f, s), (z, f, s)
                         Atom classification            [AC1, AC5, AC36/AC31, ACall]
    We trained 160 models for each of the CGenFF and GAFF2 atomic attributes sets, each trained
using a 3-fold cross-validation method. After training, we ran tests on subsets of the full database
using an 80:20 train:test split. We then calculated evaluation metrics such as the correlation
coefficient (𝑟 2 ), Root Mean squared errors (RMSE) and Mean Unsigned Errors (MUE) between
the predicted and experimental log 𝑃 values. In Figure 3.3, 𝑟 2 and RMSE values are shown for
predictions using 2D and 3D molecular structures. Each 𝑟 2 and RMSE is averaged on 20 NN
models that are created with all combinations of the 4 geometric scattering operator sets and the 5
wavelet step numbers as shown in Table 3.9.
                                                     49


Figure 3.3: Average 𝑟 2 (A) and RMSE (B) for the OpenChem test set using GSGNN models.
Each average is calculated over 20 individual parameter values and the error bars show the best
and worst performing models. The atomic attributes are generated with either CGenFF or GAFF2
force fields and using one of three atom type classification schemes ("AC1", "AC5",
"AC36/AC31" or "ACall").
    We find that models with 2D structure universally have higher accuracy in prediction of log 𝑃
values. Additionally, it demonstrates that CGENFF atomic attributes on average are more accurate
in predicting log 𝑃 values compared to GAFF2 atomic attributes, although this is not true for ACall.
We find that good results on average are obtained by classifying atom types using AC36 categories;
the best performing individual models are also found in this category. We thus use CGenFF, 2D
structure and AC36 for all models going forward.
                                                 50


         1.00
                                                                           f,s       z,f
                                                                           z,s       z,f,s
         0.95
         0.90
         0.85
   2 r   0.80
         0.75
         0.70
         0.65
         0.60        4              5            6                7              8
                                    Wavelet maximum scale index ( )
                                                                 J
Figure 3.4: The 𝑟 2 for the OpenChem test set using GSGNN models. The atomic attributes are
all generated with CGenFF force fields, AC36 atom type classification scheme, and 2D molecular
structure.
   We next investigate different scattering moment sets and wavelet scales in Figure 3.4. We find
that for each examined wavelet number (4, 5, 6, 7 and 8) there is at least one model with an 𝑟 2
value of 0.9. The model also performs well for all combinations of scattering moments, making it
difficult to determine an optimal set of parameters.
   To further evaluate the accuracy of our method, we examine four different independent test sets:
FDA, Huuskonen, Star and NonStar which are explained in Section 3.2.1. To avoid any accidental
overlap, we identified shared molecules and removed them from our training dataset. The training
dataset contains 10, 722 randomly selected molecules from the OpenChem dataset that successfully
are processed by CGenFF. The trained models are evaluated using these test sets and the 𝑟 2 between
the actual and predicted log 𝑃 values are calculated. The results are shown in Figure 3.5. For the
Huuskonen and FDA data sets we find that there is at least one GSGNN model with 𝑟 2 = 0.92 and
𝑟 2 = 0.90, respectively. The best performing models for the Star test set is a maximum wavelet
scale of 𝐽 = 4 and zero- and second-order scattering operators, while the best model for NonStar is
                                                51


a maximum wavelet scale of 𝐽 = 7 and zero-, first- and second-order scattering operators.
        A1.0                FDA                      B 1.0                  Huuskonen
            0.9                                          0.9
       2r   0.8                                  2   r   0.8
            0.7                                          0.7
            0.6                                          0.6
                  4   5      6      7     8                    4      5          6           7             8
        C 1.0               Star                     D1.0                     NonStar
                                                                                                     f,s       z,f
                                                                                                     z,s       z,f,s
            0.9                                          0.9
       2r   0.8                                  2   r   0.8
            0.7                                          0.7
            0.6                                          0.6
                  4   5      6      7     8                    4      5         6           7              8
                                                                   Wavelet maximum scale index (J)
Figure 3.5: The 𝑟 2 for different test sets using GSGNN models. A) shows 𝑟 2 for the FDA test set.
B) represent 𝑟 2 for the Huuskonen test set. C) and D) show 𝑟 2 for the Star and NonStar test sets,
respectively. The horizontal axis indicates the maximum wavelet scale 𝐽. The atomic attributes
are generated with 2D molecular structure, CGenFF force fields and using AC36 atom type
classification scheme.
   The 𝑟 2 , RMSE and MUE of the best-performing GSGNN model for each test set are shown in
Table 3.11. In paper [92] the log 𝑃 values for the FDA, Star, NonStar test sets are predicted using
their topology-based models (TopP-S) and the different established methods such as ALOGPS,
XLOGP3 and KowWIN.
                                                52


Table 3.10: The log 𝑃 prediction results using different methods on FDA test set. The result from
our model shown in red color. The TopP-S is the best performing method (GBDT-ESTD+ -2-AD)
on FDA test set from [92]. Table from [92]
                                               N=406
                       2 molecules are removed due to pre-processing failure
                          Method           𝑟2    RMSE          MUE
                          TopP-S         0.935    0.51          0.24
                       ClassicalGSG 0.910         0.55          0.35
                         ALOGPS          0.908     0.6          0.42
                         XLOGP3          0.872    0.72          0.51
                       XLOGP3-AA 0.847             0.8          0.57
                          CLOGP          0.838    0.88          0.51
                         TOPKAT          0.815    0.88          0.56
                         ALOGP98         0.800    0.90          0.64
                         KowWIN          0.771    1.10          0.63
                           HINT          0.491    1.93           1.3
    We find that our results do not outperform the TopP-S method but we do achieve higher accuracy
compared to other known methods such as ALOGPS, KowWIN and XLOGP3 for FDA test set
(Table 3.10), methods like XLOGP3 and ALOGP for Star test set (Table 3.12) and XLOGP3
method for NonStar test set (Table 3.13).
       Table 3.11: The logP prediction performance results for four independent test sets.
                                   Dataset      𝑟2     RMSE   MUE
                                    FDA        0.91     0.56  0.35
                                     Star      0.90     0.49  0.35
                                  NonStar      0.72     1.02  0.81
                                 Huuskonen     0.93     0.37  0.23
                                                  53


Table 3.12: The log 𝑃 prediction results using different methods on Star test set. The result from
our model shown in red color. The TopP-S is the best performing method (MT-ESTD+ -1-AD) on
Star test set from [92]. Table from [92].
                                              N=223
                                                    % of molecules within
                                                          error range
                             Method       RMSE < 0.5 1 <             1>
                            AB/LogP         0.41      84     12       4
                             S+logP         0.45      76     22       3
                             TopP-S         0.49      77     16       7
                          ClassicalGSG      0.49      75     18       7
                           ACD/logP          0.5      75     17       7
                            CLOGP           0.52      74     20       6
                          VLOGP OPS         0.52      64     21       7
                            ALOGPS          0.53      71     23       6
                            XLOGP3          0.62      60     30       10
                            KowWIN          0.64      68     21       11
                            CSLogP          0.65      66     22       12
                            ALOGP           0.69      60     25       16
                            MolLogP         0.69      61     25       14
                           ALOGP98           0.7      61     26       13
                             OsirisP        0.71      59     26       16
                            VLOGP           0.72      65     22       14
                             TLOGP          0.74      67     16       13
                            ABSOLV          0.75      53     30       17
                            QikProp         0.77      53     30       17
                           QuantlogP        0.80      47     30       22
                         SLIPPER-2002       0.80      62     22       15
                          COSMOFrag         0.84      48     26       19
                            XLOGP2          0.87      57     22       20
                            QLOGP           0.96      48     26       25
                             VEGA           1.04      47     27       26
                              CLIP          1.05      41     25       30
                              LSER          1.07      44     26       30
                         MLOGP(Sim+)        1.26      38     30       33
                           NC+NHET          1.35      29     26       45
                             SPARC          1.36      45     22       32
                          HINTLOGP          1.80      34     22       44
                                                 54


Table 3.13: The log 𝑃 prediction results using different methods on NonStar test set. The result
from our model shown in red color. The TopP-S is the best performing method
(MT-ESTD+ -2-AD) on NonStar test set from [92]. Table from [92]
                                               N=43
                                                    % of molecules within
                                                          error range
                           Method         RMSE < 0.5 1 <             1>
                          ALOGPS            0.82      42     30       28
                          MiLogP            0.86      49     30       21
                           S+logP           0.87      40     35       26
                          XLOGP3            0.89      47     23       30
                          CLOGP             0.91      47     28       26
                          ALOGP             0.92      28     40       33
                          CSLogP            0.93      58     19       23
                          MolLogP           0.93      40     25       26
                           TopP-S           0.94      51     19       30
                           OsirisP          0.94      42     26       33
                         ALOGP98            1.00      30     37       33
                          AB/LogP           1.00      42     23       35
                         ACD/logP           1.00      44     32       23
                        ClassicalGSG        1.02      36     31       33
                          ABSOLV            1.02      49     28       23
                          KowWIN            1.05      40     30       30
                        VLOGP OPS           1.07      33     28       26
                           TLOGP            1.12      30     37       30
                          VLOGP             1.13      40     28       33
                          XLOGP2            1.16      35     23       42
                       SLIPPER-2002         1.16      35     23       42
                         QuantlogP          1.17      35     26       40
                        COSMOFrag           1.23      26     40       23
                           VEGA             1.24      28     30       42
                          QikProp           1.24      40     26       35
                            LSER            1.26      35     16       49
                          QLOGP             1.42      21     26       53
                            CLIP            1.54      33      9       49
                       MLOGP(Sim+)          1.56      26     28       47
                           SPARC            1.70      28     21       49
                         NC+NHET            1.71      19     16       65
                        HINTLOGP            2.72      30      5       65
                                                 55


3.3.2    Geometric scattering vs graph convolution
To compare the performance of geometric scattering (GSGNN) models to the graph convolution
(GCN) models, we trained models using these two methods with two different sets of atomic
attributes "FF" and "OPENCHEM" as described in Section 3.2.1. To rule out the influence of the
training/test split, we trained 5 models for each method where the data is randomly divided into
training and test sets with 80/20 ratio, respectively. Each of the five GSGNN models is trained
using 5-fold cross validation. The RMSE and 𝑟 2 are determined for each model and are shown in
Table 3.14. Note that the atomic attributes for these GSGNN models are generated by 2D molecular
structure, CGenFF force fields, and AC36 atom type classification scheme. The GSG parameters
that used to generate the molecular features are from one of the best performing models (a wavelet
maximum scale of 𝐽 = 4 and all three of the zero-, first- and second-order scattering operators)
from Figure 3.4. Our results suggest that the ClassicalGSG method, which is a GSGNN model
trained on FF atomic attributes, has the most accurate prediction of log 𝑃 values.
         Table 3.14: The logP prediction results using different set of features and models
            Method name       Atomic attributes    Model     𝑟 2 (STD)      RMSE (STD)
            ClassicalGSG            FF            GSGNN    0.91 (0.003)      0.52 (0.009)
                   -           OPENCHEM           GSGNN    0.89 (0.003)      0.57 (0.005)
                   -                FF              GCN    0.75 (0.091)       0.91 (0.18)
             OpenChem          OPENCHEM             GCN    0.79 (0.052)      0.83 (0.122)
3.3.3    Features visualization
To examine the molecular features that are generated in the pipeline of the ClassicalGSG method,
we visualize the features generated by the GSG method and the last layer of NN model using
t-distributed stochastic neighbor embedding (t-SNE) [251] plots. The t-SNE plots are intended
for projecting high dimensional data into the low dimensional space so it can be visualized readily.
Figure 3.6 shows the GSG and NN features for molecules in the OpenChem test set visualized in
a 2D space. Here each point shows a molecule in the OpenChem test set and is colored by its
log 𝑃 value. The initial dimension of GSG features is 1716 while NN features have size of 400.
                                                  56


These GSG features are generated from CGenFF atomic charges, 2D molecular structure, AC36
type classification scheme and all three scattering moment operators with wavelet step number of
4.
    We can observe from Figure 3.6 that features extracted from NN are more discriminative with
respect to log 𝑃 values. More specifically, in the last layer of the NN model, molecules with similar
log 𝑃 value tend to be near each other and form distinct clusters. In contrast, GSG features are
less discriminative where nearby molecules have different log 𝑃 values. To further verify this, five
nearest neighbors are determined for each point in the GSG and NN reduced feature space. Then,
the difference between the actual log 𝑃 value of each point and its five neighbors are calculated
and averaged. This value averaged over all the molecules and is shown by ⟨Δ log 𝑃⟩𝑁 inside the
figure. The ⟨Δ log 𝑃⟩𝑁 in the reduced GSG features space is 0.81 while this value for the reduced
NN features space is only 0.17. The average log 𝑃 difference between two random points is 2.00.
This shows that molecules with similar log 𝑃 values are closer in the NN features space.
    This is noteworthy, as the GSG features are task independent and therefore are not adapted to
any particular prediction task, including log 𝑃 prediction. On the other hand, the NN model, which
takes as input the GSG features, is a supervised model that is trained for the specific task of log 𝑃
prediction. The results in Figure 3.6, in addition to the results in Tables 3.11 and 3.14, illustrate that
the GSG features provide not only a translation, rotation, and permutation invariant representation
of the molecules, but one that is also sufficiently rich so that, when combined with a downstream
supervised NN, the resulting model provides accurate estimates of log 𝑃 values for molecules.
                                                  57


                  GSG                                         NN
                                                                                             Experimental log P
 t-SNE dim. 2
                          t-SNE dim. 1                             t-SNE dim. 1
Figure 3.6: The t-SNE plots with GSG and NN features of the OpenChem test set molecules.
Each represents a molecule and is colored by its actual log 𝑃 value. ⟨Δ log 𝑃⟩𝑁 shows the mean
log 𝑃 difference value calculated over the nearest neighbors in the t-SNE plot. A) The GSG
features of size 1716 are projected into 2-dimensional space. B) The NN features from the last
hidden layer with size of 400 are projected into 2-dimensional space.
3.3.4           Distinguishing features of failed molecules
To investigate the common characteristics of failed molecules during the prediction of their log 𝑃
value, we created molecular fingerprints, which have been used extensively in cheminformatics,
QSAR/QSPR predictions, and drug design [92]. Here we employ ChemoPy [205] to create a
set of constitutional fingerprints. To specify the failed molecules we define a failure cutoff value
(here, 0.5) where if the difference between the actual and predicted log 𝑃 values is larger than
the cutoff, we consider the molecule as a failed prediction. We determine the failed molecules
from ClassicalGSG models we described in Section 3.3.2. These models are trained on features
constructed by CGenFF force fields, 2D structure and AC36 atom type classification method. The
probability distribution of 30 constitutional fingerprints are calculated for all molecules in the
OpenChem dataset and for failed molecules in each of the 5 GSGNN models. Kullback–Leibler
divergence (KL-divergence) [252] values are determined by comparing probability distributions of
all data with distributions of the failed molecules from each model. The KL-divergence values are
                                                     58


averaged over five models and their standard error (STE) is shown in Table 3.15. We note that the
PCX descriptors count the number of shortest paths of length X. The attributes with the highest
KL-divergence values are: PC counts with lengths of 2, 1, 3 and 4; molecular weight (Weight);
the number of carbon atoms (ncarb); and the number of heavy atoms (nhev). In other words, the
distributions of failed molecules are enriched in particular values of these attributes.
Table 3.15: The averaged KL-divergence between fingerprint distributions of all data versus
failed molecules averaged over 5 GCGNN models.
              Fingerprint    Average_KL     STE     Fingerprint   Average_KL       STE
                  PC2           0.048      0.005        nhet          0.028        0.003
                  PC1           0.047      0.006       noxy           0.026        0.003
                  PC3           0.046      0.007        nrot          0.026        0.004
                Weight          0.045      0.004      nsulph          0.026        0.003
                  PC4           0.044      0.004        ndb           0.017        0.002
                 ncarb          0.043      0.005       ndonr          0.012        0.004
                 nhev           0.043      0.007        ncof          0.011        0.001
                   nta          0.043      0.007     AWeight          0.011        0.002
                  PC5            0.04      0.005       nnitro          0.01        0.001
                  naro           0.04      0.003       ncocl          0.009        0.002
                  PC6           0.039      0.005        nhal          0.008        0.001
                  nsb           0.038      0.004       nphos          0.007        0.003
                 naccr          0.036      0.002       ncobr          0.006          0
                 nring          0.035      0.004        ncoi          0.005        0.002
                 nhyd           0.031      0.003        ntb           0.001          0
    The probability distributions of fingerprints with the highest KL-divergence values are shown
in Figure 3.7. The distributions for the failed molecules largely follow the complete dataset
distribution, but are enriched towards higher values, suggesting that the log 𝑃 predictions are more
likely to fail for larger molecules. Interestingly the distributions for attributes that count rare
element types (e.g. number of Phosphorus (nphos), number of iodine atoms (ncoi) and number of
bromine atoms (ncobr)) do not show large KL-divergence values, although this might have been
expected given that they are poorly represented in the training set. The atom element types and
their count in our dataset is shown in Figure S1.
                                                 59


                              All molecules
                              Model 1, failed
                              Model 2, failed
                              Model 3, failed
                              Model 4, failed
                              Model 5, failed
Figure 3.7: Probability distributions of molecular fingerprints. The histograms show the
distribution of fingerprints of all data and failed molecules of 5 GCGNN models. The distribution
of all data is shown in thick black line. A) The number of shortest paths of length 2, B) the atomic
weight, C) the number of carbon atoms (ncarb) and D) the number of heavy atoms.
3.3.5   Uncertainty estimations for the SAMPL7 predictions
Upon submission of our results to the SAMPL7 organizers, our predictions for uncertainty in log 𝑃
were simply the standard errors of the mean of the predictions using the four training sets. Here
we first calculate more accurate estimates for the prediction uncertainty using prediction intervals
(PIs) obtained separately for each of the four predictors. PIs define the range of values in which
predictions for new data are expected to lie with a defined probability. For example, a 90% PI
in the range of [𝑎, 𝑏] indicates that a future prediction will fall into the range [𝑎, 𝑏] 90% of the
time. There are a variety of methods for calculating PIs for NNs, such as bootstrapping [253, 254],
Mean-Variance Estimation (MVE) [255], Delta, and Bayesian methods. In this paper, we utilize a
parametric approach similar to the MVE method. However, unlike the MVE, our method constructs
                                                   60


PIs from the Mean Absolute Errors (MAE) between the predicted and observed values of similar
inputs rather than all of the input data. The PIs are determined by finding a MAE value (𝜖90 ) below
which 90% of MAE values fall. Thus for a future prediction ( 𝑦ˆ ) the PI is defined as [ 𝑦ˆ − 𝜖90 , 𝑦ˆ + 𝜖 90 ].
    To construct PIs, we make four sets of predictions for the S7_TEST dataset using ClassicalGSG
models trained on DB1, DB2, DB3, or DB4 training sets. As mentioned in Section 3.2.1, S7_TEST
is a subset of external test sets containing molecules similar to the SAMPL7 molecules. For training
these models, we used the parameters of the best ClassicalGSG models obtained in Section 3.3.1.
More specifically, atomic attributes from CGenFF parameters, 2D molecular structure information
and AC36 atomic types fed to GSG with parameters of maximum wavelet number (𝐽) of 4, and
all scattering operators (zeroth, first, and second order) to generate molecular features. After
making predictions using these models, the MAE values are calculated for S7_TEST for each set of
predictions. We then binned them in a histogram with 20 bins to determine cumulative probability
distributions shown in Figure 3.8.
                A                                                                                                             B
                              1.0               P=0.90                                                                                      1.0         P=0.90
         Cumulative probability                                                                                        Cumulative probability
                              0.8                                                                                                           0.8
                              0.6                                                                                                           0.6
                              0.4                                                                                                           0.4
                              0.2                                                                                                           0.2
                              0.0                                                                                                           0.0
                                    0.0          0.2         0.4    0.6   0.8   1.0     1.2         1.4         1.6                               0.0          0.2         0.4       0.6         0.8         1.0
                                                       MAE (predictions using DB1)                                                                               MAE (predictions using DB2)
                C                                                                                                             D
                              1.0               P=0.90                                                                                      1.0         P=0.90
         Cumulative probability                                                                                        Cumulative probability
                              0.8                                                                                                           0.8
                              0.6                                                                                                           0.6
                              0.4                                                                                                           0.4
                              0.2                                                                                                           0.2
                              0.0                                                                                                           0.0
                                          0.2          0.4         0.6    0.8     1.0         1.2         1.4                                     0.0    0.2         0.4     0.6   0.8     1.0         1.2   1.4
                                                       MAE (predictions using DB3)                                                                               MAE (predictions using DB4)
Figure 3.8: Cumulative distribution of MAE of molecules in the S7_TEST set. The solid blue line
shows the cumulative distributions for each set of predictions. The dashed red line represents the
probability of 90%. Panels A through D show MAEs using models trained on DB1 through DB4,
respectively.
                                                                                                                      61


Table 3.16: The PIs and coverage range for the SAMPL7 molecules using four ClassicalGSG
methods. 𝑦ˆ is the prediction of log 𝑃.
                                       Model name         PIs         Coverage of PIs
                                    ClassicalGSG_DB1   𝑦ˆ ± 0.94         72.72%
                                    ClassicalGSG_DB2   𝑦ˆ ± 0.88         90.90%
                                    ClassicalGSG_DB3   𝑦ˆ ± 0.88         68.18%
                                    ClassicalGSG_DB4   𝑦ˆ ± 0.98         86.36%
   The four sets of predictions made for SAMPL7 target molecules using each model and their
90% PIs are shown in Figure 3.9. We determined the coverage of PIs for experimental log 𝑃
values for each model and the results are shown in Table 3.16. The ClassicalGSG_DB2 method has
the highest coverage of 90.90%, as expected, although other predictors fall below this threshold.
This could indicate that the SAMPL7 log 𝑃 values were more difficult to predict than the similar
S7_TEST molecules.
        A                                                    B
                  DB1 predictions                                4   DB2 predictions
              4   Experimental                                       Experimental
                  90.0% PIds                                         90.0% PIds
              3                                                  3
       logP   2
                                                         logP    2
              1
                                                                 1
              0
                                                                 0
        C                                                    D
              5   DB3 predictions                                    DB4 predictions
                  Experimental                                   4   Experimental
              4   90.0% PIds                                         90.0% PIds
                                                                 3
              3
       logP   2                                          logP    2
                                                                 1
              1
              0                                                  0
Figure 3.9: Prediction intervals of log 𝑃 predictions for the SAMPL7 molecules. The
experimental log 𝑃 values are shown in red circles as a scatter plot. The predictions are shown in a
red line, and the orange wide range shows the prediction intervals (PIs). Panels A through D show
predictions from models trained on DB1 through DB4, respectively. In all cases, data is sorted
according to the predicted log 𝑃 values.
                                                        62


               Table 3.17: The log 𝑃 prediction results for the SAMPL7 molecules.
                                                       Ranking among all      Ranking among
          Model name         RMSE        𝑟2    MAE
                                                      verified predictions  ranked predictions
     ClassicalGSG_DB2          0.55     0.51   0.44             1
     ClassicalGSG_DB4          0.65     0.50   0.56             3
     ClassicalGSG_DB1          0.76     0.28   0.62             7
     ClassicalGSG_DB3          0.77     0.51   0.62             9                     5
3.3.6   Predictions for SAMPL7 log 𝑃 challenge
Four sets of blind predictions were generated using different training sets, as described above. By
the rules of the SAMPL7 challenge, only one of these predictions could be used as a “ranked”
submission. To determine which set to use we examined the performance of each predictor
on the FDA [184] and Huuskonen test sets [190] and chose the model with the lowest RMSE:
ClassicalGSG-DB3. We note that there was an error in our analysis code at this time, and later we
found the ClassicalGSG-DB1 was the model with the lowest RMSE. The RMSE and 𝑟 2 values of
prediction sets using these four models was determined after the SAMPL7 challenge (Table 3.17).
Note that the best model in terms of RMSE is the one trained on the DB2 training set and the worst
performing model was DB3. The model we were intending to select, DB1, is the second-worst
performing model. In retrospect, this shows that the FDA and Huuskonen test sets were not good
proxies for the SAMPL7 molecules. We show comparisons of our predictions with experimental
results, along with linear fit lines for each model in Figure 3.10. Note the best fitting coefficients
also correspond to the model trained on the DB2 training set.
    To identify the prediction outliers, we show the log 𝑃 predictions from our methods for the
SAMPL7 target molecules in Figure 3.11. We find the largest systematic errors in compounds:
SM36, SM40, SM41, SM42, SM43 and SM45 molecules. As shown in Figure 6A from Ref. [212],
these molecules were found to have some of the highest prediction errors across all submissions.
For SM36, SM41, SM42 and SM43, ClassicalGSG consistently over-predicted the experimentally
determined log 𝑃, which was also true for the other well-performing methods (in Figure 6D of
Ref. [212]). Molecules SM40 and SM45 were under-predicted compared to experiment, which was
                                                  63


            A               4.0                                                      B               4.0
                                       y=0.46x+0.96                                                             y=0.71x+0.52
                            3.5        y=x                                                           3.5        y=x
        Predictions using DB1                                                    Predictions using DB2
                            3.0                                                                      3.0
                            2.5                                                                      2.5
                            2.0                                                                      2.0
                            1.5                                                                      1.5
                            1.0                                                                      1.0
                            0.5                                                                      0.5
                            0.0                                                                      0.0
                               0.0   0.5   1.0    1.5   2.0   2.5   3.0   3.5   4.0                     0.0   0.5   1.0    1.5   2.0   2.5   3.0   3.5   4.0
                                                 Experimental                                                             Experimental
            C               4.0                                                      D               4.0
                                       y=0.47x+0.82                                                             y=0.62x+0.80
                            3.5        y=x                                                           3.5        y=x
        Predictions using DB3                                                    Predictions using DB4
                            3.0                                                                      3.0
                            2.5                                                                      2.5
                            2.0                                                                      2.0
                            1.5                                                                      1.5
                            1.0                                                                      1.0
                            0.5                                                                      0.5
                            0.0                                                                      0.0
                               0.0   0.5   1.0    1.5   2.0   2.5   3.0   3.5   4.0                     0.0   0.5   1.0    1.5   2.0   2.5   3.0   3.5   4.0
                                                 Experimental                                                             Experimental
Figure 3.10: The best fit lines for prediction sets. The experimental versus prediction values are
shown in red circles as a scatter plot. The actual fit line is shown in orange. The dashed blue curve
shows the best fit line. A) predictions using DB1, B) predictions using DB2, C) predictions using
DB3, and D) predictions using the DB4 training set.
also in line with other well-performing methods, although the trend is less clear.
3.3.7              Performance of Open-source force fields
Although there is an online server for generating CGenFF parameter files for a given molecule,
it is still challenging to use CGenFF for high throughput applications as it is not open-source.
Hence, we decided to assess the performance of open-source force field tools implemented by
                                                                                     64


Figure 3.11: The log 𝑃 predictions from our submissions to the SAMPL7 challenge. The orange
line shows the experimental values. The ClassicalGSG predictions are shown as circles (DB1:
blue, DB2: orange, DB3: green, DB4: red). The thick orange area shows the MAE interval of
0.44, which is the lowest MAE of our submitted predictions (ClassicalGSG-DB2). Molecules are
labeled with their molecule ID from SAMPL7 [212].
OpenBabel [256, 257] which is open-source and free to use on large databases of molecules. We
utilized GAFF [232], UFF [233], MMFF94 [234, 235] and Ghemical [236] force field parameters
to generate atomic attributes including atom types, partial charges and the two Lennard-Jones
interaction parameters (𝜖 and 𝑟). As above, we applied the GSG method with maximum wavelet
scale of 4 while using all scattering operators to generate molecular features from atomic attributes.
We used the DB2 training set to train 5 log 𝑃 predictor models for each force field. Each of these
models is trained using a 5-fold cross validation approach. These models are tested on the SAMPL7
molecules and the RMSE and 𝑟 2 values are calculated for each set of predictions. We took the
average values over the 5 runs for each force field and the results are shown in Figure 3.12. This
figure shows that ClassicalGSG models from MMFF94 force field parameters achieve the highest
𝑟 2 and lowest RMSE value, which are on par with the CGenFF results submitted to the challenge.
     Additionally, we studied the performance of MMFF94 ClassicalGSG models on independent
                                                   65


       A                                                          B        1.0
                  0.8
                                                                           0.8
       Average RMSE
                  0.6
                                                                  Average r2
                                                                           0.6
                  0.4
                                                                           0.4
                  0.2                                                      0.2
                  0.0                                                      0.0
                        MMFF94   CGenFF   GAFF   UFF   Ghemical                  MMFF94   CGenFF   GAFF   UFF   Ghemical
Figure 3.12: Results of ClassicalGSG models trained using open-source force field parameters.
Error bars are computed over five independently-trained models. These models are trained using
the 2D structure information and using all the scattering moments with the maximum wavelet
scale (𝐽) of 4. For each set of ClassicalGSG models trained using these force field parameters we
show A) the average RMSE, and B) the average 𝑟 2 .
Table 3.18: Sets of parameters used to evaluate MMFF94 ClassicalGSG models on external test
sets. Sets in square brackets denote all the GSG parameter values used for generating the
molecular features. For scattering operators, “z” denotes the zeroth order operator (Equation 3.1),
“f” is first order (Equation 3.2), and “s” is second order (Equation 3.3).
                                          Parameter                            Values
                                    Max. wavelet scale (𝐽)                 [4, 5, 6, 7, 8]
                                     Scattering operators         [(z, f), (z, s), (f, s), (z, f, s)]
external test sets such as FDA [184], Huuskonen [190, 239], Star [194], NonStar [194] and the com-
pounds from the SAMPL6 log 𝑃 prediction challenge [211]. For the purpose of a fair comparison,
we used the same 10722 molecules from the OpenChem dataset as utilized in Section 3.3.1. All
combinations of a set of maximum wavelet scales (𝐽) and sets of scattering operators are used as
GSG parameters to train 20 ClassicalGSG models as indicated in Table 3.18. The atomic attributes
were generated from MMFF94 atomic parameters and all atomic types from MMFF94 (ACall).
The parameters corresponding to the best models per each test set along with their performance
results are shown in Table 3.19. The ClassicalGSG models based on MMFF94 achieve better
performance compared to CGenFF based models for all test sets (Table 3.11). The comparison
between log 𝑃 prediction results for FDA, Star, and NonStar test sets and those from other log 𝑃
predictor methods are shown in Tables 3.20, 3.21 and 3.22.
                                                                  66


Table 3.19: The logP prediction results from MMFF94 force field parameters for external test set.
                    GSG parameters                             Performance results
       Max. wavelet scale Scattering operators       Test set name RMSE 𝑟 2        MAE
                7                    (f,s)               FDA          0.53    0.93  0.27
                7                   (z,f,s)               Star        0.44    0.93  0.29
                5                    (z,f)             NonStar        0.74    0.89  0.59
                7                    (z,f)            Huuskonen       0.35    0.94  0.18
                5                   (z,f,s)            SAMPL6         0.29    0.87  0.23
Table 3.20: The log 𝑃 prediction results from different log 𝑃 methods for the FDA test set. The
TopP-S is the best performing method (GBDT-ESTD+ -2-AD) on FDA test set from [92]. Table
from [92].
                                       FDA test set, N=406
                      2 molecules are removed due to pre-processing failure
                              Method               𝑟2     RMSE        MAE
                              TopP-S             0.935      0.51      0.24
                      MMFF94-ClassicalGSG 0.929 0.535                 0.27
                      CGenFF-ClassicalGSG 0.910             0.55      0.35
                             ALOGPS              0.908      0.6       0.42
                             XLOGP3              0.872      0.72      0.51
                          XLOGP3-AA              0.847      0.8       0.57
                             CLOGP               0.838      0.88      0.51
                             TOPKAT              0.815      0.88      0.56
                            ALOGP98              0.800      0.90      0.64
                             KowWIN              0.771      1.10      0.63
                               HINT              0.491      1.93       1.3
                                                67


Table 3.21: The log 𝑃 prediction results from different log 𝑃 methods for the Star test set. The
TopP-S is the best performing method (MT-ESTD+ -1-AD) on Star test set from [92]. Table from
[92].
                                       Star test set, N=223
                                                         % of molecules within
                                                              error range
                            Method             RMSE < 0.5 1 <            1>
                           AB/LogP               0.41     84     12        4
                   MMFF94-ClassicalGSG           0.44     81     14        5
                            S+logP               0.45     76     22        3
                            TopP-S               0.49     77     16        7
                    CGenFF-ClassicalGSG          0.49     75     18        7
                          ACD/logP                0.5     75     17        7
                           CLOGP                 0.52     74     20        6
                         VLOGP OPS               0.52     64     21        7
                           ALOGPS                0.53     71     23        6
                           XLOGP3                0.62     60     30       10
                           KowWIN                0.64     68     21       11
                           CSLogP                0.65     66     22       12
                           ALOGP                 0.69     60     25       16
                           MolLogP               0.69     61     25       14
                          ALOGP98                 0.7     61     26       13
                            OsirisP              0.71     59     26       16
                           VLOGP                 0.72     65     22       14
                           TLOGP                 0.74     67     16       13
                           ABSOLV                0.75     53     30       17
                           QikProp               0.77     53     30       17
                          QuantlogP              0.80     47     30       22
                        SLIPPER-2002             0.80     62     22       15
                         COSMOFrag               0.84     48     26       19
                           XLOGP2                0.87     57     22       20
                           QLOGP                 0.96     48     26       25
                            VEGA                 1.04     47     27       26
                             CLIP                1.05     41     25       30
                             LSER                1.07     44     26       30
                        MLOGP(Sim+)              1.26     38     30       33
                          NC+NHET                1.35     29     26       45
                            SPARC                1.36     45     22       32
                         HINTLOGP                1.80     34     22       44
                                                 68


Table 3.22: The log 𝑃 prediction results from different log 𝑃 methods for the NonStar test set. The
TopP-S is the best performing method (MT-ESTD+ -2-AD) on NonStar test set from [92]. Table
from [92]
                                       NonStar test set, N=43
                                                          % of molecules within
                                                               error range
                            Method             RMSE < 0.5 1 <             1>
                    MMFF94-ClassicalGSG          0.74      49     35       16
                           ALOGPS                0.82      42     30       28
                            MiLogP               0.86      49     30       21
                            S+logP               0.87      40     35       26
                           XLOGP3                0.89      47     23       30
                            CLOGP                0.91      47     28       26
                            ALOGP                0.92      28     40       33
                           CSLogP                0.93      58     19       23
                           MolLogP               0.93      40     25       26
                            TopP-S               0.94      51     19       30
                            OsirisP              0.94      42     26       33
                          ALOGP98                1.00      30     37       33
                           AB/LogP               1.00      42     23       35
                          ACD/logP               1.00      44     32       23
                     CGenFF-ClassicalGSG         1.02      36     31       33
                           ABSOLV                1.02      49     28       23
                           KowWIN                1.05      40     30       30
                         VLOGP OPS               1.07      33     28       26
                            TLOGP                1.12      30     37       30
                            VLOGP                1.13      40     28       33
                           XLOGP2                1.16      35     23       42
                        SLIPPER-2002             1.16      35     23       42
                          QuantlogP              1.17      35     26       40
                         COSMOFrag               1.23      26     40       23
                             VEGA                1.24      28     30       42
                           QikProp               1.24      40     26       35
                             LSER                1.26      35     16       49
                            QLOGP                1.42      21     26       53
                              CLIP               1.54      33      9       49
                        MLOGP(Sim+)              1.56      26     28       47
                            SPARC                1.70      28     21       49
                          NC+NHET                1.71      19     16       65
                         HINTLOGP                2.72      30      5       65
    As these tables show, MMFF94 ClassicalGSG achieves the best results to date for the NonStar
test set and the second-best results for the FDA and Star test sets. Moreover, our method shows a
                                                 69


significant improvement in the prediction of log 𝑃 values for SAMPL6 molecules, with a RMSE
in the range [0.29, 0.52] and median of 0.42 over 20 models. This compares favorably to the best
performing model (Cosmotherm [219]) with an RMSE of 0.35 in the SAMPL6 blind challenge.
3.4     Discussion and Conclusions
In this chapter, we introduced a method called "ClassicalGSG" for predicting the partition coeffi-
cient. Our method uses atomic attributes that are usually utilized as parameters for classical MD
simulations. These parameters include partial charges, Lennard-Jones well depth, Lennard-Jones
radius and atomic type. The pipeline for generating these parameters includes: creating 3D struc-
ture from SMILES, creating PDB and Mol2 formatted files, and generating atomic parameters
using either Antechamber (GAFF2) or CGenFF tools. We note that, in our implementation, the
Antechamber tool is about 200 times slower than CGenFF, requiring a couple of days to process
10,000 molecules.
    We employ the geometric scattering for graphs (GSG) method to transform the atomic attributes
into molecular features that satisfy re-indexation symmetry. Our results indicate that GSG is
powerful enough to capture the universal molecular features, as the average log 𝑃 difference for all
pairs of adjacent molecules in a t-SNE plot (⟨Δ log 𝑃⟩𝑁 ) is 0.81, which is a low value compared to the
random pair log 𝑃 difference (2.00). As these features are general to the molecule and not specific
to log 𝑃, this suggests that they can be used in multi-task NNs to predict other molecular properties,
such as solubility (log 𝑆), melting point, pKa, and intestinal permeability (e.g. Caco2).[258]
    Our results show that employing a 2D molecular structure in ClassicalGSG yields accurate
log 𝑃 predictions compared to 3D structures, and this confirms the same conclusion achieved
previously [259, 260]. This could be due to difficulties in generating appropriate 3D structures, or
that a single 3D structure is insufficient to capture the high probability conformations of a given
molecule. Additionally, our 3D adjacency matrix did not explicitly distinguish between bonded
and non-bonded interactions, where the former are much more important to determine molecular
properties. The results reported here for four independent external test sets show that our log 𝑃
                                                  70


ClassicalGSG method is generalizable to new molecules. However, we do not expect this model
to perform well for molecules with new elements or functional groups that are not covered in the
training set. Like other empirical methods, we expect the accuracy will improve as the availability
of training data grows.
    Furthermore, we described the curation of four training sets that we utilized to train Classical-
GSG log 𝑃 predictor models for the SAMPL7 physical property blind challenge. The molecular
features originally submitted for these models were created by CGenFF force field parameters. Our
most accurate set of predictions – with an RMSE of 0.55 and MAE of 0.44 – were made by the
ClassicalGSG-DB2 model, which had the lowest RMSE among the 36 submitted sets of predictions
based on the non-ranked predictions analysis. Our ranked predictions were from ClassicalGSG-
DB3 – with an RMSE of 0.77 and MAE of 0.62 – which were ranked in 5th place. To further
compare ClassicalGSG with other predictors we also made post-hoc predictions for the previous
SAMPL6 challenge molecules. We trained 20 predictors using different parameters and obtained
some estimates that had significantly lower RMSE (0.29) than the best performing model at the
time (Cosmotherm [219], 0.35). However, this parameter selection had the benefit of hindsight, so
a more meaningful comparison is with our median RMSE of 0.42. We note that this RMSE would
have placed fourth among the submissions to SAMPL6 [211].
    Here we trained several ClassicalGSG models on molecular features generated by atomic
attributes from open-source force fields. We find that MMFF94 ClassicalGSG models are slightly
more accurate than CGenFF ClassicalGSG models. Applying the MMFF94 log 𝑃 predictor model
trained on the OpenChem dataset to external test sets obtains excellent results throughout, at times
achieving the best results acquired to date. An added benefit is that the MMFF94 ClassicalGSG
models provide an end-to-end framework for predicting log 𝑃 values using only SMILES strings
as input and does not require any auxiliary stream files like the CGenFF models. It might be
counter-intuitive that a force field developed in the 1990s [234, 235] would outperform a modern
forcefield that is still being actively developed [228, 229]. We note here that pertinent features of
force fields for predicting log 𝑃 values are very different from those needed to conduct physically-
                                                  71


meaningful molecular dynamics simulations. We suspect that the leading benefit of MMFF94 is its
broad coverage of atom types describing the chemical features of small, organic molecules that are
relevant to log 𝑃. Differences in one-hot encodings of atom type would likely have a much stronger
impact than improving predictions of partial charges, for example.
    Our code is publicly available on GitHub https://github.com/ADicksonLab/ClassicalGSG and
our training and test sets are available on https://doi.org/10.5281/zenodo.4531015 and in SDF format
on Zenodo https://doi.org/10.5281/zenodo.4560967. The ClassicalGSG repository contains two
pre-trained log 𝑃 predictors, one using MMFF94 and another one using CGenFF atomic attributes.
Once the predictor is trained, values can be predicted extremely quickly. Predictions for a set of
1000 molecules can be made in about 150 seconds on an Intel i7 processor, without parallelization.
The code provides modules for extracting GSG features and training NN models on new datasets.
    As mentioned the ClassicalGSG method is not specific to log 𝑃 and could predict other molecular
properties as well. We emphasize that progress in the field of molecular property prediction
can be greatly accelerated by the free sharing of molecular property datasets. Efforts such as
OpenChem [55] that support the sharing of methods and datasets will be useful catalysts for methods
development. Publicly available data sources for properties such as intestinal permeability, pKa
values, intrinsic clearance rates (CLint ) and serum protein binding fractions would similarly be
great catalysts for the development of accurate predictors of pharmacokinetic effects.
                                                     72


                                            CHAPTER 4
  MODELING COMPOUNDS IN BINDING SITE WITH THE FLEXIBLE TOPOLOGY
                                             METHOD
4.1     Introduction
One of the common steps in early-stage drug discovery is High Throughput Screening (HTS),
where millions of drug-like compounds are examined against a target protein, and compounds with
high affinity are selected [106]. HTS is carried out in a wet lab via experimental methods, where
the binding of up to millions of compounds can be examined at significant cost [5, 107, 108].
To increase the efficiency of this process, in silico screening tools emerged in the 1980s [5]
called Virtual Screening (VS), which can now evaluate large chemical databases with billions of
compounds [107, 109, 110, 111]. VS can either rely on the 3D structure information of the target
molecules – Structure-Based Virtual Screening (SBVS) [116, 117, 118] – or on known bioactive
ligands for the target drug – Ligand-Based Virtual Screening (LBVS) [112, 113, 114, 115]. LBVS
uses chemical similarity search methods such as QSAR, similarity search [261], pharmacophore
matching [262] or 3D shape matching [263] to find novel ligands that show biological activity
with the target of interest. SBVS techniques require the 3D structure of the target protein that
can be determined by X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy,
or homology modeling. Drug-like molecules are then docked into the binding site of the target
protein [11] and scored based on how well they bind to the target using empirical functions [264].
The highest-scoring compounds are selected as hit compounds and passed to the next step for further
computational or experimental evaluation. SBVS can use compounds from large databases such as
ZINC [119], ChEMBL [120], PubChem [121, 122] and NCI [123] are screened which require heavy
computational resources and time. The conventional SBVS methods consider a single fixed snapshot
of the target protein structure during docking simulation to reduce the computational burden, which
can lead to inaccurate predictions of both ligand pose and binding affinity [124, 125, 126, 127].
                                                  73


     A variety of methods have been developed over the years to alleviate the deficiencies of
single-structure docking methods [127, 129, 130, 131, 132, 133, 134, 265]. There are more than
60 docking programs [266] and some of them like AutoDock [267], DOCK [268], AutoDock
Vina [269], GalaxyDock [270], and FITTED [271] can incorporate protein flexibility into docking,
although this typically is limited to sidechain motions in the binding pocket. In addition, there
are approaches that incorporate protein flexibility using a larger set of protein conformations that
are generated either beforehand (“ensemble docking”) [128, 135, 134] or on-the-fly (“induced
fit docking”) [133, 272, 273]. Boltzmann docking [274] and Implicit Ligand Theory [275] are
also used to dock ligands to an ensemble of protein conformations, however these ensembles are
generated in the apo structure and would not easily capture pocket conformations that are induced
by the presence of a ligand. While these algorithms can be accelerated by parallel computing
approaches such as message-passing interface (MPI) [276, 277, 278, 279], graphics processing
units (GPUs) [280, 281, 282, 283, 284] or MPI-GPU heterogeneous methods [285, 275], the
higher computational expense makes these algorithms unsuitable for virtual screening large ligand
databases.
     Developing new techniques for accelerating docking while considering the ligand-induced
conformational changes in the receptor are two main active areas of research in SBVS. Here, we
present an advanced ML-based method that can be used for SBVS and also uses molecular dynamics
(MD) to model ligand-induced conformational changes in the receptor as well as water molecules’
effects on ligand-receptor interactions. We refer to this method as “Flexible Topology” since it
allows the topology (i.e. the chemical structure) of the ligand to change in response to forces from
the surrounding protein and water atoms. This algorithm has the potential to achieve efficiency
gains because it combines screening (chemical space search) with the sampling of ligand poses
(conformational space search) through MD simulations.
     In Flexible Topology (FT), we run MD simulations with a set of randomly-initialized atoms
called “Ghost Particles” (GP), where each ghost particle has a vector of attributes. These attributes
are defined as dynamical variables that – just like the atomic positions – experience forces given
                                                 74


by negative partial derivatives of the energy function. The coordinates and attributes of the
ghost particles thus change over time and gradually optimize toward different drug-like molecules
during the MD simulation. The identity of each ghost particle is defined by a set of parameters
from classical molecular force fields: the partial charge (𝑞) and two Lennard-Jones interaction
parameters, the radius (𝜎) and the well-depth (𝜖). In addition, each particle has a parameter
called 𝜆 restricted to [0, 1] that determines its presence. The ghost particles assemble into target
ligands by running an ML-driven dynamical trajectory where the assembly of the ghost particles
is informed by their environment: binding site residues and explicit solvent. Possible applications
of this method include prediction of bound poses, calculation of relative binding free energies, and
fast virtual screening of large ligand sets.
     The FT method combines the search of conformation and chemical space together. To illustrate
the benefits of this approach, it is helpful to consider the example of a protein searching its
conformational landscape to find its folded state. Levinthal’s paradox states that there are an
enormous number of possible protein conformations (10300 ) [286] and if all the conformations of
a protein are sampled one by one at a rate of 1 conformation per millisecond, then it would take
the universe’s age to find its folded conformation. However, in nature, protein folding does not
require sampling all of the high-energy conformations; the protein is automatically drawn towards
low-energy conformations due to the funnel-shaped nature of the energy landscape [287]. The
same reasoning is valid for the FT approach as its search space will also be funnel-shaped with
the lowest energy target molecules at its bottom. In this way, FT aims to avoid the sampling of
all high-energy ligands and conformations, and to be gradually drawn towards the best atomic
identities and structures with each simulation step. At the end of a simulation we aim to achieve
the optimal binding pose and the correct chemical types.
     In this chapter, we first describe the framework of the method and details of its implementation.
We then run assembly simulations with a set of target ligands to study its performance and optimize
the energy landscape of the FT energy function. Further, we employ FT to run assembly simulations
in the binding pocket of a bromodomain. Finally, we conclude with a discussion of the challenges
                                                   75


and future work.
4.2     Methods
4.2.1   Framework of Flexible Topology
The goal of the FT method is to model a set of atoms that can continuously transform between
different molecules. Ultimately we envision this as an efficient way to identify candidate drug
compounds for a drug target, while taking into account full flexibility in the receptor and surrounding
solvent. To implement this method, we use MD simulations that are coupled with an ML-based
external force (MLForce) that slowly nudges a set of ghost particles to become a drug-like molecule
(Fig. 4.1).
     MLForce consists of three components: a molecular representation model, an assignment
algorithm and a loss function. The molecular representation model defines a set of symmetry-
invariant atomic environment vectors (AEVs) using functions introduced by Behler and Parrinello
[94]. The AEVs, in combination with the ghost particle attributes, determine the set of features for
each atom that describe both its environment and its identity. We discuss the features in Sec. 4.2.2
and the attributes in Sec. 4.2.3.
     The second component is an algorithm to solve an assignment problem to determine an optimal
mapping between the ghost particles and target ligand atoms. In our implementation, we employ
the Hungarian algorithm [288] to find an assignment with a minimum cost between the ghost
particles and target ligand atoms. The cost matrix is determined by calculating the Euclidean norm
between the ghost particle features (𝐹𝐺𝑃 ) and target molecule features (𝐹𝑇 ). While the brute force
solution to the assignment problem is 𝑂 (𝑛!), the Hungarian algorithm has 𝑂 (𝑛3 ) time complexity.
We use a C++ implementation from Cong Ma (the original code is implemented in Matlab by
Buehren) https://github.com/mcximing/hungarian-algorithm-cpp.git. The output of this algorithm
is the mapped set of features (𝑀 (𝐹𝐺𝑃 )).
     Finally, the loss function 𝐿 is used to define the restraint potential that guides the ghost particles
to adopt their target configurations. Gradients of 𝐿 yield the external forces on both the positions
                                                     76


   Initial Structure                        All-atom MD                                      Final Structure
 Positions and attributes                                                                  Forces
                                                                Mapped
                                        Features
                         Molecular
                                                                Features
                                                                           Loss function
                                                   Hungarian
                       Representation
                                                   Algorithm
                           Model
                                                                                               MLForce
Figure 4.1: The framework of the Flexible Topology method. The gray and blue spheres show the
ghost particles. MD simulations start with an initial structure of the protein of interest and
randomly initialized ghost particles. The MLForce plugin computes external forces on the ghost
particle positions and attributes using a loss function, 𝐿. 𝑋𝐺𝑃 denotes the ghost particle positions,
𝐴𝐺𝑃 represents the atomic attributes of ghost particles, 𝑀 shows the mapping function. 𝐹𝐺𝑃 and
𝐹𝑇 show ghost particle and target ligand features, respectively. Final output shows the ghost
particles assembled into a drug-like molecule in the binding-pocket of the protein.
and attributes of the ghost particles. These forces are applied to the ghost particles to change
their positions and attributes at each dynamic step, gradually transforming them into a drug-like
molecule in an orientation that fits the protein of interest. This loss function is described in Sec.
4.2.4, followed by a implementation details in Sec. 4.2.5.
4.2.2   Symmetry-invariant Molecular Features
The atomic environment vectors that guide the assembly of the ghost particles are mostly comprised
of a set of customized Behler-Parinello symmetry functions [94]. For each particle, these functions
capture the radial and angular information of other ghost particles in its surrounding environment.
                                                           77


These features are invariant to translation and rotation operations and are differentiable with respect
to atomic positions. The local radial environment of an atom is defined by a sum of Gaussian terms
multiplied by a cutoff function:
                                                                     𝑅 2
                                                ∑︁
                                                     𝑒 −𝜂 𝑅 ( 𝑅𝑖 𝑗 −𝑅𝑠 ) 𝑓𝑐 𝑅𝑖 𝑗
                                                                                     
                                        𝐺 𝑖𝑅 =                                                                 (4.1)
                                                𝑖≠ 𝑗
where 𝑅𝑖 𝑗 is the distance between atoms 𝑖 and 𝑗. The cutoff function 𝑓𝑐 goes from 1 to 0 as 𝑅𝑖 𝑗
surpasses a cutoff distance 𝑅𝑐 , which focuses the sum to capture only the local atomic environment.
The parameter 𝜂 𝑅 controls the width and 𝑅𝑠𝑅 determines the location of the Gaussian functions.
Using a set of different values for 𝑅𝑠𝑅 allows the capturing of information in particular regions of
the radial environment around 𝑖. The cutoff function is defined as follows where 𝑅𝑐 is the cutoff
distance.
                                            
                                                          𝜋𝑅𝑖 𝑗
                                             0.5(cos ( 𝑅𝑐 ) + 1)             for 𝑅𝑖 𝑗 ≤ 𝑅𝑐
                                            
                                            
                                            
                              𝑓𝑐 (𝑅𝑖 𝑗 ) =                                                                     (4.2)
                                            
                                             0.0                             for 𝑅𝑖 𝑗 > 𝑅𝑐
                                            
                                            
                                            
     The angular functions are defined for an atom 𝑖 using the sum of cosine terms for all possible
angles centered at atom 𝑖:
                                                                       1                 𝐴 2
                          ∑︁
              𝐺 𝑖𝐴 = 21−𝜁                                     𝑒 −𝜂 𝐴 ( 2 ( 𝑅𝑖 𝑗 +𝑅𝑖𝑘 ) −𝑅𝑠 ) 𝑓𝑐 𝑅𝑖 𝑗 𝑓𝑐 (𝑅𝑖𝑘 )
                                                         𝜁                                        
                                 1 + cos 𝜃 𝑖 𝑗 𝑘 − 𝜃 𝑠                                                         (4.3)
                          𝑗,𝑘≠𝑖
where 𝜂 𝐴 and 𝑅𝑠𝐴 have the same role to modify the width and location of radial Gaussian functions,
and the parameters 𝜁 and 𝜃 𝑠 control the width and location of different angular regions. Note
that these angular functions are the modified version of original Behler-Parinello symmetry func-
tions [94] and reflects the changes made by Smith et al. [95]. Using multiple values of 𝜂 𝑅 , 𝑅𝑠𝑅 , 𝜂 𝐴 ,
𝑅𝑠𝐴 , 𝜃 𝑠 and 𝜁, enables us to construct a set of symmetry functions that can uniquely describe the
local environment of an atom.
     Although we employ the TorchANI module [289] to compute the symmetry functions, there are
significant differences in our implementation. Firstly, whereas in ANI atoms have a fixed elemental
type (e.g. H, C, N or O), here we consider all the atoms to have the same type. However, we
found that symmetry functions with one atom type struggle to couple chemical and geometrical
                                                           78


Table 4.1: Parameter values for Behler-Parrinello functions, grouped into radial and angular
components. The number of values used for each parameter is shown in the 𝑛vals column. The
product of 𝑛vals is the total number of features for each component (𝑁feat ).
                                 Var.                      Values             𝑛vals 𝑁feat
                                 𝜂 𝑅 (nm−2 )                1600                1
                    Radial       𝑅𝑠𝑅 (nm)         0.090, 0.169, ..., 0.708     24     24
                                 𝑅𝑐𝑅 (nm)                    1.0                1
                                 𝜂 𝐴 (nm−2 )                 800                1
                                 𝑅𝑠𝐴 (nm)         0.090, 0.155, ..., 0.545      8
                    Angular      𝜃 𝑠𝐴 (radians)     𝜋/16, 2𝜋/16, ..., 𝜋        16    128
                                 𝜁                           32                 1
                                 𝑅𝑐𝐴 (nm)                    0.5                1
information properly in some cases. Therefore, we added an additional set of radial functions to
describe the charge atomic environment. We employ the radial function defined in Eq. 4.1 while
using the following cutoff function that considers the charge of the pair atom (𝑞 𝑗 ).
                                          
                                                    𝜋𝑅𝑖 𝑗
                                           0.5(cos ( 𝑅𝑐 ) + 1)𝑞 𝑗   for 𝑅𝑖 𝑗 ≤ 𝑅𝑐
                                          
                                          
                                          
                            𝑓𝑐𝑞 (𝑅𝑖 𝑗 ) =                                                            (4.4)
                                          
                                           0.0                      for 𝑅𝑖 𝑗 > 𝑅𝑐
                                          
                                          
                                          
                                           𝑅𝑞
The resulting features are denoted 𝐺 𝑖 .
                                                                          𝑅𝑞
     The full set of features for atom 𝑖 is then: 𝐹𝑖 = ({𝐺 𝑖𝑅 }, {𝐺 𝑖 }, {𝐺 𝑖𝐴 }, { 𝐴𝑖 }), where the {}
brackets for the 𝐺 functions denote a list over all permutations of the parameters 𝜁 𝑅 , 𝑅𝑠𝑅 , 𝜂 𝐴 , 𝑅𝑠𝐴 ,
𝜃 𝑠 and 𝜁. {𝐴𝑖 } is the set of the four attributes for atom 𝑖, namely, 𝑞𝑖 , 𝜖𝑖 , 𝜎𝑖 and 𝜆𝑖 . The parameter
values used to construct the features are given in Tab. 4.1. In total, there are 24 radial features (for
                      𝑅𝑞
each {𝐺 𝑖𝑅 } and {𝐺 𝑖 }), 128 angular features, and the four atomic attributes, for a grand total of
180 features per atom.
     We denote the ordered list of the ghost particle feature vectors as 𝐹𝐺𝑃 . Note that while 𝐹𝐺𝑃
is invariant to translation and rotation, it is not invariant to the re-indexing of atoms. However,
the mapped set of features, 𝑀 (𝐹𝐺𝑃 ) is symmetric to re-indexing. In this way, our loss function,
energies and forces are invariant to all three symmetries.
                                                      79


4.2.3   Continuous Atomic Identities
In the framework of a classical MD simulation, “topology” describes the identities of each atom
and their covalent connectivity in the system. The atomic identities are extensive; for instance, the
CHARMM36 protein force field has 20 different types of carbon where each atom type has unique
parameters for the partial charge, bond lengths, and rotation around chemical bonds [290]. A
harmonic energy function defines the bonds between atoms, which can vibrate around a minimum
energy position but can never break. The system’s topology for the classical MD simulation is thus
determined at the beginning of the simulation and remains fixed for the duration of a trajectory.
    The FT method allows the topology to change over the course of the simulation by treating the
attributes that define atomic identities as dynamic variables. To enable the continuous and dynamic
description of chemical identities, a key step is to separate external interactions from internal
ones and treat them separately. While internal interactions are complex and involve a multitude
of atom-type specific parameters, external interactions are comprised of only Lennard-Jones and
electrostatic interactions that can be wholly described by only three parameters: partial charges (𝑞),
Lennard-Jones well-depths (𝜖) and Lennard-Jones radii (𝜎).
    To describe how to turn these parameters into dynamic variables, we first consider the role of
𝜆 in alchemical simulations. In alchemical simulations, 𝜆 is an external parameter that is used to
“turn on” or “turn off” additional terms in a potential energy function [291, 292, 293, 294]:
                                   𝑈 (𝑋, 𝑥, 𝜆) = 𝑈0 (𝑋) + 𝜆𝑈 𝑝𝑒𝑟𝑡 (𝑥)                              (4.5)
where 𝑋 represents the positions of the fixed system and 𝑥 represents the positions of the alchemical
group, 𝑈0 shows the original potential energy function and 𝑈 𝑝𝑒𝑟𝑡 is a perturbative term that is turned
on when 𝜆 = 1 and off when 𝜆 = 0. This approach has also been used in the 𝜆-dynamics method,
where each 𝜆 is considered as a dynamical variable [295, 296]. That is, 𝜆 feels a force according
to: 𝐹𝜆 = −𝜕𝑈/𝜕𝜆, and has a velocity 𝑣 𝜆 and a mass 𝑚 𝜆 .
    In FT, each atom is assigned a dynamic 𝜆 variable, as well as three parameters 𝑞𝑖 , 𝜖𝑖 , and 𝜎𝑖 .
These are all treated as dynamical variables that change as a function of time and together we
                                                  80


refer to them as “attributes” denoted by 𝐴𝑖 = {𝑞𝑖 , 𝜖𝑖 , 𝜎𝑖 , 𝜆𝑖 } for atom 𝑖 and 𝐴𝐺𝑃 for the complete
set. These variables, like 𝜆, have masses (𝑚 𝑞 , 𝑚 𝜖 and 𝑚 𝜎 ) and their motion follows the gradients
of an extended Hamiltonian energy function. These dynamic variables are used to model non-
bonded interactions between the ghost particles and the rest of the system, using conventional
Lennard-Jones and Coulomb equations.
4.2.4    Loss function and Flexible Topology Restraint Energy
While the MLForce model does not have any trainable parameters, we define a loss function
that measures the difference between the mapped ghost particle features (𝑀 (𝐹𝐺𝑃 )) and the target
features (𝐹𝑇 ):
                                            𝐿 = |𝑀 (𝐹𝐺𝑃 ) − 𝐹𝑇 |                                      (4.6)
where || is the L2 norm and 𝐹𝑇 is a set of features computed for a given conformation of the target
ligand. Note that because 𝐹𝑇 is computed using a single ligand structure, the loss function can
restrict the internal degrees of freedom of the ligand after assembly. To that end we could also
employ a loss function with multiple minima that represent either different ligand conformations
or different ligands entirely:
                                      𝐿 = min (|𝑀 (𝐹𝐺𝑃 ) − 𝐹𝑇,𝑖 | + 𝜍𝑖 )                              (4.7)
                                           𝑖∈ℓ
where only one set of target features 𝐹𝑇,𝑖 in a (possibly large) set (ℓ) is used to determine 𝐿 and an
offset for each target (𝜍𝑖 ) can be used to balance their relative probabilities. This loss is transformed
into a perturbative energy according to:
                                                        𝐶
                                    𝑈𝐹𝑇 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) =     𝐿 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 )                             (4.8)
                                                        𝛽
where we explicitly show the loss and potential energy as a function of the ghost particle coordinates
(𝑋𝐺𝑃 ) and attributes (𝐴𝐺𝑃 ). 𝛽 is the inverse temperature, and 𝐶 is a parameter called the “MLForce
Scale” that will be chosen to scale 𝑈𝐹𝑇 against the other forces in the system. We then add this
term, multiplied by a tunable coupling parameter (𝜑), to the system energy function 𝑈:
                𝑈 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ; 𝜑) = 𝑈0 (𝑋) + 𝑈𝑛𝑏 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) + 𝜑𝑈𝐹𝑇 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 )               (4.9)
                                                    81


where 𝑈𝑛𝑏 (𝑋, 𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) is the nonbonded interaction energy between the ghost particle atoms and
the remainder of the system (𝑋). The free energy between the 𝜑 = 0 (“Random”) state and the 𝜑 = 1
(“assembled”) state could be rigorously calculated using the work applied during a nonequilibrium
trajectory with progressively increasing values of the coupling parameter 𝜑, using the Jarzynski
equality [297]. This will allow the system to be initialized in an arbitrary configuration, and to
“assemble” in the most probable orientation as the loss function is turned on. Alternatively, we can
run “relaxation” trajectories that begin with 𝜑 = 1 and the ghost particles in a disordered state. As
shown below, although relaxation trajectories do not yet allow for rigorous free energy calculation,
they are sufficient to determine suitable algorithmic parameters, and to assemble ligands inside a
protein binding pocket.
4.2.5   Implementation details
Assembly simulations were conducted in OpenMM 7.7 [298]. These simulations are run using a
custom OpenMM integrator. In this integrator, we use Langevin dynamics with a friction coefficient
of 1 ps−1 and an integration step size of 1 fs is used for advancing the positions and velocities.
Brownian dynamics is used for evolving the atomic attributes 𝐴𝐺𝑃 . We also keep the sum of
the ghost particle partial charges equal to a constant value, which is set to zero for all molecules
examined here, but could be set to an integer value if examining charged ligands.
    Implementation of MLForce follows the architecture of the OpenMM library [298]. It includes
multiple layers, which are as follows: 1) an OpenMM public Application Programming Interface
(API) to generate the MLForce object with an embedded PyTorch model, 2) a platform-independent
layer to load the model and, 3) a computational kernel layer for executing the model and computing
the gradients. All of the layers are written in C++, and the computational kernels are implemented
for CUDA, OpenCL, and Reference (CPU) platforms in OpenMM. Furthermore, to compute deriva-
tives, MLForce can be implemented using an automatic differentiation tool such as Autograd (e.g.,
https://github.com/HIPS/autograd). Here we utilize TorchANI [289] a PyTorch [299] implementa-
tion of the Behler-Parrinello symmetry functions to compute features, and we use Torch functions
                                                  82


to calculate the loss function derivatives with respect to positions and attributes of ghost particles.
The feature calculator, Hungarian mapping algorithm, and loss function are created in a PyTorch
model, then converted to a TorchScript module and save it in the “.pt” format. The assembly MD
simulation loads this TorchScript model using the MLForce plug-in and adds it as an external force
in an OpenMM simulation, which are automatically combined with the other system forces at each
dynamic step.
    Simulations that embed the ghost particle in a larger system required additional setup steps.
OpenMM System and Topology objects were created in a standard fashion for all of the non-ghost
particle atoms (referred to as the “system atoms”). The ghost particles were then added to the
topology object with masses of 20 amu. The ghost particles were also added to the existing
nonbonded forces in the System, although this was done in such a way that their interaction en-
ergy would always be zero. Nonbonded interactions between the ghost particles and the system
atoms were implemented using a CustomNonbondedForce for each ghost atom, which took into
account the dynamic variables 𝑞, 𝜎, 𝜖 and 𝜆. These attributes were defined for each ghost particle
as “global variables”, which are accessible to both the external forces and to the integrator. The
addEnergyParameterDerivative function of OpenMM was then used to easily compute deriva-
tives of the nonbonded energy with respect to these parameters. For each CustomNonbondedForce
we implemented an interaction group that restricts its application to only the interactions of the
ghost particle in question with the system atoms.
4.2.5.1    Bromodomain simulation setup details
The PDB entry 4a9e was used as a starting point for system setup in CHARMM-GUI [300].
The ligand, protein and crystallographic waters from Chain A were used, with ligand parameters
obtained from CGenFF [301, 229]. The system was solvated in a cubic box using a 10 Å cutoff
from the nearest protein atom. Four neutralizing Cl ions were added to the solution. To allow for
easier introduction of ghost particles, we manually removed 8 water molecules from the vicinity
of the binding site, taking care not to select waters that are involved in water-mediated hydrogen
                                                  83


bonds [302]. The final system contained 24788 atoms.
     In each assembly simulation, we begin by introducing 22 particles in a 6 × 6 × 6 Å3 cube
centered at the protein binding site, which was defined as the centroid of residues Pro98, Val103,
Tyr155 and Ile162. Ghost particles are inserted one at a time and trial positions are only accepted
if the atom is farther than 0.5 from all system atoms and all ghost particles inserted so far. Ghost
particle attributes 𝜎, 𝜖 and 𝑞 were selected randomly from the bounds in Tab. 4.2. The 𝜆 attributes
were initialized uniformly at 0.7. To prevent the ghost particles from drifting away into solvent, a
CustomCentroidBondForce was used to restrain each ghost particle to be close to the centroid of a
selection of atoms in the protein binding site. The binding site residues Pro98, Val103, Tyr155 and
Ile162 are used to define the binding site centroid. The restraint energy is a one-sided, flat-bottom
potential of the form: 𝑈rest (𝑟𝑖 ) = 𝑘
                                     2 Θ(𝑟 𝑖 − 𝑟 0 )(𝑟𝑖 − 𝑟 0 ) 2 where 𝑟 0 is a cutoff equal to 8 Å, Θ(𝑥)
is a step function equal to 1 when 𝑥 > 0 and 0 otherwise, and 𝑘 = 1000 kcal/mol nm2 . The
ghost-ghost forces are computed using the standard ghost-ghost forces added to the system with
the MLForce package using the ligand features as a target, as described above. Before simulation,
the positions of all atoms were minimized with all forces turned on for 1000 steps. The system was
then run at 300 K for 4 ps with a 2 fs timestep, which was sufficient to reach the target temperature.
Assembly simulations consisted of 1000 ps of simulation with structures and features reported
every 50 timesteps.
4.3     Results
4.3.1    Assembly Simulations in Vacuum
To evaluate the performance of the FT method and, in particular, examine the smoothness of 𝑈𝐹𝑇
with respect to 𝑋𝐺𝑃 and 𝐴𝐺𝑃 , we run assembly simulations in vacuum with randomly-initialized
ghost particles and a single set of target features. These are relaxation simulations, where the
disordered state is unstable and 𝜑 = 1 throughout. Ideally, 𝑈𝐹𝑇 would be a smooth function that
steadily guides the ghost particles to assemble to positions that are compatible with the target
features.
                                                     84


        6                8                10                15
         20                25                       30                           35
                   40                                            50
Figure 4.2: The assembly simulation target molecules randomly chosen from the OpenChem
dataset [55] are shown in their 2D structures. The number of atoms for each molecule is written
under it. The 2D structures are generated and drawn by MarvinSketch using implicit hydrogens.
    We randomly selected 10 molecules from the OpenChem dataset [55] as the targets whose size
ranges from 6 atoms to 50 atoms (Fig. 4.2). Each simulation consists of 10 cycles of 100 steps of
minimization followed by 20 ps of dynamics at 300 K. In order to optimize the landscape of the
loss function, we examine a parameter which we refer to as “MLForce scale”, which is a scalar
multiplier to the loss function that allows us to tune the loss function forces with respect to other
forces in the system (denoted 𝐶 in Eq. 4.8). We run 3 simulations for each molecule across a range
of 𝐶 values and compare the performance. Each simulation is 200 ps and takes, on average, 20
minutes to run. The Root Mean Squared Distance (RMSD) between the ghost particle and target
ligand positions is calculated in a way that takes into account equivalent atoms in the structure (e.g.
those that have equal values of the BP functions) and finds the minimum RMSD evaluated over all
permutations of equivalent atoms. Average values of these RMSDs and 𝑈𝐹𝑇 Eq. 4.8 for the last
10% of the assembly simulations are shown in Fig. 4.3.
                                                   85


       A                                                                             B
                         0.6                                                                   0.200
                                                                                               0.175
                         0.5
       Positions RMSD (nm)                                            Potential Energy (UFT)
                                                                                               0.150
                         0.4                                                                   0.125
                         0.3                                                                   0.100
                                                                                               0.075
                         0.2
                                                                                               0.050
                         0.1
                                                                                               0.025
                         0.0                                                                   0.000
                               100   200   500 1000 2000 5000 10000                                    100   200   500 1000 2000 5000 10000
                                             MLForce Scale                                                           MLForce Scale
Figure 4.3: Average values of RMSD and 𝑈𝐹𝑇 for assembly simulations run with different
MLForce scale values. Error bars show standard errors calculated over 3 separate assembly
simulations for the last 10% of data. A) RMSD between positions of ghost particles and the target
molecule. B) The potential energy of ghost-ghost interactions, normalized by 𝐶 (Eq. 4.8).
           Table 4.2: Minimum and maximum parameter values for ghost particle attributes.
                                                                                      Min.               Max.
                                                       𝑞 (Debye)                      −0.7               0.7
                                                       𝜎 (nm)                         0.022              0.23
                                                       𝜖 (kJ/mol)                     0.037              2.63
                                                       𝜆                               0.0               1.0
   These figures show that the RMSD and 𝑈𝐹𝑇 energy is lower for the MLForce scale of 2000
than other MLForce scale values. This suggests that an MLForce scale of 2000 is the best value for
reaching a smooth FT potential energy function. To see the differences in the attributes of ghost
particles and the target molecule, we determined the Root Mean Square Errors (RMSE) of charge
(𝑞), particle size (𝜎), Lennard-Jones well depth (𝜖), and 𝜆 for the assembly simulations for the
last 10% of data (Fig. 4.4). The standard errors scaled by the ranges of each attribute (Tab. 4.2)
and are all unitless quantities. The lowest RMSEs are achieved by the MLForce Scale of 10000
while errors are different for each atomic attribute. RMSEs of 10−3 indicate an average difference
from the target feature of 0.1% of the allowed range for that parameter. We view all of the results
obtained with MLForce Scale values ≥ 2000 to show acceptable convergence.
   We then focus on an MLForce Scale value of 2000 and examined the performance of assembly
                                                                             86


                                                                                                                       Charge (q)
                                                                                                                       Particle Si e (σ)
                                                                                                                       Well depth (ε)
                                                                                                                       Lambda (λ)
                  10−2
    RMSE
                  10−3
                  10−4
                                   100        200      500       1000      2000                                 5000       10000
                                                             MLForce Scale
Figure 4.4: The average RMSE of atomic attributes between the ghost and target ligands for the
last 10% of simulation data. Errors are determined over three different runs for 10 molecules (30
assembly simulations in total for each MLForce Scale).
simulations for molecules of different sizes. We observed that the best results are achieved with
smaller molecules and that the RMSD and 𝑈𝐹𝑇 generally increase with increasing molecular size
(Fig. 4.5). In Fig. 4.6 we plot the minimum RMSD obtained in each simulation and we see that
FT can reach RMSE < 2 Å for the largest system we examine (50 atoms).
           A 0.35                                                               B
                                                                                          0.035
                        0.30
                                                                                          0.030
                        0.25
      Positions RMSD (nm)                                        Potential Energy (UFT)
                                                                                          0.025
                        0.20                                                              0.020
                        0.15                                                              0.015
                        0.10                                                              0.010
                        0.05                                                              0.005
                        0.00                                                              0.000
                               6   8   10 15 20 25 30 35 40 50                                    6   8   10 15 20 25 30 35 40 50
                                          Number of atoms                                                    Number of atoms
Figure 4.5: The average RMSD of positions and FT energy for MLForce scale of 2000 over the last
10% of data. The standard errors are determined over three different runs. A) The average RMSD
of ghost particles and the target ligand. B) The FT potential energy, divided by the MLForce scale.
                                                                        87


                        0.25
                        0.20
        Minimum RMSD (nm)
                        0.15
                        0.10
                        0.05
                        0.00
                               6   8   10   15     20      25    30   35     40     50
                                                 Number of atoms
Figure 4.6: The minimum RMSD of positions for MLForce scale of 2000 for the assembly
simulations. The standard errors are determined over three different runs.
4.4           Assembly in a Bromodomain binding pocket
To test the ability of FT to assemble ghost particles in a heterogeneous environment, we simulate
the assembly of a ligand within the binding pocket of a bromodomain. Bromodomains are small
left-handed bundles of four alpha helices, with a well-defined binding pocket. They are known as
epigenetic "readers" that use their binding pockets to recognize acetylated lysine residues [303].
Bromodomains are included – often multiple times – in nuclear proteins that are involved in chro-
matin remodeling. As dysregulation of chromatin remodeling is a hallmark of cancer progression,
bromodomain inhibitors have long been sought as potential cancer treatments [304, 305]. Here we
assemble the compound 3-methyl-1,2,3,4-tetrahydroquinazolin-2-one in the first bromodomain of
BRD2 [306]. A crystal structure of this compound is known and available on the Protein Data
Bank (PDBID: 4a9e). The ligand has 22 atoms and is composed of two conjoined six-membered
rings (quinazolin scaffold) with protonations and substitutions on the nitrogen-containing ring (Fig.
4.7).
                                                     88


A                                          B                               C
                                                                                                 H
                                                                              O                   N
                                                                                   N
                                                                            H3 C
                                                                                                       >0.5
                                           D
                                                                                       Atomic charge
                                                                                                       0
                                                                                                       <-0.5
Figure 4.7: The bound pose of 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one from the 4a9e PDB
entry. Residues used to define the binding pocket centroid (Pro98, Val103, Tyr155 and Ile162) are
shown in white licorice. Hydrogens are not shown on the ligand or binding site residues. A) Front
view. B) Top view. C) 2D structure of the ligand. D) A size-charge representation where the each
sphere is sized according to the atomic radius 𝜎 and colored according to its charge, 𝑞.
    We conducted ten assembly simulations of the 3-methyl-1,2,3,4-tetrahydroquinazonlin-2-one
ligand inside the BRD2 binding pocket using an MLForce scale of 5000. The simulations began
from different randomly initialized positions and attributes of the ghost particles. Fig. 4.8 shows
the 𝑈𝐹𝑇 energies over time for the ten assembly simulations. For these parameters, we found
that 𝑈𝐹𝑇 energies of approximately 15 − 20 kcal/mol correspond to correctly assembled states
and incorrectly assembled states had 𝑈𝐹𝑇 energies approximately equal to 80 kcal/mol. We find
that seven of the ten simulations assembled into the correct structure within the time limit. One
trajectory assembled between 800 and 1000 ps, suggesting that a longer time limit could allow all
trajectories to converge. Generally, the assembly simulations show periods of stagnation, followed
by sudden rapid drops in energy. This suggests the presence of metastable local minima in the
                                                89


landscape.
    The insets of Fig. 4.8 show snapshots from a successful and an unsuccessful simulation.
The final structure of the successful simulation (bottom) shows the same chemical structure and
attribute pattern as the target ligand in Fig. 4.7D. Similar results were found for the other six
simulations with 𝑈𝐹𝑇 < 40 kcal/mol. The final structure in the unsuccessful simulation (top)
shows some correct elements of the target molecule, such as some smaller, positively charged
atoms (hydrogens) arranged around a larger, neutral atom (carbon).
    While these results are encouraging, a close examination of the ligand pose in the example
directory will show that it differs from the crystal structure bound pose shown in Fig. 4.7. The
ligand should be rotated 180 degrees around the horizontal axis, with the methyl group on the left
pointing down into the binding pocket instead of upwards. Similar results were found for the other
seven correctly assembled ligands, where although the general position of the ligand was correct,
the overall bound pose did not match that found in the crystal structure. This suggests that some
further refinement of the method is needed before it is useful to predict bound poses.
                                                90


                                     0 ps                             500 ps                        1000 ps
                           160
                                            Example (success)
    FT energy (kcal/mol)
                           120
                                                                       Example (failed)
                           80
                                                                                          Average
                            40
                                 0      200         400               600        800          1000
                                                          Time (ps)
                                     0 ps                             500 ps                        1000 ps
Figure 4.8: Ten assembly simulations in a bromodomain binding pocket. The central panel shows
traces of the FT energy (𝑈𝐹𝑇 ) for the ten trajectories (thin lines) as well as an average of the
trajectory set (thick blue line). Two example trajectories were highlighted. The top panels,
outlined in red, show snapshots from an example “failed” trajectory, which did not assemble to the
correct structure. The bottom panels, outlined in green, show snapshots from a correctly
assembled trajectory.
                                                             91


4.5     Discussion and Conclusions
Here we have presented a novel method called Flexible Topology that considers the ligand-induced
fit and surrounding water molecules’ effect upon drug binding to its receptor. In this method,
atomic identities of atoms are fluidly defined by a set of four continuous dynamical variables:
partial charge (𝑞), particle size (𝜎), and Lennard-Jones well depth (𝜖) as well as presence variable
(𝜆) for each atom. These atoms are guided to assemble into a target ligand using a deterministic
loss function defined by the difference of atomic environment vectors (AEVs) of the atoms with
the AEVs of the target atoms. The AEVs are modified versions of Behler-Parrinello symmetry
functions [94], which describe the local environment around each atom in a way that is invariant
to translation and rotation operations. Derivatives of the loss function are used to define atomic
forces that are added to an OpenMM MD simulation, implemented using a PyTorch model that
incorporates features from the TorchANI [289] package. This was achieved with an external force
plugin for the OpenMM tool, referred to as “MLForce”, which is made freely available here:
https://github.com/ADicksonLab/mlforce.git. MLForce is responsible for loading the TorchScript
file of a TorchANI model and computing internal force on ghost particles. FT simulations also
require a custom integrator to update the continuous dynamical variables. Code to implement
this integrator, as well as other supporting scripts, are given in the Flexible Topology repo here:
https://github.com/ADicksonLab/flexibletopology.git.
     One of the main challenges of this project was finding a representation that could describe
molecular topologies uniquely. For this purpose, we originally tried features from a method
called “Geometric Scattering for Graphs” [105], which we had used previously to predict partition
coefficients for small molecules in chapter 3. We attempted to use these features by themselves,
as well as a combination of the GSG features with the Behler-Parrinello symmetry functions used
above, and none of these feature sets were able to generate a proper models for FT simulations.
In these cases we found that minimizing the loss function was not sufficient to achieve the desired
ligand structures. In other words, many structures were possible that were far from the target ligand
structure, but still reproduced the features exactly. The current implementation, which directly
                                                 92


compares atomic feature values after an on-the-fly Hungarian algorithm mapping step, was found
to be superior to the graph scattering approach.
    In this chapter, we have run assembly simulations in the vacuum and in a bromodomain binding
pocket. Impressively, we could reach RMSE < 2 Å for a system of 50 atoms from random initial
positions, and we can achieve convergence in both positions and attributes. We expect that assembly
results can be further improved on larger system sizes by through additional modifications to the
feature set. For instance, adding new features could improve performance by allowing for a more
precise description of a molecular conformation. However, reducing the size of the feature space
could also help focus in on the most important conformational differences. Another approach
would be to learn a set of scaling parameters for the loss function:
                                                  ∑︁
                                       𝐿 scaled =          𝑖
                                                     𝑤 𝑖 (𝐹𝐺𝑃 − 𝐹𝑇𝑖 ) 2                            (4.10)
                                                   𝑖
where 𝑤 𝑖 is the scale factor for feature 𝑖. This would allow us to “train” the loss function for optimal
performance in assembly simulations. We reserve these lines of investigation for future work.
    For the assembly simulations in the bromodomain, we found that the predicted ligand poses
did not coincide with the binding pose observed in the 4a9e crystal structure. We plan to further
investigate the ability of the FT method to predict bound poses in future work. We will first expand
our results to include pose prediction of other ligands to the same bromodomain structure. A search
of the protein data bank revealed six other ligand-bound BRD2 structures: 2ydw, 4uyf, 6db0, 6jke,
6mo7 and 6ony. It will be instructive to assemble these ligands as well and examine the pose-
prediction performance across the whole dataset. Another key detail to examine will be whether
the predicted poses are sensitive to the speed at which the FT energy is turned on. The results above
are initialized with a high MLForce scale parameter (5000), causing the first stages of assembly to
happen rather quickly. We hypothesize that a more gradual introduction of the MLForce will allow
the ghost particle positions and attributes to better respond to the local electrostatic environment.
In future work we plan to examine the performance of the algorithm across a range of coupling
speeds.
                                                     93


                                           CHAPTER 5
                            SUMMARY AND FUTURE DIRECTION
In this chapter, we provide a summary of the research presented in this dissertation and discuss the
promising future directions.
5.1     Summary
In this dissertation, we introduced three in silico methods that can be applied at different points
in the drug design and discovery processes: 1) an enhanced sampling algorithm for simulating
ligand (un)binding pathways and predicting binding kinetics, 2) a neural network model to predict
molecular properties, and 3) a new method for virtual screening. The motivations behind these
methods were discussed in Chapter 1. Next, we summarize our contributions in this dissertation.
    In Chapter 2, we proposed a region-free weighted ensemble (WE) [84] enhanced sampling algo-
rithm to observe long time-scale rare events in biological systems called Resampling of Ensembles
by Variation Optimizations, or “REVO”. Since its inception, REVO has been employed to predict
the unbinding rates and ligand residence time in unbinding simulations [307, 308, 309, 310]. We
used a tunable 𝑁-dimensional random walk system with an analytical solution to show that REVO
is able to visit more distant points when compared to a previously developed weighted ensemble
algorithm, WExplore [152, 87], as well as conventional MD sampling. Furthermore, we discov-
ered that REVO obtains the highest accuracy and range for all simulated values of 𝑁 compared to
WExplore and conventional sampling. We then applied REVO to run unbinding simulations for
the well-studied trypsin-benzamidine system and sampling the benzamidine unbinding pathways.
Using REVO, we successfully determined a novel unbinding pathway where benzamidine exits be-
tween the two loops moving downward. We predicted the residence time of the trypsin-benzamidine
complex, which was in agreement with experimental values. In summary, the main characteristic
of REVO is that it can be tuned for simulating any system by defining a new distance metric which
was discussed in detail in Chapter 2.
                                                 94


    In Chapter 3, we proposed a new method to predict the log 𝑃 values of small molecules, referred
to as “ClassicalGSG”. We employed neural networks to train log 𝑃 predictor models on molecular
features generated using a technique called Geometric Scattering for Graphs (GSG) [105]. The
GSG method uses the graph representation of molecules wherein each atom has a set of attributes.
The atomic attributes are classical molecular mechanics force field parameters such as partial
charges, Lennard-Jones parameters (well-depth and radius) and atomic type. GSG combines the
atomic attributes and molecular topology information (using either 2D or 3D molecular structure)
and transforms them into symmetry-invariant molecular features. First, we employed ClassicalGSG
to examine the performance of two force field parameter matching programs: CHARMM General
Force Field (CGenFF) [228, 229] and General AMBER Force Field 2 (GAFF2) [230, 231]. The
NN models were trained using a dataset of molecules made available by OpenChem [55]. CGenFF-
generated parameters with a specific ad hoc scheme of classifying CGenFF atomic types achieved
the highest accuracy in predicting log 𝑃 values. For the SAMPL7 target molecules, we used the best
performing parameter sets to train four log 𝑃 predictor models. One of our verified (but non-ranked)
prediction sets achieved the lowest RMSE (0.55) and the second lowest Mean Absolute Error (MAE)
of 0.44 among all the submitted predictions. Furthermore, we examined the performance of the
log 𝑃 predictor models trained on atomic attributes generated by open-source force fields such as
Merck Molecular Force Field 94 (MMFF94), Universal Force Field (UFF) and Ghemical. Our
results showed that the log 𝑃 predictor models trained on MMFF94 with 2D molecular structure
data yield more accurate predictions than those trained with other force field parameters. Our final
log 𝑃 predictor model was based on an end-to-end high-throughput pipeline to predict the log 𝑃 of
small molecules from their SMILES strings.
    Finally, in Chapter 4, we developed a novel virtual screening method called “Flexible Topology”
to identify hit compounds while considering the ligand-induced conformational changes in the
receptor. This method employs external forces from a TorchANI model to gradually change the
identity and positions of “Ghost Particles” and assemble them into target molecules. In addition,
we developed a PyTorch force plugin for the OpenMM tool [298] called “MLForce” to generate
                                                 95


internal energy and forces on ghost particles. In MLForce, a set of Behler-Parinello symmetry
functions define a symmetry-invariant molecular representation for the ghost particles and target
molecule [94]. The assembly simulations are guided by a loss function between the ghost particles
features and target ligand features. In addition, a set of parameters from classical molecular force
fields, including the partial charge (𝑞), two Lennard-Jones interaction parameters (the radius (𝜎)
and the well-depth (𝜖)) and a presence variable 𝜆 define the identity of each ghost particle and are
considered as dynamical variables. We ran assembly simulations in vacuum and in the binding
pocket of a bromodomain using the Flexible Topology. We reached RMSE < 2 Å for a system of
50 atoms and observed convergence in both structure and dynamical variables.
5.2     Future Directions
In this section, we present several possible future directions across the major areas of in silico drug
discovery and how machine learning can help these critical directions.
     • Incorporating VAMPNets into REVO. One future direction for the REVO algorithm will
       be to incorporate distance metrics that consider collective motions in the solvent or receptor
       degrees of freedom. Ideally, these collective motions can correspond to long-timescale
       processes of interest. To achieve this, a distance metric based on VAMPNets [311, 312] can
       be incorporated into REVO. VAMPNets are neural network models trained on time-lagged
       data to predict the slow molecular processes of a system. For instance, they can be trained to
       learn optimal collective variables and describe ligand (un)binding, possibly involving non-
       linear combinations of features such as distances, solvent densities, and dihedral angles. To
       incorporate these CVs into REVO, one can simply use the distance between CVs calculated
       separately for each walker at each resampling step.
     • Predicting drug-like properties using a multi-task model. In the second chapter, we
       trained NN models on the GSG features to predict the partition coefficients of drug-like
       molecules. Similarly, if data is available, one can attempt to predict any drug-like property
       such as permeability, solubility, blood-brain barrier penetration, toxicity, drug metabolism,
                                                   96


  and pKa. Given this, an interesting future direction will be to develop a multi-task model to
  simultaneously predict several drug-like chemical properties, which we believe could help
  better predict whether a small molecule is a suitable drug candidate. For example, each
  property can have its own loss function in this multi-task model, and a weighted combination
  of the individual loss functions can be used to optimize the model.
• Rigorous assessment of Flexible Topology to predict binding poses. The next step in
  improving the Flexible Topology method will be to evaluate its pose prediction power against
  established benchmark methods rigorously. In this regard, we plan to run assembly simula-
  tions with various target ligands and receptors and quantify the RMSD of the predicted poses.
  We expect that Flexible Topology will outperform methods such as docking for systems where
  the binding pocket and solvent molecules’ flexibility plays an important role.
• Flexible Topology with multiple ligand targets. We can improve the screening power of the
  Flexible Topology by allowing multiple ligands in the target set. This way, we can take into
  account the flexibility of the ligands and identify conformations that are the most compatible
  with a given binding site. By including multiple ligands – and possibly large databases of
  ligands – we hope to quickly identify ligands that are the most favorable. In this setup, the
  loss function can be formulated as follows:
                    𝐿 (𝐹𝐺𝑃 , {𝐹𝑇 }) = min (|𝐹 (𝑋𝐺𝑃 , 𝐴𝐺𝑃 ) − 𝐹𝑇𝑖 |);  𝑖 ∈ {1 . . . 𝑛}        (5.1)
                                        𝑖
  where 𝑛 is the number of feature vectors in the target set. These features can correspond
  either to different ligands, or to different ligand conformations.
• Flexible Topology with a binding score function. Another future direction for the Flexible
  Topology would be to have a score function to guide the ghost particles toward a correct
  structure and pose instead of using a set of target features for the loss function. The score
  function will take the bound structure of the protein and ligand and determine a) how well
  the ghost particles fit in the binding pocket of the target protein and b) the correct molecular
                                                97


structure for the ligand. This would be a purely generative ligand design model that considers
the target’s dynamics. Moreover, the score function could be a deep learning model trained
extensively on both ground-state data and conformations from assembly simulations.
                                           98


APPENDIX
   99


                                REVO METHOD MATHEMATICAL MODEL
The REVO resampling method is an optimization problem which aims to select a set of walkers for
resampling operations (e.g cloning and merging) in order to maximize an objective function.
    The walker 𝑖 is denoted as 𝑊𝑖 = {X𝑖 , 𝑤𝑖 }, where X𝑖 represents the system coordinates and 𝑤 𝑖
is the walker weight. In this section we precisely define the REVO resampling problem using a
mathematical model, as follows
𝑛 is the number of walkers.
𝑝 𝑚𝑖𝑛 is the minimum allowable probability of a walker.
𝑝 𝑚𝑎𝑥 is the maximum allowable probability of a walker.
𝑑𝑚𝑒𝑟𝑔𝑒 is the maximum walker-walker distance for merge operations.
w = [𝑤 1 , 𝑤 2 , . . . 𝑤 𝑛 ] 𝑇 ∈ 𝑅 𝑛 is the vector of walker weights, whose elements sum to 1.
ϕ = [𝜙(𝑊1 ), 𝜙(𝑊2 ), . . . 𝜙(𝑊𝑛 )] 𝑇 is a vector of walker “novelties” defined by the positively-valued
𝜙 function.
𝛼 is the exponent of the distance term in the variation function.
𝑑0 is the characteristic distance parameter.
𝐷 = {𝑑𝑖 𝑗 (X𝑖 , X 𝑗 )} for 𝑖 = 1 . . . 𝑛 and 𝑗 = 1 . . . 𝑛 is the distance matrix where 𝑑𝑖 𝑗 (X𝑖 , X 𝑗 ) is the
distance between walker 𝑖 and walker 𝑗, 𝑑𝑖 𝑗 = 𝑑 𝑗𝑖 and 𝑑𝑖𝑖 = 0.
The “variation” of the system is defined as
                                                   ∑︁ ∑︁ 𝑑 𝑎𝑏
                                              𝑉=          (     ) 𝛼 𝜙𝑎 𝜙𝑏                                   (2)
                                                    𝑎   𝑏
                                                            𝑑 0
where we denote 𝜙(𝑊𝑎 ) as “𝜙𝑎 ”.
    REVO solves this optimization problem using a “greedy” algorithm, which at each step finds
the cloning operation that would cause a maximal increase in 𝑉. We further denote w★ and ϕ★ as
the walker and novelty vectors after resampling operations, and 𝑉 ★ as the variation value of walkers
                                                         100


after resampling. Now we can define Δ𝑉
                         Δ𝑉 = 𝑉 ∗ − 𝑉
                                 ∑︁ ∑︁ 𝑑 𝑎𝑏                   ∑︁ ∑︁ 𝑑 𝑎𝑏
                             =            (     ) 𝛼 𝜙★𝑎 𝜙★𝑏 −       (     ) 𝛼 𝜙𝑎 𝜙𝑏                 (3)
                                   𝑎   𝑏
                                            𝑑 0                 𝑎 𝑏
                                                                      𝑑 0
    Let 𝑖 denote the index of a cloned walker, and let 𝑗 and 𝑘 denote indices of walkers that are
merged together, with walker 𝑘 continuing on to the next cycle (that is, absorbing the weight of
walker 𝑗). The goal of the REVO resampler in a given iteration is to maximize Δ𝑉 by finding
the optimal set of walkers (𝑖, 𝑗, 𝑘). The cloning and merging operations change the weights of
walkers 𝑖, 𝑗 and 𝑘, and change the inter-walker distances 𝑑 𝑗𝑥 for all 𝑥. All other walker weights and
distances are preserved.
                                                             𝑤𝑖
                                      𝑤★𝑖        = 𝑤★𝑗 =     2
                                      𝑤★𝑘        = 𝑤 𝑗 + 𝑤𝑘
                                      𝑑𝑖★𝑗        = 𝑑★𝑗𝑖 = 0
                                      𝑑★𝑗𝑥 = 𝑑𝑖𝑥        ∀ 𝑥≠ 𝑗
                                                                                                    (4)
                                                    101


Finally we can formulate this optimization problem as follows:
                     maximize
                          ★
                                 ΔV = 𝑉 ★ − 𝑉
                         𝜙
                     subject to 𝑖 = 1, 2 . . . 𝑛
                                 𝑗 = 1, 2 . . . 𝑛 & 𝑗 ≠ 𝑖
                                 𝑘 = 1, 2 . . . 𝑛 & 𝑘 ≠ 𝑖 & 𝑘 ≠ 𝑗
                                 𝑤 𝑖 ≥ 2𝑝 𝑚𝑖𝑛 ,
                                 𝑤 𝑗 + 𝑤 𝑘 ≤ 𝑝 𝑚𝑎𝑥 ,
                                 𝑑 𝑗 𝑘 ≤ 𝑑𝑚𝑒𝑟𝑔𝑒 ,
                                                𝑤𝑖
                                 𝑤★𝑖 = 𝑤★𝑗 =       ,
                                                 2
                                 𝑤★𝑘 = 𝑤 𝑗 + 𝑤 𝑘 ,
                                 𝑤★𝑙 = 𝑤 𝑙 for 𝑙 = 1, 2 . . . 𝑛 & 𝑙 ≠ 𝑖 & 𝑙 ≠ 𝑗& 𝑙 ≠ 𝑘
This is a complex optimal assignment problem which may be solved using binary integer pro-
gramming. [313] Here we intend to show an optimized example of the algorithm with a simpler
model.
Mathematical model: Cloning only
Now imagine we have 𝑛 walkers and we want to choose one of them to clone into two identical
walkers with half the weight of chosen initial walker and the same distances to other walkers. The
new arrangement will have 𝑛 + 1 walkers, as no merging will take place. The variation will be
calculated as above, and the change in variation upon cloning walker 𝑖 is denoted Δ𝑉𝑖 .
                        Í
     Let us define 𝑉 = 𝑖 𝑉𝑖 , where 𝑉𝑖 is as follows:
                                                  ∑︁ 𝑑𝑖𝑙
                                           𝑉𝑖 =        ( ) 𝛼 𝜙𝑖 𝜙 𝑙                               (5)
                                                    𝑙
                                                        𝑑0
It is intuitive that walkers with higher 𝑉𝑖 values should be the walkers most beneficial for cloning,
since they would then get to contribute twice to the variation function. However, the weight of
this walker will also be reduced, which will lower 𝜙𝑖 . We will first show that if 𝑉𝑖 ≥ 𝑉𝑚 and
                                                      102


𝜙𝑖 ≥ 𝜙𝑚 then walker 𝑖 will show a higher increase in variation than walker 𝑚. In other words,
ΔΔ𝑖𝑚 = Δ𝑉𝑖 − Δ𝑉𝑚 > 0.
                      ΔΔ𝑖𝑚 = Δ𝑉𝑖 − Δ𝑉𝑚
                               "                                                           #
                                 ∑︁ ∑︁ 𝑑𝑖 𝑗                      ∑︁ ∑︁ 𝑑𝑖 𝑗
                            =                 ( ) 𝛼 𝜙★𝑖 𝜙★𝑗 −               ( ) 𝛼 𝜙𝑖 𝜙 𝑗
                                    𝑖     𝑗
                                                𝑑 0                𝑖    𝑗
                                                                              𝑑0                        (6)
                                  "                                                          #𝑖
                                    ∑︁ ∑︁ 𝑑𝑖 𝑗                    ∑︁ ∑︁ 𝑑𝑖 𝑗
                             −                 ( ) 𝛼 𝜙★𝑖 𝜙★𝑗 −               ( ) 𝛼 𝜙𝑖 𝜙 𝑗
                                      𝑖     𝑗
                                                 𝑑  0                𝑖    𝑗
                                                                               𝑑0
                                                                                                𝑚
   The weight of one walker changes and most of the terms cancel out. After simplification we
end up with the following equation
                                          ∑︁          𝑑𝑖𝑙 𝛼                     ∑︁         𝑑𝑚𝑙 𝛼
                 ΔΔ𝑖𝑚 = (2𝜙★𝑖 − 𝜙𝑖 )            𝜙𝑙 (      ) − (2𝜙★𝑚 − 𝜙𝑚 )            𝜙𝑙 (       )
                                            𝑙
                                                      𝑑 0                         𝑙
                                                                                           𝑑   0
                          ★             ∑︁                      ★            ∑︁                     (7)
                            𝜙𝑖                   𝑑𝑖𝑙                 𝜙                   𝑑𝑚𝑙 𝛼
                       = 2 −1                  ( ) 𝛼 𝜙𝑖 𝜙 𝑙 − 2 𝑚 − 1                  (       ) 𝜙𝑚 𝜙𝑙
                            𝜙𝑖              𝑙
                                                 𝑑0                  𝜙𝑚             𝑙
                                                                                         𝑑0
Using Equation 5 we have:
                                                𝜙★𝑖
                                                                ★           
                                                                       𝜙𝑚
                             ΔΔ𝑖𝑚        = 2 − 1 𝑉𝑖 − 2                     − 1 𝑉𝑚                      (8)
                                                𝜙𝑖                     𝜙𝑚
   In this work use employ a specific form of 𝜙:
                                      𝜙𝑖 = log(𝑤 𝑖 ) − log( 𝑝 𝑚𝑖𝑛 /100)                                 (9)
which varies with walker weight as follows:
                                         1
                             𝑤★𝑖 = 𝑤 𝑖
                                         2
                               𝜙★𝑖 = log 𝑤★𝑖 − log( 𝑝 𝑚𝑖𝑛 /100)
                                     = log(𝑤 𝑖 /2) − log( 𝑝 𝑚𝑖𝑛 /100)                                  (10)
                                     = log(𝑤 𝑖 ) − log( 𝑝 𝑚𝑖𝑛 /100) − log 2
                                     = 𝜙𝑖 − log 2
Therefore we have
                                                                                
                                                   log 2                    log 2
                           ΔΔ𝑖𝑚       = 1−2                  𝑉𝑖 − 1 − 2               𝑉𝑚               (11)
                                                      𝜙𝑖                     𝜙𝑚
                                                          103


If we assume 𝜙𝑖 ≥ 𝜙𝑚 it can be shown that
                                                           log 2            log 2
                                                  (1 − 2         ) ≥ (1 − 2       )                                         (12)
                                                             𝜙𝑖              𝜙𝑚
Additionally if we have 𝑉𝑖 ≥ 𝑉𝑚 , then this guarantees ΔΔ𝑖𝑚 > 0.
    However, it is a common scenario that 𝑉𝑖 > 𝑉𝑚 , but 𝜙𝑖 < 𝜙𝑚 . This would be the case if walker
𝑖 had large distances to other walkers, but had a lower weight than walker 𝑚. To find the minimum
𝜙𝑖 value such that ΔΔ𝑖𝑚 is still positive, we set ΔΔ𝑖𝑚 to zero, and solve for 𝜙𝑖 in terms of 𝜙𝑚 and
𝑉𝑖 /𝑉𝑚 . The result is as follows:
                                                                  
                                                            𝑉𝑖              2 log 2
                                                  𝜙𝑖𝑚𝑖𝑛   =            𝑉𝑖         2 log 2
                                                                                                                            (13)
                                                            𝑉𝑚
                                                                       𝑉𝑚   −1+     𝜙𝑚
Thus, if 𝑉𝑖 = 𝑉𝑚 then 𝜙𝑖𝑚𝑖𝑛 = 𝜙𝑚 , which is intuitive, as if walker 𝑚 has both the highest 𝑉𝑚 and the
highest 𝜙𝑚 then it is guaranteed to cause the largest increase in 𝑉, as shown above.
                10 -1
                10 -2
                               Vi/Vm      =   1
                               Vi/Vm      =   1.1
                10 -3
                               Vi/Vm      =   1.5
                10 -4          Vi/Vm      =   2.0
                10 -5
                10 -6
   w^ m in_i
                10 -7
                10 -8
                10 -9
               10 -10
               10 -11
               10 -12
               10 -13
               10 -14 -12
                   10     10 -11 10 -10   10 -9      10 -8       10 -7 10 -6      10 -5     10 -4   10 -3   10 -2   10 -1
                                                                     w_m
                         Figure A.1: Minimum walker weight of 𝑖 to ensure that ΔΔ𝑖𝑚 > 0.
                                                                   104


    Figure A.1 shows 𝑤 𝑖𝑚𝑖𝑛 , which corresponds to 𝜙𝑖𝑚𝑖𝑛 for different values of 𝑤 𝑚 and 𝑉𝑖 /𝑉𝑚 . This
clearly shows that for the 𝜙 function in Equation 9, for even small values of 𝑉𝑖 /𝑉𝑚 we find that
𝜙𝑖𝑚𝑖𝑛 is very small, approaching 𝑝 𝑚𝑖𝑛 which is the smallest allowable walker weight. Thus, in our
implementation of the REVO algorithm, we only consider the walkers with the highest 𝑉𝑖 values
for cloning. Note that for different choices of the novelty function 𝜙𝑖 (𝑊𝑖 ), this assumption might
have to be revisited.
                                                 105


BIBLIOGRAPHY
     106


                                      BIBLIOGRAPHY
[1]  Chun Meng Song, Shen Jean Lim, and Joo Chuan Tong. Recent advances in computer-aided
     drug design. Briefings in bioinformatics, 10(5):579–591, 2009.
[2]  Si-sheng Ou-Yang, Jun-yan Lu, Xiang-qian Kong, Zhong-jie Liang, Cheng Luo, and
     Hualiang Jiang. Computational drug discovery. Acta Pharmacologica Sinica, 33(9):1131–
     1140, 2012.
[3]  Maria Batool, Bilal Ahmad, and Sangdun Choi. A structure-based drug discovery paradigm.
     International journal of molecular sciences, 20(11):2783, 2019.
[4]  Olivier J Wouters, Martin McKee, and Jeroen Luyten. Estimated research and development
     investment needed to bring a new medicine to market, 2009-2018. Jama, 323(9):844–853,
     2020.
[5]  Brian S Hilbush. In Silico Dreams: How Artificial Intelligence and Biotechnology Will
     Create the Medicines of the Future. John Wiley & Sons, 2021.
[6]  Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimen-
     tal and computational approaches to estimate solubility and permeability in drug discovery
     and development settings. Advanced drug delivery reviews, 23(1-3):3–25, 1997.
[7]  Philip J Hajduk, Jeffrey R Huth, and Christin Tse. Predicting protein druggability. Drug
     discovery today, 10(23-24):1675–1682, 2005.
[8]  Honglin Li, Zhenting Gao, Ling Kang, Hailei Zhang, Kun Yang, Kunqian Yu, Xiaomin Luo,
     Weiliang Zhu, Kaixian Chen, Jianhua Shen, et al. Tarfisdock: a web server for identifying
     drug targets with docking approach. Nucleic acids research, 34(suppl_2):W219–W224,
     2006.
[9]  Nagasuma Chandra. Computational systems approach for drug target discovery. Expert
     opinion on drug discovery, 4(12):1221–1236, 2009.
[10] Douglas B Kitchen, Hélène Decornez, John R Furr, and Jürgen Bajorath. Docking and
     scoring in virtual screening for drug discovery: methods and applications. Nature reviews
     Drug discovery, 3(11):935–949, 2004.
[11] Sergio Filipe Sousa, Pedro Alexandrino Fernandes, and Maria Joao Ramos. Protein–ligand
     docking: current status and future challenges. Proteins: Structure, Function, and Bioinfor-
     matics, 65(1):15–26, 2006.
[12] Xuan-Yu Meng, Hong-Xing Zhang, Mihaly Mezei, and Meng Cui. Molecular docking: a
                                              107


     powerful approach for structure-based drug discovery. Current computer-aided drug design,
     7(2):146–157, 2011.
[13] Leonardo G Ferreira, Ricardo N Dos Santos, Glaucius Oliva, and Adriano D Andricopulo.
     Molecular docking and structure-based drug design strategies. Molecules, 20(7):13384–
     13421, 2015.
[14] Shikha Agnihotry, Rajesh Kumar Pathak, Ajeet Srivastav, Pradeep Kumar Shukla, and
     Budhayash Gautam. Molecular docking and structure-based drug design. In Computer-
     Aided Drug Design, pages 115–131. Springer, 2020.
[15] Campbell McInnes. Virtual screening strategies in drug discovery. Current opinion in
     chemical biology, 11(5):494–502, 2007.
[16] Antonio Lavecchia and Carmen Di Giovanni. Virtual screening strategies in drug discovery:
     a critical review. Current medicinal chemistry, 20(23):2839–2860, 2013.
[17] Shide Liang, Chi Zhang, Song Liu, and Yaoqi Zhou. Protein binding site prediction using
     an empirical scoring function. Nucleic acids research, 34(13):3698–3707, 2006.
[18] Bingding Huang. Metapocket: a meta approach to improve protein ligand binding site
     prediction. OMICS A Journal of Integrative Biology, 13(4):325–330, 2009.
[19] Zengming Zhang, Yu Li, Biaoyang Lin, Michael Schroeder, and Bingding Huang. Iden-
     tification of cavities on protein surface using multiple computational approaches for drug
     binding site prediction. Bioinformatics, 27(15):2083–2088, 2011.
[20] Gisbert Schneider and Uli Fechner. Computer-based de novo design of drug-like molecules.
     Nature Reviews Drug Discovery, 4(8):649–663, 2005.
[21] Yun Tang, Weiliang Zhu, Kaixian Chen, and Hualiang Jiang. New technologies in computer-
     aided drug design: Toward target identification and new chemical entity discovery. Drug
     discovery today: technologies, 3(3):307–313, 2006.
[22] Markus Hartenfeller and Gisbert Schneider. De novo drug design. Chemoinformatics and
     computational chemical biology, pages 299–323, 2010.
[23] Robert P Sheridan and Simon K Kearsley. Why do we need so many chemical similarity
     search methods? Drug discovery today, 7(17):903–911, 2002.
[24] Adrià Cereto-Massagué, María José Ojeda, Cristina Valls, Miquel Mulero, Santiago Garcia-
     Vallvé, and Gerard Pujadas. Molecular fingerprint similarity search in virtual screening.
     Methods, 71:58–63, 2015.
[25] Ingo Muegge and Prasenjit Mukherjee. An overview of molecular fingerprint similarity
                                               108


     search in virtual screening. Expert opinion on drug discovery, 11(2):137–148, 2016.
[26] Han Van De Waterbeemd and Eric Gifford. Admet in silico modelling: towards prediction
     paradise? Nature reviews Drug discovery, 2(3):192–204, 2003.
[27] Jie Dong, Ning-Ning Wang, Zhi-Jiang Yao, Lin Zhang, Yan Cheng, Defang Ouyang, Ai-Ping
     Lu, and Dong-Sheng Cao. Admetlab: a platform for systematic admet evaluation based on a
     comprehensively collected admet database. Journal of cheminformatics, 10(1):1–11, 2018.
[28] Jangampalli Adi Pradeepkiran, SB Sainath, and KVL Shrikanya. In silico validation and
     admet analysis for the best lead molecules. In Brucella Melitensis, pages 133–176. Elsevier,
     2021.
[29] Kenneth M Merz Jr, Dagmar Ringe, and Charles H Reynolds. Drug design: structure-and
     ligand-based approaches. Cambridge University Press, 2010.
[30] Polamarasetty Aparoy, Kakularam Kumar Reddy, and Pallu Reddanna. Structure and ligand
     based drug design strategies in the development of novel 5-lox inhibitors. Current medicinal
     chemistry, 19(22):3763–3778, 2012.
[31] Gregory L Wilson and Markus A Lill. Integrating structure-based and ligand-based ap-
     proaches for computational drug design. Future medicinal chemistry, 3(6):735–750, 2011.
[32] Vidushi Sharma, Sharad Wakode, and Hirdesh Kumar. Structure-and ligand-based drug
     design: concepts, approaches, and challenges. In Chemoinformatics and Bioinformatics in
     the Pharmaceutical Sciences, pages 27–53. Elsevier, 2021.
[33] Leonardo LG Ferreira and Adriano D Andricopulo. Chemoinformatics approaches to
     structure-and ligand-based drug design. Frontiers in Pharmacology, page 1416, 2018.
[34] Javier Vázquez, Manel López, Enric Gibert, Enric Herrero, and F Javier Luque. Merging
     ligand-based and structure-based methods in drug discovery: An overview of combined
     virtual screening approaches. Molecules, 25(20):4723, 2020.
[35] Amy C Anderson. The process of structure-based drug design. Chemistry & biology,
     10(9):787–797, 2003.
[36] Alexander Wlodawer and Jiri Vondrasek. Inhibitors of hiv-1 protease: a major success of
     structure-assisted drug design. Annual review of biophysics and biomolecular structure,
     27(1):249–284, 1998.
[37] David E Clark. What has computer-aided molecular design ever done for drug discovery?
     Expert opinion on drug discovery, 1(2):103–110, 2006.
[38] Hedia Marrakchi, Gilbert Lanéelle, and Annaık Quémard. Inha, a target of the antituber-
                                              109


     culous drug isoniazid, is involved in a mycobacterial fatty acid elongation system, fas-ii.
     Microbiology, 146(2):289–296, 2000.
[39] Sakineh Dadashpour, Tuba Tuylu Kucukkilinc, Oya Unsal Tan, Keriman Ozadali, Hamid
     Irannejad, and Saeed Emami. Design, synthesis and in vitro study of 5, 6-diaryl-1, 2, 4-
     triazine-3-ylthioacetate derivatives as cox-2 and 𝛽-amyloid aggregation inhibitors. Archiv
     der Pharmazie, 348(3):179–187, 2015.
[40] Zachary Miller, Keun-Sik Kim, Do-Min Lee, Vinod Kasam, Si Eun Baek, Kwang Hyun Lee,
     Yan-Yan Zhang, Lin Ao, Kimberly Carmony, Na-Ra Lee, et al. Proteasome inhibitors with
     pyrazole scaffolds from structure-based virtual screening. Journal of Medicinal Chemistry,
     58(4):2036–2041, 2015.
[41] Sandeep Grover, Marsha A Apushkin, and Gerald A Fishman. Topical dorzolamide for the
     treatment of cystoid macular edema in patients with retinitis pigmentosa. American journal
     of ophthalmology, 141(5):850–858, 2006.
[42] Tomas de Paulis. Drug evaluation: Prx-00023, a selective 5-ht1a receptor agonist for
     depression. Current Opinion in Investigational Drugs (London, England: 2000), 8(1):78–
     86, 2007.
[43] Duxin Sun, Wei Gao, Hongxiang Hu, and Simon Zhou. Why 90% of clinical drug develop-
     ment fails and how to improve it? Acta Pharmaceutica Sinica B, 2022.
[44] Richard K Harrison. Phase ii and phase iii failures: 2013–2015. Nat Rev Drug Discov,
     15(12):817–818, 2016.
[45] Debomoy K Lahiri, Bryan Maloney, Justin M Long, and Nigel H Greig. Lessons from a
     bace1 inhibitor trial: Off-site but not off base. Alzheimer’s & Dementia, 10(5):S411–S419,
     2014.
[46] Radoslav Krivák and David Hoksza. P2rank: machine learning based tool for rapid and ac-
     curate prediction of ligand binding sites from protein structure. Journal of cheminformatics,
     10(1):1–12, 2018.
[47] Marta M Stepniewska-Dziubinska, Piotr Zielenkiewicz, and Pawel Siedlecki. Improving
     detection of protein-ligand binding sites with 3d segmentation. Scientific reports, 10(1):1–9,
     2020.
[48] Evan N Feinberg, Elizabeth Joshi, Vijay S Pande, and Alan C Cheng. Improvement in admet
     prediction with multitask deep featurization. Journal of medicinal chemistry, 63(16):8835–
     8848, 2020.
[49] Mariya Popova, Olexandr Isayev, and Alexander Tropsha. Deep reinforcement learning for
     de novo drug design. Science advances, 4(7):eaap7885, 2018.
                                                110


[50] Daniel Reker and Gisbert Schneider. Active-learning strategies in computer-assisted drug
     discovery. Drug discovery today, 20(4):458–465, 2015.
[51] Jérôme Hert, Peter Willett, David J Wilton, Pierre Acklin, Kamal Azzaoui, Edgar Jacoby,
     and Ansgar Schuffenhauer. New methods for ligand-based virtual screening: use of data
     fusion and machine learning to enhance the effectiveness of similarity searching. Journal of
     chemical information and modeling, 46(2):462–470, 2006.
[52] Ji-Yong An, Fan-Rong Meng, and Zi-Ji Yan. An efficient computational method for predict-
     ing drug-target interactions using weighted extreme learning machine and speed up robot
     features. BioData Mining, 14(1):1–17, 2021.
[53] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure.
     Journal of chemical information and computer sciences, 44(3):1000–1005, 2004.
[54] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning
     in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal
     of chemical information and modeling, 53(7):1563–1575, 2013.
[55] Maria Korshunova, Boris Ginsburg, Alexander Tropsha, and Olexandr Isayev. Openchem:
     A deep learning toolkit for computational chemistry and drug design. Journal of Chemical
     Information and Modeling, 61(1):7–13, 2021.
[56] Bilal Shaker, Myeong-Sang Yu, Jin Sook Song, Sunjoo Ahn, Jae Yong Ryu, Kwang-Seok
     Oh, and Dokyun Na. Lightbbb: computational prediction model of blood–brain-barrier
     penetration based on lightgbm. Bioinformatics, 37(8):1135–1139, 2021.
[57] Albert C Pan, David W Borhani, Ron O Dror, and David E Shaw. Molecular determinants
     of drug–receptor binding kinetics. Drug discovery today, 18(13-14):667–673, 2013.
[58] James A. Maier, Carmenza Martinez, Koushik Kasavajhala, Lauren Wickstrom, Kevin E.
     Hauser, and Carlos Simmerling. ff14sb: Improving the accuracy of protein side chain
     and backbone parameters from ff99sb. Journal of Chemical Theory and Computation,
     11(8):3696–3713, 2015.
[59] Tanaji T Talele, Santosh A Khedkar, and Alan C Rigby. Successful applications of computer
     aided drug discovery: moving drugs from concept to the clinic. Current topics in medicinal
     chemistry, 10(1):127–141, 2010.
[60] Vijay Kumar Bhardwaj and Rituraj Purohit. Targeting the protein-protein interface pocket
     of aurora-a-tpx2 complex: rational drug design and validation. Journal of Biomolecular
     Structure and Dynamics, 39(11):3882–3891, 2021.
[61] Sam RJ Hoare, Beth A Fleck, John P Williams, and Dimitri E Grigoriadis. The importance
     of target binding kinetics for measuring target binding affinity in drug discovery: a case
                                             111


     study from a crf1 receptor antagonist program. Drug Discovery Today, 25(1):7–14, 2020.
[62] Robert A Copeland, David L Pompliano, and Thomas D Meek. Drug–target residence time
     and its implications for lead optimization. Nature reviews Drug discovery, 5(9):730–739,
     2006.
[63] David C Swinney. The role of binding kinetics in therapeutically useful drug action. Current
     opinion in drug discovery & development, 12(1):31–39, 2009.
[64] Hao Lu and Peter J Tonge. Drug–target residence time: critical information for lead opti-
     mization. Current opinion in chemical biology, 14(4):467–474, 2010.
[65] Anthony M Giannetti. From experimental design to validated hits: a comprehensive walk-
     through of fragment lead identification using surface plasmon resonance. In Methods in
     enzymology, volume 493, pages 169–218. Elsevier, 2011.
[66] David A Sykes and Steven J Charlton. Single step determination of unlabeled compound
     kinetics using a competition association binding method employing time-resolved fret. In
     Rational Drug Design, pages 177–194. Springer, 2018.
[67] Edward C Hulme and Mike A Trevethick. Ligand binding assays at equilibrium: validation
     and interpretation. British journal of pharmacology, 161(6):1219–1237, 2010.
[68] Sai Kiran Sharma, Serge K Lyashchenko, Hijin A Park, Nagavarakishore Pillarsetty, Yorann
     Roux, Jiong Wu, Sophie Poty, Kathryn M Tully, John T Poirier, and Jason S Lewis. A rapid
     bead-based radioligand binding assay for the determination of target-binding fraction and
     quality control of radiopharmaceuticals. Nuclear medicine and biology, 71:32–38, 2019.
[69] Anni Allikalt and Ago Rinken. Budded baculovirus particles as a source of membrane
     proteins for radioligand binding assay: The case of dopamine d1 receptor. Journal of
     Pharmacological and Toxicological Methods, 86:81–86, 2017.
[70] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the
     National Academy of Sciences, 99(20):12562–12566, 2002.
[71] Huiyong Sun, Youyong Li, Mingyun Shen, Dan Li, Yu Kang, and Tingjun Hou. Charac-
     terizing drug–target residence time with metadynamics: How to achieve dissociation rate
     efficiently without losing accuracy against time-consuming approaches. Journal of Chemical
     Information and Modeling, 57(8):1895–1906, 2017.
[72] Mattia Bernetti, Matteo Masetti, Maurizio Recanatini, Rommie E Amaro, and Andrea
     Cavalli. An integrated markov state model and path metadynamics approach to characterize
     drug binding processes. Journal of chemical theory and computation, 15(10):5689–5702,
     2019.
                                               112


[73] Luca Mollica, Sergio Decherchi, Syeda Rehana Zia, Roberto Gaspari, Andrea Cavalli, and
     Walter Rocchia. Kinetics of protein-ligand unbinding via smoothed potential molecular
     dynamics simulations. Scientific reports, 5(1):1–12, 2015.
[74] Luca Mollica, Isabelle Theret, Mathias Antoine, Françoise Perron-Sierra, Yves Charton,
     Jean-Marie Fourquez, Michel Wierzbicki, Jean A Boutin, Gilles Ferry, Sergio Decherchi,
     et al. Molecular dynamics simulations and kinetic measurements to estimate and predict
     protein–ligand residence times. Journal of medicinal chemistry, 59(15):7167–7176, 2016.
[75] Indrajit Deb and Aaron T Frank. Accelerating rare dissociative processes in biomolecules
     using selectively scaled md simulations. Journal of Chemical Theory and Computation,
     15(11):5817–5828, 2019.
[76] Donatella Callegari, Alessio Lodola, Daniele Pala, Silvia Rivara, Marco Mor, Andrea Rizzi,
     and Anna Maria Capelli. Metadynamics simulations distinguish short-and long-residence-
     time inhibitors of cyclin-dependent kinase 8. Journal of chemical information and modeling,
     57(2):159–169, 2017.
[77] Riccardo Capelli, Paolo Carloni, and Michele Parrinello. Exhaustive search of ligand bind-
     ing pathways via volume-based metadynamics. The journal of physical chemistry letters,
     10(12):3495–3499, 2019.
[78] Barry Isralewitz, Mu Gao, and Klaus Schulten. Steered molecular dynamics and mechanical
     functions of proteins. Current opinion in structural biology, 11(2):224–230, 2001.
[79] Yuzhen Niu, Shuyan Li, Dabo Pan, Huanxiang Liu, and Xiaojun Yao. Computational study
     on the unbinding pathways of b-raf inhibitors and its implication for the difference of resi-
     dence time: insight from random acceleration and steered molecular dynamics simulations.
     Physical Chemistry Chemical Physics, 18(7):5622–5629, 2016.
[80] Gary A Huber and Sangtae Kim. Weighted-ensemble brownian dynamics simulations for
     protein association reactions. Biophysical journal, 70(1):97–110, 1996.
[81] Jeffrey C Sung, Adam W Van Wynsberghe, Rommie E Amaro, Wilfred W Li, and J Andrew
     McCammon. Role of secondary sialic acid binding sites in influenza n1 neuraminidase.
     Journal of the American Chemical Society, 132(9):2883–2885, 2010.
[82] Nuria Plattner and Frank Noé. Protein conformational plasticity and complex ligand-binding
     kinetics explored by atomistic simulations and markov models. Nature communications,
     6:7653, 2015.
[83] Jagannath Mondal, Navjeet Ahalawat, Subhendu Pandit, Lewis E Kay, and Pramodh Vallu-
     rupalli. Atomic resolution mechanism of ligand binding to a solvent inaccessible cavity in
     t4 lysozyme. PLoS computational biology, 14(5):e1006180, 2018.
                                              113


[84] G A Huber and S Kim. Weighted-ensemble brownian dynamics simulations for protein
     association reactions. Biophysical Journal, 70(1):97–110, January 1996.
[85] Matthew C Zwier, Adam J Pratt, Joshua L Adelman, Joseph W Kaus, Daniel M Zuckerman,
     and Lillian T Chong. Efficient atomistic simulation of pathways and calculation of rate
     constants for a protein–peptide binding process: Application to the mdm2 protein and an
     intrinsically disordered p53 peptide. The journal of physical chemistry letters, 7(17):3440–
     3445, 2016.
[86] Alex Dickson and Samuel D. Lotz. Ligand release pathways obtained with wexplore:
     Residence times and mechanisms. Journal of Physical Chemistry B, 120(24):5377–5385,
     2016. PMID: 27231969.
[87] Alex Dickson and Samuel D. Lotz. Multiple ligand unbinding pathways and ligand-induced
     destabilization revealed by wexplore. Biophysical Journal, 112(4):620–629, February 2017.
[88] Samuel D Lotz and Alex Dickson. Unbiased molecular dynamics of 11 min timescale drug
     unbinding reveals transition state stabilizing interactions. Journal of the American Chemical
     Society, 140(2):618–628, 2018.
[89] Alex Dickson. Mapping the ligand binding landscape. Biophysical Journal, 115(9):1707–
     1719, 2018.
[90] David L Mobley, Karisa L Wymer, Nathan M Lim, and J Peter Guthrie. Blind prediction
     of solvation free energies from the sampl4 challenge. Journal of computer-aided molecular
     design, 28(3):135–150, 2014.
[91] Tewksbury ChemSilico LLC. Cslogp program.
[92] Kedi Wu, Zhixiong Zhao, Renxiao Wang, and Guo-Wei Wei. Topp–s: Persistent homology-
     based multi-task deep neural networks for simultaneous predictions of partition coefficient
     and aqueous solubility. Journal of computational chemistry, 39(20):1444–1454, 2018.
[93] Evan N Feinberg, Debnil Sur, Zhenqin Wu, Brooke E Husic, Huanghao Mai, Yang Li,
     Saisai Sun, Jianyi Yang, Bharath Ramsundar, and Vijay S Pande. Potentialnet for molecular
     property prediction. ACS central science, 4(11):1520–1530, 2018.
[94] Jörg Behler and Michele Parrinello. Generalized neural-network representation of high-
     dimensional potential-energy surfaces. Physical review letters, 98(14):146401, 2007.
[95] Justin S Smith, Olexandr Isayev, and Adrian E Roitberg. Ani-1: an extensible neural
     network potential with dft accuracy at force field computational cost. Chemical science,
     8(4):3192–3203, 2017.
[96] Masashi Tsubaki and Teruyasu Mizoguchi. Fast and accurate molecular property prediction:
                                                114


      Learning atomic interactions and potentials with neural networks. The journal of physical
      chemistry letters, 9(19):5733–5741, 2018.
[97] Lei Cao, Wenbo Tao, Sungtae An, Jing Jin, Yizhou Yan, Xiaoyu Liu, Wendong Ge, Adam
      Sah, Leilani Battle, Jimeng Sun, et al. Smile: a system to support machine learning on eeg
      data at scale. Proceedings of the VLDB Endowment, 12(12):2230–2241, 2019.
[98] Gabriel A Pinheiro, Johnatan Mucelini, Marinalva D Soares, Ronaldo C Prati, Juarez LF
      Da Silva, and Marcos G Quiles. Machine learning prediction of nine molecular properties
      based on the smiles representation of the qm9 quantum-chemistry dataset. The Journal of
      Physical Chemistry A, 124(47):9854–9866, 2020.
[99] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: A deep convolutional
      neural network for bioactivity prediction in structure-based drug discovery, 2015. cite
      arxiv:1510.02855.
[100] Matthew Ragoza, Joshua Hochuli, Elisa Idrobo, Jocelyn Sunseri, and David Ryan Koes.
      Protein–ligand scoring with convolutional neural networks. Journal of chemical information
      and modeling, 57(4):942–957, 2017.
[101] José Jiménez, Miha Skalic, Gerard Martinez-Rosell, and Gianni De Fabritiis. K deep:
      protein–ligand absolute binding affinity prediction via 3d-convolutional neural networks.
      Journal of chemical information and modeling, 58(2):287–296, 2018.
[102] Andrew T McNutt, Paul Francoeur, Rishal Aggarwal, Tomohide Masuda, Rocco Meli,
      Matthew Ragoza, Jocelyn Sunseri, and David Ryan Koes. Gnina 1.0: molecular docking
      with deep learning. Journal of cheminformatics, 13(1):1–20, 2021.
[103] Kristof T Schütt, Farhad Arbabzadah, Stefan Chmiela, Klaus R Müller, and Alexandre
      Tkatchenko. Quantum-chemical insights from deep tensor neural networks. Nature commu-
      nications, 8:13890, 2017.
[104] Justin Gilmer, Samuel S Schoenholz, Patrick F Riley, Oriol Vinyals, and George E Dahl.
      Neural message passing for quantum chemistry. In Proceedings of the 34th International
      Conference on Machine Learning-Volume 70, pages 1263–1272. JMLR. org, 2017.
[105] Feng Gao, Guy Wolf, and Matthew Hirn. Geometric scattering for graph data analysis. In
      International Conference on Machine Learning, pages 2122–2131, 2019.
[106] Robert P Hertzberg and Andrew J Pope. High-throughput screening: new technology for
      the 21st century. Current opinion in chemical biology, 4(4):445–451, 2000.
[107] W Patrick Walters, Matthew T Stahl, and Mark A Murcko. Virtual screening—an overview.
      Drug discovery today, 3(4):160–178, 1998.
                                               115


[108] Tiejun Cheng, Qingliang Li, Zhigang Zhou, Yanli Wang, and Stephen H Bryant. Structure-
      based virtual screening for drug discovery: a problem-centric review. The AAPS journal,
      14(1):133–141, 2012.
[109] Brian K Shoichet. Virtual screening of chemical libraries. Nature, 432(7019):862–865,
      2004.
[110] Ulrich Rester. From virtuality to reality-virtual screening in lead discovery and lead optimiza-
      tion: a medicinal chemistry perspective. Current opinion in drug discovery & development,
      11(4):559–568, 2008.
[111] Sharangdhar S Phatak, Clifford C Stephan, and Claudio N Cavasotto. High-throughput and
      in silico screenings in drug discovery. Expert Opinion on Drug Discovery, 4(9):947–959,
      2009.
[112] Jürgen Bajorath. Integration of virtual and high-throughput screening. Nature Reviews Drug
      Discovery, 1(11):882–894, 2002.
[113] Vincent Zoete, Antoine Daina, Christophe Bovigny, and Olivier Michielin. Swisssimilarity:
      a web tool for low to ultra high throughput ligand-based virtual screening, 2016.
[114] Carlos Garcia-Hernandez, Alberto Fernandez, and Francesc Serratosa. Ligand-based virtual
      screening using graph edit distance as molecular similarity measure. Journal of chemical
      information and modeling, 59(4):1410–1421, 2019.
[115] Markus Hofmarcher, Andreas Mayr, Elisabeth Rumetshofer, Peter Ruch, Philipp Renz, Jo-
      hannes Schimunek, Philipp Seidl, Andreu Vall, Michael Widrich, Sepp Hochreiter, et al.
      Large-scale ligand-based virtual screening for sars-cov-2 inhibitors using deep neural net-
      works. Available at SSRN 3561442, 2020.
[116] Holger Gohlke and Gerhard Klebe. Approaches to the description and prediction of the
      binding affinity of small-molecule ligands to macromolecular receptors. Angewandte Chemie
      International Edition, 41(15):2644–2676, 2002.
[117] Inbal Halperin, Buyong Ma, Haim Wolfson, and Ruth Nussinov. Principles of docking:
      An overview of search algorithms and a guide to scoring functions. Proteins: Structure,
      Function, and Bioinformatics, 47(4):409–443, 2002.
[118] Nuno MFSA Cerqueira, Diana Gesto, Eduardo F Oliveira, Diogo Santos-Martins, Natércia F
      Brás, Sérgio F Sousa, Pedro A Fernandes, and Maria J Ramos. Receptor-based virtual
      screening protocol for drug discovery. Archives of biochemistry and biophysics, 582:56–67,
      2015.
[119] John J Irwin and Brian K Shoichet. Zinc- a free database of commercially available com-
      pounds for virtual screening. Journal of chemical information and modeling, 45(1):177–182,
                                                 116


      2005.
[120] Anna Gaulton, Louisa J Bellis, A Patricia Bento, Jon Chambers, Mark Davies, Anne Hersey,
      Yvonne Light, Shaun McGlinchey, David Michalovich, Bissan Al-Lazikani, et al. Chembl: a
      large-scale bioactivity database for drug discovery. Nucleic acids research, 40(D1):D1100–
      D1107, 2012.
[121] Yanli Wang, Jewen Xiao, Tugba O Suzek, Jian Zhang, Jiyao Wang, and Stephen H Bryant.
      Pubchem: a public information system for analyzing bioactivities of small molecules. Nucleic
      acids research, 37(suppl_2):W623–W633, 2009.
[122] Qingliang Li, Tiejun Cheng, Yanli Wang, and Stephen H Bryant. Pubchem as a public
      resource for drug discovery. Drug discovery today, 15(23-24):1052–1057, 2010.
[123] George WA Milne and JA Miller. The nci drug information system. 1. system overview.
      Journal of chemical information and computer sciences, 26(4):154–159, 1986.
[124] Claudio N Cavasotto and Narender Singh. Docking and high throughput docking: successes
      and the challenge of protein flexibility. Current Computer-Aided Drug Design, 4(3):221–234,
      2008.
[125] Pietro Cozzini, Glen E Kellogg, Francesca Spyrakis, Donald J Abraham, Gabriele
      Costantino, Andrew Emerson, Francesca Fanelli, Holger Gohlke, Leslie A Kuhn, Garrett M
      Morris, et al. Target flexibility: an emerging consideration in drug discovery and design.
      Journal of medicinal chemistry, 51(20):6237–6255, 2008.
[126] Thomas Scior, Andreas Bender, Gary Tresadern, José L Medina-Franco, Karina Martínez-
      Mayorga, Thierry Langer, Karina Cuanalo-Contreras, and Dimitris K Agrafiotis. Recog-
      nizing pitfalls in virtual screening: a critical review. Journal of chemical information and
      modeling, 52(4):867–881, 2012.
[127] Katrina W Lexa and Heather A Carlson. Protein flexibility in docking and surface mapping.
      Quarterly reviews of biophysics, 45(3):301–343, 2012.
[128] Claudio N Cavasotto and Ruben A Abagyan. Protein flexibility in ligand docking and virtual
      screening to protein kinases. Journal of molecular biology, 337(1):209–225, 2004.
[129] Maxim Totrov and Ruben Abagyan. Flexible ligand docking to multiple receptor confor-
      mations: a practical alternative. Current opinion in structural biology, 18(2):178–184,
      2008.
[130] B-Rao Chandrika, Jyothi Subramanian, and Somesh D Sharma. Managing protein flexibility
      in docking and its applications. Drug discovery today, 14(7-8):394–400, 2009.
[131] Jacob D Durrant and J Andrew McCammon. Computer-aided drug-discovery techniques
                                                  117


      that account for receptor flexibility. Current opinion in pharmacology, 10(6):770–774, 2010.
[132] Ferran Feixas, Steffen Lindert, William Sinko, and J Andrew McCammon. Exploring the role
      of receptor flexibility in structure-based drug discovery. Biophysical chemistry, 186:31–45,
      2014.
[133] Sheng-You Huang and Xiaoqin Zou. Ensemble docking of multiple protein structures:
      considering protein structural variations in molecular docking. Proteins: Structure, Function,
      and Bioinformatics, 66(2):399–421, 2007.
[134] Rommie E Amaro, Jerome Baudry, John Chodera, Özlem Demir, J Andrew McCammon,
      Yinglong Miao, and Jeremy C Smith. Ensemble docking in drug discovery. Biophysical
      journal, 114(10):2271–2278, 2018.
[135] Ian R Craig, Jonathan W Essex, and Katrin Spiegel. Ensemble docking into multiple
      crystallographically derived protein structures: an evaluation based on the statistical analysis
      of enrichments. Journal of chemical information and modeling, 50(4):511–524, 2010.
[136] Julie R Schames, Richard H Henchman, Jay S Siegel, Christoph A Sotriffer, Haihong Ni,
      and J Andrew McCammon. Discovery of a novel binding trench in hiv integrase. Journal of
      medicinal chemistry, 47(8):1879–1881, 2004.
[137] Lily S Cheng, Rommie E Amaro, Dong Xu, Wilfred W Li, Peter W Arzberger, and J Andrew
      McCammon. Ensemble-based virtual screening reveals potential novel antiviral compounds
      for avian influenza neuraminidase. Journal of medicinal chemistry, 51(13):3878–3894, 2008.
[138] Albert H Chan, Jeff Wereszczynski, Brendan R Amer, Sung Wook Yi, Michael E Jung,
      J Andrew McCammon, and Robert T Clubb. Discovery of s taphylococcus aureus sortase
      a inhibitors using virtual screening and the relaxed complex scheme. Chemical biology &
      drug design, 82(4):418–428, 2013.
[139] Wilfredo Evangelista Falcon, Sally R Ellingson, Jeremy C Smith, and Jerome Baudry.
      Ensemble docking in drug discovery: how many protein configurations from molecular
      dynamics simulations are needed to reproduce known ligand binding? The Journal of
      Physical Chemistry B, 123(25):5189–5195, 2019.
[140] Xavier Barril and S David Morley. Unveiling the full potential of flexible receptor docking
      using multiple crystallographic structures. Journal of medicinal chemistry, 48(13):4432–
      4443, 2005.
[141] Manuel Rueda, Giovanni Bottegoni, and Ruben Abagyan. Recipes for the selection of
      experimental protein conformations for virtual screening. Journal of chemical information
      and modeling, 50(1):186–193, 2010.
[142] Mengang Xu and Markus A Lill. Utilizing experimental data for reducing ensemble size
                                                 118


      in flexible-protein docking. Journal of chemical information and modeling, 52(1):187–198,
      2012.
[143] Nazanin Donyapour, Nicole M Roussey, and Alex Dickson. Revo: Resampling of ensembles
      by variation optimization. The Journal of Chemical Physics, 150(24):244112, 2019.
[144] Anna Maria Capelli and Gabriele Costantino. Unbinding pathways of vegfr2 inhibitors
      revealed by steered molecular dynamics. Journal of Chemical Information and Modeling,
      54(11):3124–3136, 2014. PMID: 25299731.
[145] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in monte carlo free-energy
      estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187 – 199, 1977.
[146] Alessandro Laio and Michele Parrinello. Escaping free-energy minima. Proceedings of the
      National Academy of Sciences, 99(20):12562–12566, 2002.
[147] Xiongwu Wu and Bernard R. Brooks. Self-guided langevin dynamics simulation method.
      Chemical Physics Letters, 381(3):512 – 518, 2003.
[148] Luca Maragliano and Eric Vanden-Eijnden. A temperature accelerated method for sampling
      free energy and determining reaction pathways in rare events simulations. Chemical Physics
      Letters, 426(1):168 – 175, 2006.
[149] Hongfeng Lou, Robert I Cukier, et al. Molecular dynamics of apo-adenylate kinase: a
      distance replica exchange method for the free energy of conformational fluctuations. Journal
      of Physical Chemistry B, 110(47):24121, 2006.
[150] John D. Chodera, William C. Swope, Jed W. Pitera, and Ken A. Dill. Long-time protein
      folding dynamics from short-time molecular dynamics simulations. Multiscale Modeling &
      Simulation, 5(4):1214–1226, 2006.
[151] Christof Schutte, Frank No’e, Jianfeng Lu, Marco Sarich, and Eric Vanden-Eijnden. Markov
      state models based on milestoning. The Journal of chemical physics, 134(20):05B609, 2011.
[152] Alex Dickson and Charles L Brooks III. Wexplore: hierarchical exploration of high-
      dimensional spaces using the weighted ensemble algorithm. The Journal of Physical Chem-
      istry B, 118(13):3532–3542, 2014.
[153] Alex Dickson, Anthony M. Mustoe, Loic Salmon, and Charles L. Brooks III. Efficient in
      silico exploration of rna interhelical conformations using euler angles and wexplore. Nucleic
      Acids Research, 42(19), 2014.
[154] Bin W. Zhang, David Jasnow, and Daniel M. Zuckerman. The “weighted ensemble” path
      sampling method is statistically exact for a broad class of stochastic processes and binning
      procedures. The Journal of Chemical Physics, 132(5), 2010.
                                                 119


[155] Nuria Plattner and Frank Noe. Protein conformational plasticity and complex ligand-binding
      kinetics explored by atomistic simulations and markov models. Nature Communications, 6,
      July 2015.
[156] Vittorio Limongelli, Massimiliano Bonomi, and Michele Parrinello. Funnel metadynamics
      as accurate binding free-energy method. Proceedings of the National Academy of Sciences,
      110(16):6358–6363, 2013.
[157] Pratyush Tiwary, Vittorio Limongelli, Matteo Salvalaglio, and Michele Parrinello. Kinetics
      of protein–ligand unbinding: Predicting pathways, rates, and rate-limiting steps. Proceedings
      of the National Academy of Sciences, 112(5):E386–E391, 2015.
[158] Peter Eastman, Jason Swails, John D. Chodera, Robert T. McGibbon, Yutong Zhao, Kyle A.
      Beauchamp, Lee-Ping Wang, Andrew C. Simmonett, Matthew P. Harrigan, Chaya D. Stern,
      Rafal P. Wiewiora, Bernard R. Brooks, and Vijay S. Pande. Openmm 7: Rapid development
      of high performance algorithms for molecular dynamics. PLOS Computational Biology,
      13(7):1–17, 07 2017.
[159] Kenno Vanommeslaeghe and Alexander D MacKerell Jr. Automation of the charmm general
      force field (cgenff) i: bond perception and atom typing. Journal of chemical information
      and modeling, 52(12):3144–3154, 2012.
[160] Kenno Vanommeslaeghe, E Prabhu Raman, and Alexander D MacKerell Jr. Automation
      of the charmm general force field (cgenff) ii: assignment of bonded parameters and partial
      atomic charges. Journal of chemical information and modeling, 52(12):3155–3168, 2012.
[161] Francesco Rao and Amedeo Caflisch. The protein folding network. Journal of Molecular
      Biology, 342(1):299 – 306, 2004.
[162] Danzhi Huang and Amedeo Caflisch. The free energy landscape of small molecule unbinding.
      PLoS Computational Biology, 7:1–12, February 2011.
[163] Alex Dickson and Charles L. Brooks. Native states of fast-folding proteins are kinetic traps.
      Journal of the American Chemical Society, 135(12):4729–4734, 2013. PMID: 23458553.
[164] Kyle A. Beauchamp, Gregory R. Bowman, Thomas J. Lane, Lutz Maibaum, Imran S. Haque,
      and Vijay S. Pande. Msmbuilder2: Modeling conformational dynamics on the picosecond to
      millisecond scale. Journal of Chemical Theory and Computation, 7(10):3412–3419, 2011.
      PMID: 22125474.
[165] Robert A Copeland. The dynamics of drug-target interactions: drug-target residence time
      and its impact on efficacy and safety. Expert Opinion on Drug Discovery, 5(4):305–310,
      2010. PMID: 22823083.
[166] Alex Dickson, Aryeh Warmflash, and Aaron R. Dinner. Separating forward and backward
                                               120


      pathways in nonequilibrium umbrella sampling. Journal of Chemical Physics, 131(154104),
      2009.
[167] Eric Vanden-Eijnden and Maddalena Venturoli. Exact rate calculations by trajectory paral-
      lelization and tilting. Journal of Chemical Physics, 131(044120), 2009.
[168] Ernesto Suárez, Steven Lettieri, Matthew C. Zwier, Carsen A. Stringer, Sundar Raman
      Subramanian, Lillian T. Chong, and Daniel M. Zuckerman. Simultaneous computation of
      dynamical and equilibrium information using a weighted ensemble of trajectories. Journal
      of Chemical Theory and Computation, 10(7):2658–2667, 2014. PMID: 25246856.
[169] Alex Dickson, Mark Maienschein-Cline, Allison Tovo-Dwyer, Jeff R. Hammond, and
      Aaron R. Dinner. Flow-dependent unfolding and refolding of an rna by nonequilibrium
      umbrella sampling. Journal of Chemical Theory and Computation, 7(9):2710–2720, 2011.
      PMID: 26605464.
[170] Daniel M. Zuckerman and Lillian T. Chong. Weighted ensemble simulation: Review of
      methodology, applications, and software. Annual Review of Biophysics, 46(1):43–57, 2017.
      PMID: 28301772.
[171] Terrell L Hill. Free Energy Transduction and Biochemical Cycle Kinetics. Springer, 1989.
[172] Florent Guillain and Darwin Thusius. Use of proflavine as an indicator in temperature-jump
      studies of the binding of a competitive inhibitor to trypsin. Journal of the American Chemical
      Society, 92(18):5534–5536, 1970.
[173] Mathieu Bastian, Sebastien Heymann, and Mathieu Jacomy. Gephi: An open source software
      for exploring and manipulating networks. 2009.
[174] Yang Zhang and Jeffrey Skolnick. Scoring function for automated assessment of protein
      structure template quality. Proteins: Structure, Function, and Bioinformatics, 57(4):702–
      710, 2004.
[175] Nazanin Donyapour, Matthew Hirn, and Alex Dickson. Classicalgsg: Prediction of log
      p using classical molecular force fields and geometric scattering for graphs. Journal of
      Computational Chemistry, 42(14):1006–1017, 2021.
[176] Nazanin Donyapour and Alex Dickson. Predicting partition coefficients for the sampl7
      physical property challenge using the classicalgsg method. Journal of Computer-Aided
      Molecular Design, 35(7):819–830, 2021.
[177] Younggil Kwon. Handbook of essential pharmacokinetics, pharmacodynamics and drug
      metabolism for industrial scientists. Springer Science & Business Media, 2001.
[178] Yingqing Ran and Samuel H Yalkowsky. Prediction of drug solubility by the general solubil-
                                                 121


      ity equation (gse). Journal of chemical information and computer sciences, 41(2):354–357,
      2001.
[179] Samuel H Yalkowsky and Shri C Valvani. Solubility and partitioning i: solubility of
      nonelectrolytes in water. Journal of pharmaceutical sciences, 69(8):912–922, 1980.
[180] Thomas Ryckmans, Martin P Edwards, Val A Horne, Ana Monica Correia, Dafydd R Owen,
      Lisa R Thompson, Isabelle Tran, Michelle F Tutt, and Tim Young. Rapid assessment of
      a novel series of selective cb2 agonists using parallel synthesis protocols: A lipophilic
      efficiency (lipe) analysis. Bioorganic & medicinal chemistry letters, 19(15):4406–4409,
      2009.
[181] Sanjivanjit K Bhal, Karim Kassam, Ian G Peirson, and Greg M Pearl. The rule of five
      revisited: applying log d in place of log p in drug-likeness filters. Molecular pharmaceutics,
      4(4):556–560, 2007.
[182] Zhi Chen and Stephen G Weber. High-throughput method for lipophilicity measurement.
      Analytical chemistry, 79(3):1043–1049, 2007.
[183] James Sangster. Octanol-water partition coefficients: fundamentals and physical chemistry,
      volume 1. John Wiley & Sons, 1997.
[184] Tiejun Cheng, Yuan Zhao, Xun Li, Fu Lin, Yong Xu, Xinglong Zhang, Yan Li, Renxiao
      Wang, and Luhua Lai. Computation of octanol- water partition coefficients by guiding an
      additive model with knowledge. Journal of chemical information and modeling, 47(6):2140–
      2148, 2007.
[185] Arup K Ghose and Gordon M Crippen. Atomic physicochemical parameters for three-
      dimensional structure-directed quantitative structure-activity relationships i. partition coef-
      ficients as a measure of hydrophobicity. Journal of computational chemistry, 7(4):565–577,
      1986.
[186] Albert J Leo. Calculating log poct from structures. Chemical Reviews, 93(4):1281–1306,
      1993.
[187] William M Meylan and Philip H Howard. Atom/fragment contribution method for estimating
      octanol–water partition coefficients. Journal of pharmaceutical sciences, 84(1):83–92, 1995.
[188] Jeffrey Plante and Stephane Werner. Jplogp: an improved logp predictor trained using
      predicted data. Journal of Cheminformatics, 10(1):61, 2018.
[189] László Molnár, György M Keserű, Ákos Papp, Zsolt Gulyás, and Ferenc Darvas. A neural
      network based prediction of octanol–water partition coefficients using atomic5 fragmental
      descriptors. Bioorganic & medicinal chemistry letters, 14(4):851–853, 2004.
                                                 122


[190] Jarmo J Huuskonen, David J Livingstone, and Igor V Tetko. Neural network modeling
      for estimation of partition coefficient based on atom-type electrotopological state indices.
      Journal of chemical information and computer sciences, 40(4):947–955, 2000.
[191] Ikuo Moriguchi, Shuichi HIRONO, Qian LIU, Izumi NAKAGOME, and Yasuo MAT-
      SUSHITA. Simple method of calculating octanol/water partition coefficient. Chemical
      and pharmaceutical bulletin, 40(1):127–130, 1992.
[192] Antoine Daina, Olivier Michielin, and Vincent Zoete. ilogp: a simple, robust, and efficient
      description of n-octanol/water partition coefficient for drug design using the gb/sa approach.
      Journal of chemical information and modeling, 54(12):3284–3301, 2014.
[193] Deliang Chen, Qingyun Wang, Yibao Li, Yongdong Li, Hui Zhou, and Yulan Fan. A
      general linear free energy relationship for predicting partition coefficients of neutral organic
      compounds. Chemosphere, 247:125869, 2020.
[194] Raimund Mannhold, Gennadiy I Poda, Claude Ostermann, and Igor V Tetko. Calculation
      of molecular lipophilicity: State-of-the-art and comparison of log p methods on more than
      96,000 compounds. Journal of pharmaceutical sciences, 98(3):861–893, 2009.
[195] Igor V Tetko, Vsevolod Yu Tanchuk, and Alessandro EP Villa. Prediction of n-octanol/water
      partition coefficients from physprop database using artificial neural networks and e-state
      indices. Journal of chemical information and computer sciences, 41(5):1407–1421, 2001.
[196] Igor V Tetko and Vsevolod Yu Tanchuk. Application of associative neural networks for
      prediction of lipophilicity in alogps 2.1 program. Journal of chemical information and
      computer sciences, 42(5):1136–1145, 2002.
[197] ADMET Predictor. version 2.3. 0, simulations plus. Inc.: Lancaster, CA, 2009.
[198] Silicos-it. Filter-it software.
[199] Jian-Wei Zou, Wen-Na Zhao, Zhi-Cai Shang, Mei-Lan Huang, Ming Guo, and Qing-Sen Yu.
      A quantitative structure- property relationship analysis of logp for disubstituted benzenes.
      The Journal of Physical Chemistry A, 106(47):11550–11557, 2002.
[200] Maximiliano Riquelme and Esteban Vöhringer-Martinez. Sampl6 octanol–water partition
      coefficients from alchemical free energy calculations with mbis atomic charges. Journal of
      Computer-Aided Molecular Design, pages 1–8, 2020.
[201] Hai-Feng Chen. In silico log p prediction for a large data set with support vector machines,
      radial basis neural networks and multiple linear regression. Chemical biology & drug design,
      74(2):142–147, 2009.
[202] Edward W Lowe, Mariusz Butkiewicz, Matthew Spellings, Albert Omlor, and Jens Meiler.
                                                123


      Comparative analysis of machine learning techniques for the prediction of logp. In 2011 IEEE
      Symposium on Computational Intelligence in Bioinformatics and Computational Biology
      (CIBCB), pages 1–6. IEEE, 2011.
[203] Quan Liao, Jianhua Yao, and Shengang Yuan. Svm approach for predicting logp. Molecular
      diversity, 10(3):301–309, 2006.
[204] Andreas Breindl, Bernd Beck, Timothy Clark, and Robert C Glen. Prediction of the
      n-octanol/water partition coefficient, logp, using a combination of semiempirical mo-
      calculations and a neural network. Molecular modeling annual, 3(3):142–155, 1997.
[205] Dong-Sheng Cao, Qing-Song Xu, Qian-Nan Hu, and Yi-Zeng Liang. Chemopy: freely
      available python package for computational biology and chemoinformatics. Bioinformatics,
      29(8):1092–1094, 2013.
[206] Peng Gao, Jie Zhang, Yuzhu Sun, and Jianguo Yu. Accurate predictions of aqueous solu-
      bility of drug molecules via the multilevel graph convolutional network (mgcn) and schnet
      architectures. Physical Chemistry Chemical Physics, 22(41):23766–23772, 2020.
[207] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy
      Hirzel, Alán Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for
      learning molecular fingerprints. In Advances in neural information processing systems,
      pages 2224–2232, 2015.
[208] Kevin Yang, Kyle Swanson, Wengong Jin, Connor Coley, Philipp Eiden, Hua Gao, Angel
      Guzman-Perez, Timothy Hopper, Brian Kelley, Miriam Mathea, et al. Analyzing learned
      molecular representations for property prediction. Journal of chemical information and
      modeling, 59(8):3370–3388, 2019.
[209] Junshui Ma, Robert P Sheridan, Andy Liaw, George E Dahl, and Vladimir Svetnik. Deep
      neural nets as a method for quantitative structure–activity relationships. Journal of chemical
      information and modeling, 55(2):263–274, 2015.
[210] SAMPL. Sampl.
[211] Mehtap Işık, Teresa Danielle Bergazin, Thomas Fox, Andrea Rizzi, John D Chodera, and
      David L Mobley. Assessing the accuracy of octanol–water partition coefficient predictions in
      the sampl6 part ii log p challenge. Journal of computer-aided molecular design, 34(4):335–
      370, 2020.
[212] Teresa Danielle Bergazin, Nicolas Tielker, Yingying Zhang, Junjun Mao, Marilyn R Gunner,
      Carlo Ballatore, Stefan Kast, David Mobley, et al. Evaluation of log p, pka and log d
      predictions from the sampl7 blind challenge. ChemRxiv, 2021.
[213] Raymond Lui, Davy Guan, and Slade Matthews. A comparison of molecular representations
                                                124


      for lipophilicity quantitative structure–property relationships with results from the sampl6
      logp prediction challenge. Journal of Computer-Aided Molecular Design, pages 1–12, 2020.
[214] Andreas Krämer, Phillip S Hudson, Michael R Jones, and Bernard R Brooks. Multi-
      phase boltzmann weighting: accounting for local inhomogeneity in molecular simulations
      of water–octanol partition coefficients in the sampl6 challenge. Journal of Computer-Aided
      Molecular Design, pages 1–13, 2020.
[215] Ye Ding, You Xu, Cheng Qian, Jinfeng Chen, Jian Zhu, Houhou Huang, Yi Shi, and Jing
      Huang. Predicting partition coefficients of drug-like molecules in the sampl6 challenge with
      drude polarizable force fields. Journal of Computer-Aided Molecular Design, pages 1–15,
      2020.
[216] Shujie Fan, Bogdan I Iorga, and Oliver Beckstein. Prediction of octanol-water partition
      coefficients for the sampl6-log 𝑝 logp molecules using molecular dynamics simulations with
      opls-aa, amber and charmm force fields. Journal of Computer-Aided Molecular Design,
      pages 1–18, 2020.
[217] Piero Procacci and Guido Guarnieri. Sampl6 blind predictions of water-octanol parti-
      tion coefficients using nonequilibrium alchemical approaches. Journal of Computer-Aided
      Molecular Design, pages 1–14, 2019.
[218] Aleksandr V Marenich, Christopher J Cramer, and Donald G Truhlar. Universal solvation
      model based on solute electron density and on a continuum model of the solvent defined by
      the bulk dielectric constant and atomic surface tensions. The Journal of Physical Chemistry
      B, 113(18):6378–6396, 2009.
[219] Christoph Loschen, Jens Reinisch, and Andreas Klamt. Cosmo-rs based predictions for the
      sampl6 logp challenge. Journal of computer-aided molecular design, 34(4):385–392, 2020.
[220] Nicolas Tielker, Daniel Tomazic, Lukas Eberlein, Stefan Güssregen, and Stefan M Kast. The
      sampl6 challenge on predicting octanol–water partition coefficients from ec-rism theory.
      Journal of computer-aided molecular design, pages 1–9, 2020.
[221] Davy Guan, Raymond Lui, and Slade Matthews. Logp prediction performance with the smd
      solvation model and the m06 density functional family for sampl6 blind prediction challenge
      molecules. Journal of Computer-Aided Molecular Design, pages 1–12, 2020.
[222] Michael R Jones and Bernard R Brooks. Quantum chemical predictions of water–octanol
      partition coefficients applied to the sampl6 log p blind challenge. Journal of Computer-Aided
      Molecular Design, pages 1–9, 2020.
[223] Jonathan A Ouimet and Andrew S Paluch. Predicting octanol/water partition coefficients
      for the sampl6 challenge using the sm12, sm8, and smd solvation models. Journal of
      Computer-Aided Molecular Design, pages 1–14, 2020.
                                                 125


[224] William J Zamora, Silvana Pinheiro, Kilian German, Clara Ràfols, Carles Curutchet, and
      F Javier Luque. Prediction of the n-octanol/water partition coefficients in the sampl6 blind
      challenge from mst continuum solvation calculations. Journal of computer-aided molecular
      design, 34(4):443–451, 2020.
[225] Shuzhe Wang and Sereina Riniker. Use of molecular dynamics fingerprints (mdfps) in
      sampl6 octanol–water log p blind challenge. Journal of Computer-Aided Molecular Design,
      pages 1–11, 2019.
[226] Prajay Patel, David M Kuntz, Michael R Jones, Bernard R Brooks, and Angela K Wilson.
      Sampl6 log p challenge: machine learning and quantum mechanical approaches. Journal of
      Computer-Aided Molecular Design, pages 1–16, 2020.
[227] Evrim Arslan, Basak K Findik, and Viktorya Aviyente. A blind sampl6 challenge: insight
      into the octanol-water partition coefficients of drug-like molecules via a dft approach. Journal
      of Computer-Aided Molecular Design, pages 1–8, 2020.
[228] Kenno Vanommeslaeghe and Alexander D MacKerell Jr. Automation of the charmm general
      force field (cgenff) i: bond perception and atom typing. Journal of chemical information
      and modeling, 52(12):3144–3154, 2012.
[229] Kenno Vanommeslaeghe, E Prabhu Raman, and Alexander D MacKerell Jr. Automation
      of the charmm general force field (cgenff) ii: assignment of bonded parameters and partial
      atomic charges. Journal of chemical information and modeling, 52(12):3155–3168, 2012.
[230] James A Maier, Carmenza Martinez, Koushik Kasavajhala, Lauren Wickstrom, Kevin E
      Hauser, and Carlos Simmerling. ff14sb: improving the accuracy of protein side chain and
      backbone parameters from ff99sb. Journal of chemical theory and computation, 11(8):3696–
      3713, 2015.
[231] Dario Vassetti, Marco Pagliai, and Piero Procacci. Assessment of gaff2 and opls-aa general
      force fields in combination with the water models tip3p, spce, and opc3 for the solvation
      free energy of druglike organic molecules. Journal of Chemical Theory and Computation,
      15(3):1983–1995, 2019.
[232] Junmei Wang, Romain M Wolf, James W Caldwell, Peter A Kollman, and David A Case.
      Development and testing of a general amber force field. Journal of computational chemistry,
      25(9):1157–1174, 2004.
[233] Anthony K Rappé, Carla J Casewit, KS Colwell, William A Goddard III, and W Mason
      Skiff. Uff, a full periodic table force field for molecular mechanics and molecular dynamics
      simulations. Journal of the American chemical society, 114(25):10024–10035, 1992.
[234] Thomas A Halgren. Merck molecular force field. i. basis, form, scope, parameterization,
      and performance of mmff94. Journal of computational chemistry, 17(5-6):490–519, 1996.
                                                  126


[235] Thomas A Halgren. Merck molecular force field. ii. mmff94 van der waals and electrostatic
      parameters for intermolecular interactions. Journal of Computational Chemistry, 17(5-
      6):520–552, 1996.
[236] Tommi Hassinen and Mikael Peräkylä. New energy terms for reduced protein models
      implemented in an off-lattice force field. Journal of Computational Chemistry, 22(12):1229–
      1242, 2001.
[237] P Howard and W Meylan. Physical/chemical property database (physprop), 1999.
[238] DA Case, IY Ben-Shalom, SR Brozell, DS Cerutti, TE Cheatham III, VWD Cruzeiro,
      TA Darden, RE Duke, D Ghoreishi, MK Gilson, et al. Amber 2018; 2018. University of
      California, San Francisco, 2018.
[239] Jarmo J Huuskonen, Alessandro EP Villa, and Igor V Tetko. Prediction of partition coefficient
      based on atom-type electrotopological state indices. Journal of pharmaceutical sciences,
      88(2):229–233, 1999.
[240] Gilles Klopman, Ju-Yun Li, Shaomeng Wang, and Mario Dimayuga. Computer automated
      log p calculations based on an extended group contribution approach. Journal of Chemical
      Information and Computer Sciences, 34(4):752–781, 1994.
[241] C Hansch, A Leo, and D Hoekman. Hydrophobic, electronic, and steric constants-exploring
      qsar, 1995.
[242] RDkit. Program.
[243] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
      Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative
      style, high-performance deep learning library. In Advances in Neural Information Processing
      Systems, pages 8024–8035, 2019.
[244] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv
      preprint arXiv:1412.6980, 2014.
[245] Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
      Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al.
      Scikit-learn: Machine learning in python. the Journal of machine Learning research,
      12:2825–2830, 2011.
[246] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and
      locally connected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.
[247] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-
      structured data. arXiv preprint arXiv:1506.05163, 2015.
                                                 127


[248] Will Hamilton, Zhitao Ying, and Jure Leskovec. Inductive representation learning on large
      graphs. In Advances in neural information processing systems, pages 1024–1034, 2017.
[249] Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional
      networks. arXiv preprint arXiv:1609.02907, 2016.
[250] Han Altae-Tran, Bharath Ramsundar, Aneesh S Pappu, and Vijay Pande. Low data drug
      discovery with one-shot learning. ACS central science, 3(4):283–293, 2017.
[251] Geoffrey E Hinton and Sam T Roweis. Stochastic neighbor embedding. In Advances in
      neural information processing systems, pages 857–864, 2003.
[252] Solomon Kullback and Richard A Leibler. On information and sufficiency. The annals of
      mathematical statistics, 22(1):79–86, 1951.
[253] TM Heskes, WAJJ Wiegerinck, and HJ Kappen. Practical confidence and prediction intervals
      for prediction tasks. PROGRESS IN NEURAL PROCESSING, pages 128–135, 1997.
[254] Sricharan Kumar and Ashok Srivastava. Bootstrap prediction intervals in non-parametric
      regression with applications to anomaly detection. In Proc. 18th ACM SIGKDD Conf. Knowl.
      Discovery Data Mining, 2012.
[255] David A Nix and Andreas S Weigend. Estimating the mean and variance of the target
      probability distribution. In Proceedings of 1994 ieee international conference on neural
      networks (ICNN’94), volume 1, pages 55–60. IEEE, 1994.
[256] Noel M O’Boyle, Michael Banck, Craig A James, Chris Morley, Tim Vandermeersch, and
      Geoffrey R Hutchison. Open babel: An open chemical toolbox. Journal of cheminformatics,
      3(1):33, 2011.
[257] openbabel. Program.
[258] Ismael J Hidalgo, Thomas J Raub, and Ronald T Borchardt. Characterization of the human
      colon carcinoma cell line (caco-2) as a model system for intestinal epithelial permeability.
      Gastroenterology, 96(2):736–749, 1989.
[259] Robert D Brown and Yvonne C Martin. The information content of 2d and 3d structural de-
      scriptors relevant to ligand-receptor binding. Journal of Chemical Information and Computer
      Sciences, 37(1):1–9, 1997.
[260] Georgia B McGaughey, Robert P Sheridan, Christopher I Bayly, J Chris Culberson, Constan-
      tine Kreatsoulas, Stacey Lindsley, Vladimir Maiorov, Jean-Francois Truchon, and Wendy D
      Cornell. Comparison of topological, shape, and docking methods in virtual screening.
      Journal of chemical information and modeling, 47(4):1504–1519, 2007.
                                                 128


[261] Jordi Mestres and Ronald Knegtel. Similarity versus docking in 3d virtual screening. In
      Virtual Screening: An Alternative or Complement to High Throughput Screening?, pages
      191–207. Springer, 2000.
[262] JS Mason, AC Good, and EJ Martin. 3-d pharmacophores in drug discovery. Current
      pharmaceutical design, 7(7):567–597, 2001.
[263] Jayashree Srinivasan, Angelo Castellino, Erin K Bradley, John E Eksterowicz, Peter DJ
      Grootenhuis, Santosh Putta, and Robert V Stanton. Evaluation of a novel shape-based
      computational filter for lead evolution: Application to thrombin inhibitors. Journal of
      medicinal chemistry, 45(12):2494–2500, 2002.
[264] Sheng-You Huang, Sam Z Grinter, and Xiaoqin Zou. Scoring functions and their evalua-
      tion methods for protein–ligand docking: recent advances and future directions. Physical
      Chemistry Chemical Physics, 12(40):12899–12908, 2010.
[265] Hannes Stärk, Octavian-Eugen Ganea, Lagnajit Pattanaik, Regina Barzilay, and Tommi
      Jaakkola. Equibind: Geometric deep learning for drug binding structure prediction. arXiv
      preprint arXiv:2202.05146, 2022.
[266] N Moitessier, P Englebienne, D Lee, J Lawandi, and C R Corbeil. Towards the development
      of universal, fast and highly accurate docking/scoring methods: a long way to go. British
      journal of pharmacology, 153(S1):S7–S26, 2008.
[267] Garrett M Morris, Ruth Huey, William Lindstrom, Michel F Sanner, Richard K Belew,
      David S Goodsell, and Arthur J Olson. Autodock4 and autodocktools4: Automated docking
      with selective receptor flexibility. Journal of computational chemistry, 30(16):2785–2791,
      2009.
[268] William J Allen, Trent E Balius, Sudipto Mukherjee, Scott R Brozell, Demetri T Moustakas,
      P Therese Lang, David A Case, Irwin D Kuntz, and Robert C Rizzo. Dock 6: Impact
      of new features and current docking performance. Journal of computational chemistry,
      36(15):1132–1156, 2015.
[269] AutoDock Vina. Improving the speed and accuracy of docking with a new scoring function,
      efficient optimization, and multithreading trott, oleg; olson, arthur j. J. Comput. Chem,
      31(2):455–461, 2010.
[270] Woong-Hee Shin and Chaok Seok. Galaxydock: protein–ligand docking with flexible protein
      side-chains. Journal of chemical information and modeling, 52(12):3225–3232, 2012.
[271] Christopher R Corbeil and Nicolas Moitessier. Docking ligands into flexible and solvated
      macromolecules. 3. impact of input ligand conformation, protein flexibility, and water
      molecules on the accuracy of docking programs. Journal of Chemical Information and
      Modeling, 49(4):997–1009, 2009.
                                                129


[272] Ian W. Davis, Kaushik Raha, Martha S. Head, and David Baker. Blind docking of pharma-
      ceutically relevant compounds using rosettaligand. Protein Science, 18, 2009.
[273] Edward B. Miller, Robert B. Murphy, Daniel Sindhikara, Kenneth W. Borrelli, Matthew J.
      Grisewood, Fabio Ranalli, Steven L. Dixon, Steven Jerome, Nicholas A. Boyles, Tyler
      Day, Phani Ghanakota, Sayan Mondal, Salma B. Rafi, Dawn M. Troast, Robert Abel, and
      Richard A. Friesner. Reliable and accurate solution to the induced fit docking problem for
      protein-ligand binding. Journal of Chemical Theory and Computation, 17, 2021.
[274] Kathryn M Hart, Chris MW Ho, Supratik Dutta, Michael L Gross, and Gregory R Bow-
      man. Modelling proteins’ hidden conformations to predict antibiotic resistance. Nature
      communications, 7(1):1–10, 2016.
[275] David DL Minh. Implicit ligand theory: Rigorous binding free energies and thermodynamic
      expectations from molecular docking. The Journal of chemical physics, 137(10):104106,
      2012.
[276] A Peters, ME Lundberg, PT Lang, and CP Sosa. High throughput computing validation
      for drug discovery using the dock program on a massively parallel system. redpaper 2008.
      Technical report, REDP-4410.
[277] Barbara Collignon, Roland Schulz, Jeremy C Smith, and Jerome Baudry. Task-parallel
      message passing interface implementation of autodock4 for docking of very large databases of
      compounds using high-performance super-computers. Journal of computational Chemistry,
      32(6):1202–1209, 2011.
[278] Sally R Ellingson, Jeremy C Smith, and Jerome Baudry. Vinampi: Facilitating multiple
      receptor high-throughput virtual docking on high-performance computers, 2013.
[279] Xiaohua Zhang, Sergio E Wong, and Felice C Lightstone. Message passing interface and
      multithreading hybrid for parallel molecular docking of large databases on petascale high
      performance computing machines. Journal of computational chemistry, 34(11):915–927,
      2013.
[280] John E Stone, David J Hardy, Ivan S Ufimtsev, and Klaus Schulten. Gpu-accelerated molecu-
      lar modeling coming of age. Journal of Molecular Graphics and Modelling, 29(2):116–125,
      2010.
[281] Oliver Korb, Thomas Stützle, and Thomas E Exner. Accelerating molecular docking cal-
      culations using graphics processing units. Journal of chemical information and modeling,
      51(4):865–876, 2011.
[282] Serkan Altuntaş, Zeki Bozkus, and Basilio B Fraguela. Gpu accelerated molecular dock-
      ing simulation with genetic algorithms. In European Conference on the Applications of
      Evolutionary Computation, pages 134–146. Springer, 2016.
                                              130


[283] Xinqiang Ding, Yujin Wu, Yanming Wang, Jonah Z Vilseck, and Charles L Brooks III.
      Accelerated cdocker with gpus, parallel simulated annealing, and fast fourier transforms.
      Journal of chemical theory and computation, 16(6):3910–3919, 2020.
[284] Diogo Santos-Martins, Leonardo Solis-Vasquez, Andreas F Tillack, Michel F Sanner, An-
      dreas Koch, and Stefano Forli. Accelerating autodock4 with gpus and gradient-based local
      search. Journal of Chemical Theory and Computation, 17(2):1060–1073, 2021.
[285] Jianhua Li, Guanlong Liu, Zhiyuan Zhen, Zihao Shen, Shiliang Li, and Honglin Li. Molecu-
      lar docking for ligand-receptor binding process based on heterogeneous computing. Scientific
      Programming, 2022, 2022.
[286] Cyrus Levinthal. Mossbauer spectroscopy in biological systems. In Proceedings of a meeting
      held at Allerton House. P. Debrunner, JCM Tsibris, and E. Munck, editors. University of
      Illinois Press, Urbana, IL, 1969.
[287] Peter E Leopold, Mauricio Montal, and José N Onuchic. Protein folding funnels: a kinetic
      approach to the sequence-structure relationship. Proceedings of the National Academy of
      Sciences, 89(18):8721–8725, 1992.
[288] Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics
      quarterly, 2(1-2):83–97, 1955.
[289] Xiang Gao, Farhad Ramezanghorbani, Olexandr Isayev, Justin S Smith, and Adrian E Roit-
      berg. Torchani: a free and open source pytorch-based deep learning implementation of the ani
      neural network potentials. Journal of chemical information and modeling, 60(7):3408–3415,
      2020.
[290] Jing Huang and Alexander D MacKerell Jr. Charmm36 all-atom additive protein force
      field: Validation based on comparison to nmr data. Journal of computational chemistry,
      34(25):2135–2145, 2013.
[291] Bhalachandra L Tembre and J Andrew Mc Cammon. Ligand-receptor interactions. Com-
      puters & Chemistry, 8(4):281–283, 1984.
[292] Pavel V Klimovich, Michael R Shirts, and David L Mobley. Guidelines for the analysis
      of free energy calculations. Journal of computer-aided molecular design, 29(5):397–411,
      2015.
[293] David L Mobley, Alan P Graves, John D Chodera, Andrea C McReynolds, Brian K Shoichet,
      and Ken A Dill. Predicting absolute ligand binding free energies to a simple model site.
      Journal of molecular biology, 371(4):1118–1134, 2007.
[294] John D Chodera, David L Mobley, Michael R Shirts, Richard W Dixon, Kim Branson, and
      Vijay S Pande. Alchemical free energy methods for drug discovery: progress and challenges.
                                                131


      Current opinion in structural biology, 21(2):150–160, 2011.
[295] Xianjun Kong and Charles L Brooks III. 𝜆-dynamics: A new approach to free energy
      calculations. The Journal of chemical physics, 105(6):2414–2423, 1996.
[296] Jennifer L Knight and Charles L Brooks III. Multisite 𝜆 dynamics for simulated structure–
      activity relationship studies. Journal of chemical theory and computation, 7(9):2728–2739,
      2011.
[297] Christopher Jarzynski. Nonequilibrium equality for free energy differences. Physical Review
      Letters, 78(14):2690, 1997.
[298] Peter Eastman, Jason Swails, John D Chodera, Robert T McGibbon, Yutong Zhao, Kyle A
      Beauchamp, Lee-Ping Wang, Andrew C Simmonett, Matthew P Harrigan, Chaya D Stern,
      et al. Openmm 7: Rapid development of high performance algorithms for molecular dy-
      namics. PLoS computational biology, 13(7):e1005659, 2017.
[299] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
      Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
      Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
      Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style,
      high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer,
      F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing
      Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
[300] Sunhwan Jo, Taehoon Kim, Vidyashankara G Iyer, and Wonpil Im. Charmm-gui: a
      web-based graphical user interface for charmm. Journal of computational chemistry,
      29(11):1859–1865, 2008.
[301] K. Vanommeslaeghe, E. Hatcher, C. Acharya, S. Kundu, S. Zhong, J. Shim, E. Darian,
      O. Guvench, P. Lopes, I. Vorobyov, and A. D. Mackerell. Charmm general force field: A
      force field for drug-like molecules compatible with the charmm all-atom additive biological
      force fields. Journal of Computational Chemistry, 31(4):671–690, 2010.
[302] Nagakumar Bharatham, Peter J Slavish, William R Shadrick, Brandon M Young, and
      Anang A Shelat. The role of za channel water-mediated interactions in the design of
      bromodomain-selective bet inhibitors. Journal of Molecular Graphics and Modelling,
      81:197–210, 2018.
[303] Chun-wa Chung, Hervé Coste, Julia H White, Olivier Mirguet, Jonathan Wilde, Romain L
      Gosmini, Chris Delves, Sylvie M Magny, Robert Woodward, Stephen A Hughes, et al.
      Discovery and characterization of small molecule inhibitors of the bet family bromodomains.
      Journal of medicinal chemistry, 54(11):3827–3838, 2011.
[304] Montserrat Pérez-Salvia and Manel Esteller. Bromodomain inhibitors and cancer therapy:
                                                132


      From structures to applications. Epigenetics, 12(5):323–339, 2017.
[305] Monica M Mita and Alain C Mita. Bromodomain inhibitors a decade later: a promise
      unfulfilled?, 2020.
[306] Chun-wa Chung, Anthony W Dean, James M Woolven, and Paul Bamborough. Fragment-
      based discovery of bromodomain inhibitors part 1: inhibitor binding modes and implications
      for lead discovery. Journal of medicinal chemistry, 55(2):576–586, 2012.
[307] Tom Dixon, Samuel D Lotz, and Alex Dickson. Predicting ligand binding affinity using
      on-and off-rates for the sampl6 sampling challenge. Journal of computer-aided molecular
      design, 32(10):1001–1012, 2018.
[308] Robert Hall, Tom Dixon, and Alex Dickson. On calculating free energy differences using
      ensembles of transition paths. Frontiers in molecular biosciences, 7:106, 2020.
[309] Tom Dixon, Derek MacPherson, Barmak Mostofian, Taras Dauzhenka, Samuel Lotz, Dwight
      McGee, Sharon Shechter, Utsab R Shrestha, Rafal Wiewiora, Zachary A McDargh, et al.
      Atomic-resolution prediction of degrader-mediated ternary complex structures by combining
      molecular simulations with hydrogen deuterium exchange. bioRxiv, 2021.
[310] Tom Dixon, Arzu Uyar, Shelagh Ferguson-Miller, and Alex Dickson. Membrane-mediated
      ligand unbinding of the pk-11195 ligand from tspo. Biophysical journal, 120(1):158–167,
      2021.
[311] Andreas Mardt, Luca Pasquali, Hao Wu, and Frank Noé. Vampnets for deep learning of
      molecular kinetics. Nature communications, 9(1):1–11, 2018.
[312] Andreas Mardt, Luca Pasquali, Frank Noé, and Hao Wu. Deep learning markov and koopman
      models with physical constraints. In Mathematical and Scientific Machine Learning, pages
      451–475. PMLR, 2020.
[313] Dirk G Cattrysse and Luk N Van Wassenhove. A survey of algorithms for the generalized
      assignment problem. European journal of operational research, 60(3):260–272, 1992.
                                               133