CONFORMATIONAL SAMPLING OF BINDING POCKET AND PREDICTING BINDING FREE ENERGIES By Nupur Bansal A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Chemistry⎯Doctor of Philosophy 2018 CONFORMATIONAL SAMPLING OF BINDING POCKET AND PREDICTING BINDING By ABSTRACT FREE ENERGIES Nupur Bansal In order to correctly predict protein-ligand binding poses and free energies, it is essential to accurately take into account receptor flexibility. However, incorporating it into even the smallest region, for example, the binding site of a protein is computationally demanding. Even if this task can be accomplished there is a risk of running into false positives due to the enormous conformational space involved. So, it is the interplay between sampling and scoring. Nonetheless, to better mimic experiments, structure-based drug design methods need to identify and then incorporate the most populated receptor states in any docking and scoring campaign. This work addresses the development of a novel tool that has been implemented to incorporate receptor flexibility into the ligand-binding domain of a protein. This method enumerates conformational states on an energy landscape in a computationally tractable manner. The algorithm treats molecules at an atom pair level and uses a distance-based coordinate system, where each selected distance is associated with a pair-potential value selected from a look-up table. With a collection of conformations in hand, we then perform on-the-fly local partition function estimations on each of the “seed structures” using the Movable Type (MT) method to estimate the associated free energy changes. This strategy helps to simultaneously generate relevant structures with the most favorable free energies. We initially applied our side chain flexibility method to a set of 159 protein-ligand systems and the docking score of Glide and our in-house Movable Type based scoring improved over the crystal docking score. Later we applied it to study the active site loop transitions seen in the Streptavidin-biotin system. Copyright by NUPUR BANSAL 2018 To my wonderful parents for their unconditional love and unwavering support v ACKNOWLEDGEMENTS I convey my sincere thanks to my Ph.D. advisor, Professor Kenneth M. Merz Jr. for giving me an opportunity to learn from him. He is a great mentor who taught me a good deal about research and life too. His guidance correctly molded my scientific outlook and helped me in finishing up my projects. Words are not sufficient to describe my gratitude for him. All I can say is without his support and encouragement; I wouldn’t have made this far. I would like to thank Dr. Zheng Zheng for collaborating with me on my research projects and guiding me. His mentorship and friendship helped me in learning things in a much easier way. I would also like to thanks my committee members Professor Katharine C. Hunt, Professor Robert I. Cukier, and Professor Heedeok Hong for giving me numerous suggestions on my projects, taking out their time to attend meetings, providing valuable feedback and comments. Special thank goes to former and current Merz group members whose affection and non- scientific discussions made my Ph.D. life really easy. Thank you Doris, Jun, Lin, and Anthony! I would also like to thank Dr. Nihan M. Ucisik, Dr. Mona Minkara, Dr. Pengfei Li, Dr. Arkajyoti Sengupta and Dr. David Cerutti for answering numerous questions (scientific and otherwise) and their kind friendship. I would extend my thanks to Dr. Dhruva Chakravorty for helping me in the initial phase of my research life. I would also like to thank HPCC facility at Michigan State University and HPC facility at the University of Florida for providing me with the computational resources. I would also like to thank the chemistry staff for answering my endless list of administrative questions and taking care of my needs very efficiently. vi I would like to extend special thanks to my Master's advisor, Dr. Neelanjana Sengupta for teaching me basic principals and ethics of doing research. I cannot thank her enough. Countless thanks to my parents for keeping their faith in me and loving me unconditionally. I thank my brother and sister for always being there for me. They are my support system and have always stood by me. I would like to thank their spouses and my two lovely nieces. I sincerely thank God for giving me such a nice family and this opportunity in life. A big shout out to Prashant for sticking with me through thick and thin. Last, but definitely not the least, I would like to thank my crazy friends for their support and encouragement. vii LIST OF TABLES .......................................................................................................................... x LIST OF FIGURES ...................................................................................................................... xii KEY TO ABBREVIATIONS ..................................................................................................... xvii CHAPTER 1 ................................................................................................................................... 1 Introduction ..................................................................................................................................... 1 1.1 Overview ........................................................................................................................... 2 1.2 Receptor Flexibility in Docking Methods ........................................................................ 5 1.2.1 Implicit flexibility .......................................................................................................... 5 1.2.2 Side chain and limited backbone flexibility .............................................................. 6 1.2.3. Large conformational changes and loop rearrangement ......................................... 13 1.3 Simulation based methods .............................................................................................. 19 1.3.1 Enhanced sampling .................................................................................................. 19 1.3.2. Receptor flexibility in pathway free energy methods ............................................. 20 1.4 Other algorithms ............................................................................................................. 20 1.5 Binding affinity predictions in docking methods ........................................................... 21 1.6 Prospective validation of docking methods with receptor flexibility ............................. 22 1.7 Discussion ....................................................................................................................... 26 REFERENCES ..................................................................................................................... 27 CHAPTER 2 ................................................................................................................................. 38 Methodology ................................................................................................................................. 38 2.1 Abstract ........................................................................................................................... 39 2.2 Introduction ..................................................................................................................... 39 2.3 Conformation generation ................................................................................................ 41 2.4 Side chain flexibility ....................................................................................................... 46 2.5 Loop flexibility ............................................................................................................... 46 2.6. Free energy estimation ................................................................................................... 48 2.7. Solvation free energy ..................................................................................................... 50 REFERENCES ..................................................................................................................... 53 CHAPTER 3 ................................................................................................................................. 55 Incorporation of Side Chain Flexibility into Protein Binding Pockets using MTflex ................ 55 3.1 Abstract ........................................................................................................................... 56 3.2 Introduction ..................................................................................................................... 56 3.3 Results and Discussion ................................................................................................... 59 3.3.1. Generation of the protein side chain conformations and the relative free energies using MTflex ....................................................................................................................... 59 3.3.2. Ligand docking and scoring .................................................................................... 66 3.3.3. Multi seed versus rigid receptor free energy calculations ...................................... 73 3.4 Conclusions ..................................................................................................................... 79 TABLE OF CONTENTS viii 3.5 Acknowledgement .......................................................................................................... 80 3.6 Supplementary information ............................................................................................ 81 3.6.1 RMSD in-house code ............................................................................................... 93 REFERENCES ..................................................................................................................... 95 CHAPTER 4 ............................................................................................................................... 105 The Role of the Active Site Flap in Streptavidin/Biotin Complex Formation ....................... 105 4.1 Abstract ......................................................................................................................... 106 4.2 Introduction ................................................................................................................... 107 4.3 Results and Discussion: Streptavidin-biotin ................................................................. 111 4.3.1 RMSD .................................................................................................................... 114 4.3.2 MT Free Energy Surface and MD Potential of Mean Force Studies ..................... 116 4.3.3 Thermodynamic free energy cycle ........................................................................ 128 4.4 Conclusions ................................................................................................................... 131 4.5 Acknowledgement ........................................................................................................ 134 4.6 Supporting information ................................................................................................. 134 4.6.1 MD-PMF methodology .......................................................................................... 147 4.6.2 MD simulation details ............................................................................................ 147 REFERENCES ................................................................................................................... 149 CHAPTER 5 ............................................................................................................................... 162 Conclusions and Future Outlook ............................................................................................ 162 REFERENCES ................................................................................................................... 167 ix Table 3.1. Statistical data for the ΔG’S evaluated using the crystal protein docked ligand complex, MTflex lowest free energy complex, MTflex with conformations selected up to 2 kcal/mol and 4 kcal/mol higher in binding affinity than the crystal structure and MTflex average (including the full range) relative to the experimental binding affinity. ....................................... 78 Table 3.2. Average CPU time (in seconds) to generate MTflex conformers for each Amino acid 81 Table 3.3. Minimum and Maximum RMSDs of MTflex conformers along with the number of conformers retained for all 159 systems of the validation dataset. ............................................... 81 Table 3.4. Minimum RMSDs of the docked ligand conformers in both the crystal and MTflex binding pockets of all 159 systems of the validation dataset. ....................................................... 85 Table 3.5. Correlation of Glide’s best score (top docking solution) for crystal protein, Glide’s best score for MTflex conformers across whole conformations and Glide’s score for the lowest free energy MTflex conformation in the ligand-unbound state to the experimental binding affinities. ....................................................................................................................................... 92 Table 4.1 Summary of free energy differences (in kcal/mol) between open and closed states of the loop in apo and holo states obtained with MT-free energy surface (MT-FES), MT- thermodynamic cycle (MT-TC), and MD using FF99SBILDN and FF14SB force fields. The differences are reported for the crystal open and loop position and minima obtained by the MT and MD methods. ........................................................................................................................ 127 Table 4.2. The total number of backbone loop conformations and the (backbone + sidechain) conformations generated by MTFlex-b in the presence of Biotin bound in the active site of crystal streptavidin monomer. ................................................................................................................ 134 Table 4.3 List of the residues involved in forming hydrogen bond and vander waal interactions with Biotin. The residues color-coded in red are part of the loop3-4. .......................................... 143 Table 4.4 The free energy difference between open and closed loop state with the free energy components in the form of the change in the protein torsion free energy, the change in the protein non-covalent interactions, the free energy change in the protein-ligand interactions, they change in the solvation free energy, the net free energy change in gas phase and in solution phase. All free energy values are in kcal/mol. ............................................................................................. 144 Table 4.5 Comparison of main chain phi-psi dihedrals of all the loop residues of the experimental closed structure with the closed minima structures obtained using MTFlex-b and from the FF14SB MD simulations. The differences are shown as positive or negative from a reference value of 0º. ..................................................................................................................................................... 145 LIST OF TABLES x Table 4.6 Comparison of main chain phi-psi dihedrals of all the loop residues of the experimental open structure with the open minima structures obtained using MTFlex-b and from the FF14SB MD simulations. The differences are shown as positive or negative from a reference value of 0º. ..................................................................................................................................................... 146 xi LIST OF FIGURES Figure 1.1. A model example to show side-chain flexibility in the active site residues of a protein (PDBID: 2XDX) with the ligand bound. The residues of the crystal active site are shown in ice blue color with the ligand highlighted in yellow background. The flexible side-chains (shown in pink, tan, purple, etc.) are superimposed on the crystal active site. ................................................ 7 Figure 1.2. The closed (left) and the open (right) state of the loop3-4 of Streptavidin monomer are highlighted. The effect of loop transitioning can be clearly seen on the size of the binding pocket and also on the preferred binding mode of Biotin. ....................................................................... 17 Figure 2.1. Pictorial representation of our free energy estimation procedure. The conformations generated by MTFlex-B are generated on an energy surface (shown in the heat map plot on the left). Each “seed structure” is fed to the Movable type method (shown on the right), which performs the local sampling around the initial conformation as explained with the red dots on the right heat map plot. ....................................................................................................................... 40 Figure 2.2. An example illustrating the parallel printing scheme used to construct alanine. The printing starts from three atoms with known locations. In each step a new atom is added to the system fixed by three distances constrained by the pair potentials from the lookup table. Bonds and angles are fixed to their minimum well depths due to their extremely narrow distribution ranges. Distances for the torsion angles and non-bonded interactions (not used in the alanine example) are selected based on a range of points around the energy minima (only minima are selected in this figure for simplicity). ........................................................................................... 45 Figure 2.3. Pictorial representation of our loop closure strategy. Loop fragments are generated from both N and C terminals individually. The individual loop fragments are then combined from both ends to form complete loops. ....................................................................................... 48 Figure 2.4. Illustration of the solvation free energy change during protein-ligand binding. The red arrow shows the desolvation and placement of the ligand into the protein-binding site. This process replaces water molecules bound in the protein-binding site that are displaced by the ligand volume. The green arrow shows the dissociation and solvation of these water molecules during the binding procedure. ....................................................................................................... 51 Figure 3.1. The top panel represents the minimum, maximum and RMSD range of MTflex conformers generated for all 159 systems. Dark blue curve denotes the lowest RMSD, Green curve denotes the highest RMSD conformer and black stripes represent the RMSD ranges of MTflex conformers for all 159 systems. The lower panel represents the percentage (%) of native contacts in MTflex conformers with respect to the crystal-binding pocket. Dark blue curve shows the maximum %, Green curve symbolizes the minimum % and black stripes represent the range of native contacts in MTflex conformers. ....................................................................................... 62 xii Figure 3.2. It shows the side chain of binding pocket region of PDB ID 3UEU (left) and 1R5Y (right). The crystal-binding pocket represented in pink licorice conformation is superimposed with the MTflex conformer with the maximum percentage of native contacts (tan) and MTflex conformer with minimum percentage of native contacts (cyan color). ........................................ 64 Figure 3.3. The upper panel represents the relative free energy (kcal/mol) between the lowest free energy MTflex conformer and the crystal structure (blue curve) and the highest free energy MTflex conformer and crystal structure (red curve) for all 159 systems in the validation dataset. The lower panel represents the corresponding RMSDs (Å) of the minimum RMSD MTflex conformer (purple) superimposed with the maximum RMSD MTflex conformer (green). ........... 66 Figure 3.4. Ligand RMSD (Å) for all 159 systems after docking the native ligand into the crystal receptor’s binding pocket (red) and into the MTflex binding pocket (green). Lowest RMSD Conformations are used for comparison in both the datasets. The blue, yellow and the black columns represent the PDBIDs: 2R23, 3PXF and 1U33, which will be discussed in text. .......... 68 Figure 3.5. It shows the crystal complex ligand (orange), lowest RMSD docked ligand pose in the crystal binding pocket (blue) and the lowest RMSD docked ligand pose in the MTflex binding pocket (green) superimposed onto each other for PDB IDs: 2R23 and 1U33 (from left to right). The background represents the binding pocket area of the respective protein. ............................ 69 Figure 3.6. The position of the MTflex lowest free energy conformation in the ligand bound state on the free energy scale obtained for the MTflex conformations in the ligand-unbound state is shown. The lowest free energy MTflex conformation in the unbound state is used as reference (set to 0 kcal/mol). ............................................................................................................................... 73 Figure 3.7. It shows the absolute binding affinities (kcal/mol) of and superimposed with the experimental binding affinity (kcal/mol). ± 3kcal/mol error window is extended for the experimental binding affinities. ......................................................................... 75 Figure 3.8. Free energy scale of PMTLD complex for all 159 systems in the validation dataset. Approximately ~70.4% of the systems have lower free energy than the crystal complex. .......... 77 Figure 3.9. Ligand Interaction diagrams of PDBIDs: 2R23 (top panel) and 1U33 (lower panel). Crystal complex is shown on the left side, crystal protein docked ligand (center) and the MTflex protein docked ligand in right. ...................................................................................................... 90 Figure 3.10. The superposition of MTflex generated binding pocket (Red) with the crystal binding pocket (coloured) along with the native ligand (orange) for PDBIDs 2R23 (left) and 1U33(right). For PDBID 2R23, Arg30 and Asn52 of chain D interact with the native ligand restoring the position of docked ligand in the MTflex binding pocket. For PDBID 1U33, the side chain of GLU233 in the MTflex binding pocket clashes with an atom of native ligand. ............................. 91 Figure 3.11. A plot showing the position of MTflex lowest free energy conformation in ligand bound state (marked in red square) on the free energy scale generated by MTflex conformers in xiii ligand-unbound state for four systems from our validation dataset. PDBIDs of the systems are: 10gs, 2fvd, 1qbr and 2zjw. ............................................................................................................ 92 Figure 3.12. Side chain rotation of Phenylalanine residue. .......................................................... 93 Figure 4.1. Cartoon representation of crystal Streptavidin monomer in holo (Biotin bound - in licorice) and apo states is superimposed on top of each other. The closed state of the loop3-4 (Residues 45 to 52) pertaining to the holo conformation is highlighted in mauve color while the open loop is shown in blue color. The distance between the Cα of Residue 49 of the closed and open state is approx. 12.67Å. ...................................................................................................... 113 Figure 4.2 The left box displays the closed (biotin-bound) and open states of streptavidin superimposed on top of each other with the loop3-4 highlighted in mauve for the closed and blue for the open conformation. The center image shows the lowest RMSD (Å) loop conformation (shown in green) generated by MTFlex-b superimposed on the holo crystal structure (closed loop), and the right image shows the lowest RMSD MTFlex-b conformation superimposed on the apo crystal conformation (open loop). ............................................................................................... 115 Figure 4.3 The Cα RMSD (Å) of the MTFlex-b loop conformations with respect to the closed state of loop3-4 in streptavidin. The x-axis represents the total number of generated backbone loop conformations. For the sake of structural comparison, crystal-closed (pink) and open (blue) states of the loop are superimposed on the generated MTFlex-b loop conformation (green). ................. 116 Figure 4.4 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained using the Cα distance between ASN49/LEU109 and GLY48/ILE30 as reaction coordinate. ................................................................................................................................... 118 Figure 4.5 Relative free energy (kcal/mol) heat map of apo (left) and holo (right) streptavidin obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 superimposed with the MD snapshots generated using FF14SB force fields. MD snapshots are highlighted in black color as scattered points in both the left and right figures with the rest of the PMF slightly faded in the background. ................................................................................................................................. 120 Figure 4.6 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained by using umbrella sampling with the FF14SB force fields using the Cα distances between Asn49/Leu109 (y-axis) and Gly48/Ile30 (x-axis) as the two reaction coordinates. ..... 123 Figure 4.7 Computed changes in the relative free energy on going from the closed to the open holo state. The closed loop is the global minima in the holo state and the open loop is ~11 kcal/mol higher in free energy. The observed transition state barrier is ~17.5 kcal/mol on going from the closed to open loop. ...................................................................................................... 124 Figure 4.8. a) Detailed representation of thermodynamic free energy cycle for binding in solution phase for the streptavidin-biotin system. b) The net free energy change upon biotin binding and loop closure in the solution phase. .............................................................................................. 131 xiv Figure 4.9 Cartoon representation of Streptavidin crystal monomer (Chain A) with the loop3-4 in the closed conformation. The reaction coordinates are distances between Cα atoms of Resid30 and 48, and between Resid109 and 49. ....................................................................................... 135 Figure 4.10. Superimposed conformations of open minima observed in MT-apo state (shown in tan color) and MT-holo state (highlighted in green color). The observed Cα RMSD is ~1.07 Å. ..................................................................................................................................................... 136 Figure 4.11 Relative free energy (kcal/mol) heat map of apo (left) and holo (right) streptavidin obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 superimposed with the MD snapshots generated using FF99SBILDN force fields. MD snapshots are highlighted in black color as scattered points in both the left and right figures with the rest of the PMF slightly faded in the background. ............................................................................................................. 137 Figure 4.12 Superimposed MD snapshots generated using FF99SBILDN (shown as black scattered points) and FF14SB (pink scattered points) force fields obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 ................................................................................... 138 Figure 4.13 The crystal-closed state (shown in pink) superimposed with the apo-closed minima obtained by MD-FF14SB force field (highlighted in aqua color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. .............................................................................................................. 139 Figure 4.14 The crystal open state (shown in brown) superimposed with the holo-open minima obtained by MD-FF14SB force field (highlighted in green color). The upper panel shows the alignment based on only the loop region while the lower panel shows the alignment based on the entire protein monomer. .............................................................................................................. 140 Figure 4.15 The superimposed minima conformations predicted by MT (green) and MD-FF14SB (mauve) methods for apo open minima, apo closed minima and holo open minima (from left to right). ........................................................................................................................................... 141 Figure 4.16 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained by using umbrella sampling with FF99SBILDN force fields using the Cα distances between ASN49/LEU109 and GLY48/ILE30 as the two reaction coordinates. ........................ 141 Figure 4.17 The crystal-closed state (shown in blue) superimposed with the apo-closed minima obtained by MD-FF99SBILDN force field (highlighted in yellow color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. ........................................................................................................ 142 Figure 4.18 The crystal-open state (shown in blue) superimposed with the holo-open minima obtained by MD-FF99SBILDN force field (highlighted in yellow and cyan color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. ......................................................................................... 143 xv Figure 4.19 The active site of streptavidin with Biotin shown for the closed loop (left) and open loop (right) states superimposed with the structure of the active site at the transition state (shown in black in both plots). Residues forming hydrogen bonds are highlighted in orange, residues forming van der Waals interactions are shown in green, Trp120 from sub-unit D is shown in purple and Arg84 in red. ............................................................................................................. 144 xvi KEY TO ABBREVIATIONS Assisted Model Building with Energy Refinement Computer Aided Drug Design Force Field Free Energy Perturbation Free Energy Surface Genetic Algorithm AMBER CADD FF FEP FES GA GAFF General Amber Force Field LJ MC MCI MD MM MT MTFlex MTFlex-B MUE NMR PBC PDB PME PMF Lennard-Jones Monte Carlo Monte Carlo Integration Molecular Dynamics Molecular Mechanics Movable Type Movable Type with Flexible side chains Movable Type with Flexible backbone and side chains Mean Unsigned Error Nuclear Magnetic Resonance Periodic Boundary Conditions Protein Data Bank Particle Mesh Ewald Potential of Mean Force xvii QM RESP RMSD TC TI TS US VDW WHAM Quantum Mechanics Restrained Electrostatic Potential Root Mean Squared Deviation Thermodynamic Cycle Thermodynamic Integration Transition State Umbrella Sampling Van der Waals Weighted Histogram Analysis Method xviii CHAPTER 1 Introduction †Adapted from Bansal, N.; Merz, K. M., Introducing receptor flexibility in protein-ligand binding associations. Manuscript in preparation. 1 This dissertation aims to discuss the novel in-house tools developed to incorporate receptor flexibility in the active site of the proteins, which we call MTFlex for including side chain flexibility and MTFlex-B for including both side-chains and backbone flexibility. The first chapter provides a brief introduction of the current methods used to introduce flexibility in the binding site of the protein in the protein-ligand binding mechanism. It provides a retrospective and prospective view of the available tools and their performances in the blind challenges hosted by the community. The second chapter describes the development of both MTFlex (side chain flexibility) and MTFlex-B (side-chains and backbone flexibility) methods and the procedure to calculate the free energies to predict the binding affinities and conformational free energies. The third chapter entails the application of MTFlex on the active site of 159 protein-ligand complexes. The fourth chapter presents the application of MTFlex-B on the eight residue long loop of Streptavidin from the streptavidin-biotin complex. Finally, the fifth chapter summarizes the findings of our work and discusses the future prospects. 1.1 Overview In order to correctly predict protein-ligand binding poses and free energies, it is essential to accurately take into account receptor flexibility. However, incorporating it into even the smallest region, for example, the binding site of a protein, is computationally demanding. Even if this task can be accomplished there is a risk of running into false positives due to the enormous conformational space involved. This thesis addresses computational approaches that have been implemented to incorporate receptor flexibility into the ligand-binding domain of a protein. The introduction is not limited to receptor flexibility methods used in molecular docking studies. Prospective validation of several docking and other comprehensive tools has also been discussed 2 largely within the scope of blind challenges conducted by the D3R and CSAR organizations. Based on our analysis to date, we conclude that the community is moving forward by fine-tuning several computational approaches but that the statistical uncertainties in the sampling and scoring accuracy still need to be improved. The plasticity of proteins is closely associated with several important molecular recognition processes including protein-protein and protein-ligand binding events, enzyme catalysis, allosteric control, and bio-molecular assembly[1]. Conformational changes linked to ligand binding have been characterized by the induced fit and conformational selection models[2]. The induced fit model states that external perturbations like ligand binding induces conformational changes in the receptor forcing it to the holo conformation while the conformational selection model proposes that the ligand selectively picks one of the conformations out of a pool of a pre-existing conformational ensemble[1]. Both of the theories are plausible for different systems and have been supported experimentally[1]. Several experiments suggest that both models play a significant role in the protein-ligand binding process[3, 4]. These peculiarities of the mechanistic details of protein-ligand binding, pose an extremely difficult yet interesting challenge for the computer-aided drug discovery (CADD) community. The problem is further complicated by the enormous conformational space available to proteins, which increases the chances of encountering false positives. Using only available crystal structures offers a simple solution to the problem but at the cost of adding inaccuracies in binding affinity prediction associated with the lack of inclusion of receptor flexibility[5]. Indeed, considering only the crystal structure is fundamentally incorrect as proteins are inherently flexible and undergo a variety of conformational changes ranging from vibrational fluctuations to 3 large-scale domain motions upon ligand binding. Some motions are so large in both the apo and holo states that the typical time scales used in in silico studies are too short to capture them. The importance of receptor flexibility in the CADD field has been emphasized multiple times[1, 6- 11]. Accurate prediction of binding free energies and binding mode prediction rely heavily on the accurate accounting of receptor flexibility[11]. To this end, it is crucial to incorporate both ligand-induced and inherent receptor flexibility when estimating protein-ligand binding affinities[2, 11]. This chapter presents an overview of the computational methods available to explore receptor flexibility as a result of protein-ligand binding in the field of structure-based drug design (SBDD). There are several reviews that focus only on receptor flexibility in terms of molecular docking[12-18], but this introduction will cover the general methods used to model receptor flexibility. Our emphasis will be on methods exploring receptor flexibility in the ligand-binding region of proteins. As proteins undergo a variety of motions, different levels of flexibility modeling are required for different protein target systems. For instance, His64 in Human carbonic anhydrase II is the only residue found in alternate conformations on binding with three very similar inhibitors thereby accounting for variations in binding affinity[19]. While kinases exhibit flexibility in terms of loop rearrangements and large lobe motions delimiting the active site region[13]. To include flexibility in drug design applications simulation-based algorithms and docking based methods have been reported. However, these two approaches are not necessarily mutually exclusive. Multiple docking methods use MD or MC simulations or their variants to generate an ensemble of protein conformations and then perform ensemble based docking. Similarly, some simulation-based algorithms perform docking as the first step to obtain the optimum protein-ligand bound conformation and then generate a conformational ensemble. 4 Although it is quite difficult to segregate the various approaches, we have tried to categorize methods into docking based, simulation-based or other. Methods based on docking have been categorized based on the level of receptor flexibility employed, which are then further parsed into sub-sections. A section is devoted to the assessment of prospective validation of emerging computational methods utilizing receptor flexibility. 1.2 Receptor Flexibility in Docking Methods 1.2.1 Implicit flexibility Several docking methods incorporate receptor flexibility without including it explicitly. One way of introducing implicit receptor flexibility in docking algorithms is by enlarging the binding pocket. The simplest way to accomplish this is by lowering the repulsive part of the potential energy function using a soft docking approach[20]. This approach qualitatively addresses very small conformational changes (up to 1Å) in the binding pocket region of the protein. Fundamentally this method works by reducing the penalty to binding afforded by the repulsive term in the Lennard Jones[20]. For example, Ferari et al applied this approach in their virtual screening study of T4 lysozyme and aldol reductase and found it afforded better results[21]. The advantage of this strategy is that it is computationally efficient but it can only include very subtle conformational changes limited to one or very few side-chains in the proteins. Several variants of the soft docking approach have been employed recently where the “softened” van der Waals (vdw) term is combined with structural refinement. Soft docking has 5 also been coupled with Monte Carlo minimization to include both receptor and ligand flexibility[22]. Mizutani et al enlarged the protein cavity by offsetting the vdw radii and then optimized the structure in the subsequent step[23]. There are more advanced hybrid approaches involving soft docking as the first step in a docking campaign and these are discussed further below[24, 25]. Another simple way to enlarge the binding pocket area is by mutating the amino acid side chains to alanine (so-called “Alanine scanning”)[26] which has the effect of enlarging the binding pocket. In this way the energy landscape becomes “smoother” resulting in an enhancement of the conformational sampling of the ligand. This approach is quite fast but has the drawback that it may result in false positives requiring further refinement[24]. 1.2.2 Side chain and limited backbone flexibility The methods listed in this section handle larger conformational changes but are still localized, as flexibility is localized to a few select residues in the binding site of the protein. To obtain a sense of the scope of the problem, consider a typical ligand-binding site for a drug like molecule, which is generally delineated by twelve to twenty amino acid side chains thereby approximating to dozens of rotatable torsions. In this class of methods the side chains undergo rearrangement with only limited or no backbone flexibility. A number of methods have been devised to explore this class of partial flexibility using both induced-fit and conformational selection approaches. An example is illustrated in Figure 1.1 to illustrate the level of flexibility introduced by incorporating side chain flexibility in active site residues. The active site displayed belongs to PDBID: 2XDX with the bound ligand highlighted in yellow color. The flexible side- chains (shown in pink, tan, purple, etc.) are superimposed on the crystal active site. Incorporating side chain flexibility is less computationally expensive and can be treated independently while backbone flexibility typically cannot[27]. For some target systems side chain flexibility is 6 sufficient to represent conformational changes induced by ligand binding, but for some not. To address this methods that incorporate full side-chain with limited backbone flexibility have been introduced. To reduce the computational expense of this model it is typically applied only to a few targeted residues in the binding pocket to explore their available conformational space. This category of method can be further binned into discrete (rotamer library) and continuous approaches to incorporate partial receptor flexibility. Figure 1.1 A model example to show side-chain flexibility in the active site residues of a protein (PDBID: 2XDX) with the ligand bound. The residues of the crystal active site are shown in ice blue color with the ligand highlighted in yellow background. The flexible side-chains (shown in pink, tan, purple, etc.) are superimposed on the crystal active site. 7 1.2.2.1 Rotamer Library exploration To explore side chain rotatable bonds, rotamer libraries are generated that contain collections of experimentally preferred rotameric states (i.e. low energy conformations) of amino-acid side chains. Leach developed the side chain specific rotamer library in order to explore the conformational space of the side-chains of the protein receptor. Given a specific orientation of ligand, the global energy minimum conformation of the side chain was found within the provided energy cutoff. He observed that most of the side-chain conformations had no concerted motions and were independent from each other[28, 29]. The large solution space offered by the flexible side-chains was explored using A* and dead-end elimination. Hartman et al. also explored discrete rotamer libraries for generating side-chains conformations in the binding site region and used a knowledge based scoring function called ROTA for scoring the protein-ligand interactions. They incorporated multiple side chain conformations of protein and investigated its effect on the docking efficiency by validating it against the screening DUD[30] database[31]. ROSETTADOCK uses a backbone dependent rotamer library boosted with additional side-chain dihedral angles χ1 and χ2 rotamers. In their iterative procedure, the side- chain conformations are optimized via first substituting, at each position, the allowable rotamers and then using a quasi-Newton minimization for refinement of torsion angles. Flexibility is incorporated in all the amino acids up to the second shell of ligand-binding region[32]. Schumann et al. employed rotamer libraries to generate the receptor side chain conformations and then used the cheapest-path algorithm to achieve a fast and thorough optimization of the receptor structures[33]. MedusaDock uses STROLL (stochastic rotamer library of ligands) and simultaneously models flexibility of both the receptor side-chains and the 8 ligand molecule[34]. The Fleksy algorithm generates side-chain conformers at selected binding site residues using a backbone-dependent rotamer library[35]. It employs “interaction sampling” to explore the ambiguous orientations of the Asn, Gln and His side chains, different tautomers of His, and the rotational freedom of the thiol group in Cys and the hydroxyl groups in Ser, Thr, Tyr. This interaction sampling approach alleviates the problem of managing protonation states and the handling of different side chain variations observed by the FlexE approach. GalaxyDock pre-defines the residues within the binding site region and accounts for their flexibility by using global optimization. The side-chain flexibility version1 of GalaxyDock performed better than RosettaLigand and SCARE when flexible residues were pre-specified[36]. GalaxyDock2 uses the same energy function for scoring as GalaxyDock but constructs side-chains more quickly using Voronoi diagrams of protein atoms[37]. Both the versions of GalaxyDock estimated the RMSD of the predicted ligand within 2 Å as compared to the experimental structures for ~80- 87% of the cases. This performance was comparable to several other leading docking algorithms such as AutoDock (versions 3 and 4), SCARE, RosettaLigand and FLIPDock[17, 36, 37]. 1.2.2.2 Continuous approaches Rotamer libraries are simple to use and have low computational costs but their main drawback is that the method is discrete. The sampling is restricted and biased by the contents of the rotamer library itself. To address this issue methods have been developed which explore selected rotatable bonds beyond simply using discrete rotamer libraries. For example Specitope and Slide are docking tools developed in the Kuhn lab[38-40], which carry out very fast and low- resolution docking using ligands from a large database. They use soft-docking approach to dock the ligand into the protein receptors. Both the receptor and ligand flexibility is incorporated 9 during the post-docking optimization step by rotating single bonds of either the ligand and the protein side-chain in order to resolve clashes. Rotations are incorporated using Mean-field theory such that there is an improvement in shape complementarity and the collisions are resolved. ICM (internal coordinate modeling) performs global energy optimization of active side chains and small molecules together using a “double energy” Monte Carlo minimization procedure. It is called double energy scheme as the energies calculated during local minimization and after minimization are both used in the Metropolis selection criteria. In this procedure, torsion angles of the ligand and protein side-chain within 7Å of the binding site are randomly changed. The receptor conformations are generated using a biased probability Monte Carlo procedure (BPMC)[41]. Using a continuous probability distribution function for a given conformational subspace (e.g., side-chain torsional angles or phi-psi), the method chooses a new random position completely unbiased of the previous position which is then locally minimized in the torsional angle space[42]. The mining minima method offers no limitation on the number of rotatable bonds to be considered as continuous degree of freedom[43]. It is computationally expensive because it calculates the configurational integral of the protein-ligand complex. The user defines the binding site region (also called “the live set”), which is considered flexible while the other part of receptor (“real set”) is held rigid and untreated. Initially only side-chains were made flexible, but in subsequent versions limited backbone flexibility was also included[44]. The energy optimization is carried out simultaneously for the flexible side chains and the ligand. Conformation search algorithm is used to search the local minima for different binding modes. The binding affinity predictions for 24 HIV-1 protease inhibitors and 20 inhibitors of phosphodiesterase 10a were successfully studied with this method[44]. Within the family of 10 continuous approaches, methods can be further categorized as on-the-fly (during the docking process) and ensemble based (pre-generation of conformational ensemble) methods. 1.2.2.2.1. On-the-fly methods SCARE (SCan Alanines and Refine) is an induced fit docking protocol in which the algorithm scans the neighboring side-chains pairs and mutates them to Alanine to fit the ligand[26]. Afterwards, the pocket residues are completely optimized for side-chains with limited backbone flexibility. The algorithm does not rely on the knowledge of location of the binding pocket, geometry of binding ligand or the extent of flexible receptor regions. Induced fit docking (IFD) uses a softened potential to allow for modest steric clashes in the first round of Glide docking[45, 46] with a rigid receptor and then PRIME is used to refinement the protein[24]. FlipDock searches the conformational space by using a divide and conquer approach combined with a very powerful genetic algorithm (GA). The conformational space of both the ligands and receptors is represented by using Flexibility Tree (FT) database[47]. Flexibility Tree (FT) is a tree-like computational data structure, which allows for the hierarchical and multi- resolution encoding of sub-spaces of a protein’s conformational space[48]. The FT structure combines and nests a wide variety of motions such as shear, screw, hinge, twist, rotameric side chains, normal mode and essential dynamics in a straightforward manner. POPSS (Pose Prediction using shape similarity) method, as the name suggests uses the shape similarity approach to place the ligand optimally in the receptor structure. After the ligand is placed, side-chain repacking is performed followed by Monte Carlo minimizations to refine the docked complex[49]. 11 1.2.2.2.2. Ensemble Based FlexE docking method relies on the united protein description obtained from the superimposed structures of the ensemble[50]. It implies that upon superposition, similar parts of the structures are merged together while dissimilar parts stand out and can be treated as separate alternatives. The concept is quite similar to the rotamer library approach. All atoms are selected within 6.5 Å of any part of the ligand within the binding pocket to consider flexibility. The initial structures were taken from PDBs but in principle can be taken from MD simulation, rotamer libraries or homology modeling. The algorithm is fast as multiple receptor conformations from the ensemble are treated simultaneously and sometimes even new structures are formed by the combination of several structures. The flexibility is incorporated by selection of a combination of partial structures (from the provided input structures) suited best for a particular ligand based on the scoring function. It considers full side chain flexibility and even loop flexibility to some extent. Although it is an ensemble-based approach, the protein flexibility is incorporated during ligand placement stage and not during ligand optimization. A problem of handling different side chain variations and protonation states was observed in FlexE algorithm[51]. It is able to perform very well for the side-chain and some small variations in loops. However, any large movements are not predicted accurately. MTFlex algorithm introduced by our group also generates an ensemble of side chain conformations within the 6Å cutoff of the bound ligands on an energy landscape using a pair potential look up table. The best ligand mode predictions for a set of 159 systems were ~1.31 Å deviated from the native pose as opposed to the RMSD of 2.3 Å for the best ligand modes docked in the native structure. Our method calculates free energies by utilizing a Monte Carlo integration (MCI) scheme for simulation of local partition function. It 12 was observed that apart from pose prediction, scoring also improved for ligand docked in flexible side-chain conformers as compared to the ligands docked in the crystal structure[52]. ALiBERO (Automatic Ligand-guided Backbone Ensemble Receptor Optimization) algorithm incorporates flexibility by generating either the best or a “team” of complimentary pockets best suited for the particular ligand. It takes single or multiple receptor structures as input along with a ligand training dataset. Receptor ensembles are created iteratively followed by the virtual screening and MC refinement until they converge to the fitness function. The complimentary pockets (either best or the team) are then selected based on the ligands provided[53]. AutoDock is a widely used docking algorithm that allows specifications of flexible side-chains. It was successfully applied to study the conformational variability in the HIV protease’s binding site by several side chains reorganization and a water molecule. However, the problem persists as it has a hardcoded limit of including flexibility in ~32 rotatable bonds which is easily surpassed when the side chains in the receptor are made flexible[54]. AutoDockFR (AutoDock for flexible Receptors) supersedes FlipDock by using a very powerful and efficient genetic algorithm along with the advanced motion descriptors of FT for simulating partial side chain flexibility. The novel genetic algorithm can allow explicit side chain flexibility for upto ~14 residues in the binding site region as well as very limited backbone flexibility[55]. Higher docking success rates were achieved by using AutoDockFR and implementing receptor flexibility in the binding site region of the experimentally determined apo receptor conformations. 1.2.3. Large conformational changes and loop rearrangement MD or MC simulations provide a very detailed representation of molecular flexibility. However, large domain motions occur beyond the time scales of current conventional MD 13 simulations. The flexibility of DFG loops in Kinases was implemented using mean field approach. DFG-in loops of various kinases were converted to type-II bound state (or DFG-out state) using a general deterministic modeling protocol[56]. DOLPHIN models were prepared from DFG-in known structures by removal of all the atoms of Phenylalanine residues on the DFG loop and the consecutive 4 residues following Phe. In the second step, the side-chain atoms of removed Phe and the backbone atoms of the other removed residues except for Gly were used to generate pharmacophore like field. These models represent the average physico-chemical profile of the loop. The density map of the DOLPHIN model was combined with the standard ICM receptor maps for docking. These models performed exceptionally well in both ligand docking and in-silico activity profiling when validated against a kinase-ligand benchmark database. A large set of methods has been developed using ensemble approach. 1.2.3.1. Ensemble approach Several methods attempt to use multiple low energy loop conformations in docking. These methods usually dwell on the conformational selection model and aim to mimic conformational changes that occur in the protein region upon ligand binding[57]. The conformational selection model assumes the pre-existence of all the protein conformations in solution and states that upon ligand binding, the holo-like conformation is stabilized. The encoding hypothesis is an extension of the conformational selection model, which states that the apo form of protein encodes all the necessary fluctuations critical for the ligand bound holo conformation[2, 58-60]. Such methods rely on collective degrees of freedom where most atoms 14 move relative to each other. Collective degree of freedom captures only the dominant motions of protein derived from the native degrees of freedom[15]. 1.2.3.1.1. Collective degree of freedom Collective degrees of freedom enable modeling of huge protein motions as they capture conformational variations in terms of changes in all or part of native degrees of freedom. These methods assume that perturbations in the apo form of the protein such as ligand binding can lead to the holo form and all these essential protein fluctuations/perturbations are encoded in the collective degrees of freedom. Ikeguchi and co-workers used linear response theory where they modeled ligand binding as an external perturbation to the apo form of the protein. They applied their method to the ferric binding protein, F1-ATPase and citrate synthetase and observed that large-scale changes between the apo and holo forms of proteins (up to 15 Å Cα) was predicted on the basis of 5 collective modes[61]. However, some portions of the predicted holo structures differed from the experimental structure by 2-5 Å. Normal mode analysis (NMA) was used by Cavassato et al. on the apo form of cAMP dependent kinase to model the loop flexibility coupled to ligand binding[62]. Normal modes are Eigen vectors of the hessian matrix and have been shown to represent most motions of the protein at low frequency (< 30 cm-1)[62-66]. To increase the computational efficiency, the collective variables of NMA derived from an elastic network model (ENM) have also been used[67-69]. In ENM, protein structures are modeled as elastic networks in which amino acids are represented by Cα atoms and the Cα pairs are connected by uniform springs within a pre-defined distance cutoff[70]. The connectivity matrix of inter- residue contacts is used to derive vibrational vectors and frequencies, which is used to assess the protein flexibility. Dietzan et al. investigated the applicability of Cα-ENMs normal modes for the 15 binding-pocket region in protein-small molecule docking[71]. Using low-lying normal modes from apo structures, they reproduced the Cα trace of holo partner proteins for the 433 apo/holo pairs contained in the Astex[72] data sets. They then assessed the docking capability based on the number of modes used to represent the holo structure. However, they found that even for cognate docking, the use of NMA was limited and it was difficult to find a generalized rule to define the number of necessary modes. An alternative approach to NMA is essential dynamics (ED). In ED, atomic coordinates are used to build the covariance matrix and a principal component analysis of this matrix yields the eigenvalues (squared magnitude) and eigenvectors (principal components), where the top principal component corresponds to the most significant conformational change. In contrast to NMA, PCA does not rest on the assumption of a harmonic potential[73]. Cukier et al. studied the adenylate kinase by applying PCA analysis to its apo structure[59]. As the substrates AMP and Mg2+-ATP bind, the LID (ligand) and AMP binding domains of adenylate kinase undergo huge conformational change leading to the closed form of the binding pocket. With the help of PCA analysis, they discerned the presence of 12 modes that encoded the conformational change of the LID domain upon ligand binding. However, the modes did not encode the motion associated with the AMP binding domain. Their study reinforced the encoding hypothesis as only the modes associated with the ligand-binding domain were encoded in the apo form. 16 Figure 1.2 The closed (left) and the open (right) state of the loop3-4 of Streptavidin monomer are highlighted. The effect of loop transitioning can be clearly seen on the size of the binding pocket and also on the preferred binding mode of Biotin. 1.2.3.1.2. Native degree of freedom The loop3-4 in the streptavidin-biotin system was recently studied using MTFlex-B by Bansal et al.[74] The algorithm generated an ensemble of loop conformations where both apo and holo forms of the eight-residue loop were generated with the best loop conformation having a RMSD of 1.6 Å with respect to available crystal structures of the two forms. An example to illustrate the closed and open state of the Streptavidin loop3-4 is shown in Figure 1.2. The opening and closing of the loop is highly affecting the size of binding site as shown in the highlighted region in Figure 1.2. It can also be seen that the preferred binding mode of Biotin is quite different in the open and closed states of loop reinforcing the importance of receptor 17 flexibility in docking algorithms. Wong et al also studied “latch” loops (i.e. loops covering the ligand in the holo state and switches to an open conformation in the unbound state)[75]. Their method used a combination of replica exchange molecular dynamics (REMD) simulations with harmonic restraints to sample the loop conformations in an unbound state, followed by clustering to identify conformational substates and, finally performed docking against a representative structure from each cluster. For loops spanning up to 15 residues, it yielded a RMSD of ~2 Å with respect to the known holo structures. Although harmonic restraints were applied to limit the sampling of the loop, the method cannot a priori determine the number of degrees of freedom required to capture the possible rearrangements. Bansal’s and Wong’s study both support and provide a computational validation of the conformational selection model. Flick et al implemented a multistage loop reconstruction algorithm where they predefined segments of backbone and changed all the dihedral angles of the backbone simultaneously. A global loop closure search was applied to close the loop and the resulting conformation was then locally minimized using a modified steepest descent algorithm. If after minimization, the loop was closed successfully with the gap < 0.001 nm then the resulting conformation was preserved in the acceptance criteria. This method was applied to study the apo-holo transitions in the ATP binding loop of ERK2 kinase and adenosine dependent protein kinase (PKA)[76]. Recently a new algorithm called ED/MD (essential dynamics/molecular dynamics) was developed for generating perturbed ensembles representing ligand induced binding site flexibility[77]. It is a hybrid method that uses collective degrees of freedom to enhance the sampling. The first thirty modes obtained by a PCA analysis of MD trajectories were perturbed in the ligand-binding domain. Perturbations were accounted by Lennard-Jones term and then ED/MD simulations[78] 18 were carried out. These ensembles led to superior docking performance when compared to single structure or conventionally derived MD ensembles. 1.3 Simulation based methods Apart from the docking based methods, another way of introducing receptor flexibility is through the use of simulation-based methods. These methods are not necessarily independent of the docking based approach as several docking based methods use simulations to collect an ensemble of conformations before performing docking. Some of the methodologies using simulation based conformation generation methods were already covered in the ensemble based docking section. In this section, we will briefly go over the enhanced sampling approaches and the novel tools employed in pathway-based free energy methods to incorporate receptor flexibility. Several extant review articles discuss simulation based methods[6, 10, 11, 58, 79, 80]. 1.3.1 Enhanced sampling Short MD simulations are known to be susceptible to get trapped in the local minima on a free energy landscape[10, 11]. Recent advances in this area to overcome simulation length shortcomings come by implementation of GPU accelerated computing[81], specialized supercomputers such as Anton[82, 83], or using the cloud[84]. Another solution is to introduce bias into the potential or by using enhanced sampling methods such as metadynamics, umbrella sampling, temperature accelerated replica exchange, Morkov state models, and Hamiltonian based accelerated MD[10, 80] 19 1.3.2. Receptor flexibility in pathway free energy methods Pathway free energy methods (alchemical and PMF based) in combination with MD or MC simulations can accurately estimate free energies. Unfortunately, the procedure requires extensive conformational sampling and is computationally demanding. Recently, enhanced sampling methods have been added to standard pathway based free energy methods to facilely subsume conformational changes associated with ligand binding[85-89]. Mobley et al used the “confine-and-release” framework in conjunction with umbrella sampling methods to study the binding site of T4-lysozyme involving a conformational change[86]. aMD was coupled with thermodynamic integration simulations to improve free energy convergence[87]. Independent trajectory thermodynamic integration (IT-TI) was used to study the flexible loop regions of the H5N1 avian influenza virus neuraminidase interacting with peramivir[85]. aMD was applied on selective dihedrals of neuraminidase and free energy calculations were performed to obtain the converged free energy of binding[88]. FEP/REST was used to study the apolar cavity of T4 lysozyme L99A and it was shown that binding free energies are sensitive to protein reorganization in the binding pocket region[89]. 1.4 Other algorithms Apart from incorporating conformational sampling in docking and simulation based methods, there are loop prediction algorithms that are based on the conformational selection model. Danielson et al. used CorLps program to generate holo like protein conformations from the apo form of the protein in the presence of the ligand[90]. Loop prediction was performed on the following three systems: GART (a six residue loop segment), CYP119 (two nine residue loop regions), and enolase (an eleven residue loop region)[90]. The energetically favorable loop 20 conformational ensemble was generated, which was then filtered by clustering and further refined and the remaining top 100 poses were ranked based on the DFIRE scoring function. 1.5 Binding affinity predictions in docking methods Once the conformational ensemble containing the most relevant structure is generated, the focus is shifted to how one can predict reliable binding affinities. Although great success has been achieved in retrospective pose prediction and virtual screening strategies, binding affinity predictions still pose a daunting challenge[14, 79, 91, 92]. A wide range of scoring function strategies have implemented protein flexibility and achieved some level of improvement in either pose prediction or binding affinity or both[24, 32, 36, 93-99]. Recently, Zheng et al. developed a novel tool for free energy calculations called the movable type (MT) sampling method. It directly calculates the Helmholtz free energy by estimating the local partition function using the principles of statistical mechanics. Bansal et al. used this approach in their MTFlex strategy for side-chain flexibility, which allowed for a better prediction of the binding free energy (over 159 protein-ligand systems an RMSE of ~2.72 kcal/mol versus 3.4 kcal/mol using only the crystal structure)[52]. However, one common thread in all of the strategies discussed so far is that all the evaluations are performed retrospectively. Our tools are clearly honed and perfected to perform better when we know the answer a priori, but what about the cases where we don’t. The real challenge is to evaluate our methods against the prospective targets and this is discussed below. 21 1.6 Prospective validation of docking methods with receptor flexibility Retrospective assessments are important but nonetheless suffer from inherent bias. The knowledge of the correct answer is embedded in the construction of the problem, which often leads to skewing of the results. Hence, there is no substitute for making truly prospective predictions[100]. Prospective validations are the gold standard to accurately judge the predictive ability of methods, as they leave no scope for fitting of parameters and other human interventions or biases. In this regard, several blind community challenges initiated by CSAR and further continued by the D3R organizations have been run since 2009 for the prospective assessment of ligand pose prediction and ranking protein-ligand binding affinities[101-105]. Grand challenges (GC) hosted by the D3R organization offered the community a range of prospective validations of docking pose prediction[103, 104]. These challenges included receptor targets with flexible binding pockets. In the 2015 GC1 challenge, HSP90 and MAP4K4 were the receptor targets chosen to perform docking and binding affinity prediction. All submissions that afforded good binding mode predictions incorporated receptor flexibility in some manner in their modeling efforts. The Camacho lab used Smina (default parameters) and the AutoDock Vina scoring function to dock into HSP90 and MAP4K4[106]. They used multiple receptor/ligand co-crystal structures as binding templates for performing minimization and docking from the known complex structures and obtained ligand RMSDs of 0.32 Å and 1.6 Å for the target systems. POPSS method, which incorporates 3D shape similarity and included side chain repacking and protein minimization, also performed well in the D3R prospective challenge with median RMSD for their top prediction of 0.73 Å and for 2.87Å[107] for HSP90 and MAP4K4, respectively. DockBench tool also performed reasonably well for both the targets. It also used existing crystal structures to 22 obtain training set to perform ensemble based docking[108]. GRIM graph matching method also predicted the best binding mode with reasonable accuracy by performing docking using several crystal protein structures with and without conserved water molecules[109]. The D3R evaluators established that overall the best pose-prediction methods were less associated with a single docking algorithm, rather with a “similarity docking” approach[104]. Its worth noting that several pose predictions for HSP90 were within 2 Å RMSD which was not the case for MAP4K4. HSP90 is a structurally well-characterized system, while MAP4K4 is less so allowing participants the ability to pre-evaluate the capability of their chosen approach. The MAP4K4 compound collection was also more challenging with ~1/3 coming from a congeneric series while for HSP90 the compounds were all from a congeneric series. MAP4K4 was also simply more challenging due to the flexibility of the classic P-loop found in kinases. In Grand Challenge2 as well, a rather flexible bile acid receptor target called FXR was used. Totrov and group predicted the lowest mean RMSD (of all the submissions) of 1.95 Å by using a new hybrid ligand/receptor structure-based docking method called LigBEnd[110]. The Cournia group used a combination of docking and physics based methods and attained the lowest median RMSD of 0.99 Å out of all the submissions[111]. Most of the leading docking methods were combined with the knowledge of existing PDB data, which made it quite difficult to discern the performance of individual methods. In fact, when organizers excluded the most similar ligands from evaluations, the performance of the known docking methods became much worse. Literature search for similar compounds proved to be handy except for one of the ligands (FXR34) despite of having a similarity coefficient of 0.83 with the available systems. It happened because the common available known structures for that ligand had appreciable different binding modes and different protein conformations, which misguided the challengers 23 relying on the known crystal structures[103]. In the first stage of both the Grand Challenges, there were very few methods that behaved consistently well for all the targets. This problem was realized in CSAR 2014 challenge as well where the major task was to find the methods that performed consistently well for all the three targets provided by the organizers[101]. Only 34% of the methods had RMSD of < 2 Å for all the provided targets. A few methods that consistently edged the prediction of binding mode did not do well on the predictions of binding affinity. Ranking of ligands based on affinity was an issue for nearly all the blind challenged conducted so far[101-105]. In GC1, the highest correlation of the submitted rankings with respect to the experimental rankings was quite poor with the values of 0.32 for HSP90 and 0.48 for MAP4K4[104]. The highest Kendall Tau attained for affinity ranking in GC2 was 0.45 despite of the knowledge of crystal structures. The only encouraging aspect observed in Kendall’s Tau trend for GC2 was that all the values were positive depicting the lack of randomness in the protocols. Nevertheless, no particular approach was observed to edge both the docking and scoring challenges showing that the results are statistically insignificant. Apart from these blind challenges, prospective testing of cross docking was performed by Jain lab on a set of 10 pharmaceutically relevant targets using a total of 949 ligands[112]. They used a cross docking benchmark dataset “PINC” to follow a knowledge-guided docking protocol. Protein pocket similarity was used to perform ensemble docking using Surflex-Dock. Out of all the predictions, correct poses were identified nearly 90% of the time. Understanding protein-ligand binding mechanism is a challenging problem and to solve it using a generalized method that handles all target systems equally well has still not been identified. A good example illustrating the problem involves the participation of RosettaLigand with full ligand flexibility and receptor backbone flexibility in theSAMPL1 blind challenge. In 24 this challenge participants were asked as to dock nearly 100 ligands into the JNK-3 kinase and urokinase-type plasminogen activator. Initially, they did not incorporate full backbone flexibility in their challenge submissions, but reevaluated their results retrospectively where they included full flexibility[113]. The addition of full backbone flexibility benefitted the JNK3 challenge yielding more predictions within a 2 Å RMSD of the native ligand and lowering the structural RMSD on average. However, implementing backbone flexibility for urokinase opened avenues for mistakes since the receptor does not undergo significant conformational changes upon binding. Overall, docking of some compounds fare better but several fared worse than the original submission for the urokinase challenge. The problem of structural flexibility of the binding pocket residues is clearly important, but there are several other issues that pose problems like accounting for conserved water molecules, accurate sampling of ligand conformations, missing residues, solvation models, scoring functions, etc. Another major bottleneck in the field is the lack of a plethora of high-resolution experimental protein-ligand structures. Some of the concerns associated with the field of binding mode prediction, affinity estimation and virtual screening are listed here[114] For the more precise estimate of binding affinities, it is crucial to consider the configurational entropy contributed by the conformational substates of proteins, but the conundrum is that these additional components are also adding inaccuracies into the calculated binding affinity. The hard part is to decipher whether the problem lies with the scoring functions or with the sampling strategies or both. In a nutshell, these blind challenges for protein- ligand docking and affinity rankings are helping the community to move forward but to clearly pinpoint the underlying cause of the error in any particular method is still a major issue. 25 1.7 Discussion Receptor flexibility is one of the major challenges faced by the CADD community in binding mode prediction and accurate prediction of binding free energies. Recent advancements in the field of SBDD have accounted for receptor flexibility in the binding pocket. Computational methods are being continuously improved as not all methods work for all the protein systems. The computational approaches accounting for receptor flexibility in docking based, simulation based and other protein loop prediction algorithms have been discussed. As most of the groups validate their methodologies retrospectively, we discussed prospective validation of extant methods based on blind challenges hosted by CSAR and D3R. Although we are seeing significant improvement in retrospective evaluations, prospective validations are still statistically challenged. It is difficult to a priori anticipate the level of receptor flexibility needed for a particular target system, however it is certain that some level of receptor motion is important to account for the entropy associated with the binding processes[115]. Considering the configurational entropy contributed by receptor conformations is extremely important for the accurate estimation of binding affinities, however probably adding this component is adding more inaccuracies in the free energy calculations. The problem remains with the scoring function or the sampling strategy is hard to pinpoint as of yet. 26 REFERENCES 27 REFERENCES Boehr, D.D., R. Nussinov, and P.E. Wright, The role of dynamic conformational ensembles in biomolecular recognition. Nature Chemical Biology, 2009. 5(11): p. 789- 796. Lill, M.A., Efficient Incorporation of Protein Flexibility and Dynamics into Molecular Docking Simulations. Biochemistry, 2011. 50(28): p. 6157-6169. James, L.C. and D.S. Tawfik, Structure and kinetics of a transient antibody binding intermediate reveal a kinetic discrimination mechanism in antigen recognition. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(36): p. 12730-12735. Ma, B.Y. and R. Nussinov, Enzyme dynamics point to stepwise conformational selection in catalysis. Current Opinion in Chemical Biology, 2010. 14(5): p. 652-659. Huang, S.Y. and X.Q. Zou, Advances and Challenges in Protein-Ligand Docking. International Journal of Molecular Sciences, 2010. 11(8): p. 3016-3034. Carlson, H.A., Protein flexibility and drug design: how to hit a moving target. Current Opinion in Chemical Biology, 2002. 6(4): p. 447-452. Nichols, S.E., R.V. Swift, and R.E. Amaro, Rational Prediction with Molecular Dynamics for Hit Identification. Current Topics in Medicinal Chemistry, 2012. 12(18): p. 2002-2012. Teague, S.J., Implications of protein flexibility for drug discovery. Nature Reviews Drug Discovery, 2003. 2(7): p. 527-541. Csermely, P., R. Palotai, and R. Nussinov, Induced fit, conformational selection and independent dynamic segments: an extended view of binding events. Trends in Biochemical Sciences, 2010. 35(10): p. 539-546. Sinko, W., S. Lindert, and J.A. McCammon, Accounting for Receptor Flexibility and Enhanced Sampling Methods in Computer-Aided Drug Design. Chemical Biology & Drug Design, 2013. 81(1): p. 41-49. Feixas, F., et al., Exploring the role of receptor flexibility in structure-based drug discovery. Biophysical Chemistry, 2014. 186: p. 31-45. Antunes, D.A., D. Devaurs, and L.E. Kavraki, Understanding the challenges of protein flexibility in drug design. Expert Opinion on Drug Discovery, 2015. 10(12): p. 1301- 1313. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 28 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 24. 25. 26. Apostolakis, J., A. Pluckthun, and A. Caflisch, Docking small ligands in flexible binding sites. Journal of Computational Chemistry, 1998. 19(1): p. 21-37. 23. Mizutani, M.Y., et al., Effective handling of induced-fit motion in flexible docking. Proteins-Structure Function and Bioinformatics, 2006. 63(4): p. 878-891. Cavasotto, C.N. and R.A. Abagyan, Protein flexibility in ligand docking and virtual screening to protein kinases. Journal of Molecular Biology, 2004. 337(1): p. 209-225. Guedes, I.A., C.S. de Magalhaes, and L.E. Dardenne, Receptor-ligand molecular docking. Biophys Rev, 2014. 6(1): p. 75-87. Teodoro, M.L., G.N. Phillips, and L.E. Kavraki, Understanding protein flexibility through dimensionality reduction. Journal of Computational Biology, 2003. 10(3-4): p. 617-634. Yuriev, E., M. Agostino, and P.A. Ramsland, Challenges and advances in computational docking: 2009 in review. Journal of Molecular Recognition, 2011. 24(2): p. 149-164. Yuriev, E., J. Holien, and P.A. Ramsland, Improvements, trends, and new ideas in molecular docking: 2012-2013 in review. Journal of Molecular Recognition, 2015. 28(10): p. 581-604. Yuriev, E. and P.A. Ramsland, Latest developments in molecular docking: 2010-2011 in review. Journal of Molecular Recognition, 2013. 26(5): p. 215-239. Smith, G.M., et al., Positions of His-64 and a Bound Water in Human Carbonic- Anhydrase-Ii Upon Binding 3 Structurally Related Inhibitors. Protein Science, 1994. 3(1): p. 118-125. Jiang, F. and S.H. Kim, Soft Docking - Matching of Molecular-Surface Cubes. Journal of Molecular Biology, 1991. 219(1): p. 79-102. Ferrari, A.M., et al., Soft docking and multiple receptor conformations in virtual screening. Journal of Medicinal Chemistry, 2004. 47(21): p. 5076-5084. Sherman, W., et al., Novel procedure for modeling ligand/receptor induced fit effects. Journal of Medicinal Chemistry, 2006. 49(2): p. 534-553. Venkatraman, V. and D.W. Ritchie, Flexible protein docking refinement using pose- dependent normal mode analysis. Proteins-Structure Function and Bioinformatics, 2012. 80(9): p. 2262-2274. Bottegoni, G., et al., A new method for ligand docking to flexible receptors by dual alanine scanning and refinement (SCARE). Journal of Computer-Aided Molecular Design, 2008. 22(5): p. 311-325. 29 27. 28. 29. 30. 31. 33. 34. 36. 37. 38. 39. Hartmann, C., I. Antes, and T. Lengauer, Docking and scoring with alternative side- chain conformations. Proteins-Structure Function and Bioinformatics, 2009. 74(3): p. 712-726. 32. Meiler, J. and D. Baker, ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins-Structure Function and Bioinformatics, 2006. 65(3): p. 538-548. Schumann, M. and R.S. Armen, Systematic and efficient side chain optimization for molecular docking using a cheapest-path procedure. Journal of Computational Chemistry, 2013. 34(14): p. 1258-1269. Ding, F., S.Y. Yin, and N.V. Dokholyan, Rapid Flexible Docking Using a Stochastic Rotamer Library of Ligands. Journal of Chemical Information and Modeling, 2010. 50(9): p. 1623-1632. 35. Wagener, M., J. de Vlieg, and S.B. Nabuurs, Flexible Protein-Ligand Docking Using the Fleksy Protocol. Journal of Computational Chemistry, 2012. 33(12): p. 1215-1217. Totrov, M. and R. Abagyan, Flexible ligand docking to multiple receptor conformations: a practical alternative. Current Opinion in Structural Biology, 2008. 18(2): p. 178-184. Leach, A.R., Ligand Docking to Proteins with Discrete Side-Chain Flexibility. Journal of Molecular Biology, 1994. 235(1): p. 345-356. Leach, A.R. and A.P. Lemon, Exploring the conformational space of protein side chains using dead-end elimination and the A* algorithm. Proteins-Structure Function and Bioinformatics, 1998. 33(2): p. 227-239. Huang, N., B.K. Shoichet, and J.J. Irwin, Benchmarking sets for molecular docking. Journal of Medicinal Chemistry, 2006. 49(23): p. 6789-6801. Shin, W.H. and C. Seok, GalaxyDock: Protein-Ligand Docking with Flexible Protein Side-chains. Journal of Chemical Information and Modeling, 2012. 52(12): p. 3225-3232. Shin, W.H., et al., GalaxyDock2: Protein-Ligand Docking Using Beta-Complex and Global Optimization. Journal of Computational Chemistry, 2013. 34(30): p. 2647-2656. Schnecke, V., et al., Screening a peptidyl database for potential ligands to proteins with side-chain flexibility. Proteins-Structure Function and Genetics, 1998. 33(1): p. 74-87. Schnecke, V. and L.A. Kuhn, Virtual screening with solvation and ligand-induced complementarity. Perspectives in Drug Discovery and Design, 2000. 20(1): p. 171-190. 30 Schnecke, V. and L.A. Kuhn, Database screening for HIV protease ligands: the influence of binding-site conformation and representation on ligand selectivity. Proc Int Conf Intell Syst Mol Biol, 1999: p. 242-51. Abagyan, R. and M. Totrov, Biased Probability Monte-Carlo Conformational Searches and Electrostatic Calculations for Peptides and Proteins. Journal of Molecular Biology, 1994. 235(3): p. 983-1002. Totrov, M. and R. Abagyan, Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins-Structure Function and Genetics, 1997: p. 215-220. David, L., R. Luo, and M.K. Gilson, Ligand-receptor docking with the Mining Minima optimizer. Journal of Computer-Aided Molecular Design, 2001. 15(2): p. 157-171. Chen, W., et al., Modeling Protein-Ligand Binding by Mining Minima. Journal of Chemical Theory and Computation, 2010. 6(11): p. 3540-3557. Friesner, R.A., et al., Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. Journal of Medicinal Chemistry, 2004. 47(7): p. 1739-1749. Halgren, T.A., et al., Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. Journal of Medicinal Chemistry, 2004. 47(7): p. 1750-1759. Zhao, Y. and M.F. Sanner, FLIPDock: Docking flexible ligands into flexible receptors. Proteins-Structure Function and Bioinformatics, 2007. 68(3): p. 726-737. Zhao, Y., D. Stoffler, and M. Sanner, Hierarchical and multi-resolution representation of protein flexibility. Bioinformatics, 2006. 22(22): p. 2768-2774. Kumar, A. and K.Y.J. Zhang, A pose prediction approach based on ligand 3D shape similarity. Journal of Computer-Aided Molecular Design, 2016. 30(6): p. 457-469. Claussen, H., et al., FlexE: Efficient molecular docking considering protein structure variations. Journal of Molecular Biology, 2001. 308(2): p. 377-395. Polgar, T. and G.M. Keseru, Ensemble docking into flexible active sites. Critical evaluation of FlexE against JNK-3 and beta-secretase. Journal of Chemical Information and Modeling, 2006. 46(4): p. 1795-1805. Bansal, N., Z. Zheng, and K.M. Merz, Incorporation of side chain flexibility into protein binding pockets using MTflex. Bioorganic & Medicinal Chemistry, 2016. 24(20): p. 4978-4987. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 31 Kufareva, I. and R. Abagyan, Type-II Kinase Inhibitor Docking, Screening, and Profiling Using Modified Structures of Active Kinase States. Journal of Medicinal Chemistry, 2008. 51(24): p. 7921-7932. 57. Ma, B.Y., et al., Folding funnels and binding mechanisms. Protein Engineering, 1999. 12(9): p. 713-720. 53. 54. 55. 56. 58. 59. 60. 61. 62. 63. 64. 65. Rueda, M., M. Totrov, and R. Abagyan, ALiBERO: Evolving a Team of Complementary Pocket Conformations Rather than a Single Leader. Journal of Chemical Information and Modeling, 2012. 52(10): p. 2705-2714. Osterberg, F., et al., Automated docking to multiple target structures: Incorporation of protein mobility and structural water heterogeneity in AutoDock. Proteins-Structure Function and Genetics, 2002. 46(1): p. 34-40. Ravindranath, P.A., et al., AutoDockFR: Advances in Protein-Ligand Docking with Explicitly Specified Binding Site Flexibility. Plos Computational Biology, 2015. 11(12). Carlson, H.A. and J.A. McCammon, Accommodating protein flexibility in computational drug design. Molecular Pharmacology, 2000. 57(2): p. 213-218. Cukier, R.I., Apo Adenylate Kinase Encodes Its Holo Form: A Principal Component and Varimax Analysis. Journal of Physical Chemistry B, 2009. 113(6): p. 1662-1672. Lou, H.F. and R.I. Cukier, Molecular dynamics of apo-adenylate kinase: A principal component analysis. Journal of Physical Chemistry B, 2006. 110(25): p. 12796-12808. Ikeguchi, M., et al., Protein structural change upon ligand binding: Linear response theory. Physical Review Letters, 2005. 94(7). Cavasotto, C.N., J.A. Kovacs, and R.A. Abagyan, Representing receptor flexibility in ligand docking through relevant normal modes. Journal of the American Chemical Society, 2005. 127(26): p. 9632-9640. Brooks, B. and M. Karplus, Normal-Modes for Specific Motions of Macromolecules - Application to the Hinge-Bending Mode of Lysozyme. Proceedings of the National Academy of Sciences of the United States of America, 1985. 82(15): p. 4995-4999. Hayward, S., A. Kitao, and H.J.C. Berendsen, Model-free methods of analyzing domain motions in proteins from simulation: A comparison of normal mode analysis and molecular dynamics simulation of lysozyme. Proteins-Structure Function and Genetics, 1997. 27(3): p. 425-437. Hinsen, K., A. Thomas, and M.J. Field, Analysis of domain motions in large proteins. Proteins-Structure Function and Genetics, 1999. 34(3): p. 369-382. 32 66. Tama, F., et al., Building-block approach for determining low-frequency normal modes of macromolecules. Proteins-Structure Function and Genetics, 2000. 41(1): p. 1-7. 67. May, A. and M. Zacharias, Protein-ligand docking accounting for receptor side chain and global flexibility in normal modes: Evaluation on kinase inhibitor cross docking. Journal of Medicinal Chemistry, 2008. 51(12): p. 3499-3506. Petrone, P. and V.S. Pande, Can conformational change be described by only a few normal modes? Biophysical Journal, 2006. 90(5): p. 1583-1593. Rueda, M., G. Bottegoni, and R. Abagyan, Consistent Improvement of Cross-Docking Results Using Binding Site Ensembles Generated with Elastic Network Normal Modes. Journal of Chemical Information and Modeling, 2009. 49(3): p. 716-725. Tirion, M.M., Large amplitude elastic motions in proteins from a single-parameter, atomic analysis. Physical Review Letters, 1996. 77(9): p. 1905-1908. Dietzen, M., et al., On the Applicability of Elastic Network Normal Modes in Small- Molecule Docking (vol 52, pg 844, 2012). Journal of Chemical Information and Modeling, 2014. 54(12): p. 3453-3453. Hartshorn, M.J., et al., Diverse, high-quality test set for the validation of protein-ligand docking performance. Journal of Medicinal Chemistry, 2007. 50(4): p. 726-741. Amadei, A., A.B.M. Linssen, and H.J.C. Berendsen, Essential Dynamics of Proteins. Proteins-Structure Function and Genetics, 1993. 17(4): p. 412-425. 68. 69. 70. 71. 72. 73. 74. 76. 77. 78. Bansal, N., et al., The Role of the Active Site Flap in Streptavidin/Biotin Complex Formation. Journal of the American Chemical Society, Submitted 75. Wong, S. and M.P. Jacobson, Conformational selection in silico: Loop latching motions and ligand binding in enzymes. Proteins-Structure Function and Bioinformatics, 2008. 71(1): p. 153-164. Flick, J., F. Tristram, and W. Wenzel, Modeling loop backbone flexibility in receptor- ligand docking simulations. Journal of Computational Chemistry, 2012. 33(31): p. 2504- 2515. Chaudhuri, R., et al., Application of Drug-Perturbed Essential Dynamics/Molecular Dynamics (ED/MD) to Virtual Screening and Rational Drug Design. Journal of Chemical Theory and Computation, 2012. 8(7): p. 2204-2214. Carrillo, O., C.A. Laughton, and M. Orozco, Fast Atomistic Molecular Dynamics Simulations from Essential Dynamics Samplings. Journal of Chemical Theory and Computation, 2012. 8(3): p. 792-799. 33 Kokh, D.B., R.C. Wade, and W. Wenzel, Receptor flexibility in small-molecule docking calculations. Wiley Interdisciplinary Reviews-Computational Molecular Science, 2011. 1(2): p. 298-314. Rocchia, W., M. Masetti, and A. Cavalli, Enhanced Sampling Methods in Drug Design. Physico-Chemical and Computational Approaches to Drug Discovery, 2012(23): p. 273- 301. Buch, I., T. Giorgino, and G. De Fabritiis, Complete reconstruction of an enzyme- inhibitor binding process by molecular dynamics simulations. Proceedings of the National Academy of Sciences of the United States of America, 2011. 108(25): p. 10184- 10189. Shan, Y.B., et al., How Does a Drug Molecule Find Its Target Binding Site? Journal of the American Chemical Society, 2011. 133(24): p. 9181-9183. Scarpazza, D.P., et al., Extending the generality of molecular dynamics simulations on a special-purpose machine. Ieee 27th International Parallel and Distributed Processing Symposium (Ipdps 2013), 2013: p. 933-945. De Paris, R., et al., wFReDoW: A Cloud-Based Web Environment to Handle Molecular Docking Simulations of a Fully Flexible Receptor Model. Biomed Research International, 2013. 79. 80. 81. 82. 83. 84. 85. 87. 89. Independent-Trajectories Lawrenz, M., R. Baron, for Biomolecular Systems: Thermodynamic-Integration Free-Energy Changes Determinants of H5N1 Avian Influenza Virus Neuraminidase Inhibition by Peramivir. Journal of Chemical Theory and Computation, 2009. 5(4): p. 1106-1116. J.A. McCammon, and 86. Mobley, D.L., J.D. Chodera, and K.A. Dill, Confine-and-release method: Obtaining correct binding free energies in the presence of protein conformational change. Journal of Chemical Theory and Computation, 2007. 3(4): p. 1231-1235. de Oliveira, C.A.F., D. Hamelberg, and J.A. McCammon, Coupling accelerated molecular dynamics methods with thermodynamic integration simulations. Abstracts of Papers of the American Chemical Society, 2009. 238. 88. Wereszczynski, J. and J.A. McCammon, Using Selectively Applied Accelerated Molecular Dynamics to Enhance Free Energy Calculations. Journal of Chemical Theory and Computation, 2010. 6(11): p. 3285-3292. Lim, N.M., et al., Sensitivity in Binding Free Energies Due to Protein Reorganization. Journal of Chemical Theory and Computation, 2016. 12(9): p. 4620-4631. 34 Danielson, M.L. and M.A. Lill, Predicting flexible loop regions that interact with ligands: The challenge of accurate scoring. Proteins-Structure Function and Bioinformatics, 2012. 80(1): p. 246-260. Forrey, C., J.F. Douglas, and M.K. Gilson, The fundamental role of flexibility on the strength of molecular binding. Soft Matter, 2012. 8(23): p. 6385-6392. Kolb, P. and J.J. Irwin, Docking screens: right for the right reasons? Curr Top Med Chem, 2009. 9(9): p. 755-70. Abagyan, R., M. Totrov, and D. Kuznetsov, Icm - a New Method for Protein Modeling and Design - Applications to Docking and Structure Prediction from the Distorted Native Conformation. Journal of Computational Chemistry, 1994. 15(5): p. 488-506. Knegtel, R.M.A., I.D. Kuntz, and C.M. Oshiro, Molecular docking to ensembles of protein structures. Journal of Molecular Biology, 1997. 266(2): p. 424-440. Sotriffer, C.A. and I. Dramburg, "In situ cross-docking" to simultaneously address multiple targets. Journal of Medicinal Chemistry, 2005. 48(9): p. 3122-3125. Yang, C.Y., R.X. Wang, and S.M. Wang, M-score: A knowledge-based potential scoring function accounting for protein atom mobility. Journal of Medicinal Chemistry, 2006. 49(20): p. 5903-5911. I.R., J.W. Essex, and K. Spiegel, Ensemble Docking into Multiple Craig, Crystallographically Derived Protein Structures: An Evaluation Based on the Statistical Analysis of Enrichments. Journal of Chemical Information and Modeling, 2010. 50(4): p. 511-524. 98. Morris, G.M., et al., AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. Journal of Computational Chemistry, 2009. 30(16): p. 2785-2791. 90. 91. 92. 93. 94. 95. 96. 97. 99. Neves, M.A.C., M. Totrov, and R. Abagyan, Docking and scoring with ICM: the benchmarking results and strategies for improvement. Journal of Computer-Aided Molecular Design, 2012. 26(6): p. 675-686. 100. Leach, A.R., B.K. Shoichet, and C.E. Peishoff, Prediction of protein-ligand interactions. Docking and scoring: Successes and gaps. Journal of Medicinal Chemistry, 2006. 49(20): p. 5851-5855. 101. Carlson, H.A., et al., CSAR 2014: A Benchmark Exercise Using Unpublished Data from Pharma. Journal of Chemical Information and Modeling, 2016. 56(6): p. 1063-1077. 102. Damm-Ganamet, K.L., et al., CSAR Benchmark Exercise 2011-2012: Evaluation of Results from Docking and Relative Ranking of Blinded Congeneric Series. Journal of Chemical Information and Modeling, 2013. 53(8): p. 1853-1870. 35 103. Gaieb, Z., et al., D3R Grand Challenge 2: blind prediction of protein-ligand poses, affinity rankings, and relative binding free energies. Journal of Computer-Aided Molecular Design, 2018. 32(1): p. 1-20. 104. Gathiaka, S., et al., D3R grand challenge 2015: Evaluation of protein-ligand pose and affinity predictions. Journal of Computer-Aided Molecular Design, 2016. 30(9): p. 651- 668. 105. Smith, R.D., et al., CSAR Benchmark Exercise of 2010: Combined Evaluation Across All Submitted Scoring Functions. Journal of Chemical Information and Modeling, 2011. 51(9): p. 2115-2131. 106. Ye, Z.F., et al., Optimal strategies for virtual screening of induced-fit and flexible target in the 2015 D3R Grand Challenge. Journal of Computer-Aided Molecular Design, 2016. 30(9): p. 695-706. 107. Kumar, A. and K.Y.J. Zhang, Prospective evaluation of shape similarity based pose prediction method in D3R Grand Challenge 2015. Journal of Computer-Aided Molecular Design, 2016. 30(9): p. 685-693. 108. Salmaso, V., et al., DockBench as docking selector tool: the lesson learned from D3R Grand Challenge 2015. Journal of Computer-Aided Molecular Design, 2016. 30(9): p. 773-789. 109. Slynko, I., et al., Docking pose selection by interaction pattern graph similarity: application to the D3R grand challenge 2015. Journal of Computer-Aided Molecular Design, 2016. 30(9): p. 669-683. 110. Lam, P.C.H., R. Abagyan, and M. Totrov, Ligand-biased ensemble receptor docking (LigBEnD): a hybrid ligand/receptor structure-based approach. Journal of Computer- Aided Molecular Design, 2018. 32(1): p. 187-198. 111. Athanasiou, C., et al., Using physics-based pose predictions and free energy perturbation calculations to predict binding poses and relative binding affinities for FXR ligands in the D3R Grand Challenge 2. Journal of Computer-Aided Molecular Design, 2018. 32(1): p. 21-44. 112. Cleves, A.E. and A.N. Jain, Knowledge-guided docking: accurate prospective prediction of bound configurations of novel ligands using Surflex-Dock. Journal of Computer-Aided Molecular Design, 2015. 29(6): p. 485-509. 113. Davis, I.W. and D. Baker, ROSETTALIGAND Docking with Full Ligand and Receptor Flexibility. Journal of Molecular Biology, 2009. 385(2): p. 381-392. 36 Jain, A.N. and A. Nicholls, Recommendations for evaluation of computational methods. Journal of Computer-Aided Molecular Design, 2008. 22(3-4): p. 133-139. 115. Cournia, Z., B. Allen, and W. Sherman, Relative Binding Free Energy Calculations in Drug Discovery: Recent Advances and Practical Considerations. Journal of Chemical Information and Modeling, 2017. 57(12): p. 2911-2937. 114. 37 CHAPTER 2 Methodology †Reprinted (adapted) with permission from Bansal, N.; Zheng, Z.; Merz, K. M., Incorporation of side chain flexibility into protein binding pockets using MTflex. Bioorgan Med Chem 2016, 24, 4978-4987. Reprinted (adapted) with permission from Bansal, N.; Zheng, Z.; Song, F. L.; Pei, J.; Merz, K. M., The Role of the Active Site Flap in Streptavidin/Biotin Complex Formation. J Am Chem Soc 2018, DOI 10.1021/jacs.8b00743 38 2.1 Abstract This chapter delineates the motivation and methodologies used in our study. Sampling and scoring are two key factors to improve the overall free energy estimated by a computational method. We generate conformations with the bigger purpose of improving the overall binding free energy estimation by our movable type free energy method. The conformers are generated by novel methods, which we call MTflex and MTFlex-b. The conformational ensemble is generated on an energy scale using pre-tabulated one-dimensional databases of distance versus energies. MTflex generates conformations for the side-chain only keeping the backbone rigid while MTFlex-b extends the concept to include both backbone and side chain flexibility. However, both the methods generate conformations using same mechanics. Free energies are further calculated by using our in-house method called Movable type free energy method 2.2 Introduction Our unique and trivially parallel free energy estimation procedure follows a two-step procedure. In the first step, the significant uncorrelated configurational states are assembled on a molecular energy landscape using our conformation search algorithm for receptor flexibility- MTflex and MTFlex-b. The molecular conformations are generated at an atom-pair level using a distance-based coordinate system, where each selected distance is associated with the pair- potential value selected from a pre-built look up table, which helps us to rapidly generate structures on an energy landscape. In this way, we obtain several uncorrelated molecular conformations of the ensemble on an energy surface in a rapid manner by performing smart or guided dihedral scans using our potentials. 39 Single crystal structure Free energy extrapolation using Movable type method for each conformation Energy guided database for conformation generation Conformational ensemble Figure 2.1 Pictorial representation of our free energy estimation procedure. The conformations generated by MTFlex-B are generated on an energy surface (shown in the heat map plot on the left). Each “seed structure” is fed to the Movable type method (shown on the right), which performs the local sampling around the initial conformation as explained with the red dots on the right heat map plot. The generated pool of molecular conformations served as the “seed structures”, which were fed to the MT free energy method for extrapolation of the local partition function and plotting the complete energy landscape. Figure 2.1 shows the pictorial representation of our free energy simulation procedure. 40 2.3 Conformation generation Our conformation generation strategy is analogous to 3-D printing. In three-dimensional space, any coordinate system requires three variables to locate a particle in the domain of definition e.g. x, y, z in the Cartesian coordinate system and r, θ, φ in the spherical coordinate system. Locating one particle requires three distance variables associated with three reference points. A coordinate translation from the Cartesian coordinate system is shown in equation 2.1, with the particle coordinate (x, y, z) translated to (d1, d2, d3) using three reference points (x1, y1, z1), (x2, y2, z2) and (x3, y3, z3). d1 = (x − x1)2 + (y − y1)2 + (z − z1)2 d2 = (x − x2)2 + (y − y2)2 + (z − z2)2 d3 = (x − x3)2 + (y − y3)2 + (z − z3)2 (2.1) The reference points, like the origin in the Cartesian coordinate system, can be randomly defined within the domain of definition. Nonetheless, due to the association of our pair potentials to the distances, the reference points used herein are defined as atoms with known locations. In each step a new atom is added to the growing molecular ensemble based on the solution of equation 4 using three different reference atoms (see Figure 2.2). Every selected distance is associated with a pair potential value chosen from a pair potential lookup table indexed with the atom pair type at discretized distances. Theoretically, the selection of the distances (d1, d2, d3) could be random once it satisfies the solution of equation 2.1. However, random selection yields too many unfavorable structures appearing in the final molecular ensemble collection, which suggests the importance of using the pair potentials (as Boltzmann weighted factors – See the upper-right of Figure 2.2) as a reference for the selection of appropriate distances. Moreover, any three atoms with known locations in the growing molecular ensemble can be selected as the 41 reference points during each step of the construction, resulting in a much larger sample size during the computation. Briefly, for each of the distances (d1, d2, d3) between the new atom and the reference atoms, a sampling or “blurring” range is set around the selected distance to create a vector of energies as a function of r, which is the candidate pool for future structure sampling. The vector of one-dimensional energies (atom pairwise energy component to the molecular ensemble) is then expanded into a matrix with a user defined size, by randomly “scrambling” the order of the original energy vectors repeatedly and then “tiling” these disordered vectors into a fix-sized matrix. The numerical integral is performed by point-wise multiplication through all the three one-dimensional energy matrices according to the three reference atoms (equations 2.2 to 2.4). /zα = ⎡ ! ⎢ zα(r0 − 2Δr) ⎢ ⎢ zα(r0 − Δr) ⎢ ⎢ zα(r0) ⎢ zα(r0 + Δr) ⎢ ⎢ zα(r0 + 2Δr) ⎢ ⎢ ! ⎣ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ = ⎡ ! ⎢ e−βEα(r0−2Δr) ⎢ ⎢ e−βEα(r0−Δr) ⎢ ⎢ e−βEα(r0 ) ⎢ ⎢ e−βEα(r0+Δr) ⎢ e−βEα(r0+2Δr) ⎢ ⎢ ! ⎢ ⎣ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ ⇒ scramble /zα( ) = ⎡ zα(r0 −8Δr) ⎢ zα(r0 + 3Δr) ⎢ ⎢ zα(r0 + Δr) ⎢ ⎢ ! ⎢ zα(r0) ⎢ ⎣ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ m rows (2.2) (2.3) (2.4) scramble( /zα)1 scramble( /zα)2 ! scramble( /zα)i+1 ! scramble( /zα)n−i+1 scramble( /zα)n−i+2 scramble( /zα)i+2 ⎫ ⎤ ⎪ ⎥ ⎪⎪ ⎥ ⎬ ⎥ ⎪ ⎥ ⎥ ⎪ ⎪ ⎦ ⎭ ############ $ scramble( /zα)2i ! scramble( /zα)n ! ! " scramble( /zα)i ############ " n columns ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ ! /Zα = /ZA = /Z1 i /Z2 i /Z3 42 The random disordered permutations to each one-dimensional energy matrix is meant to maximize the variety of energy combinations at different distances, and the fixed-size matrix multiplication is to maintain a computationally tractable sampling size. As shown in equation 2.2, the row number m of the matrices is defined as the least multiplier of all the atom pairwise vector sizes in the molecular system under study in order to identically size all atom pairwise energy vectors with different sampling or “blurring” ranges for different pair potentials, while the column number n is a user defined number defining the sampling size to satisfy a convergent ensemble through the different pair potentials. Hence, the matrix ZA in equation 2.4 after the assembly through the 3 pair-potential matrices includes m×n possibilities to locate the new atom for each distance set (d1, d2, d3) that have been selected. This candidate pool selection for just one single atom would be a burden for the construction of a molecular structure due to the exponential increase in the sampling size with the number of atoms. To address this, the selection of each distance set (d1, d2, d3) is constrained within a limited range according to the pair potential significance (see upper-right of Figure 2.2). The method used herein is analogous to the conformational search algorithm for small molecules recently developed by our group.[99] Bonds and angles are restricted to their well- depth location due to their extremely narrow energy ranges. Distances for torsions and non-bond contacts are chosen at their most favorable regions to give precedence to the most relevant contacts. For instance when coming to a step involving the addition of a new hydrogen bond donor/acceptor to the growing ensemble, other than considering all possible torsion distributions of this atom, a 3 Å vicinity is searched for new locations to satisfy any potential hydrogen bond contacts. Atoms forming a disulphide bond are not allowed to move during the side chain generation. The generated poses are accepted only if they don’t clash with the rest of the 43 structure. A geometric cutoff of 2.8 Å is set as the collision criteria for the non-bonded heavy atoms. This procedure narrows down our conformational search while sampling in the appropriate regime. An example illustrating the parallel printing scheme of our method is shown in the left-hand side of Figure 2.2. 44 Figure 2.2. An example illustrating the parallel printing scheme used to construct alanine. The printing starts from three atoms with known locations. In each step a new atom is added to the system fixed by three distances constrained by the pair potentials from the lookup table. Bonds and angles are fixed to their minimum well depths due to their extremely narrow distribution ranges. Distances for the torsion angles and non-bonded interactions (not used in the alanine example) are selected based on a range of points around the energy minima (only minima are selected in this figure for simplicity). 45 2.4 Side chain flexibility The construction of side chain conformers for one residue was independent of the other residues of the binding pocket; i.e. the three reference points to locate a particle were taken from the same residue. In this way, the first side chain atom to be rotated for each residue was Cγ, where backbone amide N, Cα and Cβ served as the three reference atoms. Because the exploration of side chain conformations was done per residue steric clashes for each residue were checked after looping over all residues with respect to all other residues of the binding pocket and the remaining residues in the protein. Following this strategy, conformers were generated for 17 out of a total of 20 amino acids (Gly, Pro and Ala were skipped). Though conceptually similar, MTflex differs from the rotamer library based methods in that new atoms are not built according to a structural database but on a local free-energy landscape defined by the potential utilized. 2.5 Loop flexibility Conformational space available to loop grows exponentially with the loop length, so we generated the loop from both N and C termini separately. This loop buildup procedure bears similarity to Jacobson et al strategy [1], where loop was generated from both ends and closed in the middle. Our loop building strategy is different in a way that backbone and side-chain atoms for each residue were constructed altogether. This helped us in eliminating those backbone positions immediately, which couldn’t fit side chain atoms. Several restraints were also applied during structural sampling. The entire loop was generated in the presence of the rest of the protein structure so the Vander Waal collision restraints served as the primary source of screening. A geometric restraint of 2.8 Å was used for the non-bonded heavy atoms and 2.3 Å 46 for the hydrogen bond forming atoms. Steric clashes within the generated loop structure were also identified using the same geometric restraint. The atoms forming a disulphide bond were also kept intact during the side chain formation of the generated loop structure. Chirality at α- carbon position was considered for all the amino acids for the backbone generation. β-carbon chirality was considered for isoleucine and threonine for the generation of side chains conformers. Ring planarity of aromatic amino acids was also enforced. The sampling size was kept at a manageable level in this way. The loop closure of the generated loop fragments was obtained using the following procedure. The N-termini loop fragments were generated from 1 to ith loop residues, where the ith residue corresponds to one of the middle residues of the loop. Similarly, the C-termini loop fragments were generated from M to (i+1)th residues, where M is the last residue of the loop. See Figure 2.3 for reference. Both ends were tried to meet by finding the pair from either end falling in the optimum bond distance for the carbonyl carbon of the ith residue and backbone Nitrogen of (i+1)th residue. The optimum distance was chosen from our pre-built look up table of distance versus energies. Apart from the bond distance restraint, the geometric criterion for checking Van der Waals collisions between the connecting residues was also imposed. In this way, the loop fragments, which were unable to close or had steric clashes, were pruned from the ensemble on the fly. 47 a. Loop fragments generated from N- terminal for 1 to ith residues b. Loop fragments generated from C- terminal for M to (i+1)th residues c. Complete loops for residues (1-M) generated by selective screening of loop fragments N-ter C-ter N-ter C-ter Bond distance range Figure 2.3 Pictorial representation of our loop closure strategy. Loop fragments are generated from both N and C terminals individually. The individual loop fragments are then combined from both ends to form complete loops. 2.6. Free energy estimation The Movable Type (MT) method numerically simulates local partition functions utilizing a Monte Carlo integration (MCI) scheme, given an initial structure, from a canonical ensemble. The MCI method is a widely used numerical approach for free energy calculation.[2, 3] By simulating the integral of the canonical partition function instead of generating enthalpy and entropy values separately, the MCI method allows for the avoidance of expensive and poorly converging entropy calculations. A =− RT ln Z [ M ] =− RT ln e ⎡ ⎣ ∫ D ( ) E ⎤ β τ τ − ≈ − M ⎦ d RT ln ⎡ ⎢ V ⎢ ⎢ ⎢ ⎣ N ∑ i β − E r ( ) i e N ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ (2.5) The development of the MT algorithm is inspired by the idea of the MCI approach expressed in equation 2.5, where the Helmholtz free energy is simulated using the average of the sampled energy states multiplied by the actual sampling volume. The distinctive feature of the 48 MT method is that it numerically simulates the average of the local partition function given a defined sampling volume centered around an initial structure, instead of searching among actual physical structures within that defined volume. In our first MT algorithm publication, a matrix- based random sampling strategy combining every atom pairwise potential against each target molecular system was introduced, in which all pairwise potentials are regarded as orthogonal and a total random combination among the atom pairwise distances were performed within a small range of sampling (±0.5 Å for every atom pairwise distances). The generated hyper-dimensional energy states were associated with pre-modeled structural weighting factors and averaged over their sampling magnitude CN, where C is the defined sampling range and N is the pairwise contact number. This approach is to simulate the average of the actual physical energy states using the more easily constructed virtual states. Given an N-particle physical space, a quantitative description of the ensemble volume is written as: V = ! dτ1 !dτN (2.6) ∫ D∫ where, τ1 to τN are the coordinates of all the particles and D is the domain of definition for all of the particle coordinates. The ensemble volume is under exponential growth as the number of dimensions increases. The MT algorithm uses a distance-based coordinate system in order to better estimate the ensemble volume, and importantly, this approach is well suited to the MT method, where each distance is associated with a pair potential value chosen from a pre-built lookup table allowing us to simultaneously generate the structures and energies for a given system. [4] The feature of our method that sets it apart from other traditional methods is that it simultaneously generates the structure and its local free energy by simulating the partition 49 function within a defined ensemble volume by multiplying the pair potential matrices as summarized in equations 2.2-2.4 through all atom pairwise potentials in the molecular system. Atom pairwise energies stored in the final partition function matrix (ZM) represent the virtual hyper-dimensional energy states given the defined sampling range regarding each generated conformation. The free energy is then calculated using the ensemble volume and ZM. /ZM = /Z1 i /Z2 i /Z3 i!i /Zn A =− RT ln ( ) β τ − E V e ⎡ ⎣ ⎤ =− ⎦ RT ln ⎡ V ⎢ ⎣ Ζ/ M m n × ⎤ ⎥ ⎦ (2.7) (2.8) where, β is the Boltzmann constant, E(τ) is the molecular energy as a function of geometric variable τ, m and n represent the number of rows and columns of the Z-matrix. For further detailed explanation of the free energy calculation and the solvation model, please see Zheng et al. and Pan et al.[6, 7, 8] 2.7. Solvation free energy Solvation free energy is calculated using the implicit solvation model KECSA-Movable Type Implicit Solvation Model (KMTISM).[5] KMTISM is a semi-continuum solvation model, which places water molecules around the solute as isotropic rigid balls with the van der Waals radii of 1.6 Å. The water molecules are placed in incremented isometric layers of 0.005 Å up to 8 Å from the solute’s van der Waals surface. The solvation free energy change for the protein- ligand binding process is separated into two components, representing the translation of the solutes and water molecules during the binding process. During the protein-ligand binding in solution, the ligand replaces a number of water molecules in the protein-binding site, with 50 respect to the volume occupied by the ligand. At the same time, these water molecules lost upon ligand binding are transferred into bulk. Figure 2.4 highlights the pictorial representation of the change in solvation free energy during the protein-ligand binding process. Figure 2.4 Illustration of the solvation free energy change during protein-ligand binding. The red arrow shows the desolvation and placement of the ligand into the protein-binding site. This process replaces water molecules bound in the protein-binding site that are displaced by the ligand volume. The green arrow shows the dissociation and solvation of these water molecules during the binding procedure. Given these two simultaneous processes, the solvation free energy change is calculated using the following equation: ΔGsol = ΔGsol PL+ΔGsol water (2.9) where, 51 ΔGsol PL= 1 −β log ⎡ ∑ ⎢ ⎣ i exp −β Esol PL ( ( )i ⎤ ) ⎥− ⎦ 1 −β log ⎡ ∑ ⎢ ⎢ ⎣ j exp −β Esol P ( ( ) j ) ⎤ ⎥ ⎥ ⎦ − 1 −β log ⎡ ∑ ⎢ ⎣ k exp −β Esol L ( ( )k ) ⎤ ⎥ ⎦ (2.10) PL ΔGsol represents the solvation free energy change with respect to the solute-solvent interactions, and: )i ⎤ ) ⎥− ⎦ 1 −β log ⎡ ∑ ⎢ ⎢ ⎣ j ΔGsol water = 1 −β log ⎡ ∑ ⎢ ⎣ i exp −β Esol water ( ( exp −β Ereceptor water ( ( ) j ) ⎤ ⎥ ⎥ ⎦ (2.11) water ΔGsol represents the solvation free energy change with respect to removing water molecules at the binding site and placing them into solvent. solE PL , solE P and solE L are interaction potential energies for the protein-ligand complex, protein and ligand in contact with solvent respectively; solE water and E water receptor are interaction potential energies for the water molecules being replaced representing their contacts with solvent and with the protein receptor, respectively. Boltzmann factors for all the sampled states are summed to simulate the corresponding partition functions. Free energy changes are then calculated using the logarithms of the summed Boltzmann factors using the mechanics described above for the free energy calculation. 52 REFERENCES 53 REFERENCES 1. Jacobson, M.P., et al., A hierarchical approach to all-atom protein loop prediction. Proteins-Structure Function and Bioinformatics, 2004. 55(2): p. 351-367. 2. Valleau, J.P. and D.N. Card, Monte-Carlo Estimation of Free-Energy by Multistage Sampling. Journal of Chemical Physics, 1972. 57(12): p. 5457-&. 3. Jorgensen, W.L. and C. Ravimohan, Monte-Carlo Simulation of Differences in Free- Energies of Hydration. Journal of Chemical Physics, 1985. 83(6): p. 3050-3054. 4. Zheng, Z. and K.M. Merz, Development of the Knowledge-Based and Empirical Combined Scoring Algorithm (KECSA) To Score Protein-Ligand Interactions. Journal of Chemical Information and Modeling, 2013. 53(5): p. 1073-1083. 5. Zheng, Z., et al., KECSA-Movable Type Implicit Solvation Model (KMTISM). Journal of Chemical Theory and Computation, 2015. 11(2): p. 667-682. 6. Zheng, Z., M.N. Ucisik, and K.M. Merz, Jr., The Movable Type Method Applied to Protein-Ligand Binding. J Chem Theory Comput, 2013. 9(12): p. 5526-5538. 7. Pan, L.L., et al., Free Energy-Based Conformational Search Algorithm Using the Movable Type Sampling Method. Journal of Chemical Theory and Computation, 2015. 11(12): p. 5853-5864. 8. Zheng, Z., et al., KECSA-Movable Type Implicit Solvation Model (KMTISM). Journal of Chemical Theory and Computation, 2015. 11(2): p. 667-682. 54 CHAPTER 3 Incorporation of Side Chain Flexibility into Protein Binding Pockets using MTflex † Reprinted (adapted) with permission from Bansal, N.; Zheng, Z.; Merz, K. M., Incorporation of side chain flexibility into protein binding pockets using MTflex. Bioorgan Med Chem 2016, 24, 4978-4987. 55 3.1 Abstract The plasticity of active sites plays a significant role in drug recognition and binding, but the accurate incorporation of “receptor flexibility” remains a significant computational challenge. Many approaches have been put forward to address receptor flexibility in docking studies by generating relevant ensembles on the energy surface, but, herein, we describe a method (Movable Type with flexibility (MTflex)) that generates ensembles on the more relevant free energy surface in a computationally tractable manner. This novel approach enumerates conformational states based on side chain flexibility and then estimates their relative free energies using the MT methodology. The resultant conformational states can then be used in subsequent docking and scoring exercises. In particular, we demonstrate that using the MTflex ensembles improves MT’s ability to predict binding free energies over docking only to the crystal structure. 3.2 Introduction Gaining a deep understanding of the interactions involved in the protein-ligand binding process has been of great interest for many decades because of its importance in a wide range of biochemical processes occurring inside living organisms. To explain this process several theories have been put forward of which the most widely accepted ones include: lock and key[1], induced fit[2], and the conformational selection[3] model. The former dates back to 1854, when Emil Fischer first proposed the lock and key model where the active site was viewed as structurally immutable. His theory remained popular until experiments reported changes in the receptor structure upon ligand binding.[4-8] From these observations, Koshland proposed the induced fit model while Nussinov proposed the conformational selection model, which incorporated 56 receptor flexibility. The induced fit model proposed that ligand binding induced conformational changes in the receptor; the conformational selection model assumes that the protein has an ensemble of states and that the ligand selectively picks one of these.[9, 10] Altogether experimental and theoretical considerations have modified our perception of how a ligand recognizes a biological receptor. Today, proteins are considered as inherently flexible that possess a range of intrinsic motions.[5] The inherent flexibility of proteins is attributed to the numerous non-covalent interactions that exist in their native states. They possess a variety of motions, which range from vibrational fluctuations up to large backbone rearrangements, which taken altogether generate a conformational ensemble for a given protein system.[5, 11-15] After much debate, it has been posited that proteins exist in multiple conformations in solution while available crystal structures represent just one member of the entire conformational ensemble. [5, 9, 11, 16-18] Computational chemists have expended enormous intellectual capital to capture the plasticity of proteins.[19, 20] The conformational space spanned by proteins can be quite substantial making addressing this issue a highly challenging problem.[21-25] Even in the face of these challenges, numerous methods have been developed to account for protein flexibility in protein-ligand binding studies using simulations or docking methodologies.[26] Soft-receptor docking methods represent some of the first attempts to address this issue. These methods enumerated a very limited or localized flexibility by softening the van der Waals potentials, which allow for small steric clashes between the protein and ligand molecule.[27, 28] Rotamer library exploration is another way to approach partial flexibility by including side chain motions on the basis of allowable states of the rotatable bonds of select active site residues.[29-32] Ensemble docking is another class of docking methods that employs multiple receptor 57 conformations to account for protein flexibility prior to docking. These multiple conformations are generated by either computer simulations based on Monte Carlo, conventional or accelerated molecular dynamics[33-36], Normal Mode Analysis[37-40], homology models[41], or by collecting experimental structures using NMR or X-Ray[42-44]. Few of the known docking algorithms that use ensemble based docking are Autodock[45, 46], ITScore[47, 48], IFREDA[49], MedusaDock[50], RosettaBackrub[51], FlexE[52], BP-Dock[53] and MDock[44]. Induced fit docking (IFD) is another major category of docking algorithms that account for a certain degree of receptor flexibility. Few examples of docking algorithms that use IFD are Autodock Vina[54], Autodock4[45, 55], GOLD[56, 57], GLIDE[58], RosettaLigand[59, 60], ICM[39, 49], FLIPDock[61, 62], DOCK6.0[63], SLIDE[64-66], ReFlexin[67], GalaxyDock[68], FITTED[69-71], PC-RELAX[72, 73], Flesky[74], FiberDock[75], hinge belt docking algorithms[76-79] and Adaptive BP-Dock.[80] Other methods include introducing flexibility in the receptor structures using hybrid methods like the linear interaction energy (LIE)[81, 82], Relaxed complex scheme (RCS)[83, 84], dynamic pharmacophore modeling[85, 86] and single step perturbation.[87-89] An extensive overview of all receptor flexibility methods and docking software packages is provided in recent reviews.[26, 90-94] One drawback of docking methods is the difficulty of obtaining a converged partition function within a limited structural ensemble. Monte Carlo and molecular dynamics simulations (and their variations, e.g., Markov Chain Monte Carlo, replica exchange, etc.) can delineate more of the relevant ensemble, but in general, at an increased computational cost. Recently we have described a novel free energy calculation method called the Movable Type (MT) method.[95] Using its mechanics of parallel conformation generation, we have developed MTflex and an associated program for fast protein conformation generation and energy 58 sampling. The present manuscript introduces a method to sample the side chain conformations within a protein active site via the MTflex methodology generating an ensemble of structures that can be subsequently used in ensemble docking studies. 3.3 Results and Discussion In the current study, we applied the MTflex method to side-chain conformational ensemble generation against holo protein crystal structures, which was then followed by docking and scoring to the resultant structure. In particular, we explored the following: (1) side chain flexibility and protein free energy variations with fixed backbone conformations in the binding pocket area; (2) binding mode predictions and binding free energy predictions using protein structures with different side chain conformations. The validation benchmark set involved 159 protein-ligand crystal structures from the PDBbind v2014 core database after excluding proteins having metal ions in the binding pockets area.[96, 97] The core dataset consists of high- resolution structures with a wide distribution of pKd values. 3.3.1. Generation of the protein side chain conformations and the relative free energies using MTflex The binding pocket’s residues were identified as those lying in a 6-Å vicinity of the co- crystal native ligands. All the heavy atoms in a residue were used to identify a binding pocket. If a residue had one or more than one heavy atoms (be it a side chain or backbone atom) within 6 Å of the native ligand, then the entire residue was considered to be a part of the binding pocket. Isoleucine and threonine are chiral at the β-carbon position, which is the first heavy atom of the 59 side chains of all amino acids except Glycine. Chirality was considered for these two amino acids for the generation of side chains conformers using MTflex. Ring planarity of aromatic amino acids was also enforced. Disulphide bonds were also kept intact. Steric clashes within the generated binding pocket and with the rest of the protein were avoided by the use of a geometric restraint of 2.3 Å for the non-bonded heavy atoms with the capacity to form hydrogen bonds and 2.8 Å for the non-bonded non-hydrogen heavy atoms. The MTflex conformations were generated in-vacuo (no explicit or implicit water model). The MTflex code was written using MATLAB R2015a software.[98] The code was run using a single CPU processor on the Intel14 cluster provided by the High Performance Computer Center (HPCC) facility at MSU. Each node consists of two 2.5 GHz 10-core Intel Xeon Ivy Bridge E5-2670v2 processors and 512 GB memory per node. The performance of the code varied based on several factors including the- i) type of residue, ii) number of residues in the binding pocket, and iii) the crowding of the binding pocket. We did an averaging of the computation time required to generate MTflex conformers for the 17 amino acids (excluding Gly, Pro and Ala) over 159 systems. Table 3.2 lists the average CPU time (in seconds) required to generate the MTflex conformers for each amino acids. Currently, we are in the process of transitioning the code to C++ to facilitate distribution and to improve the computational performance. The application of MTflex with this criteria and restraints generated a varied number of conformers for each system depending on the nature of the binding pocket. The magnitude of the number of conformations varies from ~10 conformations for the crowded and compact binding pockets to thousands of poses. In total, we generated 38397 conformers for 159 systems. After generation of the conformers, we used a RMSD criterion of 0.5 Å to select the final set of 60 receptor conformations. The first conformation was always selected. For the remaining conformations, their RMSDs were compared to all the preceding selected conformations. If a conformation’s RMSD was <= 0.5 Å with respect to any of the already selected conformations, it was rejected otherwise it was accepted. The list containing the numbers of MTflex conformers retained for each system is provided in Table 3.3 of the supporting information. The RMSDs (Å) of the MTflex generated conformers were calculated with respect to the crystal structure for all systems. RMSDs were obtained by an in-house code to make sure that MTflex conformers were orientated for the best fit to the crystal structure. The code was written using MATLAB R2015a. [98] The details of the algorithm are provided in Section 3.5.1 of the supporting information. The top panel of Figure 3.1 shows the deviation of the MTflex conformers from each crystal structure in the validation set. It is clear that more conformations were generated for more flexible binding pockets (RMSD distribution range as the gap between the green line and the blue line shown in Figure 3.1). The distribution of the lowest RMSD conformers (the blue line in top panel of Figure 3.1) is from 0.05 Å to 1.15 Å representing how MTflex reproduces structures relative to the crystal structures in the validation set. 61 Figure 3.1 The top panel represents the minimum, maximum and RMSD range of MTflex conformers generated for all 159 systems. Dark blue curve denotes the lowest RMSD, Green curve denotes the highest RMSD conformer and black stripes represent the RMSD ranges of MTflex conformers for all 159 systems. The lower panel represents the percentage (%) of native contacts in MTflex conformers with respect to the crystal-binding pocket. Dark blue curve shows the maximum %, Green curve symbolizes the minimum % and black stripes represent the range of native contacts in MTflex conformers. In general, binding pockets do not afford much space to work with in terms of reorienting the side chains. Indeed, any large-scale movement of a side chain would be rejected because of 62 van der Waals collisions with the rest of the protein. All of the systems studied had their lowest RMSDs less than 1.15 Å with the minimum value being ~0.05 Å. However, for some systems the side chain RMSDs were higher with the maximum as high as 4.91Å. On closer examination, we observed that the source of the large RMSD values was the rotation of aromatic side chains in the receptor’s binding pocket. RMSD is a good criterion to measure the deviation from a native structure, but we realized that they were biased somewhat due to the rotation of aromatic side chain in the binding pocket. Hence, in order to further understand the overall mobility of the binding pocket, we also calculated the percentage of native contacts in MTflex conformers with respect to the binding pocket of the crystal structure (refer to the lower panel of Figure 3.1). The percentage of native contacts reflects the number of native contacts retained in the MTflex conformers over the total number of contacts present in the crystal protein structure. The number of native contacts was calculated with a distance cutoff of 6.0 Å using the CPPTRAJ utility of Amber tools (2015).[99] It can be seen from the lower panel of Figure 3.1 that the maximum percentage of native contacts for all 159 systems of the validation dataset is within 92.6-99.6%, while the minimum percentage is as low as 76.7%. Given the fixed backbone structures, the gap between the minimum and the maximum percentage of native contacts indicates the magnitude of flexibility of the side chains in the binding pockets according to the MTflex method. Shown in Figure 3.2, two protein structures, with PDBIDs 3UEU and 1R5Y, were selected from the validation dataset as examples to illustrate the positioning of the side chains produced by the MTflex method. These two structures had the largest gaps in the percentage of native contacts (according to the distances between the blue and green lines in Figure 3.1), while at the same time the MTflex generated conformations covered the native structures for the two 63 proteins (lowest 0.50 Å RMSD and 0.17 Å RMSD and maximum native contacts of 97% for 3UEU and 99% for 1R5Y, respectively). The largest deviations for these two proteins have RMSDs of 2.8Å for 3UEU and 3.51Å for 1R5Y, with the corresponding percentage of native contacts being 76.7% for 3UEU and 79.4% for 1R5Y. Table S2 lists the minimum and maximum RMSDs for each system along with the PDBID of each system. Figure 3.2 It shows the side chain of binding pocket region of PDB ID 3UEU (left) and 1R5Y (right). The crystal-binding pocket represented in pink licorice conformation is superimposed with the MTflex conformer with the maximum percentage of native contacts (tan) and MTflex conformer with minimum percentage of native contacts (cyan color). The highest and lowest free energy differences among the protein conformations were obtained determined and the top panel of Figure 3.3 illustrates the range of free energies obtained for MTflex binding pocket conformations relative to the holo crystal conformation. The blue curve represents the free energy difference between the most “relaxed” (lowest free energy) MTflex binding pocket conformation and the holo crystal conformation and the red curve between the highest free energy and the crystal structure for all 159 systems in the validation dataset. Approximately, 68.5% of the systems in the validation dataset a MTflex generated conformation gave a free energy lower than the crystal structure, while for 31.5% of the systems all MTflex generated conformations had higher free energies than the crystal structure. To dig into this 64 more, we lumped the 68.5% into Region 1 and the rest in Region 2. We estimated the net difference in free energies between the highest free energy and the lowest free energy MTflex conformation for each system of the validation dataset and observed that the average difference of free energies for all the systems in Region 1 is 4.31 kcal/mol and for Region 2 is 3.9 kcal/mol. We also calculated the minimum and maximum RMSDs (Å) of the MTflex conformations for each system (plotted in the lower panel of Figure 3.3) and found out that the net RMSD difference between the minimum and maximum RMSD conformations in Region 1 is 2.21 Å while in Region 2 is 1.99 Å. The narrower range of free energy and the constricted RMSD difference between minimum and maximum RMSD conformations in Region 2 suggests conformational sampling is restricted in Region 2. The restricted conformational sampling suggests two possible scenarios-a) the crystal-binding pockets in Region 2 are not flexible in nature or, b) partial relaxation of the crystal binding pocket with only side chain flexibility is not sufficient for the systems in Region 2. Alternatively, issues with our statistical energy function can give the observed behavior. Because there are no hard and fast rules governing whether or not pocket sampling across a series of proteins should yield a certain percentage of lower or higher free energies conformations for a given pocket we cannot fully assess the performance of our approach. 65 Figure 3.3 The upper panel represents the relative free energy (kcal/mol) between the lowest free energy MTflex conformer and the crystal structure (blue curve) and the highest free energy MTflex conformer and crystal structure (red curve) for all 159 systems in the validation dataset. The lower panel represents the corresponding RMSDs (Å) of the minimum RMSD MTflex conformer (purple) superimposed with the maximum RMSD MTflex conformer (green). 3.3.2. Ligand docking and scoring After generating the MTflex conformers, the next task was to dock the ligands into their respective binding pocket. Ligands were docked exhaustively into all of the receptor poses 66 generated by MTflex for each of the 159 systems using GLIDE version 6.1 in the Schrodinger 2013-3 suite.[100-102] The Protein Preparation wizard utility, accessible from the Maestro interface of the Schrodinger 2013-3 suite was used to process the receptor structures.[100, 103] Protonation was performed at a pH-7.0.[104, 105] The hydrogen atom positions were optimized using the OPLS 2005 force field. The optimization was done in the absence of ligand.[106, 107] Crystal waters, if present, were removed from the receptor structures. Minimization was done only on the hydrogen atoms and the heavy atoms were kept fixed. Docking was done using the standard precision (SP) methodology of Glide version 6.1. The identical procedure was followed to dock ligands into the crystal structure for the purpose of comparison. The top five scored docked ligand poses were retained. In the current study, the Glide SP method was applied to ensure that a certain number of binding modes were been generated to cover as broad a range as possible of the conformational space for the following free energy simulations. We obtained the ligand heavy atom RMSD (Å) after docking the native ligand into the binding pocket from the crystal structure and also into the MTflex binding pocket(s) for all 159 systems. These RMSDs were calculated with respect to the native ligand geometry in the crystal structure. The ligand heavy atom RMSD values were obtained from the Glide report generated after docking the ligands into the respective binding pockets. Figure 3.4 gives a detailed pictorial analysis of the comparison of ligand RMSD (Å) between the ligand docked into the crystal structure and the ligand docked into the MTflex generated binding pockets for all 159 systems. Ligand RMSDs reduced for ~ 78% of the systems after docking the native ligands into the MTflex binding pockets as compared to the docked poses in the crystal structure binding pocket. The average RMSD of ligands over 159 systems docked into the MTflex binding pocket was reduced to around 1.31 Å as opposed to the ligands docked into the crystal binding pockets with the 67 average being around ~2.11 Å. The reduction in average RMSD (Å) of the ligands supports the notion that inclusion of receptor flexibility is important in computer aided drug design campaigns. Figure 3.4 Ligand RMSD (Å) for all 159 systems after docking the native ligand into the crystal receptor’s binding pocket (red) and into the MTflex binding pocket (green). Lowest RMSD Conformations are used for comparison in both the datasets. The blue, yellow and the black columns represent the PDBIDs: 2R23, 3PXF and 1U33, which will be discussed in text. A table containing the RMSD values (Å) of the minimum RMSD docked ligand conformers in both the MTflex and binding pockets from the crystal structure for all 159 systems is provided in supporting information as Table 3.4. It can be seen from Figure 3.4 that for some of the systems, the RMSD difference is quite large (~7-8 Å) and two such cases are highlighted in Figure 3.4 in blue (PDBID 2R23) and black (PDBID 1U33). These two PDBIDs were picked to show the contradictory affect of docking ligand in the MTflex binding pocket. 68 Figure 3.5 shows the superimposed images of the ligand in the crystal structure (orange), lowest RMSD docked ligand pose into the crystal structure binding pocket (blue) and the lowest RMSD docked ligand pose in the MTflex binding pocket (green) for both PDBIDs. For PDBID 2R23 (left), the ligand docked in crystal structure is quite far apart from the ligand position in the crystal structure with an RMSD of ~7.3 Å. However, the ligand docked in the MTflex binding pocket is nearly coincident with the ligand in the crystal structure with a RMSD of 0.83 Å. The situation is reversed for PDBID 1U33 (right side of Figure 3.5), where the lowest RMSD crystal structure docked ligand pose is well aligned to the ligand in the crystal structure with a RMSD of 0.54 Å while the lowest RMSD MTflex docked ligand pose is 8.20 Å. Figure 3.5 It shows the crystal complex ligand (orange), lowest RMSD docked ligand pose in the crystal binding pocket (blue) and the lowest RMSD docked ligand pose in the MTflex binding pocket (green) superimposed onto each other for PDB IDs: 2R23 and 1U33 (from left to right). The background represents the binding pocket area of the respective protein. We visualized the 2-D ligand interaction diagram (LID) in Maestro to look for the protein-ligand interactions in the crystal structure, ligand docked into the crystal structure and the ligand docked into MTflex derived binding pocket for both PDBIDs. LIDs are shown in Figure 3.9 for PDBID: 2R23 (top panel) and 1U33 (lower panel). In both the panels, the left image displays 69 the protein ligand interactions found in the crystal structure, the center image is for the ligand docked into the crystal structure and the right image is for the ligand docked into the MTflex derived binding pocket. We can establish from the LIDs that Glide is looking for a maximum number of contacts between the binding site residues and ligands in order to find an optimized position for the bound ligand. The number of contacts between the protein from the crystal structure and the docked ligand for PDB ID 2R23 (see top panel) are far more than the number of contacts found in the crystal structure. We anticipate that this is why the ligand docked in the protein from the crystal structure has a large RMSD from the position of the ligand found in the crystal structure. However, the side chains of the Arginine (Arg) residue (Resid 30) and Asparagine (Asn) (Resid 52 of chain D) are rotated in the MTflex binding pocket (refer to the left panel of Figure 3.10) yielding a net RMSD of 1.86 Å for the MTflex binding pocket with respect to the pocket found in the crystal structure. These rotations have established a new H-bond between the side chain of the Asn and the ligand and have restored the original salt bridge between the Arg and the ligand, keeping the position of the docked ligand nearly the same as found for the ligand in the crystal complex. For PDBID 1U33 (lower panel), Glide identifies very similar interactions between the docked ligand and crystal protein as that of the crystal complex, thereby finding the optimized position of the docked ligand almost overlapping with the ligand in crystal complex with the RMSD of 0.54 Å. But in the MTflex binding pocket, even for the lowest RMSD MTflex conformer (0.33 Å), the side chain of the Glutamate (Glu) residue (Resid 233) is tilted slightly such that it is blocking the position of the native ligand pose found in the crystal structure (see the right panel of Figure 3.10). Due to the slight change of the Glu residue coupled with few other side chain rotations, the docked position of the ligand for the MTflex binding pocket is far from that found for the ligand in the crystal structure. 70 For further comparison, we correlated the Glide scores of the docked crystal and MTflex ligands with the experimental binding affinities (pKd/pKi). To be consistent with the energy unit used by Glide, we converted experimental pKd/pKi values to ΔG in kcal/mole. We compared Glide’s score for the best docking solution for both cases across all 159 systems. Correlation of Glide’s score for the top docking solution in the crystal structure generated a Pearson’s R of 0.39 and a Spearman’s rank correlation coefficient (Spearman’s rho) of 0.43 while the correlation of Glide’s top docking score across all MTflex conformations generated a Pearson’s R of 0.47 and a Spearman’s rho of 0.50. The Root mean square error (RMSE) (kcal/mol) reduced from 3.21 to 2.70 kcal/mol and the Mean unassigned error (MUE) (kcal/mol) from 2.34 to 2.17. The table listing all the statistical parameters for the best solution using Glide’s score is provided in Table 3.5 of the supporting information. The correlation between Glide’s top docking score for the crystal receptor and for the MTflex conformers generated a Spearman’s Rho of 0.82 and Pearson’s R of 0.86. Overall, the Glide score for docking the ligand into MTflex binding pocket improved in terms of both error and correlation when compared to docking the ligand into the crystal structure. By this analysis, it was observed that MTflex generated multiple conformers serve as better input than a single crystal structure “seed” on which ligand scoring and docking is performed. We also re-ranked Glide’s docking solutions based on the relative free energy scale obtained for the MTflex conformations in the ligand-unbound states. After re-ranking, we obtained the correlation between the Glide’s score for the lowest free energy MTflex conformer (in the ligand-unbound state) and the experimental binding affinities (the statistical parameters are present in Table 3.5). The correlation was quite poor with Pearson’s R and Spearman Rho values of 0.37 and 0.38, respectively. The free energy errors were also higher with a RMSE of 71 3.17 kcal/mol and MUE of 2.43 kcal/mol. All the statistical parameters for correlation were poor as compared to the Glide’s best score for crystal and MTflex docking, which is understandable because the ranking of conformations on the free energy scale can be entirely different before and after ligand binding. A ligand does not necessarily have the strongest affinity towards the lowest free energy receptor conformation. It can bind with more affinity to a higher free energy conformation on a free energy scale and change the overall ranking of the conformations in the bound state as compared to the ligand unbound state. Figure 3.6 shows the relative free energy between the MTflex lowest free energy conformation in the ligand bound state and the MTflex lowest free energy conformations in the ligand-unbound state for all 159 systems of the validation dataset. The MTflex lowest free energy conformation in the ligand-unbound state is the reference state (0 kcal/mol) for all systems. Based on our analysis, we found that only for ~31% of the systems, the MTflex lowest free energy conformer in the ligand bound state was also the lowest free energy MTflex conformer in the unbound state. However, no particular trend was observed. To give a pictorial overview of the free energy scales, we include Figure 3.11 in supporting information. It shows the position of the MTflex lowest free energy conformation in the ligand bound state on the MTflex free energy scale generated for four random systems from our validation dataset (PDBIDs: 10gs, 2fvd, 2qbr and 2zjw). It can be also be seen from Figure 3.11 that the position of the lowest free energy conformation in the ligand bound state (marked as a red square) is different for each of the systems and does not show any particular preference to a particular rank on the free energy scale. 72 Figure 3.6 The position of the MTflex lowest free energy conformation in the ligand bound state on the free energy scale obtained for the MTflex conformations in the ligand-unbound state is shown. The lowest free energy MTflex conformation in the unbound state is used as reference (set to 0 kcal/mol). 3.3.3. Multi seed versus rigid receptor free energy calculations Using the MT method, we calculated the binding affinity (ΔG in kcal/mole) of all the protein-ligand complexes obtained after docking ligands into the MTflex generated binding pockets and also into the crystal structure pockets. We also calculated the binding affinity of the crystal protein-ligand complex for the entire validation dataset and incorporated it as one of the conformers of the MTflex ensemble. For the sake of simplicity, we will denote the experimental protein-ligand complex as PCLC, the crystal protein-docked ligand complex as PCLD and the MTflex protein docked ligand complex as PMTLD. We calculated binding affinities of (average over all the docked solutions of the crystal protein docked ligand complex) and (average over all the docked solutions of 73 each MTflex conformer) for each of the 159 systems from the validation dataset and correlated them with the experimental ΔG values. The binding affinity prediction for yielded a Pearson’s R of 0.58 and Spearman’s Rho of 0.60. The correlation of the average binding affinity for PMTLD including the full range () showed an improvement over the crystal structure docked structures in all statistical parameters including Pearson’s R, Spearman’s Rho, RMSE and MUE (see Table 3.1). The correlation coefficient, Pearson’s R showed a modest improvement with a value of 0.65 and Spearman’s Rho with a value of 0.69. The errors decreased with the RMSE dropping from 3.37 to 2.83 kcal/mol and the MUE went from 2.74 to 2.27 kcal/mol. The RMSE and MUE drops are significant observations because they affirm the relevance of our sampling space. Figure 3.7 shows the superposition of the binding affinities (kcal/mol) of and along with the experimental binding affinities for all 159 systems in the validation set. We observed that for approximately 63.3% of the validation dataset, the binding affinities of the complexes lie within ±3 kcal/mol of the experimental binding affinities while the percentage increases to 75.4% for the complexes. The overall improvement in binding affinity prediction over the complex highlights the significance of using a multi-seed strategy over single receptor structures in MT based binding affinity prediction. 74 Figure 3.7 It shows the absolute binding affinities (kcal/mol) of and superimposed with the experimental binding affinity (kcal/mol). ± 3kcal/mol error window is extended for the experimental binding affinities. As already described in the method section, our method is based on a Monte Carlo integration. The accuracy of any MCI approach relies on the selection of the structural ensemble. The improvement of the binding affinity prediction validates that our sampling set does possess some of the key states of the conformational ensemble beyond the crystal structure. Nonetheless, some less significant conformations are also generated and the overall calculation would benefit from their elimination. The goal here was to find a converged ensemble out of the larger pool of conformers generated by MTflex. To prune out the high-energy conformers and/or the very low populated states from our sampling space, we calculated the free energy difference (ΔΔGbind) between all the docked solutions of each MTflex conformer in the PMTLD complexes 75 (PMTLD-All) and the PCLC for each system using equation 3.1, where n refers to the number of complexes in PMTLD-All for each system in the validation dataset. ΔΔGn bind = ΔGn PMTLD - ΔGPCLC (3.1) The free energy difference provided us with a scale to rank our conformations from the lowest to highest binding affinities with respect to the crystal structure. The crystal protein ligand complex was incorporated into the PMTLD-All ensemble as one of the conformers. The free energy scale for all the 159 systems of the validation dataset is represented in Figure 3.8. The lowest free energy MT protein ligand complex (PMTLD-Lowest) for each system is highlighted by the orange curve. Based on the free energy scale, approximately 70.4% of the systems in the validation dataset have lower free energy PMTLD-Lowest conformations (lower free energy binding modes) than the crystal complex. The observed free energy differences are not remarkable owing to the fact that we only introduced side chain flexibility. However, we hypothesize that if we add backbone flexibility on top of side chain flexibility, we would likely span a broader free energy range. We compared the binding affinity of the lowest free energy MT protein ligand complex (PMTLD-Lowest) by correlating it with the experimental binding affinity (kcal/mol). The correlation parameter for PMTLD-Lowest showed a slight improvement over the prediction for the complex with the Pearson’s R incrementing from 0.58 to 0.64 and Spearman’s Rho from 0.60 to 0.68. The errors were reduced substantially with the RMSE reducing from 3.37 to 2.72 kcal/mol and the MUE dropping from 2.74 to 2.14 kcal/mol. In comparison to the , the correlation trends for PMTLD-Lowest did not show any improvement but both the RMSE and MUE (kcal/mol) reduced. 76 Figure 3.8 Free energy scale of PMTLD complex for all 159 systems in the validation dataset. Approximately ~70.4% of the systems have lower free energy than the crystal complex. Based on the free energy difference scale, we also selected several subsets of the conformations and correlated them with the experimental binding affinities. The subsets were formed by collecting PL complexes starting from the lowest up to a selected free energy difference endpoint from the crystal structure. We calculated the average binding affinity for all of the subsets and referred to them as PMTLD-set 1 to set 5, where PMTLD-set1 represents ΔG’s up to 1 kcal/mole higher in free energy than the free energy of the crystal structure, set 2 denotes ΔG’s up to 2 kcal/mole higher in free energy and so on. The selection of set 1 to set 5 is a qualitative approach to explore how the ensemble affects the computed results. In terms of correlation, PMTLD set 1-5 generated nearly identical trends with the Pearson’s R correlation coefficient being ~0.65 and the Spearman’s Rho being ~0.69. The errors increased slightly on 77 going from PMTLD set 1-5 as more noise was added to the ensemble. The errors had small differences between set 1 and 2 and between set 3,4 and 5, so we have only shown PMTLD set 2 and set 4 in Table 3.1 to give a general sense of the outcome. It can be seen from Table 3.1 that the lowest free energy MT complex (PMTLD-Lowest) gives the lowest RMSE and MUE (kcal/mol) from the experimental binding affinities. After that there is a slight increase in both RMSE and MUE as the cut off is increased for all of the PMTLD ensembles. For the PMTLD set 1-5, the error reduction is better in comparison to the binding affinity prediction found for (including the full range). Since, it is an ensemble-based study, error reductions are a more accurate measure to assess the precision of any method rather than the correlation coefficients. Table 3.1 Statistical data for the ΔG’S evaluated using the crystal protein docked ligand complex, MTflex lowest free energy complex, MTflex with conformations selected up to 2 kcal/mol and 4 kcal/mol higher in binding affinity than the crystal structure and MTflex average (including the full range) relative to the experimental binding affinity. Pearson’s R PMTLD-Lowest PMTLD-set2 PMTLD-set4 < PMTLD > 0.58 0.64 0.65 0.64 0.65 rho 0.60 0.68 0.69 0.69 0.69 Spearman’s RMSE (kcal/mol) MUE (kcal/mol) 2.74 2.14 2.21 2.23 2.27 3.37 2.72 2.75 2.77 2.83 The results for PMTLD ensembles are certainly far better both in terms of error and correlation than single docking into a receptor site (PCLD). Overall, the statistical comparison of the PMTLD-Lowest and PMTLD sets with the experimental binding affinity enhanced the improvement over the results (including the full range) indicating that by eliminating the low probability conformations we can improve the accuracy of the modeling. This analysis further strengthens our hypothesis that with MTflex, we can obtain better binding modes, which are difficult to attain with single receptor docking. 78 An ensemble of conformations can be easily generated with the major sampling methods available, but the problem lies in the identification of the conformations that have the lowest free energy and, hence, contribute the most to the ensemble. With MTflex, the identification of the lowest free energy conformations becomes feasible. The improved correlation and reduced RMSE (kcal/mole) after elimination of the higher energy conformations further illustrates the importance of generating relevant conformations. It also nicely demonstrates the significance of incorporating receptor flexibility yielding multiple seed structures that can be used in MT scoring. Moreover this multi-seed strategy shows significant improvement over traditional rigid/crystal structure docking. 3.4 Conclusions Proteins undergo a wide variety of motions ranging from ultrafast vibrations to long- range backbone motions. All these motions help in generating multiple conformations of the protein, thus giving an ensemble of structures. Consideration of all the relevant receptor poses is extremely important in the field of structure based drug design to garner a detailed picture of the binding pocket and its interactions with guest molecules. MTflex, unlike other sampling methods[15, 19, 20, 27, 28, 30-33, 36, 45, 58-60, 67, 84, 108], generate conformations on a free energy surface and as we show these structures seem to impart relevance to binding free energy prediction. We calculate the Boltzmann factor contribution of each torsional interaction, which allows us to create the ensemble on a free energy surface. Thus, along with the crystal structure, we can include several relevant (i.e., low free energy) receptor poses as the starting point or seed for calculating protein-ligand binding free energies. By using MTflex generated multiple conformers, we have shown that Glide generates 79 better binding modes (lower RMSD conformers with respect to the ligand in crystal complex) and improved correlation with experimental binding affinities by using MTflex generated multiple conformers rather than a single crystal structure “seed”. Apart from Glide scoring, we also used our MT scoring method and validated that scoring results improve via the inclusion of the MTflex generated ensemble. Overall, via side chain conformational sampling, we have shown that docking with multiple receptor poses or seeds gives better correlation with the experimental data than docking with a single structure. This viewpoint is consistent with the consensus docking viewpoint, but with one scoring function generating multiple samples. [109] The two major motions that occur in the protein ligand binding process are side chain and backbone movements. Herein, we have only incorporated side-chain flexibility via generation of conformations on the free energy surface. The addition of backbone flexibility should also improve the quality of our computed free energies of binding, but offer computational complexities beyond side chain sampling. 3.5 Acknowledgement NB would like to thank Mr. Pengfei Li for fruitful discussions. The authors would like to thank high performance computing center (HPCC) at Michigan State University for providing computational resources. 80 3.6 Supplementary information Table 3.2 Average CPU time (in seconds) to generate MTflex conformers for each Amino acid Amino acid (3 letter code) CPU time (seconds) 2.11 30.83 3.40 222.24 223.55 217.52 37.80 1008.62 784.53 5449.91 5.36 2649.07 1.00 2.67 4.52 1140.16 165.75 Ile Leu Val Phe Trp Tyr Asp Glu Gln Arg His Lys Ser Thr Cys Met Asn Table 3.3 Minimum and Maximum RMSDs of MTflex conformers along with the number of conformers retained for all 159 systems of the validation dataset. PDBID Minimum RMSD (Å) Maximum RMSD (Å) 10gs 1a30 1bcu 1e66 1f8b 1f8c 1f8d 1gpk 1h23 1hnn 1igj 1jyq 0.49 0.20 0.47 0.29 1.07 1.15 1.15 0.68 0.80 0.43 0.79 0.56 3.41 2.47 4.91 1.27 2.88 2.67 2.63 2.15 2.87 2.94 3.29 3.26 81 Number of MTflex conformers retained 255 10 4 3 10 14 15 4 19 18 4 40 Table 3.3 (cont’d) 1kel 1lbk 1lol 1loq 1lor 1mq6 1n1m 1n2v 1nvq 1o3f 1o5b 1oyt 1p1q 1q8t 1q8u 1qi0 1r5y 1sqa 1u1b 1u33 1uto 1vso 1w3k 1w3l 1w4o 1xd0 1yc1 1z95 1zea 2brb 2cbj 2cet 2d3u 2fvd 2g70 2gss 2hb1 2iwx 2j62 2j78 0.43 0.43 0.47 0.71 0.23 0.79 0.56 0.14 0.40 0.29 0.25 0.16 0.87 0.51 0.67 0.36 0.17 0.63 0.34 0.33 0.39 0.95 0.34 0.35 0.40 0.35 0.14 0.40 0.43 1.10 0.54 0.35 0.34 0.60 0.63 0.34 0.39 0.29 0.45 0.45 2 19 15 35 31 36 7 20 12 6 20 15 4 4 18 22 22 6 14 8 7 4 12 7 49 33 13 40 2 528 116 14 30 81 12 53 17 20 34 3 1.97 2.95 2.97 2.46 2.54 3.34 2.04 3.69 2.42 2.76 2.30 2.32 2.29 2.77 3.36 2.47 3.51 2.34 2.04 1.80 2.70 1.87 2.32 2.17 2.77 2.64 2.84 2.39 2.37 2.28 4.14 2.22 3.70 2.61 2.30 3.23 2.74 3.02 2.95 2.26 82 Table 3.3 (cont’d) 2jdm 2jdu 2jdy 2obf 2ole 2p4y 2pcp 2pq9 2qbp 2qbr 2qft 2qmj 2r23 2v00 2vl4 2vo5 2vot 2vvn 2vw5 2w66 2wbg 2wca 2wtv 2x00 2x0y 2xb8 2xbv 2xdl 2xnb 2xys 2y5h 2yfe 2yge 2yki 2ymd 2zjw 2zwz 2zx6 2zxd 3acw 0.98 0.24 0.05 0.46 0.62 0.96 0.58 0.44 0.24 0.38 0.50 0.36 0.83 0.19 0.19 0.21 0.12 0.38 0.44 0.23 0.52 0.20 0.61 0.44 0.34 1.06 0.43 0.52 0.75 0.32 0.30 0.46 0.36 0.12 0.38 0.63 0.46 0.63 0.48 0.25 4 12 16 37 59 6 20 3 23 148 12 39 19 12 18 8 29 12 8 18 3 4 12 61 20 10 4 18 17 17 11 15 7 5 4 80 4 2 24 6 2.30 2.37 2.54 2.74 2.96 2.09 1.82 2.03 2.73 3.49 2.55 1.69 2.05 2.58 1.96 1.78 1.94 2.01 2.80 2.80 1.65 2.82 2.95 3.57 3.02 2.94 2.31 2.76 2.98 2.48 2.90 3.20 2.61 2.37 2.44 3.37 1.78 2.22 2.22 2.29 83 Table 3.3 (cont’d) 3ag9 3ao4 3b68 3bfu 3bpc 3cft 3cj2 3coy 3cyx 3dxg 3e93 3ebp 3f3a 3f3c 3f3e 3fk1 3fv1 3g0w 3g2n 3g2z 3gbb 3gcs 3ge7 3gnw 3gy4 3huc 3imc 3ivg 3jvs 3k5v 3kgp 3l4u 3l4w 3l7b 3mss 3myg 3n7a 3n86 3nox 3nq3 0.73 0.64 0.44 0.47 1.09 0.20 0.44 0.40 0.58 0.29 0.15 0.74 0.22 0.16 0.24 0.69 0.64 0.46 0.50 0.47 0.69 0.45 0.56 0.32 0.43 0.51 0.66 0.51 0.48 0.49 0.14 0.85 0.42 0.54 0.34 0.57 0.33 0.92 0.80 0.76 14 13 10 18 4 176 35 29 8 35 11 5 2 2 3 18 3 8 20 4 5 11 16 14 14 32 47 125 53 7 7 6 14 59 14 21 12 7 26 10 2.76 2.80 2.17 2.98 1.43 3.71 3.44 2.90 2.48 2.51 2.25 2.99 1.79 1.23 1.34 2.80 2.30 2.53 2.63 1.80 2.99 2.58 2.82 2.83 3.00 3.66 3.90 3.18 3.33 2.41 3.52 1.83 1.29 2.62 2.67 3.19 2.83 3.12 2.27 2.37 84 Table 3.3 (cont’d) 3ov1 3owj 3pe2 3pww 3pxf 3s8o 3su2 3su3 3su5 3u9q 3udh 3ueu 3uex 3uo4 3uri 3utu 3zso 3zsx 4de1 4de2 4des 4dew 4djr 4djv 4g8m 4gid 4gqq PDBID 10gs 1a30 1bcu 1e66 1f8b 1f8c 1f8d Table 3.4 Minimum RMSDs of the docked ligand conformers in both the crystal and MTflex binding pockets of all 159 systems of the validation dataset. Crystal ligand RMSD (Å) MTflex RMSD 0.46 0.80 0.47 0.56 0.68 0.38 0.17 0.46 0.71 0.80 0.52 0.49 0.39 0.41 0.31 0.30 0.26 0.31 0.54 0.54 0.47 0.14 0.71 0.35 0.33 0.23 0.63 2.60 2.57 2.29 1.93 2.88 2.57 2.71 2.84 2.93 2.85 2.64 2.80 2.29 3.47 2.31 3.26 2.90 2.43 1.98 2.61 2.45 2.50 1.95 2.78 2.59 2.38 3.35 12 24 34 3 60 7 25 23 16 10 8 66 5 47 9 24 10 21 4 4 12 17 6 6 11 7 28 85 (Å) 0.67 1.63 0.25 0.40 0.86 1.28 0.93 0.84 2.90 0.34 0.49 0.33 0.25 0.58 Table 3.4 (cont’d) 1gpk 1h23 1hnn 1igj 1jyq 1kel 1lbk 1lol 1loq 1lor 1mq6 1n1m 1n2v 1nvq 1o3f 1o5b 1oyt 1p1q 1q8t 1q8u 1qi0 1r5y 1sqa 1u1b 1u33 1uto 1vso 1w3k 1w3l 1w4o 1xd0 1yc1 1z95 1zea 2brb 2cbj 2cet 2d3u 2fvd 2g70 0.39 5.49 1.04 0.92 9.91 1.18 0.53 2.70 0.83 0.19 0.50 0.56 2.00 0.25 1.21 1.02 0.69 0.81 0.99 0.54 5.88 5.00 0.59 8.94 0.54 1.26 3.93 0.25 0.16 7.67 8.65 0.40 0.38 12.78 0.59 1.67 0.51 0.47 0.36 0.42 0.30 2.05 0.46 1.11 1.73 1.16 0.51 1.08 0.49 0.34 1.05 0.49 0.82 0.25 0.65 0.46 0.47 0.41 0.99 0.54 4.19 0.34 0.56 4.03 8.20 0.65 1.52 0.20 0.29 0.87 7.63 0.36 0.38 8.56 0.49 1.09 0.48 0.47 0.30 1.02 86 Table 3.4 (cont’d) 2gss 2hb1 2iwx 2j62 2j78 2jdm 2jdu 2jdy 2obf 2ole 2p4y 2pcp 2pq9 2qbp 2qbr 2qft 2qmj 2r23 2v00 2vl4 2vo5 2vot 2vvn 2vw5 2w66 2wbg 2wca 2wtv 2x00 2x0y 2xb8 2xbv 2xdl 2xnb 2xys 2y5h 2yfe 2yge 2yki 2ymd 3.05 1.10 0.19 0.29 0.63 3.07 0.29 0.33 1.09 2.46 0.79 0.51 0.63 1.95 0.98 1.32 5.44 7.32 1.50 0.50 0.48 1.70 0.26 0.39 0.28 2.38 5.16 0.84 0.49 0.29 0.39 0.32 5.24 0.83 0.33 0.33 0.61 0.67 1.92 0.44 1.72 0.92 0.18 0.29 0.38 1.53 0.67 0.83 1.00 1.94 0.69 0.40 0.99 1.12 0.76 1.18 0.83 0.83 0.62 0.26 0.40 0.49 0.26 0.48 0.32 2.35 2.97 0.67 0.49 0.31 1.08 0.24 1.17 0.56 0.41 0.33 0.61 0.76 1.29 0.29 87 Table 3.4 (cont’d) 2zjw 2zwz 2zx6 2zxd 3acw 3ag9 3ao4 3b68 3bfu 3bpc 3cft 3cj2 3coy 3cyx 3dxg 3e93 3ebp 3f3a 3f3c 3f3e 3fk1 3fv1 3g0w 3g2n 3g2z 3gbb 3gcs 3ge7 3gnw 3gy4 3huc 3imc 3ivg 3jvs 3k5v 3kgp 3l4u 3l4w 3l7b 3mss 3.72 0.26 2.03 0.27 0.35 14.84 0.34 0.93 1.04 11.59 0.56 0.40 2.77 5.47 6.12 0.60 0.34 0.87 0.41 0.71 3.47 0.76 0.20 0.74 3.60 1.10 0.24 0.37 0.34 0.23 0.51 0.34 5.96 2.01 2.48 0.94 1.73 1.80 0.51 2.49 0.86 0.59 1.88 0.35 0.64 9.92 0.31 0.93 0.83 3.41 0.49 0.36 0.61 0.56 2.20 0.59 4.88 0.58 0.92 0.42 1.43 0.75 0.45 0.63 3.45 0.61 0.43 1.98 0.24 0.68 0.64 0.26 1.13 0.66 1.99 0.92 1.33 2.20 0.44 2.26 88 Table 3.4 (cont’d) 3myg 3n7a 3n86 3nox 3nq3 3ov1 3owj 3pe2 3pww 3pxf 3s8o 3su2 3su3 3su5 3u9q 3udh 3ueu 3uex 3uo4 3uri 3utu 3zso 3zsx 4de1 4de2 4des 4dew 4djr 4djv 4g8m 4gid 4gqq 1.78 0.24 0.72 0.75 3.05 0.79 5.22 0.25 10.70 9.15 0.90 0.28 0.36 0.33 1.27 0.59 2.90 1.43 0.71 12.42 2.11 0.80 4.39 0.65 1.30 2.42 6.78 10.67 0.29 1.29 12.12 0.34 1.72 0.66 0.87 0.75 2.73 1.24 0.49 0.37 9.21 0.90 0.73 0.28 0.36 0.33 0.62 0.32 1.17 1.58 0.69 10.22 0.52 0.69 1.30 0.97 0.68 0.43 2.18 4.86 0.28 0.98 9.97 0.96 89 Figure 3.9 Ligand Interaction diagrams of PDBIDs: 2R23 (top panel) and 1U33 (lower panel). Crystal complex is shown on the left side, crystal protein docked ligand (center) and the MTflex protein docked ligand in right. 90 Figure 3.10 The superposition of MTflex generated binding pocket (Red) with the crystal binding pocket (coloured) along with the native ligand (orange) for PDBIDs 2R23 (left) and 1U33(right). For PDBID 2R23, Arg30 and Asn52 of chain D interact with the native ligand restoring the position of docked ligand in the MTflex binding pocket. For PDBID 1U33, the side chain of GLU233 in the MTflex binding pocket clashes with an atom of native ligand. 91 Figure 3.11 A plot showing the position of MTflex lowest free energy conformation in ligand bound state (marked in red square) on the free energy scale generated by MTflex conformers in ligand-unbound state for four systems from our validation dataset. PDBIDs of the systems are: 10gs, 2fvd, 1qbr and 2zjw. Table 3.5 Correlation of Glide’s best score (top docking solution) for crystal protein, Glide’s best score for MTflex conformers across whole conformations and Glide’s score for the lowest free energy MTflex conformation in the ligand-unbound state to the experimental binding affinities. Spearman’s Pearson’s RMSE (kcal/mol) MUE (kcal/mol) 92 Glide’s best score- Crystal Glide’s best score-MTflex Glide’s score-MTflex lowest free energy Rho 0.43 0.50 0.38 R 0.39 0.47 0.37 3.21 2.71 3.17 2.34 2.17 2.43 3.6.1 RMSD in-house code For calculating RMSDs of the MTflex conformers from the crystal structure, we wanted to do best fitting of the structures by aligning them and minimizing the rotations. We were especially concerned about the two aromatic amino acids - Phenylalanine and Tyrosine, which have benzene and 4-hydroxy benzene as the side chain and have an axis of symmetry. Figure 3.12 Side chain rotation of Phenylalanine residue. If these residues undergo a 180° rotation around Cβ atom, the net RMSD of the side chain would be zero. Please see the attached figure 3.12 for Phenylalanine. Similar situation would happen for Tyrosine. Our in-house code incorporated a feature, which would swap Cδ1 ’ with Cδ2 ’ and Cε1 ’ with Cε2 ’, if the Euclidean distance between Cδ1 and Cδ1 ’ is greater than Cδ1 with Cδ2 ’. Another feature of our code is that it includes only those heavy atoms of side chains in doing averaging, which deviates from their position in the crystal structure. The code was written as a simple script in MATLAB software. 93 Algorithm: If then, elseif then, ||Cδ1 - Cδ1’|| > ||Cδ1 - Cδ2’|| ’ (new) = Cδ2 Cδ1 ’ (new) = Cε2 Cε1 ’ and Cδ2 ’ ’ (new) = Cδ1 ’ ’ ’ and Cε2 (new) = Cε1 ||Cδ1 - Cδ1’|| <= ||Cδ1 - Cδ2’|| ’ Cδ1 (new) = Cδ1 ’ Cε1 (new) = Cε1 ’ ’ and Cδ2 ’ (new) = Cδ2 ’ ’ and Cε2 ’ (new) = Cε2 94 REFERENCES 95 REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Koshland, D.E., The Key-Lock Theory and the Induced Fit Theory. Angewandte Chemie- International Edition, 1994. 33(23-24): p. 2375-2378. Koshland, D.E., Correlation of Structure and Function in Enzyme Action. Science, 1963. 142(359): p. 1533-&. Ma, B.Y., et al., Folding funnels and binding mechanisms. Protein Engineering, 1999. 12(9): p. 713-720. Fitzgerald, P.M.D., et al., Crystallographic Analysis of a Complex between Human- Immunodeficiency-Virus Type-1 Protease and Acetyl-Pepstatin at 2.0-a Resolution. Journal of Biological Chemistry, 1990. 265(24): p. 14209-14219. Gutteridge, A. and J. Thornton, Conformational changes observed in enzyme crystal structures upon substrate binding. Journal of Molecular Biology, 2005. 346(1): p. 21-28. Jorgensen, W.L., Rusting of the Lock and Key Model for Protein-Ligand Binding. Science, 1991. 254(5034): p. 954-955. Vanduyne, G.D., et al., Atomic-Structure of Fkbp-Fk506, an Immunosuppressant Complex. Science, 1991. 252(5007): p. 839-842. Immunophilin- Weber, P.C., et al., Structural Origins of High-Affinity Biotin Binding to Streptavidin. Science, 1989. 243(4887): p. 85-88. Boehr, D.D., R. Nussinov, and P.E. Wright, The role of dynamic conformational ensembles in biomolecular recognition (vol 5, pg 789, 2009). Nature Chemical Biology, 2009. 5(12): p. 954-954. Teague, S.J., Implications of protein flexibility for drug discovery. Nat Rev Drug Discov, 2003. 2(7): p. 527-41. Feixas, F., et al., Exploring the role of receptor flexibility in structure-based drug discovery. Biophys Chem, 2014. 186: p. 31-45. Gerstein, M., A.M. Lesk, and C. Chothia, Structural Mechanisms for Domain Movements in Proteins. Biochemistry, 1994. 33(22): p. 6739-6749. Gunasekaran, K. and R. Nussinov, How different are structurally flexible and rigid binding sites? Sequence and structural features discriminating proteins that do and do not undergo conformational change upon ligand binding. Journal of Molecular Biology, 2007. 365(1): p. 257-273. Lesk, A.M. and C. Chothia, Mechanisms of Domain Closure in Proteins. Journal of Molecular Biology, 1984. 174(1): p. 175-191. 96 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. Najmanovich, R., et al., Side-chain flexibility in proteins upon ligand binding. Proteins- Structure Function and Genetics, 2000. 39(3): p. 261-268. Frauenfelder, H., S.G. Sligar, and P.G. Wolynes, The Energy Landscapes and Motions of Proteins. Science, 1991. 254(5038): p. 1598-1603. Sorensen, J., et al., Molecular Docking to Flexible Targets. Molecular Modeling of Proteins: 2nd Edition, 2015. 1215: p. 445-469. Du, X., et al., Insights into Protein-Ligand Interactions: Mechanisms, Models, and Methods. International Journal of Molecular Sciences, 2016. 17(2). Abagyan, R. and M. Totrov, High-throughput docking for lead generation. Current Opinion in Chemical Biology, 2001. 5(4): p. 375-382. Carlson, H.A., Protein flexibility and drug design: how to hit a moving target. Current Opinion in Chemical Biology, 2002. 6(4): p. 447-452. Antunes, D.A., D. Devaurs, and L.E. Kavraki, Understanding the challenges of protein flexibility in drug design. Expert Opinion on Drug Discovery, 2015. 10(12): p. 1301- 1313. B-Rao, C., J. Subramanian, and S.D. Sharma, Managing protein flexibility in docking and its applications. Drug Discovery Today, 2009. 14(7-8): p. 394-400. Henzler-Wildman, K. and D. Kern, Dynamic personalities of proteins. Nature, 2007. 450(7172): p. 964-972. Kokh, D.B., R.C. Wade, and W. Wenzel, Receptor flexibility in small-molecule docking calculations. Wiley Interdisciplinary Reviews-Computational Molecular Science, 2011. 1(2): p. 298-314. Spyrakis, F., et al., Protein Flexibility and Ligand Recognition: Challenges for Molecular Modeling. Current Topics in Medicinal Chemistry, 2011. 11(2): p. 192-210. Sousa, S.F., et al., Protein-Ligand Docking in the New Millennium - A Retrospective of 10 Years in the Field. Current Medicinal Chemistry, 2013. 20(18): p. 2296-2314. Ferrari, A.M., et al., Soft docking and multiple receptor conformations in virtual screening. Journal of Medicinal Chemistry, 2004. 47(21): p. 5076-5084. Jiang, F. and S.H. Kim, Soft Docking - Matching of Molecular-Surface Cubes. Journal of Molecular Biology, 1991. 219(1): p. 79-102. 97 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. Jain, T., D.S. Cerutti, and J.A. McCammon, Configurational-bias sampling technique for predicting side-chain conformations in proteins. Protein Science, 2006. 15(9): p. 2029- 2039. Leach, A.R., Ligand Docking to Proteins with Discrete Side-Chain Flexibility. Journal of Molecular Biology, 1994. 235(1): p. 345-356. Yang, A.Y.C., P. Kallblad, and R.L. Mancera, Molecular modelling prediction of ligand binding site flexibility. Journal of Computer-Aided Molecular Design, 2004. 18(4): p. 235-250. Alberts, I.L., N.P. Todorov, and P.M. Dean, Receptor flexibility in de novo ligand design and docking. Journal of Medicinal Chemistry, 2005. 48(21): p. 6585-6596. Sinko, W., S. Lindert, and J.A. McCammon, Accounting for Receptor Flexibility and Enhanced Sampling Methods in Computer-Aided Drug Design. Chemical Biology & Drug Design, 2013. 81(1): p. 41-49. Frembgen-Kesner, T. and A.H. Elcock, Computational sampling of a cryptic drug binding site in a protein receptor: Explicit solvent molecular dynamics and inhibitor docking to p38 MAP kinase. Journal of Molecular Biology, 2006. 359(1): p. 202-214. Kua, J., Y.K. Zhang, and J.A. McCammon, Studying enzyme binding specificity in acetylcholinesterase using a combined molecular dynamics and multiple docking approach. Journal of the American Chemical Society, 2002. 124(28): p. 8260-8267. Armen, R.S., J.H. Chen, and C.L. Brooks, An Evaluation of Explicit Receptor Flexibility in Molecular Docking Using Molecular Dynamics and Torsion Angle Molecular Dynamics. Journal of Chemical Theory and Computation, 2009. 5(10): p. 2909-2923. Orry, A.J.W., R.A. Abagyan, and C.N. Cavasotto, Structure-based development of target- specific compound libraries. Drug Discovery Today, 2006. 11(5-6): p. 261-266. Atilgan, C., et al., Manipulation of Conformational Change in Proteins by Single-Residue Perturbations. Biophysical Journal, 2010. 99(3): p. 933-943. Cavasotto, C.N., J.A. Kovacs, and R.A. Abagyan, Representing receptor flexibility in ligand docking through relevant normal modes. Journal of the American Chemical Society, 2005. 127(26): p. 9632-9640. Gerek, Z.N. and S.B. Ozkan, Change in Allosteric Network Affects Binding Affinities of PDZ Domains: Analysis through Perturbation Response Scanning. Plos Computational Biology, 2011. 7(10). Gerek, Z.N. and S.B. Ozkan, A flexible docking scheme to explore the binding selectivity of PDZ domains. Protein Science, 2010. 19(5): p. 914-928. 98 Huang, S.Y. and X.Q. Zou, Ensemble docking of multiple protein structures: Considering protein structural variations in molecular docking. Proteins-Structure Function and Bioinformatics, 2007. 66(2): p. 399-421. 45. Morris, G.M., et al., AutoDock4 and AutoDockTools4: Automated Docking with Selective Receptor Flexibility. Journal of Computational Chemistry, 2009. 30(16): p. 2785-2791. 42. 43. 44. 46. 47. 48. 49. 50. 51. 52. 53. Carlson, H.A. and J.A. McCammon, Accommodating protein flexibility in computational drug design. Molecular Pharmacology, 2000. 57(2): p. 213-218. Damm, K.L. and H.A. Carlson, Exploring experimental sources of multiple protein conformations in structure-based drug design. Journal of the American Chemical Society, 2007. 129(26): p. 8225-8235. Osterberg, F., et al., Automated docking to multiple target structures: Incorporation of protein mobility and structural water heterogeneity in AutoDock. Proteins-Structure Function and Genetics, 2002. 46(1): p. 34-40. Huang, S.Y. and X.Q. Zou, An iterative knowledge-based scoring function to predict protein-ligand function. Journal of Computational Chemistry, 2006. 27(15): p. 1876-1882. interactions: II. Validation of the scoring Huang, S.Y. and X.Q. Zou, An iterative knowledge-based scoring function to predict protein-ligand interaction potentials. Journal of Computational Chemistry, 2006. 27(15): p. 1866-1875. interactions: I. Derivation of Cavasotto, C.N. and R.A. Abagyan, Protein flexibility in ligand docking and virtual screening to protein kinases. Journal of Molecular Biology, 2004. 337(1): p. 209-225. Ding, F. and N.V. Dokholyan, Incorporating Backbone Flexibility in MedusaDock Improves Ligand-Binding Pose Prediction in the CSAR2011 Docking Benchmark. Journal of Chemical Information and Modeling, 2013. 53(8): p. 1871-1879. Lauck, F., et al., RosettaBackrub-a web server for flexible backbone protein structure modeling and design. Nucleic Acids Research, 2010. 38: p. W569-W575. Claussen, H., et al., FlexE: Efficient molecular docking considering protein structure variations. Journal of Molecular Biology, 2001. 308(2): p. 377-395. Bolia, A., Z.N. Gerek, and S.B. Ozkan, BP-Dock: A Flexible Docking Scheme for Exploring Protein Ligand Interactions Based on Unbound Structures. Journal of Chemical Information and Modeling, 2014. 54(3): p. 913-925. 99 54. Trott, O. and A.J. Olson, Software News and Update AutoDock Vina: Improving the Speed and Accuracy of Docking with a New Scoring Function, Efficient Optimization, and Multithreading. Journal of Computational Chemistry, 2010. 31(2): p. 455-461. 55. Morris, G.M., et al., Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry, 1998. 19(14): p. 1639-1662. Jones, G., et al., Development and validation of a genetic algorithm for flexible ligand docking. Abstracts of Papers of the American Chemical Society, 1997. 214: p. 154- COMP. Kallblad, P. and P.M. Dean, Efficient conformational sampling of local side-chain flexibility. Journal of Molecular Biology, 2003. 326(5): p. 1651-1665. Sherman, W., et al., Novel procedure for modeling ligand/receptor induced fit effects. Journal of Medicinal Chemistry, 2006. 49(2): p. 534-553. Davis, I.W. and D. Baker, ROSETTALIGAND Docking with Full Ligand and Receptor Flexibility. Journal of Molecular Biology, 2009. 385(2): p. 381-392. 60. Meiler, J. and D. Baker, ROSETTALIGAND: Protein-small molecule docking with full side-chain flexibility. Proteins-Structure Function and Bioinformatics, 2006. 65(3): p. 538-548. 56. 57. 58. 59. 61. 62. 63. 64. 65. 66. 67. Zhao, Y. and M.F. Sanner, FLIPDock: Docking flexible ligands into flexible receptors. Abstracts of Papers of the American Chemical Society, 2007. 233: p. 152-152. Zhao, Y. and M.F. Sanner, Protein-ligand docking with multiple flexible side chains. Journal of Computer-Aided Molecular Design, 2008. 22(9): p. 673-679. Allen, W.J., et al., DOCK 6: Impact of New Features and Current Docking Performance. Journal of Computational Chemistry, 2015. 36(15): p. 1132-1156. Schnecke, V., et al., Screening a peptidyl database for potential ligands to proteins with side-chain flexibility. Proteins-Structure Function and Genetics, 1998. 33(1): p. 74-87. Zavodszky, M.I. and L.A. Kuhn, Side-chain flexibility in protein-ligand binding: The minimal rotation hypothesis. Protein Science, 2005. 14(4): p. 1104-1114. Zavodszky, M.I., et al., Scoring ligand similarity in structure-based virtual screening. Journal of Molecular Recognition, 2009. 22(4): p. 280-292. Leis, S. and M. Zacharias, ReFlexIn: A Flexible Receptor Protein-Ligand Docking Scheme Evaluated on HIV-1 Protease. Plos One, 2012. 7(10). 100 Shin, W.H. and C. Seok, GalaxyDock: Protein-Ligand Docking with Flexible Protein Side-chains. Journal of Chemical Information and Modeling, 2012. 52(12): p. 3225-3232. Corbeil, C.R., P. Englebienne, and N. Moitessier, Docking ligands into flexible and solvated macromolecules. 1. Development and validation of FITTED 1.0. Journal of Chemical Information and Modeling, 2007. 47(2): p. 435-449. Corbeil, C.R., et al., Docking Ligands into flexible and solvated macromolecules. 2. Development and application of FITTED 1.5 to the virtual screening of potential HCV polymerase inhibitors. Journal of Chemical Information and Modeling, 2008. 48(4): p. 902-909. Corbeil, C.R. and N. Moitessier, Docking Ligands into Flexible and Solvated Macromolecules. 3. Impact of Input Ligand Conformation, Protein Flexibility, and Water Molecules on the Accuracy of Docking Programs. Journal of Chemical Information and Modeling, 2009. 49(4): p. 997-1009. Zacharias, M., Rapid protein-ligand docking using soft modes from molecular dynamics simulations to account for protein deformability: Binding of FK506 to FKBP. Proteins- Structure Function and Bioinformatics, 2004. 54(4): p. 759-767. Zacharias, M., Combining elastic network analysis and molecular dynamics simulations by Hamiltonian replica exchange. Journal of Chemical Theory and Computation, 2008. 4(3): p. 477-487. Nabuurs, S.B., M. Wagener, and J. De Vlieg, A flexible approach to induced fit docking. Journal of Medicinal Chemistry, 2007. 50(26): p. 6507-6518. 75. Mashiach, E., R. Nussinov, and H.J. Wolfson, FiberDock: a web server for flexible induced-fit backbone refinement in molecular docking. Nucleic Acids Research, 2010. 38: p. W457-W461. 76. Mashiach, E., et al., An integrated suite of fast docking algorithms. Proteins-Structure 68. 69. 70. 71. 72. 73. 74. 77. 78. 79. Function and Bioinformatics, 2010. 78(15): p. 3197-3204. Sandak, B., H.J. Wolfson, and R. Nussinov, Flexible docking allowing induced fit in proteins: Insights from an open to closed conformational isomers. Proteins-Structure Function and Genetics, 1998. 32(2): p. 159-174. Schneidman-Duhovny, D., et al., Geometry-based flexible and symmetric protein docking. Proteins-Structure Function and Bioinformatics, 2005. 60(2): p. 224-231. Schneidman-Duhovny, D., R. Nussinov, and H.J. Wolfson, Automatic prediction of protein large scale motion. Proteins-Structure Function and Bioinformatics, 2007. 69(4): p. 764-773. interactions with 101 80. 81. 82. 83. 84. 85. 87. 88. 89. 90. 91. Bolia, A. and S.B. Ozkan, Adaptive BP-Dock: An Induced Fit Docking Approach for Full Receptor Flexibility. Journal of Chemical Information and Modeling, 2016. 56(4): p. 734- 746. Almlof, M., B.O. Brandsdal, and J. Aqvist, Binding affinity prediction with different force fields: examination of the linear interaction energy method. J Comput Chem, 2004. 25(10): p. 1242-54. Aqvist, J., C. Medina, and J.E. Samuelsson, A new method for predicting binding affinity in computer-aided drug design. Protein Eng, 1994. 7(3): p. 385-91. Amaro, R.E., R. Baron, and J.A. McCammon, An improved relaxed complex scheme for receptor flexibility in computer-aided drug design. Journal of Computer-Aided Molecular Design, 2008. 22(9): p. 693-705. Lin, J.H., et al., Computational drug design accommodating receptor flexibility: The relaxed complex scheme. Journal of the American Chemical Society, 2002. 124(20): p. 5632-5633. Carlson, H.A., et al., Developing a dynamic pharmacophore model for HIV-1 integrase. Journal of Medicinal Chemistry, 2000. 43(11): p. 2100-2114. 86. Meagher, K.L. and H.A. Carlson, Incorporating protein flexibility in structure-based drug discovery: Using HIV-1 protease as a test case. Journal of the American Chemical Society, 2004. 126(41): p. 13276-13281. Liu, H.Y., A.E. Mark, and W.F. vanGunsteren, Estimating the relative free energy of different molecular states with respect to a single reference state. Journal of Physical Chemistry, 1996. 100(22): p. 9485-9494. Oostenbrink, C. and W.F. Van Gunsteren, Single-step perturbations to calculate free energy differences from unphysical reference states: Limits on size, flexibility, and character. Journal of Computational Chemistry, 2003. 24(14): p. 1730-1739. Zagrovic, B. and W.F. van Gunsteren, Computational analysis of the mechanism and thermodynamics of inhibition of phosphodiesterase 5A by synthetic ligands. Journal of Chemical Theory and Computation, 2007. 3(1): p. 301-311. Grinter, S.Z. and X.Q. Zou, Challenges, Applications, and Recent Advances of Protein- Ligand Docking in Structure-Based Drug Design. Molecules, 2014. 19(7): p. 10150- 10176. Lexa, K.W. and H.A. Carlson, Protein flexibility in docking and surface mapping. Quarterly Reviews of Biophysics, 2012. 45(3): p. 301-343. 102 92. Mobley, D.L. and K.A. Dill, Binding of Small-Molecule Ligands to Proteins: "What You See" Is Not Always "What You Get". Structure, 2009. 17(4): p. 489-498. Sliwoski, G., et al., Computational Methods in Drug Discovery. Pharmacological Reviews, 2014. 66(1): p. 334-395. Durrant, J.D. and J.A. McCammon, Computer-aided drug-discovery techniques that account for receptor flexibility. Current Opinion in Pharmacology, 2010. 10(6): p. 770- 774. Zheng, Z., M.N. Ucisik, and K.M. Merz, Jr., The Movable Type Method Applied to Protein-Ligand Binding. J Chem Theory Comput, 2013. 9(12): p. 5526-5538. 96. Wang, R.X., et al., The PDBbind database: Collection of binding affinities for protein- ligand complexes with known three-dimensional structures. Journal of Medicinal Chemistry, 2004. 47(12): p. 2977-2980. 97. Wang, R.X., et al., The PDBbind database: Methodologies and updates. Journal of Medicinal Chemistry, 2005. 48(12): p. 4111-4119. 98. MATLAB R2015a. MATLAB R2015a, The Mathworks Inc., Natick, Massachusetts, United States, 2015. 93. 94. 95. 99. Roe, D.R. and T.E. Cheatham, PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. Journal of Chemical Theory and Computation, 2013. 9(7): p. 3084-3095. 100. Friesner, R.A., et al., Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment of docking accuracy. Journal of Medicinal Chemistry, 2004. 47(7): p. 1739-1749. 101. Friesner, R.A., et al., Extra precision glide: Docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. Journal of Medicinal Chemistry, 2006. 49(21): p. 6177-6196. 102. Halgren, T.A., et al., Glide: A new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database screening. Journal of Medicinal Chemistry, 2004. 47(7): p. 1750-1759. 103. Sastry, G.M., et al., Protein and ligand preparation: parameters, protocols, and influence on virtual screening enrichments. Journal of Computer-Aided Molecular Design, 2013. 27(3): p. 221-234. 104. Olsson, M.H.M., et al., PROPKA3: Consistent Treatment of Internal and Surface Residues in Empirical pK(a) Predictions. Journal of Chemical Theory and Computation, 2011. 7(2): p. 525-537. 103 105. Rostkowski, M., et al., Graphical analysis of pH-dependent properties of proteins predicted using PROPKA. Bmc Structural Biology, 2011. 11. Jorgensen, W.L. and J. Tiradorives, The Opls Potential Functions for Proteins - Energy Minimizations for Crystals of Cyclic-Peptides and Crambin. Journal of the American Chemical Society, 1988. 110(6): p. 1657-1666. 107. Kaminski, G.A., et al., Evaluation and reparametrization of the OPLS-AA force field for proteins via comparison with accurate quantum chemical calculations on peptides. Journal of Physical Chemistry B, 2001. 105(28): p. 6474-6487. 108. Totrov, M. and R. Abagyan, Flexible protein-ligand docking by global energy optimization in internal coordinates. Proteins-Structure Function and Genetics, 1997: p. 215-220. 109. Houston, D.R. and M.D. Walkinshaw, Consensus Docking: Improving the Reliability of Docking in a Virtual Screening Context. Journal of Chemical Information and Modeling, 2013. 53(2): p. 384-390. 106. 104 CHAPTER 4 The Role of the Active Site Flap in Streptavidin/Biotin Complex Formation † Reprinted (adapted) with permission from Bansal, N.; Zheng, Z.; Song, F. L.; Pei, J.; Merz, K. M., The Role of the Active Site Flap in Streptavidin/Biotin Complex Formation. J Am Chem Soc 2018, DOI 10.1021/jacs.8b00743 105 4.1 Abstract Obtaining a detailed description of how active site flap motion affects substrate or ligand binding will advance structure-based drug design (SBDD) efforts on systems including the kinases, HSP90, HIV protease, ureases, etc. Through this understanding we will be able to design better inhibitors and better proteins that have desired functions. Herein we address this issue by generating the relevant configurational states of a protein flap on the molecular energy landscape using an approach we call MTFlex-b and then following this with a procedure to estimate the free energy associated with the motion of the flap region. To illustrate our overall workflow, we explored the free energy changes in the Streptavidin/Biotin system upon introducing conformational flexibility in loop3-4 in the Biotin un-bound (apo) and bound (holo) state. The free energy surfaces were created using the Movable Type free energy method and for further validation we compared them to potential of mean force (PMF) generated free energy surfaces using MD simulations employing the FF99SBILDN and FF14SB force fields. We also estimated the free energy thermodynamic cycle using an ensemble of closed-like and open-like end states for the ligand unbound and bound states and estimated the binding free energy to be ~ –16.2 kcal/mol (experimental -18.3 kcal/mol). The good agreement between MTFlex-b in combination with the MT method with experiment and MD simulations supports the effectiveness of our strategy in obtaining unique insights into the motions in proteins that can then be used in a range of biological and biomedical applications. 106 4.2 Introduction The rapid and chemically accurate estimation of the free energy change involved when a ligand binds to a biochemical system is the Holy Grail of structure based drug design[1-3]. Several free energy methods ranging from end-point approach to pathway based free energy calculation have been explored computationally. The so-called end-point methods used to estimate free energies (e.g., docking, MMPBSA and MMGBSA) are computationally inexpensive but they typically rely on a single static structure and often ignore receptor flexibility[4, 5]. Pathway free energy methods require extensive sampling to estimate the free energies. The pathway free energy methods can be broadly categorized into alchemical and potential of mean force approaches. Alchemical free energy methods use an “alchemical” pathway of non-physical intermediate states to build a thermodynamic cycle and compute the free energy differences between the end states[6-8]. Free energy perturbation[9] (FEP) and thermodynamic integration[10] (TI) are two of the most commonly used alchemical free energy methods. Another pathway method to estimate free energies is the Potential of mean force (PMF) approach[11-13]. Umbrella sampling coupled with the WHAM (weighted histogram analysis method) analysis is one of the most commonly used PMF approaches[14]. The major bottleneck of the PMF approach is the computational cost associated with the conformational sampling. Moreover, the pathway free energy methods, in contrast to end-points methods, are computationally intensive and are limited by simulation timescales. Finally, all methods are affected by the accuracy of the underlying force field and improvements in these remain an active area of research[6, 15-17]. 107 Conformational sampling of the relevant receptor “states” and ligand poses is essential in order to advance the field of structure based drug design (SBDD). Conformational flexibility is pivotal to several important molecular recognition processes including protein-protein and protein-ligand binding events, enzyme catalysis, allosteric control, and biomolecular assembly[18-20]. In many protein structures the conformational flexibility is often attributed to the loop regions, which serve as the “connecting segments” between two defined secondary structures or different domains[20-23]. But, loops are not mere connectors. They participate in crucial functions, including ligand binding events[24-26], enzyme catalysis[27] and molecular recognition processes[28-30]. Often, crystal structures of proteins are reported with loops missing because their conformational flexibility makes it difficult to resolve them in X-ray crystallographic experiments[31]. In many instances, these missing loops mediate the biological function of a protein; hence, computational chemists have developed a range of loop modeling algorithms to address this issue[32-92]. Sampling methods based on MC and MD simulations have been used extensively[79, 93-99]. Several, inverse kinematic methods with Monte Carlo sampling scheme have been developed to simulate loop flexibility and domain motion. These methods explore the entire torsional space (ϕ, ψ, ω) by designing Monte Carlo moves and select or reject a move based on a metropolis algorithm. The conformational energy in these methods is usually explored using force fields based methods[100-104]. Rosetta also uses Metropolis Monte Carlo sampling of ϕ and ψ for local refinement of protein structures. It models large-scale backbone conformational changes by exchanging the backbone conformations of peptide fragments collected from homologous sequences in the PDB and a backbone dependent rotamer library for inclusion of side chain flexibility. The Rosetta suite uses an all atom energy function, which calculates the weighted sum of energy terms in Rosetta Energy units. The scoring function 108 is a combination of physics based and statistical terms[105]. Conceptually, our approach is similar to using rotamer library based methods but our atoms are added not in terms of structural database rather on a local energy landscape defined by the potential. Moreover, unlike other methods, we report free energies rather than energies. Detailed discussions of loop modeling methods and algorithms can be found in the extant literature[106-109]. A combination of both flexibility and rigidity are crucial for the protein function[110]. Flexibility of an amino acid is usually defined by its atomic temperature factors of B-factor[111]. Highly flexible regions are generally characterized by more hydrophilic residues and are enriched with higher charged residues while the less flexible regions tend to be high in hydrophobic residues and have a reduced number of charged residues[112]. Recently, a H192P point mutation on one of the cofactor binding loops in E.coli Thermus thermophilus improved the stability of the protein by two-fold at elevated temperature compared to wild type at 60º C[113]. Surface loops are known to undergo functionally relevant conformational changes. Experiments have shown that often these conformational changes are related to external perturbations including ligand binding[114]. These conformations pre-exist but the relative population shifts upon the binding event[115-117]. This leads to discrete open or closed states or multiple different states in the protein separated by an energy barrier. They may have a relatively low free energy barrier, which favors fast interconversion between different states or they may have a relatively high barrier, thereby, making the interconversion relatively slow[118]. If the barrier height is too high, it becomes more computationally challenging for sampling methods to capture the entire conformational space spanned by the loop requiring the use of biasing methods. 109 The accuracy of free energy calculations in part depends on the extent and accuracy of the conformational sampling of the relevant chemical space. With the continued advancement in computational hardware and software, molecular dynamics (MD) simulations have reached the millisecond milestone, but a sufficient number of uncorrelated configurations is still not guaranteed[3]. Monte Carlo sampling, on the other hand, does not suffer from the problem of uncorrelated configurations, however, convergence is still an issue. Moreover, the computational cost associated with “sufficient” conformational sampling with both of these methods is a major bottleneck for proteins due to their size and the number of degrees of freedom[3]. Apart from the computational challenges faced when sampling relevant conformational space, the accuracy of current force fields can be a concern as well[99]. In this spirit, we introduce a new procedure to estimate the free energies in biochemical systems via extrapolation of the local partition function for a set of pre-generated conformational ensembles. The conformational ensemble of the targeted flexible region of the protein is generated via a sampling method we call MTFlex-b, which treats the molecule at an atom pair level and utilizes atom pairwise one-dimensional knowledge based potentials to enumerate the molecular conformations on an energy surface. The free energies are then estimated by extrapolating the local partition function of the generated “seed structures” from the conformational ensemble using the MT method[119]. We have applied our method to study the free energy changes observed in Streptavidin by accommodating conformational flexibility in the loop3-4 region (in the streptavidin monomer) in the Biotin bound (holo) and un-bound (apo) states. This loop is highly mobile and undergoes an open to closed transition upon biotin binding[120-123]. Currently, we have incorporated flexibility into the loop region only, but the workflow is straightforward and can be applied to 110 larger regions of a given protein as needed. Using MTFlex-b, we obtained 21,295 unique backbone conformations of the eight-residue long loop3-4 where we observe examples of closed-like, open- like and several intermediate loop conformations. The lowest Cα RMSD conformations generated by MTFlex-b (with respect to the closed and open crystal structures) were similar to their corresponding crystal conformations with Cα RMSDs of ~1.6 Å. Using an ensemble of closed- like and open-like MTFlex-b generated conformations as end states in the ligand unbound and bound states, we obtained a free energy thermodynamic cycle that gives us detailed insights into biotin binding and the role flap motion plays in binding. We further explored the energy landscape by generating a free energy surface (FES) using the MT method and further validated it by generating a PMF via umbrella sampling followed by analysis by WHAM using force field based MD simulation methodologies. The conformational search was run from both loop ends in two separate jobs each using a single CPU on the Laconia cluster available at the High Performance Computer Center (HPCC) facility at MSU. Conformation generation using this code base took ~7days to generate over 11 million different conformations of the eight residue long loop of Streptavidin. The free energy calculations using the MT method took ~20 minutes per seed using a MATLAB code on a single CPU. Since the conformations were pre-generated, the free energy calculations were run in a trivially parallel fashion using the HPCC facility at MSU. 4.3 Results and Discussion: Streptavidin-biotin The streptavidin-biotin complex is biotechnologically an interesting protein-ligand system with one of the most remarkable binding affinities known[122-125]. Streptavidin is a homo tetramer with an eight-residue loop between β–strand 3 and strand 4 (also known as loop3- 111 4) in all four monomers. This loop3-4 region is flexible and is known to undergo a transition from an open to closed state upon biotin binding[121, 124, 126]. The closed and open states of the streptavidin monomer with biotin bound in the binding pocket are shown in Figure 4.1. The backbone RMSD (Å) between the closed and open states of loop3-4 is ~ 8.6 Å. The motion of the loop can be characterized by the distance between the C-alpha carbon of residue 49 (roughly the mid-point of loop3-4) in the closed and open state, which is 12.7Å. This loop is highly mobile and its interaction with biotin has been extensively explored in several experimental and computational studies[123, 124, 126-129]. Recently, the conformational dynamics of loop3-4 was studied by Song et al using MD simulations and enhanced sampling methods[129]. Using accelerated MD, they were able to study the transition of loop3-4 between the open and closed states in the presence of biotin in the monomer of streptavidin but conventional MD was not able to observe the loop transition at the timescales employed. Because they studied the monomer the binding pocket was incomplete due to absence of an important hydrophobic/p-stacking interaction contributed by TRP120 from an adjacent monomer. This residue is important in stabilizing the streptavidin-biotin complex and how its neglect affects the outcome of the study of Song et al was not addressed[124, 130, 131]. To the best of our knowledge, no other loop modeling algorithm or sampling strategy has elucidated the loop3-4 transitions of streptavidin. This motivated us to apply our method to study the free energy changes associated with the loop motion in streptavidin in the presence and absence of biotin by generating multiple loop3-4 conformational states. 112 Figure 4.1 Cartoon representation of crystal Streptavidin monomer in holo (Biotin bound - in licorice) and apo states is superimposed on top of each other. The closed state of the loop3-4 (Residues 45 to 52) pertaining to the holo conformation is highlighted in mauve color while the open loop is shown in blue color. The distance between the Cα of Residue 49 of the closed and open state is approx. 12.67Å. The conformational ensemble for loop3-4 was generated for a monomer (chain A) of the streptavidin holo structure (PDBID 1mk5) in the context of the tetramer. We preferred to use the holo conformation to the apo since biotin is present in the binding pocket, which helped us to calculate the protein-ligand binding affinities directly without having to resort to molecular docking and introduce additional errors into the free energy calculations pertaining to closed like conformations. Moreover, the crystal structures for the apo and holo conformations of the streptavidin protein are quite similar, except for the loop3-4 region (the monomer structures are shown in Figure 4.1). Excluding the loop3-4 region, the heavy atom RMSD (Å) between the 113 experimental apo and holo structures of streptavidin is ~1 Å with a backbone RMSD of less than 0.6 Å. The initial loop structure from the crystal structure isn’t relevant when using MTFlex-b, because of the nature of its conformation generation strategy. Only, the environment surrounding the loop residues is important since it defines van der Waals collisions and determines other important interactions (see chapter 2 for details). The rest of the protein including monomers B, C and D and rest of chain A (excluding the loop3-4 region) were kept rigid. The biotin ligands in all the sub-units of streptavidin were also kept fixed. Biotin and the rest of the protein (the tetramer excluding the loop region of chain A) were kept rigid in order to check for van der Waals collisions. The presence of biotin added an additional layer of steric clashes, which were also identified during loop generation using the geometric restraint of 2.8 Å for the non-bonded heavy atoms and 2.3 Å for non-bonded hydrogen atoms. The total number of backbone and side- chain conformations generated using MTFlex-b for loop3-4 of streptavidin, in the presence of biotin, was ~11 million conformations, for a total of 21,295 unique backbone loop conformations (see Table 4.2). The conformation generation calculations required ~7 days using our in-house MATLAB script. CPPTRAJ[132] and VMD[133] software were employed for analysis and visualization of the structures. 4.3.1 RMSD The structural Cα RMSD (Å) of the MTFlex-b generated loop conformations was calculated with respect to the closed loop conformation from streptavidin (chain A from PDB ID 1mk5) using CPPTRAJ, a utility program in the AMBER suite[132]. The MTFlex-b loop conformation structurally closest to the closed state of the loop from streptavidin (closed-like conformation) had a Cα RMSD of 1.63 Å and the open-like conformation (structurally most similar to the open 114 loop in the crystal structure) had a Cα RMSD of 1.61 Å. Figure 4.2 highlights the closed-like and open-like (lowest RMSD (Å)) loop conformations superimposed with the closed and open states of loop3-4 from the crystal structures. Figure 4.2 The left box displays the closed (biotin-bound) and open states of streptavidin superimposed on top of each other with the loop3-4 highlighted in mauve for the closed and blue for the open conformation. The center image shows the lowest RMSD (Å) loop conformation (shown in green) generated by MTFlex-b superimposed on the holo crystal structure (closed loop), and the right image shows the lowest RMSD MTFlex-b conformation superimposed on the apo crystal conformation (open loop). Using our conformation sampling strategy, we were able to map both the closed and open loop conformations from the available crystal structures. Apart from the closed-like and open-like conformations, we obtained quite a few intermediate loop conformations generated by MTFlex-b. Figure 4.3 shows the Cα RMSD (Å) of all the generated MTFlex-b loop conformations with respect to the experimentally observed closed loop of streptavidin. The generated MTFlex-b loop conformers were arranged in an increasing order of structural RMSD (Å) with respect to the closed state of loop3-4 from the streptavidin crystal structure. As can be seen from Figure 4.3, the MTFlex-b loop conformations traverse a 115 RMSD range of ~ 10 Å from the closed state covering all the major conformations including the closed, open and several intermediate states. Examples of conformational states explored by MTFlex-b are shown in Figure 4.3 in green. The experimental closed and open states of the loop are superimposed on the generated loop conformation in Figure 4.3 for comparison. Figure 4.3 The Cα RMSD (Å) of the MTFlex-b loop conformations with respect to the closed state of loop3-4 in streptavidin. The x-axis represents the total number of generated backbone loop conformations. For the sake of structural comparison, crystal-closed (pink) and open (blue) states of the loop are superimposed on the generated MTFlex-b loop conformation (green). 4.3.2 MT Free Energy Surface and MD Potential of Mean Force Studies Two-dimensional relative free energy surfaces were assembled using MT in order to assess the free energy differences between the different loop states. The closed to open transition 116 of the loop was characterized by the separation between Cα atoms of residues Gly48 and Ile30 (X-axis), and Asn49 and Leu109 (Y-axis). The positions of the residues chosen for the reaction coordinates are shown in Figure 4.9. For the free energy surface (FES) construction, the entire reaction pathway was divided into a number of windows with a grid size of 0.5 Å. The MTFlex-b generated conformations were arranged in windows based on the two reaction coordinates. Our docking protocol[119] was applied to obtain the optimized ligand position for each protein conformation in all the windows. The crystal closed and open states of the loop were also included in the FES construction. The free energies were calculated in the holo (ligand bound) and the apo (free) states in the aqueous phase for each of the windows in a parallel fashion using the MT method with a sampling range of ±0.25 Å. Please note that the closed and open states of loop3-4 in the ligand bound and free states were considered only for one of the monomers (monomer A) while the remaining three monomers were always considered in the native closed state with biotin bound in the binding pocket. Figure 4.4 highlights the free energy profile for the apo and holo conformations obtained using this procedure. The entire FES construction has been subdivided into open (O, closed (C and intermediate (I regions. These regions are pinpointed on the 2-D heat map plot as well. For the apo state (shown in the left image of Figure 4.4), it can be seen that the free energy gradually increases on going from the open to the closed state of the loop with the closed state being higher in energy than the rest of the conformations. The open state of the loop is estimated to be around 10.5 kcal/mol more stable than the closed state. For the apo state FES, we also identified a region where the free energy was ~3 kcal/mol lower than the open state (shown on the upper right part of the apo FES in Figure 4.4). On visualization, we found that this region has conformations quite similar to the crystal open loop with Cα RMSDs of ~3.2 Å. The structure corresponding to 117 the local minima of the closed apo state had a ~2.5 Å RMSD from the closed loop seen in the crystal structure. The free energy difference between the local minimums in the apo state predicted via MT-FES is 9.1 kcal/mol. The loop conformations are shown on the apo FES in Figure 4.4. For the holo FES, the closed state of the loop is clearly the lowest free energy region on the entire landscape. It is approximately 12.0 kcal/mol lower in free energy than the open state of the loop. In the holo state, MT is again identifying a local minima for the open state of the loop, which is around ~0.7 kcal/mol lower in free energy and has a 2.5 Å RMSD from the open loop seen in the crystal structure which gives a free energy difference between the open and the closed state of -11.3 kcal/mol. The local minimums for the open state observed in both the apo and the holo states are almost similar with a backbone RMSD of 1.07Å. The superimposed open loop conformations predicted by MT in the apo and the holo states are shown in Figure S3. Apo (O local minima: Loop backbone RMSD wrt crystal open : 3.2 Å 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 A A C C - - 9 9 0 0 1 1 U U E E L L / / a a C C - - 9 9 4 4 N N S S A A 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca local minima: Loop backbone RMSD wrt crystal close : 2.5 Å Holo local minima: Loop backbone RMSD wrt crystal open : 3.0 Å (O (I 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca (C 20 16 12 7 4 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 A A C C - - 9 9 0 0 1 1 U U E E L L a a C C - - 9 9 4 4 N N S S A A / / (C Figure 4.4 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained using the Cα distance between ASN49/LEU109 and GLY48/ILE30 as reaction coordinate. We also performed two 90 ns MD simulations starting from the experimental open and 118 closed state and overlaid the simulation results onto the MT generated FES using the same set of reaction coordinates (shown in Figure 4.5). The MD simulations details are described in Section 4.6.1 of the supporting information. In order to make a robust comparison, MD simulations were carried out using the ff99SBILDN[134] and ff14SB[135] force fields. Figure 4.5 highlights the MD snapshots generated by FF14SB simulations superimposed on the FES generated using MT. From the apo MT FES superimposed with the MD snapshots in Figure 4.5, we observe that the MD generated data points are most densely populated in the open region of the PMF with a horizontal spreading across the surface covering the low energy region observed by MT in the apo state. This shows that the MD simulation samples similar regions as predicted by the MT FES. For holo, MD generated data points are highly concentrated in the closed region of the FES, which is also identified as the lowest free energy point by MT in the ligand bound case. Similar trends are observed in the case of MD snapshots generated by FF99SBILDN simulations for both the apo and holo simulations as shown in Figure 4.11. Both the force fields behaved similarly and populated nearly the same regions in both the apo and holo simulations. Comparison between the two force fields is shown in Figure 4.12, which shows the superimposed MD snapshots obtained by both force fields for both the apo and holo simulations. As can be seen from the figure, the regions sampled by the two force fields are nearly on top of each other with small differences in the case of the apo simulation, highlighting the similarity of the two force fields in terms of their structural preferences. 119 Apo (O Holo (O 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 A A C C - - 9 9 0 0 1 1 U U E E L L / / a a C C - - 9 9 4 4 N N S S A A (C 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 A A C C - - 9 9 0 0 1 1 U U E E L L / / a a C C - - 9 9 4 4 N N S S A A 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca (C 20 16 12 7 4 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 Figure 4.5 Relative free energy (kcal/mol) heat map of apo (left) and holo (right) streptavidin obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 superimposed with the MD snapshots generated using FF14SB force fields. MD snapshots are highlighted in black color as scattered points in both the left and right figures with the rest of the PMF slightly faded in the background. Further validation of the MT FES was performed using umbrella sampling (US) coupled with the weighted histogram analysis method (WHAM)[136, 137] in order to study the free energy change upon loop transitions in both the apo and the holo systems. The PMF simulations were also carried out using both the FF99SBILDN and FF14SB force fields. Both the PMF simulations ran in parallel and took roughly 10 days each to finish using a single K80 GPU processor in the MSU HPCC facility. The PMF simulations were initiated from the closed state to obtain the holo PMF and from the open state for the apo MD PMF. Initially, a diagonal scan was performed along the chosen set of reaction coordinates used to study the transition between the open and the closed states of the loop. After that, a detailed scan was performed successively in both the X and Y directions using every conformation generated from the diagonal scan. This 120 scanning strategy was adopted from our previous work on metallochaperones[138]. The procedure to generate the MD-PMF along with the necessary MD parameters and related details are provided in the 4.6.1 and 4.6.2 sections of the supporting information. The PMF study shows significant differences between the two force fields. Figure 4.6 shows the MD-PMF obtained using the FF14SB force field for both the apo and holo simulations. The apo MD-PMF shows a free energy difference of 10.5 kcal/mol between the crystal open and crystal-closed state with the open state seen crystallographically as the global minimum. A local minimum was observed near the crystal-closed state (structure highlighted in the left image in Figure 4.6). The structural RMSD (Å) between the local minimum is ~2.85 Å from the crystal-closed state. This RMSD analysis only considers the loop region, but there was a large change in the orientation of the loop relative to the rest of the protein (see Figure 4.13). So, when the RMSD calculation is carried out via alignment on the entire protein, the loop backbone RMSD is estimated to be ~3.8 Å. The local minimum observed for the open state was quite similar to the experimental open state structure with a backbone RMSD of 0.41 Å. The free energy difference between the local minima in the apo state is roughly 9.4 kcal/mol. The free energy difference obtained from the MT-FES surface between the local minima of the closed and open states in the apo state is quite close (~9.1 kcal/mol). The holo MD-PMF (right image in Figure 4.6) estimates a free energy difference of 16 kcal/mol between the crystal closed and open state. A local minimum was identified here as well near the crystal open state. The backbone RMSD (Å) of the local minimum from the crystal structure of the open state is ~1.85Å. The total RMSD including the change in orientation was ~4.7 Å (structures shown in Figure 4.14). The local minimum observed for the closed state is almost same as the crystal closed state with the backbone loop RMSD of 0.67 Å. The free energy difference between the local minima in the 121 holo state is 11.7 kcal/mol. This free energy difference is quite similar to the free energy difference of 11.0 kcal/mol obtained from the MT-FES between the local minima and the experimental closed state. Since the free energy differences for the predicted minima were so similar between MT-FES and MD FF14SB PMF methodologies, we were also interested in comparing the structural differences between these two methods. Figure 4.15 highlights the superimposed structures obtained for the apo open, apo closed and holo open states predicted from the MT FES and the MD PMF (FF14SB) methods. The structures for the apo closed state have a Cα RMSD of 2.64 Å for the loop region. For the apo open state, the structure predicted by MT is curled inwards towards the protein relative to the MD FF14SB with a resultant Cα RMSD of 3.85 Å. The structure of the holo open state for MT is also curled inwards and is significantly different from the FF14SB structure with a RMSD of 5.90Å. MT-FES and FF14SB holo closed state is essentially identical to the experimental structure (see Figure 4.6 for details). Table 4.5 and 4.6 in the supporting information show the comparison of the phi-psi main chain dihedral between the crystal structure (closed and open state) with the dihedrals of the local minima obtained by MTFlex-b and from the MD FF14SB PMF simulations. 122 Apo local minima: Loop backbone RMSD wrt crystal open: 0.41 Å A C - 9 0 1 U E L / a C - 9 4 N S A 28 26 24 22 20 18 16 14 10 12 14 16 18 20 GLY48-Ca/ILE30-Ca local minima: Loop backbone RMSD wrt crystal closed: 2.85 Å 45 40 35 30 25 20 15 10 5 0 A C - 9 0 1 U E L / a C - 9 4 N S A 28 26 24 22 20 18 16 14 10 20 18 16 14 10 8 6 4 2 45 40 35 30 25 20 15 10 5 0 Holo local minima: Loop backbone RMSD wrt crystal open : 1.85 Å 12 14 16 18 20 GLY48-Ca/ILE30-Ca local minima: Loop backbone RMSD wrt crystal closed : 0.67 Å Figure 4.6 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained by using umbrella sampling with the FF14SB force fields using the Cα distances between Asn49/Leu109 (y-axis) and Gly48/Ile30 (x-axis) as the two reaction coordinates. From the holo MD FF14SB PMF plot in Figure 4.6 (right panel), a transition state can be clearly seen between the open and closed loop state. We further analyzed the structural and free energy differences between the transition state and the closed and open loop state to obtain some pathway details. Figure 4.7 schematically shows the one-dimensional relative free energy differences obtained using the open minima, closed minima and the transition state as observed in the FF14SB PMF simulations for the holo case. Residues forming hydrogen bond are highlighted in yellow color and residues forming vander Waal interactions are shown in green color in the figure 4.7. 123 ) l o m / l a c k ( G Δ Δ 17.5 11.7 0.00 Open loop Closed loop Reaction coordinate Figure 4.7 Computed changes in the relative free energy on going from the closed to the open holo state. The closed loop is the global minima in the holo state and the open loop is ~11 kcal/mol higher in free energy. The observed transition state barrier is ~17.5 kcal/mol on going from the closed to open loop. In the holo state, the closed loop is the global minima and is favored because of numerous non-covalent interactions between biotin and streptavidin. The ureido oxygen of biotin has a hydrogen bond (H-bond) with the side chains of Ser27, Asn23 and Tyr43. The side chains of Ser45 and Asp128 form H-bonds with the hydrogen atoms attached to the two ureido nitrogen atoms. The sulfur from biotin also forms a H-bond with the side chain of Thr90. The carboxylate end of biotin forms H-bonds with the amide backbone of Asn49 and the side chain of Ser88. Stabilization by van der Waals interaction come from Trp70, Trp92, Trp105 and Trp120 (provided by the adjacent monomer). Apart from these Trp residues, three residues from the loop, Val47, Gly48 and Ala50 also provide hydrophobic interactions. Table 4.3 summarizes all the residues involved in H-bond and van der waal interactions with biotin. 124 The activation barrier observed by FF14SB MD PMF simulations in the holo state is ~17.5 kcal/mol. In the transition state, the loop (highlighted in purple ribbon in Figure 4.19) has partially opened up causing changes in the interactions present. Due to the extra space provided by partial loop opening, the carboxylate end of biotin has straightened up instead of being curled inwards while the H-bond between Asn49 and biotin is absent. The comparison of biotin’s position in closed and transition state is shown in the left image of Figure 4.19. The carboxylate of biotin is slightly straightened and pushed out forming two H-bonds with the side chains of Arg84 (red in Figure 4.19). The first half of the loop is still folded in, but the side-chain of Ser45 has rotated breaking the H-bond with the ureido nitrogen of biotin. The H-bond between Asp128 and the other ureido nitrogen is also broken mainly due to the rotation of the side chain of Asp128. However, the H-bonds between the ureido oxygen and Tyr43 and Ser27 remain intact. The distance between Trp120 (from the adjacent monomer) and the ureido ring of biotin has also increased reducing the magnitude of the hydrophobic interaction. From the left image of Figure 4.19, Trp120 in the closed loop state (purple) is much closer to the biotin ring relative to its position in the transition state (black). As the loop opens up completely, the H-bond between biotin and loop residues Ser-45 and Asn49 remain severed. The carboxylate end of biotin is still forming 2 H-bonds with the side-chains of Arg84. The hydrophobic interactions between Trp120 and the ureido ring are lost as can be seen from the right image of Figure 4.19. The side-chain of Tyr43 moves closer to the ureido nitrogen thus forming a H-bond between O -Tyr and the H- ureido nitrogen along with the ureido oxygen. The hydrogen bond between Asp128 and the ureido nitrogens is re-established. The open loop is further stabilized by several intermolecular interactions with the protein itself. There is a H-bond observed between the side chain of Ser45 125 and backbone of Ser52, the backbones of Ser45 and Gly48, the backbone of Ser52 and the side chain of Asn81 and the side chains of Ser52 and Asn81. The MD PMF obtained using the FF99SBILDN force fields is shown in Figure 4.16. In the apo PMF simulations by, the local minimum observed for the open state has an RMSD of 0.79 Å from experiment (shown in the left image in Figure 4.16). The local minimum for the closed state has a backbone RMSD of 2.36 Å with respect to experiment when the loops are aligned to each other. When the alignment is over the entire protein, the loop backbone RMSD is 4.64 Å (highlighted in Figure 4.17). The free energy difference between the crystal open and closed loop in the apo state is ~15 kcal/mol while the free energy difference between the observed minima is 14.6 kcal/mol. The holo state (right image in Figure 4.16) obtained using FF99SBILDN predicts a free energy difference of ~22 kcal/mol between the crystal open and closed states. The local minimum observed for the closed state is quite close to the crystal-closed state with a RMSD of 0.54 Å. The minimum in the case of holo PMF simulations has a backbone RMSD of 2.12 Å when aligned to the loop seen in the crystal structure and 2.78 Å when aligned to the entire protein (shown in Figure 4.18). The free energy difference observed between the two minima in the holo system is 21.3 kcal/mol. With these PMFs in hand the free energy of binding in the open and closed states (exp = - 18.26 kcal/mol) can be extrapolated from the free energy differences observed in the apo and holo states (via construction of a thermodynamic cycle). Using this approach, the free energy differences between open and closed loop states predicted using the FF99SBILDN force field are: free energy of binding to the closed state is -33.3 kcal/mol and to the open state is +3.7 kcal/mol. The former represents a significant overestimation of the experimental binding affinity, while the latter is an underestimation. The free energy differences observed using the FF14SB 126 force field are in better accord with experiment and with the free energy differences observed using MT FES. The free energy differences generated via different methodologies are summarized in Table 4.1. Table 4.1 Summary of free energy differences (in kcal/mol) between open and closed states of the loop in apo and holo states obtained with MT-free energy surface (MT-FES), MT- thermodynamic cycle (MT-TC), and MD using FF99SBILDN and FF14SB force fields. The differences are reported for the crystal open and loop position and minima obtained by the MT and MD methods. ΔG (kcal/mol) MT-FES MT-TC MD-FF14SB MD-FF99SBILDN ΔGO–>C Apo-crystal 10.5 ΔGO–>C Holo-crystal –12.0 ΔGO–>C Apo-minima 9.1 ΔGO–>C Holo-minima –11.3 10.4 –12.0 9.0 –11.3 10.5 –16 9.4 –11.7 15 –22 14.6 -21.3 The observed free energy differences between the closed and open states obtained by MD-FF99SBILDN, MD-FF14SB and MT are largely different due to the different energy functions employed by the methods. Sampling effects, etc., also play a role, but the energy functions used are almost certainly the largest source of variance. Overall, the free energy trends are qualitatively similar in both (MT and MD) methods with the closed state being the lowest in free energy in the holo state and the least stable in the apo state. We performed a more detailed analysis of our free energy calculations to figure out the reason of stability of open loop in the apo state and stability of closed loop in holo state. Table 4.4 highlights the individual components of free energy differences for both the apo and holo states. It can be seen from the table that for the apo state, the net free energy change in the gas phase favors the closed loop over the open loop by 1.2 kcal/mol but the solvation free energy change is highly favorable for the open state (-11.6 kcal/mol) as compared to the closed state favoring the open state when the 127 ligand is not bound in the solution phase. So, based on our results we observe that the loop prefers the open state to closed state by 10.4 kcal/mol mainly due to a favorable solvation free energy in the open state. In the holo state as well, the solvation free energy is highly favorable for the open loop when compared to the closed loop but the streptavidin-biotin interactions are much stronger for the closed loop (~24 kcal/mol) making the closed loop more favorable in the holo state by ~12 kcal/mol. The free energy differences observed between PMF obtained using FF14SB and MT-FES method are comparable in both the apo and holo states. In the apo state, both FF14SB and MT- FES estimate a free energy difference of 10.5 kcal/mol between the closed and open state seen experimentally. For the holo case, the free energy difference between the experimentally observed open and closed states is 16 kcal/mol for the former and ~12 kcal/mol for the latter. But, the free energy differences between the local minimums obtained via MT-FES and FF14SB PMF are quite similar (see Table 4.1). The fact that MT gives comparable free energy differences demonstrates its relevance as a relatively efficient technique for estimating free energies. 4.3.3 Thermodynamic free energy cycle The binding free energy of the streptavidin-biotin system in the aqueous phase was estimated using a thermodynamic cycle linking the open and the closed states of the loop3-4 of streptavidin protein in the holo (ligand bound) and the apo (free) states as the end states (see Figure 4.8). The end states were represented by selecting an ensemble of closed-like and open- like conformations from the MT-FES windows containing the experimental closed and open 128 state of the loop. The free energies for each of the MTFlex-b conformations in the selected ensembles were calculated in the holo (ligand bound) and the apo (free) states using the MT method in the aqueous phase as described in the Methods section. The free energies of the end states (open apo, open holo, closed apo and closed holo) for the thermodynamic cycle were estimated by taking the lowest free energy conformation for each of the end states individually. From the thermodynamic cycle (Figure 4.8a), it can be seen that the open loop in the apo state is ~10.4 kcal/mol more stable than the closed loop. The trend is in modest agreement with the Song et al MD study, which estimated the stability of open loop to be ~5 kcal/mol more stable than the closed loop in the apo state[129]. In the holo state, the closed loop is estimated to be more stable (as expected) than the open state (by 12 kcal/mol). The free energy differences between the open and closed states are consistent with those estimated by the MT-FES method (see Table 4.1 for a summary). A slight difference in the relative free energy differences exist because the apo state in the TC accounts for the free energy contributions from the free states of both protein and ligand while in MT-FES, the apo state FES correspond to only the free state of the protein. Apart from the relative stability of the open and closed loops in the ligand un-bound and bound states; we can also estimate the free energy change upon ligand binding. The free energy change upon biotin binding to the open loop state is estimated to be –4.2 kcal/mol (as shown in Figure 4.8a). A study by Chu et al in 1998 estimated that the binding affinity of Streptavidin- Biotin was –10 kcal/mol after deleting loop3-4 via circular permutation[120]. Based on this analysis, we are underestimating the binding affinity of biotin in the open loop state and overestimating the free energy change upon loop ordering in the ligand bound state (by ~6 kcal/mol). The red and the blue arrows shown in Figure 4.8a represent the two paths that can be 129 taken to estimate the free energy of binding. The estimated binding free energy of biotin taking either path is ~ –16.2 kcal/mol. Our prediction for binding free energy change is in reasonable accord with the experimental binding affinity of –18.3 kcal/mol[139, 140]. ΔGbinding = ΔGO–>C ΔGbinding = –16.2 kcal/mol unbound + ΔGC bind = ΔGO bind + ΔGO–>C bound (4.1) The high binding affinity of biotin to streptavidin is attributed to three major factors: extensive hydrogen bond network of streptavidin with the ureido and carboxylate groups of biotin, hydrophobic interactions between several residues and biotin and the closure of the loop3- 4 over the Biotin binding site[123, 127, 141, 142]. Out of all the hydrophobic interactions, Trp120 from an adjacent monomer (highlighted in Figure 4.8) forms a key hydrophobic interaction with biotin bound (Trp120-biotin interactions are formed between monomers A and D, and, B and C). Its importance in the binding free energy calculations has been shown previously [124, 130, 131]. Hence, reinforcing our use of the tetrameric structure in our calculations. 130 a) b) Open apo ΔGO –>C 10.4 kcal/mol unbound = Trp 120 bind = ΔGO –4.2 kcal/mol Open state of loop3-4 Closed state of loop3-4 bind = ΔGC –26.6 kcal/mol Open holo ΔGO –>C bound = –12.0 kcal/mol Closed apo Open apo Closed holo Closed holo ΔGbind = –16.2 kcal/mol ΔGExperimental = –18.26 kcal/mol Figure 4.8 a) Detailed representation of thermodynamic free energy cycle for binding in solution phase for the streptavidin-biotin system. b) The net free energy change upon biotin binding and loop closure in the solution phase. 4.4 Conclusions Rapidly obtaining accurate free energies remains a daunting problem, which if solved will have a major impact on structure based drug design[1-3]. Despite the many attempts to address this issue, we still need fast and accurate methods to obtain free energies for biochemical 131 processes[6, 7]. To a large extent, the current state of affairs is rooted in our use of compute intensive conformational sampling strategies. The conformational sampling issue is especially notable for large biomolecules like proteins[82, 84, 114]. Often, proteins possess a highly flexible loop or intrinsically disordered region (IDR)[143-145], which may undergo conformational changes of functional relevance. Therefore estimation of the energetic cost of loop motions in molecular recognition processes is of great import. To fully understand the free energy changes induced due to loop conformational changes, it is important to model the equilibrium flexibility of a protein loop, which remains a computationally intensive problem using our current sampling and free energy protocols. To address this ongoing challenge, we introduce a novel approach for relatively fast and accurate estimation of free energies. The significant configurational states of the protein are sampled by using MTFlex-b, which uses 1-D atom pairwise databases to generate molecular conformations. Each of the generated molecular conformations serves as the “seed structure” for the MT method, which estimates the free energies by performing local partition function estimations. Using the parallel mechanics of the MT method, this strategy provides us with a trade off between sampling speed and accuracy. This procedure was applied to study the free energy changes associated with the movement of the loop3-4 region within streptavidin in both the free and bound states. The large conformational ensemble of the loop3-4 region generated by MTFlex-b contained both closed-like and open-like loop conformations with a best structural RMSD of ~1.6 Å with respect to experiment. We generated the MT FES and found that the closed state of the loop is more stable in the holo state with the open loop being least stable. Similarly, in the apo state, the trend is reversed and the closed loop is found to be least stable. We also carried out MD PMF 132 simulations using umbrella sampling and obtained similar trends for both the apo and holo PMFs using the FF99SBILDN and FF14SB force fields relative to the MT FES. Even though the MT FES is less continuous than the MD PMF the trend in free energy is similar in both the MT and MD PMF plots. In particular, the free energy differences predicted by the FF14SB PMF simulations and MT-FES method were in good accord, while FF99SBILDN provide less internally consist free energy results. Clearly, the reparameterization carried out by the Simmerling group (creating FF14SB) has made significant improvements to the modeling of the processes highlighted herein[135]. To summarize, the free energy difference observed by MD- FF14SB between closed and open state in the apo state is 10.5 kcal/mol, which is exactly the same as the free energy difference estimated by MT FES. In the holo state, the free energy difference between the experimental closed and open states differ by ~4kcal/mol between the estimates provided by MD FF14SB and MT FES. But, the free energy differences between the minima estimated by the two methods are quite comparable at ~11kcal/mol. Our free energy components estimate that solvation is playing a key role in the stability of the open loop in the apo state. In the holo state as well, solvation is significantly favoring the open loop, but the contacts between Streptavidin and Biotin are stabilizing the closed loop much more favorably making it ~12 kcal/mol more stable than the open loop. Using a free energy thermodynamic cycle, we observed that the relative stability of the open loop in the apo state is -10.4 kcal/mol with respect to the closed state, while in the holo state; the closed loop conformation was estimated to be -12 kcal/mol more stable than the open loop. The binding free energy change was also obtained and was estimated to be –16.2 kcal/mol, which is in reasonable agreement with the experimental binding affinity of –18.3 kcal/mol. Give the success of the MT based approach, we can envision this method being applied to 133 other examples of free energy changes accompanied by the conformational mobility of loops in other protein systems including. As the size of the loops increases the complexity and the computational expense for the conformational search would concomitantly increase. However, with advancement in GPU and CPU processors and parallel computation coupled with enhanced software abilities, all of these problems are addressable using our approach. 4.5 Acknowledgement This work was supported in part by Michigan State University through computational resources provided by the Institute for Cyber-Enabled Research. 4.6 Supporting information Table 4.2 The total number of backbone loop conformations and the (backbone + sidechain) conformations generated by MTFlex-b in the presence of Biotin bound in the active site of crystal streptavidin monomer. 21,295 11546486 134 Number of conformations in the presence of Biotin Backbone conformations Backbone and side chain conformations ASN49:CA GLY48:CA ILE109:CA ILE30:CA Figure 4.9 Cartoon representation of Streptavidin crystal monomer (Chain A) with the loop3-4 in the closed conformation. The reaction coordinates are distances between Cα atoms of Resid30 and 48, and between Resid109 and 49. 135 Figure 4.10 Superimposed conformations of open minima observed in MT-apo state (shown in tan color) and MT-holo state (highlighted in green color). The observed Cα RMSD is ~1.07 Å. 136 Apo (O Holo (O 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 A A C C - - 9 9 0 0 1 1 U U E E L L a a C C - - 9 9 4 4 N N S S A A / / (C 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 A A C C - - 9 9 0 0 1 1 U U E E L L a a C C - - 9 9 4 4 N N S S A A / / 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca (C 20 16 12 7 4 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 Figure 4.11 Relative free energy (kcal/mol) heat map of apo (left) and holo (right) streptavidin obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 superimposed with the MD snapshots generated using FF99SBILDN force fields. MD snapshots are highlighted in black color as scattered points in both the left and right figures with the rest of the PMF slightly faded in the background. 137 Apo Holo FF99SBILDN FF14SB A C - 9 0 1 U E L / a C - 9 4 N S A 28 26 24 22 20 18 16 14 10 A C - 9 0 1 U E L / a C - 9 4 N S A 28 26 24 22 20 18 16 14 10 12 14 16 18 20 12 14 16 18 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca Figure 4.12 Superimposed MD snapshots generated using FF99SBILDN (shown as black scattered points) and FF14SB (pink scattered points) force fields obtained using the Cα distance between Asn49/Leu109 and Gly48/Ile30 138 Figure 4.13 The crystal-closed state (shown in pink) superimposed with the apo-closed minima obtained by MD-FF14SB force field (highlighted in aqua color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. 139 Figure 4.14 The crystal open state (shown in brown) superimposed with the holo-open minima obtained by MD-FF14SB force field (highlighted in green color). The upper panel shows the alignment based on only the loop region while the lower panel shows the alignment based on the entire protein monomer. 140 Figure 4.15 The superimposed minima conformations predicted by MT (green) and MD- FF14SB (mauve) methods for apo open minima, apo closed minima and holo open minima (from left to right). 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 A A C C - - 9 9 0 0 1 1 U U E E L L / / a a C C - - 9 9 4 4 N N S S A A 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 141 Apo MD PMF local minima: Loop backbone RMSD wrt crystal open: 0.79 Å 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca A A C C - - 9 9 0 0 1 1 U U E E L L a a C C - - 9 9 4 4 N N S S A A / / 28 28 26 26 24 24 22 22 20 20 18 18 16 16 14 14 10 10 local minima: Loop backbone RMSD wrt crystal closed: 2.36 Å Holo MD PMF local minima: Loop backbone RMSD wrt crystal open: 2.12 Å 12 12 14 14 16 16 18 18 20 20 GLY48-Ca/ILE30-Ca GLY48-Ca/ILE30-Ca 23 16 12 8 4 50 50 45 45 40 40 35 35 30 30 25 25 20 20 15 15 10 10 5 5 0 0 local minima: Loop backbone RMSD wrt crystal closed: 0.54 Å Figure 4.16 Relative free energy (kcal/mol) heat map for apo (left) and holo (right) streptavidin obtained by using umbrella sampling with FF99SBILDN force fields using the Cα distances between ASN49/LEU109 and GLY48/ILE30 as the two reaction coordinates. Figure 4.17 The crystal-closed state (shown in blue) superimposed with the apo-closed minima obtained by MD-FF99SBILDN force field (highlighted in yellow color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. 142 Figure 4.18 The crystal-open state (shown in blue) superimposed with the holo-open minima obtained by MD-FF99SBILDN force field (highlighted in yellow and cyan color). The left panel shows the alignment based on only the loop region while the right panel shows the alignment based on the entire protein monomer. 143 Table 4.3 List of the residues involved in forming hydrogen bond and vander waal interactions with Biotin. The residues color-coded in red are part of the loop3-4. Hydrogen-bond interactions Vander-waal interactions Loop state Ureido oxygens Ureido Nitrogens Biotin Sulfur Closed Ser27, Asn23 and Tyr43 Ser45 and Asp128 Thr90 Transition Ser27 and Tyr43 Ser27 and Open Tyr43 - Tyr43 and Asp128 Thr90 Thr90 Carboxylate end Amide backbone of Asn49 and side chain of Ser88 2 H-bonds with Arg84 2 H-bonds with Arg84 Trp120 Other hydrophobic interactions Trp70, Trp92, Trp105, Yes Ala46, Val47, Gly48, Ala50 Partial Trp70, Trp92, Trp105, Ala46 Trp70, Trp92, Lost Trp105 Figure 4.19 The active site of streptavidin with Biotin shown for the closed loop (left) and open loop (right) states superimposed with the structure of the active site at the transition state (shown in black in both plots). Residues forming hydrogen bonds are highlighted in orange, residues forming van der Waals interactions are shown in green, Trp120 from sub-unit D is shown in purple and Arg84 in red. Table 4.4 The free energy difference between open and closed loop state with the free energy components in the form of the change in the protein torsion free energy, the change in the protein non-covalent interactions, the free energy change in the protein-ligand interactions, they change in the solvation free energy, the net free energy change in gas phase and in solution phase. All free energy values are in kcal/mol. ΔΔGTorsion- protein Apo state Holo state 1.8 -1.0 ΔΔGIntra- protein (Non- covalent) ΔΔGInter- ProteinLigand ΔΔGsolvation Net change ΔΔG (Gas phase) Net change ΔΔG (Solution phase) -0.6 2.1 0 -11.6 23.3 -12.4 1.2 24.4 -10.4 12.0 144 Table 4.5 Comparison of main chain phi-psi dihedrals of all the loop residues of the experimental closed structure with the closed minima structures obtained using MTFlex-b and from the FF14SB MD simulations. The differences are shown as positive or negative from a reference value of 0º. Res 45 Res 46 Res 47 Res 48 Res 49 Res 50 Res 51 Res 52 phi psi phi psi phi psi phi psi phi psi phi psi phi psi phi psi MTFlex-b close minima FF14SB close minima +15.15 +13.97 +10.61 +1.11 +3.27 +5.24 +14.43 +4.26 +20.67 +89.59 +89.75 +1.26 +5.73 +13.13 +5.77 –33.37 +1.37 +10.26 +71.55 +33.87 +70.34 +21.30 –160.41 –32.50 –115.34 –137.97 +59.84 +17.29 +2.30 +31.30 +22.10 –52.41 145 Table 4.6 Comparison of main chain phi-psi dihedrals of all the loop residues of the experimental open structure with the open minima structures obtained using MTFlex-b and from the FF14SB MD simulations. The differences are shown as positive or negative from a reference value of 0º. Res 45 Res 46 Res 47 Res 48 Res 49 Res 50 Res 51 Res 52 phi psi phi psi phi psi phi psi phi psi phi psi phi psi phi psi MTFlex-b open minima FF14SB open minima +21.01 +19.33 +12.37 +31.99 +11.61 +9.44 +16.22 +20.37 +3.69 +3.91 +2.36 +8.79 +15.11 +26.04 +11.49 +8.39 +35.33 –93.95 +30.28 –44.62 +18.29 +23.13 –75.16 +95.92 -57.22 +34.18 –94.64 +123.47 +14.86 +44.38 +38.06 +28.54 146 4.6.1 MD-PMF methodology Starting from the holo crystal closed structure and apo crystal open state, we performed PMF simulations that transited to the open or closed state through a series of windows where the two “reaction coordinate” distances increased/decreased by the window size successively[138]. The window size for the X and Y coordinate was chosen to be 0.32 Å and 0.46 Å, respectively. There were a total of 32 windows in this diagonal scan. For each window, 1ns equilibration and 2ns sampling was performed, and the data points were stored every 200fs. The starting structure of each window came from the final sampling snapshot of the previous window. After that, the last snapshot of each of the 32 diagonal windows was used as starting structure to sample along the X-axis while maintaining the Y coordinate distance. Again, data points were collected every 200fs and total sampling was 2ns for each window, with the starting structure coming from the final sampling snapshot from the previous window. This time, there were total 32 parallel runs and for each run there were 34 successive windows along the X-axis. The force constant for both “reaction coordinate” was 60 kcal/(mol*Å2) for all the US windows. The two-dimensional WHAM was utilized to obtain the free energy profile[137]. 4.6.2 MD simulation details The simulations were performed using the FF99Sbildn[134] and FF14SB[135] force fields of the AMBER14 software package[146]. Each protein structure was solvated in a periodically replicated octahedral box using the TIP3P water model. SHAKE was used to constrain bonds with hydrogen atoms and a time step of 2fs was used during all the MD runs. The Particle mesh Ewald method[147-149] was used to treat long-range electrostatic interactions and a cutoff of 12.0 Å was used for the non-bonded interactions.[150] The counter-ions were 147 added to neutralize the system and the parameters for the ions were taken from IOD parameter set[151]. The parameters for Biotin were obtained using the GAFF force field[152]. HF/6-31G* level theory was used to optimize the Biotin structure and charges were fitted by using RESP fitting at MP2/6-31G* level[153]. 148 REFERENCES 149 REFERENCES Gilson, M.K. and H.X. Zhou, Calculation of protein-ligand binding affinities. Annual Review of Biophysics and Biomolecular Structure, 2007. 36: p. 21-42. Michel, J. and J.W. Essex, Prediction of protein-ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. Journal of Computer-Aided Molecular Design, 2010. 24(8): p. 639-658. Shirts, M.R., D.L. Mobley, and J.D. Chodera, Alchemical Free Energy Calculations: Ready for Prime Time? Annual Reports in Computational Chemistry, Vol 3, 2007. 3: p. 41-59. Halperin, I., et al., Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins-Structure Function and Genetics, 2002. 47(4): p. 409-443. Yuriev, E. and P.A. Ramsland, Latest developments in molecular docking: 2010-2011 in review. Journal of Molecular Recognition, 2013. 26(5): p. 215-239. Chodera, J.D., et al., Alchemical free energy methods for drug discovery: progress and challenges. Current Opinion in Structural Biology, 2011. 21(2): p. 150-160. Christ, C.D., A.E. Mark, and W.F. van Gunsteren, Feature Article Basic Ingredients of Free Energy Calculations: A Review. Journal of Computational Chemistry, 2010. 31(8): p. 1569-1582. Woo, H.J. and B. Roux, Calculation of absolute protein-ligand binding free energy from computer simulations. Proceedings of the National Academy of Sciences of the United States of America, 2005. 102(19): p. 6825-6830. Zwanzig, R.W., High-Temperature Equation of State by a Perturbation Method .1. Nonpolar Gases. Journal of Chemical Physics, 1954. 22(8): p. 1420-1426. Kirkwood, J.G., Statistical mechanics of fluid mixtures. Journal of Chemical Physics, 1935. 3(5): p. 300-313. Buch, I., S.K. Sadiq, and G. De Fabritiis, Optimized Potential of Mean Force Calculations for Standard Binding Free Energies. Journal of Chemical Theory and Computation, 2011. 7(6): p. 1765-1772. Essex, J.W., et al., Monte Carlo simulations for proteins: Binding affinities for trypsin- benzamidine complexes via free-energy perturbations. Journal of Physical Chemistry B, 1997. 101(46): p. 9663-9669. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 150 13. 14. 15. 16. 17. 18. 19. 20. 21. 23. 24. 25. 26. Bahar, I., et al., Global Dynamics of Proteins: Bridging Between Structure and Function. Annual Review of Biophysics, Vol 39, 2010. 39: p. 23-42. 22. Meireles, L., et al., Pre-existing soft modes of motion uniquely defined by native contact topology facilitate ligand binding to proteins. Protein Science, 2011. 20(10): p. 1645- 1658. Gumbart, J.C., B. Roux, and C. Chipot, Standard Binding Free Energies from Computer Simulations: What Is the Best Strategy? Journal of Chemical Theory and Computation, 2013. 9(1): p. 794-802. Roux, B., The Calculation of the Potential of Mean Force Using Computer-Simulations. Computer Physics Communications, 1995. 91(1-3): p. 275-282. Jorgensen, W.L., Free-Energy Calculations - a Breakthrough for Modeling Organic- Chemistry in Solution. Accounts of Chemical Research, 1989. 22(5): p. 184-189. Klimovich, P.V., M.R. Shirts, and D.L. Mobley, Guidelines for the analysis of free energy calculations. Journal of Computer-Aided Molecular Design, 2015. 29(5): p. 397- 411. Shirts, M.R. and D.L. Mobley, An introduction to best practices in free energy calculations. Methods Mol Biol, 2013. 924: p. 271-311. Bajorath, J. and S. Sheriff, Comparison of an antibody model with an X-ray structure: The variable fragment of BR96. Proteins-Structure Function and Genetics, 1996. 24(2): p. 152-157. Streaker, E.D. and D. Beckett, Ligand-linked structural changes in the Escherichia coli biotin repressor: The significance of surface loops for binding and allostery. Journal of Molecular Biology, 1999. 292(3): p. 619-632. Nestl, B.M. and B. Hauer, Engineering of Flexible Loops in Enzymes. Acs Catalysis, 2014. 4(9): p. 3201-3211. Papaleo, E., et al., The Role of Protein Loops and Linkers in Conformational Dynamics and Allostery. Chemical Reviews, 2016. 116(11): p. 6391-6423. Saraste, M., P.R. Sibbald, and A. Wittinghofer, The P-loop--a common motif in ATP- and GTP-binding proteins. Trends Biochem Sci, 1990. 15(11): p. 430-4. Slesinger, P.A., Y.N. Jan, and L.Y. Jan, The S4-S5 loop contributes to the ion-selective pore of potassium channels. Neuron, 1993. 11(4): p. 739-49. Stuart, D.I., et al., Alpha-lactalbumin possesses a novel calcium binding loop. Nature, 1986. 324(6092): p. 84-7. 151 27. 28. 29. 30. 31. 33. 34. 35. 36. 37. 38. 39. Steichen, J.M., et al., Structural Basis for the Regulation of Protein Kinase A by Activation Loop Phosphorylation. Journal of Biological Chemistry, 2012. 287(18): p. 14672-14680. Bernstein, L.S., et al., RGS2 binds directly and selectively to the M1 muscarinic acetylcholine receptor third intracellular loop to modulate Gq/11alpha signaling. J Biol Chem, 2004. 279(20): p. 21248-56. Ciarapica, R., et al., Molecular recognition in helix-loop-helix and helix-loop-helix leucine zipper domains - Design of repertoires and selection of high affinity ligands for natural proteins. Journal of Biological Chemistry, 2003. 278(14): p. 12182-12190. Kiss, C., et al., Antibody binding loop insertions as diversity elements. Nucleic Acids Research, 2006. 34(19). Fukuchi, S., et al., Intrinsically disordered loops inserted into the structural domains of human proteins. Journal of Molecular Biology, 2006. 355(4): p. 845-857. 32. Michalsky, E., A. Goede, and R. Preissner, Loops In Proteins (LIP) - a comprehensive loop database for homology modelling. Protein Engineering, 2003. 16(12): p. 979-985. Greer, J., Model for Haptoglobin Heavy-Chain Based Upon Structural Homology. Proceedings of the National Academy of Sciences of the United States of America- Biological Sciences, 1980. 77(6): p. 3393-3397. Kwasigroch, J.M., J. Chomilier, and J.P. Mornon, A global taxonomy of loops in globular proteins (vol 259, pg 855, 1996). Journal of Molecular Biology, 1996. 261(5): p. 673- 673. Ring, C.S., et al., Taxonomy and Conformational-Analysis of Loops in Proteins. Journal of Molecular Biology, 1992. 224(3): p. 685-699. Tramontano, A. and A.M. Lesk, Common Features of the Conformations of Antigen- Binding Loops in Immunoglobulins and Application to Modeling Loop Conformations. Proteins-Structure Function and Genetics, 1992. 13(3): p. 231-245. Zhang, H.Y., et al., A fast and efficient program for modeling protein loops. Biopolymers, 1997. 41(1): p. 61-72. Cui, M., M. Mezei, and R. Osman, Prediction of protein loop structures using a local move Monte Carlo approach and a grid-based force field. Protein Engineering Design & Selection, 2008. 21(12): p. 729-735. de Bakker, P.I.W., et al., Ab initio construction of polypeptide fragments: Accuracy of loop decoy discrimination by an all-atom statistical potential and the AMBER force field 152 with the generalized born solvation model. Proteins-Structure Function and Genetics, 2003. 51(1): p. 21-40. Rapp, C.S. and R.A. Friesner, Prediction of loop geometries using a generalized born model of solvation effects. Proteins-Structure Function and Genetics, 1999. 35(2): p. 173- 183. Spassov, V.Z., P.K. Flook, and L. Yan, LOOPER: a molecular mechanics-based algorithm for protein loop prediction. Protein Engineering Design & Selection, 2008. 21(2): p. 91-100. Jacobson, M.P., et al., A hierarchical approach to all-atom protein loop prediction. Proteins-Structure Function and Bioinformatics, 2004. 55(2): p. 351-367. Rapp, C.S., et al., Prediction of protein loop geometries in solution. Proteins-Structure Function and Bioinformatics, 2007. 69(1): p. 69-74. Zhu, K., et al., Long loop prediction using the protein local optimization program. Proteins-Structure Function and Bioinformatics, 2006. 65(2): p. 438-452. Felts, A.K., et al., Prediction of protein loop conformations using the AGBNP implicit solvent model and torsion angle sampling. Journal of Chemical Theory and Computation, 2008. 4(5): p. 855-868. Danielson, M.L. and M.A. Lill, Predicting flexible loop regions that interact with ligands: The challenge of accurate scoring. Proteins-Structure Function and Bioinformatics, 2012. 80(1): p. 246-260. Fogolari, F. and S.C.E. Tosatto, Application of MM/PBSA colony free energy to loop decoy discrimination: Toward correlation between energy and root mean square deviation. Protein Science, 2005. 14(4): p. 889-901. Sellers, B.D., et al., Toward better refinement of comparative models: Predicting loops in inexact environments. Proteins-Structure Function and Bioinformatics, 2008. 72(3): p. 959-971. Zhou, H.Y. and Y.Q. Zhou, Distance-scaled, finite ideal-gas reference state improves structure-derived potentials of mean force for structure selection and stability prediction. Protein Science, 2002. 11(11): p. 2714-2726. Yang, Y.D. and Y.Q. Zhou, Specific interactions for ab initio folding of protein terminal regions with secondary structures. Proteins-Structure Function and Bioinformatics, 2008. 72(2): p. 793-803. Park, J. and K. Saitou, ROTAS: a rotamer-dependent, atomic statistical potential for assessment and prediction of protein structures. Bmc Bioinformatics, 2014. 15. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 153 Shen, M.Y. and A. Sali, Statistical potential for assessment and prediction of protein structures. Protein Science, 2006. 15(11): p. 2507-2524. Rata, I.A., Y.H. Li, and E. Jakobsson, Backbone Statistical Potential from Local Sequence-Structure Interactions in Protein Loops. Journal of Physical Chemistry B, 2010. 114(5): p. 1859-1869. Liang, S.D., et al., Protein Loop Modeling with Optimized Backbone Potential Functions. Journal of Chemical Theory and Computation, 2012. 8(5): p. 1820-1827. Galaktionov, S., G.V. Nikiforovich, and G.R. Marshall, Ab initio modeling of small, medium, and large loops in proteins. Biopolymers, 2001. 60(2): p. 153-168. Burke, D.F. and C.M. Deane, Improved protein loop prediction from sequence alone. Protein Engineering, 2001. 14(7): p. 473-478. Fiser, A., R.K.G. Do, and A. Sali, Modeling of loops in protein structures. Protein Science, 2000. 9(9): p. 1753-1773. Schueler-Furman, O., et al., Progress in modeling of protein structures and interactions. Science, 2005. 310(5748): p. 638-42. 59. Mandell, D.J., E.A. Coutsias, and T. Kortemme, Sub-angstrom accuracy in protein loop reconstruction by robotics-inspired conformational sampling. Nature Methods, 2009. 6(8): p. 551-552. Rohl, C.A., et al., Modeling structurally variable regions in homologous proteins with rosetta. Proteins-Structure Function and Bioinformatics, 2004. 55(3): p. 656-677. Simons, K.T., et al., Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins-Structure Function and Bioinformatics, 1999: p. 171-176. Xiang, Z.X., C.S. Soto, and B. Honig, Evaluating conformational free energies: The colony energy and its application to the problem of loop prediction. Proceedings of the National Academy of Sciences of the United States of America, 2002. 99(11): p. 7432- 7437. 52. 53. 54. 55. 56. 57. 58. 60. 61. 62. 63. Jamroz, M. and A. Kolinski, Modeling of loops in proteins: a multi-method approach. Bmc Structural Biology, 2010. 10. 64. Mas, M.T., et al., Modeling the Anti-Cea Antibody Combining Site by Homology and Conformational Search. Proteins-Structure Function and Genetics, 1992. 14(4): p. 483- 498. 154 66. 67. 68. 69. 70. 71. 72. 73. 74. 75. 76. 77. 78. 65. Martin, A.C.R., J.C. Cheetham, and A.R. Rees, Modeling Antibody Hypervariable Loops - a Combined Algorithm. Proceedings of the National Academy of Sciences of the United States of America, 1989. 86(23): p. 9268-9272. vanVlijmen, H.W.T. and M. Karplus, PDB-based protein loop prediction: Parameters for selection and methods for optimization. Journal of Molecular Biology, 1997. 267(4): p. 975-1001. Deane, C.M. and T.L. Blundell, CODA: A combined algorithm for predicting the structurally variable regions of protein models. Protein Science, 2001. 10(3): p. 599-612. Sudarsanam, S., et al., Modeling Protein Loops Using a Phi-I+1, Psi-I Dimer Database. Protein Science, 1995. 4(7): p. 1412-1420. Fernandez-Fuentes, N., J. Zhai, and A. Fiser, ArchPRED: a template based loop structure prediction server. Nucleic Acids Research, 2006. 34: p. W173-W176. Lyskov, S., et al., Serverification of Molecular Modeling Applications: The Rosetta Online Server That Includes Everyone (ROSIE). Plos One, 2013. 8(5). Hildebrand, P.W., et al., SuperLooper-a prediction server for the modeling of loops in globular and membrane proteins. Nucleic Acids Research, 2009. 37: p. W571-W574. Chys, P. and P. Chacon, Random Coordinate Descent with Spinor-matrices and Geometric Filters for Efficient Loop Closure. Journal of Chemical Theory and Computation, 2013. 9(3): p. 1821-1829. Lopez-Blanco, J.R., et al., RCD plus : Fast loop modeling server. Nucleic Acids Research, 2016. 44(W1): p. W395-W400. Choi, Y. and C.M. Deane, FREAD revisited: Accurate loop structure prediction using a database search algorithm. Proteins-Structure Function and Bioinformatics, 2010. 78(6): p. 1431-1440. Park, H., et al., Protein Loop Modeling Using a New Hybrid Energy Function and Its Application to Modeling in Inaccurate Structural Environments. Plos One, 2014. 9(11). Ko, J., et al., The FALC-Loop web server for protein loop modeling. Nucleic Acids Research, 2011. 39: p. W210-W214. Fiser, A. and A. Sali, ModLoop: automated modeling of loops in protein structures. Bioinformatics, 2003. 19(18): p. 2500-2501. Li, Y.H., I. Rata, and E. Jakobsson, Sampling Multiple Scoring Functions Can Improve Protein Loop Structure Prediction Accuracy. Journal of Chemical Information and Modeling, 2011. 51(7): p. 1656-1666. 155 Lee, A., I. Streinu, and O. Brock, A methodology for efficiently sampling the conformation space of molecular structures. Physical Biology, 2005. 2(4): p. S108-S115. 83. Mamonova, T., et al., Protein flexibility using constraints from molecular dynamics simulations. Physical Biology, 2005. 2(4): p. S137-S147. 79. 80. 81. 82. 84. 85. 86. 87. 88. 89. 90. 91. Liu, P., et al., A Self-Organizing Algorithm for Modeling Protein Loops. Plos Computational Biology, 2009. 5(8). Shehu, A., C. Clementi, and L.E. Kavraki, Modeling protein conformational ensembles: From missing loops to equilibrium fluctuations. Proteins-Structure Function and Bioinformatics, 2006. 65(1): p. 164-179. Cortes, J., et al., A path planning approach for computing large-amplitude motions of flexible molecules. Bioinformatics, 2005. 21: p. I116-I125. Thorpe, M.F. and M. Lei, Macromolecular flexibility. Philosophical Magazine, 2004. 84(13-16): p. 1323-1331. Amato, N.M., K.A. Dill, and G. Song, Using motion planning to map protein folding landscapes and analyze folding kinetics of known native structures. Journal of Computational Biology, 2003. 10(3-4): p. 239-255. Apaydin, M.S., et al., Stochastic roadmap simulation: An efficient representation and algorithm for analyzing molecular motion. Journal of Computational Biology, 2003. 10(3-4): p. 257-281. Lavalle, S.M., et al., A randomized kinematics-based approach to pharmacophore- constrained conformational search and database screening. Journal of Computational Chemistry, 2000. 21(9): p. 731-747. Abagyan, R. and M. Totrov, Biased Probability Monte-Carlo Conformational Searches and Electrostatic Calculations for Peptides and Proteins. Journal of Molecular Biology, 1994. 235(3): p. 983-1002. Abagyan, R.A. and M.M. Totrov, Biased probability Monte Carlo as a powerful global energy optimization method for biomolecular structure prediction. Abstracts of Papers of the American Chemical Society, 1996. 211: p. 35-COMP. Zhu, K. and T. Day, Ab initio structure prediction of the antibody hypervariable H3 loop. Proteins-Structure Function and Bioinformatics, 2013. 81(6): p. 1081-1089. Bruccoleri, R.E. and M. Karplus, Prediction of the Folding of Short Polypeptide Segments by Uniform Conformational Sampling. Biopolymers, 1987. 26(1): p. 137-168. 156 Shenkin, P.S., et al., Predicting Antibody Hypervariable Loop Conformation .1. Ensembles of Random Conformations for Ring-Like Structures. Biopolymers, 1987. 26(12): p. 2053-2085. Unger, R., The genetic algorithm approach to protein structure prediction. Applications of Evolutionary Computation in Chemistry, 2004. 110: p. 153-175. Ring, C.S. and F.E. Cohen, Conformational Sampling of Loop Structures Using Genetic Algorithms. Israel Journal of Chemistry, 1994. 34(2): p. 245-252. Collura, V., J. Higo, and J. Garnier, Modeling of Protein Loops by Simulated Annealing. Protein Science, 1993. 2(9): p. 1502-1510. Olson, M.A., M. Feig, and C.L. Brooks, Prediction of protein loop conformations using multiscale Modeling methods with physical energy scoring functions. Journal of Computational Chemistry, 2008. 29(5): p. 820-831. 97. Wu, M.G. and M.W. Deem, Efficient Monte Carlo methods for cyclic peptides. Molecular Physics, 1999. 97(4): p. 559-580. Shukla, D., et al., Activation pathway of Src kinase reveals intermediate states as targets for drug design. Nature Communications, 2014. 5. 92. 93. 94. 95. 96. 98. 99. Raval, A., et al., Refinement of protein structure homology models via long, all-atom molecular dynamics simulations. Proteins-Structure Function and Bioinformatics, 2012. 80(8): p. 2071-2079. 100. Coutsias, E.A., et al., A kinematic view of loop closure. Journal of Computational Chemistry, 2004. 25(4): p. 510-528. 101. Dinner, A.R., Local deformations of polymers with nonplanar rigid main-chain internal coordinates. Journal of Computational Chemistry, 2000. 21(13): p. 1132-1144. 102. Hayward, S. and A. Kitao, Monte Carlo Sampling with Linear Inverse Kinematics for Simulation of Protein Flexible Regions. Journal of Chemical Theory and Computation, 2015. 11(8): p. 3895-3905. 103. Hoffmann, D. and E.W. Knapp, Polypeptide folding with off-lattice Monte Carlo dynamics: The method. European Biophysics Journal with Biophysics Letters, 1996. 24(6): p. 387-403. 104. Nilmeier, J., et al., Assessing Protein Loop Flexibility by Hierarchical Monte Carlo Sampling. Journal of Chemical Theory and Computation, 2011. 7(5): p. 1564-1574. 105. Kaufmann, K.W., et al., Practically Useful: What the ROSETTA Protein Modeling Suite Can Do for You. Biochemistry, 2010. 49(14): p. 2987-2998. 157 106. Friesner, R.A., et al., Computational methods for high resolution prediction and refinement of protein structures. Current Opinion in Structural Biology, 2013. 23(2): p. 177-184. 107. Li, Y., Conformational sampling in template-free protein loop structure modeling: an overview. Comput Struct Biotechnol J, 2013. 5: p. e201302003. 108. Shehu, A. and L.E. Kavraki, Modeling Structures and Motions of Loops in Protein Molecules. Entropy, 2012. 14(2): p. 252-290. 109. Soto, C.S., et al., Loop modeling: Sampling, filtering, and scoring. Proteins-Structure Function and Bioinformatics, 2008. 70(3): p. 834-843. 110. Vihinen, M., Relationship of Protein Flexibility to Thermostability. Protein Engineering, 1987. 1(6): p. 477-480. 111. Karplus, P.A. and G.E. Schulz, Prediction of Chain Flexibility in Proteins - a Tool for the Selection of Peptide Antigens. Naturwissenschaften, 1985. 72(4): p. 212-213. 112. Radivojac, P., et al., Protein flexibility and intrinsic disorder. Protein Science, 2004. 13(1): p. 71-80. 113. Morris, P., et al., Impact of cofactor-binding loop mutations on thermotolerance and activity of E-coli transketolase. Enzyme and Microbial Technology, 2016. 89: p. 85-91. 114. Boehr, D.D., R. Nussinov, and P.E. Wright, The role of dynamic conformational ensembles in biomolecular recognition. Nature Chemical Biology, 2009. 5(11): p. 789- 796. 115. Kumar, S., et al., Folding and binding cascades: Dynamic landscapes and population shifts. Protein Science, 2000. 9(1): p. 10-19. 116. Ma, B.Y., et al., Multiple diverse ligands binding at a single protein site: A matter of pre- existing populations. Protein Science, 2002. 11(2): p. 184-197. 117. Tsai, C.J., et al., Folding funnels, binding funnels, and protein function. Protein Science, 1999. 8(6): p. 1181-1190. 118. Monod, J., J. Wyman, and J.P. Changeux, On the Nature of Allosteric Transitions: A Plausible Model. J Mol Biol, 1965. 12: p. 88-118. 119. Zheng, Z., M.N. Ucisik, and K.M. Merz, The Movable Type Method Applied to Protein- Ligand Binding. Journal of Chemical Theory and Computation, 2013. 9(12): p. 5526- 5538. 158 120. Chu, V., et al., Thermodynamic and structural consequences of flexible loop deletion by circular permutation in the streptavidin-biotin system. Protein Science, 1998. 7(4): p. 848-859. 121. Freitag, S., et al., Structural studies of the streptavidin binding loop. Protein Science, 1997. 6(6): p. 1157-1166. 122. Le Trong, I., et al., Streptavidin and its biotin complex at atomic resolution. Acta Crystallographica Section D-Biological Crystallography, 2011. 67: p. 813-821. 123. Stayton, P.S., et al., Streptavidin-biotin binding energetics. Biomolecular Engineering, 1999. 16(1-4): p. 39-44. 124. Weber, P.C., et al., Structural Origins of High-Affinity Biotin Binding to Streptavidin. Science, 1989. 243(4887): p. 85-88. 125. Izrailev, S., et al., Molecular dynamics study of unbinding of the avidin-biotin complex. Biophysical Journal, 1997. 72(4): p. 1568-1581. 126. General, I.J. and H. Meirovitch, Relative stability of the open and closed conformations of the active site loop of streptavidin. Journal of Chemical Physics, 2011. 134(2). 127. Miyamoto, S. and P.A. Kollman, Absolute and Relative Binding Free-Energy Calculations of the Interaction of Biotin and Its Analogs with Streptavidin Using Molecular-Dynamics Free-Energy Perturbation Approaches. Proteins-Structure Function and Genetics, 1993. 16(3): p. 226-245. 128. O'Sullivan, V.J., et al., Development of a Tetrameric Streptavidin Mutein with Reversible Biotin Binding Capability: Engineering a Mobile Loop as an Exit Door for Biotin. Plos One, 2012. 7(4). 129. Song, J.N., et al., Functional Loop Dynamics of the Streptavidin-Biotin Complex. Scientific Reports, 2015. 5. 130. Hendrickson, W.A., et al., Crystal-Structure of Core Streptavidin Determined from Multiwavelength Anomalous Diffraction of Synchrotron Radiation. Proceedings of the National Academy of Sciences of the United States of America, 1989. 86(7): p. 2190- 2194. 131. Wu, S.C. and S.L. Wong, Engineering soluble monomeric streptavidin with reversible biotin binding capability. Journal of Biological Chemistry, 2005. 280(24): p. 23225- 23231. 132. Roe, D.R. and T.E. Cheatham, PTRAJ and CPPTRAJ: Software for Processing and Analysis of Molecular Dynamics Trajectory Data. Journal of Chemical Theory and Computation, 2013. 9(7): p. 3084-3095. 159 133. Humphrey, W., A. Dalke, and K. Schulten, VMD: Visual molecular dynamics. Journal of Molecular Graphics & Modelling, 1996. 14(1): p. 33-38. 134. Lindorff-Larsen, K., et al., Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins-Structure Function and Bioinformatics, 2010. 78(8): p. 1950- 1958. 135. Maier, J.A., et al., ff14SB: Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J Chem Theory Comput, 2015. 11(8): p. 3696-713. 136. Kumar, S., et al., The Weighted Histogram Analysis Method for Free-Energy Calculations on Biomolecules .1. The Method. Journal of Computational Chemistry, 1992. 13(8): p. 1011-1021. 137. Grossfield, A., WHAM:the weighted histogram analysis method: http://membrane.urmc.rochester.edu/content/wham. 138. Chakravorty, D.K., et al., Metal Ion Capture Mechanism of a Copper Metallochaperone. Biochemistry, 2016. 55(3): p. 501-509. 139. Howarth, M., et al., A monovalent streptavidin with a single femtomolar biotin binding site. Nature Methods, 2006. 3(4): p. 267-273. 140. Chaiet, L. and F.J. Wolf, The Properties of Streptavidin, a Biotin-Binding Protein Produced by Streptomycetes. Arch Biochem Biophys, 1964. 106: p. 1-5. 141. Weber, P.C., et al., Crystallographic and Thermodynamic Comparison of Natural and Synthetic Ligands Bound to Streptavidin. Journal of the American Chemical Society, 1992. 114(9): p. 3197-3200. 142. Le Trong, I., et al., Structural consequences of cutting a binding loop: two circularly permuted variants of streptavidin. Acta Crystallographica Section D-Biological Crystallography, 2013. 69: p. 968-977. 143. Dyson, H.J. and P.E. Wright, Intrinsically unstructured proteins and their functions. Nature Reviews Molecular Cell Biology, 2005. 6(3): p. 197-208. 144. van der Lee, R., et al., Classification of Intrinsically Disordered Regions and Proteins. Chemical Reviews, 2014. 114(13): p. 6589-6631. 145. Wright, P.E. and H.J. Dyson, Intrinsically disordered proteins in cellular signalling and regulation. Nature Reviews Molecular Cell Biology, 2015. 16(1): p. 18-29. 146. D. A. Case, V.B., J. T. Berryman, et al., Amber 14. 2014. 160 147. Darden, T., D. York, and L. Pedersen, Particle Mesh Ewald - an N.Log(N) Method for Ewald Sums in Large Systems. Journal of Chemical Physics, 1993. 98(12): p. 10089- 10092. 148. Essmann, U., et al., A Smooth Particle Mesh Ewald Method. Journal of Chemical Physics, 1995. 103(19): p. 8577-8593. 149. Toukmaji, A., et al., Efficient particle-mesh Ewald based approach to fixed and induced dipolar interactions. Journal of Chemical Physics, 2000. 113(24): p. 10913-10927. Jorgensen, W.L., et al., Comparison of Simple Potential Functions for Simulating Liquid Water. Journal of Chemical Physics, 1983. 79(2): p. 926-935. 151. Li, P.F., L.F. Song, and K.M. Merz, Parameterization of Highly Charged Metal Ions Using the 12-6-4 LJ-Type Nonbonded Model in Explicit Water. Journal of Physical Chemistry B, 2015. 119(3): p. 883-895. 152. Wang, J.M., et al., Development and testing of a general amber force field. Journal of Computational Chemistry, 2004. 25(9): p. 1157-1174. 153. Bour, P. and T.A. Keiderling, Ab-Initio Simulations of the Vibrational Circular- Dichroism of Coupled Peptides. Journal of the American Chemical Society, 1993. 115(21): p. 9602-9607. 150. 161 CHAPTER 5 Conclusions and Future Outlook 162 The objective of this dissertation is to contribute to the field of structure based drug design by introduction of a novel conformational sampling method for receptors in the binding pocket region. Proteins are inherently plastic and undergo a variety of motions that range from ultrafast vibrations to long-range backbone motions. In order to better understand the protein- ligand binding interactions, it is crucial to capture the flexibility of proteins in our current existing computational models. However, incorporating it into even the smallest region, for example, the binding site of a protein, is computationally demanding. Even if this task can be accomplished there is a risk of running into false positives due to the enormous conformational space involved. Crystal structure represents only one of the conformations from the ensemble and can be hardly considered as a good substitute for an entire conformational space available to proteins. The first chapter of this thesis addresses computational approaches that have been implemented to incorporate receptor flexibility into the ligand-binding domain of a protein. The introduction is not limited to receptor flexibility methods used in molecular docking studies. Prospective validation of several docking and other comprehensive tools has also been discussed largely within the scope of blind challenges conducted by the D3R and CSAR organizations. Based on our analysis to date, we conclude that the community is moving forward by fine-tuning several computational approaches but that the statistical uncertainties in the sampling and scoring accuracy still remain significant. In this regard, we have developed a method that can introduce receptor flexibility in the binding pocket region of the protein in a novel fashion. Chapter 2 entails the algorithm and the method development section. The algorithm is quite simple and can be succinctly described here. We treat a molecule at an atom pair level and use a distance based coordinate system. Each 163 selected distance is associated with pair potential look up tables for each atom pairwise interaction. These look up tables help us in generating structures on an energy scale in a very efficient manner. There are two main motions that occur in the protein-ligand binding motions- side chains and backbone motions. We included receptor flexibility in terms of both. Once the conformational ensemble is generated on an energy scale, the free energies are calculated by using Movable Type free energy method[1]. We initially applied this strategy to treat receptor flexibility in terms of side chains, which we are calling MTflex[2]. MTflex was successfully applied on a set of 159 protein-ligand systems derived from core PDBBINDv2014 dataset after excluding proteins with metal ions in the binding pocket area[3, 4]. This application is discussed in details in Chapter 3 of this dissertation. The conformational states generated by MTflex were employed in subsequent docking and scoring exercise. We performed Glide docking on the generated conformational ensemble and also on the crystal structure. It was observed that Glide docking generated better binding modes in terms of both structural RMSDs (Å) and scores when docking was performed by using MTflex generated conformations as compared to docking in crystal structure. Apart from Glide scoring, we also used our in-house Movable type free energy method and validated that the binding affinity showed better correlation with respect to experimental results when we employed MTflex generated ensemble as compared to the score of using crystal docked structures. Overall, we showed that by including side chain flexible multiple receptor structures, docking and binding affinity measures improve as compared to including only single structure. After including side chain receptor flexibility, we next applied our method to include backbone flexibility as well. The extension to include backbone flexibility was quite straightforward, yet computationally expensive. We are calling this strategy, MTFlex-b. To take 164 care of the computational expense, we used Jacobsen’s strategy of generating loops from both the terminals and meeting in the middle[5]. We successfully applied our strategy to study the loop3-4 of the Streptavidin-Biotin system. This highly mobile, eight residue long loop undergoes upon to close transition in going from Biotin un-bound (apo) to bound (holo) state[6-9]. It has been discussed in Chapter 4 of this thesis. We generated a huge conformational ensemble of the loop3-4 region using MTFlex-b with over ~11 million conformations. The ensemble contained both closed-like and open-like loop conformations with a best structural RMSD of ~1.6 Å with respect to experiment. We obtained a free energy surface (MT-FES) for both the apo and holo states and observed that the open loop is ~10.5 kcal/mol more stable in the apo state. In the holo state, the trend was reversed and it was observed that the closed loop is ~12 kcal/mol more stable than the open loop. We observed that the stability of the open loop in the apo state is attributed to solvation free energy, while in the holo state, the protein-ligand contacts are much more favorable towards the closed state, thereby compensating for the solvation free energy. We validated our MT-FES by performing MD-PMF simulations using FF99SBILDN and FF14SB force fields. A good correlation between MT-FES and MD-PMF using FF14SB simulations was observed. We also generated a thermodynamic free energy cycle and obtained a binding free energy change of –16.2 kcal/mol, which is in reasonable agreement with the experimental binding affinity of –18.3 kcal/mol. With the help of fast and efficient thermodynamic cycle and free energy surfaces, we hope that we are introducing a tool that will hugely impact the field of structure based drug design. The ultimate goal of our study is to not only generate a conformational ensemble but to employ it in understanding the deeper knowledge of protein-ligand interactions. We hope to expand our ability to accurately predict free energies, which will hugely impact the growth of 165 structure based drug design[10-12]. This understanding of side chain and flap motions will help us in understanding its affects on substrate or ligand binding, which will help in the advancement of structure-based drug design efforts. This strategy can be extended to systems like Kinases, ureases, HSP90, HIV protease, etc. This understanding will help the community in designing better inhibitors that have desired functions. 166 REFERENCES 167 REFERENCES Zheng, Z., M.N. Ucisik, and K.M. Merz, The Movable Type Method Applied to Protein- 1. Ligand Binding. Journal of Chemical Theory and Computation, 2013. 9(12): p. 5526-5538. 2. Bansal, N., Z. Zheng, and K.M. Merz, Incorporation of side chain flexibility into protein binding pockets using MTflex. Bioorganic & Medicinal Chemistry, 2016. 24(20): p. 4978-4987. 3. 4. 5. 6. 7. 8. 9. 10. 12. Wang, R., et al., The PDBbind database: collection of binding affinities for protein- ligand complexes with known three-dimensional structures. J Med Chem, 2004. 47(12): p. 2977-80. Wang, R., et al., The PDBbind database: methodologies and updates. J Med Chem, 2005. 48(12): p. 4111-9. Jacobson, M.P., et al., A hierarchical approach to all-atom protein loop prediction. Proteins-Structure Function and Bioinformatics, 2004. 55(2): p. 351-367. Chu, V., et al., Thermodynamic and structural consequences of flexible loop deletion by circular permutation in the streptavidin-biotin system. Protein Science, 1998. 7(4): p. 848-859. Freitag, S., et al., Structural studies of the streptavidin binding loop. Protein Science, 1997. 6(6): p. 1157-1166. Le Trong, I., et al., Streptavidin and its biotin complex at atomic resolution. Acta Crystallographica Section D-Biological Crystallography, 2011. 67: p. 813-821. Stayton, P.S., et al., Streptavidin-biotin binding energetics. Biomolecular Engineering, 1999. 16(1-4): p. 39-44. Gilson, M.K. and H.X. Zhou, Calculation of protein-ligand binding affinities. Annual Review of Biophysics and Biomolecular Structure, 2007. 36: p. 21-42. 11. Michel, J. and J.W. Essex, Prediction of protein-ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. Journal of Computer-Aided Molecular Design, 2010. 24(8): p. 639-658. Shirts, M.R., D.L. Mobley, and J.D. Chodera, Alchemical Free Energy Calculations: Ready for Prime Time? Annual Reports in Computational Chemistry, Vol 3, 2007. 3: p. 41-59. 168