. u... ”mm :m .11: I .x: {.1 :5: .95 V . 69$ , T :J u s. I 1 . urxdxa J a .. a u» a. .fi ... Ltd a. “an“. 3. 1 .6. {9.3.07 3%”) 3.19:. ti!!!- 3 . Jpn "gasp-qty"; Jame-eta. I c 4..."... V . . .. , . H V $23 m . ‘ _ .. . ‘ ‘ . . . . . ‘ . . . . ,aéufim ggfifig .fiafiégaw? is: . flawlhwau.‘ .u 7%! 7355’ LIBRARY Michigan State University This is to certify that the thesis entitled OPTIMIZING SIDE-CHAIN INTERACTIONS IN PROTEIN-LIGAND INTERFACES presented by SAMEER ARORA has been accepted towards fulfillment of the requirements for the MS. degree in Computer Science 5/44, Major Professor’s Signature 2/ f/ of Date MSU is an Aflirmative Action/Equal Opportunity Institution PlACE lN RETURN BOX to remove this checkout from your record. To AVOID FINES return on or before date due. MAY BE RECALLED with earlier due date if requested. DATE DUE DATE DUE DATE DUE 2/05 chlRC/DatoDqundd—pts OPTIMIZING SIDE-CHAIN INTERACTIONS IN PROTEIN-LIGAND INTERFACES By Sameer Arara A THESIS Submitted to Michigan State University in partial fulfillment of the requirements for the degree of MASTER OF SCIENCE Department of Computer Science 2005 ABSTRACT OPTIMIZING SIDE-CHAIN INTERACTIONS IN PROTEIN-LIGAND INTERFACES By Sameer Arora Proteins bind to other proteins or small molecules to perform essential cellular functions. Protein side-chain flexibility is crucial for binding and molecular recog- nition. Hence modeling side-chain flexibility in protein-ligand docking algorithms to predict the optimal inter- and intra-molecular interactions is extremely desirable. However, modeling side-chain flexibility in docking and screening is computationally expensive due to the numerous side chains and their many degrees of freedom. Our research indicates that direct, intra-protein hydrogen bonds and hydrophobic interactions are preserved to a significant extent upon ligand binding. This provides guidance to restrict the number of side-chain candidates for conformational sam- pling in docking. While these bond-preservation tendencies limit side-chain move- ments, large side-chain motions are also observed in the protein-ligand interface. Subsequently, the extent of these large side-chain motions from ligand-free to ligand- bound crystal-structure conformations are characterized, as is the suitability of using backbone-dependent rotamers for sampling these larger motions. The ability to accurately identify which side chains move significantly upon ligand binding as well as their optimal conformations is crucial for docking and screening. A new scoring function, having good linear correlation with experimentally determined protein-ligand binding affinities, is presented for scoring dockings and side-chain in- teractions by SLIDE. Using the new scoring function as a cost measure, a mean-field based algorithm, exploiting rotamer-based side-chain flexibility modeling, is proposed and tested for optimizing interactions in protein-ligand interfaces. Copyright by Sameer Arora 2005 To my Mummyji and Papajz'. It is your endless love, sacrifices, committment and support, that I am here, forever indebted. iv ACKNOWLEDGEMENTS First, I would like to express my deep gratitude for my advisor, Dr. Leslie A. Kuhn, for all hers support, guidance and encouragement throughout my graduate education. Her boundless energy and infectious enthusiasm for science helped me embark and persist on this research work. I am grateful for her mentorship, and feel lucky to have been associated with such a person, whose optimism, patience and dynamism continue to set high standards for me. I want to thank my co-advisor, Dr. Bill Punch. His advice and guidance strength- ened my decision and faith in pursuing this interdisciplinary research. And his support and patience helped me continue forward with a calm and positive mind. I extend my thanks to Dr. Phil Duxbury. My discussions with him always left me with valuable insight and a physicist’s perspective on our research problem. I would also like to thank the past and present members of the Kuhn lab, Litian He, Chetan Sukuru, Rajesh Korde, Dr. Brandon Hespenheide, and Sandeep Nami- likonda for creating an enjoyable and stimulating research environment. I am espe- cially thankful to Dr. Maria Zévodszky, who helped me grow, besides teaching me many things over the years, including Biochemistry and SLIDE. She was always ready to help, discuss ideas and answer countless (quick) questions. Finally, I would like to thank my friends (especially Kantha Kumar) and family, without whose support and encouragement, it would be hard to imagine myself here. Table of Contents List of Tables viii List of Algorithms viii List of Figures viii 1 Flexibility in Protein-Ligand Docking 1 1.1 Introduction: Protein-Ligand Docking .................. 1 1.2 Overview of Protein-Structure Prediction Methods ........... 2 1.3 Overview of Current Side-Chain Modeling Approaches in Docking Tools 7 1.4 Side-Chain Modeling in SLIDE ..................... 10 1.5 Motivation and Focus in this Thesis Work ............... 11 1.6 Organization of the Thesis ........................ 13 2 Hydrogen Bond Preservation in Protein Binding Sites 15 2.1 Motivation ................................. 15 2.2 Methods .................................. 16 2.3 Results and Discussion .......................... 22 2.4 Conclusions ................................ 27 3 Sampling Side-Chain Positions in Protein-Ligand Interfaces in SLIDE 29 3.1 Side-Chain Displacements upon Ligand Binding ............ 30 3.2 Current Rotation Paradigm in SLIDE .................. 32 3.2.1 Motions Modeling in SLIDE ................... 39 3.3 Employing Hydrogen-Bond Preservation Bias in Mean-Field Optimiza- tion .................................... 43 3.4 Sampling Large Side-Chain Motions ................... 45 3.4.1 Rotamer Libraries ......................... 49 3.5 Conclusions ................................ 56 4 Scoring SLIDE Dockings 58 4.1 Introduction ................................ 58 4.2 SLIDE Scoring Function ......................... 60 4.3 Methods .................................. 61 4.3.1 Calculation of Terms for Scoring Function ........... 62 4.3.2 Preparation of Test Set ...................... 66 4.3.3 Training and Testing ....................... 67 4.4 Results ................................... 69 4.5 Conclusions ................................ 73 vi 5 Guiding Sampling by Score 74 5.1 Sampling Choices for Maximizing Score ................. 75 5.2 Methods .................................. 81 5.3 Results ................................... 82 5.4 Analysis .................................. 94 5.5 Conclusion ................................. 97 6 Summary and Future Directions 99 6.1 Future Directions ............................. 100 Bibliography 103 vii 2.1 2.2 3.1 4.2 4.1 4.3 4.4 List of Tables List of protein-structure pairs studied for hydrogen bond preservation upon ligand binding. ........................... Rules for categorization of atom types ................. Visual analysis of reasons behind large dihedral rotations in interfacial side chains ................................. Scoring function variants evaluated to predict binding affinities . . . . Potential terms for new scoring function for SLIDE .......... PDB codes of crystal complexes used for training and testing the scor- ing functions ................................ Predicted and experimental binding-affinity linear correlation coeffi- cients and average weights derived from linear multiple regression viii 50 61 63 68 70 List of Algorithms 1 Pseudo-code for rotamer sampling for unsatisfied side chains and mean- field optimization for maximizing score ................. 79 ix 1.1 1.2 2.1 2.2 2.3 2.4 2.5 2.6 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 List of Figures Example of a protein-ligand crystal complex .............. 3 Screening and docking algorithm implemented in SLIDE ....... 12 Geometric parameters used to identify hydrogen bonds and measure their energy ................................ 18 Percentages and displacements, by atom category, of preserved direct intra-protein hydrogen bonds in protein-ligand interfaces ....... 23 Percentages, by residue type, of preserved direct intra-protein hydrogen bonds in protein-ligand interfaces .................... 24 Percentages, by residue type, of preserved interfacial intra-protein hy- drogen bonds mediated by one water molecule ............. 25 Percentages, by residue type, of preserved interfacial intra-protein hy- drogen bonds mediated by two water molecules ............ 26 Percentages, by atom category, of preserved interfacial intrarprotein hydrophobic interactions ......................... 28 Frequency distribution of RMSD ligand-bound side-chain conforma- tions from ligand-free conformations ................... 31 Change in X1 and x2 dihedral angles of preserved interfacial all side chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. 33 Change in x3 and x4 dihedral angles of preserved interfacial all side chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. 34 Rotation angle choices for resolving an inter-atomic steric clash . . . 36 The algorithm for resolving side-chain collisions using the mean-field Optimization technique. When there are still collisions exceeding the threshold after 10 iterations of the outer loop, this ligand orientation is discarded ................................. 38 Frequency distributions of RMSD between ligand-free and ligand- bound side-chain conformation, as well as RMSD between ligand-free conformation and conformation in the best docking generated by SLIDE. 40 Comparison of side-chain RMSDs between ligand-free to ligand-bound conformation, and between ligand-free conformation and conformation in the top docking. ............................ 42 Comparison of the number of intra-protein hydrogen bonds before and after ligand binding in crystal complexes, in best dockings by SLIDE and the hydrogen-bond-preservationist version SLIDE. ........ 46 Comparison of the number of intra-protein hydrogen-bonds lost upon ligand binding in nature (between ligand-free and ligand-bound struc- tures), in best dockings by SLIDE, and in the best dockings by hydrogen-bond-preservationist version SLIDE. ............. 47 3.10 Comparison of the number of intra-protein hydrogen bonds gained upon ligand binding in crystal complexes, in the best dockings by SLIDE and by the hydrogen-bond—preservationist version of SLIDE. . 3.11 Illustration of bond-rotation angles associated with single bonds in a side chain. ................................. 3.12 Number of rotamers approximating dihedral angles of target confor- mations for 25 interfacial side chains undergoing large rotation upon ligand binding. .............................. 3.13 Cartesian space search for rotamers close to ligand-bound conforma- tions of selected side chains using Dunbrack rotamer library ..... 4.1 Correlation between experimentally known binding affinity and score values determined by SLIDE scoring function 61 ............ 4.2 Correlation between experimentally known binding affinity and score values determined by old SLIDE scoring function ........... 4.3 Correlation between affinity values known experimentally and those determined by DrugScore ......................... 5.1 Flowchart of algorithm used to guide selection of a rotamer from a pool of rotamers ................................ 5.2 Distribution of change in score across all dockings with triangle matches in common, generated by both old and new SLIDE versions . 5.3 Comparison of scores of best dockings by old and new SLIDE ver- sions for each protein, as well as scores evaluated for the corresponding crystal complexes ............................. 5.4 Comparison of the number of interfacial unsatisfied polar groups in the best dockings generated by old and new SLIDE versions ....... 5.5 Comparison of displacement and direction of side-chain motions per- formed for resolving protein-ligand collisions .............. 5.6 Side-chain motions performed in new SLIDE to optimize interactions in the same 24 complexes ......................... 5.7 Large natural side-chain motions undetected by old or new SLIDE versions .................................. 5.8 Selection or rejection reasons for each of the sampled rotamers for unsatisfied side chains in the best dockings ............... Images in this thesis/dissertation are presented in color. xi 48 52 54 57 69 71 72 80 84 86 87 89 90 96 Chapter 1 Flexibility in Protein-Ligand Docking 1.1 Introduction: Protein-Ligand Docking Like people interact and cooperate with one another to perform many functions for the sustenance and growth of life at the social level, proteins too interact and cooperate with each other as well as with other molecules for the sustenance and growth of life at the cellular level. Proteins perform various vital functions in a cell - they provide cellular structure, bind and transport other proteins or organic compounds, and catalyze or inhibit reactions. Underlying all stable bindings between a protein and another molecule lies the mechanism for the two molecules to recognize each other and achieve a bound state more stable than their individual unbound states. Understanding the mechanism of protein binding is key understanding their function as well providing valuable insights for discovering novel compounds that can bind to specific proteins for designing therapeutic drugs. Molecular docking is a term used to describe computational techniques that at- tempt to find the “best” mode binding between two molecules. Protein-ligand docking aims at finding the optimal binding between a protein and small molecule to a spe- cific site of the protein. In protein-ligand docking, the atomic structures of the two molecules are given as input in the most general form, no additional data is provided. However, in practice, additional biochemical information can be given, specifically, information about the location of the binding site. Docking methods can be categorized in multiple ways. “Rcdocking” attempts to reconstruct a complex using bound structures of the receptor and the ligand. More challenging is the “unbound” docking, which attempts to reconstruct a complex us- ing unbound structures of the receptor and the ligand. In this case, some degree of conformational change in the protein and the ligand must be modeled or accommo- dated. Another categorization may be based on flexibility: “rigid” docking keep the structures rigid during the docking, while flexible docking techniques allow flexibility in the receptor or ligand or both. Typically, there are three key ingredients in docking algorithms: 1. representation of the molecular system 2. conformational and orientational space search (“sampling”) 3. ranking of potential solutions (“scoring”) The present work focuses on modeling protein side-chain motion in docking, and the subsequent sections of this Introduction cover how flexibility has been modeled by others in a variety of protein modeling methods. 1.2 Overview of Protein-Structure Prediction Methods Over several years, various techniques have been developed to predict protein structures from their amino acid sequences. These prediction techniques attempt to Figure 1.1: This image is presented in color. Example of a protein-ligand crys- tal complex. Crystal structure of a-momorcharin complexed with formycin 5'- monophosphate. Green ribbon represents the protein backbone while the ligand is displayed with Connolly solvent-accessible molecular surface, colored by atom type. Only binding-site side chains are displayed (in tubes) with interfacial side-chains col- ored cyan and side chains beyond the interface colored by atom type. define the protein structure in the native state from the known sequence of amino acids, and in some cases aim to capture dynamics of processes like protein folding and protein-ligand or protein-protein docking. While each of these methods is used for structure prediction, they have been used in context of dockings as well. Molecular Dynamics (MD) Simulation is one such technique that serves as an im- portant tool in protein structural dynamics and refinement. It uses torsional degrees of freedom in proteins and ligands and key physical properties at the atomic level to solve Newton’s equations of motion. Forces on individual atoms are represented as sums of potential terms (e.g. electrostatic and van der Waals) during the entire simulation. Each simulation run is divided into time steps, which are typically scaled small enough (1-2 femtoseconds) to ensure that the physical interactions are modeled accurately. While smaller time steps, combined with an accurate force field, lead to a better quality of simulation results, the large number of steps required makes the sim- ulation extremely slow for docking and screening purposes, as each time step involves thousands of degrees of freedom in the molecule(s), requires evaluation of computa- tionally expensive potential energy terms, and must also account for the surrounding solvent. The computational intensity also limits the amount of conformational space sampled for both protein and ligands. To optimize side-chain placement in docking, sometimes less computationally demanding energy minimization is used, but this in- volves very local motions. Carlson and colleagues[2] have used MD to determine relatively rigid regions from multiple crystal structures and focus on these immobile regions as templates for drug design. Monte Carlo (MC) sampling is a stochastic sampling method applied to a num- ber of diverse domains, including economics, physics, traffic regulation and others. In molecular simulations, an MC simulation is often guided by simulated annealing. In simulated annealing, the “temperature” of the system is raised and then system is al- lowed to cool down gradually. Here, temperature implies any parameter of the system that controls the magnitude of sampled motions. Like high an input of energy helps electrons jump to higher energy orbital in atoms, high “temperature” in MC simu- lations helps torsional rotations overcome high energetic barriers. Once the systems has been allowed to sample a variety of states, temperature is dropped in a step-wise manner to allow the system to relax into a low energy state. This relaxation may be guided by a force field or could result from a random move. New conformations are accepted if the energy drops. the simulation stops when the temperature falls below a threshold. Since there is some randomness to the generation of the final molecular complex, and since there are usually many protein-ligand conformations with similar energy, an ensemble of many structures is usually calculated to observe which binding modes appear most often. This means that this computationally intensive program must be repeated many times. The Dead-End Elimination (DEE) approach relies on a potential energy function to evaluate discrete side—chain conformations. A DEE run searches for a combination of side-chain conformations that result in global energy minimum for the protein or protein-ligand complex[6]. Typically, a set of side-chains is identified for conforma- tional sampling. The rest of the atoms provide a representation of the immediate environment and additional constraints for the sampling. The DEE implementations typically use rotamer libraries to generate side-chain conformations. The relatively high efficiency of this relies on a mathematical expression, the DEE criterion, which rules out infeasible rotamers that cannot be part of the global minimum conforma- tions. One example is that if a certain x angle value of a particular side chain is observed to cause unresolvable collisions with neighboring residues, then this x value is ruled out in all future moves. This criterion helps reduce the number of rotamers to be considered for side chains in the modeling set, resulting in a number that is sufficiently small to analyze by means of “brute force” combinatorial analysis. The DEE approach is useful for homology modeling, where backbone structure is usually known from a related protein. For docking and screening purposes, DEE scales well as long as the number of side-chains sampled is small. Schaffer et a1 [40] have success- fully applied DEE for side-chain Optimization after docking HIV-1 protease mutant complexes using Monte Carlo/simulated annealing sampling. The latest version of Dunbrack’s SCRWL algorithm[1] for side-chain modeling uses results from graph theory to solve the combinatorial problem encountered in side-chain prediction. In this method, side chains are represented as vertices in an undirected graph. Any two residues that have rotamers with nonzero interaction energies are considered to have an edge in the graph. The resulting graph can be partitioned into connected subgraphs with no edges between them. These subgraphs can in turn be broken into biconnected components, which are graphs that cannot be disconnected by removal of a single vertex. The combinatorial problem is reduced to finding the minimum energy of these small biconnected components representing sets of mutually acceptable rotamers for different side chains and combining the results to identify the global minimum energy conformation. While SCRWL’s main usage is in homology modeling, ab initio protein structure prediction, and protein design applications, it can be used for side-chain modeling in the presence of ligands too. While such an approach may help develop induced fit, the ligand largely remains rigid in the process. In mean-field optimization methods[38, 26], the part of the protein that requires modeling, like the side chains, is oversampled by replacing single copies by multiple different copies. The copies of each side chain do not interact with each other but “see” the other side chains as an average. The multiple copies essentially correspond to a distribution function of one side chain, represented by a number of discrete structures with assigned, often equal, weights. Mean—field optimization has been used both for side-chain modeling in homology modeling [38] as well as modeling flexible side chains during docking in SLIDE, a ligand screening and docking software developed in our laboratory [42, 41]. Jackson et al[19] also attempt side-chain optimization in protein- protein interfaces through iterative cycles of mean-field optimization and rigid-body energy minimization. Rotamers from the rotamer library of Thffrey et al [46] are used for discrete sampling of side-chain conformational space while the main-chain atoms have fixed coordinates. The self-consistent mean field approach is used to determine the most probable set of rotamers from an ensemble of rotamers. In this closed system, the potential mean force on each rotamer is based on the internal energy of the rotamer itself, rotamer-backbone interactions, interactions between the rotamer and rotamers of other residues, and the rotamer’s interactions with the surrounding solvent. Each rotamer’s interactions are calculated prior to the Optimization, thus saving time during the mean-field optimization step. Raplacing a system of side chains by a multiple copies per side chain forming a mean-field system has two major advantages. First, while mean-field optimization complexity is lower than the exponential complexity of searching the entire conforma- tional space, the global energy minimum of the new mean-field system is the same[38]. Secondly, the barriers separating the minima in the mean-field calculation are lower than in the original system, making annealing easier. However, the result of opti- mization does depend on initial conditions, like the probability distributions for the multiple copies of the side chains. 1.3 Overview of Current Side-Chain Modeling Ap- proaches in Docking Tools Accounting for side-chain reorientation during docking is similar to predicting side-chain conformations in homology modeling. The search for candidate solutions in a docking problem is addressed by two essentially different approaches: (1) a gradual guided search through solution space, or (2) a full solution space search. The first either scans only part of the solution space in a partially random and partially criteria: guided manner. This approach consists mainly of Monte Carlo (MC), simulated annealing, molecular dynamics (MD), and evolutionary algorithms such as genetic algorithms (GA). In contrast, the second scans the entire solution space in a predefined systematic manner such as using rotamer-library or geometric-hashing based searches of conformations. The classical algorithm implemented for computational docking is that Of DOCK, the first ligand docking tool [7, 5, 9, 44]. DOCK operates by generating a set of spheres to describe the volume, or negative image, of the binding site and uses the centers Of these spheres as sites for matching to ligand atoms. Sets of receptor spheres are matched to sets of ligand atoms to generate a ligand orientation, which can then be scored according to their complementarity with the protein. The early internal DOCK scoring function, GRID [32], is a grid-based scoring function in the method of Goodford and colleagues [13]. Later implementations have used more robust scoring functions which are also grid-based. It is possible to use the ligand docking method of DOCK with an externally supplied scoring function. The initial implementation of DOCK used only steric fit and electrostatics as a determinant for ligand docking, but later versions implemented chemical type matching to better model chemical complementarity between ligand and receptor groups, including hydrogen bonding interactions. It should be noted that DOCK uses only rigid-body translations and rotations, including no internal molecular flexibility within the docking algorithm. Recently, Shoichet and colleagues have used conformational ensembles as input to DOCKBm. Another popular docking algorithm is AutoDock[33], which employs a Monte- Carlo simulated annealing method to sample binding orientations and ligand con- formations, by randomized rotation of torsional angles in the ligand. AutoDock is inexpensive, easy to use, achieves reasonable dockings, and has been applied success- fully to predict ligand binding modes in several cases[33]. Another stochastic approach based on the use of a genetic algorithms (GA) was developed by Jones and colleagues[21]. Their approach uses a simple GA operating on rotational angles in the protein, rotational angles in the ligand, and on hydro- gen bonds between protein and ligand. The fitness or scoring function encompasses terms for the hydrogen bond energy between protein and ligand, for the van der Waals energy between protein and ligand, and for the internal van der Waals energy Of contacts within the ligand. Improvements to this algorithm resulted in the de- velopment of the GOLD algorithm[22], with changes in the representation of angles and hydrogen bonds and inclusion of a more robust scoring function. The GOLD algorithm achieved “acceptable” dockings for 71 of 100 test cases. Similar to GA approaches is the evolutionary programming approach AGDOCK [11], developed at Agouron Pharmaceuticals, which is able to correctly dock (within 1.5 A of correct po- sition), methotrexate into dihydrofolate reductase (DHFR) and a proprietary ligand, AG-l343, into HIV protease. To address side-chain and limited main-chain flexibility of the receptor structure in docking, FlexE [3], was created. FlexE docks ligands into an ensemble of structures of the receptor instead of a single receptor binding site. The binding site ensembles can be from different crystallographic structures, as in [3], from a homology model with uncertain side-chain positions, from a series molecular dynamics time steps, or from another source. The key component of this algorithm is that multiple conformations of the protein can be used as a docking target simultaneously. In this algorithm, the receptor structures are merged, with regions of similar conformations reduced to a single structure and regions with dissimilar conformation constituting alternate positions. While this algorithm may handle some backbone movement in addition to side-chain rotations, the authors claim it is not able to work with large domain movements and limit their test set to protein ensembles with similar backbone traces. 1.4 Side-Chain Modeling in SLIDE The docking and screening software SLIDE (Screening for Ligands by Induced-fit Docking, Efficiently) was developed in our laboratory for screening and docking small molecules into specified binding sites of target proteins [43, 41, 42]. Figure 1.2 pictori- ally describes the algorithm used in SLIDE for screening and docking small molecules. Surfaces of both the protein and the molecule being screened are represented by sets of strategically located interaction points, each point representing the binding site’s or the ligand’s polar atom or hydrophobic characteristics. The protein’s interaction points are called template points, while the potential ligand’s interaction points are called interaction centers. Each of these points is assigned one of the descriptor types : hydrogen-bond donor if the protein or ligand atom can donate a proton in an hy- drogen bond, acceptor if the atom can accept a hydrogen bond, donor/acceptor if the atom can donate and/or accept a hydrogen bond, and hydrophobic if the atom or point represents a hydrophobic site in the ligand or protein. Unlike the potential ligand’s interaction-centers, template points represent the binding-site’s “negative” image, or ideal ligand atom types to bind at the position. Screening and docking is based on matching triplets of ligand’s interaction centers onto triangles Of template points based on geometry as well as descriptor types. Since there can be 0(n!) tem- plate triangles from n template points, SLIDE uses a multi-level geometric hashing for matching interaction-point triangles to pre-computed template triangles stored in the hash table, indexed using the triangle’s descriptor types and geometry. Geomet- ric hashing was originally applied in Object matching and identification in computer vision [49]. Combined with an adequate molecular surface representation, geometric hashing yields a state-of-the-art toolkit for docking[10, 35]. This indexing empowers SLIDE to exhaustively explore orientational space for the ligand with respect to the protein speedily and efficiently. The part of the ligand within the triangle of inter- action centers is treated as rigid, and called an anchor fragment. It serves as the 10 anchor for docking the ligand into the binding site. Any remaining fragments of the ligand are considered flexible. Single bonds in the flexible parts of both the protein and ligand candidates are rotated as needed to remove inter-atomic collisions and to generate a shape-complementary interface, before the complex is scored by the number of intermolecular hydrogen bonds and the hydrophobic complementarity of the contact surfaces. SLIDE was the first method to balance protein and ligand flexibility in dock- ing while developing induced fit between the protein and ligand surfaces during the docking step. It has identified and correctly docked diverse, known ligands into the ligand-free conformation of the binding site for a variety of proteins, e.g., subtilisin, cyclodextrin glycosyltransferase, uracil DNA glycosylase, rhizopuspepsin, HIV pro- tease, estrogen receptor, and Asn tRNA synthetase [43, 41, 42]. 1.5 Motivation and Focus in this Thesis Work Various groups have studied protein side-chain flexibility modeling in docking. Docking techniques using MD or MC simulations can model side-chain flexibility with good accuracy, however they are slow. Besides, MD is poor at crossing energetic barriers at reasonable temperatures, hence limited in its sampling conformational space. Genetic algorithms encoding flexibility tend to be too slow for use in screening many molecules by docking. Exhaustive search space sampling methods using dis- crete rotamer libraries are fast and efficient if the number of side chains are limited. Methods like SCRWL using rotamer libraries have good accuracies for predicting 20.2 x angles, however such predictions have been confined to homology modeling or ab initio protein design rather than docking. Because rotamer libraries specify rotamers as average conformations representing clusters of similar side-chain conformations, a realistic use of libraries requires expanding the rotamers by sampling around the 11 For All Possible Anchor Fragments Defined by All Trlplets of lntersctlon Centers In Each of the Screened Molecules Identliy Chemically and Geornelrlcally Feasible Superposition of Ligand Triangle onto Template TriangIeA ————> . Dock Rigid Anchor Model Induced Md Ligand Side Fragment based on Complementarity Chains in Triengle'e with Figure 1.2: This image is presented in color. Screening and docking algorithm im- plemented in SLIDE [39]. SLIDEs docking of potential ligands into the binding site is based on mapping triplets of ligand interaction centers (hydrogen-bond donors, acceptors donor/ acceptors, or hydrophobic atom centers) onto triangles of template points located above the protein surface. Feasible template triangles for each possible triplet in a screened molecule are directly accessed via a multi-level hash table, and the corresponding mapping is used to dock the rigid anchor fragment of the potential ligand. Single bonds in the flexible parts of both molecules are rotated to generate a shape-complementary interface, before the complex is scored by the number of inter- molecular hydrogen bonds and hydrophobic complementarity of the contact surfaces. In all steps, the ligand triplets or dockings that do not meet a particular threshold are discarded. 12 dihedral angles within specified deviations[24]. DEE-based side-chain conformational search techniques are efficient but can miss global energy minima. The fuzzy-end elimination theorem[28, 25] corrects this problem but the search space becomes huge. Some groups have studied specific side-chain flexibility aspects, such as the num- ber of interfacial side chains undergoing large conformational changes or whether the side chains remain rotameric after ligand binding. For instance, [14, 15] claim that interfacial side chains are not always rotameric as ligand binding could induce non- rotamericity in the binding-site residues. Najmanovich and colleagues[34] show that for 85% of the binding pockets they studied, 3 or fewer side chains underwent a dihe- dral rotation of more than 30°. Knowledge about such proclivity might help restrict the conformational sampling in an exhaustive search without loosing the ability to model substantial side-chain conformational changes; and hence is very valuable for to improve docking algorithms. This research was motivated to develop techniques that accurately model side chain positioning during docking to optimize the interactions in the protein-ligand interfaces. Optimal interactions, which may be defined as interactions observed in crystal complexes, play significant role in molecular recognition. Ability to predict Optimal interactions in SLIDE would require developing an algorithm that not only samples larger side-chain conformational space than sampled by the current induced- fit collision resolution paradigm, but can also correctly identify those side chains for which bigger motions need to be sampled. 1.6 Organization of the Thesis This thesis addresses the questions of which side chains are observed to undergo large rotations upon ligand binding (and why), how to sample such motions and the extent to which rotameric sampling and protein-ligand scoring facilitate or limit our 13 ability to model these motions accurately. Chapter 2 presents ana analysis of intra- protein noncovalent bond preservation upon ligand binding, conclusions from which help identify the side chains which undergo large motions during docking. Chapter 3 analyzes the rotamer-based sampling for these side chains. Of the many conformations available for a collection of side chains, a docking tool should be able to identify conformations that Optimize interactions and rank them highly. Chapter 4 presents developing a new scoring function to predict the affinity of a protein-ligand complex with state-of-the-art accuracy and applies it to select from high-probability side-chain rotamers. Chapter 5 presents the side-chain interaction optimization algorithm in SLIDE and subsequent results. Chapter 6 concludes the thesis with a brief summary and a discussion on future directions of this work. 14 Chapter 2 Hydrogen Bond Preservation in Protein Binding Sites 2.1 Motivation Predicting the structure of the complex of a small ligand with a protein is still a complex task, two major problems being the definition of an appropriate scoring function to discriminate good binding modes of the ligand among all possible binding modes and the huge size of the binding-mode search space itself. Assuming that one knows the correct scoring function, a successful search procedure should consider the three factors that give rise to the size of search space: 0 sampling the relative orientations of ligand and receptor, 0 sampling the low-energy ligand conformations, and e sampling the low-energy protein conformations SLIDE can sample different relative positions of ligand. Moreover it also ad- dresses small-scale receptor and ligand flexibility through resolving steric overlaps and developing a shape complementarity between the two. However, while enhancing I5 shape complementarity, it does not yet take into account how chemical complemen- tarity can also be enhanced during the docking. Enhancing chemical complementarity in the binding site through exhaustive conformational search for both protein and lig- and leads to a very large conformational search space. If there are ‘n’ configurations for each of ‘m’ side chains in the protein-ligand interface, then there are 12'" possible configurations for the protein alone to be assessed by scoring. Such expensive spatial sampling is infeasible for high throughput screening algo- rithms. To perform such affinity enhancing sampling, we first focus on what can be learned from known changes in a receptor’s non-covalent bond network upon ligand binding in order to limit the choices to those that are reasonable. This can help us identify circumstances in which side chains do not undergo much rearrangement. Hence, the knowledge of the circumstances under which side-chain motions occur, and their extent of motion, can provide guidance for developing improved docking prediction algorithms. 2.2 Methods We selected 30 non-homologous protein crystal structures, both ligand free (apo- protein) and ligand bound (holo—protein) from the PDB macromolecular structural database. 16 Table 2.1: List Of protein-structure pairs studied for hydrogen bond preservation upon ligand binding. PDB CodeI Protein / Ligand Complex R.ee(A)2 RMSD3 1ahc/lahb a-momorcharin / formycin 2.0/2.2 0.23 5’ -monophosphate lapm/latp CAMP-dependent protein kinase / 2.0/2.2 0.35 MnATP 1ajz/ 1aj2 dihydropteroate synthsse / 2.0/2.0 0.29 diphospbate 1ca2/1bcd carbonic anhydrase ll / 2.0/ 1.9 0.20 trifluoromethane sulphonamide 1cgf/1hfc fibroblast collagenase / HAP4 2.1 / 1.56 0.39 lgmq/ lgmr ribonuclease/2’ -guanosine- 1.8 / 1.77 0.25 monophosphate lkem/lkel catalytic antibody/hapten 2.2/1.90 0.89 2hvm/lllo hevamine a/allosamidin 1.8/ 1.85 0.12 1swa/1swd apo-core-streptavidin / biotin 2.0/ 1.90 0.97 2ptn/1tps trypsin / a90720a 1.55/1.9 0.3 1xib/1xid d-xylose isomerase / l-ascorbic acid 1.6/ 1.7 0.19 lydc/lydb carbonic anhydrase ii / acetamlamide 1.95/ 1.9 0.13 1tli/3tmn thermolysin / val-trp 2.05/ 1.7 0.22 6taa/7taa family 13 alpha amylase / acarbose 2.1 / 1.98 0.30 lgta/lgtb glutathione S-transferase / 2.4/2.6 0.20 praziquantel lhel/lmlc hen egg—white lysozyme / monoclonal 1.7/2.1 0.49 antibody Fab D44.1 llib/llic adipocyte lipid-binding protein / 1.7/1.6 0.32 hexadecanesulfonic acid lnsb/lnsc neuraminidase / N-acetyl neuraminic 2.2/ 1.7 0.12 acid lpoa/lpob phospholipase A2 / transition-state 1.5/2.0 0.72 analogue lsyc/ lsyd staphylococcal nuclease / 1.8/ 1.7 0.41 2’ -deoxy-3’ -5’ -diphosphothymidine ludg/ludh uracil-DNA glycosylase / uracil 1.75/1.75 0.19 2sct/1aec actinidin / E645 1.7/1.86 0.11 2apr/3apr acid proteinase / reduced peptide 1.8/1.8 0.13 inhibitor 2cla/3cla chloramphenicol acetyltransferase / 2.35/ 1.75 0.41 chloramphenicol 2ctv/5cna concanavalin A / 1.95/20 0.42 a-methyl-D—mannopyranoside 2sgs/5sga proteinase A / tetrapeptide 1.5/ 1.8 0.08 AcePro-AlapPro-Tyr 2wrp/1tro 'Ii'p repressor / synthetic operator 1.65/ 1.9 2.18 3cox/lcoy cholesterol oxidase / 1.8/1.8 0.24 3-B-hydrmry-5—androsten-17—one 3dni/2dnj deoxyribonuclease I / DNA 2.0/2.0 0.37 3enl/5enl enolase / 2-phospho-D-glyceric acid 2.25/22 0.21 3grs/1gra glutathione reductase / glutathione 1.54/20 0.12 disulfide 5cpa/6cpa carboxypeptidase A / phosphonate 1.54 /2.0 0.36 ‘ Ligand-free/ligand-bound 2Resolution of the crystallographic structures in Angstroms. 3Main-chain RMS positional deviation from superposition of the ligand-bound and free structures. ‘HAP is (N-(2-hydrouramatemethylene-4—methyl-pentoyl)phenylalanyl) methylarnine. 5E64 is [N-(l-3-trans-carboxyoxirane-2-carbonyl)-l-leucyl]-amido(4-guanido)butane. 17 These structures are a set of diverse, non-homologous structures with resolution better than 2.2 A . Positional root-mean-square deviation after superimposition of ligand-free and ligand-bound structures was less than 1.0 A (mostly smaller than 0.5 A ) . The structures had no missing residues in the binding site. Hydrogen bonds were identified between donor and acceptor groups according to the following geometric criteria [45, 31], shown graphically in Figure 2.1: 1. Donor-Acceptor distance, d S 3.6A 2. Hydrogen-Acceptor distance, r 5 2.6A 3. Donor-Hydrogen-Acceptor angle, 90° 3 0 5 180° Figure 2.1: This image is presented in color. Geometric parameters used to identify hydrogen bonds and measure their energy. The hydrogen bond is depicted as a dashed line between the hydrogen and the acceptor oxygen. r is the hydrogen-acceptor distance, d is the donor-acceptor distance, 0 is the donor-hydrogen-acceptor angle and d: is the hydrogen-acceptor-base atom angle, where the carbon is the base atom in this example. The energy of each hydrogen bond was measured using a modified Mayo poten- tial [16, 4]. The function evaluates the favorability of the observed hydrogen-bond 18 length relative to the optimal, equilibrium length for that pair Of atoms based on their electron orbital hybridization, as well as the favorability of the angles between the donor and acceptor groups. The modification avoids non-physical H-bonds with angles near 90 deg (e.g., between C=O(i) and NH(i+3), rather than the important C=O(i)4—>NH(i+4) interactions in the middle of a-helices). The energies of hydrogen bonds, Egg were calculated using equations 2.1. Ens = Vo {5 (%)12 - 6 (%)lo} F09. «A w) (2-1) with V0 = 8 kcal/mol R0 = 2.80 A sp3 donor - sp3 acceptor F = cos2l9e'(*‘“)°cos2 (qi — 109.5) sp3 donor - sp2 acceptor F = cos”0e"*‘”l°cos"’¢ sp2 donor - sp3 acceptor F = c0340(e‘2(*‘9)6) sp2 donor - sp2 acceptor F = cos20e’(""’lficos2 (max [(1), rer “XX XXX)“ m: X0 0 0mm >181. Dihedral Angle Change (degrees) 0 I X (X ram mo 0 -50—l 5 o o X .350 400- o x 33 .X. 38 -150... O o “ZOOIIIIIIIIIIIIIIIIIII 02amzagurarnr-ruarrra. easesalIesassserséi ResidueType Figure 3.2: Change in X1 and x2 dihedral angles of all interfacial side chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. The change is calculated as the difference of x angle of interfacial side chains between their ligand-free and ligand-bound conformations. The standard deviation for X1 changes is 25.8 and for x2 changes is 34.7. Note that most changes in x”) fall within the range of —20° to 20°. 33 200 150"" O O ' 8 A100- " 9 I x O 8’ o 2.50- 52 O X 3 o. s 8 o 6 o- i M g. 0 . i“ x ’8 : §-50- g 00 5-100- ‘ % 450- 6 . 1 O L '200 TiillTIIIlllllliliI 02am aggawr-urrraco. rasaésx-ssessaefi Figure 3.3: Change in x3 and x4 dihedral angles of all interfacial side chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. The change is calculated as the difference of x angle of interfacial side chains between their ligand-free and ligand-bound conformations. The standard deviation for x3 changes is 42.9 while for x4 changes is 46.4. Note the somewhat broader range of change in x3 and x4 compared to changes in X1 and x2 dihedral angles in Figure 3.2. 34 these steric clashes involve the flexible portion of the ligand, it is possible to resolve each steric clash independently by rotating either the ligand’s flexible side chain, or a flexible group in the protein with which it overlaps. In this context, a ligand side chain is any group connected to the anchor fragment by a rotatable bond. When the target (protein) atom clashing with the ligand belongs to a flexible protein side chain, that side chain’s rotation also presents another option to resolve the steric clash. Geometrically, any collision between two atoms, which are modeled as va der Waals spheres can be resolved in multiple ways, given that each atom as 3 degrees of translational freedom. However, the colliding atoms are themselves bonded to other atoms by single or double bonds. For each single-bond rotation that can help resolve a steric clash, overlapping atom displaced by the bond rotation is considered mobile, while the other overlapping atom is considered fixed. When both the overlapping atoms are independently connected to single bonds, than their overlap resolution is evaluated with each of them being considered mobile and fixed, one at a time. Single bond rotations allow moving the rotating atoms in a circular trajectory in planes perpendicular to the rotation axis, which is along the single bond. To resolve a collision through a specific bond rotation, the rotatable single bond is transformed along the Y-axis, with the atom of the bond further from the backbone forming the new origin. The same transformation is applied to all the atoms in the side-chain beyond, including both the atoms having the steric clash. The transformations are calculated in a way such that, on application, they would move the fixed atom into the X-Y plane, as shown in Figure 3.4. This later helps calculations for determining the rotatable bond’s rotation angle for collision resolution. While there are potentially infinite positions on the trajectory where mobile atom can be placed to resolve a clash, most of these are usually infeasible due to the high probability of causing new collisions. SLIDE’s paradigm to choose a position to resolve collisions during docking is based on developing “induced fit” between the 35 Circular trajectory available to mobile atom through rotation around a rotatable bond aligned along Y- “ axis. which points downward. perpendicular to the plane or the Z 'a' reflects the minimum angular rotation for mobile \ , ~ Sterlc atom to resolve the steric ‘ \ ‘ overlq: clash \ Mobile Atom I 0 Fixed Atom X Figure 3.4: Induced fit development mechanism implemented in SLIDE performs di- rected rotations to resolve atomic collisions. Directed rotations are performed around a rotatable bond not adjacent to the mobile atom. The rotations are performed in a plane parallel to X—Z plane, by aligning rotatable bond along the Y-axis. The hinge- atom of the bond is new origin for the rotation. The fixed atom is transformed into the positive quadrant of the X-Y plane, while the mobile atom is rotated through the minimum angle of rotation, ‘a’ to resolve the steric overlap. 36 ligand and the protein binding site[23]. To mimic induced fit, SLIDE chooses the smallest angle through which the mobile atom can be rotated to resolve the collision. However, choosing the smallest rotation angle for a bond rotation is not sufficient since there may be multiple single bonds in a side chain in residues like lysine, arginine which could be rotated to resolve a collision. SLIDE’s approach for deciding which rotations to apply for resolving a set of collisions is based on mean-field theory[26, 19], implemented to optimize bond rotations[42]. The key feature of this approach is a probability matrix P(i, j), which describes the probability that a particular collision 2' will be resolved by a rotation of bond j . First, all intermolecular collisions in the complex are identified. They form one dimension of the matrix. Only complexes with up to 20 collisions between atom pairs undergo side-chain collision resolution in SLIDE; ligand dockings with more collisions are discarded. All rotatable bonds that can be used to resolve at least one of the collisions build the other dimension of the matrix. Only those rotations are considered that do not result in an intramolecular collision based on the current conformation. Note that there is no differentiation between ligand and protein side chains in this matrix. All rotations that can resolve a particular collision are initialized with equal probability values. For each probability entry P(z', j), a cost value E’ (z', j) is computed that reflects the cost of rotating bond j to resolve collision i. This cost is simply the product of the number of displaced non-hydrogen atoms and the absolute value of the rotation angle. This makes large rotation anng or the rotation of large side chains more costly, as they are more likely to cause steric problems elsewhere. During the iterative cycles of the mean-field optimization process, the probability matrix P is updated to converge to high probabilities for those rotations that provide the lowest-cost conformational change of both molecules to resolve a maximal number of the observed collisions (Figure ??). In each cycle a mean cost, E (i, j), is computed 37 Find all side-chain collisions 2' and all rotatable bonds j; While there are between 1 and 20 side-chain collisions do: Compute probability matrix P and cost matrix E’; For 10 cycles do: Compute mean cost E(i, j); update probability matrix P; Do feasible highest probability rotations; Find all remaining side-chain collisions i; Figure 3.5: The algorithm for resolving side-chain collisions using the mean-field optimization technique. When there are still collisions exceeding the threshold after 10 iterations of the outer loop, this ligand orientation is discarded. for each rotation, as follows: E(i.j) = E'(z°.j) + Z depi(t,j),(h,k)i-P(h.k)~E’(h,k) haéi, I: The value of dep[(i, j), (h, k)] is set to -1.0 if j = k, i.e., both entries refer to the same bond and both rotations are in the same direction. Hence, two collisions can be resolved at once by a rotation of this bond, and the —1.0 value results in a lower mean cost E(i, j). The value of dep[(z', 3'), (h, k)] is set to 1.0, with a resulting increase in E(z’, j), in two cases. The first case is when both entries refer to the same bond (j = 1:), but the corresponding rotation directions are opposed to each other. In the other case, bond j lies on the path from bond 1: to the anchor fragment or the main chain, respectively. Here, the mean cost of rotating bond j is increased, since if rotation (i, j) were applied, then bond k would be moved, and this would invalidate the assumptions made in the current iteration regarding rotations involving this bond. At the end of each cycle, the entries in the probability matrix are updated based 38 on the mean costs E (2', j) using the Boltzmann principle: . . e-Efiajvl‘ P(2,J) = Tl: e-E(i.k)/u where p is the average value of all computed mean costs. Convergence of the values in the probability matrix is usually observed in fewer than ten cycles, and those rotations with the highest probability are chosen to resolve the collisions. At this point, it is again necessary to check for negative correlations between bonds. Although this was already considered during the computation of the mean cost, two correlated bonds can receive high probabilities if they are the only bonds to resolve particular collisions, or if alternative rotations are much more expensive. During the mean- field optimization process, it is not possible to anticipate complex dependencies, e.g., which ligand rotations influence protein bonds related to other collisions. Rotations are only accepted if they do not cause any intramolecular collisions. Since it is likely that not all collisions can be resolved in one application of the mean-field Optimization technique, up to 10 iterations of this process are executed, as outlined in Figure ??. Ligand conformations are discarded if they have more than 20 collisions at any time during the optimization or have collisions exceeding the threshold after 10 iterations. 3.2.1 Motions Modeling in SLIDE Mean-field optimization helps choose bond rotations to minimize the confor- mational cost of developing shape complementarity. Figures 3.6 and 3.7 present a displacement comparison of side chains that were moved by SLIDE for collision resolution with displacements of corresponding side chains from their ligand-free to ligand-bound conformations. (Hence, these figures present a subset of interfacial side chains which were found to have moved in Figure 3.1). To examine how well SLIDE models known side—chain motions, the distribution 39 Ligand-free to ligand-bound side-chain RMSD 40 " Ligand-free to SLIDE's best docking's side—chain RMSD Z / V a A E ZS M . /\/§ /§/\ Zfifi /§%\ /\/\ /\/\ Zhé§ a . Mt as, \T . l3“. . -2 0 2 4 6 RMSD (A) (negaMmRMSD:MienSUDEmvedasldednhmmermyfiunllgmdbwndomt) Figure 3.6: Frequency distributions of RMSD between ligand-free and ligand-bound side-chain conformation, as well as RMSD between ligand-free side-chain conforma- tion and sidechain conformation in the best docking generated by SLIDE. The best docking was determined using the RMSD of the docked ligand relative to the crystal structure orientation. Positive values indicate the RMSD by which SLIDE’s con- formation of side chain got closer to crystal complex conformation, while negative values indicates that SLIDE moved the side chain further from crystal complex con- formation. Observations are derived from SLIDE’s best dockings of 24 structures, comparing only those interfacial side chains that were moved by SLIDE. 40 of root-mean-square deviations (RMSDs) between ligand-free and docked side-chain conformations found in the best dockings by SLIDE were compared with the distribu- tion of RMSDs between ligand-free and ligand-bound side-chain conformations from crystal structures (Figure 3.6). This distribution data was derived from 24 of the 30 structures specified in Table 2.1. The remaining 6 structures did not permit dock- ing without an increase in inter- atomic interpenetration parameters, so they were excluded from this analysis. The RMSD value of a docked side-chain conformation relative to its ligand-free conformation is displayed negative when it is greater than the RMSD value between the ligand-free and ligand-bound conformation in crystal structures. Hence, this figure compares side-chain displacement distributions between modeled and observed structures, both in magnitude and direction. While SLIDE’s distribution of RMSD magnitudes was quite similar to nature, as most of the motions were restricted between 0-1 A, a few large displacements in ligand-bound conforma- tions are also observed. These observations are in agreement with [50], where it has been found that in 85% of the cases, interfacial side chains rotate through less than 45° upon ligand binding. While Figure 3.6 compares displacement distributions, Figure 3.7 compares dis- placements on individual side-chain basis. Presented as a scatter plot is the compar- ison of side chain RMSDs between ligand-free and ligand-bound conformations, and between the ligand-free conformation and the conformation in the best dockings for 24 structures by SLIDE. Comparing with Figure 3.1, it is clear that SLIDE moves very few side chains compared to nature. The correlation between the ligand-free to ligand-bound side chain RMSD and ligand-free to docked side chain RMSD is small (0.28), indicating the extent to which SLIDE misjudges the magnitudes of motion, as is clear by the degree of scatter. The color coding indicates the quality of motion, answering the question: “ did SLIDE move the side chain closer to ligand-bound conformation or further away ? ” SLIDE, while removing steric overlaps, tends to 41 5 Closer to :, llg-bound oonf. O . -1 I 0 0.6000 ‘0 0 c: 4 o g ‘ 0.3375 5 g 5 l 0 07500 9.. § 5 §. 2 3‘ ,- u, 0.1875 8. i " m. g 0.4500 .2 .E 2 'l .0 g .3 “ . ° . ' c.7125 _ _ E . . 3 g . . 0 0.9750 g 1 - . e re 0 0 ..:~ 4.238 (O O .9. 0 z '0 ' ° 1 500 m 0 -l - ' Further from I - I . I ' I . I . liq-bound cont. 0 1 2 3 4 5 RMSD between Ligand-free and SLIDE‘s best docking's side-chain conformation Figure 3.7: This image is presented in color. Comparison of side chain RMSDs between ligand-free to ligand-bound conformation, and between ligand-free confor- mation and conformation in the best docking. The best docking was determined using the RMSD of the docked ligand versus the crystal structure ligand orientation. The color scale indicates the quality of motion. Positive values indicate the RMSD by which the SLIDE-generated best docking’s side-chain conformation became closer to crystal complex conformation, while negative values indicate that SLIDE moved the side chain further from the crystal complex conformation. Observations are de- rived from the best dockings of 24 structures, with only those interfacial side chains being compared which were moved by SLIDE. Correlation between the two RMSD distributions is 0.28. 42 move the side chains away from the ligand-bound conformation (colors ranging from cyan to red). For only 18 of 58 observations did SLIDE move side chain closer to the ligand-bound conformation (blue). Having observed that most of the SLIDE motions are minimal to resolve colli- sions, that each collision can usually be resolved by either atom’s rotation around a rotatable bond, and that often, SLIDE moves atoms further away from ligand-bound positions, it is postulated that both directions of rotation should be explored for col- lision resolution. While the angle of rotation in one direction will be bigger than the angle in the opposite direction, the ability to make better interactions in one direc- tion versus the other should influence the direction of rotation. Furthermore, from the observed intra—protein hydrogen-bond preservation tendency presented in Chapter 2, any rotations which might disrupt direct hydrogen bonds between the binding-site side chains may be discouraged so that SLIDE’s dockings will have a similar bias towards preserving most of the intra-protein hydrogen bonds. 3.3 Employing Hydrogen-Bond Preservation Bias in Mean-Field Optimization As explained in previous section 3.2, the key feature of mean-field optimization process is the probability matrix P(i, j), which describes the probability that a par- ticular collision i will be resolved by a rotation of bond j. At the beginning of the optimization, each rotation that can resolve a collision is assigned equal probabil- ity. Subsequently, these probabilities are updated iteratively during the optimization process, depending upon the conformational cost of rotation and dependencies of re- solving other collisions. To bias against those specific side-chain rotations that disrupt a hydrogen bond, initial probabilities can be reduced according to the hydrogen-bond preservation probabilities derived from the statistics shown in Figure 2.2. This should 43 effectively discourage rotations which can break a hydrogen bond. Nevertheless, de- spite low probability, a particular rotation that disrupts a hydrogen bond can still be effected if it is the only one that can resolve a particular collision, hence ensuring that final docking will be free from atomic overlaps. Biasing against breaking intra-protein hydrogen bonds would likely improve the chemical complementarity of the docking as well. To measure the effect of incorporating hydrogen-bond preservation bias, a pilot study of 5 ligand-free and ligand-bound structure pairs was conducted. The templates representing the binding sites of the ligand-free structures included random sampling of potential interaction points as well as known interaction points with the ligand. Known points were included to ensure that at least one docking would result in placing the ligand close to its native position and orientation, so that side-chain motion would not be required to compensate for inaccurate ligand placement. Figures 3.8, 3.9, and 3.10 present a comparison of SLIDE and the new SLIDE variant with a hydrogen-bond preservation bias based on the statistics from Figure 2.2. Data presented for both SLIDE versions include only the best dockings according to the ligand RMSD (relative to its crystal complex position), which must be less than 1.0 A. The native ligand conformations from ligand-bound crystal complexes were used for docking into apo structures. Figure 3.8 presents the total number of intra- target (protein binding site) hydrogen bonds. The crystal complexes have more intra- protein hydrogen bonds upon ligand binding compared to the ligand-free structures. Both SLIDE versions preserve a similar number of hydrogen bonds, but fewer than what nature preserves in the crystal complexes. This indicates that new intra-target hydrogen-bonding opportunities are missed in both versions of SLIDE. Figure 3.9 presents intra—target hydrogen bonds that were lost upon ligation in the crystal com- plex and in the best dockings by SLIDE. Interestingly, while more hydrogen bonds were lost in the crystal complexes, Figure 3.8 indicates that these complexes form 44 new intra-target hydrogen bonds that were not present in the ligand-free structures, as shown in Figure 3.10. Even though the sample set of 5 structures is small, it nev- ertheless suggests that nature breaks and makes more intretarget hydrogen bonds upon ligation through optimal arrangement of side chains. SLIDE, on the other hand, due to its minimal rotation collision resolution model, tends to keep the binding site side-chain positions and their interactions more or less the same, breaking and making fewer intra-target hydrogen bonds compared to nature. Considering that each hydro- gen bond has an energy of about -5 Kcal/mol, and that typical complex formation involves a favorable energy change of only ~3 times this number, careful modeling of interfacial hydrogen bonds is likely to be crucial to correctly sample and assess the best dockings. 3.4 Sampling Large Side-Chain Motions Comparison of the extent of displacement in Figures 3.6 and 3.7, as well as hydrogen-bond preservation results from best dockings by SLIDE presented in Fig- ures 3.8, 3.9, and 3.10 indicate that aside from developing good shape comple- mentarity through small side chain adjustments, optimal rearrangement of the intra- target hydrogen-bond network is important for favorable binding affinity between the molecules. This rearrangement may involve not only small but also large rotations, which SLIDE’s induced-fit mechanism does not encourage. To explore if there are any prominent reasons for rearrangements through large side-chain rotations, all interfacial residues that had undergone a side—chain dihedral rotation greater than 60° upon ligand binding were analyzed by molecular graphics to understand the reasons for these large rotations. Any binding sites having gaps between residues were excluded from this analysis using the molecular graphics tool InsightII, since those mobile residues could influence motions in unpredictable ways. 45 lntra-protein direct hydrogen—bond comparison between ligand-free, ligand-bound structures and best dockings by old and H-bond-preservationist SLIDE +0 3 rs H-bond-preservatlonist SLIDE 9 +1 223 Old SLIDE E Ligand-bound crsytal complex DID Ligand-free structure +0 1syc +3 1nsb Complex 1 ajz 1ahb 0 1O 20 30 4O 50 60 Total number of intra-protein hydrogen bonds in the binding site Figure 3.8: Comparison of the number of intra-protein hydrogen bonds before and af- ter ligand binding in crystal complexes, in best dockings by SLIDE and the hydrogen- bond-preservationist version SLIDE. 46 Comparison of intra-protein direct hydrogen bonds lost upon docking by old SLIDE and H-bond-preservationist SLIDE versions, compared against H-bonds lost in crystal complexes H-bond preservationist SLIDE agrs Old SLIDE E Crystal complex 1 eye 1nsb Complex 1ajz Iahb 0 1 2 3 4 5 6 7 8 9 Number of intra-protein hydrogen bonds lost in the binding site Figure 3.9: Comparison of the number of intra—protein hydrogen-bonds lost upon ligand binding in nature (between ligand-free and ligand-bound structures), in best dockings by SLIDE, and in the best dockings by hydrogen-bond-preservationist ver- sion SLIDE. 47 Comparison of intra-protein direct hydrogen bonds gained upon docking by SLIDE and by H-bond-preservationist SLIDE, compared to hydrogen bonds lost in the crystal complexes H-bond preservationist SLIDE 398 I222 Old SLIDE E Crystal complex 1syc 1ahb O 1 2 3 4 5 6 7 8 9 Number of intra-protein hydrogen bonds gained in the binding site Figure 3.10: Comparison of the number of intra—protein hydrogen bonds gained upon ligand binding in crystal complexes, in the best dockings by SLIDE and by the hydrogen-bond—preservationist version of SLIDE. 48 Table 3.1 presents the conclusions from visual inspection of such large rotations. The process involved super-imposing the ligand-free binding site onto ligand-bound site, followed by analyzing steric and chemical factors that could encourage large rotations. Several causes for rotations or displacements were found - main-chain movements of more than 1.0 A (for residues like L33 in PDB entry lkel L33, residues A45 and A49 in lswd), side-chain movements caused by motion in adjacent residues, aromatic interaction with the ligand (lahb, residue 70), 1r—cation interaction with the ligand (lgmr residue B40, lsyd residue 115 ), surface exposure or even crystal packing effects (ludh residue 87). Nevertheless, in two-thirds of the cases, side chains moved not to resolve any steric clashes upon ligation. Rather, side chains moved to satisfy polar atoms (in 10 cases, excluding cases with aromatic or 1r-cation interactions) which would have remained unsatisfied and buried if they had remained in ligand- free conformations. Visual inspection and results trends indicate that, in 50% of the cases, the reason to satisfying the hydrogen—bond potential of a polar buried atom encouraged large side-chain rotations in the interface upon ligand-binding. Exhaustive modeling of pos- sible large side-chain rotations in SLIDE around single bonds would mean sampling 300° of dihedral-angular space for each single bond. Given that a side chain can have from zero to four rotatable bonds, the combinatorial sampling, depending upon the fineness of sampling and the number of side chains to model, would be prohibitively time-consuming to do in docking. 3.4.1 Rotamer Libraries Fortunately, for most of the bond-rotation angles, protein side chains adopt pri- marily a staggered dihedral angle such that covalently bonded, sp3-hybridized atoms tend to stagger, rather than eclipse, their substituent atoms. As the number of solved protein structures has increased over the years, statistical distributions of the 49 Table 3.1: Visual analysis of reasons behind large dihedral rotations (> 60°) in inter- facial side chains. Structure Residue# Residue Type Did side chain clash Would side-chain with the ligand ? H-bond potential have remained if side chain had not moved I? 1aj2 221 LYS N 0 YES latp El 20 MET NO YES 1 atp E53 SER NO YES lcgu 257 GLU NO YES lgmr B40 ARG NO YES lpob A52 ASN N 0 YES lsyd 43 GLU N 0 YES lsyd 1 1 5 TYR NO YES ltps 192 GLN NO YES 7taa 296 HIS NO YES lgmr B38 GLN N O NO 1ke1 L33 ASN NO NO lbib 1 83 LYS NO YES llic 58 LYS NO NO 1ahb 70 TYR YES YES lgmr 865 ARG YES YES 1kel H56 LYS YES YES lpob A30 ARG YES NO laj2 115 ASN YES NO llic 57 PHE YES NO lpob A2 LEU YES NO lcoy 122 MET YES NO 5O side chain bond-rotational angles for each residue type have been characterized[24]. These bond-rotation angles are labelled x angles by convention, as explained in Fig- ure 3.11. Side-chain X angles have been found to occur in tight clusters around certain values. This is both because of the hybridization of the bonded atoms, as well as to prevent collisions with the main-chain atoms of the residue and its neighbors. Each single bond in the side chain can sample its dihedral degrees of freedom subject to these constraints. The resulting side chain conformation is called ‘rotational isomer’ of the side chain. Abbreviated as ‘rotamer’, this favored orientation is represented as a set of values, one for each dihedral angle degree of freedom. Since bond angles and bond lengths in proteins have rather small variances, they are usually not in- cluded in the definition. A library of such favored sets of side-chain x-angle values for each residue type is called a rotamer library. Rotamer libraries usually also contain information about the frequency of each rotamer, as some conformations are more likely than others due to side chain stereochemistry, besides the information about the variance around the dihedral angle mean or mode. Using rotamers to sample side chain conformation can help avoid exhaustive spatial sampling and make tractable the identification of a suitable combination of dihedral angles. With known geometries derived from experimentally solved crystal structures, not only can time be saved during sampling, it is also more likely to arrive at a side chain conformation that is natural and low in energy. Using rotamer library for side chain modeling does have certain drawbacks. Firstly, rotamers are typically averaged conformations representing a cluster of con- formations; they may not be real conformation themselves. Besides, granularity of dihedral space sampling of side chains within the rotamer library can itself be a limit, especially because rotamers largely reflect favored side-chain orientations outside of interfaces. Another aspect is that even though rotamers are thought to represent local energy minima on a potential energy landscape, not all rotamers may be local 51 Figure 3.11: Illustration of bond-rotation angles associated with single bonds in an arginine side chain. The number of single bonds in a side chain, ranging from 0 to 5, depends on the residue type. A single bond is free to sample dihedral rotations and its associated dihedral angle is called a x angle. The x angles are labelled X1» x2, x3 and so on, according to the level of the single bond from the residue’s backbone. 52 extrema since distributions Of side chain dihedral angles are fairly broad (i20° or 30° relative to the average value presented). Nevertheless, rotameric conformations can definitely be considered candidate conformations to sample, and the extent to which they approach Optimal confor- mation can be assessed. For sampling larger conformational space for side chains, rotamer libraries Offer time savings while providing ready-tO-use, pre-calculated rO- tameric side chain conformations. Rotamer libraries can be backbone-independent, backbone-dependent, or even secondary-structure-dependent. Backbone-dependent rotamers have dihedral angles and/ or frequencies that are binned according to the local backbone conformation. Backbone-independent rotamers are calculated using all available side chains of a residue type. Backbone-dependent rotamers have the advantage that no rotameric side chain can have steric clashes with its own backbone. TO see if one can find rotamers from a library similar to ligand-bound conforma- tions Of few side chains which rotated through more than 60° on ligand-binding, the Dunbrack May 2002 backbone dependent rotamer library was searched (downloaded from http://dunbrack.fccc.edu/bbdep). Rotamer searches were done in incremental fashion - first comparing only X1, then X1 and x2 and so on up to x4, depending upon the number Of x angles in the side chain. A rotamer was considered similar to a side chain if each Of the side-chain x angles were within the range x :l: a, where a is the standard deviation of respective x angle specified for the rotamer in the rotamer library. As is clear from search results presented in Figure 3.12, for side chains with x3 and x4 angles, fewer suitably close rotamers were found. Furthermore, for a few Of the side chains, like Tyr70 in 1ahc and Tyr115 in lsyc, no rotamer was found. This result, in effect, places an upper bound on how close, in x-angular space, tO a ligand-bound side chain conformation can rotamers from this particular library reach. 53 Rotamers found by x—angie search for selected ligand-bound side chain conformations 1ahc_ TYR _701 1syc_ TYR_ 1151 1swa_ A _SER _451 1apm: E _SER _531 1Iib_ ”PHE _571 3cox_ MET_ 1221 1apm_ E_ MET: 1201 1kem_ H L_YS _561 1ajz_LYS_2211 1bia_ us: 1831 1Iib_ LYS _581 1kem_ L_ LYS 501 6taa:_ HIS _2961 1syc: GLU _431 1cgt_ GLU _2571 1udg_GLN_ 871 1gmq__BG GLN _381 GLN_ 1921 1swa_ A_ ASN _491 1kem: L _ASN :331 1p0a_ ASN :521 1poa_ ARG :301 1gmq_B_ ARG_ 651 13m mcLB _ARG: 401 chs_ A_ ARG :901 WMmemk Matching upto 13 Matching upto x4 4 Matching upto x, 4 00290-0000... DO 4 Sidechain 4 B 8 4 3' A V 'l'l'Ijjinl'l'l'lejT 510152025303540455055 Number Of rotamers found In the Dunbrack May 2002 rotamer library 010090000000. 4 -5 Figure 3.12: Number Of rotamers, from Dunbrack May 2002 backbone-dependent rotamer library (http://dunbrack.fccc.edu/bbdep), approximating dihedral angles of target conformations for 25 interfacial side chains undergoing large rotation upon ligand binding. Selected side chains across structures are those that had rotated more than 60° from the ligand-free tO ligand-bound conformation. Side chain names (Y-axis) include the PDB code, chain ID (if present), residue type and residue number. Backbone d, 21) angles were used to locate the bins in which rotamers were searched, according the x angles Of the side chain in question. In the rotamer library, the d, w resolution is 10°. The neighboring 8 bins, using 45 :t 10 , 1,0 :l: 10, were also searched. Rotamer searches were done in incremental fashion - first comparing only X1, then X1 and x2, and so on, up tO x4, depending upon the number Of x angles in the side chain. A rotamer was considered similar to a side chain if each Of the side-chain x angles were within the range x i a, where a is the standard deviation Of respective x angle specified for the rotamer in the rotamer library. Rotamers searched consisted of only those rotamers which had the probability Of at least 0.05 times the probability Of the highest probable rotamer within the same 45, ll) bin. This helps exclude rotamers that were very rare, since many Of them could be poorly resolved side-chain conformations in PDB. 54 Another worthwhile aspect to explore is to determine how close rotameric side chains, which are the 3D side-chain orientations generated from the rotamer library’s 05, w and x angles, can approach ligand-bound conformations in terms Of RMSD. The reason for this is that single-bond rotations through adjacent x-angle changes can either compensate for each other or augment each other in terms of the resulting side-chain motion. The closest rotamers found, in 3D Cartesian space, for ligand- bound conformations are presented Figure 3.13. Rotarneric conformations within 1 A Of the ligand-bound conformations were found for about half Of the 22 side chain cases presented here. This again effectively defines the limit on how close rotameric modeling approach can approach ligand-bound side-chain conformations using Dunbrack’s backbone-dependent rotamer library. In Figure 3.13, the relationship between B-values for the large-motion side chains and the probability Of their having a rotameric conformation is presented. The moti- vation for this is that poor resolution Of high B-value residues could be reflected in the unlikelihood Of finding these conformations in the rotamer library. On a color-scale representing the maximum B—factor value found for any atom Of the side chain, black to yellow indicate that the atom coordinates are low in mobility, while light-green to grey indicate that the side chain had at least one atom whose position was highly mo- bile and thus likely to be poorly resolved in the crystal complex. SO, while rotamers close to ligand-bound conformations were found (RMSD < 0.7 A) for lbib-Ly8183 and 1gmr-ArgB40, high B-factor values indicate that these side chains do not have a well-defined conformation. Hence reliable conclusions may not be drawn about the efl'ectiveness Of a rotamer library for its ability or inability to find a close rotamer for side chains that consist Of atoms having high B-factor values. However, for most (13 out of 19) side chains having low B-factor values, a ro- tamer within an RMSD rang of 0.28 A to 0.83 A was found in each case. For 5 side chains having low B-factor values for which no close rotamers were found, three (1ahb- 55 Tyr70, 1cgu-G1u257 and lpoa-A52 ) moved to interact with the ligand, while 1poa—A2 moved due tO excessive steric clashes with the ligand, indicating that their favored conformations are influenced by the ligand as well as the side-chain conformational energetics. 3.5 Conclusions In this chapter, the bond-preservation bias discovered in 2 chapter was im- plemented in SLIDE. However, on further experiments it was found that SLIDE models small motions well enough during docking to preserve most Of the hydrogen- bond interactions and, thus, additional bias towards preserving hydrogen bonds is not warranted. Relatively large side—chain rotations in ligand-bound conformations and causes for them were further investigated. Trying to satisfy buried polar groups was determined as a predominant reason. Investigating how large rotations can be modeled in SLIDE, the effectiveness Of using Dunbrack backbone-dependent rotamer library to sample larger conformational space was analyzed. For X1 and x2 angles, the rotamer libraries always contained entries reflecting the conformations Of side chains that underwent large side-chain motions upon ligand binding. However, for long side chains there were Often no rotamers that also reflected their final ligand-bound x3 and x4 angles. In Cartesian RMSD space, 13 out Of221igand-bound side-chain conformer were represented in the rotamer library tO within 1.0 A RMSD and 8 Of the remain- der to within 2.0 A RMSD. For all cases where side-chain mobility values were low, rotamers close to the ligand-bound conformations were found. At other times, side chains were found to be non-rotameric either because Of ligand-induced strain[14, 15] in the interface or because they moved to enhance interactions with the ligand. 56 RMSD of the ligand—bound side chain conformations from closest rotamers found in Dunbrack's backbone dependent rotamer library 120 a. i 1001 so- q so- 401 / a\ a m1z/Aaj:A/\\z/./X:z\n E - An I: g ’ . +clloast—rotamuzerMSDmiativeto a l. 8 25‘ +MaximumB-factorofanysidechainatom 3 —A—AverageB—factorofdlsIde—ctuinatonu s 2.0- 3250 I . o \ I (I) _ 2 1.5 /\I_ I! 0.5- u/' an "" IIIrIIIIIIIrIIIIIIII 9:» fié $5069 3’ éeée o 06’ ~89 “its“ .9. 3W . 33:55:55 73* 1...:ij ti 5’5: :«t Large dihedral rotation residues Figure 3.13: This image is presented in color. RMSD of rotamers from Dunbrack’s May 2002 backbone-dependent rotamer library found to be closest to selected ligand- bound side-chain conformations in Cartesian space. Selected side chains across struc- tures are those that had dihedral rotations more than 60° from the ligand-free to ligand-bound conformation. Side-chain observation names include the PDB code, chain ID (if present), residue type, and the residue number. Backbone dz, i/) angles were used to locate the bins in which rotamers were searched, using a it, w resolution of 10°. The neighboring 8 bins, using :15 :l: 10 , 1,!) d: 10 were also searched. Each ro- tamer in these bins was converted to a side—chain orientation in Cartesian space and transformed into the reference frame of the selected side chain. The rotamer library consisted of only those rotamers which had a probability of at least 0.05 times the probability of the highest probable rotamer within the same (1), 1/1 bin. This helped exclude rotamers that were rare, since they could be poorly resolved side-chain con- formations in the PDB. Maximum and average B-factor values over all the side-chain atoms are also shown in the top half of the figure. SO, while close rotamers (RMSD < 0.7 A) were found for lbib-Ly5183 and 1gmr-ArgB40, high B-factor values convey that some target side-chain positions are in fact poorly defined. In general, there is no trend between the B-value mobility of a side-chain and the probability that this conformation is present in the rotamer library. 57 Chapter 4 Scoring SLIDE Dockings 4.1 Introduction Computer-aided structure-based methods for virtual screening are aimed at pre- dicting the binding mode Of a ligand in the binding site of a protein or molecular target, and estimating the resulting binding affinity. A conformational search algo- rithm can produce an immense number Of solutions from which the right solution(s) must be selected by an energy-based or other scoring procedure. Thus, it is extremely valuable tO predict binding affinity accurately so the user may select good conforma- tions and orientations from poor ones with confidence. For ranking protein-ligand dockings, a class Of programs, namely ‘scoring func- tions’ has been actively researched for many years. Since docking algorithms search for potentially good binding modes between a specified target and, and in some cases, thousands Of small molecules, they must have the ability to discriminate not only be- tween promising and poor binders, but also between different binding modes Of the same ligand. Theoretically, free—energy simulations can be a reliable check for dis- criminating good from bad dockings, but this approach is too expensive for screening large libraries Of small molecules and remains a techniques performed by relatively 58 few experts. Scoring functions can be broadly categorized into three categories: force—field based, knowledge-based and empirical. Force-field based scoring functions apply clas- sical molecular-mechanics based energy functions to approximate the binding free en- ergy Of the protein-ligand using a sum Of van der Waals, electrostatic, bond length, and bond angle terms. Solvation is usually taken into account using a distance de- pendent dielectric or solvent-accessible surface term, although explicit modeling of discrete water molecules is also done. Non-polar contributions are usually assumed to be proportional to the solvent-accessible surface area. A drawback is that energy landscapes associated with the force-field potentials are usually rugged, and therefore minimization is required prior to any energy evaluation. Furthermore, small inac- curacies in positioning atoms during docking can lead to large discrepancies in the energy score. Knowledge-based scoring functions represent the binding affinity as a sum Of protein-ligand pairwise atomic interactions. Using the protein-ligand complexes de- posited in the Protein Data Bank (PDB), knowledge-based potentials are derived from the Observed preferences for particular atom-pair interactions tO occur at given distances pr distance ranges between ligands and proteins. However, structures in PDB do not provide a thermodynamic ensemble at equilibrium. Hence knowledge- based potentials should be considered as a statistical preference rather than an actual potential. Recently, many empirical functions have emerged as alternatives to force-field- based and knowledge-based scoring functions. Unlike force fields, the weights Of terms in empirical scoring functions are directly calibrated using a set Of protein- ligand complexes with experimentally determined structures and binding affinities through multivariate regression analysis. These scoring functions may also include derived terms that include interactions that would otherwise involve several terms in 59 a force-field based function, e.g. terms for hydrogen bonds and hydrophobic inter- actions. Empirical scoring functions have several appealing features. They can be calibrated against a set Of diverse protein-ligand complexes. Secondly, each term in an empirical function has an intuitive physical meaning. The resulting regression co- efficients before each term conveys the relative contribution Of each interaction in the ligand-binding process. Thirdly, empirical scoring functions are extremely fast with reasonable accuracy, which makes them acceptable for structure-based drug design applications like virtual database screening and de novo ligand generation. Finally, most force-field-based energy functions have been parameterized solely tO fit pep- tidyl data, and are just beginning to be parameterized to recognize and accurately represent atom and bond types that occur in non-protein ligand molecules. 4.2 SLIDE Scoring Function The current SLIDE scoring function is an empirical scoring function trained tO score a collision-free complex using a linear combination of hydrophobic and hydrogen- bond interactions terms: SCORE(P, L) = A - HPHOB(P, L) + B - HBONDS(P, L) The weights A and B have been fit to Optimize SCORE(P, L) to match the affinities Of 89 high-resolution complexes listed in reference [8]. For more information about the terms and their tuning in current version Of SLIDE, please see reference [42]. While the current SLIDE scoring function is based on hydrogen-bonding and hydrophobic terms, more detailed terms tO represent both stabilizing and destabilizing thermodynamic interactions can be explored for developing new scoring functions for SLIDE. Inclusion Of such terms is also desirable to help discriminate among different 60 binding orientations generated while sampling rotamers for buried unsatisfied side chains and to detect interactions that could be improved through further Optimization. 4.3 Methods Empirical scoring functions are typically trained and tested against series of experimentally determined protein-ligand crystal complexes whose binding affinities are known. Unfortunately, no experimental data exists for suboptimal orientations Of ligands with proteins. For designing a new empirical scoring function for SLIDE to be used during and after docking, we developed and tested a series of scoring functions resulting from linear combinations Of individual terms listed in Table 4.1. The choice Of which terms to combine was driven by low-correlation among the terms, as well their ability to represent different interaction types. Table 4.2: Scoring function variants evaluated to predict binding affinities. 269 crystal complexes, listed in Table 4.3, with experimentally known affinities were used for training and testing the scoring functions. Terms are as defined in Table 4.1. Linear regression constant a, and coefficients 6 through 1) were determined to best fit the binding affinity values. Scoring Function Expression f1 (Id-BA f2 a+flB f3 (Id-BC f4 a+flD f5 O+flE fa a-i-BF f7 a-i-flG f3 a-l-BH f9 0+3] fro 0+1” fit 0+5K in 01-514 f13 0+5M fti 0+5N fits 04-50 fie 0+3!) In 04-50 he 0+3}? fro 04'53 Continued on next page 61 Table 4.2 — continued from previous page Scoring Function Expression fro a+BT f21 0+5U fn 0+3V f23 01-5-7443 f24 a+l6(J+K)+vB f25 0+5(H+I)+’YB fee 04-130 f2? 0+5D+7I fzs 0+BD+7(I+K) 129 a+flI+7(G+H)+6(J+K+L+M)+eP+(R [30 a+flI+7(G+H)+6(J+K+L+M)+eP f3, a+flI+7(G+H)+6(P) f32 a+flB+7(I+G+H)+6P+eR f33 a+flB+7(I+G+H)+6R+e I34 0+flB+7I+6R+E fss Owl-33+?! fag a+flO+7(I+G+H)+6(J+K+L+M) I37 a+BO+7(I+G+H)+6(J+K+L+M)+eR I33 a+flO+7(I+G+H)+6(J+K+L+M)+eP f39 a+BS+71+6(G+H)+e(J+K+L+M)+(P+nR f4o a+BS+7I+6(G+H)+6(J+K+L+M)+CP f“ a+fiS+7I+6(G+H)+eP I42 a+flS+7(I+G+H)+6P+eR I43 a+fiS+7(I+G+H)+6P f“ a+flS+7(I+G+H)+6R f45 a+flS+7I+6R f4s 044934-71 I47 a+flS+7(I-i-G+H)+6(J+K+L+M) I48 a+flS+ty(I+G+H)+6(J+K+L+M)+eR f“ a+flS+7(I-i-G+H)+6(J+K+L+M)+eP I50 a+flB+7I+dG+€H f51 O+BS+7I+6G+€H f52 a+flB+7I+6(G+H) f53 a+BS+7I+6(G+H) f54 0+BT+VI+6(G+H) f55 a+flT+7I+6(G+H)+cP I53 a+flT+7(I-l-G+H)+6(J+K+L+M)+eR I57 a+flS+7G+6(H+I)+eL+(J f“ a+flT+7G+6(H+I)+eL+(J [59 a+flS+7G+6(H+I)+eL+(J+nV foo a+flU+7G+6(H+I)+eL+ -b us-s-s ~96 G -s -s a - ab .0 us mos «s «as 05 030° 5 — .gb .‘ e' a' a" J'e' "BR 1.1% _ 0% e' ' 05 es .Qj 1.400 .. 0' \B/ I“) . Higher score I I l I I l I I -1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 Side-chain RMSD relative to ligand-bound conformation (A) Figure 5.8: This image is presented in color. Selection or rejection reasons for each of the sampled rotamers for the unsatisfied side chains in the best dockings of the 24 structures used in experiments. X-axis represents the RMSD between the docked side-chain conformation and the ligand-bound conformation. Y-axis entries are the cases of unsatisfied side chains; their names have been omitted for clarity. For each side chain, the sampled rotamers are represented as filled circles. Symbol ‘b’ represents side-chain conformation after the induced-fit docking and before the rotamer sampling, ‘3’ indicates the rotamer was rejected due to unresolvable steric overlaps, while ‘*’ means that steric overlaps, if any, were resolved. The color of the filled circle represents the change in score - from brown towards blue indicates the rotameric conformation enhanced the score compared to starting or the base conformation ‘b’, while brown towards red depicts a decrease in score. 96 score for each overlap-free rotamer. In such cases, the likelihood of a rotamer being chosen depends on the backbone probability used in the score—based mean- field maximization. A rotamer further from the ligand-bound conformation may be chosen due to having a high probability. 5.5 Conclusion One of the most important observation involves “B” cases. Representing the side-chain conformation after preliminary induced-fit docking, these conformations are already very close, within 1.0 A, of the ligand—bound conformations in most of the cases. This is observed to be true for almost all of the unsatisfied polar side chains. This implies that while rotamers help sample a larger 3D space, instead a local 3D sampling of rotations should first be explored to satisfy unsatisfied polar groups, as this approach is more likely to stay close to the true ligand-bound conformation. This also conveys that nature too may resist big rotameric shifts in conformation, since that involves overcoming high strain energy barriers due for the rotamer tran- sitions, and substantial repacking of the interface. Hence though rotamer sampling helped improve scores and hydrogen-bond interactions, across a substantial number of dockings, it was observed that side-chain conformation prediction actually wors- ened due to ambiguity in choice of the correct rotamer given that several rotamers could result in the same score improvement. Furthermore, even the most positionally correct rotamer was often further from the ligand-bound position than was the initial, collision-resolved side chain. The metrics for choosing a rotamer were to maximize the interactions as measured by the new scoring function, and to choose rotamers that are observed the most frequently in nature. There can be no a priori knowledge within rotamer sampling or optimization to recognize which side-chain conformations are close to ligand-bound conformations. Recognizing that the target side—chain con- 97 formations are typically within 1 A RMSD of the initial position, finer and more local conformational sampling may be all that is required. 98 Chapter 6 Summary and Future Directions Even though enhancing protein side-chain interactions remains the goal of this work, the ultimate goal of docking tools is to model motions to predict crystal complex configuration using apo protein conformation and unbound ligands. Towards these ends, our key contributions have been: 0 Discovering that the intra—protein hydrogen bond network in the binding site largely remains conserved upon ligand binding. Most of the rearrangements in the hydrogen bond network increase the number of hydrogen bonds or serve as a ligand-recognition mechanism by replacing protein-water hydrogen bonds with protein-ligand hydrogen bonds. 0 Observing, and corroborating, that SLIDE’s paradigm of minimal rotations for induced-fit docking[50] performs well for preserving most of the hydrogen bond network. However, favoring minimal rotations misses 15% of the cases where larger motions are needed to improve docking chemistry, as found from comparing ligand-free and ligand-bound crystal structures. 0 Analyzing ligand-free and ligand-bound crystal structures for the reasons be- hind large—angle rotations in side chains upon ligand binding. For half of the 99 cases, large rotations aided in satisfying buried polar groups which would have remained unpaired in the protein-ligand interface if they had remained in their ligand-free conformations. The current criteria for selecting side chains for conformational optimization - steric complementarity and satisfying unpaired hydrogen-bond donors and acceptors - could be easily expanded to include other cases of suboptimal chemical complementarity like side chains with repulsive contacts or side chains with solvent exposed hydrophobic groups. 0 Implementing in SLIDE an infrastructure for modeling these larger motions through the use of rotamer libraries. 0 Determining that rotamer-based sampling typically moves side chains too much, and it can be difficult to determine the correct rotamer when several rotamer choices improve the protein-ligand complementarity score equivalently. In gen- eral, the induced-fit docked conformation proved to be closer to the ligand- bound conformation than were any of the choices in the rotamer library. Thus, local rotational sampling (going beyond minimal rotations for removing van der Waals collisions) is likely to be a better approach for repositioning interfacial side chains to satisfy their hydrogen-bonding potential. 6.1 Future Directions Future work can build upon the insights and algorithmic and geometric frame- work that has resulted from this work. Specifically: 0 Local, directed rotations. Akin to directed rotations in induced-fit to resolve collisions, local rotations can be used to explore potential hydrogen bonds and salt bridges with nearby protein and ligand atoms. 100 e Ligand flexibility. While optimizing side-chain placement has been the focus in this work, the ligand’s rotatable bonds present equally probable opportuni- ties to enhance binding-site chemical complementarity. Exploring motions in the ligand in balance with those in the protein would also help model ligand flexibility in the docking process. 0 Unbiased protein template. While we conducted experiments using an active- site template derived from ligands known to bind the protein in nature, SLIDE’s typical use is to find new potential ligands. Hence unbiased templates (which include no knowledge of known ligands’ interactions) should also be used to validate performance improvements from such local sampling. 0 Residue selection criteria. As mentioned before, about half of the large side- chain motions in crystal structures were not detected by the current criteria to select residues for flexible modeling. These motions can be investigated further to learn the reasons for their motions and define criteria to identify such residues. 0 Handling water molecules. Experiments in this work did not consider water molecules during the docking process. Including interfacial conserved waters should help in correctly identifying if side chains have unsatisfied polar groups or not. This should improve results by avoiding search for better geometries of polar side chains already satisfied by water, as well as recognizing water-exposed hydrophobic side chains in the protein-ligand interface. 0 Scoring function discrimination between alternative conformations. The current scoring function was trained on crystal complexes with known binding affini- ties. These complexes are unlikely to offer many cases with repulsive contacts or unsatisfied polar atoms left due to rotations or water displacements, even though such scenarios are bound to occur frequently during docking. A scoring function, trained on data that includes favorable as well as unfavorable con- 101 tacts may better differentiate good conformations from bad ones in the same neighborhood. With these thoughts on future opportunities, we would like to conclude this thesis. This work is still in progress, and we are excited to pursue the above directions and continue working towards improving state of the art in protein-ligand docking. 102 BIBLIOGRAPHY [1] Adrian A. Canutescu, Andrew A. Shelenkov, and Roland L. Dunbrack Jr. A graph-theory algorithm for rapid protein side-chain prediction. Protein Science, 12:2001—2014, 2003. [2] Heather A. Carlson, Kevin M. Musukawa, Kathleen Rubins, Fredric D. Bush- man, William L. Jorgensen, Robert D. Lins, James M. Briggs, and J. Andrew McCammon. Developing a Dynamic Pharamacophore Model for HIV -1 Integrase. Journal of Medicinal Chemistry, 43:2100—2114, 2000. [3] H. Claussen, C. Buning, M. Rarey, and T. Lengauer. FlexE: Efficient molecular docking considering protein structure variations. Journal of Molecular Biology, 308(2):377—395, 2001. [4] Bassil I. Dahiyat, D. Benjamin Gordon, and Stephen L. Mayo. Automated design of the surface positions of protein helices. Protein Science, 6:1333—1337, 1997. [5] R. L. DesJarlais, R. P. Sheridan, G. L. Seibel, J. S. Dixon, 1. D. Kuntz, and R. Venkataraghavan. Using shape complementarity as an initial screen in de- signing ligands for a receptor binding site of known three-dimensional structure. Journal of Medicinal Chemistry, 31(4):722—729, 1988. [6] Johan Desmet, Marc De Maeyer, and Ignace Lasters. The Dead-End Elimination Theorem: A New Approach to the Side-Chain Packing Problem. The Protein Folding Problem and Tertiary Structure Prediction, Birkhauser, pages 307—338, 1994. [7] I. D.Kuntz, J. M. Blaney, S. J. Oatley, R. Langridge, and T. E. Ferrin. A geo- metric approach to macromolecule—ligand interactions. Journal of Computational Chemistry, 19:269—288, 1982. [8] Matthew D. Eldridge, Christopher W. Murray, Timothy R. Anton, Gaia V. Paolini, and Roger P. Mee. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. Journal of Computer-Aided Molecular Design, 11:425—445, 1997. [9] T. Ewing and I. D. Kuntz. Critical evaluation of search algortihms for automated molecular docking and database screening. Journal of Computational Chemistry, 18:1175—1189, 1997. [10] D. Fischer, S. L. Lin, H. L. Wolfson, and R. Nussinov. A geometry based suite of molecular docking processes. Journal of Molecular Biology, 248:459—477, 1995. 103 [11] D. K. Gehlhaar, G. M. Verkhivker, P. A. chto, C. J. Sherman, D. B. Fogel, L. J. Fogel, and S. T. Freer. Molecular recognition of the inhibitor AG—1343 by HIV-1 protease: Conformationally flexible docking by evolutionary prograrn- ming. Chem. Biol, 2(5):317—324, 1995b. [12] Holger Gohlke, Manfred Hendlich, and Gerhard Klebe. Knowledge-based Scoring anction to Predict Protein-Ligand Interactions. Journal of Molecular Biology, 295:337—356, 2000. [13] P. J. Goodford. A computational procedure for determining energetically favor- able binding sites on biologically important macromolecules. Journal of Medici- nal Chemistry, 13(7):849—857, 1985. [14] Jeep Heringa and Patrick Argos. Strain in Protein Structures as Viewed Through Nonrotameric Side Chains: 1. Their Position and Interaction. Proteins: Struc- ture, Function, and Genetics, 37:30—43, 1999. [15] J aap Heringa and Patrick Argos. Strain in Protein Structures as Viewed Through Nonrotameric Side Chains: II. Upon Ligand Binding. Proteins: Structure, Func- tion, and Genetics, 37:44—55, 1999. [16] Brandon M. Hespenheide, A. J. Rader, M. F. Thorpe, and Leslie A. Kuhn. Identifying protein folding cores from the evolution of flexible regions during unfolding. Journal of Molecular Graphics and Modelling, 21:195—207, 2002. [17] Thomas Huber, Andrew E. Torda, and Wilfred F. van Gunsteren. Optimization Methods for Conformational Sampling Using a Boltzmann-Weighted Mean Field Approach. Biopolymers, 39:103—114, 1996. [18] OpenEye Scientific Software Inc. Quality Atomic Charges, Proton Assignment and Canonicalization . http://www.eyesopen.com/docs/html/quacpac, 2004. [19] Richard M. Jackson, Henry A. Gabb, and Michael J. E. Sternberg. Rapid Refine ment of Protein Interfaces Incorporating Solvation: Application to the Docking Problem. Journal of Molecular Biology, 276:265—285, 1998. [20] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility predictions using graph theory. Proteins: Structure, Function and Genetics, 44:150—165, 2001. [21] G. Jones, P. Willett, and R. C. Glen. Molecular recognition of receptor sites using a genetic algorithm with a description of desolvation. Journal of Molecular Biology, 245:43—53, 1995. [22] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development and validation of a genetic algorithm for flexible docking. Journal of Molecular Biology, 3:727—748, 1997. 104 [23] D. E. Koshland Jr. Application of a Theory of Enzyme Specificity to Protein Synthesis. Proceedings of National Academy of Science, USA, 44:98-123, 1958. [24] Roland L. Dunbrack Jr. Rotamer libraries in the 21" century. Current Opinion in Structural Biology, 12:431—440, 2002. [25] D. Keller, M. Shibata, E. Markus, R. Ornstein, and R. Rein. Finding the global minimum: A fuzzy end elimination implementation. Protein Engineering, 8:893— 904, 1995. [26] P. Koch] and M. Delarue. Mean-field minimization methods for biological macro- molecules. Current Opinion in Structural Biology, 6:222—226, 1996. [27] L. A. Kuhn, C. A. Swanson, M. E. Pique, J. A. Tainer, , and E. D. Getzofl'. Atomic and Residue Hydrophilicity in the Context of Folded Protein Structures. Proteins: Structure, Function, and Genetics, 23:536—547, 1995. [28] I. Lasters and J. Desmet. The fuzzy-end elimination theorem: correctly imple- menting the side chain placement algorithm based on the dead-end elimination theorem. Protein Engineering, 6:717—722, 1993. [29] Ming Lei, Maria I. Zavodszky, Leslie A. Kuhn, and Michael F. Thorpe. Sampling Protein Conformations and Pathways. Journal of Computational Chemistry, 25(9):]133—48, 2004. [30] D. M. Lorber and B. K. Shoichet. Flexible ligand docking using conformational ensembles. Protein Science, 7:938—950, 1998. [31] I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding protein in proteins. J. Mol. Biol, 238:777—793, 1994. [32] E. C. Meng, B. K. Shoichet, and I. D. Kuntz. A geometric approach to macromolecule—ligand interactions. Journal of Computational Chemistry, 13(6):505-524, 1992. [33] G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, and A. J. Olson. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. Journal of Computational Chemistry, 19:1639—1662, 1998. [34] Rafael Najmanovich, Joseph Kuttner, Vladimir Sobolev, and Marvin Edelman. Side-Chain Flexibility in Proteins Upon Ligand Binding. Proteins: Structure, Function, and Genetics, 39:261-268, 2000. [35] R. Norel, D. Fischer, H. Wolfson H, and R. N ussinov. Molecular surface recogni- tion by a computer vision based technique. Protein Engineering, 7:39—46, 1994. [36] Protein Structural Analysis and Design Lab, MSU. Quick Guide to SLIDE Version 2.30a. http://www.bch.msu.edu/kuhn/projects/slide/user_manual.html, 2005. 105 [37] Olivier Roche, Ryuichi Kiyama, and III Charles L. Brooks. Ligand-Protein DataBase: Linking Protein-Ligand Complex Structures to Binding Data. Jour- nal of Medicinal Chemistry, 44:3592—3598, 2001. [38] Adrian Roitberg and Ron Elber. Modeling side chains in peptides and proteins: Application of the locally enhanced sampling and the simulated annealing meth- ods to find the minimum energy conformations. Journal of Chemical Physics, 95:9277—9287, 1991. [39] Paul C. Sanschagrin. Computational Techniques for Modeling Protein-Ligand Interactions and their Application to Serine Proteases and Asparaginyl-tRN A Synthetase. Ph. D. dissertation, Michigan State University, 2001. [40] Lana Schaffer and Gennady M. Verkhivker. Predicting Structural Effects on HIV-1 Protease Mutant Complexes With Flexible Ligand Docking and Protein Side-Chain Optimization. Proteins: Structure, Function, and Genetics, 33:295- 310, 1998. [41] Volker Schnecke and Leslie A. Kuhn. Flexibly Screening for Molecules Interacting with Proteins. Rigidity in Theory and Applications, Plenum Publishing, New York, 385-400, 1999. [42] Volker Schnecke and Leslie A. Kuhn. Virtual screening with salvation and ligand- induced complementarity. Perspectives in Drug Discovery and Design, 20:171- 190, 2000. [43] Volker Schnecke, Craig A. Swanson, Elizabeth D. Getzoff, John A. Tainer, and Leslie A. Kuhn. Screening a Peptidyl Database for Potential Ligands to Proteins With Side-Chain Flexibility. Proteins: Structure, Function, and Genetics, 33:74- 87, 1998. [44] B. K. Shoichet, D. L. Bodian, and I. D. Kuntz. Molecular docking using shape descriptors. Journal of Computational Chemistry, 13:380—397, 1992. [45] D. Stickle, L. Presta, K. Dill, and G. Rose. Hydrogen bonding in globular proteins. J. Mol. Biol, 226:1143—1159, 1992. [46] P. Tuffrey, C. Etchebest, S. Hazout, and R. Lavery. A new approach to the rapid determination of protein side-chain conformations. Journal of Biomolecular Structural Dynamics, 8(6):1267—89, 1991. [47] G. Vriend. WHAT IF: A molecular modeling and drug design program. Journal of Mololecular Graphics, 8:52-56, 1990. [48] Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. The PDBbind Database: Collection of Binding Affinities for Protein-Ligand Complexes with Known Three-Dimensional Structures. Journal of Medicinal Chemistry, 47:2977— 2980, 2004. 106 [49] H. J. Wolfson and Y. Lamdan. Geometric hashing: A general and efficient model- based recognition scheme. IEEE Internation Conference on Computer Vision, Tampa, Fl, pages 238-249, 1988. [50] Maria I. Zavodszky and Leslie A. Kuhn. Side-Chain Flexibility in Protein—Ligand Binding: The Minimal Rotation Hypothesis. Protein Science, 1421104 — 1114, 2005. [51] Maria I. Zavodszky, Paul C. Sanschagrin, Rajesh S. Korde, and Leslie A. Kuhn. Distilling the Essential Features of a Protein Surface for Improving Protein- Ligand Docking, Scoring, and Virtual Screening. Journal of Computer-Aided Molecular Design, 16:883—902, 2002. 107 [[[r[[[[[[[[[[[[[[ii