. u...
”mm
:m
.11:
I
.x: {.1
:5:
.95
V .
69$
, T :J
u s. I 1 .
urxdxa J
a
.. a
u» a.
.fi
...
Ltd
a. “an“. 3. 1
.6. {9.3.07
3%”) 3.19:.
ti!!!-
3 .
Jpn "gasp-qty";
Jame-eta.
I c
4..."... V .
. .. , . H V $23 m . ‘ _ .. . ‘
‘ . . . . . ‘ . . . . ,aéufim ggfifig .fiafiégaw? is: . flawlhwau.‘ .u
7%! 7355’
LIBRARY
Michigan State
University
This is to certify that the
thesis entitled
OPTIMIZING SIDE-CHAIN INTERACTIONS IN
PROTEIN-LIGAND INTERFACES
presented by
SAMEER ARORA
has been accepted towards fulfillment
of the requirements for the
MS. degree in Computer Science
5/44,
Major Professor’s Signature
2/ f/ of
Date
MSU is an Aflirmative Action/Equal Opportunity Institution
PlACE lN RETURN BOX to remove this checkout from your record.
To AVOID FINES return on or before date due.
MAY BE RECALLED with earlier due date if requested.
DATE DUE
DATE DUE
DATE DUE
2/05 chlRC/DatoDqundd—pts
OPTIMIZING SIDE-CHAIN INTERACTIONS
IN PROTEIN-LIGAND INTERFACES
By
Sameer Arara
A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
Department of Computer Science
2005
ABSTRACT
OPTIMIZING SIDE-CHAIN INTERACTIONS IN PROTEIN-LIGAND
INTERFACES
By
Sameer Arora
Proteins bind to other proteins or small molecules to perform essential cellular
functions. Protein side-chain flexibility is crucial for binding and molecular recog-
nition. Hence modeling side-chain flexibility in protein-ligand docking algorithms
to predict the optimal inter- and intra-molecular interactions is extremely desirable.
However, modeling side-chain flexibility in docking and screening is computationally
expensive due to the numerous side chains and their many degrees of freedom.
Our research indicates that direct, intra-protein hydrogen bonds and hydrophobic
interactions are preserved to a significant extent upon ligand binding. This provides
guidance to restrict the number of side-chain candidates for conformational sam-
pling in docking. While these bond-preservation tendencies limit side-chain move-
ments, large side-chain motions are also observed in the protein-ligand interface.
Subsequently, the extent of these large side-chain motions from ligand-free to ligand-
bound crystal-structure conformations are characterized, as is the suitability of using
backbone-dependent rotamers for sampling these larger motions.
The ability to accurately identify which side chains move significantly upon ligand
binding as well as their optimal conformations is crucial for docking and screening. A
new scoring function, having good linear correlation with experimentally determined
protein-ligand binding affinities, is presented for scoring dockings and side-chain in-
teractions by SLIDE. Using the new scoring function as a cost measure, a mean-field
based algorithm, exploiting rotamer-based side-chain flexibility modeling, is proposed
and tested for optimizing interactions in protein-ligand interfaces.
Copyright by
Sameer Arora
2005
To my Mummyji and Papajz'. It is your endless love, sacrifices, committment and
support, that I am here, forever indebted.
iv
ACKNOWLEDGEMENTS
First, I would like to express my deep gratitude for my advisor, Dr. Leslie A.
Kuhn, for all hers support, guidance and encouragement throughout my graduate
education. Her boundless energy and infectious enthusiasm for science helped me
embark and persist on this research work. I am grateful for her mentorship, and
feel lucky to have been associated with such a person, whose optimism, patience and
dynamism continue to set high standards for me.
I want to thank my co-advisor, Dr. Bill Punch. His advice and guidance strength-
ened my decision and faith in pursuing this interdisciplinary research. And his support
and patience helped me continue forward with a calm and positive mind.
I extend my thanks to Dr. Phil Duxbury. My discussions with him always left
me with valuable insight and a physicist’s perspective on our research problem.
I would also like to thank the past and present members of the Kuhn lab, Litian
He, Chetan Sukuru, Rajesh Korde, Dr. Brandon Hespenheide, and Sandeep Nami-
likonda for creating an enjoyable and stimulating research environment. I am espe-
cially thankful to Dr. Maria Zévodszky, who helped me grow, besides teaching me
many things over the years, including Biochemistry and SLIDE. She was always ready
to help, discuss ideas and answer countless (quick) questions.
Finally, I would like to thank my friends (especially Kantha Kumar) and family,
without whose support and encouragement, it would be hard to imagine myself here.
Table of Contents
List of Tables viii
List of Algorithms viii
List of Figures viii
1 Flexibility in Protein-Ligand Docking 1
1.1 Introduction: Protein-Ligand Docking .................. 1
1.2 Overview of Protein-Structure Prediction Methods ........... 2
1.3 Overview of Current Side-Chain Modeling Approaches in Docking Tools 7
1.4 Side-Chain Modeling in SLIDE ..................... 10
1.5 Motivation and Focus in this Thesis Work ............... 11
1.6 Organization of the Thesis ........................ 13
2 Hydrogen Bond Preservation in Protein Binding Sites 15
2.1 Motivation ................................. 15
2.2 Methods .................................. 16
2.3 Results and Discussion .......................... 22
2.4 Conclusions ................................ 27
3 Sampling Side-Chain Positions in Protein-Ligand Interfaces in
SLIDE 29
3.1 Side-Chain Displacements upon Ligand Binding ............ 30
3.2 Current Rotation Paradigm in SLIDE .................. 32
3.2.1 Motions Modeling in SLIDE ................... 39
3.3 Employing Hydrogen-Bond Preservation Bias in Mean-Field Optimiza-
tion .................................... 43
3.4 Sampling Large Side-Chain Motions ................... 45
3.4.1 Rotamer Libraries ......................... 49
3.5 Conclusions ................................ 56
4 Scoring SLIDE Dockings 58
4.1 Introduction ................................ 58
4.2 SLIDE Scoring Function ......................... 60
4.3 Methods .................................. 61
4.3.1 Calculation of Terms for Scoring Function ........... 62
4.3.2 Preparation of Test Set ...................... 66
4.3.3 Training and Testing ....................... 67
4.4 Results ................................... 69
4.5 Conclusions ................................ 73
vi
5 Guiding Sampling by Score 74
5.1 Sampling Choices for Maximizing Score ................. 75
5.2 Methods .................................. 81
5.3 Results ................................... 82
5.4 Analysis .................................. 94
5.5 Conclusion ................................. 97
6 Summary and Future Directions 99
6.1 Future Directions ............................. 100
Bibliography 103
vii
2.1
2.2
3.1
4.2
4.1
4.3
4.4
List of Tables
List of protein-structure pairs studied for hydrogen bond preservation
upon ligand binding. ...........................
Rules for categorization of atom types .................
Visual analysis of reasons behind large dihedral rotations in interfacial
side chains .................................
Scoring function variants evaluated to predict binding affinities . . . .
Potential terms for new scoring function for SLIDE ..........
PDB codes of crystal complexes used for training and testing the scor-
ing functions ................................
Predicted and experimental binding-affinity linear correlation coeffi-
cients and average weights derived from linear multiple regression
viii
50
61
63
68
70
List of Algorithms
1 Pseudo-code for rotamer sampling for unsatisfied side chains and mean-
field optimization for maximizing score ................. 79
ix
1.1
1.2
2.1
2.2
2.3
2.4
2.5
2.6
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
List of Figures
Example of a protein-ligand crystal complex .............. 3
Screening and docking algorithm implemented in SLIDE ....... 12
Geometric parameters used to identify hydrogen bonds and measure
their energy ................................ 18
Percentages and displacements, by atom category, of preserved direct
intra-protein hydrogen bonds in protein-ligand interfaces ....... 23
Percentages, by residue type, of preserved direct intra-protein hydrogen
bonds in protein-ligand interfaces .................... 24
Percentages, by residue type, of preserved interfacial intra-protein hy-
drogen bonds mediated by one water molecule ............. 25
Percentages, by residue type, of preserved interfacial intra-protein hy-
drogen bonds mediated by two water molecules ............ 26
Percentages, by atom category, of preserved interfacial intrarprotein
hydrophobic interactions ......................... 28
Frequency distribution of RMSD ligand-bound side-chain conforma-
tions from ligand-free conformations ................... 31
Change in X1 and x2 dihedral angles of preserved interfacial all side
chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. 33
Change in x3 and x4 dihedral angles of preserved interfacial all side
chains within 4.0 A of the ligand, across 30 structures listed in Table 2.1. 34
Rotation angle choices for resolving an inter-atomic steric clash . . . 36
The algorithm for resolving side-chain collisions using the mean-field
Optimization technique. When there are still collisions exceeding the
threshold after 10 iterations of the outer loop, this ligand orientation
is discarded ................................. 38
Frequency distributions of RMSD between ligand-free and ligand-
bound side-chain conformation, as well as RMSD between ligand-free
conformation and conformation in the best docking generated by SLIDE. 40
Comparison of side-chain RMSDs between ligand-free to ligand-bound
conformation, and between ligand-free conformation and conformation
in the top docking. ............................ 42
Comparison of the number of intra-protein hydrogen bonds before and
after ligand binding in crystal complexes, in best dockings by SLIDE
and the hydrogen-bond-preservationist version SLIDE. ........ 46
Comparison of the number of intra-protein hydrogen-bonds lost upon
ligand binding in nature (between ligand-free and ligand-bound struc-
tures), in best dockings by SLIDE, and in the best dockings by
hydrogen-bond-preservationist version SLIDE. ............. 47
3.10 Comparison of the number of intra-protein hydrogen bonds gained
upon ligand binding in crystal complexes, in the best dockings by
SLIDE and by the hydrogen-bond—preservationist version of SLIDE. .
3.11 Illustration of bond-rotation angles associated with single bonds in a
side chain. .................................
3.12 Number of rotamers approximating dihedral angles of target confor-
mations for 25 interfacial side chains undergoing large rotation upon
ligand binding. ..............................
3.13 Cartesian space search for rotamers close to ligand-bound conforma-
tions of selected side chains using Dunbrack rotamer library .....
4.1 Correlation between experimentally known binding affinity and score
values determined by SLIDE scoring function 61 ............
4.2 Correlation between experimentally known binding affinity and score
values determined by old SLIDE scoring function ...........
4.3 Correlation between affinity values known experimentally and those
determined by DrugScore .........................
5.1 Flowchart of algorithm used to guide selection of a rotamer from a pool
of rotamers ................................
5.2 Distribution of change in score across all dockings with triangle
matches in common, generated by both old and new SLIDE versions .
5.3 Comparison of scores of best dockings by old and new SLIDE ver-
sions for each protein, as well as scores evaluated for the corresponding
crystal complexes .............................
5.4 Comparison of the number of interfacial unsatisfied polar groups in the
best dockings generated by old and new SLIDE versions .......
5.5 Comparison of displacement and direction of side-chain motions per-
formed for resolving protein-ligand collisions ..............
5.6 Side-chain motions performed in new SLIDE to optimize interactions
in the same 24 complexes .........................
5.7 Large natural side-chain motions undetected by old or new SLIDE
versions ..................................
5.8 Selection or rejection reasons for each of the sampled rotamers for
unsatisfied side chains in the best dockings ...............
Images in this thesis/dissertation are presented in color.
xi
48
52
54
57
69
71
72
80
84
86
87
89
90
96
Chapter 1
Flexibility in Protein-Ligand
Docking
1.1 Introduction: Protein-Ligand Docking
Like people interact and cooperate with one another to perform many functions
for the sustenance and growth of life at the social level, proteins too interact and
cooperate with each other as well as with other molecules for the sustenance and
growth of life at the cellular level. Proteins perform various vital functions in a
cell - they provide cellular structure, bind and transport other proteins or organic
compounds, and catalyze or inhibit reactions. Underlying all stable bindings between
a protein and another molecule lies the mechanism for the two molecules to recognize
each other and achieve a bound state more stable than their individual unbound
states. Understanding the mechanism of protein binding is key understanding their
function as well providing valuable insights for discovering novel compounds that can
bind to specific proteins for designing therapeutic drugs.
Molecular docking is a term used to describe computational techniques that at-
tempt to find the “best” mode binding between two molecules. Protein-ligand docking
aims at finding the optimal binding between a protein and small molecule to a spe-
cific site of the protein. In protein-ligand docking, the atomic structures of the two
molecules are given as input in the most general form, no additional data is provided.
However, in practice, additional biochemical information can be given, specifically,
information about the location of the binding site.
Docking methods can be categorized in multiple ways. “Rcdocking” attempts to
reconstruct a complex using bound structures of the receptor and the ligand. More
challenging is the “unbound” docking, which attempts to reconstruct a complex us-
ing unbound structures of the receptor and the ligand. In this case, some degree of
conformational change in the protein and the ligand must be modeled or accommo-
dated. Another categorization may be based on flexibility: “rigid” docking keep the
structures rigid during the docking, while flexible docking techniques allow flexibility
in the receptor or ligand or both.
Typically, there are three key ingredients in docking algorithms:
1. representation of the molecular system
2. conformational and orientational space search (“sampling”)
3. ranking of potential solutions (“scoring”)
The present work focuses on modeling protein side-chain motion in docking, and
the subsequent sections of this Introduction cover how flexibility has been modeled
by others in a variety of protein modeling methods.
1.2 Overview of Protein-Structure Prediction
Methods
Over several years, various techniques have been developed to predict protein
structures from their amino acid sequences. These prediction techniques attempt to
Figure 1.1: This image is presented in color. Example of a protein-ligand crys-
tal complex. Crystal structure of a-momorcharin complexed with formycin 5'-
monophosphate. Green ribbon represents the protein backbone while the ligand is
displayed with Connolly solvent-accessible molecular surface, colored by atom type.
Only binding-site side chains are displayed (in tubes) with interfacial side-chains col-
ored cyan and side chains beyond the interface colored by atom type.
define the protein structure in the native state from the known sequence of amino
acids, and in some cases aim to capture dynamics of processes like protein folding
and protein-ligand or protein-protein docking. While each of these methods is used
for structure prediction, they have been used in context of dockings as well.
Molecular Dynamics (MD) Simulation is one such technique that serves as an im-
portant tool in protein structural dynamics and refinement. It uses torsional degrees
of freedom in proteins and ligands and key physical properties at the atomic level
to solve Newton’s equations of motion. Forces on individual atoms are represented
as sums of potential terms (e.g. electrostatic and van der Waals) during the entire
simulation. Each simulation run is divided into time steps, which are typically scaled
small enough (1-2 femtoseconds) to ensure that the physical interactions are modeled
accurately. While smaller time steps, combined with an accurate force field, lead to a
better quality of simulation results, the large number of steps required makes the sim-
ulation extremely slow for docking and screening purposes, as each time step involves
thousands of degrees of freedom in the molecule(s), requires evaluation of computa-
tionally expensive potential energy terms, and must also account for the surrounding
solvent. The computational intensity also limits the amount of conformational space
sampled for both protein and ligands. To optimize side-chain placement in docking,
sometimes less computationally demanding energy minimization is used, but this in-
volves very local motions. Carlson and colleagues[2] have used MD to determine
relatively rigid regions from multiple crystal structures and focus on these immobile
regions as templates for drug design.
Monte Carlo (MC) sampling is a stochastic sampling method applied to a num-
ber of diverse domains, including economics, physics, traffic regulation and others. In
molecular simulations, an MC simulation is often guided by simulated annealing. In
simulated annealing, the “temperature” of the system is raised and then system is al-
lowed to cool down gradually. Here, temperature implies any parameter of the system
that controls the magnitude of sampled motions. Like high an input of energy helps
electrons jump to higher energy orbital in atoms, high “temperature” in MC simu-
lations helps torsional rotations overcome high energetic barriers. Once the systems
has been allowed to sample a variety of states, temperature is dropped in a step-wise
manner to allow the system to relax into a low energy state. This relaxation may be
guided by a force field or could result from a random move. New conformations are
accepted if the energy drops. the simulation stops when the temperature falls below
a threshold. Since there is some randomness to the generation of the final molecular
complex, and since there are usually many protein-ligand conformations with similar
energy, an ensemble of many structures is usually calculated to observe which binding
modes appear most often. This means that this computationally intensive program
must be repeated many times.
The Dead-End Elimination (DEE) approach relies on a potential energy function
to evaluate discrete side—chain conformations. A DEE run searches for a combination
of side-chain conformations that result in global energy minimum for the protein or
protein-ligand complex[6]. Typically, a set of side-chains is identified for conforma-
tional sampling. The rest of the atoms provide a representation of the immediate
environment and additional constraints for the sampling. The DEE implementations
typically use rotamer libraries to generate side-chain conformations. The relatively
high efficiency of this relies on a mathematical expression, the DEE criterion, which
rules out infeasible rotamers that cannot be part of the global minimum conforma-
tions. One example is that if a certain x angle value of a particular side chain is
observed to cause unresolvable collisions with neighboring residues, then this x value
is ruled out in all future moves. This criterion helps reduce the number of rotamers
to be considered for side chains in the modeling set, resulting in a number that is
sufficiently small to analyze by means of “brute force” combinatorial analysis. The
DEE approach is useful for homology modeling, where backbone structure is usually
known from a related protein. For docking and screening purposes, DEE scales well
as long as the number of side-chains sampled is small. Schaffer et a1 [40] have success-
fully applied DEE for side-chain Optimization after docking HIV-1 protease mutant
complexes using Monte Carlo/simulated annealing sampling.
The latest version of Dunbrack’s SCRWL algorithm[1] for side-chain modeling
uses results from graph theory to solve the combinatorial problem encountered in
side-chain prediction. In this method, side chains are represented as vertices in an
undirected graph. Any two residues that have rotamers with nonzero interaction
energies are considered to have an edge in the graph. The resulting graph can be
partitioned into connected subgraphs with no edges between them. These subgraphs
can in turn be broken into biconnected components, which are graphs that cannot be
disconnected by removal of a single vertex. The combinatorial problem is reduced to
finding the minimum energy of these small biconnected components representing sets
of mutually acceptable rotamers for different side chains and combining the results
to identify the global minimum energy conformation. While SCRWL’s main usage
is in homology modeling, ab initio protein structure prediction, and protein design
applications, it can be used for side-chain modeling in the presence of ligands too.
While such an approach may help develop induced fit, the ligand largely remains rigid
in the process.
In mean-field optimization methods[38, 26], the part of the protein that requires
modeling, like the side chains, is oversampled by replacing single copies by multiple
different copies. The copies of each side chain do not interact with each other but “see”
the other side chains as an average. The multiple copies essentially correspond to a
distribution function of one side chain, represented by a number of discrete structures
with assigned, often equal, weights. Mean—field optimization has been used both for
side-chain modeling in homology modeling [38] as well as modeling flexible side chains
during docking in SLIDE, a ligand screening and docking software developed in our
laboratory [42, 41]. Jackson et al[19] also attempt side-chain optimization in protein-
protein interfaces through iterative cycles of mean-field optimization and rigid-body
energy minimization. Rotamers from the rotamer library of Thffrey et al [46] are used
for discrete sampling of side-chain conformational space while the main-chain atoms
have fixed coordinates. The self-consistent mean field approach is used to determine
the most probable set of rotamers from an ensemble of rotamers. In this closed
system, the potential mean force on each rotamer is based on the internal energy of
the rotamer itself, rotamer-backbone interactions, interactions between the rotamer
and rotamers of other residues, and the rotamer’s interactions with the surrounding
solvent. Each rotamer’s interactions are calculated prior to the Optimization, thus
saving time during the mean-field optimization step.
Raplacing a system of side chains by a multiple copies per side chain forming
a mean-field system has two major advantages. First, while mean-field optimization
complexity is lower than the exponential complexity of searching the entire conforma-
tional space, the global energy minimum of the new mean-field system is the same[38].
Secondly, the barriers separating the minima in the mean-field calculation are lower
than in the original system, making annealing easier. However, the result of opti-
mization does depend on initial conditions, like the probability distributions for the
multiple copies of the side chains.
1.3 Overview of Current Side-Chain Modeling Ap-
proaches in Docking Tools
Accounting for side-chain reorientation during docking is similar to predicting
side-chain conformations in homology modeling. The search for candidate solutions in
a docking problem is addressed by two essentially different approaches: (1) a gradual
guided search through solution space, or (2) a full solution space search. The first
either scans only part of the solution space in a partially random and partially criteria:
guided manner. This approach consists mainly of Monte Carlo (MC), simulated
annealing, molecular dynamics (MD), and evolutionary algorithms such as genetic
algorithms (GA). In contrast, the second scans the entire solution space in a predefined
systematic manner such as using rotamer-library or geometric-hashing based searches
of conformations.
The classical algorithm implemented for computational docking is that Of DOCK,
the first ligand docking tool [7, 5, 9, 44]. DOCK operates by generating a set of spheres
to describe the volume, or negative image, of the binding site and uses the centers
Of these spheres as sites for matching to ligand atoms. Sets of receptor spheres are
matched to sets of ligand atoms to generate a ligand orientation, which can then
be scored according to their complementarity with the protein. The early internal
DOCK scoring function, GRID [32], is a grid-based scoring function in the method of
Goodford and colleagues [13]. Later implementations have used more robust scoring
functions which are also grid-based. It is possible to use the ligand docking method
of DOCK with an externally supplied scoring function. The initial implementation
of DOCK used only steric fit and electrostatics as a determinant for ligand docking,
but later versions implemented chemical type matching to better model chemical
complementarity between ligand and receptor groups, including hydrogen bonding
interactions. It should be noted that DOCK uses only rigid-body translations and
rotations, including no internal molecular flexibility within the docking algorithm.
Recently, Shoichet and colleagues have used conformational ensembles as input to
DOCKBm.
Another popular docking algorithm is AutoDock[33], which employs a Monte-
Carlo simulated annealing method to sample binding orientations and ligand con-
formations, by randomized rotation of torsional angles in the ligand. AutoDock is
inexpensive, easy to use, achieves reasonable dockings, and has been applied success-
fully to predict ligand binding modes in several cases[33].
Another stochastic approach based on the use of a genetic algorithms (GA) was
developed by Jones and colleagues[21]. Their approach uses a simple GA operating
on rotational angles in the protein, rotational angles in the ligand, and on hydro-
gen bonds between protein and ligand. The fitness or scoring function encompasses
terms for the hydrogen bond energy between protein and ligand, for the van der
Waals energy between protein and ligand, and for the internal van der Waals energy
Of contacts within the ligand. Improvements to this algorithm resulted in the de-
velopment of the GOLD algorithm[22], with changes in the representation of angles
and hydrogen bonds and inclusion of a more robust scoring function. The GOLD
algorithm achieved “acceptable” dockings for 71 of 100 test cases. Similar to GA
approaches is the evolutionary programming approach AGDOCK [11], developed at
Agouron Pharmaceuticals, which is able to correctly dock (within 1.5 A of correct po-
sition), methotrexate into dihydrofolate reductase (DHFR) and a proprietary ligand,
AG-l343, into HIV protease.
To address side-chain and limited main-chain flexibility of the receptor structure
in docking, FlexE [3], was created. FlexE docks ligands into an ensemble of structures
of the receptor instead of a single receptor binding site. The binding site ensembles can
be from different crystallographic structures, as in [3], from a homology model with
uncertain side-chain positions, from a series molecular dynamics time steps, or from
another source. The key component of this algorithm is that multiple conformations
of the protein can be used as a docking target simultaneously. In this algorithm,
the receptor structures are merged, with regions of similar conformations reduced
to a single structure and regions with dissimilar conformation constituting alternate
positions. While this algorithm may handle some backbone movement in addition
to side-chain rotations, the authors claim it is not able to work with large domain
movements and limit their test set to protein ensembles with similar backbone traces.
1.4 Side-Chain Modeling in SLIDE
The docking and screening software SLIDE (Screening for Ligands by Induced-fit
Docking, Efficiently) was developed in our laboratory for screening and docking small
molecules into specified binding sites of target proteins [43, 41, 42]. Figure 1.2 pictori-
ally describes the algorithm used in SLIDE for screening and docking small molecules.
Surfaces of both the protein and the molecule being screened are represented by sets
of strategically located interaction points, each point representing the binding site’s
or the ligand’s polar atom or hydrophobic characteristics. The protein’s interaction
points are called template points, while the potential ligand’s interaction points are
called interaction centers. Each of these points is assigned one of the descriptor types
: hydrogen-bond donor if the protein or ligand atom can donate a proton in an hy-
drogen bond, acceptor if the atom can accept a hydrogen bond, donor/acceptor if
the atom can donate and/or accept a hydrogen bond, and hydrophobic if the atom
or point represents a hydrophobic site in the ligand or protein. Unlike the potential
ligand’s interaction-centers, template points represent the binding-site’s “negative”
image, or ideal ligand atom types to bind at the position. Screening and docking is
based on matching triplets of ligand’s interaction centers onto triangles Of template
points based on geometry as well as descriptor types. Since there can be 0(n!) tem-
plate triangles from n template points, SLIDE uses a multi-level geometric hashing
for matching interaction-point triangles to pre-computed template triangles stored in
the hash table, indexed using the triangle’s descriptor types and geometry. Geomet-
ric hashing was originally applied in Object matching and identification in computer
vision [49]. Combined with an adequate molecular surface representation, geometric
hashing yields a state-of-the-art toolkit for docking[10, 35]. This indexing empowers
SLIDE to exhaustively explore orientational space for the ligand with respect to the
protein speedily and efficiently. The part of the ligand within the triangle of inter-
action centers is treated as rigid, and called an anchor fragment. It serves as the
10
anchor for docking the ligand into the binding site. Any remaining fragments of the
ligand are considered flexible. Single bonds in the flexible parts of both the protein
and ligand candidates are rotated as needed to remove inter-atomic collisions and
to generate a shape-complementary interface, before the complex is scored by the
number of intermolecular hydrogen bonds and the hydrophobic complementarity of
the contact surfaces.
SLIDE was the first method to balance protein and ligand flexibility in dock-
ing while developing induced fit between the protein and ligand surfaces during the
docking step. It has identified and correctly docked diverse, known ligands into the
ligand-free conformation of the binding site for a variety of proteins, e.g., subtilisin,
cyclodextrin glycosyltransferase, uracil DNA glycosylase, rhizopuspepsin, HIV pro-
tease, estrogen receptor, and Asn tRNA synthetase [43, 41, 42].
1.5 Motivation and Focus in this Thesis Work
Various groups have studied protein side-chain flexibility modeling in docking.
Docking techniques using MD or MC simulations can model side-chain flexibility
with good accuracy, however they are slow. Besides, MD is poor at crossing energetic
barriers at reasonable temperatures, hence limited in its sampling conformational
space. Genetic algorithms encoding flexibility tend to be too slow for use in screening
many molecules by docking. Exhaustive search space sampling methods using dis-
crete rotamer libraries are fast and efficient if the number of side chains are limited.
Methods like SCRWL using rotamer libraries have good accuracies for predicting 20.2
x angles, however such predictions have been confined to homology modeling or ab
initio protein design rather than docking. Because rotamer libraries specify rotamers
as average conformations representing clusters of similar side-chain conformations,
a realistic use of libraries requires expanding the rotamers by sampling around the
11
For All Possible Anchor Fragments Defined by All Trlplets of
lntersctlon Centers In Each of the Screened Molecules
Identliy Chemically and
Geornelrlcally Feasible
Superposition of Ligand Triangle
onto Template TriangIeA
————> .
Dock Rigid Anchor
Model Induced Md Ligand Side Fragment based on
Complementarity Chains in Triengle'e
with
Figure 1.2: This image is presented in color. Screening and docking algorithm im-
plemented in SLIDE [39]. SLIDEs docking of potential ligands into the binding site
is based on mapping triplets of ligand interaction centers (hydrogen-bond donors,
acceptors donor/ acceptors, or hydrophobic atom centers) onto triangles of template
points located above the protein surface. Feasible template triangles for each possible
triplet in a screened molecule are directly accessed via a multi-level hash table, and
the corresponding mapping is used to dock the rigid anchor fragment of the potential
ligand. Single bonds in the flexible parts of both molecules are rotated to generate a
shape-complementary interface, before the complex is scored by the number of inter-
molecular hydrogen bonds and hydrophobic complementarity of the contact surfaces.
In all steps, the ligand triplets or dockings that do not meet a particular threshold
are discarded.
12
dihedral angles within specified deviations[24]. DEE-based side-chain conformational
search techniques are efficient but can miss global energy minima. The fuzzy-end
elimination theorem[28, 25] corrects this problem but the search space becomes huge.
Some groups have studied specific side-chain flexibility aspects, such as the num-
ber of interfacial side chains undergoing large conformational changes or whether the
side chains remain rotameric after ligand binding. For instance, [14, 15] claim that
interfacial side chains are not always rotameric as ligand binding could induce non-
rotamericity in the binding-site residues. Najmanovich and colleagues[34] show that
for 85% of the binding pockets they studied, 3 or fewer side chains underwent a dihe-
dral rotation of more than 30°. Knowledge about such proclivity might help restrict
the conformational sampling in an exhaustive search without loosing the ability to
model substantial side-chain conformational changes; and hence is very valuable for
to improve docking algorithms.
This research was motivated to develop techniques that accurately model side
chain positioning during docking to optimize the interactions in the protein-ligand
interfaces. Optimal interactions, which may be defined as interactions observed in
crystal complexes, play significant role in molecular recognition. Ability to predict
Optimal interactions in SLIDE would require developing an algorithm that not only
samples larger side-chain conformational space than sampled by the current induced-
fit collision resolution paradigm, but can also correctly identify those side chains for
which bigger motions need to be sampled.
1.6 Organization of the Thesis
This thesis addresses the questions of which side chains are observed to undergo
large rotations upon ligand binding (and why), how to sample such motions and the
extent to which rotameric sampling and protein-ligand scoring facilitate or limit our
13
ability to model these motions accurately. Chapter 2 presents ana analysis of intra-
protein noncovalent bond preservation upon ligand binding, conclusions from which
help identify the side chains which undergo large motions during docking. Chapter 3
analyzes the rotamer-based sampling for these side chains. Of the many conformations
available for a collection of side chains, a docking tool should be able to identify
conformations that Optimize interactions and rank them highly. Chapter 4 presents
developing a new scoring function to predict the affinity of a protein-ligand complex
with state-of-the-art accuracy and applies it to select from high-probability side-chain
rotamers. Chapter 5 presents the side-chain interaction optimization algorithm in
SLIDE and subsequent results. Chapter 6 concludes the thesis with a brief summary
and a discussion on future directions of this work.
14
Chapter 2
Hydrogen Bond Preservation in
Protein Binding Sites
2.1 Motivation
Predicting the structure of the complex of a small ligand with a protein is still
a complex task, two major problems being the definition of an appropriate scoring
function to discriminate good binding modes of the ligand among all possible binding
modes and the huge size of the binding-mode search space itself. Assuming that one
knows the correct scoring function, a successful search procedure should consider the
three factors that give rise to the size of search space:
0 sampling the relative orientations of ligand and receptor,
0 sampling the low-energy ligand conformations, and
e sampling the low-energy protein conformations
SLIDE can sample different relative positions of ligand. Moreover it also ad-
dresses small-scale receptor and ligand flexibility through resolving steric overlaps
and developing a shape complementarity between the two. However, while enhancing
I5
shape complementarity, it does not yet take into account how chemical complemen-
tarity can also be enhanced during the docking. Enhancing chemical complementarity
in the binding site through exhaustive conformational search for both protein and lig-
and leads to a very large conformational search space. If there are ‘n’ configurations
for each of ‘m’ side chains in the protein-ligand interface, then there are 12'" possible
configurations for the protein alone to be assessed by scoring.
Such expensive spatial sampling is infeasible for high throughput screening algo-
rithms. To perform such affinity enhancing sampling, we first focus on what can be
learned from known changes in a receptor’s non-covalent bond network upon ligand
binding in order to limit the choices to those that are reasonable. This can help
us identify circumstances in which side chains do not undergo much rearrangement.
Hence, the knowledge of the circumstances under which side-chain motions occur,
and their extent of motion, can provide guidance for developing improved docking
prediction algorithms.
2.2 Methods
We selected 30 non-homologous protein crystal structures, both ligand free (apo-
protein) and ligand bound (holo—protein) from the PDB macromolecular structural
database.
16
Table 2.1: List Of protein-structure pairs studied for hydrogen bond preservation
upon ligand binding.
PDB CodeI Protein / Ligand Complex R.ee(A)2 RMSD3
1ahc/lahb a-momorcharin / formycin 2.0/2.2 0.23
5’ -monophosphate
lapm/latp CAMP-dependent protein kinase / 2.0/2.2 0.35
MnATP
1ajz/ 1aj2 dihydropteroate synthsse / 2.0/2.0 0.29
diphospbate
1ca2/1bcd carbonic anhydrase ll / 2.0/ 1.9 0.20
trifluoromethane sulphonamide
1cgf/1hfc fibroblast collagenase / HAP4 2.1 / 1.56 0.39
lgmq/ lgmr ribonuclease/2’ -guanosine- 1.8 / 1.77 0.25
monophosphate
lkem/lkel catalytic antibody/hapten 2.2/1.90 0.89
2hvm/lllo hevamine a/allosamidin 1.8/ 1.85 0.12
1swa/1swd apo-core-streptavidin / biotin 2.0/ 1.90 0.97
2ptn/1tps trypsin / a90720a 1.55/1.9 0.3
1xib/1xid d-xylose isomerase / l-ascorbic acid 1.6/ 1.7 0.19
lydc/lydb carbonic anhydrase ii / acetamlamide 1.95/ 1.9 0.13
1tli/3tmn thermolysin / val-trp 2.05/ 1.7 0.22
6taa/7taa family 13 alpha amylase / acarbose 2.1 / 1.98 0.30
lgta/lgtb glutathione S-transferase / 2.4/2.6 0.20
praziquantel
lhel/lmlc hen egg—white lysozyme / monoclonal 1.7/2.1 0.49
antibody Fab D44.1
llib/llic adipocyte lipid-binding protein / 1.7/1.6 0.32
hexadecanesulfonic acid
lnsb/lnsc neuraminidase / N-acetyl neuraminic 2.2/ 1.7 0.12
acid
lpoa/lpob phospholipase A2 / transition-state 1.5/2.0 0.72
analogue
lsyc/ lsyd staphylococcal nuclease / 1.8/ 1.7 0.41
2’ -deoxy-3’ -5’ -diphosphothymidine
ludg/ludh uracil-DNA glycosylase / uracil 1.75/1.75 0.19
2sct/1aec actinidin / E645 1.7/1.86 0.11
2apr/3apr acid proteinase / reduced peptide 1.8/1.8 0.13
inhibitor
2cla/3cla chloramphenicol acetyltransferase / 2.35/ 1.75 0.41
chloramphenicol
2ctv/5cna concanavalin A / 1.95/20 0.42
a-methyl-D—mannopyranoside
2sgs/5sga proteinase A / tetrapeptide 1.5/ 1.8 0.08
AcePro-AlapPro-Tyr
2wrp/1tro 'Ii'p repressor / synthetic operator 1.65/ 1.9 2.18
3cox/lcoy cholesterol oxidase / 1.8/1.8 0.24
3-B-hydrmry-5—androsten-17—one
3dni/2dnj deoxyribonuclease I / DNA 2.0/2.0 0.37
3enl/5enl enolase / 2-phospho-D-glyceric acid 2.25/22 0.21
3grs/1gra glutathione reductase / glutathione 1.54/20 0.12
disulfide
5cpa/6cpa carboxypeptidase A / phosphonate 1.54 /2.0 0.36
‘ Ligand-free/ligand-bound
2Resolution of the crystallographic structures in Angstroms.
3Main-chain RMS positional deviation from superposition of the ligand-bound and free structures.
‘HAP is (N-(2-hydrouramatemethylene-4—methyl-pentoyl)phenylalanyl) methylarnine.
5E64 is [N-(l-3-trans-carboxyoxirane-2-carbonyl)-l-leucyl]-amido(4-guanido)butane.
17
These structures are a set of diverse, non-homologous structures with resolution
better than 2.2 A . Positional root-mean-square deviation after superimposition of
ligand-free and ligand-bound structures was less than 1.0 A (mostly smaller than 0.5
A ) . The structures had no missing residues in the binding site.
Hydrogen bonds were identified between donor and acceptor groups according to
the following geometric criteria [45, 31], shown graphically in Figure 2.1:
1. Donor-Acceptor distance, d S 3.6A
2. Hydrogen-Acceptor distance, r 5 2.6A
3. Donor-Hydrogen-Acceptor angle, 90° 3 0 5 180°
Figure 2.1: This image is presented in color. Geometric parameters used to identify
hydrogen bonds and measure their energy. The hydrogen bond is depicted as a
dashed line between the hydrogen and the acceptor oxygen. r is the hydrogen-acceptor
distance, d is the donor-acceptor distance, 0 is the donor-hydrogen-acceptor angle and
d: is the hydrogen-acceptor-base atom angle, where the carbon is the base atom in
this example.
The energy of each hydrogen bond was measured using a modified Mayo poten-
tial [16, 4]. The function evaluates the favorability of the observed hydrogen-bond
18
length relative to the optimal, equilibrium length for that pair Of atoms based on
their electron orbital hybridization, as well as the favorability of the angles between
the donor and acceptor groups. The modification avoids non-physical H-bonds with
angles near 90 deg (e.g., between C=O(i) and NH(i+3), rather than the important
C=O(i)4—>NH(i+4) interactions in the middle of a-helices). The energies of hydrogen
bonds, Egg were calculated using equations 2.1.
Ens = Vo {5 (%)12 - 6 (%)lo} F09. «A w) (2-1)
with
V0 = 8 kcal/mol R0 = 2.80 A
sp3 donor - sp3 acceptor F = cos2l9e'(*‘“)°cos2 (qi — 109.5)
sp3 donor - sp2 acceptor F = cos”0e"*‘”l°cos"’¢
sp2 donor - sp3 acceptor F = c0340(e‘2(*‘9)6)
sp2 donor - sp2 acceptor F = cos20e’(""’lficos2 (max [(1),
rer “XX
XXX)“
m:
X0 0 0mm
>181.
Dihedral Angle Change (degrees)
0
I
X (X ram
mo 0
-50—l 5
o o X
.350
400- o x 33
.X.
38
-150...
O o
“ZOOIIIIIIIIIIIIIIIIIII
02amzagurarnr-ruarrra.
easesalIesassserséi
ResidueType
Figure 3.2: Change in X1 and x2 dihedral angles of all interfacial side chains within
4.0 A of the ligand, across 30 structures listed in Table 2.1. The change is calculated
as the difference of x angle of interfacial side chains between their ligand-free and
ligand-bound conformations. The standard deviation for X1 changes is 25.8 and for
x2 changes is 34.7. Note that most changes in x”) fall within the range of —20° to
20°.
33
200
150"" O O
' 8
A100- " 9
I x O
8’ o
2.50- 52 O X
3 o.
s 8 o
6 o- i M g.
0 .
i“ x ’8 :
§-50- g 00
5-100-
‘ %
450- 6 .
1 O L
'200 TiillTIIIlllllliliI
02am aggawr-urrraco.
rasaésx-ssessaefi
Figure 3.3: Change in x3 and x4 dihedral angles of all interfacial side chains within
4.0 A of the ligand, across 30 structures listed in Table 2.1. The change is calculated
as the difference of x angle of interfacial side chains between their ligand-free and
ligand-bound conformations. The standard deviation for x3 changes is 42.9 while
for x4 changes is 46.4. Note the somewhat broader range of change in x3 and x4
compared to changes in X1 and x2 dihedral angles in Figure 3.2.
34
these steric clashes involve the flexible portion of the ligand, it is possible to resolve
each steric clash independently by rotating either the ligand’s flexible side chain, or
a flexible group in the protein with which it overlaps. In this context, a ligand side
chain is any group connected to the anchor fragment by a rotatable bond. When the
target (protein) atom clashing with the ligand belongs to a flexible protein side chain,
that side chain’s rotation also presents another option to resolve the steric clash.
Geometrically, any collision between two atoms, which are modeled as va der
Waals spheres can be resolved in multiple ways, given that each atom as 3 degrees of
translational freedom. However, the colliding atoms are themselves bonded to other
atoms by single or double bonds. For each single-bond rotation that can help resolve
a steric clash, overlapping atom displaced by the bond rotation is considered mobile,
while the other overlapping atom is considered fixed. When both the overlapping
atoms are independently connected to single bonds, than their overlap resolution is
evaluated with each of them being considered mobile and fixed, one at a time.
Single bond rotations allow moving the rotating atoms in a circular trajectory in
planes perpendicular to the rotation axis, which is along the single bond. To resolve
a collision through a specific bond rotation, the rotatable single bond is transformed
along the Y-axis, with the atom of the bond further from the backbone forming the
new origin. The same transformation is applied to all the atoms in the side-chain
beyond, including both the atoms having the steric clash. The transformations are
calculated in a way such that, on application, they would move the fixed atom into
the X-Y plane, as shown in Figure 3.4. This later helps calculations for determining
the rotatable bond’s rotation angle for collision resolution.
While there are potentially infinite positions on the trajectory where mobile
atom can be placed to resolve a clash, most of these are usually infeasible due to the
high probability of causing new collisions. SLIDE’s paradigm to choose a position
to resolve collisions during docking is based on developing “induced fit” between the
35
Circular trajectory available to
mobile atom through rotation around
a rotatable bond aligned along Y- “
axis. which points downward.
perpendicular to the plane or the Z
'a' reflects the minimum
angular rotation for mobile \ , ~ Sterlc
atom to resolve the steric ‘ \ ‘ overlq:
clash \ Mobile Atom I
0 Fixed
Atom X
Figure 3.4: Induced fit development mechanism implemented in SLIDE performs di-
rected rotations to resolve atomic collisions. Directed rotations are performed around
a rotatable bond not adjacent to the mobile atom. The rotations are performed in a
plane parallel to X—Z plane, by aligning rotatable bond along the Y-axis. The hinge-
atom of the bond is new origin for the rotation. The fixed atom is transformed into
the positive quadrant of the X-Y plane, while the mobile atom is rotated through the
minimum angle of rotation, ‘a’ to resolve the steric overlap.
36
ligand and the protein binding site[23]. To mimic induced fit, SLIDE chooses the
smallest angle through which the mobile atom can be rotated to resolve the collision.
However, choosing the smallest rotation angle for a bond rotation is not sufficient
since there may be multiple single bonds in a side chain in residues like lysine, arginine
which could be rotated to resolve a collision. SLIDE’s approach for deciding which
rotations to apply for resolving a set of collisions is based on mean-field theory[26, 19],
implemented to optimize bond rotations[42]. The key feature of this approach is a
probability matrix P(i, j), which describes the probability that a particular collision
2' will be resolved by a rotation of bond j . First, all intermolecular collisions in the
complex are identified. They form one dimension of the matrix. Only complexes
with up to 20 collisions between atom pairs undergo side-chain collision resolution in
SLIDE; ligand dockings with more collisions are discarded. All rotatable bonds that
can be used to resolve at least one of the collisions build the other dimension of the
matrix. Only those rotations are considered that do not result in an intramolecular
collision based on the current conformation. Note that there is no differentiation
between ligand and protein side chains in this matrix. All rotations that can resolve
a particular collision are initialized with equal probability values. For each probability
entry P(z', j), a cost value E’ (z', j) is computed that reflects the cost of rotating bond
j to resolve collision i. This cost is simply the product of the number of displaced
non-hydrogen atoms and the absolute value of the rotation angle. This makes large
rotation anng or the rotation of large side chains more costly, as they are more likely
to cause steric problems elsewhere.
During the iterative cycles of the mean-field optimization process, the probability
matrix P is updated to converge to high probabilities for those rotations that provide
the lowest-cost conformational change of both molecules to resolve a maximal number
of the observed collisions (Figure ??). In each cycle a mean cost, E (i, j), is computed
37
Find all side-chain collisions 2' and all rotatable bonds j;
While there are between 1 and 20 side-chain collisions do:
Compute probability matrix P and cost matrix E’;
For 10 cycles do:
Compute mean cost E(i, j);
update probability matrix P;
Do feasible highest probability rotations;
Find all remaining side-chain collisions i;
Figure 3.5: The algorithm for resolving side-chain collisions using the mean-field
optimization technique. When there are still collisions exceeding the threshold after
10 iterations of the outer loop, this ligand orientation is discarded.
for each rotation, as follows:
E(i.j) = E'(z°.j) + Z depi(t,j),(h,k)i-P(h.k)~E’(h,k)
haéi, I:
The value of dep[(i, j), (h, k)] is set to -1.0 if j = k, i.e., both entries refer to the
same bond and both rotations are in the same direction. Hence, two collisions can
be resolved at once by a rotation of this bond, and the —1.0 value results in a lower
mean cost E(i, j). The value of dep[(z', 3'), (h, k)] is set to 1.0, with a resulting increase
in E(z’, j), in two cases. The first case is when both entries refer to the same bond
(j = 1:), but the corresponding rotation directions are opposed to each other. In
the other case, bond j lies on the path from bond 1: to the anchor fragment or the
main chain, respectively. Here, the mean cost of rotating bond j is increased, since if
rotation (i, j) were applied, then bond k would be moved, and this would invalidate
the assumptions made in the current iteration regarding rotations involving this bond.
At the end of each cycle, the entries in the probability matrix are updated based
38
on the mean costs E (2', j) using the Boltzmann principle:
. . e-Efiajvl‘
P(2,J) = Tl: e-E(i.k)/u
where p is the average value of all computed mean costs. Convergence of the values in
the probability matrix is usually observed in fewer than ten cycles, and those rotations
with the highest probability are chosen to resolve the collisions. At this point, it is
again necessary to check for negative correlations between bonds. Although this
was already considered during the computation of the mean cost, two correlated
bonds can receive high probabilities if they are the only bonds to resolve particular
collisions, or if alternative rotations are much more expensive. During the mean-
field optimization process, it is not possible to anticipate complex dependencies, e.g.,
which ligand rotations influence protein bonds related to other collisions. Rotations
are only accepted if they do not cause any intramolecular collisions. Since it is likely
that not all collisions can be resolved in one application of the mean-field Optimization
technique, up to 10 iterations of this process are executed, as outlined in Figure ??.
Ligand conformations are discarded if they have more than 20 collisions at any time
during the optimization or have collisions exceeding the threshold after 10 iterations.
3.2.1 Motions Modeling in SLIDE
Mean-field optimization helps choose bond rotations to minimize the confor-
mational cost of developing shape complementarity. Figures 3.6 and 3.7 present
a displacement comparison of side chains that were moved by SLIDE for collision
resolution with displacements of corresponding side chains from their ligand-free to
ligand-bound conformations. (Hence, these figures present a subset of interfacial side
chains which were found to have moved in Figure 3.1).
To examine how well SLIDE models known side—chain motions, the distribution
39
Ligand-free to ligand-bound
side-chain RMSD
40 " Ligand-free to SLIDE's best docking's
side—chain RMSD
Z
/ V
a A
E ZS
M .
/\/§
/§/\
Zfifi
/§%\
/\/\
/\/\
Zhé§ a .
Mt
as, \T . l3“. .
-2 0 2 4 6
RMSD (A)
(negaMmRMSD:MienSUDEmvedasldednhmmermyfiunllgmdbwndomt)
Figure 3.6: Frequency distributions of RMSD between ligand-free and ligand-bound
side-chain conformation, as well as RMSD between ligand-free side-chain conforma-
tion and sidechain conformation in the best docking generated by SLIDE. The best
docking was determined using the RMSD of the docked ligand relative to the crystal
structure orientation. Positive values indicate the RMSD by which SLIDE’s con-
formation of side chain got closer to crystal complex conformation, while negative
values indicates that SLIDE moved the side chain further from crystal complex con-
formation. Observations are derived from SLIDE’s best dockings of 24 structures,
comparing only those interfacial side chains that were moved by SLIDE.
40
of root-mean-square deviations (RMSDs) between ligand-free and docked side-chain
conformations found in the best dockings by SLIDE were compared with the distribu-
tion of RMSDs between ligand-free and ligand-bound side-chain conformations from
crystal structures (Figure 3.6). This distribution data was derived from 24 of the 30
structures specified in Table 2.1. The remaining 6 structures did not permit dock-
ing without an increase in inter- atomic interpenetration parameters, so they were
excluded from this analysis. The RMSD value of a docked side-chain conformation
relative to its ligand-free conformation is displayed negative when it is greater than
the RMSD value between the ligand-free and ligand-bound conformation in crystal
structures. Hence, this figure compares side-chain displacement distributions between
modeled and observed structures, both in magnitude and direction. While SLIDE’s
distribution of RMSD magnitudes was quite similar to nature, as most of the motions
were restricted between 0-1 A, a few large displacements in ligand-bound conforma-
tions are also observed. These observations are in agreement with [50], where it has
been found that in 85% of the cases, interfacial side chains rotate through less than
45° upon ligand binding.
While Figure 3.6 compares displacement distributions, Figure 3.7 compares dis-
placements on individual side-chain basis. Presented as a scatter plot is the compar-
ison of side chain RMSDs between ligand-free and ligand-bound conformations, and
between the ligand-free conformation and the conformation in the best dockings for
24 structures by SLIDE. Comparing with Figure 3.1, it is clear that SLIDE moves
very few side chains compared to nature. The correlation between the ligand-free to
ligand-bound side chain RMSD and ligand-free to docked side chain RMSD is small
(0.28), indicating the extent to which SLIDE misjudges the magnitudes of motion,
as is clear by the degree of scatter. The color coding indicates the quality of motion,
answering the question: “ did SLIDE move the side chain closer to ligand-bound
conformation or further away ? ” SLIDE, while removing steric overlaps, tends to
41
5 Closer to
:, llg-bound oonf.
O . -1
I 0 0.6000
‘0 0
c: 4 o
g ‘ 0.3375 5
g 5
l 0 07500 9..
§ 5 §.
2 3‘ ,-
u, 0.1875 8.
i " m.
g 0.4500
.2 .E 2 'l .0 g
.3 “ . ° . ' c.7125 _
_ E . . 3
g . . 0 0.9750 g
1 - . e re
0 0 ..:~ 4.238
(O O .9. 0
z '0 ' ° 1 500
m 0 -l - '
Further from
I - I . I ' I . I . liq-bound cont.
0 1 2 3 4 5
RMSD between Ligand-free and SLIDE‘s best docking's
side-chain conformation
Figure 3.7: This image is presented in color. Comparison of side chain RMSDs
between ligand-free to ligand-bound conformation, and between ligand-free confor-
mation and conformation in the best docking. The best docking was determined
using the RMSD of the docked ligand versus the crystal structure ligand orientation.
The color scale indicates the quality of motion. Positive values indicate the RMSD
by which the SLIDE-generated best docking’s side-chain conformation became closer
to crystal complex conformation, while negative values indicate that SLIDE moved
the side chain further from the crystal complex conformation. Observations are de-
rived from the best dockings of 24 structures, with only those interfacial side chains
being compared which were moved by SLIDE. Correlation between the two RMSD
distributions is 0.28.
42
move the side chains away from the ligand-bound conformation (colors ranging from
cyan to red). For only 18 of 58 observations did SLIDE move side chain closer to the
ligand-bound conformation (blue).
Having observed that most of the SLIDE motions are minimal to resolve colli-
sions, that each collision can usually be resolved by either atom’s rotation around a
rotatable bond, and that often, SLIDE moves atoms further away from ligand-bound
positions, it is postulated that both directions of rotation should be explored for col-
lision resolution. While the angle of rotation in one direction will be bigger than the
angle in the opposite direction, the ability to make better interactions in one direc-
tion versus the other should influence the direction of rotation. Furthermore, from the
observed intra—protein hydrogen-bond preservation tendency presented in Chapter 2,
any rotations which might disrupt direct hydrogen bonds between the binding-site
side chains may be discouraged so that SLIDE’s dockings will have a similar bias
towards preserving most of the intra-protein hydrogen bonds.
3.3 Employing Hydrogen-Bond Preservation Bias
in Mean-Field Optimization
As explained in previous section 3.2, the key feature of mean-field optimization
process is the probability matrix P(i, j), which describes the probability that a par-
ticular collision i will be resolved by a rotation of bond j. At the beginning of the
optimization, each rotation that can resolve a collision is assigned equal probabil-
ity. Subsequently, these probabilities are updated iteratively during the optimization
process, depending upon the conformational cost of rotation and dependencies of re-
solving other collisions. To bias against those specific side-chain rotations that disrupt
a hydrogen bond, initial probabilities can be reduced according to the hydrogen-bond
preservation probabilities derived from the statistics shown in Figure 2.2. This should
43
effectively discourage rotations which can break a hydrogen bond. Nevertheless, de-
spite low probability, a particular rotation that disrupts a hydrogen bond can still be
effected if it is the only one that can resolve a particular collision, hence ensuring that
final docking will be free from atomic overlaps. Biasing against breaking intra-protein
hydrogen bonds would likely improve the chemical complementarity of the docking
as well.
To measure the effect of incorporating hydrogen-bond preservation bias, a pilot
study of 5 ligand-free and ligand-bound structure pairs was conducted. The templates
representing the binding sites of the ligand-free structures included random sampling
of potential interaction points as well as known interaction points with the ligand.
Known points were included to ensure that at least one docking would result in placing
the ligand close to its native position and orientation, so that side-chain motion would
not be required to compensate for inaccurate ligand placement.
Figures 3.8, 3.9, and 3.10 present a comparison of SLIDE and the new SLIDE
variant with a hydrogen-bond preservation bias based on the statistics from Figure 2.2.
Data presented for both SLIDE versions include only the best dockings according to
the ligand RMSD (relative to its crystal complex position), which must be less than
1.0 A. The native ligand conformations from ligand-bound crystal complexes were
used for docking into apo structures. Figure 3.8 presents the total number of intra-
target (protein binding site) hydrogen bonds. The crystal complexes have more intra-
protein hydrogen bonds upon ligand binding compared to the ligand-free structures.
Both SLIDE versions preserve a similar number of hydrogen bonds, but fewer than
what nature preserves in the crystal complexes. This indicates that new intra-target
hydrogen-bonding opportunities are missed in both versions of SLIDE. Figure 3.9
presents intra—target hydrogen bonds that were lost upon ligation in the crystal com-
plex and in the best dockings by SLIDE. Interestingly, while more hydrogen bonds
were lost in the crystal complexes, Figure 3.8 indicates that these complexes form
44
new intra-target hydrogen bonds that were not present in the ligand-free structures,
as shown in Figure 3.10. Even though the sample set of 5 structures is small, it nev-
ertheless suggests that nature breaks and makes more intretarget hydrogen bonds
upon ligation through optimal arrangement of side chains. SLIDE, on the other hand,
due to its minimal rotation collision resolution model, tends to keep the binding site
side-chain positions and their interactions more or less the same, breaking and making
fewer intra-target hydrogen bonds compared to nature. Considering that each hydro-
gen bond has an energy of about -5 Kcal/mol, and that typical complex formation
involves a favorable energy change of only ~3 times this number, careful modeling of
interfacial hydrogen bonds is likely to be crucial to correctly sample and assess the
best dockings.
3.4 Sampling Large Side-Chain Motions
Comparison of the extent of displacement in Figures 3.6 and 3.7, as well as
hydrogen-bond preservation results from best dockings by SLIDE presented in Fig-
ures 3.8, 3.9, and 3.10 indicate that aside from developing good shape comple-
mentarity through small side chain adjustments, optimal rearrangement of the intra-
target hydrogen-bond network is important for favorable binding affinity between the
molecules. This rearrangement may involve not only small but also large rotations,
which SLIDE’s induced-fit mechanism does not encourage.
To explore if there are any prominent reasons for rearrangements through large
side-chain rotations, all interfacial residues that had undergone a side—chain dihedral
rotation greater than 60° upon ligand binding were analyzed by molecular graphics
to understand the reasons for these large rotations. Any binding sites having gaps
between residues were excluded from this analysis using the molecular graphics tool
InsightII, since those mobile residues could influence motions in unpredictable ways.
45
lntra-protein direct hydrogen—bond comparison
between ligand-free, ligand-bound structures
and best dockings by old and H-bond-preservationist SLIDE
+0
3 rs H-bond-preservatlonist SLIDE
9 +1 223 Old SLIDE
E Ligand-bound crsytal complex
DID Ligand-free structure
+0
1syc +3
1nsb
Complex
1 ajz
1ahb
0 1O 20 30 4O 50 60
Total number of intra-protein hydrogen bonds in the binding site
Figure 3.8: Comparison of the number of intra-protein hydrogen bonds before and af-
ter ligand binding in crystal complexes, in best dockings by SLIDE and the hydrogen-
bond-preservationist version SLIDE.
46
Comparison of intra-protein direct hydrogen bonds
lost upon docking by old SLIDE and H-bond-preservationist
SLIDE versions, compared against H-bonds lost in crystal complexes
H-bond preservationist SLIDE
agrs Old SLIDE
E Crystal complex
1 eye
1nsb
Complex
1ajz
Iahb
0 1 2 3 4 5 6 7 8 9
Number of intra-protein hydrogen bonds lost in the binding site
Figure 3.9: Comparison of the number of intra—protein hydrogen-bonds lost upon
ligand binding in nature (between ligand-free and ligand-bound structures), in best
dockings by SLIDE, and in the best dockings by hydrogen-bond-preservationist ver-
sion SLIDE.
47
Comparison of intra-protein direct hydrogen bonds
gained upon docking by SLIDE and by H-bond-preservationist SLIDE,
compared to hydrogen bonds lost in the crystal complexes
H-bond preservationist SLIDE
398 I222 Old SLIDE
E Crystal complex
1syc
1ahb
O 1 2 3 4 5 6 7 8 9
Number of intra-protein hydrogen bonds gained in the binding site
Figure 3.10: Comparison of the number of intra—protein hydrogen bonds gained upon
ligand binding in crystal complexes, in the best dockings by SLIDE and by the
hydrogen-bond—preservationist version of SLIDE.
48
Table 3.1 presents the conclusions from visual inspection of such large rotations. The
process involved super-imposing the ligand-free binding site onto ligand-bound site,
followed by analyzing steric and chemical factors that could encourage large rotations.
Several causes for rotations or displacements were found - main-chain movements
of more than 1.0 A (for residues like L33 in PDB entry lkel L33, residues A45 and
A49 in lswd), side-chain movements caused by motion in adjacent residues, aromatic
interaction with the ligand (lahb, residue 70), 1r—cation interaction with the ligand
(lgmr residue B40, lsyd residue 115 ), surface exposure or even crystal packing
effects (ludh residue 87). Nevertheless, in two-thirds of the cases, side chains moved
not to resolve any steric clashes upon ligation. Rather, side chains moved to satisfy
polar atoms (in 10 cases, excluding cases with aromatic or 1r-cation interactions)
which would have remained unsatisfied and buried if they had remained in ligand-
free conformations.
Visual inspection and results trends indicate that, in 50% of the cases, the reason
to satisfying the hydrogen—bond potential of a polar buried atom encouraged large
side-chain rotations in the interface upon ligand-binding. Exhaustive modeling of pos-
sible large side-chain rotations in SLIDE around single bonds would mean sampling
300° of dihedral-angular space for each single bond. Given that a side chain can have
from zero to four rotatable bonds, the combinatorial sampling, depending upon the
fineness of sampling and the number of side chains to model, would be prohibitively
time-consuming to do in docking.
3.4.1 Rotamer Libraries
Fortunately, for most of the bond-rotation angles, protein side chains adopt pri-
marily a staggered dihedral angle such that covalently bonded, sp3-hybridized atoms
tend to stagger, rather than eclipse, their substituent atoms. As the number of
solved protein structures has increased over the years, statistical distributions of the
49
Table 3.1: Visual analysis of reasons behind large dihedral rotations (> 60°) in inter-
facial side chains.
Structure Residue# Residue Type Did side chain clash Would side-chain
with the ligand ? H-bond potential
have remained if side
chain had not moved
I?
1aj2 221 LYS N 0 YES
latp El 20 MET NO YES
1 atp E53 SER NO YES
lcgu 257 GLU NO YES
lgmr B40 ARG NO YES
lpob A52 ASN N 0 YES
lsyd 43 GLU N 0 YES
lsyd 1 1 5 TYR NO YES
ltps 192 GLN NO YES
7taa 296 HIS NO YES
lgmr B38 GLN N O NO
1ke1 L33 ASN NO NO
lbib 1 83 LYS NO YES
llic 58 LYS NO NO
1ahb 70 TYR YES YES
lgmr 865 ARG YES YES
1kel H56 LYS YES YES
lpob A30 ARG YES NO
laj2 115 ASN YES NO
llic 57 PHE YES NO
lpob A2 LEU YES NO
lcoy 122 MET YES NO
5O
side chain bond-rotational angles for each residue type have been characterized[24].
These bond-rotation angles are labelled x angles by convention, as explained in Fig-
ure 3.11. Side-chain X angles have been found to occur in tight clusters around certain
values. This is both because of the hybridization of the bonded atoms, as well as to
prevent collisions with the main-chain atoms of the residue and its neighbors. Each
single bond in the side chain can sample its dihedral degrees of freedom subject to
these constraints. The resulting side chain conformation is called ‘rotational isomer’
of the side chain. Abbreviated as ‘rotamer’, this favored orientation is represented
as a set of values, one for each dihedral angle degree of freedom. Since bond angles
and bond lengths in proteins have rather small variances, they are usually not in-
cluded in the definition. A library of such favored sets of side-chain x-angle values for
each residue type is called a rotamer library. Rotamer libraries usually also contain
information about the frequency of each rotamer, as some conformations are more
likely than others due to side chain stereochemistry, besides the information about
the variance around the dihedral angle mean or mode.
Using rotamers to sample side chain conformation can help avoid exhaustive
spatial sampling and make tractable the identification of a suitable combination of
dihedral angles. With known geometries derived from experimentally solved crystal
structures, not only can time be saved during sampling, it is also more likely to arrive
at a side chain conformation that is natural and low in energy.
Using rotamer library for side chain modeling does have certain drawbacks.
Firstly, rotamers are typically averaged conformations representing a cluster of con-
formations; they may not be real conformation themselves. Besides, granularity of
dihedral space sampling of side chains within the rotamer library can itself be a limit,
especially because rotamers largely reflect favored side-chain orientations outside of
interfaces. Another aspect is that even though rotamers are thought to represent
local energy minima on a potential energy landscape, not all rotamers may be local
51
Figure 3.11: Illustration of bond-rotation angles associated with single bonds in an
arginine side chain. The number of single bonds in a side chain, ranging from 0 to 5,
depends on the residue type. A single bond is free to sample dihedral rotations and
its associated dihedral angle is called a x angle. The x angles are labelled X1» x2, x3
and so on, according to the level of the single bond from the residue’s backbone.
52
extrema since distributions Of side chain dihedral angles are fairly broad (i20° or 30°
relative to the average value presented).
Nevertheless, rotameric conformations can definitely be considered candidate
conformations to sample, and the extent to which they approach Optimal confor-
mation can be assessed. For sampling larger conformational space for side chains,
rotamer libraries Offer time savings while providing ready-tO-use, pre-calculated rO-
tameric side chain conformations.
Rotamer libraries can be backbone-independent, backbone-dependent, or even
secondary-structure-dependent. Backbone-dependent rotamers have dihedral angles
and/ or frequencies that are binned according to the local backbone conformation.
Backbone-independent rotamers are calculated using all available side chains of a
residue type. Backbone-dependent rotamers have the advantage that no rotameric
side chain can have steric clashes with its own backbone.
TO see if one can find rotamers from a library similar to ligand-bound conforma-
tions Of few side chains which rotated through more than 60° on ligand-binding, the
Dunbrack May 2002 backbone dependent rotamer library was searched (downloaded
from http://dunbrack.fccc.edu/bbdep). Rotamer searches were done in incremental
fashion - first comparing only X1, then X1 and x2 and so on up to x4, depending
upon the number Of x angles in the side chain. A rotamer was considered similar to
a side chain if each Of the side-chain x angles were within the range x :l: a, where a is
the standard deviation of respective x angle specified for the rotamer in the rotamer
library.
As is clear from search results presented in Figure 3.12, for side chains with x3
and x4 angles, fewer suitably close rotamers were found. Furthermore, for a few Of the
side chains, like Tyr70 in 1ahc and Tyr115 in lsyc, no rotamer was found. This result,
in effect, places an upper bound on how close, in x-angular space, tO a ligand-bound
side chain conformation can rotamers from this particular library reach.
53
Rotamers found by x—angie search for selected
ligand-bound side chain conformations
1ahc_ TYR _701
1syc_ TYR_ 1151
1swa_ A _SER _451
1apm: E _SER _531
1Iib_ ”PHE _571
3cox_ MET_ 1221
1apm_ E_ MET: 1201
1kem_ H L_YS _561
1ajz_LYS_2211
1bia_ us: 1831
1Iib_ LYS _581
1kem_ L_ LYS 501
6taa:_ HIS _2961
1syc: GLU _431
1cgt_ GLU _2571
1udg_GLN_ 871
1gmq__BG GLN _381
GLN_ 1921
1swa_ A_ ASN _491
1kem: L _ASN :331
1p0a_ ASN :521
1poa_ ARG :301
1gmq_B_ ARG_ 651
13m mcLB _ARG: 401
chs_ A_ ARG :901
WMmemk
Matching upto 13
Matching upto x4
4
Matching upto x,
4
00290-0000...
DO
4
Sidechain
4
B 8
4
3'
A V
'l'l'Ijjinl'l'l'lejT
510152025303540455055
Number Of rotamers found In the Dunbrack May 2002 rotamer library
010090000000.
4
-5
Figure 3.12: Number Of rotamers, from Dunbrack May 2002 backbone-dependent
rotamer library (http://dunbrack.fccc.edu/bbdep), approximating dihedral angles of
target conformations for 25 interfacial side chains undergoing large rotation upon
ligand binding. Selected side chains across structures are those that had rotated
more than 60° from the ligand-free tO ligand-bound conformation. Side chain names
(Y-axis) include the PDB code, chain ID (if present), residue type and residue number.
Backbone d, 21) angles were used to locate the bins in which rotamers were searched,
according the x angles Of the side chain in question. In the rotamer library, the d, w
resolution is 10°. The neighboring 8 bins, using 45 :t 10 , 1,0 :l: 10, were also searched.
Rotamer searches were done in incremental fashion - first comparing only X1, then X1
and x2, and so on, up tO x4, depending upon the number Of x angles in the side chain.
A rotamer was considered similar to a side chain if each Of the side-chain x angles
were within the range x i a, where a is the standard deviation Of respective x angle
specified for the rotamer in the rotamer library. Rotamers searched consisted of only
those rotamers which had the probability Of at least 0.05 times the probability Of the
highest probable rotamer within the same 45, ll) bin. This helps exclude rotamers that
were very rare, since many Of them could be poorly resolved side-chain conformations
in PDB.
54
Another worthwhile aspect to explore is to determine how close rotameric side
chains, which are the 3D side-chain orientations generated from the rotamer library’s
05, w and x angles, can approach ligand-bound conformations in terms Of RMSD.
The reason for this is that single-bond rotations through adjacent x-angle changes
can either compensate for each other or augment each other in terms of the resulting
side-chain motion. The closest rotamers found, in 3D Cartesian space, for ligand-
bound conformations are presented Figure 3.13. Rotarneric conformations within
1 A Of the ligand-bound conformations were found for about half Of the 22 side
chain cases presented here. This again effectively defines the limit on how close
rotameric modeling approach can approach ligand-bound side-chain conformations
using Dunbrack’s backbone-dependent rotamer library.
In Figure 3.13, the relationship between B-values for the large-motion side chains
and the probability Of their having a rotameric conformation is presented. The moti-
vation for this is that poor resolution Of high B-value residues could be reflected in the
unlikelihood Of finding these conformations in the rotamer library. On a color-scale
representing the maximum B—factor value found for any atom Of the side chain, black
to yellow indicate that the atom coordinates are low in mobility, while light-green to
grey indicate that the side chain had at least one atom whose position was highly mo-
bile and thus likely to be poorly resolved in the crystal complex. SO, while rotamers
close to ligand-bound conformations were found (RMSD < 0.7 A) for lbib-Ly8183
and 1gmr-ArgB40, high B-factor values indicate that these side chains do not have
a well-defined conformation. Hence reliable conclusions may not be drawn about the
efl'ectiveness Of a rotamer library for its ability or inability to find a close rotamer for
side chains that consist Of atoms having high B-factor values.
However, for most (13 out of 19) side chains having low B-factor values, a ro-
tamer within an RMSD rang of 0.28 A to 0.83 A was found in each case. For 5 side
chains having low B-factor values for which no close rotamers were found, three (1ahb-
55
Tyr70, 1cgu-G1u257 and lpoa-A52 ) moved to interact with the ligand, while 1poa—A2
moved due tO excessive steric clashes with the ligand, indicating that their favored
conformations are influenced by the ligand as well as the side-chain conformational
energetics.
3.5 Conclusions
In this chapter, the bond-preservation bias discovered in 2 chapter was im-
plemented in SLIDE. However, on further experiments it was found that SLIDE
models small motions well enough during docking to preserve most Of the hydrogen-
bond interactions and, thus, additional bias towards preserving hydrogen bonds is
not warranted. Relatively large side—chain rotations in ligand-bound conformations
and causes for them were further investigated. Trying to satisfy buried polar groups
was determined as a predominant reason. Investigating how large rotations can be
modeled in SLIDE, the effectiveness Of using Dunbrack backbone-dependent rotamer
library to sample larger conformational space was analyzed. For X1 and x2 angles, the
rotamer libraries always contained entries reflecting the conformations Of side chains
that underwent large side-chain motions upon ligand binding. However, for long side
chains there were Often no rotamers that also reflected their final ligand-bound x3 and
x4 angles. In Cartesian RMSD space, 13 out Of221igand-bound side-chain conformer
were represented in the rotamer library tO within 1.0 A RMSD and 8 Of the remain-
der to within 2.0 A RMSD. For all cases where side-chain mobility values were low,
rotamers close to the ligand-bound conformations were found. At other times, side
chains were found to be non-rotameric either because Of ligand-induced strain[14, 15]
in the interface or because they moved to enhance interactions with the ligand.
56
RMSD of the ligand—bound side chain conformations from
closest rotamers found in Dunbrack's backbone dependent rotamer library
120 a. i
1001
so-
q
so-
401 / a\
a m1z/Aaj:A/\\z/./X:z\n
E - An
I:
g ’ . +clloast—rotamuzerMSDmiativeto
a l.
8 25‘ +MaximumB-factorofanysidechainatom
3 —A—AverageB—factorofdlsIde—ctuinatonu
s 2.0- 3250
I .
o \ I
(I) _
2 1.5 /\I_
I!
0.5- u/'
an
"" IIIrIIIIIIIrIIIIIIII
9:» fié $5069 3’ éeée o 06’ ~89
“its“ .9. 3W . 33:55:55 73* 1...:ij ti 5’5: :«t
Large dihedral rotation residues
Figure 3.13: This image is presented in color. RMSD of rotamers from Dunbrack’s
May 2002 backbone-dependent rotamer library found to be closest to selected ligand-
bound side-chain conformations in Cartesian space. Selected side chains across struc-
tures are those that had dihedral rotations more than 60° from the ligand-free to
ligand-bound conformation. Side-chain observation names include the PDB code,
chain ID (if present), residue type, and the residue number. Backbone dz, i/) angles
were used to locate the bins in which rotamers were searched, using a it, w resolution
of 10°. The neighboring 8 bins, using :15 :l: 10 , 1,!) d: 10 were also searched. Each ro-
tamer in these bins was converted to a side—chain orientation in Cartesian space and
transformed into the reference frame of the selected side chain. The rotamer library
consisted of only those rotamers which had a probability of at least 0.05 times the
probability of the highest probable rotamer within the same (1), 1/1 bin. This helped
exclude rotamers that were rare, since they could be poorly resolved side-chain con-
formations in the PDB. Maximum and average B-factor values over all the side-chain
atoms are also shown in the top half of the figure. SO, while close rotamers (RMSD
< 0.7 A) were found for lbib-Ly5183 and 1gmr-ArgB40, high B-factor values convey
that some target side-chain positions are in fact poorly defined. In general, there is
no trend between the B-value mobility of a side-chain and the probability that this
conformation is present in the rotamer library.
57
Chapter 4
Scoring SLIDE Dockings
4.1 Introduction
Computer-aided structure-based methods for virtual screening are aimed at pre-
dicting the binding mode Of a ligand in the binding site of a protein or molecular
target, and estimating the resulting binding affinity. A conformational search algo-
rithm can produce an immense number Of solutions from which the right solution(s)
must be selected by an energy-based or other scoring procedure. Thus, it is extremely
valuable tO predict binding affinity accurately so the user may select good conforma-
tions and orientations from poor ones with confidence.
For ranking protein-ligand dockings, a class Of programs, namely ‘scoring func-
tions’ has been actively researched for many years. Since docking algorithms search
for potentially good binding modes between a specified target and, and in some cases,
thousands Of small molecules, they must have the ability to discriminate not only be-
tween promising and poor binders, but also between different binding modes Of the
same ligand. Theoretically, free—energy simulations can be a reliable check for dis-
criminating good from bad dockings, but this approach is too expensive for screening
large libraries Of small molecules and remains a techniques performed by relatively
58
few experts.
Scoring functions can be broadly categorized into three categories: force—field
based, knowledge-based and empirical. Force-field based scoring functions apply clas-
sical molecular-mechanics based energy functions to approximate the binding free en-
ergy Of the protein-ligand using a sum Of van der Waals, electrostatic, bond length,
and bond angle terms. Solvation is usually taken into account using a distance de-
pendent dielectric or solvent-accessible surface term, although explicit modeling of
discrete water molecules is also done. Non-polar contributions are usually assumed
to be proportional to the solvent-accessible surface area. A drawback is that energy
landscapes associated with the force-field potentials are usually rugged, and therefore
minimization is required prior to any energy evaluation. Furthermore, small inac-
curacies in positioning atoms during docking can lead to large discrepancies in the
energy score.
Knowledge-based scoring functions represent the binding affinity as a sum Of
protein-ligand pairwise atomic interactions. Using the protein-ligand complexes de-
posited in the Protein Data Bank (PDB), knowledge-based potentials are derived
from the Observed preferences for particular atom-pair interactions tO occur at given
distances pr distance ranges between ligands and proteins. However, structures in
PDB do not provide a thermodynamic ensemble at equilibrium. Hence knowledge-
based potentials should be considered as a statistical preference rather than an actual
potential.
Recently, many empirical functions have emerged as alternatives to force-field-
based and knowledge-based scoring functions. Unlike force fields, the weights Of
terms in empirical scoring functions are directly calibrated using a set Of protein-
ligand complexes with experimentally determined structures and binding affinities
through multivariate regression analysis. These scoring functions may also include
derived terms that include interactions that would otherwise involve several terms in
59
a force-field based function, e.g. terms for hydrogen bonds and hydrophobic inter-
actions. Empirical scoring functions have several appealing features. They can be
calibrated against a set Of diverse protein-ligand complexes. Secondly, each term in
an empirical function has an intuitive physical meaning. The resulting regression co-
efficients before each term conveys the relative contribution Of each interaction in the
ligand-binding process. Thirdly, empirical scoring functions are extremely fast with
reasonable accuracy, which makes them acceptable for structure-based drug design
applications like virtual database screening and de novo ligand generation. Finally,
most force-field-based energy functions have been parameterized solely tO fit pep-
tidyl data, and are just beginning to be parameterized to recognize and accurately
represent atom and bond types that occur in non-protein ligand molecules.
4.2 SLIDE Scoring Function
The current SLIDE scoring function is an empirical scoring function trained tO
score a collision-free complex using a linear combination of hydrophobic and hydrogen-
bond interactions terms:
SCORE(P, L) = A - HPHOB(P, L) + B - HBONDS(P, L)
The weights A and B have been fit to Optimize SCORE(P, L) to match the
affinities Of 89 high-resolution complexes listed in reference [8]. For more information
about the terms and their tuning in current version Of SLIDE, please see reference
[42].
While the current SLIDE scoring function is based on hydrogen-bonding and
hydrophobic terms, more detailed terms tO represent both stabilizing and destabilizing
thermodynamic interactions can be explored for developing new scoring functions for
SLIDE. Inclusion Of such terms is also desirable to help discriminate among different
60
binding orientations generated while sampling rotamers for buried unsatisfied side
chains and to detect interactions that could be improved through further Optimization.
4.3 Methods
Empirical scoring functions are typically trained and tested against series of
experimentally determined protein-ligand crystal complexes whose binding affinities
are known. Unfortunately, no experimental data exists for suboptimal orientations Of
ligands with proteins. For designing a new empirical scoring function for SLIDE to be
used during and after docking, we developed and tested a series of scoring functions
resulting from linear combinations Of individual terms listed in Table 4.1. The choice
Of which terms to combine was driven by low-correlation among the terms, as well
their ability to represent different interaction types.
Table 4.2: Scoring function variants evaluated to predict binding affinities. 269 crystal complexes,
listed in Table 4.3, with experimentally known affinities were used for training and testing the
scoring functions. Terms are as defined in Table 4.1. Linear regression constant a, and coefficients
6 through 1) were determined to best fit the binding affinity values.
Scoring Function Expression
f1 (Id-BA
f2 a+flB
f3 (Id-BC
f4 a+flD
f5 O+flE
fa a-i-BF
f7 a-i-flG
f3 a-l-BH
f9 0+3]
fro 0+1”
fit 0+5K
in 01-514
f13 0+5M
fti 0+5N
fits 04-50
fie 0+3!)
In 04-50
he 0+3}?
fro 04'53
Continued on next page
61
Table 4.2 — continued from previous page
Scoring Function Expression
fro a+BT
f21 0+5U
fn 0+3V
f23 01-5-7443
f24 a+l6(J+K)+vB
f25 0+5(H+I)+’YB
fee 04-130
f2? 0+5D+7I
fzs 0+BD+7(I+K)
129 a+flI+7(G+H)+6(J+K+L+M)+eP+(R
[30 a+flI+7(G+H)+6(J+K+L+M)+eP
f3, a+flI+7(G+H)+6(P)
f32 a+flB+7(I+G+H)+6P+eR
f33 a+flB+7(I+G+H)+6R+e
I34 0+flB+7I+6R+E
fss Owl-33+?!
fag a+flO+7(I+G+H)+6(J+K+L+M)
I37 a+BO+7(I+G+H)+6(J+K+L+M)+eR
I33 a+flO+7(I+G+H)+6(J+K+L+M)+eP
f39 a+BS+71+6(G+H)+e(J+K+L+M)+(P+nR
f4o a+BS+7I+6(G+H)+6(J+K+L+M)+CP
f“ a+fiS+7I+6(G+H)+eP
I42 a+flS+7(I+G+H)+6P+eR
I43 a+fiS+7(I+G+H)+6P
f“ a+flS+7(I+G+H)+6R
f45 a+flS+7I+6R
f4s 044934-71
I47 a+flS+7(I-i-G+H)+6(J+K+L+M)
I48 a+flS+ty(I+G+H)+6(J+K+L+M)+eR
f“ a+flS+7(I-i-G+H)+6(J+K+L+M)+eP
I50 a+flB+7I+dG+€H
f51 O+BS+7I+6G+€H
f52 a+flB+7I+6(G+H)
f53 a+BS+7I+6(G+H)
f54 0+BT+VI+6(G+H)
f55 a+flT+7I+6(G+H)+cP
I53 a+flT+7(I-l-G+H)+6(J+K+L+M)+eR
I57 a+flS+7G+6(H+I)+eL+(J
f“ a+flT+7G+6(H+I)+eL+(J
[59 a+flS+7G+6(H+I)+eL+(J+nV
foo a+flU+7G+6(H+I)+eL+ -b us-s-s ~96 G -s -s
a - ab .0 us mos «s «as 05 030°
5 — .gb .‘ e' a' a" J'e' "BR 1.1%
_ 0% e'
' 05 es .Qj 1.400
.. 0' \B/ I“)
. Higher score
I I l I I l I I
-1.0 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0
Side-chain RMSD relative to ligand-bound conformation (A)
Figure 5.8: This image is presented in color. Selection or rejection reasons for
each of the sampled rotamers for the unsatisfied side chains in the best dockings of
the 24 structures used in experiments. X-axis represents the RMSD between the
docked side-chain conformation and the ligand-bound conformation. Y-axis entries
are the cases of unsatisfied side chains; their names have been omitted for clarity.
For each side chain, the sampled rotamers are represented as filled circles. Symbol
‘b’ represents side-chain conformation after the induced-fit docking and before the
rotamer sampling, ‘3’ indicates the rotamer was rejected due to unresolvable steric
overlaps, while ‘*’ means that steric overlaps, if any, were resolved. The color of
the filled circle represents the change in score - from brown towards blue indicates
the rotameric conformation enhanced the score compared to starting or the base
conformation ‘b’, while brown towards red depicts a decrease in score.
96
score for each overlap-free rotamer. In such cases, the likelihood of a rotamer
being chosen depends on the backbone probability used in the score—based mean-
field maximization. A rotamer further from the ligand-bound conformation may
be chosen due to having a high probability.
5.5 Conclusion
One of the most important observation involves “B” cases. Representing the
side-chain conformation after preliminary induced-fit docking, these conformations
are already very close, within 1.0 A, of the ligand—bound conformations in most of
the cases. This is observed to be true for almost all of the unsatisfied polar side chains.
This implies that while rotamers help sample a larger 3D space, instead a local 3D
sampling of rotations should first be explored to satisfy unsatisfied polar groups, as
this approach is more likely to stay close to the true ligand-bound conformation.
This also conveys that nature too may resist big rotameric shifts in conformation,
since that involves overcoming high strain energy barriers due for the rotamer tran-
sitions, and substantial repacking of the interface. Hence though rotamer sampling
helped improve scores and hydrogen-bond interactions, across a substantial number
of dockings, it was observed that side-chain conformation prediction actually wors-
ened due to ambiguity in choice of the correct rotamer given that several rotamers
could result in the same score improvement. Furthermore, even the most positionally
correct rotamer was often further from the ligand-bound position than was the initial,
collision-resolved side chain. The metrics for choosing a rotamer were to maximize
the interactions as measured by the new scoring function, and to choose rotamers
that are observed the most frequently in nature. There can be no a priori knowledge
within rotamer sampling or optimization to recognize which side-chain conformations
are close to ligand-bound conformations. Recognizing that the target side—chain con-
97
formations are typically within 1 A RMSD of the initial position, finer and more local
conformational sampling may be all that is required.
98
Chapter 6
Summary and Future Directions
Even though enhancing protein side-chain interactions remains the goal of this
work, the ultimate goal of docking tools is to model motions to predict crystal complex
configuration using apo protein conformation and unbound ligands. Towards these
ends, our key contributions have been:
0 Discovering that the intra—protein hydrogen bond network in the binding site
largely remains conserved upon ligand binding. Most of the rearrangements in
the hydrogen bond network increase the number of hydrogen bonds or serve
as a ligand-recognition mechanism by replacing protein-water hydrogen bonds
with protein-ligand hydrogen bonds.
0 Observing, and corroborating, that SLIDE’s paradigm of minimal rotations
for induced-fit docking[50] performs well for preserving most of the hydrogen
bond network. However, favoring minimal rotations misses 15% of the cases
where larger motions are needed to improve docking chemistry, as found from
comparing ligand-free and ligand-bound crystal structures.
0 Analyzing ligand-free and ligand-bound crystal structures for the reasons be-
hind large—angle rotations in side chains upon ligand binding. For half of the
99
cases, large rotations aided in satisfying buried polar groups which would have
remained unpaired in the protein-ligand interface if they had remained in their
ligand-free conformations. The current criteria for selecting side chains for
conformational optimization - steric complementarity and satisfying unpaired
hydrogen-bond donors and acceptors - could be easily expanded to include other
cases of suboptimal chemical complementarity like side chains with repulsive
contacts or side chains with solvent exposed hydrophobic groups.
0 Implementing in SLIDE an infrastructure for modeling these larger motions
through the use of rotamer libraries.
0 Determining that rotamer-based sampling typically moves side chains too much,
and it can be difficult to determine the correct rotamer when several rotamer
choices improve the protein-ligand complementarity score equivalently. In gen-
eral, the induced-fit docked conformation proved to be closer to the ligand-
bound conformation than were any of the choices in the rotamer library. Thus,
local rotational sampling (going beyond minimal rotations for removing van der
Waals collisions) is likely to be a better approach for repositioning interfacial
side chains to satisfy their hydrogen-bonding potential.
6.1 Future Directions
Future work can build upon the insights and algorithmic and geometric frame-
work that has resulted from this work. Specifically:
0 Local, directed rotations. Akin to directed rotations in induced-fit to resolve
collisions, local rotations can be used to explore potential hydrogen bonds and
salt bridges with nearby protein and ligand atoms.
100
e Ligand flexibility. While optimizing side-chain placement has been the focus
in this work, the ligand’s rotatable bonds present equally probable opportuni-
ties to enhance binding-site chemical complementarity. Exploring motions in
the ligand in balance with those in the protein would also help model ligand
flexibility in the docking process.
0 Unbiased protein template. While we conducted experiments using an active-
site template derived from ligands known to bind the protein in nature, SLIDE’s
typical use is to find new potential ligands. Hence unbiased templates (which
include no knowledge of known ligands’ interactions) should also be used to
validate performance improvements from such local sampling.
0 Residue selection criteria. As mentioned before, about half of the large side-
chain motions in crystal structures were not detected by the current criteria to
select residues for flexible modeling. These motions can be investigated further
to learn the reasons for their motions and define criteria to identify such residues.
0 Handling water molecules. Experiments in this work did not consider water
molecules during the docking process. Including interfacial conserved waters
should help in correctly identifying if side chains have unsatisfied polar groups
or not. This should improve results by avoiding search for better geometries of
polar side chains already satisfied by water, as well as recognizing water-exposed
hydrophobic side chains in the protein-ligand interface.
0 Scoring function discrimination between alternative conformations. The current
scoring function was trained on crystal complexes with known binding affini-
ties. These complexes are unlikely to offer many cases with repulsive contacts
or unsatisfied polar atoms left due to rotations or water displacements, even
though such scenarios are bound to occur frequently during docking. A scoring
function, trained on data that includes favorable as well as unfavorable con-
101
tacts may better differentiate good conformations from bad ones in the same
neighborhood.
With these thoughts on future opportunities, we would like to conclude this
thesis. This work is still in progress, and we are excited to pursue the above directions
and continue working towards improving state of the art in protein-ligand docking.
102
BIBLIOGRAPHY
[1] Adrian A. Canutescu, Andrew A. Shelenkov, and Roland L. Dunbrack Jr. A
graph-theory algorithm for rapid protein side-chain prediction. Protein Science,
12:2001—2014, 2003.
[2] Heather A. Carlson, Kevin M. Musukawa, Kathleen Rubins, Fredric D. Bush-
man, William L. Jorgensen, Robert D. Lins, James M. Briggs, and J. Andrew
McCammon. Developing a Dynamic Pharamacophore Model for HIV -1 Integrase.
Journal of Medicinal Chemistry, 43:2100—2114, 2000.
[3] H. Claussen, C. Buning, M. Rarey, and T. Lengauer. FlexE: Efficient molecular
docking considering protein structure variations. Journal of Molecular Biology,
308(2):377—395, 2001.
[4] Bassil I. Dahiyat, D. Benjamin Gordon, and Stephen L. Mayo. Automated design
of the surface positions of protein helices. Protein Science, 6:1333—1337, 1997.
[5] R. L. DesJarlais, R. P. Sheridan, G. L. Seibel, J. S. Dixon, 1. D. Kuntz, and
R. Venkataraghavan. Using shape complementarity as an initial screen in de-
signing ligands for a receptor binding site of known three-dimensional structure.
Journal of Medicinal Chemistry, 31(4):722—729, 1988.
[6] Johan Desmet, Marc De Maeyer, and Ignace Lasters. The Dead-End Elimination
Theorem: A New Approach to the Side-Chain Packing Problem. The Protein
Folding Problem and Tertiary Structure Prediction, Birkhauser, pages 307—338,
1994.
[7] I. D.Kuntz, J. M. Blaney, S. J. Oatley, R. Langridge, and T. E. Ferrin. A geo-
metric approach to macromolecule—ligand interactions. Journal of Computational
Chemistry, 19:269—288, 1982.
[8] Matthew D. Eldridge, Christopher W. Murray, Timothy R. Anton, Gaia V.
Paolini, and Roger P. Mee. Empirical scoring functions: I. The development of
a fast empirical scoring function to estimate the binding affinity of ligands in
receptor complexes. Journal of Computer-Aided Molecular Design, 11:425—445,
1997.
[9] T. Ewing and I. D. Kuntz. Critical evaluation of search algortihms for automated
molecular docking and database screening. Journal of Computational Chemistry,
18:1175—1189, 1997.
[10] D. Fischer, S. L. Lin, H. L. Wolfson, and R. Nussinov. A geometry based suite of
molecular docking processes. Journal of Molecular Biology, 248:459—477, 1995.
103
[11] D. K. Gehlhaar, G. M. Verkhivker, P. A. chto, C. J. Sherman, D. B. Fogel,
L. J. Fogel, and S. T. Freer. Molecular recognition of the inhibitor AG—1343
by HIV-1 protease: Conformationally flexible docking by evolutionary prograrn-
ming. Chem. Biol, 2(5):317—324, 1995b.
[12] Holger Gohlke, Manfred Hendlich, and Gerhard Klebe. Knowledge-based Scoring
anction to Predict Protein-Ligand Interactions. Journal of Molecular Biology,
295:337—356, 2000.
[13] P. J. Goodford. A computational procedure for determining energetically favor-
able binding sites on biologically important macromolecules. Journal of Medici-
nal Chemistry, 13(7):849—857, 1985.
[14] Jeep Heringa and Patrick Argos. Strain in Protein Structures as Viewed Through
Nonrotameric Side Chains: 1. Their Position and Interaction. Proteins: Struc-
ture, Function, and Genetics, 37:30—43, 1999.
[15] J aap Heringa and Patrick Argos. Strain in Protein Structures as Viewed Through
Nonrotameric Side Chains: II. Upon Ligand Binding. Proteins: Structure, Func-
tion, and Genetics, 37:44—55, 1999.
[16] Brandon M. Hespenheide, A. J. Rader, M. F. Thorpe, and Leslie A. Kuhn.
Identifying protein folding cores from the evolution of flexible regions during
unfolding. Journal of Molecular Graphics and Modelling, 21:195—207, 2002.
[17] Thomas Huber, Andrew E. Torda, and Wilfred F. van Gunsteren. Optimization
Methods for Conformational Sampling Using a Boltzmann-Weighted Mean Field
Approach. Biopolymers, 39:103—114, 1996.
[18] OpenEye Scientific Software Inc. Quality Atomic Charges, Proton Assignment
and Canonicalization . http://www.eyesopen.com/docs/html/quacpac, 2004.
[19] Richard M. Jackson, Henry A. Gabb, and Michael J. E. Sternberg. Rapid Refine
ment of Protein Interfaces Incorporating Solvation: Application to the Docking
Problem. Journal of Molecular Biology, 276:265—285, 1998.
[20] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe. Protein flexibility
predictions using graph theory. Proteins: Structure, Function and Genetics,
44:150—165, 2001.
[21] G. Jones, P. Willett, and R. C. Glen. Molecular recognition of receptor sites
using a genetic algorithm with a description of desolvation. Journal of Molecular
Biology, 245:43—53, 1995.
[22] G. Jones, P. Willett, R. C. Glen, A. R. Leach, and R. Taylor. Development
and validation of a genetic algorithm for flexible docking. Journal of Molecular
Biology, 3:727—748, 1997.
104
[23] D. E. Koshland Jr. Application of a Theory of Enzyme Specificity to Protein
Synthesis. Proceedings of National Academy of Science, USA, 44:98-123, 1958.
[24] Roland L. Dunbrack Jr. Rotamer libraries in the 21" century. Current Opinion
in Structural Biology, 12:431—440, 2002.
[25] D. Keller, M. Shibata, E. Markus, R. Ornstein, and R. Rein. Finding the global
minimum: A fuzzy end elimination implementation. Protein Engineering, 8:893—
904, 1995.
[26] P. Koch] and M. Delarue. Mean-field minimization methods for biological macro-
molecules. Current Opinion in Structural Biology, 6:222—226, 1996.
[27] L. A. Kuhn, C. A. Swanson, M. E. Pique, J. A. Tainer, , and E. D. Getzofl'.
Atomic and Residue Hydrophilicity in the Context of Folded Protein Structures.
Proteins: Structure, Function, and Genetics, 23:536—547, 1995.
[28] I. Lasters and J. Desmet. The fuzzy-end elimination theorem: correctly imple-
menting the side chain placement algorithm based on the dead-end elimination
theorem. Protein Engineering, 6:717—722, 1993.
[29] Ming Lei, Maria I. Zavodszky, Leslie A. Kuhn, and Michael F. Thorpe. Sampling
Protein Conformations and Pathways. Journal of Computational Chemistry,
25(9):]133—48, 2004.
[30] D. M. Lorber and B. K. Shoichet. Flexible ligand docking using conformational
ensembles. Protein Science, 7:938—950, 1998.
[31] I. K. McDonald and J. M. Thornton. Satisfying hydrogen bonding protein in
proteins. J. Mol. Biol, 238:777—793, 1994.
[32] E. C. Meng, B. K. Shoichet, and I. D. Kuntz. A geometric approach
to macromolecule—ligand interactions. Journal of Computational Chemistry,
13(6):505-524, 1992.
[33] G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew,
and A. J. Olson. Automated docking using a Lamarckian genetic algorithm and
an empirical binding free energy function. Journal of Computational Chemistry,
19:1639—1662, 1998.
[34] Rafael Najmanovich, Joseph Kuttner, Vladimir Sobolev, and Marvin Edelman.
Side-Chain Flexibility in Proteins Upon Ligand Binding. Proteins: Structure,
Function, and Genetics, 39:261-268, 2000.
[35] R. Norel, D. Fischer, H. Wolfson H, and R. N ussinov. Molecular surface recogni-
tion by a computer vision based technique. Protein Engineering, 7:39—46, 1994.
[36] Protein Structural Analysis and Design Lab, MSU. Quick Guide to SLIDE
Version 2.30a. http://www.bch.msu.edu/kuhn/projects/slide/user_manual.html,
2005.
105
[37] Olivier Roche, Ryuichi Kiyama, and III Charles L. Brooks. Ligand-Protein
DataBase: Linking Protein-Ligand Complex Structures to Binding Data. Jour-
nal of Medicinal Chemistry, 44:3592—3598, 2001.
[38] Adrian Roitberg and Ron Elber. Modeling side chains in peptides and proteins:
Application of the locally enhanced sampling and the simulated annealing meth-
ods to find the minimum energy conformations. Journal of Chemical Physics,
95:9277—9287, 1991.
[39] Paul C. Sanschagrin. Computational Techniques for Modeling Protein-Ligand
Interactions and their Application to Serine Proteases and Asparaginyl-tRN A
Synthetase. Ph. D. dissertation, Michigan State University, 2001.
[40] Lana Schaffer and Gennady M. Verkhivker. Predicting Structural Effects on
HIV-1 Protease Mutant Complexes With Flexible Ligand Docking and Protein
Side-Chain Optimization. Proteins: Structure, Function, and Genetics, 33:295-
310, 1998.
[41] Volker Schnecke and Leslie A. Kuhn. Flexibly Screening for Molecules Interacting
with Proteins. Rigidity in Theory and Applications, Plenum Publishing, New
York, 385-400, 1999.
[42] Volker Schnecke and Leslie A. Kuhn. Virtual screening with salvation and ligand-
induced complementarity. Perspectives in Drug Discovery and Design, 20:171-
190, 2000.
[43] Volker Schnecke, Craig A. Swanson, Elizabeth D. Getzoff, John A. Tainer, and
Leslie A. Kuhn. Screening a Peptidyl Database for Potential Ligands to Proteins
With Side-Chain Flexibility. Proteins: Structure, Function, and Genetics, 33:74-
87, 1998.
[44] B. K. Shoichet, D. L. Bodian, and I. D. Kuntz. Molecular docking using shape
descriptors. Journal of Computational Chemistry, 13:380—397, 1992.
[45] D. Stickle, L. Presta, K. Dill, and G. Rose. Hydrogen bonding in globular
proteins. J. Mol. Biol, 226:1143—1159, 1992.
[46] P. Tuffrey, C. Etchebest, S. Hazout, and R. Lavery. A new approach to the
rapid determination of protein side-chain conformations. Journal of Biomolecular
Structural Dynamics, 8(6):1267—89, 1991.
[47] G. Vriend. WHAT IF: A molecular modeling and drug design program. Journal
of Mololecular Graphics, 8:52-56, 1990.
[48] Renxiao Wang, Xueliang Fang, Yipin Lu, and Shaomeng Wang. The PDBbind
Database: Collection of Binding Affinities for Protein-Ligand Complexes with
Known Three-Dimensional Structures. Journal of Medicinal Chemistry, 47:2977—
2980, 2004.
106
[49] H. J. Wolfson and Y. Lamdan. Geometric hashing: A general and efficient model-
based recognition scheme. IEEE Internation Conference on Computer Vision,
Tampa, Fl, pages 238-249, 1988.
[50] Maria I. Zavodszky and Leslie A. Kuhn. Side-Chain Flexibility in Protein—Ligand
Binding: The Minimal Rotation Hypothesis. Protein Science, 1421104 — 1114,
2005.
[51] Maria I. Zavodszky, Paul C. Sanschagrin, Rajesh S. Korde, and Leslie A. Kuhn.
Distilling the Essential Features of a Protein Surface for Improving Protein-
Ligand Docking, Scoring, and Virtual Screening. Journal of Computer-Aided
Molecular Design, 16:883—902, 2002.
107
[[[r[[[[[[[[[[[[[[ii