A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS
IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY

By
Hossam Mohamed Farg Ashtawy

A THESIS
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
MASTER OF SCIENCE
Electrical Engineering
2011

ABSTRACT
A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS
IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY
By
Hossam Mohamed Farg Ashtawy
Accurately predicting the binding affinities of large sets of diverse protein-ligand complexes is a
key challenge in computational biomolecular science, with applications in drug discovery,
chemical biology, and structural biology. Since a scoring function (SF) is used to score, rank,
and identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a
protein’s binding site and its computational complexity have a significant bearing on the
accuracy and throughput of virtual screening. Despite intense efforts in this area, so far there is
no universal SF that consistently outperforms others. Therefore, in this work, we explore a range
of novel SFs employing different machine-learning (ML) approaches in conjunction with a
diverse feature set characterizing protein-ligand complexes. We assess the scoring and ranking
accuracies and computational complexity of these new ML-based SFs as well as those of
conventional SFs in the context of the 2007 and 2010 PDBbind benchmark datasets. We also
investigate the influence of the size of the training dataset, the number of features, the protein
family, and the novelty of the protein target on scoring accuracy. Furthermore, we examine the
interpretive power of the best ML-based SFs. We find that the best performing ML-based SF has
a Pearson correlation coefficient of 0.797 between predicted and measured binding affinities
compared to 0.644 achieved by a state-of-the-art conventional SF. Finally, we show that there is
potential for further improvement in our proposed ML-based SFs.

Table of Contents
LIST OF TABLES ........................................................................................................................ v
LIST OF FIGURES ..................................................................................................................... vi
1

Introduction ........................................................................................................................... 1
1.1 Background ...................................................................................................................... 3
1.1.1
Drug Design .............................................................................................................. 3
1.1.2
Virtual Screening ...................................................................................................... 5
1.1.3
Molecular Docking and Scoring Functions .............................................................. 6
1.1.4
Types of Scoring Functions ...................................................................................... 9
1.1.5
Scoring Challenges ................................................................................................. 10
1.2 Motivation and Problem Statement ................................................................................ 12
1.3 Related Work.................................................................................................................. 15
1.4 Thesis Organization........................................................................................................ 20

2

Materials and Methods ....................................................................................................... 21
2.1 Compounds Database ..................................................................................................... 23
2.2 Compounds Characterization ......................................................................................... 26
2.3 Binding Affinity ............................................................................................................. 32
2.4 Training and Test Datasets ............................................................................................. 35
2.5 Machine Learning Methods ........................................................................................... 38
2.5.1
Single Models ......................................................................................................... 38
2.5.2
Ensemble Models .................................................................................................... 46
2.6 Scoring Functions under Comparative Assessment ....................................................... 60

3

Results and Discussion ........................................................................................................ 61
3.1 Evaluation of Scoring Functions .................................................................................... 63
3.2 Parameter Tuning and Cross-Validation ........................................................................ 67
3.2.1
Multivariate Adaptive Regression Splines (MARS)............................................... 71
3.2.2
k-Nearest Neighbors (kNN) .................................................................................... 72
3.2.3
Support Vector Machines (SVM) ........................................................................... 73
3.2.4
Random Forests (RF) .............................................................................................. 75
3.2.5
Neural Forests (NF) ................................................................................................ 76
3.2.6
Boosted Regression Trees (BRT) ........................................................................... 78
3.3 Tuning and Evaluating Models on Overlapped Datasets ............................................... 80
3.4 Machine-Learning vs. Conventional Scoring Functions ................................................ 83
3.5 Dependence of Scoring Functions on the Size of Training Set...................................... 88
3.6 Dependence of Scoring Functions on the Number of Features ...................................... 94
3.7 Family Specific Scoring Functions ................................................................................ 99
3.8 Performance of Scoring Functions on Novel Targets .................................................. 105
3.9 Interpreting Scoring Models......................................................................................... 113
iii

3.9.1
Relative Importance of Protein-Ligand Interactions on Binding Affinity ............ 113
3.9.2
Partial Dependence plots....................................................................................... 115
3.10 Scoring Throughput...................................................................................................... 120
4

Concluding Remarks and Future Work .......................................................................... 124
4.1
Thesis Summary ........................................................................................................... 124
4.2
Findings and Suggestions ............................................................................................. 125
4.3
Looking Ahead ............................................................................................................. 127

BIBLIOGRAPHY ..................................................................................................................... 132

iv

LIST OF TABLES
Table 2.1 Protein-ligand interactions used to model binding affinity. ......................................... 28
Table 2.2 Training and test sets used in the study. ....................................................................... 36
Table 3.1 Optimal parameters for MARS, kNN, SVM, RF, NF and BRT models. ..................... 79
Table 3.2 Results of tuning and evaluating models on overlapped datasets. ................................ 82
Table 3.3 Comparison of the scoring power of 23 scoring functions on the core test set Cr. ...... 85
Table 3.4 Comparison of the ranking power of scoring functions on the core test set Cr. .......... 86
Table 3.5 Performance of several machine-learning-based scoring functions on a compound test
set described by different features. ............................................................................................... 95
Table 3.6 Ranking accuracy of scoring functions on the four additional tests sets. The test
samples overlap the training data. ............................................................................................... 100
Table 3.7 Ranking accuracy of scoring functions on the four additional tests sets. The test
samples are independent of the training data. ............................................................................. 101
Table 3.8 Performance of scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes for different cutoff ranges. ... 110
Table 3.9 Feature extraction time (in seconds) and performance (RP) of some prediction
techniques fitted to these features. .............................................................................................. 120
Table 3.10 Training and prediction times of different scoring techniques. ................................ 122

v

LIST OF FIGURES
Figure 1.1 Time and cost involved in drug discovery [2]. For interpretation of the references to
color in this and all other figures, the reader is referred to the electronic version of this thesis. . 4
Figure 1.2 A binding pose of a ligand in the active site of a protein [8]. ....................................... 6
Figure 1.3 Crystal structure of HIV-1 protease in complex with inhibitor AHA024 [9]. .............. 6
Figure 1.4 The drug design process. ............................................................................................... 8
Figure 2.1 Data preparation workflow. ......................................................................................... 22
Figure 2.2 Multi-layered perceptron, feed-forward neural network ............................................. 51
Figure 2.3 BRT algorithm. ............................................................................................................ 57
Figure 3.1 Workflow of scoring function construction. ............................................................... 61
Figure 3.2 10-fold cross-validation. .............................................................................................. 68
Figure 3.3 Parameter tuning algorithm. ........................................................................................ 69
Figure 3.4 Tuning the parameter GCV-penalty in MARS. ........................................................... 72
Figure 3.5 Tuning the parameters k and q in kNN. ....................................................................... 73
Figure 3.6 Tuning the parameters C, ε and σ in SVM. ................................................................. 74
Figure 3.7 Tuning the parameter mtry in RF. ............................................................................... 76
Figure 3.8 Tuning the parameters mtry and the weight decay ( λ ) in NF. ................................... 77
Figure 3.9 Tuning the parameter Interaction Depth in BRT. ....................................................... 79
Figure 3.10 Algorithm for tuning and evaluating models on overlapped datasets. ...................... 81
Figure 3.11 Dependence of XAR-based scoring functions on the size of training set. ................ 89
Figure 3.12 Dependence of X-based scoring functions on the size of training set. ...................... 90
Figure 3.13 Algorithm to assess the impact of data dimension on performance of scoring
functions. ....................................................................................................................................... 97
vi

Figure 3.14 Data dimension impact on performance of scoring functions. .................................. 98
Figure 3.15 Dependence of scoring functions on the number of known binders to HIV protease in
the training set. ............................................................................................................................ 103
Figure 3.16 Performance of scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes. .............................................. 108
Figure 3.17 Sensitivity of RF::XAR scoring model accuracy to size of training dataset and
BLAST sequence similarity cutoff between binding sites of proteins in training and test
complexes. .................................................................................................................................. 111
Figure 3.18 The most important ten features identified by RF and BRT algorithms. Refer to
Table 2.1 for feature description. ................................................................................................ 114
Figure 3.19 The least important five features identified by RF and BRT algorithms. Refer to
Table 2.1 for feature description. ................................................................................................ 115
Figure 3.20 Marginal dependence plots of most relevant four features. .................................... 116
Figure 3.21 Marginal dependence plots of least relevant four features. ..................................... 117
Figure 3.22 The RF::XAR scoring model’s performance as a function of inclusion of increasing
numbers of the most important features...................................................................................... 119
Figure 4.1 Performance of consensus scoring functions on the core test set Cr. ....................... 129

vii

CHAPTER 1
Introduction
Protein-ligand binding affinity is the principal determinant of many vital processes, such as
cellular signaling, gene regulation, metabolism, and immunity, which depend upon proteins
binding to some substrate molecule. Accurately predicting the binding affinities of large sets of
diverse protein-ligand complexes remains one of the most challenging unsolved problems in
computational biomolecular science, with applications in drug discovery, chemical biology, and
structural biology. Computational molecular docking involves docking of tens of thousands to
millions of ligand candidates into a target protein receptor’s binding site and using a suitable
scoring function to evaluate the binding affinity of each candidate to identify the top candidates
as leads or promising protein inhibitors. Since a scoring function is used to score, rank, and
identify drug leads, the fidelity with which it predicts the affinity of a ligand candidate for a
protein’s binding site and its computational complexity have a significant bearing on the
accuracy and throughput of virtual screening. Despite intense efforts in this area, so far there is
no universal scoring function that consistently outperforms others. Therefore, in this work, we
explore a range of novel scoring functions employing different machine-learning (ML)
approaches in conjunction with a diverse feature set characterizing protein-ligand complexes
with a view toward significantly improving scoring function accuracy compared to that achieved
by existing conventional scoring functions.

1

In this chapter, we first briefly describe the process of designing new drugs and introduce
techniques used to expedite such task. These techniques include high-throughput screening and
computational methods such as virtual screening and molecular docking. Scoring functions,
which are a key component of molecular docking, are then presented and their types and
limitations are also briefly covered. The challenges associated with conventional scoring
functions have motivated us and other researchers to build more accurate and robust scoring
models. A more detailed motivation behind this work and a review of related studies are
discussed at the end of the chapter.

2

1.1

Background

1.1.1 Drug Design
The first step in designing a new drug for many diseases is accurately identifying the enzyme,
receptor, or other protein responsible for altering the biological functioning of the signaling
pathway associated with the disease. This protein must be a critical component in the activity of
the cell linked to the disease. Then finding chemical molecules that stimulate the protein to
modulate the cell’s function is the main idea of drug discovery. Such molecules (drugs) should
be able to inhibit or activate the cell’s cycle or specific function. Furthermore, proteins
stimulated by such drugs should be unique and their function may not be substituted by other
proteins in the cell.
In the early days, natural products derived from animals and plants (e.g. alkaloids and
terpenoids) were the dominant sources of drugs. Over the years new synthetic and semi-synthetic
methodologies have been developed to generate large numbers of competitive and effective
drug-like compounds. The field under which these methodologies lie is known as combinatorial
chemistry which is widely utilized by pharmaceutical companies to help to produce their inhouse compound libraries. Automated devices and robots are typically used to accelerate the
process of synthesizing and screening chemical compounds from these large databases. This
process is generally known as High Throughput Screening (HTS), which in the context of drug
design is used to rapidly synthesize and verify the activity of small molecules (Ligands) against
the given protein. Although HTS technology has significantly shortened the time and reduced the
costs of discovering new drugs, it is still not fast and cheap enough to screen the large and yet
ever-growing compound databases.
3

Target Identification
2.5 years

4%

Lead Generation & Lead Optimization
3 years

15%

Preclinical Development
1 year

10%

Phase I, II & III Clinical Trials
6 years

68%

FDA Review & Approval
1.5 years

3%

Getting Drug to Market
14 years

$880 M

Figure 1.1 Time and cost involved in drug discovery [2]. For interpretation of the references to
color in this and all other figures, the reader is referred to the electronic version of this thesis.

Discovery and development of a drug requires more than 10 years of research and
development on an average and a budget that varies from hundreds of millions to tens of billions
of dollars [1], [2]. Even with this prohibitive cost in time and money, its clinical success rate
does not exceed 25% [3]. Such high expenses are mainly due to screening large numbers of
drug-like compounds against the given target, follow up assays for investigating toxicological
and pharmacokinetic properties, and clinical trials. Figure 1.1 shows the cost and time associated
with the main stages of drug discovery based on data collected by PARAXEL several years ago
[2].
4

1.1.2 Virtual Screening
In order to reduce laborious and costly hit-identification (or lead optimization) search campaigns,
computational methods are widely employed to filter large in-house compound libraries into
smaller and more focused ones. Such virtual screening methods are used as integral component
of HTS. Virtual screening narrows down the number of ligands to be tested experimentally by
orders of magnitude, which in turn dramatically could shorten drug discovery timeline and
reduce associated costs.
In virtual screening, information, such as structure and physicochemical properties, about
ligand, protein, or both is used to infer the activity of interest that would likely to take place in
real physical settings. The essential biological interaction between a ligand and a target protein in
the early phases of drug design is normally represented by binding affinity (or binding free
energy). Binding affinity determines whether a ligand could bind to the given protein, and if so,
with how much efficacy and potency. The most popular approach to predicting binding energy in
virtual screening is structure-based in which physicochemical interactions between a ligand and
target protein are extracted from the 3D structure of both molecules. Three-dimensional
structures of proteins are typically obtained using X-ray crystallography [4], NMR spectroscopy
[5] or homology models [6], [7]. This in silico method is also known as protein-based as
opposed to the alternative approach, ligand-based, in which only ligands similar to the ones
known to bind to the target are screened. Such ligand-based screening ignores valuable
information about the target and exclusively considers compounds exhibiting common
biochemical features. This limited and biased (toward the known ligand) search may not identify
novel chemicals as hits. However, it is the method of choice when the 3D structure for the target
is not available.
5

Ligand
Protein

Figure 1.2 A binding pose of a ligand in the active site of a protein [8].

Protein

Ligand

Figure 1.3 Crystal structure of HIV-1 protease in complex with inhibitor AHA024 [9].

1.1.3 Molecular Docking and Scoring Functions
In this work, the focus will be on protein-based drug design, where ligands are placed into the
active site of a receptor as depicted in Figures 1.2 and 1.3 [8], [9]. Figure 1.2 is a solventaccessible surface representation of a ligand’s pose in the protein’s active site and Figure 1.3
illustrates a ribbon diagram of an inhibitor (AHA024) binding to HIV-1 protease (protein). The
orientation of a molecule and its position in the active region of the protein’s 3D structure is
6

known as conformation. The process in which a set of ligand conformations is generated and
placed in the binding pocket is generally known as docking. Every ligand’s conformation (or
pose) during the course of docking is evaluated and the binding mode of the ligand is identified.
Typically, an internal engine built inside the docking tool known as a scoring function is
used to discriminate between the promising binding modes from poor ones. Such a function
takes a specific ligand’s conformation in the protein’s binding cavity, and estimates a score that
represents the binding free energy of that orientation. The outcome of the docking run therefore
is a set of ligands ranked according to their predicted binding scores. Top ranked ligands are
considered the most promising drugs so they proceed to the next phase where they will be
synthesized and physically screened using HTS. Figure 1.4 depicts a simplified view of the drug
design process.
The success of docking simulation relies upon the effectiveness of the docking algorithm and
the quality of the utilized scoring function. Various docking protocols adopt different strategies
in exploring the conformational space of ligands and their targets. The treatment of ligand and
protein molecules in terms of their flexibility is one of the distinguishing features of a docking
method. The most rapid docking techniques usually assume a rigid ligand interacting with a rigid
target, and some tools at the other end of the extreme take the flexibility of both objects into
consideration. Although the latter paradigm yields more accurate results, its application is
computationally demanding and therefore it fails to provide one of the most attractive features of
virtual screening: speed. Ideally, in order to trade-off between accuracy and high throughput, the
mainstream docking systems assume the target is held fixed or minimally flexible in its crystal
structure and simulate the conformational flexibility of the ligands [10], [11], [12].

7

Disease

Target
Identification

No : Ligand-Based

Known
Actives

Known 3D
structure ?

Yes : Protein-Based

Compound
Database

Similarity
Searching

Target 3D
structure

Docking

Hits

High Throughput Screening

Toxicological & Pharmacokinetic Analysis
Clinical Trials

Drug

Figure 1.4 The drug design process.

8

1.1.4 Types of Scoring Functions
No matter how exhaustive and efficient the docking algorithm is in exploring the conformational
space, the entire in silico run may not reveal interesting hits unless the scoring engine is effective
enough. The scoring function should be precise and fast in predicting the binding energy of
protein-ligand snapshots from the vast conformational space. Precision and speed are inherently
conflicting requirements. To make scoring functions more accurate, they have to capture deeper
details regarding interacting forces between protein and its ligand. Thorough accommodation of
such factors involved in determining binding affinity is computationally intensive and therefore
practically infeasible for virtual screening. Several scoring schemes have been formulated that
attempt to calculate accurate binding energies within reasonable time limits. The most popular
schemes can be categorized under three major families: force-field, empirical, and knowledgebased scoring functions.
In force-field methods, the potential of a system is formulated by calculating physical atomic
interactions between the protein and ligand that include van der Waals forces, bonding and nonbonding interactions, and electrostatic energies. Parameters that define these intermolecular
interactions are derived from experimental data and ab initio simulations [13]. The limitations of
these classes of functions arise from inaccurate implicit accounting for solvation, enthalpy, and
absence of entropy [8], [14]. Examples of some well-known docking programs that employ this
class of scoring functions are DOCK [10], [15], [16], [17] and AutoDock [18], [19].
Empirical scoring functions adopt a different philosophy for calculating the free energy of
the system. The energy as a whole in this type of functions is assumed to be composed of
weighted energy terms. Each term describes a chemically intuitive interaction such as hydrogen

9

bonding, desolvation effect, van der Waal interactions, hydrophobicity, and entropy. A fitting to
a training dataset of known experimental binding energy is typically conducted to calibrate the
coefficients of interest via linear regression [20]. Due to its intrinsic simplicity, empirical scoring
is the method of choice when screening speed is of great importance [21].
Finally, knowledge-based scoring is founded on the theory that large databases of proteinligand complexes can be statistically mined to deduce rules and models that are implicitly
embedded in the data. Distance-dependent occurrences of certain atom pair types are counted
from experimentally determined protein-ligand structures and used as descriptors or free
energies. There are two common approaches to utilizing these descriptors in predicting the
binding energy. The first one is by summing up the free energies of the interatomic contacts
obtained according to inverse Boltzmann relation [22], [23]. In this case, experimental
measurements of binding affinity are not required. Examples of such an approach are PMF and
DrugScore [24], [25], [26], [27], [28], [29].
The second type of knowledge-based scoring function is inspired from empirical models,
where descriptors are used by a machine-learning method to fit the binding affinities of a training
set. Ballester and Mitchell borrow this technique in building their RF-Score function [30].
Knowledge-based scoring functions’ accuracy and speed are comparable to those of empirical
functions [1].

1.1.5 Scoring Challenges
Despite the aforementioned variety of docking methods and scoring functions, several recent
studies report that there is no superior docking tool (docking algorithm and scoring function) that
outperforms its counterparts [31], [32]. This can be attributed to imperfections in the docking
10

process as well as deficiencies in scoring. In addition to the insufficient amount of protein
flexibility considered during docking that leads to the limited predictive power, the other source
of the problem can be traced to the capability of employed scoring function. Lack of full and
accurate accounting of intermolecular physicochemical interactions and improper solvent
modeling are believed to be the main causes of errors in computing binding energies.
Experimental data to which regression techniques are fitted also poses a challenge to the quality
of resulting scoring models. More specifically, uncertainties may be associated with collected
binding affinities of protein-ligand complexes that are typically used to build scoring functions.
Binding affinities are usually assembled from several research groups who obtain such values
under different experimental conditions [8].

11

1.2

Motivation and Problem Statement

It has been difficult to capture the unknown relationship correlating binding affinity to
interactions between a ligand and its receptor. Conventional empirical scoring functions rest on
the hypothesis that a linear regression model is capable of computing the binding affinity. Such
an assumption fails to explain intrinsic nonlinearities latent in the data on which such models are
calibrated. Instead of assuming a predetermined theoretical function that governs the unknown
relationship between different energetic terms and binding affinity, one can employ more
accurate methodologies that are generally referred to as nonparametric models. In these
techniques, the underlying unknown function is “learnt” from the data itself and no assumption
regarding its statistical distribution is made. Various nonparametric machine-learning methods
inspired from statistical learning theory are examined in this work to model the unknown
function that maps structural and physicochemical information of a protein-ligand complex into a
corresponding binding energy value. In order to make the derived models capture more
information about protein-ligand interactions, we considered a variety of descriptors that are
normally employed either for knowledge-based or for empirical scoring functions. Namely, these
energy terms include hydrogen bonding, hydrophobic forces, and van der Waals interactions
employed in empirical scoring functions and QSAR (quantitative structure activity relationships)
descriptors such as pairwise atomic statistics used in knowledge-based scoring functions.
We seek to advance structure-based drug design by designing scoring functions that
significantly improve upon the protein-ligand binding affinity prediction accuracy of
conventional scoring functions. Our approach is to couple the modeling power of flexible
machine learning algorithms with training datasets comprising thousands of protein-ligand

12

complexes with known high resolution 3D crystal structures and experimentally-determined
binding affinities and a variety of features characterizing the complexes. We will compare the
predictive accuracies of several machine learning based scoring functions and existing
conventional scoring functions of all three types, force-field based, empirical, and knowledgebased, on diverse and independent test sets.
In similar studies reported in the literature, the focus was more or less on the bioactivity of a
ligand when it binds to its receptor. In such cases, the given ligand is classified as either an active
or an inactive molecule for the target protein. Although in other works the binding affinity as a
numerical value was of interest, the data used to train and test prediction power of different
techniques was either restricted to very few protein families and/or the size of the dataset itself
was not as ample and diverse as the ones we are exploiting in our experiments. Furthermore, the
abundance of different techniques to extract features characterizing protein-ligand complexes has
motivated us to experiment building models that are trained on diverse sets of descriptors.
More specifically, in this work, we will evaluate and compare the performance of different
conventional and machine-learning-based scoring functions. The issues we will explore include:


Assessment of scoring and ranking accuracies of scoring functions based on
comparison between predicted and experimentally-determined binding affinities.



Dependence of scoring function accuracy on the size of training data.



Dependence of scoring function accuracy on the number of features describing
protein-ligand complexes.



Performance of scoring functions on family-specific proteins and novel targets.



Descriptive power of scoring functions (i.e., their interpretability).

13



Throughput of scoring functions in terms of the time required to extract the
features and calculate binding affinity.

14

1.3

Related Work

Several comparative studies of different prediction methods applied in virtual screening show
that there is no universal best method. This includes both classification and regression based
approaches. Even when the task is to predict binding energy of protein-ligand complexes using
the same family of scoring functions, say empirical techniques, there are cases where some
functions do better and in other scenarios those functions are outperformed by others. Here we
will review the most relevant comparative studies that have assessed the performance of different
prediction methodologies in virtual screening and drug design.
Cheng and co-workers recently conducted an extensive test of 16 scoring functions employed
in mainstream docking tools and researched in academia [32]. The team analyzed the
performance of each scoring function in predicting the binding energy of protein-ligand
complexes whose high resolution structures and binding constants are experimentally known.
The main test set used in this study consisted of 195 diverse protein-ligand complexes and four
other protein specific sets. Considering different evaluation criteria and test datasets, they
concluded that no single scoring function was superior to others in every aspect. In fact, the best
scoring function in terms of predicting binding constants that are most correlated to the
experimentally calculated ones was not even in the top five when the goal was to identify the
correct binding pose of the ligand. These findings agree to some extent with an earlier study
conducted by Wang et al. that considered very similar data sets and scoring functions [33].
In both studies mentioned above, scoring functions examined were force-field based,
empirical, or knowledge based, but none based on sophisticated machine learning algorithms.
Recently, Ballester and Mitchell took advantage of the diverse test set collected by Cheng et al.
15

as a benchmark to evaluate their RF-Score scoring function [30]. It is a random forest model
fitted to geometric features that count the number of occurrences of particular atom pairs within
a predefined distance [34].The results of this work indicate that the random forest model is
significantly more accurate compared to the 16 scoring functions considered in Cheng et al.’s
work in predicting the binding energies of 195 diverse complexes. In quantitative terms, RFScore predictions of binding affinities have a Pearson correlation coefficient of RP = 0.776 with
the actual values, whereas the correlation coefficient of the top scoring function among the 16
mentioned earlier is just 0.644. It should be noted that this high gain in performance is mainly a
result of the characteristic of the modeling approach rather than the data itself, since both sets of
scoring functions were trained and evaluated on identical training and test datasets.
The usage of machine learning methods in identifying lead drugs is not just restricted to the
scoring task which is a crucial step in molecular docking. In fact, machine learning methods have
also been applied for QSAR modeling, where the modeling is exclusively based on ligand
molecular properties. As an example, Svetnik and co-workers studied random forests’ ability in
predicting a compound’s quantitative and/or categorical biological activity [35]. Their random
forest model and other machine learning schemes were trained and evaluated on six QSAR
publicly-available datasets. Datasets such as BBB [36], Estrogen [37], P-gp [38], and MDRR
[39] were treated as classification problems, while Dopamine [40] was treated as a regression
problem and COX-2 [41] was present in both contexts. The researchers reported the accuracy in
classification and the correlation coefficient in regression for the modeling algorithms: random
forests (RF), decision trees (DT), partial least squares (PLS) [42], [43], artificial neural networks
(ANN), support vector machines (SVM), and k-Nearest neighbors (kNN). On both classification
and regression datasets, RF models were always among the top performers. SVM and an
16

enhanced version of kNN have shown satisfactory performance, where in some tests they
outperform the random forest approach.
Plewczynksi and co-workers carried out an assessment study for prediction capabilities of the
models: SVM, ANN, NB, kNN, RF, DT, and trend vectors (TV) [44]. These machine-learning
techniques are used in the context of classification where a large database of compounds is
partitioned into active and inactive compounds for five biological targets. The targets considered
are HIV-reverse transcriptase, COX-2, dihydrofolate reductase, estrogen receptor, and thrombin
which were retrieved from the MDDR (MDL Drug Data Report) database [45]. For each target,
several ligands known to be active are included in the compound database to be screened. The
team has shown that different QSAR methods exhibit different strengths and weaknesses. For
example, SVM was found to be superior when the goal is to retrieve high number of true
positives. When the goal was to identify a small subset with the highest enrichment, TV was
found to be the most suitable method. When the aim was to increase precession, TV, RF, and
SVM are recommended. SVM, kNN, and RF achieve high recall (measures the percentage of
correct predictions) values. Thus, the research results recommend that the choice of the
appropriate classification model should be application dependent.
Other studies have been conducted that compare different nonlinear regression methods in
virtual screening. For example, in a comparison between ANN and SVM to predict the
apoptosis-inducing activity of several 4-aryl-4-H-chromenes, Fatemi et al. demonstrated that
SVM shows more accurate predictions than ANN [46]. Goodarzi and co-workers reached also to
the same conclusion when they considered SVM and ANN capabilities to predict the inhibitory
activity of glycogen synthase kinase-3β [47]. SVM was also the winner when it was evaluated
against kNN to identify inhibitors for histone deacetylases (HDAC) [48].
17

There is a plethora of other comparative assessment studies carried out to address different
QSAR modeling and ligand-based drug design techniques [49], [50], [51], [52], [53], [54], [55],
[56]. However, to our knowledge, there are no such comparative studies conducted for machine
learning-based scoring functions. Therefore, in this work we aim to pragmatically analyze and
compare several well-studied machine learning approaches in the context of molecular docking
and binding affinity scoring.
Prior to this work, S.K. Namilikonda and N.R. Mahapatra in collaboration with L.A. Kuhn
and with assistance from M.I. Zavodszky had performed preliminary investigations with a view
toward improving the original X-Score [57] and original AffiScore [11], [57], [58], [59], [60]
scoring functions. They performed feature subset selection separately on X-Score features and
AffiScore features to design two sets of multiple linear regression (MLR) based empirical
scoring functions: one based on a subset of X-Score features and the other based on a subset of
AffiScore features. This investigation was done using the 2005 PDBbind database [61] and
yielded some improvements in scoring accuracy. The setup used in this prior work served as a
helpful starting point for this thesis work. However, the focus and scope of our work are
different. All our experiments have been performed using the 2007 and 2010 PDBbind refined
sets. In the initial stages of this thesis research, feature subset selection was comprehensively
explored using not only X-Score and AffiScore features, but also RF-Score features [30] and
their combinations. For the larger datasets (2007 and 2010 PDBbind refined sets) that we used,
feature subset selection, in the main, was found to be not necessary to obtain improved MLRbased empirical scoring functions from X-Score, AffiScore, and RF-Score features and their
combinations. Additionally, it was quite computationally time-intensive to perform feature
subset selection on larger number of features. Therefore, we do not report any results for feature
18

subset selection in this work. More importantly, the focus in this work is on machine-learning
based scoring functions in contrast to MLR-based empirical scoring functions that have been
studied in the past.

19

1.4

Thesis Organization

The rest of the thesis is organized as follows. Chapter 2 describes the compound database used in
our study and presents the tools used to extract the features that characterize interaction factors
between proteins and their ligands. The other half of the chapter explains the mechanics of each
machine-learning algorithm considered in this work.
Chapter 3 starts with a discussion of the common evaluation criteria employed to judge the
performance of scoring functions. These criteria are then investigated in subsequent sections
where several experiments addressing different aspects are conducted. Chapter 4 concludes this
work and discusses some potential future improvements.

20

CHAPTER 2
Materials and Methods
In this chapter, we will discuss the materials and methods used to build and evaluate different
scoring functions. First, we will introduce the protein-ligand complex database that serves as our
data source. Then, as sketched out in Figure 2.1, we will discuss the features used to characterize
the 3D structure of a protein-ligand complex and how they are calculated. Essentially, this step
transforms complex information from geometrical coordinate representation into tuples of
features. These features will be used to construct training and testing datasets for predictive
learning models.
The second half of this chapter will present the machine-learning methods that will be trained
and evaluated on the dataset alluded to above. We will explain the mathematical models behind
the prediction algorithms and the software packages that implement them. Finally, we will list
the 16 existing scoring functions that will be compared against the machine-learning models
considered in this work.

21

Protein-ligand
complexes
from PDB
PDB key 1a30
1d5j
1f9g
2qtg
2usn
PDBbind filters and Binding Affinity Collection

9hvp

2qtg
1a30
1f9g
2usn
Feature Calculation (using X-Score, Affi-Score and RF-Score tools)

Training and Test Datasets

XAR Dataset
AR Dataset
XR Dataset
XA Dataset
R Dataset
A Dataset
X Dataset
Primary Training Dataset
Pr

XAR Dataset
AR Dataset
XR Dataset
XA Dataset
R Dataset
A Dataset
X Dataset
Secondary Training Dataset
Sc

Figure 2.1 Data preparation workflow.

22

XAR Dataset
AR Dataset
XR Dataset
XA Dataset
R Dataset
A Dataset
X Dataset
Core Test Dataset
Cr

2.1

Compounds Database

We used the same complex database that Cheng et al. benchmarked in their recent comparative
assessment of sixteen popular scoring functions [32]. They obtained the data from PDBbind,
which is a selective compilation of the Protein Data Bank (PDB) database [61], [31]. Both
databases are publicly accessible and continually updated. The PDB is periodically mined and
only complexes that can be utilized in drug discovery are filtered into the PDBbind database.
PDBbind has been used as a benchmark for many scoring-function related studies due to its
diversity, reliability, and high-quality [26], [30], [62], [63], [64], [65], [66]. In PDBbind, a
number of filters are imposed to obtain high-quality protein-ligand complexes with both
experimentally known binding affinity and three-dimensional structure from PDB. First,
complexes that consist of only one protein and one valid ligand attached to it are collected. The
validity of a ligand is determined by its molecular weight (which should not exceed 1000) and
the number of its non-hydrogen atoms (that must be 6 or more). The binding affinities of these
protein-ligand complexes are then assembled from the literature. The binding affinities
considered are dissociation constant (Kd), inhibition constant (Ki), and concentration at 50%
inhibition (IC50). For these measurements to be accepted, they must be reported at normal room
temperature and neutral pH values.
Second, a high-quality subset of the complexes identified in the first step is compiled into a
list referred to as the refined-set. Each protein-ligand complex in the refined-set must meet the
following requirements: (i) it must have a stable experimental binding affinity value such as Kd
or Ki; complexes with only known IC50 values are excluded due to its dependence on binding
23

assay conditions; (ii) only one ligand is bound to the protein; (iii) protein and ligand molecules
must be non-covalently bound to each other; (iv) the resolution of the complex crystal structure
does not exceed 2.5 Angstrom (Å); and (v) for compatibility with molecular modeling software,
neither the residues in the protein binding pocket nor the ligand are allowed to have organic
elements other than C, N, O, P, S, F, Cl, Br, I, and H.
Finally, hydrogenation of both molecules in complex structures is carried out using Sybyl
tool [67]. Protonation and deprotonation of specific groups were performed for proteins and
ligands under neutral pH scheme.
The PDBbind curators compiled another list out of the refined set. It is called the core-set
and is mainly intended to be used for benchmarking docking and scoring systems. The core set is
composed of diverse protein families and diverse binding affinities. BLAST was employed to
cluster the refined set according to the protein sequence similarity with a 90% cutoff [68]. From
each resultant cluster, three protein-ligand complexes were chosen to be its representatives in the
core set. A cluster must fulfill the following criteria to be admitted into the core set: (i) it has at
least four members, (ii) the binding affinity of the highest-affinity complex must be at least 100fold of the complex of the lowest one. The representatives were then chosen based on their
binding affinity rank, i.e., the complex having the highest rank, the middle one, and the one with
the lowest rank. The approach of constructing the core set guarantees unbiased, reliable, and
biologically rich test set of complexes.
In order to be consistent with the comparative framework used to assess the sixteen scoring
functions mentioned above, we consider the 2007 version of PDBbind. This version has a 1300complex refined set and a 195-complex core-set (with 65 clusters.) We also take advantage of the

24

newly deposited complexes in the 2010 version of the database. Considering larger samples of
data makes our conclusions regarding performance of scoring functions in some experiments
more statistically sound. The refined set of 2010 PDBbind has 2061 entries, from which the 2007
refined list is a subset. A small portion of the 2007-version set is not included in the newly
released one due to some modifications in the filtering criteria.

25

2.2

Compounds Characterization

To accurately estimate the binding affinity of a protein-ligand complex, one needs to collect as
much useful information about the compound as possible. We attempt here to satisfy this
requirement by considering features from two different scoring families. The first is empirical
scheme where extracted features correspond to energy terms. The second is knowledge-based
that mainly employ QSAR descriptors. As for empirical-based features, X-Score and AffiScore
algorithms are used to calculate the energy terms [11], [57], [58], [59], [60]. X-Score is
implemented by Wang et al. as a standalone scoring function, whereas AffiScore is developed
for the SLIDE docking tool, and it supports standalone scoring mode as well. Knowledge-based
geometrical descriptors were obtained from a tool developed by Ballester and Mitchell. It serves
the same purpose for their RF-Score scoring function [30]. X-Score, AffiScore (SLIDE), and RFScore are publicly available and in this section we briefly discuss the types of features extracted
by each one of them. For a detailed review of these tools, we refer the interested reader to the
relevant work conducted by their authors [11], [30], [57], [58], [59], [60].
X-Score utility outputs the binding free energy for a given three-dimensional protein-ligand
structure as a sum of weighted terms. The terms calculated in X-Score are: (i) van der Waals
interactions between the ligand and the protein; (ii) hydrogen bonding between the ligand and the
protein; (iii) the deformation effect which is expressed as a contribution of the number of rotors
(rotatable bonds) for the ligand (the protein contribution to this term is ignored); and (iv) the
hydrophobic effect. The last term is computed out in three different ways, which in turn leads to
three different versions of scoring functions. One version models the hydrophobic effect by
counting pairwise hydrophobic contacts between the protein and the ligand. The second one

26

computes this term by examining the fit of hydrophobic ligand atoms inside the hydrophobic
binding site of the protein. The last version of the scoring function accounts for hydrophobicity
by calculating the hydrophobic surface area of the ligand buried upon binding. We consider the
three versions of the hydrophobic interaction as distinct features and accordingly we obtain a
total of six descriptors for each protein-ligand complex.
SLIDE’s scoring function, AffiScore, calculates the binding free energy in a similar fashion
to X-Score. AffiScore is also an empirical scoring function whose energy terms include: (i) a
term capturing hydrophobic complementarity between the protein and the ligand; (ii) a polar
term, which is the sum of the number of protein-ligand H-bonds, the number of protein-ligand
salt-bridges, and the number of metal-ligand bonds; and (iii) the number of interfacial unsatisfied
polar atoms. We did not limit ourselves to these terms only; we considered the other 25 features
that AffiScore also calculates. Most of these contribute to the five main terms listed above. Table
2.1 provides a description of each term of these. The twenty-five features are not directly
accounted for in the final AffiScore model. The reason is perhaps to prevent the model from
overfitting. Several features are linear combinations of each other. That is in addition to a
significant amount of redundancy among them. Our experiments suggest that such intercorrelation and redundancy is not a big issue for some of the machine learning models we
consider.
The features listed (from X-Score and AffiScore) above are computed mathematically from
the geometrical coordinates and chemical properties of protein-ligand atoms in 3D space.
Usually, such sophisticated terms are naturally interpretable and have solid biochemical
meaning. A simpler approach inspired from knowledge-based scoring functions does not require
deriving such complex terms. Instead, a straightforward algorithm outputs a number of terms,
27

each of which represents number of occurrences of a certain protein-ligand atom pair within a
predefined distance. Atoms of nine chemical elements are considered, which translates to a total
of (9 x 9 =) 81 frequencies of atom pairs. The nine common elements taken into account in
protein and ligand molecules are C, N, O, F, P, S, Cl, Br, and I. As pointed out by the
algorithm’s author (and confirmed by our experiments), 45 of the resultant features are always
zero across all PDBbind compounds. Such terms were discarded and we only include nonzero
atom-pair number of occurrences (36 terms) to achieve maximal data compactness, see Table
2.1. The distance cutoff we use here is 12Å. This value is recommended by PMF [24] for being
adequately large to implicitly capture solvation effects [30].
Table 2.1 Protein-ligand interactions used to model binding affinity.
Feature name
vdw_interactions
hyd_bonding
frozen_rotor

hphob_p

hphob_s

hphob_m

Feature name
number_of_ligand_nonh_atoms
total_sum_hphob_hphob
total_diff_hphob_hphob

X-Score Features
Description
van der Waals interactions between the ligand
and the protein.
Hydrogen bonding between the ligand and the
protein.
The deformation effect, expressed as the
contribution of the number of rotors for the
ligand.
Hydrophobic effect: accounted for by counting
pairwise hydrophobic contacts between the
protein and the ligand.
Hydrophobic effect: accounted for by
examining the fit of hydrophobic ligand atoms
inside hydrophobic binding site.
Hydrophobic effect: accounted for by
calculating the hydrophobic surface area of the
ligand buried upon binding.
AffScore Features
Description
Number of non-hydrogen atoms in the ligand.
The total sum of hydrophobicity values for all
hydrophobic protein-ligand atom pairs.
The total difference in hydrophobicity values
between all hydrophobic protein-ligand atom
pairs.

28

Table 2.1 Protein-ligand interactions used to model binding affinity.
Table 2.1 (cont’d)
total_diff_hphob_hphil
The total difference in hydrophobicity values
between all protein-ligand
hydrophobic/hydrophilic mismatches.
total_sum_hphil_hphil
The total sum of hydrophobicity values for all
hydrophilic protein-ligand atom pairs.
total_diff_hphil_hphil
The total difference in hydrophobicity values
between all hydrophilic protein-ligand atom
pairs.
number_of_salt_bridges
The total number of protein-ligand salt bridges.
number_of_hbonds
The total number of regular hydrogen bonds.
number_of_unsat_polar
The total number of uncharged polar atoms at
the protein-ligand interface which do not have
a bonding partner.
number_of_unsat_charge
The total number of charged atoms at the
protein-ligand interface which do not have a
bonding partner.
ratio_of_interfacial_ligand_heavy_atoms
The fraction of ligand heavy atoms (nonhydrogen) which are at the protein-ligand
interface (defined as atoms within 4.5
Angstrom of opposite protein or ligand atoms).
number_of_interfacial_ligand_atoms
The total number of ligand atoms at the
interface.
number_of_exposed_hydro_ligand_atoms
The number of hydrophilic ligand atoms which
are not at the interface.
number_of_ligand_flexible_bonds
The number of rotatable single bonds in the
ligand molecule.
number_of_all_interfacial_flexible_bonds
The number of protein and ligand rotatable
single bonds at the interface.
total_avg_hphob_hphob
The sum of the average hydrophobicity values
for all ligand atoms. The average
hydrophobicity value for the ligand atoms was
determined by calculating the average
hydrophobicity value of all neighboring
hydrophobic protein atoms.
old_slide_total_sum_hphob_hphob
The sum of hydrophobicity values for all
hydrophobic protein-ligand atom pairs (as
computed in the old AffiScore version).
number_of_flexible_interfacial_ligand_bonds
Number of rotatable single bonds of the ligand
at the interface.
total_hphob_comp
The sum of the degree of similarity between
protein-ligand hydrophobic atom contacts. A
higher value indicates the protein and ligand
atoms had more similar raw hydrophobicity
values.

29

Table 2.1 Protein-ligand interactions used to model binding affinity.
Table 2.1 (cont’d)
total_hphob_hphob_comp
A normalized version of the degree of
similarity between the ligand atoms and their
hydrophobic protein neighbors. This is the
version currently used in Affiscore.
contact_hphob_hphob
The pairwise count of hydrophobichydrophobic protein-ligand contacts.
contact_hphil_hphil
Pairwise count of hydrophilic-hydrophilic
protein-ligand contacts.
contact_hphob_hphil
Pairwise count of hydrophobic-hydrophilic
protein-ligand contacts.
total_target_hphob_hphob
Total of protein hydrophobicity values for
protein atoms involved in hydrophobichydrophobic contacts.
increase_hphob_environ
For hydrophobic/hydrophilic mismatch pairs,
sum of the hydrophobic atom hydrophobicity
values.
total_target_hydro
Total hydrophilicity of all protein interfacial
atoms.
norm_target_hphob_hphob
A distance normalized version of
total_target_hphob_hpob
total_target_hphob_contact
Total of protein hydrophobicity values for
protein atoms at the interface.
norm_target_hphob_contact
A distance normalized version of
total_target_hphob_contact.
Total_ligand_hydro
Total of ligand hydrophilicity values for
interfacial ligand atoms.
RF-Score Features
Description (The number of protein-ligand …)
Feature Name
C.C
Carbon-Carbon contacts.
N.C
Nitrogen-Carbon contacts.
O.C
Oxygen-Carbon contacts.
S.C
Sulfur-Carbon contacts.
C.N
Carbon-Nitrogen contacts.
N.N
Nitrogen-Nitrogen contacts.
O.N
Oxygen-Nitrogen contacts.
S.N
Sulfur-Nitrogen contacts.
C.O
Carbon-Oxygen contacts.
N.O
Nitrogen-Oxygen contacts.
O.O
Oxygen-Oxygen contacts.
S.O
Sulfur-Oxygen contacts.
C.F
Carbon-Fluorine contacts.
N.F
Nitrogen-Fluorine contacts.
O.F
Oxygen-Fluorine contacts.
S.F
Sulfur-Fluorine contacts.
30

Table 2.1 Protein-ligand interactions used to model binding affinity.
Table 2.1 (cont’d)
C.P
Carbon-Phosphorus contacts.
N.P
Nitrogen-Phosphorus contacts.
O.P
Oxygen-Phosphorus contacts.
S.P
Sulfur-Phosphorus contacts.
C.S
Carbon-Sulfur contacts.
N.S
Nitrogen-Sulfur contacts.
O.S
Oxygen-Sulfur contacts.
S.S
Sulfur-Sulfur contacts.
C.Cl
Carbon-Chlorine contacts.
N.Cl
Nitrogen-Chlorine contacts.
O.Cl
Oxygen-Chlorine contacts.
S.Cl
Sulfur-Chlorine contacts.
C.Br
Carbon-Bromine contacts.
N.Br
Nitrogen-Bromine contacts.
O.Br
Oxygen-Bromine contacts.
S.Br
Sulfur-Bromine contacts.
C.I
Carbon-Iodine contacts.
N.I
Nitrogen-Iodine contacts.
O.I
Oxygen-Iodine contacts.
S.I
Sulfur-Iodine contacts.

31

2.3

Binding Affinity

Dissociation constant (Kd) is commonly used to quantify the strength with which a protein binds
with a ligand. This equilibrium constant measures the tendency for a complex to separate into its
constituent components. We utilize it here to describe the degree of tightness of proteins to their
binders. That is, by interpreting complexes whose components are more likely to dissociate (high
dissociation constants) as loosely bound (low binding affinities) and vice-versa. We can write the
binding reaction for a protein-ligand complex as:
Protein-Ligandcomplex↔ Protein + Ligand,

and the dissociation constant in terms of mol/L is given by:

Kd = [P] [L] / [PL],

where [P], [L], and [PL] are the equilibrium molar concentrations for the protein, ligand, and
protein-ligand complex; respectively.
The higher the concentration of the ligand needed to form a stable protein-ligand complex,
the lower the binding affinity will be. Put differently, as the concentration of the ligand increases,
the value of Kd also increases and this corresponds to a weaker binding.

In some experimental settings, such as competition binding assays, another measure of
binding affinity is widely employed. This measure is known as inhibitor constant and denoted by
Ki. Inhibitor constant is defined as the concentration of a competing ligand in a competition

32

assay that would occupy 50% of the receptor in the absence of radioligand. Inhibitor constant is
expressed by the following Cheng-Prusoff equation [69]:

Ki = IC50/(1+[L] / Kd),

where IC50 is the half maximal inhibitory concentration that measures the effectiveness of the
ligand to inhibit a biochemical or biological function by half, Kd is the dissociation constant of
the radioligand, and [L] is the concentration of the ligand L. IC50 could also be thought of as the
concentration of the ligand required to displace 50% of the radioligand’s binding. IC50 value is
typically determined using dose-response curve [70] and it should be noted that it is highly
dependent on experimental conditions under which it is calculated. To overcome such
dependency, the inhibitor constant formula Ki is developed such that it has an absolute value. An
increase in its value means a higher concentration of an inhibitor needed to block certain cellular
activity.
It is customary to logarithmically convert the dissociation and inhibitory constants to
numbers that are more convenient to interpret and deal with. These constants are scaled
according to the formula:

pKd = -log (Kd), pKi = -log (Ki),

where higher values of pKd and pKi reflect an exponentially greater binding affinity. The data
we use in our work are restricted to complexes whose either Kd or Ki values are available.
Therefore, all regression models that will be built are fitted so that the response variable is either
33

pKd or pKi. In terms of pKd and pKi, it is worth knowing that the binding affinities of the
complexes in PDBbind database range from 0.49 to 13.96, spanning more than 13 orders of
magnitude.

34

2.4

Training and Test Datasets

Several training and test datasets are used in our experiments in order to evaluate scoring
functions in various ways. As discussed in Section 2.1, two versions of the PDBbind database are
employed in our study. We consider the “refined” and “core” sets of the 2007 version that
include 1300 and 195 complexes, respectively. The latest version, 2010, whose refined set
contains 2061 complexes is also used. For each complex in the three complex sets, we extracted
its X-Score, AffiScore, and RF-Score features. Then out of these features, many datasets were
constructed that serve different functions (i.e., training, testing, or both); see Figure 2.1. Each
dataset can be thought as a table of several rows and columns. Each row represents a unique
protein-ligand complex and each column corresponds to a particular feature.
One dataset derived from the 2007 refined set is referred to as the primary training set and
we denote it by Pr. It is composed of the refined complexes of 2007, excluding those complexes
present in the core set of the same year’s version. There are also nine complexes in the refined
set that we excluded in the primary training set due to difficulty in generating their descriptors.
Thus the primary training set consists of an overall (1300 - 195 - 9 =) 1096 complexes.
We also extracted features for the complexes in the core set and we stored them in a dataset
called the core test set which is denoted by Cr. The number of complexes in this dataset is 192,
as we were not able to extract features for one protein-ligand complex. Therefore that
problematic complex and the other two in the same cluster were not included (195 - 3 = 192) in
the core set.
The last main dataset was built from complexes in the refined set of the 2010 version of
PDBbind database. We refer to this set as the secondary training set and it is denoted by Sc. Not
35

all of the 2061 complexes in the 2010 refined set were compiled to the Sc dataset. That is
because there were several complexes that could not be processed by feature-extraction tools.
Furthermore, the complexes that appear in the core set Cr (of the 2007 version) were not
considered in Sc. As a result, the total number of records (complexes) in the Sc dataset reduced
to 1986. It should be noted that the dataset Pr is a subset of Sc (except, as noted in Section 2.1,
some complexes present in Pr are not included in Sc due to modifications in the filtering criteria
in 2010).
Table 2.2 Training and test sets used in the study.
Data Set
Primary (2007) Secondary (2010)
Feature Set
Pr
Sc
X-Score
PrX
ScX
AffiScore
Pr
Sc

Core (2007)
Cr
CrX

# of
Features
6

A

A

CrA

30

RF-Score

PrR

ScR

CrR

36

X-Score AffiScore

PrXA

ScXA

CrXA

36

X-Score RF-Score

PrXR

ScXR

CrXR

42

AffiScore RF-Score

PrAR

ScAR

CrAR

66

X-Score AffiScore RFScore
Number of Complexes

PrXAR

ScXAR

CrXAR

72

1096

1986

192 (64 clstr)

For each dataset, Pr, Sc, and Cr, seven versions are generated (see the white boxes toward
the bottom of Figure 2.1). Each version is associated with one feature type (or set). These types
refer to features extracted using X-Score, AffiScore, RF-Score, and all their possible
combinations (which are seven). For instance, features extracted using X-Score and RF-Score
tools for complexes in Pr set, when combined in one dataset we obtain the version PrXR of Pr.
Note also that we distinguish versions from each other using subscripts that stand for the first
letter(s) of the tool(s) employed to calculate the respective feature set. One should be careful that

36

features generated using the tool RF-Score does not necessarily mean that only Random Forestbased scoring function will be applied to such descriptors. In fact, we assume these features can
be employed to build any machine learning-based scoring function. This is the case even though
such features were originally used by Ballester and Mitchell in conjunction with Random Forests
to build their RF-Score scoring function. Table 2.2 summarizes the datasets discussed above and
the numbers of their constituent features and complexes.

37

2.5

Machine Learning Methods

The collected protein-ligand training data is assumed to be independently sampled from the joint
distribution (population) of (X,Y) and consists of N of (p+1)-tuples (or vectors){(X1,Y1),.....,
th

th

(Xi,Yi), …, (XN,YN)}. Here Xi is the i descriptor vector that characterizes the i protein-ligand
th

complex. Yi is the binding affinity of the i complex. These tuples form training and test datasets
on which the machine learning models of interest will be fitted and validated. A dataset can be
denoted more concisely by:

D  X i , Yi iN 1, X i  R P , Yi  R.

We categorize the models to be compared into single machine learning models and ensemble
learners. The single models investigated in this work are Multiple Linear Regression (MLR),
Multivariate Adaptive Regression Splines (MARS), k-Nearest Neighbors (kNN), and Support
Vector Machines (SVM). Ensemble models include Random Forests (RF), Neural Forests (NF),
and Boosted Regression Trees (BRT). Consensus models from single learners, ensemble learners
and both will also be briefly investigated at the end of the thesis.

2.5.1 Single Models
In this category, we construct one parametric model which is MLR and three non-parametric
models: MARS, kNN and SVM. Generally single-learner algorithms are faster to train than their
ensemble counterparts, and require less memory from the system. Besides they tend to have
more intuitive descriptive power at the cost of less predictive performance.

38

2.5.1.1 Multiple Linear Regression (MLR)
Multiple linear regression (MLR) model is built based on the assumption that there exists a linear
function relates explanatory variables (features) X and the outcome Y (binding affinity). This
linear relationship takes the form:
P 


Y  f ( X )  0    j X j .

(2.1)

j 1

MLR algorithm attempts to estimate the coefficients 0 .. P from training data by
minimizing the overall deviations of fitted values from measured observations. Such fitting
methodology is widely known as linear lest-squares. Despite the fact that MLR is simple to
construct and fast in prediction, it suffers from overfitting when tackling high-dimensional data.
Moreover, linear models have a relatively large bias error due to their rigidity. The reason for
developing simple linear models in this study is two-fold. First, they are useful in estimating the
behavior of conventional empirical scoring functions. Second, they serve as base-line
benchmarks for the nonlinear models we investigate.
2.5.1.2 Multivariate Adaptive Regression Splines (MARS)
Instead of fitting one linear spline to training data as in MLR, a more flexible approach is to fit
several additive splines. Such a technique, known as Multivariate Adaptive Regression Splines
(MARS), was proposed by Friedman two decades ago to solve regression problems [71].
Nonlinearity in MARS is automatically addressed by partitioning the feature space into S regions
each of which is fitted by one spline. The final nonlinear model is obtained by truncating
together the S resulting splines (or terms). Every term is composed of one or more two-sided

39

basis functions that are generated in pairs from the training data. Each pair (a.k.a. hinge
function) is characterized by an input variable Xj and a knot variable k. The knot variable acts as
a pivot around which one function pair has a positive value from one side, and zero in the other
side. More formally:

h( X j , k )  max(0, X j  k ) and

h( X j , k )  max(0, k  X j )

(2.2)

The +/- sign in the subscripts denotes that the function has the value of X j  k / k  X j right/left
to the knot and 0 to its left/right. MARS algorithm adaptively determines both the number of
knots and their values from the training data. It also employs least-squares to optimize the


weights i associated with each term (i=1..S) in the final MARS model that can be written as:
S 


Y  f ( X )   0   i H i ( X )

(2.3)

i 1



where Y is the estimated binding affinity for the complex X, 0 is the model’s intercept (or
bias), and S is the number of splines (or terms). Note that the function f(x) can be either additive
combination of “piecewise” splines or interactive “pairwise” mix of hinge functions, or a hybrid
of both. Thus each Hi term is expressed as
d

H i ( X )   hl ( X , k )

1  d  2 NP

(2.4)

l

The value of one for d implies that the term Hi is a linear spline and contributes in an additive
fashion to the final model, whereas d ≥ 2 implies an interaction depth of d and thus the
corresponding term is nonlinear.
40

MARS algorithm firstly builds a complex model consisting of a large number of terms. To
prevent the model from overfitting, this step is followed by a pruning round where several terms
are dropped. An iterative backward elimination according to generalized cross-validation (GCV)
criterion is typically performed to determine which terms to be removed. The GCV statistic
penalizes models with large number of terms and high degree when their (least-squares)
goodness-of-fit are measured. During the training process, a greedy search is performed in both
the forward (terms adding) and backward (terms elimination) passes. MARS models investigated
in this work were allowed to grow the number of terms in the forward pass (before pruning) as
many as the number of descriptors plus one, or 21, which larger -- i.e. max(P+1, 21).
The implementation of MARS algorithm used in this comparison is a publicly available R
[72] package. The package is named EARTH (version 2.4-5) which provides default values for
all parameters [73]. We have used these values except for the degree of each term and the
optimal GCV penalty whose optimal values are determined using 10-fold cross-validation. More
detailed discussion about the optimization procedure is presented in Section 3.2.
2.5.1.3 k-Nearest Neighbors (kNN)
An intuitive strategy to predict the binding energy of a protein-ligand complex is to scan the
training complex database for a similar protein-ligand compound and outputs its binding affinity
as estimation for the complex in question. The most similar complex is referred to as the nearest
neighbor to the protein-ligand compound we attempt to predict its binding energy. This simple
algorithm is typically augmented to take into account the binding affinities of k nearest
neighbors, instead of just the nearest one. In the kNN algorithm used here, each neighbor
contributes to the complex binding energy by an amount proportional to its distance from the

41

target complex. The distance measure employed in such algorithm is Minkowski distance which
is expressed as:
1/ q

 p

D( X i , X j )    ( xim  x jm ) q 
m 1




(2.5)

where the final kNN model is established as:


1 K
Y  f ( X )  W ( D, X , X i )Yi
K

(2.6)

i 1

th

where X and Xi are the target and the i neighbor complexes; respectively. The weighting
method W is a function of the distance metric D. The q parameter in Equation 2.5 determines the
distance function to be calculated. Euclidean distance is obtained when q=2, whereas Manhattan
distance is realized when q is set to 1. Both the number of nearest neighbors and the Minkowski
parameter q are user defined. We ran a 10-fold cross-validation to obtain the optimal values of
these two parameters that minimize the overall mean-squared error. No feature selection has
been conducted since the number of features we use is not very large.
It should be noted that the learning process in kNN is lazy local. It is lazy because the
learning process is differed until a complexes’ binding affinity needs to be predicted. And local
because only a subset of the training instances, the neighbors, is involved in the final output as
opposed to a global participation of the whole training dataset.
We use the R package KKNN (version 1.0-8) that implements a weighted version of kNearest Neighbors [74]. It is publically available and also supported by a publication [75].

42

2.5.1.4 Support Vector Machines (SVM)
Support Vector Machines is a supervised learning technique developed by Vapnic et al. to solve
classification and regression problems [76]. The first step in constructing an SVM model is to
map data to high-dimensional space using a nonlinear mapping function. Several such
transforming functions are employed by SVM algorithms and the most popular two are
polynomial and radial basis functions. In this work we choose radial basis function [RBF]
(kernel) due to its effectiveness and efficiency in the training process [77]. Another attractive
feature of this kernel is that it has only one parameter, namely the radial basis width, which can
be calibrated to reflect the layout and distribution of the training data.
After mapping to the new high-dimensional space, a linear model can be constructed. In
regression formulation, SVM algorithms induce from training data an optimal hyperplane that
approximates the underlying relationship f(x) between the covariates (in the new feature space)
and the output variable. The predicting function f(x) can be more formally defined as:


Y  f ( X )  w, G( X )  b

(2.7)

where , represents the dot product and G(X) is a nonlinear function that maps the input vector
to another feature space. The rightmost term b is the model’s bias. The norm of vector w is
minimized in order for the model to generalize well for unseen data. At the same time the model
should fit the training data by minimizing the ε-insensitive loss function that is defined as,

L ( X )  max(0, y  f ( X )   )

(2.8)

43

This loss function defines a region around the hyperplane out of which lying data points are
penalized. The penalty is introduced by slack variables  ,  * . SVM algorithm therefore can be
formulated as an optimization problem that minimizes the functional:

minimize

N
1 2
w  C  ( i   i* )
2
i 1

(2.9)

where C is a user defined parameter. This equation can be solved more easily by applying
quadratic programming on its dual formulation. The solution to Equation 2.9 is henceforth given
by,

max W ( ,  * )  max
 , *
 , *

N N
*( y   )   ( y   )  1
*
i i
( i   i )( *   j ) K ( X i ,X j )
i i
j
2
i 1
i 1 j 1





(2.10)

 i ,  i*  0, C ,
subject to the constraints,

i  1,..., N

  i   i*   0
N

(2.11)

i 1

*
where i and  i are Lagrange multipliers that are obtained by solving Equation 2.10 under the

constraints given in Equation 2.11. After obtaining Lagrange multipliers we can calculate the
right hand terms of Equation 2.7 as follows,

44

N





*
w, G ( X )    i   i K ( X , X i )
i 1

(2.12)

and





1 N
*
b     i   i K ( X i , X r )  K ( X i , X s ) 
2
i 1

In our work we choose the kernel K(X,Xj) to be the radial basis function (RBF) which takes the
following Gaussian form,

~
K(X , X )  e

~2
X X

2 2

(2.13)

where σ is the kernel width. The data points Xr and Xs in Equation 2.12 are any arbitrary support
vectors located on both sides of the hyperplane. Support vectors are data instances located
outside the ε-insensitive region, and as this region gets wider (larger ε value), fewer support
vectors are obtained and the model becomes flatter. The width of the ε-tube is not the sole
parameter that controls the model complexity. The variable C (Equations 2.9 and 2.11) functions
as a trade-off between the model flatness and the degree to which deviations larger than ε are
tolerated. Very small C values may produce under-fitted model, while too large C increases the
model complexity and hence it may overfit.
Most SVM implementations treat the method’s meta-parameters as user defined inputs and
usually are not known beforehand. They should be carefully tuned to ensure proper model
generalization and predicting performance. A 10-fold cross-validation is performed to select the
optimal complexity parameter C, ε value and the width of the RBF kernel σ. Section 3.2 provides
more details about the tuning procedure and the resulting optimal values.
45

We use the well-known SVM implementation built in the C++ library LIBSVM (version 2.6)
[78]. LIBSVM empowers both supervised (classification and regression) and unsupervised
learning modes. An R interface to this engine is provided by the package E1071 (version 1.5-24)
[79]. Among many other features, this package allows automatic parameter selection using crossvalidation, predicting new data, and provides intuitive interface to LIBSVM.

2.5.2 Ensemble Models
In the context of regression analysis, ensemble learning is based on a simple concept of
combining predictions of a set of individual learners either by averaging (co-operative) or
weighting (competitive). Such an approach is proved to be powerful in machine learning. In
order to gain maximum performance from ensemble schemes, the constituent base learners
should be accurate and diverse [80]. The first requirement is achieved by adopting nonlinear
techniques such as prediction trees. As for the second requirement, several methods are proposed
in the literature to maximize learners’ diversity by minimizing inter-correlation among them.
Most popular ones promote such diversity by manipulating the training data set from which
regressors are to be induced. Each learner is fit to a data sample drawn from the main training set
according to some random mechanism. Bagging, adaptive-weight-boosting of training examples,
sampling without replacement, and stacking are commonly applied approaches to inject
randomness whenever a tree (or any base learner) is built. Another perturbation techniques that
target the input feature space are also applied in ensemble learning [81], [82], [34]. In this
approach, a random subset of features is chosen each time a hypotheses is constructed. The
th

th

ensemble’s i learner is trained on data formed by the i random subset of the whole feature

46

space. In this work we will build ensemble models based on manipulating the training set and
input features.
2.5.2.1 Random Forests (RF)
Random Forests is a powerful ensemble tool developed by Leo Breiman to solve classification
and regression problems [34]. It has been applied in many drug-discovery related studies where
it showed excellent results [30], [44], [35], [83], [84]. A Random Forests model consists of an
ensemble of prediction trees (typically a few hundreds.) Each tree is fitted to a bootstrap sampled
from training set. When a tree is constructed, at each node a binary-split is made on mtry features
randomly sampled from the P-dimensional feature space (mtry ≤ P). Each tree is grown
independently to its full size without pruning. When the task is to predict the binding affinity of a
new protein-ligand complex, the output is then the average of the predictions of the comprising
individual trees. This mechanism can be formally written as:


1 L
Y  f ( X )   Tl ( X , l )
L

(2.14)

l 1

where X is a protein-ligand complex whose binding affinity needs to be predicted, L is the
th

number of generated trees, and Tl is the l tree in the ensemble. Each tree in the ensemble is
characterized by the values Өl of its terminal nodes.

Upon building an RF model, the two parameters that must be assigned are L and mtry. As we
pointed out earlier, mtry is the number of random features that need to be sampled each time a
node is split. The variable mtry is a user defined parameter and it is practically the only variable
that could be tuned. Although experience suggests that the performance of RF models is not very
47

sensitive to mtry value. This is in fact what we obtained based on search for mtry that minimizes
the root of out-of-bag mean-squared error (OOBMSE) -- see Section 3.2 for more details. Outof-bag refers to observations that are not sampled from the original training set when bootstrap
sets are drawn. On average about 34% of the training set is left out in each tree-generating
iteration. These out-of-sample examples are valuably utilized to estimate the generalization error
of the RF model as follows:

OOBMSE 

1 N
 (Yi  YˆiOOB ) 2
N 1

(2.15)

th
ˆ
where YiOOB is the average predictions of the i example over all the trees in which this

example is an OOB.
The second parameter in an RF model is the number of trees (L) grown in the forest. We
fixed L to 1000 in our experiments; even though the OOBMSE statistic started to stabilize a few
hundred trees before this value. This finding conforms to the Strong Law of Large Numbers that
assures such convergence. Our finding and the supporting theory of large numbers explain why
RF models do not overfit by adding more trees. Computationally wise, growing extra number of
trees does not significantly hurt the algorithm’s speed neither in training nor during prediction.
This can be partially attributed to rapidity in trees prediction and the speed of creating regression
trees with only subset of the feature space.
Besides the ease of tuning and internal unbiased general-error estimation (OOBMSE),
random forests have other attractive features such as: built-in variable importance estimation,
partial independence analysis, handling large number of features without degradation in

48

accuracy, and low bias and variance. Variable importance is a very important metric in
prediction models. It allows drug designers, for instance, to identify the most critical proteinligand interactions that provide the predictive accuracy. Breiman [34] proposed a simple method
to calculate such metric using OOB examples. That is achieved by permuting input variables
(one at the time) and then computing the corresponding OOBMSE (permute) after this noising
step. By comparing the intact OOBMSE to thus-computed OOBMSE (permute), the percentage
increase in MSE is evaluated. For each tree t in the forest, the regular OOBMSE is determined as
follows:
~

1 N
ˆ
OOBMSEt  ~  (Yi  Yi ) 2
N i 1

(2.16)

~

th

where N is the number of OOB examples in the t tree, and the iterator i is their index. The
criterion OOBMSE for the same tree when permuting the input variable Xj is defined as:
~

1 N
ˆ
OOBMSEt  ~  (Yi  Yi j )2
N i 1
j

(2.17)

For each variable Xj in each tree t in the forest the increase in MSE is evaluated. The result is
then averaged over all the trees in the forest. The overall increase in MSE for an input feature Xj
is therefore regarded as its influence on predicting the binding affinity.
In virtual screening, input variables can be sorted based on their influence and plotted to give
a visual insight into different protein-ligand interactions. Another powerful visualization tool for
drug designers provided by RF is partial dependence plots. Such plots are produced by
49

calculating the effect of an input variable Xj on the outcome variable while the rest of the input
variables X\j are integrated out. More formally, the dependence of the RF approximated function
f(X) on the input variable Xj is calculated as follows:

~
1 N
f j ( x)   f ( x, xi,\ j )
N

(2.18)

i 1

For any input value x for the variable Xj, the summand in (2.18) is iterated over all the data
patterns (typically the training data) to account for the average effects of the rest of the input
variables X\j. It should be noted that the resultant plots do not show interplays between different
input variables. However, such plots are an excellent basis for interpretation.
In our experiments we use the RANDOMFOREST package (version 4.5-36) of R language
[85]. The package is an interface to the RF’s FORTRAN algorithm that is implemented by the
technique’s author Breiman and colleagues.
2.5.2.2 Neural Forests (NF)
We propose here an ensemble method based on Artificial Neural Network (ANN). The algorithm
is inspired from the Random Forests approach in forming the ensemble. Instead of using decision
trees as base learners, we employ here multi-layered perceptron (MLP) ANN’s.
So far, the focus has been more or less biased to using decision trees in ensemble learning.
Choosing trees as base learners is mainly due to their high flexibility and variance (low stability).
High variance decreases inter-correlation between trees and therefore increases the overall
model’s accuracy.
50

Artificial Neural Networks, such as MLP, share several characteristics with prediction trees.
They are non-parametric, non-linear and have high variance. Besides, both techniques are very
fast in prediction. As for the training, neural networks are relatively slower in learning the
unknown function. Our model uses hundreds of back-propagation iterations to train each neural
network in the ensemble. Such neural networks are composed of mtry input neurons, an arbitrary
number of neurons (20 in our experiments) in the hidden layer, and one output neuron for the
output layer. These neurons are interconnected via weighted links as shown in Figure 2.2. The
outputs of the input neurons are directed to all the neurons in the hidden layer. The outputs of the
hidden layer neurons are also directed forward to the output neuron. The output of each network
l in the ensemble is calculated at the output neuron according to the formula:
wij

1
1

Figure 2.2 Multi-layered perceptron, feed-forward neural network

51




ˆ
Yl  f ( X )  fo   who f h   wih xi  



 i

 h

(2.19)

ˆ
where f(X) is the function that maps the input protein-ligand X to binding affinity Yl , fo is the
activation function of the output neuron (linear in our case), fh is the activation function for the
hidden-layer neurons (sigmoid in this work), who refers to the weights associated with the links
connecting the hidden to the output layer, wih represents the weights of input-to-hidden layer
th

links, and xi is the i component of the input pattern X. The network weights, wih and who, are
optimized such that they minimize the fitting criterion E that is defined as
2
ˆ
E  (Y  Y )2    wij

(2.20)

ij

ˆ
where Y and Y are respectively the measured and predicted binding affinities, and λ is a

regularization parameter. The later parameter is also known as the weight decay and it guards
weights from getting large values. Introducing the weight decay parameter avoids the scenario of
saturation at the output of the hidden-layer neurons. We scaled the input variables to the range
[0-1] to effectively optimize the weights when regularization is considered [86].
Learning the ensemble in NF is identical to Random Forests approach except MLP networks
are fitted to the training data instead of decision trees. Each network Netl in the ensemble is fitted
to a bootstrap of the training data described by a random subset of features. The size of this
subset, mtry, and the number of networks L to train are user defined variables as it is the case in
RF method. After building the model, the binding affinity of a new protein-ligand complex X is
computed by applying the formula:

52


1 L
Y  f ( X )   Netl ( X , wl )
L

(2.21)

l 1

where each Netl in the ensemble is defined by its weights set wl. As given in (2.21), the mean of
the outputs of these networks (2.19) is taken to calculate the final value of the binding affinity.
Such value is dependent on the values of weight decay λ and mtry. We tuned these parameters
such that the root of OOB mean-squared error (OOB RMSE) is minimized. More details about
this procedure and the resulting optimal values for λ and mtry are given in Section 3.2.
The proposed NF model addresses the most important five issues associated with traditional
MLP networks. These issues are weight optimization, dealing with high-dimensional data,
training speed, model interpretability, and model’s generalization. Firstly, if we were to build a
scoring function based on a single MLP network, then there would always be a chance that the
network weights may yield very poor predictions. That could be due to improper optimization of
such weights. They may get trapped in local minima when they are optimized to minimize the
loss function E. When generating several neural networks, the chance that all of them realize
poorly optimized weights is almost zero. For instance, let us assume that there is a probability of
0.7 that weights’ final values yield poor model (i.e. trapped in local minima). Then an ensemble
model that uses just 20 networks will have a probability of (0.7

20

≈)

zero that all networks realize

such imperfect weights.
Secondly, dealing with high-dimensional data can be problematic in MLP networks. Large
number of inputs means building networks with large number of neurons and links connecting
them. Weights associated with such links need sufficiently large number of training patterns to
reliably use the network for predicting future examples. In many cases the available training data
is limited. Therefore a feature selection or reduction becomes necessary for fitting a robust MLP
53

model. Such preprocessing procedure may fail when, for instance, the input variables are of
equal importance and/or it is desirable to use the original feature space. The Neural Forest
approach in handling the input space is very effective when the data has a large number of
dimensions. By selecting only a random subset of an acceptable size one can avoid the
possibility of immature weights’ optimization. Since this random selection of features is repeated
several times (1000 in our experiments), it is very likely that all the features are taken into
account in more than one network.
Thirdly, training MLP networks using back-propagation technique is a time consuming
process. The time required for the weights to converge to their optimum values scales with their
number. As discussed above, the number of weights is dependent on the number of input neurons
in the network (given the number of neurons in the hidden layer is fixed). NF that uses a subset
of the inputs is faster to train than a single-network model that accounts for the whole input
space. That is the case even though in NF hundreds of networks are fitted. The parameter mtry
directly influences the training speed when the number of units in the hidden layer is fixed.
Another issue of neural networks is that they are not transparent and they are always
regarded as black boxes when used for prediction. The lack of understandability of neuralnetwork models limits their use in solving non-linear regression and classification problems.
Similar to RF models, NF can provide end users with a way to identify the most relevant inputs
to the response variable. In RF and NF, the influence of each input feature on the response
variable can be expressed by an absolute number. NF can also, as any other prediction method
studied in this work, apply Equation 2.18 to produce marginal dependence plots.
Finally, estimating the generalization ability of a predictive model is very important to judge
its effectiveness in predicting future data patterns. Typically, out-of-sample portion of training

54

data is used to evaluate the model’s performance. In many cases, data available for training is
already scarce and further reducing it may render the model unreliable. To use all the available
data for training and also for reliable model’s validation, cross-validation or bootstrap techniques
are widely applied. Despite the effectiveness of these techniques, they are computationally
expensive. This problem is even more pronounced when cross-validation and bootstrapping are
used in tuning the model’s hyper parameters. Like RF, NF offers a very efficient and reliable
way to test model’s generalization error. That is by using out-of-bag (OOB) data patterns as outof-sample test examples. The out-of-sample examples are tested on networks whose training data
did not include these OOB data patterns. This process is very efficient because testing can be run
simultaneously with training. Once the ensemble model is built, its estimated generalization error
is readily available. We use the OOB RMSE statistic as estimation for NF generalization error.
The NNET R-language package is used to build the NF networks [87]. This package
implements algorithm for training single-hidden-layer, feed-forward, back-propagation networks.
NNET allows users to fit both classification and regression networks with different loss and
activation functions. It also provides a function to plot the fitted networks.
2.5.2.3 Boosted Regression Trees (BRT)
BRT is an ensemble machine-learning technique based on a stage-wise fitting of regression trees.
As the name implies, the technique attempts to minimize the overall loss by boosting the
examples having highest predicted errors. That is by fitting regression trees to (accumulated)
residuals made by the previous ensemble learners.
There are several different implementations of the boosting concept in the literature. The
difference mainly arises from the employed loss functions and treatment of most erroneous
55

predictions. In this work we employ the stochastic gradient boosting strategy proposed by
Friedman [88]. This approach builds a stage-wise model as listed in Figure 2.3 below. The
algorithm starts with fitting the first regression tree to all training data. Step 3 (in Figure 2.3)
shows that the tree constructed in Steps 1 and 2 is the first term in the BRT additive model. In
each subsequent stage, a tree is fitted to residuals obtained by applying the entire model on a
random subset of the training data (Steps 4.a, b and c). It should be noted that split variables and
values of the internal nodes of every tree (except the first) are decided from the current residuals.
Whereas the constant values at the terminal nodes are optimized so that they minimize the loss of
the entire model. Trees generation continues as long as their number does not exceed a
predefined limit. At the end of each tree-generating iteration, the BRT model is updated
according to the formula in Step 4.e. A tree joins the ensemble with a shrunk version of itself.
The variable that controls its contribution to the final model is known as shrinkage or learning
rate λ. The smaller the value of the learning rate, the more confident we reach local minima of
the error surface. However, the reduction in the learning rate should be compensated by
increasing the number of fitted trees by almost the same factor to get the same performance. In
our experiments we fixed the shrinkage to 0.005 and evaluated the corresponding optimal
number of trees using cross-validation.
The other parameter in the algorithm (listed in Figure 2.3) that requires assignment is the
number of terminal nodes created in every tree. This parameter directly affects the size of trees
and therefore controls the interaction depth between the explanatory variables. In this text we
refer to this variable by the interaction depth (which equals the number of terminal nodes - 1).
Assigning the value of one to the interaction depth renders an ensemble of stumps. Two-level
trees are built when the interaction depth is set to two, and so on. Interaction depth, the number
56

of trees and the learning rate are related to each other. Usually larger trees (larger interaction
depth) requires less number of trees to be fitted given that the learning rate is fixed. Therefore,
these parameters should be optimized as a combination to generate accurate models. Section 3.2
discusses the details of the tuning procedure.
Another parameter employed in BRT algorithm is the bag fraction. This parameter controls
the subset size of data selected at random each time a tree is built. Data perturbation has two
advantages. The first has to do with accuracy; it is known that stochasticity lowers the variance
of the overall model. The other is computational benefit, since constructing tree on the whole
training data takes longer time than fitting a tree to just a subset of the data. In our analysis we
choose the subset size to be 50% of the total training data. From our experience, subset sizes of
40% to 75% work equally well.
1- T0 = Fit a S-terminal node tree to data X i , Yi iN1

N
2-  0  arg min  Yi , T0 ( X i , ) 
 i 1
3- F0 ( X )  T0 ( X ,0 )
4- For l=1 to L do
a.
b.
c.
d.

 
  

~
l , Y l N  random_subset_wo_rep
X i i i 1
lacement( X i , Yi }iN 1,50%)
{

~
~
N
 N
Yil i 1  Yil  Fl 1 ( X il ) i 1
~
l ,Y l N
ˆ
Tl  Fit a S - terminal node tree to data X i i i 1
~
N
l  arg min  (Yil , Fl 1( X il )  Tl (X il , ))
 i 1









e. Fl ( X )  Fl 1( X )  .Tl ( X ,l )
5- End for.

Figure 2.3 BRT algorithm.

57

Since BRT algorithm involves creating large number of trees, the resulting model may seem
very complex to interpret. However, similar to RF, BRT offers a method to compute relative
influence of the individual input variables on the prediction. Unlike RF, BRT determines such
criterion by calculating the number of splits that were made on the input variable Xj instead of
th

perturbing it. The importance estimation I for the variable Xj is calculated for the t tree as:

Itj 

S 1

2
 in 1(vn  j),

(2.22)

n 1

where vn represents the non-terminal nodes of the S-terminal node tree t. The statistic in denotes
the improvement in performance (reduction of the squared-error) when a split at node n is made
on the variable j (vn = j). To generalize the importance metric for all the trees, we take the
average over the ensemble as follows:

1 T j
I j   It .
T

(2.23)

t 1

BRT also generates partial dependence plots the same way RF produces them. We will
present and analyze the results of this feature in more detail in Chapter 3.
BRT models were also built in R using the publically available GBM package (version 1.63.1) [89]. GBM implements Friedman’s gradient boosting machine [90]. It has a built-in crossvalidation tool that allows choosing the optimal number of trees automatically. It also supplies
default values for all the parameters. Furthermore, GBM implements algorithm for computing

58

and presenting the relative importance of explanatory variables and produces partial dependence
plots. These functions are similar to the ones implemented in RF package.

59

2.6

Scoring Functions under Comparative Assessment

A total of sixteen popular scoring functions are compared to machine-learning-based scoring
functions in this study. The sixteen functions are either used in the mainstream commercial
docking tools and/or have been developed in academia. The functions were recently compared
against each other in a study conducted by Cheng et al [32]. This bundle includes five scoring
functions in the Discovery Studio software version 2.0 [91]: LigScore [92], PLP [93], [94], PMF
[24], [25], [27], [29], Jain [95] , and LUDI [20], [96]. Five scoring functions in SYBYL software
(version 7.2) [67]: D-Score, PMF-Score, G-Score, ChemScore, and F-Score. GOLD software
version 3.2 [97], [98] contributes three scoring functions: GoldScore, ChemScore [99], [100],
and ASP [101]. GlideScore [12], [102], in the Schrodinger software version 8.0 [103]. Besides,
two standalone scoring functions developed in academia are also assessed, namely, DrugScore
[26], [28] and X-Score version 1.2 [57].
Some of the scoring functions have several options or versions, these include LigScore
(LigScore1 and LigScore2), PLP (PLP1 and PLP2), and LUDI (LUDI1, LUDI2 and LUDI3) in
Discovery Studio; GlideScore (GlideScore-SP and GlideScore-XP) in the Schrodigner software;
DrugScore (DrugScore-PDB and DrugScore-CSD); and X-Score (HPScore, HMScore and
HSScore). For brevity, we only report the version and/or option that yields the best performance
on the PDBbind benchmark that was considered by Cheng et al. [31], [32], [104].

60

CHAPTER 3
Results and Discussion
After having discussed the compound dataset and the machine-learning methods we will use, in
this chapter, we focus on the construction and evaluation of scoring functions. First, in Section

Scoring Function
Building and Evaluation
Collecting Data
Feature Generation
Training Set Formation
Training Data

Protein
3D
structure

Val.
Set

Parameter Tuning
10-fold cross validation
V
Tr
Train. Set
Model Building &
Validation

Ligand

Feature Generation

Optimal Parameters
Model Building

Scoring Function

Binding Affinity

Figure 3.1 Workflow of scoring function construction.
61

3.1, we discuss the criteria that will be used in later sections to evaluate the effectiveness of
scoring functions. Next, in Sections 3.2 and 3.3 we discuss how we tune the modeling
parameters of different scoring functions. Figure 3.1 depicts the process of building tuned
models, which are then trained on a training dataset and subsequently used for binding affinity
prediction using features extracted from independent test protein-ligand complexes. Finally, we
describe a number of experiments and present results in Sections 3.4 through 3.10 to assess
scoring and ranking accuracies, interpretive capability, and computational complexity of scoring
functions.

62

3.1

Evaluation of Scoring Functions

Perhaps the most important step in the docking process is scoring ligands’ conformations in their
respective binding sites. This core step affects the outcomes of the drug search campaign. That is
because the scoring function predictions determine which binding orientation/conformation is the
best, the absolute binding affinity, and which ligand from a database is likely to be the most
effective drug. Correspondingly, three main characteristics that a reliable scoring function should
have are: (i) ability to identify the correct binding modes of ligands, (ii) accuracy in predicting
the binding affinity of a protein-ligand complex, and finally, (iii) capability of ranking active
ligands above inactive ones. These three performance attributes were referred to by Cheng et al.,
as docking power, scoring power, and ranking power, respectively [32]. It should be noted that
all three aspects are related to each other and are dependent on the prediction of the absolute
binding affinity.
Docking power measures the ability of a scoring function to distinguish a native binding
mode from a poor one. Typically, generated conformations are ranked in descending order
according to their predicted binding affinities. Ligands’ poses that are very close to the
experimentally determined ones should also be ranked high. The closeness is measured in
Angstroms (Å) as the root mean square deviation (RMSD) from the true binding pose. Generally,
in docking, a pose that has RMSD of 2 Å or less is considered successful. This criterion is simple
and effective when poses generated are in the range of hundreds to thousands. However, we did
not attempt to evaluate the docking power of scoring functions in this work. That is mainly
because almost all scoring functions investigated by Cheng et al. showed high docking power

63

performance [32]. Therefore, we focus on the other two criteria in which the performance of
these scoring functions showed high variance and left room for major improvements.
The second aspect of scoring function performance is its ability to accurately estimate the
binding energy of a protein-ligand complex. The binding affinity represents the degree of
tightness between a ligand and protein. Accurate estimation of binding energy/affinity is a very
challenging task and is much more difficult compared to estimating the correct binding mode. In
our experiments, we measure the scoring power of every machine-learning-based scoring
function using Pearson and Spearman correlation coefficients, and the standard deviation (SD) of
errors between predicted and measured binding affinities. Pearson correlation coefficient is
widely used to describe linear dependence between two random variables and is defined as:
N





 (Yi  Y )(Yi  Y )

RP 

i 1

 N    N

  (Yi  Y ) 2    (Yi  Y ) 2 
i 1
 i 1



,


where N is the number of complexes and Yi and Yi are the predicted and measured binding
th

affinities of the i complex, respectively. The average values of the predicted and experimentally


measured affinities for all complexes are Y and Y , respectively. In order to evaluate the
correlation between the predicted and measured binding affinities in terms of their ranks,
Spearman coefficient should be used instead, and is defined as follows:
N

6  di2

RS  1 

i 1
N ( N 2  1)

64

where di is the difference between the rankings of the predicted and measured binding affinities
th

for the i complex. These rankings are assigned based on the order of the complexes according
to their known binding affinities.

Both RP and RS range between -1 and 1 and they are practically related. The scoring function
that achieves the highest correlation coefficient for some dataset is considered more accurate (in
terms of scoring power) than its counterparts that realize smaller RP (and/or RS ) values.

Another accuracy measurement reported here is the standard deviation (SD) of errors
between predicted and measured binding affinities (in –log Ki or –log Kd units). For a given


scoring function, a linear model that correlates predicted scores Y to the measured ones Y is first
evaluated. The SD statistic can then be computed as follows:

Y   0  1Y ,

N

 Yi  (0  1Yi )

SD  i 1

 2

N 2

,

where β0 and β1 are, respectively, the intercept and the slope of the linear model. For most of our
experiments on machine-learning-based scoring functions, we report their scoring power by at
least one of the above statistics (i.e., by at least one of RP, RS, and SD.)

The third performance criterion for assessing scoring functions is ranking power. Ideally, a
scoring function should give high binding affinity scores to ligands that will physically bind
65

tightly to the given target. By the same token, ligands that do not (or loosely) bind to a receptor
should receive low binding affinity scores. By ranking ligands according to their estimated
binding scores, one can choose top ligands as the ones that will have higher chances to tightly
bind to the target protein in HTS. This is the main idea used when screening a large database of
ligands. Ligands with high ranks proceed to the next stages of drug design. In this work, the
ranking power of all machine-learning-based scoring functions is calculated across a set of more
than 60 different proteins. These protein sets can be regarded as clusters of complexes, where
each cluster consists of three complexes with the same protein family.
The ligands corresponding to the complexes within a cluster are chosen such that when they
bind with their protein they maximize the range of binding affinity values for that common
protein. More specifically, for each protein family, we consider three complexes: these are the
complexes with the highest, lowest, and median experimentally-determined binding affinities
among all complexes corresponding to a given protein. Calculating the ranking power for a
scoring function is then straightforward. First, each protein-ligand complex is scored. Then, for
each protein cluster (i.e., the three protein-ligand complexes associated with a common protein),
complexes are ordered according to their predicted binding affinity scores. Any given cluster is
considered correctly ranked if its predicted-affinity-based order matches its corresponding
measured-affinity-based order. The ratio of the number of correctly ranked clusters to the overall
number of clusters is considered in this work as an estimation of the ranking power of a given
scoring function. Ranking power of scoring functions is only calculated for the core test set,
which is composed of 64 protein families. For test sets that are not composed of clusters, one can
interpret Spearman correlation coefficient (RS) as an alternative measure for the ranking power.

66

3.2

Parameter Tuning and Cross-Validation

Most machine learning approaches we employ for constructing scoring functions in this study
need some sort of parameter value selection. Specifically, the kNN method has two parameters
that require optimization; these are the neighborhood size k and the degree of Minkowski
distance q. MARS’s most important parameters are the number of terms in the model, the degree
of each term, and the penalty associated with adding new terms. For the SVM model, we have
three parameters: namely, the complexity constant C, the width of the ε-insensitive zone ε, and
the width (σ) of radial basis function. Random forests (RF) algorithm has effectively only one
important parameter, mtry, which is the number of features to randomly select at each node split
when growing the forest’s trees. In addition to mtry, neural forests (NF) employs the weights
decay parameter that needs to be tuned. Boosted regression trees (BRT), on the other hand, has
several parameters. However, the one we will try to tune is the interaction depth. Performance of
some of these predictive techniques is quite sensitive to the values assigned to their meta
parameters. Generally, the optimum values for such parameters are unknown for any given
application and need to be estimated from available data.
One may think that parameter values could be selected in such a way that their respective
model best fits the training data, i.e., to minimize the training error. Although it is an intuitive
way to select values for meta-parameters, its biggest drawback is the fact that the resultant model
may overfit when it is applied to external data. Therefore, we must optimize the model’s
parameters so that the expected generalization error is minimized instead of the training error.
For this purpose, we employ a simple method known as cross-validation which is widely applied
in parameter selection. This technique utilizes all the available training data efficiently for

67

parameter tuning. Models built using the resulting optimal parameters usually generalize well for
unseen data.
In K-fold cross-validation, the training data (Pr in our case) is partitioned into K nonth

overlapping and approximately equal-sized regions. During the k iteration of K-fold crossvalidation based parameter optimization, k = 1 through K, a model f with an arbitrary parameter
th

set  j is fit into K-1 parts of the data, and validated on the remaining k part as shown in Figure
th

3.2. In each iteration, the loss value (typically in RMSE) of the k model is calculated for the k

th

validation section and accumulated.
k=1

k=7

k = 10

Validation

Training

Training

Training

Validation

Validation

Figure 3.2 10-fold cross-validation.
th

th

These accumulated values are averaged over all K folds at the end of the K iteration (10 in
our case). The resultant quantity avg(CVk) is the cross-validation estimate of the prediction error
for the parameter set  j . The same process is repeated for the next set of parameters  j 1 and
again the new avg(CVk) is preserved. We stop when j reaches θ (i.e., the last parameter set to

68

examine). When the given parameter space has been walked, the set of parameters that is
associated with the minimum overall cross-validation error is selected as the optimum parameter
set αopt. The parameter optimization algorithm is listed in more detail in Figure 3.3.

1. Initialize the arguments of the optimization algorithm
a. Let Γ be the machine-learning method whose parameters need to be tuned.
b. Let K=10
c. Let the training set Tr = Shuffle(Pr), i.e. randomly shuffle the training data.
d. Split Tr into K “equal” and non-overlapping partitions.

e. Let the parameter space    j .
1
f. Let the set CVα={ } be a storage for cross-validation estimate of the prediction
errors associated with every parameter set  j

 

2. For j=1 to θ do
a. Let k=1
b. Let the set CVk={} be a buffer for K RMSE’s.
c. For k=1 to K do
th
~
i. Let N be the number of data instances in the k partition.
~

ii. Let the validation set be denoted by Ts k  X i , Yi iN 1

th
th
iii. Let the k training data be Trk = Tr \Tsk , i.e. exclude the k partition

from the original dataset Tr.
iv. Fit the model f to the training data Trk using the fitting method Γ, and
the parameter set  j .
1/ 2

~

1 N

 ~  (Yi  f ( X i ))2 
v. CVk  CVk 
 N i 1



d. End For loop of step 2.c.
e. CV  CV  avg(CVk )
3. End For loop of step 2.
4. The optimum parameter set opt  arg min(CV ) , i.e. αj associated with the lowest
value in the set CVα.

Figure 3.3 Parameter tuning algorithm.

69

An important question that may arise is how to decide on the optimal value for K.
Unfortunately, there is no precise answer for this question. However, according to some studies,
values of 5 to 10 usually yield good results [105], [106]. If it is desired to choose larger or
smaller values for K, one should be cautious about the model’s variance and bias. Typically,
larger values of K yield high variance models because of the large similarity between datasets on
which models were fitted. The largest value K can take is the number of training instances minus
one (N-1). In such a scenario, the technique is called leave-one-out (or LOO) cross-validation.
On the other hand, when K is too small, biased models are likely to be generated. That is because
such models are trained on smaller datasets, therefore they do not “learn enough.”
We tuned MARS and kNN models with the 10-fold cross-validation algorithm presented
above. SVM and the ensemble methods RF, NF, and BRT have built-in techniques that can be
utilized to tune their parameters. So we used a modified version of the tuning algorithm
discussed above. The most significant modification is to replace the internal cross-validation
loop in the algorithm by similar tools built internally in SVM, RF, and BRT packages. However,
for all machine-learning methods we use grid search to locate the optimal parameters [107]. In
grid search, tuples of parameters are tried and the one with the lowest overall loss is picked. The
parameter space for any method is constructed by trying linearly increasing sequences of values
of the parameters. The only exception to this rule is when we optimize SVM’s meta parameters.
The parameter sequences in this case are grown exponentially. Next, we further discuss this
approach and present the optimal parameters we obtained for each machine-learning method.

70

3.2.1 Multivariate Adaptive Regression Splines (MARS)
The algorithm of this regression method has several parameters. Many of these parameters are
left to take on their default values. The default values are suggested by the authors of the MARS
implementation we use in our study [73]. Among these parameters, for instance, is the maximum
number of terms allowed in the forward pass, which takes on the value 21 if the number of
descriptors of the dataset is less than 10. Otherwise, the default value for this parameter is twice
the number of descriptors plus one.
Another parameter that could have been tweaked is the maximum number of terms in the
final model (after pruning). This parameter is useful when imposing an upper limit on the
model’s size is desired. However, we leave this task to the algorithm to decide on the optimal
model’s size using more important parameters.
One of the only two optimized parameters in MARS is the interaction degree between
variables. The second one is GCV (generalized cross-validation) penalty associated with each
knot during the pruning pass of the algorithm. We performed a nested sweep for these two
parameters to construct the parameter space. The degree parameter was linearly swept from one
(i.e., to build an additive model) to three (the final model may have three knots in each term) in
step sizes of one. For each degree value, the penalty parameter was also swept from one to ten
with a step size of one. By applying this procedure on all optimization datasets (PrX, PrA, etc.),
we noticed that when models are trained and tuned on relatively high-dimensional datasets, the
degree of one was always automatically selected. The overall cross-validation error, on the other
hand, was not very sensitive to changes in the penalty values. This conclusion is illustrated in
Figure 3.4, where model parameters were tuned on PrXAR dataset. Although the curve looks
71

quiet flat, the graph shows the penalty of six yields the lowest overall cross-validation RMSE
value when the degree of the model was set to one. The optimal values for MARS parameters on
all datasets are listed in Table 3.1.

Figure 3.4 Tuning the parameter GCV-penalty in MARS.

3.2.2 k-Nearest Neighbors (kNN)
The kNN algorithm we use has only two parameters: namely, the number of neighbors (k) and
the order of Minkowski distance (q). We swept k from one to twenty and q from one to five to
build the parameter grid we aim to search. It turns out that Manhattan distance (q = 1) produces
relatively most accurate models for all datasets. In addition, the optimal value of neighborhood
size (k) lies somewhere beyond ten. This finding agrees with the general rule of thumb that the
72

value ten or higher for k works usually well. The left panel in Figure 3.5 shows the average
cross-validation error when the number of neighbors iterates from one to twenty, and the q
parameter is fixed at its optimal value, one. The right panel, on the other hand, shows the effect
of changing the order of Minkowski distance (q) on the models’ performance when the
neighborhood size is set to eighteen. It can be noticed that low values result in slightly better
performing models. Both plots are a result of tuning kNN model parameters on the dataset
PrXAR. Refer to Table 3.1 for the other datasets.

Figure 3.5 Tuning the parameters k and q in kNN.

3.2.3 Support Vector Machines (SVM)
Perhaps the most sensitive machine-learning technique (compared to other methods considered
in this study) to parameter tuning is SVM. The optimization algorithm targets three parameters:
the cost parameter C, the thickness of the ε-insensitive zone (ε), and the width of radial basis
function (σ). An exhaustive grid search for the optimal combination of parameters is conducted

73

by first expanding the parameter space. This is
achieved by generating sequences of exponentially
increasing values for the three parameters. More
specifically, C is swept from 2

-10

to 2

10

and for each C value, ε iterates from 2

in steps of 2,

-15

to 2 with a

step size of 2. In addition, for each C-ε pair, the last
x

parameter σ is also varied from 2

-15

C=2
to 2 in steps of 2.

Figure 3.6 on the right shows the resultant crossvalidation error when the parameters are optimized for
0

-2

PrXAR dataset. The values of C = 1 (2 ), ε = 0.25 (2 ),
-5

and σ = 0.03125 (2 ), generates a model with lowest
x

ε=2

expected generalization error. The upper panel
displays the cost parameter C plotted against the crossvalidation RMSE when ε and σ are fixed at their
-2

-5

optimal values 2 and 2 , respectively. The crossvalidation error quickly declines as C increases, then it
slowly rises after it reaches its minimum at C = 1. We
also obtain the plot illustrated in the middle panel when
we record the cross-validation error at different ε
-5

values by fixing C = 1 and σ = 2 . From the resultant

74

x

σ=2

Figure 3.6 Tuning the parameters C, ε
and σ in SVM.

curve, we notice that the SVM model is insensitive to changes in ε so long as it is below 0.5. The
width of the radial basis function, on the other hand, is quite an influential parameter on the
performance of SVM model. The bottom plot supports this claim where it shows a sharp increase
-5

in the error as we move away from a σ value of 2 . We will be using the optimal values, C = 1, ε
= 0.25, and σ = 0.03125, in our experiments for XAR-based data or the values listed in Table 3.1
for other datasets.

3.2.4 Random Forests (RF)
In contrast to the other prediction approaches investigated in this work, this ensemble method is
the least sensitive to changes in its parameters, which are just two. The number of trees grown in
the model does not have any significant impact on the performance so long as it is set to several
hundreds. We fixed it to 1000 for all datasets. The other parameter is the number of features
randomly sampled at each node split when growing the model’s 1000 trees (known as mtry).
Breiman, the author of Random Forests algorithm, proposes that mtry could be set to a value
somewhere around the third of the dataset dimension [34]. He also suggests that the model’s
accuracy around that region is roughly flat. When mtry was swept from 2 to 72 for the dataset
PrXAR, we obtained the data depicted in Figure 3.7.

The out-of-bag error does not seem to be affected by changes in mtry. Similar conclusions
were drawn for the other datasets for the mtry values listed in Table 3.1. Such a finding is yet one
more powerful characteristic that random forests algorithm offers. In other words, the low
number of the technique’s parameters and its insensitivity to their fluctuations makes RF an
appealing choice as an off-the-shelf prediction tool.

75

Figure 3.7 Tuning the parameter mtry in RF.
Cross-validation was not performed to tune RF parameters as pointed out in Figure 3.3.
Instead, we employed the out-of-bag statistic to estimate the generalization error.

3.2.5 Neural Forests (NF)
The NF model was allowed to learn 1000 MLP networks. The number of hidden-layer neurons in
these networks is set to 20, where the number of inputs is controlled by the variable mtry.
Weights in each network in the ensemble were optimized using back-propagation and
regularization techniques. As discussed in Section 2.6.2.2, the weight decay parameter λ needs to
be tuned in order for the weights to converge to their optimum values.
We conducted a grid search for the optimal pair values of mtry and λ that minimize the OOB
root means-squared error (RMSE). The parameter mtry was swept linearly between two
76

boundary values a and b, and with a step size of c. For most datasets (i.e., PrXA, PrR, etc.), a, b
and c were, respectively, set to 10, 35, and 5. For datasets with smaller number of features, such
as X and A, their a, b, and c values were set to 2, 6, and 1 for PrX and 10, 25, and 5 for PrA
datasets. The weight decay parameter λ was exponentially varied between 2

-15

and 2 with a step

of size of 2.

λ=2

x

Figure 3.8 Tuning the parameters mtry and the weight decay ( λ ) in NF.
The optimal values of mtry and λ for complexes described by XAR features are ≈ 0.001and
15, respectively. As shown in Figure 3.8 above, NF performance is slightly more sensitive to the
value of mtry than RF model. This curve was obtained when λ was fixed at it is optimal value
(0.001). From the right panel of Figure 3.8 we also notice that OOB RMSE is fairly insensitive to
-4

the exact value of λ as long as it is less than ≈ 0.06 (2 ). Again, data depicted in the right panel
was collected when mtry was fixed at 15. The optimal values for mtry and λ are almost the same
for the other feature sets as listed in Table 3.1.
77

3.2.6 Boosted Regression Trees (BRT)
The BRT algorithm has several parameters that determine the model’s goodness-of-fit and
generalization performance. Among the most influential parameters are the number of trees to
fit, the shrinkage factor that determines the contribution of each tree to the whole ensemble, the
interaction depth of each tree, and less importantly the bag fraction. The last parameter, bag
fraction, controls the number of instances randomly drawn from the original training data in
order to grow the next tree.
The algorithm’s shrinkage parameter, or the learning rate, was fixed at 0.005. The interaction
depth of the grown trees was swept from one to twenty with a step size of one. For each
interaction depth we obtained the optimal number of trees. The optimal number of trees is
calculated internally in BRT algorithm. BRT employs a k-fold cross-validation procedure to
obtain the number of trees (up to a maximum of 8000) that minimizes the overall error.
Therefore, this nested procedure allows us to obtain the optimal interaction depth and the optimal
number of trees for this depth (which are 16 and 2213 for PrXAR, respectively).

The results of the tuning run on PrXAR dataset are summarized in Figure 3.9. It is clear that
the cross-validation error holds steady even when relatively large trees are grown. However, the
model’s optimal number of trees for each interaction depth drops as trees get larger. This can be
seen in the right panel of Figure 3.9. One can observe that the model’s size (in terms of the
number of trees) at interaction depth of 5 (~ 6000) is three times that when the interaction depth
climbs to 20 (~ 2000 trees).

78

Figure 3.9 Tuning the parameter Interaction Depth in BRT.
Table 3.1 Optimal parameters for MARS, kNN, SVM, RF, NF and BRT models.
When complexes are described by the features :
Methods Parameters X
A
R
XA
XR
AR
XAR
(X-Score)

MARS
kNN
SVM
RF
NF
BRT

Degree
Penalty
k
q
C
ε
σ
mtry
mtry
λ
Int. depth
N. Trees

(AffiScore)

(RF-Score)

2
2
15
1
2
0.5
1
3
5
0.001
15
1114

1
6
13
1
2
0
0.25
18
15
0.004
17
1523

1
5
14
1
1
0.25
0.125
8
15
0.001
18
1573

79

1
7
9
1
2
0.125
0.125
31
15
0.002
19
1371

1
2
19
1
4
0.125
0.031
5
15
0.001
15
2113

1
7
19
1
2
0.25
0.031
10
15
0.001
19
2950

1
6
18
1
1
0.25
0.031
14
15
0.001
16
2213

3.3

Tuning and Evaluating Models on Overlapped Datasets

In some experiments we carry out, we use Pr and Sc datasets for building scoring functions and
Cr for testing them. In other cases, though, Pr and Sc will be used for both training and testing.
As was pointed out in Section 3.2, the entire Pr dataset was already used to tune the free
parameters for machine learning models we intend to use as scoring functions. Consequently,
there is some overlap between the data used (in one way or another) to build the model and the
data on which it will be evaluated. That is to say, to fairly judge the performance of a machinelearning-based scoring function, parameter tuning should be performed on a dataset that is
independent of the test set, as it is the case for training. However, in some situations, it is
computationally expensive to tune a model’s parameters whenever we train and test it,
particularly, if the experiment involves building ensemble of models (in order to obtain unbiased
and robust results, for instance).
For the prediction techniques under study, we ask: how much gain in performance can be
achieved when there is an overlap between tuning and test datasets? If the gain is insignificant,
we may relax the constraint that parameter selection and model testing must be conducted on
disjoint datasets. In order to estimate the performance gain for kNN, SVM, and MARS, we
applied the algorithm shown in Figure 3.10.
The algorithm starts with splitting training data Pr into 10 non-overlapping training (90%)
and test (10%) datasets. It proceeds to create two models; one is tuned on the whole Pr dataset
th

(overlapping model), while the other is tuned on just the k training part of the data (disjoint).
th

The optimal parameters are subsequently used to fit both the models to the k training partition.

80

th

The scoring functions are then evaluated on the k test set. When the last fold is reached, the
cross-validation errors of the overlapping and disjoint models are averaged and compared. Table
3.2 shows the results (gain in performance) for kNN, MARS, and SVM scoring functions across

1. Initialize the arguments of the algorithm
a. Let Γ be the machine-learning method whose parameters need to be tuned.
b. Let K=10
c. Let the training set Tr =Shuffle(Pr), i.e. randomly shuffle the training data.
d. Split Tr into K “equal” and non-overlapping partitions.
e.  ovr  tune(Tr , ) . Tune the parameters for the “overlapping” model fovr.
f. Let the set CVovr={ } and CVdis={ } be RMSE buffers for overlapping
fovr and disjoint fdis models; respectively.
2. For k=1 to K do
th
~
a. Let N be the number of data instances in the k partition.
~

b. Let the validation set be denoted by Ts k  X i , Yi iN 1

th

th

c. Let the k training data be Trk = Tr \Tsk , i.e. exclude the k partition from
the original dataset Tr.
d.  dis  tune(Trk , ) . Tune the parameters for the “disjoint” model fdis.
e. Fit fovr and fdis models to the training data Trk using the fitting method Γ,
and the parameter sets  ovr and  dis ; respectively.
1/ 2

~

f.

1 N

CVovr  CVovr   ~  (Yi  f ovr ( X i ))2 
 N i 1




1/ 2

~

1 N

g. CVdis  CVdis   ~  (Yi  f dis ( X i ))2 
 N i 1



3. End For loop.
4. RMSEovr  avg(CVovr )
5. RMSEdis  avg(CVdis )

e
6. Gain in Performanc (%) 

RMSEdis  RMSEovr
 100
RMSEdis

Figure 3.10 Algorithm for tuning and evaluating models on overlapped datasets.

81

the seven Pr datasets. Based on the numbers listed in the table, it would not be a subjective claim
if one concluded that tuning parameters of MARS, kNN, or SVM models on overlapping
datasets does not significantly help their performance. Most of the enhancements in performance
are less than 1%. In addition, there are some models that perform slightly better when tuned and
tested on independent datasets. Therefore, it should not be an issue when models are tuned on
datasets that share instances with datasets on which these models are tested.
Table 3.2 Results of tuning and evaluating models on overlapped datasets.
When complexes are described by the features:
Methods Criteria
X
A
R
XA
XR
AR
(X-Score)

MARS

kNN

SVM

Disjoint
Overlap
Gain(%)
Disjoint
Overlap
Gain(%)
Disjoint
Overlap
Gain(%)

(AffiScore)

(RF-Score)

1.708
1.679
1.7
1.576
1.563
0.8
1.599
1.596
0.2

1.653
1.642
0.7
1.580
1.572
0.5
1.613
1.629
-1.0

1.706
1.679
1.6
1.603
1.603
0
1.625
1.622
0.2

XAR

82

1.625
1.628
-0.2
1.523
1.515
0.5
1.581
1.584
-0.2

1.604
1.600
0.2
1.548
1.547
0.1
1.546
1.556
-0.6

1.695
1.704
-0.5
1.504
1.501
0.2
1.571
1.575
-0.3

1.689
1.680
0.5
1.495
1.495
0
1.561
1.569
-0.5

3.4

Machine-Learning vs. Conventional Scoring Functions

We have created several scoring functions using different regression algorithms in conjunction
with different feature sets extracted from the same compound set Pr. More specifically, we
considered the primary training set that is described in seven different forms: namely, PrX, PrA,
PrR, PrXA, PrXR, PrAR, and PrXAR. Then seven machine learning models (MLR, MARS, kNN,
SVM, BRT, RF, and NF) were fitted to these datasets to generate 49 (7 x 7) different scoring
functions. We distinguish each one of these using the notation Machine-learning
technique::tools used to calculate features. For instance, kNN::XA indicates that the scoring
function is a k-Nearest neighbor model that is trained and tested on datasets (the primary training
set and the core set, respectively) described by XA features (i.e., features extracted by using the
tools X-Score and AffiScore).
We compared the performance of these 49 scoring functions against each other and against
the 16 presented in Section 2.6. Similar notation is also used to write these 16 scoring functions
that were originally studied by Cheng et al. [32]. For example, when referring to the scoring
function LigScore1 in Discovery Studio software, the notation DS::LigScore1 is used. In fact, if
we considered all scoring functions that could result from the modeling tools mentioned earlier
in Section 2.6, we would end up having large numbers of them. That is because every modeling
tool typically supplies more than one version/option for the same scoring function. Each such
variation could be regarded as a different scoring function. Therefore, only the version/option
that yields the best performing scoring function was considered. LigScore, for example, comes in
two different options (LigScore1 and LigScore2).

83

We follow the same approach for selecting a smaller set from the 49 scoring functions
mentioned above. For each machine-learning technique, only the best performing model was
selected out of the possible seven. Table 3.3 lists the performance of machine-learning-based
scoring functions and the other 16 on the core test set Cr. The scoring functions are ordered
based on their scoring power in terms of Pearson correlation coefficient. It is clear that most
machine-learning-based scoring functions significantly outperform the 16 scoring functions. This
is not only the case for Pearson correlation coefficient, but also in terms of Spearman rank
correlation coefficient and prediction standard error SD of each function. BRT, NF, and RF
models top the list of the 23 scoring functions whose correlation coefficients are close to 0.8.
When the best performing machine-learning model (BRT) is compared (in terms of its Pearson
correlation coefficient) against the best performing conventional scoring function (XScore::HMScore) out of the 16, we realize about 24% improvement in scoring power. It should
be noted that such a boost in performance comes purely from the powerful fitting algorithm and
the diversity of features describing protein-ligand complexes (since features used in both
empirical and knowledge-based scoring functions are employed by the best machine-learning
based approaches).
Moreover, the primary training dataset on which the machine learning models were fitted
does not overlap with the core set. Whereas, training datasets of some of the 16 existing scoring
functions share some examples with the test set. Such overlap may, although not necessarily,
benefit scoring functions that were trained and evaluated on non-independent datasets. However,
overlapping should be avoided when possible. It is impossible to guarantee such a requirement in
all the 16 scoring functions considered by Cheng et al. This is because some of them were built
in different studies and details about their training data are poorly reported. Cheng and co84

workers calibrated two versions of their X-Score scoring function. One, X-Score v1.2 that is
reported in Table 3.3, was trained and tested on partially overlapping datasets. The other, XScore v1.3, was fitted to the primary training data and evaluated on the core set. Performance of
the two functions was almost the same. The overlapping scoring function resulted in RP= 0.644
while X-Score v1.3 yielded RP = 0.649.

Table 3.3 Comparison of the scoring power of 23 scoring functions on the core test set Cr.
Scoring Function
N
SD
R
R
P

BRT::XAR
NF::XAR
RF::XR
SVM::XAR
KNN::AR
MARS::XR
MLR::XAR
X-Score1.2::HMScore
DrugScoreCSD
SYBYL::ChemScore
DS::PLP1
GOLD::ASP
SYBYL::G-Score
DS::LUDI3
DS::LigScore2
GlideScore-XP
DS::PMF
GOLD::ChemScore
SYBYL::D-Score
DS::Jain
GOLD::GoldScore
SYBYL::PMF-Score
SYBYL::F-Score

192
192
192
192
192
192
192
195
195
195
195
193
195
195
193
178
193
178
195
189
169
190
185

0.797
0.796
0.794
0.768
0.735
0.704
0.672
0.644
0.569
0.555
0.545
0.534
0.492
0.487
0.464
0.457
0.445
0.441
0.392
0.316
0.295
0.268
0.216

S

0.781
0.784
0.788
0.792
0.736
0.732
0.756
0.705
0.627
0.585
0.588
0.577
0.536
0.478
0.507
0.435
0.448
0.452
0.447
0.346
0.322
0.273
0.243

1.43
1.43
1.44
1.51
1.6
1.68
1.75
1.83
1.96
1.98
2.00
2.02
2.08
2.09
2.12
2.14
2.14
2.15
2.19
2.24
2.29
2.29
2.35

One may expect that ranking power of scoring functions should be reflective of their scoring
capabilities. After all, ranking power is merely ordering ligands that bind to a common protein
85

according to their affinity scores. Moreover, to identify the correct ranking of ligands, their
predicted binding scores do not have to linearly correlate with the measured affinity constants.
Nevertheless, Cheng et al. and our results indicate that scoring functions perform less
satisfactorily in ranking as opposed to scoring. Table 3.4 shows that the top scoring function has
succeeded in correctly ranking the three ligands in less than two-thirds of targets (40 out of 64).
Table 3.4 Comparison of the ranking power of scoring functions on the core test set Cr.
Scoring function
N
(%)
BRT::XA
192
62.5
RF::XAR
192
60.9
NF::XAR
192
59.4
X-Score1.2::HSScore
195
58.5
SVM::XAR
192
57.8
DS::PLP2
195
53.8
KNN::AR
192
53.1
DrugScoreCSD
195
52.3
MLR::XAR
192
51.6
MARS::XR
192
50
SYBYL::ChemScore
195
47.7
SYBYL::G-Score
195
46.2
SYBYL::D-Score
195
46.2
GOLD::ASP
193
43.1
DS::LUDI3
195
43.1
DS::LigScore2
193
43.1
DS::PMF
193
41.5
DS::Jain
189
41.5
SYBYL::PMF-Score
190
38.5
GOLD::ChemScore
178
36.9
GlideScore-XP
178
33.8
SYBYL::F-Score
185
29.2
GOLD::GoldScore
169
23.1

The ranking power performance difference between the top performing machine-learning-based
scoring function (BRT::XA) and X-Score::HSScore is less than 15% as compared to 24%
86

improvement in scoring power. In terms of ranking power, it should be noted that the top three
scoring functions were also obtained by applying ensemble methods (BRT, RF, and NF). The
surprising result, however, is that the other machine-learning predictive models that significantly
outperform X-Score scoring function in scoring power are lagging behind in terms of ranking
power.

87

3.5

Dependence of Scoring Functions on the Size of Training Set

Experimental information about structures and binding affinities of new protein-ligand
complexes are regularly determined and made publically available. Protein Data Bank (PDB), for
instance, is a large macromolecular archive that stores huge and diverse information about
proteins, nucleic acids, and complexes. The size of this database, and other similar public
biochemical repositories, in addition to corporate compound banks, are regularly growing. In this
work, we use an annually updated database called PDBbind [31], [104]. As it is explained in
Section 2.1, PDBbind provides binding affinities of high-quality 3D structures of protein-ligand
complexes that can be readily utilized in building and testing new scoring functions. More than
750 new protein-ligand complexes have been deposited into this database from the year 2007 to
2010. Here, we take advantage of such abundance of data to assess the impact of growing the
size of training set on the effectiveness of scoring functions.
We use the secondary training data set Sc to build and test different machine-learning-based
scoring models. Every model is trained on an increasing subset from Sc and validated on two
independent test sets. Models are first fit to small-sized (starting with 178 instances) datasets and
over ten steps this number gradually grows to 1780 instances (178 instances increase in each
step). At each training size, 206 instances are randomly drawn (without any overlap with the just
sampled training data) to constitute one of the test sets. The second test sample is the core test set
Cr. To obtain reliable results, this process is repeated 50 times for each size of training data.
During each iteration, bootstrap mechanism (i.e., selection without replacement) is used to select
training data. After averaging the 50 values of the correlation coefficients of the 50 generated
scoring functions, we obtained the data shown in Figures 3.11 and 3.12.

88

Correlation RP
Correlation RP

a. Core test set.

b. Random test set.
Figure 3.11 Dependence of XAR-based scoring functions on the size of training set.
89

Correlation RP
Correlation RP

a - Core test set.

b - Random test set.
Figure 3.12 Dependence of X-based scoring functions on the size of training set.
90

Notice that NF is not included in this experiment due to its long training time. The two panels
in Figure 3.11 show the results when the datasets are characterized using XAR features, while
the plots in Figure 3.12 are for the case when complexes are described by X features. In the four
cases (two test sets and two characterization methods), we observe that performance of all
models improves as the size of the training data increases.
When the number of features is high (XAR = 72) and the size of training data is relatively
small, we notice poor accuracy of some scoring functions such as MLR and SVM. This is
perhaps a consequence of overfitting since the ratio of training data size to number of features is
small. The performance of MLR and SVM is boosted when the number of data patterns rises
and/or lower dimensional data is used (as shown in Figure 3.12). RF, BRT, and kNN based
scoring functions are almost always the leading models regardless of the data dimensionality and
the size of training set.
It is evident that the scoring power of machine-learning models is greater when the test is
conducted on the core set. In other words, it may be thought that testing scoring functions on
randomly drawn instances should yield results similar to those obtained on the diverse test set
Cr. Comparing curves on the top panels to the ones on the bottom show that all scoring functions
do better job in predicting binding affinity of the core set complexes. This gap in performance
can be attributed to the large variance of binding energies of protein-ligand complexes
constituting the core test set. Remember that upon building the core set the emphasis was to
diversify it in terms of protein families and to maximize the range of their binding constants. It is
therefore easier for scoring functions to estimate binding constants of very different proteinligand structures whose binding constants are far apart. Such variation is not guaranteed in case
of randomly drawn protein-ligand complexes and hence their binding constants are rather less
91

distinctive. Additionally, the way the core set is constructed, there is at least one complex in Pr
for each cluster represented in Cr. This implies that Pr has training complexes relevant to the
complexes in Cr.
By looking to curves in Figure 3.11, one may claim that scoring functions will keep
improving because the ratio of data size to the number of dimensions (features) is increasing and
thus models are less susceptible to overfit. Although this argument seems valid, it is still worth
investigating the behavior of scoring functions when the data size to dimension ratio becomes
relatively very large. For that purpose, we carried out a related experiment on two datasets
characterized by different numbers of features (i.e., ScX and ScXAR). We assume that differences
in performance between predictive models extracted from ScX and those from ScXAR are mainly
resulting from the difference in number of features instead of feature types. Such an assumption
is sensible to some extent since both datasets share the descriptors extracted via X-Score tool.
Moreover, the focus will be on performance trends rather than on the absolute values of the
correlation statistic.
As stated earlier, X-based models (i.e., RF::X, BRT::X, etc.) tend to improve with increase in
training dataset sizes. SVM and MLR curves in Figure 3.12 show that such models no longer
suffer from overfitting. Furthermore, the scoring power of MLR model does not show significant
change as the size of training data increases. On the other hand, the improvement in performance
for the rest of the models is similar to that seen for the corresponding XAR-scoring functions.
According to findings summarized in Figures 3.11 and 3.12, we can conclude that most
machine-learning-based scoring functions have a potential for improvement as more training
data becomes available. That is the case weather the ratio of training dataset size to number of
92

dimensions is low (178/72 = 2.47 to 1780/72 = 24.72) or relatively high (178/6 = 29.67 to 1780/6
= 296.67). Scoring function designers can conduct similar experiments to estimate accuracy
enhancement when their proposed functions are recalibrated on larger number of data instances.
Take, for example, the BRT model. We can approximately project its scoring power on the core
test set after a few years from now. If we averaged its improvement slope over the last three
increments in training size, we obtain roughly 0.01 increase in correlation for each 178 increase
in number of training patterns. By assuming that another 750 protein-ligand complexes will be
deposited into PDBbind in the next 3 years, one can then expect BRT::XAR scoring power to go
up to about RP = 0.822 (from the current RP = 0.779, i.e., a 5% increase) in 2013. Perhaps it
does not seem a great leap in improvement, but this enhancement would certainly have a great
impact when the goal is to only choose promising drugs from databases that contain millions of
drug-like molecules.
MLR::X model, which resembles empirical scoring functions, has very small potential for
improvement in the future. This is due to the rigidity of linear models whose performance tends
to “saturate”. Many of the 16 scoring functions compared to the machine-learning-based models
are empirical and thus they are also most likely to suffer from the same limitation. In fact, the
best performing model of those 16 in terms of scoring power, X-Score::HPScore, is very similar
to MLR::X. That is because both scoring functions use almost the same features and both are
linearly fit to the same training data. Therefore, one should consider better prediction approaches
to derive accurate models from training data available on hand and from future updates.

93

3.6

Dependence of Scoring Functions on the Number of Features

The binding affinity of a protein-ligand complex depends on many physicochemical interaction
factors involving a protein and its ligand. Capturing all relevant factors is not possible and
calculating some of them is not a trivial task. Furthermore, quantifying these factors into values
known as features or descriptors involves many assumptions and approximations. To minimize
the effect of employing such imperfect features, we utilize more than one tool to calculate them.
As it was detailed in Section 2.2, we considered in our analysis three tools to extract features for
the scoring functions: X-Score, AffiScore, and RF-Score. Each tool generates different features
and even the ones shared by more than one tool are computed differently. Utilizing all features
provided by these tools (even the redundant ones) provides a fuller picture of the various proteinligand interaction effects.
Table 3.5 below shows correlation coefficients of machine-learning models that were trained
on the primary dataset Pr. All resultant scoring functions were tested on the core test set Cr. The
first three columns in the table show the performance when protein-ligand complexes are
described using X, A, and R features. The last column lists the performance when the complexes
are characterized by all these features combined, i.e., XAR. Evidently, better results are obtained
when regression models are fit to datasets described with more and diverse features. Similar
results were seen when the impact of increasing the size of training set was investigated in the
previous section (see Figures 3.11 and 3.12). In that case, scoring functions were induced from
the secondary dataset Sc. The difference in performance for all functions is very obvious on both
test data patterns (i.e., random and core sets).

94

For some machine-learning methods, their generalization error may increase dramatically by
adding more dimensions to the data. That is because there would be a higher chance that
irrelevant information is taken into account. Noise associated with features having weak
relevance with the outcome variable could degrade models as well. Overfitting is another serious
problem more likely to arise when the size of training data is inadequate.
Table 3.5 Performance of several machine-learning-based scoring functions on a compound test
set described by different features.
X
A
R
XAR
(X-Score)
(AffiScore)
(RF-Score)
BRT
0.707
0.772
0.741
0.797
NF
0.713
0.771
0.751
0.796
RF
0.732
0.755
0.763
0.794
SVM
0.672
0.703
0.733
0.768
kNN
0.685
0.701
0.707
0.732
MLR
0.631
0.663
0.59
0.672
MARS
0.63
0.669
0.63
0.628
We propose a simple experiment to investigate how our machine-learning models (except
NF) behave when the data dimension is increased. The experiment is conducted on the secondary
dataset characterized by XAR features. R features in this experiment are calculated slightly
different than in previous sections. Recall that the RF-Score tool extracts 36 features each of
which is a count of the occurrence of specific protein ligand atom pairs within a predefined
distance range. In previous XAR sets, the distance range was set to 12Å. Whereas, in this
experiment, we use several distance ranges or bins. Atom pair statistics are counted over five
contiguous four-angstrom bins, that cover an overall radius of (5 x 4) 20Å. Therefore, there are
36 features emerging from each bin. By combining the new R features (36 x 5 = 180) with XA’s
(36), we gather a feature pool of size 180 + 36 = 216 total number of features. Bin size and
number are arbitrarily chosen. Actually, the quality of features in this experiment is not as
95

important as their number so long as they are relevant to the binding affinity. Note that the aim
of this experiment is to investigate the impact of increasing data dimensionality on the
performance of different scoring models.
After generating the feature pool we proceed to evaluating machine-learning-based scoring
functions on datasets with an increasing number of features. We start with k = 20 features
randomly chosen from the feature pool. Next, the data (complexes in Sc) is characterized by the
selected features and then training and test data sets are randomly sampled without replacement
and without overlap from Sc. At every random feature set and data sample, the performance of
scoring functions in terms of scoring power is evaluated. Ensembles of models that are trained on
fifty bootstrap data samples of the same features are then averaged. Likewise, another random set
of features with the same size is drawn from the feature pool. This outer loop is also repeated
fifty times. Then we boost k by 20 and rerun the same steps. The procedure terminates when k
reaches 200.
The above procedure is listed in more detail in Figure 3.13. When this algorithm was applied
for RF, BRT, SVM, kNN, MARS, and MLR, we obtained results depicted in Figure 3.14. All
performance statistics of scoring functions rise relatively sharply when data dimensionality
increases from 20 to 40 features. Scoring power of most predictive models keeps improving by
increasing the number of features beyond 40, although at a lower rate than before. Performance
of SVM and MLR models, on the other hand, appears to degrade as data dimension rises. We
suspect that had SVM parameters been tuned for every random set of features, SVM model could
have delivered better results. Recall from Section 3.3 that SVM performance is very sensitive to
the appropriate selection of its meta parameters. However, tuning these parameters for each
value of k is computationally expensive and therefore we did not attempt to search for the
96

optimal parameter set for every feature set size. As for MLR-based scoring function, the plot
indicates that the larger the number of features, the more overfitting it experiences.
Consequently, dimension reduction is necessary before building linear models when there is a
plethora of features describing protein-ligand complexes.

1. Select and initialize the algorithm arguments
A. Let Γ be the machine-learning method to be tested
B. Let the set Ck={ } be a buffer for the correlation coefficient statistic for each value
of k.
2. For k = 20 to 200, with increments of 20 do
A. For i = 1 to 50 do // Counter for feature samples drawn from the feature pool.
i. Ci = 0. // Accumulator for correlation statistics calculated for each i
value.
th

ii. Fsik  random _ wo _ replacemen t ( FeaturePoo l , k ) // Choose the i k
random features from the feature pool.
iii. For j = 1 to 50 do // Counter for models that will be fit to j training data
sets characterized by Fsik .
a. Cj = 0. // Accumulator for correlation statistics calculated for
each j value.
th

b. {Tr j , Ts j }  Bootstrap Sc, Fsik ,2 / 3) // Sample the j training and
(
test sets. The size of Trj= 2/3 of Sc, Tsj is the remaining third.
N
N/
c. Tr j  { X r , Yr }21/ 3 Ts j  { X s , Ys }s 13
r

d.

rs

f j ( X )  Fit (Tr j , ) // Fit the model fj to Trj using the fitting
technique Γ.







N/
e. C j  C j  correlatio Ys s 13, f j ( X s ) N / 3
n
s 1
iv. End For loop of Step 2.a.iii
v. Ci = Ci + Cj/50.
B. End For loop of Step 2.a.
C. Ck  Ck  Ci / 50
3. End For loop of Step 2.
4. Return Ck.



Figure 3.13 Algorithm to assess the impact of data dimension on performance of scoring
functions.
97

Correlation RP
Figure 3.14 Data dimension impact on performance of scoring functions.

98

3.7

Family Specific Scoring Functions

In the previous section, performance of scoring functions was assessed on a diverse test set Cr.
The core set consists of more than sixty different protein families. One may ask how these
scoring functions perform on homogeneous datasets. In other words, how accurate are scoring
functions in predicting binding affinities of ligands that form a complex with a common protein
family? To answer this question, four additional test sets were constructed. Protein-ligand
complexes in each set share a common protein family. The protein families are HIV protease,
trypsin, carbonic anhydrase, and thrombin. The findings are shown in Table 3.6, where the
results are ordered based on Spearman correlation coefficient. Thus the emphasis here is on the
ranking correlation between observed binding scores and predicted ones. From the data in the
table, however, we observe that scoring functions whose ranking correlation is high, their linear
correlation is usually high as well, and vice versa.
It is interesting to see that scoring functions that perform well in case of predicting diverse
protein-ligand complexes, are also among the best achievers in predicting binding scores of
certain protein families. Scoring functions derived using BRT, RF, NF, SVM, and kNN
approaches were superior in accurately predicting binding affinities for the heterogeneous test
set. In the case of specific family protein-ligand complexes, these models’ predictions are even
more accurate. Across the four additional test sets there is no single model that always preserves
remains on top. However, the top four machine-learning models are always the same across the
four sets. The correlation coefficients of BRT, RF, NF, SVM, and kNN models approach
excellent values (RS and RP = 0.91 to 0.98) with relatively low error variance (SD < 0.85 in -log
Kd units given that the range of the binding affinities is from 1.01 to 12.1).
99

Table 3.6 Ranking accuracy of scoring functions on the four additional tests sets. The test
samples overlap the training data.
HIV protease (N=112)
Trypsin (N=73)
Scoring Functions

RS

RP

BRT::AR
0.980 0.973
RF::XA
0.976 0.96
SVM::A
0.963 0.955
NF::XA
0.920 0.921
KNN::XA
0.867 0.890
MARS::A
0.529 0.584
MLR::R
0.458 0.413
X-Score::HPScore
0.339 0.341
SYBYL::ChemScre
0.228 0.276
DS::PMF04
0.200 0.183
DrugScorePDB::PairSurf 0.170 0.225
Carbonic anhydrase (N=44)
Scoring Functions
SVM::A
BRT::AR
RF::A
KNN::XA
NF::XAR
DS::PLP2
MLR::XAR
SYBYL::G-Score
MARS::AR
SYBYL::ChemScore
SYBYL::PMF-Score

RS
0.929
0.919
0.910
0.910
0.889
0.772
0.709
0.646
0.645
0.631
0.618

RP
0.909
0.959
0.944
0.910
0.931
0.800
0.747
0.706
0.734
0.699
0.627

SD
0.376
0.457
0.487
0.638
0.748
1.331
1.493
1.540
1.580
1.610
1.600

Scoring Functions

RP

SD

SVM::A
0.974 0.954
RF::XA
0.940 0.958
BRT::AR
0.934 0.946
KNN::XA
0.927 0.943
NF::XA
0.918 0.935
X-Score::HSScore
0.824 0.817
MLR::X
0.809 0.803
MARS::XR
0.808 0.820
DS::Ludi2
0.791 0.823
DS::PLP2
0.774 0.797
SYBYL::ChemScore
0.773 0.829
Thrombin (N=38)

0.509
0.484
0.548
0.563
0.598
0.970
1.007
0.967
0.960
1.020
0.950

SD
0.580
0.393
0.458
0.577
0.508
0.840
0.926
0.990
0.945
1.000
1.090

Scoring Functions
BRT::AR
RF::XA
SVM::A
NF::XAR
KNN::XA
MARS::XR
DS::PLP1
MLR::R
SYBYL::G-Score
DrugScoreCSD::Pair
X-Score::HSScore

RS

RS
0.977
0.972
0.946
0.943
0.939
0.733
0.672
0.653
0.626
0.622
0.586

RP
0.953
0.952
0.920
0.921
0.918
0.784
0.692
0.699
0.667
0.586
0.666

SD
0.639
0.647
0.831
0.825
0.839
1.317
1.530
1.515
1.580
0.651
1.580

Clearly, sophisticated machine-learning-based scoring functions exhibit high performance
capabilities as compared to the less flexible models. This is obvious in case of HIV protease,
where BRT::AR scoring function delivers accurate estimates (RS = 0.98) as opposed to the
embarrassing correlation value (RS = 0.339) that X-Score::HPScore attains. This gap in
performance is realized despite the fact that both models were trained on the same dataset (Pr)
100

when all proteins of the four families were included. Complete overlap between the familyspecific datasets and the Pr set for the other 15 scoring functions cannot be verified.
Table 3.7 Ranking accuracy of scoring functions on the four additional tests sets. The test
samples are independent of the training data.
HIV Protease (N=112)
Trypsin (N=73)
Scoring Function
Scoring Function
R
R
R
R
SD
SD
S

P

MARS::X
0.484 0.458
MLR::XR
0.398 0.365
NF::AR
0.373 0.422
X-Score::HPScore
0.339 0.341
BRT::A
0.331 0.415
RF::A
0.310 0.392
SYBYL::ChemScore
0.228 0.276
KNN::X
0.204 0.247
SVM::X
0.200 0.267
DS::PMF04
0.200 0.183
DrugScorePDB::PairSurf 0.170 0.225
Carbonic anhydrase (N=44)
Scoring Functions
DS::PLP2
BRT::A
SYBYL::G-Score
RF::XA
MLR::XR
SYBYL::ChemScore
SYBYL::PMF-Score
MARS::XAR
KNN::XA
SVM::AR
NF::AR

RS
0.772
0.698
0.646
0.632
0.631
0.631
0.618
0.566
0.531
0.426
0.421

RP
0.800
0.735
0.706
0.652
0.716
0.699
0.627
0.555
0.717
0.490
0.597

S

1.457
1.526
1.486
1.540
1.491
1.508
1.580
1.588
1.579
1.610
1.600
SD
1.250
0.944
0.990
1.056
0.972
1.000
1.090
1.158
0.970
1.214
1.117

P

X-Score::HSScore
0.824 0.817
MLR::X
0.807 0.801
DS::Ludi2
0.791 0.823
MARS::A
0.784 0.739
DS::PLP2
0.774 0.797
SYBYL::ChemScore 0.773 0.829
RF::XA
0.771 0.794
KNN::XAR
0.769 0.792
NF::AR
0.767 0.763
BRT::XA
0.766 0.78
SVM::XA
0.737 0.787
Thrombin (N=38)
Scoring Functions
R
R
S

MARS::RF
KNN::X
RF::XA
NF::X
DS::PLP1
SVM::X
BRT::XR
MLR::R
SYBYL::G-Score
DrugScoreCSD::Pair
X-Score::HSScore

0.721
0.711
0.704
0.696
0.672
0.667
0.656
0.640
0.626
0.622
0.586

P

0.667
0.737
0.730
0.749
0.692
0.720
0.718
0.676
0.667
0.651
0.666

0.970
1.012
0.960
1.138
1.020
0.950
1.027
1.031
1.045
1.058
1.044
SD
1.580
1.432
1.448
1.404
1.530
1.471
1.475
1.561
1.580
1.610
1.580

In addition to their low generalization error, BRT, RF, NF, SVM, and kNN models are known
for being adept at predicting well for data patterns on which they are trained. That is perhaps due
to their high-degree of nonlinearity and flexibility. Linear models on the other hand are rigid and
101

tend to be less amenable to presence of particular data patterns. MARS combines characteristics
of both families of models. MARS nonlinearity is achieved by combining several linear models in
a piece-wise fashion.
We retrained BRT, RF, NF, SVM, kNN, MARS, and MLR models on the primary training
dataset after we removed complexes whose proteins belong to the four protein families under
consideration. More specifically, for scoring functions that will be evaluated on a particular
protein family, all complexes of that protein family were filtered out from the Pr set before
training. Table 3.7 shows performance statistics of the resulting scoring functions on the four
proteins datasets. RF, NF, BRT, SVM, and kNN are now inferior to linear and MARS based
scoring functions. In fact, correlation coefficients of almost all machine-learning models have
dropped drastically when they are reevaluated on disjoint datasets. This is particularly true in case
of HIV protease, where none of the scoring functions reach a value of 0.5 for both RP and RS.
Such a poor performance can perhaps be attributed to difficulty in sampling conformational
changes between HIV protease and ligands [32]. Besides, scoring functions are poor in
accounting for enthalpic and entropic factors upon binding. By contrast, scoring functions
predicted binding energies more accurately in the other three protein-family datasets. Correlation
coefficients in such datasets are not far from those obtained on the diverse test set.
What if the purpose of a screening campaign was to identify new ligands for HIV protease or
to conduct lead optimization? We saw how scoring function accuracy suffers when there are no
known binding ligands to this protein in the training set. How much the predictive power of
scoring functions can change if we recalibrate them by taking into account several ligands that we

102

know bind to HIV protease? Figure 3.15 answers this question for different values of the number

Correlation RS

of known binders. These values are expressed as percentage of the total number of complexes in

Figure 3.15 Dependence of scoring functions on the number of known binders to HIV protease
in the training set.

HIV protease dataset (112). For instance, 20% means that 22 protein-ligand complexes from HIV
protease dataset were added to the primary training set PrX. After reconstructing scoring functions
on the updated version of Pr, they are evaluated on the other 80% of the HIV data (i.e., 90

103

complexes). For each percentage (0% to 80% in steps of 5%), this procedure was repeated 50
times to collect reliable results.
When the training dataset includes none to very few HIV protease complexes (roughly less
than 10), we notice the curves in the plot are consistent to some extent with the data presented in
Table 3.7. In such a scenario, MLR and MARS models outperform other scoring functions. As the
number of HIV complexes increase in the training set, the accuracy of both models does not show
clear responsiveness to changes in training data. On the other hand, RF, SVM, and kNN models
gain sharp increase in performance as patterns of HIV protease become more and more prevalent
in the training data. For the BRT::X scoring function, the improvement is less pronounced. This is
surprisingly not satisfying for a BRT model given its high performance profile shown in other test
sets.
It should be noted that results given in Table 3.7 of some machine-learning-based scoring
functions were fitted to datasets characterized with different feature types. On the other hand, the
curves plotted in Figure 3.15 show the performance of functions driven by a complex set (Pr)
identified by the X features only. Curves obtained on other feature sets are not very different (not
reported in this text).
The analysis of the effectiveness of different regression approaches on specific families of
proteins has very useful practical advantages in virtual screening and drug discovery. Typically,
the task in drug design is to conduct computational screening of large numbers of ligands against
a certain protein family. Given the number of known binders to that protein present in the training
set and having scoring functions’ characteristic plot (similar to the one in Figure 3.15) could help
specialists to utilize the correct scoring function.

104

3.8

Performance of Scoring Functions on Novel Targets

In Section 3.4, we investigated the ability of different scoring functions to predict the binding
affinities of diverse protein-ligand complexes that constitute the core test set Cr. This test set was
constructed by Cheng et al. by including three representative protein-ligand complexes from all
clusters containing at least four complexes that were found in the 2007 PDBbind refined set
based on BLAST sequence similarity (with a 90% cutoff) [32]. The complexes in each cluster
typically consisted of those formed by a particular type of protein. The primary training set Pr is
comprised of refined set complexes not included in Cr, thus ensuring there are no complexes
common to both Pr and Cr. The way the training-test set pair (Pr, Cr) is formed ensures that: (1)
Pr has complexes that are relevant to the test set Cr since Pr contains at least one complex from
each cluster represented in Cr; (2) Cr has maximal diversity, in terms of both protein and ligand
molecules in a complex, and minimal redundancy: all clusters with four or more complexes are
represented with only three complexes per cluster (or type of protein), binding constants of
complexes span over 12 orders of magnitude, and ligand molecular weights and number of
rotatable bonds range from 103 to 974 and 0 to 32, respectively; (3) Cr is suitable for not only
assessing the scoring power of scoring functions, but also their ranking power since it contains
three complexes from each cluster (or type of protein). The training-test set pair (Pr, Cr) (or
similarly (Sc, Cr)) is a useful benchmark when the aim is to evaluate the performance of scoring
functions on targets that have some degree of sequence similarity with at least one protein
present in the complexes of the training set. This is typically the case since, in practice, it has
been observed that more than 92% of today’s drug targets are similar to known proteins in the
PDB [108], high-quality complexes from which are included in the PDBbind refined set [61].

105

When the goal is to assess scoring functions in the context of novel protein targets, however,
(Pr, Cr) (or similarly (Sc, Cr)) is not that suitable since the training set has at least one complex
with a protein represented in the core set. We considered this issue to some extent in the previous
section (see Table 3.7), where we investigated the performance of scoring functions on four
different protein-specific test sets after training them on complexes that did not have the protein
under consideration. This resulted in a drop in performance of all scoring functions, especially,
in the case of HIV protease as target. However, even if there are no common proteins between
training and test set complexes, different proteins at their binding sites may have sequence and
structural similarity, which influence protein-ligand binding affinity. To more rigorously and
systematically assess scoring function performance on novel targets, we performed a separate set
of experiments in which we limited BLAST sequence similarity between the binding sites of
proteins present in the training and test set complexes.
Specifically, for each similarity cut-off value S = 30%, 40%, 50%, …, 100%, we constructed
100 different independent 150-complex test and T-complex training set pairs, trained the scoring
models (MLR, MARS, kNN, SVM, BRT, and RF) using two different feature sets (XAR and X
features) on the training set and evaluated them on the corresponding test set, and determined
their average performance over the 100 rounds to obtain robust results. Since scoring function
prediction performance depends upon both similarity cutoff and training set size and since
training set size is constrained by similarity cutoff (a larger S means a larger feasible T), we
investigated different ranges of S (30% to 100%, 40% to 100%, 50% to 100%, etc.) and for each
range set T close to the largest feasible for the smallest S value in that range. Each test and
training set pair was constructed as follows. We randomly sampled a test set of 150 proteinligand complexes without replacement from all complexes at our disposal: 1986 in Sc + 192 in
106

Cr = 2178 complexes. The remaining 2028 complexes were randomly scanned until T different
complexes were found that had protein binding site similarity of S% or less with the protein
binding sites of all complexes in the test set – if less than T such complexes were found, then the
process was repeated with a new 150-complex test set.
The accuracy of the five scoring models in terms of Pearson correlation coefficient is
depicted in Figure 3.16 for a similarity cutoff range from 30% to 100% and for a training set size
of 120 complexes, which was about the largest training set size feasible for S = 30%. The upper
plot shows these results when protein-ligand complexes are described by XAR features and the
bottom plot when such compounds are characterized by X-Score (X) features alone. When XAR
features are used, RF and kNN perform the best across the entire range of similarity cutoffs.
When only the six X features are used, MLR, kNN, RF, and BRT perform the best. In both
feature set cases, we see that no scoring function is able to predict the measured binding affinity
with good accuracy (RP < 0.5 when similarity cutoff = 30%). This can be attributed to three
factors. First is the significant sequence, and hence structural, dissimilarity between the binding
sites of proteins in training and test complexes. Second is the size of the training data. We saw
how important the size of training data is in influencing the quality of scoring models in Section
3.5. Finally, the scoring model parameters used for this experiment were based on parameter
tuning with respect to Pr as described in Section 3.2 rather than with respect to the training set
used for this experiment, which is not only small but has a different mix of complexes governed
by the small similarity cutoff of 30%. This is evident in Figure 3.16 from the inferior
performance of several scoring functions (MLR, SVM, and, to a limited extent, BRT) when
XAR features are used instead of X features (which is a subset of XAR features), signifying
overfitting or suboptimal parameter values. The reason the RF model had relatively good
107

Correlation RP
Correlation RP

a - Complexes are characterized using XAR features.

b - Complexes are characterized using X features.
Figure 3.16 Performance of scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes.
108

performance compared to the other generally-competitive machine-learning models for both
XAR and X features is most likely because it is less sensitive to parameter tuning (see Section
3.2.4). Due to the time-intensive nature of tuning parameters of multiple scoring models for the
many training scenarios we consider in this work, we used the parameters tuned with respect to
Pr as explained in Section 3.2.
Figure 3.16 provides only a partial portrait of how sensitive scoring model accuracy is to
similarity cutoff because of the small training dataset (only 120 complexes) used. To more fully
capture this sensitivity, we need larger training datasets. Therefore, we determined the
performance of scoring models for a number of similarity cutoff ranges (30% to 100%, 40% to
100%, etc. with increasingly larger training datasets) and these results are tabulated in Table 3.8.
For a given similarity cutoff, not only does the scoring performance of models improve with the
size of training dataset, the overfitting/parameter-tuning problem also gets alleviated or
minimized. For example, for a similarity cutoff of 50%, BRT::XAR has inferior performance
relative to BRT:X when training dataset has 120 complexes (0.46 vs. 0.49), but the opposite is
true when the training set has 520 complexes (0.54 vs. 0.50). Also, the performance of RF and
the other generally-competitive machine-learning models (BRT, kNN, and SVM) becomes
comparable as training dataset size increases, signifying that parameter tuning becomes less of an
issue.
Based on the data in Table 3.8, we plot in Figure 3.17 the percentage improvement in
Pearson correlation coefficient of RF::XAR, relative to that for the smallest training dataset size
(120 complexes), as the size of the training set increases. The reason for focusing on RF::XAR is
that it is not only among the best scoring functions, but also, as noted earlier, its performance is
less sensitive to parameter tuning and it performs better relative to other scoring functions when
109

considering the entire range of similarity cutoffs. As expected, for any given similarity cutoff,
RF::XAR’s accuracy improves with the size of training set. Moreover, increase in training set
size helps improve accuracy somewhat more at larger than at smaller similarity cutoffs, although
continuing gains in accuracy are evident for all similarity cutoffs at the largest training set sizes
considered.
Table 3.8 Performance of scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes for different cutoff ranges.
XAR Features
X Features
Similarity
Training Set Size = 120
Training Set Size = 120
MLR MARS kNN SVM BRT RF
Cutoff % MLR MARS kNN SVM BRT RF
30
0.11 0.38
0.44 0.24 0.38 0.48 0.49
0.30 0.47 0.35 0.45 0.46
40
0.12 0.37
0.47 0.21 0.41 0.49 0.48
0.29 0.49 0.36 0.46 0.46
50
0.14 0.42
0.51 0.17 0.46 0.51 0.51
0.38 0.51 0.38 0.49 0.49
60
0.15 0.43
0.50 0.20 0.46 0.51 0.51
0.37 0.51 0.41 0.49 0.50
70
0.21 0.46
0.52 0.22 0.49 0.52 0.52
0.43 0.52 0.42 0.50 0.51
80
0.19 0.45
0.53 0.24 0.50 0.53 0.52
0.41 0.52 0.42 0.50 0.51
90
0.23 0.48
0.55 0.30 0.53 0.55 0.51
0.40 0.51 0.44 0.51 0.52
100
0.30 0.49
0.58 0.32 0.57 0.59 0.53
0.44 0.55 0.49 0.53 0.55
Similarity
Training Set Size = 220
Training Set Size = 220
MLR MARS kNN SVM BRT RF
Cutoff % MLR MARS kNN SVM BRT RF
40
0.22
0.39 0.47 0.35 0.46 0.51 0.50
0.39 0.49 0.40 0.47 0.48
50
0.26
0.45 0.51 0.34 0.5 0.53 0.53
0.44 0.51 0.42 0.50 0.51
60
0.28
0.46 0.5 0.35 0.49 0.52 0.52
0.41 0.51 0.43 0.49 0.51
70
0.33
0.47 0.53 0.40 0.52 0.54 0.53
0.46 0.52 0.44 0.51 0.52
80
0.35
0.49 0.55 0.38 0.54 0.56 0.53
0.46 0.52 0.45 0.51 0.53
90
0.36
0.50 0.57 0.46 0.56 0.58 0.52
0.44 0.53 0.48 0.52 0.54
100
0.42
0.54 0.61 0.50 0.61 0.62 0.54
0.49 0.57 0.53 0.55 0.57
Similarity
Training Set Size = 520
Training Set Size = 520
MLR MARS kNN SVM BRT RF
Cutoff % MLR MARS kNN SVM BRT RF
50
0.36
0.48 0.52 0.53 0.54 0.55 0.53
0.46 0.51 0.44 0.50 0.52
60
0.38
0.48 0.52 0.53 0.54 0.55 0.53
0.48 0.52 0.46 0.51 0.53
70
0.42
0.51 0.55 0.55 0.56 0.57 0.54
0.50 0.52 0.47 0.52 0.54
80
0.46
0.52 0.58 0.57 0.59 0.59 0.54
0.52 0.54 0.49 0.53 0.55
90
0.49
0.54 0.61 0.60 0.61 0.62 0.53
0.53 0.57 0.53 0.56 0.57
100
0.55
0.58 0.65 0.64 0.66 0.67 0.55
0.54 0.61 0.58 0.60 0.61
Similarity
Training Set Size = 720
Training Set Size = 720
MLR MARS kNN SVM BRT RF
Cutoff % MLR MARS kNN SVM BRT RF
60
0.42
0.50 0.53 0.54 0.55 0.56 0.53
0.50 0.52 0.46 0.52 0.53
70
0.46
0.52 0.56 0.57 0.57 0.58 0.54
0.53 0.53 0.49 0.53 0.55
80
0.49
0.53 0.59 0.59 0.60 0.60 0.54
0.54 0.55 0.50 0.54 0.56
110

% Improvement in Rp w.r.t. to Rp for
Smallest Training Dataset

Table 3.8 (cont’d)
90
0.52
100
0.56
Similarity
Cutoff % MLR
70
0.49
80
0.54
90
0.55
100
0.59
Similarity
Cutoff % MLR
80
0.55
90
0.56
100
0.60
Similarity
Cutoff % MLR
90
0.58
100
0.61
Similarity
Cutoff % MLR
100
0.62

0.56 0.62 0.61 0.63 0.64
0.6 0.67 0.67 0.68 0.69
Training Set Size = 1020

0.53
0.55

MARS kNN SVM BRT RF

MLR MARS kNN

0.54 0.57 0.58 0.59
0.55 0.60 0.60 0.62
0.58 0.63 0.63 0.64
0.62 0.69 0.68 0.70
Training Set Size = 1220

0.54
0.54
0.53
0.55

0.59
0.62
0.65
0.70

0.54 0.57 0.54 0.57 0.58
0.58 0.62 0.59 0.62 0.63
Training Set Size = 1020
SVM BRT RF

0.52 0.53 0.50 0.55
0.54 0.55 0.52 0.56
0.56 0.58 0.56 0.58
0.59 0.63 0.61 0.63
Training Set Size = 1220

0.56
0.57
0.59
0.65

MARS kNN SVM BRT RF

MLR MARS kNN

0.55 0.61 0.61 0.62 0.63
0.59 0.64 0.64 0.65 0.66
0.63 0.69 0.69 0.71 0.71
Training Set Size = 1620

0.54
0.54
0.55

MARS kNN SVM BRT RF

MLR MARS kNN

0.61 0.65 0.65 0.67 0.67
0.65 0.71 0.70 0.72 0.73
Training Set Size = 2028

0.54
0.55

MARS kNN SVM BRT RF

MLR MARS kNN

SVM BRT RF

0.55

0.64 0.65 0.67

0.66 0.72

0.72 0.73 0.74

SVM BRT RF

0.55 0.56 0.53 0.57 0.57
0.57 0.59 0.57 0.59 0.60
0.60 0.64 0.61 0.64 0.65
Training Set Size = 1620
SVM BRT RF

0.57 0.60 0.58 0.60 0.60
0.60 0.65 0.63 0.65 0.66
Training Set Size = 2028
0.61

0.66

30%
Similarity
Cutoff
30%

25%
20%

40%
50%

15%

60%
70%

10%

80%
5%

90%
100%

0%
120

220

520 720 1020 1220 1620 2028
Size of Training Dataset

Figure 3.17 Sensitivity of RF::XAR scoring model accuracy to size of training dataset and
BLAST sequence similarity cutoff between binding sites of proteins in training and test
complexes.

111

To summarize, imposing a sequence similarity cutoff between the binding sites of proteins in
training and test set complexes has an expected adverse impact on the accuracy of all scoring
models. However, increasing the number of training complexes helps improve accuracy for all
similarity cutoffs. RF::XAR has the best accuracy considering the entire range of similarity
cutoffs. The other generally-competitive machine-learning models (BRT, kNN, and SVM) may
also provide comparable accuracy if parameter tuning is performed with respect to the training
set being considered. The fact that the largest training sets that could be obtained from the 2007
and 2010 PDBbind refined sets (which is representative of PDB) at low similarity cutoffs were
small is a reflection of real-world drug targets most often being similar to known proteins in the
PDB [108].

112

3.9

Interpreting Scoring Models

Although predictive power determines success or failure of screening campaigns, it is still not
the only characteristic of scoring function we desire to improve. In some cases, the descriptive
power of a model is as important as its accuracy. Self-explanatory models enable drug designers,
for instance, to gain deeper insights to interactions between a given protein and its ligand.
Ensemble methods (BRT, RF, and NF) we employ in our study possess excellent tools that allow
users to decipher much of the relationship between protein-ligand interactions and binding
affinity. As pointed out in Section 2.5.2, these methods compute the relative variable importance
and their marginal effects on the response variable. This descriptive information will be obtained
using BRT and RF methods in the following subsections.

3.9.1 Relative Importance of Protein-Ligand Interactions on Binding Affinity
Since BRT and RF employ different schemes to calculate relative variable relevance, we here
show the results of both techniques. Out of the 72 features of the dataset ScXAR, we will calculate
the most important ten features as well as the least relevant five. Figure 3.18 shows estimation of
the most important ten features using both RF (red, top) and BRT (blue, bottom) techniques. It is
clear that the number of hydrophobic contacts has the strongest effect on binding affinity. It is
interesting to see that this feature (the number of protein-ligand carbon-carbon contacts) is
identified by both methods as the highest contributor to the change in binding affinity. That is the
case even though BRT and RF estimate this contribution differently. In fact, they overlap in four
other features which include van der Waals interactions between the protein and the ligand, the
hydrophilicity of all protein interfacial atoms, and two hydrophobicity related features.

113

Figure 3.18 The most important ten features identified by RF and BRT algorithms. Refer to
Table 2.1 for feature description.

Even more interestingly, the least five contributing features identified by BRT and RF are
almost identical. As shown in Figure 3.19, the number of contacts between protein atoms (C, N,
O, and S) and iodine in ligands has no effect (0%) on calculating the binding affinity.

114

Figure 3.19 The least important five features identified by RF and BRT algorithms. Refer to
Table 2.1 for feature description.

3.9.2 Partial Dependence plots
We used here the BRT tool to produce the partial dependence plots of the most relevant features
identified in the previous section. Figure 3.20 shows the marginal plots of the most important
four features that were seen overlapping between RF and BRT. For each variable, the effect of
the other 71 features was averaged out according to Equation 2.18. It is apparent that the binding
affinity increases by increasing the number of hydrophobic contacts and van der Waals
interactions between proteins and ligands.

115

f(C.C)

f(total_target_hydro)

total_target_hydro

f(hphob_b)

f(vdw_interactions)

C.C

vdw_interactions

hphob_b

Figure 3.20 Marginal dependence plots of most relevant four features.
On the other hand, the marginal effect of hydrophilic interfacial atoms of proteins is less
clear. The binding affinity sharply declines when the hydrophilicity starts to increase from 10.
Once it reaches 15, the binding affinity rises almost linearly to peak at hydrophilicity of 40
before it declines again.

116

The marginal plots of the least influential features corroborate the results we obtained in the
previous subsection. Figure 3.21 shows that no feature among the least important four has any
effect on estimating binding affinity.

Figure 3.21 Marginal dependence plots of least relevant four features.
The question may arise: how much change in performance of the RF::XAR scoring function
will occur when only the most important features are taken into account? To answer this

117

question, we built 71 versions of the RF::XAR scoring function. The first version considered
only the most two contributing features (the number of protein-ligand carbon-carbon contacts
and total hydrophilicity of all protein interfacial atoms.) The three most important input features
were accounted for in the second version and we proceeded in this fashion until all the 72
features were considered in the seventy-first scoring function (i.e., regular RF::XAR). Each
resultant version of the scoring function was evaluated on the core test set CrXAR and the
corresponding Pearson’s correlation coefficient between the predicted binding affinity and the
measured values was recorded. These statistics are represented by the blue curve in Figure 3.22.
The red curve shows the relative influence of the least important feature taken into account to
build each version of the seventy-one models.
It is clear from the plots that adding more features to RF models renders them more accurate.
The improvement is very noticeable in the early versions where the performance (in terms of RP)
increases by more than 20% when eighteen more features are added to the model that accounts
for just the most important two. The improvement trend still holds (but with smaller slope) even
when the least influential features (that have virtually zero influence) are considered in the later
RF::XAR versions. Since accounting for such features does not harm the scoring function’s
performance, we recommend keeping them as long as they are relevant to estimating the binding
affinity. In fact, these features may have some contribution to the overall binding affinity when
they interact with other features. We cannot elucidate the interaction effects between the input
features (if there are any) from the results depicted in Figure 3.20 and Figure 3.21. Instead,
techniques other than those discussed in Section 2.5.2 can be used to analyze and visualize the
interplay between the input variables. Such techniques will be investigated in future work.

118

Correlation RP
Figure 3.22 The RF::XAR scoring model’s performance as a function of inclusion of increasing
numbers of the most important features.

119

3.10 Scoring Throughput
After isolating it from docking, a scoring function’s throughput depends on two factors. The
first, which has the highest contribution, is feature extraction. The time required for calculating
features typically used by empirical scoring functions is the bottleneck of scoring speed. This
conclusion comes from an experiment we conducted whose results are listed in Table 3.9. In this
experiment, we recorded the average time taken to calculate the features of each protein-ligand
complex in the primary training set Pr and the core test set Cr (1278 complexes). Extraction
tools X-Score v1.2, SLIDE v3.3.5, and RF-Score were executed on a 2.53 GHz-PC running
Linux-Ubuntu 10.04. Clearly, X-Score is the slowest in that it spends 27 seconds on average to
compute its six features (listed in Table 2.1) for a protein-ligand complex. It is about 77 times
slower than AffiScore which also utilizes empirical energy terms.

Table 3.9 Feature extraction time (in seconds) and performance (RP) of some prediction
techniques fitted to these features.
Features
Time in sec.
MLR
RF
BRT
NF
X-Score
27
0.644
0.727
0.706
0.751
AffiScore (raw)
0.35
0.663
0.755
0.770
0.771
AffiScore (subset)
0.35
0.563
0.708
0.689
0.766
RF-Score
0.18
0.590
0.763
0.741
0.751

Interestingly, AffiScore is not only faster, scoring functions fitted to its features are also more
accurate. That is true only if the raw features computed by AffiScore are used to construct the
predictive model. The linear regression model fitted to PrA (i.e., the primary training complexes
described by AffiScore raw features) achieves an accuracy of 0.663 in terms of Pearson
correlation coefficient (RP). The original implementation of AffiScore scoring function that uses

120

rather a subset of the raw features yields an accuracy of only RP = 0.563. This latter scoring
function is inferior to X-Score v1.2 whose correlation RP is 0.644. AffiScore (raw) features are
better than their X-Score counterparts not only for linear models, but also for the best performing
ML techniques as shown in the last three columns of Table 3.9.
As pointed out in Section 2.2, the RF-Score tool employs the simplest algorithm to generate
features. That explains why RF-Score’s 36 features were the fastest to calculate. Generating
them took about half the time that the 30 features of AffiScore required. Performance-wise, RFScore based models lag slightly behind those fitted to AffiScore raw features (except in the case
of RF) as shown in Table 3.9 above.
The second contributor to scoring throughput is prediction time. The generated features
discussed here and in Section 2.2 are merely fed to a prediction machine to calculate the final
binding affinity. This computation is instantaneous for almost all the machine-learning tools we
investigate in this study. The third column of Table 3.10 shows the time spent in predicting the
binding affinity of protein-ligand complexes in CrXAR test set by all methods. The only method
that showed nonzero time in prediction is kNN. As discussed in Section 2.5.1, this can be
attributed to the “lazy learning” that kNN performs during prediction. kNN’s time (0.6 seconds)
was for performing predictions for the 192 complexes in CrXAR, which would be almost zero if a
per-complex time was recorded instead.
Unlike prediction time, the training time of some prediction schemes can be quite long. The
second column in Table 3.10 lists the time elapsed (in seconds) to train each method on the
primary training set PrXAR (1086 complexes and 72 features). Optimal parameters listed in
121

Table 3.1 were used to construct the machine-learning models. The numbers show that ensemble
techniques are relatively slow to learn the unknown function. NF is the slowest; in fact, BRT is
about 180 times faster than NF. This is understandable since each neural network of the 1000
grown in NF models requires hundreds of epochs for their weights to converge. This time,
however, can be reduced significantly if the NF algorithm is slightly modified to run on a multiprocessor system. Remember that each network in NF is completely independent from the rest of
the committee. This parallelization is also applicable to RF models whose learning time is
already relatively small. For the other techniques, we notice that MARS is not as rapid as SVM
and MLR. It could be even slower if more exhaustive approaches were used during the
addition/elimination of the model’s terms in forward/backward passes. Again, kNN does not
have the notion of training, so its training time is not reported in the table.
Table 3.10 Training and prediction times of different scoring techniques.
Method
Training Time (in secs.) - Pr
Prediction Time (in secs.) - Cr
XAR

NF
BRT
RF
MARS
kNN
SVM
MLR

9000
48.7
16.7
2.2
0.9
0.1

XAR

0.0
0.0
0.0
0.0
0.6
0.0
0.0

In summary, training a scoring function is not an issue in virtual screening since they are
typically constructed offline. The online part on the other hand is mixed. We saw that the time
spent in prediction is almost zero. Feature extraction, however, is the main factor in determining
scoring speed. Therefore, we recommend using only features that can be rapidly calculated, such
as those provided by SLIDE (AffiScore – raw) and RF-Score. It turns out that there is no big
difference in performance between the scoring functions BRT::XAR (i.e., X-Score features
122

included) and BRT::AR (only AffiScore and RF-Score features considered). In terms of numbers
(RP), compare 0.797 for BRT::XAR to 0.795 achieved by BRT::AR.

123

CHAPTER 4
Concluding Remarks and Future Work
4.1

Thesis Summary

The aim of this thesis was to empirically compare machine-learning-based scoring functions to
their conventional counterparts. Machine-learning schemes investigated were random forests
(RF), neural forests (NF), boosted regression trees (BRT), support vector machines (SVM), kNearest neighbors (kNN), multivariate adaptive regression splines (MARS), and multiple linear
regression (MLR). A diverse database of protein-ligand complexes was used. In addition, a
diverse set of features were extracted to characterize each complex in this database. The
machine-learning models were then trained on this data and their predictive ability were
compared to 16 state-of-the-art scoring functions. We next discuss our main findings and some
areas for future work.

124

4.2

Findings and Suggestions

We found that ensemble prediction methods in general outperform others in many performance
metrics. If the task is to score or rank ligands for almost all proteins (except HIV protease), then
ensemble-based scoring functions should be employed. When large numbers of features are
available and need to be exploited, again ensemble models are most suitable. That is due to their
excellent immunity to overfitting. If the collected features are not noisy and have adequate
relevance to binding affinity, then kNN-based scoring functions can be considered as well. SVM
could also perform well given its performance on other experiments when its parameters were
carefully tuned.
Empirical- and MARS-based scoring functions as well as RF and kNN may be used when the
target is a protein that is not present in the training dataset used to build the scoring model. In
such scenarios, simple models can compete with sophisticated methods. This was obvious when
we attempted to predict the binding affinity of ligands interacting with HIV protease, trypsin,
carbonic anhydrase, thrombin (see Table 3.7). For novel targets with low binding site sequence
similarity with the binding sites of proteins in training complexes, RF::XAR, kNN::XAR and
MLR::X were found to be the best choices (see Table 3.8).
Ensemble techniques were found not only to be accurate prediction methods, but also
excellent descriptive tools. Variable importance calculation and marginal plots generation are
valuable visualization utilities. These features changed BRT, RF, and NF from being black box
tools to methods that enable specialists to decipher the influence of different protein-ligand
interactions on modeling the binding energies. Computationally, ensemble techniques were
shown to be as fast as other predictive methods in calculating binding affinity. Training
125

ensemble-based scoring functions is, however, relatively slower than their counterparts based on
single models.
Our findings agree with previous work in that there is no single scoring function that
consistently outperforms others in all scoring aspects. However, we can conclude that BRT, RF,
and NF are the best performers on the whole. Even in scenarios when these ensemble models do
not take the lead, their performance statistics are not far from those of the top scoring functions.

126

4.3

Looking Ahead

Based on the experiments we conducted in this work, machine-learning based algorithms have
been very effective in predicting protein-ligand binding affinity. Accuracy-wise, BRT, RF, and
NF are the best performers on average. Their descriptive power is also valuable in virtual
screening. Another powerful ensemble method that could be explored in the future is rule
ensembles. It is known that rule-based models are very intuitive to interpret. In addition, a study
conducted by Friedman and colleagues suggests that the rule ensembles model surpasses other
competitive ensemble techniques in both classification and regression problems [109]. Their
proposed model, RuleFit, is a set of rules extracted from an ensemble of trees. In addition to the
high accuracy they reported for RuleFit and the inherent ease of rule interpretation, it also has the
capability of automatic identification of important variables. Despite these features, applying
RuleFit to our problem did not take the first spot from BRT, NF, and RF models. In terms of
numbers, when the RuleFit model was trained on PrXAR dataset, it yielded a Pearson correlation
coefficient of 0.75 on the CrXAR test set. Nevertheless, this was a preliminary experiment that
did not involve any model improvements or thorough parameter tuning. In the future, we aim to
enhance the model accuracy and present its powerful descriptive capabilities, as well as compare
it to other competitive techniques.
According to results collected from the features size experiment in Section 3.6, we find that
some machine-learning techniques, BRT and RF in particular, are very immune to overfitting.
Moreover, their accuracy tends to improve by increasing the number of features relevant to
binding affinity. This finding should encourage us to pursue the path of collecting as many
important features as possible and utilizing them in building future scoring engines.
127

Collecting more training data is also shown to be important to enhance the performance of
scoring functions. In the future, we will consider more than one source for protein-ligand
complexes to expand our training and test datasets. In fact, there are many publicly available
databases of protein-ligand complexes such as LPDB [110], PLD [111], Binding DB [112],
PDBbind [31], [104], Binding MOAD [113], and AffinDB [114]. Complexes to be collected
from these archives should undergo strict filtering criteria to insure only complexes with high
quality structures and reliable measured binding affinities are used.
We pointed out earlier that BRT, NF and RF yield best performing scoring functions.
Besides combining a set of trees to constitute one ensemble model, one could fuse different
predictive techniques to form a consensus scoring function. Different machine-learning
methodologies tend to generate diverse prediction surfaces for the same data patterns. The
average of these surfaces may be closer to the surface of the unknown function. That may
explain why prediction performance of combined models sometimes surpasses those of the
model components when they are applied alone.
We conducted a preliminary experiment to explore the potential of consensus schemes. In
this experiment, we averaged all possible combinations of RF, NF, BRT, SVM, kNN, MARS,
and MLR predictions on CrXAR test set. The results are depicted in Figure 4.1, where the x-axis
shows the number of combined models, while the vertical axis shows each combination’s scoring
power. The combinations are identified by the initials of the constituting components, except for
MLR which is represented by the letter L. The figure shows only the models whose Pearson
correlation coefficient exceeds 0.80. Note that BLNRS consensus scoring function (which is a
combination of BRT, MLR, NF, RF, and SVM) outperforms the best of single models (BRT);

128

the difference is very modest though. It is clear that the best consensus scoring functions have
always at least two ensemble models included. One area of future work regarding consensus
scoring is to investigate the optimal techniques that could be used to combine several scoring
functions. Here we merely take the simple average; more sophisticated schemes such as weighed

Correlation RP

aggregation should be probed.

Figure 4.1 Performance of consensus scoring functions on the core test set Cr.

129

Finally, we showed in Section 3.4 that ranking power of scoring functions is inferior to the
scoring power although both statistics are related. Therefore, one could put more emphasis on
ranking power when scoring functions are to be designed in the future.
In conclusion, we believe that there is good room for further improvement in the
performance of scoring functions. In addition to the high performance profile we already
obtained using machine-learning techniques, we think that our enhancement plans outlined above
could result in steadily superior scoring functions.

130

BIBLIOGRAPHY

131

BIBLIOGRAPHY

[1]

P. D. Lyne, "Structure-based virtual screening: an overview," Drug Discovery Today, vol.
7, no. 20, pp. 1047-1055, 2002.

[2]

M. P. Mathieu, PARAXEL's Pharmaceutical R&D Statistical Sourcebook. PAREXEL
International Corp., 2001.

[3]

L. Michielan and S. Moro, "Pharmaceutical Perspectives of Nonlinear QSAR Strategies,"
Journal of chemical information and modeling, vol. 50, no. 6, pp. 961-978, 2010.

[4]

A. M. Davis, S. J. Teague, and G. J. Kleywegt, "Application and limitations of X-ray
crystallographic data in structure-based ligand and drug design," Angewandte Chemie
International Edition, vol. 42, no. 24, pp. 2718-2736, 2003.

[5]

G. Pintacuda, M. John, X. Su, and G. Otting, "NMR Structure Determination of
Protein−Ligand Complexes by Lanthanide Labeling," Accounts of Chemical Research, vol.
40, no. 3, pp. 206-212, 2007, PMID: 17370992.

[6]

A. Evers and G. Klebe, "Successful Virtual Screening for a Submicromolar Antagonist of
the Neurokinin-1 Receptor Based on a Ligand-Supported Homology Model," Journal of
Medicinal Chemistry, vol. 47, no. 22, pp. 5381-5392, 2004.

[7]

A. Evers and T. Klabunde, "Structure-based Drug Discovery Using GPCR Homology
Modeling:  Successful Virtual Screening for Antagonists of the Alpha1A Adrenergic
Receptor," Journal of Medicinal Chemistry, vol. 48, no. 4, pp. 1088-1097, 2005, PMID:
15715476.

[8]

F. Sousa, M. Cerqueira, A. Fernandes, and J. Ramos, "Virtual Screening in Drug Design
and Development," Combinatorial Chemistry & High Throughput Screening, vol. 13, no.
5, pp. 442-453, 2010.

[9]

W. Schaal, A. Karlsson, A. Göran, J. Lindberg, H. Andersson, U. Danielson, and B.
Classon, T. Unge, B. Samuelsson, J. Hultén, A. Hallberg, and A. Karlén, "Synthesis and
Comparative Molecular Field Analysis (CoMFA) of Symmetric and Nonsymmetric Cyclic
Sulfamide HIV-1 Protease Inhibitors," Journal of Medicinal Chemistry, vol. 44, no. 2, pp.
155-169, 2001.

[10] T. J. Ewing, S. Makino, A. Skillman, and I. D. Kuntz, "DOCK 4.0: search strategies for
automated molecular docking of flexible molecule databases," Journal of Computer-Aided
132

Molecular Design, vol. 15, no. 5, pp. 411-428, 2001.
[11] M. I. Zavodszky, P. C. Sanschagrin, L. A. Kuhn, and R. S. Korde, "Distilling the essential
features of a protein surface for improving protein-ligand docking, scoring, and virtual
screening," Journal of Computer-Aided Molecular Design, vol. 16, pp. 883-902, 2002,
10.1023/A:1023866311551.
[12] R. A. Friesner, J. L. Banks, R. B. Murphy, T. A. Halgren, J. J. Klicic, D. T. Mainz, M. P.
Repasky, E. H. Knoll, M. Shelley, J. K. Perry, D. E. Shaw, P. Francis, and P. S. Shenkin,
"Glide:  A New Approach for Rapid, Accurate Docking and Scoring. 1. Method and
Assessment of Docking Accuracy," Journal of Medicinal Chemistry, vol. 47, no. 7, pp.
1739-1749, 2004, PMID: 15027865.
[13] S. Y. Huang, S.Z. Grinter, and X. Zou, "Scoring functions and their evaluation methods for
protein--ligand docking: recent advances and future directions," Physical chemistry
chemical physics: PCCP, vol. 12, pp. 12899-12908, 2010.
[14] D. B. Kitchen, H. Decornez, J. R. Furr, and J. Bajorath, "Docking and scoring in virtual
screening for drug discovery: methods and applications," Nature reviews Drug discovery,
vol. 3, no. 11, pp. 935-949, 2004.
[15] I. D. Kuntz, J. M. Blaney, S. J. Oatley, R. Langridge, and T. E. Ferrin, "A geometric
approach to macromolecule-ligand interactions* 1," Journal of Molecular Biology, vol.
161, no. 2, pp. 269-288, 1982.
[16] R. L. DesJarlais, R. P. Sheridan, G. L. Seibel, J. S. Dixon, I. D. Kuntz, and R.
Venkataraghavan, "Using shape complementarity as an initial screen in designing ligands
for a receptor binding site of known three-dimensional structure," Journal of medicinal
chemistry, vol. 31, no. 4, pp. 722-729, 1988.
[17] D. A. Gschwend and I. D. Kuntz, "Orientational sampling and rigid-body minimization in
molecular docking revisited: on-the-fly optimization and degeneracy removal," Journal of
computer-aided molecular design, vol. 10, no. 2, pp. 123-132, 1996.
[18] D. S. Goodsell and A. J. Olson, "Automated docking of substrates to proteins by simulated
annealing," Proteins: Structure, Function, and Bioinformatics, vol. 8, no. 3, pp. 195-202,
1990.
[19] G. M. Morris, D. S. Goodsell, R. S. Halliday, R. Huey, W. E. Hart, R. K. Belew, and A. J.
Olson, "Automated docking using a Lamarckian genetic algorithm and an empirical
binding free energy function," Journal of Computational Chemistry, vol. 19, no. 14, pp.
1639-1662, 1998.
[20] H. J. Böhm, "The development of a simple empirical scoring function to estimate the
binding constant for a protein-ligand complex of known three-dimensional structure,"
Journal of Computer-Aided Molecular Design, vol. 8, no. 3, pp. 243-256, 1994.

133

[21] M. Rarey, B. Kramer, T. Lengauer, and G. Klebe, "A fast flexible docking method using
an incremental construction algorithm," Journal of Molecular Biology, vol. 261, no. 3, pp.
470-489, 1996.
[22] P. D. Thomas and K. A. Dill, "An iterative method for extracting energy-like quantities
from protein structures," Proceedings of the National Academy of Sciences of the United
States of America, vol. 93, no. 21, p. 11628, 1996.
[23] W. Koppensteiner and M. Sippl, "Knowledge-based potentials--back to the roots.,"
a, vol. 63, no. 3, p. 247, 1998.
[24] I. Muegge, and Y. Martin, "A General and Fast Scoring Function for Protein−Ligand
Interactions: A Simplified Potential Approach," Journal of Medicinal Chemistry, vol. 42,
no. 5, pp. 791-804, 1999, PMID: 10072678.
[25] I. Muegge, "A knowledge-based scoring function for protein-ligand interactions: Probing
the reference state," Perspectives in Drug Discovery and Design, vol. 20, pp. 99-114,
2000, 10.1023/A:1008729005958.
[26] H. Gohlke, M. Hendlich, and G. Klebe, "Knowledge-based scoring function to predict
protein-ligand interactions," Journal of Molecular Biology, vol. 295, no. 2, pp. 337-356,
2000.
[27] I. Muegge, "Effect of ligand volume correction on PMF scoring," Journal of
Computational Chemistry, vol. 22, no. 4, pp. 418-425, 2001.
[28] H. F. Velec , H. Gohlke, and G. Klebe, "DrugScoreCSD Knowledge-Based Scoring
Function Derived from Small Molecule Crystal Data with Superior Recognition Rate of
Near-Native Ligand Poses and Better Affinity Prediction," Journal of Medicinal
Chemistry, vol. 48, no. 20, pp. 6296-6303, 2005.
[29] I. Muegge, "PMF Scoring Revisited," Journal of Medicinal Chemistry, vol. 49, no. 20, pp.
5895-5902, 2006, PMID: 17004705.
[30] P. Ballester and J. Mitchell, "A machine learning approach to predicting protein-ligand
binding affinity with applications to molecular docking," Bioinformatics, vol. 26, no. 9, p.
1169, 2010.
[31] R. Wang, X. Fang, Y. Lu, and S. Wang, "The PDBbind Database:  Collection of Binding
Affinities for Protein−Ligand Complexes with Known Three-Dimensional Structures,"
Journal of Medicinal Chemistry, vol. 47, no. 12, pp. 2977-2980, 2004, PMID: 15163179.
[32] T. Cheng, X. Li, Y. Li, Z. Liu, and R. Wang, "Comparative assessment of scoring
functions on a diverse test set," Journal of chemical information and modeling, vol. 49, no.
4, pp. 1079-1093, 2009.

134

[33] R. Wang, Y. Lu, X. Fang, S. Wang, "An Extensive Test of 14 Scoring Functions Using the
PDBbind Refined Set of 800 Protein−Ligand Complexes," Journal of Chemical
Information and Computer Sciences, vol. 44, no. 6, pp. 2114-2125, 2004, PMID:
15554682.
[34] L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5-32, 2001,
10.1023/A:1010933404324.
[35] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, and B. P. Feuston,
"Random forest: a classification and regression tool for compound classification and
QSAR modeling," J. Chem. Inf. Comput. Sci, vol. 43, no. 6, pp. 1947-1958, 2003.
[36] S. Doniger, T. Hofmann, and J. Yeh, "Predicting CNS permeability of drug molecules:
comparison of neural network and support vector machine algorithms," Journal of
computational biology, vol. 9, no. 6, pp. 849-864, 2002.
[37] W. Tong, H. Hong, H. Fang, Q. Xie, and R. Perkins, "Decision Forest:  Combining the
Predictions of Multiple Independent Decision Tree Models," Journal of Chemical
Information and Computer Sciences, vol. 43, no. 2, pp. 525-531, 2003, PMID: 12653517.
[38] J. E. Penzotti, M. L. Lamb, L. Michelle, E. Evensen, and J. D. Peter, "A Computational
Ensemble Pharmacophore Model for Identifying Substrates of P-Glycoprotein," Journal of
Medicinal Chemistry, vol. 45, no. 9, pp. 1737-1740, 2002.
[39] G. A. Bakken, and P. C. Jurs, "Classification of Multidrug-Resistance Reversal Agents
Using Structure-Based Descriptors and Linear Discriminant Analysis," Journal of
Medicinal Chemistry, vol. 43, no. 23, pp. 4534-4541, 2000.
[40] P. J. Gilligan, G. A. Cain, T. E. Christos, L. Cook, S. Drummond, A. L. Johnson, A. A.
Kergaye, J. F. McElroy, and K. W. Rohrbach, "Novel piperidine .sigma. receptor ligands
as potential antipsychotic drugs," Journal of Medicinal Chemistry, vol. 35, no. 23, pp.
4344-4361, 1992.
[41] G. W. Kauffman and P. C. Jurs, "QSAR and k-Nearest Neighbor Classification Analysis of
Selective Cyclooxygenase-2 Inhibitors Using Topologically-Based Numerical
Descriptors," Journal of Chemical Information and Computer Sciences, vol. 41, no. 6, pp.
1553-1560, 2001.
[42] C. H. Andrade, K. F. Pasqualoto, E. I. Ferreira, A. J. Hopfinger, "4D-QSAR: Perspectives
in Drug Design," Molecules, vol. 15, no. 5, pp. 3281-3294, 2010.
[43] C. Andersson, B. Chen, and A. Linusson, "Multivariate assessment of virtual screening
experiments," Journal of Chemometrics, vol. 24, no. 11-12, pp. 757-767, 2010.
[44] D. Plewczynski, S. Spieser, and U. Koch, "Assessing different classification methods for
virtual screening," J. Chem. Inf. Model, vol. 46, no. 3, pp. 1098-1106, 2006.

135

[45] MDL Information Systems. MACCS Drug Data Report (MDDR). Database, 2006.
[46] M. Fatemi and S. Gharaghani, "A novel QSAR model for prediction of apoptosis-inducing
activity of 4-aryl-4-H-chromenes based on support vector machine," Bioorganic &
Medicinal Chemistry, vol. 15, no. 24, pp. 7746-7754, 2007.
[47] M. Goodarzi, M. Freitas, and R. Jensen, "Feature Selection and Linear\Nonlinear
Regression Methods for the Accurate Prediction of Glycogen Synthase Kinase-3β
Inhibitory Activities," Journal of Chemical Information and Modeling, vol. 49, no. 4, pp.
824-832, 2009.
[48] H. Tang, X. S. Wang, X. Huang, B. L. Roth, K. V. Butler, A. P. Kozikowski, M. Jung, and
A. Tropsha, "Novel Inhibitors of Human Histone Deacetylase (HDAC) Identified by
QSAR Modeling of Known Inhibitors, Virtual Screening, and Experimental Validation,"
Journal of Chemical Information and Modeling, vol. 49, no. 2, pp. 461-476, 2009.
[49] H. Liu, E. Papa, J. Walker, and P. Gramatica, "In silico screening of estrogen-like
chemicals based on different nonlinear classification models," Journal of Molecular
Graphics and Modelling, vol. 26, no. 1, pp. 135-144, 2007.
[50] H. Lin, L. Han, C. Yap, Y. Xue, X. Liu, F. Zhu, and Y. Chen, "Prediction of factor Xa
inhibitors by machine learning methods," Journal of Molecular Graphics and Modelling,
vol. 26, no. 2, pp. 505-518, 2007.
[51] D. S. Chekmarev, V. Kholodovych, K. V. Balakin, Y. Ivanenkov, S. Ekins, and W. J.
Welsh, "Shape Signatures: New Descriptors for Predicting Cardiotoxicity In Silico,"
Chemical Research in Toxicology, vol. 21, no. 6, pp. 1304-1314, 2008, PMID: 18461975.
[52] R. Hu, J. Doucet, M. Delamar, and R. Zhang, "QSAR models for 2-amino-6arylsulfonylbenzonitriles and congeners HIV-1 reverse transcriptase inhibitors based on
linear and nonlinear regression methods," European Journal of Medicinal Chemistry, vol.
44, no. 5, pp. 2158-2171, 2009.
[53] N. Hernández, R. Kiralj, M. Ferreira, and I. Talavera, "Critical comparative analysis,
validation and interpretation of SVM and PLS regression models in a QSAR study on
HIV-1 protease inhibitors," Chemometrics and Intelligent Laboratory Systems, vol. 98, no.
1, pp. 65-77, 2009.
[54] M. Goodarzi, P. Duchowicz, C. Wu, F. Fern ndez, and E. Castro, "New Hybrid Genetic
Based Support Vector Regression as QSAR Approach for Analyzing FlavonoidsGABA(A) Complexes," Journal of Chemical Information and Modeling, vol. 49, no. 6, pp.
1475-1485, 2009, PMID: 19492793.
[55] X. Yang, D. Chen, M. Wang, Y. Xue, and Y. Chen, "Prediction of antibacterial compounds
by machine learning approaches," Journal of Computational Chemistry, vol. 30, no. 8, pp.

136

1202-1211, 2009.
[56] Y. Cong, X. Yang, W. Lv, and Y. Xue, "Prediction of novel and selective TNF-alpha
converting enzyme (TACE) inhibitors and characterization of correlative molecular
descriptors by machine learning approaches," Journal of Molecular Graphics and
Modelling, vol. 28, no. 3, pp. 236-244, 2009.
[57] R. Wang, L. Lai, and S. Wang, "Further development and validation of empirical scoring
functions for structure-based binding affinity prediction," Journal of Computer-Aided
Molecular Design, vol. 16, pp. 11-26, 2002, 10.1023/A:1016357811882.
[58] V. Schnecke and L. A. Kuhn, "Virtual screening with solvation and ligand-induced
complementarity," in Virtual Screening: An Alternative or Complement to High
Throughput Screening?, G. Klebe, Ed. Springer Netherlands, 2002, pp. 171-190,
10.1007/0-306-46883-2_10.
[59] M. Zavodszky and L. Kuhn, "Side-chain flexibility in protein–ligand binding: The minimal
rotation hypothesis," Protein Science, vol. 14, no. 4, pp. 1104-1114, 2005.
[60] M. Tonero, M. Zavodszky, J. Van Voorst, L. He, S. Arora, S. Namilikonda, and L. Kuhn,
"Effective Scoring Functions for Predicting Ligand Binding Mode and Affinity in Docking
and High-Throughput Screening"," in preparation.
[61] H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N.
Shindyalov, and P. E. Bourne, "The Protein Data Bank," Nucleic Acids Research, vol. 28,
no. 1, pp. 235-242, 2000.
[62] K. Raha, and K. Merz, "Large-Scale Validation of a Quantum Mechanics Based Scoring
Function:  Predicting the Binding Affinity and the Binding Mode of a Diverse Set of
Protein−Ligand Complexes," Journal of Medicinal Chemistry, vol. 48, no. 14, pp. 45584575, 2005, PMID: 15999994.
[63] S. Huang and X. Zou, "An iterative knowledge-based scoring function to predict protein–
ligand interactions: II. Validation of the scoring function," Journal of Computational
Chemistry, vol. 27, no. 15, pp. 1876-1882, 2006.
[64] A. Ruvinsky, "Calculations of protein-ligand binding entropy of relative and overall
molecular motions," Journal of Computer-Aided Molecular Design, vol. 21, pp. 361-370,
2007, 10.1007/s10822-007-9116-0.
[65] A. Ruvinsky, "Role of binding entropy in the refinement of protein–ligand docking
predictions: Analysis based on the use of 11 scoring functions," Journal of Computational
Chemistry, vol. 28, no. 8, pp. 1364-1372, 2007.
[66] J. Lee, and C. Seok, "A statistical rescoring scheme for protein–ligand docking:
Consideration of entropic effect," Proteins: Structure, Function, and Bioinformatics, vol.

137

70, no. 3, pp. 1074-1083, 2008.
[67] Tripos Inc. The Sybyl Software. Software, 2006.
[68] T. Madden, "The BLAST sequence analysis tool," The NCBI Handbook [Internet].
National Library of Medicine (US), National Center for Biotechnology Information,
Bethesda, MD, 2002.
[69] Y. Cheng and W. Prusoff, "Relationship between the inhibition constant (Ki) and the
concentration of inhibitor which causes 50% inhibition (IC50) of an enzymatic reaction,"
Biochem. Pharmacol, vol. 22, no. 23, pp. 3099-3108, 1973.
[70] R. A. Copeland, D. Lombardo, J. Giannaras, and C. P. Decicco, "Estimating KI values for
tight binding inhibitors from dose-response plots," Bioorganic & Medicinal Chemistry
Letters, vol. 5, no. 17, pp. 1947-1952, 1995.
[71] J. H. Friedman, "Multivariate Adaptive Regression Splines," The Annals of Statistics, vol.
19, no. 1, pp. 1-67, 1991.
[72] R Core Development Team. R: A Language and Environment for Statistical Computing.
Software, 2010.
[73] S. Milborrow and R. Tibshirani, "earth: Multivariate Adaptive Regression Spline Models,"
2010.
[74] K. Schliep, "kknn: Weighted k-Nearest Neighbors," 2010.
[75] K. Hechenbichler and K. Schliep, "Weighted k-nearest-neighbor techniques and ordinal
classification," Citeseer, 2004.
[76] V. Vapnik, Statistical learning theory. Wiley, New York, 1998.
[77] X. J. Yao, A. Panaye, J. P. Doucet, R. S. Zhang, H. F. Chen, M. Liu, Z. D. Hu, and B. T.
Fan, "Comparative Study of QSAR/QSPR Correlations Using Support Vector Machines,
Radial Basis Function Neural Networks, and Multiple Linear Regression," Journal of
Chemical Information and Computer Sciences, vol. 44, no. 4, pp. 1257-1266, 2004.
[78] C. C. Chang and C. J. Lin, "LIBSVM: a library for support vector machines," 2001.
[79] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, and A. Weingessel, "e1071: Misc
Functions of the Department of Statistics (e1071), TU Wien," 2010.

138

[80] L. K. Hansen and P. Salamon, "Neural Network Ensembles," IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 12, pp. 993-1001, 1990.
[81] Y. Amit and D. Geman, "Shape Quantization and Recognition with Randomized Trees,"
Neural Computation, vol. 9, no. 7, pp. 1545-1588, 1997.
[82] T. G. Dietterich, "An experimental comparison of three methods for constructing
ensembles of decision trees: Bagging, boosting, and randomization," Machine learning,
vol. 40, no. 2, pp. 139-157, 2000.
[83] M. R. Segal, "Machine learning benchmarks and random forest regression," 2004.
[84] D. Plewczynski, M. Grotthuss, L. Rychlewski, and K. Ginalski, "Virtual High Throughput
Screening Using Combined Random Forest and Flexible Docking," Combinatorial
Chemistry & High Throughput Screening, vol. 12, no. 5, pp. 484-489, 2009, Not Available
Yet.
[85] A. Liaw and M. Wiener, "Classification and Regression by randomForest," R News, vol. 2,
no. 3, pp. 18-22, 2002.
[86] B. D. Ripley, Pattern recognition and neural networks. Cambridge Univ Pre, 2005.
[87] W. N. Venables and B. D. Ripley, Modern Applied Statistics with S. Springer, 2002.
[88] J. H. Friedman, "Stochastic gradient boosting," Computational Statistics & Data Analysis,
vol. 38, no. 4, pp. 367-378, 2002.
[89] G. Ridgeway, "gbm: Generalized Boosted Regression Models," 2010.
[90] J. H. Friedman, "Greedy function approximation: a gradient boosting machine," Annals of
Statistics, pp. 1189-1232, 2001.
[91] Accelrys Software Inc. The Discovery Studio Software. Software, 2001.
[92] A. Krammer, P. Kirchhoff, X. Jiang, C. Venkatachalam, and M. Waldman, "LigScore: a
novel scoring function for predicting binding affinities," Journal of Molecular Graphics
and Modelling, vol. 23, no. 5, pp. 395-407, 2005.
[93] D. K. Gehlhaar, G. M. Verkhivker, P. A. Rejto, C. J. Sherman, D. R. Fogel, L. J. Fogel,
and S. T. Freer, "Molecular recognition of the inhibitor AG-1343 by HIV-1 protease:
conformationally flexible docking by evolutionary programming," Chemistry & Biology,

139

vol. 2, no. 5, pp. 317-324, 1995.
[94] A. L. Parrill and M. R. Reddy, Rational Drug Design: novel methodology and practical
applications. An American Chemical Society Publication, 1999.
[95] A. N. Jain, "Scoring noncovalent protein-ligand interactions: A continuous differentiable
function tuned to compute binding affinities," Journal of Computer-Aided Molecular
Design, vol. 10, pp. 427-440, 1996, 10.1007/BF00124474.
[96] H. Böhm, "Prediction of binding constants of protein ligands: A fast method for the
prioritization of hits obtained from de novo design or 3D database search programs,"
Journal of Computer-Aided Molecular Design, vol. 12, pp. 309-309, 1998,
10.1023/A:1007999920146.
[97] G. Jones, P. Willett, and R. C. Glen, "Molecular recognition of receptor sites using a
genetic algorithm with a description of desolvation," Journal of molecular biology, vol.
245, no. 1, pp. 43-53, 1995.
[98] G. Jones, P. Willett, R. Glen, A. Leach, and R. Taylor, "Development and validation of a
genetic algorithm for flexible docking," Journal of Molecular Biology, vol. 267, no. 3, pp.
727-748, 1997.
[99] M. D. Eldridge, C. W. Murray, T. R. Auton, G. V. Paolini, and P. P. Mee, "Empirical
scoring functions: I. The development of a fast empirical scoring function to estimate the
binding affinity of ligands in receptor complexes," Journal of Computer-Aided Molecular
Design, vol. 11, pp. 425-445, 1997, 10.1023/A:1007996124545.
[100] C. A. Baxter, C. W. Murray, D. E. Clark, D. R. Westhead, and M. D. Eldridge, "Flexible
docking using tabu search and an empirical estimate of binding affinity," Proteins:
Structure, Function, and Bioinformatics, vol. 33, no. 3, pp. 367-382, 1998.
[101] W. Mooij, and M. Verdonk, "General and targeted statistical potentials for protein-ligand
interactions.," Proteins, vol. 61, no. 2, p. 272, 2005.
[102] R. Friesner, R. Murphy, M. Repasky, L. Frye, J. Greenwood, T. Halgren, P. Sanschagrin,
and D. Mainz, "Extra precision glide: docking and scoring incorporating a model of
hydrophobic enclosure for protein- ligand complexes," J. Med. Chem, vol. 49, no. 21, pp.
6177-6196, 2006.
[103] Schrodinger, LLC. The Schrodinger Software. Software, 2005.
[104] R. Wang, X. Fang, Y. Lu, C. Yang, and S. Wang, "The PDBbind Database: 
Methodologies and Updates," Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 41114119, 2005, PMID: 15943484.

140

[105] R. Kohavi, "A study of cross-validation and bootstrap for accuracy estimation and model
selection," in Citeseer, vol. 14, 1995, pp. 1137-1145.
[106] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning. 2001.
[107] C. J. Lin, C. Chang, and C. Hsu, "A practical guide to support vector classification,"
National Taiwan University, 2004.
[108] J. Overington, B. Al-Lazikani, and A. Hopkins, "How many drug targets are there?,"
Nature reviews Drug discovery, vol. 5, no. 12, pp. 993--996, 2006.
[109] J. H. Friedman and B. E. Popescu, "Predictive learning via rule ensembles," Annals, vol. 2,
no. 3, pp. 916-954, 2008.
[110] O. Roche, R. Kiyama, and C. L. Brooks, "Ligand−Protein DataBase:  Linking
Protein−Ligand Complex Structures to Binding Data," Journal of Medicinal Chemistry,
vol. 44, no. 22, pp. 3592-3598, 2001, PMID: 11606123.
[111] D. Puvanendrampillai and B. John, "Protein Ligand Database (PLD): additional
understanding of the nature and specificity of protein–ligand complexes," Bioinformatics,
vol. 19, no. 14, pp. 1856-1857, 2003.
[112] T. Liu, Y. Lin, X. Wen, R. Jorissen, M. Gilson, and K. Michael , "BindingDB: a webaccessible database of experimentally determined protein–ligand binding affinities,"
Nucleic Acids Research, vol. 35, no. suppl 1, pp. D198-D201, 2006.
[113] M. L. Benson, R. D. Smith, N. A. Khazanov, B. Dimcheff, J. Beaver, P. Dresslar, J.
Nerothin, and H. Carlson, "Binding MOAD, a high-quality protein–ligand database,"
Nucleic Acids Research, vol. 36, no. suppl 1, pp. D674-D678, 2008.
[114] P. Block, C. Sotriffer, I. Dramburg, and G. Klebe, "AffinDB: a freely accessible database
of affinities for protein–ligand complexes from the PDB," Nucleic Acids Research, vol. 34,
no. suppl 1, pp. D522-D526, 2006.

141