DATA-DRIVEN AND TASK-SPECIFIC SCORING FUNCTIONS FOR PREDICTING
LIGAND BINDING POSES AND AFFINITY AND FOR SCREENING ENRICHMENT
By
Hossam M. Ashtawy

A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Electrical Engineering – Doctor of Philosophy
2017

ABSTRACT
DATA-DRIVEN AND TASK-SPECIFIC SCORING FUNCTIONS FOR PREDICTING
LIGAND BINDING POSES AND AFFINITY AND FOR SCREENING ENRICHMENT
By
Hossam M. Ashtawy
Molecular modeling has become an essential tool to assist in early stages of drug discovery and
development. Molecular docking, scoring, and virtual screening are three such modeling tasks
of particular importance in computer-aided drug discovery. They are used to computationally
simulate the interaction between small drug-like molecules, known as ligands, and a target protein
whose activity is to be altered. Scoring functions (SF) are typically employed to predict the binding
conformation (docking task), binary activity label (screening task), and binding affinity (scoring
task) of ligands against a critical protein in the disease’s pathway. In most molecular docking
software packages available today, a generic binding affinity-based (BA-based) SF is invoked for
the three tasks to solve three different, but related, prediction problems. The vast majority of these
predictive models are knowledge-based, empirical, or force-field scoring functions. The fourth
family of SFs that has gained popularity recently and showed potential of improved accuracy is
based on machine-learning (ML) approaches. Despite intense efforts in developing conventional
and current ML SFs, their limited predictive accuracies in these three tasks have been a major
roadblock toward cost-effective drug discovery. Therefore, in this work we present (i) novel taskspecific and multi-task SFs employing large ensembles of deep neural networks (NN) and other
state-of-the-art ML algorithms in conjunction with (ii) data-driven multi-perspective descriptors
(features) for accurate characterization of protein-ligand complexes (PLCs) extracted using our
Descriptor Data Bank (DDB) platform.
We assess the docking, screening, scoring, and ranking accuracies of the proposed task-specific
SFs with DDB descriptors as well as several conventional approaches in the context of the 2007 and
2014 PDBbind benchmark that encompasses a diverse set of high-quality PLCs. Our approaches
substantially outperform conventional SFs based on BA and single-perspective descriptors in all

tests. In terms of scoring accuracy, we find that the ensemble NN SFs, BsN-Score and BgN-Score,
have more than 34% better correlation (0.844 and 0.840 vs. 0.627) between predicted and measured BAs compared to that achieved by X-Score, a top performing conventional SF. We further
find that ensemble NN models surpass SFs based on other state-of-the-art ML algorithms. Similar
results have been obtained for the ranking task. Within clusters of PLCs with different ligands
bound to the same target protein, we find that the best ensemble NN SF is able to rank the ligands
correctly 64.6% of the time compared to 57.8% obtained by X-Score. A substantial improvement
in the docking task has also been achieved by our proposed docking-specific SFs. We find that
the docking NN SF, BsN-Dock, has a success rate of 95% in identifying poses that are within
2 Å RMSD from the native poses of 65 different protein families. This is in comparison to a
success rate of only 82% achieved by the best conventional SF, ChemPLP, employed in the commercial docking software GOLD. As for the ability to distinguish active molecules from inactives,
our screening-specific SFs showed excellent improvements over the conventional approaches. The
proposed SF BsN-Screen achieved a screening enrichment factor of 33.90 as opposed to 19.54 obtained from the best conventional SF, GlideScore, employed in the docking software Glide. For all
tasks, we observed that the proposed task-specific SFs benefit more than their conventional counterparts from increases in the number of descriptors and training PLCs. They also perform better
on novel proteins that they were never trained on before. In addition to the three task-specific SFs,
we propose a novel multi-task deep neural network (MT-Net) that is trained on data from three
tasks to simultaneously predict binding poses, affinities, and activity labels. MT-Net is composed
of shared hidden layers for the three tasks to learn common features, task-specific hidden layers
for higher feature representation, and three outputs for the three tasks. We show that the performance of MT-Net is superior to conventional SFs and competitive with other ML approaches.
Based on current results and potential improvements, we believe our proposed ideas will have a
transformative impact on the accuracy and outcomes of molecular docking and virtual screening.

To the memory of my uncle and role model, Hassan Abuhafa (1955-2015).

iv

ACKNOWLEDGMENTS

I am indebted to many people who played a significant role in the completion of this dissertation.
First and foremost, I would like to extend my deepest thanks and appreciation to my research
advisor and mentor Prof. Nihar Mahapatra for his time, ideas, patience, and guidance that made
my Ph.D. experience productive and stimulating. He taught me how to think critically and pay
attention to detail. I would also like to thank my committee members Prof. Jin Chen, Prof. Fathi
Salem, and Prof. Yanni Sun for their time in reviewing this work and insightful questions and
feedback.
It goes without saying that none of this would have been possible without the support of my
parents, Hasna and Mohamed, and the help of my wife Galya. I would like to thank them all for
their encouragement and patience. I would also like to thank my 20-month-old daughter, Danya,
for the joy that she brings into my life and the motivation she gives me to press forward.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Scoring function requirements . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3 Motivation: Limitations of existing scoring functions for accurate ligand docking, screening, scoring, and ranking . . . . . . . . . . . . . . . . . . . . . . .
1.4 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. .
. .
. .

1
1
4

. .
. .

6
9

CHAPTER 2 MATERIALS AND METHODS . . . . . . . . . . . . . . . . . . . . . .
2.1 Main training and test protein-ligand complexes . . . . . . . . . . . . . . . . .
2.1.1 PDBbind core test set: novel complexes with known protein targets . .
2.1.2 Multi-fold cross-validation: novel complexes with known and novel protein targets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.3 Leave clusters out: novel complexes with novel protein targets . . . . .
2.2 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Multiple linear regression (MLR) . . . . . . . . . . . . . . . . . . . .
2.2.2 Multivariate adaptive regression splines (MARS) . . . . . . . . . . . .
2.2.3 k-nearest neighbors (kNN) . . . . . . . . . . . . . . . . . . . . . . . .
2.2.4 Support vector machines (SVM) . . . . . . . . . . . . . . . . . . . . .
2.2.5 Deep Neural Networks (DNN) . . . . . . . . . . . . . . . . . . . . . .
2.2.6 Random forests (RF) . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.7 Boosted regression trees (BRT) . . . . . . . . . . . . . . . . . . . . . .
2.2.8 Extreme Gradient Boosting (XGB) . . . . . . . . . . . . . . . . . . . .
2.3 Conventional scoring functions under assessment . . . . . . . . . . . . . . . .

. . 13
. . 13
. . 14

CHAPTER 3
3.1
3.2
3.3

3.4

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

15
16
17
17
18
20
21
23
24
25
27
29

DESCRIPTOR DATA BANK (DDB): A PLATFORM FOR MULTI-PERSPECTIVE
MODELING OF PROTEIN-LIGAND INTERACTIONS . . . . . . . . . . 31
Improving machine-learning scoring functions . . . . . . . . . . . . . . . . . . . . 32
The need for a library of descriptor extraction tools . . . . . . . . . . . . . . . . . 34
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3.1 Receptor-specific descriptors based on structure similarity and triangulation 36
3.3.2 Multi-perspective modeling of protein-ligand interactions . . . . . . . . . . 38
3.3.2.1 Descriptor Data Bank (DDB) . . . . . . . . . . . . . . . . . . . 38
3.3.2.2 Descriptor tools and data: deposition, extraction, and sharing . . 39
3.3.2.3 Descriptor filtering and analysis . . . . . . . . . . . . . . . . . . 41
3.3.2.4 Machine-learning scoring functions in DDB . . . . . . . . . . . 43
3.3.2.5 Multi-perspective descriptors extracted for PDBbind complexes . 44
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

vi

3.5

3.4.1 Single vs. multi-perspective modeling of protein-ligand interactions . . .
3.4.2 Number of perspectives vs. number of training protein-ligand complexes .
3.4.3 Perspective filtering using automatic feature subset selection . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 4

4.1
4.2
4.3
4.4

4.5

.
.
.
.

44
46
48
55

BAGGING AND BOOSTING BASED ENSEMBLE NEURAL NETWORKS
SCORING FUNCTIONS FOR ACCURATE BINDING AFFINITY PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Binding affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
Predicting binding affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.4.1 Protein-ligand complex datasets and multi-perspective characterization . . . 61
4.4.1.1 The PDBbind 2007 benchmark . . . . . . . . . . . . . . . . . . 61
4.4.1.2 The PDBbind 2014 benchmark . . . . . . . . . . . . . . . . . . 62
4.4.2 Limitations of neural networks and our approach to tackling them . . . . . 63
4.4.3 BgN-Score: Ensemble neural networks through bagging . . . . . . . . . . 64
4.4.4 BsN-Score: Ensemble neural networks through boosting . . . . . . . . . . 65
4.4.5 Improving speed and accuracy of ensemble neural networks via transfer
learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.4.6 Neural networks and other machine-learning scoring functions . . . . . . . 69
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.1 Evaluation of scoring functions in binding affinity prediction and ligand
ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5.2 Ensemble neural networks vs. other approaches on a diverse test set . . . . 74
4.5.2.1 Scoring performance on diverse protein families from PDBbind 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.2.2 Scoring performance on diverse protein families from PDBbind 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.3 Ensemble neural networks vs. other approaches on homogeneous test sets . 81
4.5.3.1 Scoring performance on four protein families from PDBbind 2007 81
4.5.3.2 Scoring performance on four protein families from PDBbind 2014 86
4.5.4 Performance of machine-learning scoring functions on novel targets . . . . 92
4.5.4.1 Simulating novel targets using PDBbind 2007 . . . . . . . . . . . 92
4.5.4.2 Simulating novel targets using PDBbind 2014 . . . . . . . . . . . 95
4.5.5 Impact of the number of training protein-ligand complexes on the scoring and ranking accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.5.1 Simulating increasing training set size using PDBbind 2007 &
2010 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.5.5.2 Simulating increasing training set size using PDBbind 2014 . . . 99
4.5.6 Impact of the type and number of descriptors on the scoring and ranking
accuracies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.6.1 Type and number of descriptors characterizing PDBbind 2007
complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.5.6.2 Number of descriptor sets characterizing PDBbind 2014 complexes104
vii

4.6

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

CHAPTER 5
5.1

5.2

5.3

5.4

5.5

DOCKING-SPECIFIC SCORING FUNCTIONS FOR ACCURATE LIGAND POSE PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . 107
Decoy generation of ligand poses and formation of training and test complexes . . 109
5.1.1 Generating decoy poses for protein-ligand complexes in PDBBind 2007 . . 109
5.1.2 Generating decoy poses for protein-ligand complexes in PDBBind 2014 . . 112
Conventional scoring functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.2.1 Conventional scoring functions evaluated on the PDBbind 2007 benchmark 114
5.2.2 Conventional scoring functions evaluated on the PDBbind 2014 benchmark 115
Docking-specific machine-learning scoring functions . . . . . . . . . . . . . . . . 115
5.3.1 Generic ML scoring functions evaluated on the PDBbind 2007 benchmark . 115
5.3.2 Ensemble deep neural networks evaluated on the PDBbind 2014 benchmark 117
Results and discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.4.1 Evaluation of the docking power of scoring functions . . . . . . . . . . . . 118
5.4.2 Docking-specific scoring functions vs. conventional approaches on a
diverse test set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4.2.1 Docking performance on diverse protein families from PDBbind 2007 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.4.2.2 Docking performance on diverse protein families from PDBbind 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.4.3 Docking-specific scoring functions vs. conventional approaches on homogeneous test sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.3.1 Docking performance on four protein families from PDBbind 2007125
5.4.3.2 Docking performance on four protein families from PDBbind 2014128
5.4.4 Performance of docking-specific scoring functions on novel targets . . . . . 130
5.4.4.1 Simulating novel targets using PDBbind 2007 . . . . . . . . . . . 130
5.4.4.2 Simulating novel targets using PDBbind 2014 . . . . . . . . . . . 133
5.4.5 Impact of the number of training protein-ligand complexes on the docking accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.4.5.1 Simulating increasing training set size using PDBbind 2007 . . . 136
5.4.5.2 Simulating increasing training set size using PDBbind 2014 . . . 139
5.4.6 Impact of the type and number of descriptors on the docking accuracy . . . 142
5.4.6.1 Type and number of raw descriptors characterizing PDBbind
2007 complexes . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.4.6.2 Number of descriptor sets characterizing PDBbind 2014 complexes145
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

CHAPTER 6
6.1
6.2

SCREENING-SPECIFIC SCORING FUNCTIONS FOR ACCURATE LIGAND BIOACTIVITY PREDICTION . . . . . . . . . . . . . . . . . . . . .
Proposed screening-specific scoring functions . . . . . . . . . . . . . . . . . . . .
Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.1 Evaluation of the screening power of scoring functions . . . . . . . . . . .
6.2.2 Screening-specific ML SFs vs. conventional approaches on a diverse test set
6.2.3 Impact of the number of training protein targets on the screening accuracies

viii

149
149
152
152
152
156

6.3

6.2.4 Impact of the number of descriptor sets on the screening accuracies . . . . 158
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

CHAPTER 7

7.1
7.2
7.3

7.4
7.5

MULTI-TASK DEEP NEURAL NETWORKS FOR SIMULTANEOUS
DOCKING, SCREENING, AND SCORING OF PROTEIN-LIGAND COMPLEXES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Training the multi-task network . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Stochastic learning for imbalanced data . . . . . . . . . . . . . . . . . . .
7.3.2 Gradient computing for unavailable labels . . . . . . . . . . . . . . . . . .
7.3.3 Network regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.4 Computational requirements . . . . . . . . . . . . . . . . . . . . . . . . .
Prediction and accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

161
161
162
163
164
164
165
165
166
168

CHAPTER 8 CONCLUSION AND FUTURE WORK . . . . . . . . . . . . . . . . . . . 169
8.1 Proposed approaches and key findings . . . . . . . . . . . . . . . . . . . . . . . . 169
8.1.1 Descriptor Data Bank (DDB) for multi-perspective characterization of
protein-ligand interactions . . . . . . . . . . . . . . . . . . . . . . . . . . 169
8.1.2 Boosted and bagged deep neural networks for task-specific scoring functions170
8.1.3 Multi-task deep neural networks for simultaneous binding pose, activity,
and affinity prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
8.2.1 Enriching DDB with more molecular descriptors . . . . . . . . . . . . . . 172
8.2.2 Pipeline for self-learning scoring functions . . . . . . . . . . . . . . . . . 173
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

ix

LIST OF TABLES

Table 1.1: Performance of conventional scoring functions in scoring, ranking, docking,
and screening the ligands of 65 different protein families . . . . . . . . . . . . .

7

Table 2.1: The 19 conventional scoring functions and the molecular docking software in
which they are implemented . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Table 3.1: The 16 descriptor types and scoring functions and the molecular docking software in which they are implemented . . . . . . . . . . . . . . . . . . . . . . . . 42
Table 3.2: Comparison of the scoring, docking, and screening powers of single and multiperspective scoring functions using the core test set (Cr), leave-cluster-out
(LCO), and cross-validation (CV). . . . . . . . . . . . . . . . . . . . . . . . . . 45
Table 3.3: The docking, screening, and scoring accuracies (A) of multi-perspective scoring functions constructed using raw (R) and noisy (R+N) descriptors. The
number of descriptors (P) and accuracies (A) are reported when all features
are used (Full) and after conducting feature subset selection (FSS). . . . . . . . . 51
Table 4.1: Optimal parameter values for MARS, kNN, SVM, RF, and BRT models . . . . . 72
Table 4.2: Comparison of the scoring and ranking powers of the proposed and 17 conventional scoring functions on the diverse protein-ligand complexes of PDBbind
2007 core (Cr) test set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Table 4.3: Comparison of the scoring and ranking accuracies of scoring functions on four
protein-family-specific tests sets derived from the refined set of PDBbind 2007. . 83
Table 4.4: Scoring and ranking accuracies of machine-learning scoring functions when
protein-ligand complexes are characterized by different descriptor types . . . . . 102
Table 5.1: Optimal parameter values for MARS, kNN, SVM, RF, and BRT models for
the docking task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Table 5.2: Docking success rate S11 (in %) of ML SFs complexes characterized by different descriptors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Table 6.1: Screening powers of proposed and conventional scoring functions on diverse
complexes from the PDBbind 2014 core (Cr) test set. Training and test proteinligand complexes are characterized using a combination of descriptors from
at least one of the four types X, A, R, or G. . . . . . . . . . . . . . . . . . . . . 153

x

Table 7.1: The docking, screening, and scoring performance of multi-task and singletask SFs on PDBbind 2014 core test set (Cr). . . . . . . . . . . . . . . . . . . . 166

xi

LIST OF FIGURES

Figure 1.1: The drug design process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

Figure 1.2: Protein-ligand docking, screening, scoring, and ranking workflow. . . . . . . . . . .

4

Figure 1.3: Time and cost involved in drug discovery. . . . . . . . . . . . . . . . . . . . . . .

5

Figure 2.1: A deep neural network for predicting the binding affinity of protein-ligand
complexes characterized by a set of descriptors. . . . . . . . . . . . . . . . . . 24
Figure 3.1: The workflow for generating the receptor-based similarity descriptors REPAST
and RETEST. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Figure 3.2: The system architecture of Descriptor Data Bank. . . . . . . . . . . . . . . . . 39
Figure 3.3: The web interface for Descriptor Data Bank. . . . . . . . . . . . . . . . . . . . 40
Figure 3.4: The graphical (left) and command-line (right) user interfaces for Descriptor
Data Bank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Figure 3.5: The scoring accuracy of multi-perspective scoring functions trained on varying numbers of perspectives and known targets. . . . . . . . . . . . . . . . . . 47
Figure 3.6: The scoring, docking, and screening accuracy during the filtering of raw (left)
and noisy (right) descriptors. . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Figure 3.7: The relative influence of the top 15 (out of 2714) descriptors on predicting
the ligand’s binding affinity (left panel) and poses (right panel) for the core
test set complexes. The x-axis shows abbreviated names for the descriptors
in which the first letter denotes the symbolic identifier of the SF or tool generating the descriptor (see Table 3.1), the following “.X#” is the original SF’s
unique index of the descriptor, and the suffix is the descriptor short name as
produced by the original SF or extraction tool. . . . . . . . . . . . . . . . . . . 52
Figure 4.1: BgN-Score: ensemble neural network SF using bagging approach. . . . . . . . 66
Figure 4.2: BsN-Score: ensemble neural network SF using boosting approach. . . . . . . . 68

xii

Figure 4.3: The scoring accuracy of the proposed and conventional scoring functions
when evaluated on test complexes with proteins that are either fully represented (Cr), partially represented (CV), or not represented (LCO) in the SFs’
training data. Training and test protein-ligand complexes are from the refined
set of PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
Figure 4.4: The ranking accuracy of the proposed and conventional scoring functions
when evaluated on the core (Cr) test set complexes of PDBbind 2014. . . . . . 80
Figure 4.5: Dependence of scoring performance of machine-learning scoring functions
on the number of HIV protease complexes in their training set. They are
evaluated on out-of-sample HIV protease complexes from the refined set of
PDBbind 2007. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
Figure 4.6: Comparison of the scoring and ranking accuracies of scoring functions on
four protein-family-specific tests sets derived from the refined set of PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Figure 4.7: Dependence of scoring performance of machine-learning scoring functions
on the number of HIV protease (left) and carbonic anhydrase (right) complexes in their training set. They are evaluated on out-of-sample HIV protease and carbonic anhydrase complexes from the refined set of PDBbind
2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
Figure 4.8: Binding affinity prediction accuracy of scoring models as a function of BLAST
sequence similarity cutoff between binding sites of proteins in training and
test complexes of PDBbind 2007. . . . . . . . . . . . . . . . . . . . . . . . . . 93
Figure 4.9: Binding affinity prediction accuracy of scoring models as a function of BLAST
sequence similarity cutoff between binding sites of proteins in training and
test complexes of PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . 96
Figure 4.10: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on training set size. The models are trained on complexes
selected randomly (without replacement) from the 2007 and 2010 refined sets
of PDBbind and tested on out-of-sample complexes from the core (Cr) test
set of PDBbind 2007. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
Figure 4.11: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on training set size. The models are trained on complexes
selected randomly (without replacement) from the refined set of PDBbind
2014 and tested on out-of-sample complexes from the core (Cr) test set of
PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

xiii

Figure 4.12: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on the number of descriptors. The descriptors are drawn
randomly (without replacement) from a pool of X, A, R, and G-type features
and used to characterize the training complexes of the refined sets of PDBbind 2007 and 2010 and the out-of-sample complexes from the core (Cr) test
set of PDBbind 2007. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Figure 4.13: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on the number of descriptor sets. The descriptor sets are
randomly sampled from all feature types in Descriptor Data Bank. The selected descriptor types are used to characterize the training complexes of the
refined set of PDBbind 2014 and the out-of-sample complexes from the core
(Cr) test set of PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Figure 5.1: The decoy generation process of ligand poses to train and test scoring functions on complexes from the refined set of PDBbind 2014. . . . . . . . . . . . . 114
Figure 5.2: Success rates of scoring functions in identifying binding poses that are closest to native conformations of complexes in PDBbind 2007. The results show
these rates by examining the top N scoring ligands that lie within an RMSD
cut-off of C Å from their respective native poses. Panels on the left show success rates when BA-based scoring is used and the ones on the right show the
same results when docking-specific SFs predicted RMSD values directly. Accuracies of conventional SFs are re-depicted on right panels for comparison
convenience. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
Figure 5.3: The docking accuracy of docking-specific (proposed) and binding-affinitybased (conventional) scoring functions when evaluated on test complexes
with proteins that are either fully represented (Cr), partially represented (CV),
or not represented (LCO) in the SFs’ training data. The docking accuracy is
expressed in terms of the success rates (S11 , S21 , and S31 ) of SFs in identifying binding poses that are closest to native ones for protein-ligand complexes
derived from the refined set of PDBbind 2014. . . . . . . . . . . . . . . . . . . 123
Figure 5.4: The docking accuracy of docking-specific and best binding-affinity-based
(conventional) scoring functions on native and decoy conformations of ligands docked to diverse proteins from the core test set of PDBbind 2014. The
docking accuracy is expressed in terms of the success rates (S21 , S22 , and S23 )
of SFs in identifying binding poses that are closest to the native conformation
for each protein-ligand complex in the test set. . . . . . . . . . . . . . . . . . . 124

xiv

Figure 5.5: Success rates of scoring function in identifying binding poses that are closest
to native conformations observed in four protein families in PDBbind 2007:
HIV protease (a-d), trypsin (e-h), carbonic anhydrase (i-l), and thrombin (mp). The results show these rates by examining the top N scoring ligands that
lie within an RMSD cut-off of C Å from their respective native poses. Panels
on the left show success rates when binding-affinity-based scoring is used
and the ones on the right are for RMSD-based SFs. . . . . . . . . . . . . . . . 127
Figure 5.6: Success rates of docking-specific (proposed) and binding-affinity-based (conventional) scoring functions in identifying binding poses that are closest to
the native conformations observed in four protein families from PDBbind
2014: HIV protease, carbonic anhydrase (CAH), trypsin, and thrombin. The
results show these rates by examining the top N = 1 scoring ligands that lie
within an RMSD cut-off of C ∈ {1, 2, 3} Å from their respective native poses.
Panels on the left show in-sample success rates and the ones on the right show
the out-of-sample docking accuracies. . . . . . . . . . . . . . . . . . . . . . . 129
Figure 5.7: Docking accuracy of docking-specific (proposed) and BA-based (conventional) scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes. The docking
accuracy is expressed in terms of S11 success rate obtained by examining the
top N = 1 scoring ligand that lies within an RMSD cut-off of C = 1 Å from
its respective native pose. In panels (a)-(c), a single (native) pose is used per
training complex, whereas in panels (d)-(f) 20 randomly-selected poses are
used per training complex. Training and test complexes are sampled from
the refined set of PDBbind 2007. . . . . . . . . . . . . . . . . . . . . . . . . . 132
Figure 5.8: Docking accuracy of docking-specific (proposed) and BA-based (conventional) scoring models as a function of BLAST sequence similarity cutoff
between binding sites of proteins in training and test complexes. The docking accuracy is expressed in terms of S11 and S21 success rates obtained by
examining the top N = 1 scoring ligand that lies within an RMSD cut-off of
C ∈ {1, 2} Å from its respective native pose. The docking-specific SF BTDock is trained on 100 computer-generated poses for each of its 3000 native
protein-ligand complexes. Training and test complexes are sampled from the
refined set of PDBbind 2014. . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
Figure 5.9: Dependence of docking accuracy of docking-specific and binding-affinitybased scoring models on training set size when training complexes are selected randomly (without replacement) from the refined set of PDBbind 2007
and the models are tested on the out-of-sample core (Cr) set. The size of the
training data was increased by including more protein-ligand complexes ((a)
and (b)) or more computationally-generated poses for all complexes ((c) and (d)).137

xv

Figure 5.10: Dependence of docking accuracy of docking-specific and binding-affinitybased scoring models on training set size when training complexes are selected randomly (without replacement) from the refined set of PDBbind 2014
and the models are tested on the out-of-sample core (Cr) set. The docking
accuracy is expressed in terms of S11 and S21 success rates obtained by examining the top N = 1 scoring ligand that lies within an RMSD cut-off of
C ∈ {1, 2} Å from its respective native pose. . . . . . . . . . . . . . . . . . . . 140
Figure 5.11: Dependence of docking accuracy of the docking-specific model BT-Dock on
the number of training poses generated for each native conformation of 3000
protein-ligand complexes selected randomly (without replacement) from the
refined set of PDBbind 2014. The three binding-affinity-based SFs BT-Score,
RF-Score, and X-Score are only trained on the native conformations. The
docking accuracy is expressed in terms of S11 and S21 success rates obtained
by examining the top N = 1 scoring ligand that lies within an RMSD cut-off
of C ∈ {1, 2} Å from its respective native pose in the core test set. . . . . . . . . 141
Figure 5.12: Dependence of docking accuracy of docking-specific (proposed) and bindingaffinity-based (conventional) scoring models on the number of features. The
features are drawn randomly (without replacement) from a pool of X, A, R,
and G-type features and used to characterize the training and out-of-sample
core Cr test set complexes of PDBbind 2007. In panels (a) and (b), a single
pose (native pose in (a) and randomly-selected pose in (b)) is used per training complex, whereas in panel (c) 50 randomly-selected poses are used per
training complex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
Figure 5.13: Dependence of docking accuracy of docking-specific (proposed) and bindingaffinity-based (conventional) scoring models on the number of descriptor
(feature) sets. The sets are drawn randomly (without replacement) from a
pool of 16 feature types in Descriptor Data Bank (DDB) and used to characterize the training and out-of-sample core Cr test set complexes of PDBbind
2014. The docking accuracy is expressed in terms of S11 and S21 success rates
obtained by examining the top N = 1 scoring ligand that lies within an RMSD
cut-off of C ∈ {1, 2} Å from its respective native pose in the core test set. . . . . 146
Figure 6.1: Constructing training and test data sets of active and inactive protein-ligand
complexes using the core sets of PDBbind 2007 and 2014. . . . . . . . . . . . . 151
Figure 6.2: The screening accuracy of screening-specific (proposed) and binding-affinitybased (conventional) scoring functions when evaluated on test complexes
with proteins that are either fully represented (Cr), partially represented (CV),
or not represented (LCO) in the SFs’ training data. The screening accuracy
is expressed in terms of the 1%, 5%, and 10% enrichment factors of SFs
in classifying ligands to actives and inactives against diverse set of proteins
from the PDBbind 2014 benchmark. . . . . . . . . . . . . . . . . . . . . . . . 155
xvi

Figure 6.3: Screening accuracy of screening-specific and top-performing conventional
scoring functions in terms of 1%, 5%, and 10% enrichment factors of protein
targets sampled from the core test set of PDBbind 2014. . . . . . . . . . . . . . 156
Figure 6.4: Dependence of screening accuracy of screening-specific and binding-affinitybased scoring models on training set size when training complexes are selected randomly (without replacement) from the refined set of PDBbind 2014
and the core set of PDBbind 2007. The screening accuracy is expressed in
terms of 1%, 5%, and 10% enrichment factors of the PDBbind 2014 core test set.157
Figure 6.5: Dependence of screening accuracy of screening-specific (proposed) and bindingaffinity-based (conventional) scoring models on the number of descriptor
(feature) sets. The sets are drawn randomly (without replacement) from a
pool of 16 feature types in Descriptor Data Bank (DDB) and used to characterize the training and out-of-sample core Cr test set complexes of PDBbind
2014. The docking accuracy is expressed in terms of 1%, 5%, and 10% enrichment factors of protein targets sampled from the core test set. . . . . . . . . 159
Figure 7.1: The architecture of the multi-task deep neural network SF MT-Net. . . . . . . . 163

xvii

CHAPTER 1
INTRODUCTION

1.1

Background

Protein-ligand binding affinity is the principal determinant of many vital processes, such as
cellular signaling, gene regulation, metabolism, and immunity, that depend upon proteins binding to some substrate molecule [1]. Consequently, it is extensively used in molecular docking for
rational drug design. Due to prohibitive costs and delays associated with experimental drug discovery, academia and pharmaceutical and biotechnology companies rely on virtual screening using
computational molecular docking [2, 3, 4]. Typically, this involves docking tens of thousands to
millions of ligand candidates into a target protein receptor’s binding site and using a suitable scoring function (SF) to evaluate the binding affinity of each candidate to identify the top candidates
and their poses as drug leads and then to perform lead optimization [3]; it is also used for target
identification [5]. Besides drug discovery, the bioactive molecules thus identified can be used as
chemical probes to investigate the biochemical role of a target of interest [6]. Molecular docking also has applications in many structural bioinformatics problems, such as protein structure
[7] and function prediction [8]. It has become attractive because of the ever-increasing number
of available receptor protein structures and putative ligand drug candidates in publicly-accessible
databases, such as Protein Data Bank (PDB) [9], PDBbind [10], Cambridge Structural Database
(CSD) [11], and corporate repositories.
Drug design involves two main steps: first, the enzyme, receptor, or other protein responsible
for a disease of interest is identified; second, a small molecule or ligand is found or designed that
will bind to the target protein, modulate its behavior, and provide therapeutic benefit to the patient.
Typically, high-throughput screening (HTS) facilities with automated devices and robots are used
to synthesize and screen ligands against a target protein. However, due to the large number of
ligands that need to be screened, HTS is not fast and cost-effective enough as a lead identification

1

method in the initial phases of drug discovery [12]. Therefore, computational methods referred to
as virtual screening are employed to complement HTS by narrowing down the number of ligands
to be physically screened. In virtual screening, information such as structure and physicochemical
properties of a ligand, protein, or both, are used to estimate binding affinity (or binding free energy), which represents the strength of association between the ligand and its receptor protein at the
binding site. The most popular approach to predicting binding affinity (BA) in virtual screening is
structure-based in which physicochemical interactions between a ligand and receptor are deduced
from the 3D structures of both molecules. This in silico method is also known as protein-based
as opposed to the alternative approach, ligand-based, in which only ligands that are biochemically
similar to the ones known to bind to the target are screened. Since ligand-based screening does not
directly take information about the target into account, it may not as easily identify novel chemicals
as hits. Therefore, it is the method of choice when the 3D structure of the target is not available.
With the unprecedented growth in the number of available 3D structures of protein-ligand complexes (PLCs) in the last decade, interest in the protein-based approach has increased. It is more
accurate than the ligand-based approach due to the inclusion of shape and volume information extracted from the protein’s 3D structure during the screening process [13, 14]. Figure 1.1 depicts a
simplified view of the drug design process that involves both techniques.
The focus of the dissertation will be on protein-based drug design, wherein ligands are placed
into the active site of the receptor during computational molecular docking. The 3D structure
of a ligand, when bound to a protein, is known as ligand active conformation. Binding mode
refers to the orientation of a ligand relative to the target and the protein-ligand conformation in the
bound state. A binding pose is simply a candidate binding mode. In molecular docking, a large
number of binding poses are generated and then evaluated using a scoring function (SF), which
is a mathematical or predictive model that produces a score representing the binding stability of
the pose. The outcome of the docking run, therefore, is a ligand’s top pose ranked according
to its predicted binding score as shown in Figure 1.2. As shown in the figure, the molecule’s
bioactivity against the target protein is determined in the following screening step. Typically,

2

Disease
Target identification

No: ligand-based

Known 3D
structure?

Yes: protein-based

Known actives

Target 3D structure
Molecule
database

Docking,
screening,
ranking,
scoring.

Similarity
searching

Hits

Lead optimization
High throughput screening
Toxicological & pharmacokinetic analysis
Clinical trials

No: ligand-based

Drug efficacious
and safe?

No: protein-based

Drug

Figure 1.1: The drug design process.
these docking and screening steps are performed iteratively over a database containing thousands
to millions of ligand candidates. After predicting their binding modes and bioactivities, ranking
and scoring rounds are usually performed to rank ligands according to their predicted binding free
energies. The top-ranked ligand, considered the most promising drug candidate, is synthesized and
physically screened using HTS.
AIDS treatment provides a classic example of the promise of molecular docking. From the
identification of HIV protease (receptor protein), an enzyme critical in the HIV virus’ life cycle,
as a drug target in 1988, and the determination of its X-ray crystallographic structure in 1989, to

3

Input
Protein

Molecular docking and virtual screening
Top
ligand
pose

Ligand
5

Protein

Ligand
DB

Top
Pose: 8

Docking
SF

Ligand
3

Protein

Protein

Active Screening
SF
Ligand?

Score: Scoring
SF
-23

Output
Scored &
ranked
active
ligands

Lig. 64 = -15
Lig. 3 = -23
…

Figure 1.2: Protein-ligand docking, screening, scoring, and ranking workflow.
early 1996, it took less than 8 years to have three FDA-approved protease inhibitor drugs (ligands)
on the market [15]. This contrasts sharply with the 10 to 15 years and hundreds of millions of
dollars it typically takes using the traditional trial-and-error drug development process. Figure 1.3
shows the cost and time associated with the main stages of drug discovery based on data collected
by PARAXEL several years ago [16].

1.2

Scoring function requirements

The most important steps in structure-based drug design are identifying the correct conformation of ligands at their respective binding sites, discriminating active ligands from those that do not
bind, predicting their binding affinities and using these scores to rank ligands against each other.
These core steps affect the outcome of the entire drug search campaign. That is because predictions of scoring functions determine which binding orientation/conformation is deemed the best,
the ligand’s bio-activity, which ligand from a database is considered likely to be the most effective
drug, and the estimated absolute binding affinity. Correspondingly, four main capabilities that a
reliable scoring function should have are: (i) the ability to identify the correct binding mode of a
ligand from among a set of (computationally-generated) poses, (ii) the ability to identify whether a
ligand is active or not, (iii) the ability to produce binding scores that are (linearly) correlated to the
experimentally-determined binding affinities of protein-ligand complexes (PLCs) with known 3D

4

Target identification
2.5 years

4%

Lead generation & lead optimization
3 years

15%

Preclinical development
1 year

10%

Phase I, II, & III clinical trials
6 years

68%

FDA review & approval
1.5 years

3%

Drug to market
14 years

$880 M

Figure 1.3: Time and cost involved in drug discovery.
structures, and, finally (iv) the ability to correctly rank a given set of ligands, with known binding
modes when bound to the same protein. These four performance attributes were referred to by
Cheng et al. as docking power, screening power, scoring power, and ranking power, respectively
[17]. We refer to the corresponding problems as ligand docking problem, ligand screening problem, ligand scoring problem, and ligand ranking problem and SFs used to solve the three problems
as ligand docking SF, ligand screening SF, ligand scoring SF, and ligand ranking SF, respectively.
We also refer to the corresponding proposed models as docking-specific SF, screening-specific SF,
scoring-specific SF, and ranking-specific SF, respectively. In practice and in all existing work, a
single SF is trained to predict protein-ligand BA and then used in all the ligand docking, screening,
and ranking stages to identify the top pose, bioactivity, and top ligand, respectively. Other requirements of SFs are that they should be fast enough for efficient virtual screening and be interpretable.

5

It should be noted that the time taken to train SFs is not critical for high-throughput since it is performed offline. Their prediction speed, however, is relevant for fast virtual screening applications.
Therefore, computationally-intensive approaches such molecular dynamics and Monte Carlo simulations are not suitable for virtual screening [17]. The task-specific ensemble machine learning
SFs we propose are very computationally efficient to apply, easily parallelizable on multi-cores,
and also have excellent descriptive and interpretive power [18].

1.3

Motivation: Limitations of existing scoring functions for accurate ligand
docking, screening, scoring, and ranking

Traditionally, the ligand docking, screening, scoring, and ranking steps depicted in Figure 1.2
are performed using the same SF built in the molecular modeling software package such as GOLD [19,
20] and Discovery Studio [21]. Such functions predict BA that is in turn used to rank-order hundreds of generated poses (docking) for thousands to millions of different ligands (ranking) for a
certain target protein. We have collected scoring, ranking, docking, and screening performance of
seventeen conventional SFs used in academic and commercial molecular docking packages. This
set of SFs was evaluated on one of the most well-regarded benchmark datasets, viz. PDBbind core
test set Cr [10], which is derived from PDB and comprises 65 three-complex clusters corresponding to diverse protein families and drug-like molecules.
Table 1.1 lists the scoring performance of the SFs in terms of Pearson correlation coefficient
between the predicted and true values of BA. Their standard deviation (SD) values from experimental BA are listed in the second column. In the next columns, we show ranking power in terms
of R1-2-3 , R1-3-2 , and R1-X-X success rates. The docking power statistics are reported in the subsequent three columns in terms of the S11 , S21 , and S31 success rates. The last three columns list the
screening power in terms of the top 1%, 5%, and 10% enrichment factors. These scoring, ranking,
docking, and screening statistics as well as details of the SFs, protein-ligand complexes, and features used are discussed in more detail in the following chapters, where we will see that the higher
these values are (except SD), the more accurate an SF is in terms of scoring, ranking, docking, and

6

Table 1.1: Performance of conventional scoring functions in scoring, ranking, docking, and
screening the ligands of 65 different protein families
Scoring Functions

X-Score::HMScore
DrugScoreCSD
SYBYL::ChemScore
DS::PLP1
GOLD::ASP
AffiScore
SYBYL::G-Score
DS::LUDI3
DS::LigScore2
GlidScore::XP
DS::PMF
GOLD::ChemScore
SYBYL::D-Score
DS::Jain
GOLD::GoldScore
SYBYL::PMF-Score
SYBYL::F-Score

Scoring Power

Ranking Power (in %)

Docking Power (in %)

Rp

SD

R1-2-3

R1-3-2

R1-X-X

S1

0.644
0.569
0.555
0.545
0.534
0.521
0.492
0.487
0.464
0.457
0.445
0.441
0.392
0.316
0.295
0.268
0.216

1.83
1.96
1.98
2.00
2.02
2.06
2.08
2.09
2.12
2.14
2.14
2.15
2.19
2.24
2.29
2.29
2.35

54.7
51.6
46.9
54.7
43.8
38.5
46.9
45.3
35.9
34.4
40.6
35.9
46.9
42.2
23.4
39.1
29.7

15.6
21.9
25.0
15.6
21.9
12.3
25.0
21.9
25.0
29.7
18.8
21.9
20.3
31.3
26.6
21.9
25.0

70.3
73.4
71.9
70.3
65.6
50.8
71.9
67.2
60.9
64.1
59.4
57.8
67.2
73.4
50.0
60.9
54.7

51.3
62.8
40.5
64.9
69.5
31.8
25.2
41.6
53.9
54.6
32.4
54.6
15.3
25.7
51.7
37.2
54.6

1

Screening Power (E.F.)

S1

S1

top 1%

top 5%

top 10%

68.5
74.3
60.1
75.3
82.4
46.2
41.5
57.3
71.5
73.2
43.7
70.5
30.8
44.8
69.0
48.1
64.4

78.0
81.4
71.4
84.2
88.6
52.3
56.3
67.2
80.4
84.7
53.4
79.2
47.6
64.6
80.9
56.8
73.8

5.1
2.6
5.3
6.9
12.4
3.6
1.9
12.5
15.9
19.5
4.9
18.9
2.3
5.9
7.8
5.4
1.9

2.3
1.4
2.4
4.3
6.2
2.0
1.3
4.3
6.2
6.3
2.2
6.8
1.8
2.5
4.5
2.2
1.3

1.5
1.6
2.2
3.0
3.8
1.9
1.4
2.8
4.1
4.1
2.6
4.1
1.6
1.8
3.2
1.9
1.4

2

3

Numbers highlighted in bold correspond to the top performing scoring function in each of the
three performance metrics: scoring power (Rp ), ranking power (R1-2-3 ), docking power (S11 %),
and screening power (E.F.1% ). Scoring functions are evaluated on the core set of PDBbind
2007 to estimate their scoring, ranking, and docking performance. The screening accuracy is
obtained by testing scoring functions on the core set of PDBbind 2014. More details about
these two benchmarks are provided in Section 2.1.
screening performance.

The numbers indicate that the predictions of almost all SFs have mediocre to weak correlation
with true binding affinity values. This trend translates to poor scoring power for all SFs except
X-Score whose predictions seem to show moderate linear correlation with experimental values.
However, due to uncertainties associated in BA prediction, a relatively higher scoring power of
an SF at this correlation level (0.644) does not necessarily lead to corresponding improvements in
ranking, docking, and screening performances. This is evident from Table 1.1. The top performing
SF, X-Score, has a relatively better BA prediction accuracy (R p = 0.644), but fails to identify poses
that are within 1 Å from the native one for almost 49% of the proteins. Its ranking performance
is also low where only 55% of protein families had their ligands ranked correctly. A similar
mediocre accuracy was also obtained for X-Score when its screening performance was evaluated.
The table shows that this empirical SF achieved an enrichment factor of 5.1 when an ideal SF
7

could result in up to 66 of enrichment to the database on which we tested it. The best performing
scoring function in correctly classifying ligands as actives or inactives, GlideScore::XP, is not even
among the top ten in scoring performance. Similarly, the most accurate SF in terms of docking
power, GOLD::ASP, comes only fifth in scoring and eighth in ranking performance. SF design
for solving the ligand scoring, ranking, docking, and screening problems has been investigated for
over two decades now [17], but according to the 2007 PDBbind benchmark, the best conventional
SF technique has a scoring power of only 0.644 (in terms of R p ), whereas our proposed boosted
NN SF attains a much higher scoring power of 0.816, which is a significant result as we will see in
Chapter 4. Furthermore, our top ML SF has the best ranking power of R1-2-3 = 64.3 vs. 54.7 for the
best conventional SF. In terms of docking and screening power, however, the proposed scoring ML
SFs lag behind conventional ones. Thus, as we will see later, using BA-based ML SFs has clearly
worked well for scoring in both relative (to conventional SFs) and absolute terms, has worked in
relative terms in the case of ranking power, but not so in absolute terms, and has worked in neither
absolute nor relative terms in the case of docking nor screening power. Therefore, ligand scoring
SF design that is BA based is clearly not sufficient to solve the ligand docking and screening
problems.
Note that in SBDD all four of the problems, ligand scoring, docking, screening, and ranking, are critical to solve well. Solving ligand scoring and ranking well, but not the docking and
screening problems well will lead to ineffective virtual screening—this is what we noticed from
the results above. Solving the ligand docking, screening, and ranking problems well will lead
to effective relative screening of ligands, but if ligand scoring is not done properly, there will be
uncertainty in the binding affinity (and hence stability) of the identified ligand(s). The proposed
research seeks to address the ligand docking, screening, scoring, and ranking problems using completely novel approaches whose performance on multiple tests is promising and show potential for
transformative impact.

8

1.4

Key contributions

To address the ligand docking, screening, scoring, and ranking problems, we propose several
innovative and potentially transformative concepts to advance the state-of-the-art in SF design for
these problems in the following two directions:
• Descriptor Data Bank (DDB): A platform for multi-perspective modeling of proteinligand interactions. Protein-ligand (PL) interactions are essential biochemical processes
for all living organisms. They play a key role in molecular recognition, molecular binding, signal transmission, and cell metabolism, to name a few. Accurate modeling of such
interactions remains an unsolved challenge despite advances in molecular biology and the
growing number of solved protein-ligand structures. Decades of efforts have been devoted
to understand these intermolecular forces in the context of protein-based drug design. Such
efforts have resulted in a large number of hypotheses and perspectives of interaction factors
scattered in the literature that would have substantially improved binding affinity prediction
accuracy of protein-ligand complexes had they been utilized collectively. Typically, only
a small subset of hypotheses for few interactions are utilized as descriptors in a scoring
function (SF) without making use of existing solutions that may complement the proposed
perspectives of the modeled intermolecular forces. In this work, we present Descriptor Data
Bank (DDB), an online platform for facilitating multi-perspective modeling of PL interactions. DDB is an open-access hub for depositing, hosting, executing, and sharing descriptor
extraction tools and data for a large number of interaction modeling hypotheses. The platform also implements a machine-learning (ML) toolbox for automatic descriptor filtering &
analysis, and SF fitting & prediction. To enable local access to many of DDB’s utilities, a
command-line-based program and a PyMOL plug-in have also been developed and can be
freely downloaded for offline and standalone multi-perspective modeling. We seed DDB
with 16 diverse descriptor extraction tools developed in-house and collected from the literature. The tools combined generate over 2700 descriptors that characterize (i) proteins, (ii)

9

ligands, and (iii) protein-ligand complexes. The in-house descriptors we extract are proteinspecific which are based on pair-wise primary and tertiary alignment of protein structures
followed by clustering and triangulation. We built and used DDB’s ML library to fit SFs
to the in-house descriptors and those collected from the literature. We then evaluated them
on several data sets that were constructed to reflect real-world drug screening scenarios. We
found that multi-perspective SFs that were constructed using large and diverse number of
DDB interaction models (descriptors) outperformed their single-perspective counterparts in
all evaluation scenarios with an average improvement of more than 15%. We also found
that our proposed target-specific descriptors improve upon the accuracy of SFs that were
trained without them. In addition, DDB’s filtering module was able to exclude noisy and
irrelevant descriptors when artificial noise was added as new features. We also observed that
the tree-based ensemble ML SFs implemented in DDB are robust even with the presence of
noisy and decoy descriptors. In Chapter 3, we describe DDB in more detail and in the following chapters we provide results that demonstrate the excellent utility of multi-perspective
interaction modeling.
• Task-specific scoring functions: Conventional empirical SFs rest on the hypothesis that
a linear regression model is capable of computing binding affinity from a set of molecular
descriptors. Additionally, these BA-based regression SFs can be used for ligand scoring
(with a lot of room for improvement), but they are not suited for ligand docking, screening,
and ranking (as we saw earlier in Sec. 1.3). To address this, we propose three task-specific
approaches to SF design: (1) nonparameteric models capable of learning the underlying
unknown function of protein-ligand binding from the data itself with minimal assumptions
about the data distribution. We will show in Chapter 4 the excellent accuracy of such models in ligand scoring and ranking problems. (2) A docking SF that performs RMSD-based
regression, where RMSD captures how close a given ligand binding pose is to the native
pose, and enables us to leverage 3D structure data about all poses, native and non-native,
instead of just the native pose as in the case of BA-based docking SFs. It also does not rely
10

on BA values which have inherent experimental measurement noise. Pose prediction results
provided in Chapter 5 show how this makes a significant difference in docking performance.
A further enhancement considers both RMSD and BA in SF design. (3) A screening SF that
performs classification to discriminate between active and inactive ligands instead of BAbased regression in conventional SFs that make use of data only about active ligands. This
family of screening SFs are presented in Chapter 6 along with preliminary results showing
excellent accuracy in enriching databases of large numbers of ligands for a diverse set of
protein families.
• Multi-task scoring functions: In addition to the three task-specific models, we also propose a novel multi-task scoring function for simultaneously predicting ligand binding pose,
its activity classification label (active/inactive), and its binding affinity with a target protein.
The scoring function, MT-Net, is based on wide and deep multi-task neural network with
(i) general hidden layers shared among the three tasks to learn higher-level physiochemical
features important for the three tasks, (ii) three specialized sets of hidden layers to learn features for each task separately and simultaneously in parallel, and (iii) three corresponding
output layers for binding mode prediction (docking), ligand activity prediction (screening),
and binding affinity prediction (scoring). When fitted to our current training sets, MT-Net
performed equally well when compared to single-task neural network SF on out-of-sample
test sets carefully designed to evaluate screening, docking, and scoring accuracy. Due to
their high-capacity, we anticipate the accuracy of scoring functions based on multi-task deep
networks to improve beyond those based on single-task NNs when larger and more diverse
datasets are used for training. Currently, we are developing a pipeline for automated data retrieval from biochemical databases to continuously train the network on new protein-ligand
complexes and screening assays as they become available. Details about the network’s architecture and its training algorithm are provided in Chapter 7.

11

In Chapter 8, we summarize the findings of our work and discuss our future work that we
believe will lead to further improvements in the performance of the proposed scoring functions.

12

CHAPTER 2
MATERIALS AND METHODS

In this chapter, we discuss the materials and methods used to build and evaluate the scoring functions proposed in the following chapters. First, we will describe their main training and test complexes. Then, we will present the machine-learning methods that we use as basis for the proposed
scoring functions and re-construct those published in the literature. We will explain the mathematical models behind these prediction algorithms and the software packages that implement them.
Finally, we will discuss several conventional scoring functions that will be compared against the
machine-learning models considered in this work.

2.1

Main training and test protein-ligand complexes

We use the molecular database PDBBind [22, 23] for building and testing task-specific scoring
functions on protein-ligand complexes characterized by the proposed multi-perspective descriptors
from DDB. PDBbind is a selective compilation of the Protein Data Bank (PDB) database [9]. Both
databases are publicly accessible and regularly updated. The PDB is periodically mined and only
complexes that are suitable for drug discovery are filtered into the PDBbind database. We used the
2007 version of PDBbind during the early stages of this work to build and evaluate several novel
machine-learning scoring functions. PDBbind 2007 contains a refined set of 1300 protein-ligand
complexes with high-resolution 3D structures and binding affinity data. The curators of PDBBind
partitioned the refined set into training and test sets as described in more detail in Section 2.1.1. We
also considered the 2014 release of PDBbind to build more accurate versions of our boosted and
bagged neural network scoring functions which utilize larger number of descriptors extracted using
Descriptor Data Bank (DDB). The refined set of PDBbind 2014 contains 3446 high-quality proteinligand complexes with experimental binding affinity data collected from the literature. From this
set of complexes, we design three training-test sampling strategies to control the overlap in protein
family between training and test data sets. The objective of such strategies is to estimate the

13

generalization ability of scoring functions on familiar proteins represented in their training data or
novel targets that were not encountered before.

2.1.1

PDBbind core test set: novel complexes with known protein targets

From the protein-ligand complexes in the refined set of PDBbind (version 2007 and 2014), the
curators of the database built a test set with maximum diversity in terms of protein families and
binding affinies. The test set was constructed as follows. (i) The refined set was first clustered
based on the 90% primary-strucutre similarity of its protein families. (ii) Clusters with four (five
for the 2014 version) proteins or more are then examined based on their ligands’ binding affinity
values. From each of these PLC clusters, three complexes are marked. The PLC whose ligand has
the lowest BA, the one with the largest BA, and the one closest to the mean BA of that cluster.
(iii) The three PLCs are added to the test set if the marked complexes with the largest and lowest
BA values are at least 2 orders of magnitude apart in terms of disassociation Kd or inhibition Ki
constants. This resulted in the selection of 65 different protein families where each protein family
binds to three different ligands with varying BA values to form a set of 195 unique PLCs. This
is called the core test set and has been a popular benchmarking test set for many recent docking
and scoring systems [17, 24, 25, 26, 27]. Due to its popularity, the 2015 and 2016 core sets have
been kept identical to their predecessor in PDBbind 2014 so that results of different studies can be
compared. In addition to the core test set of PDBbind 2007, we too consider the 2014/2015/2016
version of the core test set Cr in our evaluation of the proposed SFs. The corresponding training
set for Cr, referred to as the primary training set and denoted by Pr, was formed by removing all
Cr complexes from the total 1300 and 3446 complexes in the refined sets of PDBbind 2007 and
2014, respectively. As a result, Pr contains (1300 - 195 =) 1105 and (3446-195 =) 3251 complexes
that are disjoint from Cr complexes. In our experiments on PDBbind 2014 we fixed the base
sizes of training and test sets to 3000 and 195 PLCs, respectively. The 3000 PLCs are randomly
sampled from the 3251 set to help evaluate the variance of ML SFs. In subsequent chapters, we
present results for two sets of our task-specific scoring functions. One set of models are trained and

14

evaluated on complexes from the primary and core test sets of PDBbind 2007. The second set uses
the 2014 PDBbind primary training and core test set complexes for construction and validation.
This set of SFs are also evaluated based on cross-validation and leave clusters out as described
in the next two sections. Scoring functions based on PDBbind 2007 were not evaluated using
cross-validation and leave-clusters-out since the models were published before we developed these
additional benchmarking strategies.
The core test set complexes are considered known targets to scoring functions due to the overlap
between their training and test proteins. More specifically, for each protein in the 65 protein
clusters of Cr, there is at least one identical protein present in the primary Pr training data, albeit
bound to a different ligand.

2.1.2

Multi-fold cross-validation: novel complexes with known and novel protein targets

Although the training Pr and test Cr sets are disjoint at the protein-ligand complex level, they
overlap at the protein family level. As mentioned above, each of the three PLC clusters in Cr were
sampled from larger (≥ 5) protein-family clusters in the refined set and the remaining two or more
PLCs are now part of Pr. More stringent test sets were also developed in terms of the protein-family
overlap with training sets. One testing technique is based on 10-fold cross-validation (CV) where
the refined set of PDBbind 2014 is shuffled randomly and then partitioned into 10 non-overlapping
sets of equal size. Upon training and validation, one out of the ten sets is used for testing and the
remaining nine are combined for training. Once the training and test completes for this fold, the
same process is repeated for the other nine folds one at a time. In a typical 10-fold CV experiment
on a dataset of 3446 PLCs, the training and test sizes of the 10 folds are (3446 × 9/10 ∼) 3101 and
(3446 × 1/10 ∼) 345 complexes, respectively. In order to neutralize the dataset size effect on our
results and have similar sizes to Pr and Cr, we sample 3000 from the 3101 training complexes and
195 PLCs from the corresponding 345 complexes in each fold. The performance results on the 10
test folds are then averaged and reported in the Results section.
Unlike the overlap in Cr, due to the randomness of CV, every protein family is not necessarily

15

present in both the training and test sets across the ten folds. Therefore, some test proteins are
actually novel for the scoring function while others may be present in the training data. If it is
present, the protein will be bound to a different ligand in the training set.

2.1.3

Leave clusters out: novel complexes with novel protein targets

Our third approach to constructing the training and test datasets is based on the novel-targets experiment by Ashtawy and Mahapatra [28] and leave-one-cluster-out (LOCO) by Kramer et al. [29].
In this approach, the refined set complexes are partitioned based on the protein’s primary-structure
similarity such that the training and test partitions have zero overlap in protein families. More
specifically, datasets were constructed as follows. The 3446 PLCs of the refined set were clustered
based on the 90% similarity cut-off of the proteins’ primary structures. A total of 1135 clusters
were found, of which the largest five are: HIV protease (262 PLCs), carbonic anhydrase (170
PLCs), trypsin (98 PLCs), thrombin (79 PLCs), and factor XA (56 PLCs). There are 9 clusters
with size 30 PLCs or larger including the aforementioned five, 43 of size 10 or larger, and 1092
clusters with size less than 10 PLCs, the majority of which (644 clusters) consist of a single PLC.
We created 20 different training-test dataset pairs based on this distribution of protein families.
For the nine most frequent protein families (with ≥ 30 PLCs), we created nine corresponding test
sets each consists of a single family. For each family-specific test set of these nine, we sampled
3000 PLCs from the refined set none of which contains the protein family under test. From the
less populous clusters (i.e., those with < 30 PLCs), we randomly sampled eleven 195-PLC test
datasets. These test sets could contain complexes from (1135 - 9 =) 1126 different protein families
mostly singles (i.e., protein families that occur in one complex only). For each multi-family test set
of these eleven, we again sampled another 3000 random PLCs from the refined set none of which
contains the protein families under test. The justification for selecting these particular number of
datasets and sizes is two-fold. First, we wanted about half of the mean performance on these test
sets to result from the single-family test sets and the other half from the multi-family sets. Second,
taking the average of a larger number of test sets with reasonable sizes reduces the possibility of

16

skewing the final performance due to potential outliers that could be obtained on few small test
sets.
Test proteins are considered novel here due to the imposed sequence similarity constraint between them and training proteins which must be less than 90%. Therefore, all test proteins in a
test cluster are considered novel for scoring functions fitted to training proteins sampled using the
LCO strategy.

2.2

Machine learning algorithms

We utilize several regression and classification techniques in this work: multiple linear regression (MLR), multivariate adaptive regression splines (MARS), k-nearest neighbors (kNN), support
vector machines (SVM), deep neural networks (DNN), random forests (RF), boosted regression
trees (BRT), and extreme gradient boosting trees (XGB). These methods benefit from some form
of parameter tuning prior to their use in prediction. We optimized these parameter values by
performing a grid search over a suitable range in conjunction with 10-fold cross-validation over
complexes independent from the test examples that we use in the following chapters. For every
machine-learning method, we will be using these optimal values to build ML SFs in the subsequent
chapters. Next, we provide a brief description of the theory of these algorithms.

2.2.1

Multiple linear regression (MLR)

Multiple linear regression (MLR) model is built based on the assumption that there exists a linear
function that relates a set of explanatory variables X (descriptors characterizing protein-ligand
complexes) and the response variable Y . This linear relationship has the form:
P

Yˆ = f (X) = β0 +

∑ β jX j.

(2.1)

j=1

The coefficients β0 , . . . , βP are estimated from training data by minimizing a loss function,
such as the sum of squared errors (referred to as the method of least squares), that measures the
overall deviations of fitted values from experimentally-observed ones. Despite the fact that MLR
17

models are simple to construct and fast in prediction, they tend to suffer from overfitting when
tackling high-dimensional data (i.e., datasets with larger values of P/N, where P is the number of
features and N is the number of training records) [30]. Moreover, these models have a relatively
large bias error due to their rigidity. For large data sets with few dimensions, MLR models are also
prone to underfitting the data. The reason for developing simple linear models in this study is twofold. First, they are useful in estimating the behavior of conventional empirical scoring functions.
Second, they serve as baseline benchmarks for the nonlinear regression models we investigate. In
our first set of experiments on PDBbind 2007 we use the built-in R language package stats to fit
MLR models by invoking the function “lm” implemented therein [31]. For the second set of results
collected on PDBbind 2014, we fit MLR models using the class “LinearRegression” implemented
in the Python library Scikit-learn [32].

2.2.2

Multivariate adaptive regression splines (MARS)

MARS is an extension of MLR that can model nonlinearity in the data and account for interactions
between related features [33]. Nonlinearity is modeled by partitioning the feature space into S
regions, each of which is fitted by a nonlinear term known as a basis function. The MARS model
is a weighed sum of an intercept (which is the mean of the response values in the training data)
and the S basis functions. Every basis function is the product of one or more hinge functions that
are generated in pairs from the training data. Each pair is characterized by an input variable X j
and a knot variable k. The knot variable acts as a pivot around which the hinge function produces
a positive value on one side and zero on the other side. More formally, a hinge function takes the
form:
h(X j , k)+ = max(0, X j − k)
(2.2)

or
h(X j , k)− = max(0, k − X j ).

The + (or −) sign in the subscript denotes that the function has the value X j − k (or k − X j ) to
the right (or left) of the knot and 0 to its left (or right). The MARS algorithm we use adaptively

18

determines both the number of knots and their values from the training data. It also employs the
method of least-squares to optimize the weights βi ’s in the final MARS model which can be written
as:
S

Yˆ = f (X) = β0 + ∑ βi Hi (X),

(2.3)

i=1

where Yˆ is the estimated biological activity for the complex X and Hi is the i-th basis function
which is a product of up to d hinge functions; the degree parameter d models interaction between
up to d features that appear in the basis functions.
During the training phase of a MARS model, a greedy search is performed in both forward
(terms added) and backward (terms eliminated) passes. In the forward pass, the MARS algorithm
first builds a complex model consisting of a large number of terms. To prevent the model from
overfitting, this step is followed by a pruning round where several terms are dropped. An iterative backward elimination procedure based on generalized cross-validation (GCV) is typically
performed to determine which terms are to be removed. Terms that have the least effect in increasing the model’s residual sum of squares are dropped. Additionally, models with large number of
terms and higher degree are penalized according to a penalty criterion. The penalty criterion and
the maximum degree of MARS terms are user-defined parameters. A grid search using 10-fold
cross validation has been conducted to tune these parameters. The R language package earth was
used to construct our MARS models on complexes from PDBbind 2007 [34]. To avoid clutter and
permit easier interpretation of the results obtained by other scoring functions, we do not report results of MARS models on the 2014 release of PDBbind. However, our experiments show that their
performance are better than MARS scoring functions based on PDBbind 2007 due to the larger
size of training data in PDBbind 2014 and the use of multi-perspective descriptors from DDB. The
Python library Py-earth was used to conduct the (unreported) experiments of MARS models on
PDBbind 2014 [35].

19

2.2.3

k-nearest neighbors (kNN)

An intuitive approach to predicting a certain biological property (e.g. binding affinity) of a proteinligand complex is to search the training database for a similar protein-ligand complex and output
the value of its response variable as an estimation for the complex in question. The most similar
complex is referred to as the nearest neighbor of the protein-ligand compound whose biological
property we are attempting to predict. This simple algorithm is usually augmented to take into
account the biological property values of the k nearest neighbors, instead of just the nearest one.
In the kNN algorithm we use here, each neighbor contributes to the final estimated property by an
amount that is inversely proportional to its Minkowski distance from the target complex [36]. The
Minkowski distance between two complexes Xi and X j is defined as:
1/q

P

D(Xi , X j ) =

∑ |xi,p − x j,p

|q

.

(2.4)

p=1

These distances are then normalized and transformed with an arbitrary kernel function K(.) into
weights. The final weighted kNN model is formulated as:
k

Yˆ = f (X) =

∑ (K(D(X, Xi)) ·Yi),

(2.5)

i=1

where X and Xi are the target and the i-th neighbor complexes, respectively, characterized by P
features, and Yi is the experimental biological property of the i-th neighbor. In this work, we apply
the optimal kernel formulated by Samworth in [37] to weight the neighbors according to their
Minkowski distances D defined in Equation 2.4. The parameter q in Equation 2.4 determines the
distance function to be calculated. Manhattan distance is obtained when q = 1, whereas Euclidean
distance is realized when q is set to 2. A grid search using 10-fold cross validation has been applied
to find optimal values of the number of neighbors k and the distance parameter q.
The R language package kknn was used to construct our kNN models on complexes from PDBbind 2007 [38]. To avoid clutter and permit easier interpretation of the results obtained by other
scoring functions, we do not report results of kNN models on the 2014 release of PDBbind. However, our experiments show that their performance are better than kNN scoring functions based
20

on PDBbind 2007 due to the larger size of training data in PDBbind 2014 and the use of multiperspective descriptors from DDB. In the unreported experiments on PDBbind 2014, we build our
kNN regressors and classifiers using the classes “KNeighborsRegressor” and “KNeighborsClassifier” implemented in the Python library Scikit-learn [32].

2.2.4

Support vector machines (SVM)

Support Vector Machines is a supervised learning technique developed by Vapnic et al. to solve
classification and regression problems [39]. The first step in constructing an SVM model is typically to transform the input data to a higher dimensional space using a nonlinear mapping function
known as the kernel function [40]. After mapping to the new feature space, a linear model f (X) is
constructed. In regression formulation, the SVM algorithm induces from the training data an optimal hyperplane defined by a vector w and a bias term b that approximates the underlying unknown
relationship between a set of physicochemical descriptors (X) and the protein-ligand interaction of
interest (Y ). The linear model f (X) can be written more formally as:
Yˆ = f (X) = w, G(X) + b,

(2.6)

where ·, · denotes the dot product between w and the output of the kernel function G(X), which is
then added to the model’s bias b. The norm of w is minimized to ensure a small w in order for the
model to be as “flat” as possible so that it generalizes well to previously unseen data. At the same
time, the model should fit the training data well by not allowing its predictions to deviate more
than ε from true values. This is achieved by minimizing the ε-insensitive loss function which is
defined as:
Lε (X) = max(0, |Y − f (X)| − ε).

(2.7)

This loss function defines a region such that data points lying outside of it are penalized. The
penalty is introduced by the slack variables ζ and ζ ∗ . The SVM training process can therefore be
formulated as the following optimization problem:

21

minimize

subject to

w 2
2

∗
+C ∑N
i=1 (ζi + ζi )




yi − w, G(Xi ) − b ≤ ε + ζi



w, G(Xi ) + b − yi ≤ ε + ζi∗




 ζi , ζ ∗
≥ 0,
i

(2.8)

where C > 0 is a user-defined variable that can be employed to regularize against overfitting. The
parameter serves as a trade-off between model complexity and its flatness by scaling up or down
the penalty imposed on examples whose deviations are larger than ε. Equation 2.8 can be solved
more easily by applying quadratic programming on its dual formulation as follows:

max (w(α, α ∗ )) = max

α,α ∗

N

∑ (αi∗(yi − ε) − αi(yi + ε))

α,α ∗ i=1

1 N N
− ∑ ∑ (αi∗ − αi )(α ∗j − α j )K(Xi , X j )
2 i=1 j=1

(2.9)
,

where αi and αi∗ are the standard Lagrange multipliers that can be obtained by solving Equation 2.9
under the constraints:
∗
∑N
i=1 (αi − αi ) = 0

(2.10)

and
αi , αi∗ = [0,C].

After obtaining the Lagrange multipliers we can calculate the terms of Equation 2.6 as follows:
w, G(X) = ∑N
i=1

αi − αi∗ K(X, Xi )
(2.11)

and
b = − 12 ∑N
i=1

αi − αi∗ (K(Xi , Xr ) + K(Xi , Xs )) ,

˜ is the kernel function G. We choose K to be the radial basis function (RBF) that is
where K(X, X)
calculated as:

˜ =e
K(X, X)
22

X−X˜ 2
2σ 2 ,

(2.12)

where σ is the kernel width. The data points Xr and Xs in Equation 2.11 are any arbitrary support
vectors located on both sides of the margin. Support vectors are data instances located outside the
ε-insensitive region, and as this region gets wider (i.e., larger ε value), fewer support vectors are
obtained and the model becomes flatter. The three parameters of the SVM model, namely C, ε,
and σ , are optimized via a grid search using a 10-fold cross validation.
The R language package e1071 was used to construct our SVM models on complexes from
PDBbind 2007 [41]. To avoid clutter and permit easier interpretation of the results obtained by
other scoring functions, we do not report results of SVM models on the 2014 release of PDBbind.
However, our experiments show that their performance are better than SVM scoring functions
based on PDBbind 2007 due to the larger size of training data in PDBbind 2014 and the use of
multi-perspective descriptors from DDB. In the unreported experiments on PDBbind 2014, we
build our SVM regressors and classifiers using the classes “SVR” and “SVC” implemented in the
Python library Scikit-learn [32].

2.2.5

Deep Neural Networks (DNN)

Artificial Neural Networks are computational algorithms inspired from networks of biological neurons to solve prediction problems in computer vision, natural language processing, drug discovery,
etc. The neural network we fit here consists of an input layer for the raw descriptors X1 , L hidden
layers for extracting higher-level representations for the molecular interactions and forces, and an
output layer for prediction as depicted in Figure 2.1.
Upon prediction, new representation of the raw descriptors are first extracted from the hidden
layers as follows:
Xl+1 = H(Wl Xl + Bl ),

l = 1, . . . , L

(2.13)

where Wl and Bl are respectively the weight matrix and bias for the l-th hidden layer, and H is
the activation function associated with it which is selected to be a rectified linear unit (ReLu) in
this work. The final output of the network is simply the following transformations for the features

23

+1

wh,o

Binding affinity

A complex characterized by a feature set P

+1
wi,h

x1

x2

x|P|
Input
layer

Hidden
layer

Output
layer

Figure 2.1: A deep neural network for predicting the binding affinity of protein-ligand complexes
characterized by a set of descriptors.
XL+1 generated by the L-th hidden layer:
Yˆ = O(Wo XL+1 + Bo ),

(2.14)

where O is the output function which is logistic for the screening problem and linear for docking
and scoring. The weights of the network were learned from the training data using stochastic
gradient descent.
In our first set of experiments on PDBbind 2007, we use the R language package nnet to fit
boosted and bagged ensembles of one-hidden-layer NN models [42]. For the second set of results
collected on PDBbind 2014, we fit deeper and wider ensemble NN models based on the classes
“DNNRegressor” and “DNNClassifier” implemented in the Python library TensorFlow [43].

2.2.6

Random forests (RF)

A Random Forests model consists of an ensemble of prediction trees that are typically on the
order of several hundred to few thousand [44]. Each tree is fitted to bootstrap data consisting of N
samples chosen randomly with replacement from N training samples. When a tree is constructed,
at each node a binary split is made on the “best” descriptor out of mtry features randomly sampled
from the P-dimensional feature space (mtry ≤ P). Each tree is grown independently to its full size
24

without pruning. When the task is to score a new protein-ligand complex, the output is computed as
the average of the predictions of the comprising individual trees. This mechanism can be formally
expressed as:
1 T
1 T
(2.15)
Yˆ = F(X) = ∑ ft (X) = ∑ Yˆ (t) ,
T t=1
T t=1
where X is a protein-ligand complex whose biological property needs to be predicted, T is the
number of generated trees, and ft is the t-th tree in the ensemble. Upon building an RF model,
the parameter that typically requires tuning is mtry. We choose the optimal value that minimizes
the out-of-bag mean squared error (OOBMSE). Out-of-bag (OOB) refers to complexes that are
not sampled from the original training set when bootstrap sets are drawn. On average, about 34%
of the training set is left out in each tree-generating iteration. These out-of-sample examples are
typically utilized to tune mtry and estimate the generalization error of the RF model as follows:
OOBMSE =

2
1 N
yi − y¯ˆOOB
,
∑
i
N i=1

(2.16)

where y¯ˆOOB
is the average of the predictions for the i-th complex over all the trees in which it is
i
out of bag.
In our first set of experiments on PDBbind 2007, we use the R language package randomForest
to fit Random Forest scoring functions [45]. For the second set of results collected on PDBbind
2014, we fit Random Forest regressors and classifiers using the classes “RandomForestRegressor”
and “RandomForestClassifier” implemented in the Python library Scikit-learn [32].

2.2.7

Boosted regression trees (BRT)

BRT is an ensemble machine-learning technique based on a stage-wise fitting of regression trees.
As the name implies, the technique attempts to minimize the overall loss by boosting the examples
with the highest predicted errors, i.e., by fitting regression trees to residuals resulting from the
existing ensemble trees. There are several different implementations of the boosting concept in the
literature. The differences mainly arise from the employed loss function and treatment of the most
erroneous predictions. In this work, we employ the stochastic gradient boosting strategy proposed
25

by Friedman [46] that builds on an earlier technique known as AdaBoost developed by Freund et
al. [47]. The employed BRT approach builds a stage-wise model as listed in Algorithm 1 below.
Algorithm 1 BRT algorithm
1:
2:

Input: training data D = {Xi , yi }N
i=1
Start with an initial guess (e.g., the mean of y):
F(X) = f0 (X) = arg min ∑N
i=1 ψ(yi , θ )

3:
4:

for t = 1 to T do
Sample D without replacement (SWOR):

θ

(t)

5:

6:
7:
8:

˜
(t) N

Xi , yi
= SWOR(D, 50%) (i.e., N˜ = N/2)
i=1
Calculate current residuals:
˜
(t) N
yˆi
i=1

=

˜
(t)
(t) N
yi − Ft−1 (Xi )
i=1

(t)

˜
(t) N
i=1

Construct a new training set: D(t) = Xi , yˆi

(t)

Fit an S-terminal-node tree model ft to D(t) : ft = {Rs }Ss=1
(t)
(t)
Find the optimal values ({θs }Ss=1 ) of the S regions of ft ({Rs }Ss=1 ) that minimize the
(t)

(t)

(t)

loss function ψ: θs = arg min ∑ (t) (t) ψ(yti , Ft−1 (Xi ) + ft (Xi , θ ))
Xi ∈Rs
θ
9:
Add the tree ft to the model F(X): Ft (X) = Ft−1 (X) + λ ft (X, θ (t) )
10: end for
Step 2 of Algorithm 1 shows that the first term in the model is a simple learner that minimizes
the overall loss function ψ. Given the loss criterion ψ we use here is a least-squares function, the
outcome of this step is basically the mean value of the biological activities of interest (yi ) of the
complexes in the training set D. In each subsequent stage t, a tree ft is fitted to the residuals Yˆ (t)
obtained by applying the model Ft−1 (X) on a random subset of the training data D(t) (Steps 4-7).
Split variables and values of the internal nodes of the tree ft are decided from the current residuals
Yˆ (t) . In contrast, the constant values θ (t) at the S disjoint terminal regions R(t) (i.e., leaves) are
optimized so that they minimize the loss of the entire model. Generation of trees continues as long
as their number does not exceed a predefined limit T . At the end of each tree-generating iteration,
the BRT model is updated according to the formula in Step 9. A tree joins the ensemble with a
shrunk version of itself. The variable that controls its contribution to the final model is known
as the shrinkage or learning rate λ . The smaller the value of the learning rate, the smoother the
26

building process of a BRT model. It is safer to move slowly down the gradient given the fact
that boosting is a form of functional gradient descent. However, the reduction in the learning
rate should be compensated by increasing the number of fitted trees by almost the same factor to
achieve a comparable level of performance. In our experiments, we fixed the shrinkage at 0.005
and determined the corresponding optimal number of trees using cross-validation.
The R language package gbm was used to construct our BRT models on complexes from PDBbind 2007 [48]. Since we primarily use Python language to conduct our experiments on PDBBind
2014 datasets, a choice had to be made between the Python libraries Scikit-learn and XGboost to fit
BRT models. Scikit-learn provides the classes ensemble.GradientBoostingRegressor and ensemble.GradientBoostingClassifier while XGBoost implements XGBRegressor and XGBClassifier to
fit boosted regression and classification trees. The accuracies of models trained using Scikit-learn
and XGboost algorithms are more or less identical according to several experiments we conducted
on the 2014 version of PDBBind. However, the run time of our experiments are noticeably shorter
using XGBoost training algorithms. Therefore, we choose it over Scikit-learn’s API to build several scoring functions in the following chapters. XGBoost builds a faster variant of boosted decision trees based on a more recent training algorithm developed by Chen et al. [49]. More details
about XGBoost are provided in the following section.

2.2.8

Extreme Gradient Boosting (XGB)

In addition to BRT, we fit a faster variant of boosted trees scoring functions to PDBbind 2014
protein-ligand complexes using the Extreme Gradient Boosting algorithm [49]. Similar to BRT
scoring functions discussed in the previous section, the XGB model consists of an ensemble of T
trees whose individual outputs make up the final predicted score yˆ according to the equation:
T

Yˆ = F(X) =

∑ ft (X),

t=1

where X ∈ RP is a protein-ligand complex characterized by P molecular descriptors, and ft is the
t-th tree in the ensemble. The trees in XGB are also grown sequentially one after the other in

27

a greedy fashion. In each tree-fitting step, the training algorithm minimizes the overall loss by
boosting the examples with the highest predicted error so far. Each tree in the ensemble is fitted to
residuals made by its predecessor trees. The training algorithm implements a regularized stochastic
gradient boosting strategy proposed by Chen et al. [49] similar to BRT’s algorithm described in
the previous section. More specifically, the t-th tree in the ensemble is constructed to minimize the
regularized loss L(t) as follows:
N

L(t) =

(t−1)

∑ ψ(yi, yˆi

+ ft (Xi )) + Ω( ft ),

i=1

where ψ is the loss function that accounts for the error between the true output yi and its estimated
(t−1)

value by the current ( ft ) and past (yˆi

) trees in the ensemble. The regularization term Ω uses

the size of the tree in terms of the number of its nodes, n(t) , and the values of its leave nodes, θ (t) ,
to increase the loss as follows:
Ω( ft ) = γn(t) + 0.5λ θ (t)

2

.

We use the Python library XGBoost with its default parameter values for γ = 0 and λ = 1 to
build models for the docking, screening, and scoring tasks. The remaining parameters such as the
number of trees, their maximum depth, the subsample ratio of the training instances and features
were optimized using grid search for each of the three tasks. XGBoost-based scoring function
will be trained on their respective task-specific data sets as will be discussed in more detail in
the following chapters. Scoring functions based on XGBoost algorithm and fitted to PDBBind
2014 complexes will have the prefix BT (for Boosted Trees for regression or classification) as
in BT-Score, BT-Dock, and BT-Screen. On the other hand, scoring functions based on boosted
trees trained using the R language package gbm will start with the prefix BRT as in BRT::X or
BRT::XAR that are trained and evaluated on protein-ligand complexes from the 2007 version of
PDBbind.

28

2.3

Conventional scoring functions under assessment

The proposed multi-perspective descriptors in DDB and task-specific scoring functions are assessed on benchmark datasets and tests carefully designed to reflect real-world drug discovery and
design scenarios. Some of the benchmark datasets and testing strategies have also been used in the
literature to profile the performance of other scoring functions developed in academia or used in
commercial software packages. In order to study how the proposed approaches compare with other
scoring functions, we focus on recent studies in which several scoring methods have been tested
on the PDBBind benchmarks we use [17, 27, 50, 25]. From those studies, we compare a total of
19 popular conventional scoring functions to our proposed ML approaches on the 2007 version
PDBbind. The 19 functions are either used in mainstream commercial docking tools and/or have
been developed in academia. Sixteen of the SFs (all except AffiScore, RF-Score, NNScore) were
recently compared against each other in two studies conducted by Cheng et al. [17] and [25].
This set, listed in Table 2.1, includes five SFs in the Discovery Studio software version 2.0 [21]:
LigScore, PLP, PMF, Jain, and LUDI. Five SFs in SYBYL software (version 7.2) [51]: D-Score,
PMF-Score, G-Score, ChemScore, and F-Score. GOLD software version 3.2 [52] contributes three
SFs: GoldScore, ChemScore, and ASP. GlideScore [53] in Schrödinger software version 8.0 [54].
Besides, two standalone SFs developed in academia are also assessed, namely, DrugScore [55]
and X-Score version 1.2 [56]. In addition to the performance of the 16 scoring functions originally
collected by Cheng et al. [17] and Li et al. [25] studies, we consider three additional scoring functions published in different work. The first is AffiScore, an empirical scoring function used in the
academic docking software SLIDE [57]. The other two are the machine-learning scoring functions
RF-Score [27] and NNScore [50] based on the ensemble trees algorithm Random Forests and neural networks, respectively. We use AffiScore, RF-Score, and NNScore as benchmark methods and
sources of descriptors in our multi-perspective interaction modeling platform DDB and multi-task
scoring functions (see Chapter 3 for more details).
Since molecular modeling software programs often provide multiple scoring choices, we use
notation to identify both the modeling program and the SF used. For example, when referring to the
29

Table 2.1: The 19 conventional scoring functions and the molecular docking software in which
they are implemented
Scoring function (SF)

Software

Type of SF

Reference

Jain
LigScore
Ludi
PLP
PMF
ChemScore
D-Score
G-Score
F-Score
PMF-Score1
ASP
ChemScore2
GoldScore3
GlideScore
AffiScore
DrugScore
X-Score
NNScore
RF-Score

Discovery Studio

Empirical
Knowledge based
Empirical
Empirical
Knowledge based
Empirical
Force-field based
Force-field based
Empirical
Knowledge based
Empirical
Empirical
Force-field based
Empirical
Empirical
Knowledge based
Empirical
Machine-learning
Machine-learning

[58]
[59]
[60]
[61]
[62]
[63]
[64]
[52]
[51]
[62]
[19]
[63]
[52]
[53]
[57]
[55]
[56]
[50]
[27]

SYBYL

GOLD

Glide
SLIDE
Standalone
Standalone
Standalone
Standalone

1 SYBYL’s implementation of PMF
2 GOLD’s implementation of ChemScore
3 GOLD’s implementation of G-Score

SF LigScore1 in Discovery Studio software, the notation DS::LigScore1 is used. Considering all
SFs possible from the modeling tools mentioned in this subsection would result in large numbers
of them. That is because modeling tools typically supply more than one version/option for the
same SF. Each such variation could be regarded as a different SF. LigScore, for example, comes
in two different flavors (LigScore1 and LigScore2). For brevity, we only report the version and/or
option that yields the best performance on the PDBbind benchmark.

30

CHAPTER 3
DESCRIPTOR DATA BANK (DDB): A PLATFORM FOR MULTI-PERSPECTIVE
MODELING OF PROTEIN-LIGAND INTERACTIONS

Non-covalent binding of protein-ligand complexes involves numerous physiochemical and structural factors that bring the molecules together such as hydrogen bonding, hydrophobic interactions, electrostatic charges, π-effects, van der Waals forces, molecular weight, 3D structures of
the molecules and their sizes, etc. Such factors are too complex to model correctly and to determine their contribution to the binding process whether considered individually or collectively in
groups of interacting forces. Accurate modeling of protein-ligand interactions remains an unsolved
challenge despite advances in molecular biology and the growing number of solved protein-ligand
structures. That is in part due to the limitation of current scoring functions in capturing and modeling protein-ligand interactions. Most of the knowledge-based, force-field, and to some extent
empirical scoring functions in use today attempt to reduce a very complex system of intermolecular forces into a small number of terms that are combined in a simple additive manner to calculate
the free energy of binding. The empirical approach improves upon its knowledge-based and forcefield counterparts by relying on experimental data and linear regression to express binding affinity
as a weighted sum of energetic terms. Recent studies have shown that empirical scoring functions
have better accuracy on average in predicting binding affinity when compared to knowledge-based
and force-field techniques [17, 25]. However, the empirical linear model is still too rigid and simple to reliably model the free energy of such a complex system. Their rigidity prevents them from
fully taking advantage of new experimental data whose growth is only increasing. A new family
of scoring functions based on machine learning has been introduced to the field in the past few
years to fix the shortcomings of the simple empirical approaches [27, 50, 65, 24]. These scoring
functions are typically based on highly non-linear predictive models such as ensemble trees and
neural networks fitted to protein-ligand complexes with known experimental activity data. The
complexes are typically characterized by larger number of descriptors than those used by the three

31

other categories of scoring functions. For example, the ensemble neural network SFs BgN-Score
and BsN-Score (proposed in the following chapters) employ more than 86 descriptors that encompass several types and versions of intermolecular forces between the protein and the ligand [24].
These neural network scoring functions have shown substantial improvement in prediction accuracy when compared to 16 other conventional models. Several other studies have found that this
new category of SFs are significantly more accurate than knowledge-based, force-field and empirical approaches [65, 17, 27, 50]. In this work, we propose a novel approach to further improve the
accuracy of machine-learning scoring functions.

3.1

Improving machine-learning scoring functions

Recent machine-learning scoring functions include almost all the energy terms and potentials
that knowledge-based (KB), force-field (FF), and empirical (E) SFs use in one way or another.
For example, the frequency of different pairwise atomic types similar to those used in knowledgebased SFs are utilized in ML SFs [27, 50, 65, 24]. Van der Waals interactions, electrostatic contacts,
solvation terms, hydrogen bonding, and hydrophobic forces that are employed in force-field and
empirical approaches have also been considered in ML scoring functions [28]. In fact, all types
and forms of energy terms that have been developed in the literature for empirical SFs can be,
in principle, directly included as individual descriptors in ML SFs given enough training data.
In this respect, ML SFs can be regarded as a superset of KB, FF, and E SFs. As opposed to
linear regression models used in empirical SFs, ML approaches have higher capacity of learning
and can better handle high-dimensional data due to their superior bias-variance error trade-off.
The more interactions they model, and with more experimental training data, ML SFs have every
potential to make near perfect predictions. Not only can they predict the bioactivity of interest
for the target complex based on similar molecules in their training data, they can also rely on the
large and diverse number of specialized descriptors to search for similar atomic environments and
contacts encountered during training. Therefore, with large enough data and interaction modeling
techniques (descriptors), ML SFs could truly one day become general purpose SFs valid for almost

32

all protein families and chemical compounds even the ones they never seen before during training.
Other fields such as computer vision and natural language processing have benefited greatly from
using machine-learning approaches to perform at par or surpass human precision on test subjects
novel to them [66, 67]. The main reason for the successful application of ML in these and other
domains with challenging problems is two-fold: (1) effective ML algorithms with high-capacity of
learning, (2) the quantity, quality, and diversity of data on which they are trained, (3) availability
of high-performance computers. We believe the domain of drug discovery and design has many of
the ingredients to be another success story for machine-learning applications.
Many of the necessary building blocks for highly effective machine-learning scoring functions
are available today. The field of machine and statistical learning is rapidly growing with new
theories and algorithms that can effectively model regression, classification, ranking, clustering
problems, etc. The computational resources on which these predictive models can be developed
and deployed are becoming more powerful and cost-effective. A machine-learning scoring function that requires expensive computer infrastructure to design a few years ago can now be built on
a multi-processor PC or outsourced to public computing cloud in a matter of minutes in exchange
for small fees.
As is the case for other machine-learning models [66, 67], the accuracy of ML SFs has the
potential to substantially improve as biochemical data continue to grow in size, quality, and diversity [28]. Thousands of new, high-quality, 3D structures of proteins and protein-ligand complexes
are resolved and publicly released every year [9]. Millions of annotated drug-like compounds are
being added to public databases periodically [68, 69]. Hundreds of millions of assay results and
experimental bioactivity data for large numbers of proteins and ligands are also maintained in free
databases [70, 71]. Open-access binding affinity databases of protein-ligand complexes for training
and benchmarking are also growing in number and size every year [22, 72, 73, 74, 10]. However,
despite such advances in data availability, we still do not know how to comprehensively characterize the raw proteins, ligands, and protein-ligand complexes in these databases into descriptors
(features) suitable for producing accurate ML SFs. Developers of ML SFs typically build in-house

33

tools to extract descriptors which is the most time consuming stage in SF design. In this work,
we propose a platform that provides SF designers with the tools necessary to easily and efficiently
calculate a large number of diverse descriptors.

3.2

The need for a library of descriptor extraction tools

Almost all databases of small compounds, proteins, or protein-ligand complexes host molecular data in their raw form. These molecules must be characterized comprehensively by informative descriptors before applying ML algorithms. Descriptors are models1 for different molecular
properties and protein-ligand interactions such as hydrogen-bonding, hydrophobic forces, steric
clashes, electrostatic contacts, etc. Such interactions are very complex to represent correctly and
there is no consensus on deriving a standard model for them. As a result, a large number of
hypotheses have been proposed in the literature to characterize the same types of intermolecular
forces. A hypothesis is basically a single perspective for the interaction it attempts to model as
a descriptor. Developers of empirical and recent ML SFs generally rely on their experience, intuition, and analysis of data sets to derive a hypothetical model for each descriptor. Despite the
relatively large number of descriptors that could be computed efficiently from proteins, ligands, and
protein-ligand complexes, most existing SFs utilize only a very small set of orthogonal descriptors
characterizing the various interaction factors. E.g., among popular commercial and academiadeveloped SFs, D-Score [64]: uses 2 descriptors, LigScore [59], PLP [61], G-Score [51]: 3, LUDI
[60], ChemScore-2 [63]: 4, Jain [58], F-Score [60], AffiScore [75]: 5, X-Score [56]: 6, GoldScore
[52]: 7, ChemScore-1 [52]: 9, and GlideScore [53]: 10 descriptors. Furthermore, it is uncommon
for an empirical SF to include more than one hypothetical model for each type of interaction it
attempts to take into account. Also, the type of the interactions employed by current scoring functions vary from one model to another. As an example, the empirical SF ChemPLP [76] accounts
for the protein’s solvation and flexibility characteristics among other factors. The SF X-Score, on
the other hand, models neither of these interactions. Instead, it includes a hydrophobicity energy
1 Descriptor

models should not be confused with machine learning models that we use to build
scoring functions for prediction.
34

term which is not present in ChemPLP. In addition to employing different interactions, two SFs
that account for the same type of interactions may model them differently. As an example, hydrogen bonding and hydrophobic contacts are modeled differently by X-Score [56], AffiScore [75],
and RF-Score [27]. The variation in the characterization of the same interactions are usually due
to differences in the theoretical model of the interactions, the parameters of the model, the level
of detail or granularity of the modeling algorithm, etc. The lack of sufficient understanding of the
forces that bring proteins and ligands together is the main reason behind such discrepancies.
Although there is an overlap in the types of interactions the features in various SFs attempt
to capture, they still represent different ways or perspectives of capturing protein-ligand binding.
Current empirical SFs, such as X-Score, ChemPLP, and AffiScore, treat the hypothetical models
of interactions as competing perspectives instead of utilizing them collectively as co-operative descriptors. This is evident from the small number of descriptors that is normally considered by
traditional SFs. An important reason existing SFs use very few features is to avoid overfitting
in the simple, often linear, models employed. Also, the interpretation of the linear SF becomes
harder due to high-dimensionality and possible multicollinearity as more factors are added especially when they account for similar interactions. More importantly, implementing interaction
terms from the literature is a very challenging task unless the modeling tool or software is easily
accessible or the interaction model is simple and well described. Even with the availability of the
software, obtaining and installing it could still be a challenge. To solve these problems, in this
work we introduce a descriptor-specific platform that enables easy access, execution, and sharing
of descriptor extraction tools via simple portals.

3.3

Methods

The main design goal of the descriptor library is to facilitate the extraction of high-quality
and comprehensive descriptors for the three molecular entities: ligands, proteins, and proteinligand complexes (PLCs). Based on our experience working with empirical and machine-learning
scoring functions, we observed that the vast majority of them include descriptors specific to ligands

35

and/or PLCs. Protein-specific descriptors are under-represented in the feature space and we felt the
need to fix this imbalance. Therefore, in this section, we present our novel approach for deriving
similarity descriptors based on the primary and tertiary structures of proteins. We then describe
the proposed Descriptor Data Bank (DDB) framework that facilitates integrating descriptors from
various types and sources for multi-perspective modeling of protein-ligand interactions.

3.3.1

Receptor-specific descriptors based on structure similarity and triangulation

The well-known similar property principle in chemoinformatics states that “similar compounds
have similar properties”, i.e., molecules or compounds that have similarity in the descriptor space
(which may be based on functional groups, structure, etc.) are likely to have similar biological
activity [77]. This principle underpins the similarity-based virtual screening approach employed
in ligand-based approached wherein compounds similar to a query compound (a known active
ligand) are retrieved from a compound database for enrichment [78]. We propose target-specific
descriptors based on the primary and tertiary structure similarities of receptors.
Homologous proteins are likely to have similar functions and hence triggered by related or the
same biological signals. Proteins that are structurally and functionally similar are more likely to be
expressed in similar cells. Cellular environmental factors, such as the presence and concentration
of different molecules and charged ions, could have a bearing on the binding process. Such factors
are not typically measured when the structures of proteins or protein-ligand complexes are solved
and their binding affinity determined. We believe extracting protein-similarity based features could
implicitly model environmental factors and intelligently encode the location of the target receptor
in the protein family space. We are not aware of any SFs in the literature or commercial software
that consider such information about the target explicitly or implicitly. Distant homologous proteins that have low sequence similarity may have similar tertiary structures and as a result might
bind to analogous ligands. To capture such information, we extract two types of structure-specific
descriptors: one based on receptor primary structure similarity via triangulation (or REPAST) and
a second set based on receptor tertiary structure similarity via traingulation (or RETEST).

36

Similarity Matrix

PDB Proteins

1a30

1f9g

Test Protein

2usn

(1)
All-againstall
Similarity

…

2usn

1a30

100

25

...

72

1f9g

25

100

...

65

...

...

...

...

...

2usn 72

65

...

100

Prtn. 1a30 1f9g

Centroid Proteins

(2) Clustering

(3) One-against-all Similarity

Similarity Features
25 14 94 55 27 33 47

Y-Axis

Clustered Proteins

X-Axis

Figure 3.1: The workflow for generating the receptor-based similarity descriptors REPAST and
RETEST.
Figure 3.1 illustrates the strategy we follow to extract REPAST and RESTEST descriptors.
First, a set of diverse protein families from PDBbind 2007 is prepared for all-against-all pair-wise
alignment. We use BLAST [79] for primary sequence alignment and TM-Align [80] to compute the
tertiary structural similarity between proteins. The similarity scores associated with the alignment
are recorded in a similarity matrix which is then clustered into a predefined number of groups. We
use k-means to group the training proteins into 80 clusters. The centroid proteins of the clusters
are chosen as cluster prototypes or representatives. Finally, for a test PLC, the similarity of the
protein is determined w.r.t. the cluster prototypes using BLAST and TM-Align to yield similarity
descriptors. The distances (inverse of BLAST and TM-Align similarity scores) of the target protein
from the 80 protein prototypes determine its location in the protein family space. The family of the
test protein is therefore determined indirectly via triangulation. Two structurally or functionally
homologous proteins are likely to have similar REPAST and/or RETEST descriptor vectors and
hence separated by a small euclidean distance. Representing large molecules such as proteins into
37

descriptor vectors is a very efficient and useful technique for ML SFs. In addition to characterizing
the protein as a whole molecule, this approach can also be applied to generate sub-molecular
descriptors. We plan on extending this strategy in the future to also better characterize the sequence
of the receptor’s binding site and the secondary structure of the domains near the cavity region.

3.3.2

Multi-perspective modeling of protein-ligand interactions

To date, SF design has primarily focused on hypothesis-driven modeling of the binding mechanism using a set of descriptors specific to each SF. Consequently, a wide variety of descriptor
sets representing diverse and overlapping hypotheses or perspectives have been developed over
time. These perspectives represent valuable knowledge about the binding mechanism that have
never been brought together in one place and systematically analyzed. Given the high impact of
PLC repositories such as RCSB PDB [81] and related ones like PDBbind [10] and CSAR [82] on
drug discovery research and benchmarking, there is a clear need for such a resource for molecular
descriptors representing multiple perspectives. We develop a descriptor data bank (DDB) and resource that will facilitate multi-perspective, data-driven protein-ligand interaction model discovery
as depicted in Figure 3.2.

3.3.2.1

Descriptor Data Bank (DDB)

Descriptor Data Bank (DDB) is designed to host and share descriptor calculation tools and data
to develop diverse and comprehensive multi-perspective models of protein-ligand interactions. A
machine-learning toolbox is also implemented in DDB to provide in-place automatic descriptor
filtering and scoring function development. Its multi-modular architecture, illustrated in Figure 3.2,
provides a framework for meeting DDB design goals in terms of versatile user-interfaces, services,
user programs and data, and system platforms. Descriptor Data Bank is accessible online via
the web at www.descriptordb.com and can also be installed locally as a standalone PyMOL plugin and command-line programs. These three separate front-end channels act on user commands
and raw data and deliver processed data and results. Data processing, storage, and retrieval takes
38

Frontend

Web

PyMOL

CLI

Results &
reports

ML
toolbox

Descriptor
filtering & SF
construction

Descriptors

Cloud

Unified interface
to interaction
modeling tools

Descriptor
extractors

PC

Molecule
datasets

DDB deposits

DDB downlods

OS libraries and tools
dependent on the
platform and UI

Platform

Backend

Interface

User commands

PC & HPC

Figure 3.2: The system architecture of Descriptor Data Bank.
place in the back-end subsystem where databases of molecules, descriptor extraction programs,
descriptors, ML algorithms, and results are maintained. DDB’s front-end and back-end subsystems
communicate through a unified interface with two Python-based programs and system-dependent
libraries and tools. The two Python programs are for: (i) executing descriptor extraction tools
donated by DDB users via well-defined interfaces and (ii) facilitating descriptor filtering and SF
development requests and results. We discuss the main features of DDB and its design in the
following sections.

3.3.2.2

Descriptor tools and data: deposition, extraction, and sharing

DDB’s online platform provides an upload portal to enable users to easily deposit their descriptor
extraction programs and data. The descriptor extraction programs can be written in any program-

39

Figure 3.3: The web interface for Descriptor Data Bank.
ming language. To ensure consistency and quality, the donated programs and data must adhere to
DDB’s packaging and naming conventions which are provided on the website.
When a user deposits a descriptor extraction program, it will be inspected for security and
computational resource requirements. If it passes these tests, the program will then be available
for use and download. The descriptor extraction tool will appear on the descriptor extraction
page (see Figure 3.3) as an additional available descriptor set that can be calculated for a given
molecular (P, L, or PLC) dataset. Users will have the choice to use sample molecule datasets
hosted on DDB or upload their own. Instructions about data formatting, naming conventions,
and upload instructions of molecule datasets can be easily located on the website. Users can also
download the deposited descriptor extraction programs, molecule datasets, or raw descriptors via
the download page. Figure 3.2 illustrates the flow, storage, and handling of molecule datasets,
descriptor extraction programs, and raw descriptors.
The standalone PyMOL plug-in and command-line programs are also capable of descriptor

40

Figure 3.4: The graphical (left) and command-line (right) user interfaces for Descriptor Data
1
Bank.
extraction as shown in Figure 3.4. Users can download descriptor extraction programs from DDB’s
website to extend the functionality of local DDB and keep it up-to-date. The PyMOL plug-in
and command line programs feature, by default, some of the descriptor extraction tools available
in DDB. We seed DDB with a rich collection of diverse descriptors and descriptor extraction
tools—of all three types: those that depend upon protein only, ligand only, and the protein-ligand
complex—from three different sources that together will more fully capture the complex nature
of protein-ligand interaction: (a) descriptors used in existing SFs—for some of these (e.g., the
AutoDock Vina [83], NNScore [50], and Cyscore [84] descriptors) we have access to descriptors
extraction tools and others we will implement ourselves based on descriptions in the literature; (b)
the new molecular and sub-molecular (in the future) similarity descriptors such as REPAST and
RETEST discussed earlier; and (c) molecular descriptors and physicochemical properties that have
been employed in ligand-based screening and QSAR such as PaDEL [85] and ECFP [86].

3.3.2.3

Descriptor filtering and analysis

Due to its open-access nature, there is a possibility that some of the descriptors hosted on DDB
may be noisy or irrelevant for drug design and discovery applications. Improper modeling of some
interactions, bugs in the donated descriptor extraction program, or any other artifacts can intro-

41

Table 3.1: The 16 descriptor types and scoring functions and the molecular docking software in
which they are implemented
Descriptor set/SF

Symbol1 Molecule(s)2 Software

Type of SF

Reference

AffiScore
AutoDock Vina
ChemGauss
Cyscore
DPOCKET
DSX
ECFP
GOLD
LigScore
NNScore
PaDEL
REPAST
RETEST
RF-Score
SMINA
X-Score

a
u
h
c
f
d
e
g
l
n
p
b
t
r
s
x

Empirical
Empirical
Gaussian
Empirical
Descriptors
Knowledge based
Descriptors
Descriptors
Knowledge based
Machine-learning
Descriptors
Descriptors
Descriptors
Machine-learning
Empirical
Empirical

[58]
[83]
[87]
[84]
[88]
[89]
[86]
[20]
[59]
[50]
[85]
descriptordb.com [90]
descriptordb.com [90]
[27]
[91]
[56]

PLC
PLC
PLC
PLC
PLC
Preds.
L
PLC
Preds.
PLC
L
P
P
PLC
PLC
PLC

SLIDE
AutoDock Vina
OpenEye
Standalone
FPOCKET
Standalone
RDkit
GOLD
IMP
Standalone
Standalone
Standalone
Standalone
Standalone
Standalone
Standalone

1 The symbol of the SF or extraction tool is used to identify the source of descriptors in Figure 3.7.
2 The entity characterized by the descriptors: P for protein, L for ligand, and PLC for protein-ligand com-

plex. When access to the descriptors of SFs in not available, we use their predictions (denoted by Preds.
in the table) as a surrogate for them.

duce noise into the computed descriptors. To guard against noisy data and ensure robustness of
the DDB’s ML SFs, DDB provides an automatic descriptor filtering tool. Users have the option to
experiment with DDB’s raw descriptors or choose to filter them before fitting an ML SF. Descriptor filters can be also stored in DDB for later use or to share them with other developers to build
their SFs. In fact, DDB can build ensemble-based SFs that are less prone to overfitting than their
empirical counterparts. They are more forgiving when their training complexes are characterized
by a larger number of descriptors and/or the possibility that some of these features are either noisy
or not very relevant to the property being predicted. However, the predictive performance of ensemble ML SFs might be enhanced by utilizing only the most relevant and informative descriptors.
In addition to the accuracy gain, the other benefit of employing a compact set of features is computational efficiency. Speed is of utmost importance in molecular docking and virtual screening
where interaction modeling is perhaps the most time consuming phase in the process. Therefore,

42

substantial computational efficiency could be achieved saved when using fewer features by ignoring the ones that are slower to extract among redundant descriptors. The gained computational
savings could be utilized to explore the conformational space more thoroughly during docking or
employed to screen more ligands.
The descriptor filtering algorithm implemented in DDB is based on embedded feature-subsetselection in which the search for an optimal subset of descriptors is built into the ML algorithm.
The algorithm relies on Random Forest’s automatic variable importance calculation in conjunction with in-house iterative backward elimination to remove redundant or noisy descriptors and
only retains the most relevant descriptor subset for the given prediction task. Our algorithm employs multi-objective fitness functions that allow task-specific descriptor subset selection. The
selected subset of descriptors could be optimized for any or all of the following drug-discovery
tasks: identifying the native or near-native binding pose during docking, binding affinity prediction, and enriching compound databases with active compounds. After filtering, DDB will output
results in the form of summary tables and graphs. Such reports shed light on the effect of various
interaction forces and other factors on the bioactivity of interest. Descriptor filtering and analysis
can also be conducted using the PyMOL plug-in and command-line interface (CLI) as shown in
Figure 3.4.
We plan to extend the descriptor filtering algorithm such that it selects the descriptors that are
faster to calculate when there are equally accurate but redundant features with different computational requirements. Such an improvement will result in the selection of accurate descriptors that
are fast to calculate.

3.3.2.4

Machine-learning scoring functions in DDB

We provide several machine-learning algorithms in DDB to develop SFs using descriptors hosted
on the platform. The ML algorithms available include Extreme gradient boosting trees (XGBoost) [49], Gradient Boosting Trees (GBT) [92], Deep Neural Networks (DNN) [24], Random Forest (RF) [45], Support Vector Machines (SVM) [40], and Multi-variate Linear Regres-

43

sion (MLR). These algorithms are implemented in the Python libraries XGBoost [49], TensorFlow [43], and Scikit-learn [32]. DDB provides a unified and simple interface to construct ML
SFs and hides the details associated with the underlying ML Python libraries. Users can select
the desired learning algorithm, the PLC training and test datasets, and the sets of descriptors to
characterize these complexes. Furthermore, users can also choose a pre-calculated descriptor filter
to fit the ML algorithm to only the most important descriptors. Upon completion of the ML fitting
process, predictions are calculated for the training and test complexes, and summary performance
and processing time reports will be produced. The results are saved in DDB for later use, and can
be downloaded and shared with other DDB users. The ML SF fitting and prediction capabilities
can also be performed locally in the user’s PC via the PyMOL plug-in and/or CLI as shown in
Figure 3.4.

3.3.2.5

Multi-perspective descriptors extracted for PDBbind complexes

For each PDBbind protein-ligand complex prepared for the docking, screening, and scoring tasks,
we extracted 16 descriptor sets covering a wide range of perspectives and interaction hypothetical models. The total number of descriptors we extracted using DDB are 2714 including 320
receptor-specific descriptors from the proposed REPAST and RETEST techniques, 1765 ligandspecefic descriptors using PaDeL and ECFP, and 629 complex-specific interactions using the remaining 12 SFs listed in Tabel 3.1. All the scoring, docking, and screening PLC datasets (derived in Chapters 4, 5, and 6) along with the 16 descriptor types are available on DDB’s website
(www.descriptordb.com) for use and download.

3.4
3.4.1

Results and discussion
Single vs. multi-perspective modeling of protein-ligand interactions

In this section, we systematically investigate the effect of multi-perspective modeling of proteinligand interactions on the performance of scoring functions in different applications. We use the
training and test datasets described in Sections 2.1 for building and evaluating task-specific ML
44

Table 3.2: Comparison of the scoring, docking, and screening powers of single and
multi-perspective scoring functions using the core test set (Cr), leave-cluster-out (LCO), and
cross-validation (CV).
Task (metric)

Best SP (using
NNScore descriptors)

MP (without
REPAST & RETEST)

MP (with
REPAST & RETEST)

Cr

Cr

LCO

Cr

0.612
75.48
28.64

0.827 0.825
82.05 81.78
34.00 37.33

CV

Scoring (R p )
0.779 0.773
Docking (S%)
69.23 68.86
Screening (EF1% ) 28.00 32.92

LCO

CV

0.571 0.801 0.815
64.24 80.05 81.80
24.38 33.50 35.82

CV

LCO
0.637
77.89
29.79

SFs for the scoring (Chapter 4), docking (Chapter 5), and screening (Chapter 6) tasks.
Our proposed data-driven and multi-perspective (MP) modeling of PL interactions is compared
against the conventional approach of PLC characterization that depends on single-perspective (SP)
modeling. For MP modeling, we built two sets of SFs to evaluate the effectiveness of the proposed
protein-based SFs. In the first, we combined all 2714 descriptors in the 16 descriptor sets listed
in Table 3.1 for each training dataset to build XGBoost SFs specific to the scoring, docking, and
screening tasks. We tested the resulting task-specific SFs on the corresponding test PLCs that are
also characterized by the same 2714 descriptors. The second set of MP SFs are trained using (16
- 2 =) 14 descriptor types where the 320 protein-based descriptors (REPAST and RETEST) are
excluded from the total 16 types used to train the full MP version. As for the SP approach, we
considered each of the (16 - 4 =) 12 descriptor sets individually to characterize the training PLCs
and then built XGBoost SFs for each descriptor set for the three tasks. We excluded the receptorspecific (REPAST & RETEST) and ligand-specific (ECFP & PADEL) descriptor sets from SP
SFs since they only characterize the protein or the ligand molecules but not the protein-ligand
complex as a whole. We then evaluated all SP-based SFs on out-of-sample validation PLCs whose
interactions were modeled using the corresponding descriptor set. We select the descriptor set
that is associated with the best overall scoring, docking, and screening accuracies on the out-ofsample validation complexes to build a single-perspective (SP) SF. Our results indicate that the 218
descriptors used by the neural-network SF NNScore are the best set of features for the three tasks.

45

Table 3.2 shows the scoring, docking, and screening powers of single and multi-perspective
SFs on known targets (using the core test set (Cr) and cross-validation (CV)) and novel proteins
via leave-clusters-out (LCO) validation. The three rows of the table are for the three different
tasks, the three major columns are for different PLC characterization techniques, and the three sub
columns list the performance on different training-test sampling strategies. The SP XGBoost SF
that obtained the best overall accuracy on validation datasets used the 218 NNScore descriptors.
The performance statistics in the table show that both sets of MP XGBoost SFs are significantly
and consistently better than the SP SF for the three tasks whether on known (Cr and CV) or novel
(LCO) test protein targets. The average improvement of the full MP approach over SP in scoring
accuracy (across Cr, CV, and LCO test sets) is more than 8%, and more than 19% for the screening
and docking tasks. Also, by comparing the summary statistics in the two right-most major columns
we observe that MP SFs that trained with REPAST and RETEST descriptors are consistently more
accurate than MP SFs trained without them. This highlights the utility of the proposed receptorbased descriptors in improving overall accuracy especially for the scoring task. It should be noted
that all three SP and MP SFs were based on the same ML algorithm XGBoost with the same
parameters, trained and tested on the same complexes, and evaluated using the same metrics. The
only difference between the three sets of SFs is the descriptors they employ to characterize their
training and test PLCs. All other factors that may have an effect on the performance results are
fixed. Given the substantial improvement we report in Table 3.2 between SP and the two versions of
MP, it is evident that multi-perspective modeling of protein-ligand interactions is a very promising
technique in enhancing the accuracy of ML SFs. In addition to XGBoost, we were able to observe
similar levels of improvements for other ML algorithms such as Random Forest, boosted and
Bagged Ensemble Neural Networks [24], and Support Vector Machines.

3.4.2

Number of perspectives vs. number of training protein-ligand complexes

Public biochemical databases such as PDB [81], PubChem [71], PDBbind [22], Zinc [68], and
others are the inspiring successful projects behind the creation of DDB. These public databases

46

●

100

680

Scoring accuracy (R p )
Docking accuracy (S 11%)
Screening accuracy (EF1%)
1260

1840

2420

3000

30
●

●

●

●

●

1

(a) Number of training targets

●

4

Scoring accuracy (R p )
Docking accuracy (S 11%)
Screening accuracy (EF1%)
7

10

13

15
20
25
Screening accuracy (EF1%)

●

●

10

●

0.55 0.60 0.65 0.70 0.75 0.80 0.85

Scoring (R p ) and docking (S 11%) accuracy

30
●

15
20
25
Screening accuracy (EF1%)

●

10

0.55 0.60 0.65 0.70 0.75 0.80 0.85

Scoring (R p ) and docking (S 11%) accuracy

●
●

16

(b) Number of descriptor sets (multiperspecitves)

Figure 3.5: The scoring accuracy of multi-perspective scoring functions trained on varying
numbers of perspectives and known targets.
have had a tremendous impact on advancing virtual screening and other molecular modeling applications. We believe our database will be a very important complement in this space of public
biochemical data and will significantly contribute to the advancement of machine-learning SFs. In
this section, we quantitatively show how SFs benefit from training on increasing numbers of public
complexes that are characterized by an increasing number of descriptor sets. In other words, we
demonstrate how the performance of SFs improves as new PLCs and perspectives of their interactions are added to public databases overtime. To mimic the improvement of ML SFs in response
to the release of new PLCs, we randomly sample an increasing sizes of training subsets from the
3251 primary training set Pr. We randomly select x base PLCs from the 3251 complexes in Pr,
where x ∈ {100, 680, 1260, 1840, 2420, 3000}, and use the selected PLCs to train XGBoost SFs for
the three tasks. We then test them on the disjoint core set Cr for the scoring, docking, and screening tasks. This process is repeated 100 times to obtain robust average R p , S11 %, and EF1% values,
which are plotted in Figure 3.5(a). In the three tasks, we observe that the performance XGBoost
SFs improves as the size of the training dataset increases. The slope of the three curves indicates
that the these ML SFs have a potential for improvement as more training data becomes available
in the future.

47

To investigate the effect of increasing number of descriptor sets, we randomly select a maximum of 100 combinations of x descriptor sets from all possible 16
x combinations of the 16
types listed in Table 3.1, where x ∈ {1, 4, 7, 10, 13, 16}, and use them to characterize the training
set complexes, which we then use to train XGboost SF. These models are subsequently tested on
the Cr dataset characterized by the same descriptors. For each number of descriptor sets, x, the
performance of the 100 SFs are averaged to obtain robust overall R p , S11 %, and EF1% statistics,
which are plotted in Figure 3.5(b). Similar to the case of increasing training data size in terms of
number of new PLCs, the performance of the ML SFs steadily improves by increasing the number
of protein-ligand interaction models in all three tasks. It is clear from the right and left panels of
Figure 3.5 that the number of different models of protein-ligand interactions are as important as the
number of training complexes in improving the quality of prediction. Based on these results, we
believe that the public database of protein-ligand interactions we are proposing in this work will be
of high value to ML SFs and just as important to their accuracy as the resources of raw structural
and experimental data.

3.4.3

Perspective filtering using automatic feature subset selection

In addition to hosting descriptor extraction tools, DDB also offers descriptor filtering and analysis
toolbox. The toolbox is developed to fulfill the following needs: (i) automatically guard against
noisy and irrelevant descriptors, (ii) provide insight about the effect of different intermolecular
forces on the bioactivity of interest, and (iii) distill the information contained in all descriptors into
a compact subset of accurate features for applications where high throughput is required. Descriptor filtering is essentially a problem of feature subset selection (FSS) in machine learning. The
naïve approach of performing FSS is to search the feature space for the most relevant descriptors
using brute force and drop the remaining least important subset. Eliminating the least relevant
features from a pool of thousands of descriptors using exhaustive enumeration is impractical because of the combinatorial explosion of the number of all possible subsets (2P − 1, where P is the
pool size). Numerous alternative approaches have been proposed in the literature that attempt to

48

find accurate subset of features in a reasonable time. These approaches can be grouped into three
major categories. First are filter-based approaches in which one descriptor is considered at a time
to examine its relevance score to the property we wish to predict. The descriptors that have the
lowest scores (e.g., the lowest correlation with the dependent variable) are filtered out. Although
filter techniques are fast, they ignore descriptor interactions with each other since every feature is
evaluated separately. The second category of FSS techniques are called wrapper methods in which
a large number of descriptor subsets with varying sizes and combinations are constructed and examined according to some search heuristic. A classification or regression ML model is typically
fitted to each descriptor subset to examine their accuracy on an out-of-sample test set. After evaluating each subset, the wrapper search technique, typically an evolutionary algorithm, explores the
descriptor space in the vicinity of the most accurate subsets by generating new combinations of
descriptors derived from the best subsets so far. The search continues until a predefined number of
search rounds has been performed or a target fitness value has been obtained or reached a plateau.
In either case, wrapper search algorithms such as Genetic Algorithms (GA) [93] or Particle Swarm
Optimization (PSO) [94] require a large number of descriptor subsets to explore and search rounds
to arrive at an optimal descriptor subset. Based on our experiments using GA, we found that it is
computationally very expensive to reliably search for the best descriptor subset when the number
of descriptors exceed few hundreds.
The third FSS technique is based on the embedded paradigm. As the name implies, the search
for an optimal subset of descriptors is embedded into, or built in, the ML algorithm itself and
therefore it is more computationally efficient than the wrapper approach. It is also more accurate
than filter methods since descriptor interactions are considered by the ML model. Due to these
properties, we choose the embedded method to be the basis of our FSS algorithm. The algorithm
is based on Random Forest’s automatic variable importance calculation in conjunction with an
in-house approach of iterative backward elimination to remove redundant, noisy, and irrelevant
descriptors.
Automatic variable importance calculation is a very useful byproduct in ensemble models. In

49

addition to utilizing it for filtering, it helps identify the most critical interactions that contribute
to the predictive accuracy of the model. Random forest utilizes out-of-bag (OOB2 ) PLCs to calculate the effect of each descriptor on the predicted bioactivity of interest. This is achieved by
monitoring the change in the model’s accuracy while permuting (shuffling) input descriptors randomly one at a time. With each random permutation of the descriptor values across the OOB
complexes (i.e., descriptor noising), the corresponding increase in mean-square-error (MSE) on
the OOB examples (OOBMSE (permute)) is measured. By comparing the intact OOBMSE to thus
computed OOBMSE (permute), the average increase in MSE is evaluated. For each tree t in the en(t)
˜
(y − yˆi )2 ,
semble model, standard OOBMSE is determined as follows: OOBMSE(t) = 1˜ ∑N
N i=1 i

where N˜ is the number of OOB examples of the tree t. The criterion OOBMSE for the same
(t)

˜

(t)

tree when permuting the input variable X j is defined as: OOBMSE j = 1˜ ∑N
(y − yˆi, j )2 . For
N i=1 i
each descriptor X j in every tree t in the ensemble, the increase in MSE is calculated as follows:
(t)

(t)

∆OOBMSE j = OOBMSE j − OOBMSE(t) . To calculate the importance (I ) of descriptor j,
(t)

the resulting ∆OOBMSE j values are averaged and normalized over all the trees in the ensemble
according to the formula: I j = µ∆OOBMSE j /σ∆OOBMSE j . The overall increase in MSE (I j ) for
an input descriptor X j can be considered its influence on predicting the final binding affinity or any
other dependent variable.
Based on variable importance and backward elimination we implement the descriptor filtering
approach depicted in Algorithm 2. (i) First, the number of descriptor subsets (M) to test are defined
by the user. The sizes of the M feature subsets start from the full dimension P down to 3 with a
fixed step of size ∼ (P − 3)/(M − 1). Here, we arbitrarily choose three to be the lowest number
of descriptors to test. (ii) Starting from the largest descriptor subset (whose size is M (1) = P), an
RF model is built and its relative variable importance I (1) as well as out-of-sample loss L1 are
recorded. (iii) We then use variable importance scores to select the M (2) most influential descriptors V (2) and eliminate the remaining less important M (1) − M (2) features (V (1) −V (2) ). Next,
2 Out-of-bag

(OOB) refers to complexes that are not sampled from the training set when bootstrap sets are drawn to fit individual trees in a random forest model—on average, about 34% of the
training set complexes are left out (or “out-of-bag”) when bootstrap sets are drawn.

50

Algorithm 2 Descriptor filtering algorithm
Input 1: Training data: D = {Xi , yi }N
i=1
Input 2: The number of descriptor subset sizes to test: M
Subset sizes M = Sequence(from = P, down to = 3, decrement by = ∼ (P − 3)/(M − 1))
Set the initial variable importance values for all P descriptors to some arbitrary value (e.g., 1)
as I (0) = {I j = 1 | 1 ≤ j ≤ P}
5: for p = 1 to M do
6:
V (p) = choose M (p) descriptors with the largest V. I. values according to I (p−1)

1:
2:
3:
4:

7:
8:
9:
10:
11:
12:

(V (p) )

Characterize PLCs with the descriptor set V (p) as D p = Xi

N

, yi
i=1

Train model Fp on data D p as Fp = fit(D p )
Compute variable importance of the set V (p) as I (p) = Var. Imp.(Fp )
Calculate the loss of model Fp as L p = OOBMSE(Fp )
end for
∗
Find the optimal subset of descriptors V (optimal) = V (p ) , where p∗ = arg min L p
p

Table 3.3: The docking, screening, and scoring accuracies (A) of multi-perspective
scoring functions constructed using raw (R) and noisy (R+N) descriptors. The number
of descriptors (P) and accuracies (A) are reported when all features are used (Full) and
after conducting feature subset selection (FSS).
Task (metric)

Raw descriptors (R)
PRFull

AFull
R

Scoring (R p )
2714 0.816
1
Docking (S1 %)
2714 82.05
Screening (EF1% ) 2714 34.00

Raw & noise descriptors (R+N)

PRFSS

AFSS
R

Full
PR+N

AFull
R+N

FSS
PR+N

AFSS
R+N

216
270
216

0.804
81.95
34.00

5428
5428
5428

0.772
79.49
32.00

378
216
541

0.823
81.54
33.50

we build a new RF SF and record its out-of-sample loss L2 as well as its importance estimates
I (2) of the descriptor subset with size M (2). (iv) We repeat this process of model construction,
out-of-sample loss calculation, variable importance estimation, and backward elimination until all
M (where M (M) = 3) descriptor subsets have been tested. Upon completion, the most informative
descriptors will be the subset associated with the lowest out-of-sample loss. The variable importance values of the best subset is a good estimate of the contribution of different molecular forces
to the system’s binding affinity.
We conducted two types of experiments to show the effectiveness of DDB’s descriptor filtering

51

0
500
1000 1500 2000 2500
(a) The size (P) of the most important subset
out of 2704 raw descriptors

●

Scoring OOB accuracy (R p )
Docking OOB accuracy (R p )
Screening CV accuracy (EF1%)

30
35
40
Screening CV accuracy (EF1%)

0.78
0.74

●

0.70
0.66

Scoring and docking OOB accuracy (R p )

Scoring OOB accuracy (R p )
Docking OOB accuracy (R p )
Screening CV accuracy (EF1%)

30
35
40
Screening CV accuracy (EF1%)

0.78
0.74
0.70

●

0.66

Scoring and docking OOB accuracy (R p )

●

0
1000 2000 3000 4000 5000
(b) The size (P) of the most important subset
out of 2704 raw and 2704 decoy descriptors

Figure 3.6: The scoring, docking, and screening accuracy during the filtering of raw (left) and
noisy (right) descriptors.
Scoring top 15 descriptors

Docking top 15 descriptors

Relative importance

0.15

0.10

0.05

r.X
r.X 37
73 _C
_C C_
h
c. r.X1 .X1 C_ 8
X1 0 _ 1
_ 9 st 2
a. r hyd _CCeric
X3 .X ro _
1_ 11 ph 16
h 0_ o
t.X
14 n pho CN bic
b_ _1
2_ .X
a.
1
s 6
X2 R_7 _v h.X co
i
n
6
1_ a _ re
0_
1 _a H
hp
ho n me ffin B
b . s i
h. _h X13 _rm ty
X5 ph 0 s
_ d
p. _lig ob_ N_
X1 D co S
3 es m
r.X 1_A olv p
75 TS HB
_C C1
n.
X1 h. O_ e
_v X1 12
in _s
c.
a
X1 h _a teri
_h .X2 ffin c
h. y _ it
X3 dr cl y
_p op ash
ro ho
g.
D bi
X1
U eso c
.X l
4_
lig f.X h. 2_ v
B 2_ X6 hb
f.X urie po _H
h. 14 d_ ck_ B
a.
X5 _a sfc vo
X2
_l s_ Ar l
6_
ig de e
D n a
t
o
es s
a.
t
X3 a al_
o ity
3_ .X3 tar x.X lvH
un 2_ ge 2_ B
sa po t_ HB
t_ la hyd
po r_ r
la sc o
r_ or
sc e
or
e

0.00

Descriptor names (as generated by their respective packages)

Figure 3.7: The relative influence of the top 15 (out of 2714) descriptors on predicting the
ligand’s binding affinity (left panel) and poses (right panel) for the core test set complexes. The
x-axis shows abbreviated names for the descriptors in which the first letter denotes the symbolic
identifier of the SF or tool generating the descriptor (see Table 3.1), the following “.X#” is the
original SF’s unique index of the descriptor, and the suffix is the descriptor short name as
produced by the original SF or extraction tool.

52

capabilities. In the first, we applied the iterative descriptor importance estimation and backward
elimination algorithm on DDB’s 2714 original (or raw, denoted by R) descriptors for the scoring,
docking, and screening tasks. The goodness of fit metric for the scoring and docking is the MSE
between predicted and true labels of OOB PLCs. The true PLC labels for scoring are the experimental BA values and for docking it is the RMSD values of PLC poses. The objective function
for the screening task is the average enrichment factor of active ligands on ten out-of-sample test
sets. Upon calculating the optimal filtered subset for each task, we used them to build XGBoost
SFs whose accuracy statistics are listed in Table 3.3. The second major column of Table 3.3 shows
(from left to right) the number (PRFull ) of the original (R) descriptors, the accuracy (AFull
R ) of XGBoost SFs trained on the raw descriptors, the number (PRFSS ) of descriptors in the best feature
subset, and the accuracy (AFSS
R ) of XGBoost SFs trained on this subset of filtered descriptors. We
FSS
observe that the accuracy of the SFs did not change significantly (AFull
R vs. AR ) despite the use

of (270/2714=) 10% of the total number of descriptors (PRFull vs. PRFSS ). We further found that the
best descriptors selected for each task include subsets from 16 descriptor types. Note that the size
of the best descriptor subset is comparable to that of NNScore whose 218 descriptors were found to
be the best performing when single-perspectives were evaluated individually as shown in Table 3.2.
However, the performance of the two descriptor sets (as listed in Table 3.2 and Table 3.3) are quite
different and this implies the effectiveness of multi-perspective modeling of interactions and the
feature filtering algorithm. Figure 3.6(a) shows the search path for the best subset of descriptors for
the three prediction tasks. The x axis of the plot is for the descriptor subset size and the y axis is for
the accuracy of that subset on out-of-sample PLCs. The tick marks (the small square, circle, and
triangle) on the plot correspond to the size of the best subsets and their performance. The results
in the plot agree with those in the table in the fact that the performance for large descriptor subsets
are not very different from those for smaller subsets. The gap in performance for the screening
task could be due to the difference in the ML models used for filtering (RF in the plot) and the final
SF construction (XGBoost in the table).
The results of the first experiment clearly indicate that valuable computational resources could

53

be saved by only extracting the most important descriptors without affecting performance. This
was one of the objectives of implementing descriptor filtering as part of DDB. The approach of the
second experiment is very similar to the first, but the objectives are different. The main objective
is to simulate the possibility of depositing noisy descriptors to DDB. Noise could be introduced
as a result of improper modeling of some interactions, bugs in the donated descriptor extraction
program, or due to any other artifacts. Therefore, in this experiment we evaluate the descriptor
filtering algorithm on its ability to remove noise that is added as artificial descriptors. We first
added as many as 2714 random noise (N) variables to the 2714 raw (R) descriptors. We then
fitted two sets of SFs. The first set of XGBoost SFs are trained directly on the 5428 mixed (R+N)
descriptors. For the second, we first applied the descriptor filtering algorithm on the 5428 mixed
(R+N) descriptor set. After obtaining the filters for the three tasks, we fitted another set of XGBoost
SFs on the resulting filtered descriptors. The results of both experiments are shown in the rightmost major column of Table 3.3. Although not very significant, we notice a drop in performance
of the SF that was trained on the noisy descriptors compared to its counterpart that was constructed
Full
on the original features (i.e., AFull
R vs. AR+N ). This indicates the resilience of XGBoost SFs to such

a large amount of noise (100%) in the data. After applying DDB filters, we notice a rebound in the
performance of the ML SF back to normal. We examined the final three subsets of descriptors and
we found that none of the 378, 216, and 541 selected descriptors are the noisy set we intentionally
added. This clearly shows the accuracy of the embedding-based FSS algorithm in finding and
removing noise.
Drug designers and medicinal chemists are often interested in understanding the influence of
various intermolecular forces on the ligand’s binding affinity and conformation. DDB’s filtering
and ML toolbox serves this need through its descriptor importance plots such as the ones depicted
in Figure 3.7. Out of the 2714 descriptors in DDB, the bar plots illustrate the relative importance of the top 15 descriptors with the largest effect on predicted binding affinities and poses
of PLCs in the core test set Cr. For binding affinity, RF’s variable importance plot suggests that
descriptors that have the most predictive power are related to hydrophobic contacts followed by

54

steric effects, hydrogen bonding, and ligand solvation. Whereas descriptors that characterize the
ligand’s steric effects and clashes with the protein were found to play the biggest role in estimating its binding pose. The descriptor based on NNScore’s term for Vina’s (SF) affinity score is
among the top predictors of binding affinity and conformation. The volume of the pocket calculated by DPOCKET [88] appears to affect binding pose prediction but does not seem to have the
same effect on BA estimation. On the other hand, the number of Carbon-Carbon atom pairs between the protein and ligand within 8 and 12 Å are very relevant for scoring but not docking. The
descriptors based on these simple atomic contacts are calculated by the SF RF-Score (descriptor
codes with the prefix “r”). The other top 15 scoring and docking descriptors shown in Figure 3.7
are generated by the following diverse set of SFs and characterization tools: ChemGauss (prefix
h), AffiScore (a), NNScore (n), PaDEL (p), RETEST (t), DPOCKET (f), AutoDock Vina (U),
Cyscore (c), and X-Score (x). In some cases, these tools overlap in features that describe the
same interactions but using different modeling approaches. For example, hydrophobic contacts are
modeled by RF-Score’s CC_8, CC_12 & CC_16 contacts, Cyscore’s hydrophobic term, and AffiScore’s hydrophobic score. Hydrogen bonding-related descriptors are also generated by X-Score
and ChemGauss using two different perspectives. Although the top 15 features capture a diverse
collection of intermolecular forces, some of them characterize the same interaction type which
highlights the benefits of the proposed multi-perspective modeling paradigm.

3.5

Conclusion

In this chapter, we presented Descriptor Data Bank (DDB), an online platform for facilitating
multi-perspective modeling of PL interactions. The online platform is an open-access hub for depositing, hosting, executing, and sharing tools and data arising from a diverse set of interaction
modeling hypotheses. In addition to the descriptor extraction tools, the platform also implements
a machine-learning (ML) toolbox that drives DDB’s automatic descriptor filtering and evaluation
utilities as well as scoring function fitting and prediction functionalities. To enable local access
to many of DDB’s utilities, command-line programs and a PyMOL plug-in have also been devel-

55

oped and can be freely downloaded for offline multi-perspective modeling. We seed DDB with
16 diverse descriptor extraction tools developed in-house and collected from the literature. The
tools altogether generate over 2700 descriptors that characterize (i) proteins, (ii) ligands, and (iii)
protein-ligand complexes. The in-house descriptors we propose here, REPAST & RETEST, are
target-specific and based on pair-wise sequence and structural alignment of proteins followed by
target clustering and triangulation. We built and used the fit/predict service in DDB to fit ML SFs
to the in-house descriptors and those collected from the literature. We then evaluated them on
several data sets that were constructed to reflect real-world drug screening scenarios. We found
that multi-perspective SFs constructed using large and diverse number of DDB interaction models outperformed their single-perspective counterparts in all evaluation scenarios with an average
improvement of 15%. We also found that our proposed target-specific descriptors improve upon
the accuracy of SFs that were trained without them. In addition, DDB’s filtering module was able
to exclude noisy and irrelevant descriptors when artificial noise was included as new features. We
also observed that the tree-based ensemble ML SFs implemented in DDB are robust even in the
presence of noisy and decoy descriptors.

56

CHAPTER 4
BAGGING AND BOOSTING BASED ENSEMBLE NEURAL NETWORKS SCORING
FUNCTIONS FOR ACCURATE BINDING AFFINITY PREDICTION

4.1

Binding affinity

In the context of drug discovery and development, binding affinity is the strength of the noncovalent bond that brings a protein and a ligand together which is typically quantified using the
Dissociation constant (Kd ). It is an equilibrium constant that measures the tendency for a complex to separate into its constituent components (protein and ligand). We utilize it here to describe
the degree of tightness of proteins to their binders. That is, by interpreting complexes whose
components are more likely to dissociate (high dissociation constants) as loosely bound (low binding affinities) and vice-versa. We can write the binding reaction for a protein-ligand complex
as: Protein-Ligand Complex ↔ Protein + Ligand, and the dissociation constant in terms of molars (mol/L) is given by: Kd = [P][L]/[PL], where [P], [L], and [PL] are the equilibrium molar
concentrations for the protein, ligand, and protein-ligand complex, respectively. The higher the
concentration of the ligand needed to form a stable protein-ligand complex, the lower the binding
affinity will be. Put differently, as the concentration of the ligand increases, the value of Kd also
increases and this corresponds to a weaker binding.
In some experimental settings, such as competition binding assays, inhibitor constant is used
as a measure of binding affinity. Inhibitor constant (commonly denoted by Ki ) is defined as the
concentration of a competing ligand in a competition assay that would occupy 50% of the receptor
in the absence of radioligand. Inhibitor constant is expressed by the following Cheng-Prusoff
equation [95]:
Ki =

IC50
,
1 + [L]/Kd

(4.1)

where IC50 is the half maximal inhibitory concentration that measures the effectiveness of the
ligand to inhibit a biochemical or biological function by half, Kd is the dissociation constant of

57

the radioligand, and [L] is the concentration of the ligand L. IC50 could also be thought of as
the concentration of the ligand required to displace 50% of the radioligand’s binding. IC50 value
is typically determined using dose-response curve [96] and it should be noted that it is highly
dependent on experimental conditions under which it is calculated. To overcome such dependency,
the inhibitor constant formula Ki is developed such that it has an absolute value. An increase in its
value means a higher concentration of an inhibitor needed to block certain cellular activity.
It is customary to logarithmically convert the dissociation and inhibitory constants to numbers
that are more convenient to interpret and deal with. These constants are scaled according to the
formula:
pKd = − log(Kd ) and pKi = − log(Ki ),

(4.2)

where higher values of pKd and pKi reflect an exponentially greater binding affinity. The data we
use in our work are restricted to complexes whose either Kd or Ki values are available. Therefore,
in many of the regression models that will be built are fitted so that the response variable is either
pKd or pKi . The binding affinities of the complexes in PDBbind database considered in this work
range from 2 to 11.96 in terms of pKd and pKi , spanning about 10 orders of magnitude.

4.2

Predicting binding affinity

Unknown binding affinities of protein-ligand complexes are typically estimated by predictive
models known as scoring functions. Most existing scoring functions employed in commercial and
free molecular docking software fall in one of three main categories: force-field-based [64], empirical [56], or knowledge-based [97] SFs. Many comparative studies have found that these types of
SFs are not accurate enough for reliable and successful molecular docking and virtual screening.
A recent study examined a total of 16 popular scoring functions in their ability to reproduce experimental binding affinities of 195 protein-ligand complexes that encompass 65 different protein
families [17]. Although these SFs are employed in mainstream commercial and academic molecular docking tools, the best performing SF achieved only mediocre accuracy of less than 0.65
in terms of Pearson’s correlation between its predictions and measured binding affinities (BAs).

58

These findings are in agreement with an earlier work by Wang et al. in which a related benchmark
and scoring functions were examined [98]. Several of the evaluated SFs were empirical models
derived via fitting linear regression equations to training data, but none were based on nonlinear
modeling approaches such as Artificial Neural Networks (ANN) or ensemble decision trees.
Artificial neural networks have been previously used in computational drug development, but
they have mostly been applied in QSAR modeling problems or in predicting the biological activity
of ligands (active or not) against a target protein [99, 100, 101]. Their application in predicting
binding affinity has been very rare and only reported in small scale experiments in which just a
handful of protein-ligand complexes were used for training and validation [102, 103, 50]. Neural
networks’ poor generalization performance when trained on small and high-dimensional datasets
is perhaps the main reason for their limited use in scoring protein-ligand complexes in commercial
docking tools. In this chapter, we propose novel SFs based on an ensemble of neural networks to
predict binding affinity of protein-ligand complexes characterized using large and diverse number
of descriptors. We train and test our models on hundreds of high-quality protein-ligand complexes
and compare their accuracies against conventional and state-of-the-art scoring functions. We show
that our NN SFs are resilient to overfitting and generalize well even when predicting BAs of complexes characterized by a large number of descriptors.

4.3

Key contributions

Conventional empirical SFs rest on the hypothesis that a linear regression function is sufficiently capable of modeling protein-ligand binding affinity [63, 56]. Instead of assuming a predetermined theoretical function that models the unknown relationship between different energetic
terms and binding affinity, two accurate nonparametric SFs, BgN-Score and BsN-Score, based
on ensemble and deep learning approaches are introduced in this chapter. We utilize the multiperspective molecular descriptors (proposed in Chapter 3) to build the SFs BgN-Score and BsNScore by combining a large number of diverse neural networks using bagging and boosting ensemble techniques, respectively. We show that BgN-Score and BsN-Score have scoring powers of

59

0.840 and 0.844 in terms of Pearson’s correlation coefficient between predicted and true binding
affinities of PDBbind (version 2014) core test set complexes, respectively. This is in comparison
to 0.627 for the best conventional SF, X-Score, on the same benchmark test set. In addition to
this substantial 34% improvement, the same ensemble NN SFs are also more accurate than SFs
based on a single neural network (0.840 and 0.844 vs. 0.803). We also compare our proposed
models to a SF based on other popular machine-learning algorithm Random Forests. We found
that our ensemble NN SFs are at least 16% more accurate than RF-Score, an SF based on Random
Forests. Scoring functions based on a boosted trees model fitted using the XGBoost algorithm
achieved 0.827 correlation with true binding affinity values. Although NN and decision-tree based
ensemble approaches are competitive with each other, the significance of NN ensemble SFs introduced in this work is two-fold. First, they represent a way to overcome the overfitting limitations
of single neural network models that have been used traditionally in drug-discovery applications
[99, 100, 50]. Second, neural networks have the ability to approximate any underlying function
smoothly [104, 105, 106] in contrast to decision trees that model functions with step changes across
decision boundaries [107].
We seek to advance structure-based drug design by designing SFs that significantly improve
upon the protein-ligand binding affinity prediction accuracy of conventional SFs. Our approach
is to couple the modeling power of ensemble algorithms and deep learning with training datasets
comprising hundreds of protein-ligand complexes with known high-resolution 3D crystal structures and experimentally-determined binding affinities and a variety of features characterizing the
complexes. We will compare the predictive accuracies of BgN-Score, BsN-Score, single deep
neural network SF (DNN-Score), and other machine-learning approaches as well as existing conventional SFs of all three types, force-field, empirical, and knowledge-based, on diverse and homogeneous sets of protein families. The remainder of the chapter is organized as follows. We first
describe the training and test datasets used to construct and evaluate several neural network and
machine-learning scoring functions. The next section covers some of the limitations of artificial
neural networks and our approach to tackling them. Then we present our proposed SFs based on

60

these models. Next, we show results comparing the scoring and ranking powers of conventional
and the proposed SFs on diverse and homogeneous test sets. We also compare their performance
on novel drug targets and analyze how they are impacted by training set size and the number of
descriptor sets characterizing protein-ligand complexes. Finally, we summarize our results and
conclude this chapter.

4.4

Methodology

4.4.1

Protein-ligand complex datasets and multi-perspective characterization

For most of our experiments in this work, we use two versions of PDBbind to train and test the
proposed scoring functions. We build and evaluate the proposed scoring functions on complexes
from the database PDBbind versions 2007 and 2014 as outlined in Section 2.1. These two versions were used in recent studies to evaluate the performance of several academic and commercial
scoring functions [17, 25, 27, 50]. In order to objectively compare the predictive accuracy of the
proposed scoring functions to many existing popular SFs, we too use the 2007 and 2014 PDBbind
benchmark. The experiments on the two versions were conducted during different stages of this
work and therefore there are some changes concerning the machine-learning methods (addition
of XGBoost) and their libraries (python & R), the descriptors we extract for each protein-ligand
complex, as well as other variations that will be highlighted in the text.

4.4.1.1

The PDBbind 2007 benchmark

We use the 2007 version of PDBbind to train and test scoring functions based on the machine learning algorithms NN, RF, BRT, SVM, kNN, MARS, and MLR. For each of the 1300 protein-ligand
complexes in PDBbind 2007, we extracted physicochemical features used in the empirical SFs XScore [56] (a set of 6 features denoted by X) and AffiScore [108] (a set of 30 features denoted by A)
and calculated by GOLD [52] (a set of 14 features denoted by G), and geometrical features used in
the ML SF RF-Score [27] (a 36-feature set denoted by R). The software packages that calculate XScore, AffiScore (from SLIDE), and RF-Score features were available to us in an open-source form
61

from their authors and a full list of these features is provided in the appendix. The GOLD docking
suite provides a utility that calculates a set of general descriptors for both molecules as separate
entities and in a complex form. The full set of these features can be easily accessed and calculated
via the Descriptors menu in GOLD. By considering all fifteen combinations of these four types of
features (i.e., X, A, R, G, X ∪ A, X ∪ R, X ∪ G, A ∪ R, A ∪ G, R ∪ G, X ∪ A ∪ R, X ∪ A ∪ G, X ∪ R ∪ G,
A ∪ R ∪ G, and X ∪ A ∪ R ∪ G), we generated 15 corresponding versions of scoring functions. Each
scoring function is trained and evaluated on complexes characterized by one of the 15 descriptor
combinations. We denoted these models using the name of the machine-learning algorithm (as a
prefix) and the feature combintion (as a suffix). The scoring function BsN-Score::XR, for example,
denotes the boosted ensemble neural network model fitted to complexes characterized using the
set of descriptors X ∪ R (referred to simply as XR).

4.4.1.2

The PDBbind 2014 benchmark

In addition to the 2007 release of PDBbind, we also take advantage of PDBbind 2014 which includes 3446 high-quality complexes—about 160% more complexes than the 1300 PLCs in PDBbind 2007. We use protein-ligand datasets from PDBBind 2014 to train and test machine-learning
scoring functions based on NN, RF, XGB, and MLR. All protein-ligand complexes are characterized using the proposed multi-perspective descriptors in Descriptor Data Bank (DDB). The set
includes more than 2700 descriptors extracted using 16 different scoring functions and molecular
modeling tools developed in house and collected from the literature which are now part of DDB.
The descriptor sets are comprised of the types X, A, R, G, which were also considered in our experiments on PDBbind 2007 and 12 other types (listed in Table 3.1) extracted for PDBbind 2014
complexes only. We only fit one version of each machine-learning scoring function to all 16 feature types instead of creating as many versions as the number of combinations of the 16 descriptor
types. We distinguish scoring functions using their underlying machine-learning algorithms. For
example, BsN-Score refers to the scoring function built using the proposed boosted neural network
model and fitted to more than 2700 multi-perspective descriptors from DDB.

62

4.4.2

Limitations of neural networks and our approach to tackling them

Although multi-layer (or deep) ANN models can theoretically approximate any nonlinear continuous function, their application in drug-discovery related problems has always been complicated
by several challenges. Data arising from bioinformatics and cheminformatics processes are typically high-dimensional. Since ANN models cannot handle large number of features efficiently
when data is scarce, a pre-processing step prior to fitting the data using an ANN model is usually
necessary. Feature subset selection using evolutionary algorithms or dimension reduction using,
say, principal component analysis (PCA), is commonly applied to overcome this problem. However, valuable experimental information may be discarded when only a small subset of features is
selected to build a prediction model. The dimensionality-reduction approach is also complicated
by the fact that the underlying data distribution is unknown and hence making the right choice
of which dimensionality-reduction technique to apply is a tricky problem in itself. In addition to
these pre-processing issues, training ANN models is also a challenging task because their weights
can not be guaranteed to converge to optimal values. This causes NN models to suffer from high
variance errors which translate to unreliable SFs.
The aforementioned problems can be avoided and state-of-the-art performance can be achieved
by combining predictions of hundreds of diverse and nonlinear NN models. We propose here
ensemble methods based on ANNs. The ensemble itself is trained on all the features, although
each network in the ensemble is fitted to only a subset of the features. This approach relieves us
from carrying out feature subset selection or dimensionality reduction prior to training. In fact,
the performance of the ensemble can even be improved by describing the data with more relevant
features. Moreover, it is no longer critical to tune the weights of each network in the ensemble
to optimal values as it is the case for a single NN model. Suboptimal weight tuning of individual
networks could contribute to decreasing the inter-correlation between them, which translates to a
diverse ensemble and therefore a more accurate model [45].
Our proposed NN ensemble models are inspired from Random Forests [45] and Boosted Regression Trees [92] techniques in the formation of the ensembles. So far, the focus in ensemble

63

learning has been more or less biased towards using decision trees as base learners in forming
ensembles. Choosing trees as base learners is mainly due to their high flexibility and variance
(low stability). High variance decreases inter-correlation between trees and therefore increases the
overall ensemble model’s accuracy. Instead of using decision trees as base learners, we employ
here multi-layered perceptron, or deep, neural networks (DNN). DNN shares several characteristics with prediction trees. They are non-parametric, nonlinear, and have high variance. Moreover,
both techniques are very fast in prediction. Deep neural networks, however, have the ability of
modeling any arbitrary boundary smoothly while decision trees can only learn rectangular-shaped
boundaries. Decision trees are typically pruned after training to avoid overfitting, whereas DNN
uses regularization while the network weights are optimized during learning. We next describe our
two new ensemble NN models.

4.4.3

BgN-Score: Ensemble neural networks through bagging

Bootstrap aggregation, or bagging for short, is a popular approach to construct an ensemble learning model. As the name implies and as indicated in the third step of Algorithm 3, the ensemble is
composed of neural networks that are fitted to bootstrap samples from the training data. To further
increase the diversity of the ensemble and decrease its training time, the inputs to each network
l are a random subset (pl ) of the total P features extracted for every protein-ligand complex (see
Step 4). Feature sampling has proven effective in building tree-based ensemble algorithms such as
Random Forests [45]. When the task is to predict the binding affinity of a new protein-ligand complex, the output is the aggregated average of the predictions of the comprising individual networks
as shown in Algorithm 3 and depicted in Figure 4.1. This mechanism can be formally expressed
as:
yˆ = f (xP ) =

L
1 L
pl ) = 1
f
(x
∑ l
∑ yˆl ,
L l=1
L l=1

(4.3)

where xP ∈ ℜ|P| is a feature vector representing a protein-ligand complex characterized by a feature set P, f (xP ) is the function that maps it to binding affinity yˆ ∈ ℜ, x pl ∈ ℜ| pl | is the same

64

complex but characterized by a random subset pl of features (|pl | < |P|), L is the number of
networks in the ensemble, and yˆl is the prediction of each network l in the ensemble which is
calculated at the output neuron according to Equation 2.14. The resulting bagging-based ensemble
SF is referred to as BgN-Score.
Algorithm 3 Algorithm for building BgN-Score: an ensemble NN SF using bagging
1:
2:
3:
4:

5:
6:
7:
8:
9:

P
Input: training data D = {XP , Y}, where XP = {xP
1 , . . . , xN }, Y = {y1 , . . . , yN }, and N is the
number of training complexes.
for l = 1 to L do
P
Draw a bootstrap sample XP
l from X .
Characterize the complexes in the bootstrap sample XP
l using a random subset pl of feapl
tures: Xl .
From Y, draw the measured binding affinities of the complexes in the sample XP
l : Yl .
pl
Construct a new training set: Dl = {Xl , Yl }.
Learn the current binding affinities by training an FFBP NN model fl on Dl .
end for
The final prediction of a protein-ligand complex xP is: yˆ = f (xP ) = L1 ∑Ll=1 fl (x pl ) = L1 ∑Ll=1 yˆl

4.4.4

BsN-Score: Ensemble neural networks through boosting

Boosting is an ensemble machine-learning technique based on a stage-wise fitting of base learners.
The technique attempts to minimize the overall loss by boosting the complexes having highest
predicted errors, i.e., by fitting NNs to (accumulated) residuals made by previous networks in
the ensemble model. There are several different implementations of the boosting concept in the
literature. The differences mainly arise from the employed loss functions and treatment of most
erroneous predictions. Our proposed NN boosting algorithm in this work is a modified version of
the boosting strategy developed by Cao et al. [109] and Friedman [92] in that we perform random
feature subset sampling. This approach builds a stage-wise model as listed in Algorithm 4 and
illustrated in Figure 4.2. The algorithm starts by fitting the first network to all training complexes.
A small fraction (ν < 1) of the first network’s predictions is used to calculate the first iteration
of residuals Yres
1 as shown in Step 3 of Algorithm 4. Step 3 also shows that the network f 1 is
the first term in the boosting additive model. In each subsequent stage l, a network is trained on
65

Feature subset p1

+1

+1
wi,h

wh,o

+1
wi,h

wh,o

+1
wi,h

wh,o

x3

x8

Average

Feature subset p2

+1

x51

x6

+1

Binding affinity

x13

Feature subset pL

A complex characterized by a feature set P

x21

x6

x2

x37

Figure 4.1: BgN-Score: ensemble neural network SF using bagging approach.
a bootstrap sample of the training complexes described by a random subset pl of features (Steps
5 and 6). The values of the dependent variable of the training data for the network l are the
current residuals corresponding to the sampled protein ligand complexes. The residuals for a

66

network at each stage are the differences between previous stage residuals and a small fraction of its
predictions. This fraction is controlled by the shrinkage parameter ν < 1 to avoid any overfitting.
Network generation continues as long as the number of networks does not exceed a predefined
limit L. Each network joins the ensemble with a shrunk version of itself. In our experiments, we
fixed the shrinkage parameter to 0.001 which gave the lowest out-of-sample error. We refer to this
boosting-based ensemble SF as BsN-Score.
Algorithm 4 Algorithm for building BsN-Score: an ensemble NN SF using boosting
1:
2:
3:
4:
5:
6:

7:

P
Input: training data D = {XP , Y}, where XP = {xP
1 , . . . , xN }, Y = {y1 , . . . , yN }, and N is the
number of training complexes.
Construct D1 = {X p1 , Y} from XP by selecting a random subset p1 of features.
ˆ 1 ) of the complexes X p1 .
Train an FFBP NN model f1 on D1 and use it to predict BAs (Y
ˆ
Calculate the residuals: Yres
1 = Y − ν Y1 .
for l = 2 to L do
P
Draw a bootstrap sample XP
l from X .
Characterize the complexes in the bootstrap sample XP
l using a random subset pl of feapl
tures: Xl .
P
res∗
From Yres
l−1 , draw the residuals corresponding to the complexes in the sample Xl : Yl−1 .
p

Construct a new training set: Dl = {Xl l , Yres∗
l−1 }.
Learn the current residuals by training an FFBP NN model fl on Dl .
ˆ l of the NN model fl on all X pl training complexes in the
Calculate the predictions Y
original training set D.
res
ˆ
11:
Update the residuals: Yres
l = Yl−1 − ν Yl
12: end for
pl
L
13: The final prediction of a protein-ligand complex xP is: yˆ = f (xP ) = ∑L
l=1 ν f l (x ) = ∑l=1 ν yˆl

8:
9:
10:

4.4.5

Improving speed and accuracy of ensemble neural networks via transfer learning

The bagged and boosted ensemble neural networks proposed in this work involves building large
number of wide and deep networks to accurately model the complex binding problem. Before
training, the weights of each network in these ensemble models are initialized randomly and independently from one another. Tens of thousands of training steps are then applied to each network to
learn the residuals made by the previous networks. We propose a more efficient boosting algorithm
in which optimized weights from a trained network are used as initial weights for the next network

67

x19

x64

+1

+1
wi,h

x39

wh,o
+1
Feature subset pL

x30

wh,o
Feature subset p2

Feature subset p1

A complex characterized by a feature set P

+1

+1
wi,h

x8

x5

+1
wi,h

wh,o

x11
x2

x57

Binding affinity

Figure 4.2: BsN-Score: ensemble neural network SF using boosting approach.
o ). The process of transferring knowledge from one model to another on
to be trained (Wli = Wl−1

a related problem is known as transfer learning. The networks whose weights are initialized with
values transferred from optimized NNs are faster to train than initializing them randomly. Also,
they tend to improve upon the performance of the networks from which the weights are initialized.
Depending on the particular training data, our neural networks typically require tens of thousands of training steps (S) for their weights to converge. In each step, a random subset of training
data are predicted by the network with its current weights. The prediction errors made on these
examples are used by the back-propagation algorithm to update the weights. These random subsets of protein-ligand complexes are known as mini-batches and their sizes are set to b = 50 data
points in our experiments. For a boosted ensemble model with L neural networks, the training
time is proportional to the number of steps S, mini-batch size b, and the number of networks L
as T ∝ SbL. Using transfer learning, we could cut the number of training steps to few thousand
o ). If
(s < S) for networks whose initial weights are transferred from their predecessors (Wli = Wl−1

we initialize weights randomly once after every t networks, we could have L/t networks initialized randomly and as many as L − L/t nets initialized with optimal or suboptimal weights. As a

68

result, the training time T ∗ of a boosting model with weight transfer is defined by the equation
T ∗ ∝ SbL/t + sb(L − L/t). Since the number of training steps (s) of optimally or sub-optimally
initialized networks is a fraction f (typically 10 to 15%) of those initialized randomly, we could
substitute f S for s in the equation defining T ∗ . After this substitution (s = f S), we could easily
calculate the speedup from transfer learning:
speedup =

T∗
t
=
.
T
1 + f (t − 1)

For a boosted neural network without weight transfer (i.e., when t = 1), we gain no speedup.
The speedup is also one when the number of training steps is equal for both types of initializing
approaches (i.e., s = S and therefore f = 1). On the other hand, with setting t = 30 and f = 0.2,
we get a speedup of more than 4x. For a regular boosted neural network that takes 8 hours to train,
we could train it in less than 2 hours with transfer learning without affecting accuracy.
In our experiments, we attempt to strike a balance between model diversity and speed of training via transfer learning. The networks whose weights are initialized randomly tend to be less
correlated amongst each other (i.e., more diverse) than those that share their initial weights. However, the diverse ensemble model is slower to train than the latter. Therefore, our BsN-Score and
BgN-Score SFs are composed of 300 networks, 20% of which with weights randomly initialized
and the other 80% started with optimized weights. Networks benefiting from transfer learning must
use the same input descriptors (pl ) that the optimal-weight-donating networks utilize pl = pl−1 .
Otherwise, the transferred initial weights would be effectively random and the training process
would be slower.

4.4.6

Neural networks and other machine-learning scoring functions

In order to investigate the effectiveness of ensemble NN SFs in comparison to traditional NN models and ensemble decision-tree models, as well as linear regression, we trained and tested BgNScore, BsN-Score, a single deep neural network SF referred to as DNN-Score, RF, BRT, XGB,
SVM, kNN, MARS, and MLR SFs on training and test datasets sampled from PDBbind complexes

69

according the procedures described in Section 2.1. Our neural network and other machine-learning
SFs are trained and tested on complexes characterized by the multi-perspective descriptors derived
using the proposed DDB platform from 4 and 16 (in case of PDBbind 2007 and PDBbind 2014,
respectively) different sources as described in Chapter 3. The parameters of these SFs were tuned
in a consistent manner to optimize the mean-squared prediction errors on validation complexes
sampled without replacement from the training set and independent of the test sets. Out-of-bag instances were used as validation complexes for BgN-Score and RF, while a ten-fold cross-validation
was conducted for BsN-Score, DNN-Score, and other ML SFs. Out-of-bag (OOB) refers to complexes that are not sampled from the training set when bootstrap sets are drawn to fit individual
NNs in BgN-Score models or decision trees in RF—on average, about 34% of the training set
complexes are left out (or “out-of-bag”) when bootstrap sets are drawn. The parameters that are
tuned and their optimized values are as follows:
1. L: the number of base learners (neural networks in ensemble NN SFs) was set to 300.
2. |p|: the size of the feature subset p randomly selected from the overall set of features P while
constructing each neural network in ensemble NN SFs. This was set to 33% of the total
number of feature for ensemble NN SFs. The number of input neurons for DNN-Score is set
to the full set of input descriptors since this model does not sample the input features. All
NN SFs have one output neuron per network that produces the binding affinity score.
3. H + 1: the number of hidden-layer neurons in NN SFs was set to 150%, 100% and 50% of
the number of neurons in the input layer.
4. The activation function for hidden layers is the rectified linear unit [110].
5. ν: the shrinkage parameter for BsN-Score models was set to 0.001.
6. The initial weights of each layer were randomly sampled from a normal distribution with
zero mean and variance of 1/(ni + no ); where ni and no are the number of input and output
neurons for the layer being initialized, respectively [111].
70

7. Adaptive Moment Estimation (also known as Adam) variant of stochastic gradient descent
algorithm was used to update the weights during training [112]. We use TensorFlow’s implementation for Adam optimization algorithm [43].
8. A total of 50000 training batches (parameter optimization steps) and a batch size of 50
random complexes were used to optimize the weights for each network in the ensemble and
single NN SFs. The training process of a network is terminated earlier if the moving-average
of validation loss (RMSE) does not decrease for 1000 successive steps. Twenty-percent of
the training data of each network is held aside before the training starts as an out-of-sample
validation set to enable early stopping.
9. The number of regression trees for Random Forest model was set to 3000.
10. During fitting each tree in the ensemble, a random subset of 12 are considered when expanding each node in the tree. A random subset is sampled for each node from the 36 RF-Score’s
geometric descriptors.
The parameter optimal values of the other ML SFs trained on PDBbind 2007 are listed in Table 4.1. We distinguish them from each other using the notation ML model::tools used to calculate
features. For instance, BsN-Score::XA implies that the SF is a boosted ensemble neural networks
model that is trained and tested on complex sets described by XA features. For brevity, for each of
DNN-Score, BgN-Score, BsN-Score, RF, BRT, SVM, kNN, MARS, and MLR models, we report
results only for the feature combination (out of the fifteen possible) that yields the best performance
on the validation complexes sampled without replacement from the training data and independent
of the test set.

71

Table 4.1: Optimal parameter values for MARS, kNN, SVM, RF, and BRT models
Model Parameter

Feature set
X

A

R

G

XA

XR

XG

AR

AG

RG XAR XAG XRG ARG XARG

MARS Degree
2
1
1
1
1
1
1
1
1
1
1
1
1
1
Penalty
2
6
5
6
7
2
6
7
6
5
6
7
6
5
kNN k
15 13
14
16
9
19
17
19
18
17
18
19
17
18
q
1
1
1
1
1
1
1
1
1
1
1
1
1
1
SVM C
2
2
1
1
1
4
1
2
2
1
1
2
2
1
ε
0.5
0 0.250 0.250 0.125 0.125 0.250 0.250 0.250 0.125 0.250 0.125 0.125 0.125
σ
1 0.25 0.125 0.250 0.250 0.031 0.031 0.031 0.125 0.031 0.125 0.031 0.031 0.031
RF
mtry
3 18
8
7
31
5
8
10
16
17
14
20
21
25
BRT
Interaction depth
15 17
18
16
19
15
18
19
17
16
16
20
18
17
Number of trees 1114 1523 1573 1208 1371 2113 1610 2950 2181 2303 2213 2590 2854 2921

4.5
4.5.1

1
6
19
1
2
0.250
0.031
35
20
2859

Results and discussion
Evaluation of scoring functions in binding affinity prediction and ligand ranking

Scoring power of SFs quantifies their ability to accurately predict protein-ligand binding affinity or reproduce it for complexes with known experimental BA data. The similarity between the
predicted and measured BAs are calculated using Pearson’s (R p ) and Spearman’s (Rs ) correlation
coefficients, the standard deviation (SD) of errors, and the root-mean square-error (RMSE). Pearson’s correlation coefficient measures the linear relationship between two variables as follows:

Rp =

∑N
i=1
∑N
i=1

yˆi − y¯ˆ yi − y¯

2
2
yˆi − y¯ˆ ∑N
i=1 yi − y¯

,

where N is the number of complexes and yˆi and yi are the predicted and measured binding affinities
of the i-th complex, respectively. The average values of the predicted and experimentally measured
affinities for all complexes are y¯ˆ and y,
¯ respectively. Spearman’s correlation coefficient is used to
evaluate the correlation between the predicted and measured BAs in terms of their ranks and it is
defined as follows:
Rs = 1 −

2
6 ∑N
i=1 di

N N2 − 1

,

where di is the difference in ranks of the predicted and measured affinities of the i-th complex.
The SF that achieves the highest correlation coefficient (maximum is one) for some dataset is
considered more accurate than its counterparts that realize smaller R p and/or Rs values (minimum
72

is negative one). Another measure of scoring power we report here is the standard deviation (SD)
of errors between predicted and measured BAs (in − log Ki or − log Kd units). To calculate this
statistic for a given SF, a linear model that correlates predicted scores Yˆ to the measured ones Y
is first evaluated: Y = β0 + β1Yˆ , where β0 and β1 are the intercept and the slope of the model,
respectively. The SD statistic can then be computed as follows [98]:

SD =

2

∑N
i=1 yi − β0 + β1 yˆi

.

N −2

The root-mean square-error (RMSE) of the predicted scores is calculated as:
RMSE =

∑N
i=1 yi − yˆi
N

2

.

SFs that yield smaller SD and RMSE values usually realize higher R p and Rs values, and therefore
have higher scoring power than models with large SD and RMSE statistics.
Calculating the ranking power for an SF on the core test set is straightforward due to its construction (see Section 2.1.1 for more information about this set). First, each protein-ligand complex
is scored. Then, for each protein cluster (i.e., the three protein-ligand complexes associated with a
common protein), complexes are ordered according to their predicted binding affinity scores. Any
given cluster is considered correctly ranked if its predicted-affinity-based order matches its corresponding measured-affinity-based order. We denote this order by “1-2-3” which implies that the
strongest-binding ligand is ranked as the first candidate drug, the second strongest binding one is
ranked second by the SF, and the weakest binder is ranked last—the number in the ordering corresponds to the true measured rank and the position in the ordering corresponds to the predicted rank.
The percentage of clusters with “1-2-3” ordering is referred to as the 1-2-3 ranking rate, denoted
by R1-2-3 , and is used as a measure of the ranking power of a given scoring function as in [17]. In
order to more fully capture ranking accuracy, we also report rates for other rankings and use similar
notation for them. For example, “1-3-2” implies that the actual second- and third-ranked ligands
are incorrectly ranked while the first one is correctly identified as the strongest binder among the
three candidates—the percentage of such clusters with such an ordering is referred to as the 1-3-2
73

ranking rate and denoted by R1-3-2 . We also report the success rates of different SFs in identifying the strongest binder for each cluster (i.e., their 1-X-X ranking rates, R1-X-X ). We also use
the Spearman correlation coefficient (denoted by Rs ) to measure ranking accuracy on test samples
other than the core test set.

4.5.2

Ensemble neural networks vs. other approaches on a diverse test set

In this section, we examine the binding affinity prediction performance of several machine-learning
and conventional scoring functions on protein-ligand complexes obtained from two versions of
PDBbind. In the first, we evaluate the performance of 9 machine-learning scoring functions (including neural network models) and 17 conventional approaches on the 2007 release of PDBBind.
This set of experiments was conducted prior to the development of our multi-perspective descriptor
extraction platform DDB. Therefore, we only use a subset of 4 descriptor types based on X-Score,
RF-Score, Affi-Score, and GOLD type features. In the second set of experiments, we train and
evaluate scoring functions on complexes from PDBbind 2014 characterized by over 2700 descriptors extracted from 16 different sources (perspectives) in DDB.

4.5.2.1

Scoring performance on diverse protein families from PDBbind 2007

We trained three neural network SFs (DNN-Score, BgN-Score, and BsN-Score) and six other
machine-learning scoring models (RF, BRT, SVM, kNN, MARS, MLR) on the primary (Pr) training set and evaluated their scoring performance on an independent test set of 195 diverse proteinligand complexes from 65 different protein families (refer to Section 2.1 for more details about
these sets). Both the training and test set complexes are obtained from the 2007 version of PDBbind.
Table 4.2 lists the scoring and ranking powers of these models and the same performance statistics of the seventeen conventional SFs evaluated on the same test set. We also report the scoring
performances of NN and the other ML SFs on the training set Pr by using out-of-sample validation to show how close the predicted BAs are to the experimentally-measured ones in terms of
74

RMSE. Therefore, this statistic indicates whether SFs deemed accurate on training data will also
be reliable scoring models on the test set Cr. This measure was not calculated for the conventional
SFs (except X-Score) since we do not have access to the BA values of their training sets. All the
scoring and ranking power metrics indicate that our proposed SFs and the ensemble decision-treebased models are the most accurate in predicting the binding affinities and the correct ranks of the
independent test set complexes and out-of-sample training data. BsN-Score outperforms the most
accurate conventional SF, X-Score::HMScore, by at least 26% in terms of Pearson’s correlation coefficients, which is 0.816 and 0.644 for both SFs, respectively. BgN-Score also achieves excellent
performance of 0.808 which is about 25% improvement over X-Score::HMScore.
As for the ranking power, we notice that all SFs are able to identify the top binder (see 1-X-X
ranking rate) in a majority of the clusters. However, ordering the three ligands correctly within a
cluster (1-2-3 ranking rate) is much more challenging for scoring models. In fact, less than half
of the SFs are capable of correctly ranking 50% or more of the protein-family clusters. Ensemble
learning based scoring models (i.e. BsN-Score, BgN-Score, BRT, and RF) top the list by achieving
1-2-3 ranking rates of 60.0% (39 clusters) for BgN-Score and upto 64.6% (42 clusters) achieved
by BsN-Score. SVM model fitted to XAR features comes in fifth place by identifying the correct
order of 37 clusters. BsN-Score, BgN-Score, BRT, RF, and SVM also yield the highest Rs and Rp
values on the core set and lowest RMSE values on the training data compared to the other models.
There are two main reasons for the superior performance of ensemble SFs. First, our proposed
multi-perspective interaction modeling result in more numerous and varied features that more fully
characterize protein-ligand interactions. Thus we find that BsN-Score, BgN-Score, BRT, and RF
SFs employing all four features types considered (X, A, R, and G features) are more accurate than
the same SFs employing fewer features. Second, the learning model of ensemble SFs is nonlinear
and flexible and can exploit a large number of features while being resilient to overfitting. Thus we
find that DNN-Score::X (for which R p = 0.675) is more accurate compared to the versions of DNNScore employing one of A, R, or G features only as well as DNN-Score::XARG (for which R p =
0.517) because single neural network models overfit the training complexes when characterized by

75

Table 4.2: Comparison of the scoring and ranking powers of the proposed and 17 conventional
scoring functions on the diverse protein-ligand complexes of PDBbind 2007 core (Cr) test set.
Scoring Functions
BsN-Score::XARG
BRT::XARG
BgN-Score::XARG
RF::XARG
SVM::XAR
kNN::XA
MARS::XAR
MLR::XA
DNN-Score::X
X-Score::HMScore
DrugScoreCSD
SYBYL::ChemScore
DS::PLP1
GOLD::ASP
AffiScore
SYBYL::G-Score
DS::LUDI3
DS::LigScore2
GlidScore-XP
DS::PMF
GOLD::ChemScore
SYBYL::D-Score
DS::Jain
GOLD::GoldScore
SYBYL::PMF-Score
SYBYL::F-Score

N1
195
195
195
195
195
195
195
195
195
195
195
195
195
195
195
195
195
193
178
193
178
195
189
169
190
185

Scoring Power
R p2
0.816
0.810
0.808
0.804
0.773
0.740
0.710
0.689
0.675
0.644
0.569
0.555
0.545
0.534
0.521
0.492
0.487
0.464
0.457
0.445
0.441
0.392
0.316
0.295
0.268
0.216

Rs 3
0.800
0.798
0.799
0.790
0.793
0.731
0.740
0.730
0.685
0.705
0.627
0.585
0.588
0.577
0.505
0.536
0.478
0.507
0.435
0.448
0.452
0.447
0.346
0.322
0.273
0.243

SD4
1.38
1.40
1.40
1.43
1.51
1.61
1.68
1.73
1.76
1.83
1.96
1.98
2.00
2.02
2.06
2.08
2.09
2.12
2.14
2.14
2.15
2.19
2.24
2.29
2.29
2.35

RMSEtest 5
1.386
1.485
1.449
1.498
1.490
1.52 0
1.66 0
1.700
1.760
1.865
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–

Ranking Power (in %)
RMSEtrain 6
1.366
1.425
1.403
1.442
1.580
1.600
1.680
1.790
1.704
1.730
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–
–

R1-2-3
64.6
63.1
60.0
63.8
56.9
44.6
44.6
41.5
33.8
54.7
51.6
46.9
54.7
43.8
38.5
46.9
45.3
35.9
34.4
40.6
35.9
46.9
42.2
23.4
39.1
29.7

R1-3-2
13.8
10.8
15.4
6.2
21.9
18.5
15.4
21.5
29.2
15.6
21.9
25.0
15.6
21.9
12.3
25.0
21.9
25.0
29.7
18.8
21.9
20.3
31.3
26.6
21.9
25.0

R1-X-X
78.4
69.3
75.4
70.0
78.8
63.1
60.0
63.0
63.0
70.3
73.4
71.9
70.3
65.6
50.8
71.9
67.2
60.9
64.1
59.4
57.8
67.2
73.4
50.0
60.9
54.7

1 Number of complexes in Cr with positive (favorable) binding scores using this SF [17].
2 R is the Pearson correlation coefficient between predicted and measured BA values of complexes in Cr.
p
3 R is the Spearman correlation coefficient between predicted and measured BA values of complexes in
s

Cr.
4 SD is the standard deviation of errors between predicted and measured BA values of complexes in Cr

based on Equation 3 in [98].
5 RMSE is the root-mean-square of errors between predicted and measured BA values of the test complexes

in Cr. Test RMSE is not available for conventional SFs except for X-Score::HMScore that we have reconstructed.
6 RMSE is the root-mean-square of errors between predicted and measured BA values of out-of-sample
complexes in the training set Pr. Training RMSE is not available for conventional SFs except for XScore::HMScore that we have re-constructed.

76

a large number of features. We attempted to decrease the effect of overfitting by conducting feature
reduction using PCA which helped increase the performance of DNN-Score::XARG to R p = 0.667.
However, the predictions of DNN-Score::XARG are still substantially less accurate than those of
BsN-Score::XARG and BgN-Score::XARG even though the first 10 principle components we used
to calculate the 10 new features explain more than 99.7% of the total variance in the raw XARG
features. Further, the significance of the ensemble modeling approach can be gauged from the fact
that even with a single type of feature, BsN-Score::A and BgN-Score::A yield accuracies of R p =
0.780 and 0.775, respectively, which are within ∼ 4% of the accuracies of BsN-Score::XARG and
BgN-Score::XARG.
Table 4.2 also shows that the ensemble NNs SFs BsN-Score::XARG and BgN-Score::XARG
are more accurate than their counterparts of decision-trees-based ensemble SFs, BRT::XARG and
RF::XARG, though the latter comes a close second and forth (Rp = 0.816 and R1-2-3 = 64.6%
achieved by BsN-Score vs. 0.810 and 63.8% realized by BRT and RF, receptively). Note that the
boosting model BRT corresponds to BsN-Score and RF is the bagging analog of BgN-Score. We
believe the difference in performance between neural-networks and decision-trees based ensemble models, although small, is mainly attributable to the way the base learners of these ensemble
models approximate the unknown function. Decision trees model the unknown function by partitioning training data into smaller subsets from which a prediction is calculated. Such a procedure
creates a series of non-overlapping regions with axis-parallel decision boundaries. The numerical
values associated with each region are typically the average BA of the training data subset belonging to that partition which could be significantly different from the neighboring regions. This
could create a rough and abrupt approximation of the unknown function. On the other hand, NNs
with hidden units can closely and smoothly model any nonlinear continuous function. In addition,
hidden neurons may create new important features that would otherwise be impossible to extract
directly from protein-ligand complexes. These two factors minimize the bias error of NN models,
but may lead to increased variance or instability as in the case of single neural network SFs. The
proposed boosting and bagging ensemble learning approaches greatly reduce the variance error.

77

Such simultaneous reduction in bias and variance errors makes the ensemble NN SFs the most
accurate BA predictors compared to the other scoring functions listed in Table 4.2.

4.5.2.2

Scoring performance on diverse protein families from PDBbind 2014

In the second set of experiments, we use the 2014 release of PDBbind training and test datasets
described in Sections 2.1 for building and evaluating three neural network SFs (DNN-Score, BgNScore, and BsN-Score), the Random Forest SF RF-Score, and the empirical scoring function XScore. The neural network scoring functions are trained and tested on complexes characterized
by 2714 multi-perspective descriptors using the proposed descriptor platform DDB. The Random
Forest scoring function RF-Score uses 36 geometrical features of atomic pair-wise counts while
the empirical model X-Score employs 6 physiochemical descriptors as energy terms.
After training them, the five SFs are then tested on the out-of-sample (OOS) core test set (Cr) of
familiar (or known) targets, OOS cross-validation (CV) sets of partially familiar targets, and OOS
leave-clusters-out (LCO) sets of novel targets. In these three training-test sampling scenarios, the
protein-ligand complex as a whole is always out-of-sample (OOS) and is never present in the
training data of the five SFs. A scoring function is considered familiar with a protein target if it
was part of its training data even though the protein is bound to a ligand different from the one it
binds to in the test set. Figure 4.3 shows the scoring powers of the five SFs in terms of Pearson’s
and Spearman’s correlation coeffients, and normalized root-mean-square-error (NRMSE1 ). All
three scoring power metrics indicate that our proposed SFs are the most accurate in predicting the
binding affinities of the OOS complexes in the three training-test sampling scenarios. The average
correlation coefficients of ten versions of the five SFs on the familiar targets test set Cr reveal that
BsN-Score and BgN-Score outperform the most accurate empirical SF, X-Score, by at least 34%
in terms of Pearson’s correlation coefficients R p , which is 0.844, 0.840, and 0.627 for the three
SFs, respectively. BsN-Score and BgN-Score also achieve better scoring accuracy than RF-Score
1 NRMSE = RMSE/(BA
max − BAmin ), where BAmax = 11.96 and BAmin = 2.00 are the largest

and smallest BA values of the training complexes, respectively.

78

whose predictions have an average correlations of 0.725 and 0.729 for Cr and CV tests. The single
(i.e., non-ensemble) deep neural network SF, DNN-Score, performs better than RF-Score (0.802
for Cr & 0.773 for CV) but second to the ensemble DNN SFs BsN-Score and BgN-Score. Similar
accuracy trends for the five SFs can also be observed from the other two performance metrics Rs
and NRMSE.
The scoring accuracy of the five SFs drop substantially when they are tested on protein targets
that share no sequence similarity with their training proteins as can be observed from the LeaveClusters-Out (LCO) evaluation scenario. However, BsN-Score and BgN-Score can still re-produce
experimental affinity measurements (pKd and pKi) more accurately than the other three SFs. With
correlation coefficient of 0.637, BsN-Score is at least 19% more accurate than DNN-Score, RFScore, and X-Score.
Cr (known targets)

CV (known & novel targets)

LCO (novel targets)

Accuracy

0.75

Rp
0.50

Rs
NRMSE
0.534
0.502

0.532
0.484

0.507
0.486

0.634
0.576

0.637
0.581

0.543
0.540

0.729
0.726

0.773
0.764

0.823
0.817

0.826
0.819

0.627
0.642
0.181

0.725
0.728

0.805
0.807

0.840
0.839

0.25
0.844
0.840

Scoring accuracy

1.00

Bs
N
−S
Bg co
r
N
−S e
D
N cor
N
−S e
co
R
F− re
Sc
o
X− re
Sc
or
e
Bs
N
−S
Bg co
r
N
−S e
D
N cor
N
−S e
co
R
F− re
Sc
o
X− re
Sc
or
e
Bs
N
−S
Bg co
r
N
−S e
D
N cor
N
−S e
co
R
F− re
Sc
o
X− re
Sc
or
e

0.00

Scoring function

Figure 4.3: The scoring accuracy of the proposed and conventional scoring functions when
evaluated on test complexes with proteins that are either fully represented (Cr), partially
represented (CV), or not represented (LCO) in the SFs’ training data. Training and test
protein-ligand complexes are from the refined set of PDBbind 2014.
The ranking performance of the scoring functions are evaluated on the core test set of PDBbind
2014 and depicted in Figure 4.4. The results in the figure are in line with those summarized in
Table 4.2 for the 2007 core test set. We here observe that all five SFs are also able to identify the
actual top (1-X-X ranking rate) and bottom (X-X-3) binders in more than two thirds of the protein79

Accuracy

0.75

R1−2−3
R1−3−2

0.50

R1−X−X
0.25
0.732

0.691

0.580

0.646

0.689

0.206

0.483

0.797

0.769

0.605

0.769

0.815

0.200

0.615

0.769

0.815

0.200

RX−X−3
0.615

Ranking accuracy

1.00

Sc
or
e
X−

e

D

N

R

N

F−

−S

Sc

co

or

re

re
co
−S
N
Bg

Bs

N

−S

co
re

0.00

Scoring function

Figure 4.4: The ranking accuracy of the proposed and conventional scoring functions when
evaluated on the core (Cr) test set complexes of PDBbind 2014.
family clusters. Ordering the three ligands correctly within a cluster (1-2-3 ranking rate) is also
much more challenging for scoring models. Neural network SFs top the list again by achieving
1-2-3 ranking rates of 61.5% and 60.0% (40 and 39 clusters) as opposed to 58.0% and 48.3% (38
and 32 clusters) achieved by X-Score and RF-Score, respectively.
The main factors behind the superior performance of BsN-Score and BgN-Score are the learning algorithm and multi-perspective descriptors we use to characterize protein-ligand complexes.
Deep neural networks are excellent in modeling very complex functions due to the their nonlinearity and capacity. Ensemble learning through boosting and bagging significantly decrease
their variance error and overfitting. Second, our proposed multi-perspective interaction modeling
result in more numerous and divers features that more fully characterize protein-ligand interactions. Thus we find that BsN-Score, BgN-Score, DNN-Score employing all 16 features types are
more accurate than the scoring functions employing only one set of features such as RF-Score and
X-Score.
We conducted another set of experiments on PDBbind 2014 complexes in which we compare
the performance of the proposed ensemble neural network models to ensemble trees SFs. The
neural network and tree based ensemble models were trained and evaluated using the same Cr, CV,

80

and LCO training-test sampling strategies whose complexes are also characterized by the same 16
descriptor sets. The only difference between the two types of SFs is the base learner they use to
construct the ensemble (i.e., a neural network versus a decision tree). Our results indicate that
the neural network SFs BsN-Score and BgN-Score are more accurate than their counterparts of
decision-trees SFs based on XGBoost and Random Forests (RF). The boosted trees-based SF, we
coin as BT-Score, comes a close second to BsN-Score and BgN-Score with Rp = 0.830 and R1-2-3 =
60% on the core test set Cr. Note that the boosting model BT-Score corresponds to BsN-Score and
RF is the bagging analog of BgN-Score. The result on the 2014 version of PDBbind confirms our
finding on the 2007 release that the performance gap between neural-networks and decision-trees
based ensemble models is mainly due to their superior approximation of the unknown function that
govern the molecular binding process.

4.5.3
4.5.3.1

Ensemble neural networks vs. other approaches on homogeneous test sets
Scoring performance on four protein families from PDBbind 2007

It has been observed that around 92% of existing drug targets are similar to proteins already
present in the Protein Data Bank, which is the primary source of our training and validation complexes [113]. Based on this finding and the similar overlap relationship between training and test
set proteins in the core test set experiment in Section 4.5.2.1, we believe that the scoring performance of the SFs listed in Table 4.2 should be expected in typical molecular docking and virtual
screening campaigns. For each protein family in PDBbind’s core test set, there is at least one protein family in the primary training set of our proposed NN and the six other ML SFs, but the two
sets share no protein-ligand complexes when these pairs of compounds are considered as whole
biological units. We describe here a more stringent experiment to assess the generalization of the
NN, RF, BRT, SVM, kNN, MARS and MLR SFs when they are applied to score ligands for novel
drug targets. In this experiment, we evaluate the BA predictive accuracy of the NN SFs on four
protein families not present in their training set. These protein families are the most frequent in

81

the 2007 version of PDBbind which includes 112 HIV protease, 73 trypsin, 44 carbonic anhydrase, and 38 thrombin complexes. A test set for each of these protein families was constructed by
sampling all complexes formed by that protein from the training (Pr) and the test (Cr) sets. The
training complexes corresponding to each of these four test sets are the remaining protein-ligand
pairs in Pr. For each protein family, we fitted the proposed models to the corresponding independent training complexes and validated them on the test set complexes that are formed between that
type of protein and a unique set of co-crystallized ligands. The prediction accuracy of our proposed
models and the top four conventional scoring functions on complexes formed by the four protein
types are shown in Table 4.3.
Examining the upper portion of the table for the four families where the test and training sets
are disjoint for the NN and the other 6 ML SFs, we notice that the predictive accuracy of all SFs
varies from poor to good depending on the protein family. Predictions of all SFs have mediocre correlation with the experimental binding affinities for the ligands that bind to HIV protease proteins.
The highest Pearson’s correlation value between predicted and true BAs is 0.465, which is the case
for the ML scoring function MARS::X. Improper characterization of enthalpic and entropic forces
for HIV protease complexes could be the main reason for these erroneous predictions [17]. The
significant conformational changes observed during binding as well as the lack of similar proteins
in the training set could also result in such inaccurate estimations for BA. The scoring accuracy on
the other three protein families is substantially better. The binding affinities for ligands bound to
trypsin were predicted with an accuracy of at least R p = 0.842; and again with a scoring function
based on MARS. Discovery Studio’s empirical SF PLP2 shows the highest accuracy on the carbonic anhydrase dataset with a linear correlation value of 0.800. The most accurate models on the
thrombin test set are the NN and several other ML based SFs with Rp values of about 0.700 and
better, followed by the conventional scoring functions.
It can be observed that the SF based on a single NN, DNN-Score, performs relatively poorly
overall, except in one case. In some of these test sets, a few conventional SFs perform better than
the ensemble NN SFs. This behavior can be attributed to the possibility of some overlap between
82

Table 4.3: Comparison of the scoring and ranking accuracies of scoring functions on four
protein-family-specific tests sets derived from the refined set of PDBbind 2007.
HIV protease (N = 112)
1

1

Trypsin (N = 73)

Scoring function
MARS::X
MLR::G
BRT::A
BgN-Score::A
kNN::XG
SVM::G
X-Score::HPScore
BsN-Score::XARG
RF::XARG
SYBYL::ChemScore
DrugScore::PairSurf
DS::PMF04
DNN-Score::X

Rp
0.465
0.463
0.403
0.388
0.386
0.350
0.341
0.290
0.289
0.255
0.225
0.183
0.039

Rs
0.467
0.470
0.281
0.340
0.306
0.313
0.339
0.230
0.219
0.228
0.170
0.200
0.048

SD1

RMSE1

D2

1.384
1.452
1.500
1.511
1.512
1.535
1.540
1.560
1.519
1.580
1.590
1.610
1.640

1.405
1.486
1.621
1.817
1.641
1.706
1.509
1.705
1.719
–
–
–
2.255

BRT::A
BsN-Score::XARG
BgN-Score::XA
RF::XARG
SVM::A
kNN::XA
DNN-Score::X
MARS::XR
MLR::RG

0.989
0.979
0.968
0.965
0.964
0.842
0.748
0.619
0.522

0.996
0.980
0.977
0.975
0.978
0.797
0.716
0.587
0.554

0.243
0.405
0.445
0.440
0.438
0.883
1.080
1.287
1.398

0.263
0.410
0.598
0.588
0.440
0.905
1.085
1.292
1.393

Y
Y
Y
Y
Y
Y
N
Y
Y
U
U
U
Y

Scoring function
MARS::XARG
kNN::XA
SYBYL::ChemScore
SVM::XA
DS::Ludi2
X-Score::HSScore
DS::PLP2
BsN-Score::AR
MLR::XG
BgN-Score::XAR
RF::XAR
BRT::XA
DNN-Score::X

Rp1
0.842
0.833
0.829
0.826
0.823
0.817
0.797
0.793
0.783
0.778
0.774
0.771
0.735

Rs 1
0.825
0.782
0.773
0.821
0.791
0.824
0.774
0.727
0.786
0.728
0.753
0.748
0.672

SD1
0.910
0.936
0.950
0.953
0.960
0.970
1.020
1.080
1.050
1.060
1.070
1.076
1.140

RMSE1
1.269
0.970
–
1.029
–
1.401
–
1.119
1.370
1.070
1.133
1.128
1.209

D2
Y
Y
U
Y
U
N
U
Y
Y
Y
Y
Y
Y

N
N
N
N
N
N
N
N
N

BRT::XAG
BsN-Score::XA
SVM::A
BgN-Score::XAG
kNN::XA
RF::XAG
MARS::XAR
MLR::XARG
DNN-Score::X

0.969
0.968
0.965
0.950
0.936
0.934
0.861
0.806
0.829

0.961
0.961
0.979
0.966
0.926
0.928
0.828
0.795
0.789

0.420
0.424
0.445
0.439
0.596
0.600
0.861
1.001
0.940

0.425
0.429
0.454
0.475
0.610
0.657
0.953
1.136
0.957

N
N
N
N
N
N
N
N
N

Carbonic anhydrase (N = 44)

Thrombin (N = 38)

Scoring function
DS::PLP2
MLR::XR
BRT::A
MARS::G
SYBYL::G-Score
SYBYL::ChemScore
SVM::G
BsN-Score::X
BgN-Score::XA
kNN::XAG
DNN-Score::X
SYBYL::PMF-Score
RF::XARG

Rp1
0.800
0.729
0.721
0.706
0.706
0.699
0.695
0.674
0.669
0.661
0.631
0.627
0.601

Rs 1
0.772
0.634
0.631
0.651
0.646
0.631
0.598
0.434
0.527
0.616
0.451
0.618
0.374

SD1
0.840
0.954
0.965
0.986
0.990
1.000
1.001
1.030
1.030
1.045
1.080
1.090
1.112

RMSE1
–
2.969
3.508
2.777
–
–
2.919
3.418
3.541
3.793
3.561
–
3.393

D2
U
Y
Y
Y
U
U
Y
Y
Y
Y
Y
U
Y

Scoring function
DNN-Score::X
BRT::XG
BsN-Score::X
BgN-Score::XAG
kNN::XG
MARS::X
RF::XARG
MLR::XG
DS::PLP1
SYBYL::G-Score
X-Score::HSScore
DrugScore::Pair
SVM::X

Rp1
0.756
0.738
0.730
0.723
0.713
0.704
0.697
0.674
0.667
0.667
0.666
0.651
0.647

Rs 1
0.704
0.722
0.710
0.726
0.722
0.644
0.693
0.624
0.672
0.626
0.586
0.622
0.575

SD1
1.380
1.429
1.449
1.465
1.487
1.505
1.520
1.566
1.580
1.580
1.580
1.611
1.615

RMSE1
1.433
1.432
1.586
1.562
1.586
1.588
1.674
1.681
–
–
1.737
–
1.607

D2
Y
Y
Y
Y
Y
Y
Y
Y
U
U
N
U
Y

BRT::A
BsN-Score::XARG
SVM::A
RF::XARG
kNN::RG
BgN-Score::XARG
MARS::XR
MLR::AR
DNN-Score::X

0.988
0.948
0.946
0.910
0.896
0.884
0.785
0.794
0.652

0.979
0.921
0.974
0.860
0.827
0.766
0.694
0.762
0.310

0.214
0.441
0.451
0.57
0.619
0.650
0.862
0.845
1.050

0.218
1.004
0.455
1.140
0.613
1.320
0.974
1.039
1.687

N
N
N
N
N
N
N
N
N

BRT::XG
BsN-Score::XA
SVM::XA
RF::XARG
BgN-Score::XA
kNN::XA
DNN-Score::X
MARS::XR
MLR::XRG

0.975
0.955
0.927
0.910
0.901
0.859
0.761
0.746
0.724

0.989
0.976
0.948
0.934
0.935
0.873
0.756
0.691
0.638

0.471
0.637
0.793
0.860
0.920
1.084
1.371
1.412
1.463

0.548
0.747
0.820
1.125
1.113
1.175
1.374
1.449
1.553

N
N
N
N
N
N
N
N
N

1

R p and Rs are the Pearson and Spearman correlation coefficients between predicted and measured BA values of complexes in
this protein-family-specific test set, respectively. SD and RMSE are the standard deviation and the root-mean-square of errors
between predicted and measured BA values of complexes in this protein-family-specific test set, respectively.

2

This indicates whether the test set complexes are disjoint from (D = Y) or overlap with (D = N) the training set complexes for NN
and the six other ML models. Any overlap between the training and test data of the conventional SFs is unknown (D = U) to us.

83

the training complexes of the conventional approaches and the four family-specific test sets. As
discussed earlier, the protein families of training and test complexes for the NN and the six other
ML models do not overlap and they are completely disjoint. When we retrain ensemble NN and the
six other ML SFs on the original training set (Pr), which overlaps with the family-specific test sets,
and assess their scoring power on the four homogeneous test sets, we notice that the predictions of
the proposed SFs are near perfect as listed in the lower portion of Table 4.3.
The results listed in Tables 4.2 and 4.3 show the performance of the proposed and conventional
SFs on target proteins when they are partially or fully encountered in their training sets, or completely novel for them. Therefore, we believe that these results are very useful in estimating the
accuracy of our scoring models given the number of solved structures of the drug target with other
ligands and the availability of their binding data.
We performed another experiment in which the ML SFs were calibrated with training sets
containing some known binders to the protein target and assessed their scoring ability on an independent test set of complexes that all feature the protein target, but bound to different ligands.
Since HIV protease complexes are the most abundant in our dataset and they have proved to be
the most challenging in the previous experiment (see the upper portion of Table 4.3), we focus on
them. Also, we only consider ML SFs in this and subsequent experiments since for the conventional 17 SFs we do not have access to their full training set BA values; in these cases, the MLR
model, due its linear nature, can be considered representative of empirical SFs [63, 56].
The SFs based on RF, BRT, SVM, kNN, MARS, and MLR are trained using X, A, R, and
XAR features on a training set that includes a randomly selected (without replacement) x% of the
112 complexes (more precisely, x × 112/100 complexes) from the HIV protease homogeneous
test set2 . The remaining complexes in the training set are the non-HIV-protease complexes in Pr,
except that as many of them as the number of the added HIV complexes are randomly excluded
2 This

experiment was conducted before considering the G-type features and developing the
Neural Network based models, therefore they are not included here. However, we expect the
performance trends of BsN-Score and BgN-Score to be similar to that of BRT and RF SFs, respectively.

84

0.7
0.6
0.5
0.4
0.3

10% 20% 30% 40% 50% 60% 70% 80%
Percentage of 112 HIV complexes added to Pr \ HIV
(b) Complexes characterized with A features.

0%

0.6
0.5
0.4
0.3
0.2

RF
BRT
kNN
SVM
MARS
MLR

0.0

0.1

Pearson correlation coefficient (Rp)

0.6
0.5
0.4
0.3
0.1

0.2

RF
BRT
kNN
SVM
MARS
MLR

0.0

Pearson correlation coefficient (Rp)

0.2
0.0
0%

0.7

10% 20% 30% 40% 50% 60% 70% 80%
Percentage of 112 HIV complexes added to Pr \ HIV
(a) Complexes characterized with X features.

0.7

0%

RF
BRT
kNN
SVM
MARS
MLR

0.1

Pearson correlation coefficient (Rp)

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0

Pearson correlation coefficient (Rp)

RF
BRT
kNN
SVM
MARS
MLR

10% 20% 30% 40% 50% 60% 70% 80%
Percentage of 112 HIV complexes added to Pr \ HIV
(c) Complexes characterized with R features.

0%

10% 20% 30% 40% 50% 60% 70% 80%
Percentage of 112 HIV complexes added to Pr \ HIV
(d) Complexes characterized with XAR features.

Figure 4.5: Dependence of scoring performance of machine-learning scoring functions on the
number of HIV protease complexes in their training set. They are evaluated on out-of-sample
HIV protease complexes from the refined set of PDBbind 2007.
from the training set. This means that the size of the training set is fixed and equals the number
of non-HIV-protease complexes in Pr. Another different 20% of the 112 HIV protease complexes
(more precisely, 0.2×112 = 23 complexes) are randomly selected (without replacement) to form
the test set. This process is repeated 1000 times to obtain a robust average Rp scoring measure for
various values (0, 5, 10, . . . , 80) of x (i.e., the percentage of HIV protease complexes) for the six
ML models—these are plotted in Figure 4.5.
The performance behavior of ML SFs seems similar across the four feature sets as shown in
panels (a) through (d) of Figure 4.5. There are three other important observations that can be made
from these four panels. First, when the training dataset includes none to very few HIV protease
complexes (roughly up to 5–10% of the total), the simpler linear model MLR ties with or outper-

85

forms some of the more sophisticated nonlinear techniques such as SVM, kNN, BRT, MARS, and
RF. This is somewhat similar to that seen earlier in the upper portion of Table 4.3. Second, the RF
model clearly performs the best across the entire range of number of HIV complexes in training
data characterized with XAR features. Other ML SFs such as kNN, SVM, and BRT also show substantial improvement in performance when trained on larger numbers of HIV complexes. Finally,
MLR, due to its rigid linear nature that resembles that of empirical SFs, is not able to effectively
exploit increasing numbers of HIV complexes in training data to improve its scoring accuracy in
contrast to the other flexible models which show moderate (MARS) to significant improvement
(kNN, RF, BRT, and SVM). For instance, for the XAR feature set (see Figure 4.5-d), the scoring
accuracy of some ensemble models improves by more than 90% (RF) or 140% (BRT) when the
number of HIV complexes increases from 0% to 80%, whereas the corresponding increase in performance for the linear model does not even exceed 11%. A similar trend can also be observed for
the other three feature sets in the other panels.
The analysis of the effectiveness of different regression approaches on specific families of
proteins has very practical benefits in virtual screening and drug discovery. Typically, the task in
drug design is to conduct computational screening of large numbers of ligands against a certain
protein family. The knowledge of the number of known binders to that protein present in the
training set and the characteristic plots of several scoring models (similar to the ones in Figure 4.5)
can aid in the selection of the most effective SF.

4.5.3.2

Scoring performance on four protein families from PDBbind 2014

The Leave-Clusters-Out (LCO) training-test sampling strategy of the 2014 PDBBind complexes
was a stringent experiment to assess the generalization of the proposed and conventional scoring
functions when they are applied to score ligands for novel drug targets. The difficulty of the
prediction was reflected in the performance drop as can be observed in the novel targets panel of
Figure 4.3. As explained in Section 2.1.3, the accuracy statistics reported on the right panel of
Figure 4.3 are the average across several clusters with different protein families and types. In this

86

experiment, we again evaluate the BA predictive accuracy of the NN and conventional SFs on the
largest four protein families or clusters. These protein families are the most frequent in PDBbind
2014 which include 262 HIV protease, 170 carbonic anhydrase (CAH), 98 trypsin, and 79 thrombin
complexes. Similar to the experiment on PDBbind 2007 we described in the previous section, a
test set for each of these protein families was also constructed by sampling all complexes formed
by that protein from the 2014 PDBbind refined set. The training complexes corresponding to each
of these four test sets are the remaining protein-ligand pairs. For each protein family, we fitted the
proposed models to the corresponding independent training complexes and validated them on the
test set complexes that are formed between that type of protein and a unique set of co-crystallized
ligands. The prediction accuracy of our proposed models and the conventional scoring functions
on complexes formed by the four protein types are shown in Figure 4.6.
The right panels of the figure depict the scoring performance when the training and test sets are
disjoint (i.e., out-of-sample) for the five scoring functions. It can be observed that the predictive
accuracy of all SFs varies from poor to good depending on the protein family. Predictions of all
SFs have mediocre correlation with the experimental binding affinities for the ligands that bind to
HIV protease and carbonic anhydrase proteins. The highest Pearson’s correlation value between
predicted and true BAs is 0.451 for HIV and 0.524 for CAH, which are obtained by the ensemble
NN scoring function BsN-Score. Complexes formed by HIV protease in the 2007 version of
PDBbind were also challenging to score accurately. However, many of the examined scoring
functions in the previous section were able to correctly re-produce the binding affinity of the 44
carbonic anhydrase complexes of PDBbind 2007. The vast majority of the new complexes in
PDBbind 2014 were not part of the 2007 release which are apparently harder to predict. The
scoring accuracy on the other two protein families in PDBbind 2014 is substantially better. The
binding affinities for ligands bound to trypsin were predicted with an accuracy of at least R p =
0.820; and again with a scoring function based on BsN-Score. BgN-Score performs almost as
well followed by RF-Score and X-Score whose accuracies are not far behind. BsN-Score and
BgN-Score also show the highest accuracy on the thrombin dataset with a linear correlation value

87

HIV (262 in−sample PLCs)

HIV (262 out−of−sample PLCs)

1.00
0.75

0.246

0.371

0.428

0.405

0.464

0.362

0.451

0.886

0.887

0.924

0.930

0.971

0.973

0.972

0.25

0.969

0.50

0.00
CAH (170 in−sample PLCs)

CAH (170 out−of−sample PLCs)

1.00
0.75

0.284

0.306

0.524

0.364

0.391

0.423

0.522

0.526

0.294

0.885

0.897

0.941

0.946

0.967

0.974

0.972

0.25

0.977

Scoring accuracy

0.50

Accuracy

0.00
Trypsin (98 in−sample PLCs)

Rp

Trypsin (98 out−of−sample PLCs)

Rs

1.00
0.75

0.752

0.768

0.701

0.745

0.742

0.760

0.795

0.820

0.795

0.820

0.752

0.769

0.841

0.858

0.984

0.983

0.984

0.984

0.987

0.25

0.986

0.50

0.00
Thrombin (79 in−sample PLCs)

Thrombin (79 out−of−sample PLCs)

1.00
0.75

0.514

0.542

0.546

0.554

0.665

0.678

0.756

0.759

0.763

0.769

0.516

0.547

0.872

0.869

0.970

0.967

0.978

0.978

0.977

0.25

0.977

0.50

e
X−

Sc

or

e
or
F−

Sc

or
N

N
D

R

−S
c

co
−S

Bg

N

N
Bs

e

re

e
−S
c

or

e
X−

Sc

or

e
Sc
F−

R

N
N
D

or

e
or
−S
c

co
−S
N

Bg

Bs

N

−S
c

or

re

e

0.00

Scoring function

Figure 4.6: Comparison of the scoring and ranking accuracies of scoring functions on four
protein-family-specific tests sets derived from the refined set of PDBbind 2014.
of 0.769. The correlation coefficients of RF-Score and X-Score at 0.554 and 0.542 on this novel
target are substantially less than those obtained by the three neural-network SFs.

88

In addition to the out-of-sample evaluation of the five SFs, we also examined their in-sample
accuracy to empirically test their capacity in exploiting their knowledge about known proteinligand complexes and reproducing observed binding affinities. Unsurprisingly, the non-linear SFs,
specially those based on NN, show near perfect predictions while X-Score perform very poorly
on HIV and CAH complexes as depicted in the left panels of Figure 4.6. This implies that the
relationship between the employed descriptors by X-Score and binding affinity is non-linear. In
addition, our experiments show that the six X-Score descriptors do not appear to capture all the
binding-related interactions between the protein and ligand molecules. We made this conclusion
after we obtained mediocre performance from boosted NN model fitted to the six features used
by X-Score. However, the performance of this boosted NN model was substantially better than
X-Score in which a linear regression model is used.
The results shown in Figures 4.6 and 4.3 summarize the performance of the proposed and
conventional SFs on target proteins when they are encountered in their training sets or completely
novel for them. Therefore, we believe that these results are very useful in estimating the accuracy
of our scoring models given the number of solved structures of the drug target with other ligands
and the availability of their binding data.
We performed another experiment in which the ML SFs were calibrated with training sets containing some known binders to the protein target and assessed their scoring ability on an independent (OOS) test set of complexes that all feature the protein target, but bound to different ligands.
Since HIV protease and carbonic anhydrase complexes are the most abundant in our dataset and
they have proved to be the most challenging in the previous experiment (see the upper portion of
Figure 4.3), we focus on them.
Due to the large number of models required to perform this experiment and the computational
resources needed to train deep neural networks, the boosting neural network SF, BsN-Score, is
replaced with its sister Boosted decision Tree Scoring function, BT-Score. The other two NN SFs,
BgN-Score and DNN-Score, are not considered in this experiment due to the same reason. The
use of BsN-Score’s surrogate SF BT-Score will be performed for the experiments in the following

89

sections as well since they too involve building a large number of SFs to obtain reliable results
and trends. BT-Score will be compared to our versions of the conventional ML SF RF-Score and
empirical SF X-Score.
The SF BT-Score is fitted to the 2714 multi-perspective descriptors while RF-Score and XScore are fitted to their original 36 geometrical and 6 physiochemical features, respectively. Before training an SF based on these models, half of the 260-HIV and 190-CAH PLCs are randomly
sampled and held aside for testing while the remaining 130 and 85 complexes are considered for
training, respectively. Then, the SFs are trained on 3000 complexes sampled randomly (without
replacement) from the PDBbind refined set and x% of the 130 HIV (or CAH 85) training complexes. The overall size of the training set for any SF was fixed to 3000 complexes regardless of
the number of added HIV or CAH complexes (by removing extra non-HIV or non-CAH PLCs
from the original 3000 complexes). The scoring and ranking performance of the SFs are evaluated
on the OOS 50% HIV (or CAH) test complexes. Then another 50% of the 260 HIV (or 190 CAH)
complexes are randomly sampled to form the next training and test sets. This process is repeated
50 times to obtain a robust average scoring (Rp ) and ranking (Rs ) measures for various values of
x ∈ {0, 0.2, 0.4, 0.6, 0.8, 1} (i.e., the fraction of 130 HIV and 85 CAH complexes) for the three
models—these are plotted in Figure 4.7.
The overall scoring and ranking performance trends of the three SFs appear to be similar for
both HIV and CAH complexes as shown in Figure 4.7. There are three important observations that
can be made from these four panels. First, the XGboost SF, BT-Score, clearly performs the best
across the entire range of number of HIV complexes and more than half of the range of CAH complexes in the training data. The Random Forest SF, RF-Score, also shows substantial improvement
in performance when trained on larger numbers of HIV and CAH complexes. Second, when the
training dataset includes less than half of the CAH complexes, the SF RF-Score slightly outperforms BT-Score in both scoring and ranking tests. Finally, X-Score, due to its rigid linear nature, is
not able to effectively exploit increasing numbers of HIV and CAH complexes in training data to
improve its scoring and ranking accuracy in contrast to the two other flexible models which show

90

Scoring accuracy for CAH (85/85 Tr./Ts. Cpxs.)

Ranking accuracy for CAH (85/85 Tr./Ts. Cpxs.)

1.00

Scoring (Rp) & ranking (Rs) accuracy

0.75

0.50

0.25

0.00
Scoring accuracy for HIV (131/130 Tr./Ts. Cpxs.) Ranking accuracy for HIV (131/130 Tr./Ts. Cpxs.)

1.00

SF
BT−Score
RF−Score
X−Score

0.75

0.50

0.25

0
1.
0

5
0.
7

0
0.
5

5
0.
2

0

0

0.
0

1.
0

5
0.
7

0
0.
5

5
0.
2

0.
0

0

0.00

Fraction of family training (Tr.) complexes
added to the main training set ({Refined Set} \ {All family complexes})
Figure 4.7: Dependence of scoring performance of machine-learning scoring functions on the
number of HIV protease (left) and carbonic anhydrase (right) complexes in their training set.
They are evaluated on out-of-sample HIV protease and carbonic anhydrase complexes from the
refined set of PDBbind 2014.
significant improvement. To put that in perspective, the scoring and ranking accuracy of BT-Score
improve by more than 140% when the number of HIV complexes in the training data increases
from 0 to 208, whereas the corresponding increase in performance for the linear model X-Score
does not even exceed 2%.
The analysis of the effectiveness of different regression approaches on specific families of

91

proteins has very practical benefits in virtual screening and drug discovery. Typically, the task in
drug design is to conduct computational screening of large numbers of ligands against a certain
protein family. The knowledge of the number of known binders to that protein present in the
training set and the characteristic plots of several scoring models (similar to the ones in Figure 4.7)
can aid in the selection of the most effective SF.

4.5.4
4.5.4.1

Performance of machine-learning scoring functions on novel targets
Simulating novel targets using PDBbind 2007

The training-test set pair (Pr, Cr) of PDBbind 2007 is a useful benchmark when the aim is to evaluate the performance of scoring functions on targets that have some degree of sequence similarity
with at least one protein present in the complexes of the training set. This is typically the case
since, as it was mentioned earlier, 92% of drug targets are similar to known proteins [113]. When
the goal is to assess scoring functions in the context of novel protein targets, however, the trainingtest set pair (Pr, Cr) is not that suitable because of the partial overlap in protein families between
Pr and Cr. We considered this issue to some extent in Section 4.5.3.1, where we investigated the
performance of SFs on four different protein-specific test sets from PDBbind 2007 after training
them on complexes that did not have the protein under consideration. This resulted in a drop in
performance of all SFs, especially, in the case of HIV protease as a target. However, even if there
are no common proteins between training and test set complexes, different proteins at their binding
sites may have sequence and structural similarities, which influence protein-ligand BA. To more
rigorously and systematically assess the performance of ML SFs on novel targets, we performed a
separate set of experiments in which we limited BLAST sequence similarity between the binding
sites of proteins present in the training and test set complexes in the refined set of PDBbind 2007.
Specifically, for each similarity cut-off value S = 30%, 40%, 50%, ..., 100%, we constructed
100 different independent 150-complex test and T-complex training set pairs, trained the scoring
models (MLR, MARS, kNN, SVM, BRT, and RF) using two different feature sets (XAR and X

92

0.8
70%

80%
90%
100%
Similarity cut−off (S)
(c) T = 1020 training complexes with XAR features.

●

●

●

●

●

●

●

●

●

●
●
●

●

●

0.5

●

0.5

0.5

●

●

RF
BRT
kNN
SVM
MARS
MLR

0.7

0.7
0.6

0.7

0.6

60%
70%
80%
90%
100%
Similarity cut−off (S)
(b) T = 520 training complexes with XAR features.

0.8

0.8

RF
BRT
kNN
SVM
MARS
MLR

●

0.8

50%

40% 50% 60% 70% 80% 90% 100%
Similarity cut−off (S)
(a) T = 120 training complexes with XAR features.

0.6

RF
BRT
kNN
SVM
MARS
MLR

0.3

0.4

0.1

0.2

0.2

●

0.2

0.4
0.3

●

●

●

0.5

●

0.1

●

●

0.6

●

●

0.5

0.5

●

●

0.3

0.4

●

●
●

●

30%

40%

50% 60% 70% 80% 90% 100%
Similarity cut−off (S)
(d) T = 120 training complexes with X features.

50%

60%

70%
80%
90%
100%
Similarity cut−off (S)
(e) T = 520 training complexes with X features.

0.3

0.4

0.1

●

0.2

0.2
30%

RF
BRT
kNN
SVM
MARS
MLR

0.1

0.4
0.3

●

0.2

0.4
0.3

●

0.1

Pearson correlation coefficient (Rp)

0.7

0.7
0.6

0.7
0.6

●
●

●

●

●

0.1

Pearson correlation coefficient (Rp)

0.8

0.8

RF
BRT
kNN
SVM
MARS
MLR

●

RF
BRT
kNN
SVM
MARS
MLR

70%

80%
90%
100%
Similarity cut−off (S)
(f) T = 1020 training complexes with X features.

Figure 4.8: Binding affinity prediction accuracy of scoring models as a function of BLAST
sequence similarity cutoff between binding sites of proteins in training and test complexes of
PDBbind 2007.
1

features) on the training set and evaluated them on the corresponding test set, and determined their
average performance over the 100 rounds to obtain robust results. Since SF prediction performance
depends upon both similarity cut-off and training set size and since training set size is constrained
by similarity cut-off (a larger S means a larger feasible T), we investigated different ranges of
S (30% to 100%, 50% to 100%, and 70% to 100%) and for each range we set T close to the
largest feasible value for the smallest S value in that range. Each test and training set pair was
constructed as follows. We randomly sampled a test set of 150 protein-ligand complexes without
replacement from the refined sets of PDBbind 2007 and 2010 which together provide 2182 highquality protein-ligand complexes. The remaining (2182-150=) 2032 complexes were randomly
scanned until T different training complexes were found that had protein binding site similarity
of S% or less with the protein binding sites of all complexes in the test set — if less than T such

93

complexes were found, then the process was repeated with a new 150-complex test set.
The accuracy of the six scoring models in terms of Pearson correlation coefficient Rp is depicted
in Figure 4.8 for a variety of similarity cut-offs and training set sizes. The plots in the first column
(Figure 4.8 (a) and (d)) are for similarity cut-offs 30% to 100% in which T = 120 complexes is
the largest training set size feasible for S = 30%. The upper plot (Figure 4.8 (a)) shows these
results when protein-ligand complexes are described by XAR features and the bottom plot when
such compounds are characterized by X-Score (X) features alone. When XAR features are used,
RF and kNN perform the best across the entire range of S when T = 120. In the corresponding
bottom plot, in which the six X features are considered, it can be easily observed that MLR, kNN,
RF, and BRT are the top four performing models. In both feature set cases, we see that no ML
SF is able to predict the measured BA with good accuracy (Rp ≤ 0.5 when S = 30%). This can
be attributed to three factors. First is the significant sequence, and hence structural, dissimilarity
between the binding sites of proteins in training and test complexes. Second is the limited size of
the training data. We will show how important the size of training data is in influencing the quality
of scoring models in Section 4.5.5. Finally, the scoring model parameters used for this experiment
were based on parameter tuning with respect to Pr as described in Section 2.2 rather than with
respect to the training set used for this experiment, which is not only small but has a different
mix of complexes governed by the small similarity cut-off of 30%. This is evident in Figure 4.8
(a) from the inferior performance of several SFs (MLR, SVM, MARS and, to a limited extent,
BRT) when XAR features are used instead of X features (which is a subset of XAR features),
signifying overfitting or suboptimal parameter values. The reason the RF model had relatively
good performance compared to the other generally-competitive ML models for both XAR and X
features is most likely because it is immune to overfitting and less sensitive to parameter tuning.
Due to the time-intensive nature of tuning parameters of multiple scoring models for the many
training scenarios we consider in this work, we used the parameters tuned with respect to Pr.
For a given similarity cut-off, not only does the scoring performance of models improve with
the size of training dataset, the overfitting/parameter-tuning problem also gets alleviated or mini-

94

mized. For example, for a similarity cut-off of 50%, BRT::XAR has inferior performance relative
to BRT:X when training dataset has 120 complexes (0.46 vs. 0.49), but the opposite is true when
the training set has 520 complexes (0.54 vs. 0.51). Also, the performance of RF and the other
generally-competitive ML models (BRT and kNN) becomes comparable as training dataset size
increases, signifying that parameter tuning becomes less of an issue.
To summarize, imposing a sequence similarity cut-off between the binding sites of proteins
in training and test set complexes has an expected adverse impact on the accuracy of all scoring
models. However, increasing the number of training complexes helps improve accuracy for all
similarity cut-offs. RF has the best accuracy considering the entire range of similarity cut-offs.
The other generally-competitive ML models (BRT, kNN, and SVM) may also provide comparable
accuracy if parameter tuning is performed with respect to the training set being considered.

4.5.4.2

Simulating novel targets using PDBbind 2014

In this section, we test the scoring ability of scoring functions on novel targets extracted from the
2014 release of PDBbind. We addressed this issue to some extent using our leave-cluster-out (LCO)
test sets by restricting the similarity between training and test complexes to 90% or less using
BLAST. Figure 4.3 shows the gap between the accuracies of SFs on Cr complexes as opposed to
LCO PLCs. In Section 4.5.3.2, we also investigated the performance of SFs on four different 2014
PDBBind protein-specific test sets after training them on complexes that did not have the protein
under consideration. This resulted in a drop in performance of all SFs, especially, in the case of
HIV protease and carbonic anhydrase targets. As we explained in the previous section, even if there
are no common proteins between training and test set complexes, different proteins at their binding
sites may have sequence and structural similarities, which influence protein-ligand BA. To more
rigorously and systematically assess the performance of ML SFs on the 2014 release of PDBbind
novel targets, we performed another set of experiments in which we limited BLAST sequence
similarity between the binding sites of proteins present in the training and test set complexes.
The procedure of simulating the novel targets using PDBbind 2014 has some similarities with

95

Spearman correlation (Rs)

1.00

1.00

0.75

0.75

0.50

0.50

SF
0.25

BT−Score

0.25

RF−Score
X−Score

10
0

80

60

40

10
0

80

0.00
60

0.00
40

Scoring (Rp) and ranking (Rs) accuracy

Pearson correlation (Rp)

Similarity cutoff

Figure 4.9: Binding affinity prediction accuracy of scoring models as a function of BLAST
sequence similarity cutoff between binding sites of proteins in training and test complexes of
PDBbind 2014.
the novel targets experiment performed on the 2007 release of PDBbind in Section 4.5.4.2. Specifically, for each similarity cut-off value S = 30%, 40%, 50%, ..., 100%, we constructed 100 different
independent 100-complex test and 3000-complex training set pairs, trained the scoring models
BT-Score, RF-Score, and X-Score using their respective feature sets (2714 multi-perspective, 36
geometric, and 6 physiochemical descriptors, respectively) on the training set and evaluated them
on the corresponding test set, and determined their average performance over the 50 rounds to
obtain robust results. Each test and training set pair was constructed as follows. We randomly
sampled a test set of 100 protein-ligand complexes without replacement from the 3446 complexes
of the 2014 PDBbind refined set. The remaining 3346 complexes were randomly scanned until
3000 different complexes were found that had protein binding site similarity of S% or less with
the protein binding sites of all complexes in the test set — if less than 3000 such complexes were
found after multiple trials, then the process was repeated with a new 100-complex test set.
The accuracy of the three scoring models in terms of Pearson (Rp ) and Spearman (Rs ) correla-

96

tion coefficients are depicted in Figure 4.9 for a variety of similarity cut-offs. BT-Score produces
substantially better predictions than RF-Score and X-Score regardless of the level of similarity
between the binding pocket of training and test proteins. BT-Score and RF-Score performance improves steadily as they are trained and tested on proteins with BLAST similarity of 70% or more.
On the other hand, the improvement slope of X-Score appears flatter than BT-Score and RF-Score.
The slow response of X-Score to increasing the similarity between training and test proteins can
be attributed to the high rigidity of their underlying linear model. The small number of descriptors
used by X-Score also plays a role in capturing and utilizing the similarity information between
hundreds of different training protein families. The SF BT-Score, and to some extent RF-Score, do
not suffer from model rigidity and descriptor limitations. We trained and tested boosted regression
tree models in conjunction with the six physiochemical descriptors of X-Score and noticed better predictions and sharper responses to train-test similarity than those obtained using the original
linear regression model.
Due to the sufficiently large number of training complexes in PDBbind 2014, we notice that
the performance of BT-Score is always better than RF-Score and X-Score despite its use of much
bigger feature set (2714 vs. 36 and 6 descriptors used by RF-Score and X-Score respectively). In
the case of the 2007 PDBbind simulation in the previous section, we noticed that larger number of
descriptors has some negative effect on the predictive accuracy of scoring functions when trained
on smaller datasets. In this section, our results show that larger training sets sizes make it possible
to incorporate more descriptors without incurring model overfitting. In the next section, we will
investigate the effect of training set size on the scoring accuracy in more detail.

4.5.5

4.5.5.1

Impact of the number of training protein-ligand complexes on the scoring and ranking accuracies
Simulating increasing training set size using PDBbind 2007 & 2010

Experimental information about 3D structures and binding affinities of new protein-ligand complexes are regularly determined. This contributes to the growing size of public biochemical repos-

97

178

356

534

712

890

1068

1246

1424

RF
BRT
SVM
kNN
MARS
MLR

1602

80
70
60

●

●

●

●

●

●

●

1424

1602

1780

●

●
●

30

0.4

●

RF
BRT
SVM
kNN
MARS
MLR

50

●

0.5

0.6

●

40

●

●

●

●

Ranking success rate (R1−2−3 in %)

90

0.9
0.8
0.7

●

0.3

Pearson correlation coefficient (Rp)

●

●

●

●

1780

178

356

534

Number of training complexes
(a)

712

890

1068

1246

Number of training complexes
(b)

Figure 4.10: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
1
scoring functions on training set size. The models are trained on complexes selected randomly
(without replacement) from the 2007 and 2010 refined sets of PDBbind and tested on
out-of-sample complexes from the core (Cr) test set of PDBbind 2007.
itories and corporate compound banks. To assess the impact that a larger training set size would
have on the predictive accuracy of ML SFs, we consider the 2007 and 2010 PDBbind datasets—
more than 750 new protein-ligand complexes have been deposited into this database from the year
2007 to 20103 . After setting aside the core (Cr) test set of PDBbind 2007 for evaluation, we use
the remaining complexes of the PDBbind 2007 and 2010 refined sets for training which we denote
by Tr. For a given number of training complexes x, x = 1 × 178, 2 × 178, . . . , 10 × 178, we select x
complexes randomly (without replacement) from Tr to train the six ML models. We then test them
on the disjoint core set Cr which is comprising 195 complexes. This process is repeated 100 times
to obtain robust average Rp values, which are plotted in Figure 4.10.
Figure 4.10 shows the scoring and ranking accuracies of RF, BRT, SVM, kNN, MARS and
MLR SFs when the datasets are characterized using XAR features4 . In both cases (scoring and
ranking), we observe that the performance of almost all models improves as the size of the training
3 This

experiment was conducted prior to considering the 2014 version of PDBbind, therefore
complexes from this set are not used in this experiment. However, we expect similar performance
trends for all scoring functions.
4 This simulation was performed before extracting the G-type features and developing the Neural Network based models, therefore they are not included here. The performance trends of BsNScore and BgN-Score are expected to be similar to that of BRT and RF SFs, respectively.

98

dataset increases. When the number of features is high (|XAR| = 72) and the size of training
dataset is relatively small, we notice poor accuracy of some SFs such as MLR and SVM. This is
perhaps a consequence of overfitting since the ratio of training dataset size to number of features
is small. The performance of MLR and SVM improves substantially when the training dataset size
increases and/or lower dimensional data is used. RF and BRT based SFs are almost always the
leading models regardless of the size of training set.
From Figure 4.10, we can conclude that most ML SFs have a potential for improvement as
more training data becomes available. SF designers can conduct similar experiments to estimate
accuracy enhancement when their proposed functions are recalibrated on larger number of data
instances.
The non-linear ML approaches fitted to the 6 X-type features (whose accuracy plots are not
shown) were also found to be improving as the number of their training complexes increases.
MLR::X model on the other hand, which resembles empirical SFs, showed minimal response to
increasing the training set size which is an indication of a very small potential for improvement
in the future. This is due to the rigidity of linear models whose performance tends to saturate.
Many of the 16 conventional SFs considered in this study are empirical and thus they are also most
likely to suffer from the same limitation. In fact, the best performing model of those 16 in terms
of scoring power, X-Score::HMScore, is very similar to MLR::X. That is because both SFs use
almost the same features and both are linearly fit to the same training data. Therefore, one should
consider better prediction approaches to derive accurate models from training data available on
hand and from future updates.

4.5.5.2

Simulating increasing training set size using PDBbind 2014

In this section we consider a larger number of complexes obtained from the 2014 version of PDBbind to simulate the performance of scoring functions on increasing sizes of training data. More
than 2100 new protein-ligand complexes have been deposited into this database from the year 2007

99

Scoring (Rp) & ranking (R1−2−3) accuracy

Scoring accuracy (Rp)

Ranking accuracy (R1−2−3)

1.00

1.00

SF
BT−Score

0.75

RF−Score

0.75

X−Score

0.50

0.50

0.25

0.25

0.00

0.00
0

1000

2000

3000

0

1000

2000

3000

Number of training targets

Figure 4.11: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on training set size. The models are trained on complexes selected randomly
(without replacement) from the refined set of PDBbind 2014 and tested on out-of-sample
complexes from the core (Cr) test set of PDBbind 2014.
to 20145 . To be able to use a greater range of training set sizes, we choose the refined (Ref ) set
of PDBBind 2014 to build and test three different scoring models. For a given number of training complexes x, where x ∈ {100, 680, 1260, 1840, 2420, 3000}, we select x complexes randomly
(without replacement) from Ref to train the three ML models. We then test them on the out-ofsample core set Cr comprising 195 complexes. This process is repeated 50 times to obtain robust
average scoring (Rp ) and ranking (R1-2-3 ) values, which are plotted in Figure 4.11.
Figure 4.11 shows the scoring and ranking accuracies of BT-Score, and our versions of RFScore and X-Score. The figure clearly shows that the scoring and ranking accuracies of BT-Score
and RF-Score improve as the size of the training dataset increases. The scoring performance
of X-Score is substantially lower than both ensemble SFs and its rate of improvement is almost
zero beyond 680 training complexes. Its ranking accuracy, however, is on par with BT-Score and
5 This experiment was conducted prior to the release of the 2015 and 2016 versions of PDBbind,

therefore complexes from the new releases are not used in this experiment. However, we expect
similar performance trends for all scoring functions.
100

both show similar gain in ranking performance as they get exposed to more training complexes.
Similar rise in improvement was also achieved using Random Forest (whose accuracy plots are not
shown) when it was fitted to our multi-perspective descriptors instead of the RF-Score’s original
36 geometric descriptors. Another set of Random Forest models fitted to the 6 features used by
X-Score were also found to be improving in scoring accuracy as the number of their training
complexes increases. This indicates that ensemble SFs have a potential for improvement as more
training data becomes available. The linear model used in X-Score (and other empirical SFs) on
the other hand showed minimal rise in scoring power by increasing the training set size which is an
indication of a very small potential for improvement in the future. Overall, the improvement plots
of BT-Score, RF-Score, and X-Score on PDBbind 2014 complexes are similar to the improvement
trends of the corresponding models BRT, RF, and MLR on the PDBbind 2007 benchmark as can
be observed by comparing Figures 4.10 and 4.11.

4.5.6

Impact of the type and number of descriptors on the scoring and ranking accuracies

4.5.6.1

Type and number of descriptors characterizing PDBbind 2007 complexes

The BA of a protein-ligand complex depends on many physicochemical interaction factors that
are too complex to be accurately captured by any one approach. Therefore, we perform three
different experiments to investigate how utilizing different types of features from different scoring
tools, X-Score, AffiScore, RF-Score, and GOLD, and considering an increasing number of features
affects the performance of the various ML models. We also test how increasing the number of
descriptor sets affect the performance of three scoring models. In the first experiment, six ML
(RF, BRT, SVM, kNN, MARS, and MLR) models were fitted to the PDBbind primary Pr training
complexes characterized by X, A, R, G, and XARG features and tested on the corresponding core
test Cr characterized by the same features6 . Table 4.4 reports the scoring (Rp ) and ranking (R1-2-3 )
6 The

training (Pr) and test (Cr) sets were based on the 2007 PDBBbind datasets. This experiment was conducted before the release of 2014-2016 PDBBind datasets and prior to developing
the Neural Network based models, therefore they are not included here. However, we expect the

101

Table 4.4: Scoring and ranking accuracies of machine-learning scoring functions when
protein-ligand complexes are characterized by different descriptor types
Model

BsN-Score
BgN-Score
DNN-Score
RF
BRT
SVM
kNN
MARS
MLR

Scoring Power (Rp )

Ranking Power (R1-2-3 in%)

X

A

R

G

XARG

X

A

R

G

XARG

0.694
0.664
0.675
0.743
0.719
0.693
0.703
0.641
0.648

0.737
0.783
0.583
0.763
0.787
0.716
0.716
0.675
0.681

0.710
0.750
0.503
0.777
0.746
0.739
0.717
0.625
0.602

0.728
0.550
0.501
0.726
0.72
0.603
0.613
0.591
0.555

0.816
0.808
0.631
0.804
0.810
0.779
0.753
0.681
0.684

52.3
53.8
33.8
56.6
60.6
44.6
52.3
41.5
49.2

47.7
53.8
35.7
46.5
51.7
46.2
50.8
44.6
44.6

41.5
44.6
35.3
52.1
50.8
38.5
44.6
44.6
50.8

41.5
41.5
35.1
49.9
44.6
36.9
33.8
35.4
44.6

64.3
60.0
38.9
63.8
63.1
56.9
53.8
41.5
50.8

performance statistics for the resulting 36 SFs.
We notice that the scoring and ranking accuracies of almost all models, except MARS and
MLR, improve by considering more than one type of features rather than just X, A, R, or G features
alone. The results of Table 4.4 are useful in assessing the relative benefit of different types of
features for the various ML models.
A pertinent issue when considering a variety of features is how well different SF models exploit
an increasing number of features. The features we consider are the X, A, G, and a larger set of
geometrical features than the R feature set available from the RF-Score tool. RF-Score counts
the number of occurrences of 36 different protein-ligand atom pairs within a distance of 12 Å. In
order to have more features of this kind for this experiment, we produce 36 such counts for five
contiguous distance intervals of 4 Å each: (0 Å, 4 Å], (4 Å, 8 Å], . . . , (16 Å, 20 Å]. This provides
us 6 X, 30 A, 14 G, and (36×5 =) 180 geometrical features or a total of 230 features. We randomly
select (without replacement) x features from this pool, where x = 20, 40, 60, . . . , 220, and use them
to characterize the 2007 and 2010 PDBbind dataset, which we then use to train the six ML models.
These models are subsequently tested on the out-of-sample 2007 Cr dataset characterized by the
same features7 . This process is repeated 100 times to obtain robust average Rp statistics, which are
performance trends of BsN-Score and BgN-Score to be similar to that of BRT and RF SFs.
7 This experiment was conducted before the release of 2014-2016 PDBBind datasets and prior

102

90

0.9

plotted in Figure 4.12.

●

●

●

●

0.5
20

40

●

●
●

●

●

●

●

160

180

●

●

●
●

30

0.4

●

RF
BRT
SVM
kNN
MARS
MLR

80

●

70

●

60

●

50

●

RF
BRT
SVM
kNN
MARS
MLR

40

●

Ranking success rate (R1−2−3 in %)

0.8

●

0.6

0.7

●

0.3

Pearson correlation coefficient (Rp)

●

60

80

100 120 140
Number of features
(a)

160

180

200

220

20

40

60

80

100 120 140
Number of features
(b)

200

220

Figure 4.12: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
1
scoring functions on the number of descriptors. The descriptors are drawn randomly (without
replacement) from a pool of X, A, R, and G-type features and used to characterize the training
complexes of the refined sets of PDBbind 2007 and 2010 and the out-of-sample complexes from
the core (Cr) test set of PDBbind 2007.
Clearly, the RF, BRT, SVM, kNN, MARS, and MLR SFs have very different responses to an
increase in the number of features. For several of them, peak performance is attained at 60 (kNN
and MLR) or 120 (SVM) features and then there generally tends to be a drop in or saturation in
performance at larger number of features. Although the features we used are distinct, they have
varying degrees of correlation between them. This combined with larger number of features lead
to overfitting problems for some of the SFs. Further, ML model meta-parameters are not tuned
for every number of features chosen. This especially affects the performance of SVM which we
have found to be very sensitive to its parameter values. However, tuning SVM parameters for
every number of features is computationally intensive, and therefore we did not attempt to search
for the optimal parameter values for every feature set size for it. In contrast to these models, the
performance of RF and BRT benefits from increasing number of features. Based on these results,
utilizing as many relevant features as possible in conjunction with ensemble based approaches like
to developing the Neural Network based models, therefore they are not included here. However,
we expect the performance trends of BsN-Score and BgN-Score to be similar to that of BRT and
RF SFs.

103

Scoring (Rp) & ranking (R1−2−3) accuracy

Scoring accuracy (Rp)

Ranking accuracy (R1−2−3)

1.00

1.00

SF
BT−Score

0.75

RF−Score

0.75

X−Score

0.50

0.50

0.25

0.25

0.00

0.00
4

8

12

16

4

8

12

16

Number of descriptor sets

Figure 4.13: Dependence of scoring (left) and ranking (right) accuracies of machine-learning
scoring functions on the number of descriptor sets. The descriptor sets are randomly sampled
from all feature types in Descriptor Data Bank. The selected descriptor types are used to
characterize the training complexes of the refined set of PDBbind 2014 and the out-of-sample
complexes from the core (Cr) test set of PDBbind 2014.
BRT and RF that are resilient to overfitting appears to be the best option.

4.5.6.2

Number of descriptor sets characterizing PDBbind 2014 complexes

To investigate the effect of increasing the number of descriptor sets, we randomly select a maximum of 100 combinations of x descriptor sets from all possible (16 choose x) combinations of the
16 DDB types listed in Table 3.1, where x ∈ {1, 4, 7, 10, 13, 16}, and use them to characterize the
training set complexes in PDBbind 2014 database, which we then use to train the XGboost, Random Forest, and MLR models to build the scoring functions BT-Score, RF-Score, and X-Score,
respectively. These models are subsequently tested on the 2014 PDBbind Cr benchmark data set
characterized by the same descriptors. For each number of descriptor sets, x, the performance of
the 100 SFs are averaged to obtain robust overall scoring (Rp ) and ranking (R1-2-3 ) statistics, which
are plotted in Figure 4.12.

104

Similar to the case of increasing training data in terms of new PLCs and number of raw features
in the previous section, the performance of the ensemble SFs are clearly benefiting from increasing
the number of protein-ligand interaction models in the scoring and ranking tasks. It is clear from
the right and left plots that the number of different models of protein-ligand interactions (descriptor
sets) are as important as the number of training complexes in improving the quality of prediction.
Based on these results, we believe that the public database of protein-ligand interactions we are
proposing in this work will be of high value to ML SFs and just as important to their scoring and
ranking accuracy as the resources of raw structural and experimental data.

4.6

Conclusion

Our experiments have shown that the proposed neural networks SFs, BsN-Score and BgNScore, achieved the best results in reproducing experimental binding affinity for large and diverse
number of protein-ligand complexes. We further found that ensemble models based on NNs surpass SFs based on the decision-tree ensemble techniques Boosted Regression Trees and Random
Forests as well as other state-of-the-art ML algorithms such as SVM. SFs that were trained on a
single neural network, which have traditionally been used in drug-discovery applications, showed
linear correlation (R p ) of only 0.627 between observed and predicted binding affinities. On the
other hand, BsN-Score and BgN-Score outperform the best of other ML and existing conventional knowledge-based, force-field-based, and empirical SFs (R p = 0.844 and 0.840 vs. 0.725
and 0.627, respectively) and those based on a single neural network whose correlation coefficient
is 0.803. Similar results have been obtained for the ranking task. Within clusters of protein-ligand
complexes with different ligands bound to the same target protein, we find that the best ML-based
SF is able to rank the ligands correctly based on their experimentally-determined binding affinities
61.5% of the time and identify the top binding ligand 81.5% of the time. Given the challenging
nature of the ranking problem and that SFs are used to screen millions of ligands, this represents a
significant improvement over the the best conventional SF we studied, for which the corresponding
ranking performance values are 58.0% and 69.1%. The accuracies of ensemble NN SFs are even

105

higher when they predict binding affinities for protein-ligand complexes that are related to their
training sets. The high predictive accuracy of ensemble SFs BsN-Score and BgN-Score is due
to the following three factors: (i) the low bias error of the highly-nonlinear neural network base
learners, (ii) the low variance error achieved using bagging and boosting, and (iii) the employed
diverse set of multi-perspective descriptors we extract for protein-ligand complexes.
We also observed that the ensemble scoring functions benefit more than their conventional
counterparts from increases in the number of descriptors and the size of training dataset. This
means that the proposed scoring models will be even more accurate in the future when 3D structures of more protein-ligand complexes are resolved and larger numbers of descriptors are modeled.

106

CHAPTER 5
DOCKING-SPECIFIC SCORING FUNCTIONS FOR ACCURATE LIGAND POSE
PREDICTION

Molecular docking is a computational approach widely used in structure-based drug design to
simulate the behavior of drug-like ligands interacting with a target protein to bind with it and
form a stable protein-ligand complex. An essential component of molecular docking programs is a
scoring function (SF) that is used to identify the most stable binding pose of a ligand, when bound
to a receptor protein, from among a large set of candidate poses. Scoring functions employed in
most of commercial docking software rank ligand poses based on the binding affinities they predict.
Although some scoring functions have achieved some success in terms of identifying the correct
poses for many protein targets, their accuracy is still inadequate for reliable virtual screening.
When a poor ligand pose is mistakenly predicted as the conformation the ligand would take when
it interacts with the protein, the ligand could be incorrectly prioritized over other compounds in the
database as a better drug candidate. This possibility is not implausible given the current limitation
of binding affinity prediction which conventional scoring function uses to rank-order ligand poses.
In this chapter, we introduce a novel type of ensemble neural network scoring functions optimized specifically for the docking task. Various other nonparametric ML methods inspired from
statistical learning theory are also examined in this chapter to model the unknown function that
maps structural and physicochemical information of a protein-ligand complex to a corresponding
distance to the native pose (in terms of RMSD value). Ours is the first work to develop and perform a comprehensive assessment of the docking accuracies of task-specific and conventional SFs
across both diverse and homogeneous test sets of protein families. We show that the best dockingspecific SF has a success rate of ∼80% compared to ∼70% for the best conventional SF when
the goal is to find poses within RMSD of 1 Å from the native ones for 195 different protein-ligand
complexes in PDBbind 2007 benchmark. Furthermore, ensemble neural network scoring functions
optimized for the docking task correctly identified the ligand poses for 95% protein-targets in the

107

PDBbind 2014 benchmark as opposed to 82% obtained by the best commercial scoring function
ChemPLP. In this test, the model’s top-scoring pose is considered native or near-native (i.e., correctly identified) if it lies within 2 Å RMSD from the ligand’s native conformation observed in the
crystallographic structure.
Such a significant improvement (> 14%) in docking power will lead to better quality drug
hits and ultimately help reduce costs associated with drug discovery. We seek to advance ligand
pose prediction by designing SFs that significantly improve upon the protein-ligand docking performance of conventional SFs. Our approach is to couple the modeling power of flexible machine
learning algorithms with training datasets comprising hundreds of protein-ligand complexes with
native poses of known high-resolution 3D crystal structures and experimentally-determined binding affinities. In addition, we computationally generate a large number of decoy poses and utilize
their RMSD values from the native pose and a variety of features characterizing each complex. We
compare the docking accuracies of the proposed ensemble deep neural networks and several other
ML models to existing conventional SFs of all three types, force-field, empirical, and knowledgebased, on diverse and independent test sets. We also perform a systematic analysis of the ability
of the proposed SFs in identifying native poses of ligands that are docked to novel protein targets.
Further, we assess the impact of training set size on the docking performance of the conventional
BA-based SFs and the proposed RMSD-based models.
The remainder of the chapter is organized as follows. The next section presents the procedure
for decoy generation and formation of training and test datasets. Then, we present results comparing the docking powers of conventional and ML SFs on diverse and homogeneous test sets. We
also compare the performance of the ML techniques on novel drug targets and analyze how they
are impacted by training set size. Finally, we summarize our findings.

108

5.1

Decoy generation of ligand poses and formation of training and test
complexes

We conducted our experiments on two versions of the protein-ligand complex repository PDBbind. In the first set of experiments, we considered the 2007 version of PDBbind which contains
1300 complexes with experimental binding affinity data. The primary reason for selecting this release was the availability of related studies to this work that used its primary and core datasets for
training and benchmarking purposes [17, 27]. This dataset enables us to objectively compare the
performance of our docking-specific ML scoring models to conventional SFs published by Cheng
et al. [17] and Ballester et al. [27] since the proposed and traditional models are tested on the same
benchmark using the same evaluation criteria. Upon the development of our boosted and bagged
deep-learning SFs for docking, the 2014 version of PDBbind was released with about 250% more
protein ligand complexes than the 1300 PLCs in the 2007 PDBbind. Shortly after that, another
comparative study of SFs was also published which uses the 2014 PDBbind as a benchmark for
several popular scoring functions. To evaluate the accuracy of our deep neural network based SFs
and compare them consistently with the SFs published therein, we too use the same release of
PDBbind and the evaluation criteria employed in the latter comparative study. It should be noted
that both comparative studies based on the 2007 and 2014 releases of PDBbind are similar in scope
and methodology. In this chapter, we will show results on both versions for every experiment we
conduct.

5.1.1

Generating decoy poses for protein-ligand complexes in PDBBind 2007

The proteins of both the 2007 PDBbind primary (Pr) training and core (Cr) test sets form complexes with ligands that were observed bound to them during 3D structure identification. These
ligands are commonly known as native ligands and the conformation in which they were found at
their respective binding sites are referred to as true or native poses. In order to assess the docking power of SFs in distinguishing true poses from random ones, a decoy set was generated for
each protein-ligand complex in Pr and Cr. We utilize the decoy set produced for the core set Cr
109

by Cheng et al. [17] using four popular docking tools: LigandFit in Discovery Studio, Surflex
in SYBYL, FlexX in SYBYL (currently in LeadIT [114]), and GOLD. From each tool, a diverse
set of binding poses was generated by controlling docking parameters as described in [17]. This
process generated a total of ∼2000 poses for each protein-ligand complex from the four docking
protocols combined. Binding poses that are more than 10 Å away, in terms of RMSD (root-meansquare deviation), from the native pose are discarded. The remaining poses are then grouped into
ten 1 Å bins based on their RMSD values from the native binding pose. Binding poses within each
bin were further clustered into ten clusters based on their similarities [17]. From each such subcluster, the pose with the lowest noncovalent interaction energy with the protein was selected as a
representative of that cluster and the remaining poses in that cluster were discarded. Therefore, at
the end of this process, decoy sets consisting of (10 bins × 10 representatives =) 100 diverse poses
were generated for each protein-ligand complex. Since we have access to the original Cr decoy
set, we used it as is and we followed the same procedure to generate the decoy set for the training
data Pr. Since we did not have access to Discovery Studio software, we did not use LigandFit
protocol for the training data. In order to keep the size of the training set reasonable, we generated
50 decoys for each protein-ligand complex instead of 100 as it is the case for Cr complexes. Due
to geometrical constraints during decoy generation, the final number of resultant decoys for some
complexes does not add up exactly to 50 for Pr and 100 for Cr. It should be noted that the decoys
in the training set are completely independent of those in the test set since both datasets share no
ligands from which these decoys are generated.
As noted earlier, in the ligand docking problem with which we are concerned in this chapter,
the task is to identify the correct binding mode of a ligand from among a set of (computationallygenerated) poses. The closer, in terms of RMSD, a pose is to the experimentally-determined native
pose, the better [17]. We develop two types of ML SFs in this work to identify poses close to
the native one. The first type are trained on training complexes with (known experimentallydetermined) binding affinity (BA) as the response variable. To assess their docking accuracy, their
predicted BA on a separate set of test complexes is used to distinguish promising poses from less

110

promising ones. Note that for the test complexes, the experimentally-determined BA and actual
RMSD values are not used during BA prediction; the actual RMSD value for test complexes is
only used to assess docking accuracy. In all previous work, BA has been used for identifying nearnative poses, which carries with it the implicit assumption that higher predicted BA implies lower
RMSD for a pose. We believe a better approach is to model RMSD instead of BA. Therefore,
the second set of SFs we build are trained on training complexes with (known) RMSD as the
response variable. The accuracy of this approach, as in the case of BA-based SFs, is assessed
on a separate set of test complexes by ranking poses according to predicted RMSD values: the
lower the predicted RMSD, the more likely a pose is closer to the native pose. Note that for
the test complexes the experimental BA and actual RMSD values are not used during RMSD
prediction; the actual RMSD value is used only for docking accuracy assessment after prediction.
RMSD-based SFs have three advantages over BA-based SFs. First, RMSD-based SFs model the
same parameter (RMSD) that is used for pose ranking instead of relying on a related parameter
(BA). Second, BA-based SFs are trained on experimental BA values, which are inherently noisy,
whereas RMSD-based SFs use computationally-determined RMSD values during training which
makes them less error prone. Third, during training, multiple decoys with different RMSD values
can be computationally generated per complex. Therefore, the number of training records that
can be utilized by RMSD-based SFs is the product of the number of different training complexes
and the average number of computationally-generated poses per training complex. This training
set size can be much larger compared to that available to BA-based SFs which is limited to as
many training records as the number of different training complexes because BA values can be
experimentally determined only for native poses, not for decoys. As it will be shown later, our
novel RMSD-based approach provides significantly superior accuracy compared to conventional
BA-based prediction.
For the two types of SFs, two versions of training and test data sets are created. The first version
uses BA as the dependent variable (Y = BA) and the size of Pr remains fixed at 1105 while Cr
includes 16554 complex conformations because it consists of native poses and a decoy set for each

111

native pose. The dependent variable of the second version is RMSD (Y = RMSD) and because
both training and test sets consist of native and decoy poses, the size of Pr expands to 39,085 while
Cr still retains the same 16554 complex conformations.
For all protein-ligand complexes, for both native poses and computationally-generated decoys,
we extracted X, A, R, and G features. By considering all fifteen combinations of these four types of
features (i.e., X, A, R, G, X ∪ A, X ∪ R, X ∪ G, A ∪ R, A ∪ G, R ∪ G, X ∪ A ∪ R, X ∪ A ∪ G, X ∪ R ∪ G,
A∪R∪G, and X ∪A∪R∪G), we generated 30 versions of scoring functions. Each scoring function
is trained and evaluated on complexes characterized by one of the 15 descriptor combinations and
use RMSD or binding affinity (BA) as a dependent variable (Y ). We denoted these models using
the name of the machine-learning algorithm (as a prefix) and the feature combintion (as a suffix).
The scoring function BsN-Score::XR, for example, denotes the boosted ensemble neural network
model fitted to complexes characterized using the set of descriptors X ∪ R (referred to simply as
XR). RMSD and BA-based scoring functions will be distinguished from each other in the text
when necessary.

5.1.2

Generating decoy poses for protein-ligand complexes in PDBBind 2014

The methodology of generating computational poses for the 2014 version of PDBbind is very similar to the process discussed in the previous section for PDBbind 2007. The main differences are
the software packages used to perform the docking and the number of poses we generate for each
protein-ligand complex. For this release, we use the open-source software AutoDock Vina [83]
to conduct docking instead of Surflex in SYBYL [51], FlexX in LeadIT [114], and GOLD [52]
due to the expiration of their licenses during the time we conducted this study. We also retain
the full number of 100 computational poses for each protein-ligand complex in PDBbind 2014 instead of 50 as in the case of PDBbind 2007 since we have access to more computational resources.
Figure 5.1 summarizes the steps we performed to generate the training poses and calculate their
distance from their respective native conformations for PDBbind 2014. First, we randomly translated and rotated the native pose and then docked it into the binding site of the target which is

112

marked by the original location of the crystallized ligand. We use AutoDock Vina to generate 100
binding modes with its exhaustiveness search argument was set to 25. We repeated this random
re-orientation and docking process 20 times to obtain a total of (100 × 20 =) 2000 diverse poses
around the native conformation. We then calculated the symmetry-corrected root-mean-square distance of each computationally-generated pose to the native conformations. All poses with RMSD
greater than 10 Å away from the native conformation were discarded. The remaining poses were
then clustered into 10 one-Angstrom-wide bins where each bin contains 10 representative poses
uniformly-spaced from each other. Therefore, for every native PLC we obtained a total of 100
poses spanning the [0-10Å] range where each pose is approximately 0.1 Å from its closest two
neighboring conformations. The pose with 0 Å is simply the native conformation.
The pose-generation strategy was applied to each protein-ligand complex in our training and
test datasets formed based on the sampling strategies we describe in Section 2.1. The multiperspective descriptors were then extracted for each native, near-native, and decoy complex. Similar to the two versions of the six ML SFs evaluated on PDBbind 2007, here we also fit two versions of the proposed ensemble of deep neural networks scoring functions. Namely, we build the
docking-specific SFs BsN-Dock and BgN-Dock that we train on PLCs with native and computationallygenerated poses characterized by the 2714 DDB’s multi-perspective descriptors and labeled by the
RMSD values. We also construct their binding-affinity-based counterparts BsN-Score and BgNScore that are fitted to complexes of native poses only and characterized by the same 2714 multiperspective descriptors but labeled using their measured binding affinity values.

5.2

Conventional scoring functions

The proposed docking-specific models presented in this work are evaluated on the 2007 and
2014 versions of PDBbind and their performance is compared against popular SFs from the literature whose docking accuracy has been reported on the same benchmarks [17, 25].

113

Figure 5.1: The decoy generation process of ligand poses to train and test scoring functions on
complexes from the refined set of PDBbind 2014.
5.2.1

Conventional scoring functions evaluated on the PDBbind 2007 benchmark

A total of sixteen popular conventional SFs are compared to ML SFs on the 2007 version of
PDBbind. The sixteen functions are either used in mainstream commercial docking tools and/or
have been developed in academia. The functions were recently compared against each other in
a study conducted by Cheng et al. [17]. This set includes five SFs in the Discovery Studio
software [21]: LigScore, PLP, PMF, Jain, and LUDI. Five SFs in SYBYL software [51]: DScore, PMF-Score, G-Score, ChemScore, and F-Score. GOLD software [52] contributes three

114

SFs: GoldScore, ChemScore, and ASP. GlideScore in the Schrödinger software [54]. Besides,
two standalone scoring functions developed in academia are also assessed, namely, DrugScore
[55] and X-Score [56]. Some of the SFs have several options or versions, these include LigScore
(LigScore1 and LigScore2), PLP (PLP1 and PLP2), and LUDI (LUDI1, LUDI2, and LUDI3) in
Discovery Studio; GlideScore (GlideScore-SP and GlideScore-XP) in the Schrödinger software;
DrugScore (DrugScore-PDB and DrugScore-CSD); and X-Score (HPScore, HMScore, and HSScore). For brevity, we only report the version and/or option that yields the best performance on the
PDBbind benchmark that was considered by Cheng et al.

5.2.2

Conventional scoring functions evaluated on the PDBbind 2014 benchmark

In addition to the conventional methods evaluated on the core test set of PDBbind 2007, we compare the docking accuracy of the proposed deep neural networks SFs to five conventional models on the 2014 release of PDBbind. Three out of the five SFs are the empirical ChemPLP and
ChemScore implemented in Gold [52] and the empirical GlideScore-SP featured in the modeling
software Schrödinger [54]. These three were the top performing models among 18 other scoring
functions in identifying the correct binding of PDBbind 2014 complexes as reported by Li et al.
in a recent study [25]. The team also found X-Score to be the top performing in correctly predicting binding affinity and therefore we consider it here too. The ML SF RF-Score is also reported
here due to its popularity in the literature and resemblance to the proposed bagged deep neural
networks [27].

5.3
5.3.1

Docking-specific machine-learning scoring functions
Generic ML scoring functions evaluated on the PDBbind 2007 benchmark

For the 2007 version of PDBBind, we utilize a total of six regression techniques in our study: multiple linear regression (MLR), multivariate adaptive regression splines (MARS), k-nearest neighbors
(kNN), support vector machines (SVM), random forests (RF), and boosted regression trees (BRT)
[44]. We use the optimal parameters listed in Table 5.1 for both the BA and RMSD versions. The
115

experiments presented in this chapter for the six ML methods on complexes from PDBbind 2007
were conducted prior to the development of our proposed ensemble neural networks SFs, therefore we include them separately in this chapter. However, we expect the performance trends of
BsN-Score and BgN-Score to be similar to that of BRT and RF SFs, respectively. The absolute
performance of NN SFs may in fact be higher as we demonstrated in the previous chapter when
the task was to predict binding affinities.
The six ML techniques are implemented in the following R language packages that we use [31]:
the package stats readily available in R for MLR, earth for MARS [34], kknn for kNN [38], e1071
for SVM [41], randomForest for RF [45], and gbm for BRT [48]. These methods benefit from
some form of parameter tuning prior to their use in prediction. For example, the most important
parameters in MARS are the number of terms (or basis functions) in the model, the degree of each
term, and the penalty associated with adding new terms. Here we only tune the degree and penalty
parameters and leave the final number of terms of MARS models to be automatically selected
by the MARS algorithm implementation we use [34]. The kNN method has two parameters that
require optimization: the neighborhood size k and the degree of Minkowski distance q [38]. For
the SVM model, we have three parameters to optimize: the complexity constant C, the width of
the ε-insensitive zone ε, and the width σ of the radial basis function that is used as a kernel [41].
RF algorithm has effectively only one important parameter mtry which determines the number
of features to be randomly selected at each node split when growing the forest’s trees [45]. The
number of unpruned trees in the forest was fixed at 2000. BRT, on the other hand, has several
parameters in addition to the most important two we tune: the number of trees and the interaction
depth between the features [48]. The number of trees is optimized automatically using a crossvalidation scheme internally implemented in the BRT algorithm [48]. The number of trees is tuned
simultaneously with the interaction depth that controls their sizes. The shrinkage (or learning) rate
of the BRT algorithm is set to 0.005 in all our experiments.
The values of the aforementioned parameters were selected so as to optimize the mean-squared
errors on validation complexes sampled without replacement from the training set and independent

116

of the test data. Out-of-bag instances were used as validation complexes to select the optimal
value for the RF parameter mtry. Out-of-bag (OOB) refers to complexes that are not sampled
from the training set when bootstrap sets are drawn to fit individual trees in RF models. The
parameter values for MARS, kNN, SVM, and BRT were optimized by performing a grid search
over a suitable range in conjunction with 10-fold cross-validation over the training set Pr. The
resulting optimal parameter values are provided in Table 5.1. This optimization was performed on
the primary training (Pr) data for all 15 descriptors set. For every machine-learning method, we
will be using these values to build ML SFs in the subsequent experiments.
Table 5.1: Optimal parameter values for MARS, kNN, SVM, RF, and BRT models for the
docking task
Model Parameter

Descriptor set
X

A

R

G

XA

XR

XG

AR

AG

RG XAR XAG XRG ARG XARG

MARS Degree
2
1
1
1
1
1
1
1
1
1
1
1
1
1
Penalty
2
6
5
6
7
2
6
7
6
5
6
7
6
5
kNN
k
15
13
14
16
9
19
17
19
18
17
18
19
17
18
q
1
1
1
1
1
1
1
1
1
1
1
1
1
1
SVM C
2
2
1
1
1
4
1
2
2
1
1
2
2
1
ε
0.5
0 0.250 0.250 0.125 0.125 0.250 0.250 0.250 0.125 0.250 0.125 0.125 0.125
σ
1 0.25 0.125 0.250 0.250 0.031 0.031 0.031 0.125 0.031 0.125 0.031 0.031 0.031
RF
mtry
3
18
8
7
31
5
8
10
16
17
14
20
21
25
BRT
Interaction depth
15
17
18
16
19
15
18
19
17
16
16
20
18
17
Number of trees 1114 1523 1573 1208 1371 2113 1610 2950 2181 2303 2213 2590 2854 2921

5.3.2

1
6
19
1
2
0.250
0.031
35
20
2859

Ensemble deep neural networks evaluated on the PDBbind 2014 benchmark

We trained and tested the ensemble neural network SFs, BsN-Dock, BgN-Dock, BsN-Score, BgNScore, and an updated (in-house) versions of RF-Score and X-Score on protein-ligand complexes
sampled from PDBbind 2014 according the procedures described in Section 2.1. A boosted regression tree SF, BT-Dock, based on XGBoost algorithm was also constructed to substitute the
neural network SFs for computationally expensive experiments that require fitting large number
of models. Our neural network SFs and BT-Dock are trained and tested on complexes characterized by the multi-perspective descriptors derived using the proposed DDB platform from 16

117

different sources as described in Chapter 3. The parameters of these SFs were tuned in a consistent manner to optimize the mean-squared prediction errors on validation complexes sampled
without replacement from the training set and independent of the test sets. Out-of-bag instances
were used as validation complexes for BgN-Score while a ten-fold cross-validation was conducted
for BsN-Score. The parameters that are tuned and their optimized values are described in Section 4.4.6. The Python language libraries TensorFlow, XGBoost, and Scikit-learn were used to
construct BsN-Dock, BgN-Dock, BsN-Score, BgN-Score, BT-Dock, and our versions of RF-Score
and X-Score.

5.4
5.4.1

Results and discussion
Evaluation of the docking power of scoring functions

In contrast to our work presented in the previous chapter for improving and examining scoring and
ranking accuracies of different families of SFs, this chapter is devoted to enhancing and comparing
SFs in terms of their docking powers. Docking power measures the ability of an SF to distinguish a
promising binding mode from a less promising one. Typically, generated conformations are ranked
in non-ascending order according to their predicted binding affinity (BA). Ligand poses that are
very close to the experimentally-determined ones should be ranked high. Closeness is measured in
terms of RMSD (in Å) from the true binding pose. Generally, in docking, a pose whose RMSD is
within 2 Å from the true pose is considered a success or a hit.
In this chapter, we use comparison criteria similar to those used by Cheng et al. to compare the
docking accuracies of sixteen popular conventional SFs. Doing so ensures fair comparison of ML
SFs to those examined in that study in which each SF was assessed in terms of its ability to find the
pose that is closest to the native one. More specifically, docking ability is expressed in terms of a
success rate statistic S that accounts for the percentage of times an SF is able to find a pose whose
RMSD is within a predefined cutoff value C by only considering the N topmost poses ranked by
their predicted scores. Since success rates for various C (e.g., 0, 1, 2, and 3 Å) and N (e.g., 1,
N to distinguish between these
2, 3, and 5) values are reported in this study, we use the notation SC

118

different statistics. For example, S12 is the percentage of protein-ligand complexes whose either one
of the two best scoring poses are within 1 Å from the true pose of a given complex. It should be
noted that S01 is the most stringent docking measure in which an SF is considered successful only if
the best scoring pose is the native pose. By the same token and based on the C and N values listed
earlier, the least strict docking performance statistic is S35 in which an SF is considered successful
if at least one of the five best scoring poses is within 3 Å from the true pose.

5.4.2
5.4.2.1

Docking-specific scoring functions vs. conventional approaches on a diverse test set
Docking performance on diverse protein families from PDBbind 2007

In this experiment, we compare the docking performance of the six ML SFs to the sixteen conventional approaches on the core test Cr of PDBbind 2007 that comprises thousands of protein-ligand
complex conformations corresponding to 195 different native poses in 65 diverse protein families.
As mentioned earlier, we conducted two experiments. In the first, BA values predicted using the
conventional and ML SFs were used to rank poses in a non-ascending order for each complex in
Cr. In the other experiment, RMSD-based ML models directly predicted RMSD values that are
used to rank in non-descending order the poses for the given complex.
By examining the true RMSD values of the best N scoring ligands using the two prediction
approaches, success rates of SFs are computed; these are shown in Figure 5.2. Panels (a) and (b)
in the figure show the success rates S11 , S21 , and S31 for all 22 SFs. The SFs, as in the other panels,
are sorted in non-ascending order from the most stringent docking test statistic value to the least
stringent one. In the top two panels, for example, success rates are ranked based on S11 , then S21
in case of a tie in S11 , and finally S31 if two or more SFs tie in S21 . In both BA- and RMSD-based
scoring, we find that the 22 SFs vary significantly in their docking performance. The top three
BA-based SFs, GOLD::ASP, DS::PLP1, and DrugScorePDB::PairSurf, have success rates of more
than 60% in terms of S11 measure. That is in comparison to the BA-based ML SFs, the best of which
has an S11 value barely exceeding 50% (Fig. 5.2(a)). On the other hand, the other six ML SFs that

119

GOLD::ASP
DS::PLP1
DrugScorePDB::PairSurf
GlideScore::SP
GOLD::ChemScore
SYBYL::F−Score
DS::LigScore2
GOLD::GoldScore
X−Score1.2::HMScore
MLR::XARG
BRT::XRG
SVM::XAG
DS::Ludi2
SYBYL::ChemScore
MARS::XRG
RF::XRG
SYBYL::PMF−Score
DS::PMF
DS::Jain
SYBYL::G−Score
kNN::XRG
SYBYL::D−Score

MARS::XARG
RF::XARG
SVM::XAG
MLR::XG
BRT::XRG
kNN::XG
GOLD::ASP
DS::PLP1
DrugScorePDB::PairSurf
GlideScore::SP
GOLD::ChemScore
SYBYL::F−Score
DS::LigScore2
GOLD::GoldScore
X−Score1.2::HMScore
DS::Ludi2
SYBYL::ChemScore
SYBYL::PMF−Score
DS::PMF
DS::Jain
SYBYL::G−Score
SYBYL::D−Score

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

0

10

20

30

40 50 60 70 80 90
Success Rate (%)
(a) C = 1, 2 and 3 Angstrom, N = 1 pose, Y = BA.

GOLD::ASP
GlideScore::SP
GOLD::ChemScore
DrugScorePDB::PairSurf
DS::LigScore2
DS::PLP1
GOLD::GoldScore
X−Score1.2::HMScore
SYBYL::F−Score
MLR::XRG
SYBYL::ChemScore
DS::Ludi2
MARS::XRG
BRT::XAG
SVM::AG
RF::XRG
DS::Jain
SYBYL::PMF−Score
DS::PMF
SYBYL::G−Score
kNN::XRG
SYBYL::D−Score

100

10

20 30 40 50 60 70 80 90
Success Rate (%)
(c) C = 2 Angstrom, N = 1 pose, Y =BA.

GOLD::ASP
DS::PLP1
DrugScorePDB::PairSurf
GlideScore::SP
DS::LigScore2
GOLD::ChemScore
GOLD::GoldScore
X−Score1.2::HMScore
MLR::XG
SYBYL::F−Score
SYBYL::ChemScore
MARS::XRG
DS::Ludi2
SVM::AG
BRT::XAG
RF::XRG
SYBYL::PMF−Score
DS::Jain
DS::PMF
SYBYL::G−Score
kNN::XRG
SYBYL::D−Score

100

10

20

30

40 50 60 70 80 90 100
Success Rate (%)
(b) C = 1, 2 and 3 Angstrom, N = 1 pose, Y = RMSD.

10

20 30 40 50 60 70 80 90
Success Rate (%)
(d) C = 2 Angstrom, N = 1 pose, Y =RMSD.

30

40 50 60 70 80 90 100
Success Rate (%)
(g) C = 0 Angstrom, N = 1, 3 and 5 poses, Y = BA.

100

N=1
N=2
N=3

0

10 20 30 40 50 60 70 80 90 100
Success Rate (%)
(f) C = 2 Angstrom, N = 1, 2 and 3 poses, Y = RMSD.

MARS::XARG
MLR::XARG
RF::XARG
SVM::XAG
BRT::XARG
kNN::XG

N=1
N=3
N=5

0

20

Native pose not included
Native pose included

0

10 20 30 40 50 60 70 80 90 100
Success Rate (%)
(e) C = 2 Angstrom, N = 1, 2 and 3 poses, Y = BA.

MLR::X
BRT::XG
SVM::XRG
RF::XRG
MARS::XRG
kNN::XRG

10

RF::RG
MARS::XG
MLR::XG
BRT::XRG
SVM::XAG
kNN::XG
GOLD::ASP
DS::PLP1
DrugScorePDB::PairSurf
GlideScore::SP
DS::LigScore2
GOLD::ChemScore
GOLD::GoldScore
X−Score1.2::HMScore
SYBYL::F−Score
SYBYL::ChemScore
DS::Ludi2
SYBYL::PMF−Score
DS::Jain
DS::PMF
SYBYL::G−Score
SYBYL::D−Score

N=1
N=2
N=3

0

0

RF::RG
MARS::G
MLR::G
SVM::XAG
BRT::XRG
kNN::XG
GOLD::ASP
GlideScore::SP
GOLD::ChemScore
DrugScorePDB::PairSurf
DS::LigScore2
DS::PLP1
GOLD::GoldScore
X−Score1.2::HMScore
SYBYL::F−Score
SYBYL::ChemScore
DS::Ludi2
DS::Jain
SYBYL::PMF−Score
DS::PMF
SYBYL::G−Score
SYBYL::D−Score

Native pose not included
Native pose included

0

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

N=1
N=3
N=5

0

10

20

30

40 50 60 70 80 90 100
Success Rate (%)
(h) C = 0 Angstrom, N = 1, 3 and 5 poses, Y = RMSD.

Figure 5.2: Success rates of scoring functions in identifying binding poses that are closest to
native conformations of complexes in PDBbind 2007. The results show these rates by examining
the top N scoring ligands that lie within an RMSD cut-off of C Å from their respective native
poses. Panels on the left show success rates when BA-based scoring is used and the ones on the
right show the same results when docking-specific SFs predicted RMSD values directly.
Accuracies of conventional SFs are re-depicted on right panels for comparison convenience.
1

120

directly predict RMSD values achieve success rates of over 70% as shown in Figure 5.2(b). The
top performing of these ML SFs, MARS::XARG, has a success rate of ∼80%. This is a significant
improvement (> 14%) over the best conventional SF, the empirical GOLD::ASP, whose S11 value is
∼70%. Similar conclusions can also be made for the less stringent docking performance measures
S21 and S31 in which the RMSD cut-off constraint is relaxed to 2 Å and 3 Å, respectively.
The success rates plotted in the top two panels (Figure 5.2 (a) and (b)) are reported when native
poses are included in the decoy sets. Panels (c) and (d) of the same figure show the impact of
removing the native poses on docking success rates of all SFs. It is clear that the performance of
almost all SFs does not radically decrease by examining the difference in their S21 statistics which
ranges from 0 to ∼5%. This is due to the fact that some of the poses in the decoy sets are actually
very close to the native ones. As a result, the impact of allowing native poses in the decoy sets is
insignificant in most cases and therefore we include such poses in all other tests in the paper.
In reality, more than one pose is usually used from the outcomes of a docking run in the next
stages of drug design for further experimentation. It is useful therefore to assess docking accuracy
of SFs when more than one pose is considered (i.e., N > 1). Figure 5.2 (e) and (f) show the success
rates of SFs when the RMSD values of the best 1, 2, and 3 scoring poses are examined. These
rates correspond, respectively, to S12 , S22 , and S32 . The plots show a significant boost in performance
for almost all SFs. By comparing S12 to S32 , we observe a jump in accuracy from 82% to 92% for
GOLD::ASP and from 87% to 96% for RF::RG that models RMSD values directly. Such results
signify the importance of examining an ensemble of top scoring poses because there is a very good
chance it contains relevant conformations and hence good drug candidates.
Upon developing RMSD-based ML scoring models, we noticed excellent improvement over
their binding-affinity-based counterparts as shown in Figure 5.2. We conducted an experiment to
investigate whether they will maintain a similar level of accuracy when ML SFs are examined for
their ability to pinpoint the native poses from their respective 100-pose decoy sets. The bottom
two panels, (g) and (h), plot the success rates in terms of S01 , S03 , and S05 for the six ML SFs. By
examining the five best scoring poses, we notice that the top BA-based SF, MLR::X, was able to

121

distinguish native binding poses in ∼60% of the 195 decoy sets whereas the top RMSD-based
SF, MARS::XARG, achieved a success rate of S05 = 77% on the same protein-ligand complexes.
It should be noted that both sets of ML SFs, the BA- and RMSD-based, were trained and tested
on completely disjoint test sets. Therefore, this gap in performance is largely due to the explicit
modeling of RMSD values and the corresponding abundance of training data which includes information from both native and computationally-generated poses.

5.4.2.2

Docking performance on diverse protein families from PDBbind 2014

In this section, the docking-specific SFs based on boosted and bagged deep neural networks, BsNDock and BgN-Dock, are compared against their scoring-specific counterparts BsN-Score and
BgN-Score as well as the ML SF RF-Score and the empirical SFs X-Score, ChemPLP, ChemScore, and GlideScore-SP on PDBbind 2014. The docking-specific NN SFs are RMSD-based
while the remaining models are trained on complexes labeled with binding affinity data. As in the
previous section, all nine SFs were evaluated based on their ability to find the native or near-native
binding pose for each of the 195 protein-ligand complex in the core test set Cr. In addition, six
of these SFs were further tested using the stratified cross-validation (CV) and leave-clusters-out
(LCO) approaches to assess their generalization abilities in identifying the correct binding poses
for known and novel protein targets.
Figure 5.3 shows the docking accuracy of the RMSD-based SFs BsN-Dock and BgN-Dock as
well as the BA-based models BsN-Score, BgN-Score, RF-Score, and X-Score in finding native or
near-native poses for ligands docked to known protein targets in the core (Cr) and cross-validation
(CV) test sets and novel proteins in the leave-clusters-out (LCO) test sets. More specifically, for
each SF and training-test sampling protocol (Cr, CV, and LCO), we report the SF’s success rates
N %) in identifying the binding pose that is within C Å from the native conformations by con(SC

sidering the top N scoring poses after ordering the hundred poses for each protein-ligand complex
based on the score predicted by the SF. We denote these success rates by S11 , S12 , and S13 in Figure 5.3 and we order SFs based on their average S11 value across the 195 PLCs in the core test set

122

Cr (known targets)

CV (known & novel targets)

LCO (novel targets)

Accuracy

75

S11%
50

S12%
S13%
26.91
38.90

45.46
70.01
78.45
45.46
70.01
78.45
42.55
62.92
68.99

87.02
94.66
96.18
86.67
96.67

25.61
36.39

87.88
97.14
97.31
86.88
95.14
92.31
45.78
70.39
79.31
45.78
70.39
79.31
43.48
64.30
70.61

Bs

N
−
Bg Do
N ck
Bs −Do
N
− ck
Bg Sc
N ore
−S
c
X− ore
S
R co
F− re
Sc
or
e
Bs
N
−
Bg Do
N ck
Bs −Do
N
− ck
Bg Sc
N ore
−S
c
X− ore
Sc
R
F− ore
Sc
or
e
Bs
N
−D
Bg o
N ck
Bs −Do
N
− ck
Bg Sc
N ore
−S
c
X− ore
Sc
R
F− ore
Sc
or
e

0

25.49
38.41

25
87.12
95.89
97.94
86.15
94.87
97.44
51.54
73.79
79.95
49.48
72.82
78.97
50.72
65.90
72.72

Docking accuracy

100

Scoring function

Figure 5.3: The docking accuracy of docking-specific (proposed) and binding-affinity-based
(conventional) scoring functions when evaluated on test complexes with proteins that are either
fully represented (Cr), partially represented (CV), or not represented (LCO) in the SFs’ training
data. The docking accuracy is expressed in terms of the success rates (S11 , S21 , and S31 ) of SFs in
identifying binding poses that are closest to native ones for protein-ligand complexes derived
from the refined set of PDBbind 2014.
(Cr). The figure clearly shows superior docking performance for the RMSD-based SFs BsN-Dock
and BgN-Dock on known and native targets alike. BsN-Dock and BgN-Dock are able to correctly
find the native or near-native ligand pose for 87% of the test protein targets by only examining
their top-scoring poses (i.e., by allowing them to output only one pose for each PLC). It should
be noted that there are 90 poses on average for each protein ligand complex in the test set that
are more than 1 Å RMSD away from the actual native conformation which for this test (C = 1)
considered to be non-native (or decoy) poses. Compare this to the BA-based SFs RF-Score and
X-Score whose success rates are less than 45%. This gap in performance, 42% in absolute points,
is due to three main reasons. First is the superior ensemble deep learning models used to train
BsN-Dock and BgN-Dock SFs as opposed to Random Forests (RF-Score) and linear-regression
(X-Score). Second, BsN-Dock and BgN-Dock are trained on large number of native and computationally generated poses and optimized to directly reduce the RMSD between generated poses
and the native ones. Third, BsN-Dock and BgN-Dock capture more important protein-ligand interactions through a comprehensive and diverse set of multi-perspective descriptors. RF-Score

123

and X-Score, on the other hand, are only trained on native protein-ligand complexes characterized
by smaller subset of descriptors and optimized to reduce the error between actual and estimated
binding affinity data. BsN-Score and BgN-Score use the same underlying deep learning-based
algorithms and multi-perspective descriptors employed by BsN-Dock and BgN-Dock which contribute to their improved performance in comparison to RF-Score and X-Score. Their substantial
lag behind BsN-Dock and BgN-Dock, however, is attributed to the use of native protein-ligand
complexes with BA labels as training data instead of both native and computationally-generated
poses labeled with RMSD values on which BsN-Dock and BgN-Dock are trained. Therefore,
enriching the original complexes in PDBbind 2014 with computationally generated poses and replacing their binding affinity labels with our RMSD values boosted the docking performance of
our ensemble deep learning SFs by at least 25 absolute percentage points in terms of S11 success
rate (BsN-Dock = 87% vs. BsN-Score = 51%).

Accuracy

75

S12%
50

S22%
S32%
44.46

36.56

25.49

or

e

84.56

F−
Sc
R

79.18

87.42

65.90

Sc
or
e
X−

re

−S

P

84.11

81.03

87.42

co
lid
eS
G

C

he

m

m
PL
he
C

84.11

81.03

Sc
or
e

90.00

87.31

82.29

P

98.46

oc
k
Bg

N

−D

oc
−D
N
Bs

97.44

94.87

98.97

k

0

97.95

25
95.89

Docking accuracy

100

Scoring function

Figure 5.4: The docking accuracy of docking-specific and best binding-affinity-based
(conventional) scoring functions on native and decoy conformations of ligands docked to diverse
proteins from the core test set of PDBbind 2014. The docking accuracy is expressed in terms of
the success rates (S21 , S22 , and S23 ) of SFs in identifying binding poses that are closest to the native
conformation for each protein-ligand complex in the test set.
BsN-Dock and BgN-Dock are also compared against the empirical SFs ChemPLP and ChemScore used in the commercial software GOLD and GlideScore-SP employed in the other popular

124

commercial molecular modeling suite Schrödinger. These three SFs were the top performing predictors of binding poses according to a recent comparative study by Li et al. [25]. In that comparative study, the docking accuracies of ChemPLP, ChemScore, and GlideScore-SP, as well as 18
other popular SFs, were tested on the core test set of PDBbind 2014 which we also use to evaluate our models. Their performance was not reported for cross-validation and leave-clusters-out
tests and therefore we do not show them for all SFs in Figure 5.4. The figure shows the docking
accuracy in terms of the success rate of identifying the poses that are within 2 Å from the native
poses by only examining the top 1, 2, and 3 scoring poses out of a 100 total conformations for
each protein-ligand complex. BsN-Dock and BgN-Dock are at least 13 absolute percentage points
more accurate than ChemPLP and 30 percentage points ahead of X-Score in terms of S21 success
rate. This gap in performance confirms the superiority of BsN-Dock and BgN-Dock even against
the best approaches used in commercial tools to solve real-world problems.

5.4.3

5.4.3.1

Docking-specific scoring functions vs. conventional approaches on homogeneous test
sets
Docking performance on four protein families from PDBbind 2007

In the previous section, performance of 6 ML and 16 conventional SFs were assessed on the diverse
test set Cr of PDBbind 2007. The core set consists of more than sixty different protein families
each of which is related to a subset of protein families in Pr. That is, while the training and test
set complexes were different (at least for all the ML SFs), proteins present in the core test set were
also present in the training set, albeit bound to different ligands. A much more stringent test of
SFs is their evaluation on a completely new protein, i.e., when test set complexes all feature a
given protein—test set is homogeneous—and training set complexes do not feature that protein.
To address this issue, four homogeneous test sets were constructed corresponding to the four most
frequently occurring proteins in our data: HIV protease (112 complexes), trypsin (73), carbonic
anhydrase (44), and thrombin (38). Each of these protein-specific test sets was formed by extracting complexes containing the protein from Cr (one cluster or three complexes) and Pr (remaining

125

complexes). For each test set, we retrained BRT, RF, SVM, kNN, MARS, and MLR models on
the non-test-set complexes of Pr. Figure 5.5 shows the docking performance of resultant BA and
RMSD-based ML scoring models on the four protein families. The plots clearly show that success
rates of SFs are dependent on the protein family under investigation. It is easier for some SFs
to distinguish good poses for HIV protease and thrombin than for carbonic anhydrase. The best
performing SFs on HIV protease and thrombin complexes, MLR::XRG and MLR::XG, respectively, achieve success rates of over 95% in terms of S13 as shown in panels (b) and (n), whereas
no SF exceeded 65% in success rate in case of carbonic anhydrase as demonstrated in panels (i)
and (j). Finding the native poses is even more challenging for all SFs, although we can notice that
RMSD-based SFs outperform those models that rank poses using predicted BA. The exception to
this is the SF MLR::XAR whose performance exceeds all RMSD-based ML models in terms of
the success rate in reproducing native poses as illustrated in panels (c) and (d).
The results also indicate that multivariate linear regression models (MLR), which are basically
empirical SFs, are the most accurate across the four families, whereas ensemble learning models,
RF and BRT, unlike their good performance in Figure 5.2, appear to be inferior compared to simpler models in Figure 5.5. This can be attributed to the high rigidity of linear models compared to
ensemble approaches. In other words, linear models are not as sensitive as ensemble techniques to
the presence or absence of certain protein family in the data on which they are trained. On the other
hand, RF- and BRT-based SFs are more flexible and adaptive to their training data that in some
cases fail to generalize well enough to completely different test proteins as seen in Figure 5.5. In
practice, however, it has been observed that more than 92% of today’s drug targets are similar to
known proteins in PDB [113], an archive of high quality complexes from which our training and
test compounds originated. Therefore, if the goal of a docking run is to identify the most stable
poses, it is important to consider sophisticated SFs (such as RF and BRT) calibrated with training
sets containing some known binders to the target of interest. Simpler models, such as MLR and
MARS, tend to be more accurate when docking to novel proteins that are not present in training
data.

126

MLR::XAR
BRT::XARG
SVM::G
kNN::RG
RF::XG
MARS::XA

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

0

10

20

30

40

50

60

70

80

90

MLR::XRG
MARS::XRG
RF::XARG
BRT::XARG
SVM::XG
kNN::XG

100

0

Success Rate (%)
(a) HIV protease, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = BA.
MLR::XAR
MARS::XAG
BRT::XARG
kNN::XRG
RF::XRG
SVM::XRG

N=1
N=3
N=5

0

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

0

100

N=1
N=3
N=5

20

30

40

50

60

70

80

0

90

100

10

20

30

40

50

60

70

80

0

90

100

10

20

30

40

50

60

70

80

0

90

100

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

0

10

20

30

40

50

60

70

80

90

MLR::X
MARS::XARG
BRT::XAG
RF::X
kNN::X
SVM::XAG

0

20

30

40

50

60

70

80

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

90

10

20

30

40

50

60

70

80

90

100

10

20

30

40

50

60

70

80

90

100

MLR::XG
SVM::XG
MARS::XRG
BRT::XARG
RF::XAG
kNN::AG

100

0

10

20

30

40

50

60

70

80

90

100

Success Rate (%)
(n) Thrombin, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = RMSD.

N=1
N=3
N=5

10

10

Success Rate (%)
(l) Carbonic anhydrase, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = RMSD.

Success Rate (%)
(m) Thrombin, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = BA.

0

100

kNN::RG
SVM::G
RF::XG
MARS::XG
BRT::G
MLR::G

Success Rate (%)
(k) Carbonic anhydrase, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = BA.
MLR::X
SVM::XAG
MARS::XARG
RF::XRG
BRT::X
kNN::X

90

Success Rate (%)
(j) Carbonic anhydrase, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = RMSD.

N=1
N=3
N=5

0

80

SVM::XAG
kNN::XARG
MARS::RG
BRT::G
MLR::XARG
RF::G

Success Rate (%)
(i) Carbonic anhydrase, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = BA.
MARS::XA
MLR::XAR
RF::XA
SVM::X
BRT::XA
kNN::X

70

Success Rate (%)
(h) Trypsin, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = RMSD.

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

0

60

MARS::XAG
SVM::AG
kNN::XG
MLR::XRG
RF::XRG
BRT::XR

Success Rate (%)
(g) Trypsin, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = BA.
MLR::XR
MARS::AG
BRT::XA
RF::XA
SVM::X
kNN::XG

50

Success Rate (%)
(f) Trypsin, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = RMSD.

MARS::XAR
MLR::XR
BRT::RG
SVM::XR
kNN::XRG
RF::XG
10

40

MLR::XG
MARS::XAG
BRT::XRG
SVM::AG
RF::XRG
kNN::XG

Success Rate (%)
(e) Trypsin, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = BA.

0

30

Success Rate (%)
(d) HIV protease, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = RMSD.

C = 1 Angstrom
C = 2 Angstrom
C = 3 Angstrom

0

20

MARS::XAR
MLR::XA
BRT::XAR
RF::XARG
SVM::XAR
kNN::XG

Success Rate (%)
(c) HIV protease, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = BA.
MLR::XARG
BRT::XRG
MARS::XAR
SVM::XR
RF::XRG
kNN::XRG

10

Success Rate (%)
(b) HIV protease, C = 1, 2 and 3 Angstrom, N = 1 pose, Y = RMSD.

MLR::XRG
MARS::XRG
SVM::XAG
RF::XG
BRT::XRG
kNN::XARG

100

Success Rate (%)
(o) Thrombin, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = BA.

0

10

20

30

40

50

60

70

80

90

100

Success Rate (%)
(p) Thrombin, C = 0 Angstrom, N = 1, 3 and 5 poses, Y = RMSD.

Figure 5.5: Success rates of scoring function in identifying binding poses that are closest to native
conformations observed in four protein families in PDBbind 2007: HIV protease (a-d), trypsin
(e-h), carbonic anhydrase (i-l), and thrombin (m-p). The results show these rates by examining
the top N scoring ligands that lie within an RMSD cut-off of C Å from their respective native
poses. Panels on the left show success rates when binding-affinity-based scoring is used and the
ones on the right are for RMSD-based SFs.

127

Sophisticated ML algorithms are not the only critical element in building a capable SF. Features
to which they are fitted also play an important role as can be seen in Figure 5.5. By comparing
the right panels to the ones on the left, we can notice that X-Score features (X) are almost always
present in BA-based SFs while those provided by GOLD (G) are used more to model RMSD
explicitly. This implies that X-Score features are more accurate than other feature sets in predicting
BA, while GOLD features are the best for estimating RMSD and hence poses close to the native
one.

5.4.3.2

Docking performance on four protein families from PDBbind 2014

In this section, we present the docking accuracy of six SFs in identifying the correct binding
pose of four protein families from the 2014 version of PDBbind. The four families are the same
targets we considered in the previous section. However, the number of complexes formed by
these families are larger in PDBbind 2014: 262 HIV protease, 170 carbonic anhydrase (CAH),
98 trypsin, and 79 thrombin complexes. The six SFs we evaluate here are the ensemble deep
learning-based models BsN-Dock, BgN-Dock, BsN-Score, BgN-Score, and two popular SFs from
the literature RF-Score and X-Score which are based on Random Forests and linear regression,
respectively. We build two sets of the ensemble deep neural networks models. BsN-Dock and BgNDock are trained on the 2014 refined set complexes with native and computationally generated
poses characterized by DDB’s multi-perspective descriptors and labeled using RMSD values from
native poses. The second set includes BsN-Score and BgN-Score which are trained on the native
poses of the PDBbind 2014 refined set complexes without adding any computationally-generated
decoy poses. Therefore, BsN-Score and BgN-Score’s training complexes are labeled using their
measured binding affinity values just like the training data for RF-Score and X-Score. Similar to
BsN-Dock and BgN-Dock, BsN-Score and BgN-Score training complexes are also characterized
by DDB’s multi-perspective. The SF RF-Score uses 36 geometric descriptors based on counts of
different protein-ligand atom types, while X-Score employs six physiochemical descriptors.
The docking performance of the six SFs are shown in Figure 5.6 for the four protein families.

128

HIV (262 in−sample PLCs)

HIV (262 out−of−sample PLCs)

100
75

CAH (170 in−sample PLCs)

30.53
32.82

25.26

25.19

80.89
92.32
93.25

83.97
96.06
96.95

23.03
38.55
46.06

30.41
53.82
55.73

48.29
76.03
82.82

0

49.24
75.70
82.06

98.60
99.87
100.00

25

96.15
98.64
98.65

50

CAH (170 out−of−sample PLCs)

100
75

21.77

33.92
41.57

65.51
80.79
91.81

65.10
84.31
92.55

56.47
68.63
80.39

38.83
46.86

44.63
71.72
82.60

44.31
73.72
84.12

0

98.01
98.84
98.32

25

97.26
99.61
100.00

Docking accuracy

50

Accuracy
S11%

Trypsin (98 in−sample PLCs)

Trypsin (98 out−of−sample PLCs)

S12%

100

S13%

75

Thrombin (79 in−sample PLCs)

23.13
38.44

25.17
37.76

24.87
41.63

25.17
42.52

82.46
92.16
95.53

82.99
93.54
94.90

36.05
57.14

36.39
67.01
74.83

49.49
75.38
84.66

0

50.00
76.53
84.01

99.32
100.00
100.00

25

99.60
99.53
98.67

50

Thrombin (79 out−of−sample PLCs)

100
75

20.67
49.79
66.24

or
e

R

F−
Sc

27.43
60.34
71.73

or
e

34.84
47.50

X−

Sc

re
co

N
−S

Bg

Bs

N

−S
c

or
e

34.81
48.73

88.29
94.86
98.40

−D
oc

k

89.03
95.78
97.47

Bg
N

Bs

N

−D

oc

k

27.01
45.57
62.87

or
e

F−
Sc

58.23
73.42
79.75

Sc

or
e
R

55.19
71.31
83.16

co

re
X−

55.69
72.99
85.65

or
e

N
−S

Bg

−S
c

Bs

N

−D
oc

Bg
N

Bs

N

−D

oc

k

0

k

97.89
100.00
100.00

25

97.66
98.96
98.83

50

Scoring function

Figure 5.6: Success rates of docking-specific (proposed) and binding-affinity-based
(conventional) scoring functions in identifying binding poses that are closest to the native
conformations observed in four protein families from PDBbind 2014: HIV protease, carbonic
anhydrase (CAH), trypsin, and thrombin. The results show these rates by examining the top
N = 1 scoring ligands that lie within an RMSD cut-off of C ∈ {1, 2, 3} Å from their respective
native poses. Panels on the left show in-sample success rates and the ones on the right show the
out-of-sample docking accuracies.

129

The left panels summarize the in-sample docking accuracies in terms of S11 , S21 , and S31 success
rates. BsN-Dock and BgN-Dock perfectly recognize the correct poses for the four protein families
they encountered in their training data. The four other scoring functions failed in more than fifty
percent of the cases (S11 ) even though all the protein-ligand complexes are part of their training data.
This implies the difficulty of the task of finding the relationship between BA and RMSD values for
BA-based SFs. RMSD-based SFs perform very well in ranking the ligand poses correctly whether
the four test protein families are part of their training data or out-of-sample as shown in the right
panels of Figure 5.6. We notice a slight drop in performance for BsN-Dock and BgN-Dock in
this test when neither the PLCs nor even the proteins are part of their training data. On the other
hand, the other four SFs fail completely in identifying the native or near native poses even when
considering poses that are 3 Å away as “near-native” (S31 ). This experiment, yet again, highlights
the potential of our RMSD-based approach in identifying binding poses correctly even for novel
protein targets that they have never seen before.

5.4.4
5.4.4.1

Performance of docking-specific scoring functions on novel targets
Simulating novel targets using PDBbind 2007

The primary training-core test set pair (Pr, Cr) sampled from PDBbind 2007 PLCs is a useful
benchmark when the aim is to evaluate the performance of SFs on targets that have some degree
of sequence similarity with at least one protein present in the complexes of the training set. This
is typically the case since, as it was mentioned earlier, 92% of drug targets are similar to known
proteins [113]. When the goal is to assess SFs in the context of novel protein targets, however, the
training-test set pair (Pr, Cr) is not that suitable because of the partial overlap in protein families
between Pr and Cr. We considered this issue to some extent in the previous section, where we
investigated the docking accuracy of SFs on four different protein-specific test sets after training
them on complexes that did not have the protein under consideration. This resulted in a drop in
performance of all SFs, especially, in the case of carbonic anhydrase as a target. However, even if

130

there are no common proteins between training and test set complexes, different proteins at their
binding sites may have sequence and structural similarities, which influence docking results. To
more rigorously and systematically assess the performance of BA and RMSD-based ML SFs on
novel targets, we performed a separate set of experiments in which we limited BLAST sequence
similarity between the binding sites of proteins present in the training and test set complexes.
Sequence similarity was used to construct the core test set and it was also noted by Ballester and
Mitchell as being relevant to testing the efficacy of SFs on a novel target [29].
Specifically, for each similarity cut-off value S = 30%, 40%, 50%, ..., 100%, we constructed
100 different independent 100-complex test and T-complex training set pairs. Two versions were
created out of these training and test set pairs. The first version uses BA as a response variable
that SFs will be fitted to, predict, and employ to assess poses. The response variable of the other
version is the RMSD value of true poses (RMSD = 0 Å) and computer generated decoys (with
10 Å ≤ RMSD > 0 Å) of each original protein-ligand complex in every training and test dataset
pair. A total of 20 poses per complex have been used in this second version. Then, we trained
BA and RMSD scoring models (MLR, MARS, kNN, SVM, RF, and BRT) using XARG features
on the training set and evaluated them on the corresponding test set, and determined their average
performance over the 100 training-test-set pairs to obtain robust results. Since SF docking performance depends upon both similarity cut-off and training set size and since training set size is
constrained by similarity cut-off (a larger S means a larger feasible T), we investigated different
ranges of S (30% to 100%, 50% to 100%, and 70% to 100%) and for each range we set T close to
the largest feasible value for the smallest S value in that range. Each test and training set pair was
constructed as follows. We randomly sampled a test set of 100 protein-ligand complexes without
replacement from all complexes at our disposal: 1105 in Pr + 195 in Cr = 1300 complexes. The
remaining 1200 complexes were randomly scanned until T different complexes were found that
had protein binding site similarity of S% or less with the protein binding sites of all complexes in
the test set — if less than T such complexes were found, then the process was repeated with a new
100-complex test set.

131

●

●

●

●

●

40%

50%

60%

70%

80%

90%

100%

50%

●

●

60%

70%

80%

90%

100%

●

●

●
●

●

●

20
40%

50%

60%

70%

80%

90%

100%

Similarity cut−off (S)
(d) T = 100 (x 20) training complexes, Y = RMSD.

●

80%

90%

100%

80
●

●

30%

70%

●

●

60

●

●

●

●

●

40

●

●

50%

60%

70%

80%

90%

RF
BRT
kNN
SVM
MARS
MLR
100%

Similarity cut−off (S)
(e) T = 400 (x 20) training complexes, Y = RMSD.

●

20

●

●

Similarity cut−off (S)
(c) T = 700 training complexes, Y = BA.

80
●

60

80

●

40

●

40

60

RF
BRT
kNN
SVM
MARS
MLR

20

Docking success rate (S11 in %)

●

●

Similarity cut−off (S)
(b) T = 400 training complexes, Y = BA.

Similarity cut−off (S)
(a) T = 100 training complexes, Y = BA.
●

●

20

●

RF
BRT
kNN
SVM
MARS
MLR

60

60
40
30%

●

●

20

●

RF
BRT
kNN
SVM
MARS
MLR

40

60
40

80

●

80

RF
BRT
kNN
SVM
MARS
MLR

●

20

Docking success rate (S11 in %)

80

●

70%

80%

90%

RF
BRT
kNN
SVM
MARS
MLR
100%

Similarity cut−off (S)
(f) T = 700 (x 20) training complexes, Y = RMSD.

Figure 5.7: Docking accuracy of docking-specific (proposed) and BA-based (conventional)
scoring models as a function of BLAST sequence similarity cutoff between binding sites of
proteins in training and test complexes. The docking accuracy is expressed in terms of S11 success
rate obtained by examining the top N = 11 scoring ligand that lies within an RMSD cut-off of
C = 1 Å from its respective native pose. In panels (a)-(c), a single (native) pose is used per
training complex, whereas in panels (d)-(f) 20 randomly-selected poses are used per training
complex. Training and test complexes are sampled from the refined set of PDBbind 2007.
The performance of the six scoring models in terms of their S11 docking accuracy is depicted in
Figure 5.7 for various similarity cut-offs and training set sizes. The plots in each column ((a) and
(d), (b) and (e), and (c) and (f) of Figure 5.7) show docking power results for similarity cut-offs
of 30% to 100%, 50% to 100%, and 70% to 100% for which T = 100, 400, and 700 complexes
is the largest training set size feasible for S = 30%, 50%, and 70%, respectively. The results
in these plots are consistent with those obtained for the four protein families presented in the
previous section and illustrated in Figure 5.5. More specifically, we notice that simpler models
such as MLR::XARG and MARS::XARG perform the best across almost all values of similarity
cut-offs (S = 30%, 50%, or 70%-100%), training set sizes (T = 100, 400, or 700 complexes), and

132

response variables (Y = BA or RMSD). This is mainly due to their rigidity. The performance
of such models do not suffer as much as the more flexible ML SFs when their training and test
proteins have low sequence similarity. On the other hand, the SFs based on MLR and MARS
are also less responsive to increasing the similarity between protein families in the training and
test sets. Unlike the other four non-linear ML SFs, we can observe that the performance curves
of MLR and MARS are flat and do not seem to react to having more and more similar training
and test proteins. This observation is more clear in the bottom row of plots of Figure 5.7 where
the training set sizes are large enough (i.e., 2000 ligand poses or more). Plot (f) shows that the
RMSD-based SFs RF and BRT are catching up with MLR and MARS and can eventually surpass
them in terms of performance as training set sizes become larger. Similar to RF and BRT, the other
non-linear RMSD SFs, namely kNN and SVM, have the sharpest increase in docking performance
as similarity cut-off S increases. However, unlike the ensemble SFs RF and BRT, kNN and SVM
SFs are the least reliable models when ligand poses need to be scored for novel targets.
To summarize, imposing a sequence similarity cut-off between the binding sites of proteins
in training and test set complexes has an expected adverse impact on the accuracy of all scoring
models. However, increasing the number of training complexes helps improve accuracy for all
similarity cut-offs as we will show in more detail in the next section. Scoring functions based on
MLR and MARS have the best accuracy when training set sizes are small which is typically the
case when the response variable is binding affinity. The other generally-competitive ML models
are RF and BRT whose accuracies surpass all other SFs when evaluated on targets that have some
sequence similarity with their training proteins.

5.4.4.2

Simulating novel targets using PDBbind 2014

Similar to the objective of the previous experiment described in Section 5.4.4.1, the aim of this
simulation is to compare the docking accuracy of the proposed docking-specific SFs to the generic
binding affinity-based approaches on novel protein targets. The scoring functions we consider
here are trained and evaluated on data sets from the larger and more recent version of PDBbind.

133

From the 3446 PLCs of PDBbind 2014, we randomly sample a test set of 100 complexes each
of which is associated with 100 decoy and native ligand poses. From the remaining 3346 complexes, we randomly sample 3000 training PLCs where each of which must not have a binding
pocket with a BLAST similarity of more than S% with any of the 100 test set proteins. If we
are unable to find such complexes in the 3346 PLCs, we repeat this random sampling process
several times until the similarity restriction is met for every similarity cut off value S, where S
∈ {30%, 50%, 70%, 90%, 100%}. We then build four SFs using the 3000 training complexes and
test their docking accuracy on the 100 novel targets whose binding sites have less than S% BLAST
similarity with the binding pockets of the training complexes. This procedure is repeated 50 times
for every similarity level S to obtain robust summary values of docking success rates S11 and S12 as
shown in Figure 5.8. The four scoring functions tested here are: (i) BT-Dock, the docking-specific
boosted decision trees model fitted to native and computationally-generated poses characterized by
DDB’s multi-perspective descriptors and labeled using RMSD values; (ii) BT-Score, the generic,
scoring-specific counterpart of BT-Dock that is also based on a boosted ensemble of decision trees
fitted to complexes characterized by DDB’s multi-perspective descriptors. However, BT-Score is
only trained on the native poses of these complexes labeled with BA-data. The decision to replace deep neural networks with decision trees as base learners for the boosted ensemble model is
based on computational considerations. This experiment involves training large number of models
to obtain statistically significant results. Since deep neural networks require hefty computational
resources, training hundreds of them is impractical for this test and the models in the following
sections. We believe BT-Dock and BT-Score resemble BsN-Dock and BsN-Score, respectively, at
least with respect to their performance trends across different similarity values. (iii) RF-Score, the
Random Forests based scoring functions trained on complexes with native poses characterized by
the 36 geometric features and labeled using BA-data. (iv) X-Score, the empirical SF based on a linear regression model fitted to native complexes with binding affinity labels and 6 physiochemical
descriptors.
The docking performance trends in Figure 5.8 across the various levels of similarity cutoffs

134

C=2 Angstroms & N=1 pose

100

100

75

75

SF
BT−Dock

50

BT−Score

50

RF−Score

0
10

80

60

10

0

0
80

0
60

25

40

X−Score
25

40

Docking accuracy (S11 & S12 %)

C=1 Angstrom & N=1 pose

Similarity cutoff

Figure 5.8: Docking accuracy of docking-specific (proposed) and BA-based (conventional)
scoring models as a function of BLAST sequence similarity cutoff between binding sites of
proteins in training and test complexes. The docking accuracy is expressed in terms of S11 and S21
success rates obtained by examining the top N = 1 scoring ligand that lies within an RMSD
cut-off of C ∈ {1, 2} Å from its respective native pose. The docking-specific SF BT-Dock is
trained on 100 computer-generated poses for each of its 3000 native protein-ligand complexes.
Training and test complexes are sampled from the refined set of PDBbind 2014.
resemble those obtained using PDBbind 2007 and depicted in Figure 5.7. BT-Dock achieves substantially better ranking accuracy of ligand poses for each novel test protein. The accuracy of
BT-Dock gradually becomes better as the binding pocket similarity between its training and test
proteins moves past the 70% cutoff. On the other hand, the S11 success rate of BA-based SFs,
BT-Score, X-Score, and RF-Score is less than 50% and does not show any sign of improvement
even when tested on familiar targets (with BLAST similarity S ≥ 90%). It should be noted that the
only difference between the RMSD-based SF BT-Dock and its sister BA-based model BT-Score
is the use of computational ligand poses during training BT-Dock whose dependent variable is the
RMSD distances of ligand poses from their respective novel conformations. This difference in
modeling is responsible for attaining the gap in performance shown in Figure 5.8.

135

5.4.5
5.4.5.1

Impact of the number of training protein-ligand complexes on the docking accuracy
Simulating increasing training set size using PDBbind 2007

An important factor influencing the accuracy of ML SFs is the size of the training dataset. In the
case of BA-based ML SFs, training dataset size can be increased by training on a larger set of
protein-ligand complexes with known binding affinity values. In the case of RMSD-based SFs, on
the other hand, training dataset size can be increased not only by considering a large number of
protein-ligand complexes in the training set, but also by using a larger number of computationallygenerated ligand poses per complex since each pose provides a new training record because it
corresponds to a different combination of features and/or RMSD value. Unlike experimental binding affinity values, which have inherent noise and require additional resources to obtain, RMSD
from the native conformation for a new ligand pose is computationally determined and is accurate.
We carried out three different experiments to determine: (i) the response of BA-based ML SFs
to increasing number of training protein-ligand complexes, (ii) the response of RMSD-based ML
SFs to increasing number of training protein-ligand complexes while the number of poses for each
complex is fixed at 50, and (iii) the response of RMSD-based ML SFs to increasing number of
computationally-generated poses while the number of protein-ligand complexes is fixed at 1105.
In the first two experiments, we built 6 ML SFs, each of which was trained on a randomly sampled
x% of the 1105 protein-ligand complexes in PDBbind 2007 Pr dataset, where x = 10, 20, . . . , 100.
The dependent variable in the first experiment is binding affinity (Y = BA), and the performance of
these BA-based ML SFs is shown in Figure 5.9(a) and partly in Figure 5.9(d) (MLR::XARG). The
set of RMSD values from the native pose is used as a dependent variable for ML SFs trained in the
second experiment (Y = RMSD). For a given value of x, the number of conformations is fixed at
50 ligand poses for each protein-ligand complex. The docking accuracy of these RMSD-based ML
models is shown in Figure 5.9(b). In the third experiment, all 1105 complexes in Pr were used for
training the RMSD-based ML SFs (i.e., Y = RMSD) with x randomly sampled poses considered
per complex, where x = 2, 6, 10, . . . , 50; results for this are reported in Figure 5.9(c) and partly

136

in Figure 5.9(d) (MARS::XARG). In all three experiments, results reported are the average of 50
random runs in order to ensure all complexes and a variety of poses are equally represented. All
training and test complexes in these experiments are characterized by the XARG (=X ∪ A ∪ R ∪ G)

●

●

●

●

●

●

●

●

●

40
20

●

2

6

10

14

80

●

●

●

●

●

40

60

●

20

●

0.1

MARS
RF
SVM
MLR
BRT
kNN

0.3 0.4 0.5 0.6 0.7 0.8 0.9
1
Fraction of 1105 training complexes in Pr
(b) Native and computationally−generated poses, Y = RMSD.

●

MARS
RF
SVM
MLR
BRT
kNN

18

22 26 30 34 38 42 46 50
Number of poses
(c) Native and computationally−generated poses, Y = RMSD.

0.2

80

●

●

60

●

1

60

●

●

40

0.3 0.4 0.5 0.6 0.7 0.8 0.9
Fraction of 1105 training complexes in Pr
(a) Native poses only, Y = BA.

80

0.2

●

●

MARS::XARG (RMSD)
GOLD::ASP (BA)
MLR::XARG (BA)

20

80

Docking success rate (S11 in %)

●

Docking success rate (S11 in %)

●

●

●

●

●

●

●

●

0.1

Docking success rate (S11 in %)

MARS
RF
SVM
MLR
BRT
kNN

40

60

●

●

20

Docking success rate (S11 in %)

features.

2

6

10 14 18 22 26 30 34 38 42 46 50
Number of poses
(d) Native and computationally−generated poses, Y = RMSD/BA.

Figure 5.9: Dependence of docking accuracy of docking-specific and binding-affinity-based
scoring models on training set size when training complexes are selected randomly (without
replacement) from the refined set of PDBbind 2007 and the models are tested on the
out-of-sample core (Cr) set. The size of the training data was increased by including more
1
protein-ligand complexes ((a) and (b)) or more computationally-generated
poses for all
complexes ((c) and (d)).
From Figure 5.9(a), it is evident that increasing training dataset size has a positive impact on
docking accuracy (measured in terms of S11 success rate), although it is most appreciable in the case
of MLR::XARG and MARS::XARG, two of the simpler models, MLR being linear and MARS
being piecewise linear. The performance of the other models, which are all highly nonlinear, seems
to saturate at 60% of the maximum training dataset size used. The performance of all six models is
137

quite modest, with MLR::XARG being the only one with docking success rate (slightly) in excess
of 50%. The explanation for these results is that binding affinity is not a very good response
variable to learn for the docking problem because the models are trained only on native poses (for
which binding affinity data is available) although they need to be able to distinguish between native
and non-native poses during testing. This means that the training data is not particularly well suited
for the task for which these models are used. An additional reason is that experimental binding
affinity data, though useful, is inherently noisy. The flexible highly nonlinear models, RF, BRT,
SVM, and kNN, are susceptible to this noise because the training dataset (arising only from native
poses) is not particularly relevant to the test scenario (consisting of both native and non-native
poses). Therefore, the more rigid MLR and MARS models fair better in this case.
When RMSD is used as the response variable, the training set consists of data from both native and non-native poses and hence is more relevant to the test scenario and the RMSD values,
being computationally determined, are also accurate. Consequently, docking accuracy of all SFs
improves dramatically compared to their BA-based counterparts as can be observed by comparing Figure 5.9(a) to Figure 5.9(b) and (c). We also notice that all SFs respond favorably to increasing training set size by either considering more training complexes (Figure 5.9(b)) or more
computationally-generated training poses (Figure 5.9(c)). Even for the smallest training set sizes
in Figure 5.9(b) and (c), we notice that the docking accuracy of most RMSD-based SFs is about
70% or more, which is far better than the roughly 50% success rate for the largest training set size
for the best BA-based SF MLR::XARG.
In Figure 5.9(d), we compare the top performing RMSD SF, MARS::XARG, to the best BAbased SFs, GOLD::ASP and MLR::XARG, to show how docking performance can be improved by
just increasing the number of computationally-generated poses, an important feature that RMSDbased SFs possess but which is lacking in their BA-based conventional counterparts. To increase
the performance of these BA-based SFs to a comparable level, thousands of protein-ligand complexes with high-quality experimentally-determined binding affinity data need to be collected.
Such a requirement is too expensive to meet in practice. Furthermore, RMSD-based SFs with

138

the same training complexes will still likely outperform BA-based SFs.

5.4.5.2

Simulating increasing training set size using PDBbind 2014

The number of protein ligand complexes in PDBbind has increased from 1300 in 2007 to 3446 in
2014. In this section, we use the updated dataset to re-evaluate the capacity of BA- and RMSDbased SFs in utilizing an increasing number of complexes to improve their performance. Therefore, we follow a very similar experimental setup to the one we describe in the previous subsection
(5.4.5.1). This section however, focuses on the docking-specific (RMSD-based) SF BT-Dock and
the three BA-based models BT-Score, RF-Score, and X-Score. We conduct two sets of experiments on the 2014 version PDBbind similar to those performed using the 2007 release. In the first,
we train the four SFs on training sets of x protein-ligand complexes randomly sampled from the
3446-complex refined set of PDBbind 2014 where x ∈ {100, 680, 1260, 1840, 2420, 3000}. The
SFs are then tested on 195 PLCs randomly sampled from the remaining (3446 − x) complexes in
the refined set. The training and test complexes are characterized using the 2714 proposed multiperspective descriptors for BT-Dock and BT-Score. RF-Score extracts 36 geometrical features of
atomic pair-wise counts while X-Score relies on its 6 physiochemical descriptors to characterize
the complexes. The three generic SFs BT-Score, RF-Score and X-Score are trained on the native
poses of the x training complexes whose labels are the measured binding affinity values. The proposed docking-specific SF BT-Dock is trained on 100 native and computationally-generated poses
for each training complex and uses their root-mean-square distances from the native conformations
as a dependent variable. For each training set size x, the four SFs are tested on their ability in identifying the native or near-native pose of the test protein-ligand complexes. Each test protein-ligand
complex is associated with 100 poses, the vast majority (90%) of them are decoys. The success
rates (in terms of S11 and S12 ) are then reported for each SF and averaged over 50 runs for every
training set size x to obtain reliable results. The average success rates for BT-Dock, BT-Score,
RF-Score, and X-Score are shown in Figure 5.10
The plot shows that the accuracy of the docking-specific SF BT-Dock is superior to the three

139

Docking accuracy (S11 & S12 %)

C=1 Angstrom & N=1 pose

C=2 Angstroms & N=1 pose

100

100

75

75

SF
BT−Dock

50

BT−Score

50

RF−Score
X−Score
25

25

0

0
0

1000

2000

3000

0

1000

2000

3000

Number of training targets

Figure 5.10: Dependence of docking accuracy of docking-specific and binding-affinity-based
scoring models on training set size when training complexes are selected randomly (without
replacement) from the refined set of PDBbind 2014 and the models are tested on the
out-of-sample core (Cr) set. The docking accuracy is expressed in terms of S11 and S21 success
rates obtained by examining the top N = 1 scoring ligand that lies within an RMSD cut-off of
C ∈ {1, 2} Å from its respective native pose.
BA-based SFs even for 100 training PLCs. Increasing the training data consistently improves BTDock and BT-Score followed by X-Score. RF-Score appears to be less responsive to increasing its
training data size. We attribute this mainly to the type of features that RF-Score uses to characterize
the protein-ligand molecules. When we fitted a Random Forests model (the algorithm that the SF
RF-Score uses) to the 2714 multi-perspective descriptors generated using the proposed Descriptor
Data Bank (DDB) we observed an improvement trend on par with BT-Score.
In the second experiment, we test how the docking accuracy of the best performing SF BTDock reacts to different sizes of the computationally generated poses for each training complex.
To answer this question, we trained BT-Dock on a fixed number of 3000 complexes with increasing
number of ligand poses x, where x ∈ {10, 20, 40, 60, 80, 100}. The three BA-based SFs BT-Score,
RF-Score, and X-Score are trained on a fixed size of 3000 complexes with their native conformations. We use the BA-based SFs in this experiment as baseline models to compare against the
140

proposed BT-Dock model. The types of descriptors, the labels of PLCs, and the test complexes
are the same for every SF as the ones described for the previous experiment. The docking success
rates S11 and S21 for the four SFs are summarized in Figure 5.11.

Docking accuracy (S11 & S12 %)

C=1 Angstrom & N=1 pose

C=2 Angstroms & N=1 pose

100

100

75

75

SF
BT−Dock

50

BT−Score

50

RF−Score
X−Score
25

25

0

0
0

25

50

75

100

0

25

50

75

100

Number of poses per target

Figure 5.11: Dependence of docking accuracy of the docking-specific model BT-Dock on the
number of training poses generated for each native conformation of 3000 protein-ligand
complexes selected randomly (without replacement) from the refined set of PDBbind 2014. The
three binding-affinity-based SFs BT-Score, RF-Score, and X-Score are only trained on the native
conformations. The docking accuracy is expressed in terms of S11 and S21 success rates obtained by
examining the top N = 1 scoring ligand that lies within an RMSD cut-off of C ∈ {1, 2} Å from its
respective native pose in the core test set.
The accuracy of BT-Dock is substantially benefiting from generating more computational poses
for each training protein-ligand complex. Enriching the performance of an SF without using more
native complexes is a very desirable outcome since solving the native structure of a protein-ligand
complex and determining its binding affinity value is an expensive task. Therefore, we believe
virtual enrichment of training data should be pursued when possible for applications in which
experimental data is difficult to obtain.

141

5.4.6

Impact of the type and number of descriptors on the docking accuracy

5.4.6.1

Type and number of raw descriptors characterizing PDBbind 2007 complexes

The binding pose of a protein-ligand complex depends on many physiochemical interaction factors
that are too complex to be accurately captured by any one approach. Therefore, we perform two
different experiments to investigate how utilizing different types of features from different scoring
tools, X-Score, AffiScore, RF-Score, and GOLD, and considering an increasing number of features
affects the performance of the various ML models. In the first experiment, the 6 ML models were
trained on the 2007 Pr dataset characterized by all 15 combinations of X, A, R, and G feature types
and tested on the corresponding PDBbind 2007 core test Cr characterized by the same features.
Table 5.2 reports the S11 docking success rate for three groups of ML SFs. The first set (Table 5.2 top
part) of 90 (6 methods × 15 feature combinations) BA-based SFs is trained on 1105 Pr complexes.
The second set (Table 5.2 middle part) of 90 RMSD-based SFs is again trained on the 1105 Pr
complexes with one randomly sampled pose from 50 poses generated per complex. Therefore, the
training set size for these first two groups of SFs is identical and consists of 1105 training records,
with the only difference being the response variable that they are trained for, BA in the first case and
RMSD in the second case. The final (Table 5.2 bottom part) 90 RMSD-based SFs are trained on
1105 Pr complexes, with 50 poses per complex, so that its training set size is 1105 × 50 = 55, 250
records.

We notice that the S11 value of almost all models improves by considering more than one type
of feature rather than just X, A, R, or G features alone. The table also shows that RMSD SFs are
substantially more accurate than their BA counterparts for each feature type and ML method. By
comparing the 180 RMSD SFs with the corresponding 90 BA SFs across all feature types and ML
models, we find that the former are, on average, almost twice as accurate as the BA approaches
(50.64% and 57.61% vs. 27.95% — see Table 5.2 rightmost column). In terms of feature types,
we note that the most accurate SFs always include X-Score and GOLD features. SFs that are
142

Table 5.2: Docking success rate S11 (in %) of ML SFs complexes characterized by different
descriptors

RMSD (1105x50)

RMSD (1105)

BA (1105)

Y (T)

Model

Descriptor set
X

A

R

G

XA

XR

XG

AR

AG

RG XAR XAG XRG ARG XARG Average

MARS
RF
SVM
MLR
BRT
kNN
Average

28.72
23.08
26.67
43.08
23.85
17.95
27.22

20.00
23.08
30.26
18.97
25.39
11.28
21.50

5.13
10.77
4.62
9.74
9.24
8.72
8.04

18.46
24.10
19.49
28.21
39.23
16.41
24.32

34.36
30.52
30.77
33.33
30.52
9.23
28.12

36.92
30.77
23.08
46.67
30.26
18.46
31.03

28.21
37.44
29.23
47.18
42.31
15.90
33.38

21.03
21.54
20.51
21.03
22.82
11.28
19.71

16.41
30.00
41.54
40.00
37.69
15.38
30.17

8.21
25.65
30.26
33.85
33.85
17.95
24.96

37.95
31.03
26.15
46.15
33.85
15.38
31.75

28.72
34.12
42.05
49.23
44.36
17.44
35.99

40.00
39.75
37.44
49.74
44.62
21.03
38.76

8.21
25.90
38.97
40.00
36.67
14.36
27.35

36.41
32.31
41.54
51.28
43.08
17.44
37.01

24.58
28.00
29.51
37.23
33.18
15.21
27.95

MARS
RF
SVM
MLR
BRT
kNN
Average

45.64
32.72
29.74
44.62
38.77
31.79
37.21

42.05
32.31
28.72
35.90
34.87
24.10
32.99

26.15
13.85
6.15
6.15
8.10
3.08
10.58

72.31
67.39
66.15
70.26
70.98
55.38
67.08

54.87
39.80
37.95
58.46
52.00
30.77
45.64

52.31
42.26
30.77
54.36
41.13
17.44
39.71

73.85
70.36
70.77
72.31
74.77
62.56
70.77

33.85
32.10
32.31
40.51
35.59
15.90
31.71

70.26
64.82
61.54
70.26
67.70
40.00
62.43

71.28
68.10
56.41
65.64
69.02
38.97
61.57

52.82
41.95
45.64
56.92
48.92
22.05
44.72

70.26
64.82
64.10
75.90
71.39
44.62
65.18

72.82
72.00
62.05
70.26
72.51
45.13
65.79

67.18
64.51
58.97
67.69
65.33
37.95
60.27

75.90
67.18
63.08
69.23
67.49
41.03
63.98

58.77
51.61
47.62
57.23
54.57
34.05
50.64

MARS
RF
SVM
MLR
BRT
kNN
Average

44.10
39.39
36.41
45.64
46.15
36.41
41.35

31.79
48.10
43.08
36.41
36.92
46.15
40.41

5.13
22.46
10.26
6.15
13.33
23.59
13.49

72.82
70.77
66.15
70.26
71.59
61.03
68.77

62.56
61.95
54.87
56.92
54.36
51.28
57.01

49.74
54.77
43.08
52.31
54.87
41.03
49.30

75.38
72.41
70.77
73.34
71.59
71.28
72.46

34.87
50.97
43.08
37.95
42.56
45.13
42.43

73.33
73.64
71.79
71.79
70.77
60.00
70.22

70.77
75.18
65.13
71.28
70.77
53.33
67.69

64.10
63.79
57.95
57.44
56.92
49.23
58.24

76.92
76.00
74.36
72.82
70.26
61.54
72.05

72.31
75.49
72.31
71.79
72.31
60.51
70.68

72.31
75.28
69.74
70.77
71.28
53.33
69.23

78.97
76.90
70.77
73.33
71.28
55.90
70.63

59.01
62.47
56.65
57.88
58.33
51.32
57.61

fitted to the individual X and G features only are more accurate than their A and R counterparts
whether they are BA or RMSD models. By averaging the performance of all ML models across all
feature types, we see that the simple linear approach MLR outperforms other more sophisticated
ML SFs that are trained to predict binding affinity. MARS outperforms all other RMSD SFs that
are trained on the same number of training records (1105) as their BA counterparts. The lower part
of the table shows that the ensemble SF RF that predicts the binding pose directly has the highest
docking accuracy (62.47%), on the average, across 15 different feature types and MARS::XARG
has the highest docking accuracy (78.97%) overall. Comparing the two versions of RMSD SFs
in the middle and lower portions of the table, we notice that the largest gainers from increasing
training set size are the most nonlinear ML techniques (RF, BRT, SVM and kNN). The results of
Table 5.2 are useful in assessing the relative benefit of different types of features for the various
ML models.
A pertinent issue when considering a variety of features is how well different SF models exploit
an increasing number of features. The features we consider are the X, A, G, and a larger set of

143

80

MARS
RF
SVM
MLR
BRT
kNN

●

●
●
●

60

80

●

60

80
60

MARS
RF
SVM
MLR
BRT
kNN

●

●

●

●

●

20

●
●

20

60
100
140
180
220
Number of features
(a) T = 1105 training complexes, Y = BA.

●

20

●

●
●
●

40

40

40

●

MARS
RF
SVM
MLR
BRT
kNN
●

●

20

Docking success rate (S11 in %)

●

20

60
100
140
180
220
Number of features
(b) T = 1105 training complexes, Y = RMSD.

20

60
100
140
180
220
Number of features
(c) T = 1105 (x 50) training complexes, Y = RMSD.

Figure 5.12: Dependence of docking accuracy
of docking-specific (proposed) and
1
binding-affinity-based (conventional) scoring models on the number of features. The features are
drawn randomly (without replacement) from a pool of X, A, R, and G-type features and used to
characterize the training and out-of-sample core Cr test set complexes of PDBbind 2007. In
panels (a) and (b), a single pose (native pose in (a) and randomly-selected pose in (b)) is used per
training complex, whereas in panel (c) 50 randomly-selected poses are used per training complex.
geometrical features than the R feature set available from the RF-Score tool. Recall from the
Compound Characterization subsection that RF-Score counts the number of occurrences of 36
different protein-ligand atom pairs within a distance of 12 Å. In order to have more features of
this kind for this experiment, we produce 36 such counts for five contiguous distance intervals of 4
Å each: (0 Å, 4 Å], (4 Å, 8 Å], . . . , (16 Å, 20 Å]. This provides us 6 X, 30 A, 14 G, and (36 × 5 =)
180 geometrical features or a total of 220 features. We randomly select (without replacement) x
features from this pool, where x = 20, 60, 100, . . . , 220, and use them to characterize the Pr dataset,
which we then use to train the six ML models. These models are subsequently tested on the core
(Cr) dataset characterized by the same features. This process is repeated 100 times to obtain robust
average S11 statistics, which are plotted in Figure 5.12.
The performance of the BA SFs is depicted in Figure 5.12(a) whereas panels (b) and (c) of
the same figure show the docking success rates for the RMSD versions of the scoring models. In
order to fairly compare the docking performance of BA and RMSD SFs as number of features
increase, we fixed their training set sizes to 1105 complexes as shown in Figure 5.12 (a) and
(b). We also show in Figure 5.12(c) the effect of increasing number of features on the docking
performance of RMSD SFs when trained on all Pr complexes, with 50 poses per complex. The

144

plots clearly indicate that RMSD SFs benefit the most from characterizing complexes with more
descriptors. This is the case regardless of the number of records used to train RMSD SFs (compare
plots (b) and (c) in Figure 5.12). The only exception is the RMSD SF based on SVM where it
appears to overfit the 1105 training records when they are characterized by more than 60 features.
This ML scoring function, however, performs better when trained on larger number of records
and shows a slight increase in performance as more features are included in building the model.
Other RMSD SFs such as RF, BRT, MLR, and MARS have much sharper slopes than SVM and
kNN. Compare these SFs to their BA counterparts in Figure 5.12(a) where most of them show
none to little improvement as the number of features increases due to overfitting. Not only are they
resilient to overfitting, most RMSD SFs improve dramatically by extracting more relevant features.
Adding more features may result in highest gains in performance when more training complexes
are included as was discussed in the previous subsection.

5.4.6.2

Number of descriptor sets characterizing PDBbind 2014 complexes

Public biochemical databases such as PDB [81], PubChem [71], PDBbind [22], Zinc [68], and others are the inspiring successful projects behind the creation of Descriptor Data Bank (DDB). These
public databases have had a tremendous impact on advancing virtual screening and other molecular
modeling applications. We believe our descriptor database will be a very important complement
in this space of public biochemical data and will significantly contribute to the advancement of
machine-learning scoring functions. We showed in Section 5.4.5 the substantial improvement to
the docking accuracy of SFs when trained on more complexes obtained from the public database
PDBbind 2007 and 2014. In this section, we quantitatively show how SFs benefit from training
on public complexes characterized by an increasing number of descriptor sets using the 2014 version of PDBbind. In other words, we demonstrate how the performance of SFs enhance as new
perspectives of protein-ligand interactions are added to public databases overtime.
To investigate the effect of increasing the number of descriptor sets, we randomly select a maximum of 100 combinations of x descriptor sets from all possible (16 choose x) combinations of the

145

Docking accuracy (S11 & S12 %)

C=1 Angstrom & N=1 pose

C=2 Angstroms & N=1 pose

100

100

75

75

SF
BT−Dock

50

BT−Score

50

RF−Score
X−Score
25

25

0

0
4

8

12

16

4

8

12

16

Number of descriptor sets

Figure 5.13: Dependence of docking accuracy of docking-specific (proposed) and
binding-affinity-based (conventional) scoring models on the number of descriptor (feature) sets.
The sets are drawn randomly (without replacement) from a pool of 16 feature types in Descriptor
Data Bank (DDB) and used to characterize the training and out-of-sample core Cr test set
complexes of PDBbind 2014. The docking accuracy is expressed in terms of S11 and S21 success
rates obtained by examining the top N = 1 scoring ligand that lies within an RMSD cut-off of
C ∈ {1, 2} Å from its respective native pose in the core test set.
16 DDB types listed in Table 3.1, where x ∈ {1, 4, 7, 10, 13, 16}, and use them to characterize the
training set complexes, which we then use to train the RMSD-based SF BT-Dock and its BA-based
counterpart BT-Score. We also fit a Random Forests and linear regression models to these descriptors which we call RF-Score and X-Score for notational convenience. We choose these models in
particular due to their use in constructing the SFs RF-Score and X-Score from the literature which
were originally fitted to different descriptor sets of 36 geometrical features and 6 physiochemical
terms, respectively. The four SFs, BT-Dock, BT-Score, RF-Score and X-Score, are subsequently
tested on the Cr dataset characterized by the same descriptors on which they were trained. For
each number of descriptor sets, x, the performance of the 100 SFs are averaged to obtain robust
overall docking success rates S11 and S21 , which are plotted in Figure 5.13. Similar to the case of
increasing training data in terms of new PLCs, the performance of the ensemble SFs BT-Dock, and

146

RF-Score are clearly benefiting from increasing the number of protein-ligand interaction models.
The level of improvement for the three SFs with increasing the number of descriptor sets is on
par or better with that observed when increasing their number of training complexes. The linear
regression model suffers serious overfitting problems when trained on large and diverse sets of
features as can be observed in the plot. A similar behavior was also observed for the MLR model
in Figure 5.12(b) when it was fitted to an increasing number of descriptors.
For the proposed RMSD-based SF BT-Dock in particular, it is clear from the plots in Figure 5.13 and 5.10 that the number of different models of protein-ligand interactions are as important as the number of its training complexes in improving the quality of the SF’s prediction. Based
on these results, we believe that the public database of protein-ligand interactions we are proposing in this work will be of high value to ML SFs and just as important to their accuracy as the
repositories of molecular structures and experimental assay data.

5.5

Conclusion

We found that docking-specific ML models trained to explicitly predict RMSD values significantly outperform all conventional SFs in almost all testing scenarios. The estimated RMSD values
of such models have a correlation coefficient of 0.825 on average with the true RMSD values. On
the other hand, predicted binding affinities have a correlation of as low as -0.3 with the measured
RMSD values. This difference in correlation explains the wide gap in docking performance between the top SFs of the two approaches. Our docking-specific SF based on boosted deep neural
networks correctly identified the correct ligand binding poses for 95% of the protein targets in
the core test set of PDBbind 2014. This is in comparison to 82% success rate obtained by the
empirical SF GOLD::ChemPLP. On the 2007 release of PDBbind, the empirical SF GOLD::ASP,
which is the best conventional model, achieved a success rate of 70% in identifying a pose that lies
within 1 Å from the native pose of 195 different complexes. On other hand, our top RMSD-based
SF, MARS::XARG, has a success rate of ∼80% on the same test set, which represents a signifi-

147

cant improvement in docking performance1 . When using non-DDB descriptors for PDBbind 2007
complexes, the linear ML SF, MLR::XARG, and its nonlinear extension, MARS::XARG, showed
relatively better accuracy than Random Forests and SVM when the target is a protein not present
in the training dataset used to build the scoring model. Ensemble SFs fitted to the same complexes and descriptors, however, may prove more reliable when there is some similarity between
training set proteins and the target protein. Our RMSD-based, ensemble deep learning and decision tree SFs, on the other hand, outperformed all other approaches in docking accuracy for all
testing scenarios when fitted to complexes characterized by large number of multi-perspective descriptors generated using the proposed Descriptor Data Bank platform. We also observed steady
gains in the performance of RMSD-based ML SFs as the training set size and number of features were increased by considering more descriptors and protein-ligand complexes and/or more
computationally-generated ligand poses for each complex.

1 Experiments on the 2007 version of PDBbind were conducted prior to the development of the

proposed SFs based on deep neural networks. Therefore, they were not compared against the 16
conventional and 6 ML (RF, BRT, SVM, KNN, MARS, and MLR) scoring functions on PDBbind
2007.
148

CHAPTER 6
SCREENING-SPECIFIC SCORING FUNCTIONS FOR ACCURATE LIGAND
BIOACTIVITY PREDICTION

One of the most important initial steps in drug discovery is the screening of very large databases of
drug-like molecules to identify putative ligands and filter out the ones that are potentially inactive
for the target protein. Predicting the binding affinity (BA) of all the database molecules accurately
is not as critical in this phase as to reliably discriminate active molecules from the inactive ones.
In addition, scoring functions have limited accuracy when predicting binding affinity of inactive
ligands simply because they are only trained on active molecules. Therefore, binding affinity
prediction and relative ranking of molecules against each other should take place after the screening
campaign when the database is enriched with putative active compounds.
Almost all existing scoring functions in use today rely on binding affinity prediction to carry
out database enrichment. These models are typically trained on datasets of proteins bound to one
or more ligands that are known to be actives. High-throughput screening assays typically result
in some active molecules and a large number of inactive compounds against the target protein.
Inactive compounds end up being discarded during the scoring function training process due to
the lack of their binding affinity data. In such cases, conventional BA-based scoring functions
could fail in producing reliable binding affinity scores for inactive compounds since they are only
exposed to active molecules. Therefore, using conventional scoring functions to rank active and
inactive ligands housed in large databases in order to enrich them will be problematic as it will be
demonstrated in our experiments in this chapter.

6.1

Proposed screening-specific scoring functions

To overcome the limitations of conventional approaches, we introduce screening-specific scoring functions that are trained on both active and inactive compounds for a diverse set of protein
targets. Our proposed models are based on deep neural networks and other machine learning

149

approaches that are optimized to directly reduce active/inactive misclassification rate instead of
minimizing BA mean-squared error. The proposed models take advantage of the larger training
set size that is constructed by combining active molecules and inactives vis-à-vis only the binders
as is the case for the vast majority of conventional SFs. Taking into account both active and inactive molecules yield more diverse training complexes which could in turn substantially improve the
screening accuracy of scoring functions on diverse families of ligands. The screening-specific scoring functions are also resilient to the uncertainty associated with experimentally collected binding
affinity data due to the usage of binary labels instead of the measured BA values themselves as
in the case of traditional SFs. The common sources of uncertainty in the binding affinity labels
include measurement noise, data collection errors, incomplete and missing data. In addition, there
are more and much larger public databases with assay results of binary (active/inactive) outcomes
than sources with binding affinity data. For example, the biochemical databases PubChem [71] and
ChEMBL [70] host millions of bio-assay and bioactivity data whereas the number of complexes
with BA in archives such as PDBbind [23] and BindingDB [74] do not exceed several thousand.
As we will see in the following sections, the proposed methods for constructing screening-specific
SFs will result in substantial improvement in screening accuracy compared to those achieved by
conventional models.
We use the 2014 PDBbind core set Cr as a benchmark for assessing the screening accuracy
of the conventional and proposed SFs. As explained in Section 2.1.1, the core set consists of 65
protein families, each is bound to three different ligands with known and diverse binding affinity
values. These are the active compounds that an ideal scoring function would identify. We assume
the inactive molecules for each protein target are the crystallized ligands of the remaining 64
receptors in the test set as shown in Figure 6.1. Although this bioactivity assumption requires
further investigation, it appears to be reasonable since all the 65 proteins have a pair-wise sequence
similarity of less than 90%. Furthermore, all the assumed inactive protein-ligand (65 × (64 ×
3) =) 12480 pairs were also checked in the ChEMBL database for possible activity and cross
binding [70]. Forty of these protein-ligand pairs were found to be actives and therefore the final test

150

L.1-3 = Actives
L. 4-6
L. 7-9

P. 2

L.1

P. 3

P. 1

L. 2
L. 193-195

P. 65

L. 3
L. 4

L. 1-3
Inactives

L. 7-9

L. 5

P. 2
L. 193-195

P.65

L. 6

L. 193
P. 65

L. 1-3

P. 1

L. 4-6

P. 2

L. 194

L. 190-192

P. 64

L. 195

Figure 6.1: Constructing training and test data sets of active and inactive protein-ligand
complexes using the core sets of PDBbind 2007 and 2014.
set consists of a total of (195 + 40 =) 235 binders and (65 × 192 − 40 =) 12440 inactive molecules.
This translates to a total of (235 + 12440 =) 12675 test protein-ligand pairs and the actives to
inactive ratio is approximately 1:64 for each target. A similar approach was also followed by Li
et al. [25] to construct a test set for screening purposes. Their test set was also derived from the

151

2014 version of the PDBbind core set and it also consists of 65 proteins each binding to 3 different
ligands. This set overlaps with the 2007 PDBbind core set in only 10 protein families. We use the
remaining 55 protein families of the 2007 PDBbind test set including their active and (generated)
inactive molecules as training set for our proposed screening-specific ML SFs. The overall size of
this training set is 10726 protein-ligand pairs without any overlap with the screening test set.

6.2
6.2.1

Results and Discussion
Evaluation of the screening power of scoring functions

In this work, we predict the binding affinity or activity probability score for every protein-ligand
complex in our test sets. Our test sets are composed of clusters of protein families. Each cluster
represents a protein and several ligands and we are interested in classifying which of them are
active. Therefore, we use the predicted binding affinity or activity probability score for the ligands
to rank-order them in each cluster. Then the screening power of all SFs will be assessed based on
their enrichment performance on the screening test set of protein clusters. The enrichment factor
is a measure that accounts for the number of true binders among the top x% ranked molecules and
is calculated as follows: EFx% = NTBx% /(NTBtotal × x%). Here NTBx% refers to the number
of actives in the top-ranked x% of molecules, and NTBtotal is the total number of binders for the
given target protein which is typically three. In our results, we report the average enrichment factor
across protein clusters for the given test set.

6.2.2

Screening-specific ML SFs vs. conventional approaches on a diverse test set

We conduct two sets of experiments in this section to demonstrate the screening accuracy of the
proposed SFs. In the first, we build 8 machine learning SFs trained on the 2007 protein-ligand
complexes characterized with one or more feature sets of types X, A, R, and/or G. The ML SFs
we construct are based on ensemble deep neural networks BsN-Score and BgN-Score, ensemble
decision trees BRT and RF, SVM, kNN, MARS, and MLR. Two versions for each of these 8
models have been built. A scoring-specific version in which the training complexes (active only)
152

Table 6.1: Screening powers of proposed and conventional scoring functions on diverse
complexes from the PDBbind 2014 core (Cr) test set. Training and test protein-ligand complexes
are characterized using a combination of descriptors from at least one of the four types X, A, R,
or G.
Scoring function

Glide::GlideScore-XP
GOLD::ChemScore
Glide::GlideScore-SP
DS::LUDI3
GOLD::ASP
MLR::XR
GOLD::GoldScore
SVM::A
BsN-Score::XA
BRT::XARG
BgN-Score::XAG
RF::XR
DS::PLP1
MARS::RG
DS::Jain
SYBYL::PMF-Score
SYBYL::ChemScore
DS::PMF
AffiScore
DNN-Score::XARG
DrugScoreCSD
X-Score::HMScore
SYBYL::D-Score
kNN::XR
SYBYL::G-Score

BA Modeling

Scoring function

EF1%

EF5%

EF10%

19.54
18.91
16.81
12.53
12.36
8.72
7.95
7.69
7.18
7.18
7.18
7.15
6.92
6.15
5.90
5.38
5.26
4.87
3.62
3.08
2.62
2.31
2.31
2.05
1.92

6.27
6.83
6.02
4.28
6.23
2.46
4.52
3.69
3.80
3.60
2.97
2.67
4.28
1.85
2.51
2.21
2.38
2.87
2.04
1.54
1.48
2.14
1.79
1.95
1.26

4.14
4.08
4.07
2.80
3.79
1.90
3.16
2.78
3.03
2.51
2.36
2.15
3.04
1.80
1.80
1.90
2.18
2.63
1.92
1.44
1.16
1.41
1.46
1.74
1.44

BsN-Score::XAG
SVM::XAG
BgN-Score::XAG
MLR::XARG
MARS::XARG
BRT::XAG
RF::XA
DNN-Score::XAG
Glide::GlideScore-XP
GOLD::ChemScore
Glide::GlideScore-SP
kNN::XA
DS::LUDI3
GOLD::ASP
GOLD::GoldScore
DS::PLP1
DS::Jain
SYBYL::PMF-Score
SYBYL::ChemScore
DS::PMF
AffiScore
DrugScoreCSD
X-Score::HMScore
SYBYL::D-Score
SYBYL::G-Score

Bio-activity Modeling
EF1%

EF5%

EF10%

33.85
30.26
28.72
28.72
28.72
26.15
24.10
22.05
19.54
18.91
16.81
14.85
12.53
12.36
7.95
6.92
5.90
5.38
5.26
4.87
3.62
2.62
2.31
2.31
1.92

8.72
8.41
8.51
8.10
7.90
8.31
7.80
7.59
6.27
6.83
6.02
7.69
4.28
6.23
4.52
4.28
2.51
2.21
2.38
2.87
2.04
1.48
2.14
1.79
1.26

5.03
5.08
4.92
4.77
4.82
4.72
4.97
4.46
4.14
4.08
4.07
4.56
2.80
3.79
3.16
3.04
1.80
1.90
2.18
2.63
1.92
1.16
1.41
1.46
1.44

are labeled using the measured binding affinity data (Y ∈ R). For the proposed-specific version,
we use binary values of active or inactive (Y ∈ {1, 0}) to label the protein-ligand complexes. These
two sets of 8 ML SFs are then compared against the conventional, BA-based SFs.
In Table 6.1 we list the screening powers of the 8 ML and 16 conventional SFs in terms of
their (x =) 1, 5, and 10 per cent enrichment factors. In the left half of the table, we report screening performance when the ML SFs were trained to predict binding affinity instead of bioactivity

153

(whether the ligand is active or not). We notice that all of the BA-based ML SFs are outperformed
by conventional approaches. The right half of the table shows the screening accuracy of conventional scoring functions and the proposed task-specific models. The ML screening-specific SFs that
were trained as bioactivity classifiers are the top performers (except for kNN::XA). Our proposed
SF that is based on ensemble neural networks has the highest screening performance of 33.85 in
terms of EF1% . This is an order of magnitude difference in performance as compared to that of
X-Score whose screening performance is very low. The BsN-Score::XAG is also substantially
more accurate (by about 73%) than the best performing conventional SF GlideScore-XP whose
EF1% value is 19.54. In addition, the screening-specific version of BsN-Score achieved more than
370% improvement in enrichment accuracy over its generic BA counterpart BsN-Score::XA whose
EF1% = 7.15. This boost in performances can be primarily attributed to the task-specific learning
approach, as is apparent by comparing the performance of BsN-Score::XAG learned through classification to its best BsN-Score regression version BsN-Score::A (note that the regression version
of BsN-Score::XAG has an even smaller EF1% value of 6.88). We are confident that a substantial
further improvement in screening accuracy is achievable when all the ideas proposed in Chapter 8
are fully developed and applied.
In the second experiment, we consider an updated version of ensemble deep-learning models
to build scoring- and screening-specific SFs BsN-Score and BsN-Screen. We compare them to the
conventional SFs RF-Score and X-Score which are based on Random Forests and linear regression. All four SFs are trained and tested on complexes from PDBbind 2007 and 2014. The core
test set of PDBbind 2014 remains one of the main validation sets for the new SFs. We also use
two additional test sets based on cross-validation (CV) and leave-clusters-out (LCO) as described
in Section 2.1. The methodology we followed in the previous section (6.1) to construct active and
inactive complexes are also followed here for the three test sets Cr, CV, and LCO. The training and
test complexes for BsN-Score and BsN-Screen are characterized by the 2714 multi-persective descriptors generated using the the proposed Descriptor Data Bank platform. RF-Score and X-Score
utilize their original descriptor sets of 36 geometrical and 6 physiochemical features, respectively.

154

CV (known & novel targets)

LCO (novel targets)

Accuracy

30

EF1%
20

EF5%
EF10%
21.87
6.17

20.83

29.89
9.59

30.88
9.33

6.35

13.52
7.81

13.54
7.22

38.48
11.17
6.27
36.42
10.71

9.68

11.28

cr
e
−S en
Bs cre
N en
−
Bg Sc
N ore
−S
c
X− ore
S
R co
F− re
Sc
or
Bs
e
N
−S
Bg c
N ree
−S n
Bs cre
N en
−
Bg Sc
N ore
−S
c
X− ore
S
R co
F− re
Sc
or
Bs
e
N
−
Bg Sc
N ree
−S n
Bs cre
N en
−
Bg Sc
N ore
−S
c
X− ore
S
R co
F− re
Sc
or
e

0

33.90
10.80
6.44
29.40
9.20

10

N

Bg

Bs

N

−S

Screening accuracy

Cr (known targets)
40

Scoring function

Figure 6.2: The screening accuracy of screening-specific (proposed) and binding-affinity-based
(conventional) scoring functions when evaluated on test complexes with proteins that are either
fully represented (Cr), partially represented (CV), or not represented (LCO) in the SFs’ training
data. The screening accuracy is expressed in terms of the 1%, 5%, and 10% enrichment factors of
SFs in classifying ligands to actives and inactives against diverse set of proteins from the
PDBbind 2014 benchmark.
The performance of the four SFs in terms of enrichment factors are shown in Figure 6.2 for known
and novel test protein targets.
The plots clearly show that the screening-specific SF BsN-Screen is considerably more accurate
than the generic, binding affinity-based methods regardless of the model’s familiarity with the
protein target. The gap in performance between BsN-Screen and BsN-Score is mainly attributed to
the use of both active and inactive complexes to train BsN-Screen rather than just active compounds
on which BT-Score is trained. BsN-Screen and BsN-Score are otherwise identical in terms of the
training protein targets, descriptors, and the underlying ensemble deep-learning method they use
to fit the data. BsN-Screen is also more accurate than the best performing conventional SFs as
shown in Figure 6.3. The screening accuracy of GlideScore-SP, ChemScore, and GlideScore-XP
has only been reported on known the core test set complexes by Li et al [25]. However, we expect
their performance to be similar as we observed for the other BA-based SFs in Figure 6.2.

155

Accuracy

30

EF1%
20

EF5%
EF10%

or

or
R

F−

Sc

Sc
X−

co
G

lid

eS

e

e

6.02

16.81

P
−X
re

Sc
m
he
C

eS
lid
G

6.83

18.90

e
or

P
−S
re
co

−S
c
N
Bg

6.27

19.54

en
re

ee
cr
−S
N
Bs

9.20

29.40

6.44

n

0

10.80

10
33.90

Screening accuracy

40

Scoring function

Figure 6.3: Screening accuracy of screening-specific and top-performing conventional scoring
functions in terms of 1%, 5%, and 10% enrichment factors of protein targets sampled from the
core test set of PDBbind 2014.
6.2.3

Impact of the number of training protein targets on the screening accuracies

Experimental information about 3D structures and bioassays of new protein-ligand complexes are
regularly determined. This contributes to the growing size of public biochemical repositories and
corporate compound databases. To assess the impact that a larger training set size would have
on the predictive accuracy of generic and screening-specific SFs, we consider the 2007 and 2014
PDBbind datasets. We use the core (Cr) set of PDBbind 2014 as our main test set in this experiment. We use the independent core (Cr) set complexes of PDBbind 2007 and the primary ( Pr)
complexes of PDBbind 2014 to train generic and screening-specific SFs. Therefore, our training
data consists of 3225 complexes in the primary (Pr) set as well as (55 × 3 =) 165 complexes in Cr
which yields a total of 3390 complexes. We use these complexes to simulate the effect of increasing
training set size on the screening accuracy of the models under investigation. More specifically, for
a given number of training complexes x, where x ∈ {100, 680, 1260, 1840, 2420, 3000}, we select
x complexes randomly (without replacement) from the 3390 PLCs to train the three the models.
We then test them on the out-of-sample core set Cr comprising 195 complexes. This process is repeated 50 times to obtain robust average enrichment factors EF1% and EF5% , which are plotted in

156

Screening accuracy (EF1% & EF5%)

1% Enrichment factor

5% Enrichment factor

40

40

30

30

SF
BT−Screen

20

BT−Score

20

RF−Score
X−Score
10

10

0

0
0

1000

2000

3000

0

1000

2000

3000

Number of training targets

Figure 6.4: Dependence of screening accuracy of screening-specific and binding-affinity-based
scoring models on training set size when training complexes are selected randomly (without
replacement) from the refined set of PDBbind 2014 and the core set of PDBbind 2007. The
screening accuracy is expressed in terms of 1%, 5%, and 10% enrichment factors of the PDBbind
2014 core test set.
Figure 6.4. The SFs we build here include the boosted decision tree SFs BT-Screen and BT-Score
based on XGboost. BT-Screen is our screening-specific SF trained on active and inactive complexes characterized by our multi-perspective descriptors. BT-Score is only trained on active complexes. However, it also uses the same descriptors and underlying learning algorithm of boosted
trees as BT-Screen. We also train our version of RF-Score and X-Score which are constructed by
fitting Random Forests and Linear Regression models to the training complexes characterized by
their the original RF-Score’s 36 geometrical descriptors and X-Score’s 6 physiochemical energy
terms, respectively.
Figure 6.4 shows the screening accuracies of BT-Screen, BT-Score, and our versions of RFScore and X-Score. For both enrichment factors (EF1% and EF5% ), we observe that BT-Screen
and BT-Score improve as the size of the training dataset increases. The screening performance
of RF-Score and X-Score are substantially lower than both ensemble SFs and their rate of im-

157

provement is almost zero beyond 680 training complexes. Similar rise in improvement was also
achieved using Random Forest (whose accuracy plots are not shown) when it was fitted to our
multi-perspective descriptors instead of the RF-Score’s original 36 geometric descriptors. Another
set of Random Forest models fitted to the 6 features used by X-Score were also found to be improving in screening accuracy as the number of their training complexes increases. This indicates
that ensemble SFs have a potential for improvement as more training data becomes available. On
the other hand, the linear model used in X-Score and other empirical SFs showed minimal rise in
screening power as a response to increasing the training set size which is an indication of a very
small potential for improvement in the future. This is due to the rigidity of linear models whose
performance tends to saturate. Many of the conventional SFs considered in this study (Table 1.1)
are empirical and thus they are also most likely to suffer from the same limitation. In fact, the
best performing model of those 16 in terms of scoring power is X-Score; whose performance we
study here. Therefore, one should consider better prediction approaches to derive accurate models
from training data available on hand and from future updates. SF designers can conduct similar
experiments to estimate accuracy enhancement when their proposed functions are re-calibrated on
larger number of data instances.

6.2.4

Impact of the number of descriptor sets on the screening accuracies

To investigate the effect of increasing the number of descriptor sets on screening accuracy of SFs,
we randomly select a maximum of 100 combinations of x descriptor sets from all possible (16
choose x) combinations of the 16 types listed in Table 3.1, where x ∈ {1, 4, 7, 10, 13, 16}, and use
them to characterize the training set complexes, which we then use to train the XGboost, Random
Forest, and MLR models to build BT-Screen and BT-Score, RF-Score and X-Score analogs. These
models are subsequently tested on the Cr dataset characterized by the same descriptors. For each
number, x, of descriptor sets, the performance of the 100 SFs are averaged to obtain robust overall
EF1% and EF5% statistics, which are plotted in Figure 6.5.
Similar to the case of increasing training data in terms of new PLCs, the performance of the en-

158

Screening accuracy (EF1% & EF5%)

1% Enrichment factor

5% Enrichment factor

40

40

30

30

SF
BT−Screen

20

BT−Score

20

RF−Score
X−Score
10

10

0

0
4

8

12

16

4

8

12

16

Number of descriptor sets

Figure 6.5: Dependence of screening accuracy of screening-specific (proposed) and
binding-affinity-based (conventional) scoring models on the number of descriptor (feature) sets.
The sets are drawn randomly (without replacement) from a pool of 16 feature types in Descriptor
Data Bank (DDB) and used to characterize the training and out-of-sample core Cr test set
complexes of PDBbind 2014. The docking accuracy is expressed in terms of 1%, 5%, and 10%
enrichment factors of protein targets sampled from the core test set.
semble SFs are clearly benefiting from increasing the number of protein-ligand interaction models
for both enrichment factor statistics. The number of descriptor sets (interaction perspectives) are
as important as the number of training complexes in improving the quality of prediction. Based on
these results, and similar improvement trends for the scoring and docking accuracies in the previous chapters, we believe that the public database of protein-ligand interactions we introduced in
this work will be of high value to ML SFs and just as important to their screening accuracy as the
resources of raw structural and experimental data.

6.3

Conclusion

In this work, we developed the screening-specific scoring functions BsN-Screen and BgNScreen using novel ensemble models of boosted and bagged deep neural networks. The models are
fitted to a large and diverse database of active and inactive complexes characterized by thousands

159

of multi-perspective descriptors. BsN-Screen and BgN-Screen are optimized to directly model
the ligand activity as a classification problem to improve its accuracy in finding real active compounds from databases of new ligands not seen in their training set. We compare BsN-Screen and
BgN-Screen with their generic counterparts BsN-Score and BgN-Score which are trained on (only)
active complexes labeled with binding affinity data and characterized using our multi-perspective
descriptors. The regression models BsN-Score and BgN-Score, as well as the vast majority of
existing SFs, order test ligands based on their predicted binding affinity such that potential active
compounds are ranked above the molecules that the models deem inactive. Our extensive experiments indicate that the screening-specific SFs are substantially more accurate in enriching ligand
databases with more active compounds than their generic, BA-based counterparts. BsN-Screen and
BgN-Screen have top 1% enrichment factors of more than 35 as opposed to 11 obtained by their
scoring-specific analogs BsN-Score and BgN-Score. The best conventional SF GlideScore-SP, an
empirical generic model implemented in GOLD, achieves an enrichment factor of 19.54 when
tested on the same benchmark complexes [25]. Furthermore, the proposed screening-specific SFs
also excelled when their improvement capacity was tested in response to more training data and
descriptor sets.

160

CHAPTER 7
MULTI-TASK DEEP NEURAL NETWORKS FOR SIMULTANEOUS DOCKING,
SCREENING, AND SCORING OF PROTEIN-LIGAND COMPLEXES

7.1

Introduction

Protein-ligand complexes fed to the proposed docking, scoring, and screening-specific SFs for
prediction are typically characterized by the same descriptors and they are formed with the same
protein targets. The training datasets of these task-specific scoring functions may only differ in
their sizes and type of the output (independent) variable. When fitting a boosting tree or any
other ML model to the dataset of one task, it does not take advantage of the datasets available
for the other tasks to improve its performance. Rather, the three scoring functions are trained and
applied in isolation despite the commonalities between the three problems they are modeling. In
this chapter, we propose MT-Net, a novel multi-task deep neural network that can be effectively
trained and applied to the three datasets simultaneously. MT-Net leverages information sharing
across the three molecular modeling problems and uses abundant data for one problem to improve
upon the performance of the tasks with limited data availability. In a sense, optimizing the network
for multiple tasks simultaneously self-regularizes it against overfitting the tasks with limited data
if they were to be learned separately.
Multi-task learning using deep neural networks has been successfully applied in recent studies
related to ligand-based drug design [115, 116, 117]. Unlike the three tasks studied here in this
work for screening, scoring, and docking tasks, each task in those studies is a predefined protein
system for which compounds are classified as active or inactive molecules. As is the case for other
QSAR models, the multi-target neural network must be re-trained from scratch in order to be used
for targets different from the training proteins. On the other hand, the multi-task SFs we develop
in this work are applicable to any protein system without the need for extra training. During their
construction, our task-specific and multi-task scoring functions take into account information about

161

the ligand and protein explicitly instead of implicitly via multi-target learning as in some QSAR
models [115, 116, 117]. Therefore, the proposed scoring functions in this work are applicable to
any target as was shown by our experiments in the previous chapters (refer to SFs performance on
known and novel targets).

7.2

Network architecture

The neural network we propose here consists of an input layer for the raw descriptors xs1 ,
Ls hidden layers shared (s) among the three tasks for extracting low-level representations for the
molecular interactions, Lt task-specific hidden layers for each task t ∈ {dock, screen, score} to
generate high-level representations specific to each task, and Y t corresponding output layers as
depicted in Figure 7.1. During prediction, lower-level features universal for the three tasks are first
extracted from the hidden shared layers as follows:
xsl+1 = H(Wsl xsl + bsl ),

l = 1, . . . , Ls

where Wsl and bsl are respectively the weight matrix and the bias vector for the l-th shared hidden
layer, and H is the activation function associated with it which is selected to be a rectified linear
unit (ReLU) in this work. The output of the final shared hidden layer is then fed as an input
(xt1 = xsLs +1 ) to the task-specific layers to extract higher-level representations specialized for each
task using the following transformations:
xtl+1 = H(Wtl xtl + btl ),

l = 1, . . . , Lt

Here, each task-specific layer is parameterized by a weight matrix Wtl and a bias vector btl . The
final outputs for the three tasks are simply the following transformations for the features xtLt +1
generated by the high-level hidden layers:
Y t = Ot (Wto xtLt +1 + bto ),
where Ot is the output function for the t-th task which is logistic for the screening problem and
linear for the docking and scoring tasks. The number and sizes (number of hidden units) of the
162

Simultaneous
multi-task
predictions

Y Score: Linear regression
Score
Woutput

H 2Dock: 200 ReLUs (0.2)

W2Score

Screen
Woutput

H2Screen: 100 ReLUs (0.2)

W2Dock

H1Score: 200 ReLUs (0.2)

H1Dock : 200 ReLUs (0.2)

W1Score

Shared
low-level
representations

Y Screen : Logistic regression

Dock
Woutput

H2Score : 200 ReLUs (0.2)
Task-specific
high-level
representations

Y Dock: Linear regression

W2Screen
H1Screen: 200 ReLUs (0.2)

W1Dock

W1Screen

H3Shared: 1000 ReLUs, dropout rate = 0.2
W3Shared
H2Shared: 1500 ReLUs, dropout rate = 0.2
W2Shared
H1Shared: 2500 ReLUs, dropout rate = 0.2
W1Shared

Raw
descriptors

X Input: 2700 descriptors, dropout rate = 0.2

Figure 7.1: The architecture of the multi-task deep neural network SF MT-Net.
shared and task-specific hidden layers were tuned using cross-validation. We tested multiple pyramidal configurations for the network and found the configuration of three hidden layers with sizes
{2500, 1500, 1000} to be optimal for the shared layers, {200, 200} for the docking and scoringspecific layers, and {200, 100} for the screening-specific layers. Further details about the network’s
training procedure will be provided later in the following section.

7.3

Training the multi-task network

Nowadays, training wide and deep neural networks is simpler than ever before—thanks to
recent advances in deep learning and computational resources such as Graphics Processing Units
(GPU). However, even with such tools, building custom deep neural network scoring functions for

163

multiple tasks is not a trivial machine-learning problem. The following sections summarize the
main challenges we faced during the model training process and the methods we followed to solve
them.

7.3.1

Stochastic learning for imbalanced data

The three molecular modeling tasks benefit from training sets with different sizes. For the scoring
task, we have an average of 3000 complexes with valid binding affinity labels. The screening
scoring function is trained with more than 15000 PLCs with binary labels indicating whether the
complexes are active or not. The docking task benefits from the largest training data which includes
300000 unique protein-ligand conformations. For the network to perform well across the three
tasks, it should utilize all available training data while guarding its weights from overfitting the
tasks with the larger training sizes. We achieved this by performing stochastic gradient descent
with equally-sized minibatches sampled randomly from the training data of the three tasks. More
specifically, we sample 10 PLCs from each task for a total minibatch size of 30 examples during
every training (weight updating) step. We also used larger batch sizes of 100 complexes sampled
proportionally from the three tasks according to their sizes. To avoid the bias of over-optimizing
the weights on tasks with larger training data, we weighted the gradients up or down for each
task depending on the number of examples in each batch for that task. This resulted in small
improvement in the training speed. However, we resorted back to the former uniform sampling
approach due to the increased implementation complexity of the weighted gradient approach.

7.3.2

Gradient computing for unavailable labels

Despite having common training protein targets across the three molecular modeling tasks, the
majority of proteins form unique complexes across tasks. Complexes with the same protein differ
amongst each other in the bound ligands and/or the conformations of those ligands. Complexes
available for one task are typically not present for the other two tasks and consequently they do
not share the same labels. The task-specific labels we use here are binding affinity, RMSD of a
164

pose from its respective native conformation, or active/inactive binary value. This poses a challenge during training while computing the gradient to update the network’s weights. Gradients are
calculated at the output layer of each task based on the difference between the actual values and
labels predicted by those output layers of each task. We use gradients of zero for the tasks with
unlabeled examples such that the task-specific weights associated with missing labels do not get
updated while weights in shared layers are updated with gradients calculated from labeled examples only. It should be noted that each of the 30 training examples in any random minibatch has at
least one available label and at most two missing labels.

7.3.3

Network regularization

Networks with wide and deep hidden layers and multiple output layers such as the one we build
here have millions of parameters (weights and biases) which outnumber our training size multiple
fold. Therefore, regularizing the network is necessary to avoid overfitting. MT-Net has a powerful
regularization mechanism in place enabled by the simultaneous optimization of shared layers with
different training data for each task which help prevent the shared weights from overfitting individual tasks. To further guard against overfitting, we use the dropout technique of hidden units by
randomly selecting 20% of the network’s activations and deliberately switching them off to zero
for each protein-ligand complex in the current training minibatch. Random dropout was introduced by Hinton et al. to avoid overfitting by preventing co-adaptations among the hidden units
of each layer [118]. A network trained with dropout can effectively be viewed as an average of
an ensemble of multiple thinned networks. The cost of using dropout is longer training time since
parameter updates are noisy due to the random replacement of activations with zeros.

7.3.4

Computational requirements

Training a wide and deep neural network with thousands of input descriptors and hidden units in
several deep layers and three output layers results in millions of parameters. A network of such
scale requires a powerful computer system to train and execute. We use an Amazon AWS instance
165

with NVIDIA Kippler GK210 GPU featuring 2496 parallel processing cores and 12 Gigabytes of
graphics-dedicated memory. Training the network on this system takes five hours of wall clock
time. Predicting a dataset of 1000 protein-ligand complexes can be achieved in 5 milliseconds on
the same machine. We use the Python machine-learning libraries TensorFlow and Keras to build
our multi-task neural network scoring functions [43, 119].

7.4

Prediction and accuracy

In this section, we investigate the benefits of multi-task learning to simultaneously predict the
ligand binding pose, binding affinity, and whether it’s active or not for any protein target. We build
a scoring function based on the multi-task neural network architecture described in Section 7.2 and
compare it with conventional approaches and its single-task counterpart for the scoring, screening,
and docking tasks. The SFs based on the multi-task (MT-Net) and single-task neural networks are
trained on the primary training set of PDBbind 2014 and tested on the out-of-sample core test set
Cr. The performance results of these scoring functions and the best conventional approaches are
listed in Table 7.1.
Table 7.1: The docking, screening, and scoring performance of multi-task and single-task SFs on
PDBbind 2014 core test set (Cr).
SF type

Docking (S11 %)

Screening (EF1% )

Scoring (R p )

Multi-task (MT-Net)
Single-task
Best conventional1

79.97 %
80.51 %
–

29.17
22.87
19.54 (ChemPLP)

0.804
0.803
0.627 (X-Score)

1 We report the screening and scoring accuracy of the top performing conven-

tional scoring functions GlideScore and X-Score on these tasks. We use the
screening accuracy for ClideScore-XP from the recent comparative study by
Li et al [25]. The scoring performance of X-Score is calculated by us since
we have access to its software. We do not have access to the docking success
rate statistic S11 for the most accurate conventional SF ChemPLP. However,
ChemPLP’s S21 is 82% while MT-Net and the single-task NN achieve 90% on
this less stringent test.

The screening and scoring performance of MT-Net are substantially higher than those of GlideScoreXP and X-Score, which are the best performing conventional approaches on these two tasks. MT-

166

Net’s binding pose prediction is also more accurate than the top performing conventional scoring
function in the docking task, ChemPLP. MT-Net and ChemPLP correctly identified the correct
binding modes of 90% and 82% of protein-ligand complexes in the core test set of PDBbind
2014, respectively. The binding mode is considered correctly identified if the top-scoring pose
of the ligand was found to lie within 2 Å RMSD from the known native conformation. MT-Net
achieves 80.51% success rate on the more stringent test where the top-scoring pose must lie within
1 Å RMSD (instead of 2 Å RMSD) from the known binding mode to be considered native or nearnative. We denote this success rate by S11 and the former less stringent test by S21 . We do not report
the S11 success rate for ChemPLP since it was not available to us in the original publication [25]
and we do not have access to the scoring function’s predictions for the test complexes to compute
its S11 value.
The screening performance of the multi-task SF is substantially higher than that obtained by the
single-task neural network. The gap in performance between the two functions is much narrower
for the docking and scoring tasks with a slight lead for the traditional single-task approach. The
finding that docking performance is not benefiting from the multi-task approach might be attributed
to the fact that the docking data set size is already large so that the additional screening and scoring
data are relatively too small to make any difference. The reason behind the small scoring is not very
clear and warrants further investigation. However, it might be due to the fact that the docking and
screening data sets are generated from the scoring complexes and therefore the derived complexes
may not contribute new information. We believe that the gains in performance will be much greater
if scoring, screening, and docking data were more diverse and larger in size.
To our knowledge, such a novel architecture for scoring functions has not been attempted before
and we believe it could be used as a blue-print for future scoring functions with the availability of
more data.

167

7.5

Conclusion

We proposed a novel multi-task scoring function for simultaneously predicting ligand binding
pose, its activity classification label (active/inactive), and its binding affinity with a target protein.
The scoring function, MT-Net, is based on wide and deep multi-task neural network with shared
hidden layers and three sets of task-specific and output layers for the docking, screening, and
scoring tasks. When fitted to our current training sets, MT-Net performed equally well with the
single-task neural network SF when tested on out-of-sample test sets carefully designed to evaluate
screening, docking, and scoring accuracy. Due to their high-capacity, we anticipate the accuracy of
scoring functions based on multi-task deep networks to improve beyond those based on single-task
NNs when larger and more diverse datasets are used for training. Currently we are developing a
pipeline for automated data retrieval from biochemical databases to continuously train the network
on new protein-ligand complexes and screening assays as they become available. When coupled
with the proposed descriptor databank (DDB) platform, we think the active multi-task learning
pipeline will produce an evolving, powerful scoring function.

168

CHAPTER 8
CONCLUSION AND FUTURE WORK

8.1

Proposed approaches and key findings

The aim of this research is to develop next generation scoring functions for successful computeraided drug discovery. The work presented in this dissertation helped to achieve this goal by the
development of Descriptor Data Bank and novel task-specific scoring functions based on machinelearning approaches. In the following sections, we briefly describe the two components and summarize their effectiveness quantitatively.

8.1.1

Descriptor Data Bank (DDB) for multi-perspective characterization of protein-ligand
interactions

In this work we presented Descriptor Data Bank (DDB), an online platform for facilitating multiperspective modeling of PL interactions. The online platform is an open-access hub for depositing,
hosting, executing, and sharing tools and data arising from a diverse set of interaction modeling
hypotheses. In addition to the descriptor extraction tools, the platform also implements a machinelearning (ML) toolbox that drives the DDB’s automatic descriptor filtering and evaluation utilities
as well as scoring function fitting and prediction functionalities. To enable local access to many of
DDB’s utilities, command-line programs and a PyMOL plug-in have also been developed and can
be freely downloaded for offline multi-perspective modeling. We seed DDB with 16 diverse descriptor extraction tools developed in-house and collected from the literature. The tools combined
generate over 2700 descriptors that characterize (i) proteins, (ii) ligands, and (iii) protein-ligand
complexes. The in-house descriptors we propose here, REPAST & RETEST, are target-specific
and based on pair-wise sequence and structural alignment of proteins followed by target clustering and triangulation. We built and used the fit/predict service in DDB to fit ML SFs to the
in-house descriptors and those collected from the literature. We then evaluated them on several

169

data sets that were constructed to reflect real-world drug screening scenarios. We found that multiperspective SFs that were constructed using large and diverse number of DDB interaction models
outperformed their single-perspective counterparts in all evaluation scenarios with an average improvement of 15%. We also found that our proposed target-specific descriptors improve upon the
accuracy of SFs that were trained without them. In addition, DDB’s filtering module was able to
exclude noisy and irrelevant descriptors when artificial noise was included as new features. We
also observed that the tree-based ensemble ML SFs implemented in DDB are robust even with the
presence of noisy and decoy descriptors.

8.1.2

Boosted and bagged deep neural networks for task-specific scoring functions

The proposed descriptor data bank platform made it possible to efficiently capture thousands of
important protein-ligand interactions in the form of molecular descriptors or features for entire
databases of complexes. However, current scoring functions lack the flexibility and capacity to
take advantage of such data fully. Moreover, conventional modeling strategies have been centered
around building generic scoring functions based on simple linear regression models that attempt
to predict ligand binding poses (docking task) and binary class labels of activity (screening task)
indirectly via binding affinity prediction (scoring task). In the previous chapters, we showed the
limitations of traditional scoring functions not only in the docking and screening tasks, but also
the scoring task for which they are optimized to model. Therefore, we presented a novel set of
task-specific scoring functions based on large ensemble of deep neural networks to model the
three tasks. The boosted and bagged neural network scoring functions are BsN-Score and BgNScore for the scoring task, BsN-Dock and BgN-Dock for the docking task, and BsN-Screen and
BgN-Screen for the screening task.
We evaluated the docking, screening, scoring, and ranking accuracies of the proposed scoring
functions, as well as several conventional approaches in the context of the 2007 and 2014 versions
of the PDBbind benchmark that encompasses a diverse set of high-quality protein families and
drug-like molecules. In terms of scoring accuracy, we found that the ensemble NN SFs, BgN-Score

170

and BsN-Score, have more than 35% better Pearson’s correlation coefficient (0.844 and 0.840 vs.
0.627) between predicted and measured binding affinities compared to that achieved by a state-ofthe-art conventional SF. We further found that ensemble NN models surpass SFs based on other
state-of-the-art ML algorithms such as Boosted Regression Trees, Random Forests, and SVM.
Similar results have been obtained for the ranking task. Within clusters of protein-ligand complexes with different ligands bound to the same target protein, we found that the best ensemble NN
SF is able to rank the ligands correctly based on their experimentally-determined binding affinities
64.6% of the time and identify the top binding ligand 78.4% of the time. Given the challenging
nature of the ranking problem and that SFs are used to screen millions of ligands, this represents a
significant improvement over the the best conventional SF we studied, for which the corresponding
ranking performance values are 57.8% and 73.4%. A substantial improvement in the docking task
has also been achieved by our proposed docking-specific SFs. We found that the best performing
ML SF has a success rate of 95% in identifying poses that are within 2 Å root-mean-square deviation from the native poses of 65 different protein families. This is in comparison to a success rate of
only 82% achieved by the best conventional SF, ChemPLP, employed in the commercial docking
software GOLD. As for the ability to distinguish active molecules from inactives, our screeningspecific SFs showed excellent improvements over the conventional approaches. The proposed SF
BsN-Screen achieved a screening enrichment factor of 33.90 as opposed to 19.54 obtained from
the best conventional SF which is employed in GOLD. For all tasks, we observed that the proposed
task-specific NN SFs benefit more than their conventional counterparts from increases in the number of features and the size of training dataset. They also perform better than the conventional SFs
on novel proteins that they were never trained on before. The accuracies of ensemble NN SFs are
even higher when they predict for protein-ligand complexes that are related to their training sets.
The high predictive accuracy of ensemble SFs BsN-Score and BgN-Score is due to the following
three factors: (i) the low bias error of the highly-nonlinear neural network base learners, (ii) the low
variance error achieved by using bagging and boosting, and (iii) the employed multi-perspective
set of diverse descriptors we extract for protein-ligand complexes using Descriptor Data Bank.

171

8.1.3

Multi-task deep neural networks for simultaneous binding pose, activity, and affinity
prediction

In addition to the three task-specific models, we developed a novel multi-task scoring function
to simultaneously predict ligand binding poses, their activity class labels (active/inactive), and
binding affinities against the target protein. The scoring function, MT-Net, is based on wide and
deep multi-task neural network with shared hidden layers and three sets of task-specific hidden
and output layers for the docking, screening, and scoring tasks. When fitted to our current training
sets, MT-Net outperformed conventional scoring functions and matched single-task neural network
SFs when tested on out-of-sample test sets carefully designed to evaluate screening, docking, and
scoring accuracy. Due to their high-capacity, we expect the accuracy of scoring functions based
on multi-task deep networks to improve beyond those based on single-task NNs when larger and
more diverse datasets are used for training. Currently we are developing a pipeline for automated
data retrieval from biochemical databases to continuously train the network on new protein-ligand
complexes and screening assays as they become available. When coupled with the proposed descriptor databank (DDB) platform, we think the active multi-task learning pipeline will produce an
evolving, powerful scoring function.

8.2

Future work

In addition to the excellent predictions we have achieved in ligand scoring, ranking, docking,
and screening, we are working on the following ideas that we believe will make the predictions
even more accurate.

8.2.1

Enriching DDB with more molecular descriptors

Protein-ligand complexes have been characterized by 16 feature types in the experiments we presented in this work. Currently, these are the only descriptor types hosted in Descriptor Data Bank
(DDB). After publishing the platform in the literature, it will gain exposure and we believe the
wider research community will quickly enrich it with much larger number of descriptor sets and

172

feature extraction tools. The collected descriptors will then be used in conjunction with ensemble
deep neural networks to build task-specific scoring functions. DDB has automated task-specific
filters in-place to select the most relevant descriptors for the given task.

8.2.2

Pipeline for self-learning scoring functions

The proposed descriptor data bank and the task-specific scoring functions based on ensembles of
wide and deep neural networks are two important components for building evolving scoring models. More specifically, we are planning to develop a pipeline with the following automated stages
to create self-learning scoring functions: (i) retrieval of raw data of proteins and genes, ligands,
protein-ligand complexes, and screening assays from public databases such as PDB [9], Zinc [68],
PDB-Ligand [120], PDBbind [10], BindingDB [74], DUD [121], CSAR [82], ChEMBL [70], and
PubChem [71]; (ii) data preprocessing, integration, and descriptor extraction using Descriptor Data
Bank platform; and (iii) model training and updating of the boosted and multi-task deep neural networks introduced in this work. We believe such pipeline will result in a scoring function that will
constantly improve upon newly released molecules and experimental data.

173

APPENDIX

174

APPENDIX
FEATURES USED TO CHARACTERIZE PROTEIN-LIGAND COMPLEXES

Below we list the descriptors we used to characterize protein-ligand complexes.
• X-Score (X) features [56]1
1. van der Waals interactions between the ligand and the protein.
2. Hydrogen bonding between the ligand and the protein.
3. Hydrophobic effect: accounted for by counting pairwise hydrophobic contacts between
the protein and the ligand.
4. Hydrophobic effect: accounted for by examining the fit of hydrophobic ligand atoms
inside hydrophobic binding site.
5. Hydrophobic effect: accounted for by calculating the hydrophobic surface area of the
ligand buried upon binding.
6. The deformation effect, expressed as the contribution of the number of rotors for the
ligand.
• AffiScore (A) features [75, 57, 122]
1. The number of non-hydrogen atoms in the ligand.
2. The total sum of hydrophobicity values for all hydrophobic protein-ligand atom pairs.
3. The total difference in hydrophobicity values between all hydrophobic protein-ligand
atom pairs.
1 The

software packages to calculate X-Score and AffiScore features can be obtained from
their respective authors via the links http://sw16.im.med.umich.edu/software/xtool/ and
http://www.bch.msu.edu/~kuhn/software/slide/. The packages were modified so that the
features are written to a file since they are not written out by default.

175

4. The total difference in hydrophobicity values between all protein-ligand hydrophobic/hydrophilic mismatches.
5. The total sum of hydrophobicity values for all hydrophilic protein-ligand atom pairs.
6. The total difference in hydrophobicity values between all hydrophilic protein-ligand
atom pairs.
7. The total number of protein-ligand salt bridges.
8. The total number of regular hydrogen bonds.
9. The total number of uncharged polar atoms at the protein-ligand interface which do not
have a bonding partner.
10. The total number of charged atoms at the protein-ligand interface which do not have a
bonding partner.
11. The fraction of ligand heavy atoms (non-hydrogen) which are at the protein-ligand
interface (defined as atoms within 4.5 Å of opposite protein or ligand atoms).
12. The number of ligand atoms at the interface.
13. The number of hydrophilic ligand atoms which are not at the interface.
14. The number of rotatable single bonds in the ligand molecule.
15. The number of protein and ligand rotatable single bonds at the interface.
16. The sum of the average hydrophobicity values for all ligand atoms. The average hydrophobicity value for the ligand atoms was determined by calculating the average
hydrophobicity value of all neighboring hydrophobic protein atoms.
17. The sum of hydrophobicity values for all hydrophobic protein-ligand atom pairs (as
computed in the old AffiScore version).
18. The number of rotatable single bonds of the ligand at the interface.

176

19. The sum of the degree of similarity between protein-ligand hydrophobic atom contacts.
A higher value indicates the protein and ligand atoms had more similar raw hydrophobicity values.
20. A normalized version of the degree of similarity between the ligand atoms and their
hydrophobic protein neighbors. This is the version currently used in AffiScore.
21. Pairwise count of hydrophobic-hydrophobic protein-ligand contacts.
22. Pairwise count of hydrophilic-hydrophilic protein-ligand contacts.
23. Pairwise count of hydrophobic-hydrophilic protein-ligand contacts.
24. Total of protein hydrophobicity values for protein atoms involved in hydrophobichydrophobic contacts.
25. For hydrophobic/hydrophilic mismatch pairs, sum of the hydrophobic atom hydrophobicity values.
26. Total hydrophilicity of all protein interfacial atoms.
27. A distance normalized version of feature number 24.
28. Total of protein hydrophobicity values for protein atoms at the interface.
29. A distance normalized version of feature number 28.
30. Total of ligand hydrophilicity values for interfacial ligand atoms.
• Gold (G) features [52]
1. The number of ligand exposed hydrophobic atoms.
2. The number of ligand acceptor atoms.
3. The number of ligand donor atoms.
4. The number of ligand atoms that clash with protein atoms.
5. The number of hydrogen bonds.
6. The molecular weight of the ligand.
177

7. The number of occluded ligand acceptor atoms.
8. The number of occluded ligand donor atoms.
9. The number of occluded ligand polar atoms.
10. The number of occluded protein acceptor atoms.
11. The number of occluded protein donor atoms.
12. The number of occluded protein polar atoms.
13. The number of ligand rotatable bonds.
14. The buried surface area of the ligand upon docking.
• RF-Score or geometrical (R) features [27]2
1. The number of protein-ligand carbon-carbon (C-C) contacts.
2. The number of protein-ligand nitrogen-carbon (N-C) contacts.
3-36. Similarly, the number of protein-ligand contacts for the pairs: O-C, S-C, C-N, N-N,
O-N, S-N, C-O, N-O, O-O, S-O, C-F, N-F, O-F, S-F, C-P, N-P, O-P, S-P, C-S, N-S, O-S,
S-S, C-Cl, N-Cl, O-Cl, S-Cl, C-Br, N-Br, O-Br, S-Br, C-I, N-I, O-I, and S-I.

2 The

features can be readily calculated and output by the open-source software package RF-Score that can be found in the supplementary data of [27] or directly downloaded from http://bioinformatics.oxfordjournals.org/content/suppl/2010/03/18/
btq112.DC1/bioinf-2010-0060-File007.zip
178

BIBLIOGRAPHY

179

BIBLIOGRAPHY

[1]

L. Kavraki, “Rigid receptor-flexible ligand docking: Overview and examples,” Connexions,
2006.

[2]

X. Fradera and J. Mestress, “Guided docking approaches to structure-based design and
screening,” Current Topics in Medicinal Chemistry, vol. 4, pp. 687–700, 2004.

[3]

R. A. Friesner, J. L. Banks, R. B. Murphy, T. A. Halgren, J. J. Klicic, D. T. Mainz, M. P.
Repasky, E. H. Knoll, M. Shelley, J. K. Perry, D. E. Shaw, P. Francis, and P. S. Shenkin,
“Glide: A new approach for rapid, accurate docking and scoring. 1. method and assessment
of docking accuracy,” Journal of medicinal chemistry, vol. 47, no. 7, pp. 1739–1749, 2004.

[4]

G. Warren, C. Andrews, A.-M. Capelli, B. Clarke, J. LaLonde, M. Lambert, M. Lindavall,
N. Nevins, S. Semus, S. Senger, G. Tedesco, I. Wall, J. Woolven, C. Peishoff, and M. Head,
“A critical assessment of docking programs and scoring functions,” Journal of medicinal
chemistry, 2005.

[5]

M. Cases and J. Mestres, “A chemogenomic approach to drug discovery: focus on cardiovascular diseases,” Drug discovery today, vol. 14, no. 9-10, pp. 479–485, 2009.

[6]

X. Xu, M. Kasembeli, X. Jiang, B. Tweardy, and D. Tweardy, “Chemical probes that competitively and selectively inhibit Stat3 activation,” PLoS One, vol. 4, no. 3, 2009.

[7]

K. Simons, R. Bonneau, I. Ruczinski, and D. Baker, “Ab initio protein structure prediction
of CASP III targets using ROSETTA,” Proteins: Structure, Function, and Genetics, vol. 37,
no. S3, pp. 171–176, 1999.

[8]

A. Favia, I. Nobeli, F. Glaser, and J. Thornton, “Molecular docking for substrate identification: The short-chain dehydrogenases/reductases,” Journal of Molecular Biology, vol. 375,
no. 3, pp. 855–874, 2008.

[9]

H. M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T. N. Bhat, H. Weissig, I. N. Shindyalov,
and P. E. Bourne, “The protein data bank,” Nucleic Acids Research, vol. 28, no. 1, pp. 235–
242, 2000.

[10] R. Wang, X. Fang, Y. Lu, and S. Wang, “The PDBbind database: Collection of binding
affinities for protein-ligand complexes with known three-dimensional structures,” Journal
of Medicinal Chemistry, vol. 47, no. 12, pp. 2977–2980, 2004, pMID: 15163179.
[11] F. Allen and O. Kennard, “Cambridge Structural Database (CSD),” Chemical Design Automation News, vol. 8, pp. 1–31, 1993.
[12] P. D. Lyne, “Structure-based virtual screening: An overview,” Drug Discovery Today, vol. 7,
no. 20, pp. 1047 – 1055, 2002.

180

[13] N. Singh, G. Chevé, D. Ferguson, and C. McCurdy, “A combined ligand-based and targetbased drug design approach for g-protein coupled receptors: Application to salvinorin a,
a selective kappa opioid receptor agonist,” Journal of Computer-Aided Molecular Design,
vol. 20, no. 7, pp. 471–493, 2006.
[14] T. Marrone, J. Briggs, and, and J. McCammon, “Structure-based drug design: Computational advances,” Annual Review of Pharmacology and Toxicology, vol. 37, no. 1, pp. 71–90,
1997.
[15] N. I. of Health, “Structure-based drug design: from the computer to the clinic,” available at:
http://publications.nigms.nih.gov/structlife/chapter4.html.
[16] M. Mathieu and P. I. Corporation, PARAXEL’s Pharmaceutical R&D Statistical Sourcebook
2001. PAREXEL International Corp., 2001.
[17] T. Cheng, X. Li, Y. Li, Z. Liu, and R. Wang, “Comparative assessment of scoring functions
on a diverse test set,” Journal of Chemical Information and Modeling, vol. 49, no. 4, pp.
1079–1093, 2009.
[18] T. Hastie, R. Tibshirani, J. Friedman, T. Hastie, J. Friedman, and R. Tibshirani, The elements
of statistical learning. Springer, 2009, vol. 2.
[19] W. Mooij and M. Verdonk, “General and targeted statistical potentials for protein-ligand
interactions,” Proteins, vol. 61, no. 2, p. 272, 2005.
[20] M. L. Verdonk, J. C. Cole, M. J. Hartshorn, C. W. Murray, and R. D. Taylor, “Improved
protein–ligand docking using gold,” Proteins: Structure, Function, and Bioinformatics,
vol. 52, no. 4, pp. 609–623, 2003.
[21] Accelrys Software Inc., The Discovery Studio Software, San Diego, CA, 2001, version 2.0.
[22] Y. Li, Z. Liu, J. Li, L. Han, J. Liu, Z.-X. Zhao, and R. Wang, “Comparative assessment
of scoring functions on an updated benchmark: I. compilation of the test set,” Journal of
chemical information and modeling, 2014.
[23] R. Wang, X. Fang, Y. Lu, C.-Y. Yang, and S. Wang, “The pdbbind database: Methodologies
and updates,” Journal of Medicinal Chemistry, vol. 48, no. 12, pp. 4111–4119, 2005, pMID:
15943484.
[24] H. M. Ashtawy and N. R. Mahapatra, “BgN-Score and BsN-Score: Bagging and boosting
based ensemble neural networks scoring functions for accurate binding affinity prediction
of protein-ligand complexes,” BMC bioinformatics, vol. 16, no. Suppl 4, p. S8, 2015.
[25] Y. Li, L. Han, Z. Liu, and R. Wang, “Comparative assessment of scoring functions on an
updated benchmark: 2. evaluation methods and general results,” Journal of chemical information and modeling, 2014.
[26] H. M. Ashtawy and N. R. Mahapatra, “Machine-learning scoring functions for identifying
native poses of ligands docked to known and novel proteins,” BMC bioinformatics, vol. 16,
no. Suppl 6, p. S3, 2015.
181

[27] P. Ballester and J. Mitchell, “A machine learning approach to predicting protein-ligand binding affinity with applications to molecular docking,” Bioinformatics, vol. 26, no. 9, p. 1169,
2010.
[28] H. M. Ashtawy and N. R. Mahapatra, “A comparative assessment of predictive accuracies of
conventional and machine learning scoring functions for protein-ligand binding affinity prediction,” Computational Biology and Bioinformatics, IEEE/ACM Transactions on, vol. PP,
no. 99, pp. 1–1, 2014.
[29] P. Ballester and J. Mitchell, “Comments on “leave-cluster-out cross-validation is appropriate
for scoring functions derived from diverse protein data sets”: Significance for the validation
of scoring functions,” Journal of Chemical Information and Modeling, vol. 51, no. 8, pp.
1739–1741, 2011.
[30] D. M. Hawkins, “The problem of overfitting,” Journal of chemical information and computer sciences, vol. 44, no. 1, pp. 1–12, 2004.
[31] R. D. C. Team, R: A Language and Environment for Statistical Computing, R Foundation
for Statistical Computing, Vienna, Austria, 2010, ISBN 3-900051-07-0.
[32] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel,
P. Prettenhofer, R. Weiss, V. Dubourg et al., “Scikit-learn: Machine learning in python,”
Journal of Machine Learning Research, vol. 12, no. Oct, pp. 2825–2830, 2011.
[33] J. H. Friedman, “Multivariate adaptive regression splines,” The Annals of Statistics, vol. 19,
no. 1, pp. pp. 1–67, 1991.
[34] T. H. Stephen Milborrow and R. Tibshirani, earth: Multivariate Adaptive Regression Spline
Models, 2010, r package version 2.4-5.
[35] J. Rudy, “Py-earth: Github repository,” 2016.
[36] K. Hechenbichler and K. Schliep, “Weighted k-nearest-neighbor techniques and ordinal
classification,” Citeseer, Tech. Rep., 2004.
[37] R. J. Samworth, “Optimal weighted nearest neighbour classifiers,” The Annals of Statistics,
vol. 40, no. 5, pp. 2733–2763, 2012.
[38] K. S. . K. Hechenbichler, kknn: Weighted k-Nearest Neighbors, 2010, r package version
1.0-8.
[39] V. Vapnik, Statistical learning theory.

Wiley, New York, 1998.

[40] A. J. Smola and B. Schölkopf, “A tutorial on support vector regression,” Statistics and computing, vol. 14, no. 3, pp. 199–222, 2004.
[41] E. Dimitriadou, K. Hornik, F. Leisch, D. Meyer, , and A. Weingessel, e1071: Miscellaneous
Functions of the Department of Statistics (e1071), TU Wien, 2010, r package version 1.5-24.

182

[42] B. Ripley, “nnet: Feed-forward neural networks and multinomial log-linear models,” R
package version, vol. 7, no. 5, 2011.
[43] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis,
J. Dean, M. Devin et al., “Tensorflow: Large-scale machine learning on heterogeneous
systems, 2015,” Software available from tensorflow. org, vol. 1, 2015.
[44] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning, 2001.
[45] L. Breiman, “Random forests,” Machine Learning, vol. 45, pp. 5–32, 2001.
[46] J. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of
Statistics, pp. 1189–1232, 2001.
[47] Y. Freund and R. Schapire, “A desicion-theoretic generalization of on-line learning and an
application to boosting,” in Computational learning theory. Springer, 1995, pp. 23–37.
[48] G. Ridgeway, gbm: Generalized Boosted Regression Models, 2010, r package version 1.63.1.
[49] T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” CoRR, vol.
abs/1603.02754, 2016.
[50] J. D. Durrant and J. A. McCammon, “NNScore: A neural-network-based scoring function
for the characterization of protein- ligand complexes,” Journal of Chemical Information and
Modeling, vol. 50, no. 10, pp. 1865–1871, 2010.
[51] Tripos Inc., The SYBYL Software, 1699 South Hanley Rd., St. Louis, Missouri, 63144, USA,
2006, version 7.2.
[52] G. Jones, P. Willett, R. Glen, A. Leach, and R. Taylor, “Development and validation of a
genetic algorithm for flexible docking,” Journal of Molecular Biology, vol. 267, no. 3, pp.
727–748, 1997.
[53] R. A. Friesner, J. L. Banks, R. B. Murphy, T. A. Halgren, J. J. Klicic, D. T. Mainz, M. P.
Repasky, E. H. Knoll, M. Shelley, J. K. Perry, D. E. Shaw, P. Francis, and P. S. Shenkin,
“Glide: A new approach for rapid, accurate docking and scoring. 1. Method and assessment
of docking accuracy,” Journal of Medicinal Chemistry, vol. 47, no. 7, pp. 1739–1749, 2004,
pMID: 15027865.
[54] Schrödinger, LLC, The Schrödinger Software, New York, 2005, version 8.0.
[55] H. F. G. Velec, H. Gohlke, and G. Klebe, “DrugScore CSD - Knowledge-based scoring
function derived from small molecule crystal data with superior recognition rate of nearnative ligand poses and better affinity prediction,” Journal of Medicinal Chemistry, vol. 48,
no. 20, pp. 6296–6303, 2005.
[56] R. Wang, L. Lai, and S. Wang, “Further development and validation of empirical scoring functions for structure-based binding affinity prediction,” Journal of Computer-Aided
Molecular Design, vol. 16, pp. 11–26, 2002, 10.1023/A:1016357811882.
183

[57] V. Schnecke and L. A. Kuhn, “Virtual screening with solvation and ligand-induced complementarity,” in Virtual Screening: An Alternative or Complement to High Throughput
Screening?, G. Klebe, Ed. Springer Netherlands, 2002, pp. 171–190.
[58] A. N. Jain, “Scoring noncovalent protein-ligand interactions: A continuous differentiable
function tuned to compute binding affinities,” Journal of Computer-Aided Molecular Design, vol. 10, pp. 427–440, 1996.
[59] A. Krammer, P. D. Kirchhoff, X. Jiang, C. Venkatachalam, and M. Waldman, “LigScore:
A novel scoring function for predicting binding affinities,” Journal of Molecular Graphics
and Modelling, vol. 23, no. 5, pp. 395 – 407, 2005.
[60] H. Bohm, “The development of a simple empirical scoring function to estimate the binding
constant for a protein-ligand complex of known three-dimensional structure,” Journal of
Computer-Aided Molecular Design, vol. 8, no. 3, pp. 243–256, 1994.
[61] D. K. Gehlhaar, G. M. Verkhivker, P. A. Rejto, C. J. Sherman, D. R. Fogel, L. J. Fogel, and
S. T. Freer, “Molecular recognition of the inhibitor ag-1343 by HIV-1 protease: Conformationally flexible docking by evolutionary programming,” Chemistry & Biology, vol. 2, no. 5,
pp. 317 – 324, 1995.
[62] I. Muegge, “Effect of ligand volume correction on PMF scoring,” Journal of Computational
Chemistry, vol. 22, no. 4, pp. 418–425, 2001.
[63] M. D. Eldridge, C. W. Murray, T. R. Auton, G. V. Paolini, and R. P. Mee, “Empirical scoring
functions: I. The development of a fast empirical scoring function to estimate the binding
affinity of ligands in receptor complexes,” Journal of Computer-Aided Molecular Design,
vol. 11, pp. 425–445, 1997, 10.1023/A:1007996124545.
[64] T. Ewing, S. Makino, A. Skillman, and I. Kuntz, “DOCK 4.0: Search strategies for automated molecular docking of flexible molecule databases,” Journal of Computer-Aided
Molecular Design, vol. 15, no. 5, pp. 411–428, 2001.
[65] H. M. Ashtawy and N. R. Mahapatra, “A comparative assessment of conventional and
machine-learning-based scoring functions in predicting binding affinities of protein-ligand
complexes,” in Bioinformatics and Biomedicine (BIBM), 2011 IEEE International Conference on. IEEE, 2011, pp. 627–630.
[66] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CoRR,
vol. abs/1512.03385, 2015.
[67] G. Hinton, D. Y. Li Deng, G. Dahl, A. rahman Mohamed, N. Jaitly, A. Senior, V. Vanhoucke,
P. Nguyen, B. Kingsbury, and T. Sainath, “Deep neural networks for acoustic modeling
in speech recognition,” IEEE Signal Processing Magazine, vol. 29, pp. 82–97, November
2012.
[68] T. Sterling and J. J. Irwin, “Zinc 15-ligand discovery for everyone,” 2015.

184

[69] C. Knox, V. Law, T. Jewison, P. Liu, S. Ly, A. Frolkis, A. Pon, K. Banco, C. Mak, V. Neveu
et al., “Drugbank 3.0: a comprehensive resource for ’omics’ research on drugs,” Nucleic
acids research, vol. 39, no. suppl 1, pp. D1035–D1041, 2011.
[70] A. Gaulton, L. Bellis, A. Bento, J. Chambers, M. Davies, A. Hersey, Y. Light,
S. McGlinchey, D. Michalovich, B. Al-Lazikani et al., “ChEMBL: a large-scale bioactivity
database for drug discovery,” Nucleic acids research, vol. 40, no. D1, pp. D1100–D1107,
2012.
[71] Y. Wang, J. Xiao, T. Suzek, J. Zhang, J. Wang, Z. Zhou, L. Han, K. Karapetyan, S. Dracheva,
B. Shoemaker et al., “Pubchem’s bioassay database,” Nucleic acids research, vol. 40,
no. D1, pp. D400–D412, 2012.
[72] K. L. Damm-Ganamet, R. D. Smith, J. B. Dunbar Jr, J. A. Stuckey, and H. A. Carlson, “Csar
benchmark exercise 2011–2012: evaluation of results from docking and relative ranking of
blinded congeneric series,” Journal of chemical information and modeling, vol. 53, no. 8,
pp. 1853–1870, 2013.
[73] M. L. Benson, R. D. Smith, N. A. Khazanov, B. Dimcheff, J. Beaver, P. Dresslar, J. Nerothin,
and H. A. Carlson, “Binding MOAD, a high-quality protein-ligand database,” Nucleic Acids
Research, vol. 36, no. suppl 1, pp. D674–D678, 2008.
[74] T. Liu, Y. Lin, X. Wen, R. Jorissen, and M. Gilson, “BindingDB: a web-accessible database
of experimentally determined protein–ligand binding affinities,” Nucleic acids research,
vol. 35, no. suppl 1, pp. D198–D201, 2007.
[75] M. I. Zavodszky, P. C. Sanschagrin, L. A. Kuhn, and R. S. Korde, “Distilling the essential features of a protein surface for improving protein-ligand docking, scoring, and virtual
screening,” Journal of Computer-Aided Molecular Design, vol. 16, pp. 883–902, 2002.
[76] O. Korb, T. Stutzle, and T. Exner, “Empirical scoring functions for advanced protein- ligand
docking with plants,” Journal of chemical information and modeling, vol. 49, no. 1, pp.
84–96, 2009.
[77] M. A. Johnson and G. M. Maggiora, “Concepts and applications of molecular similarity,”
1990.
[78] A. R. Leach and V. J. Gillet, An introduction to chemoinformatics.
Business Media, 2007.

Springer Science &

[79] T. Madden, “The BLAST sequence analysis tool,” The NCBI Handbook. National Library
of Medicine (US), National Center for Biotechnology Information, Bethesda, MD, 2002.
[80] Y. Zhang and J. Skolnick, “Tm-align: a protein structure alignment algorithm based on the
tm-score,” Nucleic acids research, vol. 33, no. 7, pp. 2302–2309, 2005.
[81] H. Berman, K. Henrick, and H. Nakamura, “Announcing the worldwide protein data bank,”
Nature Structural & Molecular Biology, vol. 10, no. 12, pp. 980–980, 2003.

185

[82] J. Dunbar Jr, R. Smith, C. Yang, P. Ung, K. Lexa, N. Khazanov, J. Stuckey, S. Wang, and
H. Carlson, “CSAR benchmark exercise of 2010: Selection of the protein–ligand complexes,” Journal of chemical information and modeling, vol. 51, no. 9, pp. 2036–2046,
2011.
[83] O. Trott and A. J. Olson, “Autodock vina: improving the speed and accuracy of docking
with a new scoring function, efficient optimization, and multithreading,” Journal of computational chemistry, vol. 31, no. 2, pp. 455–461, 2010.
[84] Y. Cao and L. Li, “Improved protein–ligand binding affinity prediction by using a curvaturedependent surface-area model,” Bioinformatics, p. btu104, 2014.
[85] C. W. Yap, “PaDEL-descriptor: An open source software to calculate molecular descriptors
and fingerprints,” Journal of computational chemistry, vol. 32, no. 7, pp. 1466–1474, 2011.
[86] D. Rogers and M. Hahn, “Extended-connectivity fingerprints,” Journal of chemical information and modeling, vol. 50, no. 5, pp. 742–754, 2010.
[87] M. McGann, “Fred pose prediction and virtual screening accuracy,” Journal of chemical
information and modeling, vol. 51, no. 3, pp. 578–596, 2011.
[88] P. Schmidtke, V. Le Guilloux, J. Maupetit, and P. Tufféry, “Fpocket: online tools for protein
ensemble pocket detection and tracking,” Nucleic acids research, vol. 38, no. suppl 2, pp.
W582–W589, 2010.
[89] G. Neudert and G. Klebe, “DSX: a knowledge-based scoring function for the assessment of
protein–ligand complexes,” Journal of chemical information and modeling, vol. 51, no. 10,
pp. 2731–2745, 2011.
[90] H. M. Ashtawy and N. R. Mahapatra, “Descriptor Data Bank (DDB): A platform
for multi-perspective modeling of protein-ligand interactions,” To be published, check
www.descriptordb.com, 2016.
[91] D. R. Koes, M. P. Baumgartner, and C. J. Camacho, “Lessons learned in empirical scoring
with smina from the csar 2011 benchmarking exercise,” Journal of chemical information
and modeling, vol. 53, no. 8, pp. 1893–1904, 2013.
[92] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis,
vol. 38, no. 4, pp. 367 – 378, 2002.
[93] O. Soufan, D. Kleftogiannis, P. Kalnis, and V. B. Bajic, “Dwfs: a wrapper feature selection
tool based on a parallel genetic algorithm,” PloS one, vol. 10, no. 2, p. e0117988, 2015.
[94] B. Sahu and D. Mishra, “A novel feature selection algorithm using particle swarm optimization for cancer microarray data,” Procedia Engineering, vol. 38, pp. 27–31, 2012.
[95] Y. Cheng and W. Prusoff, “Relationship between the inhibition constant (ki) and the concentration of inhibitor which causes 50% inhibition (ic50) of an enzymatic reaction,” Biochem.
Pharmacol, vol. 22, no. 23, pp. 3099–3108, 1973.

186

[96] R. A. Copeland, D. Lombardo, J. Giannaras, and C. P. Decicco, “Estimating KI values
for tight binding inhibitors from dose-response plots,” Bioorganic & Medicinal Chemistry
Letters, vol. 5, no. 17, pp. 1947 – 1952, 1995.
[97] H. Gohlke, M. Hendlich, and G. Klebe, “Knowledge-based scoring function to predict
protein-ligand interactions,” Journal of Molecular Biology, vol. 295, no. 2, pp. 337 – 356,
2000.
[98] R. Wang, Y. Lu, X. Fang, and S. Wang, “An extensive test of 14 scoring functions using the
PDBbind refined set of 800 protein-ligand complexes,” Journal of Chemical Information
and Computer Sciences, vol. 44, no. 6, pp. 2114–2125, 2004, pMID: 15554682.
[99] G. Schneider and P. Wrede, “Artificial neural networks for computer-based molecular design,” Progress in Biophysics and Molecular Biology, vol. 70, no. 3, pp. 175 – 222, 1998.
[100] L. Douali, D. Villemin, A. Zyad, and D. Cherqaoui, “Artificial neural networks: Non-linear
QSAR studies of HEPT derivatives as HIV-1 reverse transcriptase inhibitors,” Molecular
Diversity, vol. 8, no. 1, pp. 1–8, 2004.
[101] D. Winkler, “Neural networks as robust tools in drug lead discovery and development,”
Molecular Biotechnology, vol. 27, pp. 139–167, 2004, 10.1385/MB:27:2:139.
[102] R. D. Head, M. L. Smythe, T. I. Oprea, C. L. Waller, S. M. Green, and G. R. Marshall,
“Validate: A new method for the receptor-based prediction of binding affinities of novel
ligands,” Journal of the American Chemical Society, vol. 118, no. 16, pp. 3959–3969, 1996.
[103] S. So and M. Karplus, “A comparative study of ligand-receptor complex binding affinity prediction methods based on glycogen phosphorylase inhibitors,” Journal of computer-aided
molecular design, vol. 13, no. 3, pp. 243–258, 1999.
[104] G. Cybenko, “Approximation by superpositions of a sigmoidal function,” Mathematics of
control, signals and systems, vol. 2, no. 4, pp. 303–314, 1989.
[105] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal
approximators,” Neural networks, vol. 2, no. 5, pp. 359–366, 1989.
[106] M. Stinchcombe and H. White, “Approximating and learning unknown mappings using
multilayer feedforward networks with bounded weights,” in Neural Networks, 1990., 1990
IJCNN International Joint Conference on. IEEE, 1990, pp. 7–16.
[107] D. Steinberg and P. Colla, “CART: classification and regression trees,” The Top Ten Algorithms in Data Mining, vol. 9, p. 179, 2009.
[108] V. Schnecke and L. A. Kuhn, “Virtual screening with solvation and ligand-induced complementarity,” Perspectives in Drug Discovery and Design, vol. 20, no. 1, pp. 171–190, 2000.
[109] D.-S. Cao, Q.-S. Xu, Y.-Z. Liang, L.-X. Zhang, and H.-D. Li, “The boosting: A new idea
of building models,” Chemometrics and Intelligent Laboratory Systems, vol. 100, no. 1, pp.
1–11, 2010.

187

[110] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,”
in Proceedings of the 27th international conference on machine learning (ICML-10), 2010,
pp. 807–814.
[111] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural
networks.” in Aistats, vol. 9, 2010, pp. 249–256.
[112] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint
arXiv:1412.6980, 2014.
[113] J. Overington, B. Al-Lazikani, and A. Hopkins, “How many drug targets are there?” Nature
Reviews Drug Discovery, vol. 5, no. 12, pp. 993–996, 2006.
[114] BioSolveIT., LeadIT, St. Augustin, Germany, 2012, version 2.1.
[115] B. Ramsundar, S. Kearnes, P. Riley, D. Webster, D. Konerding, and V. Pande, “Massively
multitask networks for drug discovery,” arXiv preprint arXiv:1502.02072, 2015.
[116] G. E. Dahl, N. Jaitly, and R. Salakhutdinov, “Multi-task neural networks for qsar predictions,” arXiv preprint arXiv:1406.1231, 2014.
[117] T. Unterthiner, A. Mayr, G. Klambauer, M. Steijaert, J. K. Wegner, H. Ceulemans, and
S. Hochreiter, “Deep learning as an opportunity in virtual screening,” Advances in Neural
Information Processing Systems, vol. 27, 2014.
[118] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
a simple way to prevent neural networks from overfitting.” Journal of Machine Learning
Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[119] F. Chollet, “Keras,” 2015.
[120] J. Shin and D. Cho, “Pdb-ligand: a ligand database based on pdb for the automated and
customized classification of ligand-binding structures,” Nucleic acids research, vol. 33, no.
suppl 1, pp. D238–D241, 2005.
[121] J. Irwin, “Community benchmarks for virtual screening,” Journal of Computer-Aided
Molecular Design, vol. 22, pp. 193–199, 2008, 10.1007/s10822-008-9189-4.
[122] M. I. Zavodszky and L. A. Kuhn, “Side-chain flexibility in protein-ligand binding: The
minimal rotation hypothesis,” Protein Science, vol. 14, no. 4, pp. 1104–1114, 2005.

188