INTEGRATION OF TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FOR

SMALL MOLECULE PROPERTY PREDICTIONS

By

Kedi Wu

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Mathematics – Doctor of Philosophy

2018

ABSTRACT

INTEGRATION OF TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FOR

SMALL MOLECULE PROPERTY PREDICTIONS

By

Kedi Wu

Accurate prediction of small molecule properties is of paramount importance to drug design and
discovery. A variety of quantitative properties of small molecules has been studied in this thesis.
These properties include solvation free energy, partition coeﬃcient, aqueous solubility, and toxicity
endpoints.

The highlight of this thesis is to introduce an algebraic topology based method, called element
speciﬁc persistent homology (ESPH), to predict small molecule properties. Essentially ESPH
describes molecular properties in terms of multiscale and multicomponent topological invariants
and is diﬀerent from conventional chemical and physical representations. Based on ESPH and its
modiﬁed version, element-speciﬁc topological descriptors (ESTDs) are constructed. The advantage
of ESTDs is that they are systematical, comprehensive, and scalable with respect to molecular size
and composition variations, and are readily suitable for machine learning methods, rendering
topological learning algorithms. Due to the inherent correlation between diﬀerent small molecule
properties, multi-task frameworks are further employed to simultaneously predict related properties.
Deep neural networks, along with ensemble methods such as random forest and gradient
boosting trees, are used to develop quantitative predictive models. Physical based molecular
descriptors and auxiliary descriptors are also used in addition to ESTDs. As a result, we obtain
state-of-the-art results for various benchmark data sets of small molecule properties.

We have also developed two online servers for predicting properties of small molecules, TopP-S
and TopTox. TopP-S is a software for topological learning predictions of partition coeﬃcient and
aqueous solubility, and TopTox is a software for computing element-speciﬁc tological descriptors
(ESTDs) for toxicity endpoint predictions. They are available at http://weilab.math.msu.
edu/TopP-S/ and http://weilab.math.msu.edu/TopTox/, respectively.

Copyright by
KEDI WU
2018

This thesis is dedicated to my parents, for their love.

iv

ACKNOWLEDGEMENTS

First and foremost I would like to sincerely express my gratitude to my advisor, Dr. Guo-Wei Wei.
It has been an honor to be his Ph.D. student. I am truly thankful for his unselﬁsh contributions of
time, ideas, and funding to make my Ph.D. study productive and enjoyable. The joy and enthusiasm
he has for his research was motivational for me, and this work would have not come into existence
without his supervision. He has shown me, by his own example, how a great scientist and person
should be.

I also want to thank Dr. Chichia Chiu, Dr. Moxun Tang and Dr. Yiying Tong, for serving on

my thesis committee and providing me with extensive guidance and helpful comments.

My time at Michigan State was made enjoyable in large part due to my group members, David
Bramer, Zixuan Cang, Yin Cao, Duc Nguyen, Bao Wang, Menglun Wang, Kelin Xia, Zhixiong
Zhao, that become a part of my life. My time at Michigan State was also enriched by fellow
graduate students, Tianbo Chen, Wenzhao Chen, Anqi Chen and Weicong Zhou, just to name a
few. I am grateful for the time spent with friends playing poker all night long, for the memorable
conference trips to Boston and Columbus, and for many other precious memories.

Last but not the least, I would like to thank my family for all their love and encouragement. For
my parents, Tenghua Wu and Lianxiang Pan, who unconditionally support me in all my pursuits
throughout my life. For my grandma who raised me up as a little kid. And most of all for Yafei Gu
whose faithful support during the ﬁnal stages of this Ph.D. is so appreciated. Thank you.

v

TABLE OF CONTENTS

.
.

.
.

.
.

.
.
.

.
.
.

.
.
.

LIST OF TABLES .
.
.
LIST OF FIGURES .
.
.
KEY TO ABBREVIATIONS .
CHAPTER 1

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
1
1
4
5
6

INTRODUCTION TO CHEMINFORMATICS . . . . . . . . . . . . . . . .
1.1 An overview of small molecule modeling . . . . . . . . . . . . . . . . . . . . .
.
.
1.2 An overview of topological modeling and topological learning . . . . . . . . . .
1.3 An overview of QSAR and machine learning . . . . . . . . . . . . . . . . . . . .
.
1.4 Motivations and objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

.

.

CHAPTER 2 TOPOLOGICAL MODELING OF SMALL MOLECULES . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7
7
2.1 Persistent homology .
. 10
2.2 Persistent homology for characterizing molecules . . . . . . . . . . . . . . . . .
2.3 Element speciﬁc persistent homology (ESPH) . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Limitations of persistent homology . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 An introduction to ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
. . . . . . . . . . . 15

2.4 Modiﬁed ESPH for characterizing intra-molecular interactions

.

.

.

.

3.2.2 Random forest .

CHAPTER 3 MACHINE LEARNING METHODS AND ALGORITHMS . . . . . . . .

. 17
3.1 An overview of machine learning algorithms . . . . . . . . . . . . . . . . . . . . . 17
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Ensemble methods
3.2.1 Decision tree induction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1.1 Basics of decision tree induction . . . . . . . . . . . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2.1 Bootstrap aggregating . . . . . . . . . . . . . . . . . . . . . . . 20
3.2.2.2
Feature bagging . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Gradient boosting decision tree . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3.1 Gradient boosting algorithm . . . . . . . . . . . . . . . . . . . . 21
3.2.3.2 Gradient tree boosting . . . . . . . . . . . . . . . . . . . . . . . 21
3.3 Multi-task learning and deep neural network . . . . . . . . . . . . . . . . . . . . . 21
Single-task deep neural network (ST-DNN)
. . . . . . . . . . . . . . . . . 21
3.3.1.1 Multi-task deep neural network . . . . . . . . . . . . . . . . . . 22
3.3.1.2 Network parameters and training . . . . . . . . . . . . . . . . . 24
. 25
. . . . . . . . 26
3.5.1 Ranking algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.2 Query construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.5.3
Prediction using nearest neighbors . . . . . . . . . . . . . . . . . . . . . . 27

3.4 Pipelines for predicting properties of small molecules using MT-DNN . . . . . .
3.5 Ranking, query construction and prediction using nearest neighbors

3.3.1

CHAPTER 4 DATA SETS, PREPROCESSING AND DESCRIPTORS . . . . . . . . . . 29

vi

4.1 Data sets .

4.2 Data Preprocessing .
.
4.3 Molecular Descriptors .

.

.

.

.

.

.
.

. .

. .

Solvation free energy . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1
4.1.2
Partition coeﬃcient and aqueous solubility . . . . . . . . . . . . . . . . .
4.1.3 Quantitative toxicity endpoints . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
. 29
. 31
. 33
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 Element speciﬁc topological descriptors . . . . . . . . . . . . . . . . . . . 35
ESNs for partition coeﬃcient and aqueous solubility prediction . 35
4.3.1.1
4.3.1.2
. 36
ESNs for toxicity endpoint prediction . . . . . . . . . . . . . .
4.3.1.3 A general workﬂow for computing ESTDs from ESNs . . . . . . 37
4.3.1.4
. 38
The essence of ESTDs . . . . . . . . . . . . . . . . . . . . . .
ESTDs for partition coeﬃcient and aqueous solubility prediction . 38
4.3.1.5
. 39
4.3.1.6
ESTDs for toxicity endpoint prediction . . . . . . . . . . . . .
Physical model based descriptors . . . . . . . . . . . . . . . . . . . . . .
. 39

4.3.2
4.3.3 Auxiliary molecular descriptors for partition coeﬃcient and aqueous

solubility prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 41

CHAPTER 5 RESULTS .
5.1 Evaluation criteria .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Commonly used evaluation metrics . . . . . . . . . . . . . . . . . . . . . . 42
5.1.2 Additional evaluation metrics for partition coeﬃcient and aqueous solu-

.
.

.
.

.
.

5.2 Evaluation results .

.

.

.

.

.

5.2.1

5.2.2

bility prediction .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.3 Additional evaluation metrics for toxicity endpoint prediction . . . . . . . . 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Solvation free energy prediction . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.1.1 Microscopic feature parametrization . . . . . . . . . . . . . . . . 44
Polar and non-polar descriptors for solvation free energy . . . . . 44
5.2.1.2
. . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.1.3
Leave-one-out result
5.2.1.4 Blind prediction of SAMPLx challenge molecules
. . . . . . . . 46
5.2.1.5 Wang’s [1] dataset
. . . . . . . . . . . . . . . . . . . . . . . . . 51
Partition coeﬃcient and aqueous solubility prediction . . . . . . . . . . . . 51
5.2.2.1
log P training set cross-validation . . . . . . . . . . . . . . . . . 52
FDA set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.2.2
5.2.2.3
Star set and non-star set
. . . . . . . . . . . . . . . . . . . . . . 54
5.2.2.4 Wang’s 1708 set in ref. [2] . . . . . . . . . . . . . . . . . . . . . 57
5.2.2.5 Dataset in ref. [3]
. . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2.3 Toxicity endpoint prediction . . . . . . . . . . . . . . . . . . . . . . . . . 60
. . . . . . . . . . . . . . . . . . . 60
5.2.3.1
Fathead minnow LC50 test set
. . . . . . . . . . . . . . . . . . . 62
5.2.3.2 Daphnia magna LC50 test set
5.2.3.3
Tetraphymena pyriformis IGC50 test set . . . . . . . . . . . . . . 62
5.2.3.4 Oral rat LD50 test set . . . . . . . . . . . . . . . . . . . . . . . . 64
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
6.1.1 Descriptor importance analysis . . . . . . . . . . . . . . . . . . . . . . . . 66

.
.

CHAPTER 6 DISCUSSION .
6.1 Solvation free energy .

vii

6.3 Toxicity endpoints prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2.1 ESTDs for small molecules . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.2 Multitask learning .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2.3

6.2 Partition coeﬃcient and aqueous solubility . . . . . . . . . . . . . . . . . . . . . . 67
. 67
. 68
Predictive power for log P and log S . . . . . . . . . . . . . . . . . . . . . 68
. 69
6.3.1 The impact of descriptor selection and potential overﬁtting . . . . . . . . . 69
6.3.2 The predictive power of ESTDs for toxicity . . . . . . . . . . . . . . . . . 70
6.3.3 Alternative element speciﬁc networks for generating ESTDs
. . . . . . . . 70
6.3.4 A potential improvement with consensus tools . . . . . . . . . . . . . . . . 71

CHAPTER 7 THESIS CONTRIBUTION AND FUTURE WORK . . . . . . . . . . . . . 72
7.1 Solvation free energy .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.2 Partition coeﬃcient and aqueous solubility . . . . . . . . . . . . . . . . . . . . . . 73
.
7.3 Toxicity endpoints .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Future work .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

.
.

.
.

.
.

.

.

.

APPENDICES .

.

.

.

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

APPENDIX A SUPPLEMENTARY MATERIALS FOR SOLVATION FREE EN-

APPENDIX B SUPPLEMENTARY MATERIALS FOR TOXICITY ENDPOINT

ERGY PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . 78

BIBLIOGRAPHY .

PREDICTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
.
.

.

.

.

.

.

.

viii

LIST OF TABLES

Table 3.1: Proposed hyperparameters for MT-DNN . . . . . . . . . . . . . . . . . . . . . . 24

Table 4.1: Molecules in the test sets with large discrepancies in their experimental solva-

tion free energies. Here “ID” refers to the ID of Table 3 of Ref. [1] . . . . . . . . 30

Table 4.2: Duplicated molecules in Ref. [1]

. . . . . . . . . . . . . . . . . . . . . . . . . 31

Table 4.3: Statistics of solvation free energy data sets. The numbers within parenthesis

represent the actual numbers of molecules used in this study.

. . . . . . . . . . 31

Table 4.4: Summary of log P and log S data sets used . . . . . . . . . . . . . . . . . . . . . 32

Table 4.5: Statistics of quantitative toxicity data sets . . . . . . . . . . . . . . . . . . . . . 34

Table 4.6: ESNs for partition coeﬃcient and aqueous solubility prediction . . . . . . . . . . 36

Table 4.7: Statistics of element occurrences for partition coeﬃcient training set . . . . . . . 36

Table 4.8: ESNs for toxicity endpoint prediction . . . . . . . . . . . . . . . . . . . . . . . 37

Table 4.9: ESTDs for partition coeﬃcient and aqueous solubility prediction . . . . . . . .

. 38

Table 5.1: The RMSE and ME of the leave-one-out test in the solvation free energy
prediction of 668 molecules with descriptor set 1 (the ﬁrst position), descriptor
set 2 (the second position) and HPK model (the last position) [4]. All errors
are in unit kcal/mol.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

.

Table 5.2: Results of 10-fold cross validation on the partition coeﬃcient training set,

N = 8199. .

.

.

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 52

Table 5.3: Performances and ﬂuctuations of ﬁfty 10-fold cross-validation test runs.

. . . . . 52

Table 5.4: Results of diﬀerent logP prediction methods on 406 FDA-approved drugs [5],
ranked by R2. Two molecules were dropped for our model evaluation due to
feature generation failure of ChemoPy . . . . . . . . . . . . . . . . . . . . . . . 53

Table 5.5: Benchmark test results [6] on both star and non-star set.

. . . . . . . . . . . . . 55

Table 5.6: Leave-one-out test on the 1708 solubility data set. . . . . . . . . . . . . . . . . . 57

Table 5.7: 10-fold cross-validation on the 1708 solubility data set.

. . . . . . . . . . . . . . 58

ix

Table 5.8: Results of Klopman’s test set [3], where MUE was not reported.

. . . . . . . . . 59

Table 5.9: Results of Zhu’s test set.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Table 5.10: 10-fold cross-validation results on the small Delaney set.

. . . . . . . . . . . . . 60

Table 5.11: 10-fold cross-validation results on the Huuskonen set. . . . . . . . . . . . . . . . 60

Table 5.12: Comparison of prediction results for the fathead minnow LC50 test set.

. . . . . 61

Table 5.13: Comparison of prediction results for the Daphnia magna LC50 test set. . . . . . . 63

Table 5.14: Comparison of prediction results for the Tetraphymena Pyriformis IGC50 test set. 64

Table 5.15: Comparison of prediction results for the Oral rat LD50 test set.

. . . . . . . . . . 65

Table 6.1: Results of selected descriptor groups for LC50 set . . . . . . . . . . . . . . . .

. 69

Table 6.2: Alternative element speciﬁc networks used to characterize molecules

. . . . . . 71

Table A.1: Microscopic polar features with high correlations to the solvation free energy

used in this study .

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 78

Table A.2: Molecules with corresponding query number, and eave one out results with

AM1-BCC charge and MBondi2 radius . . . . . . . . . . . . . . . . . . . . . . 79

Table A.3: Solvation energy prediction results for SAMPL0 molecules using selected

polar features .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table A.4: Solvation energy prediction results for SAMPL1 molecules using selected

polar features .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table A.5: Solvation energy prediction results for SAMPL2 molecules using selected

polar features .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 95

. 96

. 98

. 99

. 100

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Table A.6: Solvation energy prediction results for SAMPL3 molecules using selected

polar features .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table A.7: Solvation energy prediction results for SAMPL4 molecules using selected

polar features .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table A.8: Solvation energy prediction results for SAMPL0 molecules using all features

. . 102

Table A.9: Solvation energy prediction results for SAMPL1 molecules using all features

. . 103

x

Table A.10:Solvation energy prediction results for SAMPL2 molecules using all features

. . 105

Table A.11:Solvation energy prediction results for SAMPL3 molecules using all features

. . 106

Table A.12:Solvation energy prediction results for SAMPL4 molecules using all features

. . 107

Table B.1: Performances of diﬀerent descriptor groups with importance threshold 2.5e-4 . . 109

Table B.2: Performances of diﬀerent descriptor groups with importance threshold 5e-4 . . . 109

Table B.3: Performances of diﬀerent descriptor groups with importance threshold 7.5e-4 . . 109

Table B.4: Performances of diﬀerent descriptor groups with importance threshold 1e-3 . . . 110

Table B.5: Performances of RFon diﬀerent datasets using ESTDs only proposed in Dis-

cussion section .

.

.

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Table B.6: Performances of RF on diﬀerent datasets using ESTDs proposed in Discussion

section along with physical descriptors . . . . . . . . . . . . . . . . . . . . . . . 110

Table B.7: Performances of GBDT on diﬀerent datasets using ESTDs only proposed in

Discussion section .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Table B.8: Performances of GBDT on diﬀerent datasets using ESTDs proposed in Dis-

cussion section along with physical descriptors . . . . . . . . . . . . . . . . . . 111

Table B.9: Performances of MT-DNN on diﬀerent datasets using ESTDs only proposed

in Discussion section .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

Table B.10: Performances of MT-DNN on diﬀerent datasets using ESTDs proposed in

Discussion section along with physical descriptors

. . . . . . . . . . . . . . . . 111

Table B.11: Performances of Consensus (MT-DNN and GBDT) on diﬀerent datasets using

ESTDs only proposed in Discussion section . . . . . . . . . . . . . . . . . . .

. 111

Table B.12: Performances of Consensus (MT-DNN and GBDT) diﬀerent datasets using

ESTDs proposed in Discussion section along with physical descriptors . . . . . . 112

xi

LIST OF FIGURES

Figure 2.1: Examples of simplex of diﬀerent dimensions.

(a), (b), (c) and (d) above

represent 0-simplex, 1-simplex, 2-simplex, and 3-simplex, respectively.

. . . .

8

Figure 2.2: Diﬀerent representations of cyclohexane and their persistent homology bar-
code plots. In subﬁgure(a) and (b), complete cyclohexane and cyclohexane
with only carbon atoms being selected, respectively. In subﬁgure (c) and (d),
from top to bottom, the results are for Betti-0 and Betti-1, respectively.

. . . . . 12

Figure 2.3: Benzene, pyridine and their persistent homology barcode plots. In subﬁg-
ure(a) and (b), benzene and pyridine are shown with hydrogen atoms being
neglected, respectively.
In subﬁgure (c) and (d), from top to bottom, the
results are for Betti-0 and Betti-1, respectively, for benzene and pyrdine.

. . . . 13

Figure 2.4:

Indazole and its persistent homology barcodes. In subﬁgure(a) and (b), inda-
zole is shown with carbon and carbon-nitrogen atoms selected, respectively.
In subﬁgure (c) and (d), from top to bottom, the results are for Betti-0 and
Betti-1, respectively . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Figure 3.1: An illustration of ST-DNN architecture. . . . . . . . . . . . . . . . . . . . . . . 22

Figure 3.2: An illustration of MT-DNN architecture.

. . . . . . . . . . . . . . . . . . . . . 24

Figure 3.3: Graphical pipeline for simultaneous prediction of partition coeﬃcient and

aqueous solubility .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Figure 5.1:

Figure 5.2:

Illustration of leave-one-out predictions for the whole set of 668 molecules.
Left chart: Correlation between experimental solvation free energies and
predictions obtained by BCC charges and Amber MBondi2 using all polar-
nonpolar features. Right chart: Comparison of prediction RMSEs obtained
by models with polar features and all features against HPK models. In the
plot, GAS and MUL are abbreviations for Gasteiger and Mulliken charges,
respectively.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

. .

. .

.

Illustration of prediction RMSEs obtained with diﬀerent molecular parametriza-
tions by the model for SAMPL0 test set.
abbreviations for Gasteiger and Mulliken charges, respectively.

In the plot, GAS and MUL are

. . . . . . . . . 47

Figure 5.3:

Illustration of prediction RMSEs obtained with diﬀerent molecular parametriza-
tions by the proposed model for SAMPL1 test set. In the plot, GAS and MUL
are abbreviations for Gasteiger and Mulliken charges, respectively.

. . . . . . . 48

xii

Figure 5.4:

Illustration of prediction RMSEs obtained with diﬀerent molecular parametriza-
tions by the proposed model for SAMPL2 test set. In the plot, GAS and MUL
are abbreviations for Gasteiger and Mulliken charges, respectively.

. . . . . . . 49

Figure 5.5:

Illustration of prediction RMSEs obtained with diﬀerent molecular parametriza-
tions by the proposed model for SAMPL3 test set. In the plot, GAS and MUL
are abbreviations for Gasteiger and Mulliken charges, respectively.

. . . . . . . 50

Figure 5.6:

Illustration of prediction RMSEs obtained with diﬀerent molecular parametriza-
tions by the proposed model for SAMPL4 test set. In the plot, GAS and MUL
are abbreviations for Gasteiger and Mulliken charges, respectively.

. . . . . . . 50

Figure 6.1: Descriptor importance cutoﬀ versus RMSE for all test sets with AM1-BCC
charge and MBondi2 radius parametrization. The larger cutoﬀ value is, the
smaller number of features is selected.

. . . . . . . . . . . . . . . . . . . . . . 67

xiii

KEY TO ABBREVIATIONS

PH Persistent homology
ML Machine learning
DL Deep learning
ESN Element-speciﬁc network
ESPH Element-speciﬁc persistent homology
ESTD Element-speciﬁc topological descriptor
STL Single-task learning
MTL Multi-task learning
NN Neural network
CNN Convolutional neural network
DNN Deep neural network
ST-DNN Single-task deep neural network
MT-DNN Multi-task deep neural network
log P Logarithm of partition coeﬃcient
log S Logarithm of aqueous solubility
QSAR Quantitative structure-activity relationship
QSPR Quantitative structure-property relationship
ADME absorption, distribution, metabolism, and elimination
GBDT Gradient boosting decision tree
RF Random Forest
RMSE Root mean square error
MUE Mean unsigned error
GAFF General amber force ﬁeld
MIBPB Matched interface and boundary - Poisson-Boltzmann
ESES Eulerian solvent excluded surface

xiv

SAMPL Statistical assessment of the modeling of proteins and ligands
TEST Toxicity estimation software tool
HPK Hybrid physical and knowledge
AMBER Assisted model building with energy reﬁnement

xv

CHAPTER 1

INTRODUCTION TO CHEMINFORMATICS

1.1 An overview of small molecule modeling

The understanding of small molecule properties is key to drug design and discovery. Proper-
ties can be physical or chemical – physical properties may include partition coeﬃcient, aqueous
solubility, boiling points, density, etc., while chemical properties may refer to toxicity, chemical
stability or ﬂammability. This work is mainly focused on quantitative predictions of solvation free
energy, partition coeﬃcient, aqueous solubility, and toxicity endpoints.

Solvation process is foundamental to sophisticated processes such as protein binding, protein
DNA and RNA binding, protein-protein interaction, etc [7, 8, 9, 10]. Many studies have been
conducted in modeling and analysis of the solvation processes over past few decades [8, 11, 9, 7,
12, 13] and accurate prediction of solvation free energies is one of the most popular and challenging
topics[14] as it is intrinsically related to binding free energy.

Many approaches have been developed to predict small molecule solvation free energies. Cat-
egorically, there are physical models and knowledge-based models. The advantage of physical
models mainly lies on physical intepretability. Physical models can be further classiﬁed as explicit
and implicit models. For explict solvent models, molecular mechanics (MM) [15] and hybrid
quantum mechanics/molecular mechanics (QM/MM) are commonly used [16]. On the other hand,
implicit solvent models include the generalized Born (GB) model as well as many other variations
[17, 18, 19, 20, 21, 22] such as GBSA [23] and SM.x [24, 25]. The most popular implicit solvent
model is based on the Poisson-Boltzmann (PB) theory, which retains an atomistic description of the
solute molecule, while treating the solvent and includes possible ions and cofactors as a dielectric
continuum [26, 27, 28, 29, 30, 31]. More recently, Gaussian-based smooth dielectric functions
have also shown success for computing solvation energy of both small molecules and proteins.
[32, 33]. In classical implicit solvent models, solvation free energy is split into polar and nonpolar

1

contributions, where polar and nonpolar parts are computed separately. Aforementioned models
can be used to calculate polar part. For nonpolar part, it is shown that the solvent-accessible surface
area (SASA) can be very useful [34, 35] while there are still some drawbacks [36, 37, 38, 39]. More
ecently, the coupling of polar and nonpolar components has been considered in several models
[40, 41, 42]. One representative model for this coupling is based on diﬀerential geometry theory,
variational approach and geometric measure theory. These mathematical apparatuses give rise
to an elegant dynamical coupling of polar and nonpolar solvation components [41, 43, 42, 44].
By applying constrained optimization to nonpolar parameter selections, this model provides some
of the best solvation free energy ﬁtting and cross validation results for a large amount of solute
molecules [45].

The partition coeﬃcient, denoted P and deﬁned to be the ratio of concentrations of a solute in
a mixture of two immiscible solvents at equilibrium, is of great importance in pharmacology. It
measures the drug-likeness of a compound as well as its hydrophobic eﬀect on human body. The
logarithm of this coeﬃcient, i.e., log P, has proved to be one of the key parameters in drug design
and discovery. Optimal log P along with low molecular weight and low polar surface area plays
an important role in governing kinetic and dynamic aspects of drug action. In particular, Hansch
et al.
[46] gave a detailed description of how lipophilicity impacted pharmacodynamics. This
being said, surveys show that approximately half of the drug candidates fail to reach market due to
unsatisfactory pharmacokinetic properties or toxicity [47], which indeed makes log P predictions
even more important. The extent of existing reliable experimental log P data is negligible compared
to tremendous compounds whose log P data are practically needed. Therefore, computational
prediction of partition coeﬃcient is an indispensable approach in modern drug design and discovery.
Since the pioneering work of Hansch et al. [48, 49, 50], a large variety of octanol-water partition
coeﬃcient predictors has been developed over the past few decades. Many methods are generally
called as quantitative structure-activity relationship (QSAR) models.
In general, these models
can be categorized into atom-based additive methods, fragment/compound-based methods, and
property based methods. One of atom-based additive methods, which was ﬁrst proposed by Crippen

2

and his co-workers [51], is essentially purely additive and eﬀectively a table look-up per atom. Later
on, XLOGP3, a reﬁned version of atom-based additive methods, was developed [5]. This approach
considers various atom types, contributions from neighbors, as well as correction factors which help
overcome known diﬃculties in purely atomistic additive methods. However additivity may fail in
some cases, where unexpected contributions to log P occur, especially for complicated structures.
Fragment/compound based predictors, instead of employing information from single atom, are
built at compounds or fragments level. Compounds or fragments are then added up with correction
factors. Popular fragment methods include KOWWIN [52, 53], CLOGP [54, 55], ACD/LOGP
[56, 57], and KLOGP [58, 59]. A major challenge for fragment/compound based methods is the
optimal classiﬁcation of “building blocks”. The number of fragments and corrections involved in
current methods range from hundreds to thousands, which could be even larger if remote atoms are
also taken into account. This fact may lead to technical problems in practice and may also cause
overﬁtting in modeling. The third category is property-based. Basically property-based methods
determine partition coeﬃcient using properties, empirical approaches, three dimensional (3D)
structures (e.g., implicit solvent models, molecule dynamics (MD) methods), and topological or
electrostatic indices. Most of these methods are modeled using statistical tools such as associative
neural network (ALOGPS) [60, 61].
It is worthy to mention that property-based methods are
relatively computationally expensive, and depend largely on the choice of descriptors and accuracy
of computations. This to some extent results in a preference of methods in the ﬁrst two categories
over those in the third.

Another closely related physical property is aqueous solubility, denoted by S, or its logarithm
value log S. In drug discovery and other related pharmaceutical ﬁelds, it is of great signiﬁcance to
identify molecules with undesirable water solubility on early stages as solubility aﬀects absorption,
distribution, metabolism, and elimination processes (ADME) [62, 63]. QSPR models, along
with atom/group additive models [64, 3, 65, 2], have been developed to predict solubility. For
example, QSPR models assume that aqueous solubility correlates with experimental properties
such as aforementioned partition coeﬃcient and melting point [66], or molecular descriptors such

3

as solvent accessible area. However, due to the diﬃculty of experimentally measuring solubility
for certain compounds, the experimental data can contain errors up to 1.5 log units [67, 68] and no
less than 0.6 log units [69]. Such a high variability brings challenge to solubility prediction.

Speaking of chemical properties, toxicity is among the most signiﬁcant ones. Toxicity is a
measure of the degree to which a chemical can adversely aﬀect an organism. These adverse
eﬀects, which are called toxicity endpoints, can be either quantitatively or qualitatively measured
by their eﬀects on given targets. Qualitative toxicity classiﬁes chemicals into toxic and nontoxic
categories, while quantitative toxicity data set records the minimal amount of chemicals that can
reach certain lethal eﬀects. Most toxicity tests aim to protect human from harmful eﬀects caused
by chemical substances and are traditionally conducted in in vivo or in vitro manner. Nevertheless,
such experiments are usually very time consuming and cost intensive, and even give rise to ethical
concerns when it comes to animal tests for chemical properties. Therefore, computer-aided methods,
or in silico methods, have been developed to improve prediction eﬃciency without sacriﬁcing too
much of accuracy.

1.2 An overview of topological modeling and topological learning

The key to successful predictions of small molecule properties lies on accurate representation
of a given molecule. In fact, geometric representation of molecules, particularly macromolecules,
often involves too much structural details and thus may become intractable for large and complex
biomolecular data sets. On the contrary, topology oﬀers the highest level of abstraction and
truly metric free representations of molecules, although in most cases traditional topology incurs
too much geometric reduction to be practically useful for molecules. Persistent homology bridges
classical geometry and topology, oﬀering a multiscale representation of molecular systems [70, 71].
In doing so, it creates a family of topologies via a ﬁltration parameter, which leads to a one-
dimensional topological invariants, i.e., barcodes of Betti numbers, and Betti-0, Betti-1 and Betti-2
numbers can be physically interpreted as the number of isolated components, circles, and cavities,
respectively. Persistent homology has been successfully applied to the modeling and prediction

4

of nano particles, proteins and other biomolecules [72, 73, 74, 75, 76]. Nonetheless, it was found
that primitive persistent homology has very limited predictive power in machine learning based
classiﬁcation of biomolecules [77], which motivates us to introduce a more sophisticated, element
speciﬁc persistent homology (ESPH) to retain crucial biological information during the topological
simpliﬁcation of geometric complexity [78, 79, 80]. ESPH has found its success in the predictions
of protein-ligand binding aﬃnities [79, 80] and mutation induced protein stability changes [78, 80].
Thus topological tools certainly have the potential when it comes to small molecule modeling.

1.3 An overview of QSAR and machine learning

Quantitative structure activity relationship (QSAR) approach is one of the most popular and
commonly used approaches in cheminformatics modeling. The basic QASR assumption is that
similar molecules have similar activities. Therefore by studying the relationship between chemical
structures and biological activities, it is possible to predict the activities of new molecules without
actually conducting lab experiments. There are several types of algorithms to generate QSAR
models: linear models based on linear regression and linear discriminant analysis [81]; nonlinear
models including nearest neighbor [82, 83], support vector machine [81, 84, 85] and random forest
[86]. These methods have advantages and disadvantages [87] due to their statistics natures. For
instance, linear models overlook the relatedness between diﬀerent features, while nearest neighbor
method largely depends on the choice of descriptors. To overcome these diﬃculties, more reﬁned
and advanced machine learning methods have been introduced. Multi-task learning (MTL) [88] was
proposed partially to deal with data sparsity problem, which is commonly encountered in QSAR
applications. The idea of MTL is to learn the so-called “inductive bias” from related tasks to improve
accuracy using the same representation. In other words, MT learning aims at learning a shared and
generalized feature representation from multiple tasks. Indeed, MT learning strategies have brought
new insights to bioinformatics since compounds from related assays may share features at various
feature levels, which is extremely helpful if data set is small. Successful applications include
splice-site and MHC-I binding prediction [89] in sequence biology, gene expression analysis, and

5

system biology [90]. MTL becomes more eﬃcient when it is incorporated with deep learning (DL)
[91, 92] strategies. Deep neural network (DNN), particularly convolutional neural network (CNN),
has emerged as a powerful paradigm to render a wide range of state-of-the-art results in signal and
information processing ﬁelds, such as speech recognition [93, 94] and natural language processing
[95, 96], as well as toxicity prediction [97, 98, 99, 100, 101] and aqueous solubility prediction
[102]. The major advancement of DNN models as compared to non-DNN models is that DNN
models consist of a larger number of layers and neurons, making it possible to extract more abstract
features.

1.4 Motivations and objectives

We would like to study the descriptive and predictive power of PH and ESPH for small molecules.
The diﬃculty of small molecule modeling is that small molecules involve a wide range of chemical
elements and their properties are very sensitive to their chemical constitutions, symmetry and
stereochemistry. Therefore, it is not clear whether PH and ESPH are suitable descriptors for small
molecules.

The objective of this thesis is to explore the representability and predictive power of ESPH
for small molecules, using state-of-the-art machine learning and deep learning algorithms. We
focus on the analysis and prediction of several diﬀerent small molecule properties, including both
physical and chemical properties. Speciﬁcally, we aim to predict partition coeﬃcient, aqueous
solubility and four diﬀerent toxicity endpoints. Due to their relevance to drug design and discovery,
relatively large data sets have been collected in the literature for these problems, which provides
a way to validate the representability of proposed topological descriptors along with machine
learning algorithms. Certainly, to overcome the diﬃculty of predicting datasets with small training
set for certain problems, we construct topological learning by integrating ESPH and multitask
deep learning. We show that ESPH provides a competitive description of relatively small drug-like
molecules and MT-DNN architecture is capable of promoting model performances on related tasks.

6

CHAPTER 2

TOPOLOGICAL MODELING OF SMALL MOLECULES

In this chapter, we will focus on our approach to biomolecules modeling using topological tools.
First we brieﬂy review the background of persistent homology, then we will introduce the so-called
element speciﬁc persistent homology (ESPH) and its modiﬁed version. Several concrete examples
will be given to illustrate the motivation and predictive power of ESPH. Lastly the essential idea of
element speciﬁc topological descriptors (ESTD) and how ESTDs are computed will be discussed.

2.1 Persistent homology

For atomic coordinates in a molecule, algebraic groups can be deﬁned via simplicial complexes,
which are constructed from simplices, i.e., generalizations of the geometric notion of nodes, edges,
triangles, tetrahedrons, etc. Homology associates a sequence of algebraic objects to topological
spaces, and characterizes the topological connectivity of geometric objects in terms of topological
invariants, i.e., Betti numbers. Betti-0, Betti-1 and Betti-2, represent the number of isolated
connected components, rings and cavities respectively. A ﬁltration parameter, such as the radius of
a ball, is used to continuously vary over an interval so as to generate a family of structures. Loosely
speaking, the corresponding family of homology groups induced by the ﬁltration is a persistent
homology. The variation of the topological invariants, i.e., Betti numbers, gives rise to a unique
characterization of physical objects such as protein complex and small molecules.

Simplex Let u0, u1, . . . , uk be a set of points in Rd. A point x =k
combination of the ui ifk
An aﬃne combination, x =k

i=0 λiui is called an aﬃne
i=0 λi = 1. The k + 1 points are said to be aﬃnely independent, if and
only if ui − u0, 1 ≤ i ≤ k are linearly independent. We can ﬁnd at most d linearly independent
vectors and at most d + 1 aﬃnely independent points in Rd.

i=0 λiui is a convex combination if λi are nonnegative. A k-
simplex, which is deﬁned to be the convex hull (the set of convex combinations) of k + 1 aﬃnely

7

independent points, can be formally represented as

(cid:40) k

i=0

λiui|

σ =

λi = 1, λi ≥ 0, i = 0, 1, ..., k

(cid:41)

,

(2.1)

where {u0, u1, ..., uk} ⊂ Rd is a set of aﬃnely independent points. Examples of k-simplex for the
ﬁrst few dimensions are shown in Figure 2.1. Essentially, a 0-simplex is a vertex, a 1-simplex is an
edge, a 2-simplex is a triangle, and a 3-simplex is a tetrahedron. A face τ of σ is the convex hull of

(a) 0-simplex

(b) 1-simplex

(c) 2-simplex

(d) 3-simplex

Figure 2.1: Examples of simplex of diﬀerent dimensions.
0-simplex, 1-simplex, 2-simplex, and 3-simplex, respectively.

(a), (b), (c) and (d) above represent

a non-empty subset of ui and is proper if the subset does not contain all k + 1 points. Equivalently,
we can write as τ ≤ σ if τ is a face or σ, or τ < σ if τ is proper. The boundary of σ, is deﬁned to
be the union of all proper faces of σ.

Simplicial complex A simplicial complex is a ﬁnite collection of simplices K such that σ ∈ K
and τ ≤ σ implies τ ∈ K, and σ, σ0 ∈ K implies σ ∩ σ0 is either empty or a face of both. The
dimension of K is deﬁned to be the maximum dimension of its simplices.

Chain complex Given a simplicial complex K and a constant p as dimension, a p-chain is a
formal sum of p-simplices in K, denoted as c = aiσi. Here σi are the p-simplices and the ai are
the coeﬃcients, mostly deﬁned as 0 or 1 (module 2 coeﬃcients) for computational considerations.
Speciﬁcally, p-chains can be added as polynomials.

If c0 =  aiσi and c1 =  biσi, then
c0 + c1 =(ai + bi)σi, where the coeﬃcients follow Z2 addition rules. The p-chains with the

previously deﬁned addition form an Abelian group and can be written as (Cp, +). A boundary

8

operator of a p-simplex σ is deﬁned as

p

(−1)j[u0, u1, ...,(cid:98)uj, ..., up],

∂pσ =

(2.2)

where [u0, u1, ...,(cid:98)uj, ..., up] means that vertex uj is excluded in computation. Given a p-chain
c = aiσi, we have ∂pc = ai∂pσi. Notice that ∂p maps p-chain to {p−1}-chain and that boundary

j=0

operation commutes with addition, a boundary homomorphism ∂p : σp → σp−1 can be deﬁned.
The chain complex can be further deﬁned using such boundary homomorphism as following:

· · · −−−−→ Cp+1

∂p+1−−−−→ Cp

∂p−−−−→ Cp−1

∂p−1−−−−→ · · · ∂1−−−−→ C0

∂0−−−−→ 0.

(2.3)

Cycles and boundaries A p-cycle is deﬁned to be a p-chain c with empty boundary (∂pc = 0),
and the group of p-cycles of K is denoted as Zp = Zp(K). In other words, Zp is the kernel of the
p-th boundary homomorphism, Zp = ker ∂p. A p-boundary is a p-chain, say c, such that there
exists d ∈ Cp+1 and ∂pd = c, and the group of p-boundaries is written as Bp = Bp(K). Similarly,
we can rewrite Bp as Bp = im∂p+1 since the group of p-boundaries is the image of the (p + 1)-st
boundary homomorphism.

Homology groups The fundamental lemma of homology says that the composition operator
∂p ◦ ∂p+1 is a zero map [103]. With this lemma, we conclude that im∂p+1 is a subgroup of ker ∂p.
Then the p-th homology group of simplicial complex is deﬁned as the p-th cycle group modulo the
p-th boundary group,

Hp = Zp/Bp

(2.4)

and the p-th Betti number is the rank of this group, βp = rankHp. Geometrically, Betti numbers
can be used to describe the connectivity of given simplicial complexes. Intuitively, β0, β1 and β2
are numbers of connected components, tunnels, and cavities, respectively, for the ﬁrst few Betti
numbers.

9

Filtration and persistence A ﬁltration of a simplicial complex K is a nested sequence of sub-
complexes of K.

 = K0 ⊆ K1 ⊆ ... ⊆ Kn = K.

(2.5)
For each i ≤ j, there exists an inclusion map from Ki to Kj and therefore an induced homomorphism
: Hp(Ki) → Hp(Kj) for each dimension p. The ﬁltration deﬁned in Equation (2.5) thus
f i, j
p
corresponds to a sequence of homology groups connected by homomorphisms.

0 = Hp(K0) → Hp(K1) → · · · → Hp(Kn) = Hp(K)

(2.6)

for each dimension p. The p-th persistent homology groups are deﬁned as the images of the
homomorphisms induced by inclusion,

Hi, j
p = im f i, j
p

(2.7)

where 0 ≤ i ≤ j ≤ n. In other words, Hi, j
p contains the homology classes of Ki that are still alive
at Kj for given dimension p and each pair i, j. We can reformulate the p-th persistent homology
group as

p = Zp(Ki)/(cid:16)

Bp(Kj) ∩ Zp(Ki)(cid:17)

Hi, j

.

(2.8)

p = rankHi, j
i, j
The corresponding p-th persistent Betti numbers are the ranks of these groups, β
p .
The birth, death and persistence of a Betti number carry important chemical and/or biological
information, which is the basics of the present method.

2.2 Persistent homology for characterizing molecules

As introduced before, persistent homology indeed reveals long lasting properties of a given
object and oﬀers a practical method for computing topological invariants of a space, which captures
the underlying features of the object directly from discrete point cloud data. An intuitive way to
construct simplicial complex from point cloud data is to adopt Euclidean distance, or to use Vietoris-
Rips complex with Euclidean distance. Vietoris-Rips complex is deﬁned to be a simplicial complex
whose k-simplices correspond to unordered (k + 1)-tuples of points which are pairwise within
distance .

10

However, a particular radius  is not suﬃcient since it is diﬃcult to see if a hole is essential.
Therefore, it is necessary to increase radius  systematically, and see how the homology groups
and Betti-numbers evolve. The persistence [103, 71] of each Betti number over the ﬁltration
can be recorded in barcodes [104, 105]. The persistence of topological invariants observed from
barcodes oﬀers an important characterization of small molecular structures. For instance, given
the 3D coordinates of a small molecule, a short-lived Betti-0 bar may be the consequence of a
strong covalent bond while a long-lived Betti-0 bar can indicate a weak covalent bond. Similarly,
a long-lived Betti-1 bar may represent a chemical ring.

Such observations motivate us to design persistent homology based topological descriptors
(TDs). However, it is important to note that the ﬁltration radius is not a chemical bond and
topological connectivity is not a physical relationship. In other words, persistent homology oﬀers
a representation of molecules that is entirely diﬀerent from classical theories of chemical and/or
physical bonds. Such a representation is systematical and comprehensive, and thus is able to unveil
structure-activity relationships when it is coupled with advanced machine learning algorithms.

An example of PH Figure 2.2 is a detailed example of how persistent homology can be applied
to a simple molecule – cyclohexane. An all-element representation of cyclohexane is given in Fig
2.2a, where carbon atoms are in green and hydrogen atoms are in white. As we can see from its
barcodes in Fig. 2.2c, there are 18 Betti-0 bars that correspond to 18 atoms at the very beginning,
12 of which disappear when the ﬁltration value increases to 1.08Å. It indicates that each carbon
atom has merged with its closest 2 hydrogen atoms as the ﬁltration value becomes larger than the
length of C-H bond and these three atoms are regarded as one single connected component. When
the ﬁltration value further increases to 1.44Å, a Betti-1 bar emerges which means that a hexagonal
carbon ring is captured and there is only one connected component left. As the ﬁltration value
eventually exceeds the radius of the hexagon, the ring structure disappears and the Betti-1 bar dies.
The longest Betti-0 bar corresponds to the existence of the connected component. When only
carbon atoms are selected, it is relatively straightforward to interpret the barcode plot. The cutoﬀ

11

where 5 Betti-0 bars disappear corresponds to the C-C bond length and the Betti-1 bar represents
the existence of the hexagonal carbon ring.

(a) Cyclohexane

(b) Cyclohexane with only carbon atoms

(c) Barcodes of cyclohexane

(d) Barcodes of cyclohexane with only carbon atoms

Figure 2.2: Diﬀerent representations of cyclohexane and their persistent homology barcode plots.
In subﬁgure(a) and (b), complete cyclohexane and cyclohexane with only carbon atoms being
selected, respectively. In subﬁgure (c) and (d), from top to bottom, the results are for Betti-0 and
Betti-1, respectively.

2.3 Element speciﬁc persistent homology (ESPH)

2.3.1 Limitations of persistent homology

Persistent homology, as discussed before, is eﬃcient at characterizing covalent bonding or chemical
structures of higher dimensions. Nevertheless, such information is not suﬃcient under most
circumstances, especially for small molecules. For instance, it is not possible to distinguish a
carbon-nitrogen ring from a all-carbon ring as primitive persistent homology can only capture the

12

persistence of a Betti-1 bar whereas there is no indication whether a nitrogen atom exists. Figure 2.3
shows why primitive persistent homology has limitations when dealing with some small molecules
of similar structures.

(a) Benzene

(b) Pyridine

(c) Barcodes of benzene

(d) Barcodes of pyridine

Figure 2.3: Benzene, pyridine and their persistent homology barcode plots. In subﬁgure(a) and (b),
benzene and pyridine are shown with hydrogen atoms being neglected, respectively. In subﬁgure
(c) and (d), from top to bottom, the results are for Betti-0 and Betti-1, respectively, for benzene and
pyrdine.

As shown in Fig 2.3a and Fig 2.3b, both benzene and pyridine have a hexagonal ring except
that pyridine has a nitrogen atom. From primitive persistent homology point of view, there is no
diﬀerence between these two molecules except a slight diﬀerence between the lengths of Betti bars
– the Betti-1 bar of benzene has length 1.08 Åwhile pyridine’s Betti-1 bar has length 0.96 Å– which
is caused by the fact that the carbon-carbon bond is generally longer than carbon-nitrogen bond.

However, Fig 2.3c and Fig 2.3d follow a very similar pattern. Both barcodes contain 6 Betti-0
bars and 1 Betti-1 bar, and it is nearly impossible for us to distinguish these two molecules purely

13

from barcodes calculated from their structures. Therefore it is necessary for us to introduce the
idea of ESPH, where persistent homology is computed based on diﬀerent combinations of speciﬁc
element types.

2.3.2 An introduction to ESPH

An example of ESPH representation Figure 2.4 depicts how ESPH modeling can be applied to
small molecules. In the following case, indazole (PubChem id: 9221) is chosen. For simplicity,
hydrogen atoms are neglected. Apparently there are two chemical rings within the indazole
molecule - one hexagonal carbon ring and one pentagonal carbon-nitrogen ring.

(a) C representation of indazole

(b) C-N representation of indazole

(c) Barcodes of the C representation

(d) Barcodes of the C-N representation

Figure 2.4: Indazole and its persistent homology barcodes. In subﬁgure(a) and (b), indazole is
shown with carbon and carbon-nitrogen atoms selected, respectively. In subﬁgure (c) and (d), from
top to bottom, the results are for Betti-0 and Betti-1, respectively

Mathematically if we ignore element types, there are two 1-simplices (loops) for indazole.
We have to notice, however, that the properties of these two rings are dramatically diﬀerent and

14

they have to be handled very carefully. Thus if only carbon atoms are selected, we can observe 7
Betti-0 bars and 1 Betti-1 bar, which correspond to 7 carbon atoms and the hexagonal carbon ring,
respectively. While carbon and nitrogen atoms are selected, there are 9 Betti-0 bars, 1 short Betti-1
bar and 1 longer Betti-0 bar. By comparing these two barcodes, we may conclude the following:

• The diﬀerence between number of Betti-0 bars represents the number of nitrogen atoms (2

nitrogen atoms).

• There exists a carbon-nitrogen ring, and its size is smaller than the previous carbon ring.
Notice that the carbon-nitrogen ring is not captured when only carbon atoms are selected,
and its Betti-1 bar is shorter than the other Betti-1 bar that corresponds to the carbon ring.

If we further consider the length of Betti-0 and Betti-1 bars, it is possible for us to ﬁnd out the
length of chemical bonds. For example, in Fig 2.4a, the length of the 6 shorter Betti-0 bars is 1.32
Å and it indicates that the carbon-carbon bond has length 1.32 Å for this particular molecule,
while generally the length of carbon-carbon bonds falls within the range 1.20 Å– 1.54 Å [106].
Thus it is reasonable to apply ESPH to a wider range of applications, especially when bond lengths
and number of Betti bars are taken into account.

2.4 Modiﬁed ESPH for characterizing intra-molecular interactions

Another important component of ESPH is the ﬁltration matrix that deﬁnes the distance in
persistent homology analysis. Traditionally, the distance between atom i at (xi, yi, zi) and atom j at
(xj, yj, zj) is deﬁned to be the Euclidean distance between them:

(cid:113)(xi − xj)2 + (yi − yj)2 + (zi − zj)2

di, j =

.

(2.9)

Indeed, by using Euclidean distance deﬁned in Eq. 2.9, persistent homology is able to capture
the information such as covalent bonds between diﬀerent atom types easily. However, it does
not necessarily reﬂect intramolecular interactions such as hydrogen bonding and van der Waals
interaction, which is not ideal for the purpose of small molecule modeling.
In other words,
the Betti-0 bar between two atoms with certain hydrogen bonding or van der Waals cannot be

15

captured since there already exists shorter Betti-0 bar between them (essentially covalent bonds).
To circumvent such deﬁciencies, we redeﬁne the distance between atom i and atom j to be:

di, j,

d∞,

Mi, j =

if di, j ≥ ri + rj + |∆d|
otherwise,

(2.10)

where ri and rj are the atomic radius of atom i and j, respectively. Here ∆d is the bond length
deviation in the data set and d∞ is a large number which is set to be greater than the maximal
ﬁltration value. Since the distance between two atoms with covalent bonds can never exceed the
preset maximum ﬁltration value, we are able to use such modiﬁed ESPH to capture important
intramolecular interactions, since covalent bonds can never be built.

16

CHAPTER 3

MACHINE LEARNING METHODS AND ALGORITHMS

In this chapter, we will give an overview of machine learning algorithms and multi-task deep
learning architectures used in this study.

3.1 An overview of machine learning algorithms

The concept of machine learning was ﬁrst proposed by Arthur Samuel [107]. Machine learning
algorithms can learn from and make predictions on given data, and have the potential to overcome
complicated computational problems when explicit solutions are diﬃcult to determine.

Basically speaking, machine learning algorithms can be classiﬁed into three diﬀerent categories
– supervised learning, unsupervised learning and reinforcement learning. The features of each
category can be summarized below:

• Supervised learning: Each sample data in training set consists of a target value (categorical
for classiﬁcation and continuous for regression) and a given set of (independent) descriptors.
The purpose of supervised learning is to learn a function that map inputs to desired outputs.
The training process continues until the model reaches a predeﬁned level of accuracy on the
training data.

• Unsupervised learning: The diﬀerence between unsupervised learning and supervised learn-
ing is that there is no target value for each training sample and there is no evaluation of
the accuracy of output. The purpose of unsupervised learning is to perform clustering for
population in diﬀerent groups.

• Reinforcement learning: Reinforcement learning is trained to make speciﬁc decisions and
is typically formulated as Markov decision process. During learning process, correct in-
put/output pairs are never presented, nor sub-optimal actions are explicitly corrected. The
machine is exposed to an environment where it trains itself continually using trial and error,

17

and eventually ﬁnds a balance between exploration (of uncharted territory) and exploitation
(of current knowledge) [108].

In this thesis, small molecule properties to be predicted are all quantitative, therefore we will employ
supervised learning algorithms to perform training and testing. More precisely, ensemble methods
and supervised deep neural networks will be discussed in the next sections.

3.2 Ensemble methods

In this section, we will ﬁrst review decision tree induction. Then we will also introduce several
ensemble methods that are essentially based on decision tree algorithms, including both random
forests and gradient boosting decision tree algorithms.

3.2.1 Decision tree induction

3.2.1.1 Basics of decision tree induction

For a decision tree model, there are three types of nodes: root node, internal nodes, and terminal
nodes. The model solves problem by answering a series of questions at each node and returns a
conclusion when a terminal node is reached.

Hunt’s algorithm Hunt’s algorithm [109] is the basis of many existing decision tree induction
algorithms, where a decision tree is grown in a recursive fashion by partitioning the training records
into successive subsets. A general procedure for Hunt’s algorithm is described below:

1. If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt

2. If Dt contains records that belong to more than one class, an attribute test condition is used
to split records into smaller subset. The procedure continues until all the records in the subset
belong to the same class.

18

Attribute test condition As mentioned in previous paragraph, there are diﬀerent attribute test
conditions depending on attribute types. For instance, binary attributes output two potential
outcomes, while continuous attributes use comparison test to express the test condition. Thus it is
of great signiﬁcance to select proper measures to evaluate diﬀerent splits at each node.

Measures for selecting best split The measures for selecting the best split are based on the
degree of impurity of the child nodes. Generally for classiﬁcation tasks, some typical measures
include:

Entropy(t) = − c−1
Gini(t) = 1 − c−1

p(i|t) log2 p(i|t)

i=0
[p(i|t)]2

i=0

(3.1)

(3.2)

(3.3)

Information Gain(t) = Entropy(t) − weighted entropy of t(cid:48)s child nodes

where t is a node of the decision tree, c is the number of classes, p(i|t) is the probability, and
weighted entropy of all child nodes of node t can be calculated as the weighted average of child
node entropies where the weight of each child node is the number of records at that child node
divided by the number of records at the parent node c. To select the best split, impurity measures
for each candidate split are calculated and the best split can be chosen from all candidates.

For regression tasks, variance reduction can be used to split nodes. A commonly used measure
for variance reduction is analogous to information gain for classiﬁcation in Eq. 3.3, and can be
written as:

Variance Reduction(t) = Variance(t) − weighted variance of t(cid:48)s child nodes

(3.4)

The weight is calculated in the same way as that of information gain.

Pruning of decision tree Decision trees, however, are often susceptible to overﬁtting, in the sense
that trained decision trees are so closely ﬁt to training data that models can result in substantial

19

errors for unseen data as training data usually has some degree of error or random noise within it.
Pruning is a strategy to reduce the size of decision tree.

One way to perform pruning is to apply early stopping rule during tree-training process.
Speciﬁcally, the node stops expanding to child nodes when the observed gain in impurity measure
falls below a predeﬁned threshold, which helps to avoid constructing overly complicated subtrees
that may cause overﬁtting issues [109]. However, the threshold is diﬃcult to determine [109].

The other strategy is to perform post-training pruning. This can be done by replacing a subtree
with 1) a new child node whose class is determined by majority class of records associated with
the subtree, or 2) the most frequently used branch of the subtree. The pruning process stops until
no further improvement can be observed.

3.2.2 Random forest

Random forest is an ensemble machine learning algorithm that can be used for regression and
classiﬁcation.
It learns training data by constructing a multitude of decision trees, and returns
prediction by averaging the outputs of individual trees (regression) or by taking the majority vote
of individual trees (classiﬁcation). Typically random forest does not overﬁt the training data and
it is capable of reducing variance while maintaining the same level of bias [110] when compared
with traditional decision tree models.

3.2.2.1 Bootstrap aggregating

Random forest takes advantage of the bootstrap aggregating techniques when building individual
tree learners. Given a training set {(xi, yi)}n
i=1 and the number of trees N, a bootstrap aggregating
process is to repeatedly select a random sample S(j) with replacement from the training set and
ﬁt each sample with individual tree fj (j = 1, . . . , N). Eventually for regression problems, the
prediction ˆf for any unknown data x(cid:48) shall be given by:

ˆf =

1
N

fj(x(cid:48))

(3.5)

N

j=1

20

For classiﬁcation problems, the predicted class ˆf is the class that the majority of trees vote for.

3.2.2.2 Feature bagging

Random forest algorithm diﬀers from the aforementioned bootstrap aggregating technique except
that a random feature subset is used for tree splitting in random forest, or in other words, feature
bagging is used to determine best splits.

3.2.3 Gradient boosting decision tree

3.2.3.1 Gradient boosting algorithm

Gradient boosting algorithm was ﬁrst observed by Leo Breiman [111], and was subsequently
developed by Friedman [112]. Gradient boosting algorithm can also be viewed as a iterative
functional gradient descent algorithm [113, 114]. The idea of this algorithm is to iteratively ﬁnd a
series of weighted weak learners and eventually form a strong learner which can be expressed as
the summation of weak learners.

3.2.3.2 Gradient tree boosting

Gradient tree boosting is a combination of aforementioned gradient boosting and decision trees of
ﬁxed size. Speciﬁcally decision trees are used as base learners to ﬁt pseudo-residuals hm(x) at each
iteration step m.

3.3 Multi-task learning and deep neural network

3.3.1 Single-task deep neural network (ST-DNN)

A neural network acts as a transformation that maps an input feature vector to an output vector. It
essentially models the way a biological brain solves problems with numerous neuron units connected
by axons. A typical shallow neural network consists of a few layers with neurons and uses back
propogation to update weights on each layer. However, it is not able to construct hierarchical

21

features and thus falls short in revealing more abstract properties, which makes it diﬃcult to model
complex non linear relationships.

A single-task deep learning algorithm, compared to shallow networks, has a wider and deeper
architecture – it consists of more layers and more neurons in each layer and reveals the facets of
input features at diﬀerent levels. Single-task deep learning algorithm is deﬁned for each individual
prediction task and only learns data from the speciﬁc task. A representation of such single task
deep neural network (ST-DNN) can be found in Figure 3.1, where n represents the number of layers
of a given ST-DNN, ki and Nki (i = 1, . . . , n) is the number of neurons and node of i-th hidden
layer, respectively.

Generally speaking, the objective of such a ST-DNN is to minimize a given loss function, which
is essentially based on problems that one is trying to solve – such a loss function can be deﬁned as
cross-entropy loss function for a multi-class classiﬁcation problem, or mean squared error function
for a regression problem.

Figure 3.1: An illustration of ST-DNN architecture.

3.3.1.1 Multi-task deep neural network

Multi-task learning is a machine learning technique which has shown success in various ﬁelds. The
main advantage of MT learning is to learn multiple tasks simultaneously and exploit commonalities
as well as diﬀerences across diﬀerent tasks. Another advantage of MT learning is that a small data

22

RawdataInputvectorofequallength............PredictionsN1Nk1N1Nk2N1Nkn−1N1NknDatasetsInputDenseDenseDenseDenseOutputlayerlayer1layer2layern-1layernlayerNt

i=1

Nt

i=1

Nt

i=1

Nt

i=1

Suppose we have a total of T tasks and the training data for the t-th task are denoted as (xt
i, yt
and yt
i

set with incomplete statistical distribution to establish an accurate predictive model can often be
signiﬁcantly beneﬁted from relatively large data sets with more complete statistical distributions.
i)Nt
i=1,
where t = 1, . . . , T, i = 1, . . . , Nt, Nt is the number of samples of the t-th tasks, with xt
being
i
the topological descriptor vector that consists of ESTDs and the target value of the i-th molecule
in t-th task, respectively. The goal of MTL is to minimize the following loss function for all tasks
simultaneously:

argmin

L(yt

i, f t(xt

i

; {Wt, bt}))

(3.6)

where f t is a functional of the topological descriptor vector xt
parametrized by a weight vector Wt
i
and bias term bt, and L is the loss function. A typical cost function for quantitative regression is
the mean squared error, thus the loss of the t-th task can be deﬁned as:

Loss of Task t =

1
2

L(xt

i, yt

i) =

1
2

(yt
i − f t(xt

i

; {Wt, bt})2

(3.7)

To avoid overﬁtting problem, it is usually beneﬁcial to customize above loss function (3.7) by
adding a regularization term on weight vectors, giving us an improved loss function for t-th task:

Loss of Task t =

1
2

(yt
i − f t(xt

i

; {Wt, bt})2 + β||Wt||2
2

(3.8)

where || · || denotes the L2 norm and β represents a penalty constant.

The goal of topology based MTL is to learn diﬀerent small molecule properties jointly, and to
potentially improve the overall performances of multiple models simultaneously. More concretely,
it is reasonable to assume that diﬀerent small molecules comprise distinct physical/chemical fea-
tures, while descriptors such as the occurrence of certain chemical structure, can result in similar
physical/chemical properties. A simple representation of multitask deep neural network (MT-DNN)
for our study is shown in Figure 3.2, where ki (i = 1, . . . , n) represents the number of neurons on
the i-th hidden layer, Nki
are neurons on i-th layer, and Prd1, . . . , Prdt represent predicted values
for t diﬀerent tasks.

23

Figure 3.2: An illustration of MT-DNN architecture.

3.3.1.2 Network parameters and training

The hyperparameters tuning of DNN is known to be very complicated. In order to come up with
a reasonable set of hyperparameters, we perform a grid search of each hyperparameter within a
wide range. Hyperparameters in Table 3.1 are chosen so that we can have a reasonable training
speed and accuracy. It turns out that adding dropout or L2 decay does not necessarily increase
the accuracy and as a consequence we omit these two techniques. The underlying reason may be
that the ensemble results of diﬀerent DNN models is essentially capable of reducing bias from
individual predictions. A list of hyperparameters used to train all models can be found in Table 3.1

Table 3.1: Proposed hyperparameters for MT-DNN

Number of neurons on each layer

1000 for ﬁrst 3 layers, and 100 for the next 4 layers

Number of epochs

Number of hidden layers

Optimizer
Learning rate

1000

7

ADAM
0.001

In each training epoch, molecules in each training set are randomly shuﬄed and then divided
into mini-batches of size 200, which are then used to update parameters. When all mini-batches are
traversed, an training “epoch" is done. All the training processes were done using Keras wrapper
[115] with Theano (v0.8.2) [116] as the backend. All training were run on Nvidia Tesla K80 GPU
and the approximate training time for a total of 1000 epochs is about 80 minutes.

24

RawdataInputvectorofequallength...............N1Nk1N1Nk2N1Nkn−1N1NknPrd1PrdtDatasetsInputDenseDenseDenseDenseOutputlayerlayer1layer2layern-1layernlayer3.4 Pipelines for predicting properties of small molecules using MT-DNN

In this section, we provide graphical pipelines for predicting properties of small molecules to
help readers understand how our MT-DNN architecture works along with ESPH. In Fig. 3.3, a gen-
eral procedure for predicting log P and log S simultaneously is presented. Given any molecule, we
use ESPH to extract information such as intra-molecular interactions and geometrical connections
from the molecule and construct ESTDs accordingly. Notice that the numbers of ESTDs for each
molecule are the same, thus they can be directly fed into a MT-DNN architecture, where joint tasks
can be learned and predicted simultaneously.

Figure 3.3: Graphical pipeline for simultaneous prediction of partition coeﬃcient and aqueous
solubility

25

3.5 Ranking, query construction and prediction using nearest neighbors

The algorithms in this section are speciﬁcally used for solvation free energy, as a continuation
work of Ref [4]. A generally procedure for solvation free energy prediction can be summarized
below:

1. Molecules in training set are divided into diﬀerent queries based on element types, and

molecules are ranked within each query.

2. Given any molecule in test set, ﬁrst determine the query that the molecule belongs to, then

use trained models to determine the order of the molecule within that speciﬁc query.

3. Finally, a predeﬁned number of nearest neighbors (molecules) based on the ranking are

selected and used for solvation free energy prediction of the target molecule.

3.5.1 Ranking algorithms

The essentially idea of ranking algorithms is to train a list of data points with some partial orders
(either numerical or ordinal scores) so that the learners are able to predict the order of an unseen
item with respect to the training data. Apparently if we use solvation free energy as a numerical
score for each small molecule, the ranking algorithm can be directly applied to solvation free energy
prediction. In this study, GBDT is used to rank a list of molecules.

3.5.2 Query construction

Query construction is an essential step for accurate prediction of solvation free energy. We follow
the same principle in Ref [117]. Basically, seven groups of molecules are constructed according
to element types:
i) H, C; ii) H, C, O; iii) H, C, N/H, C, N, O; iv) H, C, Cl; v) H, C, O, Cl; vi)
H, S; and vii) anything else, respectively. Detailed information of diﬀerent queries can be found in
Appendix.

26

3.5.3 Prediction using nearest neighbors

After a given molecule is fed into trained models and the order of the molecule within the query is
returned, we select a number of nearest neighbors to predict its solvation free energy. Let m be the
number of nearest neighbors. The purpose of local linear regression is to determine a weight vector
w := (w1, . . . , wn)T and bias b, such that the training error on nearest neighbors can be minimized.

Mathematically, the problem can be formulated in matrix multiplication form:

(cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171)

∆G1

∆G2
...

∆Gm

(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)

=

(cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171)

x11

x12

x21
...
xm1

x22
...
xm2

· · ·
· · ·
...
· · ·

x1n
x2n
...
xmn

(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)

(cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171)

w1

w2
...

wn

(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)

(cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171)

(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)

b

b
...
b

+

.

(3.9)

or equivalently

∆G = Xw + b1,

(3.10)
where ∆G = (∆G1, ∆G2, · · · , ∆Gm)T are experimental solvation free energy for m nearest neigh-
bors, w = (w1, w2, · · · , wn)T, 1 is a column unit vector of length m, and matrix X contain descriptor
vector for these m nearest neighbors and is written as:
· · ·
· · ·
...
· · ·

x1n
x2n
...
xmn

x21
...
xm1

x22
...
xm2

x11

x12

X =

(cid:170)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:174)(cid:172)

.

(cid:169)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:173)(cid:171)

Since m is relatively small comparing to n, such local linear model is likely to overﬁt. Thus a
L2 penalty term can be added to training error, and thus Eq. (3.10) can be viewed as the following
optimization problem

(3.11)
where λ is the regularization parameter, which is set to 1000 in this work, || ∗ ||2 denotes the L2
norm of the quantity ∗.

arg min
w,b

2 + λ||w||2
2

(cid:16)||∆G − Xw − b1||2

(cid:17)

27

Let ∂F

∂w = 0, a direct computation returns:

(cid:16)XTX + λI(cid:17)−1(cid:16)XT ∆G − XT(b1)(cid:17)

w =

where I is m × m identity matrix.

,

(3.12)

Similarly, if we relax b1 to arbitrary vector b = (b1, b2, · · · , bm)T and let ∂F

∂b = 0, we have

Thus the unbiased estimation of b can be written as

b = ∆G − Xw.

m
i=1(∆G − Xw)i

(3.13)

(3.14)

,

b =

m

where (∆G − Xw)i is the ith component of the vector ∆G − Xw.

With Eq. 3.12 and Eq. 3.14, we may solve the optimization problem 3.11 in an iterative manner.

The iteration continues until the solution converges.

1, . . . , x(cid:48)

Let x(cid:48) = (x(cid:48)

n) be the descriptor vector of the target molecule and ∆ ˆG be its predicted
solvation free energy, we can now compute the solvation free energy of target molecule as below
in Eq. 3.15, using w and b calculated from previous steps.

∆ ˆG = x(cid:48)w + b

(3.15)

28

CHAPTER 4

DATA SETS, PREPROCESSING AND DESCRIPTORS

In this chapter, we ﬁrst introduce the data sets used to train and test quantitative models. Second,
detailed descriptions of data preprocessing techniques for small molecules are discussed. Finally,
we propose a variety of molecular descriptors, including element speciﬁc topological descriptors
(ESTDs), physical descriptors and auxiliary descriptors. A detailed description of how they are
calculated will also be provided.

4.1 Data sets

4.1.1 Solvation free energy

In an earlier work [4], a data set that contains a total of 668 molecules was proposed and it is the
largest for solavtion free energies to the best of our knowledge. The data set contains both mono-
functional group and polyfunctional group molecules. Experimental solvation free energies are
collected from the literature [118, 119, 120]. All the structures of this dataset are downloaded from
the PubChem project (https://pubchem.ncbi.nlm.nih.gov/). More detailed description of
the dataset can be found in our earlier work [4].

SAMPL (Statistical Assessment of the Modeling of Proteins and Ligands) is a set of community-
wide blind challenges aimed to advance computational techniques as standard predictive tools in
rational drug design [121, 122, 123, 124, 125]. One major subject of SAMPL challenge for bind
prediction is to predict solvation free energy for small molecules. In order to check how our models
perform comparing to others’, we make sure that all molecules of SAMPL0, SAMPL1, SAMPL2,
SAMPL3 and SAMPL4, except for 5-iodouracil in SAMPL2, are properly processed and predicted.
By excluding the SAMPLx molecules from the training set, we can use the remaining molecules
from the 668 training set to independently predict the solvation free energy for SAMPLx molecules.
Moreover, we are also interested in knowing how our descriptors and algorithms can be extended

29

[1] ﬁrst
to other datasets. An external dataset [1] is used for validation purpose. Wang et al.
introduced this dataset to evaluated their weighted solvent accessible surface areas based models.
In Model III, the authors further divided 387 molecules (which exclude ions) into a training set
(293) and a test set (94).
In our study, a slightly smaller training set of 289 molecules is used
because 4 molecules in the training set have ambiguous chemical names in the PubChem database,
while all molecules in the test set are included in our prediction which enables us to compare to
theirs on equal footing.

It should be noted that there exist discrepancies in experimental solvation free energies for some
molecules in Ref [1] and the 668 set [4]. When such discrepancies occur, the experimental values
reported by Wang et al [1] are used for training and testing in the above comparison. It should be
noted that most experimental value diﬀerences are within a very small range. Only 23 diﬀerences
are greater than 0.2 kcal/mol and 5 out of these 23 molecules are in the test set. These 5 compounds
and their experimental values corresponding to Table 3 of Wang et al [1] are listed in Table 4.1.
Additionally, 4 molecules listed in the training set of Ref. [1] are excluded in our training due to
their absence of structures in PubChem. These 4 molecules have compound ID of 363, 364, 385
and 388 in Table 3 of Ref. [1].

Table 4.1: Molecules in the test sets with large discrepancies in their experimental solvation free
energies. Here “ID” refers to the ID of Table 3 of Ref. [1]

ID Exp1 [1] Exp2 [4]
46
67
97
103
352

0.29
-3.15
-0.78
-0.64
-4.71

0.01
-3.4
-1.73
-1.4
-5.22

Moreover, we have also noticed that there are 11 duplicates in the training set and the test set of
Ref. [1]. Their compound IDs and duplicated IDs (Dup-IDs) in the Table 3 of Ref. [1] are listed in
Table 4.2. Molecules that are in the test set are marked with a superscript “b” to be consistent with
the notation of Ref. [1]. Finally, we provide information about all datasets used for solvation free
energy prediction in Table 4.3

30

Table 4.2: Duplicated molecules in Ref. [1]

ID
104b
334b
161b
140b
97b
196b

Dup-ID

119
336
202
142
116
203

ID
333b
384b
82b
184b
58b

Dup-ID

335
389
84
194
59

Table 4.3: Statistics of solvation free energy data sets. The numbers within parenthesis represent
the actual numbers of molecules used in this study.

SAMPL0 set
SAMPL1 set
SAMPL2 set
SAMPL3 set
SAMPL4 set

Number of molecules

30 (20)

17
63

36
47

Wang’s[1] train set
Wang’s[1] test set

Number of molecules

293 (289)

94

4.1.2 Partition coeﬃcient and aqueous solubility

The primary work of this thesis is to explore the proposed topology based multi-task methods for
learning related tasks. Thus data sets can naturally be divided into two parts – one for partition
coeﬃcient prediction and the other for aqueous solubility prediction.

Partition coeﬃcient data sets The training set used for partition coeﬃcient prediction was
originally compiled by Cheng et al.
[5] and consists of 8199 compounds, which is based on
Hansch et al.’s compilation [126]. These compounds are considered to have reliable experimental
log P values by Hansch (marked with * or checkmark).
In addition, three sets were chosen as
test sets. The ﬁrst test set, which is completely independent from the training set, contains 406
small-molecule organic drugs approved by the Food and Drug Administration (FDA) of the United
States and represents a variety of organic compounds of pharmaceutical interests. This set was also
compiled by Cheng et al. [5]. The remaining two test sets, Star set and Non-star set, were publicly
available and originated from a monograph of Avdeef [127]. Star set comprises 223 compounds
that are part of BioByte Star set and have been widely used to develop log P prediction method.

31

The Non-star set contains 43 compounds that represent relatively new chemical structures and
properties. The compound list and corresponding partition coeﬃcient is available for download
at http://ochem.eu/article/17434. We also made an attempt to expand our training set
by searching the NIH database as other software packages use a large number of molecules for
supervised learning. In this way, more than 3000 additional molecules were added to the training
set.

In order to develop and validate prediction models for aqueous
Aqueous solubility data sets
solubility, several well-deﬁned aqueous solubility datasets were used. Firstly, a diverse data set of
1708 molecules proposed by Wang et al. [128] was used to verify the predictive power of descriptors.
Both leave-one-out and 10-fold cross-validation were carried out on this set. Furthermore, we also
tested our models on a relatively small set with independent test sets [3]. As Hou [3] suggested, we
also removed some molecules from the training set to ensure that training set and test set have no
overlapping molecules.

In addition, two more widely used, publicly available solubility data sets are also used to train
and evaluate our models. The ﬁrst set is the ‘small’ Delaney data set [129] that contains 1144
molecules and their measured aqueous solubility (log mol/L at 25 degree Celcius. The second
set was originally built by Huuksonen [130] from AQUASOL database [131] and PHYSPROP
database [132]. It consists of 1026 organic molecules with their aqueous solubility in log mol/L at
20-25 degree Celcius.

A summary of data sets used for the proposed models is given in Table 4.4.

Table 4.4: Summary of log P and log S data sets used

logP data

logP train set
FDA test set
Star test set

Nonstar test set

Number of molecules

8199
406
223
43

logS data

logS train set 1 [128]
logS train set 2 [3]
logS test set 1 [3]
logS test set 2 [3]
Huuskonen logS set
Small delaney logS set

32

Number of molecules

1290 (1207 for test set 2)

1708

21
120

1033 (1030)
1144 (1135)

4.1.3 Quantitative toxicity endpoints

Four diﬀerent quantitative toxicity datasets, anmely, 96 hour fathead minnow LC50 data set (LC50
set), 48 hour Daphnia magna LC50 data set (LC50-DM set), 40 hour Tetrahymena pyriformis
IGC50 data set (IGC50 set), and oral rat LD50 data set (LD50 set), are studied in this work. Among
them, LC50 set reports at the concentration of test chemicals in water in mg/L that causes 50%
of fathead minnow to die after 96 hours. Similarly, LC50-DM set records the concentration of
test chemicals in water in mg/L that causes 50% Daphnia maga to die after 48 hours. Both sets
were originally downloadable from the ECOTOX aquatic toxicity database via web site http:
//cfpub.epa.gov/ecotox/ and were preprocessed using ﬁlter criterion including media type,
test location, etc [133]. The third set, IGC50 set, measures the 50% growth inhibitory concentration
of Tetrahymena pyriformis organism after 40 hours. It was obtained from Schultz and coworkers
[134, 135]. The endpoint LD50 represents the amount of chemicals that can kill half of rates when
orally ingested. The LD50 was constructed from ChemIDplus databse (http://chem.sis.nlm.
nih.gov/chemidplus/chemidheavy.jsp) and then ﬁltered according to several criteria [133].
The data sets used in this work are identical to those that were preprocessed and used to develop
the Toxicity Estimation Software Tool (T.E.S.T.) at https://www.epa.gov/chemical-research/
toxicity-estimation-software-tool-test [133]. TEST was developed to estimate chemi-
cal toxicity using various QSAR methodologies and is very convenient to use as it does not require
any external programs. It follows the general QSAR workﬂow — it ﬁrst calculates 797 2D molec-
ular descriptors and then predicts the toxicity of a given target by utilizing these precalculated
molecular descriptors.

All molecular structures and their toxicity endpoints are available on the T.E.S.T. website

(https://www.epa.gov/chemical-research/toxicity-estimation-software-tool-test)
. It should be noted that we are particularly interested in predicting quantitative toxicity endpoints
so other data sets that contain qualitative endpoints or physical properties were not used. Moreover,
diﬀerent toxicity endpoints have diﬀerent units. The units of LC50, LC50-DM, IGC50 endpoints
are − log10(Tmol/L), where T represents corresponding endpoint. For LD50 set, the units are

33

− log10(LD50 mol/kg). Although the units are not exactly the same, it should be pointed out that no
additional attempt was made to rescale the values since endpoints are of the same magnitude order.
These four data sets also diﬀer in their sizes, ranging from hundreds to thousands, which essentially
challenges the robustness of our methods. A detailed statistics of our datasets is presented in Table
4.5.

Table 4.5: Statistics of quantitative toxicity data sets

Total # of mols Train set size Test set size Max value Min value

659
283
1434

164
70
358

5931 (5924)

1482 (1479)

LC50 set
LC50-DM set
IGC50 set
LD50 set

823
353
1792

7413 (7403)

9.261
10.064
6.36
7.201

0.037
0.117
0.334
0.291

4.2 Data Preprocessing

A major step to successful small molecule modeling is to properly preprocess input data. All
small molecules are available in Tripos Mol2, SDF (Structure-Data File) or SMILES (Simpliﬁed
Molecular Input Line Entry Speciﬁcation) format. Essentially these formats contain chemical
structure information, and conversions can be done between diﬀerent formats.

In order to consistently get well prepared structures for descriptor generations, LigPrep utility in
Schrödinger suites (https://www.schrodinger.com/) are used to get optimized 3D structures,
which are readily suitable for ESTD and auxiliary physical descriptors computations. Additionally,
SMILES ﬁles are also essential for preparing auxiliary molecular descriptors by ChemoPy[136].

4.3 Molecular Descriptors

In this section, we will introduce molecular descriptors used in this work. These molecular
descriptors can be divided into several categories: element speciﬁc topological descriptors (ESTDs),
physical model based descriptors, and auxiliary molecular descriptors calculated by ChemoPy
[136]. The motivation and details of our molecular descriptors construction process will also be
presented.

34

4.3.1 Element speciﬁc topological descriptors

Element speciﬁc networks The key to accurate prediction is to engineer ESTDs from corre-
sponding element speciﬁc networks (ESNs) based on which persistent homology is computed. It is
therefore necessary to choose diﬀerent element combinations in order to characterize the properties
of a given molecule.

Filtration matrix The essential idea of element speciﬁc persistent homology (ESPH) and its
modiﬁed version have been discussed in Section 2.3 and 2.4. Throughout this study, the deﬁnition
of distance between atom i and atom j will be consistent with Eq. (2.10). The underlying reason
has already been discussed in Section 2.4. Simply speaking, for small molecules, intramolecular
eﬀects such as van der Waals interaction play a much more signiﬁcant role than covalent bonding
eﬀects. This is essentially diﬀerent from large biomolecules such as proteins.

Topological dimension We also need to consider the dimensions of topological invariants. For
large molecules such as proteins, it is important to compute the persistent homology of ﬁrst three
dimensions (Betti-0, Betti-1 and Betti-2). The underlying reason is that proteins generally consists
of thousands of atoms, and Betti-1 and Betti-2 bars usually contain very rich geometric information
such as internal loops and cavities. However, small molecules are geometrically simple and their
barcodes of high dimensions are usually very sparse. Additionally, small molecules are chemically
complex due to their involvement of many element types and oxidation states. As such, high
dimensional barcodes of element speciﬁc networks carry little information. Therefore, we only
consider Betti-0 bars for small molecule modeling.

4.3.1.1 ESNs for partition coeﬃcient and aqueous solubility prediction

Inspired by classic atom-additive models for partition coeﬃcient prediction, we utilize a total of 61
basic element types calculated by antechamber [137, 138] using general amber force ﬁeld (GAFF)

35

[137]. Atoms of given atom type and their appropriate combinations are selected to construct
Vietoris-Rips complex and ESTDs are subsequently calculated.

Table 4.6: ESNs for partition coeﬃcient and aqueous solubility prediction

Network type
Single-element
Two-element

{ai}, where ai ∈ A, A = {GAFF61}

{bi, bj}, where bi, bj ∈ B, i, j ∈ {1 . . . 3}, and i < j.

Element speciﬁc networks

Here B={C, N, O}.

Remark For two-element type ESNs, we choose C, N, O for two considerations. The ﬁrst reason
is that rare elements are already included in single element group. The second reason is based
on a statistical point of view. Speciﬁcally, we perform statistical analysis of the dataset and the
occurrences of diﬀerent element types are shown in Table 4.7.

Table 4.7: Statistics of element occurrences for partition coeﬃcient training set

S
C 8198
H 8172
F
O 6612 Br
P
N 6212
Cl
1361
I

1360
672
308
188
115

4.3.1.2 ESNs for toxicity endpoint prediction

The ESTD construction for toxicity endpoint prediction is very similar. It may not make sense if
we continue to use atom types generated by GAFF force ﬁeld. Instead, we focus on intra-molecular
interactions on a wider range of element types. It is reasonable to assume that rare elements (such
as Br or I) are more capable of indicating a higher level of toxicity as compared to more frequently
appeared elements (such as C or N). Thus these elements are considered when calculating ESTDs.
Diﬀerent combinations of ESNs are tested, and we propose the following ESTDs for toxicity
endpoint prediction, although our search may not be exhaustive. A list of ESNs used can be found
in Table 4.8.

36

Table 4.8: ESNs for toxicity endpoint prediction

Network type
Single-element
Two-element

Element speciﬁc networks

{ai}, where ai ∈ A, A={H, C, N, O}

{bi, cj}, where bi ∈ B, cj ∈ C, i ∈ {1 . . . 3}, j ∈ {1 . . . 9}, and i < j.

Here B={C, N, O} and C={C, N, O, F, P, S, Cl, Br, I}.

4.3.1.3 A general workﬂow for computing ESTDs from ESNs

A general workﬂow process for computing ESTDs from ESNs can be summarized as follows.

1. Given an ESN, 3D coordinates of atoms in the ESN are selected, and their Vietoris-Rips
complexes are constructed. Note that distance deﬁned in Eq. (2.10) is used for persistent
homology barcodes generation in this study.

2. The maximum ﬁltration size is set to a large number (12 Å for small molecules). After
barcodes are obtained, barcodes are divided into several subintervals so that intra-molecular
interactions of diﬀerent strengths can be captured. For instance, ESTDs can be calculated
based on the barcodes of the ﬁrst 10 small subintervals Inti = [0.5i, 0.5(i + 1)], i = 0, . . . , 9.

• Within each Inti, search Betti-0 bars whose birth time falls within this interval and
Betti-0 bars that dies within Inti, respectively, and denote these two sets of Betti-0 bars
as Sbirthi
• Count the number of Betti-0 bars within Sbirthi
2 ESTDs for the interval Inti.

, and these two counts yield

and Sdeathi

and Sdeathi

.

3. In addition to interval-wise descriptors, we also consider global ESTDs for the entire barcodes.
All Betti-0 bars’ birth times and death times are collected and added into Sbirth and Sdeath,
respectively. The maximum, minimum, mean and sum of each set of values are then computed
as ESTDs. This step gives 8 more ESTDs.

37

4.3.1.4 The essence of ESTDs

To summarize, we would like to emphasize the essential ideas of our choice of ESTDs. In Step 2 of
the ESTD generation process in Section 4.3.1.3, we collect all birth and death time of Betti-0 bars in
order to capture intra-molecular eﬀects such as hydrogen bonding and van der Waals interactions.
These intra-molecular interactions are captured by eliminating the topological connectivity of
covalent bonds. For instance, the birth position can signal the formation of hydrogen bonding,
and the death position represents the disappearance of such eﬀects, which in turn reﬂects the
strength of these eﬀects. In step 3 of the above process, we consider all potential element-speciﬁc
intra-molecular eﬀects together and use statistics of these eﬀects as global descriptors for a given
molecule. This would help us to better characterize small molecules.

4.3.1.5 ESTDs for partition coeﬃcient and aqueous solubility prediction

Notice that the ﬁrst 10 subintervals with length 0.5 Å of Betti-0 barcodes are used for ESTD
computation. Mathematically, these subintervals can be denoted as Inti = [0.5i, 0.5(i + 1)], i =
0, . . . , 9. We further denote sets that contain all birth or death values of Betti-0 bars as Sbirth and
Sdeath, respectively. The word “Statistics” in Table 4.9 refer to maximum, minimum, mean and sum
of values in Sbirth and Sdeath. As Table 4.9 suggests, we have 1 ESTDs from each single-element

Table 4.9: ESTDs for partition coeﬃcient and aqueous solubility prediction

Element speciﬁc networks
{ai}, where ai ∈ {GAFF61}

{bi, bj}, where bi, bj ∈ {C, N, O},

and bi (cid:44) bj.

ESTDs

Counts of Betti-0 bars for each of the 61 atom types
1. Counts of Betti-0 bars with birth or death values

falling within each Inti, i = 0, . . . , 9.

2. Statistics of birth or death values for all Betti-0 bars
(consider all birth and death values, i.e., Sbirth and Sdeath)

ESN (count of Betti-0 bars). Meanwhile, 28 ESTDs are also generated for each two-element ESN.
Speciﬁcally, we have 2 ESTDs (birth and death count) for each Inti, and 4 global ESTDs for Sbirth
and Sdeath. Thus in total we have 145 (61 + 28*3) ESTDs for each molecule.

38

Remark To see how diﬀerent bin sizes aﬀect prediction accuracy, we also choose a slightly larger
interval size to evaluate ESTDs’ predictive power. Simply put, an interval size of 1 Å is used to
construct ESTDs, which results in 5 subintervals. We also want to include the eﬀects of Betti-1
bars. Such construction results in 121 ESTDs (61 + 2*2*5*3) for each molecule as we consider
both birth and death values of barcodes with 3 ESNs, 5 subintervals and 2 diﬀerent topological
dimensions. It turns out this set of ESTDs perform very well for some speciﬁc aqueous solubility
datasets (Small Delaney set and Huuskonen set).

4.3.1.6 ESTDs for toxicity endpoint prediction

ESTDs for toxicity endpoint prediction are constructed in a very similar way. The only diﬀerence
is that for single-element ESN, we no longer consider GAFF atom types. Instead, we calculate
the same set of ESTDs in the way discussed in section 4.3.1.5, which results in 28 ESTDs for
each ESN. For two-element type, we also consider more combinations comparing to solubility
prediction. The reason is that it is necessary to include more element types as they have yet been
included in single-element. Since we have 25 ESNs in table 4.8, we have a total of 700 (25*28)
ESTDs for each molecule.

4.3.2 Physical model based descriptors

In addition to ESTDs discussed above, we are also interested in constructing a set of microscopic
features based on physical models to describe molecular toxicity. This set of features should
be convenient to use in diﬀerent machine learning approaches, including deep learning and non
deep learning, and single-task and multi-task ones. To make our feature generation feasible and
robust to all compounds, we consider three types of basic physical information, i.e., atomic charges
computed from quantum mechanics or molecular force ﬁelds, atomic surface areas calculated for
solvent excluded surface deﬁnition, and atomic electrostatic solvation free energies estimated from
the Poisson model. Optimized 3D structure obtained from section 4.2 are used to compute atomic
properties, which can be summarized in several steps:

39

1. Charge Optimized 3D structures are fed in antechamber [138], using parametrization: AM1-
BCC charge, Amber mbondi2 radii and general Amber force ﬁeld (GAFF) [137]. This step
leads to pqr ﬁles with corresponding charge assignments.

2. Surface ESES online server at http://weilab.math.msu.edu/ESES/ [139] is used to
compute atomic surface area of each molecule, using pqr ﬁles from the previous step. This
step also results in molecular solvent excluded surface information.

3. Energy MIBPB online server at http://weilab.math.msu.edu/MIBPB/ [140] is used to
calculate the atomic electrostatic solvation free energy of each molecule, using surface and
pqr ﬁles from previous steps.

Speciﬁcally, physical descriptors come from Step 1, Step 2 and Step 3 above. To make our method
scalable and applicable to all kinds of molecules, we manually construct element-speciﬁc molecular
descriptors so that it does not depend on atomic positions or the number of atoms. The essential
idea of such construction is to derive atomic properties of the each element type, which is very
similar to the idea of ESPH.

We consider 10 diﬀerent commonly occurring element types, i.e., H, C, N, O, F, P, S, Cl, Br,
and I and three diﬀerent types of descriptors – charge, surface area and electrostatic solvation free
energies. Given an element type and a descriptor type, we compute the statistics of the quantities
obtained from the aforementioned physical model calculation, i.e., summation, maximum, min-
imum, mean and variance, giving rise to 5 physical descriptors. To capture absolute strengths
of each element descriptor, we further generate 5 more physical descriptors after taking absolute
values of the same quantities. Consequently, we have a total of 10 physical descriptors for each
given element type and descriptor type. Thus 300 (10 descriptor × 10 element types × 3 descriptor
type) molecular descriptors can be generated at element type level.

Additionally when all atoms are included for computation, 10 more physical descriptors can be
constructed in a similar way (5 statistical quantities of original values, and another 5 for absolute
values) for each element descriptor type (charge, surface area and electrostatic solvation free

40

energies). This step yields another 30 molecular descriptors. As a result, we organize all of the
above information into a 1D descriptor vector with 330 components, which can be directly fed into
machine learning algorithms.

4.3.3 Auxiliary molecular descriptors for partition coeﬃcient and aqueous solubility pre-

diction

ChemoPy[136] is a popular software for computing 2D and 3D molecular descriptors. How-
ever, preliminary results have shown that 3D descriptors by ChemoPy do not perform well for
our prediction tasks due to inaccurate generation of 3D structures. Thus in this study, we only
incorporate 2D molecule ChemoPy descriptors on top of our self-designed molecular descriptors.
ChemoPy descriptors can be categorized as following - 30 molecular constitutional descriptors,
35 topological descriptors, 44 molecular connectivity indices, 7 Kappa shape descriptors, 64
Burden descriptors, 245 E-state indices, 21 Basak information indices, 96 autocorrelation de-
scriptors, 6 molecular property descriptors, 25 charge descriptors, and 60 MOE-type descrip-
tors. A more detailed description of descriptor and ChemoPy software is available on line
at https://code.google.com/archive/p/pychem/downloads. Also notice that ChemoPy
software only requires 2D SMILES as input.

Speciﬁcally for partition coeﬃcient and aqueous solubility prediction, we also combine these
features with ESTDs to create ESTD+ in order to improve the overall performance. For consistency
reasons, only molecules whose descriptors can be calculated by both our ESTD software and
ChemoPy software are used for training purpose. It is worth to mention that our ESTD approaches
are applicable to all molecules whereas ChemoPy has diﬃculty in dealing with some molecules.
Separate results and discussions for diﬀerent sets of descriptors will also be conducted in later
chapters.

41

CHAPTER 5

RESULTS

In this chapter, we will ﬁrst introduce evaluation metrics used to evaluate model performances.
Next, we will present our results along with those in literature to see if the proposed models can
truly improve over others’ models. Speciﬁcally, we would like to predict solvation free energy,
partition coeﬃcient, aqueous solubility, and various toxicity endpoints. Both results of diﬀerent
machine learning methods using the same set of molecular descriptors, and results of diﬀerent
molecular descriptors using the same machine learning algorithm will be provided, with emphasis
on both machine learning algorithms and predictive power of molecular descriptors. Notice we
may use diﬀerent evaluation metrics/descriptors for diﬀerent tasks.

Since solvation free energy is isolated from the other small molecular properties mentioned
above, a separate discussion will be provided for solvation free energy. On the other hand, partition
coeﬃcient and aqueous solubility are closely related as they all measure how chemicals dissolve in
a given solvent, while diﬀerent toxicity endpoints may share similarities. Thus we will learn these
properties jointly using multi-task learning framework, and the same descriptors shall be used for
related tasks.

5.1 Evaluation criteria

5.1.1 Commonly used evaluation metrics

Commonly used evaluation metrics used in quantitative predictions are Pearson correlation coef-
ﬁcient (R), root mean squared error (RMSE), and mean unsigned error (MUE). Mathematically,
they are deﬁned as following:

N
i=1(XPred
i=1(XPred

i

i

(cid:113)N

R =

− XPred)2(cid:113)N

Expl
− XPred)(X
i
i=1(X

)

Expl
− X
i
Expl
i

− XExpl)2

,

(5.1)

42

RMSE =

and

XPred
i

Expl
− X
i

(cid:12)(cid:12)(cid:12),

Expl
− X
i

(cid:118)(cid:117)(cid:116) 1
(cid:16)
N
(cid:12)(cid:12)(cid:12)XPred
N

i=1

N

1
N

i

i=1

MUE =

(cid:17)2

(5.2)

(5.3)

Expl
stand for the experimental
where N is the total number of molecules in the test set, X
i
and predicted value for the ith molecule, respectively, XPred and XExpl is the average of predicted
and experimental value for the entire test set, respectively.

and XPred

i

5.1.2 Additional evaluation metrics for partition coeﬃcient and aqueous solubility prediction

In Tetko’s review paper [6] for partition coeﬃcient prediction on Star and Non-star set, an additional
metric based on the diﬀerence between experimental and predicted log P (∆ log P) was proposed.
Speciﬁcally, the percentage within various error ranges was considered.:

• If |∆ log P| < 0.5, prediction is considered to be “acceptable";

• If 0.5 ≤ |∆ log P| < 1.0, prediction is considered to be “disputable";

• If |∆ log P| ≥ 1.0, prediction is considered to be “unacceptable".

This metric is exclusively used for Star and Non-star set proposed by Tetko [6].

5.1.3 Additional evaluation metrics for toxicity endpoint prediction

In T.E.S.T. software, developers referred to Golbraikh et al.
QSAR model has a predictive power.

[141]’s protocol to determine if a

> 0.5,

q2
R2

> 0.6,
R2 − R2
0

R2

< 0.1
0.85 ≤ k ≤ 1.15

43

(5.4)

(5.5)

(5.6)

(5.7)

where q2 is the squared leave one out correlation coeﬃcient for the training set, R2 is the squared
Pearson correlation coeﬃcient between the experimental and predicted toxicities for the test set,
R2
0 is the squared correlation coeﬃcient between the experimental and predicted toxicities for the
test set with the y-intercept being set to zero so that the regression is given by Y = k X. All these
metrics will be used to compare performances of toxicity endpoint prediction models.

5.2 Evaluation results

5.2.1 Solvation free energy prediction

5.2.1.1 Microscopic feature parametrization

In earlier hybrid physical and knowledge (HPK) model [4], three types of atomic radii (Amber 6,
Amber bondi, and Amber mbondi2 [142]) and three types of charge assignments ( OpenEye-AM1-
BCC v1 parameters [143], Gasteiger [144], and Mulliken [142]) were used to test the sensitivity
and accuracy of models with respect to diﬀerent parameterizations. For such reason, we continue
to use these 9 diﬀerent parameterizations in order to conveniently compare current models with
previous ones.

5.2.1.2 Polar and non-polar descriptors for solvation free energy

Implicit solvent models divide solvation free energies into polar and nonpolar additive contributions,
whereas polar and nonpolar interactions are inseparable and non additive. Thus in order to explore
how important polar and nonpolar descriptors contribute to solvation process, we use two diﬀerent
set of descriptors to predict small molecule solvation energy.
Descriptor set 1: Polar descriptors with high correlations to solvation free energy. The list of
descriptors used can be found in Appendix.
Descriptor set 2: Non-polar atomic surface area descriptors (part of physical descriptors discussed
in previous chapter), in addition to polar descriptors with high correlations to solvation free energy
in descriptor set 1.

44

5.2.1.3 Leave-one-out result

Fig. 5.1 contains leave-one-out results using diﬀerent descriptor sets and HPK models for the 668
set, with diﬀerent parameterizations. It is evident that the present model outperforms our earlier
HPK model with most parameterizations.

Figure 5.1: Illustration of leave-one-out predictions for the whole set of 668 molecules. Left chart:
Correlation between experimental solvation free energies and predictions obtained by BCC charges
and Amber MBondi2 using all polar-nonpolar features. Right chart: Comparison of prediction
RMSEs obtained by models with polar features and all features against HPK models. In the plot,
GAS and MUL are abbreviations for Gasteiger and Mulliken charges, respectively.

Moreover, leave-one-out method is also used to determine the number of nearest neighbors that
should be used to predict a given molecule. For each parameterization, we perform leave-one-out
prediction with 1 to 10 nearest neighbors. The best results can be found in Table 5.1. Although
complete numerical results are not provided, it is worth to mention that further increasing the
number of descriptors does not necessarily increase prediction accuracy. For consideration of
consistency, we select 10 nearest neighbors for all following tests.

45

Table 5.1: The RMSE and ME of the leave-one-out test in the solvation free energy prediction of
668 molecules with descriptor set 1 (the ﬁrst position), descriptor set 2 (the second position) and
HPK model (the last position) [4]. All errors are in unit kcal/mol.

Radius
Amber 6

Charge
RMSE
ME
RMSE
ME

Amber Bondi
Amber MBondi2 RMSE

ME

BCC

1.19, 1.07,1.47
-0.01, 0.01, -0.13
1.23, 1.09, 1.34
-0.02, 0.01, -0.14
1.18, 1.05, 1.33
0.00, 0.03, -0.14

Mulliken

Gasteiger

1.52, 1.28, 1.49
1.42, 1.24, 1.65
-0.01, 0.00, -0.20 -0.03, -0.01, -0.19
1.52, 1.31, 1.48
1.35, 1.24, 1.66
-0.02, 0.00, -0.13
-0.01, 0.01, -0.21
1.33, 1.23, 1.68
1.54, 1.30, 1.49
0.0, 0.01, -0.22
0.00, 0.00, -0.22

5.2.1.4 Blind prediction of SAMPLx challenge molecules

SAMPL0 test First, let us consider the solvation free energy prediction for the SAMPL0 test set,
which contains a total of 17 molecules. All structures of this test set are relatively simple. However,
the molecule species of this set is quite diverse. Many researchers have reported their solvation free
energy predictions for this challenge set [145, 146]. Prior to our work, the optimal prediction for
this test set has an RMSE of 1.34 kcal/mol for the whole set [146]. Figure 5.2 depicts the present
results for a total of 9 charge and radius combinations. When the BCC charge is used, the RMSEs of
our predictions with three radius parametrizations are all smaller than 0.75 kcal/mol. Our optimal
prediction has an RMSE of 0.61 kcal/mol, obtained from Amber Bondi radius parametrization in
conjugation with the BCC charge assignment with polar features only. When all polar and nonpolar
features are used, the results become slightly worse whereas performances over all parametrizations
turn out to be more stable especially when the Mulliken charge assignment is used.

SAMPL1 test Having demonstrated the superiority of the proposed model for the prediction of
the SAMPL0 challenge set, we further consider the SAMPL1 test set, which is generally believed to
be the most diﬃcult one, due to the following two reasons. First, the molecular structures of this test
set are extremely complex compared to other molecules with known experimental solvation free
energies. Second, the uncertainty of SAMPL1 experimental data is very large. For some molecules
the uncertainty is as large as 2.0 kcal/mol [147, 14]. Nevertheless, it is extremely desirable to
develop an accurate modeling paradigm for this test set because most molecules in this test set are

46

Figure 5.2: Illustration of prediction RMSEs obtained with diﬀerent molecular parametrizations
by the model for SAMPL0 test set. In the plot, GAS and MUL are abbreviations for Gasteiger and
Mulliken charges, respectively.

druggable. The best prediction for the whole set has an RMSE of 2.45 kcal/mol [146]. On a subset
of the SAMPL1 test set that contains only 56 molecules, the best performance was shown to give an
RMSE of 2.4 kcal/mol [14]. Figure 5.3 illustrates the results of the approach for the whole SAMPL1
test set. It is obvious to see that the model is much more accurate. The optimal prediction with only
polar descriptors has an RMSE as small as 2.07 kcal/mol, and adding nonpolar descriptors further
improves the RMSE to 1.86 kcal/mol, which is the best to our best knowledge. Additionally, the
present model is very robust with respect to the change in force ﬁelds. The maximum and minimum
prediction RMSEs over 9 sets of parametrizations and 2 feature combinations are 1.86 and 2.82
kcal/mol, respectively. The diﬀerence between the maximum and minimum is 0.96 kcal/mol, which
is much smaller than experimental uncertainty of 2 kcal/mol for this set [147, 14].

SAMPL2 test Another diﬃcult test set is SAMPL2, which contains a total of 30 molecules [148].
The experimental uncertainty on these molecules is much less than that of the SAMPL1 test set.
Nevertheless, accurate solvation prediction for this set is rare. Using all-atom molecular dynamics
simulations and multiple starting conformations for prediction, Klimovich and Mobley reported

47

Figure 5.3: Illustration of prediction RMSEs obtained with diﬀerent molecular parametrizations
by the proposed model for SAMPL1 test set. In the plot, GAS and MUL are abbreviations for
Gasteiger and Mulliken charges, respectively.

an RMSE of 2.82 kcal/mol over the whole set and 1.86 kcal/mol over all the molecules except
several hydroxyl-rich compounds [148]. Some of the best reported predictions have an RMSE
of 1.59 kcal/mol [146]. In our previous test, the molecule containing an I atom (5-iodouracil) is
excluded in all calculations due to the lack of appropriate charge force ﬁeld. In this work, we also
ignore this molecule for the same reason. The HPK model gives an optimal prediction with RMSE
1.96 kcal/mol. However, the RMSEs of the prediction vary over a large range, from 1.96 to 4.86
kcal/mol, when diﬀerent charge and radius force ﬁelds are applied. A bar graph of the RMSEs of
the predictions is given in Fig. 5.4. Parametrizations based on AM1-BCC charge yield the best
results among polar features and adding nonpolar features oﬀers a substantial improvement over
polar features, as the ﬁrst three yellow bars are lower than the ﬁrst three blue bars. The optimal
RMSE for SAMPL2 molecules is 1.64 kcal/mol with AM1-BCC charge and AMBER6 radius, when
all features are used to train the models. It is also worthy to note that the variation of RMSEs under
diﬀerent parametrizations is 1.42 kcal/mol (1.64 to 3.06 kcal/mol), which indicates the robustness
of the present models compared to HPK models.

48

Figure 5.4: Illustration of prediction RMSEs obtained with diﬀerent molecular parametrizations
by the proposed model for SAMPL2 test set. In the plot, GAS and MUL are abbreviations for
Gasteiger and Mulliken charges, respectively.

SAMPL3 test The SAMPL3 test set, which contains 36 molecules, is relatively easy for predic-
tion. The structures of SAMPL3 molecules are relatively simple, and most molecules in this set are
chlorinated hydrocarbon molecules [149]. The best prediction in the literature oﬀers an RMSE of
1.29 kcal/mol [146]. Figure 5.5 depicts the RMSEs of the predictions by only polar features and all
features. Although the optimal result (RMSe of 0.86 kcal/mol) is generated by polar features with
Gasteiger charges, all features combination turns out to be more stable over all parametrizations
as Figure 5.5 clearly shows. More speciﬁcally, the RMSEs using polar features span over a small
range of 0.48 kcal/mol (i.e., from 0.86 to 1.34 kcal/mol) across all 9 diﬀerent parametrizations,
while all features yield a variation of 0.24 kcal/mol. This further veriﬁes the robustness of the
current solvation model.

SAMPL4 test Finally, we consider the SAMPL4 test set, which is a very popular one. Many
explicit, implicit, integral equation, and hybrid QM/MM approaches [16] have been applied to this
set [150]. As shown in Fig. 5.6, the overall performance enhances when all features are used as the
blue bars are consistently higher than yellow bars, which indicates the predictive power of nonpolar

49

Figure 5.5: Illustration of prediction RMSEs obtained with diﬀerent molecular parametrizations
by the proposed model for SAMPL3 test set. In the plot, GAS and MUL are abbreviations for
Gasteiger and Mulliken charges, respectively.

features. Our proposed model gives an optimal RMSE of 1.14 kcal/mol. It is also easy to see that
our current approach is quite robust across diﬀerent force ﬁeld and charge parametrizations.

Figure 5.6: Illustration of prediction RMSEs obtained with diﬀerent molecular parametrizations
by the proposed model for SAMPL4 test set. In the plot, GAS and MUL are abbreviations for
Gasteiger and Mulliken charges, respectively.

50

5.2.1.5 Wang’s [1] dataset

Finally, we validate our models on the external datasets that Ref [1] used to develop weighted
solvent accessible surface area. It is interesting to compare our results with theirs since compounds
used for training and testing are essentially independent, which challenges the predictive power of
solvation models.

In Model III, they achieved unsigned average errors of 0.50 and 0.66 kcal/mol for the training
set and the test set, respectively [1], and an unsigned average error of 0.538 kcal/mol for the entire
set. Our models were also trained with 10 nearest neighbors and all polar/nonpolar descriptors,
and the average of 50 independent runs yields unsigned average errors of 0.00 and 0.57 kcal/mol,
respectively for the training set and the test set, and an unsigned average error of 0.441 kcal/mol for
the entire set. In fact, we used a slightly smaller training set of 289 molecules because 4 molecules
in the training set have ambiguous chemical names in the PubChem database, while all molecules
in the test set are included in our prediction.

Since there are 11 duplicates in the training set and the test set of Ref. [1] (see Table 4.2), we
also train models with these 11 duplicated molecules excluded from the training set. The updated
model acheives an RMSE of 0.61 kcal/mol for the test set, which is still smaller than that reported
in Ref. [1] (i.e., 0.66 kcal/mol).

Therefore, we conclude that our models have a competitive edge over the classic solvation

model based on weighted SASA.

5.2.2 Partition coeﬃcient and aqueous solubility prediction

In this section, we present the results of the proposed ESPH methods in conjugation with gradient
boosting decision tree and multi-task deep neural networks for variety of data sets, including
partition coeﬃcient and solubility test sets. Otherwise stated, diﬀerent tasks are trained together
in the same network. Besides, we would like to introduce some notations for easier reference.
ESTD-1 contains 61 Betti-0 bar-based ESTDs, ESTD-2 contains all ESTDs listed in Table 4.9 (a
total of 145 ESTDs), and ESTD-3 corresponds to ESTDs discussed in Remark of Section 4.3.1.5.

51

When molecular descriptors calculated by ChemoPy software [136] are combined with ESTD-1,
ESTD-2 and ESTD-3 respectively, two more descriptor sets are created and they are denoted as
ESTD+-1, ESTD+-2 and ESTD+-3, respectively.

5.2.2.1

log P training set cross-validation

In order to have an idea of how our topological representation would work for partition coeﬃcient,
a 10-fold cross-validation is performed using baseline method GBDT. Note that 50 runs were done
to achieve the ﬁnal results as randomness is involved and the results are summarized in Table
5.2. Notice that ChemoPy descriptors were not used here.
It can be seen that our descriptors

Table 5.2: Results of 10-fold cross validation on the partition coeﬃcient training set, N = 8199.

Method

GBDT-ESTD-2
GBDT-ESTD-1
XLOGP3-AA [5]

R2
0.923
0.912
0.904

RMSE MUE
0.32
0.45
0.35
0.48
0.50
0.39

perform better than XLOGP3 software [5] given the same training data, and thus demonstrates
great predictive power. In addition, we also provide the ﬂuctuation of 10-fold cross-validation of
Table 5.3, even though such statistics is not available for XLOGP3 software. It is clear that our
ESTDs give quite consistent predictions and their performances are independent of random fold
generations since RMSDs of R2, RMSE and MUE do not essentially ﬂuctuate. Thus it would be
very interesting to see the performances of our MT-DNN compared to XLOGP3 and GBDT.

Table 5.3: Performances and ﬂuctuations of ﬁfty 10-fold cross-validation test runs.
Mean R2 (RMSD) Mean RMSE (RMSD) Mean MUE (RMSD)

Method

GBDT-ESTD-2
GBDT-ESTD-1

0.923 (0.001)
0.912 (0.001)

0.45 (0.003)
0.48 (0.002)

0.32 (0.002)
0.35 (0.001)

52

5.2.2.2 FDA set

The ﬁrst test set that we would like to apply our model to is the FDA test set. A molecule that
contains Hg was dropped due to the diﬃculty of computation. A major challenge of this set is that
its structures are more complex than that of the training set, and the partition coeﬃcient range spans
over nearly 12 units. A series of prediction methods [5], including our multi-task neural networks,
are applied to this set and their results are summarized in Table 5.4 for a comparison with ours.
Table 5.4: Results of diﬀerent logP prediction methods on 406 FDA-approved drugs [5], ranked
by R2. Two molecules were dropped for our model evaluation due to feature generation failure of
ChemoPy

Method

MT-ESTD-1

MT-ESTD-2-AD
GBDT-ESTD+-1
GBDT-ESTD+-2
MT-ESTD+-1
MT-ESTD+-2

R2
GBDT-ESTD+-2-AD 0.935
GBDT-ESTD+-1-AD 0.932
0.930
MT-ESTD+-1-AD
0.929
MT-ESTD-1-AD
0.928
MT-ESTD+-2-AD
0.920
0.912
0.910
0.910
0.909
0.909
0.908
0.900
0.893
0.891
0.883
0.872
0.848
0.847
0.838
0.815
0.80
0.771
0.491

CLOGP
TOPKAT
ALOGP98
KowWIN

GBDT-ESTD-2
XLOGP3-AA

ALOGPS

GBDT-ESTD+-1
GBDT-ESTD-1
MT-ESTD-2

GBDT-ESTD+-2

XLOGP3

HINT

RMSE MUE
0.24
0.51
0.23
0.52
0.53
0.22
0.26
0.54
0.27
0.53
0.28
0.57
0.59
0.37
0.28
0.60
0.30
0.60
0.60
0.27
0.34
0.60
0.42
0.60
0.39
0.63
0.66
0.41
0.44
0.66
0.49
0.68
0.72
0.51
0.57
0.78
0.57
0.80
0.88
0.51
0.56
0.88
0.64
0.90
0.63
1.10
1.93
1.30

As we can see from Table 5.4, our multi-task model gives the best prediction in terms of R2,
RMSE, and MUE. Speciﬁcally, the small MUE of our model indicates that our predictions are
less biased than other methods tested, except for some outliers. Also, note that the training set is

53

completely independent of the test set which shows the applicability of our multi-task architecture.
We also build models with the same architecture when additional molecules gathered from NIH-
database are included as there is no guarantee that the training set of ALOGPS is completely
independent of the test set. It turns out that the accuracy can be greatly improved. For instance, the
performance of ESTD+-1 can be improved by more than 10% in terms of RMSE (0.60 log units to
0.53 log units). It demonstrates the potential of our MT-DNN architecture when more data become
available and it will be more carefully discussed in later section.

5.2.2.3 Star set and non-star set

Star set and Non-star set were proposed by Tetko [6] as two benchmark sets for evaluating partition
coeﬃcient models. Over 20 diﬀerent models were tested on these two sets. It should be emphasized
that for these sets, diﬀerent models are trained on diﬀerent training sets and their overlap with the test
sets is unknown. Thus it makes more sense to merge our 8199 training set with additional molecules
in NIH database and see how additional training data can beneﬁt the overall performances. Results
of diﬀerent models on these two sets can be found in Table 5.5. Notice that models trained with
additional data from NIH database are labeled with (-AD).

For star set, we achieve RMSE of 0.49 log units with other popular commercial software
packages such as ACD/logP and CLOGP, in addition to a high acceptable prediction percentage
(77%, rank 2). For non-star set, most methods do not give accurate predictions as the structures
in this set are relatively new and complex. Our 51% acceptable rate ranks number 2 among all
predictors, though RMSE is relatively high due to a few large outliers. The results are satisfactory,
especially when considering commercial software packages generally use a much larger training set
than that ours. In general, when there exist more overlapped molecules in the training set, the test
results will be signiﬁcantly improved. Thus as a baseline comparison, it would be more meaningful
if we compare our results with XLogP3 software. As Table 5.5 indicates, our MT-ESTD+ models
achieve a substantial improvement over XLOP3 for star set, while XLogP3 achieves a lower RMSE
for Non-star set. It may be due to XLogP3’s corrections terms with relatively new structures. The

54

performances of our MT-ESTD+ models suggest that our models are able to predict log P accurately.
We also would like to know if the predictive power can potentially be further improved once more
molecules are incorporated into the training set. Thus we extend our original 8199 training set by
adding molecules in both Star set and Non-star set. As shown in Table 5.5, descriptor sets labeled
with (-AD) (with additional data from NIH database) generally oﬀer better performances on both
Star set and Non-star set.

Table 5.5: Benchmark test results [6] on both star and non-star set.

Star Set (N = 223)

% of Molecules

Within Error Range

Non-star Set (N = 43)
% of Molecules

Within Error Range

Method
AB/LogP
S+logP

MT-ESTD+-1-AD

MT-ESTD+-2

MT-ESTD+-2-AD

ACD/logP

GBDT-ESTD+-1-AD
GBDT-ESTD+-2-AD

CLOGP

VLOGP OPS

ALOGPS

MT-ESTD+-1
MT-ESTD-1-AD
MT-ESTD-2-AD

MT-ESTD-1
MT-ESTD-2

MiLogP

GBDT-ESTD+-2
GBDT-ESTD+-1

RMSE < 0.5 <1
12
0.41
0.45
22
16
0.49
0.49
21
17
0.50
17
0.50
0.51
17
17
0.51
20
0.52
0.52
21
23
0.53
0.53
17
18
0.53
19
0.53
0.55
18
23
0.56
22
0.57
0.58
16
15
0.60

84
76
77
74
76
75
76
75
74
64
71
75
73
71
72
66
69
75
74

55

> 1
4
3
7
5
7
7
6
7
6
7
6
8
9
9
10
11
9
8
9

RMSE < 0.5 <1
23
1.00
0.87
35
19
0.98
0.97
23
19
0.94
32
1.00
1.03
30
30
1.04
28
0.91
1.07
28
30
0.82
0.97
28
30
1.00
19
1.01
1.01
28
33
1.06
30
0.86
1.06
25
23
1.02

42
40
49
49
51
44
44
41
47
33
42
47
37
47
33
35
49
44
46

> 1
35
26
33
28
30
23
25
27
26
26
28
26
33
35
40
33
21
30
30

XLOGP3
KowWIN

GBDT-ESTD-2-AD

CSLogP

GBDT-ESTD-1-AD

ALOGP
MolLogP
ALOGP98

GBDT-ESTD-1

OsirisP
VLOGP

GBDT-ESTD-2

TLOGP
ABSOLV
QikProp
QuantlogP

SLIPPER-2002
COSMOFrag

XLOGP2
QLOGP
VEGA
CLIP
LSER

MLOGP(Sim+)

NC+NHET

SPARC

HINTLOGP

0.62
0.64
0.65
0.65
0.68
0.69
0.69
0.70
0.71
0.71
0.72
0.73
0.74
0.75
0.77
0.80
0.80
0.84
0.87
0.96
1.04
1.05
1.07
1.26
1.35
1.36
1.80

Table 5.5 (cont’d)

60
68
62
66
71
60
61
61
63
59
65
52
67
53
53
47
62
48
57
48
47
41
44
38
29
45
34

30
21
26
22
16
25
25
26
22
26
22
30
16
30
30
30
22
26
22
26
27
25
26
30
26
22
22

10
11
11
12
12
16
14
13
13
16
14
17
13
17
17
22
15
19
20
25
26
30
30
33
45
32
44

0.89
1.05
1.15
0.93
1.16
0.92
0.93
1.00
1.07
0.94
1.13
1.23
1.12
1.02
1.24
1.17
1.16
1.23
1.16
1.42
1.24
1.54
1.26
1.56
1.71
1.70
2.72

47
40
46
58
41
28
40
30
34
42
40
44
30
49
40
35
35
26
35
21
28
33
35
26
19
28
30

23
30
16
19
11
40
25
37
20
26
28
16
37
28
26
26
23
40
23
26
30
9
16
28
16
21
5

30
30
37
23
46
33
26
33
44
33
33
39
30
23
35
40
42
23
42
53
42
49
49
47
65
49
65

Remark for aqueous solubility prediction To evaluate the performances of our models for
aqueous solubility prediction, several datasets are used, derived from Wang et al. [2] and Hou et al.

56

[3]. For leave-one-out validation, only the baseline method is used. For 10-fold cross-validation,
the 9 remaining folds are trained together with the partition coeﬃcient training set when evaluating
the remaining fold with MT-DNN architecture.

5.2.2.4 Wang’s 1708 set in ref. [2]

For this dataset, both leave-one-out and 10-fold cross-validation are carried out in order to evaluate
the performance of our models.

Leave-one-out As MT-DNN requires a lot of computational resources, only baseline method
GBDT is used for leave-one-out prediction. We use 4000 trees and 0.10 learning rate as training
parameters to develop models and following results in Table 5.6 are achieved.

Table 5.6: Leave-one-out test on the 1708 solubility data set.

R2
GBDT-ESTD+-1-AD 0.931
GBDT-ESTD+-2-AD 0.929
0.910
0.897
0.893
0.884

GBDT-ESTD+-2
ASMS-LOGP[2]
GBDT-ESTD+-1

Method

ASMS[2]

RMSE MUE
0.543
0.389
0.389
0.551
0.457
0.621
0.664
0.505
0.494
0.683
0.707
0.547

10-fold cross-validation As MT-DNN and baseline method GBDT involve randomness, we run
MT-DNN and GBDT 50 times and report mean performances for all metrics. The results are
summarized in Table 5.7. It is observed that our models yield more accurate and robust predictions
than ASMS and ASMS-LOGP models do, improving the R2 from 0.884 to 0.925. Additionally, we
also notice that there generally exists an improvement of MT-ESTD models over GBDT models,
though, not as signiﬁcant as what we see in the previous partition coeﬃcient prediction.

57

Table 5.7: 10-fold cross-validation on the 1708 solubility data set.

Mean R2 (RMSD) Mean RMSE (RMSD) Mean MUE (RMSD)

Method

MT-ESTD+-1
MT-ESTD+-2
GBDT-ESTD+-1
GBDT-ESTD+-2

MT-ESTD-1
GBDT-ESTD-2
MT-ESTD-2
GBDT-ESTD-1

ASMS[2]

ASMS-LOGP [2]

0.925 (0.001)
0.924 (0.003)
0.924 (0.002)
0.923 (0.002)
0.908 (0.002)
0.904 (0.002)
0.902 (0.002)
0.889 (0.003)
0.884 (0.021)
0.869 (0.022)

0.568 (0.005)
0.571 (0.010)
0.572 (0.006)
0.571 (0.006)
0.630 (0.005)
0.642 (0.008)
0.649 (0.007)
0.697 (0.009)
0.699 (0.054)
0.742 (0.053)

0.393 (0.003)
0.395 (0.004)
0.408 (0.005)
0.408 (0.005)
0.466 (0.003)
0.469 (0.005)
0.466 (0.005)
0.502 (0.005)
0.527 (0.034)
0.570 (0.034)

5.2.2.5 Dataset in ref. [3]

We test our models on dataset proposed by Hou et al. [3], where training and test sets were predeﬁned
to cover a variety of molecules. Klopman’s test set contains 21 commonly used compounds
of pharmaceutical and environmental interest [151] and is to be trained on the original 1290
molecules. Zhu’s test set contains 120 molecules that were used to develop Klopman and Zhu’s
group contribution model [152]. As Hou et al. [3] suggested, we remove 83 molecules that overlap
with Zhu’s test set from the training set to make predictions independent and unbiased. This reduces
the size of the training set for Zhu’s test set to 1207.

Klopman’s test set Table 5.8 shows the performances of diﬀerent models on Klopman’s test set.
Our MT-ESTD models perform similarly to Drug-LOGS method while achieving improvement
over Klopman and Zhu’s MLR method [152] with ESTDs. It is also evident that the MT-DNN
method has an edge over GBDT method, which is consistent with our previous experiments.

Zhu’s test set The results of Zhu’s test set are summarized in Table 5.9. For this dataset, our
MT-ESTD models give satisfactory results with a high Pearson correlation over 0.97 across all
ESTD combinations. Such results indicate that our methods are applicable to a wide variety of
molecules. Again, the MT-DNN method outperforms the GBDT method.

58

Table 5.8: Results of Klopman’s test set [3], where MUE was not reported.

Method

MT-ESTD+-1
Drug-LOGS[3]

GBDT+-2

MT-ESTD+-2

GBDT+-1
MT-ESTD-2

GBDT-2

MT-ESTD-1

GBDT-1

Klopman MLR [152]

R
0.94
0.94
0.94
0.93
0.93
0.92
0.92
0.92
0.91
0.84

RMSE
0.69
0.64
0.71
0.75
0.76
0.79
0.86
0.85
0.82
1.07

Table 5.9: Results of Zhu’s test set.

Method

MT-ESTD+-1
MT-ESTD+-2
MT-ESTD-1
MT-ESTD-2
GBDT+-2
GBDT+-1

Drug-LOGS [3]

GBDT-2
GBDT-1

Group contribution [152]

R
0.97
0.97
0.97
0.97
0.97
0.96
0.96
0.96
0.96
0.96

RMSE MUE
0.47
0.65
0.67
0.48
0.50
0.70
0.53
0.71
0.50
0.73
0.76
0.52
0.57
0.79
0.60
0.79
0.82
0.58
0.70
0.84

Small Delaney set The small Delaney set has been extensively tested using diﬀerent approaches,
such as ESOL [129] and GSE [153]. Table 5.10 shows the 10-fold cross-validation results of
the MT-DNN-ESTD+ model and other methods. Overall, MT-DNN-ChemoPy and MT-DNN-
ESTD+ give very similar results in terms of the R2, RMSE and MUE, which essentially proves
the predictive power of our MTL framework.
In addition, it is encouraging to notice that the
MT-DNN-ESTD model which uses purely ESTDs as input slightly improves 2D kernel model
[154]. As a comparison, our baseline method (random forest) underperforms. Again, we notice a
substantial accuracy improvement of MT-DNN architecture over RF by direct comparison between
their results. It indicates that our MT-DNN models beneﬁt from MTL and there potentially exists
an underlying feature representation for partition coeﬃcient and aqueous solubility.

59

Table 5.10: 10-fold cross-validation results on the small Delaney set.

Methods

MT-ESTD+-3
MT-ESTD-3
RF-ESTD+-3
RF-ESTD-3
GSE[153]

2D kernel[154]

R2
0.93
0.92
0.91
0.88

-

0.91

RMSE MUE
0.37
0.54
0.61
0.43
0.45
0.63
0.52
0.71
0.47
0.44

0.61

-

Huuskonen set The Huuskonen set is also a popular solubility set. Similary to the previous small
Delaney set, a direct 10-fold cross-validation yileds the results as listed in Table 5.11. Again the
results of MT-DNN turn out to be the best in terms of all metrics. When only 121 ESTDs are
used, the results of MT-DNN become slightly worse but still perform better than the RBF kernel
approach and random forest models.

Table 5.11: 10-fold cross-validation results on the Huuskonen set.

Methods

MT-ESTD+-3
MT-ESTD-3
RF-ESTD+-3
RF-ESTD-3
RBF[155]

R2
0.93
0.91
0.91
0.89
0.90

RMSE MUE
0.39
0.55
0.43
0.60
0.45
0.61
0.69
0.51

-

-

Remark We ﬁnd that ESTD-1 and ESTD-2 do not perform as well as ESTD-3 for Delaney and
Huuskonen set, thus their results are skipped. For similar reasons, ESTD-3’s results are not included
for the other sets.

5.2.3 Toxicity endpoint prediction

5.2.3.1 Fathead minnow LC50 test set

The fathead minnow LC50 set was randomly divided into a training set (80% of the entire set)
and a test set (20% of the entire set) [133], based on which a variety of TEST models were built.
Table 5.12 shows the performances of ﬁve T.E.S.T. models, the TEST consensus obtained by the

60

average of all independent TEST predictions, four proposed methods and two consensus results
obtained from averaging over present RF, GBDT, ST-DNN and MT-DNN results. TEST consensus
gives the best prediction [133] among TEST results, reporting a correlation coeﬃcient of 0.728
and RMSE of 0.768 log(mol/L). As Table 5.12 indicates, our MT-DNN model outperforms TEST
consensus both in terms of R2 and RMSE with only ESTDs as input. When physical descriptors
are independently used or combined with ESTDs, the prediction accuracy can be further improved
to a higher level, with R2 of 0.771 and RMSE of 0.705 log(mol/L). The best result is generated by
consensus method using all descriptors, with R2 of 0.789 and RMSE of 0.677 log(mol/L).

Table 5.12: Comparison of prediction results for the fathead minnow LC50 test set.

Method

Hierarchical [133]
Single Model [133]

FDA [133]

Group contribution [133]
Nearest neighbor [133]
TEST consensus [133]

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

Results with only auxiliary molecular descriptors

RMSE MAE Coverage
0.801
0.803
0.915
0.810
0.876
0.768

0.951
0.945
0.945
0.872
0.939
0.951

0.574
0.605
0.656
0.578
0.649
0.545

0.858
0.857
0.862
0.763
0.755

0.784
0.736
0.959
0.705
0.679

0.638
0.612
0.601
0.514
0.518

0.560
0.511
0.648
0.472
0.464

0.782
0.719
0.822
0.716
0.677

0.564
0.496
0.568
0.466
0.446

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

k

R2
0.710
0.704
0.626
0.686
0.667
0.728

R2−R2
0
R2
0.075
0.134
0.113
0.123
0.080
0.121

0.966
0.960
0.985
0.949
1.001
0.969
Results with ESTDs
0.946
0.958
0.995
1.015
0.956

0.364
0.103
0.031
0.012
0.087

0.661
0.672
0.675
0.738
0.740

0.467
0.148
0.044
0.003
0.105

0.947
0.962
0.982
1.010
0.963

0.744
0.750
0.598
0.771
0.787
Results with all descriptors
0.727
0.761
0.692
0.769
0.789

0.948
0.959
0.997
1.014
0.959

0.322
0.102
0.010
0.009
0.076

61

5.2.3.2 Daphnia magna LC50 test set

The Daphinia Magna LC50 set is the smallest in terms of set size, with 283 training molecules and
70 test molecules, respectively. However, it brings diﬃculties to building robust QSAR models
given the relatively large number of descriptors. Indeed, ﬁve independent models in TEST software
give signiﬁcantly diﬀerent predictions, as indicated by RMSEs shown in Table 5.13 ranging from
0.810 to 1.190 log units. Though the RMSE of Group contribution is the smallest, its coverage
is only 0.657 % which largely restricts this method’s applicability. Additionally, its R2 value is
inconsistent with its RMSE and MAE. Since Ref. [133] states that “The consensus method achieved
the best results in terms of both prediction accuracy and coverage”, these usually low RMSE and
MAE values might be typos.

We also notice that our non-multitask models that contain ESTDs result in very large devia-
tion from experimental values. Indeed, overﬁtting issue challenges traditional machine learning
approaches especially when the number of samples is less than the number of descriptors. The
advantage of MT-DNN model is to extract information from related tasks and our numerical results
show that the predictions do beneﬁt from MTL architecture. For models using ESTDs, physical
descriptors and all descriptors, the R2 has been improved from around 0.5 to 0.788, 0.705, and
0.726, respectively. It is worthy to mention that our ESTDs yield the best results, which proves the
power of persistent homology. This result suggests that by learning related problems jointly and
extracting shared information from diﬀerent data sets, MT-DNN architecture can simultaneously
perform multiple prediction tasks and enhance performances especially on small data sets.

5.2.3.3 Tetraphymena pyriformis IGC50 test set

IGC50 set is the second largest QSAR toxicity set that we want to study. The diversity of molecules
in IGC50 set is low and the coverage of TEST methods is relatively high compared to previous LC50
sets. As shown in Table 5.14, the R2 of diﬀerent TEST methods ﬂuctuates from 0.600 to 0.764 and
Test consensus prediction again yields the best result for TEST software with R2 of 0.764. As for
our models, the R2 of MT-DNN with diﬀerent descriptors spans a range of 0.038 (0.732 to 0.770),

62

Table 5.13: Comparison of prediction results for the Daphnia magna LC50 test set.

R2
0.695
0.697
0.565
0.671
0.733
0.739

k

R2−R2
0
R2
0.151
0.152
0.257
0.049
0.014
0.118

0.981
1.002
0.987
0.999
1.015
1.001
Results with ESTDs
0.957
0.972
0.927
1.002
0.970

1.177
0.440
0.315
0.008
0.266

0.441
0.467
0.446
0.788
0.681

1.568
0.613
0.404
0.009
0.359

0.963
0.959
0.921
1.031
0.945

0.479
0.495
0.430
0.705
0.665
Results with all descriptors
0.460
0.505
0.459
0.726
0.678

0.955
0.961
0.933
1.017
0.953

1.244
0.448
0.278
0.003
0.282

RMSE MAE Coverage
0.979
0.993
1.190
0.803a
0.975
0.911

0.757
0.772
0.909
0.620a
0.745
0.727

0.886
0.871
0.900
0.657
0.871
0.900

1.300
1.311
1.434
0.805
0.977

1.261
1.238
1.484
0.944
1.000

1.274
1.235
1.407
0.905
0.978

0.995
0.957
0.939
0.592
0.724

0.946
0.926
1.034
0.610
0.732

0.958
0.905
1.004
0.590
0.714

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

Method

Hierarchical [133]
Single Model [133]

FDA [133]

Group contribution [133]
Nearest neighbor [133]
TEST consensus [133]

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

Results with only auxiliary molecular descriptors

a these values are inconsistent with R2 = 0.671.

which indicates that our MT-DNN not only takes care of overﬁtting problem but also is insensitive
to datasets. Although ESTDs slightly underperform compared to physical descriptors, its MT-DNN
results are able to defeat most TEST methods except FDA method. When all descriptors are used,
predictions by GBDT and MT-DNN outperform TEST consensus, with R2 of 0.787 and RMSE of
0.455 log(mol/L). The best result is again given by consensus method using all descriptors, with
R2 of 0.802 and RMSE of 0.438 log(mol/L).

63

Table 5.14: Comparison of prediction results for the Tetraphymena Pyriformis IGC50 test set.

Method

Hierarchical [133]

FDA [133]

Group contribution [133]
Nearest neighbor [133]
TEST consensus [133]

RMSE MAE Coverage
0.539
0.489
0.575
0.638
0.475

0.933
0.978
0.955
0.986
0.983

0.358
0.337
0.411
0.451
0.332

k

R2
0.719
0.747
0.682
0.600
0.764

R2−R2
0
R2
0.023
0.056
0.065
0.170
0.065

0.978
0.988
0.994
0.976
0.983
Results with ESTDs
0.966
0.984
1.000
1.002
0.980

0.469
0.099
0.011
0.000
0.121

0.625
0.705
0.708
0.723
0.745

0.301
0.065
0.052
0.002
0.073

0.978
0.992
0.972
0.995
0.989

0.738
0.780
0.678
0.745
0.789
Results with all descriptors
0.736
0.787
0.749
0.770
0.802

0.235
0.054
0.019
0.000
0.066

0.981
0.993
0.982
1.001
0.987

0.603
0.538
0.537
0.517
0.496

0.514
0.462
0.587
0.498
0.451

0.428
0.374
0.374
0.378
0.356

0.375
0.323
0.357
0.348
0.317

0.510
0.455
0.506
0.472
0.438

0.368
0.316
0.339
0.331
0.305

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

1.000
1.000
1.000
1.000
1.000

Results with only auxiliary molecular descriptors

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

5.2.3.4 Oral rat LD50 test set

The oral rat LD50 set contains the largest molecule pool with 7413 compounds. However, none of
methods is able to provide a 100% coverage of this data set. The results of single model method
or group contribution method were not properly built for the entire set [133]. It was noted that
LD50 values of this data set are relatively diﬃcult to predict as they have a higher experimental
uncertainty [156]. As shown in Table 5.15, results of two TEST approaches, i.e., Single Model
and Group contribution, were not reported for this problem. The TEST consensus result improves
overall prediction accuracy of other TEST methods by about 10 %, however, other non-consensus
methods all yield low R2 and high RMSE.

For our models, all results outperform those of non-consensus methods of TEST. In particular,

64

GBDT and MT-DNN with all descriptors yield the best (similar) results, giving slightly better
results compared to TEST consensus. Meanwhile, our predictions are also relatively stable for this
particular set as R2s do not essentially ﬂuctuate. It should also be noted that our ESTDs have slightly
higher coverage than physical descriptors (all combined descriptors) since 2 molecules in the test
set that contains As element cannot be properly optimized for energy computation. However this
is not an issue with our persistent homology computation. Consensus method using all descriptors
again yield the best results for all combinations, with optimal R2 of 0.653 and RMSE of 0.568
log(mol/kg).

Table 5.15: Comparison of prediction results for the Oral rat LD50 test set.

Method

Hierarchical [133]

FDA [133]

Nearest neighbor [133]
TEST consensus [133]

RMSE MAE Coverage
0.650
0.657
0.656
0.594

0.876
0.984
0.993
0.984

0.460
0.474
0.477
0.431

Results with only auxiliary molecular descriptors

k

R2−R2
0
R2
0.184
0.238
0.243
0.235

0.969
0.953
0.961
0.959
Results with ESTDs
0.949
0.960
0.991
1.000
0.956

0.823
0.407
0.006
0.000
0.384

R2
0.578
0.557
0.557
0.626

0.586
0.598
0.601
0.613
0.631

0.626
0.613
0.612
0.601
0.586

0.946
0.958
0.992
0.995
0.957

0.825
0.385
0.008
0.003
0.350

0.597
0.605
0.593
0.604
0.637
Results with all descriptors
0.619
0.630
0.614
0.626
0.653

0.728
0.328
0.006
0.002
0.306

0.949
0.960
0.991
0.995
0.959

0.619
0.606
0.618
0.609
0.581

0.603
0.586
0.601
0.590
0.568

0.469
0.455
0.446
0.442
0.432

0.463
0.455
0.447
0.445
0.433

0.452
0.441
0.436
0.430
0.421

0.999
0.999
0.999
0.999
0.999

0.997
0.997
0.997
0.997
0.997

0.997
0.997
0.997
0.997
0.997

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

RF

GBDT
ST-DNN
MT-DNN
Consensus

65

CHAPTER 6

DISCUSSION

We split the discussion chapter into several parts, namely, solvation free energy, partition coeﬃcient
and aqueous solubility, various toxicity endpoints. Within each part, we review model performances
of diﬀerent datasets, and also discuss how current models can be potentially improved.

6.1 Solvation free energy

6.1.1 Descriptor importance analysis

An important concern for machine learning is descriptor importance. In order to analyze this issue,
we rank all descriptors by their importance and consequently generate 40 diﬀerent sets of feature
combinations. Note that the descriptor importance here refers to Gini importance [157] weighted
by the number of trees in a forest calculated by our baseline methods. We train models with diﬀerent
numbers of descriptors to examine their predictive performances on test sets. More speciﬁcally,
the protocol to select descriptors relies on a series of descriptor importance cutoﬀs, equally spaced
between 0 and 0.01, with features whose importance is greater than the given cutoﬀ value being
selected.

Figure 6.1 represents the RMSEs of predicted solvation energy of SAMPL molecules against
diﬀerent descriptor importance cutoﬀs. When the feature importance cutoﬀ value is large, the
number of features is small, and RMSE is typically large too. The performance is getting better
when the importance cutoﬀ value is relatively small. However, further reduction in the cutoﬀ value
does not necessarily improve the prediction accuracy and may result in worse performance. Indeed,
a suitable cutoﬀ value can beneﬁt overall performance. Cutoﬀ value of 2.5 × 10−3 appears to be a
good choice in our case according to our descriptor importance analysis.

66

(a) SAMPL0

(b) SAMPL1

(c) SAMPL2

(d) SAMPL3

(e) SAMPL4

Figure 6.1: Descriptor importance cutoﬀ versus RMSE for all test sets with AM1-BCC charge
and MBondi2 radius parametrization. The larger cutoﬀ value is, the smaller number of features is
selected.

6.2 Partition coeﬃcient and aqueous solubility

6.2.1 ESTDs for small molecules

As the previous results indicate, there exists a common descriptor representation for partition
coeﬃcient and solubility prediction. Our descriptors come from two diﬀerent categories – one that
is computed solely by ESPH, and the other one that has been widely used in the development of

67

QSAR models. Although the number of ESTDs is small, it turns out that our topological descriptor
representation of molecules has a very strong predictive power compared to baseline method. Our
ESTDs highlight atom type information and are able to retain intra-molecular interactions via a
ﬁltration prcoess.

6.2.2 Multitask learning

The goal of multitask learning is to learn commonalities between diﬀerent tasks, and to simultane-
ously improve model performances. Partition coeﬃcient and aqueous solubility are trained jointly
and substantial improvements over single task models are observed. Our results suggest that there
exists shared information across these two tasks that can beneﬁt prediction accuracy. Indeed, the
original motivation for predicting log P and log S is that both coeﬃcients closely relate to the extent
to which a compound dissolves in solvents. By comparing our MT-DNN with gradient boost-
ing trees, we ﬁnd that it is beneﬁcial to learn partition coeﬃcient and aqueous solubility models
together. Our MT-ESTD models achieve satisfactory results on various partition coeﬃcient and
aqueous solubility data sets, some of which are the state-of-the-art to our best knowledge. Moreover,
ESTDs alone can give very accurate predictions, bringing us new insights by ESPH computations.
In addition to ESTDs, commonly-used 2D descriptors also help to improve the overall accuracy.
Learning these two related properties together boosts the overall model performances.

6.2.3 Predictive power for log P and log S

We have shown that a common set of ESTDs can be used to accurately predict log P and log S.
However, we also notice that the performances of ESTDs on log P are generally better than those
of log S. One major reason is that the size of log S training set is small comparing to the size
of log P training set, and it is diﬃcult to fulﬁll the potential of MT-DNN algorithms. Also the
(descriptor)/(training sample size) ratio of log S is much lower than that of log P, and MT-DNN
and GBDT models are likely to overﬁt due to the large number of ﬁtting parameters. However,
MT-DNN is still able to take advantage of log P prediction tasks - for Klopman’s and Hou’s test

68

sets, MT-DNN achieves better results than GBDT. We believe that log P and log S can be predicted
simultaneously using MT-DNN architectures and it could be beneﬁcial for both prediction tasks.

6.3 Toxicity endpoints prediction

6.3.1 The impact of descriptor selection and potential overﬁtting

To deal with descriptor redundancy and overﬁtting, four diﬀerent sets of high-importance descrip-
tors are selected by a threshold to perform prediction tasks, in a similar way as discussed in previous
subsection. More speciﬁcally, we rank all descriptors according to their feature importance and
use various feature importance thresholds as a selection protocol. Four diﬀerent values are chosen
(2.5e-4, 5e-4, 7.5e-4 and 1e-4) and the results using MT-DNN are shown in Table 6.1. Results for
the other three remaining sets are provided in Appendix.

Table 6.1: Results of selected descriptor groups for LC50 set

Threshold

# of descriptors

0.0
2.5e-4
5e-4
7.5e-4
1e-3

1030
411
308
254
222

R2
0.769
0.784
0.764
0.772
0.764

R2−R2
0
R2
0.009
0.051
0.062
0.064
0.063

k

1.014
0.971
0.962
0.958
0.963

RMSE MAE Coverage
0.716
0.685
0.719
0.708
0.717

1.000
1.000
1.000
1.000
1.000

0.466
0.459
0.470
0.468
0.467

Table 6.1 shows performance with respect to diﬀerent numbers of descriptors. When the number
of descriptors is increased from 222, 254, 308, 411 to 1030, RMSE does not increase and R2 does
not change much. This behavior suggests that our models are essentially insensitive to the number
of descriptors and thus there is little overﬁtting. MT-DNN architecture takes care of overﬁtting
issues by successive feature abstraction, which naturally mitigates noise generated by less important
descriptors. MT-DNN architecture can also potentially take advantage across related tasks, which
in turn reduces the potential overﬁtting on single data set by the alternative training procedure.

Similar behaviors have also been observed for the remaining three data sets, as presented in
Appendix. Therefore our MT-DNN architecture is very robust against descriptor selection and can
avoid overﬁtting.

69

6.3.2 The predictive power of ESTDs for toxicity

One of the main objectives of this study is to understand toxicity of small molecules from a
topological point of view. It is important to see if ESTDs alone can match those methods proposed
in T.E.S.T. software. When all ESTDs and MT-DNN architecture are used for toxicity prediction,
we observe following results:

• LC50 set and LC50DM set. Models using only ESTDs achieve higher accuracy than T.E.S.T.

consensus method.

• LD50 set. Consensus result of ESTDs tops T.E.S.T. software in terms of both R2 and RMSE

and MT-DNN results outperform all non-consensus T.E.S.T methods.

• IGC50 set. ESTDs are slightly underperformed than T.E.S.T consensus. However, MT-DNN
with ESTDs still yield better results than most non-consensus T.E.S.T methods except FDA.

It is evident that our ESTDs along with MT-DNN architecture have a strong predictive power for
all kinds of toxicity endpoints. The ability of MT-DNN to learn from related toxicity endpoints has
resulted in a substantial improvement over ensemble methods such as GBDT. Along with physical
descriptors calculated by our in-house MIBPB, we can obtain state-of-the-art results for all four
quantitative toxicity endpoints.

6.3.3 Alternative element speciﬁc networks for generating ESTDs

Apart from the element speciﬁc networks proposed in Table 4.8, we also use alternative element
speciﬁc networks listed below in Table 6.2 to perform the same prediction tasks. Instead of using
two types of element-speciﬁc networks, we only consider two-element networks, which essentially
puts more emphasis on intra-molecular interaction aspect. Eventually, this new construction yields
30 diﬀerent element speciﬁc networks (9+8+7+6), and a total of 840 ESTDs (30× 28) are calculated
and used for prediction. On LC50 set, IGC50 set and LD50 set, overall performances of the new
ESTDs can be improved slightly. However on LC50-DM set, the accuracy is comparably lower (still

70

Table 6.2: Alternative element speciﬁc networks used to characterize molecules

Network type
Two-element

Element speciﬁc networks

{bi, cj}, where bi ∈ B, cj ∈ C, i ∈ {1 . . . 3}, j ∈ {1 . . . 9}, and i < j,

where B={H, C, N, O} and C={H, C, N, O, F, P, S, Cl, Br, I}.

higher than T.E.S.T consensus). Detailed performances of these ESTDs are presented in Appendix.
Thus the predictive power of our ESTDs is not sensitive to the choice of element speciﬁc networks
as long as reasonable element types are included.

6.3.4 A potential improvement with consensus tools

In this work, we also propose consensus method as discussed in Section 5.2.3. The idea of consensus
is to train diﬀerent models on the same set of descriptors and average across all predicted values.
The underlying mechanism is to take advantage of system errors generated by diﬀerent machine
learning algorithms with the potential to reduce bias for the ﬁnal prediction.

As we notice from Section 5.2.3, consensus method oﬀers a considerable boost in prediction
accuracy. For reasonably large sets except LC50-DM set, consensus models turn out to give the
best predictions. When it comes to small set (LC50-DM set), consensus models perform worse
than MT-DNN. It is likely due to the fact that large number of descriptors may cause overﬁtting
issues for most machine learning algorithms, and consequently generate large deviations, which
eventually result in a large error of consensus method. Thus, it should be a good idea to perform
prediction tasks with both MT-DNN and consensus methods, depending on the size of data sets, to
take advantage of both approaches.

71

CHAPTER 7

THESIS CONTRIBUTION AND FUTURE WORK

In this chapter, thesis contribution is highlighted for predictions of small molecule properties, in
terms of topological modeling and machine learning algorithms. The thesis then ends with some
perspectives on future work.

7.1 Solvation free energy

Implicit solvent models intuitively split the total solvation free energies into polar and nonplar
contributions. However, polar and nonpolar interactions are coupled and interdependent during
solvation process. A novel framework is proposed to break the polar-nonpolar division used in
implicit solvent models and treat polar and nonpolar contributions on an equal footing, based on
the assumption that there exists a microscopic descriptor vector that can uniquely characterize a
molecule and distinguish it from other molecules.

To validate the proposed method, we adopt a large dataset of 668 molecules collected in our
earlier work [4]. We propose two sets of descriptors to train the quantitative models: one set with
polar features are highly correlated with solvation free energies of this dataset and the other set
with both polar and onpolar features. Although non-polar features such as atomic area are not
highly correlated with solvation free energy, the inclusion of nonpolar features improves the overall
performance of the present method. Highly accurate solvation free energy prediction is conﬁrmed
by both the leave-one-out test over 668 molecules and the prediction of ﬁve SAMPL test sets,
namely, SAMPL0, SAMPL1, SAMPL2, SAMPL3 and SAMPL4. Finally, we consider a test set
of 94 molecules and its associated training set [1] for a comparison of the present method and a
classic solvation model based on weighted solvent accessible surface area [1].

72

7.2 Partition coeﬃcient and aqueous solubility

Partition coeﬃcient and aqueous solubility are among the most important physical properties
of small molecules and have signiﬁcant applications to drug design and discovery in terms of
lipophilic eﬃciency. Based on chemical and physical models, a wide variety of computational
methods has been developed in the literature for the theoretical predictions of partition coeﬃcient
and aqueous solubility.

Present work introduces an algebraic topology based method, element speciﬁc persistent homol-
ogy (ESPH), for simultaneous partition coeﬃcient and aqueous solubility predictions. ESPH oﬀers
an unconventional representation of small molecules in terms of multiscale and multicomponent
topological invariants. Here the multiscale representation is inherited from persistent homology,
while the multicomponent formulation is developed to retain essential chemical information during
the topological simpliﬁcation of molecular geometric complexity. Therefore, the present ESPH
gives a unique representation of small molecules that cannot be obtained by any other methods.
Although ESPH representation of molecules cannot be literally translated into a physical inter-
pretation, it systematically and comprehensively enciphers chemical and physical information of
molecules into scalable topological invariants, and thus is ideally suited for machine learning/deep
learning algorithms to decipher such information.

To predict partition coeﬃcient and aqueous solubility, we integrate ESPH with advanced ma-
chine learning methods, including gradient boosting tree, random forest, and deep neural networks
to construct topological learning strategies. Since partition coeﬃcient and aqueous solubility are
highly correlated to each other, we develop a common set of ESPH based descriptors, called element
speciﬁc topological descriptors (ESTDs), to represent both properties. This approach enables us to
perform simultaneous predictions of partition coeﬃcient and aqueous solubility using a topology
based multi-task deep learning strategy.

To test the representational of ESPH and the predictive power of the proposed topological
multi-task deep learning strategy, we consider some commonly used data sets, including two
benchmark test sets, for partition coeﬃcient, as well as additional solubility data sets. Extensive

73

cross validations and benchmark tests indicate that the proposed topological multi-task strategy
oﬀers some of the most accurate predictions of partition coeﬃcient and aqueous solubility.

7.3 Toxicity endpoints

Toxicity refers to the degree of damage a substance on an organism, such as an animal, bac-
terium, or plant, and can be qualitatively or quantitatively measured by experiments. Experimental
measurement of quantitative toxicity is extremely valuable, but is typically expensive and time
consuming, in addition to potential ethic concerns. Theoretical prediction of quantitative toxicity
has become a useful alternative in pharmacology and environmental science. A wide variety of
methods has been developed for toxicity prediction in the past. The performances of these methods
depend not only on the descriptors, but also on machine learning algorithms, which makes the
model evaluation a diﬃcult task.

We introduce a series of novel descriptors, called element speciﬁc topological descriptor
(ESTD), for the characterization and prediction of toxicity endpoints. Additionally, physical de-
scriptors based on established physical models are also developed to enhance the predictive power
of ESTDs. These new descriptors are then combined with a variety of advanced machine learning
algorithms to demonstrate their capability in quantitative toxicity analysis.

Four quantitative toxicity data sets, i.e., 96 hour fathead minnow LC50 data set (LC50 set), 48
hour Daphnia magna LC50 data set (LC50-DM set), 40 hour Tetrahymena pyriformis IGC50 data set
(IGC50 set), and oral rat LD50 data set (LD50 set), are used in the present study. Comparison has also
been made to the state-of-art approaches given in the T.E.S.T website at https://www.epa.gov/
chemical-research/toxicity-estimation-software-tool-test by United States Envi-
ronmental Protection Agency. Our numerical experiments indicate that the proposed ESTDs are
as competitive as individual methods in T.E.S.T. Aided with physical descriptors and MT-DNN
architecture, ESTDs are able to establish state-of-the-art predictions for quantitative toxicity data
sets. Additionally, MT-DNN models are typically more accurate than ensemble methods.

It is worthy to note that the proposed new descriptors are very easy to generate and thus have

74

almost 100% coverage for all molecules, indicating their broader applicability to practical toxicity
analysis and prediction.
In fact, our topological descriptors are much easier to construct than
physical descriptors, which depend on physical models and force ﬁelds. The present work indicates
that ESTDs are a new class of powerful descriptors for small molecules.

7.4 Future work

In this section, we propose the future work of this thesis from several diﬀerent aspects, following

cheminformatics, topological modeling and machine learning point of view.

First of all, there are still many other useful chemical/phyiscal properties to be studied. Prop-
erties such as boiling point, viscosity, boiling point, and melting point of a chemical can all
potentially be predicted following the same QSAR pipeline. Furthermore, qualitative proper-
ties can also be incorporated into our current framework. For instance, Tox21 data challenge
(https://tripod.nih.gov/tox21/challenge/), although also focuses on toxicity predic-
tions, proposes 12 diﬀerent qualitative toxicity endpoints for participants to predict. After all,
both quantitative and qualitative properties would be essential for drug design and drug discovery.
Secondly, our current ESPH framework can be further improved. From the idea of primitive
PH, we have developed ESPH, and further modiﬁed ESPH to better characterize small molecules
by eliminating covalent bonding. Proposed ESPH, however, mainly takes care of geometric in-
formation. A potential approach is to incorporate electrostatic persistence or resolution based
persistence to generate more ESTDs. Moreover, it may also be beneﬁcial to construct complexes
(such as Alpha complex) in addition to Vietoris-Rips complex or to use diﬀerent sets of distance
metrics. We ﬁrmly believe that topological modeling has tremendous potential in cheminformatics
and bioinformatics applications.

Last but not least, it is promising that our current molecular descriptors can beneﬁt from the
development of machine learning techniques. Neural network based models, such as convolutional
neural networks (CNN), generative adversarial networks (GAN) and capsule networks (CN), have
all demonstrated success or achieved preliminary results when applying to biomolecule modeling.

75

For example, one can take advantage of image processing by splitting atoms of diﬀerent element
types into diﬀerent channels (analogous to RBG channel for image) and analyzing molecular
descriptors based on locations. The potential of machine learning can never be overestimated.

76

APPENDICES

77

SUPPLEMENTARY MATERIALS FOR SOLVATION FREE ENERGY PREDICTION

APPENDIX A

A.1 A list of polar descriptors used for solvation free energy prediction

Table A.1: Microscopic polar features with high correlations to the solvation free energy used in
this study

Feature Name
Sum of atomic reaction ﬁeld energy
Sum of the absolute value of atomic reaction ﬁeld energy
Sum of H atomic reaction ﬁeld energy
Sum of the absolute value of H atomic reaction ﬁeld energy
Sum of O atomic reaction ﬁeld energy
Sum of the absolute value of O atomic reaction ﬁeld energy
Minimum value of atomic reaction ﬁeld energy
Maximum of the absolute value of reaction ﬁeld energy
Minimum value of H atomic reaction ﬁeld energy
Maximum of the absolute value of H atomic reaction ﬁeld energy
Mean of atomic reaction ﬁeld energy
Mean of the absolute value of atomic reaction ﬁeld energy
Variance of atomic reaction ﬁeld energy
Variance of the absolute value of reaction ﬁeld energy
Variance of H atomic reaction ﬁeld energy
Variance of the absolute value of H atomic reaction ﬁeld energy
Sum of the absolute value of atomic charge
Sum of H atomic charge
Sum of the absolute value of H atomic charge
Sum of O atomic charge
Sum of the absolute value of O atomic charge
Minimum of atomic charge
Maximum of the absolute value of atomic charge
Maximum of H atomic charge
Maximum of the absolute value of H atomic charge
Mean of the absolute value of atomic charge
Variance of the atomic charge
Variance of the absolute value of atomic charge
Variance of the absolute value of H atomic charge
Variance of H atomic charge

78

A.2 Leave-one-out results
Table A.2 shows the query of each molecule and leave one out results of AM1-BCC charge and
MBondi2 radius parameterizations with selected polar features.

Table A.2: Molecules with corresponding query number, and eave one out results with AM1-BCC
charge and MBondi2 radius

PubChem ID Query Number Experimental Value Predicted Value

10008
10041
101
1031
1032
10326
10371
10399
10405
10422
10430
10461
10486
1049
10541
10553
10566
1068
10686
10687
10722
10740
107828
10822
10823
10824
10864
10870
10882
10899
10903
10907
10926
10943
10974
10977

1
1
2
2
2
1
3
2
2
3
2
5
5
3
3
1
1
6
1
2
2
1
3
3
3
3
1
1
2
5
2
1
5
5
3
5

-0.535
1.893
-7.801
-5.018
-6.244
-0.877
-7.277
-3.895
-2.505
-4.679
-6.315
-3.353
-0.379
-5.326
-10.266
0.767
1.337
-1.418
-0.935
-6.429
-4.715
-3.035
-18.016
-4.581
-4.856
-7.762
-0.442
-0.167
-2.445
-0.260
-2.414
2.773
-1.395
-1.125
-2.334
-0.515

-0.8
2.51
-9.52
-4.85
-6.46
-1.46
-7.1
-5.48
-2.65
-5.56
-6.09
-3.92
1.09
-4.69
-8.7
1.31
0.4
-1.61
-1.21
-6.16
-3.93
-2.78
-17.17
-4.73
-4.59
-7.29
-0.23
0.16
-2.49
-0.33
-2.1
2.89
-0.7
-0.98
-1.88
-0.07

79

10992

11

11000

11095291

11124
11126
11137
11182
11229
11239
11251
11260
11264
11269
11271
11304
11327
11335
11386
11387
1140
11416
11417
11420
11428
1146
11507
11513
11519
11523
11526
11542
11543
11565
11574
11582
11583
11587
11597
11598
11610
11611
11638

5
1
5
2
6
3
2
1
1
2
1
2
1
2
1
1
2
1
1
1
1
3
3
2
3
5
1
2
1
1
3
1
2
3
1
1
2
1
1
1
1
1
1

Table A.2 (cont’d)

0.0
-1.79
-0.99
0.1
-2.93
-1.38
-9.62
-1.66
2.56
1.83
-3.24
2.52
-3.88
2.56
-2.74
-2.82
-2.47
-5.26
-2.78
-2.63
-0.9
1.58
-5.22
-4.82
-4.35
-3.2
2.71
-4.01
2.97
2.11
-4.72
2.88
-3.92
-4.84
0.67
2.93
-3.28
0.93
1.58
1.01
1.66
1.68
-0.22

80

0.087
-1.228
-0.912
-1.943
-2.928
-1.475
-6.729
-1.559
2.535
1.502
-3.015
2.545
-3.905
2.870
-3.442
-2.428
-2.770
-5.531
-2.485
-2.800
-0.797
2.105
-4.831
-4.853
-4.182
-5.502
2.815
-3.224
2.870
1.670
-4.959
2.547
-3.999
-4.516
1.255
2.585
-3.102
1.170
1.760
1.170
1.680
1.675
1.047

11643
11670
11721
1174
11903
11904
11940
12009
12016
12083
12101
12160
12178
12180
12206
12217
12232
12245
12257
12264
12287
12291
12302
12309

12321
12332
12340
12348
12350
12370
12371
12375
12418
12463
12468
12508
12580
12586
126
12720
12724
12732

12309460

5
2
2
3
1
1
1
3
5
3
2
1
2
2
2
2
6
5
2
6
3
3
3
1
4
6
3
2
2
1
1
5
2
5
5
5
2
2
2
2
1
1
1

Table A.2 (cont’d)

-0.77
-3.91
-2.01
-16.59
-0.85
-1.24
-3.78
-14.0
0.04
-6.88
-6.25
-0.95
-4.06
-2.83
-2.56
-2.11
-1.83
0.07
-2.02
-1.21
-4.01
-3.88
-3.09
0.01
-4.23
-1.43
-9.31
-1.81
-2.51
0.6
0.71
0.29
-3.54
-1.43
-1.34
-1.62
-2.22
-5.21
-4.1
-8.83
0.04
1.91
0.29

81

-0.775
-4.364
-2.215
-17.440
-0.535
-1.244
-2.800
-13.380
0.132
-6.300
-5.703
-1.170
-4.539
-2.650
-2.593
-2.116
-1.535
0.045
-2.802
-1.314
-4.811
-3.671
-2.827
0.248
-4.692
-2.143
-8.857
-1.912
-2.350
0.248
0.675
-0.232
-3.127
-1.047
-1.943
-1.262
-2.412
-6.061
-3.166
-9.038
-0.250
1.680
0.530

12741
12896
12986

13

13238893

13004
13019
13081
13187
13207

13263
13389
13394
13450
135191
13529
13567
13855
138747
138975
1390
14109
141897
14215
14276
14282
14315
144381
144702
15050
15413
15546
15600
15625
156391
15758
16003
16270
16295
16318
16441
16628
16666

2
3
1
2
3
7
2
2
3
3
3
3
3
5
2
2
3
2
3
3
3
1
2
3
3
3
1
3
4
2
2
3
1
4
2
3
3
1
1
5
2
1
2

Table A.2 (cont’d)

-2.34
-11.24
1.47
-1.12
-4.61
-10.17
-6.5
-2.49
-3.9
-1.92
-7.65
-7.0
-9.65
-6.68
-20.52
-3.77
-2.09
-2.92
-4.29
-11.95
-8.41
-0.04
-5.73
-3.64
-4.4
-7.62
-2.4
-6.4
-4.59
-4.42
-2.21
-16.43
3.16
-3.37
-10.21
-17.74
-4.13
2.13
-0.18
-2.98
-4.09
1.58
-3.2

82

-2.765
-9.358
1.680
-1.047
-3.668
-9.746
-5.217
-2.709
-3.442
-6.087
-7.663
-6.837
-8.318
-8.389
-23.559
-4.357
-2.012
-1.993
-4.243
-7.774
-6.564
-0.535
-5.918
-4.378
-3.162
-7.206
-2.800
-7.051
-4.107
-4.368
-2.220
-15.724
2.930
-3.758
-9.925
-16.703
-3.644
2.585
0.247
-0.978
-3.290
1.993
-3.507

18927701

21075956

21210

21269179

1672
16900
17190
1732
174
176
177
17739
17756
178
180
18636

18937
19041
19540
19878
20419
20528
20748
2078
20848

2160
220639
221525
22188
222
22227
222536
223106
22386
2244
2268
2319
2331
240
244

249266
25146
2519

3
2
3
4
2
2
2
2
2
3
2
5
1
1
1
1
3
3
3
2
3
3
3
2
1
3
3
5
3
2
2
7
1
2
2
7
3
3
2
2
5
7
3

Table A.2 (cont’d)

-6.93
-3.43
-8.94
-6.79
-9.3
-6.69
-3.5
-5.26
-6.4
-9.71
-3.8
-4.53
-0.8
1.05
2.93
2.55
-4.46
-7.44
-14.21
-3.3
-8.21
-7.98
-4.95
-5.73
-1.29
-7.43
-2.82
-2.32
-11.14
-4.29
-3.75
-4.97
1.07
-4.39
-9.94
-10.03
-3.51
-11.0
-4.02
-6.62
-2.69
-5.74
-12.64

83

-7.207
-3.437
-11.688
-6.848
-7.136
-7.501
-3.250
-6.844
-5.745
-10.312
-3.945
-1.125
0.246
0.597
2.895
2.502
-4.072
-10.787
-9.851
-2.412
-8.459
-9.430
-5.145
-4.771
-0.768
-6.135
-2.896
-2.815
-9.860
-6.906
-3.752
-5.298
0.247
-4.420
-9.319
-6.451
-2.982
-10.803
-4.978
-5.745
-2.933
-8.238
-10.790

2566
261
263
26331
264
26447
2730
27588
2879
297
3017
30209
3030
3031
3048
3059
3100
31234
31242
31246
31249
31260
31265
31268
31272
31275
31276
31285
31289
31297
31347
31373
31420
31423
31645
32594
3283
3286
3301
33135
335
33500
3385

3
2
2
3
2
2
7
5
2
1
7
3
4
3
4
2
3
2
2
2
2
2
2
3
2
2
2
1
2
7
7
5
3
1
3
3
2
7
3
3
3
2
3

Table A.2 (cont’d)

-9.61
-3.18
-4.72
-5.45
-6.35
-2.53
-5.04
-2.28
-6.13
2.0
-6.48
-1.66
-9.86
-4.71
-4.82
-9.4
-9.34
-6.92
-6.13
-2.92
-5.71
-4.42
-2.23
-5.48
-2.64
-5.06
-2.21
2.06
-2.07
-4.87
-8.61
0.1
-11.85
-4.52
-9.41
-3.68
-1.59
-6.1
-7.6
-4.4
-5.9
-2.45
-16.92

84

-9.117
-3.181
-4.731
-5.476
-6.268
-3.442
-6.853
-2.661
-6.548
1.680
-6.172
-1.928
-7.827
-3.673
-6.154
-9.804
-7.454
-6.060
-5.747
-2.986
-5.298
-4.568
-2.178
-4.395
-2.573
-3.895
-2.505
1.670
-2.303
-6.475
-9.740
-2.514
-9.586
-3.741
-10.178
-4.959
-2.125
-6.834
-6.557
-4.702
-6.011
-3.264
-18.056

3394
342
34468
34586
34591
35454
356
3589
36401
36613
3672
36980
37037
37207
37247
3776
38019
3825
38251
38252
38253
38254
38306
39253
398
4004
402
4044
40818
4101
4109
4116
4130
4156
442474
447466
447907

46174049

454
460

4684
4685
4790

2
2
3
4
3
4
1
5
5
4
2
5
5
4
5
2
5
4
2
4
4
4
5
5
3
7
6
3
3
3
3
5
7
7
2
2
3
2
2
4
4
5
7

Table A.2 (cont’d)

-8.42
-5.49
-5.66
-3.56
-5.04
-3.81
2.88
-2.55
-3.48
-3.67
-7.0
-2.46
-4.4
-3.52
-2.16
-4.74
-3.04
-10.78
-3.71
-3.1
-4.05
-4.15
-3.17
-4.61
-3.13
-8.15
-0.7
-6.78
-5.73
-4.56
-10.65
-1.12
-7.19
-4.87
-2.49
-4.22
-10.91
-2.29
-5.94
-4.82
-7.03
-1.01
-4.37

85

-8.897
-6.429
-7.407
-4.243
-5.461
-4.010
2.930
-3.352
-2.516
-3.539
-6.699
-2.394
-2.973
-3.580
-2.517
-4.272
-3.352
-19.721
-3.920
-3.684
-3.757
-3.758
-3.610
-2.270
-3.178
-6.873
-1.207
-9.428
-4.518
-8.587
-9.717
-0.640
-6.462
-9.685
-3.586
-4.218
-9.807
-2.783
-5.930
-4.554
-6.792
-1.228
-6.853

4837
48889
4929
4933
5216
527

5281168
5283324
52997
52999
53167
5326160
5326161
53476
53479
5377791

5541
5569
56160
5793
5802
5853
5899
5943
5993
6027
6053
6054
6115
6129
61362
6184
6212
6213
6214
6228
6251
6276
6278
63079
63088
6324
6325

3
4
3
3
3
2
2
2
3
3
3
1
1
4
4
7
2
3
3
2
3
7
3
5
5
2
3
2
3
3
2
2
5
7
5
3
2
2
5
5
5
1
1

Table A.2 (cont’d)

-7.4
-3.84
-8.43
-7.78
-10.22
-3.43
-3.68
-3.44
-20.25
-15.54
-7.77
1.31
1.34
-8.68
-7.78
-7.07
-8.84
-3.25
-14.01
-25.47
-18.17
-12.74
-15.46
0.08
-3.44
-20.52
-6.66
-6.79
-5.49
-9.45
-4.51
-2.81
-1.08
-10.08
-0.64
-7.81
-23.62
-4.57
-0.19
-4.38
-3.61
1.83
1.28

86

-7.050
-3.888
-6.704
-8.305
-9.381
-3.438
-4.445
-3.434
-14.017
-16.091
-6.563
1.217
0.850
-7.291
-8.449
-6.513
-15.756
-2.799
-17.548
-22.599
-17.100
-9.281
-18.053
-1.396
-2.812
-23.556
-7.217
-6.588
-6.007
-10.469
-3.290
-2.505
-0.640
-6.889
-1.263
-7.456
-23.886
-4.633
-1.138
-2.972
-2.972
2.105
1.091

6326
6327
6329
6334
6335
6337
6338
6341
6342
6343
6344
6351
6360
6361
6365
6366
6368
6372
6373
6375
637564
637566
638186
6386
6391
6392
639662
6403
6405
6408
6409
64151
6416
6419
6423
6427
6428
6429
6430
643820
643833
6441
64689

1
5
3
1
1
5
5
3
3
6
5
1
1
5
5
5
1
5
1
3
2
2
5
2
5
5
1
1
2
5
2
2
2
5
5
5
5
5
5
2
5
2
2

Table A.2 (cont’d)

-0.01
-0.55
-4.55
2.0
-0.48
-0.63
-0.59
-4.5
-3.88
-1.14
-1.31
0.75
2.3
-0.25
-0.84
0.25
-0.11
-0.5
0.81
-4.02
-4.63
-4.45
-0.78
-4.47
1.69
2.52
1.66
2.51
-4.43
0.06
-4.31
-4.46
-3.11
-1.23
-1.45
0.82
1.77
2.32
2.87
-4.78
-1.17
-4.88
-25.47

87

-0.024
-0.525
-5.738
2.105
-0.197
-0.455
-0.105
-4.712
-4.517
-1.125
-1.047
1.670
2.379
-0.188
-1.047
-0.168
-0.766
1.343
0.000
-4.096
-3.786
-4.802
-0.685
-5.047
0.994
1.717
1.680
2.502
-4.364
-1.394
-4.595
-3.226
-3.920
-0.845
-4.483
0.042
0.973
2.123
0.614
-4.485
-1.262
-5.480
-22.599

6497
6544
6556
6557
6560
6561
6563
6564
6568
6569
6574
6575
6578
6582
6584
6587
6589
6591
66750
6710
6720
6734
6736
674
679
6809
6845
68510
6853
688400
6895
6896
69027
6944
6946
6947
6950
69689
69720
6993809

6997
6998
7000

7
2
1
1
2
2
5
5
2
2
5
5
3
3
2
3
1
5
2
3
5
1
3
3
7
3
5
3
1
3
5
3
2
3
3
3
3
3
2
2
2
2
3

Table A.2 (cont’d)

-5.1
-5.18
2.38
0.68
-4.5
-2.86
0.0
-1.27
-4.62
-3.71
-1.99
-0.44
-9.4
-10.0
-3.13
-3.71
2.34
-2.37
-7.75
-9.44
-5.22
-3.15
-5.88
-4.29
-9.28
-9.61
-3.32
-0.41
-3.35
-6.23
-1.24
-5.21
-2.4
-3.58
-7.37
-4.58
-6.23
-13.6
-4.04
-4.2
-5.66
-4.68
-6.12

88

-7.770
-3.819
2.547
0.597
-4.897
-3.194
-0.105
-2.370
-4.701
-3.763
-1.465
-1.138
-9.803
-9.778
-2.803
-3.265
2.259
-1.574
-8.895
-9.808
-0.585
-1.894
-5.578
-5.145
-4.596
-10.127
-2.810
-2.980
-2.427
-6.264
-2.670
-4.913
-3.058
-3.645
-7.436
-9.511
-6.003
-11.641
-5.281
-4.183
-6.348
-7.800
-7.456

7002
7005
702
7041
7043
7047
7057
7095
712
7144
7150
7165
7175
7184
7237
7238
7239
7240
7242
7245
7247
7249
7258
7267
727
7270
7282
7288
7295
7296
7298
7301
7304
73272
7351
7366
7368
7393
7406
7407
7410
7416
7422

1
2
2
2
2
3
3
1
2
2
2
2
2
2
1
5
5
3
3
4
1
2
4
2
5
5
1
2
4
1
2
2
3
3
2
1
1
2
1
1
2
3
3

Table A.2 (cont’d)

-2.44
-7.67
-5.0
-6.96
-5.33
-5.72
-7.47
-2.7
-2.75
-5.8
-3.92
-3.64
-9.37
-8.72
-0.9
-1.14
-1.36
-4.91
-5.53
-4.55
-0.86
-6.5
-7.29
-5.91
-5.44
-1.34
2.51
-3.41
-4.0
1.59
-5.49
-3.3
-2.89
-15.83
-1.69
-0.44
-0.25
-5.91
-0.3
-1.24
-4.58
-4.12
-3.45

89

-2.800
-7.315
-4.837
-6.053
-4.366
-5.836
-6.834
-3.035
-4.071
-6.918
-4.183
-4.943
-9.042
-9.318
-0.935
-0.977
-1.315
-5.894
-5.736
-5.665
-1.232
-6.128
-6.731
-5.240
-2.660
-1.573
2.547
-3.102
-4.549
1.680
-4.980
-2.802
-3.436
-13.950
-2.481
-0.465
-0.000
-5.801
-0.797
-1.308
-4.714
-4.098
-3.675

7423
7456
74626
7463
7475
7476
7498
7500
7501
7503
7505
7506
7515
7519
7520
7523
753
7560
7580
7583
76122
7668
7674
7705
7732
7749
7761
7762
7765
77650
7770
7771
77918
7797
7803
7809
7812
7813
7818
7824
7843
7844
7845

3
2
2
1
3
2
6
1
1
5
3
3
3
2
6
3
2
3
1
2
2
1
2
1
3
2
2
2
3
2
2
2
3
2
2
1
3
3
3
2
1
1
1

Table A.2 (cont’d)

-8.84
-9.51
-3.82
-0.68
-9.82
-4.4
-2.38
-0.79
-1.24
-1.93
-4.1
-6.02
-4.69
-2.45
-2.73
-4.33
-13.43
-9.13
-2.82
-2.87
-4.1
-0.53
-2.22
-0.4
-7.48
-2.68
-6.0
-2.49
-3.28
-9.8
-2.28
-6.01
-7.63
-2.3
-2.44
-0.8
-5.9
-5.57
-7.58
-2.49
2.1
1.38
0.56

90

-7.406
-8.900
-4.512
-0.935
-9.629
-4.541
-1.958
-0.397
-0.935
-2.668
-5.145
-6.781
-6.085
-2.599
-1.636
-4.703
-10.059
-10.800
-2.882
-3.393
-3.128
-0.513
-2.490
-0.475
-6.724
-2.650
-5.741
-2.540
-2.930
-7.340
-2.540
-5.703
-8.403
-2.080
-2.481
-1.232
-5.462
-5.462
-6.381
-2.468
2.313
1.320
0.353

7846
7848
7850
7852
7854
7858
7865
7892
7895
79
7903
7907
7909
7910
7912
79123
7914
79143
7915
7929
7932
7933
7936
7937
7947
7948
795
7950
7956
7962
7963
79639
7964
7965
7966
7967
7970
7972
7975
7976
7977
7991
7997

1
6
5
3
3
2
2
1
2
3
1
2
2
3
3
2
3
2
1
3
3
4
3
3
1
2
5
3
3
1
3
3
5
3
2
2
3
3
3
3
3
2
2

Table A.2 (cont’d)

-0.16
-1.1
-0.57
-4.39
-3.84
-5.03
-2.78
2.51
-3.52
-6.75
-3.34
2.83
-3.05
-3.73
-3.22
-1.82
-0.53
-18.4
-2.64
-0.83
-5.82
-6.62
-4.86
-4.59
-0.9
-6.27
-9.63
-0.78
-18.06
1.7
-4.93
-22.4
-1.12
-4.59
-5.46
-4.91
-4.77
-6.32
-4.63
-5.51
-4.39
-6.16
-2.79

91

-0.250
-1.135
-0.640
-4.516
-3.376
-5.480
-3.080
2.088
-3.275
-6.088
-3.563
2.625
-3.508
-3.427
-4.020
-2.083
-1.687
-20.045
-2.540
-0.535
-5.462
-6.891
-4.327
-4.517
-0.349
-5.915
-9.141
-1.110
-16.249
1.583
-4.703
-17.631
-1.395
-3.989
-5.018
-3.427
-4.651
-4.958
-5.007
-5.671
-5.241
-6.126
-2.650

8003
8004
8005
8007
8008
8012
8018
8019
8020
8021
8025
8027
8028
8030
8038
8051
8052
8058
8059
8060
8061
8063
8071
8076
8077
8078
8079
8082
8083
8091
8093
8095
8102
8103
8114
8115
8118
8121
81226
8125
8127
8129
8130

1
1
5
3
3
6
3
2
2
3
2
3
2
6
2
2
2
1
5
3
3
2
2
2
6
1
1
3
3
2
2
2
3
2
2
5
6
2
3
1
3
2
2

Table A.2 (cont’d)

2.3
1.68
-0.16
-4.24
-3.64
-0.99
-6.55
-6.62
-2.93
-4.07
-2.56
-4.78
-3.47
-1.4
-2.36
-3.04
-2.13
2.48
-2.32
-4.09
-3.52
-3.03
-4.84
-6.69
-1.64
1.23
0.14
-5.11
-7.17
-2.04
-2.88
-5.31
-3.95
-4.4
-1.16
-4.23
-1.28
-6.34
-5.99
2.08
-3.79
-4.21
-2.67

92

1.893
1.338
0.186
-4.441
-3.192
-1.162
-5.543
-6.428
-4.000
-4.858
-2.783
-4.412
-3.081
-1.535
-2.650
-3.077
-2.540
2.693
-2.272
-4.543
-3.436
-2.842
-2.650
-6.588
-1.418
1.137
1.378
-4.287
-6.226
-2.282
-2.842
-3.943
-4.286
-4.110
-1.509
-2.270
-1.418
-5.969
-7.609
1.818
-3.376
-4.540
-2.599

8133
8141
8143
8148
8163
81713
8174
8252
8254
8255
8263
8302
8323
8341
8370
84179
8418
8434
84440
8452
8454
8471
8500
85254
8606
8640
8663
8680
8723
878
8857
887
8881
8882
8892
8894
8900
8902
8908
8909
8914
9005
9007

2
1
3
3
2
3
2
1
2
1
1
1
3
3
5
3
1
2
2
2
3
3
2
3
3
3
2
1
2
6
2
2
5
1
2
2
1
3
2
2
2
2
2

Table A.2 (cont’d)

-6.25
3.13
-3.65
-3.24
-2.16
-11.01
-3.64
1.32
-1.91
1.16
3.43
2.31
-9.53
-11.53
-2.33
-9.76
-3.95
-9.2
-5.23
-4.7
-3.98
-3.22
-4.7
-8.18
-9.01
-7.28
-8.11
-0.45
-4.42
-1.2
-2.94
-5.1
-1.89
0.56
-6.21
-3.12
2.67
-3.65
-2.26
-0.83
-3.89
-4.42
-7.66

93

-6.222
2.870
-3.989
-3.222
-2.283
-10.472
-3.507
1.335
-2.350
1.320
1.552
1.680
-8.664
-11.107
-0.684
-9.008
-3.883
-9.591
-4.566
-4.365
-3.888
-2.980
-4.709
-9.703
-6.731
-7.497
-6.914
-0.349
-4.485
-0.983
-2.883
-5.064
-2.428
0.665
-6.347
-3.167
2.547
-3.260
-2.125
-1.687
-3.443
-4.455
-6.348

9033
91662
91729
9216
9253
9265
9266
931
9321
93462
9411
94221
949
957

9609
962
96257
9707
9774
9775
980
9818
9872
9893
991
995
996

9570071

9589

9595287

-5.782
-2.970
-16.916
-3.509
1.046
1.553
1.091
-3.539
-4.566
-6.922
-10.890
-4.485
-4.793
-4.150
-9.119
-6.727
-13.603
-1.372
-5.047
-4.351
-5.746
-4.193
2.074
-8.973
0.806
-4.408
-3.016
-6.860
-3.558
-6.347

3
5
3
2
1
1
1
1
7
4
3
2
3
3
2
3
3
6
2
3
2
2
5
3
1
2
2
7
1
2

Table A.2 (cont’d)

-5.42
-1.96
-16.23
-3.15
1.2
0.8
0.86
-2.4
-9.3
-6.44
-9.73
-4.44
-3.45
-4.09
-9.84
-8.26
-10.18
-1.46
-6.3
-4.8
-5.29
-4.16
2.51
-10.64
1.07
-4.15
-1.1
-6.74
-3.88
-6.61

94

223106
31275
7765
8020
222536
7761
8121
5541
12375
74626
795
84179
81713
7503
8115
9609
7498

0.155
-4.735
-2.783
-3.574
-5.138
-6.423
-5.941
-10.141
-3.057
-4.409
-9.872
-9.798
-10.798
-2.429
-2.964
-1.372
-1.958

A.3 SAMPLx challenge results

Table A.3, A.4, A.5, A.6 and A.7 represent the solvation energy prediction for SAMPL0,
SAMPL1, SAMPL2, SAMPL3, SAMPL4 molecules calculated with polar features, respectively.
Table A.8, A.9, A.10, A.11 and A.12 represent the solvation energy prediction for SAMPL0,
SAMPL1, SAMPL2, SAMPL3, SAMPL4 molecules calculated with all available polar and nonpolar
features, respectively.

Table A.3: Solvation energy prediction results for SAMPL0 molecules using selected polar features

PubChem ID Experimental Value Predicted Value

1.07
-5.06
-3.28
-2.93
-4.97
-6.0
-6.34
-8.84
-3.54
-3.82
-9.63
-9.76
-11.01
-1.93
-4.23
-1.46
-2.38

95

Table A.4: Solvation energy prediction results for SAMPL1 molecules using selected polar features

PubChem ID Experimental Value Predicted Value

6544
14215
16003
6053
3031
34468
5569
2319
33500
56160
15546
107828
91729
52997
52999
9411
10974
13567
79123
40818

21075956

7560
17190
20419
12896
2566
6950
81226
85254
31645
6129
2078
20848
13263
13450
8606
4929
4933
22188
9570071
9595287

4109

-4.032
-4.749
-4.768
-7.696
-3.257
-7.334
-2.982
-2.982
-2.982
-16.171
-18.067
-18.016
-18.021
-16.157
-14.734
-10.395
-3.220
-2.984
-2.698
-4.807
-5.463
-10.910
-14.287
-11.639
-10.456
-9.836
-6.174
-7.206
-9.963
-11.945
-10.908
-9.919
-9.437
-8.699
-9.082
-7.367
-8.577
-7.700
-10.395
-9.373
-13.643
-10.412

-5.18
-3.64
-4.13
-6.66
-4.71
-5.66
-3.25
-3.51
-2.45
-14.01
-16.43
-17.17
-16.23
-20.25
-15.54
-9.73
-1.88
-2.09
-1.82
-5.73
-4.95
-9.13
-8.94
-7.44
-11.24
-9.61
-6.23
-5.99
-8.18
-9.41
-9.45
-8.21
-7.98
-7.65
-6.68
-9.01
-8.43
-7.78
-11.14
-9.84
-10.18
-10.65

96

12309460

5216

3030
3048

46174049

3589
5993
727
6423
3286
4004
4790
25146
13081
2730
5853

2268
991
4130
3017

5377791

-10.923
-4.181
-7.827
-4.326
-4.328
-3.583
-3.395
-2.790
-4.702
-9.321
-9.307
-9.313
-9.308
-9.315
-9.311
-9.309
-9.301
-9.316
-9.310
-9.313
-9.308

Table A.4 (cont’d)

-10.22
-4.23
-9.86
-4.82
-4.82
-2.55
-3.44
-5.44
-1.45
-6.1
-8.15
-4.37
-5.74
-6.5
-5.04
-12.74
-7.07
-10.03
-6.74
-7.19
-6.48

97

Table A.5: Solvation energy prediction results for SAMPL2 molecules using selected polar features

PubChem ID Experimental Value Predicted Value
18927701

0.530
2.312
-6.588
-6.217
-8.779
-7.796
-9.119
-7.508
-7.084
-19.263
-20.759
-7.797
-7.797
-7.797
-7.797
-17.666
-17.666
-17.268
-17.262
-17.675
-16.315
-17.619
-10.142
-9.295
-11.448
-2.969
-3.463
0.063
-9.867

8263
3394
3672
753
3825
3059
156391
2244
135191
5793
7175
7184
7456
8434
15758
73272
3385
5899
5802
1174
7956
6809
2519
10541
8370
6720
31347
6214

-0.8
3.43
-8.42
-7.0
-13.43
-10.78
-9.4
-10.21
-9.94
-20.52
-25.47
-9.37
-8.72
-9.51
-9.2
-17.74
-15.83
-16.92
-15.46
-18.17
-16.59
-18.06
-9.61
-12.64
-8.7
-2.33
-5.22
-0.64
-8.61

98

Table A.6: Solvation energy prediction results for SAMPL3 molecules using selected polar features

PubChem ID Experimental Value Predicted Value

2.035
-2.428
-4.715
-4.694
-4.694
-4.695
-4.695
-4.695
-4.696
-4.696
-4.696
-4.634
-4.632
0.087
-1.443
-0.757
-0.515
-2.225
-0.657
-1.372
-1.372
-1.940
-2.115
-2.115
-2.225
-3.722
-3.307
-2.225
-2.380
-2.910
-2.380
-3.726
-3.726
0.187
-4.206
-4.715

6324
7095
9216
37207
38252
34586
36613
38253
15625
35454
48889
38251
38254
6214
12418
6419
6337
11
6365
6574
6591
249266
36980
37247
27588
36401
63088
91662
38019
63079
37037
38306
39253
6278
16318
18636

1.83
-2.7
-3.15
-3.52
-3.1
-3.56
-3.67
-4.05
-3.37
-3.81
-3.84
-3.71
-4.15
-0.64
-1.43
-1.23
-0.63
-1.79
-0.84
-1.99
-2.37
-2.69
-2.46
-2.16
-2.28
-3.48
-3.61
-1.96
-3.04
-4.38
-4.4
-3.17
-4.61
-0.19
-2.98
-4.53

99

Table A.7: Solvation energy prediction results for SAMPL4 molecules using selected polar features

PubChem ID Experimental Value Predicted Value

1.337
-2.780
-3.726
-0.475
-2.780
-3.427
-4.465
-4.940
-2.275
-4.200
-3.166
-23.886
-6.390
-3.166
-3.245
-6.849
-2.845
-5.932
-1.839
-3.667
-3.165
-3.783
-4.530
-4.715
-4.271
-4.825
-6.267
-6.635
-6.245
-5.184
-4.020
-5.108
-9.305
-8.255
-1.928
-5.623
-5.093
-8.441
-10.139
-10.140
-10.151
-7.613

8079
10740
8418
11903
11940
31275
9216
10722
8908
26447
16666
6251
6997
8894
7583
7043
16441
6998
442474
8095
22227
61362
94221
637566
643820
21210
460
7144
17739
7041
8082
2160
77918
4044
30209
96257
138747
31420
6710
8341
20528
8323

0.14
-2.78
-3.95
-0.85
-3.78
-5.06
-3.15
-3.93
-2.26
-2.53
-3.2
-23.62
-5.66
-3.12
-2.87
-5.33
-4.09
-4.68
-2.49
-5.31
-3.75
-4.51
-4.44
-4.45
-4.78
-5.73
-5.94
-5.8
-5.26
-6.96
-5.11
-7.43
-7.63
-6.78
-1.66
-4.8
-4.29
-11.85
-9.44
-11.53
-14.21
-9.53

100

Table A.7 (cont’d)

3100
53479
53476
7258
93462

-9.34
-7.78
-8.68
-7.29
-6.44

-8.408
-7.944
-7.883
-7.346
-4.693

101

223106
31275
7765
8020
222536
7761
8121
5541
12375
74626
795
84179
81713
7503
8115
9609
7498

-0.137
-4.717
-2.650
-3.370
-5.924
-6.265
-4.709
-25.855
-2.533
-4.686
-10.133
-9.596
-9.598
-0.662
-2.170
-1.330
-1.372

Table A.8: Solvation energy prediction results for SAMPL0 molecules using all features

PubChem ID Experimental Value Predicted Value

1.07
-5.06
-3.28
-2.93
-4.97
-6.0
-6.34
-8.84
-3.54
-3.82
-9.63
-9.76
-11.01
-1.93
-4.23
-1.46
-2.38

102

Table A.9: Solvation energy prediction results for SAMPL1 molecules using all features

PubChem ID Experimental Value Predicted Value

6544
14215
16003
6053
3031
34468
5569
2319
33500
56160
15546
107828
91729
52997
52999
9411
10974
13567
79123
40818

21075956

7560
17190
20419
12896
2566
6950
81226
85254
31645
6129
2078
20848
13263
13450
8606
4929
4933
22188
9570071
9595287

4109

-4.385
-4.262
-4.262
-8.247
-2.672
-3.900
-2.695
-2.703
-2.727
-16.387
-16.068
-18.361
-18.285
-15.588
-14.088
-11.020
-3.029
-2.159
-3.022
-4.990
-5.421
-11.112
-12.631
-11.561
-10.571
-10.656
-6.238
-8.910
-8.932
-12.282
-10.462
-9.347
-11.169
-9.850
-8.214
-7.741
-8.199
-8.772
-11.029
-9.651
-13.044
-9.827

-5.18
-3.64
-4.13
-6.66
-4.71
-5.66
-3.25
-3.51
-2.45
-14.01
-16.43
-17.17
-16.23
-20.25
-15.54
-9.73
-1.88
-2.09
-1.82
-5.73
-4.95
-9.13
-8.94
-7.44
-11.24
-9.61
-6.23
-5.99
-8.18
-9.41
-9.45
-8.21
-7.98
-7.65
-6.68
-9.01
-8.43
-7.78
-11.14
-9.84
-10.18
-10.65

103

12309460

5216

3030
3048

46174049

3589
5993
727
6423
3286
4004
4790
25146
13081
2730
5853

2268
991
4130
3017

5377791

-9.443
-3.702
-7.995
-7.429
-7.451
-4.050
-3.070
-2.285
-0.243
-4.572
-4.822
-4.503
-4.772
-4.658
-4.649
-4.825
-4.783
-4.789
-4.916
-5.018
-4.373

Table A.9 (cont’d)

-10.22
-4.23
-9.86
-4.82
-4.82
-2.55
-3.44
-5.44
-1.45
-6.1
-8.15
-4.37
-5.74
-6.5
-5.04
-12.74
-7.07
-10.03
-6.74
-7.19
-6.48

104

Table A.10: Solvation energy prediction results for SAMPL2 molecules using all features

PubChem ID Experimental Value Predicted Value
18927701

0.603
1.318
-6.511
-5.764
-9.164
-13.958
-8.420
-8.337
-7.683
-20.937
-23.679
-7.329
-6.887
-7.729
-7.702
-16.623
-13.920
-16.831
-15.115
-17.318
-17.474
-14.373
-11.710
-11.015
-14.316
-3.180
-3.070
-0.723
-9.778

8263
3394
3672
753
3825
3059
156391
2244
135191
5793
7175
7184
7456
8434
15758
73272
3385
5899
5802
1174
7956
6809
2519
10541
8370
6720
31347
6214

-0.8
3.43
-8.42
-7.0
-13.43
-10.78
-9.4
-10.21
-9.94
-20.52
-25.47
-9.37
-8.72
-9.51
-9.2
-17.74
-15.83
-16.92
-15.46
-18.17
-16.59
-18.06
-9.61
-12.64
-8.7
-2.33
-5.22
-0.64
-8.61

105

Table A.11: Solvation energy prediction results for SAMPL3 molecules using all features

PubChem ID Experimental Value Predicted Value

2.310
-3.083
-4.577
-4.479
-4.479
-4.531
-4.532
-4.579
-4.631
-4.623
-4.630
-4.564
-4.675
-1.590
-2.416
-1.545
-0.608
-1.115
-0.801
-2.376
-2.414
-3.299
-2.137
-2.194
-2.024
-2.217
-3.091
-2.293
-2.805
-3.006
-2.733
-2.719
-2.626
-1.679
-2.528
-1.355

6324
7095
9216
37207
38252
34586
36613
38253
15625
35454
48889
38251
38254
6214
12418
6419
6337
11
6365
6574
6591
249266
36980
37247
27588
36401
63088
91662
38019
63079
37037
38306
39253
6278
16318
18636

1.83
-2.7
-3.15
-3.52
-3.1
-3.56
-3.67
-4.05
-3.37
-3.81
-3.84
-3.71
-4.15
-0.64
-1.43
-1.23
-0.63
-1.79
-0.84
-1.99
-2.37
-2.69
-2.46
-2.16
-2.28
-3.48
-3.61
-1.96
-3.04
-4.38
-4.4
-3.17
-4.61
-0.19
-2.98
-4.53

106

Table A.12: Solvation energy prediction results for SAMPL4 molecules using all features

PubChem ID Experimental Value Predicted Value

0.870
-3.063
-3.848
-0.888
-3.011
-3.679
-5.498
-4.864
-2.107
-3.283
-2.917
-25.954
-6.143
-2.278
-3.669
-4.852
-3.572
-6.869
-2.027
-4.956
-3.463
-3.635
-4.034
-4.185
-4.031
-4.109
-5.257
-7.423
-6.888
-5.976
-5.291
-5.097
-9.472
-8.697
-2.491
-4.599
-3.502
-9.133
-9.773
-11.820
-11.259
-8.767

8079
10740
8418
11903
11940
31275
9216
10722
8908
26447
16666
6251
6997
8894
7583
7043
16441
6998
442474
8095
22227
61362
94221
637566
643820
21210
460
7144
17739
7041
8082
2160
77918
4044
30209
96257
138747
31420
6710
8341
20528
8323

0.14
-2.78
-3.95
-0.85
-3.78
-5.06
-3.15
-3.93
-2.26
-2.53
-3.2
-23.62
-5.66
-3.12
-2.87
-5.33
-4.09
-4.68
-2.49
-5.31
-3.75
-4.51
-4.44
-4.45
-4.78
-5.73
-5.94
-5.8
-5.26
-6.96
-5.11
-7.43
-7.63
-6.78
-1.66
-4.8
-4.29
-11.85
-9.44
-11.53
-14.21
-9.53

107

Table A.12 (cont’d)

3100
53479
53476
7258
93462

-9.34
-7.78
-8.68
-7.29
-6.44

-7.481
-8.435
-9.121
-7.877
-4.586

108

APPENDIX B

SUPPLEMENTARY MATERIALS FOR TOXICITY ENDPOINT PREDICTION

This part of supplementary materials contain MT-DNN results with selected descriptors for all four
datasets. Table B.1 - B.4 correspond to results calculated with descriptors whose importance are
higher than 2.5e-4, 5e-4, 7.5e-4 and 1e-3, respectively. Table B.5 - B.12 contain the results with
the ESTDs proposed in Discussion section using diﬀerent algorithms.

Table B.1: Performances of diﬀerent descriptor groups with importance threshold 2.5e-4

Dataset
LC50

R2
0.784
LC50DM 0.760
0.760
IGC50
LD50
0.617

R2−R2
0
R2
0.051
0.145
0.078
0.306

k

0.971
0.943
0.981
0.954

RMSE MAE Coverage
0.685
0.850
0.482
0.598

1.000
1.000
1.000
0.997

0.459
0.550
0.334
0.433

Table B.2: Performances of diﬀerent descriptor groups with importance threshold 5e-4

Dataset
LC50

R2
0.764
LC50DM 0.742
0.757
IGC50
LD50
0.612

R2−R2
0
R2
0.062
0.152
0.075
0.325

k

0.962
0.951
0.984
0.954

RMSE MAE Coverage
0.719
0.877
0.486
0.601

0.470
0.552
0.338
0.437

1.000
1.000
1.000
0.997

Table B.3: Performances of diﬀerent descriptor groups with importance threshold 7.5e-4

RMSE MAE Coverage
0.708
0.899
0.491
0.611

1.000
1.000
1.000
0.997

0.468
0.565
0.341
0.439

Dataset
LC50

R2
0.772
LC50DM 0.729
IGC50
0.751
0.600
LD50

R2−R2
0
R2
0.064
0.157
0.076
0.341

k

0.958
0.945
0.983
0.953

109

Table B.4: Performances of diﬀerent descriptor groups with importance threshold 1e-3

Dataset
LC50

R2
0.764
LC50DM 0.677
IGC50
0.746
0.607
LD50

R2−R2
0
R2
0.063
0.204
0.075
0.325

k

0.963
0.945
0.982
0.955

RMSE MAE Coverage
0.717
0.981
0.497
0.605

0.467
0.611
0.342
0.437

1.000
1.000
1.000
0.997

Table B.5: Performances of RFon diﬀerent datasets using ESTDs only proposed in Discussion
section

Dataset
LC50

R2
0.685
LC50DM 0.418
0.649
IGC50
LD50
0.585

R2−R2
0
R2
0.324
1.749
0.407
0.936

k

0.947
0.960
0.967
0.947

RMSE MAE Coverage
0.829
1.327
0.584
0.629

0.608
1.023
0.414
0.471

1.000
1.000
1.000
0.999

Table B.6: Performances of RF on diﬀerent datasets using ESTDs proposed in Discussion section
along with physical descriptors

Dataset
LC50

R2
0.732
LC50DM 0.449
0.737
IGC50
LD50
0.615

R2−R2
0
R2
0.314
1.517
0.245
0.766

k

0.948
0.957
0.979
0.948

RMSE MAE Coverage
0.776
1.289
0.510
0.607

1.000
1.000
1.000
0.997

0.561
0.973
0.370
0.455

Table B.7: Performances of GBDT on diﬀerent datasets using ESTDs only proposed in Discussion
section

Dataset
LC50

R2
0.687
LC50DM 0.490
IGC50
0.737
0.609
LD50

R2−R2
0
R2
0.119
0.489
0.072
0.431

k

0.959
0.969
0.987
0.958

110

RMSE MAE Coverage
0.829
1.261
0.508
0.604

1.000
1.000
1.000
0.999

0.583
0.917
0.348
0.449

Table B.8: Performances of GBDT on diﬀerent datasets using ESTDs proposed in Discussion
section along with physical descriptors

Dataset
LC50

R2
0.764
LC50DM 0.498
0.791
IGC50
LD50
0.635

R2−R2
0
R2
0.103
0.563
0.053
0.328

k

0.960
0.957
0.994
0.959

RMSE MAE Coverage
0.715
1.236
0.451
0.583

0.485
0.913
0.313
0.438

1.000
1.000
1.000
0.997

Table B.9: Performances of MT-DNN on diﬀerent datasets using ESTDs only proposed in Discus-
sion section

Dataset
LC50

R2
0.738
LC50DM 0.723
0.736
IGC50
LD50
0.611

R2−R2
0
R2
0.012
0.001
0.001
0.000

k

1.004
0.993
1.008
0.998

RMSE MAE Coverage
0.763
0.907
0.506
0.602

0.497
0.629
0.365
0.442

1.000
1.000
1.000
0.999

Table B.10: Performances of MT-DNN on diﬀerent datasets using ESTDs proposed in Discussion
section along with physical descriptors

Dataset
LC50

R2
0.771
LC50DM 0.725
0.768
IGC50
LD50
0.632

R2−R2
0
R2
0.012
0.000
0.000
0.000

k

1.017
0.999
1.002
1.000

RMSE MAE Coverage
0.716
0.903
0.473
0.585

0.467
0.597
0.335
0.427

1.000
1.000
1.000
0.997

Table B.11: Performances of Consensus (MT-DNN and GBDT) on diﬀerent datasets using ESTDs
only proposed in Discussion section

Dataset
LC50

R2
0.745
LC50DM 0.653
IGC50
0.765
0.635
LD50

R2−R2
0
R2
0.084
0.297
0.105
0.401

k

0.962
0.968
0.979
0.956

111

RMSE MAE Coverage
0.745
1.017
0.476
0.584

0.496
0.736
0.336
0.431

1.000
1.000
1.000
0.999

Table B.12: Performances of Consensus (MT-DNN and GBDT) diﬀerent datasets using ESTDs
proposed in Discussion section along with physical descriptors

RMSE MAE Coverage
0.674
0.985
0.437
0.565

1.000
1.000
1.000
0.997

0.444
0.711
0.304
0.420

Dataset
LC50

R2
0.792
LC50DM 0.674
IGC50
0.802
0.657
LD50

R2−R2
0
R2
0.074
0.288
0.068
0.321

k

0.958
0.960
0.987
0.957

112

BIBLIOGRAPHY

113

BIBLIOGRAPHY

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

Junmei Wang, Wei Wang, Shuanghong Huo, Matthew Lee, and Peter A Kollman. Solvation
model based on weighted solvent accessible surface area. The Journal of Physical Chemistry
B, 105(21):5055–5067, 2001.
Junmei Wang, George Krudy, Tingjun Hou, Wei Zhang, George Holland, and Xiaojie Xu.
Development of reliable aqueous solubility models and their application in druglike analysis.
Journal of chemical information and modeling, 47(4):1395–1404, 2007.
TJ Hou, Ke Xia, Wei Zhang, and XJ Xu. Adme evaluation in drug discovery. 4. prediction of
aqueous solubility based on atom contribution approach. Journal of chemical information
and computer sciences, 44(1):266–275, 2004.
Bao Wang, Zhixiong Zhao, and G. W. Wei. Automatic parametrization of non-polar implicit
solvent models for the blind prediction of solvation free energies. Journal of Chemical
Physics, 145:124110, 2016.
Tiejun Cheng, Yuan Zhao, Xun Li, Fu Lin, Yong Xu, Xinglong Zhang, Yan Li, Renxiao
Wang, and Luhua Lai. Computation of octanol-water partition coeﬃcients by guiding an
additive model with knowledge. Journal of chemical information and modeling, 47(6):2140–
2148, 2007.
Raimund Mannhold, Gennadiy I Poda, Claude Ostermann, and Igor V Tetko. Calculation
of molecular lipophilicity: state-of-the-art and comparison of log P methods on more than
96,000 compounds. Journal of pharmaceutical sciences, 98(3):861–893, 2009.
Joey W. Storer, David J. Giesen, Gregory D. Hawkins, Gillian C. Lynch, Christopher J.
Cramer, Donald G. Truhlar, and Daniel A. Liotard. Solvation modeling in aqueous and
nonaqueous solvent, new techniques and a reexamination of the claisen rearrangement. In
C. J. Cramer and D. G. Truhlar, editors, Structure, Energetics, and Reactivity in Aqueous So-
lution: Characterization of Chemical and Biological Systems, 568, pages 24–49. American
Chemical Society, 1994.
R. Daudel. Quantum theory of chemical reactivity.
Reactivity, 1973.
C. Reichardt. Solvents and solvent eﬀects in organic chemistry. In Solvents and Solvent
Eﬀects in Organic Chemistry. Wiley-VCH: New York, 1990.

In Quantum Theory of Chemical

[10] Mikhail Borisover, Minolen Reddy, and Ellen R. Graber. Solvation eﬀect on organic com-
pound interactions in soil organic matter. Environ. Sci. Technol., 35(12):2518âĂŞ2524,
2001.

[11] M.M. Kreevoy and D.G. Truhlar. In investigation of rates and mechanisms of reactions, part
i. In C.F. Bernasconi, editor, In Investigation of Rates and Mechanisms of Reactions, Part I,
page 13. Wiley: New York, 1986.

114

[12] M. E. Davis and J. A. McCammon. Electrostatics in biomolecular structure and dynamics.

Chemical Reviews, 94:509–21, 1990.

[13] A. Warshel and A. Papazyan. Electrostatic eﬀects in macromolecules: fundamental concepts

and practical modeling. Current Opinion in Structural Biology, 8(2):211–217, 1998.

[14] Aleksandr V. Marenich, Christopher J. Cramer, , and Donald G. Truhlar. Performance of
SM6 SM8 and SMD on the SAMPL1 test set for the prediction of small-molecule solvation
free energies. Journal of Physical Chemistry B, 113:4538âĂŞ4543, 2009.

[15] Silvia A. Martins, Sergio F. Sousa, Maria Joao Ramos, and Pedro A. Fernandes. Prediction
of solvation free energies with thermodynamic integration using the general amber force
ﬁeld. Journal of Chemical Theory and Computation, 10:3570–3577, 2014.

[16] Gerhard König, Frank C Pickard, Ye Mei, and Bernard R Brooks. Predicting hydration free
energies with a hybrid qm/mm approach: an evaluation of implicit and explicit solvation
models in sampl4. Journal of computer-aided molecular design, 28(3):245–257, 2014.

[17] B. Jayaram, D. Sprous, and D. L. Beveridge. Solvation free energy of biomacromolecules:
Parameters for a modiﬁed generalized Born model consistent with the AMBER force ﬁeld.
Journal of Physical Chemistry B, 102(47):9571–9576, 1998.
J. A. Grant, B. T. Pickup, M. T. Sykes, C. A. Kitchen, and A. Nicholls. The Gaussian
Generalized Born model: application to small molecules. Physical Chemistry Chemical
Physics, 9:4913–22, 2007.

[18]

[19] H. Gohlke and D. A. Case. Converging free energy estimates: MM-PB(GB)SA studies on
the protein-protein complex ras-raf. Journal of Computational Chemistry, 25(2):238–250,
2004.

[20] M. Feig, W. Im, and C. L. Brooks III. Implicit solvation based on generalized Born theory
in diﬀerent dielectric environments. Journal of Chemical Physics, 120(2):903–911, 2004.
[21] B. N. Dominy and C. L. Brooks, III. Development of a generalized Born model parameteri-
zation for proteins and nucleic acids. Journal of Physical Chemistry B, 103(18):3765–3773,
1999.

[22] D. Bashford and D. A. Case. Generalized Born models of macromolecular solvation eﬀects.

Annual Review of Physical Chemistry, 51:129–152, 2000.
Jian J. Tan, Wei Z. Chen, and Cun X. Wang.
Investigating interactions between HIV-1
gp41 and inhibitors by molecular dynamics simulation and MM-PBSA/GBSA calculations.
Journal of Molecular Structure: Theochem., 766(2-3):77–82, 2006.

[23]

[24] C. J. Cramer and D. G. Truhlar. Implicit solvation models: equilibria, structure, spectra, and

dynamics. Chemical Reviews, 99(8):2161–2200, 1999.

[25] Christopher J. Cramer and Donald G. Truhlar. A universal approach to solvation modeling.

Accounts of chemical research, 41:760–768, 2008.

115

[26]

Insook Park, Yun Hee Jang, Sungu Hwang, and Doo Soo Chung. Poisson-boltzmann
continuum solvation models for nonaqueous solvents i. 1-octanol. Chemistry Letters, 32:4,
2003.

[27] K. A. Sharp and B. Honig. Calculating total electrostatic energies with the nonlinear Poisson-

Boltzmann equation. Journal of Physical Chemistry, 94:7684–7692, 1990.

[28] K. A. Sharp and B. Honig. Electrostatic interactions in macromolecules - theory and
applications. Annual Review of Biophysics and Biophysical Chemistry, 19:301–332, 1990.
[29] Lin Li, Chuan Li, Subhra Sarkar, Jie Zhang, Shawn Witham, Zhe Zhang, Lin Wang, Nicholas
Smith, Marharyta Petukh, and Emil Alexov. Delphi: a comprehensive suite for delphi
software and associated resources. BMC biophysics, 5(1):9, 2012.

[30] B. Honig and A. Nicholls. Classical electrostatics in biology and chemistry. Science,

268(5214):1144–9, 1995.

[31] M. K. Gilson, M. E. Davis, B. A. Luty, and J. A. McCammon. Computation of electrostatic
forces on solvated molecules using the Poisson-Boltzmann equation. Journal of Physical
Chemistry, 97(14):3591–3600, 1993.

[32] Lin Li, Chuan Li, and Emil Alexov. On the modeling of polar component of solvation energy
using smooth gaussian-based dielectric function. Journal of Theoretical and Computational
Chemistry, 13(03):1440002, 2014.

[33] Lin Li, Chuan Li, Zhe Zhang, and Emil Alexov. On the dielectric âĂĲconstantâĂİ of
proteins: smooth dielectric function for macromolecular modeling and its implementation
in delphi. Journal of chemical theory and computation, 9(4):2126–2136, 2013.

[34] K. Lum, D. Chandler, and J. D. Weeks. Hydrophobicity at small and large length scales.

Journal of Physical Chemistry B, 103(22):4570–7, 1999.

[35] David M Huang and David Chandler. Temperature and length scale dependence of hy-
drophobic eﬀects and their possible implications for protein folding. Proceedings of the
National Academy of Sciences, 97(15):8324–8327, 2000.

[36] E. Gallicchio, L. Y. Zhang, and R. M. Levy. The SGB/NP hydration free energy model
based on the surface generalized Born solvent reaction ﬁeld and novel nonpolar hydration
free energy estimators. Journal of Computational Chemistry, 23(5):517–29, 2002.

[37] E. Gallicchio and R. M. Levy. AGBNP: An analytic implicit solvent model suitable for
molecular dynamics simulations and high-resolution modeling. Journal of Computational
Chemistry, 25(4):479–499, 2004.

[38] Niharendu Choudhury and B Montgomery Pettitt. On the mechanism of hydrophobic
association of nanoscopic solutes. Journal of the American Chemical Society, 127(10):3556–
3567, 2005.

116

[39]

J. A. Wagoner and N. A. Baker. Assessing implicit models for nonpolar mean solvation forces:
the importance of dispersion and volume terms. Proceedings of the National Academy of
Sciences of the United States of America, 103(22):8331–6, 2006.
J. Dzubiella, J. M. J. Swanson, and J. A. McCammon. Coupling hydrophobicity, dispersion,
and electrostatics in continuum solvent models. Physical Review Letters, 96:087802, 2006.
[41] G. W. Wei. Diﬀerential geometry based multiscale models. Bulletin of Mathematical Biology,

[40]

72:1562 – 1622, 2010.

[42] Z. Chen, N. A. Baker, and G. W. Wei. Diﬀerential geometry based solvation models I:

Eulerian formulation. J. Comput. Phys., 229:8231–8258, 2010.

[43] Z. Chen and G. W. Wei. Diﬀerential geometry based solvation models III: Quantum formu-

lation. J. Chem. Phys., 135:194108, 2011.

[44] Z. Chen, N. A. Baker, and G. W. Wei. Diﬀerential geometry based solvation models II:

Lagrangian formulation. J. Math. Biol., 63:1139– 1200, 2011.

[45] B. Wang and G. W. Wei. Parameter optimization in diﬀerential geometry based solvation

models. Journal Chemical Physics, 143:134119, 2015.

[46] A Leo, DH Hoekman, and C Hansch. Hydrophobic, electronic, and steric constants, 1995.

[47] Han Van De Waterbeemd and Eric Giﬀord. ADMET in silico modelling: towards prediction

paradise? Nature reviews Drug discovery, 2(3):192–204, 2003.

[48] Corwin Hansch and Toshio Fujita. p-σ-π analysis. a method for the correlation of biological
activity and chemical structure. Journal of the American Chemical Society, 86(8):1616–
1626, 1964.

[49] Toshio Fujita, Junkichi Iwasa, and Corwin Hansch. A new substituent constant, π, derived
from partition coeﬃcients. Journal of the American Chemical Society, 86(23):5175–5180,
1964.

[50] Albert Leo, Corwin Hansch, and David Elkins. Partition coeﬃcients and their uses. Chemical

reviews, 71(6):525–616, 1971.

[51] Arup K Ghose and Gordon M Crippen. Atomic physicochemical parameters for three-
dimensional-structure-directed quantitative structure-activity relationships. 2. modeling dis-
persive and hydrophobic interactions. Journal of chemical information and computer sci-
ences, 27(1):21–35, 1987.

[52] William M Meylan and Philip H Howard. Atom/fragment contribution method for estimating
octanol–water partition coeﬃcients. Journal of pharmaceutical sciences, 84(1):83–92, 1995.
[53] William M Meylan and Philip H Howard. Estimating log P with atom/fragments and water

solubility with log P. Perspectives in drug discovery and design, 19(1):67–84, 2000.

117

[54] Albert J Leo. Calculating log poct from structures. Chemical Reviews, 93(4):1281–1306,

1993.

[55]

JT Chou and Peter C Jurs. Computer-assisted computation of partition coeﬃcients from
molecular structures using fragment constants. Journal of Chemical Information and Com-
puter Sciences, 19(3):172–178, 1979.

[56] Alanas A Petrauskas and Eduard A Kolovanov. ACD/Log P method description. Perspectives

in drug discovery and design, 19(1):99–116, 2000.

[57] Matthew J Walker. Training ACD/LogP with experimental data. QSAR & Combinatorial

Science, 23(7):515–520, 2004.

[58] Hao Zhu, Aleksander Sedykh, Suman K Chakravarti, and Gilles Klopman. A new group
contribution approach to the calculation of LogP. Current Computer-Aided Drug Design,
1(1):3–9, 2005.

[59] Aleksandr Y Sedykh and Gilles Klopman. A structural analogue approach to the prediction
of the octanol-water partition coeﬃcient. Journal of chemical information and modeling,
46(4):1598–1603, 2006.

[60]

[61]

Igor V Tetko and Vsevolod Yu Tanchuk. Application of associative neural networks for
prediction of lipophilicity in ALOGPS 2.1 program. Journal of chemical information and
computer sciences, 42(5):1136–1145, 2002.
Igor V Tetko and Pierre Bruneau. Application of ALOGPS to predict 1-octanol/water
distribution coeﬃcients, logP, and logD, of AstraZeneca in-house database. Journal of
pharmaceutical sciences, 93(12):3103–3110, 2004.

[62] Christopher A Lipinski, Franco Lombardo, Beryl W Dominy, and Paul J Feeney. Experimen-
tal and computational approaches to estimate solubility and permeability in drug discovery
and development settings. Advanced drug delivery reviews, 23(1-3):3–25, 1997.

[63] Li Di and Edward H Kerns. Biological assay challenges from compound solubility: strategies

for bioassay optimization. Drug discovery today, 11(9):446–451, 2006.

[64] Bao-Gen DUAN, Yan LI, Jie LI, Tie-Jun CHENG, and Ren-Xiao WANG. An empirical
additive model for aqueous solubility computation: success and limitations. Acta Physico-
Chimica Sinica, 28(10):2249–2257, 2012.
Junmei Wang, Tingjun Hou, and Xiaojie Xu. Aqueous solubility prediction based on
weighted atom type counts and solvent accessible surface areas. Journal of Chemical
Information and Modeling, 49(3):571–581, 2009.

[65]

[66] Samuel H Yalkowsky and Shri C Valvani. Solubility and partitioning i: solubility of

nonelectrolytes in water. Journal of pharmaceutical sciences, 69(8):912–922, 1980.
John C Dearden. In silico prediction of aqueous solubility. Expert opinion on drug discovery,
1(1):31–52, 2006.

[67]

118

[68] RM Dannenfelser, M Paric, M White, and Samuel H Yalkowsky. A compliation of some

physico-chemical properties for chlorobenzenes. Chemosphere, 23(2):141–165, 1991.

[69] William L Jorgensen and Erin M Duﬀy. Prediction of drug solubility from structure.

Advanced drug delivery reviews, 54(3):355–366, 2002.

[70] H. Edelsbrunner, D. Letscher, and A. Zomorodian. Topological persistence and simpliﬁca-

tion. Discrete Comput. Geom., 28:511–533, 2002.

[71] A. Zomorodian and G. Carlsson. Computing persistent homology. Discrete Comput. Geom.,

33:249–274, 2005.

[72] K. L. Xia and G. W. Wei. Persistent homology analysis of protein structure, ﬂexibility
International Journal for Numerical Methods in Biomedical Engineerings,

and folding.
30:814–844, 2014.

[73] K. L. Xia, X. Feng, Y. Y. Tong, and G. W. Wei. Persistent homology for the quantitative
prediction of fullerene stability. Journal of Computational Chemsitry, 36:408–422, 2015.
[74] K. L. Xia, Z. X. Zhao, and G. W. Wei. Multiresolution topological simpliﬁcation. Journal

Computational Biology, 22:1–5, 2015.

[75] K. L. Xia, Z. X. Zhao, and G. W. Wei. Multiresolution persistent homology for excessively

large biomolecular datasets. Journal of Chemical Physics, 143:134103, 2015.

[76] B. Wang and G. W. Wei. Object-oriented persistent homology. Journal of Computational

Physics, 305:276–299, 2016.

[77] Zixuan Cang, Lin Mu, Kedi Wu, Kris Opron, Kelin Xia, and Guo-Wei Wei. A topological
approach to protein classiﬁcation. Molecular based Mathematical Biologys, 3:140–162,
2015.

[78] Z. X. Cang and G. W. Wei. Analysis and prediction of protein folding energy changes
upon mutation by element speciﬁc persistent homology. Bioinformatics (arXiv preprint
arXiv:1703.10966), Revised, 2017.

[79] Z. X. Cang and G. W. Wei. Integration of element speciﬁc persistent homology and machine
learning for protein-ligand binding aﬃnity prediction . International Journal for Numerical
Methods in Biomedical Engineering, Accepted, 2017.

[80] Z. X. Cang and G. W. Wei. TopologyNet: Topology based deep convolutional neural
networks for biomolecular property predictions . Plos Computational Biology (arXiv preprint
arXiv:1704.00063), Submitted, 2017.

[81] Omar Deeb and Mohammad Goodarzi. In silico quantitative structure toxicity relationship

of chemical compounds: some case studies. Current drug safety, 7(4):289–297, 2012.

[82] Gregory W Kauﬀman and Peter C Jurs. QSAR and k-nearest neighbor classiﬁcation analysis
of selective cyclooxygenase-2 inhibitors using topologically-based numerical descriptors.
Journal of chemical information and computer sciences, 41(6):1553–1560, 2001.

119

[83] Subhash Ajmani, Kamalakar Jadhav, and Sudhir A Kulkarni. Three-dimensional QSAR
using the k-nearest neighbor method and its interpretation. Journal of chemical information
and modeling, 46(1):24–31, 2006.

[84] Hongzong Si, Tao Wang, Kejun Zhang, Yun-Bo Duan, Shuping Yuan, Aiping Fu, and Zhide
Hu. Quantitative structure activity relationship model for predicting the depletion percentage
of skin allergic chemical substances of glutathione. Analytica chimica acta, 591(2):255–264,
2007.

[85] Hongying Du, Jie Wang, Zhide Hu, Xiaojun Yao, and Xiaoyun Zhang. Prediction of
fungicidal activities of rice blast disease based on least-squares support vector machines and
project pursuit regression. Journal of agricultural and food chemistry, 56(22):10785–10792,
2008.

[86] Vladimir Svetnik, Andy Liaw, Christopher Tong, J Christopher Culberson, Robert P Sheri-
dan, and Bradley P Feuston. Random forest: a classiﬁcation and regression tool for compound
classiﬁcation and QSAR modeling. Journal of chemical information and computer sciences,
43(6):1947–1958, 2003.

[87] Peixun Liu and Wei Long. Current mathematical methods used in qsar/qspr studies. Inter-

national journal of molecular sciences, 10(5):1978–1998, 2009.

[88] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
[89] Christian Widmer and Gunnar Rätsch. Multitask learning in computational biology.

ICML Unsupervised and Transfer Learning, pages 207–216, 2012.

In

[90] Qian Xu and Qiang Yang. A survey of transfer and multitask learning in bioinformatics.

Journal of Computing Science and Engineering, 5(3):257–268, 2011.

[91] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature, 521(7553):436–

444, 2015.
Jürgen Schmidhuber. Deep learning in neural networks: An overview. Neural networks,
61:85–117, 2015.

[92]

[93] George E Dahl, Dong Yu, Li Deng, and Alex Acero. Context-dependent pre-trained deep
IEEE Transactions on Audio,

neural networks for large-vocabulary speech recognition.
Speech, and Language Processing, 20(1):30–42, 2012.

[94] Li Deng, Geoﬀrey Hinton, and Brian Kingsbury. New types of deep neural network learning
for speech recognition and related applications: An overview.
In Acoustics, Speech and
Signal Processing (ICASSP), 2013 IEEE International Conference on, pages 8599–8603.
IEEE, 2013.

[95] Richard Socher, Yoshua Bengio, and Christopher D Manning. Deep learning for nlp (without
In Tutorial Abstracts of ACL 2012, pages 5–5. Association for Computational

magic).
Linguistics, 2012.

120

[96]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural
networks. In Advances in neural information processing systems, pages 3104–3112, 2014.
[97] George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multi-task neural networks for

[98]

qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
Junshui Ma, Robert P Sheridan, Andy Liaw, George E Dahl, and Vladimir Svetnik. Deep
neural nets as a method for quantitative structure–activity relationships. Journal of chemical
information and modeling, 55(2):263–274, 2015.

[99] Andreas Mayr, Günter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. DeepTox:

toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80, 2016.

[100] Bharath Ramsundar, Steven Kearnes, Patrick Riley, Dale Webster, David Konerding,
arXiv preprint

and Vijay Pande. Massively multitask networks for drug discovery.
arXiv:1502.02072, 2015.

[101] Z. X. Wu and G. W. Wei. Comparison of multi-task convolutional neural network (MT-CNN)
and a few other methods for toxicity prediction . Toxicological Sciences, Submitted, 2017.
[102] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and deep learning
in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. Journal
of chemical information and modeling, 53(7):1563–1575, 2013.

[103] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence and
simpliﬁcation. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Sym-
posium on, pages 454–463. IEEE, 2000.

[104] G. Carlsson, A. Zomorodian, A. Collins, and L. J. Guibas. Persistence barcodes for shapes.

International Journal of Shape Modeling, 11(2):149–187, 2005.

[105] R. Ghrist. Barcodes: The persistent topology of data. Bull. Amer. Math. Soc., 45:61–75,

2008.

[106] William M Haynes. CRC handbook of chemistry and physics. CRC press, 2014.
[107] Arthur L Samuel. Some studies in machine learning using the game of checkers.

Journal of research and development, 3(3):210–229, 1959.

IBM

[108] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning:

A survey. Journal of artiﬁcial intelligence research, 4:237–285, 1996.

[109] Pang-Ning Tan and Vipin Kumar. Michael steinbach: Introduction to data mining.
[110] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning,

volume 1. Springer series in statistics New York, 2001.

[111] Leo Breiman. Arcing the edge. Technical report, Technical Report 486, Statistics Depart-

ment, University of California at Berkeley, 1997.

121

[112] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals

of statistics, pages 1189–1232, 2001.

[113] L Mason, J Baxter, P Bartlett, and M Frean. Boosting algorithms as gradient descent in

function space (technical report). RSISE, Australian National University, 1999.

[114] Llew Mason, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean. Boosting algorithms
as gradient descent. In Advances in neural information processing systems, pages 512–518,
2000.

[115] François Chollet. Keras. https://github.com/fchollet/keras, 2015.
[116] Theano Development Team. Theano: A Python framework for fast computation of mathe-

matical expressions. arXiv e-prints, abs/1605.02688, May 2016.

[117] Bao Wang, Chengzhang Wang, Kedi Wu, and Guo-Wei Wei. Breaking the polar-nonpolar
division in solvation free energy prediction. Journal of computational chemistry, 39(4):217–
233, 2018.

[118] S. Cabani, P. Gianni, V Mollica, and L Lepori. Group Contributions to the Thermodynamic
Properties of Non-Ionic Organic Solutes in Dilute Aqueous Solution. Journal of Solution
Chemistry, 10(8):563–595, 1981.

[119] Junmei Wang, Wei Wang, Shuanghong Huo, Matthew Les, and Peter A. Kollman. Solvation
model based on weighted solvent ccessible surface area. J. Phys. Chem. B, 105:5055–5067,
2001.

[120] Jiabo Li, Tianhai Zhu, Gergory D. Hawkins, Paul Winget, Daniel A. Liotard, Christopher J.
Cramer, and Donald G. Truhlar. Extension of the platform of applicability of the sm5.42r
universal solvation model. Theoretical Chemistry Accounts, 103:9–63, 1999.

[121] Matthew T Geballe, A Geoﬀrey Skillman, Anthony Nicholls, J Peter Guthrie, and Peter J
introduction and overview. Journal of

Taylor. The sampl2 blind prediction challenge:
computer-aided molecular design, 24(4):259–279, 2010.

[122] A Geoﬀrey Skillman. Sampl3: blinded prediction of host–guest binding aﬃnities, hydration
free energies, and trypsin inhibitors. Journal of computer-aided molecular design, pages
1–2, 2012.

[123] Hari S Muddana, C Daniel Varnado, Christopher W Bielawski, Adam R Urbach, Lyle Isaacs,
Matthew T Geballe, and Michael K Gilson. Blind prediction of host–guest binding aﬃnities:
a new sampl3 challenge. Journal of computer-aided molecular design, 26(5):475–487, 2012.
[124] Hari S Muddana, Andrew T Fenley, David L Mobley, and Michael K Gilson. The sampl4
host–guest blind prediction challenge: an overview. Journal of computer-aided molecular
design, 28(4):305–317, 2014.

[125] David L Mobley, Karisa L Wymer, Nathan M Lim, and J Peter Guthrie. Blind prediction
of solvation free energies from the sampl4 challenge. Journal of computer-aided molecular
design, 28(3):135–150, 2014.

122

[126] Corwin Hansch, Albert Leo, DH Hoekman, et al. Exploring QSAR: fundamentals and
applications in chemistry and biology, volume 557. American Chemical Society Washington,
DC, 1995.

[127] Alex Avdeef. Absorption and drug development: solubility, permeability, and charge state.

John Wiley & Sons, 2012.

[128] Junmei Wang, George Krudy, Tingjun Hou, Wei Zhang, George Holland, and Xiaojie Xu.
Development of reliable aqueous solubility models and their application in druglike analysis.
Journal of chemical information and modeling, 47(4):1395–1404, 2007.

[129] John S Delaney. Esol: estimating aqueous solubility directly from molecular structure.

Journal of chemical information and computer sciences, 44(3):1000–1005, 2004.

[130] Jarmo Huuskonen. Estimation of aqueous solubility for a diverse set of organic compounds
based on molecular topology. Journal of Chemical Information and Computer Sciences,
40(3):773–777, 2000.

[131] SH Yalkowsky and RM Dannelfelser. The arizona database of aqueous solubility. Tuscon,

AZ, USA, 1990.

[132] P Howard and W Meylan. Physical/chemical property database (physprop), 1999.
[133] T. Martin. User’s Guide for T.E.S.T. (version 4.2) (Toxicity Estimation Software Tool): A

Program to Estimate Toxicity from Molecular Structure.

[134] Kevin S Akers, Glendon D Sinks, and T Wayne Schultz. Structure–toxicity relationships
for selected halogenated aliphatic chemicals. Environmental Toxicology and Pharmacology,
7(1):33–39, 1999.

[135] Hao Zhu, Alexander Tropsha, Denis Fourches, Alexandre Varnek, Ester Papa, Paola Gra-
matica, Tomas Oberg, Phuong Dao, Artem Cherkasov, and Igor V Tetko. Combinatorial qsar
modeling of chemical toxicants tested against tetrahymena pyriformis. Journal of Chemical
Information and Modeling, 48(4):766–784, 2008.

[136] Dong-Sheng Cao, Qing-Song Xu, Qian-Nan Hu, and Yi-Zeng Liang. Chemopy: freely
available python package for computational biology and chemoinformatics. Bioinformatics,
page btt105, 2013.

[137] Junmei Wang, Romain M Wolf, James W Caldwell, Peter A Kollman, and David A Case.
Development and testing of a general amber force ﬁeld. Journal of computational chemistry,
25(9):1157–1174, 2004.

[138] Junmei Wang, Wei Wang, Peter A Kollman, and David A Case. Automatic atom type and
bond type perception in molecular mechanical calculations. Journal of molecular graphics
and modelling, 25(2):247–260, 2006.

[139] Beibei Liu, Bao Wang, Rundong Zhao, Yiying Tong, and Guo Wei Wei. ESES: software for
Eulerian solvent excluded surface. Journal of Computational Chemistry, 38:446–466, 2017.

123

[140] Duan Chen, Zhan Chen, Changjun Chen, W. H. Geng, and G. W. Wei. MIBPB: A software

package for electrostatic analysis. J. Comput. Chem., 32:657 – 670, 2011.

[141] Alexander Golbraikh, Min Shen, Zhiyan Xiao, Yun-De Xiao, Kuo-Hsiung Lee, and Alexan-
der Tropsha. Rational selection of training and test sets for the development of validated
QSAR models. Journal of computer-aided molecular design, 17(2):241–253, 2003.

[142] D. A. Case, J. T. Berryman, R. M. Betz, D. S. Cerutti, T. E. Cheatham III, T. A. Darden,
R. E. Duke, T. J. Giese, H. Gohlke, A. W. Goetz, N. Homeyer, S. Izadi, P. Janowski, J. Kaus,
A. Kovalenko, T. S. Lee, S. LeGrand, P. Li, T. Luchko, R. Luo, B. Madej, K. M. Merz,
G. Monard, P. Needham, H. Nguyen, H. T. Nguyen, I. Omelyan, A. Onufriev, D. R. Roe,
A. Roitberg, R. Salomon-Ferrer, C. L. Simmerling, W. Smith, J. Swails, R. C. Walker,
J. Wang, R.M. Wolf, X. Wu, D. M. York, and P. A. Kollman. Amber 2015. University of
California, San Francisco, 2015.

[143] Araz Jakalian, Bruce L. Bush, David B. Jack, and Christopher I. Bayly. Fast, eﬃcient gener-
ation of high-quality atomic charges. am1-bcc model: I. method. Journal of Computational
Chemistry, 21(2):132–146, 2000.

[144] J. Gasteiger and M. Marsili. Iterative partial equalization of orbital electronegativity - a rapid

access to atomic charges. Tetrahedron, 36:3219–3228, 1980.

[145] Anthony Nicholls, David L. Mobley, J. Peter Guthrie, John D. Chodera, Chridtopher I. Bayly,
Matthew D. Cooper, and Vijay S. Pande. Predicting small-molecule solvation free energies:
An informal blind test for computational chemistry. J. Med. Chem., 51:769–799, 2008.

[146] Charles W. Kehoe, Christopher J. Fennell, and Ken A. Dill. Testing the semi-explicit
assembly solvation model in the sampl3 community blind test. J Comput Aided Mol Des,
26:563–568, 2012.

[147] J. Peter Guthrie. A blind challenge for computational solvation free energies: Introduction

and overview. Journal of Physical Chemistry B, 113:4501–4507, 2009.

[148] Pavel V. Klimovich and David L. Mobley. Predicting hydration free energies using all-atom
molecular dynamics simulations and multiple starting conformations. J. Comput Aided Mol
Des., 24:307–316, 2010.

[149] Matthew T. Geballe and J. P. Guthrie. The SAMPL3 blind prediction challenge: transfer

energy overview. Journal of Computer-Aided Molecular Design, 26:489 –496, 2012.

[150] David L. Mobley, Karisa L. Wymer, Nathan M. Lim, and J. Peter Guthrie. Blind prediction
of solvation free energies from the sampl4 challenge. J. Comput Aided Mol Des, 28:135–150,
2014.

[151] Gilles Klopman, Shaomeng Wang, and Donald M Balthasar. Estimation of aqueous solu-
bility of organic molecules by the group contribution approach. application to the study of
biodegradation. Journal of chemical information and computer sciences, 32(5):474–482,
1992.

124

[152] Gilles Klopman and Hao Zhu. Estimation of the aqueous solubility of organic molecules by
the group contribution approach. Journal of chemical information and computer sciences,
41(2):439–445, 2001.

[153] Neera Jain and Samuel H Yalkowsky. Estimation of the aqueous solubility i: Application to

organic nonelectrolytes. Journal of pharmaceutical sciences, 90(2):234–252, 2001.

[154] Chloé-Agathe Azencott, Alexandre Ksikes, S Joshua Swamidass, Jonathan H Chen, Liva
Ralaivola, and Pierre Baldi. One-to four-dimensional kernels for virtual screening and the
prediction of physical, chemical, and biological properties. Journal of chemical information
and modeling, 47(3):965–974, 2007.

[155] Holger Fröhlich, Jörg K Wegner, and Andreas Zell. Towards optimal descriptor subset selec-
tion with support vector machines in classiﬁcation and regression. QSAR & Combinatorial
Science, 23(5):311–318, 2004.

[156] Hao Zhu, Todd M Martin, Lin Ye, Alexander Sedykh, Douglas M Young, and Alexander
Tropsha. Quantitative structure- activity relationship modeling of rat acute toxicity by oral
exposure. Chemical research in toxicology, 22(12):1913–1921, 2009.
[157] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.

125