ALGEBRAIC TOPOLOGY AND MACHINE LEARNING FOR
BIOMOLECULAR MODELING
By
Zixuan Cang
A DISSERTATION
Submitted to
Michigan State University
in partial fulﬁllment of the requirements
for the degree of
Applied Mathematics  Doctor of Philosophy
2018
ALGEBRAIC TOPOLOGY AND MACHINE LEARNING FOR BIOMOLECULAR
ABSTRACT
MODELING
By
Zixuan Cang
Data is expanding in an unprecedented speed in both quantity and size. Topological data
analysis provides excellent tools for analyzing high dimensional and highly complex data.
Inspired by the topological data analysis’s ability of robust and multiscale characterization
of data and motivated by the demand of practical predictive tools in computational biology
and biomedical researches, this dissertation extends the capability of persistent homology
toward quantitative and predictive data analysis tools with an emphasis in biomolecular
systems.
Although persistent homology is almost parameter free, careful treatment is still needed
toward practically useful prediction models for realistic systems. This dissertation carefully
assesses the representability of persistent homology for biomolecular systems and introduces
a collection of characterization tools for both macromolecules and small molecules focusing on
intra and intermolecular interactions, chemical complexities, electrostatics, and geometry.
The representations are then coupled with deep learning and machine learning methods for
several problems in drug design and biophysical research.
In realworld applications, data often come with heterogeneous dimensions and compo
nents. For example, in addition to location, atoms of biomolecules can also be labeled with
chemical types, partial charges, and atomic radii. While persistent homology is powerful in
analyzing geometry of data, it lacks the ability of handling the nongeometric information.
Based on cohomology, we introduce a method that attaches the nongeometric information
to the topological invariants in persistent homology analysis. This method is not only useful
to handle biomolecules but also can be applied to general situations where the data carries
both geometric and nongeometric information.
In addition to describing biomolecular systems as a static frame, we are often interested
in the dynamics of the systems. An eﬃcient way is to assign an oscillator to each atom
and study the coupled dynamical system induced by atomic interactions. To this end, we
propose a persistent homology based method for the analysis of the resulting trajectories
from the coupled dynamical system.
The methods developed in this dissertation have been applied to several problems, namely,
prediction of protein stability change upon mutations, proteinligand binding aﬃnity pre
diction, virtual screening, and protein ﬂexibility analysis. The tools have shown top perfor
mance in both commonly used validation benchmarks and communitywide blind prediction
challenges in drug design.
Copyright by
ZIXUAN CANG
2018
To my parents, for their love.
v
ACKNOWLEDGMENTS
I would particularly like to thank my advisor Professor Guowei Wei for his enlightening
guidance and tremendous support which have shaped the result in this thesis. His passion in
research, visionary insights, and bold ideas have greatly encouraged me and inﬂuenced my
path.
I want to thank Professor Elizabeth Munch and Professor Yiying Tong for the mathe
matical instructions, the fruitful discussions, and their encouragement.
The interdisciplinary research experience I obtained in the collaboration with Professor
Ke Dong, Professor Heedeok Hong, and Professor Jian Hu is invaluable to the application
part of this thesis. I would like to thank Professor Gunnar Carlsson for useful discussions
and support.
In addition, I want to express my gratitude to Professor Peter Bates and
Professor Changyi Wang for guidance, to Dr. Lin Mu for oﬀering me the opportunity of
collaborating at ORNL, and to Professor Di Liu for serving in my committee.
I thank
my undergraduate advisors Professor Zhong Tan and Professor Jianyong Wang at Xiamen
University for motivating and supporting me.
I would also like to thank my friends and
colleagues who have supported me and enriched my life during my study here.
Finally yet importantly, I would like to thank my parents Wei Shi and Ping Cang for their
unchanging love and trust and my girlfriend Yiqing Yang for her support and understanding
throughout my research career.
vi
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
Chapter 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Applied algebraic topology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Machine learning and deep learning . . . . . . . . . . . . . . . . . . . . . . .
1.3 Biomolecular modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Coupled dynamical systems
2.1.1
2.1.2 Filtration and persistence
2.1.3 Barcode space metrics
Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Applied algebraic topology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplicial homology . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Oscillators and coupling . . . . . . . . . . . . . . . . . . . . . . . . .
Stability and controllability . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
2.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Knearest neighbor algorithm . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Decision tree
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Ensemble of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Biomolecular modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Proteins and small molecules . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Physical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3
Sequence tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 3 TopologyNet: Deep convolutional neural networks based on
topology for biomolecular property predictions. . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Persistent homology . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Topological representation of biomolecules . . . . . . . . . . . . . . .
3.2.3 Neuron for persistence barcode
. . . . . . . . . . . . . . . . . . . . .
3.2.4 Multichannel topological convolutional neural network . . . . . . . .
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Deep learning prediction of proteinligand binding aﬃnities . . . . . .
3.3.2 Deep learning prediction of protein folding free energy changes upon
mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
1
1
5
8
10
11
13
13
13
15
19
20
20
21
21
22
23
25
26
29
29
29
32
33
33
38
38
39
49
50
56
56
59
3.3.3 Multitask deep learning prediction of membrane protein mutation im
pacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61
63
Chapter 4 Persistent cohomology for data with heterogeneous dimensions 68
68
72
72
73
75
77
79
81
81
82
86
87
89
95
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Cohomology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2
Smoothed cocycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Enriched persistent barcode . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Preprocessing of the input function . . . . . . . . . . . . . . . . . . .
4.2.5 Modiﬁed Wasserstein distance . . . . . . . . . . . . . . . . . . . . . .
4.3 Examples and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 A minimalist example
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Example datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Wasserstein distance based similarity . . . . . . . . . . . . . . . . . .
4.3.4 Analysis of molecules . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 An application to proteinligand binding . . . . . . . . . . . . . . . .
4.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 5 Evolutionary homology for coupled dynamical systems
5.2.1 Coupled dynamical systems
. . . .
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
97
97
99
. . . . . . . . . . . . . . . . . . . . . . . 100
5.2.1.1 Oscillators and coupling . . . . . . . . . . . . . . . . . . . . 100
5.2.1.2
Stability and controllability . . . . . . . . . . . . . . . . . . 101
5.2.1.3 Topological learning . . . . . . . . . . . . . . . . . . . . . . 104
. . . . . . . . . . 106
5.2.2.1 Filtration function deﬁned for coupled dynamical systems
. 107
5.2.2.2 Deﬁnition of evolutionary homology . . . . . . . . . . . . . 110
5.2.3 Protein residue ﬂexibility analysis . . . . . . . . . . . . . . . . . . . . 110
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 Disordered and ﬂexible protein regions . . . . . . . . . . . . . . . . . 114
5.3.2 Protein Bfactor prediction . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.2.2 Evolutionary homology (EH) and the EH barcodes
Chapter 6 Topological characterization of static macrobiomolecules and
small molecules
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1
6.2 Biological considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.1 Element speciﬁc persistent homology . . . . . . . . . . . . . . . . . . 125
6.3.2 Construction of distance matrix . . . . . . . . . . . . . . . . . . . . . 126
6.3.2.1 Multilevel persistent homology. . . . . . . . . . . . . . . . . 126
6.3.2.2
Interactive persistent homology. . . . . . . . . . . . . . . . . 128
viii
6.3.2.3 Correlation function based persistent homology. . . . . . . . 129
6.3.2.4 Electrostatic persistence . . . . . . . . . . . . . . . . . . . . 130
6.3.3 Feature generation from topological invariants . . . . . . . . . . . . . 132
6.3.4 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.1 Ligand based proteinligand binding aﬃnity prediction . . . . . . . . 143
6.4.2 Complex based proteinligand binding aﬃnity prediction . . . . . . . 145
6.4.3
Structurebased virtual screening . . . . . . . . . . . . . . . . . . . . 148
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.5.1 Ligand based proteinligand binding aﬃnity prediction . . . . . . . . 156
6.5.2 Complex based proteinligand binding aﬃnity prediction . . . . . . . 163
6.5.2.1 Robustness of GBT algorithm against redundant element
combination features and potential overﬁtting. . . . . . . . . 163
Structurebased virtual screening . . . . . . . . . . . . . . . . . . . . 171
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
6.5.3
Chapter 7 Dissertation contribution . . . . . . . . . . . . . . . . . . . . . . . 176
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
ix
LIST OF TABLES
Table 3.1: Topological representations of proteinligand complexes.
. . . . . . . . . .
46
Table 3.2: Topological representations for protein mutation problem.
. . . . . . . . .
47
Table 3.3: Performance comparisons of TNetBP and other methods . . . . . . . . .
58
Table 3.4: Performance comparisons of TNetMP and other methods. . . . . . . . . .
61
Table 3.5: Performance comparisons of TNetMMP and other methods.
. . . . . . .
63
Table 4.1: Candidate values for hyperparameters of the gradient boosting trees model. 95
Table 4.2: The predictor performance is evaluated by training on PDBBind reﬁned
set excluding the core set and testing on the core set of a certain year’s
version. The median Pearson’s correlation coeﬃcient (root mean squared
error) among 10 repeated experiments is reported.
. . . . . . . . . . . . .
95
Table 5.1: The averaged Pearson correlation coeﬃcients (RP ) between the computed
values (blind prediction for the topological features and regression for the
rest of the models) and the experimental Bfactors for a set of 364 proteins
[146] (Left: Prediction RP s based on EH barcodes. Right: A comparison
of the RP s of predictions from diﬀerent methods.). Here, EH is the linear
regression using EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. For a few large and multichain proteins (i.e., 1F8R, 1H6V,
1KMM, 2D5W, 3HHP, 1QKI, and 2Q52), to reduce the computational time
and as a good approximation, we compute their EH barcodes on separated
(protein) chains. We see from the table at right that the proposed EH
barcode method outperforms other methods in this application.
. . . . . 118
Table 6.1: Pearson correlation coeﬃcients (RMSE in kcal/mol) of ligand based topo
logical model on the S1322 dataset.
. . . . . . . . . . . . . . . . . . . . . 144
Table 6.2: Description of the PDBBind datasets.
. . . . . . . . . . . . . . . . . . . . 146
Table 6.3: Pearson correlation coeﬃcients (RMSE in kcal/mol) of diﬀerent protein
ligand complex based approaches on PDBBind datasets. . . . . . . . . . . 146
Table 6.4: Parameters used in machine learning.
. . . . . . . . . . . . . . . . . . . . 151
Table 6.5: Performance on each protein in DUD dataset. . . . . . . . . . . . . . . . . 154
x
Table 6.6: AUC comparison of diﬀerent methods on DUD dataset.
. . . . . . . . . . 155
Table 6.7: Experiments for ligandbased proteinligand binding aﬃnity prediction of
7 protein clusters and 1322 proteinligand complexes. . . . . . . . . . . . . 161
Table 6.8: Performance of diﬀerent approaches on the S1322 dataset. . . . . . . . . . 162
Table 6.9: Experiments for proteinligandcomplexbased proteinligand binding aﬃn
ity prediction for the PDBBind datasets.
. . . . . . . . . . . . . . . . . . 164
Table 6.10: Performance of diﬀerent proteinligand complex based approaches on the
PDBBind datasets.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Table 6.11: The AUC for autodock vina, TopVSML with only compound features,
TopVSML with only proteincompound complex features, and TopVSML
with all features. The targets with high quality results by Autodock Vina
are reported (AUC > 0.8) . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
Table 6.12: The AUC for autodock vina, TopVSML with only compound features,
TopVSML with only proteincompound complex features, and TopVSML
with all features. The targets with low quality results by Autodock Vina
are reported (AUC < 0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
xi
LIST OF FIGURES
Figure 2.1: Persistence barcodes of alpha complex ﬁltration (bottom left) and Vietoris
Rips complex ﬁltration (bottom right) for the point cloud (top). The top
and bottom panels of barcodes are H1 and H0 barcodes.
. . . . . . . . .
18
Figure 2.2: Example regression decision tree.
. . . . . . . . . . . . . . . . . . . . . .
24
Figure 2.3: a. Dense layer. b. Convolution layer.
. . . . . . . . . . . . . . . . . . . .
27
Figure 2.4: A practical neural network architecture for multitask learning. . . . . . .
28
Figure 3.1: An illustration of barcode changes from wild type to mutant proteins.[26]
41
Figure 3.2: Energy cycle of proteinligand binding free energy modeling.[26]
. . . . .
45
Figure 3.3: Mutation induced protein folding free energy changes.[26] . . . . . . . . .
48
Figure 3.4: An illustration of the 1D convolutional neural network.[26]
. . . . . . . .
51
Figure 3.5: The deep learning architecture for the application to globular proteins.[26] 53
Figure 3.6: The multitask deep learning architecture for membrane proteins.[26]
. .
54
Figure 3.7: A comparison of behaviors of the GBT based method and the neural
network based method.[26] . . . . . . . . . . . . . . . . . . . . . . . . . .
66
Figure 4.1: A simple example loop.[25] . . . . . . . . . . . . . . . . . . . . . . . . . .
82
Figure 4.2: a: A point cloud sampled from two adjacent annulus. b: The correspond
ing H1 barcode using alpha complex ﬁltration.[25] . . . . . . . . . . . . .
Figure 4.3: Example of smoothed H1 cocycle.[25] . . . . . . . . . . . . . . . . . . . .
Figure 4.4: a and b: Two datasets with similar geometry but diﬀerent information
given on the nodes. c and d: The diﬀerences are revealed in the enriched
H1 barcodes.[25]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.5: Persistent cohomology enriched barcode example of data points sampled
. . . . . . . . . . . . . . . . . . . . . . . . . . .
from porous cuboid.[25]
Figure 4.6: D3 dataset sampled from an annulus with randomly assigned values on
. . . . . . . . . . . .
the points and corresponding H1 enriched barcode.
83
83
84
85
86
xii
Figure 4.7: Wasserstein characteristics curve.
. . . . . . . . . . . . . . . . . . . . . .
87
Figure 4.8: a: The cucurbit[8]uril molecule viewed from two diﬀerent angles. The
hydrogen, carbon, nitrogen, and oxygen atoms are colored in white, grey,
blue, and red. b, c, and d: The H1 enriched barcodes obtained by assign
ing 1 to nodes of the selected atom types (carbon, nitrogen, and oxygen)
and 0 elsewhere.[25]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Figure 4.9: a: A structure of the B24N24 cage. The nitrogen and boron atoms are
colored in blue and grey. b: The enriched barcodes obtained by assigning
1 to Boron atoms and 0 elsewhere. H1 and H2 barcodes are plotted in
bottom and top panels.[25] . . . . . . . . . . . . . . . . . . . . . . . . . .
88
89
Figure 4.10: Enriched barcodes focusing on atomic partial charges. [25]
. . . . . . . .
91
Figure 5.1: a: Chaotic trajectory of one oscillator without coupling. b: The 70
synchronized oscillators associated with the carbon Cα atoms of protein
PDB:1E68 are plotted together.[22] . . . . . . . . . . . . . . . . . . . . . 103
Figure 5.2: The ﬁltration of the simplicial complex associated to three 1dimensional
trajectories.[22]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
Figure 5.3: An example of the construction of the evolutionary homology barcode.[22] 111
Figure 5.4: The result of perturbing residue 31 in protein (PDB:1ABA).[22]
. . . . . 112
Figure 5.5: Left: partially disordered protein, model 1 of PDB:2RVQ. Right: well
folded protien, PDB:1UBQ.[22]
. . . . . . . . . . . . . . . . . . . . . . . 114
Figure 5.6:
(a) Models 13 of PDB:2ME9 with the disordered region colored in blue,
red, and yellow for the three models. (b) Similar plot as (a) for PDB:2MT6.
(c) Topological features for PDB:2ME9 whose large disordered region is
from residue 28 to residue 85. (d) Topological features for PDB:2MT6
whose large disordered region is from residue 118 to residue 151. [22]
. . 116
Figure 5.7: Barcode plots for two residues. (a) Residue 6 of PDB:2NUH with a B
factor of 12.13 ˚A2. (b) Residue 49 of PDB:2NUH with a Bfactor of 33.4
˚A2.[22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
Figure 5.8: Bfactors and the computed topological features. EH shows the linear
regression with EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. (a) PDB:3PSM with 94 residues. (b) PDB:3SZH with 697
residues.[22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
Figure 6.1: Multilevel persistent homology on simple small molecules.[20]
. . . . . . 128
xiii
Figure 6.2: The network architecture of TopBPDL.[20]
. . . . . . . . . . . . . . . . 139
Figure 6.3: The network architecture of TopVSDL.[20]
. . . . . . . . . . . . . . . . 140
Figure 6.4: An illustration of the topology based machine learning algorithms used in
scoring and virtual screening.[20]
. . . . . . . . . . . . . . . . . . . . . . 142
Figure 6.5: Statistics of ligands in 7 protein clusters in S1322 dataset.[20]
. . . . . . 157
Figure 6.6: An illustration of similarities between ligands measured by their barcode
space Wasserstein distances.[20] . . . . . . . . . . . . . . . . . . . . . . . 158
Figure 6.7: Plot of performance against number of element combinations used.[20]
. 160
Figure 6.8: Feature robustness tests on PDBBind datasets.[20]
. . . . . . . . . . . . 166
Figure 6.9: Assessment of performance of the model on samples with elements that
are rare in the data sets.[20] . . . . . . . . . . . . . . . . . . . . . . . . . 170
xiv
Chapter 1
Introduction
1.1 Applied algebraic topology
Topology delivers an abstraction of a space by studying the properties that persist as the
space deforms continuously. A concise description of a space can be obtained by using topol
ogy and the geometric information of diﬀerent levels of details can be kept by using methods
in algebraic topology. Because of the brevity and the ability of information preservation,
algebraic topology methods have been applied to data analysis leading to the advancement
in an emerging ﬁeld called topological data analysis. Topological data analysis is especially
powerful at handling highdimensional and highly complex data sets and has been applied to
many ﬁelds such as image processing, network analysis, and genomics study. The emphasis
on the application side of this thesis is given to molecular biology where quantitative and
predictive models are developed.
Homology can distinguish topological spaces by assigning algebraic structures to the
spaces to characterize the holes of various dimensions. Intuitively speaking, the 0dimensional
homology characterizes connected components, the 1dimensional homology counts circu
lar structures, the 2dimensional homology addresses voids or cavities, and the higher
dimensional homology concerns the higherdimensional holes. Given a data set with a chosen
parameter reﬂecting the geometric scale, one can compute the homology by building a com
1
plex upon the data and representing the algebraic structures as vector spaces. There are
also diﬀerent ways for data representation. For example, a simplicial complex consisted of
points, edges, triangles, and the higherdimensional counterparts can be built upon a point
cloud and a cubical complex can naturally represent volumetric data. Eﬃcient algorithms
have been proposed together with several implementations enabling the analysis of large
datasets. Computational homology has been applied in both mathematical applications and
applications in other ﬁelds. For example, the Conley index for realworld systems can be
computed using computational homology [98, 208]. It has also been applied to topological
characterizations of spatialtemporal chaos [76].
The ability to automatically analyze data at multiple scales is important for a practical
data analysis method. Fewer assumptions of the given data need to be made if such property
is met which indicates higher robustness. Persistent homology adds another dimension to
the conventional homology by introducing a ﬁltration of the topological space to achieve
a multiscale characterization of data. Instead of computing homology of a ﬁxed topologi
cal space, persistent homology scans along a ﬁltration of a space where the building blocks
are added sequentially to the space ordered by their associated ﬁltration parameter values.
Then, the computation is done for not only homology of each frame of ﬁltration but also
how the homology generators appear, disappear, and persist along the course of ﬁltration.
A method named size function was introduced for applications in computer vision which
is a 0th dimensional version of persistent homology [72, 163]. Persistent homology theory
and practical algorithms were formulated and developped by Edelsbrunner et al.
[62]. A
more general theory was later introduced by Zomorodian and Carlsson [216]. The outputs
of persistent homology computation are collections of homology generators of diﬀerent di
mensions paired with its “birth” and “death” values along the course of ﬁltration and are
2
usually visualized via barcode plots [33, 79] where horizontal linesegments are stacked each
corresponds to a homology generator or persistence diagrams [43, 36] where each generator
is plotted as a point in the 2dimensional plane. A quantitative description of two persistent
homology computation results can be realized by computing their bottleneck distance or
more generally, their pth Wasserstein distance [44, 29]. As a topological method, persistent
homology massively reduces the original dimension of data to one retaining crucial geometric
information through the ﬁltration. These features make persistent homology a suitable tool
for data analysis especially for complex and highdimensional data. Persistent homology
also has a potential to be paired with predictive models such as deep learning and manifold
learning.
Since the introduction of persistent homology, there have been tremendous advancements
in algorithms, implementations, and theories. The theory of multidimensional persistent
homology has been formulated to address the situation where multiple parameters are in
volved in building the increasing spaces in ﬁltration [32]. A practical implementation for
2dimensional case was developed enabling interactive visualization [113]. Zigzag persistent
homology together with practical algorithm were introduced which allow traveling in both
directions along ﬁltration facilitating the analysis of realvalued functions on topological
spaces by associating homology groups to levelsets [30]. Introduced to study persistence di
agram stability, vinyards can be used to analyze timevarying data [45]. Many systems can
be recorded as graphs. Clique complex can be eﬃciently constructed for undirected graphs
[215] and path complex was introduced for directed graphs [84]. The theory of path complex
was extended with persistence theory formulating persistent path homology for the analysis
of directed graphs [41]. Eﬀorts have also been made on the dual side of persistent homology
to make use of the richer information carried by cohomology. A more detailed description of
3
1dimensional homology generators was derived by assigning circular coordinates to the in
put space using persistent cohomology [54]. Cohomology was also used for the coordination
in higher dimensional cases [155].
There is a collection of software for persistent homology with a wide range of utilities.
Dionysus [129] is able to compute zigzag persistence, persistent cohomology, vineyards, alpha
complex ﬁltration, circular coordinates, and bottleneck and Wasserstein distances. Ripser [8]
focuses on Rips complex ﬁltration and is extremely fast. Perseus [136] speeds up computa
tion by using discrete Morse theory [127] and provides utilities for cubical complex ﬁltration.
DIPHA provides eﬃcient algorithms enabling distributed computing [9] and PHAT provides
fast matrix reduction implementations [11]. JavaPlex is easy to use through MATLAB and
can conveniently illustrate the concepts [2]. It also includes some approximation construc
tions for faster computation. Gudhi [122] is another comprehensive library with Python
interface. There is also eﬃcient implementations of both exact and approximate algorithms
for computing the bottleneck and Wasserstein distances [101]. Both computer science tech
niques and mathematical properties are used to accelerate computations and one of the most
important theoretical foundations is the duality between persistent homology and persistent
cohomology [53].
Another major method in topological data analysis is mapper which builds a graph to
represent the topology of a dataset [171]. Given a point cloud dataset, each data point
is ﬁrst assigned a value which can be its eccentricity or one of many other choices and
this assignment is called a ﬁlter function. Then, a cover of the range of the ﬁlter function
is decided based on user deﬁned parameters such as cover length and overlap percentage.
Finally, data points in the same cover are grouped and are represented by a node in the
graph while an edge in the graph means there is an overlap of the intervals associated to the
4
two nodes. Compared to persistent homology, mapper reduces the data dimension but still
keeps a relatively explicit representation of the topology of data making it a good method for
visualization especially for high dimensional datasets. More quantitative descriptions can be
derived by running persistent homology upon mapper graphs. A library that can be called
in Python with GUI is available [132]. Mapper has found its applications in biological and
biomedical sciences [142, 143, 206].
Computational homology and persistent homology have been applied to various ﬁelds,
including image and signal analysis [31, 150, 172, 14, 73, 157], chaotic dynamics characteri
zation [126, 98], sensor networks [80], complex networks [112, 91], shape recognition [57, 66],
and computational biology [99, 75, 48, 156].
1.2 Machine learning and deep learning
Machine learning models are able to automatically extract information from data and subse
quently make predictions or inferences. There are mainly supervised, semisupervised, and
unsupervised learning. In supervised learning, collections of data entries with descriptions
and labels are given to the model and once the model has learned from the data it makes
predictions on labels of new data given only descriptions. A simple example of supervised
learning is the linear regression model. There are also situations where large dataset is given
but only a small portion of data has known labels. Instead of learning on the small amount
of labeled data using supervised learning techniques, semisupervised learning models also
utilize unlabeled data which usually help determine the underlying distribution of the data.
One example is manifold learning where unlabeled data are used to approximate the un
derlying manifold and the similarities between data entries are measured by their distance
5
on the manifold. Unsupervised learning derives useful information from unlabeled data by
inferring the underlying structure of data.
In this thesis, we mainly focus on supervised
learning with an emphasis on molecular biology.
There are many competitive machine learning methods and we just list a few here. Sup
port vector machine [46] builds a hyperplane to separate samples by solving an optimization
problem. It can also be extended to nonlinear cases by using a nonlinear kernel which tends
to separate data in higher dimensional spaces without the need to explicitly embed the data
in the higher dimensional space. Support vector machines can be used for both supervised
learning and clustering. Another popular class of machine learning methods is ensemble
learning. Ensemble learning relies on the assumption that combining multiple weak learner
can improve the prediction performance. A wellknown model of this kind is random forest
[19] where independent decision trees are built and an average of these tress is used as the
ﬁnal model. Since the decision trees are constructed independently, random forest is usually
good at lowering bias. Another way of ensemble is by gradient boosting such as gradient
tree boosting where one decision tree is added to the model at a time according to the error
at this stage [70, 71]. Due to the nature of gradient boosting, it is good at reducing vari
ance. Inspired by biological neural networks, artiﬁcial neural networks [125] were proposed
to mimic the arrangement of biological neurons by stacking nonlinear functions forming a
complicated composite function with tunable parameters. A common problem for machine
learning models especially those with large amount of tunable parameters is the overﬁtting
problem where models can have worse performance on new data while their performance on
training data is getting better. Many techniques have been developed to prevent overﬁtting.
For example, each weak learner in the ensemble learning can be trained on randomly selected
parts of training data introducing randomness to the training process. Also, a regularization
6
term containing norms of model parameters can be added to the objective function avoiding
making a model too specialized.
Neural network has a potential for complicated and complex situations with its highly
ﬂexible structure. After the advancement in dealing with the vanishing gradient problem
where impacts of back propagated error on parameters decay too fast along layers, deep
learning has ﬂourished. Many diﬀerent models have been introduced and these models have
stateoftheart performance in many realistic applications. A stacking of regular neural
network layers makes a deep neural network and the number of layers can be as many as
hundreds. Convolutional neural networks signiﬁcantly reduce the number of parameters
compared to deep neural networks by taking advantage of locality properties of input data
and applying the same ﬁlter to only local connections [111]. Convolutional neural network is
very good at processing image data [107]. Recurrent neural network allows directed connec
tions between neurons in the same layer following the ﬂow of an input sequence and is good
at processing sequential data such as gene sequences, natural languages, and time series. In
addition to the models that learn functions of inputs, there are also generative models that
learn to generate data. When there are related learning tasks, multitask learning or trans
fer learning take advantage of the shared properties of the tasks to improve the predictive
power. Practical multitask learning models can be constructed due to the ﬂexibility and the
hierarchical structure of neural networks. In general, one of the major advantages of machine
learning and especially deep learning methods is their ability of handling large and diverse
datasets. Deep learning methods have established stateoftheart in many computational
biology problems. For example, one of the most accurate sequencebased protein structure
prediction tools assessed by blind prediction challenges is based on residual network with
tens of convolutional layers [191]. The winner of a blind prediction challenge on toxicity pre
7
diction (Tox21) that tens of groups participated was based on deep neural network paired
with multitask learning technique [124].
1.3 Biomolecular modeling
Understanding structurefunction relationships is a major challenge in the molecular level
computational biology. Studying the relationship between structures of biomolecules and
their functions not only helps us understand nature but also aids biomedical research such as
drug design. For a biomolecular system, there are many important properties that are related
to the function of the system. For example, the binding aﬃnity of proteinligand complexes,
the stability changes induced by amino acid mutations in protein, and the ﬂexibility of
protein residues. The ability of modeling these basic physical properties further leads to the
studying of functions more related to realworld applications such as whether a candidate
small molecule potentially binds to a protein target and the impact on drug eﬀect when a
speciﬁc mutation happens in the target protein.
Physicsbased methods build models according to physical laws and are indispensable for
molecular modeling which provide predictions and reveal underlying mechanisms. Examples
of such kind include quantum mechanics calculation, molecular dynamics simulation, and
Monte Carlo sampling. There are also more eﬃcient approximations to the atomic systems
by using a continuum for part of the systems. For example, the PoissonBoltzmann model
delivers eﬃcient description of the electrostatics in the solvation processes of molecules by
using a continuum solvent. A more complete and aﬀordable parameterization of complex
biomolecular systems can be achieved by using force ﬁelds usually derived from quantum
chemistry computations. Such models are able to model larger realistic systems such as
8
proteinligand binding [147, 207]. There are also descriptions of the systems that are po
tentially related to the target property but rigorous connections to the target property may
be hard to derive.
In this case, an empirical model can be constructed combining each
component with weights determined from experimental data or more expensive but accu
rate models. A widely used setup for empirical models is to combine molecular mechanics
energies with polar part of solvation process modeled by PoissonBoltzmann model or Gen
eralized Born model and nonpolar part of solvation process reﬂected by surface areas which
are usually called MM/PBSA or MM/GBSA models [189]. MM/PBSA and MM/GBSA
models have been applied to the energy modeling of various biomolecular systems including
proteinligand binding [77] and proteinDNA binding [153].
When there are more detailed descriptions of the systems, possibly hundreds or thou
sands of descriptors of various types, it is hard to eﬀectively put them in a linear regression
based empirical model. Machine learning models with more capacity may help with this
situation. High capacity usually comes with higher number of model parameters to be de
termined which requires more data. Generally, the performance of knowledgebased model
heavily depends on the quantity and quality of available data. Knowledgebased models
using machine learning methods can achieve better performance compared to physicsbased
and empirical methods given suﬃcient data. At the same time, the descriptors derived from
physical models can be crucial for knowledgebased methods.
In this work, we are interested in several biological applications, structurebased protein
ligand binding aﬃnity prediction where binding aﬃnity is to be predicted given the structure
of proteinligand complexes, prediction of protein stability changes upon mutation where
structures of the wildtype proteins are given, virtual screening which aims to determine if
a pair of target protein and a candidate small molecule could potentially bind, and protein
9
ﬂexibility analysis where relative ﬂexibility of protein residues or atoms are to be determined.
Persistent homology has been applied to computational biology [99, 75, 48],
in the
mathematical modeling and prediction of nano particles, proteins, and other biomolecules
[198, 195, 75]. Quantitative topological analysis has been cultivated to predict the curva
ture energy of fullerene isomers [195, 186], characterize protein folding [198], and quantify
immunohistochemical eﬀects [176]. Diﬀerential geometry based persistent homology [186]
and multiresolutional persistent homology [201] have been proposed to better characterize
biomolecular data, detect protein cavities [117], and resolve illposed inverse problems in
cryoEM structure determination [200].
1.4 Motivation
The exponential growth of biological data has set the stage for datadriven discovery of
structurefunction relationships.
Indeed, the Protein Data Bank (PDB) has accumulated
near 130,000 tertiary structures. The availability of 3D structural data enables knowledge
based approaches to oﬀer complementary and competitive predictions of structurefunction
relationships. Recent advances in machine learning algorithms have made data driven ap
proaches more competitive and powerful than ever. Arguably, machine learning is one of
the most important developments in data analysis. Machine learning has become an indis
pensable tool in biomolecular data analysis and prediction. Virtually every computational
problem in computational biology and biophysics, such as the prediction of solvation free
energies, proteinligand binding aﬃnities, mutation impacts, pKa values, etc, has a class
of knowledge based approaches that are either parallel or complementary to physics based
approaches.
10
On the other hand, persistent homology as a data analysis tool can be applied to both
data that are sampled from underlying manifolds or naturally discrete data such as molec
ular structures. Actually, persistent homology has been applied to both qualitative and
quantitative analysis of complex molecular structures [99, 75, 48, 198, 195, 176].
Encouraged by persistent homology’s ability of information extraction and dimension
reduction and the advantage on predictive modeling of machine learning and especially
deep learning methods, we aim at developing competitive predictive models using persistent
homology and machine learning with focuses on molecular biology. Despite the excellent
outofbox performance of powerful machine learning methods in many applications, proper
featurization of biomolecular systems is crucial to the success of a model. To this end, persis
tent homology can serve as an appropriate featurization tool. Though persistent homology
is good at describing geometry and is especially powerful compared to other methods in
the case of high dimension, special treatment is still needed to address the chemical and
biological complexities when applied to predictive modeling of biomolecules.
1.5 Outline
In Chapter 2, we review the background of persistent homology, coupling of dynamical
systems, some relevant machine learning and deep learning methods, and several biologi
cal applications we worked on. In Chapter 3, we develop a persistent homology based deep
learning model using convolutional neural networks and multitask learning for the prediction
of proteinligand binding aﬃnity and mutation induced protein stability change. A construc
tion of neuron which can directly take persistence barcodes as input is also introduced. We
extend the capacity of the topological characterizations in Chapter 4 by embedding physical
11
properties of the molecules to persistence barcodes using cohomology. This extension is also
useful for general situations where data come with heterogeneous dimensions. In Chapter
5, we further consider dynamical properties of molecules using coupled dynamical systems
and introduce a construction of persistent homology to analysis the resulting trajectories.
In Chapter 6, we introduce several descriptions based on persistent homology for both mac
robiomolecules and small molecules and we discuss in detail about the representability of
persistent homology for biomolecular systems. The dissertation contribution is summarized
in Chapter 7.
12
Chapter 2
Background
2.1 Applied algebraic topology
2.1.1 Simplicial homology
Topological spaces can be approximated, represented, and discretized by simplicial com
plexes. An (abstract) simplicial complex is a (ﬁnite) collection of sets K = {σi}i where each
σi is a subset of a (ﬁnite) set K0 called the vertex set. We require that this collection satisﬁes
the following condition: if σi ∈ K and τ is a face of σi (that is, if τ ⊆ σj commonly denoted
τ ≤ σi), then τ ∈ K. If σi has k + 1 vertices, {v0, v1,··· , vk} where every pair of vertices
is nonequivalent, σi is called a ksimplex. The kskeleton of a simplicial complex K is the
subcomplex of K consisting of simplicies of dimension k and below. While the simplices for
the abstract simplicial complex we will build will not have an obvious geometric meaning,
there is a more geometric viewpoint from which we often reference simplicies. A geometric
ksimplex can be regarded as the convex hull of k + 1 aﬃnely independent points in Rd,
and because of this we often call a 0simplex a point, a 1simplex an edge, a 2simplex a
triangle, and a 3simplex a tetrahedron. Without confusion, we will use the same symbols
for geometric and abstract simplices.
The homology group for a ﬁxed simplicial complex gives a topological characterization
which encodes holes of diﬀerent dimensions. Homology groups are built using linear transfor
13
mations called boundary operators. A kchain of the simplicial complex K is a ﬁnite formal
sum of the ksimplices in K, α =(cid:80) aiσi with coeﬃcients ai ∈ G where G is a chosen group,
for example, the widely used one Z2 or more generally Zp with a prime p. The group of all
kchains with addition given by the addition of the coeﬃcients is called the kth chain group
and is denoted by Ck(K) or simply Ck when the choice of complex is obvious. Note that
when Z2 is used, since it is a ﬁeld, Ck(K) is, in fact, a vector space.
The boundary operator ∂k : Ck → Ck−1 is the linear transformation generated by
mapping any ksimplex to the sum of its codim1 faces; namely,
∂k({v0, v1,··· , vk}) =
(−1)i{v0,··· ,(cid:98)vi,··· , vk},
k(cid:88)
i=0
where(cid:98)vi means that vi is absent. The kth cycle group, Zk(K), is the kernel of the boundary
operator ∂k with elements called kcycles. The kth boundary group, Bk(K), is the image of
the boundary operator ∂k+1 and its elements are called kboundaries. Since ∂k ◦ ∂k+1 = 0,
Bk(K) is a subgroup of Zk(K). Thus we can deﬁne the kth homology group, Hk(K), to
be the quotient group Zk(K)/Bk(K). Two kcycles are called homologous if they diﬀer by
a boundary; equivalently if they are in the same equivalence class of Hk(K). Intuitively,
if two kcycles diﬀer from each other by the boundary of a subcomplex, they can roughly
be deformed from one to another continuously through the subcomplex. Each equivalence
class in Hk(K) can be thought of as corresponding to a kdimensional “loop” in K going
around a k + 1dimensional “hole”: 1dimensional classes give information about loops going
around 2D voids, 2dimensional classes give information about enclosures of 3D voids, etc.
While the analogy is not as nice, 0dimensional classes give information about connected
components of the space.
14
2.1.2 Filtration and persistence
We now turn to the case where we have a changing simplicial complex and want to understand
something about its structure. Consider a ﬁnite simplicial complex K and let f be a real
valued function on the simplices of K which satisﬁes the following: f (τ ) ≤ f (σ) for all τ ≤ σ
simplices in K. We will refer to this function as the ﬁltration function. For any x ∈ R, the
sublevelset of K associated to x is deﬁned as
K(x) = {σ ∈ K  f (σ) ≤ x}.
Note ﬁrst that because of our assumptions on f , K(x) is always a simplicial complex, and
second that K(x) ⊆ K(y) for any x ≤ y. Further, as x varies, K(x) only changes at the
function values deﬁned on the simplices. Since K is assumed to be ﬁnite, let {x1 < x2 <
··· < x(cid:96)} be the sorted range of f . The ﬁltration of K with respect to f is the ordered
sequence of its subcomplexes,
∅ ⊂ K(x1) ⊂ K(x2) ⊂ ··· ⊂ K(x(cid:96)) = K.
(2.1)
The ﬁltration of a simplicial complex sets the stage for a thorough topological examination
of the space under multiple scales of the ﬁltration parameter which is the output value of
the ﬁltration function f .
The deﬁnition of homology is valid for a ﬁxed simplicial complex, however we are inter
ested in studying the structure of a ﬁltration like that of Eq. (2.1). Functoriality of homology
means that such a sequence of inclusions induces linear transformations on the sequence of
15
vector spaces
Hk(K(x1)) → Hk(K(x2)) → ··· → Hk(K(xn)).
(2.2)
Persistent homology not only characterizes each frame in the ﬁltration {K(xi)}i, but also
tracks the appearance and disappearance (commonly referred to as births and deaths) of
nontrivial homology classes as the ﬁltration progresses. A collection of vector spaces {Vi}
and linear transformations fi : Vi → Vi+1 is called a persistence module, of which Eq. (2.2)
is an example.
It is a special case of a much more general theorem of Gabriel [74] that
suﬃciently nice persistence modules can be decomposed uniquely into a ﬁnite collection of
[b,d) is a persistence module for which Vi = Z2
interval modules[37, 149]. An interval module I
if i ∈ [b, d) and 0 otherwise; and fi is the identity when possible, and 0 otherwise.
I
[b,d)∈B
So, given the persistence module of Eq. (2.2), we can decompose it as(cid:76)
[b,d), and
thus fully represent the algebraic information by the discrete collection B. These intervals
exactly encode when homology classes appear and disappear in the persistence module.
The collection of such intervals can be visualized by plotting points in the 2D half plane
{(x, y)  y ≥ x} which is known as a persistence diagram; or by stacking the horizontal
intervals, which is known as a barcode. In this paper, for no reason other than convenience,
we represent our information using barcodes. We call the barcode resulting from a sequence
of trivial homology groups the empty barcode and denote it by ∅. Thus, for every interval
[b, d) ∈ B, we call b the birth time and d the death time.
We review two widely used constructions of ﬁltration. VietorisRips (VR) complex is
built upon the 1skeleton induced by pairwise distances among given points. Given a distance
function and a threshold distance, a simplex is in the VR complex if the distance between
any pair of vertices in the simplex is smaller than or equal to the threshold. More formally,
16
VR complex on a ﬁnite point set X at threshold δ is deﬁned as
V R(X, δ) = {σv ∈ X, d(v, v(cid:48)) ≤ δ, ∀v, v(cid:48) ∈ σ}.
(2.3)
The distance function can naturally be the Euclidean distance but can also be other user
deﬁned distances tailored for speciﬁc applications. Since it does not directly rely on the
exact geometry, a more abstract usage is possible with distance functions that do not satisfy
triangular inequality. Another complex construction is alpha complex which is closely related
to geometric modeling. On the ﬁnite point set X in the Euclidean space, we can build a
Voronoi diagram and let V (v) be the Voronoi cell associated to v ∈ X. Then the alpha
complex associated to the parameter is deﬁned as
A(X, ) = {σ ∩v∈X (V (v) ∩ B(v, )) (cid:54)= ∅},
(2.4)
where B(v, ) is the ball centered at v. It should be noted that alpha complex is a subset of
Delaunay triangulation and thus is very eﬃcient in terms of computation. Alpha complex is
a faithful representation of geometry. Though usually more computationally costly, Vietoris
Rips is useful when the input point set does not have natural coordinates (though geometric
embedding can be done) or some interaction scores are of interest instead of geometric
distances. Since A(X, ) ⊆ A(X, (cid:48)) for ≤ (cid:48) and V R(X, δ) ⊆ V R(X, δ(cid:48)) for δ ≤ δ(cid:48), these
constructions of simplicial complexes can induce proper ﬁltrations and and δ are called the
ﬁltration parameters. An example of persistence barcodes is shown in Figure 2.1.
17
Figure 2.1: Persistence barcodes of alpha complex ﬁltration (bottom left) and VietorisRips
complex ﬁltration (bottom right) for the point cloud (top). The top and bottom panels of
barcodes are H1 and H0 barcodes.
18
2.1.3 Barcode space metrics
The similarity between persistence barcodes can be quantiﬁed by barcode space distances.
The most commonly used metrics are the bottleneck distance [43] and the pWasserstein
distances [44]. The deﬁnitions of the two distances are summarized as follows.
The l∞ distance between two persistence bars I1 = [b1, d1) and I2 = [b2, d2) is deﬁned
to be
∆(I1, I2) = max{b2 − b1,d2 − d1}.
The existence of a bar I = [b, d) is measured as
λ(I) := (d − b)/2 = min
x∈R ∆(I, [x, x)).
This can be interpreted as measuring the smallest distance from the bar to the closest
degenerate bar whose birth and death values are the same.
For two ﬁnite barcodes B1 = {I1
β}β∈B, a partial bijection is deﬁned
to be a bijection θ : A(cid:48) → B(cid:48) where A(cid:48) ⊆ A to B(cid:48) ⊆ B. In order to deﬁne the pWasserstein
α}α∈A and B2 = {I2
distance, we have the following penalty for θ
∆(I1
α, I2
θ(α))p +
(cid:88)
α∈A\A(cid:48)
λ(I1
α)p +
1/p
.
λ(I2
β)p
(cid:88)
β∈B\B(cid:48)
(cid:88)
α∈A(cid:48)
P (θ) =
Then the pWasserstein distance is deﬁned as
dW,p(B1, B2) = min
θ∈Θ
P (θ),
where Θ is the set of all possible partial bijections from A to B. Intuitively, a partial bijection
19
θ is mostly penalized for connecting two bars with large diﬀerence measured by ∆(·), and
for connecting long bars to degenerate bars, measured by λ(·).
The bottleneck distance is an L∞ analogue to the pWasserstein distance. The bottleneck
penalty of a partial matching θ is deﬁned as
(cid:40)
(cid:110)
(cid:16)
P (θ) = max
max
α∈A(cid:48)
∆
α, I2
I1
θ(α)
, max
α∈A\A(cid:48)
λ(I1
α)
, max
β∈B\B(cid:48)
λ(I2
β)
.
(cid:17)(cid:111)
(cid:110)
(cid:111)
(cid:110)
(cid:111)(cid:41)
The bottleneck distance is deﬁned as
dW,∞(B1, B2) = min
θ∈Θ
P (θ).
2.2 Coupled dynamical systems
The general theory of control of coupled dynamical systems has been wellstudied in the
literature [148, 92, 192, 197]. A brief review is given in this section.
2.2.1 Oscillators and coupling
We consider a collection of N ndimensional dynamical systems originally governed by the
same equation
dui
dt
= g(ui), i = 1, 2,··· , N,
where ui = {ui,1, ui,2,··· , ui,n}T is a column vector of size n.
The individual dynamical systems can be coupled with an N × N coupling matrix A
by building an (N × n)dimensional system. We denote {u1, u2,··· , uN}T by u where
ui = {ui,1, ui,2,··· , ui,n}T . The coupled system is an (N×n)dimensional dynamical system
20
modeled as
= G(u) + (A ⊗ Γ)u,
du
dt
(2.5)
where G(u) = {g(u1), g(u2),··· , g(uN )}T , is a parameter reﬂecting the coupling strength,
and Γ is an n×n predeﬁned linking matrix specifying how the variables of individual systems
are coupled across the systems.
2.2.2 Stability and controllability
The coupled systems are said to be in synchronous state if
u1(t) = u2(t) = ··· = uN (t) = s(t).
The stability can be analyzed using v = {u1 − s, u2 − s,··· , uN − s}T with the following
equation obtained by linearizing Eq. (2.5)
= [IN ⊗ Dg(s) + (A ⊗ Γ)]v,
dv
dt
(2.6)
where IN is the N × N unit matrix and Dg(s) is the Jacobian of g on s. The stability of
can be studied by eigenvalue analysis of the coupling matrix A.
2.3 Machine learning
In this section, we review a few supervised learning methods. For a data entry, we assume
we have a description of it and it is called the feature. We also assume that there is a label
associated to an entry which might be known or unknown. An example scenario is that a
21
drug can have diﬀerent eﬀects on patients with the same disease and we would like to predict
the eﬃciency of a drug on some new patients before applying the drug. Here, each patient is
an entry and drug eﬀect is the label that we are interested in. What we have is the labeled
data on a set of past patients which means that we know the drug outcome on them. And
for both past and new patients, we have descriptions of them such as height, body weight,
and age. These descriptions are the features which are usually much easier to measure than
the label. Then, based on the experience of past patients, we can build a model that can
predict labels for new patients. Basically, given a collection of pairs of features and labels
{(xi, yi)}n
i=1 usually called the training data, a machine learning method tries to build a
function that is able to predict the label for a newly given entry based on its feature. The
feature can be in structured and unstructured forms. The data can also be in the form where
only a metric measuring similarities between data entries is given.
The basic assumption is that entries sharing similar features tend to have similar labels.
Generally, a supervised machine learning model have hyperparameters that determines the
basic model structure and tunable parameters that are to be optimized with training data.
Usually, there is a loss function which measures how well the model is performing upon the
training data. Then, the tunable parameters of the model are optimized to minimize the
loss function.
2.3.1 Knearest neighbor algorithm
The knearest neighbor algorithm is a direct usage of the assumption that entries sharing
similar features tend to have similar labels and it relies on a reliable distance function in
the input space. Let X be the input (feature) space and Y be the output (label) space,
there is a given distance function or more loosely speaking, a function reﬂecting distance,
22
d(x, x(cid:48)), x, x(cid:48) ∈ X. Given a collection of input output pairs {(xi, yi)}n
i=1, we can predict the
output of a newly given input x by comparing it to the inputs of all the training examples.
First, an ordering of training data with respect to the newly given input {(xαi, yαi)}n
i=1 is
determined such that
d(x, xα1) ≤ d(x, xα2) ≤ ··· ≤ d(x, xαn−1) ≤ d(x, xαn)
(2.7)
Here, we consider the regression case where the labels are real numbers. Then, the output y
paired with x can be predicted as
k(cid:88)
i=1
1
k
yαi,
where k is a positive integer chosen by the user. Weights can be added to accommodate the
hierarchy in the ordering,
k(cid:88)
i=1
wiyαi,
and a simple choice of weights is the inverse of distance wi =
d(x,xαi) /(
are also other more complicated weights that are more robust and stable.
1
k(cid:80)
i=1
1
d(x,xαi)). There
2.3.2 Decision tree
We review the idea of decision learning tree in the case of regression. When the feature is
in the form of a vector, say x ∈ Rd. If the impacts of elements of x on the output y are
assumed to be additive, a linear regression model can be used. A generalization of linear
model to higher order polynomials can accommodate higher order interactions among the
elements of the input. However, model complexity increases dramatically as higher order
interactions are considered. Decision tree model provides an alternative to address higher
23
Figure 2.2: Example regression decision tree.
order interactions by partition the feature space instead of exploring the space of high order
polynomials with many variables.
The nodes that are not leaves in a decision tree are associated to elements of the input
feature vector and the leaf nodes are assigned prediction values of the label. When ﬂowing
from the root node to a leaf node, the value of the corresponding feature vector element is
evaluated against predeﬁned criteria to determine which child node to go to. Once reached
Fig. 2.2 gives an example regression decision tree.
Decision trees can be constructed with greedy algorithm from top down. At each con
struction stage, a feature element with a splitting criteria is selected to most eﬀectively
reduce the loss function [166]. There are techniques to prevent overﬁtting by restricting
model complexity. For example, one may set an upper bound for number of branches or tree
depth. One can also set a lower bound for number of training examples associated to each
leaf node. One advantage of decision tree is that one element of feature vector is used at
each node so that the features can have completely diﬀerent meanings and magnitudes.
24
2.3.3 Ensemble of trees
In many problems, there can be thousands of features and large amount of data that are
too diverse and complicated for one decision tree to handle. A group of weak learners can
be combined to increase model capacity and this is called ensemble method. Here, a set
of decision trees can be used instead using one single tree. There are diﬀerent ensemble
methods for decision trees and we review two widely used constructions here, random forest
and gradient boosting trees.
Random forest relies on bootstrap aggregation where independent trees are built on
random resampling of the original training data with replacement. Then, a consensus model
is constructed by taking the average output of the independent trees. There are techniques
designed against overﬁtting such as using the resampling of only a subset of training data
for each tree. Since the trees are trained independently, random forest can be eﬃciently
constructed by constructing the trees in parallel.
Another approach is by gradient boosting where one tree is added at a time minimizing
the current loss. Let L(y, F (x)) be a loss function measuring the quality of prediction F (x)
against true label y. A general gradient boosting procedure is to ﬁrst initialize a model
denoted F0 which can be a constant function. Then, F(cid:96) is updated to F(cid:96)+1 by training a
base model h(cid:96) against the gradient of current loss {(xi,−∇L(yi,F (xi)
)}. The model
∇F (xi)
in the next iteration takes the form of F(cid:96)+1 = F(cid:96) + γ(cid:96)h(cid:96) and γ(cid:96) is obtained by solving the
(cid:12)(cid:12)F =F(cid:96)
minimization problem
(cid:88)
i
min
γ
L(yi, F(cid:96)(xi) + γh(cid:96)(xi)).
A real number between 0 and 1 called the learning rate is usually multiplied to γ(cid:96) when
updating the model for robustness.
25
2.3.4 Deep learning
A neural network is a generally nonlinear composite function constructly mainly with linear
transformations and nonlinear activation functions.
A regular neuron is a processing unit that applies a nonlinear activation function on top
of a linear function of the outputs of neurons in the previous layer that are connected to it.
A regular neural network is constructed by stacking densely connected layers where every
pair of neurons from adjacent layers are connected. Assume that we have a collection of n
neurons in layer i and denote the output of the jth neuron by N (i)
, then the output of the
j
kth neuron in layer i + 1 is computed as
n(cid:88)
,
(i+1)
k
)
(i+1)
j,k N (i)
j + b
(w
N (i+1)
k
= φ
j=1
where w
(i+1)
j,k
and b
(i+1)
k
are tunable parameters and φ is the chosen activation function.
Commonly used activation functions include rectiﬁed linear unit ( max(0, x) ), sigmoid (
1+e−x ), and hyperbolic tangent ( ex+e−x
1
ex−e−x ). The number of parameters increase rapidly
when more neurons and layers are added to the model.
Originally inspired by vision system in biology, convolutional neural network is a widely
used architecture especially good at dealing with image data or imagelike analogues. Only
local connections are allowed and the weights are shared across diﬀerent locations. The
setup of a single neuron is similar to the regular neural network but the information carried
by a neuron can be a vector instead of a scalar and the weight and bias parameters are
then higher dimensional counterparts. The activation function is applied elementwise on the
vector. Fig. 2.3 shows the architecture of a dense layer and a convolution layer. We discuss
more details of convolutional neural network in the 1dimensional case. Consider an n × m
26
Figure 2.3: a. Dense layer. b. Convolution layer.
second order tensor V, where n is the number of digits along that one dimension and m is
number of channels for each digit. With a predeﬁned window size w, a convolutional ﬁlter
F can be represented by a w × m second order tensor. By moving the window of size w
along the one dimension of V, a sequence of Nf second order tensors, which are subtensors
of V , are obtained and can be concatenated to form an Nf × w × m third order tensor T.
The ﬁlter F operated on T results in a ﬁrst order tensor TijkFjk by tensor contraction.
Concatenating the outputs of nf ﬁlters gives an Nf × nf second order tensor. Generally
speaking, a 1D convolution layer takes an n× m tensor and outputs an Nf × nf tensor. The
bias terms and activation functions are omitted in the illustration above for simplicity.
Another technique we are going to make use of is multitask learning. A practical multitask
learning model can be achieved by branching out from the shared lower layers. An example
architecture of multitask learning neural network is shown in Fig. 2.4.
Feedforward neural networks are usually trained by backpropagation where the error of
the output layer is calculated and is propagated backward through the network to update its
weights. One popular approach of training a neural network is the stochastic gradient decent
27
Figure 2.4: A practical neural network architecture for multitask learning.
(SGD) method. Let Θ be the parameters in the network and L(Θ) be the loss function that
is to be minimized. SGD method updates Θi to Θi+1 from step i to step i + 1 as
Θi+1 = Θi − τ∇ΘL(Θi; Xs, Ys),
where τ is the learning rate, Xs and Ys are the input and target of the sth example
of the training set.
In practice, the training set (X, Y ) is often split into minibatches
{(Xs, Ys)}s∈S. SGD method then goes through each minibatch instead of going through
only one example at a time. When the landscape of the objective function is like a long
steep valley, momentum is added to accelerate convergence of the algorithm. The updating
scheme can therefore be changed to
∆Θi = Θi − Θi−1,
Θi+1 = Θi − (1 − η)τ∇ΘL(Θi; Xi
s, Y i
s ) + η∆Θi,
where 0 ≤ η ≤ 1 is a scalar coeﬃcient for the momentum term. Many techniques have been
proposed to address the overﬁtting problem such as dropout and regularization term.
28
2.4 Biomolecular modeling
We review some basics of biomolecular modeling relevant to this dissertation in this section
involving both bioinformatics methods and physical models.
2.4.1 Proteins and small molecules
Proteins are made of sequences of amino acids and play very important roles in living things.
Just listing a few of their functions, they serve as structural supporting elements, drive
biological processes, and sense chemical changes. The amino acid sequences determine local
secondary structures including alpha helix, beta sheet, and random coil. For regular proteins,
they usually fold into a relatively stable structure which we call the folded state. There are
also disordered or partially disordered proteins that do not have stable structures in certain
conditions.
Small molecules also play an important role in biological processes often by interacting
with other macromolecules such as proteins and nucleic acids. They may serve as triggering
signals for certain processes or inhibit certain processes. Actually, many working drugs are
small molecules. Being able to accurately describe small molecules and small molecule
macromolecule complexes is important for computational molecular biology or drug design.
2.4.2 Physical modeling
Solvation free energy
Electrostatic forces play an essential role in protein folding. Electrostatic solvation free
energy is an important component of electrostatic interactions. Among all kinds of ap
proaches, continuum solvation models achieve a favorable tradeoﬀ between accuracy and
29
eﬃciency. PoissonBoltzmann model is used to quantify the polar part of energy change in
solvation process, that is the process of moving a molecule from vacuum to solvent environ
ment.
PoissonBoltzmann model uses Boltzmann distribution to model ion density and the
equation is formulated as
−∇ · ((x)∇φ(x)) + k
2
(x) sinh(φ(x)) =
N(cid:88)
i=1
qiδ(x − xi),
(2.8)
φ(∞) = 0,
where φ is the electrostatic potential, is the dielectric constant, k is the modiﬁed Debye
H¨uckel screening factor, δ is the Dirac delta function, and qi and xi are the partial charge
and location of each ﬁxed atom in the molecule. The second term and the right hand side
in equation 2.8 model charge density of mobile ion and ﬁxed molecule respectively. The
value of the partial charges can be assigned by precomputed force ﬁelds. In computation,
a bounding box Ω is taken as computation domain and is composed of solvent domain Ωs
and solute domain Ωm. One deﬁnition of the dielectric constant is to take constant values
on Ωs and Ωm. The solution φ to equation 2.8 should satisfy the jump condition across the
interface Γ between Ωs and Ωm,
[φ]Γ := φΩs(x) − φΩm(x) = 0, x ∈ Γ
[∇φ]Γ := s∇φΩs(x) · n − m∇φΩm(x) · n = 0, x ∈ Γ
(2.9)
30
The solvation free energy modeled by PoissonBoltzmann model can then be computed as
N(cid:88)
i=1
∆Gsol
pol =
1
2
qi(φ(xi) − φvac(xi)),
(2.10)
where ∆Gsol
pol is the polar component of the solvation free energy and φvac is the electrostatic
potential of the molecule in vacuum environment which is obtained by solving a Poisson
equation with uniform dielectric constant for vacuum.
Electrostatic interaction
The electrostatics of pairwise interactions in homogeneous media can be modeled with
the Coulomb model. Given two atoms with partial charges qi and qj with a distance of rij
apart from each other. The electrostatic force Fclb between these two atoms can be described
by Coulomb’s law
Fclb = ke
qiqj
r2
ij
,
where ke is the Coulomb’s constant.
Van der Waals interaction
Van der Waals forces describe interaction between atoms that are irrelevant to covalent
bond or electrostatic interaction. LennardJones potential is used to model the Van Der
Waals interaction energy between atoms and the energy between the ith and the jth atom
is deﬁned as
Gi,j = ij
(cid:32)(cid:18) ri + rj
xi − xj
(cid:19)12 − 2
(cid:18) ri + rj
xi − xj
(cid:19)6(cid:33)
,
(2.11)
where ri, rj are Van der Waals radii of the atoms and ij is the depth of the potential well
which diﬀers for diﬀerent atoms. Since atomic features are constructed and the interac
tions between diﬀerent pairs of atom types are modeled separately, the determination of the
31
parameter ij is left with the machine learning algorithm.
Molecule surface area
While the polar part of solvation free energy is modeled with the PoissonBoltzmann
theory, the nonpolar part is aﬀected by several factors. The surface area of the molecule is
believed to relate to nonpolar part of solvation free energy.
2.4.3 Sequence tools
BLOSUM matrix
A BLOSUM matrix is a substitution matrix for protein sequence alignments. It scores
the switch of any pair of amino acids and can thus describe the favorability of a certain point
mutation. In this work, BLOSUM62 which models moderately related proteins is used.
PSSM matrix
The BLOSUM matrix is a general scoring matrix which does not consider residue position
or the alignment of the entire sequence. A positionspeciﬁc scoring matrix (PSSM) is used
to ﬁll this information gap. BLAST searches for the query sequence is performed iteratively
and the substitution scores are updated according to the matched sequences found which is
positionspeciﬁc along the sequence.
32
Chapter 3
TopologyNet: Deep convolutional
neural networks based on topology for
biomolecular property predictions.
3.1
Introduction
Understanding the structurefunction relationships of biomolecules is fundamentally impor
tant in computational biophysics and experimental biology. As such, methods that can
robustly predict biomolecular properties, such as proteinligand binding aﬃnity and protein
stability change upon mutation from threedimensional (3D) structures are important tools
to help us understand this relationship. Numerous approaches have been developed to un
veil the structurefunction relationship. Physics based models make use of fundamental laws
of physics, i.e., quantum mechanics, molecular mechanics, continuum mechanics, multiscale
modeling, statistical mechanics, thermodynamics, etc, to investigate structurefunction rela
tionships and predict function from structure. Physical methods provide important insights
and are indispensable for understanding the relationships between protein structure and
function.
Physical models for complex biomolecular systems are made computationally tractable
33
often by making simpliﬁcations to the realworld systems. However, there are so many factors
that could aﬀect the properties of biomolecular systems and a factor that is crucial to one
type of systems can have minor eﬀect on another. Therefore, it is hard to tune a physical
model to handle diverse systems. On the other hand, a model with the capability of handling
diverse systems is desirable in many realistic applications. For example, the computational
screening of drug candidates involves the scanning of a library containing millions of diverse
molecules. The large and diverse datasets bring the opportunity of enabling the model
learn from data to automatically extract the relevant factors and the unknown relationships.
The machine learning models driven by data make relatively fewer assumptions about the
importance of the factors to the systems and there is usually high ﬂexibility in the models so
that nonlinear and highorder interactions among features can be recognized. In this chapter,
we focus on deep learning models for its excellent ability of handling complex datasets and its
ﬂexible architectures enabling the combination of relevant learning tasks. Deep learning has
fueled the rapid growth in several areas of data science [110, 138]. Notably, deep learning can
automatically extract optimal high level features and discover intricate structures in large
data sets. The capability of handling data with underlying spatial dimensions hierarchically
has lead to breakthroughs in deep convolutional neural networks in image processing, video,
audio and computer vision [107, 170].
Given multiple learning tasks, multitask learning (MTL) [34] provides a powerful tool
to exploit the intrinsic relatedness among learning tasks, transfer predictive information
among tasks, and achieve better generalized performance than single task learning. Dur
ing the learning stage, MTL algorithms seek to learn a shared representation (e.g., shared
distribution of a given hyperparameter [65], shared lowrank subspace [64], shared feature
subset [118] and clustered task structure [213]), and use the shared representation to bridge
34
between tasks and transfer knowledge. MTL has applications to the bioactivity of small
molecular drugs [180, 120, 184] and genomics [49]. Linear regression based MTL heavily
depends on well crafted features, while neural network based MTL allows more ﬂexible task
coupling and is able to deliver satisfactory results with a large number of low level features
provided such features have the representative power of the problem.
For complex 3D biomolecular data, the physical features used in machine learning vary
greatly in their nature. Typical features are generated from geometric properties, electro
statics, atom types, atomic partial charges, and graph theory based properties [188]. Such
manually extracted features can be used in a deep neural network, but the performance heav
ily relies on feature engineering. In contrast, convolutional neural networks learn to extract
high level representations hierarchically from low level features while maintaining the un
derlying spatial relationships. However, the cost is huge for directly applying convolutional
neural network to the 3D biomolecules, especially if longrange interactions are included. A
major obstacle in the development of deep learning nets for 3D biomolecular data is their
entanglement between geometric complexity and biological complexity.
Most theoretical models for the study of structurefunction relationships of biomolecules
are based on geometric modeling techniques. Mathematically, these approaches exploit local
geometric information, i.e., coordinates, distances, angles, areas, and sometimes curvatures
[139] for the physical modeling of biomolecular systems. Indeed, the importance of geometric
modeling for structural biology [67], and biophysics cannot be overemphasized. However,
geometry based models often contain too much structural detail and are frequently compu
tationally intractable for large structures or datasets. In many biological problems, such as
the opening or closing of ion channels, the association or dissociation of binding ligands, the
folding or unfolding of proteins, and the symmetry breaking or formation of virus capsids, ob
35
vious topological changes exist. In fact, one only needs qualitative topological information to
understand many physical and biological functions. In short, topologyfunction relationships
exist in many biomolecular systems.
Topology oﬀers entirely diﬀerent approaches and could provide signiﬁcant simpliﬁcation
of biomolecular data [168, 216, 175, 51, 88, 56, 52, 169]. The study of topology deals with
the connectivity of diﬀerent components in a space, and characterizes independent entities,
rings and higher dimensional faces within the space [98]. Topological methods produce a
high level of abstraction to many biological processes. For example, the opening and closing
of ion channels, the assembly or disassembly of virus capsids, the folding and unfolding of
proteins, and the association or dissociation of ligands are reﬂected by topological changes.
The fundamental task of topological data analysis is to extract topological invariants, namely
the intrinsic features of the underlying space, of a given data set without additional structure
information. Examples include covalent bonds, hydrogen bonds, van der Waals interactions,
etc. In this work, we use persistent homology which is a relatively new branch of algebraic
topology that embeds multiscale geometric information in topological invariants to achieve
an interplay between geometry and topology. It creates a variety of topologies of a given
object by varying a ﬁltration parameter, such as the radii of balls centered at the nodes or
the level set of a surface function. As a result, persistent homology can capture topological
structures continuously over a range of spatial scales.
The objective of this chapter is to introduce a new framework for the structure based
biomolecular property predictions using elementspeciﬁc persistent homology, and convolu
tional and multitask neural networks. In this framework, elementspeciﬁc persistent homol
ogy reduces geometric and biological complexities and provides a suﬃcient and structured
low level representation for neural networks. Given this representation, convolutional neural
36
networks can then learn from data to extract high level representations of the biomolecular
systems, while retaining the spatial relationships, and construct mappings from these rep
resentations to the target properties. For the prediction problems whose available datasets
are small, multitask learning by jointly learning the related prediction problems with larger
available datasets helps to extract a proper high level representation for the target applica
tions. The elementspeciﬁc treatment is inspired by the RFscore method [115] for binding
aﬃnity prediction. Elementspeciﬁc persistent homology is originated in our previous work
using classic machine learning methods. [23, 24] In this work, we further develop topology
based neural network (TopologyNet) models for the predictions of biomolecular structure
function relationships. Speciﬁcally, we integrate ESPH and convolutional neural networks
(CNNs) to improve modern methods for proteinligand binding aﬃnity and protein mutation
impact predictions from 3D biomolecular data. In this approach, topological invariants are
used to reduce the dimensionality of 3D biomolecular data. Additionally, elementspeciﬁc
persistent barcodes oﬀer imagelike topological representations to facilitate convolutional
deep neural networks. Moreover, biological information is retained by elementspeciﬁc topo
logical ﬁngerprints and described in multichannels in our image like representation. Fur
thermore, convolutional neural networks uncover hidden relationships between biomolecular
topological invariants and biological functions. Finally, a multitask multichannel topological
convolutional neural network (MMTCNN) framework is introduced to exploit the relations
among various structurefunction predictions and enhance the prediction for problems with
small and noisy training data. Our hypothesis is that many biomolecular predictions share
a common set of topological ﬁngerprints representations and are highly correlated to each
other. As a result, multitask deep learning by simultaneous training for globular proteins
and membrane proteins improves upon existing predictions for the mutation induced stability
37
changes of membrane proteins whose training data is relatively small.
3.2 Methods
In this section, we give a brief explanation of persistent homology before introducing topo
logical representations of proteinligand binding and protein changes upon mutation. Multi
channel topological deep learning and multitask topological deep learning architectures are
constructed for binding aﬃnity and mutation impact predictions. The source codes with
examples of feature construction for the binding problem and the mutation problem can be
found in supplementary material (S3 Code and S4 Code) of [26]. The network architectures,
parameters, and training procedures are listed in S2 Text of [26]. Some auxiliary features
such as sequence features are used to enhance the models for the mutation applications. The
description of the auxiliary features together with pseudocode for the mutation application
are listed in S1 Text of [26].
3.2.1 Persistent homology
Simplicial homology gives a computable way to distinguish one space from another in topol
ogy and is built on simplicial complexes which can be used to extract topological invariants
in a given data set. A simplicial complex K is a topological space that is constructed from
geometric components of a data set, including discrete vertices (nodes or atoms in a pro
tein), edges (line segments or bonds in a biomolecule), triangles, tetrahedrons and their
high dimensional counterparts, under certain rules. Speciﬁcally, a 0simplex is a vertex, a
1simplex an edge, a 2simplex a triangle, and a 3simplex represents a tetrahedron. The
identiﬁcation of connectivity of a given data set can follow diﬀerent rules which leads to,
38
for example, VietorisRips (VR) complex, ˇCech complex and alpha complex. The linear
combination of ksimplexes is called kchain, which is introduced to associate the topological
space, i.e., simplicial complex, with algebra groups, which further facilitate the computation
of the topological invariants (i.e., Betti numbers) in a given data set. Speciﬁcally, the set
of all kchains of a simplicial complex K are elements of a chain group, which is an abelian
group with a modulo2 addition operation rule. Loosely speaking, a boundary operator sys
tematically eliminates one vertex from the ksimplex at a time, which leads to a family of
abelian groups, including the kth cycle group and the kth boundary group. The quotient
group of the kth cycle group and the kth boundary group is called the kth homology group.
The kth Betti number is computed for the rank of the kth homology group.
Persistent homology is constructed via a ﬁltration process, in which the connectivity of
the given data set is systematically reset according to a scale parameter. More speciﬁcally,
a nested sequence of subcomplexes is deﬁned via a ﬁltration parameter, such as the growing
radius of protein atoms located at their initial coordinates. For each subcomplex, homology
groups and the corresponding Betti numbers can be computed. Therefore, the evolution
of topological invariants over the ﬁltration process can be recorded as a barcode [79] or a
persistence diagram. For a given data set, barcodes represent the persistence of its topological
features over diﬀerent spatial scales.
3.2.2 Topological representation of biomolecules
Topological ﬁngerprints
A basic assumption of persistent homology as applied to biomolecular function prediction
is that 1D biomolecular persistent barcodes are able to eﬀectively characterize 3D biomolec
ular structures. We call such barcodes topological ﬁngerprints (TFs) [198, 195]. Fig. 3.1
39
illustrates the TFs of a wild type protein (PDB:1hmk) and its mutant obtained from per
sistent homology calculations using the VR complex. The mutation (W60A) occurred at
residue 60 from Trp to Ala is shown at Figs. 3.1a and b. A large residue (Trp) at the
protein surface is replaced by a relatively small one (Ala). The corresponding barcodes are
given in Figs. 3.1c and d, where three panels from top to bottom are for Betti0, Betti1,
and Betti2, respectively. The barcodes for the wild type are generated using heavy atoms
within 6˚A from the mutation site. The mutant barcodes are obtained with the same set of
heavy atoms in the protein except for those in the mutated residue. In two Betti0 panels,
the diﬀerence in the number of bars is equal to the diﬀerence in the number of heavy atoms
between the wild type and mutant. Broadly speaking, the lengths of short bars reﬂect the
bond length of the corresponding heavy atom. Therefore, in both the wild type protein
and the mutant, bond lengths for most heavy atoms are smaller than 1.8˚A. Additionally,
bars that end between 1.8˚A and 3.8 ˚A might correlate with hydrogen bonds. Comparing
Fig. 3.1c and d, one can easily note the increase in the number of bars that end in the range
of 1.8  3.8 ˚A in the mutant, which indicates a less compact atom arrangement. In Betti1
and Betti2 panels, the mutant has fewer bars than the wild type does because a smaller
surface residue at 60 creates fewer ring and cavity contacts with the rest of the protein.
Elementspeciﬁc persistent homology
The all heavy atom topological representation of proteins does not provide enough bi
ological information about protein structures, such as bond length distribution of a given
type of atoms, hydrogen bonds, hydrophobic and hydrophilic eﬀects, etc. Therefore, we use
the elementspeciﬁc topological ﬁngerprint (ESTF) to oﬀer a more detailed characterization
of proteinligand binding and protein mutation. For example, Betti1 and Betti2 ESTFs
from carbon atoms are associated with hydrophobic interaction networks in biomolecules.
40
Figure 3.1: An illustration of barcode changes from wild type to mutant proteins.[26]
a: The wild type protein (PDB:1hmk) with residue 60 as Trp. b: The mutant with residue 60 as Ala. c:
Wild type protein barcodes for heavy atoms within 6 ˚A of the mutation site. Three panels from top to
bottom are Betti0, Betti1, and Betti2 barcodes, respectively. The horizontal axis is the ﬁltration radius
(˚A). d: Mutant protein barcodes obtained similarly to those of the wild type.
41
Similarly ESTFs between nitrogen and oxygen atoms correlate to hydrophilic interactions
and/or hydrogen bonds in biomolcules. However, hydrogen atoms are typically absent from
structures in the PDB and thus are not used in our data driven ESTF description. For
proteins, commonly occurring heavy atom types include C, N, O, and S. For ligands, we use
9 commonly occurring atom types, namely C, N, O, S, P, F, Cl, Br, and I. To characterize the
interactions between protein and ligand binding, we construct cross proteinligand ESTFs
such that one type of heavy atoms is chosen from the protein and the other from the ligand.
Therefore, there are a total of 36 sets of ESTFs in each topological dimension. For mutation
characterization, we describe the interactions between mutated residue and the rest of the
protein and arrive at 9 sets of ESTFs in each topological dimension considering { C, N, O }
for protein atoms. Similarly, we generate 9 sets of cross ESTFs in each topological dimension
from the wild type protein to study the interactions between the residue to be mutated and
the rest of the protein. However, high dimensional Betti1 and Betti2 invariants require the
formation of high order complexes. As noncarbon atoms do not occur very often, Betti1
and Betti2 ESTFs are generated for all carbon atoms or all heavy atoms, except speciﬁed.
The TFs and ESTFs are originally stored as collections of barcodes denoted by B(α,C,D)
with α labeling the selection of atoms depending on atom types and aﬃliations (i.e., protein,
ligand or mutated residue). C denotes the type of simplicial complex (i.e., VR complex
or alpha complex) and D indicates the dimension, such as Betti0, Betti1, or Betti2. A
collection of barcodes can have any number of barcodes and thus can not be directly fed to
deep learning models. Additionally, as shown in Fig. 3.1, it is important to keep track of the
birth, death, and persistence patterns of the barcodes, because this information is associated
with the bond length, ring or cavity size, ﬂexibility and steric eﬀect. Moreover, Jeﬀrey
suggested that there are strong, moderate and weak hydrogen bond interactions with donor
42
acceptor distances of 2.22.5˚A, 2.53.2˚A, and 3.24.0˚A, respectively [96]. To this end, we
construct structured vectors Vb, Vd, and Vp to respectively describe the birth, death, and
persistent patterns of the barcodes in various spatial dimensions. Practically, the ﬁltration
interval [0, L] is divided into n equal length subintervals and the patterns are characterized
on each subinterval. The description vectors are deﬁned as
Vb
i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)(i − 1)L/n ≤ bj ≤ iL/n}(cid:13)(cid:13) , 1 ≤ i < n,
i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)(i − 1)L/n ≤ dj ≤ iL/n}(cid:13)(cid:13) , 1 ≤ i < n,
i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)(i − 1)L/n ≥ bj, iL/n ≤ dj}(cid:13)(cid:13) , 1 ≤ i ≤ n,
Vd
Vp
(3.1)
where (cid:107) · (cid:107) is cardinality of sets. Here bj, dj are birth and death of bar j. The three
types of representation vectors are computed for sets of Betti1 and Betti2 bars. For Betti
0 bars, since their birth positions are uniformly 0, only Vd needs to be addressed. To
characterize pairwise interactions between atoms, it is convenient to simply use pairwise
distance information between atoms. The corresponding imagelike representation, denoted
by Vr, can be constructed similarly to Vd by substituting the set of barcodes by a collection
of distances between the atom pairs of interest. It should be noted that Vr is not equivalent
to Vd in most simplicial complex setups. Generally speaking, Vr also reﬂects the 0th
order topological connectivity information. It is used as the characterization of 0th order
connectivity of the biomolecules in the applications shown in this work. Finally, we let Xs
denote all the feature vectors for the sth sample and let Ys denote the corresponding target
value.
Imagelike multichannel topological representation
To feed the outputs of TFs into the convolutional neural network, the barcodes are
43
transformed to a 1Dimagelike representation with multiple channels. Topological feature
vectors , Vb, Vd, and Vp, can be viewed as onedimensional (1D) images. Each subinterval
in the ﬁltration axis represents a digit (or pixel) in the 1Dimagelike representation. Such
a treatment of topological features describes the topological information with appropriately
chosen resolution of L/n. Meanwhile, the chemical information in the ESTFs of B(α,C,D)
are described by multiple channels in the 1Dimagelike representation, which is similar to
the RGB color image representation. However, in our description, each pixel is associated
with m channels to describe diﬀerent element type, protein mutation status (i.e., wild type
and mutant), topological dimension (i.e., Betti0, Betti1 and Betti2), and topological event
(i.e., birth, death, and persistence). Each element in the 1Dimagelike representation is
standardized to have zero mean and unit variance among the data sets. This 1Dimagelike
topological representation can be easily transferred among problems such as proteinligand
binding aﬃnity modeling and prediction of protein stability change upon mutation. Tradi
tional machine learning approaches require manual extraction of features for each domain of
application. When the convolutional neural network is applied, the convolution layers iden
tify local patterns of atomic interactions and the fully connected layers then extract higher
level descriptions of the system by combining local patterns at various distance scales.
Multichannel topological invariants for proteinligand binding prediction
In computation, the binding aﬃnity, or alternatively the binding free energy, can be
modeled via an energy cycle as shown in Fig. 3.2 where the main contributors to the process
are intermolecular interactions and solvation eﬀects. In this work, we consider the set of
element types Le = {C, N, O, S, P, F, Cl, Br, I} contained in ligands and Pe = {C, N, O, S}
44
Figure 3.2: Energy cycle of proteinligand binding free energy modeling.[26]
contained in proteins. We deﬁne an opposition distance between two atoms ai and aj as
d(ai, aj)
∞
dop(ai, aj) =
, A(ai) (cid:54)= A(aj)
,
, A(ai) = A(aj)
(3.2)
where d(·,·) is Euclidean distance between two atoms and A(·) denotes the aﬃliation of an
atom which is either a protein or a ligand.
The ESTFs used in this application are summarized in Table 3.1. The structured de
scription vectors of the ESTFs are generated according to the deﬁnition given in Eq (3.1).
As shown in Table 3.1, ﬁve sets of ESTFs are constructed. The diﬀerences between the
description vectors arising from Set 2 and Set 3, and between those arising from Set 4 and
Set 5 are also employed as representation vectors to address the impact of ligand binding
resulting in a total of 72 representation vectors (i.e., channels) forming the 1Dimagelike
45
representation of the proteinligand complex. Pairwise interactions are characterized for the
36 element pairs with {C, N, O, S} for the protein and {C, N, O, S, F, P, Cl, Br, I} for
the ligand with Vd providing 36 channels. The birth (Vb), death (Vd), and persistence
(Vp) for Betti1 and Betti2 barcodes are computed for carbon atoms and all heavy atoms
of the protein and the proteinligand complex which results in 24 channels. The diﬀerence
between the characterization of the protein and the proteinligand complex accounts for an
other 12 channels. Thus, we have a total of 72 channels. Here, 0dimensional TFs describe
intramolecular interactions between the protein and ligand. All heavy atom TFs delineate
the geometric eﬀect of proteinligand binding. The TFs of carbon atoms account for hy
drophobic eﬀects and also implicitly reﬂect the solvation eﬀects. The distance scale interval,
[0, 50] ˚A is divided into bins of length 0.25 ˚A.
Set Atoms used
1
2
3
4
5
{a ∈ PT (a) = eP} ∪ {a ∈ LT (a) = eL}, eP ∈ Pe, eL ∈ Le
{a ∈ PT (a) ∈ Pe}
{a ∈ PT (a) ∈ Pe} ∪ {a ∈ LT (a) ∈ Le}
{a ∈ PT (a) = C}
{a ∈ PT (a) = C} ∪ {a ∈ LT(a) = C}
Complex Dim.

Distance
dop
Euclidean Alpha
Euclidean Alpha
Euclidean Alpha
Euclidean Alpha
0
1,2
1,2
1,2
1,2
Table 3.1: Topological representations of proteinligand complexes.
P and L are sets of atoms in protein and in ligand. T (·) denotes element type of an atom. eP is an element
type in protein and eL is an element type in ligand. “Complex” refers to the type of simplicial complex used
and “Dimension” refers to the dimensionality of a topological invariant.
Multichannel topological invariants for the prediction of protein folding free
energy change upon mutation
Modeling protein folding free energy change upon mutation basically involves the un
folded states and folded structures of the mutant and the wild type as shown in Fig. 3.3.
Since unfolded states of proteins are highly dynamic which signiﬁcantly increases the mod
eling cost due to the need of sampling over large conformation space, we only analyze the
folded states of the mutants and the wild type proteins in this application. Similar to the
46
proteinligand binding aﬃnity prediction, atomic interactions between speciﬁc element types,
geometric eﬀects, and hydrophobic eﬀects are characterized. The persistent homology anal
ysis performed in this application is summarized in Table 3.2. The diﬀerences between the
description vectors arising from Sets 1 and 2, and between those arising from Sets 3 and 4 are
also included to account for changes caused by mutation. The 1Dimagelike representation
in this application thus has a channel size of 45. The pairwise interaction pattern is charac
terized for 9 element pairs from the element set {C, N, O }. For example, the interactions
between the carbon atoms of the mutation site and the nitrogen atoms from the rest of the
protein. Such characterization for mutant protein, wild protein, and the diﬀerence between
these characterizations account for 27 channels. The birth, death, and bar persistence are
characterized for Betti1 and Betti2 barcodes for all heavy atoms of both the wild type
protein and the mutant protein resulting in 12 channels. The diﬀerence between the mutant
and the wild type, which accounts for 6 channels, is also included. Thus, we have a total of
45 channels. The distance scale interval, [0, 12] ˚A is divided into bins of length 0.25 ˚A. An
example of the persistent homology barcodes of a mutant and its wild type is given in Fig.
3.1.
Set Atoms selected
1
2
3
4
{a ∈ PW\MWT (a) = eP} ∪ {a ∈ MWT (a) = eM}, eP, eM ∈ Pe
{a ∈ PM\MMT (a) = eP} ∪ {a ∈ MMT (a) = eM}, eP, eM ∈ Pe
{a ∈ PWT (a) ∈ Pe}
{a ∈ PMT (a) ∈ Pe}
Complex Dim.


Distance
dop
dop
Euclidean Alpha
Euclidean Alpha
0
0
1,2
1,2
Table 3.2: Topological representations for protein mutation problem.
Here PW, PM, MW, and MM are sets of atoms of wild type protein, mutant protein, mutation site in the
wild type protein, and mutated site in the mutant protein. Here Pe = {C, N, O} and T (·) is the same as
deﬁned in Table 3.1. The distance function dop is similar to the one deﬁned in Eq (3.2), while the aﬃliation
function A(·) returns either M or P\M.
47
Figure 3.3: Mutation induced protein folding free energy changes.[26]
48
3.2.3 Neuron for persistence barcode
In addition to structured representations of barcodes, we introduce a neuron construction
that directly take persistence barcode as inputs. The basic idea is to treat a barcode as a
collection of points in a plane. Let B = {[bi, di)}i∈I be a barcode which can be regarded as
a collection of pairs of numbers. With a chosen ﬁltering function, one such neuron takes B
as input and its output is deﬁned to be
N (B; Θ, c) = ψ
,
φ(bi, di; Θ) + c
(cid:88)
i∈I
(3.3)
where ψ is an activation function, φ is a ﬁltering function, and Θ is the set of parameters in
φ that are to be tuned by the training process of the neural network. One natural choice of
φ is a radial basis function and a speciﬁc setup is to use the Gaussian,
NG(B; µb, µd, σb, σd, c) = ψ
(cid:88)
i∈I
exp(cid:0) − (bi − µb)2
2σ2
b
− (di − µd)2
2σ2
d
.
(cid:1) + c
(3.4)
The basic idea of such construction is that instead of scanning over a predetermined
regular region of persistence diagram, we can let the network learn where to examine the
pattern of the persistent homology computation result.
Intuitively, when a radial basis
function is used for φ, the output of a neuron characterizes the population of homology
generators in that speciﬁc area. The location and coverage of each neuron is to be learned
by the neural network with given training data. A collection of such neurons forms an input
layer of the neural network which directly takes the persistence barcode as the input without
formulating it into structured construction. Regular densely connected layers can be stacked
upon this input layer to form a multilayer neural network. The parameters in the ﬁltering
49
function φ carried by the neurons are regarded as part of parameters to be tuned in the
entire neural network via training process. This is feasible as long as a proper construction
of derivative of φ can be derived. There are many possible choices for φ. For example, in
addition to radial basis functions, a family of Hermite functions can also be used to examine
higher order patterns.
3.2.4 Multichannel topological convolutional neural network
The preprocessed multichannel topological image is standardized with mean 0 and standard
deviation 1 for use in the convolutional neural network. A convolutional neural network
with a few 1D convolution layers, followed by several fully connected layers, is used to ex
tract higher level features from multichannel topological images and to perform regression
with the learned features. An illustration of the convolutional neural network structure is
shown in Fig. 3.4. A brief review of multichannel topological convolutional neural network
concepts is given in the case of 1Dimagelike inputs. Convolution operation, optimization
method for feedforward neural networks, and dropout out technique which prevents over
ﬁtting are discussed. One of the advantages of multichannel topological convolutional deep
neural networks is their ability to extract features hierarchically from low level topological
representations.
Convolution operation
Consider an n × m second order tensor V, where n is the number of topological feature
pixels and m is number of channels for each pixel. In this approach, n corresponds to the
radius ﬁltration dimension of the biomolecular topological analysis and m corresponds the
number of representation vectors used which are deﬁned in Eq (3.1). With a predeﬁned
window size w, a convolutional ﬁlter F can be represented by a w × m second order tensor.
50
Figure 3.4: An illustration of the 1D convolutional neural network.[26]
The network consists of repeated convolution layers and pooling layers followed by several fully connected
layers.
By moving the window of size w along the radius ﬁltration direction of V, a sequence of
Nf second order tensors, which are subtensors of V , are obtained and can be concatenated
to form an Nf × w × m third order tensor T. The ﬁlter F operated on T results in a ﬁrst
order tensor TijkFjk by tensor contraction. Concatenating the outputs of nf ﬁlters gives
an Nf × nf second order tensor. Generally speaking, a 1D convolution layer takes an n× m
tensor and outputs an Nf × nf tensor.
Dropout
Neural networks with several convolution layers and fully connected layers possess a large
number of degrees of freedom which can easily lead to overﬁtting. The dropout technique
is an easy way of preventing network overﬁtting [173]. During the training process, hidden
units are randomly chosen to feed zero values to their connected neighbors in the next layer.
Suppose that a percentage of neurons at a certain layer are chosen to be dropped during
training. Then, in the testing process, the output of this layer is computed by multiplying
a coeﬃcient such as 1 − λ, where λ is the dropout rate, to approximate the average of the
network after dropout in each training step.
51
Bagging (bootstrap aggregating)
In addition to dropout technique which regularizes each individual model, bagging is
a technique to combine the output of several models trained separately by averaging to
reduce generalization error. This is based on the assumption that models with randomness
in the training process likely make diﬀerent errors on testing data. Generally, bagging trains
diﬀerent models on diﬀerent subsets of the training set. Speciﬁcally, as neural networks
have relatively high underlying randomness caused by factors including the random weights
initialization and the random minibatch partition, it can beneﬁt from bagging even if the
individual models are trained on the same dataset. In this work, bagging of neural network
models trained individually with the same architecture and training dataset is used.
Incorporating nonimagelike features
Deep learning architecture also allows the use of nonimagelike features together with
image or imagelike features. In this work, additional auxiliary features, which are important
to mutation analysis, are incorporated after the convolution layers as shown in Fig. 3.5. This
approach leads to a 9% improvement to mutation prediction of the “S2648” data set.
Multitask learning
We construct a multitask multichannel topological convolutional neural network (MM
TCNN) architecture to carry out simultaneous training and prediction. The common topo
logical attributes and underlying physical interactions in features provide a basis for multi
task predictions. Because the deep neural networks are jointly trained from multiple pre
diction tasks, we expect the networks to generate robust highlevel representations from low
level TFs for prediction problems. We also expect that the reﬁned representation would
lead to prediction models with improved generalized performance. From the proposed deep
learning models, we hope to gain insights into how the nonlinear and nonlocal interactions
52
Figure 3.5: The deep learning architecture for the application to globular proteins.[26]
The nonimagelike features are incorporated in the multichannel topological convolutional deep neural net
work by merging the features into the network at one of the fully connected layers.
among topological features impact various prediction tasks, which could further lead to bet
ter understanding towards the interactions among biomolecular prediction tasks. Finally,
tasks with insuﬃcient training data sets will be more likely to beneﬁt from the information
collected from tasks with large training sets in a multitask learning framework.
In the present mutation analysis, there are two data sets. The mutation data of the
large data set for globular proteins are more reliable, while those of the small data set for
membrane proteins are noisy and less reliable due to the fact that the current technologies
for membrane protein mutagenesis experiments are immature. The prediction for membrane
proteins beneﬁts from joint learning with the prediction for globular proteins. The coupling
of the two predictions through a neural network is shown in Fig. 3.6.
The general objective function to minimize for multitask learning through neural net
works can be decomposed into training loss, similarity penalty for shared layers, and regu
53
Figure 3.6: The multitask deep learning architecture for membrane proteins.[26]
The globular protein stability change upon mutation prediction is used as an auxiliary task to improve the
task of predicting membrane protein stability changes upon mutation. The solid arrows show the path of
information passing when the model is applied for predictions. The dotted and dashed arrows mark the
paths of backpropagation when the network is trained with globular protein data set and membrane protein
data set respectively.
larization term as
L(Θ; X, Y ) =
N(cid:88)
j=1
Jj(ΘSj, ΘBj; Xj, Yj)
+ P(ΘS1,··· , ΘSN )
+ R(Θ),
(3.5)
where Θ is the collection of all parameters to be updated, ΘSj is the set of parameters for
the jth task of the shared layers, ΘBj is the set of parameters for the jth branch of neurons
dedicated for the jth task, and (Xj, Yj) are training data for the jth task. Here P is the
penalty function which penalizes the diﬀerence among N sets of parameters. Finally R(·)
is the regularization term which prevents overﬁtting and J is the jth loss function. In this
work, we force the shared layers of the two problems to be the same and the regularization
of the network is realized using dropout.
54
Model training and prediction
Due to the complexity of the network for the mutation example with auxiliary features, a
brief parameter search is performed using Hyperopt [15] with only 50 trials allowing ﬂexibility
in number of neurons, activation function, and weight initialization. In the proteinligand
binding example, only around 10 sets of parameters are selected manually and tested because
of the large input size for the problem.
In the proteinligand binding aﬃnity predictions, we repeatedly train 100 single neural
networks individually. To test the performance of bagging of the models, we randomly select
50 trained models from the 100 individually trained networks and output the average value
of the outputs from the 50 selected models as the prediction. The performance is then
computed for the bagging. This process is repeated 100 times and both median and best
results are reported.
In the mutation induced protein stability predictions, we use the same procedure used
in the proteinligand binding prediction, for the “S350” task, where the training and test
ing split is predeﬁned. In the case of cross validation, 10 sets of 5fold splits are generated
randomly and 20 single models are generated for each split. The average prediction is taken
over the 20 models within each split and the median result of the 10 splits is reported. Bag
ging of only 20 models is performed here because it is not valid to do bagging of predictors
on diﬀerent cross validation splits. The bagging of 50 models will result in 50(individual
models)x10(cross validation splits)x5(ﬁve folds)=2500 training processes which is too com
putationally expensive.
Software
Dionysus software [129] with CGAL library [50] is used for persistent homology computa
tion on alpha complex. Javaplex [2] and Dipha [10] software packages are used for persistent
55
homology computation on VietorisRips complex. The neural networks are realized using
Keras [40] wrapper of Theano [177] backend. Various functions from Numpy and Scipy [185]
packages are used to process data and evaluate the performance.
3.3 Results
3.3.1 Deep learning prediction of proteinligand binding aﬃnities
Proteinligand binding is a fundamental biological process in cells and involves detailed
molecular recognition, synergistic proteinligand interaction, and may involve protein con
formational changes. Agonist binding is crucial to receptor functions and typically trig
gers a physiological response, such as transmittermediated signal transduction, hormone
and growth factor regulated metabolic pathways, stimulusinitiated gene expression, enzyme
production, cell secretion, etc. Understanding proteinligand interactions has been a fun
damental issue in molecular biophysics, structural biology and medicine. A speciﬁc task in
drug and protein design is to predict proteinligand binding aﬃnity from given structural
information [82] Proteinligand binding aﬃnity is a measurement of rate of binding which
indicates the degree of occupancy of a ligand at the corresponding protein binding site and is
aﬀected by several factors including intermolecular interaction strength and solvation eﬀects.
The ability to predict proteinligand binding aﬃnity to a desired accuracy is a prerequisite
for the success of many applications in biochemistry such as proteinligand docking and drug
discovery. In general, there are three types of binding aﬃnity predictors (commonly called
scoring functions): physics based [147, 207], empirical [211, 183, 63, 190, 212, 131, 182, 94],
and knowledge based [114, 105, 6]. In general, physics based scoring functions invoke QM
and QM/MM approaches [121, 35] to provide unique insights into the molecular mecha
56
nism of proteinligand interactions. A prevalent view is that binding involves intermolecular
forces, such as steric contacts, ionic bonds, hydrogen bonds, hydrophobic eﬀects and van
der Waals interactions. Empirical scoring functions work well but require carefully selected
data sets and parametrization [183, 63, 190]. However, both physics based scoring functions
and empirical scoring functions employ linear superposition principles that are not explicitly
designed to deal with exponentially growing and increasingly diverse experimental data sets.
Knowledge based scoring functions use modern machine learning techniques, which utilize
nonlinear regression and exploit large data sets to uncover underlying patterns within the
data sets. Given the current massive and complex data challenges, knowledge based scoring
functions outperform other scoring functions. [211].
In this study, the proposed method is tested on the PDBBind 2007 data set [119]. The
PDBBind 2007 core set of 195 proteinligand complexes is used as the test set and the
PDBBind 2007 reﬁned set, excluding the PDBBind 2007 core set, is used as the training
set with 1105 proteinligand complexes. A comparison between our TNetbinding predictor
(TNetBP) and other binding aﬃnity predictors is summarized in Table 3.3. TNetBP
outperforms other scoring functions reported by Li et al [115] on the task of binding aﬃnity
prediction from structures.
TNetBP is also validated on a larger dataset, PDBBind v2016 reﬁned set of 4057 com
plexes, where the training set contains 3767 samples which is the reﬁned set minus the core
set, and the testing set is the core set with 290 samples. All the model parameters and train
ing procedures are the same as that used for v2007 dataset except that the epoch number is
set to 500 instead of 2000 due to the larger data size. The median RP and RMSE are 0.81
and 1.34 pKd/pKi units, respectively.
57
Method
TNetBP
RF::VinaElem
RF:Vina
Cyscore
XScore::HMScore
MLR::Vina
HYDE2.0::HbondsHydrophobic
DrugScore
SYBYL::ChemScore
AutoDock Vina
DS::PLP1
GOLD::ASP
SYBYL::GScore
DS::LUDI3
DS:LigScore2
GlideScoreXP
DS::PMF
GOLD::ChemScore
PHOENIX
SYBYL::DScore
DS::Jain
IMP::RankScore
GOLD::GoldScore
SYBYL::PMFScore
SYBYL::FScore
RP
0.826a
0.803
0.739
0.660
0.644
0.622
0.620
0.569
0.555
0.554
0.545
0.534
0.492
0.487
0.464
0.457
0.445
0.441
0.616
0.392
0.316
0.322
0.295
0.268
0.216
RMSE
1.37
1.42
1.61
1.79
1.83
1.87
1.89
1.96
1.98
1.99
2.00
2.02
2.08
2.09
2.12
2.14
2.14
2.15
2.16
2.19
2.24
2.25
2.29
2.29
2.35
Table 3.3: Performance comparisons of TNetBP and other methods
Comparison of optimal Pearson correlation coeﬃcients RP and RMSEs (pKd/pKi) of various scoring func
tions for the prediction of proteinligand binding aﬃnity of the PDBBind 2007 core set. Except for the result
of our TNetBP, all other results are adopted from Li et al [115]. a Median results (The best RP = 0.828
and best RMSE=1.37 for this method).
58
3.3.2 Deep learning prediction of protein folding free energy changes
upon mutation
Apart from some exceptions, proteins fold into speciﬁc threedimensional structures to pro
vide the structural basis for living organisms. Protein functions, i.e., acting as enzymes, cell
signaling mediators, ligand receptors, and structural supports, are typical consequences of a
delicate balance between protein structural stability and ﬂexibility. Mutation that changes
protein amino acid sequences through nonsynonymous single nucleotide substitutions (nsS
NPs) plays a fundamental role in selective evolution. Such substitutions may lead to the loss
or the modiﬁcation of certain functions. Mutations are often associated with various human
diseases [210, 109]. For example, mutations in proteases and their natural inhibitors result in
more than 60 human hereditary diseases [160]. Additionally, mutation can also lead to drug
resistance [123]. Artiﬁcially designed mutations are used to understand mutation impacts to
protein structural stability, ﬂexibility and function, as well as mutagenic diseases, and evolu
tion pathways of organisms [68]. However, mutagenesis experiments are typically costly and
timeconsuming. Computational prediction of mutation impacts is able to systematically
explore protein structural instabilities, functions, disease connections, and organismal evo
lution pathways [85] and provide an economical, fast, and potentially accurate alternative
to mutagenesis experiments. Many computational methods have been developed in the past
decade, including support vector machine based approach [27], statistical potentials based
approach[55], knowledgemodiﬁed MM/PBSA approach [78], Rosetta protocols [100], FoldX
(3.0, beta 6.1) [85], SDM [193], DUET [159], PPSC (Prediction of Protein Stability, ver
sion 1.0) with the 8 (M8) and 47 (M47) feature sets [205], PROVEAN [39], ELASPIC [16],
STRUM [161], and EASEMM [69].
59
The proposed method is tested on a data set of 2648 mutation instances of 131 proteins
named “S2648” data set [55] in a 5fold cross validation task over the “S2648” set and a
task of prediction of the “S350” set which is a subset of “S2648” set. The “S2648” set,
excluding the “S350” subset, is used as the training set in the prediction of the “S350” set.
All thermodynamic data are obtained from the ProTherm database [12]. A comparison of
the performance of various methods is summarized in Table 3.4. Among them, STRUM
[161] is based on structural, evolutionary and sequence information and results in excellent
performance. We therefore have constructed two topology based neural network mutation
predictors (TNetMPs). TNetMP1 is solely based on topological information while TNet
MP2 is aided by auxiliary features characterizing electrostatics, evolutionary, and sequence
information, which is merged into the convolutional neural network at one of the fully con
nected layers. TNetMP2 is able to signiﬁcantly improve our original topological prediction,
indicating the importance of the aforementioned auxiliary information to mutation predic
tion.
60
Method
TNetMP2
STRUMb
TNetMP1
mCSMb,c
INPSb,c
PoPMuSiC 2.0b
PoPMuSiC 1.0a
IMutant 3.0b
Dmutanta
Automutea
CUPSATa
Erisa
IMutant 2.0a
S350
nd RP RMSE
0.94
350
350
0.98
1.07
350
1.08
350
350
1.25
1.16
350
1.23
350
338
1.35
1.38
350
1.42
315
1.46
346
334
1.49
1.50
346
0.81
0.79
0.74
0.73
0.68
0.67
0.62
0.53
0.48
0.46
0.37
0.35
0.29
S2648
Re
P RMSEf
0.94
0.94
1.02
1.07
1.26
1.17
0.77
0.77
0.72
0.69
0.56
0.61


nd
2648
2647
2648
2643
2648
2647

2636
0.60
1.19















Table 3.4: Performance comparisons of TNetMP and other methods.
Comparison of Pearson correlation coeﬃcients (RP ) and RMSEs (kcal/mol) of various methods on the
prediction task of the “S350” set and 5fold cross validation of the “S2648”. TNetMP1 is our multichannel
topological convolutional neural network model that solely utilizes topological information. TNetMP2 is
our model that complements TNetMP1 with auxiliary features. a Data directly obtained from Worth et
al [193]. b Data obtained from Quan et al [161]. c The results reported in the publications are listed in the
table. According to Ref. [161], the data from the online server has Rp (RMSE) of 0.59 (1.28) and 0.70 (1.13)
for INPS and mCSM respectively in the task of S350 set. d Number of samples successfully processed.
3.3.3 Multitask deep learning prediction of membrane protein
mutation impacts
Multitask learning oﬀers an eﬃcient way to improve the predictions associated with small
data sets by taking the advantage of other larger data sets [214]. Although a large amount
of thermodynamic data is available for globular protein mutations, the mutation data set for
membrane proteins is relatively small, between 200 and 300 proteins [108]. The small size
of membrane protein mutation data limits the success of data driven approaches, such as
ensemble of trees. While the popular multitask learning framework built on linear regression
with regularization techniques lacks the ability to extract the relationship between very low
61
level descriptors and the target quantity. A neural network with a hierarchical structure
provides a promising option for such problems. We add the prediction of globular protein
stability changes upon mutation as an auxiliary task for the prediction of membrane protein
stability changes upon mutation.
In the designed network architecture, two tasks share
convolution layers and the network splits into two branches with fully connected layers
for the two tasks.
Intuitively, the task of globular protein mutation predictions help to
extract higher level features from low level topological representations. Thus, the branch
for membrane protein mutation predictions learns the featuretarget relationship from the
learned high level features.
The proposed method is tested on a set of 223 mutation instances of membrane proteins
covering 7 protein families named “M223” data set [108] with 5fold cross validation. A
comparison with other methods is shown in Table 3.5. TNetMMP1 employes multichannel
topological convolutional neural networks with topological features from the “M223” data
set, while TNetMMP2 is a multitask multichannel topological convolutional neural net
work (MMTCNN) architecture. Unlike TNetMP2, both TNetMMP1 and TNetMMP2
do not use auxiliary features. Our goal is to test the performance of the multitask archi
tecture on the improvement of high level feature extraction from low level features. Pearson
correlation coeﬃcient of membrane protein mutation prediction is improved by 9.6%, i.e.,
from 0.52 to 0.57 by the multitask algorithm that trains and predicts the present “M223”
data set with the “S2648” date set. As noted by Kroncke et al, there is no reliable methods
for the prediction of membrane protein mutation impacts at the present [108]. Our TNet
results, though still not practically useful, are the best among the methods tested on this
problem.
62
RP
Method
TNetMMP2d
0.57
TNetMMP1c
0.52
0.31
RosettaMP
Rosetta (High)a
0.28
0.26
FoldX
0.26
PROVEAN
RosettaMPddG 0.19
Rosetta (low)b
0.18
SDM
0.09
RMSE
1.09
1.15


2.56
4.23


2.40
Table 3.5: Performance comparisons of TNetMMP and other methods.
Comparison of Pearson correlation coeﬃcients (RP ) and RMSEs (kcal/mol) on 5fold cross validation for the
“M223” data set for various methods. Except for the present results for TNetMMP1 and TNetMMP2,
all other results are adopted from Kroncke et al [108]. The results of Rosetta methods are obtained from
Fig. S1 of Ref. [108] where RMSE is not given. The results of other methods are obtained from Table S1
of Ref. [108]. Many less competitive results of the machine learning based methods reported in Ref. [108]
are not listed since these servers were not trained on the membrane protein data set. Among the methods
listed, only Rosetta methods have terms describing the membrane protein system. a High resolution. b
Low resolution. c The multichannel topological convolutional neural network architecture with topological
features from “S223” data set. d The multitask multichannel topological convolutional neural network
(MMTCNN) architecture trained with an auxiliary task of globular protein prediction using the “S2648”
data set.
3.4 Discussion and conclusion
The adoption of convolutional neural network concepts in this work is motivated by the
underlying spatial relationship along the distance scale (ﬁltration) dimension. Properties
that reside in diﬀerent distance scales are heterogeneous so unlike images or videos, there is
no obvious transferable property of the convolution ﬁlters along the convolution dimension in
the proposed method. To take this into consideration, the convolution layers are substituted
with “locally connected layers”, where the local connection properties are conserved whilst
the ﬁlters applied to diﬀerent distance scales are allowed to be diﬀerent. The RMSE is
in kcal/mol for the mutation problems and pKd/pKi units for the proteinligand binding
problem. The performance in RP (RMSE) signiﬁcantly decreases from 0.81 (0.94) to 0.77
(1.02) for the task of “S350” set prediction in the mutation impact example. This shows
63
that the construction of lower level features in the lower sparse layers beneﬁts from sharing
ﬁlters along the distance scale and indicates the existence of some common rules for feature
extractions at diﬀerent distance scales.
Intuitively, the dimension 0 inputs describe pairwise atomic interactions, which clearly
contribute to the prediction of the target properties. In contrast, dimension 1 and dimension
2 topological features characterize the hydrophobic network and geometric rings and voids.
To understand to what extent the higher topological dimensions help the characterization
of biomolecules, we separate the dimension 0 inputs from higher dimensional inputs in the
prediction of “S350” set in the mutation impact on protein stability example and in the
proteinligand binding aﬃnity prediction for v2007 set example. To compare the performance
of diﬀerent sets of features, 50 single models are trained for each feature set. Twenty of the
50 trained models are randomly chosen and bagged, and this procedure is repeated 100 times
with the median results reported. The individual performances measured by RP (RMSE)
for dimension 0 features are 0.73 (1.09) and 0.82 (1.40), respectively for the mutation and
binding predictions. For dimensions 1 and 2 features, RP (RMSE) are 0.66 (1.21) and
0.78 (1.54), respectively for the mutation and binding predictions. The combination of all
dimension features results in better RP (RMSE) of 0.74 (1.08) and 0.83 (1.37), respectively
for the mutation and binding predictions, showing that two sets of features both contribute
to predictions. The alpha complex is used for geometric characterization and therefore is in
R3 with Betti number up to dimension 2. It is possible that the higher dimensional Betti
numbers in a more abstract setup such as VietorisRips complex for the characterization of
an interaction network will enrich the representation and deliver improved results.
Another popular class of machine learning methods is the ensemble of trees methods.
Many modern methods for biomolecular property prediction are based on random forest
64
(RF) and gradient boosting trees (GBTs). The ensemble of decision trees has the capability
of learning complicated functions, but GBTs learn to partition the feature space based on the
training data which means that they do not have the ability to appropriately extrapolate
the learned function to broader situations than the provided training data. Additionally,
it is generally the case that data samples are unevenly distributed. It has been observed
that in many applications, where among the dataset, there are just a handful of samples
with large absolute value for the target property, methods of ensembles of trees tend to
overestimate (underestimate) the border cases with very negative (positive) target values.
The neural network, due to its diﬀerent ways of learning the underlying function, seems to be
able to deliver better results for the border cases. Therefore, similar to the idea of bagging,
methods of ensembles of trees and neural network based methods may result in diﬀerent
error characteristics for diﬀerent samples and can potentially improve the predictive power
by correcting each others’ error when the results from diﬀerent models are averaged.
In
the example of prediction of the “S350” set, we obtained performance of 0.82 (0.92) for
RP (RMSE) in our other work using handcrafted features with gradient boosting trees [23].
When the results are averaged for the two methods, the performance is improved to 0.83
(0.89) which is better than both individual methods. Similar improvement is observed for the
proteinligand binding example with v2007 set. Our method based on handcrafted features
and gradient boosting trees with performance 0.82 (1.40) [24] and the method presented
in this work with performance 0.83 (1.37) can achieve improved performance of 0.84 (1.35)
when the two results are combined by averaging. An intuitive illustration is shown in Fig.
3.7. It can be seen from the plot that the neural network based method presented in this
work performs better than the GBT based method for samples with high ∆∆G or with
low ∆∆G. The slope of linear ﬁtting of the predicted values to the experimental data is
65
Figure 3.7: A comparison of behaviors of the GBT based method and the neural network
based method.[26]
The plot is for the prediction task of the S350 dataset. The linear ﬁt for GBT prediction [23]is y =
0.603x − 0.435 and for TNetMP2, y = 0.657x − 0.422.
0.66 for the neural network based method and 0.60 for the GBT based method which also
illustrates that the neural network based method handles border cases better. The observed
improvement is marginal since it is mainly on a small portion of the samples.
In conclusion, the approach introduced in this work utilizes elementspeciﬁc persistent
homology to eﬃciently characterize 3D biomolecular structures in terms of multichannel
topological invariants. Convolutional neural network facilitates the automatic feature ex
traction from multichannel topological invariant inputs. The ﬂexible and hierarchical struc
ture of neural network allows seamless combination of automatically extracted features and
handcrafted features. It also makes it easy to implement multitask learning by combining
related tasks to a desired level of model sharing by tuning the layer of model branching. The
proposed topology based neural network (TopologyNet) methods have been shown to out
66
perform other existing methods in proteinligand binding aﬃnity predictions and mutation
induced protein stability change predictions. The proposed methods can be easily extended
to other applications in the structural prediction of biomolecular properties. They have the
potential to further beneﬁt from the fast accumulating biomolecular data. The combination
of the proposed methods and existing RF and GBT based methods is expected to deliver
improved results.
67
Chapter 4
Persistent cohomology for data with
heterogeneous dimensions
4.1
Introduction
With the advancements in sensor hardware, data collection software, data organization and
storage frameworks, various datasets are expanding in an unprecedented speed where a
large part of the newly accumulated data is highdimensional, highly complex, diverse, and
often noisy. The rapid growth of datasets demands robust and automatic data analysis tools.
While many widely used data analysis methods make assumptions of data complexity and/or
the underlying dimensionality and some other methods often require knowledge from domain
experts, an emerging family of data analysis methods called topological data analysis [28]
(TDA) makes minimal assumptions of data. TDA characterizes the shapes of data in various
dimensions and scans over a wide range of scales and is often robust against noise.
Given a point cloud dataset, it can be regarded as embedded in the Euclidean space which
allows the usage of radius ﬁltration associated with alpha complex [60]. A more general
distance ﬁltration associated with a VietorisRips complex [87] or a ˇCech complex can be
used to allow a predeﬁned distance function suitable for speciﬁc applications [198, 200]. It
is also possible to use a more ﬂexible construction by directly assigning ﬁltration values to
68
simplices in a complex which is considered as the ﬁnal structure at the end of the ﬁltration.
In many applications, persistent homology is used to analyze the topological structures
of datasets with generalized but homogenous information. For example, once the genetic
distance between genes is deﬁned by the number of mutations, persistent homology can be
used to analyze the topological property of a gene evolution dataset. When the information
of a dataset is heterogeneous, i.e., it involves multicomponent information, for which special
treatments are needed. For example, vineyards [45, 133] is used to study spatiotemporal
data. Additionally, element speciﬁc persistent homology is introduced to deal with molecular
datasets with chemical, physical and biological information [23, 24, 26, 20].
In Chapter 3, we developed a deep learning model for biomolecular predictive modeling
based on topological representations of mainly the biomolecular geometries. When persis
tent homology is applied to complex molecular structures, in addition to the point cloud
in the Euclidean space representing the coordinates of atoms, there are additional physical
and chemical information such as element types, atomic partial charges, Coulomb and van
der Waals interactions between atoms, and hydrophobic interactions among carbon atoms.
Another general situation is that the data has multiple dimensions with heterogeneous mean
ings and persistent homology analysis is done on certain dimensions while the information
of other dimensions are also to be reﬂected in the topological analysis. Therefore, there is
a need to incorporate multicomponent heterogeneous information into topological represen
tations. Although one can resort to the tight representative cycles of homology generators
[144], we prefer the cohomology framework because it is ﬂexible and it is natural to view
cochains as assigning weights on the simplicial complex which provides more quantitative
representations. We consider cohomology theory with a graph Laplacian or a Laplacian
deﬁned on simplicial complexes to localize and smooth the representatives of (co)homology
69
generators in the data and describe the additional information in the form of cochains (func
tions on chains) by computing inner product of these cochains and the smoothed (persistent)
cohomology representatives.
Cohomology provides a richer algebraic structure for a topological space. The cohomology
construction used in this work dualizes homology by building cochains as functions on chains
in homology theory. Cohomology theory has been applied in both mathematics and the
ﬁeld of data analysis. One well known cohomology theory is the de Rham cohomology
which studies the topological features of smooth manifolds using diﬀerential forms. The de
Rham cohomology has led to further theoretical developments such as the Hodge theory.
Recently, a discrete exterior calculus framework has been established [90] where manifolds
are approximated by meshlike simplicial complexes and the discrete counterparts of the
continuous concepts such as diﬀerential forms are deﬁned thereafter. This framework has
wide applications, for example, the harmonic component of the discrete Hodge decomposition
has been used in sensor network coverage problem to localize holes in a sensor network
[13]. Cohomology theory has also been applied in the ﬁeld of persistent homology. A 1
dimensional cohomology was used to assign circular values to the input data associated to
a homology generator [54] which further led to applications in several ﬁelds including the
analysis of neural data [167] and the study of periodic motion [181]. Persistent cohomology
in higher dimensions has been used to produce coordinate representations which reduces
dimensionality while retaining the topological property of data [155]. Weighted (co)homology
and weighted Laplacian were introduced with biological applications [194]. Computationally,
the duality between homology and cohomology [53] has set the basis for constructing more
eﬃcient algorithms that utilize cohomology to compute persistent homology. Several code
implementations, such as Dionysus [129] and Ripser [8], speed up the persistent homology
70
computation by taking the advantage of this property.
In this work, we seek a formulation that can utilize a function fully or partially deﬁned
on a simplicial complex constructed from the input data at locations associated to homology
generators. To this end, we need a representation that can locate homology generators. When
manifoldlike simplicial complexes are available, we can look for harmonic (in the sense of
Laplacede Rham operator) cohomologous cocycles under the framework of discrete exterior
calculus [89]. A discrete version of the Hodgede Rham theorem guarantees the uniqueness of
the harmonic cocycle if certain conditions are satisﬁed [89]. However, this method requires
the proper construction of the Hodge star operator which usually relies on a welldeﬁned
dual complex while in general applications, this is not always feasible. For example, when
a userdeﬁned distance matrix is used with the Rips complex, the distance may even not
satisfy triangle inequality. Therefore, we relax our requirement on the accuracy of geometric
localization and represent the set of simplices of a certain dimension in a simplicial complex
as a graph where the simplices are represented by graph nodes and their adjacency is treated
as graph edges. We can also deﬁne a Laplacian on simplicial complexes by ﬁrst introducing
an inner product of cochains and then constructing an adjoint of the coboundary operator.
Then, the smoothness of a cocycle can be measured by a Laplacian. Speciﬁcally, given
a representative cocycle of a homology generator, we look for a cohomologous cocycle that
minimizes the norm of the output under the Laplacian. We can then consider such smoothed
cocycles which distribute smoothly around the holes of certain dimensions as measures on
simplicial complexes and describe the input functions deﬁned on the simplicial complexes by
integrating with respect to these measures. The present formulation also utilizes a ﬁltration
process to assign a function over the ﬁltration interval associated to each bar in the barcode
representation to result in an enriched barcode representation of persistent cohomology. A
71
modiﬁed Wasserstein distance is deﬁned and implemented subsequently to facilitate the
comparison of these enriched barcodes generated from data.
In the rest of this chapter, basic background of cohomology is given and the proposed
method is described in detail in Section 4.2.
In Section 4.3, we illustrate the proposed
method by simple examples, example datasets, and the characterization of molecules. We
also demonstrate the utility of the proposed persistent cohomology by the prediction of
proteinligand binding aﬃnities from large datasets.
4.2 Methods
We refer readers to Section 2.1 for the basics and deﬁnition of persistent homology.
4.2.1 Cohomology
Like homology, cohomology is also a sequence of abelian groups associated to topological
space X and is deﬁned from a cochain complex, which is a function on the chain group
in the homology theory. Speciﬁcally, a kcochain is a function α : Xk → R where R is a
commutative ring. The set of all kcochains following the addition in R is called the kth
cochain group denoted Ck(X, R). The coboundary operator dk : Ck(X, R) → Ck+1(X, R)
maps a cochain to a cochain one dimension higher and is the counterpart of boundary
operators for chains, namely
dk(α)([v0,··· , vk]) =
k(cid:88)
i=0
(−1)iα([v0,··· ,(cid:98)vi,··· , vk]),
72
for a kcochain α. It should be noted that in the matrix representation of the two operators,
dk and ∂k+1 are transpose to each other. When there is no ambiguity, we simply refer to
dk using d. A kcochain is called a coboundary if it is in the image of dk−1. A kcochain is
called a cocycle if its image under dk is 0. The coboundary operators have the property that
k = (∂k ◦ ∂k+1)T . The kth cohomology
dk ◦ dk−1 = 0 following that dk ◦ dk−1 = ∂T
group is deﬁned to be the quotient group Hk(X, R) = Ker(dk)/Im(dk−1). Two cocycles are
k+1 ◦ ∂T
called cohomologous if they defer by a coboundary.
In practice, some ﬁeld is used instead of a ring due to the computation of persistent
(co)homology. In this work, we consider ﬁnite ﬁelds Zp with some prime p when computing
cohomology or persistent cohomology.
Given a ﬁltration of a simplicial complex, similar to persistent homology, the persistent
cohomology can be derived with the following relationship
Hk(X(x0), Zp) ← Hk(X(x1), Zp) ← ··· ← Hk(X(xl), Zp).
The universal coeﬃcient theorem for cohomology (Theorem 3.2 in [86]) implies that there is
a natural isomorphism Hk(X, Zp) ≡ HomZp(Hk(X, Zp), Zp) so that the cohomology group
can be considered as the dual space of the homology group. This property further im
plies that rank(Hk(X, Zp)) = rank(Hk(X, Zp)) and thus persistent homology and persistent
cohomology have identical barcodes [53].
4.2.2 Smoothed cocycle
Some representative cocycles in persistent cohomology may not reﬂect the overall location
and structure associated with their cohomology generators. To better embed the additional
73
information in the data into cohomology generators, we look for a smoothed representative
cocycle in each cohomology class. The smoothness of functions can usually be measured by a
Laplacian. We construct smoothed representative cocycles with a Laplacian in this section.
Laplacian on simplicial complex
A Laplacian for cochains can be deﬁned by ﬁrst deﬁning an inner product and using the
induced adjoint operator. Assuming the case of real number, for α1, α2 ∈ Ck(X, R), the
inner product can be deﬁned as
< α1, α2 >k=
(cid:88)
σ∈Xk
α1(σ)α2(σ).
(4.1)
Then, the adjoint d∗
k : Ck+1(X, R) → Ck(X, R) of the operator dk with respect to this inner
product can be deﬁned by
< dkα, β >k+1=< α, d∗
kβ >k, for α ∈ Ck(X, R), β ∈ Ck+1(X, R).
(4.2)
Weights reﬂecting size of simplices can be used to reﬂect the geometry by deﬁning a weighted
inner product,
< α1, α2 >w
k =
(cid:88)
σ∈Xk
sσα1(σ)α2(σ),
(4.3)
where sσ is the size of σ such as area or volume if σ is a 2 or 3simplex. Then, a Laplacian
on Xk can be deﬁned by
Lsc = d∗
kdk + dk−1d∗
k−1.
(4.4)
An inner product based on Wedge product can also be constructed if a manifold like simplicial
complex is given.
74
Weighted graph Laplacian
We can also represent Xk as a graph where the nodes are simplices and the edges describe
adjacency. Consider a graph associated to Xk where each ksimplex is represented by a node
and there is an edge if two ksimplices have nonempty intersection. Note that this is a simple
graph and we deﬁne a weight matrix W = (wij) to be
v(σi)v(σj), σi ∩ σj (cid:54)= ∅,
0, otherwise,
wij =
(4.5)
where v(σ) is the size of σ, for example, the area of 2simplices and the volume of 3simplices.
The size of a 0simplex is deﬁned to be 1. Then, the W weighted graph Laplacian [42] LW
is deﬁned as
LW i,j =
where wi =(cid:80)
1 − wij/wi, if i = j, and wi (cid:54)= 0,
√
−wij/
wiwj, if σi ∩ σj (cid:54)= ∅, and i (cid:54)= j,
0, otherwise,
(4.6)
wji. A kcochain α can be represented by a column vector given the basis
in Eq. 4.7. The matrix LW measures the diﬀerence between the value of the cochain on a
j
simplex and the values on the neighbors of this simplex. A large penalty is given to prevent
rapid changes through smaller simplices.
4.2.3 Enriched persistent barcode
We describe the workﬂow in this section. Given a simplicial complex X of dimension n, and
a function f : Xk → R with 0 ≤ k ≤ n, we seek a method to embed the information of f
75
on the persistence barcodes obtained with a chosen ﬁltration of X. In other words, we seek
a representation of f on cohomology generators. To this end, smoothed representations are
ﬁrst computed for cohomology generators. One of such smoothed representations induces a
measure on the simplicial complex which allows us to integrate f on X. We describe the
protocol of our approach below.
Dimension greater than 0
Consider a ﬁltration of X, ∅ = X(x0) ⊆ X(x1) ⊆ ··· ⊆ X(xn) = X and an associated
persistent cohomology with a prime p other than 2. Let ω be a representative cocycle for a
persistence interval [xi, xj) of dimension k > 0. The cocycle ω is ﬁrst lifted to a cocycle ˆω with
integer coeﬃcients satisfying that ω(σ) ≡ ˆω(σ) (mod p) and ˆω(σ) ∈ {i ∈ Z : −(p − 1)/2 ≤
i ≤ (p− 1)/2} for all σ ∈ Xk. This is almost always possible [54]. Now that ˆω is an interger
cocycle and thus also a real cocycle. With a basis for kcochains {ασi}i where
1, σ = σi
0, otherwise
ασi(σ) =
(4.7)
Given a Laplacian on cochains L to measure smoothness, a smooth cocycle ¯ω can be
obtained by solving a least square problem,
¯α = arg min
α∈Ck−1(X,R)
(cid:107)L(ˆω + dα)(cid:107)2
2,
(4.8)
and letting ¯ω = ˆω + d¯α. This smoothed cocycle ¯ω induces a measure µ on Xk by setting
µ(σ) = ¯ω(σ).
76
(4.9)
To obtain a sequence of such smoothed real kcocycles for the cohomology generator along
a persistence interval, we restrict the representative integer cocycle ˆω to subcomplexes of X
and repeat the smoothing computation. Consider the integer kcocycle ˆωX(x) at ﬁltration
value x. The corresponding smoothed real kcocycle ¯ωx can be obtained by running the
optimization problem for ˆωX(x) as Eq. (4.8) on Ck−1(X(x), R) and it induces a measure µx
on Xk(x) as described in Eq. (4.9). It suﬃces to compute for all diﬀerent ﬁltration values
in [xi, xj) because we have a ﬁnite ﬁltration which gives {µx(cid:96)}j−1
(cid:96)=i .
A function of ﬁltration values f∗ can be deﬁned for each persistent interval [xi, xj) as
(cid:90)
f∗(x) =
(cid:30)(cid:90)
f dµx
Xk(x)
Xk(x)
dµx
(4.10)
for x ∈ [xi, xj). We call each of the collection of persistent intervals being associated with
one such function f∗ an enriched persistent barcode.
Dimension 0
In the case of dimension 0, persistent homology tracks the appearance and merging of
connected components. It is convenient to assign a smooth 0cocycle to a persistent interval
by assigning 1 to the nodes in the connected component associated with the interval right
before the generator is killed due to merging with another connected component. This is
implemented with a unionﬁnd algorithm.
4.2.4 Preprocessing of the input function
When given the original input function associated with the input data, we ﬁrst need to
generate a cochain of the dimension of interest out of this input function. The procedures
in several situations are discussed in the rest of this section.
77
Case 1
When given a function f0 : Xk0 → R, and we are interested in its behavior associated
with a kdimensional homology where k0 (cid:54)= k. We need to interpolate or extrapolate f0 to a
function f : Xk → R. We assume that f0 is unoriented, i.e. f0(σ) = f0(−σ). A simple way
is to take unweighted averages,
nσ(cid:88)
i=1
1
nσ
f0(σ(cid:48)
i),
fa(σ) =
(4.11)
where each σ(cid:48)
i is a k0simplex satisfying that σ(cid:48)
i > σ if k < k0 and
nσ is the total number of such k0simplices. A weighted version based on geometry can be
deﬁned as
nσ(cid:88)
i=1
fw(σ) =
i < σ if k > k0 and σ(cid:48)
(cid:30) nσ(cid:88)
wif0(σ(cid:48)
i)
wi,
i=1
(4.12)
where wi is the reciprocal of the distance between the barycenters of σ and σ(cid:48)
i.
An example of this situation is the pairwise interaction strengths between atoms of a
molecule which are naturally deﬁned on edges connecting the vertices representing the atoms.
Another example is the atomic partial charges deﬁned on the vertices representing the atoms
in a molecule or a molecular complex.
Case 2
When given a function f0 : Rn → R with n >= k and a geometric simplicial complex, we
can integrate it on every ksimplex in X to obtain a function fi : Xk → R. For simplicity,
we require f0 to be bounded. Then, fi is deﬁned as
fi(σ) =
dσ,
(4.13)
(cid:90)
σ
(cid:30)(cid:90)
σ
f0dσ
78
for a ksimplex σ and(cid:82)
σ dσ computes the kdimensional volume of σ. In many cases, f0 is
given as results of numerical simulations which is often deﬁned on grid points. Then, the
integrals can be computed by some chosen quadrature formula and interpolating f0 to the
collocation points.
4.2.5 Modiﬁed Wasserstein distance
An enriched bar can be represented by three elements, birth value b, death value d, and
function f∗ constructed by Eq. (4.10). Given two enriched barcodes of the same dimension
represented by B = {{bi, di, f∗
j }}j∈J , we would like to compute
i }}i∈I and B(cid:48) = {{b(cid:48)
j, d(cid:48)
j, f(cid:48)∗
their distance analogous to Wasserstein distance. We ﬁrst deﬁne two pairwise distances, i.e.,
∆b that measures the distance between two persistence bars
(cid:0)[b, d), [b(cid:48), d(cid:48))(cid:1) = max{b − b(cid:48),d − d(cid:48)}
∆b
and ∆f that measures the distance between f∗ and f(cid:48)∗
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1
d − b
(cid:90) d
b
(cid:90) d(cid:48)
b(cid:48) f(cid:48)∗(x)dx
(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .
∆f (f∗, f(cid:48)∗) =
f∗(x)dx − 1
d(cid:48) − b(cid:48)
(4.14)
(4.15)
In practice, it sometimes is too costly to compute the output values of f∗ for all possible
ﬁltration values, and only a subset of possible ﬁltration values is selected, such as only
the middle value of a bar. In this case, we use middle Riemann sum to approximate the
integration in Eq. 4.15. For a bijection θ : ¯I → ¯J where ¯I and ¯J are subsets of I and J, the
79
associated penalties are deﬁned as
Pb(θ; q, B, B(cid:48)) =
(cid:16)
[bi, di), [b(cid:48)
θ(i), d(cid:48)
θ(i))
(cid:17)(cid:17)q
∆b
i∈ ¯I
(cid:16)
(cid:88)
(cid:88)
(cid:88)
+
i∈I\ ¯I
+
i∈J\ ¯J
(cid:0)∆b
and
Pf (θ; q, B, B(cid:48)) =
(∆b ([bi, di), [(bi + di)/2, (bi + di)/2)))q
i)/2)(cid:1)(cid:1)q
i + d(cid:48)
i + d(cid:48)
(cid:0)[b(cid:48)
i, d(cid:48)
(cid:16)
(cid:88)
∆(cid:48)
(cid:88)
(cid:88)
i∈I\ ¯I
i), [(b(cid:48)
(cid:16)
(cid:16)
∆(cid:48)
f (f∗
(cid:16)
i∈ ¯I
+
f
i , f(cid:48)∗
f∗
i)/2, (b(cid:48)
(cid:17)(cid:17)q
(cid:17)q
(cid:17)q
i , 0)
θ(i)
∆(cid:48)
f (f(cid:48)∗
i , 0)
.
+
i∈J\ ¯J
(4.16)
(4.17)
The qth modiﬁed Wasserstein distance is deﬁned as
W q,γ(B, B(cid:48)) = inf
θ∈Θ
(cid:0)γPb(θ; q, B, B(cid:48)) + (1 − γ)Pf (θ; q, B, B(cid:48))(cid:1) 1
q ,
(4.18)
where γ is a weight parameter and we denote the minimizer by θq,γ. Similar to receiver
operating characteristic curve, instead of ﬁxing γ we let it change from 0 to 1 which results
in a function Wq : [0, 1] → R2 deﬁned as
Wq(γ) = [Pb(θq,γ; q, B, B(cid:48))
1
q , Pf (θq,γ; q, B, B(cid:48))
1
q ],
(4.19)
and we call it a Wasserstein characteristic curve.
The optimization problem can be considered as an assignment problem and solved by
80
Hungarian algorithm. Given two enriched barcodes B = {{bi, di, f∗
i }}m
i=1 and B(cid:48) = {{b(cid:48)
j, d(cid:48)
j, f(cid:48)∗
j }}n
j=1,
we ﬁrst construct pseudo barcode for each of them to account for the situation where a bar is
not paired with another. The pseudo barcodes are BB(cid:48) = {{(b(cid:48)
and B(cid:48)
and B(cid:48) ∪ B(cid:48)
j)/2, 0}}n
i=1. Then the assignment problem between B ∪ BB(cid:48)
1
q . The linear sum assignment tool
B = {{(bi + di)/2, (bi + di)/2, 0}}m
B is solved with the cost (γPb + (1 − γ)Pf )
j)/2, (b(cid:48)
j + d(cid:48)
j + d(cid:48)
j=1
under optimize module of SciPy package [97] is used.
4.3 Examples and results
4.3.1 A minimalist example
Consider a simplicial complex X with four vertices and four edges with unit length that
forms a square as shown in Figure 4.1. The 1cochain ˆω = [1, 0, 0, 0]T is a real cocycle. The
notation means that ˆω(e0) = 1 and ˆω(e1) = ˆω(e2) = ˆω(e3) = 0. The weighted laplacian
matrix LW deﬁned in Eq. 4.6 for X1 is
1
3
2 −1
−1
0 −1
−1
2 −1
0 −1
0
2 −1
0 −1
2
when a uniform weight of 1 is assigned to all edges. Then, we obtain a smoothed cocycle
¯ω = ω + d¯α = [0.5, 0.5, 0.5, 0.5]T with a 0cochain ¯α = [1, 0.5, 1, 1.5]T which minimizes
(cid:107)LW (ˆω + d¯α)(cid:107)2
2 to 0.
81
Figure 4.1: A simple example loop.[25]
4.3.2 Example datasets
In this section, we show the smoothed representative 1 and 2cocycles and the enriched
barcodes using artiﬁcial datasets. We create some example input functions deﬁned on the
nodes and aim to reﬂect the information about these functions on the enriched barcodes.
Two adjacent annuluses
We ﬁrst consider a point cloud sampled from two adjacent annulus with radii 1 and
centered at (0, 0) and (2, 2) as shown in Figure 4.2. There are two signiﬁcant H1 bars
associated to the two major circles. An example of the representative cocycles for the two
long H1 bars are shown in Figure 4.3a and b. The associated smoothed cocycles obtained
by using the method described in Section 4.2.3 are shown in Figure 4.3c and d.
Given two datasets with similar geometry but diﬀerent values on the nodes, we can use
enriched barcodes to distinguish between them. See Figure 4.4 for example.
82
Figure 4.2: a: A point cloud sampled from two adjacent annulus. b: The corresponding H1
barcode using alpha complex ﬁltration.[25]
Figure 4.3: Example of smoothed H1 cocycle.[25]
a and b: Two representative 1cocycles corresponding to the two long H1 bars. The edges where the cocycles
take nonzero values are drawn in red. c and d: The smoothed 1cocycles associated to the representative
cocycles. The edges where the cocycles take values with magnitudes greater than or equal to 0.035 are drawn
in blue. The smoothing is done on the subcomplexes associated to the ﬁltration values at the middle of the
corresponding bars.
83
Figure 4.4: a and b: Two datasets with similar geometry but diﬀerent information given on
the nodes. c and d: The diﬀerences are revealed in the enriched H1 barcodes.[25]
84
Cuboid minus two balls
In this example, the object considered is a rectangular cuboid ([0, 4] × [0, 2] × [0, 2])
subtracted by two balls with radius of 0.5 centered at (1, 1, 1) and (3, 1, 1). Two thousand
points are ﬁrst sampled from a uniform distribution over the cuboid and the ones that are
inside the balls are deleted. The dataset with values on the points, the two smoothed cocycles
corresponding to the two voids, and the enriched barcode are shown in Figure 4.5.
Figure 4.5: Persistent cohomology enriched barcode example of data points sampled from
porous cuboid.[25]
a: The points sampled from an object which is a box subtracted by two balls. b: The H2 enriched barcode
showing the two voids in the blue and red regions of the original dataset. c and d: The two smoothed
2cocycles. The faces where the cocycles take absolute values greater than or equal to 0.01 are drawn in
blue. The smoothing is done on the subcomplexes associated to the ﬁltration values at the middle of the
corresponding bars.
85
Figure 4.6: D3 dataset sampled from an annulus with randomly assigned values on the points
and corresponding H1 enriched barcode.
4.3.3 Wasserstein distance based similarity
We illustrate in this section the measurement of similarities among the persistent cohomology
enriched barcodes. We use the enriched barcodes from three datasets, D1, D2, and D3. Here,
D1 and D2 are the two datasets shown in Fig. 4.4a and b, respectively, while D3 is shown
in the left chart of Fig. 4.6. The Wasserstein characteristics curve deﬁned in Eq. (4.19)
for three datasets, i.e., D1, D2 and D3, are shown in Fig. 4.7. Here, D1 and D2 have the
same geometry and thus their curve is more on the left side which means a smaller distance
between their persistent homology barcodes. On the other hand, D3 has a similar value
assignment on the points as that of D2, so their curve is on the bottom which means a
smaller distance in the nongeometric information.
86
Figure 4.7: Wasserstein characteristics curve.
4.3.4 Analysis of molecules
Cyclic and cagelike structures often exist in complicated macromolecules in various scales.
They can be as small as a benzene (a ring) containing 6 heavy atoms or an adamantane
(a cage) containing 10 heavy atoms. And some macromolecules have a global conﬁguration
of cyclic or cagelike structures such as buckminsterfullerene and carbon nanotubes which
are consisted of tens or hundreds of atoms. Persistent homology is good at detecting these
structures in multiple scales and when we label the atoms by their element types, we can
also reveal the element composition of the detected structures. Speciﬁcally, if oxygen is of
interest, we construct an input function f0 (see Section 4.2.4) that is deﬁned on the nodes
representing the atoms, and outputs 1 on oxygen atoms and 0 elsewhere. We illustrate this
application using a cyclic structure cucurbit[8]uril and a cagelike structure B24N24 cage in
this section.
Cucurbituril
In this example, we consider a macrocyclic molecule cucurbit[8]uril from the cucurbi
87
Figure 4.8: a: The cucurbit[8]uril molecule viewed from two diﬀerent angles. The hydrogen,
carbon, nitrogen, and oxygen atoms are colored in white, grey, blue, and red. b, c, and
d: The H1 enriched barcodes obtained by assigning 1 to nodes of the selected atom types
(carbon, nitrogen, and oxygen) and 0 elsewhere.[25]
turil family. The molecule contains eight 6membered rings and sixteen 5membered rings
consisted of carbon and nitrogen atoms. The rings form a big cyclic structure with a rela
tively tighter opening surrounded by oxygen atoms. The structure is taken from the provided
structure in the SAMPL6 challenge [1] and the resulting H1 barcodes are shown in Figure 4.8.
Boron nitride cage
The fullerenelike boron nitride cages exhibit spherical structures similar to fullerenes
but are consisted of boron and nitrogen atoms. The global spherical structure is composed
of a collection of local rings containing several atoms. A possible structure of B24N24 cage
given in the supporting information of [209] is used in this example. The molecule and the
88
Figure 4.9: a: A structure of the B24N24 cage. The nitrogen and boron atoms are colored
in blue and grey. b: The enriched barcodes obtained by assigning 1 to Boron atoms and 0
elsewhere. H1 and H2 barcodes are plotted in bottom and top panels.[25]
enriched barcode is shown in Figure 4.9.
In this application, the element type could be substituted by other information that
the user is interested in, such as partial charge, van der Waals potential, and electrostatic
solvation free energy.
4.3.5 An application to proteinligand binding
An important component of computeraided drug design is the prediction of proteinligand
binding aﬃnity based on given proteinligand complex structures. Persistent homology is
good at identifying rings, tunnels, and cavities in various scales which are crucial to the
proteinligand complex stability and instability.
In addition to geometry and topology,
chemical and biological complexity also need to be addressed toward a practically useful
method for this application. To this end, for example, the behavior of atoms of diﬀerent
element types can be described by computing persistent homology for subsets of atoms of
the molecule of certain element types. The interaction between protein and ligand can be
89
emphasized by prohibiting an edge to form between two atoms both in the protein or the
ligand. And the electrostatic interactions can be revealed by tweaking the distance matrix
used for ﬁltration to be the interaction strength computed with a chosen physical model
such as Coulomb’s law. However, the approaches described above disturb the original geom
etry and topology of the proteinligand complexes. With the method proposed in this work,
we are able to naturally embed the information such as atom type, atomic partial charges,
and electrostatic interactions to the barcodes without disturbing the original geometric and
topological setup of the molecular systems.
We compute the enriched barcodes for proteinligand complexes, turn them into struc
tured features, and combine with machine learning methods for the prediction of binding
aﬃnity. The procedure is validated on datasets from the PDBBind database [119] which in
cludes experimentally derived proteinligand complex structures and the associated binding
aﬃnities. An example of enriched barcode for atomic partial charges is shown in Fig. 4.10.
Enriched barcodes generation
In addition to the traditional barcode obtained from persistent homology computation,
we also like to add descriptions of the electrostatic properties of the system. An eﬃcient
characterization of this property is the Coulomb potential where the interaction between two
point charges is relatively described by qiqj/rij where qi and qj are the point charges with
a distance of rij. The atomic partial charges of proteins are assigned by using PDB2PQR
software [58] with CHARMM22 force ﬁeld. Two types of construction of the physical infor
mation are used to characterize the systems.
For dimension 0, a collection of subsets of atoms are ﬁrst identiﬁed according to atom
type. Speciﬁcally, 10 element types (C, N, O, S, P, F, Cl, Br, I, H) are considered for ligands
and 5 element types are considered for proteins (C, N, O, S, H) and a total of 50 subsets of
90
Figure 4.10: Enriched barcodes focusing on atomic partial charges. [25]
a: Ligand (as van der Waals spheres) and surrounding protein atoms (within 12 ˚A of ligand as thick sticks)
of PDB entry 1a94. The color reﬂects the atomic partial charges. b, c, and d: Enriched barcodes for partial
charges generated by computing persistent cohomology with alpha complex ﬁltration on all heavy atoms,
all carbon atoms, and nitrogen and oxygen atoms respectively. The top panel shows H2 barcode and the
bottom one shows H1 barcode.
91
atoms are selected by choosing one element type from each component (protein or ligand).
The pairwise distance matrix based on Euclidean distance is tweaked by setting distances
between atoms both from protein or ligand to inﬁnity which emphasizes the interactions
between protein and ligand. Based on the tweaked distance matrix, persistent (co)homology
computation with Rips complex is performed. The electric potential is computed for each
atom with its nearest neighbor in the diﬀerent part of the proteinligand complex and is put
on this atom as the additional information. We deﬁne the input function f 0
0 : X0 → R to
take 0 on protein atoms and to take the value discussed above on ligand atoms. The average
potential over ligand atoms in each 0cocycle representative is used to generate features. In
this way, the favorability of the protein ligand electrostatic interactions is explicitly described.
For dimension 1 and 2, the input function f 1
0 : X1 → R is deﬁned to output the absolute
value of electric potential on edges connecting two atoms to characterize the interaction
strengths. The Coulomb potential is modeled as
Eij = ke
qiqj
rij
,
where ke is Coulomb’s constant, qi and qj are the partial charges of atoms i and j, and
rij is the distance between the two atoms. Persistent (co)homology with alpha complex
is computed on three subsets of the proteinligand complexes, all heavy atoms, all carbon
atoms, and all oxygen/nitrogen atoms. For simplicity, all enriched barcodes are computed
only at the middle points of the bars.
Featurization of barcodes
Given an enriched barcode, B = {{bi, di, f∗
i }}i∈I obtained by applying the proposed
method to a dataset with an input function f0 (see Section 4.2.4), we turn it into ﬁxed
92
shape array required by the machine learning algorithms we choose. Here, the input function
is f 0
0 or f 1
0 described in the previous section when computing 0th dimensional persistent
(co)homology or in higher dimensions.
For dimension 0, we ﬁrst identify a range of scales to focus on and in this application,
we are interested in the interval [0, 12)˚A. The interval is then divided into 6 subintervals
{[l0
j )}j = {[0, 2.5), [2.5, 3), [3, 3.5), [3.5, 4.5), [4.5, 6), [6, 12)} to address diﬀerent types of
j , r0
interactions. For dimension 0, we are interested in the death values of the bars. Therefore, a
collection of index sets marking the deaths values of the bars that fall into each subinterval
is calculated as
j = {i ∈ Ilj ≤ di < rj}.
I0
For dimension 1 and 2, we are interested in the interval [0, 6)˚A with Alpha complex
)}j. We
ﬁltration. The interval is then divided into 6 equal length subintervals {[l1,2
, r1,2
j
j
then deﬁne a collection of index sets marking the bars that overlap with each subinterval,
j = {i ∈ Ibi < r1,2
I1,2
j
, di ≥ l1,2
j }.
Given a collection of index sets {Ij}j, a feature vector vh(B) is deﬁned as
(cid:16)
(cid:17)
vh(B)
= Ij.
j
When {I0
tration parameter interval. When {I1,2
j }j is used, it characterizes the number of component merging events in each ﬁl
j } is used, it reﬂects the ranks of homology groups at
certain stage along the course of ﬁltration.
And a feature vector vf (B, f0) can be generated subsequently to address the information
93
of the predeﬁned function on the homology generators,
(cid:16)
(cid:17)
vf (B, f0)
=
j
f∗
i (x)dx)/(di − bi).
i = ((cid:82) di
bi
where ¯f∗
(cid:80)
¯f∗
i∈Ij
Ij
i
,
Machine learning model
The application of predicting proteinligand binding aﬃnity based on structures can be
regarded as a supervised learning problem. Generally speaking, we are given a collection of
pairs of input and output {(xi, yi)} and there is a chosen model which is a function M (x; θ)
with tunable parameters θ. The training process is to ﬁnd a speciﬁc setting for the function
M that globally or locally minimizes a penalty function which depends on the given data
{(xi, yi)} and the parameter set θ. Once trained, the model can be used to predict the output
for a newly given input. We choose gradient booting trees (GBT) method for its accuracy,
robustness, and eﬃciency. GBT is an ensemble of trees method with single decision trees
as building blocks. The training of a GBT model is done by adding one tree at a time
according to reduce loss of current model. In practice, diﬀerent randomly selected subsets of
the training data and features are used for each update of the model to reduce overﬁtting.
For every result reported in Table 4.2, a parameter search is done by crossvalidation within
the training set where model performance is judged by Pearson’s correlation coeﬃcient. The
candidate values for hyperparameters tried are summarized in Table 4.1. Another hyper
parameter max feature is set to sqrt because of the relatively large number of features. The
GradientBoostingRegressor module in scikitlearn (version 0.17.1) [152] software is used.
Application
We test the improvement of the enriched barcodes with electrostatic information in the
94
Hyper Prm.
n estimators
max depth
min samples split
learning rate
subsample
min samples leaf
Candidates
5000, 10000, 20000
4, 8, 16
5, 10, 20
0.0025, 0.005, 0.01
0.25, 0.5, 0.75
1, 3
Table 4.1: Candidate values for hyperparameters of the gradient boosting trees model.
2007
2013
2015
2016
Dim 0 w/o elec.
0.802 (1.47))
0.754 (1.56)
0.745 (1.56)
0.824 (1.32)
Dim 0 w. elec.
0.796 (1.50)
0.768 (1.53)
0.763 (1.53)
0.833 (1.31)
Dim 1&2 w/o elec.
0.726 (1.65)
0.706 (1.67)
0.718 (1.62)
0.767 (1.46)
Dim 1&2 w. elec.
0.738 (1.64)
0.734 (1.60)
0.737 (1.59)
0.778 (1.44)
Table 4.2: The predictor performance is evaluated by training on PDBBind reﬁned set
excluding the core set and testing on the core set of a certain year’s version. The median
Pearson’s correlation coeﬃcient (root mean squared error) among 10 repeated experiments
is reported.
cases of 0th dimension and higher dimensions using the PDBBind database. The predic
tor performance is improved by using the enriched barcode embedding the electrostatics
information. The results are listed in Table 4.2.
4.4 Discussion and conclusion
Utilizing the richer information carried by cohomology, we introduce a method to reﬂect in
the barcodes the additional information from the dimensions that are not used for persistent
homology computation. This is achieved by ﬁnding a smoothed representative cocycle with
respect to a Laplacian directly deﬁned on the simplicial complexes or a weighted graph
Laplacian. The smoothed cocycles then serve as measures on the simplicial complexes and
95
allow us to do integration of the additional information. As a result, in addition to the
original persistence barcodes, functions of ﬁltration values associated to each persistence
pair are constructed which enriches the information carried by the barcodes. A similarity
score based on Wasserstein distance is introduced to analyze these enriched barcodes. Going
back to the problem that physical properties should be embedded in the persistence barcodes
to better describe the biomolecules which motivated the development of this method at the
beginning, the method shows to improve the performance in practical tasks of proteinligand
binding aﬃnity prediction by adding electrostatics information to the barcodes.
This method is potentially useful for a wider range of applications where data come with
multiple heterogeneous dimensions. For example, when analyzing time series dataset in 3
dimensional space using persistent homology, some speciﬁc treatment such as Vineyards [45]
is used instead of directly doing the computation in R4. Computing persistence over multiple
dimensions at the same time [32] also helps to address this general situation. For one speciﬁc
dimension of a multidimensional dataset, there are also cases where we would like to embed
the information carried in this dimension to the persistence barcodes computed for other
dimensions rather than looking at the persistence for this dimension. For example, persistent
homology can ﬁnd us loops and voids in biomolecular structures and we are interested
in question that what kind of charges do these homology generators carry.
In this case,
the duality between homology and cohomology enables us to better localize the homology
generators and to examine the charge distributions associated to each generator.
96
Chapter 5
Evolutionary homology for coupled
dynamical systems
5.1
Introduction
In addition to analyzing static structures of biomolecules, we are also interested in analyzing
the dynamics of the biomolecular systems which is often related to important biomolecular
properties such as stability and instability. The time evolution of complex phenomena is of
ten described by dynamical systems, i.e., mathematical models built on diﬀerential equations
for continuous dynamical systems or on diﬀerence equations for discrete dynamical systems
[197, 148, 92, 192]. Most dynamical systems have their origins in Newtonian mechanics.
However, these mathematical models typically only admit highly reduced descriptions of the
original complex physical systems, and thus their continuous forms do not have to satisfy the
EulerLagrange equation of the least action principle. Although a lowdimensional dynamical
system is not expected to describe the full dynamics of a complex physical system, its long
term behavior, such as the existence of steady states (i.e., ﬁxed points) and/or chaotic states,
oﬀers a qualitative understanding of the underlying system. Focused on ergodic systems, dy
namic mappings, bifurcation theory, and chaos theory, the study of dynamical systems is a
mathematical subject in its own right, drawing on analysis, geometry, and topology. Dynam
97
ical systems are motivated by realworld applications, having a wide range of applications
to physics, chemistry, biology, medicine, engineering, economics, and ﬁnance. Nevertheless,
essentially all of the analyses in these applications are qualitative and phenomenological in
nature.
In order to pass from qualitative to quantitative evaluation of these systems, we look
to the newly emerging ﬁeld of topological data analysis (TDA) [28, 61, 79, 98, 81, 134].
Speciﬁcally, we use persistent homology which provides multiscale topological characteriza
tion of datasets. The use of homology for the analysis of dynamical systems and time series
analysis predates and intertwines with the beginnings of persistent homology [98, 126, 76, 3,
164, 163, 162]. More recently, there has been increased interest in the combination of per
sistent homology with time series analysis [165]. Some common methods include computing
the persistent homology of the Takens embedding [157, 156, 154, 103, 102, 104], studying
the sublevelset homology of movies [106, 178], and working with the additional structure
aﬀorded by persistent cohomology [54, 18, 181]. Wang and Wei have deﬁned temporal per
sistent homology over the solution of a partial diﬀerential equation derived from diﬀerential
geometry [186]. This method encodes spatial connectivity into temporal persistence in the
LaplaceBeltrami ﬂow, and oﬀers accurate quantitative prediction of fullerene isomer stabil
ity in terms of total curvature energy for over 500 fullerene molecules. Closely related to our
work, Stolz et al. have recently constructed persistent homology from timedependent func
tional networks associated with coupled time series [174]. This work uses weight rank clique
ﬁltration over a deﬁned parameter reﬂecting similarities between trajectories to characterize
coupled dynamical systems.
The objective of the present work is to (1) deﬁne a new simplicial complex ﬁltration using
a coupled dynamical system as input, which encodes the time evolution and synchronization
98
of the system, and (2) use the persistent homology of this ﬁltration to study the system
itself. The resulting persistence barcode is what we call the evolutionary homology (EH)
barcode. We are particularly interested in the encoding and decoding of the topological
connectivity of a real physical system into a dynamical system. To this end, we regulate
the dynamical system by a generalized graph Laplacian matrix deﬁned on a physical system
with distinct topology. As such, the regulation encodes the topological information into
the time evolution of the dynamical system. We use a wellstudied dynamical system, the
Lorenz system, to illustrate our EH formulation. The Lorenz attractor is utilized to facilitate
the control and synchronization of chaotic oscillators by weighted graph Laplacian matrices
generated from protein Cα networks. We create features from the EH barcodes originat
ing from protein networks by using the Wasserstein and bottleneck metrics. The resulting
outputs in various topological dimensions are directly correlated with physical properties
of protein residue networks. Finally, to demonstrate the quantitative analysis power of the
proposed EH, we apply the present method to the prediction of protein thermal ﬂuctuations
characterized by experimental Bfactors. We show that the present EH provides some of the
most accurate Bfactor predictions for a set of 364 proteins. Our approach not only provides
a new tool for quantitatively analyzing the behavior of dynamical systems but also extends
the utility of dynamical systems to the quantitative modeling and prediction of important
physical/biological problems.
5.2 Methods
This section is devoted to the methods and algorithms. In Sec. 5.2.1, we give a brief discussion
of coupled dynamical systems and their stability control via a correlation (coupling) matrix
99
which embeds topological connectivity of a physical system into the dynamical system. For
background of persistent homology theory, we refer readers to Section 2.1. A concept of
topological learning is given in Section 5.2.1.3. We then deﬁne evolutionary homology on
coupled dynamical systems in Section 5.2.2. Finally, the full pipeline as applied to protein
ﬂexibility analysis is outlined in Section 5.2.3.
5.2.1 Coupled dynamical systems
The general control of coupled dynamical systems has been wellstudied in the literature
[148, 92, 192, 197]. A brief review is given in this section.
5.2.1.1 Oscillators and coupling
We consider the coupling of N ndimensional dynamical systems
dui
dt
= g(ui), i = 1, 2,··· , N,
where ui = {ui,1, ui,2,··· , ui,n}T is a column vector of size n.
associated to a point ri ∈ Rd which will be used to determine inﬂuence in the coupling.
In our setup, each ui is
The coupling of the systems can be very general, but a speciﬁc selection can be an N × N
graph Laplacian matrix A deﬁned for pairwise interactions
Aij =
I(i, j), i (cid:54)= j,
Ail, i = j,
−(cid:80)
l(cid:54)=i
where I(i, j) is a value describing the degree of inﬂuence on the ith system induced by the jth
100
system. We assume undirected graph edges I(i, j) = I(j, i). Let u = {u1, u2,··· , uN}T be a
column vector with ui = {ui,1, ui,2,··· , ui,n}T . The coupled system is an N ×ndimensional
dynamical system modeled as
= G(u) + (A ⊗ Γ)u,
du
dt
(5.1)
where G(u) = {g(u1), g(u2),··· , g(uN )}T , is a parameter, and Γ is an n × n predeﬁned
linking matrix.
One choice of g is the Lorenz oscillator deﬁned as
g(ui) =
ui,1(γ − ui,3) − ui,2
δ(ui,2 − ui,1)
ui,1ui,2 − βui,3
(5.2)
where δ, γ, and β are parameters determining the state of the Lorenz oscillator. This system
is used in this work because of its relative simplicity, rich dynamics and wellunderstood
behavior.
5.2.1.2 Stability and controllability
Let s(t) satisfy ds/dt = g(s). We say the coupled systems are in synchronous state if
u1(t) = u2(t) = ··· = uN (t) = s(t).
101
The stability can be analyzed using v = {u1 − s, u2 − s,··· , uN − s}T with the following
equation obtained by linearizing Eq. (5.1)
= [IN ⊗ Dg(s) + (A ⊗ Γ)]v,
dv
dt
(5.3)
where IN is the N × N unit matrix and Dg(s) is the Jacobian of g on s.
The stability of the synchronous state in Eq. (5.3) can be studied by eigenvalue analysis
of graph Laplacian A. Since the graph Laplacian A for undirected graph is symmetric, it
only admits real eigenvalues. After diagonalizing A as
Aφj = λjφj, j = 1, 2,··· , N,
where λj is the jth eigenvalue and φj is the jth eigenvector, v can be represented by
N(cid:88)
v =
wj(t)φj.
j=1
Then, the original problem on the coupled systems of dimension N × n can be studied
independently on the ndimensional systems
dwj
dt
= (Dg(s) + λjΓ)wj, j = 1, 2,··· , N.
(5.4)
Let Lmax be the largest Lyapunov characteristic exponent of the jth system governed by
Eq. (5.4). It can be decomposed as Lmax = Lg + Lc, where Lg is the largest Lyapunov
exponent of the system ds/dt = g(s) and Lc depends on λj and Γ.
In many numerical
experiments in this work, we set Γ = In, an n × n identity matrix. Then the stability of
102
Figure 5.1: a: Chaotic trajectory of one oscillator without coupling. b: The 70 synchro
nized oscillators associated with the carbon Cα atoms of protein PDB:1E68 are plotted
together.[22]
the coupled systems is determined by the second largest eigenvalue λ2. The critical coupling
strength 0 can, therefore, be derived as 0 = Lg/(−λ2). A requirement for the coupled
systems to synchronize is that > 0, while ≤ 0 causes instability.
An example of chaos controlled by coupling is shown in Fig. 5.1. In this example, each
alpha carbon atom (Cα) of protein PDB:1E68 is associated with a Lorenz oscillator and the
underlying locations of the oscillators are used to construct the coupling matrix. The speciﬁc
coupling matrix A = Ageo + Aseq used in this example is a sum of a graph Laplacian matrix
deﬁned using the geometric coupling,
−1, if i (cid:54)= j and dorg
ij < d,
Ageo
ij =
−(cid:80)
l(cid:54)=i
Ageo
il
, i = j,
103
and another which takes the amino acid sequence into account,
Aseq
ij =
seq, if (i + 1 + N ) mod N = j,
−seq, if (i − 1 + N ) mod N = j,
0, otherwise.
Here, dorg is the distance function in the original space; that is, the Euclidean distance
between atoms in this example. The mod operator is used because the protein in this
example is circular. The parameters used for the example of Fig. 5.1 are seq = 0.7 for
sequence coupling, d = 4˚A for spatial cutoﬀ, and δ = 10, γ = 60, and β = 8/3 for the
Lorenz system. The parameters in Eq. (5.1) are = 10 and
.
Γ =
1 0 0
0 0 0
0 0 0
Initial values for all oscillators are randomly chosen.
5.2.1.3 Topological learning
The proposed method provides a vast but relatively abstract characterization of the objects
of interest. It is potentially powerful in quantitative analysis, but is diﬃcult to use out of
the box machine learning or data analysis techniques. In regression analysis or the training
part of supervised learning, with Bi being the collection of sets of barcodes corresponding to
the ith entry in the training data, the problem can be cast into the following minimization
104
problem,
(cid:88)
i∈I
min
θb∈Θb,θm∈Θm
L(yi, F(Bi; θb); θm),
where L is a scalar loss function, yi is the collection of target values in the training set, F is a
function that maps barcodes to suitable input for the learning models, and θb and θm are the
parameters to be optimized within the search domains Θb and Θm respectively. The form
of the loss function also depends on the choice of metric and machine learning/regression
model.
A function F which translates barcodes to structured representation (tensors with ﬁxed
dimension) can be used with popular machine learning models including random forest,
gradient boosting trees and deep neural networks. Another popular class of models are the
kernel based models that depend on an abstract measurement of the similarity or distance
between the entries.
Our choices for F, deﬁned in Eq. (5.9) of Sec. 5.2.3, will arise from looking at the distance
from the speciﬁed barcode to the empty barcode and there is no tuning of θb. In Sec. 5.3.2
where we quantitatively analyze protein residue ﬂexibility, we evaluate our method by check
ing the correlation between each topological feature deﬁned in Eq. (5.9) and the experimental
value (blind prediction) as well as the correlation between the output of a linear regression
with multiple topological features and the experimental value (regression). In the former
case, there is no parameter to be optimized, while in the latter case, the speciﬁc minimiza
tion problem can be written as
(cid:88)
i∈I
min
θm∈Rn+1
(cid:16)
yi −(cid:104)
EH
p1,1
i
,··· , EHpn,n
i
(cid:105) · θm
, 1
(cid:17)2
,
105
pk,k
where EH
i
is the topological parameter by computing the pkWasserstein distance of the
empty set to the kth barcode associated with the EH computation of the ith protein residue
(node). I is the set of indexes of all residues in the protein and yi is the experimental Bfactor
for the ith protein residue which quantitatively reﬂects ﬂexibility.
5.2.2 Evolutionary homology (EH) and the EH barcodes
Consider a system of N not yet synchronized oscillators {u1,··· , uN} associated to a collec
tion of N embedded points, {r1,··· , rN} ⊂ Rd. We assume the global synchronized state is
a periodic orbit denoted s(t) for t ∈ [t0, t1] where s(t0) = s(t1). For ﬂexibility and generality,
the original trajectories, (cid:98)ui(t) := T (ui(t)). The choice of function T is ﬂexible and should
we work on postprocessed trajectories obtained by applying a transformation function on
ﬁt the applications; in this work, we choose
T (ui(t)) = min
t(cid:48)∈[t0,t1]
(cid:107)ui(t) − s(t(cid:48))(cid:107)2,
(5.5)
which gives 1dimensional trajectories for simplicity. Further, in our speciﬁc example,(cid:98)s(t) :=
T (s(t)) = 0, but, again, this is not necessary in general.
We wish to study the eﬀects on the synchronized system of N oscillators (an (N × 3)
dimensional system) after perturbing one oscillator of interest. To this end, we set the
initial values of all the oscillators except that of the ith oscillator to s(¯t) for a ﬁxed ¯t ∈
[t0, t1]. The initial value of the ith oscillator is set to ρ(s(¯t)) where ρ is a predeﬁned function
playing the role of introducing perturbance to the system. After the system starts running,
some oscillators will be dragged away from and then go back to the periodic orbit as the
perturbance is propagated and relaxed through the system. Let (cid:98)ui
j(t) denote the modiﬁed
106
trajectory of the jth oscillator after perturbing the ith oscillator at t = 0. We focus on the
subset of nodes that are aﬀected by the perturbation,
(cid:40)
nj  max
t>0
V i =
j(t) −(cid:98)s(t(cid:48))(cid:107)2} ≥ p
(cid:107)(cid:98)ui
{ min
t(cid:48)∈[t0,t1]
(cid:41)
for some ﬁxed p determining how much deviation from synchronization constitutes “being
aﬀected”.
5.2.2.1 Filtration function deﬁned for coupled dynamical systems
Assuming we have perturbed the oscillator for node ni, let M = Vi. We will now construct a
function f on the complete simplicial complex, denoted by K or KM , with M vertices. Here,
we abuse notation and write Vi = {n1,··· , nM}. The ﬁltration function f : KM → R is built
to take into account the temporal pattern of the propagation of the perturbance through the
coupled systems and the relaxation (going back to synchronization) of the coupled systems.
It requires the advance choice of three parameters:
• p ≥ 0, mentioned above, which determines when a trajectory is far enough from the
global synchronized state, s(t) to be considered unsynchronized,
• sync ≥ 0 which controls when two trajectories are close enough to be considered
synchronized with each other, and
• d ≥ 0 which is a distance parameter in the space where the points ri are embedded,
giving control on when the objects represented by the oscillators are far enough apart
to be considered insigniﬁcant to each other.
107
We will deﬁne the function f by giving its value on simplices in the order of increasing
dimension. Deﬁne
ti
sync = min
(cid:26)
t 
(cid:90) ∞
t
j(t(cid:48)) −(cid:98)ui
(cid:107)(cid:98)ui
k(t(cid:48))(cid:107)2 dt(cid:48) ≤ sync
2
(cid:27)
.
, ∀j, k
That is, ti
sync is the ﬁrst time at which all oscillators have returned to the global synchronized
state after perturbing the ith oscillator. The value of the ﬁltration function for the vertex
nj is deﬁned as
f (nj) = min
(cid:40)
{t  min
t(cid:48)∈[t0,t1]
(cid:107)(cid:98)ui
j(t) −(cid:98)s(t(cid:48))(cid:107)2 ≥ p} ∪ {ti
sync}
(cid:41)
.
(5.6)
Next, we give the function value f for the edges of K. To avoid the involvement of any
insigniﬁcant interaction between oscillators, an edge between nj and nk denoted by ejk is
allowed in the earlier stage of the ﬁltration only if dorg
jk is the distance
between ri and rj in Rd. Speciﬁcally, the value of the ﬁltration function for the edge ejk is
jk ≤ d where dorg
deﬁned as
f (ejk) =
max
ti
sync,
(cid:110)
min{t(cid:82) ∞
t (cid:107)(cid:98)ui
j(t(cid:48)) −(cid:98)ui
(cid:111)
k(t(cid:48))(cid:107)2 dt(cid:48) ≤ sync}, f (nj), f (nk)
,
if dorg
jk ≤ d
if dorg
jk > d.
(5.7)
jk ≤ d,
It should be noted that to this point, f deﬁnes a ﬁltration function because when dorg
f (nj) ≤ f (ejk) according to the deﬁnition given in Eq. (5.7). The property also holds when
jk > d because f (nj) ≤ tsync according to the deﬁnition in Eq. (5.6) and f (ejk) equals
dorg
tsync in this case.
108
Figure 5.2: The ﬁltration of the simplicial complex associated to three 1dimensional
trajectories.[22]
The trajectories are generate as deﬁned in Sec. 5.2.2.1. Here, each vertex corresponds to the trajectory with
the same color. A vertex is added when its trajectory value exceeds the parameter p; an edge is added when
its two associated trajectories become close enough together that the area between the curves after that
time is below the parameter sync. Triangles and higher dimensional simplices are added when all necessary
edges have been included in the ﬁltration.
We extend the function to the higher dimensional simplices using the deﬁnition on the
1skeleton. A simplex σ of dimension higher than one is included in K(x) if all of its 1
dimensional faces are already included; that is, its ﬁltration value is deﬁned iteratively by
dimension as
f (σ) = max
τ≤σ
f (τ ),
where the max is taken over all codim1 faces of σ. Taking the ﬁltration of K using this
function (c.f. Eq. (2.1)) means that topological changes only occur at the collection of func
tion values {f (ni)}i ∪ {f (ejk)}j(cid:54)=k. Fig. 5.2 shows the ﬁltration constructed for an example
consisting of three trajectories.
109
5.2.2.2 Deﬁnition of evolutionary homology
The previous section gives a function fi : KV i → R deﬁned on the complete simplicial
complex with V i vertices for each i = 1,··· , N . From the ﬁltration deﬁned by fi, we
then compute the persistence barcode for homology dimension k, which we call the kth
EH barcode, denoted Bk
i . The persistent homology computation for dimension ≥ 1 on the
ﬁltered simplicial complex is done using the software package Ripser [8] using the fact that
kdimensional homology only requires knowledge of k and k + 1dimensional simplices. The
0dimensional homology is computed with a modiﬁcation of the unionﬁnd algorithm.
Fig. 5.3 gives an example of the geometric conﬁgurations of two sets of points associated
to Lorenz oscillators and their resulting EH barcodes. The EH barcodes eﬀectively examine
the local properties of signiﬁcant cycles in the original space which is important when the
data is intrinsically discrete instead of a discrete sampling of a continuous space. As a result,
the point clouds with diﬀerent geometry but similar barcodes using traditional persistence
methods1 may be distinguished by EH barcodes.
5.2.3 Protein residue ﬂexibility analysis
In this section, we combine all the methods to formulate realistic protein residue ﬂexibility
analysis using the EH barcodes. Consider a protein with N residues and let ri denote the
position of the alpha carbon (Cα) atom of the ith residue. The coupled systems deﬁned
in Eq. (5.1) are used to study protein ﬂexibility with each protein residue represented by
a 3dimensional Lorenz system. Deﬁne the distance for the atoms in the original space as
the Euclidean distance between the Cα atoms, dorg(ri, rj) = (cid:107)ri − rj(cid:107)2. A weighted graph
1Here, traditional means the VietorisRips ﬁltration on the point cloud induced by the embedding
110
Figure 5.3: An example of the construction of the evolutionary homology barcode.[22]
The geometry of two embedded systems is shown in Figures (a) and (b). Speciﬁcally, (b) consists of six
vertices of a regular hexagon with side length of e1; and (a) consists of the vertices in (b) with the addition
of the vertices of hexagons with a side length of e2 (cid:28) e1 centered at each of the previous vertices; here, e1 = 8
and e2 = 1. Figures (c) and (d) are the EH barcodes corresponding to Figures (a) and (b) respectively. A
collection of coupled Lorenz systems is used with parameters δ = 1, γ = 12, β = 8/3, µ = 8, k = 2, Γ = I3,
and = 0.12; see Eqs. (5.2), (5.8) and (5.1). In the model for the ith residue, marked in red, the system is
perturbed from the synchronized state by setting ui,3 = 2s3 with s3 being the value of the third variable of
the dynamical system at the synchronized state and is simulated with step size h = 0.01 from t = 0 using the
fourthorder RungeKutta method. The calculation of persistent homology using the VietorisRips ﬁltration
with Euclidean distance on the point clouds delivers similar bars corresponding to the 1dimensional holes
in (a) and (b) which are [e1 − e2, 2(e1 − e2)) and [e1, 2e1).
Laplacian matrix is constructed based on the distance function dorg to prescribe the coupling
strength between the oscillators and is deﬁned as
−(dorg(ri,rj )/µ)κ
, i (cid:54)= j,
(5.8)
e
−(cid:80)
Aij =
Ail, i = j,
l(cid:54)=i
where µ and κ are tunable parameters.
To quantitatively study the ﬂexibility of a protein, one needs to extract topological in
111
Figure 5.4: The result of perturbing residue 31 in protein (PDB:1ABA).[22]
(a) The modiﬁed trajectories as deﬁned in Eq. (5.5) is plotted for each residue after the perturbation at
t = 0 as a heatmap. The residues are ordered by the (geometric) distance to the perturbed site from the
closest to the farthest. (b) The modiﬁed trajectories as deﬁned in Eq. (5.5) is plotted for each residue after
the perturbation at t = 0 as line plots. The darker lines are closer to the perturbed site. The heatmap shows
ﬁltration value for the edges as deﬁned in Eq. (5.7) and the order of residues is the same as in (a). The
parameters for the coupled Lorenz system and the perturbation method are the same as that of Fig. 5.3.
formation for each residue. To this end, we go through the process given in the previous
sections once for each residue. When addressing the ith residue, we perturb the ith oscillator
at a time point in a synchronized system and take this state as the initial condition for the
coupled systems. See Fig. 5.4 for an example of this procedure when perturbing the oscillator
attached to a residue for a given embedding of one particular protein.
A collection of modiﬁed trajectories {(cid:98)ui(t)}N
tion deﬁned in Eq. (5.5). The persistence over time for {(cid:98)ui(t)}N
i=1 is obtained with the transformation func
i=1 is computed following the
ﬁltration procedure deﬁned in Sec. 5.2.2.1. Let Bk
i be the kth EH barcode obtained from
the experiment of perturbing the oscillator corresponding to residue i. We introduce the
following topological features to relate to protein ﬂexibility:
EHp,k
i = dW,p(Bk
i ,∅),
(5.9)
112
where dW,p for 1 ≤ p < ∞ is the pWasserstein distance and p = ∞ is the bottleneck distance.
We will show that these features characterize the behavior of this particular collection of
barcodes, which in turn, captures the topological pattern of the coupled dynamical systems
arising from the underlying protein structure.
The ﬂexibility of any given residue is reﬂected by how the perturbation induced stress
is propagated and relaxed through the interaction with the neighbors. Such a relaxation
process will induce the change in the states of the nearby oscillators. Therefore, the records
of the time evolution of this subset of coupled oscillators in terms of topological invariants
can be used to analyze and predict protein ﬂexibility.
The diﬀerence in results of the procedure can be seen in the example of Fig. 5.5 where
the control of chaotic oscillators attached to a partially disordered protein (PDB:2RVQ)
and a wellfolded protein (PDB:1UBQ) is demonstrated. Clearly, the folded part of protein
2RVQ has strong correlations or interactions among residues from residue 25 to residue
110, which leads to the synchronization of the associated chaotic oscillators. In contrast,
the random coil part of protein 2RVQ does not have much coupling or interaction among
residues. Consequently, the associated chaotic oscillators remain in chaotic dynamics during
the time evolution. For folded protein 1UBQ, the associated chaotic oscillators become
synchronized within a few steps of simulation, except for a small ﬂexible tail. This behavior
underpins the use of coupled dynamical systems for protein ﬂexibility analysis.
113
Figure 5.5: Left: partially disordered protein, model 1 of PDB:2RVQ. Right: well folded
protien, PDB:1UBQ.[22]
The ui,1 value of each dynamical system is plotted as heatmap. The Lorenz system deﬁned in Eq. (5.2)
is used with the parameters δ = 10, γ = 28, β = 8/3. The coupling matrix A deﬁned in Eq. (5.8) has
parameters µ = 14, κ = 2. The coupled system deﬁned in Eq. (5.1) has parameters Γ = I3 and = 0.12.
The system is initialized with a random value between 0 and 1 and is simulated from t = 0 to t = 200 with
step size h = 0.01. The system is numerically solved using the 4th order RungeKutta method. It can be
seen from the heatmaps that the oscillators corresponding to the disordered regions behave asynchronously.
5.3 Results
5.3.1 Disordered and ﬂexible protein regions
To illustrate the correlation between protein residue ﬂexibility and the topological features
deﬁned in Eq. (5.9), we study several proteins with intrinsically disordered regions. Intrinsi
cally disordered proteins lack stable 3dimensional molecular structures. Partially disordered
proteins refer to the intrinsically disordered proteins that contain both stable structure and
ﬂexible regions.
In nature, the disordered regions may play important roles in biological
processes which requires ﬂexibility.
In what follows, we always work with the coupled Lorenz system parameters, pertur
114
bation method for the ith residue, and simulation described in Fig. 5.3. The simulation is
stopped when all oscillators go back to synchronized state. This process is repeated for each
residue. Two NMR structures of partially disordered proteins PDB:2ME9 and PDB:2MT6
are studied. The topological features are computed for each model of the structures and are
averaged over the models. The results are plotted in Fig. 5.6. The disordered regions clearly
correlate to the peaks of EH∞,0 and the valleys of EH∞,1, EH1,0, and EH1,1. The topo
logical features are also able to distinguish between relatively stable coils (the coils that are
consistent among the NMR models) and the disordered parts (the parts that diﬀer among
the NMR models).
5.3.2 Protein Bfactor prediction
Protein Bfactors quantitatively measure the relative thermal motion of each atom and re
ﬂects atomic ﬂexibility. The xray crystal structures deposited to the Protein Data Bank
contain experimentally derived Bfactors which can be used to validate the proposed method
[151, 145]. To analyze protein ﬂexible regions, Bfactor prediction is needed for protein
structures built from computational models and some experimentally solved structures us
ing NMR or cryoEM techniques. Normal mode analysis (NMA) is one of the ﬁrst methods
proposed for Bfactor predictions [83]. The Gaussian network model (GNM) [7] was known
for its better accuracy and eﬃciency compared to a variety of earlier methods [204]. The
multiscale ﬂexibilityrigidity index (FRI), which is about 20% more accurate than GNM, has
been established as the stateoftheart in the Bfactor predictions [146].
In this section, we compute the correlation between the topological features and the ex
perimentally derived protein Bfactors. We further test the proposed topological features
by building a simple linear regression model with a least square penalty against the exper
115
Figure 5.6: (a) Models 13 of PDB:2ME9 with the disordered region colored in blue, red,
and yellow for the three models. (b) Similar plot as (a) for PDB:2MT6. (c) Topological
features for PDB:2ME9 whose large disordered region is from residue 28 to residue 85. (d)
Topological features for PDB:2MT6 whose large disordered region is from residue 118 to
residue 151. [22]
imental Bfactors. A collection of 364 diverse proteins reported in the literature is chosen
as the validation data (The set of 365 proteins [145] excepts PDB:1AGN due to issue in re
ported Bfactors [146]). The size of the proteins ranges from tens to thousands of amino acid
residues. The topological features in the model are the same as the setup given in Sec. 5.3.1.
An example of the resulting persistence barcodes for relatively rigid and relatively ﬂexible
residues are shown in Fig. 5.7. It is seen that the residue with a relatively small Bfactor
has many H0, H1 and H2 bars. Compared to the residue having a large Bfactor, it has a
much richer dynamical response and barcodes with more bars. Additionally, its H0 bars are
much shorter, indicating a stronger interaction with neighbor residues.
116
Figure 5.7: Barcode plots for two residues. (a) Residue 6 of PDB:2NUH with a Bfactor of
12.13 ˚A2. (b) Residue 49 of PDB:2NUH with a Bfactor of 33.4 ˚A2.[22]
The computed topological features are plotted against a relatively small protein and a
relatively large protein in Fig. 5.8. Clearly, 0dimensional topological features, speciﬁcally
EH∞,0, provide a reasonable approximation to experimental Bfactors. The regression using
all topological information, EH, oﬀers very good approximation to experimental Bfactors.
A summary of the results and a comparison to other methods is shown in Table 5.1 for
the set of 364 proteins. It is seen that the present evolutionary topology based prediction
outperforms other methods in computational biophysics. A possible reason for this excellent
performance is that the proposed method gives a more detailed description of residue in
teractions in terms of three diﬀerent topological dimensions and two distance metrics. This
example indicates that the proposed EH has a great potential for other important biophysical
applications, including the predictions of proteinligand binding aﬃnities, mutation induced
protein stability changes and proteinprotein interactions.
117
Description
Method RP
EH∞,0
EH∞,1
EH∞,2
EH1,0
EH1,1
EH1,2
EH2,0
EH2,1
EH2,2
EH
0.586 Topological feature
0.039 Topological feature
0.097 Topological feature
0.477 Topological feature
0.381 Topological feature
0.104 Topological feature
0.188 Topological feature
0.258 Topological feature
0.100 Topological feature
0.691 Topological features
Description
Method RP
EH
mFRI
pfFRI
GNM
0.691 Topological metrics
0.670 Multiscale FRI [146]
0.626 Parameter free FRI [145]
0.565 Gaussian network model [145]
Table 5.1: The averaged Pearson correlation coeﬃcients (RP ) between the computed values
(blind prediction for the topological features and regression for the rest of the models) and
the experimental Bfactors for a set of 364 proteins [146] (Left: Prediction RP s based on EH
barcodes. Right: A comparison of the RP s of predictions from diﬀerent methods.). Here,
EH is the linear regression using EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. For a few large and multichain proteins (i.e., 1F8R, 1H6V, 1KMM, 2D5W,
3HHP, 1QKI, and 2Q52), to reduce the computational time and as a good approximation,
we compute their EH barcodes on separated (protein) chains. We see from the table at right
that the proposed EH barcode method outperforms other methods in this application.
118
Figure 5.8: Bfactors and the computed topological features. EH shows the linear regression
with EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within each protein. (a) PDB:3PSM
with 94 residues. (b) PDB:3SZH with 697 residues.[22]
5.4 Conclusion
Many dynamical systems are designed to understand the timedependent phenomena in the
real world. The topological analysis of dynamical systems is scarce in general, partially due
to the fact that the topological structure of most dynamical systems is typically simple. In
this work, we have introduced evolutionary homology (EH) to analyze the topology and its
time evolution of dynamical systems. We present a method to embed external topology
of a physical system into dynamical systems. EH examines the embedded topology and
converts it into topological invariants over time. The resulting barcode representation of the
topological persistence is able to unveil the quantitative topologyfunction relationship of
the embedded physical system.
We have chosen the wellknown Lorenz system as an example to illustrate our EH for
mulation. An important biophysical problem, protein ﬂexibility analysis, is employed to
demonstrate the proposed topological embedding of realistic physical systems into dynam
ical systems. Speciﬁcally, we construct weighted graph Laplacian matrices from protein
119
networks to regulate the Lorenz system, which leads to the synchronization of the chaotic
oscillators associated with protein residue network nodes. Simplices, simplicial complexes,
and homology groups are subsequently deﬁned using the adjacent Lorenz oscillator trajec
tories. Topological invariants and their persistence are computed over the time evolution
(ﬁltration) of these oscillators, unveiling protein thermal ﬂuctuations at each residue. The
Wasserstein and bottleneck metrics are used to quantitatively discriminate EH barcodes from
diﬀerent protein residues. The resulting model using the EH barcodes is found to outper
form both geometric graph and spectral graph theory based methods in the protein Bfactor
predictions of a commonly used benchmark set of 364 proteins.
The proposed EH method can be used to study the topological structure of a general
physical system. Moreover, the present method extends the utility of dynamical systems,
which are usually designed for qualitative analysis, to the quantitative modeling and pre
diction of realistic physical systems. Finally, the proposed approach can be readily applied
to the study of a wide variety of topologyfunction relationships, both within computa
tional biology such as the role of topology in proteinligand, proteinprotein, proteinmetal
and proteinnucleic acid interactions; but also to other interactive graphs and networks in
science and engineering.
120
Chapter 6
Topological characterization of static
macrobiomolecules and small
molecules
6.1
Introduction
Despite the competitive outofbox performance of persistent homology as a featurization
tool in supervised learning tasks, careful designs tailored for the ﬁeld of applications can
further improve the performance. To this end, we present a comprehensive assessment of
representability of persistent homology for both macromolecules and small molecules. We
introduce several ways of characterizing macromolecules and small molecules whose quality
are judged by the problem of proteinligand binding aﬃnity prediction. We propose multi
level persistent homology speciﬁcally designed for the characterization of small molecules
and we show that such representation is able to capture subtle changes in small molecules.
The best protocol benchmarked on the proteinligand binding aﬃnity prediction problem
is then applied to virtual screening where more than a hundred thousand targetcandidate
pairs are involved and our model achieved top performance in a benchmark using the DUD
(directory of useful decoys) database.
121
The rest of this chapter is organized as follows. In Section 6.2, we discuss in detail the
biology or chemistry that needs to be addressed in a representation tool. We introduce
several persistent homology based methods focusing on geometry, chemistry, electrostatics
properties of macromolecules and small molecules in Section 6.3. The results are shown
in Section 6.4 followed by a detailed discussion about several model aspects in persistent
homology based machine learning models for biomolecular systems in Section 6.5.
6.2 Biological considerations
The development of persistent homology was motivated by its potential in the dimensionality
reduction, abstraction and simpliﬁcation of biomolcular complexity [62]. In the early applica
tions of persistent homology to biomolecules, emphasis was given on major or global features
(longpersistent features) to derive descriptive tools. For example, persistent homology was
used to identify the tunnel in a Gramicidin A channel [62] and to study membrane fusion
[99]. For the predictive modeling of biomolecules, features of a wide range of scales might all
be important to the target quantity [198]. At the global scale, the biomolecular conforma
tion should be captured. At the intermediate scale, the smaller intradomain cavities need
to be identiﬁed. At the most local scale, the important substructures should be addressed,
such as the pyrrolidine in the side chain of proline. Biomolecules are both structurally and
biologically complex. Their geometric and biological complexities include covalent bonds,
noncovalent interactions, eﬀects of chirality, cis and trans distinctions, multileveled protein
structures, and proteinligand and proteinnucleic acid complexes. Covering a large range
of spatial scales is not enough for a powerful model. The biological details should also be
explored. We address the underlying biology and physics by modifying the distance func
122
tion and selecting various sets of atoms according to element types, to describe diﬀerent
interactions. Some biological considerations are discussed in this section.
Covalent bonds.
Covalent bonds are formed via shared electron pairs or bonding pairs. The lengths and
the number of covalent bonds can be easily detected from 0th dimensional barcodes. For
macromolecules, the same type of covalent bonds have very similar bond lengths and thus
0th dimensional barcode patterns.
Noncovalent interactions.
Noncovalent interactions play a critical role in maintaining the 3D structure of biomolecules
and mediating chemical and biological processes, such as solvation, binding, proteinDNA
speciﬁcation, molecular selfassembly, etc. Physically, noncovalent interactions are due to
electrostatic, van der Waals forces, hydrogen bonds, πeﬀects, hydrophobic eﬀects, etc. The
ability to characterize noncovalent interactions is an essential task in any methodological
development. The 1st and 2nd dimensional barcodes are suitable for the characterization of
the arrangement of such interactions in a larger scale. Additionally, we propose multilevel
persistence and electrostatic persistence to reveal local and pairwise noncovalent interactions
via 0th dimensional barcodes as well.
Chirality, cis eﬀect and trans eﬀect.
Chirality, cis and trans eﬀects are geometric properties of many molecules. Among them,
chirality is a symmetry property such that a chiral molecule cannot be superposed on its
mirror image. Cis and trans eﬀects are due to molecular steric and electronic eﬀects. Chi
rality, cis and trans eﬀects often play a role in molecular kinetics, activity and catalysis, and
thus their characterization is an important issue in developing topological methods. These
eﬀects should be reﬂected from barcodes of various dimensions.
123
Multileveled protein structures.
Protein structures are typically described in terms of primary, secondary, tertiary and
quaternary levels. The protein primary structure is the linear sequence of amino acids
in the polypeptide chain. Protein secondary structure refers to the local 3D structure of
protein segments containing mainly αhelix and βsheets, which are highly regular and can
be easily detected by distinct FrenetSerret frames. A tertiary structure refers to the 3D
structure of a single polypeptide chain.
Its formation involves various noncovalent and
covalent interactions including salt bridges, hydrophobic eﬀects, and often disulﬁde bonds.
A quaternary structure refers to the aggregation of two or more individual folded protein
subunits into a 3D multisubunit complex. Protein structures are further complicated by
its functional domains, motifs, and particular folds. The protein structural diversity and
complexity result in the challenge and opportunity for methodological developments.
Proteinligand, proteinprotein, and proteinnucleic acid complexes.
Topological characterization of proteins is further complicated by protein interactions
or binding with ligands (drugs), other proteins, DNA and/or RNA molecules. Although
a normal protein involves only carbon (C), hydrogen (H), nitrogen (N), oxygen (O) and
sulfur (S) atoms, its proteinligand complexes bring a variety of other elements into the play,
including, phosphorus (P), ﬂuorine (F), chlorine (Cl), Bromine (Br), iodine (I), and many
important biometals, such as calcium (Ca), potassium (K) sodium (Na), iron (Fe), copper
(Cu), cobalt (Co), zinc (Zn), manganese (Mn), chromium (Cr), vanadium (V), tin (Sn),
and molybdenum (Mo). Each biological element has important biological functions and its
presence in biomolecules should be treated uniformly as a set of points in the point cloud
data. The interaction of protein and nucleic acids can be very intricate.
124
6.3 Methods
6.3.1 Element speciﬁc persistent homology
One important issue is how to protect chemical and biological information during the topo
logical simpliﬁcation. As mentioned earlier, one should not treat diﬀerent types of atoms as
homogeneous points in a point cloud data. To this end, we propose element speciﬁc per
sistent homology or multicomponent persistent homology to retain biological information
in topological analysis [24]. The element selection is similar to a predeﬁned vertex color
conﬁguration for graphs.
When all atoms are passed to persistent homology algorithms, the information extracted
mainly reﬂects the overall geometric arrangement of a biomoelcule at diﬀerent scales. By
passing only atoms of certain element types or of certain roles to the persistent homol
ogy analysis, diﬀerent types of interactions or geometric arrangements can be revealed. In
proteinligand binding modeling, the selection of all carbon atoms characterizes the hy
drophobic interaction network whilst the selection of all nitrogen and/or oxygen atoms char
acterizes hydrophilic network and the network of potential hydrogen bonds. In the protein
structural analysis, computation on all atoms can identify geometric voids inside the protein
which may suggest structural instability and computation on only Cα atoms reveals the
overall structure of amino acid backbones. In addition, combination of various selections of
atoms based on element types provides very detailed description of the biomolecular system
and the hidden relationships from the structure to function can then be learned by machine
learning algorithms. This may lead to the discovery of important interactions not realized
as a prior. This can be realized by passing the set of atoms of the selected element types to
the persistent homology computation. This concept can be used with various constructions
125
of distance matrix for persistent homology computation based on VietorisRips complex
ﬁltration.
6.3.2 Construction of distance matrix
Biomolecular systems are not only complex in geometry, but also in chemistry and biology.
To eﬀectively describe complex biomolecular systems, it is necessary to modify the ﬁltration
process. There are three commonly used ﬁltrations, namely, radius ﬁltration, distance matrix
ﬁltration, and density ﬁltration, for biomolecules [198, 201]. The distance matrices can
be used with a more abstract construction of simplicial complexes, such as VietorisRips
complex.
6.3.2.1 Multilevel persistent homology.
Small molecules such as ligands in proteinligand complexes usually contain fewer atoms than
large biomolecules such as proteins. Bonded atoms stay closer than nonbonded ones in most
cases. As a result, the collection of 0th dimensional bars will mostly provide the information
about the length of covalent bonds and the higher dimension barcodes will most likely be
very sparse. It is diﬃcult to capture non covalent bond interactions among atoms especially
hydrogen bonds and van der Waals pairwise interactions in 0th dimensional barcodes. In
order to describe non covalent interactions , we propose multilevel persistent homology, by
simply modifying the distance matrix. Given the original distance matrix M = (dij) with
1 ≤ i, j ≤ N , the modiﬁed distance matrix is deﬁned as
(cid:102)Mij =
d∞, if atoms i and j are bonded,
Mij, otherwise,
(6.1)
126
where d∞ is a large number which is set to be greater than the upper limit of the ﬁltration
value chosen by a persistent homology algorithm. Note that this matrix may fail to satisfy
triangle inequality whilst still satisﬁes the construction principle of Rips complex.
The present multilevel persistent homology is able to describe any selected interactions
of interest and delivers two beneﬁts in characterizing biomolecules. Firstly, the pairwise
noncovalent interactions can be reﬂected by the 0th dimensional barcodes. Secondly, such
treatment generates more higher dimensional barcodes and the small structural ﬂuctuation
among diﬀerent conformations of the same molecule can be captured. The persistent barcode
representation of the molecule can be signiﬁcantly enriched to better distinguish between
diﬀerent molecular structures and isomers. As an illustration, we take the ligand from
the proteinligand complex with PDB code “1BCD” which only has 10 atoms. A diﬀerent
conformation of the ligand is generated by using the Frog2 web server [128]. The persistent
barcodes generated using Rips complex with the distance matrices M are identical and only
have 0th dimensional bars due to the simple structure. In this case, the 0th dimensional
bars only reﬂect the length of each bond and therefore fail to distinguish the two slightly
diﬀerent conformations of the same molecule. However, when the modiﬁed distance matrices
(cid:102)M are employed, the barcode representation is signiﬁcantly enriched and is able to capture
from the modiﬁed distance matrix (cid:102)M is shown in Fig. 6.1. A general nth level persistence
characterization of molecules can be obtained with the distance matrix(cid:102)Mn as,
the tiny structural perturbation between the conformations. An illustration of the outcome
(6.2)
(cid:102)Mn
ij =
d∞, D(i, j) ≤ n
Mij, otherwise,
127
where D(i, j) is the smallest number of bonds to travel from atom i to atom j and d∞ is
some number greater than the upper limit of the ﬁltration value.
Figure 6.1: Multilevel persistent homology on simple small molecules.[20]
Illustration of representation ability of(cid:102)M in reﬂecting structural perturbations among conformations of the
Rips complex with (cid:102)M for two conformations. It is worth noticing that the barcodes generated using Rips
same molecule. Left: The structural alignment of two conformations of the ligand in proteinligand complex
(PDB:1BCD). Right: The persistence diagram showing the 1st and 2nd dimensional results generated using
complex with M are identical for the two conformations.
6.3.2.2 Interactive persistent homology.
In proteinligand binding analysis and analysis involving interactions, we are interested in
the change of topological invariants induced by interactions that are caused by binding or
other processes. Similar to the idea of multilevel persistent homology, we can design a
distance matrix to focus on the interactions of interest. For a set of atoms, A = A1 ∪ A2
with A1 ∩ A2 = ∅ where only interactions between atoms from A1 and atoms from A2 are
128
of interest [24]. The interactive distance matrix(cid:99)M is deﬁned as
Mij, if ai ∈ A1, aj ∈ A2 or ai ∈ A2, aj ∈ A1,
d∞, otherwise,
(cid:99)Mij =
(6.3)
where M is the original distance matrix induced from Euclidean metrics or other correlation
function based distances, ai and aj are atoms i and j , and d∞ is a number greater than
the upper limit of the ﬁltration value. In applications, A1 and A2 can be respectively a set
of atoms of the protein and a set of atoms of the ligand in a proteinligand complex. In this
case, the characterization of interactions between ligand and protein is an important task.
In the modeling of point mutation induced protein stability changes, A1 could be the set of
atoms at the mutation site and A2 could be the set of atoms of surrounding residues close
to the mutation site. Similar treatment can be used for proteinprotein and proteinnucleic
acid interactions.
6.3.2.3 Correlation function based persistent homology.
For biomolecules, the interaction strength between pair of atoms usually does not align
linearly to their Euclidean distances. For example, van der Waals interaction is often de
scribed by the LennardJones potential. Therefore, kernel function ﬁltration can be used to
emphasize certain geometric scales [198].
¯Mij = 1 − Φ(dij, ηij),
(6.4)
129
where Φ(dij, ηij) is a radial basis function and ηij is a scale parameter. This ﬁltration can
be incorporated in the element speciﬁc persistent homology
d∞, if atom i or atom j∈ U,
1 − Φ(dij, ηij), otherwise.
`Mij =
(6.5)
Additionally, one can simultaneously use two or more correlation functions characterized by
diﬀerent scales to generate a multiscale representation of biomolecules [140].
One form of the correlation function based ﬁltration matrix is constructed by ﬂexibility
and rigidity index. In this case, the Lorentz function is used in Eq. (6.5)
Φ(dij; ηij, ν) =
1 +
(cid:19)ν ,
(6.6)
(cid:18) dij
1
ηij
where dij is the Euclidean distance between point i and point j and ηij is a parameter
controlling the scale and is related to radius of two atoms. When distance matrices based
on such correlation functions are used, patterns at diﬀerent spatial scales can be addressed
separately by altering the scale parameter ηij. Note that the rigidity index is given by [196]
(cid:88)
µi =
Φ(dij; ηij, ν).
(6.7)
This expression is closely related to the rigidity density based volumetric ﬁltration [201].
j
6.3.2.4 Electrostatic persistence
Electrostatic eﬀects are some of the most important eﬀects in biomolecular structure, func
tion, and dynamics. The embedding of electrostatics in topological invariants is of particular
130
interest and can be very useful in describing highly charged biomolecules such as nucleic acids
and their complexes. We introduce electrostatics interaction induced distance functions in
Eq.
(6.8) to address the electrostatic interactions among charged atoms. The abstract
distance between two charged particles are rescaled according to their charges and their
geometric distance, and is modeled as
Φ(dij, qi, qj; c) =
1
1 + exp(−cqiqj/dij)
,
(6.8)
where dij is the distance between the two atoms, qi and qj are the partial charges of the two
atoms, and c is a nonzero tunable parameter. c is set to a positive number if opposite charge
interactions are to be addressed and is set to a negative number if like charge interactions are
of interest. The form of the function is adopted from sigmoid function which is widely used as
an activation function in artiﬁcial neural networks. Such function regularizes the input signal
to the [0, 1] interval. Other functions can be similarly used. This formulation can be extended
to systems with dipole or higher order multipole approximations to electron density. The
weak interactions due to long distances or neutral charges result in correlation values close to
0.5. When c > 0, the repulsive interaction and attractive interaction deliver the correlation
values in (0.5, 1) and (0, 0.5) respectively. The distances induced by Φ(dij, qi, qj; c) are
used to characterize electrostatic eﬀects. The parameter c is rather physical but chosen to
eﬀectively spread the computed values over the (0, 1) interval so that the results can be used
by machine learning methods. Another simple choice of charge correlation functions is
Φ(dij, ηij, qi, qj) = qiqj exp(−dij/ηij).
131
However, this choice will lead to a diﬀerent ﬁltration domain. Additionally, a charge density
can be constructed
(cid:88)
j
µc(r) =
qj exp(−(cid:107)r − rj(cid:107)/ηj),
(6.9)
where r is a position vector, (cid:107)r − rj(cid:107) is the Euclidean distance between r and jth atom
position rj and ηj is a scale parameter. Equation (6.9) can be used for electrostatic ﬁltration
as well. In this case, the ﬁltration parameter can be the charge density value and cubical
complex based ﬁltration can be used.
6.3.3 Feature generation from topological invariants
Barcode representation of topological invariants oﬀers a visualization of persistent homology
analysis. In machine learning analysis, we convert the barcode representation of topological
invariants into structured feature arrays for machine learning. To this end, we introduce
several procedures to generate feature vectors from sets of barcodes. These methods are
discussed below.
Counts in bins.
For a given set of atoms A, we denote its barcodes as B = {Iα}α∈A and represent each
bar by an interval Iα = [bα, dα], where bα and dα are respectively the birth and death
positions on the ﬁltration axis. The length of each bar, or the persistence of topological
invariant is given by pα = dα − bα. To locate the position of all bars and persistences, we
further split the set of barcodes on the ﬁltration axis into a predeﬁned collection of N bins
Bin = {Bini}N
i=1 with Bini = [li, ri], where li and ri are the left and the right positions of
the ith bin. We generate features by counting the numbers of births, deaths, and persistences
in each bin, which leads to three counting feature vectors, namely, counts of birth F C
b , death
132
d , and persistence F C
F C
p ,
b,i(B) = (cid:107){[bα, dα] ∈ Bli ≤ bα ≤ ri}(cid:107), 1 ≤ i ≤ N,
F C
d,i(B) = (cid:107){[bα, dα] ∈ Bli ≤ dα ≤ ri}(cid:107), 1 ≤ i ≤ N,
F C
p,i(B) = (cid:107){[bα, dα] ∈ Bbα ≤ ri or li ≤ dα}(cid:107), 1 ≤ i ≤ N,
F C
(6.10)
where (cid:107) · (cid:107) is number of elements in a set. Note that the above discussion should be applied
to three topological dimensions, i.e., barcodes of the 0th dimension (B0), 1st dimension (B1)
and 2nd dimension (B2). In general, this approach enables the description of bond lengths,
including the length of noncovalent interactions, in biomolecules.
Barcode statistics.
Another method of feature vector generation from a set of barcodes is to extract important
statistics of barcode collections such as maximum values and standard deviations. Given a
set of bars B = {[bα, dα]}α∈A, we deﬁne sets of Birth = {bα}α∈A, Death = {dα}α∈A, and
Persistence = {dα − bα}α∈A. Three statistic feature vectors F S
be generated in the sense of the statistics of the collection of barcodes. For example, F S
b
d , and F S
p can then
b , F S
consists of avg(Birth), std(Birth), max(Birth), min(Birth), sum(Birth), and cnt(Birth),
where avg(·) is the average value of a set of numbers, std(·) is the standard deviation of a
set of numbers, max(·) and min(·) are maximum and minimum values in a set of numbers,
sum(·) is the summation of elements in a set of numbers, and cnt(·) is the count of elements
in a set. The generation of F S
d is the same by examining the set Death. F S
p contains
the same information with two extra terms, the birth and death values of the longest bar.
Statistics feature vectors are collected from barcodes of three topological dimensions, i.e.,
the 0th, 1st, and 2nd dimensions.
133
2D representation.
The construction of multidimensional persistence is an interesting topic in persistent
homology.
In general, it is believed that multidimensional persistence has better repre
sentational power for complex systems described by multiple parameters [32]. Although
multidimensional persistence is hard to compute, one can compute persistence for one pa
rameter while ﬁxing the rest of the parameters to a sequence of ﬁxed values. In the case
where there are two parameters, a biﬁltration can be done by taking turns to ﬁx one param
eter to a sequence of ﬁxed values while computing persistence for the other parameter. For
example, one can take a sequence of resolutions and compute persistence for distance with
each ﬁxed resolution. The sequence of outputs can be stacked to form a multidimensional
representation [199].
Computing persistence multiple times and stacking the results is especially useful when
the parameters that are not chosen to be the ﬁltration parameter are naturally discrete
with underlying orders. For example, the multicomponent or element speciﬁc persistent
homology will result in many persistent homology computations over diﬀerent selections
of atoms. These results can be ordered by the percentage of atoms used of the whole
molecule or by their importance scores in classical machine learning methods. Also, multiple
underlying dimensions exist in the element speciﬁc persistent homology characterization of
molecules. This property enables 2D or 3D topological representation of molecules. Based on
the observation that the performance of the predictor degenerates when too many element
combinations are used, we order the element combinations according to their individual
performance on the task using methods of ensemble of trees. Combining the dimension
of spatial scale and dimension of element combinations, a 2D topological representation is
obtained. Such representation is expected to work better in the case of complex geometry
134
such as proteinligand complexes. With E = {Ej}NE
combinations ordered by their individual importance scores on the task and Bk(Ei) being
j=1 denoting the collection of element
the kth dimensional barcodes obtained with atoms of element combination Ej, eight 2D
representations are deﬁned as
{F C
d,i(B0(Ej)), F C
p,i(B0(Ej)), F C
b,i(B1(Ej)), F C
p,i(B1(Ej)), F C
F C
b,i(B2(Ej)), F C
d,i(B2(Ej)), F C
d,i(B1(Ej)),
p,i(B2(Ej))}j=1,··· ,NE
i=1,··· ,N ,
(6.11)
where F C
γ,i with γ = b, d, p is the barcode counting rule deﬁned in Eq.
(6.10). For 0th
dimensional, since all bars start from zero, there is no need for F C
b,i(B0(Ej)). These eight 2D
representations are regarded as eight channels of a 2D topological image. In proteinligand
binding analysis, 2D topological features are generated for the barcodes of a proteinligand
complex and for the diﬀerences between barcodes of the proteinligand complex and those
of the protein. Therefore, we have a total of 16 channels in a 2D image for the protein
ligand complex. This 16channel image can be fed into the training or the prediction of
convolutional neural networks.
As an example, in the characterization of proteinligand complexes using alpha complexes,
2D features are generated from the alpha complex based on persistent homology computa
tions of protein and proteinligand complex. A total of 128 element combinations are consid
ered. The [0, 12]˚A interval is divided into 120 equal length bins, which deﬁnes the resolution
of topological images. Therefore, the input feature for each sample is a 120×128×16 tensor.
When there are fewer element combinations considered which can hardly form another
axis, the axis of element combinations can be added into the original channels to form 1D
representations that can be used in 1D CNN.
135
6.3.4 Machine learning algorithms
Three machine learning algorithms, including knearest neighbors (KNN) regression, gradient
boosting trees and deep convolutional neural networks, are integrated with our topological
representations to construct topological learning algorithms.
Knearest neighbors algorithm via barcode space metrics.
One of the simplest machine learning algorithms is knearest neighbors (KNN) for clas
siﬁcation or for regression.
In KNN regression, for a given object, its property values is
obtained by the average or the weighted average of the values of its k nearest neighbors
induced by a given metric of similarity. Then, the problem becomes how to construct a
metric on the dataset.
In the present work, instead of computing similarities from constructed feature vectors,
the similarity between biomolecules can simply be derived from distances between sets of
barcodes generated from diﬀerent biomolecules. Popular barcode space metrics include the
bottleneck distance [43] and more generally, the Wasserstein metrics [44, 29]. The deﬁnition
of the two metrics can be found in Section 2.1.3.
The barcode space metrics can be directly used to assess the representation power of
various persistent homology methods on biomolecules without being aﬀected by the choice
of machine learning models and hyperparameters. We show in the section of results that the
barcode space metrics induced similarity measurement is signiﬁcantly correlated to molecule
functions.
Wasserstein metric measures from biomolecules can also be directly implemented in a
kernel based method such as nonlinear support vector machine algorithm for classiﬁcation
and regression tasks. However, this aspect is not explored in the present work.
136
Gradient boosting trees.
Gradient boosting trees is an ensemble method which ensembles individual decision trees
to achieve the capability of learning complex feature target maps and can eﬀectively prevent
overﬁtting by using shrinkage technique. The gradient boosting trees method is realized
using the GradientBoostingRegressor module in scikitlearn software package [152] (version
0.17.1). A set of parameters found to be eﬃcient in our previous study on the protein
ligand binding aﬃnity prediction [24] is used uniformly unless speciﬁed. The parameters
used are n estimators=20000, max depth=8, learning rate=0.005, loss=’ls’, subsample=0.7,
max features=’sqrt’.
Deep convolutional neural networks.
The deep convolutional neural networks in this work are implemented using Keras [40]
(version 1.1.2) with Theano backend [177] (version 0.8.2).
For TopBPDL(Complex), a widely used convolutional neural network architecture is
employed beginning with convolution layers followed by dense layers. Due to the limited
computation resources, parameter optimization is not performed, while most parameters are
adopted from our earlier work [26]. Reasonable parameters are assigned manually. The
detailed architecture is shown in Fig. 6.2. The Adam optimizer with learning rate 0.0001 is
used. The loss function is the mean squared error function. The network is trained with a
batch size of 16 and 150 epochs. The training data is shuﬄed for each epoch.
The network architecture of TopVSDL is shown in Fig. 6.3. The Adam optimizer with
learning rate set to 0.0001 is used. The loss function is binary crossentropy. The network
is trained with a batch size of 1024 and 10 epochs. The training data is shuﬄed for each
epoch. The batch size is larger than that used in TopBPDL due to the much larger training
set in this problem. Because of the same reason, the training process converges to a small
137
loss very fast with only a few training steps.
138
Figure 6.2: The network architecture of TopBPDL.[20]
The structured layers are shown in boxes/rectangles with sharp corners for 2D/1D imagelike content and
the unstructured layers are shown in rectangles. The numbers in convolution layers mean the number of
ﬁlters and ﬁlter size from left to right. The dense layers are drawn with number of neurons and activation
function. The pooling size of the pooling layers and dropout rate of the dropout layers are listed. The layers
that are repeated n times are marked with “×n” sign on the right side of the layer.
139
Figure 6.3: The network architecture of TopVSDL.[20]
The 1D imagelike layers are shown in sharpcorner rectangles. The numbers in convolution layers mean the
number of ﬁlters and ﬁlter size from left to right. The pooling size of the pooling layers and dropout rate of
the dropout layers are listed. The layers that are repeated n times are marked with “×n” sign on the right
side of the layer.
140
6.4 Results
Rational drug design and discovery have rapidly evolved into some of the most important
and exciting research ﬁelds in medicine and biology. These approaches potentially have a
profound impact on human health. The ultimate goal is to determine and predict whether
a given drug candidate will bind to a target so as to activate or inhibit its function, which
results in a therapeutic beneﬁt to the patient. Virtual screening is an important process
in rational drug design and discovery which aims to identify actives of a given target from
a library of small molecules. There are mainly two types of screening techniques, ligand
based and structurebased. Ligandbased approaches depend on the similarity among small
molecule candidates. Structurebased approaches try to dock a candidate molecule to the
target protein and judge the candidate with the modeled binding aﬃnity based on docking
poses. Various molecular docking software packages have been developed for these purposes.
Molecular docking involves both pose generation and binding aﬃnity scoring. Currently,
pose generation is quite robust while scoring power is still limited. Therefore, knowledge
based rescoring methods using machine learning or deep learning approaches can improve
scoring accuracy [158, 59, 4]. We also apply our topological learning method as a rescoring
tool to rerank the candidates based on docking poses generated by docking software.
This section explores the representational power of the proposed persistent homology
methods for the prediction of proteinligand binding aﬃnities and the discrimination of
actives and nonactives for protein targets. To this end, we use the present method to in
vestigate three types of problems. First, we develop topological learning models for ligand
based proteinligand binding aﬃnity predictions. This problem is designed to examine the
representability of the proposed topological methods for small molecules. Then, we develop
141
Figure 6.4: An illustration of the topology based machine learning algorithms used in scoring
and virtual screening.[20]
topological learning models for proteinligand complex based binding aﬃnity prediction.
This problem enables us to understand the capability of the proposed topological learning
methods for dealing with proteinligand complexes. Finally, we examine the structurebased
classiﬁcation of active ligands and decoys which are highly possible to be nonactives, i.e.,
structurebased virtual screening (VS). The optimal selection of features and methods are
determined by studying the ﬁrst two applications and this ﬁnding leads to the main applica
tion studied in this work, the topological structurebased virtual screening. Computational
algorithms used in this study are illustrated in Fig. 6.4.
142
6.4.1 Ligand based proteinligand binding aﬃnity prediction
In this section, we address the representation of small molecules by element speciﬁc persis
tent homology, especially the proposed multilevel persistent homology designed for small
molecules.
Data set
To assess the representational ability of the present persistent homology algorithms on
small molecules, we use a high quality data set of 1322 proteinligand complexes with binding
aﬃnity data involving 7 protein clusters introduced earlier (denoted as S1322) [187]. It is
a subset of the PDBBind v2015 reﬁned set and its detail is given in the Supplementary
material 1 of Ref. [187]. We consider a ligand based approach to predict the binding aﬃnities
of proteinligand complexes in various protein clusters. As such, only the ligand information
is used in our topological analysis. The ligand structures are taken from PDBBind database
without modiﬁcation. Numbers of ligands in protein clusters range from 94 to 333.
Models and performance
Two models, i.e., TopBPKNN(Ligand) and TopBPML(Ligand), are constructed. TopBP
KNN(Ligand) is used to directly assess the representation power of persistent homology for
small molecules and TopBPML(Ligand) is the ﬁnal practical model. The results are shown
in Table 6.1. All the gradient boosting trees models take the setup described in Section
Methods/Machine learning algorithms/Gradient boosting trees.
In TopBPML(Ligand), we process the geometry, the shape, and the covalent bond in
formation of the small molecules using alpha complex, and the noncovalent intramolecular
interactions using multilevel persistent homology with Rips complex. The features used are
AB012ESGBT and RB012M1SGBT as described in Section Discussion/Ligand based
143
proteinligand binding aﬃnity prediction. Gradient boosting trees method is used.
In TopBPKNN(Ligand), we represent the small molecules with a collection of sets of
barcodes from element speciﬁc persistent homology calculations. Wasserstein distance with
p = 2 is applied to measure similarities between two sets of barcodes. The similarity between
each pair of small molecules is then measured by taking the average of the Wasserstein
distances between all sets of considered barcodes. Knearestneighbor (KNN) regression is
then applied to the measured similarity. In detail, the 6 sets of barcodes considered are, RB0
EKNN, RB1EKNN, RB2EKNN, RB0M1KNN, RB1M1KNN, and RB2M1KNN
as described in Section Discussion/Ligand based proteinligand binding aﬃnity prediction.
Leaveoneout validation within each protein cluster with k = 3 is used for this model.
Methods
TopBPKNN(Ligand) TopBPML(Ligand) (5fold) FFTBP (5fold) [187]
CL 1 (333)
CL 2 (264)
CL 3 (219)
CL 4 (156)
CL 5 (134)
CL 6 (122)
CL 7 (94)
Average
0.698(1.66)
0.817(1.28)
0.620(1.68)
0.645(1.41)
0.756(1.68)
0.658(1.68)
0.739(1.31)
0.705(1.49)
0.713(1.60)
0.843(1.15)
0.693(1.51)
0.670(1.35)
0.831(1.34)
0.698(1.56)
0.737(1.26)
0.741(1.40)
(1.93)
(1.32)
(2.01)
(1.61)
(2.02)
(2.06)
(1.71)
(1.81)
Table 6.1: Pearson correlation coeﬃcients (RMSE in kcal/mol) of ligand based topological
model on the S1322 dataset.
The numbers in the ﬁrst row show the number of entries in each protein cluster. The performance is
reported as Pearson correlation coeﬃcient (root mean squared error in kcal/mol). The median performance
of 20 random 5fold cross validation results is reported for TopBPML(Ligand). The results reported for
TopBPKNN(Ligand) are obtained by leaveoneout validation within each protein cluster with k = 3 for
the KNN model.
In Table 6.1, FFTBP 5fold cross validation results were obtained based on multiple
additive regression trees and a set of physical descriptors, including geometry, charge, elec
trostatic interactions, and van der Waals interactions for S1322 set[187]. Since multiple
additive regression trees is also an implementation of the GBT used in the present work, it is
144
appropriate to compare the FFTBP results with the GBT results in this work to assess rep
resentation power of topological features. It is interesting to note that judged by RMSE, both
sets of current topological descriptors have more predictive power than the physical descrip
tors built on proteinligand complexes constructed in our earlier work [187]. These physical
descriptors were constructed from sophisticated surface areas, molecular volumes, van der
Waals interactions, charges computed by quantum mechanics, and PoissonBoltzmann the
ory based electrostatics [187]. The success of topological descriptors implies the existence of
an alternative and potentially more powerful description of the complex biomolecular world.
6.4.2 Complex based proteinligand binding aﬃnity prediction
In this section, we focus on topological representations of proteinligand complexes.
Data sets
The PDBBind database provides a comprehensive collection of structures of protein
ligand complexes and their binding aﬃnity data [38, 119]. The original experimental data
inProtein Data Bank (PDB) [17] are selected to PDBBind database based on certain qual
ity requirements and are curated for applications. As shown in Table 6.2, this database is
expanding on a yearly basis. It has become a common resource for benchmarking computa
tional methods and algorithms for proteinligand binding analysis and drug design. Popular
data sets include version 2007 (v2007), v2013, and v2015. Among them, v2013 core set and
v2015 core set are identical. A large number of scoring functions has been tested on these
data sets. The latest version, v2016, has an enlarged core set, which contains 290 protein
ligand complexes from 58 protein families. Therefore, this test set should be relatively easier
than v2015 core set, whose 195 complexes involve 65 protein families. The core sets are con
structed by choosing 3 samples with median, maximum, and minimum binding aﬃnity from
145
each protein family for v2007, v2013, and v2015 sets. The core set for v2016 was constructed
similarly but with 5 samples from each protein family.
Version Reﬁned set Training set Core set (test set) Protein families
v2007
v2013
v2015
v2016
1300
2959
3706
4057
1105
2764
3511
3767
195
195
195
290
65
65
65
58
Table 6.2: Description of the PDBBind datasets.
Number of complexes or number of protein families in PDBBind data sets used in the present binding
aﬃnity prediction. Here training sets are set to the corresponding reﬁned sets, excluding the complexes in
the corresponding test sets (i.e., core sets). Protein families refer to those in the corresponding core sets.
Model and performance
Two models TopBPML(Complex) and TopBPDL(Complex) are introduced. The results
are shown in Table 6.3. All the gradient boosting trees models take the setup described in
Section Methods/Machine learning algorithms/Gradient boosting trees.
Methods
v2007
v2013
v2015
v2016
Average
Core set predictions
TopBP(Complex)
TopBPML(Complex)
TopBPDL(Complex)
RF::VinaElem
RIScore[140]
0.827 (1.93)
0.818 (2.01)
0.806 (1.95)
0.803 (1.94) [115]
0.803 (1.99)
0.808 (1.95)
0.804 (2.00)
0.781 (1.98)
0.752 (2.03) [116]

0.812 (1.92)
0.797 (1.99)
0.799 (1.91)

0.762 (2.05)
0.861 (1.65)
0.848 (1.74)
0.848 (1.64)

0.815 (1.85)
0.827 (1.86)
0.817 (1.94)
0.809 (1.87)


Methods
v2007
v2013
v2015
v2016
Average
Reﬁned set 5fold cross validations
TopBPML(Complex)
RIScore[140]
0.752 (1.95)

0.768 (1.75)

0.781 (1.71)

0.785 (1.71)
0.747 (1.83)
0.771 (1.78)

Table 6.3: Pearson correlation coeﬃcients (RMSE in kcal/mol) of diﬀerent proteinligand
complex based approaches on PDBBind datasets.
Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for predictions by diﬀerent methods
are listed. For the tests on core sets, the models are trained with the corresponding reﬁned set minus the
core set. Fivefold cross validation is done on reﬁned sets. Results of TopBPML(Complex) are the medians
of 50 repeated runs. For TopBPDL(Complex), 100 independent models are generated at ﬁrst. A consensus
model is built by randomly choosing 50 models out of the 100, and this process is repeated 1000 times
with the median reported. TopBP(Complex) is a consensus model combining TopBPML(Complex) and
TopBPDL(Complex). Each time, 50 single deep learning models are randomly selected to form TopBP
DL(Complex) and a TopBPML(Complex) model is randomly selected. The average of the two is taken as
the output for TopBP(Complex). This process is repeated 1000 times with the median reported.
146
In TopBPML(Complex), alpha complex is used to describe the arrangement of carbon
and heavy atom networks, while Rips complex with diﬀerent distance matrices is used to
describe the proteinligand interactions from the perspective of interaction distances and
strength of electrostatics interactions. In detail, the features used are RB0IC, RB0CIS,
AB12ES as described in Section Discussion/Complex based proteinligand binding aﬃnity
prediction, and those used in TopBPML(Ligand).
With the idea that a sequence of element combinations ordered by their importance in
gradient boosting trees models can make an extra dimension of the description, we build a
2D convolutional neural network with one spatial dimension and one dimension of element
combination. We combine this 2D CNN with a 1D CNN with the pairwise interaction inputs.
For the construction of 2D input, the reader is referred to Section Feature generation from
topological invariants . The 1D imagelike inputs consists of two parts both generated by the
counts in bins method described in Section Feature generation from topological invariants
. For the 0th dimensional barcodes from interactive persistent homology of the 36 pairs of
atom types ({C,N,O,S} from protein and {C,N,O,S,P,F,Cl,Br,I} from ligand), the interval
[0, 50] ˚A is divided into equal length subintervals of length 0.25 ˚A. For the 0th dimensional
barcodes from interactive persistent homology for electrostatics of the 50 pairs of atom
types ({C,N,O,S,H} from protein and {C,N,O,S,P,F,Cl,Br,I,H} from ligand), the parameter
interval of [0, 1] is divided into equal length subintervals of length 0.01. These two 1D image
like features have sizes 200 × 36 and 100 × 50. The network architecture is given in Section
Methods/Machine learning algorithms/Deep convolutional neural networks.
The ﬁnal model TopBP(Complex) takes the average of TopBPML(Complex) and TopBP
DL(Complex) with the assumption that the errors made by the two approaches are only
partially correlated and thus averaging over them may cancel part of the errors. As a result,
147
TopBP(Complex) delivers the best prediction performance on all four testing sets.
6.4.3 Structurebased virtual screening
In this section, we examine the performance of the proposed method for the main application
in this paper, which is structurebased virtual screening which involves proteincompound
complexes obtained by attempting to dock the candidates to the target proteins. The dataset
is much larger than the two applications on proteinligand binding aﬃnity prediction which
makes parameter tuning very time consuming. Therefore, the best performing procedures
in ligandbased binding aﬃnity prediction and proteinligandcomplexbased binding aﬃnity
prediction are applied in this virtual screening application.
DUD data set
The directory of useful decoys (DUD) [93, 135] is used to benchmark our topological
approach for virtual screening. The DUD data set contains 40 protein targets from six
classes, i.e., nuclear hormone receptors, kinases, serine proteases, metalloenzymes, folate
enzymes, and other enzymes. A total of about 3000 active ligands were identiﬁed from
literature. The number of ligands for each target ranges from tens to hundreds. At most 36
decoys were constructed for each ligand, from the ZINC database of commercially available
compounds [95]. At the ﬁrst step, the ZINC database of 3.5 million compounds was reduced
to a database of 1.5 million compounds with similarity less than 0.9 to the ligands. The
similarity was measured by Tanimoto coeﬃcient on CACTVS type 2 ﬁngerprints. The
decoys were selected so that they possess similar physical properties to the ligands but
have dissimilar molecular topology (topology in the sense of chemistry, not mathematical
topology). A total of 32 physical properties were used including molecular weight, partition
coeﬃcient, and number of hydrogen bonding groups. This results in a total of about 100000
148
compounds. A discrepancy between calculated partial charges for the ligand and decoy sets
was reported for the original release 2 of DUD datasets, which makes it trivial for virtual
screening methods to distinguish between the two categories using those charges [5]. In this
work, we use the data with recalculated Gasteiger charges for both ligand and decoy sets
given by Armstrong et al. [5] in AutoDock Vina and our electrostatic persistence.
Data processing
In structurebased virtual screening, the possible complex structures of the target protein
and the small molecule candidate are required. For the DUD dataset, the structures of the 40
protein targets, the ligands, and the decoys are given, and we generate the proteincompound
complexes by using docking software. To this end, we ﬁrst add missing atoms to the proteins
by using the proﬁx utility in Jackal software package [203]. The receptors and ligands or
decoys are prepared using the scripts prepare receptor4.py and prepare ligand4.py provided
by the AutoDockTools module in MGLTools package (version 1.5.6) [130]. The bounding
box of the binding site is deﬁned as a cube with edge size equal to 27 ˚A, centered at the
geometric center of the crystal ligand. AutoDock Vina (version 1.1.2) [179] is used to dock
the ligands or decoys to the receptors. The option exhaustiveness is set to 16 and all the
other parameters are set to their default values. In each docking experiment, the pose having
the lowest binding free energy reported by AutoDock Vina, is used by the reranking models.
Evaluation
Two measurements, the enrichment factor (EF) and the area under the receiver operating
characteristic curve (AUC), are used to evaluate each method’s ability of discriminating
149
actives from decoys. The AUC is deﬁned as
AUC = 1 − 1
Na
Na(cid:88)
i=1
N i
d
Nd
,
(6.12)
where Na is the number of active ligands, Nd is the total number of decoys, and N i
d is the
number of decoys that are higher ranked than the ith ligand [158]. An AUC value of 0.5 is
the expected value of a random selection, whereas a perfect prediction results in an AUC
of 1. The EF at x% denoted by EFx% evaluates the quality of the set of top x% ranked
compounds, by comparing the percentage of actives in the top x% ranked compounds to the
percentage of actives in the entire compound set. It is deﬁned as
EFx% =
a
N x%
N x%
· N
Na
,
(6.13)
where N x%
a
is the number of active ligands in the top x% ranked compounds, N x% is the
number of top x% ranked compounds, N is the total number of compounds, and Na is the
total number of active ligands.
To evaluate the performance of various methods on the DUD data set, the entries asso
ciated with one protein target are used as the test set each time [158]. For the selection of
the training set of a given protein target, we follow a procedure given in the literature [93],
where the entries associated to the rest of the proteins, excluding those that are within the
same class of the testing protein and those that have reported positive crossenrichment with
the testing protein, are taken as the training set. The 40 proteins are split into 6 classes and
the excluded list of training data for each protein follows [4].
150
Topology based machine learning models
Our topology based machine learning model, called TopVSML, relies on manually con
structed features and utilizes ensemble of trees methods. For the complex with the small
molecules (i.e., ligands and decoys) docked to the receptor, features RB0IBP, RB0CI
S, and AB12ES are used (see Section Discussion/Complex based proteinligand binding
aﬃnity prediction), whereas features RB012M1S and AB012ES (see Section Discus
sion/Ligand based proteinligand binding aﬃnity prediction) are used for the small molecules.
The gradient boosting trees method, random forest method, and extra trees method are em
ployed as voters. The averaged probabilities output by the three methods are used for the
classiﬁer to decide the class of the testing samples. The modules GradientBoostingClassiﬁer,
RandomForestClassiﬁer, and ExtraTreesClassiﬁer in the scikitlearn package [152] (version
0.17.1) are used. The parameters for the three modules are listed in Table 6.4. TopVSML
achieves a performance of AUC= 0.83, EF2% = 8.6, EF20% = 3.4. These values are the
median values of 10 repeated experiments.
Method Parameters
GBT
RF
ET
n=2000, s=0.5, cw=100:1, lr=0.01, mf=sqrt
n=2000, cw=balanced subsample
n=2000, cw=balanced subsample
Table 6.4: Parameters used in machine learning.
The parameters used for the ensemble of trees methods while the other parameters are set to default.
GBT: gradient boosting trees. RF: random forest. ET: extra trees. n: n estimators. s: subsample. cw:
class weight. lr: learning rate. mf: max feature.
Topology based deep learning model
Our topology based deep learning model, called TopVSDL, relies on 1D imagelike inputs
for proteincompound complexes and manually constructed features for the compounds. The
2D representation used in binding aﬃnity problem is not used here due to the intractable
151
data size. The manually constructed features for the compounds are RB012M1S and
AB012ES as described in Section Discussion/Ligand based proteinligand binding aﬃnity
prediction. The 1D imagelike inputs consisted of three parts are all generated by the counts
in bins method described in Section Feature generation from topological invariants .
(1)
For the 0th dimensional barcodes from interactive persistent homology of the 36 pairs of
atom types ({C, N, O, S} from protein and {C, N, O, S, P, F, Cl, Br, I} from ligand), the
interval [0, 25] ˚A is divided into equal length subintervals of length 0.25 ˚A. The barcodes
used here are identical to the barcodes in feature RB0IBP. This results in a 1D imagelike
feature with size 100 × 36. (2) For the 0th dimensional barcodes from interactive persistent
homology for electrostatics of the 50 pairs of atom types ({C, N, O, S, H} from protein and
{C, N, O, S, P, F, Cl, Br, I, H} from ligand), the parameter interval of [0, 1] is divided into
equal length subintervals of length 0.01. The barcodes used are identical to the barcodes
in feature RB0CIS. This results in a 1D imagelike feature with size 100 × 50. (3) Alpha
complex based persistent homology is applied to all carbon atoms and all heavy atoms. The
computation is done on the complex as well as only the protein with a cutoﬀ distance of 12
˚A from the ligands. The interval [0, 12] ˚A is divided into equal length subintervals of length
0.125 ˚A. Counts in bins method is applied to the 0th, 1st, and 2nd dimensional barcodes.
The features are generated for persistent homology computation of the complex and the
protein. The features for the complex and the diﬀerence between the features for complex
and protein are ﬁnally used. This results in a 1D imagelike feature of size 96 × 32. The
detailed network architecture is listed in Section Methods/Machine learning algorithms/Deep
convolutional neural networks. A consensus model is constructed by taking the average over
25 single models trained independently. TopVSDL achieves a performance of AUC= 0.81,
EF2% = 9.1, EF20% = 3.2.
152
The ﬁnal model
Same as the idea of taking the average output of diﬀerent ensemble of trees models as the
ﬁnal output in TopVSML, we add TopVSDL as another voter to TopVSML to construct a
ﬁnal model, called TopVS. Such consensus approach takes the average over diﬀerent models
with the hope that diﬀerent models make partially uncorrelated errors which are possible
to cancel out when averaged. The performance on each of 40 protein targets is reported
in Table 6.5. We have also generated virtual screening results of AutoDock Vina (ADV)
based on the computed binding free energy by ADV and compared them with those of the
present TopVS in terms of enrichment factors and the areas under the receiver operating
characteristic curve (AUC). A comparison of average AUC with those from a large number
of methods is given in Table 6.6.
153
Target
ACE
AChE
ADA
ALR2
AmpC
AR
CDK2
COMT
COX1
COX2
DHFR
EGFr
ERagonist
ERantagonist
FGFr1
FXa
GART
GPB
GR
HIVPR
HIVRT
HMGR
HSP90
InhA
MR
NA
P38 MAP
PARP
PDE5
PDGFrb
PNP
PPARg
PR
RXRa
SAHH
SRC
thrombin
TK
trypsin
VEGFr2
1.4
2.8
0.4
2.7
0.2
3.8
2.4
1.4
2.8
3.9
2.8
1.6
3.3
2.3
0.8
1.3
1.9
0.9
1.2
2.6
1.9
0.9
0.9
1.9
4.0
0.3
1.7
2.7
1.9
0.5
0.7
3.4
1.1
4.8
3.0
2.3
2.6
0.9
1.9
2.2
ADV
EF2% EF20% AUC EF2% EF20% AUC
4.1
0.81
0.65
4.7
0.90
0.0
0.68
2.0
2.4
0.58
0.90
17.0
0.88
9.0
0.73
13.1
9.9
0.86
0.97
20.7
0.96
6.4
0.95
3.4
17.8
0.81
0.83
10.2
0.95
0.4
0.89
1.0
0.0
0.48
0.66
0.0
0.84
5.7
0.91
5.6
8.2
0.88
0.96
0.0
0.93
0.0
0.95
13.4
0.87
16.7
0.0
0.87
0.94
1.4
0.71
4.2
0.86
8.0
3.5
0.97
0.89
0.0
0.72
17.7
0.91
1.9
28.2
0.83
0.84
10.4
0.98
5.6
0.79
8.3
0.0
0.65
0.78
3.1
10.2
0.96
0.42
0.67
0.49
0.74
0.34
0.81
0.64
0.56
0.76
0.86
0.82
0.63
0.84
0.70
0.44
0.63
0.75
0.48
0.57
0.74
0.64
0.53
0.64
0.56
0.82
0.37
0.59
0.71
0.61
0.32
0.59
0.82
0.52
0.95
0.80
0.71
0.72
0.56
0.58
0.63
3.1
1.9
4.5
1.5
1.0
4.2
4.1
2.9
3.6
4.9
4.7
4.8
2.8
2.8
4.8
4.4
0.7
1.5
3.4
4.4
4.0
5.0
4.5
4.5
4.3
3.8
4.5
1.7
3.4
4.9
4.3
1.8
4.1
3.2
3.9
4.9
2.4
2.5
2.0
4.7
TopVS
5.1
1.4
7.8
4.9
0.0
20.1
7.6
17.4
11.8
23.3
12.6
16.4
10.0
1.3
15.1
2.1
2.6
1.4
1.3
8.9
11.7
14.4
9.6
22.7
0.0
1.5
18.4
0.0
6.9
26.5
7.9
0.6
9.4
12.8
4.5
24.6
4.1
6.9
0.0
24.9
Average
6.9
2.0
0.64
9.5
3.5
0.84
Table 6.5: Performance on each protein in DUD dataset.
The median results of 10 repeated runs with diﬀerent random seeds (for the TopVSML part) are reported.
The best AUC in each row is marked in bold. The left block of AutoDock Vina (ADV) results are acquired
from the ADV runs with the binding free energy reported by ADV.
154
Method
AUC Ref.
TopVS
DeepVSADV
ICMa
NNScore1ADVb
Glide SPa
DDFAALL
DDFARL
NNScore2ADVb
DDFAADV
DeepVSDock
DDFAAD4
Glide HTVSb
Surﬂexa
Glide HTVS
ICM
RAWALL
AutoDock Vinab
Surﬂex
Rosetta Ligand
AutoDock Vina
ICM
FlexX
Autodock4.2
PhDOCK
Dock4.0
0.84
0.81
0.79
0.78
0.77
0.77
0.76
0.76
0.75
0.74
0.74
0.73
0.72
0.72
0.71
0.70
0.70
0.66
0.65
0.64
0.63
0.61
0.60
0.59
0.55
[158]
[137]
[59]
[47]
[4]
[4]
[59]
[4]
[158]
[4]
[59]
[47]
[47]
[137]
[4]
[59]
[47]
[4]
[4]
[47]
[47]
[4]
[47]
[47]
Table 6.6: AUC comparison of diﬀerent methods on DUD dataset.
aTuned by expert knowledge. bDetermined using a diﬀerent data set of decoys.
155
6.5 Discussion
6.5.1 Ligand based proteinligand binding aﬃnity prediction
We conduct several experiments on ligand based proteinligand binding aﬃnity prediction
in this section which leads to the ﬁnal models. To examine the strength and weakness of
diﬀerent sets of features and models, we ﬁrst show a statistics fact of the S1322 data set
of 7 protein clusters in Fig. 6.5. The details of the S1322 data set is given in Section Re
sults/Ligand based proteinligand binding aﬃnity prediction. All the gradient boosting trees
models take the setup described in Section Methods/Machine learning algorithms/Gradient
boosting trees.
Feature vectors for gradient boosting trees.
In this test, Rips complex based and alpha complex based persistent homology compu
tations up to 2nd dimension are performed for a variety of atom collections with diﬀerent
element types using the Euclidean metric and multilevel distance deﬁned in Eq. (6.1). Two
types of features are generated and are denoted by F C , which is a combination of F C
b , F C
d ,
and F C
p , and F S, which is a combination of F S
b , F S
d , and F S
p . The construction of features
F C and F S are described in Section Feature generation from topological invariants . For sets
of the 0th dimensional bars, only F C
d and F S
d are computed. In each protein cluster, 10fold
or 5fold cross validation is repeated 20 times for each subset of feature vectors depending
on selected element type. The median Pearson correlation coeﬃcients and the rootmean
square error (RMSE) in kcal/mol are reported. For Rips complex, both level 0 computation
with distance matrix M and level 1 computation with distance matrix(cid:102)M1 as deﬁned in Eq.
(6.2) are performed.
156
Figure 6.5: Statistics of ligands in 7 protein clusters in S1322 dataset.[20]
The average numbers of heavy atoms of a ligand in each protein cluster are shown in red and the standard
deviations of number of heavy atoms across each protein cluster are shown in blue. The number of ligands
in each cluster is given in parentheses.
Barcode space metrics for knearest neighbor regression.
The barcodes generated using Rips complex with distance matrices M and (cid:102)M1 are col
lected and the distance between each pair of sets of barcodes are measured using the Wasser
stein metric d2. Leaveoneout prediction for every sample is performed with knearest neigh
bor regression with k = 3 within each protein cluster based on the Wasserstein metric. The
performance of the best performing and the worst performing protein clusters is shown in
Fig. 6.6. The better the performance, the closer the lines are to the semicircle.
The experiments done for this section are summarized in Table 6.7.
Performance of multicomponent persistent homology.
It can be noticed from Table 6.8 that topological features generated from barcode statis
tics typically outperform those created from counts in bins. RB012ESGBT and RB012
157
Figure 6.6: An illustration of similarities between ligands measured by their barcode space
Wasserstein distances.[20]
Ligands are ordered according to their binding aﬃnities and are represented as dots on the semicircle.
Speciﬁcally, a sample of binding free energy x is plotted at the angle θ = π(Emax − x)/(Emax − Emin)
where Emin and Emax are the lowest and the highest energy in the dataset. Each dot is connected with
two nearest neighbors based on their barcode space Wasserstein distances. An optimal prediction would be
achieved if lines stay close to the semicircle. The majority of the connections stay near the boundary to the
upper half sphere demonstrating that barcode space metric based Wasserstein distance measurement reﬂects
the similarity in function, i.e., the binding aﬃnity in this case. The protein clusters with the best and the
worst performance are shown. Left: Protein cluster 2. Right: Protein cluster 3.
M1SGBT perform similarly in the majority of the protein clusters whilst RB012M1S
GBT which is based on (cid:102)M1 signiﬁcantly outperforms RB012ESGBT which is based on
Euclidean distance in protein cluster 3 and 6. To assess in what circumstances does the
multilevel persistent homology improve the original persistent homology characterization of
small molecules, we analyze the statistics of the size of ligands in Fig. 6.5. It turns out that
protein cluster 3 has the smallest average number of heavy atoms and protein cluster 6 has
the smallest standard deviation of the number of heavy atoms. This observation partially
answers the question that in the cases where the small molecules are relatively simple and are
relatively of similar size, multilevel persistent homology is able to enrich the characterization
of the small molecules which further improves the robustness of the model. Such enrichment
or improvement over the original persistent homology approach is mainly realized in higher
dimensional barcodes, i.e. the 1st and 2nd dimensions. In Table 6.8, the results with ID
through 7 to 12 conﬁrm that the 0th dimensional features from computation with (cid:102)M1 are
158
inferior to the results with Euclidean distance whilst the 1st and 2nd dimensional features
based on(cid:102)M1 outperforms the best result with Euclidean distance in most cases.
It is interesting to note that although Wasserstein metric based KNN methods are not
as accurate as GBT approaches, the consensus result obtained by averaging over various
predictions with Wasserstein metric on diﬀerent sets of barcodes is quite accurate.
Robustness of topological learning models.
Certain elements such as Br are very rare in the data sets studied in this work. Consid
ering only the elements of high occurrence will not hurt the performance on the validations
performed. However, omitting the low occurrence elements will sacriﬁce the capability of
the model to handle new data in which such elements play an important role. Therefore, we
decide to keep the rare elements that result in a large number of features and redundancy
in features. For example, the element combinations CBrH and CH will probably deliver
the same performance for most of the samples in the data sets studied in this work. To
test whether this redundancy causes degenerated results of the model, the features of one
element combination is added to the model at a step and the model is validated with an
accumulation of the added features at each step. The performance of the model is measured
with Pearson correlation coeﬃcient and is plotted against number of element combinations
involved in Fig. 6.7 . For most cases in Fig. 6.7 , the model is robust against the inclusion
of more element combinations.
159
Figure 6.7: Plot of performance against number of element combinations used.[20]
The topological learning model performance against the number of element combinations involved in feature
construction for 7 protein clusters in S1322. The horizontal axis corresponds to the number of element
combinations used for the features. From left to right, one extra element combination is added at a step.
The features are then used in gradient boosting trees method to test if the model is robust against redundant
information. The results related to alpha complex are marked in red and Rips complex in blue. The
median Pearson correlation coeﬃcient between predicted and experimental results is reported of 10fold
crossvalidation within each protein cluster repeated 20 times are reported.
160
Experiment
AB012ECGBT
AB012ESGBT
Description
The barcodes are generated using alpha complex on diﬀerent sets of atoms
based on diﬀerent element combinations. The features are constructed using
the 0th, 1st, and 2nd dimensional barcodes following the counts in bins method
with bins equally dividing the interval [0, 5]. Here 32 diﬀerent element combi
nations are considered, including {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO,
CNS, COS, NOS, CNOS, CNOSPFClBrI, H, CH, NH, OH, SH, CNH, COH,
CSH, NOH, NSH, OSH, CNOH, CNSH, COSH, NOSH, CNOSH, CNOSPF
ClBrIH}. Gradient boosting trees (GBT) with the structured feature matrix
are used for this computation.
The barcodes same as those used in AB012ECGBT are used.
counts in bins , the Barcode statistics method is used to generate features.
Instead of
RB012ESGBT
AB012ESSGBT The barcodes same as those used in AB012ECGBT are used. The per
sistence diagram slice and statistics method is used to generate features. A
uniform set of bins by dividing the interval [0,5] into 10 equal length bins is
used to slice birth, death, and persistence values.
Barcodes are generated using Rips complex with Euclidean distances. The
features are generated following the barcode statistics method. Here 36 ele
ment combinations are considered, i.e., {C, N, O, S, CN, CO, CS, NO, NS,
OS, CNO, CNS, COS, NOS, CNOS, CNOSPFClBrI, H, CH, NH, OH, SH,
CNH, COH, CSH, NOH, NSH, OSH, CNOH, CNSH, COSH, NOSH, CNOSH,
CNOSPFClBrIH, CCl, CClH, CBr, CBrH}.
ﬁrst level enrichment distance matrix(cid:102)M1 is used instead of Euclidean distance.
RB012M1SGBT The result is obtained with the same setup as RB012ESGBT except that the
RBnEKNN
RBnM1KNN
The nth dimensional barcodes from Rips complex computation with Euclidean
distance are used. Knearest neighbor (KNN) regression is performed with
Wasserstein metric d2. The leaveoneout validation is performed individually
with each element combination and the average prediction of these element
combinations is taken as the output result. The element combinations con
sidered are {CNOS, CNOSPFClBrI, NOH, CNO, CNOSPFClBrIH}. These
combinations are selected based on their performance in the gradient boosting
trees experiments.
The result is obtained with the same setup as RBnEKNN except that the
distance matrix (cid:102)Mn is used instead of Euclidean distance.
Table 6.7: Experiments for ligandbased proteinligand binding aﬃnity prediction of 7 protein
clusters and 1322 proteinligand complexes.
161
ID Experiments
1
2
3
4
5
6
7
8
9
AB012ECGBT
AB012ESGBT
AB012ESSGBT
RB012ESGBT
RB012M1SGBT
2+5
RB0EKNN
RB1EKNN
RB2EKNN
10 RB0M1KNN
11 RB1M1KNN
12 RB2M1KNN
13 Cons(7+8+9+10+11+12)
14
2+5 (5fold)
CL 1 (333)
CL 5 (134)
CL 2 (264)
CL 6 (122)
0.695(1.63)
0.840(1.30)
0.836(1.18)
0.647(1.65)
0.695(1.63)
0.828(1.35)
0.704(1.62)
0.834(1.34)
0.712(1.60)
0.808(1.41)
0.845(1.14)
0.702(1.54)
0.846(1.15)
0.715(1.53)
0.837(1.17)
0.635(1.67)
CL 3 (219)
CL 7 (94)
0.690(1.52)
0.730(1.27)
0.678(1.54)
0.739(1.25)
0.681(1.53)
0.741(1.25)
CL 4 (156)
Average
0.642(1.38)
0.726(1.42)
0.692(1.31)
0.740(1.39)
0.668(1.35)
0.741(1.40)
0.659(1.57)
0.757(1.22)
0.683(1.32)
0.727(1.42)
0.716(1.59)
0.822(1.37)
0.836(1.17)
0.708(1.53)
0.706(1.48)
0.746(1.24)
0.672(1.34)
0.744(1.39)
0.714(1.59)
0.831(1.34)
0.648(1.73)
0.700(1.70)
0.547(1.91)
0.535(2.01)
0.474(2.01)
0.126(2.49)
0.581(1.85)
0.672(1.76)
0.663(1.70)
0.786(1.49)
0.675(1.67)
0.655(1.81)
0.698(1.66)
0.756(1.68)
0.713(1.60)
0.831(1.34)
0.848(1.13)
0.717(1.52)
0.699(1.50)
0.747(1.24)
0.692(1.31)
0.750(1.38)
0.761(1.39)
0.487(1.89)
0.684(1.55)
0.634(1.67)
0.494(1.87)
0.331(2.09)
0.771(1.35)
0.485(1.90)
0.784(1.33)
0.610(1.71)
0.803(1.28)
0.617(1.72)
0.817(1.28)
0.658(1.68)
0.843(1.15)
0.698(1.56)
0.544(1.76)
0.641(1.43)
0.444(1.88)
0.649(1.42)
0.202(2.14)
0.609(1.47)
0.516(1.80)
0.644(1.43)
0.652(1.59)
0.731(1.30)
0.577(1.72)
0.648(1.42)
0.620(1.68)
0.739(1.31)
0.693(1.51)
0.737(1.26)
0.616(1.42)
0.628(1.62)
0.536(1.52)
0.576(1.71)
0.298(1.79)
0.362(1.98)
0.601(1.44)
0.610(1.65)
0.555(1.50)
0.683(1.52)
0.531(1.52)
0.644(1.59)
0.645(1.41)
0.705(1.49)
0.670(1.35)
0.741(1.40)
Table 6.8: Performance of diﬀerent approaches on the S1322 dataset.
Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for binding aﬃnity predictions on 7
protein clusters (CL) in S1322. On the title row, the numbers in parentheses denote the numbers of ligands
in the cluster. The median results of 20 repeated runs are reported for the ensemble of trees based methods
to account for randomness in the algorithm. For experimental labels, the ﬁrst letter indicates the complex
deﬁnition used, ‘A’ for alpha complex and ‘R’ for Rips complex. The second part starting with ‘B’ followed by
the integers indicates the dimension of barcode used. The third part indicates the distance function used, ‘E’
for Euclidean and ‘M1’ for(cid:102)M1. For row 1 through 5, the forth part shows the way of feature construction, ‘C’
for counts in bins and ‘S’ for barcode statistics. The last part indicates the regression technique used, ‘GBT’
for gradient boosting trees and ’KNN’ for knearest neighbors. The detailed descriptions of the experiments
are given in Table 6.7. Row 6 is the results using features of both row 2 and row 5. Row 13 is the consensus
results by taking the average of the predictions by row 7 through row 12. Except for speciﬁed, all results are
obtained from 10fold cross validations.
162
6.5.2 Complex based proteinligand binding aﬃnity prediction
Having demonstrated the representational power of the present topological learning method
for characterizing small molecules, we further examine the method on the task of charac
terizing proteinligand complex. Biologically, we consider the same task, i.e., the prediction
of proteinligand binding aﬃnity, with a diﬀerent approach that is based on the structural
information of the proteinligand complexes. Only gradient boosting trees and deep convo
lutional neural network algorithms are used in this section. All the gradient boosting trees
models take the setup described in Section Methods/Machine learning algorithms/Gradient
boosting trees.
In the present topological learning study, we use four versions of PDBBind core sets as
our test sets. For each test set, the corresponding reﬁned set, excluding the core set, is used
as the training set.
Groups of topological features and their performance in association with GBT.
The experiments of proteinligandcomplexbased proteinligand binding aﬃnity predic
tion for the PDBBind datasets are summarized in Table 6.9.
6.5.2.1 Robustness of GBT algorithm against redundant element combination
features and potential overﬁtting.
It is intuitive that combinations of more than 2 element types are able to enrich the represen
tation especially in the case of higher dimensional barcodes. However, the consideration of
combination of more element types rapidly increases the dimension of feature space. In the
high dimensional feature space, it is almost inevitable that there exists nonessential and re
dundant features. Additionally, the importance of a feature varies across diﬀerent problems
and data sets. Therefore, it is preferable to keep all the potentially important features in a
163
Experiment
RB0IC
RB0IBP
RB0CIC
Description
0th dimensional barcodes from Rips complex computation with interactive distance
matrix based on Euclidean distance are used. Features are generated following counts
in bins method with bins {[0, 2.5), [2.5, 3), [3, 3.5), [3.5, 4.5), [4.5, 6), [6, 12]}. Element
combinations used are all possible paired choices of one item from {C, N, O, S, CN,
CO, NO, CNO} in protein and another item from {C, N, O, S, P, F, Cl, Br, I, CN,
CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS} in ligand, which result in a
total of 160 combinations.
The persistent homology computation and feature generation is the same as RB0I
C. However, the element combinations used are all possible paired choices of one item
from {C, N, O, S} in protein and another item from {C, N, O, S, P, F, Cl, Br, I} in
ligand, which result in a total of 36 element combinations.
0th
in
teractive
func
tion deﬁned in Eq.
The fea
tures
bins
{(0, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4], (0.4, 0.5], (0.5, 0.6], (0.6, 0.7], (0.7, 0.8], (0.8, 0.9],
(0.9, 1.0)}. The element combinations used are all possible paired choices of one item
from {C, N, O, S, H} in protein and another item from {C, N, O, S, P, F, Cl, Br, I,
H} in ligand, which result in a total of 50 element combinations.
(6.8) with the parameter
following
bins
computation with
dimensional
barcodes
distance matrix
electrostatics
correlation
method with
based
on
the
c = 100.
from Rips
complex
are
generated
counts
in
RB0CIBS The barcodes and element combinations are the same as those of RB0CIBC. The
AB12ES
features are generated following the barcode statistics method.
1st and 2nd dimensional barcodes from alpha complex computation with Euclidean
distance are used. The element combinations considered are all heavy atoms and all
carbon atoms. Features are generated following the barcode statistics method.
Table 6.9: Experiments for proteinligandcomplexbased proteinligand binding aﬃnity pre
diction for the PDBBind datasets.
general model which is expected to cover a wide range of situations. To test the robustness
of the model against unimportant features, we select a total of 128 element combinations
(i.e., all possible paired choices of one item from {C, N, O, CN, CO, NO, CNO, CNOS} in
protein and another item from {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO, CNS, COS,
NOS, CNOS, CNOSPFClBrI} in ligand). The 0th, 1st, and 2nd dimensional barcodes are
computed for all combinations using alpha complex with Euclidean distance. Features are
generated following the barcode statistics method.
A general model with all the features is generated in the ﬁrst place. The element combi
nations are then sorted according to their importance scores in the general model. Starting
from the most important element combination, one element combination is added to the fea
164
ture vector each time and then the resulting feature vector is passed to the machine learning
training and testing procedure. The order of adding element combinations is based on their
importance scores and thus that a less important feature is added each step.
Fig. 6.8 depicts the changes of Pearson correlation coeﬃcient and RMSE (kcal/mol)
with respect to the increase of element combinations in predicting four PDBBind core sets.
In all cases, the inclusion of top combinations can readily deliver very good models. The
behavior of the present method in PDBBind v2007 is quite diﬀerent from that in other
data sets. The performance of the present method improves almost monotonically as the
element combination increases. However, in other three cases, the improvement is unsteady.
Nevertheless, the performance ﬂuctuates within a small range, which indicates that the
present method is reasonably stable against the increase in element combinations. From
a diﬀerent perspective, the increase in element combinations might lead to overﬁtting in
machine learning. Since the model parameters are ﬁxed before the experiments, it shows
that GBT algorithms are not very sensitive to redundant features and are robust against
overﬁtting.
Usefulness of more than 2 element types for interactive 0th dimensional barcodes.
While using element combinations with more than 2 element types with higher dimen
sional barcodes enriches characterization of geometry, it remains to assess whether interactive
0th dimensional characterization will beneﬁt from element combinations with more element
types. As an example, we denote interactive 0th dimensional barcodes for carbon and ni
trogen atoms from protein and oxygen atoms from ligand by BCN−O, barcodes for carbon
atoms from protein and oxygen atoms from ligand by BC−O, and barcodes for nitrogen
atoms from protein and oxygen atoms from ligand by BN−O.
In the case of persistent
homology barcode representation, BCN−O is not strictly the union of BC−O and BN−O.
165
Figure 6.8: Feature robustness tests on PDBBind datasets.[20]
The performance of the topological learning model against the number of included element combinations for
predicting on PDBBind core sets and training on PDBBind reﬁned sets minus the core sets. The 1st and
2nd dimensional barcodes computed with alpha complex is used. Features are generated following barcode
statistics method. Element combinations are all possible paired choices of one item from {C, N, O, CN, CO,
NO, CNO, CNOS} in protein and another item from {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO, CNS,
COS, NOS, CNOS, CNOSPFClBrI} in ligand, which result in 128 element combinations. The horizontal
straight lines represents the performance of the 2D representation with deep convolutional neural network
(row 10 in Table 6.10). The blue and red colors correspond to Pearson correlation coeﬃcient and RMSE
(kcal/mol) respectively. Each experiment is done by training on reﬁned set minus the core set with the
median result of 20 repeated runs reported.
166
However BCN−O might be redundant to BC−O and BN−O. To address this concern, we test
features from interactive 0th dimensional barcodes with the 36 element combinations (i.e.,
{ C, N, O, S } for protein and { C, N, O, S, P, F, Cl, Br, I } for ligand) and features for
the 160 selected element combinations (i.e., { C, N, O, S, CN, CO, NO, CNO } for protein
and { C, N, O, S, P, F, Cl, Br, I, CN, CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS
} for ligand), which are listed as feature group 2 and feature group 1 in Table 6.10. In all
the four cases, the features of the 36 combinations (feature group 2) slightly outperforms or
performs as well as the features of the 160 combinations (feature group 1) suggesting that
element combinations with more than 2 element types are redundant to all the combinations
with 2 element types in the case of interactive 0th dimensional characterization.
ID Experiments
v2007
v2013
v2015
v2016
Average
1
2
3
4
5
6
7
8
9
10
11
RB0IC
RB0IBP
RB0CIC
RB0CIS
AB12ES
1+4
2+4
1+4+5
2+4+5
2DCNNAlpha
1D2DCNN
0.799 (2.01)
0.816 (1.94)
0.791 (2.05)
0.773 (2.10)
0.736 (2.25)
0.815 (1.95)
0.806 (1.99)
0.810 (1.98)
0.802 (2.01)
0.787 (2.02)
0.806 (1.95)
0.741 (2.14)
0.741 (2.13)
0.759 (2.10)
0.762 (2.12)
0.709 (2.26)
0.780 (2.04)
0.787 (2.04)
0.792 (2.02)
0.796 (2.02)
0.781 (1.98)
0.781 (1.98)
0.750 (2.11)
0.750 (2.10)
0.738 (2.13)
0.749 (2.13)
0.695 (2.27)
0.774 (2.04)
0.770 (2.06)
0.786 (2.02)
0.782 (2.04)
0.785 (1.95)
0.799 (1.91)
0.813 (1.82)
0.825 (1.78)
0.801 (1.87)
0.810 (1.86)
0.752 (2.02)
0.833 (1.76)
0.834 (1.77)
0.831 (1.76)
0.822 (1.79)
0.837 (1.68)
0.848 (1.64)
0.776 (2.02)
0.783 (1.99)
0.772 (2.04)
0.774 (2.05)
0.723 (2.20)
0.801 (1.95)
0.799 (1.97)
0.805 (1.95)
0.801 (1.97)
0.798 (1.91)
0.809 (1.87)
Table 6.10: Performance of diﬀerent proteinligand complex based approaches on the PDB
Bind datasets.
Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for predictions by various groups of
features on the four PDBBind core sets. The training sets are the PDBBind reﬁned sets minus the core sets
of the same version year. Results of ensemble of trees based methods (rows 1 through 9) are the median
values of 50 repeated runs to account for randomness in the algorithm. For the deep learning based methods
(row 10 and 11), 100 independent models are generated in the ﬁrst place. A consensus model is built by
randomly choosing 50 models out of the 100, and the this process is repeated 1000 times with the median
reported. The ﬁrst letter indicates the deﬁnition of complex, ‘A’ for alpha complex and ‘R’ for Rips complex.
The second part indicates the dimension of barcodes used. The third part indicates the distance function
used, ‘I’ for (cid:99)Mij deﬁned in Eq. (6.3), ‘CI’ for the one deﬁned in Eq. (6.8), and ‘E’ for Euclidean. The last
part shows the way of feature construction, ‘C’ for counts in bins, ‘S’ for barcode statistics, and ‘BP’ for
only pair of two single elements. The results reported in row 6 through 9 are obtained by combining the
features of the rows with the corresponding numbers.
167
Importance of atomic charge in electrostatic persistence.
In element speciﬁc persistent homology, atoms of diﬀerent element types are character
ized separately, which oﬀers a rough and implicit description of the electrostatics of the
system. However, such implicit treatment of electrostatics may lose important information
because atoms behave diﬀerently at diﬀerent oxidation states. Therefore, we explicitly em
bed atomic charges in interactive 0th dimensional barcodes as described in Eq. (6.8). The
resulting topological features are given in feature group 4 in Table 6.10. It can be seen from
Table 6.10 that the combination of feature group 4 and the Euclidean distance based inter
active 0th dimensional barcodes (listed as feature group 6 and 7) generally outperforms the
results obtained with only Euclidean distance based features. This observation suggests that
electrostatics play an important role and should be taken care of explicitly for the protein
ligand binding problem. Additionally, the inclusion of physical interactions in topological
invariants opens a promising new direction in topological analysis.
Relevance of elements that are rare with respect to the data sets.
Since the majority of the samples in both training and testing sets only contain atoms
of element types, C, N, O, and H, the performance of the model on the samples with rare
occurring elements with respect to data sets is hardly reﬂected by the overall performance
statistics. For simplicity, we refer to such rarely occurring elements with respect to data sets
simply by rarely occurring elements in the discussion follows. To assess the aspects of the
model that potentially aﬀect the performance on the samples containing rarely occurring
elements, we picked the samples containing each rarely occurring element from the original
testing set as a new testing set. Three experiments are carried out to address two questions:
“Are the training samples containing the same rarely occurring element crucial?” and “Are
features addressing the rarely occurring element important?”. A short answer is yes to
168
both according to the results shown in Fig. 6.9. Speciﬁcally, for each rarely occurring
element, the exclusion of samples containing this element in training set and the exclusion of
features addressing this element will both cause degenerated results. It is also shown that the
exclusion of samples of the rarely occurring element leads to much worse results. Since both
modiﬁcations of the model deliver worse results, we conclude that including the samples in
the training set with similar compositions to the test sample is crucial to the success of the
model on this speciﬁc test sample. Even the inclusion of features of more element types or
element combinations does not deliver better results in the general testing sets, such features
should still be kept in the model in case that a sample with a similar element composition
comes in as a test sample.
2D persistence for topological deep convolutional neural networks.
Deep learning is potentially more powerful than many other machine learning algorithms
when the data size is suﬃciently large.
In the present work, it is natural to construct a
2D topological representation by incorporating the element combination as an additional
dimension, resulting in 16 channels as deﬁned in Section Feature generation from topological
invariants . Here 128 element combinations (i.e., all possible paired choices of one item
from {C, N, O, CN, CO, NO, CNO, CNOS} in protein and another item from {C, N, O, S,
CN, CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS, CNOSPFClBrI} in ligand) are
used for 2D analysis. The advantage of introducing this extra dimension with convolutional
neural networks is to prevent unimportant features from interacting with important ones at
the lower levels of the model whilst generally unimportant features are still kept in the model
in case that they are essential to speciﬁc problems or a certain portion of the data set.
As shown in Fig. 6.8, for all the data sets except the PDBBind v2007 set, the 2D topo
logical deep learning with convolutional neural networks performs signiﬁcantly better. The
169
Figure 6.9: Assessment of performance of the model on samples with elements that are rare
in the data sets.[20]
For the four data sets PDBBind v2007, v2013, v2015, and v2016 [119], and for each element, the testing set is
the subset of the original core sets with only ligands that contain atoms of the particular element type. The
features used are features with ID=7 in Table 6.10. The reported RMSE is the average taken over the four
data sets. Experiment 1: Training set is the original training set and all the features are used. Experiment
2: Training set is the original training set and only features that do not involve the particular element are
used. Experiment 3: Training set is the original training set excluding the samples that contain atoms of
the particular element type and all features are used. For most of the elements, experiment 1 achieves the
best result and experiment 3 yields the worst performance.
170
inferior performance of convolutional neural networks in v2007 might be a result of the small
data size. Note that v2007 training set has 1105 proteinligand complexes, whereas other
training sets have more than 2700 complexes. Consequently, topological deep convolutional
neural networks are able to outperform the topological GBT algorithm in predicting v2013,
v2015 and v2016 core sets.
Indeed, topological deep convolutional neural networks have
advantages in dealing with large data sets.
6.5.3 Structurebased virtual screening
In our ﬁnal model TopVS reported in Table 6.6, we use topological descriptors of both
proteincompound interactions and only the compounds (i.e., ligands and decoys) and take a
consensus model on top of several ensemble of trees models and a deep learning model. We
have also tested the behavior of our topological learning model TopVSML using either one of
the aforementioned descriptions. The tests are done with TopVSML because that TopVS
DL is much more time consuming. When only topological descriptor of small molecules
are used, which falls into the category of ligandbased virtual screening, an AUC of 0.81 is
achieved. For the topological learning model using only the descriptions of proteinligand
interactions, an AUC of 0.77 is achieved. An AUC of 0.83 is obtained with a model combining
both sets of descriptors which is better than each individual performance, suggesting that
the two groups of descriptors are complementary to each other and are both important
for achieving satisfactory results. The marginal improvement made by proteincompound
complexes maybe due to the various docking quality. Similar situation was encountered by
a deep learning method [158]. For the targets with high quality results by Autodock Vina
(AUC of ADV > 0.8), the ligandbased features achieve an AUC of 0.81 and the complex
based features achieve an AUC of 0.86. On the other hand, for the targets with low quality
171
Target
ADV LIG COM ALL
AR
COX2
DHFR
ERagonist
MR
PPARg
RXRa
SAHH
Average
0.81
0.86
0.82
0.84
0.82
0.82
0.95
0.80
0.84
0.83
0.97
0.95
0.69
0.78
0.70
0.74
0.81
0.93
0.80
0.94
0.91
0.91
0.72
0.91
0.72
0.81
0.86
0.90
0.97
0.96
0.81
0.89
0.72
0.79
0.84
0.86
Table 6.11: The AUC for autodock vina, TopVSML with only compound features, TopVS
ML with only proteincompound complex features, and TopVSML with all features. The
targets with high quality results by Autodock Vina are reported (AUC > 0.8)
results by Autodock Vina (AUC of ADV < 0.5), the ligandbased features achieve an AUC
of 0.82 and the complexbased features achieve an AUC of 0.74. The results of these cases
are listed in Table 6.11 and Table 6.12. This observation suggests that the performance
of features describing the interactions and the geometry of proteincompounds complexes
highly depends on the quality of docking results.
Our model with small molecular descriptors delivers an AUC of 0.81, which is comparably
well to the other top performing methods. The performance of this model is also competitive
in the regime of proteinligand binding aﬃnity prediction based on experimentally solved
complex structures as is shown in Section Discussion/Ligand based proteinligand binding
aﬃnity prediction. These results suggest that topology based small molecule characterization
proposed in this work is potentially useful in other applications involving small molecules,
such as predictions of toxicity, solubility and partition coeﬃcient of small molecules.
172
Target
ADV LIG COM ALL
0.42
ACE
0.49
ADA
0.34
AmpC
0.44
FGFr1
0.48
GPB
0.37
NA
PDGFrb 0.32
0.85
0.89
0.56
0.97
0.70
0.79
0.98
0.78
0.89
0.37
0.71
0.69
0.82
0.90
Average
0.41
0.82
0.74
0.81
0.89
0.53
0.95
0.71
0.84
0.96
0.81
Table 6.12: The AUC for autodock vina, TopVSML with only compound features, TopVS
ML with only proteincompound complex features, and TopVSML with all features. The
targets with low quality results by Autodock Vina are reported (AUC < 0.5)
6.6 Conclusion
Persistent homology is a relatively new branch of algebraic topology and is one of the main
tools in topological data analysis. The topological simpliﬁcation of biomolecular systems
was a major motivation of the earlier persistent homology development [62, 216]. Persistent
homology has been applied to computational biology [99, 75, 48, 156, 75, 198, 195, 202, 201,
200, 186, 117]. However, the predictive power of primitive persistent homology was limited
in early topological learning applications [21]. To address this challenge, we have recently
introduced element speciﬁc persistent homology to retain chemical and biological informa
tion during the topological abstraction of biomolecules [23, 24, 26]. The resulting topological
learning approach oﬀers competitive predictions of proteinligand binding aﬃnity and muta
tion induced protein stability changes. However, persistent homology based approaches for
small molecules have not been developed and its representability and predictive powers for
the interaction of small molecules with macromolecules have not been extensively studied.
The present work further introduces multicomponent persistent homology, multilevel
persistent homology and electrostatic persistence for chemical and biological characteriza
173
tion, analysis and modeling. Multicomponent persistent homology takes a combinatorial ap
proach to create possible element speciﬁc topological representations. Multilevel persistent
homology allows tailored topological descriptions of any desirable interaction in biomolecules
which is especially useful for small molecules. Electrostatic persistence incorporates partial
charges that are essential to biomolecules into topological invariants. These approaches are
implemented via the appropriate construction of the distance matrix for ﬁltration. The rep
resentation power and reduction power of multicomponent persistent homology, multilevel
persistent homology and electrostatic persistence are validated by two databases, namely
PDBBind [119] and DUD [93, 135]. PDBBind involves more than 4,000 high quality protein
ligand complexes and DUD contains near 100,000 small compounds. Two classes of problems
are used to test the proposed topological methods, including the prediction of proteinligand
binding aﬃnities and the discrimination of active ligands from decoys (virtual screening).
In both problems, we examine the representability of proposed topological learning methods
on small molecules, which are somewhat more diﬃcult to describe by persistent homol
ogy due to their chemical diversity, variability and sensitivity. Additionally, these methods
are tested on their ability to handle the full proteinligand complexes. Advanced machine
learning methods, including Wasserstein metric based knearest neighbors (KNNs), gradient
boosting trees (GBT), random forest (RF), extra trees (ET) and deep convolutional neu
ral networks (CNN) are utilized in the present work to facilitate the proposed topological
methods, rendering advanced topological learning algorithms for quantitative and qualita
tive biomolecular predictions. The thorough examination of the method on the prediction
of binding aﬃnity for experimentally solved proteinligand complexes leads to a structure
based virtual screening method, TopVS, which outperforms other methods. The feature sets
introduced in this work for small molecules and proteinligand complexes can be extended to
174
other applications such as 3Dstructure based prediction of toxicity, solubility, and partition
coeﬃcient for small molecules and complex structure based prediction of proteinnucleic acid
binding and proteinprotein binding aﬃnities.
175
Chapter 7
Dissertation contribution
The main contributions of this dissertation are listed as follows.
• In Chapter 3, we develop a quantitative and predictive model based on persistent ho
mology and deep learning for proteinligand binding aﬃnity prediction and the predic
tion of protein stability change upon mutation. This model achieves top performance
on benchmarks where many other methods have been tested. To the best of our knowl
edge, this work is the ﬁrst competitive topological data analysis based predictive model
in the ﬁeld of molecular biology.
• In Chapter 4, motivated by embedding physical properties to persistent homology
representations of biomolecules, we propose a method called enriched barcode using
cohomology and tools from graph theory. This method has a general application to
the situations where data come with multiple heterogeneous dimensions.
• In Chapter 5, we develop a persistent homology construction called evolutionary ho
mology tailored for the analysis of trajectories from coupled dynamical systems. This
method is applied to coupled dynamical systems associated to molecular systems and
shows competitive results in the modeling of protein ﬂexibility.
• In Chapter 6, we carefully assess the representability of persistent homology of biomolecules.
As a result, we propose persistent homology representations speciﬁcally designed for
176
small molecules which we call multilevel persistent homology as well as representations
for macromolecules addressing chemical and biological complexities. The method de
veloped in this chapter also leads to a topperforming tool for structuralbased virtual
screening.
• In addition to the high performance of this work on benchmarks where the testing
data are given to the researchers, we have also achieved top performance in world
wide challenges of blind predictions in drug design where testing data are hidden from
researchers before the prediction results are submitted.[141]
• The works described in this dissertation are made accessible by attaching source
codes to the publications [26, 20]. Several accompanying webservers are made avail
able to users, “TDLMP” (http://weilab.math.msu.edu/TDL/TDLMP/) and “TML
MP” (http://weilab.math.msu.edu/TML/TMLMP/) for predictors of protein stabil
ity changes upon mutations, and “TDLBP”(http://weilab.math.msu.edu/TDL/TDL
BP/) and “TMLBP” (http://weilab.math.msu.edu/TML/TMLBP/).
The contents of this dissertation are mostly adopted from the follow publications and
preprints:
• Zixuan Cang and Guowei Wei. TopologyNet: Topology based deep convolutional
and multitask neural networks for biomolecular property predictions. Plos Computa
tional Biology, 13(7):e1005690, 2017.
• Zixuan Cang and GuoWei Wei. Persistent cohomology for data with multicompo
nent heterogeneous information. preprint, 2018.
177
• Zixuan Cang, Elizabeth Munch, and GuoWei Wei. Evolutionary homology on
coupled dynamical systems. arXiv preprint arXiv:1802.04677, 2018.
• Zixuan Cang, Lin Mu, and GuoWei Wei. Representability of algebraic topology for
biomolecules in machine learning based scoring and virtual screening. Plos Computa
tional Biology, 14(1):e1005929. https://doi.org/10.1371/journal.pcbi.1005929, 2018.
This work led to the following publications/preprints that are not discussed in this dis
sertation:
• Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and GuoWei Wei. A
topological approach for protein classiﬁcation. Molecular Based Mathematical Biology,
3(1), 2015.
• Zixuan Cang and GuoWei Wei. Analysis and prediction of protein folding en
ergy changes upon mutation by element speciﬁc persistent homology. Bioinformatics,
33(22):3549–3557, 2017.
• Zixuan Cang and GuoWei Wei. Integration of element speciﬁc persistent homology
and machine learning for proteinligand binding aﬃnity prediction.
International
journal for numerical methods in biomedical engineering, 34(2):e2914, 2018.
The following publications/preprints are also related to this dissertation:
• Rundong Zhao, Zixuan Cang, Yiying Tong, and Guo wei Wei. Protein pocket
detection via convex hull surface evolution and associated reeb graph.
accepted by
Bioinformatics/proceedings of ECCB 2018, 2018.
178
• Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and GuoWei
Wei. Mathematical deep learning for pose and binding aﬃnity prediction and ranking
in d3r grand challenges. arXiv preprint arXiv:1804.10647, 2018.
179
BIBLIOGRAPHY
180
BIBLIOGRAPHY
[1] SAMPL6 challenge. https://drugdesigndata.org/about/sampl6. Accessed: 20180410.
[2] Henry Adams, Andrew Tausz, and Mikael VejdemoJohansson. Javaplex: A research
software package for persistent (co) homology. In International Congress on Mathe
matical Software, pages 129–136. Springer, 2014.
[3] Mamiko Arai, Vicky Brandt, and Yuri Dabaghian. The eﬀects of theta precession
on spatial learning and simplicial complex dynamics in a topological model of the
hippocampal spatial map. PLoS Computational Biology, 10(6):e1003651, 2014.
[4] Marcelino Arciniega and Oliver F Lange.
Improvement of virtual screening results
by docking data feature analysis. Journal of chemical information and modeling,
54(5):1401–1411, 2014.
[5] M. Stuart Armstrong, Garrett M. Morris, Paul W. Finn, Raman Sharma, Loris
Moretti, Richard I. Cooper, and W. Graham Richards. Electroshape:
fast molecu
lar similarity calculations incorporating shape, chirality and electrostatics. Journal of
computeraided molecular design, 24(9):789–801, 2010.
[6] Hossam M. Ashtawy and Nihar R. Mahapatra. A comparative assessment of ranking
accuracies of conventional and machinelearningbased scoring functions for protein
ligand binding aﬃnity prediction. IEEE/ACM Transactions on computational biology
and bioinformatics, 9(5):1301–1313, 2012.
[7] Ivet Bahar, Ali Rana Atilgan, and Burak Erman. Direct evaluation of thermal ﬂuc
tuations in proteins using a singleparameter harmonic potential. Folding and Design,
2(3):173–181, 1997.
[8] Ulrich Bauer. Ripser: a lean c++ code for the computation of VietorisRips persistence
barcodes. Software available at https://github. com/Ripser/ripser, 2017.
[9] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of per
sistent homology. In 2014 proceedings of the sixteenth workshop on algorithm engineer
ing and experiments (ALENEX), pages 31–38. SIAM, 2014.
[10] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of per
sistent homology. In 2014 Proceedings of the Sixteenth Workshop on Algorithm Engi
neering and Experiments (ALENEX), pages 31–38. SIAM, 2014.
181
[11] Ulrich Bauer, Michael Kerber, Jan Reininghaus, and Hubert Wagner. Phat–persistent
homology algorithms toolbox. Journal of Symbolic Computation, 78:76–90, 2017.
[12] K. Abdulla Bava, M. Michael Gromiha, Hatsuho Uedaira, Koji Kitajima, and Akinori
Sarai. Protherm, version 4.0: thermodynamic database for proteins and mutants.
Nucleic acids research, 32(suppl 1):D120–D121, 2004.
[13] Nathan Bell and Anil N Hirani. Pydec: software and algorithms for discretization
of exterior calculus. ACM Transactions on Mathematical Software (TOMS), 39(1):3,
2012.
[14] Paul Bendich, Herbert Edelsbrunner, and Michael Kerber. Computing robustness and
IEEE transactions on visualization and computer graphics,
persistence for images.
16(6):1251–1260, 2010.
[15] James Bergstra, Daniel Yamins, and David D Cox. Making a science of model search:
Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML
(1), 28:115–123, 2013.
[16] Niklas Berliner, Joan Teyra, Recep C¸ olak, Sebastian Garcia Lopez, and Philip M.
Kim. Combining structural modeling with ensemble machine learning to accurately
predict protein fold stability and binding aﬃnity eﬀects upon mutation. PloS one,
9(9):e107353, 2014.
[17] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N. Bhat,
Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank.
Nucleic acids research, 28(1):35–242, 2000.
[18] Jesse Berwald, Marian Gidea, and Mikael VejdemoJohansson. Automatic recognition
and tagging of topologically diﬀerent regimes in dynamical systems. arXiv preprint
arXiv:1312.2482, 2013.
[19] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.
[20] Zixuan Cang, Lin Mu, and GuoWei Wei. Representability of algebraic topology for
biomolecules in machine learning based scoring and virtual screening. Plos Computa
tional Biology, 14(1):e1005929. https://doi.org/10.1371/journal.pcbi.1005929, 2018.
[21] Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and GuoWei Wei. A
topological approach for protein classiﬁcation. Molecular Based Mathematical Biology,
3(1), 2015.
[22] Zixuan Cang, Elizabeth Munch, and GuoWei Wei. Evolutionary homology on coupled
dynamical systems. arXiv preprint arXiv:1802.04677, 2018.
182
[23] Zixuan Cang and GuoWei Wei. Analysis and prediction of protein folding energy
changes upon mutation by element speciﬁc persistent homology. Bioinformatics,
33(22):3549–3557, 2017.
[24] Zixuan Cang and GuoWei Wei. Integration of element speciﬁc persistent homology and
machine learning for proteinligand binding aﬃnity prediction. International journal
for numerical methods in biomedical engineering, 34(2):e2914, 2018.
[25] Zixuan Cang and GuoWei Wei. Persistent cohomology for data with multicomponent
heterogeneous information. preprint, 2018.
[26] Zixuan Cang and Guowei Wei. TopologyNet: Topology based deep convolutional and
multitask neural networks for biomolecular property predictions. Plos Computational
Biology, 13(7):e1005690, 2017.
[27] Emidio Capriotti, Piero Fariselli, and Rita Casadio. Imutant2. 0: predicting stability
changes upon mutation from the protein sequence or structure. Nucleic acids research,
33(suppl 2):W306–W310, 2005.
[28] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,
46(2):255–308, 2009.
[29] Gunnar Carlsson. Topological pattern recognition for point cloud data. Acta Numerica,
23:289–368, 2014.
[30] Gunnar Carlsson, Vin De Silva, and Dmitriy Morozov. Zigzag persistent homology
In Proceedings of the twentyﬁfth annual symposium on
and realvalued functions.
Computational geometry, pages 247–256. ACM, 2009.
[31] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. On the
local behavior of spaces of natural images. International journal of computer vision,
76(1):1–12, 2008.
[32] Gunnar Carlsson and Afra Zomorodian. The theory of multidimensional persistence.
Discrete & Computational Geometry, 42(1):71–93, 2009.
[33] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas J Guibas. Persistence
barcodes for shapes. International Journal of Shape Modeling, 11(02):149–187, 2005.
[34] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.
[35] D. A. Case, J. T. Berryman, R. M. Betz, D. S. Cerutti, T. E. Cheatham III, T. A.
Darden, R. E. Duke, T. J. Giese, H. Gohlke, A. W. Goetz, N. Homeyer, S. Izadi,
P. Janowski, J. Kaus, A. Kovalenko, T. S. Lee, S. LeGrand, P. Li, T. Luchko, R. Luo,
B. Madej, K. M. Merz, G. Monard, P. Needham, H. Nguyen, H. T. Nguyen, I. Omelyan,
183
A. Onufriev, D. R. Roe, A. Roitberg, R. SalomonFerrer, C. L. Simmerling, W. Smith,
J. Swails, R. C. Walker, J. Wang, R.M. Wolf, X. Wu, D. M. York, and P. A. Kollman.
Amber 2015. University of California, San Francisco, 2015.
[36] Fr´ed´eric Chazal, David CohenSteiner, Marc Glisse, Leonidas J Guibas, and Steve Y
Oudot. Proximity of persistence modules and their diagrams.
In Proceedings of
the twentyﬁfth annual symposium on Computational geometry, pages 237–246. ACM,
2009.
[37] Fr´ed´eric Chazal, Vin De Silva, Marc Glisse, and Steve Oudot. The structure and
stability of persistence modules. Springer, 2016.
[38] Tiejun Cheng, Xun Li, Yan Li, Zhihai Liu, and Renxiao Wang. Comparative assess
ment of scoring functions on a diverse test set. Journal of chemical information and
modeling, 49(4):1079–1093, 2009.
[39] Yongwook Choi, Gregory E Sims, Sean Murphy, Jason R Miller, and Agnes P Chan.
Predicting the functional eﬀect of amino acid substitutions and indels. PloS one,
7(10):e46688, 2012.
[40] Fran¸cois Chollet. Keras. https://github.com/fchollet/keras, 2015.
[41] Samir Chowdhury and Facundo M´emoli. Persistent path homology of directed net
works. In Proceedings of the TwentyNinth Annual ACMSIAM Symposium on Discrete
Algorithms, pages 1152–1169. SIAM, 2018.
[42] Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.
[43] David CohenSteiner, Herbert Edelsbrunner, and John Harer. Stability of persistence
diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.
[44] David CohenSteiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lips
chitz functions have Lpstable persistence. Foundations of computational mathematics,
10(2):127–139, 2010.
[45] David CohenSteiner, Herbert Edelsbrunner, and Dmitriy Morozov. Vines and vine
In Proceedings of the twentysecond
yards by updating persistence in linear time.
annual symposium on Computational geometry, pages 119–126. ACM, 2006.
[46] Corinna Cortes and Vladimir Vapnik. Supportvector networks. Machine learning,
20(3):273–297, 1995.
[47] Jason B Cross, David C Thompson, Brajesh K Rai, J Christian Baber, Kristi Yi
Fan, Yongbo Hu, and Christine Humblet. Comparison of several molecular docking
184
programs: pose prediction and virtual screening accuracy. Journal of chemical infor
mation and modeling, 49(6):1455–1474, 2009.
[48] Yuri Dabaghian, Facundo M´emoli, Loren Frank, and Gunnar Carlsson. A topological
paradigm for hippocampal spatial map formation using persistent homology. PLoS
computational biology, 8(8):e1002581, 2012.
[49] George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multitask neural networks
for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.
[50] Guillaume
Damiand.
Combinatorial maps.
In
and Reference Manual. CGAL
www.cgal.org/Manual/4.0/doc html/cgal manual/packages.html
#Pkg:CombinatorialMaps.
Editorial Board,
4.0
CGAL User
2012.
edition,
[51] I. K. Darcy and M. Vazquez. Determining the topology of stable proteinDNA com
plexes. Biochemical Society Transactions, 41:601–605, 2013.
[52] Bhaskar DasGupta and Jie Liang. Models and Algorithms for Biomolecules and Molec
ular Networks. John Wiley & Sons, 2016.
[53] Vin De Silva, Dmitriy Morozov, and Mikael VejdemoJohansson. Dualities in persistent
(co) homology. Inverse Problems, 27(12):124003, 2011.
[54] Vin De Silva, Dmitriy Morozov, and Mikael VejdemoJohansson. Persistent cohomol
ogy and circular coordinates. Discrete & Computational Geometry, 45(4):737–759,
2011.
[55] Yves Dehouck, Aline Grosﬁls, Benjamin Folch, Dimitri Gilis, Philippe Bogaerts, and
Marianne Rooman. Fast and accurate predictions of protein stability changes upon mu
tations using statistical potentials and neural networks: Popmusic2.0. Bioinformatics,
25(19):2537–2543, 2009.
[56] Omar NA Demerdash, Michael D. Daily, and Julie C. Mitchell. Structurebased pre
dictive models for allosteric hot spots. PLoS computational biology, 5(10):e1000531,
2009.
[57] Barbara Di Fabio and Claudia Landi. A Mayer–Vietoris formula for persistent homol
ogy with an application to shape recognition in the presence of occlusions. Foundations
of Computational Mathematics, 11(5):499, 2011.
[58] T. J. Dolinsky, J. E. Nielsen, J. A. McCammon, and N. A. Baker. PDB2PQR: An
automated pipeline for the setup, execution, and analysis of PoissonBoltzmann elec
trostatics calculations. Nucleic Acids Research, 32:W665–W667, 2004.
185
[59] Jacob D Durrant, Aaron J Friedman, Kathleen E Rogers, and J Andrew McCammon.
Comparing neuralnetwork scoring functions and the state of the art: applications to
common library screening. Journal of chemical information and modeling, 53(7):1726–
1735, 2013.
[60] Herbert Edelsbrunner. Weighted alpha shapes. University of Illinois at Urbana
Champaign, Department of Computer Science, 1992.
[61] Herbert Edelsbrunner and John Harer. Computational topology: an introduction.
American Mathematical Soc., 2010.
[62] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence
and simpliﬁcation. In Foundations of Computer Science, 2000. Proceedings. 41st An
nual Symposium on, pages 454–463. IEEE, 2000.
[63] Matthew D. Eldridge, Christopher W. Murray, Timothy R. Auton, Gaia V. Paolini,
and Roger P. Mee. Empirical scoring functions: I. the development of a fast empiri
cal scoring function to estimate the binding aﬃnity of ligands in receptor complexes.
Journal of computeraided molecular design, 11(5):425–445, 1997.
[64] A Evgeniou and Massimiliano Pontil. Multitask feature learning. Advances in neural
information processing systems, 19:41, 2007.
[65] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Pro
ceedings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 109–117. ACM, 2004.
[66] Xin Feng and Yiying Tong. Choking loops on surfaces. IEEE transactions on visual
ization and computer graphics, 19(8):1298–1306, 2013.
[67] Xin Feng, Kelin Xia, Yiying Tong, and GuoWei Wei. Geometric modeling of sub
cellular structures, organelles, and multiprotein complexes. International Journal for
Numerical Methods in Biomedical Engineering, 28(12):1198–1223, 2012.
[68] Alan R Fersht. Dissection of the structure and activity of the tyrosyltrna synthetase
by sitedirected mutagenesis. Biochemistry, 26(25):8031–8037, 1987.
[69] Lukas Folkman, Bela Stantic, Abdul Sattar, and Yaoqi Zhou. EASEMM: Sequence
based prediction of mutationinduced stability changes with featurebased multiple
models. Journal of molecular biology, 428(6):1394–1405, 2016.
[70] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.
Annals of statistics, pages 1189–1232, 2001.
186
[71] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data
Analysis, 38(4):367–378, 2002.
[72] Patrizio Frosini. A distance for similarity classes of submanifolds of a euclidean space.
Bulletin of the Australian Mathematical Society, 42(3):407–415, 1990.
[73] Patrizio Frosini and Claudia Landi. Persistent betti numbers for a noise tolerant
shapebased approach to image retrieval. Pattern Recognition Letters, 34(8):863–872,
2013.
[74] Peter Gabriel. Unzerlegbare darstellungen i. Manuscripta mathematica, 6(1):71–103,
1972.
[75] Marcio Gameiro, Yasuaki Hiraoka, Shunsuke Izumi, Miroslav Kramar, Konstantin Mis
chaikow, and Vidit Nanda. A topological measurement of protein compressibility.
Japan Journal of Industrial and Applied Mathematics, 32(1):1–17, 2015.
[76] Marcio Gameiro, Konstantin Mischaikow, and William Kalies. Topological character
ization of spatialtemporal chaos. Physical Review E, 70(3):035203, 2004.
[77] Samuel Genheden and Ulf Ryde. The MM/PBSA and MM/GBSA methods to estimate
ligandbinding aﬃnities. Expert opinion on drug discovery, 10(5):449–461, 2015.
[78] Ivan Getov, Marharyta Petukh, and Emil Alexov. Saafec: predicting the eﬀect of single
point mutations on protein folding free energy using a knowledgemodiﬁed mm/pbsa
approach. International journal of molecular sciences, 17(4):512, 2016.
[79] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of the American
Mathematical Society, 45(1):61–75, 2008.
[80] Robert Ghrist and Abubakr Muhammad. Coverage and holedetection in sensor net
works via homology. In Information Processing in Sensor Networks, 2005. IPSN 2005.
Fourth International Symposium on, pages 254–260. IEEE, 2005.
[81] Robert W Ghrist. Elementary applied topology. Createspace Seattle, 2014.
[82] Michael K Gilson and HuanXiang Zhou. Calculation of proteinligand binding aﬃni
ties. Annual review of biophysics and biomolecular structure, 36, 2007.
[83] Nobuhiro Go, Tosiyuki Noguti, and Testuo Nishikawa. Dynamics of a small globu
lar protein in terms of lowfrequency vibrational modes. Proceedings of the National
Academy of Sciences, 80(12):3696–3700, 1983.
[84] Alexander Grigor’yan, Yong Lin, Yuri Muranov, and ShingTung Yau. Homologies of
path complexes and digraphs. arXiv preprint arXiv:1207.2834, 2012.
187
[85] Raphael Guerois, Jens Erik Nielsen, and Luis Serrano. Predicting changes in the
stability of proteins and protein complexes: a study of more than 1000 mutations.
Journal of molecular biology, 320(2):369–387, 2002.
[86] Allen Hatcher. Algebraic Topology. Cambridge University Press, 2001.
[87] JeanClaude Hausmann et al. On the vietorisrips complexes and a cohomology theory
for metric spaces. Annals of Mathematics Studies, 138:175–188, 1995.
[88] Christine Heitsch and Svetlana Poznanovi´c. Combinatorial insights into rna secondary
In Discrete and topological models in molecular biology, pages 145–166.
structure.
Springer, 2014.
[89] Anil N Hirani, Kaushik Kalyanaraman, Han Wang, and Seth Watts. Cohomologous
harmonic cochains. arXiv preprint arXiv:1012.2835, 2010.
[90] Anil Nirmal Hirani. Discrete exterior calculus. PhD thesis, California Institute of
Technology, 2003.
[91] Danijela Horak, Slobodan Maleti´c, and Milan Rajkovi´c. Persistent homology of
Journal of Statistical Mechanics: Theory and Experiment,
complex networks.
2009(03):P03034, 2009.
[92] Gang Hu, Junzhong Yang, and Wenji Liu. Instability and controllability of linearly
coupled oscillators: Eigenvalue analysis. Phys. Rev. E, 58:4440– 4453, 1998.
[93] Niu Huang, Brian K. Shoichet, and John J. Irwin. Benchmarking sets for molecular
docking. Journal of medicinal chemistry, 49(23):6789–6801, 2006.
[94] ShengYou Huang and Xiaoqin Zou. An iterative knowledgebased scoring function to
predict protein–ligand interactions: I. derivation of interaction potentials. Journal of
computational chemistry, 27(15):1866–1875, 2006.
[95] John J. Irwin and Brian K. Shoichet. Zinc a free database of commercially avail
able compounds for virtual screening. Journal of chemical information and modeling,
45(1):177–182, 2005.
[96] George A. Jeﬀrey and George A. Jeﬀrey. An introduction to hydrogen bonding, vol
ume 32. Oxford university press New York, 1997.
[97] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools
for Python, 2001–. [Online; accessed ¡today¿].
[98] Tomasz Kaczynski, Konstantin Mischaikow, and Marian Mrozek. Computational ho
mology, volume 157. Springer Science & Business Media, 2006.
188
[99] Peter M. Kasson, Afra Zomorodian, Sanghyun Park, Nina Singhal, Leonidas J. Guibas,
and Vijay S. Pande. Persistent voids: a new structural metric for membrane fusion.
Bioinformatics, 23(14):1753–1759, 2007.
[100] Elizabeth H Kellogg, Andrew LeaverFay, and David Baker. Role of conformational
sampling in computing mutationinduced changes in protein structure and stability.
Proteins: Structure, Function, and Bioinformatics, 79(3):830–838, 2011.
[101] Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov. Geometry helps to compare
persistence diagrams. Journal of Experimental Algorithmics (JEA), 22:1–4, 2017.
[102] Firas A Khasawneh and Elizabeth Munch. Exploring equilibria in stochastic delay
diﬀerential equations using persistent homology. In ASME 2014 International Design
Engineering Technical Conferences and Computers and Information in Engineering
Conference, pages V008T11A034–V008T11A034. American Society of Mechanical En
gineers, 2014.
[103] Firas A. Khasawneh and Elizabeth Munch. Chatter detection in turning using persis
tent homology. Mechanical Systems and Signal Processing, 70:527–541, 2016.
[104] Firas A. Khasawneh and Elizabeth Munch. Utilizing topological data analysis for
studying signals of timedelay systems. In Time Delay Systems, pages 93–106. Springer,
2017.
[105] Sarah L. Kinnings, Nina Liu, Peter J. Tonge, Richard M. Jackson, Lei Xie, and
Philip E. Bourne. A machine learningbased method to improve docking scoring func
tions and its application to drug repurposing. Journal of chemical information and
modeling, 51(2):408–419, 2011.
[106] Miroslav Kram´ar, Rachel Levanger, Jeﬀrey Tithof, Balachandra Suri, Mu Xu, Mark
Paul, Michael F Schatz, and Konstantin Mischaikow. Analysis of kolmogorov ﬂow and
rayleigh–b´enard convection using persistent homology. Physica D: Nonlinear Phenom
ena, 334:82–98, 2016.
[107] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with
In Advances in neural information processing
deep convolutional neural networks.
systems, pages 1097–1105, 2012.
[108] Brett M. Kroncke, Amanda M. Duran, Jeﬀrey L. Mendenhall, Jens Meiler, Jeﬀrey D.
Blume, and Charles R. Sanders. Documentation of an imperative to improve methods
for predicting membrane protein stability. Biochemistry, 55(36):5002–5009, 2016.
[109] Tugba G Kucukkal, Marharyta Petukh, Lin Li, and Emil Alexov. Structural and
physicochemical eﬀects of disease and nondisease nssnps on proteins. Current opinion
in structural biology, 32:18–24, 2015.
189
[110] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature,
521(7553):436–444, 2015.
[111] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradientbased learn
ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[112] Hyekyoung Lee, Hyejin Kang, Moo K Chung, BungNyun Kim, and Dong Soo Lee.
Persistent brain network homology from the perspective of dendrogram. IEEE trans
actions on medical imaging, 31(12):2267–2277, 2012.
[113] Michael Lesnick and Matthew Wright.
Interactive visualization of 2d persistence
modules. arXiv preprint arXiv:1512.00180, 2015.
[114] Hongjian Li, KwongSak Leung, ManHon Wong, and Pedro J Ballester. Substituting
random forest for multiple linear regression improves binding aﬃnity prediction of
scoring functions: Cyscore as a case study. BMC bioinformatics, 15(1):291, 2014.
[115] Hongjian Li, KwongSak Leung, ManHon Wong, and Pedro J Ballester. Improving
autodock vina using random forest: the growing accuracy of binding aﬃnity prediction
by the eﬀective exploitation of larger data sets. Molecular informatics, 34(23):115–126,
2015.
[116] Hongjian Li, KwongSak Leung, ManHon Wong, and Pedro J. Ballester. Lowquality
structural and interaction data improves binding aﬃnity prediction via random forest.
Molecules, 20(6):10947–10962, 2015.
[117] Beibei Liu, Bao Wang, Rundong Zhao, Yiying Tong, and GuoWei Wei. ESES: software
for Eulerian solvent excluded surface. Journal of computational chemistry, 38(7):446–
466, 2017.
[118] Jun Liu, Shuiwang Ji, and Jieping Ye. Multitask feature learning via eﬃcient l 2,
1norm minimization. In Proceedings of the twentyﬁfth conference on uncertainty in
artiﬁcial intelligence, pages 339–348. AUAI Press, 2009.
[119] Zhihai Liu, Yan Li, Li Han, Jie Li, Jie Liu, Zhixiong Zhao, Wei Nie, Yuchen Liu, and
Renxiao Wang. PDBwide collection of binding data: current status of the PDBbind
database. Bioinformatics, 31(3):405–412, 2014.
[120] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and
deep learning in chemoinformatics: the prediction of aqueous solubility for druglike
molecules. Journal of chemical information and modeling, 53(7):1563–1575, 2013.
[121] Jr. MacKerell, A. D., D. Bashford, M. Bellot, Jr. Dunbrack, R. L., J. D. Evanseck, M. J.
Field, S. Fischer, J. Gao, H. Guo, S. Ha, D. JosephMcCarthy, L. Kuchnir, K. Kuczera,
F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom, III Reiher,
190
W. E., B. Roux, M. Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watanabe,
J. WiorkiewiczKuczera, D. Yin, and M. Karplus. Allatom empirical potential for
molecular modeling and dynamics studies of proteins. Journal of Physical Chemistry
B, 102(18):3586–3616, 1998.
[122] Cl´ement Maria, JeanDaniel Boissonnat, Marc Glisse, and Mariette Yvinec. The Gudhi
library: Simplicial complexes and persistent homology. In International Congress on
Mathematical Software, pages 167–174. Springer, 2014.
[123] JL Martinez and F Baquero. Mutation frequencies and antibiotic resistance. Antimi
crobial agents and chemotherapy, 44(7):1771–1777, 2000.
[124] Andreas Mayr, G¨unter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Deep
tox: toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80,
2016.
[125] Marvin Minsky and Seymour A Papert. Perceptrons: an introduction to computational
geometry. MIT press, 2017.
[126] Konstantin Mischaikow, Marian Mrozek, J Reiss, and Andrzej Szymczak. Construc
tion of symbolic dynamics from experimental time series. Physical Review Letters,
82(6):1144, 1999.
[127] Konstantin Mischaikow and Vidit Nanda. Morse theory for ﬁltrations and eﬃcient
computation of persistent homology. Discrete & Computational Geometry, 50(2):330–
353, 2013.
[128] Maria A Miteva, Frederic Guyon, and Pierre Tuﬀ´ery. Frog2: Eﬃcient 3d conformation
ensemble generator for small compounds. Nucleic acids research, 38(suppl 2):W622–
W627, 2010.
[129] Dmitriy Morozov. Dionysus. http://www.mrzv.org/software/dionysus/, 2015.
[130] Garrett M. Morris, Ruth Huey, William Lindstrom, Michel F. Sanner, Richard K.
Belew, David S. Goodsell, and Arthur J. Olson. Autodock4 and autodocktools4: Au
tomated docking with selective receptor ﬂexibility. Journal of computational chemistry,
30(16):2785–2791, 2009.
[131] Ingo Muegge and Yvonne C Martin. A general and fast scoring function for protein
ligand interactions: a simpliﬁed potential approach. Journal of medicinal chemistry,
42(5):791–804, 1999.
[132] Daniel M¨ullner and Aravindakshan Babu. Python mapper: An opensource toolchain
for data exploration, analysis and visualization. See http://danifold. net/mapper, 2013.
191
[133] Elizabeth Munch. Applications of persistent homology to time varying systems. PhD
thesis, 2013.
[134] Elizabeth Munch. A users guide to topological data analysis. Journal of Learning
Analytics, 4(2):47–61, 2017.
[135] Michael M. Mysinger and Brian K. Shoichet. Rapid contextdependent ligand desolva
tion in molecular docking. Journal of chemical information and modeling, 50(9):1561–
1573, 2010.
[136] Vidit Nanda. Perseus:
the persistent homology software. Software available at
http://www. sas. upenn. edu/˜ vnanda/perseus, 2012.
[137] Marco AC Neves, Maxim Totrov, and Ruben Abagyan. Docking and scoring with icm:
the benchmarking results and strategies for improvement. Journal of computeraided
molecular design, 26(6):675–686, 2012.
[138] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y
Ng. Multimodal deep learning. In Proceedings of the 28th international conference on
machine learning (ICML11), pages 689–696, 2011.
[139] Duc D. Nguyen and GuoWei Wei. The impact of surface area, volume, curvature, and
lennard–jones potential to solvation modeling. Journal of computational chemistry,
38(1):24–36, 2017.
[140] Duc D. Nguyen, Tian Xiao, Menglun Wang, and GuoWei Wei. Rigidity strengthen
ing: A mechanism for protein–ligand binding. Journal of chemical information and
modeling, 57(7):1715–1721, 2017.
[141] Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and GuoWei
Wei. Mathematical deep learning for pose and binding aﬃnity prediction and ranking
in d3r grand challenges. arXiv preprint arXiv:1804.10647, 2018.
[142] Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. Topology based data analysis
identiﬁes a subgroup of breast cancers with a unique mutational proﬁle and excellent
survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.
[143] Jessica L Nielson, Jesse Paquette, Aiwen W Liu, Cristian F Guandique, C Amy Tovar,
Tomoo Inoue, KarenAmanda Irvine, John C Gensel, Jennifer Kloke, Tanya C Pet
rossian, et al. Topological data analysis for discovery in preclinical spinal cord injury
and traumatic brain injury. Nature communications, 6:8581, 2015.
[144] Ippei Obayashi. Volume optimal cycle: Tightest representative cycle of a generator on
persistent homology. arXiv preprint arXiv:1712.05103, 2017.
192
[145] Kristopher Opron, Kelin Xia, and GuoWei Wei. Fast and anisotropic ﬂexibilityrigidity
index for protein ﬂexibility and ﬂuctuation analysis. The Journal of chemical physics,
140(23):06B617 1, 2014.
[146] Kristopher Opron, Kelin Xia, and GuoWei Wei. Communication: Capturing protein
multiscale thermal ﬂuctuations, 2015.
[147] Angel R. Ortiz, M. Teresa Pisabarro, Federico Gago, and Rebecca C. Wade. Prediction
of drug binding aﬃnities by comparative binding energy analysis. Journal of medicinal
chemistry, 38(14):2681–2691, 1995.
[148] Edward Ott, Celso Grebogi, and James A. Yorke. Controlling chaos. Physical review
letters, 64(11):1196, 1990.
[149] Steve Y. Oudot. Persistence theory:
from quiver representations to data analysis,
volume 209. American Mathematical Society Providence, RI, 2015.
[150] Deepti Pachauri, Chris Hinrichs, Moo K Chung, Sterling C Johnson, and Vikas Singh.
Topologybased kernels with application to inference problems in alzheimer’s disease.
IEEE transactions on medical imaging, 30(10):1760–1770, 2011.
[151] JunKoo Park, Robert Jernigan, and Zhijun Wu. Coarse grained normal mode analysis
vs. reﬁned gaussian network model for protein residuelevel structural ﬂuctuations.
Bulletin of mathematical biology, 75(1):124–160, 2013.
[152] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikitlearn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[153] Yunhui Peng, Lexuan Sun, Zhe Jia, Lin Li, and Emil Alexov. Predicting proteinDNA
binding free energy change upon missense mutations using modiﬁed MM/PBSA ap
proach: SAMPDI webserver. Bioinformatics, 34(5):779–786, 2018.
[154] Jose A Perea. Persistent homology of toroidal sliding window embeddings. In Acous
tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on,
pages 6435–6439. IEEE, 2016.
[155] Jose A Perea. Multiscale projective coordinates via persistent cohomology of sparse
ﬁltrations. Discrete & Computational Geometry, 59(1):175–225, 2018.
[156] Jose A Perea, Anastasia Deckard, Steve B Haase, and John Harer. Sw1pers: Sliding
windows and 1persistence scoring; discovering periodicity in gene expression time
series data. BMC bioinformatics, 16(1):257, 2015.
193
[157] Jose A Perea and John Harer. Sliding windows and persistence: An application of
topological methods to signal analysis. Foundations of Computational Mathematics,
15(3):799–838, 2015.
[158] Janaina Cruz Pereira, Ernesto Raul Caﬀarena, and Cicero Nogueira dos Santos. Boost
ing dockingbased virtual screening with deep learning. Journal of chemical informa
tion and modeling, 56(12):2495–2506, 2016.
[159] Douglas EV Pires, David B. Ascher, and Tom L. Blundell. DUET: a server for pre
dicting eﬀects of mutations on protein stability using an integrated computational
approach. Nucleic acids research, 42(W1):W314–W319, 2014.
[160] Xose S Puente, Luis M S´anchez, Christopher M Overall, and Carlos L´opezOt´ın. Hu
man and mouse proteases: a comparative genomic approach. Nature Reviews Genetics,
4(7):544, 2003.
[161] Lijun Quan, Qiang Lv, and Yang Zhang. Strum: structurebased prediction of protein
stability changes upon singlepoint mutation. Bioinformatics, 32(19):2936–2946, 2016.
[162] V Robins, JD Meiss, and E Bradley. Computing connectedness: Disconnectedness and
discreteness. Physica D: Nonlinear Phenomena, 139(34):276–300, 2000.
[163] Vanessa Robins. Towards computing homology from ﬁnite approximations. In Topology
proceedings, volume 24, pages 503–532, 1999.
[164] Vanessa Robins, James D Meiss, and Elizabeth Bradley. Computing connectedness:
An exercise in computational topology. Nonlinearity, 11(4):913, 1998.
[165] Michael Robinson. Topological signal processing. Springer, 2016.
[166] Lior Rokach and Oded Maimon. Topdown induction of decision trees classiﬁersa
survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews), 35(4):476–487, 2005.
[167] Erik Rybakken, Nils Baas, and Benjamin Dunn. Decoding of neural data using coho
mological learning. arXiv preprint arXiv:1711.07205, 2017.
[168] Tamar Schlick and Wilma K Olson. Trefoil knotting revealed by molecular dynamics
simulations of supercoiled dna. Science, 257(5073):1110–1115, 1992.
[169] X Shi and P Koehl. Geometry and topology for modeling biomolecular surfaces. Far
East J Applied Math, 50:1–34, 2011.
[170] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large
scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
194
[171] Gurjeet Singh, Facundo M´emoli, and Gunnar E Carlsson. Topological methods for
the analysis of high dimensional data sets and 3d object recognition. In SPBG, pages
91–100, 2007.
[172] Gurjeet Singh, Facundo Memoli, Tigran Ishkhanov, Guillermo Sapiro, Gunnar Carls
son, and Dario L Ringach. Topological analysis of population activity in visual cortex.
Journal of vision, 8(8):11–11, 2008.
[173] Nitish Srivastava, Geoﬀrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.
[174] Bernadette J. Stolz, Heather A. Harrington, and Mason A. Porter. Persistent homology
of timedependent functional networks constructed from coupled time series. Chaos:
An Interdisciplinary Journal of Nonlinear Science, 27(4):047410, 2017.
[175] DW Sumners. Knot theory and dna. In Proceedings of Symposia in Applied Mathe
matics, volume 45, pages 39–72, 1992.
[176] Akihiro Takiyama, Takashi Teramoto, Hiroaki Suzuki, Katsushige Yamashiro, and
Shinya Tanaka. Persistent homology index as a robust quantitative measure of im
munohistochemical scoring. Scientiﬁc reports, 7(1):14002, 2017.
[177] Theano Development Team. Theano: A Python framework for fast computation of
mathematical expressions. arXiv eprints, abs/1605.02688, May 2016.
[178] Christopher J. Tralie and Jose A. Perea. (quasi) periodicity quantiﬁcation in video
data, using topology. SIAM Journal on Imaging Sciences, 11(2):1049–1077, 2018.
[179] Oleg Trott and Arthur J Olson. Autodock vina:
improving the speed and accuracy
of docking with a new scoring function, eﬃcient optimization, and multithreading.
Journal of computational chemistry, 31(2):455–461, 2010.
[180] Thomas Unterthiner, Andreas Mayr, G¨unter Klambauer, and Sepp Hochreiter. Toxi
city prediction using deep learning. arXiv preprint arXiv:1503.01445, 2015.
[181] Mikael VejdemoJohansson, Florian T Pokorny, Primoz Skraba, and Danica Kragic.
Cohomological learning of periodic motion. Applicable Algebra in Engineering, Com
munication and Computing, 26(12):5–26, 2015.
[182] Hans FG Velec, Holger Gohlke, and Gerhard Klebe. Drugscorecsd knowledgebased
scoring function derived from small molecule crystal data with superior recognition
rate of nearnative ligand poses and better aﬃnity prediction. Journal of medicinal
chemistry, 48(20):6296–6303, 2005.
195
[183] G Verkhivker, K Appelt, ST Freer, and JE Villafranca. Empirical free energy calcu
lations of ligandprotein crystallographic complexes. i. knowledgebased ligandprotein
interaction potentials applied to the prediction of human immunodeﬁciency virus 1
protease binding aﬃnity. Protein Engineering, Design and Selection, 8(7):677–691,
1995.
[184] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: A deep convolutional
neural network for bioactivity prediction in structurebased drug discovery. arXiv
preprint arXiv:1510.02855, 2015.
[185] St´efan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a
structure for eﬃcient numerical computation. Computing in Science & Engineering,
13(2):22–30, 2011.
[186] Bao Wang and GuoWei Wei. Objectoriented persistent homology. Journal of com
putational physics, 305:276–299, 2016.
[187] Bao Wang, Zhixiong Zhao, Duc D Nguyen, and GuoWei Wei. Feature functional
theory–binding predictor (ﬀt–bp) for the blind prediction of binding free energies.
Theoretical Chemistry Accounts, 136(4):55, 2017.
[188] Bao Wang, Zhixiong Zhao, and GuoWei Wei. Automatic parametrization of nonpolar
implicit solvent models for the blind prediction of solvation free energies. The Journal
of chemical physics, 145(12):124110, 2016.
[189] Changhao Wang, D’Artagnan Greene, Li Xiao, Ruxi Qi, and Ray Luo. Recent devel
opments and applications of the mmpbsa method. Frontiers in molecular biosciences,
4:87, 2017.
[190] Renxiao Wang, Luhua Lai, and Shaomeng Wang. Further development and validation
of empirical scoring functions for structurebased binding aﬃnity prediction. Journal
of computeraided molecular design, 16(1):11–26, 2002.
[191] Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. Accurate de novo
prediction of protein contact map by ultradeep learning model. PLoS computational
biology, 13(1):e1005324, 2017.
[192] GuoWei Wei, Meng Zhan, and Choy Heng Lai. Tailoring wavelets for chaos control.
Phys. Rev. Lett., 89:284103, 2002.
[193] Catherine L. Worth, Robert Preissner, and Tom L. Blundell. SDMa server for predict
ing eﬀects of mutations on protein stability and malfunction. Nucleic acids research,
39(suppl 2):W215–W222, 2011.
196
[194] Chengyuan Wu, Shiquan Ren, Jie Wu, and Kelin Xia. Weighted (co) homology and
weighted laplacian. arXiv preprint arXiv:1804.06990, 2018.
[195] Kelin Xia, Xin Feng, Yiying Tong, and Guo Wei Wei. Persistent homology for the quan
titative prediction of fullerene stability. Journal of computational chemistry, 36(6):408–
422, 2015.
[196] Kelin Xia, Kristopher Opron, and GuoWei Wei. Multiscale multiphysics and multido
main modelsﬂexibility and rigidity. The Journal of chemical physics, 139(19):11B614 1,
2013.
[197] Kelin Xia and GuoWei Wei. Molecular nonlinear dynamics and protein thermal un
certainty quantiﬁcation. Chaos: An Interdisciplinary Journal of Nonlinear Science,
24:013103, 2014.
[198] Kelin Xia and GuoWei Wei. Persistent homology analysis of protein structure, ﬂexibil
ity, and folding. International journal for numerical methods in biomedical engineering,
30(8):814–844, 2014.
[199] Kelin Xia and GuoWei Wei. Multidimensional persistence in biomolecular data. Jour
nal of computational chemistry, 36(20):1502–1520, 2015.
[200] Kelin Xia and GuoWei Wei. Persistent topology for cryoEM data analysis. Interna
tional Journal for Numerical Methods in Biomedical Engineering, 31(8), 2015.
[201] Kelin Xia, Zhixiong Zhao, and GuoWei Wei. Multiresolution persistent homol
ogy for excessively large biomolecular datasets. The Journal of chemical physics,
143(13):10B603 1, 2015.
[202] Kelin Xia, Zhixiong Zhao, and GuoWei Wei. Multiresolution topological simpliﬁca
tion. Journal of Computational Biology, 22(9):887–891, 2015.
[203] Zhexin Xiang and Barry Honig. Extending the accuracy limits of prediction for side
chain conformations1. Journal of molecular biology, 311(2):421–430, 2001.
[204] LeeWei Yang and ChoonPeng Chng. Coarsegrained models reveal
functional
dynamicsi. elastic network models–theories, comparisons and perspectives. Bioin
formatics and Biology Insights, 2:BBI–S460, 2008.
[205] Yang Yang, Biao Chen, Ge Tan, Mauno Vihinen, and Bairong Shen. Structurebased
prediction of the eﬀects of a missense variant on protein stability. Amino Acids,
44(3):847–855, 2013.
[206] Yuan Yao, Jian Sun, Xuhui Huang, Gregory R Bowman, Gurjeet Singh, Michael
Lesnick, Leonidas J Guibas, Vijay S Pande, and Gunnar Carlsson. Topological meth
197
ods for exploring lowdensity states in biomolecular folding pathways. The Journal of
chemical physics, 130(14):04B614, 2009.
[207] Shuangye Yin, Lada Biedermannova, Jiri Vondrasek, and Nikolay V Dokholyan.
Medusascore: an accurate force ﬁeldbased scoring function for virtual drug screen
ing. Journal of chemical information and modeling, 48(8):1656–1662, 2008.
[208] Piotr Zgliczynski and Konstantin Mischaikow. Rigorous numerics for partial diﬀer
ential equations: The kuramotosivashinsky equation. Foundations of Computational
Mathematics, 1(3):255–288, 2001.
[209] Shunhong Zhang, Qian Wang, Yoshiyuki Kawazoe, and Puru Jena. Threedimensional
metallic boron nitride. Journal of the American Chemical Society, 135(48):18216–
18221, 2013.
[210] Zhe Zhang, Maria A Miteva, Lin Wang, and Emil Alexov. Analyzing eﬀects of naturally
occurring missense mutations. Computational and mathematical methods in medicine,
2012, 2012.
[211] Zheng Zheng and Kenneth M Merz Jr. Ligand identiﬁcation scoring algorithm (lisa).
Journal of chemical information and modeling, 51(6):1296–1306, 2011.
[212] Zheng Zheng, Melek N Ucisik, and Kenneth M Merz. The movable type method applied
to protein–ligand binding. Journal of chemical theory and computation, 9(12):5526–
5538, 2013.
[213] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multitask learning via alter
nating structure optimization. In Advances in neural information processing systems,
pages 702–710, 2011.
[214] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multitask learning via structural
regularization. Arizona State University, 21, 2011.
[215] Afra Zomorodian. Fast construction of the VietorisRips complex. Computers & Graph
ics, 34(3):263–271, 2010.
[216] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete &
Computational Geometry, 33(2):249–274, 2005.
198