ALGEBRAIC TOPOLOGY AND MACHINE LEARNING FOR

BIOMOLECULAR MODELING

By

Zixuan Cang

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Applied Mathematics - Doctor of Philosophy

2018

ALGEBRAIC TOPOLOGY AND MACHINE LEARNING FOR BIOMOLECULAR

ABSTRACT

MODELING

By

Zixuan Cang

Data is expanding in an unprecedented speed in both quantity and size. Topological data

analysis provides excellent tools for analyzing high dimensional and highly complex data.

Inspired by the topological data analysis’s ability of robust and multiscale characterization

of data and motivated by the demand of practical predictive tools in computational biology

and biomedical researches, this dissertation extends the capability of persistent homology

toward quantitative and predictive data analysis tools with an emphasis in biomolecular

systems.

Although persistent homology is almost parameter free, careful treatment is still needed

toward practically useful prediction models for realistic systems. This dissertation carefully

assesses the representability of persistent homology for biomolecular systems and introduces

a collection of characterization tools for both macromolecules and small molecules focusing on

intra- and inter-molecular interactions, chemical complexities, electrostatics, and geometry.

The representations are then coupled with deep learning and machine learning methods for

several problems in drug design and biophysical research.

In real-world applications, data often come with heterogeneous dimensions and compo-

nents. For example, in addition to location, atoms of biomolecules can also be labeled with

chemical types, partial charges, and atomic radii. While persistent homology is powerful in

analyzing geometry of data, it lacks the ability of handling the non-geometric information.

Based on cohomology, we introduce a method that attaches the non-geometric information

to the topological invariants in persistent homology analysis. This method is not only useful

to handle biomolecules but also can be applied to general situations where the data carries

both geometric and non-geometric information.

In addition to describing biomolecular systems as a static frame, we are often interested

in the dynamics of the systems. An eﬃcient way is to assign an oscillator to each atom

and study the coupled dynamical system induced by atomic interactions. To this end, we

propose a persistent homology based method for the analysis of the resulting trajectories

from the coupled dynamical system.

The methods developed in this dissertation have been applied to several problems, namely,

prediction of protein stability change upon mutations, protein-ligand binding aﬃnity pre-

diction, virtual screening, and protein ﬂexibility analysis. The tools have shown top perfor-

mance in both commonly used validation benchmarks and community-wide blind prediction

challenges in drug design.

Copyright by
ZIXUAN CANG
2018

To my parents, for their love.

v

ACKNOWLEDGMENTS

I would particularly like to thank my advisor Professor Guowei Wei for his enlightening

guidance and tremendous support which have shaped the result in this thesis. His passion in

research, visionary insights, and bold ideas have greatly encouraged me and inﬂuenced my

path.

I want to thank Professor Elizabeth Munch and Professor Yiying Tong for the mathe-

matical instructions, the fruitful discussions, and their encouragement.

The interdisciplinary research experience I obtained in the collaboration with Professor

Ke Dong, Professor Heedeok Hong, and Professor Jian Hu is invaluable to the application

part of this thesis. I would like to thank Professor Gunnar Carlsson for useful discussions

and support.

In addition, I want to express my gratitude to Professor Peter Bates and

Professor Changyi Wang for guidance, to Dr. Lin Mu for oﬀering me the opportunity of

collaborating at ORNL, and to Professor Di Liu for serving in my committee.

I thank

my undergraduate advisors Professor Zhong Tan and Professor Jianyong Wang at Xiamen

University for motivating and supporting me.

I would also like to thank my friends and

colleagues who have supported me and enriched my life during my study here.

Finally yet importantly, I would like to thank my parents Wei Shi and Ping Cang for their

unchanging love and trust and my girlfriend Yiqing Yang for her support and understanding

throughout my research career.

vi

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Chapter 1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Applied algebraic topology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Machine learning and deep learning . . . . . . . . . . . . . . . . . . . . . . .
1.3 Biomolecular modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.2 Coupled dynamical systems

2.1.1
2.1.2 Filtration and persistence
2.1.3 Barcode space metrics

Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Applied algebraic topology . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simplicial homology . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Oscillators and coupling . . . . . . . . . . . . . . . . . . . . . . . . .
Stability and controllability . . . . . . . . . . . . . . . . . . . . . . .
2.2.2
2.3 Machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 K-nearest neighbor algorithm . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Decision tree
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.3 Ensemble of trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.4 Deep learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Biomolecular modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Proteins and small molecules . . . . . . . . . . . . . . . . . . . . . . .
2.4.2 Physical modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.4.3
Sequence tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 TopologyNet: Deep convolutional neural networks based on
topology for biomolecular property predictions. . . . . . . . . .
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Persistent homology . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.2 Topological representation of biomolecules . . . . . . . . . . . . . . .
3.2.3 Neuron for persistence barcode
. . . . . . . . . . . . . . . . . . . . .
3.2.4 Multichannel topological convolutional neural network . . . . . . . .
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Deep learning prediction of protein-ligand binding aﬃnities . . . . . .
3.3.2 Deep learning prediction of protein folding free energy changes upon
mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

vii

1
1
5
8
10
11

13
13
13
15
19
20
20
21
21
22
23
25
26
29
29
29
32

33
33
38
38
39
49
50
56
56

59

3.3.3 Multi-task deep learning prediction of membrane protein mutation im-
pacts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61
63

Chapter 4 Persistent cohomology for data with heterogeneous dimensions 68
68
72
72
73
75
77
79
81
81
82
86
87
89
95

4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Cohomology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.2
Smoothed cocycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.3 Enriched persistent barcode . . . . . . . . . . . . . . . . . . . . . . .
4.2.4 Preprocessing of the input function . . . . . . . . . . . . . . . . . . .
4.2.5 Modiﬁed Wasserstein distance . . . . . . . . . . . . . . . . . . . . . .
4.3 Examples and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 A minimalist example
. . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.2 Example datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.3 Wasserstein distance based similarity . . . . . . . . . . . . . . . . . .
4.3.4 Analysis of molecules . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.5 An application to protein-ligand binding . . . . . . . . . . . . . . . .
4.4 Discussion and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 5 Evolutionary homology for coupled dynamical systems

5.2.1 Coupled dynamical systems

. . . .
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97
97
99
. . . . . . . . . . . . . . . . . . . . . . . 100
5.2.1.1 Oscillators and coupling . . . . . . . . . . . . . . . . . . . . 100
5.2.1.2
Stability and controllability . . . . . . . . . . . . . . . . . . 101
5.2.1.3 Topological learning . . . . . . . . . . . . . . . . . . . . . . 104
. . . . . . . . . . 106
5.2.2.1 Filtration function deﬁned for coupled dynamical systems
. 107
5.2.2.2 Deﬁnition of evolutionary homology . . . . . . . . . . . . . 110
5.2.3 Protein residue ﬂexibility analysis . . . . . . . . . . . . . . . . . . . . 110
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.1 Disordered and ﬂexible protein regions . . . . . . . . . . . . . . . . . 114
5.3.2 Protein B-factor prediction . . . . . . . . . . . . . . . . . . . . . . . . 115
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2.2 Evolutionary homology (EH) and the EH barcodes

Chapter 6 Topological characterization of static macrobiomolecules and

small molecules

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
6.1
6.2 Biological considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
6.3 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.3.1 Element speciﬁc persistent homology . . . . . . . . . . . . . . . . . . 125
6.3.2 Construction of distance matrix . . . . . . . . . . . . . . . . . . . . . 126
6.3.2.1 Multi-level persistent homology. . . . . . . . . . . . . . . . . 126
6.3.2.2
Interactive persistent homology. . . . . . . . . . . . . . . . . 128

viii

6.3.2.3 Correlation function based persistent homology. . . . . . . . 129
6.3.2.4 Electrostatic persistence . . . . . . . . . . . . . . . . . . . . 130
6.3.3 Feature generation from topological invariants . . . . . . . . . . . . . 132
6.3.4 Machine learning algorithms . . . . . . . . . . . . . . . . . . . . . . . 136
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.4.1 Ligand based protein-ligand binding aﬃnity prediction . . . . . . . . 143
6.4.2 Complex based protein-ligand binding aﬃnity prediction . . . . . . . 145
6.4.3
Structure-based virtual screening . . . . . . . . . . . . . . . . . . . . 148
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
6.5.1 Ligand based protein-ligand binding aﬃnity prediction . . . . . . . . 156
6.5.2 Complex based protein-ligand binding aﬃnity prediction . . . . . . . 163

6.5.2.1 Robustness of GBT algorithm against redundant element

combination features and potential overﬁtting. . . . . . . . . 163
Structure-based virtual screening . . . . . . . . . . . . . . . . . . . . 171
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.5.3

Chapter 7 Dissertation contribution . . . . . . . . . . . . . . . . . . . . . . . 176

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

ix

LIST OF TABLES

Table 3.1: Topological representations of protein-ligand complexes.

. . . . . . . . . .

46

Table 3.2: Topological representations for protein mutation problem.

. . . . . . . . .

47

Table 3.3: Performance comparisons of TNet-BP and other methods . . . . . . . . .

58

Table 3.4: Performance comparisons of TNet-MP and other methods. . . . . . . . . .

61

Table 3.5: Performance comparisons of TNet-MMP and other methods.

. . . . . . .

63

Table 4.1: Candidate values for hyper-parameters of the gradient boosting trees model. 95

Table 4.2: The predictor performance is evaluated by training on PDBBind reﬁned
set excluding the core set and testing on the core set of a certain year’s
version. The median Pearson’s correlation coeﬃcient (root mean squared
error) among 10 repeated experiments is reported.
. . . . . . . . . . . . .

95

Table 5.1: The averaged Pearson correlation coeﬃcients (RP ) between the computed
values (blind prediction for the topological features and regression for the
rest of the models) and the experimental B-factors for a set of 364 proteins
[146] (Left: Prediction RP s based on EH barcodes. Right: A comparison
of the RP s of predictions from diﬀerent methods.). Here, EH is the linear
regression using EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. For a few large and multi-chain proteins (i.e., 1F8R, 1H6V,
1KMM, 2D5W, 3HHP, 1QKI, and 2Q52), to reduce the computational time
and as a good approximation, we compute their EH barcodes on separated
(protein) chains. We see from the table at right that the proposed EH
barcode method outperforms other methods in this application.

. . . . . 118

Table 6.1: Pearson correlation coeﬃcients (RMSE in kcal/mol) of ligand based topo-

logical model on the S1322 dataset.

. . . . . . . . . . . . . . . . . . . . . 144

Table 6.2: Description of the PDBBind datasets.

. . . . . . . . . . . . . . . . . . . . 146

Table 6.3: Pearson correlation coeﬃcients (RMSE in kcal/mol) of diﬀerent protein-

ligand complex based approaches on PDBBind datasets. . . . . . . . . . . 146

Table 6.4: Parameters used in machine learning.

. . . . . . . . . . . . . . . . . . . . 151

Table 6.5: Performance on each protein in DUD dataset. . . . . . . . . . . . . . . . . 154

x

Table 6.6: AUC comparison of diﬀerent methods on DUD dataset.

. . . . . . . . . . 155

Table 6.7: Experiments for ligand-based protein-ligand binding aﬃnity prediction of

7 protein clusters and 1322 protein-ligand complexes. . . . . . . . . . . . . 161

Table 6.8: Performance of diﬀerent approaches on the S1322 dataset. . . . . . . . . . 162

Table 6.9: Experiments for protein-ligand-complex-based protein-ligand binding aﬃn-

ity prediction for the PDBBind datasets.

. . . . . . . . . . . . . . . . . . 164

Table 6.10: Performance of diﬀerent protein-ligand complex based approaches on the

PDBBind datasets.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

Table 6.11: The AUC for autodock vina, TopVS-ML with only compound features,
TopVS-ML with only protein-compound complex features, and TopVS-ML
with all features. The targets with high quality results by Autodock Vina
are reported (AUC > 0.8) . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

Table 6.12: The AUC for autodock vina, TopVS-ML with only compound features,
TopVS-ML with only protein-compound complex features, and TopVS-ML
with all features. The targets with low quality results by Autodock Vina
are reported (AUC < 0.5) . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

xi

LIST OF FIGURES

Figure 2.1: Persistence barcodes of alpha complex ﬁltration (bottom left) and Vietoris-
Rips complex ﬁltration (bottom right) for the point cloud (top). The top
and bottom panels of barcodes are H1 and H0 barcodes.
. . . . . . . . .

18

Figure 2.2: Example regression decision tree.

. . . . . . . . . . . . . . . . . . . . . .

24

Figure 2.3: a. Dense layer. b. Convolution layer.

. . . . . . . . . . . . . . . . . . . .

27

Figure 2.4: A practical neural network architecture for multitask learning. . . . . . .

28

Figure 3.1: An illustration of barcode changes from wild type to mutant proteins.[26]

41

Figure 3.2: Energy cycle of protein-ligand binding free energy modeling.[26]

. . . . .

45

Figure 3.3: Mutation induced protein folding free energy changes.[26] . . . . . . . . .

48

Figure 3.4: An illustration of the 1D convolutional neural network.[26]

. . . . . . . .

51

Figure 3.5: The deep learning architecture for the application to globular proteins.[26] 53

Figure 3.6: The multi-task deep learning architecture for membrane proteins.[26]

. .

54

Figure 3.7: A comparison of behaviors of the GBT based method and the neural
network based method.[26] . . . . . . . . . . . . . . . . . . . . . . . . . .

66

Figure 4.1: A simple example loop.[25] . . . . . . . . . . . . . . . . . . . . . . . . . .

82

Figure 4.2: a: A point cloud sampled from two adjacent annulus. b: The correspond-
ing H1 barcode using alpha complex ﬁltration.[25] . . . . . . . . . . . . .

Figure 4.3: Example of smoothed H1 cocycle.[25] . . . . . . . . . . . . . . . . . . . .

Figure 4.4: a and b: Two datasets with similar geometry but diﬀerent information
given on the nodes. c and d: The diﬀerences are revealed in the enriched
H1 barcodes.[25]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.5: Persistent cohomology enriched barcode example of data points sampled
. . . . . . . . . . . . . . . . . . . . . . . . . . .

from porous cuboid.[25]

Figure 4.6: D3 dataset sampled from an annulus with randomly assigned values on
. . . . . . . . . . . .

the points and corresponding H1 enriched barcode.

83

83

84

85

86

xii

Figure 4.7: Wasserstein characteristics curve.

. . . . . . . . . . . . . . . . . . . . . .

87

Figure 4.8: a: The cucurbit[8]uril molecule viewed from two diﬀerent angles. The
hydrogen, carbon, nitrogen, and oxygen atoms are colored in white, grey,
blue, and red. b, c, and d: The H1 enriched barcodes obtained by assign-
ing 1 to nodes of the selected atom types (carbon, nitrogen, and oxygen)
and 0 elsewhere.[25]
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 4.9: a: A structure of the B24N24 cage. The nitrogen and boron atoms are
colored in blue and grey. b: The enriched barcodes obtained by assigning
1 to Boron atoms and 0 elsewhere. H1 and H2 barcodes are plotted in
bottom and top panels.[25] . . . . . . . . . . . . . . . . . . . . . . . . . .

88

89

Figure 4.10: Enriched barcodes focusing on atomic partial charges. [25]

. . . . . . . .

91

Figure 5.1: a: Chaotic trajectory of one oscillator without coupling. b: The 70
synchronized oscillators associated with the carbon Cα atoms of protein
PDB:1E68 are plotted together.[22] . . . . . . . . . . . . . . . . . . . . . 103

Figure 5.2: The ﬁltration of the simplicial complex associated to three 1-dimensional

trajectories.[22]

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Figure 5.3: An example of the construction of the evolutionary homology barcode.[22] 111

Figure 5.4: The result of perturbing residue 31 in protein (PDB:1ABA).[22]

. . . . . 112

Figure 5.5: Left: partially disordered protein, model 1 of PDB:2RVQ. Right: well

folded protien, PDB:1UBQ.[22]

. . . . . . . . . . . . . . . . . . . . . . . 114

Figure 5.6:

(a) Models 1-3 of PDB:2ME9 with the disordered region colored in blue,
red, and yellow for the three models. (b) Similar plot as (a) for PDB:2MT6.
(c) Topological features for PDB:2ME9 whose large disordered region is
from residue 28 to residue 85. (d) Topological features for PDB:2MT6
whose large disordered region is from residue 118 to residue 151. [22]

. . 116

Figure 5.7: Barcode plots for two residues. (a) Residue 6 of PDB:2NUH with a B-
factor of 12.13 ˚A2. (b) Residue 49 of PDB:2NUH with a B-factor of 33.4
˚A2.[22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

Figure 5.8: B-factors and the computed topological features. EH shows the linear
regression with EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. (a) PDB:3PSM with 94 residues. (b) PDB:3SZH with 697
residues.[22] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Figure 6.1: Multi-level persistent homology on simple small molecules.[20]

. . . . . . 128

xiii

Figure 6.2: The network architecture of TopBP-DL.[20]

. . . . . . . . . . . . . . . . 139

Figure 6.3: The network architecture of TopVS-DL.[20]

. . . . . . . . . . . . . . . . 140

Figure 6.4: An illustration of the topology based machine learning algorithms used in

scoring and virtual screening.[20]

. . . . . . . . . . . . . . . . . . . . . . 142

Figure 6.5: Statistics of ligands in 7 protein clusters in S1322 dataset.[20]

. . . . . . 157

Figure 6.6: An illustration of similarities between ligands measured by their barcode

space Wasserstein distances.[20] . . . . . . . . . . . . . . . . . . . . . . . 158

Figure 6.7: Plot of performance against number of element combinations used.[20]

. 160

Figure 6.8: Feature robustness tests on PDBBind datasets.[20]

. . . . . . . . . . . . 166

Figure 6.9: Assessment of performance of the model on samples with elements that

are rare in the data sets.[20] . . . . . . . . . . . . . . . . . . . . . . . . . 170

xiv

Chapter 1

Introduction

1.1 Applied algebraic topology

Topology delivers an abstraction of a space by studying the properties that persist as the

space deforms continuously. A concise description of a space can be obtained by using topol-

ogy and the geometric information of diﬀerent levels of details can be kept by using methods

in algebraic topology. Because of the brevity and the ability of information preservation,

algebraic topology methods have been applied to data analysis leading to the advancement

in an emerging ﬁeld called topological data analysis. Topological data analysis is especially

powerful at handling high-dimensional and highly complex data sets and has been applied to

many ﬁelds such as image processing, network analysis, and genomics study. The emphasis

on the application side of this thesis is given to molecular biology where quantitative and

predictive models are developed.

Homology can distinguish topological spaces by assigning algebraic structures to the

spaces to characterize the holes of various dimensions. Intuitively speaking, the 0-dimensional

homology characterizes connected components, the 1-dimensional homology counts circu-

lar structures, the 2-dimensional homology addresses voids or cavities, and the higher-

dimensional homology concerns the higher-dimensional holes. Given a data set with a chosen

parameter reﬂecting the geometric scale, one can compute the homology by building a com-

1

plex upon the data and representing the algebraic structures as vector spaces. There are

also diﬀerent ways for data representation. For example, a simplicial complex consisted of

points, edges, triangles, and the higher-dimensional counterparts can be built upon a point

cloud and a cubical complex can naturally represent volumetric data. Eﬃcient algorithms

have been proposed together with several implementations enabling the analysis of large

datasets. Computational homology has been applied in both mathematical applications and

applications in other ﬁelds. For example, the Conley index for real-world systems can be

computed using computational homology [98, 208]. It has also been applied to topological

characterizations of spatial-temporal chaos [76].

The ability to automatically analyze data at multiple scales is important for a practical

data analysis method. Fewer assumptions of the given data need to be made if such property

is met which indicates higher robustness. Persistent homology adds another dimension to

the conventional homology by introducing a ﬁltration of the topological space to achieve

a multiscale characterization of data. Instead of computing homology of a ﬁxed topologi-

cal space, persistent homology scans along a ﬁltration of a space where the building blocks

are added sequentially to the space ordered by their associated ﬁltration parameter values.

Then, the computation is done for not only homology of each frame of ﬁltration but also

how the homology generators appear, disappear, and persist along the course of ﬁltration.

A method named size function was introduced for applications in computer vision which

is a 0th dimensional version of persistent homology [72, 163]. Persistent homology theory

and practical algorithms were formulated and developped by Edelsbrunner et al.

[62]. A

more general theory was later introduced by Zomorodian and Carlsson [216]. The outputs

of persistent homology computation are collections of homology generators of diﬀerent di-

mensions paired with its “birth” and “death” values along the course of ﬁltration and are

2

usually visualized via barcode plots [33, 79] where horizontal line-segments are stacked each

corresponds to a homology generator or persistence diagrams [43, 36] where each generator

is plotted as a point in the 2-dimensional plane. A quantitative description of two persistent

homology computation results can be realized by computing their bottle-neck distance or

more generally, their pth Wasserstein distance [44, 29]. As a topological method, persistent

homology massively reduces the original dimension of data to one retaining crucial geometric

information through the ﬁltration. These features make persistent homology a suitable tool

for data analysis especially for complex and high-dimensional data. Persistent homology

also has a potential to be paired with predictive models such as deep learning and manifold

learning.

Since the introduction of persistent homology, there have been tremendous advancements

in algorithms, implementations, and theories. The theory of multidimensional persistent

homology has been formulated to address the situation where multiple parameters are in-

volved in building the increasing spaces in ﬁltration [32]. A practical implementation for

2-dimensional case was developed enabling interactive visualization [113]. Zigzag persistent

homology together with practical algorithm were introduced which allow traveling in both

directions along ﬁltration facilitating the analysis of real-valued functions on topological

spaces by associating homology groups to levelsets [30]. Introduced to study persistence di-

agram stability, vinyards can be used to analyze time-varying data [45]. Many systems can

be recorded as graphs. Clique complex can be eﬃciently constructed for undirected graphs

[215] and path complex was introduced for directed graphs [84]. The theory of path complex

was extended with persistence theory formulating persistent path homology for the analysis

of directed graphs [41]. Eﬀorts have also been made on the dual side of persistent homology

to make use of the richer information carried by cohomology. A more detailed description of

3

1-dimensional homology generators was derived by assigning circular coordinates to the in-

put space using persistent cohomology [54]. Cohomology was also used for the coordination

in higher dimensional cases [155].

There is a collection of software for persistent homology with a wide range of utilities.

Dionysus [129] is able to compute zigzag persistence, persistent cohomology, vineyards, alpha

complex ﬁltration, circular coordinates, and bottleneck and Wasserstein distances. Ripser [8]

focuses on Rips complex ﬁltration and is extremely fast. Perseus [136] speeds up computa-

tion by using discrete Morse theory [127] and provides utilities for cubical complex ﬁltration.

DIPHA provides eﬃcient algorithms enabling distributed computing [9] and PHAT provides

fast matrix reduction implementations [11]. JavaPlex is easy to use through MATLAB and

can conveniently illustrate the concepts [2]. It also includes some approximation construc-

tions for faster computation. Gudhi [122] is another comprehensive library with Python

interface. There is also eﬃcient implementations of both exact and approximate algorithms

for computing the bottleneck and Wasserstein distances [101]. Both computer science tech-

niques and mathematical properties are used to accelerate computations and one of the most

important theoretical foundations is the duality between persistent homology and persistent

cohomology [53].

Another major method in topological data analysis is mapper which builds a graph to

represent the topology of a dataset [171]. Given a point cloud dataset, each data point

is ﬁrst assigned a value which can be its eccentricity or one of many other choices and

this assignment is called a ﬁlter function. Then, a cover of the range of the ﬁlter function

is decided based on user deﬁned parameters such as cover length and overlap percentage.

Finally, data points in the same cover are grouped and are represented by a node in the

graph while an edge in the graph means there is an overlap of the intervals associated to the

4

two nodes. Compared to persistent homology, mapper reduces the data dimension but still

keeps a relatively explicit representation of the topology of data making it a good method for

visualization especially for high dimensional datasets. More quantitative descriptions can be

derived by running persistent homology upon mapper graphs. A library that can be called

in Python with GUI is available [132]. Mapper has found its applications in biological and

biomedical sciences [142, 143, 206].

Computational homology and persistent homology have been applied to various ﬁelds,

including image and signal analysis [31, 150, 172, 14, 73, 157], chaotic dynamics characteri-

zation [126, 98], sensor networks [80], complex networks [112, 91], shape recognition [57, 66],

and computational biology [99, 75, 48, 156].

1.2 Machine learning and deep learning

Machine learning models are able to automatically extract information from data and subse-

quently make predictions or inferences. There are mainly supervised, semi-supervised, and

unsupervised learning. In supervised learning, collections of data entries with descriptions

and labels are given to the model and once the model has learned from the data it makes

predictions on labels of new data given only descriptions. A simple example of supervised

learning is the linear regression model. There are also situations where large dataset is given

but only a small portion of data has known labels. Instead of learning on the small amount

of labeled data using supervised learning techniques, semi-supervised learning models also

utilize unlabeled data which usually help determine the underlying distribution of the data.

One example is manifold learning where unlabeled data are used to approximate the un-

derlying manifold and the similarities between data entries are measured by their distance

5

on the manifold. Unsupervised learning derives useful information from unlabeled data by

inferring the underlying structure of data.

In this thesis, we mainly focus on supervised

learning with an emphasis on molecular biology.

There are many competitive machine learning methods and we just list a few here. Sup-

port vector machine [46] builds a hyperplane to separate samples by solving an optimization

problem. It can also be extended to nonlinear cases by using a nonlinear kernel which tends

to separate data in higher dimensional spaces without the need to explicitly embed the data

in the higher dimensional space. Support vector machines can be used for both supervised

learning and clustering. Another popular class of machine learning methods is ensemble

learning. Ensemble learning relies on the assumption that combining multiple weak learner

can improve the prediction performance. A well-known model of this kind is random forest

[19] where independent decision trees are built and an average of these tress is used as the

ﬁnal model. Since the decision trees are constructed independently, random forest is usually

good at lowering bias. Another way of ensemble is by gradient boosting such as gradient

tree boosting where one decision tree is added to the model at a time according to the error

at this stage [70, 71]. Due to the nature of gradient boosting, it is good at reducing vari-

ance. Inspired by biological neural networks, artiﬁcial neural networks [125] were proposed

to mimic the arrangement of biological neurons by stacking nonlinear functions forming a

complicated composite function with tunable parameters. A common problem for machine

learning models especially those with large amount of tunable parameters is the overﬁtting

problem where models can have worse performance on new data while their performance on

training data is getting better. Many techniques have been developed to prevent overﬁtting.

For example, each weak learner in the ensemble learning can be trained on randomly selected

parts of training data introducing randomness to the training process. Also, a regularization

6

term containing norms of model parameters can be added to the objective function avoiding

making a model too specialized.

Neural network has a potential for complicated and complex situations with its highly

ﬂexible structure. After the advancement in dealing with the vanishing gradient problem

where impacts of back propagated error on parameters decay too fast along layers, deep

learning has ﬂourished. Many diﬀerent models have been introduced and these models have

state-of-the-art performance in many realistic applications. A stacking of regular neural

network layers makes a deep neural network and the number of layers can be as many as

hundreds. Convolutional neural networks signiﬁcantly reduce the number of parameters

compared to deep neural networks by taking advantage of locality properties of input data

and applying the same ﬁlter to only local connections [111]. Convolutional neural network is

very good at processing image data [107]. Recurrent neural network allows directed connec-

tions between neurons in the same layer following the ﬂow of an input sequence and is good

at processing sequential data such as gene sequences, natural languages, and time series. In

addition to the models that learn functions of inputs, there are also generative models that

learn to generate data. When there are related learning tasks, multitask learning or trans-

fer learning take advantage of the shared properties of the tasks to improve the predictive

power. Practical multitask learning models can be constructed due to the ﬂexibility and the

hierarchical structure of neural networks. In general, one of the major advantages of machine

learning and especially deep learning methods is their ability of handling large and diverse

datasets. Deep learning methods have established state-of-the-art in many computational

biology problems. For example, one of the most accurate sequence-based protein structure

prediction tools assessed by blind prediction challenges is based on residual network with

tens of convolutional layers [191]. The winner of a blind prediction challenge on toxicity pre-

7

diction (Tox21) that tens of groups participated was based on deep neural network paired

with multitask learning technique [124].

1.3 Biomolecular modeling

Understanding structure-function relationships is a major challenge in the molecular level

computational biology. Studying the relationship between structures of biomolecules and

their functions not only helps us understand nature but also aids biomedical research such as

drug design. For a biomolecular system, there are many important properties that are related

to the function of the system. For example, the binding aﬃnity of protein-ligand complexes,

the stability changes induced by amino acid mutations in protein, and the ﬂexibility of

protein residues. The ability of modeling these basic physical properties further leads to the

studying of functions more related to real-world applications such as whether a candidate

small molecule potentially binds to a protein target and the impact on drug eﬀect when a

speciﬁc mutation happens in the target protein.

Physics-based methods build models according to physical laws and are indispensable for

molecular modeling which provide predictions and reveal underlying mechanisms. Examples

of such kind include quantum mechanics calculation, molecular dynamics simulation, and

Monte Carlo sampling. There are also more eﬃcient approximations to the atomic systems

by using a continuum for part of the systems. For example, the Poisson-Boltzmann model

delivers eﬃcient description of the electrostatics in the solvation processes of molecules by

using a continuum solvent. A more complete and aﬀordable parameterization of complex

biomolecular systems can be achieved by using force ﬁelds usually derived from quantum

chemistry computations. Such models are able to model larger realistic systems such as

8

protein-ligand binding [147, 207]. There are also descriptions of the systems that are po-

tentially related to the target property but rigorous connections to the target property may

be hard to derive.

In this case, an empirical model can be constructed combining each

component with weights determined from experimental data or more expensive but accu-

rate models. A widely used setup for empirical models is to combine molecular mechanics

energies with polar part of solvation process modeled by Poisson-Boltzmann model or Gen-

eralized Born model and nonpolar part of solvation process reﬂected by surface areas which

are usually called MM/PBSA or MM/GBSA models [189]. MM/PBSA and MM/GBSA

models have been applied to the energy modeling of various biomolecular systems including

protein-ligand binding [77] and protein-DNA binding [153].

When there are more detailed descriptions of the systems, possibly hundreds or thou-

sands of descriptors of various types, it is hard to eﬀectively put them in a linear regression

based empirical model. Machine learning models with more capacity may help with this

situation. High capacity usually comes with higher number of model parameters to be de-

termined which requires more data. Generally, the performance of knowledge-based model

heavily depends on the quantity and quality of available data. Knowledge-based models

using machine learning methods can achieve better performance compared to physics-based

and empirical methods given suﬃcient data. At the same time, the descriptors derived from

physical models can be crucial for knowledge-based methods.

In this work, we are interested in several biological applications, structure-based protein-

ligand binding aﬃnity prediction where binding aﬃnity is to be predicted given the structure

of protein-ligand complexes, prediction of protein stability changes upon mutation where

structures of the wild-type proteins are given, virtual screening which aims to determine if

a pair of target protein and a candidate small molecule could potentially bind, and protein

9

ﬂexibility analysis where relative ﬂexibility of protein residues or atoms are to be determined.

Persistent homology has been applied to computational biology [99, 75, 48],

in the

mathematical modeling and prediction of nano particles, proteins, and other biomolecules

[198, 195, 75]. Quantitative topological analysis has been cultivated to predict the curva-

ture energy of fullerene isomers [195, 186], characterize protein folding [198], and quantify

immunohistochemical eﬀects [176]. Diﬀerential geometry based persistent homology [186]

and multiresolutional persistent homology [201] have been proposed to better characterize

biomolecular data, detect protein cavities [117], and resolve ill-posed inverse problems in

cryo-EM structure determination [200].

1.4 Motivation

The exponential growth of biological data has set the stage for data-driven discovery of

structure-function relationships.

Indeed, the Protein Data Bank (PDB) has accumulated

near 130,000 tertiary structures. The availability of 3D structural data enables knowledge-

based approaches to oﬀer complementary and competitive predictions of structure-function

relationships. Recent advances in machine learning algorithms have made data driven ap-

proaches more competitive and powerful than ever. Arguably, machine learning is one of

the most important developments in data analysis. Machine learning has become an indis-

pensable tool in biomolecular data analysis and prediction. Virtually every computational

problem in computational biology and biophysics, such as the prediction of solvation free

energies, protein-ligand binding aﬃnities, mutation impacts, pKa values, etc, has a class

of knowledge based approaches that are either parallel or complementary to physics based

approaches.

10

On the other hand, persistent homology as a data analysis tool can be applied to both

data that are sampled from underlying manifolds or naturally discrete data such as molec-

ular structures. Actually, persistent homology has been applied to both qualitative and

quantitative analysis of complex molecular structures [99, 75, 48, 198, 195, 176].

Encouraged by persistent homology’s ability of information extraction and dimension

reduction and the advantage on predictive modeling of machine learning and especially

deep learning methods, we aim at developing competitive predictive models using persistent

homology and machine learning with focuses on molecular biology. Despite the excellent

out-of-box performance of powerful machine learning methods in many applications, proper

featurization of biomolecular systems is crucial to the success of a model. To this end, persis-

tent homology can serve as an appropriate featurization tool. Though persistent homology

is good at describing geometry and is especially powerful compared to other methods in

the case of high dimension, special treatment is still needed to address the chemical and

biological complexities when applied to predictive modeling of biomolecules.

1.5 Outline

In Chapter 2, we review the background of persistent homology, coupling of dynamical

systems, some relevant machine learning and deep learning methods, and several biologi-

cal applications we worked on. In Chapter 3, we develop a persistent homology based deep

learning model using convolutional neural networks and multitask learning for the prediction

of protein-ligand binding aﬃnity and mutation induced protein stability change. A construc-

tion of neuron which can directly take persistence barcodes as input is also introduced. We

extend the capacity of the topological characterizations in Chapter 4 by embedding physical

11

properties of the molecules to persistence barcodes using cohomology. This extension is also

useful for general situations where data come with heterogeneous dimensions. In Chapter

5, we further consider dynamical properties of molecules using coupled dynamical systems

and introduce a construction of persistent homology to analysis the resulting trajectories.

In Chapter 6, we introduce several descriptions based on persistent homology for both mac-

robiomolecules and small molecules and we discuss in detail about the representability of

persistent homology for biomolecular systems. The dissertation contribution is summarized

in Chapter 7.

12

Chapter 2

Background

2.1 Applied algebraic topology

2.1.1 Simplicial homology

Topological spaces can be approximated, represented, and discretized by simplicial com-
plexes. An (abstract) simplicial complex is a (ﬁnite) collection of sets K = {σi}i where each
σi is a subset of a (ﬁnite) set K0 called the vertex set. We require that this collection satisﬁes
the following condition: if σi ∈ K and τ is a face of σi (that is, if τ ⊆ σj commonly denoted
τ ≤ σi), then τ ∈ K. If σi has k + 1 vertices, {v0, v1,··· , vk} where every pair of vertices

is nonequivalent, σi is called a k-simplex. The k-skeleton of a simplicial complex K is the

subcomplex of K consisting of simplicies of dimension k and below. While the simplices for

the abstract simplicial complex we will build will not have an obvious geometric meaning,

there is a more geometric viewpoint from which we often reference simplicies. A geometric
k-simplex can be regarded as the convex hull of k + 1 aﬃnely independent points in Rd,

and because of this we often call a 0-simplex a point, a 1-simplex an edge, a 2-simplex a

triangle, and a 3-simplex a tetrahedron. Without confusion, we will use the same symbols

for geometric and abstract simplices.

The homology group for a ﬁxed simplicial complex gives a topological characterization

which encodes holes of diﬀerent dimensions. Homology groups are built using linear transfor-

13

mations called boundary operators. A k-chain of the simplicial complex K is a ﬁnite formal

sum of the k-simplices in K, α =(cid:80) aiσi with coeﬃcients ai ∈ G where G is a chosen group,

for example, the widely used one Z2 or more generally Zp with a prime p. The group of all

k-chains with addition given by the addition of the coeﬃcients is called the k-th chain group

and is denoted by Ck(K) or simply Ck when the choice of complex is obvious. Note that
when Z2 is used, since it is a ﬁeld, Ck(K) is, in fact, a vector space.

The boundary operator ∂k : Ck → Ck−1 is the linear transformation generated by

mapping any k-simplex to the sum of its codim-1 faces; namely,

∂k({v0, v1,··· , vk}) =

(−1)i{v0,··· ,(cid:98)vi,··· , vk},

k(cid:88)

i=0

where(cid:98)vi means that vi is absent. The kth cycle group, Zk(K), is the kernel of the boundary

operator ∂k with elements called k-cycles. The kth boundary group, Bk(K), is the image of
the boundary operator ∂k+1 and its elements are called k-boundaries. Since ∂k ◦ ∂k+1 = 0,

Bk(K) is a subgroup of Zk(K). Thus we can deﬁne the kth homology group, Hk(K), to

be the quotient group Zk(K)/Bk(K). Two k-cycles are called homologous if they diﬀer by

a boundary; equivalently if they are in the same equivalence class of Hk(K). Intuitively,

if two k-cycles diﬀer from each other by the boundary of a subcomplex, they can roughly

be deformed from one to another continuously through the subcomplex. Each equivalence

class in Hk(K) can be thought of as corresponding to a k-dimensional “loop” in K going

around a k + 1-dimensional “hole”: 1-dimensional classes give information about loops going

around 2D voids, 2-dimensional classes give information about enclosures of 3D voids, etc.

While the analogy is not as nice, 0-dimensional classes give information about connected

components of the space.

14

2.1.2 Filtration and persistence

We now turn to the case where we have a changing simplicial complex and want to understand

something about its structure. Consider a ﬁnite simplicial complex K and let f be a real-
valued function on the simplices of K which satisﬁes the following: f (τ ) ≤ f (σ) for all τ ≤ σ
simplices in K. We will refer to this function as the ﬁltration function. For any x ∈ R, the

sublevelset of K associated to x is deﬁned as

K(x) = {σ ∈ K | f (σ) ≤ x}.

Note ﬁrst that because of our assumptions on f , K(x) is always a simplicial complex, and
second that K(x) ⊆ K(y) for any x ≤ y. Further, as x varies, K(x) only changes at the
function values deﬁned on the simplices. Since K is assumed to be ﬁnite, let {x1 < x2 <
··· < x(cid:96)} be the sorted range of f . The ﬁltration of K with respect to f is the ordered

sequence of its subcomplexes,

∅ ⊂ K(x1) ⊂ K(x2) ⊂ ··· ⊂ K(x(cid:96)) = K.

(2.1)

The ﬁltration of a simplicial complex sets the stage for a thorough topological examination

of the space under multiple scales of the ﬁltration parameter which is the output value of

the ﬁltration function f .

The deﬁnition of homology is valid for a ﬁxed simplicial complex, however we are inter-

ested in studying the structure of a ﬁltration like that of Eq. (2.1). Functoriality of homology

means that such a sequence of inclusions induces linear transformations on the sequence of

15

vector spaces

Hk(K(x1)) → Hk(K(x2)) → ··· → Hk(K(xn)).

(2.2)

Persistent homology not only characterizes each frame in the ﬁltration {K(xi)}i, but also

tracks the appearance and disappearance (commonly referred to as births and deaths) of
nontrivial homology classes as the ﬁltration progresses. A collection of vector spaces {Vi}
and linear transformations fi : Vi → Vi+1 is called a persistence module, of which Eq. (2.2)

is an example.

It is a special case of a much more general theorem of Gabriel [74] that

suﬃciently nice persistence modules can be decomposed uniquely into a ﬁnite collection of
[b,d) is a persistence module for which Vi = Z2
interval modules[37, 149]. An interval module I
if i ∈ [b, d) and 0 otherwise; and fi is the identity when possible, and 0 otherwise.
I
[b,d)∈B

So, given the persistence module of Eq. (2.2), we can decompose it as(cid:76)

[b,d), and

thus fully represent the algebraic information by the discrete collection B. These intervals

exactly encode when homology classes appear and disappear in the persistence module.

The collection of such intervals can be visualized by plotting points in the 2D half plane
{(x, y) | y ≥ x} which is known as a persistence diagram; or by stacking the horizontal

intervals, which is known as a barcode. In this paper, for no reason other than convenience,

we represent our information using barcodes. We call the barcode resulting from a sequence
of trivial homology groups the empty barcode and denote it by ∅. Thus, for every interval
[b, d) ∈ B, we call b the birth time and d the death time.

We review two widely used constructions of ﬁltration. Vietoris-Rips (VR) complex is

built upon the 1-skeleton induced by pairwise distances among given points. Given a distance

function and a threshold distance, a simplex is in the VR complex if the distance between

any pair of vertices in the simplex is smaller than or equal to the threshold. More formally,

16

VR complex on a ﬁnite point set X at threshold δ is deﬁned as

V R(X, δ) = {σ|v ∈ X, d(v, v(cid:48)) ≤ δ, ∀v, v(cid:48) ∈ σ}.

(2.3)

The distance function can naturally be the Euclidean distance but can also be other user

deﬁned distances tailored for speciﬁc applications. Since it does not directly rely on the

exact geometry, a more abstract usage is possible with distance functions that do not satisfy

triangular inequality. Another complex construction is alpha complex which is closely related

to geometric modeling. On the ﬁnite point set X in the Euclidean space, we can build a
Voronoi diagram and let V (v) be the Voronoi cell associated to v ∈ X. Then the alpha

complex associated to the parameter  is deﬁned as

A(X, ) = {σ| ∩v∈X (V (v) ∩ B(v, )) (cid:54)= ∅},

(2.4)

where B(v, ) is the -ball centered at v. It should be noted that alpha complex is a subset of

Delaunay triangulation and thus is very eﬃcient in terms of computation. Alpha complex is

a faithful representation of geometry. Though usually more computationally costly, Vietoris-

Rips is useful when the input point set does not have natural coordinates (though geometric

embedding can be done) or some interaction scores are of interest instead of geometric
distances. Since A(X, ) ⊆ A(X, (cid:48)) for  ≤ (cid:48) and V R(X, δ) ⊆ V R(X, δ(cid:48)) for δ ≤ δ(cid:48), these

constructions of simplicial complexes can induce proper ﬁltrations and  and δ are called the

ﬁltration parameters. An example of persistence barcodes is shown in Figure 2.1.

17

Figure 2.1: Persistence barcodes of alpha complex ﬁltration (bottom left) and Vietoris-Rips
complex ﬁltration (bottom right) for the point cloud (top). The top and bottom panels of
barcodes are H1 and H0 barcodes.

18

2.1.3 Barcode space metrics

The similarity between persistence barcodes can be quantiﬁed by barcode space distances.

The most commonly used metrics are the bottleneck distance [43] and the p-Wasserstein

distances [44]. The deﬁnitions of the two distances are summarized as follows.

The l∞ distance between two persistence bars I1 = [b1, d1) and I2 = [b2, d2) is deﬁned

to be

∆(I1, I2) = max{|b2 − b1|,|d2 − d1|}.

The existence of a bar I = [b, d) is measured as

λ(I) := (d − b)/2 = min

x∈R ∆(I, [x, x)).

This can be interpreted as measuring the smallest distance from the bar to the closest

degenerate bar whose birth and death values are the same.

For two ﬁnite barcodes B1 = {I1

β}β∈B, a partial bijection is deﬁned
to be a bijection θ : A(cid:48) → B(cid:48) where A(cid:48) ⊆ A to B(cid:48) ⊆ B. In order to deﬁne the p-Wasserstein

α}α∈A and B2 = {I2

distance, we have the following penalty for θ

∆(I1

α, I2

θ(α))p +

(cid:88)

α∈A\A(cid:48)

λ(I1

α)p +

1/p

.

λ(I2

β)p

(cid:88)

β∈B\B(cid:48)

(cid:88)

α∈A(cid:48)

P (θ) =

Then the p-Wasserstein distance is deﬁned as

dW,p(B1, B2) = min
θ∈Θ

P (θ),

where Θ is the set of all possible partial bijections from A to B. Intuitively, a partial bijection

19

θ is mostly penalized for connecting two bars with large diﬀerence measured by ∆(·), and
for connecting long bars to degenerate bars, measured by λ(·).

The bottleneck distance is an L∞ analogue to the p-Wasserstein distance. The bottleneck

penalty of a partial matching θ is deﬁned as

(cid:40)

(cid:110)

(cid:16)

P (θ) = max

max
α∈A(cid:48)

∆

α, I2
I1

θ(α)

, max
α∈A\A(cid:48)

λ(I1
α)

, max
β∈B\B(cid:48)

λ(I2
β)

.

(cid:17)(cid:111)

(cid:110)

(cid:111)

(cid:110)

(cid:111)(cid:41)

The bottleneck distance is deﬁned as

dW,∞(B1, B2) = min
θ∈Θ

P (θ).

2.2 Coupled dynamical systems

The general theory of control of coupled dynamical systems has been well-studied in the

literature [148, 92, 192, 197]. A brief review is given in this section.

2.2.1 Oscillators and coupling

We consider a collection of N n-dimensional dynamical systems originally governed by the

same equation

dui
dt

= g(ui), i = 1, 2,··· , N,

where ui = {ui,1, ui,2,··· , ui,n}T is a column vector of size n.

The individual dynamical systems can be coupled with an N × N coupling matrix A
by building an (N × n)-dimensional system. We denote {u1, u2,··· , uN}T by u where
ui = {ui,1, ui,2,··· , ui,n}T . The coupled system is an (N×n)-dimensional dynamical system

20

modeled as

= G(u) + (A ⊗ Γ)u,

du
dt

(2.5)

where G(u) = {g(u1), g(u2),··· , g(uN )}T ,  is a parameter reﬂecting the coupling strength,
and Γ is an n×n predeﬁned linking matrix specifying how the variables of individual systems

are coupled across the systems.

2.2.2 Stability and controllability

The coupled systems are said to be in synchronous state if

u1(t) = u2(t) = ··· = uN (t) = s(t).

The stability can be analyzed using v = {u1 − s, u2 − s,··· , uN − s}T with the following

equation obtained by linearizing Eq. (2.5)

= [IN ⊗ Dg(s) + (A ⊗ Γ)]v,

dv
dt

(2.6)

where IN is the N × N unit matrix and Dg(s) is the Jacobian of g on s. The stability of

can be studied by eigenvalue analysis of the coupling matrix A.

2.3 Machine learning

In this section, we review a few supervised learning methods. For a data entry, we assume

we have a description of it and it is called the feature. We also assume that there is a label

associated to an entry which might be known or unknown. An example scenario is that a

21

drug can have diﬀerent eﬀects on patients with the same disease and we would like to predict

the eﬃciency of a drug on some new patients before applying the drug. Here, each patient is

an entry and drug eﬀect is the label that we are interested in. What we have is the labeled

data on a set of past patients which means that we know the drug outcome on them. And

for both past and new patients, we have descriptions of them such as height, body weight,

and age. These descriptions are the features which are usually much easier to measure than

the label. Then, based on the experience of past patients, we can build a model that can

predict labels for new patients. Basically, given a collection of pairs of features and labels
{(xi, yi)}n

i=1 usually called the training data, a machine learning method tries to build a

function that is able to predict the label for a newly given entry based on its feature. The

feature can be in structured and unstructured forms. The data can also be in the form where

only a metric measuring similarities between data entries is given.

The basic assumption is that entries sharing similar features tend to have similar labels.

Generally, a supervised machine learning model have hyperparameters that determines the

basic model structure and tunable parameters that are to be optimized with training data.

Usually, there is a loss function which measures how well the model is performing upon the

training data. Then, the tunable parameters of the model are optimized to minimize the

loss function.

2.3.1 K-nearest neighbor algorithm

The k-nearest neighbor algorithm is a direct usage of the assumption that entries sharing

similar features tend to have similar labels and it relies on a reliable distance function in

the input space. Let X be the input (feature) space and Y be the output (label) space,

there is a given distance function or more loosely speaking, a function reﬂecting distance,

22

d(x, x(cid:48)), x, x(cid:48) ∈ X. Given a collection of input output pairs {(xi, yi)}n

i=1, we can predict the

output of a newly given input x by comparing it to the inputs of all the training examples.

First, an ordering of training data with respect to the newly given input {(xαi, yαi)}n

i=1 is

determined such that

d(x, xα1) ≤ d(x, xα2) ≤ ··· ≤ d(x, xαn−1) ≤ d(x, xαn)

(2.7)

Here, we consider the regression case where the labels are real numbers. Then, the output y

paired with x can be predicted as

k(cid:88)

i=1

1
k

yαi,

where k is a positive integer chosen by the user. Weights can be added to accommodate the

hierarchy in the ordering,

k(cid:88)

i=1

wiyαi,

and a simple choice of weights is the inverse of distance wi =

d(x,xαi) /(
are also other more complicated weights that are more robust and stable.

1

k(cid:80)

i=1

1

d(x,xαi)). There

2.3.2 Decision tree

We review the idea of decision learning tree in the case of regression. When the feature is
in the form of a vector, say x ∈ Rd. If the impacts of elements of x on the output y are

assumed to be additive, a linear regression model can be used. A generalization of linear

model to higher order polynomials can accommodate higher order interactions among the

elements of the input. However, model complexity increases dramatically as higher order

interactions are considered. Decision tree model provides an alternative to address higher

23

Figure 2.2: Example regression decision tree.

order interactions by partition the feature space instead of exploring the space of high order

polynomials with many variables.

The nodes that are not leaves in a decision tree are associated to elements of the input

feature vector and the leaf nodes are assigned prediction values of the label. When ﬂowing

from the root node to a leaf node, the value of the corresponding feature vector element is

evaluated against predeﬁned criteria to determine which child node to go to. Once reached

Fig. 2.2 gives an example regression decision tree.

Decision trees can be constructed with greedy algorithm from top down. At each con-

struction stage, a feature element with a splitting criteria is selected to most eﬀectively

reduce the loss function [166]. There are techniques to prevent overﬁtting by restricting

model complexity. For example, one may set an upper bound for number of branches or tree

depth. One can also set a lower bound for number of training examples associated to each

leaf node. One advantage of decision tree is that one element of feature vector is used at

each node so that the features can have completely diﬀerent meanings and magnitudes.

24

2.3.3 Ensemble of trees

In many problems, there can be thousands of features and large amount of data that are

too diverse and complicated for one decision tree to handle. A group of weak learners can

be combined to increase model capacity and this is called ensemble method. Here, a set

of decision trees can be used instead using one single tree. There are diﬀerent ensemble

methods for decision trees and we review two widely used constructions here, random forest

and gradient boosting trees.

Random forest relies on bootstrap aggregation where independent trees are built on

random resampling of the original training data with replacement. Then, a consensus model

is constructed by taking the average output of the independent trees. There are techniques

designed against overﬁtting such as using the resampling of only a subset of training data

for each tree. Since the trees are trained independently, random forest can be eﬃciently

constructed by constructing the trees in parallel.

Another approach is by gradient boosting where one tree is added at a time minimizing

the current loss. Let L(y, F (x)) be a loss function measuring the quality of prediction F (x)

against true label y. A general gradient boosting procedure is to ﬁrst initialize a model

denoted F0 which can be a constant function. Then, F(cid:96) is updated to F(cid:96)+1 by training a
base model h(cid:96) against the gradient of current loss {(xi,−∇L(yi,F (xi)
)}. The model
∇F (xi)
in the next iteration takes the form of F(cid:96)+1 = F(cid:96) + γ(cid:96)h(cid:96) and γ(cid:96) is obtained by solving the

(cid:12)(cid:12)F =F(cid:96)

minimization problem

(cid:88)

i

min

γ

L(yi, F(cid:96)(xi) + γh(cid:96)(xi)).

A real number between 0 and 1 called the learning rate is usually multiplied to γ(cid:96) when

updating the model for robustness.

25

2.3.4 Deep learning

A neural network is a generally nonlinear composite function constructly mainly with linear

transformations and nonlinear activation functions.

A regular neuron is a processing unit that applies a nonlinear activation function on top

of a linear function of the outputs of neurons in the previous layer that are connected to it.

A regular neural network is constructed by stacking densely connected layers where every

pair of neurons from adjacent layers are connected. Assume that we have a collection of n
neurons in layer i and denote the output of the jth neuron by N (i)

, then the output of the

j

kth neuron in layer i + 1 is computed as

 n(cid:88)

 ,

(i+1)
k

)

(i+1)

j,k N (i)

j + b

(w

N (i+1)

k

= φ

j=1

where w

(i+1)
j,k

and b

(i+1)
k

are tunable parameters and φ is the chosen activation function.

Commonly used activation functions include rectiﬁed linear unit ( max(0, x) ), sigmoid (

1+e−x ), and hyperbolic tangent ( ex+e−x

1

ex−e−x ). The number of parameters increase rapidly

when more neurons and layers are added to the model.

Originally inspired by vision system in biology, convolutional neural network is a widely

used architecture especially good at dealing with image data or image-like analogues. Only

local connections are allowed and the weights are shared across diﬀerent locations. The

setup of a single neuron is similar to the regular neural network but the information carried

by a neuron can be a vector instead of a scalar and the weight and bias parameters are

then higher dimensional counterparts. The activation function is applied elementwise on the

vector. Fig. 2.3 shows the architecture of a dense layer and a convolution layer. We discuss
more details of convolutional neural network in the 1-dimensional case. Consider an n × m

26

Figure 2.3: a. Dense layer. b. Convolution layer.

second order tensor V, where n is the number of digits along that one dimension and m is

number of channels for each digit. With a predeﬁned window size w, a convolutional ﬁlter
F can be represented by a w × m second order tensor. By moving the window of size w

along the one dimension of V, a sequence of Nf second order tensors, which are subtensors
of V , are obtained and can be concatenated to form an Nf × w × m third order tensor T.

The ﬁlter F operated on T results in a ﬁrst order tensor TijkFjk by tensor contraction.
Concatenating the outputs of nf ﬁlters gives an Nf × nf second order tensor. Generally
speaking, a 1D convolution layer takes an n× m tensor and outputs an Nf × nf tensor. The

bias terms and activation functions are omitted in the illustration above for simplicity.

Another technique we are going to make use of is multitask learning. A practical multitask

learning model can be achieved by branching out from the shared lower layers. An example

architecture of multitask learning neural network is shown in Fig. 2.4.

Feedforward neural networks are usually trained by backpropagation where the error of

the output layer is calculated and is propagated backward through the network to update its

weights. One popular approach of training a neural network is the stochastic gradient decent

27

Figure 2.4: A practical neural network architecture for multitask learning.

(SGD) method. Let Θ be the parameters in the network and L(Θ) be the loss function that

is to be minimized. SGD method updates Θi to Θi+1 from step i to step i + 1 as

Θi+1 = Θi − τ∇ΘL(Θi; Xs, Ys),

where τ is the learning rate, Xs and Ys are the input and target of the sth example

of the training set.
In practice, the training set (X, Y ) is often split into mini-batches
{(Xs, Ys)}s∈S. SGD method then goes through each mini-batch instead of going through

only one example at a time. When the landscape of the objective function is like a long

steep valley, momentum is added to accelerate convergence of the algorithm. The updating

scheme can therefore be changed to

∆Θi = Θi − Θi−1,
Θi+1 = Θi − (1 − η)τ∇ΘL(Θi; Xi

s, Y i

s ) + η∆Θi,

where 0 ≤ η ≤ 1 is a scalar coeﬃcient for the momentum term. Many techniques have been

proposed to address the overﬁtting problem such as dropout and regularization term.

28

2.4 Biomolecular modeling

We review some basics of biomolecular modeling relevant to this dissertation in this section

involving both bioinformatics methods and physical models.

2.4.1 Proteins and small molecules

Proteins are made of sequences of amino acids and play very important roles in living things.

Just listing a few of their functions, they serve as structural supporting elements, drive

biological processes, and sense chemical changes. The amino acid sequences determine local

secondary structures including alpha helix, beta sheet, and random coil. For regular proteins,

they usually fold into a relatively stable structure which we call the folded state. There are

also disordered or partially disordered proteins that do not have stable structures in certain

conditions.

Small molecules also play an important role in biological processes often by interacting

with other macromolecules such as proteins and nucleic acids. They may serve as triggering

signals for certain processes or inhibit certain processes. Actually, many working drugs are

small molecules. Being able to accurately describe small molecules and small molecule-

macromolecule complexes is important for computational molecular biology or drug design.

2.4.2 Physical modeling

Solvation free energy

Electrostatic forces play an essential role in protein folding. Electrostatic solvation free

energy is an important component of electrostatic interactions. Among all kinds of ap-

proaches, continuum solvation models achieve a favorable trade-oﬀ between accuracy and

29

eﬃciency. Poisson-Boltzmann model is used to quantify the polar part of energy change in

solvation process, that is the process of moving a molecule from vacuum to solvent environ-

ment.

Poisson-Boltzmann model uses Boltzmann distribution to model ion density and the

equation is formulated as

−∇ · ((x)∇φ(x)) + k

2

(x) sinh(φ(x)) =

N(cid:88)

i=1

qiδ(x − xi),

(2.8)

φ(∞) = 0,

where φ is the electrostatic potential,  is the dielectric constant, k is the modiﬁed Debye-

H¨uckel screening factor, δ is the Dirac delta function, and qi and xi are the partial charge

and location of each ﬁxed atom in the molecule. The second term and the right hand side

in equation 2.8 model charge density of mobile ion and ﬁxed molecule respectively. The

value of the partial charges can be assigned by precomputed force ﬁelds. In computation,

a bounding box Ω is taken as computation domain and is composed of solvent domain Ωs

and solute domain Ωm. One deﬁnition of the dielectric constant  is to take constant values

on Ωs and Ωm. The solution φ to equation 2.8 should satisfy the jump condition across the

interface Γ between Ωs and Ωm,

[φ]Γ := φ|Ωs(x) − φ|Ωm(x) = 0, x ∈ Γ

[∇φ]Γ := s∇φ|Ωs(x) · n − m∇φ|Ωm(x) · n = 0, x ∈ Γ

(2.9)

30

The solvation free energy modeled by Poisson-Boltzmann model can then be computed as

N(cid:88)

i=1

∆Gsol

pol =

1
2

qi(φ(xi) − φvac(xi)),

(2.10)

where ∆Gsol

pol is the polar component of the solvation free energy and φvac is the electrostatic

potential of the molecule in vacuum environment which is obtained by solving a Poisson

equation with uniform dielectric constant for vacuum.

Electrostatic interaction

The electrostatics of pairwise interactions in homogeneous media can be modeled with

the Coulomb model. Given two atoms with partial charges qi and qj with a distance of rij

apart from each other. The electrostatic force Fclb between these two atoms can be described

by Coulomb’s law

Fclb = ke

qiqj
r2
ij

,

where ke is the Coulomb’s constant.

Van der Waals interaction

Van der Waals forces describe interaction between atoms that are irrelevant to covalent

bond or electrostatic interaction. Lennard-Jones potential is used to model the Van Der

Waals interaction energy between atoms and the energy between the ith and the jth atom

is deﬁned as

Gi,j = ij

(cid:32)(cid:18) ri + rj

|xi − xj|

(cid:19)12 − 2

(cid:18) ri + rj

|xi − xj|

(cid:19)6(cid:33)

,

(2.11)

where ri, rj are Van der Waals radii of the atoms and ij is the depth of the potential well

which diﬀers for diﬀerent atoms. Since atomic features are constructed and the interac-

tions between diﬀerent pairs of atom types are modeled separately, the determination of the

31

parameter ij is left with the machine learning algorithm.

Molecule surface area

While the polar part of solvation free energy is modeled with the Poisson-Boltzmann

theory, the nonpolar part is aﬀected by several factors. The surface area of the molecule is

believed to relate to nonpolar part of solvation free energy.

2.4.3 Sequence tools

BLOSUM matrix

A BLOSUM matrix is a substitution matrix for protein sequence alignments. It scores

the switch of any pair of amino acids and can thus describe the favorability of a certain point

mutation. In this work, BLOSUM62 which models moderately related proteins is used.

PSSM matrix

The BLOSUM matrix is a general scoring matrix which does not consider residue position

or the alignment of the entire sequence. A position-speciﬁc scoring matrix (PSSM) is used

to ﬁll this information gap. BLAST searches for the query sequence is performed iteratively

and the substitution scores are updated according to the matched sequences found which is

position-speciﬁc along the sequence.

32

Chapter 3

TopologyNet: Deep convolutional

neural networks based on topology for

biomolecular property predictions.

3.1

Introduction

Understanding the structure-function relationships of biomolecules is fundamentally impor-

tant in computational biophysics and experimental biology. As such, methods that can

robustly predict biomolecular properties, such as protein-ligand binding aﬃnity and protein

stability change upon mutation from three-dimensional (3D) structures are important tools

to help us understand this relationship. Numerous approaches have been developed to un-

veil the structure-function relationship. Physics based models make use of fundamental laws

of physics, i.e., quantum mechanics, molecular mechanics, continuum mechanics, multiscale

modeling, statistical mechanics, thermodynamics, etc, to investigate structure-function rela-

tionships and predict function from structure. Physical methods provide important insights

and are indispensable for understanding the relationships between protein structure and

function.

Physical models for complex biomolecular systems are made computationally tractable

33

often by making simpliﬁcations to the real-world systems. However, there are so many factors

that could aﬀect the properties of biomolecular systems and a factor that is crucial to one

type of systems can have minor eﬀect on another. Therefore, it is hard to tune a physical

model to handle diverse systems. On the other hand, a model with the capability of handling

diverse systems is desirable in many realistic applications. For example, the computational

screening of drug candidates involves the scanning of a library containing millions of diverse

molecules. The large and diverse datasets bring the opportunity of enabling the model

learn from data to automatically extract the relevant factors and the unknown relationships.

The machine learning models driven by data make relatively fewer assumptions about the

importance of the factors to the systems and there is usually high ﬂexibility in the models so

that nonlinear and high-order interactions among features can be recognized. In this chapter,

we focus on deep learning models for its excellent ability of handling complex datasets and its

ﬂexible architectures enabling the combination of relevant learning tasks. Deep learning has

fueled the rapid growth in several areas of data science [110, 138]. Notably, deep learning can

automatically extract optimal high level features and discover intricate structures in large

data sets. The capability of handling data with underlying spatial dimensions hierarchically

has lead to breakthroughs in deep convolutional neural networks in image processing, video,

audio and computer vision [107, 170].

Given multiple learning tasks, multi-task learning (MTL) [34] provides a powerful tool

to exploit the intrinsic relatedness among learning tasks, transfer predictive information

among tasks, and achieve better generalized performance than single task learning. Dur-

ing the learning stage, MTL algorithms seek to learn a shared representation (e.g., shared

distribution of a given hyper-parameter [65], shared low-rank subspace [64], shared feature

subset [118] and clustered task structure [213]), and use the shared representation to bridge

34

between tasks and transfer knowledge. MTL has applications to the bioactivity of small

molecular drugs [180, 120, 184] and genomics [49]. Linear regression based MTL heavily

depends on well crafted features, while neural network based MTL allows more ﬂexible task

coupling and is able to deliver satisfactory results with a large number of low level features

provided such features have the representative power of the problem.

For complex 3D biomolecular data, the physical features used in machine learning vary

greatly in their nature. Typical features are generated from geometric properties, electro-

statics, atom types, atomic partial charges, and graph theory based properties [188]. Such

manually extracted features can be used in a deep neural network, but the performance heav-

ily relies on feature engineering. In contrast, convolutional neural networks learn to extract

high level representations hierarchically from low level features while maintaining the un-

derlying spatial relationships. However, the cost is huge for directly applying convolutional

neural network to the 3D biomolecules, especially if long-range interactions are included. A

major obstacle in the development of deep learning nets for 3D biomolecular data is their

entanglement between geometric complexity and biological complexity.

Most theoretical models for the study of structure-function relationships of biomolecules

are based on geometric modeling techniques. Mathematically, these approaches exploit local

geometric information, i.e., coordinates, distances, angles, areas, and sometimes curvatures

[139] for the physical modeling of biomolecular systems. Indeed, the importance of geometric

modeling for structural biology [67], and biophysics cannot be overemphasized. However,

geometry based models often contain too much structural detail and are frequently compu-

tationally intractable for large structures or datasets. In many biological problems, such as

the opening or closing of ion channels, the association or dissociation of binding ligands, the

folding or unfolding of proteins, and the symmetry breaking or formation of virus capsids, ob-

35

vious topological changes exist. In fact, one only needs qualitative topological information to

understand many physical and biological functions. In short, topology-function relationships

exist in many biomolecular systems.

Topology oﬀers entirely diﬀerent approaches and could provide signiﬁcant simpliﬁcation

of biomolecular data [168, 216, 175, 51, 88, 56, 52, 169]. The study of topology deals with

the connectivity of diﬀerent components in a space, and characterizes independent entities,

rings and higher dimensional faces within the space [98]. Topological methods produce a

high level of abstraction to many biological processes. For example, the opening and closing

of ion channels, the assembly or disassembly of virus capsids, the folding and unfolding of

proteins, and the association or dissociation of ligands are reﬂected by topological changes.

The fundamental task of topological data analysis is to extract topological invariants, namely

the intrinsic features of the underlying space, of a given data set without additional structure

information. Examples include covalent bonds, hydrogen bonds, van der Waals interactions,

etc. In this work, we use persistent homology which is a relatively new branch of algebraic

topology that embeds multiscale geometric information in topological invariants to achieve

an interplay between geometry and topology. It creates a variety of topologies of a given

object by varying a ﬁltration parameter, such as the radii of balls centered at the nodes or

the level set of a surface function. As a result, persistent homology can capture topological

structures continuously over a range of spatial scales.

The objective of this chapter is to introduce a new framework for the structure based

biomolecular property predictions using element-speciﬁc persistent homology, and convolu-

tional and multi-task neural networks. In this framework, element-speciﬁc persistent homol-

ogy reduces geometric and biological complexities and provides a suﬃcient and structured

low level representation for neural networks. Given this representation, convolutional neural

36

networks can then learn from data to extract high level representations of the biomolecular

systems, while retaining the spatial relationships, and construct mappings from these rep-

resentations to the target properties. For the prediction problems whose available datasets

are small, multi-task learning by jointly learning the related prediction problems with larger

available datasets helps to extract a proper high level representation for the target applica-

tions. The element-speciﬁc treatment is inspired by the RF-score method [115] for binding

aﬃnity prediction. Element-speciﬁc persistent homology is originated in our previous work

using classic machine learning methods. [23, 24] In this work, we further develop topology

based neural network (TopologyNet) models for the predictions of biomolecular structure-

function relationships. Speciﬁcally, we integrate ESPH and convolutional neural networks

(CNNs) to improve modern methods for protein-ligand binding aﬃnity and protein mutation

impact predictions from 3D biomolecular data. In this approach, topological invariants are

used to reduce the dimensionality of 3D biomolecular data. Additionally, element-speciﬁc

persistent barcodes oﬀer image-like topological representations to facilitate convolutional

deep neural networks. Moreover, biological information is retained by element-speciﬁc topo-

logical ﬁngerprints and described in multichannels in our image like representation. Fur-

thermore, convolutional neural networks uncover hidden relationships between biomolecular

topological invariants and biological functions. Finally, a multi-task multichannel topological

convolutional neural network (MM-TCNN) framework is introduced to exploit the relations

among various structure-function predictions and enhance the prediction for problems with

small and noisy training data. Our hypothesis is that many biomolecular predictions share

a common set of topological ﬁngerprints representations and are highly correlated to each

other. As a result, multi-task deep learning by simultaneous training for globular proteins

and membrane proteins improves upon existing predictions for the mutation induced stability

37

changes of membrane proteins whose training data is relatively small.

3.2 Methods

In this section, we give a brief explanation of persistent homology before introducing topo-

logical representations of protein-ligand binding and protein changes upon mutation. Multi-

channel topological deep learning and multi-task topological deep learning architectures are

constructed for binding aﬃnity and mutation impact predictions. The source codes with

examples of feature construction for the binding problem and the mutation problem can be

found in supplementary material (S3 Code and S4 Code) of [26]. The network architectures,

parameters, and training procedures are listed in S2 Text of [26]. Some auxiliary features

such as sequence features are used to enhance the models for the mutation applications. The

description of the auxiliary features together with pseudocode for the mutation application

are listed in S1 Text of [26].

3.2.1 Persistent homology

Simplicial homology gives a computable way to distinguish one space from another in topol-

ogy and is built on simplicial complexes which can be used to extract topological invariants

in a given data set. A simplicial complex K is a topological space that is constructed from

geometric components of a data set, including discrete vertices (nodes or atoms in a pro-

tein), edges (line segments or bonds in a biomolecule), triangles, tetrahedrons and their

high dimensional counterparts, under certain rules. Speciﬁcally, a 0-simplex is a vertex, a

1-simplex an edge, a 2-simplex a triangle, and a 3-simplex represents a tetrahedron. The

identiﬁcation of connectivity of a given data set can follow diﬀerent rules which leads to,

38

for example, Vietoris-Rips (VR) complex, ˇCech complex and alpha complex. The linear

combination of k-simplexes is called k-chain, which is introduced to associate the topological

space, i.e., simplicial complex, with algebra groups, which further facilitate the computation

of the topological invariants (i.e., Betti numbers) in a given data set. Speciﬁcally, the set

of all k-chains of a simplicial complex K are elements of a chain group, which is an abelian

group with a modulo-2 addition operation rule. Loosely speaking, a boundary operator sys-

tematically eliminates one vertex from the k-simplex at a time, which leads to a family of

abelian groups, including the kth cycle group and the kth boundary group. The quotient

group of the kth cycle group and the kth boundary group is called the kth homology group.

The kth Betti number is computed for the rank of the kth homology group.

Persistent homology is constructed via a ﬁltration process, in which the connectivity of

the given data set is systematically reset according to a scale parameter. More speciﬁcally,

a nested sequence of subcomplexes is deﬁned via a ﬁltration parameter, such as the growing

radius of protein atoms located at their initial coordinates. For each subcomplex, homology

groups and the corresponding Betti numbers can be computed. Therefore, the evolution

of topological invariants over the ﬁltration process can be recorded as a barcode [79] or a

persistence diagram. For a given data set, barcodes represent the persistence of its topological

features over diﬀerent spatial scales.

3.2.2 Topological representation of biomolecules

Topological ﬁngerprints

A basic assumption of persistent homology as applied to biomolecular function prediction

is that 1D biomolecular persistent barcodes are able to eﬀectively characterize 3D biomolec-

ular structures. We call such barcodes topological ﬁngerprints (TFs) [198, 195]. Fig. 3.1

39

illustrates the TFs of a wild type protein (PDB:1hmk) and its mutant obtained from per-

sistent homology calculations using the VR complex. The mutation (W60A) occurred at

residue 60 from Trp to Ala is shown at Figs. 3.1a and b. A large residue (Trp) at the

protein surface is replaced by a relatively small one (Ala). The corresponding barcodes are

given in Figs. 3.1c and d, where three panels from top to bottom are for Betti-0, Betti-1,

and Betti-2, respectively. The barcodes for the wild type are generated using heavy atoms

within 6˚A from the mutation site. The mutant barcodes are obtained with the same set of

heavy atoms in the protein except for those in the mutated residue. In two Betti-0 panels,

the diﬀerence in the number of bars is equal to the diﬀerence in the number of heavy atoms

between the wild type and mutant. Broadly speaking, the lengths of short bars reﬂect the

bond length of the corresponding heavy atom. Therefore, in both the wild type protein

and the mutant, bond lengths for most heavy atoms are smaller than 1.8˚A. Additionally,

bars that end between 1.8˚A and 3.8 ˚A might correlate with hydrogen bonds. Comparing

Fig. 3.1c and d, one can easily note the increase in the number of bars that end in the range

of 1.8 - 3.8 ˚A in the mutant, which indicates a less compact atom arrangement. In Betti-1

and Betti-2 panels, the mutant has fewer bars than the wild type does because a smaller

surface residue at 60 creates fewer ring and cavity contacts with the rest of the protein.

Element-speciﬁc persistent homology

The all heavy atom topological representation of proteins does not provide enough bi-

ological information about protein structures, such as bond length distribution of a given

type of atoms, hydrogen bonds, hydrophobic and hydrophilic eﬀects, etc. Therefore, we use

the element-speciﬁc topological ﬁngerprint (ESTF) to oﬀer a more detailed characterization

of protein-ligand binding and protein mutation. For example, Betti-1 and Betti-2 ESTFs

from carbon atoms are associated with hydrophobic interaction networks in biomolecules.

40

Figure 3.1: An illustration of barcode changes from wild type to mutant proteins.[26]

a: The wild type protein (PDB:1hmk) with residue 60 as Trp. b: The mutant with residue 60 as Ala. c:
Wild type protein barcodes for heavy atoms within 6 ˚A of the mutation site. Three panels from top to
bottom are Betti-0, Betti-1, and Betti-2 barcodes, respectively. The horizontal axis is the ﬁltration radius
(˚A). d: Mutant protein barcodes obtained similarly to those of the wild type.

41

Similarly ESTFs between nitrogen and oxygen atoms correlate to hydrophilic interactions

and/or hydrogen bonds in biomolcules. However, hydrogen atoms are typically absent from

structures in the PDB and thus are not used in our data driven ESTF description. For

proteins, commonly occurring heavy atom types include C, N, O, and S. For ligands, we use

9 commonly occurring atom types, namely C, N, O, S, P, F, Cl, Br, and I. To characterize the

interactions between protein and ligand binding, we construct cross protein-ligand ESTFs

such that one type of heavy atoms is chosen from the protein and the other from the ligand.

Therefore, there are a total of 36 sets of ESTFs in each topological dimension. For mutation

characterization, we describe the interactions between mutated residue and the rest of the
protein and arrive at 9 sets of ESTFs in each topological dimension considering { C, N, O }

for protein atoms. Similarly, we generate 9 sets of cross ESTFs in each topological dimension

from the wild type protein to study the interactions between the residue to be mutated and

the rest of the protein. However, high dimensional Betti-1 and Betti-2 invariants require the

formation of high order complexes. As non-carbon atoms do not occur very often, Betti-1

and Betti-2 ESTFs are generated for all carbon atoms or all heavy atoms, except speciﬁed.
The TFs and ESTFs are originally stored as collections of barcodes denoted by B(α,C,D)

with α labeling the selection of atoms depending on atom types and aﬃliations (i.e., protein,
ligand or mutated residue). C denotes the type of simplicial complex (i.e., VR complex
or alpha complex) and D indicates the dimension, such as Betti-0, Betti-1, or Betti-2. A

collection of barcodes can have any number of barcodes and thus can not be directly fed to

deep learning models. Additionally, as shown in Fig. 3.1, it is important to keep track of the

birth, death, and persistence patterns of the barcodes, because this information is associated

with the bond length, ring or cavity size, ﬂexibility and steric eﬀect. Moreover, Jeﬀrey

suggested that there are strong, moderate and weak hydrogen bond interactions with donor-

42

acceptor distances of 2.2-2.5˚A, 2.5-3.2˚A, and 3.2-4.0˚A, respectively [96]. To this end, we

construct structured vectors Vb, Vd, and Vp to respectively describe the birth, death, and

persistent patterns of the barcodes in various spatial dimensions. Practically, the ﬁltration

interval [0, L] is divided into n equal length subintervals and the patterns are characterized

on each subinterval. The description vectors are deﬁned as

Vb

i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)|(i − 1)L/n ≤ bj ≤ iL/n}(cid:13)(cid:13) , 1 ≤ i < n,
i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)|(i − 1)L/n ≤ dj ≤ iL/n}(cid:13)(cid:13) , 1 ≤ i < n,
i =(cid:13)(cid:13){(bj, dj) ∈ B(α,C,D)|(i − 1)L/n ≥ bj, iL/n ≤ dj}(cid:13)(cid:13) , 1 ≤ i ≤ n,

Vd

Vp

(3.1)

where (cid:107) · (cid:107) is cardinality of sets. Here bj, dj are birth and death of bar j. The three

types of representation vectors are computed for sets of Betti-1 and Betti-2 bars. For Betti-

0 bars, since their birth positions are uniformly 0, only Vd needs to be addressed. To

characterize pairwise interactions between atoms, it is convenient to simply use pairwise

distance information between atoms. The corresponding image-like representation, denoted

by Vr, can be constructed similarly to Vd by substituting the set of barcodes by a collection

of distances between the atom pairs of interest. It should be noted that Vr is not equivalent

to Vd in most simplicial complex setups. Generally speaking, Vr also reﬂects the 0th

order topological connectivity information. It is used as the characterization of 0th order

connectivity of the biomolecules in the applications shown in this work. Finally, we let Xs

denote all the feature vectors for the sth sample and let Ys denote the corresponding target

value.

Image-like multichannel topological representation

To feed the outputs of TFs into the convolutional neural network, the barcodes are

43

transformed to a 1D-image-like representation with multiple channels. Topological feature

vectors , Vb, Vd, and Vp, can be viewed as one-dimensional (1D) images. Each subinterval

in the ﬁltration axis represents a digit (or pixel) in the 1D-image-like representation. Such

a treatment of topological features describes the topological information with appropriately
chosen resolution of L/n. Meanwhile, the chemical information in the ESTFs of B(α,C,D)

are described by multiple channels in the 1D-image-like representation, which is similar to

the RGB color image representation. However, in our description, each pixel is associated

with m channels to describe diﬀerent element type, protein mutation status (i.e., wild type

and mutant), topological dimension (i.e., Betti-0, Betti-1 and Betti-2), and topological event

(i.e., birth, death, and persistence). Each element in the 1D-image-like representation is

standardized to have zero mean and unit variance among the data sets. This 1D-image-like

topological representation can be easily transferred among problems such as protein-ligand

binding aﬃnity modeling and prediction of protein stability change upon mutation. Tradi-

tional machine learning approaches require manual extraction of features for each domain of

application. When the convolutional neural network is applied, the convolution layers iden-

tify local patterns of atomic interactions and the fully connected layers then extract higher

level descriptions of the system by combining local patterns at various distance scales.

Multichannel topological invariants for protein-ligand binding prediction

In computation, the binding aﬃnity, or alternatively the binding free energy, can be

modeled via an energy cycle as shown in Fig. 3.2 where the main contributors to the process

are intermolecular interactions and solvation eﬀects. In this work, we consider the set of
element types Le = {C, N, O, S, P, F, Cl, Br, I} contained in ligands and Pe = {C, N, O, S}

44

Figure 3.2: Energy cycle of protein-ligand binding free energy modeling.[26]

contained in proteins. We deﬁne an opposition distance between two atoms ai and aj as

d(ai, aj)

∞

dop(ai, aj) =

, A(ai) (cid:54)= A(aj)

,

, A(ai) = A(aj)

(3.2)

where d(·,·) is Euclidean distance between two atoms and A(·) denotes the aﬃliation of an

atom which is either a protein or a ligand.

The ESTFs used in this application are summarized in Table 3.1. The structured de-

scription vectors of the ESTFs are generated according to the deﬁnition given in Eq (3.1).

As shown in Table 3.1, ﬁve sets of ESTFs are constructed. The diﬀerences between the

description vectors arising from Set 2 and Set 3, and between those arising from Set 4 and

Set 5 are also employed as representation vectors to address the impact of ligand binding

resulting in a total of 72 representation vectors (i.e., channels) forming the 1D-image-like

45

representation of the protein-ligand complex. Pairwise interactions are characterized for the
36 element pairs with {C, N, O, S} for the protein and {C, N, O, S, F, P, Cl, Br, I} for

the ligand with Vd providing 36 channels. The birth (Vb), death (Vd), and persistence

(Vp) for Betti-1 and Betti-2 barcodes are computed for carbon atoms and all heavy atoms

of the protein and the protein-ligand complex which results in 24 channels. The diﬀerence

between the characterization of the protein and the protein-ligand complex accounts for an-

other 12 channels. Thus, we have a total of 72 channels. Here, 0-dimensional TFs describe

intramolecular interactions between the protein and ligand. All heavy atom TFs delineate

the geometric eﬀect of protein-ligand binding. The TFs of carbon atoms account for hy-

drophobic eﬀects and also implicitly reﬂect the solvation eﬀects. The distance scale interval,

[0, 50] ˚A is divided into bins of length 0.25 ˚A.

Set Atoms used
1
2
3
4
5

{a ∈ P|T (a) = eP} ∪ {a ∈ L|T (a) = eL}, eP ∈ Pe, eL ∈ Le
{a ∈ P|T (a) ∈ Pe}
{a ∈ P|T (a) ∈ Pe} ∪ {a ∈ L|T (a) ∈ Le}
{a ∈ P|T (a) = C}
{a ∈ P|T (a) = C} ∪ {a ∈ L|T(a) = C}

Complex Dim.
-

Distance
dop
Euclidean Alpha
Euclidean Alpha
Euclidean Alpha
Euclidean Alpha

0
1,2
1,2
1,2
1,2

Table 3.1: Topological representations of protein-ligand complexes.

P and L are sets of atoms in protein and in ligand. T (·) denotes element type of an atom. eP is an element
type in protein and eL is an element type in ligand. “Complex” refers to the type of simplicial complex used
and “Dimension” refers to the dimensionality of a topological invariant.

Multichannel topological invariants for the prediction of protein folding free

energy change upon mutation

Modeling protein folding free energy change upon mutation basically involves the un-

folded states and folded structures of the mutant and the wild type as shown in Fig. 3.3.

Since unfolded states of proteins are highly dynamic which signiﬁcantly increases the mod-

eling cost due to the need of sampling over large conformation space, we only analyze the

folded states of the mutants and the wild type proteins in this application. Similar to the

46

protein-ligand binding aﬃnity prediction, atomic interactions between speciﬁc element types,

geometric eﬀects, and hydrophobic eﬀects are characterized. The persistent homology anal-

ysis performed in this application is summarized in Table 3.2. The diﬀerences between the

description vectors arising from Sets 1 and 2, and between those arising from Sets 3 and 4 are

also included to account for changes caused by mutation. The 1D-image-like representation

in this application thus has a channel size of 45. The pairwise interaction pattern is charac-
terized for 9 element pairs from the element set {C, N, O }. For example, the interactions

between the carbon atoms of the mutation site and the nitrogen atoms from the rest of the

protein. Such characterization for mutant protein, wild protein, and the diﬀerence between

these characterizations account for 27 channels. The birth, death, and bar persistence are

characterized for Betti-1 and Betti-2 barcodes for all heavy atoms of both the wild type

protein and the mutant protein resulting in 12 channels. The diﬀerence between the mutant

and the wild type, which accounts for 6 channels, is also included. Thus, we have a total of

45 channels. The distance scale interval, [0, 12] ˚A is divided into bins of length 0.25 ˚A. An

example of the persistent homology barcodes of a mutant and its wild type is given in Fig.

3.1.

Set Atoms selected
1
2
3
4

{a ∈ PW\MW|T (a) = eP} ∪ {a ∈ MW|T (a) = eM}, eP, eM ∈ Pe
{a ∈ PM\MM|T (a) = eP} ∪ {a ∈ MM|T (a) = eM}, eP, eM ∈ Pe
{a ∈ PW|T (a) ∈ Pe}
{a ∈ PM|T (a) ∈ Pe}

Complex Dim.
-
-

Distance
dop
dop
Euclidean Alpha
Euclidean Alpha

0
0
1,2
1,2

Table 3.2: Topological representations for protein mutation problem.

Here PW, PM, MW, and MM are sets of atoms of wild type protein, mutant protein, mutation site in the
wild type protein, and mutated site in the mutant protein. Here Pe = {C, N, O} and T (·) is the same as
deﬁned in Table 3.1. The distance function dop is similar to the one deﬁned in Eq (3.2), while the aﬃliation
function A(·) returns either M or P\M.

47

Figure 3.3: Mutation induced protein folding free energy changes.[26]

48

3.2.3 Neuron for persistence barcode

In addition to structured representations of barcodes, we introduce a neuron construction

that directly take persistence barcode as inputs. The basic idea is to treat a barcode as a
collection of points in a plane. Let B = {[bi, di)}i∈I be a barcode which can be regarded as
a collection of pairs of numbers. With a chosen ﬁltering function, one such neuron takes B

as input and its output is deﬁned to be

N (B; Θ, c) = ψ

 ,

φ(bi, di; Θ) + c

(cid:88)

i∈I

(3.3)

where ψ is an activation function, φ is a ﬁltering function, and Θ is the set of parameters in

φ that are to be tuned by the training process of the neural network. One natural choice of

φ is a radial basis function and a speciﬁc setup is to use the Gaussian,

NG(B; µb, µd, σb, σd, c) = ψ

(cid:88)

i∈I

exp(cid:0) − (bi − µb)2

2σ2
b

− (di − µd)2

2σ2
d

 .

(cid:1) + c

(3.4)

The basic idea of such construction is that instead of scanning over a predetermined

regular region of persistence diagram, we can let the network learn where to examine the

pattern of the persistent homology computation result.

Intuitively, when a radial basis

function is used for φ, the output of a neuron characterizes the population of homology

generators in that speciﬁc area. The location and coverage of each neuron is to be learned

by the neural network with given training data. A collection of such neurons forms an input

layer of the neural network which directly takes the persistence barcode as the input without

formulating it into structured construction. Regular densely connected layers can be stacked

upon this input layer to form a multi-layer neural network. The parameters in the ﬁltering

49

function φ carried by the neurons are regarded as part of parameters to be tuned in the

entire neural network via training process. This is feasible as long as a proper construction

of derivative of φ can be derived. There are many possible choices for φ. For example, in

addition to radial basis functions, a family of Hermite functions can also be used to examine

higher order patterns.

3.2.4 Multichannel topological convolutional neural network

The preprocessed multichannel topological image is standardized with mean 0 and standard

deviation 1 for use in the convolutional neural network. A convolutional neural network

with a few 1D convolution layers, followed by several fully connected layers, is used to ex-

tract higher level features from multichannel topological images and to perform regression

with the learned features. An illustration of the convolutional neural network structure is

shown in Fig. 3.4. A brief review of multichannel topological convolutional neural network

concepts is given in the case of 1D-image-like inputs. Convolution operation, optimization

method for feedforward neural networks, and dropout out technique which prevents over-

ﬁtting are discussed. One of the advantages of multichannel topological convolutional deep

neural networks is their ability to extract features hierarchically from low level topological

representations.

Convolution operation

Consider an n × m second order tensor V, where n is the number of topological feature

pixels and m is number of channels for each pixel. In this approach, n corresponds to the

radius ﬁltration dimension of the biomolecular topological analysis and m corresponds the

number of representation vectors used which are deﬁned in Eq (3.1). With a predeﬁned
window size w, a convolutional ﬁlter F can be represented by a w × m second order tensor.

50

Figure 3.4: An illustration of the 1D convolutional neural network.[26]

The network consists of repeated convolution layers and pooling layers followed by several fully connected
layers.

By moving the window of size w along the radius ﬁltration direction of V, a sequence of

Nf second order tensors, which are subtensors of V , are obtained and can be concatenated
to form an Nf × w × m third order tensor T. The ﬁlter F operated on T results in a ﬁrst

order tensor TijkFjk by tensor contraction. Concatenating the outputs of nf ﬁlters gives
an Nf × nf second order tensor. Generally speaking, a 1D convolution layer takes an n× m
tensor and outputs an Nf × nf tensor.

Dropout

Neural networks with several convolution layers and fully connected layers possess a large

number of degrees of freedom which can easily lead to overﬁtting. The dropout technique

is an easy way of preventing network overﬁtting [173]. During the training process, hidden

units are randomly chosen to feed zero values to their connected neighbors in the next layer.

Suppose that a percentage of neurons at a certain layer are chosen to be dropped during

training. Then, in the testing process, the output of this layer is computed by multiplying
a coeﬃcient such as 1 − λ, where λ is the dropout rate, to approximate the average of the

network after dropout in each training step.

51

Bagging (bootstrap aggregating)

In addition to dropout technique which regularizes each individual model, bagging is

a technique to combine the output of several models trained separately by averaging to

reduce generalization error. This is based on the assumption that models with randomness

in the training process likely make diﬀerent errors on testing data. Generally, bagging trains

diﬀerent models on diﬀerent subsets of the training set. Speciﬁcally, as neural networks

have relatively high underlying randomness caused by factors including the random weights

initialization and the random mini-batch partition, it can beneﬁt from bagging even if the

individual models are trained on the same dataset. In this work, bagging of neural network

models trained individually with the same architecture and training dataset is used.

Incorporating non-image-like features

Deep learning architecture also allows the use of non-image-like features together with

image or image-like features. In this work, additional auxiliary features, which are important

to mutation analysis, are incorporated after the convolution layers as shown in Fig. 3.5. This

approach leads to a 9% improvement to mutation prediction of the “S2648” data set.

Multi-task learning

We construct a multi-task multichannel topological convolutional neural network (MM-

TCNN) architecture to carry out simultaneous training and prediction. The common topo-

logical attributes and underlying physical interactions in features provide a basis for multi-

task predictions. Because the deep neural networks are jointly trained from multiple pre-

diction tasks, we expect the networks to generate robust high-level representations from low

level TFs for prediction problems. We also expect that the reﬁned representation would

lead to prediction models with improved generalized performance. From the proposed deep

learning models, we hope to gain insights into how the nonlinear and nonlocal interactions

52

Figure 3.5: The deep learning architecture for the application to globular proteins.[26]

The non-image-like features are incorporated in the multichannel topological convolutional deep neural net-
work by merging the features into the network at one of the fully connected layers.

among topological features impact various prediction tasks, which could further lead to bet-

ter understanding towards the interactions among biomolecular prediction tasks. Finally,

tasks with insuﬃcient training data sets will be more likely to beneﬁt from the information

collected from tasks with large training sets in a multi-task learning framework.

In the present mutation analysis, there are two data sets. The mutation data of the

large data set for globular proteins are more reliable, while those of the small data set for

membrane proteins are noisy and less reliable due to the fact that the current technologies

for membrane protein mutagenesis experiments are immature. The prediction for membrane

proteins beneﬁts from joint learning with the prediction for globular proteins. The coupling

of the two predictions through a neural network is shown in Fig. 3.6.

The general objective function to minimize for multi-task learning through neural net-

works can be decomposed into training loss, similarity penalty for shared layers, and regu-

53

Figure 3.6: The multi-task deep learning architecture for membrane proteins.[26]

The globular protein stability change upon mutation prediction is used as an auxiliary task to improve the
task of predicting membrane protein stability changes upon mutation. The solid arrows show the path of
information passing when the model is applied for predictions. The dotted and dashed arrows mark the
paths of backpropagation when the network is trained with globular protein data set and membrane protein
data set respectively.

larization term as

L(Θ; X, Y ) =

N(cid:88)

j=1

Jj(ΘSj, ΘBj; Xj, Yj)

+ P(ΘS1,··· , ΘSN )
+ R(Θ),

(3.5)

where Θ is the collection of all parameters to be updated, ΘSj is the set of parameters for

the jth task of the shared layers, ΘBj is the set of parameters for the jth branch of neurons
dedicated for the jth task, and (Xj, Yj) are training data for the jth task. Here P is the
penalty function which penalizes the diﬀerence among N sets of parameters. Finally R(·)
is the regularization term which prevents overﬁtting and J is the jth loss function. In this

work, we force the shared layers of the two problems to be the same and the regularization

of the network is realized using dropout.

54

Model training and prediction

Due to the complexity of the network for the mutation example with auxiliary features, a

brief parameter search is performed using Hyperopt [15] with only 50 trials allowing ﬂexibility

in number of neurons, activation function, and weight initialization. In the protein-ligand

binding example, only around 10 sets of parameters are selected manually and tested because

of the large input size for the problem.

In the protein-ligand binding aﬃnity predictions, we repeatedly train 100 single neural

networks individually. To test the performance of bagging of the models, we randomly select

50 trained models from the 100 individually trained networks and output the average value

of the outputs from the 50 selected models as the prediction. The performance is then

computed for the bagging. This process is repeated 100 times and both median and best

results are reported.

In the mutation induced protein stability predictions, we use the same procedure used

in the protein-ligand binding prediction, for the “S350” task, where the training and test-

ing split is predeﬁned. In the case of cross validation, 10 sets of 5-fold splits are generated

randomly and 20 single models are generated for each split. The average prediction is taken

over the 20 models within each split and the median result of the 10 splits is reported. Bag-

ging of only 20 models is performed here because it is not valid to do bagging of predictors

on diﬀerent cross validation splits. The bagging of 50 models will result in 50(individual

models)x10(cross validation splits)x5(ﬁve folds)=2500 training processes which is too com-

putationally expensive.

Software

Dionysus software [129] with CGAL library [50] is used for persistent homology computa-

tion on alpha complex. Javaplex [2] and Dipha [10] software packages are used for persistent

55

homology computation on Vietoris-Rips complex. The neural networks are realized using

Keras [40] wrapper of Theano [177] backend. Various functions from Numpy and Scipy [185]

packages are used to process data and evaluate the performance.

3.3 Results

3.3.1 Deep learning prediction of protein-ligand binding aﬃnities

Protein-ligand binding is a fundamental biological process in cells and involves detailed

molecular recognition, synergistic protein-ligand interaction, and may involve protein con-

formational changes. Agonist binding is crucial to receptor functions and typically trig-

gers a physiological response, such as transmitter-mediated signal transduction, hormone

and growth factor regulated metabolic pathways, stimulus-initiated gene expression, enzyme

production, cell secretion, etc. Understanding protein-ligand interactions has been a fun-

damental issue in molecular biophysics, structural biology and medicine. A speciﬁc task in

drug and protein design is to predict protein-ligand binding aﬃnity from given structural

information [82] Protein-ligand binding aﬃnity is a measurement of rate of binding which

indicates the degree of occupancy of a ligand at the corresponding protein binding site and is

aﬀected by several factors including intermolecular interaction strength and solvation eﬀects.

The ability to predict protein-ligand binding aﬃnity to a desired accuracy is a prerequisite

for the success of many applications in biochemistry such as protein-ligand docking and drug

discovery. In general, there are three types of binding aﬃnity predictors (commonly called

scoring functions): physics based [147, 207], empirical [211, 183, 63, 190, 212, 131, 182, 94],

and knowledge based [114, 105, 6]. In general, physics based scoring functions invoke QM

and QM/MM approaches [121, 35] to provide unique insights into the molecular mecha-

56

nism of protein-ligand interactions. A prevalent view is that binding involves intermolecular

forces, such as steric contacts, ionic bonds, hydrogen bonds, hydrophobic eﬀects and van

der Waals interactions. Empirical scoring functions work well but require carefully selected

data sets and parametrization [183, 63, 190]. However, both physics based scoring functions

and empirical scoring functions employ linear superposition principles that are not explicitly

designed to deal with exponentially growing and increasingly diverse experimental data sets.

Knowledge based scoring functions use modern machine learning techniques, which utilize

nonlinear regression and exploit large data sets to uncover underlying patterns within the

data sets. Given the current massive and complex data challenges, knowledge based scoring

functions outperform other scoring functions. [211].

In this study, the proposed method is tested on the PDBBind 2007 data set [119]. The

PDBBind 2007 core set of 195 protein-ligand complexes is used as the test set and the

PDBBind 2007 reﬁned set, excluding the PDBBind 2007 core set, is used as the training

set with 1105 protein-ligand complexes. A comparison between our TNet-binding predictor

(TNet-BP) and other binding aﬃnity predictors is summarized in Table 3.3. TNet-BP

outperforms other scoring functions reported by Li et al [115] on the task of binding aﬃnity

prediction from structures.

TNet-BP is also validated on a larger dataset, PDBBind v2016 reﬁned set of 4057 com-

plexes, where the training set contains 3767 samples which is the reﬁned set minus the core

set, and the testing set is the core set with 290 samples. All the model parameters and train-

ing procedures are the same as that used for v2007 dataset except that the epoch number is

set to 500 instead of 2000 due to the larger data size. The median RP and RMSE are 0.81

and 1.34 pKd/pKi units, respectively.

57

Method

TNet-BP
RF::VinaElem
RF:Vina
Cyscore
X-Score::HMScore
MLR::Vina
HYDE2.0::HbondsHydrophobic
DrugScore
SYBYL::ChemScore
AutoDock Vina
DS::PLP1
GOLD::ASP
SYBYL::G-Score
DS::LUDI3
DS:LigScore2
GlideScore-XP
DS::PMF
GOLD::ChemScore
PHOENIX
SYBYL::D-Score
DS::Jain
IMP::RankScore
GOLD::GoldScore
SYBYL::PMF-Score
SYBYL::F-Score

RP
0.826a
0.803
0.739
0.660
0.644
0.622
0.620
0.569
0.555
0.554
0.545
0.534
0.492
0.487
0.464
0.457
0.445
0.441
0.616
0.392
0.316
0.322
0.295
0.268
0.216

RMSE

1.37
1.42
1.61
1.79
1.83
1.87
1.89
1.96
1.98
1.99
2.00
2.02
2.08
2.09
2.12
2.14
2.14
2.15
2.16
2.19
2.24
2.25
2.29
2.29
2.35

Table 3.3: Performance comparisons of TNet-BP and other methods

Comparison of optimal Pearson correlation coeﬃcients RP and RMSEs (pKd/pKi) of various scoring func-
tions for the prediction of protein-ligand binding aﬃnity of the PDBBind 2007 core set. Except for the result
of our TNet-BP, all other results are adopted from Li et al [115]. a Median results (The best RP = 0.828
and best RMSE=1.37 for this method).

58

3.3.2 Deep learning prediction of protein folding free energy changes

upon mutation

Apart from some exceptions, proteins fold into speciﬁc three-dimensional structures to pro-

vide the structural basis for living organisms. Protein functions, i.e., acting as enzymes, cell

signaling mediators, ligand receptors, and structural supports, are typical consequences of a

delicate balance between protein structural stability and ﬂexibility. Mutation that changes

protein amino acid sequences through non-synonymous single nucleotide substitutions (nsS-

NPs) plays a fundamental role in selective evolution. Such substitutions may lead to the loss

or the modiﬁcation of certain functions. Mutations are often associated with various human

diseases [210, 109]. For example, mutations in proteases and their natural inhibitors result in

more than 60 human hereditary diseases [160]. Additionally, mutation can also lead to drug

resistance [123]. Artiﬁcially designed mutations are used to understand mutation impacts to

protein structural stability, ﬂexibility and function, as well as mutagenic diseases, and evolu-

tion pathways of organisms [68]. However, mutagenesis experiments are typically costly and

time-consuming. Computational prediction of mutation impacts is able to systematically

explore protein structural instabilities, functions, disease connections, and organismal evo-

lution pathways [85] and provide an economical, fast, and potentially accurate alternative

to mutagenesis experiments. Many computational methods have been developed in the past

decade, including support vector machine based approach [27], statistical potentials based

approach[55], knowledge-modiﬁed MM/PBSA approach [78], Rosetta protocols [100], FoldX

(3.0, beta 6.1) [85], SDM [193], DUET [159], PPSC (Prediction of Protein Stability, ver-

sion 1.0) with the 8 (M8) and 47 (M47) feature sets [205], PROVEAN [39], ELASPIC [16],

STRUM [161], and EASE-MM [69].

59

The proposed method is tested on a data set of 2648 mutation instances of 131 proteins

named “S2648” data set [55] in a 5-fold cross validation task over the “S2648” set and a

task of prediction of the “S350” set which is a subset of “S2648” set. The “S2648” set,

excluding the “S350” subset, is used as the training set in the prediction of the “S350” set.

All thermodynamic data are obtained from the ProTherm database [12]. A comparison of

the performance of various methods is summarized in Table 3.4. Among them, STRUM

[161] is based on structural, evolutionary and sequence information and results in excellent

performance. We therefore have constructed two topology based neural network mutation

predictors (TNet-MPs). TNet-MP-1 is solely based on topological information while TNet-

MP-2 is aided by auxiliary features characterizing electrostatics, evolutionary, and sequence

information, which is merged into the convolutional neural network at one of the fully con-

nected layers. TNet-MP-2 is able to signiﬁcantly improve our original topological prediction,

indicating the importance of the aforementioned auxiliary information to mutation predic-

tion.

60

Method

TNet-MP-2
STRUMb
TNet-MP-1
mCSMb,c
INPSb,c
PoPMuSiC 2.0b
PoPMuSiC 1.0a
I-Mutant 3.0b
Dmutanta
Automutea
CUPSATa
Erisa
I-Mutant 2.0a

S350

nd RP RMSE
0.94
350
350
0.98
1.07
350
1.08
350
350
1.25
1.16
350
1.23
350
338
1.35
1.38
350
1.42
315
1.46
346
334
1.49
1.50
346

0.81
0.79
0.74
0.73
0.68
0.67
0.62
0.53
0.48
0.46
0.37
0.35
0.29

S2648
Re
P RMSEf
0.94
0.94
1.02
1.07
1.26
1.17

0.77
0.77
0.72
0.69
0.56
0.61

-

-

nd

2648
2647
2648
2643
2648
2647

-

2636

0.60

1.19

-
-
-
-
-

-
-
-
-
-

-
-
-
-
-

Table 3.4: Performance comparisons of TNet-MP and other methods.

Comparison of Pearson correlation coeﬃcients (RP ) and RMSEs (kcal/mol) of various methods on the
prediction task of the “S350” set and 5-fold cross validation of the “S2648”. TNet-MP-1 is our multichannel
topological convolutional neural network model that solely utilizes topological information. TNet-MP-2 is
our model that complements TNet-MP-1 with auxiliary features. a Data directly obtained from Worth et
al [193]. b Data obtained from Quan et al [161]. c The results reported in the publications are listed in the
table. According to Ref. [161], the data from the online server has Rp (RMSE) of 0.59 (1.28) and 0.70 (1.13)
for INPS and mCSM respectively in the task of S350 set. d Number of samples successfully processed.

3.3.3 Multi-task deep learning prediction of membrane protein

mutation impacts

Multi-task learning oﬀers an eﬃcient way to improve the predictions associated with small

data sets by taking the advantage of other larger data sets [214]. Although a large amount

of thermodynamic data is available for globular protein mutations, the mutation data set for

membrane proteins is relatively small, between 200 and 300 proteins [108]. The small size

of membrane protein mutation data limits the success of data driven approaches, such as

ensemble of trees. While the popular multi-task learning framework built on linear regression

with regularization techniques lacks the ability to extract the relationship between very low

61

level descriptors and the target quantity. A neural network with a hierarchical structure

provides a promising option for such problems. We add the prediction of globular protein

stability changes upon mutation as an auxiliary task for the prediction of membrane protein

stability changes upon mutation.

In the designed network architecture, two tasks share

convolution layers and the network splits into two branches with fully connected layers

for the two tasks.

Intuitively, the task of globular protein mutation predictions help to

extract higher level features from low level topological representations. Thus, the branch

for membrane protein mutation predictions learns the feature-target relationship from the

learned high level features.

The proposed method is tested on a set of 223 mutation instances of membrane proteins

covering 7 protein families named “M223” data set [108] with 5-fold cross validation. A

comparison with other methods is shown in Table 3.5. TNet-MMP-1 employes multichannel

topological convolutional neural networks with topological features from the “M223” data

set, while TNet-MMP-2 is a multi-task multichannel topological convolutional neural net-

work (MM-TCNN) architecture. Unlike TNet-MP-2, both TNet-MMP-1 and TNet-MMP-2

do not use auxiliary features. Our goal is to test the performance of the multi-task archi-

tecture on the improvement of high level feature extraction from low level features. Pearson

correlation coeﬃcient of membrane protein mutation prediction is improved by 9.6%, i.e.,

from 0.52 to 0.57 by the multi-task algorithm that trains and predicts the present “M223”

data set with the “S2648” date set. As noted by Kroncke et al, there is no reliable methods

for the prediction of membrane protein mutation impacts at the present [108]. Our TNet

results, though still not practically useful, are the best among the methods tested on this

problem.

62

RP

Method
TNet-MMP-2d
0.57
TNet-MMP-1c
0.52
0.31
Rosetta-MP
Rosetta (High)a
0.28
0.26
FoldX
0.26
PROVEAN
Rosetta-MPddG 0.19
Rosetta (low)b
0.18
SDM
0.09

RMSE

1.09
1.15
-
-
2.56
4.23
-
-
2.40

Table 3.5: Performance comparisons of TNet-MMP and other methods.

Comparison of Pearson correlation coeﬃcients (RP ) and RMSEs (kcal/mol) on 5-fold cross validation for the
“M223” data set for various methods. Except for the present results for TNet-MMP-1 and TNet-MMP-2,
all other results are adopted from Kroncke et al [108]. The results of Rosetta methods are obtained from
Fig. S1 of Ref. [108] where RMSE is not given. The results of other methods are obtained from Table S1
of Ref. [108]. Many less competitive results of the machine learning based methods reported in Ref. [108]
are not listed since these servers were not trained on the membrane protein data set. Among the methods
listed, only Rosetta methods have terms describing the membrane protein system. a High resolution. b
Low resolution. c The multichannel topological convolutional neural network architecture with topological
features from “S223” data set. d The multi-task multichannel topological convolutional neural network
(MM-TCNN) architecture trained with an auxiliary task of globular protein prediction using the “S2648”
data set.

3.4 Discussion and conclusion

The adoption of convolutional neural network concepts in this work is motivated by the

underlying spatial relationship along the distance scale (ﬁltration) dimension. Properties

that reside in diﬀerent distance scales are heterogeneous so unlike images or videos, there is

no obvious transferable property of the convolution ﬁlters along the convolution dimension in

the proposed method. To take this into consideration, the convolution layers are substituted

with “locally connected layers”, where the local connection properties are conserved whilst

the ﬁlters applied to diﬀerent distance scales are allowed to be diﬀerent. The RMSE is

in kcal/mol for the mutation problems and pKd/pKi units for the protein-ligand binding

problem. The performance in RP (RMSE) signiﬁcantly decreases from 0.81 (0.94) to 0.77

(1.02) for the task of “S350” set prediction in the mutation impact example. This shows

63

that the construction of lower level features in the lower sparse layers beneﬁts from sharing

ﬁlters along the distance scale and indicates the existence of some common rules for feature

extractions at diﬀerent distance scales.

Intuitively, the dimension 0 inputs describe pairwise atomic interactions, which clearly

contribute to the prediction of the target properties. In contrast, dimension 1 and dimension

2 topological features characterize the hydrophobic network and geometric rings and voids.

To understand to what extent the higher topological dimensions help the characterization

of biomolecules, we separate the dimension 0 inputs from higher dimensional inputs in the

prediction of “S350” set in the mutation impact on protein stability example and in the

protein-ligand binding aﬃnity prediction for v2007 set example. To compare the performance

of diﬀerent sets of features, 50 single models are trained for each feature set. Twenty of the

50 trained models are randomly chosen and bagged, and this procedure is repeated 100 times

with the median results reported. The individual performances measured by RP (RMSE)

for dimension 0 features are 0.73 (1.09) and 0.82 (1.40), respectively for the mutation and

binding predictions. For dimensions 1 and 2 features, RP (RMSE) are 0.66 (1.21) and

0.78 (1.54), respectively for the mutation and binding predictions. The combination of all

dimension features results in better RP (RMSE) of 0.74 (1.08) and 0.83 (1.37), respectively

for the mutation and binding predictions, showing that two sets of features both contribute

to predictions. The alpha complex is used for geometric characterization and therefore is in
R3 with Betti number up to dimension 2. It is possible that the higher dimensional Betti

numbers in a more abstract setup such as Vietoris-Rips complex for the characterization of

an interaction network will enrich the representation and deliver improved results.

Another popular class of machine learning methods is the ensemble of trees methods.

Many modern methods for biomolecular property prediction are based on random forest

64

(RF) and gradient boosting trees (GBTs). The ensemble of decision trees has the capability

of learning complicated functions, but GBTs learn to partition the feature space based on the

training data which means that they do not have the ability to appropriately extrapolate

the learned function to broader situations than the provided training data. Additionally,

it is generally the case that data samples are unevenly distributed. It has been observed

that in many applications, where among the dataset, there are just a handful of samples

with large absolute value for the target property, methods of ensembles of trees tend to

overestimate (underestimate) the border cases with very negative (positive) target values.

The neural network, due to its diﬀerent ways of learning the underlying function, seems to be

able to deliver better results for the border cases. Therefore, similar to the idea of bagging,

methods of ensembles of trees and neural network based methods may result in diﬀerent

error characteristics for diﬀerent samples and can potentially improve the predictive power

by correcting each others’ error when the results from diﬀerent models are averaged.

In

the example of prediction of the “S350” set, we obtained performance of 0.82 (0.92) for

RP (RMSE) in our other work using handcrafted features with gradient boosting trees [23].

When the results are averaged for the two methods, the performance is improved to 0.83

(0.89) which is better than both individual methods. Similar improvement is observed for the

protein-ligand binding example with v2007 set. Our method based on handcrafted features

and gradient boosting trees with performance 0.82 (1.40) [24] and the method presented

in this work with performance 0.83 (1.37) can achieve improved performance of 0.84 (1.35)

when the two results are combined by averaging. An intuitive illustration is shown in Fig.

3.7. It can be seen from the plot that the neural network based method presented in this

work performs better than the GBT based method for samples with high ∆∆G or with

low ∆∆G. The slope of linear ﬁtting of the predicted values to the experimental data is

65

Figure 3.7: A comparison of behaviors of the GBT based method and the neural network
based method.[26]
The plot is for the prediction task of the S350 dataset. The linear ﬁt for GBT prediction [23]is y =
0.603x − 0.435 and for TNet-MP-2, y = 0.657x − 0.422.

0.66 for the neural network based method and 0.60 for the GBT based method which also

illustrates that the neural network based method handles border cases better. The observed

improvement is marginal since it is mainly on a small portion of the samples.

In conclusion, the approach introduced in this work utilizes element-speciﬁc persistent

homology to eﬃciently characterize 3D biomolecular structures in terms of multichannel

topological invariants. Convolutional neural network facilitates the automatic feature ex-

traction from multichannel topological invariant inputs. The ﬂexible and hierarchical struc-

ture of neural network allows seamless combination of automatically extracted features and

handcrafted features. It also makes it easy to implement multi-task learning by combining

related tasks to a desired level of model sharing by tuning the layer of model branching. The

proposed topology based neural network (TopologyNet) methods have been shown to out-

66

perform other existing methods in protein-ligand binding aﬃnity predictions and mutation

induced protein stability change predictions. The proposed methods can be easily extended

to other applications in the structural prediction of biomolecular properties. They have the

potential to further beneﬁt from the fast accumulating biomolecular data. The combination

of the proposed methods and existing RF and GBT based methods is expected to deliver

improved results.

67

Chapter 4

Persistent cohomology for data with

heterogeneous dimensions

4.1

Introduction

With the advancements in sensor hardware, data collection software, data organization and

storage frameworks, various datasets are expanding in an unprecedented speed where a

large part of the newly accumulated data is high-dimensional, highly complex, diverse, and

often noisy. The rapid growth of datasets demands robust and automatic data analysis tools.

While many widely used data analysis methods make assumptions of data complexity and/or

the underlying dimensionality and some other methods often require knowledge from domain

experts, an emerging family of data analysis methods called topological data analysis [28]

(TDA) makes minimal assumptions of data. TDA characterizes the shapes of data in various

dimensions and scans over a wide range of scales and is often robust against noise.

Given a point cloud dataset, it can be regarded as embedded in the Euclidean space which

allows the usage of radius ﬁltration associated with alpha complex [60]. A more general

distance ﬁltration associated with a Vietoris-Rips complex [87] or a ˇCech complex can be

used to allow a predeﬁned distance function suitable for speciﬁc applications [198, 200]. It

is also possible to use a more ﬂexible construction by directly assigning ﬁltration values to

68

simplices in a complex which is considered as the ﬁnal structure at the end of the ﬁltration.

In many applications, persistent homology is used to analyze the topological structures

of datasets with generalized but homogenous information. For example, once the genetic

distance between genes is deﬁned by the number of mutations, persistent homology can be

used to analyze the topological property of a gene evolution dataset. When the information

of a dataset is heterogeneous, i.e., it involves multicomponent information, for which special

treatments are needed. For example, vineyards [45, 133] is used to study spatiotemporal

data. Additionally, element speciﬁc persistent homology is introduced to deal with molecular

datasets with chemical, physical and biological information [23, 24, 26, 20].

In Chapter 3, we developed a deep learning model for biomolecular predictive modeling

based on topological representations of mainly the biomolecular geometries. When persis-

tent homology is applied to complex molecular structures, in addition to the point cloud

in the Euclidean space representing the coordinates of atoms, there are additional physical

and chemical information such as element types, atomic partial charges, Coulomb and van

der Waals interactions between atoms, and hydrophobic interactions among carbon atoms.

Another general situation is that the data has multiple dimensions with heterogeneous mean-

ings and persistent homology analysis is done on certain dimensions while the information

of other dimensions are also to be reﬂected in the topological analysis. Therefore, there is

a need to incorporate multicomponent heterogeneous information into topological represen-

tations. Although one can resort to the tight representative cycles of homology generators

[144], we prefer the cohomology framework because it is ﬂexible and it is natural to view

cochains as assigning weights on the simplicial complex which provides more quantitative

representations. We consider cohomology theory with a graph Laplacian or a Laplacian

deﬁned on simplicial complexes to localize and smooth the representatives of (co)homology

69

generators in the data and describe the additional information in the form of cochains (func-

tions on chains) by computing inner product of these cochains and the smoothed (persistent)

cohomology representatives.

Cohomology provides a richer algebraic structure for a topological space. The cohomology

construction used in this work dualizes homology by building cochains as functions on chains

in homology theory. Cohomology theory has been applied in both mathematics and the

ﬁeld of data analysis. One well known cohomology theory is the de Rham cohomology

which studies the topological features of smooth manifolds using diﬀerential forms. The de

Rham cohomology has led to further theoretical developments such as the Hodge theory.

Recently, a discrete exterior calculus framework has been established [90] where manifolds

are approximated by mesh-like simplicial complexes and the discrete counterparts of the

continuous concepts such as diﬀerential forms are deﬁned thereafter. This framework has

wide applications, for example, the harmonic component of the discrete Hodge decomposition

has been used in sensor network coverage problem to localize holes in a sensor network

[13]. Cohomology theory has also been applied in the ﬁeld of persistent homology. A 1-

dimensional cohomology was used to assign circular values to the input data associated to

a homology generator [54] which further led to applications in several ﬁelds including the

analysis of neural data [167] and the study of periodic motion [181]. Persistent cohomology

in higher dimensions has been used to produce coordinate representations which reduces

dimensionality while retaining the topological property of data [155]. Weighted (co)homology

and weighted Laplacian were introduced with biological applications [194]. Computationally,

the duality between homology and cohomology [53] has set the basis for constructing more

eﬃcient algorithms that utilize cohomology to compute persistent homology. Several code

implementations, such as Dionysus [129] and Ripser [8], speed up the persistent homology

70

computation by taking the advantage of this property.

In this work, we seek a formulation that can utilize a function fully or partially deﬁned

on a simplicial complex constructed from the input data at locations associated to homology

generators. To this end, we need a representation that can locate homology generators. When

manifold-like simplicial complexes are available, we can look for harmonic (in the sense of

Laplace-de Rham operator) cohomologous cocycles under the framework of discrete exterior

calculus [89]. A discrete version of the Hodge-de Rham theorem guarantees the uniqueness of

the harmonic cocycle if certain conditions are satisﬁed [89]. However, this method requires

the proper construction of the Hodge star operator which usually relies on a well-deﬁned

dual complex while in general applications, this is not always feasible. For example, when

a user-deﬁned distance matrix is used with the Rips complex, the distance may even not

satisfy triangle inequality. Therefore, we relax our requirement on the accuracy of geometric

localization and represent the set of simplices of a certain dimension in a simplicial complex

as a graph where the simplices are represented by graph nodes and their adjacency is treated

as graph edges. We can also deﬁne a Laplacian on simplicial complexes by ﬁrst introducing

an inner product of cochains and then constructing an adjoint of the coboundary operator.

Then, the smoothness of a cocycle can be measured by a Laplacian. Speciﬁcally, given

a representative cocycle of a homology generator, we look for a cohomologous cocycle that

minimizes the norm of the output under the Laplacian. We can then consider such smoothed

cocycles which distribute smoothly around the holes of certain dimensions as measures on

simplicial complexes and describe the input functions deﬁned on the simplicial complexes by

integrating with respect to these measures. The present formulation also utilizes a ﬁltration

process to assign a function over the ﬁltration interval associated to each bar in the barcode

representation to result in an enriched barcode representation of persistent cohomology. A

71

modiﬁed Wasserstein distance is deﬁned and implemented subsequently to facilitate the

comparison of these enriched barcodes generated from data.

In the rest of this chapter, basic background of cohomology is given and the proposed

method is described in detail in Section 4.2.

In Section 4.3, we illustrate the proposed

method by simple examples, example datasets, and the characterization of molecules. We

also demonstrate the utility of the proposed persistent cohomology by the prediction of

protein-ligand binding aﬃnities from large datasets.

4.2 Methods

We refer readers to Section 2.1 for the basics and deﬁnition of persistent homology.

4.2.1 Cohomology

Like homology, cohomology is also a sequence of abelian groups associated to topological

space X and is deﬁned from a cochain complex, which is a function on the chain group
in the homology theory. Speciﬁcally, a k-cochain is a function α : Xk → R where R is a

commutative ring. The set of all k-cochains following the addition in R is called the kth
cochain group denoted Ck(X, R). The coboundary operator dk : Ck(X, R) → Ck+1(X, R)

maps a cochain to a cochain one dimension higher and is the counterpart of boundary

operators for chains, namely

dk(α)([v0,··· , vk]) =

k(cid:88)

i=0

(−1)iα([v0,··· ,(cid:98)vi,··· , vk]),

72

for a k-cochain α. It should be noted that in the matrix representation of the two operators,

dk and ∂k+1 are transpose to each other. When there is no ambiguity, we simply refer to

dk using d. A k-cochain is called a coboundary if it is in the image of dk−1. A k-cochain is

called a cocycle if its image under dk is 0. The coboundary operators have the property that
k = (∂k ◦ ∂k+1)T . The kth cohomology
dk ◦ dk−1 = 0 following that dk ◦ dk−1 = ∂T
group is deﬁned to be the quotient group Hk(X, R) = Ker(dk)/Im(dk−1). Two cocycles are

k+1 ◦ ∂T

called cohomologous if they defer by a coboundary.

In practice, some ﬁeld is used instead of a ring due to the computation of persistent
(co)homology. In this work, we consider ﬁnite ﬁelds Zp with some prime p when computing

cohomology or persistent cohomology.

Given a ﬁltration of a simplicial complex, similar to persistent homology, the persistent

cohomology can be derived with the following relationship

Hk(X(x0), Zp) ← Hk(X(x1), Zp) ← ··· ← Hk(X(xl), Zp).

The universal coeﬃcient theorem for cohomology (Theorem 3.2 in [86]) implies that there is

a natural isomorphism Hk(X, Zp) ≡ HomZp(Hk(X, Zp), Zp) so that the cohomology group

can be considered as the dual space of the homology group. This property further im-
plies that rank(Hk(X, Zp)) = rank(Hk(X, Zp)) and thus persistent homology and persistent

cohomology have identical barcodes [53].

4.2.2 Smoothed cocycle

Some representative cocycles in persistent cohomology may not reﬂect the overall location

and structure associated with their cohomology generators. To better embed the additional

73

information in the data into cohomology generators, we look for a smoothed representative

cocycle in each cohomology class. The smoothness of functions can usually be measured by a

Laplacian. We construct smoothed representative cocycles with a Laplacian in this section.

Laplacian on simplicial complex

A Laplacian for cochains can be deﬁned by ﬁrst deﬁning an inner product and using the
induced adjoint operator. Assuming the case of real number, for α1, α2 ∈ Ck(X, R), the

inner product can be deﬁned as

< α1, α2 >k=

(cid:88)

σ∈Xk

α1(σ)α2(σ).

(4.1)

Then, the adjoint d∗

k : Ck+1(X, R) → Ck(X, R) of the operator dk with respect to this inner

product can be deﬁned by

< dkα, β >k+1=< α, d∗

kβ >k, for α ∈ Ck(X, R), β ∈ Ck+1(X, R).

(4.2)

Weights reﬂecting size of simplices can be used to reﬂect the geometry by deﬁning a weighted

inner product,

< α1, α2 >w

k =

(cid:88)

σ∈Xk

sσα1(σ)α2(σ),

(4.3)

where sσ is the size of σ such as area or volume if σ is a 2- or 3-simplex. Then, a Laplacian

on Xk can be deﬁned by

Lsc = d∗

kdk + dk−1d∗

k−1.

(4.4)

An inner product based on Wedge product can also be constructed if a manifold like simplicial

complex is given.

74

Weighted graph Laplacian

We can also represent Xk as a graph where the nodes are simplices and the edges describe

adjacency. Consider a graph associated to Xk where each k-simplex is represented by a node

and there is an edge if two k-simplices have nonempty intersection. Note that this is a simple

graph and we deﬁne a weight matrix W = (wij) to be

v(σi)v(σj), σi ∩ σj (cid:54)= ∅,

0, otherwise,

wij =

(4.5)

where v(σ) is the size of σ, for example, the area of 2-simplices and the volume of 3-simplices.
The size of a 0-simplex is deﬁned to be 1. Then, the W -weighted graph Laplacian [42] LW



is deﬁned as

LW i,j =

where wi =(cid:80)

1 − wij/wi, if i = j, and wi (cid:54)= 0,
√
−wij/
wiwj, if σi ∩ σj (cid:54)= ∅, and i (cid:54)= j,

0, otherwise,

(4.6)

wji. A k-cochain α can be represented by a column vector given the basis
in Eq. 4.7. The matrix LW measures the diﬀerence between the value of the cochain on a

j

simplex and the values on the neighbors of this simplex. A large penalty is given to prevent

rapid changes through smaller simplices.

4.2.3 Enriched persistent barcode

We describe the work-ﬂow in this section. Given a simplicial complex X of dimension n, and
a function f : Xk → R with 0 ≤ k ≤ n, we seek a method to embed the information of f

75

on the persistence barcodes obtained with a chosen ﬁltration of X. In other words, we seek

a representation of f on cohomology generators. To this end, smoothed representations are

ﬁrst computed for cohomology generators. One of such smoothed representations induces a

measure on the simplicial complex which allows us to integrate f on X. We describe the

protocol of our approach below.

Dimension greater than 0

Consider a ﬁltration of X, ∅ = X(x0) ⊆ X(x1) ⊆ ··· ⊆ X(xn) = X and an associated

persistent cohomology with a prime p other than 2. Let ω be a representative cocycle for a

persistence interval [xi, xj) of dimension k > 0. The cocycle ω is ﬁrst lifted to a cocycle ˆω with
integer coeﬃcients satisfying that ω(σ) ≡ ˆω(σ) (mod p) and ˆω(σ) ∈ {i ∈ Z : −(p − 1)/2 ≤
i ≤ (p− 1)/2} for all σ ∈ Xk. This is almost always possible [54]. Now that ˆω is an interger
cocycle and thus also a real cocycle. With a basis for k-cochains {ασi}i where

1, σ = σi

0, otherwise

ασi(σ) =

(4.7)

Given a Laplacian on cochains L to measure smoothness, a smooth cocycle ¯ω can be

obtained by solving a least square problem,

¯α = arg min

α∈Ck−1(X,R)

(cid:107)L(ˆω + dα)(cid:107)2
2,

(4.8)

and letting ¯ω = ˆω + d¯α. This smoothed cocycle ¯ω induces a measure µ on Xk by setting

µ(σ) = |¯ω(σ)|.

76

(4.9)

To obtain a sequence of such smoothed real k-cocycles for the cohomology generator along

a persistence interval, we restrict the representative integer cocycle ˆω to subcomplexes of X
and repeat the smoothing computation. Consider the integer k-cocycle ˆω|X(x) at ﬁltration

value x. The corresponding smoothed real k-cocycle ¯ωx can be obtained by running the
optimization problem for ˆω|X(x) as Eq. (4.8) on Ck−1(X(x), R) and it induces a measure µx
on Xk(x) as described in Eq. (4.9). It suﬃces to compute for all diﬀerent ﬁltration values
in [xi, xj) because we have a ﬁnite ﬁltration which gives {µx(cid:96)}j−1
(cid:96)=i .

A function of ﬁltration values f∗ can be deﬁned for each persistent interval [xi, xj) as

(cid:90)

f∗(x) =

(cid:30)(cid:90)

f dµx

Xk(x)

Xk(x)

dµx

(4.10)

for x ∈ [xi, xj). We call each of the collection of persistent intervals being associated with
one such function f∗ an enriched persistent barcode.

Dimension 0

In the case of dimension 0, persistent homology tracks the appearance and merging of

connected components. It is convenient to assign a smooth 0-cocycle to a persistent interval

by assigning 1 to the nodes in the connected component associated with the interval right

before the generator is killed due to merging with another connected component. This is

implemented with a union-ﬁnd algorithm.

4.2.4 Preprocessing of the input function

When given the original input function associated with the input data, we ﬁrst need to

generate a cochain of the dimension of interest out of this input function. The procedures

in several situations are discussed in the rest of this section.

77

Case 1

When given a function f0 : Xk0 → R, and we are interested in its behavior associated
with a k-dimensional homology where k0 (cid:54)= k. We need to interpolate or extrapolate f0 to a
function f : Xk → R. We assume that f0 is unoriented, i.e. f0(σ) = f0(−σ). A simple way

is to take unweighted averages,

nσ(cid:88)

i=1

1
nσ

f0(σ(cid:48)
i),

fa(σ) =

(4.11)

where each σ(cid:48)

i is a k0-simplex satisfying that σ(cid:48)

i > σ if k < k0 and

nσ is the total number of such k0-simplices. A weighted version based on geometry can be

deﬁned as

nσ(cid:88)

i=1

fw(σ) =

i < σ if k > k0 and σ(cid:48)
(cid:30) nσ(cid:88)

wif0(σ(cid:48)
i)

wi,

i=1

(4.12)

where wi is the reciprocal of the distance between the barycenters of σ and σ(cid:48)
i.

An example of this situation is the pairwise interaction strengths between atoms of a

molecule which are naturally deﬁned on edges connecting the vertices representing the atoms.

Another example is the atomic partial charges deﬁned on the vertices representing the atoms

in a molecule or a molecular complex.

Case 2

When given a function f0 : Rn → R with n >= k and a geometric simplicial complex, we
can integrate it on every k-simplex in X to obtain a function fi : Xk → R. For simplicity,

we require f0 to be bounded. Then, fi is deﬁned as

fi(σ) =

dσ,

(4.13)

(cid:90)

σ

(cid:30)(cid:90)

σ

f0dσ

78

for a k-simplex σ and(cid:82)

σ dσ computes the k-dimensional volume of σ. In many cases, f0 is

given as results of numerical simulations which is often deﬁned on grid points. Then, the

integrals can be computed by some chosen quadrature formula and interpolating f0 to the

collocation points.

4.2.5 Modiﬁed Wasserstein distance

An enriched bar can be represented by three elements, birth value b, death value d, and
function f∗ constructed by Eq. (4.10). Given two enriched barcodes of the same dimension
represented by B = {{bi, di, f∗
j }}j∈J , we would like to compute

i }}i∈I and B(cid:48) = {{b(cid:48)

j, d(cid:48)

j, f(cid:48)∗

their distance analogous to Wasserstein distance. We ﬁrst deﬁne two pairwise distances, i.e.,

∆b that measures the distance between two persistence bars

(cid:0)[b, d), [b(cid:48), d(cid:48))(cid:1) = max{|b − b(cid:48)|,|d − d(cid:48)|}

∆b

and ∆f that measures the distance between f∗ and f(cid:48)∗

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1

d − b

(cid:90) d

b

(cid:90) d(cid:48)
b(cid:48) f(cid:48)∗(x)dx

(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) .

∆f (f∗, f(cid:48)∗) =

f∗(x)dx − 1

d(cid:48) − b(cid:48)

(4.14)

(4.15)

In practice, it sometimes is too costly to compute the output values of f∗ for all possible

ﬁltration values, and only a subset of possible ﬁltration values is selected, such as only

the middle value of a bar. In this case, we use middle Riemann sum to approximate the
integration in Eq. 4.15. For a bijection θ : ¯I → ¯J where ¯I and ¯J are subsets of I and J, the

79

associated penalties are deﬁned as

Pb(θ; q, B, B(cid:48)) =

(cid:16)

[bi, di), [b(cid:48)

θ(i), d(cid:48)

θ(i))

(cid:17)(cid:17)q

∆b

i∈ ¯I

(cid:16)
(cid:88)
(cid:88)
(cid:88)

+

i∈I\ ¯I

+

i∈J\ ¯J

(cid:0)∆b

and

Pf (θ; q, B, B(cid:48)) =

(∆b ([bi, di), [(bi + di)/2, (bi + di)/2)))q

i)/2)(cid:1)(cid:1)q

i + d(cid:48)

i + d(cid:48)

(cid:0)[b(cid:48)
i, d(cid:48)
(cid:16)
(cid:88)
∆(cid:48)
(cid:88)
(cid:88)

i∈I\ ¯I

i), [(b(cid:48)
(cid:16)
(cid:16)
∆(cid:48)
f (f∗
(cid:16)

i∈ ¯I

+

f

i , f(cid:48)∗
f∗

i)/2, (b(cid:48)
(cid:17)(cid:17)q
(cid:17)q
(cid:17)q

i , 0)

θ(i)

∆(cid:48)
f (f(cid:48)∗

i , 0)

.

+

i∈J\ ¯J

(4.16)

(4.17)

The qth modiﬁed Wasserstein distance is deﬁned as

W q,γ(B, B(cid:48)) = inf
θ∈Θ

(cid:0)γPb(θ; q, B, B(cid:48)) + (1 − γ)Pf (θ; q, B, B(cid:48))(cid:1) 1

q ,

(4.18)

where γ is a weight parameter and we denote the minimizer by θq,γ. Similar to receiver

operating characteristic curve, instead of ﬁxing γ we let it change from 0 to 1 which results
in a function Wq : [0, 1] → R2 deﬁned as

Wq(γ) = [Pb(θq,γ; q, B, B(cid:48))

1

q , Pf (θq,γ; q, B, B(cid:48))

1
q ],

(4.19)

and we call it a Wasserstein characteristic curve.

The optimization problem can be considered as an assignment problem and solved by

80

Hungarian algorithm. Given two enriched barcodes B = {{bi, di, f∗

i }}m

i=1 and B(cid:48) = {{b(cid:48)

j, d(cid:48)

j, f(cid:48)∗

j }}n

j=1,

we ﬁrst construct pseudo barcode for each of them to account for the situation where a bar is

not paired with another. The pseudo barcodes are BB(cid:48) = {{(b(cid:48)
and B(cid:48)
and B(cid:48) ∪ B(cid:48)

j)/2, 0}}n
i=1. Then the assignment problem between B ∪ BB(cid:48)
1
q . The linear sum assignment tool

B = {{(bi + di)/2, (bi + di)/2, 0}}m

B is solved with the cost (γPb + (1 − γ)Pf )

j)/2, (b(cid:48)

j + d(cid:48)

j + d(cid:48)

j=1

under optimize module of SciPy package [97] is used.

4.3 Examples and results

4.3.1 A minimalist example

Consider a simplicial complex X with four vertices and four edges with unit length that

forms a square as shown in Figure 4.1. The 1-cochain ˆω = [1, 0, 0, 0]T is a real cocycle. The

notation means that ˆω(e0) = 1 and ˆω(e1) = ˆω(e2) = ˆω(e3) = 0. The weighted laplacian
matrix LW deﬁned in Eq. 4.6 for X1 is



1
3

2 −1
−1
0 −1
−1



2 −1

0 −1

0
2 −1

0 −1

2

when a uniform weight of 1 is assigned to all edges. Then, we obtain a smoothed cocycle

¯ω = ω + d¯α = [0.5, 0.5, 0.5, 0.5]T with a 0-cochain ¯α = [1, 0.5, 1, 1.5]T which minimizes
(cid:107)LW (ˆω + d¯α)(cid:107)2

2 to 0.

81

Figure 4.1: A simple example loop.[25]

4.3.2 Example datasets

In this section, we show the smoothed representative 1- and 2-cocycles and the enriched

barcodes using artiﬁcial datasets. We create some example input functions deﬁned on the

nodes and aim to reﬂect the information about these functions on the enriched barcodes.

Two adjacent annuluses

We ﬁrst consider a point cloud sampled from two adjacent annulus with radii 1 and

centered at (0, 0) and (2, 2) as shown in Figure 4.2. There are two signiﬁcant H1 bars

associated to the two major circles. An example of the representative cocycles for the two

long H1 bars are shown in Figure 4.3a and b. The associated smoothed cocycles obtained

by using the method described in Section 4.2.3 are shown in Figure 4.3c and d.

Given two datasets with similar geometry but diﬀerent values on the nodes, we can use

enriched barcodes to distinguish between them. See Figure 4.4 for example.

82

Figure 4.2: a: A point cloud sampled from two adjacent annulus. b: The corresponding H1
barcode using alpha complex ﬁltration.[25]

Figure 4.3: Example of smoothed H1 cocycle.[25]

a and b: Two representative 1-cocycles corresponding to the two long H1 bars. The edges where the cocycles
take nonzero values are drawn in red. c and d: The smoothed 1-cocycles associated to the representative
cocycles. The edges where the cocycles take values with magnitudes greater than or equal to 0.035 are drawn
in blue. The smoothing is done on the subcomplexes associated to the ﬁltration values at the middle of the
corresponding bars.

83

Figure 4.4: a and b: Two datasets with similar geometry but diﬀerent information given on
the nodes. c and d: The diﬀerences are revealed in the enriched H1 barcodes.[25]

84

Cuboid minus two balls

In this example, the object considered is a rectangular cuboid ([0, 4] × [0, 2] × [0, 2])

subtracted by two balls with radius of 0.5 centered at (1, 1, 1) and (3, 1, 1). Two thousand

points are ﬁrst sampled from a uniform distribution over the cuboid and the ones that are

inside the balls are deleted. The dataset with values on the points, the two smoothed cocycles

corresponding to the two voids, and the enriched barcode are shown in Figure 4.5.

Figure 4.5: Persistent cohomology enriched barcode example of data points sampled from
porous cuboid.[25]
a: The points sampled from an object which is a box subtracted by two balls. b: The H2 enriched barcode
showing the two voids in the blue and red regions of the original dataset. c and d: The two smoothed
2-cocycles. The faces where the cocycles take absolute values greater than or equal to 0.01 are drawn in
blue. The smoothing is done on the subcomplexes associated to the ﬁltration values at the middle of the
corresponding bars.

85

Figure 4.6: D3 dataset sampled from an annulus with randomly assigned values on the points
and corresponding H1 enriched barcode.

4.3.3 Wasserstein distance based similarity

We illustrate in this section the measurement of similarities among the persistent cohomology

enriched barcodes. We use the enriched barcodes from three datasets, D1, D2, and D3. Here,

D1 and D2 are the two datasets shown in Fig. 4.4a and b, respectively, while D3 is shown

in the left chart of Fig. 4.6. The Wasserstein characteristics curve deﬁned in Eq. (4.19)

for three datasets, i.e., D1, D2 and D3, are shown in Fig. 4.7. Here, D1 and D2 have the

same geometry and thus their curve is more on the left side which means a smaller distance

between their persistent homology barcodes. On the other hand, D3 has a similar value

assignment on the points as that of D2, so their curve is on the bottom which means a

smaller distance in the non-geometric information.

86

Figure 4.7: Wasserstein characteristics curve.

4.3.4 Analysis of molecules

Cyclic and cage-like structures often exist in complicated macromolecules in various scales.

They can be as small as a benzene (a ring) containing 6 heavy atoms or an adamantane

(a cage) containing 10 heavy atoms. And some macromolecules have a global conﬁguration

of cyclic or cage-like structures such as buckminsterfullerene and carbon nanotubes which

are consisted of tens or hundreds of atoms. Persistent homology is good at detecting these

structures in multiple scales and when we label the atoms by their element types, we can

also reveal the element composition of the detected structures. Speciﬁcally, if oxygen is of

interest, we construct an input function f0 (see Section 4.2.4) that is deﬁned on the nodes

representing the atoms, and outputs 1 on oxygen atoms and 0 elsewhere. We illustrate this

application using a cyclic structure cucurbit[8]uril and a cage-like structure B24N24 cage in

this section.

Cucurbituril

In this example, we consider a macrocyclic molecule cucurbit[8]uril from the cucurbi-

87

Figure 4.8: a: The cucurbit[8]uril molecule viewed from two diﬀerent angles. The hydrogen,
carbon, nitrogen, and oxygen atoms are colored in white, grey, blue, and red. b, c, and
d: The H1 enriched barcodes obtained by assigning 1 to nodes of the selected atom types
(carbon, nitrogen, and oxygen) and 0 elsewhere.[25]

turil family. The molecule contains eight 6-membered rings and sixteen 5-membered rings

consisted of carbon and nitrogen atoms. The rings form a big cyclic structure with a rela-

tively tighter opening surrounded by oxygen atoms. The structure is taken from the provided

structure in the SAMPL6 challenge [1] and the resulting H1 barcodes are shown in Figure 4.8.

Boron nitride cage

The fullerene-like boron nitride cages exhibit spherical structures similar to fullerenes

but are consisted of boron and nitrogen atoms. The global spherical structure is composed

of a collection of local rings containing several atoms. A possible structure of B24N24 cage

given in the supporting information of [209] is used in this example. The molecule and the

88

Figure 4.9: a: A structure of the B24N24 cage. The nitrogen and boron atoms are colored
in blue and grey. b: The enriched barcodes obtained by assigning 1 to Boron atoms and 0
elsewhere. H1 and H2 barcodes are plotted in bottom and top panels.[25]

enriched barcode is shown in Figure 4.9.

In this application, the element type could be substituted by other information that

the user is interested in, such as partial charge, van der Waals potential, and electrostatic

solvation free energy.

4.3.5 An application to protein-ligand binding

An important component of computer-aided drug design is the prediction of protein-ligand

binding aﬃnity based on given protein-ligand complex structures. Persistent homology is

good at identifying rings, tunnels, and cavities in various scales which are crucial to the

protein-ligand complex stability and instability.

In addition to geometry and topology,

chemical and biological complexity also need to be addressed toward a practically useful

method for this application. To this end, for example, the behavior of atoms of diﬀerent

element types can be described by computing persistent homology for subsets of atoms of

the molecule of certain element types. The interaction between protein and ligand can be

89

emphasized by prohibiting an edge to form between two atoms both in the protein or the

ligand. And the electrostatic interactions can be revealed by tweaking the distance matrix

used for ﬁltration to be the interaction strength computed with a chosen physical model

such as Coulomb’s law. However, the approaches described above disturb the original geom-

etry and topology of the protein-ligand complexes. With the method proposed in this work,

we are able to naturally embed the information such as atom type, atomic partial charges,

and electrostatic interactions to the barcodes without disturbing the original geometric and

topological setup of the molecular systems.

We compute the enriched barcodes for protein-ligand complexes, turn them into struc-

tured features, and combine with machine learning methods for the prediction of binding

aﬃnity. The procedure is validated on datasets from the PDBBind database [119] which in-

cludes experimentally derived protein-ligand complex structures and the associated binding

aﬃnities. An example of enriched barcode for atomic partial charges is shown in Fig. 4.10.

Enriched barcodes generation

In addition to the traditional barcode obtained from persistent homology computation,

we also like to add descriptions of the electrostatic properties of the system. An eﬃcient

characterization of this property is the Coulomb potential where the interaction between two

point charges is relatively described by qiqj/rij where qi and qj are the point charges with

a distance of rij. The atomic partial charges of proteins are assigned by using PDB2PQR

software [58] with CHARMM22 force ﬁeld. Two types of construction of the physical infor-

mation are used to characterize the systems.

For dimension 0, a collection of subsets of atoms are ﬁrst identiﬁed according to atom

type. Speciﬁcally, 10 element types (C, N, O, S, P, F, Cl, Br, I, H) are considered for ligands

and 5 element types are considered for proteins (C, N, O, S, H) and a total of 50 subsets of

90

Figure 4.10: Enriched barcodes focusing on atomic partial charges. [25]

a: Ligand (as van der Waals spheres) and surrounding protein atoms (within 12 ˚A of ligand as thick sticks)
of PDB entry 1a94. The color reﬂects the atomic partial charges. b, c, and d: Enriched barcodes for partial
charges generated by computing persistent cohomology with alpha complex ﬁltration on all heavy atoms,
all carbon atoms, and nitrogen and oxygen atoms respectively. The top panel shows H2 barcode and the
bottom one shows H1 barcode.

91

atoms are selected by choosing one element type from each component (protein or ligand).

The pairwise distance matrix based on Euclidean distance is tweaked by setting distances

between atoms both from protein or ligand to inﬁnity which emphasizes the interactions

between protein and ligand. Based on the tweaked distance matrix, persistent (co)homology

computation with Rips complex is performed. The electric potential is computed for each

atom with its nearest neighbor in the diﬀerent part of the protein-ligand complex and is put

on this atom as the additional information. We deﬁne the input function f 0

0 : X0 → R to

take 0 on protein atoms and to take the value discussed above on ligand atoms. The average

potential over ligand atoms in each 0-cocycle representative is used to generate features. In

this way, the favorability of the protein ligand electrostatic interactions is explicitly described.

For dimension 1 and 2, the input function f 1

0 : X1 → R is deﬁned to output the absolute

value of electric potential on edges connecting two atoms to characterize the interaction

strengths. The Coulomb potential is modeled as

Eij = ke

qiqj
rij

,

where ke is Coulomb’s constant, qi and qj are the partial charges of atoms i and j, and

rij is the distance between the two atoms. Persistent (co)homology with alpha complex

is computed on three subsets of the protein-ligand complexes, all heavy atoms, all carbon

atoms, and all oxygen/nitrogen atoms. For simplicity, all enriched barcodes are computed

only at the middle points of the bars.

Featurization of barcodes

Given an enriched barcode, B = {{bi, di, f∗

i }}i∈I obtained by applying the proposed
method to a dataset with an input function f0 (see Section 4.2.4), we turn it into ﬁxed

92

shape array required by the machine learning algorithms we choose. Here, the input function

is f 0

0 or f 1

0 described in the previous section when computing 0th dimensional persistent

(co)homology or in higher dimensions.

For dimension 0, we ﬁrst identify a range of scales to focus on and in this application,

we are interested in the interval [0, 12)˚A. The interval is then divided into 6 subintervals
{[l0
j )}j = {[0, 2.5), [2.5, 3), [3, 3.5), [3.5, 4.5), [4.5, 6), [6, 12)} to address diﬀerent types of

j , r0

interactions. For dimension 0, we are interested in the death values of the bars. Therefore, a

collection of index sets marking the deaths values of the bars that fall into each subinterval

is calculated as

j = {i ∈ I|lj ≤ di < rj}.
I0

For dimension 1 and 2, we are interested in the interval [0, 6)˚A with Alpha complex
)}j. We

ﬁltration. The interval is then divided into 6 equal length subintervals {[l1,2

, r1,2

j

j

then deﬁne a collection of index sets marking the bars that overlap with each subinterval,

j = {i ∈ I|bi < r1,2
I1,2

j

, di ≥ l1,2
j }.

Given a collection of index sets {Ij}j, a feature vector vh(B) is deﬁned as

(cid:16)

(cid:17)

vh(B)

= |Ij|.

j

When {I0
tration parameter interval. When {I1,2

j }j is used, it characterizes the number of component merging events in each ﬁl-
j } is used, it reﬂects the ranks of homology groups at

certain stage along the course of ﬁltration.

And a feature vector vf (B, f0) can be generated subsequently to address the information

93

of the predeﬁned function on the homology generators,

(cid:16)

(cid:17)

vf (B, f0)

=

j

f∗
i (x)dx)/(di − bi).

i = ((cid:82) di

bi

where ¯f∗

(cid:80)
¯f∗
i∈Ij
|Ij|

i

,

Machine learning model

The application of predicting protein-ligand binding aﬃnity based on structures can be

regarded as a supervised learning problem. Generally speaking, we are given a collection of
pairs of input and output {(xi, yi)} and there is a chosen model which is a function M (x; θ)

with tunable parameters θ. The training process is to ﬁnd a speciﬁc setting for the function

M that globally or locally minimizes a penalty function which depends on the given data
{(xi, yi)} and the parameter set θ. Once trained, the model can be used to predict the output

for a newly given input. We choose gradient booting trees (GBT) method for its accuracy,

robustness, and eﬃciency. GBT is an ensemble of trees method with single decision trees

as building blocks. The training of a GBT model is done by adding one tree at a time

according to reduce loss of current model. In practice, diﬀerent randomly selected subsets of

the training data and features are used for each update of the model to reduce overﬁtting.

For every result reported in Table 4.2, a parameter search is done by cross-validation within

the training set where model performance is judged by Pearson’s correlation coeﬃcient. The

candidate values for hyper-parameters tried are summarized in Table 4.1. Another hyper-

parameter max feature is set to sqrt because of the relatively large number of features. The

GradientBoostingRegressor module in scikit-learn (version 0.17.1) [152] software is used.

Application

We test the improvement of the enriched barcodes with electrostatic information in the

94

Hyper Prm.
n estimators
max depth
min samples split
learning rate
subsample
min samples leaf

Candidates
5000, 10000, 20000
4, 8, 16
5, 10, 20
0.0025, 0.005, 0.01
0.25, 0.5, 0.75
1, 3

Table 4.1: Candidate values for hyper-parameters of the gradient boosting trees model.

2007

2013

2015

2016

Dim 0 w/o elec.

0.802 (1.47))

0.754 (1.56)

0.745 (1.56)

0.824 (1.32)

Dim 0 w. elec.

0.796 (1.50)

0.768 (1.53)

0.763 (1.53)

0.833 (1.31)

Dim 1&2 w/o elec.

0.726 (1.65)

0.706 (1.67)

0.718 (1.62)

0.767 (1.46)

Dim 1&2 w. elec.

0.738 (1.64)

0.734 (1.60)

0.737 (1.59)

0.778 (1.44)

Table 4.2: The predictor performance is evaluated by training on PDBBind reﬁned set
excluding the core set and testing on the core set of a certain year’s version. The median
Pearson’s correlation coeﬃcient (root mean squared error) among 10 repeated experiments
is reported.

cases of 0th dimension and higher dimensions using the PDBBind database. The predic-

tor performance is improved by using the enriched barcode embedding the electrostatics

information. The results are listed in Table 4.2.

4.4 Discussion and conclusion

Utilizing the richer information carried by cohomology, we introduce a method to reﬂect in

the barcodes the additional information from the dimensions that are not used for persistent

homology computation. This is achieved by ﬁnding a smoothed representative cocycle with

respect to a Laplacian directly deﬁned on the simplicial complexes or a weighted graph

Laplacian. The smoothed cocycles then serve as measures on the simplicial complexes and

95

allow us to do integration of the additional information. As a result, in addition to the

original persistence barcodes, functions of ﬁltration values associated to each persistence

pair are constructed which enriches the information carried by the barcodes. A similarity

score based on Wasserstein distance is introduced to analyze these enriched barcodes. Going

back to the problem that physical properties should be embedded in the persistence barcodes

to better describe the biomolecules which motivated the development of this method at the

beginning, the method shows to improve the performance in practical tasks of protein-ligand

binding aﬃnity prediction by adding electrostatics information to the barcodes.

This method is potentially useful for a wider range of applications where data come with

multiple heterogeneous dimensions. For example, when analyzing time series dataset in 3-

dimensional space using persistent homology, some speciﬁc treatment such as Vineyards [45]
is used instead of directly doing the computation in R4. Computing persistence over multiple

dimensions at the same time [32] also helps to address this general situation. For one speciﬁc

dimension of a multidimensional dataset, there are also cases where we would like to embed

the information carried in this dimension to the persistence barcodes computed for other

dimensions rather than looking at the persistence for this dimension. For example, persistent

homology can ﬁnd us loops and voids in biomolecular structures and we are interested

in question that what kind of charges do these homology generators carry.

In this case,

the duality between homology and cohomology enables us to better localize the homology

generators and to examine the charge distributions associated to each generator.

96

Chapter 5

Evolutionary homology for coupled

dynamical systems

5.1

Introduction

In addition to analyzing static structures of biomolecules, we are also interested in analyzing

the dynamics of the biomolecular systems which is often related to important biomolecular

properties such as stability and instability. The time evolution of complex phenomena is of-

ten described by dynamical systems, i.e., mathematical models built on diﬀerential equations

for continuous dynamical systems or on diﬀerence equations for discrete dynamical systems

[197, 148, 92, 192]. Most dynamical systems have their origins in Newtonian mechanics.

However, these mathematical models typically only admit highly reduced descriptions of the

original complex physical systems, and thus their continuous forms do not have to satisfy the

Euler-Lagrange equation of the least action principle. Although a low-dimensional dynamical

system is not expected to describe the full dynamics of a complex physical system, its long-

term behavior, such as the existence of steady states (i.e., ﬁxed points) and/or chaotic states,

oﬀers a qualitative understanding of the underlying system. Focused on ergodic systems, dy-

namic mappings, bifurcation theory, and chaos theory, the study of dynamical systems is a

mathematical subject in its own right, drawing on analysis, geometry, and topology. Dynam-

97

ical systems are motivated by real-world applications, having a wide range of applications

to physics, chemistry, biology, medicine, engineering, economics, and ﬁnance. Nevertheless,

essentially all of the analyses in these applications are qualitative and phenomenological in

nature.

In order to pass from qualitative to quantitative evaluation of these systems, we look

to the newly emerging ﬁeld of topological data analysis (TDA) [28, 61, 79, 98, 81, 134].

Speciﬁcally, we use persistent homology which provides multiscale topological characteriza-

tion of datasets. The use of homology for the analysis of dynamical systems and time series

analysis predates and intertwines with the beginnings of persistent homology [98, 126, 76, 3,

164, 163, 162]. More recently, there has been increased interest in the combination of per-

sistent homology with time series analysis [165]. Some common methods include computing

the persistent homology of the Takens embedding [157, 156, 154, 103, 102, 104], studying

the sublevelset homology of movies [106, 178], and working with the additional structure

aﬀorded by persistent cohomology [54, 18, 181]. Wang and Wei have deﬁned temporal per-

sistent homology over the solution of a partial diﬀerential equation derived from diﬀerential

geometry [186]. This method encodes spatial connectivity into temporal persistence in the

Laplace-Beltrami ﬂow, and oﬀers accurate quantitative prediction of fullerene isomer stabil-

ity in terms of total curvature energy for over 500 fullerene molecules. Closely related to our

work, Stolz et al. have recently constructed persistent homology from time-dependent func-

tional networks associated with coupled time series [174]. This work uses weight rank clique

ﬁltration over a deﬁned parameter reﬂecting similarities between trajectories to characterize

coupled dynamical systems.

The objective of the present work is to (1) deﬁne a new simplicial complex ﬁltration using

a coupled dynamical system as input, which encodes the time evolution and synchronization

98

of the system, and (2) use the persistent homology of this ﬁltration to study the system

itself. The resulting persistence barcode is what we call the evolutionary homology (EH)

barcode. We are particularly interested in the encoding and decoding of the topological

connectivity of a real physical system into a dynamical system. To this end, we regulate

the dynamical system by a generalized graph Laplacian matrix deﬁned on a physical system

with distinct topology. As such, the regulation encodes the topological information into

the time evolution of the dynamical system. We use a well-studied dynamical system, the

Lorenz system, to illustrate our EH formulation. The Lorenz attractor is utilized to facilitate

the control and synchronization of chaotic oscillators by weighted graph Laplacian matrices

generated from protein Cα networks. We create features from the EH barcodes originat-

ing from protein networks by using the Wasserstein and bottleneck metrics. The resulting

outputs in various topological dimensions are directly correlated with physical properties

of protein residue networks. Finally, to demonstrate the quantitative analysis power of the

proposed EH, we apply the present method to the prediction of protein thermal ﬂuctuations

characterized by experimental B-factors. We show that the present EH provides some of the

most accurate B-factor predictions for a set of 364 proteins. Our approach not only provides

a new tool for quantitatively analyzing the behavior of dynamical systems but also extends

the utility of dynamical systems to the quantitative modeling and prediction of important

physical/biological problems.

5.2 Methods

This section is devoted to the methods and algorithms. In Sec. 5.2.1, we give a brief discussion

of coupled dynamical systems and their stability control via a correlation (coupling) matrix

99

which embeds topological connectivity of a physical system into the dynamical system. For

background of persistent homology theory, we refer readers to Section 2.1. A concept of

topological learning is given in Section 5.2.1.3. We then deﬁne evolutionary homology on

coupled dynamical systems in Section 5.2.2. Finally, the full pipeline as applied to protein

ﬂexibility analysis is outlined in Section 5.2.3.

5.2.1 Coupled dynamical systems

The general control of coupled dynamical systems has been well-studied in the literature

[148, 92, 192, 197]. A brief review is given in this section.

5.2.1.1 Oscillators and coupling

We consider the coupling of N n-dimensional dynamical systems

dui
dt

= g(ui), i = 1, 2,··· , N,

where ui = {ui,1, ui,2,··· , ui,n}T is a column vector of size n.
associated to a point ri ∈ Rd which will be used to determine inﬂuence in the coupling.

In our setup, each ui is

The coupling of the systems can be very general, but a speciﬁc selection can be an N × N

graph Laplacian matrix A deﬁned for pairwise interactions



Aij =

I(i, j), i (cid:54)= j,

Ail, i = j,

−(cid:80)

l(cid:54)=i

where I(i, j) is a value describing the degree of inﬂuence on the ith system induced by the jth

100

system. We assume undirected graph edges I(i, j) = I(j, i). Let u = {u1, u2,··· , uN}T be a
column vector with ui = {ui,1, ui,2,··· , ui,n}T . The coupled system is an N ×n-dimensional

dynamical system modeled as

= G(u) + (A ⊗ Γ)u,

du
dt

(5.1)

where G(u) = {g(u1), g(u2),··· , g(uN )}T ,  is a parameter, and Γ is an n × n predeﬁned

linking matrix.

One choice of g is the Lorenz oscillator deﬁned as

g(ui) =

ui,1(γ − ui,3) − ui,2

δ(ui,2 − ui,1)

ui,1ui,2 − βui,3





(5.2)

where δ, γ, and β are parameters determining the state of the Lorenz oscillator. This system

is used in this work because of its relative simplicity, rich dynamics and well-understood

behavior.

5.2.1.2 Stability and controllability

Let s(t) satisfy ds/dt = g(s). We say the coupled systems are in synchronous state if

u1(t) = u2(t) = ··· = uN (t) = s(t).

101

The stability can be analyzed using v = {u1 − s, u2 − s,··· , uN − s}T with the following

equation obtained by linearizing Eq. (5.1)

= [IN ⊗ Dg(s) + (A ⊗ Γ)]v,

dv
dt

(5.3)

where IN is the N × N unit matrix and Dg(s) is the Jacobian of g on s.

The stability of the synchronous state in Eq. (5.3) can be studied by eigenvalue analysis

of graph Laplacian A. Since the graph Laplacian A for undirected graph is symmetric, it

only admits real eigenvalues. After diagonalizing A as

Aφj = λjφj, j = 1, 2,··· , N,

where λj is the jth eigenvalue and φj is the jth eigenvector, v can be represented by

N(cid:88)

v =

wj(t)φj.

j=1

Then, the original problem on the coupled systems of dimension N × n can be studied

independently on the n-dimensional systems

dwj
dt

= (Dg(s) + λjΓ)wj, j = 1, 2,··· , N.

(5.4)

Let Lmax be the largest Lyapunov characteristic exponent of the jth system governed by

Eq. (5.4). It can be decomposed as Lmax = Lg + Lc, where Lg is the largest Lyapunov

exponent of the system ds/dt = g(s) and Lc depends on λj and Γ.
In many numerical
experiments in this work, we set Γ = In, an n × n identity matrix. Then the stability of

102

Figure 5.1: a: Chaotic trajectory of one oscillator without coupling. b: The 70 synchro-
nized oscillators associated with the carbon Cα atoms of protein PDB:1E68 are plotted
together.[22]

the coupled systems is determined by the second largest eigenvalue λ2. The critical coupling
strength 0 can, therefore, be derived as 0 = Lg/(−λ2). A requirement for the coupled
systems to synchronize is that  > 0, while  ≤ 0 causes instability.

An example of chaos controlled by coupling is shown in Fig. 5.1. In this example, each

alpha carbon atom (Cα) of protein PDB:1E68 is associated with a Lorenz oscillator and the

underlying locations of the oscillators are used to construct the coupling matrix. The speciﬁc

coupling matrix A = Ageo + Aseq used in this example is a sum of a graph Laplacian matrix

deﬁned using the geometric coupling,

−1, if i (cid:54)= j and dorg

ij < d,

Ageo

ij =



−(cid:80)

l(cid:54)=i

Ageo
il

, i = j,

103

and another which takes the amino acid sequence into account,



Aseq

ij =

seq, if (i + 1 + N ) mod N = j,
−seq, if (i − 1 + N ) mod N = j,

0, otherwise.

Here, dorg is the distance function in the original space; that is, the Euclidean distance

between atoms in this example. The mod operator is used because the protein in this

example is circular. The parameters used for the example of Fig. 5.1 are seq = 0.7 for

sequence coupling, d = 4˚A for spatial cutoﬀ, and δ = 10, γ = 60, and β = 8/3 for the

Lorenz system. The parameters in Eq. (5.1) are  = 10 and



 .

Γ =

1 0 0

0 0 0

0 0 0

Initial values for all oscillators are randomly chosen.

5.2.1.3 Topological learning

The proposed method provides a vast but relatively abstract characterization of the objects

of interest. It is potentially powerful in quantitative analysis, but is diﬃcult to use out of

the box machine learning or data analysis techniques. In regression analysis or the training

part of supervised learning, with Bi being the collection of sets of barcodes corresponding to

the ith entry in the training data, the problem can be cast into the following minimization

104

problem,

(cid:88)

i∈I

min

θb∈Θb,θm∈Θm

L(yi, F(Bi; θb); θm),

where L is a scalar loss function, yi is the collection of target values in the training set, F is a

function that maps barcodes to suitable input for the learning models, and θb and θm are the

parameters to be optimized within the search domains Θb and Θm respectively. The form

of the loss function also depends on the choice of metric and machine learning/regression

model.

A function F which translates barcodes to structured representation (tensors with ﬁxed

dimension) can be used with popular machine learning models including random forest,

gradient boosting trees and deep neural networks. Another popular class of models are the

kernel based models that depend on an abstract measurement of the similarity or distance

between the entries.

Our choices for F, deﬁned in Eq. (5.9) of Sec. 5.2.3, will arise from looking at the distance

from the speciﬁed barcode to the empty barcode and there is no tuning of θb. In Sec. 5.3.2

where we quantitatively analyze protein residue ﬂexibility, we evaluate our method by check-

ing the correlation between each topological feature deﬁned in Eq. (5.9) and the experimental

value (blind prediction) as well as the correlation between the output of a linear regression

with multiple topological features and the experimental value (regression). In the former

case, there is no parameter to be optimized, while in the latter case, the speciﬁc minimiza-

tion problem can be written as

(cid:88)

i∈I

min

θm∈Rn+1

(cid:16)

yi −(cid:104)

EH

p1,1
i

,··· , EHpn,n

i

(cid:105) · θm

, 1

(cid:17)2

,

105

pk,k
where EH
i

is the topological parameter by computing the pk-Wasserstein distance of the

empty set to the kth barcode associated with the EH computation of the ith protein residue

(node). I is the set of indexes of all residues in the protein and yi is the experimental B-factor

for the ith protein residue which quantitatively reﬂects ﬂexibility.

5.2.2 Evolutionary homology (EH) and the EH barcodes

Consider a system of N not yet synchronized oscillators {u1,··· , uN} associated to a collec-
tion of N embedded points, {r1,··· , rN} ⊂ Rd. We assume the global synchronized state is
a periodic orbit denoted s(t) for t ∈ [t0, t1] where s(t0) = s(t1). For ﬂexibility and generality,
the original trajectories, (cid:98)ui(t) := T (ui(t)). The choice of function T is ﬂexible and should

we work on post-processed trajectories obtained by applying a transformation function on

ﬁt the applications; in this work, we choose

T (ui(t)) = min
t(cid:48)∈[t0,t1]

(cid:107)ui(t) − s(t(cid:48))(cid:107)2,

(5.5)

which gives 1-dimensional trajectories for simplicity. Further, in our speciﬁc example,(cid:98)s(t) :=

T (s(t)) = 0, but, again, this is not necessary in general.

We wish to study the eﬀects on the synchronized system of N oscillators (an (N × 3)-

dimensional system) after perturbing one oscillator of interest. To this end, we set the
initial values of all the oscillators except that of the ith oscillator to s(¯t) for a ﬁxed ¯t ∈

[t0, t1]. The initial value of the ith oscillator is set to ρ(s(¯t)) where ρ is a predeﬁned function

playing the role of introducing perturbance to the system. After the system starts running,

some oscillators will be dragged away from and then go back to the periodic orbit as the

perturbance is propagated and relaxed through the system. Let (cid:98)ui

j(t) denote the modiﬁed

106

trajectory of the jth oscillator after perturbing the ith oscillator at t = 0. We focus on the

subset of nodes that are aﬀected by the perturbation,

(cid:40)
nj | max

t>0

V i =

j(t) −(cid:98)s(t(cid:48))(cid:107)2} ≥ p
(cid:107)(cid:98)ui

{ min
t(cid:48)∈[t0,t1]

(cid:41)

for some ﬁxed p determining how much deviation from synchronization constitutes “being

aﬀected”.

5.2.2.1 Filtration function deﬁned for coupled dynamical systems

Assuming we have perturbed the oscillator for node ni, let M = |Vi|. We will now construct a

function f on the complete simplicial complex, denoted by K or KM , with M vertices. Here,
we abuse notation and write Vi = {n1,··· , nM}. The ﬁltration function f : KM → R is built

to take into account the temporal pattern of the propagation of the perturbance through the

coupled systems and the relaxation (going back to synchronization) of the coupled systems.

It requires the advance choice of three parameters:

• p ≥ 0, mentioned above, which determines when a trajectory is far enough from the

global synchronized state, s(t) to be considered unsynchronized,

• sync ≥ 0 which controls when two trajectories are close enough to be considered

synchronized with each other, and

• d ≥ 0 which is a distance parameter in the space where the points ri are embedded,

giving control on when the objects represented by the oscillators are far enough apart

to be considered insigniﬁcant to each other.

107

We will deﬁne the function f by giving its value on simplices in the order of increasing

dimension. Deﬁne

ti
sync = min

(cid:26)

t |

(cid:90) ∞

t

j(t(cid:48)) −(cid:98)ui
(cid:107)(cid:98)ui

k(t(cid:48))(cid:107)2 dt(cid:48) ≤ sync

2

(cid:27)

.

, ∀j, k

That is, ti

sync is the ﬁrst time at which all oscillators have returned to the global synchronized

state after perturbing the ith oscillator. The value of the ﬁltration function for the vertex

nj is deﬁned as

f (nj) = min

(cid:40)

{t | min
t(cid:48)∈[t0,t1]

(cid:107)(cid:98)ui
j(t) −(cid:98)s(t(cid:48))(cid:107)2 ≥ p} ∪ {ti
sync}

(cid:41)

.

(5.6)

Next, we give the function value f for the edges of K. To avoid the involvement of any

insigniﬁcant interaction between oscillators, an edge between nj and nk denoted by ejk is

allowed in the earlier stage of the ﬁltration only if dorg
jk is the distance
between ri and rj in Rd. Speciﬁcally, the value of the ﬁltration function for the edge ejk is

jk ≤ d where dorg

deﬁned as

f (ejk) =

max

ti
sync,

(cid:110)
min{t|(cid:82) ∞
t (cid:107)(cid:98)ui

j(t(cid:48)) −(cid:98)ui

(cid:111)
k(t(cid:48))(cid:107)2 dt(cid:48) ≤ sync}, f (nj), f (nk)

,

if dorg

jk ≤ d

if dorg

jk > d.

(5.7)

jk ≤ d,
It should be noted that to this point, f deﬁnes a ﬁltration function because when dorg
f (nj) ≤ f (ejk) according to the deﬁnition given in Eq. (5.7). The property also holds when
jk > d because f (nj) ≤ tsync according to the deﬁnition in Eq. (5.6) and f (ejk) equals
dorg
tsync in this case.

108

Figure 5.2: The ﬁltration of the simplicial complex associated to three 1-dimensional
trajectories.[22]
The trajectories are generate as deﬁned in Sec. 5.2.2.1. Here, each vertex corresponds to the trajectory with
the same color. A vertex is added when its trajectory value exceeds the parameter p; an edge is added when
its two associated trajectories become close enough together that the area between the curves after that
time is below the parameter sync. Triangles and higher dimensional simplices are added when all necessary
edges have been included in the ﬁltration.

We extend the function to the higher dimensional simplices using the deﬁnition on the

1-skeleton. A simplex σ of dimension higher than one is included in K(x) if all of its 1-

dimensional faces are already included; that is, its ﬁltration value is deﬁned iteratively by

dimension as

f (σ) = max
τ≤σ

f (τ ),

where the max is taken over all codim-1 faces of σ. Taking the ﬁltration of K using this

function (c.f. Eq. (2.1)) means that topological changes only occur at the collection of func-
tion values {f (ni)}i ∪ {f (ejk)}j(cid:54)=k. Fig. 5.2 shows the ﬁltration constructed for an example

consisting of three trajectories.

109

5.2.2.2 Deﬁnition of evolutionary homology

The previous section gives a function fi : K|V i| → R deﬁned on the complete simplicial
complex with |V i| vertices for each i = 1,··· , N . From the ﬁltration deﬁned by fi, we

then compute the persistence barcode for homology dimension k, which we call the kth

EH barcode, denoted Bk

i . The persistent homology computation for dimension ≥ 1 on the

ﬁltered simplicial complex is done using the software package Ripser [8] using the fact that

k-dimensional homology only requires knowledge of k and k + 1-dimensional simplices. The

0-dimensional homology is computed with a modiﬁcation of the union-ﬁnd algorithm.

Fig. 5.3 gives an example of the geometric conﬁgurations of two sets of points associated

to Lorenz oscillators and their resulting EH barcodes. The EH barcodes eﬀectively examine

the local properties of signiﬁcant cycles in the original space which is important when the

data is intrinsically discrete instead of a discrete sampling of a continuous space. As a result,

the point clouds with diﬀerent geometry but similar barcodes using traditional persistence

methods1 may be distinguished by EH barcodes.

5.2.3 Protein residue ﬂexibility analysis

In this section, we combine all the methods to formulate realistic protein residue ﬂexibility

analysis using the EH barcodes. Consider a protein with N residues and let ri denote the

position of the alpha carbon (Cα) atom of the ith residue. The coupled systems deﬁned

in Eq. (5.1) are used to study protein ﬂexibility with each protein residue represented by

a 3-dimensional Lorenz system. Deﬁne the distance for the atoms in the original space as
the Euclidean distance between the Cα atoms, dorg(ri, rj) = (cid:107)ri − rj(cid:107)2. A weighted graph

1Here, traditional means the Vietoris-Rips ﬁltration on the point cloud induced by the embedding

110

Figure 5.3: An example of the construction of the evolutionary homology barcode.[22]

The geometry of two embedded systems is shown in Figures (a) and (b). Speciﬁcally, (b) consists of six
vertices of a regular hexagon with side length of e1; and (a) consists of the vertices in (b) with the addition
of the vertices of hexagons with a side length of e2 (cid:28) e1 centered at each of the previous vertices; here, e1 = 8
and e2 = 1. Figures (c) and (d) are the EH barcodes corresponding to Figures (a) and (b) respectively. A
collection of coupled Lorenz systems is used with parameters δ = 1, γ = 12, β = 8/3, µ = 8, k = 2, Γ = I3,
and  = 0.12; see Eqs. (5.2), (5.8) and (5.1). In the model for the ith residue, marked in red, the system is
perturbed from the synchronized state by setting ui,3 = 2s3 with s3 being the value of the third variable of
the dynamical system at the synchronized state and is simulated with step size h = 0.01 from t = 0 using the
fourth-order Runge-Kutta method. The calculation of persistent homology using the Vietoris-Rips ﬁltration
with Euclidean distance on the point clouds delivers similar bars corresponding to the 1-dimensional holes

in (a) and (b) which are [e1 − e2, 2(e1 − e2)) and [e1, 2e1).

Laplacian matrix is constructed based on the distance function dorg to prescribe the coupling

strength between the oscillators and is deﬁned as

−(dorg(ri,rj )/µ)κ

, i (cid:54)= j,

(5.8)



e

−(cid:80)

Aij =

Ail, i = j,

l(cid:54)=i

where µ and κ are tunable parameters.

To quantitatively study the ﬂexibility of a protein, one needs to extract topological in-

111

Figure 5.4: The result of perturbing residue 31 in protein (PDB:1ABA).[22]

(a) The modiﬁed trajectories as deﬁned in Eq. (5.5) is plotted for each residue after the perturbation at
t = 0 as a heatmap. The residues are ordered by the (geometric) distance to the perturbed site from the
closest to the farthest. (b) The modiﬁed trajectories as deﬁned in Eq. (5.5) is plotted for each residue after
the perturbation at t = 0 as line plots. The darker lines are closer to the perturbed site. The heatmap shows
ﬁltration value for the edges as deﬁned in Eq. (5.7) and the order of residues is the same as in (a). The
parameters for the coupled Lorenz system and the perturbation method are the same as that of Fig. 5.3.

formation for each residue. To this end, we go through the process given in the previous

sections once for each residue. When addressing the ith residue, we perturb the ith oscillator

at a time point in a synchronized system and take this state as the initial condition for the

coupled systems. See Fig. 5.4 for an example of this procedure when perturbing the oscillator

attached to a residue for a given embedding of one particular protein.

A collection of modiﬁed trajectories {(cid:98)ui(t)}N
tion deﬁned in Eq. (5.5). The persistence over time for {(cid:98)ui(t)}N

i=1 is obtained with the transformation func-

i=1 is computed following the

ﬁltration procedure deﬁned in Sec. 5.2.2.1. Let Bk

i be the kth EH barcode obtained from

the experiment of perturbing the oscillator corresponding to residue i. We introduce the

following topological features to relate to protein ﬂexibility:

EHp,k

i = dW,p(Bk

i ,∅),

(5.9)

112

where dW,p for 1 ≤ p < ∞ is the p-Wasserstein distance and p = ∞ is the bottleneck distance.

We will show that these features characterize the behavior of this particular collection of

barcodes, which in turn, captures the topological pattern of the coupled dynamical systems

arising from the underlying protein structure.

The ﬂexibility of any given residue is reﬂected by how the perturbation induced stress

is propagated and relaxed through the interaction with the neighbors. Such a relaxation

process will induce the change in the states of the nearby oscillators. Therefore, the records

of the time evolution of this subset of coupled oscillators in terms of topological invariants

can be used to analyze and predict protein ﬂexibility.

The diﬀerence in results of the procedure can be seen in the example of Fig. 5.5 where

the control of chaotic oscillators attached to a partially disordered protein (PDB:2RVQ)

and a well-folded protein (PDB:1UBQ) is demonstrated. Clearly, the folded part of protein

2RVQ has strong correlations or interactions among residues from residue 25 to residue

110, which leads to the synchronization of the associated chaotic oscillators. In contrast,

the random coil part of protein 2RVQ does not have much coupling or interaction among

residues. Consequently, the associated chaotic oscillators remain in chaotic dynamics during

the time evolution. For folded protein 1UBQ, the associated chaotic oscillators become

synchronized within a few steps of simulation, except for a small ﬂexible tail. This behavior

underpins the use of coupled dynamical systems for protein ﬂexibility analysis.

113

Figure 5.5: Left: partially disordered protein, model 1 of PDB:2RVQ. Right: well folded
protien, PDB:1UBQ.[22]
The ui,1 value of each dynamical system is plotted as heatmap. The Lorenz system deﬁned in Eq. (5.2)
is used with the parameters δ = 10, γ = 28, β = 8/3. The coupling matrix A deﬁned in Eq. (5.8) has
parameters µ = 14, κ = 2. The coupled system deﬁned in Eq. (5.1) has parameters Γ = I3 and  = 0.12.
The system is initialized with a random value between 0 and 1 and is simulated from t = 0 to t = 200 with
step size h = 0.01. The system is numerically solved using the 4-th order Runge-Kutta method. It can be
seen from the heatmaps that the oscillators corresponding to the disordered regions behave asynchronously.

5.3 Results

5.3.1 Disordered and ﬂexible protein regions

To illustrate the correlation between protein residue ﬂexibility and the topological features

deﬁned in Eq. (5.9), we study several proteins with intrinsically disordered regions. Intrinsi-

cally disordered proteins lack stable 3-dimensional molecular structures. Partially disordered

proteins refer to the intrinsically disordered proteins that contain both stable structure and

ﬂexible regions.

In nature, the disordered regions may play important roles in biological

processes which requires ﬂexibility.

In what follows, we always work with the coupled Lorenz system parameters, pertur-

114

bation method for the ith residue, and simulation described in Fig. 5.3. The simulation is

stopped when all oscillators go back to synchronized state. This process is repeated for each

residue. Two NMR structures of partially disordered proteins PDB:2ME9 and PDB:2MT6

are studied. The topological features are computed for each model of the structures and are

averaged over the models. The results are plotted in Fig. 5.6. The disordered regions clearly
correlate to the peaks of EH∞,0 and the valleys of EH∞,1, EH1,0, and EH1,1. The topo-

logical features are also able to distinguish between relatively stable coils (the coils that are

consistent among the NMR models) and the disordered parts (the parts that diﬀer among

the NMR models).

5.3.2 Protein B-factor prediction

Protein B-factors quantitatively measure the relative thermal motion of each atom and re-

ﬂects atomic ﬂexibility. The x-ray crystal structures deposited to the Protein Data Bank

contain experimentally derived B-factors which can be used to validate the proposed method

[151, 145]. To analyze protein ﬂexible regions, B-factor prediction is needed for protein

structures built from computational models and some experimentally solved structures us-

ing NMR or cryo-EM techniques. Normal mode analysis (NMA) is one of the ﬁrst methods

proposed for B-factor predictions [83]. The Gaussian network model (GNM) [7] was known

for its better accuracy and eﬃciency compared to a variety of earlier methods [204]. The

multiscale ﬂexibility-rigidity index (FRI), which is about 20% more accurate than GNM, has

been established as the state-of-the-art in the B-factor predictions [146].

In this section, we compute the correlation between the topological features and the ex-

perimentally derived protein B-factors. We further test the proposed topological features

by building a simple linear regression model with a least square penalty against the exper-

115

Figure 5.6: (a) Models 1-3 of PDB:2ME9 with the disordered region colored in blue, red,
and yellow for the three models. (b) Similar plot as (a) for PDB:2MT6. (c) Topological
features for PDB:2ME9 whose large disordered region is from residue 28 to residue 85. (d)
Topological features for PDB:2MT6 whose large disordered region is from residue 118 to
residue 151. [22]

imental B-factors. A collection of 364 diverse proteins reported in the literature is chosen

as the validation data (The set of 365 proteins [145] excepts PDB:1AGN due to issue in re-

ported B-factors [146]). The size of the proteins ranges from tens to thousands of amino acid

residues. The topological features in the model are the same as the setup given in Sec. 5.3.1.

An example of the resulting persistence barcodes for relatively rigid and relatively ﬂexible

residues are shown in Fig. 5.7. It is seen that the residue with a relatively small B-factor

has many H0, H1 and H2 bars. Compared to the residue having a large B-factor, it has a

much richer dynamical response and barcodes with more bars. Additionally, its H0 bars are

much shorter, indicating a stronger interaction with neighbor residues.

116

Figure 5.7: Barcode plots for two residues. (a) Residue 6 of PDB:2NUH with a B-factor of
12.13 ˚A2. (b) Residue 49 of PDB:2NUH with a B-factor of 33.4 ˚A2.[22]

The computed topological features are plotted against a relatively small protein and a

relatively large protein in Fig. 5.8. Clearly, 0-dimensional topological features, speciﬁcally
EH∞,0, provide a reasonable approximation to experimental B-factors. The regression using

all topological information, EH, oﬀers very good approximation to experimental B-factors.

A summary of the results and a comparison to other methods is shown in Table 5.1 for

the set of 364 proteins. It is seen that the present evolutionary topology based prediction

outperforms other methods in computational biophysics. A possible reason for this excellent

performance is that the proposed method gives a more detailed description of residue in-

teractions in terms of three diﬀerent topological dimensions and two distance metrics. This

example indicates that the proposed EH has a great potential for other important biophysical

applications, including the predictions of protein-ligand binding aﬃnities, mutation induced

protein stability changes and protein-protein interactions.

117

Description

Method RP
EH∞,0
EH∞,1
EH∞,2
EH1,0
EH1,1
EH1,2
EH2,0
EH2,1
EH2,2
EH

0.586 Topological feature
-0.039 Topological feature
-0.097 Topological feature
-0.477 Topological feature
-0.381 Topological feature
-0.104 Topological feature
0.188 Topological feature
-0.258 Topological feature
-0.100 Topological feature
0.691 Topological features

Description

Method RP
EH
mFRI
pfFRI
GNM

0.691 Topological metrics
0.670 Multiscale FRI [146]
0.626 Parameter free FRI [145]
0.565 Gaussian network model [145]

Table 5.1: The averaged Pearson correlation coeﬃcients (RP ) between the computed values
(blind prediction for the topological features and regression for the rest of the models) and
the experimental B-factors for a set of 364 proteins [146] (Left: Prediction RP s based on EH
barcodes. Right: A comparison of the RP s of predictions from diﬀerent methods.). Here,
EH is the linear regression using EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within
each protein. For a few large and multi-chain proteins (i.e., 1F8R, 1H6V, 1KMM, 2D5W,
3HHP, 1QKI, and 2Q52), to reduce the computational time and as a good approximation,
we compute their EH barcodes on separated (protein) chains. We see from the table at right
that the proposed EH barcode method outperforms other methods in this application.

118

Figure 5.8: B-factors and the computed topological features. EH shows the linear regression
with EH∞,0, EH∞,1, EH1,0, EH1,1, EH2,0, and EH2,1 within each protein. (a) PDB:3PSM
with 94 residues. (b) PDB:3SZH with 697 residues.[22]

5.4 Conclusion

Many dynamical systems are designed to understand the time-dependent phenomena in the

real world. The topological analysis of dynamical systems is scarce in general, partially due

to the fact that the topological structure of most dynamical systems is typically simple. In

this work, we have introduced evolutionary homology (EH) to analyze the topology and its

time evolution of dynamical systems. We present a method to embed external topology

of a physical system into dynamical systems. EH examines the embedded topology and

converts it into topological invariants over time. The resulting barcode representation of the

topological persistence is able to unveil the quantitative topology-function relationship of

the embedded physical system.

We have chosen the well-known Lorenz system as an example to illustrate our EH for-

mulation. An important biophysical problem, protein ﬂexibility analysis, is employed to

demonstrate the proposed topological embedding of realistic physical systems into dynam-

ical systems. Speciﬁcally, we construct weighted graph Laplacian matrices from protein

119

networks to regulate the Lorenz system, which leads to the synchronization of the chaotic

oscillators associated with protein residue network nodes. Simplices, simplicial complexes,

and homology groups are subsequently deﬁned using the adjacent Lorenz oscillator trajec-

tories. Topological invariants and their persistence are computed over the time evolution

(ﬁltration) of these oscillators, unveiling protein thermal ﬂuctuations at each residue. The

Wasserstein and bottleneck metrics are used to quantitatively discriminate EH barcodes from

diﬀerent protein residues. The resulting model using the EH barcodes is found to outper-

form both geometric graph and spectral graph theory based methods in the protein B-factor

predictions of a commonly used benchmark set of 364 proteins.

The proposed EH method can be used to study the topological structure of a general

physical system. Moreover, the present method extends the utility of dynamical systems,

which are usually designed for qualitative analysis, to the quantitative modeling and pre-

diction of realistic physical systems. Finally, the proposed approach can be readily applied

to the study of a wide variety of topology-function relationships, both within computa-

tional biology such as the role of topology in protein-ligand, protein-protein, protein-metal

and protein-nucleic acid interactions; but also to other interactive graphs and networks in

science and engineering.

120

Chapter 6

Topological characterization of static

macrobiomolecules and small

molecules

6.1

Introduction

Despite the competitive out-of-box performance of persistent homology as a featurization

tool in supervised learning tasks, careful designs tailored for the ﬁeld of applications can

further improve the performance. To this end, we present a comprehensive assessment of

representability of persistent homology for both macromolecules and small molecules. We

introduce several ways of characterizing macromolecules and small molecules whose quality

are judged by the problem of protein-ligand binding aﬃnity prediction. We propose multi-

level persistent homology speciﬁcally designed for the characterization of small molecules

and we show that such representation is able to capture subtle changes in small molecules.

The best protocol benchmarked on the protein-ligand binding aﬃnity prediction problem

is then applied to virtual screening where more than a hundred thousand target-candidate

pairs are involved and our model achieved top performance in a benchmark using the DUD

(directory of useful decoys) database.

121

The rest of this chapter is organized as follows. In Section 6.2, we discuss in detail the

biology or chemistry that needs to be addressed in a representation tool. We introduce

several persistent homology based methods focusing on geometry, chemistry, electrostatics

properties of macromolecules and small molecules in Section 6.3. The results are shown

in Section 6.4 followed by a detailed discussion about several model aspects in persistent

homology based machine learning models for biomolecular systems in Section 6.5.

6.2 Biological considerations

The development of persistent homology was motivated by its potential in the dimensionality

reduction, abstraction and simpliﬁcation of biomolcular complexity [62]. In the early applica-

tions of persistent homology to biomolecules, emphasis was given on major or global features

(long-persistent features) to derive descriptive tools. For example, persistent homology was

used to identify the tunnel in a Gramicidin A channel [62] and to study membrane fusion

[99]. For the predictive modeling of biomolecules, features of a wide range of scales might all

be important to the target quantity [198]. At the global scale, the biomolecular conforma-

tion should be captured. At the intermediate scale, the smaller intra-domain cavities need

to be identiﬁed. At the most local scale, the important substructures should be addressed,

such as the pyrrolidine in the side chain of proline. Biomolecules are both structurally and

biologically complex. Their geometric and biological complexities include covalent bonds,

non-covalent interactions, eﬀects of chirality, cis and trans distinctions, multi-leveled protein

structures, and protein-ligand and protein-nucleic acid complexes. Covering a large range

of spatial scales is not enough for a powerful model. The biological details should also be

explored. We address the underlying biology and physics by modifying the distance func-

122

tion and selecting various sets of atoms according to element types, to describe diﬀerent

interactions. Some biological considerations are discussed in this section.

Covalent bonds.

Covalent bonds are formed via shared electron pairs or bonding pairs. The lengths and

the number of covalent bonds can be easily detected from 0th dimensional barcodes. For

macromolecules, the same type of covalent bonds have very similar bond lengths and thus

0th dimensional barcode patterns.

Non-covalent interactions.

Non-covalent interactions play a critical role in maintaining the 3D structure of biomolecules

and mediating chemical and biological processes, such as solvation, binding, protein-DNA

speciﬁcation, molecular self-assembly, etc. Physically, non-covalent interactions are due to

electrostatic, van der Waals forces, hydrogen bonds, π-eﬀects, hydrophobic eﬀects, etc. The

ability to characterize non-covalent interactions is an essential task in any methodological

development. The 1st and 2nd dimensional barcodes are suitable for the characterization of

the arrangement of such interactions in a larger scale. Additionally, we propose multi-level

persistence and electrostatic persistence to reveal local and pairwise non-covalent interactions

via 0th dimensional barcodes as well.

Chirality, cis eﬀect and trans eﬀect.

Chirality, cis and trans eﬀects are geometric properties of many molecules. Among them,

chirality is a symmetry property such that a chiral molecule cannot be superposed on its

mirror image. Cis and trans eﬀects are due to molecular steric and electronic eﬀects. Chi-

rality, cis and trans eﬀects often play a role in molecular kinetics, activity and catalysis, and

thus their characterization is an important issue in developing topological methods. These

eﬀects should be reﬂected from barcodes of various dimensions.

123

Multi-leveled protein structures.

Protein structures are typically described in terms of primary, secondary, tertiary and

quaternary levels. The protein primary structure is the linear sequence of amino acids

in the polypeptide chain. Protein secondary structure refers to the local 3D structure of

protein segments containing mainly α-helix and β-sheets, which are highly regular and can

be easily detected by distinct Frenet-Serret frames. A tertiary structure refers to the 3D

structure of a single polypeptide chain.

Its formation involves various non-covalent and

covalent interactions including salt bridges, hydrophobic eﬀects, and often disulﬁde bonds.

A quaternary structure refers to the aggregation of two or more individual folded protein

subunits into a 3D multi-subunit complex. Protein structures are further complicated by

its functional domains, motifs, and particular folds. The protein structural diversity and

complexity result in the challenge and opportunity for methodological developments.

Protein-ligand, protein-protein, and protein-nucleic acid complexes.

Topological characterization of proteins is further complicated by protein interactions

or binding with ligands (drugs), other proteins, DNA and/or RNA molecules. Although

a normal protein involves only carbon (C), hydrogen (H), nitrogen (N), oxygen (O) and

sulfur (S) atoms, its protein-ligand complexes bring a variety of other elements into the play,

including, phosphorus (P), ﬂuorine (F), chlorine (Cl), Bromine (Br), iodine (I), and many

important biometals, such as calcium (Ca), potassium (K) sodium (Na), iron (Fe), copper

(Cu), cobalt (Co), zinc (Zn), manganese (Mn), chromium (Cr), vanadium (V), tin (Sn),

and molybdenum (Mo). Each biological element has important biological functions and its

presence in biomolecules should be treated uniformly as a set of points in the point cloud

data. The interaction of protein and nucleic acids can be very intricate.

124

6.3 Methods

6.3.1 Element speciﬁc persistent homology

One important issue is how to protect chemical and biological information during the topo-

logical simpliﬁcation. As mentioned earlier, one should not treat diﬀerent types of atoms as

homogeneous points in a point cloud data. To this end, we propose element speciﬁc per-

sistent homology or multi-component persistent homology to retain biological information

in topological analysis [24]. The element selection is similar to a predeﬁned vertex color

conﬁguration for graphs.

When all atoms are passed to persistent homology algorithms, the information extracted

mainly reﬂects the overall geometric arrangement of a biomoelcule at diﬀerent scales. By

passing only atoms of certain element types or of certain roles to the persistent homol-

ogy analysis, diﬀerent types of interactions or geometric arrangements can be revealed. In

protein-ligand binding modeling, the selection of all carbon atoms characterizes the hy-

drophobic interaction network whilst the selection of all nitrogen and/or oxygen atoms char-

acterizes hydrophilic network and the network of potential hydrogen bonds. In the protein

structural analysis, computation on all atoms can identify geometric voids inside the protein

which may suggest structural instability and computation on only Cα atoms reveals the

overall structure of amino acid backbones. In addition, combination of various selections of

atoms based on element types provides very detailed description of the biomolecular system

and the hidden relationships from the structure to function can then be learned by machine

learning algorithms. This may lead to the discovery of important interactions not realized

as a prior. This can be realized by passing the set of atoms of the selected element types to

the persistent homology computation. This concept can be used with various constructions

125

of distance matrix for persistent homology computation based on Vietoris-Rips complex

ﬁltration.

6.3.2 Construction of distance matrix

Biomolecular systems are not only complex in geometry, but also in chemistry and biology.

To eﬀectively describe complex biomolecular systems, it is necessary to modify the ﬁltration

process. There are three commonly used ﬁltrations, namely, radius ﬁltration, distance matrix

ﬁltration, and density ﬁltration, for biomolecules [198, 201]. The distance matrices can

be used with a more abstract construction of simplicial complexes, such as Vietoris-Rips

complex.

6.3.2.1 Multi-level persistent homology.

Small molecules such as ligands in protein-ligand complexes usually contain fewer atoms than

large biomolecules such as proteins. Bonded atoms stay closer than non-bonded ones in most

cases. As a result, the collection of 0th dimensional bars will mostly provide the information

about the length of covalent bonds and the higher dimension barcodes will most likely be

very sparse. It is diﬃcult to capture non covalent bond interactions among atoms especially

hydrogen bonds and van der Waals pairwise interactions in 0th dimensional barcodes. In

order to describe non covalent interactions , we propose multi-level persistent homology, by

simply modifying the distance matrix. Given the original distance matrix M = (dij) with
1 ≤ i, j ≤ N , the modiﬁed distance matrix is deﬁned as

(cid:102)Mij =

d∞, if atoms i and j are bonded,

Mij, otherwise,

(6.1)

126

where d∞ is a large number which is set to be greater than the upper limit of the ﬁltration

value chosen by a persistent homology algorithm. Note that this matrix may fail to satisfy

triangle inequality whilst still satisﬁes the construction principle of Rips complex.

The present multi-level persistent homology is able to describe any selected interactions

of interest and delivers two beneﬁts in characterizing biomolecules. Firstly, the pairwise

non-covalent interactions can be reﬂected by the 0th dimensional barcodes. Secondly, such

treatment generates more higher dimensional barcodes and the small structural ﬂuctuation

among diﬀerent conformations of the same molecule can be captured. The persistent barcode

representation of the molecule can be signiﬁcantly enriched to better distinguish between

diﬀerent molecular structures and isomers. As an illustration, we take the ligand from

the protein-ligand complex with PDB code “1BCD” which only has 10 atoms. A diﬀerent

conformation of the ligand is generated by using the Frog2 web server [128]. The persistent

barcodes generated using Rips complex with the distance matrices M are identical and only

have 0th dimensional bars due to the simple structure. In this case, the 0th dimensional

bars only reﬂect the length of each bond and therefore fail to distinguish the two slightly

diﬀerent conformations of the same molecule. However, when the modiﬁed distance matrices

(cid:102)M are employed, the barcode representation is signiﬁcantly enriched and is able to capture
from the modiﬁed distance matrix (cid:102)M is shown in Fig. 6.1. A general nth level persistence
characterization of molecules can be obtained with the distance matrix(cid:102)Mn as,

the tiny structural perturbation between the conformations. An illustration of the outcome

(6.2)

(cid:102)Mn

ij =

d∞, D(i, j) ≤ n

Mij, otherwise,

127

where D(i, j) is the smallest number of bonds to travel from atom i to atom j and d∞ is

some number greater than the upper limit of the ﬁltration value.

Figure 6.1: Multi-level persistent homology on simple small molecules.[20]

Illustration of representation ability of(cid:102)M in reﬂecting structural perturbations among conformations of the
Rips complex with (cid:102)M for two conformations. It is worth noticing that the barcodes generated using Rips

same molecule. Left: The structural alignment of two conformations of the ligand in protein-ligand complex
(PDB:1BCD). Right: The persistence diagram showing the 1st and 2nd dimensional results generated using

complex with M are identical for the two conformations.

6.3.2.2 Interactive persistent homology.

In protein-ligand binding analysis and analysis involving interactions, we are interested in

the change of topological invariants induced by interactions that are caused by binding or

other processes. Similar to the idea of multi-level persistent homology, we can design a
distance matrix to focus on the interactions of interest. For a set of atoms, A = A1 ∪ A2
with A1 ∩ A2 = ∅ where only interactions between atoms from A1 and atoms from A2 are

128

of interest [24]. The interactive distance matrix(cid:99)M is deﬁned as

Mij, if ai ∈ A1, aj ∈ A2 or ai ∈ A2, aj ∈ A1,

d∞, otherwise,

(cid:99)Mij =

(6.3)

where M is the original distance matrix induced from Euclidean metrics or other correlation

function based distances, ai and aj are atoms i and j , and d∞ is a number greater than

the upper limit of the ﬁltration value. In applications, A1 and A2 can be respectively a set

of atoms of the protein and a set of atoms of the ligand in a protein-ligand complex. In this

case, the characterization of interactions between ligand and protein is an important task.

In the modeling of point mutation induced protein stability changes, A1 could be the set of

atoms at the mutation site and A2 could be the set of atoms of surrounding residues close

to the mutation site. Similar treatment can be used for protein-protein and protein-nucleic

acid interactions.

6.3.2.3 Correlation function based persistent homology.

For biomolecules, the interaction strength between pair of atoms usually does not align

linearly to their Euclidean distances. For example, van der Waals interaction is often de-

scribed by the Lennard-Jones potential. Therefore, kernel function ﬁltration can be used to

emphasize certain geometric scales [198].

¯Mij = 1 − Φ(dij, ηij),

(6.4)

129

where Φ(dij, ηij) is a radial basis function and ηij is a scale parameter. This ﬁltration can

be incorporated in the element speciﬁc persistent homology

d∞, if atom i or atom j∈ U,

1 − Φ(dij, ηij), otherwise.

`Mij =

(6.5)

Additionally, one can simultaneously use two or more correlation functions characterized by

diﬀerent scales to generate a multiscale representation of biomolecules [140].

One form of the correlation function based ﬁltration matrix is constructed by ﬂexibility

and rigidity index. In this case, the Lorentz function is used in Eq. (6.5)

Φ(dij; ηij, ν) =

1 +

(cid:19)ν ,

(6.6)

(cid:18) dij

1

ηij

where dij is the Euclidean distance between point i and point j and ηij is a parameter

controlling the scale and is related to radius of two atoms. When distance matrices based

on such correlation functions are used, patterns at diﬀerent spatial scales can be addressed

separately by altering the scale parameter ηij. Note that the rigidity index is given by [196]

(cid:88)

µi =

Φ(dij; ηij, ν).

(6.7)

This expression is closely related to the rigidity density based volumetric ﬁltration [201].

j

6.3.2.4 Electrostatic persistence

Electrostatic eﬀects are some of the most important eﬀects in biomolecular structure, func-

tion, and dynamics. The embedding of electrostatics in topological invariants is of particular

130

interest and can be very useful in describing highly charged biomolecules such as nucleic acids

and their complexes. We introduce electrostatics interaction induced distance functions in

Eq.

(6.8) to address the electrostatic interactions among charged atoms. The abstract

distance between two charged particles are rescaled according to their charges and their

geometric distance, and is modeled as

Φ(dij, qi, qj; c) =

1

1 + exp(−cqiqj/dij)

,

(6.8)

where dij is the distance between the two atoms, qi and qj are the partial charges of the two

atoms, and c is a nonzero tunable parameter. c is set to a positive number if opposite charge

interactions are to be addressed and is set to a negative number if like charge interactions are

of interest. The form of the function is adopted from sigmoid function which is widely used as

an activation function in artiﬁcial neural networks. Such function regularizes the input signal

to the [0, 1] interval. Other functions can be similarly used. This formulation can be extended

to systems with dipole or higher order multipole approximations to electron density. The

weak interactions due to long distances or neutral charges result in correlation values close to

0.5. When c > 0, the repulsive interaction and attractive interaction deliver the correlation

values in (0.5, 1) and (0, 0.5) respectively. The distances induced by Φ(dij, qi, qj; c) are

used to characterize electrostatic eﬀects. The parameter c is rather physical but chosen to

eﬀectively spread the computed values over the (0, 1) interval so that the results can be used

by machine learning methods. Another simple choice of charge correlation functions is

Φ(dij, ηij, qi, qj) = qiqj exp(−dij/ηij).

131

However, this choice will lead to a diﬀerent ﬁltration domain. Additionally, a charge density

can be constructed

(cid:88)

j

µc(r) =

qj exp(−(cid:107)r − rj(cid:107)/ηj),

(6.9)

where r is a position vector, (cid:107)r − rj(cid:107) is the Euclidean distance between r and jth atom

position rj and ηj is a scale parameter. Equation (6.9) can be used for electrostatic ﬁltration

as well. In this case, the ﬁltration parameter can be the charge density value and cubical

complex based ﬁltration can be used.

6.3.3 Feature generation from topological invariants

Barcode representation of topological invariants oﬀers a visualization of persistent homology

analysis. In machine learning analysis, we convert the barcode representation of topological

invariants into structured feature arrays for machine learning. To this end, we introduce

several procedures to generate feature vectors from sets of barcodes. These methods are

discussed below.

Counts in bins.

For a given set of atoms A, we denote its barcodes as B = {Iα}α∈A and represent each

bar by an interval Iα = [bα, dα], where bα and dα are respectively the birth and death

positions on the ﬁltration axis. The length of each bar, or the persistence of topological
invariant is given by pα = dα − bα. To locate the position of all bars and persistences, we

further split the set of barcodes on the ﬁltration axis into a predeﬁned collection of N bins
Bin = {Bini}N

i=1 with Bini = [li, ri], where li and ri are the left and the right positions of

the ith bin. We generate features by counting the numbers of births, deaths, and persistences

in each bin, which leads to three counting feature vectors, namely, counts of birth F C

b , death

132

d , and persistence F C
F C
p ,

b,i(B) = (cid:107){[bα, dα] ∈ B|li ≤ bα ≤ ri}(cid:107), 1 ≤ i ≤ N,
F C
d,i(B) = (cid:107){[bα, dα] ∈ B|li ≤ dα ≤ ri}(cid:107), 1 ≤ i ≤ N,
F C
p,i(B) = (cid:107){[bα, dα] ∈ B|bα ≤ ri or li ≤ dα}(cid:107), 1 ≤ i ≤ N,
F C

(6.10)

where (cid:107) · (cid:107) is number of elements in a set. Note that the above discussion should be applied

to three topological dimensions, i.e., barcodes of the 0th dimension (B0), 1st dimension (B1)

and 2nd dimension (B2). In general, this approach enables the description of bond lengths,

including the length of non-covalent interactions, in biomolecules.

Barcode statistics.

Another method of feature vector generation from a set of barcodes is to extract important

statistics of barcode collections such as maximum values and standard deviations. Given a
set of bars B = {[bα, dα]}α∈A, we deﬁne sets of Birth = {bα}α∈A, Death = {dα}α∈A, and
Persistence = {dα − bα}α∈A. Three statistic feature vectors F S
be generated in the sense of the statistics of the collection of barcodes. For example, F S
b

d , and F S

p can then

b , F S

consists of avg(Birth), std(Birth), max(Birth), min(Birth), sum(Birth), and cnt(Birth),
where avg(·) is the average value of a set of numbers, std(·) is the standard deviation of a
set of numbers, max(·) and min(·) are maximum and minimum values in a set of numbers,
sum(·) is the summation of elements in a set of numbers, and cnt(·) is the count of elements

in a set. The generation of F S

d is the same by examining the set Death. F S

p contains

the same information with two extra terms, the birth and death values of the longest bar.

Statistics feature vectors are collected from barcodes of three topological dimensions, i.e.,

the 0th, 1st, and 2nd dimensions.

133

2D representation.

The construction of multi-dimensional persistence is an interesting topic in persistent

homology.

In general, it is believed that multi-dimensional persistence has better repre-

sentational power for complex systems described by multiple parameters [32]. Although

multidimensional persistence is hard to compute, one can compute persistence for one pa-

rameter while ﬁxing the rest of the parameters to a sequence of ﬁxed values. In the case

where there are two parameters, a biﬁltration can be done by taking turns to ﬁx one param-

eter to a sequence of ﬁxed values while computing persistence for the other parameter. For

example, one can take a sequence of resolutions and compute persistence for distance with

each ﬁxed resolution. The sequence of outputs can be stacked to form a multidimensional

representation [199].

Computing persistence multiple times and stacking the results is especially useful when

the parameters that are not chosen to be the ﬁltration parameter are naturally discrete

with underlying orders. For example, the multi-component or element speciﬁc persistent

homology will result in many persistent homology computations over diﬀerent selections

of atoms. These results can be ordered by the percentage of atoms used of the whole

molecule or by their importance scores in classical machine learning methods. Also, multiple

underlying dimensions exist in the element speciﬁc persistent homology characterization of

molecules. This property enables 2D or 3D topological representation of molecules. Based on

the observation that the performance of the predictor degenerates when too many element

combinations are used, we order the element combinations according to their individual

performance on the task using methods of ensemble of trees. Combining the dimension

of spatial scale and dimension of element combinations, a 2D topological representation is

obtained. Such representation is expected to work better in the case of complex geometry

134

such as protein-ligand complexes. With E = {Ej}NE
combinations ordered by their individual importance scores on the task and Bk(Ei) being

j=1 denoting the collection of element

the kth dimensional barcodes obtained with atoms of element combination Ej, eight 2D

representations are deﬁned as

{F C

d,i(B0(Ej)), F C

p,i(B0(Ej)), F C

b,i(B1(Ej)), F C

p,i(B1(Ej)), F C
F C

b,i(B2(Ej)), F C

d,i(B2(Ej)), F C

d,i(B1(Ej)),
p,i(B2(Ej))}j=1,··· ,NE
i=1,··· ,N ,

(6.11)

where F C

γ,i with γ = b, d, p is the barcode counting rule deﬁned in Eq.

(6.10). For 0th

dimensional, since all bars start from zero, there is no need for F C

b,i(B0(Ej)). These eight 2D

representations are regarded as eight channels of a 2D topological image. In protein-ligand

binding analysis, 2D topological features are generated for the barcodes of a protein-ligand

complex and for the diﬀerences between barcodes of the protein-ligand complex and those

of the protein. Therefore, we have a total of 16 channels in a 2D image for the protein-

ligand complex. This 16-channel image can be fed into the training or the prediction of

convolutional neural networks.

As an example, in the characterization of protein-ligand complexes using alpha complexes,

2D features are generated from the alpha complex based on persistent homology computa-

tions of protein and protein-ligand complex. A total of 128 element combinations are consid-

ered. The [0, 12]˚A interval is divided into 120 equal length bins, which deﬁnes the resolution
of topological images. Therefore, the input feature for each sample is a 120×128×16 tensor.

When there are fewer element combinations considered which can hardly form another

axis, the axis of element combinations can be added into the original channels to form 1D

representations that can be used in 1D CNN.

135

6.3.4 Machine learning algorithms

Three machine learning algorithms, including k-nearest neighbors (KNN) regression, gradient

boosting trees and deep convolutional neural networks, are integrated with our topological

representations to construct topological learning algorithms.

K-nearest neighbors algorithm via barcode space metrics.

One of the simplest machine learning algorithms is k-nearest neighbors (KNN) for clas-

siﬁcation or for regression.

In KNN regression, for a given object, its property values is

obtained by the average or the weighted average of the values of its k nearest neighbors

induced by a given metric of similarity. Then, the problem becomes how to construct a

metric on the dataset.

In the present work, instead of computing similarities from constructed feature vectors,

the similarity between biomolecules can simply be derived from distances between sets of

barcodes generated from diﬀerent biomolecules. Popular barcode space metrics include the

bottleneck distance [43] and more generally, the Wasserstein metrics [44, 29]. The deﬁnition

of the two metrics can be found in Section 2.1.3.

The barcode space metrics can be directly used to assess the representation power of

various persistent homology methods on biomolecules without being aﬀected by the choice

of machine learning models and hyperparameters. We show in the section of results that the

barcode space metrics induced similarity measurement is signiﬁcantly correlated to molecule

functions.

Wasserstein metric measures from biomolecules can also be directly implemented in a

kernel based method such as nonlinear support vector machine algorithm for classiﬁcation

and regression tasks. However, this aspect is not explored in the present work.

136

Gradient boosting trees.

Gradient boosting trees is an ensemble method which ensembles individual decision trees

to achieve the capability of learning complex feature target maps and can eﬀectively prevent

overﬁtting by using shrinkage technique. The gradient boosting trees method is realized

using the GradientBoostingRegressor module in scikit-learn software package [152] (version

0.17.1). A set of parameters found to be eﬃcient in our previous study on the protein-

ligand binding aﬃnity prediction [24] is used uniformly unless speciﬁed. The parameters

used are n estimators=20000, max depth=8, learning rate=0.005, loss=’ls’, subsample=0.7,

max features=’sqrt’.

Deep convolutional neural networks.

The deep convolutional neural networks in this work are implemented using Keras [40]

(version 1.1.2) with Theano backend [177] (version 0.8.2).

For TopBP-DL(Complex), a widely used convolutional neural network architecture is

employed beginning with convolution layers followed by dense layers. Due to the limited

computation resources, parameter optimization is not performed, while most parameters are

adopted from our earlier work [26]. Reasonable parameters are assigned manually. The

detailed architecture is shown in Fig. 6.2. The Adam optimizer with learning rate 0.0001 is

used. The loss function is the mean squared error function. The network is trained with a

batch size of 16 and 150 epochs. The training data is shuﬄed for each epoch.

The network architecture of TopVS-DL is shown in Fig. 6.3. The Adam optimizer with

learning rate set to 0.0001 is used. The loss function is binary cross-entropy. The network

is trained with a batch size of 1024 and 10 epochs. The training data is shuﬄed for each

epoch. The batch size is larger than that used in TopBP-DL due to the much larger training

set in this problem. Because of the same reason, the training process converges to a small

137

loss very fast with only a few training steps.

138

Figure 6.2: The network architecture of TopBP-DL.[20]

The structured layers are shown in boxes/rectangles with sharp corners for 2D/1D image-like content and
the unstructured layers are shown in rectangles. The numbers in convolution layers mean the number of
ﬁlters and ﬁlter size from left to right. The dense layers are drawn with number of neurons and activation
function. The pooling size of the pooling layers and dropout rate of the dropout layers are listed. The layers
that are repeated n times are marked with “×n” sign on the right side of the layer.

139

Figure 6.3: The network architecture of TopVS-DL.[20]

The 1D image-like layers are shown in sharp-corner rectangles. The numbers in convolution layers mean the
number of ﬁlters and ﬁlter size from left to right. The pooling size of the pooling layers and dropout rate of
the dropout layers are listed. The layers that are repeated n times are marked with “×n” sign on the right
side of the layer.

140

6.4 Results

Rational drug design and discovery have rapidly evolved into some of the most important

and exciting research ﬁelds in medicine and biology. These approaches potentially have a

profound impact on human health. The ultimate goal is to determine and predict whether

a given drug candidate will bind to a target so as to activate or inhibit its function, which

results in a therapeutic beneﬁt to the patient. Virtual screening is an important process

in rational drug design and discovery which aims to identify actives of a given target from

a library of small molecules. There are mainly two types of screening techniques, ligand-

based and structure-based. Ligand-based approaches depend on the similarity among small

molecule candidates. Structure-based approaches try to dock a candidate molecule to the

target protein and judge the candidate with the modeled binding aﬃnity based on docking

poses. Various molecular docking software packages have been developed for these purposes.

Molecular docking involves both pose generation and binding aﬃnity scoring. Currently,

pose generation is quite robust while scoring power is still limited. Therefore, knowledge

based rescoring methods using machine learning or deep learning approaches can improve

scoring accuracy [158, 59, 4]. We also apply our topological learning method as a rescoring

tool to rerank the candidates based on docking poses generated by docking software.

This section explores the representational power of the proposed persistent homology

methods for the prediction of protein-ligand binding aﬃnities and the discrimination of

actives and non-actives for protein targets. To this end, we use the present method to in-

vestigate three types of problems. First, we develop topological learning models for ligand

based protein-ligand binding aﬃnity predictions. This problem is designed to examine the

representability of the proposed topological methods for small molecules. Then, we develop

141

Figure 6.4: An illustration of the topology based machine learning algorithms used in scoring
and virtual screening.[20]

topological learning models for protein-ligand complex based binding aﬃnity prediction.

This problem enables us to understand the capability of the proposed topological learning

methods for dealing with protein-ligand complexes. Finally, we examine the structure-based

classiﬁcation of active ligands and decoys which are highly possible to be non-actives, i.e.,

structure-based virtual screening (VS). The optimal selection of features and methods are

determined by studying the ﬁrst two applications and this ﬁnding leads to the main applica-

tion studied in this work, the topological structure-based virtual screening. Computational

algorithms used in this study are illustrated in Fig. 6.4.

142

6.4.1 Ligand based protein-ligand binding aﬃnity prediction

In this section, we address the representation of small molecules by element speciﬁc persis-

tent homology, especially the proposed multi-level persistent homology designed for small

molecules.

Data set

To assess the representational ability of the present persistent homology algorithms on

small molecules, we use a high quality data set of 1322 protein-ligand complexes with binding

aﬃnity data involving 7 protein clusters introduced earlier (denoted as S1322) [187]. It is

a subset of the PDBBind v2015 reﬁned set and its detail is given in the Supplementary

material 1 of Ref. [187]. We consider a ligand based approach to predict the binding aﬃnities

of protein-ligand complexes in various protein clusters. As such, only the ligand information

is used in our topological analysis. The ligand structures are taken from PDBBind database

without modiﬁcation. Numbers of ligands in protein clusters range from 94 to 333.

Models and performance

Two models, i.e., TopBP-KNN(Ligand) and TopBP-ML(Ligand), are constructed. TopBP-

KNN(Ligand) is used to directly assess the representation power of persistent homology for

small molecules and TopBP-ML(Ligand) is the ﬁnal practical model. The results are shown

in Table 6.1. All the gradient boosting trees models take the setup described in Section

Methods/Machine learning algorithms/Gradient boosting trees.

In TopBP-ML(Ligand), we process the geometry, the shape, and the covalent bond in-

formation of the small molecules using alpha complex, and the non-covalent intramolecular

interactions using multi-level persistent homology with Rips complex. The features used are

A-B012-E-S-GBT and R-B012-M1-S-GBT as described in Section Discussion/Ligand based

143

protein-ligand binding aﬃnity prediction. Gradient boosting trees method is used.

In TopBP-KNN(Ligand), we represent the small molecules with a collection of sets of

barcodes from element speciﬁc persistent homology calculations. Wasserstein distance with

p = 2 is applied to measure similarities between two sets of barcodes. The similarity between

each pair of small molecules is then measured by taking the average of the Wasserstein

distances between all sets of considered barcodes. K-nearest-neighbor (KNN) regression is

then applied to the measured similarity. In detail, the 6 sets of barcodes considered are, R-B0-

E-KNN, R-B1-E-KNN, R-B2-E-KNN, R-B0-M1-KNN, R-B1-M1-KNN, and R-B2-M1-KNN

as described in Section Discussion/Ligand based protein-ligand binding aﬃnity prediction.

Leave-one-out validation within each protein cluster with k = 3 is used for this model.

Methods

TopBP-KNN(Ligand) TopBP-ML(Ligand) (5-fold) FFT-BP (5-fold) [187]

CL 1 (333)
CL 2 (264)
CL 3 (219)
CL 4 (156)
CL 5 (134)
CL 6 (122)
CL 7 (94)
Average

0.698(1.66)
0.817(1.28)
0.620(1.68)
0.645(1.41)
0.756(1.68)
0.658(1.68)
0.739(1.31)
0.705(1.49)

0.713(1.60)
0.843(1.15)
0.693(1.51)
0.670(1.35)
0.831(1.34)
0.698(1.56)
0.737(1.26)
0.741(1.40)

(1.93)
(1.32)
(2.01)
(1.61)
(2.02)
(2.06)
(1.71)
(1.81)

Table 6.1: Pearson correlation coeﬃcients (RMSE in kcal/mol) of ligand based topological
model on the S1322 dataset.
The numbers in the ﬁrst row show the number of entries in each protein cluster. The performance is
reported as Pearson correlation coeﬃcient (root mean squared error in kcal/mol). The median performance
of 20 random 5-fold cross validation results is reported for TopBP-ML(Ligand). The results reported for
TopBP-KNN(Ligand) are obtained by leave-one-out validation within each protein cluster with k = 3 for
the KNN model.

In Table 6.1, FFT-BP 5-fold cross validation results were obtained based on multiple

additive regression trees and a set of physical descriptors, including geometry, charge, elec-

trostatic interactions, and van der Waals interactions for S1322 set[187]. Since multiple

additive regression trees is also an implementation of the GBT used in the present work, it is

144

appropriate to compare the FFT-BP results with the GBT results in this work to assess rep-

resentation power of topological features. It is interesting to note that judged by RMSE, both

sets of current topological descriptors have more predictive power than the physical descrip-

tors built on protein-ligand complexes constructed in our earlier work [187]. These physical

descriptors were constructed from sophisticated surface areas, molecular volumes, van der

Waals interactions, charges computed by quantum mechanics, and Poisson-Boltzmann the-

ory based electrostatics [187]. The success of topological descriptors implies the existence of

an alternative and potentially more powerful description of the complex biomolecular world.

6.4.2 Complex based protein-ligand binding aﬃnity prediction

In this section, we focus on topological representations of protein-ligand complexes.

Data sets

The PDBBind database provides a comprehensive collection of structures of protein-

ligand complexes and their binding aﬃnity data [38, 119]. The original experimental data

inProtein Data Bank (PDB) [17] are selected to PDBBind database based on certain qual-

ity requirements and are curated for applications. As shown in Table 6.2, this database is

expanding on a yearly basis. It has become a common resource for benchmarking computa-

tional methods and algorithms for protein-ligand binding analysis and drug design. Popular

data sets include version 2007 (v2007), v2013, and v2015. Among them, v2013 core set and

v2015 core set are identical. A large number of scoring functions has been tested on these

data sets. The latest version, v2016, has an enlarged core set, which contains 290 protein-

ligand complexes from 58 protein families. Therefore, this test set should be relatively easier

than v2015 core set, whose 195 complexes involve 65 protein families. The core sets are con-

structed by choosing 3 samples with median, maximum, and minimum binding aﬃnity from

145

each protein family for v2007, v2013, and v2015 sets. The core set for v2016 was constructed

similarly but with 5 samples from each protein family.

Version Reﬁned set Training set Core set (test set) Protein families

v2007
v2013
v2015
v2016

1300
2959
3706
4057

1105
2764
3511
3767

195
195
195
290

65
65
65
58

Table 6.2: Description of the PDBBind datasets.

Number of complexes or number of protein families in PDBBind data sets used in the present binding
aﬃnity prediction. Here training sets are set to the corresponding reﬁned sets, excluding the complexes in
the corresponding test sets (i.e., core sets). Protein families refer to those in the corresponding core sets.

Model and performance

Two models TopBP-ML(Complex) and TopBP-DL(Complex) are introduced. The results

are shown in Table 6.3. All the gradient boosting trees models take the setup described in

Section Methods/Machine learning algorithms/Gradient boosting trees.

Methods

v2007

v2013

v2015

v2016

Average

Core set predictions

TopBP(Complex)
TopBP-ML(Complex)
TopBP-DL(Complex)
RF::VinaElem
RI-Score[140]

0.827 (1.93)
0.818 (2.01)
0.806 (1.95)
0.803 (1.94) [115]
0.803 (1.99)

0.808 (1.95)
0.804 (2.00)
0.781 (1.98)
0.752 (2.03) [116]
-

0.812 (1.92)
0.797 (1.99)
0.799 (1.91)
-
0.762 (2.05)

0.861 (1.65)
0.848 (1.74)
0.848 (1.64)
-
0.815 (1.85)

0.827 (1.86)
0.817 (1.94)
0.809 (1.87)
-
-

Methods

v2007

v2013

v2015

v2016

Average

Reﬁned set 5-fold cross validations

TopBP-ML(Complex)
RI-Score[140]

0.752 (1.95)
-

0.768 (1.75)
-

0.781 (1.71)
-

0.785 (1.71)
0.747 (1.83)

0.771 (1.78)
-

Table 6.3: Pearson correlation coeﬃcients (RMSE in kcal/mol) of diﬀerent protein-ligand
complex based approaches on PDBBind datasets.
Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for predictions by diﬀerent methods
are listed. For the tests on core sets, the models are trained with the corresponding reﬁned set minus the
core set. Five-fold cross validation is done on reﬁned sets. Results of TopBP-ML(Complex) are the medians
of 50 repeated runs. For TopBP-DL(Complex), 100 independent models are generated at ﬁrst. A consensus
model is built by randomly choosing 50 models out of the 100, and this process is repeated 1000 times
with the median reported. TopBP(Complex) is a consensus model combining TopBP-ML(Complex) and
TopBP-DL(Complex). Each time, 50 single deep learning models are randomly selected to form TopBP-
DL(Complex) and a TopBP-ML(Complex) model is randomly selected. The average of the two is taken as
the output for TopBP(Complex). This process is repeated 1000 times with the median reported.

146

In TopBP-ML(Complex), alpha complex is used to describe the arrangement of carbon

and heavy atom networks, while Rips complex with diﬀerent distance matrices is used to

describe the protein-ligand interactions from the perspective of interaction distances and

strength of electrostatics interactions. In detail, the features used are R-B0-I-C, R-B0-CI-S,

A-B12-E-S as described in Section Discussion/Complex based protein-ligand binding aﬃnity

prediction, and those used in TopBP-ML(Ligand).

With the idea that a sequence of element combinations ordered by their importance in

gradient boosting trees models can make an extra dimension of the description, we build a

2D convolutional neural network with one spatial dimension and one dimension of element

combination. We combine this 2D CNN with a 1D CNN with the pairwise interaction inputs.

For the construction of 2D input, the reader is referred to Section Feature generation from

topological invariants . The 1D image-like inputs consists of two parts both generated by the

counts in bins method described in Section Feature generation from topological invariants

. For the 0th dimensional barcodes from interactive persistent homology of the 36 pairs of
atom types ({C,N,O,S} from protein and {C,N,O,S,P,F,Cl,Br,I} from ligand), the interval

[0, 50] ˚A is divided into equal length subintervals of length 0.25 ˚A. For the 0th dimensional

barcodes from interactive persistent homology for electrostatics of the 50 pairs of atom
types ({C,N,O,S,H} from protein and {C,N,O,S,P,F,Cl,Br,I,H} from ligand), the parameter

interval of [0, 1] is divided into equal length subintervals of length 0.01. These two 1D image-
like features have sizes 200 × 36 and 100 × 50. The network architecture is given in Section

Methods/Machine learning algorithms/Deep convolutional neural networks.

The ﬁnal model TopBP(Complex) takes the average of TopBP-ML(Complex) and TopBP-

DL(Complex) with the assumption that the errors made by the two approaches are only

partially correlated and thus averaging over them may cancel part of the errors. As a result,

147

TopBP(Complex) delivers the best prediction performance on all four testing sets.

6.4.3 Structure-based virtual screening

In this section, we examine the performance of the proposed method for the main application

in this paper, which is structure-based virtual screening which involves protein-compound

complexes obtained by attempting to dock the candidates to the target proteins. The dataset

is much larger than the two applications on protein-ligand binding aﬃnity prediction which

makes parameter tuning very time consuming. Therefore, the best performing procedures

in ligand-based binding aﬃnity prediction and protein-ligand-complex-based binding aﬃnity

prediction are applied in this virtual screening application.

DUD data set

The directory of useful decoys (DUD) [93, 135] is used to benchmark our topological

approach for virtual screening. The DUD data set contains 40 protein targets from six

classes, i.e., nuclear hormone receptors, kinases, serine proteases, metalloenzymes, folate

enzymes, and other enzymes. A total of about 3000 active ligands were identiﬁed from

literature. The number of ligands for each target ranges from tens to hundreds. At most 36

decoys were constructed for each ligand, from the ZINC database of commercially available

compounds [95]. At the ﬁrst step, the ZINC database of 3.5 million compounds was reduced

to a database of 1.5 million compounds with similarity less than 0.9 to the ligands. The

similarity was measured by Tanimoto coeﬃcient on CACTVS type 2 ﬁngerprints. The

decoys were selected so that they possess similar physical properties to the ligands but

have dissimilar molecular topology (topology in the sense of chemistry, not mathematical

topology). A total of 32 physical properties were used including molecular weight, partition

coeﬃcient, and number of hydrogen bonding groups. This results in a total of about 100000

148

compounds. A discrepancy between calculated partial charges for the ligand and decoy sets

was reported for the original release 2 of DUD datasets, which makes it trivial for virtual

screening methods to distinguish between the two categories using those charges [5]. In this

work, we use the data with recalculated Gasteiger charges for both ligand and decoy sets

given by Armstrong et al. [5] in AutoDock Vina and our electrostatic persistence.

Data processing

In structure-based virtual screening, the possible complex structures of the target protein

and the small molecule candidate are required. For the DUD dataset, the structures of the 40

protein targets, the ligands, and the decoys are given, and we generate the protein-compound

complexes by using docking software. To this end, we ﬁrst add missing atoms to the proteins

by using the proﬁx utility in Jackal software package [203]. The receptors and ligands or

decoys are prepared using the scripts prepare receptor4.py and prepare ligand4.py provided

by the AutoDockTools module in MGLTools package (version 1.5.6) [130]. The bounding

box of the binding site is deﬁned as a cube with edge size equal to 27 ˚A, centered at the

geometric center of the crystal ligand. AutoDock Vina (version 1.1.2) [179] is used to dock

the ligands or decoys to the receptors. The option exhaustiveness is set to 16 and all the

other parameters are set to their default values. In each docking experiment, the pose having

the lowest binding free energy reported by AutoDock Vina, is used by the reranking models.

Evaluation

Two measurements, the enrichment factor (EF) and the area under the receiver operating

characteristic curve (AUC), are used to evaluate each method’s ability of discriminating

149

actives from decoys. The AUC is deﬁned as

AUC = 1 − 1
Na

Na(cid:88)

i=1

N i
d
Nd

,

(6.12)

where Na is the number of active ligands, Nd is the total number of decoys, and N i

d is the

number of decoys that are higher ranked than the ith ligand [158]. An AUC value of 0.5 is

the expected value of a random selection, whereas a perfect prediction results in an AUC

of 1. The EF at x% denoted by EFx% evaluates the quality of the set of top x% ranked

compounds, by comparing the percentage of actives in the top x% ranked compounds to the

percentage of actives in the entire compound set. It is deﬁned as

EFx% =

a

N x%
N x%

· N
Na

,

(6.13)

where N x%

a

is the number of active ligands in the top x% ranked compounds, N x% is the

number of top x% ranked compounds, N is the total number of compounds, and Na is the

total number of active ligands.

To evaluate the performance of various methods on the DUD data set, the entries asso-

ciated with one protein target are used as the test set each time [158]. For the selection of

the training set of a given protein target, we follow a procedure given in the literature [93],

where the entries associated to the rest of the proteins, excluding those that are within the

same class of the testing protein and those that have reported positive cross-enrichment with

the testing protein, are taken as the training set. The 40 proteins are split into 6 classes and

the excluded list of training data for each protein follows [4].

150

Topology based machine learning models

Our topology based machine learning model, called TopVS-ML, relies on manually con-

structed features and utilizes ensemble of trees methods. For the complex with the small

molecules (i.e., ligands and decoys) docked to the receptor, features R-B0-I-BP, R-B0-CI-

S, and A-B12-E-S are used (see Section Discussion/Complex based protein-ligand binding

aﬃnity prediction), whereas features R-B012-M1-S and A-B012-E-S (see Section Discus-

sion/Ligand based protein-ligand binding aﬃnity prediction) are used for the small molecules.

The gradient boosting trees method, random forest method, and extra trees method are em-

ployed as voters. The averaged probabilities output by the three methods are used for the

classiﬁer to decide the class of the testing samples. The modules GradientBoostingClassiﬁer,

RandomForestClassiﬁer, and ExtraTreesClassiﬁer in the scikit-learn package [152] (version

0.17.1) are used. The parameters for the three modules are listed in Table 6.4. TopVS-ML

achieves a performance of AUC= 0.83, EF2% = 8.6, EF20% = 3.4. These values are the

median values of 10 repeated experiments.

Method Parameters

GBT
RF
ET

n=2000, s=0.5, cw=100:1, lr=0.01, mf=sqrt
n=2000, cw=balanced subsample
n=2000, cw=balanced subsample

Table 6.4: Parameters used in machine learning.

The parameters used for the ensemble of trees methods while the other parameters are set to default.
GBT: gradient boosting trees. RF: random forest. ET: extra trees. n: n estimators. s: subsample. cw:
class weight. lr: learning rate. mf: max feature.

Topology based deep learning model

Our topology based deep learning model, called TopVS-DL, relies on 1D image-like inputs

for protein-compound complexes and manually constructed features for the compounds. The

2D representation used in binding aﬃnity problem is not used here due to the intractable

151

data size. The manually constructed features for the compounds are R-B012-M1-S and

A-B012-E-S as described in Section Discussion/Ligand based protein-ligand binding aﬃnity

prediction. The 1D image-like inputs consisted of three parts are all generated by the counts

in bins method described in Section Feature generation from topological invariants .

(1)

For the 0th dimensional barcodes from interactive persistent homology of the 36 pairs of
atom types ({C, N, O, S} from protein and {C, N, O, S, P, F, Cl, Br, I} from ligand), the

interval [0, 25] ˚A is divided into equal length subintervals of length 0.25 ˚A. The barcodes

used here are identical to the barcodes in feature R-B0-I-BP. This results in a 1D image-like
feature with size 100 × 36. (2) For the 0th dimensional barcodes from interactive persistent
homology for electrostatics of the 50 pairs of atom types ({C, N, O, S, H} from protein and
{C, N, O, S, P, F, Cl, Br, I, H} from ligand), the parameter interval of [0, 1] is divided into

equal length subintervals of length 0.01. The barcodes used are identical to the barcodes
in feature R-B0-CI-S. This results in a 1D image-like feature with size 100 × 50. (3) Alpha

complex based persistent homology is applied to all carbon atoms and all heavy atoms. The

computation is done on the complex as well as only the protein with a cutoﬀ distance of 12

˚A from the ligands. The interval [0, 12] ˚A is divided into equal length subintervals of length

0.125 ˚A. Counts in bins method is applied to the 0th, 1st, and 2nd dimensional barcodes.

The features are generated for persistent homology computation of the complex and the

protein. The features for the complex and the diﬀerence between the features for complex
and protein are ﬁnally used. This results in a 1D image-like feature of size 96 × 32. The

detailed network architecture is listed in Section Methods/Machine learning algorithms/Deep

convolutional neural networks. A consensus model is constructed by taking the average over

25 single models trained independently. TopVS-DL achieves a performance of AUC= 0.81,

EF2% = 9.1, EF20% = 3.2.

152

The ﬁnal model

Same as the idea of taking the average output of diﬀerent ensemble of trees models as the

ﬁnal output in TopVS-ML, we add TopVS-DL as another voter to TopVS-ML to construct a

ﬁnal model, called TopVS. Such consensus approach takes the average over diﬀerent models

with the hope that diﬀerent models make partially uncorrelated errors which are possible

to cancel out when averaged. The performance on each of 40 protein targets is reported

in Table 6.5. We have also generated virtual screening results of AutoDock Vina (ADV)

based on the computed binding free energy by ADV and compared them with those of the

present TopVS in terms of enrichment factors and the areas under the receiver operating

characteristic curve (AUC). A comparison of average AUC with those from a large number

of methods is given in Table 6.6.

153

Target

ACE
AChE
ADA
ALR2
AmpC
AR
CDK2
COMT
COX1
COX2
DHFR
EGFr
ERagonist
ERantagonist
FGFr1
FXa
GART
GPB
GR
HIVPR
HIVRT
HMGR
HSP90
InhA
MR
NA
P38 MAP
PARP
PDE5
PDGFrb
PNP
PPARg
PR
RXRa
SAHH
SRC
thrombin
TK
trypsin
VEGFr2

1.4
2.8
0.4
2.7
0.2
3.8
2.4
1.4
2.8
3.9
2.8
1.6
3.3
2.3
0.8
1.3
1.9
0.9
1.2
2.6
1.9
0.9
0.9
1.9
4.0
0.3
1.7
2.7
1.9
0.5
0.7
3.4
1.1
4.8
3.0
2.3
2.6
0.9
1.9
2.2

ADV
EF2% EF20% AUC EF2% EF20% AUC
4.1
0.81
0.65
4.7
0.90
0.0
0.68
2.0
2.4
0.58
0.90
17.0
0.88
9.0
0.73
13.1
9.9
0.86
0.97
20.7
0.96
6.4
0.95
3.4
17.8
0.81
0.83
10.2
0.95
0.4
0.89
1.0
0.0
0.48
0.66
0.0
0.84
5.7
0.91
5.6
8.2
0.88
0.96
0.0
0.93
0.0
0.95
13.4
0.87
16.7
0.0
0.87
0.94
1.4
0.71
4.2
0.86
8.0
3.5
0.97
0.89
0.0
0.72
17.7
0.91
1.9
28.2
0.83
0.84
10.4
0.98
5.6
0.79
8.3
0.0
0.65
0.78
3.1
10.2
0.96

0.42
0.67
0.49
0.74
0.34
0.81
0.64
0.56
0.76
0.86
0.82
0.63
0.84
0.70
0.44
0.63
0.75
0.48
0.57
0.74
0.64
0.53
0.64
0.56
0.82
0.37
0.59
0.71
0.61
0.32
0.59
0.82
0.52
0.95
0.80
0.71
0.72
0.56
0.58
0.63

3.1
1.9
4.5
1.5
1.0
4.2
4.1
2.9
3.6
4.9
4.7
4.8
2.8
2.8
4.8
4.4
0.7
1.5
3.4
4.4
4.0
5.0
4.5
4.5
4.3
3.8
4.5
1.7
3.4
4.9
4.3
1.8
4.1
3.2
3.9
4.9
2.4
2.5
2.0
4.7

TopVS

5.1
1.4
7.8
4.9
0.0
20.1
7.6
17.4
11.8
23.3
12.6
16.4
10.0
1.3
15.1
2.1
2.6
1.4
1.3
8.9
11.7
14.4
9.6
22.7
0.0
1.5
18.4
0.0
6.9
26.5
7.9
0.6
9.4
12.8
4.5
24.6
4.1
6.9
0.0
24.9

Average

6.9

2.0

0.64

9.5

3.5

0.84

Table 6.5: Performance on each protein in DUD dataset.

The median results of 10 repeated runs with diﬀerent random seeds (for the TopVS-ML part) are reported.
The best AUC in each row is marked in bold. The left block of AutoDock Vina (ADV) results are acquired
from the ADV runs with the binding free energy reported by ADV.

154

Method

AUC Ref.

TopVS
DeepVS-ADV
ICMa
NNScore1-ADVb
Glide SPa
DDFA-ALL
DDFA-RL
NNScore2-ADVb
DDFA-ADV
DeepVS-Dock
DDFA-AD4
Glide HTVSb
Surﬂexa
Glide HTVS
ICM
RAW-ALL
AutoDock Vinab
Surﬂex
Rosetta Ligand
AutoDock Vina
ICM
FlexX
Autodock4.2
PhDOCK
Dock4.0

0.84
0.81
0.79
0.78
0.77
0.77
0.76
0.76
0.75
0.74
0.74
0.73
0.72
0.72
0.71
0.70
0.70
0.66
0.65
0.64
0.63
0.61
0.60
0.59
0.55

[158]
[137]
[59]
[47]
[4]
[4]
[59]
[4]
[158]
[4]
[59]
[47]
[47]
[137]
[4]
[59]
[47]
[4]
[4]
[47]
[47]
[4]
[47]
[47]

Table 6.6: AUC comparison of diﬀerent methods on DUD dataset.

aTuned by expert knowledge. bDetermined using a diﬀerent data set of decoys.

155

6.5 Discussion

6.5.1 Ligand based protein-ligand binding aﬃnity prediction

We conduct several experiments on ligand based protein-ligand binding aﬃnity prediction

in this section which leads to the ﬁnal models. To examine the strength and weakness of

diﬀerent sets of features and models, we ﬁrst show a statistics fact of the S1322 data set

of 7 protein clusters in Fig. 6.5. The details of the S1322 data set is given in Section Re-

sults/Ligand based protein-ligand binding aﬃnity prediction. All the gradient boosting trees

models take the setup described in Section Methods/Machine learning algorithms/Gradient

boosting trees.

Feature vectors for gradient boosting trees.

In this test, Rips complex based and alpha complex based persistent homology compu-

tations up to 2nd dimension are performed for a variety of atom collections with diﬀerent

element types using the Euclidean metric and multi-level distance deﬁned in Eq. (6.1). Two

types of features are generated and are denoted by F C , which is a combination of F C

b , F C
d ,

and F C

p , and F S, which is a combination of F S

b , F S

d , and F S

p . The construction of features

F C and F S are described in Section Feature generation from topological invariants . For sets

of the 0th dimensional bars, only F C

d and F S

d are computed. In each protein cluster, 10-fold

or 5-fold cross validation is repeated 20 times for each subset of feature vectors depending

on selected element type. The median Pearson correlation coeﬃcients and the root-mean-

square error (RMSE) in kcal/mol are reported. For Rips complex, both level 0 computation

with distance matrix M and level 1 computation with distance matrix(cid:102)M1 as deﬁned in Eq.

(6.2) are performed.

156

Figure 6.5: Statistics of ligands in 7 protein clusters in S1322 dataset.[20]

The average numbers of heavy atoms of a ligand in each protein cluster are shown in red and the standard
deviations of number of heavy atoms across each protein cluster are shown in blue. The number of ligands
in each cluster is given in parentheses.

Barcode space metrics for k-nearest neighbor regression.

The barcodes generated using Rips complex with distance matrices M and (cid:102)M1 are col-

lected and the distance between each pair of sets of barcodes are measured using the Wasser-

stein metric d2. Leave-one-out prediction for every sample is performed with k-nearest neigh-

bor regression with k = 3 within each protein cluster based on the Wasserstein metric. The

performance of the best performing and the worst performing protein clusters is shown in

Fig. 6.6. The better the performance, the closer the lines are to the semicircle.

The experiments done for this section are summarized in Table 6.7.

Performance of multi-component persistent homology.

It can be noticed from Table 6.8 that topological features generated from barcode statis-

tics typically outperform those created from counts in bins. R-B012-E-S-GBT and R-B012-

157

Figure 6.6: An illustration of similarities between ligands measured by their barcode space
Wasserstein distances.[20]
Ligands are ordered according to their binding aﬃnities and are represented as dots on the semicircle.
Speciﬁcally, a sample of binding free energy x is plotted at the angle θ = π(Emax − x)/(Emax − Emin)
where Emin and Emax are the lowest and the highest energy in the dataset. Each dot is connected with
two nearest neighbors based on their barcode space Wasserstein distances. An optimal prediction would be
achieved if lines stay close to the semicircle. The majority of the connections stay near the boundary to the
upper half sphere demonstrating that barcode space metric based Wasserstein distance measurement reﬂects
the similarity in function, i.e., the binding aﬃnity in this case. The protein clusters with the best and the
worst performance are shown. Left: Protein cluster 2. Right: Protein cluster 3.

M1-S-GBT perform similarly in the majority of the protein clusters whilst R-B012-M1-S-

GBT which is based on (cid:102)M1 signiﬁcantly outperforms R-B012-E-S-GBT which is based on

Euclidean distance in protein cluster 3 and 6. To assess in what circumstances does the

multi-level persistent homology improve the original persistent homology characterization of

small molecules, we analyze the statistics of the size of ligands in Fig. 6.5. It turns out that

protein cluster 3 has the smallest average number of heavy atoms and protein cluster 6 has

the smallest standard deviation of the number of heavy atoms. This observation partially

answers the question that in the cases where the small molecules are relatively simple and are

relatively of similar size, multi-level persistent homology is able to enrich the characterization

of the small molecules which further improves the robustness of the model. Such enrichment

or improvement over the original persistent homology approach is mainly realized in higher

dimensional barcodes, i.e. the 1st and 2nd dimensions. In Table 6.8, the results with ID

through 7 to 12 conﬁrm that the 0th dimensional features from computation with (cid:102)M1 are

158

inferior to the results with Euclidean distance whilst the 1st and 2nd dimensional features

based on(cid:102)M1 outperforms the best result with Euclidean distance in most cases.

It is interesting to note that although Wasserstein metric based KNN methods are not

as accurate as GBT approaches, the consensus result obtained by averaging over various

predictions with Wasserstein metric on diﬀerent sets of barcodes is quite accurate.

Robustness of topological learning models.

Certain elements such as Br are very rare in the data sets studied in this work. Consid-

ering only the elements of high occurrence will not hurt the performance on the validations

performed. However, omitting the low occurrence elements will sacriﬁce the capability of

the model to handle new data in which such elements play an important role. Therefore, we

decide to keep the rare elements that result in a large number of features and redundancy

in features. For example, the element combinations CBrH and CH will probably deliver

the same performance for most of the samples in the data sets studied in this work. To

test whether this redundancy causes degenerated results of the model, the features of one

element combination is added to the model at a step and the model is validated with an

accumulation of the added features at each step. The performance of the model is measured

with Pearson correlation coeﬃcient and is plotted against number of element combinations

involved in Fig. 6.7 . For most cases in Fig. 6.7 , the model is robust against the inclusion

of more element combinations.

159

Figure 6.7: Plot of performance against number of element combinations used.[20]

The topological learning model performance against the number of element combinations involved in feature
construction for 7 protein clusters in S1322. The horizontal axis corresponds to the number of element
combinations used for the features. From left to right, one extra element combination is added at a step.
The features are then used in gradient boosting trees method to test if the model is robust against redundant
information. The results related to alpha complex are marked in red and Rips complex in blue. The
median Pearson correlation coeﬃcient between predicted and experimental results is reported of 10-fold
cross-validation within each protein cluster repeated 20 times are reported.

160

Experiment
A-B012-E-C-GBT

A-B012-E-S-GBT

Description
The barcodes are generated using alpha complex on diﬀerent sets of atoms
based on diﬀerent element combinations. The features are constructed using
the 0th, 1st, and 2nd dimensional barcodes following the counts in bins method
with bins equally dividing the interval [0, 5]. Here 32 diﬀerent element combi-
nations are considered, including {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO,
CNS, COS, NOS, CNOS, CNOSPFClBrI, H, CH, NH, OH, SH, CNH, COH,
CSH, NOH, NSH, OSH, CNOH, CNSH, COSH, NOSH, CNOSH, CNOSPF-
ClBrIH}. Gradient boosting trees (GBT) with the structured feature matrix
are used for this computation.
The barcodes same as those used in A-B012-E-C-GBT are used.
counts in bins , the Barcode statistics method is used to generate features.

Instead of

R-B012-E-S-GBT

A-B012-E-SS-GBT The barcodes same as those used in A-B012-E-C-GBT are used. The per-
sistence diagram slice and statistics method is used to generate features. A
uniform set of bins by dividing the interval [0,5] into 10 equal length bins is
used to slice birth, death, and persistence values.
Barcodes are generated using Rips complex with Euclidean distances. The
features are generated following the barcode statistics method. Here 36 ele-
ment combinations are considered, i.e., {C, N, O, S, CN, CO, CS, NO, NS,
OS, CNO, CNS, COS, NOS, CNOS, CNOSPFClBrI, H, CH, NH, OH, SH,
CNH, COH, CSH, NOH, NSH, OSH, CNOH, CNSH, COSH, NOSH, CNOSH,
CNOSPFClBrIH, CCl, CClH, CBr, CBrH}.

ﬁrst level enrichment distance matrix(cid:102)M1 is used instead of Euclidean distance.

R-B012-M1-S-GBT The result is obtained with the same setup as R-B012-E-S-GBT except that the

R-Bn-E-KNN

R-Bn-M1-KNN

The nth dimensional barcodes from Rips complex computation with Euclidean
distance are used. K-nearest neighbor (KNN) regression is performed with
Wasserstein metric d2. The leave-one-out validation is performed individually
with each element combination and the average prediction of these element
combinations is taken as the output result. The element combinations con-
sidered are {CNOS, CNOSPFClBrI, NOH, CNO, CNOSPFClBrIH}. These
combinations are selected based on their performance in the gradient boosting
trees experiments.
The result is obtained with the same setup as R-Bn-E-KNN except that the

distance matrix (cid:102)Mn is used instead of Euclidean distance.

Table 6.7: Experiments for ligand-based protein-ligand binding aﬃnity prediction of 7 protein
clusters and 1322 protein-ligand complexes.

161

ID Experiments

1

2

3

4

5

6

7

8

9

A-B012-E-C-GBT

A-B012-E-S-GBT

A-B012-E-SS-GBT

R-B012-E-S-GBT

R-B012-M1-S-GBT

2+5

R-B0-E-KNN

R-B1-E-KNN

R-B2-E-KNN

10 R-B0-M1-KNN

11 R-B1-M1-KNN

12 R-B2-M1-KNN

13 Cons(7+8+9+10+11+12)

14

2+5 (5-fold)

CL 1 (333)
CL 5 (134)

CL 2 (264)
CL 6 (122)

0.695(1.63)
0.840(1.30)

0.836(1.18)
0.647(1.65)

0.695(1.63)
0.828(1.35)

0.704(1.62)
0.834(1.34)

0.712(1.60)
0.808(1.41)

0.845(1.14)
0.702(1.54)

0.846(1.15)
0.715(1.53)

0.837(1.17)
0.635(1.67)

CL 3 (219)
CL 7 (94)

0.690(1.52)
0.730(1.27)

0.678(1.54)
0.739(1.25)

0.681(1.53)
0.741(1.25)

CL 4 (156)
Average

0.642(1.38)
0.726(1.42)

0.692(1.31)
0.740(1.39)

0.668(1.35)
0.741(1.40)

0.659(1.57)
0.757(1.22)

0.683(1.32)
0.727(1.42)

0.716(1.59)
0.822(1.37)

0.836(1.17)
0.708(1.53)

0.706(1.48)
0.746(1.24)

0.672(1.34)
0.744(1.39)

0.714(1.59)
0.831(1.34)

0.648(1.73)
0.700(1.70)

0.547(1.91)
0.535(2.01)

0.474(2.01)
0.126(2.49)

0.581(1.85)
0.672(1.76)

0.663(1.70)
0.786(1.49)

0.675(1.67)
0.655(1.81)

0.698(1.66)
0.756(1.68)

0.713(1.60)
0.831(1.34)

0.848(1.13)
0.717(1.52)

0.699(1.50)
0.747(1.24)

0.692(1.31)
0.750(1.38)

0.761(1.39)
0.487(1.89)

0.684(1.55)
0.634(1.67)

0.494(1.87)
0.331(2.09)

0.771(1.35)
0.485(1.90)

0.784(1.33)
0.610(1.71)

0.803(1.28)
0.617(1.72)

0.817(1.28)
0.658(1.68)

0.843(1.15)
0.698(1.56)

0.544(1.76)
0.641(1.43)

0.444(1.88)
0.649(1.42)

0.202(2.14)
0.609(1.47)

0.516(1.80)
0.644(1.43)

0.652(1.59)
0.731(1.30)

0.577(1.72)
0.648(1.42)

0.620(1.68)
0.739(1.31)

0.693(1.51)
0.737(1.26)

0.616(1.42)
0.628(1.62)

0.536(1.52)
0.576(1.71)

0.298(1.79)
0.362(1.98)

0.601(1.44)
0.610(1.65)

0.555(1.50)
0.683(1.52)

0.531(1.52)
0.644(1.59)

0.645(1.41)
0.705(1.49)

0.670(1.35)
0.741(1.40)

Table 6.8: Performance of diﬀerent approaches on the S1322 dataset.

Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for binding aﬃnity predictions on 7
protein clusters (CL) in S1322. On the title row, the numbers in parentheses denote the numbers of ligands
in the cluster. The median results of 20 repeated runs are reported for the ensemble of trees based methods
to account for randomness in the algorithm. For experimental labels, the ﬁrst letter indicates the complex
deﬁnition used, ‘A’ for alpha complex and ‘R’ for Rips complex. The second part starting with ‘B’ followed by
the integers indicates the dimension of barcode used. The third part indicates the distance function used, ‘E’

for Euclidean and ‘M1’ for(cid:102)M1. For row 1 through 5, the forth part shows the way of feature construction, ‘C’

for counts in bins and ‘S’ for barcode statistics. The last part indicates the regression technique used, ‘GBT’
for gradient boosting trees and ’KNN’ for k-nearest neighbors. The detailed descriptions of the experiments
are given in Table 6.7. Row 6 is the results using features of both row 2 and row 5. Row 13 is the consensus
results by taking the average of the predictions by row 7 through row 12. Except for speciﬁed, all results are
obtained from 10-fold cross validations.

162

6.5.2 Complex based protein-ligand binding aﬃnity prediction

Having demonstrated the representational power of the present topological learning method

for characterizing small molecules, we further examine the method on the task of charac-

terizing protein-ligand complex. Biologically, we consider the same task, i.e., the prediction

of protein-ligand binding aﬃnity, with a diﬀerent approach that is based on the structural

information of the protein-ligand complexes. Only gradient boosting trees and deep convo-

lutional neural network algorithms are used in this section. All the gradient boosting trees

models take the setup described in Section Methods/Machine learning algorithms/Gradient

boosting trees.

In the present topological learning study, we use four versions of PDBBind core sets as

our test sets. For each test set, the corresponding reﬁned set, excluding the core set, is used

as the training set.

Groups of topological features and their performance in association with GBT.

The experiments of protein-ligand-complex-based protein-ligand binding aﬃnity predic-

tion for the PDBBind datasets are summarized in Table 6.9.

6.5.2.1 Robustness of GBT algorithm against redundant element combination

features and potential overﬁtting.

It is intuitive that combinations of more than 2 element types are able to enrich the represen-

tation especially in the case of higher dimensional barcodes. However, the consideration of

combination of more element types rapidly increases the dimension of feature space. In the

high dimensional feature space, it is almost inevitable that there exists nonessential and re-

dundant features. Additionally, the importance of a feature varies across diﬀerent problems

and data sets. Therefore, it is preferable to keep all the potentially important features in a

163

Experiment
R-B0-I-C

R-B0-I-BP

R-B0-CI-C

Description
0th dimensional barcodes from Rips complex computation with interactive distance
matrix based on Euclidean distance are used. Features are generated following counts
in bins method with bins {[0, 2.5), [2.5, 3), [3, 3.5), [3.5, 4.5), [4.5, 6), [6, 12]}. Element
combinations used are all possible paired choices of one item from {C, N, O, S, CN,
CO, NO, CNO} in protein and another item from {C, N, O, S, P, F, Cl, Br, I, CN,
CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS} in ligand, which result in a
total of 160 combinations.
The persistent homology computation and feature generation is the same as R-B0-I-
C. However, the element combinations used are all possible paired choices of one item
from {C, N, O, S} in protein and another item from {C, N, O, S, P, F, Cl, Br, I} in
ligand, which result in a total of 36 element combinations.
0th
in-
teractive
func-
tion deﬁned in Eq.
The fea-
tures
bins
{(0, 0.1], (0.1, 0.2], (0.2, 0.3], (0.3, 0.4], (0.4, 0.5], (0.5, 0.6], (0.6, 0.7], (0.7, 0.8], (0.8, 0.9],
(0.9, 1.0)}. The element combinations used are all possible paired choices of one item
from {C, N, O, S, H} in protein and another item from {C, N, O, S, P, F, Cl, Br, I,
H} in ligand, which result in a total of 50 element combinations.

(6.8) with the parameter
following
bins

computation with

dimensional

barcodes
distance matrix

electrostatics

correlation

method with

based

on

the

c = 100.

from Rips

complex

are

generated

counts

in

R-B0-CI-B-S The barcodes and element combinations are the same as those of R-B0-CI-B-C. The

A-B12-E-S

features are generated following the barcode statistics method.
1st and 2nd dimensional barcodes from alpha complex computation with Euclidean
distance are used. The element combinations considered are all heavy atoms and all
carbon atoms. Features are generated following the barcode statistics method.

Table 6.9: Experiments for protein-ligand-complex-based protein-ligand binding aﬃnity pre-
diction for the PDBBind datasets.

general model which is expected to cover a wide range of situations. To test the robustness

of the model against unimportant features, we select a total of 128 element combinations
(i.e., all possible paired choices of one item from {C, N, O, CN, CO, NO, CNO, CNOS} in
protein and another item from {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO, CNS, COS,
NOS, CNOS, CNOSPFClBrI} in ligand). The 0th, 1st, and 2nd dimensional barcodes are

computed for all combinations using alpha complex with Euclidean distance. Features are

generated following the barcode statistics method.

A general model with all the features is generated in the ﬁrst place. The element combi-

nations are then sorted according to their importance scores in the general model. Starting

from the most important element combination, one element combination is added to the fea-

164

ture vector each time and then the resulting feature vector is passed to the machine learning

training and testing procedure. The order of adding element combinations is based on their

importance scores and thus that a less important feature is added each step.

Fig. 6.8 depicts the changes of Pearson correlation coeﬃcient and RMSE (kcal/mol)

with respect to the increase of element combinations in predicting four PDBBind core sets.

In all cases, the inclusion of top combinations can readily deliver very good models. The

behavior of the present method in PDBBind v2007 is quite diﬀerent from that in other

data sets. The performance of the present method improves almost monotonically as the

element combination increases. However, in other three cases, the improvement is unsteady.

Nevertheless, the performance ﬂuctuates within a small range, which indicates that the

present method is reasonably stable against the increase in element combinations. From

a diﬀerent perspective, the increase in element combinations might lead to overﬁtting in

machine learning. Since the model parameters are ﬁxed before the experiments, it shows

that GBT algorithms are not very sensitive to redundant features and are robust against

overﬁtting.

Usefulness of more than 2 element types for interactive 0th dimensional barcodes.

While using element combinations with more than 2 element types with higher dimen-

sional barcodes enriches characterization of geometry, it remains to assess whether interactive

0th dimensional characterization will beneﬁt from element combinations with more element

types. As an example, we denote interactive 0th dimensional barcodes for carbon and ni-

trogen atoms from protein and oxygen atoms from ligand by BCN−O, barcodes for carbon

atoms from protein and oxygen atoms from ligand by BC−O, and barcodes for nitrogen

atoms from protein and oxygen atoms from ligand by BN−O.

In the case of persistent

homology barcode representation, BCN−O is not strictly the union of BC−O and BN−O.

165

Figure 6.8: Feature robustness tests on PDBBind datasets.[20]

The performance of the topological learning model against the number of included element combinations for
predicting on PDBBind core sets and training on PDBBind reﬁned sets minus the core sets. The 1st and
2nd dimensional barcodes computed with alpha complex is used. Features are generated following barcode
statistics method. Element combinations are all possible paired choices of one item from {C, N, O, CN, CO,
NO, CNO, CNOS} in protein and another item from {C, N, O, S, CN, CO, CS, NO, NS, OS, CNO, CNS,
COS, NOS, CNOS, CNOSPFClBrI} in ligand, which result in 128 element combinations. The horizontal
straight lines represents the performance of the 2D representation with deep convolutional neural network
(row 10 in Table 6.10). The blue and red colors correspond to Pearson correlation coeﬃcient and RMSE
(kcal/mol) respectively. Each experiment is done by training on reﬁned set minus the core set with the
median result of 20 repeated runs reported.

166

However BCN−O might be redundant to BC−O and BN−O. To address this concern, we test

features from interactive 0th dimensional barcodes with the 36 element combinations (i.e.,
{ C, N, O, S } for protein and { C, N, O, S, P, F, Cl, Br, I } for ligand) and features for
the 160 selected element combinations (i.e., { C, N, O, S, CN, CO, NO, CNO } for protein
and { C, N, O, S, P, F, Cl, Br, I, CN, CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS
} for ligand), which are listed as feature group 2 and feature group 1 in Table 6.10. In all

the four cases, the features of the 36 combinations (feature group 2) slightly outperforms or

performs as well as the features of the 160 combinations (feature group 1) suggesting that

element combinations with more than 2 element types are redundant to all the combinations

with 2 element types in the case of interactive 0th dimensional characterization.

ID Experiments

v2007

v2013

v2015

v2016

Average

1
2
3
4
5
6
7
8
9
10
11

R-B0-I-C
R-B0-I-BP
R-B0-CI-C
R-B0-CI-S
A-B12-E-S
1+4
2+4
1+4+5
2+4+5
2D-CNN-Alpha
1D2D-CNN

0.799 (2.01)
0.816 (1.94)
0.791 (2.05)
0.773 (2.10)
0.736 (2.25)
0.815 (1.95)
0.806 (1.99)
0.810 (1.98)
0.802 (2.01)
0.787 (2.02)
0.806 (1.95)

0.741 (2.14)
0.741 (2.13)
0.759 (2.10)
0.762 (2.12)
0.709 (2.26)
0.780 (2.04)
0.787 (2.04)
0.792 (2.02)
0.796 (2.02)
0.781 (1.98)
0.781 (1.98)

0.750 (2.11)
0.750 (2.10)
0.738 (2.13)
0.749 (2.13)
0.695 (2.27)
0.774 (2.04)
0.770 (2.06)
0.786 (2.02)
0.782 (2.04)
0.785 (1.95)
0.799 (1.91)

0.813 (1.82)
0.825 (1.78)
0.801 (1.87)
0.810 (1.86)
0.752 (2.02)
0.833 (1.76)
0.834 (1.77)
0.831 (1.76)
0.822 (1.79)
0.837 (1.68)
0.848 (1.64)

0.776 (2.02)
0.783 (1.99)
0.772 (2.04)
0.774 (2.05)
0.723 (2.20)
0.801 (1.95)
0.799 (1.97)
0.805 (1.95)
0.801 (1.97)
0.798 (1.91)
0.809 (1.87)

Table 6.10: Performance of diﬀerent protein-ligand complex based approaches on the PDB-
Bind datasets.
Pearson correlation coeﬃcients with RMSE (kcal/mol) in parentheses for predictions by various groups of
features on the four PDBBind core sets. The training sets are the PDBBind reﬁned sets minus the core sets
of the same version year. Results of ensemble of trees based methods (rows 1 through 9) are the median
values of 50 repeated runs to account for randomness in the algorithm. For the deep learning based methods
(row 10 and 11), 100 independent models are generated in the ﬁrst place. A consensus model is built by
randomly choosing 50 models out of the 100, and the this process is repeated 1000 times with the median
reported. The ﬁrst letter indicates the deﬁnition of complex, ‘A’ for alpha complex and ‘R’ for Rips complex.
The second part indicates the dimension of barcodes used. The third part indicates the distance function

used, ‘I’ for (cid:99)Mij deﬁned in Eq. (6.3), ‘CI’ for the one deﬁned in Eq. (6.8), and ‘E’ for Euclidean. The last

part shows the way of feature construction, ‘C’ for counts in bins, ‘S’ for barcode statistics, and ‘BP’ for
only pair of two single elements. The results reported in row 6 through 9 are obtained by combining the
features of the rows with the corresponding numbers.

167

Importance of atomic charge in electrostatic persistence.

In element speciﬁc persistent homology, atoms of diﬀerent element types are character-

ized separately, which oﬀers a rough and implicit description of the electrostatics of the

system. However, such implicit treatment of electrostatics may lose important information

because atoms behave diﬀerently at diﬀerent oxidation states. Therefore, we explicitly em-

bed atomic charges in interactive 0th dimensional barcodes as described in Eq. (6.8). The

resulting topological features are given in feature group 4 in Table 6.10. It can be seen from

Table 6.10 that the combination of feature group 4 and the Euclidean distance based inter-

active 0th dimensional barcodes (listed as feature group 6 and 7) generally outperforms the

results obtained with only Euclidean distance based features. This observation suggests that

electrostatics play an important role and should be taken care of explicitly for the protein-

ligand binding problem. Additionally, the inclusion of physical interactions in topological

invariants opens a promising new direction in topological analysis.

Relevance of elements that are rare with respect to the data sets.

Since the majority of the samples in both training and testing sets only contain atoms

of element types, C, N, O, and H, the performance of the model on the samples with rare

occurring elements with respect to data sets is hardly reﬂected by the overall performance

statistics. For simplicity, we refer to such rarely occurring elements with respect to data sets

simply by rarely occurring elements in the discussion follows. To assess the aspects of the

model that potentially aﬀect the performance on the samples containing rarely occurring

elements, we picked the samples containing each rarely occurring element from the original

testing set as a new testing set. Three experiments are carried out to address two questions:

“Are the training samples containing the same rarely occurring element crucial?” and “Are

features addressing the rarely occurring element important?”. A short answer is yes to

168

both according to the results shown in Fig. 6.9. Speciﬁcally, for each rarely occurring

element, the exclusion of samples containing this element in training set and the exclusion of

features addressing this element will both cause degenerated results. It is also shown that the

exclusion of samples of the rarely occurring element leads to much worse results. Since both

modiﬁcations of the model deliver worse results, we conclude that including the samples in

the training set with similar compositions to the test sample is crucial to the success of the

model on this speciﬁc test sample. Even the inclusion of features of more element types or

element combinations does not deliver better results in the general testing sets, such features

should still be kept in the model in case that a sample with a similar element composition

comes in as a test sample.

2D persistence for topological deep convolutional neural networks.

Deep learning is potentially more powerful than many other machine learning algorithms

when the data size is suﬃciently large.

In the present work, it is natural to construct a

2D topological representation by incorporating the element combination as an additional

dimension, resulting in 16 channels as deﬁned in Section Feature generation from topological

invariants . Here 128 element combinations (i.e., all possible paired choices of one item
from {C, N, O, CN, CO, NO, CNO, CNOS} in protein and another item from {C, N, O, S,
CN, CO, CS, NO, NS, OS, CNO, CNS, COS, NOS, CNOS, CNOSPFClBrI} in ligand) are

used for 2D analysis. The advantage of introducing this extra dimension with convolutional

neural networks is to prevent unimportant features from interacting with important ones at

the lower levels of the model whilst generally unimportant features are still kept in the model

in case that they are essential to speciﬁc problems or a certain portion of the data set.

As shown in Fig. 6.8, for all the data sets except the PDBBind v2007 set, the 2D topo-

logical deep learning with convolutional neural networks performs signiﬁcantly better. The

169

Figure 6.9: Assessment of performance of the model on samples with elements that are rare
in the data sets.[20]
For the four data sets PDBBind v2007, v2013, v2015, and v2016 [119], and for each element, the testing set is
the subset of the original core sets with only ligands that contain atoms of the particular element type. The
features used are features with ID=7 in Table 6.10. The reported RMSE is the average taken over the four
data sets. Experiment 1: Training set is the original training set and all the features are used. Experiment
2: Training set is the original training set and only features that do not involve the particular element are
used. Experiment 3: Training set is the original training set excluding the samples that contain atoms of
the particular element type and all features are used. For most of the elements, experiment 1 achieves the
best result and experiment 3 yields the worst performance.

170

inferior performance of convolutional neural networks in v2007 might be a result of the small

data size. Note that v2007 training set has 1105 protein-ligand complexes, whereas other

training sets have more than 2700 complexes. Consequently, topological deep convolutional

neural networks are able to outperform the topological GBT algorithm in predicting v2013,

v2015 and v2016 core sets.

Indeed, topological deep convolutional neural networks have

advantages in dealing with large data sets.

6.5.3 Structure-based virtual screening

In our ﬁnal model TopVS reported in Table 6.6, we use topological descriptors of both

protein-compound interactions and only the compounds (i.e., ligands and decoys) and take a

consensus model on top of several ensemble of trees models and a deep learning model. We

have also tested the behavior of our topological learning model TopVS-ML using either one of

the aforementioned descriptions. The tests are done with TopVS-ML because that TopVS-

DL is much more time consuming. When only topological descriptor of small molecules

are used, which falls into the category of ligand-based virtual screening, an AUC of 0.81 is

achieved. For the topological learning model using only the descriptions of protein-ligand

interactions, an AUC of 0.77 is achieved. An AUC of 0.83 is obtained with a model combining

both sets of descriptors which is better than each individual performance, suggesting that

the two groups of descriptors are complementary to each other and are both important

for achieving satisfactory results. The marginal improvement made by protein-compound

complexes maybe due to the various docking quality. Similar situation was encountered by

a deep learning method [158]. For the targets with high quality results by Autodock Vina

(AUC of ADV > 0.8), the ligand-based features achieve an AUC of 0.81 and the complex-

based features achieve an AUC of 0.86. On the other hand, for the targets with low quality

171

Target

ADV LIG COM ALL

AR
COX2
DHFR
ERagonist
MR
PPARg
RXRa
SAHH

Average

0.81
0.86
0.82
0.84
0.82
0.82
0.95
0.80

0.84

0.83
0.97
0.95
0.69
0.78
0.70
0.74
0.81

0.93
0.80
0.94
0.91
0.91
0.72
0.91
0.72

0.81

0.86

0.90
0.97
0.96
0.81
0.89
0.72
0.79
0.84

0.86

Table 6.11: The AUC for autodock vina, TopVS-ML with only compound features, TopVS-
ML with only protein-compound complex features, and TopVS-ML with all features. The
targets with high quality results by Autodock Vina are reported (AUC > 0.8)

results by Autodock Vina (AUC of ADV < 0.5), the ligand-based features achieve an AUC

of 0.82 and the complex-based features achieve an AUC of 0.74. The results of these cases

are listed in Table 6.11 and Table 6.12. This observation suggests that the performance

of features describing the interactions and the geometry of protein-compounds complexes

highly depends on the quality of docking results.

Our model with small molecular descriptors delivers an AUC of 0.81, which is comparably

well to the other top performing methods. The performance of this model is also competitive

in the regime of protein-ligand binding aﬃnity prediction based on experimentally solved

complex structures as is shown in Section Discussion/Ligand based protein-ligand binding

aﬃnity prediction. These results suggest that topology based small molecule characterization

proposed in this work is potentially useful in other applications involving small molecules,

such as predictions of toxicity, solubility and partition coeﬃcient of small molecules.

172

Target

ADV LIG COM ALL

0.42
ACE
0.49
ADA
0.34
AmpC
0.44
FGFr1
0.48
GPB
0.37
NA
PDGFrb 0.32

0.85
0.89
0.56
0.97
0.70
0.79
0.98

0.78
0.89
0.37
0.71
0.69
0.82
0.90

Average

0.41

0.82

0.74

0.81
0.89
0.53
0.95
0.71
0.84
0.96

0.81

Table 6.12: The AUC for autodock vina, TopVS-ML with only compound features, TopVS-
ML with only protein-compound complex features, and TopVS-ML with all features. The
targets with low quality results by Autodock Vina are reported (AUC < 0.5)

6.6 Conclusion

Persistent homology is a relatively new branch of algebraic topology and is one of the main

tools in topological data analysis. The topological simpliﬁcation of biomolecular systems

was a major motivation of the earlier persistent homology development [62, 216]. Persistent

homology has been applied to computational biology [99, 75, 48, 156, 75, 198, 195, 202, 201,

200, 186, 117]. However, the predictive power of primitive persistent homology was limited

in early topological learning applications [21]. To address this challenge, we have recently

introduced element speciﬁc persistent homology to retain chemical and biological informa-

tion during the topological abstraction of biomolecules [23, 24, 26]. The resulting topological

learning approach oﬀers competitive predictions of protein-ligand binding aﬃnity and muta-

tion induced protein stability changes. However, persistent homology based approaches for

small molecules have not been developed and its representability and predictive powers for

the interaction of small molecules with macromolecules have not been extensively studied.

The present work further introduces multi-component persistent homology, multi-level

persistent homology and electrostatic persistence for chemical and biological characteriza-

173

tion, analysis and modeling. Multi-component persistent homology takes a combinatorial ap-

proach to create possible element speciﬁc topological representations. Multi-level persistent

homology allows tailored topological descriptions of any desirable interaction in biomolecules

which is especially useful for small molecules. Electrostatic persistence incorporates partial

charges that are essential to biomolecules into topological invariants. These approaches are

implemented via the appropriate construction of the distance matrix for ﬁltration. The rep-

resentation power and reduction power of multi-component persistent homology, multi-level

persistent homology and electrostatic persistence are validated by two databases, namely

PDBBind [119] and DUD [93, 135]. PDBBind involves more than 4,000 high quality protein-

ligand complexes and DUD contains near 100,000 small compounds. Two classes of problems

are used to test the proposed topological methods, including the prediction of protein-ligand

binding aﬃnities and the discrimination of active ligands from decoys (virtual screening).

In both problems, we examine the representability of proposed topological learning methods

on small molecules, which are somewhat more diﬃcult to describe by persistent homol-

ogy due to their chemical diversity, variability and sensitivity. Additionally, these methods

are tested on their ability to handle the full protein-ligand complexes. Advanced machine

learning methods, including Wasserstein metric based k-nearest neighbors (KNNs), gradient

boosting trees (GBT), random forest (RF), extra trees (ET) and deep convolutional neu-

ral networks (CNN) are utilized in the present work to facilitate the proposed topological

methods, rendering advanced topological learning algorithms for quantitative and qualita-

tive biomolecular predictions. The thorough examination of the method on the prediction

of binding aﬃnity for experimentally solved protein-ligand complexes leads to a structure-

based virtual screening method, TopVS, which outperforms other methods. The feature sets

introduced in this work for small molecules and protein-ligand complexes can be extended to

174

other applications such as 3D-structure based prediction of toxicity, solubility, and partition

coeﬃcient for small molecules and complex structure based prediction of protein-nucleic acid

binding and protein-protein binding aﬃnities.

175

Chapter 7

Dissertation contribution

The main contributions of this dissertation are listed as follows.

• In Chapter 3, we develop a quantitative and predictive model based on persistent ho-

mology and deep learning for protein-ligand binding aﬃnity prediction and the predic-

tion of protein stability change upon mutation. This model achieves top performance

on benchmarks where many other methods have been tested. To the best of our knowl-

edge, this work is the ﬁrst competitive topological data analysis based predictive model

in the ﬁeld of molecular biology.

• In Chapter 4, motivated by embedding physical properties to persistent homology

representations of biomolecules, we propose a method called enriched barcode using

cohomology and tools from graph theory. This method has a general application to

the situations where data come with multiple heterogeneous dimensions.

• In Chapter 5, we develop a persistent homology construction called evolutionary ho-

mology tailored for the analysis of trajectories from coupled dynamical systems. This

method is applied to coupled dynamical systems associated to molecular systems and

shows competitive results in the modeling of protein ﬂexibility.

• In Chapter 6, we carefully assess the representability of persistent homology of biomolecules.

As a result, we propose persistent homology representations speciﬁcally designed for

176

small molecules which we call multi-level persistent homology as well as representations

for macromolecules addressing chemical and biological complexities. The method de-

veloped in this chapter also leads to a top-performing tool for structural-based virtual

screening.

• In addition to the high performance of this work on benchmarks where the testing

data are given to the researchers, we have also achieved top performance in world-

wide challenges of blind predictions in drug design where testing data are hidden from

researchers before the prediction results are submitted.[141]

• The works described in this dissertation are made accessible by attaching source

codes to the publications [26, 20]. Several accompanying webservers are made avail-

able to users, “TDL-MP” (http://weilab.math.msu.edu/TDL/TDL-MP/) and “TML-

MP” (http://weilab.math.msu.edu/TML/TML-MP/) for predictors of protein stabil-

ity changes upon mutations, and “TDL-BP”(http://weilab.math.msu.edu/TDL/TDL-

BP/) and “TML-BP” (http://weilab.math.msu.edu/TML/TML-BP/).

The contents of this dissertation are mostly adopted from the follow publications and

preprints:

• Zixuan Cang and Guo-wei Wei. TopologyNet: Topology based deep convolutional

and multi-task neural networks for biomolecular property predictions. Plos Computa-

tional Biology, 13(7):e1005690, 2017.

• Zixuan Cang and Guo-Wei Wei. Persistent cohomology for data with multicompo-

nent heterogeneous information. preprint, 2018.

177

• Zixuan Cang, Elizabeth Munch, and Guo-Wei Wei. Evolutionary homology on

coupled dynamical systems. arXiv preprint arXiv:1802.04677, 2018.

• Zixuan Cang, Lin Mu, and Guo-Wei Wei. Representability of algebraic topology for

biomolecules in machine learning based scoring and virtual screening. Plos Computa-

tional Biology, 14(1):e1005929. https://doi.org/10.1371/journal.pcbi.1005929, 2018.

This work led to the following publications/preprints that are not discussed in this dis-

sertation:

• Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A

topological approach for protein classiﬁcation. Molecular Based Mathematical Biology,

3(1), 2015.

• Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding en-

ergy changes upon mutation by element speciﬁc persistent homology. Bioinformatics,

33(22):3549–3557, 2017.

• Zixuan Cang and Guo-Wei Wei. Integration of element speciﬁc persistent homology

and machine learning for protein-ligand binding aﬃnity prediction.

International

journal for numerical methods in biomedical engineering, 34(2):e2914, 2018.

The following publications/preprints are also related to this dissertation:

• Rundong Zhao, Zixuan Cang, Yiying Tong, and Guo wei Wei. Protein pocket

detection via convex hull surface evolution and associated reeb graph.

accepted by

Bioinformatics/proceedings of ECCB 2018, 2018.

178

• Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and Guo-Wei

Wei. Mathematical deep learning for pose and binding aﬃnity prediction and ranking

in d3r grand challenges. arXiv preprint arXiv:1804.10647, 2018.

179

BIBLIOGRAPHY

180

BIBLIOGRAPHY

[1] SAMPL6 challenge. https://drugdesigndata.org/about/sampl6. Accessed: 2018-04-10.

[2] Henry Adams, Andrew Tausz, and Mikael Vejdemo-Johansson. Javaplex: A research
software package for persistent (co) homology. In International Congress on Mathe-
matical Software, pages 129–136. Springer, 2014.

[3] Mamiko Arai, Vicky Brandt, and Yuri Dabaghian. The eﬀects of theta precession
on spatial learning and simplicial complex dynamics in a topological model of the
hippocampal spatial map. PLoS Computational Biology, 10(6):e1003651, 2014.

[4] Marcelino Arciniega and Oliver F Lange.

Improvement of virtual screening results
by docking data feature analysis. Journal of chemical information and modeling,
54(5):1401–1411, 2014.

[5] M. Stuart Armstrong, Garrett M. Morris, Paul W. Finn, Raman Sharma, Loris
Moretti, Richard I. Cooper, and W. Graham Richards. Electroshape:
fast molecu-
lar similarity calculations incorporating shape, chirality and electrostatics. Journal of
computer-aided molecular design, 24(9):789–801, 2010.

[6] Hossam M. Ashtawy and Nihar R. Mahapatra. A comparative assessment of ranking
accuracies of conventional and machine-learning-based scoring functions for protein-
ligand binding aﬃnity prediction. IEEE/ACM Transactions on computational biology
and bioinformatics, 9(5):1301–1313, 2012.

[7] Ivet Bahar, Ali Rana Atilgan, and Burak Erman. Direct evaluation of thermal ﬂuc-
tuations in proteins using a single-parameter harmonic potential. Folding and Design,
2(3):173–181, 1997.

[8] Ulrich Bauer. Ripser: a lean c++ code for the computation of Vietoris-Rips persistence

barcodes. Software available at https://github. com/Ripser/ripser, 2017.

[9] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of per-
sistent homology. In 2014 proceedings of the sixteenth workshop on algorithm engineer-
ing and experiments (ALENEX), pages 31–38. SIAM, 2014.

[10] Ulrich Bauer, Michael Kerber, and Jan Reininghaus. Distributed computation of per-
sistent homology. In 2014 Proceedings of the Sixteenth Workshop on Algorithm Engi-
neering and Experiments (ALENEX), pages 31–38. SIAM, 2014.

181

[11] Ulrich Bauer, Michael Kerber, Jan Reininghaus, and Hubert Wagner. Phat–persistent

homology algorithms toolbox. Journal of Symbolic Computation, 78:76–90, 2017.

[12] K. Abdulla Bava, M. Michael Gromiha, Hatsuho Uedaira, Koji Kitajima, and Akinori
Sarai. Protherm, version 4.0: thermodynamic database for proteins and mutants.
Nucleic acids research, 32(suppl 1):D120–D121, 2004.

[13] Nathan Bell and Anil N Hirani. Pydec: software and algorithms for discretization
of exterior calculus. ACM Transactions on Mathematical Software (TOMS), 39(1):3,
2012.

[14] Paul Bendich, Herbert Edelsbrunner, and Michael Kerber. Computing robustness and
IEEE transactions on visualization and computer graphics,

persistence for images.
16(6):1251–1260, 2010.

[15] James Bergstra, Daniel Yamins, and David D Cox. Making a science of model search:
Hyperparameter optimization in hundreds of dimensions for vision architectures. ICML
(1), 28:115–123, 2013.

[16] Niklas Berliner, Joan Teyra, Recep C¸ olak, Sebastian Garcia Lopez, and Philip M.
Kim. Combining structural modeling with ensemble machine learning to accurately
predict protein fold stability and binding aﬃnity eﬀects upon mutation. PloS one,
9(9):e107353, 2014.

[17] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N. Bhat,
Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank.
Nucleic acids research, 28(1):35–242, 2000.

[18] Jesse Berwald, Marian Gidea, and Mikael Vejdemo-Johansson. Automatic recognition
and tagging of topologically diﬀerent regimes in dynamical systems. arXiv preprint
arXiv:1312.2482, 2013.

[19] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

[20] Zixuan Cang, Lin Mu, and Guo-Wei Wei. Representability of algebraic topology for
biomolecules in machine learning based scoring and virtual screening. Plos Computa-
tional Biology, 14(1):e1005929. https://doi.org/10.1371/journal.pcbi.1005929, 2018.

[21] Zixuan Cang, Lin Mu, Kedi Wu, Kristopher Opron, Kelin Xia, and Guo-Wei Wei. A
topological approach for protein classiﬁcation. Molecular Based Mathematical Biology,
3(1), 2015.

[22] Zixuan Cang, Elizabeth Munch, and Guo-Wei Wei. Evolutionary homology on coupled

dynamical systems. arXiv preprint arXiv:1802.04677, 2018.

182

[23] Zixuan Cang and Guo-Wei Wei. Analysis and prediction of protein folding energy
changes upon mutation by element speciﬁc persistent homology. Bioinformatics,
33(22):3549–3557, 2017.

[24] Zixuan Cang and Guo-Wei Wei. Integration of element speciﬁc persistent homology and
machine learning for protein-ligand binding aﬃnity prediction. International journal
for numerical methods in biomedical engineering, 34(2):e2914, 2018.

[25] Zixuan Cang and Guo-Wei Wei. Persistent cohomology for data with multicomponent

heterogeneous information. preprint, 2018.

[26] Zixuan Cang and Guowei Wei. TopologyNet: Topology based deep convolutional and
multi-task neural networks for biomolecular property predictions. Plos Computational
Biology, 13(7):e1005690, 2017.

[27] Emidio Capriotti, Piero Fariselli, and Rita Casadio. I-mutant2. 0: predicting stability
changes upon mutation from the protein sequence or structure. Nucleic acids research,
33(suppl 2):W306–W310, 2005.

[28] Gunnar Carlsson. Topology and data. Bulletin of the American Mathematical Society,

46(2):255–308, 2009.

[29] Gunnar Carlsson. Topological pattern recognition for point cloud data. Acta Numerica,

23:289–368, 2014.

[30] Gunnar Carlsson, Vin De Silva, and Dmitriy Morozov. Zigzag persistent homology
In Proceedings of the twenty-ﬁfth annual symposium on

and real-valued functions.
Computational geometry, pages 247–256. ACM, 2009.

[31] Gunnar Carlsson, Tigran Ishkhanov, Vin De Silva, and Afra Zomorodian. On the
local behavior of spaces of natural images. International journal of computer vision,
76(1):1–12, 2008.

[32] Gunnar Carlsson and Afra Zomorodian. The theory of multidimensional persistence.

Discrete & Computational Geometry, 42(1):71–93, 2009.

[33] Gunnar Carlsson, Afra Zomorodian, Anne Collins, and Leonidas J Guibas. Persistence
barcodes for shapes. International Journal of Shape Modeling, 11(02):149–187, 2005.

[34] Rich Caruana. Multitask learning. In Learning to learn, pages 95–133. Springer, 1998.

[35] D. A. Case, J. T. Berryman, R. M. Betz, D. S. Cerutti, T. E. Cheatham III, T. A.
Darden, R. E. Duke, T. J. Giese, H. Gohlke, A. W. Goetz, N. Homeyer, S. Izadi,
P. Janowski, J. Kaus, A. Kovalenko, T. S. Lee, S. LeGrand, P. Li, T. Luchko, R. Luo,
B. Madej, K. M. Merz, G. Monard, P. Needham, H. Nguyen, H. T. Nguyen, I. Omelyan,

183

A. Onufriev, D. R. Roe, A. Roitberg, R. Salomon-Ferrer, C. L. Simmerling, W. Smith,
J. Swails, R. C. Walker, J. Wang, R.M. Wolf, X. Wu, D. M. York, and P. A. Kollman.
Amber 2015. University of California, San Francisco, 2015.

[36] Fr´ed´eric Chazal, David Cohen-Steiner, Marc Glisse, Leonidas J Guibas, and Steve Y
Oudot. Proximity of persistence modules and their diagrams.
In Proceedings of
the twenty-ﬁfth annual symposium on Computational geometry, pages 237–246. ACM,
2009.

[37] Fr´ed´eric Chazal, Vin De Silva, Marc Glisse, and Steve Oudot. The structure and

stability of persistence modules. Springer, 2016.

[38] Tiejun Cheng, Xun Li, Yan Li, Zhihai Liu, and Renxiao Wang. Comparative assess-
ment of scoring functions on a diverse test set. Journal of chemical information and
modeling, 49(4):1079–1093, 2009.

[39] Yongwook Choi, Gregory E Sims, Sean Murphy, Jason R Miller, and Agnes P Chan.
Predicting the functional eﬀect of amino acid substitutions and indels. PloS one,
7(10):e46688, 2012.

[40] Fran¸cois Chollet. Keras. https://github.com/fchollet/keras, 2015.

[41] Samir Chowdhury and Facundo M´emoli. Persistent path homology of directed net-
works. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete
Algorithms, pages 1152–1169. SIAM, 2018.

[42] Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.

[43] David Cohen-Steiner, Herbert Edelsbrunner, and John Harer. Stability of persistence

diagrams. Discrete & Computational Geometry, 37(1):103–120, 2007.

[44] David Cohen-Steiner, Herbert Edelsbrunner, John Harer, and Yuriy Mileyko. Lips-
chitz functions have Lp-stable persistence. Foundations of computational mathematics,
10(2):127–139, 2010.

[45] David Cohen-Steiner, Herbert Edelsbrunner, and Dmitriy Morozov. Vines and vine-
In Proceedings of the twenty-second

yards by updating persistence in linear time.
annual symposium on Computational geometry, pages 119–126. ACM, 2006.

[46] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine learning,

20(3):273–297, 1995.

[47] Jason B Cross, David C Thompson, Brajesh K Rai, J Christian Baber, Kristi Yi
Fan, Yongbo Hu, and Christine Humblet. Comparison of several molecular docking

184

programs: pose prediction and virtual screening accuracy. Journal of chemical infor-
mation and modeling, 49(6):1455–1474, 2009.

[48] Yuri Dabaghian, Facundo M´emoli, Loren Frank, and Gunnar Carlsson. A topological
paradigm for hippocampal spatial map formation using persistent homology. PLoS
computational biology, 8(8):e1002581, 2012.

[49] George E Dahl, Navdeep Jaitly, and Ruslan Salakhutdinov. Multi-task neural networks

for qsar predictions. arXiv preprint arXiv:1406.1231, 2014.

[50] Guillaume

Damiand.

Combinatorial maps.

In

and Reference Manual. CGAL
www.cgal.org/Manual/4.0/doc html/cgal manual/packages.html
#Pkg:CombinatorialMaps.

Editorial Board,

4.0

CGAL User
2012.

edition,

[51] I. K. Darcy and M. Vazquez. Determining the topology of stable protein-DNA com-

plexes. Biochemical Society Transactions, 41:601–605, 2013.

[52] Bhaskar DasGupta and Jie Liang. Models and Algorithms for Biomolecules and Molec-

ular Networks. John Wiley & Sons, 2016.

[53] Vin De Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson. Dualities in persistent

(co) homology. Inverse Problems, 27(12):124003, 2011.

[54] Vin De Silva, Dmitriy Morozov, and Mikael Vejdemo-Johansson. Persistent cohomol-
ogy and circular coordinates. Discrete & Computational Geometry, 45(4):737–759,
2011.

[55] Yves Dehouck, Aline Grosﬁls, Benjamin Folch, Dimitri Gilis, Philippe Bogaerts, and
Marianne Rooman. Fast and accurate predictions of protein stability changes upon mu-
tations using statistical potentials and neural networks: Popmusic-2.0. Bioinformatics,
25(19):2537–2543, 2009.

[56] Omar NA Demerdash, Michael D. Daily, and Julie C. Mitchell. Structure-based pre-
dictive models for allosteric hot spots. PLoS computational biology, 5(10):e1000531,
2009.

[57] Barbara Di Fabio and Claudia Landi. A Mayer–Vietoris formula for persistent homol-
ogy with an application to shape recognition in the presence of occlusions. Foundations
of Computational Mathematics, 11(5):499, 2011.

[58] T. J. Dolinsky, J. E. Nielsen, J. A. McCammon, and N. A. Baker. PDB2PQR: An
automated pipeline for the setup, execution, and analysis of Poisson-Boltzmann elec-
trostatics calculations. Nucleic Acids Research, 32:W665–W667, 2004.

185

[59] Jacob D Durrant, Aaron J Friedman, Kathleen E Rogers, and J Andrew McCammon.
Comparing neural-network scoring functions and the state of the art: applications to
common library screening. Journal of chemical information and modeling, 53(7):1726–
1735, 2013.

[60] Herbert Edelsbrunner. Weighted alpha shapes. University of Illinois at Urbana-

Champaign, Department of Computer Science, 1992.

[61] Herbert Edelsbrunner and John Harer. Computational topology: an introduction.

American Mathematical Soc., 2010.

[62] Herbert Edelsbrunner, David Letscher, and Afra Zomorodian. Topological persistence
and simpliﬁcation. In Foundations of Computer Science, 2000. Proceedings. 41st An-
nual Symposium on, pages 454–463. IEEE, 2000.

[63] Matthew D. Eldridge, Christopher W. Murray, Timothy R. Auton, Gaia V. Paolini,
and Roger P. Mee. Empirical scoring functions: I. the development of a fast empiri-
cal scoring function to estimate the binding aﬃnity of ligands in receptor complexes.
Journal of computer-aided molecular design, 11(5):425–445, 1997.

[64] A Evgeniou and Massimiliano Pontil. Multi-task feature learning. Advances in neural

information processing systems, 19:41, 2007.

[65] Theodoros Evgeniou and Massimiliano Pontil. Regularized multi–task learning. In Pro-
ceedings of the tenth ACM SIGKDD international conference on Knowledge discovery
and data mining, pages 109–117. ACM, 2004.

[66] Xin Feng and Yiying Tong. Choking loops on surfaces. IEEE transactions on visual-

ization and computer graphics, 19(8):1298–1306, 2013.

[67] Xin Feng, Kelin Xia, Yiying Tong, and Guo-Wei Wei. Geometric modeling of sub-
cellular structures, organelles, and multiprotein complexes. International Journal for
Numerical Methods in Biomedical Engineering, 28(12):1198–1223, 2012.

[68] Alan R Fersht. Dissection of the structure and activity of the tyrosyl-trna synthetase

by site-directed mutagenesis. Biochemistry, 26(25):8031–8037, 1987.

[69] Lukas Folkman, Bela Stantic, Abdul Sattar, and Yaoqi Zhou. EASE-MM: Sequence-
based prediction of mutation-induced stability changes with feature-based multiple
models. Journal of molecular biology, 428(6):1394–1405, 2016.

[70] Jerome H Friedman. Greedy function approximation: a gradient boosting machine.

Annals of statistics, pages 1189–1232, 2001.

186

[71] Jerome H Friedman. Stochastic gradient boosting. Computational Statistics & Data

Analysis, 38(4):367–378, 2002.

[72] Patrizio Frosini. A distance for similarity classes of submanifolds of a euclidean space.

Bulletin of the Australian Mathematical Society, 42(3):407–415, 1990.

[73] Patrizio Frosini and Claudia Landi. Persistent betti numbers for a noise tolerant
shape-based approach to image retrieval. Pattern Recognition Letters, 34(8):863–872,
2013.

[74] Peter Gabriel. Unzerlegbare darstellungen i. Manuscripta mathematica, 6(1):71–103,

1972.

[75] Marcio Gameiro, Yasuaki Hiraoka, Shunsuke Izumi, Miroslav Kramar, Konstantin Mis-
chaikow, and Vidit Nanda. A topological measurement of protein compressibility.
Japan Journal of Industrial and Applied Mathematics, 32(1):1–17, 2015.

[76] Marcio Gameiro, Konstantin Mischaikow, and William Kalies. Topological character-

ization of spatial-temporal chaos. Physical Review E, 70(3):035203, 2004.

[77] Samuel Genheden and Ulf Ryde. The MM/PBSA and MM/GBSA methods to estimate

ligand-binding aﬃnities. Expert opinion on drug discovery, 10(5):449–461, 2015.

[78] Ivan Getov, Marharyta Petukh, and Emil Alexov. Saafec: predicting the eﬀect of single
point mutations on protein folding free energy using a knowledge-modiﬁed mm/pbsa
approach. International journal of molecular sciences, 17(4):512, 2016.

[79] Robert Ghrist. Barcodes: the persistent topology of data. Bulletin of the American

Mathematical Society, 45(1):61–75, 2008.

[80] Robert Ghrist and Abubakr Muhammad. Coverage and hole-detection in sensor net-
works via homology. In Information Processing in Sensor Networks, 2005. IPSN 2005.
Fourth International Symposium on, pages 254–260. IEEE, 2005.

[81] Robert W Ghrist. Elementary applied topology. Createspace Seattle, 2014.

[82] Michael K Gilson and Huan-Xiang Zhou. Calculation of protein-ligand binding aﬃni-

ties. Annual review of biophysics and biomolecular structure, 36, 2007.

[83] Nobuhiro Go, Tosiyuki Noguti, and Testuo Nishikawa. Dynamics of a small globu-
lar protein in terms of low-frequency vibrational modes. Proceedings of the National
Academy of Sciences, 80(12):3696–3700, 1983.

[84] Alexander Grigor’yan, Yong Lin, Yuri Muranov, and Shing-Tung Yau. Homologies of

path complexes and digraphs. arXiv preprint arXiv:1207.2834, 2012.

187

[85] Raphael Guerois, Jens Erik Nielsen, and Luis Serrano. Predicting changes in the
stability of proteins and protein complexes: a study of more than 1000 mutations.
Journal of molecular biology, 320(2):369–387, 2002.

[86] Allen Hatcher. Algebraic Topology. Cambridge University Press, 2001.

[87] Jean-Claude Hausmann et al. On the vietoris-rips complexes and a cohomology theory

for metric spaces. Annals of Mathematics Studies, 138:175–188, 1995.

[88] Christine Heitsch and Svetlana Poznanovi´c. Combinatorial insights into rna secondary
In Discrete and topological models in molecular biology, pages 145–166.

structure.
Springer, 2014.

[89] Anil N Hirani, Kaushik Kalyanaraman, Han Wang, and Seth Watts. Cohomologous

harmonic cochains. arXiv preprint arXiv:1012.2835, 2010.

[90] Anil Nirmal Hirani. Discrete exterior calculus. PhD thesis, California Institute of

Technology, 2003.

[91] Danijela Horak, Slobodan Maleti´c, and Milan Rajkovi´c. Persistent homology of
Journal of Statistical Mechanics: Theory and Experiment,

complex networks.
2009(03):P03034, 2009.

[92] Gang Hu, Junzhong Yang, and Wenji Liu. Instability and controllability of linearly

coupled oscillators: Eigenvalue analysis. Phys. Rev. E, 58:4440– 4453, 1998.

[93] Niu Huang, Brian K. Shoichet, and John J. Irwin. Benchmarking sets for molecular

docking. Journal of medicinal chemistry, 49(23):6789–6801, 2006.

[94] Sheng-You Huang and Xiaoqin Zou. An iterative knowledge-based scoring function to
predict protein–ligand interactions: I. derivation of interaction potentials. Journal of
computational chemistry, 27(15):1866–1875, 2006.

[95] John J. Irwin and Brian K. Shoichet. Zinc- a free database of commercially avail-
able compounds for virtual screening. Journal of chemical information and modeling,
45(1):177–182, 2005.

[96] George A. Jeﬀrey and George A. Jeﬀrey. An introduction to hydrogen bonding, vol-

ume 32. Oxford university press New York, 1997.

[97] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools

for Python, 2001–. [Online; accessed ¡today¿].

[98] Tomasz Kaczynski, Konstantin Mischaikow, and Marian Mrozek. Computational ho-

mology, volume 157. Springer Science & Business Media, 2006.

188

[99] Peter M. Kasson, Afra Zomorodian, Sanghyun Park, Nina Singhal, Leonidas J. Guibas,
and Vijay S. Pande. Persistent voids: a new structural metric for membrane fusion.
Bioinformatics, 23(14):1753–1759, 2007.

[100] Elizabeth H Kellogg, Andrew Leaver-Fay, and David Baker. Role of conformational
sampling in computing mutation-induced changes in protein structure and stability.
Proteins: Structure, Function, and Bioinformatics, 79(3):830–838, 2011.

[101] Michael Kerber, Dmitriy Morozov, and Arnur Nigmetov. Geometry helps to compare

persistence diagrams. Journal of Experimental Algorithmics (JEA), 22:1–4, 2017.

[102] Firas A Khasawneh and Elizabeth Munch. Exploring equilibria in stochastic delay
diﬀerential equations using persistent homology. In ASME 2014 International Design
Engineering Technical Conferences and Computers and Information in Engineering
Conference, pages V008T11A034–V008T11A034. American Society of Mechanical En-
gineers, 2014.

[103] Firas A. Khasawneh and Elizabeth Munch. Chatter detection in turning using persis-

tent homology. Mechanical Systems and Signal Processing, 70:527–541, 2016.

[104] Firas A. Khasawneh and Elizabeth Munch. Utilizing topological data analysis for
studying signals of time-delay systems. In Time Delay Systems, pages 93–106. Springer,
2017.

[105] Sarah L. Kinnings, Nina Liu, Peter J. Tonge, Richard M. Jackson, Lei Xie, and
Philip E. Bourne. A machine learning-based method to improve docking scoring func-
tions and its application to drug repurposing. Journal of chemical information and
modeling, 51(2):408–419, 2011.

[106] Miroslav Kram´ar, Rachel Levanger, Jeﬀrey Tithof, Balachandra Suri, Mu Xu, Mark
Paul, Michael F Schatz, and Konstantin Mischaikow. Analysis of kolmogorov ﬂow and
rayleigh–b´enard convection using persistent homology. Physica D: Nonlinear Phenom-
ena, 334:82–98, 2016.

[107] Alex Krizhevsky, Ilya Sutskever, and Geoﬀrey E Hinton. Imagenet classiﬁcation with
In Advances in neural information processing

deep convolutional neural networks.
systems, pages 1097–1105, 2012.

[108] Brett M. Kroncke, Amanda M. Duran, Jeﬀrey L. Mendenhall, Jens Meiler, Jeﬀrey D.
Blume, and Charles R. Sanders. Documentation of an imperative to improve methods
for predicting membrane protein stability. Biochemistry, 55(36):5002–5009, 2016.

[109] Tugba G Kucukkal, Marharyta Petukh, Lin Li, and Emil Alexov. Structural and
physico-chemical eﬀects of disease and non-disease nssnps on proteins. Current opinion
in structural biology, 32:18–24, 2015.

189

[110] Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning. Nature,

521(7553):436–444, 2015.

[111] Yann LeCun, L´eon Bottou, Yoshua Bengio, and Patrick Haﬀner. Gradient-based learn-
ing applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[112] Hyekyoung Lee, Hyejin Kang, Moo K Chung, Bung-Nyun Kim, and Dong Soo Lee.
Persistent brain network homology from the perspective of dendrogram. IEEE trans-
actions on medical imaging, 31(12):2267–2277, 2012.

[113] Michael Lesnick and Matthew Wright.

Interactive visualization of 2-d persistence

modules. arXiv preprint arXiv:1512.00180, 2015.

[114] Hongjian Li, Kwong-Sak Leung, Man-Hon Wong, and Pedro J Ballester. Substituting
random forest for multiple linear regression improves binding aﬃnity prediction of
scoring functions: Cyscore as a case study. BMC bioinformatics, 15(1):291, 2014.

[115] Hongjian Li, Kwong-Sak Leung, Man-Hon Wong, and Pedro J Ballester. Improving
autodock vina using random forest: the growing accuracy of binding aﬃnity prediction
by the eﬀective exploitation of larger data sets. Molecular informatics, 34(2-3):115–126,
2015.

[116] Hongjian Li, Kwong-Sak Leung, Man-Hon Wong, and Pedro J. Ballester. Low-quality
structural and interaction data improves binding aﬃnity prediction via random forest.
Molecules, 20(6):10947–10962, 2015.

[117] Beibei Liu, Bao Wang, Rundong Zhao, Yiying Tong, and Guo-Wei Wei. ESES: software
for Eulerian solvent excluded surface. Journal of computational chemistry, 38(7):446–
466, 2017.

[118] Jun Liu, Shuiwang Ji, and Jieping Ye. Multi-task feature learning via eﬃcient l 2,
1-norm minimization. In Proceedings of the twenty-ﬁfth conference on uncertainty in
artiﬁcial intelligence, pages 339–348. AUAI Press, 2009.

[119] Zhihai Liu, Yan Li, Li Han, Jie Li, Jie Liu, Zhixiong Zhao, Wei Nie, Yuchen Liu, and
Renxiao Wang. PDB-wide collection of binding data: current status of the PDBbind
database. Bioinformatics, 31(3):405–412, 2014.

[120] Alessandro Lusci, Gianluca Pollastri, and Pierre Baldi. Deep architectures and
deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like
molecules. Journal of chemical information and modeling, 53(7):1563–1575, 2013.

[121] Jr. MacKerell, A. D., D. Bashford, M. Bellot, Jr. Dunbrack, R. L., J. D. Evanseck, M. J.
Field, S. Fischer, J. Gao, H. Guo, S. Ha, D. Joseph-McCarthy, L. Kuchnir, K. Kuczera,
F. T. K. Lau, C. Mattos, S. Michnick, T. Ngo, D. T. Nguyen, B. Prodhom, III Reiher,

190

W. E., B. Roux, M. Schlenkrich, J. C. Smith, R. Stote, J. Straub, M. Watanabe,
J. Wiorkiewicz-Kuczera, D. Yin, and M. Karplus. All-atom empirical potential for
molecular modeling and dynamics studies of proteins. Journal of Physical Chemistry
B, 102(18):3586–3616, 1998.

[122] Cl´ement Maria, Jean-Daniel Boissonnat, Marc Glisse, and Mariette Yvinec. The Gudhi
library: Simplicial complexes and persistent homology. In International Congress on
Mathematical Software, pages 167–174. Springer, 2014.

[123] JL Martinez and F Baquero. Mutation frequencies and antibiotic resistance. Antimi-

crobial agents and chemotherapy, 44(7):1771–1777, 2000.

[124] Andreas Mayr, G¨unter Klambauer, Thomas Unterthiner, and Sepp Hochreiter. Deep-
tox: toxicity prediction using deep learning. Frontiers in Environmental Science, 3:80,
2016.

[125] Marvin Minsky and Seymour A Papert. Perceptrons: an introduction to computational

geometry. MIT press, 2017.

[126] Konstantin Mischaikow, Marian Mrozek, J Reiss, and Andrzej Szymczak. Construc-
tion of symbolic dynamics from experimental time series. Physical Review Letters,
82(6):1144, 1999.

[127] Konstantin Mischaikow and Vidit Nanda. Morse theory for ﬁltrations and eﬃcient
computation of persistent homology. Discrete & Computational Geometry, 50(2):330–
353, 2013.

[128] Maria A Miteva, Frederic Guyon, and Pierre Tuﬀ´ery. Frog2: Eﬃcient 3d conformation
ensemble generator for small compounds. Nucleic acids research, 38(suppl 2):W622–
W627, 2010.

[129] Dmitriy Morozov. Dionysus. http://www.mrzv.org/software/dionysus/, 2015.

[130] Garrett M. Morris, Ruth Huey, William Lindstrom, Michel F. Sanner, Richard K.
Belew, David S. Goodsell, and Arthur J. Olson. Autodock4 and autodocktools4: Au-
tomated docking with selective receptor ﬂexibility. Journal of computational chemistry,
30(16):2785–2791, 2009.

[131] Ingo Muegge and Yvonne C Martin. A general and fast scoring function for protein-
ligand interactions: a simpliﬁed potential approach. Journal of medicinal chemistry,
42(5):791–804, 1999.

[132] Daniel M¨ullner and Aravindakshan Babu. Python mapper: An open-source toolchain
for data exploration, analysis and visualization. See http://danifold. net/mapper, 2013.

191

[133] Elizabeth Munch. Applications of persistent homology to time varying systems. PhD

thesis, 2013.

[134] Elizabeth Munch. A users guide to topological data analysis. Journal of Learning

Analytics, 4(2):47–61, 2017.

[135] Michael M. Mysinger and Brian K. Shoichet. Rapid context-dependent ligand desolva-
tion in molecular docking. Journal of chemical information and modeling, 50(9):1561–
1573, 2010.

[136] Vidit Nanda. Perseus:

the persistent homology software. Software available at

http://www. sas. upenn. edu/˜ vnanda/perseus, 2012.

[137] Marco AC Neves, Maxim Totrov, and Ruben Abagyan. Docking and scoring with icm:
the benchmarking results and strategies for improvement. Journal of computer-aided
molecular design, 26(6):675–686, 2012.

[138] Jiquan Ngiam, Aditya Khosla, Mingyu Kim, Juhan Nam, Honglak Lee, and Andrew Y
Ng. Multimodal deep learning. In Proceedings of the 28th international conference on
machine learning (ICML-11), pages 689–696, 2011.

[139] Duc D. Nguyen and Guo-Wei Wei. The impact of surface area, volume, curvature, and
lennard–jones potential to solvation modeling. Journal of computational chemistry,
38(1):24–36, 2017.

[140] Duc D. Nguyen, Tian Xiao, Menglun Wang, and Guo-Wei Wei. Rigidity strengthen-
ing: A mechanism for protein–ligand binding. Journal of chemical information and
modeling, 57(7):1715–1721, 2017.

[141] Duc Duy Nguyen, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao, and Guo-Wei
Wei. Mathematical deep learning for pose and binding aﬃnity prediction and ranking
in d3r grand challenges. arXiv preprint arXiv:1804.10647, 2018.

[142] Monica Nicolau, Arnold J Levine, and Gunnar Carlsson. Topology based data analysis
identiﬁes a subgroup of breast cancers with a unique mutational proﬁle and excellent
survival. Proceedings of the National Academy of Sciences, 108(17):7265–7270, 2011.

[143] Jessica L Nielson, Jesse Paquette, Aiwen W Liu, Cristian F Guandique, C Amy Tovar,
Tomoo Inoue, Karen-Amanda Irvine, John C Gensel, Jennifer Kloke, Tanya C Pet-
rossian, et al. Topological data analysis for discovery in preclinical spinal cord injury
and traumatic brain injury. Nature communications, 6:8581, 2015.

[144] Ippei Obayashi. Volume optimal cycle: Tightest representative cycle of a generator on

persistent homology. arXiv preprint arXiv:1712.05103, 2017.

192

[145] Kristopher Opron, Kelin Xia, and Guo-Wei Wei. Fast and anisotropic ﬂexibility-rigidity
index for protein ﬂexibility and ﬂuctuation analysis. The Journal of chemical physics,
140(23):06B617 1, 2014.

[146] Kristopher Opron, Kelin Xia, and Guo-Wei Wei. Communication: Capturing protein

multiscale thermal ﬂuctuations, 2015.

[147] Angel R. Ortiz, M. Teresa Pisabarro, Federico Gago, and Rebecca C. Wade. Prediction
of drug binding aﬃnities by comparative binding energy analysis. Journal of medicinal
chemistry, 38(14):2681–2691, 1995.

[148] Edward Ott, Celso Grebogi, and James A. Yorke. Controlling chaos. Physical review

letters, 64(11):1196, 1990.

[149] Steve Y. Oudot. Persistence theory:

from quiver representations to data analysis,

volume 209. American Mathematical Society Providence, RI, 2015.

[150] Deepti Pachauri, Chris Hinrichs, Moo K Chung, Sterling C Johnson, and Vikas Singh.
Topology-based kernels with application to inference problems in alzheimer’s disease.
IEEE transactions on medical imaging, 30(10):1760–1770, 2011.

[151] Jun-Koo Park, Robert Jernigan, and Zhijun Wu. Coarse grained normal mode analysis
vs. reﬁned gaussian network model for protein residue-level structural ﬂuctuations.
Bulletin of mathematical biology, 75(1):124–160, 2013.

[152] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-
del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau,
M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.

[153] Yunhui Peng, Lexuan Sun, Zhe Jia, Lin Li, and Emil Alexov. Predicting proteinDNA
binding free energy change upon missense mutations using modiﬁed MM/PBSA ap-
proach: SAMPDI webserver. Bioinformatics, 34(5):779–786, 2018.

[154] Jose A Perea. Persistent homology of toroidal sliding window embeddings. In Acous-
tics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on,
pages 6435–6439. IEEE, 2016.

[155] Jose A Perea. Multiscale projective coordinates via persistent cohomology of sparse

ﬁltrations. Discrete & Computational Geometry, 59(1):175–225, 2018.

[156] Jose A Perea, Anastasia Deckard, Steve B Haase, and John Harer. Sw1pers: Sliding
windows and 1-persistence scoring; discovering periodicity in gene expression time
series data. BMC bioinformatics, 16(1):257, 2015.

193

[157] Jose A Perea and John Harer. Sliding windows and persistence: An application of
topological methods to signal analysis. Foundations of Computational Mathematics,
15(3):799–838, 2015.

[158] Janaina Cruz Pereira, Ernesto Raul Caﬀarena, and Cicero Nogueira dos Santos. Boost-
ing docking-based virtual screening with deep learning. Journal of chemical informa-
tion and modeling, 56(12):2495–2506, 2016.

[159] Douglas EV Pires, David B. Ascher, and Tom L. Blundell. DUET: a server for pre-
dicting eﬀects of mutations on protein stability using an integrated computational
approach. Nucleic acids research, 42(W1):W314–W319, 2014.

[160] Xose S Puente, Luis M S´anchez, Christopher M Overall, and Carlos L´opez-Ot´ın. Hu-
man and mouse proteases: a comparative genomic approach. Nature Reviews Genetics,
4(7):544, 2003.

[161] Lijun Quan, Qiang Lv, and Yang Zhang. Strum: structure-based prediction of protein
stability changes upon single-point mutation. Bioinformatics, 32(19):2936–2946, 2016.

[162] V Robins, JD Meiss, and E Bradley. Computing connectedness: Disconnectedness and

discreteness. Physica D: Nonlinear Phenomena, 139(3-4):276–300, 2000.

[163] Vanessa Robins. Towards computing homology from ﬁnite approximations. In Topology

proceedings, volume 24, pages 503–532, 1999.

[164] Vanessa Robins, James D Meiss, and Elizabeth Bradley. Computing connectedness:

An exercise in computational topology. Nonlinearity, 11(4):913, 1998.

[165] Michael Robinson. Topological signal processing. Springer, 2016.

[166] Lior Rokach and Oded Maimon. Top-down induction of decision trees classiﬁers-a
survey. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications
and Reviews), 35(4):476–487, 2005.

[167] Erik Rybakken, Nils Baas, and Benjamin Dunn. Decoding of neural data using coho-

mological learning. arXiv preprint arXiv:1711.07205, 2017.

[168] Tamar Schlick and Wilma K Olson. Trefoil knotting revealed by molecular dynamics

simulations of supercoiled dna. Science, 257(5073):1110–1115, 1992.

[169] X Shi and P Koehl. Geometry and topology for modeling biomolecular surfaces. Far

East J Applied Math, 50:1–34, 2011.

[170] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-

scale image recognition. arXiv preprint arXiv:1409.1556, 2014.

194

[171] Gurjeet Singh, Facundo M´emoli, and Gunnar E Carlsson. Topological methods for
the analysis of high dimensional data sets and 3d object recognition. In SPBG, pages
91–100, 2007.

[172] Gurjeet Singh, Facundo Memoli, Tigran Ishkhanov, Guillermo Sapiro, Gunnar Carls-
son, and Dario L Ringach. Topological analysis of population activity in visual cortex.
Journal of vision, 8(8):11–11, 2008.

[173] Nitish Srivastava, Geoﬀrey E Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan
Salakhutdinov. Dropout: a simple way to prevent neural networks from overﬁtting.
Journal of Machine Learning Research, 15(1):1929–1958, 2014.

[174] Bernadette J. Stolz, Heather A. Harrington, and Mason A. Porter. Persistent homology
of time-dependent functional networks constructed from coupled time series. Chaos:
An Interdisciplinary Journal of Nonlinear Science, 27(4):047410, 2017.

[175] DW Sumners. Knot theory and dna. In Proceedings of Symposia in Applied Mathe-

matics, volume 45, pages 39–72, 1992.

[176] Akihiro Takiyama, Takashi Teramoto, Hiroaki Suzuki, Katsushige Yamashiro, and
Shinya Tanaka. Persistent homology index as a robust quantitative measure of im-
munohistochemical scoring. Scientiﬁc reports, 7(1):14002, 2017.

[177] Theano Development Team. Theano: A Python framework for fast computation of

mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016.

[178] Christopher J. Tralie and Jose A. Perea. (quasi) periodicity quantiﬁcation in video

data, using topology. SIAM Journal on Imaging Sciences, 11(2):1049–1077, 2018.

[179] Oleg Trott and Arthur J Olson. Autodock vina:

improving the speed and accuracy
of docking with a new scoring function, eﬃcient optimization, and multithreading.
Journal of computational chemistry, 31(2):455–461, 2010.

[180] Thomas Unterthiner, Andreas Mayr, G¨unter Klambauer, and Sepp Hochreiter. Toxi-

city prediction using deep learning. arXiv preprint arXiv:1503.01445, 2015.

[181] Mikael Vejdemo-Johansson, Florian T Pokorny, Primoz Skraba, and Danica Kragic.
Cohomological learning of periodic motion. Applicable Algebra in Engineering, Com-
munication and Computing, 26(1-2):5–26, 2015.

[182] Hans FG Velec, Holger Gohlke, and Gerhard Klebe. Drugscorecsd knowledge-based
scoring function derived from small molecule crystal data with superior recognition
rate of near-native ligand poses and better aﬃnity prediction. Journal of medicinal
chemistry, 48(20):6296–6303, 2005.

195

[183] G Verkhivker, K Appelt, ST Freer, and JE Villafranca. Empirical free energy calcu-
lations of ligand-protein crystallographic complexes. i. knowledge-based ligand-protein
interaction potentials applied to the prediction of human immunodeﬁciency virus 1
protease binding aﬃnity. Protein Engineering, Design and Selection, 8(7):677–691,
1995.

[184] Izhar Wallach, Michael Dzamba, and Abraham Heifets. Atomnet: A deep convolutional
neural network for bioactivity prediction in structure-based drug discovery. arXiv
preprint arXiv:1510.02855, 2015.

[185] St´efan van der Walt, S Chris Colbert, and Gael Varoquaux. The numpy array: a
structure for eﬃcient numerical computation. Computing in Science & Engineering,
13(2):22–30, 2011.

[186] Bao Wang and Guo-Wei Wei. Object-oriented persistent homology. Journal of com-

putational physics, 305:276–299, 2016.

[187] Bao Wang, Zhixiong Zhao, Duc D Nguyen, and Guo-Wei Wei. Feature functional
theory–binding predictor (ﬀt–bp) for the blind prediction of binding free energies.
Theoretical Chemistry Accounts, 136(4):55, 2017.

[188] Bao Wang, Zhixiong Zhao, and Guo-Wei Wei. Automatic parametrization of non-polar
implicit solvent models for the blind prediction of solvation free energies. The Journal
of chemical physics, 145(12):124110, 2016.

[189] Changhao Wang, D’Artagnan Greene, Li Xiao, Ruxi Qi, and Ray Luo. Recent devel-
opments and applications of the mmpbsa method. Frontiers in molecular biosciences,
4:87, 2017.

[190] Renxiao Wang, Luhua Lai, and Shaomeng Wang. Further development and validation
of empirical scoring functions for structure-based binding aﬃnity prediction. Journal
of computer-aided molecular design, 16(1):11–26, 2002.

[191] Sheng Wang, Siqi Sun, Zhen Li, Renyu Zhang, and Jinbo Xu. Accurate de novo
prediction of protein contact map by ultra-deep learning model. PLoS computational
biology, 13(1):e1005324, 2017.

[192] Guo-Wei Wei, Meng Zhan, and Choy Heng Lai. Tailoring wavelets for chaos control.

Phys. Rev. Lett., 89:284103, 2002.

[193] Catherine L. Worth, Robert Preissner, and Tom L. Blundell. SDMa server for predict-
ing eﬀects of mutations on protein stability and malfunction. Nucleic acids research,
39(suppl 2):W215–W222, 2011.

196

[194] Chengyuan Wu, Shiquan Ren, Jie Wu, and Kelin Xia. Weighted (co) homology and

weighted laplacian. arXiv preprint arXiv:1804.06990, 2018.

[195] Kelin Xia, Xin Feng, Yiying Tong, and Guo Wei Wei. Persistent homology for the quan-
titative prediction of fullerene stability. Journal of computational chemistry, 36(6):408–
422, 2015.

[196] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. Multiscale multiphysics and multido-
main modelsﬂexibility and rigidity. The Journal of chemical physics, 139(19):11B614 1,
2013.

[197] Kelin Xia and Guo-Wei Wei. Molecular nonlinear dynamics and protein thermal un-
certainty quantiﬁcation. Chaos: An Interdisciplinary Journal of Nonlinear Science,
24:013103, 2014.

[198] Kelin Xia and Guo-Wei Wei. Persistent homology analysis of protein structure, ﬂexibil-
ity, and folding. International journal for numerical methods in biomedical engineering,
30(8):814–844, 2014.

[199] Kelin Xia and Guo-Wei Wei. Multidimensional persistence in biomolecular data. Jour-

nal of computational chemistry, 36(20):1502–1520, 2015.

[200] Kelin Xia and Guo-Wei Wei. Persistent topology for cryo-EM data analysis. Interna-

tional Journal for Numerical Methods in Biomedical Engineering, 31(8), 2015.

[201] Kelin Xia, Zhixiong Zhao, and Guo-Wei Wei. Multiresolution persistent homol-
ogy for excessively large biomolecular datasets. The Journal of chemical physics,
143(13):10B603 1, 2015.

[202] Kelin Xia, Zhixiong Zhao, and Guo-Wei Wei. Multiresolution topological simpliﬁca-

tion. Journal of Computational Biology, 22(9):887–891, 2015.

[203] Zhexin Xiang and Barry Honig. Extending the accuracy limits of prediction for side-

chain conformations1. Journal of molecular biology, 311(2):421–430, 2001.

[204] Lee-Wei Yang and Choon-Peng Chng. Coarse-grained models reveal

functional
dynamics-i. elastic network models–theories, comparisons and perspectives. Bioin-
formatics and Biology Insights, 2:BBI–S460, 2008.

[205] Yang Yang, Biao Chen, Ge Tan, Mauno Vihinen, and Bairong Shen. Structure-based
prediction of the eﬀects of a missense variant on protein stability. Amino Acids,
44(3):847–855, 2013.

[206] Yuan Yao, Jian Sun, Xuhui Huang, Gregory R Bowman, Gurjeet Singh, Michael
Lesnick, Leonidas J Guibas, Vijay S Pande, and Gunnar Carlsson. Topological meth-

197

ods for exploring low-density states in biomolecular folding pathways. The Journal of
chemical physics, 130(14):04B614, 2009.

[207] Shuangye Yin, Lada Biedermannova, Jiri Vondrasek, and Nikolay V Dokholyan.
Medusascore: an accurate force ﬁeld-based scoring function for virtual drug screen-
ing. Journal of chemical information and modeling, 48(8):1656–1662, 2008.

[208] Piotr Zgliczynski and Konstantin Mischaikow. Rigorous numerics for partial diﬀer-
ential equations: The kuramotosivashinsky equation. Foundations of Computational
Mathematics, 1(3):255–288, 2001.

[209] Shunhong Zhang, Qian Wang, Yoshiyuki Kawazoe, and Puru Jena. Three-dimensional
metallic boron nitride. Journal of the American Chemical Society, 135(48):18216–
18221, 2013.

[210] Zhe Zhang, Maria A Miteva, Lin Wang, and Emil Alexov. Analyzing eﬀects of naturally
occurring missense mutations. Computational and mathematical methods in medicine,
2012, 2012.

[211] Zheng Zheng and Kenneth M Merz Jr. Ligand identiﬁcation scoring algorithm (lisa).

Journal of chemical information and modeling, 51(6):1296–1306, 2011.

[212] Zheng Zheng, Melek N Ucisik, and Kenneth M Merz. The movable type method applied
to protein–ligand binding. Journal of chemical theory and computation, 9(12):5526–
5538, 2013.

[213] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Clustered multi-task learning via alter-
nating structure optimization. In Advances in neural information processing systems,
pages 702–710, 2011.

[214] Jiayu Zhou, Jianhui Chen, and Jieping Ye. Malsar: Multi-task learning via structural

regularization. Arizona State University, 21, 2011.

[215] Afra Zomorodian. Fast construction of the Vietoris-Rips complex. Computers & Graph-

ics, 34(3):263–271, 2010.

[216] Afra Zomorodian and Gunnar Carlsson. Computing persistent homology. Discrete &

Computational Geometry, 33(2):249–274, 2005.

198