ALGEBRAIC TOPOLOGY AND GRAPH THEORY BASED

APPROACHES FOR PROTEIN FLEXIBILITY ANALYSIS AND B

FACTOR PREDICTION

By

David Bramer

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Mathematics - Doctor of Philosophy

2019

ABSTRACT

ALGEBRAIC TOPOLOGY AND GRAPH THEORY BASED APPROACHES FOR

PROTEIN FLEXIBILITY ANALYSIS AND B FACTOR PREDICTION

By

David Bramer

Protein ﬂuctuation, measured by B factors, has been shown to highly correlate to protein

ﬂexibility and function. Several methods have been developed to predict protein B factor as

well as related applications. While many B factor methods exist, reliable B factor prediction

continues to be an ongoing challenge and there is much room for improvement.

This work introduces a paradigm shifting geometric graph based model called the multi-

scale weighted colored graph (MWCG) model. The MWCG model is a new computational

algorithm that greatly improves the current landscape of protein structural ﬂuctuation anal-

ysis. The MWCG model treats each protein as a colored graph where colored nodes corre-

spond to atomic element types and edges are weighted by a generalized centrality metric.

Each graph contains multiple subgraphs based on interaction types between graphic nodes.

Protein rigidity is represented by generalized centralities of subgraphs. MWCGs predict B

factors of protein residues and accurately analyze the ﬂexibility of all atoms in a protein

simultaneously. The MWCG model presented here captures element speciﬁc interactions

across multiple scales and is a novel visual tool for identifying various protein secondary

structures. This work also demonstrates MWCG protein hinge detection using a variety of

proteins.

Cross-protein prediction of B factors has previously been an unsolved problem in terms

of B factor prediction. Many proteins are diﬃcult to crystallize, and for some it is likely

impossible, so models that can cross predict protein B factor are absolutely necessary. Using

machine learning and the MWCG method, this work provides a robust cross protein B factor

prediction using a set of known proteins to predict the B factors of a protein previously

unseen to the algorithm. The algorithm connects diﬀerent proteins using global protein

features such as the resolution of the X-ray crystallography data. The combination of global

and local features results in successful cross protein B factor prediction. To test and validate

these results this work considers several machine learning approaches such as random forest,

gradient boosted trees, and deep convolutional neural networks.

Recently, persistent homology has had tremendous success in biomolecular data analysis.

It works by examining the topological relationship or connectivity of a group of atoms in a

molecule at a variety of scales, then rendering a family of topological representations of the

molecule. Persistent homology is rarely employed for analysis of atomic properties, such as

protein ﬂexibility analysis or B factor prediction. This work introduces atom speciﬁc per-

sistent homology (ASPH) to provide a local atomic level representation of a molecule via a

global topological tool. This is achieved through the construction of a pair of conjugated sets

of atoms and corresponding conjugated simplicial complexes, as well as conjugated topolog-

ical spaces. The diﬀerence between the topological invariants of the pair of conjugated sets

is measured by Bottleneck and Wasserstein metrics and leads to an atom speciﬁc topological

representation of individual atomic properties in a molecule. Atom speciﬁc topological fea-

tures are integrated with various machine learning algorithms, including gradient boosting

trees and convolutional neural network for protein thermal ﬂuctuation analysis and blind

cross protein B factor prediction.

Extensive numerical testing indicates the proposed methods provide novel and powerful

graph theory and algebraic topology based tools for analyzing and predicting atom speciﬁc,

localized protein ﬂexibility information.

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv

Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Computing Protein Flexibility and Dynamics . . . . . . . . . . . . . . . . . .
2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 3 Multiscale Weighted Colored Graphs . . . . . . . . . . . . . . . .
3.1 Weighted colored graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 WCG Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Weighted Colored Graph Flexibility Analysis . . . . . . . . . . . . . . . . . .
3.4 Multiscale Weighted Colored Graph Flexibility Analysis . . . . . . . . . . . .
3.5 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Chapter 4 Atom Speciﬁc Persistent Homology . . . . . . . . . . . . . . . . .
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Simplex & Simplicial Complex . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Filtration & Persistence
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.5 Similarity and distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.6 Vietoris-Rips Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.7 Atom Speciﬁc Persistent Homology & Element Speciﬁc Persistent Homology

5.1.2 Neural Networks

Chapter 5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1.1 Random forest
. . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1.2 Gradient boosted trees . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2.1 Convolutional Neural Network . . . . . . . . . . . . . . . . .
5.1.3 Consensus methods . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 General Machine Learning Features . . . . . . . . . . . . . . . . . . . . . . .
5.2.1 Global features
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2.2 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Image-like MWCG Features . . . . . . . . . . . . . . . . . . . . . . .

5.3 MWCG Features

5.3.1

7
8
11

13
13
15
16
17
19

22
22
24
25
27
27
29
29

34
35
35
36
37
37
38
39
39
40
41
42
43

iv

5.4 ASPH & ESPH Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1
Image-like ASPH & ESPH Features . . . . . . . . . . . . . . . . . . .
5.4.2 Cutoﬀ Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Machine Learning Model Parameters . . . . . . . . . . . . . . . . . . . . . .
5.5.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1.2 Gradient Boosted Trees
. . . . . . . . . . . . . . . . . . . .
5.5.1.3 Deep Convolutional Neural Network . . . . . . . . . . . . .
5.5.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.2.1 Gradient Boosted Trees
. . . . . . . . . . . . . . . . . . . .
5.5.2.2 Deep Convolutional Neural Network . . . . . . . . . . . . .
5.6 Machine Learning Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.6.1 Training set and test set . . . . . . . . . . . . . . . . . . . . . . . . .

44
45
47
48
48
48
49
49
51
52
52
54
54

Chapter 6 Workﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

7.4 Machine Learning Results

Chapter 7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
7.1 Visualization of Element Speciﬁc Correlation Maps
7.2 Hinge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.3.2 Fitting Results
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.1.1 Eﬃciency comparison . . . . . . . . . . . . . . . . . . . . .
7.4.1.2 Machine learning performance . . . . . . . . . . . . . . . . .
7.4.1.3 Relative feature importance . . . . . . . . . . . . . . . . . .
7.5 ASPH & ESPH B Factor Prediction . . . . . . . . . . . . . . . . . . . . . . .
7.5.1 Least Squares Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59
59
59
66
66
67
72
72
73
74
80
81
81
82

Chapter 8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.1 Element Speciﬁc Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8.2 Hinge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
8.3 Fitting Models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.3.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
8.3.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
. . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.4.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
8.4.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

8.4 Machine Learning Models

Chapter 9 Conclusions and Future Directions . . . . . . . . . . . . . . . . . 120
9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
. . . . . . . . . . . . . . . . . . . . . . . . . . 127

Software Development

9.2.1

v

9.2.2
9.2.3
9.2.4 Other general approaches

Inclusion of other datasets . . . . . . . . . . . . . . . . . . . . . . . . 128
Speciﬁc applications in drug design and docking pose . . . . . . . . . 128
. . . . . . . . . . . . . . . . . . . . . . . . 129

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

vi

LIST OF TABLES

Table 2.1: Notable molecular mechanic techniques and the year of introduction. . . .

Table 3.1: Element pair combinations used in weighted colored graph.

. . . . . . . .

Table 3.2: Parameters used for correlation kernels in a parameter-free MWCG. Pa-
. . .

rameter optimization results originally published in Bramer et al [1].

Table 4.1: Topological invariants displayed as Betti numbers. Betti-0 represents the
number of connected components, Betti-1 the number of tunnels or circles,
and Betti-2 the number of cavities or voids. Two auxiliary rings are added
to the torus to illustrate that Betti-1=2. . . . . . . . . . . . . . . . . . . .

Table 5.1: The packing density distance parameters (d ˚A) used for generating short
. . . . . . .

medium, and long packing density machine learning features.

Table 5.2: Correlation kernel parameters used to generate parameter-free MWCG ma-
. . . . .

chine learning features. Parameters based on previous results.[1]

Table 5.3: Parameters used for topological feature generation. All features used a cut-
oﬀ of 11˚A. Both lorentz (Lor) and exponential (exp) kernels and Bottleneck
(B) and Wasserstein (W) distance metrics were used. . . . . . . . . . . . .

Table 5.4: Parameters used for the element speciﬁc persistent homology features with
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

a cutoﬀ of 11 ˚A.

Table 5.5: Boosted gradient tree parameters used for testing MWCG based B factor
prediction. These parameters were determined using a grid search. Any
hyper parameters not listed below were taken to be the default values
provided by the python scikit-learn package. MWCG based GBT machine
. . . .
learning prediction results originally published in Bramer et al [2].

Table 5.6: MWCG based deep Convolutional Neural Network (CNN) hyper-parameters
used for testing. These hyper-parameters were determined using a grid
search. Any hyper parameters not listed below were taken to be the de-
fault values provided by python with the Keras package. MWCG machine
learning prediction results originally published in Bramer et al [2].
. . . .

Table 5.7: Boosted gradient tree parameters used for persistent homology based pre-
diction testing. Parameters were determined using a grid search. Any hyper
parameters not listed below were taken to be the default values provided
by the python scikit-learn package. . . . . . . . . . . . . . . . . . . . . . .

11

14

20

23

42

43

46

48

49

51

52

vii

Table 5.8: Convolutional Neural Network (CNN) parameters used for testing persis-
tent homology based features. Parameters were determined using a grid
search. Any hyper-parameters not listed below were taken to be the de-
fault values provided by python with the Keras package. . . . . . . . . . .

Table 7.4: Correlation coeﬃcients for B factor prediction obtained by MWCG, optimal
FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode
(GNM) for a set of 364 proteins. GNM scores reported here are the result
of tests with a processed set of PDB ﬁles as described in Chapter 2.2.
MWCG results originally published in Bramer et al [1].
. . . . . . . . . .

Table 7.12: Pearson correlation coeﬃcients for cross protein heavy atom blind B factor
prediction obtained by random forest (RF), boosted gradient (GBT), and
convolutional neural network (CNN) for the Superset. Results reported use
heavy atoms in both training and prediction. MWCG machine learning
. . . . . . . . . . . . . . .
results originally published in Bramer et al [2].

Table 7.21: Pearson correlation coeﬃcients of persistent homology based least squares
ﬁtting Cα B factor prediction of all proteins using 11˚A cutoﬀ. Two Bot-
tleneck (B) and Wasserstein (W) metrics using various kernel choices are
included.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 7.22: Persistent homology based Pearson correlation coeﬃcients for cross protein
Cα atom blind B factor prediction obtained by boosted gradient (GBT),
convolutional neural network (CNN), and consensus method (CON) for the
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Superset.

Table 7.1: Correlation coeﬃcients for B factor prediction obtained by optimal FRI
(opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM)
for small-size structures. Results for opFRI, pfFRI are taken from Opron
et al [3]. GNM and NMA values are taken from the coarse-grained (Cα)
results reported in Park et al [4]. MWCG results are parameter free and
use all C, N, and O to predict Cα. MWCG results originally published in
Bramer et al [1].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 7.2: Correlation coeﬃcients for B factor prediction obtained by optimal FRI
(opFRI), parameter free FRI (pfFRI) and Gaussian normal mode (GNM)
for medium-size structures. Results for opFRI, pfFRI are taken from Opron
et al [3]. GNM and NMA values are taken from the coarse-grained (Cα)
results reported in Park et al [4]. MWCG results are parameter free and
use all C, N, and O to predict Cα. MWCG results originally published in
Bramer et al [1].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

67

75

83

92

98

99

viii

Table 7.3: Correlation coeﬃcients for B factor prediction obtained by optimal FRI
(opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM)
for large-size structures. Results for opFRI, pfFRI are taken from Opron
et al [3]. GNM and NMA values are taken from the coarse-grained (Cα)
results reported in Park et al [4]. MWCG results are parameter free and
use all C, N, and O to predict Cα. MWCG results originally published in
Bramer et al [1].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

Table 7.5: Average pearson correlation coeﬃcients for Cα B factor prediction with
FRI, GNM and NMA for three structure sets from Park et al.
[4] and a
superset of 364 structures. Results for opFRI, pfFRI are taken from Opron
et al [3]. GNM and NMA values are taken from the coarse-grained (Cα)
results reported in Park et al.
[4] MWCG results are parameter free and
use all C, N, and O to predict Cα. MWCG Results originally published in
Bramer et al [1].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Table 7.6: Pearson Correlation coeﬃcients for Cα, non Cα carbon, nitrogen, oxygen,
and sulfur using parameter free MWCG. Only 215 of the 364 proteins
contain sulfur atoms. MWCG results originally published in Bramer et al
[1].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

Table 7.7: CPU execution times, in seconds, from eﬃciency comparison between GNM

[3], RF, GBT, and CNN. Results originally reported in Bramer et al [2]

. 102

Table 7.8: Average Pearson correlation coeﬃcients (PCC) both of all heavy atom and
Cα only B factor predictions for small-, medium-, and large-sized protein
sets along with the entire superset of the 364 protein dataset. Predictions
of random forest (RF), gradient boosted tree (GBT), and convolutional
neural network (CNN) are obtained by leave-one-protein-out (blind), while
predictions of parameter-free ﬂexibility-rigidity index (pfFRI), Gaussian
network model (GNM) and normal mode analysis (NMA) were obtained via
the least squares ﬁtting of individual proteins. All machine learning models
use all heavy atom information for training. MWCG machine learning B
factor prediction results originally reported in Bramer et al [2].

. . . . . . 103

Table 7.9: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG
B factor prediction obtained by random forest (RF), boosted gradient
(GBT), and convolutional neural network (CNN) for the small-sized pro-
tein set. Results reported use heavy atoms in both training and prediction.
Originally published in Bramer et al [2].

. . . . . . . . . . . . . . . . . . 104

ix

Table 7.10: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG
B factor prediction obtained by random forest (RF), boosted gradient
(GBT), and convolutional neural network (CNN) for the medium-sized pro-
tein set. Results reported use heavy atoms in both training and prediction.
Originally published in Bramer et al [2].

. . . . . . . . . . . . . . . . . . . 105

Table 7.11: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG
B factor prediction obtained by random forest (RF), boosted gradient
(GBT), and convolutional neural network (CNN) for the large-sized pro-
tein set. Results reported use heavy atoms in both training and prediction.
Originally published in Bramer et al [2].

. . . . . . . . . . . . . . . . . . . 106

Table 7.13: ASPH and ESPH average Pearson correlation coeﬃcients Cα B factor pre-
dictions for small-, medium-, and large-sized protein sets along with the
entire superset of the 364 protein dataset. Gradient boosted tree (GBT),
convolutional neural network, and consensus(CON) results are obtained
by leave-one-protein-out (blind). Predictions of parameter-free ﬂexibility-
rigidity index (pfFRI), Gaussian network model (GNM) and normal mode
analysis (NMA) were obtained via the least squares ﬁtting of individual
proteins.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

Table 7.14: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom
blind B factor prediction obtained by boosted gradient (GBT), convolu-
tional neural network (CNN), and consensus (CON) for the small-sized
protein set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

Table 7.15: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom
blind B factor prediction obtained by boosted gradient (GBT), convolu-
tional neural network (CNN), and consensus (CON) for the medium-sized
protein set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

Table 7.16: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom
blind B factor prediction obtained boosted gradient (GBT), convolutional
neural network (CNN), and consensus (CON) for the large-sized protein set.110

Table 7.17: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting
Cα B factor prediction of small proteins using 11˚A cutoﬀ. Two Bottleneck
(B) and Wasserstein (W) metrics using various kernel choices are included. 111

Table 7.18: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting Cα
B factor prediction of medium proteins using 11˚A cutoﬀ. Two Bottleneck
(B) and Wasserstein (W) metrics using various kernel choices are included. 112

x

Table 7.19: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting
Cα B factor prediction of large proteins using 11˚A cutoﬀ. Two Bottleneck
(B) and Wasserstein (W) metrics using various kernel choices are included. 113

Table 7.20: ASPH and ESPH average Pearson correlation coeﬃcients of least squares
ﬁtting Cα B factor prediction of small, medium, large, and superset using
11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein (W) metrics using various
kernel choices are included. Results for pFRI are taken from Opron et al[3].
GNM and NMA value are taken from the course grained Cα results reported
in Park et al[4].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

xi

LIST OF FIGURES

Figure 3.1: The average Pearson correlation coeﬃcient (PCC) as found by optimizing
individual kernels in the range of ηn = 1, . . . , 40. Parameter optimization
results originally published in Bramer et al [1]. . . . . . . . . . . . . . . .

Figure 4.1: From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and
3-simplex.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
(a) An example of 5 points in R2 and (b) the corresponding topological
barcode. The length of each barcode corresponds to the persistence of
. . . . . . . .
each topological object (β0,β1,β2,etc..) over the ﬁltration.

Figure 4.2:

Figure 4.3:

Figure 4.4:

Illustration of Atom-speciﬁc persistent homology point clouds. Top: the
original point cloud. The atom of interest is at the center of the circle.
Second row: a pair of conjugated sets of point clouds for atom-speciﬁc
persistent homology. The rest: Four pairs of conjugated point clouds for
atom-speciﬁc and element-speciﬁc persistent homology.
. . . . . . . . .

Illustration of residue 338 Cα atom-speciﬁc persistent homology in the CC
element-speciﬁc point cloud of protein PDBID 1AIE. For this example
residues 332-339 are used and are shown on the left. The Cα location
used to generate the barcodes (right) is highlighted in red in the left
chart. Conjugated persistence barcodes are generated with and without
the selected Cα. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

24

28

30

33

Figure 5.1: An example of a perceptron, the basic functional unit of a neural network. 38

Figure 5.2: An illustration of a fully connected deep neural network. Circles repre-
sent neurons and connections between neurons are indicated by arrows.
Each connection has an associated weight. A neural network is considered
. . . . . . . . . . . . . . . . .
“deep” when it uses several hidden layers.

Figure 5.3: Frequency of the number of heavy elements from the 364 protein dataset.
Figure originally published in Bramer et al [2]. . . . . . . . . . . . . . . .

Figure 5.4:

Illustration of modiﬁed persistence diagrams used in distance calculations.
(a) Unchanged. (b) Rotated 30◦. (c) rotated 60◦. Black dots are Betti-0
. . . . . . . . . . . . . . . . . . .
events and triangles are Betti-1 events.

Figure 5.5: Average Pearson correlation coeﬃcient over the entire protein dataset ﬁt-
ting all 24 persistent homology features using various cutoﬀ distances. . .

Figure 5.6: The MWCG based deep convolutional neural network architecture used
for B factor prediction. The plus symbol represents the concatenation of
data sets. Figure originally published in Bramer et al [2]. . . . . . . . . .

38

41

45

47

50

xii

Figure 5.7: The deep learning architecture using a convolutional neural network com-
bined with a deep neural network to predict B factor using PH based
features. The plus symbol represents the concatenation of features.
. . .

Figure 6.1: Workﬂow for procedure in MWCG feature construction.

. . . . . . . . .

Figure 6.2: Workﬂow for procedure in ASPH and ESPH feature construction.

. . . .

Figure 6.3: Workﬂow for procedure MWCG, ASPH, and ESPH based machine learn-
ing B factor prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 7.1:

Figure 7.2:

Figure 7.3:

Figure 7.4:

Figure 7.5:

(a) VMD representation of PBD ID 1AIE. (b) Correlation maps for nitrogen-
nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1AIE.
The thicker band along the main diagonal of (b) and (c) corresponds to
the alpha helix secondary structure in 1AIE. Figure originally published
in Bramer et al [1].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

(a) VMD representation of PBD ID 1KGM. (b) Correlation maps for
nitrogen-nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for pro-
tein 1KGM. The bands perpendicular to the main diagonal of (b) and
(c) correspond to the anti parallel beta sheet present in 1KGM. Figure
. . . . . . . . . . . . . . . . . .
originally published in Bramer et al [1].

(a) VMD representation of PBD ID 5IIV. (b) Correlation maps for nitrogen-
nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 5IIV.
The presence of the two distinct thick bands along the main diagonal of
(b) and (c) corresponds to the two alpha helices present in 5IIV. The
oﬀ diagonal bands correspond to the bonding interaction between alpha
. . . . . . . . . .
helices. Figure originally published in Bramer et al [1].

(a) A visual comparison of experimental B factors , (b) WCG predicted
B factors, (c) and GNM predicted B factors for the ribosomal protein
L14 (PDB ID:1WHI). (d) The experimental and predicted B factor values
plotted per residue. GNM represents predicted B factors using GNM with
a cutoﬀ distance of 7 ˚A. WCG is parametrized using CC, CN, CO kernels
of the exponential type with ﬁxed parameters κ = 1, and η = 3 ˚A. Figure
originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . .

(a) The structure of calmodulin (PDB ID: 1CLL) visualized in Visual
Molecular Dynamics (VMD)18 and colored by experimental B factors,
(b) MWCG predicted B factors, (c) WCG predicted B factors, (d) and
GNM predicted B factors with red representing the most ﬂexible regions.
Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . .

53

56

57

58

61

62

63

64

65

xiii

Figure 7.5:

(Continued) (e) The experimental (Exp) and predicted B factor values
plotted per residue for PDB ID 1CLL. The GNM is for the GNM method
with a cutoﬀ distance of 7 ˚A. We see that GNM clearly misses the ﬂexible
hinge region. WCG is parametrized using CC, CN, CO kernels of the
exponential type with ﬁxed parameters κ = 1, and η = 3 ˚A. MWCG
represents B factor predictions determined from the MWCG method using
the ﬁxed parameters listed in Table 3.2. Figure originally published in
Bramer et al [1].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 7.6: A visual comparison of experimental B factors (a), WCG predicted B
factors (b), and GNM predicted B factors (c) for the engineered cyan
ﬂuorescent protein, mTFP1 (PDB ID:2HQK). (d) The experimental (Exp)
and predicted B factor values plotted per residue for PDB ID 2HQK. The
GNM is for the GNM method with a cutoﬀ distance of 7 ˚A. WCG is
parametrized using CC, CN, CO kernels of the exponential type with ﬁxed
parameters κ = 1, and η = 3 ˚A. Figure originally published in Bramer et
al [1].
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

97

Figure 7.7: CPU Eﬃciency comparison between GNM [3], RF, GBT, and CNN al-
gorithms for MWCG B factor prediction. Execution times in seconds
(s) versus number of residues. A set of 34 proteins, listed in Table 7.7,
were used to evaluate the computational complexity. Result originally
published in Bramer et al [2].

. . . . . . . . . . . . . . . . . . . . . . . . 101

Figure 7.8:

Individual feature importance for the MWCG random forest model aver-
aged over the data set. Reported feature selection includes the use heavy
atoms in the model. Figure originally published in Bramer et al [2].

. . . 103

Figure 7.9: Average feature importance for the MWCG random forest model with
the angle, secondary, MWCG, atom type, protein size, amino acid, and
packing density features aggregated. Reported feature selection includes
the use heavy atoms in the model. Figure originally published in Bramer
et al [2].

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

xiv

KEY TO ABBREVIATIONS

aFRI

Anisotropic Flexibility Rigidity Index

ANM

Anisotropic Network Model

ASPH

Atom Speciﬁc Persistent Homology

CNN

Convolutional Neural Network

DNN

Deep Neural Network

ESPH

Element Speciﬁc Persistent Homology

fFRI

Fast Flexibility Rigidity Index

FRI

Flexibility Rigidity Index

GBT

Gradient Boosting Trees

GNM

Gaussian Network Model

mFRI

Multiscale Flexibility Rigidity Index

MWCG Multiscale Weighted Colored Graph

NMA

Normal Mode Analysis

NMR

Nuclear Magnetic Resonance

MD

Molecular Dynamics

PDB

Protein Data Bank

PH

RF

Persistent Homology

Random Forest

WCG Weighted Colored Graph

xv

Chapter 1

Overview

X-ray crystallography is an impressive experimental tool that provides three dimensional

(3D) spatial coordinates and thermal ﬂuctuation data of atoms within a crystallized molecule

in the form of a PDB data ﬁle. Using data contained in a protein PDB ﬁle, one can validate

mathematical models to understand protein dynamics and ﬂexibility. The protein data

bank is massive, containing over 140,000 structures as of March 2019, with more structures

submitted annually.

Even with the solution of many protein structures, there is still an important need for

robust and accurate mathematical models. Many important classes of proteins are diﬃcult

to crystallize and some may even prove to be impossible. Protein crystallization diﬃculty

increases proportionally with the size of a protein. Highly ﬂexible proteins represent another

class of proteins that are diﬃcult to crystallize due to their resistance in forming a crystal

lattice structure. Other examples of proteins which are diﬃcult to crystallize include small

heat shock proteins, transmembrane and membrane proteins, and intrinsically disordered

proteins. Heat shock proteins are an important class of proteins related to cardiovascular

function, immunity, and cancer. Transmembrane and Membrane proteins are targets for

the majority of modern drugs. Intrinsically disordered proteins are also vitally important

to understand as they have been implicated in a number of diseases such as Bovine Spongi-

form Encephalopathy (mad cow disease), Creutzfeldt-Jakob disease, Alzheimer’s disease, and

1

Parkinson’s disease.

In this work new and eﬃcient methods for protein analysis are introduced that improve

upon existing methods in several ways. These methods are the ﬁrst protein B factor pre-

diction methods to incorporate additional protein information from non-Cα atoms in the

form of element speciﬁc interaction pairs. Moreover, this work introduces methods that are

entirely new to B factor prediction. These methods are capable of successful cross protein

B factor prediction using only information from other proteins. The methods presented use

advanced graph theory based techniques, machine learning algorithms, and the ﬁrst known

topological data analysis based persistent homology method, to successfully analyze protein

ﬂexibility and dynamics. Lastly, the methods provide the best predictive results, to date, for

both protein B factor prediction within a protein and cross protein B factor prediction. The

results are validated through extensive testing on a large and diverse set of proteins. Using

these methods many protein analysis tools can be constructed. In addition to the protein B

factor prediction, several applications of these methods are provided in this work. Examples

include hinge detection, element speciﬁc protein correlation maps, and protein model relative

feature importance ranking.

This work ﬁrst introduces an eﬃcient and accurate advanced graph theory based mul-

tiscale weighted colored graph (MWCG) method for analyzing protein ﬂexibility and dy-

namics. The weighted colored graph (WCG) theory is based on the hypothesis that the

most fundamental properties of proteins are determined by the geometric structure of the

protein. The WCG method does not require costly matrix diagonalization like other com-

monly used methods such as Normal Mode Analysis (NMA) and Gaussian Network Model

(GNM). Given a protein of N atoms, the computational complexity of the WCG method is
approximately O(N 2) whereas methods like Normal Mode Analysis and Gaussian Network

2

Model are O(N 3) due to the fact that they require diagonalization of a large matrix.

Next a multiscale formulation of the WCG method is introduced to incorporate the

multiscale interactions that occur within a protein into the model. Protein interactions take

place over a variety of diﬀerent scales, so any reliable model should take this property into

account. To reduce computational complexity, most elastic network models include a pre-

deﬁned cutoﬀ distance. However, the computational cost saved by using a cutoﬀ in ENM

incurs a cost in the overall accuracy of such models. By prescribing a distance based cutoﬀ

these models fail to capture protein interactions that take place across multiple characteristic

length scales. The MWCG model was developed to capture the multiscale behavior of protein

interactions. To capture various interaction scales within a protein the MWCGs used in this

work were parameterized using three correlation kernels parameterized at diﬀerent length

scales. However the method is general and adaptable, so the number of correlation kernels

can be adjusted to ﬁt the users performance needs.

To test the eﬃcacy of the WCG approach, the method is tested on a set of over 300 protein

structures taken from X-ray crystallography data provided by the Protein Data Bank. The

accuracy of B factor prediction between MWCG is compared to the most commonly used

approaches, parameter-free FRI (pfFRI), NMA, and GNM. Averaged over a large and diverse

set of over 300 proteins the results demonstrate a signiﬁcant improvement. Averaged over

the entire protein test set, the MWCG method is over 28% more accurate than the best

previous method opFRI and 42% more accurate than GNM. To further demonstrate the

utility of the WCG method, applications such as element speciﬁc protein heat maps and

hinge detection visualizations are included.

Accurate identiﬁcation of hinge regions and hinge motion is an important topic that has

been highly studied[5, 6, 7, 8, 9]. Hinge residue detection is integral for molecules that are

3

too large for MD simulation over meaningful time scales. In the past, methods such as GNN

and NMA have been used to detect hinges for proteins where MD is intractable. This work

compares the ability of GNM, WCG, and MWCG methods to identify the hinge regions

of several proteins. The work demonstrates several instances where WCG and MWCG

accurately identify hinge regions where GNM at the same time fails to do so. This highlights

the overall eﬃcacy of this method and the multiscale behavior captured by MWCG.

Element speciﬁc correlation maps provide a new way to visualize secondary and tertiary

protein structure using a two dimensional (2D) image where ﬂexibility is represented by the

color of each pixel of the image. These correlation maps have been introduced in the past

for Cα atoms [3]. In this work we introduce more general element speciﬁc correlation maps.

Examples of nitrogen-nitrogen and oxygen-oxygen element interaction correlation maps are

provided for several proteins. This demonstrates the adaptability of the WCG and MWCG

methods presented here. The provided examples clearly demonstrate important secondary

structures such as alpha helix and beta sheets as well as their primary and secondary inter-

actions.

Previous protein B factor prediction methods are not capable of accurate prediction of

B factor across proteins. The MWCG method, along with other engineered features are

used to create machine learning based B factor prediction models. The model captures

various interaction scales within an individual protein. To capture distinctions between

proteins other global features such as protein resolution are included as feature inputs. The

machine learning algorithms used in this work are trained using nine MWCG kernels with

various parameterizations. Other local and global features are also included to improve the

robustness of the feature set. The algorithms were trained using leave-one-protein-out cross

validation, where the algorithm trains on all protein data except the protein of interest,

4

then the test set is taken to be the protein of interest. Extensive numerical testing indicates

that the MWCG cross B factor predictions obtained are more accurate than any B factor

prediction using existing traditional methods. The approach introduced here is particularly

notable because it accurately predicts cross-protein B factors.

In recent years topological data analysis (TDA) has been successfully applied to protein

analysis in a variety of areas. The basic idea of TDA is to use tools from topology to analyze

high dimensional datasets that may be noisy or incomplete. Techniques from TDA reduce

the dimensionality of the dataset, and allow the user their choice of metric. These techniques

are a good ﬁt for protein analysis, where one wants to infer high dimensional structure from

low dimensional representations, capture multiple scales, and assemble discrete point data

into a global structure. The point cloud of 3D spatial coordinates provided for proteins in

the protein databank PDB ﬁles can be converted into a family of simplicial complexes. These

simplicial complexes are indexed by a proximity parameter. Then, converting the dataset

into global topological objects, tools from algebraic topology can be applied for protein

analysis.

Persistent homology theory allows the persistent homology of a ﬁltered simplicial com-

plex to be uniquely represented with a barcode. In this work protein data is encoded into a

barcode by taking a ﬁltration over simplicial complexes that have been constructed from ele-

ment speciﬁc protein spatial data. The protein barcodes provide global invariant topological

features of the protein. By comparing two related barcodes for each atom of interest, this

technique can be used to predict local atomic ﬂexibility. The two barcodes are constructed

such that one barcode is constructed using a point cloud that includes the atom of interest,

and another is constructed using the same point cloud but without the atom of interest. The

similarity or diﬀerence between barcodes is compared using bottleneck or wasserstein met-

5

rics. This provides atomic speciﬁc persistent homology protein ﬂexibility analysis. Including

various element interaction pairs one may also generate element speciﬁc persistent homology

(ESPH) to capture element speciﬁc interactions. To the author’s knowledge no previous

protein ﬂexibility models have used persistent homology to predict B factors of atoms in a

protein in this way.

In this work ASPH and ESPH features are generated for each Cα of a given protein.

However, this method is a general framework that can be applied to any element in a protein,

including hydrogen. The method allows for several parameterizations that can be tuned by

the user. To validate this approach several PH features are generated and used to ﬁt B

factor prediction models using linear least squares ﬁtting. Later, the features are used with

machine learning techniques. Both cases are validated using a large and diverse data set

of proteins from the protein data bank. These results provide good predictions that are

comparable to the aforementioned MWCG results.

6

Chapter 2

Background

Currently Nuclear Magnetic Resonance (NMR) Spectroscopy and X-ray crystallography are

the two major experimental techniques used for protein dynamics and ﬂexibility analysis.

Techniques for NMR were previously very challenging but are now becoming more routine.

At a basic level, NMR works by mapping the magnitude or intensity of magnetic resonance

signals as a magnetic ﬁeld is applied to a protein sample. X-ray crystallography determines

protein structure by measuring the diﬀraction patterns of an intense beam of X-rays of a

crystallized protein. The crystal is rotated many times and a with each rotation a new set of

diﬀraction patterns is collected. After tens of thousands of rotations, the data is combined

and computationally processed into a ﬁnal atomic arrangement known as the protein crystal

structure.

At the time of this dissertation, over 90% of the protein data bank (PDB) ﬁles have

been solved using X-ray crystallography while less than 10% have been solved using NMR.

Unlike X-ray crystallography, NMR results do not provide atomic ﬂexibility information. In

contrast, X-ray crystallography data includes ﬂexibility information in the form of atomic

B factor (temperature factor, B value, or Debye-Waller factor), which is a measurement of

the X-ray scattering of atoms or groups of atoms in a protein. Atomic B factor has been

observed to correlate with atomic ﬂexibility from Molecular dynamics (MD) and Normal

mode analysis (NMA) experiments thus it provides a good experimental gold standard to

7

compare theoretical methods.

2.1 Computing Protein Flexibility and Dynamics

Many methods exist for studying protein structure and function; however, there is room for

substantial improvement. Algorithms which require X-ray crystallography are limited by

the availability of previously crystallized proteins. Surely the protein databank will continue

to grow as scientists crystallize proteins with ever increasing eﬃciency. However, for many

types of proteins, crystallization is very diﬃcult or impossible. This calls for new approaches

to theoretical protein analysis.

MD simulation is one method for protein analysis that has made a serious contribution

to our understanding of the conformational landscapes of proteins. It has been particularly

helpful in understanding proteins that are diﬃcult to study experimentally such as amyloid

ﬁbrils, intrinsically disordered proteins, and partially disordered proteins. Even so, the dy-

namics of large proteins generally takes place over long time scales that are inaccessible to

modern MD simulations. MD simulations are computationally intractable for larger macro-

molecules and in systems of multiple molecules as the time scales required are unreasonable

for current technology. As such MD continues to be limited to systems of low complexity

due to the methods high degree of freedom.

To address the limitations of time-dependant MD approaches several time independent

approaches to protein dynamics and ﬂexibility analysis have been developed. NMA was one

of the ﬁrst successful time-independent methods used for protein analysis[10, 11, 12, 13, 14].

NMA achieves time-independence by adopting an interaction Hamiltonian based on pro-

tein molecular mechanics. In this approach bond lengths and angles are ﬁxed, and NMA

8

is computed by the diagonalization of a Hamiltonian on an energy minimized structure.

Normal modes are the orthogonal resonant patterns of the molecular mechanic system. A

superposition of the normal modes provides the collective motion of the protein. Low fre-

quency modes correspond to cooperative motions and are meaningful in applications like

hinge detection and MD where slow, collective motion is relevant. The transition pathways

of macromolecules are also highly correlated with the low-frequency modes of NMA[14].

NMA provides good coarse grained deformation motion of supramolecular complexes. The

success of NMA has resulted in several related methods that improve the computational cost

and quality of the generated results.

The elastic network model (ENM) was proposed in 1996 as a simpliﬁed NMA approach[15].

The ENM is based on a statistical mechanics approach where a molecule is treated as a sys-

tem of N nodes with each node corresponding to an atom or residue within the molecular

network[16]. This approach provides good prediction of global motions but does not re-

liably predict local motion and requires costly diagonalization of the large corresponding

Hessian matrix. The Anisotropic network model (ANM) was model introduced using the

ENM framework to account for 3D directionality. The ANM uses a spring network with a
simple spring potential between Cα atoms[17]. Given N atoms, ANM requires a 3N × 3N

matrix Diagonalization of the resulting Hessian. This provides the modes of the system

that correspond to cooperative motions. Lower eigenvalue and eigenvectors can be used to

estimate protein ﬂexibility.

In ANM all springs use the same force constant. The ANM

provides good insight into the protein dynamics at a lower computational cost than other

normal mode analysis based methods.

The Gaussian network model (GNM) is a related ENM developed around the same time

as ANM that provides a good course grained, isotropic, low cost approach[18, 19]. In GNM

9

the Hessian is replace by a Kirchoﬀ matrix. The diagonalization of the Kirchoﬀ matrix gives

rise to eigenmodes and eigenvalues for describing protein ﬂuctuations that correspond to B

factors. GNM is both accurate and eﬃcient compared to other previous approaches[20].

To bypass costly large matrix diagonalization the ﬂexibility-rigidity index (FRI) was

more recently introduced[21, 3, 22]. FRI is a mathematical method based on geometric

graphs, that makes use of protein graph connectivity and node centrality to analyze protein

ﬂexibility. The method is based on the hypothesis that protein interactions and protein

structure are inextricably linked in a given environment. That is, protein ﬂexibility and

function are determined by protein structure and environment. Since the FRI approach

is not based on molecular mechanics it does not require a protein interaction Hamiltonian

like those used in spectral graph theory, to analyze protein ﬂexibility. The FRI approach

works well as long as the accurate structure of the protein and its environment is known. As

such FRI is restricted to proteins with solved 3D X-ray crystal structures. The FRI method

provided a signiﬁcant improvement in computational speed compared to previous protein
analysis methods. The ﬁrst FRI method [21] is of computational complexity O(N 2) [21].

Later fast FRI (fFRI) [3] was introduced to reduce computational cost further. The fFRI
method is of computational complexity O(N ). Anisotropic FRI (aFRI) [3] and generalized

FRI (gFRI) [23] have also since been developed. To capture the multiscale interactions seen

in macromolecules the multiscale FRI (mFRI) method was introduced[24]. Compared to

GNM, the mFRI algorithm was shown to be approximately 20%, more accurate averaged

over a large and diverse set proteins [24]. The fFRI algorithm was shown to be signiﬁcantly

faster than GNM[3]. Generalized GNM (gGNM), generalized ANM (gANM), multiscale

GNM (mGNM), and multiscale ANM (mANM) methods have been recently constructed

using FRI matrices [25]. These generalized algorithms provide major improvements to the

10

accuracy of original algorithms for protein ﬂexibility analysis. A summary of when the

diﬀerent approaches to protein ﬂexibility and dynamics were ﬁrst introduced is provided in

Table 2.1.

Table 2.1: Notable molecular mechanic techniques and the year of introduction.

Molecular Mechanics Technique
Molecular Dynamics (MD)
Normal Mode Analysis (NMA)
Elastic Network Model (ENM)
Gaussian Network Model (GNM)
Anisotropic Network Model (ANM)
Flexibility Rigidity Index (FRI)

Year of Introduction
1977[26]
1982[11]
1996[15]
1996[18]
2001[17]
2014[3]

While the previous methods provide good results, there is still room for signiﬁcant im-

provement. The average pearson correlation coeﬃcient of the B factor predictions of the

aforementioned methods is generally below 0.7. Knowing the importance of protein ﬂexi-

bility analysis, it is crucial to improve these results. Moreover the above methods do not

provide satisfactory results when predicting cross protein B factor. Given the the many

classes of proteins with no X-ray crystal structure this is an important problem with no

existing reliable solutions.

2.2 Data

Two data sets are used for testing and validation in this work: one from Refs.

[3, 24] and

the other from Park, Jernigan, and Wu [4]. The ﬁrst data set contains 364 proteins [3, 24],

and the second contains 3 subsets of small, medium, and large sized proteins [4]. All protein

PDB structures have a resolution of 3 ˚A or higher and an average resolution of 1.3 ˚A. The

PDB data sets include proteins that range in size from 4 to 3912 residues [4]. This work

excludes protein 1AGN due to known data issues. Proteins 1NKO, 2OCT, and 3FVA are

11

also excluded as these proteins have PDB ﬁles with residues whose B factors are reported

as zero which is nonphysical. For all machine learning results provided in this work, the

STRIDE software is unable to provide the required secondary features for proteins 1OB4,

1OB7, 2OLX, and 3MD5 so these also excluded.

12

Chapter 3

Multiscale Weighted Colored Graphs

3.1 Weighted colored graphs

For this approach, each protein is considered to be a network in the form of a mathematical

graph. That is, a protein a network where atoms represent nodes or vertices of the graph and

edges are weighted connections between nodes that are determined by a distance based radial

function. Colored graphs are constructed based on heavy element (carbon, nitrogen, oxygen,

sulfur) interaction pairs. Provided it is available, one may even include hydrogen atoms.

Hydrogen atoms have a high degree of uncertainty, and cannot be accurately measured by

X-ray crystallography so we exclude them from this work. A graph is denoted as G(V, E)

where V represents a set of nodes called vertices and E the set of edges of the graph that

relate vertices pairwise. This work deﬁnes a protein network to be a graph whose nodes

and edges have speciﬁc attributes corresponding to the protein.

In particular, individual

atoms correspond to graph nodes, and the edges to a distance based correlation metric. This

approach makes sense from a biophysical point of view since interaction strength is inversely

proportional to distance. Further, many existing B factor prediction methods use three-

dimensional (3D) networks of spatial atomic coordinate data from the protein databank.

The most basic component of this method is a weighted colored graph. A WCG converts

3D geometric protein spatial information, provided as atomic coordinates by a PDB data

13

ﬁle, into a protein connectivity network. All existing previous methods only take Cα atoms

into consideration when constructing graph theoretic approaches. However, in this work all

N atoms in a protein are considered. Given the colored graph G(V, E), the ith atom is

labeled by its element type αj and position rj and thus

V = {(rj, αj)|rj ∈ IR3; αj ∈ C; j = 1, 2, . . . , N},

where C ={C, N, O, S } is the set containing the chosen element types of interest in a
protein. The set of edges, P, in a colored protein graph is deﬁned to be the set of all element
speciﬁc pairs of C. This choice of C results in 16 element directed interaction pairs. Table
3.1 illustrates the 16 possible element interaction pairs. For this work P is deﬁned to be

Table 3.1: Element pair combinations used in weighted colored graph.

C

N

O

S

C CC CN CO CS

N NC NN NO NS

O OC ON OO OS

S

SC SN SO SS

P = {CC, CN, CO, CS, NC, NN, NO, NS, OC, ON, OO, OS, SC, SN, SO, SS}.

For example, the subset P3 ={CO} contains all directed CO pairs in the protein such that

the ﬁrst atom is a carbon and the second one is a oxygen. Mathematically, E is the set of

weighted directed edges describing the potential interaction pairs of atoms given by

E =(cid:8)Φk(||ri − rj||; ηij)(cid:12)(cid:12)(αiαj) ∈ Pk; k = 1, 2, . . . , 16; i, j = 1, 2, . . . , N(cid:9),

(3.1)

14

where ||ri − rj|| is deﬁned to be the Euclidean distance between the ith and jth atoms, ηij

a characteristic distance between the atoms, and (αiαj) a directed pair of element types. In

this work Φk is a correlation function with the following properties [3]

Φk(||ri − rj||; ηij) = 1, as ||ri − rj|| → 0

(αiαj) ∈ Pk,

Φk(||ri − rj||; ηij) = 0 as ||ri − rj|| → ∞,

(αiαj) ∈ Pk.

(3.2)

(3.3)

Previous work by Opron et al[3] has shown that generalized exponential functions of the

form,

Φk(||ri − rj||; ηij) = e

−(||ri−rj||/ηij )κ

(αiαj) ∈ Pk;

,

κ > 0,

(3.4)

and generalized Lorentz functions of the form,

Φk(||ri − rj||; ηij) =

1

1 + (||ri − rj||/ηij)ν ,

(αiαj) ∈ Pk;

ν > 0,

(3.5)

are good choices for correlation functions that satisfy the above properties.

3.2 WCG Centrality

Given a graph, centrality provides a measure of the importance of a node. Centrality is an

important concept in graph theory that has a wide variety of applications including social

network analysis, identiﬁcation of critical genes, traﬃc ﬂows, and epidemics[27, 28, 29]. There

are several types of centrality measures. For example, the normalized closeness centrality

[30] of node ri is deﬁned as

1(cid:80)
j ||ri − rj||

15

and the Harmonic centrality [31] of node ri in a connected graph is deﬁned as

1

||ri − rj||.

(cid:88)

j

In this work the notion of Harmonic centrality is extended to subgraphs with weighted edges

deﬁned by generalized correlation functions. The generalized centrality metric used in this

work is deﬁned as

µk
i =

N(cid:88)

j=1

wijΦk(||ri − rj||; ηij),

(αiαj) ∈ Pk,

∀i = 1, 2, . . . , N,

(3.6)

where wij is a weight function related to the element type. The WCG centrality in Equation

(3.6) provides the atom speciﬁc rigidity index of the ith atom. This is a measure of the

stiﬀness of the ith atom that corresponds to the kth set of contact atoms.

3.3 Weighted Colored Graph Flexibility Analysis

Given a rigidity index, its reciprocal function provides a corresponding measure of ﬂexibility,

or ﬂexibility index. Thus the general ﬂexibility index on subgraphs is given by

f k
i =

,

1
µk
i

(αiαj) ∈ Pk,

∀i = 1, 2, . . . , N.

(3.7)

Previous work by Ngyuen et al shows that other ﬂexibility index forms work equally as well

[23]. At each atom, the ﬂexibility index corresponds to temperature ﬂuctuation. Thus we

16

can model the B factor of the ith atom as

(cid:88)

Bt

i =

ckf k

i + b,

∀i = 1, 2, . . . , N,

(3.8)

k

where Bt

i represents the theoretically predicted B factor of the ith atom. The coeﬃcients ck

and b are determined by minimizing the linear system given by

(cid:26) N(cid:88)

i=1

(cid:12)(cid:12)(cid:12)(cid:12)Bt

min
ck,b

i − Be

i

(cid:12)(cid:12)(cid:12)(cid:12)2(cid:27)

,

(3.9)

where Be

i is the experimentally measured B factor of the ith atom.

3.4 Multiscale Weighted Colored Graph Flexibility Anal-

ysis

Macromolecular interactions consist of a complex interplay of short, medium, and long range

interactions. Covalent bonds dominate short-range type interactions. Medium-range interac-

tions consist mainly of hydrogen bonds, electrostatics and van der Waals interactions. Lastly,

hydrophobicity is the main contributor to long-range molecular interactions. As such, a pro-

tein’s ﬂexibility is inherently connected to multiple characteristic length scales. This work

proposes multiscale weighted colored graphs to characterize the multiscale interactions that

exist within a protein. The ﬂexibility of ith atom at nth scale corresponding to the kth set

of interaction atoms is given by

(cid:80)N

f k,n
i =

1

j=1 wn

ijΦk(||ri − rj||; ηn
ij)

(αiαj) ∈ Pk,

,

(3.10)

17

where wn

ij is an atomic type dependent parameter, Φk(||ri − rj||; ηn

ij) a correlation kernel,

and ηn

ij a scale parameter. Minimization takes the form

(cid:26)(cid:88)

i

(cid:12)(cid:12)(cid:12)(cid:12)(cid:88)

k,n

(cid:12)(cid:12)(cid:12)(cid:12)2(cid:27)

,

(3.11)

min
cn
,b
k

i + b − Be
k f k,n
cn

i

where Be

i are experimental B factors. In this work we construct three correlation kernels us-

ing two generalized Lorentz kernels and a generalized exponential kernel to capture multiple

length scales. The method provided here is made parameter free by choosing appropriate

values for η, ν, and κ.

Sulfur atoms play an important role in proteins but they are also very sparse in proteins.

As such, this work provides some results using sulfur atoms but for most of the testing

provided sulfur atoms are excluded as they have a negligible overall eﬀect on the model.
Thus, unless otherwise noted this works considers the following subset of P for the lion’s

share of computations.

ˆP =(cid:8)CC, CN, CO, NC, NN, NO, OC, ON, OO(cid:9).

(3.12)

This work chooses to focus on C, N, and O due to their high occurrence in proteins and im-

portant biological relevance. However, it should be noted that the general method presented

here can be adapted to include any element the user chooses. For WCG calculations of B

factor predictions all possible element pairs, SC, SN, SO, and SS are considered.

This method is unique compared to other B factor prediction methods. The WCG method

considers not only Cα interactions but the eﬀects of interactions between nitrogen, oxygen,

and other non-Cα carbon atoms. For this work, three element speciﬁc correlation kernels

18

are constructed for all carbon-carbon (CC), carbon-nitrogen (CN), and carbon-oxygen (CO)

interactions within a protein. To capture multiscale interactions this work also includes three

diﬀerent scale parameterizations for each kernel. In total this generates 9 correlation kernels

to characterize element speciﬁc multiscale protein interactions in terms of their corresponding

graph centralities and atomic ﬂexibility. The result of this method can be used directly, ﬁtted

using linear least squares, or as a machine learning feature. Previously existing methods such

as mFRI, GNM, and NMA fail to take into account the element speciﬁc interactions that

the WCG method presented here captures. Since this method provides a general framework

for any element, in addition to carbon, WCG can also be used to predict the B factor of any

heavy element.

3.5 Parameterization

In this work a total of 9 unique correlation kernels are used based on the CC, CN, and CO

element speciﬁc correlation kernels described in Eq.

(3.10). For simpliﬁcation purposes,

all B factor prediction computed in this work through ﬁtting and machine learning uses

wij = wn

ij = 1 and ηn

ij = ηn.

A basic grid search over the 364 dataset determined the near optimal parameters for
MWCG based Cα B factor predictions. Three kernels are used with ν = {1, 3} for Lorentz

kernels and κ = 1 for the Exponential kernel, respectively. To improve the eﬃciency, a radial

cutoﬀ distance may be used. However, the WCG ﬁtting and MWCG based machine learning

results presented in this work do not use a cutoﬀ.

The ﬁrst kernel considered is a Lorentz function, and its near optimal η1 was found to

be η1 = 16 as shown in Fig. 3.1. Then, ﬁxing η1 = 16, a parameter grid search is used

19

Figure 3.1: The average Pearson correlation coeﬃcient (PCC) as found by optimizing in-
dividual kernels in the range of ηn = 1, . . . , 40. Parameter optimization results originally
published in Bramer et al [1].

Table 3.2: Parameters used for correlation kernels in a parameter-free MWCG. Parameter
optimization results originally published in Bramer et al [1].

Kernel Type
Lorentz (n = 1)
Lorentz (n = 2)
Exponential (n = 3)

κ ηn
16
-
-
2
31
1

ν
3
1
-

to determine optimal η2 for a second Lorentz kernel. The second Lorentz kernel was found

to provide optimal predictions for η2 = 2 as shown in Fig. 3.1. Lastly, ﬁxing η1 = 16 and

η2 = 2, a parameter search is used to determine optimal values for η3 used in an exponential

kernel. Given the ﬁxed parameters of the Lorentz kernel the average Pearson correlation

coeﬃcient (PCC) does not decay even for very large values of η3 as indicated in Fig. 3.1.

Given the multiscale nature of these three parameters this behavior is reasonable. With

only a single kernel, the strongest interactions, which provide good approximations, can
be obtained for 12 ≤ η ≤ 17. To capture close range interactions, the second η provides

the best results for small values. The large values seen in the third η appear to capture

20

large scale interactions. This result corresponds to the dominance of these length dependent

interaction types. Because it decays so quickly, the exponential kernel is used to capture

large scale interaction eﬀect. Of course large η values are very costly due to the structure of

the kernel. So for η3 a value of 31 is used in the testing published in Bramer et al [1, 2] for

the parameter-free MWCG method as listed in Table 3.2.

21

Chapter 4

Atom Speciﬁc Persistent Homology

4.1 Overview

Most existing protein analysis methods are structure or geometry based models. Many of

these models struggle with the high dimensional space of protein data. Put another way, any

model that is too ﬁne grained will inherently fail in the high dimensional protein data space

due to the associated computational complexity. The study of topology provides the con-

nectivity of components, and characterizes independent entities, rings, and high dimensional

topological faces within a space. Applied to proteins, topology provides a powerful tool

for analysis of several important biological processes. Examples include hot spot detection,

assembly/disassembly of virus capsids, ligand binding state, ion channel state, and protein

folding[32, 33, 34, 35, 36, 37, 38, 39]. Topology provides a high level of abstraction and in

its purely mathematical form is free of metrics of coordinates which can be problematic for

the study of biological macro-molecules. Topological data analysis allows the extraction of

invariant features that are embedded in the high dimensional data space of biomolecules.

Persistent homology is one component of TDA that provides useful bridge between the high

dimensional protein data space and the abstract low dimensional topological analysis of the

protein data space. PH embeds multiscale geometric information into topological invariants,

this works well for the aforementioned examples but oversimpliﬁes the atomic properties of

22

Table 4.1: Topological invariants displayed as Betti numbers. Betti-0 represents the number
of connected components, Betti-1 the number of tunnels or circles, and Betti-2 the number
of cavities or voids. Two auxiliary rings are added to the torus to illustrate that Betti-1=2.

Example
Betti-0
Betti-1
Betti-2

Point

Circle

Sphere

Torus

1
0
0

1
1
0

1
0
1

1
2
1

macro-molecules making it challenging to use directly for atomic level analysis. In this work

we provide a new approach that uses techniques from topological data analysis to provide

element speciﬁc protein analysis at atomic resolution.

To apply TDA techniques, data must ﬁrst be described as a simplicial complex or a graph

network. Speciﬁcally, simplicial homology is concerned with the identiﬁcation of topological

invariants from a set of discrete nodes such as the atomic coordinates of a protein. Given a

point cloud, Betti numbers describe the topological variants of connected components, rings,

and cavities. Table 4.1 provides examples of the Betti-0, Betti-1, and Betti-2 numbers of a

point, circle, sphere, and torus.

To determine topological invariants, a simplicial complex, such as Vietoris-Rips (VR)

complex, ˇCech complex, or an alpha complex is constructed using a ﬁxed ﬁltration parameter.

The simplicial complex is made up of vertices, edges, triangles, and tetrahedrons, denoted

0-simplex, 1-simplex, 2-simplex, and 3-simplex respectively. Basic examples are provided in

Figure 4.1. By varying the ﬁltration parameter over an interval a persistence diagram can be

generated from a simplicial complex. A persistence diagram, or barcode, provides the birth

and death (appearance and cessation) of Betti features for each node. The diﬀerence between

23

Figure 4.1: From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and 3-simplex.

two persistence diagrams can be compared using Bottleneck and Wasserstein distances.

The main idea of atom-speciﬁc persistent homology and element-speciﬁc persistent ho-

mology is to extract atomic molecular information using global persistent homology tech-

niques. To generate an atom-speciﬁc description using a global topological description we

construct a pair of conjugated point clouds for each atom of interest. One point cloud is

centered about the original atom of interest and all nearby atoms within a prescribed radial

cutoﬀ. The conjugate point cloud consists of the same point cloud minus the atom of interest.

Then for each atom of interest, Bottleneck and Wasserstein distances are computed between

the corresponding conjugate pairs which provides the desired topological information of each

atom.

4.2 Simplex & Simplicial Complex

A simplex is a generalization of a triangle or tetrahedron to arbitrary dimensions. A k-

simplex is a convex hull of k + 1 vertices represented by a set of aﬃnely independent points

σ = {λ0u0 + λ1u1 + . . . + λkuk |(cid:88)

λi = 1, λi ≥ 0, i = 0, 1, . . . , k},

(4.1)

where {u0, u1, . . . , uk} ⊂ Rk is the set of points, σ is the k-simplex, and constraints on λi’s

ensure the formation of a convex hull. A convex combination of points can have at most

24

k + 1 points in Rk. For example a 1-simplex is a line segment, a 2-simplex a triangle, and a

3-simplex a tetrahedron. A subset of the k + 1 vertices of a k simplex with m + 1 vertices

forms a convex hull in a lower dimension and is called an m-face of the k-simplex. An m-face

is proper for m < k. The boundary of a k-simplex σ, is deﬁned as the formal sum of its
(k − 1) faces. Given as

k(cid:88)

∂kσ =

[u0, . . . , ˆui, . . . , uk]k(−1)i[u0, . . . , ˆui, . . . , uk]k,

(4.2)

i=0

where [u0, . . . , ˆui, . . . , uk] denotes the convex hull formed by vertices of σ with the vertex

ui excluded and ∂k is called the boundary operator. A collection of ﬁnitely many simplicies
forms a simplicial complex denoted by K. All simplicial complexes satisfy the following

conditions.

1. Faces of any simplex in K are also simplices in K.

2. The intersection of any two simplicies σ1, σ2 ∈ K is a face of both σ1 and σ2.

4.3 Homology

Given a simplicial complex K, a k-chain ck of K is a formal sum of the k-simplices in K with

k no greater than dimension of K and is deﬁned as ck =(cid:80) aiσi where σi are the k-simplices

and ai’s coeﬃcients. Generally, ai can be in any ﬁeld such as R, Q, or Z. Here we choose
ai to be in Z2 for simplicity. Let the group of k-chains in K be denoted by Ck. Then
(Ck, Z2) forms an Abelian group under addition in modulo two. This allows us to extend

the deﬁnition of the boundary operator introduced in Equation 4.2 to chains.

25

The boundary operator applied to a k-chain ck is deﬁned as

(cid:88)

∂kck =

ai∂kσi,

(4.3)

where σi’s are k-simplices. The boundary operator is a map from Ck to Ck−1, which is also

known as a boundary map for chains. Note that operator ∂k satisﬁes the property that
∂k ◦ ∂k+1σ = 0 for any (k + 1)-simplex σ following the fact that any (k − 1)-face of σ is

contained in exactly two k-faces of σ. The chain complex is deﬁned as a sequence of chains

connected by boundary maps with decreasing dimension and is denoted

. . . → Cn(K)

∂n−−→ Cn−1(K)

∂n−1−−−−→ . . .

∂1−→ C0(K)

∂0−→ 0.

(4.4)

The k-cycle group and k-boundary group are then deﬁned as kernel and image of ∂k and

∂k+1 respectively, and

Zk = Ker∂k = {c ∈ Ck | ∂kc = 0},
Bk = Im∂k = {∂kc | c ∈ Ck},

(4.5)

(4.6)

where Zk is the k-cycle group and Bk is the k-boundary group. Since ∂k ◦ ∂k+1 = 0, we
have Bk ⊂ Zk ⊂ Ck. Then the k-homology group is deﬁned to be the quotient group of the

k-cycle group modulo the k-boundary group,

Hk = Zk/Bk

(4.7)

where Hk is the k-homology group. The kth Betti number is deﬁned to be rank of the

26

k-homology group as βk = rank(Hk).

4.4 Filtration & Persistence

For a simplicial complex K, we deﬁne a ﬁltration of K as a nested sequence of sub-complexes
of K,

∅ ⊆ K0 ⊆ K1 . . . ⊆ Kn = K

(4.8)

In persistent homology, the nested sequence of sub-complexes usually depends on a ﬁl-

tration parameter. The persistence of a topological feature is denoted graphically by its life

span with respect to ﬁltration parameter. Sub-complexes corresponding to various ﬁltration

parameters oﬀer the topological ﬁngerprints over multiple scales. The kth persistent Betti
numbers Bi,j
represent the ranks of the kth homology groups of Ki that are alive and are

k

deﬁned as

Bi,j
k = rank(Hi,j

k ) = rank(Zk(Ki)/(Bk(Kj) ∩ Zk(Ki))).

(4.9)

respectively where X and Y are persistence barcodes and Bij(X, Y ) the collection of all

bijections from X to Y . An example of a barcode is provided in Figure 4.2.

4.5 Similarity and distance

In this work, both Bottleneck and Wasserstein distances are used to compare conjugate

persistence diagrams. This provides the models with atom-speciﬁc topological information

and facilitates atom-speciﬁc persistent homology. Let X and Y be multisets of data points,

27

(a) Example Points

(b) Barcode

Figure 4.2: (a) An example of 5 points in R2 and (b) the corresponding topological bar-
code. The length of each barcode corresponds to the persistence of each topological object
(β0,β1,β2,etc..) over the ﬁltration.

the Bottleneck and Wasserstein distances of X and Y are given by [40]

dB(X, Y ) =

inf

γ∈B(X,Y )

sup
x∈X

|| x − γ(x) ||∞,

(4.10)

28

and [41]

 inf

γ∈B(X,Y )

(cid:88)

x∈X

1/p

,

|| x − γ(x) ||p∞

dp
W (X, Y ) =

(4.11)

respectively. Here B(X, Y ) is the collection of all bijections from X to Y .

In this work

topological invariants of diﬀerent dimensions are compared separately.

4.6 Vietoris-Rips Complex

Given a metric space M and a cutoﬀ distance d, a simplex is formed if all points have

pairwise distances no greater than d. All such simplices form the Vietoris-Rips complex.

The abstract nature of the VR complex allows the construction of simplicial complexes for

correlation function based metric spaces, which models pairwise interaction of atoms using

correlation functions versus more standard spatial metrics.

4.7 Atom Speciﬁc Persistent Homology & Element Spe-

ciﬁc Persistent Homology

To embed the chemical biological protein information into topological invariants, element-

speciﬁc persistent homology was introduced by Cang et al[42, 43]. The basic idea of ESPH

is to use subset of atoms of various element types within a protein to construct topological

representations. The corresponding persistence diagrams then represent diﬀerent interac-

tions that occur within a protein. For example selecting all carbon atoms would result in

barcodes that coded the network and strength of the hydrophobicity in the protein.

29

Figure 4.3: Illustration of Atom-speciﬁc persistent homology point clouds. Top: the original
point cloud. The atom of interest is at the center of the circle. Second row: a pair of
conjugated sets of point clouds for atom-speciﬁc persistent homology. The rest: Four pairs
of conjugated point clouds for atom-speciﬁc and element-speciﬁc persistent homology.

To represent the topological importance of a given atom, atom-speciﬁc persistent ho-

mology is introduced. This works by constructing two conjugated point clouds centered

about a given atom of interest within a biomolecule. The point clouds consists of one that

includes the atom of interest and all nearby atoms within a prescribed cutoﬀ, and another

identical point cloud minus the atom of interest. Then, conjugated simplicial complexes,

conjugated homology groups and conjugated topological invariants are generated for each

30

conjugate pair of points clouds. Wasserstein and Bottleneck distances can then be used to

measure the diﬀerence between conjugated topological invariants which provides a topologi-

cal representation of the atom of interest. Figure 4.3 provides and example of atom-speciﬁc

and element-speciﬁc conjugated point clouds can be constructed for a given toy dataset.

This work generates only Cα B factor predictions however the method is general and can

be used to predict the B factor of any atom. To create a diverse topological representation

for each Cα element speciﬁc persistent homology is used. Atom-speciﬁc persistent homology

is also used to contribute a precise topological representation at each Cα atom. Using the

conjugate pair subsets, Vietoris-Rips complexes are constructed by contact maps or matrix

ﬁltration [44].

To capture element-speciﬁc interactions three subsets of carbon-carbon, carbon-nitrogen,

and carbon-oxygen point clouds are used. This gives the following element speciﬁc pairs,

P = {CC, CN, CO}.

(4.12)

For a given Protein Data Bank (PDB) ﬁle, persistence barcodes are calculated as follows.

i ∈ Pk in an element speciﬁc set Pk (P1 = CC,P2 = CN,
Given a speciﬁc Cα of interest, rk
and P3 = CO) , a point cloud consisting of all atoms within a pre-deﬁned cutoﬀ radius rc is

deﬁned as

Rk
i = {rk

j

(cid:12)(cid:12) ||rk

i − rk

j|| < rc,

rk
i , rk

j ∈ Pk,∀ j ∈ 1, 2, . . . N},

(4.13)

where N is the number of atoms in the kth element pair Pk. A conjugated set of point
cloud, ˆRk
point clouds Rk

i , includes the same set of atoms, except for rk

i , conjugated simplicial complexes, conjugated homology groups,

i and ˆRk

i . For a given pair of conjugated

and conjugated persistence barcodes are computed. Euclidean distance based ﬁltration is

31

computed using the Vietoris-Rips complex. Given set of atoms selected according to atom-

speciﬁc and element speciﬁc constructions, a family of multi-resolution persistence barcodes

is generated by a resolution controlled ﬁltration matrix given by [44]

Mnm(ϑ) = 1 − Φ(||rn − rm||; ϑ),

(4.14)

where ϑ denotes a set of kernel parameters. We have used both exponential kernels

Φ(||rn − rm||; η, κ) = e−(||rn−rm||/η)κ

,

κ > 0

(4.15)

and Lorentz kernels

Φ(||rn − rm||; η, ν) =

1 +(cid:0)||rn − rm||/η(cid:1)ν ,

1

ν > 0

(4.16)

where η κ, and ν are pre-deﬁned constants. This ﬁltration matrix is used in association with

the Vietoris-Rips complex to generate persistence barcodes or persistence diagrams. These

topological invariants are then compared using both Bottleneck and Wasserstein distances.

An example of the conjugated persistence barcode pair generated for a Cα atom is illustrated

in Figure 4.4.

32

(a) 1AIE Subunit

(b) Barcode

Figure 4.4:
Illustration of residue 338 Cα atom-speciﬁc persistent homology in the CC
element-speciﬁc point cloud of protein PDBID 1AIE. For this example residues 332-339 are
used and are shown on the left. The Cα location used to generate the barcodes (right) is
highlighted in red in the left chart. Conjugated persistence barcodes are generated with and
without the selected Cα.

33

Chapter 5

Machine Learning

Machine learning is a subset of artiﬁcial intelligence based on statistical and probabilistic

methods to “learn” patterns in data given a training set. This means that unlike other

mathematical models, the structure of the algorithm is not known a priori. Broadly speak-

ing, machine learning tasks are classiﬁed into supervised, semi-supervised, or unsupervised

learning. Supervised learning involves training on data that contains both input data and

some desired output data, semi-supervised training on data where some of the outputs are

unknown, and unsupervised training on data without known output. Supervised and semi-

supervised algorithms can then be trained for regression or classiﬁcation tasks depending on

the desired output. Since they have no target output, unsupervised algorithms can only ﬁnd

structure in data such as in the clustering or grouping data.

Machine learning algorithms diﬀer by their internal representation. These algorithms

are ﬁrst classiﬁed as parametric or non-parametric depending on whether they have ﬁxed

number of parameters regardless of sample size, or whether the number of parameters is

allowed to grow with sample size respectively. In practice parametric machine learning al-

gorithms are computationally fast, require less data, and easy to implement compared to

non-parametric machine learning algorithms. However, parametric machine learning algo-

rithms can suﬀer from poor ﬁtting due to overly strong assumptions about the underlying

mapping function.

In contrast non-parametric machine learning models are able to ﬁt a

34

larger variety of functional forms and can thus produce more robust models.

The work by Wolpert et al suggests that learning algorithms cannot be universally

good[45]. That is, a machine learning algorithm that provides a good model for one problem

may not work for a diﬀerent problem. As such, it is standard practice when using machine

learning, to test several diﬀerent machine learning algorithms to determine which of the

algorithms are best suited to the problem.

The task of B factor prediction is a supervised regression task. It is supervised because

B factors are known from experimental data and the prediction task is regression because

B factor takes continuous values. Taking the aforementioned considerations into mind, this

work considers several non-parametric machine learning algorithms. In particular, random

forests, gradient boosted trees, convolutional neural networks, and deep neural networks are

all considered in this work. All machine learning results are reported in Chapter 7. The

following sections provide a detailed description of the algorithms, feature inputs, parame-

terizations, and datasets used for testing.

5.1 Machine Learning Algorithms

The following subsections provide a brief overview of each of type of machine learning algo-

rithm used in this work.

5.1.1 Ensemble Methods

Ensemble methods are a class of machine learning algorithms that generate a strong pre-

dictive model based on a large number of simple weak learning models. The basic idea

is that taken together, a large number of weak learners, those who do only slightly better

35

than chance, can generate a robust predictive model. Two of the most popular ensemble

algorithms, which are used in this work, are random forests of trees and gradient boosting

trees[46, 47, 48, 49].

5.1.1.1 Random forest

Random forests are an ensemble machine learning method used for classiﬁcation or regres-

sion tasks. For regression tasks random forests train many decision trees then output the

mean prediction of the individual trees. Compared to other machine learning algorithms,

random forests are advantageous because they have few hyper-parameters, are generally

robust against overﬁtting, and invariant to scaling.

Machine learning approaches are commonly criticized as “black box” approaches. That

is, while the input and output of a machine learning algorithm are well known the internal

model the algorithm is using is generally hidden to the user. Ensemble methods like random

forests address this issue in part by providing variable importance of the trained model.

Variable importance is one important way that users can understand which features give the

model the most predictive power. Random forests are invariant to scaling, so they do not

require the feature data to be pre-processed.

Random forests require minimal hyperparameter tuning. The only hyper parameter

required is the number of n decision trees. While random forests are generally robust to

overﬁtting if n is chosen to be too large it is possible for a random forests to overﬁt a

dataset. Thus too few trees and the model will have poor predictive power and too many

trees may lead to overﬁtting and be computationally costly. The user must take special care

to determine the right amount of decision trees. For this work, the choice of decision trees

was determined by testing various values of n to strike a balance between performance and

36

cost.

5.1.1.2 Gradient boosted trees

Like random forests, gradient boosting trees (GBTs) are an ensemble method. GBTs in-

corporate boosting to reduce bias and variance and utilize a number of “weak learners” to

iteratively construct a predictive model. The algorithm is optimized using gradient descent,

minimizing the residual of a predeﬁned loss function. At each step, GBTs incorporate de-

cision trees to improve their predictive power. Gradient boosting trees and other related

ensemble methods are useful because they have strong predictive power, do not require

normalization of the dataset, and are typically robust to outliers and overﬁtting.

5.1.2 Neural Networks

Recent advances in GPU computing have allowed neural networks to be computationally

tractable machine learning models. Modeled after neurons in the brain, neural networks

apply layers of activation functions, called perceptrons, to inputs. Weights of the neural

network are trained to minimize a loss function over many passes of a training dataset.

Many neural networks utilize back-propagation, which allows the error to propagate to the

previous layer, to adjust neuron weights and improve output error until it is below a preset

threshold. In short, neural networks begin with an initial random guess at an output then

repeatedly adjust the neuronal weights until the output error is satisfactorily reduced. Neural

networks with several “hidden” layers of perceptrons are known as deep neural networks

(DNNs). Figures 5.1 and 5.2 provide examples of the basic perceptron and deep neural

network framework.

37

Inputs Weights

Neuron

Output

w1

w2

w3

wn

n(cid:88)

i=1

xiwi + b

(cid:19)

xiwi + b

(cid:18) n(cid:88)

f

i

x1

x2

x3
...

xn

Figure 5.1: An example of a perceptron, the basic functional unit of a neural network.

Input
Layer

Hidden
Layer

Hidden
Layer

Output
Layer

Output

Figure 5.2: An illustration of a fully connected deep neural network. Circles represent
neurons and connections between neurons are indicated by arrows. Each connection has an
associated weight. A neural network is considered “deep” when it uses several hidden layers.

5.1.2.1 Convolutional Neural Network

Convolutional neural networks (CNNs) are a type of neural network that have recently had

great success in the ﬁeld of image classiﬁcation. CNNs work by applying convolutional

38

ﬁlters over several layers, and by doing so extract successively higher-level features from

input images. For image data CNNs are more advantageous than fully connected neural

networks because they can often outperform fully connected neural networks with a fraction

of training parameters.

5.1.3 Consensus methods

It is often the case that one machine learning model model may outperform others in certain

areas but do worse in others. As such a consensus model can provide a useful tool that

may improve overall results. As such, for PH based Cα only B factor prediction, this work

also includes B factor prediction results using a consensus model. The consensus model

prediction used here is generated by combining the B factor predictions of the two PH based

machine learning models. In particular, the consensus prediction for each Cα is the average

of Cα B factor values predicted from the PH based GBT and deep CNN B factor prediction.

5.2 General Machine Learning Features

3D spatial atomic coordinates of each atom in a protein are provided by Protein Databank

(PDB) .pdb ﬁles. The PDB ﬁles also provide additional experimental data that can be used

as local and global input features for machine learning algorithms. All machine learning

algorithms used in this work make use of both global and local protein features described

in the sections 5.2.1 and 5.2.2. To study the impact of the MWCG, ESPH, and ASPH

methods these features are tested separately in diﬀerent machine learning algorithms. The

parameters used to generate these machine learning features in this work are described in

detail in sections 5.3 and 5.4 and below.

39

5.2.1 Global features

The global protein features described in this section were used in all the machine learning

models in this work. The global features that were used in this work are R-value, resolution,

and total number of heavy atoms. These features are obtained via the experimental data

recorded in PDB ﬁle of each protein. Both R-value and resolution provide measures of the

quality of the atomic model obtained from the X-ray crystallography. Also included as a

global feature is the total protein size which is determined as the sum of heavy elements

(carbon, nitrogen, oxygen, and sulfur) present in the protein. To code the protein size data,

it is organized into one of 10 discrete size classes using one hot encoding. The size ranges

are given based on the distribution of total number of heavy elements of each protein. For

this work we use the following size classes. A frequency distribution of the size categories is

provided in Figure 5.3.

[500, 750, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 30000]

Using one-hot coding, a protein element feature size will take on 1 if the number of heavy

atoms (carbon, nitrogen, or oxygen) of the protein is less than or equal to the corresponding

size and zero for the other sizes. For example, a protein with 600 heavy elements would have

the feature size vector for all of its atoms given by

[0, 1, 0, 0, 0, 0, 0, 0, 0, 0].

The maximum size bin is 30,000 since all proteins in the dataset have less than 30,000 heavy

elements.

40

Figure 5.3: Frequency of the number of heavy elements from the 364 protein dataset. Figure
originally published in Bramer et al [2].

5.2.2 Local features

In addition to the features discussed above, PDB ﬁles contain the amino acid corresponding

to each heavy element. Like the protein size feature, amino acid information is included by

using one hot encoding for each heavy element which results in twenty amino acid features.

More locally, each of the the four diﬀerent heavy element types carbon, nitrogen, oxygen,

and sulfur for each element are one hot coded which results in another four features.

To explicitly take the density of nearby atoms into account, this work includes packing

density as an additional model feature. Short, medium and long packing density features for

each heavy atom are generated and included in all the machine learning models used in this

work. Mathematically, the packing density of the ith atom is deﬁned as

pd
i =

Nd
N

,

where d is the given cutoﬀ in angstroms, Nd is the number of atoms within the Euclidean

distance of the cutoﬀ to the ith atom, and N the total number of heavy atoms of the protein.

41

Table 5.1 provides the packing density cutoﬀs used in this work.

Table 5.1: The packing density distance parameters (d ˚A) used for generating short medium,
and long packing density machine learning features.

Short Medium Long
5 ≤ d
d < 3

3 ≤ d < 5

Secondary structures also play an important role in protein interactions. This work in-

cludes several secondary structural machine learning features for all the machine learning

models used. Several software packages exist for the prediction of secondary protein struc-

tures. All secondary protein machine learning features used in this work were generated

using the STRIDE software. This software returns secondary structure results that are in

maximal agreement with X-ray crystallography data through the use of an optimized knowl-

edge based algorithm. STRIDE takes 3D atomic coordinates in the form of protein PDB

ﬁles as input and assigns each atom to a corresponding secondary structural group. STRIDE

assigns each atom as belonging to a alpha helix, 3-10 helix, PI-helix, extended conformation,

isolated bridge, turn, or a coil. Solvent accessible surface area, φ and ψ angle information

are also generated by the software. This provides a total of 12 secondary structure features

that are used in all the machine learning models in this work.

5.3 MWCG Features

The MWCG ﬂexibility index described in Chapter 3 is used to create feature vectors for

carbon, nitrogen, and oxygen interactions with each heavy element. To capture multiscale

interactions 3 diﬀerent kernel parameterizations are used for each interaction type. This

provides a total of nine MWCG machine learning features for each heavy element. The

kernel parameters used in this work are based oﬀ previous results. Speciﬁc parameters for

42

the kernels used here were originally published in Bramer et al and are provided in Table

5.2.[1]

Table 5.2: Correlation kernel parameters used to generate parameter-free MWCG machine
learning features. Parameters based on previous results.[1]

Kernel Type
Lorentz (n = 1)
Lorentz (n = 2)
Exponential (n = 3)

κ ηn
16
-
2
-
1
31

ν
3
1
-

5.3.1 Image-like MWCG Features

Convolutional neural networks make use of the large amount of data provided in images

by applying a convolution operation. Due to the massive amount of trainable parameters,

fully connected feed forward neural networks are computationally prohibitive for images.

Convolutional operations greatly reduce the number of free parameters, thereby striking

good balance between deep predictive power and computational cost. For this work MWCG

images are generated for every heavy atom in the data set then used in a deep CNN model.

Multiscale images are generated using both Lorentz and exponential radial basis functions

for all heavy atoms in the data set. The generated images capture multiscale interactions by

using a number of diﬀerent parameterizations of κ, ν, and η in their kernels. To capture a

large range of protein atomic interaction scales this work uses the following values are used

for κ, ν, and η.

η = {1, 2, 3, 4, 5, 10, 15, 20}

κ, ν = {2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 8, 9, 10, 11}.

Taken together as a matrix, each generated “image” results in three 2D MWCG images

43

of dimension (8, 30) for each heavy atom in the data set. For this work MWCG images are

generated for all carbon, nitrogen, and oxygen interactions for each heavy atom. This results

in a total of three channels for each image and a ﬁnal image dimension of (8, 30, 3) for each

atom used the MWCG deep CNN testing.

The image matrix is given by F k
i

i (l, m, n) represents
the ﬂexibility index of the ith atom, and kth atom interaction (C, N, or O), l = η, m = {κ, ν},

in equation 5.1, where each atom f k

and n the type of radial basis function. Values of n = 1 and n = 2 correspond to exponential

and Lorentz radial basis functions respectively.

f k
i (1, 2, 1)
f k
i (2, 2, 1)

...

f k
i (1, 2.5, 1)
f k
i (2, 2.5, 1)

...
...

f k
i (1, 11, 1)
f k
i (2, 11, 1)

f k
i (1, 2, 2)
f k
i (2, 2, 2)

f k
i (1, 2.5, 2)
f k
i (2, 2.5, 2)

f k
i (15, 2, 1) f k

i (15, 2.5, 1)

... f k

i (15, 11, 1) f k

i (15, 2, 2) f k

i (15, 2.5, 2)

...
...
...
... f k

F k
i =

f k
i (1, 11, 2)
f k
i (2, 11, 2)

i (15, 11, 2)

(5.1)

 η


(cid:125)



(cid:124)

f k
i (20, 2, 1) f k

i (20, 2.5, 1)

... f k

f k
i (20, 2, 2) f k

i (20, 2.5, 2)

... f k

i (20, 11, 2)

(cid:123)(cid:122)

κ

(cid:125)
i (20, 11, 1) (cid:124)

(cid:123)(cid:122)

ν

5.4 ASPH & ESPH Features

A variety of element-speciﬁc and atom-speciﬁc persistent homology features, as described in

Chapter 4, are generated as local machine learning features. The ASPH and ESPH features

are generated in several ways by varying kernels (Lorentz and exponential), element-speciﬁc

pairs (CC, CN, CO), and distance metrics (Wasserstein-0 and Wasserstein-1, Bottleneck-0

and Bottleneck-1). For this work, all persistent homology features were generated with a

radial cutoﬀ of 11˚A.

The distances determined by Wasserstein and Bottleneck metrics are dependent on the

boundary of the corresponding persistence diagrams. In other words any events from one

44

diagram that do not match an event on the other diagram can contribute to the ﬁnal Wasser-

stein or Bottleneck distance by their distances from the boundary. Considering these eﬀects,

this work includes two additional persistence diagrams. The additional diagrams are con-
structed by rotating the y-axis is rotated clockwise by 30◦ or 60◦, respectively. Figure 5.4

provides an example of these modiﬁcations. By introducing this modiﬁcation, the Bottleneck

and Wasserstein distances correspondingly allow the model to recognize elements that have

a short persistence, or lifespan. As a ﬁnal consideration, a feature is generated by reﬂecting

the original persistence diagram about the diagonal axis. An example of this modiﬁcation is

provided in Figure 5.4. A list of kernels, kernel parameters, y-axis change, distance metric,

and element-speciﬁc pairs used to generate features in machine learning models is provided

in Table 5.3.

(a)

(b)

(c)

Figure 5.4: Illustration of modiﬁed persistence diagrams used in distance calculations. (a)
Unchanged. (b) Rotated 30◦. (c) rotated 60◦. Black dots are Betti-0 events and triangles
are Betti-1 events.

5.4.1 Image-like ASPH & ESPH Features

2D image-like persistent homology (PH) features for each Cα of the proteins are generated

using the process described in Section 4.7. The images-like features are generated by taking

various values of η and κ using the kernel function. An exponential kernel is used with a

45

Table 5.3: Parameters used for topological feature generation. All features used a cutoﬀ of
11˚A. Both lorentz (Lor) and exponential (exp) kernels and Bottleneck (B) and Wasserstein
(W) distance metrics were used.

No. features Kernel Kernel parameter

Diagram

12
12
12
12
12

Lor
Exp
Exp
Exp
Exp

η = 21, ν = 5
η = 10, κ = 1
η = 2, κ = 1 Diagonal reﬂection
η = 2, κ = 1
η = 2, κ = 1

Unchanged
Unchanged
Rotated 30◦
Rotated 60◦

Distance metric Element pair
CC, CN, CO
CC, CN, CO
CC, CN, CO
CC, CN, CO
CC, CN, CO

B, W
B, W
B, W
B, W
B, W

radial cutoﬀ of 11˚A. Diﬀerent values of η and κ are used to capture multiple interaction

scales. The values used in this work are

and

η = {1, 2, 3, 4, 5, 10, 15, 20},

κ = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}.

This results in an image-like matrix given by PHk
i

in Eq.

(5.2). Each atom PHk

i (l, m)

represents the PH feature of the ith Cα atom, and kth atom interaction (C, N, or O), l = η,

and m = κ.

PHk

i =



(cid:124)

f k
i (1, 1)

f k
i (1, 2)

f k
i (2, 1)

f k
i (2, 2)

...

. . .

. . .
...

f k
i (1, 9)

f k
i (1, 10)

f k
i (2, 9)

f k
i (2, 10)

f k
i (15, 1) f k
f k
i (20, 1) f k

i (15, 2)

. . . f k

i (20, 2)

. . . f k

i (15, 9) f k
i (20, 9) f k

i (15, 10)

i (20, 10)

(cid:123)(cid:122)

κ




(cid:125)

η

(5.2)

This generates 2D PH image-like features of dimension (8,10). Compared to MWCG images,

46

the PH images have lower resolution than the MWCG images due to the cost of calculating

PH features. Images are generated for carbon, nitrogen, and oxygen element-speciﬁc inter-

actions with each Cα atom. As a result, the ﬁnal image feature input has a dimension of

(8,10,3) for each Cα atom.

5.4.2 Cutoﬀ Distance

For this work a cutoﬀ of 11˚A is used to generate all persistent homology machine learning

features. The cutoﬀ was determined using a basic grid search over various cutoﬀ distances.

Figure 5.5 displays the average Pearson correlation coeﬃcient, obtained via ﬁtting with

experimental B factors, over the entire dataset using all persistent homology metrics with

various point cloud distance cutoﬀs. The parameters listed in Table 5.4 are used to generate

Figure 5.5: Average Pearson correlation coeﬃcient over the entire protein dataset ﬁtting all
24 persistent homology features using various cutoﬀ distances.

PH features for each protein. These parameters were determined using a grid search over

various ν, η, and κ.

47

Table 5.4: Parameters used for the element speciﬁc persistent homology features with a
cutoﬀ of 11 ˚A.

Kernel Type
Lorentz (n = 1)
Exponential (n = 2)

ν
5
-

ηn κ
21
-
1
10

5.5 Machine Learning Model Parameters

For this work several machine learning models were generated. All machine learning models

used in this study include the global and local features described sections 5.2.1 and 5.2.2.

Two classes of machine models are generated for this work. The ﬁrst includes random

forest, gradient boosted tree, and deep convolutional neural networks that use MWCG input

features in addition the general global and local features mentioned above. The second class

of machine learning models use the ASPH and ESPH input features in addition to the general

and local features. Each model has speciﬁc parameters than can be tuned. The following

sections outline the parameters used in this work.

5.5.1 MWCG

5.5.1.1 Random Forest

Random forests only require the user to determine the amount of n trees. The predictive

power of random forests generally increases with the number of trees used and these models

are robust to over ﬁtting. However increasing the number of trees comes at a computational

cost. To balance performance with computational cost, this work uses n = 500 trees for all

MWCG based random forest B factor prediction.

48

5.5.1.2 Gradient Boosted Trees

Several hyperparameters within the gradient boosted tree method can be tuned. The MWCG

based GBT hyperparamters used in this work are determined using the standard practice

of a grid search. Testing parameters are provided in Table 5.5. Any hyper parameters not

listed below were taken to be the default values provided by the python scikit-learn package.

Table 5.5: Boosted gradient tree parameters used for testing MWCG based B factor pre-
diction. These parameters were determined using a grid search. Any hyper parameters not
listed below were taken to be the default values provided by the python scikit-learn package.
MWCG based GBT machine learning prediction results originally published in Bramer et al
[2].

Parameter
Loss Function
Alpha
Estimators
Learning Rate
Max Depth
Min Samples Leaf
Min Samples Split

Setting
Quantile
0.95
1000
0.001
4
9
9

5.5.1.3 Deep Convolutional Neural Network

This work uses 3 channel (8,30) MWCG based image-like correlation maps, as described in

Section 5.3, as CNN input data for each image. The CNN output is ﬂattened and concate-

nated with global and local protein features, as described in Sections 5.2.1 and 5.2.2, then

input into a deep neural network to predict atomic B factor. A diagram of the MWCG based

deep CNN architecture is provided in Figure 5.6.

The CNN input image used for MWCG based B factor in this work is a three-channel

MWCG image of dimension (8,30,3). The deep CNN applies two convolutional layers with

2x2 ﬁlters, a dropout layer of 0.5, a dense layer, then ﬂattens the resulting output. The

49

Figure 5.6: The MWCG based deep convolutional neural network architecture used for
B factor prediction. The plus symbol represents the concatenation of data sets. Figure
originally published in Bramer et al [2].

50

ﬂattened output from the CNN is concatenated with the other global and local features

into a dense layer of 59 neurons followed by a dropout layer of 0.5, another dense layer of

100 neurons, a dropout layer of 0.25, a dense layer of 10 neurons, and ﬁnishes with a dense

output layer. This results in a total of 21,584 trainable parameters for the deep CNN used

in MWCG based B factor prediction. A diagram of the deep CNN architecture is illustrated

in Figure 5.6.

Convolutional neural networks have several hyper-parameters. The hyper parameters for

the MWCG based deep CNN used in this work are optimized using a grid search. Table 5.6

provides a list of the hyper-parameter values used for testing. Any hyper parameters not

listed below were taken to be the default values provided by the python Keras package.

Table 5.6: MWCG based deep Convolutional Neural Network (CNN) hyper-parameters used
for testing. These hyper-parameters were determined using a grid search. Any hyper pa-
rameters not listed below were taken to be the default values provided by python with the
Keras package. MWCG machine learning prediction results originally published in Bramer
et al [2].

Parameter
Learning Rate
Epoch
Batch Size
Loss
Optimizer

Setting
0.001
100
100
Mean Absolute Error
Adam

5.5.2 ASPH & ESPH

The generated ASPH & ESPH features described in section 4.7 are used for prediction of

protein B factor using both least squares ﬁtting and machine learning as described in the

following sections.

51

5.5.2.1 Gradient Boosted Trees

The persistent homology based GBT hyper-parameters used in this work are optimized using

a grid search. The parameters used for testing are provided in 5.7. Any hyper-parameters

not listed in the table were taken to be the default values provided by the python scikit-learn

package.

Table 5.7: Boosted gradient tree parameters used for persistent homology based prediction
testing. Parameters were determined using a grid search. Any hyper parameters not listed
below were taken to be the default values provided by the python scikit-learn package.

Parameter
Loss Function
Alpha
Estimators
Learning Rate
Max Depth
Min Samples Leaf
Min Samples Split

Setting
Quantile
0.975
500
0.25
4
9
9

5.5.2.2 Deep Convolutional Neural Network

The deep CNN used in this work uses input images generated from an image-like correlation

map. These images are generated by using a range of kernel parameters for atom-speciﬁc

and element-speciﬁc persistent homology as described in Section 5.4.1. The CNN output

is ﬂattened and then input into a DNN along with global and local protein features. This

allows the deep CNN to use the same feature set as the boosted gradient method to be used

as well as the generated PH image-like data. Figure 5.7 provides a diagram of the CNN

architecture used for the PH based B factor prediction in this work.

The CNN is passed a three-channel persistent homology image of dimension (8,10,3) for

each Cα of the training set. The model used in this work takes the input image data and

52

Figure 5.7: The deep learning architecture using a convolutional neural network combined
with a deep neural network to predict B factor using PH based features. The plus symbol
represents the concatenation of features.

applies two convolutional layers with 2x2 ﬁlters, followed by a dropout of 0.5. The image

data is then passed through a dense layer, ﬂattened, then joined with the other global and

local features to form a dense layer of 218 neurons. This is followed by a dropout layer of

0.5, another dense layer of 100 neurons, a dropout layer of 0.25, a dense layer of 10 neurons,

and ﬁnishes with a dense layer of the B factor prediction output. Figure 5.7 provides an

illustration of the deep CNN used in this work.

Several hyper-parameters of the deep convolutional neural network can be tuned. The

deep convolutional neural network hyper-parameters are optimized using a basic grid search.

Table 5.8 provides the parameters used for testing. Any hyper-parameters not listed in the

provided table were taken to be the default values provided by the python Keras package.

53

Table 5.8: Convolutional Neural Network (CNN) parameters used for testing persistent
homology based features. Parameters were determined using a grid search. Any hyper-
parameters not listed below were taken to be the default values provided by python with the
Keras package.

Parameter
Learning Rate
Epoch
Batch Size
Loss
Optimizer

Setting
0.001
1000
1000
Mean Squared Error
Adam

5.6 Machine Learning Datasets

The image like features used in all convolutional neural networks were standardized with

mean 0 and variance of 1. Because the STRIDE software is unable to provide features for

these proteins, 1OB4, 1OB7, 2OLX, and 3MD5 are excluded from the data set. Protein

1AGN is also excluded due to known problems with this protein data. Lastly, proteins

1NKO, 2OCT, and 3FVA are excluded because they have residues with B factors reported

as zero, which is unphysical.

5.6.1 Training set and test set

The PH and MWCG based machine learning algorithms used in this work are all trained and

tested using a leave-one-protein-out approach. For each protein a machine learning model

is built using the entire dataset but excluding data from the protein whose B factors are to

be predicted. The dataset contains over 620,000 atoms in total which provides a training

set of roughly 600,000 data points (i.e., atoms) for each protein. Each heavy atom in the

training set has an associated set of input features, as described in Sections 5.3 and 4.7, and

a B factor output. The feature inputs and the outputs in the training set are used to train

each machine learning model. Since the predictions are leave-one-protein-out, data from

54

each protein is taken as a test set when its B factors are to be blindly predicted.

All random forest models and boosted gradient models are implemented using the scikit-

learn python package. All deep CNN models are implemented using the python package

Keras with tensorﬂow as a backend.

55

Chapter 6

Workﬂow

Read PDB Data

Pre-process PDB

Select element pairs and construct corresponding point clouds

Select kernel functions Φn and scale parameters ηn

Calculate rigidity un

i and ﬂexibility index f n
i

Figure 6.1: Workﬂow for procedure in MWCG feature construction.

56

Read PDB Data

Pre-process PDB

Select element pair, cutoﬀ distance, and kernel

Construct corresponding point clouds

Calculate kernel based distances matrices

Construct Rips complexes for each local Cα point cloud

Construct Persistence Diagrams using Rips complexes

Calculate Bottleneck distances

Figure 6.2: Workﬂow for procedure in ASPH and ESPH feature construction.

57

Choose ML model

Read and clean PDB ﬁle Data

Mine PDB ﬁle data for local and global features

Generate engineered features (MWCG, ASPH,
ESPH, CNN images, packing density, etc. . .)

Parameterize (MWCG, ASPH, ESPH,
CNN Images, cutoﬀ distance(s), etc. . .)

Normalize features

Tune hyperparameters

Validate model (leave-one-protein-out)

Figure 6.3: Workﬂow for procedure MWCG, ASPH, and ESPH based machine learning B
factor prediction.

58

Chapter 7

Results

7.1 Visualization of Element Speciﬁc Correlation Maps

In this result the radial basis functions are used in the MWCG method to construct various

element speciﬁc correlation heat maps of a given protein. For this study we consider carbon,

nitrogen, and oxygen interactions and create correlation heat maps using both nitrogen-

nitrogen and carbon-carbon interaction pairs. Only one spatial scale is used to illustrate the

element speciﬁc feature of the MWCG method. This is abbreviated as WCG in the related

tables. Given an element pair, each map was calculated used the average of the three kernels

described in Chapter 3. Axes of each correlation map correspond to individual atoms of each

carbon, nitrogen, or oxygen atom in the given protein. In this work correlation heat maps are

generated using the three proteins with PDB ID 3TYS, 1AIE, and 3PSM. Nitrogen-nitrogen

and oxygen-oxygen correlation heat maps are provided in Figures 7.1, 7.2, and 7.3. Each

ﬁgure also includes a 3D representation, generated using Visual Molecular Dynamics (VMD)

software, of each protein for reference.

7.2 Hinge Detection

Accurate and robust identiﬁcation of hinge regions is an ongoing problem. An important

application of hinge region detection is domain identiﬁcation. Hinge regions of proteins also

59

play an essential role in enzymatic catalysis due to their ability to allow conformational

changes to the protein. Binding by ligands can be accommodated by a ﬂexible active site

as seen in hinge regions. With these considerations in mind, hinge prediction cannot be

overlooked when developing methods for protein ﬂexibility and dynamics analysis. The

MWCG presented here can be used as a hinge detection tool. In this work we consider three

interesting examples. Calmodulin provides an example of a protein hinge that eﬀects both

the structure and function of the protein. For this result experimental protein B factors of

Cα atoms are compared with predictions from the WCG method and GNM for calmodulin

(PDB ID 1CLL), ribosomal protein (PDB ID 1WHI), and engineered ﬂuorescent cyan protein

(PDB ID 2HQK). To highlight the value of the element speciﬁc feature of the MWCG only

one scale is used so that the method is simply WCG. For comparison protein PDB ID 1CLL

includes MWCG and WCG predictions to compare and contrast the element speciﬁc and

multiscale nature of the MWCG method. Results are generated with carbon-carbon, carbon-

nitrogen, and carbon-oxygen interaction pairs. Exponential type kernels are used with ﬁxed

parameters κ = 1, and η = 3 ˚A. The results are displayed in Figures 7.5, 7.4, and 7.6.

60

(a) 1AIE

(b) Amine Nitrogens

(c) Double Bonded Carboxyl Oxygens

Figure 7.1: (a) VMD representation of PBD ID 1AIE. (b) Correlation maps for nitrogen-
nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1AIE. The thicker band
along the main diagonal of (b) and (c) corresponds to the alpha helix secondary structure
in 1AIE. Figure originally published in Bramer et al [1].

61

(a) 1KGM

(b) Amine Nitrogens

(c) Double Bonded Carboxyl Oxygens

Figure 7.2: (a) VMD representation of PBD ID 1KGM. (b) Correlation maps for nitrogen-
nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1KGM. The bands per-
pendicular to the main diagonal of (b) and (c) correspond to the anti parallel beta sheet
present in 1KGM. Figure originally published in Bramer et al [1].

62

(a) 5IIV

(b) Amine Nitrogen

(c) Double Bonded Carboxyl Oxygens

Figure 7.3: (a) VMD representation of PBD ID 5IIV. (b) Correlation maps for nitrogen-
nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 5IIV. The presence of
the two distinct thick bands along the main diagonal of (b) and (c) corresponds to the two
alpha helices present in 5IIV. The oﬀ diagonal bands correspond to the bonding interaction
between alpha helices. Figure originally published in Bramer et al [1].

63

(a)

(b)

(c)

(d)

Figure 7.4: (a) A visual comparison of experimental B factors , (b) WCG predicted B factors,
(c) and GNM predicted B factors for the ribosomal protein L14 (PDB ID:1WHI). (d) The
experimental and predicted B factor values plotted per residue. GNM represents predicted
B factors using GNM with a cutoﬀ distance of 7 ˚A. WCG is parametrized using CC, CN, CO
kernels of the exponential type with ﬁxed parameters κ = 1, and η = 3 ˚A. Figure originally
published in Bramer et al [1].

64

(a)

(b)

(c)

(d)

Figure 7.5: (a) The structure of calmodulin (PDB ID: 1CLL) visualized in Visual Molecular
Dynamics (VMD)18 and colored by experimental B factors, (b) MWCG predicted B factors,
(c) WCG predicted B factors, (d) and GNM predicted B factors with red representing the
most ﬂexible regions. Figure originally published in Bramer et al [1].

65

(e)

Figure 7.5: (Continued) (e) The experimental (Exp) and predicted B factor values plotted
per residue for PDB ID 1CLL. The GNM is for the GNM method with a cutoﬀ distance of
7 ˚A. We see that GNM clearly misses the ﬂexible hinge region. WCG is parametrized using
CC, CN, CO kernels of the exponential type with ﬁxed parameters κ = 1, and η = 3 ˚A.
MWCG represents B factor predictions determined from the MWCG method using the ﬁxed
parameters listed in Table 3.2. Figure originally published in Bramer et al [1].

7.3 MWCG

7.3.1 Validation

The Pearson correlation coeﬃcient is used to quantitatively assess the prediction results.

The Pearson correlation coeﬃcient for B factor prediction used in this work is given by

66

N(cid:88)

i=1

PCC =

(cid:20) N(cid:88)

i=1

i − ¯Bt)

(Be

i − ¯Be)(Bt
N(cid:88)

(Be

i − ¯Be)2

(Bt

i − ¯Bt)2

i=1

(cid:21)1/2

,

(7.1)

where Bt

i , i = 1, 2, . . . , N are predicted B factors using the proposed method and Be

i , i =

1, 2, . . . , N are experimental B factors from the PDB ﬁle. The terms Bt

i and Be

i represent

the ith theoretical and experimental B factors respectively. Here ¯Be and ¯Bt are averaged B

factors.

7.3.2 Fitting Results

Tables 7.1-7.6, and 7.4 provide the average Pearson correlation coeﬃcient obtained using the

MWCG method as outlined in Chapter 3. The MWCG method is compared to other com-

monly used protein B factor prediction methods. The MWCG B factor Pearson correlation

coeﬃcient results for all 364 proteins in the dataset are provided in table 7.4. The proposed

MWCG method, optimal FRI (opFRI), parameter free FRI (pFRI), and GNM methods are

all compared. The same comparison for proteins of relatively, small, medium, and large sizes

are provided in tables 7.1, 7.2, and 7.3.

Table 7.4: Correlation coeﬃcients for B factor prediction obtained by MWCG, optimal FRI
(opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for a set of 364
proteins. GNM scores reported here are the result of tests with a processed set of PDB ﬁles
as described in Chapter 2.2. MWCG results originally published in Bramer et al [1].

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM
0.826 0.808
0.465
0.27
0.654 0.685
0.781 0.843
0.726 0.656
0.672
0.62

1PEF
18
1PEN 16
1PMY 123
114
1PZ4
1Q9B
43
1QAU 112

1ABA 87
1AHO 64
1AIE
31
1AKG 16
1ATG 231
1BGF 124

0.698 0.613
0.625 0.562
0.416 0.155
0.35
0.185
0.578 0.497
0.539 0.543

0.727
0.698
0.588
0.373
0.613
0.603

0.989
0.957
0.701
0.921
0.957
0.786

0.888
0.516
0.671
0.828
0.746
0.678

0.855
0.768
0.969
0.945
0.843
0.834

67

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM
0.751 0.645
0.809
0.52
0.334
0.543
0.631 0.556
0.65
0.621 0.368
0.789
0.447 0.431
0.517
0.372 0.529
0.435
0.671 0.596
0.742
0.711 0.714
0.72
0.82
0.841
0.837
0.429 0.434
0.474
0.762 0.691
0.778
0.577 0.522
0.6
0.665 0.638
0.726
0.661 0.742
0.665
0.639
0.594 0.495
0.713 0.653 0.671
0.438
0.146 -0.142
0.832 0.809 0.798
0.61
0.691
0.538
0.629
0.599 0.632
0.622
0.492 0.162
0.792 0.695 0.677
0.691
0.564 0.397
0.591 0.577 0.549
0.539
0.601
0.27
0.695
0.679 0.666
0.634 0.577 0.417
0.559 0.654
0.6
0.645 0.447
0.832
0.619
0.57
0.562
0.524 0.366
0.596
0.333 0.434
0.375
0.834 0.901
0.842
0.662
0.638 0.433
0.77
0.808 0.757
0.756 0.579
0.69
0.524 0.281
0.564
0.694 0.512
0.705
0.593 0.521
0.684
0.639
0.603 0.467
0.555 0.551 0.477
0.606
0.68

0.726 0.623 0.706
0.543
0.491 0.552
0.58
0.512 0.351
0.751 0.702 0.741
0.912
0.889 0.832
0.746
0.732 0.859
0.653 0.638 0.677
0.609 0.628
0.71
0.544
0.393 0.432
0.023 -0.274
0.089
0.644 0.547
0.65
0.859 0.738
0.878
0.718
0.613 0.674
0.568 0.485
0.59
0.766
0.693 0.646
0.845 0.773 0.821
0.732 0.591
0.781
0.634 0.421
0.748
0.488
0.429 0.306
0.811 0.686 0.616
0.516 0.549
0.549
0.715
0.735
0.69
0.697
0.689 0.637
0.553 0.531 0.378
0.749
0.744 0.558
0.547 0.536 0.512
0.635
0.612 0.466
0.763 0.754
0.796
0.62
0.679
0.657
0.687 0.681
0.7
0.609 0.497
0.651
0.75
0.703 0.631
0.619 0.535 0.368
0.53
0.669
0.523
0.789 0.631
0.795
0.622
0.604 0.615
0.746 0.622 0.523
0.874 0.844
0.91
0.333 0.309
0.562
0.776
0.763
0.75
0.545 0.652
0.737
0.555
0.409 0.398

1QKI
3912
1QTO 122
122
1R29
90
1R7J
1RJU
36
1RRO 112
1SAU 114
1TGR 104
1TZV 141
55
1U06
1U7I
267
1U9C 221
1UHA 83
1UKU 102
1ULR 87
1UOY 64
1USE
40
1USM 77
1UTG 70
1V05
96
105
1V70
21
1VRZ
1W2L
97
1WBE 204
1WHI
122
1WLY 322
1WPA 107
1X3O 80
18
1XY1
1XY2
8
87
1Y6X
1YJO
6
1YZM 46
1Z21
96
1ZCE 146
75
1ZVA
2A50
457
2AGK 233
2AH1
939
2B0A 186
2BCM 413
2BF9
36

51
1BX7
224
1BYI
1CCR 111
1CYO 88
1DF4
57
1E5K 188
260
1ES5
1ETL
12
1ETM 12
1ETN 12
1EW4
106
1F8R 1932
1FF4
65
1FK5
93
1GCO 1044
1GK7
39
1GVD 52
1GXU 88
1H6V 2927
1HJE
13
1I71
83
1IDP 441
1IFR 113
1K8U 89
1KMM 1499
1KNG 144
1KR4
110
1KYC 15
73
1LR7
1MF7
194
1N7E
95
1NKD 59
1NKO 122
1NLS
238
1NNX 93
1NOA 113
1NOT 13
20
1O06
221
1O08
1OB4
16
1OB7
16
1OPD 85

0.896
0.600
0.741
0.860
0.941
0.848
0.700
0.932
0.941
0.949
0.804
0.504
0.933
0.648
0.839
0.984
0.849
0.901
0.133
0.931
0.798
0.827
0.875
0.856
0.740
0.810
0.892
0.971
0.929
0.757
0.812
0.911
0.831
0.799
0.834
0.808
0.937
0.988
0.516
1.000
1.000
0.607

0.554

68

Table 7.4 (cont’d)

0.508
0.809
0.787
0.859
0.805
0.748
0.819
0.810
0.869
0.774
0.885
0.764
0.838
0.765
0.718
0.769
0.960
0.819
0.745
0.841
0.854
0.995
0.747
0.767
0.804
0.728
0.797
0.787
0.933
1.000
0.838
1.000
0.970
0.725
0.898
0.911
0.704
0.821
0.462
0.805
0.695
0.714

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM

Table 7.4 (cont’d)

29
1P9I
99
2CE0
2CG7
90
2COV 534
2CWS 227
2D5W 1214
2DKO 253
2DPL 565
2DSX 52
439
2E10
2E3H
81
2EAQ 89
2EHP 248
2EHS
75
2ERW 53
2ETX 389
116
2FB6
157
2FG1
560
2FN9
2FQ3
85
2G69
99
2G7O 68
2G7S
190
2GKG 122
2GOM 121
2GXG 140
2GZQ 191
2HQK 213
2HYK 238
2I24
113
398
2I49
108
2IBL
2IGD
61
2IMF 203
87
2IP6
88
2IVY
2J32
244
2J9W 200
2JKU 35
2JLI
100
2JLJ
115
2MCM 113

0.841
0.824
0.738
0.895
0.756
0.448
0.873
0.721
0.704
0.808
0.794
0.817
0.832
0.805
0.513
0.854
0.850
0.719
0.704
0.844
0.850
0.888
0.756
0.748
0.874
0.901
0.462
0.897
0.728
0.672
0.766
0.919
0.865
0.798
0.841
0.837
0.878
0.741
0.926
0.937
0.811
0.867

2BRF 100
0.754 0.742 0.625
2C71
205
0.706 0.598 0.529
2OLX
4
0.551 0.539 0.379
2PKT 93
0.846 0.823 0.812
2PLT
99
0.64
0.647
0.696
2PMR 76
0.689
0.682 0.681
2POF 440
0.816 0.812
0.69
2PPN 107
0.596 0.538 0.658
2PSF
608
0.337
0.333 0.127
2PTH 193
0.796 0.692
0.798
2Q4N 153
0.682 0.605
0.692
412
2Q52
0.69
0.753
0.695
2QJL
99
0.804
0.804 0.773
2R16
176
0.713 0.747
0.72
2R6Q 138
0.253 0.199
0.461
0.58
0.556 0.632
2RB8
93
2RE2
238
0.791 0.786
0.74
2RFR 154
0.617 0.584
0.62
2V9V 135
0.595 0.611
0.607
2VE8
515
0.719
0.692 0.348
2VH7
94
0.436
0.59
0.622
2VIM 104
0.784
0.785
0.66
2VPA 204
0.67
0.644 0.649
106
2VQ4
0.646 0.711
0.688
2VY8
149
0.584 0.491
0.586
2VYO 210
0.78
0.847
0.52
2W1V 548
0.505
0.382 0.369
2W2A 350
0.809 0.365
0.824
2W6A 117
0.575
0.585
0.51
2WJ5
96
0.593
0.498 0.494
0.683 0.601
0.714
2WUJ
100
0.625 0.352
0.629
2WW7 150
0.481 0.386 2WWE 111
0.585
0.652
0.625 0.514
2X1Q 240
2X25
0.654 0.578 0.572
168
2X3M 166
0.544 0.483 0.271
0.863
0.848 0.855
2X5Y 171
2X9Z
0.705 0.662
0.716
262
2XHF 310
0.805
0.695 0.656
0.779 0.613 0.622
2Y0T 101
170
2Y72
0.741
0.527
0.789 0.713 0.639
2Y7L
319

0.72

69

0.873
0.773
1.000
0.762
0.635
0.799
0.743
0.673
0.641
0.901
0.846
0.510
0.611
0.640
0.915
0.840
0.711
0.826
0.697
0.698
0.851
0.859
0.757
0.776
0.759
0.777
0.761
0.819
0.804
0.821
0.919
0.629
0.903
0.505
0.710
0.875
0.799
0.726
0.830
0.834
0.926
0.939

0.71
0.764
0.795
0.56
0.649
0.658
0.888 0.885
0.917
0.162
0.003 0.193
0.508 0.484 0.509
0.693
0.682 0.619
0.651 0.589
0.682
0.638 0.668
0.677
0.526
0.5
0.565
0.784 0.767
0.822
0.667
0.711
0.74
0.756
0.748 0.621
0.594
0.584 0.594
0.582 0.495 0.618
0.54
0.603
0.529
0.727
0.614 0.517
0.613 0.673
0.652
0.671 0.753
0.693
0.555
0.548 0.528
0.744 0.643 0.616
0.726 0.596
0.775
0.413
0.393 0.212
0.763 0.755 0.576
0.679 0.555
0.68
0.77
0.724 0.533
0.675 0.648 0.729
0.68
0.571
0.706 0.638 0.589
0.823 0.748 0.647
0.484
0.357
0.44
0.739
0.598 0.598
0.499 0.471 0.356
0.582 0.628
0.692
0.534
0.478 0.443
0.598 0.403
0.632
0.717 0.655
0.744
0.718
0.705 0.694
0.583 0.578 0.574
0.591 0.569
0.606
0.778
0.774 0.798
0.754 0.766
0.78
0.928
0.797 0.747

0.68

Table 7.4 (cont’d)

0.422

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM
0.762 0.664
0.771
0.807 0.675
0.807
0.813 0.804 0.706
0.458
0.42
0.689 0.672 0.653
0.807
0.712 0.392
0.663 0.756
0.713
0.669 0.581
0.675
0.614
0.608 0.637
0.629 0.601
0.644
0.781 0.914
0.86
0.413 -0.218
0.649
0.669
0.669
0.6
0.205
0.119 0.193
0.661 0.641 0.567
0.614
0.583 0.517
0.706 0.645 0.659
0.689 0.719
0.696
0.653 0.677
0.702
0.721
0.617 0.597
0.583 0.506
0.627
0.706 0.734
0.727
0.734
0.698
0.63
0.645 0.655
0.649
0.686 0.637
0.698
0.767
0.774
0.81
0.77
0.723 0.589
0.642 0.683
0.697
0.531 0.642
0.537
0.758
0.744 0.717
0.597 0.568
0.625
0.79
0.876
0.745
0.541 0.468
0.543
0.718
0.667 0.568
0.709 0.678
0.709
0.945 0.922
0.95
0.904 0.866
0.922
0.577 0.605
0.622
0.745 0.568
0.78
0.587
0.442 0.503
0.688 0.669 0.495
0.452 0.419 0.286

0.559
0.53
0.605
0.785 0.727
0.803
0.835 0.691 0.771
0.799 0.651
0.814
0.456 0.458
0.571
0.567
0.55
0.54
0.614 0.539 0.475
0.433 0.411 0.336
0.909
0.904 0.689
0.624 0.621
0.661
0.82
0.845
0.684
0.481 0.297
0.634
0.532
0.516 0.466
0.45
0.485
0.6
0.549 0.488
0.559
0.707
0.661 0.547
0.489 0.296
0.502
0.687 0.642
0.706
0.427 0.577
0.431
0.824
0.792
0.74
0.803 0.811
0.812
0.64
0.606 0.632
0.583 0.533 0.276
0.476 0.435
0.525
0.655 0.556
0.701
0.532
0.44 -0.126
0.831
0.817 0.793
0.722 0.713 0.634
0.825 0.789
0.835
0.771
0.7
0.63
0.747
0.82
0.51
0.511 0.196
0.732
0.691
0.67
0.518
0.72
0.716 0.683
0.793 0.723 0.758
0.534
0.573
0.754 0.748 0.841
0.966
0.867
0.617 0.502 0.475
0.486
0.441 0.301
0.599 0.317
0.613
0.735
0.714 0.738

36
2NLS
194
2NR7
2NUH 104
2O6X 306
2OA2
132
2OCT 192
2OHW 256
2OKT 342
2OL9
6
3BA1
312
3BED 261
3BQX 139
3BZQ 99
3BZZ
100
3DRF 547
3DWV 325
3E5T 228
3E7R
40
3EUR 140
3F2Z
149
3F7E 254
3FCN 158
3FE7
91
3FKE 250
3FMY 66
3FOD 48
3FSO 221
3FTD 240
6
3FVA
3G1S
418
3GBW 161
3GHJ
116
3HFO 197
3HHP 1234
3HNY 156
3HP4
183
3HWU 144
7
3HYD
3HZ8
192
3I2V 124
3I2Z
138
3I4O 135

2Y9F
149
2YLB 400
2YNY 315
2ZCM 357
2ZU1
360
3A0M 148
3A7L
128
3AMC 614
3AUB 116
3B5O 230
12
3MD4
3MD5
12
3MEA 166
3MGN 348
3MRE 383
3N11
325
3NE0
208
3NGG 94
3NPV 495
3NVG
6
3NZL
73
3O0P 194
3O5P 128
3OBQ 150
3OQY 234
125
3P6J
3PD7
188
3PES
165
3PID 387
3PIW 154
3PKV 221
3PSM 94
3PTL 289
3PVE 347
357
3PZ9
12
3PZZ
3Q2X
6
131
3Q6L
3QDS
284
3QPA 197
3R6D 221
3R87
132

0.937
0.885
0.922
0.825
0.703
0.673
0.743
0.779
1.000
0.827
0.874
0.900
0.848
0.783
0.781
0.754
0.731
0.769
0.874
0.877
0.847
0.741
0.914
0.755
0.857
0.725
0.906
0.818
1.000
0.879
0.864
0.864
0.825
0.830
0.885
0.690
0.905
1.000
0.857
0.879
0.732
0.767

0.769
0.820
0.836
0.723
0.753
0.916
0.806
0.758
0.650
0.729
0.999
0.998
0.872
0.742
0.675
0.736
0.859
0.867
0.855
1.000
0.713
0.898
0.787
0.877
0.807
0.689
0.848
0.861
0.677
0.772
0.731
0.914
0.611
0.785
0.758
0.998
1.000
0.723
0.782
0.616
0.854
0.861

0.5

0.95

70

Table 7.4 (cont’d)

0.73

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM
0.403 0.242
0.51
0.47
0.616 0.606
0.8
0.784 0.849
0.562 0.524 0.526
0.523
0.421 0.237
0.801 0.712 0.826
0.709 0.658 0.712
0.666
0.675
0.63
0.619
0.611 0.624
0.633 0.567
0.644
0.815 0.697
0.817
0.808
0.775 0.694
0.796 0.748 0.735
0.527 0.447
0.592
0.419
0.458
0.24
0.578
0.556 0.571
0.658 0.588
0.665
0.8
0.853
0.791
0.738 0.716
0.776
0.68
0.68
0.674
0.701 0.688
0.74
0.625 0.551
0.648
0.57
0.529 0.405
0.633
0.372 0.688
0.617 0.598 0.551
0.655 0.501
0.671
0.467
0.323 0.356
0.76
0.755 0.758
0.786 0.754 0.743
0.591
0.528
0.688 0.587 0.624
0.528 0.485 0.406
0.678 0.628
0.55
0.522 0.547
0.544
0.806 0.689
0.81
0.682
0.588 0.596
0.745 0.728 0.615
0.649
0.703
0.51
0.622 0.499
0.638
0.446
0.404 0.316
0.562 0.401
0.62
0.793
0.757 0.777

162
3RQ9
0.667 0.635 0.695
128
3RY0
0.586
0.565 0.409
3RZY 139
0.797 0.693
0.817
3S0A 119
0.535 0.301
0.586
3SD2
86
0.704 0.611
0.705
3SEB 238
0.784 0.775
0.68
3SED 124
0.611 0.475
0.647
150
3SO6
0.718
0.716 0.669
3SR3
637
0.827 0.647 0.659
3SUK 248
0.734
0.584
3SZH 697
0.658 0.614 0.589
3T0H 208
0.612 0.608 0.551
3T3K 122
0.584
0.554 0.338
3T47
141
0.73
0.728 0.628
3TDN 357
0.639 0.574 0.296
0.591
0.471
0.51
3TOW 152
3TUA 210
0.664
0.591 0.451
75
3TYS
0.635 0.632 0.526
4DT4
160
0.753 0.736 0.712
4EK3
287
0.589
0.526 0.495
4ERY 318
0.666
0.652 0.597
4ES1
95
0.698 0.586 0.553
4EUG 225
0.531
0.487 0.583
448
4F01
0.596 0.491
0.604
143
4F3J
0.837 0.812
0.84
4FR9
141
0.557 0.484
0.602
4G14
15
0.625
0.61
0.607
4G2E 151
0.833 0.741 0.753
4G5X 550
0.785 0.749 0.695
4G6C 658
0.737
0.725 0.649
0.651 0.516 0.632
4G7X 194
4GA2
144
0.404 0.392
0.43
4GMQ 92
0.562 0.391
0.59
4GS3
90
0.691
0.687 0.526
236
4H4J
0.46
0.524
0.448
4H89
168
0.709 0.728
0.746
4HDE 168
0.618
0.516 0.303
0.748
4HJP 281
0.746 0.759
0.724 0.717 0.717 4HWM 117
0.619
0.674
85
4IL7
4J11
0.551 0.536
357
0.68
0.605
4J5O 220

3I7M 134
3IHS
169
3IVV 149
3K6Y 227
3KBE 140
3KGK 190
3KZD 85
3L41
220
3LAA 169
3LAX 106
833
3LG3
3LJI
272
3M3P 249
178
3M8J
3M9J
210
3M9Q 176
3MAB 173
3U6G 248
77
3U97
3UCI
72
637
3UR8
148
3US6
3V1A
48
285
3V75
3VN0
193
3VOR 182
3VUB 101
3VVV 108
3VZ9
163
3W4Q 773
3ZBD 213
3ZIT 152
3ZRX 221
3ZSL
138
3ZZP
74
3ZZY 226
4A02
166
167
4ACJ
186
4AE7
4AM1
345
4ANN 176
4AVR 188

0.604
0.807
0.866
0.817
0.743
0.798
0.789
0.776
0.880
0.924
0.701
0.720
0.697
0.813
0.867
0.851
0.770
0.808
0.819
0.689
0.832
0.668
0.811
0.674
0.889
0.686
0.852
0.951
0.887
0.798
0.891
0.641
0.639
0.903
0.692
0.804
0.730
0.827
0.862
0.796
0.562
0.759

0.711
0.790
0.867
0.713
0.842
0.879
0.870
0.747
0.633
0.721
0.860
0.897
0.803
0.759
0.668
0.722
0.696
0.918
0.784
0.830
0.801
0.820
0.592
0.883
0.879
0.806
1.000
0.835
0.822
0.834
0.840
0.782
0.794
0.698
0.866
0.624
0.783
0.730
0.807
0.719
0.726
0.817

0.59

0.46
0.47
0.65

71

PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM
0.742 0.689
0.648 0.608
0.736 0.543
0.697 0.553
0.682 0.538
0.53
0.324
0.421 0.331
0.574 0.594

4AXY 54
4B6G 558
4B9G 292
4DD5
387
4DKN 423
4DND 95
109
4DPZ
4DQ7
328

4J5Q 146
4J78
305
4JG2
185
4JVU 207
4JYP 534
4KEF 133
5CYT 103
6RXN 45

0.7
0.765
0.844
0.615
0.781
0.763
0.73
0.69

0.623
0.72
0.756 0.669
0.816 0.763
0.596 0.351
0.761 0.539
0.75
0.582
0.726 0.651
0.683 0.376

0.973
0.804
0.855
0.850
0.786
0.829
0.837
0.776

Table 7.4 (cont’d)

0.851
0.729
0.889
0.800
0.800
0.704
0.548
0.583

0.742
0.658
0.746
0.723
0.688
0.58
0.441
0.614

As reported in Bramer et al [1], the Pearson correlation coeﬃcients for small, medium and

large proteins, as well as the average Pearson correlation coeﬃcient of the protein superset,

are provided in Table 7.5. In addition to MWCG, the average Pearson correlation coeﬃcients

for opFRI, pFRI, GNM, and NMA are also included for comparison. As determined by Park

et al, GNM is more accurate than NMA, as analyzed by Park el al [4]. Moreover, opFRI and

pfFRI are more accurate than GNM and the MWCG method presented in this work is on

average approximately 28% more accurate than pfFRI and 42% more accurate than GNM.

Table 7.6 provides the average Pearson correlation coeﬃcient obtained from MWCG

linear least square ﬁtting for Cα, non Cα carbon, nitrogen, oxygen, and sulfur atom based

B factor prediction. It is notable that these predictions were not available to earlier GNM

and FRI methods, thus no comparison can be provided for this result.

7.4 Machine Learning Results

7.4.1 MWCG

B factors of all carbon, nitrogen, and oxygen atoms present in a given protein were blindly

predicted using a leave-one-(protein)-out approach. Results for predicted Cα B factors are

72

also included for comparison between other methods. These results are predicted in the same

way as other heavy atoms. The machine learning B factor prediction models were trained

using the generated input feature and B factor data from a training data set as described

in Sections 5.6.1 and 5.6. After training the model is used to predict B factors for all heavy

atoms in a given protein using only its feature input data.

7.4.1.1 Eﬃciency comparison

It is important for any algorithmic approach to consider the computational eﬃciency of the

method. For B factor predictions this is a particularly important consideration for large

proteins. The running times of the GNM, RF, GBT, and CNN models used for testing the

MWCG method are provided in Table 7.7 . Figure 7.7 provides a log-log comparison of these

times. The protein set used to test the computational complexity were the same as those

used by Opron et all [3]. Because GNM only provides Cα B factor predictions, only B factors

of Cα atoms are predicted by the RF, GBT, and CNN models for this comparison. Because

GNM is computational prohibitive for large proteins, several proteins were excluded from

the test set for GNM predictions. All testing excludes the time it takes to load PDB ﬁles and

feature data. The RF, GBT, and CNN times exclude the training of the model which can

be used for the prediction of all proteins once they are trained. The results agree with the
theoretical complexity O(N 3) for GNM. This is due to the matrix diagonalization required
for GNM. In contrast the machine learning algorithms are close to O(N ), where N is the
number of atoms. The lines of best ﬁt for CPU time (t) are t ≈ (4× 10−8)∗ N 3.09 for GNM,
t ≈ (9 × 10−6) ∗ N 0.78 for RF, t ≈ (4 × 10−6) ∗ N 0.87 for GBT, and t ≈ (1.1 × 10−3) ∗ N 0.97

for CNN.

73

7.4.1.2 Machine learning performance

Table 7.8 provides the results for the blind prediction of all heavy atoms over the protein

dataset. Overall the convolutional neural network method performs best with average Pear-

son correlation coeﬃcient of 0.69. Both gradient boosted and random forest perform similarly

with Pearson correlation coeﬃcients of 0.63 and 0.59 respectively. Table 7.8 provides the

results of the average Pearson correlation coeﬃcients for Cα only B factor predictions, which

are obtained in the same manner as other heavy atoms. This allows a comparison between

previous methods. For comparison, the parameter-free ﬂexibility-rigidity index (pfFRI),

Gaussian network model (GNM) and normal mode analysis (NMA) are all included. The

predictions of these previous methods are all obtained via the least squares ﬁtting of each

protein.

B factor prediction results are also included in Tables 7.9, 7.10, and 7.11 for the small-

, medium-, and large-sized protein data subsets [4]. The results B factor predictions of

all proteins in the protein Superset are provided in Table 7.12. The averages over the data

subsets and superset is provided in Table 7.8. Over the diﬀerent subsets all methods provided

similar performance in terms of Pearson correlation coeﬃcient. The deep convolutional

neural network performed best on the protein Superset for both Cα only and all heavy atom

B factor predictions.

The blind cross protein B factor prediction obtained in this work is particularly notable

because it improves upon the best existing ﬁtting methods. Previous work by Opron et al

used the single protein linear least squares parameter-free FRI (pfFRI) method to obtain an

average Pearson correlation coeﬃcient of 0.63 averaged over the superset [3]. GNM performs

worse with an overall Pearson correlation coeﬃcient of 0.57 averaged over the superset [3].

74

Cross protein blind prediction is a much more diﬃcult task than linear ﬁtting. Table 7.12

shows that none fo the machine learning methods out perform one another over the entire

data set. Averaged over the superset, the Pearson correlation coeﬃcient for the all heavy

atom B factor prediction of the convolutional neural network out performed the boosted

gradient and random forest by 10% and 17% respectively.

Table 7.12: Pearson correlation coeﬃcients for cross protein heavy atom blind B factor
prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional
neural network (CNN) for the Superset. Results reported use heavy atoms in both training
and prediction. MWCG machine learning results originally published in Bramer et al [2].

PDB ID N
1ABA
1AHO
1AIE
1AKG
1ATG
1BGF
1BX7
1BYI
1CCR
1CYO
1DF4
1E5K
1ES5
1ETL
1ETM
1ETN
1EW4
1F8R
1FF4
1FK5
1GCO
1GK7
1GVD
1GXU
1H6V
1HJE
1I71
1IDP
1IFR

728
482
235
108
1689
1018
345
1540
837
697
463
1423
1912
76
80
77
863
15291
477
626
7888
321
401
694
22514
73
683
3661
878

0.77
0.71
0.53
0.51
0.66
0.63
0.67
0.63
0.67
0.68
0.75
0.73
0.68
0.03
0.13
0.25
0.71
0.64
0.59
0.71
0.61
0.73
0.69
0.67
0.40
0.46
0.62
0.74
0.74

RF GBT CNN PDB ID N
0.74
0.62
0.62
0.41
0.61
0.58
0.55
0.59
0.70
0.66
0.79
0.70
0.63
0.27
0.46
0.33
0.70
0.64
0.55
0.62
0.64
0.53
0.66
0.65
0.39
-0.07
0.57
0.69
0.72

2X5Y
2X9Z
2XHF
2Y0T
2Y72
2Y7L
2Y9F
2YLB
2YNY
2ZCM
2ZU1
3A0M
3A7L
3AMC
3AUB
3B5O
3BA1
3BED
3BQX
3BZQ
3BZZ
3DRF
3DWV
3E5T
3E7R
3EUR
3F2Z
3F7E
3FCN

0.73
0.76
0.60
0.70
0.63
0.63
0.63
0.59
0.66
0.76
0.64
0.74
0.66
0.48
0.48
0.20
0.61
0.83
0.76
0.63
0.71
0.72
0.71
0.66
0.58
0.37
0.66
0.83
0.73

RF GBT CNN
0.72
0.75
0.71
0.76
0.70
0.65
0.73
0.59
0.73
0.75
0.62
0.81
0.64
0.72
0.63
0.60
0.67
0.68
0.44
0.41
0.17
0.59
0.65
0.74
0.81
0.66
0.62
0.72
0.74
0.63
0.53
0.65
0.44
0.65
0.70
0.73
0.52
0.85
0.43
0.60
0.77
0.45
0.81
0.67
0.60
0.87
0.75
0.71
0.81
0.60
0.47
0.82
0.88
0.78
0.69
0.61
0.68
0.73

0.79
0.72
0.71
0.75
0.80
0.82
0.77
0.69
0.71
0.45
0.73
0.47
0.75
0.75
0.62
0.55
0.64
0.73
0.59
0.61
0.45
0.66
0.67
0.72
0.60
0.50
0.78
0.67
0.71

1352
1956
2432
757
1171
2398
1212
3065
2364
2959
2794
823
963
5174
782
1510
2391
1570
1028
742
773
4101
2363
1543
295
1059
1160
1912
1039

75

Table 7.12 (cont’d)

PDB ID N
1K8U
1KMM
1KNG
1KR4
1KYC
1LR7
1MF7
1N7E
1NKD
1NLS
1NNX
1NOA
1NOT
1O06
1O08
1OPD
1P9I
1PEF
1PEN
1PMY
1PZ4
1Q9B
1QAU
1QKI
1QTO
1R29
1R7J
1RJU
1RRO
1SAU
1TGR
1TZV
1U06
1U7I
1U9C
1UHA
1UKU
1ULR
1UOY
1USE
1USM
1UTG

686
11632
1016
906
138
522
1551
700
426
1746
674
778
96
142
1722
642
203
153
109
937
874
303
812
31154
934
971
729
257
846
830
749
1051
432
1988
1712
623
873
677
452
290
631
548

0.68
0.70
0.56
0.76
0.30
0.70
0.68
0.65
0.59
0.64
0.73
0.57
0.81
0.64
0.58
0.60
0.77
0.64
0.24
0.65
0.73
0.67
0.58
0.27
0.55
0.73
0.70
0.75
0.52
0.68
0.65
0.77
0.68
0.75
0.64
0.80
0.75
0.71
0.56
0.50
0.78
0.55

RF GBT CNN PDB ID N
0.65
0.65
0.61
0.73
0.43
0.53
0.68
0.62
0.56
0.61
0.69
0.52
-0.18
0.51
0.51
0.55
0.73
0.60
0.34
0.64
0.73
0.41
0.57
0.44
0.61
0.61
0.71
0.71
0.56
0.62
0.61
0.75
0.55
0.73
0.61
0.74
0.74
0.69
0.55
0.25
0.59
0.58

3FE7
3FKE
3FMY
3FOD
3FSO
3FTD
3G1S
3GBW
3GHJ
3HFO
3HHP
3HNY
3HP4
3HWU
3HYD
3HZ8
3I2V
3I2Z
3I4O
3I7M
3IHS
3IVV
3K6Y
3KBE
3KGK
3KZD
3L41
3LAA
3LAX
3LG3
3LJI
3M3P
3M8J
3M9J
3M9Q
3MAB
3MD4
3MEA
3MGN
3MRE
3N11
3NE0

0.74
0.87
0.55
0.72
0.32
0.71
0.70
0.71
0.63
0.56
0.53
0.57
0.63
0.65
0.55
0.62
0.77
0.76
0.21
0.67
0.74
0.75
0.57
0.84
0.63
0.72
0.65
0.73
0.54
0.60
0.67
0.75
0.61
0.77
0.58
0.75
0.70
0.68
0.55
0.68
0.67
0.62

RF GBT CNN
0.83
0.62
0.57
0.76
0.84
0.73
0.78
0.30
0.85
0.71
0.75
0.69
0.72
0.74
0.68
0.75
0.66
0.44
0.70
0.65
0.62
0.71
0.58
0.73
0.61
0.65
0.51
0.51
0.60
-0.05
0.51
0.76
0.81
0.50
0.75
0.63
0.87
0.66
0.56
0.87
0.81
0.66
0.85
0.72
0.62
0.90
0.86
0.75
0.87
0.75
0.64
0.74
0.88
0.73
0.89
0.54
0.89
0.69
0.57
0.91
0.50
0.46
0.68
0.57
0.78
0.68
0.50
0.66
0.48
0.52
0.59
0.63
0.36
0.79
0.93
0.58
0.82
0.15
0.57
0.84
0.85
0.52
0.68
0.85

0.71
0.56
0.75
0.45
0.73
0.75
0.76
0.76
0.71
0.72
0.74
0.73
0.63
0.69
0.28
0.54
0.54
0.64
0.64
0.60
0.65
0.81
0.65
0.76
0.78
0.70
0.76
0.46
0.71
0.59
0.54
0.62
0.77
0.74
0.53
0.65
0.61
0.64
0.03
0.56
0.57
0.69

710
1938
470
328
197
1795
3196
1275
808
1432
8495
1351
1322
934
52
1459
929
1039
969
928
1120
1097
1617
829
1492
605
1735
1112
753
6061
1946
1858
1396
1329
1359
1311
81
1236
2236
2598
2501
1551

76

PDB ID N
17
1V05
1V70
784
66
1VRZ
746
1W2L
1542
1WBE
1WHI
937
2430
1WLY
906
1WPA
1X3O
622
124
1XY1
62
1XY2
669
1Y6X
1YJO
55
361
1YZM
771
1Z21
1ZCE
1100
551
1ZVA
3493
2A50
1867
2AGK
2AH1
7215
1454
2B0A
3002
2BCM
2BF9
287
735
2BRF
1446
2C71
2CE0
714
536
2CG7
4366
2COV
1624
2CWS
2D5W
9772
1933
2DKO
4454
2DPL
2DSX
386
3416
2E10
589
2E3H
705
2EAQ
2EHP
1875
590
2EHS
385
2ERW
2ETX
3018
766
2FB6
2FG1
1021

Table 7.12 (cont’d)

RF GBT CNN PDB ID N
-0.20
0.70
0.38
0.62
0.59
0.74
0.65
0.64
0.53
0.58
0.16
0.44
0.36
0.51
0.63
0.77
0.59
0.64
0.61
0.65
0.66
0.51
0.39
0.76
0.59
0.62
0.47
0.76
0.63
0.71
0.71
0.49
0.36
0.50
0.70
0.63
0.75
0.55
0.47
0.56
0.63
0.55

3NGG
3NPV
3NVG
3NZL
3O0P
3O5P
3OBQ
3OQY
3P6J
3PD7
3PES
3PID
3PIW
3PKV
3PSM
3PTL
3PVE
3PZ9
3PZZ
3Q2X
3Q6L
3QDS
3QPA
3R6D
3R87
3RQ9
3RY0
3RZY
3S0A
3SD2
3SEB
3SED
3SO6
3SR3
3SUK
3SZH
3T0H
3T3K
3T47
3TDN
3TOW
3TUA

0.02
0.67
-0.17
0.68
0.61
0.77
0.71
0.66
0.52
0.19
0.27
0.53
0.12
0.60
0.66
0.81
0.56
0.48
0.68
0.57
0.68
0.62
0.52
0.78
0.61
0.65
0.54
0.83
0.60
0.75
0.72
0.53
0.44
0.64
0.73
0.61
0.74
0.71
0.50
0.61
0.65
0.65

RF GBT CNN
0.83
0.63
0.70
0.84
0.88
-0.08
0.63
0.59
0.63
0.55
0.53
0.70
0.84
0.61
0.76
0.57
0.57
0.88
0.85
0.70
0.84
0.72
0.86
0.49
0.72
0.87
0.81
0.66
0.80
0.62
0.61
0.72
0.46
0.56
0.60
0.63
0.85
0.47
0.29
0.76
0.75
0.71
0.71
0.71
0.43
0.71
0.59
0.31
0.53
0.39
0.32
0.66
0.53
0.66
0.64
0.69
0.61
0.55
0.38
0.71
0.57
0.61
0.72
0.70
0.69
0.01
0.45
0.69
0.59
0.62
0.44
0.74
0.78
0.65
0.62
0.56
0.74
0.54
0.55
0.58
0.66
0.53
0.63
0.70

0.75
0.75
0.08
0.65
0.65
0.63
0.61
0.62
0.70
0.72
0.73
0.56
0.75
0.68
0.68
0.62
0.61
0.76
0.25
0.59
0.67
0.72
0.44
0.69
0.51
0.47
0.65
0.69
0.61
0.52
0.71
0.71
0.75
0.69
0.65
0.80
0.81
0.68
0.62
0.55
0.66
0.66

702
3655
50
567
1452
819
1195
1772
857
1354
1240
3078
1223
1688
729
2101
2656
2913
76
43
1022
2234
1348
1550
1007
1174
964
985
884
527
1948
933
1119
4891
1761
5074
1627
922
1116
2703
1193
1510

0.60
0.62
0.09
0.69
0.63
0.71
0.68
0.74
0.63
0.47
0.55
0.46
0.02
0.56
0.63
0.73
0.58
0.68
0.44
0.67
0.72
0.85
0.70
0.86
0.83
0.90
0.79
0.78
0.78
0.75
0.72
0.73
0.56
0.61
0.38
0.58
0.74
0.38
0.32
0.58
0.52
0.68

77

Table 7.12 (cont’d)

PDB ID N
2FN9
2FQ3
2G69
2G7O
2G7S
2GKG
2GOM
2GXG
2GZQ
2HQK
2HYK
2I24
2I49
2IBL
2IGD
2IMF
2IP6
2IVY
2J32
2J9W
2JKU
2JLI
2JLJ
2MCM
2NLS
2NR7
2NUH
2O6X
2OA2
2OHW
2OKT
2OL9
2PKT
2PLT
2PMR
2POF
2PPN
2PSF
2PTH
2Q4N
2Q52
2QJL

4362
721
744
537
1258
706
987
1132
1402
1582
1832
872
3109
815
431
1564
702
727
1935
1626
229
708
889
735
269
1556
806
2415
970
2074
2587
51
666
719
590
3418
701
4983
1437
9496
26784
734

0.60
0.75
0.61
0.63
0.60
0.60
0.70
0.75
0.60
0.76
0.65
0.52
0.77
0.53
0.68
0.62
0.67
0.59
0.78
0.68
0.63
0.54
0.70
0.73
0.49
0.70
0.72
0.82
0.53
0.62
0.59
0.51
0.17
0.67
0.66
0.66
0.68
0.55
0.72
0.39
0.62
0.60

RF GBT CNN PDB ID N
0.37
0.67
0.60
0.52
0.60
0.63
0.61
0.67
0.59
0.76
0.60
0.52
0.78
0.46
0.58
0.62
0.62
0.47
0.79
0.66
0.57
0.58
0.66
0.71
0.45
0.71
0.64
0.76
0.54
0.55
0.56
0.65
0.06
0.62
0.63
0.58
0.50
0.54
0.68
0.45
0.63
0.61

3TYS
3U6G
3U97
3UCI
3UR8
3US6
3V1A
3V75
3VN0
3VOR
3VUB
3VVV
3VZ9
3W4Q
3ZBD
3ZIT
3ZRX
3ZSL
3ZZP
3ZZY
4A02
4ACJ
4AE7
4AM1
4ANN
4AVR
4AXY
4B6G
4B9G
4DD5
4DKN
4DND
4DPZ
4DQ7
4DT4
4EK3
4ERY
4ES1
4EUG
4F01
4F3J
4FR9

0.61
0.76
0.87
0.89
0.81
0.70
0.92
0.86
0.90
0.90
0.85
0.91
0.90
0.88
0.82
0.47
0.64
0.62
0.70
0.73
0.35
0.73
0.68
0.60
0.70
0.66
0.19
0.63
0.92
0.81
0.89
0.84
0.76
0.70
0.63
0.85
0.83
0.79
0.79
0.85
0.77
0.42

RF GBT CNN
0.71
0.67
0.52
0.60
0.27
0.57
0.56
0.44
0.83
0.63
0.62
0.01
0.76
0.36
0.83
0.63
0.69
0.76
0.81
0.41
0.78
0.64
0.84
0.62
0.70
0.66
0.65
0.66
0.78
0.54
0.51
0.71
0.60
0.38
0.69
0.61
0.56
0.40
0.64
0.69
0.75
0.62
0.75
0.64
0.64
0.61
0.56
0.64
0.72
0.53
0.62
0.64
0.75
0.45
0.84
0.78
0.83
0.79
0.63
0.87
0.88
0.76
0.85
0.66
0.65
0.83
0.78
0.58
0.73
0.71
0.73
0.70
0.70
0.83
0.81
0.63
0.79
0.59
0.55
0.77
0.53
0.58
0.61
0.62

0.68
0.51
0.66
0.51
0.66
0.64
0.36
0.65
0.76
0.50
0.70
0.69
0.72
0.73
0.54
0.54
0.67
0.64
0.46
0.69
0.65
0.67
0.74
0.67
0.60
0.61
0.64
0.76
0.81
0.66
0.77
0.73
0.66
0.69
0.73
0.72
0.74
0.64
0.66
0.54
0.62
0.64

556
1658
524
536
5033
1156
319
1974
1469
1077
787
869
1366
5406
1718
1192
1654
925
585
1741
1281
1210
1458
2605
1180
1437
317
4504
2226
2618
3356
755
865
2526
1163
2147
2357
737
1789
3374
1116
956

78

PDB ID N
2R16
2R6Q
2RB8
2RE2
2RFR
2V9V
2VE8
2VH7
2VIM
2VPA
2VQ4
2VY8
2VYO
2W1V
2W2A
2W6A
2WJ5
2WUJ
2WW7
2WWE
2X1Q
2X25
2X3M

1262
903
723
1559
1019
986
3967
749
781
1524
800
1058
1589
4223
2918
826
630
828
915
54
1852
1289
1267

Table 7.12 (cont’d)

RF GBT CNN PDB ID N
39
0.52
1178
0.59
4002
0.61
4814
0.66
1315
0.54
873
0.64
678
0.65
737
0.74
0.62
1470
1127
0.63
1288
0.72
2112
0.71
0.53
799
527
0.68
2658
0.56
0.66
1406
1062
0.49
2443
0.55
1294
0.35
1615
0.23
4063
0.58
1002
0.65
0.66
800
345

4G14
4G2E
4G5X
4G6C
4G7X
4GA2
4GMQ
4GS3
4H4J
4H89
4HDE
4HJP
4HWM
4IL7
4J11
4J5O
4J5Q
4J78
4JG2
4JVU
4JYP
4KEF
5CYT
6RXN

0.53
0.53
0.64
0.66
0.58
0.61
0.59
0.70
0.61
0.68
0.76
0.74
0.65
0.72
0.62
0.76
0.53
0.55
0.43
0.22
0.53
0.68
0.70

0.50
0.57
0.42
0.54
0.66
0.63
0.66
0.82
0.75
0.61
0.78
0.63
0.61
0.72
0.63
0.69
0.77
0.55
0.61
0.12
0.77
0.80
0.75

RF GBT CNN
0.55
0.28
0.73
0.76
0.65
0.74
0.61
0.47
0.80
0.49
0.51
0.55
0.54
0.56
0.56
0.56
0.69
0.70
0.62
0.55
0.70
0.73
0.76
0.65
0.50
0.81
0.74
0.35
0.94
0.47
0.64
0.91
0.87
0.73
0.86
0.71
0.88
0.70
0.69
0.89
0.93
0.70
0.68
0.65
0.68
0.74
0.82
0.56

0.50
0.73
0.75
0.60
0.56
0.55
0.72
0.60
0.80
0.61
0.79
0.70
0.57
0.43
0.58
0.63
0.75
0.75
0.73
0.68
0.78
0.62
0.70
0.71

Several proteins have low Pearson correlation coeﬃcients indicating a poor model pre-

diction. In these cases we see that if one model performs poorly the other models generally

perform satisfactorily. Taking the consensus of the maximum correlation coeﬃcient for each

protein among the three machine learning methods results in an average all heavy atom

correlation coeﬃcient of 0.73 and an average Cα only correlation coeﬃcient of 0.72. This

result is similar to that of the parameter-optimized FRI (opFRI) reported in earlier work by

Opron et al [3].

79

7.4.1.3 Relative feature importance

Ensemble methods provide relative feature importance of the features used in the resulting

models. This is an important tool to help understand which features are most signiﬁcant in

a model. Figure 7.8 shows the individual feature importance for the random forest averaged

over the protein superset.

Since several of the features are related, Figure 7.9 provides a plot of the aggregated

feature importance. The feature importance of the individual angle, secondary, MWCG,

atom type, protein size, amino acid, and packing density features are all summed together

to illustrate the overall eﬀect of each feature type.

Figure 7.8 shows the most important MWCG feature is the carbon-carbon interaction.

This MWCG feature uses a Lorentz radial basis function as with η = 16 and ν = 3 as

detailed in Section 5.3. The remaining eight MWCG features all rank similarly with the

carbon-oxygen interaction ranked as the second most signiﬁcant MWCG feature. This result

validates that the model beneﬁts from the multi-scale property of the MWCG feature, which

uses three diﬀerent kernels to capture interactions at various length scales. Since all MWCG

have signiﬁcance in the feature ranking it follows that the element speciﬁc property of the

MWCG method is also a meaningful model feature.

Figure 7.8 shows that that the individual MWCG, amino acid type, and packing density

feature have low relative importance, however, considering their aggregate importance as

seen in Figure 7.9, we see that they contribute to the model. Figure 7.9 shows that the

medium density protein packing density feature was twice as important to the model as

the short and long density features. The medium packing density may be capturing semi-

local side chain interactions which are important in protein ﬂexibility. The short packing

80

density likely captures only adjacent backbone information while the long packing density

is only adding weak atomic interaction information to the model. Protein resolution is the

most signiﬁcant relative feature followed by MWCG features and the STRIDE generated

residue solvent accessible area feature. This also highlights the importance of the quality of

X-ray crystal structures and diﬃculty in cross-protein B factor prediction. Protein angles,

secondary structures, and size play a less signiﬁcant role in the model compared to the other

features. Atom type has the lowest signiﬁcance relative to the other features implemented

in the model. Not surprisingly, we see that global features such as resolution and R-value

are important components in the ensemble model. The global feature of protein size has a

small role in the model.

Care must be taken to use feature ranking to understand feature importance. The feature

ranking provided by these models is a relative ordering of features that the models ﬁnd most

important. So features with high correlation may be redundant giving one of them a lower

rank even though they may have signiﬁcant prediction power. For example, R-value highly

correlates with resolution so it is likely a meaningful feature. However, the use of resolution

reduces the relative importance ranking of R-value in the model.

7.5 ASPH & ESPH B Factor Prediction

7.5.1 Least Squares Fitting

The Pearson correlation coeﬃcients using least squares ﬁtting for Cα B factor prediction

of small, medium, and large protein subsets are provided in tables 7.17, 7.18, and 7.19

respectively. Results for the all proteins in the dataset are provided in table 7.21. The

average Pearson correlation coeﬃcients for small, medium, large, and superset data sets

81

is provided in table 7.20. Table 7.20 includes ﬁtting results using only Bottleneck, only

Wasserstein, and using both Bottleneck and Wasserstein metrics. Results using only an

exponential kernel, only a lorentz kernel, or both an exponential and lorentz kernel for ﬁtting

are also included. All results reported here PH features generated with a cutoﬀ of 11˚A and

include three pairwise interactions (carbon-carbon, carbon-nitrogen, carbon-oxygen).

7.5.2 Machine Learning

ASPH and ESPH Pearson correlation coeﬃcients using boosted gradient (GBT), convolu-

tional neural network (CNN), and consensus method (CON) for Cα B factor prediction of

small, medium, and large protein subsets is provided in tables 7.14, 7.15, and 7.16 respec-

tively. Parameters for GBT and CNN methods can be found in Tables 5.7 and 5.8. The

global and local features used for training and testing are provided in chapter 5. Results for

all proteins are provided in table 7.22. The average Pearson correlation coeﬃcients for small,

medium, large, and superset data sets is provided in table 7.13. All results reported here

use a cutoﬀ of 11˚A and include three pairwise interactions (carbon-carbon, carbon-nitrogen,

carbon-oxygen). Kernel parameters for both exponential and lorentz kernels are provided

in Table 5.4. Results from previously existing Cα B factor prediction methods are included

for comparison in Table 7.13. Overall both GBT and CNN algorithms perform similarly.

As expected, the CNN method out performs the GBT with average correlation coeﬃcients

over the superset of 0.60 and 0.59 respectively. The consensus method improves upon both

results with an average Pearson correlation coeﬃcient of 0.61 over the superset. Table 7.13

shows that the blind prediction machine learning models perform better than ﬁtting models

GNM and NMA and similar to the pFRI ﬁtting model.

82

Table 7.21: Pearson correlation coeﬃcients of persistent homology based least squares ﬁtting
Cα B factor prediction of all proteins using 11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein
(W) metrics using various kernel choices are included.

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.70
0.75
0.96
0.87
0.51
0.75
0.82
0.54
0.63
0.67
0.94
0.69
0.60
1.00
1.00
1.00
0.62
0.65
0.76
0.55
0.65
0.94
0.66
0.77
0.30
1.00
0.59
0.48
0.62
0.75
0.57
0.53
0.54
1.00
0.58
0.59
0.73
0.75

1ABA
87
1AHO
66
1AIE
31
1AKG
16
1ATG
231
1BGF
124
1BX7
51
1BYI
238
1CCR
109
1CYO
88
1DF4
57
1E5K
188
1ES5
260
1ETL
12
1ETM
12
1ETN
12
1EW4
106
1F8R
1932
1FF4
65
1FK5
93
1GCO
1044
1GK7
39
1GVD
56
1GXU
89
1H6V
2927
1HJE
13
1I71
83
1IDP
441
1IFR
113
1K8U
87
1KMM 1499
144
1KNG
1KR4
107
15
1KYC
73
1LR7
194
1MF7
1N7E
95
59
1NKD

0.56
0.53
0.90
0.53
0.38
0.68
0.81
0.44
0.43
0.65
0.88
0.63
0.44
0.95
0.70
0.70
0.55
0.50
0.68
0.49
0.53
0.88
0.61
0.69
0.23
0.67
0.38
0.39
0.47
0.65
0.36
0.43
0.45
0.88
0.46
0.50
0.54
0.55

0.68
0.79
0.90
0.72
0.53
0.75
0.82
0.49
0.65
0.68
0.95
0.68
0.58
1.00
0.86
0.99
0.55
0.63
0.75
0.58
0.63
0.95
0.69
0.75
0.30
1.00
0.56
0.47
0.65
0.71
0.54
0.51
0.53
0.99
0.63
0.59
0.72
0.63

0.67
0.75
0.97
0.82
0.50
0.75
0.86
0.50
0.65
0.71
0.93
0.67
0.58
1.00
1.00
1.00
0.58
0.61
0.77
0.53
0.63
0.95
0.75
0.75
0.29
1.00
0.44
0.48
0.65
0.72
0.57
0.52
0.57
0.96
0.61
0.56
0.67
0.73

0.67
0.78
0.88
0.66
0.50
0.70
0.74
0.51
0.66
0.69
0.92
0.68
0.57
1.00
1.00
1.00
0.60
0.63
0.72
0.59
0.64
0.94
0.68
0.78
0.31
1.00
0.66
0.47
0.59
0.74
0.54
0.51
0.48
0.99
0.62
0.59
0.71
0.69

0.63
0.65
0.77
0.56
0.48
0.61
0.69
0.48
0.58
0.59
0.91
0.67
0.56
0.98
0.83
0.92
0.55
0.62
0.68
0.50
0.63
0.92
0.62
0.72
0.29
0.57
0.58
0.46
0.53
0.67
0.53
0.50
0.45
0.88
0.56
0.58
0.63
0.65

0.62
0.73
0.64
0.53
0.47
0.54
0.68
0.46
0.56
0.58
0.89
0.67
0.55
0.87
0.74
0.92
0.51
0.62
0.65
0.50
0.63
0.93
0.63
0.61
0.29
0.79
0.46
0.45
0.54
0.64
0.53
0.47
0.47
0.93
0.55
0.57
0.68
0.58

0.76
0.88
0.99
1.00
0.61
0.82
0.89
0.58
0.71
0.78
0.97
0.74
0.65
1.00
1.00
1.00
0.73
0.70
0.80
0.71
0.66
0.98
0.84
0.82
0.33
1.00
0.76
0.55
0.73
0.85
0.59
0.61
0.60
1.00
0.71
0.67
0.80
0.89

0.54
0.72
0.78
0.60
0.45
0.64
0.79
0.41
0.53
0.66
0.92
0.66
0.51
0.68
0.45
0.96
0.52
0.59
0.70
0.49
0.59
0.91
0.67
0.72
0.28
0.72
0.41
0.43
0.56
0.67
0.49
0.43
0.39
0.92
0.57
0.55
0.54
0.56

83

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.82
0.86
0.59
1.00
0.98
0.48
1.00
1.00
0.36
0.89
0.96
0.94
0.71
0.88
0.71
0.58
0.40
0.56
0.72
0.86
0.72
0.45
0.76
0.75
0.74
0.55
0.71
0.67
0.73
0.80
0.61
0.73
0.64
0.65
0.56
0.65
0.62
1.00
0.69
0.48
0.55

1NLS
1NNX
1NOA
1NOT
1O06
1O08
1OB4
1OB7
1OPD
1P9I
1PEF
1PEN
1PMY
1PZ4
1Q9B
1QAU
1QKI
1QTO
1R29
1R7J
1RJU
1RRO
1SAU
1TGR
1TZV
1U06
1U7I
1U9C
1UHA
1UKU
1ULR
1UOY
1USE
1USM
1UTG
1V05
1V70
1VRZ
1W2L
1WBE
1WHI

238
93
113
13
22
221
5
5
85
29
18
16
123
113
44
112
3912
122
122
90
36
108
123
111
157
55
259
220
82
102
87
64
47
77
70
96
105
13
97
206
122

0.83
0.83
0.63
1.00
0.97
0.50
1.00
1.00
0.36
0.92
0.96
0.83
0.67
0.89
0.69
0.58
0.41
0.53
0.69
0.87
0.81
0.45
0.75
0.74
0.77
0.52
0.70
0.64
0.74
0.80
0.59
0.69
0.72
0.66
0.60
0.65
0.66
1.00
0.69
0.55
0.57

0.86
0.88
0.72
1.00
1.00
0.56
1.00
1.00
0.57
0.98
1.00
1.00
0.76
0.93
0.94
0.66
0.45
0.65
0.76
0.91
0.91
0.56
0.81
0.83
0.83
0.72
0.73
0.74
0.82
0.84
0.68
0.83
0.91
0.81
0.68
0.72
0.75
1.00
0.79
0.63
0.63

0.75
0.81
0.60
0.82
0.96
0.44
1.00
1.00
0.25
0.87
0.88
0.60
0.62
0.86
0.58
0.57
0.34
0.48
0.55
0.83
0.75
0.31
0.70
0.72
0.73
0.37
0.62
0.61
0.69
0.78
0.49
0.65
0.50
0.57
0.51
0.60
0.56
0.92
0.60
0.43
0.42

84

0.81
0.84
0.63
1.00
0.98
0.46
1.00
1.00
0.35
0.89
0.96
0.96
0.71
0.88
0.79
0.59
0.38
0.59
0.71
0.88
0.81
0.39
0.76
0.77
0.76
0.50
0.71
0.66
0.70
0.80
0.56
0.73
0.66
0.62
0.57
0.67
0.64
1.00
0.72
0.53
0.57

0.78
0.84
0.65
1.00
0.97
0.48
1.00
1.00
0.29
0.88
0.97
0.90
0.70
0.82
0.76
0.61
0.42
0.59
0.56
0.86
0.74
0.35
0.75
0.76
0.78
0.52
0.71
0.65
0.75
0.81
0.53
0.72
0.75
0.61
0.53
0.66
0.65
1.00
0.72
0.47
0.55

0.65
0.79
0.57
0.86
0.92
0.42
1.00
1.00
0.21
0.82
0.94
0.67
0.59
0.74
0.59
0.55
0.38
0.46
0.35
0.76
0.69
0.23
0.73
0.70
0.71
0.36
0.68
0.57
0.68
0.80
0.50
0.66
0.52
0.53
0.49
0.61
0.60
0.92
0.63
0.38
0.44

0.80
0.81
0.53
0.86
0.97
0.37
1.00
1.00
0.29
0.87
0.92
0.47
0.68
0.85
0.69
0.55
0.32
0.55
0.69
0.81
0.62
0.33
0.68
0.74
0.69
0.46
0.53
0.61
0.67
0.74
0.44
0.65
0.46
0.61
0.49
0.52
0.51
0.77
0.56
0.36
0.34

0.72
0.81
0.57
0.81
0.94
0.45
1.00
1.00
0.19
0.84
0.94
0.73
0.69
0.76
0.57
0.57
0.38
0.52
0.43
0.79
0.65
0.19
0.74
0.73
0.70
0.39
0.67
0.60
0.69
0.80
0.50
0.69
0.53
0.58
0.49
0.61
0.58
0.85
0.61
0.42
0.43

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.64
0.70
0.67
0.99
1.00
0.56
1.00
0.90
0.72
0.82
0.86
0.67
0.67
0.48
0.63
0.45
0.92
0.75
0.48
0.79
0.41
0.67
0.55
0.53
0.72
0.37
0.55
0.62
0.78
0.82
0.64
0.73
0.49
0.54
0.74
0.57
0.55
0.78
0.60
0.87
0.63

322
1WLY
107
1WPA
80
1X3O
16
1XY1
8
1XY2
86
1Y6X
6
1YJO
46
1YZM
96
1Z21
139
1ZCE
75
1ZVA
469
2A50
233
2AGK
939
2AH1
191
2B0A
415
2BCM
35
2BF9
103
2BRF
225
2C71
109
2CE0
110
2CG7
534
2COV
2CWS
235
2D5W 1214
2DKO
253
565
2DPL
52
2DSX
439
2E10
2E3H
81
89
2EAQ
246
2EHP
2EHS
75
53
2ERW
390
2ETX
129
2FB6
2FG1
176
560
2FN9
85
2FQ3
2G69
99
68
2G7O
2G7S
206

0.67
0.79
0.72
1.00
1.00
0.62
1.00
0.95
0.82
0.88
0.94
0.70
0.69
0.49
0.69
0.50
0.97
0.76
0.56
0.86
0.63
0.70
0.66
0.54
0.79
0.41
0.78
0.65
0.82
0.86
0.71
0.81
0.84
0.57
0.76
0.62
0.62
0.82
0.76
0.95
0.66

0.64
0.69
0.66
0.96
1.00
0.53
1.00
0.90
0.73
0.83
0.85
0.63
0.65
0.47
0.60
0.41
0.73
0.73
0.38
0.79
0.44
0.64
0.55
0.52
0.72
0.36
0.50
0.59
0.71
0.77
0.65
0.73
0.41
0.54
0.66
0.56
0.49
0.76
0.65
0.91
0.60

0.54
0.66
0.62
0.81
0.91
0.50
1.00
0.86
0.64
0.81
0.83
0.41
0.55
0.33
0.48
0.35
0.89
0.72
0.23
0.71
0.30
0.57
0.40
0.41
0.68
0.24
0.41
0.43
0.56
0.77
0.52
0.69
0.31
0.47
0.65
0.52
0.41
0.68
0.47
0.76
0.54

0.62
0.56
0.64
0.89
0.91
0.52
1.00
0.84
0.69
0.78
0.81
0.58
0.63
0.46
0.59
0.39
0.71
0.72
0.30
0.77
0.33
0.64
0.52
0.52
0.69
0.33
0.36
0.57
0.69
0.76
0.62
0.71
0.28
0.51
0.63
0.54
0.46
0.75
0.45
0.82
0.59

0.62
0.70
0.66
0.97
1.00
0.56
1.00
0.87
0.70
0.84
0.85
0.64
0.65
0.45
0.59
0.46
0.94
0.74
0.45
0.77
0.32
0.66
0.59
0.52
0.75
0.35
0.54
0.60
0.66
0.81
0.63
0.75
0.62
0.54
0.71
0.55
0.51
0.78
0.59
0.89
0.63

0.63
0.71
0.65
0.87
1.00
0.59
1.00
0.88
0.64
0.85
0.92
0.67
0.65
0.46
0.62
0.40
0.78
0.74
0.42
0.80
0.36
0.67
0.54
0.52
0.75
0.35
0.56
0.61
0.76
0.81
0.65
0.74
0.60
0.56
0.69
0.58
0.55
0.79
0.66
0.88
0.63

0.62
0.52
0.60
0.66
0.95
0.49
1.00
0.72
0.63
0.77
0.78
0.60
0.64
0.45
0.58
0.39
0.65
0.71
0.33
0.73
0.31
0.64
0.52
0.52
0.69
0.32
0.30
0.58
0.69
0.72
0.62
0.71
0.26
0.53
0.63
0.52
0.47
0.75
0.50
0.79
0.58

0.59
0.61
0.62
0.73
0.99
0.50
1.00
0.82
0.61
0.83
0.84
0.54
0.61
0.42
0.50
0.39
0.70
0.74
0.29
0.75
0.29
0.63
0.53
0.49
0.72
0.30
0.37
0.51
0.62
0.78
0.58
0.72
0.33
0.52
0.67
0.54
0.44
0.75
0.42
0.85
0.59

85

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.78
0.53
0.73
0.31
0.81
0.60
0.49
0.58
0.71
0.74
0.64
0.78
0.57
0.77
0.62
0.88
0.86
0.76
0.82
0.69
0.77
0.80
0.75
0.67
0.38
0.46
1.00
1.00
0.43
0.66
0.69
0.63
0.63
0.44
0.72
0.61
0.65
0.51
0.52
0.67
0.81

150
2GKG
121
2GOM
140
2GXG
203
2GZQ
232
2HQK
237
2HYK
113
2I24
399
2I49
108
2IBL
61
2IGD
203
2IMF
87
2IP6
89
2IVY
244
2J32
203
2J9W
38
2JKU
112
2JLI
2JLJ
121
2MCM 112
36
2NLS
193
2NR7
2NUH
104
309
2O6X
2OA2
140
2OHW 257
377
2OKT
6
2OL9
4
2OLX
2PKT
93
98
2PLT
83
2PMR
2POF
428
122
2PPN
608
2PSF
193
2PTH
2Q4N
1208
3296
2Q52
107
2QJL
2R16
185
149
2R6Q
2RB8
93

0.76
0.44
0.69
0.24
0.68
0.43
0.45
0.41
0.65
0.61
0.59
0.64
0.34
0.73
0.51
0.83
0.85
0.74
0.75
0.49
0.71
0.75
0.65
0.60
0.27
0.22
1.00
1.00
0.36
0.52
0.65
0.44
0.44
0.37
0.61
0.55
0.63
0.41
0.45
0.62
0.74

0.65
0.42
0.68
0.34
0.74
0.55
0.40
0.51
0.67
0.64
0.56
0.58
0.45
0.68
0.59
0.65
0.70
0.65
0.77
0.32
0.72
0.63
0.73
0.49
0.32
0.36
1.00
1.00
0.35
0.59
0.62
0.55
0.59
0.44
0.70
0.55
0.56
0.46
0.45
0.68
0.75

0.83
0.64
0.79
0.60
0.83
0.63
0.69
0.62
0.75
0.84
0.71
0.82
0.69
0.85
0.70
0.95
0.90
0.80
0.85
0.88
0.79
0.85
0.78
0.70
0.48
0.49
1.00
1.00
0.69
0.72
0.80
0.66
0.74
0.53
0.77
0.68
0.70
0.63
0.66
0.76
0.84

0.74
0.42
0.71
0.38
0.70
0.51
0.40
0.43
0.66
0.61
0.59
0.66
0.35
0.73
0.55
0.85
0.82
0.71
0.78
0.61
0.74
0.73
0.70
0.55
0.29
0.31
1.00
1.00
0.40
0.57
0.59
0.48
0.51
0.41
0.65
0.58
0.62
0.42
0.46
0.66
0.78

86

0.77
0.47
0.74
0.45
0.80
0.59
0.47
0.54
0.69
0.67
0.61
0.72
0.43
0.77
0.59
0.89
0.87
0.78
0.80
0.75
0.75
0.77
0.74
0.63
0.35
0.43
1.00
1.00
0.44
0.66
0.69
0.62
0.57
0.43
0.71
0.65
0.65
0.45
0.50
0.71
0.81

0.71
0.52
0.72
0.40
0.79
0.58
0.44
0.53
0.71
0.72
0.65
0.66
0.53
0.72
0.60
0.75
0.81
0.75
0.80
0.66
0.75
0.74
0.75
0.64
0.39
0.37
1.00
1.00
0.39
0.63
0.68
0.56
0.61
0.45
0.71
0.62
0.66
0.52
0.51
0.72
0.78

0.78
0.45
0.72
0.48
0.80
0.59
0.48
0.56
0.70
0.74
0.60
0.73
0.48
0.77
0.64
0.88
0.85
0.74
0.81
0.76
0.76
0.81
0.75
0.60
0.35
0.40
1.00
1.00
0.55
0.67
0.68
0.60
0.63
0.45
0.73
0.59
0.64
0.50
0.51
0.70
0.80

0.67
0.47
0.68
0.29
0.76
0.54
0.40
0.49
0.68
0.66
0.59
0.64
0.42
0.68
0.59
0.60
0.78
0.71
0.77
0.47
0.73
0.66
0.73
0.63
0.34
0.33
1.00
1.00
0.36
0.59
0.65
0.54
0.57
0.42
0.69
0.57
0.57
0.49
0.46
0.65
0.76

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.63
0.74
0.62
0.47
0.54
0.40
0.74
0.58
0.49
0.70
0.68
0.60
0.60
0.64
0.71
0.49
0.73
0.47
0.64
0.60
0.69
0.51
0.63
0.70
0.71
0.69
0.74
0.59
0.66
0.43
0.63
0.78
0.67
0.57
0.37
0.57
0.70
0.61
0.51
0.59
0.59

2RE2
2RFR
2V9V
2VE8
2VH7
2VIM
2VPA
2VQ4
2VY8
2VYO
2W1V
2W2A
2W6A
2WJ5
2WUJ
2WW7
2WWE
2X1Q
2X25
2X3M
2X5Y
2X9Z
2XHF
2Y0T
2Y72
2Y7L
2Y9F
2YLB
2YNY
2ZCM
2ZU1
3A0M
3A7L
3AMC
3AUB
3B5O
3BA1
3BED
3BQX
3BZQ
3BZZ

0.70
0.80
0.66
0.55
0.68
0.52
0.78
0.64
0.58
0.77
0.77
0.65
0.64
0.79
0.79
0.60
0.83
0.54
0.73
0.69
0.71
0.54
0.67
0.83
0.78
0.72
0.78
0.63
0.75
0.49
0.68
0.84
0.78
0.64
0.53
0.66
0.72
0.67
0.54
0.69
0.68

0.61
0.74
0.56
0.44
0.63
0.41
0.73
0.56
0.46
0.72
0.70
0.59
0.54
0.68
0.65
0.50
0.75
0.46
0.64
0.64
0.64
0.42
0.60
0.68
0.72
0.69
0.71
0.52
0.63
0.40
0.63
0.72
0.59
0.54
0.41
0.63
0.68
0.56
0.51
0.61
0.61

0.59
0.72
0.55
0.40
0.42
0.24
0.68
0.35
0.38
0.59
0.56
0.54
0.52
0.59
0.67
0.33
0.61
0.34
0.57
0.57
0.53
0.38
0.55
0.56
0.66
0.58
0.58
0.34
0.56
0.24
0.45
0.61
0.62
0.37
0.32
0.46
0.60
0.44
0.41
0.47
0.45

0.60
0.59
0.50
0.43
0.49
0.31
0.73
0.46
0.42
0.68
0.64
0.57
0.56
0.53
0.59
0.43
0.58
0.37
0.57
0.57
0.58
0.39
0.62
0.64
0.70
0.69
0.70
0.49
0.63
0.32
0.58
0.73
0.54
0.51
0.32
0.55
0.65
0.53
0.46
0.55
0.50

0.64
0.73
0.60
0.46
0.59
0.38
0.73
0.56
0.47
0.68
0.69
0.60
0.59
0.63
0.69
0.44
0.71
0.48
0.62
0.61
0.67
0.50
0.62
0.69
0.71
0.68
0.75
0.55
0.63
0.42
0.61
0.74
0.69
0.54
0.36
0.55
0.67
0.61
0.52
0.57
0.60

0.65
0.66
0.51
0.48
0.54
0.33
0.75
0.54
0.46
0.70
0.67
0.59
0.59
0.55
0.68
0.48
0.71
0.44
0.61
0.61
0.63
0.42
0.62
0.68
0.71
0.70
0.72
0.52
0.67
0.39
0.61
0.76
0.61
0.53
0.41
0.58
0.66
0.55
0.50
0.62
0.63

249
166
149
515
94
114
217
106
162
207
551
350
139
110
103
161
120
240
167
175
185
266
310
111
183
323
149
418
326
348
360
146
128
614
124
249
312
262
136
99
103

0.59
0.57
0.48
0.41
0.49
0.28
0.71
0.49
0.42
0.66
0.63
0.56
0.52
0.52
0.52
0.42
0.62
0.39
0.57
0.55
0.59
0.38
0.56
0.61
0.69
0.68
0.69
0.49
0.62
0.35
0.58
0.70
0.45
0.50
0.26
0.56
0.65
0.53
0.48
0.55
0.58

0.57
0.68
0.53
0.42
0.52
0.29
0.72
0.43
0.38
0.64
0.63
0.57
0.51
0.59
0.62
0.40
0.62
0.38
0.56
0.60
0.60
0.37
0.58
0.60
0.69
0.66
0.65
0.46
0.60
0.34
0.53
0.68
0.52
0.47
0.31
0.52
0.64
0.53
0.47
0.50
0.51

87

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.34
0.65
0.55
0.88
0.47
0.78
0.69
0.67
0.70
0.45
0.71
0.48
0.82
0.60
0.49
0.81
0.72
0.74
0.63
0.56
0.62
0.68
1.00
0.58
0.53
0.61
0.66
0.64
0.62
0.79
0.55
0.61
0.80
0.78
0.67
0.77
0.82
0.41
0.56
0.48
0.73

567
3DRF
359
3DWV
268
3E5T
40
3E7R
150
3EUR
148
3F2Z
261
3F7E
185
3FCN
89
3FE7
250
3FKE
75
3FMY
48
3FOD
238
3FSO
257
3FTD
418
3G1S
3GBW 170
129
3GHJ
216
3HFO
3HHP
1314
170
3HNY
201
3HP4
3HWU
155
8
3HYD
200
3HZ8
3I2V
127
140
3I2Z
154
3I4O
145
3I7M
3IHS
173
168
3IVV
227
3K6Y
3KBE
166
190
3KGK
94
3KZD
219
3L41
3LAA
176
118
3LAX
846
3LG3
3LJI
270
244
3M3P
3M8J
178

0.32
0.63
0.52
0.86
0.46
0.78
0.65
0.65
0.65
0.42
0.69
0.47
0.82
0.57
0.51
0.78
0.71
0.72
0.62
0.56
0.61
0.69
1.00
0.59
0.58
0.59
0.64
0.62
0.67
0.80
0.53
0.61
0.80
0.72
0.62
0.66
0.81
0.38
0.53
0.44
0.72

0.30
0.62
0.50
0.82
0.42
0.77
0.63
0.59
0.63
0.36
0.66
0.35
0.81
0.52
0.45
0.71
0.68
0.69
0.59
0.49
0.56
0.61
1.00
0.54
0.48
0.57
0.63
0.58
0.60
0.74
0.50
0.60
0.79
0.66
0.59
0.60
0.78
0.37
0.52
0.35
0.70

0.33
0.66
0.56
0.81
0.47
0.78
0.65
0.64
0.67
0.49
0.70
0.55
0.77
0.59
0.51
0.79
0.72
0.75
0.62
0.57
0.64
0.63
1.00
0.56
0.61
0.56
0.60
0.58
0.60
0.83
0.52
0.62
0.81
0.77
0.66
0.76
0.83
0.40
0.58
0.58
0.73

0.22
0.54
0.38
0.73
0.31
0.69
0.47
0.54
0.54
0.34
0.66
0.38
0.77
0.41
0.38
0.51
0.65
0.65
0.52
0.42
0.43
0.50
1.00
0.52
0.40
0.56
0.56
0.49
0.58
0.68
0.42
0.52
0.68
0.47
0.57
0.69
0.77
0.32
0.45
0.25
0.67

0.32
0.67
0.55
0.81
0.49
0.76
0.66
0.60
0.69
0.47
0.71
0.48
0.82
0.60
0.44
0.77
0.71
0.75
0.61
0.59
0.60
0.60
1.00
0.58
0.57
0.58
0.63
0.58
0.62
0.80
0.53
0.62
0.79
0.79
0.61
0.70
0.81
0.40
0.53
0.47
0.74

0.29
0.62
0.51
0.77
0.43
0.76
0.64
0.59
0.60
0.36
0.64
0.33
0.74
0.52
0.45
0.74
0.67
0.63
0.59
0.52
0.54
0.61
1.00
0.53
0.53
0.54
0.59
0.55
0.54
0.76
0.49
0.60
0.79
0.68
0.60
0.56
0.76
0.37
0.52
0.40
0.69

0.38
0.69
0.60
0.96
0.53
0.84
0.71
0.75
0.76
0.52
0.79
0.82
0.85
0.67
0.68
0.84
0.81
0.82
0.68
0.64
0.72
0.81
1.00
0.66
0.66
0.65
0.73
0.71
0.74
0.89
0.60
0.65
0.84
0.83
0.71
0.80
0.86
0.41
0.62
0.69
0.75

0.27
0.62
0.51
0.78
0.39
0.75
0.61
0.56
0.58
0.40
0.66
0.42
0.77
0.49
0.41
0.64
0.65
0.70
0.57
0.47
0.57
0.57
1.00
0.55
0.51
0.52
0.58
0.53
0.58
0.75
0.48
0.57
0.77
0.55
0.59
0.68
0.80
0.36
0.47
0.40
0.69

88

Table 7.21 (cont’d)

B & W

B

W

250
190
180
13
14
170
277
446
325
208
97
500
6
70
197
147
150
236
145
216
166
387
161
229
94
289
363
357
12
6
131
284
212
222
148
165
128
151
132
100
238

0.57
0.53
0.57
1.00
1.00
0.58
0.33
0.40
0.43
0.77
0.80
0.44
1.00
0.68
0.62
0.64
0.59
0.71
0.75
0.65
0.70
0.50
0.66
0.50
0.83
0.50
0.45
0.51
1.00
1.00
0.39
0.63
0.68
0.65
0.48
0.51
0.44
0.65
0.39
0.65
0.63

0.54
0.51
0.47
0.94
0.93
0.57
0.28
0.36
0.44
0.70
0.74
0.42
1.00
0.49
0.62
0.57
0.49
0.64
0.71
0.60
0.63
0.48
0.63
0.48
0.77
0.49
0.39
0.38
0.90
1.00
0.31
0.59
0.45
0.63
0.44
0.44
0.40
0.54
0.34
0.63
0.61

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.56
0.51
0.56
1.00
1.00
0.59
0.39
0.41
0.45
0.82
0.80
0.47
1.00
0.67
0.64
0.64
0.58
0.72
0.75
0.65
0.66
0.51
0.72
0.57
0.79
0.50
0.45
0.50
1.00
1.00
0.42
0.64
0.65
0.69
0.47
0.56
0.47
0.59
0.37
0.67
0.67

3M9J
3M9Q
3MAB
3MD4
3MD5
3MEA
3MGN
3MRE
3N11
3NE0
3NGG
3NPV
3NVG
3NZL
3O0P
3O5P
3OBQ
3OQY
3P6J
3PD7
3PES
3PID
3PIW
3PKV
3PSM
3PTL
3PVE
3PZ9
3PZZ
3Q2X
3Q6L
3QDS
3QPA
3R6D
3R87
3RQ9
3RY0
3RZY
3S0A
3SD2
3SEB

0.56
0.53
0.55
1.00
1.00
0.64
0.30
0.40
0.45
0.77
0.78
0.44
1.00
0.66
0.64
0.60
0.58
0.70
0.73
0.65
0.70
0.53
0.70
0.53
0.83
0.50
0.44
0.42
1.00
1.00
0.37
0.65
0.47
0.65
0.48
0.52
0.47
0.65
0.38
0.69
0.68

0.56
0.52
0.56
1.00
1.00
0.58
0.32
0.38
0.45
0.79
0.81
0.44
1.00
0.61
0.64
0.60
0.59
0.66
0.73
0.66
0.72
0.49
0.67
0.52
0.78
0.50
0.45
0.45
1.00
1.00
0.44
0.62
0.66
0.66
0.47
0.47
0.45
0.65
0.43
0.67
0.66

0.59
0.59
0.62
1.00
1.00
0.68
0.47
0.45
0.51
0.84
0.85
0.50
1.00
0.84
0.71
0.71
0.66
0.73
0.81
0.72
0.79
0.56
0.78
0.63
0.88
0.53
0.59
0.57
1.00
1.00
0.56
0.69
0.78
0.73
0.55
0.61
0.54
0.84
0.52
0.77
0.77

0.53
0.50
0.52
0.91
0.98
0.57
0.26
0.32
0.42
0.75
0.72
0.40
1.00
0.53
0.59
0.55
0.46
0.63
0.69
0.62
0.58
0.44
0.60
0.43
0.79
0.49
0.37
0.36
0.95
1.00
0.33
0.59
0.45
0.62
0.41
0.41
0.40
0.59
0.33
0.64
0.62

89

0.39
0.46
0.56
0.93
0.94
0.48
0.16
0.24
0.38
0.70
0.74
0.36
1.00
0.59
0.53
0.53
0.53
0.60
0.61
0.60
0.52
0.37
0.56
0.35
0.68
0.43
0.41
0.34
0.94
1.00
0.34
0.51
0.59
0.53
0.40
0.39
0.41
0.57
0.32
0.56
0.61

0.53
0.50
0.51
0.99
0.92
0.57
0.29
0.35
0.44
0.76
0.76
0.43
1.00
0.55
0.62
0.56
0.56
0.64
0.71
0.61
0.60
0.46
0.63
0.50
0.76
0.49
0.42
0.39
0.80
1.00
0.37
0.59
0.59
0.64
0.45
0.45
0.42
0.51
0.31
0.63
0.62

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.40
0.70
0.44
0.57
0.53
0.76
0.68
0.54
0.49
0.65
0.54
0.75
0.48
0.80
0.63
0.53
0.59
0.68
0.53
0.89
0.63
0.64
0.58
0.67
0.64
0.36
0.52
0.56
0.37
0.40
0.64
0.60
0.60
0.69
0.55
0.47
0.57
0.62
0.73
0.83
0.66

126
3SED
157
3SO6
657
3SR3
254
3SUK
753
3SZH
209
3T0H
122
3T3K
145
3T47
3TDN
359
3TOW 155
226
3TUA
3TYS
78
276
3U6G
85
3U97
3UCI
72
637
3UR8
159
3US6
59
3V1A
3V75
294
193
3VN0
219
3VOR
3VUB
101
112
3VVV
163
3VZ9
3W4Q
826
213
3ZBD
157
3ZIT
241
3ZRX
3ZSL
165
74
3ZZP
226
3ZZY
4A02
169
182
4ACJ
189
4AE7
359
4AM1
4ANN
210
189
4AVR
56
4AXY
4B6G
559
292
4B9G
4DD5
412

0.55
0.78
0.48
0.64
0.57
0.78
0.72
0.78
0.53
0.74
0.63
0.86
0.54
0.84
0.72
0.60
0.67
0.95
0.57
0.90
0.70
0.71
0.79
0.70
0.68
0.74
0.59
0.63
0.54
0.47
0.69
0.66
0.75
0.74
0.59
0.57
0.70
0.76
0.75
0.85
0.71

0.29
0.69
0.41
0.48
0.51
0.69
0.62
0.45
0.42
0.61
0.50
0.48
0.37
0.73
0.53
0.51
0.49
0.53
0.46
0.87
0.52
0.56
0.48
0.55
0.59
0.28
0.39
0.52
0.22
0.27
0.63
0.52
0.58
0.61
0.52
0.43
0.51
0.48
0.69
0.80
0.59

0.39
0.67
0.45
0.53
0.53
0.76
0.66
0.54
0.47
0.66
0.57
0.78
0.44
0.78
0.67
0.52
0.60
0.74
0.50
0.87
0.64
0.65
0.64
0.65
0.61
0.36
0.51
0.56
0.39
0.40
0.65
0.61
0.55
0.69
0.57
0.50
0.57
0.55
0.70
0.81
0.60

0.45
0.71
0.44
0.54
0.53
0.73
0.66
0.54
0.43
0.65
0.55
0.58
0.39
0.78
0.64
0.53
0.56
0.57
0.49
0.88
0.58
0.60
0.64
0.64
0.60
0.47
0.47
0.56
0.39
0.30
0.67
0.56
0.59
0.67
0.54
0.48
0.57
0.60
0.71
0.82
0.63

0.28
0.63
0.43
0.46
0.51
0.72
0.55
0.45
0.43
0.58
0.52
0.67
0.39
0.77
0.48
0.49
0.55
0.51
0.48
0.86
0.56
0.60
0.55
0.60
0.56
0.24
0.36
0.49
0.28
0.19
0.63
0.49
0.55
0.63
0.53
0.42
0.53
0.47
0.67
0.78
0.57

90

0.38
0.73
0.45
0.54
0.52
0.74
0.68
0.62
0.44
0.66
0.55
0.73
0.45
0.80
0.57
0.55
0.62
0.77
0.53
0.88
0.63
0.61
0.65
0.63
0.61
0.34
0.47
0.53
0.40
0.31
0.64
0.57
0.61
0.65
0.53
0.48
0.59
0.63
0.72
0.81
0.63

0.33
0.55
0.39
0.47
0.45
0.68
0.48
0.43
0.38
0.53
0.45
0.70
0.27
0.77
0.55
0.45
0.53
0.39
0.47
0.79
0.53
0.61
0.57
0.60
0.47
0.25
0.47
0.46
0.31
0.12
0.59
0.31
0.51
0.63
0.46
0.36
0.49
0.47
0.60
0.71
0.51

0.33
0.64
0.43
0.49
0.52
0.71
0.60
0.47
0.43
0.60
0.52
0.46
0.35
0.76
0.56
0.52
0.46
0.46
0.47
0.88
0.55
0.57
0.49
0.60
0.60
0.31
0.41
0.52
0.24
0.22
0.63
0.51
0.59
0.65
0.53
0.42
0.53
0.50
0.69
0.82
0.61

Table 7.21 (cont’d)

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.61
0.74
0.69
0.46
0.73
0.60
0.65
0.83
0.62
0.39
0.60
0.64
1.00
0.80
0.72
0.45
0.53
0.60
0.72
0.70
0.69
0.42
0.67
0.62
0.70
0.62
0.66
0.77
0.64
0.53
0.63
0.67
0.61
0.56
0.50
0.76

423
4DKN
93
4DND
113
4DPZ
338
4DQ7
170
4DT4
313
4EK3
318
4ERY
96
4ES1
225
4EUG
459
4F01
143
4F3J
145
4FR9
5
4G14
155
4G2E
584
4G5X
676
4G6C
216
4G7X
183
4GA2
94
4GMQ
90
4GS3
278
4H4J
175
4H89
167
4HDE
4HJP
308
4HWM 129
99
377
268
162
305
202
207
550
145
103
45

0.54
0.64
0.64
0.44
0.68
0.56
0.59
0.73
0.60
0.34
0.59
0.58
1.00
0.61
0.64
0.44
0.31
0.53
0.66
0.64
0.64
0.37
0.52
0.55
0.60
0.59
0.61
0.62
0.56
0.44
0.61
0.58
0.57
0.42
0.46
0.48

0.56
0.75
0.67
0.49
0.72
0.58
0.64
0.78
0.62
0.37
0.61
0.70
1.00
0.74
0.74
0.46
0.47
0.57
0.72
0.68
0.75
0.39
0.69
0.59
0.68
0.62
0.63
0.77
0.66
0.50
0.64
0.66
0.61
0.49
0.54
0.76

0.42
0.61
0.62
0.29
0.70
0.53
0.52
0.57
0.51
0.22
0.47
0.58
1.00
0.68
0.64
0.24
0.51
0.49
0.67
0.51
0.57
0.35
0.59
0.58
0.68
0.57
0.63
0.75
0.59
0.38
0.58
0.59
0.38
0.27
0.43
0.49

0.55
0.64
0.64
0.40
0.70
0.59
0.59
0.74
0.58
0.34
0.58
0.57
1.00
0.61
0.67
0.43
0.37
0.53
0.63
0.66
0.66
0.40
0.51
0.58
0.63
0.61
0.61
0.66
0.57
0.47
0.60
0.60
0.58
0.45
0.48
0.49

4IL7
4J11
4J5O
4J5Q
4J78
4JG2
4JVU
4JYP
4KEF
5CYT
6RXN

0.59
0.75
0.68
0.45
0.76
0.58
0.61
0.76
0.61
0.38
0.57
0.65
1.00
0.75
0.71
0.43
0.53
0.55
0.73
0.65
0.67
0.39
0.63
0.62
0.69
0.63
0.66
0.77
0.65
0.48
0.63
0.67
0.59
0.52
0.53
0.74

0.58
0.66
0.70
0.46
0.74
0.63
0.60
0.77
0.61
0.37
0.63
0.62
1.00
0.64
0.69
0.44
0.47
0.56
0.77
0.68
0.67
0.50
0.55
0.61
0.66
0.63
0.63
0.76
0.63
0.48
0.63
0.64
0.60
0.49
0.52
0.63

0.63
0.82
0.79
0.51
0.78
0.65
0.67
0.86
0.67
0.47
0.66
0.78
1.00
0.85
0.80
0.50
0.61
0.70
0.84
0.74
0.82
0.67
0.75
0.65
0.71
0.65
0.68
0.82
0.75
0.56
0.74
0.75
0.69
0.65
0.65
0.86

0.52
0.67
0.65
0.37
0.70
0.55
0.59
0.69
0.54
0.32
0.52
0.63
1.00
0.59
0.69
0.40
0.41
0.52
0.68
0.60
0.63
0.33
0.59
0.57
0.66
0.60
0.62
0.71
0.57
0.43
0.61
0.57
0.52
0.40
0.49
0.59

91

Table 7.22: Persistent homology based Pearson correlation coeﬃcients for cross protein Cα
atom blind B factor prediction obtained by boosted gradient (GBT), convolutional neural
network (CNN), and consensus method (CON) for the Superset.

PDB ID N GBT CNN CON PDB ID N GBT CNN CON
0.76
1ABA
1AHO
0.52
0.58
1AIE
0.74
1AKG
1ATG
0.69
0.68
1BGF
0.76
1BX7
0.7
1BYI
1CCR
0.69
0.36
1CYO
0.68
1DF4
1E5K
0.59
0.53
1ES5
0.69
1ETL
0.55
1ETM
0.52
1ETN
0.63
1EW4
0.5
1F8R
1FF4
0.58
0.49
1FK5
0.44
1GCO
0.52
1GK7
1GVD
0.63
0.46
1GXU
0.77
1H6V
1HJE
0.38
0.75
1I71
0.68
1IDP
0.66
1IFR
1K8U
0.54
1KMM
0.54
0.68
1KNG
1KR4
0.54
0.74
1KYC
0.67
1LR7
0.61
1MF7
1N7E
0.75
0.59
1NKD

2X5Y
2X9Z
2XHF
2Y0T
2Y72
2Y7L
2Y9F
2YLB
2YNY
2ZCM
2ZU1
3A0M
3A7L
3AMC
3AUB
3B5O
3BA1
3BED
3BQX
3BZQ
3BZZ
3DRF
3DWV
3E5T
3E7R
3EUR
3F2Z
3F7E
3FCN
3FE7
3FKE
3FMY
3FOD
3FSO
3FTD
3G1S
3GBW
3GHJ

0.68
0.52
0.57
0.71
0.71
0.66
0.75
0.66
0.71
0.38
0.66
0.6
0.61
0.64
0.5
0.55
0.59
0.53
0.55
0.53
0.51
0.45
0.55
0.48
0.66
0.42
0.76
0.69
0.65
0.55
0.51
0.67
0.57
0.75
0.68
0.57
0.74
0.56

87
66
31
16
231
124
51
238
109
88
57
188
260
12
12
12
106
1932
65
93
1044
39
56
89
2927
13
83
441
113
87
1499
144
107
15
73
194
95
59

0.73
0.66
0.75
0.27
0.55
0.61
0.74
0.61
0.55
0.64
0.85
0.74
0.65
0.37
0.37
0.07
0.59
0.52
0.61
0.59
0.47
0.77
0.71
0.67
0.26
0.84
0.53
0.62
0.7
0.57
0.64
0.5
0.56
0.62
0.62
0.65
0.63
0.7

0.71
0.66
0.7
0.32
0.51
0.58
0.74
0.5
0.6
0.7
0.85
0.72
0.62
0.82
0.63
0.48
0.6
0.54
0.66
0.6
0.47
0.9
0.55
0.68
0.34
0.75
0.58
0.6
0.64
0.6
0.51
0.52
0.71
0.69
0.61
0.66
0.58
0.7

185
266
310
111
183
323
149
418
326
348
360
146
128
614
124
249
312
262
136
99
103
567
359
268
40
150
148
261
185
89
250
75
48
238
257
418
170
129

0.76
0.49
0.58
0.71
0.65
0.66
0.74
0.67
0.65
0.33
0.66
0.53
0.44
0.68
0.5
0.49
0.62
0.45
0.56
0.45
0.38
0.51
0.63
0.44
0.72
0.36
0.73
0.65
0.63
0.52
0.51
0.65
0.45
0.72
0.64
0.6
0.74
0.58

0.74
0.7
0.78
0.29
0.55
0.62
0.76
0.6
0.59
0.68
0.88
0.74
0.66
0.55
0.43
0.13
0.61
0.54
0.64
0.61
0.5
0.82
0.69
0.69
0.34
0.9
0.56
0.63
0.7
0.59
0.63
0.51
0.63
0.66
0.64
0.67
0.65
0.72

92

Table 7.22 (cont’d)

PDB ID N GBT CNN CON PDB ID N GBT CNN CON
0.54
1NLS
0.65
1NNX
0.62
1NOA
0.58
1NOT
0.62
1O06
0.99
1O08
0.48
1OPD
0.48
1P9I
1PEF
0.6
0.66
1PEN
0.49
1PMY
0.64
1PZ4
1Q9B
0.84
0.58
1QAU
0.6
1QKI
1QTO
0.78
0.6
1R29
0.64
1R7J
0.42
1RJU
0.74
1RRO
0.5
1SAU
0.58
1TGR
1TZV
0.57
0.74
1U06
0.56
1U7I
0.45
1U9C
0.65
1UHA
0.96
1UKU
0.63
1ULR
1UOY
0.09
0.57
1USE
0.52
1USM
1UTG
0.71
0.75
1V05
0.54
1V70
0.71
1VRZ
1W2L
0.57
0.71
1WBE
0.61
1WHI
1WLY
0.59
0.66
1WPA
1X3O
0.69

3HFO
3HHP
3HNY
3HP4
3HWU
3HYD
3HZ8
3I2V
3I2Z
3I4O
3I7M
3IHS
3IVV
3K6Y
3KBE
3KGK
3KZD
3L41
3LAA
3LAX
3LG3
3LJI
3M3P
3M8J
3M9J
3M9Q
3MAB
3MD4
3MEA
3MGN
3MRE
3N11
3NE0
3NGG
3NPV
3NVG
3NZL
3O0P
3O5P
3OBQ
3OQY
3P6J

238
93
113
13
22
221
85
29
18
16
123
113
44
112
3912
122
122
90
36
108
123
111
157
55
259
220
82
102
87
64
47
77
70
96
105
13
97
206
122
322
107
80

216
1314
170
201
155
8
200
127
140
154
145
173
168
227
166
190
94
219
176
118
846
270
244
178
250
190
180
13
170
277
446
325
208
97
500
6
70
197
147
150
236
145

0.51
0.61
0.61
0.56
0.58
0.99
0.45
0.44
0.6
0.62
0.44
0.61
0.83
0.56
0.56
0.76
0.55
0.61
0.35
0.74
0.45
0.57
0.53
0.72
0.56
0.4
0.63
0.88
0.62
0.08
0.54
0.51
0.67
0.72
0.51
0.51
0.56
0.68
0.6
0.59
0.66
0.66

0.57
0.65
0.6
0.58
0.65
0.74
0.54
0.52
0.6
0.72
0.57
0.62
0.82
0.57
0.64
0.8
0.67
0.64
0.49
0.69
0.51
0.55
0.59
0.71
0.52
0.48
0.63
0.96
0.63
0.09
0.54
0.47
0.73
0.75
0.5
0.63
0.58
0.72
0.59
0.57
0.59
0.72

0.55
0.78
0.55
0.69
0.94
0.49
0.42
0.73
0.79
0.36
0.59
0.72
0.59
0.51
0.34
0.53
0.56
0.71
0.6
0.4
0.54
0.66
0.74
0.44
0.71
0.57
0.71
0.75
0.54
0.72
0.05
0.73
0.62
0.6
0.63
0.54
0.43
0.6
0.59
0.64
0.65
0.41

0.57
0.79
0.53
0.96
0.93
0.47
0.34
0.73
0.82
0.74
0.7
0.8
0.85
0.59
0.45
0.48
0.59
0.77
0.46
0.45
0.66
0.69
0.77
0.4
0.74
0.59
0.74
0.76
0.53
0.7
0.32
0.72
0.64
0.64
0.62
0.34
0.5
0.56
0.56
0.62
0.65
0.43

0.57
0.8
0.56
0.8
0.95
0.49
0.41
0.74
0.82
0.44
0.65
0.77
0.67
0.57
0.38
0.54
0.59
0.75
0.58
0.43
0.59
0.69
0.76
0.45
0.74
0.59
0.73
0.77
0.56
0.76
0.12
0.75
0.66
0.63
0.64
0.54
0.47
0.6
0.6
0.66
0.67
0.44

93

PDB ID N GBT CNN CON PDB ID N GBT CNN CON
0.71
1XY1
0.57
1XY2
0.45
1Y6X
0.75
1YJO
0.53
1YZM
0.82
1Z21
0.55
1ZCE
0.63
1ZVA
2A50
0.63
0.93
2AGK
0.93
2AH1
0.52
2B0A
2BCM
0.63
0.59
2BF9
0.69
2BRF
2C71
0.48
0.46
2CE0
0.46
2CG7
0.66
2COV
0.54
2CWS
0.57
2D5W
0.63
2DKO
2DPL
0.55
0.66
2DSX
0.5
2OCT
0.6
2E3H
0.71
2EAQ
0.73
2EHP
0.78
2EHS
2ERW
0.57
0.49
2ETX
0.63
2FB6
2FG1
0.63
0.72
2FN9
0.52
2FQ3
0.71
2G69
2G7O
0.43
0.64
2G7S
0.64
2GKG
2GOM
0.55
0.53
2GXG
2GZQ
0.86

3PD7
3PES
3PID
3PIW
3PKV
3PSM
3PTL
3PVE
3PZ9
3PZZ
3Q2X
3Q6L
3QDS
3QPA
3R6D
3R87
3RQ9
3RY0
3RZY
3S0A
3SD2
3SEB
3SED
3SO6
3SR3
3SUK
3SZH
3T0H
3T3K
3T47
3TDN
3TOW
3TUA
3TYS
3U6G
3U97
3UCI
3UR8
3US6
3V1A
3V75
3VN0

0.75
0.82
0.46
-0.06
0.64
0.65
0.74
0.7
0.54
0.63
0.55
0.59
0.51
0.79
0.77
0.6
0.66
0.32
0.72
0.47
0.64
0.78
0.36
0.34
0.67
0.68
0.63
0.62
0.67
0.24
0.48
0.75
0.61
0.54
0.82
0.5
0.86
0.58
0.64
0.59
0.67
0.4

16
8
86
6
46
96
139
75
469
233
939
191
415
35
103
225
109
110
534
235
1214
253
565
52
439
81
89
246
75
53
390
129
176
560
85
99
68
206
150
121
140
203

0.82
0.79
0.5
0.7
0.69
0.68
0.7
0.7
0.6
0.67
0.48
0.62
0.5
0.48
0.72
0.57
0.6
0.3
0.74
0.61
0.54
0.78
0.41
0.34
0.64
0.65
0.57
0.66
0.62
0.12
0.49
0.73
0.57
0.57
0.77
0.62
0.72
0.55
0.56
0.69
0.65
0.34

216
166
387
161
229
94
289
363
357
12
6
131
284
212
222
148
165
128
151
132
100
238
126
157
657
254
753
209
122
145
359
155
226
78
276
85
72
637
159
59
294
193

0.68
0.56
0.48
0.72
0.52
0.8
0.53
0.61
0.61
0.94
0.95
0.47
0.62
0.55
0.65
0.47
0.46
0.41
0.65
0.53
0.56
0.63
0.53
0.65
0.5
0.58
0.69
0.71
0.76
0.51
0.47
0.61
0.62
0.66
0.53
0.67
0.42
0.64
0.61
0.57
0.49
0.85

0.7
0.54
0.3
0.77
0.51
0.77
0.55
0.61
0.58
0.85
0.72
0.53
0.62
0.67
0.74
0.45
0.4
0.49
0.62
0.49
0.56
0.6
0.52
0.65
0.46
0.59
0.67
0.7
0.76
0.62
0.49
0.63
0.56
0.74
0.46
0.72
0.42
0.6
0.63
0.27
0.56
0.85

Table 7.22 (cont’d)

0.83
0.81
0.51
0.57
0.7
0.69
0.73
0.71
0.6
0.67
0.54
0.63
0.52
0.58
0.75
0.6
0.64
0.32
0.75
0.6
0.59
0.8
0.42
0.36
0.67
0.67
0.61
0.67
0.65
0.16
0.51
0.75
0.59
0.58
0.81
0.6
0.8
0.58
0.59
0.69
0.68
0.37

94

PDB ID N GBT CNN CON PDB ID N GBT CNN CON
0.48
2HQK
0.59
2HYK
0.57
2I24
0.72
2I49
0.66
2IBL
0.55
2IGD
0.5
2IMF
0.58
2IP6
2IVY
0.53
0.42
2J32
0.68
2J9W
0.62
2JKU
2JLI
0.64
0.68
2JLJ
0.55
2MCM
2NLS
0.45
0.56
2NR7
0.62
2NUH
0.71
2O6X
0.76
2OA2
0.63
2OHW
0.68
2OKT
2OL9
0.65
0.72
2PKT
0.57
2PLT
0.69
2PMR
0.61
2POF
0.59
2PPN
0.71
2PSF
2PTH
0.58
0.33
2Q4N
0.62
2Q52
2QJL
0.61
0.04
2R16
0.76
2R6Q
0.74
2RB8
2RE2
0.58
0.45
2RFR
0.61
2V9V
2VE8
0.76
0.61
2VH7
2VIM
0.77

3VOR
3VUB
3VVV
3VZ9
3W4Q
3ZBD
3ZIT
3ZRX
3ZSL
3ZZP
3ZZY
4A02
4ACJ
4AE7
4AM1
4ANN
4AVR
4AXY
4B6G
4B9G
4DD5
4DKN
4DND
4DPZ
4DQ7
4DT4
4EK3
4ERY
4ES1
4EUG
4F01
4F3J
4FR9
4G14
4G2E
4G5X
4G6C
4G7X
4GA2
4GMQ
4GS3
4H4J

0.47
0.59
0.56
0.72
0.65
0.55
0.52
0.54
0.49
0.38
0.65
0.59
0.62
0.65
0.54
0.44
0.56
0.59
0.69
0.74
0.61
0.66
0.62
0.7
0.55
0.67
0.6
0.57
0.69
0.56
0.35
0.58
0.6
-0.28
0.75
0.71
0.56
0.45
0.61
0.76
0.61
0.75

232
237
113
399
108
61
203
87
89
244
203
38
112
121
112
36
193
104
309
140
257
377
6
93
98
83
428
122
608
193
1208
3296
107
185
149
93
249
166
149
515
94
114

0.77
0.65
0.44
0.65
0.65
0.57
0.53
0.6
0.51
0.75
0.64
0.57
0.62
0.71
0.71
0.23
0.78
0.72
0.76
0.54
0.56
0.42
0.94
0.01
0.52
0.6
0.62
0.64
0.42
0.69
0.44
0.55
0.54
0.44
0.63
0.67
0.65
0.61
0.53
0.55
0.75
0.44

0.48
0.55
0.57
0.64
0.6
0.49
0.42
0.6
0.57
0.48
0.65
0.65
0.66
0.7
0.52
0.43
0.53
0.65
0.68
0.74
0.62
0.64
0.67
0.74
0.6
0.69
0.58
0.59
0.69
0.55
0.26
0.63
0.56
0.45
0.72
0.73
0.54
0.4
0.53
0.67
0.56
0.74

219
101
112
163
826
213
157
241
165
74
226
169
182
189
359
210
189
56
559
292
412
423
93
113
338
170
313
318
96
225
459
143
145
5
155
584
676
216
183
94
90
278

Table 7.22 (cont’d)

0.77
0.63
0.46
0.61
0.66
0.56
0.58
0.66
0.45
0.79
0.58
0.71
0.68
0.71
0.77
0.47
0.76
0.56
0.76
0.55
0.46
0.42
0.85
-0.04
0.53
0.63
0.6
0.54
0.42
0.7
0.43
0.28
0.57
0.49
0.62
0.7
0.66
0.69
0.52
0.55
0.56
0.47

0.78
0.65
0.46
0.66
0.67
0.58
0.56
0.63
0.51
0.79
0.64
0.66
0.65
0.74
0.75
0.29
0.79
0.7
0.78
0.56
0.54
0.43
0.94
-0.01
0.54
0.63
0.66
0.63
0.43
0.71
0.45
0.52
0.56
0.46
0.65
0.7
0.68
0.66
0.54
0.58
0.73
0.47

95

Table 7.22 (cont’d)

PDB ID N GBT CNN CON PDB ID N GBT CNN CON
0.56
2VPA
0.7
2VQ4
0.67
2VY8
0.57
2VYO
0.56
2W1V
0.58
2W2A
0.69
2W6A
0.74
2WJ5
2WUJ
0.64
0.73
2WW7
0.72
2WWE
0.65
2X1Q
2X25
0.51
0.39
2X3M
0.61

4H89
4HDE
4HJP
4HWM
4IL7
4J11
4J5O
4J5Q
4J78
4JG2
4JVU
4JYP
4KEF
5CYT
6RXN

0.58
0.72
0.6
0.6
0.55
0.49
0.68
0.74
0.6
0.72
0.7
0.67
0.53
0.34
0.6

0.53
0.66
0.68
0.54
0.55
0.58
0.67
0.72
0.63
0.72
0.7
0.59
0.48
0.39
0.59

175
167
308
129
99
377
268
162
305
202
207
550
145
103
45

217
106
162
207
551
350
139
110
103
161
120
240
167
175

0.66
0.7
0.77
0.6
0.64
0.59
0.71
0.45
0.35
0.36
0.49
0.44
0.5
0.64

0.75
0.75
0.68
0.63
0.69
0.6
0.69
0.53
0.54
0.35
0.55
0.5
0.57
0.65

0.71
0.72
0.76
0.63
0.66
0.61
0.72
0.48
0.45
0.37
0.53
0.47
0.55
0.65

96

(a)

(b)

(c)

(d)

Figure 7.6: A visual comparison of experimental B factors (a), WCG predicted B factors (b),
and GNM predicted B factors (c) for the engineered cyan ﬂuorescent protein, mTFP1 (PDB
ID:2HQK). (d) The experimental (Exp) and predicted B factor values plotted per residue
for PDB ID 2HQK. The GNM is for the GNM method with a cutoﬀ distance of 7 ˚A. WCG
is parametrized using CC, CN, CO kernels of the exponential type with ﬁxed parameters
κ = 1, and η = 3 ˚A. Figure originally published in Bramer et al [1].

97

Table 7.1: Correlation coeﬃcients for B factor prediction obtained by optimal FRI (opFRI),
parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for small-size structures.
Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken
from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter
free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer
et al [1].

PDB ID N MWCG opFRI pfFRI GNM NMA

1AIE
31
1AKG 16
51
1BX7
1ETL
12
1ETM 12
12
1ETN
1FF4
65
1GK7
39
1GVD 52
13
1HJE
1KYC
15
13
1NOT
20
1O06
1OB4
16
16
1OB7
29
1P9I
18
1PEF
1PEN
16
43
1Q9B
36
1RJU
1U06
55
1UOY 64
40
1USE
21
1VRZ
1XY2
8
1YJO
6
1YZM 46
2DSX
52
35
2JKU
36
2NLS
2OL9
6
2OLX
4
6RXN 45

0.969
0.945
0.896
0.932
0.941
0.949
0.933
0.984
0.849
0.931
0.971
0.937
0.988
1.000
1.000
0.841
0.989
0.957
0.957
0.805
0.774
0.769
0.960
0.995
1.000
1.000
0.970
0.704
0.926
0.937
1.000
1.000
0.583

0.416
0.35
0.623
0.609
0.393
0.023
0.613
0.773
0.732
0.686
0.763
0.622
0.874
0.763
0.545
0.742
0.826
0.465
0.726
0.447
0.429
0.653
0.146
0.695
0.57
0.333
0.834
0.333
0.695
0.559
0.904
0.888
0.574

0.155
0.185
0.706
0.628
0.432
-0.274
0.674
0.821
0.591
0.616
0.754
0.523
0.844
0.750
0.652
0.625
0.808
0.270
0.656
0.431
0.434
0.671
-0.142
0.677
0.562
0.434
0.901
0.127
0.656
0.530
0.689
0.885
0.594

0.712
-0.229
0.868
0.355
0.027
-0.573
0.555
0.822
0.570
0.562
0.784
0.567
0.900
0.930
0.952
0.603
0.888
0.056
0.646
0.235
0.377
0.628
-0.399
-0.203
0.458
0.445
0.939
0.433
0.850
0.088
0.886
0.776
0.304

0.588
0.373
0.726
0.71
0.544
0.089
0.718
0.845
0.781
0.811
0.796
0.746
0.91
0.776
0.737
0.754
0.888
0.516
0.746
0.517
0.474
0.713
0.438
0.792
0.619
0.375
0.842
0.337
0.805
0.605
0.909
0.917
0.614

98

Table 7.2: Correlation coeﬃcients for B factor prediction obtained by optimal FRI (opFRI),
parameter free FRI (pfFRI) and Gaussian normal mode (GNM) for medium-size structures.
Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken
from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter
free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer
et al [1].

PDB ID N MWCG opFRI pfFRI GNM NMA

87
1ABA
88
1CYO
93
1FK5
88
1GXU
83
1I71
73
1LR7
95
1N7E
93
1NNX
1NOA 113
1OPD
85
1QAU 112
1R7J
90
83
1UHA
1ULR
87
1USM 77
96
1V05
97
1W2L
80
1X3O
1Z21
96
75
1ZVA
36
2BF9
2BRF
100
99
2CE0
81
2E3H
89
2EAQ
2EHS
75
85
2FQ3
2IP6
87
2MCM 113
2NUH 104
93
2PKT
99
2PLT
2QJL
99
93
2RB8
99
3BZQ
5CYT
103

0.855
0.860
0.648
0.901
0.798
0.929
0.812
0.834
0.808
0.607
0.786
0.859
0.838
0.718
0.819
0.841
0.747
0.787
0.725
0.911
0.714
0.873
0.824
0.794
0.817
0.805
0.844
0.841
0.867
0.922
0.762
0.635
0.611
0.840
0.848
0.548

0.698
0.702
0.568
0.634
0.516
0.657
0.609
0.789
0.604
0.409
0.672
0.621
0.665
0.594
0.809
0.599
0.564
0.559
0.638
0.579
0.554
0.764
0.598
0.682
0.690
0.713
0.692
0.578
0.713
0.691
0.003
0.484
0.584
0.614
0.516
0.421

0.613
0.741
0.485
0.421
0.549
0.620
0.497
0.631
0.615
0.398
0.620
0.368
0.638
0.495
0.798
0.632
0.397
0.654
0.433
0.690
0.680
0.710
0.529
0.605
0.695
0.747
0.348
0.572
0.639
0.771
-0.193
0.509
0.594
0.517
0.466
0.331

0.057
0.774
0.362
0.581
0.380
0.795
0.385
0.517
0.485
0.796
0.533
0.078
0.308
0.223
0.780
0.389
0.432
0.453
0.289
0.579
0.521
0.535
0.628
0.632
0.688
0.565
0.508
0.826
0.643
0.685
-0.165
0.187
0.497
0.485
0.351
0.102

0.727
0.751
0.590
0.748
0.549
0.679
0.651
0.795
0.622
0.555
0.678
0.789
0.726
0.639
0.832
0.629
0.691
0.600
0.662
0.756
0.606
0.795
0.706
0.692
0.753
0.720
0.719
0.654
0.789
0.835
0.162
0.508
0.594
0.727
0.532
0.441

99

Table 7.3: Correlation coeﬃcients for B factor prediction obtained by optimal FRI (opFRI),
parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for large-size structures.
Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken
from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter
free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer
et al [1].

PDB ID N MWCG opFRI pfFRI GNM NMA

1AHO
64
1ATG 231
224
1BYI
111
1CCR
1E5K
188
106
1EW4
1IFR
113
1NKO 122
1NLS
238
1O08
221
1PMY 123
1PZ4
114
1QTO 122
1RRO 112
1UKU 102
1V70
105
204
1WBE
1WHI
122
1WPA 107
2AGK 233
205
2C71
2CG7
90
2CWS
227
2HQK 213
2HYK 238
2I24
113
203
2IMF
107
2PPN
2R16
176
2V9V
135
2VIM 104
2VPA
204
2VYO 210
238
3SEB
3VUB
101

0.768
0.843
0.600
0.741
0.848
0.804
0.875
0.831
0.799
0.516
0.701
0.921
0.809
0.748
0.765
0.854
0.767
0.804
0.797
0.821
0.773
0.738
0.756
0.897
0.728
0.672
0.798
0.673
0.640
0.697
0.859
0.757
0.777
0.879
0.852

0.625
0.578
0.491
0.512
0.732
0.644
0.689
0.535
0.530
0.333
0.654
0.781
0.520
0.372
0.661
0.492
0.577
0.539
0.577
0.694
0.649
0.539
0.640
0.809
0.575
0.498
0.625
0.638
0.495
0.548
0.393
0.755
0.648
0.712
0.610

0.562
0.497
0.552
0.351
0.859
0.547
0.637
0.368
0.523
0.309
0.685
0.843
0.334
0.529
0.742
0.162
0.549
0.270
0.417
0.512
0.560
0.379
0.696
0.365
0.510
0.494
0.514
0.668
0.618
0.528
0.212
0.576
0.729
0.826
0.607

0.339
0.154
0.133
0.530
0.620
0.447
0.330
0.322
0.385
0.616
0.702
0.844
0.725
0.546
0.720
0.285
0.574
0.414
0.380
0.514
0.584
0.308
0.524
0.743
0.593
0.441
0.401
0.468
0.411
0.594
0.221
0.594
0.739
0.720
0.365

0.698
0.613
0.543
0.580
0.746
0.650
0.697
0.619
0.669
0.562
0.671
0.828
0.543
0.435
0.665
0.622
0.591
0.601
0.634
0.705
0.658
0.551
0.647
0.824
0.585
0.593
0.652
0.677
0.582
0.555
0.413
0.763
0.675
0.801
0.625

100

Table 7.5: Average pearson correlation coeﬃcients for Cα B factor prediction with FRI,
GNM and NMA for three structure sets from Park et al. [4] and a superset of 364 structures.
Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken
from the coarse-grained (Cα) results reported in Park et al. [4] MWCG results are parameter
free and use all C, N, and O to predict Cα. MWCG Results originally published in Bramer
et al [1].

PDB set MWCG opFRI[3] pfFRI[3]

Small

Medium

Large

Superset

0.921
0.795
0.775
0.803

0.667
0.664
0.636
0.673

0.594
0.605
0.591
0.626

GNM NMA[4]
0.541[4]
0.550[4]
0.529[4]
0.565 [3]

0.480
0.482
0.494
NA

Table 7.6: Pearson Correlation coeﬃcients for Cα, non Cα carbon, nitrogen, oxygen, and
sulfur using parameter free MWCG. Only 215 of the 364 proteins contain sulfur atoms.
MWCG results originally published in Bramer et al [1].

Subset
Average

No of proteins

Cα
0.803
364

Non Cα Carbon Nitrogen Oxygen Sulfur
0.903
215

0.789
364

0.744
364

0.812
364

Figure 7.7: CPU Eﬃciency comparison between GNM [3], RF, GBT, and CNN algorithms
for MWCG B factor prediction. Execution times in seconds (s) versus number of residues.
A set of 34 proteins, listed in Table 7.7, were used to evaluate the computational complexity.
Result originally published in Bramer et al [2].

101

Table 7.7: CPU execution times, in seconds, from eﬃciency comparison between GNM [3],
RF, GBT, and CNN. Results originally reported in Bramer et al [2]

N
PDB
125
3P6J
132
3R87
3KBE
140
1TZV 141
149
2VY8
152
3ZIT
2FG1
157
2X3M 166
3LAA 169
3M8J
178
2GZQ 191
4G7X 194
2J9W 200
3TUA 210
1U9C
221
3ZRX 221
3K6Y 227
3OQY 234
244
2J32
249
3M3P
1U7I
267
4B9G 292
4ERY 318
3MGN 348
2ZU1
360
412
2Q52
448
4F01
3DRF
547
637
3UR8
2AH1
939
1GCO 1044
1F8R 1932
1H6V 2927
1QKI
3912

GNM[3]

0.141
0.156
0.187
0.203
0.219
0.234
0.265
0.312
0.327
0.375
0.468
0.499
0.546
0.655
0.733
0.718
0.765
0.873
0.967
1.029
1.263
1.669
2.122
2.902
3.136
4.696
6.178
11.154
17.409
61.012
75.801
654.127
2085.842
6365.668

RF

0.000455
0.000464
0.000505
0.000473
0.000486
0.000519
0.000518
0.000526
0.000514
0.000548
0.000647
0.000631
0.000554
0.000602
0.000592
0.000654
0.000619
0.000619
0.000625
0.000621
0.000647
0.000693
0.000775
0.000655
0.000816
0.000900
0.001016
0.001131
0.001307
0.001716
0.001936
0.003343
0.005205
0.006261

GBT

0.000358
0.000339
0.000384
0.000365
0.000359
0.000365
0.000403
0.000382
0.000405
0.000412
0.000454
0.000445
0.000424
0.000472
0.000486
0.000515
0.000490
0.000502
0.000556
0.000525
0.000551
0.000574
0.000619
0.000552
0.000675
0.000750
0.000878
0.001033
0.001136
0.001605
0.001814
0.003163
0.004739
0.006198

CNN
0.130
0.138
0.149
0.163
0.156
0.148
0.174
0.182
0.155
0.178
0.195
0.209
0.208
0.217
0.198
0.216
0.189
0.211
0.225
0.220
0.237
0.256
0.289
0.267
0.337
0.369
0.401
0.512
0.583
0.800
0.905
1.745
2.543
3.560

102

Table 7.8: Average Pearson correlation coeﬃcients (PCC) both of all heavy atom and Cα only
B factor predictions for small-, medium-, and large-sized protein sets along with the entire
superset of the 364 protein dataset. Predictions of random forest (RF), gradient boosted
tree (GBT), and convolutional neural network (CNN) are obtained by leave-one-protein-
out (blind), while predictions of parameter-free ﬂexibility-rigidity index (pfFRI), Gaussian
network model (GNM) and normal mode analysis (NMA) were obtained via the least squares
ﬁtting of individual proteins. All machine learning models use all heavy atom information for
training. MWCG machine learning B factor prediction results originally reported in Bramer
et al [2].

Prediction Of Only Cα

Protein Set RF GBT CNN pfFRI [3] GNM [3] NMA [3]

Small

Medium

Large

Superset

0.25
0.47
0.50
0.49

0.39
0.59
0.57
0.57

0.53
0.55
0.62
0.66

0.60
0.61
0.59
0.63

0.54
0.55
0.53
0.57

0.48
0.48
0.49
NA

Prediction Of All Heavy Atom

Protein Set RF GBT CNN pfFRI [3] GNM [3] NMA [3]

Small

Medium

Large

Superset

0.44
0.59
0.62
0.59

0.49
0.64
0.65
0.63

0.56
0.62
0.68
0.69

NA
NA
NA
NA

NA
NA
NA
NA

NA
NA
NA
NA

Figure 7.8: Individual feature importance for the MWCG random forest model averaged
over the data set. Reported feature selection includes the use heavy atoms in the model.
Figure originally published in Bramer et al [2].

103

Table 7.9: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG B factor
prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional
neural network (CNN) for the small-sized protein set. Results reported use heavy atoms in
both training and prediction. Originally published in Bramer et al [2].

PDB ID N
235
1AIE
1AKG 108
345
1BX7
1ETL
76
1ETM 80
77
1ETN
477
1FF4
1GK7
321
1GVD 401
73
1HJE
138
1KYC
1NOT
96
142
1O06
203
1P9I
1PEF
153
109
1PEN
303
1Q9B
1RJU
257
1U06
432
1UOY 452
290
1USE
1VRZ
66
62
1XY2
1YJO
55
1YZM 361
386
2DSX
229
2JKU
2NLS
269
2OL9
51
6RXN 345

RF GBT CNN
0.62
0.60
0.70
0.41
0.63
0.55
0.48
0.27
0.46
0.48
0.20
0.33
0.76
0.55
0.53
0.72
0.71
0.66
0.37
-0.07
0.32
0.43
-0.18
0.63
0.65
0.51
0.77
0.73
0.60
0.76
0.21
0.34
0.75
0.41
0.71
0.73
0.61
0.55
0.55
0.55
0.68
0.25
0.38
0.09
0.55
0.16
0.02
0.36
0.51
0.56
0.56
0.36
0.35
0.57
0.70
0.45
0.65
0.84
0.82
0.56

0.53
0.51
0.67
0.03
0.13
0.25
0.59
0.73
0.69
0.46
0.30
0.81
0.64
0.77
0.64
0.24
0.67
0.75
0.68
0.56
0.50
-0.17
0.27
0.12
0.60
0.44
0.63
0.49
0.51
0.71

104

Table 7.10: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG B
factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolu-
tional neural network (CNN) for the medium-sized protein set. Results reported use heavy
atoms in both training and prediction. Originally published in Bramer et al [2].

PDB ID N
1ABA 728
1CYO 697
1FK5
626
1GXU 694
683
1I71
522
1LR7
1N7E
700
1NNX 674
1NOA 778
1OPD 642
1QAU 812
1R7J
729
1UHA 623
1ULR
677
1USM 631
17
1V05
746
1W2L
1X3O
622
771
1Z21
551
1ZVA
2BF9
287
735
2BRF
714
2CE0
2E3H
589
2EAQ 705
590
2EHS
721
2FQ3
2IP6
702
2MCM 735
2NUH 806
666
2PKT
2PLT
719
734
2QJL
2RB8
723
3BZQ 742
800
5CYT

RF GBT CNN
0.74
0.73
0.76
0.66
0.63
0.62
0.65
0.66
0.66
0.57
0.71
0.53
0.71
0.62
0.69
0.53
0.57
0.52
0.62
0.55
0.57
0.57
0.65
0.71
0.75
0.74
0.68
0.69
0.59
0.67
0.60
-0.20
0.69
0.62
0.53
0.63
0.63
0.63
0.58
0.59
0.39
0.70
0.86
0.76
0.90
0.62
0.38
0.70
0.63
0.58
0.38
0.55
0.76
0.67
0.62
0.64
0.60
0.71
0.19
0.64
0.76
0.06
0.62
0.70
0.42
0.61
0.42
0.61
0.60
0.43
0.74
0.68

0.77
0.68
0.71
0.67
0.62
0.70
0.65
0.73
0.57
0.60
0.58
0.70
0.80
0.71
0.78
0.02
0.68
0.52
0.66
0.56
0.52
0.78
0.65
0.73
0.61
0.71
0.75
0.67
0.73
0.72
0.17
0.67
0.60
0.64
0.61
0.70

105

Table 7.11: Pearson correlation coeﬃcients for cross protein heavy atom blind MWCG B fac-
tor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional
neural network (CNN) for the large-sized protein set. Results reported use heavy atoms in
both training and prediction. Originally published in Bramer et al [2].

N
PDB ID
1AHO
482
1ATG 1689
1BYI
1540
837
1CCR
1423
1E5K
863
1EW4
1IFR
878
1746
1NLS
1722
1O08
1PMY
937
874
1PZ4
934
1QTO
846
1RRO
1UKU
873
784
1V70
1542
1WBE
1WHI
937
1WPA
906
2AGK 1867
2C71
1446
536
2CG7
2CWS
1624
2HQK 1582
2HYK 1832
872
2I24
1564
2IMF
2PPN
701
1262
2R16
2V9V
986
2VIM 781
2VPA
1524
2VYO 1589
1948
3SEB
3VUB
787

RF GBT CNN
0.76
0.62
0.63
0.61
0.59
0.59
0.66
0.70
0.74
0.70
0.61
0.70
0.72
0.73
0.56
0.61
0.55
0.51
0.64
0.67
0.74
0.73
0.63
0.61
0.54
0.56
0.74
0.70
0.62
0.70
0.63
0.59
0.74
0.71
0.74
0.64
0.44
0.61
0.59
0.83
0.79
0.47
0.78
0.63
0.90
0.76
0.60
0.85
0.91
0.52
0.47
0.62
0.50
0.83
0.50
0.52
0.63
0.64
0.75
0.62
0.63
0.61
0.61
0.53
0.57
0.61
0.64
0.78

0.71
0.66
0.63
0.67
0.73
0.71
0.74
0.64
0.58
0.65
0.73
0.55
0.52
0.75
0.67
0.61
0.77
0.66
0.68
0.61
0.54
0.60
0.76
0.65
0.52
0.62
0.68
0.53
0.61
0.61
0.68
0.65
0.71
0.70

106

Figure 7.9: Average feature importance for the MWCG random forest model with the an-
gle, secondary, MWCG, atom type, protein size, amino acid, and packing density features
aggregated. Reported feature selection includes the use heavy atoms in the model. Figure
originally published in Bramer et al [2].

Table 7.13: ASPH and ESPH average Pearson correlation coeﬃcients Cα B factor predictions
for small-, medium-, and large-sized protein sets along with the entire superset of the 364
protein dataset. Gradient boosted tree (GBT), convolutional neural network, and consen-
sus(CON) results are obtained by leave-one-protein-out (blind). Predictions of parameter-
free ﬂexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode
analysis (NMA) were obtained via the least squares ﬁtting of individual proteins.

CNN GBT CON pFRI GNM NMA
0.63
0.48
0.48
Medium 0.60
0.49
0.58
0.60
NA

0.59
0.61
0.59
0.63

0.62
0.61
0.58
0.61

0.58
0.58
0.59
0.59

Small

Large

Superset

0.54
0.55
0.53
0.57

107

Table 7.14: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom
blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network
(CNN), and consensus (CON) for the small-sized protein set.

PDB ID N GBT CNN CON
0.78
0.29
0.76
0.55
0.43
0.13
0.64
0.82
0.69
0.9
0.66
0.8
0.95
0.74
0.82
0.44
0.67
0.58
0.45
0.76
0.12
0.54
0.81
0.57
0.7
0.36
0.66
0.29
0.94
0.61

1AIE
31
1AKG 16
51
1BX7
1ETL
12
1ETM 12
12
1ETN
65
1FF4
1GK7
39
1GVD 56
13
1HJE
15
1KYC
1NOT
13
22
1O06
29
1P9I
1PEF
18
16
1PEN
44
1Q9B
36
1RJU
1U06
55
1UOY 64
47
1USE
1VRZ
13
8
1XY2
1YJO
6
1YZM 46
2DSX
52
38
2JKU
36
2NLS
2OL9
6
6RXN 45

0.7
0.32
0.74
0.82
0.63
0.48
0.66
0.9
0.55
0.75
0.69
0.96
0.93
0.73
0.82
0.74
0.85
0.46
0.4
0.7
0.32
0.34
0.82
-0.06
0.64
0.34
0.71
0.47
0.85
0.6

0.75
0.27
0.74
0.37
0.37
0.07
0.61
0.77
0.71
0.84
0.62
0.69
0.94
0.73
0.79
0.36
0.59
0.6
0.44
0.72
0.05
0.54
0.79
0.7
0.69
0.34
0.57
0.23
0.94
0.59

108

Table 7.15: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom
blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network
(CNN), and consensus (CON) for the medium-sized protein set.

PDB ID N GBT CNN CON
87
1ABA
0.74
88
0.68
1CYO
93
0.61
1FK5
89
0.69
1GXU
83
1I71
0.56
73
0.64
1LR7
0.65
95
1N7E
93
1NNX
0.8
0.56
1NOA 113
0.41
1OPD
85
0.57
1QAU 112
1R7J
90
0.75
0.73
82
1UHA
0.56
1ULR
87
1USM 77
0.75
0.63
96
1V05
0.47
97
1W2L
1X3O
80
0.44
0.69
96
1Z21
0.71
75
1ZVA
0.58
35
2BF9
2BRF
103
0.75
0.64
109
2CE0
0.67
81
2E3H
2EAQ
89
0.61
0.65
75
2EHS
0.81
85
2FQ3
0.63
2IP6
87
2MCM 112
0.75
0.7
2NUH 104
-0.01
93
2PKT
2PLT
98
0.54
0.56
107
2QJL
0.7
93
2RB8
0.49
99
3BZQ
5CYT
103
0.39

0.71
0.7
0.6
0.68
0.58
0.61
0.58
0.79
0.53
0.34
0.59
0.77
0.74
0.53
0.72
0.64
0.5
0.43
0.65
0.7
0.79
0.77
0.66
0.68
0.63
0.67
0.82
0.66
0.77
0.56
-0.04
0.53
0.57
0.7
0.53
0.34

0.73
0.64
0.59
0.67
0.53
0.62
0.63
0.78
0.55
0.42
0.51
0.71
0.71
0.54
0.73
0.6
0.43
0.41
0.68
0.7
0.48
0.72
0.6
0.65
0.57
0.62
0.77
0.6
0.71
0.72
0.01
0.52
0.54
0.67
0.45
0.39

109

Table 7.16: ASPH and ESPH Pearson correlation coeﬃcients for cross protein Cα atom blind
B factor prediction obtained boosted gradient (GBT), convolutional neural network (CNN),
and consensus (CON) for the large-sized protein set.

PDB ID N GBT CNN CON
0.7
1AHO
66
0.55
1ATG 231
0.6
238
1BYI
1CCR
109
0.59
0.74
188
1E5K
0.61
106
1EW4
1IFR
113
0.7
0.57
238
1NLS
0.49
1O08
221
0.65
1PMY 123
1PZ4
113
0.77
0.54
1QTO 122
0.43
1RRO 108
1UKU 102
0.77
0.64
105
1V70
0.6
206
1WBE
1WHI
122
0.6
0.67
1WPA 107
0.67
2AGK 233
0.6
225
2C71
2CG7
110
0.32
0.6
2CWS
235
0.78
2HQK 232
2HYK 237
0.65
0.46
113
2I24
0.56
203
2IMF
0.63
122
2PPN
2R16
185
0.46
0.54
2V9V
149
0.47
2VIM 114
2VPA
217
0.71
0.63
2VYO 207
0.63
238
3SEB
3VUB
101
0.59

0.66
0.51
0.5
0.6
0.72
0.6
0.64
0.57
0.47
0.7
0.8
0.48
0.45
0.76
0.62
0.56
0.56
0.65
0.63
0.6
0.32
0.47
0.77
0.63
0.46
0.58
0.54
0.49
0.52
0.47
0.75
0.63
0.6
0.55

0.66
0.55
0.61
0.55
0.74
0.59
0.7
0.55
0.49
0.59
0.72
0.53
0.4
0.75
0.63
0.6
0.59
0.65
0.67
0.57
0.3
0.61
0.77
0.65
0.44
0.53
0.64
0.44
0.53
0.44
0.66
0.6
0.63
0.59

110

Table 7.17: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting Cα B
factor prediction of small proteins using 11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein
(W) metrics using various kernel choices are included.

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.96
0.87
0.82
1.00
1.00
1.00
0.76
0.94
0.66
1.00
1.00
1.00
0.98
0.89
0.96
0.94
0.71
0.72
0.55
0.73
0.64
1.00
1.00
1.00
0.90
0.55
0.88
0.69
1.00
0.76

31
1AIE
1AKG 16
51
1BX7
1ETL
12
1ETM 12
12
1ETN
65
1FF4
1GK7
39
1GVD 56
13
1HJE
1KYC
15
13
1NOT
22
1O06
29
1P9I
1PEF
18
16
1PEN
44
1Q9B
1RJU
36
1U06
55
1UOY 64
47
1USE
1VRZ
13
8
1XY2
1YJO
6
1YZM 46
52
2DSX
38
2JKU
2NLS
36
2OL9
6
6RXN 45

0.90
0.53
0.81
0.95
0.70
0.70
0.68
0.88
0.61
0.67
0.88
0.86
0.97
0.87
0.92
0.47
0.69
0.62
0.46
0.65
0.46
0.77
0.91
1.00
0.86
0.41
0.83
0.49
1.00
0.49

0.64
0.53
0.68
0.87
0.74
0.92
0.65
0.93
0.63
0.79
0.93
0.86
0.92
0.82
0.94
0.67
0.59
0.69
0.36
0.66
0.52
0.92
0.95
1.00
0.72
0.30
0.65
0.32
1.00
0.48

0.90
0.72
0.82
1.00
0.86
0.99
0.75
0.95
0.69
1.00
0.99
1.00
0.97
0.92
0.96
0.83
0.69
0.81
0.52
0.69
0.72
1.00
1.00
1.00
0.88
0.56
0.88
0.76
1.00
0.76

0.77
0.56
0.69
0.98
0.83
0.92
0.68
0.92
0.62
0.57
0.88
0.81
0.94
0.84
0.94
0.73
0.57
0.65
0.39
0.69
0.53
0.85
0.91
1.00
0.84
0.36
0.60
0.47
1.00
0.49

0.97
0.82
0.86
1.00
1.00
1.00
0.77
0.95
0.75
1.00
0.96
1.00
0.98
0.89
0.96
0.96
0.79
0.81
0.50
0.73
0.66
1.00
1.00
1.00
0.87
0.54
0.89
0.75
1.00
0.74

0.88
0.66
0.74
1.00
1.00
1.00
0.72
0.94
0.68
1.00
0.99
1.00
0.97
0.88
0.97
0.90
0.76
0.74
0.52
0.72
0.75
1.00
1.00
1.00
0.90
0.50
0.75
0.66
1.00
0.63

0.99
1.00
0.89
1.00
1.00
1.00
0.80
0.98
0.84
1.00
1.00
1.00
1.00
0.98
1.00
1.00
0.94
0.91
0.72
0.83
0.91
1.00
1.00
1.00
0.95
0.78
0.95
0.88
1.00
0.86

0.78
0.60
0.79
0.68
0.45
0.96
0.70
0.91
0.67
0.72
0.92
0.82
0.96
0.87
0.88
0.60
0.58
0.75
0.37
0.65
0.50
0.92
0.99
1.00
0.82
0.37
0.85
0.61
1.00
0.59

111

Table 7.18: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting Cα B
factor prediction of medium proteins using 11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein
(W) metrics using various kernel choices are included.

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
1ABA
0.70
87
0.67
1CYO
88
0.55
1FK5
93
1GXU
0.77
89
0.59
1I71
83
0.58
1LR7
73
0.73
1N7E
95
1NNX
0.86
93
0.59
1NOA 113
0.36
1OPD
85
1QAU 112
0.58
0.86
90
1R7J
0.73
82
1UHA
1ULR
87
0.61
0.65
1USM 77
0.65
96
1V05
0.69
97
1W2L
1X3O
80
0.67
0.72
96
1Z21
0.86
75
1ZVA
0.92
2BF9
35
0.75
103
2BRF
0.79
109
2CE0
0.78
81
2E3H
2EAQ
89
0.82
0.73
75
2EHS
0.78
85
2FQ3
2IP6
87
0.78
0.82
2MCM 112
0.80
2NUH 104
0.43
93
2PKT
2PLT
98
0.66
107
2QJL
0.51
0.81
93
2RB8
3BZQ
99
0.59
0.50
103
5CYT

0.62
0.58
0.50
0.61
0.46
0.55
0.68
0.79
0.57
0.21
0.55
0.76
0.68
0.50
0.53
0.61
0.63
0.60
0.63
0.78
0.65
0.71
0.73
0.69
0.72
0.71
0.75
0.58
0.77
0.63
0.35
0.59
0.46
0.75
0.55
0.46

0.56
0.65
0.49
0.69
0.38
0.46
0.54
0.81
0.53
0.29
0.55
0.81
0.67
0.44
0.61
0.52
0.56
0.62
0.64
0.83
0.89
0.72
0.71
0.56
0.77
0.69
0.68
0.64
0.75
0.75
0.36
0.52
0.41
0.74
0.47
0.43

0.76
0.78
0.71
0.82
0.76
0.71
0.80
0.88
0.72
0.57
0.66
0.91
0.82
0.68
0.81
0.72
0.79
0.72
0.82
0.94
0.97
0.76
0.86
0.82
0.86
0.81
0.82
0.82
0.85
0.85
0.69
0.72
0.63
0.84
0.69
0.65

0.67
0.71
0.53
0.75
0.44
0.61
0.67
0.84
0.63
0.35
0.59
0.88
0.70
0.56
0.62
0.67
0.72
0.66
0.70
0.85
0.94
0.74
0.77
0.66
0.81
0.75
0.78
0.72
0.80
0.77
0.44
0.66
0.45
0.81
0.57
0.53

0.67
0.69
0.59
0.78
0.66
0.62
0.71
0.84
0.65
0.29
0.61
0.86
0.75
0.53
0.61
0.66
0.72
0.66
0.73
0.85
0.73
0.73
0.79
0.71
0.77
0.73
0.76
0.66
0.80
0.74
0.39
0.63
0.52
0.78
0.62
0.52

0.68
0.68
0.58
0.75
0.56
0.63
0.72
0.83
0.63
0.36
0.58
0.87
0.74
0.59
0.66
0.65
0.69
0.65
0.64
0.92
0.78
0.74
0.80
0.76
0.81
0.74
0.79
0.73
0.81
0.81
0.55
0.67
0.50
0.80
0.61
0.54

0.63
0.59
0.50
0.72
0.58
0.56
0.63
0.81
0.57
0.19
0.57
0.79
0.69
0.50
0.58
0.61
0.61
0.64
0.69
0.81
0.71
0.72
0.77
0.69
0.76
0.71
0.75
0.64
0.77
0.66
0.36
0.59
0.49
0.76
0.55
0.48

0.54
0.66
0.49
0.72
0.41
0.57
0.54
0.81
0.60
0.26
0.57
0.83
0.69
0.49
0.57
0.60
0.60
0.62
0.61
0.84
0.70
0.74
0.75
0.62
0.79
0.72
0.75
0.67
0.78
0.73
0.39
0.57
0.42
0.78
0.50
0.49

112

Table 7.19: ASPH and ESPH Pearson correlation coeﬃcients of least squares ﬁtting Cα B
factor prediction of large proteins using 11˚A cutoﬀ. Two Bottleneck (B) and Wasserstein
(W) metrics using various kernel choices are included.

B & W

B

W

PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both
0.75
1AHO
66
0.51
1ATG 231
1BYI
238
0.54
0.63
109
1CCR
0.69
188
1E5K
1EW4
106
0.62
0.62
113
1IFR
0.82
238
1NLS
0.48
1O08
221
1PMY 123
0.71
0.88
1PZ4
113
0.56
1QTO 122
0.45
1RRO 108
0.80
1UKU 102
0.62
105
1V70
0.48
1WBE
206
0.55
1WHI
122
0.70
1WPA 107
0.67
2AGK 233
2C71
225
0.48
0.41
110
2CG7
0.55
2CWS
235
0.81
2HQK 232
0.60
2HYK 237
113
2I24
0.49
0.64
203
2IMF
2PPN
122
0.63
0.52
185
2R16
0.62
2V9V
149
0.40
2VIM 114
2VPA
217
0.74
0.70
2VYO 207
0.67
238
3SEB
3VUB
101
0.64

0.73
0.47
0.46
0.56
0.67
0.51
0.54
0.65
0.42
0.59
0.74
0.46
0.23
0.80
0.60
0.38
0.44
0.52
0.64
0.33
0.31
0.52
0.74
0.55
0.40
0.56
0.59
0.45
0.48
0.28
0.71
0.66
0.61
0.56

0.88
0.61
0.58
0.71
0.74
0.73
0.73
0.86
0.56
0.76
0.93
0.65
0.56
0.84
0.75
0.63
0.63
0.79
0.69
0.56
0.63
0.66
0.83
0.63
0.69
0.71
0.74
0.66
0.66
0.52
0.78
0.77
0.77
0.71

0.78
0.50
0.51
0.66
0.68
0.60
0.59
0.78
0.48
0.70
0.82
0.59
0.35
0.81
0.65
0.47
0.55
0.69
0.65
0.38
0.44
0.55
0.79
0.58
0.44
0.65
0.61
0.51
0.51
0.33
0.75
0.70
0.66
0.60

0.75
0.50
0.50
0.65
0.67
0.58
0.65
0.81
0.46
0.71
0.88
0.59
0.39
0.80
0.64
0.53
0.57
0.70
0.65
0.45
0.32
0.59
0.80
0.59
0.47
0.61
0.57
0.50
0.60
0.38
0.73
0.68
0.63
0.65

0.79
0.53
0.49
0.65
0.68
0.55
0.65
0.83
0.50
0.67
0.89
0.53
0.45
0.80
0.66
0.55
0.57
0.71
0.65
0.42
0.36
0.54
0.80
0.59
0.48
0.60
0.63
0.51
0.56
0.41
0.73
0.72
0.68
0.61

0.53
0.38
0.44
0.43
0.63
0.55
0.47
0.80
0.37
0.68
0.85
0.55
0.33
0.74
0.51
0.36
0.34
0.66
0.55
0.23
0.30
0.40
0.68
0.43
0.45
0.59
0.44
0.45
0.55
0.24
0.68
0.59
0.61
0.61

0.65
0.48
0.48
0.58
0.67
0.55
0.53
0.72
0.45
0.69
0.76
0.52
0.19
0.80
0.58
0.42
0.43
0.56
0.63
0.30
0.33
0.52
0.76
0.54
0.40
0.59
0.57
0.46
0.50
0.31
0.73
0.68
0.62
0.57

0.72
0.45
0.41
0.53
0.66
0.52
0.56
0.75
0.44
0.62
0.86
0.48
0.31
0.78
0.56
0.43
0.42
0.61
0.61
0.29
0.29
0.53
0.70
0.51
0.40
0.59
0.51
0.46
0.53
0.29
0.72
0.64
0.62
0.60

113

Table 7.20: ASPH and ESPH average Pearson correlation coeﬃcients of least squares ﬁtting
Cα B factor prediction of small, medium, large, and superset using 11˚A cutoﬀ. Two Bot-
tleneck (B) and Wasserstein (W) metrics using various kernel choices are included. Results
for pFRI are taken from Opron et al[3]. GNM and NMA value are taken from the course
grained Cα results reported in Park et al[4].

B & W

B

W

Exp Lor Both Exp Lor Both Exp Lor Both pFRI GNM NMA
0.87
0.48
0.48
Medium 0.68
0.49
0.61
0.65
NA

0.74
0.60
0.51
0.55

0.54
0.55
0.53
0.57

0.73
0.63
0.55
0.59

0.74
0.62
0.54
0.58

0.72
0.61
0.54
0.58

0.84
0.68
0.60
0.64

0.94
0.78
0.70
0.73

Small

Large

Superset

0.85
0.69
0.61
0.65

0.86
0.69
0.62
0.66

0.59
0.61
0.59
0.63

114

Chapter 8

Discussion

8.1 Element Speciﬁc Heat Maps

One useful application of the WCG method is the generation of element speciﬁc correlation

heat maps. These maps provide a two dimensional visualization of important secondary

and tertiary components of a given protein. Of course, maps of this kind are not new, for

example see Opron et al for past use. However, the correlation maps provided here are the

ﬁrst of their kind. Previous correlation maps have only considered Cα intereactions. The

maps provided here in Figures 7.1, 7.2, and 7.3 illustrate that the more general frame work of

the WCG method is a valid and useful approach to furthering our understanding of protein

structure and ﬂexibility. The results presented here generate correlation maps using PDB

ID 3TYS, 1AIE, and 3PSM to demonstrate the applicability of this approach.

Protein PDB ID 1AIE consists of a random coil attached to a single alpha helix. The

provided amine nitrogen and double bonded carboxyl correlation heat maps in Figure 7.1

clearly show the alpha helix as indicated by a thick band along the diagonal. This thick

band corresponds to the rigidity imposed by the local interactions of nearby residues within

the alpha helix.

Protein PDB ID 1KGM is made up of various random coils and two anti-parallel beta

sheets. The provided amine nitrogen and double bonded carboxyl correlation heat maps in

115

Figure 7.2 illustrate the interaction between residues in the anti parallel beta sheet with thick

bands perpendicular to the main diagonal. These perpendicular bands correspond physically

to the rigidity imposed by the interactions between the anti parallel beta sheets.

Protein PDB ID 5IIV presents two parallel alpha helices. The provided amine nitrogen

and double bonded carboxyl correlation heat maps in Figure 7.3 illustrate both the short

range interactions within a single alpha helix and interactions between alpha helices. The

two alpha helices are represented clearly as two distinct thick bands along the diagonal.

Thick oﬀ diagonal bands illustrate interactions between alpha helices. The diagrams also

illustrate the strength of each type of bonding. Bonding within an alpha helix is stronger

and thus the main diagonal of the correlation heat maps is warmer than the oﬀ diagonal

which corresponds to the weaker alpha helix to alpha helix interaction.

8.2 Hinge Detection

Figures 7.5-7.6 show the B factor prediction comparison of protein PDB ID 1CLL, 1WHI,

and 2HQK. Figure 7.5 clearly indicates GNM misses the hinge region present in calmodulin

(PDB ID 1CLL) around residue 75. The WCG method clearly agree with the experimental

results as indicated in the provided results. The MWCG method is also included in this re-

sult to demonstrate the ability of the MWCG method to capture multiple scales and improve

the overall B factor prediction. In the ribosomal protein L14 (PDB ID 1WHI) the results

demonstrate that WCG provides a more reliable prediction compared to GNM as seen in

Figure 7.4. In particular, GNM incorrectly predicts a large ﬂexible region around the 75th

residue that does not exist. Lastly, the engineered cyan ﬂuorescent protein mTFP1 (PDB

ID 2HQK) is also considered. Figure 7.6 shows that GNM predicts a highly ﬂexible region

116

incorrectly around the 60th residue whereas the WCG method agrees with the experimental

results of low ﬂexibility in that region. The results presented in this work demonstrate that

GNM consistently misses hinge regions and predicts hinge regions where none exist. Com-

paratively, the WCG method is more accurate than GNM, and MWCG the most accurate

of all the hinge prediction techniques studied here.

8.3 Fitting Models

8.3.1 MWCG

The MWCG method is used to predict B factor of a large and diverse set of over 300 proteins.

Results for Cα B factor prediction are provided in Tables 7.1-7.6, and 7.4. Results of protein

subsets of small, medium, and large sizes are considered in 7.1-7.6 and their overall average

Pearson correlation coeﬃcients are provided in 7.5. In all cases of Cα B factor prediction,

the MWCG method outperforms previously existing methods in terms of average Pearson

correlation coeﬃcient. The MWCG method is notable in that, averaged over the superset

of proteins, it provides a 19% improvement over the best existing method opFRI and a 42%

improvement over GNM. Table 7.6 provides results for B factor prediction of other heavy

elements such as non Cα carbon, nitrogen, oxygen, and sulfur atoms. This is also notable

because to date no other previous method has included B factor prediction of elements other

than Cα. These predictions also have a similar average correlation coeﬃcient to the Cα

results indicating the robustness of the model.

117

8.3.2 ASPH & ESPH

ASPH and ESPH methods are used for Cα only B factor using the same protein dataset

as MWCG. Results for Cα only B factor prediction are provided in Tables 7.17-7.20, and

7.21. Results of protein subsets of small, medium, and large sizes are considered in 7.17-

7.19 and their overall average Pearson correlation coeﬃcients are provided in 7.20. Overall

ﬁtting methods using the various ASPH and ESPH features performed similarly. The best

results came from using features generated by both exponential and lorentz kernels and

both Bottleneck and Wasserstein distances. Using both kernels, ASPH and ESPH distance

metrics resulted in an average correlation coeﬃcient of 0.73 for the superset.

8.4 Machine Learning Models

8.4.1 MWCG

Among the three methods considered for MWCG based B factor prediction, the convolutional

neural network method outperforms the boosted gradient tree and random forest by 10%

and 17%, respectively. As reported in Table 7.12, no machine learning method outperforms

any other method for every protein.

Compared to the deep CNN, the ensemble methods do not require as much parameter

tuning. The random forest method is the simplest and most robust with only one hyper-

parameter. Overall the boosted gradient tree method outperforms the random forest for

MWCG based B factor prediction for all data sets. To balance cost, time, and quality,

500 trees were used for the random forest and 1000 trees were used for the boosted gradient

method for the MWCG B factor prediction. It’s possible that this may account for the perfor-

118

mance diﬀerence between the boosted gradient tree method compared to the random forest.

Ensemble methods are generally robust to overﬁtting, and adding more features would likely

improve their results [42]. Moreover, boosted gradient trees use several hyperparameters so

the model could be improved by further tuning these hyperparameters. The image-like heat

map data used in the deep CNN provides additional data compared to the dataset used for

the ensemble methods. This very likely explains the improved performance as compared to

the ensemble methods. Providing more reﬁned images, and other novel image types, would

undoubtedly improve the results further but would come at a computational cost.

Applying several dropout layers prevents the CNNs from overﬁtting the data. Much like

the GBT the CNN contains several hyperparameters. Thus, the CNN model would beneﬁt

from more careful parameter tuning as well. Incorporating a larger dataset, more features,

and higher resolutions images would also improve the CNN performance.

In general the

results of the machine learning methods generated in this work could be further improved

by reﬁning features, exploring new features, and further tuning the hyperparameters.

8.4.2 ASPH & ESPH

Machine learning results for ASPH and ESPH can be found in 7.14-7.16 and 7.22. Overall

both GBT and CNN algorithms perform similarly. As expected, the CNN method out

performs the GBT with average correlation coeﬃcients over the superset of 0.60 and 0.59

respectively. The consensus method improves upon both results with an average Pearson

correlation coeﬃcient of 0.61 over the superset. Table 7.13 shows that the blind prediction

machine learning models perform better than ﬁtting models GNM and NMA and similar to

the pFRI ﬁtting model.

119

Chapter 9

Conclusions and Future Directions

9.1 Conclusions

Protein ﬂexibility and dynamics are important tools for understanding the function, con-

formational states, folding, binding, and molecular mechanisms of proteins.

It is a well

known paradigm that protein ﬂexibility strongly correlates with protein function. Protein

interactions span multiple interactions scales and their complexity and large number of de-

grees of freedom make quantitative understanding a great challenge. Molecular dynamics

oﬀers a useful tool but is limited in scope due to the computational cost involved with large

biomolecules or long time scales. Several successful time-independent methods have been

developed that provide good B factor analysis at low computational cost. Examples include

NMA [12, 10, 13, 11], ENM [15], GNM [18, 19, 50], and FRI methods [51, 3, 24, 22]. However,

none of these methods can blindly predict cross protein B factors of an unknown protein.

The guiding principle of this work is that intrinsic physics lies in a low-dimensional space

that is embedded in a high dimensional data space. Based on this, the results of this work

introduces graph theory based MWCG, ASPH, and ESPH[52, 53]. Moreover these methods

are combined with advanced machine learning techniques to provide models that provide

eﬃcient and accurate tools for protein ﬂexibility analysis and prediction. This work also

outlines methods to successfully blindly predict cross-protein B factors.

120

First, this work introduces WCGs that eﬃciently reduce the protein structural complex-

ity while accurately predicting protein ﬂexibility. This work shows that weighted colored

graphs are a useful and novel tool for investigating ﬂexibility and dynamics of proteins. In

section 7.2 the WCG approach was compared to experimental and GNM predicted B factor

results. Nitrogen-nitrogen and oxygen-oxygen element speciﬁc correlation heat maps were

constructed for several proteins using the WCG technique described in this work. As seen

in Figures 7.1-7.3 these maps provide a clear picture of the various secondary and tertiary

structures presented in these proteins.

The correlation heat maps presented demonstrate a fresh approach to representing protein

ﬂexibility and interactions visually. Previous approaches only use data from Cα atoms

whereas the WCG method allows previously unused protein PDB data to be utilized. This

provides a viable alternative method and makes such heat maps more robust as multiple

heat maps can be constructed for each residue using diﬀerent elements. Using double bonded

carboxyl oxygens and amine nitrogens the work presented here demonstrates generality of

the WCG approach. The WCG method introduced a unique opportunity for alternative

approaches and allows for redundancy since the method is able to make use of non Cα

atoms. This method can also include hydrogen atoms without any modiﬁcations, which

may prove useful in future work as empirical methods inevitably become more accurate and

robust.

Several proteins are tested to demonstrate the eﬃcacy of WCGs to predict hinge regions

of proteins. In this study we use proteins with well known ﬂexibility to compare the ability of

the GNM and the WCG method to predict ﬂexible residues. Figures 7.5-7.6 show the B factor

prediction comparison of protein PDB ID 1CLL, 1WHI, and 2HQK. The examples provided

in this work demonstrate that WCG is an improvement upon the commonly used method of

121

GNM for hinge prediction. In proteins calmodulin (PDB ID 1CLL) and ribosomal protein

(PDB ID 1WHI) the results show that prediction using GNM completely misses highly

ﬂexible hinge regions. The results using the engineered cyan ﬂuorescent protein (PDB ID

2HQK) show that GNM incorrectly predicts a highly ﬂexible region where none exists. In

all the cases tested in this work the WCG method was able to correctly capture all the hinge

regions and did not identify any false positive hinge regions. For further comparison the

MWCG ﬂexibility prediction is included in the calmodulin protein (PDB ID 1CLL) results

seen in Figure 7.5. This result highlights the predictive power of the multiscale information

that the MWCG method captures as seen with the excellent agreement with experimental

results. Overall these results demonstrate the WCG and MWCG methods are superior tools

to the commonly used GNM method in terms of the accuracy of hinge prediction for the

provided examples.

The WCG method is used to predict B factor of a large and diverse set of over 300

proteins. Results for Cα B factor prediction are provided in Tables 7.1-7.6, and 7.4. Results

of protein subsets of small, medium, and large sizes are considered in 7.1-7.6 and their

overall B factor averages are provided in 7.5.

In all cases of Cα B factor prediction, the

MWCG method outperforms previously existing methods. The MWCG method is notable

in that, averaged over the superset of proteins, it provides a 19% improvement over the best

existing method opFRI and a 42% improvement over GNM. Table 7.6 provides results for

B factor prediction of other heavy elements such as non Cα carbon, nitrogen, oxygen, and

sulfur atoms. This is also notable because to date no other previous method has included B

factor prediction of elements other than Cα. These predictions also have a similar average

correlation coeﬃcient to the Cα only prediction results indicating the robustness of the

model.

122

To capture the multiscale protein interactions that occur over several characteristic length

scales multiscale weighted colored graphs are constructed. The MWCGs are successfully

used to construct models by linear least square ﬁtting and a variety of machine learning

techniques.

Several machine learning approaches were considered in this work for blind cross protein B

factor prediction. In particular this work considered random forest, gradient boosting, and

a deep convolutional neural network machine learning models for MWCG based B factor

prediction. By using MWCG based features along with several local and global features this

work uses advanced machine learning approaches to blindly predict protein ﬂexibility and

B factors. Moreover, unlike previous methods, this approach is able to predict B factors of

any element the user desires provided 3D spatial coordinates are available. MWCG based

images were engineered for the deep convolutional neural network. Overall the MWCG

feature based deep convolutional neural networks provide the strongest predictive power in

terms of B factor prediction which is likely accounted for by the additional data provided by

the MWCG based image-like heat map features.

Several local, semi-local, and global features were included as machine learning features.

MWCGs capture local structural properties corresponding to the intrinsic ﬂexibility of the

given protein. X-ray crystallography resolution and R-value provide global structures that

allow the algorithms the ability to compare B factor across proteins. Packing density is a

semi-local feature that captures several protein interaction scales.

Ensemble methods include relative feature importance used in the model which is pro-

vided in Figures 7.8 and 7.9. As seen in the ﬁgures both local and global features play

an important role in the model. Overall the most meaningful global features are protein

resolution and surface accessible area. On average, the most meaningful local feature of the

123

random forest model was the set of 9 MWCG features with the carbon-carbon kernel having

most signiﬁcance. Machine learning models often suﬀer from the black box problem. That is,

once the model has trained, the user is unable to explicitly see how the model is determining

predictions. Feature importance provides important insight into the underlying mechanics

of the machine learning model. The feature importance results are in good agreement with

our expectations within the context of protein ﬂexibility analysis.

Both MWCG based ﬁtting and machine learning B factor prediction demonstrate that

MWCG based B factor prediction is more accurate in terms of Pearson correlation coeﬃcient

than previous ﬁtting based methods such as GNM and NMA. For B factor prediction of Cα

only atoms, the ﬁtting model provided a 20% improvement over the next best B factor

prediction method, opFRI, with a Pearson correlation coeﬃcient of 0.80 averaged over the

superset. The MWCG based deep CNN also outperformed opFRI, with a Pearson correlation

coeﬃcient of 0.66 averaged over the superset.

The working hypothesis is explored further by creating a B factor predictor using tools

from algebraic topology. To the author’s knowledge, this is the ﬁrst time persistent homol-

ogy has been used to predict the B factor of atoms in proteins. This is a novel approach

because topology is a global property and on its own cannot be directly used to describe

local atomic information. This unique approach allows a localized topological representation

to be constructed using a global mathematical tool. This approach accounts for multiple

spatial interaction scales and element speciﬁc interactions. These results demonstrate that

this is an accurate and robust topological approach.

This work introduces atom-speciﬁc topology and atom-speciﬁc persistent homology to

construct localized topological representations for individual atoms from global topological

tools. This approach works by constructing two conjugated sets of atoms. The ﬁrst set

124

of atoms is centered around the given atom of interest while the other set is identical but

excludes the atom of interest. To embed biological information into atom-speciﬁc persistent

homology, element-speciﬁc selections are implemented. The topological distance between

topological invariants generated from these conjugated sets of atoms provides a local topo-

logical representation of the atom of interest. To estimate the topological distances between

conjugated barcodes both Bottleneck and Wasserstein metrics are utilized. For topolog-

ical barcode generation, the Vietoris-Rips complex is employed. Atom-speciﬁc persistent

homology features are generated using several element-speciﬁc interactions, kernel choices,

parametrizations, and barcode distance metrics.

In this work ASPH and ESPH B factor prediction results are validated in two ways. First,

topological features are used to ﬁt protein B factors using linear least squares. The ﬁtting

model outperformed previous ﬁtting models with an average Pearson correlation coeﬃcient

of 0.73 over the superset of proteins. These results show that the method is comparable to

existing commonly used methods such as GNM and NMA.

Secondly, ASPH and ESPH features are used in machine learning models to blindly

predict protein B factors of Cα atoms. Two machine learning models are used, a gradient

boosted tree (GBT) and deep convolutional neural network (CNN). Additionally the Cα

predictions from the two models are averaged to generate a more robust consensus model.

In addition to the generated topological features, a variety of local and global features were

included. The blind prediction consensus model provided the best results, outperforming

both GNM and NMA ﬁtting models and produced results similar to those of the pFRI ﬁtting

model. These results demonstrate that this is a robust model that is more accurate than

existing GNM and NMA predictions. There are many other machine learning approaches

available and testing those approaches is certainly worth exploring. Moreover, these results

125

could easily be improved by including a larger dataset, ﬁne tuning parameters, and exploring

diﬀerent machine learning algorithms.

The proposed methods are tested and validated using a set of over 300 diverse proteins,

or more than 600,000 B factors. For all machine learning models, a leave-one-protein-out

approach is used to blindly predict protein B factors of all heavy atoms as well as only Cα

atoms.

The work presented in this study is a ﬁrst step using the recent advances in MWCG,

ASPH, and ESPH based machine learning techniques to blindly predict cross protein B

factors. These approaches are particularly notable compared to previous methods because

of their ability to blindly predict protein B factors across proteins.

The MWCG, ASPH, and ESPH based machine learning results provided in this work

are eﬃcient and accurate compared to previous methods. This work provides clear evidence

that machine learning approaches are useful and eﬃcient for protein ﬂexibility analysis.

Nonetheless, many new and compelling features can be implemented in future work. With-

out a doubt these results can be improved by experimenting with various advanced machine

learning approaches, larger datasets, and designing better mathematical descriptions of in-

trinsic ﬂexibility.

The methods presented here can be applied to a variety of interesting applications related

to molecules and biomolecules. Examples include allosteric site detection, hinge detection,

hot spot identiﬁcation, chemical shift analysis, atomic spectroscopy interpretation, and pre-

diction of protein folding stability changes upon mutation. More generally these methods

may be amenable to problems outside chemistry and biology such as network dynamics and

social network centrality measure.

126

9.2 Future Directions

This work provides a rich basis for further exploration of mathematical approaches to protein

analysis and ﬂexibility. The following sections provide several areas of potential future

research based on this work.

9.2.1 Software Development

To provide awareness and accessibility of the methods provided here, an online tool could

be developed consisting of PDB ﬁles from the PDB database or uploads of a compatible

structural ﬁle. Ideally users would be able to do any of the following:

• Choose MWCG based or ASPH/ESPH based models.

• Choose the number of kernels, type of kernel, and kernel parameterization.

• Choose element speciﬁc pairs and element speciﬁc heat maps.

• Choose machine learning algorithm and training features.

• Predict hinge regions based on user deﬁned B factor cutoﬀ.

• Predict the B factor of any atoms in a protein.

• Provide an interactive B factor colored 3D representation of the protein with down-

loadable image or gif ﬁles.

To host the website and run the required computations a server or cloud resources would be

required.

127

9.2.2 Inclusion of other datasets

The protein databank currently contains over 138,000 protein structures whereas the work

presented here used around 350 protein structures. The machine learning models presented

here would undoubtedly beneﬁt by including a larger dataset. More data would provide

better validation and a more general framework for protein B factor prediction. However,

there are enough proteins available at this time that even restricting to only Cα atoms would

result in a data set of roughly 116,610,000 data points. So care would need to be taken to

balance the amount of data used with the computational cost of training such models.

In addition to using larger datasets for training, models could be trained using more

speciﬁc data. For example datasets could be selected based on speciﬁc types of proteins such

as enzymes, structural proteins, signaling proteins, regulatory proteins, transport proteins,

sensory proteins, motor proteins, defense proteins, and storage proteins.

9.2.3 Speciﬁc applications in drug design and docking pose

The applications provided here demonstrate the validity of the proposed method. Future

work could apply the method to a variety of interesting problems. Drug design is an im-

portant and open problem where accurate and robust prediction of protein ﬂexibility and

dynamics are essential. Docking pose is another area where reliable B factor prediction

may improve existing methods. Molecular docking programs common modeling tools for

predicting ligand binding modes and structure based virtual screening.

128

9.2.4 Other general approaches

These methods could easily be developed to predict anisotropic B factors of a protein. Pairing

this method with a local or global Hessian would allow the Hessian matrix to be local or

global and by deﬁnition adaptive depending on the physical problem. Moreover, the methods

provided here could be used for the following related work.

• Integrate these methods to include genetic sequence information for a more compre-

hensive model.

• Predict protein ﬂexibility and dynamics from mutations.

• Investigation these tools as a general centrality measure in areas outside of biology.

• Investigation related topological data analysis techniques to understand proteins and

protein networks.

• Test other advanced learning approaches such as reinforcement learning and long short

term memory algorithms.

129

BIBLIOGRAPHY

130

BIBLIOGRAPHY

[1] D. Bramer and G.-W. Wei, “Multiscale weighted colored graphs for protein ﬂexibility
and rigidity analysis,” The Journal of Chemical Physics, vol. 148, no. 5, p. 054103, 2018.

[2] D. Bramer and G.-W. Wei, “Blind prediction of protein b-factor and ﬂexibility,” The

Journal of Chemical Physics, vol. 149, no. 13, p. 134107, 2018.

[3] K. Opron, K. L. Xia, and G. W. Wei, “Fast and anisotropic ﬂexibility-rigidity index
for protein ﬂexibility and ﬂuctuation analysis,” Journal of Chemical Physics, vol. 140,
p. 234105, 2014.

[4] J. K. Park, R. Jernigan, and Z. Wu, “Coarse grained normal mode analysis vs. reﬁned
gaussian network model for protein residue-level structural ﬂuctuations,” Bulletin of
Mathematical Biology, vol. 75, pp. 124 –160, 2013.

[5] U. Emekli, S. Dina, H. Wolfson, R. Nussinov, and T. Haliloglu, “HingeProt: automated
prediction of hinges in protein structures.,” Proteins, vol. 70, no. 4, pp. 1219–1227, 2008.

[6] S. Flores and M. Gerstein, “FlexOracle: predicting ﬂexible hinges by identiﬁcation of

stable domains,” BMC bioinformatics, vol. 8, no. 1, 2007.

[7] S. Flores, L. Lu, J. Yang, N. Carriero, and M. Gerstein, “Hinge atlas: relating protein

sequence to sites of structural ﬂexibility.,” BMC bioinformatics, vol. 8, 2007.

[8] K. S. Keating, S. C. Flores, M. B. Gerstein, and L. A. Kuhn, “StoneHinge: hinge
prediction by network analysis of individual protein structures,” Protein Science, vol. 18,
no. 2, pp. 359–371, 2009.

[9] M. Shatsky, R. Nussinov, and H. J. Wolfson, “FlexProt: alignment of ﬂexible protein
structures without a predeﬁnition of hinge regions,” Journal of Computational Biology,
vol. 11, no. 1, pp. 83–8106, 2004.

[10] N. Go, T. Noguti, and T. Nishikawa, “Dynamics of a small globular protein in terms
of low-frequency vibrational modes,” Proc. Natl. Acad. Sci., vol. 80, pp. 3696 – 3700,
1983.

[11] M. Tasumi, H. Takenchi, S. Ataka, A. M. Dwidedi, and S. Krimm, “Normal vibrations

of proteins: Glucagon,” Biopolymers, vol. 21, pp. 711 – 714, 1982.

[12] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. States, S. Swaminathan, and
M. Karplus, “Charmm: A program for macromolecular energy, minimization, and dy-
namics calculations,” J. Comput. Chem., vol. 4, pp. 187–217, 1983.

131

[13] M. Levitt, C. Sander, and P. S. Stern, “Protein normal-mode dynamics: Trypsin in-
hibitor, crambin, ribonuclease and lysozyme.,” J. Mol. Biol., vol. 181, no. 3, pp. 423 –
447, 1985.

[14] J. P. Ma, “Usefulness and limitations of normal mode analysis in modeling dynamics of

biomolecular complexes.,” Structure, vol. 13, pp. 373 – 180, 2005.

[15] M. M. Tirion, “Large amplitude elastic motions in proteins from a single-parameter,

atomic analysis.,” Phys. Rev. Lett., vol. 77, pp. 1905 – 1908, 1996.

[16] H. Goldstein, Classical Mechanics. Cambridge: Addison-Wesley, 1953.

[17] A. R. Atilgan, S. R. Durrell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Ba-
har, “Anisotropy of ﬂuctuation dynamics of proteins with an elastic network model.,”
Biophys. J., vol. 80, pp. 505 – 515, 2001.

[18] I. Bahar, A. R. Atilgan, and B. Erman, “Direct evaluation of thermal ﬂuctuations
in proteins using a single-parameter harmonic potential.,” Folding and Design, vol. 2,
pp. 173 – 181, 1997.

[19] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational dynamics of pro-
teins: Signiﬁcance of slow and fast modes in relation to function and stability.,” Phys.
Rev. Lett, vol. 80, pp. 2733 – 2736, 1998.

[20] L. W. Yang and C. P. Chng, “Coarse-grained models reveal functional dynamics–I.
elastic network models–theories, comparisons and perspectives.,” Bioinformatics and
Biology Insights, vol. 2, pp. 25 – 45, 2008.

[21] K. L. Xia, K. Opron, and G. W. Wei, “Multiscale multiphysics and multidomain models

— Flexibility and rigidity,” Journal of Chemical Physics, vol. 139, p. 194109, 2013.

[22] K. Opron, K. L. Xia, Z. Burton, and G. W. Wei, “Flexibility-rigidity index for protein-
nucleic acid ﬂexibility and ﬂuctuation analysis,” Journal of Computational Chemistry,
vol. 37, pp. 1283–1295, 2016.

[23] D. D. Nguyen, K. L. Xia, and G. W. Wei, “Generalized ﬂexibility-rigidity index,” Jour-

nal of Chemical Physics, vol. 144, p. 234106, 2016.

[24] K. Opron, K. L. Xia, and G. W. Wei, “Communication: Capturing protein multiscale

thermal ﬂuctuations,” Journal of Chemical Physics, vol. 142, no. 211101, 2015.

[25] K. L. Xia, K. Opron, and G. W. Wei, “Multiscale Gaussian network model (mGNM) and
multiscale anisotropic network model (mANM),” Journal of Chemical Physics, vol. 143,
p. 204106, 2015.

132

[26] J. A. McCammon, B. R. Gelin, and M. Karplus, “Dynamics of folded proteins,” Nature,

vol. 267, pp. 585–590, 1977.

[27] M. Newman, Networks: an introduction. Oxford university press, 2010.

[28] C. Ambedkar, “Application of centrality measures in the identiﬁcation of critical genes

in diabetes mellitus,” Bioinformation, vol. 11, no. 2, pp. 90–5, 2015.

[29] W. Y. G. Y. . L. Y. Gao, S., “Understanding urban traﬃc-ﬂow characteristics: A
rethinking of betweenness centrality.,” Environment and Planning B: Planning and De-
sign, vol. 40, no. 1, p. 135153, 2013.

[30] A. Bavelas, “Communication patterns in task-oriented groups,” The Journal of the

Acoustical Society of America, vol. 22, no. 6, pp. 725–730, 1950.

[31] A. Dekker, “Conceptual distance in social network analysis,” Journal of Social Structure

(JOSS), vol. 6, 2005.

[32] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput.

Geom., vol. 33, pp. 249–274, 2005.

[33] T. Schlick and W. K. Olson, “Trefoil knotting revealed by molecular dynamics simula-

tions of supercoiled DNA,” Science, vol. 257, no. 5073, pp. 1110–1115, 1992.

[34] D. W. Sumners, “Knot theory and DNA,” in Proceedings of Symposia in Applied Math-

ematics, vol. 45, pp. 39–72, 1992.

[35] I. K. Darcy and M. Vazquez, “Determining the topology of stable protein-DNA com-

plexes,” Biochemical Society Transactions, vol. 41, pp. 601–605, 2013.

[36] C. Heitsch and S. Poznanovic, “Combinatorial insights into rna secondary structure,
in N. Jonoska and M. Saito, editors,” Discrete and Topological Models in Molecular
Biology, vol. Chapter 7, pp. 145–166, 2014.

[37] O. N. A. Demerdash, M. D. Daily, and J. C. Mitchell, “Structure-based predictive
models for allosteric hot spots,” PLOS Computational Biology, vol. 5, p. e1000531,
2009.

[38] B. DasGupta and J. Liang, Models and Algorithms for Biomolecules and Molecular

Networks. John Wiley & Sons, 2016.

[39] X. Shi and P. Koehl, “Geometry and topology for modeling biomolecular surfaces,” Far

East J. Applied Math., vol. 50, pp. 1–34, 2011.

[40] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,”

Discrete & Computational Geometry, vol. 37, no. 1, pp. 103–120, 2007.

133

[41] D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko, “Lipschitz functions
have Lp-stable persistence,” Foundations of computational mathematics, vol. 10, no. 2,
pp. 127–139, 2010.

[42] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes
upon mutation by element speciﬁc persistent homology,” Bioinformatics, vol. 33,
pp. 3549–3557, 2017.

[43] Z. X. Cang, L. Mu, and G. W. Wei, “Representability of algebraic topology for
biomolecules in machine learning based scoring and virtual screening ,” PLOS Computa-
tional Biology, vol. 14(1), pp. e1005929, https://doi.org/10.1371/journal.pcbi.1005929,
2018.

[44] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, ﬂexibility
and folding,” International Journal for Numerical Methods in Biomedical Engineering,
vol. 30, pp. 814–844, 2014.

[45] M. W. Wolpert, D.H., “No free lunch theorems for optimization,” IEEE Transactions

on Evolutionary Computation, vol. 1, no. 67, 1997.

[46] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.

[47] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE
transactions on pattern analysis and machine intelligence, vol. 20, no. 8, pp. 832–844,
1998.

[48] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals

of statistics, pp. 1189–1232, 2001.

[49] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Anal-

ysis, vol. 38, no. 4, pp. 367–378, 2002.

[50] B. Brooks and M. Karplus, “Harmonic dynamics of proteins: normal modes and ﬂuc-
tuations in bovine pancreatic trypsin inhibitor,” Proceedings of the National Academy
of Sciences, vol. 80, no. 21, pp. 6571–6575, 1983.

[51] K. L. Xia and G. W. Wei, “A stochastic model for protein ﬂexibility analysis,” Physical

Review E, vol. 88, p. 062709, 2013.

[52] D. Bramer and G. W. Wei, “Weighted multiscale colored graphs for protein ﬂexibility

and rigidity analysis,” Journal of Chemical Physics, vol. 148, p. 054103, 2018.

[53] D. Bramer and G. W. Wei, “Blind prediction of protein b-factor and ﬂexibility,” Journal

of Chemical Physics, vol. 149, p. 021837, 2018.

134