ALGEBRAIC TOPOLOGY AND GRAPH THEORY BASED APPROACHES FOR PROTEIN FLEXIBILITY ANALYSIS AND B FACTOR PREDICTION By David Bramer A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Mathematics - Doctor of Philosophy 2019 ABSTRACT ALGEBRAIC TOPOLOGY AND GRAPH THEORY BASED APPROACHES FOR PROTEIN FLEXIBILITY ANALYSIS AND B FACTOR PREDICTION By David Bramer Protein fluctuation, measured by B factors, has been shown to highly correlate to protein flexibility and function. Several methods have been developed to predict protein B factor as well as related applications. While many B factor methods exist, reliable B factor prediction continues to be an ongoing challenge and there is much room for improvement. This work introduces a paradigm shifting geometric graph based model called the multi- scale weighted colored graph (MWCG) model. The MWCG model is a new computational algorithm that greatly improves the current landscape of protein structural fluctuation anal- ysis. The MWCG model treats each protein as a colored graph where colored nodes corre- spond to atomic element types and edges are weighted by a generalized centrality metric. Each graph contains multiple subgraphs based on interaction types between graphic nodes. Protein rigidity is represented by generalized centralities of subgraphs. MWCGs predict B factors of protein residues and accurately analyze the flexibility of all atoms in a protein simultaneously. The MWCG model presented here captures element specific interactions across multiple scales and is a novel visual tool for identifying various protein secondary structures. This work also demonstrates MWCG protein hinge detection using a variety of proteins. Cross-protein prediction of B factors has previously been an unsolved problem in terms of B factor prediction. Many proteins are difficult to crystallize, and for some it is likely impossible, so models that can cross predict protein B factor are absolutely necessary. Using machine learning and the MWCG method, this work provides a robust cross protein B factor prediction using a set of known proteins to predict the B factors of a protein previously unseen to the algorithm. The algorithm connects different proteins using global protein features such as the resolution of the X-ray crystallography data. The combination of global and local features results in successful cross protein B factor prediction. To test and validate these results this work considers several machine learning approaches such as random forest, gradient boosted trees, and deep convolutional neural networks. Recently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. Persistent homology is rarely employed for analysis of atomic properties, such as protein flexibility analysis or B factor prediction. This work introduces atom specific per- sistent homology (ASPH) to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topolog- ical spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom specific topological representation of individual atomic properties in a molecule. Atom specific topological fea- tures are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and blind cross protein B factor prediction. Extensive numerical testing indicates the proposed methods provide novel and powerful graph theory and algebraic topology based tools for analyzing and predicting atom specific, localized protein flexibility information. TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Chapter 1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Chapter 2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Computing Protein Flexibility and Dynamics . . . . . . . . . . . . . . . . . . 2.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 3 Multiscale Weighted Colored Graphs . . . . . . . . . . . . . . . . 3.1 Weighted colored graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 WCG Centrality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Weighted Colored Graph Flexibility Analysis . . . . . . . . . . . . . . . . . . 3.4 Multiscale Weighted Colored Graph Flexibility Analysis . . . . . . . . . . . . 3.5 Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chapter 4 Atom Specific Persistent Homology . . . . . . . . . . . . . . . . . 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Simplex & Simplicial Complex . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Filtration & Persistence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Similarity and distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Vietoris-Rips Complex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Atom Specific Persistent Homology & Element Specific Persistent Homology 5.1.2 Neural Networks Chapter 5 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Machine Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1.1 Random forest . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1.2 Gradient boosted trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2.1 Convolutional Neural Network . . . . . . . . . . . . . . . . . 5.1.3 Consensus methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 General Machine Learning Features . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Global features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Local features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Image-like MWCG Features . . . . . . . . . . . . . . . . . . . . . . . 5.3 MWCG Features 5.3.1 7 8 11 13 13 15 16 17 19 22 22 24 25 27 27 29 29 34 35 35 36 37 37 38 39 39 40 41 42 43 iv 5.4 ASPH & ESPH Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Image-like ASPH & ESPH Features . . . . . . . . . . . . . . . . . . . 5.4.2 Cutoff Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Machine Learning Model Parameters . . . . . . . . . . . . . . . . . . . . . . 5.5.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1.2 Gradient Boosted Trees . . . . . . . . . . . . . . . . . . . . 5.5.1.3 Deep Convolutional Neural Network . . . . . . . . . . . . . 5.5.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.2.1 Gradient Boosted Trees . . . . . . . . . . . . . . . . . . . . 5.5.2.2 Deep Convolutional Neural Network . . . . . . . . . . . . . 5.6 Machine Learning Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Training set and test set . . . . . . . . . . . . . . . . . . . . . . . . . 44 45 47 48 48 48 49 49 51 52 52 54 54 Chapter 6 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 7.4 Machine Learning Results Chapter 7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Visualization of Element Specific Correlation Maps 7.2 Hinge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.1 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3.2 Fitting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4.1.1 Efficiency comparison . . . . . . . . . . . . . . . . . . . . . 7.4.1.2 Machine learning performance . . . . . . . . . . . . . . . . . 7.4.1.3 Relative feature importance . . . . . . . . . . . . . . . . . . 7.5 ASPH & ESPH B Factor Prediction . . . . . . . . . . . . . . . . . . . . . . . 7.5.1 Least Squares Fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 59 59 66 66 67 72 72 73 74 80 81 81 82 Chapter 8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.1 Element Specific Heat Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 8.2 Hinge Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 8.3 Fitting Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.3.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 8.3.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.4.1 MWCG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 8.4.2 ASPH & ESPH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 8.4 Machine Learning Models Chapter 9 Conclusions and Future Directions . . . . . . . . . . . . . . . . . 120 9.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 9.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Software Development 9.2.1 v 9.2.2 9.2.3 9.2.4 Other general approaches Inclusion of other datasets . . . . . . . . . . . . . . . . . . . . . . . . 128 Specific applications in drug design and docking pose . . . . . . . . . 128 . . . . . . . . . . . . . . . . . . . . . . . . 129 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 vi LIST OF TABLES Table 2.1: Notable molecular mechanic techniques and the year of introduction. . . . Table 3.1: Element pair combinations used in weighted colored graph. . . . . . . . . Table 3.2: Parameters used for correlation kernels in a parameter-free MWCG. Pa- . . . rameter optimization results originally published in Bramer et al [1]. Table 4.1: Topological invariants displayed as Betti numbers. Betti-0 represents the number of connected components, Betti-1 the number of tunnels or circles, and Betti-2 the number of cavities or voids. Two auxiliary rings are added to the torus to illustrate that Betti-1=2. . . . . . . . . . . . . . . . . . . . Table 5.1: The packing density distance parameters (d ˚A) used for generating short . . . . . . . medium, and long packing density machine learning features. Table 5.2: Correlation kernel parameters used to generate parameter-free MWCG ma- . . . . . chine learning features. Parameters based on previous results.[1] Table 5.3: Parameters used for topological feature generation. All features used a cut- off of 11˚A. Both lorentz (Lor) and exponential (exp) kernels and Bottleneck (B) and Wasserstein (W) distance metrics were used. . . . . . . . . . . . . Table 5.4: Parameters used for the element specific persistent homology features with . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . a cutoff of 11 ˚A. Table 5.5: Boosted gradient tree parameters used for testing MWCG based B factor prediction. These parameters were determined using a grid search. Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learn package. MWCG based GBT machine . . . . learning prediction results originally published in Bramer et al [2]. Table 5.6: MWCG based deep Convolutional Neural Network (CNN) hyper-parameters used for testing. These hyper-parameters were determined using a grid search. Any hyper parameters not listed below were taken to be the de- fault values provided by python with the Keras package. MWCG machine learning prediction results originally published in Bramer et al [2]. . . . . Table 5.7: Boosted gradient tree parameters used for persistent homology based pre- diction testing. Parameters were determined using a grid search. Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learn package. . . . . . . . . . . . . . . . . . . . . . . 11 14 20 23 42 43 46 48 49 51 52 vii Table 5.8: Convolutional Neural Network (CNN) parameters used for testing persis- tent homology based features. Parameters were determined using a grid search. Any hyper-parameters not listed below were taken to be the de- fault values provided by python with the Keras package. . . . . . . . . . . Table 7.4: Correlation coefficients for B factor prediction obtained by MWCG, optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for a set of 364 proteins. GNM scores reported here are the result of tests with a processed set of PDB files as described in Chapter 2.2. MWCG results originally published in Bramer et al [1]. . . . . . . . . . . Table 7.12: Pearson correlation coefficients for cross protein heavy atom blind B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the Superset. Results reported use heavy atoms in both training and prediction. MWCG machine learning . . . . . . . . . . . . . . . results originally published in Bramer et al [2]. Table 7.21: Pearson correlation coefficients of persistent homology based least squares fitting Cα B factor prediction of all proteins using 11˚A cutoff. Two Bot- tleneck (B) and Wasserstein (W) metrics using various kernel choices are included. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.22: Persistent homology based Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network (CNN), and consensus method (CON) for the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Superset. Table 7.1: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for small-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 7.2: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI) and Gaussian normal mode (GNM) for medium-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 67 75 83 92 98 99 viii Table 7.3: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for large-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Table 7.5: Average pearson correlation coefficients for Cα B factor prediction with FRI, GNM and NMA for three structure sets from Park et al. [4] and a superset of 364 structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al. [4] MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG Results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Table 7.6: Pearson Correlation coefficients for Cα, non Cα carbon, nitrogen, oxygen, and sulfur using parameter free MWCG. Only 215 of the 364 proteins contain sulfur atoms. MWCG results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Table 7.7: CPU execution times, in seconds, from efficiency comparison between GNM [3], RF, GBT, and CNN. Results originally reported in Bramer et al [2] . 102 Table 7.8: Average Pearson correlation coefficients (PCC) both of all heavy atom and Cα only B factor predictions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset. Predictions of random forest (RF), gradient boosted tree (GBT), and convolutional neural network (CNN) are obtained by leave-one-protein-out (blind), while predictions of parameter-free flexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares fitting of individual proteins. All machine learning models use all heavy atom information for training. MWCG machine learning B factor prediction results originally reported in Bramer et al [2]. . . . . . . 103 Table 7.9: Pearson correlation coefficients for cross protein heavy atom blind MWCG B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the small-sized pro- tein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. . . . . . . . . . . . . . . . . . . 104 ix Table 7.10: Pearson correlation coefficients for cross protein heavy atom blind MWCG B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the medium-sized pro- tein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. . . . . . . . . . . . . . . . . . . . 105 Table 7.11: Pearson correlation coefficients for cross protein heavy atom blind MWCG B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the large-sized pro- tein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. . . . . . . . . . . . . . . . . . . . 106 Table 7.13: ASPH and ESPH average Pearson correlation coefficients Cα B factor pre- dictions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset. Gradient boosted tree (GBT), convolutional neural network, and consensus(CON) results are obtained by leave-one-protein-out (blind). Predictions of parameter-free flexibility- rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares fitting of individual proteins. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Table 7.14: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolu- tional neural network (CNN), and consensus (CON) for the small-sized protein set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 Table 7.15: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolu- tional neural network (CNN), and consensus (CON) for the medium-sized protein set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Table 7.16: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained boosted gradient (GBT), convolutional neural network (CNN), and consensus (CON) for the large-sized protein set.110 Table 7.17: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of small proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. 111 Table 7.18: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of medium proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. 112 x Table 7.19: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of large proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. 113 Table 7.20: ASPH and ESPH average Pearson correlation coefficients of least squares fitting Cα B factor prediction of small, medium, large, and superset using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. Results for pFRI are taken from Opron et al[3]. GNM and NMA value are taken from the course grained Cα results reported in Park et al[4]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 xi LIST OF FIGURES Figure 3.1: The average Pearson correlation coefficient (PCC) as found by optimizing individual kernels in the range of ηn = 1, . . . , 40. Parameter optimization results originally published in Bramer et al [1]. . . . . . . . . . . . . . . . Figure 4.1: From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and 3-simplex. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) An example of 5 points in R2 and (b) the corresponding topological barcode. The length of each barcode corresponds to the persistence of . . . . . . . . each topological object (β0,β1,β2,etc..) over the filtration. Figure 4.2: Figure 4.3: Figure 4.4: Illustration of Atom-specific persistent homology point clouds. Top: the original point cloud. The atom of interest is at the center of the circle. Second row: a pair of conjugated sets of point clouds for atom-specific persistent homology. The rest: Four pairs of conjugated point clouds for atom-specific and element-specific persistent homology. . . . . . . . . . Illustration of residue 338 Cα atom-specific persistent homology in the CC element-specific point cloud of protein PDBID 1AIE. For this example residues 332-339 are used and are shown on the left. The Cα location used to generate the barcodes (right) is highlighted in red in the left chart. Conjugated persistence barcodes are generated with and without the selected Cα. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 24 28 30 33 Figure 5.1: An example of a perceptron, the basic functional unit of a neural network. 38 Figure 5.2: An illustration of a fully connected deep neural network. Circles repre- sent neurons and connections between neurons are indicated by arrows. Each connection has an associated weight. A neural network is considered . . . . . . . . . . . . . . . . . “deep” when it uses several hidden layers. Figure 5.3: Frequency of the number of heavy elements from the 364 protein dataset. Figure originally published in Bramer et al [2]. . . . . . . . . . . . . . . . Figure 5.4: Illustration of modified persistence diagrams used in distance calculations. (a) Unchanged. (b) Rotated 30◦. (c) rotated 60◦. Black dots are Betti-0 . . . . . . . . . . . . . . . . . . . events and triangles are Betti-1 events. Figure 5.5: Average Pearson correlation coefficient over the entire protein dataset fit- ting all 24 persistent homology features using various cutoff distances. . . Figure 5.6: The MWCG based deep convolutional neural network architecture used for B factor prediction. The plus symbol represents the concatenation of data sets. Figure originally published in Bramer et al [2]. . . . . . . . . . 38 41 45 47 50 xii Figure 5.7: The deep learning architecture using a convolutional neural network com- bined with a deep neural network to predict B factor using PH based features. The plus symbol represents the concatenation of features. . . . Figure 6.1: Workflow for procedure in MWCG feature construction. . . . . . . . . . Figure 6.2: Workflow for procedure in ASPH and ESPH feature construction. . . . . Figure 6.3: Workflow for procedure MWCG, ASPH, and ESPH based machine learn- ing B factor prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 7.1: Figure 7.2: Figure 7.3: Figure 7.4: Figure 7.5: (a) VMD representation of PBD ID 1AIE. (b) Correlation maps for nitrogen- nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1AIE. The thicker band along the main diagonal of (b) and (c) corresponds to the alpha helix secondary structure in 1AIE. Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) VMD representation of PBD ID 1KGM. (b) Correlation maps for nitrogen-nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for pro- tein 1KGM. The bands perpendicular to the main diagonal of (b) and (c) correspond to the anti parallel beta sheet present in 1KGM. Figure . . . . . . . . . . . . . . . . . . originally published in Bramer et al [1]. (a) VMD representation of PBD ID 5IIV. (b) Correlation maps for nitrogen- nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 5IIV. The presence of the two distinct thick bands along the main diagonal of (b) and (c) corresponds to the two alpha helices present in 5IIV. The off diagonal bands correspond to the bonding interaction between alpha . . . . . . . . . . helices. Figure originally published in Bramer et al [1]. (a) A visual comparison of experimental B factors , (b) WCG predicted B factors, (c) and GNM predicted B factors for the ribosomal protein L14 (PDB ID:1WHI). (d) The experimental and predicted B factor values plotted per residue. GNM represents predicted B factors using GNM with a cutoff distance of 7 ˚A. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . (a) The structure of calmodulin (PDB ID: 1CLL) visualized in Visual Molecular Dynamics (VMD)18 and colored by experimental B factors, (b) MWCG predicted B factors, (c) WCG predicted B factors, (d) and GNM predicted B factors with red representing the most flexible regions. Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . . 53 56 57 58 61 62 63 64 65 xiii Figure 7.5: (Continued) (e) The experimental (Exp) and predicted B factor values plotted per residue for PDB ID 1CLL. The GNM is for the GNM method with a cutoff distance of 7 ˚A. We see that GNM clearly misses the flexible hinge region. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. MWCG represents B factor predictions determined from the MWCG method using the fixed parameters listed in Table 3.2. Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 7.6: A visual comparison of experimental B factors (a), WCG predicted B factors (b), and GNM predicted B factors (c) for the engineered cyan fluorescent protein, mTFP1 (PDB ID:2HQK). (d) The experimental (Exp) and predicted B factor values plotted per residue for PDB ID 2HQK. The GNM is for the GNM method with a cutoff distance of 7 ˚A. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. Figure originally published in Bramer et al [1]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 97 Figure 7.7: CPU Efficiency comparison between GNM [3], RF, GBT, and CNN al- gorithms for MWCG B factor prediction. Execution times in seconds (s) versus number of residues. A set of 34 proteins, listed in Table 7.7, were used to evaluate the computational complexity. Result originally published in Bramer et al [2]. . . . . . . . . . . . . . . . . . . . . . . . . 101 Figure 7.8: Individual feature importance for the MWCG random forest model aver- aged over the data set. Reported feature selection includes the use heavy atoms in the model. Figure originally published in Bramer et al [2]. . . . 103 Figure 7.9: Average feature importance for the MWCG random forest model with the angle, secondary, MWCG, atom type, protein size, amino acid, and packing density features aggregated. Reported feature selection includes the use heavy atoms in the model. Figure originally published in Bramer et al [2]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 xiv KEY TO ABBREVIATIONS aFRI Anisotropic Flexibility Rigidity Index ANM Anisotropic Network Model ASPH Atom Specific Persistent Homology CNN Convolutional Neural Network DNN Deep Neural Network ESPH Element Specific Persistent Homology fFRI Fast Flexibility Rigidity Index FRI Flexibility Rigidity Index GBT Gradient Boosting Trees GNM Gaussian Network Model mFRI Multiscale Flexibility Rigidity Index MWCG Multiscale Weighted Colored Graph NMA Normal Mode Analysis NMR Nuclear Magnetic Resonance MD Molecular Dynamics PDB Protein Data Bank PH RF Persistent Homology Random Forest WCG Weighted Colored Graph xv Chapter 1 Overview X-ray crystallography is an impressive experimental tool that provides three dimensional (3D) spatial coordinates and thermal fluctuation data of atoms within a crystallized molecule in the form of a PDB data file. Using data contained in a protein PDB file, one can validate mathematical models to understand protein dynamics and flexibility. The protein data bank is massive, containing over 140,000 structures as of March 2019, with more structures submitted annually. Even with the solution of many protein structures, there is still an important need for robust and accurate mathematical models. Many important classes of proteins are difficult to crystallize and some may even prove to be impossible. Protein crystallization difficulty increases proportionally with the size of a protein. Highly flexible proteins represent another class of proteins that are difficult to crystallize due to their resistance in forming a crystal lattice structure. Other examples of proteins which are difficult to crystallize include small heat shock proteins, transmembrane and membrane proteins, and intrinsically disordered proteins. Heat shock proteins are an important class of proteins related to cardiovascular function, immunity, and cancer. Transmembrane and Membrane proteins are targets for the majority of modern drugs. Intrinsically disordered proteins are also vitally important to understand as they have been implicated in a number of diseases such as Bovine Spongi- form Encephalopathy (mad cow disease), Creutzfeldt-Jakob disease, Alzheimer’s disease, and 1 Parkinson’s disease. In this work new and efficient methods for protein analysis are introduced that improve upon existing methods in several ways. These methods are the first protein B factor pre- diction methods to incorporate additional protein information from non-Cα atoms in the form of element specific interaction pairs. Moreover, this work introduces methods that are entirely new to B factor prediction. These methods are capable of successful cross protein B factor prediction using only information from other proteins. The methods presented use advanced graph theory based techniques, machine learning algorithms, and the first known topological data analysis based persistent homology method, to successfully analyze protein flexibility and dynamics. Lastly, the methods provide the best predictive results, to date, for both protein B factor prediction within a protein and cross protein B factor prediction. The results are validated through extensive testing on a large and diverse set of proteins. Using these methods many protein analysis tools can be constructed. In addition to the protein B factor prediction, several applications of these methods are provided in this work. Examples include hinge detection, element specific protein correlation maps, and protein model relative feature importance ranking. This work first introduces an efficient and accurate advanced graph theory based mul- tiscale weighted colored graph (MWCG) method for analyzing protein flexibility and dy- namics. The weighted colored graph (WCG) theory is based on the hypothesis that the most fundamental properties of proteins are determined by the geometric structure of the protein. The WCG method does not require costly matrix diagonalization like other com- monly used methods such as Normal Mode Analysis (NMA) and Gaussian Network Model (GNM). Given a protein of N atoms, the computational complexity of the WCG method is approximately O(N 2) whereas methods like Normal Mode Analysis and Gaussian Network 2 Model are O(N 3) due to the fact that they require diagonalization of a large matrix. Next a multiscale formulation of the WCG method is introduced to incorporate the multiscale interactions that occur within a protein into the model. Protein interactions take place over a variety of different scales, so any reliable model should take this property into account. To reduce computational complexity, most elastic network models include a pre- defined cutoff distance. However, the computational cost saved by using a cutoff in ENM incurs a cost in the overall accuracy of such models. By prescribing a distance based cutoff these models fail to capture protein interactions that take place across multiple characteristic length scales. The MWCG model was developed to capture the multiscale behavior of protein interactions. To capture various interaction scales within a protein the MWCGs used in this work were parameterized using three correlation kernels parameterized at different length scales. However the method is general and adaptable, so the number of correlation kernels can be adjusted to fit the users performance needs. To test the efficacy of the WCG approach, the method is tested on a set of over 300 protein structures taken from X-ray crystallography data provided by the Protein Data Bank. The accuracy of B factor prediction between MWCG is compared to the most commonly used approaches, parameter-free FRI (pfFRI), NMA, and GNM. Averaged over a large and diverse set of over 300 proteins the results demonstrate a significant improvement. Averaged over the entire protein test set, the MWCG method is over 28% more accurate than the best previous method opFRI and 42% more accurate than GNM. To further demonstrate the utility of the WCG method, applications such as element specific protein heat maps and hinge detection visualizations are included. Accurate identification of hinge regions and hinge motion is an important topic that has been highly studied[5, 6, 7, 8, 9]. Hinge residue detection is integral for molecules that are 3 too large for MD simulation over meaningful time scales. In the past, methods such as GNN and NMA have been used to detect hinges for proteins where MD is intractable. This work compares the ability of GNM, WCG, and MWCG methods to identify the hinge regions of several proteins. The work demonstrates several instances where WCG and MWCG accurately identify hinge regions where GNM at the same time fails to do so. This highlights the overall efficacy of this method and the multiscale behavior captured by MWCG. Element specific correlation maps provide a new way to visualize secondary and tertiary protein structure using a two dimensional (2D) image where flexibility is represented by the color of each pixel of the image. These correlation maps have been introduced in the past for Cα atoms [3]. In this work we introduce more general element specific correlation maps. Examples of nitrogen-nitrogen and oxygen-oxygen element interaction correlation maps are provided for several proteins. This demonstrates the adaptability of the WCG and MWCG methods presented here. The provided examples clearly demonstrate important secondary structures such as alpha helix and beta sheets as well as their primary and secondary inter- actions. Previous protein B factor prediction methods are not capable of accurate prediction of B factor across proteins. The MWCG method, along with other engineered features are used to create machine learning based B factor prediction models. The model captures various interaction scales within an individual protein. To capture distinctions between proteins other global features such as protein resolution are included as feature inputs. The machine learning algorithms used in this work are trained using nine MWCG kernels with various parameterizations. Other local and global features are also included to improve the robustness of the feature set. The algorithms were trained using leave-one-protein-out cross validation, where the algorithm trains on all protein data except the protein of interest, 4 then the test set is taken to be the protein of interest. Extensive numerical testing indicates that the MWCG cross B factor predictions obtained are more accurate than any B factor prediction using existing traditional methods. The approach introduced here is particularly notable because it accurately predicts cross-protein B factors. In recent years topological data analysis (TDA) has been successfully applied to protein analysis in a variety of areas. The basic idea of TDA is to use tools from topology to analyze high dimensional datasets that may be noisy or incomplete. Techniques from TDA reduce the dimensionality of the dataset, and allow the user their choice of metric. These techniques are a good fit for protein analysis, where one wants to infer high dimensional structure from low dimensional representations, capture multiple scales, and assemble discrete point data into a global structure. The point cloud of 3D spatial coordinates provided for proteins in the protein databank PDB files can be converted into a family of simplicial complexes. These simplicial complexes are indexed by a proximity parameter. Then, converting the dataset into global topological objects, tools from algebraic topology can be applied for protein analysis. Persistent homology theory allows the persistent homology of a filtered simplicial com- plex to be uniquely represented with a barcode. In this work protein data is encoded into a barcode by taking a filtration over simplicial complexes that have been constructed from ele- ment specific protein spatial data. The protein barcodes provide global invariant topological features of the protein. By comparing two related barcodes for each atom of interest, this technique can be used to predict local atomic flexibility. The two barcodes are constructed such that one barcode is constructed using a point cloud that includes the atom of interest, and another is constructed using the same point cloud but without the atom of interest. The similarity or difference between barcodes is compared using bottleneck or wasserstein met- 5 rics. This provides atomic specific persistent homology protein flexibility analysis. Including various element interaction pairs one may also generate element specific persistent homology (ESPH) to capture element specific interactions. To the author’s knowledge no previous protein flexibility models have used persistent homology to predict B factors of atoms in a protein in this way. In this work ASPH and ESPH features are generated for each Cα of a given protein. However, this method is a general framework that can be applied to any element in a protein, including hydrogen. The method allows for several parameterizations that can be tuned by the user. To validate this approach several PH features are generated and used to fit B factor prediction models using linear least squares fitting. Later, the features are used with machine learning techniques. Both cases are validated using a large and diverse data set of proteins from the protein data bank. These results provide good predictions that are comparable to the aforementioned MWCG results. 6 Chapter 2 Background Currently Nuclear Magnetic Resonance (NMR) Spectroscopy and X-ray crystallography are the two major experimental techniques used for protein dynamics and flexibility analysis. Techniques for NMR were previously very challenging but are now becoming more routine. At a basic level, NMR works by mapping the magnitude or intensity of magnetic resonance signals as a magnetic field is applied to a protein sample. X-ray crystallography determines protein structure by measuring the diffraction patterns of an intense beam of X-rays of a crystallized protein. The crystal is rotated many times and a with each rotation a new set of diffraction patterns is collected. After tens of thousands of rotations, the data is combined and computationally processed into a final atomic arrangement known as the protein crystal structure. At the time of this dissertation, over 90% of the protein data bank (PDB) files have been solved using X-ray crystallography while less than 10% have been solved using NMR. Unlike X-ray crystallography, NMR results do not provide atomic flexibility information. In contrast, X-ray crystallography data includes flexibility information in the form of atomic B factor (temperature factor, B value, or Debye-Waller factor), which is a measurement of the X-ray scattering of atoms or groups of atoms in a protein. Atomic B factor has been observed to correlate with atomic flexibility from Molecular dynamics (MD) and Normal mode analysis (NMA) experiments thus it provides a good experimental gold standard to 7 compare theoretical methods. 2.1 Computing Protein Flexibility and Dynamics Many methods exist for studying protein structure and function; however, there is room for substantial improvement. Algorithms which require X-ray crystallography are limited by the availability of previously crystallized proteins. Surely the protein databank will continue to grow as scientists crystallize proteins with ever increasing efficiency. However, for many types of proteins, crystallization is very difficult or impossible. This calls for new approaches to theoretical protein analysis. MD simulation is one method for protein analysis that has made a serious contribution to our understanding of the conformational landscapes of proteins. It has been particularly helpful in understanding proteins that are difficult to study experimentally such as amyloid fibrils, intrinsically disordered proteins, and partially disordered proteins. Even so, the dy- namics of large proteins generally takes place over long time scales that are inaccessible to modern MD simulations. MD simulations are computationally intractable for larger macro- molecules and in systems of multiple molecules as the time scales required are unreasonable for current technology. As such MD continues to be limited to systems of low complexity due to the methods high degree of freedom. To address the limitations of time-dependant MD approaches several time independent approaches to protein dynamics and flexibility analysis have been developed. NMA was one of the first successful time-independent methods used for protein analysis[10, 11, 12, 13, 14]. NMA achieves time-independence by adopting an interaction Hamiltonian based on pro- tein molecular mechanics. In this approach bond lengths and angles are fixed, and NMA 8 is computed by the diagonalization of a Hamiltonian on an energy minimized structure. Normal modes are the orthogonal resonant patterns of the molecular mechanic system. A superposition of the normal modes provides the collective motion of the protein. Low fre- quency modes correspond to cooperative motions and are meaningful in applications like hinge detection and MD where slow, collective motion is relevant. The transition pathways of macromolecules are also highly correlated with the low-frequency modes of NMA[14]. NMA provides good coarse grained deformation motion of supramolecular complexes. The success of NMA has resulted in several related methods that improve the computational cost and quality of the generated results. The elastic network model (ENM) was proposed in 1996 as a simplified NMA approach[15]. The ENM is based on a statistical mechanics approach where a molecule is treated as a sys- tem of N nodes with each node corresponding to an atom or residue within the molecular network[16]. This approach provides good prediction of global motions but does not re- liably predict local motion and requires costly diagonalization of the large corresponding Hessian matrix. The Anisotropic network model (ANM) was model introduced using the ENM framework to account for 3D directionality. The ANM uses a spring network with a simple spring potential between Cα atoms[17]. Given N atoms, ANM requires a 3N × 3N matrix Diagonalization of the resulting Hessian. This provides the modes of the system that correspond to cooperative motions. Lower eigenvalue and eigenvectors can be used to estimate protein flexibility. In ANM all springs use the same force constant. The ANM provides good insight into the protein dynamics at a lower computational cost than other normal mode analysis based methods. The Gaussian network model (GNM) is a related ENM developed around the same time as ANM that provides a good course grained, isotropic, low cost approach[18, 19]. In GNM 9 the Hessian is replace by a Kirchoff matrix. The diagonalization of the Kirchoff matrix gives rise to eigenmodes and eigenvalues for describing protein fluctuations that correspond to B factors. GNM is both accurate and efficient compared to other previous approaches[20]. To bypass costly large matrix diagonalization the flexibility-rigidity index (FRI) was more recently introduced[21, 3, 22]. FRI is a mathematical method based on geometric graphs, that makes use of protein graph connectivity and node centrality to analyze protein flexibility. The method is based on the hypothesis that protein interactions and protein structure are inextricably linked in a given environment. That is, protein flexibility and function are determined by protein structure and environment. Since the FRI approach is not based on molecular mechanics it does not require a protein interaction Hamiltonian like those used in spectral graph theory, to analyze protein flexibility. The FRI approach works well as long as the accurate structure of the protein and its environment is known. As such FRI is restricted to proteins with solved 3D X-ray crystal structures. The FRI method provided a significant improvement in computational speed compared to previous protein analysis methods. The first FRI method [21] is of computational complexity O(N 2) [21]. Later fast FRI (fFRI) [3] was introduced to reduce computational cost further. The fFRI method is of computational complexity O(N ). Anisotropic FRI (aFRI) [3] and generalized FRI (gFRI) [23] have also since been developed. To capture the multiscale interactions seen in macromolecules the multiscale FRI (mFRI) method was introduced[24]. Compared to GNM, the mFRI algorithm was shown to be approximately 20%, more accurate averaged over a large and diverse set proteins [24]. The fFRI algorithm was shown to be significantly faster than GNM[3]. Generalized GNM (gGNM), generalized ANM (gANM), multiscale GNM (mGNM), and multiscale ANM (mANM) methods have been recently constructed using FRI matrices [25]. These generalized algorithms provide major improvements to the 10 accuracy of original algorithms for protein flexibility analysis. A summary of when the different approaches to protein flexibility and dynamics were first introduced is provided in Table 2.1. Table 2.1: Notable molecular mechanic techniques and the year of introduction. Molecular Mechanics Technique Molecular Dynamics (MD) Normal Mode Analysis (NMA) Elastic Network Model (ENM) Gaussian Network Model (GNM) Anisotropic Network Model (ANM) Flexibility Rigidity Index (FRI) Year of Introduction 1977[26] 1982[11] 1996[15] 1996[18] 2001[17] 2014[3] While the previous methods provide good results, there is still room for significant im- provement. The average pearson correlation coefficient of the B factor predictions of the aforementioned methods is generally below 0.7. Knowing the importance of protein flexi- bility analysis, it is crucial to improve these results. Moreover the above methods do not provide satisfactory results when predicting cross protein B factor. Given the the many classes of proteins with no X-ray crystal structure this is an important problem with no existing reliable solutions. 2.2 Data Two data sets are used for testing and validation in this work: one from Refs. [3, 24] and the other from Park, Jernigan, and Wu [4]. The first data set contains 364 proteins [3, 24], and the second contains 3 subsets of small, medium, and large sized proteins [4]. All protein PDB structures have a resolution of 3 ˚A or higher and an average resolution of 1.3 ˚A. The PDB data sets include proteins that range in size from 4 to 3912 residues [4]. This work excludes protein 1AGN due to known data issues. Proteins 1NKO, 2OCT, and 3FVA are 11 also excluded as these proteins have PDB files with residues whose B factors are reported as zero which is nonphysical. For all machine learning results provided in this work, the STRIDE software is unable to provide the required secondary features for proteins 1OB4, 1OB7, 2OLX, and 3MD5 so these also excluded. 12 Chapter 3 Multiscale Weighted Colored Graphs 3.1 Weighted colored graphs For this approach, each protein is considered to be a network in the form of a mathematical graph. That is, a protein a network where atoms represent nodes or vertices of the graph and edges are weighted connections between nodes that are determined by a distance based radial function. Colored graphs are constructed based on heavy element (carbon, nitrogen, oxygen, sulfur) interaction pairs. Provided it is available, one may even include hydrogen atoms. Hydrogen atoms have a high degree of uncertainty, and cannot be accurately measured by X-ray crystallography so we exclude them from this work. A graph is denoted as G(V, E) where V represents a set of nodes called vertices and E the set of edges of the graph that relate vertices pairwise. This work defines a protein network to be a graph whose nodes and edges have specific attributes corresponding to the protein. In particular, individual atoms correspond to graph nodes, and the edges to a distance based correlation metric. This approach makes sense from a biophysical point of view since interaction strength is inversely proportional to distance. Further, many existing B factor prediction methods use three- dimensional (3D) networks of spatial atomic coordinate data from the protein databank. The most basic component of this method is a weighted colored graph. A WCG converts 3D geometric protein spatial information, provided as atomic coordinates by a PDB data 13 file, into a protein connectivity network. All existing previous methods only take Cα atoms into consideration when constructing graph theoretic approaches. However, in this work all N atoms in a protein are considered. Given the colored graph G(V, E), the ith atom is labeled by its element type αj and position rj and thus V = {(rj, αj)|rj ∈ IR3; αj ∈ C; j = 1, 2, . . . , N}, where C ={C, N, O, S } is the set containing the chosen element types of interest in a protein. The set of edges, P, in a colored protein graph is defined to be the set of all element specific pairs of C. This choice of C results in 16 element directed interaction pairs. Table 3.1 illustrates the 16 possible element interaction pairs. For this work P is defined to be Table 3.1: Element pair combinations used in weighted colored graph. C N O S C CC CN CO CS N NC NN NO NS O OC ON OO OS S SC SN SO SS P = {CC, CN, CO, CS, NC, NN, NO, NS, OC, ON, OO, OS, SC, SN, SO, SS}. For example, the subset P3 ={CO} contains all directed CO pairs in the protein such that the first atom is a carbon and the second one is a oxygen. Mathematically, E is the set of weighted directed edges describing the potential interaction pairs of atoms given by E =(cid:8)Φk(||ri − rj||; ηij)(cid:12)(cid:12)(αiαj) ∈ Pk; k = 1, 2, . . . , 16; i, j = 1, 2, . . . , N(cid:9), (3.1) 14 where ||ri − rj|| is defined to be the Euclidean distance between the ith and jth atoms, ηij a characteristic distance between the atoms, and (αiαj) a directed pair of element types. In this work Φk is a correlation function with the following properties [3] Φk(||ri − rj||; ηij) = 1, as ||ri − rj|| → 0 (αiαj) ∈ Pk, Φk(||ri − rj||; ηij) = 0 as ||ri − rj|| → ∞, (αiαj) ∈ Pk. (3.2) (3.3) Previous work by Opron et al[3] has shown that generalized exponential functions of the form, Φk(||ri − rj||; ηij) = e −(||ri−rj||/ηij )κ (αiαj) ∈ Pk; , κ > 0, (3.4) and generalized Lorentz functions of the form, Φk(||ri − rj||; ηij) = 1 1 + (||ri − rj||/ηij)ν , (αiαj) ∈ Pk; ν > 0, (3.5) are good choices for correlation functions that satisfy the above properties. 3.2 WCG Centrality Given a graph, centrality provides a measure of the importance of a node. Centrality is an important concept in graph theory that has a wide variety of applications including social network analysis, identification of critical genes, traffic flows, and epidemics[27, 28, 29]. There are several types of centrality measures. For example, the normalized closeness centrality [30] of node ri is defined as 1(cid:80) j ||ri − rj|| 15 and the Harmonic centrality [31] of node ri in a connected graph is defined as 1 ||ri − rj||. (cid:88) j In this work the notion of Harmonic centrality is extended to subgraphs with weighted edges defined by generalized correlation functions. The generalized centrality metric used in this work is defined as µk i = N(cid:88) j=1 wijΦk(||ri − rj||; ηij), (αiαj) ∈ Pk, ∀i = 1, 2, . . . , N, (3.6) where wij is a weight function related to the element type. The WCG centrality in Equation (3.6) provides the atom specific rigidity index of the ith atom. This is a measure of the stiffness of the ith atom that corresponds to the kth set of contact atoms. 3.3 Weighted Colored Graph Flexibility Analysis Given a rigidity index, its reciprocal function provides a corresponding measure of flexibility, or flexibility index. Thus the general flexibility index on subgraphs is given by f k i = , 1 µk i (αiαj) ∈ Pk, ∀i = 1, 2, . . . , N. (3.7) Previous work by Ngyuen et al shows that other flexibility index forms work equally as well [23]. At each atom, the flexibility index corresponds to temperature fluctuation. Thus we 16 can model the B factor of the ith atom as (cid:88) Bt i = ckf k i + b, ∀i = 1, 2, . . . , N, (3.8) k where Bt i represents the theoretically predicted B factor of the ith atom. The coefficients ck and b are determined by minimizing the linear system given by (cid:26) N(cid:88) i=1 (cid:12)(cid:12)(cid:12)(cid:12)Bt min ck,b i − Be i (cid:12)(cid:12)(cid:12)(cid:12)2(cid:27) , (3.9) where Be i is the experimentally measured B factor of the ith atom. 3.4 Multiscale Weighted Colored Graph Flexibility Anal- ysis Macromolecular interactions consist of a complex interplay of short, medium, and long range interactions. Covalent bonds dominate short-range type interactions. Medium-range interac- tions consist mainly of hydrogen bonds, electrostatics and van der Waals interactions. Lastly, hydrophobicity is the main contributor to long-range molecular interactions. As such, a pro- tein’s flexibility is inherently connected to multiple characteristic length scales. This work proposes multiscale weighted colored graphs to characterize the multiscale interactions that exist within a protein. The flexibility of ith atom at nth scale corresponding to the kth set of interaction atoms is given by (cid:80)N f k,n i = 1 j=1 wn ijΦk(||ri − rj||; ηn ij) (αiαj) ∈ Pk, , (3.10) 17 where wn ij is an atomic type dependent parameter, Φk(||ri − rj||; ηn ij) a correlation kernel, and ηn ij a scale parameter. Minimization takes the form (cid:26)(cid:88) i (cid:12)(cid:12)(cid:12)(cid:12)(cid:88) k,n (cid:12)(cid:12)(cid:12)(cid:12)2(cid:27) , (3.11) min cn ,b k i + b − Be k f k,n cn i where Be i are experimental B factors. In this work we construct three correlation kernels us- ing two generalized Lorentz kernels and a generalized exponential kernel to capture multiple length scales. The method provided here is made parameter free by choosing appropriate values for η, ν, and κ. Sulfur atoms play an important role in proteins but they are also very sparse in proteins. As such, this work provides some results using sulfur atoms but for most of the testing provided sulfur atoms are excluded as they have a negligible overall effect on the model. Thus, unless otherwise noted this works considers the following subset of P for the lion’s share of computations. ˆP =(cid:8)CC, CN, CO, NC, NN, NO, OC, ON, OO(cid:9). (3.12) This work chooses to focus on C, N, and O due to their high occurrence in proteins and im- portant biological relevance. However, it should be noted that the general method presented here can be adapted to include any element the user chooses. For WCG calculations of B factor predictions all possible element pairs, SC, SN, SO, and SS are considered. This method is unique compared to other B factor prediction methods. The WCG method considers not only Cα interactions but the effects of interactions between nitrogen, oxygen, and other non-Cα carbon atoms. For this work, three element specific correlation kernels 18 are constructed for all carbon-carbon (CC), carbon-nitrogen (CN), and carbon-oxygen (CO) interactions within a protein. To capture multiscale interactions this work also includes three different scale parameterizations for each kernel. In total this generates 9 correlation kernels to characterize element specific multiscale protein interactions in terms of their corresponding graph centralities and atomic flexibility. The result of this method can be used directly, fitted using linear least squares, or as a machine learning feature. Previously existing methods such as mFRI, GNM, and NMA fail to take into account the element specific interactions that the WCG method presented here captures. Since this method provides a general framework for any element, in addition to carbon, WCG can also be used to predict the B factor of any heavy element. 3.5 Parameterization In this work a total of 9 unique correlation kernels are used based on the CC, CN, and CO element specific correlation kernels described in Eq. (3.10). For simplification purposes, all B factor prediction computed in this work through fitting and machine learning uses wij = wn ij = 1 and ηn ij = ηn. A basic grid search over the 364 dataset determined the near optimal parameters for MWCG based Cα B factor predictions. Three kernels are used with ν = {1, 3} for Lorentz kernels and κ = 1 for the Exponential kernel, respectively. To improve the efficiency, a radial cutoff distance may be used. However, the WCG fitting and MWCG based machine learning results presented in this work do not use a cutoff. The first kernel considered is a Lorentz function, and its near optimal η1 was found to be η1 = 16 as shown in Fig. 3.1. Then, fixing η1 = 16, a parameter grid search is used 19 Figure 3.1: The average Pearson correlation coefficient (PCC) as found by optimizing in- dividual kernels in the range of ηn = 1, . . . , 40. Parameter optimization results originally published in Bramer et al [1]. Table 3.2: Parameters used for correlation kernels in a parameter-free MWCG. Parameter optimization results originally published in Bramer et al [1]. Kernel Type Lorentz (n = 1) Lorentz (n = 2) Exponential (n = 3) κ ηn 16 - - 2 31 1 ν 3 1 - to determine optimal η2 for a second Lorentz kernel. The second Lorentz kernel was found to provide optimal predictions for η2 = 2 as shown in Fig. 3.1. Lastly, fixing η1 = 16 and η2 = 2, a parameter search is used to determine optimal values for η3 used in an exponential kernel. Given the fixed parameters of the Lorentz kernel the average Pearson correlation coefficient (PCC) does not decay even for very large values of η3 as indicated in Fig. 3.1. Given the multiscale nature of these three parameters this behavior is reasonable. With only a single kernel, the strongest interactions, which provide good approximations, can be obtained for 12 ≤ η ≤ 17. To capture close range interactions, the second η provides the best results for small values. The large values seen in the third η appear to capture 20 large scale interactions. This result corresponds to the dominance of these length dependent interaction types. Because it decays so quickly, the exponential kernel is used to capture large scale interaction effect. Of course large η values are very costly due to the structure of the kernel. So for η3 a value of 31 is used in the testing published in Bramer et al [1, 2] for the parameter-free MWCG method as listed in Table 3.2. 21 Chapter 4 Atom Specific Persistent Homology 4.1 Overview Most existing protein analysis methods are structure or geometry based models. Many of these models struggle with the high dimensional space of protein data. Put another way, any model that is too fine grained will inherently fail in the high dimensional protein data space due to the associated computational complexity. The study of topology provides the con- nectivity of components, and characterizes independent entities, rings, and high dimensional topological faces within a space. Applied to proteins, topology provides a powerful tool for analysis of several important biological processes. Examples include hot spot detection, assembly/disassembly of virus capsids, ligand binding state, ion channel state, and protein folding[32, 33, 34, 35, 36, 37, 38, 39]. Topology provides a high level of abstraction and in its purely mathematical form is free of metrics of coordinates which can be problematic for the study of biological macro-molecules. Topological data analysis allows the extraction of invariant features that are embedded in the high dimensional data space of biomolecules. Persistent homology is one component of TDA that provides useful bridge between the high dimensional protein data space and the abstract low dimensional topological analysis of the protein data space. PH embeds multiscale geometric information into topological invariants, this works well for the aforementioned examples but oversimplifies the atomic properties of 22 Table 4.1: Topological invariants displayed as Betti numbers. Betti-0 represents the number of connected components, Betti-1 the number of tunnels or circles, and Betti-2 the number of cavities or voids. Two auxiliary rings are added to the torus to illustrate that Betti-1=2. Example Betti-0 Betti-1 Betti-2 Point Circle Sphere Torus 1 0 0 1 1 0 1 0 1 1 2 1 macro-molecules making it challenging to use directly for atomic level analysis. In this work we provide a new approach that uses techniques from topological data analysis to provide element specific protein analysis at atomic resolution. To apply TDA techniques, data must first be described as a simplicial complex or a graph network. Specifically, simplicial homology is concerned with the identification of topological invariants from a set of discrete nodes such as the atomic coordinates of a protein. Given a point cloud, Betti numbers describe the topological variants of connected components, rings, and cavities. Table 4.1 provides examples of the Betti-0, Betti-1, and Betti-2 numbers of a point, circle, sphere, and torus. To determine topological invariants, a simplicial complex, such as Vietoris-Rips (VR) complex, ˇCech complex, or an alpha complex is constructed using a fixed filtration parameter. The simplicial complex is made up of vertices, edges, triangles, and tetrahedrons, denoted 0-simplex, 1-simplex, 2-simplex, and 3-simplex respectively. Basic examples are provided in Figure 4.1. By varying the filtration parameter over an interval a persistence diagram can be generated from a simplicial complex. A persistence diagram, or barcode, provides the birth and death (appearance and cessation) of Betti features for each node. The difference between 23 Figure 4.1: From left to right an example of a 0-simplex, 1-simplex, 2-simplex, and 3-simplex. two persistence diagrams can be compared using Bottleneck and Wasserstein distances. The main idea of atom-specific persistent homology and element-specific persistent ho- mology is to extract atomic molecular information using global persistent homology tech- niques. To generate an atom-specific description using a global topological description we construct a pair of conjugated point clouds for each atom of interest. One point cloud is centered about the original atom of interest and all nearby atoms within a prescribed radial cutoff. The conjugate point cloud consists of the same point cloud minus the atom of interest. Then for each atom of interest, Bottleneck and Wasserstein distances are computed between the corresponding conjugate pairs which provides the desired topological information of each atom. 4.2 Simplex & Simplicial Complex A simplex is a generalization of a triangle or tetrahedron to arbitrary dimensions. A k- simplex is a convex hull of k + 1 vertices represented by a set of affinely independent points σ = {λ0u0 + λ1u1 + . . . + λkuk |(cid:88) λi = 1, λi ≥ 0, i = 0, 1, . . . , k}, (4.1) where {u0, u1, . . . , uk} ⊂ Rk is the set of points, σ is the k-simplex, and constraints on λi’s ensure the formation of a convex hull. A convex combination of points can have at most 24 k + 1 points in Rk. For example a 1-simplex is a line segment, a 2-simplex a triangle, and a 3-simplex a tetrahedron. A subset of the k + 1 vertices of a k simplex with m + 1 vertices forms a convex hull in a lower dimension and is called an m-face of the k-simplex. An m-face is proper for m < k. The boundary of a k-simplex σ, is defined as the formal sum of its (k − 1) faces. Given as k(cid:88) ∂kσ = [u0, . . . , ˆui, . . . , uk]k(−1)i[u0, . . . , ˆui, . . . , uk]k, (4.2) i=0 where [u0, . . . , ˆui, . . . , uk] denotes the convex hull formed by vertices of σ with the vertex ui excluded and ∂k is called the boundary operator. A collection of finitely many simplicies forms a simplicial complex denoted by K. All simplicial complexes satisfy the following conditions. 1. Faces of any simplex in K are also simplices in K. 2. The intersection of any two simplicies σ1, σ2 ∈ K is a face of both σ1 and σ2. 4.3 Homology Given a simplicial complex K, a k-chain ck of K is a formal sum of the k-simplices in K with k no greater than dimension of K and is defined as ck =(cid:80) aiσi where σi are the k-simplices and ai’s coefficients. Generally, ai can be in any field such as R, Q, or Z. Here we choose ai to be in Z2 for simplicity. Let the group of k-chains in K be denoted by Ck. Then (Ck, Z2) forms an Abelian group under addition in modulo two. This allows us to extend the definition of the boundary operator introduced in Equation 4.2 to chains. 25 The boundary operator applied to a k-chain ck is defined as (cid:88) ∂kck = ai∂kσi, (4.3) where σi’s are k-simplices. The boundary operator is a map from Ck to Ck−1, which is also known as a boundary map for chains. Note that operator ∂k satisfies the property that ∂k ◦ ∂k+1σ = 0 for any (k + 1)-simplex σ following the fact that any (k − 1)-face of σ is contained in exactly two k-faces of σ. The chain complex is defined as a sequence of chains connected by boundary maps with decreasing dimension and is denoted . . . → Cn(K) ∂n−−→ Cn−1(K) ∂n−1−−−−→ . . . ∂1−→ C0(K) ∂0−→ 0. (4.4) The k-cycle group and k-boundary group are then defined as kernel and image of ∂k and ∂k+1 respectively, and Zk = Ker∂k = {c ∈ Ck | ∂kc = 0}, Bk = Im∂k = {∂kc | c ∈ Ck}, (4.5) (4.6) where Zk is the k-cycle group and Bk is the k-boundary group. Since ∂k ◦ ∂k+1 = 0, we have Bk ⊂ Zk ⊂ Ck. Then the k-homology group is defined to be the quotient group of the k-cycle group modulo the k-boundary group, Hk = Zk/Bk (4.7) where Hk is the k-homology group. The kth Betti number is defined to be rank of the 26 k-homology group as βk = rank(Hk). 4.4 Filtration & Persistence For a simplicial complex K, we define a filtration of K as a nested sequence of sub-complexes of K, ∅ ⊆ K0 ⊆ K1 . . . ⊆ Kn = K (4.8) In persistent homology, the nested sequence of sub-complexes usually depends on a fil- tration parameter. The persistence of a topological feature is denoted graphically by its life span with respect to filtration parameter. Sub-complexes corresponding to various filtration parameters offer the topological fingerprints over multiple scales. The kth persistent Betti numbers Bi,j represent the ranks of the kth homology groups of Ki that are alive and are k defined as Bi,j k = rank(Hi,j k ) = rank(Zk(Ki)/(Bk(Kj) ∩ Zk(Ki))). (4.9) respectively where X and Y are persistence barcodes and Bij(X, Y ) the collection of all bijections from X to Y . An example of a barcode is provided in Figure 4.2. 4.5 Similarity and distance In this work, both Bottleneck and Wasserstein distances are used to compare conjugate persistence diagrams. This provides the models with atom-specific topological information and facilitates atom-specific persistent homology. Let X and Y be multisets of data points, 27 (a) Example Points (b) Barcode Figure 4.2: (a) An example of 5 points in R2 and (b) the corresponding topological bar- code. The length of each barcode corresponds to the persistence of each topological object (β0,β1,β2,etc..) over the filtration. the Bottleneck and Wasserstein distances of X and Y are given by [40] dB(X, Y ) = inf γ∈B(X,Y ) sup x∈X || x − γ(x) ||∞, (4.10) 28 and [41]  inf γ∈B(X,Y ) (cid:88) x∈X 1/p , || x − γ(x) ||p∞ dp W (X, Y ) = (4.11) respectively. Here B(X, Y ) is the collection of all bijections from X to Y . In this work topological invariants of different dimensions are compared separately. 4.6 Vietoris-Rips Complex Given a metric space M and a cutoff distance d, a simplex is formed if all points have pairwise distances no greater than d. All such simplices form the Vietoris-Rips complex. The abstract nature of the VR complex allows the construction of simplicial complexes for correlation function based metric spaces, which models pairwise interaction of atoms using correlation functions versus more standard spatial metrics. 4.7 Atom Specific Persistent Homology & Element Spe- cific Persistent Homology To embed the chemical biological protein information into topological invariants, element- specific persistent homology was introduced by Cang et al[42, 43]. The basic idea of ESPH is to use subset of atoms of various element types within a protein to construct topological representations. The corresponding persistence diagrams then represent different interac- tions that occur within a protein. For example selecting all carbon atoms would result in barcodes that coded the network and strength of the hydrophobicity in the protein. 29 Figure 4.3: Illustration of Atom-specific persistent homology point clouds. Top: the original point cloud. The atom of interest is at the center of the circle. Second row: a pair of conjugated sets of point clouds for atom-specific persistent homology. The rest: Four pairs of conjugated point clouds for atom-specific and element-specific persistent homology. To represent the topological importance of a given atom, atom-specific persistent ho- mology is introduced. This works by constructing two conjugated point clouds centered about a given atom of interest within a biomolecule. The point clouds consists of one that includes the atom of interest and all nearby atoms within a prescribed cutoff, and another identical point cloud minus the atom of interest. Then, conjugated simplicial complexes, conjugated homology groups and conjugated topological invariants are generated for each 30 conjugate pair of points clouds. Wasserstein and Bottleneck distances can then be used to measure the difference between conjugated topological invariants which provides a topologi- cal representation of the atom of interest. Figure 4.3 provides and example of atom-specific and element-specific conjugated point clouds can be constructed for a given toy dataset. This work generates only Cα B factor predictions however the method is general and can be used to predict the B factor of any atom. To create a diverse topological representation for each Cα element specific persistent homology is used. Atom-specific persistent homology is also used to contribute a precise topological representation at each Cα atom. Using the conjugate pair subsets, Vietoris-Rips complexes are constructed by contact maps or matrix filtration [44]. To capture element-specific interactions three subsets of carbon-carbon, carbon-nitrogen, and carbon-oxygen point clouds are used. This gives the following element specific pairs, P = {CC, CN, CO}. (4.12) For a given Protein Data Bank (PDB) file, persistence barcodes are calculated as follows. i ∈ Pk in an element specific set Pk (P1 = CC,P2 = CN, Given a specific Cα of interest, rk and P3 = CO) , a point cloud consisting of all atoms within a pre-defined cutoff radius rc is defined as Rk i = {rk j (cid:12)(cid:12) ||rk i − rk j|| < rc, rk i , rk j ∈ Pk,∀ j ∈ 1, 2, . . . N}, (4.13) where N is the number of atoms in the kth element pair Pk. A conjugated set of point cloud, ˆRk point clouds Rk i , includes the same set of atoms, except for rk i , conjugated simplicial complexes, conjugated homology groups, i and ˆRk i . For a given pair of conjugated and conjugated persistence barcodes are computed. Euclidean distance based filtration is 31 computed using the Vietoris-Rips complex. Given set of atoms selected according to atom- specific and element specific constructions, a family of multi-resolution persistence barcodes is generated by a resolution controlled filtration matrix given by [44] Mnm(ϑ) = 1 − Φ(||rn − rm||; ϑ), (4.14) where ϑ denotes a set of kernel parameters. We have used both exponential kernels Φ(||rn − rm||; η, κ) = e−(||rn−rm||/η)κ , κ > 0 (4.15) and Lorentz kernels Φ(||rn − rm||; η, ν) = 1 +(cid:0)||rn − rm||/η(cid:1)ν , 1 ν > 0 (4.16) where η κ, and ν are pre-defined constants. This filtration matrix is used in association with the Vietoris-Rips complex to generate persistence barcodes or persistence diagrams. These topological invariants are then compared using both Bottleneck and Wasserstein distances. An example of the conjugated persistence barcode pair generated for a Cα atom is illustrated in Figure 4.4. 32 (a) 1AIE Subunit (b) Barcode Figure 4.4: Illustration of residue 338 Cα atom-specific persistent homology in the CC element-specific point cloud of protein PDBID 1AIE. For this example residues 332-339 are used and are shown on the left. The Cα location used to generate the barcodes (right) is highlighted in red in the left chart. Conjugated persistence barcodes are generated with and without the selected Cα. 33 Chapter 5 Machine Learning Machine learning is a subset of artificial intelligence based on statistical and probabilistic methods to “learn” patterns in data given a training set. This means that unlike other mathematical models, the structure of the algorithm is not known a priori. Broadly speak- ing, machine learning tasks are classified into supervised, semi-supervised, or unsupervised learning. Supervised learning involves training on data that contains both input data and some desired output data, semi-supervised training on data where some of the outputs are unknown, and unsupervised training on data without known output. Supervised and semi- supervised algorithms can then be trained for regression or classification tasks depending on the desired output. Since they have no target output, unsupervised algorithms can only find structure in data such as in the clustering or grouping data. Machine learning algorithms differ by their internal representation. These algorithms are first classified as parametric or non-parametric depending on whether they have fixed number of parameters regardless of sample size, or whether the number of parameters is allowed to grow with sample size respectively. In practice parametric machine learning al- gorithms are computationally fast, require less data, and easy to implement compared to non-parametric machine learning algorithms. However, parametric machine learning algo- rithms can suffer from poor fitting due to overly strong assumptions about the underlying mapping function. In contrast non-parametric machine learning models are able to fit a 34 larger variety of functional forms and can thus produce more robust models. The work by Wolpert et al suggests that learning algorithms cannot be universally good[45]. That is, a machine learning algorithm that provides a good model for one problem may not work for a different problem. As such, it is standard practice when using machine learning, to test several different machine learning algorithms to determine which of the algorithms are best suited to the problem. The task of B factor prediction is a supervised regression task. It is supervised because B factors are known from experimental data and the prediction task is regression because B factor takes continuous values. Taking the aforementioned considerations into mind, this work considers several non-parametric machine learning algorithms. In particular, random forests, gradient boosted trees, convolutional neural networks, and deep neural networks are all considered in this work. All machine learning results are reported in Chapter 7. The following sections provide a detailed description of the algorithms, feature inputs, parame- terizations, and datasets used for testing. 5.1 Machine Learning Algorithms The following subsections provide a brief overview of each of type of machine learning algo- rithm used in this work. 5.1.1 Ensemble Methods Ensemble methods are a class of machine learning algorithms that generate a strong pre- dictive model based on a large number of simple weak learning models. The basic idea is that taken together, a large number of weak learners, those who do only slightly better 35 than chance, can generate a robust predictive model. Two of the most popular ensemble algorithms, which are used in this work, are random forests of trees and gradient boosting trees[46, 47, 48, 49]. 5.1.1.1 Random forest Random forests are an ensemble machine learning method used for classification or regres- sion tasks. For regression tasks random forests train many decision trees then output the mean prediction of the individual trees. Compared to other machine learning algorithms, random forests are advantageous because they have few hyper-parameters, are generally robust against overfitting, and invariant to scaling. Machine learning approaches are commonly criticized as “black box” approaches. That is, while the input and output of a machine learning algorithm are well known the internal model the algorithm is using is generally hidden to the user. Ensemble methods like random forests address this issue in part by providing variable importance of the trained model. Variable importance is one important way that users can understand which features give the model the most predictive power. Random forests are invariant to scaling, so they do not require the feature data to be pre-processed. Random forests require minimal hyperparameter tuning. The only hyper parameter required is the number of n decision trees. While random forests are generally robust to overfitting if n is chosen to be too large it is possible for a random forests to overfit a dataset. Thus too few trees and the model will have poor predictive power and too many trees may lead to overfitting and be computationally costly. The user must take special care to determine the right amount of decision trees. For this work, the choice of decision trees was determined by testing various values of n to strike a balance between performance and 36 cost. 5.1.1.2 Gradient boosted trees Like random forests, gradient boosting trees (GBTs) are an ensemble method. GBTs in- corporate boosting to reduce bias and variance and utilize a number of “weak learners” to iteratively construct a predictive model. The algorithm is optimized using gradient descent, minimizing the residual of a predefined loss function. At each step, GBTs incorporate de- cision trees to improve their predictive power. Gradient boosting trees and other related ensemble methods are useful because they have strong predictive power, do not require normalization of the dataset, and are typically robust to outliers and overfitting. 5.1.2 Neural Networks Recent advances in GPU computing have allowed neural networks to be computationally tractable machine learning models. Modeled after neurons in the brain, neural networks apply layers of activation functions, called perceptrons, to inputs. Weights of the neural network are trained to minimize a loss function over many passes of a training dataset. Many neural networks utilize back-propagation, which allows the error to propagate to the previous layer, to adjust neuron weights and improve output error until it is below a preset threshold. In short, neural networks begin with an initial random guess at an output then repeatedly adjust the neuronal weights until the output error is satisfactorily reduced. Neural networks with several “hidden” layers of perceptrons are known as deep neural networks (DNNs). Figures 5.1 and 5.2 provide examples of the basic perceptron and deep neural network framework. 37 Inputs Weights Neuron Output w1 w2 w3 wn n(cid:88) i=1 xiwi + b (cid:19) xiwi + b (cid:18) n(cid:88) f i x1 x2 x3 ... xn Figure 5.1: An example of a perceptron, the basic functional unit of a neural network. Input Layer Hidden Layer Hidden Layer Output Layer Output Figure 5.2: An illustration of a fully connected deep neural network. Circles represent neurons and connections between neurons are indicated by arrows. Each connection has an associated weight. A neural network is considered “deep” when it uses several hidden layers. 5.1.2.1 Convolutional Neural Network Convolutional neural networks (CNNs) are a type of neural network that have recently had great success in the field of image classification. CNNs work by applying convolutional 38 filters over several layers, and by doing so extract successively higher-level features from input images. For image data CNNs are more advantageous than fully connected neural networks because they can often outperform fully connected neural networks with a fraction of training parameters. 5.1.3 Consensus methods It is often the case that one machine learning model model may outperform others in certain areas but do worse in others. As such a consensus model can provide a useful tool that may improve overall results. As such, for PH based Cα only B factor prediction, this work also includes B factor prediction results using a consensus model. The consensus model prediction used here is generated by combining the B factor predictions of the two PH based machine learning models. In particular, the consensus prediction for each Cα is the average of Cα B factor values predicted from the PH based GBT and deep CNN B factor prediction. 5.2 General Machine Learning Features 3D spatial atomic coordinates of each atom in a protein are provided by Protein Databank (PDB) .pdb files. The PDB files also provide additional experimental data that can be used as local and global input features for machine learning algorithms. All machine learning algorithms used in this work make use of both global and local protein features described in the sections 5.2.1 and 5.2.2. To study the impact of the MWCG, ESPH, and ASPH methods these features are tested separately in different machine learning algorithms. The parameters used to generate these machine learning features in this work are described in detail in sections 5.3 and 5.4 and below. 39 5.2.1 Global features The global protein features described in this section were used in all the machine learning models in this work. The global features that were used in this work are R-value, resolution, and total number of heavy atoms. These features are obtained via the experimental data recorded in PDB file of each protein. Both R-value and resolution provide measures of the quality of the atomic model obtained from the X-ray crystallography. Also included as a global feature is the total protein size which is determined as the sum of heavy elements (carbon, nitrogen, oxygen, and sulfur) present in the protein. To code the protein size data, it is organized into one of 10 discrete size classes using one hot encoding. The size ranges are given based on the distribution of total number of heavy elements of each protein. For this work we use the following size classes. A frequency distribution of the size categories is provided in Figure 5.3. [500, 750, 1000, 1500, 2000, 2500, 3000, 4000, 5000, 30000] Using one-hot coding, a protein element feature size will take on 1 if the number of heavy atoms (carbon, nitrogen, or oxygen) of the protein is less than or equal to the corresponding size and zero for the other sizes. For example, a protein with 600 heavy elements would have the feature size vector for all of its atoms given by [0, 1, 0, 0, 0, 0, 0, 0, 0, 0]. The maximum size bin is 30,000 since all proteins in the dataset have less than 30,000 heavy elements. 40 Figure 5.3: Frequency of the number of heavy elements from the 364 protein dataset. Figure originally published in Bramer et al [2]. 5.2.2 Local features In addition to the features discussed above, PDB files contain the amino acid corresponding to each heavy element. Like the protein size feature, amino acid information is included by using one hot encoding for each heavy element which results in twenty amino acid features. More locally, each of the the four different heavy element types carbon, nitrogen, oxygen, and sulfur for each element are one hot coded which results in another four features. To explicitly take the density of nearby atoms into account, this work includes packing density as an additional model feature. Short, medium and long packing density features for each heavy atom are generated and included in all the machine learning models used in this work. Mathematically, the packing density of the ith atom is defined as pd i = Nd N , where d is the given cutoff in angstroms, Nd is the number of atoms within the Euclidean distance of the cutoff to the ith atom, and N the total number of heavy atoms of the protein. 41 Table 5.1 provides the packing density cutoffs used in this work. Table 5.1: The packing density distance parameters (d ˚A) used for generating short medium, and long packing density machine learning features. Short Medium Long 5 ≤ d d < 3 3 ≤ d < 5 Secondary structures also play an important role in protein interactions. This work in- cludes several secondary structural machine learning features for all the machine learning models used. Several software packages exist for the prediction of secondary protein struc- tures. All secondary protein machine learning features used in this work were generated using the STRIDE software. This software returns secondary structure results that are in maximal agreement with X-ray crystallography data through the use of an optimized knowl- edge based algorithm. STRIDE takes 3D atomic coordinates in the form of protein PDB files as input and assigns each atom to a corresponding secondary structural group. STRIDE assigns each atom as belonging to a alpha helix, 3-10 helix, PI-helix, extended conformation, isolated bridge, turn, or a coil. Solvent accessible surface area, φ and ψ angle information are also generated by the software. This provides a total of 12 secondary structure features that are used in all the machine learning models in this work. 5.3 MWCG Features The MWCG flexibility index described in Chapter 3 is used to create feature vectors for carbon, nitrogen, and oxygen interactions with each heavy element. To capture multiscale interactions 3 different kernel parameterizations are used for each interaction type. This provides a total of nine MWCG machine learning features for each heavy element. The kernel parameters used in this work are based off previous results. Specific parameters for 42 the kernels used here were originally published in Bramer et al and are provided in Table 5.2.[1] Table 5.2: Correlation kernel parameters used to generate parameter-free MWCG machine learning features. Parameters based on previous results.[1] Kernel Type Lorentz (n = 1) Lorentz (n = 2) Exponential (n = 3) κ ηn 16 - 2 - 1 31 ν 3 1 - 5.3.1 Image-like MWCG Features Convolutional neural networks make use of the large amount of data provided in images by applying a convolution operation. Due to the massive amount of trainable parameters, fully connected feed forward neural networks are computationally prohibitive for images. Convolutional operations greatly reduce the number of free parameters, thereby striking good balance between deep predictive power and computational cost. For this work MWCG images are generated for every heavy atom in the data set then used in a deep CNN model. Multiscale images are generated using both Lorentz and exponential radial basis functions for all heavy atoms in the data set. The generated images capture multiscale interactions by using a number of different parameterizations of κ, ν, and η in their kernels. To capture a large range of protein atomic interaction scales this work uses the following values are used for κ, ν, and η. η = {1, 2, 3, 4, 5, 10, 15, 20} κ, ν = {2, 2.5, 3, 3.5, 4, 4.5, 5, 5.5, 6, 6.5, 7, 8, 9, 10, 11}. Taken together as a matrix, each generated “image” results in three 2D MWCG images 43 of dimension (8, 30) for each heavy atom in the data set. For this work MWCG images are generated for all carbon, nitrogen, and oxygen interactions for each heavy atom. This results in a total of three channels for each image and a final image dimension of (8, 30, 3) for each atom used the MWCG deep CNN testing. The image matrix is given by F k i i (l, m, n) represents the flexibility index of the ith atom, and kth atom interaction (C, N, or O), l = η, m = {κ, ν}, in equation 5.1, where each atom f k and n the type of radial basis function. Values of n = 1 and n = 2 correspond to exponential and Lorentz radial basis functions respectively. f k i (1, 2, 1) f k i (2, 2, 1) ... f k i (1, 2.5, 1) f k i (2, 2.5, 1) ... ... f k i (1, 11, 1) f k i (2, 11, 1) f k i (1, 2, 2) f k i (2, 2, 2) f k i (1, 2.5, 2) f k i (2, 2.5, 2) f k i (15, 2, 1) f k i (15, 2.5, 1) ... f k i (15, 11, 1) f k i (15, 2, 2) f k i (15, 2.5, 2) ... ... ... ... f k F k i = f k i (1, 11, 2) f k i (2, 11, 2) i (15, 11, 2) (5.1)  η  (cid:125)  (cid:124) f k i (20, 2, 1) f k i (20, 2.5, 1) ... f k f k i (20, 2, 2) f k i (20, 2.5, 2) ... f k i (20, 11, 2) (cid:123)(cid:122) κ (cid:125) i (20, 11, 1) (cid:124) (cid:123)(cid:122) ν 5.4 ASPH & ESPH Features A variety of element-specific and atom-specific persistent homology features, as described in Chapter 4, are generated as local machine learning features. The ASPH and ESPH features are generated in several ways by varying kernels (Lorentz and exponential), element-specific pairs (CC, CN, CO), and distance metrics (Wasserstein-0 and Wasserstein-1, Bottleneck-0 and Bottleneck-1). For this work, all persistent homology features were generated with a radial cutoff of 11˚A. The distances determined by Wasserstein and Bottleneck metrics are dependent on the boundary of the corresponding persistence diagrams. In other words any events from one 44 diagram that do not match an event on the other diagram can contribute to the final Wasser- stein or Bottleneck distance by their distances from the boundary. Considering these effects, this work includes two additional persistence diagrams. The additional diagrams are con- structed by rotating the y-axis is rotated clockwise by 30◦ or 60◦, respectively. Figure 5.4 provides an example of these modifications. By introducing this modification, the Bottleneck and Wasserstein distances correspondingly allow the model to recognize elements that have a short persistence, or lifespan. As a final consideration, a feature is generated by reflecting the original persistence diagram about the diagonal axis. An example of this modification is provided in Figure 5.4. A list of kernels, kernel parameters, y-axis change, distance metric, and element-specific pairs used to generate features in machine learning models is provided in Table 5.3. (a) (b) (c) Figure 5.4: Illustration of modified persistence diagrams used in distance calculations. (a) Unchanged. (b) Rotated 30◦. (c) rotated 60◦. Black dots are Betti-0 events and triangles are Betti-1 events. 5.4.1 Image-like ASPH & ESPH Features 2D image-like persistent homology (PH) features for each Cα of the proteins are generated using the process described in Section 4.7. The images-like features are generated by taking various values of η and κ using the kernel function. An exponential kernel is used with a 45 Table 5.3: Parameters used for topological feature generation. All features used a cutoff of 11˚A. Both lorentz (Lor) and exponential (exp) kernels and Bottleneck (B) and Wasserstein (W) distance metrics were used. No. features Kernel Kernel parameter Diagram 12 12 12 12 12 Lor Exp Exp Exp Exp η = 21, ν = 5 η = 10, κ = 1 η = 2, κ = 1 Diagonal reflection η = 2, κ = 1 η = 2, κ = 1 Unchanged Unchanged Rotated 30◦ Rotated 60◦ Distance metric Element pair CC, CN, CO CC, CN, CO CC, CN, CO CC, CN, CO CC, CN, CO B, W B, W B, W B, W B, W radial cutoff of 11˚A. Different values of η and κ are used to capture multiple interaction scales. The values used in this work are and η = {1, 2, 3, 4, 5, 10, 15, 20}, κ = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10}. This results in an image-like matrix given by PHk i in Eq. (5.2). Each atom PHk i (l, m) represents the PH feature of the ith Cα atom, and kth atom interaction (C, N, or O), l = η, and m = κ. PHk i =  (cid:124) f k i (1, 1) f k i (1, 2) f k i (2, 1) f k i (2, 2) ... . . . . . . ... f k i (1, 9) f k i (1, 10) f k i (2, 9) f k i (2, 10) f k i (15, 1) f k f k i (20, 1) f k i (15, 2) . . . f k i (20, 2) . . . f k i (15, 9) f k i (20, 9) f k i (15, 10) i (20, 10) (cid:123)(cid:122) κ   (cid:125) η (5.2) This generates 2D PH image-like features of dimension (8,10). Compared to MWCG images, 46 the PH images have lower resolution than the MWCG images due to the cost of calculating PH features. Images are generated for carbon, nitrogen, and oxygen element-specific inter- actions with each Cα atom. As a result, the final image feature input has a dimension of (8,10,3) for each Cα atom. 5.4.2 Cutoff Distance For this work a cutoff of 11˚A is used to generate all persistent homology machine learning features. The cutoff was determined using a basic grid search over various cutoff distances. Figure 5.5 displays the average Pearson correlation coefficient, obtained via fitting with experimental B factors, over the entire dataset using all persistent homology metrics with various point cloud distance cutoffs. The parameters listed in Table 5.4 are used to generate Figure 5.5: Average Pearson correlation coefficient over the entire protein dataset fitting all 24 persistent homology features using various cutoff distances. PH features for each protein. These parameters were determined using a grid search over various ν, η, and κ. 47 Table 5.4: Parameters used for the element specific persistent homology features with a cutoff of 11 ˚A. Kernel Type Lorentz (n = 1) Exponential (n = 2) ν 5 - ηn κ 21 - 1 10 5.5 Machine Learning Model Parameters For this work several machine learning models were generated. All machine learning models used in this study include the global and local features described sections 5.2.1 and 5.2.2. Two classes of machine models are generated for this work. The first includes random forest, gradient boosted tree, and deep convolutional neural networks that use MWCG input features in addition the general global and local features mentioned above. The second class of machine learning models use the ASPH and ESPH input features in addition to the general and local features. Each model has specific parameters than can be tuned. The following sections outline the parameters used in this work. 5.5.1 MWCG 5.5.1.1 Random Forest Random forests only require the user to determine the amount of n trees. The predictive power of random forests generally increases with the number of trees used and these models are robust to over fitting. However increasing the number of trees comes at a computational cost. To balance performance with computational cost, this work uses n = 500 trees for all MWCG based random forest B factor prediction. 48 5.5.1.2 Gradient Boosted Trees Several hyperparameters within the gradient boosted tree method can be tuned. The MWCG based GBT hyperparamters used in this work are determined using the standard practice of a grid search. Testing parameters are provided in Table 5.5. Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learn package. Table 5.5: Boosted gradient tree parameters used for testing MWCG based B factor pre- diction. These parameters were determined using a grid search. Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learn package. MWCG based GBT machine learning prediction results originally published in Bramer et al [2]. Parameter Loss Function Alpha Estimators Learning Rate Max Depth Min Samples Leaf Min Samples Split Setting Quantile 0.95 1000 0.001 4 9 9 5.5.1.3 Deep Convolutional Neural Network This work uses 3 channel (8,30) MWCG based image-like correlation maps, as described in Section 5.3, as CNN input data for each image. The CNN output is flattened and concate- nated with global and local protein features, as described in Sections 5.2.1 and 5.2.2, then input into a deep neural network to predict atomic B factor. A diagram of the MWCG based deep CNN architecture is provided in Figure 5.6. The CNN input image used for MWCG based B factor in this work is a three-channel MWCG image of dimension (8,30,3). The deep CNN applies two convolutional layers with 2x2 filters, a dropout layer of 0.5, a dense layer, then flattens the resulting output. The 49 Figure 5.6: The MWCG based deep convolutional neural network architecture used for B factor prediction. The plus symbol represents the concatenation of data sets. Figure originally published in Bramer et al [2]. 50 flattened output from the CNN is concatenated with the other global and local features into a dense layer of 59 neurons followed by a dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25, a dense layer of 10 neurons, and finishes with a dense output layer. This results in a total of 21,584 trainable parameters for the deep CNN used in MWCG based B factor prediction. A diagram of the deep CNN architecture is illustrated in Figure 5.6. Convolutional neural networks have several hyper-parameters. The hyper parameters for the MWCG based deep CNN used in this work are optimized using a grid search. Table 5.6 provides a list of the hyper-parameter values used for testing. Any hyper parameters not listed below were taken to be the default values provided by the python Keras package. Table 5.6: MWCG based deep Convolutional Neural Network (CNN) hyper-parameters used for testing. These hyper-parameters were determined using a grid search. Any hyper pa- rameters not listed below were taken to be the default values provided by python with the Keras package. MWCG machine learning prediction results originally published in Bramer et al [2]. Parameter Learning Rate Epoch Batch Size Loss Optimizer Setting 0.001 100 100 Mean Absolute Error Adam 5.5.2 ASPH & ESPH The generated ASPH & ESPH features described in section 4.7 are used for prediction of protein B factor using both least squares fitting and machine learning as described in the following sections. 51 5.5.2.1 Gradient Boosted Trees The persistent homology based GBT hyper-parameters used in this work are optimized using a grid search. The parameters used for testing are provided in 5.7. Any hyper-parameters not listed in the table were taken to be the default values provided by the python scikit-learn package. Table 5.7: Boosted gradient tree parameters used for persistent homology based prediction testing. Parameters were determined using a grid search. Any hyper parameters not listed below were taken to be the default values provided by the python scikit-learn package. Parameter Loss Function Alpha Estimators Learning Rate Max Depth Min Samples Leaf Min Samples Split Setting Quantile 0.975 500 0.25 4 9 9 5.5.2.2 Deep Convolutional Neural Network The deep CNN used in this work uses input images generated from an image-like correlation map. These images are generated by using a range of kernel parameters for atom-specific and element-specific persistent homology as described in Section 5.4.1. The CNN output is flattened and then input into a DNN along with global and local protein features. This allows the deep CNN to use the same feature set as the boosted gradient method to be used as well as the generated PH image-like data. Figure 5.7 provides a diagram of the CNN architecture used for the PH based B factor prediction in this work. The CNN is passed a three-channel persistent homology image of dimension (8,10,3) for each Cα of the training set. The model used in this work takes the input image data and 52 Figure 5.7: The deep learning architecture using a convolutional neural network combined with a deep neural network to predict B factor using PH based features. The plus symbol represents the concatenation of features. applies two convolutional layers with 2x2 filters, followed by a dropout of 0.5. The image data is then passed through a dense layer, flattened, then joined with the other global and local features to form a dense layer of 218 neurons. This is followed by a dropout layer of 0.5, another dense layer of 100 neurons, a dropout layer of 0.25, a dense layer of 10 neurons, and finishes with a dense layer of the B factor prediction output. Figure 5.7 provides an illustration of the deep CNN used in this work. Several hyper-parameters of the deep convolutional neural network can be tuned. The deep convolutional neural network hyper-parameters are optimized using a basic grid search. Table 5.8 provides the parameters used for testing. Any hyper-parameters not listed in the provided table were taken to be the default values provided by the python Keras package. 53 Table 5.8: Convolutional Neural Network (CNN) parameters used for testing persistent homology based features. Parameters were determined using a grid search. Any hyper- parameters not listed below were taken to be the default values provided by python with the Keras package. Parameter Learning Rate Epoch Batch Size Loss Optimizer Setting 0.001 1000 1000 Mean Squared Error Adam 5.6 Machine Learning Datasets The image like features used in all convolutional neural networks were standardized with mean 0 and variance of 1. Because the STRIDE software is unable to provide features for these proteins, 1OB4, 1OB7, 2OLX, and 3MD5 are excluded from the data set. Protein 1AGN is also excluded due to known problems with this protein data. Lastly, proteins 1NKO, 2OCT, and 3FVA are excluded because they have residues with B factors reported as zero, which is unphysical. 5.6.1 Training set and test set The PH and MWCG based machine learning algorithms used in this work are all trained and tested using a leave-one-protein-out approach. For each protein a machine learning model is built using the entire dataset but excluding data from the protein whose B factors are to be predicted. The dataset contains over 620,000 atoms in total which provides a training set of roughly 600,000 data points (i.e., atoms) for each protein. Each heavy atom in the training set has an associated set of input features, as described in Sections 5.3 and 4.7, and a B factor output. The feature inputs and the outputs in the training set are used to train each machine learning model. Since the predictions are leave-one-protein-out, data from 54 each protein is taken as a test set when its B factors are to be blindly predicted. All random forest models and boosted gradient models are implemented using the scikit- learn python package. All deep CNN models are implemented using the python package Keras with tensorflow as a backend. 55 Chapter 6 Workflow Read PDB Data Pre-process PDB Select element pairs and construct corresponding point clouds Select kernel functions Φn and scale parameters ηn Calculate rigidity un i and flexibility index f n i Figure 6.1: Workflow for procedure in MWCG feature construction. 56 Read PDB Data Pre-process PDB Select element pair, cutoff distance, and kernel Construct corresponding point clouds Calculate kernel based distances matrices Construct Rips complexes for each local Cα point cloud Construct Persistence Diagrams using Rips complexes Calculate Bottleneck distances Figure 6.2: Workflow for procedure in ASPH and ESPH feature construction. 57 Choose ML model Read and clean PDB file Data Mine PDB file data for local and global features Generate engineered features (MWCG, ASPH, ESPH, CNN images, packing density, etc. . .) Parameterize (MWCG, ASPH, ESPH, CNN Images, cutoff distance(s), etc. . .) Normalize features Tune hyperparameters Validate model (leave-one-protein-out) Figure 6.3: Workflow for procedure MWCG, ASPH, and ESPH based machine learning B factor prediction. 58 Chapter 7 Results 7.1 Visualization of Element Specific Correlation Maps In this result the radial basis functions are used in the MWCG method to construct various element specific correlation heat maps of a given protein. For this study we consider carbon, nitrogen, and oxygen interactions and create correlation heat maps using both nitrogen- nitrogen and carbon-carbon interaction pairs. Only one spatial scale is used to illustrate the element specific feature of the MWCG method. This is abbreviated as WCG in the related tables. Given an element pair, each map was calculated used the average of the three kernels described in Chapter 3. Axes of each correlation map correspond to individual atoms of each carbon, nitrogen, or oxygen atom in the given protein. In this work correlation heat maps are generated using the three proteins with PDB ID 3TYS, 1AIE, and 3PSM. Nitrogen-nitrogen and oxygen-oxygen correlation heat maps are provided in Figures 7.1, 7.2, and 7.3. Each figure also includes a 3D representation, generated using Visual Molecular Dynamics (VMD) software, of each protein for reference. 7.2 Hinge Detection Accurate and robust identification of hinge regions is an ongoing problem. An important application of hinge region detection is domain identification. Hinge regions of proteins also 59 play an essential role in enzymatic catalysis due to their ability to allow conformational changes to the protein. Binding by ligands can be accommodated by a flexible active site as seen in hinge regions. With these considerations in mind, hinge prediction cannot be overlooked when developing methods for protein flexibility and dynamics analysis. The MWCG presented here can be used as a hinge detection tool. In this work we consider three interesting examples. Calmodulin provides an example of a protein hinge that effects both the structure and function of the protein. For this result experimental protein B factors of Cα atoms are compared with predictions from the WCG method and GNM for calmodulin (PDB ID 1CLL), ribosomal protein (PDB ID 1WHI), and engineered fluorescent cyan protein (PDB ID 2HQK). To highlight the value of the element specific feature of the MWCG only one scale is used so that the method is simply WCG. For comparison protein PDB ID 1CLL includes MWCG and WCG predictions to compare and contrast the element specific and multiscale nature of the MWCG method. Results are generated with carbon-carbon, carbon- nitrogen, and carbon-oxygen interaction pairs. Exponential type kernels are used with fixed parameters κ = 1, and η = 3 ˚A. The results are displayed in Figures 7.5, 7.4, and 7.6. 60 (a) 1AIE (b) Amine Nitrogens (c) Double Bonded Carboxyl Oxygens Figure 7.1: (a) VMD representation of PBD ID 1AIE. (b) Correlation maps for nitrogen- nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1AIE. The thicker band along the main diagonal of (b) and (c) corresponds to the alpha helix secondary structure in 1AIE. Figure originally published in Bramer et al [1]. 61 (a) 1KGM (b) Amine Nitrogens (c) Double Bonded Carboxyl Oxygens Figure 7.2: (a) VMD representation of PBD ID 1KGM. (b) Correlation maps for nitrogen- nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 1KGM. The bands per- pendicular to the main diagonal of (b) and (c) correspond to the anti parallel beta sheet present in 1KGM. Figure originally published in Bramer et al [1]. 62 (a) 5IIV (b) Amine Nitrogen (c) Double Bonded Carboxyl Oxygens Figure 7.3: (a) VMD representation of PBD ID 5IIV. (b) Correlation maps for nitrogen- nitrogen (NN) and (c) oxygen-oxygen (OO) interactions for protein 5IIV. The presence of the two distinct thick bands along the main diagonal of (b) and (c) corresponds to the two alpha helices present in 5IIV. The off diagonal bands correspond to the bonding interaction between alpha helices. Figure originally published in Bramer et al [1]. 63 (a) (b) (c) (d) Figure 7.4: (a) A visual comparison of experimental B factors , (b) WCG predicted B factors, (c) and GNM predicted B factors for the ribosomal protein L14 (PDB ID:1WHI). (d) The experimental and predicted B factor values plotted per residue. GNM represents predicted B factors using GNM with a cutoff distance of 7 ˚A. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. Figure originally published in Bramer et al [1]. 64 (a) (b) (c) (d) Figure 7.5: (a) The structure of calmodulin (PDB ID: 1CLL) visualized in Visual Molecular Dynamics (VMD)18 and colored by experimental B factors, (b) MWCG predicted B factors, (c) WCG predicted B factors, (d) and GNM predicted B factors with red representing the most flexible regions. Figure originally published in Bramer et al [1]. 65 (e) Figure 7.5: (Continued) (e) The experimental (Exp) and predicted B factor values plotted per residue for PDB ID 1CLL. The GNM is for the GNM method with a cutoff distance of 7 ˚A. We see that GNM clearly misses the flexible hinge region. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. MWCG represents B factor predictions determined from the MWCG method using the fixed parameters listed in Table 3.2. Figure originally published in Bramer et al [1]. 7.3 MWCG 7.3.1 Validation The Pearson correlation coefficient is used to quantitatively assess the prediction results. The Pearson correlation coefficient for B factor prediction used in this work is given by 66 N(cid:88) i=1 PCC = (cid:20) N(cid:88) i=1 i − ¯Bt) (Be i − ¯Be)(Bt N(cid:88) (Be i − ¯Be)2 (Bt i − ¯Bt)2 i=1 (cid:21)1/2 , (7.1) where Bt i , i = 1, 2, . . . , N are predicted B factors using the proposed method and Be i , i = 1, 2, . . . , N are experimental B factors from the PDB file. The terms Bt i and Be i represent the ith theoretical and experimental B factors respectively. Here ¯Be and ¯Bt are averaged B factors. 7.3.2 Fitting Results Tables 7.1-7.6, and 7.4 provide the average Pearson correlation coefficient obtained using the MWCG method as outlined in Chapter 3. The MWCG method is compared to other com- monly used protein B factor prediction methods. The MWCG B factor Pearson correlation coefficient results for all 364 proteins in the dataset are provided in table 7.4. The proposed MWCG method, optimal FRI (opFRI), parameter free FRI (pFRI), and GNM methods are all compared. The same comparison for proteins of relatively, small, medium, and large sizes are provided in tables 7.1, 7.2, and 7.3. Table 7.4: Correlation coefficients for B factor prediction obtained by MWCG, optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for a set of 364 proteins. GNM scores reported here are the result of tests with a processed set of PDB files as described in Chapter 2.2. MWCG results originally published in Bramer et al [1]. PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM 0.826 0.808 0.465 0.27 0.654 0.685 0.781 0.843 0.726 0.656 0.672 0.62 1PEF 18 1PEN 16 1PMY 123 114 1PZ4 1Q9B 43 1QAU 112 1ABA 87 1AHO 64 1AIE 31 1AKG 16 1ATG 231 1BGF 124 0.698 0.613 0.625 0.562 0.416 0.155 0.35 0.185 0.578 0.497 0.539 0.543 0.727 0.698 0.588 0.373 0.613 0.603 0.989 0.957 0.701 0.921 0.957 0.786 0.888 0.516 0.671 0.828 0.746 0.678 0.855 0.768 0.969 0.945 0.843 0.834 67 PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM 0.751 0.645 0.809 0.52 0.334 0.543 0.631 0.556 0.65 0.621 0.368 0.789 0.447 0.431 0.517 0.372 0.529 0.435 0.671 0.596 0.742 0.711 0.714 0.72 0.82 0.841 0.837 0.429 0.434 0.474 0.762 0.691 0.778 0.577 0.522 0.6 0.665 0.638 0.726 0.661 0.742 0.665 0.639 0.594 0.495 0.713 0.653 0.671 0.438 0.146 -0.142 0.832 0.809 0.798 0.61 0.691 0.538 0.629 0.599 0.632 0.622 0.492 0.162 0.792 0.695 0.677 0.691 0.564 0.397 0.591 0.577 0.549 0.539 0.601 0.27 0.695 0.679 0.666 0.634 0.577 0.417 0.559 0.654 0.6 0.645 0.447 0.832 0.619 0.57 0.562 0.524 0.366 0.596 0.333 0.434 0.375 0.834 0.901 0.842 0.662 0.638 0.433 0.77 0.808 0.757 0.756 0.579 0.69 0.524 0.281 0.564 0.694 0.512 0.705 0.593 0.521 0.684 0.639 0.603 0.467 0.555 0.551 0.477 0.606 0.68 0.726 0.623 0.706 0.543 0.491 0.552 0.58 0.512 0.351 0.751 0.702 0.741 0.912 0.889 0.832 0.746 0.732 0.859 0.653 0.638 0.677 0.609 0.628 0.71 0.544 0.393 0.432 0.023 -0.274 0.089 0.644 0.547 0.65 0.859 0.738 0.878 0.718 0.613 0.674 0.568 0.485 0.59 0.766 0.693 0.646 0.845 0.773 0.821 0.732 0.591 0.781 0.634 0.421 0.748 0.488 0.429 0.306 0.811 0.686 0.616 0.516 0.549 0.549 0.715 0.735 0.69 0.697 0.689 0.637 0.553 0.531 0.378 0.749 0.744 0.558 0.547 0.536 0.512 0.635 0.612 0.466 0.763 0.754 0.796 0.62 0.679 0.657 0.687 0.681 0.7 0.609 0.497 0.651 0.75 0.703 0.631 0.619 0.535 0.368 0.53 0.669 0.523 0.789 0.631 0.795 0.622 0.604 0.615 0.746 0.622 0.523 0.874 0.844 0.91 0.333 0.309 0.562 0.776 0.763 0.75 0.545 0.652 0.737 0.555 0.409 0.398 1QKI 3912 1QTO 122 122 1R29 90 1R7J 1RJU 36 1RRO 112 1SAU 114 1TGR 104 1TZV 141 55 1U06 1U7I 267 1U9C 221 1UHA 83 1UKU 102 1ULR 87 1UOY 64 1USE 40 1USM 77 1UTG 70 1V05 96 105 1V70 21 1VRZ 1W2L 97 1WBE 204 1WHI 122 1WLY 322 1WPA 107 1X3O 80 18 1XY1 1XY2 8 87 1Y6X 1YJO 6 1YZM 46 1Z21 96 1ZCE 146 75 1ZVA 2A50 457 2AGK 233 2AH1 939 2B0A 186 2BCM 413 2BF9 36 51 1BX7 224 1BYI 1CCR 111 1CYO 88 1DF4 57 1E5K 188 260 1ES5 1ETL 12 1ETM 12 1ETN 12 1EW4 106 1F8R 1932 1FF4 65 1FK5 93 1GCO 1044 1GK7 39 1GVD 52 1GXU 88 1H6V 2927 1HJE 13 1I71 83 1IDP 441 1IFR 113 1K8U 89 1KMM 1499 1KNG 144 1KR4 110 1KYC 15 73 1LR7 1MF7 194 1N7E 95 1NKD 59 1NKO 122 1NLS 238 1NNX 93 1NOA 113 1NOT 13 20 1O06 221 1O08 1OB4 16 1OB7 16 1OPD 85 0.896 0.600 0.741 0.860 0.941 0.848 0.700 0.932 0.941 0.949 0.804 0.504 0.933 0.648 0.839 0.984 0.849 0.901 0.133 0.931 0.798 0.827 0.875 0.856 0.740 0.810 0.892 0.971 0.929 0.757 0.812 0.911 0.831 0.799 0.834 0.808 0.937 0.988 0.516 1.000 1.000 0.607 0.554 68 Table 7.4 (cont’d) 0.508 0.809 0.787 0.859 0.805 0.748 0.819 0.810 0.869 0.774 0.885 0.764 0.838 0.765 0.718 0.769 0.960 0.819 0.745 0.841 0.854 0.995 0.747 0.767 0.804 0.728 0.797 0.787 0.933 1.000 0.838 1.000 0.970 0.725 0.898 0.911 0.704 0.821 0.462 0.805 0.695 0.714 PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM Table 7.4 (cont’d) 29 1P9I 99 2CE0 2CG7 90 2COV 534 2CWS 227 2D5W 1214 2DKO 253 2DPL 565 2DSX 52 439 2E10 2E3H 81 2EAQ 89 2EHP 248 2EHS 75 2ERW 53 2ETX 389 116 2FB6 157 2FG1 560 2FN9 2FQ3 85 2G69 99 2G7O 68 2G7S 190 2GKG 122 2GOM 121 2GXG 140 2GZQ 191 2HQK 213 2HYK 238 2I24 113 398 2I49 108 2IBL 2IGD 61 2IMF 203 87 2IP6 88 2IVY 2J32 244 2J9W 200 2JKU 35 2JLI 100 2JLJ 115 2MCM 113 0.841 0.824 0.738 0.895 0.756 0.448 0.873 0.721 0.704 0.808 0.794 0.817 0.832 0.805 0.513 0.854 0.850 0.719 0.704 0.844 0.850 0.888 0.756 0.748 0.874 0.901 0.462 0.897 0.728 0.672 0.766 0.919 0.865 0.798 0.841 0.837 0.878 0.741 0.926 0.937 0.811 0.867 2BRF 100 0.754 0.742 0.625 2C71 205 0.706 0.598 0.529 2OLX 4 0.551 0.539 0.379 2PKT 93 0.846 0.823 0.812 2PLT 99 0.64 0.647 0.696 2PMR 76 0.689 0.682 0.681 2POF 440 0.816 0.812 0.69 2PPN 107 0.596 0.538 0.658 2PSF 608 0.337 0.333 0.127 2PTH 193 0.796 0.692 0.798 2Q4N 153 0.682 0.605 0.692 412 2Q52 0.69 0.753 0.695 2QJL 99 0.804 0.804 0.773 2R16 176 0.713 0.747 0.72 2R6Q 138 0.253 0.199 0.461 0.58 0.556 0.632 2RB8 93 2RE2 238 0.791 0.786 0.74 2RFR 154 0.617 0.584 0.62 2V9V 135 0.595 0.611 0.607 2VE8 515 0.719 0.692 0.348 2VH7 94 0.436 0.59 0.622 2VIM 104 0.784 0.785 0.66 2VPA 204 0.67 0.644 0.649 106 2VQ4 0.646 0.711 0.688 2VY8 149 0.584 0.491 0.586 2VYO 210 0.78 0.847 0.52 2W1V 548 0.505 0.382 0.369 2W2A 350 0.809 0.365 0.824 2W6A 117 0.575 0.585 0.51 2WJ5 96 0.593 0.498 0.494 0.683 0.601 0.714 2WUJ 100 0.625 0.352 0.629 2WW7 150 0.481 0.386 2WWE 111 0.585 0.652 0.625 0.514 2X1Q 240 2X25 0.654 0.578 0.572 168 2X3M 166 0.544 0.483 0.271 0.863 0.848 0.855 2X5Y 171 2X9Z 0.705 0.662 0.716 262 2XHF 310 0.805 0.695 0.656 0.779 0.613 0.622 2Y0T 101 170 2Y72 0.741 0.527 0.789 0.713 0.639 2Y7L 319 0.72 69 0.873 0.773 1.000 0.762 0.635 0.799 0.743 0.673 0.641 0.901 0.846 0.510 0.611 0.640 0.915 0.840 0.711 0.826 0.697 0.698 0.851 0.859 0.757 0.776 0.759 0.777 0.761 0.819 0.804 0.821 0.919 0.629 0.903 0.505 0.710 0.875 0.799 0.726 0.830 0.834 0.926 0.939 0.71 0.764 0.795 0.56 0.649 0.658 0.888 0.885 0.917 0.162 0.003 0.193 0.508 0.484 0.509 0.693 0.682 0.619 0.651 0.589 0.682 0.638 0.668 0.677 0.526 0.5 0.565 0.784 0.767 0.822 0.667 0.711 0.74 0.756 0.748 0.621 0.594 0.584 0.594 0.582 0.495 0.618 0.54 0.603 0.529 0.727 0.614 0.517 0.613 0.673 0.652 0.671 0.753 0.693 0.555 0.548 0.528 0.744 0.643 0.616 0.726 0.596 0.775 0.413 0.393 0.212 0.763 0.755 0.576 0.679 0.555 0.68 0.77 0.724 0.533 0.675 0.648 0.729 0.68 0.571 0.706 0.638 0.589 0.823 0.748 0.647 0.484 0.357 0.44 0.739 0.598 0.598 0.499 0.471 0.356 0.582 0.628 0.692 0.534 0.478 0.443 0.598 0.403 0.632 0.717 0.655 0.744 0.718 0.705 0.694 0.583 0.578 0.574 0.591 0.569 0.606 0.778 0.774 0.798 0.754 0.766 0.78 0.928 0.797 0.747 0.68 Table 7.4 (cont’d) 0.422 PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM 0.762 0.664 0.771 0.807 0.675 0.807 0.813 0.804 0.706 0.458 0.42 0.689 0.672 0.653 0.807 0.712 0.392 0.663 0.756 0.713 0.669 0.581 0.675 0.614 0.608 0.637 0.629 0.601 0.644 0.781 0.914 0.86 0.413 -0.218 0.649 0.669 0.669 0.6 0.205 0.119 0.193 0.661 0.641 0.567 0.614 0.583 0.517 0.706 0.645 0.659 0.689 0.719 0.696 0.653 0.677 0.702 0.721 0.617 0.597 0.583 0.506 0.627 0.706 0.734 0.727 0.734 0.698 0.63 0.645 0.655 0.649 0.686 0.637 0.698 0.767 0.774 0.81 0.77 0.723 0.589 0.642 0.683 0.697 0.531 0.642 0.537 0.758 0.744 0.717 0.597 0.568 0.625 0.79 0.876 0.745 0.541 0.468 0.543 0.718 0.667 0.568 0.709 0.678 0.709 0.945 0.922 0.95 0.904 0.866 0.922 0.577 0.605 0.622 0.745 0.568 0.78 0.587 0.442 0.503 0.688 0.669 0.495 0.452 0.419 0.286 0.559 0.53 0.605 0.785 0.727 0.803 0.835 0.691 0.771 0.799 0.651 0.814 0.456 0.458 0.571 0.567 0.55 0.54 0.614 0.539 0.475 0.433 0.411 0.336 0.909 0.904 0.689 0.624 0.621 0.661 0.82 0.845 0.684 0.481 0.297 0.634 0.532 0.516 0.466 0.45 0.485 0.6 0.549 0.488 0.559 0.707 0.661 0.547 0.489 0.296 0.502 0.687 0.642 0.706 0.427 0.577 0.431 0.824 0.792 0.74 0.803 0.811 0.812 0.64 0.606 0.632 0.583 0.533 0.276 0.476 0.435 0.525 0.655 0.556 0.701 0.532 0.44 -0.126 0.831 0.817 0.793 0.722 0.713 0.634 0.825 0.789 0.835 0.771 0.7 0.63 0.747 0.82 0.51 0.511 0.196 0.732 0.691 0.67 0.518 0.72 0.716 0.683 0.793 0.723 0.758 0.534 0.573 0.754 0.748 0.841 0.966 0.867 0.617 0.502 0.475 0.486 0.441 0.301 0.599 0.317 0.613 0.735 0.714 0.738 36 2NLS 194 2NR7 2NUH 104 2O6X 306 2OA2 132 2OCT 192 2OHW 256 2OKT 342 2OL9 6 3BA1 312 3BED 261 3BQX 139 3BZQ 99 3BZZ 100 3DRF 547 3DWV 325 3E5T 228 3E7R 40 3EUR 140 3F2Z 149 3F7E 254 3FCN 158 3FE7 91 3FKE 250 3FMY 66 3FOD 48 3FSO 221 3FTD 240 6 3FVA 3G1S 418 3GBW 161 3GHJ 116 3HFO 197 3HHP 1234 3HNY 156 3HP4 183 3HWU 144 7 3HYD 3HZ8 192 3I2V 124 3I2Z 138 3I4O 135 2Y9F 149 2YLB 400 2YNY 315 2ZCM 357 2ZU1 360 3A0M 148 3A7L 128 3AMC 614 3AUB 116 3B5O 230 12 3MD4 3MD5 12 3MEA 166 3MGN 348 3MRE 383 3N11 325 3NE0 208 3NGG 94 3NPV 495 3NVG 6 3NZL 73 3O0P 194 3O5P 128 3OBQ 150 3OQY 234 125 3P6J 3PD7 188 3PES 165 3PID 387 3PIW 154 3PKV 221 3PSM 94 3PTL 289 3PVE 347 357 3PZ9 12 3PZZ 3Q2X 6 131 3Q6L 3QDS 284 3QPA 197 3R6D 221 3R87 132 0.937 0.885 0.922 0.825 0.703 0.673 0.743 0.779 1.000 0.827 0.874 0.900 0.848 0.783 0.781 0.754 0.731 0.769 0.874 0.877 0.847 0.741 0.914 0.755 0.857 0.725 0.906 0.818 1.000 0.879 0.864 0.864 0.825 0.830 0.885 0.690 0.905 1.000 0.857 0.879 0.732 0.767 0.769 0.820 0.836 0.723 0.753 0.916 0.806 0.758 0.650 0.729 0.999 0.998 0.872 0.742 0.675 0.736 0.859 0.867 0.855 1.000 0.713 0.898 0.787 0.877 0.807 0.689 0.848 0.861 0.677 0.772 0.731 0.914 0.611 0.785 0.758 0.998 1.000 0.723 0.782 0.616 0.854 0.861 0.5 0.95 70 Table 7.4 (cont’d) 0.73 PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM 0.403 0.242 0.51 0.47 0.616 0.606 0.8 0.784 0.849 0.562 0.524 0.526 0.523 0.421 0.237 0.801 0.712 0.826 0.709 0.658 0.712 0.666 0.675 0.63 0.619 0.611 0.624 0.633 0.567 0.644 0.815 0.697 0.817 0.808 0.775 0.694 0.796 0.748 0.735 0.527 0.447 0.592 0.419 0.458 0.24 0.578 0.556 0.571 0.658 0.588 0.665 0.8 0.853 0.791 0.738 0.716 0.776 0.68 0.68 0.674 0.701 0.688 0.74 0.625 0.551 0.648 0.57 0.529 0.405 0.633 0.372 0.688 0.617 0.598 0.551 0.655 0.501 0.671 0.467 0.323 0.356 0.76 0.755 0.758 0.786 0.754 0.743 0.591 0.528 0.688 0.587 0.624 0.528 0.485 0.406 0.678 0.628 0.55 0.522 0.547 0.544 0.806 0.689 0.81 0.682 0.588 0.596 0.745 0.728 0.615 0.649 0.703 0.51 0.622 0.499 0.638 0.446 0.404 0.316 0.562 0.401 0.62 0.793 0.757 0.777 162 3RQ9 0.667 0.635 0.695 128 3RY0 0.586 0.565 0.409 3RZY 139 0.797 0.693 0.817 3S0A 119 0.535 0.301 0.586 3SD2 86 0.704 0.611 0.705 3SEB 238 0.784 0.775 0.68 3SED 124 0.611 0.475 0.647 150 3SO6 0.718 0.716 0.669 3SR3 637 0.827 0.647 0.659 3SUK 248 0.734 0.584 3SZH 697 0.658 0.614 0.589 3T0H 208 0.612 0.608 0.551 3T3K 122 0.584 0.554 0.338 3T47 141 0.73 0.728 0.628 3TDN 357 0.639 0.574 0.296 0.591 0.471 0.51 3TOW 152 3TUA 210 0.664 0.591 0.451 75 3TYS 0.635 0.632 0.526 4DT4 160 0.753 0.736 0.712 4EK3 287 0.589 0.526 0.495 4ERY 318 0.666 0.652 0.597 4ES1 95 0.698 0.586 0.553 4EUG 225 0.531 0.487 0.583 448 4F01 0.596 0.491 0.604 143 4F3J 0.837 0.812 0.84 4FR9 141 0.557 0.484 0.602 4G14 15 0.625 0.61 0.607 4G2E 151 0.833 0.741 0.753 4G5X 550 0.785 0.749 0.695 4G6C 658 0.737 0.725 0.649 0.651 0.516 0.632 4G7X 194 4GA2 144 0.404 0.392 0.43 4GMQ 92 0.562 0.391 0.59 4GS3 90 0.691 0.687 0.526 236 4H4J 0.46 0.524 0.448 4H89 168 0.709 0.728 0.746 4HDE 168 0.618 0.516 0.303 0.748 4HJP 281 0.746 0.759 0.724 0.717 0.717 4HWM 117 0.619 0.674 85 4IL7 4J11 0.551 0.536 357 0.68 0.605 4J5O 220 3I7M 134 3IHS 169 3IVV 149 3K6Y 227 3KBE 140 3KGK 190 3KZD 85 3L41 220 3LAA 169 3LAX 106 833 3LG3 3LJI 272 3M3P 249 178 3M8J 3M9J 210 3M9Q 176 3MAB 173 3U6G 248 77 3U97 3UCI 72 637 3UR8 148 3US6 3V1A 48 285 3V75 3VN0 193 3VOR 182 3VUB 101 3VVV 108 3VZ9 163 3W4Q 773 3ZBD 213 3ZIT 152 3ZRX 221 3ZSL 138 3ZZP 74 3ZZY 226 4A02 166 167 4ACJ 186 4AE7 4AM1 345 4ANN 176 4AVR 188 0.604 0.807 0.866 0.817 0.743 0.798 0.789 0.776 0.880 0.924 0.701 0.720 0.697 0.813 0.867 0.851 0.770 0.808 0.819 0.689 0.832 0.668 0.811 0.674 0.889 0.686 0.852 0.951 0.887 0.798 0.891 0.641 0.639 0.903 0.692 0.804 0.730 0.827 0.862 0.796 0.562 0.759 0.711 0.790 0.867 0.713 0.842 0.879 0.870 0.747 0.633 0.721 0.860 0.897 0.803 0.759 0.668 0.722 0.696 0.918 0.784 0.830 0.801 0.820 0.592 0.883 0.879 0.806 1.000 0.835 0.822 0.834 0.840 0.782 0.794 0.698 0.866 0.624 0.783 0.730 0.807 0.719 0.726 0.817 0.59 0.46 0.47 0.65 71 PDB ID N MWCG opFRI pfFRI GNM PDB ID N MWCG opFRI pfFRI GNM 0.742 0.689 0.648 0.608 0.736 0.543 0.697 0.553 0.682 0.538 0.53 0.324 0.421 0.331 0.574 0.594 4AXY 54 4B6G 558 4B9G 292 4DD5 387 4DKN 423 4DND 95 109 4DPZ 4DQ7 328 4J5Q 146 4J78 305 4JG2 185 4JVU 207 4JYP 534 4KEF 133 5CYT 103 6RXN 45 0.7 0.765 0.844 0.615 0.781 0.763 0.73 0.69 0.623 0.72 0.756 0.669 0.816 0.763 0.596 0.351 0.761 0.539 0.75 0.582 0.726 0.651 0.683 0.376 0.973 0.804 0.855 0.850 0.786 0.829 0.837 0.776 Table 7.4 (cont’d) 0.851 0.729 0.889 0.800 0.800 0.704 0.548 0.583 0.742 0.658 0.746 0.723 0.688 0.58 0.441 0.614 As reported in Bramer et al [1], the Pearson correlation coefficients for small, medium and large proteins, as well as the average Pearson correlation coefficient of the protein superset, are provided in Table 7.5. In addition to MWCG, the average Pearson correlation coefficients for opFRI, pFRI, GNM, and NMA are also included for comparison. As determined by Park et al, GNM is more accurate than NMA, as analyzed by Park el al [4]. Moreover, opFRI and pfFRI are more accurate than GNM and the MWCG method presented in this work is on average approximately 28% more accurate than pfFRI and 42% more accurate than GNM. Table 7.6 provides the average Pearson correlation coefficient obtained from MWCG linear least square fitting for Cα, non Cα carbon, nitrogen, oxygen, and sulfur atom based B factor prediction. It is notable that these predictions were not available to earlier GNM and FRI methods, thus no comparison can be provided for this result. 7.4 Machine Learning Results 7.4.1 MWCG B factors of all carbon, nitrogen, and oxygen atoms present in a given protein were blindly predicted using a leave-one-(protein)-out approach. Results for predicted Cα B factors are 72 also included for comparison between other methods. These results are predicted in the same way as other heavy atoms. The machine learning B factor prediction models were trained using the generated input feature and B factor data from a training data set as described in Sections 5.6.1 and 5.6. After training the model is used to predict B factors for all heavy atoms in a given protein using only its feature input data. 7.4.1.1 Efficiency comparison It is important for any algorithmic approach to consider the computational efficiency of the method. For B factor predictions this is a particularly important consideration for large proteins. The running times of the GNM, RF, GBT, and CNN models used for testing the MWCG method are provided in Table 7.7 . Figure 7.7 provides a log-log comparison of these times. The protein set used to test the computational complexity were the same as those used by Opron et all [3]. Because GNM only provides Cα B factor predictions, only B factors of Cα atoms are predicted by the RF, GBT, and CNN models for this comparison. Because GNM is computational prohibitive for large proteins, several proteins were excluded from the test set for GNM predictions. All testing excludes the time it takes to load PDB files and feature data. The RF, GBT, and CNN times exclude the training of the model which can be used for the prediction of all proteins once they are trained. The results agree with the theoretical complexity O(N 3) for GNM. This is due to the matrix diagonalization required for GNM. In contrast the machine learning algorithms are close to O(N ), where N is the number of atoms. The lines of best fit for CPU time (t) are t ≈ (4× 10−8)∗ N 3.09 for GNM, t ≈ (9 × 10−6) ∗ N 0.78 for RF, t ≈ (4 × 10−6) ∗ N 0.87 for GBT, and t ≈ (1.1 × 10−3) ∗ N 0.97 for CNN. 73 7.4.1.2 Machine learning performance Table 7.8 provides the results for the blind prediction of all heavy atoms over the protein dataset. Overall the convolutional neural network method performs best with average Pear- son correlation coefficient of 0.69. Both gradient boosted and random forest perform similarly with Pearson correlation coefficients of 0.63 and 0.59 respectively. Table 7.8 provides the results of the average Pearson correlation coefficients for Cα only B factor predictions, which are obtained in the same manner as other heavy atoms. This allows a comparison between previous methods. For comparison, the parameter-free flexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis (NMA) are all included. The predictions of these previous methods are all obtained via the least squares fitting of each protein. B factor prediction results are also included in Tables 7.9, 7.10, and 7.11 for the small- , medium-, and large-sized protein data subsets [4]. The results B factor predictions of all proteins in the protein Superset are provided in Table 7.12. The averages over the data subsets and superset is provided in Table 7.8. Over the different subsets all methods provided similar performance in terms of Pearson correlation coefficient. The deep convolutional neural network performed best on the protein Superset for both Cα only and all heavy atom B factor predictions. The blind cross protein B factor prediction obtained in this work is particularly notable because it improves upon the best existing fitting methods. Previous work by Opron et al used the single protein linear least squares parameter-free FRI (pfFRI) method to obtain an average Pearson correlation coefficient of 0.63 averaged over the superset [3]. GNM performs worse with an overall Pearson correlation coefficient of 0.57 averaged over the superset [3]. 74 Cross protein blind prediction is a much more difficult task than linear fitting. Table 7.12 shows that none fo the machine learning methods out perform one another over the entire data set. Averaged over the superset, the Pearson correlation coefficient for the all heavy atom B factor prediction of the convolutional neural network out performed the boosted gradient and random forest by 10% and 17% respectively. Table 7.12: Pearson correlation coefficients for cross protein heavy atom blind B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the Superset. Results reported use heavy atoms in both training and prediction. MWCG machine learning results originally published in Bramer et al [2]. PDB ID N 1ABA 1AHO 1AIE 1AKG 1ATG 1BGF 1BX7 1BYI 1CCR 1CYO 1DF4 1E5K 1ES5 1ETL 1ETM 1ETN 1EW4 1F8R 1FF4 1FK5 1GCO 1GK7 1GVD 1GXU 1H6V 1HJE 1I71 1IDP 1IFR 728 482 235 108 1689 1018 345 1540 837 697 463 1423 1912 76 80 77 863 15291 477 626 7888 321 401 694 22514 73 683 3661 878 0.77 0.71 0.53 0.51 0.66 0.63 0.67 0.63 0.67 0.68 0.75 0.73 0.68 0.03 0.13 0.25 0.71 0.64 0.59 0.71 0.61 0.73 0.69 0.67 0.40 0.46 0.62 0.74 0.74 RF GBT CNN PDB ID N 0.74 0.62 0.62 0.41 0.61 0.58 0.55 0.59 0.70 0.66 0.79 0.70 0.63 0.27 0.46 0.33 0.70 0.64 0.55 0.62 0.64 0.53 0.66 0.65 0.39 -0.07 0.57 0.69 0.72 2X5Y 2X9Z 2XHF 2Y0T 2Y72 2Y7L 2Y9F 2YLB 2YNY 2ZCM 2ZU1 3A0M 3A7L 3AMC 3AUB 3B5O 3BA1 3BED 3BQX 3BZQ 3BZZ 3DRF 3DWV 3E5T 3E7R 3EUR 3F2Z 3F7E 3FCN 0.73 0.76 0.60 0.70 0.63 0.63 0.63 0.59 0.66 0.76 0.64 0.74 0.66 0.48 0.48 0.20 0.61 0.83 0.76 0.63 0.71 0.72 0.71 0.66 0.58 0.37 0.66 0.83 0.73 RF GBT CNN 0.72 0.75 0.71 0.76 0.70 0.65 0.73 0.59 0.73 0.75 0.62 0.81 0.64 0.72 0.63 0.60 0.67 0.68 0.44 0.41 0.17 0.59 0.65 0.74 0.81 0.66 0.62 0.72 0.74 0.63 0.53 0.65 0.44 0.65 0.70 0.73 0.52 0.85 0.43 0.60 0.77 0.45 0.81 0.67 0.60 0.87 0.75 0.71 0.81 0.60 0.47 0.82 0.88 0.78 0.69 0.61 0.68 0.73 0.79 0.72 0.71 0.75 0.80 0.82 0.77 0.69 0.71 0.45 0.73 0.47 0.75 0.75 0.62 0.55 0.64 0.73 0.59 0.61 0.45 0.66 0.67 0.72 0.60 0.50 0.78 0.67 0.71 1352 1956 2432 757 1171 2398 1212 3065 2364 2959 2794 823 963 5174 782 1510 2391 1570 1028 742 773 4101 2363 1543 295 1059 1160 1912 1039 75 Table 7.12 (cont’d) PDB ID N 1K8U 1KMM 1KNG 1KR4 1KYC 1LR7 1MF7 1N7E 1NKD 1NLS 1NNX 1NOA 1NOT 1O06 1O08 1OPD 1P9I 1PEF 1PEN 1PMY 1PZ4 1Q9B 1QAU 1QKI 1QTO 1R29 1R7J 1RJU 1RRO 1SAU 1TGR 1TZV 1U06 1U7I 1U9C 1UHA 1UKU 1ULR 1UOY 1USE 1USM 1UTG 686 11632 1016 906 138 522 1551 700 426 1746 674 778 96 142 1722 642 203 153 109 937 874 303 812 31154 934 971 729 257 846 830 749 1051 432 1988 1712 623 873 677 452 290 631 548 0.68 0.70 0.56 0.76 0.30 0.70 0.68 0.65 0.59 0.64 0.73 0.57 0.81 0.64 0.58 0.60 0.77 0.64 0.24 0.65 0.73 0.67 0.58 0.27 0.55 0.73 0.70 0.75 0.52 0.68 0.65 0.77 0.68 0.75 0.64 0.80 0.75 0.71 0.56 0.50 0.78 0.55 RF GBT CNN PDB ID N 0.65 0.65 0.61 0.73 0.43 0.53 0.68 0.62 0.56 0.61 0.69 0.52 -0.18 0.51 0.51 0.55 0.73 0.60 0.34 0.64 0.73 0.41 0.57 0.44 0.61 0.61 0.71 0.71 0.56 0.62 0.61 0.75 0.55 0.73 0.61 0.74 0.74 0.69 0.55 0.25 0.59 0.58 3FE7 3FKE 3FMY 3FOD 3FSO 3FTD 3G1S 3GBW 3GHJ 3HFO 3HHP 3HNY 3HP4 3HWU 3HYD 3HZ8 3I2V 3I2Z 3I4O 3I7M 3IHS 3IVV 3K6Y 3KBE 3KGK 3KZD 3L41 3LAA 3LAX 3LG3 3LJI 3M3P 3M8J 3M9J 3M9Q 3MAB 3MD4 3MEA 3MGN 3MRE 3N11 3NE0 0.74 0.87 0.55 0.72 0.32 0.71 0.70 0.71 0.63 0.56 0.53 0.57 0.63 0.65 0.55 0.62 0.77 0.76 0.21 0.67 0.74 0.75 0.57 0.84 0.63 0.72 0.65 0.73 0.54 0.60 0.67 0.75 0.61 0.77 0.58 0.75 0.70 0.68 0.55 0.68 0.67 0.62 RF GBT CNN 0.83 0.62 0.57 0.76 0.84 0.73 0.78 0.30 0.85 0.71 0.75 0.69 0.72 0.74 0.68 0.75 0.66 0.44 0.70 0.65 0.62 0.71 0.58 0.73 0.61 0.65 0.51 0.51 0.60 -0.05 0.51 0.76 0.81 0.50 0.75 0.63 0.87 0.66 0.56 0.87 0.81 0.66 0.85 0.72 0.62 0.90 0.86 0.75 0.87 0.75 0.64 0.74 0.88 0.73 0.89 0.54 0.89 0.69 0.57 0.91 0.50 0.46 0.68 0.57 0.78 0.68 0.50 0.66 0.48 0.52 0.59 0.63 0.36 0.79 0.93 0.58 0.82 0.15 0.57 0.84 0.85 0.52 0.68 0.85 0.71 0.56 0.75 0.45 0.73 0.75 0.76 0.76 0.71 0.72 0.74 0.73 0.63 0.69 0.28 0.54 0.54 0.64 0.64 0.60 0.65 0.81 0.65 0.76 0.78 0.70 0.76 0.46 0.71 0.59 0.54 0.62 0.77 0.74 0.53 0.65 0.61 0.64 0.03 0.56 0.57 0.69 710 1938 470 328 197 1795 3196 1275 808 1432 8495 1351 1322 934 52 1459 929 1039 969 928 1120 1097 1617 829 1492 605 1735 1112 753 6061 1946 1858 1396 1329 1359 1311 81 1236 2236 2598 2501 1551 76 PDB ID N 17 1V05 1V70 784 66 1VRZ 746 1W2L 1542 1WBE 1WHI 937 2430 1WLY 906 1WPA 1X3O 622 124 1XY1 62 1XY2 669 1Y6X 1YJO 55 361 1YZM 771 1Z21 1ZCE 1100 551 1ZVA 3493 2A50 1867 2AGK 2AH1 7215 1454 2B0A 3002 2BCM 2BF9 287 735 2BRF 1446 2C71 2CE0 714 536 2CG7 4366 2COV 1624 2CWS 2D5W 9772 1933 2DKO 4454 2DPL 2DSX 386 3416 2E10 589 2E3H 705 2EAQ 2EHP 1875 590 2EHS 385 2ERW 2ETX 3018 766 2FB6 2FG1 1021 Table 7.12 (cont’d) RF GBT CNN PDB ID N -0.20 0.70 0.38 0.62 0.59 0.74 0.65 0.64 0.53 0.58 0.16 0.44 0.36 0.51 0.63 0.77 0.59 0.64 0.61 0.65 0.66 0.51 0.39 0.76 0.59 0.62 0.47 0.76 0.63 0.71 0.71 0.49 0.36 0.50 0.70 0.63 0.75 0.55 0.47 0.56 0.63 0.55 3NGG 3NPV 3NVG 3NZL 3O0P 3O5P 3OBQ 3OQY 3P6J 3PD7 3PES 3PID 3PIW 3PKV 3PSM 3PTL 3PVE 3PZ9 3PZZ 3Q2X 3Q6L 3QDS 3QPA 3R6D 3R87 3RQ9 3RY0 3RZY 3S0A 3SD2 3SEB 3SED 3SO6 3SR3 3SUK 3SZH 3T0H 3T3K 3T47 3TDN 3TOW 3TUA 0.02 0.67 -0.17 0.68 0.61 0.77 0.71 0.66 0.52 0.19 0.27 0.53 0.12 0.60 0.66 0.81 0.56 0.48 0.68 0.57 0.68 0.62 0.52 0.78 0.61 0.65 0.54 0.83 0.60 0.75 0.72 0.53 0.44 0.64 0.73 0.61 0.74 0.71 0.50 0.61 0.65 0.65 RF GBT CNN 0.83 0.63 0.70 0.84 0.88 -0.08 0.63 0.59 0.63 0.55 0.53 0.70 0.84 0.61 0.76 0.57 0.57 0.88 0.85 0.70 0.84 0.72 0.86 0.49 0.72 0.87 0.81 0.66 0.80 0.62 0.61 0.72 0.46 0.56 0.60 0.63 0.85 0.47 0.29 0.76 0.75 0.71 0.71 0.71 0.43 0.71 0.59 0.31 0.53 0.39 0.32 0.66 0.53 0.66 0.64 0.69 0.61 0.55 0.38 0.71 0.57 0.61 0.72 0.70 0.69 0.01 0.45 0.69 0.59 0.62 0.44 0.74 0.78 0.65 0.62 0.56 0.74 0.54 0.55 0.58 0.66 0.53 0.63 0.70 0.75 0.75 0.08 0.65 0.65 0.63 0.61 0.62 0.70 0.72 0.73 0.56 0.75 0.68 0.68 0.62 0.61 0.76 0.25 0.59 0.67 0.72 0.44 0.69 0.51 0.47 0.65 0.69 0.61 0.52 0.71 0.71 0.75 0.69 0.65 0.80 0.81 0.68 0.62 0.55 0.66 0.66 702 3655 50 567 1452 819 1195 1772 857 1354 1240 3078 1223 1688 729 2101 2656 2913 76 43 1022 2234 1348 1550 1007 1174 964 985 884 527 1948 933 1119 4891 1761 5074 1627 922 1116 2703 1193 1510 0.60 0.62 0.09 0.69 0.63 0.71 0.68 0.74 0.63 0.47 0.55 0.46 0.02 0.56 0.63 0.73 0.58 0.68 0.44 0.67 0.72 0.85 0.70 0.86 0.83 0.90 0.79 0.78 0.78 0.75 0.72 0.73 0.56 0.61 0.38 0.58 0.74 0.38 0.32 0.58 0.52 0.68 77 Table 7.12 (cont’d) PDB ID N 2FN9 2FQ3 2G69 2G7O 2G7S 2GKG 2GOM 2GXG 2GZQ 2HQK 2HYK 2I24 2I49 2IBL 2IGD 2IMF 2IP6 2IVY 2J32 2J9W 2JKU 2JLI 2JLJ 2MCM 2NLS 2NR7 2NUH 2O6X 2OA2 2OHW 2OKT 2OL9 2PKT 2PLT 2PMR 2POF 2PPN 2PSF 2PTH 2Q4N 2Q52 2QJL 4362 721 744 537 1258 706 987 1132 1402 1582 1832 872 3109 815 431 1564 702 727 1935 1626 229 708 889 735 269 1556 806 2415 970 2074 2587 51 666 719 590 3418 701 4983 1437 9496 26784 734 0.60 0.75 0.61 0.63 0.60 0.60 0.70 0.75 0.60 0.76 0.65 0.52 0.77 0.53 0.68 0.62 0.67 0.59 0.78 0.68 0.63 0.54 0.70 0.73 0.49 0.70 0.72 0.82 0.53 0.62 0.59 0.51 0.17 0.67 0.66 0.66 0.68 0.55 0.72 0.39 0.62 0.60 RF GBT CNN PDB ID N 0.37 0.67 0.60 0.52 0.60 0.63 0.61 0.67 0.59 0.76 0.60 0.52 0.78 0.46 0.58 0.62 0.62 0.47 0.79 0.66 0.57 0.58 0.66 0.71 0.45 0.71 0.64 0.76 0.54 0.55 0.56 0.65 0.06 0.62 0.63 0.58 0.50 0.54 0.68 0.45 0.63 0.61 3TYS 3U6G 3U97 3UCI 3UR8 3US6 3V1A 3V75 3VN0 3VOR 3VUB 3VVV 3VZ9 3W4Q 3ZBD 3ZIT 3ZRX 3ZSL 3ZZP 3ZZY 4A02 4ACJ 4AE7 4AM1 4ANN 4AVR 4AXY 4B6G 4B9G 4DD5 4DKN 4DND 4DPZ 4DQ7 4DT4 4EK3 4ERY 4ES1 4EUG 4F01 4F3J 4FR9 0.61 0.76 0.87 0.89 0.81 0.70 0.92 0.86 0.90 0.90 0.85 0.91 0.90 0.88 0.82 0.47 0.64 0.62 0.70 0.73 0.35 0.73 0.68 0.60 0.70 0.66 0.19 0.63 0.92 0.81 0.89 0.84 0.76 0.70 0.63 0.85 0.83 0.79 0.79 0.85 0.77 0.42 RF GBT CNN 0.71 0.67 0.52 0.60 0.27 0.57 0.56 0.44 0.83 0.63 0.62 0.01 0.76 0.36 0.83 0.63 0.69 0.76 0.81 0.41 0.78 0.64 0.84 0.62 0.70 0.66 0.65 0.66 0.78 0.54 0.51 0.71 0.60 0.38 0.69 0.61 0.56 0.40 0.64 0.69 0.75 0.62 0.75 0.64 0.64 0.61 0.56 0.64 0.72 0.53 0.62 0.64 0.75 0.45 0.84 0.78 0.83 0.79 0.63 0.87 0.88 0.76 0.85 0.66 0.65 0.83 0.78 0.58 0.73 0.71 0.73 0.70 0.70 0.83 0.81 0.63 0.79 0.59 0.55 0.77 0.53 0.58 0.61 0.62 0.68 0.51 0.66 0.51 0.66 0.64 0.36 0.65 0.76 0.50 0.70 0.69 0.72 0.73 0.54 0.54 0.67 0.64 0.46 0.69 0.65 0.67 0.74 0.67 0.60 0.61 0.64 0.76 0.81 0.66 0.77 0.73 0.66 0.69 0.73 0.72 0.74 0.64 0.66 0.54 0.62 0.64 556 1658 524 536 5033 1156 319 1974 1469 1077 787 869 1366 5406 1718 1192 1654 925 585 1741 1281 1210 1458 2605 1180 1437 317 4504 2226 2618 3356 755 865 2526 1163 2147 2357 737 1789 3374 1116 956 78 PDB ID N 2R16 2R6Q 2RB8 2RE2 2RFR 2V9V 2VE8 2VH7 2VIM 2VPA 2VQ4 2VY8 2VYO 2W1V 2W2A 2W6A 2WJ5 2WUJ 2WW7 2WWE 2X1Q 2X25 2X3M 1262 903 723 1559 1019 986 3967 749 781 1524 800 1058 1589 4223 2918 826 630 828 915 54 1852 1289 1267 Table 7.12 (cont’d) RF GBT CNN PDB ID N 39 0.52 1178 0.59 4002 0.61 4814 0.66 1315 0.54 873 0.64 678 0.65 737 0.74 0.62 1470 1127 0.63 1288 0.72 2112 0.71 0.53 799 527 0.68 2658 0.56 0.66 1406 1062 0.49 2443 0.55 1294 0.35 1615 0.23 4063 0.58 1002 0.65 0.66 800 345 4G14 4G2E 4G5X 4G6C 4G7X 4GA2 4GMQ 4GS3 4H4J 4H89 4HDE 4HJP 4HWM 4IL7 4J11 4J5O 4J5Q 4J78 4JG2 4JVU 4JYP 4KEF 5CYT 6RXN 0.53 0.53 0.64 0.66 0.58 0.61 0.59 0.70 0.61 0.68 0.76 0.74 0.65 0.72 0.62 0.76 0.53 0.55 0.43 0.22 0.53 0.68 0.70 0.50 0.57 0.42 0.54 0.66 0.63 0.66 0.82 0.75 0.61 0.78 0.63 0.61 0.72 0.63 0.69 0.77 0.55 0.61 0.12 0.77 0.80 0.75 RF GBT CNN 0.55 0.28 0.73 0.76 0.65 0.74 0.61 0.47 0.80 0.49 0.51 0.55 0.54 0.56 0.56 0.56 0.69 0.70 0.62 0.55 0.70 0.73 0.76 0.65 0.50 0.81 0.74 0.35 0.94 0.47 0.64 0.91 0.87 0.73 0.86 0.71 0.88 0.70 0.69 0.89 0.93 0.70 0.68 0.65 0.68 0.74 0.82 0.56 0.50 0.73 0.75 0.60 0.56 0.55 0.72 0.60 0.80 0.61 0.79 0.70 0.57 0.43 0.58 0.63 0.75 0.75 0.73 0.68 0.78 0.62 0.70 0.71 Several proteins have low Pearson correlation coefficients indicating a poor model pre- diction. In these cases we see that if one model performs poorly the other models generally perform satisfactorily. Taking the consensus of the maximum correlation coefficient for each protein among the three machine learning methods results in an average all heavy atom correlation coefficient of 0.73 and an average Cα only correlation coefficient of 0.72. This result is similar to that of the parameter-optimized FRI (opFRI) reported in earlier work by Opron et al [3]. 79 7.4.1.3 Relative feature importance Ensemble methods provide relative feature importance of the features used in the resulting models. This is an important tool to help understand which features are most significant in a model. Figure 7.8 shows the individual feature importance for the random forest averaged over the protein superset. Since several of the features are related, Figure 7.9 provides a plot of the aggregated feature importance. The feature importance of the individual angle, secondary, MWCG, atom type, protein size, amino acid, and packing density features are all summed together to illustrate the overall effect of each feature type. Figure 7.8 shows the most important MWCG feature is the carbon-carbon interaction. This MWCG feature uses a Lorentz radial basis function as with η = 16 and ν = 3 as detailed in Section 5.3. The remaining eight MWCG features all rank similarly with the carbon-oxygen interaction ranked as the second most significant MWCG feature. This result validates that the model benefits from the multi-scale property of the MWCG feature, which uses three different kernels to capture interactions at various length scales. Since all MWCG have significance in the feature ranking it follows that the element specific property of the MWCG method is also a meaningful model feature. Figure 7.8 shows that that the individual MWCG, amino acid type, and packing density feature have low relative importance, however, considering their aggregate importance as seen in Figure 7.9, we see that they contribute to the model. Figure 7.9 shows that the medium density protein packing density feature was twice as important to the model as the short and long density features. The medium packing density may be capturing semi- local side chain interactions which are important in protein flexibility. The short packing 80 density likely captures only adjacent backbone information while the long packing density is only adding weak atomic interaction information to the model. Protein resolution is the most significant relative feature followed by MWCG features and the STRIDE generated residue solvent accessible area feature. This also highlights the importance of the quality of X-ray crystal structures and difficulty in cross-protein B factor prediction. Protein angles, secondary structures, and size play a less significant role in the model compared to the other features. Atom type has the lowest significance relative to the other features implemented in the model. Not surprisingly, we see that global features such as resolution and R-value are important components in the ensemble model. The global feature of protein size has a small role in the model. Care must be taken to use feature ranking to understand feature importance. The feature ranking provided by these models is a relative ordering of features that the models find most important. So features with high correlation may be redundant giving one of them a lower rank even though they may have significant prediction power. For example, R-value highly correlates with resolution so it is likely a meaningful feature. However, the use of resolution reduces the relative importance ranking of R-value in the model. 7.5 ASPH & ESPH B Factor Prediction 7.5.1 Least Squares Fitting The Pearson correlation coefficients using least squares fitting for Cα B factor prediction of small, medium, and large protein subsets are provided in tables 7.17, 7.18, and 7.19 respectively. Results for the all proteins in the dataset are provided in table 7.21. The average Pearson correlation coefficients for small, medium, large, and superset data sets 81 is provided in table 7.20. Table 7.20 includes fitting results using only Bottleneck, only Wasserstein, and using both Bottleneck and Wasserstein metrics. Results using only an exponential kernel, only a lorentz kernel, or both an exponential and lorentz kernel for fitting are also included. All results reported here PH features generated with a cutoff of 11˚A and include three pairwise interactions (carbon-carbon, carbon-nitrogen, carbon-oxygen). 7.5.2 Machine Learning ASPH and ESPH Pearson correlation coefficients using boosted gradient (GBT), convolu- tional neural network (CNN), and consensus method (CON) for Cα B factor prediction of small, medium, and large protein subsets is provided in tables 7.14, 7.15, and 7.16 respec- tively. Parameters for GBT and CNN methods can be found in Tables 5.7 and 5.8. The global and local features used for training and testing are provided in chapter 5. Results for all proteins are provided in table 7.22. The average Pearson correlation coefficients for small, medium, large, and superset data sets is provided in table 7.13. All results reported here use a cutoff of 11˚A and include three pairwise interactions (carbon-carbon, carbon-nitrogen, carbon-oxygen). Kernel parameters for both exponential and lorentz kernels are provided in Table 5.4. Results from previously existing Cα B factor prediction methods are included for comparison in Table 7.13. Overall both GBT and CNN algorithms perform similarly. As expected, the CNN method out performs the GBT with average correlation coefficients over the superset of 0.60 and 0.59 respectively. The consensus method improves upon both results with an average Pearson correlation coefficient of 0.61 over the superset. Table 7.13 shows that the blind prediction machine learning models perform better than fitting models GNM and NMA and similar to the pFRI fitting model. 82 Table 7.21: Pearson correlation coefficients of persistent homology based least squares fitting Cα B factor prediction of all proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.70 0.75 0.96 0.87 0.51 0.75 0.82 0.54 0.63 0.67 0.94 0.69 0.60 1.00 1.00 1.00 0.62 0.65 0.76 0.55 0.65 0.94 0.66 0.77 0.30 1.00 0.59 0.48 0.62 0.75 0.57 0.53 0.54 1.00 0.58 0.59 0.73 0.75 1ABA 87 1AHO 66 1AIE 31 1AKG 16 1ATG 231 1BGF 124 1BX7 51 1BYI 238 1CCR 109 1CYO 88 1DF4 57 1E5K 188 1ES5 260 1ETL 12 1ETM 12 1ETN 12 1EW4 106 1F8R 1932 1FF4 65 1FK5 93 1GCO 1044 1GK7 39 1GVD 56 1GXU 89 1H6V 2927 1HJE 13 1I71 83 1IDP 441 1IFR 113 1K8U 87 1KMM 1499 144 1KNG 1KR4 107 15 1KYC 73 1LR7 194 1MF7 1N7E 95 59 1NKD 0.56 0.53 0.90 0.53 0.38 0.68 0.81 0.44 0.43 0.65 0.88 0.63 0.44 0.95 0.70 0.70 0.55 0.50 0.68 0.49 0.53 0.88 0.61 0.69 0.23 0.67 0.38 0.39 0.47 0.65 0.36 0.43 0.45 0.88 0.46 0.50 0.54 0.55 0.68 0.79 0.90 0.72 0.53 0.75 0.82 0.49 0.65 0.68 0.95 0.68 0.58 1.00 0.86 0.99 0.55 0.63 0.75 0.58 0.63 0.95 0.69 0.75 0.30 1.00 0.56 0.47 0.65 0.71 0.54 0.51 0.53 0.99 0.63 0.59 0.72 0.63 0.67 0.75 0.97 0.82 0.50 0.75 0.86 0.50 0.65 0.71 0.93 0.67 0.58 1.00 1.00 1.00 0.58 0.61 0.77 0.53 0.63 0.95 0.75 0.75 0.29 1.00 0.44 0.48 0.65 0.72 0.57 0.52 0.57 0.96 0.61 0.56 0.67 0.73 0.67 0.78 0.88 0.66 0.50 0.70 0.74 0.51 0.66 0.69 0.92 0.68 0.57 1.00 1.00 1.00 0.60 0.63 0.72 0.59 0.64 0.94 0.68 0.78 0.31 1.00 0.66 0.47 0.59 0.74 0.54 0.51 0.48 0.99 0.62 0.59 0.71 0.69 0.63 0.65 0.77 0.56 0.48 0.61 0.69 0.48 0.58 0.59 0.91 0.67 0.56 0.98 0.83 0.92 0.55 0.62 0.68 0.50 0.63 0.92 0.62 0.72 0.29 0.57 0.58 0.46 0.53 0.67 0.53 0.50 0.45 0.88 0.56 0.58 0.63 0.65 0.62 0.73 0.64 0.53 0.47 0.54 0.68 0.46 0.56 0.58 0.89 0.67 0.55 0.87 0.74 0.92 0.51 0.62 0.65 0.50 0.63 0.93 0.63 0.61 0.29 0.79 0.46 0.45 0.54 0.64 0.53 0.47 0.47 0.93 0.55 0.57 0.68 0.58 0.76 0.88 0.99 1.00 0.61 0.82 0.89 0.58 0.71 0.78 0.97 0.74 0.65 1.00 1.00 1.00 0.73 0.70 0.80 0.71 0.66 0.98 0.84 0.82 0.33 1.00 0.76 0.55 0.73 0.85 0.59 0.61 0.60 1.00 0.71 0.67 0.80 0.89 0.54 0.72 0.78 0.60 0.45 0.64 0.79 0.41 0.53 0.66 0.92 0.66 0.51 0.68 0.45 0.96 0.52 0.59 0.70 0.49 0.59 0.91 0.67 0.72 0.28 0.72 0.41 0.43 0.56 0.67 0.49 0.43 0.39 0.92 0.57 0.55 0.54 0.56 83 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.82 0.86 0.59 1.00 0.98 0.48 1.00 1.00 0.36 0.89 0.96 0.94 0.71 0.88 0.71 0.58 0.40 0.56 0.72 0.86 0.72 0.45 0.76 0.75 0.74 0.55 0.71 0.67 0.73 0.80 0.61 0.73 0.64 0.65 0.56 0.65 0.62 1.00 0.69 0.48 0.55 1NLS 1NNX 1NOA 1NOT 1O06 1O08 1OB4 1OB7 1OPD 1P9I 1PEF 1PEN 1PMY 1PZ4 1Q9B 1QAU 1QKI 1QTO 1R29 1R7J 1RJU 1RRO 1SAU 1TGR 1TZV 1U06 1U7I 1U9C 1UHA 1UKU 1ULR 1UOY 1USE 1USM 1UTG 1V05 1V70 1VRZ 1W2L 1WBE 1WHI 238 93 113 13 22 221 5 5 85 29 18 16 123 113 44 112 3912 122 122 90 36 108 123 111 157 55 259 220 82 102 87 64 47 77 70 96 105 13 97 206 122 0.83 0.83 0.63 1.00 0.97 0.50 1.00 1.00 0.36 0.92 0.96 0.83 0.67 0.89 0.69 0.58 0.41 0.53 0.69 0.87 0.81 0.45 0.75 0.74 0.77 0.52 0.70 0.64 0.74 0.80 0.59 0.69 0.72 0.66 0.60 0.65 0.66 1.00 0.69 0.55 0.57 0.86 0.88 0.72 1.00 1.00 0.56 1.00 1.00 0.57 0.98 1.00 1.00 0.76 0.93 0.94 0.66 0.45 0.65 0.76 0.91 0.91 0.56 0.81 0.83 0.83 0.72 0.73 0.74 0.82 0.84 0.68 0.83 0.91 0.81 0.68 0.72 0.75 1.00 0.79 0.63 0.63 0.75 0.81 0.60 0.82 0.96 0.44 1.00 1.00 0.25 0.87 0.88 0.60 0.62 0.86 0.58 0.57 0.34 0.48 0.55 0.83 0.75 0.31 0.70 0.72 0.73 0.37 0.62 0.61 0.69 0.78 0.49 0.65 0.50 0.57 0.51 0.60 0.56 0.92 0.60 0.43 0.42 84 0.81 0.84 0.63 1.00 0.98 0.46 1.00 1.00 0.35 0.89 0.96 0.96 0.71 0.88 0.79 0.59 0.38 0.59 0.71 0.88 0.81 0.39 0.76 0.77 0.76 0.50 0.71 0.66 0.70 0.80 0.56 0.73 0.66 0.62 0.57 0.67 0.64 1.00 0.72 0.53 0.57 0.78 0.84 0.65 1.00 0.97 0.48 1.00 1.00 0.29 0.88 0.97 0.90 0.70 0.82 0.76 0.61 0.42 0.59 0.56 0.86 0.74 0.35 0.75 0.76 0.78 0.52 0.71 0.65 0.75 0.81 0.53 0.72 0.75 0.61 0.53 0.66 0.65 1.00 0.72 0.47 0.55 0.65 0.79 0.57 0.86 0.92 0.42 1.00 1.00 0.21 0.82 0.94 0.67 0.59 0.74 0.59 0.55 0.38 0.46 0.35 0.76 0.69 0.23 0.73 0.70 0.71 0.36 0.68 0.57 0.68 0.80 0.50 0.66 0.52 0.53 0.49 0.61 0.60 0.92 0.63 0.38 0.44 0.80 0.81 0.53 0.86 0.97 0.37 1.00 1.00 0.29 0.87 0.92 0.47 0.68 0.85 0.69 0.55 0.32 0.55 0.69 0.81 0.62 0.33 0.68 0.74 0.69 0.46 0.53 0.61 0.67 0.74 0.44 0.65 0.46 0.61 0.49 0.52 0.51 0.77 0.56 0.36 0.34 0.72 0.81 0.57 0.81 0.94 0.45 1.00 1.00 0.19 0.84 0.94 0.73 0.69 0.76 0.57 0.57 0.38 0.52 0.43 0.79 0.65 0.19 0.74 0.73 0.70 0.39 0.67 0.60 0.69 0.80 0.50 0.69 0.53 0.58 0.49 0.61 0.58 0.85 0.61 0.42 0.43 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.64 0.70 0.67 0.99 1.00 0.56 1.00 0.90 0.72 0.82 0.86 0.67 0.67 0.48 0.63 0.45 0.92 0.75 0.48 0.79 0.41 0.67 0.55 0.53 0.72 0.37 0.55 0.62 0.78 0.82 0.64 0.73 0.49 0.54 0.74 0.57 0.55 0.78 0.60 0.87 0.63 322 1WLY 107 1WPA 80 1X3O 16 1XY1 8 1XY2 86 1Y6X 6 1YJO 46 1YZM 96 1Z21 139 1ZCE 75 1ZVA 469 2A50 233 2AGK 939 2AH1 191 2B0A 415 2BCM 35 2BF9 103 2BRF 225 2C71 109 2CE0 110 2CG7 534 2COV 2CWS 235 2D5W 1214 2DKO 253 565 2DPL 52 2DSX 439 2E10 2E3H 81 89 2EAQ 246 2EHP 2EHS 75 53 2ERW 390 2ETX 129 2FB6 2FG1 176 560 2FN9 85 2FQ3 2G69 99 68 2G7O 2G7S 206 0.67 0.79 0.72 1.00 1.00 0.62 1.00 0.95 0.82 0.88 0.94 0.70 0.69 0.49 0.69 0.50 0.97 0.76 0.56 0.86 0.63 0.70 0.66 0.54 0.79 0.41 0.78 0.65 0.82 0.86 0.71 0.81 0.84 0.57 0.76 0.62 0.62 0.82 0.76 0.95 0.66 0.64 0.69 0.66 0.96 1.00 0.53 1.00 0.90 0.73 0.83 0.85 0.63 0.65 0.47 0.60 0.41 0.73 0.73 0.38 0.79 0.44 0.64 0.55 0.52 0.72 0.36 0.50 0.59 0.71 0.77 0.65 0.73 0.41 0.54 0.66 0.56 0.49 0.76 0.65 0.91 0.60 0.54 0.66 0.62 0.81 0.91 0.50 1.00 0.86 0.64 0.81 0.83 0.41 0.55 0.33 0.48 0.35 0.89 0.72 0.23 0.71 0.30 0.57 0.40 0.41 0.68 0.24 0.41 0.43 0.56 0.77 0.52 0.69 0.31 0.47 0.65 0.52 0.41 0.68 0.47 0.76 0.54 0.62 0.56 0.64 0.89 0.91 0.52 1.00 0.84 0.69 0.78 0.81 0.58 0.63 0.46 0.59 0.39 0.71 0.72 0.30 0.77 0.33 0.64 0.52 0.52 0.69 0.33 0.36 0.57 0.69 0.76 0.62 0.71 0.28 0.51 0.63 0.54 0.46 0.75 0.45 0.82 0.59 0.62 0.70 0.66 0.97 1.00 0.56 1.00 0.87 0.70 0.84 0.85 0.64 0.65 0.45 0.59 0.46 0.94 0.74 0.45 0.77 0.32 0.66 0.59 0.52 0.75 0.35 0.54 0.60 0.66 0.81 0.63 0.75 0.62 0.54 0.71 0.55 0.51 0.78 0.59 0.89 0.63 0.63 0.71 0.65 0.87 1.00 0.59 1.00 0.88 0.64 0.85 0.92 0.67 0.65 0.46 0.62 0.40 0.78 0.74 0.42 0.80 0.36 0.67 0.54 0.52 0.75 0.35 0.56 0.61 0.76 0.81 0.65 0.74 0.60 0.56 0.69 0.58 0.55 0.79 0.66 0.88 0.63 0.62 0.52 0.60 0.66 0.95 0.49 1.00 0.72 0.63 0.77 0.78 0.60 0.64 0.45 0.58 0.39 0.65 0.71 0.33 0.73 0.31 0.64 0.52 0.52 0.69 0.32 0.30 0.58 0.69 0.72 0.62 0.71 0.26 0.53 0.63 0.52 0.47 0.75 0.50 0.79 0.58 0.59 0.61 0.62 0.73 0.99 0.50 1.00 0.82 0.61 0.83 0.84 0.54 0.61 0.42 0.50 0.39 0.70 0.74 0.29 0.75 0.29 0.63 0.53 0.49 0.72 0.30 0.37 0.51 0.62 0.78 0.58 0.72 0.33 0.52 0.67 0.54 0.44 0.75 0.42 0.85 0.59 85 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.78 0.53 0.73 0.31 0.81 0.60 0.49 0.58 0.71 0.74 0.64 0.78 0.57 0.77 0.62 0.88 0.86 0.76 0.82 0.69 0.77 0.80 0.75 0.67 0.38 0.46 1.00 1.00 0.43 0.66 0.69 0.63 0.63 0.44 0.72 0.61 0.65 0.51 0.52 0.67 0.81 150 2GKG 121 2GOM 140 2GXG 203 2GZQ 232 2HQK 237 2HYK 113 2I24 399 2I49 108 2IBL 61 2IGD 203 2IMF 87 2IP6 89 2IVY 244 2J32 203 2J9W 38 2JKU 112 2JLI 2JLJ 121 2MCM 112 36 2NLS 193 2NR7 2NUH 104 309 2O6X 2OA2 140 2OHW 257 377 2OKT 6 2OL9 4 2OLX 2PKT 93 98 2PLT 83 2PMR 2POF 428 122 2PPN 608 2PSF 193 2PTH 2Q4N 1208 3296 2Q52 107 2QJL 2R16 185 149 2R6Q 2RB8 93 0.76 0.44 0.69 0.24 0.68 0.43 0.45 0.41 0.65 0.61 0.59 0.64 0.34 0.73 0.51 0.83 0.85 0.74 0.75 0.49 0.71 0.75 0.65 0.60 0.27 0.22 1.00 1.00 0.36 0.52 0.65 0.44 0.44 0.37 0.61 0.55 0.63 0.41 0.45 0.62 0.74 0.65 0.42 0.68 0.34 0.74 0.55 0.40 0.51 0.67 0.64 0.56 0.58 0.45 0.68 0.59 0.65 0.70 0.65 0.77 0.32 0.72 0.63 0.73 0.49 0.32 0.36 1.00 1.00 0.35 0.59 0.62 0.55 0.59 0.44 0.70 0.55 0.56 0.46 0.45 0.68 0.75 0.83 0.64 0.79 0.60 0.83 0.63 0.69 0.62 0.75 0.84 0.71 0.82 0.69 0.85 0.70 0.95 0.90 0.80 0.85 0.88 0.79 0.85 0.78 0.70 0.48 0.49 1.00 1.00 0.69 0.72 0.80 0.66 0.74 0.53 0.77 0.68 0.70 0.63 0.66 0.76 0.84 0.74 0.42 0.71 0.38 0.70 0.51 0.40 0.43 0.66 0.61 0.59 0.66 0.35 0.73 0.55 0.85 0.82 0.71 0.78 0.61 0.74 0.73 0.70 0.55 0.29 0.31 1.00 1.00 0.40 0.57 0.59 0.48 0.51 0.41 0.65 0.58 0.62 0.42 0.46 0.66 0.78 86 0.77 0.47 0.74 0.45 0.80 0.59 0.47 0.54 0.69 0.67 0.61 0.72 0.43 0.77 0.59 0.89 0.87 0.78 0.80 0.75 0.75 0.77 0.74 0.63 0.35 0.43 1.00 1.00 0.44 0.66 0.69 0.62 0.57 0.43 0.71 0.65 0.65 0.45 0.50 0.71 0.81 0.71 0.52 0.72 0.40 0.79 0.58 0.44 0.53 0.71 0.72 0.65 0.66 0.53 0.72 0.60 0.75 0.81 0.75 0.80 0.66 0.75 0.74 0.75 0.64 0.39 0.37 1.00 1.00 0.39 0.63 0.68 0.56 0.61 0.45 0.71 0.62 0.66 0.52 0.51 0.72 0.78 0.78 0.45 0.72 0.48 0.80 0.59 0.48 0.56 0.70 0.74 0.60 0.73 0.48 0.77 0.64 0.88 0.85 0.74 0.81 0.76 0.76 0.81 0.75 0.60 0.35 0.40 1.00 1.00 0.55 0.67 0.68 0.60 0.63 0.45 0.73 0.59 0.64 0.50 0.51 0.70 0.80 0.67 0.47 0.68 0.29 0.76 0.54 0.40 0.49 0.68 0.66 0.59 0.64 0.42 0.68 0.59 0.60 0.78 0.71 0.77 0.47 0.73 0.66 0.73 0.63 0.34 0.33 1.00 1.00 0.36 0.59 0.65 0.54 0.57 0.42 0.69 0.57 0.57 0.49 0.46 0.65 0.76 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.63 0.74 0.62 0.47 0.54 0.40 0.74 0.58 0.49 0.70 0.68 0.60 0.60 0.64 0.71 0.49 0.73 0.47 0.64 0.60 0.69 0.51 0.63 0.70 0.71 0.69 0.74 0.59 0.66 0.43 0.63 0.78 0.67 0.57 0.37 0.57 0.70 0.61 0.51 0.59 0.59 2RE2 2RFR 2V9V 2VE8 2VH7 2VIM 2VPA 2VQ4 2VY8 2VYO 2W1V 2W2A 2W6A 2WJ5 2WUJ 2WW7 2WWE 2X1Q 2X25 2X3M 2X5Y 2X9Z 2XHF 2Y0T 2Y72 2Y7L 2Y9F 2YLB 2YNY 2ZCM 2ZU1 3A0M 3A7L 3AMC 3AUB 3B5O 3BA1 3BED 3BQX 3BZQ 3BZZ 0.70 0.80 0.66 0.55 0.68 0.52 0.78 0.64 0.58 0.77 0.77 0.65 0.64 0.79 0.79 0.60 0.83 0.54 0.73 0.69 0.71 0.54 0.67 0.83 0.78 0.72 0.78 0.63 0.75 0.49 0.68 0.84 0.78 0.64 0.53 0.66 0.72 0.67 0.54 0.69 0.68 0.61 0.74 0.56 0.44 0.63 0.41 0.73 0.56 0.46 0.72 0.70 0.59 0.54 0.68 0.65 0.50 0.75 0.46 0.64 0.64 0.64 0.42 0.60 0.68 0.72 0.69 0.71 0.52 0.63 0.40 0.63 0.72 0.59 0.54 0.41 0.63 0.68 0.56 0.51 0.61 0.61 0.59 0.72 0.55 0.40 0.42 0.24 0.68 0.35 0.38 0.59 0.56 0.54 0.52 0.59 0.67 0.33 0.61 0.34 0.57 0.57 0.53 0.38 0.55 0.56 0.66 0.58 0.58 0.34 0.56 0.24 0.45 0.61 0.62 0.37 0.32 0.46 0.60 0.44 0.41 0.47 0.45 0.60 0.59 0.50 0.43 0.49 0.31 0.73 0.46 0.42 0.68 0.64 0.57 0.56 0.53 0.59 0.43 0.58 0.37 0.57 0.57 0.58 0.39 0.62 0.64 0.70 0.69 0.70 0.49 0.63 0.32 0.58 0.73 0.54 0.51 0.32 0.55 0.65 0.53 0.46 0.55 0.50 0.64 0.73 0.60 0.46 0.59 0.38 0.73 0.56 0.47 0.68 0.69 0.60 0.59 0.63 0.69 0.44 0.71 0.48 0.62 0.61 0.67 0.50 0.62 0.69 0.71 0.68 0.75 0.55 0.63 0.42 0.61 0.74 0.69 0.54 0.36 0.55 0.67 0.61 0.52 0.57 0.60 0.65 0.66 0.51 0.48 0.54 0.33 0.75 0.54 0.46 0.70 0.67 0.59 0.59 0.55 0.68 0.48 0.71 0.44 0.61 0.61 0.63 0.42 0.62 0.68 0.71 0.70 0.72 0.52 0.67 0.39 0.61 0.76 0.61 0.53 0.41 0.58 0.66 0.55 0.50 0.62 0.63 249 166 149 515 94 114 217 106 162 207 551 350 139 110 103 161 120 240 167 175 185 266 310 111 183 323 149 418 326 348 360 146 128 614 124 249 312 262 136 99 103 0.59 0.57 0.48 0.41 0.49 0.28 0.71 0.49 0.42 0.66 0.63 0.56 0.52 0.52 0.52 0.42 0.62 0.39 0.57 0.55 0.59 0.38 0.56 0.61 0.69 0.68 0.69 0.49 0.62 0.35 0.58 0.70 0.45 0.50 0.26 0.56 0.65 0.53 0.48 0.55 0.58 0.57 0.68 0.53 0.42 0.52 0.29 0.72 0.43 0.38 0.64 0.63 0.57 0.51 0.59 0.62 0.40 0.62 0.38 0.56 0.60 0.60 0.37 0.58 0.60 0.69 0.66 0.65 0.46 0.60 0.34 0.53 0.68 0.52 0.47 0.31 0.52 0.64 0.53 0.47 0.50 0.51 87 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.34 0.65 0.55 0.88 0.47 0.78 0.69 0.67 0.70 0.45 0.71 0.48 0.82 0.60 0.49 0.81 0.72 0.74 0.63 0.56 0.62 0.68 1.00 0.58 0.53 0.61 0.66 0.64 0.62 0.79 0.55 0.61 0.80 0.78 0.67 0.77 0.82 0.41 0.56 0.48 0.73 567 3DRF 359 3DWV 268 3E5T 40 3E7R 150 3EUR 148 3F2Z 261 3F7E 185 3FCN 89 3FE7 250 3FKE 75 3FMY 48 3FOD 238 3FSO 257 3FTD 418 3G1S 3GBW 170 129 3GHJ 216 3HFO 3HHP 1314 170 3HNY 201 3HP4 3HWU 155 8 3HYD 200 3HZ8 3I2V 127 140 3I2Z 154 3I4O 145 3I7M 3IHS 173 168 3IVV 227 3K6Y 3KBE 166 190 3KGK 94 3KZD 219 3L41 3LAA 176 118 3LAX 846 3LG3 3LJI 270 244 3M3P 3M8J 178 0.32 0.63 0.52 0.86 0.46 0.78 0.65 0.65 0.65 0.42 0.69 0.47 0.82 0.57 0.51 0.78 0.71 0.72 0.62 0.56 0.61 0.69 1.00 0.59 0.58 0.59 0.64 0.62 0.67 0.80 0.53 0.61 0.80 0.72 0.62 0.66 0.81 0.38 0.53 0.44 0.72 0.30 0.62 0.50 0.82 0.42 0.77 0.63 0.59 0.63 0.36 0.66 0.35 0.81 0.52 0.45 0.71 0.68 0.69 0.59 0.49 0.56 0.61 1.00 0.54 0.48 0.57 0.63 0.58 0.60 0.74 0.50 0.60 0.79 0.66 0.59 0.60 0.78 0.37 0.52 0.35 0.70 0.33 0.66 0.56 0.81 0.47 0.78 0.65 0.64 0.67 0.49 0.70 0.55 0.77 0.59 0.51 0.79 0.72 0.75 0.62 0.57 0.64 0.63 1.00 0.56 0.61 0.56 0.60 0.58 0.60 0.83 0.52 0.62 0.81 0.77 0.66 0.76 0.83 0.40 0.58 0.58 0.73 0.22 0.54 0.38 0.73 0.31 0.69 0.47 0.54 0.54 0.34 0.66 0.38 0.77 0.41 0.38 0.51 0.65 0.65 0.52 0.42 0.43 0.50 1.00 0.52 0.40 0.56 0.56 0.49 0.58 0.68 0.42 0.52 0.68 0.47 0.57 0.69 0.77 0.32 0.45 0.25 0.67 0.32 0.67 0.55 0.81 0.49 0.76 0.66 0.60 0.69 0.47 0.71 0.48 0.82 0.60 0.44 0.77 0.71 0.75 0.61 0.59 0.60 0.60 1.00 0.58 0.57 0.58 0.63 0.58 0.62 0.80 0.53 0.62 0.79 0.79 0.61 0.70 0.81 0.40 0.53 0.47 0.74 0.29 0.62 0.51 0.77 0.43 0.76 0.64 0.59 0.60 0.36 0.64 0.33 0.74 0.52 0.45 0.74 0.67 0.63 0.59 0.52 0.54 0.61 1.00 0.53 0.53 0.54 0.59 0.55 0.54 0.76 0.49 0.60 0.79 0.68 0.60 0.56 0.76 0.37 0.52 0.40 0.69 0.38 0.69 0.60 0.96 0.53 0.84 0.71 0.75 0.76 0.52 0.79 0.82 0.85 0.67 0.68 0.84 0.81 0.82 0.68 0.64 0.72 0.81 1.00 0.66 0.66 0.65 0.73 0.71 0.74 0.89 0.60 0.65 0.84 0.83 0.71 0.80 0.86 0.41 0.62 0.69 0.75 0.27 0.62 0.51 0.78 0.39 0.75 0.61 0.56 0.58 0.40 0.66 0.42 0.77 0.49 0.41 0.64 0.65 0.70 0.57 0.47 0.57 0.57 1.00 0.55 0.51 0.52 0.58 0.53 0.58 0.75 0.48 0.57 0.77 0.55 0.59 0.68 0.80 0.36 0.47 0.40 0.69 88 Table 7.21 (cont’d) B & W B W 250 190 180 13 14 170 277 446 325 208 97 500 6 70 197 147 150 236 145 216 166 387 161 229 94 289 363 357 12 6 131 284 212 222 148 165 128 151 132 100 238 0.57 0.53 0.57 1.00 1.00 0.58 0.33 0.40 0.43 0.77 0.80 0.44 1.00 0.68 0.62 0.64 0.59 0.71 0.75 0.65 0.70 0.50 0.66 0.50 0.83 0.50 0.45 0.51 1.00 1.00 0.39 0.63 0.68 0.65 0.48 0.51 0.44 0.65 0.39 0.65 0.63 0.54 0.51 0.47 0.94 0.93 0.57 0.28 0.36 0.44 0.70 0.74 0.42 1.00 0.49 0.62 0.57 0.49 0.64 0.71 0.60 0.63 0.48 0.63 0.48 0.77 0.49 0.39 0.38 0.90 1.00 0.31 0.59 0.45 0.63 0.44 0.44 0.40 0.54 0.34 0.63 0.61 PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.56 0.51 0.56 1.00 1.00 0.59 0.39 0.41 0.45 0.82 0.80 0.47 1.00 0.67 0.64 0.64 0.58 0.72 0.75 0.65 0.66 0.51 0.72 0.57 0.79 0.50 0.45 0.50 1.00 1.00 0.42 0.64 0.65 0.69 0.47 0.56 0.47 0.59 0.37 0.67 0.67 3M9J 3M9Q 3MAB 3MD4 3MD5 3MEA 3MGN 3MRE 3N11 3NE0 3NGG 3NPV 3NVG 3NZL 3O0P 3O5P 3OBQ 3OQY 3P6J 3PD7 3PES 3PID 3PIW 3PKV 3PSM 3PTL 3PVE 3PZ9 3PZZ 3Q2X 3Q6L 3QDS 3QPA 3R6D 3R87 3RQ9 3RY0 3RZY 3S0A 3SD2 3SEB 0.56 0.53 0.55 1.00 1.00 0.64 0.30 0.40 0.45 0.77 0.78 0.44 1.00 0.66 0.64 0.60 0.58 0.70 0.73 0.65 0.70 0.53 0.70 0.53 0.83 0.50 0.44 0.42 1.00 1.00 0.37 0.65 0.47 0.65 0.48 0.52 0.47 0.65 0.38 0.69 0.68 0.56 0.52 0.56 1.00 1.00 0.58 0.32 0.38 0.45 0.79 0.81 0.44 1.00 0.61 0.64 0.60 0.59 0.66 0.73 0.66 0.72 0.49 0.67 0.52 0.78 0.50 0.45 0.45 1.00 1.00 0.44 0.62 0.66 0.66 0.47 0.47 0.45 0.65 0.43 0.67 0.66 0.59 0.59 0.62 1.00 1.00 0.68 0.47 0.45 0.51 0.84 0.85 0.50 1.00 0.84 0.71 0.71 0.66 0.73 0.81 0.72 0.79 0.56 0.78 0.63 0.88 0.53 0.59 0.57 1.00 1.00 0.56 0.69 0.78 0.73 0.55 0.61 0.54 0.84 0.52 0.77 0.77 0.53 0.50 0.52 0.91 0.98 0.57 0.26 0.32 0.42 0.75 0.72 0.40 1.00 0.53 0.59 0.55 0.46 0.63 0.69 0.62 0.58 0.44 0.60 0.43 0.79 0.49 0.37 0.36 0.95 1.00 0.33 0.59 0.45 0.62 0.41 0.41 0.40 0.59 0.33 0.64 0.62 89 0.39 0.46 0.56 0.93 0.94 0.48 0.16 0.24 0.38 0.70 0.74 0.36 1.00 0.59 0.53 0.53 0.53 0.60 0.61 0.60 0.52 0.37 0.56 0.35 0.68 0.43 0.41 0.34 0.94 1.00 0.34 0.51 0.59 0.53 0.40 0.39 0.41 0.57 0.32 0.56 0.61 0.53 0.50 0.51 0.99 0.92 0.57 0.29 0.35 0.44 0.76 0.76 0.43 1.00 0.55 0.62 0.56 0.56 0.64 0.71 0.61 0.60 0.46 0.63 0.50 0.76 0.49 0.42 0.39 0.80 1.00 0.37 0.59 0.59 0.64 0.45 0.45 0.42 0.51 0.31 0.63 0.62 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.40 0.70 0.44 0.57 0.53 0.76 0.68 0.54 0.49 0.65 0.54 0.75 0.48 0.80 0.63 0.53 0.59 0.68 0.53 0.89 0.63 0.64 0.58 0.67 0.64 0.36 0.52 0.56 0.37 0.40 0.64 0.60 0.60 0.69 0.55 0.47 0.57 0.62 0.73 0.83 0.66 126 3SED 157 3SO6 657 3SR3 254 3SUK 753 3SZH 209 3T0H 122 3T3K 145 3T47 3TDN 359 3TOW 155 226 3TUA 3TYS 78 276 3U6G 85 3U97 3UCI 72 637 3UR8 159 3US6 59 3V1A 3V75 294 193 3VN0 219 3VOR 3VUB 101 112 3VVV 163 3VZ9 3W4Q 826 213 3ZBD 157 3ZIT 241 3ZRX 3ZSL 165 74 3ZZP 226 3ZZY 4A02 169 182 4ACJ 189 4AE7 359 4AM1 4ANN 210 189 4AVR 56 4AXY 4B6G 559 292 4B9G 4DD5 412 0.55 0.78 0.48 0.64 0.57 0.78 0.72 0.78 0.53 0.74 0.63 0.86 0.54 0.84 0.72 0.60 0.67 0.95 0.57 0.90 0.70 0.71 0.79 0.70 0.68 0.74 0.59 0.63 0.54 0.47 0.69 0.66 0.75 0.74 0.59 0.57 0.70 0.76 0.75 0.85 0.71 0.29 0.69 0.41 0.48 0.51 0.69 0.62 0.45 0.42 0.61 0.50 0.48 0.37 0.73 0.53 0.51 0.49 0.53 0.46 0.87 0.52 0.56 0.48 0.55 0.59 0.28 0.39 0.52 0.22 0.27 0.63 0.52 0.58 0.61 0.52 0.43 0.51 0.48 0.69 0.80 0.59 0.39 0.67 0.45 0.53 0.53 0.76 0.66 0.54 0.47 0.66 0.57 0.78 0.44 0.78 0.67 0.52 0.60 0.74 0.50 0.87 0.64 0.65 0.64 0.65 0.61 0.36 0.51 0.56 0.39 0.40 0.65 0.61 0.55 0.69 0.57 0.50 0.57 0.55 0.70 0.81 0.60 0.45 0.71 0.44 0.54 0.53 0.73 0.66 0.54 0.43 0.65 0.55 0.58 0.39 0.78 0.64 0.53 0.56 0.57 0.49 0.88 0.58 0.60 0.64 0.64 0.60 0.47 0.47 0.56 0.39 0.30 0.67 0.56 0.59 0.67 0.54 0.48 0.57 0.60 0.71 0.82 0.63 0.28 0.63 0.43 0.46 0.51 0.72 0.55 0.45 0.43 0.58 0.52 0.67 0.39 0.77 0.48 0.49 0.55 0.51 0.48 0.86 0.56 0.60 0.55 0.60 0.56 0.24 0.36 0.49 0.28 0.19 0.63 0.49 0.55 0.63 0.53 0.42 0.53 0.47 0.67 0.78 0.57 90 0.38 0.73 0.45 0.54 0.52 0.74 0.68 0.62 0.44 0.66 0.55 0.73 0.45 0.80 0.57 0.55 0.62 0.77 0.53 0.88 0.63 0.61 0.65 0.63 0.61 0.34 0.47 0.53 0.40 0.31 0.64 0.57 0.61 0.65 0.53 0.48 0.59 0.63 0.72 0.81 0.63 0.33 0.55 0.39 0.47 0.45 0.68 0.48 0.43 0.38 0.53 0.45 0.70 0.27 0.77 0.55 0.45 0.53 0.39 0.47 0.79 0.53 0.61 0.57 0.60 0.47 0.25 0.47 0.46 0.31 0.12 0.59 0.31 0.51 0.63 0.46 0.36 0.49 0.47 0.60 0.71 0.51 0.33 0.64 0.43 0.49 0.52 0.71 0.60 0.47 0.43 0.60 0.52 0.46 0.35 0.76 0.56 0.52 0.46 0.46 0.47 0.88 0.55 0.57 0.49 0.60 0.60 0.31 0.41 0.52 0.24 0.22 0.63 0.51 0.59 0.65 0.53 0.42 0.53 0.50 0.69 0.82 0.61 Table 7.21 (cont’d) B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.61 0.74 0.69 0.46 0.73 0.60 0.65 0.83 0.62 0.39 0.60 0.64 1.00 0.80 0.72 0.45 0.53 0.60 0.72 0.70 0.69 0.42 0.67 0.62 0.70 0.62 0.66 0.77 0.64 0.53 0.63 0.67 0.61 0.56 0.50 0.76 423 4DKN 93 4DND 113 4DPZ 338 4DQ7 170 4DT4 313 4EK3 318 4ERY 96 4ES1 225 4EUG 459 4F01 143 4F3J 145 4FR9 5 4G14 155 4G2E 584 4G5X 676 4G6C 216 4G7X 183 4GA2 94 4GMQ 90 4GS3 278 4H4J 175 4H89 167 4HDE 4HJP 308 4HWM 129 99 377 268 162 305 202 207 550 145 103 45 0.54 0.64 0.64 0.44 0.68 0.56 0.59 0.73 0.60 0.34 0.59 0.58 1.00 0.61 0.64 0.44 0.31 0.53 0.66 0.64 0.64 0.37 0.52 0.55 0.60 0.59 0.61 0.62 0.56 0.44 0.61 0.58 0.57 0.42 0.46 0.48 0.56 0.75 0.67 0.49 0.72 0.58 0.64 0.78 0.62 0.37 0.61 0.70 1.00 0.74 0.74 0.46 0.47 0.57 0.72 0.68 0.75 0.39 0.69 0.59 0.68 0.62 0.63 0.77 0.66 0.50 0.64 0.66 0.61 0.49 0.54 0.76 0.42 0.61 0.62 0.29 0.70 0.53 0.52 0.57 0.51 0.22 0.47 0.58 1.00 0.68 0.64 0.24 0.51 0.49 0.67 0.51 0.57 0.35 0.59 0.58 0.68 0.57 0.63 0.75 0.59 0.38 0.58 0.59 0.38 0.27 0.43 0.49 0.55 0.64 0.64 0.40 0.70 0.59 0.59 0.74 0.58 0.34 0.58 0.57 1.00 0.61 0.67 0.43 0.37 0.53 0.63 0.66 0.66 0.40 0.51 0.58 0.63 0.61 0.61 0.66 0.57 0.47 0.60 0.60 0.58 0.45 0.48 0.49 4IL7 4J11 4J5O 4J5Q 4J78 4JG2 4JVU 4JYP 4KEF 5CYT 6RXN 0.59 0.75 0.68 0.45 0.76 0.58 0.61 0.76 0.61 0.38 0.57 0.65 1.00 0.75 0.71 0.43 0.53 0.55 0.73 0.65 0.67 0.39 0.63 0.62 0.69 0.63 0.66 0.77 0.65 0.48 0.63 0.67 0.59 0.52 0.53 0.74 0.58 0.66 0.70 0.46 0.74 0.63 0.60 0.77 0.61 0.37 0.63 0.62 1.00 0.64 0.69 0.44 0.47 0.56 0.77 0.68 0.67 0.50 0.55 0.61 0.66 0.63 0.63 0.76 0.63 0.48 0.63 0.64 0.60 0.49 0.52 0.63 0.63 0.82 0.79 0.51 0.78 0.65 0.67 0.86 0.67 0.47 0.66 0.78 1.00 0.85 0.80 0.50 0.61 0.70 0.84 0.74 0.82 0.67 0.75 0.65 0.71 0.65 0.68 0.82 0.75 0.56 0.74 0.75 0.69 0.65 0.65 0.86 0.52 0.67 0.65 0.37 0.70 0.55 0.59 0.69 0.54 0.32 0.52 0.63 1.00 0.59 0.69 0.40 0.41 0.52 0.68 0.60 0.63 0.33 0.59 0.57 0.66 0.60 0.62 0.71 0.57 0.43 0.61 0.57 0.52 0.40 0.49 0.59 91 Table 7.22: Persistent homology based Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network (CNN), and consensus method (CON) for the Superset. PDB ID N GBT CNN CON PDB ID N GBT CNN CON 0.76 1ABA 1AHO 0.52 0.58 1AIE 0.74 1AKG 1ATG 0.69 0.68 1BGF 0.76 1BX7 0.7 1BYI 1CCR 0.69 0.36 1CYO 0.68 1DF4 1E5K 0.59 0.53 1ES5 0.69 1ETL 0.55 1ETM 0.52 1ETN 0.63 1EW4 0.5 1F8R 1FF4 0.58 0.49 1FK5 0.44 1GCO 0.52 1GK7 1GVD 0.63 0.46 1GXU 0.77 1H6V 1HJE 0.38 0.75 1I71 0.68 1IDP 0.66 1IFR 1K8U 0.54 1KMM 0.54 0.68 1KNG 1KR4 0.54 0.74 1KYC 0.67 1LR7 0.61 1MF7 1N7E 0.75 0.59 1NKD 2X5Y 2X9Z 2XHF 2Y0T 2Y72 2Y7L 2Y9F 2YLB 2YNY 2ZCM 2ZU1 3A0M 3A7L 3AMC 3AUB 3B5O 3BA1 3BED 3BQX 3BZQ 3BZZ 3DRF 3DWV 3E5T 3E7R 3EUR 3F2Z 3F7E 3FCN 3FE7 3FKE 3FMY 3FOD 3FSO 3FTD 3G1S 3GBW 3GHJ 0.68 0.52 0.57 0.71 0.71 0.66 0.75 0.66 0.71 0.38 0.66 0.6 0.61 0.64 0.5 0.55 0.59 0.53 0.55 0.53 0.51 0.45 0.55 0.48 0.66 0.42 0.76 0.69 0.65 0.55 0.51 0.67 0.57 0.75 0.68 0.57 0.74 0.56 87 66 31 16 231 124 51 238 109 88 57 188 260 12 12 12 106 1932 65 93 1044 39 56 89 2927 13 83 441 113 87 1499 144 107 15 73 194 95 59 0.73 0.66 0.75 0.27 0.55 0.61 0.74 0.61 0.55 0.64 0.85 0.74 0.65 0.37 0.37 0.07 0.59 0.52 0.61 0.59 0.47 0.77 0.71 0.67 0.26 0.84 0.53 0.62 0.7 0.57 0.64 0.5 0.56 0.62 0.62 0.65 0.63 0.7 0.71 0.66 0.7 0.32 0.51 0.58 0.74 0.5 0.6 0.7 0.85 0.72 0.62 0.82 0.63 0.48 0.6 0.54 0.66 0.6 0.47 0.9 0.55 0.68 0.34 0.75 0.58 0.6 0.64 0.6 0.51 0.52 0.71 0.69 0.61 0.66 0.58 0.7 185 266 310 111 183 323 149 418 326 348 360 146 128 614 124 249 312 262 136 99 103 567 359 268 40 150 148 261 185 89 250 75 48 238 257 418 170 129 0.76 0.49 0.58 0.71 0.65 0.66 0.74 0.67 0.65 0.33 0.66 0.53 0.44 0.68 0.5 0.49 0.62 0.45 0.56 0.45 0.38 0.51 0.63 0.44 0.72 0.36 0.73 0.65 0.63 0.52 0.51 0.65 0.45 0.72 0.64 0.6 0.74 0.58 0.74 0.7 0.78 0.29 0.55 0.62 0.76 0.6 0.59 0.68 0.88 0.74 0.66 0.55 0.43 0.13 0.61 0.54 0.64 0.61 0.5 0.82 0.69 0.69 0.34 0.9 0.56 0.63 0.7 0.59 0.63 0.51 0.63 0.66 0.64 0.67 0.65 0.72 92 Table 7.22 (cont’d) PDB ID N GBT CNN CON PDB ID N GBT CNN CON 0.54 1NLS 0.65 1NNX 0.62 1NOA 0.58 1NOT 0.62 1O06 0.99 1O08 0.48 1OPD 0.48 1P9I 1PEF 0.6 0.66 1PEN 0.49 1PMY 0.64 1PZ4 1Q9B 0.84 0.58 1QAU 0.6 1QKI 1QTO 0.78 0.6 1R29 0.64 1R7J 0.42 1RJU 0.74 1RRO 0.5 1SAU 0.58 1TGR 1TZV 0.57 0.74 1U06 0.56 1U7I 0.45 1U9C 0.65 1UHA 0.96 1UKU 0.63 1ULR 1UOY 0.09 0.57 1USE 0.52 1USM 1UTG 0.71 0.75 1V05 0.54 1V70 0.71 1VRZ 1W2L 0.57 0.71 1WBE 0.61 1WHI 1WLY 0.59 0.66 1WPA 1X3O 0.69 3HFO 3HHP 3HNY 3HP4 3HWU 3HYD 3HZ8 3I2V 3I2Z 3I4O 3I7M 3IHS 3IVV 3K6Y 3KBE 3KGK 3KZD 3L41 3LAA 3LAX 3LG3 3LJI 3M3P 3M8J 3M9J 3M9Q 3MAB 3MD4 3MEA 3MGN 3MRE 3N11 3NE0 3NGG 3NPV 3NVG 3NZL 3O0P 3O5P 3OBQ 3OQY 3P6J 238 93 113 13 22 221 85 29 18 16 123 113 44 112 3912 122 122 90 36 108 123 111 157 55 259 220 82 102 87 64 47 77 70 96 105 13 97 206 122 322 107 80 216 1314 170 201 155 8 200 127 140 154 145 173 168 227 166 190 94 219 176 118 846 270 244 178 250 190 180 13 170 277 446 325 208 97 500 6 70 197 147 150 236 145 0.51 0.61 0.61 0.56 0.58 0.99 0.45 0.44 0.6 0.62 0.44 0.61 0.83 0.56 0.56 0.76 0.55 0.61 0.35 0.74 0.45 0.57 0.53 0.72 0.56 0.4 0.63 0.88 0.62 0.08 0.54 0.51 0.67 0.72 0.51 0.51 0.56 0.68 0.6 0.59 0.66 0.66 0.57 0.65 0.6 0.58 0.65 0.74 0.54 0.52 0.6 0.72 0.57 0.62 0.82 0.57 0.64 0.8 0.67 0.64 0.49 0.69 0.51 0.55 0.59 0.71 0.52 0.48 0.63 0.96 0.63 0.09 0.54 0.47 0.73 0.75 0.5 0.63 0.58 0.72 0.59 0.57 0.59 0.72 0.55 0.78 0.55 0.69 0.94 0.49 0.42 0.73 0.79 0.36 0.59 0.72 0.59 0.51 0.34 0.53 0.56 0.71 0.6 0.4 0.54 0.66 0.74 0.44 0.71 0.57 0.71 0.75 0.54 0.72 0.05 0.73 0.62 0.6 0.63 0.54 0.43 0.6 0.59 0.64 0.65 0.41 0.57 0.79 0.53 0.96 0.93 0.47 0.34 0.73 0.82 0.74 0.7 0.8 0.85 0.59 0.45 0.48 0.59 0.77 0.46 0.45 0.66 0.69 0.77 0.4 0.74 0.59 0.74 0.76 0.53 0.7 0.32 0.72 0.64 0.64 0.62 0.34 0.5 0.56 0.56 0.62 0.65 0.43 0.57 0.8 0.56 0.8 0.95 0.49 0.41 0.74 0.82 0.44 0.65 0.77 0.67 0.57 0.38 0.54 0.59 0.75 0.58 0.43 0.59 0.69 0.76 0.45 0.74 0.59 0.73 0.77 0.56 0.76 0.12 0.75 0.66 0.63 0.64 0.54 0.47 0.6 0.6 0.66 0.67 0.44 93 PDB ID N GBT CNN CON PDB ID N GBT CNN CON 0.71 1XY1 0.57 1XY2 0.45 1Y6X 0.75 1YJO 0.53 1YZM 0.82 1Z21 0.55 1ZCE 0.63 1ZVA 2A50 0.63 0.93 2AGK 0.93 2AH1 0.52 2B0A 2BCM 0.63 0.59 2BF9 0.69 2BRF 2C71 0.48 0.46 2CE0 0.46 2CG7 0.66 2COV 0.54 2CWS 0.57 2D5W 0.63 2DKO 2DPL 0.55 0.66 2DSX 0.5 2OCT 0.6 2E3H 0.71 2EAQ 0.73 2EHP 0.78 2EHS 2ERW 0.57 0.49 2ETX 0.63 2FB6 2FG1 0.63 0.72 2FN9 0.52 2FQ3 0.71 2G69 2G7O 0.43 0.64 2G7S 0.64 2GKG 2GOM 0.55 0.53 2GXG 2GZQ 0.86 3PD7 3PES 3PID 3PIW 3PKV 3PSM 3PTL 3PVE 3PZ9 3PZZ 3Q2X 3Q6L 3QDS 3QPA 3R6D 3R87 3RQ9 3RY0 3RZY 3S0A 3SD2 3SEB 3SED 3SO6 3SR3 3SUK 3SZH 3T0H 3T3K 3T47 3TDN 3TOW 3TUA 3TYS 3U6G 3U97 3UCI 3UR8 3US6 3V1A 3V75 3VN0 0.75 0.82 0.46 -0.06 0.64 0.65 0.74 0.7 0.54 0.63 0.55 0.59 0.51 0.79 0.77 0.6 0.66 0.32 0.72 0.47 0.64 0.78 0.36 0.34 0.67 0.68 0.63 0.62 0.67 0.24 0.48 0.75 0.61 0.54 0.82 0.5 0.86 0.58 0.64 0.59 0.67 0.4 16 8 86 6 46 96 139 75 469 233 939 191 415 35 103 225 109 110 534 235 1214 253 565 52 439 81 89 246 75 53 390 129 176 560 85 99 68 206 150 121 140 203 0.82 0.79 0.5 0.7 0.69 0.68 0.7 0.7 0.6 0.67 0.48 0.62 0.5 0.48 0.72 0.57 0.6 0.3 0.74 0.61 0.54 0.78 0.41 0.34 0.64 0.65 0.57 0.66 0.62 0.12 0.49 0.73 0.57 0.57 0.77 0.62 0.72 0.55 0.56 0.69 0.65 0.34 216 166 387 161 229 94 289 363 357 12 6 131 284 212 222 148 165 128 151 132 100 238 126 157 657 254 753 209 122 145 359 155 226 78 276 85 72 637 159 59 294 193 0.68 0.56 0.48 0.72 0.52 0.8 0.53 0.61 0.61 0.94 0.95 0.47 0.62 0.55 0.65 0.47 0.46 0.41 0.65 0.53 0.56 0.63 0.53 0.65 0.5 0.58 0.69 0.71 0.76 0.51 0.47 0.61 0.62 0.66 0.53 0.67 0.42 0.64 0.61 0.57 0.49 0.85 0.7 0.54 0.3 0.77 0.51 0.77 0.55 0.61 0.58 0.85 0.72 0.53 0.62 0.67 0.74 0.45 0.4 0.49 0.62 0.49 0.56 0.6 0.52 0.65 0.46 0.59 0.67 0.7 0.76 0.62 0.49 0.63 0.56 0.74 0.46 0.72 0.42 0.6 0.63 0.27 0.56 0.85 Table 7.22 (cont’d) 0.83 0.81 0.51 0.57 0.7 0.69 0.73 0.71 0.6 0.67 0.54 0.63 0.52 0.58 0.75 0.6 0.64 0.32 0.75 0.6 0.59 0.8 0.42 0.36 0.67 0.67 0.61 0.67 0.65 0.16 0.51 0.75 0.59 0.58 0.81 0.6 0.8 0.58 0.59 0.69 0.68 0.37 94 PDB ID N GBT CNN CON PDB ID N GBT CNN CON 0.48 2HQK 0.59 2HYK 0.57 2I24 0.72 2I49 0.66 2IBL 0.55 2IGD 0.5 2IMF 0.58 2IP6 2IVY 0.53 0.42 2J32 0.68 2J9W 0.62 2JKU 2JLI 0.64 0.68 2JLJ 0.55 2MCM 2NLS 0.45 0.56 2NR7 0.62 2NUH 0.71 2O6X 0.76 2OA2 0.63 2OHW 0.68 2OKT 2OL9 0.65 0.72 2PKT 0.57 2PLT 0.69 2PMR 0.61 2POF 0.59 2PPN 0.71 2PSF 2PTH 0.58 0.33 2Q4N 0.62 2Q52 2QJL 0.61 0.04 2R16 0.76 2R6Q 0.74 2RB8 2RE2 0.58 0.45 2RFR 0.61 2V9V 2VE8 0.76 0.61 2VH7 2VIM 0.77 3VOR 3VUB 3VVV 3VZ9 3W4Q 3ZBD 3ZIT 3ZRX 3ZSL 3ZZP 3ZZY 4A02 4ACJ 4AE7 4AM1 4ANN 4AVR 4AXY 4B6G 4B9G 4DD5 4DKN 4DND 4DPZ 4DQ7 4DT4 4EK3 4ERY 4ES1 4EUG 4F01 4F3J 4FR9 4G14 4G2E 4G5X 4G6C 4G7X 4GA2 4GMQ 4GS3 4H4J 0.47 0.59 0.56 0.72 0.65 0.55 0.52 0.54 0.49 0.38 0.65 0.59 0.62 0.65 0.54 0.44 0.56 0.59 0.69 0.74 0.61 0.66 0.62 0.7 0.55 0.67 0.6 0.57 0.69 0.56 0.35 0.58 0.6 -0.28 0.75 0.71 0.56 0.45 0.61 0.76 0.61 0.75 232 237 113 399 108 61 203 87 89 244 203 38 112 121 112 36 193 104 309 140 257 377 6 93 98 83 428 122 608 193 1208 3296 107 185 149 93 249 166 149 515 94 114 0.77 0.65 0.44 0.65 0.65 0.57 0.53 0.6 0.51 0.75 0.64 0.57 0.62 0.71 0.71 0.23 0.78 0.72 0.76 0.54 0.56 0.42 0.94 0.01 0.52 0.6 0.62 0.64 0.42 0.69 0.44 0.55 0.54 0.44 0.63 0.67 0.65 0.61 0.53 0.55 0.75 0.44 0.48 0.55 0.57 0.64 0.6 0.49 0.42 0.6 0.57 0.48 0.65 0.65 0.66 0.7 0.52 0.43 0.53 0.65 0.68 0.74 0.62 0.64 0.67 0.74 0.6 0.69 0.58 0.59 0.69 0.55 0.26 0.63 0.56 0.45 0.72 0.73 0.54 0.4 0.53 0.67 0.56 0.74 219 101 112 163 826 213 157 241 165 74 226 169 182 189 359 210 189 56 559 292 412 423 93 113 338 170 313 318 96 225 459 143 145 5 155 584 676 216 183 94 90 278 Table 7.22 (cont’d) 0.77 0.63 0.46 0.61 0.66 0.56 0.58 0.66 0.45 0.79 0.58 0.71 0.68 0.71 0.77 0.47 0.76 0.56 0.76 0.55 0.46 0.42 0.85 -0.04 0.53 0.63 0.6 0.54 0.42 0.7 0.43 0.28 0.57 0.49 0.62 0.7 0.66 0.69 0.52 0.55 0.56 0.47 0.78 0.65 0.46 0.66 0.67 0.58 0.56 0.63 0.51 0.79 0.64 0.66 0.65 0.74 0.75 0.29 0.79 0.7 0.78 0.56 0.54 0.43 0.94 -0.01 0.54 0.63 0.66 0.63 0.43 0.71 0.45 0.52 0.56 0.46 0.65 0.7 0.68 0.66 0.54 0.58 0.73 0.47 95 Table 7.22 (cont’d) PDB ID N GBT CNN CON PDB ID N GBT CNN CON 0.56 2VPA 0.7 2VQ4 0.67 2VY8 0.57 2VYO 0.56 2W1V 0.58 2W2A 0.69 2W6A 0.74 2WJ5 2WUJ 0.64 0.73 2WW7 0.72 2WWE 0.65 2X1Q 2X25 0.51 0.39 2X3M 0.61 4H89 4HDE 4HJP 4HWM 4IL7 4J11 4J5O 4J5Q 4J78 4JG2 4JVU 4JYP 4KEF 5CYT 6RXN 0.58 0.72 0.6 0.6 0.55 0.49 0.68 0.74 0.6 0.72 0.7 0.67 0.53 0.34 0.6 0.53 0.66 0.68 0.54 0.55 0.58 0.67 0.72 0.63 0.72 0.7 0.59 0.48 0.39 0.59 175 167 308 129 99 377 268 162 305 202 207 550 145 103 45 217 106 162 207 551 350 139 110 103 161 120 240 167 175 0.66 0.7 0.77 0.6 0.64 0.59 0.71 0.45 0.35 0.36 0.49 0.44 0.5 0.64 0.75 0.75 0.68 0.63 0.69 0.6 0.69 0.53 0.54 0.35 0.55 0.5 0.57 0.65 0.71 0.72 0.76 0.63 0.66 0.61 0.72 0.48 0.45 0.37 0.53 0.47 0.55 0.65 96 (a) (b) (c) (d) Figure 7.6: A visual comparison of experimental B factors (a), WCG predicted B factors (b), and GNM predicted B factors (c) for the engineered cyan fluorescent protein, mTFP1 (PDB ID:2HQK). (d) The experimental (Exp) and predicted B factor values plotted per residue for PDB ID 2HQK. The GNM is for the GNM method with a cutoff distance of 7 ˚A. WCG is parametrized using CC, CN, CO kernels of the exponential type with fixed parameters κ = 1, and η = 3 ˚A. Figure originally published in Bramer et al [1]. 97 Table 7.1: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for small-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. PDB ID N MWCG opFRI pfFRI GNM NMA 1AIE 31 1AKG 16 51 1BX7 1ETL 12 1ETM 12 12 1ETN 1FF4 65 1GK7 39 1GVD 52 13 1HJE 1KYC 15 13 1NOT 20 1O06 1OB4 16 16 1OB7 29 1P9I 18 1PEF 1PEN 16 43 1Q9B 36 1RJU 1U06 55 1UOY 64 40 1USE 21 1VRZ 1XY2 8 1YJO 6 1YZM 46 2DSX 52 35 2JKU 36 2NLS 2OL9 6 2OLX 4 6RXN 45 0.969 0.945 0.896 0.932 0.941 0.949 0.933 0.984 0.849 0.931 0.971 0.937 0.988 1.000 1.000 0.841 0.989 0.957 0.957 0.805 0.774 0.769 0.960 0.995 1.000 1.000 0.970 0.704 0.926 0.937 1.000 1.000 0.583 0.416 0.35 0.623 0.609 0.393 0.023 0.613 0.773 0.732 0.686 0.763 0.622 0.874 0.763 0.545 0.742 0.826 0.465 0.726 0.447 0.429 0.653 0.146 0.695 0.57 0.333 0.834 0.333 0.695 0.559 0.904 0.888 0.574 0.155 0.185 0.706 0.628 0.432 -0.274 0.674 0.821 0.591 0.616 0.754 0.523 0.844 0.750 0.652 0.625 0.808 0.270 0.656 0.431 0.434 0.671 -0.142 0.677 0.562 0.434 0.901 0.127 0.656 0.530 0.689 0.885 0.594 0.712 -0.229 0.868 0.355 0.027 -0.573 0.555 0.822 0.570 0.562 0.784 0.567 0.900 0.930 0.952 0.603 0.888 0.056 0.646 0.235 0.377 0.628 -0.399 -0.203 0.458 0.445 0.939 0.433 0.850 0.088 0.886 0.776 0.304 0.588 0.373 0.726 0.71 0.544 0.089 0.718 0.845 0.781 0.811 0.796 0.746 0.91 0.776 0.737 0.754 0.888 0.516 0.746 0.517 0.474 0.713 0.438 0.792 0.619 0.375 0.842 0.337 0.805 0.605 0.909 0.917 0.614 98 Table 7.2: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI) and Gaussian normal mode (GNM) for medium-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. PDB ID N MWCG opFRI pfFRI GNM NMA 87 1ABA 88 1CYO 93 1FK5 88 1GXU 83 1I71 73 1LR7 95 1N7E 93 1NNX 1NOA 113 1OPD 85 1QAU 112 1R7J 90 83 1UHA 1ULR 87 1USM 77 96 1V05 97 1W2L 80 1X3O 1Z21 96 75 1ZVA 36 2BF9 2BRF 100 99 2CE0 81 2E3H 89 2EAQ 2EHS 75 85 2FQ3 2IP6 87 2MCM 113 2NUH 104 93 2PKT 99 2PLT 2QJL 99 93 2RB8 99 3BZQ 5CYT 103 0.855 0.860 0.648 0.901 0.798 0.929 0.812 0.834 0.808 0.607 0.786 0.859 0.838 0.718 0.819 0.841 0.747 0.787 0.725 0.911 0.714 0.873 0.824 0.794 0.817 0.805 0.844 0.841 0.867 0.922 0.762 0.635 0.611 0.840 0.848 0.548 0.698 0.702 0.568 0.634 0.516 0.657 0.609 0.789 0.604 0.409 0.672 0.621 0.665 0.594 0.809 0.599 0.564 0.559 0.638 0.579 0.554 0.764 0.598 0.682 0.690 0.713 0.692 0.578 0.713 0.691 0.003 0.484 0.584 0.614 0.516 0.421 0.613 0.741 0.485 0.421 0.549 0.620 0.497 0.631 0.615 0.398 0.620 0.368 0.638 0.495 0.798 0.632 0.397 0.654 0.433 0.690 0.680 0.710 0.529 0.605 0.695 0.747 0.348 0.572 0.639 0.771 -0.193 0.509 0.594 0.517 0.466 0.331 0.057 0.774 0.362 0.581 0.380 0.795 0.385 0.517 0.485 0.796 0.533 0.078 0.308 0.223 0.780 0.389 0.432 0.453 0.289 0.579 0.521 0.535 0.628 0.632 0.688 0.565 0.508 0.826 0.643 0.685 -0.165 0.187 0.497 0.485 0.351 0.102 0.727 0.751 0.590 0.748 0.549 0.679 0.651 0.795 0.622 0.555 0.678 0.789 0.726 0.639 0.832 0.629 0.691 0.600 0.662 0.756 0.606 0.795 0.706 0.692 0.753 0.720 0.719 0.654 0.789 0.835 0.162 0.508 0.594 0.727 0.532 0.441 99 Table 7.3: Correlation coefficients for B factor prediction obtained by optimal FRI (opFRI), parameter free FRI (pfFRI), and Gaussian normal mode (GNM) for large-size structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al [4]. MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG results originally published in Bramer et al [1]. PDB ID N MWCG opFRI pfFRI GNM NMA 1AHO 64 1ATG 231 224 1BYI 111 1CCR 1E5K 188 106 1EW4 1IFR 113 1NKO 122 1NLS 238 1O08 221 1PMY 123 1PZ4 114 1QTO 122 1RRO 112 1UKU 102 1V70 105 204 1WBE 1WHI 122 1WPA 107 2AGK 233 205 2C71 2CG7 90 2CWS 227 2HQK 213 2HYK 238 2I24 113 203 2IMF 107 2PPN 2R16 176 2V9V 135 2VIM 104 2VPA 204 2VYO 210 238 3SEB 3VUB 101 0.768 0.843 0.600 0.741 0.848 0.804 0.875 0.831 0.799 0.516 0.701 0.921 0.809 0.748 0.765 0.854 0.767 0.804 0.797 0.821 0.773 0.738 0.756 0.897 0.728 0.672 0.798 0.673 0.640 0.697 0.859 0.757 0.777 0.879 0.852 0.625 0.578 0.491 0.512 0.732 0.644 0.689 0.535 0.530 0.333 0.654 0.781 0.520 0.372 0.661 0.492 0.577 0.539 0.577 0.694 0.649 0.539 0.640 0.809 0.575 0.498 0.625 0.638 0.495 0.548 0.393 0.755 0.648 0.712 0.610 0.562 0.497 0.552 0.351 0.859 0.547 0.637 0.368 0.523 0.309 0.685 0.843 0.334 0.529 0.742 0.162 0.549 0.270 0.417 0.512 0.560 0.379 0.696 0.365 0.510 0.494 0.514 0.668 0.618 0.528 0.212 0.576 0.729 0.826 0.607 0.339 0.154 0.133 0.530 0.620 0.447 0.330 0.322 0.385 0.616 0.702 0.844 0.725 0.546 0.720 0.285 0.574 0.414 0.380 0.514 0.584 0.308 0.524 0.743 0.593 0.441 0.401 0.468 0.411 0.594 0.221 0.594 0.739 0.720 0.365 0.698 0.613 0.543 0.580 0.746 0.650 0.697 0.619 0.669 0.562 0.671 0.828 0.543 0.435 0.665 0.622 0.591 0.601 0.634 0.705 0.658 0.551 0.647 0.824 0.585 0.593 0.652 0.677 0.582 0.555 0.413 0.763 0.675 0.801 0.625 100 Table 7.5: Average pearson correlation coefficients for Cα B factor prediction with FRI, GNM and NMA for three structure sets from Park et al. [4] and a superset of 364 structures. Results for opFRI, pfFRI are taken from Opron et al [3]. GNM and NMA values are taken from the coarse-grained (Cα) results reported in Park et al. [4] MWCG results are parameter free and use all C, N, and O to predict Cα. MWCG Results originally published in Bramer et al [1]. PDB set MWCG opFRI[3] pfFRI[3] Small Medium Large Superset 0.921 0.795 0.775 0.803 0.667 0.664 0.636 0.673 0.594 0.605 0.591 0.626 GNM NMA[4] 0.541[4] 0.550[4] 0.529[4] 0.565 [3] 0.480 0.482 0.494 NA Table 7.6: Pearson Correlation coefficients for Cα, non Cα carbon, nitrogen, oxygen, and sulfur using parameter free MWCG. Only 215 of the 364 proteins contain sulfur atoms. MWCG results originally published in Bramer et al [1]. Subset Average No of proteins Cα 0.803 364 Non Cα Carbon Nitrogen Oxygen Sulfur 0.903 215 0.789 364 0.744 364 0.812 364 Figure 7.7: CPU Efficiency comparison between GNM [3], RF, GBT, and CNN algorithms for MWCG B factor prediction. Execution times in seconds (s) versus number of residues. A set of 34 proteins, listed in Table 7.7, were used to evaluate the computational complexity. Result originally published in Bramer et al [2]. 101 Table 7.7: CPU execution times, in seconds, from efficiency comparison between GNM [3], RF, GBT, and CNN. Results originally reported in Bramer et al [2] N PDB 125 3P6J 132 3R87 3KBE 140 1TZV 141 149 2VY8 152 3ZIT 2FG1 157 2X3M 166 3LAA 169 3M8J 178 2GZQ 191 4G7X 194 2J9W 200 3TUA 210 1U9C 221 3ZRX 221 3K6Y 227 3OQY 234 244 2J32 249 3M3P 1U7I 267 4B9G 292 4ERY 318 3MGN 348 2ZU1 360 412 2Q52 448 4F01 3DRF 547 637 3UR8 2AH1 939 1GCO 1044 1F8R 1932 1H6V 2927 1QKI 3912 GNM[3] 0.141 0.156 0.187 0.203 0.219 0.234 0.265 0.312 0.327 0.375 0.468 0.499 0.546 0.655 0.733 0.718 0.765 0.873 0.967 1.029 1.263 1.669 2.122 2.902 3.136 4.696 6.178 11.154 17.409 61.012 75.801 654.127 2085.842 6365.668 RF 0.000455 0.000464 0.000505 0.000473 0.000486 0.000519 0.000518 0.000526 0.000514 0.000548 0.000647 0.000631 0.000554 0.000602 0.000592 0.000654 0.000619 0.000619 0.000625 0.000621 0.000647 0.000693 0.000775 0.000655 0.000816 0.000900 0.001016 0.001131 0.001307 0.001716 0.001936 0.003343 0.005205 0.006261 GBT 0.000358 0.000339 0.000384 0.000365 0.000359 0.000365 0.000403 0.000382 0.000405 0.000412 0.000454 0.000445 0.000424 0.000472 0.000486 0.000515 0.000490 0.000502 0.000556 0.000525 0.000551 0.000574 0.000619 0.000552 0.000675 0.000750 0.000878 0.001033 0.001136 0.001605 0.001814 0.003163 0.004739 0.006198 CNN 0.130 0.138 0.149 0.163 0.156 0.148 0.174 0.182 0.155 0.178 0.195 0.209 0.208 0.217 0.198 0.216 0.189 0.211 0.225 0.220 0.237 0.256 0.289 0.267 0.337 0.369 0.401 0.512 0.583 0.800 0.905 1.745 2.543 3.560 102 Table 7.8: Average Pearson correlation coefficients (PCC) both of all heavy atom and Cα only B factor predictions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset. Predictions of random forest (RF), gradient boosted tree (GBT), and convolutional neural network (CNN) are obtained by leave-one-protein- out (blind), while predictions of parameter-free flexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares fitting of individual proteins. All machine learning models use all heavy atom information for training. MWCG machine learning B factor prediction results originally reported in Bramer et al [2]. Prediction Of Only Cα Protein Set RF GBT CNN pfFRI [3] GNM [3] NMA [3] Small Medium Large Superset 0.25 0.47 0.50 0.49 0.39 0.59 0.57 0.57 0.53 0.55 0.62 0.66 0.60 0.61 0.59 0.63 0.54 0.55 0.53 0.57 0.48 0.48 0.49 NA Prediction Of All Heavy Atom Protein Set RF GBT CNN pfFRI [3] GNM [3] NMA [3] Small Medium Large Superset 0.44 0.59 0.62 0.59 0.49 0.64 0.65 0.63 0.56 0.62 0.68 0.69 NA NA NA NA NA NA NA NA NA NA NA NA Figure 7.8: Individual feature importance for the MWCG random forest model averaged over the data set. Reported feature selection includes the use heavy atoms in the model. Figure originally published in Bramer et al [2]. 103 Table 7.9: Pearson correlation coefficients for cross protein heavy atom blind MWCG B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the small-sized protein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. PDB ID N 235 1AIE 1AKG 108 345 1BX7 1ETL 76 1ETM 80 77 1ETN 477 1FF4 1GK7 321 1GVD 401 73 1HJE 138 1KYC 1NOT 96 142 1O06 203 1P9I 1PEF 153 109 1PEN 303 1Q9B 1RJU 257 1U06 432 1UOY 452 290 1USE 1VRZ 66 62 1XY2 1YJO 55 1YZM 361 386 2DSX 229 2JKU 2NLS 269 2OL9 51 6RXN 345 RF GBT CNN 0.62 0.60 0.70 0.41 0.63 0.55 0.48 0.27 0.46 0.48 0.20 0.33 0.76 0.55 0.53 0.72 0.71 0.66 0.37 -0.07 0.32 0.43 -0.18 0.63 0.65 0.51 0.77 0.73 0.60 0.76 0.21 0.34 0.75 0.41 0.71 0.73 0.61 0.55 0.55 0.55 0.68 0.25 0.38 0.09 0.55 0.16 0.02 0.36 0.51 0.56 0.56 0.36 0.35 0.57 0.70 0.45 0.65 0.84 0.82 0.56 0.53 0.51 0.67 0.03 0.13 0.25 0.59 0.73 0.69 0.46 0.30 0.81 0.64 0.77 0.64 0.24 0.67 0.75 0.68 0.56 0.50 -0.17 0.27 0.12 0.60 0.44 0.63 0.49 0.51 0.71 104 Table 7.10: Pearson correlation coefficients for cross protein heavy atom blind MWCG B factor prediction obtained by random forest (RF), boosted gradient (GBT), and convolu- tional neural network (CNN) for the medium-sized protein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. PDB ID N 1ABA 728 1CYO 697 1FK5 626 1GXU 694 683 1I71 522 1LR7 1N7E 700 1NNX 674 1NOA 778 1OPD 642 1QAU 812 1R7J 729 1UHA 623 1ULR 677 1USM 631 17 1V05 746 1W2L 1X3O 622 771 1Z21 551 1ZVA 2BF9 287 735 2BRF 714 2CE0 2E3H 589 2EAQ 705 590 2EHS 721 2FQ3 2IP6 702 2MCM 735 2NUH 806 666 2PKT 2PLT 719 734 2QJL 2RB8 723 3BZQ 742 800 5CYT RF GBT CNN 0.74 0.73 0.76 0.66 0.63 0.62 0.65 0.66 0.66 0.57 0.71 0.53 0.71 0.62 0.69 0.53 0.57 0.52 0.62 0.55 0.57 0.57 0.65 0.71 0.75 0.74 0.68 0.69 0.59 0.67 0.60 -0.20 0.69 0.62 0.53 0.63 0.63 0.63 0.58 0.59 0.39 0.70 0.86 0.76 0.90 0.62 0.38 0.70 0.63 0.58 0.38 0.55 0.76 0.67 0.62 0.64 0.60 0.71 0.19 0.64 0.76 0.06 0.62 0.70 0.42 0.61 0.42 0.61 0.60 0.43 0.74 0.68 0.77 0.68 0.71 0.67 0.62 0.70 0.65 0.73 0.57 0.60 0.58 0.70 0.80 0.71 0.78 0.02 0.68 0.52 0.66 0.56 0.52 0.78 0.65 0.73 0.61 0.71 0.75 0.67 0.73 0.72 0.17 0.67 0.60 0.64 0.61 0.70 105 Table 7.11: Pearson correlation coefficients for cross protein heavy atom blind MWCG B fac- tor prediction obtained by random forest (RF), boosted gradient (GBT), and convolutional neural network (CNN) for the large-sized protein set. Results reported use heavy atoms in both training and prediction. Originally published in Bramer et al [2]. N PDB ID 1AHO 482 1ATG 1689 1BYI 1540 837 1CCR 1423 1E5K 863 1EW4 1IFR 878 1746 1NLS 1722 1O08 1PMY 937 874 1PZ4 934 1QTO 846 1RRO 1UKU 873 784 1V70 1542 1WBE 1WHI 937 1WPA 906 2AGK 1867 2C71 1446 536 2CG7 2CWS 1624 2HQK 1582 2HYK 1832 872 2I24 1564 2IMF 2PPN 701 1262 2R16 2V9V 986 2VIM 781 2VPA 1524 2VYO 1589 1948 3SEB 3VUB 787 RF GBT CNN 0.76 0.62 0.63 0.61 0.59 0.59 0.66 0.70 0.74 0.70 0.61 0.70 0.72 0.73 0.56 0.61 0.55 0.51 0.64 0.67 0.74 0.73 0.63 0.61 0.54 0.56 0.74 0.70 0.62 0.70 0.63 0.59 0.74 0.71 0.74 0.64 0.44 0.61 0.59 0.83 0.79 0.47 0.78 0.63 0.90 0.76 0.60 0.85 0.91 0.52 0.47 0.62 0.50 0.83 0.50 0.52 0.63 0.64 0.75 0.62 0.63 0.61 0.61 0.53 0.57 0.61 0.64 0.78 0.71 0.66 0.63 0.67 0.73 0.71 0.74 0.64 0.58 0.65 0.73 0.55 0.52 0.75 0.67 0.61 0.77 0.66 0.68 0.61 0.54 0.60 0.76 0.65 0.52 0.62 0.68 0.53 0.61 0.61 0.68 0.65 0.71 0.70 106 Figure 7.9: Average feature importance for the MWCG random forest model with the an- gle, secondary, MWCG, atom type, protein size, amino acid, and packing density features aggregated. Reported feature selection includes the use heavy atoms in the model. Figure originally published in Bramer et al [2]. Table 7.13: ASPH and ESPH average Pearson correlation coefficients Cα B factor predictions for small-, medium-, and large-sized protein sets along with the entire superset of the 364 protein dataset. Gradient boosted tree (GBT), convolutional neural network, and consen- sus(CON) results are obtained by leave-one-protein-out (blind). Predictions of parameter- free flexibility-rigidity index (pfFRI), Gaussian network model (GNM) and normal mode analysis (NMA) were obtained via the least squares fitting of individual proteins. CNN GBT CON pFRI GNM NMA 0.63 0.48 0.48 Medium 0.60 0.49 0.58 0.60 NA 0.59 0.61 0.59 0.63 0.62 0.61 0.58 0.61 0.58 0.58 0.59 0.59 Small Large Superset 0.54 0.55 0.53 0.57 107 Table 7.14: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network (CNN), and consensus (CON) for the small-sized protein set. PDB ID N GBT CNN CON 0.78 0.29 0.76 0.55 0.43 0.13 0.64 0.82 0.69 0.9 0.66 0.8 0.95 0.74 0.82 0.44 0.67 0.58 0.45 0.76 0.12 0.54 0.81 0.57 0.7 0.36 0.66 0.29 0.94 0.61 1AIE 31 1AKG 16 51 1BX7 1ETL 12 1ETM 12 12 1ETN 65 1FF4 1GK7 39 1GVD 56 13 1HJE 15 1KYC 1NOT 13 22 1O06 29 1P9I 1PEF 18 16 1PEN 44 1Q9B 36 1RJU 1U06 55 1UOY 64 47 1USE 1VRZ 13 8 1XY2 1YJO 6 1YZM 46 2DSX 52 38 2JKU 36 2NLS 2OL9 6 6RXN 45 0.7 0.32 0.74 0.82 0.63 0.48 0.66 0.9 0.55 0.75 0.69 0.96 0.93 0.73 0.82 0.74 0.85 0.46 0.4 0.7 0.32 0.34 0.82 -0.06 0.64 0.34 0.71 0.47 0.85 0.6 0.75 0.27 0.74 0.37 0.37 0.07 0.61 0.77 0.71 0.84 0.62 0.69 0.94 0.73 0.79 0.36 0.59 0.6 0.44 0.72 0.05 0.54 0.79 0.7 0.69 0.34 0.57 0.23 0.94 0.59 108 Table 7.15: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained by boosted gradient (GBT), convolutional neural network (CNN), and consensus (CON) for the medium-sized protein set. PDB ID N GBT CNN CON 87 1ABA 0.74 88 0.68 1CYO 93 0.61 1FK5 89 0.69 1GXU 83 1I71 0.56 73 0.64 1LR7 0.65 95 1N7E 93 1NNX 0.8 0.56 1NOA 113 0.41 1OPD 85 0.57 1QAU 112 1R7J 90 0.75 0.73 82 1UHA 0.56 1ULR 87 1USM 77 0.75 0.63 96 1V05 0.47 97 1W2L 1X3O 80 0.44 0.69 96 1Z21 0.71 75 1ZVA 0.58 35 2BF9 2BRF 103 0.75 0.64 109 2CE0 0.67 81 2E3H 2EAQ 89 0.61 0.65 75 2EHS 0.81 85 2FQ3 0.63 2IP6 87 2MCM 112 0.75 0.7 2NUH 104 -0.01 93 2PKT 2PLT 98 0.54 0.56 107 2QJL 0.7 93 2RB8 0.49 99 3BZQ 5CYT 103 0.39 0.71 0.7 0.6 0.68 0.58 0.61 0.58 0.79 0.53 0.34 0.59 0.77 0.74 0.53 0.72 0.64 0.5 0.43 0.65 0.7 0.79 0.77 0.66 0.68 0.63 0.67 0.82 0.66 0.77 0.56 -0.04 0.53 0.57 0.7 0.53 0.34 0.73 0.64 0.59 0.67 0.53 0.62 0.63 0.78 0.55 0.42 0.51 0.71 0.71 0.54 0.73 0.6 0.43 0.41 0.68 0.7 0.48 0.72 0.6 0.65 0.57 0.62 0.77 0.6 0.71 0.72 0.01 0.52 0.54 0.67 0.45 0.39 109 Table 7.16: ASPH and ESPH Pearson correlation coefficients for cross protein Cα atom blind B factor prediction obtained boosted gradient (GBT), convolutional neural network (CNN), and consensus (CON) for the large-sized protein set. PDB ID N GBT CNN CON 0.7 1AHO 66 0.55 1ATG 231 0.6 238 1BYI 1CCR 109 0.59 0.74 188 1E5K 0.61 106 1EW4 1IFR 113 0.7 0.57 238 1NLS 0.49 1O08 221 0.65 1PMY 123 1PZ4 113 0.77 0.54 1QTO 122 0.43 1RRO 108 1UKU 102 0.77 0.64 105 1V70 0.6 206 1WBE 1WHI 122 0.6 0.67 1WPA 107 0.67 2AGK 233 0.6 225 2C71 2CG7 110 0.32 0.6 2CWS 235 0.78 2HQK 232 2HYK 237 0.65 0.46 113 2I24 0.56 203 2IMF 0.63 122 2PPN 2R16 185 0.46 0.54 2V9V 149 0.47 2VIM 114 2VPA 217 0.71 0.63 2VYO 207 0.63 238 3SEB 3VUB 101 0.59 0.66 0.51 0.5 0.6 0.72 0.6 0.64 0.57 0.47 0.7 0.8 0.48 0.45 0.76 0.62 0.56 0.56 0.65 0.63 0.6 0.32 0.47 0.77 0.63 0.46 0.58 0.54 0.49 0.52 0.47 0.75 0.63 0.6 0.55 0.66 0.55 0.61 0.55 0.74 0.59 0.7 0.55 0.49 0.59 0.72 0.53 0.4 0.75 0.63 0.6 0.59 0.65 0.67 0.57 0.3 0.61 0.77 0.65 0.44 0.53 0.64 0.44 0.53 0.44 0.66 0.6 0.63 0.59 110 Table 7.17: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of small proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.96 0.87 0.82 1.00 1.00 1.00 0.76 0.94 0.66 1.00 1.00 1.00 0.98 0.89 0.96 0.94 0.71 0.72 0.55 0.73 0.64 1.00 1.00 1.00 0.90 0.55 0.88 0.69 1.00 0.76 31 1AIE 1AKG 16 51 1BX7 1ETL 12 1ETM 12 12 1ETN 65 1FF4 1GK7 39 1GVD 56 13 1HJE 1KYC 15 13 1NOT 22 1O06 29 1P9I 1PEF 18 16 1PEN 44 1Q9B 1RJU 36 1U06 55 1UOY 64 47 1USE 1VRZ 13 8 1XY2 1YJO 6 1YZM 46 52 2DSX 38 2JKU 2NLS 36 2OL9 6 6RXN 45 0.90 0.53 0.81 0.95 0.70 0.70 0.68 0.88 0.61 0.67 0.88 0.86 0.97 0.87 0.92 0.47 0.69 0.62 0.46 0.65 0.46 0.77 0.91 1.00 0.86 0.41 0.83 0.49 1.00 0.49 0.64 0.53 0.68 0.87 0.74 0.92 0.65 0.93 0.63 0.79 0.93 0.86 0.92 0.82 0.94 0.67 0.59 0.69 0.36 0.66 0.52 0.92 0.95 1.00 0.72 0.30 0.65 0.32 1.00 0.48 0.90 0.72 0.82 1.00 0.86 0.99 0.75 0.95 0.69 1.00 0.99 1.00 0.97 0.92 0.96 0.83 0.69 0.81 0.52 0.69 0.72 1.00 1.00 1.00 0.88 0.56 0.88 0.76 1.00 0.76 0.77 0.56 0.69 0.98 0.83 0.92 0.68 0.92 0.62 0.57 0.88 0.81 0.94 0.84 0.94 0.73 0.57 0.65 0.39 0.69 0.53 0.85 0.91 1.00 0.84 0.36 0.60 0.47 1.00 0.49 0.97 0.82 0.86 1.00 1.00 1.00 0.77 0.95 0.75 1.00 0.96 1.00 0.98 0.89 0.96 0.96 0.79 0.81 0.50 0.73 0.66 1.00 1.00 1.00 0.87 0.54 0.89 0.75 1.00 0.74 0.88 0.66 0.74 1.00 1.00 1.00 0.72 0.94 0.68 1.00 0.99 1.00 0.97 0.88 0.97 0.90 0.76 0.74 0.52 0.72 0.75 1.00 1.00 1.00 0.90 0.50 0.75 0.66 1.00 0.63 0.99 1.00 0.89 1.00 1.00 1.00 0.80 0.98 0.84 1.00 1.00 1.00 1.00 0.98 1.00 1.00 0.94 0.91 0.72 0.83 0.91 1.00 1.00 1.00 0.95 0.78 0.95 0.88 1.00 0.86 0.78 0.60 0.79 0.68 0.45 0.96 0.70 0.91 0.67 0.72 0.92 0.82 0.96 0.87 0.88 0.60 0.58 0.75 0.37 0.65 0.50 0.92 0.99 1.00 0.82 0.37 0.85 0.61 1.00 0.59 111 Table 7.18: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of medium proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 1ABA 0.70 87 0.67 1CYO 88 0.55 1FK5 93 1GXU 0.77 89 0.59 1I71 83 0.58 1LR7 73 0.73 1N7E 95 1NNX 0.86 93 0.59 1NOA 113 0.36 1OPD 85 1QAU 112 0.58 0.86 90 1R7J 0.73 82 1UHA 1ULR 87 0.61 0.65 1USM 77 0.65 96 1V05 0.69 97 1W2L 1X3O 80 0.67 0.72 96 1Z21 0.86 75 1ZVA 0.92 2BF9 35 0.75 103 2BRF 0.79 109 2CE0 0.78 81 2E3H 2EAQ 89 0.82 0.73 75 2EHS 0.78 85 2FQ3 2IP6 87 0.78 0.82 2MCM 112 0.80 2NUH 104 0.43 93 2PKT 2PLT 98 0.66 107 2QJL 0.51 0.81 93 2RB8 3BZQ 99 0.59 0.50 103 5CYT 0.62 0.58 0.50 0.61 0.46 0.55 0.68 0.79 0.57 0.21 0.55 0.76 0.68 0.50 0.53 0.61 0.63 0.60 0.63 0.78 0.65 0.71 0.73 0.69 0.72 0.71 0.75 0.58 0.77 0.63 0.35 0.59 0.46 0.75 0.55 0.46 0.56 0.65 0.49 0.69 0.38 0.46 0.54 0.81 0.53 0.29 0.55 0.81 0.67 0.44 0.61 0.52 0.56 0.62 0.64 0.83 0.89 0.72 0.71 0.56 0.77 0.69 0.68 0.64 0.75 0.75 0.36 0.52 0.41 0.74 0.47 0.43 0.76 0.78 0.71 0.82 0.76 0.71 0.80 0.88 0.72 0.57 0.66 0.91 0.82 0.68 0.81 0.72 0.79 0.72 0.82 0.94 0.97 0.76 0.86 0.82 0.86 0.81 0.82 0.82 0.85 0.85 0.69 0.72 0.63 0.84 0.69 0.65 0.67 0.71 0.53 0.75 0.44 0.61 0.67 0.84 0.63 0.35 0.59 0.88 0.70 0.56 0.62 0.67 0.72 0.66 0.70 0.85 0.94 0.74 0.77 0.66 0.81 0.75 0.78 0.72 0.80 0.77 0.44 0.66 0.45 0.81 0.57 0.53 0.67 0.69 0.59 0.78 0.66 0.62 0.71 0.84 0.65 0.29 0.61 0.86 0.75 0.53 0.61 0.66 0.72 0.66 0.73 0.85 0.73 0.73 0.79 0.71 0.77 0.73 0.76 0.66 0.80 0.74 0.39 0.63 0.52 0.78 0.62 0.52 0.68 0.68 0.58 0.75 0.56 0.63 0.72 0.83 0.63 0.36 0.58 0.87 0.74 0.59 0.66 0.65 0.69 0.65 0.64 0.92 0.78 0.74 0.80 0.76 0.81 0.74 0.79 0.73 0.81 0.81 0.55 0.67 0.50 0.80 0.61 0.54 0.63 0.59 0.50 0.72 0.58 0.56 0.63 0.81 0.57 0.19 0.57 0.79 0.69 0.50 0.58 0.61 0.61 0.64 0.69 0.81 0.71 0.72 0.77 0.69 0.76 0.71 0.75 0.64 0.77 0.66 0.36 0.59 0.49 0.76 0.55 0.48 0.54 0.66 0.49 0.72 0.41 0.57 0.54 0.81 0.60 0.26 0.57 0.83 0.69 0.49 0.57 0.60 0.60 0.62 0.61 0.84 0.70 0.74 0.75 0.62 0.79 0.72 0.75 0.67 0.78 0.73 0.39 0.57 0.42 0.78 0.50 0.49 112 Table 7.19: ASPH and ESPH Pearson correlation coefficients of least squares fitting Cα B factor prediction of large proteins using 11˚A cutoff. Two Bottleneck (B) and Wasserstein (W) metrics using various kernel choices are included. B & W B W PDB ID N Exp Lor Both Exp Lor Both Exp Lor Both 0.75 1AHO 66 0.51 1ATG 231 1BYI 238 0.54 0.63 109 1CCR 0.69 188 1E5K 1EW4 106 0.62 0.62 113 1IFR 0.82 238 1NLS 0.48 1O08 221 1PMY 123 0.71 0.88 1PZ4 113 0.56 1QTO 122 0.45 1RRO 108 0.80 1UKU 102 0.62 105 1V70 0.48 1WBE 206 0.55 1WHI 122 0.70 1WPA 107 0.67 2AGK 233 2C71 225 0.48 0.41 110 2CG7 0.55 2CWS 235 0.81 2HQK 232 0.60 2HYK 237 113 2I24 0.49 0.64 203 2IMF 2PPN 122 0.63 0.52 185 2R16 0.62 2V9V 149 0.40 2VIM 114 2VPA 217 0.74 0.70 2VYO 207 0.67 238 3SEB 3VUB 101 0.64 0.73 0.47 0.46 0.56 0.67 0.51 0.54 0.65 0.42 0.59 0.74 0.46 0.23 0.80 0.60 0.38 0.44 0.52 0.64 0.33 0.31 0.52 0.74 0.55 0.40 0.56 0.59 0.45 0.48 0.28 0.71 0.66 0.61 0.56 0.88 0.61 0.58 0.71 0.74 0.73 0.73 0.86 0.56 0.76 0.93 0.65 0.56 0.84 0.75 0.63 0.63 0.79 0.69 0.56 0.63 0.66 0.83 0.63 0.69 0.71 0.74 0.66 0.66 0.52 0.78 0.77 0.77 0.71 0.78 0.50 0.51 0.66 0.68 0.60 0.59 0.78 0.48 0.70 0.82 0.59 0.35 0.81 0.65 0.47 0.55 0.69 0.65 0.38 0.44 0.55 0.79 0.58 0.44 0.65 0.61 0.51 0.51 0.33 0.75 0.70 0.66 0.60 0.75 0.50 0.50 0.65 0.67 0.58 0.65 0.81 0.46 0.71 0.88 0.59 0.39 0.80 0.64 0.53 0.57 0.70 0.65 0.45 0.32 0.59 0.80 0.59 0.47 0.61 0.57 0.50 0.60 0.38 0.73 0.68 0.63 0.65 0.79 0.53 0.49 0.65 0.68 0.55 0.65 0.83 0.50 0.67 0.89 0.53 0.45 0.80 0.66 0.55 0.57 0.71 0.65 0.42 0.36 0.54 0.80 0.59 0.48 0.60 0.63 0.51 0.56 0.41 0.73 0.72 0.68 0.61 0.53 0.38 0.44 0.43 0.63 0.55 0.47 0.80 0.37 0.68 0.85 0.55 0.33 0.74 0.51 0.36 0.34 0.66 0.55 0.23 0.30 0.40 0.68 0.43 0.45 0.59 0.44 0.45 0.55 0.24 0.68 0.59 0.61 0.61 0.65 0.48 0.48 0.58 0.67 0.55 0.53 0.72 0.45 0.69 0.76 0.52 0.19 0.80 0.58 0.42 0.43 0.56 0.63 0.30 0.33 0.52 0.76 0.54 0.40 0.59 0.57 0.46 0.50 0.31 0.73 0.68 0.62 0.57 0.72 0.45 0.41 0.53 0.66 0.52 0.56 0.75 0.44 0.62 0.86 0.48 0.31 0.78 0.56 0.43 0.42 0.61 0.61 0.29 0.29 0.53 0.70 0.51 0.40 0.59 0.51 0.46 0.53 0.29 0.72 0.64 0.62 0.60 113 Table 7.20: ASPH and ESPH average Pearson correlation coefficients of least squares fitting Cα B factor prediction of small, medium, large, and superset using 11˚A cutoff. Two Bot- tleneck (B) and Wasserstein (W) metrics using various kernel choices are included. Results for pFRI are taken from Opron et al[3]. GNM and NMA value are taken from the course grained Cα results reported in Park et al[4]. B & W B W Exp Lor Both Exp Lor Both Exp Lor Both pFRI GNM NMA 0.87 0.48 0.48 Medium 0.68 0.49 0.61 0.65 NA 0.74 0.60 0.51 0.55 0.54 0.55 0.53 0.57 0.73 0.63 0.55 0.59 0.74 0.62 0.54 0.58 0.72 0.61 0.54 0.58 0.84 0.68 0.60 0.64 0.94 0.78 0.70 0.73 Small Large Superset 0.85 0.69 0.61 0.65 0.86 0.69 0.62 0.66 0.59 0.61 0.59 0.63 114 Chapter 8 Discussion 8.1 Element Specific Heat Maps One useful application of the WCG method is the generation of element specific correlation heat maps. These maps provide a two dimensional visualization of important secondary and tertiary components of a given protein. Of course, maps of this kind are not new, for example see Opron et al for past use. However, the correlation maps provided here are the first of their kind. Previous correlation maps have only considered Cα intereactions. The maps provided here in Figures 7.1, 7.2, and 7.3 illustrate that the more general frame work of the WCG method is a valid and useful approach to furthering our understanding of protein structure and flexibility. The results presented here generate correlation maps using PDB ID 3TYS, 1AIE, and 3PSM to demonstrate the applicability of this approach. Protein PDB ID 1AIE consists of a random coil attached to a single alpha helix. The provided amine nitrogen and double bonded carboxyl correlation heat maps in Figure 7.1 clearly show the alpha helix as indicated by a thick band along the diagonal. This thick band corresponds to the rigidity imposed by the local interactions of nearby residues within the alpha helix. Protein PDB ID 1KGM is made up of various random coils and two anti-parallel beta sheets. The provided amine nitrogen and double bonded carboxyl correlation heat maps in 115 Figure 7.2 illustrate the interaction between residues in the anti parallel beta sheet with thick bands perpendicular to the main diagonal. These perpendicular bands correspond physically to the rigidity imposed by the interactions between the anti parallel beta sheets. Protein PDB ID 5IIV presents two parallel alpha helices. The provided amine nitrogen and double bonded carboxyl correlation heat maps in Figure 7.3 illustrate both the short range interactions within a single alpha helix and interactions between alpha helices. The two alpha helices are represented clearly as two distinct thick bands along the diagonal. Thick off diagonal bands illustrate interactions between alpha helices. The diagrams also illustrate the strength of each type of bonding. Bonding within an alpha helix is stronger and thus the main diagonal of the correlation heat maps is warmer than the off diagonal which corresponds to the weaker alpha helix to alpha helix interaction. 8.2 Hinge Detection Figures 7.5-7.6 show the B factor prediction comparison of protein PDB ID 1CLL, 1WHI, and 2HQK. Figure 7.5 clearly indicates GNM misses the hinge region present in calmodulin (PDB ID 1CLL) around residue 75. The WCG method clearly agree with the experimental results as indicated in the provided results. The MWCG method is also included in this re- sult to demonstrate the ability of the MWCG method to capture multiple scales and improve the overall B factor prediction. In the ribosomal protein L14 (PDB ID 1WHI) the results demonstrate that WCG provides a more reliable prediction compared to GNM as seen in Figure 7.4. In particular, GNM incorrectly predicts a large flexible region around the 75th residue that does not exist. Lastly, the engineered cyan fluorescent protein mTFP1 (PDB ID 2HQK) is also considered. Figure 7.6 shows that GNM predicts a highly flexible region 116 incorrectly around the 60th residue whereas the WCG method agrees with the experimental results of low flexibility in that region. The results presented in this work demonstrate that GNM consistently misses hinge regions and predicts hinge regions where none exist. Com- paratively, the WCG method is more accurate than GNM, and MWCG the most accurate of all the hinge prediction techniques studied here. 8.3 Fitting Models 8.3.1 MWCG The MWCG method is used to predict B factor of a large and diverse set of over 300 proteins. Results for Cα B factor prediction are provided in Tables 7.1-7.6, and 7.4. Results of protein subsets of small, medium, and large sizes are considered in 7.1-7.6 and their overall average Pearson correlation coefficients are provided in 7.5. In all cases of Cα B factor prediction, the MWCG method outperforms previously existing methods in terms of average Pearson correlation coefficient. The MWCG method is notable in that, averaged over the superset of proteins, it provides a 19% improvement over the best existing method opFRI and a 42% improvement over GNM. Table 7.6 provides results for B factor prediction of other heavy elements such as non Cα carbon, nitrogen, oxygen, and sulfur atoms. This is also notable because to date no other previous method has included B factor prediction of elements other than Cα. These predictions also have a similar average correlation coefficient to the Cα results indicating the robustness of the model. 117 8.3.2 ASPH & ESPH ASPH and ESPH methods are used for Cα only B factor using the same protein dataset as MWCG. Results for Cα only B factor prediction are provided in Tables 7.17-7.20, and 7.21. Results of protein subsets of small, medium, and large sizes are considered in 7.17- 7.19 and their overall average Pearson correlation coefficients are provided in 7.20. Overall fitting methods using the various ASPH and ESPH features performed similarly. The best results came from using features generated by both exponential and lorentz kernels and both Bottleneck and Wasserstein distances. Using both kernels, ASPH and ESPH distance metrics resulted in an average correlation coefficient of 0.73 for the superset. 8.4 Machine Learning Models 8.4.1 MWCG Among the three methods considered for MWCG based B factor prediction, the convolutional neural network method outperforms the boosted gradient tree and random forest by 10% and 17%, respectively. As reported in Table 7.12, no machine learning method outperforms any other method for every protein. Compared to the deep CNN, the ensemble methods do not require as much parameter tuning. The random forest method is the simplest and most robust with only one hyper- parameter. Overall the boosted gradient tree method outperforms the random forest for MWCG based B factor prediction for all data sets. To balance cost, time, and quality, 500 trees were used for the random forest and 1000 trees were used for the boosted gradient method for the MWCG B factor prediction. It’s possible that this may account for the perfor- 118 mance difference between the boosted gradient tree method compared to the random forest. Ensemble methods are generally robust to overfitting, and adding more features would likely improve their results [42]. Moreover, boosted gradient trees use several hyperparameters so the model could be improved by further tuning these hyperparameters. The image-like heat map data used in the deep CNN provides additional data compared to the dataset used for the ensemble methods. This very likely explains the improved performance as compared to the ensemble methods. Providing more refined images, and other novel image types, would undoubtedly improve the results further but would come at a computational cost. Applying several dropout layers prevents the CNNs from overfitting the data. Much like the GBT the CNN contains several hyperparameters. Thus, the CNN model would benefit from more careful parameter tuning as well. Incorporating a larger dataset, more features, and higher resolutions images would also improve the CNN performance. In general the results of the machine learning methods generated in this work could be further improved by refining features, exploring new features, and further tuning the hyperparameters. 8.4.2 ASPH & ESPH Machine learning results for ASPH and ESPH can be found in 7.14-7.16 and 7.22. Overall both GBT and CNN algorithms perform similarly. As expected, the CNN method out performs the GBT with average correlation coefficients over the superset of 0.60 and 0.59 respectively. The consensus method improves upon both results with an average Pearson correlation coefficient of 0.61 over the superset. Table 7.13 shows that the blind prediction machine learning models perform better than fitting models GNM and NMA and similar to the pFRI fitting model. 119 Chapter 9 Conclusions and Future Directions 9.1 Conclusions Protein flexibility and dynamics are important tools for understanding the function, con- formational states, folding, binding, and molecular mechanisms of proteins. It is a well known paradigm that protein flexibility strongly correlates with protein function. Protein interactions span multiple interactions scales and their complexity and large number of de- grees of freedom make quantitative understanding a great challenge. Molecular dynamics offers a useful tool but is limited in scope due to the computational cost involved with large biomolecules or long time scales. Several successful time-independent methods have been developed that provide good B factor analysis at low computational cost. Examples include NMA [12, 10, 13, 11], ENM [15], GNM [18, 19, 50], and FRI methods [51, 3, 24, 22]. However, none of these methods can blindly predict cross protein B factors of an unknown protein. The guiding principle of this work is that intrinsic physics lies in a low-dimensional space that is embedded in a high dimensional data space. Based on this, the results of this work introduces graph theory based MWCG, ASPH, and ESPH[52, 53]. Moreover these methods are combined with advanced machine learning techniques to provide models that provide efficient and accurate tools for protein flexibility analysis and prediction. This work also outlines methods to successfully blindly predict cross-protein B factors. 120 First, this work introduces WCGs that efficiently reduce the protein structural complex- ity while accurately predicting protein flexibility. This work shows that weighted colored graphs are a useful and novel tool for investigating flexibility and dynamics of proteins. In section 7.2 the WCG approach was compared to experimental and GNM predicted B factor results. Nitrogen-nitrogen and oxygen-oxygen element specific correlation heat maps were constructed for several proteins using the WCG technique described in this work. As seen in Figures 7.1-7.3 these maps provide a clear picture of the various secondary and tertiary structures presented in these proteins. The correlation heat maps presented demonstrate a fresh approach to representing protein flexibility and interactions visually. Previous approaches only use data from Cα atoms whereas the WCG method allows previously unused protein PDB data to be utilized. This provides a viable alternative method and makes such heat maps more robust as multiple heat maps can be constructed for each residue using different elements. Using double bonded carboxyl oxygens and amine nitrogens the work presented here demonstrates generality of the WCG approach. The WCG method introduced a unique opportunity for alternative approaches and allows for redundancy since the method is able to make use of non Cα atoms. This method can also include hydrogen atoms without any modifications, which may prove useful in future work as empirical methods inevitably become more accurate and robust. Several proteins are tested to demonstrate the efficacy of WCGs to predict hinge regions of proteins. In this study we use proteins with well known flexibility to compare the ability of the GNM and the WCG method to predict flexible residues. Figures 7.5-7.6 show the B factor prediction comparison of protein PDB ID 1CLL, 1WHI, and 2HQK. The examples provided in this work demonstrate that WCG is an improvement upon the commonly used method of 121 GNM for hinge prediction. In proteins calmodulin (PDB ID 1CLL) and ribosomal protein (PDB ID 1WHI) the results show that prediction using GNM completely misses highly flexible hinge regions. The results using the engineered cyan fluorescent protein (PDB ID 2HQK) show that GNM incorrectly predicts a highly flexible region where none exists. In all the cases tested in this work the WCG method was able to correctly capture all the hinge regions and did not identify any false positive hinge regions. For further comparison the MWCG flexibility prediction is included in the calmodulin protein (PDB ID 1CLL) results seen in Figure 7.5. This result highlights the predictive power of the multiscale information that the MWCG method captures as seen with the excellent agreement with experimental results. Overall these results demonstrate the WCG and MWCG methods are superior tools to the commonly used GNM method in terms of the accuracy of hinge prediction for the provided examples. The WCG method is used to predict B factor of a large and diverse set of over 300 proteins. Results for Cα B factor prediction are provided in Tables 7.1-7.6, and 7.4. Results of protein subsets of small, medium, and large sizes are considered in 7.1-7.6 and their overall B factor averages are provided in 7.5. In all cases of Cα B factor prediction, the MWCG method outperforms previously existing methods. The MWCG method is notable in that, averaged over the superset of proteins, it provides a 19% improvement over the best existing method opFRI and a 42% improvement over GNM. Table 7.6 provides results for B factor prediction of other heavy elements such as non Cα carbon, nitrogen, oxygen, and sulfur atoms. This is also notable because to date no other previous method has included B factor prediction of elements other than Cα. These predictions also have a similar average correlation coefficient to the Cα only prediction results indicating the robustness of the model. 122 To capture the multiscale protein interactions that occur over several characteristic length scales multiscale weighted colored graphs are constructed. The MWCGs are successfully used to construct models by linear least square fitting and a variety of machine learning techniques. Several machine learning approaches were considered in this work for blind cross protein B factor prediction. In particular this work considered random forest, gradient boosting, and a deep convolutional neural network machine learning models for MWCG based B factor prediction. By using MWCG based features along with several local and global features this work uses advanced machine learning approaches to blindly predict protein flexibility and B factors. Moreover, unlike previous methods, this approach is able to predict B factors of any element the user desires provided 3D spatial coordinates are available. MWCG based images were engineered for the deep convolutional neural network. Overall the MWCG feature based deep convolutional neural networks provide the strongest predictive power in terms of B factor prediction which is likely accounted for by the additional data provided by the MWCG based image-like heat map features. Several local, semi-local, and global features were included as machine learning features. MWCGs capture local structural properties corresponding to the intrinsic flexibility of the given protein. X-ray crystallography resolution and R-value provide global structures that allow the algorithms the ability to compare B factor across proteins. Packing density is a semi-local feature that captures several protein interaction scales. Ensemble methods include relative feature importance used in the model which is pro- vided in Figures 7.8 and 7.9. As seen in the figures both local and global features play an important role in the model. Overall the most meaningful global features are protein resolution and surface accessible area. On average, the most meaningful local feature of the 123 random forest model was the set of 9 MWCG features with the carbon-carbon kernel having most significance. Machine learning models often suffer from the black box problem. That is, once the model has trained, the user is unable to explicitly see how the model is determining predictions. Feature importance provides important insight into the underlying mechanics of the machine learning model. The feature importance results are in good agreement with our expectations within the context of protein flexibility analysis. Both MWCG based fitting and machine learning B factor prediction demonstrate that MWCG based B factor prediction is more accurate in terms of Pearson correlation coefficient than previous fitting based methods such as GNM and NMA. For B factor prediction of Cα only atoms, the fitting model provided a 20% improvement over the next best B factor prediction method, opFRI, with a Pearson correlation coefficient of 0.80 averaged over the superset. The MWCG based deep CNN also outperformed opFRI, with a Pearson correlation coefficient of 0.66 averaged over the superset. The working hypothesis is explored further by creating a B factor predictor using tools from algebraic topology. To the author’s knowledge, this is the first time persistent homol- ogy has been used to predict the B factor of atoms in proteins. This is a novel approach because topology is a global property and on its own cannot be directly used to describe local atomic information. This unique approach allows a localized topological representation to be constructed using a global mathematical tool. This approach accounts for multiple spatial interaction scales and element specific interactions. These results demonstrate that this is an accurate and robust topological approach. This work introduces atom-specific topology and atom-specific persistent homology to construct localized topological representations for individual atoms from global topological tools. This approach works by constructing two conjugated sets of atoms. The first set 124 of atoms is centered around the given atom of interest while the other set is identical but excludes the atom of interest. To embed biological information into atom-specific persistent homology, element-specific selections are implemented. The topological distance between topological invariants generated from these conjugated sets of atoms provides a local topo- logical representation of the atom of interest. To estimate the topological distances between conjugated barcodes both Bottleneck and Wasserstein metrics are utilized. For topolog- ical barcode generation, the Vietoris-Rips complex is employed. Atom-specific persistent homology features are generated using several element-specific interactions, kernel choices, parametrizations, and barcode distance metrics. In this work ASPH and ESPH B factor prediction results are validated in two ways. First, topological features are used to fit protein B factors using linear least squares. The fitting model outperformed previous fitting models with an average Pearson correlation coefficient of 0.73 over the superset of proteins. These results show that the method is comparable to existing commonly used methods such as GNM and NMA. Secondly, ASPH and ESPH features are used in machine learning models to blindly predict protein B factors of Cα atoms. Two machine learning models are used, a gradient boosted tree (GBT) and deep convolutional neural network (CNN). Additionally the Cα predictions from the two models are averaged to generate a more robust consensus model. In addition to the generated topological features, a variety of local and global features were included. The blind prediction consensus model provided the best results, outperforming both GNM and NMA fitting models and produced results similar to those of the pFRI fitting model. These results demonstrate that this is a robust model that is more accurate than existing GNM and NMA predictions. There are many other machine learning approaches available and testing those approaches is certainly worth exploring. Moreover, these results 125 could easily be improved by including a larger dataset, fine tuning parameters, and exploring different machine learning algorithms. The proposed methods are tested and validated using a set of over 300 diverse proteins, or more than 600,000 B factors. For all machine learning models, a leave-one-protein-out approach is used to blindly predict protein B factors of all heavy atoms as well as only Cα atoms. The work presented in this study is a first step using the recent advances in MWCG, ASPH, and ESPH based machine learning techniques to blindly predict cross protein B factors. These approaches are particularly notable compared to previous methods because of their ability to blindly predict protein B factors across proteins. The MWCG, ASPH, and ESPH based machine learning results provided in this work are efficient and accurate compared to previous methods. This work provides clear evidence that machine learning approaches are useful and efficient for protein flexibility analysis. Nonetheless, many new and compelling features can be implemented in future work. With- out a doubt these results can be improved by experimenting with various advanced machine learning approaches, larger datasets, and designing better mathematical descriptions of in- trinsic flexibility. The methods presented here can be applied to a variety of interesting applications related to molecules and biomolecules. Examples include allosteric site detection, hinge detection, hot spot identification, chemical shift analysis, atomic spectroscopy interpretation, and pre- diction of protein folding stability changes upon mutation. More generally these methods may be amenable to problems outside chemistry and biology such as network dynamics and social network centrality measure. 126 9.2 Future Directions This work provides a rich basis for further exploration of mathematical approaches to protein analysis and flexibility. The following sections provide several areas of potential future research based on this work. 9.2.1 Software Development To provide awareness and accessibility of the methods provided here, an online tool could be developed consisting of PDB files from the PDB database or uploads of a compatible structural file. Ideally users would be able to do any of the following: • Choose MWCG based or ASPH/ESPH based models. • Choose the number of kernels, type of kernel, and kernel parameterization. • Choose element specific pairs and element specific heat maps. • Choose machine learning algorithm and training features. • Predict hinge regions based on user defined B factor cutoff. • Predict the B factor of any atoms in a protein. • Provide an interactive B factor colored 3D representation of the protein with down- loadable image or gif files. To host the website and run the required computations a server or cloud resources would be required. 127 9.2.2 Inclusion of other datasets The protein databank currently contains over 138,000 protein structures whereas the work presented here used around 350 protein structures. The machine learning models presented here would undoubtedly benefit by including a larger dataset. More data would provide better validation and a more general framework for protein B factor prediction. However, there are enough proteins available at this time that even restricting to only Cα atoms would result in a data set of roughly 116,610,000 data points. So care would need to be taken to balance the amount of data used with the computational cost of training such models. In addition to using larger datasets for training, models could be trained using more specific data. For example datasets could be selected based on specific types of proteins such as enzymes, structural proteins, signaling proteins, regulatory proteins, transport proteins, sensory proteins, motor proteins, defense proteins, and storage proteins. 9.2.3 Specific applications in drug design and docking pose The applications provided here demonstrate the validity of the proposed method. Future work could apply the method to a variety of interesting problems. Drug design is an im- portant and open problem where accurate and robust prediction of protein flexibility and dynamics are essential. Docking pose is another area where reliable B factor prediction may improve existing methods. Molecular docking programs common modeling tools for predicting ligand binding modes and structure based virtual screening. 128 9.2.4 Other general approaches These methods could easily be developed to predict anisotropic B factors of a protein. Pairing this method with a local or global Hessian would allow the Hessian matrix to be local or global and by definition adaptive depending on the physical problem. Moreover, the methods provided here could be used for the following related work. • Integrate these methods to include genetic sequence information for a more compre- hensive model. • Predict protein flexibility and dynamics from mutations. • Investigation these tools as a general centrality measure in areas outside of biology. • Investigation related topological data analysis techniques to understand proteins and protein networks. • Test other advanced learning approaches such as reinforcement learning and long short term memory algorithms. 129 BIBLIOGRAPHY 130 BIBLIOGRAPHY [1] D. Bramer and G.-W. Wei, “Multiscale weighted colored graphs for protein flexibility and rigidity analysis,” The Journal of Chemical Physics, vol. 148, no. 5, p. 054103, 2018. [2] D. Bramer and G.-W. Wei, “Blind prediction of protein b-factor and flexibility,” The Journal of Chemical Physics, vol. 149, no. 13, p. 134107, 2018. [3] K. Opron, K. L. Xia, and G. W. Wei, “Fast and anisotropic flexibility-rigidity index for protein flexibility and fluctuation analysis,” Journal of Chemical Physics, vol. 140, p. 234105, 2014. [4] J. K. Park, R. Jernigan, and Z. Wu, “Coarse grained normal mode analysis vs. refined gaussian network model for protein residue-level structural fluctuations,” Bulletin of Mathematical Biology, vol. 75, pp. 124 –160, 2013. [5] U. Emekli, S. Dina, H. Wolfson, R. Nussinov, and T. Haliloglu, “HingeProt: automated prediction of hinges in protein structures.,” Proteins, vol. 70, no. 4, pp. 1219–1227, 2008. [6] S. Flores and M. Gerstein, “FlexOracle: predicting flexible hinges by identification of stable domains,” BMC bioinformatics, vol. 8, no. 1, 2007. [7] S. Flores, L. Lu, J. Yang, N. Carriero, and M. Gerstein, “Hinge atlas: relating protein sequence to sites of structural flexibility.,” BMC bioinformatics, vol. 8, 2007. [8] K. S. Keating, S. C. Flores, M. B. Gerstein, and L. A. Kuhn, “StoneHinge: hinge prediction by network analysis of individual protein structures,” Protein Science, vol. 18, no. 2, pp. 359–371, 2009. [9] M. Shatsky, R. Nussinov, and H. J. Wolfson, “FlexProt: alignment of flexible protein structures without a predefinition of hinge regions,” Journal of Computational Biology, vol. 11, no. 1, pp. 83–8106, 2004. [10] N. Go, T. Noguti, and T. Nishikawa, “Dynamics of a small globular protein in terms of low-frequency vibrational modes,” Proc. Natl. Acad. Sci., vol. 80, pp. 3696 – 3700, 1983. [11] M. Tasumi, H. Takenchi, S. Ataka, A. M. Dwidedi, and S. Krimm, “Normal vibrations of proteins: Glucagon,” Biopolymers, vol. 21, pp. 711 – 714, 1982. [12] B. R. Brooks, R. E. Bruccoleri, B. D. Olafson, D. States, S. Swaminathan, and M. Karplus, “Charmm: A program for macromolecular energy, minimization, and dy- namics calculations,” J. Comput. Chem., vol. 4, pp. 187–217, 1983. 131 [13] M. Levitt, C. Sander, and P. S. Stern, “Protein normal-mode dynamics: Trypsin in- hibitor, crambin, ribonuclease and lysozyme.,” J. Mol. Biol., vol. 181, no. 3, pp. 423 – 447, 1985. [14] J. P. Ma, “Usefulness and limitations of normal mode analysis in modeling dynamics of biomolecular complexes.,” Structure, vol. 13, pp. 373 – 180, 2005. [15] M. M. Tirion, “Large amplitude elastic motions in proteins from a single-parameter, atomic analysis.,” Phys. Rev. Lett., vol. 77, pp. 1905 – 1908, 1996. [16] H. Goldstein, Classical Mechanics. Cambridge: Addison-Wesley, 1953. [17] A. R. Atilgan, S. R. Durrell, R. L. Jernigan, M. C. Demirel, O. Keskin, and I. Ba- har, “Anisotropy of fluctuation dynamics of proteins with an elastic network model.,” Biophys. J., vol. 80, pp. 505 – 515, 2001. [18] I. Bahar, A. R. Atilgan, and B. Erman, “Direct evaluation of thermal fluctuations in proteins using a single-parameter harmonic potential.,” Folding and Design, vol. 2, pp. 173 – 181, 1997. [19] I. Bahar, A. R. Atilgan, M. C. Demirel, and B. Erman, “Vibrational dynamics of pro- teins: Significance of slow and fast modes in relation to function and stability.,” Phys. Rev. Lett, vol. 80, pp. 2733 – 2736, 1998. [20] L. W. Yang and C. P. Chng, “Coarse-grained models reveal functional dynamics–I. elastic network models–theories, comparisons and perspectives.,” Bioinformatics and Biology Insights, vol. 2, pp. 25 – 45, 2008. [21] K. L. Xia, K. Opron, and G. W. Wei, “Multiscale multiphysics and multidomain models — Flexibility and rigidity,” Journal of Chemical Physics, vol. 139, p. 194109, 2013. [22] K. Opron, K. L. Xia, Z. Burton, and G. W. Wei, “Flexibility-rigidity index for protein- nucleic acid flexibility and fluctuation analysis,” Journal of Computational Chemistry, vol. 37, pp. 1283–1295, 2016. [23] D. D. Nguyen, K. L. Xia, and G. W. Wei, “Generalized flexibility-rigidity index,” Jour- nal of Chemical Physics, vol. 144, p. 234106, 2016. [24] K. Opron, K. L. Xia, and G. W. Wei, “Communication: Capturing protein multiscale thermal fluctuations,” Journal of Chemical Physics, vol. 142, no. 211101, 2015. [25] K. L. Xia, K. Opron, and G. W. Wei, “Multiscale Gaussian network model (mGNM) and multiscale anisotropic network model (mANM),” Journal of Chemical Physics, vol. 143, p. 204106, 2015. 132 [26] J. A. McCammon, B. R. Gelin, and M. Karplus, “Dynamics of folded proteins,” Nature, vol. 267, pp. 585–590, 1977. [27] M. Newman, Networks: an introduction. Oxford university press, 2010. [28] C. Ambedkar, “Application of centrality measures in the identification of critical genes in diabetes mellitus,” Bioinformation, vol. 11, no. 2, pp. 90–5, 2015. [29] W. Y. G. Y. . L. Y. Gao, S., “Understanding urban traffic-flow characteristics: A rethinking of betweenness centrality.,” Environment and Planning B: Planning and De- sign, vol. 40, no. 1, p. 135153, 2013. [30] A. Bavelas, “Communication patterns in task-oriented groups,” The Journal of the Acoustical Society of America, vol. 22, no. 6, pp. 725–730, 1950. [31] A. Dekker, “Conceptual distance in social network analysis,” Journal of Social Structure (JOSS), vol. 6, 2005. [32] A. Zomorodian and G. Carlsson, “Computing persistent homology,” Discrete Comput. Geom., vol. 33, pp. 249–274, 2005. [33] T. Schlick and W. K. Olson, “Trefoil knotting revealed by molecular dynamics simula- tions of supercoiled DNA,” Science, vol. 257, no. 5073, pp. 1110–1115, 1992. [34] D. W. Sumners, “Knot theory and DNA,” in Proceedings of Symposia in Applied Math- ematics, vol. 45, pp. 39–72, 1992. [35] I. K. Darcy and M. Vazquez, “Determining the topology of stable protein-DNA com- plexes,” Biochemical Society Transactions, vol. 41, pp. 601–605, 2013. [36] C. Heitsch and S. Poznanovic, “Combinatorial insights into rna secondary structure, in N. Jonoska and M. Saito, editors,” Discrete and Topological Models in Molecular Biology, vol. Chapter 7, pp. 145–166, 2014. [37] O. N. A. Demerdash, M. D. Daily, and J. C. Mitchell, “Structure-based predictive models for allosteric hot spots,” PLOS Computational Biology, vol. 5, p. e1000531, 2009. [38] B. DasGupta and J. Liang, Models and Algorithms for Biomolecules and Molecular Networks. John Wiley & Sons, 2016. [39] X. Shi and P. Koehl, “Geometry and topology for modeling biomolecular surfaces,” Far East J. Applied Math., vol. 50, pp. 1–34, 2011. [40] D. Cohen-Steiner, H. Edelsbrunner, and J. Harer, “Stability of persistence diagrams,” Discrete & Computational Geometry, vol. 37, no. 1, pp. 103–120, 2007. 133 [41] D. Cohen-Steiner, H. Edelsbrunner, J. Harer, and Y. Mileyko, “Lipschitz functions have Lp-stable persistence,” Foundations of computational mathematics, vol. 10, no. 2, pp. 127–139, 2010. [42] Z. X. Cang and G. W. Wei, “Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology,” Bioinformatics, vol. 33, pp. 3549–3557, 2017. [43] Z. X. Cang, L. Mu, and G. W. Wei, “Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening ,” PLOS Computa- tional Biology, vol. 14(1), pp. e1005929, https://doi.org/10.1371/journal.pcbi.1005929, 2018. [44] K. L. Xia and G. W. Wei, “Persistent homology analysis of protein structure, flexibility and folding,” International Journal for Numerical Methods in Biomedical Engineering, vol. 30, pp. 814–844, 2014. [45] M. W. Wolpert, D.H., “No free lunch theorems for optimization,” IEEE Transactions on Evolutionary Computation, vol. 1, no. 67, 1997. [46] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001. [47] T. K. Ho, “The random subspace method for constructing decision forests,” IEEE transactions on pattern analysis and machine intelligence, vol. 20, no. 8, pp. 832–844, 1998. [48] J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001. [49] J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Anal- ysis, vol. 38, no. 4, pp. 367–378, 2002. [50] B. Brooks and M. Karplus, “Harmonic dynamics of proteins: normal modes and fluc- tuations in bovine pancreatic trypsin inhibitor,” Proceedings of the National Academy of Sciences, vol. 80, no. 21, pp. 6571–6575, 1983. [51] K. L. Xia and G. W. Wei, “A stochastic model for protein flexibility analysis,” Physical Review E, vol. 88, p. 062709, 2013. [52] D. Bramer and G. W. Wei, “Weighted multiscale colored graphs for protein flexibility and rigidity analysis,” Journal of Chemical Physics, vol. 148, p. 054103, 2018. [53] D. Bramer and G. W. Wei, “Blind prediction of protein b-factor and flexibility,” Journal of Chemical Physics, vol. 149, p. 021837, 2018. 134