APPLICATION OF TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FOR
       MUTATION INDUCED PROTEIN PROPERTY CHANGE PREDICTION
                                       By
                                 Menglun Wang
                               A DISSERTATION
                                   Submitted to
                           Michigan State University
                   in partial fulfillment of the requirements
                                for the degree of
                 Applied Mathematics – Doctor of Philosophy
                                      2021


                                            ABSTRACT
  APPLICATION OF TOPOLOGICAL DATA ANALYSIS AND MACHINE LEARNING FOR
            MUTATION INDUCED PROTEIN PROPERTY CHANGE PREDICTION
                                                  By
                                           Menglun Wang
Mutagenesis is a process by which the genetic information of an organism is changed, resulting in a
mutation. A lot of diseases are caused by mutation of protein, including Cystic fibrosis, Alzheimer’s
Disease, and most cancer. To get a better understanding of mutation induced protein properties
change, accurate and efficient computational models are urgently needed.
    Algebraic topology, a champion in recent worldwide competitions for protein-ligand binding
affinity predictions, is a promising approach for simplifying the complexity of biological structures.
In this thesis, we introduce element-specific and site-specific persistent homology, a new branch of
algebraic topology, to simplify the structural complexity of protein-protein complexes and embed
crucial biological information into topological invariants. Additionally, we propose a new deep
learning algorithm called NetTree, to take advantage of convolutional neural networks and gradient
boosting trees. A topology-based network tree (TopNetTree) is constructed by integrating the
topological representation and NetTree for predicting protein-protein interaction ∆∆G. Tests on
major benchmark datasets indicate that the proposed TopNetTree significantly improves the current
state-of-art in ∆∆G prediction.
    For mutation induced protein folding energy change, we proposed a local topological predictor
(LTP) based machine learning model. To characterize the molecular structure, Hessian matrix of
the local surface is generated from the Exponential and Lorentz density kernel. Eigenvalues of
Hessian matrix are calculated as the local topological predictor, which is then fed into the gradient
boost machine learning model as features. Our LTP model obtained state-of-art results for various
benchmark data sets of mutation induced protein folding energy change.


    Copyright by
MENGLUN WANG
           2021


This thesis is dedicated to my parents, for their love.
                          iv


                                     ACKNOWLEDGEMENTS
First and foremost I am extremely grateful to thank my advisor, Dr. Guo-Wei Wei for his invaluable
advice, continuous support, and patience during my PhD study. His passion in research, visionary
insights and positive attitude in life has guided me to be a better researcher and a better man.
    I would like to thank Dr. Moxun Tang, Dr. Ming Yan and Dr. Yiying Tong, for serving on my
thesis committee and providing me with extensive guidance and helpful comments. I also want to
thank Dr. Irfan and Dr. Saleh for offering me the opportunity of collaborating at John D. Dingell
VA Medical Center. In addition, I want to thank Dr. Hao Zhu and Dr. Ruihao Huang for offering
me the research fellowship at U.S. Food and Drug Administration.
    I want to thank my fellow group members, Zixuan Cang, Yin Cao, Duc Nguyen, Kelin Xia,
Kedi Wu, Rui Wang, Jiahui Chen, Yuchi Qiu for their kindness help in my research projects. I also
want to thank fellow graduate students at Math Department, Anqi Chen, Wenzhao Chen, Weicong
Zhou, just to name a few, for their companion in this long journey.
    Last but not the least, I would like to thank my family. I am so grateful to my grandmother who
raised me up as a little kid. I want to express thanks to my parents Zhenfu Wang and Zhifeng Xu.
Their unconditional support and love give me the energy to pursue higher goals in life. Thank you.
                                                   v


                                  TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    xi
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv
KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
CHAPTER 1 BIOCHEMICAL BACKGROUND                     . . . . . . . . . . . . . . . . . . . . . .  1
   1.1 An overview of biomolecular modeling . .      . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 An overview of binding . . . . . . . . . .    . . . . . . . . . . . . . . . . . . . . . .  2
   1.3 An overview of protein folding . . . . . .    . . . . . . . . . . . . . . . . . . . . . .  3
   1.4 An overview of mutation . . . . . . . . .     . . . . . . . . . . . . . . . . . . . . . .  4
   1.5 Motivation . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . .  6
CHAPTER 2 MACHINE LEARNING METHODS AND ALGORITHMS                              . . . . . . . . .  7
   2.1 An Overview of machine learning algorithm . . . . . . . . . . . .       . . . . . . . . .  7
   2.2 Ensemble methods . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . .  8
       2.2.1 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . .     . . . . . . . . .  8
       2.2.2 Random Forest . . . . . . . . . . . . . . . . . . . . . . .       . . . . . . . . . 10
       2.2.3 Gradient Boost Decision Tree . . . . . . . . . . . . . . .        . . . . . . . . . 10
       2.2.4 Feature importance . . . . . . . . . . . . . . . . . . . . .      . . . . . . . . . 11
   2.3 Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . . . 12
       2.3.1 Layers and process of Neural Network . . . . . . . . . . .        . . . . . . . . . 12
       2.3.2 Convolutional Neural Network (CNN) . . . . . . . . . . .          . . . . . . . . . 15
       2.3.3 Other Neural Network . . . . . . . . . . . . . . . . . . .        . . . . . . . . . 16
CHAPTER 3    TOPNETTREE: A TOPOLOGY-BASED NETWORK TREE FOR THE
             PREDICTION OF PROTEIN-PROTEIN BINDING AFFINITY CHANGES
             UPON MUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . .           . 18
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 18
   3.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
       3.2.1 AB-Bind dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . 21
       3.2.2 SKEMPI and SKEMPI 2.0 dataset . . . . . . . . . . . . . . . . . . . . .           . 22
       3.2.3 Preprocessing of dataset . . . . . . . . . . . . . . . . . . . . . . . . . .      . 22
   3.3 Topological Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 23
       3.3.1 Persistent homology . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . 23
              3.3.1.1 Simplicial complex and filtration . . . . . . . . . . . . . . . .        . 23
              3.3.1.2 Homology and persistence . . . . . . . . . . . . . . . . . . . .         . 24
       3.3.2 Topological representation of protein-protein interactions . . . . . . . . .      . 26
   3.4 Auxiliary features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 29
       3.4.1 Atom-level features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 29
                                                vi


      3.4.2 Residue-level features . . . . . . . . . . . . . . . . . . . . . .    . . . . . . . 30
  3.5 Machine learning architecture . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . 31
      3.5.1 TopGBT: Topology based gradient boosting tree model . . . .           . . . . . . . 32
      3.5.2 TopCNN: Topology based convolutional neural network model             . . . . . . . 32
      3.5.3 TopNetTree: Topology based network tree model . . . . . . .           . . . . . . . 32
      3.5.4 Model parametrization and software used . . . . . . . . . . .         . . . . . . . 34
CHAPTER 4   LTP MODEL: APPLICATION OF LOCAL TOPOLOGICAL CHAR-
            ACTERISTICS IN PROTEIN FOLDING ENERGY CHANGE UPON
            MUTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        . . 37
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 37
  4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
      4.2.1 ProTherm protein mutation database S2648 and S350 . . . . . . . . . .           . . 38
      4.2.2 Preprocessing of dataset . . . . . . . . . . . . . . . . . . . . . . . . .      . . 39
  4.3 Local topological characterization of molecules . . . . . . . . . . . . . . . . .     . . 39
      4.3.1 Molecular surface representations and molecular density function . . .          . . 39
      4.3.2 Evaluation of curvature . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 42
      4.3.3 Critical points and Poincare Hopf index theorem . . . . . . . . . . . .         . . 44
      4.3.4 Mutation site based element specific density field generation . . . . . .       . . 48
      4.3.5 Explicit expression of Hessian matrix . . . . . . . . . . . . . . . . . .       . . 49
  4.4 Prediction model design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 52
      4.4.1 Eigenvalue learning feature generation . . . . . . . . . . . . . . . . . .      . . 53
      4.4.2 Auxiliary features . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 53
      4.4.3 Machine learning model . . . . . . . . . . . . . . . . . . . . . . . . .        . . 54
CHAPTER 5 RESULT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        56
  5.1 Evaluation criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
  5.2 Model performance of TopNetTree on PPI binding free energy change . . . . . . .           56
      5.2.1 Prediction result on AB-Bind dataset . . . . . . . . . . . . . . . . . . . . .      56
              5.2.1.1 Overall result . . . . . . . . . . . . . . . . . . . . . . . . . . . .    56
              5.2.1.2 Protein level non overlapping test on the AB-bind 645 dataset . .         59
              5.2.1.3 Protein level leave-one-out validation test . . . . . . . . . . . . .     60
      5.2.2 Prediction result on SKEMPI dataset . . . . . . . . . . . . . . . . . . . . .       61
      5.2.3 Prediction result on SKEMPI 2.0 dataset . . . . . . . . . . . . . . . . . . .       62
  5.3 Performance of LTP model on ProTherm protein mutation database S2648 and S350             64
      5.3.1 General performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       64
      5.3.2 Inter/Intra-protein-level crossvalidation . . . . . . . . . . . . . . . . . . .     65
      5.3.3 Prediction result for different density kernel . . . . . . . . . . . . . . . . .    67
CHAPTER 6 DISCUSSION . . . . . . . . . . . . . . . . . . .          . . . . . . . . . . . . . . 68
  6.1 Prediction result analysis for different mutation type . . .  . . . . . . . . . . . . . . 68
      6.1.1 Analysis of TopNetTree prediction result on S645        . . . . . . . . . . . . . . 68
      6.1.2 Analysis of LTP prediction result on S2648 . . .        . . . . . . . . . . . . . . 70
  6.2 Prediction result analysis for different mutation regions .   . . . . . . . . . . . . . . 71
      6.2.1 Definition of mutation region . . . . . . . . . . .     . . . . . . . . . . . . . . 71
                                               vii


       6.2.2 Analysis of TopNetTree prediction result on S645 . . . . . . . . . . . . . . 71
       6.2.3 Analysis of LTP prediction result on S2648 . . . . . . . . . . . . . . . . . 73
   6.3 Alanine scanning test of 1AK4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
CHAPTER 7 THESIS CONTRIBUTION . . . . . . . . . . . . . . . . . . . . . . . . . . 76
   7.1 Protein-Protein interactions energy change upon mutation . . . . . . . . . . . . . . 76
   7.2 Protein folding energy change upon mutation . . . . . . . . . . . . . . . . . . . . 76
APPENDICES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
   APPENDIX A         SUPPLEMENTARY MATERIALS FOR TOPNETTREE MODEL . 79
   APPENDIX B         SUPPLEMENTARY MATERIALS FOR LTP MODEL . . . . . . . 105
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
                                             viii


                                        LIST OF TABLES
Table 3.1: Summary of topological descriptors. The barcodes are generated upon mutant
           and wild type complexes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
Table 4.1: Properties of critical points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Table 5.1: Comparison of the Pearson correlation coefficients of various methods for the
           AB-bind S645 set. Except for those from present TopNetTree and TopGBT,
           the other results are adopted from Ref. [68]. . . . . . . . . . . . . . . . . . . . . 57
Table 5.2: Result of non-overlapping protein level test on AB-bind dataset, including
           Pearson correlation coefficient and RMSE in kcal/mol. . . . . . . . . . . . . . . 59
Table 5.3: Result of protein-level leave-one-out-validation test on AB-bind dataset, in-
           cluding Pearson correlation coefficient and RMSE in kcal/mol. . . . . . . . . . . 60
Table 5.4: Comparison of the Pearson correlation coefficients of various methods for the
           single point mutation in SKEMPI dataset of 1131 mutations. Except for those
           from TopNetTree and SAAMBE, the other results are adopted from Ref. [91]. . . 61
Table 5.5: Prediction results of S350 and 5-fold cross validation results of S2648. All the
           result of other methods listed in the table are from the paper cited. . . . . . . . . 65
Table 5.6: Intra-protein-level cross-validation result of S2648. Mutations in each protein
           are 5-fold cross-validated inside protein. . . . . . . . . . . . . . . . . . . . . . . 66
Table 6.1: Criteria of residue regions [55] . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Table 6.2: Alanine mutation test on 1AK4 chain A. . . . . . . . . . . . . . . . . . . . . . . 75
Table A.1: TopNetTree crossvalidation result on S645 . . . . . . . . . . . . . . . . . . . . . 80
Table B.1: Comparison of Pearson correlation (Rp ) and RMSEs (kcal/mol) of various
           methods on prediction of mutation induced protein stability changes of the
           S350 set and 5-fold cross validation of mutation induced protein stability
           changes of the S2648. n represents number of samples successfully processed.
           LTP1 is our topological based mutation predictor that solely utilizes structural
           information. LTP2 is our model that complements LTP1 with auxiliary fea-
           tures. The results reported in the publications are listed in the table [70].50
           repeated runs are conducted and median values of metrics are picked for LTP2
           and LTP1 methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
                                                  ix


Table B.2: 5-fold cross validation results of Q3421 with respect to Rp and RMSE. 50
           repeated runs were conducted and median values of metrics were picked for
           LTP2 and LTP1 methods. Comparison of Pearson correlation (Rp ) and RMSEs
           (kcal/mol) of various methods on 5-fold cross validation of mutation induced
           protein stability changes of the Q3421. n represents number of samples suc-
           cessfully processed. In LTP1 method, two-scales models are considered by
           coupling two sets of Hessian eigenvalue features. Moreover, Rp /RMSE (kcal/-
           mol) of 0.79/1.2 was reported for STRUM method [70]. 50 repeated runs
           are conducted and median values of metrics are picked for LTP2 and LTP1
           methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
                                               x


                                         LIST OF FIGURES
Figure 1.1: Example of single site mutation from DNA genome to amino acids . . . . . . .            5
Figure 2.1: An example of fully connected neural network . . . . . . . . . . . . . . . . . . 13
Figure 2.2: An example of convolution operation on a 4 × 4 matrix by 2 × 2 kernel . . . . . 16
Figure 3.1: Counts of mutation type in AB-bind dataset . . . . . . . . . . . . . . . . . . . 21
Figure 3.2: Illustration of filtration and persistence diagram of a set of points on a plane. . . 25
Figure 3.3: Topological barcode change associated with a mutation. Residue Leucine in
            the wild type is mutated into Alannine. Barcodes are generated for carbon
            atoms within a cutoff of 12Å of the mutant residue. . . . . . . . . . . . . . . . 26
Figure 3.4: Illustration of the point cloud generation of antibody-antigen complex 1DQJ . . 28
Figure 3.5: Illustration of the proposed TopNetTree model. . . . . . . . . . . . . . . . . . . 33
Figure 3.6: Illustration of CNN parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Figure 4.1: Illustration of van der Waals (vdWS) surface (yellow region), Solvent acces-
            sible surface (SAS) (red dotted margin) and solvent excluded surface (SES)
            (blue dotted margin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Figure 4.2: An example of 2-D exponential density function generated surface. . . . . . . . 41
Figure 4.3: Example of different critical points in a standard cube. Nucleic critical point
            (NCP) in red, Bond critical point (BCP) in yellow, Ring critical point (RCP)
            in cyan and Cage critical point (CCP) in green. . . . . . . . . . . . . . . . . . . 45
Figure 4.4: Density field of naphthalne . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Figure 4.5: Gradient field of naphthalne . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Figure 4.6: An example of two eigenvalue maps of naphthalne using two different density
            fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Figure 4.7: an example of three eigenvalue isosurface maps of naphthalne using the
            exponential density field . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
                                                   xi


Figure 4.8: Flowchart of eigenvalue learning model of protein folding energy change upon
            mutation with example of 1A3J A 12 F A. First, mutation site is selected for
            both wild and mutant structure. Corresponding exponential/Lorentz density
            fields are generated. Then Hessian matrix and its eigenvalues are evaluated
            at the near mutation region with respect to different element groups. Finally,
            eigenvalue features and auxiliary features are fed into the GBDT model to get
            the prediction value of energy change. . . . . . . . . . . . . . . . . . . . . . . 55
Figure 5.1: (a)Scatter plot of TopNetTree prediction crossvalidation result on S645 (b)Scatter
            plot of TopNetTree prediction crossvalidation result on S645 exclude 27 non-
            binders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
Figure 5.2: Scatter plot of TopNetTree blind test prediction on homology models . . . . . . 58
Figure 5.3: Performance evaluation on the 10-fold cross-validation on set S1131. . . . . . . 62
Figure 5.4: (a)Scatter plot of TopNetTree prediction crossvalidation result on S4947
            (b)Scatter plot of TopNetTree prediction crossvalidation result on S4169 . . . . 63
Figure 5.5: Scatter plot of TopNetTree prediction crossvalidation result on S8338 . . . . . . 63
Figure 5.6: (a)Scatter plot of prediction result on S350 (b) Scatter plot of 5-fold crossval-
            idation result on S2648 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 5.7: Parameter selection heatmap in prediction model of S350. . . . . . . . . . . . . 67
Figure 6.1: Comparison of average experimental and prediction binding affinity changes
            upon mutation associated with different amino acid types for the AB-Bind
            dataset. The x-axis labels the residue type of the original, while the y-axis
            labels the residue type of the mutant. For a reverse mutation, its ∆∆G is taken
            the same magnitude as the original value with an opposite sign. a Average
            binding affinity changes upon mutation (kcal/mol) b Variance of binding
            affinity changes upon mutation (kcal/mol) . . . . . . . . . . . . . . . . . . . . 69
Figure 6.2: Comparison of average experimental (a) and prediction binding (b) affinity
            changes upon mutation associated with different amino acid types for S2648
            dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Figure 6.3: Prediction results for different residue region types in S645 dataset . . . . . . . 72
Figure 6.4: Scatter plot of 5-fold crossvalidation result for S2648 set in (a) Buried muta-
            tion region, with Rp = 0.79 and RMSE = 0.94 ,(b) Exposed mutation region,
            with Rp = 0.78 and RMSE = 0.92 (c) Intermediate mutation region, with
            Rp = 0.78 and RMSE = 0.92 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
                                                  xii


Figure 6.5: Structure of protein complex 1AK4, chain A in blue and chain D in red . . . . . 74
Figure B.1: Predictive behaviors of LTP1 model on protein folding energy change upon
            mutation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
                                                xiii


                              LIST OF ALGORITHMS
Algorithm 1: Hunt’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   9
Algorithm 2: Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
                                           xiv


                                KEY TO ABBREVIATIONS
ML Machine learning
DL Deep learning
PH Persistent homology
ESPH Element-specific persistent homology
RF Random Forest
GBDT Gradient boosting decision tree
NN Neural network
CNN Convolutional neural network
GD Gradient descent
SGD Stochastic gradient descent
PPI Protein-protein interaction
RMSE Root mean square error
MIBPB Matched interface and boundary Poisson Boltzmann
SAS Solvent accessible surface
SES Solvent excluded surface
ESES Eulerian solvent excluded surface
CP Critical point
NCP Nucleic critical point
BCP Bond critical point
CCP Cage critical point
RCP Ring critical point
PDB Protein data bank
SKEMPI Structural database of Kinetics and Energetics of Mutant Protein Interactions
AB Antibody
AG Antigen
                                            xv


WT Wild type
MT Mutant
SNPs Single nucleotide polymorphisms
                                     xvi


                                           CHAPTER 1
                               BIOCHEMICAL BACKGROUND
1.1     An overview of biomolecular modeling
    Biomoleculars, such as DNA, RNA and proteins are the fundamental and essential to human
body. Understanding structure-function relationships is a major challenge in the molecular level.
Research on the relationship between structures of biomolecules and their functions is one of the
hot topic in drug design and pharmaceutical industry. Among all the properties of a biomolecular
system, thermodynamic properties are crucial to the functionality of the system. Those thermo-
dynamic properties include the binding affinity of protein-ligand complexes, the stability changes
induced by amino acid mutations in protein, and the flexibility of protein residues. To get a better
understanding of the properties mentioned above, effective prediction model is urgently needed.
    There are three major types of biomolecular prediction model. Physics-based methods build
models according to physical laws and are indispensable for molecular modeling which provide
predictions and reveal underlying mechanisms. Examples of such kind include quantum mechanics
calculation, molecular dynamics simulation, and Monte Carlo sampling. There are also more
efficient approximations to the atomic systems by using a continuum for part of the systems. For
example, the Poisson-Boltzmann model delivers efficient description of the electrostatics in the
solvation processes of molecules by using a continuum solvent.
    Empirical model is the type of model using weights determined by the experimental data. A
widely used setup for empirical models is to combine molecular mechanics energies with polar
part of solvation process modeled by Poisson-Boltzmann model or Generalized Born model and
nonpolar part of solvation process reflected by surface areas which are usually called MM/PBSA
or MM/GBSA models[29, 30].
    Machine learning based model is the type of model using machine learning algorithm to
learn from existing training data. With machine learning, the prediction model can handle more
                                                 1


detailed descriptions of the systems and descriptors of various types. Also, machine learning based
algorithm is more adaptive to large dataset.
    In this work, we are interested in designing machine learning based prediction model on
biomolecular properties, including protein-protein binding energy change upon mutation and pro-
tein folding energy change upon mutation. Detailed features and model architecture will be
discussed in the later chapters of the thesis.
1.2     An overview of binding
    In physics and chemistry, binding energy is the smallest amount of energy required to remove a
particle from a system of particles or to disassemble a system of particles into individual parts. In
this section, we will introduce the basic conceptions and evaluation s of binding energy.
K D : In chemistry, biochemistry, and pharmacology, a dissociation constant (K D ) is a specific
type of equilibrium constant that measures the propensity of a larger object to separate (dissociate)
reversibly into smaller components, as when a complex falls apart into its component molecules,
or when a salt splits up into its component ions.
For general reaction:
                                          Ax By     x A + yB                                    (1.1)
in which a complex Ax By breaks down in to x A subunits and yB subunits, the dissociation constant
is defined as
                                                  [A] x [B] y
                                           KD =                                                 (1.2)
                                                   [Ax By ]
In the specific case of antibodies (Ab) binding to antigen (Ag), usually the term affinity constant
refers to the association constant.
                                                [AbAg]        1
                                        KA =              =                                     (1.3)
                                               [Ab][Ag] KD
Ki : The inhibitory constant (Ki ) also represents a dissociation constant, but more specific for the
binding of an inhibitor (I) to an enzyme (E).
                                             E+I        EI                                      (1.4)
                                                   2


IC50 : The half-maximal inhibitory concentration (IC50 ) is a measure of the potency of a substance
in inhibiting a specific biological or biochemical function. More specifically, IC50 is the concen-
tration of inhibitor required to reduce the biological activity of interest to half of the uninhibited
value.
Although IC50 is not a direct indicator of affinity, for enzymatic reactions, binding affinity of the
inhibitor (Ki ) could be solved by the Cheng-Prusoff equation.
                                                     IC50
                                              Ki =                                                 (1.5)
                                                        [S]
                                                    1+ K
                                                         m
Here in the equation, [S] is fixed substrate concentration and Km is the Michaelis constant.
EC50 : The half-maximal effective concentration (EC50 ) refers to the concentration of a drug,
antibody or toxicant which induces a response halfway between the baseline and maximum after a
specified exposure time.
For inhibition constants at celluar receptors, one can get binding affinity from the following equation:
                                                     IC50
                                             Ki =                                                  (1.6)
                                                        [A]
                                                   1 + EC
                                                          50
where [A] is the fixed concentration of agonist.
∆G: For the binding of receptor and ligand molecules in solution, the molar Gibbs free energy ∆G,
or the binding affinity is related to the dissociation constant K D via
                                                         K
                                             ∆G = RT ln θD                                         (1.7)
                                                          c
in which R is the ideal gas constant, T is temperature and the standard reference concentration c θ =
1 mol/L.
1.3    An overview of protein folding
    Protein folding is the process by which a protein chain is translated to a folded conformation.
Folding process is mainly guided by hydrophobic interactions, formation of intramolecular hydrogen
bonds, van der Waals forces, and it is opposed by conformational entropy[2].
    Process of protein folding can be divided into the following steps
                                                    3


    • Primary structure The primary structure of a protein is the linear amino-acid sequence. This
       sequence is the essential start to folding process, which specifies both the native structure
       and the pathway to attain the final state.
    • Secondary structure The secondary structure is the first step in the folding process that a
       protein takes to assume its native structure. There are two major types of secondary structure,
       namely alpha helices and beta sheets, both of them are stabilized by intramolecular hydrogen
       bonds.
    • Tertiary structure Secondary structure hierarchically gives way to tertiary structure forma-
       tion of a protein. The major force to generate tertiary structure is hydrophobic interaction.
       Once the protein’s tertiary structure is formed and stabilized by the hydrophobic interactions,
       there may also be covalent bonding in the form of disulfide bridges formed between two
       cysteine residues.
    Protein misfolding can lead to aggregate protein. Aggregated proteins are associated with
prion-related illnesses such as Creutzfeldt–Jakob disease, bovine spongiform encephalopathy (mad
cow disease), amyloid-related illnesses such as Alzheimer’s disease and familial amyloid cardiomy-
opathy or polyneuropathy[42] as well as intracellular aggregation diseases such as Huntington’s
and Parkinson’s disease.[80, 17] Mutation of protein sequence is one of the reason to cause protein
misfolding, and protein folding energy change is an essential thermodynamic problem. In chapter
5 of this thesis, we will set up a model to predict the protein folding energy change upon mutation.
1.4     An overview of mutation
    In biology, a mutation is an alteration in the nucleotide sequence of the genome of an organism,
virus, or extrachromosomal DNA. Mutations play a part in both normal and abnormal biological
processes including: evolution, cancer, and the development of the immune system, including
junctional diversity. Mutation is the ultimate source of all genetic variation, providing the raw
material on which evolutionary forces such as natural selection can act.
                                                    4


    There are four major classes of mutation, including:
    • spontaneous mutation (molecular decay)
    • mutations due to error-prone replication bypass of naturally occurring DNA damage
    • errors introduced during DNA repair
    • induced mutations caused by mutagens
           Figure 1.1: Example of single site mutation from DNA genome to amino acids
    Mutation of gene could impact human health and functioning in several ways. In this work, we
mainly focus on mutation impact on protein sequence. More specifically, we focus on the mutations
occur in coding regions of the genome, which are more likely to alter the protein product. To better
understand the mechanism and impact on mutation, site-directed mutagenesis is being applied to the
research. Site-directed mutagenesis is an invaluable tool to modify genes and study the structural
and functional properties of a protein, based on the structure, function, catalytic mechanism, and
catalytic residues of enzymes. Site-directed mutagenesis includes two classes: single site mutation
and combinational mutations.
    Since mutation could alter the protein sequence and structure, in the binding process, such as
antibody neutralization and protein-ligand binding, binding affinity of wild type could vary from
mutant type in a large scale. According to recent research[16], binding free energy change of
SARS-CoV-2 mutations is closely related to the infectivity of virus. So finding computational
                                                 5


estimation of mutation-induced binding free energy changes is crucial to uncover the mysterious of
a lot of biological challenge. In this thesis, we are going to combine topological tools and machine
learning techniques to build models for mutation induced binding energy prediction.
1.5     Motivation
    Biological data is growing at a fascinating speed nowadays, for example, Protein Data Bank
(PDB) has accumulated near 130,000 tertiary structures. Thus, machine learning model becomes
more and more powerful in structure based biomolecular properties prediction. The evolution of
machine learning itself also boost the ability for biomolecular properties prediction. Algorithms
such as gradient boosting trees and convolutional neural network allow researchers to build more
accurate model on specific types of tasks.
    On the other hand, topology, a branch of mathematics, is proved useful in characterizing
molecular structure. With the powerful topological tools such as persistent homology, one can
reduce the dimensionality of biomolecular structural data and get more learnable input features.
    In this thesis, we incorporated topological based features with advanced machine learning
algorithms on protein properties prediction, including protein-protein binding energy change upon
mutation and protein folding energy change upon mutation. Our models proved accurate and
efficient on the given dataset, including AB-Bind, SKEMPI and ProTherm.
                                                    6


                                             CHAPTER 2
                   MACHINE LEARNING METHODS AND ALGORITHMS
In this chapter, we will give an overview of the major machine learning algorithms and architectures
used in the paper. For each algorithms, we will introduce the process, application and pros/cons.
2.1     An Overview of machine learning algorithm
    Machine learning algorithms build a model based on sample data, known as "training data",
in order to make predictions or decisions without being explicitly programmed to do so. Depend
on the label information of training data, machine learning approaches can be divided into the
following categories:
     • Supervised learning
       In supervised learning, the target value (label) of training set is explicitly given. In math-
       ematical model, supervised learning algorithm maps a vector of features to a target value.
       Depend on the target predicting value, supervised learning can be divide into two types:
       classification and regression. For classification, the target value must be categorical and for
       regression, the target value can be any continuous numerical value. A lot of machine learning
       algorithm, for example support vector machine (SVM), random forest, gradient boosting tree,
       are supervised learning algorithm.
     • Unsupervised learning
       On the contrary side of supervised learning, unsupervised learning algorithms does not have
       label information in the training set. Different from supervised learning, the main purpose
       of unsupervised learning is to find structure character and commonalities of data. . A central
       application of unsupervised learning is in the field of density estimation in statistics, such as
       finding the probability density function.
     • Semi-supervised learning
                                                   7


       Semi-supervised learning is the problems between supervised learning and unsupervised
       learning. Typically, the training labels are noisy, limited or imprecise. The main purpose
       of semi-supervised learning algorithm is to improve the learning accuracy from the limited
       training labels.
     • Reinforcement learning
       Reinforcement learning is an area of machine learning concerned with how software agents
       ought to take actions in an environment so as to maximize some notion of cumulative reward.
       Different from supervised or unsupervised learning, reinforcement learning is mainly used to
       find strategy instead of finding certain result. A lot of other disciplines of studies are closely
       related to reinforcement learning, including game theory, control theory, operations research,
       information theory, simulation-based optimization, multi-agent systems, swarm intelligence,
       statistics and genetic algorithms. There are several famous and successful applications of
       reinforcement learning, such as alpha-go (AI based go program) and autonomous driving
       software.
2.2     Ensemble methods
    Ensemble methods is the general categories of machine learning algorithm based on decision
tree algorithms. In the following section we will briefly introduce the algorithms including decision
tree, random forest and gradient boost decision tree.
2.2.1    Decision Tree
Decision Trees are a non-parametric supervised learning method used for classification and re-
gression. The goal is to create a model that predicts the value of a target variable by learning
simple decision rules inferred from the data features. There are three basic nodes of a decision tree,
including decision nodes, chance nodes and end nodes.
                                                    8


    Hunt’s algorithm Hunt’s algorithm is considered one of the most used algorithm for building
decision tree. The algorithm can recursively partition the training dataset into successively purer
subsets. The following algorithm shows the procedure of Hunt’s algorithm
Algorithm 1 Hunt’s algorithm
   while not all the records in the subset belong to the same class do
       if Dt contains records that belong the same class yt then
            t is labeled as yt
       else if Dt not empty but Attlist is empty then
            t is labeled as majority records in the dataset
       else if Dt belong to more than one class and Attlist is not empty then
            Use attribute selection methods to choose next best attribute from the Attlist and remove
   that list from Attlist use the attribute and its condition as next test condition
       end if
   end while
    Here in the algorithm, D is the training dataset with a number of attributes, Attlist is the
subset and its testing criterion and attribute selection method is the procedure to determine the best
splitting. For different target values, there are different measures to determine the best way to split
the records. The principle of the splitting algorithm is to get more purity. Following measurements
are the most commonly used to define a split’s purity:
    Entropy:
                                                   Õn
                                          E(x) =        pi log2 (pi )                            (2.1)
                                                   i=1
    Gini index:
                                                              n
                                                                 pi2
                                                             Õ
                                          GI N I(x) = 1 −                                        (2.2)
                                                             i=1
    Classification Error:
                                 Classi f icationerror(x) = 1 − maxi pi                          (2.3)
    With the measurements mentioned above, one can get the best split based on the impurity.
                                                      9


2.2.2      Random Forest
Just like its name, random forest is an ensemble method by constructing a multitude of decision
trees. Random decision forests correct for decision trees’ habit of overfitting to their training
set.[33]
Bootstrap aggregating, or bagging is the general technique to generate a random forest model.
Suppose we have a training set X, using bootstrapping by selecting a random sample with replace-
ment, we generate m decision tree f1, ..., fm . Then random forest prediction for unseen sample x 0
is defined by averaging the predictions from all the individual regression trees on
                                                    m
                                                 1 Õ
                                            f =         fn (x 0)                                 (2.4)
                                                 m
                                                   n=1
    For classification problem, the predicted class f is the class that the majority of trees vote for.
To further avoid overfitting, random forest also include another type of bagging scheme: feature
bagging. During the training process, for each decision tree, only a random subset of the features
are used.
2.2.3      Gradient Boost Decision Tree
Gradient boosting decision tree is another popular ensemble method for regression and classification
tasks. Gradient boosting algorithm was first observed by Leo Breiman[9], and was subsequently
developed by Friedman[32]. The algorithm iteratively generate weak learners by minimizing loss
function using gradient decent method. For a n samples training set with features and labels
{(xi , yi )}i=1
            n , a differentiable loss function is defined as L(y, F(x)). With presetted stop iteration
M, the gradient boosted tree algorithm is shown as below:
    In the algorithm, there are several important hyperparameters which will heavily impact the
model performance. The most important hyperparameters are number of estimators and learning
rate. Learning rate controls the speed of optimization of loss function in gradient decent and
number of estimators controls the number of boosting stages to perform.
                                                   10


Algorithm 2 Gradient Boosting
   Initialize F0 (x) with constant value
                                                                Õ n
                                          F0 (x) = argmin            L(yi , γ)
                                                         γ
                                                                 i=1
   while m < M do
                                         ∂L(yi , F(xi ))
                               rim = −[                   ]F(x)=F            , i = 1, ..., n
                                             ∂F(xi )                 m−1 (x)
   ,
                                         hm (x) = f ittree{(xi , rim )}i=1  n
                                                Õ n
                                 γm = argmin           L(yi , Fm−1 (xi ) + γhm (xi ))
                                           γ
                                                 i=1
                                         Fm (x) = Fm−1 (x) + γm hm (xi )
   end while
   Output FM (x)
2.2.4    Feature importance
Generally, feature importance provides a score that indicates how useful or valuable each feature
was in the construction of the boosted decision trees within the model. Feature importance is
calculated as the decrease in node impurity weighted by the probability of reaching that node. The
node probability can be calculated by the number of samples that reach the node, divided by the
total number of samples. The higher the value the more important the feature.
     We use Scikit-learn’s built in function to get the feature importance, the implementation in
Scikit-learn is shown below:
     For each decision tree, Gini importance is calculated as
                           ni j = w j C j − wle f t( j)Cle f t( j) − wright( j)Cright( j)     (2.5)
Here i represent feature and j represent node. n is feature importance and C is impurity value of
                                                        11


node. The importance for feature i on a single decision tree is calculated as
                                           Í
                                              j: node j splits on feature i ni j
                                   f ii =          Í                                             (2.6)
                                                     k∈all nodes ni k
                                                            f ii
                                          fˆii = Í                                               (2.7)
                                                    j∈all features   f ij
The overall feature importance is the average over all decision trees, which is calculated as :
                                                   Í                 ˆ
                                                      j∈all trees f ii j
                                           F Ii =                                                (2.8)
                                                             T
2.3     Neural Network
    Neural network (NNs) is one of the most popular machine learning algorithm which is inspired
by the neural networks that consist human brains. The history of neural network can trace back to
1943, Warren McCulloch and Walter Pitts [49] opened the subject by creating a computational model
for neural networks. In 1958, psychologist Frank Rosenblatt invented the perceptron[74, 73, 87, 43],
the first artificial neural network. The first functional networks with many layers were published
by Ivakhnenko and Lapa [78, 45, 46]in 1965, as the Group Method of Data Handling.
    As the implication of its name, neural network is based on a collection of connected units or
nodes called artificial neurons. The connections between the neurons are called edges and each
edge are assigned with specific weights. According to the connection relations, neurons can be
grouped into layers. Neurons of one layer connect only to neurons of the immediately preceding
and immediately following layers. In the following section we will briefly introduce the basic
process and layers used in the work.
2.3.1    Layers and process of Neural Network
In this section, we will introduce the component of a basic neural network, including different types
of layers and process.
    Input and output layers
    Input and output layers are the basic component of neural network. Input layer is determined
by feature size of the training sample, can be either one dimension or high dimension with multiple
                                                        12


channels. Output layer is determined by the target value type, either a value for regression problem
or a class for classification problem.
     Activation layers
     Activation layers are the layers which apply activation function to the given input layers. There
are several activation functions, which is listed below:
      • Rectified Linear Activation (ReLU): f (x) = max(0, x)
      • Logistic (Sigmoid): f (x) = 1+e1−x
      • Hyperbolic Tangent (tanh): f (x) = tanh(x)
In general, the activation functions are non linear. With those activation layers, the network will
gain more nonlinearities and a better overall performance.
     Dense layers
     Dense layers, also called as fully connected layers, are the layers connect all neurons of its
previous layer to the current layer. Mathematically, dense layer performs a matrix multiplication
and can be used to change the dimension of the vector or matrix. In general, dense layer has a large
weight matrix and is relatively computational expensive.
Input Layer           Hidden Layer      Hidden Layer       Hidden Layer       Output Layer  ¹
                      Figure 2.1: An example of fully connected neural network
                                                    13


    Pooling layers
    Pooling layers are the layers used to reduce the dimensions of the feature maps. There are
two typical pooling methods: max pooling and average pooling. The pooling layers apply sliding
window to the previous layer and returns either max value or average value of the sliding window.
Pooling layers are usually used in image processing neural networks.
    Dropout layers
    Dropout layers are the layers which randomly set input neurons to 0. The rate of setting as 0
is called dropout rate. Inputs not set to 0 are scaled up by 1−d1      , which make the sum over all
                                                                 rate
inputs unchanged. This dropout process can reduce the overfitting of the neural network.
    Optimization and backpropagation
    Backpropagation is the process which calculates the gradient of the loss function with respect
to the neural network’s weights backwards through the network. Backpropagation was invented
in the 1970s[78] as a general optimization method for performing automatic differentiation of
complex nested functions[26, 60]. However, it wasn’t until 1986, with the publishing of a paper by
Rumelhart, Hinton, and Williams[76].
For optimization algorithm of neural network, most of algorithms are variant of gradient decent.
We listed several commonly used algorithm below:
    • Gradient decent (GD)
                                              θ = θ − a∇J(θ)                                   (2.9)
    • Stochastic Gradient Descent (SGD)
                      θ = θ − a∇J(θ; x(i); y(i), where{x(i), y(i)}are training sample         (2.10)
       Stochastic Gradient Decent is a variant of Gradient Descent, instead of update the parameters
       based on whole training set, SGD update model parameter based on randomly chosen training
       sample. In general, SGD has a faster converge speed than GD.
                                                  14


     • Adaptive Moment Estimation (Adam) To reduce the high variance in SGD and accelerates
       the convergence, momentum terms are introduced in Adam:
                                                        mt
                                               m̂t =                                         (2.11)
                                                      1 − β1t
                                                        vt
                                               vˆt =                                         (2.12)
                                                     1 − β2t
       The model is updated as follows:
                                                           η
                                         θ t+1 = θ t − √         m̂t                         (2.13)
                                                         vˆt + 
In the following part of this thesis, we use Adam algorithm as the optimizer of all the neural
network.
     Hyperparameters
     Hyperparameters are the parameters of neural network other than the weights determined by
the training set. Common hyperparameters include number of layers, type of layers, learning rate
of optimizer, dimension of channel etc. How to tune hyperparameters to get a better performance
is always a headache for researchers. For some hyperparameters, one can do Bayesian optimization
tuning. For hyperparameters like number of layers, the only way is grid search.
2.3.2    Convolutional Neural Network (CNN)
Convolutional neural network (CNN) is a class of artificial neural network. The neocognitron
was introduced by Kunihiko Fukushima in 1980s.[36]. In this paper, convolutional layers and
downsampling layers were first introduced. The first modern application of convolutional neural
networks was implemented in the 90s by Yann LeCun etc.[53]
     The core part of CNN is convolution operation and its related layer. Convolution operation
applies tensor production by sliding window over input data. This operation can efficiently capture
local patterns of data and avoid the curse of dimensionality of fully connected layer. An example
of 2D convolution is shown in figure2.2
                                                  15


         Figure 2.2: An example of convolution operation on a 4 × 4 matrix by 2 × 2 kernel
    There are several conceptions and parameters related to convolution operation, we listed below
    • Kernel size: the size of sliding window, for example 2 × 2
    • Padding: Padding is the operation related to the margin of input data. If no padding to input
       data, the output dimension after convolution will decrease. To keep same dimensionality,
       adding 0 valued on the borders of input data is needed, which is called zero padding. Padding
       is typically set to the kernel dimension -1.
    • Stride: Stride is the step size of sliding window moves on each iteration.
    Beside convolution layer, a complete convolutional neural network also contains the layers we
mentioned in previous section, including dense layers, pooling layers etc. We will show the CNN
structure of TopNetTree model in the later section of the work.
2.3.3    Other Neural Network
Other than CNN, there are a lot of artificial neural network structures, each structure are adapted
to certain types of tasks. We listed a few below:
    • Recurrent neural network: Recurrent neural network (RNN) is a class of artificial neural
       networks where connections between nodes form a directed graph along a temporal sequence.
                                                  16


  The RNN networks were based on David Rumelhart’s work in 1986[75]. Long short-term
  memory (LSTM) network is one of the most commonly used RNN architecture. In general,
  Recurrent neural networks have outstanding performance in sequence and time related data
• Graph neural network: Graph neural network (GNN) is a type of neural network for
  processing graph data structures. It has been mathematically proven that GNNs are a weak
  form of the Weisfeiler–Lehman graph isomorphism test[77], so any GNN model is at least
  as powerful as this test.
• Residual neural network: Residual neural network is a type of neural network which
  envolves shipping process or shortcut between layers. It was first proposed by Kaiming He
  etc in 2015[44].
                                            17


                                           CHAPTER 3
TOPNETTREE: A TOPOLOGY-BASED NETWORK TREE FOR THE PREDICTION OF
         PROTEIN-PROTEIN BINDING AFFINITY CHANGES UPON MUTATION
3.1     Introduction
    Protein-protein interactions (PPIs) are crucial to a wide range of biological activities and
functions in the human body, including cell metabolism, signal transduction, muscle contraction,
and immune systems. Antibody-antigen is one of the most essential systems among all PPIs and
plays a unique role in studying PPIs. Antibodies (Abs) are large proteins serving important roles
in the immune system by counteracting antigens which are chemicals recognized as alien by the
human body. On the tip of an antibody, there is an antigen-binding fragment (Fab) that contains
a paratope for recognizing a unique antigen via its epitope. More specifically, a paratope consists
of a set of complementarity-determining regions (CDRs) which have the highest conformational
flexibility among sites on an antibody [19]. The high selectivity of antibody-antigen recognition
mechanism and the flexibility of antibodies as large proteins make antibodies a suitable platform
for designing counteractants of target molecules. Antibodies have been widely used as therapeutic
agents to treat human diseases. Antibody therapy has several advantages over traditional therapy
including longer serum half-life, higher avidity and selectivity, and the ability to invoke desired
immune responses [15, 23, 81]. Also, antibody therapy brings hope to several previously incurable
diseases and there are ongoing efforts in the direction of HIV vaccine development [3] and cancer
therapeutic antibodies [41, 6].
    Three-dimensional (3D) structural information and thermodynamic measurements are two es-
sential components for understanding the molecular mechanism of PPIs. Many experimental
methods have been developed to determine the structure of protein-protein complexes. Among
them, X-ray crystallography, nuclear magnetic resonance (NMR) and cryo-electron microscopy
(cryo-EM) are the main workhorses [38]. Protein Data Bank (PDB) [7], one of the largest pro-
                                                 18


tein structure databases, includes tons of thousands of protein-protein complex structures and is
expanding at an unprecedented rate.
    Site-directed mutation is a key technology for probing PPI thermodynamic properties, including
binding affinities of antibody-antigen interactions. Sirin et al. [83] collected an AB-Bind database
of mutation-induced antibody-antigen complex binding free energy changes. This database contains
1101 mutation data entries, including 645 single-point mutations on 32 different antibody-antigen
complexes. SKEMPI is a more general database for protein-protein binding affinity changes upon
mutation (∆∆G)[61]. It contains 3047 mutation data entries for protein-protein heterodimeric
complexes with experimentally determined structures.
    The aforementioned databases have been widely used as benchmark tests for evaluating the
predictive power of computational methods, which are indispensable for the investigation of PPIs,
especially for the systematic screening of mutations [64] [24]. There are many reliable compu-
tational methods that can predict mutant structures upon the wild type, including Rosetta [50]
and Jackal [90]. Computational methods for generating protein structures from sequences (e.g.,
MODELLER [85]) and predicting docking poses for protein-protein complexes (e.g., BioLuminate
[98]) are also available.
    For the thermodynamic properties of PPIs, the information is usually interpreted as the binding
affinity or binding free energy, ∆G. Given their importance, a variety of computational methods
has been developed for the prediction of antibody-antigen binding affinities based on structures.
DFIRE [94] relies on an all-atom, distance scaled, pairwise potential derived using a database
of high-quality diverse protein structures. STATIUM uses a pairwise statistical potential that
scores how well a protein complex can accommodate different pairs of residues in the parent
complex geometry. Also, force fields for proteins can be used to compute the binding free energy,
representing van der Waals interactions, hydrophobic packing, electrostatics, and solvation effects.
These approaches include FoldX (FOLDEF) [79], Discovery Studio (CHARMMPLR) [8], and
Rosetta [50]. Typically, physics-based methods provide mechanistic interpretations but are not
designed for handling large and diverse datasets.
                                                 19


    Pires et al. optimized their graph-based CSM method for predicting antibody-antigen affinity
changes upon mutation given in the AB-Bind database [68]. This method, called mCSM-AB,
was shown to outperform the aforementioned physical methods but only achieving a Pearson’s
correlation coefficient (Rp ) of 0.53 with 10-fold cross-validation on a set of 645 single-point
mutations. Therefore, the limited performance of current methods highlights a pressing need
for a new generation of ∆∆G predictors that are constructed with entirely new design principles
and/or innovative machine learning algorithms. While the physics-based methods assume potential
functions of certain forms and the graph-based method only considers pairwise interactions, we seek
an approach that makes fewer assumptions and allows a systemic description of the protein-protein
interaction.
    Persistent homology [35, 27, 99, 100], a new branch of algebraic topology, is able to bridge
geometry and topology, leading to a new efficient approach for the simplification of biological
structural complexity [88, 37, 89, 10, 93, 51]. However, it neglects critical chemical/biological
information when it is directly applied to complex biomolecular structures. Element-specific
persistent homology can retain critical biological information during the topological abstraction.
Paired with advanced machine learning, such as a convolutional neural network (CNN), this new
topological method gives rise to some of the best predictions for protein-ligand binding affinities
[14], protein folding free energy changes upon mutations [11, 13] and drug virtual screening
[12]. This approach has won many contests in D3R Grand Challenges, a worldwide competition
series in computer-aided drug design [63]. However, the techniques designed for protein-ligand
binding analysis can not be directly applied to PPI due to biological differences and the different
characteristics of available datasets.
    In this work, we introduce site-specific persistent homology tailored for PPI analysis. We
hypothesize that a topological approach that generates intrinsically low-dimensional representations
of PPIs could dramatically reduce the dimensionality of antibody-antigen complexes, leading to a
reliable high-throughput screening in searching for valuable mutants in protein design.
                                                 20


3.2     Dataset
3.2.1   AB-Bind dataset
The AB-bind dataset includes 1101 mutational data points with experimentally determined binding
affinities [83]. We follow Pires et al. [68] to consider only 645 single mutations across 29
antibody-antigen complexes. Among them, 87 mutations are on 5 complexes with homology
structures. This dataset, called the AB-bind S645 set, consists of about 20% stabilizing mutations
and 80% destabilizing ones. Besides, in the whole dataset, there are 27 non-binders, which are
variants determined not to bind within the sensitivity of the assay. The binding affinity changes
upon mutation of these non-binders were set to -8 kcal/mol. These non-binders could be regarded
as outliers in the database and have a strongly negative impact on the prediction model accuracy.
Following figure shows the counts of mutation type in AB-bind database
                                                   0 34 27 40   1   21 20 20 17 27 10 26   7   21   7   28 28 14 54 10   20
         V YW T S P F M K L I H G E Q C D N R A
                                                  34 0 2 1      0    2 1 2 4 0 0 4         0    0   0    2 5 1 3 0
                                                  27 2 0 9      0    0 2 3 0 0 0 4         0    1   0    6 4 2 3 1
                                                  40 1 9 0      0    0 4 1 1 0 2 3         0    0   0    2 3 1 4 0
                                                   1 0 0 0      0    0 0 0 0 0 0 0         0    1   0    3 1 0 0 0       16
                                                  21 2 0 0      0    0 7 2 1 0 0 3         2    0   0    6 6 1 6 3
                                                  20 1 2 4      0    7 0 1 0 0 1 10        0    0   1    1 3 0 3 0
                                                  20 2 3 1      0    2 1 0 0 3 1 0         2    1   0    7 2 2 2 2
                                                  17 4 0 1      0    1 0 0 0 0 0 1         0    2   2    0 0 1 1 0       12
                                                  27 0 0 0      0    0 0 3 0 0 0 0         0    0   0    0 3 0 0 1
                                                  10 0 0 2      0    0 1 1 0 0 0 1         0    1   0    2 0 1 2 1
                                                  26 4 4 3      0    3 10 0 1 0 1 0        1    0   1    1 0 0 1 0
                                                   7 0 0 0      0    2 0 2 0 0 0 1         0    0   0    1 0 0 0 1       8
                                                  21 0 1 0      1    0 0 1 2 0 1 0         0    0   0    0 1 2 9 2
                                                   7 0 0 0      0    0 1 0 2 0 0 1         0    0   0    0 1 0 0 1
                                                  28 2 6 2      3    6 1 7 0 0 2 1         1    0   0    0 6 1 3 2
                                                  28 5 4 3      1    6 3 2 0 3 0 0         0    1   1    6 0 1 2 6       4
                                                  14 1 2 1      0    1 0 2 1 0 1 0         0    2   0    1 1 0 3 0
                                                  54 3 3 4      0    6 3 2 1 0 2 1         0    9   0    3 2 3 0 1
                                                  10 0 1 0      0    3 0 2 0 1 1 0         1    2   1    2 6 0 1 0
                                                                                                                         0
                                                  A R N D C Q E G H I              L K M F P S T W Y V
                                                          Figure 3.1: Counts of mutation type in AB-bind dataset
                                                                                       21


3.2.2    SKEMPI and SKEMPI 2.0 dataset
The SKEMPI dataset [61] contains 3047 binding free energy changes upon mutation assembled
from the scientific literature, for protein-protein heterodimeric complexes with experimentally
determined structures. It includes single-point mutations and multi-point mutations. Among the
whole database, there are 2317 single point mutation data entries, called the SKEMPI S2317 set.
    Recently, Xiong et al. selected a subset of 1131 non-redundant interface single-point mutations,
denoted set S1131, from the SKEMPI set S2317 [91]. The same authors applied several methods
to the SKEMPI S1131 set [91], including BindProfX [91], Profile-score [54, 84] FoldX [79]
BeAtMuSiC [22],SAMMBE [66] and Dcomplex [57].
    The SKEMPI 2.0 [47] database is the updated version of the SKEMPI database and contains
new mutations collected after the first version was released. There are 7085 mutations in the
SKEMPI 2.0 dataset. We choose only single-point mutations with full energy change information,
called set S4947. Since binding energy changes upon mutation (∆∆G) are not directly given in the
SKEMPI 2.0 database, the following formula is used to obtain the ∆∆G value for each mutation
with a given kd value:
                                     8.314
                              ∆G =          × (273.15 + 25) × log(kd)
                                     4184
                              ∆∆G = ∆G MT − ∆GWT .
    Set S4169 is directly adopted from mCSM-PPI2 [72] paper, which is also derived from the
SKEMPI 2.0 dataset. Set S8338 is derived from the S4169 set by setting the reverse mutation
energy change with a negative sign [72]
3.2.3    Preprocessing of dataset
For the aforementioned databases, crystal structures of the wild type, mutation type, and binding
affinity change are given for each data entry. To calculate our structure-based topological feature,
the structures of mutant type are also needed. Scap utility in the Jackal package [90] is used to
generate mutant structures. This utility predicts side-chain conformations on a given backbone. To
                                                  22


fix the missing atoms and residues, the profix utility in the Jackal package [90] is applied to all raw
pdb files.
3.3     Topological Modelling
3.3.1    Persistent homology
In algebraic topology, atomic coordinates of protein structures are organized into simplicial com-
plexes, which are the basic elements of chains and homology groups, enabling the topological
description of biomolecular datasets. Persistent homology further introduces a filtration parameter
to examine biomolecular datasets at a variety of spatial scales. Element-specific persistent ho-
mology embeds chemical and biological information in topological representations by controlling
certain atomic types in each simplicial complex.
3.3.1.1    Simplicial complex and filtration
A (geometric) simplicial complex is a finite collection of sets of affinely independent points (i.e.,
atomic positions) K = {σi }i , where the elements in σi are called vertices and σi is called a k-
simplex if it has k + 1 distinct vertices. If τ ⊆ σi , τ is called a face of σi . A simplicial complex K
is valid if τ ⊆ σi for σi ∈ K indicates τ ∈ K, and that the intersection of two simplices is either a
simplex in K or empty.
     In practice, it is favorable to characterize points clouds or atomic positions in various spatial
scales rather than in a fixed scaled simplicial complex representation. To construct a scale-
changing simplicial complex, consider a function f : K → R satisfying f (τ) ≤ f (σ) whenever
τ ⊆ σ. Given a real value x, f induces a subcomplex of K by constructing a sub-level set,
K(x) = {σ ∈ K | f (σ) ≤ x}. Since K is finite, the range of f is also finite and the induced
subcomplexes, when ordered, form a filtration of K
                                ∅ ⊂ K(x1 ) ⊂ K(x2 ) ⊂ · · · ⊂ K(x` ) = K.                           (3.1)
There are many constructions of f and a widely used one for point clouds is the Vietoris-Rips
                                                   23


complex. Given K as the collection of all possible simplices from a set of atomic coordinates up
to a fixed dimension, the filtration function is defined as frips (σ) = max{d(vi , v j ) | vi , v j ∈ σ} for
σ ∈ K where d is a predefined distance function between vertices, such as the Euclidean distance.
In practice, an upper bound of the filtration value is set to avoid excessively large simplicial complex.
Another efficient construction called alpha complex is often used to characterize geometry and we
denote the filtration function by fα : DT(X) → R where DT(X) is the simplicial complex induced
by the Delaunay triangulation of the set of atomic coordinates X. The filtration function is defined as
 fα (σ) = max{ 12 De (vi , v j )|vi , v j ∈ σ} for σ ∈ DT(X) where De is the Euclidean distance. Back to
molecular structures, the filtration of simplicial complexes describes the topological characteristics
of interaction hypergraphs under various interaction range assumptions.
3.3.1.2    Homology and persistence
Homology group (in singular homology) of a simplicial complex topologically depicts hole-like
structures of different dimensions. Given a simplicial complex K, a k-chain is a finite formal sum
of k-simplices in K, ai σi . There are many choices for the coefficients ai and we choose ai ∈ Z2
                       Í
                        i
for simplicity. The kth chain group denoted Ck (K) consists of all the k-chains under the addition
induced by the addition of coefficients. A boundary operator ∂k : Ck (K) → Ck−1 (K) connects
chain groups of different dimensions by mapping a chain to the alternating sum of codim-1 faces.
It suffices to give the definition of the boundary operator on simplices,
                                                         Õk
                               ∂k ({v0, · · · , v k }) =     (−1)i {v0, · · · , v̂i , · · · , v k },   (3.2)
                                                         i=0
where v̂i means the absence of vertex vi . The kth cycle group denoted Z k (K) is defined to be the
kernel of ∂k whose members are called k-cycles. The kth boundary group is the image of ∂k+1
and is denoted Bk (K). It follows that Bk (K) is a subgroup of Z k (K) based on the property of
boundary maps, ∂k ◦ ∂k+1 = 0. The kth homology group Hk (K) is defined to be quotient group
Z k (K)/Bk (K). The equivalent class in Hk (K) corresponds to k-dimensional holes in K that can
not be deformed to each other by adding/subtracting the boundary of a subcomplex.
                                                             24


   Given a filtration as in Eq. (3.1), in addition to characterizing the homology group at each frame
Hk (K(xi )), we also want to track how topological features persist along the sequence. Viewing
Hk (K(xi )) as vector spaces together with inclusion map induced linear transformations gives a
persistent module,
                           Hk (K(x1 )) → Hk (K(x2 )) → · · · → Hk (K(x` )).                      (3.3)
An interval module with respect to [b, d) denoted I[b,d) is defined as a collection of vector spaces
{Vi } connected by linear maps fi : Vi → Vi+1 , where Vi = Z2 for i ∈ [b, d) and Vi = 0 elsewhere
and fi is identity map when possible and 0 otherwise. The persistence module in Eq. (3.3) can
be decomposed as a direct sum of interval modules ⊕[b,d)∈B I[b,d) . Each I[b,d) corresponds to a
homology class that appears at filtration value b and disappears at filtration value d. The values
b and d are usually called the birth and death values. The collection of these pairs B encodes
the evolution of k-dimensional holes when varying the filtration parameter and thus records the
topological configuration of the input point cloud under different interaction ranges if a distance
based filtration is used. Fig. 3.2 illustrates filtration and persistence.
                                                                             Death
                                                                                      Birth
     Figure 3.2: Illustration of filtration and persistence diagram of a set of points on a plane.
                                                   25


3.3.2   Topological representation of protein-protein interactions
The pairwise interactions between atoms are characterized by the 0th homology group, H0 (also
known as the size function [34]). The higher dimensional homology groups encode higher-order
patterns in PPI complexes. The 1st homology group (H1 ) generated with the Euclidean distance-
based filtration characterizes loop or tunnel-like structures as shown in Fig.3.3, whereas the H2
homology group describes cavity structures in PPI complexes. Combining various dimensions, we
obtain a comprehensive topological description of PPIs.
                                           0                                           0
                                           1                                           1
Figure 3.3: Topological barcode change associated with a mutation. Residue Leucine in the wild
type is mutated into Alannine. Barcodes are generated for carbon atoms within a cutoff of 12Å of
the mutant residue.
                                                 26


    Given a PPI system represented by a set of atomic coordinates (i.e., a point cloud), a topological
representation should be able to extract patterns of different biological or chemical aspects, such
as hydrogen bonds between oxygen and nitrogen atoms, hydrophobicity, polarizability, etc. To
achieve this goal, we construct simplicial complexes using selected subsets of atomic coordinates
and modified distance matrices.
    For constructing element-specific and site-specific persistent homology, we classify the atoms
in a PPI complex into various subsets: (1) A m : atoms of the mutation site. (2) A mn (r): atoms in
the neighborhood of the mutation site within a cutoff distance r. (3) A ab (r): antibody atoms within
a distance r of the binding site. (4) A ag (r): antigen atoms within a distance r of the binding site.
Finally, (5) A e (E): atoms in the system that has atoms of element type E. When characterizing
interactions between atoms ai and a j in set A and/or set B, we use a modified distance matrix to
exclude the interactions between the atoms from the same set,
                                        
                                                          if ai , a j ∈ A, or ai , a j ∈ B,
                                        
                                        ∞
                                        
                                        
                       Dm (ai , a j ) =                                                             (3.4)
                                        
                                         De (ai , a j )  if ai ∈ A, and a j ∈ B,
                                        
                                        
                                        
                                        
where De is the Euclidean distance. Specific designations for sets A and B are given in Table. 3.1,
which summarizes various topological barcodes.
Table 3.1: Summary of topological descriptors. The barcodes are generated upon mutant and wild
type complexes.
             A                          B                              Dis.   Complex       Dim.
             A m ∩ A ele (E1 )          A mn (r) ∩ A ele (E2 )         Dmod   Rips          H0
             A m ∩ A ele (E1 )          A mn (r) ∩ A ele (E2 )         De     alpha         H1 , H2
             A Ab (r) ∩ A ele (E1 )     A Ag (r) ∩ A ele (E2 )         Dmod   Rips          H0
             A Ab (r) ∩ A ele (E1 )     A Ag (r) ∩ A ele (E2 )         De     alpha         H1 , H2
    Following Fig.3.4 shows the point cloud generation of antibody-antigen complex 1DQJ
                                                         27


Figure 3.4: Illustration of the point cloud generation of antibody-antigen complex 1DQJ
                                              28


3.4     Auxiliary features
     As we mentioned in the previous section, element-specific and site-specific persistent homol-
ogy is able to embed chemical information into topological representations. However, there are
other important chemical and physical information that has not been incorporated into persistent
homology but could improve the predictive power of the present topological model. In this work,
all none topological features are named as auxiliary features. These features are appended into the
machine learning model at the last step of GBT training or the dense layer of a neural network. In
general, auxiliary features are categorized into atom-level features and residue-level ones.
3.4.1    Atom-level features
According to different criteria, atoms can be categorized into different groups for feature generation.
First, with respect to atom types, we divide atoms into 7 groups, i.e., C, N, O, S, H, all heavy atoms,
and all atoms. Additionally, with respect to distance to mutation site, atoms are grouped into 3
groups, namely, mutation site atoms, near mutation site atoms (within 10Å of mutation site), and
all atoms. Finally, similar to the treatment in topological feature generation, 3 cases, i.e., wild type,
mutant type, and their difference are considered, respectively.
     • Surface areas Atom-level solvent excluded surface areas are computed through our in-house
       software ESES [56]. All atom areas within the same group are summed as one feature. In
       this manner, a total of 7*3*3 = 63 features is generated.
     • Partial charges Partial charge of each atom is generated from pdb2pqr software [25] using
       the amber force field. After the procedure, the radius and the partial charge of each atom are
       calculated. The sum of the partial charges and the sum of absolute values of partial charges
       for each atomic group are counted as partial charge features. In this way, a total of 7*3*3*2
       =126 features is generated.
     • Coulomb interactions Coulomb energy of the ith single atom is calculated as the sum of
                                                   29


      pairwise coulomb energy with every other atom.
                                                      Õ        qi q j
                                              Ci =          ke        .                        (3.5)
                                                                ri j
                                                      j,j,i
      Here, k e is the Coulomb’s constant. Since multiplying the constant coefficient has no effect
      on machine learning result, we use k e = 1 in our calculation.
      In coulomb interaction feature generation, only 5 groups (C, N, O, S, and all heavy atoms) are
      counted. Both coulomb interaction energy and absolute value are counted. In this manner, a
      total of 5*3*3*2 = 90 features is generated.
    • van der Waals interaction The van der Waals energy of the ith atom is modeled as the sum
      of pairwise Lennard-Jones potentials with every other atom. Only 5 groups (C, N, O, S, and
      all heavy atoms) are counted.
                                                 ri + r j 12            ri + r j 6
                                              "                               #
                                      Õ
                                 Vi =                         −2                   .          (3.6)
                                                    ri j                   ri j
                                      j,j,i
      Here,  is the depth of the potential well. Since multiplying the constant coefficient has no
      effect on machine learning result, we use  = 1 in our calculation. In this manner, a total of
      5*3*3 = 45 features is generated.
    • Electrostatic solvation free energy Electrostatic solvation free energy of each atom is
      calculated using Poisson-Boltzmann model through our in-house software MIBPB [97, 96,
      39]. By summing up all the solvation free energies in same atom groups, 7*3*3 = 63 features
      are generated.
3.4.2   Residue-level features
    • Mutation site neighborhood amino acid composition The residues within 10 Å of the
      mutation site are regarded as neighbor residues. Distances between residues are calculated
      using their alpha carbon atoms. Amino acid residues are divided into 5 groups as hydrophobic,
      polar, positively charged, negatively charged and special cases. The count and percentage of
                                                    30


       the 5 groups of amino acids in neighbor site are regarding as the environment composition
       features of the mutation site, which leads to 5*2 = 10 features. Also, the sum, average and
       variance of residue volumes, surface areas, weights and hydropathy scores are generated as
       the environment chemical and physical features of a mutation site, which leads to 3*4 = 12
       features. In this manner, 10+12 = 22 features are generated.
     • pKa shifts The pKa values of 7 ionizable amino acids, namely, ASP, GLU, ARG, LYS, HIS,
       CYS, and TYR, are calculated using the PROPKA software [4]. The difference of pKa
       values between a wild type and its mutant type are calculated as pKa shifts. The maximum,
       minimum, sum, the sum of absolute values, the minimum of absolute value of total pKa
       shifts are calculated, which leads to 5 features. Also, besides the shifts of all groups, the sum
       and the sum of absolute value of pKa shifts based on the 7 ionizable amino acid groups are
       calculated, which leads to 2*7=14 features. In this manner, 5+14 = 19 features are generated.
     • Secondary structures Using SPIDER2 [92] software, the probability score of mutation site
       residues to be coil, helix or strand are calculated as well as torsion angles. The wild type,
       the mutant type and their difference are calculated as secondary structure features. In this
       manner, 4*3 =12 features are generated.
3.5     Machine learning architecture
    A major challenge in the prediction of binding affinity changes upon mutation for PPIs is that
the data is highly complex due to 3D structures while the datasets are relatively small. To overcome
this difficulty, we designed a hybrid machine learning algorithm combining CNN and GBT. The
topologically simplified description of the 3D structures are further converted into concise features
by the CNN module. The GBT module then builds robust predictors with effective control of
overfitting.
                                                   31


3.5.1   TopGBT: Topology based gradient boosting tree model
Ensemble method is a class of machine learning algorithms that build a powerful model from
weak learners. It improves the performance upon the weak learners with the assumption that the
individual learners are likely to make different mistakes and thus summing up the weak learners
will reduce the overall error. In this work, we use GBTs which add a tree to the ensemble according
to the current prediction error on the training data. This method performs well when there is a
moderate number of features and is relatively robust against hyperparameter tuning and overfitting.
The implementation provided by the scikit-learn package (version 0.18.1) [65] is used.
3.5.2   TopCNN: Topology based convolutional neural network model
CNN is one of the most successful deep learning architectures. A regular CNN is a special case of a
multilayer artificial neural network where only local connections are allowed between convolution
layers and the weights are shared across different locations. We use topology-based CNN (TopCNN)
as an intermediate model. Specifically, we feed vectorized 0th-dimensional topological barcode
(H0 ) features into CNNs to extract higher-level features for the downstream model.
3.5.3   TopNetTree: Topology based network tree model
CNN can automatically extract high-level features from the 0th-dimensional topological barcodes
(H0 ). These CNN extracted features are combined with features constructed from high-dimensional
topological barcodes, H1 and H2 , as the inputs of GBTs. Specifically, we build a supervised CNN
model with the PPI ∆∆G as labels. After the model is trained, we feed the flatten layer neural
outputs into a GBT model to rank their importance. Based on the importance, a subset of CNN
features is combined with other features, such as the statistics of H1 and H2 barcodes, for the final
GBT model as shown in Fig.3.5. The GBT is used for its robustness against overfitting, good
performance for moderately small data sizes and its model interpretability.
                                                   32


     Protein Complex Structure              Topological feature        Input            Convolution         Convolution
         Mutation Position                      generation
           Mutation Type
                                  Alpha                      Rips                                   Pooling &
                                 complex                   complex
                                                                                                     Dropout
                                                                        0
 Auxiliary Features                                                   features
(Atom/Protein level)          1,   2 features        0 features                                                    Flattening
                                               Convolutional
 Gradient Boosting               Importance    neural network
                                  selection
        Tree
                                                                                                             1& 2
                                                                                                           features
                                            Predicted Result
                                     Figure 3.5: Illustration of the proposed TopNetTree model.
                                                                    33


3.5.4   Model parametrization and software used
The details of model parameters and software packages are given below.
TopGBT: Topology based GBT model
    • H1 and H2 features. Element-specific persistent homology H1 and H2 barcodes are con-
      structed as described in Table 3.1 with cutoff value r = 12Å. We consider a wide type and
      mutant complexes. For each barcode, we extract birth death and persistence information.
      Statistical values, namely sum, min, max, mean, and standard deviation are computed from
      these barcodes to generate H1 and H2 features, giving rise to a total of 540 features.
TopCNN: Topology based CNN model
    • H0 feature. The same as what described above, except for a finer bin size of 0.25 Å, which
      leads to a total of 1296 features for CNN.
    • Four 1D convolutional layers and one dropout layer have been used in the CNN model.
TopNetTree: Topology based network tree model
    • H0 features. Top 300 high-level CNN features are selected according to their feature impor-
      tance.
    • H1 and H2 auxiliary features are the same as those in the TopGBT model.
Model parameters
    • CNN network structure and parameters are shown in Fig.3.6. This CNN network structure
      contains two 1D convolutional layer of 64 channels and two 1D convolutional layer of 128
      channels and 1 flatten layers. On the convolutional dimension, 12 Åcut off and 0.25 Åbin size
      was chosen, so 48 bins are the size for that dimension. Other parameters for the CNN are listed
      as follow: kernal_initializer = lecun_uni f orm , optimizer = adam and epochs = 2000
                                                   34


                            Figure 3.6: Illustration of CNN parameters.
    • GBT parameters: n_estimator s = 20000, max_depth = 6, mins amples_split = 3, and
      learning_rate = 0.001.
Software used
    • GBT. The scikit-learn (version 0.18.1)[65] is used for the gradient boost regressor function.
    • CNN. The Keras (version 2.0.2)[18] package is used for convolutional neural network model.
    • Persistent homology feature: Javaplex [1] is used to generate H0 barcodes and TDA package
      in R [28] is used to generate H1 and H2 barcodes.
Time and memory cost
    • All the models are generated and tested on computer facilities at Michigan State University’s
      High performance computing center (HPCC). 8 GB of memory and 5 cpu cores are requested
      for each feature generation job.
    • Average running time for generating topological features for one sample is 1.01 min (time
      for generating mutant structure is included).
                                                  35


• Average running time for generating auxiliary features is 9.21 min .
                                           36


                                           CHAPTER 4
  LTP MODEL: APPLICATION OF LOCAL TOPOLOGICAL CHARACTERISTICS IN
               PROTEIN FOLDING ENERGY CHANGE UPON MUTATION
4.1      Introduction
     Mutagenesis is a process by which the genetic information of an organism is changed, resulting
in a mutation. It may occur due to exposure to natural mutagens such as ultraviolet (UV) light, to
industrial or environmental mutagens such as benzene or asbestos, or by deliberate mutagenesis
for purposes of genetic research. Single Nucleotide Polymorphisms (SNPs) is one of the most
important kind mutation in genetic research. For each human individual, around 10000-20000
non synonymous single-nucleotide polymorphisms (nsSNPs) appears in the genome[71]. Among
those nsSNPs, some of them have no harm to protein function, but the rest loss-of-function nsSNPs
are regarded as the most common cause of human heritable diseases[95, 52, 69]. Since the close
relation between mutagenesis and certain type of disease, research of mutagenesis has always be at
the cutting edge of molecular biology.
     In thermodynamics, mutation could be accessed by the stability change ∆∆G. ∆G is the energy
change from unfolding state to folding state, and stability change.
                                       ∆∆G = ∆G w − ∆G m
While existing experimental methods for determine the ∆∆G value of mutation are very expensive
and time consuming, fast computation methods to calculate mutation stability change is essen-
tially needed. Current computational approaches could be categorized into three types: physical
based methods, empirical methods and machine learning methods. For physical based methods,
molecular mechanics (MM) is typically used to model the mutation, for example knowledge-
modified MM/PBSA approach [40], EASE-MM [31]. For empirical models, empirical functions
and potential terms are used to model the stability change and their weights are determined by the
experimental data, for example Rosetta (high) protocols [48]. The last category of method is the
                                                37


knowledge based machine learning method, for example STRUM [70]. In this type of method,
various features which cannot easily modeled by physical term and potential term are regarded as
input of machine learning protocol.
    In the following chapter, we applied local topological descriptor to characterize the protein
structure and generated a machine learning prediction model on protein folding energy change
upon mutation. Our model get a performance of Rp = 0.78 on S2648 dataset, which exceed most
of the state of art models.
4.2     Dataset
    In this section we introduced ProTherm protein mutation database. Selection criteria and
pre-processing steps were shown in the following parts.
4.2.1    ProTherm protein mutation database S2648 and S350
ProTherm[5] is a protein mutation database. Among the database 2648 different point mutations
(S2648) in 131 proteins are chosen following the criteria below:
    • Only mutations in globular proteins were considered
    • Only mutant proteins whose experimental structure is available were taken
    • Only single-site mutations were considered
    • Mutations in heme-proteins are considered only if the stability measurements were performed
       on the apo form of the protein, and the structure of this apo form is available. Indeed, the
       interactions between residues and the heme are not taken into account
    • Mutations that destabilize the structure by more than 5 kcal/mol and mutations involving a
       proline were not considered
    A subset of 350 mutants corresponding to 67 different proteins was randomly selected as the
evaluation set, namely S350 set [21]
                                                38


4.2.2    Preprocessing of dataset
For the aforementioned databases, crystal structures of the wild type, mutation type, and binding
affinity change are given for each data entry. To calculate our structure-based topological feature, the
structures of mutant type are also needed. Scap utility in the Jackal package [90] is used to generate
mutant structures. This utility predicts side-chain conformations on a given backbone. To fix the
missing atoms and residues, the profix utility in the Jackal package [90] is applied to all raw pdb files.
4.3     Local topological characterization of molecules
    Geometric modeling is one of the crucial part in molecular property prediction. With the power
of differential geometry tools and Poincare Hopf index theorem, molecules can be interperated as a
set of topological variables and thus can be used as the input features for machine learning model.
In this section, we will introduce and discuss the topological tools for molecule characterization.
4.3.1    Molecular surface representations and molecular density function
With the assist of computer meshing, molecular area, volume and more detailed structural in-
formation could be available through molecular surface modeling. Moreover, with the generated
molecular surface, a lot of molecular interaction properties, including protein-ligand binding energy,
protein B-factor, protein folding energy upon mutation can be predicted with a higher accuracy.
    There are several molecular surface models has been proposed. The van der Waals (vdWS)
surface of a molecule is a representation of surface which based on the hard cutoffs of van der Waals
radii for individual atoms. According to the definition, vdWS is not smooth at the intersection area
of two or more atoms. Solvent accessible surface (SAS) is defined as the tracing of the center of
a probe sphere rolling over the van der Waals surface of a molecule. Similar to the generation of
SAS, solvent excluded surface (SES) is by tracing the inward union of areas by the spherical probe
rolling over the vdWS of the molecule. According to the definition of the previous surface models,
vdWS is inside SES and SES is smaller than SAS, and SES are relatively more smooth than the
                                                   39


other surfaces, although it still has singularities at the intersecting region. An example of vdWS,
SAS and SES is shown in Fig.4.1
Figure 4.1: Illustration of van der Waals (vdWS) surface (yellow region), Solvent accessible surface
(SAS) (red dotted margin) and solvent excluded surface (SES) (blue dotted margin
    Instead of using the hard cutoff according to the atom radius, smooth biomolecular surfaces can
be generated using smooth density functions. The rigid index is a smooth function which describe
the interaction of two particles with decay to the distance using generalized exponential functions
                                                        −(ri j /ηi j ) κ
                                    Φ(ri j ; ηi j ) = e                  ,κ >0                  (4.1)
or generalized Lorentz function
                                                             1
                                   Φ(ri j ; ηi j ) =                      ,v > 0                (4.2)
                                                     1 + (ri j /ηi j )v
Here ηi j is a constant of characteristic distance between particles.
    By extending and sum all the pairwise rigid index, one can get a continuous rigidity density
                                                    ÕN
                                         µ(r) =         w j Φ(ri j ; ηi j )                     (4.3)
                                                    j=1
                                                       40


    Example of 2-D exponential density function using different η is shown in Fig 4.2. Subfigure
a,b,c,d are generated from density function with η = 1, 2, 3, 4, respectively. Centers of three atoms
are (0,0),(3,0),(0,3) respectively. With larger characteristic distance constant η , atom radius probes
trend to merge into one connected surface.
                         (a)                                                (b)
                         (c)                                                (d)
           Figure 4.2: An example of 2-D exponential density function generated surface.
    With the smooth and density function, topological characters can be explicitly calculated in the
following section.
                                                   41


4.3.2    Evaluation of curvature
The curvature of a surface is defined by the relationship between small positional changes on the
surface, and the resulting changes in the surface normal.
    Differential geometry tools have been introduced to evaluate the property of curvature. In
differential geometry, the first fundamental form is the inner product on the tangent space of a
surface in three-dimensional Euclidean space, noted by Roman numeral I. For a parametric surface
f (u, v) = (x(u, v), y(u, v), z(u, v)) in R3 , the inner product of two tangent vectors is
                           I(a fu + b fv , c fu + dfv )
                         =ac < fu, fv > +(ad + bc) < fu + fv > +bd < fv , fv >                 (4.4)
                         =Eac + F(ad + bc) + Gbd
The coefficient of first fundamental form is often written as a metric tensor of gi j
                                                      ©E F ª
                                          (gi j ) = ­­          ® , gi j = fi · f j
                                                                ®                              (4.5)
                                                        F G
                                                      «         ¬
    Second fundamental form of a parametric surface is defined as: r® = r®(u, v) be a regular
parametrization of a surface in R3 , which ru and r®v are linearly independent for any (u, v) in the
domain of r®. The unit normal vector can be thus calculated as
                                                             r®u × r®v
                                                      n® =                                     (4.6)
                                                             ru × r®v |
                                                            |®
Then the second fundamental form can be written as
                                        II = Ldu2 + 2M dudv + N dv 2                           (4.7)
And the coefficient matrix in basis {®      ru, r®v } of the tangent plane is
                                                           © L Mª
                                                (gi j ) = ­­            ® × n®
                                                                        ®                      (4.8)
                                                             M N
                                                           «            ¬
                                      L = r®uu × n®, M = r®uv × n®, N = r®vv                   (4.9)
                                                             42


    With the definition of first and second fundamental form, one can calculate the Gaussian
curvature as the fraction of the determinant of two fundamental forms as
                                               detII LN − M 2
                                         K=             =                                                (4.10)
                                                detI        EG − F 2
Mean curvature can be calculated as the trace of (II)(I−1 )
                                                1
                                         H = Trace((II)(I−1 ))                                           (4.11)
                                                2
    For a parametrization of surface S = (x, y, f (x, y)), according to the definition of K and H, one
can get the expression of K and H in terms of f as following:
                                                   fx x · fyy − fxy2
                                           K=                                                            (4.12)
                                                  (1 + fx2 + fy2 )2
                                            2                                2
                                   1 (1 + fx ) fyy − 2 fx fy fx y + (1 + fy ) fx x
                              H=                                                                         (4.13)
                                   2             (1 + fx2 + fy2 )3/2
    For a given density function Φ(x, y, z), one can get the level set surface by letting Φ(x, y, z) = S0 .
Then implicit function theorem states that locally, there exists a function z = f (x, y) which
parametrizes the surface as (x, y, f (x, y)). By differentiating Φ(x, y, f (x, y)) = S0 with respect to x
and y, one can get
                                                                     Φx
                                        Φ x + Φz fx = 0, fx = −
                                                                     Φz
                                                                                                         (4.14)
                                                                     Φy
                                        Φ y + Φz fy = 0, fy = −
                                                                     Φz
    By further differentiation with respect to x and y, we can express fx x , fxy and fyy using the
density function Φ by the following terms:
            Φ x x + Φ xz · fx − (Φ xz + Φzz · fx )        Φ x x + Φ xz · − Φ                       Φx
                                                                            Φ z − (Φ xz + Φ zz · − Φ z )
                                                                             x
   fx x = −                                          =−
                              Φ2z                                               Φ2z
                                                                            Φy                     Φy
            Φ yy + Φ yz · fy − (Φ yz + Φzz · fy )          Φ yy + Φ yz · − Φz − (Φ yz + Φzz · − Φz )
   fyy = −                                           =−                                                  (4.15)
                              Φ2z                                               Φ2z
                                                                            Φy                     Φy
            Φ x y + Φ xz · fy − (Φ yz + Φzz · fy )         Φ x y + Φ xz · − Φz − (Φ yz + Φzz · − Φz )
   fx y = −                                          =−
                              Φ2z                                               Φ2z
                                                       43


    Plug in the terms of Equation.4.15 into Equation.4.14 we could get the explicit formula of H
and K with the terms of derivative of density function. As we can see from the formula, H and K
are only depend on the gradient ∇Φ = ( ∂Φ           ∂Φ ∂Φ
                                              ∂ x , ∂ y , ∂z ) and the Hessian Hess(Φ) and the adjoint of
the hessian Hess∗ (Φ),
                                                       2
                                                  © ∂ 2Φ      ∂2Φ    ∂2Φ ª
                                                  ­∂ x        ∂ x∂ y ∂ x∂z ®
                                                  ­ 2         ∂2Φ    ∂2Φ ®
                                     Hess(Φ) = ­­ ∂∂x∂Φy                                           (4.16)
                                                                           ®
                                                               ∂2 y  ∂ y∂z ®
                                                  ­ ∂2Φ       ∂2Φ    ∂2Φ ®
                                                  ­                        ®
                                                  « ∂ x∂z     ∂ y∂z   ∂2 z ¬
                      © Φ yy Φzz − Φ yz Φzy Φ yz Φzx − Φ yx Φzz Φ yx Φzy − Φ yy Φzx ª
                      ­                                                                       ®
              ∗
         Hess (Φ) = ­ Φ xz Φzy − Φ x y Φzz Φ x x Φzz − Φ xz Φzx Φ x y Φzx − Φ x x Φzy ®
                      ­                                                                       ®
                                                                                                   (4.17)
                      ­                                                                       ®
                      ­                                                                       ®
                        Φ x y Φ yz − Φ xz Φ yy Φ yx Φ xz − Φ x x Φ yz Φ x x Φ yy − Φ x y Φ yx
                      «                                                                       ¬
    Expression of H and K by ∇Φ, Hess∗ (Φ) and Hess∗ (Φ) can be written as
                                              ∇Φ · Hess∗ (Φ)∇ΦT
                                         K=                                                        (4.18)
                                                         |∇Φ| 4
                                  ∇Φ · Hess(Φ)∇ΦT − |∇Φ| 2Trace(Hess(Φ))
                           H=                                                                      (4.19)
                                                        2|∇Φ| 3
4.3.3    Critical points and Poincare Hopf index theorem
A critical point (CP) of a field ρ is a point where ∇ρ = 0. Gradient paths always originate at a
critical point and terminate at another critical point.
    The properties of critical can be determined by the Hessian matrix. The sum of eigenvalues of
Hessian matrix equals to the Laplacian of density ρ
                                                            ∂2 ρ ∂2 ρ ∂2 ρ
                                ∇2 ρ = γ1 + γ2 + γ3 =             +      +                         (4.20)
                                                            ∂2 x ∂2 y ∂2 z
                                                        44


    For a convex function, the Hessian matrix is positive semi-definite, therefore all the eigenvalues
are real. Based on the positive and negative sign of the three eigenvalues of Hessian matrix, one
could determine if a critical point is a local maximum, local minimum or a saddle point as follows:
     • if the Hessian is positive definite at point x (with three positive eigenvalues), then it’s an
       isolated local minimum critical point
     • if the Hessian is negative definite at point x (with three negative eigenvalues), then it’s an
       isolated local maximum critical point
     • if the Hessian has both positive and negative eigenvalues at point x, then x is a saddle point
There is another case that the determinant of the Hessian at x is zero, then x is called a degenerate
critical point.
    For simplicity, the rank of a critical point is defined as the number of non-zero eigenvalues of
Hessian, and the signature of critical point is defined as the algebraic sum of the signs (+1 or -1) of
the eigenvalues. For a non-degenerate critical point, its signature can be 3,1,-1 or -3.
Figure 4.3: Example of different critical points in a standard cube. Nucleic critical point (NCP) in
red, Bond critical point (BCP) in yellow, Ring critical point (RCP) in cyan and Cage critical point
(CCP) in green.
                                                   45


    Apply the previous definition into molecular density field, we could define the following
molecular critical points:
    • Nucleic critical point (NCP), a nucleic center of an atom, which is the local maximum of the
      field
    • Bond critical point (BCP), a saddle point on a bond center between two atoms
    • Cage critical point (CCP), a saddle point at the center of a cage
    • Ring critical point (RCP), a center of a ring structure, which is the local minimum of the field
    Examples of different types of critical point are shown in the Fig.4.4 and Fig.4.5 with the
structure of naphthalne (C10 H8 ). NCP in red, BCP in yellow and RCP in cyan.
                               Figure 4.4: Density field of naphthalne
                                                  46


                                Figure 4.5: Gradient field of naphthalne
    The properties of critical points are also close related to the Euler characteristic using the
Poincare-Hopf theorem:
Theorem 1 (Poincare-Hopf) Let v be a vector field on M, where M is without boundary, which
also only has isolated zeros. Then the sum of the indices of the zeros of the vector field is equal to
the Euler characteristic of the manifold.
    According to the theorem, the Poincare indices for NCP, BCP, RCP and NCP are 1,-1,1,and -1,
respectively, so the Euler characteristic χ(φ) could be calculated using the number of critical points
as following:
                                      Nn − Nb + Nr − Nc = χ(φ)                                  (4.21)
    Combine all the properties of critical points, we can get the following table states the molecular
critical points of density field:
                                                  47


                                Table 4.1: Properties of critical points
              Critical point type   rank     signature   Poincare index          max/min/saddle
                    Nucleic           3          1                   1            local maxima
                     Bond             3         -1                  -1                saddle
                     Cage             3          1                   1                saddle
                     Ring             3         -1                  -1             local minima
4.3.4    Mutation site based element specific density field generation
In protein folding energy change upon mutation prediction, we want to focus on the mutation
interaction and get rid of other unrelated interaction, mutation site based density field generation
has been applied.
     Two sets of atoms are used in the mutation site based density field, named density generation
set and eigenvalue evaluation set. We choose all the atom center in mutation residue as the density
field generation set D. The corresponding density field is then defined as:
                                                   Õ
                                          µ(r) =       Φ(ri j ; ηi j )                                (4.22)
                                                   j∈D
     After the generation of density field, we want to evaluate the eigenvalues of Hessian matrix
of the density field at certain points, namely eigenvalue evaluation set. In this problem, we set
all the atom center near mutation site within a cutoff (exclude the mutation residue atoms) as the
evaluation set E.
                               < γi1, γi2, γi3 >= eig(Hess(µ(ri ))), ri ∈ E                           (4.23)
     To incorporating chemical information into the density field, we further categorized density
field generation set and eigenvalue evaluation set with respect to atom types of {C, N, O, S}. Raw
feature set of atom type a1 of D and atom type a2 of E can be expressed as following:
                                                                          Õ
          Fa1,a2 = {eig(Hess(µ(ri )))}i∈E,Ai =a2 = {eig(Hess(                       Φ(ri j ; ηi j ))} (4.24)
                                                                       j∈D,A j =a1
                                                     48


4.3.5     Explicit expression of Hessian matrix
Hessian matrix for a density function Φ is depend on the second derivative terms of Φ. To get
the eigenvalue of Hessian, we calculate the explicit form of Hessian matrix with respect to density
function.
For exponential kernel
                                                             −(ri j /ηi j ) κ
                                         Φ(ri j ; ηi j ) = e                  ,κ >0                               (4.25)
In terms of 3-D coordinate (x, y, z) with density generation set D,
                                                           2            2          2 1/2 κ
                                               e−((x−xi ) +(y−yi ) +(z−zi ) ) /η) , κ > 0
                                         Õ
                         Φ(x, y, z; η) =                                                                          (4.26)
                                         i∈D
We can get the explicit expression of terms in Hessian as follows
                                                    2         2             2 1/2 κ
                                 c1 = e−((x−xi ) +(y−yi ) +(z−zi ) ) /η) = Φ
                                                                                                                  (4.27)
                                 c2 = (x − xi    )2  + (y − yi  )2   + (z − zi    )2 = r2
       2      2     2
For ∂ 2Φ , ∂ 2Φ , ∂ 2Φ
      ∂ x ∂ y ∂ z
                              κ−2           2                              0.5κ−2 (x − x )2 · η κ + c · c0.5κ−1 · η κ ]
    ∂ 2 Φ −κ[−κ · c1 · c2 (x − xi ) + 2c1 · (0.5κ − 1)c2                                  i          1 2
           =
      2
    ∂ x                                                          η  2κ
                              κ−2           2                              0.5κ−2 (y − y )2 · η κ + c · c0.5κ−1 · η κ ]
    ∂ 2 Φ −κ[−κ · c1 · c2 (y − yi ) + 2c1 · (0.5κ − 1)c2                                  i          1 2
           =
    ∂2 y                                                         η2κ
                              κ−2          2                              0.5κ−2 (z − z )2 · η κ + c · c0.5κ−1 · η κ ]
    ∂ 2 Φ −κ[−κ · c1 · c2 (z − zi ) + 2c1 · (0.5κ − 1)c2                                i           1 2
           =
      2
    ∂ z                                                         η  2κ
                                                                                                                  (4.28)
        2       2Φ ∂2Φ
For ∂∂x∂Φy , ∂∂x∂z , ∂ y∂z
                  ∂2Φ       −κ · (x − xi ) · (y − yi )[−κ · c1 · c2κ−2 + c1 · c20.5κ−2 (κ − 2) · η κ ]
                          =
                 ∂ x∂ y                                            η2κ
                  ∂2Φ       −κ · (x − xi ) · (z − zi )[−κ · c1 · c2κ−2 + c1 · c20.5κ−2 (κ − 2) · η κ ]
                          =                                                                                       (4.29)
                 ∂ x∂z                                            η2κ
                  ∂2Φ       −κ · (y − yi ) · (z − zi )[−κ · c1 · c2κ−2 + c1 · c20.5κ−2 (κ − 2) · η κ ]
                          =
                  ∂ y∂z                                           η2κ
                                                            49


Similarly, for Lorenz kernel
                                                                   1
                                         Φ(ri j ; ηi j ) =                     ,v > 0                   (4.30)
                                                            1 + (ri j /ηi j )v
In terms of 3-D coordinate (x, y, z) with density generation set D,
                                          Õ                                 ηv
                        Φ(x, y, z; η) =                                                                 (4.31)
                                          i∈D
                                               η v + [(x − xi )2 + (y − yi )2 + (z − zi )2 ]0.5v
We can get the explicit expression of terms in Hessian as follows
                                                                                                 ηv
                            c1 =   ηv  + [(x − xi   )2   + (y − yi )2   + (z − zi    )2 ]0.5v =
                                                                                                 Φ      (4.32)
                            c2 = (x − xi    )2 + (y − yi    )2 + (z − zi    )2  =  r2
      2       2     2
For ∂ 2Φ , ∂ 2Φ , ∂ 2Φ
     ∂ x ∂ y ∂ z
                                  0.5v−1 + (v − 2) · c · c0.5v−2 (x − x )2 − 2v · c v−2 (x − x )2 ]
      ∂ 2 Φ −η · v[c1 · c2
                      v
                                                           1 2                     i            2     i
               =
         2
       ∂ x                                                     c 3
                                                                1
      ∂2Φ         −η v · v[c1 · c20.5v−1 + (v − 2) · c1 · c20.5v−2 (y − yi )2 − 2v · c2v−2 (y − yi )2 ]
               =                                                                                        (4.33)
       ∂2 y                                                    c13
                                  0.5v−1 + (v − 2) · c · c0.5v−2 (z − z )2 − 2v · c v−2 (z − z )2 ]
      ∂ 2 Φ −η · v[c1 · c2
                      v
                                                           1 2                    i             2   i
               =
       ∂2 z                                                    c3
                                                                1
      2         2Φ ∂2Φ
For ∂∂x∂Φy , ∂∂x∂z , ∂ y∂z
                    ∂2Φ      −η v · v · (x − xi ) · (y − yi )[(v − 2) · c1 · c20.5v−2 − 2v · c2v−2 ]
                           =
                   ∂ x∂ y                                           c31
                    ∂2Φ      −η v   · v · (x − xi ) · (z − zi )[(v − 2) · c1 · c20.5v−2 − 2v · c2v−2 ]
                           =                                                                            (4.34)
                   ∂ x∂z                                            c13
                    ∂2Φ      −η v · v · (y − yi ) · (z − zi )[(v − 2) · c1 · c20.5v−2 − 2v · c2v−2 ]
                           =
                   ∂ y∂z                                            c31
    With the explict Hessian matrix, we could calculate the eigenvalues easily with the matrix
diagnalization.
    Fig.4.6 is an example of two eigenvalue maps of naphthalne (C10 H8 ) using the density field of
(a) η = 0.6 and (b) η = 1.0, respectively. Compare (a1) with (a2), the small eigenvalue map captures
                                                              50


more bond information while the big eigenvalue map captures more nucleic center information.
Compare (a) with (b), we get that with a smaller η value, eigenvalue maps trend to capture more
detailed topological information of molecule. By control the value of η we can get the topological
information at a scale suitable for the specific biomolecular property.
                        (a1)                                              (a2)
                        (b1)                                              (b2)
 Figure 4.6: An example of two eigenvalue maps of naphthalne using two different density fields
    Fig.4.7 shows an example of three eigenvalue isosurface maps of naphthalne (C10 H8 ) using the
exponential density field of η = 0.6. For the smallest eigenvalue map (a), the isovalues are -3,-0.5,0
for red, green and cyan respectively. For the second smallest eigenvalue map (b), the isovalues are
-3,-0.3,0 for red, green and cyan respectively. For the largest eigenvalue map (c), the isovalues are
2.5,1,0.3 for red, green and cyan respectively.
                                                   51


                                                                         (b)
                        (a)
                                                 (c)
Figure 4.7: an example of three eigenvalue isosurface maps of naphthalne using the exponential
density field
4.4    Prediction model design
   Machine learning is a power tool in molecular property prediction. It can incorporate different
types of features and learn the internal relations between features and properties. In the previous
                                                 52


section, we know that a lot of the topological characters can be derived from Hessian matrix.
Instead of directly using the topological characters as the features of machine model, we decide
to use the eigenvalues of Hessian matrix as the initial features of our model. We believe with the
power of machine learning model, using eigenvalues as feature could capture more insight relation
unlimited to the known topological characters.
4.4.1   Eigenvalue learning feature generation
Since machine learning requires the features to be scalable between samples, which means the
number of features must be the same between samples. It is obvious that even use a same cutoff
value near the mutation site, the number of atoms to evaluate eigenvalues of density could vary
a lot, which lead to different scale of raw feature. To make it suitable for machine learning input
features, we calculate the following five statistic values for each raw feature set Fa1,a2 , including
sum, max, min, average and standard deviation.
    So for the final feature set, we have 3 eigenvalues, 5 stat values for each set, 4*4 =16 combination
of atom type, in total 240 features for one structure. And we consider wild type features, mutant
type features and their difference, which makes the feature set size to 240*3 = 720.
4.4.2   Auxiliary features
Although our eigenvalue features capture a lot of geometric information of the molecule, some
crucial chemical physical properties can not be represented with only geometric features. To
further improve the quality of our prediction models, several auxiliary features have been added
to the model, including charge information, electrostatic solvation free energy, secondary structure
information .etc. Detailed description of auxiliary features could be found at Chapter 3 of the
thesis.
                                                    53


4.4.3   Machine learning model
Gradient boosting is a machine learning technique for regression problems, which produces a
prediction model in the form of an ensemble of weak prediction models. We used Gradient
boosting decision tree (GBDT) as the machine learning algorithm and feed both eigenvalue features
and auxiliary features into the model.
    The scikit-learn (version 0.18.1)[65] is used for the gradient boost regressor function with
following parameters: n_estimator s = 20000, max_depth = 6, min_samples_split = 3, and
learning_rate = 0.001. Pipeline of our model is shown as follows.
                                               54


Figure 4.8: Flowchart of eigenvalue learning model of protein folding energy change upon mutation with example of 1A3J A 12 F A.
First, mutation site is selected for both wild and mutant structure. Corresponding exponential/Lorentz density fields are generated. Then
Hessian matrix and its eigenvalues are evaluated at the near mutation region with respect to different element groups. Finally, eigenvalue
features and auxiliary features are fed into the GBDT model to get the prediction value of energy change.
                                                                      55


                                             CHAPTER 5
                                               RESULT
5.1     Evaluation criteria
    Two evaluation metrics, Pearson’s correlation coefficient (Rp ) and root-mean-squared error
(RMSE), are used to assess the quality of predictions. Let x and y be the vector of predicted values
and the ground truth of the n samples, respectively. The definition of Rp is given by
                                            Ín
                                                  (xi − x̄)(yi − ȳ)
                                Rp = qÍ i=1             q              ,                       (5.1)
                                          n (x − x̄)2 Ín (y − ȳ)2
                                          i=1 i             i=1 i
where x̄ and ȳ, the means of x and y, respectively. RMSE is computed as
                                                v
                                                u
                                                tÕ  n
                                      RMSE =           (xi − yi )2 /n                          (5.2)
                                                   i=1
For cross validation, the Rp and RMSE of all folds are averaged.
5.2     Model performance of TopNetTree on PPI binding free energy change
    In this section, model performance of TopNetTree on PPI binding free energy change is evaluated
on AB-Bind and SKEMPI dataset.
5.2.1    Prediction result on AB-Bind dataset
5.2.1.1    Overall result
Our model achieved a Pearson’s correlation coefficient (Rp ) of 0.65 on the AB-Bind S645 dataset,
which is significantly better than those of other existing methods as shown in Table 5.1. Comparing
to non-machine learning methods such as Rosetta and bASA, our method is over 100% more
accurate in terms of Rp , indicating our topology-based machine learning methods have a better
predictive power for PPI systems. Comparing to the best existing score of Rp = 0.53 given by
mCSM-AB, our method is about 22% more accurate, indicating the power of our TopNetTree.
                                                   56


    Both GBT and neural network are quite sensitive to system errors since the training of a model
is based on optimizing the mean square error of the loss function. The ∆∆G of 27 non-binders (-8
kcal/mol) did not follow the distribution of the whole dataset. Pires et al. [68] found that excluding
non-binders from the dataset would significantly increase the performance of a prediction model.
In our case, the Rp increased from Rp = 0.65 in Fig.5.1a to Rp = 0.68 for the same treatment
as shown in Fig.5.1b. We also applied a blind test on homology structures using the rest of the
samples as the training set, achieving Rp = 0.55 as shown in Fig. 5.2
Table 5.1: Comparison of the Pearson correlation coefficients of various methods for the AB-bind
S645 set. Except for those from present TopNetTree and TopGBT, the other results are adopted
from Ref. [68].
                                   Method              Rp
                                   TopNetTree          0.65/0.68∗
                                   TopGBT              0.56
                                   mCSM-AB             0.53/0.56∗
                                   TopCNN              0.53
                                   Discovery Studio    0.45
                                   mCSM-PPI            0.31
                                   FoldX               0.34
                                   STATIUM             0.32
                                   DFIRE               0.31
                                   bASA                0.22
                                   dDFIRE              0.19
                                   Rosetta             0.16
                                                  57


                        (a)                                              (b)
Figure 5.1: (a)Scatter plot of TopNetTree prediction crossvalidation result on S645 (b)Scatter plot
of TopNetTree prediction crossvalidation result on S645 exclude 27 non-binders
        Figure 5.2: Scatter plot of TopNetTree blind test prediction on homology models
                                                58


5.2.1.2   Protein level non overlapping test on the AB-bind 645 dataset
To further test the predicting power of our model, we applied protein level non overlapping test on
the AB-bind dataset.
Table 5.2: Result of non-overlapping protein level test on AB-bind dataset, including Pearson
correlation coefficient and RMSE in kcal/mol.
                     Name                  Counts     Rp      RMSE(kcal/mol)
                     1AK4                    16      0.528          0.837
                     1BJ1                    19      0.103          1.502
                     1CZ8                    19      0.506          1.077
                     1DQJ                    21      0.568          1.877
                     1DVF                    26      0.553          1.163
                     1FFW                     9     -0.043          1.052
                     1JRH                     2        1            0.812
                     1JTG                     5      0.757          0.549
                     1KTZ/HM_1KTZ            44      0.866          0.496
                     1MHP                    68      0.505          3.216
                     1MLC                    11      0.397          1.508
                     1N8Z                    34      0.626          2.273
                     1VFB                    41      0.689          1.653
                     1YY9/HM_1YY9            21     -0.068          1.531
                     2JEL                    43      0.818          0.954
                     2NYY/HM_2NYY            53      0.589          1.238
                     2NZ9/HM_2NZ9            35      0.665          1.367
                     3BDY                    34      0.615          0.692
                     3BE1                    34      0.474          0.941
                     3BN9/HM_3BN9            43      0.368          1.743
                     3HFM                    22      0.262          2.69
                     3K2M                     7      0.705          1.416
                     3NGB                    11      0.459          1.147
                     3NPS                    27      0.242          0.953
                     Total                  645
                     Average                 27      0.508          1.362
                     Median                  24      0.541          1.201
    645 mutations in the dataset could be separated into 24 different protein-protein complexes (we
merged the complex and its homology model as one category since they are very similar). To do
the non overlapping test, all the mutations in one specific protein complex are split as the test set
                                                59


and all other mutations are split as the training set. The result of non overlapping test of 24 protein
complexes is shown in Table 5.2
5.2.1.3   Protein level leave-one-out validation test
To further test the predicting power of our model, we applied protein-level leave-one-out cross-
validation test on the AB-bind dataset.
Table 5.3: Result of protein-level leave-one-out-validation test on AB-bind dataset, including
Pearson correlation coefficient and RMSE in kcal/mol.
                     Name                    Counts    Rp       RMSE(kcal/mol)
                     1AK4                    16        0.139    0.977
                     1BJ1                    19        0.392    1.350
                     1CZ8                    19        0.671    0.921
                     1DQJ                    21        0.318    1.885
                     1DVF                    26        -0.103   1.345
                     1FFW                    9         0.149    0.512
                     1JRH                    2         -1       0.13
                     1JTG                    5         -0.998   0.689
                     1KTZ/HM_1KTZ 44                   0.973    0.222
                     1MHP                    68        0.145    3.432
                     1MLC                    11        0.606    0.418
                     1N8Z                    34        0.029    3.023
                     1VFB                    41        0.533    1.755
                     1YY9/HM_1YY9 21                   0.547    0.225
                     2JEL                    43        0.707    0.968
                     2NYY/HM_2NYY 53                   0.514    1.175
                     2NZ9/HM_2NZ9            35        0.121    1.732
                     3BDY                    34        0.604    0.548
                     3BE1                    34        0.054    1.077
                     3BN9/HM_3BN9            43        0.281    1.938
                     3HFM                    22        -0.121   2.726
                     3K2M                    7         -0.993   1.234
                     3NGB                    11        0.411    1.186
                     3NPS                    27        0.099    0.766
                     Total                   645
                     Average                 27        0.170    1.218
                     Median                  24        0.215    1.027
                                                   60


    645 mutations in the dataset could be separated into 24 different protein-protein complexes (we
merged the complex and its homology model as one category since they are very similar). Then
each protein complex is treated individually as a set to do the leave one out cross-validation test.
The result of test of 24 protein complexes is shown in Table 5.3
    For this test, our model reached an average/median R0p s of 0.170/0.215, which are significantly
lower than the 10-fold cross-validation result over the entire dataset. One possible reason for this
behavior is that the training set for each complex is too small with only an average of 27 samples per
complex. Also, this result implies that our model needs a diversity of training samples to achieve
stable and consistent prediction quality.
5.2.2   Prediction result on SKEMPI dataset
Table 5.4 shows the Pearson correlation coefficients on 10-fold cross-validations. It is found that
the proposed TopNetTree is about 15% more accurate than the best existing method.
Table 5.4: Comparison of the Pearson correlation coefficients of various methods for the single
point mutation in SKEMPI dataset of 1131 mutations. Except for those from TopNetTree and
SAAMBE, the other results are adopted from Ref. [91].
                                      Method                 Rp
                                      TopNetTree            0.850
                                      BindProfX             0.738
                                      Profile-score+FoldX   0.738
                                      Profile-score         0.675
                                      SAAMBE [66]           0.624
                                      FoldX                 0.457
                                      BeAtMuSic             0.272
                                      Dcomplex              0.056
    Scatter plot of the S1131 crossvalidation result is shown in Fig.5.3. TopNetTree model was
able to achieve the Rp of 0.85 and RMSE of 1.55 kcal/mol.
                                                    61


                                  10
                                   5
                    G(kcal/mol)    0
                    Exp            5
                                  10                      Pearson:0.85
                                       10       5     0      5      10
                                            Predicted G(kcal/mol)
        Figure 5.3: Performance evaluation on the 10-fold cross-validation on set S1131.
5.2.3   Prediction result on SKEMPI 2.0 dataset
For SKEMPI 2.0 dataset, we tested our model on S4947 , S4169 and S8338 datasets. For set S4947,
we carry out the regular 10-fold cross-validation 10 times. For S4169 and S8338 sets, we follow
the 10-fold stratified cross-validation used in mCSM-PPI2 paper. [72].
   TopNetTree model achieved following crossvailidation results: Fig.5.4 shows the crossvalida-
tion result of et S4947 with the Rp of 0.82 and RMSE of 1.11 kcal/mol and Set S4168 with the Rp
of 0.78 and RMSE of 1.13 kcal/mol. Fig.5.5 shows the crossvalidation result of Set S8338 with Rp
of 0.85 and RMSE of 1.09 kcal/mol.
                                                     62


              10                                                                  10
               5                                                                   5
G(kcal/mol)    0                                                    G(kcal/mol)    0
Exp            5                                                    Exp            5
              10                               Pearson: 0.82                      10                       Pearson: 0.79
                     10       5     0      5             10                            10       5     0      5       10
                          Predicted G(kcal/mol)                                             Predicted G(kcal/mol)
                                       (a)                                                           (b)
Figure 5.4: (a)Scatter plot of TopNetTree prediction crossvalidation result on S4947 (b)Scatter plot
of TopNetTree prediction crossvalidation result on S4169
                                              10
                                               5
                                G(kcal/mol)    0
                                Exp            5
                                              10                              Pearson:0.85
                                                    10       5     0      5                 10
                                                         Predicted G(kcal/mol)
                   Figure 5.5: Scatter plot of TopNetTree prediction crossvalidation result on S8338
                                                                  63


5.3    Performance of LTP model on ProTherm protein mutation database
       S2648 and S350
5.3.1   General performance
Following Fig5.6 shows the prediction result for S350 and 5-fold crossvalidation result for S2648.
Our LTP model had prediction result for S350, with Rp = 0.79 and RMSE = 0.94 and (b)
crossvalidation result of S2648, with Rp = 0.78 and RMSE = 0.92
                        (a)                                             (b)
Figure 5.6: (a)Scatter plot of prediction result on S350 (b) Scatter plot of 5-fold crossvalidation
result on S2648
    We also compared our result with other methods in table 5.5. Our method reached Rp of 0.80
and RMSE of 0.94 kcal/mol in S350 set and Rp of 0.78 and RMSE of 0.92 kcal/mol in S2648 set,
which is among the top of the state-of-art methods.
                                                 64


Table 5.5: Prediction results of S350 and 5-fold cross validation results of S2648. All the result of
other methods listed in the table are from the paper cited.
                                               S350                   s2648
                         Method
                                        n     Rp RMSE            n     Rp    RMSE
                           LTP         350   0.80 0.94 2648 0.78              0.92
                        STRUM          350   0.79 0.98 2647 0.77              0.94
                         mCSM          350   0.73 1.08 2643 0.69              1.07
                          INPS         350   0.68 1.25 2648 0.56              1.26
                    PoPMuSiC 2.0 350         0.67 1.16 2647 0.61              1.17
                     PoPMuSic 1.0 350        0.62 1.23           -      -       -
                      I-Mutant 3.0 338       0.53 1.35 2636 0.60              1.19
                        Dmutant        350   0.48 1.28           -      -       -
                        Automute       315   0.46 1.42           -      -       -
                        CUPSAT         346   0.37 1.46           -      -       -
                           Eris        324   0.35 1.49           -      -       -
                      I-Mutant 2.0 346       0.29 1.50           -      -       -
5.3.2    Inter/Intra-protein-level crossvalidation
To validate our model with respect to types of protein, we did inter and intra protein level cross
validation on S2648 dataset. To perform inter-protein-level cross-validation for each protein, the
samples in one protein are taken as the test set while the rest of the dataset is used as the training
set. For inter protein level cross -validation, all the mutations within one protein are selected as the
test set while the whole set exclude test set are chosen as training set. For intra protein level cross-
validation, all the proteins with more than 20 mutations are indivisually selected and leave-one-out
cross validated. Considering too small sample size will lead to extreme overfitting, those proteins
with less than 20 mutations are not included in the intra protein level cross-validation test.
    For inter-protein-level cross-validation, we get the result of Rp = 0.55, RMSE = 1.23 kcal/mol
(Rp and RMSE are evaluated for combining all inter-protein-level test result as the test result for
whole set). For intra-protein-level cross-validation, the results are shown as table 5.6
                                                     65


Table 5.6: Intra-protein-level cross-validation result of S2648. Mutations in each protein are 5-fold
cross-validated inside protein.
          Pdbid    Sample size       Rp       RMSE     Pdbid    Sample size         Rp     RMSE
           1aj3          63        0.781      0.752     1aps          21         -0.154     1.095
           1bni         153        0.645      0.962     1bvc          41          0.512     0.662
          1c9o           25         0.03      0.581     1csp          21          0.724     0.893
          1cun           26        0.726      0.851     1e65          21          0.558     1.073
          1ey0          482        0.728      0.937     1fkj          36          0.898     0.639
           1fna          34        0.067      1.413     1ftg          27          0.764     0.867
          1h7m           26        -0.132      0.62     1hfz          22           0.08     1.465
          1hmk           25        0.789      0.993     1lni          43          0.795     1.088
           1lz1         110        0.659      0.954     1qlp          43          0.386     1.125
          1rn1           31        0.867      0.983     1rop          20          0.562     0.931
           1rtb          37        0.673      1.124     1sak          28          0.126      0.81
           1shf          36        0.168      1.074     1ten          28            0.5     1.284
          1uzc           45        0.492      0.779     1vqb          90          0.693     0.805
          1wq5           44        0.703      1.329     1yyj          37          0.321     1.295
          2abd           29        0.231      1.264     2ci2          75          0.707     0.744
          2lzm           98        0.673      1.178     2rn2          70           0.62     1.101
           4lyz          57        0.393      1.359     5dfr         102          0.709     0.687
           5pti          21        0.364      1.307
          Sum          2067         N/a        N/a     Mean           59          0.505     1.001
         Median          36         0.62      0.983
    Comparing with the 5-fold crossvalidation result of S2648 of Rp = 0.78 (each fold is randomly
selected), Rp of inter-protein-level cross-validation is much lower. The reason is machine learning
method is highly depend on relations between training and testing data. For inter-protein-level
cross-validation, testing set protein is not included in training set thus it’s hard to predict the energy
change. For intra-protein-level cross-validation set, proteins with sample size larger than 100 (1ey0,
Rp = 0.728; 1bni, Rp = 0.645; 1lz1, Rp = 0.659; 5dfr, Rp = 0.687) get higher Rp than median and
mean value for all proteins. For some proteins, Rp is very low but RMSE result is very good, for
example, 1c9o and 1h7m. The reason is that for those proteins, all energy change upon mutations
are very close to 0 kcal/mol, our model is able to predict the energy change value but hard to track
the ralative small perterbations of energy change between similar mutations.
                                                    66


5.3.3   Prediction result for different density kernel
According to the property of exponential kernel and Lorentz kernel, with a large η and κ/v, the decay
rate will be higher for a single atom probe. Since we only include {C, N, O, S} in our calculation,
and those four atom types have very close VdW radius, same characteristic distance η is used in
our model. In Fig.5.7, we used grid search on η and κ/v to get the best performance on prediction
                                                      −(ri j /η) κ
model. In Fig.5.7, (a) Density kernel Φ(ri j ; η) = e              ,κ > 0 , at η = 5.5, κ = 2.0 , the model
reaches its largest Rp = 0.731 (b) Density kernel Φ(ri j ; η) = 1+(r 1 /η)v , v > 0 ,at η = 5.5, v = 2.0
                                                                         ij
,the model reaches its best Rp = 0.728
                        (a)                                                      (b)
               Figure 5.7: Parameter selection heatmap in prediction model of S350.
    As we can see from the graph, the two kernels have very similar pattern for the parameter
choosing. In general, model has a relatively high performance at η between 4.0 Å and 5.5 Å,
consider the VdW radius of C is 1.7 Å, the best choice for characteristic distance is around 2.4 to
3.2 times of the VdW radius.
                                                  67


                                             CHAPTER 6
                                             DISCUSSION
6.1     Prediction result analysis for different mutation type
     The pattern of PPI binding affinity changes and protein folding energy change over different
mutation types is important for protein design. In this section, We test how well can the model
prediction resemble the distribution in experimental data. A reverse mutation from “B” to “A” is
considered as the same mutation type as from “A” to “B” and the associated energy change admits
an opposite sign.
6.1.1    Analysis of TopNetTree prediction result on S645
Overall, our predicted patterns are remarkably similar to those of experimental data in terms of
both average binding energy changes and variance of binding energy changes as shown in Fig. 6.1.
It is interesting to note that all the mutations to Alanine have positive energy change. A possible
reason is that mutations from a large residue to a small one could lead to a stabilizing effect to
the whole system. Besides the size of amino acids, we also categorized amino acids into charged,
polar, hydrophobic and special case groups. In terms of binding affinity changes, we find that most
mutations from polar to hydrophobic residues have a positive free energy change (for example, S
to M), which means mutations from polar residues to hydrophobic residues would make the whole
PPI system more stable. We also observed that a mutation from charged residues to uncharged
polar residues could lead to a negative energy change, for example, Lysine to Serine (K to S), which
means such mutations might have broken some charge-charge interaction pairs.
     Although our model shares a similar pattern in the variance of energy changes with experimental
data, the variance of the model predictions is generally lower than the experimental data as shown
in Fig. 6.1. It remains a challenging task to come up with predictions with a diversity level the
same as that of experimental data.
                                                   68


               Experimental                                        Prediction
                (a)Average binding affinity changes upon mutation (kcal/mol)
              Experimental                                         Prediction
                (b)Variance of binding affinity changes upon mutation (kcal/mol)
             Charged                Polar               Hydrophobic            Special Cases
Figure 6.1: Comparison of average experimental and prediction binding affinity changes upon
mutation associated with different amino acid types for the AB-Bind dataset. The x-axis labels the
residue type of the original, while the y-axis labels the residue type of the mutant. For a reverse
mutation, its ∆∆G is taken the same magnitude as the original value with an opposite sign. a
Average binding affinity changes upon mutation (kcal/mol) b Variance of binding affinity changes
upon mutation (kcal/mol)
                                                 69


6.1.2    Analysis of LTP prediction result on S2648
The pattern of protein folding energy changes over different mutation types is important for protein
design.
                         (a)                                               (b)
                         (c)                                               (d)
Figure 6.2: Comparison of average experimental (a) and prediction binding (b) affinity changes
upon mutation associated with different amino acid types for S2648 dataset.
     Overall, our predicted patterns are remarkably similar to those of experimental data in terms of
both average binding energy changes and variance of binding energy changes as shown in Fig. 6.2.
It is interesting to note that all the mutations to Alanine have positive energy change. A possible
reason is that mutations from a large residue to a small one could lead to a stabilizing effect to
the whole system. Besides the size of amino acids, we also categorized amino acids into charged,
polar, hydrophobic and special case groups. In terms of binding affinity changes, we find that most
mutations from polar to hydrophobic residues have a positive free energy change (for example, S
                                                   70


to M), which means mutations from polar residues to hydrophobic residues would make the whole
PPI system more stable. We also observed that a mutation from charged residues to uncharged
polar residues could lead to a negative energy change, for example, Lysine to Serine (K to S), which
means such mutations might have broken some charge-charge interaction pairs.
6.2     Prediction result analysis for different mutation regions
6.2.1   Definition of mutation region
Mutant residue locations were classified into interface and non-interface regions. Interface residues
were further classified as the rim, support and core and non-interface residues were also further
classified as surface and interior, based on the classification approach by Levy [55].
    Residue classification is mainly based on the change of relative residue accessible surface area
(rASA) between protein-protein complex (rASAc ) and individual protein components of complex
(rASAm ), as shown in Table 6.1. ASA was calculated with AREAIMOL from the CCP4 suite [20]
and relative solvent accessibility was obtained by normalizing the absolute value with that of the
same amino acid in a G-X-G peptide. [59].Here, ∆rASA = rASAm − rASAc .
                              Table 6.1: Criteria of residue regions [55]
                               Region     ∆ rASA      rASAc     rASAm
                               Interior   0           < 25%
                               Surface    0           > 25%
                               Rim        >0          > 25 %
                               Support    >0                    < 25%
                               Core       >0          < 25%     > 25%
6.2.2   Analysis of TopNetTree prediction result on S645
The locations of the site mutations could be categorized into 5 different regions as interior, surface,
rim, support, and core. The detailed definition can be found in method mutation region section.
In experimental data, mutations at the core or support region have a higher average energy change
                                                  71


around 1.8 kcal/mol (1.72 kcal/mol and 1.91 kcal/mol respectively), while mutations at the rim or
interior region have an average energy change around 0.8 kcal/mol (0.82 kcal/mol and 0.83 kcal/mol
respectively) as shown in Fig. 6.3. On the other hand, the surface mutations have an average energy
change less than 0.2 kcal/mol. Similar patterns regarding mutation sites and energy changes were
reported in the literature [67]. A possible reason for these patterns is that different mutation regions
vary in their accessibility to the water. In general, surface, interior, and rim regions are much more
accessible to the water than the core and support regions.
           Figure 6.3: Prediction results for different residue region types in S645 dataset
    Fig. 6.3 shows our predictions concerning different mutation regions. Average R0p s of 0.60,
0.66, 0.66, 0.65, and 0.48 were achieved for the core, rim, support, interior, and surface regions,
respectively. This result shows that the performance is consistent among different mutation regions
except for the surface region. We believe that the relative inferior performance for surface mutations
is due to its small data size and that the energy disturbance caused by surface mutations is small on
                                                   72


average.
6.2.3   Analysis of LTP prediction result on S2648
Fig.6.4 shows the 5-fold crossvalidation result for S2648 set for different mutation region.
                (a)                              (b)                                (c)
Figure 6.4: Scatter plot of 5-fold crossvalidation result for S2648 set in (a) Buried mutation
region, with Rp = 0.79 and RMSE = 0.94 ,(b) Exposed mutation region, with Rp = 0.78 and
RMSE = 0.92 (c) Intermediate mutation region, with Rp = 0.78 and RMSE = 0.92
    From the scatter plot, we can see that for buried region, it is more likely to have larger folding
energy change upon mutation, while exposed region has much smaller energy perterbation upon
mutation. The reason is that exposed region residues are more accessible to the water. Desipite the
distribution of energy change, our model worked well on all three regions with simialr Rp value,
which means our model is stable with respect to the residue regions and our model is able to predict
energy change value for all three regions.
                                                  73


6.3    Alanine scanning test of 1AK4
    In molecular biology, alanine scanning is a site-directed mutagenesis technique used to deter-
mine the contribution of a specific residue to the stability or function of a given protein[62, 86, 82].
Alanine is used because of its non-bulky, chemically inert, methyl functional group that nevertheless
mimics the secondary structure preferences that many of the other amino acids possess.
    Alanine scanning test on 1AK4 chain A, using TopNetTree model with Ab-Bind training data.
All the ∆∆G values are in kcal/mol. In total there are 165 residues in the chain A of 1AK4. Results
of the alanine mutation are also separated into 5 region groups, interior, surface, rim, support, and
core, respectively.
        Figure 6.5: Structure of protein complex 1AK4, chain A in blue and chain D in red
                                                  74


                          Table 6.2: Alanine mutation test on 1AK4 chain A.
         Interior         Surface            Rim            Support          Core           All
     Avg        Var   Avg        Var     Avg     Var     Avg      Var     Avg     Var  Avg      Var
Arg 0.8435        0  1.7463    0.5962 1.6302      0       –         –    1.5676    0  1.5466  0.4017
Asn 1.0064    0.1031 1.3790    1.2943 1.4870      0    1.6727       0       –      –  1.2581  0.5352
Asp 1.0726    0.1059 0.8942    0.0837      –      –       –         –       –      –  0.9707  0.1010
Cys 0.8236    0.0172 0.6246       0        –      –       –         –       –      –  0.7739  0.0203
Gln    –          –  0.9425       0        –      –    2.8500   0.0041      –      –  2.2142  0.8114
Glu 0.8466    0.0174 1.2533    0.3848      –      –       –         –       –      –  1.1794  0.3426
Gly 1.0956    0.4595 0.6091    0.1135      –      –       –         –    1.3921    0  0.9322  0.3761
His 1.2941        0  1.1008       0        –      –    1.7063   0.0410      –      –  1.4519  0.0899
Ile 1.0031    0.1280 0.1595       0        –      –    0.8719       0       –      –  0.9056  0.1658
Leu 1.2601    0.1447 1.8473    0.8087      –      –    1.5133       0       –      –  1.4641  0.3798
Lys    –          –  0.6113    0.1443      –      –       –         –       –      –  0.6113  0.1443
Met 2.2145    0.4070 0.9892       0        –      –    1.9440   0.2721      –      –  1.9153  0.4696
Phe 2.1307    0.9728 1.0778    0.0046      –      –    1.7958   0.2721      –      –  1.9457  0.8788
Pro 0.8306        0  0.7735    0.3486      –      –       –         –       –      –  0.7831  0.2909
Ser 1.0374    0.1150 0.3301    0.0010      –      –       –         –       –      –  0.8606  0.1803
Thr 1.1284    0.1641 0.8129    0.0432 1.1802      0       –         –       –      –  0.9898  0.1205
Trp    –          –     –         –     1.3124    0       –         –       –      –  1.3124     0
Tyr 2.5878        0  1.0924       0        –      –       –         –       –      –  1.8401  0.5590
Val 1.0002    0.0229 0.4613       0        –      –       –         –       –      –  0.9403  0.0490
                                                   75


                                             CHAPTER 7
                                    THESIS CONTRIBUTION
In this chapter, thesis contribution is highlighted for our prediction model on mutation induced
affinity change of protein-protein interactions and protein folding, respectively.
7.1     Protein-Protein interactions energy change upon mutation
    The importance of protein-protein interactions (PPIs) is evident from the intensive efforts to
study them from many perspectives, including quantum mechanics, molecular mechanics, bio-
chemistry, biophysics, and molecular biology. For example, the Pearson’s correlation coefficients
(Rp ) between predicted ∆∆G values and experimental data in cross-validations of a commonly used
PPI database, AB-Bind [83], is only 0.53.
    Recently, topology has been shown to be surprisingly effective in simplifying biomolecu-
lar structural complexity [88, 37, 10]. It has been devised to win worldwide competitions in
computer-aided drug design [63]. Therefore, it is of enormous importance to exploit topology for
understanding PPIs. In this work, we propose topology-based network trees (TopNetTrees) for
∆∆G predictions. Specifically, element-specific and site-specific persistent homology is introduced
to characterize PPIs. Additionally, we propose machine learning algorithm, convolutional neural
networks (CNNs) assisted gradient boosting trees (GBTs), to pair with the topological method for
the prediction of PPI ∆∆G. We demonstrate that the proposed TopNetTree achieves Rp of 0.65,
which is about 22% better than the previous best result for the AB-Bind dataset. For another
benchmark PPI dataset, SKEMPI, the present method significantly outperforms the state-of-the-art
in the literature.
7.2     Protein folding energy change upon mutation
    Protein folding is the process by which a polypeptide chain folds to become a biologically
active protein in its native 3D structure. It is the fundamental basis for living organisms. Mutation
                                                    76


in protein folding process are related to several diseases[52]. Also, mutation can lead to drug
resistance[58]. To support and assist the timely and costly mutagenesis experiment, computational
model for protein folding energy change is strongly needed.
    In this work, we proposed LTP prediction model for protein folding energy change upon
mutation. To characterize the structure information, we used density function to generate local
surface. Through calculation of eigenvalues of Hessian matrix, we got the local topological
descriptor of the surface. Clustering the eigenvalues by element type and integrated with Gradient
boosting decision model, we introduced our LTP model.We demonstrate that the proposed LTP
model achieves Rp of 0.78 on S2648 dataset and Rp of 0.80 on S350 dataset, which outperforms
the state-of the-art in the literature.
                                                 77


APPENDICES
    78


                                           APPENDIX A
               SUPPLEMENTARY MATERIALS FOR TOPNETTREE MODEL
A.1     Code and Data availability
    Code availability:
    All source codes and models are publicly available through a Code Ocean compute capsule
(https://doi.org/10.24433/CO.0537487.v1)
    Data availability:
    Copyright and credits of the databases used in the thesis all belong to their original authors.
The databases are publicly available at the listed web pages
    • AB-Bind database
       https://github.com/sarahsirin/AB-Bind-Database
    • SKEMPI database
       https://life.bsc.es/pid/skempi2
    • Pro-Therm database
       https://web.iitm.ac.in/bioinfo2/prothermdb/index.html
                                                  79


A.2 Cross validation result on AB-Bind S645 dataset
               Table A.1: TopNetTree crossvalidation result on S645
          PDBID      Mutation Site   Experiment value       Predicted value
           1AK4         D:A488G             2.49             1.772247901
           1AK4         D:A488V              0.6              1.70987115
           1AK4         D:A492G             0.41             0.895272322
           1AK4         D:A492V             0.19             1.097418158
           1AK4         D:G489A             1.91             1.173092962
           1AK4         D:G489V             2.86             1.587869628
           1AK4         D:H487A             0.84             1.460316679
           1AK4         D:H487Q              0.8             1.007496241
           1AK4         D:H487R             1.48             1.388147881
           1AK4         D:I491A             0.07             0.841468807
           1AK4         D:I491V             -0.17            0.434347712
           1AK4         D:P485A             0.92             0.513298461
           1AK4         D:P490A               2              1.625125531
           1AK4         D:P490V             2.86             2.903634847
           1AK4         D:P493A             0.51             0.659453086
           1AK4         D:V486A             0.82               0.817046
           1BJ1          V:F17A               0              0.667371629
           1BJ1         V:Y21A                0              0.386553596
           1BJ1         W:E93A              0.82             2.052724853
           1BJ1         W:G88A              2.76             1.824412366
           1BJ1         W:G92A              3.69             2.265639133
           1BJ1         W:H86A                0              0.495278556
           1BJ1         W:H90A                0               0.55636088
                                       80


           Table A.1 (cont’d)
1BJ1 W:I80A             0.82  1.006377789
1BJ1 W:I83A             3.69  0.919007453
1BJ1 W:I91A             0.41  1.255807724
1BJ1 W:K48A             0.41  1.475578215
1BJ1 W:K84A             0.65  0.617726683
1BJ1 W:M81A             3.69  1.373023595
1BJ1 W:M94A             1.42  0.826574734
1BJ1 W:Q79A               0   1.782212133
1BJ1 W:Q87A               0   2.410079235
1BJ1 W:Q89A             1.75  4.259019508
1BJ1 W:R82A             3.69  1.520556303
1BJ1 W:Y45A             0.82  1.980918971
1CZ8 V:F17A               0   2.712193372
1CZ8 V:Y21A               0   0.593485188
1CZ8 W:E93A             1.15  1.266828729
1CZ8 W:G88A             2.67  1.354456702
1CZ8 W:G92A              4.1  2.891246561
1CZ8 W:H86A               0   0.877283637
1CZ8 W:H90A               0   0.702263438
1CZ8 W:I80A             0.95  1.187477651
1CZ8 W:I83A              1.3   1.14279147
1CZ8 W:I91A             1.06  1.525802887
1CZ8 W:K48A               0   0.733548424
1CZ8 W:K84A             1.36  1.004540774
1CZ8 W:M81A              4.1  1.518174577
1CZ8 W:M94A             1.91  0.969634812
                  81


            Table A.1 (cont’d)
1CZ8 W:Q79A              0.65  1.184333944
1CZ8 W:Q87A                0    0.52369491
1CZ8 W:Q89A              1.06  2.106217314
1CZ8 W:R82A              0.82  1.989554698
1CZ8 W:Y45A              1.94  1.539212479
1DQJ C:D101A              1.3  1.785713715
1DQJ C:K96A               6.1  1.753633758
1DQJ C:K97A               3.5  1.981746346
1DQJ C:L75A               1.5  1.763301137
1DQJ C:N93A               0.6   1.90652708
1DQJ C:R21A               1.3  1.182521681
1DQJ C:S100A              0.8  0.969499244
1DQJ C:T89A               0.8  1.435309483
1DQJ C:W62A               0.8  0.784356207
1DQJ C:W63A               1.3  1.304424197
1DQJ C:Y20A               3.3  1.677039544
1DQJ H:D32A                2   1.583783786
1DQJ H:W98A               4.9  3.350083739
1DQJ H:Y33A               5.5  2.707997194
1DQJ H:Y50A               6.9  2.536556738
1DQJ H:Y53A               1.2  0.759781316
1DQJ L:N31A                2    1.84136089
1DQJ L:N32A               4.1  2.614977004
1DQJ  L:S91A              1.4  1.216659126
1DQJ L:Y50A               2.7   1.43641669
1DQJ L:Y96A               1.1  1.552117066
                   82


           Table A.1 (cont’d)
1DVF A:H30A             1.67  0.971417375
1DVF A:S93A             1.17  1.761136416
1DVF A:W92A             0.34  0.918307718
1DVF A:Y32A             2.05  1.281131406
1DVF A:Y49A             1.64  0.113538779
1DVF A:Y49A             1.75  2.067569689
1DVF A:Y50A             0.69  1.475205049
1DVF B:D100A            2.82   3.20042784
1DVF B:D54A             4.32  2.256333782
1DVF B:D58A             1.62  0.512496378
1DVF B:E98A             4.23  1.961528555
1DVF B:N56A             1.17  1.553101451
1DVF B:R99A             1.89  1.888709858
1DVF B:T30A             0.92  1.292450289
1DVF B:W52A             4.17   2.08684492
1DVF B:Y101F            2.03  2.358112289
1DVF B:Y32A             1.85  0.992016256
1DVF C:Y49A             1.88   2.38382718
1DVF D:D52A              1.7  1.241237441
1DVF D:H33A             1.88    0.8326038
1DVF D:I101A            2.71   2.05527356
1DVF D:K30A             1.01  1.649048839
1DVF D:N55A             1.88  0.450320114
1DVF D:Q104A            1.64  2.146662618
1DVF D:R106A            4.13  2.304954282
1DVF D:Y102A            4.79  3.255549884
                  83


            Table A.1 (cont’d)
1FFW  A:A90V             0.09  0.704165614
1FFW  A:D13K             0.05  0.575279005
1FFW A:E117K             0.71   0.42333309
1FFW  A:E93K             0.82  1.249733698
1FFW A:F111V             1.26  0.342520128
1FFW  A:T112I            0.56  0.350163226
1FFW  A:T87I            -0.32  0.906796345
1FFW A:V108M             1.13  0.522969592
1FFW A:Y106W             0.71  3.141303285
1JRH  I:E45Q             0.11  1.209955109
1JRH  I:T14V            -0.02  0.734714999
1JTG A:K234A             0.82   0.76336782
1JTG A:R243A             0.56   1.28977497
1JTG A:S130A              0.5  1.100434164
1JTG A:S235A             1.32  0.684225643
1JTG  B:D49A             1.98  3.260976281
1KTZ B:D118A              0.9  1.574396523
1KTZ  B:D32A              1.4  1.183164012
1KTZ  B:D32N              1.7  1.537020647
1KTZ B:E119A              1.4  0.923715298
1KTZ B:E119Q              1.7  1.819594258
1KTZ  B:E55A              1.3  0.674391378
1KTZ  B:E75A               1   1.275212918
1KTZ B:F110A              0.9  0.381708447
1KTZ  B:F30A              2.9  2.890359842
1KTZ  B:H79A              0.4  0.438193601
                   84


            Table A.1 (cont’d)
1KTZ B:I125A              0.5  0.805530864
1KTZ  B:I50A              1.9  2.507838141
1KTZ  B:I53A              1.2   1.10341562
1KTZ  B:L27A              1.9  1.452828559
1KTZ B:M112A              0.8  0.771273735
1KTZ B:N47A               0.3  0.408989806
1KTZ  B:S49A              0.5  0.488502427
1KTZ  B:S52A              0.4  0.400807647
1KTZ  B:S52L              3.7  2.998590038
1KTZ  B:T51A              1.3   1.32474016
1KTZ B:V62A               0.7  0.862856594
1KTZ B:V77A               0.5  0.729256061
1MHP H:F99W                0   1.511132256
1MHP  H:F99Y             0.76  2.600869319
1MHP H:G100F             0.57  2.375511513
1MHP H:G100I               8   3.232489423
1MHP H:G100L             2.73  2.803968286
1MHP H:G100M              0.1  1.978972993
1MHP H:G100S             0.56  0.277025289
1MHP H:G100V               8   2.588756997
1MHP H:G102S               8   3.075302826
1MHP H:G53A              0.51  1.085337422
1MHP H:G53N              1.26  2.193903725
1MHP H:G53Q                8   2.859680867
1MHP  H:G53S             0.68   0.29963622
1MHP H:G53W                8   2.610651222
                   85


           Table A.1 (cont’d)
1MHP H:G54I             1.36  2.053849691
1MHP H:G54M               8   3.263506111
1MHP H:G54N               8   3.956349362
1MHP H:G54T             1.17  1.083141006
1MHP H:G54Y               0   1.361053347
1MHP H:H56Y               8   2.793931876
1MHP H:K64D            -0.12  0.758579301
1MHP H:K64E            -0.16  2.852913642
1MHP H:K64N            -0.07  1.004445379
1MHP H:K64Q            -0.16  0.752612944
1MHP H:L60D            -0.12  1.762583453
1MHP H:R31Q             1.31  1.856987057
1MHP H:S35A             1.58  1.399227777
1MHP H:S35Q             2.73  1.844218833
1MHP H:S35V               8   0.902972107
1MHP H:S52M             2.08  1.507612619
1MHP H:S52T               8   2.160307508
1MHP H:T33N             2.73  1.164517369
1MHP H:T33Q               8   2.011373376
1MHP H:T33V             0.43  1.269510301
1MHP H:T50E               8   2.721531952
1MHP H:T50Q               8   2.060513216
1MHP H:T50V            -0.26  3.066232312
1MHP H:Y58E             2.08  1.589362497
1MHP H:Y58Q             2.08  3.235696527
1MHP H:Y58W             0.82  1.681574377
                  86


           Table A.1 (cont’d)
1MHP H:Y59E             0.54   3.81896283
1MHP L:G92Q             1.58  2.356168725
1MHP L:G92S             0.51  1.226035214
1MHP L:H31K             1.17  1.317764394
1MHP L:H31R             1.12  1.492543745
1MHP L:H31W             1.26  1.481678655
1MHP L:L49F             0.71  1.007443222
1MHP L:L49K             0.93  0.461418046
1MHP L:L49W             1.91  1.939848971
1MHP L:L49Y               8   4.188789486
1MHP L:N30K               8   1.347084757
1MHP L:N30V             1.05  1.760356356
1MHP L:N30W             1.58   4.39788735
1MHP L:N30Y             1.91  2.446100825
1MHP L:N52D              0.5  1.226667242
1MHP L:N52E            -0.18  1.516983952
1MHP L:N52K             0.66  1.186057489
1MHP L:N52R             0.85  1.011625931
1MHP L:N52Y             -0.1  0.685355368
1MHP L:N93D               8   2.950172584
1MHP L:S24R            -0.07  0.482326814
1MHP L:S28Q            -0.67  0.514440248
1MHP L:S91K               8   3.037964039
1MHP L:S91Q             1.78  2.340801141
1MHP L:S91R             2.73  2.673891363
1MHP L:S91T             2.32  1.408651946
                  87


            Table A.1 (cont’d)
1MHP  L:S91W               8   2.142880094
1MHP  L:W90Q             2.08  1.973060762
1MLC  H:K65D             0.02  0.772076764
1MLC  H:S57A            -0.38  0.238518223
1MLC  H:S57V            -0.49   1.67920641
1MLC  H:T28D            -0.15  0.279988554
1MLC  H:T31A             0.45  1.389369554
1MLC  H:T31V             0.53  1.239429551
1MLC  H:T31W             0.13  1.895184435
1MLC  H:T58D            -0.56  0.666008019
1MLC  L:N32G            -0.85  -0.175578736
1MLC  L:N32Y               0   1.126775376
1MLC  L:N92A            -1.25  0.686745352
1N8Z  H:D31A             0.25  0.695942219
1N8Z  H:D98A             1.21  1.821430834
1N8Z  H:D98W            -0.69  2.672934996
1N8Z H:F100A             1.17  1.554348682
1N8Z  H:K30A             0.61  0.499821357
1N8Z  H:N54A            -0.15  0.715626098
1N8Z  H:R50A               8     3.0454875
1N8Z  H:T32A             0.38  0.411223308
1N8Z  H:T53A             0.78  1.166937169
1N8Z  H:W95A               8   2.781804442
1N8Z H:Y100aA              8   2.794566125
1N8Z H:Y100aF           -0.05   0.54905435
1N8Z H:Y100aF            0.82  1.266112137
                   88


            Table A.1 (cont’d)
1N8Z H:Y102V             0.22  0.776690057
1N8Z H:Y33A             -0.09  0.453388764
1N8Z H:Y52A              0.23  2.384646317
1N8Z H:Y56A              0.87  1.728654937
1N8Z L:D28N             -0.28  0.949429435
1N8Z  L:F53N             1.21   0.99378604
1N8Z L:H91A                8   3.035039866
1N8Z  L:H91F            -0.43  0.113538779
1N8Z  L:H91F             0.03  0.422061501
1N8Z L:N30A              1.12  1.712643518
1N8Z  L:N30S             0.06   0.31163856
1N8Z L:R66G              0.22  0.787922453
1N8Z  L:S50A            -0.07  1.307035447
1N8Z  L:S52A            -0.31  0.711932707
1N8Z  L:T31A              0.8  0.624779485
1N8Z  L:T93A             0.82  0.832968299
1N8Z  L:T94A            -0.12  0.913693287
1N8Z  L:T94S             0.31  0.741801789
1N8Z L:Y49A              1.05   1.83223489
1N8Z L:Y92A              1.36  0.985871557
1N8Z  L:Y92F            -0.21  1.032859711
1VFB C:D119A               1   0.960852994
1VFB C:D18A               0.3  0.738494684
1VFB C:I124A              1.2  0.874839065
1VFB C:K116A              0.7  1.295609012
1VFB C:L129A              0.2  0.553255228
                   89


            Table A.1 (cont’d)
1VFB C:N19A               0.3  0.665018923
1VFB C:Q121A              2.9  2.145117407
1VFB C:R125A              1.8  2.107791787
1VFB C:S24A               0.8  0.888296621
1VFB C:T118A              0.8  0.602773202
1VFB C:V120A              0.9  0.806104381
1VFB C:Y23A               0.4  0.589905404
1VFB H:D100A              2.9  2.118798539
1VFB H:D54A                1   0.767356465
1VFB H:D58E              0.08  0.965146096
1VFB H:D58N             -0.13  0.528588246
1VFB H:G31A               0.3  0.733750556
1VFB H:G31E             -0.51  0.717865289
1VFB H:G31W              0.01  0.481503215
1VFB H:R99W              0.71   0.85370306
1VFB H:R99Y              1.26  1.889616403
1VFB H:S28D                0   0.151328361
1VFB  H:S28E             -0.1  0.756153385
1VFB H:S28N              0.15  0.169814022
1VFB H:S28Q              0.08  0.806274008
1VFB H:W52A               0.9  1.635290525
1VFB H:Y101A               4    2.32718193
1VFB H:Y101F              1.6  2.709148226
1VFB H:Y32A               1.1  1.395961726
1VFB H:Y32E              1.91  1.696019146
1VFB  L:L46D               8   3.927971666
                   90


            Table A.1 (cont’d)
1VFB L:L46E                8   2.726775795
1VFB L:N31W              0.17  0.330030168
1VFB L:T52F              0.47   1.46582676
1VFB L:T53R              1.67  1.930301498
1VFB L:W92A               3.3  1.422575552
1VFB L:Y32A               1.7  1.893113252
1VFB L:Y32W                8    3.95789319
1VFB L:Y50A               0.5  1.482213498
1VFB L:Y50K              1.67  1.316597932
1VFB L:Y50R              0.85  1.083387413
1YY9 H:N56A             -0.06   0.44529548
1YY9 H:T61E             -0.06  1.004986158
1YY9 L:N93A             -0.74  0.852318085
1YY9 L:S26D              -0.2  0.416451333
1YY9 L:T31E             -0.52  1.000581844
2JEL P:A82S                0    0.62358458
2JEL P:D69E              0.96  0.758253135
2JEL  P:E5D              0.41  0.793075738
2JEL  P:E5Q              0.72  0.568361032
2JEL P:E66K              4.13  1.826121282
2JEL P:E68A              0.41  1.095711209
2JEL P:E70A              2.75  1.324838712
2JEL P:E70K              4.13  3.142100614
2JEL P:E75R              2.75  2.484719967
2JEL P:E83A                0   0.015917697
2JEL P:E85A                0   0.755415193
                   91


            Table A.1 (cont’d)
2JEL P:E85D                0   0.138530725
2JEL P:E85K                0   0.664799549
2JEL P:E85Q                0   0.477949894
2JEL P:F2W               2.64  1.677778688
2JEL  P:F2Y                0   0.690321795
2JEL P:H76A             -0.41  0.800214403
2JEL P:H76D             -0.66  0.162703811
2JEL P:K24E                0   0.076843167
2JEL P:K27E                0   1.075453115
2JEL P:K72E              0.41  1.015558596
2JEL P:K72R                0   1.344040071
2JEL P:K79E              0.41  0.941212343
2JEL P:N12D                0    0.53221519
2JEL P:N38T                0   0.650443812
2JEL P:P11E                0   0.438658069
2JEL P:Q3K               4.13  1.899204206
2JEL P:Q4K               1.38  0.967431396
2JEL P:Q57E             -0.41  0.586016362
2JEL P:Q71E              2.75  2.228121794
2JEL P:R17G                0   0.506077878
2JEL P:R17K                0   -0.218487614
2JEL P:S41C              1.51  1.753646505
2JEL P:S43C                0   0.251503369
2JEL P:S46C                0   0.262453297
2JEL P:S64T              4.13  0.786654792
2JEL P:T34Q                0   0.595213209
                   92


              Table A.1 (cont’d)
 2JEL  P:T36Q              0.41   0.81999459
 2JEL  P:T62A                0   0.702519448
 2JEL  P:T62N                0   0.738162379
 2JEL   P:T7N              0.41  0.654010822
 2JEL   P:T7S                0   0.372232377
 2JEL   P:V6F                0   0.445051654
2NYY  A:D1058A             0.01  0.525443185
2NYY  A:D1062A             2.53  0.532715101
2NYY  A:E920A              2.84  2.000718398
2NYY  A:F917A              0.42  0.794587442
2NYY  A:F953A              4.06  1.684166884
2NYY  A:H1064A             7.32  3.824604807
2NYY   A:I956A            -0.01  0.870105035
2NYY  A:K1056A            -0.03  0.515474786
2NYY  A:K903A              0.29   0.74346297
2NYY  A:K923A             -0.14  0.076819464
2NYY  A:L919A              2.59  2.208152106
2NYY  A:N918A              0.89  1.056838524
2NYY  A:N954A               0.1  0.914624936
2NYY  A:Q915A              0.52  0.992770347
2NYY  A:R1061A             0.82  0.917340792
2NYY  A:R1294A              0.3  1.082284119
2NYY  A:S902A             -0.23  0.987631227
2NYY  A:S955A                0   0.653515584
2NYY  A:T1063A             1.62  2.436781712
2NYY   H:K30R              0.07  0.690428166
                     93


            Table A.1 (cont’d)
2NYY  H:M34Q            -0.06  1.095716335
2NYY  H:Y31D             0.07  1.015319328
2NYY  H:Y31Q             0.13  1.439321457
2NYY  H:Y57Q             0.52  1.562704507
2NYY  L:D30Y             0.65  2.386377244
2NYY  L:H34R              0.1  1.081628988
2NYY  L:S28Q             0.42  1.037599326
2NYY  L:S31N            -0.36  1.198890994
2NZ9 A:D1058A           -0.03  0.018030161
2NZ9 A:D1062A            2.34  0.972425747
2NZ9 A:E920A             2.77  1.907371345
2NZ9 A:F917A            -0.05  1.499934326
2NZ9 A:F953A             3.34  1.477590993
2NZ9 A:H1064A            7.42  3.686355671
2NZ9  A:I956A            0.07  0.521508086
2NZ9 A:K1056A           -0.01  0.761719976
2NZ9 A:K903A             0.54  1.079718724
2NZ9 A:K923A             -0.3  1.195504875
2NZ9 A:L919A             2.28  1.756742478
2NZ9 A:N918A             2.16  0.707199903
2NZ9 A:N954A            -0.15  0.982384186
2NZ9 A:Q915A              0.1  0.957360586
2NZ9 A:R1061A            0.29  2.460852085
2NZ9 A:R1294A            0.39  0.270213282
2NZ9 A:S902A            -0.12  1.063609589
2NZ9 A:S955A            -0.08  1.241966479
                   94


             Table A.1 (cont’d)
2NZ9 A:T1063A             2.37  2.895512163
3BDY  H:D31A               0.2  0.462999503
3BDY  H:D98A                0   0.816723653
3BDY H:F100A               0.7   0.39798716
3BDY  H:G96A               0.2  0.038339515
3BDY  H:G97A              -0.1   0.21307577
3BDY  H:G99A               0.8   0.83948409
3BDY  H:K30A               0.2  0.589573872
3BDY  H:N54A              -0.2  0.955567274
3BDY  H:R50A              -0.3  0.932303598
3BDY  H:R58A              -0.2  0.390652519
3BDY  H:T32A              -0.4  0.222194885
3BDY  H:T53A              -0.5  0.304702079
3BDY  H:W95A                2   1.765017977
3BDY H:Y100aA              0.5   0.41832798
3BDY  H:Y33A               0.3  1.363225032
3BDY  H:Y52A               0.4  0.066635255
3BDY  H:Y56A               0.2  0.693914709
3BDY L:D27aA                0   0.942937671
3BDY  L:G31A               1.2  1.386320861
3BDY  L:G51A                1   1.093660591
3BDY  L:H91A                2   2.241317608
3BDY  L:I27bA              1.1  1.339230158
3BDY   L:I29A              0.8  0.509603382
3BDY  L:P27cA              0.2  0.688282228
3BDY L:R27dA              -0.2  1.436376744
                    95


            Table A.1 (cont’d)
3BDY  L:S28A              1    1.189582364
3BDY  L:S30A             1.3   1.257045566
3BDY  L:S52A             0.1   0.864860272
3BDY  L:T93A             -0.3  1.074481741
3BDY  L:T94A              0    1.795663852
3BDY  L:W50A             1.3    0.81352744
3BDY  L:Y32A             1.7   0.575994608
3BDY  L:Y53A             0.7   1.429535516
3BDY  L:Y92A             1.4   1.305772173
3BE1  H:D31A             0.4   0.780318088
3BE1  H:D98A             -0.1  0.052540047
3BE1 H:F100A             1.9   0.187278535
3BE1  H:G96A             0.1   -0.262728687
3BE1  H:G97A             0.3    1.12837483
3BE1  H:G99A             1.8   1.868551839
3BE1  H:K30A             -0.3  0.817743524
3BE1  H:N54A             -0.8  0.176916878
3BE1  H:R50A             2.4   2.146816267
3BE1  H:R58A             2.5   1.486981348
3BE1  H:T32A             -0.4  0.815500345
3BE1  H:T53A             0.1   1.232608627
3BE1  H:W95A             1.8   1.947953267
3BE1 H:Y100aA            1.2   0.830610968
3BE1  H:Y33A             2.4   1.186536328
3BE1  H:Y52A             -0.8  0.257574045
3BE1  H:Y56A             1.8   0.867268608
                   96


            Table A.1 (cont’d)
3BE1 L:D27aA               0   1.018711282
3BE1 L:G31A               0.3   1.95817431
3BE1 L:G51A              -0.3  0.513404332
3BE1 L:H91A               0.9  0.482463821
3BE1 L:I27bA             -0.2  0.955205659
3BE1  L:I29A              0.9  1.265787158
3BE1 L:P27cA              0.1   0.00789498
3BE1 L:R27dA             -0.2  0.446215309
3BE1  L:S28A              0.3  0.287493294
3BE1  L:S30A             -0.2  0.236613237
3BE1  L:S52A              0.3   0.25247753
3BE1  L:T93A             -0.4  0.644443182
3BE1  L:T94A              1.4  1.194623934
3BE1 L:W50A               1.4  0.743452596
3BE1 L:Y32A              -0.8  1.369839294
3BE1 L:Y53A               0.9  1.107454542
3BE1 L:Y92A               0.5  0.415530772
3BN9 A:D217A             0.57  1.344612036
3BN9 A:D60aA             0.42  -0.013424513
3BN9 A:D60bA             0.31  0.696714558
3BN9 A:D96A               6.7  0.839577235
3BN9 A:E169A             0.37  0.927464902
3BN9 A:F60eA            -0.05  0.696510849
3BN9 A:F94A              0.64  0.796281704
3BN9 A:F97A               6.7  1.803046157
3BN9 A:H143A             0.09  1.326110308
                   97


            Table A.1 (cont’d)
3BN9  A:I41A               0   0.503108154
3BN9  A:I60A             0.84  1.004853709
3BN9 A:K224A             0.79  0.452139457
3BN9 A:L153A             0.34  0.489134099
3BN9  A:N95A             0.77  0.466347123
3BN9 A:Q145A             0.13  0.342717139
3BN9 A:Q174A            -0.03  0.532123895
3BN9 A:Q175A             2.51  1.808248335
3BN9 A:Q221aA            0.71   1.03577532
3BN9  A:Q38A            -0.42  0.609958937
3BN9 A:R222A            -0.09  0.610519354
3BN9 A:R60cA            -0.05  0.220826913
3BN9 A:R60fA            -0.07  0.889075772
3BN9  A:R87A            -0.16  0.278420036
3BN9 A:T150A             0.29  0.335359473
3BN9  A:T98A             1.13  0.431271266
3BN9 A:Y146A             1.09  0.631743368
3BN9 A:Y60gA             0.02  0.183751899
3BN9 H:P100H               8    2.90160336
3BN9 H:Q100aV            1.36   2.01919693
3BN9  H:S30G             1.36  1.092683778
3BN9  H:S30N             1.36  1.176220347
3BN9  H:T28R              0.3  1.094147801
3BN9  H:T98Q               0   1.203092267
3BN9  H:T98R            -0.06  1.729186005
3BN9  H:Y99S             0.41  1.466513323
                   98


           Table A.1 (cont’d)
3HFM H:C95A             5.52  1.594987571
3HFM H:C95F             3.25  3.115214161
3HFM H:D32A              1.9  2.232648204
3HFM H:D32N             0.17  1.615671996
3HFM H:S31A             0.17  1.149026922
3HFM H:Y33A               6   1.579874167
3HFM H:Y50A               8   2.442873652
3HFM L:N31A             5.22  2.510529204
3HFM L:N31D             1.34  1.764424941
3HFM L:N31E             5.67  3.673517204
3HFM L:N32A             5.11  1.946142038
3HFM L:Q53A             0.95  2.273443746
3HFM L:Y50A             4.57  4.287597028
3HFM L:Y50F             2.36  1.943871252
3HFM L:Y50L              4.4  2.647111021
3HFM L:Y96A             2.71  2.960946398
3HFM L:Y96F              1.4  2.952138972
3HFM Y:K96A             6.83  2.747755689
3HFM Y:K97A             6.18  2.104647427
3HFM Y:K97M             1.09  2.476272635
3HFM Y:R21A             1.03   1.71014554
3HFM Y:Y20A             4.87  2.155665707
3K2M D:E52A              2.5  2.400196772
3K2M D:M88A               4   2.177564766
3K2M D:R38A              2.5  1.580952505
3K2M D:W80A               4   2.293547317
                  99


            Table A.1 (cont’d)
3K2M D:Y35A                1   1.146032541
3K2M D:Y36A               3.7  2.745885379
3K2M D:Y87A                4   3.222661154
3NGB H:A56G             -0.06   0.14552782
3NGB H:G54S              0.73  1.536982886
3NGB  H:I30T             0.08  1.484114774
3NGB H:K52N              0.18  1.200330023
3NGB H:P62K              0.23  1.115225538
3NGB H:R53N              0.05  1.148569033
3NGB H:R61Q               0.1  0.591696346
3NGB H:T33Y              0.08  0.832934832
3NGB H:V57T              0.14  0.378418438
3NGB H:V73T              0.16  0.850283401
3NGB H:Y74S              0.33   0.54945593
3NPS A:D214A             1.48  0.735888768
3NPS A:D46A              0.34  0.751372318
3NPS A:D47A              1.08  -0.005291115
3NPS A:D91A              1.52  1.360870246
3NPS A:E163A             0.62  0.466130221
3NPS A:F50A              0.22  1.403497591
3NPS A:F89A              1.61   1.66551116
3NPS A:F92A              0.47  1.437897506
3NPS A:H138A             1.89  0.907842784
3NPS  A:I26A             0.65  0.982534852
3NPS  A:I45A            -0.34  1.051965331
3NPS A:K221A             -0.1   0.56344418
                   100


              Table A.1 (cont’d)
 3NPS   A:L147A             0.3  0.595928346
 3NPS   A:N90A             0.26  0.804218843
 3NPS   A:Q140A             0.3   0.05200386
 3NPS   A:Q168A           -0.06   0.97202142
 3NPS   A:Q169A            0.75  1.077402593
 3NPS   A:Q218A           -0.04  0.783809807
 3NPS   A:Q23A             0.03  1.317106432
 3NPS   A:R219A           -0.08  0.631615871
 3NPS   A:R48A            -1.07  0.423264568
 3NPS   A:R51A             0.14  0.853977131
 3NPS   A:R82A            -0.15  0.372059036
 3NPS   A:T144A            0.18  0.238918702
 3NPS   A:T93A             0.73  0.500676638
 3NPS   A:Y141A            1.79  1.823759807
 3NPS   A:Y52A             0.46  1.409161225
HM_1KTZ B:D118A             0.8  0.857318869
HM_1KTZ B:D32A              1.5  0.672650201
HM_1KTZ B:D32N               2   1.365241973
HM_1KTZ B:E119A             1.5  2.103855065
HM_1KTZ B:E119Q             1.6  1.723233644
HM_1KTZ B:E55A              1.2  0.545773096
HM_1KTZ B:E75A              1.1  1.240564568
HM_1KTZ B:F110A             0.9  0.113538779
HM_1KTZ B:F30A               3   2.395796743
HM_1KTZ B:H79A              0.3  -0.146093759
HM_1KTZ B:I125A             0.5  0.596465756
                     101


               Table A.1 (cont’d)
HM_1KTZ  B:I50A             1.9   2.295236759
HM_1KTZ  B:I53A             1.4   2.866933899
HM_1KTZ  B:L27A             1.8   0.888414405
HM_1KTZ B:M112A             0.9   0.676475044
HM_1KTZ B:N47A              0.3   1.074625229
HM_1KTZ  B:S49A             0.3   0.270350336
HM_1KTZ  B:S52A             0.2   0.912370354
HM_1KTZ  B:S52L              4    3.351510141
HM_1KTZ  B:T51A             1.5   1.216283828
HM_1KTZ B:V62A              0.7   0.515725847
HM_1KTZ B:V77A              0.4   0.407517511
HM_1YY9 H:A98W             -0.23  0.418109098
HM_1YY9 H:G33D             -0.89   0.05958884
HM_1YY9  H:I51G            -0.51  0.044470256
HM_1YY9 H:R97D             -0.99  0.442808427
HM_1YY9 H:T100Y            -0.45  0.975050876
HM_1YY9  H:T57G            -0.46  0.074634377
HM_1YY9  H:T57P            -0.48  0.667425841
HM_1YY9  H:V50L             -0.6  -0.330566734
HM_1YY9 H:V50Q             -0.64  0.330782419
HM_1YY9 H:Y32R             -0.13  0.968267084
HM_1YY9  L:A25V            -0.62  1.606320727
HM_1YY9  L:G30Y            -0.92  0.771856501
HM_1YY9  L:Q27Y            -0.83  1.455638579
HM_1YY9  L:T97C            -0.51    0.320064
HM_1YY9  L:T97D            -0.66  0.640174187
                      102


               Table A.1 (cont’d)
HM_1YY9  L:T97S             -0.7  0.865009006
HM_2NYY A:D1062A           -0.14  0.351028066
HM_2NYY A:D902A            -0.49  0.575859888
HM_2NYY A:E920A              0.5  2.064501081
HM_2NYY A:F953A             1.53  1.209780382
HM_2NYY  A:I917A            0.95  0.788091299
HM_2NYY A:K915A            -0.31  0.450918691
HM_2NYY A:K955A             -0.5  0.152298061
HM_2NYY A:L919A             0.41  1.443340344
HM_2NYY A:N918A              0.3  0.816280603
HM_2NYY A:N957A             0.01  0.287045341
HM_2NYY A:P1063A            0.24  0.619803278
HM_2NYY A:R1061A            0.58  0.541139471
HM_2NYY A:R1064A            4.44  3.124877033
HM_2NYY A:R903A            -0.02  0.354346073
HM_2NYY A:S954A            -0.66  0.027106388
HM_2NYY A:T923A             0.62  1.054733231
HM_2NYY  H:K30R             0.51  0.565871848
HM_2NYY  H:M34Q            -0.01   1.20044627
HM_2NYY  H:Y31D             0.97  1.443423688
HM_2NYY  H:Y31Q             1.33  1.995269269
HM_2NYY  H:Y57Q             1.19  1.562775376
HM_2NYY  L:D30Y              0.6  0.980379677
HM_2NYY  L:H34R             0.63   1.12922711
HM_2NYY  L:S28Q              0.7  1.168373331
HM_2NYY  L:S31N              1.2  1.293459172
                      103


               Table A.1 (cont’d)
HM_2NZ9 A:D1062A           -0.14   0.11737132
HM_2NZ9 A:D902A            -0.49  -0.241554685
HM_2NZ9 A:E920A              0.5  1.035459785
HM_2NZ9 A:F953A             1.53  1.031164807
HM_2NZ9  A:I917A            0.95  1.372849249
HM_2NZ9 A:K915A            -0.31  0.162128377
HM_2NZ9 A:K955A             -0.5  0.619860501
HM_2NZ9 A:L919A             0.41  0.577141695
HM_2NZ9 A:N918A              0.3  1.316139699
HM_2NZ9 A:N957A             0.01  0.006647221
HM_2NZ9 A:P1063A            0.24  0.187907852
HM_2NZ9 A:R1061A            0.58  0.695574023
HM_2NZ9 A:R1064A            4.44  1.932082274
HM_2NZ9 A:R903A            -0.02  2.923683823
HM_2NZ9 A:S954A            -0.66  0.852797077
HM_2NZ9 A:T923A             0.62  1.028792168
HM_3BN9 H:P100H             1.36  2.322189072
HM_3BN9 H:Q100aV            0.41   0.43859132
HM_3BN9  H:S30G             0.71  0.850440861
HM_3BN9  H:S30N             0.71   2.00668105
HM_3BN9  H:T28R            -0.24  2.270706962
HM_3BN9  H:T98Q            -0.16  1.495593848
HM_3BN9  H:T98R            -1.57  0.343438481
HM_3BN9  H:Y99S             0.54  1.790362979
                      104


                                            APPENDIX B
                     SUPPLEMENTARY MATERIALS FOR LTP MODEL
B.1     Supplementary result on S2648 and S350 dataset
    For quantitative prediction of protein folding energy change upon mutation, we first apply our
LTP1 and LTP2 methods to S2648 and S350 datasets. S350 set is tested with models trained on
S2648 excluding S350 set and 5-fold cross validation is conducted for S2648.
Table B.1: Comparison of Pearson correlation (Rp ) and RMSEs (kcal/mol) of various methods on
prediction of mutation induced protein stability changes of the S350 set and 5-fold cross validation
of mutation induced protein stability changes of the S2648. n represents number of samples
successfully processed. LTP1 is our topological based mutation predictor that solely utilizes
structural information. LTP2 is our model that complements LTP1 with auxiliary features. The
results reported in the publications are listed in the table [70].50 repeated runs are conducted and
median values of metrics are picked for LTP2 and LTP1 methods.
                                                S350                    s2648
                        Method
                                         n      Rp RMSE           n      Rp    RMSE
                 LTP2   (E IG11.0,0.5
                               E      ) 350    0.81 0.93        2648    0.78    0.92
                 LTP2   (E IG12.0,2.0
                               L      ) 350    0.80 0.96        2648    0.78    0.92
                       STRUM            350    0.79 0.98        2647    0.77    0.94
                 LTP1 (E IG4.0,1.3
                               E     )  350    0.76 1.05        2648    0.72    1.03
                 LTP1 (E IG6.0,1.3 )
                               L        350    0.75 1.07        2648    0.72    1.02
                        mCSM            350    0.73 1.08        2643    0.69    1.07
                         INPS           350    0.68 1.25        2648    0.56    1.26
                   PoPMuSiC 2.0         350    0.67 1.16        2647    0.61    1.17
                    PoPMuSic 1.0        350    0.62 1.23          -       -       -
                     I-Mutant 3.0       338    0.53 1.35        2636    0.60    1.19
                       Dmutant          350    0.48 1.28          -       -       -
                       Automute         315    0.46 1.42          -       -       -
                       CUPSAT           346    0.37 1.46          -       -       -
                          Eris          324    0.35 1.49          -       -       -
                     I-Mutant 2.0       346    0.29 1.50          -       -       -
                                                  105


B.2     Supplementary result on Q3421 dataset
    In 5-fold cross validation of Q3421 set, Rp /RMSE (kcal/mol) of 0.79/1.2 was reported for
STRUM method [70].
Table B.2: 5-fold cross validation results of Q3421 with respect to Rp and RMSE. 50 repeated
runs were conducted and median values of metrics were picked for LTP2 and LTP1 methods.
Comparison of Pearson correlation (Rp ) and RMSEs (kcal/mol) of various methods on 5-fold cross
validation of mutation induced protein stability changes of the Q3421. n represents number of
samples successfully processed. In LTP1 method, two-scales models are considered by coupling
two sets of Hessian eigenvalue features. Moreover, Rp /RMSE (kcal/mol) of 0.79/1.2 was reported
for STRUM method [70]. 50 repeated runs are conducted and median values of metrics are picked
for LTP2 and LTP1 methods.
                                                                Q3421
                        Method            Features
                                                              Rp RMSE
                                              E
                                         E IG14.0,0.6        0.79 1.22
                         LTP2                 L
                                         E IG10.0,0.6        0.79 1.22
                                               E
                                         E IG2.0,1.5         0.68 1.44
                         LTP1
                                               L
                                         E IG3.0,1.3         0.68 1.44
                                        E
                                   E IG2.0,1.5 ; E IG5.0,0.7
                                                     E       0.70 1.42
                                       E
                                  E IG2.0,1.5 ; E IG22.0,0.9
                                                     L       0.70 1.41
                                                 106


B.3    Supplementary result on optimal parameter selection
   We searched for optimal parameters (i.e. α, β and τ) for Hessian eigenvalues features E IG αβ,τ
on both datasets and methods.
   Predictive behaviors of LTP1 model on protein folding energy change upon mutation. Median
values of Pearson correlation (Rp ) are plotted against β and τ. Predictions on S350 with (A)
Exponential kernel and (B) Lorentz kernel. 5-fold cross validation on S2648 with (C) Exponential
kernel and (D) Lorentz kernel. In (A-B), Pearson correlations below 0.7 take the same color with
0.7; in (C-D), Pearson correlations below 0.68 take the same color with 0.68. Parameters with
highest Pearson correlation: (A) E IG4.0,1.4
                                     E       with Rp /RMSE of 0.76/1.05; (B) and (D) E IG6.0,1.3
                                                                                         L
with Rp /RMSE of 0.75/1.07; (C) E IG4.0,1.4
                                    E       with Rp /RMSE of 0.72/1.03; (D) E IG6.0,1.3
                                                                                L       with
Rp /RMSE of 0.72/1.02.
                  A   Exponential kernel      B          Lorentz kernel
                                                                              0.76
                                                                              0.75
                                                                              0.74
          S350
                                                                              0.73   Rp
                                                                              0.72
                                                                              0.71
                                                                              0.7 (or below)
                  C   Exponential kernel      D          Lorentz kernel
                                                                              0.72
                                                                              0.71
          S2648
                                                                              0.7    Rp
                                                                              0.69
                                                                              0.68 (or below)
Figure B.1: Predictive behaviors of LTP1 model on protein folding energy change upon mutation.
                                              107


BIBLIOGRAPHY
     108


                                         BIBLIOGRAPHY
[1]  Adams, Henry, Andrew Tausz & Mikael Vejdemo-Johansson. 2014. Javaplex: A research
     software package for persistent (co) homology. In International congress on mathematical
     software, 129–136. Springer.
[2]  Anfinsen, Christian B. 1973. Principles that govern the folding of protein chains. Science
     181(4096). 223–230.
[3]  Barouch, Dan H, James B Whitney, Brian Moldt, Florian Klein, Thiago Y Oliveira, Jinyan
     Liu, Kathryn E Stephenson, Hui-Wen Chang, Karthik Shekhar, Sanjana Gupta et al. 2013.
     Therapeutic efficacy of potent neutralizing hiv-1-specific monoclonal antibodies in shiv-
     infected rhesus monkeys. Nature 503(7475). 224.
[4]  Bas, Delphine C, David M Rogers & Jan H Jensen. 2008. Very fast prediction and ratio-
     nalization of pka values for protein–ligand complexes. Proteins: Structure, Function, and
     Bioinformatics 73(3). 765–783.
[5]  Bava, K Abdulla, M Michael Gromiha, Hatsuho Uedaira, Koji Kitajima & Akinori Sarai.
     2004. Protherm, version 4.0: thermodynamic database for proteins and mutants. Nucleic
     acids research 32(suppl 1). D120–D121.
[6]  Ben-Kasus, Tsipi, Bilha Schechter, Michael Sela & Yosef Yarden. 2007. Cancer therapeutic
     antibodies come of age: targeting minimal residual disease. Molecular oncology 1(1). 42–54.
[7]  Berman, Helen M, John Westbrook, Zukang Feng, Gary Gilliland, Talapady N Bhat, Helge
     Weissig, Ilya N Shindyalov & Philip E Bourne. 2000. The protein data bank. Nucleic acids
     research 28(1). 235–242.
[8]  Biovia, Dassault Systemes. 2017. Discovery studio modeling environment.
[9]  Breiman, Leo. 1997. Arcing the edge. Tech. rep. Technical Report 486, Statistics Department,
     University of California at . . . .
[10] Cang, Z. X., Lin Mu, Kedi Wu, Kris Opron, Kelin Xia & Guo-Wei Wei. 2015. A topological
     approach to protein classification. Molecular based Mathematical Biology 3. 140–162.
[11] Cang, Z. X. & G. W. Wei. 2017. Analysis and prediction of protein folding energy changes
     upon mutation by element specific persistent homology. Bioinformatics 33. 3549–3557.
[12] Cang, Zixuan, Lin Mu & Guo-Wei Wei. 2018. Representability of algebraic topology for
     biomolecules in machine learning based scoring and virtual screening. PLoS computational
     biology 14(1). e1005929.
[13] Cang, Zixuan & Guo-Wei Wei. 2017. Topologynet: Topology based deep convolutional
     and multi-task neural networks for biomolecular property predictions. PLoS computational
     biology 13(7). e1005690.
                                              109


[14] Cang, Zixuan & Guo-Wei Wei. 2018. Integration of element specific persistent homology
     and machine learning for protein-ligand binding affinity prediction. International journal
     for numerical methods in biomedical engineering 34(2). e2914.
[15] Carter, Paul J. 2006. Potent antibody therapeutics by design. Nature reviews immunology
     6(5). 343.
[16] Chen, Jiahui, Rui Wang, Menglun Wang & Guo-Wei Wei. 2020. Mutations strengthened sars-
     cov-2 infectivity. Journal of Molecular Biology 432(19). 5212–5226. doi:https://doi.org/10.
     1016/j.jmb.2020.07.009. https://www.sciencedirect.com/science/article/pii/
     S0022283620304563.
[17] Chiti, Fabrizio & Christopher M Dobson. 2006. Protein misfolding, functional amyloid, and
     human disease. Annu. Rev. Biochem. 75. 333–366.
[18] Chollet, François. 2015. Keras. https://github.com/fchollet/keras.
[19] Chothia, Cyrus, Arthur M Lesk, Anna Tramontano, Michael Levitt, Sandra J Smith-Gill,
     Gillian Air, Steven Sheriff, Eduardo A Padlan, David Davies, William R Tulip et al. 1989.
     Conformations of immunoglobulin hypervariable regions. Nature 342(6252). 877.
[20] Collaborative, Computational Project et al. 1994. The ccp4 suite: programs for protein
     crystallography. Acta crystallographica. Section D, Biological crystallography 50(Pt 5).
     760.
[21] Dehouck, Yves, Aline Grosfils, Benjamin Folch, Dimitri Gilis, Philippe Bogaerts & Marianne
     Rooman. 2009. Fast and accurate predictions of protein stability changes upon mutations
     using statistical potentials and neural networks: Popmusic-2.0. Bioinformatics 25. 2537–
     2543.
[22] Dehouck, Yves, Jean Marc Kwasigroch, Marianne Rooman & Dimitri Gilis. 2013. Beatmu-
     sic: prediction of changes in protein–protein binding affinity on mutations. Nucleic acids
     research 41(W1). W333–W339.
[23] Demarest, Stephen J & Scott M Glaser. 2008. Antibody therapeutics, antibody engineering,
     and the merits of protein stability. Current opinion in drug discovery & development 11(5).
     675–687.
[24] Demerdash, O. N. A., M. D. Daily & J. C. Mitchell. 2009. Structure-based predictive models
     for allosteric hot spots. PLOS Computational Biology 5. e1000531.
[25] Dolinsky, Todd J, Jens E Nielsen, J Andrew McCammon & Nathan A Baker. 2004. Pdb2pqr:
     an automated pipeline for the setup of poisson–boltzmann electrostatics calculations. Nucleic
     acids research 32(suppl_2). W665–W667.
[26] Dreyfus, Stuart E. 1990. Artificial neural networks, back propagation, and the kelley-bryson
     gradient procedure. Journal of guidance, control, and dynamics 13(5). 926–928.
[27] Edelsbrunner, H., D. Letscher & A. Zomorodian. 2002. Topological persistence and simpli-
     fication. Discrete Comput. Geom. 28. 511–533.
                                               110


[28] Fasy, Brittany Terese, Jisu Kim, Fabrizio Lecci & Clément Maria. 2014. Introduction to the
     r package tda. arXiv preprint arXiv:1411.1830 .
[29] Fogolari, F., A. Brigo & H. Molinari. 2002. The Poisson-Boltzmann equation for biomolec-
     ular electrostatics: a tool for structural biology. Journal of Molecular Recognition 15(6).
     377–92. http://dx.doi.org/10.1002/jmr.577.
[30] Fogolari, F., A. Brigo & H. Molinari. 2003. Protocol for mm/pbsa molecular dynamics
     simulations of proteins. Biophysical Journal 85(1). 159–166. http://www.biophysj.
     org/cgi/content/full/85/1/159.
[31] Folkman, L., B. Stantic, A. Sattar & Y. Zhou. 2016. EASEMM: sequence-based prediction
     of mutation-induced stability changes with feature-based multiple models. J. Mol. Biol. 428.
     1394–1405.
[32] Friedman, Jerome H. 2001. Greedy function approximation: a gradient boosting machine.
     Annals of statistics 1189–1232.
[33] Friedman, Jerome H. 2017. The elements of statistical learning: Data mining, inference,
     and prediction. springer open.
[34] Frosini, Patrizio. 1990. A distance for similarity classes of submanifolds of a euclidean
     space. Bulletin of the Australian Mathematical Society 42(3). 407–415.
[35] Frosini, Patrizio & Claudia Landi. 1999. Size theory as a topological tool for computer
     vision. Pattern Recognition and Image Analysis 9(4). 596–603.
[36] Fukushima, Kunihiko & Sei Miyake. 1982. Neocognitron: A self-organizing neural network
     model for a mechanism of visual pattern recognition. In Competition and cooperation in
     neural nets, 267–285. Springer.
[37] Gameiro, M., Y. Hiraoka, S. Izumi, M. Kramar, K. Mischaikow & V. Nanda. 2014. Topo-
     logical measurement of protein compressibility via persistence diagrams. Japan Journal of
     Industrial and Applied Mathematics 32. 1–17.
[38] Geng, Cunliang, Li C Xue, Jorge Roel-Touris & Alexandre MJJ Bonvin. 2019. Finding
     the δδg spot: Are predictors of binding affinity changes upon mutations in protein–protein
     interactions ready for it? Wiley Interdisciplinary Reviews: Computational Molecular
     Science e1410.
[39] Geng, Weihua, Sining Yu & G. W. Wei. 2007. Treatment of charge singularities in implicit
     solvent models. Journal of Chemical Physics 127. 114106.
[40] Getov, Ivan, Marharyta Petukh & Emil Alexov. 2016. Saafec: Predicting the effect of
     single point mutations on protein folding free energy using a knowledge-modified mm/pbsa
     approach. International Journal of Molecular Sciences 17. 512.
[41] Glennie, Martin J & Jan GJ van de Winkel. 2003. Renaissance of cancer therapeutic
     antibodies. Drug discovery today 8(11). 503–510.
                                                111


[42] Hammarström, Per, R Luke Wiseman, Evan T Powers & Jeffery W Kelly. 2003. Prevention of
     transthyretin amyloid disease by changing protein misfolding energetics. Science 299(5607).
     713–716.
[43] Haykin, Simon. 2010. Neural networks and learning machines, 3/e. Pearson Education
     India.
[44] He, Kaiming, Xiangyu Zhang, Shaoqing Ren & Jian Sun. 2016. Deep residual learning for
     image recognition. In Proceedings of the ieee conference on computer vision and pattern
     recognition, 770–778.
[45] Ivakhnenko, Alekseı̆. ????. Cybernetic predicting devices. Tech. rep.
[46] Ivakhnenko, Alekseı̆. ????. Cybernetics and forecasting techniques.
[47] Jankauskaitė, Justina, Brian Jiménez-García, Justas Dapkūnas, Juan Fernández-Recio &
     Iain H Moal. 2018. Skempi 2.0: an updated benchmark of changes in protein–protein
     binding energy, kinetics and thermodynamics upon mutation. Bioinformatics 35(3). 462–
     469.
[48] Kellogg, E. H., A. Leaver-Fay & D. Baker. 2011. Role of conformational sampling in
     computing mutation-induced changes in protein structure and stability. Proteins: Struct.,
     Funct., Genet. 79. 830–838.
[49] Kleene, Stephen Cole. 2016. Representation of events in nerve nets and finite automata.
     Princeton University Press.
[50] Kortemme, Tanja, Alexandre V Morozov & David Baker. 2003. An orientation-dependent
     hydrogen bonding potential improves prediction of specificity and structure for proteins and
     protein–protein complexes. Journal of molecular biology 326(4). 1239–1259.
[51] Kovacev-Nikolic, Violeta, Peter Bubenik, Dragan Nikolić & Giseon Heo. 2016. Using
     persistent homology and dynamical distances to analyze protein binding. Stat. Appl. Genet.
     Mol. Biol. 15(1). 19–38.
[52] Kucukkal, Tugba G, Marharyta Petukh, Lin Li & Emil Alexov. 2015. Structural and physico-
     chemical effects of disease and non-disease nssnps on proteins. Current opinion in structural
     biology 32. 18–24.
[53] LeCun, Yann, Bernhard Boser, John S Denker, Donnie Henderson, Richard E Howard,
     Wayne Hubbard & Lawrence D Jackel. 1989. Backpropagation applied to handwritten zip
     code recognition. Neural computation 1(4). 541–551.
[54] Lensink, Marc F & Shoshana J Wodak. 2013. Docking, scoring, and affinity prediction in
     capri. Proteins: Structure, Function, and Bioinformatics 81(12). 2082–2095.
[55] Levy, Emmanuel D. 2010. A simple definition of structural regions in proteins and its use
     in analyzing interface evolution. Journal of molecular biology 403(4). 660–670.
                                              112


[56] Liu, Beibei, Bao Wang, Rundong Zhao, Yiying Tong & Guo Wei Wei. 2017. ESES: software
     for Eulerian solvent excluded surface. Journal of Computational Chemistry 38. 446–466.
[57] Liu, Song, Chi Zhang, Hongyi Zhou & Yaoqi Zhou. 2004. A physical reference state unifies
     the structure-derived potential of mean force for protein folding and binding. Proteins:
     Structure, Function, and Bioinformatics 56(1). 93–101.
[58] Martinez, JL & F Baquero. 2000. Mutation frequencies and antibiotic resistance. Antimi-
     crobial agents and chemotherapy 44(7). 1771–1777.
[59] Miller, Susan, Joel Janin, Arthur M Lesk & Cyrus Chothia. 1987. Interior and surface of
     monomeric proteins. Journal of molecular biology 196(3). 641–656.
[60] Mizutani, Eiji, Stuart E Dreyfus & Kenichi Nishio. 2000. On derivation of mlp backprop-
     agation from the kelley-bryson optimal-control gradient formula and its application. In
     Proceedings of the ieee-inns-enns international joint conference on neural networks. ijcnn
     2000. neural computing: New challenges and perspectives for the new millennium, vol. 2,
     167–172. IEEE.
[61] Moal, Iain H & Juan Fernández-Recio. 2012. Skempi: a structural kinetic and energetic
     database of mutant protein interactions and its use in empirical models. Bioinformatics
     28(20). 2600–2607.
[62] Morrison, Kim L & Gregory A Weiss. 2001. Combinatorial alanine-scanning. Current
     opinion in chemical biology 5(3). 302–307.
[63] Nguyen, Duc Duy, Zixuan Cang, Kedi Wu, Menglun Wang, Yin Cao & Guo-Wei Wei.
     2018. Mathematical deep learning for pose and binding affinity prediction and rank-
     ing in d3r grand challenges. Journal of Computer Aided Molecular Design in press.
     https://doi.org/10.1007/s10822–018–0146–6.
[64] Patil, Sachin P, Pedro J Ballester & Cassidy R Kerezsi. 2014. Prospective virtual screening
     for novel p53–mdm2 inhibitors using ultrafast shape recognition. Journal of computer-aided
     molecular design 28(2). 89–97.
[65] Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion,
     Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg et al. 2011.
     Scikit-learn: Machine learning in python. Journal of machine learning research 12(Oct).
     2825–2830.
[66] Petukh, Marharyta, Luogeng Dai & Emil Alexov. 2016. Saambe: webserver to predict the
     charge of binding free energy caused by amino acids mutations. International journal of
     molecular sciences 17(4). 547.
[67] Petukh, Marharyta, Minghui Li & Emil Alexov. 2015. Predicting binding free energy change
     caused by point mutations with knowledge-modified mm/pbsa method. PLoS computational
     biology 11(7). e1004276.
                                              113


[68] Pires, Douglas EV & David B Ascher. 2016. mcsm-ab: a web server for predicting antibody–
     antigen affinity changes upon mutation with graph-based signatures. Nucleic acids research
     44(W1). W469–W473.
[69] Puente, Xose S, Luis M Sánchez, Christopher M Overall & Carlos López-Otín. 2003. Human
     and mouse proteases: a comparative genomic approach. Nature Reviews Genetics 4(7). 544–
     558.
[70] Quan, Lijun, Qiang Lv & Yang Zhang. 2016. Strum: structure-based prediction of protein
     stability changes upon single-point mutation. Bioinformatics 32(19). 2936–2946.
[71] Ramensky, Vasily, Peer Bork & Shamil Sunyaev. 2002. Human non-synonymous snps:
     server and survey. Nucleic acids research 30(17). 3894–3900.
[72] Rodrigues, Carlos H M, Yoochan Myung, Douglas E V Pires & David B Ascher. 2019.
     mcsm-ppi2: predicting the effects of mutations on protein–protein interactions. Nucleic
     acids research .
[73] Rosenblatt, Frank. 1957. The perceptron, a perceiving and recognizing automaton project
     para. Cornell Aeronautical Laboratory.
[74] Rosenblatt, Frank. 1958. The perceptron: a probabilistic model for information storage and
     organization in the brain. Psychological review 65(6). 386.
[75] Rumelhart, David E, Geoffrey E Hinton & Ronald J Williams. 1985. Learning internal
     representations by error propagation. Tech. rep. California Univ San Diego La Jolla Inst for
     Cognitive Science.
[76] Rumelhart, David E, Geoffrey E Hinton & Ronald J Williams. 1986. Learning representations
     by back-propagating errors. nature 323(6088). 533–536.
[77] Scarselli, Franco, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner & Gabriele Monfar-
     dini. 2008. The graph neural network model. IEEE transactions on neural networks 20(1).
     61–80.
[78] Schmidhuber, Jürgen. 2015. Deep learning in neural networks: An overview. Neural
     Networks 61. 85–117.
[79] Schymkowitz, Joost, Jesper Borg, Francois Stricher, Robby Nys, Frederic Rousseau &
     Luis Serrano. 2005. The foldx web server: an online force field. Nucleic acids research
     33(suppl_2). W382–W388.
[80] Selkoe, Dennis J. 2003. Folding proteins in fatal ways. Nature 426(6968). 900–904.
[81] Shire, Steven J, Zahra Shahrokh & Jun Liu. 2004. Challenges in the development of high
     protein concentration formulations. Journal of pharmaceutical sciences 93(6). 1390–1402.
[82] Simonsen, Shane M, Lillian Sando, K Johan Rosengren, Conan K Wang, Michelle L Col-
     grave, Norelle L Daly & David J Craik. 2008. Alanine scanning mutagenesis of the prototypic
     cyclotide reveals a cluster of residues essential for bioactivity. Journal of biological chemistry
     283(15). 9805–9813.
                                                 114


[83] Sirin, Sarah, James R Apgar, Eric M Bennett & Amy E Keating. 2016. Ab-bind: Antibody
     binding mutational database for computational affinity predictions. Protein Science 25(2).
     393–409.
[84] Szilagyi, Andras & Yang Zhang. 2014. Template-based structure modeling of protein–protein
     interactions. Current opinion in structural biology 24. 10–23.
[85] Webb, Benjamin & Andrej Sali. 2014. Comparative protein structure modeling using mod-
     eller. Current protocols in bioinformatics 47(1). 5–6.
[86] Weiss, Gregory A, Colin K Watanabe, Alan Zhong, Audrey Goddard & Sachdev S Sidhu.
     2000. Rapid mapping of protein functional epitopes by combinatorial alanine scanning.
     Proceedings of the National Academy of Sciences 97(16). 8950–8954.
[87] Werbos, Paul. 1974. Beyond regression:" new tools for prediction and analysis in the
     behavioral sciences. Ph. D. dissertation, Harvard University .
[88] Xia, K. L. & G. W. Wei. 2014. Persistent homology analysis of protein structure, flexibility
     and folding. International Journal for Numerical Methods in Biomedical Engineering 30.
     814–844.
[89] Xia, K. L. & G. W. Wei. 2015. Persistent topology for cryo-EM data analysis. International
     Journal for Numerical Methods in Biomedical Engineering 31. e02719.
[90] Xiang, Jason Z & B Honig. 2002. Jackal: A protein structure modeling package. Columbia
     University and Howard Hughes Medical Institute, New York .
[91] Xiong, Peng, Chengxin Zhang, Wei Zheng & Yang Zhang. 2017. Bindprofx: assessing
     mutation-induced binding affinity change by protein interface profiles with pseudo-counts.
     Journal of molecular biology 429(3). 426–434.
[92] Yang, Yuedong, Rhys Heffernan, Kuldip Paliwal, James Lyons, Abdollah Dehzangi, Alok
     Sharma, Jihua Wang, Abdul Sattar & Yaoqi Zhou. 2017. Spider2: A package to predict
     secondary structure, accessible surface area, and main-chain torsional angles by deep neural
     networks. In Prediction of protein secondary structure, 55–63. Springer.
[93] Yao, Yuan, Jian Sun, Xuhui Huang, Gregory R Bowman, Gurjeet Singh, Michael Lesnick,
     Leonidas J Guibas, Vijay S Pande & Gunnar Carlsson. 2009. Topological methods for
     exploring low-density states in biomolecular folding pathways. The Journal of chemical
     physics 130(14). 04B614.
[94] Zhang, CHI, Song Liu & Yaoqi Zhou. 2004. Accurate and efficient loop selections by the
     dfire-based all-atom statistical potential. Protein science 13(2). 391–399.
[95] Zhang, Zhe, Maria A Miteva, Lin Wang & Emil Alexov. 2012. Analyzing effects of naturally
     occurring missense mutations. Computational and mathematical methods in medicine 2012.
[96] Zhou, Y. C., M. Feig & G. W. Wei. 2008. Highly accurate biomolecular electrostatics in
     continuum dielectric environments. Journal of Computational Chemistry 29. 87–97.
                                                115


[97] Zhou, Y. C., Shan Zhao, Michael Feig & G. W. Wei. 2006. High order matched interface and
      boundary method for elliptic equations with discontinuous coefficients and singular sources.
      J. Comput. Phys. 213(1). 1–30.
[98] Zhu, Kai, Tyler Day, Dora Warshaviak, Colleen Murrett, Richard Friesner & David Pearlman.
      2014. Antibody structure determination using a combination of homology modeling, energy-
      based refinement, and loop prediction. Proteins: Structure, Function, and Bioinformatics
      82(8). 1646–1655.
[99] Zomorodian, A. & G. Carlsson. 2005. Computing persistent homology. Discrete Comput.
      Geom. 33. 249–274.
[100] Zomorodian, Afra & Gunnar Carlsson. 2008. Localized homology. Computational Geometry
      - Theory and Applications 41(3). 126–148.
                                              116