AUTO-PARAMETRIZED KERNEL METHODS FOR BIOMOLECULAR MODELING By Timothy Andrew Szocinski A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Mathematics – Doctor of Philosophy 2021 ABSTRACT AUTO-PARAMETRIZED KERNEL METHODS FOR BIOMOLECULAR MODELING By Timothy Andrew Szocinski Being able to predict various physical quantities of biomolecules is of great importance to biologists, chemists, and pharmaceutical companies. By applying machine learning techniques to develop these predictive models, we find much success in our endeavors. Advanced mathematical techniques involving graph theory, algebraic topology, differen- tial geometry, etc. have been very profitable in generating first-rate biomolecular rep- resentations that are used to train a variety of machine learning models. Some of these representations are dependent on a choice of kernel function along with parameters that determine its shape. These kernel-based methods of producing features require careful tuning of the kernel parameters, and the tuning cost increases exponentially as more ker- nels are involved. This limitation largely restricts us to the use of machine learning mod- els with less hyper-parameters, such as random forest (RF) and gradient-boosting trees (GBT), thus precluding the use of neural networks for kernel-based representations. To alleviate these concerns, we have developed the auto-parametrized weighted element- specific graph neural network (AweGNN), which uses kernel-based geometric graph fea- tures in which the kernel parameters are automatically updated throughout the training to reach an optimal combination of kernel parameters. The AweGNN models have shown to be particularly success in toxicity and solvation predictions, especially when a multi- task approach is taken. Although the AweGNN had introduced hundreds of parameters that were automatically tuned, the ability to include multiple kernel types simultaneously was hindered because of the computational expense. In response, the GPU-enhanced AweGNN was developed to tackle the issue. Working with GPU architecture, the AweGNN’s computation speed was greatly en- hanced. To achieve a more comprehensive representation, we suggested a network con- sisting of fixed topological and spectral auxiliary features to bolster the original AweGNN success. The proposed network was tested on new hydration and solubility datasets, with excellent results. To extend the auto-parametrized kernel technique to include fea- tures of a different type, we introduced the theoretical foundation for building an auto- parametrized spectral layer, which uses kernel-based spectral features to represent biomolec- ular structures. In this dissertation, we explore some underlying notions of mathematics useful in our models, review important topics in machine learning, discuss techniques and mod- els used in molecular biology, detail the AweGNN architecture and results, and test and expand new concepts pertaining to these auto-parametrized kernel methods. Copyright by TIMOTHY ANDREW SZOCINSKI 2021 To my Lord Jesus Christ, for his infinite love and grace. v ACKNOWLEDGEMENTS First of all, I would like to thank Dr. Guowei Wei for being my advisor. His guidance and patience throughout my doctoral journey is greatly appreciated. I would also like to thank Dr. Duc Nguyen for his help and advice in all of my projects. He has been an indispensable resource, frequently taking time out of his schedule for my benefit. I also thank my officemates, Menglun Wang and Rui Wang, for their willingness to help, their encouraging work ethic, and their openness to friendly conversation. I would like to thank my parents, Hanna and Roman Szocinski, for their great love and support throughout my time here, along with the rest of my loving family. Their encouragement was unyielding and steadfast. May their love for me bear fruit in my future endeavors, and bring much joy in their lives as I strive for success in the time to come. For those who have revealed to me the possibility of a better future, I thank Christen and Trevor Pollo. Where I thought much was lost, I could find hope in a true mission. Their leadership in the greatest cause of our time is truly inspiring and heartwarming. It will not go without notice. To the friends of my life: Daniel Stewart, Matthew Bartels, Joseph Tuski, J.M. Hoyt, Joseph Matthias, and others; who have led me to a path of truth and honor, I thank you whole-heartedly. Though my eyes could not see, and my heart could not feel, these souls have brought me clarity of vision, peace of mind, and renewal of spirit. To the men of the Order of the Cube, who have pledged themselves to a great enter- prise, I thank you. Our destinies are now bound together as we journey on in life, and thus we must rejoice together in our accomplishments. To Laura Durocher, the wonderful lady who has provided me with many joyful dis- tractions and motivation for success, I give you my utmost love and affection. Above all, I thank God, the Maker of all things. To Him I dedicate this dissertation, vi for all my work should be for His glory. vii TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Machine Learning and its Applications . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Biomolecular Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Abstract Mathematical Representations . . . . . . . . . . . . . . . . . . . . . 3 1.4 Kernel-based Methods and Auto-parametrization . . . . . . . . . . . . . . . 4 1.5 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 CHAPTER 2 MATHEMATICAL TECHNIQUES . . . . . . . . . . . . . . . . . . . . 8 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Graph Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 Adjacency and Laplacian Matrices . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Adjacency and Laplacian Matrices for Weighted Graphs . . . . . . . 15 2.3 Algebraic Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3.1 Simplicial Complexes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.3.2 Simplicial Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.3.3 Persistent Homology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.3.4 Čech and Vietoris-Rips Complexes . . . . . . . . . . . . . . . . . . . . 23 CHAPTER 3 MACHINE LEARNING . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Fundamental Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Data Representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.2 Loss Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Evaluation Metrics and Data . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.4 Model Weights and Gradient Descent . . . . . . . . . . . . . . . . . . 35 3.2.5 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.3 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.5 Ensemble Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.5.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.5.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.4 Gradient Boosting Trees . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.6 Artificial Neural Networks and Deep Learning . . . . . . . . . . . . . . . . . 48 viii 3.6.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 3.6.2 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.3 Gradient Descent and Adaptive Learning Rates . . . . . . . . . . . . 54 3.6.4 Multi-task Networks and Transfer Learning . . . . . . . . . . . . . . . 57 CHAPTER 4 BIOMOLECULAR MODELING AND KERNEL METHODS . . . . . 60 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 4.2 Physical Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.1 Electrostatic Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.2 Van der Waals Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.2.3 Molecular Surface Area . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.3 Biomolecular Datasets of Small Molecules . . . . . . . . . . . . . . . . . . . . 69 4.3.1 Toxicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 4.3.2 Solubility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.3.3 Solvation Free Energy and Hydration . . . . . . . . . . . . . . . . . . 72 4.4 Basic Kernel Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.5 Kernel Methods applied to biology . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.1 Element-specific Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.5.2 Element-specific Graph Representations . . . . . . . . . . . . . . . . . 75 4.5.3 Geometric Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 4.5.4 Algebraic Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 4.5.5 Algebraic Topology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.6 Multi-scale Kernel Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 CHAPTER 5 AWEGNN: AUTO-PARAMETRIZED WEIGHTED ELEMENT-SPECIFIC GRAPH NEURAL NETWORKS FOR SMALL MOLECULES . . . . . 82 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 5.2 Theory and Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.2 Overview of AweGNN . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 5.2.3 Normalization Function . . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.2.4 Biomolecular Geometric Graph Representations . . . . . . . . . . . . 92 5.2.5 Parameter Adjustment and Initialization . . . . . . . . . . . . . . . . 96 5.2.6 Derivatives of the Representation Function . . . . . . . . . . . . . . . 98 5.2.7 How to Update the Parameters . . . . . . . . . . . . . . . . . . . . . . 100 5.2.8 Multi-scale Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.2.9 Model Architectures and Hyper-parameters . . . . . . . . . . . . . . 101 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.3.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.3.2 Toxicity Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.3.2.1 LC50 -DM (Daphnia Magna) Set . . . . . . . . . . . . . . . . 108 5.3.2.2 Fathead Minnow LC50 Set . . . . . . . . . . . . . . . . . . . . 109 5.3.2.3 Tetraphymena Pyriformis IGC50 Set . . . . . . . . . . . . . . 111 5.3.2.4 Oral Rat LD50 Set . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.3.3 Solvation Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 ix 5.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.4.1 Impact of Automating Selection of Kernel Parameters . . . . . . . . . 116 5.4.2 Feature Importance and Analyzing the Trajectories of the Kernel Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.4.3 Limitations and Advantages . . . . . . . . . . . . . . . . . . . . . . . . 119 5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 CHAPTER 6 GENERAL AUTO-PARAMETRIZED KERNEL METHODS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2 GPU-Enhanced AweGNN and Multi-scale Methods . . . . . . . . . . . . . . 124 6.2.1 GPU programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 6.2.2 Data Structure and Algorithm . . . . . . . . . . . . . . . . . . . . . . . 125 6.2.3 New Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 6.3 Comprehensive AweGNN with Fixed Topological and Spectral Descriptors 127 6.3.1 Model Features and Architecture . . . . . . . . . . . . . . . . . . . . . 128 6.3.2 FreeSolv and ESOL Data Set . . . . . . . . . . . . . . . . . . . . . . . . 129 6.3.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 6.4 Auto-parametrized Spectral Layer . . . . . . . . . . . . . . . . . . . . . . . . 132 6.4.1 Differentiability of Eigenvalue and Eigenvector Functions . . . . . . 132 6.4.2 Derivatives of the Eigenvalue Functions and the Update Rule . . . . 135 6.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6.5 Auto-parametrized Kernel Models and Future Possibilities . . . . . . . . . . 138 CHAPTER 7 DISSERTATION CONTRIBUTION . . . . . . . . . . . . . . . . . . . . 141 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 x LIST OF TABLES Table 5.1: Set statistics for quantitative toxicity data . . . . . . . . . . . . . . . . . . . 108 Table 5.2: Comparison of prediction results for the LC50DM test set . . . . . . . . . 110 Table 5.3: Comparison of prediction results for the LC50 test set . . . . . . . . . . . . 111 Table 5.4: Comparison of prediction results for the IGC50 test set . . . . . . . . . . . 113 Table 5.5: Comparison of prediction results for the LD50 test set . . . . . . . . . . . 114 Table 5.6: Comparison of prediction results for the solvation test set . . . . . . . . . 115 Table 5.7: Supplementary results for solvation data set from Nguyen et al.[81] . . . 116 xi LIST OF FIGURES Figure 2.1: This figure displays a vertex, edge, triangle, and tetrahedron from left to right, which are representations of the 0, 1, 2, and 3-simplices respectively. These are all realizations of the simplices by embedding them into 3 dimensional space [93]. . . . . . . . . . . . . . . . . . . . . . . 18 Figure 2.2: This figure displays the calculation of a homology group. The sim- plicial complex, K, is made up of one 2-simplex (triangle), abc, 3 1- simplices (lines), ab, bc, and ac, and 3 0-simplices (points), a, b, and c. If we focus on the 1-dimensional chain groups, we notice that there is one cycle represented geometrically by the loop going from a to b to c. This loop, however, is cancelled by the boundary of the sim- plex, abc, by "closing" the loop, thereby displaying the topological idea that the loop can be deformed continuously across the triangle surface. Therefore, there are no non-trivial loops in the complex, and so dim H1 (K) = 0. If the simplicial complex did not contain the 2- simplex, then the complex, K, would have one non-trivial loop and we would have dim H1 (K) = 1. . . . . . . . . . . . . . . . . . . . . . . . . 20 Figure 2.3: This figure displays the geometric construction of a Čech complex from a point cloud, S ⊂ R2 , based on the radius parameter, r. The simplices are added based on the number of balls that have non- empty intersection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 Figure 3.1: This figure captures the challenge of finding just the right fit for a polynomial whose tunable parameters include the degree of the poly- nomial being fit. The true function that generated the distribution is shown in orange and the models that are fit to the data set are shown in blue. We can see that as we increase the degree of the polynomial, we more accurately fit the training data, but the model is becomes less able to generalize to new data points. We must balance the com- plexity of our model based on complexity of the distribution of our data [51]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Figure 3.2: In this figure we show how varying the regularization constant when training models using ridge regularization affects how the model fits the data. On the left we see linear regression, and on the right we see polynomial regression. As the regularization constant increases, the model is "flattened". This "flattening" decreases the variance of the model, but increases bias [36]. . . . . . . . . . . . . . . . . . . . . . . . . . 40 xii Figure 3.3: In this figure, we show a decision tree model making a prediction for a new data point, x = (x1 , x2 , x3 , x4 , x5 ). The threshold values for the nodes are denoted by T1 , T2 , and T3 . At each node, one of the features is compared to the threshold at that node, determining the direction that the data point will travel down the tree. Notice that only the features, x1 , x2 , and x5 , are used for any prediction. Decision tree models do not always consider all features of data point when making a prediction, especially when they are pruned or limited by a maximum depth parameter during construction. . . . . . . . . . . . . . 43 Figure 3.4: This figure depicts the architecture of a simple feed-forward neural network with an input layer, hidden layers, and an output layer with one output. The neurons in the input layer, hold the values of the feature vectors of a sample from the data set, and the connections between the layers have weights associated with them, which deter- mine how the values are transformed as they move to the neurons in the next layer of the network. The yellow lines above show the flow of a sample through the network that yields a prediction at the out- put layer, while the turquoise arrows pointing backwards show the flow of the calculations of the gradients at each layer before an update. . 49 Figure 3.5: An illustration of a general multi-task network with three tasks. The input data from datasets 1, 2, and 3 are fed forward simultaneously, but through the hidden layers corresponding to their task. Then the last layers of each individual task’s hidden layers are pushed through the shared layers. Finally, the data points are separated again accord- ing to their task, and pushed through their corresponding second set of hidden layers and outputs to get the predictions. The average loss of the predictions is calculated, and the weights are then updated si- multaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Figure 4.1: This figure illustrates the solvent accessible surface compared to the Van der Waal surface. The solid red and gray colors represent the Van der Waals radii of the atoms in the molecule, and the striped yellow area represents the solvent accessible surface that is "rolled out" by the center of the translucent blue probe sphere (or solvent). . . . . . . . . 69 Figure 4.2: This figure shows a barcode representation being extracted from a biomolecular structure. On the left, we have the biomolecule in which the point cloud is taken by looking at the positions of all the atoms in the structure. On the right, we have the persistence information for the first 3 dimensions. The 0-dimensional persistence (points) for each class is on top, the 1-dimensional persistence (rings) is in the middle, and the 2-dimensional persistence (cavities) is on the bottom. The persistence is measure in Angtroms (Å) [22]. . . . . . . . . . . . . . . 79 xiii Figure 4.3: This figure depicts the featurization of the barcode data from persis- tent homology calculations. On the top, the barcode representation is shown, with 4 bars representing the "lives" of 4 different homology classes as a simplicial complex "evolves" through a filtration. Below the barcode data, the birth, death, and persistence feature vectors are constructed using a max length of 4 and a resolution of 0.5. . . . . . . . . 80 Figure 5.1: Pictorial visualization of the AweGNN training process. The first stage is splitting up the molecules in the data set into the element- specific groups, with initialized η and κ kernel parameters for each group. Then, a molecular representation is generated based on the kernel parameters, given by F in the diagram. This representation is fed into the neural network, then the kernel parameters are updated by back propagation through the neural network and through the representation function, F. . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 Figure 5.2: Diagram depicting the structure and automation of the MT-AweGNN. The model is trained by collecting samples from each data set pro- portionally and then generating their representations according to shared parameters. The representations are normalized together, then pushed through the artificial neural network and then to their corre- sponding outputs. Back propagation goes through the feature ma- trix, normalization function, and finally the representation function to update the kernel parameters. . . . . . . . . . . . . . . . . . . . . . . . 91 Figure 5.3: This shows an element specific subgraph corresponding to the ele- ment types C and O of aminopropanoic acid, C3 H6 NO2 (with the hydrogens omitted). The dashed orange edges represent the edges of the subgraph that are weighted by the chosen kernel function, Φ, while the solid blue lines represent covalent bonds. Notice the 3rd carbon atom does not connect to the oxygen atoms since it is cova- lently bonded to both of them. . . . . . . . . . . . . . . . . . . . . . . . . 95 Figure 5.4: Evolution of the kernel function of the C-N group for a MT-AweGNN. The picture below shows a series of kernel functions that were used to evaluate the 4 features corresponding to the C-N element-specific group at different points in the training of an MT-AweGNN. We choose specific snapshots during training that show a smooth change in the kernel function as the η-κ pair is updated. . . . . . . . . . . . . . . . . . . 98 xiv Figure 5.5: Internal architecture of our Auto-parametrized weight element-specific Graph Neural Network (AweGNN) model for the toxicity data sets. The input layer consists of our novel Geometric Graph Representa- tion (GGR) layer and a batch norm with no affine transformation. Hidden layers include a linear transformation followed by regular batch normalization and ReLU activation, while the output layer has the standard linear activation. . . . . . . . . . . . . . . . . . . . . . . . . . 103 Figure 5.6: Pictorial representation of the overall strategy. The arrows show the flow of the processes of training, extracting representations from trained AweGNNs, and making predictions. Blue corresponds to molecular data sets, light green corresponds to machine learning models, or- ange corresponds to molecular representations, and dark green cor- responds to predictions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Figure 5.7: Feature importance of MT-AweGNNs when applied to different data sets. We generate heat maps based on the average of feature impor- tance across 42 RF models trained on the corresponding NEARs of the MT-AweGNNs. The average feature importance of the 4 features of each element-specific group are summed together to get the final scores shown above in the maps. Figure 5.7a shows the results for the LD50 MT-RF models and figure 5.7b shows the results for the IGC50 MT-RF models. Note that the maps are asymmetric due to the way that the electrostatics-based features of a group are generated. . . . . . . 119 Figure 5.8: Average trajectories of kernel parameters for 42 MT-AweGNN mod- els. The graph on the left shows how the average of the η values of our 42 MT models for 8 different element-specific groups changes as the models are trained. On the right, we see the κ counterpart of this analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 Figure 6.1: This figure gives a graphical representation of the parallel addition algorithm for calculating the sum of an array. We start with an initial array, a, of length, n. Then we begin by doing pairwise additions, ai := ai + a2i , for i = 1, ..., n/2. Then the next step includes pairwise additions, again ai := ai +a2i , but now for i = 1, ..., n/4. The pairwise additions, ai := ai + a2i , continue for i = 1, ..., n/2k , at step k, until there is only one number remaining to be summed, and then the ar- ray sum is extract from the first element, a0 . The image uses green to show elements used in the sum at each step, grey to show inactive elements, and the red circle highlighting the final answer after all the steps are completed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 xv Figure 6.2: This figure shows the general structure of the comprehensive net- work described in this section. The geometric graph representation (GGR) layer from the AweGNN takes center stage with auxiliary topo- logical and spectral features all merging together into one shared layer. Each set of features can have their own hidden layers to ef- fectively process the features before merging. . . . . . . . . . . . . . . . . 128 Figure 6.3: This figure displays the results gathered from the MoleculeNet pa- per. 8 model types were tested on a validation set and a test set and the average performance using the RMSE metric were recorded. The error bars record the standard deviation of the model results [111]. . . . 131 xvi LIST OF ALGORITHMS Algorithm 1: Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Algorithm 2: Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . 53 Algorithm 3: Descent with Momentum . . . . . . . . . . . . . . . . . . . . . . . 55 Algorithm 4: AdaGrad . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Algorithm 5: RMSProp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Algorithm 6: Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 xvii CHAPTER 1 INTRODUCTION 1.1 Machine Learning and its Applications Machine learning can be defined as a field of computer science that uses algorithms that help a computer learn from data. More precisely, an algorithm is said to be a machine learning algorithm if for some task, T, that can be measured with a performance measure, P, the algorithm directs the computer in some way that will use experience, E, so that it will improve the performance on task T measured by P [36]. A simpler way to describe a machine learning algorithm is to say that the algorithm directs a computer to improve its performance on a task without being explicitly programmed to do so. This is in contrast to programming that would have to find an output response for each and every input response, which, in many cases, is totally infeasible. When data becomes too complex for the human mind to generate a simple set of rules to decide what to do in a given situation, then we might consider applying machine learning methods. Machine learning algorithms have been applied in a variety of instances: computer vision, speech recognition, spam filters, medicine, self-driving vehicles, amongst others [84, 24, 29, 91, 35]. As the years pass by, many more more applications spring up every year as new ways of training machine learning models are discovered and computational capabilities naturally increase over time. Accessibility to massive amounts of data too, has added to the excitement of developing machine learning models, in which common people have access to millions and millions of pieces of data that are ready for use, or require little cleaning. In the field of molecular biology, interest has heightened as de- tailed molecular data has become available and demand for medical and bio-technologies has increased. Many successful machine learning models have been employed in the biomolecular field, and have caught the attention of many pharmaceutical companies, 1 which have been relying more and more on machine learning models to avoid costly experimentation to narrow down candidates for drugs. 1.2 Biomolecular Modeling Molecular modeling is the science and art of studying molecular structure and func- tion through model building and computation. The model building can be as simple as plastic templates or metal rods, or as sophisticated as interactive, animated color stere- ographics and laser-made wooden sculptures. The computations encompass ab initio and semi-empirical quantum mechanics, empirical (molecular) mechanics, molecular dy- namics, Monte Carlo, free energy and solvation methods, quantitative structure/activity relationships (QSAR), chemical/biochemical information and databases, and many other established procedures. The refinement of experimental data, such as from nuclear mag- netic resonance (NMR) or X-ray crystallography, is also a component of biomolecular modeling. In particular, we want to create models that will help us predict various quantities or qualities of different biomolecules or biomolecular structures. For example, predictions of solvation free energy, toxicity, and solubility of molecular compounds are highly useful for biomedical purposes. The prediction of the binding free energy of protein-ligand complexes is also extremely important for drug development and design, saving precious time and money on costly experimentation. As there has recently been a shift from the qualitative and phenomenological analysis of biomolecular systems to more detailed and quantitative approaches, we would like to follow in that path. We seek to apply machine learning algorithms to obtain models that can help us de- velop these predictive models, but this requires deep analysis and plenty of data. For- tunately, there has been an explosion in the availability of detailed biomolecular data, and methods of computer modeling have improved dramatically, making it possible to extract important information from the molecular structures to construct accurate pre- 2 dictive models. The computational capabilities of modern computers has spurred much advancement in the area of biomolecular modeling, and has shown no signs of stopping. 1.3 Abstract Mathematical Representations Advanced mathematics can be used to understand many real-world problems through modeling and approximation techniques. In particular, we would like to apply these mathematical tools to the problem of biology. Algebraic topology, geometry, graph theory, and other techniques have been applied before to obtain successful predictive models that have state-of-the-art performance [78, 110, 22, 80, 81, 79, 108]. These mathematical tools are used to describe the geometry, structure, and underlying physics of the biomolecules and biomolecular constructs through calculation of abstract features. These features are then combined into representation vectors that describe each molecule in the data set, from which we may apply a multitude of machine learning algorithms to create predictive models. Persistent homology, which measures persistence of the connected components, rings, and cavities of a cloud of points, can be used to extract features that provide a picture of the topological and geometric landscape of a biomolecular structure. Differential geom- etry can be used to measure the curvature of the interactions of the elements within a biomolecule or between the molecules in a biomolecular complex. Geometric graph the- ory can be used to describe the pairwise interactions between elements, and algebraic graphs can be used to provide spectral information, which may reveal information about the connectivity and other abstract properties about the arrangement of atoms. One can imagine that there are a plethora of possibilities when attempting to build representa- tions for the various qualities of biomolecules that would be useful for applying machine learning algorithms, so we explore them ardently. 3 1.4 Kernel-based Methods and Auto-parametrization Much of these mathematical representations of molecules and biomolecular systems rely on a chosen kernel function, and a careful tuning of parameters that determine the shape of the kernel function by setting the center of the cutoff and its sharpness. For instance, geometric graph features, as in the flexibility-rigidity index used in a previous paper, rely on the choice of a kernel function such as the Lorentz or exponential kernel function [78]. These are tuned by adjusting the τ and κ values that determine the dis- tance of the cutoff and the sharpness of the cutoff. Representations based on curvature are also based on kernel choice and kernel parameter tuning, as well as the represen- tations extracted from spectral data of adjacency and Laplacian matrices. Topological representations in previous works also did introduce kernel-based features, providing an opportunity to tune the models as well. When using kernel-based features, we may develop multi-scale representations, where we combine multiple choices of kernels and kernel parameters that increase the number of features that we get for our representation. Multi-scale methods reveal extra detail that cannot be captured by the limited scope of just one kernel function and parameter choice, and have shown to be very successful in previous attempts [78, 80, 81], while single-scale models are very limited in comparison. The downfall of the multi-scale models is in the increase of kernel parameters that must be tuned. Because of this increase in the amount of parameter choices when optimizing, it be- comes increasingly time-consuming to tune them. It is computationally infeasible to go beyond a certain point without using techniques that miss key parameter combinations. In the case of element-specific features, the kernel and parameter choices per scale too must be fixed across all pairs, missing out on many possibilities for capturing various biological effects based on the interactions between different elements. This limits us significantly in how we can develop our models in that we are virtually constrained to non-deep learning models, such as gradient boosting trees and random forest algorithms, 4 since they are more easily trained with fewer hyper-parameters to adjust. What we may do to bypass the problems above is to automate the process of parameter selection. By automating this process, we may tune the parameters during the training of our models. We can then take care of the parameter tuning with just one model in- stead of hundreds that must be compared to obtain the best result. Also, the parameters may be tuned specifically for each individual feature or small group of features. This means that hundreds of parameters may be tuned with one model instead of tuning 2 parameters after training hundreds of models. To automate the parameter selection, we utilized the back propagation of a neural network to tune these parameters, thereby intro- ducing the possibility of using a neural network with kernel-based representations and solving the problem of the high cost for kernel parameter tuning. This yielded the Auto- parametrized weighted element-specific Graph Neural Network (AweGNN) [100], which used an element-specific kernel-based geometric graph representation for the molecules. This auto-parametrized kernel model was found to be very successful, and was later im- proved by increasing computational efficiency and adding additional features. Also, the idea of the AweGNN can be extended to other molecular representations using various abstract mathematical topics described previously. 1.5 Motivation There has been exponential growth in the availability of data, even to the general population, and the rapid advances of machine learning and data science have provided us with a plethora of tools and techniques for modeling a variety of data sets and con- structing very successful predictive models that can aid us in almost any situation. This has been increasingly applied to tackle the big problems in the biological sciences in re- cent years. The increase in detailed biomolecular datasets give us a large pool of data from which to develop these machine learning models. Due to this fountain of high- quality biomolecular data, machine learning has become the most powerful tool in com- 5 putational biology. Solvation free energy and hydration free energy, toxicity, solubility, protein-ligand binding free energy, etc. are all important physical quantities associated with biomolecules or biomolecular complexes that are useful in industry, medicine, and science. Machine learning models are commonly utilized to predict these values, in which advanced mathematical techniques can aid increasing performance greatly. Kernel-based mathematical representations have been used to develop state-of-the-art machine learning models for a variety of molecular datasets. Despite the success of these models, the kernel-based models have a drawback. The kernels are dependent on param- eters that have to be hand-tuned. This manual tuning can be somewhat time consuming, especially when multi-scale models are used. For multi-scale models, a parameter grid search is required, and when the number of kernels becomes too large, the grid search may be infeasible. Because of the computational cost of the grid search, neural networks are generally omitted from experimentation, and random forest and/or gradient boosting trees are applied in place of them. To remedy this issue, we sought to automate the kernel parameter-tuning by using the backpropagation of a neural network to update the pa- rameters during training. This auto-parametrization technique, however, requires careful construction and computational efficiency to be applied successfully to these molecular datasets. 1.6 Outline In chapter 2, we provide an some mathematical background in graph theory and per- sistent homology, providing vital theoretical detail to understand how it may be applied to extract molecular descriptors for molecular datasets. In chapter 3, we cover funda- mental concepts in machine learning, including some details on specific machine learn- ing algorithms, which we apply in later chapters. Chapter 4 introduces the concept of biomolecular modeling and describes the changes in technology and data availability that are spurring on the field. Various techniques in biomolecular modeling are covered, 6 including physical modeling, and kernel methods applied to machine learning. Chapter 5 details the entire AweGNN paper, which introduces the first auto-parametrized kernel model. Chapter 6 details the general idea of the auto-parametrized kernel methods and introduces improvements and theories that extend the original AweGNN concept. Some brief results for a solubility and hydration dataset are recorded and compared to other models on a benchmark dataset. 7 CHAPTER 2 MATHEMATICAL TECHNIQUES 2.1 Introduction Mathematics has been applied to many different kinds of problems. Some of the first applications were found in commerce, land measurement, architecture, and astronomy. In our current times, we see that all sciences use a variety of mathematical tools to advance their subject in terms of modeling, understanding, and applying techniques to solve var- ious problems, and many of these tools are also used in industrial applications. In many cases, there are teams that will enlist the help of a mathematician for the purposes of un- derstanding various elements of problems arising within the many scientific disciplines due to the specialized nature of the matter. For an example of applying advance mathe- matical topics to scientific modeling, we may look to the physicist Richard Feynman, who invented the path integral formulation of quantum mechanics by combining his under- standing of theoretical mathematical concepts and his insight in the field of physics [66]. Today’s research on string theory, which is a scientific theory that attempts to unify the four fundamental forces of nature, continues to inspire mathematics and spur on many creative mathematicians to participate. There are many examples apart from this, such as the application of representation theory in chemistry, but the field of physics draws the most attention for its focus on mathematical modeling. Some topics in mathematics are relevant only in the area that inspired it, and are ap- plied to solve further problems in that particular area, while many times, the problems being solved even create new areas of mathematics that can be studied independently. Often times, mathematics inspired by one field proves useful in many different fields, and so merits the generalized study of a particular area of mathematics. There is often made a distinction between the study of pure mathematics, in which there is an emphasis 8 on strict proof-writing with a disregard for real-world application, and applied mathe- matics, where application of techniques are the focus and how to apply them to a variety of situations in the real world. There are many instances, however, of topics which were considered to be in the sphere of pure mathematics, but were applied to great effect in nu- merous problems, for example, the application of number theory in cryptography. Due to the explosion of knowledge in the modern scientific era, there have been scores of spe- cializations within the field of mathematics, and there are certainly more to come. While the prospect of having well developed fields of mathematics and an understanding of the abstract mathematical objects therein is appealing for a number of reasons, we cannot lose sight of the understanding that there are important problems to tackle in this world, and abstractions do little on their own to pave the way to prevail against them. Although there has been much effort to apply mathematics and computational tech- niques to the understanding of biological systems, we have not seen so many advances in the use of more abstract techniques as we have seen in physics and chemistry. Thus, we attempt to present some useful abstract concepts that may be useful in the field. In this chapter, we discuss special topics in two fields of abstract mathematics, graph theory and algebraic topology, and focus mainly on the applied elements of the discussed con- cepts. A base knowledge of linear algebra, abstract algebra, and topology are required to understand the details in this chapter. The theory described in this chapter will later be applied to topics in computational biology, as we will see in the later chapters. 2.2 Graph Theory A simple graph is composed of a set of vertices and a set of edges in which the ver- tices represent objects of interest or points in space, and the edges represent some sort of relationship between the vertices, or simply a line between the points. This very basic definition can be extended to include additional information that describes the system in more detail. These additions can include a direction of the edges indicating that there is a 9 relationship from one vertex to another or that there is a mutual relationship. The edges might also take on values to represent some quantitative relationship between the two vertices, or the vertices themselves may have values of their own. The edges or vertices may be colored to categorize components of the graph that may represent a type of re- lationship or a type of vertex. Graphs can take on as deep of a complexity as they need, but in any case they are very useful and natural constructs for analyzing many different problems in our world. Graph theory has many ties to different mathematical fields and sciences due to its simplicity and natural definitions. Set theory, number theory, algebra, topology, compu- tational biology, computer security, geography, and many more [32, 40]. A deep tie be- tween graph theory and linear algebra is present by virtue of the notion of the adjacency and Laplacian graphs that can be obtained from a graph. In our work, we use graphs to represent molecular structures, and relationships between the atoms that make up the molecular structures, where we can obtain numerical information from the matrices as- sociated with the graph structure. This natural representation of biomolecular structures has been particularly fruitful [80, 78, 81, 79, 90], and so we will revisit the concepts learned in this section and apply them to biomolecular models. The theorems and definitions are mainly retrieved from the book, "Modern Graph Theory", by Bollobás [18]. 2.2.1 Basic Definitions An undirected graph, G, is a pair of disjoint sets, (V, E), such that E ⊂ V (2) , where V (2) is the set of unordered pairs of elements of V . Since an element of the edge set, E, can be described by an unordered pair of vertices, x, y ∈ V , we denote an edge connecting those two vertices by the two-element set {x, y} ∈ E. We say that G0 = (V 0 , E 0 ) is a subgraph of G = (V, E), denoted G0 ⊂ G, if we have V 0 ⊂ V and E 0 ⊂ E. The neighborhood of a vertex, x ∈ V , denoted by Γ(x), is defined as Γ(x) = {y ∈ V : {x, y} ∈ E}, the set of all vertices, y, that that are connected to x by an edge. Then the degree of the vertex, d(x), is 10 naturally defined by d(x) = |Γ(x)|, the number of edges that are connected to the vertex, x. A graph, G = (V, E), is called a bipartite graph if there are two vertex sets, V1 , V2 ⊂ V , S T with V = V1 V2 , V1 V2 = ∅, and the condition that for every {x, y} ∈ E, either x ∈ V1 and y ∈ V2 or y ∈ V1 and x ∈ V2 . One important, yet trivial, type of graph is the complete graph, which is a graph that has every vertex connected to every other vertex by an edge. More specifically, a graph G of n vertices is called a complete graph on n vertices if the edge set of G, namely E, is exactly E = {{x, y} ∈ V (2) : x, y ∈ V, x 6= y}. In this case, we denote G by G = Kn . There are many properties that are important to graphs. One of the most basic prop- erties of a graph is the number of connected components. For a graph, G = (V, E), two vertices, x, y ∈ V , are connected vertices if the is a path from x to y. A path from x to y is defined to be a series of edges, ({x1 , y1 }, {x2 , y2 }, ..., {xm , ym }), such that yi−1 = xi for all i = 2, ..., m, with x = x1 and y = ym . The graph, G, is a connected graph if for every vertex pair of vertices, x, y ∈ V , such that x 6= y, x is connected to y. A maximal connected subgraph of the graph, G, is called a connected component of G, or simply a component of G. One useful property of a graph is its connectivity. To define this, we need some additional language. Let W ⊂ V be a subset of vertices of the graph, G. Then we define the subtraction of G by W as G−W = (V −W, {{x, y} ∈ E : x, y ∈ V −W }). If G−W is disconnected, then we say that W separates G. If we let k ≥ 2, then the graph, G, is k-connected if either G = Kk+1 or if G has at least k + 2 vertices and there is no set of k − 1 vertices that separates G. A graph is called 1-connected if it is a connected graph. The maximal k value such that a graph, G, is k-connected is defined to be the connectivity of G, and is denoted by κ(G). This value gives us an idea of the strength of a graph’s overall connection. In many cases, we may wish to extend the basic notion of a graph to give us extra detail. If a graph has an edge set of order pairs instead of unordered pairs, i.e. elements of the edge set, E, have a direction from one vertex to another that is represented by the 11 ordered tuple (x, y) ∈ E going from the vertex x to the vertex y, then the graph is called a directed graph, or digraph. Direction in a graph may be very important, for example, if you are working with fluid flow or modeling the popularity of internet stars (you may follow an individual online, but they may not follow you back). A weighted graph is a graph in which each edge in the edge set has a corresponding value associated with it, or in other words, the graph G comes with a weight function, w : E → X, from the edge set to a set of your choosing, where common choices are X = R, Z, or N. Weights are essential for encoding pertinent information that represents the strength of a connection between two vertices. This can be the strength of the flow within pipes at different junctions, the strength of an interaction between two atoms, etc. 2.2.2 Adjacency and Laplacian Matrices In graph theory, an adjacency matrix is a square matrix that describes the important con- nective information of a graph. The entries of the matrix express whether the pairs of a set of vertices are adjacent in the graph or not. Many key features about the graph can be ex- tracted such as information about the connected components of the graph, calculations of path and cycles, finding shortest paths, and other important information. Another useful matrix associated with a graph is the Laplacian matrix. This matrix contains very similar details, but it has other properties, especially related to the spectrum of the matrix, that can provide additional information that the adjacency matrix cannot. The adjacency matrix, A, of an undirected graph, G, is defined by:   1 : {i, j} ∈ E  Aij =  0 : {i, j} ∈  /E It is easy to see that the adjacency matrix is a real symmetric matrix, i.e. A has real entries and A = AT . This gives us useful properties that we will use later that involve the 12 spectrum of the matrix, which is the set of eigenvalues for the matrix. Namely, we have the important spectral theorem associated with real symmetric matrices: Spectral Theorem. Let A be a symmetric matrix. There is an orthogonal matrix P , i.e. P T P = I = P P T , such that D = P T AP is a diagonal matrix. The corollary of this theorem most importantly shows us that the eigenvalues and eigenvectors of an orthogonal matrix are real-valued. More precisely, D is a diagonal matrix of real entries, so the eigenvalues of D are real eigenvalues, say λ1 , λ2 , ..., λn with corresponding eigenvectors being the standard basis, e1 , e2 , ..., en . Now, let us define vi = P ei for every i. The theorem gives us that D = P T AP with P P T = I, so P D = AP . Thus, we have that for every i, λi vi = λP ei = P λei = P Dei = AP ei = Avi . This means that the eigenvalues of A are the real eigenvalues of D, and the eigenvectors of A are the columns of the orthogonal matrix of P (which has real entries), so they form an orthogonal basis of Rn , with n being the size of the square matirx, A. Therefore, any real symmetric matrix has real eigenvalues and eigenvectors, The other important matrix that we can associate with a graph is the Laplacian matrix, as was mention earlier. The Laplacian matrix is very similar to the adjacency matrix, but the difference is that the Laplacian matrix has the degree of each vertex encoded on the diagonal. More precisely, we define the Laplacian, L, as a matrix with the following entries:       −1 : {i, j} ∈ E   Lij = 0 : {i, j} ∈ / E, i 6= j      d(i) : i = j  The Laplacian can also be defined in a more concise way. Let us define ∆ to be a diagonal matrix whose diagonal (i, i) elements are equal to the degree, d(i), of vertex i of the graph G. Then, if A is the adjacency matrix associated with the graph, G, we define the Laplacian, L, as the difference of matrices, L = ∆ − A. We note that the Laplacian is 13 also a real symmetric matrix, so the same theorem above applies as well, namely that the eigenvalues are real numbers. The benefit of the Laplacian matrix comes from the results of the theorems associated with it, that reveal more about the spectrum of the Laplacian, and the relation of the spectrum to the graph. This theorem gives us some additional information about the spectrum of the Laplacian [18]: Theorem. The Laplacian matrix L of a graph, G = (V, E), is a positive semi-definite matrix, i.e. hLx, xi ≥ 0 for all x ∈ Cn , where n = |V | and Xn hx, yi = xi · y i i=1 for x, y ∈ Cn . With this theorem, we can easily see that all the eigenvalues of the Laplacian are all non-negative, since for any eigenvalue λ of L, and eigenvector v ∈ Cn normalized to unit length, i.e. ||v|| = 1, we have 0 ≤ hLv, vi = hλv, vi = λhv, vi = λ||v|| = λ. Thus all eigenvalues of the Laplacian matrix of a graph are non-negative, and we also note that the zero eigenvalue occurs with multiplicy at least 1, since the vector v = h1, 1, ..., 1i is always an eigenvector with eigenvalue λ = 0 because the sum of the elements in each row is zero. A second theorem that may be useful and reveals information about the graph is given below [18]: Theorem. Let G be a graph with Laplacian L. Then the dimension of the nullspace of L is the number of connected components of G. So, by applying computational techniques in linear algebra to find the dimension of the nullspace of a matrix, we may find the basic property of the number of connected components in a graph. We can also find more complex information from a deeper theo- rem dealing with the connectivity of a graph [18]: Theorem. The vertex connectivity of an incomplete graph G, i.e. G 6= Kn , is at least as large as the second smallest eigenvalue of the Laplacian of G. 14 Now, we have a tool that can give us a lower bound for the connectivity of any graph just by using a standard concept from linear algebra, namely the eigenvalues of a matrix. Having calculated this value, we can understand how strongly connected a network or structure that can be put into the framework of a mathematical graph is. This second smallest eigenvalue is commonly referred to as the Fiedler value, and has many applica- tions, such as the stability analysis of dynamical systems [5]. More generally, the theorem reveals that there may be some important information locked within the spectrum of a Laplacian, and so it might behoove us to analyze this. 2.2.3 Adjacency and Laplacian Matrices for Weighted Graphs We consider now the notion of the adjacency matrix and the Laplacian for a weighted graph, G = (V, E, φ), where φ : E → R is the weight function that assigns the weights to the edges of the graph. Now, instead of constructing matrices with discrete values representing the presence or absence of edges, we construct similar matrices that account for the weights of the existing edges. More precisely, we define the weighted adjacency matrix of G, denoted by AG φ to be defined with entries:   φ({i, j}) : {i, j} ∈ E  G (Aφ )ij =  0 : {i, j} ∈  /E We then define in a similar way the weighted Laplacian of G, denoted by LG φ to be defined with entries:       −φ({i, j}) : {i, j} ∈ E     (Lφ )ij = 0 : {i, j} ∈ G / E, i 6= j    |V |  X φ({i, j}) : i = j      j=1 Now that we have encoded the weights of the graph into the linear algebraic structures that are derived from them, we may turn our attention once again to the spectrums of 15 these matrices. In the case of the weighted adjacency and Laplacian graphs, we do not have the benefits of the aforementioned theorems, but we do have various instances of scholars revealing some connections between the spectrums of weighted Laplacians to geometry and graph theory [80, 85, 98, 20, 86]. 2.3 Algebraic Topology Although topology can be quite an abstract discipline, topological spaces can repre- sent physical objects and notions so that they may be applied to problems in the real world [4, 1]. The biggest hurdle in applying topology is the effort required to take the smooth and stretchable nature of topological objects and transform them into something discrete or tangible that can be used for computation. One way to do this is to decompose the topological space into simpler topological pieces that ultimately can be encoded into a machine to make concrete computations. The spaces are sometimes approximated in a sense with these topological pieces, but they are able to give us accurate topological information because the structure produced is topologically the same. In fact, the field of algebraic topology gives us the means by which to extract discrete algebraic information from continuous topological data. A whole array of computational techniques in topol- ogy can be found in literature, including homology, cohomology, morse theory, and more [34]. We want to ultimately be able to apply computational topology to our problem of biomolecular modeling. Persistent homology is a geometry-sensitive extension of the idea of the homology of a topological space. The homology of a space captures the pres- ence of n-dimensional holes, and persistent homology looks at the evolution of the ho- mology of the space as it is being constructed over time. Persistent homology has seen widespread use in our group, providing a useful tool for building successful machine learning models [22, 90, 110, 70]. We seek to lay out the theoretical foundation of these ideas in this section, starting with the notion of the simplex, which are our building blocks 16 that we use to construct simplicial complexes. These complexes allow us to apply the rich theory of homology in a manner that is suitable for computation. 2.3.1 Simplicial Complexes We wish to now formally present the idea of a simplicial complex, which can be used to approximate topological spaces so that we can obtain relevant topological information from computations. We will start with the geometric notion of a simplex, as outlined by Edelsbrunner [34]. If we let x0 , x1 , ..., xk be points in Rd . Then we consider linear com- Xd Xd binations, x = λi xi , in which each λi ∈ R and λi = 1. These linear combinations i=0 i=0 are called affine combinations. The affine hull of these points is then defined to be the set of all affine combinations, where we notice that this makes a k-plane if the k + 1 points are Xd affinely independent, meaning that for any affine combination of points, a = λi xi and i=0 X d b= µi xi , then we have that λi = µi for all i. It can be seen that the k + 1 points are i=0 affinely independent iff the k vectors xi − x0 are linearly independent for i = 1, 2, ..., k. Since Rd has at most d linearly independent vectors, we have that there are at most d + 1 affinely independent points in R. X k An affine combination, x = λi xi , is a convex combination if λi ≥ 0 for all i and i=0 the convex hull of the points is the set of all convex combinations of those points. A k- simplex, σ, is then defined to be the convex hull of k + 1 affinely independent points, denoted by σ = conv{x0 , x1 , ..., xk }, with its dimension defined to be dim σ = k. On this note, we observe the first four dimensions of simplices can be represented by a point, line, triangle, and tetrahedron respectively. This fact is depicted in figure 2.1 be- low. Consider a non-empty subset of the points whose convex hull describes a simplex, σ = conv{x1 , x2 , ..., xk }. We notice that the convex hull of any subset, T ⊂ {x0 , x1 , ..., xk }, then defines another simplex, and we call this simplex a face of the simplex σ. A face of a simplex is called proper if the face is not the entire simplex, and we write τ ≤ σ for any 17 face, τ , of the simplex, σ, and τ < σ if it is a proper face. Figure 2.1: This figure displays a vertex, edge, triangle, and tetrahedron from left to right, which are repre- sentations of the 0, 1, 2, and 3-simplices respectively. These are all realizations of the simplices by embed- ding them into 3 dimensional space [93]. A simplicial complex, K, is a finite collection of simplices such that σ ∈ K and τ ≤ σ implies that τ ∈ K, while σ, σ 0 ∈ K implies σ ∩ σ 0 is empty or a face of both σ and σ 0 . In many cases, we want to represent these simplicial complexes in a more abstract way. We define an abstract simplicial complex, A, as a finite collection of sets such that X ∈ A with Y ⊂ X implies that Y ∈ A. The dimension of an abstract simplex is denoted dim X = |X|. A face of the simplex, X, is a non-empty subset Y ⊂ X, and is proper if Y 6= X. We find that for any geometric simplicial complex, we can produce an abstract simplicial complex, and vice-versa [34]. Thus, we have an intimate connection between the geometric and abstract constructs, allowing us to make discrete computations with the abstract complexes which represent geometric realities. The main computations that we can make are through an algebraic object associated with a topological space called the homology of a space. 2.3.2 Simplicial Homology The homology groups of a topological space provide an understanding of the k-dimensional holes that are present within the space. By looking at how k-simplices are continuously mapped into the space, we can extract a group structure that yields numerical informa- tion. To be able to make these calculations using our computational capabilities, we must use an equivalent version of the homology by using the simplicial structure of the space. 18 We begin by defining the objects of interest. For a simplicial complex, K, we define a p- P chain to be a formal sum of p-simplices in K, c = i ai σi , where the σi are the p-simplices in K and the ai are the coefficients from a particular group. So as to allow for simple com- putations and inituitive understanding, we use the group Z/2Z, but other finite groups can also be used. In a theoretical setting, infinite groups such as Z are used, but groups like these may not allow us to obtain information from a computational source. The op- eration of addition can be defined for two p-chains by component-wise addition, i.e. for P P P p-chains c1 = ai σi and c2 = bi σi , we have c1 + c2 = (ai + bi )σi . The set of p-chains of K under addition then define a group, Cp (K), called the chain group of K of dimension p. Now we wish to define a map between these chain groups, so that we can get a relation between simplices of different dimensions. Let us define the boundary of a p-simplex, σ = [x0 , x1 , ..., xp ], by the sum of its faces, namely Xp ∂p σ = [x0 , ..., x̂j , ..., xp ], j=0 where the hat over the xj term means that it is omitted. Then, we can extend the idea of the boundary of one simplex to the whole chain group, Cp (K), by defining ∂p : Cp (K) → P P Cp−1 (K) by making it linear, i.e. for a chain, c = i ai σi , we have ∂p c = i ai ∂p (σi ). In fact, this defines a homomorphism from the chain group of p-simplices to the chain group of (p − 1)-simplices, and we call this the boundary map of chains. Now, an important algebraic structure that comes from these boundary maps is the chain complex, which is the sequence of chain maps and chain groups of all dimensions, namely ∂p+2 ∂p+1 ∂p ∂p−1 · · · −→ Cp+1 (K) −→ Cp (K) −→ Cp−1 (K) −→ · · · To obtain the homology groups that we seek from this chain complex, we must define two special types of chains, namely p-cycles and p-boundaries. A p-cycle is a p-chain whose boundary is the 0-chain, i.e. ∂p c = 0. Since the kernel of the homomorphism, ∂p , is a subgroup of Cp (K), then the set of p-cycles under addition, which we denote by 19 Zp (K) = ker ∂p , is indeed a group. Next, we define a p-boundary as a p-chain that is the boundary of a (p + 1)-chain, i.e. c is a p-boundary if there is a d ∈ Cp+1 (K) such that ∂p+1 d = c. Then, Bp (K) = im ∂p+1 is a group of the p-boundaries of Cp (K). An important result that allows us to calculate the homology groups of a chain complex is the fact that ∂p ∂p+1 = 0. This means that Bp (K) = im ∂p+1 ⊂ ker ∂p = Zp (K), and so we may define the pth homology group of the simplex K as Hp (K) = Zp (K)/Bp (K). We then define the dimension of the homology group, denoted by dim Hp (K), to be the rank of Hp (K) (or the dimension of the group when viewed as a vector space). The dimension of Hp (K) is also known as the pth Betti number. This can be denoted by βp = dim Hp (K). Figure 2.2: This figure displays the calculation of a homology group. The simplicial complex, K, is made up of one 2-simplex (triangle), abc, 3 1-simplices (lines), ab, bc, and ac, and 3 0-simplices (points), a, b, and c. If we focus on the 1-dimensional chain groups, we notice that there is one cycle represented geometrically by the loop going from a to b to c. This loop, however, is cancelled by the boundary of the simplex, abc, by "closing" the loop, thereby displaying the topological idea that the loop can be deformed continuously across the triangle surface. Therefore, there are no non-trivial loops in the complex, and so dim H1 (K) = 0. If the simplicial complex did not contain the 2-simplex, then the complex, K, would have one non-trivial loop and we would have dim H1 (K) = 1. We notice that the cycles represent loops, cavities, and higher-dimensional "holes" that can be found in a complex, while boundaries represent the higher dimensional simplicial structures that "fill in" the space inside the holes. This idea is illustrated in figure 2.2, showing the calculation of the 1-dimensional homology group. The Betti numbers that we retrieve from the calculation of these homology groups give us numerical information about the topology of the complex, which may be useful in some applications. We will extend the ideas in this section to reveal deeper information about the structure of topo- logical spaces, which are infinitely more useful than the betti numbers of a fixed simplicial 20 complex. 2.3.3 Persistent Homology We wish to now obtain information from an evolving complex that slowly increases the number of simplices that form it. As new pieces enter the complex, the homology of the complex will change. Holes will come into and out of existence through merging classes of cycles, and we would like to capture the persistence of each of these topological features. We obtain this persistence by first explaining what is known to be the elder rule. This rule dictates how persistence is understood, by giving precedence to the older cycles when two cycles merge as a boundary is introduced to disrupt them. The older cycle persists as the younger cycle dies. We consider a simplicial complex, K, with a monotonic function, f : K → R, in which f (τ ) ≤ f (σ) whenever τ ≤ σ. With this definition of monotonic, we notice that the sublevel sets, K(a) = f −1 (−∞, a] ⊂ K are a subcomplex of K. Since K is discrete, we have a finite amount of values, a0 , a1 , ..., an ∈ K, where we can define Ki = f −1 (−∞, ai ] for each i so that we get an increasing sequence of subcomplexes, ∅ = K0 ⊂ K1 ⊂ ... ⊂ Kn = K . We call this sequence of subcomplexes a filtration of f. Now, for every, i ≤ j, we have an inclusion map from the ambient space containing Ki and Kj , namely ιi,j : Ki ,→ Kj , i,j which induces a homomorphism, ιp : Hp (Ki ) → Hp (Kj ), for every dimension. If we consider the consecutive induced inclusion maps, we obtain a sequence of homology groups for every dimension, p, 0 = Hp (K0 ) −→ Hp (H1 ) −→ ... −→ Hp (Kn ) = Hp (K). 21 As we go through the homology groups of the filtration along the induced inclusion maps, we find that non-trivial cycles in one homology group get merged with others as new boundaries are introduced. The length at which these non-trivial cycles persist will then be captured by doing some calculations of Betti numbers. We define the pth persistent i,j homology groups to be the images of the induced maps, H i,j (K) = im ιp , and the pth i,j persistent Betti numbers to be the dimensions of the corresponding groups, βp = dim H i,j . i,i i,j First, notice that Hp = Hp (Ki ). Then notice that the persistent homology group, Hp , consists of the homology classes of Ki that have survived to the complex, Kj . More i,j formally, we have that Hp = Zp (Ki )/(Bp (Kj ) ∩ Zp (Ki )). For γ ∈ Hp (Ki ), we say that γ i−1,i is born at Ki if γ ∈ / Hp . Second, if γ is born at Ki and then merges with another class at i,j−1 i−1,j−1 i,j i−1,j Kj , i.e. ιp (γ) ∈ / Hp , but ιp (γ) ∈ Hp , then we say that γ dies entering Kj . This then is considered to be the elder rule mentioned earlier. We now have a way to obtain information about the persistence of homology classes, or cycles, with respect to our filtration function, f . If γ is a class born at Ki and died at Kj , then we define the persistence of the class to be pers(γ) = aj −ai , where the ai and aj are the values that determined the sublevel sets, Ki = f −1 (−∞, ai ] and Kj = f −1 (−∞, aj ], of the filtration. We can also similarly define the birth and death by their filtration, i.e. birth(γ) = ai and death(γ) = aj . These values are calculated with sophisticated packages such as Ripser, Gudhi, and Dionysus [16, 101, 77]. The filtration functions and the companion complexes are usually obtained from the evolution of various complexes springing from a point cloud. In the case computational biology, these point clouds are derived from the positions of atoms in a molecule. Next, explore the Čech Complex, which gives us a physically relevant filtration in which we can apply the idea of persistent homology to a given point cloud. 22 2.3.4 Čech and Vietoris-Rips Complexes Suppose we have a finite set of point, S ⊂ Rn . Let us define the closed ball around a point x ∈ S with radius, r, by Bx (r) = {y ∈ Rn : ||x − y|| < r}. Then we define the Čech complex as: \ Čech(r) = {σ ⊂ S : Bx (r) 6= ∅}. x∈σ An abstract p-simplex is included in the Čech complex if there are p + 1 points whose closed balls of radius, r, have anon-trivial intersection. Although the Čech complex is de- fined using abstract simplices, we can imagine the process using the notion of geometric simplices. This idea is illustrated in figure 2.3. Figure 2.3: This figure displays the geometric construction of a Čech complex from a point cloud, S ⊂ R2 , based on the radius parameter, r. The simplices are added based on the number of balls that have non- empty intersection. For computational and geometric reasons, another complex was introduced, namely the Vietoris-Rips complex. The Vietoris-Rips complex of S and r is defined as: 23 Vietoris-Rips(r) = {σ ⊂ S : diam(σ) ≤ 2r} Instead of checking all subcollections of points to find out which simplices are in our complex as we do in the Čech complex, we may simply just check pairs and add 2-dimensional and greater simplices whenever we can. It can be shown also that we have √ the inequality, Vietoris-Rips(r) ⊂ Čech(r) ⊂ Vietoris-Rips( 2r) [34], so we have a close relationship between the two complexes. In reality, the Vietoris-Rips complex is used extensively in application because of the ease of computation in relation to persistent ho- mology. We can see that the Čech and Vietoris-Rips complex is dependent on the choice of a point cloud, S, and a choice of parameter, r, determining the radius of the balls around those points. If we define P(S) to be the set of all subsets of S, then P(S) is a simplicial complex, and we can define a filtration function, f : P(S) → R, where for a simplex σ, we have f (σ) = rσ , where rσ ≥ 0 is the value in which σ ∈ Čech(r) for r ≥ rσ , but σ ∈ / Čech(r) for r < rsigma . This can be equivalently defined for the Vietoris-Rips complex. In both cases, these filtration functions give us a way to obtain persistent homological data from a point cloud of atoms in a molecule, which gives us important physical features for their analysis. In addition to this, there is a very popular complex called the alpha complex, which is more complicated to describe, but is more physically representative of a molecular structure than the Čech and Vietoris-Rips complexes. This construction, however, will not be discussed here. 24 CHAPTER 3 MACHINE LEARNING 3.1 Introduction Machine learning (ML) is a field of computer science that deals with algorithms that create models that improve automatically through experience and/or by the application of fixed data. ML is generally seen as a subfield of artificial intelligence (AI) because on the surface it seems as though the computer being programmed is "learning" from the data that is fed into it. In fact, the computer is programmed to do a particular task with- out being explicitly programmed to do so, giving it the appearance that it is "learning" from real-world data. In many cases, this might be the only feasible way to create a rea- sonable program since the direct programming of every single action and response will be far too much to handle for one or even many human beings. For example, one might consider the difficulty of producing an explicit algorithm that can identify and differen- tiate between various objects contained in an image. ML algorithms have a wide variety of applications: computer vision, speech recognition, spam filters, medicine, self-driving vehicles, amongst others [84, 24, 29, 91, 35]. More and more applications are discovered every year as new ways of training these ML models are invented and computational efficiency increases over time. There are three main categories in machine learning that explore different ways of learning from the training data. These categories are supervised learning, unsupervised learning, and reinforcement learning. Some might include semi-supervised learning as a category of its own, but semi-supervised learning is really just supervised learning with some extra techniques that can handle unlabeled data. The most common type of learn- ing is supervised learning, which learns from data that is labeled in the sense that inputs are paired with desired outputs usually hand-chosen by a human being through either 25 manual affirmation or scientific experimentation. Within supervised learning, there are two types of tasks that the models are trained for, namely regression and classification. Regression looks for predictions that may take a continuum of values, i.e. the prediction of the price of a home, whereby classification looks for outputs that place a piece of data into a particular category, i.e. the categorization of objects in an image. Unsupervised learning learns something about a dataset without any labels as described above. In this type of learning, the goal is to understand the structure of the data and find hidden pat- terns or insights of the input that can be used to understand how data should be labeled. Clustering algorithms are commonly used to categorize a given set of data, such as in the case of search results provided by search engines [2]. In reinforcement learning, the computer interacts with a dynamic environment through actions and inputs, and learns through trial and error how to to maximize its reward. An example of this would be the training of DeepMind’s AlphaZero [102], which is a chess-playing algorithm that learned how to play by analyzing an inordinate amount of chess games against itself. In this section we will be describing various concepts in machine learning, focusing solely on supervised learning algorithms, and going over basic and more advanced algo- rithms along with their advantages and disadvantages. 3.2 Fundamental Concepts 3.2.1 Data Representations When implementing most machine learning algorithms, we represent each individual item in the data set as a vector whose components are numerical values that describe the object in some way. These vectors are commonly referred to as feature vectors, and the entries of the vectors are called features. In some cases, we use the term descriptors instead of features or refer to a feature vector as a representation of a data point, but they all refer to the same concepts. Although all entries of a feature vector must ultimately be some real number, we can have different categories of feature types. For instance, 26 common categories include numerical, binary, and categorical feature types. The numerical feature type is the simplest to explain in that the entry in the feature vector is the exact number that describes that feature, i.e. the entry for height of a particu- lar person in a data set would be a numerical feature of that person. Binary type features are entered as 1 or 0 based on whether they are in or out of a category, i.e. a 1 may repre- sent that someone has a bachelors degree or a higher degree while 0 may represent that the person has not completed a bachelors degree or higher. Categorical feature types are originally represented by an integer taken from the range 0 to k −1, where k is the number of categories for that particular feature. This choice of category is then usually converted into a vector of length k by a one-hot encoder which outputs a vector with 0s everywhere except for a 1 in the ith place, in which i is the chosen category. All of the information from each feature type is concatenated into one long feature vector and the collection of these feature vectors from each data point is sometimes called a design matrix, where the rows of the matrix are the feature vectors of each individual data point. There are some shortcomings to representing data points as the concatenation of a col- lection of numerical, binary, and categorical features as described above. The first thing to note is that categorical features that are one-hot encoded into a vector of 1s and 0s are not able to pick up on the correlations between two different categories. The representa- tion then is conveying that all categories are equally dissimilar, which is a big assumption that is in most cases unrealistic. Another problem is that the weight of each feature is not appropriately captured. Some features will be incredibly important and contribute much to the construction of the model while others will do very little and perhaps even disrupt the process and ultimately result in an less accurate model. If some numerical values are very large in comparison to other features, then there may be an issue in which the large values outweigh other features in terms of the influence in computation, especially cat- egorical or binary features, which do not exceed a value of 1. A problem that can occur with specifically categorical features and the prevalence of them in a representation is 27 that when concatenating a one-hot encoded choice of category to the feature vector, the feature vector can become sparse and waste a lot of memory and increase computing time significantly. Finally, we are not always able to easily figure out which features are going to be irrelevant to our analysis, and so we must do our best to eliminate redundant or uncorrelated features to save time, memory, and create more accurate models. To deal with these shortcomings, there are several techniques that are applied. The main technique that is used is rescaling, but sometimes techniques like embedding are applied as well to deal with some of the remaining issues. Rescaling transforms each in- dividual feature based on the values of that particular feature with respect to all the other data points. One way to do this is to take the maximum and minimum of the values of each individual features and scale them so that every value lies on or between 0 and 1, more specifically, if kmin and kmax are the minimum and maximum values respectively of x − kmin the k th feature, then a feature value is transformed by the function N (x) = . kmax − kmin The most popular method of rescaling employed is the method that sets all features to have a value from a distribution of mean 0 and standard deviation 1. In this case, for the k th feature we take the mean, µk , and standard deviation, σk , with respects to all the val- x − µk ues in the data set of the k th feature, then transform the values with N (x) = . There σk are also other popular rescaling methods, such as term frequency-inverse document fre- quency (tf-idf) [62], that are applied in natural language processing, but the above tech- niques are usually sufficient. This rescaling of all the features, which is also referred to as normalizing the features is useful for removing any differences in scale between the different features, and makes sure that the computations do not become too large during training before they have a chance to converge. Aside from rescaling features, feature embedding is a technique that reduces the size of feature vectors that remove sparse elements of a feature vector and include mainly the important features. There are many ways of doing feature embedding that have been studied. Popular global dimensionality reduction algorithms include principle com- 28 ponent analysis (PCA), independent component analysis (ICA), and multi-dimensional scaling (MDS) [47, 42, 54]. Local dimensionality reduction algorithms, which are good for finding local patterns, include t-distributed neighbor embedding (tSNE) and uniform manifold approximation (UMAP) [58, 68]. Other types of feature embedding are rep- resentations learned from neural networks, which are encoded using a neural network, such as is done with auto-encoders, Word2Vec, FastText, BERT, etc [60, 39, 61, 31]. Fi- nally mixture models have also been used, such as the guassian mixture models (GMM), Dirichlet multinomial mixture (DMM), etc [88, 41]. These dimensionality reduction al- gorithms can be incredibly useful for cleaning up your data, reducing computational weight, and for ultimately creating effective predictive models. Data extraction, preparation, and cleaning can be just as important, if not more than choosing the right machine learning algorithm. There is a common saying, which is "garbage in, garbage out", and is commonly stated in the machine learning sphere. Your model can only be as good as the data that you train it on, no matter what machine learn- ing algorithms you use. The most important thing you can do is gather a lot of data and extract relevant and important features to represent your data. The rest of the process is in finding a sufficient model to understand your data. 3.2.2 Loss Functions Loss functions are usually used for a model during training to assess the progress that is being made in the process of fitting the machine learning model to the data. Different loss functions will have different advantages and disadvantages, and so should be chosen based on the goals that you have in mind regarding the accuracy of particular data points in your data set. Different distributions of the data at hand and the type of machine learning task, either regression or classification, may also guide your decision. The two most common loss functions for regression tasks are the mean squared error (MSE) and the mean absolute error (MAE). These are given below: 29 n 1X LM SE (by , y) = yi − yi )2 (b (3.1) n i=1 and n 1X LM AE (by , y) = |b yi − yi | (3.2) n i=1 where the yb vector represents the predictions from the machine learning model, and the y vector represents the true labels. These two loss functions have their own advan- tages and disadvantages. The MSE loss is commonly used because it is a convex function, and so in many cases, this may allow us to find a global minimum. It is also sometimes possible to find a closed form solution to the minimization of the loss when considering the entire model, like in the linear regression case. The MSE is not however good at deal- ing with data sets that contain many outliers because the squaring of the error affects the loss more significantly in those case, so MSE is generally used as long as the data has a relatively normal distribution. MAE is not a convex function, and so is usually less desir- able, however it does have the advantage of ignoring the outlier data points better than the MSE loss. In terms of classification problems the most common loss function is the cross entropy loss (CEL). This is given below: X n X m LCEL (y, yb) = − yij ln ybij (3.3) i=1 j=1 where in this case, y and yb are n × m matrices representing the probability that the ith data point is in the j th category, and so the rows all add up to one. yb represents the predicted probabilities, while y represents the true values, meaning that each row has a single 1 with the remaining entries being 0 since each data point is certainly in one of the categories and absolutely not in any of the other ones. The entropy of a random vari- able is the uncertainty inherent in the possible outcomes of that random variable. So a 30 random variable that is evenly distributed has a high entropy because you can never tell what value it might take, however, when the random variable is concentrated at one or a handful of points, then the entropy is significantly reduced. With the cross entropy loss, we are simultaneously measuring the entropy and the accuracy, where we are penalized for both giving high probability to incorrect labels and for being uncertain in our predic- tions. This allows the models that we are training to be optimized in a way that provides accurate and concrete predictions. There are other loss functions that are used in classification problems, but we will not cover those explicitly. When dealing with machine learning algorithms with loss functions, we must choose these loss functions according to our problems at hand. The distribution of the data, the algorithm being used, etc. can all affect our choice of the loss function, but ultimately there are no one-size-fits-all solutions to this dilemma. We must use our experience and intuition to decide which loss to use, then apply it to the situation and see if it is an effective measure according to our task. 3.2.3 Evaluation Metrics and Data While loss functions are used to train models, evaluation metrics are used to evaluate a model after it has been trained. Although in many cases it is appropriate to use the loss function to make this evaluation, we may find that another metric may be more desirable or insightful in certain circumstances. For instance, when building binary classification models, we are interested in analyzing the precision and recall of our classifiers. Each prediction that we have is a true positive (TP), true negative (TN), false positive (FP), or a false negative (FN). The precision and recall scores are given by: TP precision = (3.4) TP + FP and 31 TP recall = (3.5) TP + FN The precision measures the accuracy of the classifier on the set of all positively iden- tified data points, and the recall measures the ability to detect the positively labeled data points. Depending on our situation, we may value precision over recall or vice-versa. For example, suppose we have we are creating an app that tells you whether a plant you found in the wilderness is safe to eat (positive) or not (negative) by looking at an image of it. In this case, you would like to make sure that there is an extremely low chance that the app could be wrong if it labels it safe to eat. This means that we would like a very high precision for the model. On the other hand, a lower recall would be an acceptable trade-off since we would not be too upset if the app labeled many things that were safe to eat as potentially dangerous because the alternative would be an app that is very good at finding exotic plants that are safe to eat, but once in awhile gives the okay for someone to ingest a poisonous substance. If we consider the case of a model that detects that a particular location is possibly being infiltrated by intruders to a group of security guards, it is easy to see that we would prefer the reverse situation, in which the recall is much more important than the precision. In both of these illustrations, we note that although we most likely trained our model with the cross-entropy loss function, we did not use this metric to evaluate our model. In this work, we are mainly interested in regression tasks, so we want to focus on evaluation metrics involving regression predictions. The most common metric used for regression is the root mean squared error (RMSE). This is just the square root of the MSE loss, i.e. for true labels, y, and predicted labels, yb, we have: v u n u1 X RMSE(byi , y i ) = t yi − yi )2 (b (3.6) n i=1 The main difference between MSE and RMSE is that the units for the RMSE are the same as that of the labels, so will provide some readability to the results. In some cases, 32 we may also use the MAE if we want to understand the tendency towards outliers. These can often be compared side-by-side to get a full picture of a models performance. A metric that is commonly used, but is not as intuitive of a measure of success as the RMSE, is the Pearson correlation coefficient. This metric is a measure of the linear correlation between the predicted labels and the true labels, and so gives us some information v about how well n u n 1X u1 X the model fit the data set. We let cov(x, y) = (xi −x)(yi −y) and σz = t (z − z)2 n n i=1 i=1 be the covariance and standard deviation of the distributions, x, y, and z, where x, y, and z are the respective means. Then we may define the Pearson correlation coefficient for predictions, yb, and true labels, y, by: cov(by , y) ryb,y = (3.7) σybσy We note that the value of ryb,y always satisfies −1 ≤ ryb,y ≤ 1, where |ryb,y | → 1 means that the correlation between the predicted and true values is getting closer to being linear. A perfect prediction, i.e. yb = y, gives a Pearson correlation of ryb,y = 1, while predictions that are entirely independent of the labels, i.e. cov(b y , y) = 0, gives a Pearson correlation of ryb,y = 0. The advantage of using this metric is that it has no units associated with it, so we can get a rough measure of how well our model is performing based on how close the correlation coefficient is to 1. The problem with using this metric alone, however, is that there are instances in which the Pearson correlation is equal to 1, but the predictions are very inaccurate. For this reason, we still combine this metric with the RMSE and/or MAE. Other correlation coefficients also exist, such as Spearman’s rank correlation coefficient and Kendall’s tau rank correlation coefficient. The Spearman coefficient measures how well the relationship between two distributions can be described by a monotonic function and Kendall’s tau is a measure of the portion of ranks that match between two data sets. Although these two metrics are somewhat different from the Pearson correlation, they can also provide us with the benefit of giving an understanding of the model performance without analyzing the error in units. 33 When we use these metrics to evaluate our model, we are ultimately trying to measure the generalization error, or the error that we would have with our chosen metric if tested on new data. For this problem, we generally split our data set into a train and test dataset in which the test set is made up of around 10-20% random data points of the whole dataset and the train set is made up of the remaining data points. The model would then be trained with specific model parameters on the train set to then get an estimate of this generalization error. Now if you have many parameters that you can tune for your model and are able to run many tests, you can use this test set to determine the optimal choice of model parameters. A problem that you may run into when doing this is that you may be chosing the best model parameters for that specific test set, and so the parameters chosen cannot generalize as well as you may want. The remedy to this is to introduce a validation set, which is usually an additional 10-20% of the whole dataset that is taken specifically from the train set to be able to tune the model parameters. In this case, we would have 60-80% of the dataset as the train set, 10-20% for the validation set, and 10- 20% for a test set, in which the models are trained on the train set and validated on the validation set to decide the optimal parameters based on the scores of your performance metric chosen, then finally tested on the test set to determine generalization error. An alternative way to tune model parameters would be to do a k-fold cross-validation in which the dataset is broken into k pieces, where a model can be tested on each piece, being trained in each instance on the remaining k − 1 pieces, where the average of the errors of the k instances of trained models of a particular set of model parameters gives an estimate of the generalization error. The more folds, the more accurate our estimate of the generalization error will be, but also the more computationally heavy the process will be. The number of folds used when applying cross-validation is about 5 to 10, Different metrics can give us different sorts of information and we must use our intu- ition ultimately to figure out what evaluation metric is the best to use to assess the success of our models. Each metric has its advantages and disadvantages, and so in many cases, 34 we may consider using more than one to better understand their performance. Our main goal is to improve our generalization error so that we have a model that can understand how to deal with new instances of data, and so we must use statistical techniques to split the data up in a way that allows us to measure this error. 3.2.4 Model Weights and Gradient Descent Many machine learning models make predictions based on intermediate or direct calcula- tions involving scalar values that are adjusted throughout the training as the model "fits" the data that it is being trained on. We refer to these adjustable scalar values as weights, and we differentiate them from the model parameters (or hyperparameters) mainly by the two properties mentioned; that is, the direct use of them in calculations used to make a prediction, and the automatics adjustments that are made to them to fit the data at hand. Model parameters stay fixed throughout the training of a model, and so have to be hand-tuned by comparing their performance through running many models at once. Many machine learning models do not have weights that are adjusted throughout train- ing, but build up a model through various criteria, and have many parameters that can be changed manually, that determine how these models are built. Machine learning models that rely on decision trees do not generally have any weights that are associated with their models, while simple algorithms like linear regression do have model weights that can be adjusted throughout training, although it turns out that linear regression models can be trained in one step in special circumstances (see below section). Neural networks are perhaps the most complicated models involving weights, and they involve a lot of them in fact. Neural network models tend to have millions of weights involved or more! The most common way to adjust these weights is through a process called gradient descent. This process applies the idea of the gradient of a scalar function of multiple variables. The gradient of a function, f , is defined to be the vector: ∇f (x1 , ..., xl ) =   ∂f ∂f , ..., . The gradient of a function at a given point gives us a vector in the di- ∂x1 ∂xl 35 rection of the greatest ascent, meaning that the directional derivative is optimal in the direction of the gradient. Since the gradient gives us the direction of greatest ascent, then the opposite of the direction of the gradient gives us the direction of the greatest descent. Thus, if we follow the opposite of the gradient at small enough intervals, then we will eventually reach a local minimum of the function. Suppose we choose a scalar value, α > 0, in which the smaller α is, the shorter step that we take in the direction of the rate of steepest descent. If we do n iterations of the gradient descent after choosing an initial point, (z1 , ..., zl ), then the following algorithm describes the gradient descent procedure quite concisely: Algorithm 1 Gradient Descent 1: (x1 , ..., xl ) ← (z1 , ..., zl ) 2: for i = 1, ..., n do 3: (x1 , ..., xl ) ← (x1 , ..., xl ) − α · ∇f (x1 , ..., xl ) If we have chosen enough iterations, i.e large enough n, then we should have reached a local minimum. Now, if we think of the function, f as the loss function with respect to the weights of our model, i.e. L(w1 , ..., wl ), then this gradient descent procedure will give us weights that will minimize our loss locally, if not globally. After we reach a local minimum, we would consider our model fully trained. If there is a model with weights, then as long as the loss function is differentiable (or almost everywhere differentiable) with respect to its weights, then we are able to use gra- dient descent. In most cases, this is a feasible procedure, but sometimes the loss function is such that the gradient descent procedure would take too many iterations to effectively settle on a local minimum. Although a large learning rate may speed up the training process, if the learning rate is too large, we will not be able to settle at a local minimum. The loss function might have valleys that are too deep and the landscape of the function may be too erratic so that a very small learning rate may be required to traverse the pa- rameter space to properly settle at a point. There are ways to find local minimums more 36 efficiently by using momentum and other considerations, so we can usually avoid the pitfalls of inefficient training times or inability to settle in a local minimums. The idea of a local minimum can be troubling in general when using a model based on weights. In general, we want to be able to find global minimums, and so convex functions are sought after for they can guarantee a global minimum. This is unrealistic, however, especially if you wish to consider relatively complex models, such as neural networks. Despite any disadvantages that models with weights may have, they are the easiest to implement, and have seen widespread use. 3.2.5 Regularization There are two important problems that arise when applying machine learning algorithms to a data set to train a predictive model: underfitting and overfitting. Underfitting occurs when a model is fit to the data set, but does not perform well even on the data set that it was trained on. Overfitting is characterized by a high performance on the training data and a low performance on the test data. Overfitting occurs when a model fits the data that it is trained on so well that it cannot perform well on any new data. The model is so tailored to the particular data set by finding patterns that are very specific and not generalizable to new data so that it cannot effectively predict anything that might even be similar to the training data. Overfitting is characterized by a very high performance on the training data with low performance on the test data. Although underfitting is an important thing to avoid, the main focus for machine learning experts is on overfitting since most machine learning algorithms can create models that almost perfectly fit the data that they are trained on, but it is hard to find an algorithm that is very generalizable. In figure 3.1 below, we see a classic example of choosing an appropriate model to avoid both overfitting and underfitting. A theoretical result that has important application in machine learning is that a model’s generalization error (A measure of how accurately the model can predict previously un- 37 Figure 3.1: This figure captures the challenge of finding just the right fit for a polynomial whose tunable parameters include the degree of the polynomial being fit. The true function that generated the distribution is shown in orange and the models that are fit to the data set are shown in blue. We can see that as we increase the degree of the polynomial, we more accurately fit the training data, but the model is becomes less able to generalize to new data points. We must balance the complexity of our model based on complexity of the distribution of our data [51]. seen data) can be broken up into the sum of three very different components: bias, vari- ance, and irreducible error. Bias is the error that comes from incorrect assumption about the shape of the data, which usually implies underfitting. Variance is the error that comes from sensitivity to variations in the training data. High variance generally comes from models with too many parameters, and usually results in overfitting. Irreducible error is the error that comes from the noisiness of the data itself and cannot be reduced by tuning the model. Irreducible error must be dealt with by retrieving better data, or by cleaning the data. As the complexity of the model increases, the variance typically increases will the bias decreases. This is customarily referred to as the bias/variance trade-off. Regularization in machine learning refers to any attempts to avoid overfitting while training a machine learning model. In many cases, we try to use the simplest models that we can to improve our generalizability while still being able to fit both the train and test data, but we also can use a combination of techniques which are generally classified as regularization techniques. One common regularization technique is to modify the loss function that is being used to train the model so that the loss takes into account the com- 38 plexity of the model in some way thereby penalizing a model based on how complex it is with it’s current parameters. The easiest way to do this in most models is to make sure that the magnitude of the parameters are kept small by incorporating the weights into the loss function in a way that penalizes the model if the weights are too large. The main assumption with penalizing large weights in a model is the belief that smaller weights create a less complicated model. The two most popular ways to put the above idea into practice are through ridge regularization and the least absolute shrinkage and selection operator (LASSO). Ridge regularization, also known as l2 regularization, focuses on penalizing excessively large weights and usually has a nice form for differentiation purposes. LASSO regularization, or l1 regularization, has the ability to help with feature selection, but does not have some of the desirable features of ridge regularization. Suppose that we have weights associated with a particular machine learning model, say {θl }dl=1 , then we have the following terms for the ridge and LASSO techniques respectively: d θl2 X Lridge = (3.8) l=1 and X d Llasso = |θl | (3.9) l=1 Then the loss, Lerror , is modified by the ridge or LASSO loss to get a new loss: L = Lerror + αLridge or L = Lerror + αLlasso , where α ≥ 0 is a parameter called the regular- ization constant that can be tuned which determines the amount that the magnitude of the weights will contribute to the total loss. If α is very small, then the training will be effectively the same as with no regularization. If α is very large, then our model will be close to a constant function. Ultimately there is no good value that will give us the best performance, but we must tune this parameter when training our models to see which 39 one allows our model to generalize the most effectively. In the figure 3.2, we observe the effects of applying ridge regularization to linear and polynomial regression techniques. Figure 3.2: In this figure we show how varying the regularization constant when training models using ridge regularization affects how the model fits the data. On the left we see linear regression, and on the right we see polynomial regression. As the regularization constant increases, the model is "flattened". This "flattening" decreases the variance of the model, but increases bias [36]. There are also other forms of regularization that are used, particularly in neural net- works, such as weight decay, dropout, and other indirect forms of regularization. Regu- larization is perhaps the most important addition to any machine learning model since it allows us to increase our model’s generalizability if we are lacking a large sum of data. 3.3 Linear Regression The linear regression algorithm is perhaps the simplest algorithm in the machine learning toolbox, but it is the most reliable, quickest to train, and the easiest model to tune. The task of linear regression is to create a linear model that fits the dataset most closely. The formulation is as follows: A loss function, L, is chosen; usually the mean squared error (MSE), which is again defined by 40 n 1X LM SE (b y , y) = yi − yi )2 (b (3.10) n i=1 The predictions, yb, are given by the linear predictor: Xm p(x; a) = aj x j = a · x (3.11) j=0 The vector, a, is the set of coefficients for the linear function, and x is the vector of variables in which the component, x0 , is held to be a constant 1. In the case of the loss L = LM SE , the optimization problem becomes: n n (p(x(i) ; a) − yi )2 (a · x(i) − yi )2 X X min = min (3.12) a a i=1 i=1 Where the x(i) is the feature vector of the ith data point in the data set, which is de- scribed by the design matrix, X. Now, by applying basic optimization techniques using derivatives, we can get a solution by the multiplication of matrices: a = (X T X)−1 X T y (3.13) This gives us a solution that requires the inversion of a matrix. In most practical sit- uations, we need not worry about the existence of an inverse, and in the cases that we may have to worry about this, we have tools like the pseudo-inverse [14] that can take care of that particular issue. For matrices of a size below a certain threshold, this all may be computationally feasible, but for a larger matrices, there are other techniques, such as gradient descent, that can give us an extremely accurate solution. In the case that we use a different loss function, such as the mean absolute error (MAE), we do not have a closed equation and so must in fact use gradient descent. The same problem may occur if you wish to apply some form of regularization to the model, in which Ridge and LASSO regression are common choices. 41 ScikitLearn has resources which include as many options as you would need for any linear regression task, and is a useful tool for a myriad of other machine learning algo- rithms [83]. 3.4 Decision Trees Decision trees in machine learning are tree structures that can be constructed by fitting the data to provide a regression or classification-based predictive model that arrives at its prediction through a series of comparisons performed at each node until it reaches a leaf element of the tree. These are constructed greedily by using an appropriate cost function to determine the best split. Decision trees can be constructed by choosing a few options, such as the cost function, the number of splits per node, the minimum number of data points needed to make a split, the maximum depth of the tree, and others. There are not too many parameters that must be considered with decision trees, so they are relatively easy to tune. In a regression task, a prediction is made by navigating a new data point through the nodes of the tree according to the split criteria until a leaf node is hit, then averaging the labels of all the data points in that leaf node to get the prediction. In the case of a classification task, the most common label in the batch at the leaf determines the final prediction. Regression tasks most commonly use the MSE cost function to determine the best split, while the GINI index measure (a measure that determines the purity of a set of data) is applied during classification tasks to reduce the overall entropy at each split in the tree. The other parameters can provide necessary rules for creating robust trees that can prevent overfitting, such as pruning, which removes branches that make use of features with low importance and don’t improve the accuracy of the model. An example of a simple decision tree model is given below in figure 3.3. The greatest advantage of decision trees is the ease at which we can determine the feature importance and the relations between the data points at each split. The model is 42 x x 1 < T1? no yes x 5 < T2? x 2 < T3? no yes no yes Prediction Prediction Prediction Prediction 1 2 3 4 Figure 3.3: In this figure, we show a decision tree model making a prediction for a new data point, x = (x1 , x2 , x3 , x4 , x5 ). The threshold values for the nodes are denoted by T1 , T2 , and T3 . At each node, one of the features is compared to the threshold at that node, determining the direction that the data point will travel down the tree. Notice that only the features, x1 , x2 , and x5 , are used for any prediction. Decision tree models do not always consider all features of data point when making a prediction, especially when they are pruned or limited by a maximum depth parameter during construction. natural and easy to understand as a prediction for a new data point is made since it has its features analyzed at each node of the path down to a leaf node. Decision trees also require little data preparation before being trained, i.e. normalization of the features, and any nonlinear relationships between the parameters do not effect the performance of the tree since the splits are always done comparatively instead of being based on some form of abstract distance. There are, however, many disadvantages to the decision tree models. For one, there is a large tendency to overfit if the tree is not properly pruned and/or the maximum depth, minimum samples to split, or other options are not wisely chosen. Sec- ondly, there is a lot of variance involved in creating these trees since dropping or adding a feature may create a very different tree. Thirdly, greedy algorithms do not always give 43 an optimal model, and so we prefer to use other techniques if possible. Finally, the trees may be unbalanced, in which we mean that they are dominated by one particular class, thus creating a model with high bias. There are many ways to combat the disadvantages of the decision trees. First, for practical use, we almost always choose to use a binary split for each node since it is faster and just as effective, but also reduces variance and overfitting. Also, balancing the data before training can reduce the effect of bias, which should be practiced for any model in fact. Next, methods such as bagging may be used, which are statistical techniques that use randomness and a consensus of multiple models to combat overfitting and variance, and help us get closer to a global minimum that the greedy nature of the algorithm cannot find for us in most cases. Finally, boosting is another technique that uses multiple trees to make a prediction, where the trees are built sequentially based on the performance of earlier trees, but the concepts of bagging and boosting will be covered in detail below. 3.5 Ensemble Methods Ensemble methods are techniques that combine different models together into one to improve the accuracy of the results produced by the consensus of all the models as a whole. In many cases, ensemble methods improve the accuracy of models by a significant amount, and so have contributed to their popularity. The most popular ensemble meth- ods are boosting, bagging, and stacking. There are also two main categories of ensembles, namely sequential and parallel, in which sequential ensemble methods combine models that depend on the previous models built and parallel models combine models that are built independently and rely solely on the random seed to establish a variety of models. Ensemble methods have been shown to reduce the variance and bias that exist within the single models. Below we detail two main forms of ensemble learning, bagging and boosting, and then go over concrete examples of both bagging and boosting that use decision trees as 44 weak learners for their ensembles. Namely, these two ensemble models are random forest (RF) and gradient boosting trees (GBT), which have been used in many of our previous works. Random forest models are classified as parallel ensemble models and gradient boosting trees are classified as sequential ensemble models. 3.5.1 Bagging Bagging stands for bootstrap aggregating. Bootstrapping is a sampling technique in which the samples are chosen from the data set using a randomized sampling with re- placement. A model is trained on the chosen samples in which we have allowed repe- tition. In some cases, we may even chose a randomized subset of the set of features to increase the variety of models put forth. This procedure is repeated a number of specified times thereby giving us a number of weaker models that can be combined, or aggregated, to produce better results than if we simply had one model trained on all the data at once, namely with a far lower variance. The bagged model is made up of these m weaker models:   1 fbbag (X) = fb (X) + fb2 (X) + · · · + fbm (X) m 1 Because of the aggregating process, the variance is lowered naturally by the averaging of these models. Bagging, however, does not do as well when the models that are aggre- gated are of low variance already. Decision trees and kNN models with low k value are good candidates for applying the bagging technique because of the ease in which they can be modified to create weak learners with high variance. 3.5.2 Boosting Boosting is a sequential ensemble technique that learns from the mistakes of the previous models in the sequence to create a new model that will be combined with the previous ones to create a very successful final model. The main boosting techniques are AdaBoost, 45 Gradient Boosting, and XGBoost. AdaBoost uses models with stump splits that partition the data set based on a threshold of their feature values, and create models in a sequen- tial manner that change the weight of each split after each instance of training. Gradient boosting uses the notion of a gradient to reduce errors made by previous models accord- ing to a chosen loss function and adds them to the ensemble. XGBoost uses gradient boosted trees in a computationally efficient way to make it possible to feasibly train on larger data sets. 3.5.3 Random Forest There are many problems that require representations consisting of thousands of features and tens of thousands or even hundreds of thousands of data points. This can easily become too great for a single decision tree to handle. Random Forest (RF) is a parallel ensemble method that combines a group of weak learners, specifically decision trees, that would normally perform poorly as individual models into one single model which in- creases performance dramatically. Random Forest uses the bagging ensemble technique to deliver stellar performance and is known to be quite robust to overfitting, while also having few hyperparameters to tune, thereby minimizing the time it takes to optimize the models. More specifically, we start by choosing an amount of trees in our forest to use in our final predictions. Then, for a fixed size (in which the size is chosen to be the whole dataset in practice, i.e. in the ScikitLearn implementation [83]), we randomly choose a number of data points equal to that fixed size with replacement for each individual tree. Next, for each of these trees, we also choose a subset of the features (this time without replacement) that will be used for the construction of each tree. The size of the subset of the features that will be use for each tree will usually be set to be the floor of the square root of the number of features, which is the default of the ScikitLearn random forest model. Finally, with each of these randomly determined data sets, we construct a decision tree. Now, our 46 model operates by making predictions from either a consensus of the predictions of the models in the forest (in the case of classification), or by the average of all the predictions of the models of the forest (in the case of regression). 3.5.4 Gradient Boosting Trees Gradient Boosting Trees (GBT) are sequential ensemble models that uses decision trees as weak learners. The procedure involves adding one tree at a time to minimize a chosen loss function by following a gradient based on the performance of the previous trees. The gradient boosting trees have similar advantages to the RF algorithm in that they are also quite robust to overfitting and have relatively few hyperparameters to tune. The gradient boosting trees, however, do generally perform far better than the RF models after optimizing the hyperparameters. We want to detail how the gradient boosting algorithm works. We choose a loss function, L, which is usually the standard MSE loss. Suppose now that F is our combined model which gives us predictions based on input representations, X is the matrix representing the data set, and y is the set of ground truth labels for each data point. Then L(F (X), y) is the loss quantity measuring the error of our predictions, F (X), when compared to the true labels, y. Let F also be considered as a sequence of functions, Fl , in which we begin with an initial weak learner, F0 , which is usually chosen n 1X to be a constant predictor which predicts the mean value, F (Xi ). Then we follow n i=1 in sequence with a series of base learners, hl , that are combined with the previous base learners to create an improved model, leaving us with a final model, F = Fm , after m iterations of improvement. We wish to ultimately minimize our loss L(F (X), y) by reducing the loss at each iter- ation. To reduce our loss, we train a new base learner, hl , which will be a simple decision ∂L(Fl (X), y) tree that is trained on the negative gradient, − , then we follow the gradient ∂Fl (X) by adding a scalar multiple of this hl to the previous function, Fl . Thus, we update the 47 model, F , by a simple rule of the form: Fl+1 = Fl + γl · hl , where the value γl is chosen by solving the below optimization problem at each step: X n min L(yi , Fl (Xi ) + γ · hl (Xi )) (3.14) γ i=1 A learning rate, α, which takes a value from 0 to 1 is generally applied so that the actual update rule above looks like Fl+1 = Fl + α · γl · hl . 3.6 Artificial Neural Networks and Deep Learning Neural networks are predictive models that consist of layers which are made up of neurons in which each neuron is connected to every other neuron in the next layer of the layer sequence. A weight is associated with each connection that determines how much a neuron contributes to the input of the neuron in the next layer, and a bias term is introduced to shift the output to that neuron. Activation functions, such as the logis- tic sigmoid, grant the network a level of non-linearity that gives it sufficient complexity that sets it apart from a linear predictor. A feature vector, in which each feature corre- sponds to a neuron in the input layer, can be pushed through the layers of the network in sequence all the way to the output layer in a process called forward propagation. Neu- ral networks can be trained as classification models that categorize data, or regression models that predict continuous quantities. Neural networks are trained by updating the weights and biases of the network in a series of steps through gradient descent, in which an appropriately chosen loss function is minimized. The process of calculating the errors of a neural network is called back propagation, in which derivatives are calculated in the backwards direction starting from the output layer, where the loss (or error) is calculated, and propagating backward through each layer in the opposite order all the way back to the input layer. The architecture and training process is depicted in figure 3.4. Artificial Neural Networks (ANN) were originally inspired by vision systems and brain structure. The artificial neurons in a the ANN model are loosely analogous to the 48 Figure 3.4: This figure depicts the architecture of a simple feed-forward neural network with an input layer, hidden layers, and an output layer with one output. The neurons in the input layer, hold the values of the feature vectors of a sample from the data set, and the connections between the layers have weights associated with them, which determine how the values are transformed as they move to the neurons in the next layer of the network. The yellow lines above show the flow of a sample through the network that yields a prediction at the output layer, while the turquoise arrows pointing backwards show the flow of the calculations of the gradients at each layer before an update. neurons of a brain, and the connections between them are analogous to the action of synapses that transmit information throughout the brain. The concept of the ANN was first introduced in 1943 by McCulloch and Pitts [67], then the idea was expanded by re- searchers like Rosenblatt [89] until a roadblock was hit in the late 1960s in which a paper written by Minsky and Papert [71]. Interest in neural networks steadily increased as many factors changed over the decades. Progress in the field picked up when computational ca- pabilities increased and computer technology became more advanced. Vast quantities of data became available to make training neural networks more effectively. Algorithms have improved, and theoretical hurdles were found to be harmless in practice. Graphical 49 processing units (GPUs), which were originally produced for graphical enhancement in the gaming industry, were applied to neural networks for an incredible computational speed boost, allowing for very large networks to be trained in a reasonable amount of time. Finally, a generous cycle of funding for neural network-related research stimulated advancement in neural network technology, and attracted many talented individuals. Deep neural networks (DNNs) are defined to be networks with two or more hidden layers, but many contemporary networks contain more than a dozen layers, and some- times even a hundred layers or more. Virtually all neural network research is referred to as deep learning research, and generally involves more complicated architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), multi-task (MT) networks, and many more. Below, we review many important concepts involved in deep learning, especially those that were used for the networks described in this work. 3.6.1 Activation Functions Activation functions in neural networks provide the necessary non-linearity that makes them far more successful than a linear model. They are generally applied at each layer of the network to create multiple levels of non-linearity that enhance the ability of the net- work to fit detailed data sets. There are various activation functions that are commonly used in deep learning today. They include the logistic sigmoid function, hyperbolic tan- gent function, Rectified Linear Unit (ReLU), etc. They are described below: 1 σ(x) = (3.15) 1 + e−x ex − e−x tanh(x) = x (3.16) e + e−x 50   x : x ≥ 0  ReLU(x) = (3.17)  0 : x < 0  As you can see, the sigmoid function keeps it’s output in between 0 and 1 while the hyperbolic tangent function keeps it’s output values between -1 and 1, so both functions have outputs within a relatively small band. This is to keep the values small at each layer so that the gradients do not get out of proportion, which is especially bound to happen in very deep networks. This normalization from the activation functions, however, comes with the problem of vanishing gradients when the activations are applied multiple times throughout a network. The problem of vanishing gradients then can be fixed by using the ReLU function, which does not suffer from vanishing gradients to the same extent as the sigmoid and tanh function, and has many other benefits. The ReLU function has the derivative of 1 when non-zero, and so it is a constant 1 at each application, and does not shrink in magnitude as it is consecutively applied, thereby assuaging the deleterious effects of the vanishing gradient. The ReLU function also pro- vides a sparsity to the neurons that are fired in the network, in which the neurons that have no use in the network are effectively turned off if they are not contributing to the improvement of the model. Finally, the computational requirements of the ReLU are far less than the sigmoid or tanh functions because ReLU does not have any exponential calculations included, thus reducing the computational cost of activation. There are also drawbacks to the ReLU activation function. The first of which is the non-bounded nature of the output in the positive direction, which can create an unsta- ble network if the initialization is not properly done. Another possible issue with the ReLU activation function is the dying ReLU problem, which occurs when too many of the weights corresponding to a neuron get close to zero and the ReLU activation cannot recover from repeatedly outputting zero, and this happens to too many neurons over- all thus leading to a dead network. Thirdly, small outputs can benefit a neural network 51 model, but ReLU outputs only zero for all non-positive values, so doesn’t allow that extra nuance within the model. Lastly, the ReLU function is not differentiable at all points and so creates a rocky optimization process, even though the gradient descent will still be able to find desirable local minimums. To combat some of these issues, various activation functions were developed that mimic the shape of the ReLU, but tackle the problems with ReLU itself, and lead ulti- mately to better performance of the network. Leaky ReLU, Swish, and SeLU are common replacements for ReLU and have their own strengths and weaknesses while being able to perform better than the ReLU itself in many cases. We mainly stick to ReLU in many of our experiments mainly for ease of use and known effectiveness, but other activation functions have also been experimented with and have been very effective. Activation functions can make quite a difference in your network models and a careful choice of activation function at each layer may improve performance noticeably. The non- linearity is essential for networks to be able to fit complex data sets and give accurate predictions. 3.6.2 Batch Normalization Batch normalization is a technique that can be applied to the layers of a deep neural network that prevents the instability that comes along with deep networks, and allows a much faster convergence to a local minimum. Batch normalization layers normalizes the outputs of each layer to speed up the training time, improve performance, and allow deep networks and reduce internal covariant shift; which is the main source of instability in deep networks characterized by the vast shift in the distribution of the network weights that is exacerbated by the depth of the network due to the number of updates involved without account of the important sequence in mind. The batch normalization technique is applied in similar fashion as the normalization of features, except the normalization occurs at each layer, and the normalization itself is 52 updated throughout the training. Suppose we have a given batch of the dataset, B, of m size m. Suppose that the batch is described by the matrix x with rows x(i) i=1 , in which  there are a total of l features, we may describe the batch normalization procedure with Algorithm 2 displayed below. Algorithm 2 Batch Normalization 1: for k = 1, ..., l do m 1 X (i) 2: µk ← xk m i=1 m 2 1 X (i) 2 3: σk ← xk − µ k m i=1 4: for i = 1, ..., n do (i) (i) xk − µ k 5: bk ← q x σk2 +  (i) (i) 6: zk ← γk x bk + βk 7: return z The parameters, µk and σk , are the mean and standard deviation respectively of the k th feature,  is a small constant used for numerical stability, while γk and βk are scaling val- ues for the k th feature that can be updated throughout training if we choose to use this (the PyTorch default is to use this scaling). The output is a m × l matrix, z, in which z(i) is the ith row of the matrix representing the ith data point in the batch with its k th component (i) denoted by zk . There are some different ways to implement this transformation over the various batches. The simplest way is to calculate the batch statistics after each individual forward pass and use those, as is show in Algorithm 2. Another way is to average the batch statistics over each batch, i.e. at each step, t, we update the mean and standard 1 Pm x(i) t µt + m t+1 i=1 deviation vectors, µ and σ respectively by the update rules: µ = t + 1 t σt + m1 Pm (x(i) − µt )2 t+1 i=1 and σ = . The most popular way, and the default for Py- t+1 Torch, is to use a momentum parameter to keep a running average over the batches as the training progresses, i.e. we choose a momentum parameter, α, such that at each step, t, 53 we update the mean and standard deviation vectors, µ and σ respectively by the update m m " # " # 1 1 2 rules: µt+1 = α µt + (1 − α) x(i) and σ t+1 = α σ t + (1 − α) x(i) − µt X X  . m m i=1 i=1 Batch normalization provides many benefits to a neural network. Aside from improv- ing the stability of training by reducing covariant shift, it has also been shown to reduce training time dramatically, and improve the performance in general due to regularization inherent in the batch normalization process, thereby reducing the need for other types of regularization. Batch normalization has created many more possibilities for the uses of deep networks due to these positive qualities. 3.6.3 Gradient Descent and Adaptive Learning Rates There are three main types of gradient descent: batch, stochastic, and mini-batch gradient descent. Batch gradient descent uses all training samples at each epoch to calculate the gradient, which gives you the correct gradient, but usually is slower to converge, does not do well in exploring new local minimums, and sometimes is not feasible to compute if your data set is extremely large. Stochastic gradient descent takes one random sample at a time and calculates the gradient before moving to the next one until it reaches all samples, in which an epoch is completed. In this case, the training time is usually de- creased, many local minimums are explored, and the gradient can be computed without any hardware considerations since there is only one sample at a time. The downfall of this technique is that the gradient descent will then be extremely unstable because the true gradient is not calculated, but only a small portion of it, though it will be statistically similar over many iterations. Mini-batch gradient descent is a wonderful middle ground between the two techniques, batch and stochastic descent, that addresses the hardware concerns, training time considerations, local minimum exploration, and gradient accu- racy and stability. Mini-batch gradient descent takes batches of random samples from the whole data set of sizes greater than 1, commonly in multiples of 2 such as 16, 32, 64, 128, or 256; and descends along the gradient calculated for each batch. This gives us the 54 erratic movements that allow us to explore the loss surfaces’ landscape, gives us more updates per epoch to decrease training time, avoids large batches that would might put a strain on hardware, and is close enough to the actual gradient so as to not introduce an uncomfortable level of instability. To help with speeding up the process of gradient descent, we use adaptive learning rates in addition to the various batch techniques. Adaptive learning rates are learning rates that change throughout the training process in a way that helps us find local min- imums quicker than if we would use a constant learning rate. Common techniques use notions of momentum, acceleration, and knowledge of previous gradients. Popular adap- tive learning rate techniques include simple momentum, Nesterov accelerated gradient, Adagrad, Adadelta, RMSProp, Adam, Nadam, and others. For an example, for N itera- tions, chosen learning rate, η, and loss function, J, a simple momentum adaptive learning rate algorithm saves the velocity, v, at each step t, and updates the given model parame- ters, θ, as follows in Algorithm 3 below. Algorithm 3 Descent with Momentum 1: v ← 0 2: for t = 1, ..., N do 3: v ← β v − η ∇θ J(θ) 4: θ ←θ+v 5: return θ This provides an effective algorithm for stochastic gradient descent to drastically speed up training time. Another type of adaptive learning rate is AdaGrad. This uses a decaying learning rate based on the accumulation of squared gradients that leads the deterioration. The update rule is detailed in Algorithm 4. Where the  term is a smoothing term that avoids division by zero (usually set to around  = 10−8 ) the operator simply means component-wise division, i.e. the ith p  ∂J(θ)/∂θi component is defined as ∇θ J(θ) Gi +  i = √ . This algorithm decays the Gi +  55 Algorithm 4 AdaGrad 1: G←0 2: for t = 1, ..., N do 3: G ← G + ∇θ J(θ) · ∇√ θ J(θ) 4: θ ← θ − η ∇θ J(θ) G+ 5: return θ learning rate and helps move the updates more towards the global minimum by scaling down the gradient in the steepest dimension. The scaling helps us move faster toward our global minimum as we encounter less steep slopes, and also puts less emphasis on tuning the learning rate. This is why we would call this algorithm an adaptive learning rate algorithm. This algorithm was one of the first successful adaptive learning rate algo- rithms, but has one major flaw of a disappearing gradient, since the accumulation of the squares of the gradients in the denominator eventually stops the gradient descent entirely if it does not make it to the bottom in time. RMSProp was designed to take care of the problem of the vanishing learning rate by choosing to use a moving average of the squares of the gradients. This then gives us the update rule, detailed in Algorithm 5. Algorithm 5 RMSProp 1: v ← 0 2: for t = 1, ..., N do 3: v ← β v + (1 − β) ∇θ√J(θ) · ∇θ J(θ) 4: θ ← θ − η ∇θ J(θ) v+ 5: return θ This process avoids the accumulation of gradients because a moving average will have its old terms exponentially decay, so the benefits of the AdaGrad algorithm are present in the RMSProp algorithm without the penalty of the vanishing gradient. Next, we move on to the most popular adaptive learning rate algorithm, which is Adam. This combines the success of RMSProp with the idea of momentum. In this case, we have two values to tune, β1 and β2 , that control the momentum and the moving average of the squared 56 gradients respectively. The Adam procedure is detailed below in Algorithm 6. Algorithm 6 Adam 1: v1 ← 0 2: v2 ← 0 3: for t = 1, ..., N do 4: v1 ← β1 v1 − (1 − β1 ) ∇θ J(θ) 5: v2 ← β2 v + (1 − β2 ) ∇θ J(θ) · ∇θ J(θ) v1 6: b1 ← v 1 − β1t v2 7: b2 ← v 1 − β2t p 8: θ ← θ+ηv b1 vb2 +  9: return θ Now, we note that step 6 and 7 in the algorithm are unique and are used to get a boost in the beginning stages of the gradient descent. Since Adam combines the best qualities of each of these algorithms and has had much success in training neural networks, it is the most popular algorithm out there along with its variants. It is also used in most of our works involving neural networks. 3.6.4 Multi-task Networks and Transfer Learning Most neural networks that are used are single task models where they are being trained to accomplish one particular task at a time. Knowledge in the human mind is acquired over time, and new tasks are initially approached with an intuition resulting from performing similar tasks. Some data sets that we analyze are very similar in nature, and it would seem that we could take advantage of this fact in some way using a machine learning technique. Two ways that we can apply this concept is to use a model that is trained to perform well on one task, then retrain it to perform well on a very similar task, or alternatively train it to perform well on multiple tasks simultaneously. These two techniques are transfer learning and multi-task learning respectively, which allow us to benefit from learning on multiple data sets. This is especially helpful if we have sparse access to data, or if we have 57 a large data set that we can use to bolster our performance on a task in which we have sparse data. Transfer learning is a technique that uses a model that was already trained on one set of data, to then train the model on a second data set that is similar to the first one. The final weights of the first model are used to initialize the second bout of training, putting us in a position in the landscape of the cost function surface that is more beneficial than the random weight initialization that we would have without it. Transfer learning works best if there is a large data set to initially pre-train and the labels are similar enough so that the initial training provides a suitable set of initial weights. The multi-task procedure is more involved than transfer learning and is done with a neural network architecture in the following way: If we let T be the number of tasks and N {(xti , yit )}i=1 t is the training data for the tth task where N is the number of samples of the t tth task, xti is the feature vector of the ith sample of the tth task, and yit is the label for the ith sample of the tth task. The training strategy is to minimize the chosen loss function, L, for all tasks simultaneously. We define the loss for task t, Lt , below. Nt L(yit , f2t (f s (f1t (xti ; {W1t , bt1 }); {Ws , bs }); {W2t , bt2 })), X Lt = (3.18) i=1 In this equation, f1t and f2t the parts of the network that are the layers before and after the shared layers respectively which are parametrized by respective weights, W1t , W2t , and biases, bt1 , bt2 . f s is the part of the network that has shared weights, Ws , and biases, bs . To train all models simultaneously, we provide a single loss function to minimize so our problem is then:  T  s 2 1X t 2 t 2 argmin α · W 2 + Lt + βt · W1 2 + γt · W2 2 , (3.19) T t=1 || · ||2 denotes the L2 norm and α, the βt s, and the γt s represent the regularization constants that set the magnitudes of the weight decay during training. In practice, if 58 weight decay is used, we generally set α = βt = γt for all t. The structure of a general multi-task layer is illustrated in figure 3.5 below: Figure 3.5: An illustration of a general multi-task network with three tasks. The input data from datasets 1, 2, and 3 are fed forward simultaneously, but through the hidden layers corresponding to their task. Then the last layers of each individual task’s hidden layers are pushed through the shared layers. Finally, the data points are separated again according to their task, and pushed through their corresponding second set of hidden layers and outputs to get the predictions. The average loss of the predictions is calculated, and the weights are then updated simultaneously. These two techniques are indispensable and in some cases necessary for us to be able to train effective models in many cases. It is not always easy to gather data, and so we must always look for ways to use the data that we already have, to then build better models from ones already constructed. 59 CHAPTER 4 BIOMOLECULAR MODELING AND KERNEL METHODS 4.1 Introduction Biomolecules, particularly proteins and nucleic acids (DNA and RNA), are essential for all forms of life, so naturally this carries importance for us as we ourselves are living beings. For instance, proteins have a wide range of functions within living organisms: membrane channel transport, signal transduction, structural support for organisms, en- zymatic catalysis for transcription and the cell cycle, etc. Nucleic acids function in asso- ciation with proteins and are essential for encoding, transmitting, and expressing genetic information, which is stored in nucleic acid sequences and transmitted by transcription and translation processes. Understanding biomolecular structure, function, dynamics, and transportation mechanisms is paramount to understanding any biological system or organism. Understanding the protein-function relationship is so crucial to understanding protein specification, protein-protein interactions, and protein-drug binding, that are the key to solving problems in drug design and discovery. Life science is one of the most prominent fields of natural science in the 21st century, as opportunities abound for depth of understanding and application to human health and technology. Molecular biology is the foundation of the biological sciences and biomolecu- lar mechanisms, and as technology advances, the capabilities for analyzing biomolecular structures improves dramatically. Many traditional disciplines, such as epidemiology, neuroscience, zoology, physiology, and population biology are shifting from a macro- scopic and phenomenological framework for analyzing to a molecular-based strategy. More generally, the field of biology is moving from a focus on qualitative and descriptive models to quantitative and predictive modeling. These changes over the decades have provided ample opportunity for the application of mathematics and machine-learning to 60 the biological sciences [108]. Biomolecular modeling is the science of studying the structure and function of molecules through modeling and computation. These computations include semi-empirical quan- tum mechanics, empirical molecular mechanics, Monte Carlo methods, free energy and solvation methods, quantitative structure/activity relationships (QSAR), biochemical databases and information, and various other procedures. Things like nuclear mag- netic resonance (NMR) and X-ray crystallography are also categorized as a component of biomolecular modeling. Many times, our models may represent a very simplified form of the complex molecular structures that we study, but we may use these simplified mod- els to glean important patterns and add insights that would otherwise not be seen in the full intricacy of a more complete model of the biological structure. The key is to develop models that are as simple as possible for us to understand, yet incorporate all of the ap- propriate elements that we require to understand all the components necessary to solve the problem at hand. Most of the data concerning the structure and function of biomolecules and biomolec- ular functions has more recently come from experimental data collected through use of techniques such as X-ray crystallography, NMR, cryo-electron microscopy (cryo-EM) [103], electron paramagnetic resonance (EPR), multi-angle light scattering, and other tech- niques. The two techniques that have seen the most use are NMR and X-ray crystallog- raphy, and have been developed extensively over the past few decades. X-ray crystallog- raphy offers structural information at angstrom and even sub-angstrom precision, while the NMR technology provides structural information in a physiological context. X-ray crystallography and NMR are limited, however, in their application to proteasomes, sub- cellular structures, organelles, cells, tissues, and others. Cryo-EM has been able to do imaging in many of the areas mentioned that were not suitable for either NMR or X-ray crystallography, and many other techniques are also applied to deal with more special- ized situations when they arise, allowing us to gather large amounts of biological data. 61 The rapid advances in experimental technology in recent history has led to massive stores of 3 dimensional biomolecular structural data and spurred large-scale develop- ment of bio-technology and pharmaceutical products. This 3-D structural information is applied in many tasks such as virtual screening of drugs, computer-aided drug design, binding pocket descriptors, quantitative structure-activity relationships (QSAR), protein design, RNA design, and molecular machine design, among other tasks. This substantial increase in molecular data has catalyzed the transition from the traditional qualitative approach to the quantitative and predictive approach that now pervades the biological sciences. The incredible growth of computer technology, which has now found applica- tion in every field of science, perhaps deserves the lions share of credit. Most quantitative endeavors would have little weight without the computational capacity of our modern computers, which made possible the advancement of biomolecular modeling and was able to bridge the gap between biological data and mathematical models [72, 27, 10, 11, 44, 56, 59, 9, 94, 6]. Studying structure-activity function in a biomolecular system is a major challenge in the field of computational biology. There are many quantities relating to bio-chemicals and biomolecular complexes that are of great interest to biologists, chemists, profession- als in the pharmaceutical field, and other scientists. Quantitative and qualitative toxicity, solvation energy, solubility, protein-ligand binding affinity, and mutation effects of pro- teins are all various quantities associated with biomolecular systems that are important to know for a variety of purposes. FDA and EPA officials may wish to know whether a particular chemical is toxic to humans or to a surrounding environment, chemists or chemical engineers may be interested in the solubility of a range of biomolecules, phar- maceutical companies may be interested in the binding energy of a series of drugs in development to see if they should undergo further experimentation, and medical experts want to understand how the effects of a mutation in a protein affects its function. These are all common examples of the applications of biomolecular modeling that showcase the 62 importance of the field. In many cases, biomolecular modeling involves creating models that can be used to predict the target quantities in question. Measuring these quantities may require tedious lab work which might be very time-consuming and expensive when considering a large set of chemical candidates. This may be prohibitive to small-scale companies and uni- versities with a smaller budget. In the cases of larger organizations that may be able to afford the incredible costs and hire the specialized personnel that are required, there is al- ways an incentive to reduce costs, bypass stringent laboratory regulations, and avoid any ethical concerns through use of in-silico methods. This can decrease waste significantly, provide cheaper options for consumers, and spur more innovation in the pharmaceutical and medical fields. One can think of a multitude of opportunities that effective predictive models can bring to provide serious benefits to our society. The simplest geometric model that has been proposed is the space-filling Corey Pauling- Koltun (CPK) theory, which represents each atom as a solid sphere with Van der Waals(VDW) radius, in which the surface is referred to as the Van der Waals surface. Solvent accessi- ble surfaces (SAS) and solvent-excluded surfaces (SES) were then introduced to create smooth modeling surfaces, where SES has been applied in the cases of protein folding, protein surface topography, protein-protein interactions, DNA binding, DNA bending, docking, enzyme catalysis, drug classification, and solvation energy prediction. The SES model is also used in dynamic simulations and understanding ion channel transports. SAS and SES, however, are not differentiable due to the cusps and tips that are prevalent on their surfaces. Gaussian surfaces were proposed to deal with this problem, which introduces a smooth Gaussian function to remove any singularities from the surface. With this fix, many analytical techniques were applied, such as differential geometry, that had found much success. Other basic physical properties of biomolecules are also applied to bolster the geometric information, including electrostatics, lipophilicity, elas- ticity, and some properties derived from molecular dynamics. These geometric models 63 have provided excellent predictions of the solvation free energy, protein-ligand binding affinities, protein mutation energy changes, and protein-protein interaction hotspots of various biomolecules and biomolecular complexes. More advanced models have been constructed using advanced theoretical concepts in physics. Theoretical modeling of structure-function relationships generally based on fun- damental laws of physics, such as quantum mechanics (QM), molecular mechanics (MM), continuum mechanics, statistical mechanics, thermodynamics, among other subfields of physics. QM methods are great for understanding chemical reactions and protein degra- dation, while MD is useful for understanding molecular conformational landscapes and collective motion/fluctuations. MD is at the forefront of theoretical biomolecular model- ing and simulations, however, the modeling of large-scale biomolecules or biomolecular complexes can introduce an excessive amount of degrees of freedom involved, and can easily become prohibitively expensive in it’s computational cost. For example, in some instances it might even take months to develop a computer simulation of protein folding, and usually is a poor comparison to the base truth. There are many ways to create predictive models involving high-level physical and mathematical methods that flesh out the details of the biological system, but also there are more crude methods may be used if they can provide sufficient predictions for the problem at hand. More recently, in-silico methods are used that involve the applica- tion of machine learning algorithms to create accurate models that can predict all sorts of important physical quantities without any experimentation. Many advanced mathe- matical techniques have been used to generate molecular representations that have been very useful in training a number of machine learning models, from simple regression and other basic models with few hyper-parameters, to the more complicated deep-learning models with a wide range of hyper-parameters that require extensive tuning. These rep- resentations, which involve the use of many different fields in mathematics, have created extremely successful models which have provided us with state-of-the-art predictive ca- 64 pabilities. Algebraic topology, differential geometry, graph theory, and other fields have all been applied to model the complex interactions involved in these biomolecular sys- tems. These three fields; geometry, topology, and graph theory have been essential for the understanding of biomolecular function, dynamics, and transport. Geometry provides structural and surface representation, which helps to understand molecular structure and internal and external interactions between or within molecules. The application of geometry-based mathematics connects the physical molecular structure to theoretical mathematical models. Topology helps us understand atomic critical points, connectivity, and information about important topological invariants that represent physical details, such as independent components, rings (or pockets), and cavities, where the notion of the "size" of these components can be captured. Graph theory analyzes biomolecular interactions and reveals structure function relationship. Aside from those three mathe- matical fields that we discussed, there are many other tools that can be used. There are two main categories that we can place them in, discrete and continuous. Discrete ap- proaches include graph theory, Gaussian network models, anisotropic network models, normal mode analysis, quasi-harmonic analysis, flexibility-rigidity index (FRI), molecu- lar nonlinear dynamics, spectral graph theory, and persistent homology. The continuous approaches include discrete to continuum mapping, high dimensional persistent homol- ogy, biomolecular geometric modeling, differential geometric theory of surfaces, curva- ture evaluation, variational derivation of minimal molecular surfaces, atoms in molecular theory and quantum chemical topology. In many of our works, we use kernel functions that measure similarity between two points on a continuous scale from 0 to 1 to obtain features for our molecular represen- tations. These kernel functions are applied in feature extraction because they provide a normalization of the features in our representation, they provide an important aspect of non-linearity to the feature generation which adds complexity to the representation, 65 and they allow for further parameter tuning, which gives us the ability to refine our fea- tures in a simple way. The shapes of these kernel functions are based on parameters that determine the inclusiveness of the similarity and also the sharpness of the decay of the similarity measure at a point of cutoff. As stated earlier, the parameters that deter- mine the shape of the function can be naturally tuned, allowing a greater chance for more successful models to be found. The above concepts of flexibility-rigidity index (FRI), ho- mology, spectral graph theory, and differential geometry all have versions in which we have applied kernels to abstract the features and allow additional tuning. Why should one care for the application of abstract topics mathematical techniques to model these biological situations? Given the abstract nature of the models, we will usu- ally not be able to determine a natural explanation about each piece of the model, but we may speculate about the proximate reasons for their success. For example, features gener- ated by the spectrum of Laplacian matrices may not give us an immediate understanding of what they may mean to a particular molecular system, but there may be hidden at- tributes, such as information about the connectivity of the molecule when viewed as a graph, that might prove useful in the context of the solubility of a molecule. One impor- tant observation is that in larger and more complex biomolecular systems, we find that the more precise physical models that give us predictive models do not scale well. As the system complexity increases, the physical calculations necessary to get appropriate ac- curacy are computationally infeasible. When abstracting the properties of the molecular systems, the fine details are not extracted, but a more general qualitative description of the system and its workings are presented. Important information is gathered, while omit- ting unnecessary detail. For example, we may obtain topological, geometric, and spectral data from a molecule, representing the system as a whole with the intrinsic physics and geometry necessary through abstraction. In this chapter, we will be exploring simple concepts in biomolecular modeling, and then dive deeper into more advanced topics. The application of kernel functions, graph 66 representations of molecules, and specific mathematical modeling methods will be cov- ered. 4.2 Physical Modeling Physics provides us with many tools to analyze physical systems in a concrete math- ematical way. The mathematical models that have been experimentally proven to be true are indispensable to us in a variety of contexts. Our focus is specifically on the use of physical modeling in biology and bio-chemistry. In this section, we will cover some basic physical notions that are pertinent for understanding biomolecular systems. 4.2.1 Electrostatic Interaction The electrostatics of pairwise interactions between molecules in a homogeneous medium is frequently modeled by Coulomb’s law, which measures the electrostatic force between two models by taking into account the distance and partial charges of the atoms. If we let the distance between two atoms, i and j, be denoted by dij and the pair, qi and qj , be the charges of atom i and atom j respectively, then the electrostatic force between the two atoms can be described by the equation: qi q j Fclb = ke 2 , (4.1) dij where ke is coulomb’s constant. This is important because it can tell us where there might be attraction or repulsion between atoms or molecules, which is incredibly useful if we are trying to model how a system of molecules will react with one another. 4.2.2 Van der Waals Interaction Van der Waals forces are weak forces between two atoms or molecules induced by electri- cal interactions. They are the weakest of the main intermolecular forces, such as hydrogen bonds, hydrophobic interactions, and ionic interactions, but they can be very strong if the 67 forces accumulate from multiple sources. There are two main types: the weak London Dispersion forces and the stronger dipole-dipole forces. The chance that an electron is in a certain area is determined by a probability cloud, and so the density of electrons within a molecule or atom will be changing at all times. The tendency for electrons to concentrate themselves to one side of the molecule or atom and momentarily change the status of its polarity to subsequently exert a force on other molecules or atoms are generally referred to as the London dispersion forces. Dipole-dipole interactions are interactions that result simply from the natural polarity of a molecule. The Lennard-Jones potential models the Van der Waal interaction energy, Eij between the ith and j th atoms in a system. This combines two forces: one attractive force that comes from the London dispersion forces, and, one shorter range repulsive force that comes from the Pauli exclusion principle. This mathematical model is given by the equa- tion:  ! ri + rj 6 ri + rj 12    Eij = ij −2 , (4.2) dij dij where ri and rj are the Van der Waal radii of the atoms i and j respectively, dij is the distance between them, and  is the depth of the potential well, which is given based on the pair of atoms. 4.2.3 Molecular Surface Area The solvent-accessible surface area is the area of a molecule that is accessible by a solvent or other type of molecule, usually measured in Angstroms (Å). This surface are is usually calculated using the rolling ball method, which uses a sphere of a particular radius to probe the surface of the molecule [95]. This process is illustrated in figure 4.1 below. There are other algorithms that can calculate this surface, such as the LCPO method and the power diagram method, which are much faster than the rolling ball algorithm [109, 7]. The solvent accessible surface area is used in calculating the transfer free energy, 68 Figure 4.1: This figure illustrates the solvent accessible surface compared to the Van der Waal surface. The solid red and gray colors represent the Van der Waals radii of the atoms in the molecule, and the striped yellow area represents the solvent accessible surface that is "rolled out" by the center of the translucent blue probe sphere (or solvent). molecular dynamics simulations, and it has been recently suggested the surface area can even be used to improve the predictions of protein secondary structures [76, 8]. 4.3 Biomolecular Datasets of Small Molecules Small molecules play an important role in all biological processes, mainly through in- teractions with larger biomolecules such as proteins and nucleic acids. They bind with proteins as parts of complex chemical reactions and processes that are part of larger pro- cesses important for the preservation of biological lifeforms. Small molecules often act as triggering signals or inhibitors for these processes, and in fact, many of these small molecules are prepared as drugs, alleviating problems in the human body and for ani- mals as well. Being able to accurately describe the effects of these small molecules on 69 larger biological complexes is of utmost importance to drug design and the understand- ing of this is attractive to pharmaceutical companies and medical professionals. In this section, we want to explore some problems involving small molecules and the important physical quantities that are associated with them. We cover toxicity, solubility, and the prediction of solvation and hydration free energy. Data sets of this type will be explored in later chapters. 4.3.1 Toxicity Toxicity is a measure of the degree to which a chemical can harm an organism. The harm- ful effects can be measured qualitatively or quantitatively. A qualitative approach will categorizes chemicals as toxic or nontoxic, while quantitative toxicity data records the minimal amount of a substance that would produce lethal effects. Experimental mea- surement of toxicity can be expensive, time-consuming, and subject to ethical constraints. Machine learning models can alleviate all of these concerns, and so are extremely use- ful in toxicity prediction. The working principal of the machine learning approach for toxicity analysis is that molecules with similar structures have similar activities, which is called the quantitative structure-activity relationship (QSAR) approach. By analyzing the relationship between molecular structures, one can predict their biological functions. The toxicity of a small molecule is a fundamental concern for drug design, simply for the fact that a drug is generally designed to alleviate or heal problems arising in the hu- man body, and so toxic drugs are naturally avoided. Toxicity analysis is also useful for us for environmental concerns. If applying a chemical to our environment for purposes of protection, we may wish to know how this will affect the area in which it is applied. En- vironmental damage may be cause concern for number of reasons, and the government agencies like the EPA might have a great interest in understanding the level of toxicity for various chemicals. There are so many different organisms that may be affected by dif- ferent chemicals, and the complexity of many organisms can present many challenges to 70 creating predictive models, however, we have seen much success in our group nonethe- less [110, 45, 81]. 4.3.2 Solubility Solubility is the ability of a chemical to form a solution with another chemical, in which the chemicals are referred to the solute and the solvent respectively. If a solute cannot form a solution with a solvent, the the solute is said to be insoluble in the solvent. The solubility of a molecule is generally measured in terms of the concentration of the solute in a saturated solution, which is a solution in which no more solute can be dissolved, where the solute and solvent are said to be in equilibrium. Quantifying and predicting the solubility of a solute has many important applications in industry and sciences besides chemistry, such as geology, biology, engineering, medicine, agriculture, paint production, brewing, and many other fields. The most common solvent used in industrial processes and scientific experimentation is water, and so understanding how a solute behaves in water is of utmost importance. In drug design, pharmaceutical professionals are interested in understanding the sol- ubility of a drug candidate that could occur during the delivery of a drug to the body. A drug must have some level of aqueous solubility (solubility in water) for it to properly be absorbed into systemic circulation. There are many factors that affect the solubility of a drug. Some of these are the shape, temperature, pH, and crystal characteristics. As far as shape is concerned, we have tools to create representations revealing the geometric prop- erties of solutes and methods for implementing them in machine learning models to great effect. The QSAR approach that we take in toxicity may also be fruitful in determining solubility. 71 4.3.3 Solvation Free Energy and Hydration Solvation describes the interaction of a solvent with dissolved molecules, in which the strength of the interaction from both ionized and uncharged molecules are measured. The strength of these interactions influence many properties of the solute, which involve solubility and reactivity, while also influencing properties of the solvent, such as viscosity and density [12]. Solvation is the process of reorganizing the molecules of the solvent and solute into solvation complexes, which are formed by having molecules of the solvent form shells around ions in the solution. Solvation involves physical interactions and pro- cesses such as bond formation, hydrogen bonding, and Van der Waals forces. A special case, which is common enough to give it a name, is the solvation of a solute in water. This solvation process is called hydration. Given the physical processes involved, we would like to look for representations that can encode the intrinsic physics of the system. Solvent polarity is perhaps the most impor- tant factor in solvation analysis. Polar solvents have molecular dipoles, and are more sus- ceptible to forming the solvent shells. When modeling, we must include partial charges and polar information. The quantity that we want to measure in solvation analysis is the solvation free energy. This is the change in the Gibbs free energy of the system, which is a thermodynamic quantity, measured in joules, that is given by the change in enthalpy minus the temperature times the change in entropy, i.e. ∆G = ∆H − T ∆S. Solvation free energy is a very important quantity in solvation analysis, which can be very beneficial in studying other complex or biological processes [63, 74, 105, 55]. 4.4 Basic Kernel Functions A kernel function in machine learning usually refers to a function that is used to map data points in the feature space into another feature space so that the data becomes lin- early separable, or at least more separable than before. These functions are based on a measure of similarity and are usually limited between 0 and 1, where the closer the dis- 72 tance, the more similar the two data points are. More precisely, for a kernel function, Φ : R≥0 → [0, 1], we have the following conditions: Φ(x) → 0 as x → ∞ Φ(x) → 1 as x → 0 Φ(x) ≤ Φ(y) for x ≤ y These conditions help us choose functions that can allow us to more deeply consider the similarity of data points that are much closer together in the feature space and elim- inate the influence of data points that are much further away. This gives us a more ac- curate categorization or grouping of the data points that would otherwise frustrate our models’ attempts to fit the data nicely. In essence, the kernel function many times acts as a low-pass filter, which has a soft cutoff centered at a parameter, η, and quality control parameter, κ, that determines the sharpness of the cutoff. There are two kernel functions that are very popular, and that have been used extensively in research coming from our group. These kernels are the Lorentz and Gaussian kernels: 1 ΦL (x; η, κ) = 1 + (x/η)κ κ − x η ΦE (x; η, κ) = e They have both shown to be great kernel functions in practice and have aided in the creation of many successful models, with multi-kernel methods having the greatest suc- cess. Another special function that has been tested briefly and performed well is the piece-wise linear function that is defined as:  1 − s · x : x ≤ η   η Φplin (x; η, s) = s · 2 − x : x > η    η 73 2 Where we usually set s = 0.5 · e−κ so that there are no values that cut the graph off from what it should look like, since the s value should be in between 0 and 0.5. Then κ can be adjusted without worry. This kernel gives us a very sharp cutoff and allows us to adjust the contribution near the cutoff point while totally eliminating the contribution of very far off points. In our experiments we use the kernel functions for different applications. We us the kernel functions to obtain features. These functions are usually applied to atomic dis- tances to get interaction strength and electrostatics information. They work as low-pass filters that put more importance on atoms that are closer and drowns out the interactions of further off atoms that should not provide much to the overall function of the system. 4.5 Kernel Methods applied to biology 4.5.1 Element-specific Groups The concept of an element-specific group is a key concept in many papers published by our research group. They are a way to group the elements of an atom to provide multiple avenues for feature extraction by providing additional relevant information about the in- ternal or external interactions of the biomolecules of a system. Different elements interact with each other in different ways and so analyzing the restricted sets of atoms defined by element pairs can give important insights for your machine learning models. This has been a powerful tool for feature extraction and has been responsible for many successful models. Being more specific, if we wish to choose element-specific groups to analyze, we must choose a set of elements that will determine our groups. If we are looking at single molecules, we will generally be analyzing the pairs within the molecules themselves. If we are dealing with a biomolecular complex, usually of a small molecule and a large protein, the element pairs will refer to the atoms of one type in the first biomolecule and the atoms of the element type in the second biomolecule. If there are more than two 74 biomolecules, then we may begin to look at triples, quadruples, etc. In this work, we will focus on the case of pairs of elements, so we choose two sets of elements, T1 and T2 , that contain element types; i.e. T1 , T2 = {H, C, N, O, S, P, F, Cl, Br, I} in the case of the internal element pair interactions of small molecules, or T1 = {C, N, O, S} and T2 = {H, C, N, O, S, P, F, Cl, Br, I} for the external interactions between a protein and a ligand. The choices of the element type sets are based on many factors, but commonly hydro- gen atoms are withheld from large molecules because of the unreliable nature of the 3- dimensional positions that are obtained through biomolecular software when applied to things like proteins, and also other elements are withheld because they do not appear regularly in molecules and so do not add any relevant information to the mix. 4.5.2 Element-specific Graph Representations From the previous section, we discussed the concept of element specific groups consist- ing of atoms with particular element types. One useful way of organizing the structure of a molecule of a biomolecular system is to place this information into a nice graph struc- ture. We can naturally obtain a set of vertices by allowing the atoms in the molecule or biomolecular complex to be a set of vertices. To get enough detail, we wish to con- sider a vertex-labeled and weighted graph whose vertices are labeled by element type of the atoms and possibly other information, and the edges are weighted by the distance between the atoms, or by a measure of their closeness as judged by a chosen kernel func- tion. This gives us a complete graph which might be quite large in most applications. By restricting ourselves to the subgraphs that consist of the element pairs of our choosing, we may then extract the useful features that we desire. To make pairs for element specific groups, we combine the two sets of element types, T1 and T2 , into a set of pairs: T = T1 × T2 . We then describe each atom in the biomolecular complex as a pair itself of a position and an element type, (r, t) ∈ R3 × T1 ∪ T2 , and let V be the vertex set which contains all the atoms in the system. It may also be the case that 75 we wish to include another descriptor along with each atom, such as the partial charge, so in some cases we may look at atoms as triples, i.e. (r, t, c) ∈ R3 × T1 ∪ T2 × R. This provides us with additional features to be extracted per element-specific group. The edge set, E, is the complete graph that connects all the vertices by an edge. In fact, we define E     to be the weighted graph, E = (r1 , t1 ), (r2 , t2 ) , d ∈ V × R≥0 : d = Φ(||r1 − r2 ||) , for some function Φ, which can either be the euclidean distance, or can be a chosen kernel function. Now for each element pair, P = (T1 , T2 ) ∈ T , we choose all the vertices that have the element types of either pair, so our vertex set for the element-specific group is VP = {(r, t) ∈ V : t = t1 or t = t2 }. Then we define the element specific edge set as follows:  EP = { (r1 , t1 ), (r2 , t2 ) ∈ E : (t1 , t2 ) = P}. This then gives us a complete description of  the element-specific subgraph, GP = VP , EP . 4.5.3 Geometric Graphs Our first example of kernel methods includes the use of the element-specific subgraphs as described above, in which we extract features for each subgraph by summing the weights of the edges with other possible statistics. An example of this process was described in a paper by Nguyen et al. [78]. To be more precise, we start with an element specific subgraph, G, within our molecule or biomolecular complex. Then we have a choice of kernel function, Φ, which is usually the Lorentz or the exponential kernel, and has chosen parameters, η and κ. Then, we obtain a value, µi at each atom i that represents the strength of the interactions between all the other, in which covalent bonds are usually excluded: X d µi = wj φ(||xi − xj ||; η, κ), (4.3) j=1 where d is the number of edges in the element specific graph that is connected to i, xi and xj are the coordinates of the atoms i and j respectively, and the wj s are either all set to the value 1, or set to be a physical quantity related to the atom, j. The value µi can 76 be referred to as the rigidity index at the atom i. To get a feature from the group, we can simply take the sum of all the rigidity indices at each atom in the subgraph, i.e. X N RIG = µi , (4.4) i=1 where N is the number of atoms in the subgraph, G. Now, if we wish to get additional features from the group, we may consider taking various statistics of the µi s such as the maximum, minimum, standard deviation, mean, etc. For every element-specific group, we can obtain one or more features and combine them together to describe each data point in our data set. 4.5.4 Algebraic Graphs Another way we can generate molecular descriptors is through analyzing the graph struc- ture by understanding the spectral decomposition of their adjacency and Laplacian ma- trices, as was done in Nguyen et al. [80]. Concerning the case of an element specific subgraph, G, as described before with a chosen kernel function, Φ, we may obtain the weighted Laplacian matrix, LΦ . We obtain the spectrum for the Laplacian, λ1 , λ2 , ..., λn , with eigenvectors, u1 , u2 , ..., un . Then we can form atom-specific features for each atom, i: n 1 µL [ul uTl ]ii X i = (4.5) λl l=1 As before, we get the element specific descriptor by summing the µi values: N L µL X RIG = i (4.6) i=1 Additional statistics can also be gathered as before. For additional molecular descrip- tors, we may wish to draw features form another graph, one that takes edges based on the 77 covalent bonds between the atoms. This may reveal more about the structural properties of the molecule since the physical structure is encoded in the graph. 4.5.5 Algebraic Topology Persistent homology can be applied to molecular structures to gain an understanding of the topological structure of a molecule or a biomolecular complex. We have seen much success in its application and in some cases, a kernel was applied to provide opportunity for parametrization and optimization of those parameters to construct more refined mod- els [22, 110]. We want to consider the case of small molecules in which for each molecule, we may obtain a point cloud from the positions of the atoms in the molecule. The point cloud from a molecule then can yield persistence values based on the the distance matrix. The Vietoris-Rips, Čech complex, or the alpha complex may then be used to produce a filtration. The births and deaths of the homology classes define a sort of barcode representation of the molecules. This representation is illustrated in figure 4.2 below from a paper by Zixuan et al [22]. Since we want to apply kernels to our features, we want to define the correlation between two atoms, i and j, with positions xi and xj respectively, by the formula 1 − Φ(||xi − xj ||). This then gives us an alternative to the distance between two atoms by looking at the correlation value defined instead, giving us a modified distance matrix. We then apply the same method as described above to get the barcode representation. From these barcode representations, i.e. persistence information, we can extract topo- logical features to use in machine learning models. We start by choosing a maximum length, L, and a resolution value, m. Then, for a given dimension, p, we may define 3 feature vectors which describe the barcode data, that is the births, deaths, and persis- tence vectors. The main technique is to split the maximum length into a number of bins, (i−1)m N = L/m, in which bin i is the bin that corresponds to the interval, L to im L . We 78 Figure 4.2: This figure shows a barcode representation being extracted from a biomolecular structure. On the left, we have the biomolecule in which the point cloud is taken by looking at the positions of all the atoms in the structure. On the right, we have the persistence information for the first 3 dimensions. The 0- dimensional persistence (points) for each class is on top, the 1-dimensional persistence (rings) is in the mid- dle, and the 2-dimensional persistence (cavities) is on the bottom. The persistence is measure in Angtroms (Å) [22]. then place homology classes (i.e. barcodes) into the bins and record the number of items placed in a bin to get a feature vector. For births and deaths, we place a homology class in bin i if the homology classes is born or died respectively in the length corresponding to its bin. We can alternatively state this as placing a barcode in a bin if its tail is con- tained in the interval corresponding to the bin (birth), or if its head is contained in the interval corresponding to the bin (death). The persistence feature vector is calculated in a slightly different way in that a homology class (or barcode) is placed into every bin whose length is entirely included within the whole stretch from birth to death of that class (or alternatively, the bins in which the barcode covers their interval entirely). This process is illustrated in figure 4.3. For each dimension then, we obtain 3 sets of N = L/m (number of bins) features. If we use 3 dimensions, then we get 3 sets of 3 feature vectors, for a total of 9L/m features. This can be a lot, since we would usually want to set the length to be enough to capture all of the atoms, and the resolution to be set to something that is certainly smaller than 1, but not so small that the feature vectors are too sparse. Generally, the length will be set 79 Figure 4.3: This figure depicts the featurization of the barcode data from persistent homology calculations. On the top, the barcode representation is shown, with 4 bars representing the "lives" of 4 different homology classes as a simplicial complex "evolves" through a filtration. Below the barcode data, the birth, death, and persistence feature vectors are constructed using a max length of 4 and a resolution of 0.5. to 10 or lower for small molecules, and up to 30 for large bio-molcules. The resolution is usually set to 0.2 or 0.25, but should probably not be set to anything below 0.1. Now, we might be willing to look at element-specific topological features and obtain many more features, given the amount of element specific groups that we can usually analyze, but we may lose some physical meaning if we do this since the intuition of these homological features is to measure components, rings, and cavities, etc. It has been, however, shown to be fruitful in either case [22]. 4.6 Multi-scale Kernel Methods Multi-scale kernel methods refer to methods involving the use of multiple kernels functions or parameters to generate additional features. For instance, we can generate representations based on a choice of kernel, such as Lorentz and exponential kernel, each 80 with a choice of η and κ parameters. Let us call these representations, R(L, η1 , κ1 ) and R(E, η2 , κ2 ). Then, we can simply combine the two representations by concatenating them to get a new representation, R(L, η1 , κ1 ) : R(E, η2 , κ2 ). This can be extended for an arbitrary number of kernel choices and parameters, i.e. for a sequence of kernel types, (K1 , ..., Kn ), and parameter choices, (η1 , η2 , ..., ηn ) and (κ1 , κ2 , ..., κn ), we can get a multi- scale representation of their concatenated representations, R(K1 , η1 , κ1 ) : R(K2 , η2 , κ2 ) : · · · : R(Kn , ηn , κn ). Each kernel type has its own benefits and drawbacks, and the differ- ent parameters give the kernels various shapes which might pick up unique features, so the combination of these different kernel types and parameters gives us an edge when building models. We have seen multi-scale models showing great success when applied [78, 80, 81]. 81 CHAPTER 5 AWEGNN: AUTO-PARAMETRIZED WEIGHTED ELEMENT-SPECIFIC GRAPH NEURAL NETWORKS FOR SMALL MOLECULES This chapter includes much of the AweGNN paper by Szocinski et al. [100], and should be read as a self-contained work. 5.1 Introduction Automated feature extraction techniques, like those used in convolutional neural net- works (CNNs) and recurrent neural networks (RNNs), have been very successful in deep learning applications [53, 92]. They have made inroads in a variety of fields now, produc- ing state-of-the-art results in signal and information processing fields [30], such as speech recognition, image recognition [25], and natural language processing [99]. These kinds of automated feature extraction techniques, however, are only highly successful on data that is relatively simple in structure, such as with images, text, etc. For data sets with complex internal structures, such as molecules or macro-molecules, hand-crafted descriptors, or representations, are indispensable for developing top-quality predictive models [79]. There are many physical, biological and man-made objects that have intricate inter- nal structures. For example, proteins, chromosomes, human bodies, and cities have very complex structures. Tools from abstract mathematics such as algebraic topology, differen- tial geometry, and combinatorics can be utilized to simplify the structural complexities of data. [79, 69, 107]. For molecules and macro-molecules, molecular fingerprint represen- tations obtained from persistent homology, curvature analysis of surface electrostatic po- tentials, and eigenvalues of weighted adjacency matrices derived from atomic distances have all be used for this endeavor [79]. From the point view of machine learning, molec- ular representations which include detailed chemical and physical information can be extremely high-dimensional, especially when the biomolecules in question are made of 82 thousands of atoms, such as in the case of protein-ligand complexes [19, 21, 26, 13]. By using lower dimensional mathematical representations to train simple machine learning models, such as gradient-boosting trees (GBT) and random forest (RF) models, one can achieve stellar performance. Some of our hand-crafted mathematical features are somewhat limited in their scope [110], in which we mean they are generated by a fixed procedure, with little possibility for making fine adjustments to produce more favorable representations. Other mathematical features that we have used are based on a choice of kernel function or functions along with adjustable parameters, in which a careful tuning of these parameters can deliver op- timal performance [78, 80]. This process of tuning, however, can be very time-consuming and require experience. Although the kernel parameters introduce new dimensions to en- hance our models, the cost of tweaking becomes tedious when more than a small handful are introduced. In this work, motivated by the automation of simple feature-extraction techniques as in CNNs, we seek to automate the parameter optimization for our mathematical repre- sentations. Specifically, we automate the selection process of kernel parameters either partially or fully, meaning that we would like to have some or all of the parameters auto- matically chosen for us during training. Neural networks are trained in a series of epochs where the weights of the network are updated at each step through gradient descent, and so they are a perfect starting point for exploring this idea of automated parameter selection. In our approach, back propagation is extended further to include the kernel parameters of our molecular representations, leading to the simultaneous update of our kernel parameters and neural weights at each batch of training. To this end, we introduce auto-parametrized weighted element-specific graph neural networks (AweGNNs). The AweGNN implements the above scheme of parameter updates by using a representation based on kernel-weighted geometric subgraphs, with features similar to the correlation functions of flexibility-rigidity index (FRI) used in a previous work[112, 78]. The repre- 83 sentations resulting from updating the kernel parameters through training the network are then referred to as network-enabled automatic representations (NEARs). To validate these NEARs, we implement them on RF and GBT models, leading to accurate predic- tions. We further show that ensemble learning can in some cases bolster our network predictions through consensus. To test our AweGNN models, we employ four toxicity data sets of various sizes. Tox- icity is a measure of the degree to which a chemical can harm an organism [110]. The harmful effects can be measured qualitatively or quantitatively. A qualitative approach only categorizes chemicals as toxic or nontoxic, while quantitative toxicity data records the minimal amount of a substance that would produce lethal effects. Experimental mea- surement of toxicity is expensive, time-consuming, and subject to ethical constraints. In this light, machine learning models are extremely useful in that they do not have these same challenges. The working principal of the machine learning approach for toxicity analysis is that molecules with similar structures have similar activities, which is called the quantitative structure-activity relationship (QSAR) approach. By analyzing the rela- tionship between molecular structures, one can predict their biological functions. We can create AweGNN models that predict these toxicity endpoints without having to conduct any lab experiments [15, 96, 57, 48]. To further improve our AweGNN predictions on the sparsity of data [46], we construct multi-task (MT) AweGNNs. MT learning or transfer learning [23] learns from related tasks to improve the performance on the smaller data sets in particular. It is frequently used to compensate if there are similar data sets at hand. We find that AweGNN models perform well on these data sets when we compare our results to state-of-the-art QSAR techniques, such as the ones that were pursued by our group [110, 81] previously and by others[64, 65], with particularly excellent performance on the larger data sets. To further showcase our AweGNN, we apply our models on a different data set; a small data set concerning the solvation of molecules. Solvation free energy is a very im- 84 portant quantity in solvation analysis, which can be very beneficial in studying other complex or biological processes [63, 74, 105, 55]. This data set is relatively small compared to the toxicity data sets, but we found that the AweGNN was able to perform extremely well, beating all other methods in the literature [81]. This proves that the AweGNN is useful in application to molecular problems that are not just restricted to toxicity anal- ysis. In addition to this, we can find that the concept of the AweGNN can be extended further to other kernel-based methods, in which we can automate the kernel parameters as we train a network, such as with the differential geometry work done by Nguyen et al [81]. Finally, we note that although we have not used multi-scale approaches in this work, we have seen them perform well in previous work [78, 81], so it is likely to be worthy of exploration in further works. 5.2 Theory and Methods In this section, we will briefly review single and multi-task neural networks, then de- scribe our AweGNN models. We discuss AweGNN through the frameworks of element- specific geometric graph representations, dynamic normalization functions, explicit deriva- tive calculations, and instructions on how to update the parameters. Some variations on the generation of features are included even though these variations might have not been used to train the final models. 5.2.1 Neural Networks Neural networks are predictive models that consist of layers which are made up of neu- rons in which each neuron is connected to every other neuron in the next layer of the layer sequence. A weight is associated with each connection that determines how much a neuron contributes to the input of the neuron in the next layer, and a bias term is in- troduced to shift the activations. Activation functions, such as the logistic sigmoid, grant the network a level of non-linearity for additional complexity. A feature vector, in which 85 each feature corresponds to a neuron in the input layer, can be pushed through the layers of the network all the way to the output layer in a process called forward propagation. Neural networks can be trained as classification models that categorize data, or regres- sion models that predict continuous quantities. In this work, we only develop regression models. Neural networks are trained by updating the weights and biases of the network in a series of steps through gradient descent, in which an appropriate loss function is min- imized. The process of calculating the errors of a neural network is called back propa- gation, in which derivatives are calculated in the backwards direction starting from the output layer, where the loss (or error) is calculated, and propagating backward through each layer all the way back to the input layer, after which the weights are then updated by following the negative direction of the gradient. The gradient descent process finds a lo- cal minimum of the loss function when viewed with respect to the weights. This process fits the network to the data set in question. A general neural network consists of an input layer, an output layer, and any number of hidden layers in between. Deep neural networks are characterized by having many more layers and neurons, allowing the network to be more complex. There are many additions to the neural network architecture that can be included for training a model. Activation functions bring non-linearity to a network, and a proper choice of activation function can have an effect on training and performance of a network. Dropout [97] layers prevent over-fitting by dropping out random neurons during each step of training. Weight decay is another technique that regularizes the network to prevent over-fitting by decreasing the magnitude of each weight throughout the training. Batch normalization [43] layers normalizes the outputs of each layer to speed up the training time, improve performance, and allow deep networks to be trained with stability by reducing internal covariant shift. Adaptive learning rates are learning rates that change during the training of the network, such as in a momentum approach to gradient descent that “gains speed 86 as it descends down a valley", that can find local minimums faster and sometimes have the ability to jump out of “undesirable" local minimums. We use batch normalization, ReLU activation, and the AMSGrad [87] variant of the popular Adam [50] optimization adaptive learning rates in our work. Dropout and weight decay increased the training time dramatically and so were not used in our final models, but may be explored further in the future. Single-task (ST) neural networks are networks trained on a single data set and have a single output (at least in the regression case). An ST network is trained by the minimiza- tion problem:  N  2 X min α· W 2 + L(yi , f (xi ; {W, b})) , (5.1) W,b i where || · ||2 denotes the L2 norm, α represents the regularization constant, f is a function parametrized by the weights, W, and biases, b, that represents the output of the network, and xi and yi are the feature vector and the label of the ith data point respectively of a data set with N samples. Single-task networks are, however, limited in their usefulness when being trained on small data sets. Multi-task (MT) networks have been developed to take advantage of large data sets to bolster the training of a network on a smaller data set. The idea is to train the models on multiple tasks simultaneously by sharing weights, thus highlighting relevant features that are important across all the related tasks. The details of their training are as follows: N If we let T be the number of tasks and {(xti , yit )}i=1 t is the training data for the tth task where Nt is the number of samples of the tth task, xti is the feature vector of the ith sample of the tth task, and yit is the label for the ith sample of the tth task. The training strategy is to minimize the chosen loss function, L, for all tasks simultaneously. We define the loss for task t, Lt , below. 87 Nt L(yit , f t (f s (xti ; {Ws , bs }); {Wt , bt })), X Lt = (5.2) i=1 where f t is the part of the network that predicts the labels for the tth task parametrized by weights, Wt , and bias, bt ; f s is the part of the network that has shared weights, Ws , and bias, bs . To train all models simultaneously, we provide a single loss function to minimize so our problem is then:  T  2 1X 2 argmin β · Ws 2 + Lt + α · W 2 ,t (5.3) T t=1 where || · ||2 denotes the L2 norm and α and β represent the regularization constants that set the magnitudes of the weight decay during training. In practice, if weight decay is used, we generally set α = β. 5.2.2 Overview of AweGNN We wish to describe the concept of AweGNNs applied to molecular data sets. As in most machine learning endeavors, we seek to frame the task at hand as a minimization prob- lem. To detail our supervised learning algorithm, we first start with a biomolecular data set, χ, where χi will represent the ith element of the training data set. We want a function F(χi ; {η, κ}) that encodes the geometric and chemical information into an abstract repre- sentation parametrized by a set of parameters, {η, κ}. Then, the minimization problem becomes: 2 X min α· W 2+ L(yi , f (F(χi ; {η, κ}); {W, b})), (5.4) η,κ,W,b i∈I where L is the chosen scalar loss function to be minimized, α is the regularization constant that determines the magnitude of the weight decay, and yi is the label of the ith data point in our biomolecular data set. The function, f , is the function parametrized by the weights, W, and biases, b, that represent the output of the neural network. Notice the difference 88 with the formulations in the previous section. The output of the function, F, on the data χi acts as the feature vector, xi , given previously. The parameters, {η, κ}, that parametrize F are included in the minimization process. This simultaneous update of the parameters of the representation function is the novel idea driving this work. The AweGNNs are trained for the regression task of predicting the toxicity endpoints and the solvation free energy of various small molecules, with the standard choice of mean squared error (MSE) as our loss function, L. The regularization constant, α, as noted before, is omitted in our work for both the single and multi-task models due to a lack of noticeable increase in performance, but with a moderate increase in training time. F will be based on element-specific calculations that are controlled by pairs of tunable kernel parameters that capture different geometric features of the molecules based on the val- ues of those kernels chosen. Each element-specific group will be assigned a pair of these parameters and will be updated throughout the training by back propagation. The repre- sentations generated by F after the training of the AweGNN are called network-enabled automatic representations (NEARs), and will be used later for training other models to demonstrate their performance. The procedure is summarized in Figure 5.1 below. More specifically, we wish to understand how to back propagate through F. This will require us to, at each batch or epoch, back propagate throughout the entire network, then use that information to back propagate through F all the way to the kernel parameters that F is dependent on. In reality, F, should be a composition of a function that outputs vector representations of biomolecules along with the normalization of the output of that function. Thus, we can represent F as F = N ◦ R, where N is the normalization func- tion, and R is the raw representation function. The normalization function is important because it avoids instability in the network. Now when considering back propagation through the function, F, we have to keep in mind the chain rule, ∂F ∂N ∂R ∂x = ∂R ∂x , as we calculate our partial derivatives. If we want to be able to calculate these derivatives, then we must make sure that 89 Group 1 Group 2 Group m Rep. Func. H-H H-C . . . C-O (n1, k1) (n2, k2) (nm, km) Backprop F(n, k) Backprop Backprop . . . F1(n1, k1) . . . F1(nm, km) . . . F2(n1, k1) . . . F2(nm, km) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fn(n1, k1) . . . Fn(nm, km) Forward Forward Forward Molecular Element Specific groups with Network-Enabled Automatic Artificial Neural Dataset tunable parameters Representation (NEAR) Network (ST or MT) Figure 5.1: Pictorial visualization of the AweGNN training process. The first stage is splitting up the molecules in the data set into the element-specific groups, with initialized η and κ kernel parameters for each group. Then, a molecular representation is generated based on the kernel parameters, given by F in the diagram. This representation is fed into the neural network, then the kernel parameters are updated by back propagation through the neural network and through the representation function, F. they are differentiable functions. Some normalization functions, however, will not be differentiable everywhere, but will usually still be differentiable almost everywhere and continuous almost everywhere, so they will be practical functions to use. Representation functions themselves can be very discontinuous and non-differentiable depending on the method of calculating features, so we must be especially careful in selecting the function, R. We would also like both the representation function and normalization function to be easy to calculate since we must calculate the normalized features at each epoch of training. We will later describe in detail the geometric graph representation function, R, that we used for our AweGNN that is continuous and differentiable everywhere, and easy to calculate. We will be testing both single-task (ST) and multi-task (MT) AweGNN models that follow much the same structure, again as outlined in figure 5.1. The ST model takes 90 batches and from a data set, generates a representation, then normalizes it with batch normalization before feeding it into the artificial neural network portion of the AweGNN and then updating the parameters as above. The MT-AweGNN consists of a common set of parameters that are used for all 4 toxicity data sets, shared weights for the hidden layers of the artificial neural network portion, but separate weights for the 4 output layers. The normalization is done also in batches, by placing a number of data points into the batch from each data set proportional to the size of that data set with respect to all the data sets combined. For example, if the batch size is 100, and there are data sets with size 200, 400, 600, and 800, then each batch would have 10 samples from the first data set, 20 from the second, 30 from the third, and 40 from the last data set. This MT structure is illustrated below in Figure 5.2. Shared Kernel Parameters Input layer w/ Hidden layers w/ Output w/ (n, k) = { (n1, k1), (n2, k2), . . . , (nm, km) } shared weights shared weights separate weights Molecular Datasets R(n, k) Backprop F(n, k) Backprop O1 . . . LC50-DM R1(n, k) N o R1 LC50 R2(n, k) . . . O2 N o R2 F =No R IGC50 N o R3 . . . R3(n, k) . . . . . N o R4 O3 . . . . LD50 R4(n, k) . . . Proportional Dynamic Combined Normed MT Neural O4 Mini-batches Repr. Batch- Feature Network norm s Figure 5.2: Diagram depicting the structure and automation of the MT-AweGNN. The model is trained by collecting samples from each data set proportionally and then generating their representations according to shared parameters. The representations are normalized together, then pushed through the artificial neu- ral network and then to their corresponding outputs. Back propagation goes through the feature matrix, normalization function, and finally the representation function to update the kernel parameters. 91 5.2.3 Normalization Function For the normalization function, we choose the standard normalization function, Nσ , which centers each feature about the mean and scales by the standard deviation of the training samples. More precisely, if x is the feature, µ the mean, and σ the standard deviation, then Nσ (x) = x−µσ . This function is continuous and differentiable everywhere except when the standard deviation is zero. By a slight modification, we can add an error constant for numerical stability and to make Nσ continuous and differentiable everywhere, where Nσ (x) = √x−µ with  being a very small number such as 10−5 . When training in batches, σ 2 + we normalize each batch individually with separate batch mean, µB , and batch standard deviation, σB . These statistics are tracked and a batch momentum value of 0.1 is used to record accumulated batch statistics to obtain a more stable reading of the distribution of the features in the whole data set, as detailed in Ioffe and Szegedy[43]. During evalua- tion time, the recorded statistics are used to transform the output of the representation function, R. The normalization procedure is similar to batch normalization in neural networks, ex- cept that the affine transformation is omitted, and the expression for ∂N∂R can be found in the paper written by Ioffe and Szegedy [43], along with the rest of the details for batch normalization. Although the above procedure omits the affine transformations, we chose to apply the affine transformations in our work becuase the models performed well with them. In practical terms, the PyTorch[82] library is utilized for applying this batch nor- malization. A BatchNorm1d layer is placed in front of the molecular representation layer with parameters: eps=1e-05, momentum=0.1, affine=True, and track_running_stats=True, which are the default settings in PyTorch[82]. 5.2.4 Biomolecular Geometric Graph Representations We seek a general formulation of a geometric graph representation of biomolecular struc- tures from which we can extract features to create our sought after representation func- 92 tion, R. We use this method to generate descriptors for the toxicity analysis of small molecules, but the idea can extend beyond this concept quite naturally to provide use- ful representations. In our work, we want to use a weighted and vertex-labeled graph in which we can generate descriptors from special subgraphs of the graph representation of each molecular structure. We begin by looking for a set, T , of element types that are suitable for our analysis. Generally the choice of element types will be chosen by looking at the commonly occurring element types found in a given data set, or we might choose to omit certain element types to avoid elements that have uncertain positioning or negli- gible influence in molecular interactions. Usually, we will have at least C, N, O, S ∈ T , but in most cases we will also include H, P, Cl, and Br. We define: V = {(rj , αj ) ∈ R3 × T | rj ; αj ∈ T ; j = 1, 2, ..., N } to be the vertex set, in which N denotes the number of atoms in the molecule that are of an element type that is a member of T . This vertex set has a labeling that describes the atom coordinates and the element type of the atom at each vertex. We will use these labels to determine element specific subgraphs from which to extract features. Our edge set, E, is described by a choice of parameters, {ηkk0 } and {κkk0 }, and a choice of a type of kernel function, Φ : R → R, which is parametrized by them E = {Φ(||ri −rj ||; ηkk0 , κkk0 ) | αi = k ∈ T , αj = k 0 ∈ T ; i, j = 1, 2, ..., N ; ||ri −rj || > vi +vj +σ} , where || · || denotes the Euclidean distance, vi and vj denote the Van der Waals radius of the ith and j th atoms respectively, ri and rj represent the ith and j th atom coordinates, and σ is the average of the standard deviations of vi and vj in the data set. Notice, the dis- tance constraint eliminates covalent interactions from being used in our calculations. We remove the covalent interactions because many biomolecular properties are known to be 93 determined by non-covalent interactions, As noted in Nguyen et al., many biomolecular properties are known to be determined by non-covalent interactions, so we remove the co- valent interactions because we do not want their contribution to overwhelm or otherwise interfere with the contributions of the non-covalent interactions. The parameters, ηkk0 and κkk0 , are values tied to the element specific pair, kk 0 , which can be tied to the properties of the element types in question. The types of kernels that determine the weights of our edges are always chosen to be decreasing functions that have the properties: Φ(x) → 1 as x → 0 (5.5) and Φ(x) → 0 as x → ∞. (5.6) This characterization gives the property that atoms that are closest will contribute the most, and atoms that are too far away, will contribute almost nothing. Most radial basis functions can be used, but the most popular choice of kernel types are the generalized exponential and the generalized Lorentz functions, given by −( x η) κ ΦE (x; η, κ) = e (5.7) and 1 ΦL (x; η, κ) = , (5.8) 1 + ( xη )κ respectively. These kernel types have been used to create successful models for protein- ligand binding prediction, predicting toxicity endpoints, and many other applications. They act as low-pass filters that capture the most important element interactions, where the η value determines the cutoff point, and the κ value determines the sharpness of the cutoff. In summary, we have procured a vetex-labeled, weighted graph, G(V, E), in which we can extrapolate features from special element specific subgraphs. 94 Let Gkk0 be the subgraph of G whose vertex set, Vkk0 , contains the vertices that are labeled element type k or k 0 ; and the edge set, Ekk0 , contain only the edges connecting a vertex labeled with element k to a vertex labeled with element type k 0 . This is called the element specific subgraph generated by the element pair kk 0 . For a given element specific subgraph, Gkk0 , we define the following descriptor associated with that subgraph: X µG 0 ,Φ = wj · Φ(||ri − rj ||; ηkk0 , κkk0 ) (5.9) kk (i,j)∈Ekk 0 where the wj term is usually a constant 1 or a value associated with the j th atom, such as a partial charge. The ηkk0 and κkk0 values represent the respective η and κ values for the kk 0 group, while ri and rj represent the atom coordinates as before. We get a sum over the weights of the subgraph in which additional atom specific weights can be applied to enhance the meaning of the feature. This abstractly defines the total strength of the non- covalent interactions between the atoms of element type k and the atoms of element type k 0 . This is illustrated in Figure 5.3 below. Φ11 C O Φ21 C C Φ12 Φ22 N O Figure 5.3: This shows an element specific subgraph corresponding to the element types C and O of amino- propanoic acid, C3 H6 NO2 (with the hydrogens omitted). The dashed orange edges represent the edges of the subgraph that are weighted by the chosen kernel function, Φ, while the solid blue lines represent covalent bonds. Notice the 3rd carbon atom does not connect to the oxygen atoms since it is covalently bonded to both of them. With these descriptors detailed above, we can describe a representation function by 95 obtaining one or more features from each element specific subgraph. We can extract mul- tiple features per subgraph by considering different statistics of our features such as the mean of the sum above, the standard deviation of the terms in the sum, choosing var- ious values of wj above according to chemical or physical properties of specific atoms, etc. In summary, the features that we generate using this approach measures the aggre- gate interaction strength between groups of atoms of specific element types, which we call element-specific groups. This gives us a representation of the internal structure of a molecule that fits into the QSAR paradigm, which promotes the idea that similar molec- ular structures have similar physical properties. The representation function used in our work for the toxicity data sets consisting of small molecules considers the element type set T = {H, C, N, O, S, P, F, Cl, Br, I}. For each element specific pair, we calculate 4 descriptors: The first is calculated with each wj = 1, the next with the wj ’s set to be equal to the partial charge of the j th atom, and the remaining 2 descriptors are generated by dividing the first 2 descriptors by the number of edges in the element specific group, giving us a measure of the mean interaction of the elements. Since we have 10 different element types, we can form 100 element specific groups; and since we have 4 descriptors per element specific group, we have 400 descrip- tors for every set of parameters and choice of kernel. In our work, we only use the Lorentz functions, ΦLη,κ , for our feature generation and results. 5.2.5 Parameter Adjustment and Initialization When considering our molecular representation function in the context of the parameter automation task, we can make a few choices in terms of the parameters that we will be adjusting at each epoch or batch. Below are 3 possible avenues to consider: η-adjustable Here, we set the η values for each element specific pair randomly within some range. We note that these assignments that are made for each element specific pair are common for all of the biomolecular structures in you data set. We can combine several 96 sets of features to make multi-scale models that would have multiple initializations of the η values, but we will be assuming a 1-scale model at present. If we set the η values this way, these will be the parameters that we will be updating in our gradient descent. τ -adjustable For an element specific pair, kk 0 , we set the characteristic value to be ηkk0 = τ (vk + vk0 ), where vk is the Van der Waals radius of element type, k, and vk0 is the Van der Waals radius of element type, k 0 . Similar to the previous situation, the τ value assigned determines the element specific η values for the entire data set. Again, if we consider multi-scale models, then we may choose different values of τ for each set. In the τ -adjustable case, we will be seeking to update the τ value at each step of the training. κ-adjustable In the previous two situations, we were assuming a fixed κ value, but we can also update this parameter in conjunction with the others. There are two ways to introduce this technique. The first way is to introduce one κ and update that with respects to all of the element specific groups. The second way is to introduce a κ value for all the element specific groups and update those separately. Multi-scale models are handled in a similar way as above. Aside from choosing which parameters to update throughout the training of the AweG- NNs, we also must consider how we want to initialize each parameter before training. We notice that τ , κ, and η all have to take values greater than zero to maintain continuity and/or satisfy the conditions of a radial basis function. Also, we want to make certain that the values are not too large. A large η value will capture all of the atoms in a molecule and the kernel value will be effectively a constant 1 at all measured distances. A large κ value will make the kernel function approach the form of an ideal low-pass filter, where it is either a constant 0 or constant 1. In both cases, the magnitude of the derivatives will be extremely small and the model will not be able to update effectively. In our experimentation, we use 100 η-adjustable and 100 κ-adjustable parameters per scale (1 η and 1 κ value per element specific group) giving us 200 total adjustable variables per scale for our molecular representation function. The η-initialization for each scale is 97 chosen by a single random tau value from a uniform distribution with a range of 0.5 to 1.5, in which all the η values are set according to the Van der Waals radii as described above. The κ-initialization for each scale is done by chosing a value of κ for each element-specific group from a uniform distribution with a range of 5 to 8. The adjustment of parameters throughout the training of our AweGNN will be like an evolving kernel function at each element-specific group that attempts to create an optimal filter that provides the best representation for the network. The η values determine the center of the cutoff region and the κ values control the sharpness of the cutoff. In Figure 5.4, we see the evolution of the kernel function of a MT-AweGNN model, where the kernel functions are graphed at various choices of the number of epochs of training, so that the transformation of the kernel function can be clearly seen. Figure 5.4: Evolution of the kernel function of the C-N group for a MT-AweGNN. The picture below shows a series of kernel functions that were used to evaluate the 4 features corresponding to the C-N element- specific group at different points in the training of an MT-AweGNN. We choose specific snapshots during training that show a smooth change in the kernel function as the η-κ pair is updated. 5.2.6 Derivatives of the Representation Function Our main goal is to update the parameters of our feature representation during the train- ing of our neural network. We must first show how to calculate the derivatives of the 98 representation functions with respect to parameters κ, η, and τ values. We will then use these derivatives to show how to complete the back propagation through all the way to the parameters. We begin with a representation function, R : Rn → Rd×m , and focus on a given element-specific subgraph; where this could be a single or multi-scale representa- tion function. Suppose that ν is the set of update-able parameters for R. Then we can represent R by: X R(ν)i,j = qb · Φ(||ra − rb ||; νG ), (5.10) (a,b)∈G Where G is the element specific subgraph which is used in the calculation of the j th de- scriptor of the ith sample, qb represents a value that may be associated with the atom labeled b, ra and rb represent the coordinates of the atoms labeled a and b respectively, and νG is the subset of ν that corresponds to the calculations that are associated with G. Now, since differential operators are linear and our features are calculated by a sum- mation of differentiable functions, we need only know how to calculate the derivative of each term, i.e., for any variable, x, we have:   ∂R(ν) X ∂ = qb · Φ(||ra − rb ||; νG ). (5.11) ∂x i,j ∂x (a,b)∈G Thus, we want to find out the derivatives of the functions, ΦE L η,κ and Φη,κ , with respect to parameters η, τ , and κ. We start by finding the derivative with respect to η. The derivative of the generalized exponential function is: ∂ΦE   κ η,κ (r) κ r κ = e−(r/η) , (5.12) ∂η η η and for the Lorentz generalized function, we have the derivative: κ  r κ ∂ΦLη,κ (r) η η = 2 (5.13) ∂η 1 + ( ηr )κ Also, we calculate the derivatives with respect to κ parameters starting with the gen- eralized exponential function: 99 ∂ΦE  κ   η,κ (r) r η −(r/η)κ = ln e , (5.14) ∂κ η r and for the Lorentz generalized function, we have the derivative: r κ ln η  ∂ΦLη,κ (r) η r = κ 2 . (5.15) ∂κ 1 + ηr  Now, for the derivative with respect to the τ parameter, we can just apply the chain ∂η τ (v1 + v2 ) η rule with η = τ (v1 + v2 ) being a function of τ . Note that = (v1 + v2 ) = = . ∂τ τ τ Then, for the exponential derivative, we have: ∂ΦE ∂η ∂ΦE    κ   κ η,κ η,κ η κ r −(r/η) κ κ r κ = · = e = e−(r/η) , (5.16) ∂τ ∂τ ∂η τ η η τ η and for the lorentz function, we have:   κ  r κ κ  r κ ∂ΦL η,κ ∂η ∂ΦL η,κ η η η τ η = · = 2 = 2 (5.17) ∂τ ∂τ ∂η τ r 1 + (η) κ 1 + ( ηr )κ Thus, we now know the derivatives ∂R ∂R ∂R ∂η , ∂τ , and ∂κ ; and we can use them to calculate the gradient for our gradient descent when training our neural network. 5.2.7 How to Update the Parameters Now that we have the derivatives of the representation function, we will be able to cal- culate the derivatives of the loss function that we have chosen, L, with respect to our parameters, η, τ , and κ. We must remember that in our process of feature calculation, we have decided to normalize the features. We chose to use the standard normalization, Nσ , and we denoted the normalized representation function to be F = N ◦R, with R being the un-normalized representation function. By the chain rule, we obtain ∂F ∂N ∂R ∂x = ∂R ∂x , where x = η, κ, or τ . We have also discussed what ∂N ∂R looks like, so in fact we should now have a 100 full description of the derivatives of the normalized representation function with respect to our parameters. Now, to get our final update rule, we have to extend the back propagation of our neural network to the features, obtaining the derivatives, ∂F ∂L . Then our derivatives that we calculated earlier, ∂F ∂L ∂L ∂F ∂x , will be used in the chain rule again to get: ∂x = ∂F ∂x , where x = η, κ, or τ . This gives us the update rule for our parameters. In more detail, the update rule actually looks like ∂L = P ∂L ∂Fij . Remember that the τ , κ, and η parameters ∂x i,j ∂Fij ∂x are all constrained to be greater than zero. As a precaution we propose that after every epoch, the parameters be clipped so that they are constrained to values that are 0.01 or higher to avoid forbidden values. 5.2.8 Multi-scale Models Multi-scale models are models that are trained on features generated by multiple sets of parameters. More specifically, for 2 representation functions, R1 : Rn → Rd×m and R2 : Rn → Rd×m , we can get the 2-scale representation function, [R1 , R2 ] : Rn → Rd×2m , by concatenating the outputs of R1 and R2 so that the features of each data point match up. We can extend this idea of a 2-scale model to any scale. Let R1 : Rn → Rd×m , ... , Rk : Rn → Rd×m be representation functions with the same parameter types. Then, combining the functions together one at a time as above, then we get a new multi-scale representation function, [R1 , R2 , ..., Rk ] : Rn → Rd×(k·m) . These multi-scale methods have been shown to be very effective in improving the performance of single-scale models[81, 78]. 5.2.9 Model Architectures and Hyper-parameters We wish to now describe the specific network architecture of the AweGNN model. For the artificial neural network (ANN) portion of the AweGNN, we choose a very simple 101 4-layer network with 400 neurons in the first two hidden layers and 20 neurons in the last two hidden layers. The ANN architecture and parameters were chosen through a quick parameter search of the multi-task network where the models were tested on a random- ized validation set of 10% of the training data of each toxicity data set to obtain good average predictions across all 4 data sets and was modified to ensure relative conver- gence for the kernel parameters. The convergence assures that we have stable choices for the parameters of the kernel function, which we will use for later analysis. Dropout and weight decay are omitted since they did not notably increase performance, but increased the training time significantly. The AMSGrad [87] variant of the Adam optimizer is used because it has been shown to improve the convergence of the kernel parameters while still achieving large gradients that allow the model to explore a larger parameter space, and allows a more interesting analysis of the parameter trajectories. A large learning rate of 0.1 is also applied with a learning rate decay of 0.999 per epoch of training, for the total of 2000 epochs, leaving the final learning rate at a value of approximately 0.01352 at the end of the training. The decay gives the network and kernel parameters an opportunity to settle down in a local minimum, while the large learning rate gives the models a chance to explore more pa- rameter choices while also regularizing the network. The network was further improved by applying batch normalization, which is known to speed up the training of a network and provides an additional source of regularization that reduces the need for dropout[43]. Finally, the batch size used was 100 for the ST models, while the MT models used a total of 40 batches per epoch, with sizes as even as possible and number of samples from each data set proportional to their respective size. For all of the network models, we use the PyTorch[82] API in conjunction with cython[17] and Numba[52] to speed up the calculation of the geometric graph descriptors. All hid- den layers are actually composed of 3 different layers: a linear unit, followed by a batch normalization layer with affine transformations active and momentum set to 0.1 (default 102 PyTorch setting), then a ReLU activation. The input layer is constructed with 2 layers: a Geometric Graph Representation (GGR) layer followed by a batch normalization layer again with affine transformations and the same momentum setting. The GGR layer is the representation function, R, and the batch normalization layer is the Nσ normalization function, as described in detailed previously to generate our normalized representation function, F = Nσ ◦ R. Finally, the output layer is simply a linear activation into either 1 or 4 outputs, depending on if we are using a multi-task or single-task model. The details of the full AweGNN architecture are illustrated in Figure 5.5 below. Batch Batch GGR Linear ReLU Linear Norm Norm 400 neurons each 20 neurons each 1 output (ST) 400 or features 4 outputs (MT) Input Layer Hidden Layers Output Layer Figure 5.5: Internal architecture of our Auto-parametrized weight element-specific Graph Neural Network (AweGNN) model for the toxicity data sets. The input layer consists of our novel Geometric Graph Rep- resentation (GGR) layer and a batch norm with no affine transformation. Hidden layers include a linear transformation followed by regular batch normalization and ReLU activation, while the output layer has the standard linear activation. As discussed earlier, the network-enabled automatic representation (NEAR) gener- ated from training an AweGNN can be used as inputs for ensemble models. By exam- ining the performance of these models, we can see if the representations generated by AweGNNs can produce meaningful features for a broad range of machine learning mod- els. We use two ensemble models, random forest regressor (RF) and gradient boosting regressor (GBT), from the scikit-learn v0.23.2 package [83]. A search for reasonable pa- rameter choices was made on the same validation sets above. For the RF models, we 103 use the parameters: n_estimators = 8 000, max_depth = 27, min_samples_split = 3, and max_features=sqrt, with any remaining parameters considered to be set at the default setting. For the GBT models, we use the same parameters as the RF models above along with the additional parameters: loss = ’ls’, learning_rate = 0.1, and subsample = 0.2, with any remaining parameters set to the default choice. 5.3 Results In this section, we give a description of the different quantitative toxicity data sets, and then we compare the results of our best models to the models generated by other published methods [64, 65, 110, 81]. We note that in the work by Nguyen et al.[81], we only report the results from one consensus model trained and tested on the Tetrahymena pyriformis IGC50 data set that performed the best, which is labeled as Nguyen-best. For the work done by Wu et al.[110], we only report the results for the models that utilized all the descriptors detailed in that paper. The results of the random forest models are referred to as Wu-RF, the gradient boosting models are referred to as Wu-GBT, the ST models are Wu-ST, the MT models are Wu-MT, and the consensus of Wu-GBT and Wu-MT is Wu- consensus. Except in the case of the Daphnia Magna LC50 , the best scoring Wu models were those that relied on all descriptors, but any such important detail will be mentioned when relevant. We also introduce a solvation data set, Model III, as in the paper by Nguyen et al [81]. The performance of our AweGNN models, when trained on this data set, is compared to those of differential geometry based models developed earlier [81], and other state-of- the-art methods [106, 104]. The models in this particular section are named as they are in the previously referenced paper [81], where the names refer to various choices of kernel options/kernel parameters used in the calculation of differential geometric descriptors. As for our own models, we report them in many ways and in many combinations. We train multiple models for each data set and each model type, then we combine those in 104 ST-RF Single Dataset Multiple Datasets MT-RF Pred Pred Train Train RF RF Train Train ST-NEAR Extract ST-AweGNN MT-AweGNN Extract MT-NEAR Train Train Network Pred consensus Pred GBT GBT Pred Pred Pred ST-GBT ST-network MT-network MT-GBT Pred Pred ST-consensus MT-consensus Pred Final-consensus Figure 5.6: Pictorial representation of the overall strategy. The arrows show the flow of the processes of training, extracting representations from trained AweGNNs, and making predictions. Blue corresponds to molecular data sets, light green corresponds to machine learning models, orange corresponds to molecular representations, and dark green corresponds to predictions. that instance by averaging their predictions to get one consensus prediction. This is done to reduce variance from random weight initializations and to improve our final results. We also combine the predictions of different model types in consensus to see if those will possibly yield better results, as we see in Wu[110]. We note that models were trained in sets of 42 instead of sets of 40 or 50 models due to the constraints of our computing architecture. If there are computational concerns, we could also significantly reduce the number of models trained (perhaps to 5 or 10 models), and still obtain similar results. For each data set, we train 42 ST and MT AweGNNs with randomized initial weights and obtain a consensus prediction for both types, labeled ST-network and MT-network respectively. Then, we train one ensemble model for each of the NEARs generated by training the networks. Thus, for random forest, we train 42 models based on the NEARs resulting from the ST models and 42 more on the NEARs resulting from the MT models, in which we get the corresponding consensus predictions, ST-RF and MT-RF. Similarly, we 105 get predictions from the GBT models, ST-GBT and MT-GBT. Finally, we obtain various consensus predictions by combining the previous predictions mentioned above. First, we average the prediction ST-network and MT-network to get the consensus, network- consensus. Next, we combine ST-network and ST-GBT to get ST-consensus. Similarly, we obtain MT-consensus by combining MT-network and MT-GBT. Finally, we get the predic- tion, final-consensus by taking the consensus between ST-consensus and MT-consensus. These predictions and their performances on the test sets are recorded in the tables in section 5.3.2, and a summary of this entire process is depicted in Figure 5.6. 5.3.1 Evaluation Metrics A protocol was proposed by Golbraikh et al.[38] to determine if a QSAR model has true predictive power, as summarized in the following four points: 1. q 2 > 0.5 2. R2 > 0.6 R2 −R02 3. < 0.1 R2 4. 0.85 ≤ k ≤ 1.15 Where q 2 is the square of the leave one out correlation coefficient for the training set, R2 is the square of the Pearson correlation coefficient between the experimental and predicted toxicity endpoints of the test set, R02 is the square of the correlation coefficient between the experimental and predicted toxicity for the test set in which the intercept is set to zero for a linear regression performed on the predicted and experimental labels given by X and Y respectively so that the regression is given by Y = kX (from which the k value is extrapolated). For each numerical experiment involving toxicity, we record the metrics above, al- though we omit the q 2 coefficient due to the computational cost. We also record the root- 106 mean-squared error (RMSE) and the mean-absolute error (MAE) since these are standard metrics for evaluating the performance of a model. The final metric to report is cover- age, or the fraction of chemicals predicted, which is important because a lower coverage can unfairly improve prediction accuracy. For the solvation data sets, we only report the MAE, RMSE, and the R2 score as in Nguyen et al.[81] 5.3.2 Toxicity Data Sets This paper analyzes 4 different quantitative toxicity data sets[64]. These are the 96h fat- head minnow LC50 data set, the 48h Daphnia magna LC50 -DM data set, the 40h Tetra- phymena pyriformis IGC50 data set, and the oral rat LD50 data set. The LC50 (LC50 -DM) set report the concentration in milligrams per liter of test chemicals placed in water that it takes to cause 50% of fathead minnows (Daphnia magna) to die after 96(48) hours. These were downloaded from the ECOTOX aquatic toxicity database via the web site e http://cfpub.epa.gov/ecotox/ and were preprocessed using filter criterion including media type, test location, etc. The IGC50 data set measures the 50% growth inhibitory concentration of the Tetrahymena pyriformis organism after 40 hours and was obtained by Schultz and co-workers[3, 113]. The last data set, LD50 , represents the concentration of chemicals that kill half of rats when orally ingested. This data set was constructed from the ChemIDplus database (http://chem.sis.nlm.nih. gov/chemidplus/chemidheavy.jsp) and then filtered according to several criteria[64]. The final sets in this work are identical to the sets preprocessed to develop the Toxicity Estimation Software Tool (TEST)[64]. The 2D sdf format molecular structures and toxi- city endpoints are available on the TEST website. 3D mol2 format molecular structures were created with the Schrödinger software in an earlier work[110]. We should note that the units of the toxicity endpoints are not uniform between the data sets. The LD50 set endpoints are in -log10 (T mol/kg) while the endpoints of the remaining sets are in units of -log10 (T mol/L). No attempt has been made to rescale the values. Finally, the data sets 107 all have differing sizes and compositions, and so the effectiveness of our method varies greatly between them. Statistics of each data set are detailed below in Table 5.1. Numbers inside parenthe- ses indicate the actual number of molecules that were used for training and evaluating models. The first 3 data sets include all available molecules, but for the last data set (LD50 ), some molecules were dropped out due to force field failures when applying the Schrödinger software. Despite this, our coverage is greater than any of the TEST models and so is more widely applicable in use. Table 5.1: Set statistics for quantitative toxicity data data set # of molecules train set size test set size max value min value LC50 -DM 353 283 70 10.064 0.117 LC50 823 659 164 9.261 0.037 IGC50 1792 1434 358 6.36 0.334 LD50 7413 (7398) 5931 (5919) 1482 (1479) 7.201 0.291 5.3.2.1 LC50 -DM (Daphnia Magna) Set The Daphnia magna LC50 set is the smallest data set with 283 molecules in the training set and 70 molecules in the test set. Given the small size of the data set, it can be difficult to train robust QSAR models, thus multi-scale models are extremely important for obtaining reasonable results. Table 5.2 shows the results of various QSAR models on the LC50DM data set. The TEST consensus had the highest R2 score (R2 = 0.739) out of all the models shown, although Wu et al.[110] reported a much higher R2 score of 0.788, which does not appear in the table, when only using topological descriptors for a multi-task model. This high result with fewer descriptors may be due to the nature of neural networks to overfit when trained on small data sets and many descriptors. When comparing the result of Wu- MT as reported in the table (using all descriptors) to the TEST consensus, we notice that although the R2 score was lower, the RMSE and MAE are better. The group contribution 108 scored exceptionally well in RMSE and MAE, but doubt was cast on the accuracy of those results[110]. The performance of our ensemble models was fairly poor in comparison to the results from Wu. Our RF-ST and RF-MT models scored 0.439 and 0.443 respectively vs. the 0.460 R2 score achieved by Wu-RF. Our MT-GBT model was a bit better in performance than the Wu-RF model, but it could not beat the Wu-GBT model. We do see however that our MT-GBT model had outperformed the ST-GBT model noticeably, going from an R2 score of 0.457 to 0.471, showing a slight benefit from using the MT-NEAR for GBT models. As for the network models, we see a descent show for our ST-network model vs. the Wu- ST model. Although ST-network has an R2 score of 0.448 vs. the 0.459 for Wu-ST, we see that the ST-network RMSE of 1.315 beats quite decisively the RMSE of 1.407 for Wu- ST. Our MT-network model had unfortunately performed much worse than the Wu-MT model, even though its results are comparable with many of the TEST models reported. The Wu-MT model R2 score of 0.726 is significantly higher than the score of 0.664 from MT-network. All of our other models do not do well enough to the comment on. 5.3.2.2 Fathead Minnow LC50 Set The fathead minnow LC50 set was randomly divided into a training (80% of the entire set) and a test set (20% of the entire set)[64]. Table 5.3 shows the performance of all of the var- ious models trained and tested on the LC50 data. This is the second smallest data set that we are analyzing in this work. The best TEST model is again the TEST consensus, at an R2 score of 0.728. Notice that this is lower than the previous TEST consensus on the LC50 at an R2 of 0.739, although the coverage increase from 0.900 to 0.951. For the Wu models, the best result is a whopping 0.789 R2 score for Wu-consensus. Wu-consensus also boasts the best overall RMSE and MAE. The other 2 high scoring models from that category are Wu-MT and Wu-GBT, which come at R2 scores of 0.769 and 0.761, respectively. Our ensemble models do comparably well when compared to the TEST models, except 109 Table 5.2: Comparison of prediction results for the LC50DM test set R2 −R2 model R2 0 k RMSE MAE coverage R2 Results with TEST models hierarchical[64] 0.695 0.151 0.981 0.979 0.757 0.886 single model[64] 0.697 0.152 1.002 0.993 0.772 0.871 FDA[64] 0.565 0.257 0.987 1.190 0.909 0.900 group contribution[64] 0.671 0.049 0.999 0.803 0.620 0.657 nearest neighbor[64] 0.733 0.014 1.015 0.975 0.745 0.871 TEST consensus[64] 0.739 0.118 1.001 0.911 0.727 0.900 Results with previous methods from our group Wu-RF[110] 0.460 1.244 0.955 1.274 0.958 1.000 Wu-GBT[110] 0.505 0.448 0.961 1.235 0.905 1.000 Wu-ST[110] 0.459 0.278 0.933 1.407 1.004 1.000 Wu-MT[110] 0.726 0.003 1.017 0.905 0.590 1.000 Wu-consensus[110] 0.678 0.282 0.953 0.978 0.714 1.000 Ensemble models ST-RF 0.439 0.014 0.956 1.312 0.983 1.000 ST-GBT 0.457 0.004 0.966 1.280 0.954 1.000 MT-RF 0.443 0.012 0.960 1.304 0.985 1.000 MT-GBT 0.471 0.003 0.970 1.261 0.953 1.000 AweGNN models ST-network 0.448 0.060 0.959 1.315 0.959 1.000 MT-network 0.664 0.000 0.983 1.002 0.741 1.000 Consensus models network-consensus 0.583 0.005 0.984 1.112 0.826 1.000 ST-consensus 0.465 0.015 0.933 1.272 0.933 1.000 MT-consensus 0.602 0.000 0.981 1.090 0.820 1.000 final-consensus 0.541 0.001 0.979 1.171 0.872 1.000 for the TEST consensus. Our ST-network model did similarly in performance to many of the TEST models, notably FDA, group contribution, and nearest neighbor. Our best model, MT-network, with an R2 score of 0.749 and RMSE of 0.735 beats all TEST models, and all of our consensus models do equal to or better than the TEST consensus model, except for the ST-consensus, which was weighed down by the oddly poor performance of the ST-GBT model (ST-GBT performed more poorly than did the ST-RF model, with an R2 score of 0.687 vs. 0.697). In comparison to the Wu models, we notice that all of our ensemble models are far behind in performance to either Wu-RF or Wu-GBT, with our best R2 score of 0.706 from our MT-GBT model being overtaken by the lowest ensemble score of 0.727 from the Wu- RF model. The performance of our ST-network model is comparable to Wu-ST with an R2 of 0.682 to 0.692, with even closer RMSE and MAE scores. As far as the MT models 110 go, we see a much greater difference between MT-network and Wu-MT. We cannot come close to the R2 score of 0.789 from the Wu-MT model, which is the best of the models from Wu for this data set[110]. Table 5.3: Comparison of prediction results for the LC50 test set R2 −R2 model R2 0 k RMSE MAE coverage R2 Results with TEST models hierarchical[64] 0.710 0.075 0.966 0.810 0.574 0.951 single model[64] 0.704 0.134 0.960 0.803 0.605 0.945 FDA[64] 0.626 0.113 0.985 0.915 0.656 0.945 group contribution[64] 0.686 0.123 0.949 0.810 0.578 0.872 nearest neighbor[64] 0.667 0.080 1.001 0.876 0.649 0.939 TEST consensus[64] 0.728 0.121 0.969 0.768 0.545 0.951 Results with previous methods from our group Wu-RF[110] 0.727 0.322 0.948 0.782 0.564 1.000 Wu-GBT[110] 0.761 0.102 0.959 0.719 0.496 1.000 Wu-ST[110] 0.692 0.010 0.997 0.822 0.568 1.000 Wu-MT[110] 0.769 0.009 1.014 0.716 0.466 1.000 Wu-consensus[110] 0.789 0.076 0.959 0.677 0.446 1.000 Ensemble models ST-RF 0.697 0.028 1.018 0.836 0.589 1.000 ST-GBT 0.687 0.000 0.995 0.820 0.562 1.000 MT-RF 0.703 0.029 1.019 0.829 0.582 1.000 MT-GBT 0.706 0.001 0.998 0.795 0.543 1.000 AweGNN models ST-network 0.682 0.003 0.987 0.830 0.566 1.000 MT-network 0.749 0.000 0.999 0.735 0.481 1.000 Consensus models network-consensus 0.739 0.000 0.996 0.748 0.505 1.000 ST-consensus 0.711 0.000 0.994 0.788 0.540 1.000 MT-consensus 0.747 0.001 1.001 0.739 0.494 1.000 final-consensus 0.737 0.001 0.998 0.753 0.507 1.000 5.3.2.3 Tetraphymena Pyriformis IGC50 Set The IGC50 data set is the second largest of the data sets we are analyzing. The diversity of the molecules is relatively low compared to the other data sets, which allows for more coverage in the TEST models. The amount of data points in the set is large enough to train robust models, which translates into the high R2 scores for the models that are compared in this section. We notice that the TEST models are far more variant in their results than in the previous 2 data sets, with R2 scores ranging from 0.600 to 0.764. Again, the TEST consensus gets the highest R2 score (0.764). Among the Wu models, the Wu-consensus 111 model is again the supreme champion with an R2 score of 0.802. As usual, the Wu-MT model also did very well, trumping all TEST models with respect to every metric. We also introduce the best IGC50 model from the work done by Nguyen et al.[81] in which GBT models were trained on representations derived from differential geometry based descriptors. Though the model was trained without help from other data sets, it was able to defeat the Wu-MT model in all metrics. The GBT Nguyen-best model is however narrowly defeated by the Wu-GBT model with an R2 score of 0.781 compared to 0.787. Our ensemble models are relatively low in comparison to the performance of the rest of the models, although our MT-GBT model does perform better than all TEST models ex- cept for the TEST consensus model. Our ST-network model outperforms all TEST models single-handedly with an R2 score of 0.778. All of our remaining models outperform our own ST-network model, and thus also every TEST model as well. Our ensemble models under-perform yet again in comparison to the RF and GBT models set forth by Wu. When comparing our network models, we see that our ST- network model strongly defeats the Wu-ST model with an R2 score of 0.749 and even outperforms the Wu-MT model with an R2 of 0.770. ST-network is, however, beaten by Wu-consensus, and even by the model put forth by Nguyen, that is, Nguyen-best. Our MT-network model is the best scoring model in the whole table in all metrics except for MAE, with an R2 score of 0.803, RMSE of 0.436, and MAE of 0.310. Only Wu-consensus has a lower MAE of 0.305, but that MAE is beaten by our model, network-consensus, with an MAE of 0.304. Although MT-network technically performs slightly better than Wu-consensus, they are effectively identical in performance. 5.3.2.4 Oral Rat LD50 Set The oral rat LD50 set contains the most molecules. The data set is quite large (7413 molec- ular compounds), so naturally full coverage is not available for any of the methods pro- posed, although most models still have very high coverage. The labels of this data set 112 Table 5.4: Comparison of prediction results for the IGC50 test set R2 −R2 model R2 0 k RMSE MAE coverage R2 Results with TEST models hierarchical[110] 0.719 0.023 0.978 0.539 0.358 0.933 FDA[110] 0.747 0.056 0.988 0.489 0.337 0.978 group contribution[110] 0.682 0.065 0.994 0.575 0.411 0.955 nearest neighbor[110] 0.600 0.170 0.976 0.638 0.451 0.986 TEST consensus[110] 0.764 0.065 0.983 0.475 0.332 0.983 Results with previous methods from our group Wu-RF[110] 0.736 0.235 0.981 0.510 0.368 1.000 Wu-GBT[110] 0.787 0.054 0.993 0.455 0.316 1.000 Wu-ST[110] 0.749 0.019 0.982 0.506 0.339 1.000 Wu-MT[110] 0.770 0.000 1.001 0.472 0.331 1.000 Wu-consensus[110] 0.802 0.066 0.987 0.438 0.305 1.000 Nguyen-best[81] 0.781 0.004 1.003 0.463 0.324 1.000 Ensemble models ST-RF 0.713 0.013 1.006 0.535 0.377 1.000 ST-GBT 0.745 0.000 0.994 0.496 0.329 1.000 MT-RF 0.716 0.013 1.006 0.532 0.377 1.000 MT-GBT 0.753 0.000 0.995 0.489 0.327 1.000 AweGNN models ST-network 0.778 0.000 1.003 0.463 0.309 1.000 MT-network 0.803 0.000 0.999 0.436 0.310 1.000 Consensus models network-consensus 0.799 0.000 1.001 0.440 0.304 1.000 ST-consensus 0.777 0.000 1.000 0.464 0.309 1.000 MT-consensus 0.795 0.000 0.998 0.445 0.305 1.000 final-consensus 0.789 0.000 0.999 0.451 0.306 1.000 are quite difficult to predict because of the high experimental uncertainty in obtaining the toxicity endpoints, as noted in Zhu et al.[114]. Table 5.5 shows the results. As in Wu et al.[110], we omit the single model and group contribution TEST methods from the table in our analysis. As always, the TEST consensus provides the greatest results amongst the TEST models. For this data set, the effectiveness of the Wu models wanes, with the Wu-MT model failing to perform decisively better than the TEST consensus (although, Wu-MT does indeed do slightly better in RMSE and MAE in comparison to TEST consen- sus). For this data set, we see the best performance from our models. Even our ensemble models do quite well. In fact, our ST-GBT and MT-GBT models, with respective R2 scores of 0.643 and 0.641, outperform any TEST model and any Wu models aside from the Wu- consensus model. All of our non-ensemble models out-perform every single other model, including Wu-consensus, with our best performing model, final-consensus, having an R2 113 of 0.667, RMSE of 0.557, and MAE of 0.405. This decisively beats the Wu-consensus model having an R2 score of 0.653, RMSE of 0.568, and MAE of 0.421. We notice that for this data set, we finally required the consensus of the GBT models to get our best prediction, as opposed to the other data sets in which we obtained the best result from our MT-network models. Our AweGNN models seem to shine relative to other groups models when the data sets are the largest, in which the networks are able to generate their most generalizable representations, and the predictions are the most accurate. Table 5.5: Comparison of prediction results for the LD50 test set R2 −R2 model R2 0 k RMSE MAE coverage R2 Results with TEST models hierarchical[64] 0.578 0.184 0.969 0.650 0.460 0.876 FDA[64] 0.557 0.238 0.953 0.657 0.474 0.984 nearest neighbor[64] 0.557 0.243 0.961 0.656 0.477 0.993 TEST consensus[64] 0.626 0.235 0.959 0.594 0.431 0.984 Results with previous methods from our group Wu-RF[110] 0.619 0.728 0.949 0.603 0.452 0.997 Wu-GBT[110] 0.630 0.328 0.960 0.586 0.441 0.997 Wu-ST[110] 0.614 0.006 0.991 0.601 0.436 0.997 Wu-MT[110] 0.626 0.002 0.995 0.590 0.430 0.997 Wu-consensus[110] 0.653 0.306 0.959 0.568 0.421 0.997 Ensemble models ST-RF 0.606 0.013 1.003 0.612 0.452 0.998 ST-GBT 0.643 0.002 0.995 0.578 0.424 0.998 MT-RF 0.596 0.012 1.003 0.619 0.456 0.998 MT-GBT 0.641 0.002 0.995 0.580 0.426 0.998 AweGNN models ST-network 0.660 0.001 0.995 0.563 0.406 0.998 MT-network 0.658 0.001 0.993 0.565 0.411 0.998 Consensus models network-consensus 0.665 0.000 0.995 0.558 0.404 0.998 ST-consensus 0.665 0.001 0.997 0.559 0.406 0.998 MT-consensus 0.664 0.001 0.996 0.560 0.409 0.998 final-consensus 0.667 0.001 0.997 0.557 0.405 0.998 5.3.3 Solvation Data Set In this section, we explore the results of the solvation data set, model III. Model III has a total of 387 molecules (excluding ions), and is split into a training set of 293 molecules and a test set of 94 molecules. We use the same training set, but we omit molecules based 114 on obscure chemical names in the PubChem database and due to some difficulties with the Schrödinger software in generating mol2 files. The test set is unaltered, so these diffi- culties leading to our smaller training set of 280 molecules should disfavor our method. In addition to this, there is only one data set for solvation that we analyze, so we can- not apply our MT method. We simply train ST models and report one consensus with the GBT model and the network model, following the same procedure as before. The re- sults are shown below in Table 5.6, where they are compared to other models mentioned previously[81, 106, 104]. Table 5.6: Comparison of prediction results for the solvation test set model MAE (kcal/mol) RMSE (kcal/mol) R2 Results from outside sources WSAS[106] 0.66 - - FFT[104] 0.57 - - Results from Nguyen et al.[81] EICHE,3.5,0.3 0.575 0.921 0.904 H EICE,3.5,0.3;E,2.5,1.3 0.558 0.857 0.920 EICHL,3,1.3 0.592 0.931 0.906 EICHL,3,1.3;L,6.5,0.3 0.608 0.919 0.907 ConsensusH 0.567 0.862 0.920 Ensemble models ST-RF 0.698 1.001 0.893 ST-GBT 0.496 0.666 0.951 AweGNN model ST-network 0.373 0.569 0.963 Consensus model ST-consensus 0.401 0.583 0.962 We see that in this instance, we get our best performance relative to other group’s mod- els. This is quite surprising, as we found that with the toxicity data, the AweGNN seemed to perform very well only when the data sets were very large. Our greatest model, the ST- network model, significantly outperforms every other model in every metric. Even our GBT model is able to greatly outperform all other models outside this group, though our RF model performs only satisfactory. In fact, the supplementary material from Nguyen et al. [81] contains models whose kernel parameters were optimized specifically on the test data, giving a further advantage. These can be seen in Table 5.7. Again, we see that even out GBT model is able to win in every metric against every model in this table, let alone 115 our ST-network with an MAE gap of 0.145, an RMSE gap of 0.229, and an R2 score gap of 0.032 when compare to the best metrics of any chosen opposing model. This analysis shows that there is great promise for applying the AweGNN method for solvation free energy prediction. Table 5.7: Supplementary results for solvation data set from Nguyen et al.[81] Method MAE (kcal/mol) RMSE (kcal/mol) R2 *EICH E,3.5,0.3 0.575 0.921 0.904 *EICHH E,3.5,0.3;E,4.0,1.3 0.518 0.812 0.929 *EICH L,5,0.3 0.579 0.862 0.917 *EICHH L,5,0.3;L,0.5,0.9 0.559 0.842 0.922 *ConsensusH 0.524 0.798 0.931 5.4 Discussion In this section, will we discuss the impact of automated parameter selection, analyze some elements of feature importance, and discuss a new paradigm of auto-parametrized kernel-based networks that could lead to many more possibilities. 5.4.1 Impact of Automating Selection of Kernel Parameters When selecting kernel parameters to optimize the choice of representation, our group was previously using the grid search method to determine which parameters were op- timal[79]. This is not so much of a hindrance if there are only a handful of parameters to tune, but as the number of parameters to tune increases, the number of models that need to be trained increases exponentially. In addition, grid searches require a discrete set of parameters to experiment with, which is a coarse way to tune parameters and leads to some inefficiency. These problems can severely restrict the potential of these kernel- based representations, and largely restrict us to the use of machine learning algorithms that do not have many tunable parameters, such as RF or GBT. One of our goals in this work was to alleviate these issues. Automatically updating kernel parameters is a great solution in theory and in practice, as we have seen with the 116 success of our AweGNN in this work. The AweGNN seeks the optimal choice of param- eters within a continuous space, so the parameter selection is done in a smooth way and can select the values that are not present in a grid search. In our case, we incorporate the parameter selection into the gradient descent process of training a neural network. This association with the gradient descent allows us to update very many kernel parameters at the same time as the weights of the neural network, but has a moderate computational cost associated with it that is commensurate with the computational cost of a moder- ately sized network. More specifically, a quick analysis on the IGC50 data shows that a network consisting of the GGR layer (producing 400 features) with normalization takes the same amount of time on average to cycle though an epoch as does the network with the GGR+normalization layer along with a 4-layer network of 1600 neurons each, thus the GGR+normalization layer is equivalent to a 4-layer 1600 neuron network in terms of computational cost. This cost is certainly not prohibitive, and ultimately saves us an inordinate amount of time, especially since we are able to tune 200 kernel parameters simultaneously (which would be impossible with a grid search). As kernel methods require careful tuning to reach their full potential, neural networks were effectively off limits for training effective models with kernel-based representations. The AweGNN proves that kernel-based representations can be used to train high per- forming models and brings many new possibilities for developing excellent predictive models for biomolecular data. 5.4.2 Feature Importance and Analyzing the Trajectories of the Kernel Parameters To find the most important features that influenced our predictions, we look at the feature importance ranking of the random forest (RF) models that were trained on the network- enabled automatic representations (NEARs) generated by the AweGNNs. This analysis will pinpoint the specific element-pair interactions that contributed the most to the cre- ation of our predictive models. Then, we can analyze the average trajectories and final 117 states of the kernel functions from the element-specific groups that generate the most im- portant features. The average η values can tell us about the most beneficial interaction distances between those element pairs, and the average κ values can tell us about the most favorable degree of sharpness of the cutoff for including the edge cases centered around those η values. By closely examining the feature importance over all data sets and all models, we no- tice that, in general, the most important interactions tend to be the ones between hydrogen- hydrogen, carbon-carbon, and the interaction between those two elements. This would make sense, since the carbons and hydrogens are the most numerous elements in these structures and will be interacting most with the environment. The second most important features are generated from the interactions between the hydrogen-oxygen and carbon- oxygen pairs. This again seems to be a natural result, for this can be indicating the impor- tance of the polar bonds. Aside from these groups, however, there seems to be no broader pattern. Different data sets seem to prioritize different element specific groups, especially the largest data set, LD50 , in which there are a great diversity of element pair interactions that are important, as can be seen in Figure 5.7a, which illustrates the feature importance associated with each pair through a heat map. A more contained display of the feature importance is portrayed in figure 5.7b, based on the RF models trained to predict the toxicity for the IGC50 data set with the NEARs generated from the multi-task (MT) AweGNN models. The feature importance values are concentrated within the pairs consisting of the hydrogen, carbon, nitrogen, and oxygen elements. For the complete list of feature importance heat maps of all the data sets, includ- ing feature importance measured by the gradient boosting models, you may reference the Supporting information, Figures S1-S16. To go into more detail of what is happening in each element-specific group, we analyze how the parameters of the kernel functions of the top 10 groups evolve on average. This evolution can be seen in Figure 5.8. 118 (a) LD50 feature importance map (b) IGC50 feature importance map Figure 5.7: Feature importance of MT-AweGNNs when applied to different data sets. We generate heat maps based on the average of feature importance across 42 RF models trained on the corresponding NEARs of the MT-AweGNNs. The average feature importance of the 4 features of each element-specific group are summed together to get the final scores shown above in the maps. Figure 5.7a shows the results for the LD50 MT-RF models and figure 5.7b shows the results for the IGC50 MT-RF models. Note that the maps are asymmetric due to the way that the electrostatics-based features of a group are generated. As the training progresses, element pairs have their kernel parameters tend towards various η-κ combinations. Some element pairs have high η values and moderate κ values, such as the H-H and C-N groups, while others tend towards low η and low κ values (i.e, H-O), or even low η and high κ (O-C). There are many physical interpretations that could be valid, for example, the low η and κ value for the H-O group could be a measure of the potential for hydrogen bond interactions with the environment, or even internal short- range interactions that assess the stability of the molecule. In any case, we may note the convergence of the η values especially for they might reveal important cutoff distances that contribute to the measurement of toxicity. 5.4.3 Limitations and Advantages One of the limitations present in the study is with regards to applying the AweGNN to very large molecules or systems of molecules. If too many atoms are involved in the 119 Figure 5.8: Average trajectories of kernel parameters for 42 MT-AweGNN models. The graph on the left shows how the average of the η values of our 42 MT models for 8 different element-specific groups changes as the models are trained. On the right, we see the κ counterpart of this analysis. calculation of these features, then the computational cost may become unbearable. Some provisions or rules can be made to avoid heavy computational costs while dealing with large scale biomolecular structures or systems of molecules. For example, our method may be naturally extended to the problem of protein-ligand binding. We can measure the interaction strength between atoms of specific element type in the protein and atoms of another element type in the ligand, instead of the interactions between the atoms of two element types contained within a single molecule. A standard cutoff distance from the ligand is applied to focus on the strongest interactions, thereby leaving a reduced number of atoms for the analysis. A cutoff that is very large might include too many atoms from the protein and thereby incur a heavy computational cost that is far beyond that of the small molecular data sets that we have used in this study. Small cutoffs can be feasible and perhaps still quite effective since long-range interactions may not carry as much weight in terms of performance. The advantage of the AweGNN is its ability to shape the kernel function for each 120 element-specific group so as to provide the optimal filter for capturing the most important interactions between elements within a molecule. Though there is much potential in this approach, there is a danger of over-fitting with smaller data sets. This is perhaps why we see the greatest performance with the largest LD50 data set, and so practically we may want to train on the largest data sets that we can. However, we notice also that the AweGNN was incredibly successful with the small solvation energy data set; and for the two smallest data sets, LC50 -DM and LC50 , the results for the single task network are similar to the RF models, which are known to be quite robust against over-fitting. Thus, the danger of over-fitting may not be as present as originally thought. Also, it may be the case that the multi-task method is less effective when applying the AweGNN, for it may fit its features too specifically to a particular task, and therefore be influenced by the larger data sets more so than other methods, like Wu’s[110]. The computational cost associated with training our multi-task (MT) AweGNN model, which is our largest computational endeavor, is under 135 minutes. The largest sin- gle task model (LD50 model), takes under 120 minutes, while the second largest IGC50 data set is trained in approximately 20 minutes, with the training time for the smaller data sets being significantly lower (a handful of minutes). While the training time of our MT-AweGNN takes almost 70% longer than, for instance, the MT-Wu model (80 minute training time)[110], we can see that the MT AweGNN can still perform better on the larger data sets, so the 70% increase in training time may be worth the performance boost when provided with more samples. In addition to this, we note that our features are simple kernel-based features, and do not require auxiliary physical descriptors to achieve stellar performance. The AweGNN can in fact use those same auxiliary features to bolster its own predictions, and is more flexible in that it can also combine multiple kernel types to generate multi-scale representations that been highly successful in previous works[78, 81]. Finally, as our features are kernel-based, we noted in section 5.4.1 that these features would usually require a grid-search to optimize the kernel parameters, but in our case, 121 we are able to automate this process while simultaneously training our network, thereby saving time in that respect. 5.5 Conclusion Recent years have witnessed much effort in developing mathematical representations for the machine learning predictions of chemical and biological properties that are cru- cial to drug discovery. Advanced mathematical tools from fields such as graph theory [78][80], differential geometry[81], algebraic topology[110], and other mathematical area, have been developed and demonstrated their superb performance. Many of these rep- resentations were generated with a choice of kernel functions that depend on a choice of parameters. These kernel-based molecular representations must be fine tuned to op- timize their effectiveness for training machine learning models. Increasing parameter choices lead to a fast increasing in the computational cost of optimization. This problem deteriorates in the case of deep neural networks because of the already cumbersome task of choosing network hyperparameters. Motivated by the automated feature extraction in convolutional neural network (CNN), we propose an auto-parametrized element-specific graph neural network (AweGNN) to automate and optimize the parameter selection in our geometric graph approach. The resulting representation from training the AweGNN is called a network-enabled automatic representation (NEAR). NEARs can be used as in- put features for other machine learning models, such as random forest (RF) and gradient- boosting tree (GBT) models. AweGNN and NEAR-based ensemble methods are validated five data sets from quan- titative toxicity and solvation predictions. Both toxicity and solvation are important for lead hit, and structure optimization in drug discovery. Four quantitative toxicity data sets: 96 h fathead minnow LC50 , 48 h Daphnia Magna LC50 data set (or LC50 -DM), 40 h Tetrahymena pyriformis IGC50 data set, and the oral rat LD50 data set were used in our work. Our models were compared to state-of-the-art models in various literature, i.e., 122 the Toxicity Estimation Software Tool (TEST)[64] listed by the United States Environmen- tal Protection Agency (EPA), a previous work by Wu et al.[110], and another work by Nguyen et al.[81]. Given that we had 4 data sets of various sizes with similar prediction task, we were able to employ the multi-task (MT) learning method to greatly improve our performance. Our top models were able to out-compete all TEST models in all but the smallest data sets. When comparing with the top Wu models, our top models were able to perform just as well or better when restricting to the largest 2 data sets, although our general models for the smaller data sets were still able to perform relatively well when compared to many of the earlier models. To broaden our application, we also tested the AweGNN on the solubility data set, Model III, used in the work by Nguyen al. [81]. For this instance, we were not able to apply MT learning because we only had one data set to work with. Despite this, we were able to greatly outperform any other models that we compared [81, 106, 104]. Although there is a positive correlation between the relative effectiveness of the AweGNN and the data set size with respect to the 4 toxicity data sets, we see that in the case of solubility, the AweGNN can perform exceptionally well even with a very small data set. Our work showcases the impressive predictive capabilities of the AweGNN and ultimately intro- duces new potential to improve the effectiveness of previous mathematical methods. 123 CHAPTER 6 GENERAL AUTO-PARAMETRIZED KERNEL METHODS AND FUTURE WORK 6.1 Introduction In this section we will discuss various extensions of the original AweGNN work and present future possibilities that can advance the auto-parametrized kernel method paradigm. 6.2 GPU-Enhanced AweGNN and Multi-scale Methods The original AweGNN did not have a geometric graph representation (GGR) layer that was compatible with the graphical processing unit (GPU) technology. This hampered the feasibility of applying the multi-scale kernel methods that have been shown to be wildly successful in the past. Although computational speed was improved though the use of the Cython and Numba packages, the GGR layer did not have the necessary speed to place the computation time within the proper scope. This led to the implementation of the GPU-Enchanced AweGNN which uses a different data structure that allows the GPU to make quick calculations of the element-specific geometric graph features. We will discuss the new structure, algorithm, and performance. 6.2.1 GPU programming Graphical processing units, or GPUs, have immense computing power due to the thou- sands of tiny processing cores contained within their architectures. These cores can run thousands of threads in parallel, and so benefits greatly in terms of low computational time when applied to algorithms that rely heavily on computations that can be done si- multaneously. GPUs were originally used to display graphics in video games, and later found use in machine learning, especially in application to deep neural networks. Since deep neural networks involve the multiplication of many large matrices and matrix mul- 124 tiplication involves many operations that can be performed in parallel, GPUs have been extremely beneficial for the advancement of these networks. NVIDIA CUDA program- ming is the most popular programming model at the moment, although there are many packages that allow various programming languages to interact with the GPU program- ming interface and improve the ease of use. OpenMP is a popular software that help facilitate GPU programming with the C++ language while Numba and PyCuda are used for GPU programming with the Python programming language. GPUs launch kernels one at a time, which are GPU tasks that consist of many blocks of computing threads. Threads are individual computations that are completed indepen- dent of one another and can be synchronized with other computations occurring within the same block. Threads are scheduled by CUDA in warps of 32 threads which are exe- cuted SIMD (single instruction, multiple data) style across all threads in a warp. Blocks usually contain around 1024 threads grouped in 32 warps per block. When programming a GPU, you must be mindful of the way in which you schedule the execution of paral- lel instructions to achieve optimal speed. For additional efficiency from parallelism, you may want to consider running multiple streams of kernels, which allow multiple whole tasks to be executed in parallel. GPUs are quite flexible despite the restrictions that one may have to work with. 6.2.2 Data Structure and Algorithm For the original AweGNN, the element-specific biomolecular data, i.e. the distances be- tween the atoms and the electrostatics information was stored in python lists and Numpy matrices which could then be accessed easily for Numba operations. To make use of GPU programming with Numba, the data had to be stored without the use of lists, and so the data was stored only in arrays, with indexing arrays that described the location of distances corresponding to specific element-specific groups of the molecules in the data set. 125 More precisely, we have a 1-D array detailing the pairwise distances of all the atoms in each element specific group in each sample all in this one array. An electrostatics informa- tion array of the same size is also present detailing the electrostatic information relevant to the corresponding position in the array. Two index arrays, again of the same length, in- dexing the element-specific group in the sequence of the dataset in one and indexing the distances within that element-specific group in the other. With this data structure, we can apply a common parallel computation algorithm that sums up the elements of an array with some slight modifications to then get a sum for all of our element-specific groups at once. This algorithm is illustrated in figure 6.1. Figure 6.1: This figure gives a graphical representation of the parallel addition algorithm for calculating the sum of an array. We start with an initial array, a, of length, n. Then we begin by doing pairwise additions, ai := ai + a2i , for i = 1, ..., n/2. Then the next step includes pairwise additions, again ai := ai + a2i , but now for i = 1, ..., n/4. The pairwise additions, ai := ai + a2i , continue for i = 1, ..., n/2k , at step k, until there is only one number remaining to be summed, and then the array sum is extract from the first element, a0 . The image uses green to show elements used in the sum at each step, grey to show inactive elements, and the red circle highlighting the final answer after all the steps are completed. 126 6.2.3 New Capabilities In addition to the new algorithm adapted for the GPU, it has been decided that the GGR layer from the AweGNN should output features based on a choice of either distance, electrostatics, average distance, or average electrostatics information for each element- specific group. This is in contrast to the original GGR layer, which would output all 4 categories per element-specific group. This was done to be able to allow the kernels to fit the data based on the different types of features, i.e. a different kernel function may be appropriate for capturing the electrostatic interactions of an element specific group as opposed to capturing the distance data within an element-specific group. We also we have a new kernel that is now operational, namely the piece-wise linear kernel detailed in section 4.4, which has performed well in brief experimentation, as can be seen in the section below. Finally, the multi-scale capabilities are operational due to the increase in speed, and have been briefly tested, again pointing to the section below. 6.3 Comprehensive AweGNN with Fixed Topological and Spectral De- scriptors The GGR layer can perform very well on its own, as it has been shown by the per- formance of the AweGNN. There is, however, more improvement to be made. The GGR layer lacks descriptors that detail the topological and geometric information of the biomolecules. In addition to that, spectral information can be gleaned from the graph structure inher- ent in the molecular structure. It would seem reasonable that the combination of the molecular descriptors from all of these categories can produce an effective cumulative model. The descriptors generated by the AweGNN provide element-specific features that measure relative distance and electrostatic interaction strength. Topological descriptors measuring the persistence of rings and cavities present within the molecules gives us a representation of the geometric properties. Finally, the spectral descriptors of the molec- ular graph can give us information about the connectivity of the molecule among other 127 geometric information. We seek to create and test a cumulative model that incorporates all these descriptors in a neural network structure. 6.3.1 Model Features and Architecture We follow a simple architecture that involves the geometric graph representation (GGR) layer from the AweGNN, and one or two auxiliary layers which facilitate the topological and/or spectral descriptors which are then merged into the main part of the network. The idea for this architecture is illustrated in figure 6.2. Figure 6.2: This figure shows the general structure of the comprehensive network described in this section. The geometric graph representation (GGR) layer from the AweGNN takes center stage with auxiliary topo- logical and spectral features all merging together into one shared layer. Each set of features can have their own hidden layers to effectively process the features before merging. In the brief set of experiments that we have using this architecture, we omit the spec- tral features, and the topological feature input does not have any corresponding hidden layers, nor does the GGR layer. The GGR layer output and the topological features are 128 concatenated from the beginning. The topological barcode data is calculated using the Ripser software [16] and the descriptors that are generated are done following the proce- dure as in section 4.5.5, with maximum length of 10 Åand a resolution of 0.25 Å. The GGR layer is a multi-scale layer consisting of 6 kernels types. The 3 kernel types described in section 4.4, the Lorentz, exponential, and piece-wise linear kernel, each with a distance based kernel and a electrostatics based kernel, yielding our 6 kernel types. Each kernel type yields 100 features and 200 kernel parameters, yielding a total of 600 features and 1,200 kernel parameters. The hidden layers of our network are composed of 6 linear layers of 1024, 512, 256, 128, 64, and 32 neurons in that sequence down to the one output which gives us our prediction. After each hidden layer, we have a batch normalization layer followed by a ReLU activation layer. A constant learning rate of 0.01 with the AMSGrad version of the Adam optimizer with default PyTorch settings (β1 = 0.9 and β2 = 0.999), batch size of 100 samples, and τ and κ initialization ranges of 0.5-2 and 2-12 respectively were used. These hyper parameters were chosen by following intuitive guidelines and experience from past experiments to make reasonable predictions. Each model was trained for 2000 epochs and 50 models were trained per data set. The 50 predictions were averaged to get a final prediction and the results were recorded. 6.3.2 FreeSolv and ESOL Data Set We describe 2 data sets from the MoleculeNet paper [111], FreeSolv and ESOL, which we have trained on our comprehensive neural network described in the previous section. ESOL is a small dataset consisting of water solubility data for 1128 compounds [28]. The dataset has been used to train models that estimate solubility directly from chemical structures (as encoded in SMILES strings) [33, 111]. Although the ESOL dataset origi- nally consisted of 2D smile strings, we used the Schrodinger software to generate mol2 files that contain the 3D structures of the molecules so we could apply our own machine 129 learning models which require them. The data set is split into a benchmark test set with 113 molecules, while the remaining data points are the training set consisting of 1015 molecules. we trained our models on this training set before testing on the benchmark set. The Free Solvation Database (FreeSolv) provides experimental and calculated hydra- tion free energy of small molecules in water [73]. A subset of the compounds in the dataset are also used in the SAMPL blind prediction challenge [75]. The calculated values are de- rived from alchemical free energy calculations using molecular dynamics simulations. The experimental values are included in the benchmark collection, and use calculated values for comparison [111]. This data set contains 642 molecules and has 65 molecules set aside for the benchmark test set. The remaining 577 molecules make up the training set, in which we trained our models before testing on the test set. 6.3.3 Results and Discussion For both data sets, we consider the standard metrics of the RMSE and the Pearson corre- lation coefficient described in section 3.2.3. We compare these results to the benchmark test results record in the MoleculeNet paper. Below, we show the charts for the ESOL and FreeSolv datasets in figure 6.3. For our ESOL experiments, we achieved an RMSE of 0.568 and a Pearson correlation coefficient of 0.966. For the FreeSolv experiments, we achieved an RMSE of 0.762 and a Pearson correlation coefficient of 0.978. The best results for the ESOL and FreeSolv datasets from the MoleculeNet paper are 0.58 and 1.15 RMSE respectively, both coming from their MPNN model. The MPNN mode is also referred to as the message passing neural proposed by Gilmer et al [37]. Our models outperform all of the proposed models in MoleculeNet on both datasets and so are clearly excellent models in and of themselves. The gap is especially large on the FreeSolv dataset, which is more than a 0.42 gap for RMSE. This is quite big considering the scale of the prediction evaluations, in which the 130 Figure 6.3: This figure displays the results gathered from the MoleculeNet paper. 8 model types were tested on a validation set and a test set and the average performance using the RMSE metric were recorded. The error bars record the standard deviation of the model results [111]. worst predictions are below 2.2 RMSE. In this case, we see another instance in which the GGR layer performs extremely well, just as we have seen stellar performance on the solvation dataset in the original AweGNN paper [100]. Although we can safely say that the comprehensive AweGNN model proposed here is a superb model in comparison, we admit that there are many improvements to be made so that the performance is optimized. First of all, the model architecture could be opti- mized in a more formal fashion, and there was no feature cleaning or feature importance analysis to clean up the topological data. Second, we did not apply any separated hidden layers to the auxiliary topological features and the GGR layer itself, which would provide opportunity for features to be pre-processed by the layers before being concatenated and sent down through the shared layers. Finally, we did not include the spectral auxiliary features that would give us the completion of our goal of making a truly comprehensive network. 131 6.4 Auto-parametrized Spectral Layer We wish to introduce a new layer that has not been tested, but the theory is important for the future development of the technique. We want to create a layer that updates the kernel parameters of a weighted adjacency and Laplacian matrix, which extends the original idea of the AweGNN from a geometric graph layer that obtains descriptors from edge weights to an algebraic graph layer that obtains descriptors from the spectrum of Laplacian and adjacency matrices. This section applies perturbation theory to develop a formula for the derivative of an eigenvalue function with respect to the kernel parameters so that the kernel parameters can be updated throughout the training. 6.4.1 Differentiability of Eigenvalue and Eigenvector Functions In this section, we use "Perturbation Theory for Linear Operators", by Tosio Kato [49], as a helpful resource. In particular, we have Theorem 6.1 from the book: Theorem. If the holomorphic family T (χ) is symmetric, the eigenvalues λh (χ) and the eigen- projections Ph (χ) are holomorphic on the real axis, whereas the eigennilpotents Dh (χ) vanish identically. In this theorem, the holomorphic family, T (χ), is a family of symmetric matrices parametrized by χ. The eigenvalues, λh (χ), refer to functions on some domain, D ⊂ R, in λh : R → R is defined by mapping the value χ to an eigenvalue of T (χ). In a similar vein, the eigen- projections, Ph (χ) are functions, Ph : D → Mn (R), where Mn (R) is the space of all n × n real matrices with the standard topology generated by the euclidean norm and n is the dimension of the matrices T (χ), in which the function maps the value χ to an eigenpro- jection of T (χ), which is a projection (i.e. P 2 = P ) that projects onto the eigenspace of its corresponding eigenvalue. Since we want to work eventually with a function that maps to individual eigenvectors, we will instead consider eigenvector functions, vh , in place of 132 these eigenprojection functions. We will not bother with the definition of the eigennilpo- tents since these will be irrelevant to the discussion. We notice that there are many questions that we must ask about the aforementioned eigenvalue and eigenvector functions, λh and vh respectively. First, we observe that they are both multi-valued functions and must be well-defined before they can be discussed seriously and used in application. The next point is that the eigenvector functions are multi-valued in 3 ways: there is a concern about the consistency of the eigenvalue, then a concern about a multiplicity greater than one for the same eigenvalue, and finally if each eigenvalue has multiplicity 1, then we still can make an arbitrary choice for the eigenvec- tor because any parallel vector to an eigenvector is also an eigenvector. The eigenvector functions too must be differentiable based on the differentiability of the eigenprojections. The last thing we wish to note is the how we choose the domain, D. We will answer these questions shortly, but we begin by including more definitions. Suppose we have a function T : D → Hn (R), where D ⊂ R, and Hn (R) is the space of all real symmetric matrices. We now wish to define the eigenvalue functions on the matrix space in a way that keeps them well-defined and allows us to obtain a derivative calculation. For a given matrix A ∈ Hn (R), we obtain its spectrum, S(A) = {λ1 , λ2 , ..., λn }, where λi ∈ R for each i. Suppose that we now let the entries of the matrix A vary, in which for any matrix M that is "close enough" to A, we have functions λi : B → R with B ⊂ Hn (R), in which λi (M ) is defined to be the eigenvalue that would be in the ith place of the spectrum. We need to be careful about how we define these eigenvalue functions. To be precise, we let M0 ∈ Hn (R) such that the spectrum of M0 contains no repeated eigenvalues. We define a ball of radius, r, centered about M0 , B(M0 , r) = {M ∈ Hn (R) : ||M0 − M || < r}, where || · || is the euclidean norm in which the matrices are seen as vectors and r is small enough so that the ranges, λi (B(M0 , r)) = {λi (M ) ∈ R : M ∈ B(M0 , r)} are disjoint. Because the matrices are symmetric, and so we have real-valued eigenvalues and 133 eigenvectors, then the spectrum of A can be ordered by the simple inequality: λ1 ≥ λ2 ≥ ... ≥ λn . Thus, we may define the spectrum function as an ordered tuple, namely   S(A) = λ1 , λ2 , ..., λn in which we may now define λi (A) = S(A) i . Then we may change the condition to λ1 (A) > λ2 (A) > ... > λn (A) for all A ∈ B(M0 , r), so that we have a well-defined function restricted to a sufficiently small ball of symmetric matrices. Now, we see that for the composition, λi ◦ T : D → R, we have eigenvalue functions like those described in the beginning of the section, but now well defined when we restrict D to an interval in which T (x) ∈ B(M0 , r) for all x ∈ D. As for the eigenvector functions, vi , we can follow the same process by looking at the composition, vi ◦ T : D → Rn . Now, we still have the concern that for each eigenvalue, λi (T (x)), I can have multiple values for vi (T (x)), however, since we are restricted by the domain, D, we are guaranteed to have distinct eigenvalues for each T (x), thus the eigenvectors will all have multiplicity 1. This eliminates the fear that there will be eigenspaces of dimension greater than 1 for a given eigenvalue, but it still leaves the choice of an infinite family of eigenvectors lingering. This can be mended by choosing a fixed half-space of Rn so that we can always choose the output eigenvector to be the one normalized to unit length within the fixed half-space. There is a problem of course if the eigenvector is located on the dividing hyper-plane, but we can further restrict our domain, D, to make sure that this does not happen. Thus, our discussion shows that it is possible to determine well-defined eigenvalue and eigenvector functions. Now, our last note is to show that the differentiability of the eigenprojections imply the differentiability of the eigenvector functions with the well- defined construction above. This follows easily by understanding that an eigenvector function is the composition of a differentiable eigenprojection function and the function that takes n × 1 matrices to the unit vector in the chosen half-plane, which is also dif- ferentiable except at the intersection of the half-plane and the unit hypersphere in Rn . This finally allows us to use the theorem stated earlier in the section to show that they are holomorphic functions, and so in this case are differentiable. In this next section, we show 134 how we might calculate these derivatives to then use them to automatically parametrize our kernel functions. 6.4.2 Derivatives of the Eigenvalue Functions and the Update Rule We start with the assumption that there is a matrix, M0 ∈ H(R), and a ball, B(M0 , r)), in which the functions, λi , along with the normalized eigenvector functions, vi , are dif- ferentiable. Then, we want to find out what the explicit derivative with respect to each component of the matrix should be. Let us consider a matrix in this domain, say A, with eigenvalue λ corresponding to the eigenvector, v. Then we have the equation from the definition of the eigenvalue: Av = λv (6.1) Then we implicitly differentiate the equation to obtain: ∂Av + A∂v = ∂λv + λ∂v (6.2) Now we will include the condition that the eigenvector, v, is of unit length, i.e. ||v|| = 1. Then we multiply both sides by vT and we get: vT ∂Av + vT A∂v = vT ∂λv + vT λ∂v (6.3) Then we notice that vT ∂λv = ∂λvT v = ∂λ||v||2 = ∂λ and vT λ∂v = λvT ∂v = λ∂( 21 vT v) = λ 12 ∂(||v||2 ) = 0, all due to the condition that ||v|| = 1. Thus, the right side of the equation 6.3 is simplified, leaving us with: vT ∂Av + vT A∂v = ∂λ (6.4) Now, we must apply the properties of a real symmetric matrix to simplify the left side of equation 6.4. We have that vT A∂v = vT AT ∂v = (Av)T ∂v = (λv)T ∂v = λvT ∂v = 135 λvT ∂v = λ∂( 12 ||v||2 ) = 0, where we use our preliminary assumptions that Av = λv, A = AT , and ||v|| = 1. This then yields the final simplified equation in which we will get out desired derivative: vT ∂Av = ∂λ (6.5) ∂λ We now wish to obtain an expression for , for any index, (i, j). The definition of ∂Aij the multiplication of matrices shows us that: n n vT ∂Av X X = vs ∂Ast vt = ∂λ s=1 t=1 And so by dividing both sides of the second equality by ∂Aij , we have that: n X X ∂Astn ∂λ = vs vt = v i v j ∂Aij ∂Aij s=1 t=1 where our last equality comes from the fact that:   ∂Ast 1 : i = s and j = t  = ∂Aij  0 : otherwise  We then let E = vv T , which has components Eij = vi vj , so that the description of the ∂λ derivatives are in the concise form, = E. Therefore, we have succeeded in finding the ∂A derivative rule for use in back propagation for any given matrix entries. As an important note, we also can see by the description of the matrix of derivatives, E, that E is a real symmetric matrix, and so applying an update rule to the matrix A, i.e. A := A − αE for some learning rate α ∈ R>0 , leaves the updated matrix A as another real symmetric matrix because the space of real symmetric matrices is a vector space over the real numbers. Other update rules that treat each parameter individually will not necessar- ily update the matrix so that the resulting matrix is a real symmetric matrix. Thus, most 136 popular adaptive learning rates, like Adam, may not be applied unless there is another way to keep this from being the case. In reality, when we want to apply this to the case of biology using our kernel meth- ods, we would have a kernel function that generates an adjacency or a Laplacian matrix in which the kernel parameters would have to be updated. We want to showcase this in terms of an eigenvalue function, λ. For a kernel function, Φ(x; η, κ), and a weighted ad- jacency matrix or a Laplacian matrix, A, that represents the distances between the atoms in a group of them, we can apply Φ to each entry of the matrix, i.e. we define Φ(A; η, κ) to be the matrix with entries Φ(A; η, κ)ij = Φ(Aij ; η, κ). In this case, if we now wish to simply update the η and κ values based on the loss of the eigenvalue function, λ, then we calculate the derivative with respects to the η and κ values by the chain rule, i.e. n X n ∂λ ∂Φ(A) ∂λ X ∂Φ(A)ij ∂λ ∂Φ = = , where x = η or x = κ. The terms ∂x ∂x ∂Φ(A) ∂x ∂Φ(A)ij ∂x i=1 j=1 ∂λ are easy to calculate, and have been derived in an earlier work, while the terms have ∂Φ been described above with the matrix, E, so we have a full description of the derivatives an eigenvalue function with respects to the parameters of a chosen kernel function that generates an adjacency/Laplacian matrix. We note that since the updates are made on the kernel parameters, and the kernel functions generate matrices that are real symmet- ric, then the update rule will not cause any issues involving matrices that are outside of the real symmetric space and thus the derivatives of the eigenvalue functions with respect to the entries of the matrices will always be valid throughout the training. 6.4.3 Discussion We can implement the above ideas in any of the previously applied fixed-kernel methods involving the use of eigenvalues. A simple idea would be to analyze the graph consist- ing of atoms as vertices with the edges corresponding to the covalent bonds between them, calculating the eigenvalues of the Laplacian and adjacency matrices, then record- 137 ing the sum (trace), average, non-zero minimum (Fiedler value), maximum, etc. as the features. This will generate a small amount of features, but may provide a useful bolster to something like the geometric graph layer of the AweGNN. If more spectral features are desired, then we may consider element specific features as in Nguyen et al. [80], but this may require a description of the derivatives of the eigenvector functions. Although, not much can be concluded about the performance of a auto-parametrized spectral layer since it has not been tested, we can speculate that the performance will be good, as we have seen with the AweGNN [100], and the performance of the compre- hensive model in section 6.3, along with the success of spectral-based representations in previous works [80, 90]. The main concern would be the computational feasibility of the layer. The computational costs of calculating eigenvalues can become exorbitantly expen- sive with very large matrices, and so if we encounter large graph structures, we will have large Laplacian matrices, leading to potentially infeasible calculations. GPUs or other hardware/software that can provide a boost may be necessary. 6.5 Auto-parametrized Kernel Models and Future Possibilities We have discussed many topics so far involving mathematical theory, machine learn- ing, biomolecular modeling, kernel-based models, the AweGNN model, a comprehensive network involving topological and spectral features, and the auto-parametrized spectral network layer. We can apply kernels to many of the techniques used in our group. Any of the features that use a kernel function with tunable parameters is a target for auto- parametrization. Geometric graphs, persistent homology, algebraic graphs, and differ- ential geometry have seen kernel functions applied to them in different ways, all with success [78, 110, 22, 80, 81, 79, 108]. Inspired by the success of the convolutional neural networks (CNN), which updates filters that pre-processes image-like data throughout training, we developed the AweGNN. The AweGNN applies kernel functions to biological data like the CNN applies filters to 138 image-like data, and we automated the process of updating the kernel functions through their parameters as CNNs do with their filters. To implement our idea, we used geometric graphs with weighted edges to get features from the edge weights and some additional information, such as the partial charges. This provided an easy set of features to extract and an easy calculation for the derivatives of the kernel parameters necessary to update them during training. This new network layer was named the geometric graph repre- sentation (GGR) layer, since it produced a geometric graph representation of a molecule based on input kernel parameters. The GGR layer performed well, but was limited in scope due to the computational cost of the calculations involved. The computational burden had to be lessened to further advance the possibilities of the GGR, and make multi-scale representations feasible. This was done with the GPU-enchanced AweGNN, in which the GGR layer was adapted to work with the GPU technology. This gave us the opportunity for a multi-scale model to be trained. To bolster the new GGR layer, we proposed using a comprehensive neural network architecture which incorporated fixed topological and spectral features, drawing out the geometric and structural information of the molecules in conjunction with the element-specific features from the GGR layer. A heavily restricted implementation of this comprehensive network had much success on two datasets, ESOL and FreeSolv, which proves that the network might be worth improving to reach for even greater success. The theory for the Auto-parametrized spectral layer was outlined, although no layer has been built yet or experimented with. It is likely that it will do well, given previous success of previous work with spectral features. There are two more kernel-based models who have not been explored yet in terms of our auto-parametrization technique, namely the differential geometry based models and the algebraic topology-based models. Given the promising theory laid down for the Auto-parametrized spectral layer, we may have confidence that in both cases, we may find a way automate the kernel parameter selection. There will always be an issue with large data sets and computation time, but we have 139 improved our models before using parallel algorithms and GPUs, so it is also likely that we can tackle this problem as it comes up. 140 CHAPTER 7 DISSERTATION CONTRIBUTION The contributions of this dissertation are as follows: • In chapter 3, we propose a new kernel function, the piece-wise linear kernel, which attempts to more closely approximate the ideal low-pass filter as compared to the Lorentz and exponential kernels. This new piece-wise linear kernel has been tested in a multi-scale model including the Lorentz and exponential kernels on two datasets, ESOL and FreeSolv [111], in which the results of the experiments were recorded in section 6.3.3. • Chapter 5 details the auto-parametrized weighted element-specific graph neural network, which was developed to address the limitations of kernel-based repre- sentations, and also to simply present a new and interesting model for experimen- tation. The AweGNN was tested on a series of toxicity datasets, using both single and multi-task architectures, then tested on a small dataset involving solvation free energy predictions, all with excellent results [100]. • The GPU-enchanced AweGNN was developed to alleviate the computational ex- pense in the original AweGNN model. The new structure and algorithm is detailed in chapter 6. • A new comprehensive architecture including, multi-scale auto-parametrized geo- metric graph features, and fixed topological and spectral auxiliary features is pro- posed also in chapter 6. Limited experiments are conducted on the ESOL and Free- Solv datasets [111], with promising results. • In Chapter 6, we also lay a theoretical foundation for the construction of an auto- parametrized kernel-based layer that produces representations based on the spec- 141 trum of Laplacian and adjacency matrices. This dissertation was centered mainly around the concepts introduced and work done in the publication: Szocinski, Timothy & Nguyen, Duc & Wei, Guo-Wei. (2021). AweGNN: Auto- parametrized weighted element-specific graph neural networks for molecules. Com- puters in Biology and Medicine. 134. 104460. 10.1016/j.compbiomed.2021.104460. 142 BIBLIOGRAPHY 143 BIBLIOGRAPHY [1] S. Shivkumar A. Speranzon and R. Ghrist. “Localization Exploiting Topological Constraints”. In: Proc. Amer. Control Conf. [ACC]. (2020). [2] Ralph Abbey et al. Search Engines and Data Clustering. [3] Kevin S Akers, Glendon D Sinks, and T Wayne Schultz. “Structure–toxicity rela- tionships for selected halogenated aliphatic chemicals”. In: Environ. Toxicol. Phar- macol. 7.1 (1999), pp. 33–39. [4] A. Sizemore et al. “"The importance of the whole: topological data analysis for the network neuroscientist"”. In: Network Neurosci. 3(3) (2019), pp. 656–673. [5] Guo Wei Wei et al. “Tailoring Wavelets for Chaos Control”. In: Phys. Rev. Lett. 89.284103 (2002). [6] K. L. Xia et al. “Multiscale geometric modeling of macromolecules i: Cartesian representation”. In: Journal of Computational Physics 275 (2014), pp. 912–936. [7] Klenin K et al. “"Derivatives of molecular surface area and volume: Simple and exact analytical formulas"”. In: Journal of Computational Chemistry 32 (12) (2011), pp. 2647–2653. DOI: 10.1002/jcc.21844. [8] R. Adamczak et al. “"Combining prediction of secondary structure and solvent accessibility in proteins"”. In: Proteins 59 (3) (2005), pp. 467–75. DOI: 10.1002/prot .20441. [9] W. Rocchia et al. “Rapid grid-based construction of the molecular surface and the use of induced surface charge to calculate reaction field energies: Applications to the molecular systems and geometric objects”. In: Journal of Computational Chem- istry 23 (2002), pp. 128–137. [10] X. Feng et al. “Multiscale geometric modeling of macromolecules II: lagrangian representation”. In: Journal of Computational Chemistry 34 (2013), pp. 2100–2120. [11] Xin Feng et al. “Geometric modeling of subcellular structures, organelles and large multiprotein complexes”. In: Int J Numer Method Biomed Eng 28 (2012), pp. 1198– 1223. [12] Marat Andreev et al. “Influence of Ion Solvation on the Properties of Electrolyte Solutions”. In: The Journal of Physical Chemistry B 122.14 (2018). PMID: 29611710, pp. 4029–4034. DOI: 10.1021/acs.jpcb.8b00518. eprint: https://doi.org/10.1021/ac s.jpcb.8b00518. URL: https://doi.org/10.1021/acs.jpcb.8b00518. 144 [13] Pedro J Ballester, Adrian Schreyer, and Tom L Blundell. “Does a more precise chemical description of protein–ligand complexes lead to more accurate predic- tion of binding affinity?” In: Journal of chemical information and modeling 54.3 (2014), pp. 944–955. [14] João Carlos Alves Barata and Mahir Saleh Hussein. “The Moore–Penrose Pseu- doinverse: A Tutorial Review of the Theory”. In: Brazilian Journal of Physics 42.1-2 (Dec. 2011), pp. 146–165. ISSN: 1678-4448. DOI: 10.1007/s13538- 011-0052- z. URL: http://dx.doi.org/10.1007/s13538-011-0052-z. [15] Maciej Barycki et al. “Multi-objective genetic algorithm (MOGA) as a feature se- lecting strategy in the development of ionic liquids’ quantitative toxicity–toxicity relationship models”. In: Journal of chemical information and modeling 58.12 (2018), pp. 2467–2476. [16] Ulrich Bauer. “Ripser: a lean c++ code for the computation of Vietoris-Rips persis- tence barcodes”. In: (2017). URL: https://github.%20com/Ripser/ripser. [17] Stefan Behnel et al. “Cython: The best of both worlds”. In: Comput. Sci. Eng. 13.2 (2010), pp. 31–39. [18] Béla Bollobás. Modern Graph Theory. 1st ed. Graduate Texts in Mathematics 184. Springer-Verlag New York, 1998. ISBN: 978-0-387-98488-9,978-1-4612-0619-4. [19] Simon Brandt et al. “Machine learning of biomolecular reaction coordinates”. In: The journal of physical chemistry letters 9.9 (2018), pp. 2144–2150. [20] R. Brooks. “Constructing isospectral manifolds”. In: Amer. Math. Month. 95 (1988), pp. 823–839. [21] Keith T Butler et al. “Machine learning for molecular and materials science”. In: Nature 559.7715 (2018), pp. 547–555. [22] Z. X. Cang and Guo-Wei Wei. “TopologyNet: Topology based deep convolutional and multi-task neural networks for biomolecular property predictions”. In: PLOS Computational Biology 13(7) (2017). [23] Rich Caruana. Learning to Learn. Springer, 1998, pp. 95–133. [24] Anupam Choudhary and Ravi Kshirsagar. “Process Speech Recognition System using Artificial Intelligence Technique”. En. In: International Journal of Soft Comput- ing and Engineering (IJSCE) 2.5 (Nov. 2012). Ed. by Dr. Shiv Kumar, pp. 239–242. ISSN: 2231-2307. URL : https://www.ijsce.org/wp-content/uploads/papers/v2i5 /E1054102512.pdf. 145 [25] Dan Ciregan, Ueli Meier, and Jürgen Schmidhuber. “Multi-column deep neural networks for image classification”. In: 2012 IEEE conference on computer vision and pattern recognition. IEEE. 2012, pp. 3642–3649. [26] Steven J Darnell, Laura LeGault, and Julie C Mitchell. “KFC Server: interactive forecasting of protein interaction hot spots”. In: Nucleic acids research 36.suppl_2 (2008), W265–W269. [27] S. Decherchi and W. Rocchia. “A general and robust ray-casting-based algorithm for triangulating surfaces at the nanoscale”. In: PLoS ONE 8:e59744 (2013). [28] J.S. Delaney. 2004. [29] Sarah Jane Delany and Pádraig Cunningham. “Ecue: A spam filter that uses ma- chine learning to track concept drift”. In: In Proceeding of the 2006 conference on ECAI 2006: 17th European Conference on Artificial Intelligence August 29–September 1, 2006, Riva del Garda, Italy. IOS Press, 2006, pp. 627–631. [30] Li Deng, Geoffrey Hinton, and Brian Kingsbury. “New types of deep neural net- work learning for speech recognition and related applications: An overview”. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE. 2013, pp. 8599–8603. [31] Jacob Devlin et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. cite arxiv:1810.04805Comment: 13 pages. 2018. URL: http://arxiv.o rg/abs/1810.04805. [32] Ashay Dharwadker and Shariefuddin Pirzada. “Applications of Graph Theory”. In: JOURNAL OF THE KOREAN SOCIETY FOR INDUSTRIAL AND APPLIED MATHEMATICS 11.4 (2007). [33] D. et al. Duvenaud. arXiv preprint. 2015. arXiv: 1509.09292 [cs.LG]. [34] Herbert Edelsbrunner and John Harer. Computational Topology - an Introduction. American Mathematical Society, 2010, pp. I–XII, 1–241. ISBN: 978-0-8218-4925-5. [35] Lance Eliot and Michael Eliot. Autonomous Vehicle Driverless Self-Driving Cars and Artificial Intelligence: Practical Advances in AI and Machine Learning. 1st. LBE Press Publishing, 2017. ISBN: 0692051023. [36] Aurélien Géron. Hands-on machine learning with Scikit-Learn and TensorFlow : con- cepts, tools, and techniques to build intelligent systems. Sebastopol, CA: O’Reilly Me- dia, 2017. ISBN: 978-1491962299. 146 [37] Justin Gilmer et al. “Neural Message Passing for Quantum Chemistry”. In: Pro- ceedings of the 34th International Conference on Machine Learning. Ed. by Doina Pre- cup and Yee Whye Teh. Vol. 70. Proceedings of Machine Learning Research. PMLR, June 2017, pp. 1263–1272. URL: https://proceedings.mlr.press/v70/gilmer17a.ht ml. [38] Alexander Golbraikh et al. “Rational selection of training and test sets for the de- velopment of validated QSAR models”. In: Journal of computer-aided molecular de- sign 17.2 (2003), pp. 241–253. [39] Yoav Goldberg and Omer Levy. word2vec Explained: deriving Mikolov et al.’s negative- sampling word-embedding method. cite arxiv:1402.3722. 2014. URL: http://arxiv.org /abs/1402.3722. [40] Allen Hatcher. Algebraic topology. Cambridge: Cambridge University Press, 2002, pp. xii+544. ISBN: 0-521-79160-X; 0-521-79540-0. [41] Ian Holmes, Keith Harris, and Christopher Quince. Dirichlet Multinomial Mixtures: Generative Models for Microbial Metagenomics. [42] Aapo Hyvärinen and Erkki Oja. “Independent component analysis: algorithms and applications”. In: Neural Networks 13 (2000), pp. 411–430. [43] Sergey Ioffe and Christian Szegedy. “Batch normalization: Accelerating deep net- work training by reducing internal covariate shift”. In: International conference on machine learning. PMLR. 2015, pp. 448–456. [44] P. Mach J. Li and P. Koehl. “Measuring the shapes of macromolecules - and why it matters”. In: Comput Struct Biotechnol J. 8:e201309001 (2013). [45] Jian Jiang, Rui Wang, and Guo-Wei Wei. “GGL-Tox: Geometric Graph Learning for Toxicity Prediction”. In: Journal of Chemical Information and Modeling 61.4 (2021). PMID: 33719422, pp. 1691–1700. DOI: 10.1021/acs.jcim.0c01294. eprint: https://d oi.org/10.1021/acs.jcim.0c01294. URL: https://doi.org/10.1021/acs.jcim.0c01294. [46] Jian Jiang et al. “Boosting tree-assisted multitask deep learning for small scientific datasets”. In: Journal of chemical information and modeling 60.3 (2020), pp. 1235–1244. [47] I.T. Jolliffe. Principal Component Analysis. Springer Verlag, 1986. [48] Abdul Karim et al. “Efficient toxicity prediction via simple features using shallow neural networks and decision trees”. In: Acs Omega 4.1 (2019), pp. 1874–1888. 147 [49] Tosio Kato. Perturbation theory for linear operators. 1st ed. Grundlehren der math- ematischen Wissenschaften in Einzeldarstellungen mit besonderer Berücksichti- gung der Anwendungsgebiete, Bd. 132. Springer-Verlag New York, 1966. [50] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In: arXiv preprint arXiv:1412.6980 (2014). [51] Volodymyr Kovenko and Vitalii Shevchuk. l1 and l2 regularization. 2020. URL: http s://machine-learning-and-data-science-with-python.readthedocs.io/en/latest /assignment3_sup_ml.html (visited on 2020). [52] Siu Kwan Lam, Antoine Pitrou, and Stanley Seibert. “Numba: A llvm-based python jit compiler”. In: Proceedings of the Second Workshop on the LLVM Compiler Infrastruc- ture in HPC. 2015, pp. 1–6. [53] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. “Deep learning”. In: nature 521.7553 (2015), pp. 436–444. [54] J. De Leeuw. “Theory of multidimensional scaling”. In: Handbook of Statistics, vol- ume II’, North. Holland Publishing Company, 1980. [55] Lin Li, Chuan Li, and Emil Alexov. “On the modeling of polar component of sol- vation energy using smooth Gaussian-based dielectric function”. In: Journal of The- oretical and Computational Chemistry 13.03 (2014), p. 1440002. [56] Chuan Li Lin Li and Emil Alexov. “On the modeling of polar component of solva- tion energy using smooth gaussian-based dielectric function”. In: Journal of Theo- retical and Computational Chemistry 13:10.1142/S0219633614400021 (2014). [57] Ruifeng Liu et al. “Assessing deep and shallow learning methods for quantita- tive prediction of acute chemical toxicity”. In: Toxicological Sciences 164.2 (2018), pp. 512–526. [58] Laurens van der Maaten and Geoffrey Hinton. “Visualizing Data using t-SNE”. In: Journal of Machine Learning Research 9 (2008), pp. 2579–2605. URL: http://www.jml r.org/papers/v9/vandermaaten08a.html. [59] P. Mach and P. Koehl. “Geometric measures of large biomolecules: Surface, vol- ume, and pockets”. In: J. Comp. Chem. 32 (2011), pp. 3023–3038. [60] Alireza Makhzani et al. Adversarial Autoencoders. cite arxiv:1511.05644. 2015. URL : http://arxiv.org/abs/1511.05644. [61] John Mannes. Facebook’s Artificial Intelligence Research lab releases open source fastText on GitHub. 2018. 148 [62] Christopher D. Manning and Hinrich Schütze. Foundations of Statistical Natural Language Processing. Cambridge, MA: MIT Press, 1999. ISBN: 978-0-262-13360-9. [63] Aleksandr V Marenich, Christopher J Cramer, and Donald G Truhlar. “Perfor- mance of SM6, SM8, and SMD on the SAMPL1 test set for the prediction of small- molecule solvation free energies”. In: The Journal of Physical Chemistry B 113.14 (2009), pp. 4538–4543. [64] T. M. Martin. User’s Guide for T.E.S.T. (version 4.2) (Toxicity Estimation Software Tool): A Program to Estimate Toxicity from Molecular Structure. USEPA, 2016. [65] Todd M Martin et al. “A hierarchical clustering methodology for the estimation of toxicity”. In: Toxicology Mechanisms and Methods 18.2-3 (2008), pp. 251–266. [66] Meinhard E. Mayer. “The Feynman Integral and Feynman’s Operational Calcu- lus”. In: Physics Today 54 (8) (2001). [67] W.S. McCulloch and W. Pitts. “A logical calculus of the ideas immanent in nervous activity”. In: Bulletin of Mathematical Biophysic 5 (1943), pp. 115–133. [68] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Ap- proximation and Projection for Dimension Reduction. cite arxiv:1802.03426Comment: Reference implementation available at http://github.com/lmcinnes/umap. 2018. URL : http://arxiv.org/abs/1802.03426. [69] Zhenyu Meng et al. “Weighted persistent homology for biomolecular data analy- sis”. In: Scientific reports 10.1 (2020), pp. 1–15. [70] Z. X. Cang Menglun Wang and Guo-Wei Wei. “A topology-based network tree for the prediction of protein-protein binding free energy changes following muta- tion”. In: Nature Machine Intelligence 2 (2020), pp. 116–123. [71] S. Papert Minsky M. An Introduction to Computational Geometry. 1969. ISBN: 978-0- 262-63022-1. [72] Bin Tu Minxin Chen and Benzhuo Lu. “Triangulated manifold meshing method preserving molecular surface topology”. In: J. Mole. Graph. Model. 3 (2012), pp. 411– 418. [73] D. L. et al. Mobley. 2014. [74] David L Mobley et al. “Blind prediction of solvation free energies from the SAMPL4 challenge”. In: Journal of computer-aided molecular design 28.3 (2014), pp. 135–150. [75] J. P. Mobley D. L.; Guthrie. 2014. 149 [76] A et al. Momen-Roknabadi. “"Impact of residue accessible surface area on the pre- diction of protein secondary structures"”. In: BMC Bioinformatics 9 (2008), p. 357. DOI : 10.1186/1471-2105-9-357. [77] Dmitriy Morozov. Dionysus. 2015. URL: http://www.mrzv.org/software/dionys us/. [78] Duc D Nguyen et al. “Rigidity strengthening: A mechanism for protein–ligand binding”. In: Journal of chemical information and modeling 57.7 (2017), pp. 1715–1721. [79] Duc Duy Nguyen, Zixuan Cang, and Guo-Wei Wei. “A review of mathematical representations of biomolecular data”. In: Physical Chemistry Chemical Physics 22.8 (2020), pp. 4343–4367. [80] Duc Duy Nguyen and Guo-Wei Wei. “AGL-Score: Algebraic graph learning score for protein–ligand binding scoring, ranking, docking, and screening”. In: Journal of chemical information and modeling 59.7 (2019), pp. 3291–3304. [81] Duc Duy Nguyen and Guo-Wei Wei. “DG-GL: Differential geometry-based geo- metric learning of molecular datasets”. In: International journal for numerical meth- ods in biomedical engineering 35.3 (2019), e3179. [82] Adam Paszke et al. “Pytorch: An imperative style, high-performance deep learn- ing library”. In: arXiv preprint arXiv:1912.01703 (2019). [83] Fabian Pedregosa et al. “Scikit-learn: Machine learning in Python”. In: the Journal of machine Learning research 12 (2011), pp. 2825–2830. [84] P. Perona. “Vision of a Visipedia”. In: Proceedings of the IEEE 98.8 (Aug. 2010), pp. 1526–1534. ISSN: 0018-9219. DOI: 10.1109/JPROC.2010.2049621. [85] G. Polyá and S. Szegö. “Isoperimetric inequalities in Mathematical Physics”. In: Annals of Math. Studies 27 (1951). [86] P. Perry R. Brooks and P. Yang. “Isospectral sets of conformally equivalent met- rics”. In: Duke Math J. 58 (1989), pp. 131–150. [87] Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. “On the convergence of adam and beyond”. In: arXiv preprint arXiv:1904.09237 (2019). [88] Douglas A. Reynolds. “Gaussian Mixture Models”. In: Encyclopedia of Biometrics. 2009. 150 [89] F. Rosenblatt. “The perceptron: A probabilistic model for information storage and organization in the brain.” In: Psychological Review 65.6 (1958), pp. 386–408. ISSN: 0033-295X. DOI: 10.1037/h0042519. URL: http://dx.doi.org/10.1037/h0042519. [90] Duc D Nguyen Rui Wang and Guo-Wei Wei. “Persistent spectral graph”. In: Inter- national Journal for Numerical Methods in Biomedical Engineering 36(9) (2020). [91] Kallu Samatha et al. “Chronic Kidney Disease Prediction using Machine Learning Algorithms”. En. In: International Journal of Preventive Medicine and Health (IJPMH) 1.3 (July 2021). Ed. by Dr. Shiv Kumar, pp. 1–4. ISSN: 2582-7588. DOI: 10.35940/ijp mh.C1010.071321. URL: https://www.ijpmh.latticescipub.com/wp-content/uplo ads/papers/v1i3/C1010071321.pdf. [92] Jürgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural networks 61 (2015), pp. 85–117. [93] Jeffrey Seely. Topological Data Analysis. 2016. URL: https://jsseely.github.io/notes /TDA/ (visited on 04/13/2016). [94] X. Shi and P. Koehl. “Geometry and topology for modeling biomolecular surfaces”. In: Far East J. Applied Math 50 (2011), pp. 1–34. [95] JA. Shrake A; Rupley. “"Environment and exposure to solvent of protein atoms. Lysozyme and insulin"”. In: J Mol Biol 79 (2) (1973). DOI: 10.1016/0022-2836(73)90 011-9. [96] Nicoleta Spinu et al. “Quantitative adverse outcome pathway (qAOP) models for toxicity prediction”. In: Archives of toxicology 94 (2020), pp. 1497–1510. [97] Nitish Srivastava et al. “Dropout: a simple way to prevent neural networks from overfitting”. In: The journal of machine learning research 15.1 (2014), pp. 1929–1958. [98] T. Sunada. “Riemannian coverings and isospectral manifolds”. In: Ann. Math. 121 (1985), pp. 169–186. [99] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. “Sequence to sequence learning with neural networks”. In: arXiv preprint arXiv:1409.3215 (2014). [100] Timothy Szocinski, Duc Duy Nguyen, and Guo-Wei Wei. “AweGNN: Auto-parametrized weighted element-specific graph neural networks for molecules”. In: Computers in Biology and Medicine 134 (2021), p. 104460. ISSN: 0010-4825. DOI: https://doi.org/1 0.1016/j.compbiomed.2021.104460. URL: https://www.sciencedirect.com/science /article/pii/S0010482521002547. 151 [101] The GUDHI Project. GUDHI User and Reference Manual. 3.4.1. GUDHI Editorial Board, 2021. URL: https://gudhi.inria.fr/doc/3.4.1/. [102] Nenad Tomašev et al. Assessing Game Balance with AlphaZero: Exploring Alternative Rule Sets in Chess. 2020. arXiv: 2009.04374 [cs.AI]. [103] N. Volkmann. “Methods for segmentation and interpretation of electron tomo- graphic reconstructions”. In: 483 (2010), pp. 31–46. [104] Bao Wang et al. “Breaking the polar-nonpolar division in solvation free energy prediction”. In: Journal of computational chemistry 39.4 (2018), pp. 217–233. [105] Ercheng Wang et al. “End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug design”. In: Chemical reviews 119.16 (2019), pp. 9478–9508. [106] Junmei Wang et al. “Solvation model based on weighted solvent accessible surface area”. In: The Journal of Physical Chemistry B 105.21 (2001), pp. 5055–5067. [107] Rui Wang, Duc Duy Nguyen, and Guo-Wei Wei. “Persistent spectral graph”. In: International journal for numerical methods in biomedical engineering 36.9 (2020), e3376. [108] Guo Wei Wei. “Mathematical molecular bioscience and biophysics.” In: SIAM News 49(7) (2016). [109] Still WC Weiser J Shenkin PS. “"Approximate atomic surfaces from linear combi- nations of pairwise overlaps (LCPO)"”. In: Journal of Computational Chemistry 20 (2) (1999), pp. 217–230. [110] Kedi Wu and Guo-Wei Wei. “Quantitative toxicity prediction using topology based multitask deep neural networks”. In: Journal of chemical information and modeling 58.2 (2018), pp. 520–531. [111] Zhenqin Wu et al. MoleculeNet: A Benchmark for Molecular Machine Learning. 2018. arXiv: 1703.00564 [cs.LG]. [112] Kelin Xia, Kristopher Opron, and Guo-Wei Wei. “Multiscale multiphysics and mul- tidomain models—Flexibility and rigidity”. In: The Journal of chemical physics 139.19 (2013), 11B614_1. [113] Hao Zhu et al. “Combinatorial QSAR modeling of chemical toxicants tested against Tetrahymena pyriformis”. In: Journal of chemical information and modeling 48.4 (2008), pp. 766–784. 152 [114] Hao Zhu et al. “Quantitative structure- activity relationship modeling of rat acute toxicity by oral exposure”. In: Chemical research in toxicology 22.12 (2009), pp. 1913– 1921. 153