COMPUTATIONAL METHODS FOR UNDERSTANDING ENVIRONMENTAL PROCESSES AND TOXICITY By Feng Gao A DISSERTATION Michigan State University in partial fulfillment of the requirements Submitted to for the degree of Crop and Soil Sciences - Environmental Toxicology – Doctor of Philosophy Computational Mathematics, Science and Engineering – Dual Major 2019 ABSTRACT COMPUTATIONAL METHODS FOR UNDERSTANDING ENVIRONMENTAL PROCESSES AND TOXICITY By Feng Gao Understanding environmental processes, risks and toxicities of persistent organic pollutants (POPs) are needed to protect human and ecosystem health. Usually the toxic effects of POPs on human health are assessed using a variety of time- and cost-intensive in vivo and in vitro experiments; in vivo evaluations utilizing animals and further complicated by ethical concerns. Computational models provide an alternative way to laboratory based experiments. Indeed, computational models have recently become widely used to study reaction mechanisms, make predictions of chemical toxicity, and for risk assessment. In this dissertation, I study two computational methods that could potentially be used to advance remediation of dioxin and assess chemical toxicity: 1) molecular dynamics simulation of dioxin adsorption in activated carbon pores and 2) toxicity prediction with deep learning models, with a special focus on geometric scattering methods. Polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) are ubiquitous environmental contaminants that resist chemical, biological and physical routes of dissipation. They are well known for their toxicity, including adverse effects on reproductive health, impairment of mammalian immunity, and carcinogenicity. Adding activated carbons (ACs) to soils or sediments has been suggested as a means to promote the sequestration of polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) in forms with reduced bioavailability and hence toxicity to humans and other mammals. However, the mechanisms and adsorption processes of dioxin by ACs are not well understood. Thus, molecular dynamics simulations were used to study the mechanism of dioxin adsorption in AC pores, and to evaluate the effects of pore size on dioxin adsorption. The results showed that smaller pores created a comparatively more hydrophobic sub-aqueous environment that promoted the adsorption of dioxins. Deep learning has achieved great successes in image recognition, natural language processing and many other tasks. Recently, the application of deep learning methods for toxicity predictions of organic molecules has gained increasing interest. Molecules can be treated as graphs, where atoms are nodes and bonds are edges. However regular deep learning methods cannot be directly applied to data in a non-Euclidean space such as graphs. Therefore, geometric scattering methods that generalize regular scattering transforms to non-Euclidean spaces are proposed herein. Scattering transforms was used to provide mathematical understanding of convolutional networks. The results in this dissertation showed that geometric scattering methods achieved near state-of-art results on multiple standard graph classification tasks, and can be used for various explorations of biochemical data. Finally, geometric scattering was applied for toxicity prediction with real-world toxicity datasets. The results demonstrate that it has excellent potential as an alternative approach for toxicity predictions. TABLE OF CONTENTS LIST OF TABLES . LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii CHAPTER 1 INTRODUCTION AND RESEARCH OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Dioxin in the Environment . 1.2 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Geometric Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Scattering Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 QSAR and Toxicity Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Research Objectives . 1 1 3 5 8 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 . . . . . . . . . . . . . . Introduction . . . 2.1 Abstract . . 2.2 . 2.3 Method . . . . 2.4 Computational Details . 2.5 Results and Discussions . . . . . . . . . . CHAPTER 2 MOLECULAR DYNAMICS STUDY OF THE INFLUENCE OF ACTI- 2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VATED CARBON PORE-SIZE ON DIOXIN ADSORPTION . . . . . . . . 14 . 14 . 14 . 17 . 18 . 24 2.5.1 Water in Activated Carbon Pores . . . . . . . . . . . . . . . . . . . . . . . 24 2.5.2 Dioxin Sorption in Activated Carbon Pores. . . . . . . . . . . . . . . . . . 26 PMF of Dioxin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.5.3 Sorption at Activated Carbon Edges . . . . . . . . . . . . . . . . . . . . . 32 2.5.4 . 32 CHAPTER 3 GEOMETRIC SCATTERING FOR GRAPH DATA ANALYSIS . . . . . . 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.1 Abstract 3.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Graph Random Walks and Graph Wavelets . . . . . . . . . . . . . . . . . . . . . . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4 Geometric Scattering . 3.4.1 Geometric Scattering Definitions . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.2 Stability and Capacity of Geometric Scattering . . . . . . . . . . . . . . . 44 . 45 3.4.3 Geometric Scattering Compared to Other Feed Forward Graph ConvNets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5.1 Detailed Dataset Descriptions . . . . . . . . . . . . . . . . . . . . . . . . 47 3.5.2 Graph Classification on Social Networks and Biochemistry Datasets . . . . 50 3.5.3 Classification with Low Training-data Availability . . . . . . . . . . . . . . 53 3.5.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 54 3.5.5 Data Exploration: Enzyme Class Exchange Preferences . . . . . . . . . . . 56 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.5.6 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 . Introduction . 3.6 Conclusion . 3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv . . . . . . . Introduction . CHAPTER 4 TOXICITY PREDICTION OF SMALL ORGANIC MOLECULES . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.1 Abstract 4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.3 Method Development and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.3.1 Geometric Scattering for Molecule Feature Generation . . . . . . . . . . . 64 4.3.2 Data Prepossessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 BIBLIOGRAPHY . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . v LIST OF TABLES Table 3.1: Basic statistics of the biochemical graph classification databases . . . . . . . . . 49 Table 3.2: Basic statistics of the social network graph classification databases . . . . . . . . 49 Table 3.3: Comparison of the proposed GS-SVM classifier with leading graph kernel and deep learning methods on social graph datasets. . . . . . . . . . . . . . . . . . . 51 Table 3.4: Comparison of the proposed GS-SVM classifier with leading graph kernel and deep learning methods on biochemistry graph datasets. . . . . . . . . . . . . . . 51 Table 3.5: Classification accuracy with different training/validaion/test splits over scatter- ing features (unnorm. moments) . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Table 3.6: Classification accuracy and dimensionality reduction with PCA over scattering features (unnorm. moments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Table 3.7: Dimensionality reduction with PCA over scattering features (unnorm. moments) 56 Table 3.8: EC subspace analysis in scattering feature space of ENZYMES (Borgwardt et al., 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 Table 3.9: Ablation study on five social network datasets using only one normalized moments and two normalized moments. . . . . . . . . . . . . . . . . . . . . . . 59 Table 3.10: Graph classfication with FCLs and linear SVM classifiers . . . . . . . . . . . . . 59 Table 4.1: Training and prediction results with GBRT and LR for "Very Toxic" classification 66 Table 4.2: Training and prediction results with GBRT and LR for "Nontoxic" classification . 66 Table 4.3: Sensitivity, specificity and overall prediction accuracy for "Very Toxic" classification 66 Table 4.4: Sensitivity, specificity and overall prediction accuracy for "Nontoxic" classification 67 Table 4.5: Training and prediction results for EPA hazard category classification . . . . . . 67 Table 4.6: Training and prediction results for GHS hazard category classification . . . . . . 67 vi LIST OF FIGURES Figure 1.1: Data represented as graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.2: Brief summary of GCN history . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 1.3: Signals on graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 8 Figure 1.4: Scattering transform in Euclidean space . . . . . . . . . . . . . . . . . . . . . 10 Figure 2.1: Unit cell for a periodic model representing six unique layers of graphite that form an AC slit-pore embedded in water for the study of DD desorption processes. 15 Figure 2.2: Number of water molecules per nm−3 inside activated carbon pores of three different widths; the model is divided into 100 small bins along axis (1,0,0); number of water molecules per nm−3 in bulk water is around 33/nm−3. . . . . 25 Figure 2.3: Potential of mean force for water molecules, in units of kT. . . . . . . . . . . . 26 Figure 2.4: Dioxin moving trajectories in small AC pore . . . . . . . . . . . . . . . . . . . 27 Figure 2.5: Dioxin moving trajectories in medium AC pore . . . . . . . . . . . . . . . . . . 28 Figure 2.6: Dioxin moving trajectories in large AC pore . . . . . . . . . . . . . . . . . . . 29 Figure 2.7: Side view of the moving trajectory of dioxin in large-pore AC model (water molecules removed for a clear view) . . . . . . . . . . . . . . . . . . . . . . . 30 Figure 2.8: The potential of mean force (kcal/mol) for desorption of DD from AC pores of three diameters. The AC pore edge is located at x=0; x <0 refers to the inside of AC pore and x >0 refers to the bulk water phase. . . . . . . . . . . . . . . . 31 Figure 2.9: Dioxin moving trajectories at AC edge . . . . . . . . . . . . . . . . . . . . . . 33 Figure 2.10: Radial distribution between oxygen atoms of dioxin and hydrogen atoms at activated carbon edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 Figure 3.1: Wavelets Ψ j for increasing scale 2j left to right, applied to Diracs centered at two different locations (marked by red circles) in two different graphs. Vertex colors indicate wavelet values (corresponding to colorbars for each plot), ranging from yellow/green indicating positive values to blue indicating negative values. Both graphs are freely available from PyGSP (2018). . . . . . 41 vii Figure 3.2: Illustration of (a) the proposed scattering feature extraction (see eqs. 3.4, 3.5, and 3.6), and (b) its application for graph data analysis. . . . . . . . . . . . . . 43 Figure 3.3: (a) Box plot showing the drop in SVM classification accuracy over social graph datasets when reducing training set size (horizontal axis marks portion of data used for testing); (b) Relation between explained variance, SVM classification accuracy, and PCA dimensions over scattering features in ENZYMES dataset. . 54 Figure 3.4: Comparison of EC exchange preferences in enzyme evolution: (a) ob- served in Cuesta et al. (2015), and (b) inferred from scattering features via pref(EC-i, EC-j) := wj ; wj = portion of en- zymes in EC-j that choose another EC as their nearest subspace; D(i, j) = mean dist. of enzymes in EC-i from PCA (90% exp. var.) subspace of EC-j . Our inference (b) mainly recovers (a). D(j,i) D(j, j) min . . . . . . . . . . . . . . . . . . . . . . . . . 57 · (cid:104) (cid:110) D(i, j) D(i,i) , (cid:111)(cid:105)−1 viii CHAPTER 1 INTRODUCTION AND RESEARCH OBJECTIVES 1.1 Dioxin in the Environment Dioxins and furans are a family of compounds that have different number of chlorines at various positions. Many of these compounds are considered toxic and can cause serious diseases. Among these chemicals, there have been numerous studies on polychlorinated dibenzodioxins/furans (PCDD/Fs). They are usually generated through combustion of carbon based materials or as by-products from chemical manufacturing processes. Most of these chemicals are lipophilic and hydrophobic (with high log Kow values of 5.6-8.0 for tetra- through octa-substituted chlorinated dibenzodioxins/furans). Thus once entering human or animal bodies, they will be adsorbed and stored in fat tissues (Sinkkonen and Paasivirta (2000)). In the natural environment, they usually originate from historical manufacturing releases, and are found in sediments and soils. For example, Hilscherova et al. (2003) reported that the concentration of PCDDs/Fs in sediments and flood-plain soils collected along the Tittabawassee River in Michigan ranged from 102 to 53,600 pg/g, dry wt in 2003. They can be very persistent for decades. Researchers have been studying the pathways of how dioxin affect human health for years (Diry et al. (2006)). It has been identified that dioxin is usually bound to an intracellular protein known as aryl hydrocarbon receptor (AhR). These AhRs can affect the expression and/or function of genes, and therefore further results in the dysfunction of normal cells and eventually the health of human/animals. Some of these health effects include cancer, skin problems, liver damage and reproductive issues, which are related to exposure time and amount. Nowadays people are exposed to dioxins though several ways. Consuming contaminated food (e.g. animal products) is one of the primary ways dioxin enter human body. Since they are persistent in fat tissues, they will accumulate in the food chain. It is thus very important to find a remediation method to remove the PCDDs/Fs in sediments and soils to mitigate the risk for human health and 1 ecological risk. Recently, there is a new direction for remediation of soils contaminated with hydrophobic organic contaminants using in-situ sorbent amendments with carbonaceous geosorbents (CGs) such as activated carbons (ACs) and biochars (Ghosh et al. (2011)). Most CGs are found to have high sorption capacity for hydrophobic organic contaminants because they have relatively large surface areas and their strong interaction with contaminant molecules (Ghosh et al. (2011)). The CGs added to the soils/sediments can potentially sequester contaminants released from the contaminated soils/sediments, reduce the aqueous phase concentration of the contaminants thus reducing their potential for exposure and make the contaminants less bioavailable. Therefore, it is very important to evaluate the capability of CGs for use as in-situ sorbent amendment to reduce the aqueous phase concentration and bioavailability of PCDD/Fs in the natural environment. ACs are porous carbonaceous materials that are manufactured under different conditions. They have large pore surfaces and pore volumes, and therefore are ideal adsorbents for the adsorption of contaminants. In addition to large surface area, ACs usually have many functional groups produced during the manufacturing process due to different conditions such as temperatures. Depending on the nature of the precursor and on the activation process, oxygen-containing groups can be bound to the edges of the graphitic plates or to defect sites such as vacancies. These groups are usually polar in nature and can interact with polar species, increasing the affinity of the carbon for these substances. Oxygen functional groups can be present in several forms, such as carboxyl, carbonyl, and hydroxyl, and these can combine to form more complex arrangements. These functional groups also play an important role in adsorption of contaminants. Although CGs have high sorption capacity for hydrophobic organic contaminants, the sorption process can be potentially influenced by dissolved organic matter (DOM) or other organic molecules (e.g., small organic acids) in the natural environment due to the sorption of these organic molecules by CGs. These organic molecules can act as competitive sorbates or pore blocking agent thus reducing the sorption of target hydrophobic organic contaminants on CGs (Pignatello et al. (2006)). There are numerous examples where DOM such as humic and fulvic acid showed attenuation effects 2 on sorption of organic compounds on carbonaceous materials (Cornelissen and Gustafsson (2004)). It was shown that sorption of phenanthrene to environmental black carbon (BC) can be as much as one order of magnitude lower than sorption of pure BC. This attenuation effects is due to native organic compounds and /or DOM molecules competing for or blocking CG sorption sites. 1.2 Molecular Dynamics Simulations Molecular simulations have been widely used to study environmental problems. These researches cover topics such as chemical reactions that take place in the environment (Liu et al. (2016)), tracking intermediate products during reaction (Fu et al. (2016)), and simulating complex environmental systems that are difficult to study via pure lab experiments (He et al. (2005)). These computational models aim to providing theoretical foundation of many experimental phenomenon as well as discovering that are hard or even impossible to observe from experiments. Density functional theory (DFT) and molecular dynamics (MD) simulations are two commonly used molecular simulation methods and have already been widely explored in environmental science. DFT is able to perform relatively accurate energy calculations based on quantum chemistry. However, it is limited by the enormous computational power needed, require a long time to run the model and the system size is limited. It is often too small to mimic the real environment (Cohen et al. (2008)). On the other hand, molecular dynamics simulations provide an alternative less computational intense method. It can be adapted to simulate from nano-scale systems to macro-scale systems, which makes it suitable for relatively large environmental systems and provide molecular level understanding of environmental issues. MD has been used to model complex environmental systems such as the self-assembling of nanotubes in water (Zou et al. (2006)), adsorption of contaminates on clays (Liu et al. (2015a)), C60 interaction with natural organic matters (Sun et al. (2013)) and many environmental systems. The basic idea of MD follows the classical equation of motion: a = F m F = − ∂ ∂r U 3 (1.1) (1.2) (cid:20)(cid:16) σ r vLJ(r) = 4 (cid:17)6(cid:21) (cid:17)12 −(cid:16) σ r to calculate the motion of atoms at each time step. Therefore by calculating the forces, we can obtain information such as velocities and positions of each atom at every time step and can then update the atoms’ positions and velocities accordingly. To compute the force, we usually need to obtain the potential energies U first since the force is calculated as the derivatives of the potential energies. The potentials energies U mainly include two parts: non-bonded interactions Unon−bonded and bonding potentials Uintramolecule. The non-bonded interactions Unon−bonded are the interaction between non-bonded atoms. To calculate this interaction, the Lennard-Jones potential is the most commonly used potential: (1.3) σ is the distance where the inter-particle potential is zero, r is the distance between the particles, and  is the depth of the potential measuring how strongly the two particles attract each other. If electrostatic charges are present, Coulomb potentials should also be calculated: vCoulomb(r) = Q1Q2 4π0r (1.4) Once the potential is calculated, one can easily calculate the forces according to equation (1.2). The calculation of intramolecular bonding should include bonds, bend angles and torsion angles, which usually depends on the force-field chosen. To accelerate the speed of updating energies and forces, there are many MD algorithms used. For example, the Vertlet algorithm (Verlet (1967)) is used to update the forces and position; periodic boundary conditions (PBC) are used to mimic a larger periodic systems. Despite the wide use of MD simulations, there are still limits on the typical time scales and length scales that can achieve. Simulation runs are typically short: typically t∼ 103 − 106 MD steps. Also the calculations of energies depend on the force fields chosen and can vary a lot among different force fields. 4 1.3 Geometric Deep Learning Deep learning utilizes multi-layer neural network models to learn patterns from data. It has gained a great attention in recent years and various deep learning algorithms were proposed. Deep neural networks have been proved to be powerful models for many machine learning tasks, e.g., natural language processing (Young et al. (2018)), object detection (Naik and Gandhi (2018)) and image classification (Rawat and Wang (2017)). Deep learning has also achieved great successes when applied to many other domains. For example, in chemistry, it has been used to predict chemical reactions (Kayala and Baldi (2011)); in physics, it has been used to predict material energies; in biochemistry, it has been used to find drug candidates (Chen et al. (2018)). The great success of deep learning is partially due to the rapid increasing of computational powers and much cheaper storage spaces. Especially, the wide application of GPU computing and the availability of more and more public datasets such as LiDAR (Light Detection and Ranging) and MNIST (Modified National Institute of Standards and Technology) allows researchers to explore more powerful neural network architectures with large scale of data. Among many deep learning models, Convolutional Neural Network (CNN) may be one of the most successful ones. CNN has been widely used in image analysis, signal processing and many other tasks where data that can be treated as functions in the Euclidean space. CNNs can extract local features with learned filters to generate expressive representation of data. In addition to data in the Euclidean space, there are also many other data that are not in the Euclidean space and are better modeled by graphs. Graphs are data structures used to model pairwise relations between objects. A graph consists of vertices/nodes and are connected by edges. Indeed graph structured data are ubiquitous. For example, as shown in Figure 1.1, in chemistry, molecules can be represented as graphs, where atoms in the molecule are nodes and bonds are treated as edges. In social science, the relationship among different individuals can be represented as a social network, where people are nodes in the network and are connected according to different relationships. Another example in biology is the protein-protein interactions which can also be modelled as graphs. However graph-structured data don’t have well known Euclidean representations 5 and thus the regular deep neural network models cannot be applied directly. Some of the challenges faced by these regular models to deal with graphs are that there is no global parameterization or common system of coordinates; the vector space structure is not known and properties such as shift-invariance don’t exist (Bronstein et al. (2017)). Therefore, these deep learning models have to be adapted to efficiently deal with non-Euclidean data. (a) Social network (b) Molecular graph (c) Protein-protein interaction network Figure 1.1: Data represented as graphs Recently there has been a growing interest in trying to generalize regular deep learning methods to non-Euclidean space, which is called geometric deep learning. Graph Neural Network (GNN) is then proposed to generate embedding of node features and graph structures. GNNs can be used for tasks such as graph classification, node classificaion and link prediction. There are many successful GNNs inspired by the architecture of popular deep learning models such as CNN, recurrent neural network (RNN) and attention neural network. Especially, the networks that generalizes CNN to graphs are called Graph Convolution Networks (GCNs). Graph Convolutional Network (GCN) proposed by Kipf and Welling (2017) is one of the most popular GCNs. GCN generates latent representation of each node by aggregating information from the node itself and its neighboring nodes. Many other well known graph neural networks include GraphSage (Hamilton et al. (2017)), Graph Attention Network (GAT) (Veličković et al. (2018)), Graph Isomorphim Network(GIN) (Xu et al. (2019)) and so on. There are roughly two ways to construct CNN on graphs: the spatial methods and the spectral methods. A brief summary of graph convolutional network in recent years is demonstrated in Figure 1.2. Though this is not a comprehensive survey of all the GCNs, it 6 Figure 1.2: Brief summary of GCN history shows the increasing interest in studying GCNs. The architecture of spatial GCNs usually include the aggregation of node neighbors and pooling over the whole graph. On the other hand, spectral approaches work with a spectral representation of the graphs. The convolution operation is defined in the Fourier domain by computing the eigendecomposition of the graph Laplacian. The operation can then be defined as the multiplication of a signal with a filter. The spectral methods of GCNs are closely related to graph signal processing. Graph signals are signals that defined on graphs as shown in Figure 1.3. The signals can be the node features, edge features or the structure of the whole graph. By constructing a filter directly on the eigenvalues of the graph Laplacian, we can directly work on the graph signals. Many GCNs are closely related to Weisfeiler-Lehman (WL) test. WL test is originally used to deal with graph isomorphsim (GI) problem. The GI problem solves this question: given two graphs, tell whether they are topological identical. The WL test iteratively aggregates the labels of nodes and their neighborhoods into unique new labels. The two graphs are considered non-isomorphic if at some iteration the labels of the nodes between the two graphs are different. WL test is an effective and computationally efficient method that distinguishes a broad class of graphs, though it fails in some specific cases like regular graphs. 7 Figure 1.3: Signals on graph 1.4 Scattering Transform CNN consists of a series of convolution, non-linearilty and pooling operations and followed by several fully connected layers, which makes it highly non-linear and difficult to understand how the coefficient values are relate to the signal properties. Therefore in addition to CNN, scattering wavelet transform was proposed by Mallat to provide mathematical understanding of convolutional networks (Mallat (2012)). However, filters/wavelets in scattering transform are designed rather than learnt. Scattering transform follows the general architecture of convolution neural networks introduced by LeCun, which includes a cascade of convolutions and a “pooling” nonlinearity. In the scattering transform, the nonlinearity is the modulus of a complex number(Bruna and Mallat (2012)). Wavelets are oscillating waveforms that provide both time and frequency localization of a signal. A mother wavelet dilated by the scale 2j has a form of ψj(t) = 2−j ψ(2−jt). A wavelet transform WJ f (cid:66) {AJ f , Ψj f : j ≤ J} = { f ∗ φJ , f ∗ ψj : j ≤ J} computes signal convolutions with a filter bank of dilated and rotated wavelets. AJ is a low pass filter and AJ f extracts low frequency features from the signal while {Ψj f : j ≤ J} covers high frequencies up to scale J. However the main difficulty here is to compute translation invariant scattering coefficients which remain stable under the action of diffeomorphisms and retain high-frequency information provided by wavelets 8 as wavelet transform is not translation invariant. To achieve the translation invariant property, the complex modulus nonlinearity and a low pass filter is used. With properly chosen wavelet filters, the scattering transform is provably invariant to the actions of certain Lie groups, such as the translation group, and is also provably Lipschitz stable to small diffeomorphisms, where the size of a diffeomorphism is quantified by its deviation from a translation. The major differences between scattering transform and CNN are that in CNN, filters are learned with back-propagation algorithms while the filters are designed in scattering transform; wavelets can cover multi-scale in one layer with multiple dilated wavelets while CNN can only use fixed size filters. They also share some similarities: both have a cascade of convolutions and non-linearity, which is pooling and ReLU activation in CNN and modulus in scattering transform. Figure 1.4 shows the architecture of scattering transform, where the blue texts represent the scattering coefficients generated at each layer and the black texts represent the intermediate process at each layer where we add new set of wavelets Recently there has been an increasing interest in studying the generalization of scattering transform to graphs and manifolds. As the regular scattering transform, the scattering transform on graphs are also proved to have properties such as invarient to translation and stable to small differmophism (Gama et al. (2019)). 1.5 QSAR and Toxicity Prediction The origin of QSAR can be traced back to the 1960s, when Hansch and his collaborators tried to understand the basis of the structure-activity relationships of plant growth regulators (HANSCH et al. (1962)). Basically they tried to correlate the biological activities with certain structural features of compounds. Nowadays QSAR has been widely used in predicting properties of molecules with the belief that similar structures have similar properties. These properties include toxicity, energy, solubility and many other physical and chemical properties. Indeed QSAR has been successfully used in predicting molecular properties, however many QSAR models are not accurate enough to make reliable predictions and have many limitations. So experimental tests cannot be replaced in the 9 Figure 1.4: Scattering transform in Euclidean space current situation. However, QSAR is a powerful tool to be a supplement method to lab experiments. Many QSAR models heavily depend on the choice of molecular representations. However, the representation of molecule structures in latent space is a challenging task: many chemical descriptors were proposed to represent molecular structures at different levels. For example, the representations could be the count of certain structures in the molecule, descriptors of molecular physical and chemical properties, and encoding of 2D and 3D structures. The molecular representations then can be further used for various tasks such as similarity searching (Klopmand (1992)), clustering (McGregor and Pallai (1997)) and classification (Wu et al. (2018)). The structure of a molecule can be represented by the connectivity of atoms according to chemical bonds. This kind of representation is usually considered as the molecular 2D-descriptor. The main advantage of this representation is that it contains straightforward information about molecular structure (connectivity), invariant to molecule rotation and translation, and finally doesn’t require structure optimization (bond information such as bond length could also be encoded as weights). In addition, more atom information such as atom types can also be encoded. However, 2D-descriptors do not contain 3D structural information and thus may not be able to reflect properties that rely on 3D structure. 10 One natural way to represent the connectivity of a molecule is to use a graph. Graph is used to model relations among multiple objects. A graph is usually denoted as G = (V, E), where V is a set of nodes and E is a set of edges. Nodes are the interested objects and they are connected be edges according to different relations. In addition to the graph structure, properties of the nodes can also be taken into consideration as node features. For instance, a molecule can be represented as a graph, where atoms are nodes and bonds are edges (could be weighted edges if bond length is also considered); a protein can be considered as a graph, where nodes are amino acids and they are connected by edges which reflect the structure of proteins; another example is that one person’s social network can also be constructed as a graph. Therefore, a molecular graph is a topological representation of a molecular. By transforming molecular structures to graphs, we can then use many methods based on graph theory to study these graph-structured data and extract graph features, which are unique to each molecular graph. The connectivity information of a graph can be encoded by a square matrix called adjacency matrix denoted as A. The off-diagonal entries of an adjacency matrix encode the connectivity of the graph while the diagonal entries are all 0s. For unweighted adjacency matrix, the off-diagonal entries could be only 0 or 1. For example, if node i and node j are connected, then the entry Ai j will be 1. Otherwise it’ll be 0. However, weighted adjacency matrix could also be used to encode bond information by replacing 1 with calculated bond distances. Another commonly used matrix is a diagonal matrix called degree matrix, which contains the degree information of each node in the graph. While 2D-descriptors are usually good enough to represent the molecular graph structures and atom features, they may not be sufficient enough for tasks specifically related to 3D structures. Therefore 3D-descriptors are also generated to reflect the interactions related to 3D structures, such as the protein-ligand interactions are related to the orientation of the involved molecules. This is why 3D descriptors such as CoMFA (Cramer et al. (1988)) was created at the very beginning and its variants are still very popular today. In general, the representation of molecule structures should be invariant to rotation and translation. 11 It means that a rotated molecule should have the same representation as the origin one as the structure of molecules don’t change. An alternative way to getting rotation and translation invariant properties is to use data augmentation to include different structures and label it correctly. However, it may not be very efficient in certain situations. Another way is by following a certain rule when generating the representations, which may not be very flexible. Another property of a good representation of molecules is that ideally a small change in the molecular structure should also only cause a small change in the representations. In other words, similar molecular structures should have identical representations. One application of QSAR is to predict toxicity of organic compounds. In fact, computational models are considered alternatives to toxicity assessment without using expensive and time consuming in-vivo and in-vitro lab experiments. Lab experiments often include animal trials and raise ethical concerns that cannot be neglected. Another disadvantage of using animal tests is that they may not provide enough guidance to human toxicity reactions due to inter-species differences and different disease models. Using computational models to predict toxicity of organic compounds is also of great significance in drug development. During drug development processes, drugs must undergo clinical trials to be legally used in the market. Specifically, drugs must prove to be nontoxic or low toxic according to certain regulations before they can be formally approved by FDA. Therefore, it’s of great importance to prevent toxic drug candidates from entering into clinical trials. NIH and EPA have been encouraging and supporting researches in developing computational models for toxicity prediction through various projects such as ToxCast (Dix et al. (2006)) and Tox21 (Tice et al. (2013)) projects. Large amounts of data have been collected in the past years and some mainstream toxicity databases were created. Toxicology data network (TOXNET) created in 1985 is among the world’s largest collection of toxicology databases. Toxicity ForeCaster (ToxCast) is also a widely used high-throughput toxicity database. It is a part of the Toxicology in the 21st Century (Tox21). Tox21 contains both acute and chronic toxicity information. There are also some other well known databases such as PubChem, Drugbank, EcoTox. In addition to the extensive collection of toxicity data, many computational models have been proposed including statistical 12 approaches such as linear regression analysis, multivariate analysis, early shallow neural network models and newly developed deep learning methods. 1.6 Research Objectives Understanding the environmental processes and toxicity are essential to protect human from toxic chemicals. Computational models are less expensive, much faster and more convenient. They are one of the future directions of risk assessment. Molecular simulation have been used to study the fundamental mechanisms of many environmental issues. On the other hand, with more and more experimental data collected and summarized into databases, researchers are aiming to find patterns from big data. There has been a new trend in applying machine learning/deep learning for risk assessment, and one common practice is for toxicity prediction. Therefore, the research objectives are three folds: 1) to study the application of MD simulation in the adsorption of dioxin with activated carbons, demonstrating the mechanism of dioxin adsorption on activated carbons, which can help us further understand the bioavailability of activated carbon adsorbed dioxin and provide important information for risk assessments. 2) following the recent trend in using deep learning/machine learning methods for toxicity prediction, to develop state of art geometric scattering model based on graph wavelet for graph learning 3) to apply machine learning methods including the proposed geometric scattering method for toxicity prediction on publicly available dataset. 13 CHAPTER 2 MOLECULAR DYNAMICS STUDY OF THE INFLUENCE OF ACTIVATED CARBON PORE-SIZE ON DIOXIN ADSORPTION 2.1 Abstract Adding activated carbons (ACs) to soils or sediments is rapidly gaining acceptance as a technology that promotes the sequestration of polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) and reduces their bioavailability to humans and other animals. Molecular dynamics simulations were performed to study structural dynamics and free energies of dibenzo-p-dioxin desorption from hydrated AC pores to aqueous solution. An AC model with slit-pores of three different sizes, 0.4, 0.9, and 1.6 nm was built to mimic AC pores ranging across the micropore range (< 2 nm). By studying the adsorption of dioxin into these three different sizes of pores, our results revealed the geometry of dioxin adsorbed inside activated carbon pores and showed that the smallest pore had the strongest ability to adsorb dioxins. We proposed that inside small pores, dioxins are restricted within a local area and could develop the strong interaction with both carbon sheets. As the pore size increases, the interaction with both AC sheets is impaired by intercalated water molecules. In addition, small pores create more hydrophobic sub-environment which favors the accommodation of highly hydrophobic compounds such as dioxins. 2.2 Introduction Hydrophobic organic chemicals in the environment sorb to particle surfaces, can accumulate in soils and sediments, and raise potential issues for human health. In particular, polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) are widespread contaminants well known for their toxicity, impairment of mammalian immunity, and persistence in the environment (Travis et al. (1989); Boyd et al. (2011); Kaplan et al. (2011)). Traditional remediation methods like dredging and disposal are expensive, destructive of 14 Figure 2.1: Unit cell for a periodic model representing six unique layers of graphite that form an AC slit-pore embedded in water for the study of DD desorption processes. ecosystems, and politically charged, so in-situ remediation methods are developing and show potential for immobilizing contaminants through adsorption on soil/sediment amendments (Zhan et al. (2009, 2011); ATHMER (2004)). High surface-area carbonaceous geosorbents such as activated carbons (ACs) and biochars seem especially suitable for remediation of PCDD/Fs, which are both hydrophobic and highly resistant to microbial degradation (Cornelissen et al. (2006)). In fact, ACs seem capable of reducing (Kupryianchyk et al. (2013); Lohmann (2005); Cornelissen et al. (2005); Allen-King et al. (2002)) or even eliminating (Boyd et al. (2017)) bioavailability of PCDD/Fs, and ACs have already been used in several pilot-scale remediation projects (Patmont et al. (2015)). Pyrolysis of organic biomass in low-oxygen environments results in the synthesis of a continuum of biochar materials depending on synthesis conditions, and high-temperature biochar is the basis for AC. With increasing pyrolysis temperature, the proportion of polar functional groups and the O/C ratio decrease, with a concomitant increase in aromaticity and sp2 carbon relative to sp3, with an increased number and domain size of graphitic layers (Nguyen et al. (2010)). Thus, an idealized model for the AC matrix is graphite. However, AC is such an interesting and useful material because this graphitic matrix is highly porous. 15 The porosity of ACs is typically characterized by adsorbing relatively inert gases at constant temperature across a wide range of partial pressures; the resulting adsorption isotherms are then used to fit models that yield estimates for total porosity and pore-size distributions (Dombrowski et al. (2000)). Various representations of AC pores have been used in such models, and it is reassuring that both cylindrical-pore models and slit-pore models yield similar pore-size distributions (Nguyen et al. (2007)). Further, similar results are gained by using the same slit-pore model but different probe gases (Dombrowski et al. (2000)). The results reveal that many ACs possess the porosity of 200-700 mL/kg, with 40-80% of these pores in the microporous size range (pore diameter < 2nm) and mesoporous size range (2 nm < pore diameter < 50 nm ) (Dombrowski et al. (2000); Nguyen et al. (2007); Haro et al. (2011)). The aqueous concentration of dissolved PCDD/Fs in soils and sediments is exceptionally low with typical values of 0.1 to 0.5 pg/L (Arp et al. (2009); Cornelissen et al. (2008b)). Thus, PCDD/Fs in soils or sediments are predominantly associated with solid surfaces. While AC-like phases are not dominant in soils on a mass basis, they are hypothesized to be dominant sorbents for PCDD/Fs (Cornelissen et al. (2008b); Chai et al. (2011); Cornelissen et al. (2008a); Fagervold et al. (2010); Persson et al. (2002)). Though data are sparse, it appears that PCDD/Fs are concentrated in AC-like particles (Chai et al. (2011); Ghosh et al. (2000)). The imputed hydrophobicity of the graphitic matrix, combined with the prevalence of micropores, imply a strong partitioning component to PCDD/F sorption to ACs (Ghosh et al. (2000); Dombrowski et al. (2000); Nguyen et al. (2007)). However, much of the total sorption must be adsorption to pore surfaces, with an unknown dependence on the pore-size distribution of a given AC. Generally, the stronger the adsorption is, the less likely PCDD/Fs would be expected to be bioavailable, so understanding the fundamental mechanisms and energetics of adsorption/desorption is essential. The present study was undertaken to estimate adsorption energies and PCDD/F structural dynamics as a function of pore size. Molecular simulations have drawn increasing attention as a way of gaining environmental insight by revealing new mechanistic hypotheses at the molecular level. Both density functional 16 theories (DFT) and molecular dynamics (MD) simulations are powerful tools to solve complicated environmental issues with the help of increasing computational power and new models (Sedlak (2015)). Our experimental results show that ACs with higher micropore distributions at the similar surface areas demonstrate greater adsorption for dibenzo-p-dioxin (DD). In this study, we examine the mechanism of DD adsorption into AC micropores. The pore-size effects on DD adsorption are quantified by comparing adsorption free energy, which are estimated by potential of mean force (PMF). Adaptive biased force (ABF) method is used to measure the PMF (Liu et al. (2015b)), and the MD trajectory of dioxin movement is analyzed to demonstrate the behavior of dioxin inside AC pores. Thus, the simulations provide both structural and thermodynamic information for the complex processes. Our results have shown that small pores can create a more hydrophobic environment to assist the adsorption of dioxin. The smallest pore has a stronger estimated adsorption energy compared to that of medium and large pores. Our results suggest the hypothesis that pore-size effects of ACs should be taken into consideration in mechanistic descriptions of hydrophobic solute sorption by AC. For example, when applying AC to soils for in-situ dioxin remediation, it is plausible that a more microporous activated carbon may bind the dioxins more strongly and minimize the bioavailability of dioxins to organisms that may ingest the soil. 2.3 Method Commonly, gas-phase adsorption isotherm on AC is interpreted using a slit-pore model to represent the AC porosity (Dombrowski et al. (2000)). In accord with this convention, we created graphitic slit-pore models. These are idealized models for AC porosity in which all carbons are aromatic and there are no oxygenated functional groups present on the graphitic surfaces; this simplifies the adsorption process as much as possible so that we can study the pore-size effects in isolation. Briefly, our slit-pore model was constructed using a packet of six graphene layers to represent the AC matrix. The spacing between adjacent carbon packets was initially set at 3.35 Å, which is the layer-spacing for graphite. The space of a six-layer packet separated from its neighboring image packet was considered as a slit-pore (Figure 2.1). We set up 9.0, 12.7 and 19.7 Å 17 as initial slit pore sizes (interlayer distance between two neighbored carbon planes) and referred as small-pore, medium-pore and large-pore ACs. For a molecule like dibenzo-p-dioxin (DD) in a pore within AC, the process of desorption refers to the transfer of DD out of the AC pore into another bulk medium, typically water. In order to study DD adsorption/desorption processes involving in AC pores, we constructed two-phase systems to represent AC slit-pore interacting with bulk water. To do so, some edges of activated carbon must terminate; in nature, these edges may contain different functional groups such as carboxylate or hydroxyl groups, or even dangling bonds. All these groups may have effects on the adsorption. The edges of AC were saturated with hydrogen atoms (Figure 2.1) to minimize the effects of functional groups at those edges and focus on the adsorption inside pores. Each AC slit pore was put in a periodical orthorhombic box with rough dimensions 80 x 30 x h Å (where h is the height of the box, and dependent of slit-pore thicknesses). A single DD molecule was added to each box, corresponding to a large but realistic sorbed concentration of 4600 mg DD/kg AC; we have measured sorption of DD by ACs, and this magnitude is in equilibrium with aqueous solution of roughly 10 µ g DD/L. The DD molecule was placed at the initial position of the middle of the AC slit-pore (Figure 2.1). The length of activated carbon and its slit-pore was 50 nm, leaving a box of 30x30xh Å to hold and represent the “bulk” water phase of each system. Each entire system was then hydrated by overlaying a large box of equilibrated water molecules and adding water molecules to all positions within the unit cell that were not already occupied. After this process, there were 967 water molecules in our model with the smallest pore, 1396 in our medium-pore model, and 2011 in our largest-pore model. These small-, medium-, and large-pore models are labeled ACS, ACM, and ACL, respectively. 2.4 Computational Details Molecular dynamics simulations were performed using the LAMMPS simulation package (Plimpton (1995)). The time step was set as 1 fs. To estimate the adsorption energy of DD in AC 18 slit-pores of different thicknesses, the Potential of Mean Force (PMF) (Comer et al. (2015); Darve et al. (2002); Darve and Pohorille (2001)) was calculated using the Adaptive Biasing Force (ABF) method implemented in the COLVARS package of LAMMPS. Our methods closely followed those of Liu et al. (2015a) in their study of PCB desorption from swelling clay. The SPC model for water was used along with the SHAKE algorithm, which keeps the angle and bond length fixed. The Lennard-Jones interactions were truncated at 10 Å with an analytical tail correction, and a PPPM (particle-particle particle mesh) method with a cutoff of 10 Å and accuracy of 0.0001 was used to compute long range Coulombic interactions. To pre-equilibrate the system, simulations were first run in the NVE ensemble for 1 ns with all AC layers fixed, then 10 ns in the NPT ensemble in which no atoms were fixed. These longer runs to equilibrate the micropore-bulk water two-phase system were performed at 298 K, pressure of 0.1 MPa, and step size of 1 fs. In this study, we applied the pcff forcefield to the AC system and to the DD solute. Detailed PcFF forcefield parameters are included below: Masses 12.011150 1.007970 15.999400 15.999400 1.007970 Pair Coeffs 0.0640000000 0.0200000000 0.2400000000 0.2740000000 0.0130000000 4.0100000000 2.9950000000 3.5350000000 3.6080000000 1.0980000000 1 2 3 4 5 1 2 3 4 5 19 Bond Coeffs 1.4170 1.0982 1.3768 0.9700 470.8361 − 627.6179 372.8251 − 803.4526 428.8798 − 738.2351 563.2800 − 1428.2200 1327.6345 894.3173 1114.9655 1902.1200 Angle Coeffs 118.9000 117.9400 123.4200 115.0700 103.7000 61.0226 − 34.9931 0.0000 35.1558 − 12.4682 0.0000 73.6781 − 21.6787 0.0000 47.1131 − 32.5592 13.1257 49.8400 − 11.6000 − 8.0000 1 2 3 4 1 2 3 4 5 Dihedral Coeffs 8.3667 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1 2 3 4 5 − 0.1820 6 7 0.0000 0.0000 0.0000 0.0000 1.1932 3.9661 1.8769 4.8498 0.0000 0.0000 0.0000 0.0000 1.7234 27.5400 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 − 0.7047 0.0000 0.0000 0.0000 0.0000 0.0000 − 0.1840 0.0000 Improper Coeffs 4.8912 0.0000 7.1794 0.0000 13.0421 0.0000 1 2 3 20 BondBond Coeffs 68.2856 1.4170 1.0795 1.4170 48.4754 1.4170 0.0000 1.3768 1 2 3 4 5 − 9.5000 1.4170 1.0982 1.3768 1.3768 0.9700 0.9700 BondAngle Coeffs 28.8708 28.8708 20.0033 24.2183 58.4790 107.6806 0.0000 22.3500 22.3500 0.0000 1.41701.4170 1.4170 1.4170 1.0982 1.3768 1.3768 1.3768 0.9700 0.9700 AngleAngle Coeffs 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 118.9000 118.9000 118.9000 117.9400 118.9000 123.4200 117.9400 118.9000 123.4200 1 2 3 4 5 1 2 3 AngleAngleTorsion Coeffs 0.0000 118.9000 118.9000 118.9000 117.9400 117.9400 117.9400 118.9000 123.4200 0.3598 1 2 − 4.8141 3 4 − 21.0247 5 6 7 0.0000 4.2296 0.0000 123.4200 117.9400 123.4200 123.4200 123.4200 0.0000 21 EndBondTorsion Coeffs 1 − 0.1185 6.3204 0.0000 − 6.8958 2 0.0000 − 0.6890 3 4 0.0000 0.0000 5 0.0000 − 1.5867 6 0.0000 7 0.2655 0.0000 0.0000 − 0.1185 0.0000 0.0000 6.3204 0.0000 − 0.4669 0.0000 − 0.6890 0.0000 0.0000 0.0000 1.4170 1.4170 1.0982 1.4170 1.0982 1.0982 0.0000 0.0000 0.0000 0.0000 4.8905 0.0000 0.0000 0.0000 1.4170 1.3768 1.3768 1.3768 0.0000 0.0000 4.2641 0.0000 1.0982 1.3768 0.0000 0.0000 0.0000 0.0000 0.0000 1.4170 1.3768 0.0000 0.0000 1.4170 1.4170 MiddleBondTorsion Coeffs 27.5989 − 2.3120 0.0000 − 1.1521 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 4.8228 4.8255 0.0000 5.5432 0.0000 1 2 3 4 5 6 7 1.4170 1.4170 1.4170 1.4170 1.3768 1.4170 1.0982 1.0982 1.3768 BondBond13 Coeffs 53.0000 1.4170 1.4170 1.0982 1.4170 1 2 − 6.2741 3 − 1.7077 4 − 2.2436 5 1.3768 6 1.0982 7 1.4170 AngleTorsion Coeffs 0.0000 2.0517 0.0000 1.3768 1.3768 1.3768 22 1 2 3 4 5 6 7 1.9767 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0239 2.5014 2.4501 10.0155 0.0000 1.8729 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.9767 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0239 2.7147 2.4501 1.7404 0.0000 2.5706 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 118.9000 118.9000 117.9400 118.9000 123.4200 117.9400 123.4200 118.9000 117.9400 117.9400 123.4200 123.4200 123.4200 0.0000 During these NPT simulations, the unit-cell dimensions relaxed somewhat to achieve their equilibrium values. The lengths (79.2 Å) and depths (29.2 Å) of each system were the same, while the equilibrated heights were 23.2 Å for the ACS system, 28.2 Å for ACM, and 35.2 Å for the ACL system. The center-to-center distances between nearest-neighbor “inner” graphene sheets across the slit were about 7.7, 12.4, and 19.7 Å for the ACS, ACM, and ACL systems, respectively. The pore sizes are still smaller than these values by about 3.3 Å, because a pore-size probe would not measure down to the C-atom-plane, it would only measure down to the vdW radii of the C atoms in graphene. Thus, the final pore sizes (as would be measured experimentally) were about 4.4, 9.1, and 16.4 Å for the ACS, ACM, and ACL systems, respectively. These three effective slit-pore thicknesses each correspond to an AC model containing 3300 C atoms, and thus represent ACs with micropore-sized porosities of 98, 202, and 364 mL/kg AC, respectively. Despite the simplicity of our slit-pore model, this study thus covered realistic porosities in the micropore size range that dominates the porosity of most ACs. For example, the experimental pore-size distribution for a well-studied AC was found to have a maximum at a slit-pore width of 12 Å(Dombrowski et al. (2000)), 16 Åand so the pores of this modeling study bracket that value nicely. Calculation of the potential of mean force (PMF) was performed in the NVT ensemble using two steps: First, an NVT steered molecular dynamics simulation (298 K and time-step of 1 fs) was run with a reaction coordinate along the (1,0,0) direction. That is, the pre-equilibrated DD molecule was slowly dragged along that direction at a rate of 1 Å/ns, spending the first 10 ns inside the pore 23 and then passing into the bulk water for another 10 ns outside the pore to generate the original 20-ns trajectory along the reaction coordinate. Secondly, that trajectory was divided into 20 1-ns windows and in each window, the adaptive biasing force was applied to calculate the change of energy in every small bin of width 0.05 Å. Each time window was sampled more than 1,000,000 times. The overall free energy profile for the desorption processes was constructed by linking the potential mean force (PMF) profiles of all adjacent windows. The moving trajectory and radial distribution functions between different groups are obtained by analyzing the movement of dioxin in the molecular dynamics process and after the adaptive biased force is applied. We have used the same methods to study the adsorption energies of PCBs interacting with clay minerals (Liu et al. (2015a)). Thermodynamically, this modeled desorption process should be reversible in that the PMF energy should be equal in magnitude but opposite in sign to an adsorption process, in which DD begins in the bulk water phase and is then dragged into the AC pore. We tested this using anthracene (same size as DD) and a cylindrical pore of diameter 12 Å; the desorption energy was +45 kJ/mol, while the adsorption energy was -40 kJ/mol. The latter was probably underestimated, since the anthracene molecule had drifted about 5 Å too close to the pore entrance during the MD equilibration before the ABF simulation began, thus pulling its initial potential energy down somewhat. Several other attempts at ABF simulations of adsorption were unsuccessful, because the anthracene or dioxin molecules got stuck at the entrances to cylindrical or slit pores and could not be pulled in. Thus we focus on studying the process of dioxin desorption from AC slit-pores. 2.5 Results and Discussions 2.5.1 Water in Activated Carbon Pores Adsorption of DD from water to AC pores can be considered as exchange between water and DD for adsorptive sites. When DD moves from inside of AC pores to outside bulk water, a certain volume of water molecules are expected to enter the pores. In this section, we first studied the relative water density inside different sizes of pores. Figure 2.2 shows the number of water molecules 24 Figure 2.2: Number of water molecules per nm−3 inside activated carbon pores of three different widths; the model is divided into 100 small bins along axis (1,0,0); number of water molecules per nm−3 in bulk water is around 33/nm−3. per nm−3 in AC pores along (1,0,0) direction, which is the direction of dioxin moving out of pore. The result reveals that water density in AC pores increased with increasing pore size. This suggests that smaller-sized AC pores hold less amount of water per unit of volume, and create less hydrophilic environments. This sub-aqueous environment favors the adsorption of hydrophobic organic compounds such as DD in the smaller-sized AC pores. Instead of performing the same PMF calculation method described in the computational details for dioxin, we have a simpler way to obtain the PMF by analyzing the distribution of water before and after dioxin is adsorbed. However, if we just look at the radial distribution function, it won’t give us the distribution along a certain dimension. The most intuitive way is to study the water density along one direction. In Figure 2.2 we analyze the relative water density along (1,0,0), which is the direction of dioxin moving out of pore and thus the direction of water moving from bulk to the pore. Relative water density is defined as the bin concentration divided by the concentration across the entire cell, which is dimensionless. Figure 2.2 shows the water density in our three activated carbon models along the x axis. Pore water density increases with pore size. This suggests that small pores can create more hydrophobic environments, which in turn may imply that hydrophobic organic compounds such as dioxin adsorb 25 Figure 2.3: Potential of mean force for water molecules, in units of kT. more strongly from water to small AC pores. The potential of mean force of water molecule can be calculated by simply using the water density W(z) = −kbTln(z) + const, since it represents the probability of finding a water molecule at a certain distance inside the pore. z is the water density inside the pores Wang et al. (2015). Here, the constant decides the zero level. Figure 2.3 shows the PMFs calculated in this manner. Larger pores show the lowest (most favorable) energy, while smallest pore shows the highest energy, suggesting water molecules sorb most strongly to the largest pore, again supporting the hypothesis that small pores are more hydrophobic than larger pores. In fact, when running simulations with the smallest activated carbon pores, the number of water molecule inside the pore is quite limited. These smallest pores can only hold up to one layer of water molecules. This may be due not only to the limited pore size, but also because of the hydrophobicity of the graphene surfaces. 2.5.2 Dioxin Sorption in Activated Carbon Pores. When DD molecules are adsorbed inside pores, they tend to take positions parallel to the AC surfaces and directly contact those surfaces. In other words, DD partially dehydrates. The distance between the DD and AC surface atomic planes is 3.3 Å. In Figure 2.4, 2.5 and 2.6, the moving trajectories of 26 Figure 2.4: Dioxin moving trajectories in small AC pore dioxin during simulation in small, medium and large pores are marked in green lines. Dark red represents the initial position of dioxin and lighter colors denote the final positions. Even when DD started in the middle of an AC pore, DD quickly found its way to an AC surface in all cases, as shown in the illustration Figure 2.7 as an example by removing all the water molecules for a clearer view. 2.5.3 PMF of Dioxin PMF of dioxin was calculated to estimate the adsorption/desorption energy for DD interaction with ACs of different pore sizes. The PMF is the energy for pulling dioxin outside of the AC pore into the bulk water phase. As shown in Figure 2.8, the energy of DD moving inside of AC pores 27 Figure 2.5: Dioxin moving trajectories in medium AC pore remains almost unchanged until the distance of 5 Å away from the AC edge, and the energy shows no apparent difference among the ACs with three different pore sizes. When DD moved inside of pores to 5 Å away from AC edge, it interacted with bulk water phase resulting in the sharp increase of PMF. There is a transition located at 3 Å from AC edge in bulk water at which the energy fluctuated because the DD molecule was moving to AC edges before finally exiting from the pore. When DD entered water phase and reached 8 Å from AC edge, the PMF approached plateaus due DDs distance away from ACs. When the pore size is smallest, 4.4 Å which is just larger than the thickness of DD, then DD can basically only move around in two dimensions (Figure 2.4) and it has the highest PMF at 15.8 28 Figure 2.6: Dioxin moving trajectories in large AC pore kcal/mol. In this situation, dioxin has the maximum contact with AC surfaces (two) and minimal contact with water. In contrast, when dioxin is inside the largest micropore (16.4 Å), dioxin is separated by water molecules from at least one AC surface, and the desorption energy is only 9.2 kcal/mol. The PMF of dioxin in the medium pore is in between at 10.8 kcal/mol. All of these desorption energies are comparable to the experimental free energy change upon dissolving DD crystals into water, which is about +9 kcal/mol based upon converting (Darve et al. (2002)) a solubility of 2.74e-5 mol/L (Darve and Pohorille (2001)). These large, unfavorable desorption energies are consistent with the simulated dynamics which is shown in previous section. Desorption of DD from AC increased the system free energy. This energy increase with 29 Figure 2.7: Side view of the moving trajectory of dioxin in large-pore AC model (water molecules removed for a clear view) 30 Figure 2.8: The potential of mean force (kcal/mol) for desorption of DD from AC pores of three diameters. The AC pore edge is located at x=0; x <0 refers to the inside of AC pore and x >0 refers to the bulk water phase. desorption therefore may be viewed from at least two perspectives: 1) For each system, one set of interaction between an AC graphene-like surface and DD are lost, with two sets lost for the small-pore AC system, and 2). The entire DD molecule and the AC surface(s) to which it was previously adsorbed become hydrated. Compared to the adsorbed state, this change is largest for the small-pore AC system, because two new DD-sized hydrophobic surfaces on AC become hydrated, plus both planar sides of the DD molecule itself. For the medium-pore AC and large-pore AC systems, just one new DD-sized hydrophobic surfaces on AC becomes hydrated, plus just one planar side of the DD molecule itself (see Figure 2.4, 2.5, 2.6 for visualization). Note that the hydration of previously unhydrated DD molecular surface is thought to be energetically unfavorable. As cited above, transfer from bulk DD to bulk water has an energy penalty of about +9 kJ/mol, while transfer of DD from the ideal gas phase to water has an estimated free energy penalty of about +4.5 kJ/mol (Rogers and Hahn (2010)). The molecular simulations of the present study support this concept that DD hydration is unfavorable and contributes to the strongly unfavorable desorption of DD from AC. 31 2.5.4 Sorption at Activated Carbon Edges Another interesting phenomena is the drop of energy at the edges. This energy change may due to the adsorption taking place at edges. When dioxin is moving at the edges, it can even bend to presumably maximize interactions with edges. Strong AC-edge-dioxin interactions are also supported by the trajectory of dioxin moving at edges (Figure 2.9). From the trajectory we can see that when dioxin is moving out of pores into the water, it does not directly move into the water phase. Instead, dioxin moves along the edges. As we know, the edges of activated carbons are usually saturated by different functional groups such as hydroxyl or carboxyl groups. In our model, we are using the simplest one: hydrogen atoms to saturate the edges. But this still shows the potential that dioxin may be adsorbed onto the edges of activated carbons if certain functional groups are there. Our radial distribution functions (Figure 2.10) also show this possibility. Here we use the largest pore as an example, and results for the other two pore-sizes are similar. In Figure 2.10, the peak appears near 3 å, suggesting interactions between dioxin and edge hydrogens. This interaction may come from one of the pi-pi interactions when a hydrogen atom from one benzene ring is perpendicular to the other ring. 2.6 Conclusion Activated carbons made from different materials, pyrolyzed under different conditions, and activated by different methods have different pore size distributions and different functional groups. In this work we studied the pore size effects on the adsorption of water and dioxin by activated carbon surfaces. Since dioxin adsorption takes place from water, the competition of water and dioxin molecules to adsorb should be taken into account. We use a simplified model for AC with minimal functional groups so that the adsorption maximally depends on the pore sizes. Our results have shown that small pores can create a more hydrophobic environment to assist the adsorption of dioxin. The smallest pore has a stronger estimated adsorption energy compared to that of medium and large pores. The adsorption will not only take place inside the pores, but also may happen at the activated carbon edges. Within AC micropores, dioxin sorption seems quite strong since the desorption thermodynamics are highly unfavorable, due to a) Loss of favorable DD-AC interactions 32 Figure 2.9: Dioxin moving trajectories at AC edge such as pi-pi and nonbonded interactions, and b) Unfavorable hydration of DD upon transfer to bulk water. Our results suggest the hypothesis that pore-size effects of ACs should be taken into consideration in mechanistic descriptions of hydrophobic solute sorption by AC. For example, when applying AC to soils for in-situ dioxin remediation, it is plausible that a more microporous activated carbon may bind the dioxins more strongly and minimize the bioavailability of dioxins to organisms that may ingest the soil. 33 Figure 2.10: Radial distribution between oxygen atoms of dioxin and hydrogen atoms at activated carbon edges 34 CHAPTER 3 GEOMETRIC SCATTERING FOR GRAPH DATA ANALYSIS 3.1 Abstract We explore the generalization of scattering transforms from traditional (e.g., image or audio) signals to graph data, analogous to the generalization of ConvNets in geometric deep learning, and the utility of extracted graph features in graph data analysis. In particular, we focus on the capacity of these features to retain informative variability and relations in the data (e.g., between individual graphs, or in aggregate), while relating our construction to previous theoretical results that establish the stability of similar transforms to families of graph deformations. We demonstrate the application the our geometric scattering features in graph classification of social network data, and in data exploration of biochemistry data. 3.2 Introduction Over the past decade, numerous examples have established that deep neural networks (i.e., cascades of linear operations and simple nonlinearities) typically outperform traditional “shallow” models in various modern machine learning applications, especially given the increasing Big Data availability nowadays. Perhaps the most well known example of the advantages of deep networks is in computer vision, where the utilization of 2D convolutions enable network designs that learn cascades of convolutional filters, which have several advantages over fully connected network architectures, both computationally and conceptually. Indeed, in terms of supervised learning, convolutional neural networks (ConvNets) hold the current state of the art in image classification, and have become the standard machine learning approach towards processing big structured-signal data, including audio and video processing. See, e.g., Goodfellow et al. (2016, Chapter 9) for a detailed discussion. Beyond their performances when applied to specific tasks, pretrained ConvNet layers have been 35 explored as image feature extractors by freezing the first few pretrained convolutional layers and then retraining only the last few layers for specific datasets or applications (e.g., Yosinski et al., 2014; Oquab et al., 2014). Such transfer learning approaches provide evidence that suitably constructed deep filter banks should be able to extract task-agnostic semantic information from structured data, and in some sense mimic the operation of human visual and auditory cortices, thus supporting the neural terminology in deep learning. An alternative approach towards such universal feature extraction was presented in Mallat (2012), where a deep filter bank, known as the scattering transform, is designed, rather than trained, based on predetermined families of distruptive patterns that should be eliminated to extract informative representations. The scattering transform is constructed as a cascade of linear wavelet transforms and nonlinear complex modulus operations that provides features with guaranteed invariance to a predetermined Lie group of operations such as rotations, translations, or scaling. Further, it also provides Lipschitz stability to small diffeomorphisms of the inputted signal. Following recent interest in geometric deep learning approaches for processing graph-structured data (see, for example, Bronstein et al. (2017) and references therein), several attempts have been made to generalize the scattering transform to graphs (Zou and Lerman, 2018; Gama et al., 2019) and manifolds (Perlmutter et al., 2018), which we will generally term “geometric scattering”. These works mostly focus on following the footsteps of Mallat (2012) in establishing the stability of their respective constructions to deformations of input signals or graphs. Their results essentially characterize the type of disruptive information eliminated by geometric scattering, by providing upper bounds for distances between scattering features, phrased in terms of a deformation size. Here, we further explore the notion of geometric scattering features by considering the complimentary question of how much information is retained by them, since stability alone does not ensure useful features in practice (e.g., a constant all-zero map would be stable to any deformation, but would clearly be useless). In other words, we examine whether a geometric scattering construction, defined and discussed in Sec. 3.4, can be used as an effective task-independent feature extractor from graphs, and whether the resulting representations provided by them are sufficiently rich to enable intelligible 36 data analysis by applying traditional (Euclidean) methods. We note that for Euclidean scattering, while stability is established with rigorous theoretical results, the capacity of scattering features to form an effective data representation in practice has mostly been established via extensive empirical examination. Indeed, scattering features have been shown effective in several audio (e.g., Bruna and Mallat, 2013a; Andén and Mallat, 2014; Lostanlen and Mallat, 2015; Andén et al., 2018) and image (e.g., Bruna and Mallat, 2013b; Sifre and Mallat, 2014; Oyallon and Mallat, 2015; Angles and Mallat, 2018) processing applications, and their advantages over learned features are especially relevant in applications with relatively low data availability, such as quantum chemistry and materials science (e.g., Hirn et al., 2017; Eickenberg et al., 2017, 2018; Brumwell et al., 2018). Similarly, our examination of geometric scattering capacity focuses on empirical results on several data analysis tasks, and on two commonly used graph data types. Our results in Sec. 3.5.2 show that on social network data, geometric scattering features enable classic RBF-kernel SVM to match, if not outperform, leading graph kernel methods as well as most geometric deep learning ones. These experiments are augmented by additional results in Sec. 3.5.3 that show the geometric scattering SVM classification rate degrades only slightly when trained on far fewer graphs than is traditionally used in graph classification tasks. On biochemistry data, where graphs represent molecular structures of compounds (e.g., Enzymes or proteins), we show in Sec. 3.5.4 that scattering features enable significant dimensionality reduction. Finally, to establish their descriptive qualities, in Sec. 3.5.5 we use geometric scattering features extracted from enzyme data (Borgwardt et al., 2005) to infer emergent patterns of enzyme commission (EC) exchange preferences in enzyme evolution, validated with established knowledge from Cuesta et al. (2015). Taken together, these results illustrate the power of the geometric scattering approach as both a relevant mathematical model for geometric deep learning, and as a suitable tool for modern graph data analysis. 37 3.3 Graph Random Walks and Graph Wavelets The Euclidean scattering transform is constructed using wavelets defined on Rd. In order to extend this construction to graphs, we define graph wavelets as the difference between lazy random walks that have propagated at different time scales, which mimics classical wavelet constructions found in Meyer (1993) and more recent constructions found in Coifman and Maggioni (2006). The underpinnings for this construction arise out of graph signal processing, and in particular the properties of the graph Laplacian. Let G = (V, E, W) be a weighted graph, consisting of n vertices V = {v1, . . . , vn}, edges E ⊆ {(v(cid:96), vm) : 1 ≤ (cid:96), m ≤ n}, and weights W = {w(v(cid:96), vm) > 0 : (v(cid:96), vm) ∈ E}. Note that unweighted graphs are considered as a special case, by setting w(v(cid:96), vm) = 1 for each (v(cid:96), vm) ∈ E. Define the n× n (weighted) adjacency matrix AG = A of G by A(v(cid:96), vm) = w(v(cid:96), vm) if (v(cid:96), vm) ∈ E and zero otherwise, where we use the notation A(v(cid:96), vm) to denote the ((cid:96), m) entry of the matrix A so as to emphasize the correspondence with the vertices in the graph and to reserve sub-indices for enumerating objects. Define the (weighted) degree of vertex v(cid:96) as deg(v(cid:96)) =m A(v(cid:96), vm) and the corresponding diagonal n × n degree matrix D given by D(v(cid:96), v(cid:96)) = deg(v(cid:96)), D(v(cid:96), vm) = 0, (cid:96) (cid:44) m. Finally, the n × n graph Laplacian matrix LG = L on G is defined as L = D − A, and its normalized version is N = D−1/2LD−1/2 = I− D−1/2AD−1/2. We focus on the latter due to its close relationship with graph random walks. The normalized graph Laplacian is a symmetric, real valued positive semi-definite matrix, and thus has n non-negative eigenvalues. Furthermore, if we set 0 = (0, . . . , 0)T to to be the n × 1 vector of all zeroes, and d(v(cid:96)) = deg(v(cid:96)) to be the n × 1 degree vector, then one has Nd1/2 = 0 (where the square root is understood to be taken entrywise). Therefore 0 is an eigenvalue of N and we write the n eigenvalues of N as 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 ≤ 2 with corresponding n × 1 orthonormal eigenvectors ϕ0, ϕ1, . . . , ϕn−1. If the graph G is connected, then λ1 > 0. In order to simplify the following discussion we assume that this is the case, although the discussion below can be amended to include disconnected graphs as well. One can show ϕ0 = d1/2/(cid:107)d1/2(cid:107), meaning ϕ0 is non-negative. Since every other eigenvector 38 is orthogonal to ϕ0 (and thus must take positive and negative values), it is natural to view the eigenvectors ϕk as the Fourier modes of the graph G, with a frequency magnitude proportional to λk. The fact that ϕ0 is in general non-constant, as opposed to the zero frequency mode on the torus or real line, reflects the non-uniform distribution of vertices in non-regular graphs. Let x : V → R be a signal defined on the vertices of the graph G, which we will consider as an n × 1 vector with entries x(v(cid:96)). It follows that the Fourier transform of x can be defined as(cid:98)x(k) = x · ϕk, where x · y is the standard dot product. This analogy is one of the foundations of graph signal processing and indeed we could use this correspondence to define wavelet operators on the graph G, as in Hammond et al. (2011). Rather than follow this path, though, we instead take a related path similar to Coifman and Maggioni (2006) and Gama et al. (2019) by defining the graph wavelet operators in terms of random walks defined on G, which will avoid diagonalizing N and will allow us to control the “spatial” graph support of the filters directly. (cid:16)I + AD−1(cid:17) Define the n × n lazy random walk matrix as P = 1 2 . Note that the column sums of P are all one. It follows that P acts as a Markov operator, mapping probability distributions to probability distribution. We refer to P as a lazy random walk matrix since Pt governs the probability distribution of a lazy random walk after t steps. A single realization of a random walk is a walk (in the graph theoretic sense) v(cid:96)0, v(cid:96)1, v(cid:96)2, . . . in which the steps are chosen randomly; lazy random walks = v(cid:96)i+1. More precisely, suppose that µ0(v(cid:96)) ≥ 0 for each vertex v(cid:96) and (cid:107)µ0(cid:107)1 = 1, allow for v(cid:96)i so that µ0 is a probability distribution on G. We take µ0(v(cid:96)) as the probability of a random walk = v(cid:96). One can verify that µ1 = Pµ0 is also a probability distribution; each starting at vertex v(cid:96)0 entry µ1(v(cid:96)) gives the probability of the random walk being located at v(cid:96)1 = v(cid:96) after one step. The probability distribution for the location of the random walk after t steps is µt = Pt µ0. The operator P can be considered a low pass operator, meaning that Px replaces x(v(cid:96)) with localized averages of x(v(cid:96)) for any x. Indeed, expanding out Px(v(cid:96)) one observes that Px(v(cid:96)) is the weighted average of x(v(cid:96)) and the values x(vm) for the neighbors vm of v(cid:96). Similarly, the value Ptx(v(cid:96)) is the weighted average of x(v(cid:96)) with all values x(vm) such that vm is within t steps of v(cid:96). Low pass operators defined on Euclidean space retain the low frequencies of a function while 39 suppressing the high frequencies. The random walk matrix P behaves similarly. Indeed, P is diagonalizable with n eigenvectors φk = D1/2ϕk and eigenvalues ωk = 1 − λk/2. Let yx = D−1/2x be a density normalized version of x and set xt = Ptx; then one can show yxt =(cid:98)yx(0)ϕ0 + k(cid:98)yx(k)ϕk . ωt n−1 k=1 (3.1) Thus, since 0 ≤ ωk < 1 for k ≥ 1, the operator Pt preserves the zero frequency of x while suppressing the high frequencies, up to a density normalization. High frequency responses of x can be recovered in multiple different fashions, but we utilize multiscale wavelet transforms that group the non-zero frequencies of G into approximately dyadic bands. As shown in Mallat (2012, Lemma 2.12), wavelet transforms are provably stable operators in the Euclidean domain, and the proof of Zou and Lerman (2018, Theorem 5.1) indicates that similar results on graphs may be possible. Furthermore, the multiscale nature of wavelet transforms will allow the resulting geometric scattering transform (Sec. 3.4) to traverse the entire graph G in one layer, which is valuable for obtaining global descriptions of G. Following Coifman and Maggioni (2006), define the n × n wavelet matrix at the scale 2j as Ψ j = P2j−1 − P2j = P2j−1(I − P2j−1) . (3.2) A similar calculation as the one required for (3.1) shows that Ψ jx partially recovers(cid:98)yx(k) for k ≥ 1. The value Ψ jx(v(cid:96)) aggregates the signal information x(vm) from the vertices vm that are within 2j steps of v(cid:96), but does not average the information like the operator P2j . Instead, it responds to sharp transitions or oscillations of the signal x within the neighborhood of v(cid:96) with radius 2j (in terms of the graph path distance). The smaller the wavelet scale 2j, the higher the frequencies Ψ jx recovers in x. The wavelet coefficients up to the scale 2J are: Ψ(J)x(v(cid:96)) =(cid:2)Ψ jx(v(cid:96)) : 1 ≤ j ≤ J(cid:3) . (3.3) Figure 3.1 plots the wavelets on two different graphs. 40 j j (a) Sample graph of the bunny manifold (b) Minnesota road network graph Figure 3.1: Wavelets Ψ j for increasing scale 2j left to right, applied to Diracs centered at two different locations (marked by red circles) in two different graphs. Vertex colors indicate wavelet values (corresponding to colorbars for each plot), ranging from yellow/green indicating positive values to blue indicating negative values. Both graphs are freely available from PyGSP (2018). 3.4 Geometric Scattering A geometric wavelet scattering transform follows a similar construction as the (Euclidean) wavelet scattering transform of Mallat (2012), but leverages a graph wavelet transform. In this paper we utilize the wavelet transform defined in (3.3) of the previous section, but remark that in principle any graph wavelet transform could be used (see, e.g., Zou and Lerman, 2018). In Sec. 3.4.1 we define the graph scattering transform, in Sec. 3.4.2 we discuss its relation to other recently proposed graph scattering constructions (Gama et al., 2019; Zou and Lerman, 2018), and in Sec. 3.4.3 we describe several of its desirable properties as compared to other geometric deep learning algorithms on graphs. 3.4.1 Geometric Scattering Definitions Machine learning algorithms that compare and classify graphs must be invariant to graph isomor- phism, i.e., re-indexations of the vertices and corresponding edges. A common way to obtain invariant graph features is via summation operators, which act on a signal x = xG that can be defined on any graph G, e.g., x(v(cid:96)) = deg(v(cid:96)). The geometric scattering transform, which is described in the remainder of this section, follows such an approach. The simplest summation operator computes the sum of the responses of the signal x. As described in Verma and Zhang (2018), this invariant can be complemented by higher order summary 41 statistics of x, the collection of which are statistical moments, and which are also referred to as “capsules” in that work. For example, the unnormalized qth moments of x yield the following “zero” order scattering moments: Sx(q) = x(v(cid:96))q, 1 ≤ q ≤ Q (3.4) n (cid:96)=1 We can also replace (3.4) with normalized (i.e., standardized) moments of x, in which case we store its mean (q = 1), variance (q = 2), skew (q = 3), kurtosis (q = 4), and so on. In what follows we discuss the unnormalized moments since their presentation is simpler. The invariants Sx(q) do not capture the full variability of x and hence the graph G upon which the signal x is defined. We thus complement these moments with summary statistics derived from the wavelet coefficients of x, which will lead naturally to the graph ConvNet structure of the geometric scattering transform. of x(v(cid:96)) over V, we have captured the zero frequency of yx = D−1/2x sincen Observe, analogously to the Euclidean setting, that in computing Sx(1), which is the summation yx · d1/2 = (cid:107)d1/2(cid:107)(cid:98)yx(0). Higher order moments of x can incorporate the full range of frequencies in (cid:96)=1 x(v(cid:96)) = x · 1 = x, e.g. Sx(2) =n k=1(cid:98)x(k)2, but they are mixed into one invariant coefficient. We (cid:96)=1 x(v(cid:96))2 =n can separate and recapture the high frequencies of x by computing its wavelet coefficients Ψ(J)x, which were defined in (3.3). However, Ψ(J)x is not invariant to permutations of the vertex indices; in fact, it is equivariant. Before summing the individual wavelet coefficient vectors Ψ jx, though, we must first apply a pointwise nonlinearity. Indeed, define 1 = (1, . . . , 1)T to be the n × 1 vector of all ones, and note that PT1 = 1, meaning that 1 is a left eigenvector of P with eigenvalue 1. It follows that ΨT j 1 = 0 and thusn (cid:96)=1 Ψ jx(v(cid:96)) = Ψ jx · 1 = 1T Ψ jx = 0. We thus apply the absolute value nonlinearity, to obtain nonlinear equivariant coefficients |Ψ(J)x| = {|Ψ jx| : 1 ≤ j ≤ J}. We use absolute value because it is equivariant to vertex permutations, non-expansive, and when combined with traditional wavelet transforms on Euclidean domains, yields a provably stable scattering transform for q = 1. Furthermore, initial theoretical results in Zou and Lerman (2018) and Gama et al. (2019) indicate that similar graph based scattering transforms possess certain types of stability properties as well. As in (3.4), we extract invariant coefficients from |Ψ jx| by computing its moments, which define the first order geometric scattering 42 moments: n (cid:96)=1 Sx(j, q) = |Ψ jx(v(cid:96))|q, 1 ≤ j ≤ J, 1 ≤ q ≤ Q (3.5) These first order scattering moments aggregate complimentary multiscale geometric descriptions of G into a collection of invariant multiscale statistics. These invariants give a finer partition of the frequency responses of x. For example, whereas Sx(2) mixed all frequencies of x, we see that Sx(j, 2) only mixes the frequencies of x captured by Ψ j. First order geometric scattering moments can be augmented with second order geometric scattering moments by iterating the graph wavelet and absolute value transforms. These moments are defined as: n (cid:96)=1 Sx(j, j(cid:48) , q) = |Ψ j(cid:48)|Ψ jx(v(cid:96))||q, 1 ≤ j < j(cid:48) ≤ J 1 ≤ q ≤ Q , (3.6) which consists of reapplying the wavelet transform operator Ψ(J) to each |Ψ jx| and computing the summary statistics of the magnitudes of the resulting coefficients. The intermediate equivariant coefficients |Ψ j(cid:48)|Ψ jx|| and resulting invariant statistics Sx(j, j(cid:48), q) couple two scales 2j and 2j(cid:48) within the graph G, creating features that bind patterns of smaller subgraphs within G with patterns of larger subgraphs (e.g., circles of friends of individual people with larger community structures in social network graphs). The transform can be iterated additional times, leading to third order features and beyond, and thus has the general structure of a graph ConvNet. x P2j−1 P2j−1 I − P2j−1 I − P2j−1 (cid:124)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:125) Ψ j | . . . | | . . . | (cid:124)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:125) P2j(cid:48)−1 I − P2j(cid:48)−1 Ψ j(cid:48) | . . . | G = (V, E, W) x : V → R m atrix: j) i, v A djacency A( v Signalvector: x(vi) Sx Diffusion wavelets: Ψ j = P2 j−1 − P2 j 2(I + AD−1) P = 1 Ψj (a) Scattering x (cid:55)→ Sx Traditional Euclidean algorithms (cid:107) . . . (cid:107)q q (cid:107) . . . (cid:107)q q (cid:107) . . . (cid:107)q q (cid:124)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:125) 1≤q≤Q (a) Representative zeroth-, first-, and second-order cascades of the geometric scattering transform for an input graph signal x. Figure 3.2: Illustration of (a) the proposed scattering feature extraction (see eqs. 3.4, 3.5, and 3.6), (b) Architecture for using geometric scattering of graph G and signal x in graph data analysis, as demonstrated in Sec. 3.5. and (b) its application for graph data analysis. The collection of graph scattering moments Sx = {Sx(q), Sx(j, q), Sx(j, j(cid:48), q)} (illustrated in Fig. 3.2(a)) provides a rich set of multiscale invariants of the graph G. These can be used in 43 supervised settings as input to graph classification or regression models, or in unsupervised settings to embed graphs into a Euclidean feature space for further exploration, as demonstrated in Sec. 3.5. In practice, the computation of the scattering features is based on several design choices, akin to typical architecture choices in neural networks. Most importantly, it requires a choice of 1. which statistical moments to use (normalized or unnormalized), 2. the number of wavelet scales to use (given by J), and 3. the number of moments to use (denoted by Q). In general, J can be automatically tuned by the diameter of the considered graphs (e.g., setting it to the logarithm of the diameter), and the other choices can be tuned via cross-validation. However, we have found the impact of such tuning to be minor, and thus for simplicity, we fix our configuration to use normalized moments, J = 5, and Q = 4 throughout this work. 3.4.2 Stability and Capacity of Geometric Scattering In order to assess the utility of scattering features for representing graphs, two properties have to be considered: stability and capacity. First, the stability property aims to provide an upper bound on distances between similar graphs that only differ by types of deformations that can be treated as noise. This property has been the focus of both Zou and Lerman (2018) and Gama et al. (2019), and in particular the latter shows that a diffusion scattering transform yields features that are stable to graph structure deformations whose size can be computed via the diffusion framework (Coifman and Maggioni, 2006) that forms the basis for their construction. While there are some technical differences between the geometric scattering here and the diffusion scattering in Gama et al. (2019), these constructions are sufficiently similar that we can expect both of them to have analogous stability properties. Therefore, we mainly focus here on the complementary property of the scattering transform capacity to provide a rich feature space for representing graph data without eliminating informative variance in them. We note that even in the classical Euclidean case, while the stability of scattering transforms to deformations can be established analytically (Mallat, 2012), their capacity is typically examined by empirical evidence when applied to machine learning tasks (e.g., Bruna and Mallat, 2011; Sifre and 44 Mallat, 2012; Andén and Mallat, 2014). Similarly, in the graph processing settings, we examine the capacity of our proposed geometric scattering features via their discriminative power in graph data analysis tasks, which are described in detail in Sec. 3.5. We show that geometric scattering enables graph embedding in a relatively low dimensional Euclidean space, while preserving insightful properties in the data. Beyond establishing the capacity of our specific construction, these results also indicate the viability of graph scattering transforms as universal feature extractors on graph data, and complement the stability results established in Zou and Lerman (2018) and Gama et al. (2019). 3.4.3 Geometric Scattering Compared to Other Feed Forward Graph ConvNets We give a brief comparison of geometric scattering with other graph ConvNets, with particular interest in isolating the key principles for building accurate graph ConvNet classifiers. We begin by remarking that like several other successful graph neural networks, the graph scattering transform is equivariant to vertex permutations (i.e., commutes with them) until the final features are extracted. This idea has been discussed in depth in various articles, including Kondor et al. (2018a), so we limit the discussion to observing that the geometric scattering transform thus propagates nearly all of the information in x through the multiple wavelet and absolute value layers, since only the absolute value operation removes information on x. As in Verma and Zhang (2018), we aggregate covariant responses via multiple summary statistics (i.e., moments), which are referred to there as a capsule. In the scattering context, at least, this idea is in fact not new and has been previously used in the Euclidean setting for the regression of quantum mechanical energies in Eickenberg et al. (2018, 2017) and texture synthesis in Bruna and Mallat (2018). We also point out that, unlike many deep learning classifiers (graph included), a graph scattering transform extracts invariant statistics at each layer/order. These intermediate layer statistics, while necessarily losing some information in x (and hence G), provide important coarse geometric invariants that eliminate needless complexity in subsequent classification or regression. Furthermore, such layer by layer statistics have proven useful in characterizing signals of other types (e.g., texture synthesis in Gatys et al., 2015). A graph wavelet transform Ψ(J)x decomposes the geometry of G through the lens of x, along 45 different scales. Graph ConvNet algorithms also obtain multiscale representations of G, but several works, including Atwood and Towsley (2016) and Zhang et al. (2018), propagate information via a random walk. While random walk operators like Pt act at different scales on the graph G, per the analysis in Sec. 3.3 we see that Pt for any t will be dominated by the low frequency responses of x. While subsequent nonlinearities may be able to recover this high frequency information, the resulting transform will most likely be unstable due to the suppression and then attempted recovery of the high frequency content of x. Alternatively, features derived from Ptx may lose the high frequency responses of x, which are useful in distinguishing similar graphs. The graph wavelet coefficients Ψ(J)x, on the other hand, respond most strongly within bands of nearly non-overlapping frequencies, each with a center frequency k j that depends on Ψ j. Finally, graph labels are often complex functions of both local and global subgraph structure within G. While graph ConvNets are adept at learning local structure within G, as detailed in Verma and Zhang (2018) they require many layers to obtain features that aggregate macroscopic patterns in the graph. This is due to the use of fixed size filters, which often only incorporate information from the neighbors of a vertex. The training of such networks is difficult due to the limited size of many graph classification databases (see the supplementary information). Geometric scattering transforms have two advantages in this regard: (a) the wavelet filters are designed; and (b) they are multiscale, thus incorporating macroscopic graph patterns in every layer/order. 3.5 Results To establish the geometric scattering features as an effective graph representation for data analysis, we examine their performance here in four graph data analysis applications. First, we briefly introduce the thirteen datasets used in this study and their statistics. Then in Sec. 3.5.2 we consider graph classification on social networks (from Yanardag and Vishwanathan, 2015) and various biochemistry datasets, in Sec. 3.5.3 we consider the impact of low training data availability on classification, in Sec. 3.5.4 we examine dimensionality reduction aspects of geometric scattering, in Sec. 3.5.5 we consider data exploration of enzyme graphs, where geometric scattering enables 46 unsupervised (descriptive) recovery of EC change preferences in enzyme evolution, and finally, in Sec. 3.5.6, we conduct ablation studies. A common theme in all these applications is the application of geometric scattering as an unsupervised task-independent feature extraction that embeds input graphs of varying sizes (with associated graph signals) into a Euclidean space formed by scattering features. Then, the extracted feature vectors are passed to traditional (Euclidean) machine learning algorithms, such as SVM for classification or PCA for dimensionality reduction, to perform downstream analysis. Our results show that our scattering features provide simplified representation (e.g., in dimensionality and extrapolation ability) of input graphs, which we conjecture is a result of their stability properties, while also being sufficiently rich to capture meaningful relations between graphs for predictive and descriptive purposes. Results of the methods compared in our paper come from the respective papers that introduced the methods, with the exception of: (1) social network results of WL, fromTixier et al. (2017); (2) biochemistry and social results of DCNN, from Verma and Zhang (2018); (3) biochemistry, except for D&D, and social result of GK, from Yanardag and Vishwanathan (2015); (4) D&D of GK is from Niepert et al. (2016); and (5) for Graphlets, biochemistry results from Kriege et al. (2016), social results from Tixier et al. (2017). All the geometric scattering and related classification code were implemented in Python. All experiments were performed on HPC environment using an intel16-k80 cluster, with a job requesting one node with four processors and two Nvidia Tesla k80 GPUs. 3.5.1 Detailed Dataset Descriptions The details of the datasets used in this work are as follows: NCI1 (Wale et al., 2008) contains 4,110 chemical compounds as graphs, with 37 node features. Each compound is labeled according to is activity against non-small cell lung cancer and ovarian cancer cell lines, and these labels serve as classification goal on this data. 47 NCI109 (Wale et al., 2008) is similar to NCI1, but with 4,127 chemical compounds and 38 node features. MUTAG (Debnath et al., 1991) consists of 188 mutagenic aromatic and heteroaromatic nitro compounds (as graphs) with 7 node features. The classification here is binary (i.e., two classes), based on whether or not a compound has a mutagenic effect on bacterium. PTC (Toivonen et al., 2003) is a dataset of 344 chemical compounds (as graphs) with nineteen node features that are divided into two classes depending on whether they are carcinogenic in rats. PROTEINS (Borgwardt et al., 2005) dataset contains 1,113 proteins (as graphs) with three node features, where the goal of the classification is to predict whether the protein is enzyme or not. D&D (Dobson and Doig, 2003) dataset contains 1,178 protein structures (as graphs) that, similar to the previous one, are classified as enzymes or non-enzymes. ENZYMES (Borgwardt et al., 2005) is a dataset of 600 protein structures (as graphs) with three node features. These proteins are divided into six classes of enzymes (labelled by enzyme commission numbers) for classification. COLLAB (Yanardag and Vishwanathan, 2015) is a scientific collaboration dataset contains 5K graphs. The classification goal here is to predict whether the graph belongs to a subfield of Physics. IMDB-B (Yanardag and Vishwanathan, 2015) is a movie collaboration dataset with contains 1K graphs. The graphs are generated on two genres: Action and Romance, the classification goal is to predict the correct genre for each graph. IMDB-M (Yanardag and Vishwanathan, 2015) is similar to IMDB-B, but with 1.5K graphs & 3 genres: Comedy, Romance, and Sci-Fi. 48 REDDIT-B (Yanardag and Vishwanathan, 2015) is a dataset with 2K graphs, where each graph corresponds to an online discussion thread. The classification goal is to predict whether the graph belongs to a Q&A-based community or discussion-based community. REDDIT-5K (Yanardag and Vishwanathan, 2015) consists of 5K threads (as graphs) from five different subreddits. The classification goal is to predict the corresponding subreddit for each thread. REDDIT-12K (Yanardag and Vishwanathan, 2015) is similar to REDDIT-5k, but with 11,929 graphs from 12 different subreddits. Table 3.1: Basic statistics of the biochemical graph classification databases # of graphs in data Max # of vertices Mean # of vertices # of features per vertex Mean # of edges # of classes NCI1 NCI109 MUTAG D&D 1178 4110 5748 111 29.8 284.32 37 64.6 2 4127 111 29.6 38 62.2 2 188 28 17.93 1431.3 39.50 89 2 7 2 PTC PROTEINS ENZYMES 344 109 25.56 22 51.90 1113 620 39.0 3 600 126 32.6 3 72.82 124.2 2 2 6 Table 3.2: Basic statistics of the social network graph classification databases COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K REDDIT-12K # of graphs in data Max # of vertices Mean # of vertices # of features per vertex Mean # of edges # of classes 5000 492 74.49 2457.78 3 3 1000 136 19.77 96.53 3 2 1500 89 13 3 65.94 3 2000 3783 429.61 497.75 2 2 5000 3783 508.5 594.87 2 5 11929 3782 391.4 456.89 2 11 Table 3.1 and Table 3.2 summarizes the size of available graph data (i.e., number of graphs, and both max & mean number of vertices within graphs) in these datasets, as previously reported in the literature. 49 3.5.2 Graph Classification on Social Networks and Biochemistry Datasets As a first application of geometric scattering, we first apply it to graph classification of social network data taken from Yanardag and Vishwanathan (2015). In particular, this work introduced six social network data sets extracted from scientific collaborations (COLLAB), movie collaborations (IMDB-B & IMDB-M), and Reddit discussion threads (REDDIT-B, REDDIT-5K, REDDIT-12K). Then we also apply our methods to biochemistry datasets that are often used in the graph classification literature. The classification results of biochemistry datasets are shown in Table 3.3. The social network data provided by Yanardag and Vishwanathan (2015) contains graph structures but no associated graph signals. Therefore we compute the eccentricity (for connected graphs) and clustering coefficient of each vertex, and use these as input signals to the geometric scattering transform. Specifically, in the case of COLLAB, IMDB-B, and IMDB-M, we use the eccentricity and clustering coefficients for each vertex as characteristic graph signals. In the case of REDDIT-B, REDDIT-5K and REDDIT-12K, on the other hand, we only use the clustering coefficient, due to the presence of disconnected graphs in these datasets. In principle, any general node characteristic could be used, although we remark that x = d, the vertex degree vector, is not useful in our construction since Ψ jd = 0. After computing the scattering moments1 of these two input signals, they are concatenated to form a single vector. This scattering feature vector is a consistent Euclidean representation of the graph, which is independent of the original graph sizes (i.e., number of vertices or edges), and thus we can apply any traditional classifier to it. In particular, we use here the standard SVM classifier with an RBF kernel, which is popular and effective in many applications and also performs well in this case. We evaluate the classification results of our SVM-based geometric scattering classification (GS-SVM) using ten-fold cross validation, which is standard practice in other graph classification works. First, the entire dataset is randomly split into ten subsets. Then, in each iteration (or “fold”), nine of them are used as training and validation, and the other one is used for testing classification 1We use the normalized scattering moments for classification, since they perform slightly better than the un-normalized moments. Also we use J = 5 and q = 4 for all scattering feature generations. 50 Table 3.3: Comparison of the proposed GS-SVM classifier with leading graph kernel and deep learning methods on social graph datasets. WL Graphlet WL-OA GK DGK DGCNN 2D CNN PSCN (k = 10) GCAPS-CNN S2S-P2P-NN GIN-0 (MLP-SUM) COLLAB 77.82 ± 1.45 73.42 ± 2.43 80.70 ± 0.10 72.84 ± 0.28 73.00 ± 0.20 73.76 ± 0.49 71.33 ± 1.96 72.60 ± 2.15 77.71 ± 2.51 81.75 ± 0.80 80.20 ± 1.90 GS-SVM 79.94 ± 1.61 N/A IMDB-B 71.60 ± 5.16 65.40 ± 5.95 65.87 ± 0.98 66.90 ± 0.50 70.03 ± 0.86 70.40 ± 3.85 71.00 ± 2.29 71.69 ± 3.40 73.80 ± 0.70 75.10 ± 5.10 71.20 ± 3.25 IMDB-M N/A N/A N/A N/A 43.89 ± 0.38 44.50 ± 0.50 47.83 ± 0.85 45.23 ± 2.84 48.50 ± 4.10 51.19 ± 0.50 52.30 ± 2.80 48.73 ± 2.32 REDDIT-B 78.52 ± 2.01 77.26 ± 2.34 89.30 ± 0.30 77.34 ± 0.18 78.00 ± 0.30 89.12 ± 1.70 86.30 ± 1.58 87.61 ± 2.51 86.50 ± 0.80 92.40 ± 2.50 89.65 ± 1.94 N/A N/A REDDIT-5K 50.77 ± 2.02 39.75 ± 1.36 41.01 ± 0.17 41.20 ± 0.10 48.70 ± 4.54 52.21 ± 2.44 49.10 ± 0.70 50.10 ± 1.72 52.28 ± 0.50 57.50 ± 1.50 53.33 ± 1.37 REDDIT-12K 34.57 ± 1.32 25.98 ± 1.29 N/A N/A N/A 32.20 ± 0.10 48.13 ± 1.47 41.32 ± 0.42 42.47 ± 0.10 45.23 ± 1.25 N/A N/A (cid:122) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:125) (cid:124) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:123) G r a p h k e r n e l (cid:122) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) D e e p (cid:125) (cid:124) l e a r n i n g (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:123) Table 3.4: Comparison of the proposed GS-SVM classifier with leading graph kernel and deep learning methods on biochemistry graph datasets. NCI1 WL Graphlet WL-OA GK DGK DGCNN 2D CNN CCN PSCN (k = 10) GCAPS-CNN S2S-P2P-NN GIN-0 (MLP-SUM) 84.46 ± 0.45 70.50 ± 0.20 86.10 ± 0.20 62.28 ± 0.29 80.30 ± 0.40 74.44 ± 0.47 76.27 ± 4.13 76.34 ± 1.68 82.72 ± 2.38 83.72 ± 0.4 82.70 ± 1.60 GS-SVM 79.14 ± 1.28 N/A NCI109 85.12 ± 0.29 69.30 ± 0.20 86.30 ± 0.20 62.60 ± 0.19 80.30 ± 0.30 N/A N/A N/A 75.54 ± 3.36 81.12 ± 1.28 83.64 ± 0.3 77.95 ± 1.25 N/A D&D 78.34 ± 0.62 79.70 ± 0.70 79.20 ± 0.40 78.45 ± 0.26 73.09 ± 0.25 79.37 ± 0.94 76.27 ± 2.15 77.62 ± 4.99 N/A N/A N/A N/A 75.04 ± 3.64 PROTEINS 72.92 ± 0.56 72.70 ± 0.60 76.40 ± 0.40 71.67 ± 0.55 75.70 ± 0.50 75.54 ± 0.94 77.12 ± 2.79 75.00 ± 2.51 76.40 ± 4.17 76.61 ± 0.5 76.20 ± 2.80 74.11 ± 4.02 N/A MUTAG 84.11 ± 1.91 85.20 ± 0.90 84.5 ± 1.7 81.39 ± 1.74 87.40 ± 2.70 85.83 ± 1.66 91.64 ± 7.24 88.95 ± 4.37 89.86 ± 1.1 89.40 ± 5.60 83.57 ± 6.75 N/A N/A PTC 59.97 ± 1.60 54.70 ± 2.00 63.6 ± 1.5 57.26 ± 1.41 60.10 ± 2.50 58.59 ± 2.47 70.62 ± 7.04 62.29 ± 5.68 66.01 ± 5.91 64.54 ± 1.1 64.60 ± 7.00 63.94 ± 7.38 N/A ENZYMES 55.22 ± 1.26 30.60 ± 1.20 59.9 ± 1.1 26.61 ± 0.99 53.40 ± 0.90 51.00 ± 7.29 N/A N/A N/A 61.83 ± 5.39 63.96 ± 0.6 56.83 ± 4.97 N/A (cid:122) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:125) (cid:124) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:123) G r a p h k e r n e l (cid:122) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) D e e p (cid:125) (cid:124) l e a r n i n g (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:32) (cid:123) accuracy. In total, after ten iterations, each of the subsets has been used once for testing, resulting in ten reported classification accuracy numbers for the examined dataset. Finally, the mean and standard deviation of these ten accuracies are computed and reported. It should be noted that during training, each iteration also performs automatic tuning of the trained classifier, as follows. First, nine iterations are performed, each time using eight subsets (i.e., folds) as training and the remaining one as validation set, which is used to determine the optimal parameters for SVM. After nine iterations, each of the training/validation subsets has been used once for validation, and we obtain nine classification models, which in turn produce nine predictions (i.e., class assignments) for each data point in the test subset of the main cross validation. To obtain the final predicted class of this cross validation iteration, we select the class with the most votes (from among the nine models) as our final classification result. These results are then compared to 51 the true labels (in the test set) on the test subset to obtain classification accuracy for this fold. We compare our results to 11 prominent methods that report results for most, if not all, of the considered datasets. Out of these, five are graph kernel methods, namely: Weisfeiler-Lehman graph kernels (WL, Shervashidze et al., 2011), propagation kernel (PK, Neumann et al., 2012), Graphlet kernels (Shervashidze et al., 2009), Random walks (RW, Gärtner et al., 2003), deep graph kernels (DGK, Yanardag and Vishwanathan, 2015), and Weisfeiler-Lehman optimal assignment kernels (WL-OA, Kriege et al., 2016). The other six are recent geometric deep learning algorithms: deep graph convolutional neural network (DGCNN, Zhang et al., 2018), Graph2vec (Narayanan et al., 2017), 2D convolutional neural networks (2DCNN, Tixier et al., 2017), covariant compositional networks (CCN, Kondor et al., 2018b), Patchy-san (PSCN, Niepert et al., 2016, with k = 10), diffusion convolutional neural networks (DCNN, Atwood and Towsley, 2016), graph capsule convolutional neural networks (GCAPS-CNN, Verma and Zhang, 2018), recurrent neural network autoencoders (S2S-N2N-PP, Taheri et al., 2018), and the graph isomorphism network (GIN, Xu et al., 2019). Following the standard format of reported classification performances for these methods, our results are reported in the form of average accuracy ± standard deviation (in percentages) over the ten cross-validation folds. We note that since some methods are not reported for all datasets, we mark N/A when appropriate. Table 3.3 reports the results. The geometric scattering transform and related variants presented in Zou and Lerman (2018) and Gama et al. (2019) is a mathematical model for graph ConvNets. However, it is natural to ask if this model accurately reflects what is done in practice. A useful model may not obtain state of the art performance, but should be competitive with the current state of the art, lest the model may not capture the underlying complexity of the most powerful methods. Examining Table 3.3 one can see that the GS-SVM classifier matches or outperforms all but the two most recent methods, i.e., S2S-N2N-PP (Taheri et al., 2018) and GIN (Xu et al., 2019). With regards to these two approaches, the GS-SVM outperforms S2S-N2N-PP (Taheri et al., 2018) on 3/6 datasets. Finally, while GIN (Xu et al., 2019) outperforms geometric scattering on 5/6 datasets, the results on COLLAB and IMDB-B 52 are not statistically significant, and on the REDDIT datasets the geometric scattering approach trails only GIN (Xu et al., 2019). We thus conclude that the geometric scattering transform yields a rich set of invariant statistical moments, which have nearly the same capacity as the current state of the art in graph neural networks. 3.5.3 Classification with Low Training-data Availability Many modern deep learning methods require large amounts of training data to generate representative features. On the contrary, geometric scattering features are based on each graph without any training processes. In this section, we demonstrate the performance of the GS-SVM under low training-data availability and show that the scattering features can embed enough graph information that even under extreme conditions (e.g. only 20% training data), they can still maintain relatively good classification results. Table 3.5: Classification accuracy with different training/validaion/test splits over scattering features (unnorm. moments) PROTEINS MUTAG PTC Dataset NCI1 NCI109 D&D ENZYMES COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K REDDIT-12K SVM accuracy 80%/10%/10% 70%/10%/20% 40%/10%/50% 20%/10%/70% 73.60 ± 0.68 79.80 ± 2.24 77.66 ± 1.78 72.36 ± 0.74 76.57 ± 3.76 75.58 ± 0.81 73.01 ± 1.94 74.03 ± 4.20 84.04 ± 6.71 77.47 ± 4.41 66.32 ± 7.54 56.75 ± 2.88 53.83 ± 6.71 36.38 ± 1.93 74.63 ± 1.05 76.88 ± 1.13 70.80 ± 3.54 67.81 ± 0.98 48.93 ± 4.77 44.28 ± 1.87 86.18 ± 0.32 88.30 ± 2.08 50.71 ± 2.27 48.37 ± 0.76 41.35 ± 1.05 37.71 ± 0.42 78.13 ± 2.07 77.54 ± 1.44 76.74 ± 2.32 74.30 ± 2.49 82.99 ± 6.97 64.83 ± 2.13 52.50 ± 5.35 76.98 ± 0.97 70.60 ± 2.85 49.00 ± 1.97 88.75 ± 0.96 50.87 ± 1.37 41.05 ± 0.70 76.37 ± 0.27 74.41 ± 0.14 76.32 ± 0.59 73.32 ± 1.68 78.72 ± 3.19 61.92 ± 1.45 44.50 ± 3.83 76.42 ± 0.82 69.10 ± 1.90 47.20 ± 1.47 86.40 ± 0.40 50.10 ± 0.41 39.36 ± 1.30 We performed graph classification under four training/validation/test splits: 80%/10%/10%, 70%/10%/20%, 40%/10%/50% and 20%/10%/70%. We did 10-fold, 5-fold and 2-fold cross validation for the first three splits. For the last split, we randomly formed a 10 folds pool, from 53 which we randomly selected 3 folds for training/validation and repeated this process ten times. Detailed classification results can be found in Table 3.5 using unnormalized moments. Following Sec. 3.5.2, we discuss the classification accuracy on six social datasets under these splits. When the training data is reduced from 90% to 80%, the classification accuracy in fact increased by 0.047%, which shows the GS-SVM classification accuracy is not affected by the decrease in training size. Further reducing the training size to 50% results in an average decrease of classification accuracy of 1.40% while from 90% to 20% causes an average decrease of 3.00%. Fig. 3.3 gives a more nuanced statistical description of these results. 3.5.4 Dimensionality Reduction We now consider the viability of scattering-based embedding for dimensionality reduction of graph data. As a representative example, we consider here the ENZYMES dataset introduced in Borgwardt et al. (2005), which contains 600 enzymes evenly split into six enzyme classes (i.e., 100 enzymes (a) (b) Figure 3.3: (a) Box plot showing the drop in SVM classification accuracy over social graph datasets when reducing training set size (horizontal axis marks portion of data used for testing); (b) Relation between explained variance, SVM classification accuracy, and PCA dimensions over scattering features in ENZYMES dataset. 54 from each class). While the Euclidean notion of dimensionality is not naturally available in graph data, we note that graphs in this dataset have, on average, 124.2 edges, 29.8 vertices, and 3 features per vertex. Therefore, the data here can be considered significantly high dimensional in its original representation, which is not amenable to traditional dimensionality reduction techniques. To perform scattering-based dimensionality reduction, we applied PCA to geometric scattering features extracted from input enzyme graphs in the data, while choosing the number of principal components to capture 99%, 90%, 80% and 50% explained variance. For each of these thresholds, we computed the mean classification accuracy (with ten-fold cross validation) of SVM applied to the GS-PCA low dimensional space, as well as the dimensionality of this space. The relation between dimensionality, explained variance, and SVM accuracy is shown in Fig. 3.3, where we can observe that indeed geometric scattering combined with PCA enables significant dimensionality reduction (e.g., to R16 with 90% exp. variance) with only a small impact on classification accuracy. Finally, we also consider the PCA dimensionality of each individual enzyme class in the data (in the scattering feature space), as we expect scattering to reduce the variability in each class w.r.t. the full feature space. Indeed, in this case, individual classes have 90% exp. variance PCA dimensionality ranging between 6 and 10, which is significantly lower than the 16 dimensions of the entire PCA space. We note that similar results can also be observed for the social network data discussed in previous sections, where on average 90% explained variances are captured by nine dimensions, yielding a drop of 3.81% in mean SVM accuracy; complete results are shown in Table 3.6 and Table 3.7. 55 Table 3.6: Classification accuracy and dimensionality reduction with PCA over scattering features (unnorm. moments) SVM accuracy w.r.t variance covered Dataset 50% PTC PROTEINS MUTAG NCI1 NCI109 D&D 72.41 ± 2.36 70.85 ± 2.59 75.21 ± 3.17 70.80 ± 3.43 77.51 ± 10.42 58.17 ± 8.91 29.67 ± 4.46 ENZYMES 62.86 ± 1.36 COLLAB 58.30 ± 3.44 IMDB-B 41.00 ± 4.86 IMDB-M 71.05 ± 2.39 REDDIT-B 40.97 ± 2.06 REDDIT-5K REDDIT-12K 28.22 ± 1.64 80% 73.89 ± 2.57 71.84 ± 2.38 75.13 ± 3.68 74.20 ± 3.06 80.32 ± 8.16 60.50 ± 9.96 45.33 ± 6.62 71.68 ± 2.06 66.10 ± 3.14 46.40 ± 4.48 78.95 ± 2.42 45.71 ± 2.21 33.36 ± 0.93 90% 73.89 ± 1.33 72.33 ± 2.24 74.87 ± 3.99 74.67 ± 3.33 82.40 ± 10.92 58.70 ± 6.93 50.67 ± 5.44 73.22 ± 2.29 68.80 ± 4.31 45.93 ± 3.86 83.75 ± 1.83 47.43 ± 1.90 34.71 ± 1.52 99% 78.22 ± 1.95 76.69 ± 1.02 76.92 ± 3.37 74.57 ± 3.42 84.09 ± 9.09 63.68 ± 3.97 52.50 ± 8.89 76.54 ± 1.41 68.40 ± 4.31 48.27 ± 3.23 86.95 ± 1.78 49.65 ± 1.86 38.39 ± 1.54 PCA dimensions w.r.t variance covered 50% 80% 90% 43 18 43 19 44 10 10 2 4 13 21 7 16 3 9 2 2 8 8 2 8 2 10 2 2 9 99% 117 114 122 36 34 62 44 32 24 20 24 27 27 32 32 35 5 8 14 9 6 4 5 5 6 5 Table 3.7: Dimensionality reduction with PCA over scattering features (unnorm. moments) SVM accuracy Dataset ENZYMES 50.67 ± 5.44 PCA Full 53.83 ± 6.71 PCA dimensions (> 90% variance) All classes Per class 8 9 10 6 16 9 8 3.5.5 Data Exploration: Enzyme Class Exchange Preferences Geometric scattering essentially provides a task independent representation of graphs in a Euclidean feature space. Therefore, it is not limited to supervised learning applications, and can be also utilized for exploratory graph-data analysis, as we demonstrate in this section. We focus our discussion in particular on the ENZYMES dataset described in the previous section. Here, geometric scattering features can be considered as providing “signature” vectors for individual enzymes, which can be used to explore interactions between the six top level enzyme classes, labelled by their Enzyme Commission (EC) numbers (Borgwardt et al., 2005). In order to emphasize the properties of scattering-based feature extraction, rather than downstream processing, we mostly limit our analysis of the scattering feature space to linear operations such as principal component analysis (PCA). 56 Table 3.8: EC subspace analysis in scattering feature space of ENZYMES (Borgwardt et al., 2005) Enzyme Class: EC-1 EC-2 EC-3 EC-4 EC-5 EC-6 Mean distance to subspace of class EC-5 EC-2 EC-3 EC-4 EC-1 EC-6 measured via PCA projection/reconstruction distance 84.86 18.15 22.75 22.65 107.23 168.56 49.14 117.68 58.22 45.46 62.38 13.56 53.07 18.45 117.24 94.3 15.09 59.23 62.87 22.66 144.08 29.59 50.07 51.94 98.44 9.43 252.31 127.27 66.57 58.88 75.47 30.14 30.4 122.3 60 73.96 True class as 3rd-6th 2nd 1st nearest subspace 45% 28% 53% 24% 32% 7% 24% 12% 67% 21% 67% 21% 27% 23% 61% 64% 12% 12% To explore the scattering feature space, and the richness of information captured by it, we use it to infer relations between EC classes. First, for each enzyme e, with scattering feature vector ve (i.e., with Sx for all vertex features x), we compute its distance from class EC-j, with PCA subspace Cj, as the projection distance: dist(e, EC-j) = (cid:107)ve − projSjve(cid:107). Then, for each enzyme class EC-i, we compute the mean distance of enzymes in it from the subspace of each EC-j class (a) Observed (b) Inferred Figure 3.4: Comparison of EC exchange preferences in enzyme evolution: (a) observed in Cuesta et al. (2015), and (b) inferred from scattering features via pref(EC-i, EC-j) := wj · (cid:104) (cid:110) D(i, j) (cid:111)(cid:105)−1 D(j,i) D(j, j) ; wj = portion of enzymes in EC-j that choose another EC as their nearest subspace; D(i, j) = mean dist. of enzymes in EC-i from PCA (90% exp. D(i,i) , min var.) subspace of EC-j . Our inference (b) mainly recovers (a). 57 as D(i, j) = mean{dist(e, EC-j) : e ∈ EC-i}. These distances are summarized in Table 3.8, as well as the proportion of points from each class that have their true EC as their nearest (or second nearest) subspace in the scattering feature space. In general, 48% of enzymes select their true EC as the nearest subspace (with additional 19% as second nearest), but these proportions vary between individual EC classes. Finally, we use these scattering-based distances to infer EC exchange preferences during enzyme evolution, which are presented in Fig. 3.4 and validated with respect to established preferences observed and reported in Cuesta et al. (2015). We note that the result there is observed independently from the ENZYMES dataset. In particular, the portion of enzymes considered from each EC is different between these data, since Borgwardt et al. (2005) took special care to ensure each EC class in ENZYMES has exactly 100 enzymes in it. However, we notice that in fact the portion of enzymes (in each EC) that choose the wrong EC as their nearest subspace, which can be considered as EC “incoherence” in the scattering feature space, correlates well with the proportion of evolutionary exchanges generally observed for each EC in Cuesta et al. (2015), and therefore we use these as EC weights (see Fig. 3.4). Our results in Fig. 3.4 demonstrate that scattering features are sufficiently rich to capture relations between enzyme classes, and indicate that geometric scattering has the capacity to uncover descriptive and exploratory insights in graph data analysis. 3.5.6 Ablation Study To fully understand the power of our geometric scattering coefficients, we conduct an ablation study using five social network datasets, namely COLLAB, IMDB-B, IMDB-M, REDDIT-B, REDDIT-5K, as representative examples. Following the settings in the main paper, here instead of using four normalized moments for each order of scattering moments, we only use one normalized moment (mean) and two normalized moments (mean and variance) and compare the graph classification results in Table 3.9. We show that using only one normalized moment our method can still get relatively good results, and using higher order moments helps us to match or outperform most state-of-the-art results. Generally, the results degrade by 1-6% on the social network data sets 58 reducing from using four normalized moments to two or one normalized moment. Table 3.9: Ablation study on five social network datasets using only one normalized moments and two normalized moments. COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K One norm. moment Two norm. moments 77.42 78.44 69.80 69.3 48.47 48.27 83.25 85.20 50.31 51.49 Finally, we perform graph classification with two different classifiers: linear SVM and fully connected layers (FCLs)2 to further demonstrate the usefulness of geometric scattering coefficients and show that our scattering coefficients perform well regardless of the choice of classifiers. Our results in Table 3.10 show that compared to RBF SVM, FCLs and linear SVM are worse (1-3%) but not by too much. Table 3.10: Graph classfication with FCLs and linear SVM classifiers COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K 77.40 79.26 70.50 69.50 47.13 46.40 86.45 86.60 53.23 50.50 linear SVM FCLs 3.6 Conclusion We presented the geometric scattering transform as a deep filter bank for feature extraction on graphs, which generalizes the Euclidean scattering transform. A reasonable criticism of the scattering theory approach to understanding geometric deep learning is that it is not clear if the scattering model is a suitable facsimile for powerful graph neural networks that are obtaining impressive results on graph classification tasks and related graph data analysis problems. In this paper we showed that in fact, at least empirically, this line of criticism is unfounded and indeed further theoretical study of geometric scattering transforms on graphs is warranted. Our evaluation results on graph classification and data exploration show the potential of the produced scattering features to serve as universal representations of graphs. Indeed, classification using these features with relatively simple classifier models, dimension reduced feature sets, and small training sets nevertheless reach 2Hyperparameters of FCLs are manually selected 59 high accuracy results on most commonly used graph classification datasets. Finally, the geometric scattering features provide a new way for computing and considering global graph representations, independent of specific learning tasks. They raise the possibility of embedding entire graphs in Euclidean space and computing meaningful distances between graphs, which can be used for both supervised and unsupervised learning, as well as exploratory analysis of graph-structured data. 60 CHAPTER 4 TOXICITY PREDICTION OF SMALL ORGANIC MOLECULES 4.1 Abstract Toxicity evaluation is of great significance in protecting human from toxic compounds. It’s also very important in drug development and approval. Usually a compound’s effects on human health are assessed by a large number of time- and cost-intensive in vivo or in vitro experiments. These experiments also raise a lot of ethical concerns. Therefore, computational methods have been proposed for toxicity predictions to avoid expensive costs on animal tests and avoid ethical issues. Recent years there is a trend in applying machine learning/deep learning for toxicity predictions. In this study, the proposed geometric scattering method is used for generating molecule representations and then used for toxicity predictions. Geometric scattering method is constructed with cascades of graph wavelets and non-linear absolute values. It provides an unsupervised way for generating molecular graph features, where molecules are treated as graphs. Once the molecular graph features are generated, we can use any classifiers such as neural network and supporting vector machine to perform toxicity prediction. The acute oral systemic toxicity dataset is used in this study and the results demonstrate that geometric scattering method can achieve competitive results and thus provide a new way for toxicity predictions. 4.2 Introduction Toxicity assessment has been a long standing issue. In drug development, measuring the toxicity of potential drugs are essential before moving on to any clinical trials, which could save millions of dollars’ loss due to drug failures. In addition, any chemicals that are put on market are required to pass reliable tests to protect humans from being harmed by unknown toxic effects. Traditionally toxicity of a compound is evaluated by a large number of in vivo or in vitro experiments, where numerous methods rely on expensive and time-consuming animal tests. Therefore, there is an 61 increasing demand for cost- and time-efficient toxicological screening methods. Computational models are considered to be able to screen large numbers of compounds in a short time at low costs (Rusyn and Daston (2010)) and thus can be a powerful alternative to lab experiments. In fact, there has been a long history of applying computational methods for toxicity predictions. For example, quantitative structure activity relationship (QSAR) is used to predict toxicity based on the chemical structures. However, traditional computational models often suffer from insufficient accuracy and are not as reliable as biological experiments. Therefore there is a great need in developing new computational models that could provide accurate predictions. In addition, more and more toxicity data are collected and are available to the public. The rapid increasing of toxicity databases makes it possible to learn from large scale of data. To predict toxicity of molecules, it is important to generate molecular representations. Molecular descriptors such as Kow and other molecular physical or chemical properties without or with very limited structural information encoded are first used to generate molecular representations, which inevitably lost lots of information. Then researchers developed various molecular representations that also encode the structural information. Based on different algorithms, there are also different ways to encode the molecular structures. For example, extended-connectivity fingerprints (ECFP), which are based on the Morgan algorithm, are widely used to generate molecular features (Rogers and Hahn (2010)). ECFP can generate molecular features that represent each atom within larger and larger circular sub-structural neighborhoods. On the other hand, molecule can be regarded as a graph: the atoms are treated nodes while the bonds are regarded as edges. Thus one can directly use methods from graph theory to study molecular graphs and generate molecular features. Once molecular features are generated, then they can be used as inputs to machine learning methods such as logistic regression (LR), supporting vector machine (SVM), random forest (RF), gradient boosting regression tree (GBRT). Recently there are growing interests in applying deep learning to learn molecular features during the training process. Deep learning utilizes multi-layer neural network model to learn patterns 62 from data. It has achieved great successes in signal processing (Goodfellow et al. (2016, Chapter 9)), computer vision (Rawat and Wang (2017)), speech recognition (Amodei et al. (2016)), and natural language processing (Young et al. (2018)). It can also be applied in many other domains. In chemistry, deep learning can be used to predict molecular energies. For example, message passing neural network framework (MPNN) recently proposed by Gilmer et al. (2017) has achieved state-of-art results for molecular energy predictions. The MPNN architecture consists of two phases. The first phase is the message passing phase which includes the acceptance of geometry information as well as the bond information. Depending on the kind of method, then molecule can be taken as a graph or as whole 3D information. The second phase is the message updating phase, where an update function was used. Deep Learning is based on deep neural networks (DNNs) with multiple layers and large numbers of neurons per layer. A fully connected DNN consists of an input layer, several hidden layers and an output layer. A DNN can be considered as a function that maps an input vector to an output vector. The mapping is parameterized by weights that are optimized in the training process through back propagation. However, in contrast to shallow neural networks with only a few hidden layers and neurons per layer, DNNs use many hidden layers with a large number of neurons. In a fully connected DNN, neurons in each layer are connected to neurons of the previous layer and can learn features at each neuron. Then the outputs of the current layer are inputs of the following layers after an activation function. Popular activation functions are ReLu, Tanh, and Sigmoid. Modern DNNs also utilize techniques such as dropout to avoid over-fitting. Finally, the output layer could either be a linear regression (for regression tasks) or a softmax function (for classification tasks). It is supposed that the features of the input data are learnt during the training process with this multi-layer architecture. 4.3 Method Development and Evaluation Molecules can be represented as graphs, where atoms are nodes and bonds are edges in the graph. Therefore we can construct the molecular graph based on the connectivity of the molecules. Furthermore, the features of each molecule can be chosen from a variety of atom’s physical properties 63 such as atomic number, valence charges, mass, whether belong to aromatic ring, whether it’s sp1, sp2 or sp3 hybridation. Here in this study, we used the following eight properties to build our node features as input signals on each graph: one hot encoding of electron acceptor, electron donor, whether or not it belongs to a aromatic ring, whether it’s sp1, sp2 or sp3 hybridization and total charge and valance charge of each atom. In this study, we utilize the acute oral systemic toxicity dataset publicly available from the NIEHS Toxicology program, which consists of 8893 molecule structures and their SMILES. There are four classification tasks considered in this study: the classification of (1) Very toxic (<50 mg/kg vs. all others); (2) Nontoxic (>2000 mg/kg vs. all others) (3) Hazard categories under the EPA classification system (n=4) (4) Hazard categories under the GHS classification system (n=5) (Category 5 and Not Classified combined into a single category). 4.3.1 Geometric Scattering for Molecule Feature Generation To construct molecular representations that can be taken into various machine learning classifiers (e.g. Random Forest, Supporting Vector Machine, Neural Network), we need to generate molecular features based on the structures and the molecular properties. Geometric scattering from Chapter 3 has shown promising results on generating representations for graphs, so in this study we perform geometric scattering on molecular graphs. The detailed calculation processes can be found in Chapter 3. As in Chapter3, we only calculated 0th, 1st and 2nd order features according to Equation 3.4, 3.5, and 3.6. All three orders of features are then concatenated and fed into a classifier. In this study we use Gradient Boosting Regression Tree (GBRT), Logistic Regression (LR) and a two layer Neural Network (NN). 4.3.2 Data Prepossessing The acute oral systemic toxicity dataset is very imbalanced. Among 8893 molecules, for the task to predict whether the molecule is very toxic (with LD50 < 50 mg/kg) or not, 90% of the molecules are non-toxic and only 10% of them is considered to be very toxic. The imbalanced dataset will 64 cause the model to be biased to the part with more data points and thus make the model hard to train. To address this issue, random small noises are added to the geometric scattering features to make fake "new features" which are not exact the same as existing features, but won’t affect their labels. Finally, 80% of data points from each class were combined and augmented (if a certain class was imbalanced) for training, 10% were used as validation and the other 10% were used as test. 4.3.3 Evaluation Criteria To evaluate the effectiveness of our model, we should not only consider the overall accuracy criterion. This is because for a test set that contains 90% non-toxic compounds, the model can easily achieve 90% prediction accuracy by predicting everything nontoxic. Therefore in addition to overall accuracy, selectivity and specificity are also used to evaluate the performance of the model. Specificity and selectivity are defined as Speci f icit y = T N T N + FP Selectivit y = T P T P + FN Overall Accuracy = T P + T N T P + FP + T N + FN (4.1) (4.2) (4.3) Here TP stands for true positive, FP stands for false positive, TN stands for true negative, and FN stands for false negative. 4.4 Results Classification are performed for four tasks, among which the "Very Toxic" and "Nontoxic" tasks are binary classification tasks, while the EPA and GHS hazard categories are multi-class classifications. Furthermore, overall classification accuracies are also compared with different classifiers (NN, GBRT, LR). The results of using GBRT and LR on "Very Toxic" and "Nontoxic" 65 tasks were shown in Table 4.1 and 4.2. Since NN performs better than other classifiers, in the following we further study the selectivity and sensitivity only using the NN model. The results in Table 4.3 shows that for "Very Toxic" task, we can achieve 74% overall test accuracy, with 77% sensitivity and 73% specificity. For "Nontoxic" task, the performances are slightly worse with a 72% overall accuracy. The sensitivity is 70% and selectivity is 72%. For EPA and GHS hazard category classification, there are multiple classes and here we only consider the overall accuracy in Table 4.5 and 4.6. For EPA classification, we got 53% accuracy with four classes. For GHS classification, the overall test accuracies are 52% with five classes. All these results are performing much better than random guess, which shows the power of the geometric scattering features. Table 4.1: Training and prediction results with GBRT and LR for "Very Toxic" classification Test Training 73% 68% 66% 65% GBRT LR Table 4.2: Training and prediction results with GBRT and LR for "Nontoxic" classification Training Test 80% 64% 75% 60% GBRT LR Table 4.3: Sensitivity, specificity and overall prediction accuracy for "Very Toxic" classification Overall Accuracy Sensitivity Specificity 87% Training Validation Test 74% 77% 73% 84% 73% 85% Since there are many other molecular representations could be used for toxicity prediction, future work should consider comparing the geometric scattering results with other molecular representations such as ECFP. In addition, methods that use end-to-end learning can learn the 66 Table 4.4: Sensitivity, specificity and overall prediction accuracy for "Nontoxic" classification Overall Accuracy Sensitivity Specificity 82% Training Validation Test 72% 70% 72% 75% 73% 76% Table 4.5: Training and prediction results for EPA hazard category classification Overall Accuracy Test Training 75% 53% Table 4.6: Training and prediction results for GHS hazard category classification Overall Accuracy Training Test 71% 52% molecular representations during training, and should also be considered for comparison in the future work. 67 BIBLIOGRAPHY 68 BIBLIOGRAPHY Allen-King, R. M., Grathwohl, P., and Ball, W. P. (2002). New modeling paradigms for the sorption of hydrophobic organic chemicals to heterogeneous carbonaceous matter in soils, sediments, and rocks. Advances in Water Resources, 25(8), 985 – 1016. Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J., Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A., Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong, C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J., Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S., Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A., Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L., Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., and Zhu, Z. (2016). Deep speech 2 : End-to-end speech recognition in english and mandarin. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 173–182, New York, New York, USA. PMLR. Andén, J. and Mallat, S. (2014). Deep scattering spectrum. IEEE Transactions on Signal Processing, 62(16), 4114–4128. Andén, J., Lostanlen, V., and Mallat, S. (2018). Classification with joint time-frequency scattering. arXiv:1807.08869. Angles, T. and Mallat, S. (2018). Generative networks as inverse problems with scattering transforms. In International Conference on Learning Representations. Arp, H. P. H., Breedveld, G. D., and Cornelissen, G. (2009). Estimating the in situ sedimentporewater distribution of pahs and chlorinated aromatic hydrocarbons in anthropogenic impacted sediments. Environmental Science & Technology, 43(15), 5576–5585. PMID: 19731647. ATHMER, C. (2004). In-situ remediation of tce in clayey soils. Soil and Sediment Contamination: An International Journal, 13(5), 381–390. Atwood, J. and Towsley, D. (2016). Diffusion-convolutional neural networks. In Advances in Neural Information Processing Systems 29, pages 1993–2001. Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P. (2005). Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1), i47–i56. Boyd, S. A., Johnston, C. T., Pinnavaia, T. J., Kaminski, N. E., Teppen, B. J., Li, H., Khan, B., Crawford, R. B., Kovalova, N., Kim, S.-S., Shao, H., Gu, C., and Kaplan, B. L. (2011). Suppression of humoral immune responses by 2,3,7,8-tetrachlorodibenzo-p-dioxin intercalated in smectite clay. Environmental Toxicology and Chemistry, 30(12), 2748–2755. 69 Boyd, S. A., Sallach, J. B., Zhang, Y., Crawford, R., Li, H., Johnston, C. T., Teppen, B. J., and Kaminski, N. E. (2017). Sequestration of 2,3,7,8-tetrachlorodibenzo-p-dioxin by activated carbon eliminates bioavailability and the suppression of immune function in mice. Environmental Toxicology and Chemistry, 36(10), 2671–2678. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric deep learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4), 18–42. Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric deep learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4), 18–42. Brumwell, X., Sinz, P., Kim, K. J., Qi, Y., and Hirn, M. (2018). Steerable wavelet scattering for 3D atomic systems with application to Li-Si energy prediction. In NeurIPS Workshop on Machine Learning for Molecules and Materials. arXiv:1812.02320. Bruna, J. and Mallat, S. (2011). Classification with scattering operators. In 2011 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1561–1566. Bruna, J. and Mallat, S. (2012). Invariant scattering convolution networks. CoRR, abs/1203.1513. Bruna, J. and Mallat, S. (2013a). Audio texture synthesis with scattering moments. arXiv:1311.0407. Bruna, J. and Mallat, S. (2013b). Invariant scattering convolution networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 1872–1886. Bruna, J. and Mallat, S. (2018). Multiscale sparse microcanonical models. arXiv:1801.02013. Chai, Y., Davis, J. W., Wilken, M., Martin, G. D., Mowery, D. M., and Ghosh, U. (2011). Role of black carbon in the distribution of polychlorinated dibenzo-p-dioxins/dibenzofurans in aged field-contaminated soils. Chemosphere, 82(5), 639 – 647. Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., and Blaschke, T. (2018). The rise of deep learning in drug discovery. Drug Discovery Today, 23(6), 1241 – 1250. Cohen, A. J., Mori-Sánchez, P., and Yang, W. (2008). Insights into current limitations of density functional theory. Science, 321(5890), 792–794. Coifman, R. R. and Maggioni, M. (2006). Diffusion wavelets. Applied and Computational Harmonic Analysis, 21(1), 53–94. Comer, J., Gumbart, J. C., Hénin, J., Lelièvre, T., Pohorille, A., and Chipot, C. (2015). The adaptive biasing force method: Everything you always wanted to know but were afraid to ask. The Journal of Physical Chemistry B, 119(3), 1129–1151. PMID: 25247823. Cornelissen, G. and Gustafsson, Ö. (2004). Sorption of phenanthrene to environmental black carbon in sediment with and without organic matter and native sorbates. Environmental Science & Technology, 38(1), 148–155. PMID: 14740730. 70 Cornelissen, G., Gustafsson, Ö., Bucheli, T. D., Jonker, M. T. O., Koelmans, A. A., and van Noort, P. C. M. (2005). Extensive sorption of organic compounds to black carbon, coal, and kerogen in sediments and soils: mechanisms and consequences for distribution, bioaccumulation, and biodegradation. Environmental Science & Technology, 39(18), 6881–6895. PMID: 16201609. Cornelissen, G., Breedveld, G. D., Kalaitzidis, S., Christanis, K., Kibsgaard, A., and Oen, A. M. P. (2006). Strong sorption of native pahs to pyrogenic and unburned carbonaceous geosorbents in sediments. Environmental Science & Technology, 40(4), 1197–1203. PMID: 16572775. Cornelissen, G., Cousins, I. T., Wiberg, K., Tysklind, M., Holmström, H., and Broman, D. (2008a). Black carbon-dominated pcdd/fs sorption to soils at a former wood impregnation site. Chemosphere, 72(10), 1455 – 1461. Cornelissen, G., Wiberg, K., Broman, D., Arp, H. P. H., Persson, Y., Sundqvist, K., and Jonsson, P. (2008b). Freely dissolved concentrations and sediment-water activity ratios of pcdd/fs and pcbs in the open baltic sea. Environmental Science & Technology, 42(23), 8733–8739. PMID: 19192790. Cramer, R. D., Patterson, D. E., and Bunce, J. D. (1988). Comparative molecular field analysis (comfa). 1. effect of shape on binding of steroids to carrier proteins. Journal of the American Chemical Society, 110(18), 5959–5967. PMID: 22148765. Cuesta, S. M., Rahman, S. A., Furnham, N., and Thornton, J. M. (2015). The classification and evolution of enzyme function. Biophysical Journal, 109(6), 1082–1086. Darve, E. and Pohorille, A. (2001). Calculating free energies using average force. The Journal of Chemical Physics, 115(20), 9169–9183. Darve, E., Wilson, M. A., and Pohorille, A. (2002). Calculating free energies using a scaled-force molecular dynamics algorithm. Molecular Simulation, 28(1-2), 113–144. Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., and Hansch, C. (1991). Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds. correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry, 34(2), 786–797. Diry, M., Tomkiewicz, C., Koehle, C., Coumoul, X., Bock, K. W., Barouki, R., and Transy, C. (2006). Activation of the dioxin/aryl hydrocarbon receptor (ahr) modulates cell plasticity through a jnk-dependent mechanism. Oncogene, 25(40), 5570–5574. Dix, D. J., Houck, K. A., Martin, M. T., Richard, A. M., Setzer, R. W., and Kavlock, R. J. (2006). The ToxCast Program for Prioritizing Toxicity Testing of Environmental Chemicals. Toxicological Sciences, 95(1), 5–12. Dobson, P. D. and Doig, A. J. (2003). Distinguishing enzyme structures from non-enzymes without alignments. Journal of molecular biology, 330(4), 771–783. Dombrowski, R. J., Hyduke, D. R., and Lastoskie, C. M. (2000). Pore size analysis of activated carbons from argon and nitrogen porosimetry using density functional theory. Langmuir, 16(11), 5041–5050. 71 Eickenberg, M., Exarchakis, G., Hirn, M., and Mallat, S. (2017). Solid harmonic wavelet scattering: Predicting quantum molecular energy from invariant descriptors of 3D electronic densities. In Advances in Neural Information Processing Systems 30 (NIPS 2017), pages 6540–6549. Eickenberg, M., Exarchakis, G., Hirn, M., Mallat, S., and Thiry, L. (2018). Solid harmonic wavelet scattering for predictions of molecule properties. Journal of Chemical Physics, 148, 241732. Fagervold, S. K., Chai, Y., Davis, J. W., Wilken, M., Cornelissen, G., and Ghosh, U. (2010). Bioaccumulation of polychlorinated dibenzo-p-dioxins/dibenzofurans in e. fetida from floodplain soils and the effect of activated carbon amendment. Environmental Science & Technology, 44(14), 5546–5552. PMID: 20560599. Fu, Z., Wang, Y., Chen, J., Wang, Z., and Wang, X. (2016). How pbdes are transformed into dihydroxylated and dioxin metabolites catalyzed by the active center of cytochrome p450s: A dft study. Environmental Science & Technology, 50(15), 8155–8163. PMID: 27363260. Gama, F., Ribeiro, A., and Bruna, J. (2019). Diffusion scattering transforms on graphs. International Conference on Learning Representations. arXiv:1806.08829. In Gärtner, T., Flach, P., and Wrobel, S. (2003). On graph kernels: Hardness results and efficient alternatives. In Learning theory and kernel machines, pages 129–143. Springer. Gatys, L., Ecker, A. S., and Bethge, M. (2015). Texture synthesis using convolutional neural networks. In Advances in Neural Information Processing Systems 28, pages 262–270. Ghosh, U., Gillette, J. S., Luthy, R. G., and Zare, R. N. (2000). Microscale location, characterization, and association of polycyclic aromatic hydrocarbons on harbor sediment particles. Environmental Science & Technology, 34(9), 1729–1736. Ghosh, U., Luthy, R. G., Cornelissen, G., Werner, D., and Menzie, C. A. (2011). In-situ sorbent amendments: A new direction in contaminated sediment management. Environmental Science & Technology, 45(4), 1163–1168. PMID: 21247210. Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message passing for quantum chemistry. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1263–1272, International Convention Centre, Sydney, Australia. PMLR. Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http: //www.deeplearningbook.org. Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Inductive representation learning on large graphs. In NIPS. Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30, 129–150. HANSCH, C., MALONEY, P. P., FUJITA, T., and MUIR, R. M. (1962). Correlation of biological activity of phenoxyacetic acids with hammett substituent constants and partition coefficients. Nature, 194(4824), 178–180. 72 Haro, M., Cabal, B., Parra, J. B., and Ania, C. O. (2011). On the adsorption kinetics and equilibrium of polyaromatic hydrocarbons from aqueous solution. Adsorption Science & Technology, 29(5), 467–478. He, H., Galy, J., and Gerard, J.-F. (2005). Molecular simulation of the interlayer structure and the mobility of alkyl chains in hdtma+/montmorillonite hybrids. The Journal of Physical Chemistry B, 109(27), 13301–13306. PMID: 16852659. Hilscherova, K., Blankenship, A., Kannan, K., Nie, M., Williams, L. L., Coady, K., Upham, B. L., Trosko, J. E., Bursian, S., and Giesy, J. P. (2003). Oxidative stress in laboratory-incubated double- crested cormorant eggs collected from the great lakes. Archives of Environmental Contamination and Toxicology, 45(4), 533–546. Hirn, M., Mallat, S., and Poilvert, N. (2017). Wavelet scattering regression of quantum chemical energies. Multiscale Modeling and Simulation, 15(2), 827–863. arXiv:1605.04654. Kaplan, B. L., Crawford, R. B., Kovalova, N., Arencibia, A., Kim, S. S., Pinnavaia, T. J., Boyd, S. A., Teppen, B. J., and Kaminski, N. E. (2011). Tcdd adsorbed on silica as a model for tcdd contaminated soils: Evidence for suppression of humoral immunity in mice. Toxicology, 282(3), 82 – 87. Kayala, M. A. and Baldi, P. F. (2011). A machine learning approach to predict chemical reactions. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 747–755. Curran Associates, Inc. Kipf, T. N. and Welling, M. (2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR). Klopmand, G. (1992). Concepts and applications of molecular similarity, by mark a. johnson and gerald m. maggiora, eds., john wiley & sons, new york, 1990, 393 pp. price: $65.00. Journal of Computational Chemistry, 13(4), 539–540. Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. (2018a). Covariant compositional networks for learning graphs. arXiv:1801.02144. Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. (2018b). Covariant compositional networks for learning graphs. arXiv:1801.02144. Kriege, N. M., Giscard, P.-L., and Wilson, R. (2016). On valid optimal assignment kernels and applications to graph classification. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1623–1631. Curran Associates, Inc. Kupryianchyk, D., Rakowska, M. I., Roessink, I., Reichman, E. P., Grotenhuis, J. T. C., and Koelmans, A. A. (2013). In situ treatment with activated carbon reduces bioaccumulation in aquatic food chains. Environmental Science & Technology, 47(9), 4563–4571. PMID: 23544454. Liu, C., Gu, C., Yu, K., Li, H., Teppen, B. J., Johnston, C. T., Boyd, S. A., and Zhou, D. (2015a). Integrating structural and thermodynamic mechanisms for sorption of pcbs by montmorillonite. Environmental Science & Technology, 49(5), 2796–2805. PMID: 25629399. 73 Liu, C., Gu, C., Yu, K., Li, H., Teppen, B. J., Johnston, C. T., Boyd, S. A., and Zhou, D. (2015b). Integrating structural and thermodynamic mechanisms for sorption of pcbs by montmorillonite. Environmental Science & Technology, 49(5), 2796–2805. PMID: 25629399. Liu, T., Xue, L., Guo, X., Huang, Y., and Zheng, C. (2016). Dft and experimental study on the mechanism of elemental mercury capture in the presence of hcl on -fe2o3(001). Environmental Science & Technology, 50(9), 4863–4868. PMID: 27082260. Lohmann, R.; MacFarlane, J. K. G. P. M. (2005). Importance of black carbon to sorption of native pahs, pcbs, and pcdds in boston and new york, harbor sediments. Environmental Science Technology. Lostanlen, V. and Mallat, S. (2015). Wavelet scattering on the pitch spiral. In Proceedings of the 18th International Conference on Digital Audio Effects, pages 429–432. Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10), 1331–1398. McGregor, M. J. and Pallai, P. V. (1997). Clustering of large databases of compounds: using the mdl “keys” as structural descriptors. Journal of Chemical Information and Computer Sciences, 37(3), 443–448. Meyer, Y. (1993). Wavelets and Operators, volume 1. Cambridge University Press. Naik, Z. K. and Gandhi, M. R. (2018). A review: Object detection using deep learning. International Journal of Computer Applications, 180(29), 46–48. Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., and Jaiswal, S. (2017). graph2vec: Learning distributed representations of graphs. arXiv:1707.05005. Neumann, M., Patricia, N., Garnett, R., and Kersting, K. (2012). Efficient graph kernels by randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 378–393. Springer. Nguyen, B. T., Lehmann, J., Hockaday, W. C., Joseph, S., and Masiello, C. A. (2010). Temperature sensitivity of black carbon decomposition and oxidation. Environmental Science & Technology, 44(9), 3324–3331. PMID: 20384335. Nguyen, T. H., Cho, H.-H., Poster, D. L., and Ball, W. P. (2007). Evidence for a pore-filling mechanism in the adsorption of aromatic hydrocarbons to a natural wood char. Environmental Science & Technology, 41(4), 1212–1217. PMID: 17593721. Niepert, M., Ahmed, M., and Kutzkov, K. (2016). Learning convolutional neural networks for graphs. In International conference on machine learning, pages 2014–2023. Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1717–1724. 74 Oyallon, E. and Mallat, S. (2015). Deep roto-translation scattering for object classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). arXiv:1412.8659. Patmont, C. R., Ghosh, U., LaRosa, P., Menzie, C. A., Luthy, R. G., Greenberg, M. S., Cornelissen, G., Eek, E., Collins, J., Hull, J., Hjartland, T., Glaza, E., Bleiler, J., and Quadrini, J. (2015). In situ sediment treatment using activated carbon: A demonstrated sediment cleanup technology. Integrated Environmental Assessment and Management, 11(2), 195–207. Perlmutter, M., Wolf, G., and Hirn, M. (2018). Geometric scattering on manifolds. In NeurIPS Workshop on Integration of Deep Learning Theories. arXiv:1812.06968. Persson, N. J., Gustafsson, Ö., Bucheli, T. D., Ishaq, R., , and Broman, D. (2002). Soot-carbon influenced distribution of pcdd/fs in the marine environment of the grenlandsfjords, norway. Environmental Science & Technology, 36(23), 4968–4974. PMID: 12523408. Pignatello, J. J., Oliveros, E., and MacKay, A. (2006). Advanced oxidation processes for organic contaminant destruction based on the fenton reaction and related chemistry. Critical Reviews in Environmental Science and Technology, 36(1), 1–84. Plimpton, S. (1995). Fast parallel algorithms for short-range molecular dynamics. Journal of Computational Physics, 117(1), 1 – 19. PyGSP (2018). Graph signal processing in python (https://pygsp.readthedocs.io/en/ stable/index.html). Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classification: A comprehensive review. Neural Computation, 29(9), 2352–2449. PMID: 28599112. Rogers, D. and Hahn, M. (2010). Extended-connectivity fingerprints. Journal of Chemical Information and Modeling, 50(5), 742–754. PMID: 20426451. Rusyn, I. and Daston, G. P. (2010). Computational toxicology: Realizing the promise of the toxicity testing in the 21st century. Environmental Health Perspectives, 118(8), 1047–1050. Sedlak, D. L. (2015). Professor einstein and the quantum mechanics. Environmental Science & Technology, 49(5), 2585–2585. PMID: 25699689. Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. (2009). Efficient graphlet kernels for large graph comparison. In D. van Dyk and M. Welling, editors, Proceedings of the 12th International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, pages 488–495, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA. PMLR. Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. (2011). Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(Sep), 2539–2561. Sifre, L. and Mallat, S. (2012). Combined scattering for rotation invariant texture analysis. In Proceedings of the ESANN 2012 conference. 75 Sifre, L. and Mallat, S. (2014). Rigid-motion scattering for texture classification. arXiv:1403.1687. Sinkkonen, S. and Paasivirta, J. (2000). Degradation half-life times of pcdds, pcdfs and pcbs for environmental fate modeling. Chemosphere, 40(9), 943 – 949. Sun, Q., Xie, H.-B., Chen, J., Li, X., Wang, Z., and Sheng, L. (2013). Molecular dynamics simulations on the interactions of low molecular weight natural organic acids with c60. Chemosphere, 92(4), 429 – 434. Taheri, A., Gimpel, K., and Berger-Wolf, T. (2018). Learning graph representations with recurrent neural network autoencoders. In KDD Deep Learning Day. Tice, R. R., Austin, C. P., Kavlock, R. J., and Bucher, J. R. (2013). Improving the human hazard characterization of chemicals: A tox21 update. Environmental Health Perspectives, 121(7), 756–765. Tixier, A. J.-P., Nikolentzos, G., Meladianos, P., and Vazirgiannis, M. (2017). Classifying graphs as images with convolutional neural networks. arXiv:1708.02218. Toivonen, H., Srinivasan, A., King, R. D., Kramer, S., and Helma, C. (2003). Statistical evaluation of the predictive toxicology challenge 2000–2001. Bioinformatics, 19(10), 1183–1193. Travis, C. C., Hattemer-Frey, H. A., and Silbergeld, E. (1989). Dioxin, dioxin everywhere. Environmental Science & Technology, 23(9), 1061–1063. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations. Verlet, L. (1967). Computer "experiments" on classical fluids. i. thermodynamical properties of lennard-jones molecules. Phys. Rev., 159, 98–103. Verma, S. and Zhang, Z.-L. (2018). Graph capsule convolutional neural networks. In Joint ICML and IJCAI Workshop on Computational Biology. arXiv:1805.08090. Wale, N., Watson, I. A., and Karypis, G. (2008). Comparison of descriptor spaces for chemical compound retrieval and classification. Knowledge and Information Systems, 14(3), 347–375. Wang, Y., Wohlert, J., Bergenstråhle-Wohlert, M., Kochumalayil, J. J., Berglund, L. A., Tu, Y., and Ågren, H. (2015). Molecular adhesion at clay nanocomposite interfaces depends on counterion hydration–molecular dynamics simulation of montmorillonite/xyloglucan. Biomacromolecules, 16(1), 257–265. PMID: 25389796. Wu, Z., Ramsundar, B., Feinberg, E., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande, V. (2018). Moleculenet: a benchmark for molecular machine learning. Chem. Sci., 9, 513–530. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In International Conference on Learning Representations. Yanardag, P. and Vishwanathan, S. (2015). Deep graph kernels. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374. ACM. 76 Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep neural networks? In Advances in Neural Information Processing Systems 27, pages 3320–3328. Young, T., Hazarika, D., Poria, S., and Cambria, E. (2018). Recent trends in deep learning based natural language processing [review article]. IEEE Computational Intelligence Magazine, 13(3), 55–75. Zhan, J., Sunkara, B., Le, L., John, V. T., He, J., McPherson, G. L., Piringer, G., and Lu, Y. (2009). Multifunctional colloidal particles for in situ remediation of chlorinated hydrocarbons. Environmental Science & Technology, 43(22), 8616–8621. PMID: 20028061. Zhan, J., Kolesnichenko, I., Sunkara, B., He, J., McPherson, G. L., Piringer, G., and John, V. T. (2011). Multifunctional ironcarbon nanocomposites through an aerosol-based process for the in situ remediation of chlorinated hydrocarbons. Environmental Science & Technology, 45(5), 1949–1954. PMID: 21299241. Zhang, M., Cui, Z., Neumann, M., and Chen, Y. (2018). An end-to-end deep learning architecture for graph classification. In AAAI Conference on Artificial Intelligence, pages 4438–4445. Zou, D. and Lerman, G. (2018). Graph convolutional neural networks via scattering. arXiv:1804:00099. Zou, J., Ji, B., Feng, X.-Q., and Gao, H. (2006). Self-assembly of single-walled carbon nanotubes into multiwalled carbon nanotubes in water: molecular dynamics simulations. Nano Letters, 6(3), 430–434. PMID: 16522036. 77