COMPUTATIONAL METHODS FOR UNDERSTANDING ENVIRONMENTAL PROCESSES

AND TOXICITY

By

Feng Gao

A DISSERTATION

Michigan State University

in partial fulﬁllment of the requirements

Submitted to

for the degree of

Crop and Soil Sciences - Environmental Toxicology – Doctor of Philosophy

Computational Mathematics, Science and Engineering – Dual Major

2019

ABSTRACT

COMPUTATIONAL METHODS FOR UNDERSTANDING ENVIRONMENTAL PROCESSES

AND TOXICITY

By

Feng Gao

Understanding environmental processes, risks and toxicities of persistent organic pollutants (POPs)
are needed to protect human and ecosystem health. Usually the toxic eﬀects of POPs on human
health are assessed using a variety of time- and cost-intensive in vivo and in vitro experiments;
in vivo evaluations utilizing animals and further complicated by ethical concerns. Computational
models provide an alternative way to laboratory based experiments. Indeed, computational models
have recently become widely used to study reaction mechanisms, make predictions of chemical
toxicity, and for risk assessment. In this dissertation, I study two computational methods that could
potentially be used to advance remediation of dioxin and assess chemical toxicity: 1) molecular
dynamics simulation of dioxin adsorption in activated carbon pores and 2) toxicity prediction
with deep learning models, with a special focus on geometric scattering methods. Polychlorinated
dibenzo-p-dioxins/furans (PCDD/Fs) are ubiquitous environmental contaminants that resist chemical,
biological and physical routes of dissipation. They are well known for their toxicity, including
adverse eﬀects on reproductive health, impairment of mammalian immunity, and carcinogenicity.
Adding activated carbons (ACs) to soils or sediments has been suggested as a means to promote
the sequestration of polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) in forms with reduced
bioavailability and hence toxicity to humans and other mammals. However, the mechanisms
and adsorption processes of dioxin by ACs are not well understood. Thus, molecular dynamics
simulations were used to study the mechanism of dioxin adsorption in AC pores, and to evaluate
the eﬀects of pore size on dioxin adsorption. The results showed that smaller pores created a
comparatively more hydrophobic sub-aqueous environment that promoted the adsorption of dioxins.
Deep learning has achieved great successes in image recognition, natural language processing and
many other tasks. Recently, the application of deep learning methods for toxicity predictions of

organic molecules has gained increasing interest. Molecules can be treated as graphs, where atoms
are nodes and bonds are edges. However regular deep learning methods cannot be directly applied
to data in a non-Euclidean space such as graphs. Therefore, geometric scattering methods that
generalize regular scattering transforms to non-Euclidean spaces are proposed herein. Scattering
transforms was used to provide mathematical understanding of convolutional networks. The results
in this dissertation showed that geometric scattering methods achieved near state-of-art results on
multiple standard graph classiﬁcation tasks, and can be used for various explorations of biochemical
data. Finally, geometric scattering was applied for toxicity prediction with real-world toxicity
datasets. The results demonstrate that it has excellent potential as an alternative approach for toxicity
predictions.

TABLE OF CONTENTS

LIST OF TABLES .
LIST OF FIGURES .

.
.

.
.

.
.

.
.

.
.

.
.

.
.

.
.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vi
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

CHAPTER 1

INTRODUCTION AND RESEARCH OBJECTIVES . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Dioxin in the Environment
.
1.2 Molecular Dynamics Simulations . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1.3 Geometric Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Scattering Transform .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 QSAR and Toxicity Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.6 Research Objectives .

1
1
3
5
8
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

.

.

.

.

.

. .

.

.

.
.
. .

.
Introduction .

.
.
2.1 Abstract
.
.
2.2
.
2.3 Method . .
.
.
2.4 Computational Details .
2.5 Results and Discussions .

.
.
. .

. .

.
.
.

CHAPTER 2 MOLECULAR DYNAMICS STUDY OF THE INFLUENCE OF ACTI-

2.6 Conclusion .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.

VATED CARBON PORE-SIZE ON DIOXIN ADSORPTION . . . . . . . . 14
. 14
. 14
. 17
. 18
. 24
2.5.1 Water in Activated Carbon Pores . . . . . . . . . . . . . . . . . . . . . . . 24
2.5.2 Dioxin Sorption in Activated Carbon Pores.
. . . . . . . . . . . . . . . . . 26
PMF of Dioxin .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.5.3
Sorption at Activated Carbon Edges . . . . . . . . . . . . . . . . . . . . . 32
2.5.4
. 32
CHAPTER 3 GEOMETRIC SCATTERING FOR GRAPH DATA ANALYSIS . . . . . . 35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.1 Abstract
3.2
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.3 Graph Random Walks and Graph Wavelets . . . . . . . . . . . . . . . . . . . . . . 38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Geometric Scattering .
3.4.1 Geometric Scattering Deﬁnitions . . . . . . . . . . . . . . . . . . . . . . . 41
3.4.2
Stability and Capacity of Geometric Scattering . . . . . . . . . . . . . . . 44
. 45
3.4.3 Geometric Scattering Compared to Other Feed Forward Graph ConvNets
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5.1 Detailed Dataset Descriptions
. . . . . . . . . . . . . . . . . . . . . . . . 47
3.5.2 Graph Classiﬁcation on Social Networks and Biochemistry Datasets . . . . 50
3.5.3 Classiﬁcation with Low Training-data Availability . . . . . . . . . . . . . . 53
3.5.4 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.5.5 Data Exploration: Enzyme Class Exchange Preferences . . . . . . . . . . . 56
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.5.6 Ablation Study .
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

.
Introduction .

3.6 Conclusion .

3.5 Results .

.
.
. .

. .

. .

. .

. .

.

.
.

.

.
.

.

.
.

.

.
.

.

.

.

.

.

.

.

.

.

.

iv

.

.

.
.
. .

.
Introduction .

CHAPTER 4 TOXICITY PREDICTION OF SMALL ORGANIC MOLECULES . . . . . 61
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.1 Abstract
4.2
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.3 Method Development and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.3.1 Geometric Scattering for Molecule Feature Generation . . . . . . . . . . . 64
4.3.2 Data Prepossessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.3 Evaluation Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

BIBLIOGRAPHY .

4.4 Results .

. .

.

.

. .

.

.

.

.
.

.

.

.
.

.

.

.
.

.

.

.
.

. .

v

LIST OF TABLES

Table 3.1: Basic statistics of the biochemical graph classiﬁcation databases . . . . . . . . . 49

Table 3.2: Basic statistics of the social network graph classiﬁcation databases . . . . . . . . 49

Table 3.3: Comparison of the proposed GS-SVM classiﬁer with leading graph kernel and

deep learning methods on social graph datasets. . . . . . . . . . . . . . . . . . . 51

Table 3.4: Comparison of the proposed GS-SVM classiﬁer with leading graph kernel and

deep learning methods on biochemistry graph datasets.

. . . . . . . . . . . . . . 51

Table 3.5: Classiﬁcation accuracy with diﬀerent training/validaion/test splits over scatter-

ing features (unnorm. moments) . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Table 3.6: Classiﬁcation accuracy and dimensionality reduction with PCA over scattering

features (unnorm. moments) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Table 3.7: Dimensionality reduction with PCA over scattering features (unnorm. moments)

56

Table 3.8: EC subspace analysis in scattering feature space of ENZYMES (Borgwardt

et al., 2005) .

.

.

.

. .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

Table 3.9: Ablation study on ﬁve social network datasets using only one normalized

moments and two normalized moments.

. . . . . . . . . . . . . . . . . . . . . . 59

Table 3.10: Graph classﬁcation with FCLs and linear SVM classiﬁers . . . . . . . . . . . . . 59

Table 4.1: Training and prediction results with GBRT and LR for "Very Toxic" classiﬁcation 66

Table 4.2: Training and prediction results with GBRT and LR for "Nontoxic" classiﬁcation . 66

Table 4.3: Sensitivity, speciﬁcity and overall prediction accuracy for "Very Toxic" classiﬁcation 66

Table 4.4: Sensitivity, speciﬁcity and overall prediction accuracy for "Nontoxic" classiﬁcation 67

Table 4.5: Training and prediction results for EPA hazard category classiﬁcation . . . . .

. 67

Table 4.6: Training and prediction results for GHS hazard category classiﬁcation . . . . .

. 67

vi

LIST OF FIGURES

Figure 1.1: Data represented as graphs

. . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1.2: Brief summary of GCN history . . . . . . . . . . . . . . . . . . . . . . . . .

.

.

Figure 1.3: Signals on graph .

.

.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

7

8

Figure 1.4: Scattering transform in Euclidean space

. . . . . . . . . . . . . . . . . . . . . 10

Figure 2.1: Unit cell for a periodic model representing six unique layers of graphite that

form an AC slit-pore embedded in water for the study of DD desorption processes. 15

Figure 2.2: Number of water molecules per nm−3 inside activated carbon pores of three
diﬀerent widths; the model is divided into 100 small bins along axis (1,0,0);
number of water molecules per nm−3 in bulk water is around 33/nm−3.

. . . . 25

Figure 2.3: Potential of mean force for water molecules, in units of kT.

. . . . . . . . . . . 26

Figure 2.4: Dioxin moving trajectories in small AC pore . . . . . . . . . . . . . . . . . .

. 27

Figure 2.5: Dioxin moving trajectories in medium AC pore . . . . . . . . . . . . . . . . .

. 28

Figure 2.6: Dioxin moving trajectories in large AC pore . . . . . . . . . . . . . . . . . . . 29

Figure 2.7: Side view of the moving trajectory of dioxin in large-pore AC model (water

molecules removed for a clear view)

. . . . . . . . . . . . . . . . . . . . . . . 30

Figure 2.8: The potential of mean force (kcal/mol) for desorption of DD from AC pores of
three diameters. The AC pore edge is located at x=0; x <0 refers to the inside
of AC pore and x >0 refers to the bulk water phase.

. . . . . . . . . . . . . . . 31

Figure 2.9: Dioxin moving trajectories at AC edge . . . . . . . . . . . . . . . . . . . . . . 33

Figure 2.10: Radial distribution between oxygen atoms of dioxin and hydrogen atoms at

activated carbon edges

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Figure 3.1: Wavelets Ψ j for increasing scale 2j left to right, applied to Diracs centered
at two diﬀerent locations (marked by red circles) in two diﬀerent graphs.
Vertex colors indicate wavelet values (corresponding to colorbars for each
plot), ranging from yellow/green indicating positive values to blue indicating
negative values. Both graphs are freely available from PyGSP (2018).

. . . . . 41

vii

Figure 3.2:

Illustration of (a) the proposed scattering feature extraction (see eqs. 3.4, 3.5,
and 3.6), and (b) its application for graph data analysis.

. . . . . . . . . . . . . 43

Figure 3.3:

(a) Box plot showing the drop in SVM classiﬁcation accuracy over social graph
datasets when reducing training set size (horizontal axis marks portion of data
used for testing); (b) Relation between explained variance, SVM classiﬁcation
accuracy, and PCA dimensions over scattering features in ENZYMES dataset.

. 54

Figure 3.4: Comparison of EC exchange preferences in enzyme evolution:

(a) ob-
served in Cuesta et al. (2015), and (b) inferred from scattering features
via pref(EC-i, EC-j) := wj
; wj = portion of en-
zymes in EC-j that choose another EC as their nearest subspace; D(i, j) = mean
dist. of enzymes in EC-i from PCA (90% exp. var.) subspace of EC-j . Our
inference (b) mainly recovers (a).

D(j,i)
D(j, j)

min

. . . . . . . . . . . . . . . . . . . . . . . . . 57

· (cid:104)

(cid:110) D(i, j)

D(i,i) ,

(cid:111)(cid:105)−1

viii

CHAPTER 1

INTRODUCTION AND RESEARCH OBJECTIVES

1.1 Dioxin in the Environment

Dioxins and furans are a family of compounds that have diﬀerent number of chlorines at
various positions. Many of these compounds are considered toxic and can cause serious diseases.
Among these chemicals, there have been numerous studies on polychlorinated dibenzodioxins/furans
(PCDD/Fs). They are usually generated through combustion of carbon based materials or as
by-products from chemical manufacturing processes. Most of these chemicals are lipophilic and
hydrophobic (with high log Kow values of 5.6-8.0 for tetra- through octa-substituted chlorinated
dibenzodioxins/furans). Thus once entering human or animal bodies, they will be adsorbed and
stored in fat tissues (Sinkkonen and Paasivirta (2000)). In the natural environment, they usually
originate from historical manufacturing releases, and are found in sediments and soils. For example,
Hilscherova et al. (2003) reported that the concentration of PCDDs/Fs in sediments and ﬂood-plain
soils collected along the Tittabawassee River in Michigan ranged from 102 to 53,600 pg/g, dry wt in
2003. They can be very persistent for decades.

Researchers have been studying the pathways of how dioxin aﬀect human health for years (Diry
et al. (2006)). It has been identiﬁed that dioxin is usually bound to an intracellular protein known
as aryl hydrocarbon receptor (AhR). These AhRs can aﬀect the expression and/or function of
genes, and therefore further results in the dysfunction of normal cells and eventually the health
of human/animals. Some of these health eﬀects include cancer, skin problems, liver damage and
reproductive issues, which are related to exposure time and amount.

Nowadays people are exposed to dioxins though several ways. Consuming contaminated food
(e.g. animal products) is one of the primary ways dioxin enter human body. Since they are persistent
in fat tissues, they will accumulate in the food chain. It is thus very important to ﬁnd a remediation
method to remove the PCDDs/Fs in sediments and soils to mitigate the risk for human health and

1

ecological risk.

Recently, there is a new direction for remediation of soils contaminated with hydrophobic
organic contaminants using in-situ sorbent amendments with carbonaceous geosorbents (CGs) such
as activated carbons (ACs) and biochars (Ghosh et al. (2011)). Most CGs are found to have high
sorption capacity for hydrophobic organic contaminants because they have relatively large surface
areas and their strong interaction with contaminant molecules (Ghosh et al. (2011)). The CGs
added to the soils/sediments can potentially sequester contaminants released from the contaminated
soils/sediments, reduce the aqueous phase concentration of the contaminants thus reducing their
potential for exposure and make the contaminants less bioavailable. Therefore, it is very important
to evaluate the capability of CGs for use as in-situ sorbent amendment to reduce the aqueous phase
concentration and bioavailability of PCDD/Fs in the natural environment.

ACs are porous carbonaceous materials that are manufactured under diﬀerent conditions. They
have large pore surfaces and pore volumes, and therefore are ideal adsorbents for the adsorption of
contaminants. In addition to large surface area, ACs usually have many functional groups produced
during the manufacturing process due to diﬀerent conditions such as temperatures. Depending on
the nature of the precursor and on the activation process, oxygen-containing groups can be bound
to the edges of the graphitic plates or to defect sites such as vacancies. These groups are usually
polar in nature and can interact with polar species, increasing the aﬃnity of the carbon for these
substances. Oxygen functional groups can be present in several forms, such as carboxyl, carbonyl,
and hydroxyl, and these can combine to form more complex arrangements. These functional groups
also play an important role in adsorption of contaminants.

Although CGs have high sorption capacity for hydrophobic organic contaminants, the sorption
process can be potentially inﬂuenced by dissolved organic matter (DOM) or other organic molecules
(e.g., small organic acids) in the natural environment due to the sorption of these organic molecules
by CGs. These organic molecules can act as competitive sorbates or pore blocking agent thus
reducing the sorption of target hydrophobic organic contaminants on CGs (Pignatello et al. (2006)).
There are numerous examples where DOM such as humic and fulvic acid showed attenuation eﬀects

2

on sorption of organic compounds on carbonaceous materials (Cornelissen and Gustafsson (2004)).
It was shown that sorption of phenanthrene to environmental black carbon (BC) can be as much as
one order of magnitude lower than sorption of pure BC. This attenuation eﬀects is due to native
organic compounds and /or DOM molecules competing for or blocking CG sorption sites.

1.2 Molecular Dynamics Simulations

Molecular simulations have been widely used to study environmental problems. These researches
cover topics such as chemical reactions that take place in the environment (Liu et al. (2016)), tracking
intermediate products during reaction (Fu et al. (2016)), and simulating complex environmental
systems that are diﬃcult to study via pure lab experiments (He et al. (2005)). These computational
models aim to providing theoretical foundation of many experimental phenomenon as well as
discovering that are hard or even impossible to observe from experiments. Density functional theory
(DFT) and molecular dynamics (MD) simulations are two commonly used molecular simulation
methods and have already been widely explored in environmental science. DFT is able to perform
relatively accurate energy calculations based on quantum chemistry. However, it is limited by the
enormous computational power needed, require a long time to run the model and the system size is
limited. It is often too small to mimic the real environment (Cohen et al. (2008)). On the other hand,
molecular dynamics simulations provide an alternative less computational intense method. It can be
adapted to simulate from nano-scale systems to macro-scale systems, which makes it suitable for
relatively large environmental systems and provide molecular level understanding of environmental
issues. MD has been used to model complex environmental systems such as the self-assembling of
nanotubes in water (Zou et al. (2006)), adsorption of contaminates on clays (Liu et al. (2015a)), C60
interaction with natural organic matters (Sun et al. (2013)) and many environmental systems.

The basic idea of MD follows the classical equation of motion:

a =

F
m
F = − ∂
∂r

U

3

(1.1)

(1.2)

(cid:20)(cid:16) σ

r

vLJ(r) = 4

(cid:17)6(cid:21)

(cid:17)12 −(cid:16) σ

r

to calculate the motion of atoms at each time step. Therefore by calculating the forces, we can obtain
information such as velocities and positions of each atom at every time step and can then update the
atoms’ positions and velocities accordingly. To compute the force, we usually need to obtain the
potential energies U ﬁrst since the force is calculated as the derivatives of the potential energies.
The potentials energies U mainly include two parts: non-bonded interactions Unon−bonded and
bonding potentials Uintramolecule. The non-bonded interactions Unon−bonded are the interaction
between non-bonded atoms. To calculate this interaction, the Lennard-Jones potential is the most
commonly used potential:

(1.3)

σ is the distance where the inter-particle potential is zero, r is the distance between the particles,
and  is the depth of the potential measuring how strongly the two particles attract each other.

If electrostatic charges are present, Coulomb potentials should also be calculated:

vCoulomb(r) =

Q1Q2
4π0r

(1.4)

Once the potential is calculated, one can easily calculate the forces according to equation (1.2).
The calculation of intramolecular bonding should include bonds, bend angles and torsion angles,

which usually depends on the force-ﬁeld chosen.

To accelerate the speed of updating energies and forces, there are many MD algorithms used.
For example, the Vertlet algorithm (Verlet (1967)) is used to update the forces and position; periodic
boundary conditions (PBC) are used to mimic a larger periodic systems.

Despite the wide use of MD simulations, there are still limits on the typical time scales and
length scales that can achieve. Simulation runs are typically short: typically t∼ 103 − 106 MD
steps. Also the calculations of energies depend on the force ﬁelds chosen and can vary a lot among
diﬀerent force ﬁelds.

4

1.3 Geometric Deep Learning

Deep learning utilizes multi-layer neural network models to learn patterns from data. It has
gained a great attention in recent years and various deep learning algorithms were proposed. Deep
neural networks have been proved to be powerful models for many machine learning tasks, e.g.,
natural language processing (Young et al. (2018)), object detection (Naik and Gandhi (2018)) and
image classiﬁcation (Rawat and Wang (2017)). Deep learning has also achieved great successes
when applied to many other domains. For example, in chemistry, it has been used to predict chemical
reactions (Kayala and Baldi (2011)); in physics, it has been used to predict material energies; in
biochemistry, it has been used to ﬁnd drug candidates (Chen et al. (2018)). The great success of
deep learning is partially due to the rapid increasing of computational powers and much cheaper
storage spaces. Especially, the wide application of GPU computing and the availability of more
and more public datasets such as LiDAR (Light Detection and Ranging) and MNIST (Modiﬁed
National Institute of Standards and Technology) allows researchers to explore more powerful neural
network architectures with large scale of data. Among many deep learning models, Convolutional
Neural Network (CNN) may be one of the most successful ones. CNN has been widely used in
image analysis, signal processing and many other tasks where data that can be treated as functions
in the Euclidean space. CNNs can extract local features with learned ﬁlters to generate expressive
representation of data.

In addition to data in the Euclidean space, there are also many other data that are not in the
Euclidean space and are better modeled by graphs. Graphs are data structures used to model
pairwise relations between objects. A graph consists of vertices/nodes and are connected by edges.
Indeed graph structured data are ubiquitous. For example, as shown in Figure 1.1, in chemistry,
molecules can be represented as graphs, where atoms in the molecule are nodes and bonds are
treated as edges. In social science, the relationship among diﬀerent individuals can be represented
as a social network, where people are nodes in the network and are connected according to diﬀerent
relationships. Another example in biology is the protein-protein interactions which can also be
modelled as graphs. However graph-structured data don’t have well known Euclidean representations

5

and thus the regular deep neural network models cannot be applied directly. Some of the challenges
faced by these regular models to deal with graphs are that there is no global parameterization or
common system of coordinates; the vector space structure is not known and properties such as
shift-invariance don’t exist (Bronstein et al. (2017)). Therefore, these deep learning models have to
be adapted to eﬃciently deal with non-Euclidean data.

(a) Social network

(b) Molecular graph

(c) Protein-protein interaction
network

Figure 1.1: Data represented as graphs

Recently there has been a growing interest in trying to generalize regular deep learning methods
to non-Euclidean space, which is called geometric deep learning. Graph Neural Network (GNN) is
then proposed to generate embedding of node features and graph structures. GNNs can be used for
tasks such as graph classiﬁcation, node classiﬁcaion and link prediction. There are many successful
GNNs inspired by the architecture of popular deep learning models such as CNN, recurrent neural
network (RNN) and attention neural network. Especially, the networks that generalizes CNN to
graphs are called Graph Convolution Networks (GCNs). Graph Convolutional Network (GCN)
proposed by Kipf and Welling (2017) is one of the most popular GCNs. GCN generates latent
representation of each node by aggregating information from the node itself and its neighboring
nodes. Many other well known graph neural networks include GraphSage (Hamilton et al. (2017)),
Graph Attention Network (GAT) (Veličković et al. (2018)), Graph Isomorphim Network(GIN) (Xu
et al. (2019)) and so on. There are roughly two ways to construct CNN on graphs: the spatial
methods and the spectral methods. A brief summary of graph convolutional network in recent
years is demonstrated in Figure 1.2. Though this is not a comprehensive survey of all the GCNs, it

6

Figure 1.2: Brief summary of GCN history

shows the increasing interest in studying GCNs. The architecture of spatial GCNs usually include
the aggregation of node neighbors and pooling over the whole graph. On the other hand, spectral
approaches work with a spectral representation of the graphs. The convolution operation is deﬁned
in the Fourier domain by computing the eigendecomposition of the graph Laplacian. The operation
can then be deﬁned as the multiplication of a signal with a ﬁlter.

The spectral methods of GCNs are closely related to graph signal processing. Graph signals are
signals that deﬁned on graphs as shown in Figure 1.3. The signals can be the node features, edge
features or the structure of the whole graph. By constructing a ﬁlter directly on the eigenvalues of
the graph Laplacian, we can directly work on the graph signals.

Many GCNs are closely related to Weisfeiler-Lehman (WL) test. WL test is originally used to
deal with graph isomorphsim (GI) problem. The GI problem solves this question: given two graphs,
tell whether they are topological identical. The WL test iteratively aggregates the labels of nodes
and their neighborhoods into unique new labels. The two graphs are considered non-isomorphic if
at some iteration the labels of the nodes between the two graphs are diﬀerent. WL test is an eﬀective
and computationally eﬃcient method that distinguishes a broad class of graphs, though it fails in
some speciﬁc cases like regular graphs.

7

Figure 1.3: Signals on graph

1.4 Scattering Transform

CNN consists of a series of convolution, non-linearilty and pooling operations and followed by
several fully connected layers, which makes it highly non-linear and diﬃcult to understand how
the coeﬃcient values are relate to the signal properties. Therefore in addition to CNN, scattering
wavelet transform was proposed by Mallat to provide mathematical understanding of convolutional
networks (Mallat (2012)). However, ﬁlters/wavelets in scattering transform are designed rather
than learnt. Scattering transform follows the general architecture of convolution neural networks
introduced by LeCun, which includes a cascade of convolutions and a “pooling” nonlinearity. In
the scattering transform, the nonlinearity is the modulus of a complex number(Bruna and Mallat
(2012)).

Wavelets are oscillating waveforms that provide both time and frequency localization of a signal.
A mother wavelet dilated by the scale 2j has a form of ψj(t) = 2−j ψ(2−jt). A wavelet transform
WJ f (cid:66) {AJ f , Ψj f : j ≤ J} = { f ∗ φJ , f ∗ ψj : j ≤ J} computes signal convolutions with a
ﬁlter bank of dilated and rotated wavelets. AJ is a low pass ﬁlter and AJ f extracts low frequency
features from the signal while {Ψj f : j ≤ J} covers high frequencies up to scale J. However the
main diﬃculty here is to compute translation invariant scattering coeﬃcients which remain stable
under the action of diﬀeomorphisms and retain high-frequency information provided by wavelets

8

as wavelet transform is not translation invariant. To achieve the translation invariant property, the
complex modulus nonlinearity and a low pass ﬁlter is used. With properly chosen wavelet ﬁlters,
the scattering transform is provably invariant to the actions of certain Lie groups, such as the
translation group, and is also provably Lipschitz stable to small diﬀeomorphisms, where the size of
a diﬀeomorphism is quantiﬁed by its deviation from a translation.

The major diﬀerences between scattering transform and CNN are that in CNN, ﬁlters are learned
with back-propagation algorithms while the ﬁlters are designed in scattering transform; wavelets can
cover multi-scale in one layer with multiple dilated wavelets while CNN can only use ﬁxed size
ﬁlters. They also share some similarities: both have a cascade of convolutions and non-linearity,
which is pooling and ReLU activation in CNN and modulus in scattering transform.

Figure 1.4 shows the architecture of scattering transform, where the blue texts represent the
scattering coeﬃcients generated at each layer and the black texts represent the intermediate process
at each layer where we add new set of wavelets

Recently there has been an increasing interest in studying the generalization of scattering
transform to graphs and manifolds. As the regular scattering transform, the scattering transform
on graphs are also proved to have properties such as invarient to translation and stable to small
diﬀermophism (Gama et al. (2019)).

1.5 QSAR and Toxicity Prediction

The origin of QSAR can be traced back to the 1960s, when Hansch and his collaborators tried to
understand the basis of the structure-activity relationships of plant growth regulators (HANSCH
et al. (1962)). Basically they tried to correlate the biological activities with certain structural features
of compounds. Nowadays QSAR has been widely used in predicting properties of molecules with
the belief that similar structures have similar properties. These properties include toxicity, energy,
solubility and many other physical and chemical properties. Indeed QSAR has been successfully
used in predicting molecular properties, however many QSAR models are not accurate enough to
make reliable predictions and have many limitations. So experimental tests cannot be replaced in the

9

Figure 1.4: Scattering transform in Euclidean space

current situation. However, QSAR is a powerful tool to be a supplement method to lab experiments.
Many QSAR models heavily depend on the choice of molecular representations. However, the
representation of molecule structures in latent space is a challenging task: many chemical descriptors
were proposed to represent molecular structures at diﬀerent levels. For example, the representations
could be the count of certain structures in the molecule, descriptors of molecular physical and
chemical properties, and encoding of 2D and 3D structures. The molecular representations then
can be further used for various tasks such as similarity searching (Klopmand (1992)), clustering
(McGregor and Pallai (1997)) and classiﬁcation (Wu et al. (2018)).

The structure of a molecule can be represented by the connectivity of atoms according to
chemical bonds. This kind of representation is usually considered as the molecular 2D-descriptor.
The main advantage of this representation is that it contains straightforward information about
molecular structure (connectivity), invariant to molecule rotation and translation, and ﬁnally doesn’t
require structure optimization (bond information such as bond length could also be encoded as
weights). In addition, more atom information such as atom types can also be encoded. However,
2D-descriptors do not contain 3D structural information and thus may not be able to reﬂect properties
that rely on 3D structure.

10

One natural way to represent the connectivity of a molecule is to use a graph. Graph is used to
model relations among multiple objects. A graph is usually denoted as G = (V, E), where V is a
set of nodes and E is a set of edges. Nodes are the interested objects and they are connected be
edges according to diﬀerent relations. In addition to the graph structure, properties of the nodes can
also be taken into consideration as node features. For instance, a molecule can be represented as
a graph, where atoms are nodes and bonds are edges (could be weighted edges if bond length is
also considered); a protein can be considered as a graph, where nodes are amino acids and they are
connected by edges which reﬂect the structure of proteins; another example is that one person’s
social network can also be constructed as a graph.

Therefore, a molecular graph is a topological representation of a molecular. By transforming
molecular structures to graphs, we can then use many methods based on graph theory to study these
graph-structured data and extract graph features, which are unique to each molecular graph. The
connectivity information of a graph can be encoded by a square matrix called adjacency matrix
denoted as A. The oﬀ-diagonal entries of an adjacency matrix encode the connectivity of the graph
while the diagonal entries are all 0s. For unweighted adjacency matrix, the oﬀ-diagonal entries
could be only 0 or 1. For example, if node i and node j are connected, then the entry Ai j will be
1. Otherwise it’ll be 0. However, weighted adjacency matrix could also be used to encode bond
information by replacing 1 with calculated bond distances. Another commonly used matrix is a
diagonal matrix called degree matrix, which contains the degree information of each node in the
graph.

While 2D-descriptors are usually good enough to represent the molecular graph structures and
atom features, they may not be suﬃcient enough for tasks speciﬁcally related to 3D structures.
Therefore 3D-descriptors are also generated to reﬂect the interactions related to 3D structures, such
as the protein-ligand interactions are related to the orientation of the involved molecules. This is
why 3D descriptors such as CoMFA (Cramer et al. (1988)) was created at the very beginning and its
variants are still very popular today.

In general, the representation of molecule structures should be invariant to rotation and translation.

11

It means that a rotated molecule should have the same representation as the origin one as the structure
of molecules don’t change. An alternative way to getting rotation and translation invariant properties
is to use data augmentation to include diﬀerent structures and label it correctly. However, it may not
be very eﬃcient in certain situations. Another way is by following a certain rule when generating
the representations, which may not be very ﬂexible. Another property of a good representation of
molecules is that ideally a small change in the molecular structure should also only cause a small
change in the representations. In other words, similar molecular structures should have identical
representations.

One application of QSAR is to predict toxicity of organic compounds. In fact, computational
models are considered alternatives to toxicity assessment without using expensive and time
consuming in-vivo and in-vitro lab experiments. Lab experiments often include animal trials and
raise ethical concerns that cannot be neglected. Another disadvantage of using animal tests is that
they may not provide enough guidance to human toxicity reactions due to inter-species diﬀerences
and diﬀerent disease models. Using computational models to predict toxicity of organic compounds
is also of great signiﬁcance in drug development. During drug development processes, drugs
must undergo clinical trials to be legally used in the market. Speciﬁcally, drugs must prove to be
nontoxic or low toxic according to certain regulations before they can be formally approved by FDA.
Therefore, it’s of great importance to prevent toxic drug candidates from entering into clinical trials.
NIH and EPA have been encouraging and supporting researches in developing computational
models for toxicity prediction through various projects such as ToxCast (Dix et al. (2006)) and
Tox21 (Tice et al. (2013)) projects. Large amounts of data have been collected in the past years and
some mainstream toxicity databases were created. Toxicology data network (TOXNET) created in
1985 is among the world’s largest collection of toxicology databases. Toxicity ForeCaster (ToxCast)
is also a widely used high-throughput toxicity database. It is a part of the Toxicology in the 21st
Century (Tox21). Tox21 contains both acute and chronic toxicity information. There are also some
other well known databases such as PubChem, Drugbank, EcoTox. In addition to the extensive
collection of toxicity data, many computational models have been proposed including statistical

12

approaches such as linear regression analysis, multivariate analysis, early shallow neural network
models and newly developed deep learning methods.

1.6 Research Objectives

Understanding the environmental processes and toxicity are essential to protect human from
toxic chemicals. Computational models are less expensive, much faster and more convenient. They
are one of the future directions of risk assessment. Molecular simulation have been used to study
the fundamental mechanisms of many environmental issues. On the other hand, with more and
more experimental data collected and summarized into databases, researchers are aiming to ﬁnd
patterns from big data. There has been a new trend in applying machine learning/deep learning for
risk assessment, and one common practice is for toxicity prediction.

Therefore, the research objectives are three folds: 1) to study the application of MD simulation in
the adsorption of dioxin with activated carbons, demonstrating the mechanism of dioxin adsorption
on activated carbons, which can help us further understand the bioavailability of activated carbon
adsorbed dioxin and provide important information for risk assessments. 2) following the recent
trend in using deep learning/machine learning methods for toxicity prediction, to develop state of art
geometric scattering model based on graph wavelet for graph learning 3) to apply machine learning
methods including the proposed geometric scattering method for toxicity prediction on publicly
available dataset.

13

CHAPTER 2

MOLECULAR DYNAMICS STUDY OF THE INFLUENCE OF ACTIVATED CARBON

PORE-SIZE ON DIOXIN ADSORPTION

2.1 Abstract

Adding activated carbons (ACs) to soils or sediments is rapidly gaining acceptance as a technology
that promotes the sequestration of polychlorinated dibenzo-p-dioxins/furans (PCDD/Fs) and reduces
their bioavailability to humans and other animals. Molecular dynamics simulations were performed
to study structural dynamics and free energies of dibenzo-p-dioxin desorption from hydrated AC
pores to aqueous solution. An AC model with slit-pores of three diﬀerent sizes, 0.4, 0.9, and 1.6
nm was built to mimic AC pores ranging across the micropore range (< 2 nm). By studying the
adsorption of dioxin into these three diﬀerent sizes of pores, our results revealed the geometry of
dioxin adsorbed inside activated carbon pores and showed that the smallest pore had the strongest
ability to adsorb dioxins. We proposed that inside small pores, dioxins are restricted within a local
area and could develop the strong interaction with both carbon sheets. As the pore size increases, the
interaction with both AC sheets is impaired by intercalated water molecules. In addition, small pores
create more hydrophobic sub-environment which favors the accommodation of highly hydrophobic
compounds such as dioxins.

2.2 Introduction

Hydrophobic organic chemicals in the environment sorb to particle surfaces, can accumulate
in soils and sediments, and raise potential issues for human health. In particular, polychlorinated
dibenzo-p-dioxins/furans (PCDD/Fs) are widespread contaminants well known for their toxicity,
impairment of mammalian immunity, and persistence in the environment (Travis et al. (1989); Boyd
et al. (2011); Kaplan et al. (2011)).

Traditional remediation methods like dredging and disposal are expensive, destructive of

14

Figure 2.1: Unit cell for a periodic model representing six unique layers of graphite that form an AC

slit-pore embedded in water for the study of DD desorption processes.

ecosystems, and politically charged, so in-situ remediation methods are developing and show
potential for immobilizing contaminants through adsorption on soil/sediment amendments (Zhan
et al. (2009, 2011); ATHMER (2004)). High surface-area carbonaceous geosorbents such as
activated carbons (ACs) and biochars seem especially suitable for remediation of PCDD/Fs, which
are both hydrophobic and highly resistant to microbial degradation (Cornelissen et al. (2006)). In
fact, ACs seem capable of reducing (Kupryianchyk et al. (2013); Lohmann (2005); Cornelissen
et al. (2005); Allen-King et al. (2002)) or even eliminating (Boyd et al. (2017)) bioavailability of
PCDD/Fs, and ACs have already been used in several pilot-scale remediation projects (Patmont
et al. (2015)).

Pyrolysis of organic biomass in low-oxygen environments results in the synthesis of a continuum
of biochar materials depending on synthesis conditions, and high-temperature biochar is the basis
for AC. With increasing pyrolysis temperature, the proportion of polar functional groups and the
O/C ratio decrease, with a concomitant increase in aromaticity and sp2 carbon relative to sp3, with
an increased number and domain size of graphitic layers (Nguyen et al. (2010)). Thus, an idealized
model for the AC matrix is graphite. However, AC is such an interesting and useful material because
this graphitic matrix is highly porous.

15

The porosity of ACs is typically characterized by adsorbing relatively inert gases at constant
temperature across a wide range of partial pressures; the resulting adsorption isotherms are then used
to ﬁt models that yield estimates for total porosity and pore-size distributions (Dombrowski et al.
(2000)). Various representations of AC pores have been used in such models, and it is reassuring
that both cylindrical-pore models and slit-pore models yield similar pore-size distributions (Nguyen
et al. (2007)). Further, similar results are gained by using the same slit-pore model but diﬀerent
probe gases (Dombrowski et al. (2000)). The results reveal that many ACs possess the porosity of
200-700 mL/kg, with 40-80% of these pores in the microporous size range (pore diameter < 2nm)
and mesoporous size range (2 nm < pore diameter < 50 nm ) (Dombrowski et al. (2000); Nguyen
et al. (2007); Haro et al. (2011)).

The aqueous concentration of dissolved PCDD/Fs in soils and sediments is exceptionally low
with typical values of 0.1 to 0.5 pg/L (Arp et al. (2009); Cornelissen et al. (2008b)). Thus, PCDD/Fs
in soils or sediments are predominantly associated with solid surfaces. While AC-like phases
are not dominant in soils on a mass basis, they are hypothesized to be dominant sorbents for
PCDD/Fs (Cornelissen et al. (2008b); Chai et al. (2011); Cornelissen et al. (2008a); Fagervold et al.
(2010); Persson et al. (2002)). Though data are sparse, it appears that PCDD/Fs are concentrated
in AC-like particles (Chai et al. (2011); Ghosh et al. (2000)). The imputed hydrophobicity of
the graphitic matrix, combined with the prevalence of micropores, imply a strong partitioning
component to PCDD/F sorption to ACs (Ghosh et al. (2000); Dombrowski et al. (2000); Nguyen
et al. (2007)). However, much of the total sorption must be adsorption to pore surfaces, with
an unknown dependence on the pore-size distribution of a given AC. Generally, the stronger the
adsorption is, the less likely PCDD/Fs would be expected to be bioavailable, so understanding the
fundamental mechanisms and energetics of adsorption/desorption is essential. The present study
was undertaken to estimate adsorption energies and PCDD/F structural dynamics as a function of
pore size.

Molecular simulations have drawn increasing attention as a way of gaining environmental
insight by revealing new mechanistic hypotheses at the molecular level. Both density functional

16

theories (DFT) and molecular dynamics (MD) simulations are powerful tools to solve complicated
environmental issues with the help of increasing computational power and new models (Sedlak
(2015)). Our experimental results show that ACs with higher micropore distributions at the similar
surface areas demonstrate greater adsorption for dibenzo-p-dioxin (DD). In this study, we examine
the mechanism of DD adsorption into AC micropores. The pore-size eﬀects on DD adsorption are
quantiﬁed by comparing adsorption free energy, which are estimated by potential of mean force
(PMF). Adaptive biased force (ABF) method is used to measure the PMF (Liu et al. (2015b)), and
the MD trajectory of dioxin movement is analyzed to demonstrate the behavior of dioxin inside AC
pores. Thus, the simulations provide both structural and thermodynamic information for the complex
processes. Our results have shown that small pores can create a more hydrophobic environment
to assist the adsorption of dioxin. The smallest pore has a stronger estimated adsorption energy
compared to that of medium and large pores. Our results suggest the hypothesis that pore-size
eﬀects of ACs should be taken into consideration in mechanistic descriptions of hydrophobic solute
sorption by AC. For example, when applying AC to soils for in-situ dioxin remediation, it is plausible
that a more microporous activated carbon may bind the dioxins more strongly and minimize the
bioavailability of dioxins to organisms that may ingest the soil.

2.3 Method

Commonly, gas-phase adsorption isotherm on AC is interpreted using a slit-pore model to
represent the AC porosity (Dombrowski et al. (2000)). In accord with this convention, we created
graphitic slit-pore models. These are idealized models for AC porosity in which all carbons are
aromatic and there are no oxygenated functional groups present on the graphitic surfaces; this
simpliﬁes the adsorption process as much as possible so that we can study the pore-size eﬀects
in isolation. Brieﬂy, our slit-pore model was constructed using a packet of six graphene layers to
represent the AC matrix. The spacing between adjacent carbon packets was initially set at 3.35
Å, which is the layer-spacing for graphite. The space of a six-layer packet separated from its
neighboring image packet was considered as a slit-pore (Figure 2.1). We set up 9.0, 12.7 and 19.7 Å

17

as initial slit pore sizes (interlayer distance between two neighbored carbon planes) and referred as
small-pore, medium-pore and large-pore ACs.

For a molecule like dibenzo-p-dioxin (DD) in a pore within AC, the process of desorption refers
to the transfer of DD out of the AC pore into another bulk medium, typically water. In order to study
DD adsorption/desorption processes involving in AC pores, we constructed two-phase systems to
represent AC slit-pore interacting with bulk water. To do so, some edges of activated carbon must
terminate; in nature, these edges may contain diﬀerent functional groups such as carboxylate or
hydroxyl groups, or even dangling bonds. All these groups may have eﬀects on the adsorption. The
edges of AC were saturated with hydrogen atoms (Figure 2.1) to minimize the eﬀects of functional
groups at those edges and focus on the adsorption inside pores. Each AC slit pore was put in
a periodical orthorhombic box with rough dimensions 80 x 30 x h Å (where h is the height of
the box, and dependent of slit-pore thicknesses). A single DD molecule was added to each box,
corresponding to a large but realistic sorbed concentration of 4600 mg DD/kg AC; we have measured
sorption of DD by ACs, and this magnitude is in equilibrium with aqueous solution of roughly 10
µ g DD/L. The DD molecule was placed at the initial position of the middle of the AC slit-pore
(Figure 2.1).

The length of activated carbon and its slit-pore was 50 nm, leaving a box of 30x30xh Å to hold
and represent the “bulk” water phase of each system. Each entire system was then hydrated by
overlaying a large box of equilibrated water molecules and adding water molecules to all positions
within the unit cell that were not already occupied. After this process, there were 967 water
molecules in our model with the smallest pore, 1396 in our medium-pore model, and 2011 in our
largest-pore model. These small-, medium-, and large-pore models are labeled ACS, ACM, and
ACL, respectively.

2.4 Computational Details

Molecular dynamics simulations were performed using the LAMMPS simulation package
(Plimpton (1995)). The time step was set as 1 fs. To estimate the adsorption energy of DD in AC

18

slit-pores of diﬀerent thicknesses, the Potential of Mean Force (PMF) (Comer et al. (2015); Darve
et al. (2002); Darve and Pohorille (2001)) was calculated using the Adaptive Biasing Force (ABF)
method implemented in the COLVARS package of LAMMPS. Our methods closely followed those
of Liu et al. (2015a) in their study of PCB desorption from swelling clay. The SPC model for water
was used along with the SHAKE algorithm, which keeps the angle and bond length ﬁxed. The
Lennard-Jones interactions were truncated at 10 Å with an analytical tail correction, and a PPPM
(particle-particle particle mesh) method with a cutoﬀ of 10 Å and accuracy of 0.0001 was used to
compute long range Coulombic interactions. To pre-equilibrate the system, simulations were ﬁrst
run in the NVE ensemble for 1 ns with all AC layers ﬁxed, then 10 ns in the NPT ensemble in which
no atoms were ﬁxed. These longer runs to equilibrate the micropore-bulk water two-phase system
were performed at 298 K, pressure of 0.1 MPa, and step size of 1 fs.

In this study, we applied the pcﬀ forceﬁeld to the AC system and to the DD solute. Detailed

PcFF forceﬁeld parameters are included below:

Masses
12.011150
1.007970
15.999400
15.999400
1.007970

Pair Coeﬀs
0.0640000000
0.0200000000
0.2400000000
0.2740000000
0.0130000000

4.0100000000
2.9950000000
3.5350000000
3.6080000000
1.0980000000

1
2
3
4
5

1
2
3
4
5

19

Bond Coeﬀs
1.4170
1.0982
1.3768
0.9700

470.8361 − 627.6179
372.8251 − 803.4526
428.8798 − 738.2351
563.2800 − 1428.2200

1327.6345
894.3173
1114.9655
1902.1200

Angle Coeﬀs
118.9000
117.9400
123.4200
115.0700
103.7000

61.0226 − 34.9931
0.0000
35.1558 − 12.4682
0.0000
73.6781 − 21.6787
0.0000
47.1131 − 32.5592
13.1257
49.8400 − 11.6000 − 8.0000

1
2
3
4

1
2
3
4
5

Dihedral Coeﬀs
8.3667
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

1
2
3
4
5 − 0.1820
6
7

0.0000
0.0000

0.0000
0.0000

1.1932
3.9661
1.8769
4.8498

0.0000
0.0000
0.0000
0.0000

1.7234
27.5400

0.0000
0.0000

0.0000
0.0000
0.0000
0.0000

0.0000
0.0000
0.0000
0.0000
0.0000 − 0.7047
0.0000
0.0000

0.0000
0.0000

0.0000 − 0.1840

0.0000

Improper Coeﬀs
4.8912
0.0000
7.1794
0.0000
13.0421 0.0000

1
2
3

20

BondBond Coeﬀs
68.2856 1.4170
1.0795
1.4170
48.4754 1.4170
0.0000
1.3768

1
2
3
4
5 − 9.5000

1.4170
1.0982
1.3768
1.3768

0.9700

0.9700

BondAngle Coeﬀs
28.8708 28.8708
20.0033 24.2183
58.4790 107.6806
0.0000
22.3500 22.3500

0.0000

1.41701.4170
1.4170
1.4170

1.0982
1.3768

1.3768

1.3768

0.9700

0.9700

AngleAngle Coeﬀs
0.0000
0.0000
0.0000

0.0000
0.0000
0.0000

0.0000
0.0000
0.0000

118.9000
118.9000
118.9000

117.9400
118.9000
123.4200

117.9400
118.9000
123.4200

1
2
3
4
5

1
2
3

AngleAngleTorsion Coeﬀs
0.0000

118.9000

118.9000

118.9000

117.9400

117.9400

117.9400

118.9000

123.4200

0.3598

1
2 − 4.8141
3
4 − 21.0247
5
6
7

0.0000
4.2296
0.0000

123.4200
117.9400
123.4200

123.4200
123.4200
0.0000

21

EndBondTorsion Coeﬀs
1 − 0.1185
6.3204
0.0000 − 6.8958
2
0.0000 − 0.6890
3
4
0.0000
0.0000
5
0.0000 − 1.5867
6
0.0000
7

0.2655
0.0000

0.0000 − 0.1185
0.0000
0.0000

6.3204
0.0000 − 0.4669
0.0000 − 0.6890

0.0000
0.0000
0.0000

1.4170
1.4170
1.0982

1.4170
1.0982
1.0982

0.0000
0.0000

0.0000
0.0000

4.8905
0.0000

0.0000
0.0000

1.4170
1.3768

1.3768
1.3768

0.0000

0.0000

4.2641

0.0000

1.0982

1.3768

0.0000

0.0000

0.0000

0.0000

0.0000

1.4170

1.3768

0.0000
0.0000

1.4170
1.4170

MiddleBondTorsion Coeﬀs
27.5989 − 2.3120
0.0000 − 1.1521
0.0000
0.0000
0.0000
0.0000
0.0000

0.0000
0.0000
0.0000
0.0000
0.0000

4.8228
4.8255
0.0000
5.5432
0.0000

1
2
3
4
5
6
7

1.4170
1.4170
1.4170
1.4170
1.3768

1.4170
1.0982
1.0982
1.3768

BondBond13 Coeﬀs
53.0000 1.4170
1.4170
1.0982
1.4170

1
2 − 6.2741
3 − 1.7077
4 − 2.2436
5
1.3768
6
1.0982
7
1.4170
AngleTorsion Coeﬀs

0.0000
2.0517
0.0000

1.3768
1.3768
1.3768

22

1
2
3
4
5
6
7

1.9767
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

1.0239
2.5014
2.4501
10.0155
0.0000
1.8729
0.0000

0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

1.9767
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

1.0239
2.7147
2.4501
1.7404
0.0000
2.5706
0.0000

0.0000
0.0000
0.0000
0.0000
0.0000
0.0000
0.0000

118.9000
118.9000
117.9400
118.9000
123.4200
117.9400
123.4200

118.9000
117.9400
117.9400
123.4200
123.4200
123.4200
0.0000

During these NPT simulations, the unit-cell dimensions relaxed somewhat to achieve their
equilibrium values. The lengths (79.2 Å) and depths (29.2 Å) of each system were the same, while
the equilibrated heights were 23.2 Å for the ACS system, 28.2 Å for ACM, and 35.2 Å for the ACL
system. The center-to-center distances between nearest-neighbor “inner” graphene sheets across the
slit were about 7.7, 12.4, and 19.7 Å for the ACS, ACM, and ACL systems, respectively. The pore
sizes are still smaller than these values by about 3.3 Å, because a pore-size probe would not measure
down to the C-atom-plane, it would only measure down to the vdW radii of the C atoms in graphene.
Thus, the ﬁnal pore sizes (as would be measured experimentally) were about 4.4, 9.1, and 16.4 Å for
the ACS, ACM, and ACL systems, respectively. These three eﬀective slit-pore thicknesses each
correspond to an AC model containing 3300 C atoms, and thus represent ACs with micropore-sized
porosities of 98, 202, and 364 mL/kg AC, respectively. Despite the simplicity of our slit-pore model,
this study thus covered realistic porosities in the micropore size range that dominates the porosity of
most ACs. For example, the experimental pore-size distribution for a well-studied AC was found to
have a maximum at a slit-pore width of 12 Å(Dombrowski et al. (2000)), 16 Åand so the pores of
this modeling study bracket that value nicely.

Calculation of the potential of mean force (PMF) was performed in the NVT ensemble using
two steps: First, an NVT steered molecular dynamics simulation (298 K and time-step of 1 fs) was
run with a reaction coordinate along the (1,0,0) direction. That is, the pre-equilibrated DD molecule
was slowly dragged along that direction at a rate of 1 Å/ns, spending the ﬁrst 10 ns inside the pore

23

and then passing into the bulk water for another 10 ns outside the pore to generate the original 20-ns
trajectory along the reaction coordinate. Secondly, that trajectory was divided into 20 1-ns windows
and in each window, the adaptive biasing force was applied to calculate the change of energy in
every small bin of width 0.05 Å. Each time window was sampled more than 1,000,000 times. The
overall free energy proﬁle for the desorption processes was constructed by linking the potential
mean force (PMF) proﬁles of all adjacent windows.

The moving trajectory and radial distribution functions between diﬀerent groups are obtained
by analyzing the movement of dioxin in the molecular dynamics process and after the adaptive
biased force is applied. We have used the same methods to study the adsorption energies of PCBs
interacting with clay minerals (Liu et al. (2015a)).

Thermodynamically, this modeled desorption process should be reversible in that the PMF
energy should be equal in magnitude but opposite in sign to an adsorption process, in which DD
begins in the bulk water phase and is then dragged into the AC pore. We tested this using anthracene
(same size as DD) and a cylindrical pore of diameter 12 Å; the desorption energy was +45 kJ/mol,
while the adsorption energy was -40 kJ/mol. The latter was probably underestimated, since the
anthracene molecule had drifted about 5 Å too close to the pore entrance during the MD equilibration
before the ABF simulation began, thus pulling its initial potential energy down somewhat. Several
other attempts at ABF simulations of adsorption were unsuccessful, because the anthracene or
dioxin molecules got stuck at the entrances to cylindrical or slit pores and could not be pulled in.
Thus we focus on studying the process of dioxin desorption from AC slit-pores.

2.5 Results and Discussions

2.5.1 Water in Activated Carbon Pores

Adsorption of DD from water to AC pores can be considered as exchange between water and DD for
adsorptive sites. When DD moves from inside of AC pores to outside bulk water, a certain volume
of water molecules are expected to enter the pores. In this section, we ﬁrst studied the relative
water density inside diﬀerent sizes of pores. Figure 2.2 shows the number of water molecules

24

Figure 2.2: Number of water molecules per nm−3 inside activated carbon pores of three diﬀerent
widths; the model is divided into 100 small bins along axis (1,0,0); number of water molecules per

nm−3 in bulk water is around 33/nm−3.

per nm−3 in AC pores along (1,0,0) direction, which is the direction of dioxin moving out of
pore. The result reveals that water density in AC pores increased with increasing pore size. This
suggests that smaller-sized AC pores hold less amount of water per unit of volume, and create less
hydrophilic environments. This sub-aqueous environment favors the adsorption of hydrophobic
organic compounds such as DD in the smaller-sized AC pores.

Instead of performing the same PMF calculation method described in the computational details
for dioxin, we have a simpler way to obtain the PMF by analyzing the distribution of water before
and after dioxin is adsorbed. However, if we just look at the radial distribution function, it won’t
give us the distribution along a certain dimension. The most intuitive way is to study the water
density along one direction. In Figure 2.2 we analyze the relative water density along (1,0,0), which
is the direction of dioxin moving out of pore and thus the direction of water moving from bulk to the
pore. Relative water density is deﬁned as the bin concentration divided by the concentration across
the entire cell, which is dimensionless.

Figure 2.2 shows the water density in our three activated carbon models along the x axis. Pore
water density increases with pore size. This suggests that small pores can create more hydrophobic
environments, which in turn may imply that hydrophobic organic compounds such as dioxin adsorb

25

Figure 2.3: Potential of mean force for water molecules, in units of kT.

more strongly from water to small AC pores.

The potential of mean force of water molecule can be calculated by simply using the water
density W(z) = −kbTln(z) + const, since it represents the probability of ﬁnding a water molecule
at a certain distance inside the pore. z is the water density inside the pores Wang et al. (2015).

Here, the constant decides the zero level. Figure 2.3 shows the PMFs calculated in this manner.
Larger pores show the lowest (most favorable) energy, while smallest pore shows the highest energy,
suggesting water molecules sorb most strongly to the largest pore, again supporting the hypothesis
that small pores are more hydrophobic than larger pores. In fact, when running simulations with
the smallest activated carbon pores, the number of water molecule inside the pore is quite limited.
These smallest pores can only hold up to one layer of water molecules. This may be due not only to
the limited pore size, but also because of the hydrophobicity of the graphene surfaces.

2.5.2 Dioxin Sorption in Activated Carbon Pores.

When DD molecules are adsorbed inside pores, they tend to take positions parallel to the AC surfaces
and directly contact those surfaces. In other words, DD partially dehydrates. The distance between
the DD and AC surface atomic planes is 3.3 Å. In Figure 2.4, 2.5 and 2.6, the moving trajectories of

26

Figure 2.4: Dioxin moving trajectories in small AC pore

dioxin during simulation in small, medium and large pores are marked in green lines. Dark red
represents the initial position of dioxin and lighter colors denote the ﬁnal positions. Even when DD
started in the middle of an AC pore, DD quickly found its way to an AC surface in all cases, as shown
in the illustration Figure 2.7 as an example by removing all the water molecules for a clearer view.

2.5.3 PMF of Dioxin

PMF of dioxin was calculated to estimate the adsorption/desorption energy for DD interaction
with ACs of diﬀerent pore sizes. The PMF is the energy for pulling dioxin outside of the AC pore
into the bulk water phase. As shown in Figure 2.8, the energy of DD moving inside of AC pores

27

Figure 2.5: Dioxin moving trajectories in medium AC pore

remains almost unchanged until the distance of 5 Å away from the AC edge, and the energy shows no
apparent diﬀerence among the ACs with three diﬀerent pore sizes. When DD moved inside of pores
to 5 Å away from AC edge, it interacted with bulk water phase resulting in the sharp increase of
PMF. There is a transition located at 3 Å from AC edge in bulk water at which the energy ﬂuctuated
because the DD molecule was moving to AC edges before ﬁnally exiting from the pore. When DD
entered water phase and reached 8 Å from AC edge, the PMF approached plateaus due DDs distance
away from ACs.

When the pore size is smallest, 4.4 Å which is just larger than the thickness of DD, then DD
can basically only move around in two dimensions (Figure 2.4) and it has the highest PMF at 15.8

28

Figure 2.6: Dioxin moving trajectories in large AC pore

kcal/mol. In this situation, dioxin has the maximum contact with AC surfaces (two) and minimal
contact with water. In contrast, when dioxin is inside the largest micropore (16.4 Å), dioxin is
separated by water molecules from at least one AC surface, and the desorption energy is only 9.2
kcal/mol. The PMF of dioxin in the medium pore is in between at 10.8 kcal/mol. All of these
desorption energies are comparable to the experimental free energy change upon dissolving DD
crystals into water, which is about +9 kcal/mol based upon converting (Darve et al. (2002)) a
solubility of 2.74e-5 mol/L (Darve and Pohorille (2001)). These large, unfavorable desorption
energies are consistent with the simulated dynamics which is shown in previous section.

Desorption of DD from AC increased the system free energy. This energy increase with

29

Figure 2.7: Side view of the moving trajectory of dioxin in large-pore AC model (water molecules

removed for a clear view)

30

Figure 2.8: The potential of mean force (kcal/mol) for desorption of DD from AC pores of three
diameters. The AC pore edge is located at x=0; x <0 refers to the inside of AC pore and x >0 refers

to the bulk water phase.

desorption therefore may be viewed from at least two perspectives: 1) For each system, one set
of interaction between an AC graphene-like surface and DD are lost, with two sets lost for the
small-pore AC system, and 2). The entire DD molecule and the AC surface(s) to which it was
previously adsorbed become hydrated. Compared to the adsorbed state, this change is largest for the
small-pore AC system, because two new DD-sized hydrophobic surfaces on AC become hydrated,
plus both planar sides of the DD molecule itself. For the medium-pore AC and large-pore AC
systems, just one new DD-sized hydrophobic surfaces on AC becomes hydrated, plus just one planar
side of the DD molecule itself (see Figure 2.4, 2.5, 2.6 for visualization). Note that the hydration of
previously unhydrated DD molecular surface is thought to be energetically unfavorable. As cited
above, transfer from bulk DD to bulk water has an energy penalty of about +9 kJ/mol, while transfer
of DD from the ideal gas phase to water has an estimated free energy penalty of about +4.5 kJ/mol
(Rogers and Hahn (2010)). The molecular simulations of the present study support this concept that
DD hydration is unfavorable and contributes to the strongly unfavorable desorption of DD from AC.

31

2.5.4 Sorption at Activated Carbon Edges

Another interesting phenomena is the drop of energy at the edges. This energy change may due to the
adsorption taking place at edges. When dioxin is moving at the edges, it can even bend to presumably
maximize interactions with edges. Strong AC-edge-dioxin interactions are also supported by the
trajectory of dioxin moving at edges (Figure 2.9). From the trajectory we can see that when dioxin
is moving out of pores into the water, it does not directly move into the water phase. Instead, dioxin
moves along the edges. As we know, the edges of activated carbons are usually saturated by diﬀerent
functional groups such as hydroxyl or carboxyl groups. In our model, we are using the simplest one:
hydrogen atoms to saturate the edges. But this still shows the potential that dioxin may be adsorbed
onto the edges of activated carbons if certain functional groups are there. Our radial distribution
functions (Figure 2.10) also show this possibility. Here we use the largest pore as an example, and
results for the other two pore-sizes are similar. In Figure 2.10, the peak appears near 3 å, suggesting
interactions between dioxin and edge hydrogens. This interaction may come from one of the pi-pi
interactions when a hydrogen atom from one benzene ring is perpendicular to the other ring.

2.6 Conclusion

Activated carbons made from diﬀerent materials, pyrolyzed under diﬀerent conditions, and
activated by diﬀerent methods have diﬀerent pore size distributions and diﬀerent functional groups.
In this work we studied the pore size eﬀects on the adsorption of water and dioxin by activated
carbon surfaces. Since dioxin adsorption takes place from water, the competition of water and
dioxin molecules to adsorb should be taken into account. We use a simpliﬁed model for AC with
minimal functional groups so that the adsorption maximally depends on the pore sizes. Our results
have shown that small pores can create a more hydrophobic environment to assist the adsorption of
dioxin. The smallest pore has a stronger estimated adsorption energy compared to that of medium
and large pores. The adsorption will not only take place inside the pores, but also may happen at
the activated carbon edges. Within AC micropores, dioxin sorption seems quite strong since the
desorption thermodynamics are highly unfavorable, due to a) Loss of favorable DD-AC interactions

32

Figure 2.9: Dioxin moving trajectories at AC edge

such as pi-pi and nonbonded interactions, and b) Unfavorable hydration of DD upon transfer to
bulk water. Our results suggest the hypothesis that pore-size eﬀects of ACs should be taken into
consideration in mechanistic descriptions of hydrophobic solute sorption by AC. For example, when
applying AC to soils for in-situ dioxin remediation, it is plausible that a more microporous activated
carbon may bind the dioxins more strongly and minimize the bioavailability of dioxins to organisms
that may ingest the soil.

33

Figure 2.10: Radial distribution between oxygen atoms of dioxin and hydrogen atoms at activated

carbon edges

34

CHAPTER 3

GEOMETRIC SCATTERING FOR GRAPH DATA ANALYSIS

3.1 Abstract

We explore the generalization of scattering transforms from traditional (e.g., image or audio)
signals to graph data, analogous to the generalization of ConvNets in geometric deep learning, and
the utility of extracted graph features in graph data analysis. In particular, we focus on the capacity
of these features to retain informative variability and relations in the data (e.g., between individual
graphs, or in aggregate), while relating our construction to previous theoretical results that establish
the stability of similar transforms to families of graph deformations. We demonstrate the application
the our geometric scattering features in graph classiﬁcation of social network data, and in data
exploration of biochemistry data.

3.2 Introduction

Over the past decade, numerous examples have established that deep neural networks (i.e.,
cascades of linear operations and simple nonlinearities) typically outperform traditional “shallow”
models in various modern machine learning applications, especially given the increasing Big Data
availability nowadays. Perhaps the most well known example of the advantages of deep networks
is in computer vision, where the utilization of 2D convolutions enable network designs that learn
cascades of convolutional ﬁlters, which have several advantages over fully connected network
architectures, both computationally and conceptually. Indeed, in terms of supervised learning,
convolutional neural networks (ConvNets) hold the current state of the art in image classiﬁcation,
and have become the standard machine learning approach towards processing big structured-signal
data, including audio and video processing. See, e.g., Goodfellow et al. (2016, Chapter 9) for a
detailed discussion.

Beyond their performances when applied to speciﬁc tasks, pretrained ConvNet layers have been

35

explored as image feature extractors by freezing the ﬁrst few pretrained convolutional layers and then
retraining only the last few layers for speciﬁc datasets or applications (e.g., Yosinski et al., 2014;
Oquab et al., 2014). Such transfer learning approaches provide evidence that suitably constructed
deep ﬁlter banks should be able to extract task-agnostic semantic information from structured data,
and in some sense mimic the operation of human visual and auditory cortices, thus supporting
the neural terminology in deep learning. An alternative approach towards such universal feature
extraction was presented in Mallat (2012), where a deep ﬁlter bank, known as the scattering transform,
is designed, rather than trained, based on predetermined families of distruptive patterns that should
be eliminated to extract informative representations. The scattering transform is constructed as
a cascade of linear wavelet transforms and nonlinear complex modulus operations that provides
features with guaranteed invariance to a predetermined Lie group of operations such as rotations,
translations, or scaling. Further, it also provides Lipschitz stability to small diﬀeomorphisms of the
inputted signal.

Following recent interest in geometric deep learning approaches for processing graph-structured
data (see, for example, Bronstein et al. (2017) and references therein), several attempts have been
made to generalize the scattering transform to graphs (Zou and Lerman, 2018; Gama et al., 2019)
and manifolds (Perlmutter et al., 2018), which we will generally term “geometric scattering”. These
works mostly focus on following the footsteps of Mallat (2012) in establishing the stability of
their respective constructions to deformations of input signals or graphs. Their results essentially
characterize the type of disruptive information eliminated by geometric scattering, by providing
upper bounds for distances between scattering features, phrased in terms of a deformation size. Here,
we further explore the notion of geometric scattering features by considering the complimentary
question of how much information is retained by them, since stability alone does not ensure useful
features in practice (e.g., a constant all-zero map would be stable to any deformation, but would
clearly be useless). In other words, we examine whether a geometric scattering construction, deﬁned
and discussed in Sec. 3.4, can be used as an eﬀective task-independent feature extractor from graphs,
and whether the resulting representations provided by them are suﬃciently rich to enable intelligible

36

data analysis by applying traditional (Euclidean) methods.

We note that for Euclidean scattering, while stability is established with rigorous theoretical
results, the capacity of scattering features to form an eﬀective data representation in practice has
mostly been established via extensive empirical examination. Indeed, scattering features have
been shown eﬀective in several audio (e.g., Bruna and Mallat, 2013a; Andén and Mallat, 2014;
Lostanlen and Mallat, 2015; Andén et al., 2018) and image (e.g., Bruna and Mallat, 2013b; Sifre
and Mallat, 2014; Oyallon and Mallat, 2015; Angles and Mallat, 2018) processing applications, and
their advantages over learned features are especially relevant in applications with relatively low data
availability, such as quantum chemistry and materials science (e.g., Hirn et al., 2017; Eickenberg
et al., 2017, 2018; Brumwell et al., 2018).

Similarly, our examination of geometric scattering capacity focuses on empirical results on
several data analysis tasks, and on two commonly used graph data types. Our results in Sec. 3.5.2
show that on social network data, geometric scattering features enable classic RBF-kernel SVM to
match, if not outperform, leading graph kernel methods as well as most geometric deep learning
ones. These experiments are augmented by additional results in Sec. 3.5.3 that show the geometric
scattering SVM classiﬁcation rate degrades only slightly when trained on far fewer graphs than
is traditionally used in graph classiﬁcation tasks. On biochemistry data, where graphs represent
molecular structures of compounds (e.g., Enzymes or proteins), we show in Sec. 3.5.4 that scattering
features enable signiﬁcant dimensionality reduction. Finally, to establish their descriptive qualities,
in Sec. 3.5.5 we use geometric scattering features extracted from enzyme data (Borgwardt et al.,
2005) to infer emergent patterns of enzyme commission (EC) exchange preferences in enzyme
evolution, validated with established knowledge from Cuesta et al. (2015). Taken together, these
results illustrate the power of the geometric scattering approach as both a relevant mathematical
model for geometric deep learning, and as a suitable tool for modern graph data analysis.

37

3.3 Graph Random Walks and Graph Wavelets

The Euclidean scattering transform is constructed using wavelets deﬁned on Rd. In order to
extend this construction to graphs, we deﬁne graph wavelets as the diﬀerence between lazy random
walks that have propagated at diﬀerent time scales, which mimics classical wavelet constructions
found in Meyer (1993) and more recent constructions found in Coifman and Maggioni (2006).
The underpinnings for this construction arise out of graph signal processing, and in particular the
properties of the graph Laplacian.

Let G = (V, E, W) be a weighted graph, consisting of n vertices V = {v1, . . . , vn}, edges
E ⊆ {(v(cid:96), vm) : 1 ≤ (cid:96), m ≤ n}, and weights W = {w(v(cid:96), vm) > 0 : (v(cid:96), vm) ∈ E}. Note that
unweighted graphs are considered as a special case, by setting w(v(cid:96), vm) = 1 for each (v(cid:96), vm) ∈ E.
Deﬁne the n× n (weighted) adjacency matrix AG = A of G by A(v(cid:96), vm) = w(v(cid:96), vm) if (v(cid:96), vm) ∈ E
and zero otherwise, where we use the notation A(v(cid:96), vm) to denote the ((cid:96), m) entry of the matrix A
so as to emphasize the correspondence with the vertices in the graph and to reserve sub-indices for

enumerating objects. Deﬁne the (weighted) degree of vertex v(cid:96) as deg(v(cid:96)) =m A(v(cid:96), vm) and the

corresponding diagonal n × n degree matrix D given by D(v(cid:96), v(cid:96)) = deg(v(cid:96)), D(v(cid:96), vm) = 0, (cid:96) (cid:44) m.
Finally, the n × n graph Laplacian matrix LG = L on G is deﬁned as L = D − A, and its normalized
version is N = D−1/2LD−1/2 = I− D−1/2AD−1/2. We focus on the latter due to its close relationship
with graph random walks.

The normalized graph Laplacian is a symmetric, real valued positive semi-deﬁnite matrix, and
thus has n non-negative eigenvalues. Furthermore, if we set 0 = (0, . . . , 0)T to to be the n × 1 vector
of all zeroes, and d(v(cid:96)) = deg(v(cid:96)) to be the n × 1 degree vector, then one has Nd1/2 = 0 (where the
square root is understood to be taken entrywise). Therefore 0 is an eigenvalue of N and we write
the n eigenvalues of N as 0 = λ0 ≤ λ1 ≤ · · · ≤ λn−1 ≤ 2 with corresponding n × 1 orthonormal
eigenvectors ϕ0, ϕ1, . . . , ϕn−1. If the graph G is connected, then λ1 > 0. In order to simplify the
following discussion we assume that this is the case, although the discussion below can be amended
to include disconnected graphs as well.

One can show ϕ0 = d1/2/(cid:107)d1/2(cid:107), meaning ϕ0 is non-negative. Since every other eigenvector

38

is orthogonal to ϕ0 (and thus must take positive and negative values), it is natural to view the
eigenvectors ϕk as the Fourier modes of the graph G, with a frequency magnitude proportional to
λk. The fact that ϕ0 is in general non-constant, as opposed to the zero frequency mode on the torus
or real line, reﬂects the non-uniform distribution of vertices in non-regular graphs. Let x : V → R
be a signal deﬁned on the vertices of the graph G, which we will consider as an n × 1 vector with

entries x(v(cid:96)). It follows that the Fourier transform of x can be deﬁned as(cid:98)x(k) = x · ϕk, where

x · y is the standard dot product. This analogy is one of the foundations of graph signal processing
and indeed we could use this correspondence to deﬁne wavelet operators on the graph G, as in
Hammond et al. (2011). Rather than follow this path, though, we instead take a related path similar
to Coifman and Maggioni (2006) and Gama et al. (2019) by deﬁning the graph wavelet operators in
terms of random walks deﬁned on G, which will avoid diagonalizing N and will allow us to control
the “spatial” graph support of the ﬁlters directly.

(cid:16)I + AD−1(cid:17)

Deﬁne the n × n lazy random walk matrix as P = 1
2

. Note that the column sums
of P are all one. It follows that P acts as a Markov operator, mapping probability distributions to
probability distribution. We refer to P as a lazy random walk matrix since Pt governs the probability
distribution of a lazy random walk after t steps. A single realization of a random walk is a walk (in the
graph theoretic sense) v(cid:96)0, v(cid:96)1, v(cid:96)2, . . . in which the steps are chosen randomly; lazy random walks
= v(cid:96)i+1. More precisely, suppose that µ0(v(cid:96)) ≥ 0 for each vertex v(cid:96) and (cid:107)µ0(cid:107)1 = 1,
allow for v(cid:96)i
so that µ0 is a probability distribution on G. We take µ0(v(cid:96)) as the probability of a random walk
= v(cid:96). One can verify that µ1 = Pµ0 is also a probability distribution; each
starting at vertex v(cid:96)0
entry µ1(v(cid:96)) gives the probability of the random walk being located at v(cid:96)1
= v(cid:96) after one step. The
probability distribution for the location of the random walk after t steps is µt = Pt µ0.

The operator P can be considered a low pass operator, meaning that Px replaces x(v(cid:96)) with
localized averages of x(v(cid:96)) for any x. Indeed, expanding out Px(v(cid:96)) one observes that Px(v(cid:96)) is
the weighted average of x(v(cid:96)) and the values x(vm) for the neighbors vm of v(cid:96). Similarly, the value
Ptx(v(cid:96)) is the weighted average of x(v(cid:96)) with all values x(vm) such that vm is within t steps of v(cid:96).
Low pass operators deﬁned on Euclidean space retain the low frequencies of a function while

39

suppressing the high frequencies. The random walk matrix P behaves similarly. Indeed, P is
diagonalizable with n eigenvectors φk = D1/2ϕk and eigenvalues ωk = 1 − λk/2. Let yx = D−1/2x
be a density normalized version of x and set xt = Ptx; then one can show

yxt =(cid:98)yx(0)ϕ0 +

k(cid:98)yx(k)ϕk .

ωt

n−1

k=1

(3.1)

Thus, since 0 ≤ ωk < 1 for k ≥ 1, the operator Pt preserves the zero frequency of x while
suppressing the high frequencies, up to a density normalization.

High frequency responses of x can be recovered in multiple diﬀerent fashions, but we utilize
multiscale wavelet transforms that group the non-zero frequencies of G into approximately dyadic
bands. As shown in Mallat (2012, Lemma 2.12), wavelet transforms are provably stable operators in
the Euclidean domain, and the proof of Zou and Lerman (2018, Theorem 5.1) indicates that similar
results on graphs may be possible. Furthermore, the multiscale nature of wavelet transforms will
allow the resulting geometric scattering transform (Sec. 3.4) to traverse the entire graph G in one
layer, which is valuable for obtaining global descriptions of G. Following Coifman and Maggioni
(2006), deﬁne the n × n wavelet matrix at the scale 2j as

Ψ j = P2j−1 − P2j

= P2j−1(I − P2j−1) .

(3.2)

A similar calculation as the one required for (3.1) shows that Ψ jx partially recovers(cid:98)yx(k) for k ≥ 1.

The value Ψ jx(v(cid:96)) aggregates the signal information x(vm) from the vertices vm that are within 2j
steps of v(cid:96), but does not average the information like the operator P2j . Instead, it responds to sharp
transitions or oscillations of the signal x within the neighborhood of v(cid:96) with radius 2j (in terms of
the graph path distance). The smaller the wavelet scale 2j, the higher the frequencies Ψ jx recovers
in x. The wavelet coeﬃcients up to the scale 2J are:

Ψ(J)x(v(cid:96)) =(cid:2)Ψ jx(v(cid:96)) : 1 ≤ j ≤ J(cid:3) .

(3.3)

Figure 3.1 plots the wavelets on two diﬀerent graphs.

40

j

j

(a) Sample graph of the bunny manifold

(b) Minnesota road network graph

Figure 3.1: Wavelets Ψ j for increasing scale 2j left to right, applied to Diracs centered at two
diﬀerent locations (marked by red circles) in two diﬀerent graphs. Vertex colors indicate wavelet
values (corresponding to colorbars for each plot), ranging from yellow/green indicating positive
values to blue indicating negative values. Both graphs are freely available from PyGSP (2018).

3.4 Geometric Scattering

A geometric wavelet scattering transform follows a similar construction as the (Euclidean)
wavelet scattering transform of Mallat (2012), but leverages a graph wavelet transform. In this paper
we utilize the wavelet transform deﬁned in (3.3) of the previous section, but remark that in principle
any graph wavelet transform could be used (see, e.g., Zou and Lerman, 2018). In Sec. 3.4.1 we
deﬁne the graph scattering transform, in Sec. 3.4.2 we discuss its relation to other recently proposed
graph scattering constructions (Gama et al., 2019; Zou and Lerman, 2018), and in Sec. 3.4.3 we
describe several of its desirable properties as compared to other geometric deep learning algorithms
on graphs.

3.4.1 Geometric Scattering Deﬁnitions

Machine learning algorithms that compare and classify graphs must be invariant to graph isomor-
phism, i.e., re-indexations of the vertices and corresponding edges. A common way to obtain
invariant graph features is via summation operators, which act on a signal x = xG that can be deﬁned
on any graph G, e.g., x(v(cid:96)) = deg(v(cid:96)). The geometric scattering transform, which is described in
the remainder of this section, follows such an approach.

The simplest summation operator computes the sum of the responses of the signal x. As
described in Verma and Zhang (2018), this invariant can be complemented by higher order summary

41

statistics of x, the collection of which are statistical moments, and which are also referred to as
“capsules” in that work. For example, the unnormalized qth moments of x yield the following “zero”
order scattering moments:

Sx(q) =

x(v(cid:96))q,

1 ≤ q ≤ Q

(3.4)

n

(cid:96)=1

We can also replace (3.4) with normalized (i.e., standardized) moments of x, in which case we store
its mean (q = 1), variance (q = 2), skew (q = 3), kurtosis (q = 4), and so on. In what follows we
discuss the unnormalized moments since their presentation is simpler. The invariants Sx(q) do not
capture the full variability of x and hence the graph G upon which the signal x is deﬁned. We thus
complement these moments with summary statistics derived from the wavelet coeﬃcients of x,
which will lead naturally to the graph ConvNet structure of the geometric scattering transform.

of x(v(cid:96)) over V, we have captured the zero frequency of yx = D−1/2x sincen
Observe, analogously to the Euclidean setting, that in computing Sx(1), which is the summation
yx · d1/2 = (cid:107)d1/2(cid:107)(cid:98)yx(0). Higher order moments of x can incorporate the full range of frequencies in
(cid:96)=1 x(v(cid:96)) = x · 1 =
x, e.g. Sx(2) =n
k=1(cid:98)x(k)2, but they are mixed into one invariant coeﬃcient. We

(cid:96)=1 x(v(cid:96))2 =n

can separate and recapture the high frequencies of x by computing its wavelet coeﬃcients Ψ(J)x,
which were deﬁned in (3.3). However, Ψ(J)x is not invariant to permutations of the vertex indices;
in fact, it is equivariant. Before summing the individual wavelet coeﬃcient vectors Ψ jx, though, we
must ﬁrst apply a pointwise nonlinearity. Indeed, deﬁne 1 = (1, . . . , 1)T to be the n × 1 vector of all
ones, and note that PT1 = 1, meaning that 1 is a left eigenvector of P with eigenvalue 1. It follows
that ΨT

j 1 = 0 and thusn

(cid:96)=1 Ψ jx(v(cid:96)) = Ψ jx · 1 = 1T Ψ jx = 0.

We thus apply the absolute value nonlinearity, to obtain nonlinear equivariant coeﬃcients
|Ψ(J)x| = {|Ψ jx| : 1 ≤ j ≤ J}. We use absolute value because it is equivariant to vertex
permutations, non-expansive, and when combined with traditional wavelet transforms on Euclidean
domains, yields a provably stable scattering transform for q = 1. Furthermore, initial theoretical
results in Zou and Lerman (2018) and Gama et al. (2019) indicate that similar graph based scattering
transforms possess certain types of stability properties as well. As in (3.4), we extract invariant
coeﬃcients from |Ψ jx| by computing its moments, which deﬁne the ﬁrst order geometric scattering

42

moments:

n

(cid:96)=1

Sx(j, q) =

|Ψ jx(v(cid:96))|q, 1 ≤ j ≤ J, 1 ≤ q ≤ Q

(3.5)

These ﬁrst order scattering moments aggregate complimentary multiscale geometric descriptions
of G into a collection of invariant multiscale statistics. These invariants give a ﬁner partition of
the frequency responses of x. For example, whereas Sx(2) mixed all frequencies of x, we see that
Sx(j, 2) only mixes the frequencies of x captured by Ψ j.

First order geometric scattering moments can be augmented with second order geometric
scattering moments by iterating the graph wavelet and absolute value transforms. These moments
are deﬁned as:

n

(cid:96)=1

Sx(j, j(cid:48)

, q) =

|Ψ j(cid:48)|Ψ jx(v(cid:96))||q,

1 ≤ j < j(cid:48) ≤ J
1 ≤ q ≤ Q ,

(3.6)

which consists of reapplying the wavelet transform operator Ψ(J) to each |Ψ jx| and computing the
summary statistics of the magnitudes of the resulting coeﬃcients. The intermediate equivariant
coeﬃcients |Ψ j(cid:48)|Ψ jx|| and resulting invariant statistics Sx(j, j(cid:48), q) couple two scales 2j and 2j(cid:48)
within the graph G, creating features that bind patterns of smaller subgraphs within G with patterns
of larger subgraphs (e.g., circles of friends of individual people with larger community structures
in social network graphs). The transform can be iterated additional times, leading to third order
features and beyond, and thus has the general structure of a graph ConvNet.

x

P2j−1
P2j−1

I − P2j−1
I − P2j−1

(cid:124)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:125)

Ψ j

| . . . |

| . . . |

(cid:124)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:125)
P2j(cid:48)−1
I − P2j(cid:48)−1
Ψ j(cid:48)

| . . . |

G = (V, E, W)
x : V → R

m atrix:
j)
i, v

A djacency
A( v
Signalvector:
x(vi)

Sx

Diﬀusion wavelets:
Ψ j = P2 j−1 − P2 j
2(I + AD−1)
P = 1

Ψj

(a)

Scattering

x (cid:55)→ Sx

Traditional
Euclidean
algorithms

(cid:107) . . . (cid:107)q
q
(cid:107) . . . (cid:107)q
q
(cid:107) . . . (cid:107)q
q

(cid:124)(cid:32)(cid:32)(cid:123)(cid:122)(cid:32)(cid:32)(cid:125)

1≤q≤Q

(a) Representative zeroth-, ﬁrst-, and second-order cascades
of the geometric scattering transform for an input graph
signal x.
Figure 3.2: Illustration of (a) the proposed scattering feature extraction (see eqs. 3.4, 3.5, and 3.6),

(b) Architecture for using geometric scattering of graph G
and signal x in graph data analysis, as demonstrated in
Sec. 3.5.

and (b) its application for graph data analysis.

The collection of graph scattering moments Sx = {Sx(q), Sx(j, q), Sx(j, j(cid:48), q)} (illustrated
in Fig. 3.2(a)) provides a rich set of multiscale invariants of the graph G. These can be used in

43

supervised settings as input to graph classiﬁcation or regression models, or in unsupervised settings
to embed graphs into a Euclidean feature space for further exploration, as demonstrated in Sec. 3.5.
In practice, the computation of the scattering features is based on several design choices, akin to
typical architecture choices in neural networks. Most importantly, it requires a choice of 1. which
statistical moments to use (normalized or unnormalized), 2. the number of wavelet scales to use
(given by J), and 3. the number of moments to use (denoted by Q). In general, J can be automatically
tuned by the diameter of the considered graphs (e.g., setting it to the logarithm of the diameter), and
the other choices can be tuned via cross-validation. However, we have found the impact of such
tuning to be minor, and thus for simplicity, we ﬁx our conﬁguration to use normalized moments,
J = 5, and Q = 4 throughout this work.

3.4.2 Stability and Capacity of Geometric Scattering

In order to assess the utility of scattering features for representing graphs, two properties have to be
considered: stability and capacity. First, the stability property aims to provide an upper bound on
distances between similar graphs that only diﬀer by types of deformations that can be treated as
noise. This property has been the focus of both Zou and Lerman (2018) and Gama et al. (2019), and
in particular the latter shows that a diﬀusion scattering transform yields features that are stable to
graph structure deformations whose size can be computed via the diﬀusion framework (Coifman
and Maggioni, 2006) that forms the basis for their construction. While there are some technical
diﬀerences between the geometric scattering here and the diﬀusion scattering in Gama et al. (2019),
these constructions are suﬃciently similar that we can expect both of them to have analogous stability
properties. Therefore, we mainly focus here on the complementary property of the scattering
transform capacity to provide a rich feature space for representing graph data without eliminating
informative variance in them.

We note that even in the classical Euclidean case, while the stability of scattering transforms to
deformations can be established analytically (Mallat, 2012), their capacity is typically examined by
empirical evidence when applied to machine learning tasks (e.g., Bruna and Mallat, 2011; Sifre and

44

Mallat, 2012; Andén and Mallat, 2014). Similarly, in the graph processing settings, we examine the
capacity of our proposed geometric scattering features via their discriminative power in graph data
analysis tasks, which are described in detail in Sec. 3.5. We show that geometric scattering enables
graph embedding in a relatively low dimensional Euclidean space, while preserving insightful
properties in the data. Beyond establishing the capacity of our speciﬁc construction, these results
also indicate the viability of graph scattering transforms as universal feature extractors on graph data,
and complement the stability results established in Zou and Lerman (2018) and Gama et al. (2019).

3.4.3 Geometric Scattering Compared to Other Feed Forward Graph ConvNets

We give a brief comparison of geometric scattering with other graph ConvNets, with particular
interest in isolating the key principles for building accurate graph ConvNet classiﬁers. We begin by
remarking that like several other successful graph neural networks, the graph scattering transform is
equivariant to vertex permutations (i.e., commutes with them) until the ﬁnal features are extracted.
This idea has been discussed in depth in various articles, including Kondor et al. (2018a), so we
limit the discussion to observing that the geometric scattering transform thus propagates nearly
all of the information in x through the multiple wavelet and absolute value layers, since only the
absolute value operation removes information on x. As in Verma and Zhang (2018), we aggregate
covariant responses via multiple summary statistics (i.e., moments), which are referred to there as
a capsule. In the scattering context, at least, this idea is in fact not new and has been previously
used in the Euclidean setting for the regression of quantum mechanical energies in Eickenberg et al.
(2018, 2017) and texture synthesis in Bruna and Mallat (2018). We also point out that, unlike many
deep learning classiﬁers (graph included), a graph scattering transform extracts invariant statistics at
each layer/order. These intermediate layer statistics, while necessarily losing some information in x
(and hence G), provide important coarse geometric invariants that eliminate needless complexity
in subsequent classiﬁcation or regression. Furthermore, such layer by layer statistics have proven
useful in characterizing signals of other types (e.g., texture synthesis in Gatys et al., 2015).

A graph wavelet transform Ψ(J)x decomposes the geometry of G through the lens of x, along

45

diﬀerent scales. Graph ConvNet algorithms also obtain multiscale representations of G, but several
works, including Atwood and Towsley (2016) and Zhang et al. (2018), propagate information via
a random walk. While random walk operators like Pt act at diﬀerent scales on the graph G, per
the analysis in Sec. 3.3 we see that Pt for any t will be dominated by the low frequency responses
of x. While subsequent nonlinearities may be able to recover this high frequency information, the
resulting transform will most likely be unstable due to the suppression and then attempted recovery
of the high frequency content of x. Alternatively, features derived from Ptx may lose the high
frequency responses of x, which are useful in distinguishing similar graphs. The graph wavelet
coeﬃcients Ψ(J)x, on the other hand, respond most strongly within bands of nearly non-overlapping
frequencies, each with a center frequency k j that depends on Ψ j.

Finally, graph labels are often complex functions of both local and global subgraph structure
within G. While graph ConvNets are adept at learning local structure within G, as detailed in Verma
and Zhang (2018) they require many layers to obtain features that aggregate macroscopic patterns
in the graph. This is due to the use of ﬁxed size ﬁlters, which often only incorporate information
from the neighbors of a vertex. The training of such networks is diﬃcult due to the limited size of
many graph classiﬁcation databases (see the supplementary information). Geometric scattering
transforms have two advantages in this regard: (a) the wavelet ﬁlters are designed; and (b) they are
multiscale, thus incorporating macroscopic graph patterns in every layer/order.

3.5 Results

To establish the geometric scattering features as an eﬀective graph representation for data
analysis, we examine their performance here in four graph data analysis applications. First, we
brieﬂy introduce the thirteen datasets used in this study and their statistics. Then in Sec. 3.5.2 we
consider graph classiﬁcation on social networks (from Yanardag and Vishwanathan, 2015) and
various biochemistry datasets, in Sec. 3.5.3 we consider the impact of low training data availability
on classiﬁcation, in Sec. 3.5.4 we examine dimensionality reduction aspects of geometric scattering,
in Sec. 3.5.5 we consider data exploration of enzyme graphs, where geometric scattering enables

46

unsupervised (descriptive) recovery of EC change preferences in enzyme evolution, and ﬁnally,
in Sec. 3.5.6, we conduct ablation studies. A common theme in all these applications is the
application of geometric scattering as an unsupervised task-independent feature extraction that
embeds input graphs of varying sizes (with associated graph signals) into a Euclidean space formed
by scattering features. Then, the extracted feature vectors are passed to traditional (Euclidean)
machine learning algorithms, such as SVM for classiﬁcation or PCA for dimensionality reduction,
to perform downstream analysis. Our results show that our scattering features provide simpliﬁed
representation (e.g., in dimensionality and extrapolation ability) of input graphs, which we conjecture
is a result of their stability properties, while also being suﬃciently rich to capture meaningful
relations between graphs for predictive and descriptive purposes.

Results of the methods compared in our paper come from the respective papers that introduced
the methods, with the exception of: (1) social network results of WL, fromTixier et al. (2017);
(2) biochemistry and social results of DCNN, from Verma and Zhang (2018); (3) biochemistry,
except for D&D, and social result of GK, from Yanardag and Vishwanathan (2015); (4) D&D of GK
is from Niepert et al. (2016); and (5) for Graphlets, biochemistry results from Kriege et al. (2016),
social results from Tixier et al. (2017).

All the geometric scattering and related classiﬁcation code were implemented in Python. All
experiments were performed on HPC environment using an intel16-k80 cluster, with a job requesting
one node with four processors and two Nvidia Tesla k80 GPUs.

3.5.1 Detailed Dataset Descriptions

The details of the datasets used in this work are as follows:

NCI1 (Wale et al., 2008) contains 4,110 chemical compounds as graphs, with 37 node features.
Each compound is labeled according to is activity against non-small cell lung cancer and
ovarian cancer cell lines, and these labels serve as classiﬁcation goal on this data.

47

NCI109 (Wale et al., 2008) is similar to NCI1, but with 4,127 chemical compounds and 38 node

features.

MUTAG (Debnath et al., 1991) consists of 188 mutagenic aromatic and heteroaromatic nitro
compounds (as graphs) with 7 node features. The classiﬁcation here is binary (i.e., two
classes), based on whether or not a compound has a mutagenic eﬀect on bacterium.

PTC (Toivonen et al., 2003) is a dataset of 344 chemical compounds (as graphs) with nineteen node
features that are divided into two classes depending on whether they are carcinogenic in rats.

PROTEINS (Borgwardt et al., 2005) dataset contains 1,113 proteins (as graphs) with three node
features, where the goal of the classiﬁcation is to predict whether the protein is enzyme or not.

D&D (Dobson and Doig, 2003) dataset contains 1,178 protein structures (as graphs) that, similar

to the previous one, are classiﬁed as enzymes or non-enzymes.

ENZYMES (Borgwardt et al., 2005) is a dataset of 600 protein structures (as graphs) with three
node features. These proteins are divided into six classes of enzymes (labelled by enzyme
commission numbers) for classiﬁcation.

COLLAB (Yanardag and Vishwanathan, 2015) is a scientiﬁc collaboration dataset contains 5K
graphs. The classiﬁcation goal here is to predict whether the graph belongs to a subﬁeld of
Physics.

IMDB-B (Yanardag and Vishwanathan, 2015) is a movie collaboration dataset with contains 1K
graphs. The graphs are generated on two genres: Action and Romance, the classiﬁcation goal
is to predict the correct genre for each graph.

IMDB-M (Yanardag and Vishwanathan, 2015) is similar to IMDB-B, but with 1.5K graphs & 3

genres: Comedy, Romance, and Sci-Fi.

48

REDDIT-B (Yanardag and Vishwanathan, 2015) is a dataset with 2K graphs, where each graph
corresponds to an online discussion thread. The classiﬁcation goal is to predict whether the
graph belongs to a Q&A-based community or discussion-based community.

REDDIT-5K (Yanardag and Vishwanathan, 2015) consists of 5K threads (as graphs) from ﬁve
diﬀerent subreddits. The classiﬁcation goal is to predict the corresponding subreddit for each
thread.

REDDIT-12K (Yanardag and Vishwanathan, 2015) is similar to REDDIT-5k, but with 11,929

graphs from 12 diﬀerent subreddits.

Table 3.1: Basic statistics of the biochemical graph classiﬁcation databases

# of graphs in data
Max # of vertices
Mean # of vertices

# of features per vertex

Mean # of edges

# of classes

NCI1 NCI109 MUTAG D&D
1178
4110
5748
111
29.8
284.32
37
64.6
2

4127
111
29.6
38
62.2
2

188
28
17.93

1431.3

39.50

89

2

7

2

PTC PROTEINS ENZYMES
344
109
25.56
22
51.90

1113
620
39.0
3

600
126
32.6
3

72.82

124.2

2

2

6

Table 3.2: Basic statistics of the social network graph classiﬁcation databases

COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K REDDIT-12K

# of graphs in data
Max # of vertices
Mean # of vertices

# of features per vertex

Mean # of edges

# of classes

5000
492
74.49

2457.78

3

3

1000
136
19.77

96.53

3

2

1500
89
13
3

65.94

3

2000
3783
429.61

497.75

2

2

5000
3783
508.5

594.87

2

5

11929
3782
391.4

456.89

2

11

Table 3.1 and Table 3.2 summarizes the size of available graph data (i.e., number of graphs, and
both max & mean number of vertices within graphs) in these datasets, as previously reported in the
literature.

49

3.5.2 Graph Classiﬁcation on Social Networks and Biochemistry Datasets

As a ﬁrst application of geometric scattering, we ﬁrst apply it to graph classiﬁcation of social
network data taken from Yanardag and Vishwanathan (2015). In particular, this work introduced six
social network data sets extracted from scientiﬁc collaborations (COLLAB), movie collaborations
(IMDB-B & IMDB-M), and Reddit discussion threads (REDDIT-B, REDDIT-5K, REDDIT-12K).
Then we also apply our methods to biochemistry datasets that are often used in the graph classiﬁcation
literature. The classiﬁcation results of biochemistry datasets are shown in Table 3.3.

The social network data provided by Yanardag and Vishwanathan (2015) contains graph
structures but no associated graph signals. Therefore we compute the eccentricity (for connected
graphs) and clustering coeﬃcient of each vertex, and use these as input signals to the geometric
scattering transform. Speciﬁcally, in the case of COLLAB, IMDB-B, and IMDB-M, we use the
eccentricity and clustering coeﬃcients for each vertex as characteristic graph signals. In the case
of REDDIT-B, REDDIT-5K and REDDIT-12K, on the other hand, we only use the clustering
coeﬃcient, due to the presence of disconnected graphs in these datasets. In principle, any general
node characteristic could be used, although we remark that x = d, the vertex degree vector, is not
useful in our construction since Ψ jd = 0. After computing the scattering moments1 of these two
input signals, they are concatenated to form a single vector. This scattering feature vector is a
consistent Euclidean representation of the graph, which is independent of the original graph sizes
(i.e., number of vertices or edges), and thus we can apply any traditional classiﬁer to it. In particular,
we use here the standard SVM classiﬁer with an RBF kernel, which is popular and eﬀective in many
applications and also performs well in this case.

We evaluate the classiﬁcation results of our SVM-based geometric scattering classiﬁcation
(GS-SVM) using ten-fold cross validation, which is standard practice in other graph classiﬁcation
works. First, the entire dataset is randomly split into ten subsets. Then, in each iteration (or “fold”),
nine of them are used as training and validation, and the other one is used for testing classiﬁcation
1We use the normalized scattering moments for classiﬁcation, since they perform slightly better
than the un-normalized moments. Also we use J = 5 and q = 4 for all scattering feature generations.

50

Table 3.3: Comparison of the proposed GS-SVM classiﬁer with leading graph kernel and deep

learning methods on social graph datasets.

WL
Graphlet
WL-OA
GK
DGK
DGCNN
2D CNN
PSCN (k = 10)
GCAPS-CNN
S2S-P2P-NN
GIN-0 (MLP-SUM)

COLLAB
77.82 ± 1.45
73.42 ± 2.43
80.70 ± 0.10
72.84 ± 0.28
73.00 ± 0.20
73.76 ± 0.49
71.33 ± 1.96
72.60 ± 2.15
77.71 ± 2.51
81.75 ± 0.80
80.20 ± 1.90
GS-SVM 79.94 ± 1.61

N/A

IMDB-B
71.60 ± 5.16
65.40 ± 5.95
65.87 ± 0.98
66.90 ± 0.50
70.03 ± 0.86
70.40 ± 3.85
71.00 ± 2.29
71.69 ± 3.40
73.80 ± 0.70
75.10 ± 5.10
71.20 ± 3.25

IMDB-M

N/A
N/A
N/A

N/A

43.89 ± 0.38
44.50 ± 0.50
47.83 ± 0.85
45.23 ± 2.84
48.50 ± 4.10
51.19 ± 0.50
52.30 ± 2.80
48.73 ± 2.32

REDDIT-B
78.52 ± 2.01
77.26 ± 2.34
89.30 ± 0.30
77.34 ± 0.18
78.00 ± 0.30
89.12 ± 1.70
86.30 ± 1.58
87.61 ± 2.51
86.50 ± 0.80
92.40 ± 2.50
89.65 ± 1.94

N/A

N/A

REDDIT-5K
50.77 ± 2.02
39.75 ± 1.36
41.01 ± 0.17
41.20 ± 0.10
48.70 ± 4.54
52.21 ± 2.44
49.10 ± 0.70
50.10 ± 1.72
52.28 ± 0.50
57.50 ± 1.50
53.33 ± 1.37

REDDIT-12K
34.57 ± 1.32
25.98 ± 1.29

N/A
N/A

N/A

32.20 ± 0.10
48.13 ± 1.47
41.32 ± 0.42
42.47 ± 0.10
45.23 ± 1.25

N/A

N/A

(cid:122)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:125)
(cid:124)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:123)

G
r
a
p
h
k
e
r
n
e
l

(cid:122)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

D
e
e
p

(cid:125)
(cid:124)

l
e
a
r
n
i
n
g

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:123)

Table 3.4: Comparison of the proposed GS-SVM classiﬁer with leading graph kernel and deep

learning methods on biochemistry graph datasets.

NCI1

WL
Graphlet
WL-OA
GK
DGK
DGCNN
2D CNN
CCN
PSCN (k = 10)
GCAPS-CNN
S2S-P2P-NN
GIN-0 (MLP-SUM)

84.46 ± 0.45
70.50 ± 0.20
86.10 ± 0.20
62.28 ± 0.29
80.30 ± 0.40
74.44 ± 0.47
76.27 ± 4.13
76.34 ± 1.68
82.72 ± 2.38
83.72 ± 0.4
82.70 ± 1.60
GS-SVM 79.14 ± 1.28

N/A

NCI109

85.12 ± 0.29
69.30 ± 0.20
86.30 ± 0.20
62.60 ± 0.19
80.30 ± 0.30

N/A
N/A

N/A

75.54 ± 3.36
81.12 ± 1.28
83.64 ± 0.3
77.95 ± 1.25

N/A

D&D

78.34 ± 0.62
79.70 ± 0.70
79.20 ± 0.40
78.45 ± 0.26
73.09 ± 0.25
79.37 ± 0.94

76.27 ± 2.15
77.62 ± 4.99

N/A
N/A

N/A
N/A

75.04 ± 3.64

PROTEINS
72.92 ± 0.56
72.70 ± 0.60
76.40 ± 0.40
71.67 ± 0.55
75.70 ± 0.50
75.54 ± 0.94
77.12 ± 2.79
75.00 ± 2.51
76.40 ± 4.17
76.61 ± 0.5
76.20 ± 2.80
74.11 ± 4.02

N/A

MUTAG
84.11 ± 1.91
85.20 ± 0.90
84.5 ± 1.7
81.39 ± 1.74
87.40 ± 2.70
85.83 ± 1.66
91.64 ± 7.24
88.95 ± 4.37
89.86 ± 1.1
89.40 ± 5.60
83.57 ± 6.75

N/A

N/A

PTC

59.97 ± 1.60
54.70 ± 2.00
63.6 ± 1.5
57.26 ± 1.41
60.10 ± 2.50
58.59 ± 2.47
70.62 ± 7.04
62.29 ± 5.68
66.01 ± 5.91
64.54 ± 1.1
64.60 ± 7.00
63.94 ± 7.38

N/A

ENZYMES
55.22 ± 1.26
30.60 ± 1.20
59.9 ± 1.1
26.61 ± 0.99
53.40 ± 0.90
51.00 ± 7.29

N/A
N/A
N/A

61.83 ± 5.39
63.96 ± 0.6
56.83 ± 4.97

N/A

(cid:122)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:125)
(cid:124)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:123)

G
r
a
p
h

k
e
r
n
e
l

(cid:122)

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

D
e
e
p

(cid:125)
(cid:124)

l
e
a
r
n
i
n
g

(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)
(cid:32)

(cid:123)

accuracy. In total, after ten iterations, each of the subsets has been used once for testing, resulting
in ten reported classiﬁcation accuracy numbers for the examined dataset. Finally, the mean and
standard deviation of these ten accuracies are computed and reported.

It should be noted that during training, each iteration also performs automatic tuning of the
trained classiﬁer, as follows. First, nine iterations are performed, each time using eight subsets (i.e.,
folds) as training and the remaining one as validation set, which is used to determine the optimal
parameters for SVM. After nine iterations, each of the training/validation subsets has been used
once for validation, and we obtain nine classiﬁcation models, which in turn produce nine predictions
(i.e., class assignments) for each data point in the test subset of the main cross validation. To obtain
the ﬁnal predicted class of this cross validation iteration, we select the class with the most votes
(from among the nine models) as our ﬁnal classiﬁcation result. These results are then compared to

51

the true labels (in the test set) on the test subset to obtain classiﬁcation accuracy for this fold.

We compare our results to 11 prominent methods that report results for most, if not all, of
the considered datasets. Out of these, ﬁve are graph kernel methods, namely: Weisfeiler-Lehman
graph kernels (WL, Shervashidze et al., 2011), propagation kernel (PK, Neumann et al., 2012),
Graphlet kernels (Shervashidze et al., 2009), Random walks (RW, Gärtner et al., 2003), deep graph
kernels (DGK, Yanardag and Vishwanathan, 2015), and Weisfeiler-Lehman optimal assignment
kernels (WL-OA, Kriege et al., 2016). The other six are recent geometric deep learning algorithms:
deep graph convolutional neural network (DGCNN, Zhang et al., 2018), Graph2vec (Narayanan et al.,
2017), 2D convolutional neural networks (2DCNN, Tixier et al., 2017), covariant compositional
networks (CCN, Kondor et al., 2018b), Patchy-san (PSCN, Niepert et al., 2016, with k = 10),
diﬀusion convolutional neural networks (DCNN, Atwood and Towsley, 2016), graph capsule
convolutional neural networks (GCAPS-CNN, Verma and Zhang, 2018), recurrent neural network
autoencoders (S2S-N2N-PP, Taheri et al., 2018), and the graph isomorphism network (GIN, Xu
et al., 2019).

Following the standard format of reported classiﬁcation performances for these methods, our
results are reported in the form of average accuracy ± standard deviation (in percentages) over the
ten cross-validation folds. We note that since some methods are not reported for all datasets, we
mark N/A when appropriate. Table 3.3 reports the results.

The geometric scattering transform and related variants presented in Zou and Lerman (2018)
and Gama et al. (2019) is a mathematical model for graph ConvNets. However, it is natural to ask if
this model accurately reﬂects what is done in practice. A useful model may not obtain state of the
art performance, but should be competitive with the current state of the art, lest the model may not
capture the underlying complexity of the most powerful methods. Examining Table 3.3 one can
see that the GS-SVM classiﬁer matches or outperforms all but the two most recent methods, i.e.,
S2S-N2N-PP (Taheri et al., 2018) and GIN (Xu et al., 2019). With regards to these two approaches,
the GS-SVM outperforms S2S-N2N-PP (Taheri et al., 2018) on 3/6 datasets. Finally, while GIN (Xu
et al., 2019) outperforms geometric scattering on 5/6 datasets, the results on COLLAB and IMDB-B

52

are not statistically signiﬁcant, and on the REDDIT datasets the geometric scattering approach trails
only GIN (Xu et al., 2019). We thus conclude that the geometric scattering transform yields a rich
set of invariant statistical moments, which have nearly the same capacity as the current state of the
art in graph neural networks.

3.5.3 Classiﬁcation with Low Training-data Availability

Many modern deep learning methods require large amounts of training data to generate representative
features. On the contrary, geometric scattering features are based on each graph without any training
processes. In this section, we demonstrate the performance of the GS-SVM under low training-data
availability and show that the scattering features can embed enough graph information that even
under extreme conditions (e.g. only 20% training data), they can still maintain relatively good
classiﬁcation results.

Table 3.5: Classiﬁcation accuracy with diﬀerent training/validaion/test splits over scattering features

(unnorm. moments)

PROTEINS
MUTAG

PTC

Dataset

NCI1
NCI109
D&D

ENZYMES
COLLAB
IMDB-B
IMDB-M
REDDIT-B
REDDIT-5K
REDDIT-12K

SVM accuracy

80%/10%/10% 70%/10%/20% 40%/10%/50% 20%/10%/70%
73.60 ± 0.68
79.80 ± 2.24
77.66 ± 1.78
72.36 ± 0.74
76.57 ± 3.76
75.58 ± 0.81
73.01 ± 1.94
74.03 ± 4.20
84.04 ± 6.71
77.47 ± 4.41
66.32 ± 7.54
56.75 ± 2.88
53.83 ± 6.71
36.38 ± 1.93
74.63 ± 1.05
76.88 ± 1.13
70.80 ± 3.54
67.81 ± 0.98
48.93 ± 4.77
44.28 ± 1.87
86.18 ± 0.32
88.30 ± 2.08
50.71 ± 2.27
48.37 ± 0.76
41.35 ± 1.05
37.71 ± 0.42

78.13 ± 2.07
77.54 ± 1.44
76.74 ± 2.32
74.30 ± 2.49
82.99 ± 6.97
64.83 ± 2.13
52.50 ± 5.35
76.98 ± 0.97
70.60 ± 2.85
49.00 ± 1.97
88.75 ± 0.96
50.87 ± 1.37
41.05 ± 0.70

76.37 ± 0.27
74.41 ± 0.14
76.32 ± 0.59
73.32 ± 1.68
78.72 ± 3.19
61.92 ± 1.45
44.50 ± 3.83
76.42 ± 0.82
69.10 ± 1.90
47.20 ± 1.47
86.40 ± 0.40
50.10 ± 0.41
39.36 ± 1.30

We performed graph classiﬁcation under four training/validation/test splits: 80%/10%/10%,
70%/10%/20%, 40%/10%/50% and 20%/10%/70%. We did 10-fold, 5-fold and 2-fold cross
validation for the ﬁrst three splits. For the last split, we randomly formed a 10 folds pool, from

53

which we randomly selected 3 folds for training/validation and repeated this process ten times.
Detailed classiﬁcation results can be found in Table 3.5 using unnormalized moments. Following
Sec. 3.5.2, we discuss the classiﬁcation accuracy on six social datasets under these splits. When the
training data is reduced from 90% to 80%, the classiﬁcation accuracy in fact increased by 0.047%,
which shows the GS-SVM classiﬁcation accuracy is not aﬀected by the decrease in training size.
Further reducing the training size to 50% results in an average decrease of classiﬁcation accuracy of
1.40% while from 90% to 20% causes an average decrease of 3.00%. Fig. 3.3 gives a more nuanced
statistical description of these results.

3.5.4 Dimensionality Reduction

We now consider the viability of scattering-based embedding for dimensionality reduction of graph
data. As a representative example, we consider here the ENZYMES dataset introduced in Borgwardt
et al. (2005), which contains 600 enzymes evenly split into six enzyme classes (i.e., 100 enzymes

(a)

(b)

Figure 3.3: (a) Box plot showing the drop in SVM classiﬁcation accuracy over social graph datasets
when reducing training set size (horizontal axis marks portion of data used for testing); (b) Relation
between explained variance, SVM classiﬁcation accuracy, and PCA dimensions over scattering

features in ENZYMES dataset.

54

from each class). While the Euclidean notion of dimensionality is not naturally available in graph
data, we note that graphs in this dataset have, on average, 124.2 edges, 29.8 vertices, and 3 features
per vertex. Therefore, the data here can be considered signiﬁcantly high dimensional in its original
representation, which is not amenable to traditional dimensionality reduction techniques.

To perform scattering-based dimensionality reduction, we applied PCA to geometric scattering
features extracted from input enzyme graphs in the data, while choosing the number of principal
components to capture 99%, 90%, 80% and 50% explained variance. For each of these thresholds,
we computed the mean classiﬁcation accuracy (with ten-fold cross validation) of SVM applied to the
GS-PCA low dimensional space, as well as the dimensionality of this space. The relation between
dimensionality, explained variance, and SVM accuracy is shown in Fig. 3.3, where we can observe
that indeed geometric scattering combined with PCA enables signiﬁcant dimensionality reduction
(e.g., to R16 with 90% exp. variance) with only a small impact on classiﬁcation accuracy. Finally, we
also consider the PCA dimensionality of each individual enzyme class in the data (in the scattering
feature space), as we expect scattering to reduce the variability in each class w.r.t. the full feature
space. Indeed, in this case, individual classes have 90% exp. variance PCA dimensionality ranging
between 6 and 10, which is signiﬁcantly lower than the 16 dimensions of the entire PCA space.
We note that similar results can also be observed for the social network data discussed in previous
sections, where on average 90% explained variances are captured by nine dimensions, yielding a
drop of 3.81% in mean SVM accuracy; complete results are shown in Table 3.6 and Table 3.7.

55

Table 3.6: Classiﬁcation accuracy and dimensionality reduction with PCA over scattering features

(unnorm. moments)

SVM accuracy w.r.t variance covered

Dataset

50%

PTC

PROTEINS
MUTAG

NCI1
NCI109
D&D

72.41 ± 2.36
70.85 ± 2.59
75.21 ± 3.17
70.80 ± 3.43
77.51 ± 10.42
58.17 ± 8.91
29.67 ± 4.46
ENZYMES
62.86 ± 1.36
COLLAB
58.30 ± 3.44
IMDB-B
41.00 ± 4.86
IMDB-M
71.05 ± 2.39
REDDIT-B
40.97 ± 2.06
REDDIT-5K
REDDIT-12K 28.22 ± 1.64

80%

73.89 ± 2.57
71.84 ± 2.38
75.13 ± 3.68
74.20 ± 3.06
80.32 ± 8.16
60.50 ± 9.96
45.33 ± 6.62
71.68 ± 2.06
66.10 ± 3.14
46.40 ± 4.48
78.95 ± 2.42
45.71 ± 2.21
33.36 ± 0.93

90%

73.89 ± 1.33
72.33 ± 2.24
74.87 ± 3.99
74.67 ± 3.33
82.40 ± 10.92
58.70 ± 6.93
50.67 ± 5.44
73.22 ± 2.29
68.80 ± 4.31
45.93 ± 3.86
83.75 ± 1.83
47.43 ± 1.90
34.71 ± 1.52

99%

78.22 ± 1.95
76.69 ± 1.02
76.92 ± 3.37
74.57 ± 3.42
84.09 ± 9.09
63.68 ± 3.97
52.50 ± 8.89
76.54 ± 1.41
68.40 ± 4.31
48.27 ± 3.23
86.95 ± 1.78
49.65 ± 1.86
38.39 ± 1.54

PCA dimensions w.r.t variance covered
50% 80% 90%
43
18
43
19
44
10
10
2
4
13
21
7
16
3
9
2
2
8
8
2
8
2
10
2
2
9

99%
117
114
122
36
34
62
44
32
24
20
24
27
27

32
32
35
5
8
14
9
6
4
5
5
6
5

Table 3.7: Dimensionality reduction with PCA over scattering features (unnorm. moments)

SVM accuracy

Dataset
ENZYMES 50.67 ± 5.44

PCA

Full

53.83 ± 6.71

PCA dimensions (> 90% variance)
All classes

Per class
8

9

10

6

16

9

8

3.5.5 Data Exploration: Enzyme Class Exchange Preferences

Geometric scattering essentially provides a task independent representation of graphs in a Euclidean
feature space. Therefore, it is not limited to supervised learning applications, and can be also utilized
for exploratory graph-data analysis, as we demonstrate in this section. We focus our discussion in
particular on the ENZYMES dataset described in the previous section. Here, geometric scattering
features can be considered as providing “signature” vectors for individual enzymes, which can be
used to explore interactions between the six top level enzyme classes, labelled by their Enzyme
Commission (EC) numbers (Borgwardt et al., 2005). In order to emphasize the properties of
scattering-based feature extraction, rather than downstream processing, we mostly limit our analysis
of the scattering feature space to linear operations such as principal component analysis (PCA).

56

Table 3.8: EC subspace analysis in scattering feature space of ENZYMES (Borgwardt et al., 2005)

Enzyme
Class:
EC-1
EC-2
EC-3
EC-4
EC-5
EC-6

Mean distance to subspace of class
EC-5

EC-2 EC-3

EC-4

EC-1
EC-6
measured via PCA projection/reconstruction distance
84.86
18.15
22.75
22.65
107.23
168.56
49.14
117.68
58.22
45.46
62.38
13.56

53.07
18.45
117.24
94.3
15.09
59.23

62.87
22.66
144.08
29.59
50.07
51.94

98.44
9.43
252.31
127.27
66.57
58.88

75.47
30.14
30.4
122.3
60
73.96

True class as

3rd-6th

2nd

1st
nearest subspace

45% 28%
53% 24%
32%
7%
24% 12%
67% 21%
67% 21%

27%
23%
61%
64%
12%
12%

To explore the scattering feature space, and the richness of information captured by it, we use
it to infer relations between EC classes. First, for each enzyme e, with scattering feature vector
ve (i.e., with Sx for all vertex features x), we compute its distance from class EC-j, with PCA
subspace Cj, as the projection distance: dist(e, EC-j) = (cid:107)ve − projSjve(cid:107). Then, for each enzyme
class EC-i, we compute the mean distance of enzymes in it from the subspace of each EC-j class

(a) Observed

(b) Inferred

Figure 3.4: Comparison of EC exchange preferences in enzyme evolution: (a) observed in Cuesta

et al. (2015), and (b) inferred from scattering features via

pref(EC-i, EC-j) := wj · (cid:104)

(cid:110) D(i, j)

(cid:111)(cid:105)−1

D(j,i)
D(j, j)

; wj = portion of enzymes in EC-j that choose
another EC as their nearest subspace; D(i, j) = mean dist. of enzymes in EC-i from PCA (90% exp.

D(i,i) ,

min

var.) subspace of EC-j . Our inference (b) mainly recovers (a).

57

as D(i, j) = mean{dist(e, EC-j) : e ∈ EC-i}. These distances are summarized in Table 3.8, as
well as the proportion of points from each class that have their true EC as their nearest (or second
nearest) subspace in the scattering feature space. In general, 48% of enzymes select their true
EC as the nearest subspace (with additional 19% as second nearest), but these proportions vary
between individual EC classes. Finally, we use these scattering-based distances to infer EC exchange
preferences during enzyme evolution, which are presented in Fig. 3.4 and validated with respect
to established preferences observed and reported in Cuesta et al. (2015). We note that the result
there is observed independently from the ENZYMES dataset. In particular, the portion of enzymes
considered from each EC is diﬀerent between these data, since Borgwardt et al. (2005) took special
care to ensure each EC class in ENZYMES has exactly 100 enzymes in it. However, we notice that
in fact the portion of enzymes (in each EC) that choose the wrong EC as their nearest subspace,
which can be considered as EC “incoherence” in the scattering feature space, correlates well with
the proportion of evolutionary exchanges generally observed for each EC in Cuesta et al. (2015),
and therefore we use these as EC weights (see Fig. 3.4). Our results in Fig. 3.4 demonstrate that
scattering features are suﬃciently rich to capture relations between enzyme classes, and indicate
that geometric scattering has the capacity to uncover descriptive and exploratory insights in graph
data analysis.

3.5.6 Ablation Study

To fully understand the power of our geometric scattering coeﬃcients, we conduct an ablation study
using ﬁve social network datasets, namely COLLAB, IMDB-B, IMDB-M, REDDIT-B, REDDIT-5K,
as representative examples. Following the settings in the main paper, here instead of using four
normalized moments for each order of scattering moments, we only use one normalized moment
(mean) and two normalized moments (mean and variance) and compare the graph classiﬁcation
results in Table 3.9. We show that using only one normalized moment our method can still get
relatively good results, and using higher order moments helps us to match or outperform most
state-of-the-art results. Generally, the results degrade by 1-6% on the social network data sets

58

reducing from using four normalized moments to two or one normalized moment.

Table 3.9: Ablation study on ﬁve social network datasets using only one normalized moments and

two normalized moments.

COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K

One norm. moment
Two norm. moments

77.42
78.44

69.80
69.3

48.47
48.27

83.25
85.20

50.31
51.49

Finally, we perform graph classiﬁcation with two diﬀerent classiﬁers: linear SVM and fully
connected layers (FCLs)2 to further demonstrate the usefulness of geometric scattering coeﬃcients
and show that our scattering coeﬃcients perform well regardless of the choice of classiﬁers. Our
results in Table 3.10 show that compared to RBF SVM, FCLs and linear SVM are worse (1-3%) but
not by too much.

Table 3.10: Graph classﬁcation with FCLs and linear SVM classiﬁers

COLLAB IMDB-B IMDB-M REDDIT-B REDDIT-5K

77.40
79.26

70.50
69.50

47.13
46.40

86.45
86.60

53.23
50.50

linear SVM

FCLs

3.6 Conclusion

We presented the geometric scattering transform as a deep ﬁlter bank for feature extraction on
graphs, which generalizes the Euclidean scattering transform. A reasonable criticism of the scattering
theory approach to understanding geometric deep learning is that it is not clear if the scattering
model is a suitable facsimile for powerful graph neural networks that are obtaining impressive results
on graph classiﬁcation tasks and related graph data analysis problems. In this paper we showed
that in fact, at least empirically, this line of criticism is unfounded and indeed further theoretical
study of geometric scattering transforms on graphs is warranted. Our evaluation results on graph
classiﬁcation and data exploration show the potential of the produced scattering features to serve
as universal representations of graphs. Indeed, classiﬁcation using these features with relatively
simple classiﬁer models, dimension reduced feature sets, and small training sets nevertheless reach

2Hyperparameters of FCLs are manually selected

59

high accuracy results on most commonly used graph classiﬁcation datasets. Finally, the geometric
scattering features provide a new way for computing and considering global graph representations,
independent of speciﬁc learning tasks. They raise the possibility of embedding entire graphs in
Euclidean space and computing meaningful distances between graphs, which can be used for both
supervised and unsupervised learning, as well as exploratory analysis of graph-structured data.

60

CHAPTER 4

TOXICITY PREDICTION OF SMALL ORGANIC MOLECULES

4.1 Abstract

Toxicity evaluation is of great signiﬁcance in protecting human from toxic compounds. It’s
also very important in drug development and approval. Usually a compound’s eﬀects on human
health are assessed by a large number of time- and cost-intensive in vivo or in vitro experiments.
These experiments also raise a lot of ethical concerns. Therefore, computational methods have been
proposed for toxicity predictions to avoid expensive costs on animal tests and avoid ethical issues.
Recent years there is a trend in applying machine learning/deep learning for toxicity predictions. In
this study, the proposed geometric scattering method is used for generating molecule representations
and then used for toxicity predictions. Geometric scattering method is constructed with cascades
of graph wavelets and non-linear absolute values. It provides an unsupervised way for generating
molecular graph features, where molecules are treated as graphs. Once the molecular graph features
are generated, we can use any classiﬁers such as neural network and supporting vector machine
to perform toxicity prediction. The acute oral systemic toxicity dataset is used in this study and
the results demonstrate that geometric scattering method can achieve competitive results and thus
provide a new way for toxicity predictions.

4.2 Introduction

Toxicity assessment has been a long standing issue. In drug development, measuring the toxicity
of potential drugs are essential before moving on to any clinical trials, which could save millions of
dollars’ loss due to drug failures. In addition, any chemicals that are put on market are required to
pass reliable tests to protect humans from being harmed by unknown toxic eﬀects. Traditionally
toxicity of a compound is evaluated by a large number of in vivo or in vitro experiments, where
numerous methods rely on expensive and time-consuming animal tests. Therefore, there is an

61

increasing demand for cost- and time-eﬃcient toxicological screening methods. Computational
models are considered to be able to screen large numbers of compounds in a short time at low costs
(Rusyn and Daston (2010)) and thus can be a powerful alternative to lab experiments. In fact, there
has been a long history of applying computational methods for toxicity predictions. For example,
quantitative structure activity relationship (QSAR) is used to predict toxicity based on the chemical
structures.

However, traditional computational models often suﬀer from insuﬃcient accuracy and are not as
reliable as biological experiments. Therefore there is a great need in developing new computational
models that could provide accurate predictions.
In addition, more and more toxicity data are
collected and are available to the public. The rapid increasing of toxicity databases makes it possible
to learn from large scale of data.

To predict toxicity of molecules, it is important to generate molecular representations. Molecular
descriptors such as Kow and other molecular physical or chemical properties without or with very
limited structural information encoded are ﬁrst used to generate molecular representations, which
inevitably lost lots of information. Then researchers developed various molecular representations
that also encode the structural information. Based on diﬀerent algorithms, there are also diﬀerent
ways to encode the molecular structures. For example, extended-connectivity ﬁngerprints (ECFP),
which are based on the Morgan algorithm, are widely used to generate molecular features (Rogers
and Hahn (2010)). ECFP can generate molecular features that represent each atom within larger
and larger circular sub-structural neighborhoods. On the other hand, molecule can be regarded as a
graph: the atoms are treated nodes while the bonds are regarded as edges. Thus one can directly
use methods from graph theory to study molecular graphs and generate molecular features. Once
molecular features are generated, then they can be used as inputs to machine learning methods
such as logistic regression (LR), supporting vector machine (SVM), random forest (RF), gradient
boosting regression tree (GBRT).

Recently there are growing interests in applying deep learning to learn molecular features during
the training process. Deep learning utilizes multi-layer neural network model to learn patterns

62

from data. It has achieved great successes in signal processing (Goodfellow et al. (2016, Chapter
9)), computer vision (Rawat and Wang (2017)), speech recognition (Amodei et al. (2016)), and
natural language processing (Young et al. (2018)). It can also be applied in many other domains.
In chemistry, deep learning can be used to predict molecular energies. For example, message
passing neural network framework (MPNN) recently proposed by Gilmer et al. (2017) has achieved
state-of-art results for molecular energy predictions. The MPNN architecture consists of two phases.
The ﬁrst phase is the message passing phase which includes the acceptance of geometry information
as well as the bond information. Depending on the kind of method, then molecule can be taken as
a graph or as whole 3D information. The second phase is the message updating phase, where an
update function was used. Deep Learning is based on deep neural networks (DNNs) with multiple
layers and large numbers of neurons per layer. A fully connected DNN consists of an input layer,
several hidden layers and an output layer. A DNN can be considered as a function that maps an
input vector to an output vector. The mapping is parameterized by weights that are optimized in
the training process through back propagation. However, in contrast to shallow neural networks
with only a few hidden layers and neurons per layer, DNNs use many hidden layers with a large
number of neurons. In a fully connected DNN, neurons in each layer are connected to neurons of the
previous layer and can learn features at each neuron. Then the outputs of the current layer are inputs
of the following layers after an activation function. Popular activation functions are ReLu, Tanh,
and Sigmoid. Modern DNNs also utilize techniques such as dropout to avoid over-ﬁtting. Finally,
the output layer could either be a linear regression (for regression tasks) or a softmax function (for
classiﬁcation tasks). It is supposed that the features of the input data are learnt during the training
process with this multi-layer architecture.

4.3 Method Development and Evaluation

Molecules can be represented as graphs, where atoms are nodes and bonds are edges in the
graph. Therefore we can construct the molecular graph based on the connectivity of the molecules.
Furthermore, the features of each molecule can be chosen from a variety of atom’s physical properties

63

such as atomic number, valence charges, mass, whether belong to aromatic ring, whether it’s sp1,
sp2 or sp3 hybridation. Here in this study, we used the following eight properties to build our node
features as input signals on each graph: one hot encoding of electron acceptor, electron donor,
whether or not it belongs to a aromatic ring, whether it’s sp1, sp2 or sp3 hybridization and total
charge and valance charge of each atom.

In this study, we utilize the acute oral systemic toxicity dataset publicly available from the NIEHS
Toxicology program, which consists of 8893 molecule structures and their SMILES. There are four
classiﬁcation tasks considered in this study: the classiﬁcation of (1) Very toxic (<50 mg/kg vs. all
others); (2) Nontoxic (>2000 mg/kg vs. all others) (3) Hazard categories under the EPA classiﬁcation
system (n=4) (4) Hazard categories under the GHS classiﬁcation system (n=5) (Category 5 and Not
Classiﬁed combined into a single category).

4.3.1 Geometric Scattering for Molecule Feature Generation

To construct molecular representations that can be taken into various machine learning classiﬁers
(e.g. Random Forest, Supporting Vector Machine, Neural Network), we need to generate molecular
features based on the structures and the molecular properties. Geometric scattering from Chapter 3
has shown promising results on generating representations for graphs, so in this study we perform
geometric scattering on molecular graphs. The detailed calculation processes can be found in
Chapter 3. As in Chapter3, we only calculated 0th, 1st and 2nd order features according to Equation
3.4, 3.5, and 3.6. All three orders of features are then concatenated and fed into a classiﬁer. In this
study we use Gradient Boosting Regression Tree (GBRT), Logistic Regression (LR) and a two layer
Neural Network (NN).

4.3.2 Data Prepossessing

The acute oral systemic toxicity dataset is very imbalanced. Among 8893 molecules, for the task to
predict whether the molecule is very toxic (with LD50 < 50 mg/kg) or not, 90% of the molecules
are non-toxic and only 10% of them is considered to be very toxic. The imbalanced dataset will

64

cause the model to be biased to the part with more data points and thus make the model hard to train.
To address this issue, random small noises are added to the geometric scattering features to make
fake "new features" which are not exact the same as existing features, but won’t aﬀect their labels.
Finally, 80% of data points from each class were combined and augmented (if a certain class was
imbalanced) for training, 10% were used as validation and the other 10% were used as test.

4.3.3 Evaluation Criteria

To evaluate the eﬀectiveness of our model, we should not only consider the overall accuracy
criterion. This is because for a test set that contains 90% non-toxic compounds, the model can
easily achieve 90% prediction accuracy by predicting everything nontoxic. Therefore in addition to
overall accuracy, selectivity and speciﬁcity are also used to evaluate the performance of the model.
Speciﬁcity and selectivity are deﬁned as

Speci f icit y =

T N

T N + FP

Selectivit y =

T P

T P + FN

Overall Accuracy =

T P + T N

T P + FP + T N + FN

(4.1)

(4.2)

(4.3)

Here TP stands for true positive, FP stands for false positive, TN stands for true negative, and

FN stands for false negative.

4.4 Results

Classiﬁcation are performed for four tasks, among which the "Very Toxic" and "Nontoxic"
tasks are binary classiﬁcation tasks, while the EPA and GHS hazard categories are multi-class
classiﬁcations. Furthermore, overall classiﬁcation accuracies are also compared with diﬀerent
classiﬁers (NN, GBRT, LR). The results of using GBRT and LR on "Very Toxic" and "Nontoxic"

65

tasks were shown in Table 4.1 and 4.2. Since NN performs better than other classiﬁers, in the
following we further study the selectivity and sensitivity only using the NN model. The results in
Table 4.3 shows that for "Very Toxic" task, we can achieve 74% overall test accuracy, with 77%
sensitivity and 73% speciﬁcity. For "Nontoxic" task, the performances are slightly worse with a
72% overall accuracy. The sensitivity is 70% and selectivity is 72%. For EPA and GHS hazard
category classiﬁcation, there are multiple classes and here we only consider the overall accuracy
in Table 4.5 and 4.6. For EPA classiﬁcation, we got 53% accuracy with four classes. For GHS
classiﬁcation, the overall test accuracies are 52% with ﬁve classes. All these results are performing
much better than random guess, which shows the power of the geometric scattering features.

Table 4.1: Training and prediction results with GBRT and LR for "Very Toxic" classiﬁcation

Test
Training
73% 68%
66% 65%

GBRT

LR

Table 4.2: Training and prediction results with GBRT and LR for "Nontoxic" classiﬁcation

Training
Test
80% 64%
75% 60%

GBRT

LR

Table 4.3: Sensitivity, speciﬁcity and overall prediction accuracy for "Very Toxic" classiﬁcation

Overall Accuracy

Sensitivity
Speciﬁcity

87%

Training Validation Test
74%
77%
73%

84%
73%
85%

Since there are many other molecular representations could be used for toxicity prediction,
future work should consider comparing the geometric scattering results with other molecular
representations such as ECFP. In addition, methods that use end-to-end learning can learn the

66

Table 4.4: Sensitivity, speciﬁcity and overall prediction accuracy for "Nontoxic" classiﬁcation

Overall Accuracy

Sensitivity
Speciﬁcity

82%

Training Validation Test
72%
70%
72%

75%
73%
76%

Table 4.5: Training and prediction results for EPA hazard category classiﬁcation

Overall Accuracy

Test
Training
75% 53%

Table 4.6: Training and prediction results for GHS hazard category classiﬁcation

Overall Accuracy

Training
Test
71% 52%

molecular representations during training, and should also be considered for comparison in the
future work.

67

BIBLIOGRAPHY

68

BIBLIOGRAPHY

Allen-King, R. M., Grathwohl, P., and Ball, W. P. (2002). New modeling paradigms for the sorption
of hydrophobic organic chemicals to heterogeneous carbonaceous matter in soils, sediments, and
rocks. Advances in Water Resources, 25(8), 985 – 1016.

Amodei, D., Ananthanarayanan, S., Anubhai, R., Bai, J., Battenberg, E., Case, C., Casper, J.,
Catanzaro, B., Cheng, Q., Chen, G., Chen, J., Chen, J., Chen, Z., Chrzanowski, M., Coates, A.,
Diamos, G., Ding, K., Du, N., Elsen, E., Engel, J., Fang, W., Fan, L., Fougner, C., Gao, L., Gong,
C., Hannun, A., Han, T., Johannes, L., Jiang, B., Ju, C., Jun, B., LeGresley, P., Lin, L., Liu, J.,
Liu, Y., Li, W., Li, X., Ma, D., Narang, S., Ng, A., Ozair, S., Peng, Y., Prenger, R., Qian, S.,
Quan, Z., Raiman, J., Rao, V., Satheesh, S., Seetapun, D., Sengupta, S., Srinet, K., Sriram, A.,
Tang, H., Tang, L., Wang, C., Wang, J., Wang, K., Wang, Y., Wang, Z., Wang, Z., Wu, S., Wei, L.,
Xiao, B., Xie, W., Xie, Y., Yogatama, D., Yuan, B., Zhan, J., and Zhu, Z. (2016). Deep speech 2 :
End-to-end speech recognition in english and mandarin. In M. F. Balcan and K. Q. Weinberger,
editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 173–182, New York, New York, USA. PMLR.
Andén, J. and Mallat, S. (2014). Deep scattering spectrum. IEEE Transactions on Signal Processing,

62(16), 4114–4128.

Andén, J., Lostanlen, V., and Mallat, S. (2018). Classiﬁcation with joint time-frequency scattering.

arXiv:1807.08869.

Angles, T. and Mallat, S. (2018). Generative networks as inverse problems with scattering transforms.

In International Conference on Learning Representations.

Arp, H. P. H., Breedveld, G. D., and Cornelissen, G. (2009). Estimating the in situ sedimentporewater
distribution of pahs and chlorinated aromatic hydrocarbons in anthropogenic impacted sediments.
Environmental Science & Technology, 43(15), 5576–5585. PMID: 19731647.

ATHMER, C. (2004). In-situ remediation of tce in clayey soils. Soil and Sediment Contamination:

An International Journal, 13(5), 381–390.

Atwood, J. and Towsley, D. (2016). Diﬀusion-convolutional neural networks. In Advances in Neural

Information Processing Systems 29, pages 1993–2001.

Borgwardt, K. M., Ong, C. S., Schönauer, S., Vishwanathan, S., Smola, A. J., and Kriegel, H.-P.

(2005). Protein function prediction via graph kernels. Bioinformatics, 21(suppl_1), i47–i56.

Boyd, S. A., Johnston, C. T., Pinnavaia, T. J., Kaminski, N. E., Teppen, B. J., Li, H., Khan,
B., Crawford, R. B., Kovalova, N., Kim, S.-S., Shao, H., Gu, C., and Kaplan, B. L. (2011).
Suppression of humoral immune responses by 2,3,7,8-tetrachlorodibenzo-p-dioxin intercalated in
smectite clay. Environmental Toxicology and Chemistry, 30(12), 2748–2755.

69

Boyd, S. A., Sallach, J. B., Zhang, Y., Crawford, R., Li, H., Johnston, C. T., Teppen, B. J., and
Kaminski, N. E. (2017). Sequestration of 2,3,7,8-tetrachlorodibenzo-p-dioxin by activated carbon
eliminates bioavailability and the suppression of immune function in mice. Environmental
Toxicology and Chemistry, 36(10), 2671–2678.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric deep

learning: Going beyond euclidean data. IEEE Signal Processing Magazine, 34(4), 18–42.

Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P. (2017). Geometric deep

learning: Going beyond Euclidean data. IEEE Signal Processing Magazine, 34(4), 18–42.

Brumwell, X., Sinz, P., Kim, K. J., Qi, Y., and Hirn, M. (2018). Steerable wavelet scattering for 3D
atomic systems with application to Li-Si energy prediction. In NeurIPS Workshop on Machine
Learning for Molecules and Materials. arXiv:1812.02320.

Bruna, J. and Mallat, S. (2011). Classiﬁcation with scattering operators. In 2011 IEEE Conference

on Computer Vision and Pattern Recognition (CVPR), pages 1561–1566.

Bruna, J. and Mallat, S. (2012). Invariant scattering convolution networks. CoRR, abs/1203.1513.

Bruna, J. and Mallat, S. (2013a). Audio texture synthesis with scattering moments. arXiv:1311.0407.
Bruna, J. and Mallat, S. (2013b). Invariant scattering convolution networks. IEEE Transactions on

Pattern Analysis and Machine Intelligence, 35(8), 1872–1886.

Bruna, J. and Mallat, S. (2018). Multiscale sparse microcanonical models. arXiv:1801.02013.

Chai, Y., Davis, J. W., Wilken, M., Martin, G. D., Mowery, D. M., and Ghosh, U. (2011). Role
of black carbon in the distribution of polychlorinated dibenzo-p-dioxins/dibenzofurans in aged
ﬁeld-contaminated soils. Chemosphere, 82(5), 639 – 647.

Chen, H., Engkvist, O., Wang, Y., Olivecrona, M., and Blaschke, T. (2018). The rise of deep learning

in drug discovery. Drug Discovery Today, 23(6), 1241 – 1250.

Cohen, A. J., Mori-Sánchez, P., and Yang, W. (2008). Insights into current limitations of density

functional theory. Science, 321(5890), 792–794.

Coifman, R. R. and Maggioni, M. (2006). Diﬀusion wavelets. Applied and Computational Harmonic

Analysis, 21(1), 53–94.

Comer, J., Gumbart, J. C., Hénin, J., Lelièvre, T., Pohorille, A., and Chipot, C. (2015). The adaptive
biasing force method: Everything you always wanted to know but were afraid to ask. The Journal
of Physical Chemistry B, 119(3), 1129–1151. PMID: 25247823.

Cornelissen, G. and Gustafsson, Ö. (2004). Sorption of phenanthrene to environmental black carbon
in sediment with and without organic matter and native sorbates. Environmental Science &
Technology, 38(1), 148–155. PMID: 14740730.

70

Cornelissen, G., Gustafsson, Ö., Bucheli, T. D., Jonker, M. T. O., Koelmans, A. A., and van Noort,
P. C. M. (2005). Extensive sorption of organic compounds to black carbon, coal, and kerogen
in sediments and soils: mechanisms and consequences for distribution, bioaccumulation, and
biodegradation. Environmental Science & Technology, 39(18), 6881–6895. PMID: 16201609.

Cornelissen, G., Breedveld, G. D., Kalaitzidis, S., Christanis, K., Kibsgaard, A., and Oen, A. M. P.
(2006). Strong sorption of native pahs to pyrogenic and unburned carbonaceous geosorbents in
sediments. Environmental Science & Technology, 40(4), 1197–1203. PMID: 16572775.

Cornelissen, G., Cousins, I. T., Wiberg, K., Tysklind, M., Holmström, H., and Broman, D.
(2008a). Black carbon-dominated pcdd/fs sorption to soils at a former wood impregnation site.
Chemosphere, 72(10), 1455 – 1461.

Cornelissen, G., Wiberg, K., Broman, D., Arp, H. P. H., Persson, Y., Sundqvist, K., and Jonsson, P.
(2008b). Freely dissolved concentrations and sediment-water activity ratios of pcdd/fs and pcbs in
the open baltic sea. Environmental Science & Technology, 42(23), 8733–8739. PMID: 19192790.

Cramer, R. D., Patterson, D. E., and Bunce, J. D. (1988). Comparative molecular ﬁeld analysis
(comfa). 1. eﬀect of shape on binding of steroids to carrier proteins. Journal of the American
Chemical Society, 110(18), 5959–5967. PMID: 22148765.

Cuesta, S. M., Rahman, S. A., Furnham, N., and Thornton, J. M. (2015). The classiﬁcation and

evolution of enzyme function. Biophysical Journal, 109(6), 1082–1086.

Darve, E. and Pohorille, A. (2001). Calculating free energies using average force. The Journal of

Chemical Physics, 115(20), 9169–9183.

Darve, E., Wilson, M. A., and Pohorille, A. (2002). Calculating free energies using a scaled-force

molecular dynamics algorithm. Molecular Simulation, 28(1-2), 113–144.

Debnath, A. K., Lopez de Compadre, R. L., Debnath, G., Shusterman, A. J., and Hansch, C. (1991).
Structure-activity relationship of mutagenic aromatic and heteroaromatic nitro compounds.
correlation with molecular orbital energies and hydrophobicity. Journal of medicinal chemistry,
34(2), 786–797.

Diry, M., Tomkiewicz, C., Koehle, C., Coumoul, X., Bock, K. W., Barouki, R., and Transy, C.
(2006). Activation of the dioxin/aryl hydrocarbon receptor (ahr) modulates cell plasticity through
a jnk-dependent mechanism. Oncogene, 25(40), 5570–5574.

Dix, D. J., Houck, K. A., Martin, M. T., Richard, A. M., Setzer, R. W., and Kavlock, R. J. (2006).
The ToxCast Program for Prioritizing Toxicity Testing of Environmental Chemicals. Toxicological
Sciences, 95(1), 5–12.

Dobson, P. D. and Doig, A. J. (2003). Distinguishing enzyme structures from non-enzymes without

alignments. Journal of molecular biology, 330(4), 771–783.

Dombrowski, R. J., Hyduke, D. R., and Lastoskie, C. M. (2000). Pore size analysis of activated
carbons from argon and nitrogen porosimetry using density functional theory. Langmuir, 16(11),
5041–5050.

71

Eickenberg, M., Exarchakis, G., Hirn, M., and Mallat, S. (2017). Solid harmonic wavelet scattering:
Predicting quantum molecular energy from invariant descriptors of 3D electronic densities. In
Advances in Neural Information Processing Systems 30 (NIPS 2017), pages 6540–6549.

Eickenberg, M., Exarchakis, G., Hirn, M., Mallat, S., and Thiry, L. (2018). Solid harmonic wavelet

scattering for predictions of molecule properties. Journal of Chemical Physics, 148, 241732.

Fagervold, S. K., Chai, Y., Davis, J. W., Wilken, M., Cornelissen, G., and Ghosh, U. (2010).
Bioaccumulation of polychlorinated dibenzo-p-dioxins/dibenzofurans in e. fetida from ﬂoodplain
soils and the eﬀect of activated carbon amendment. Environmental Science & Technology, 44(14),
5546–5552. PMID: 20560599.

Fu, Z., Wang, Y., Chen, J., Wang, Z., and Wang, X. (2016). How pbdes are transformed into
dihydroxylated and dioxin metabolites catalyzed by the active center of cytochrome p450s: A dft
study. Environmental Science & Technology, 50(15), 8155–8163. PMID: 27363260.

Gama, F., Ribeiro, A., and Bruna, J. (2019). Diﬀusion scattering transforms on graphs.

International Conference on Learning Representations. arXiv:1806.08829.

In

Gärtner, T., Flach, P., and Wrobel, S. (2003). On graph kernels: Hardness results and eﬃcient

alternatives. In Learning theory and kernel machines, pages 129–143. Springer.

Gatys, L., Ecker, A. S., and Bethge, M. (2015). Texture synthesis using convolutional neural

networks. In Advances in Neural Information Processing Systems 28, pages 262–270.

Ghosh, U., Gillette, J. S., Luthy, R. G., and Zare, R. N. (2000). Microscale location, characterization,
and association of polycyclic aromatic hydrocarbons on harbor sediment particles. Environmental
Science & Technology, 34(9), 1729–1736.

Ghosh, U., Luthy, R. G., Cornelissen, G., Werner, D., and Menzie, C. A. (2011). In-situ sorbent
amendments: A new direction in contaminated sediment management. Environmental Science &
Technology, 45(4), 1163–1168. PMID: 21247210.

Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., and Dahl, G. E. (2017). Neural message
passing for quantum chemistry. In D. Precup and Y. W. Teh, editors, Proceedings of the 34th
International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning
Research, pages 1263–1272, International Convention Centre, Sydney, Australia. PMLR.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http:

//www.deeplearningbook.org.

Hamilton, W. L., Ying, R., and Leskovec, J. (2017). Inductive representation learning on large

graphs. In NIPS.

Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via spectral

graph theory. Applied and Computational Harmonic Analysis, 30, 129–150.

HANSCH, C., MALONEY, P. P., FUJITA, T., and MUIR, R. M. (1962). Correlation of biological
activity of phenoxyacetic acids with hammett substituent constants and partition coeﬃcients.
Nature, 194(4824), 178–180.

72

Haro, M., Cabal, B., Parra, J. B., and Ania, C. O. (2011). On the adsorption kinetics and equilibrium
of polyaromatic hydrocarbons from aqueous solution. Adsorption Science & Technology, 29(5),
467–478.

He, H., Galy, J., and Gerard, J.-F. (2005). Molecular simulation of the interlayer structure and the
mobility of alkyl chains in hdtma+/montmorillonite hybrids. The Journal of Physical Chemistry
B, 109(27), 13301–13306. PMID: 16852659.

Hilscherova, K., Blankenship, A., Kannan, K., Nie, M., Williams, L. L., Coady, K., Upham, B. L.,
Trosko, J. E., Bursian, S., and Giesy, J. P. (2003). Oxidative stress in laboratory-incubated double-
crested cormorant eggs collected from the great lakes. Archives of Environmental Contamination
and Toxicology, 45(4), 533–546.

Hirn, M., Mallat, S., and Poilvert, N. (2017). Wavelet scattering regression of quantum chemical

energies. Multiscale Modeling and Simulation, 15(2), 827–863. arXiv:1605.04654.

Kaplan, B. L., Crawford, R. B., Kovalova, N., Arencibia, A., Kim, S. S., Pinnavaia, T. J., Boyd,
S. A., Teppen, B. J., and Kaminski, N. E. (2011). Tcdd adsorbed on silica as a model for tcdd
contaminated soils: Evidence for suppression of humoral immunity in mice. Toxicology, 282(3),
82 – 87.

Kayala, M. A. and Baldi, P. F. (2011). A machine learning approach to predict chemical reactions.
In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger, editors, Advances
in Neural Information Processing Systems 24, pages 747–755. Curran Associates, Inc.

Kipf, T. N. and Welling, M. (2017). Semi-supervised classiﬁcation with graph convolutional

networks. In International Conference on Learning Representations (ICLR).

Klopmand, G. (1992). Concepts and applications of molecular similarity, by mark a. johnson and
gerald m. maggiora, eds., john wiley & sons, new york, 1990, 393 pp. price: $65.00. Journal of
Computational Chemistry, 13(4), 539–540.

Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. (2018a). Covariant compositional

networks for learning graphs. arXiv:1801.02144.

Kondor, R., Son, H. T., Pan, H., Anderson, B., and Trivedi, S. (2018b). Covariant compositional

networks for learning graphs. arXiv:1801.02144.

Kriege, N. M., Giscard, P.-L., and Wilson, R. (2016). On valid optimal assignment kernels and
applications to graph classiﬁcation. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and
R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 1623–1631.
Curran Associates, Inc.

Kupryianchyk, D., Rakowska, M. I., Roessink, I., Reichman, E. P., Grotenhuis, J. T. C., and
Koelmans, A. A. (2013). In situ treatment with activated carbon reduces bioaccumulation in
aquatic food chains. Environmental Science & Technology, 47(9), 4563–4571. PMID: 23544454.
Liu, C., Gu, C., Yu, K., Li, H., Teppen, B. J., Johnston, C. T., Boyd, S. A., and Zhou, D. (2015a).
Integrating structural and thermodynamic mechanisms for sorption of pcbs by montmorillonite.
Environmental Science & Technology, 49(5), 2796–2805. PMID: 25629399.

73

Liu, C., Gu, C., Yu, K., Li, H., Teppen, B. J., Johnston, C. T., Boyd, S. A., and Zhou, D. (2015b).
Integrating structural and thermodynamic mechanisms for sorption of pcbs by montmorillonite.
Environmental Science & Technology, 49(5), 2796–2805. PMID: 25629399.

Liu, T., Xue, L., Guo, X., Huang, Y., and Zheng, C. (2016). Dft and experimental study on the
mechanism of elemental mercury capture in the presence of hcl on -fe2o3(001). Environmental
Science & Technology, 50(9), 4863–4868. PMID: 27082260.

Lohmann, R.; MacFarlane, J. K. G. P. M. (2005). Importance of black carbon to sorption of
native pahs, pcbs, and pcdds in boston and new york, harbor sediments. Environmental Science
Technology.

Lostanlen, V. and Mallat, S. (2015). Wavelet scattering on the pitch spiral. In Proceedings of the

18th International Conference on Digital Audio Eﬀects, pages 429–432.

Mallat, S. (2012). Group invariant scattering. Communications on Pure and Applied Mathematics,

65(10), 1331–1398.

McGregor, M. J. and Pallai, P. V. (1997). Clustering of large databases of compounds: using the
mdl “keys” as structural descriptors. Journal of Chemical Information and Computer Sciences,
37(3), 443–448.

Meyer, Y. (1993). Wavelets and Operators, volume 1. Cambridge University Press.
Naik, Z. K. and Gandhi, M. R. (2018). A review: Object detection using deep learning. International

Journal of Computer Applications, 180(29), 46–48.

Narayanan, A., Chandramohan, M., Venkatesan, R., Chen, L., Liu, Y., and Jaiswal, S. (2017).

graph2vec: Learning distributed representations of graphs. arXiv:1707.05005.

Neumann, M., Patricia, N., Garnett, R., and Kersting, K. (2012). Eﬃcient graph kernels by
randomization. In Joint European Conference on Machine Learning and Knowledge Discovery in
Databases, pages 378–393. Springer.

Nguyen, B. T., Lehmann, J., Hockaday, W. C., Joseph, S., and Masiello, C. A. (2010). Temperature
sensitivity of black carbon decomposition and oxidation. Environmental Science & Technology,
44(9), 3324–3331. PMID: 20384335.

Nguyen, T. H., Cho, H.-H., Poster, D. L., and Ball, W. P. (2007). Evidence for a pore-ﬁlling
mechanism in the adsorption of aromatic hydrocarbons to a natural wood char. Environmental
Science & Technology, 41(4), 1212–1217. PMID: 17593721.

Niepert, M., Ahmed, M., and Kutzkov, K. (2016). Learning convolutional neural networks for

graphs. In International conference on machine learning, pages 2014–2023.

Oquab, M., Bottou, L., Laptev, I., and Sivic, J. (2014). Learning and transferring mid-level image
representations using convolutional neural networks. In Proceedings of the IEEE conference on
computer vision and pattern recognition, pages 1717–1724.

74

Oyallon, E. and Mallat, S. (2015). Deep roto-translation scattering for object classiﬁcation. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
arXiv:1412.8659.

Patmont, C. R., Ghosh, U., LaRosa, P., Menzie, C. A., Luthy, R. G., Greenberg, M. S., Cornelissen,
G., Eek, E., Collins, J., Hull, J., Hjartland, T., Glaza, E., Bleiler, J., and Quadrini, J. (2015). In
situ sediment treatment using activated carbon: A demonstrated sediment cleanup technology.
Integrated Environmental Assessment and Management, 11(2), 195–207.

Perlmutter, M., Wolf, G., and Hirn, M. (2018). Geometric scattering on manifolds. In NeurIPS

Workshop on Integration of Deep Learning Theories. arXiv:1812.06968.

Persson, N. J., Gustafsson, Ö., Bucheli, T. D., Ishaq, R., , and Broman, D. (2002). Soot-carbon
inﬂuenced distribution of pcdd/fs in the marine environment of the grenlandsfjords, norway.
Environmental Science & Technology, 36(23), 4968–4974. PMID: 12523408.

Pignatello, J. J., Oliveros, E., and MacKay, A. (2006). Advanced oxidation processes for organic
contaminant destruction based on the fenton reaction and related chemistry. Critical Reviews in
Environmental Science and Technology, 36(1), 1–84.

Plimpton, S. (1995). Fast parallel algorithms for short-range molecular dynamics. Journal of

Computational Physics, 117(1), 1 – 19.

PyGSP (2018). Graph signal processing in python (https://pygsp.readthedocs.io/en/

stable/index.html).

Rawat, W. and Wang, Z. (2017). Deep convolutional neural networks for image classiﬁcation: A

comprehensive review. Neural Computation, 29(9), 2352–2449. PMID: 28599112.

Rogers, D. and Hahn, M. (2010). Extended-connectivity ﬁngerprints. Journal of Chemical

Information and Modeling, 50(5), 742–754. PMID: 20426451.

Rusyn, I. and Daston, G. P. (2010). Computational toxicology: Realizing the promise of the toxicity

testing in the 21st century. Environmental Health Perspectives, 118(8), 1047–1050.

Sedlak, D. L. (2015). Professor einstein and the quantum mechanics. Environmental Science &

Technology, 49(5), 2585–2585. PMID: 25699689.

Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., and Borgwardt, K. (2009). Eﬃcient
graphlet kernels for large graph comparison. In D. van Dyk and M. Welling, editors, Proceedings of
the 12th International Conference on Artiﬁcial Intelligence and Statistics, volume 5 of Proceedings
of Machine Learning Research, pages 488–495, Hilton Clearwater Beach Resort, Clearwater
Beach, Florida USA. PMLR.

Shervashidze, N., Schweitzer, P., Leeuwen, E. J. v., Mehlhorn, K., and Borgwardt, K. M. (2011).
Weisfeiler-Lehman graph kernels. Journal of Machine Learning Research, 12(Sep), 2539–2561.

Sifre, L. and Mallat, S. (2012). Combined scattering for rotation invariant texture analysis. In

Proceedings of the ESANN 2012 conference.

75

Sifre, L. and Mallat, S. (2014). Rigid-motion scattering for texture classiﬁcation. arXiv:1403.1687.
Sinkkonen, S. and Paasivirta, J. (2000). Degradation half-life times of pcdds, pcdfs and pcbs for

environmental fate modeling. Chemosphere, 40(9), 943 – 949.

Sun, Q., Xie, H.-B., Chen, J., Li, X., Wang, Z., and Sheng, L. (2013). Molecular dynamics simulations
on the interactions of low molecular weight natural organic acids with c60. Chemosphere, 92(4),
429 – 434.

Taheri, A., Gimpel, K., and Berger-Wolf, T. (2018). Learning graph representations with recurrent

neural network autoencoders. In KDD Deep Learning Day.

Tice, R. R., Austin, C. P., Kavlock, R. J., and Bucher, J. R. (2013). Improving the human hazard
characterization of chemicals: A tox21 update. Environmental Health Perspectives, 121(7),
756–765.

Tixier, A. J.-P., Nikolentzos, G., Meladianos, P., and Vazirgiannis, M. (2017). Classifying graphs as

images with convolutional neural networks. arXiv:1708.02218.

Toivonen, H., Srinivasan, A., King, R. D., Kramer, S., and Helma, C. (2003). Statistical evaluation

of the predictive toxicology challenge 2000–2001. Bioinformatics, 19(10), 1183–1193.

Travis, C. C., Hattemer-Frey, H. A., and Silbergeld, E. (1989). Dioxin, dioxin everywhere.

Environmental Science & Technology, 23(9), 1061–1063.

Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph

Attention Networks. International Conference on Learning Representations.

Verlet, L. (1967). Computer "experiments" on classical ﬂuids. i. thermodynamical properties of

lennard-jones molecules. Phys. Rev., 159, 98–103.

Verma, S. and Zhang, Z.-L. (2018). Graph capsule convolutional neural networks. In Joint ICML

and IJCAI Workshop on Computational Biology. arXiv:1805.08090.

Wale, N., Watson, I. A., and Karypis, G. (2008). Comparison of descriptor spaces for chemical

compound retrieval and classiﬁcation. Knowledge and Information Systems, 14(3), 347–375.

Wang, Y., Wohlert, J., Bergenstråhle-Wohlert, M., Kochumalayil, J. J., Berglund, L. A., Tu, Y., and
Ågren, H. (2015). Molecular adhesion at clay nanocomposite interfaces depends on counterion
hydration–molecular dynamics simulation of montmorillonite/xyloglucan. Biomacromolecules,
16(1), 257–265. PMID: 25389796.

Wu, Z., Ramsundar, B., Feinberg, E., Gomes, J., Geniesse, C., Pappu, A. S., Leswing, K., and Pande,
V. (2018). Moleculenet: a benchmark for molecular machine learning. Chem. Sci., 9, 513–530.
Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2019). How powerful are graph neural networks? In

International Conference on Learning Representations.

Yanardag, P. and Vishwanathan, S. (2015). Deep graph kernels. In Proceedings of the 21th ACM
SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1365–1374.
ACM.

76

Yosinski, J., Clune, J., Bengio, Y., and Lipson, H. (2014). How transferable are features in deep
neural networks? In Advances in Neural Information Processing Systems 27, pages 3320–3328.

Young, T., Hazarika, D., Poria, S., and Cambria, E. (2018). Recent trends in deep learning based
natural language processing [review article]. IEEE Computational Intelligence Magazine, 13(3),
55–75.

Zhan, J., Sunkara, B., Le, L., John, V. T., He, J., McPherson, G. L., Piringer, G., and Lu, Y.
(2009). Multifunctional colloidal particles for in situ remediation of chlorinated hydrocarbons.
Environmental Science & Technology, 43(22), 8616–8621. PMID: 20028061.

Zhan, J., Kolesnichenko, I., Sunkara, B., He, J., McPherson, G. L., Piringer, G., and John, V. T.
(2011). Multifunctional ironcarbon nanocomposites through an aerosol-based process for the
in situ remediation of chlorinated hydrocarbons. Environmental Science & Technology, 45(5),
1949–1954. PMID: 21299241.

Zhang, M., Cui, Z., Neumann, M., and Chen, Y. (2018). An end-to-end deep learning architecture

for graph classiﬁcation. In AAAI Conference on Artiﬁcial Intelligence, pages 4438–4445.

Zou, D. and Lerman, G. (2018). Graph convolutional neural networks via scattering.

arXiv:1804:00099.

Zou, J., Ji, B., Feng, X.-Q., and Gao, H. (2006). Self-assembly of single-walled carbon nanotubes
into multiwalled carbon nanotubes in water: molecular dynamics simulations. Nano Letters, 6(3),
430–434. PMID: 16522036.

77