AI ACCELERATED COLLISIONAL CROSS SECTION PREDICTION FOR HIGH
           THROUGHPUT METABOLITE IDENTIFICATION
                                    By
                       Kiyoto Aramis Tanemura
                          A DISSERTATION
                              Submitted to
                      Michigan State University
               in partial fulfillment of the requirements
                            for the degree of
                  Chemistry - Doctor Of Philosophy
                                   2022


                                        ABSTRACT
    Metabolomics refers to the collective characterization of small organic molecules in a bi-
ological sample. While instrumentation and software continues to improve for metabolomics
studies, the fraction of annotated signals in untargeted metabolomics experiments remain
small. Translating features to metabolite identities present a major bottleneck, confounded
by the lack of authentic standards to build comprehensive experimental databases. I illus-
trate the development of collisional cross section (CCS) prediction methods through deduc-
tion from theory and induction from available data. The theoretical CCS prediction involves
multistep modeling of conformational ensemble followed by simulation of ion mobility. The
advanced computational chemistry operations were automated using the AutoGraph con-
formational clustering protocol and implementation of the workflow in Snakemake. In a
complementary approach, I applied a graph convolutional deep Bayesian neural net to pre-
dict CCS values and their uncertainty values. The quantified uncertainty was used to guide
ab initio prediction of CCS values in an active learning strategy. The developed methodolo-
gies lay the foundation to a continuously refining in silico CCS library to aid in metabolite
annotation.


Copyright by
KIYOTO ARAMIS TANEMURA
2022


I dedicate this dissertation to my supportive East Lansing community, colleagues, and
     friends whose excellence has inspired me to continue learning and improving.
                                           iv


                                ACKNOWLEDGEMENTS
    I thank my advisor, Prof. Kenneth M. Merz Jr., for his mentorship in research. I thank
Prof. Guowei Wei for his guidance over my research project at weekly meeting. I thank the
colleagues in my research group who have contributed their time and expertise to my research
and publications, including Dr. Susanta Das, Dr. Jun Pei, Dr. Madushanka Manathunga,
Diego Sierra Costa, Akhil Shajan, Laleh Dinpazhouh, Mithony Keng, and Zhen Li. I thank
current and former members of the research group who have also been integral to my scholarly
development, including Dr. Ali Rahnamoun, Dr. Gaurav Sharma, Dr. Lin Song, Hongni Jin,
Anthony Seitz, Dr. Chi Jin, Dr. Arka Sengupta, Dr. Quan Jiang, Dr. Zhuoqin Yu, Majid
Jafari, and Andrew Candia. I also thank Prof. Heedeok Hong and Prof. Angela Wilson
for evaluating my research as guidance committee members. I would like to thank the staff
members for the Institute for Cyber-Enabled Research (ICER) for their outstanding efforts
in maintain the High Performance Computing Cluster (HPCC) and addressing my queries
and requests.
                                             v


                             TABLE OF CONTENTS
Chapter 1   Introduction . . . . . . .     .  . . . . . . . . . . . . . . . . . . . . . . .  1
  1.1 Metabolomics . . . . . . . . . . .   .  . . . . . . . . . . . . . . . . . . . . . . .  1
  1.2 Spectral Prediction of Metabolites   .  . . . . . . . . . . . . . . . . . . . . . . .  2
  1.3 Ion Mobility Spectrometry . . . .    .  . . . . . . . . . . . . . . . . . . . . . . .  3
  1.4 Machine Learning . . . . . . . . .   .  . . . . . . . . . . . . . . . . . . . . . . .  5
  1.5 Overview . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . .  8
Chapter 2   Refinement of pairwise potentials           via logistic regression     to
            score protein-protein interactions          . . . . . . . . . . . . . . . . . .  9
  2.1 Introduction . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . .  9
  2.2 Methods . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . 12
  2.3 Results and discussion . . . . . . . . . . . .    . . . . . . . . . . . . . . . . . . 17
  2.4 Conclusions . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . . . . . 27
Chapter 3   AutoGraph: Autonomous Graph-Based                   Clustering    of
            Small-Molecule Conformations . . . . . .            . . . . . . . . . . . . . . 28
  3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . 28
  3.2 Methods . . . . . . . . . . . . . . . . . . . . . . . .   . . . . . . . . . . . . . . 30
  3.3 Results and Discussion . . . . . . . . . . . . . . . .    . . . . . . . . . . . . . . 38
  3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . 47
Chapter 4   Rapid and Automated Ab Initio Metabolite Collisional Cross
            Section Prediction from SMILES Input . . . . . . . . . . . . . .                49
  4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  49
  4.2 Description of the Implementation . . . . . . . . . . . . . . . . . . . . . . . .     52
  4.3 Description of Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     54
  4.4 Snakemake Enabled Workflow Extensions . . . . . . . . . . . . . . . . . . . .         56
  4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  60
Chapter 5   Metabolite Collisional Cross Section and Uncertainty
            Prediction by Deep Bayesian Graph Convolutional Neural
            Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .       . . 61
  5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 61
  5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 62
  5.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . . 68
  5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 73
Chapter 6   Thesis Contribution and Future Directions . . . . . . . . . . . .               74
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .        77
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    86
                                             vi


Chapter 1:      Introduction
1.1    Metabolomics
      The metabolome refers to the comprehensive set of small organic molecules in a biological
sample. It encompasses endogenous metabolites produced though cellular transformations,
and exogenous metabolites consumed from the environment. The presence of endogenous
metabolites may be inferred from knowledge of the proteins or genes of the sample. This
is not the case for exogenous metabolites, or the exposome. Studying the metabolome pro-
vides description of the phenotype, and can yield insight in disease, therapy, and metabolic
engineering. [1–7]
      Metabolomics is among genomics and proteomics in the scientific paradigm of omics,
which rely on holistic characterization of biological samples to draw insight. Owing to
the complexity of the biological mixture and large volume of analytes to detect and quan-
tify, omics studies rely on accurate, precise, and high-throughput analytical and informatic
methodologies. Due to the relative standardization of subunit structures of the macro-
molecules investigated, genomics and proteomics have realized comprehensive, organismal-
level characterization of genome and proteome. [8, 9]
      Despite the interest in comprehensive characterization of the metabolome, limitations
remain in metabolomics. Unlike other biomolecules with standard subunits, the number of
possible metabolites is bound only by the number of thermodynamically stable structures,
estimated to be 1033 − 1060 chemical structures depending on the assumptions applied. [10,
11] The immense chemical diversity also implies a broad collection of chemical/physical
properties, such as polarity, volatility, and lability. Therefore, any one extraction method or
instrumentation provides coverage over only a biased subset of the metabolome. [12, 13] The
disparity between presently available technology and the goal of comprehensive metabolite
annotation may be understood through the Human Metabolome Database (HMDB). [14,
15] The HMDB contains less than 5% of the estimated metabolite space across multiple
                                                 1


organisms. Meanwhile, only about 10% of the entries have authentic chemical standards
commercially available. [16] While a large volume of metabolites remain for identification,
majority of these unknown metabolites may evade confident identifications using authentic
standards because the standards are unavailable.
     A particular bottleneck presented in the untargeted metabolomics workflow is the anno-
tation of features detected. The conventional untargeted metabolomics workflow is described
as follows. Metabolites are extracted from biological samples. The complex mixture is sepa-
rated by chromatography, then subjected to analytical methods such as mass spectrometry
(MS) and nuclear magnetic resonance spectroscopy (NMR). [17] The collection of signals
arising from a chemical structure is detected as a feature. The feature table is translated
to metabolite identities by matching to databases of metabolites with known spectroscopic
properties. The composition of the detected metabolites is then tied to biological insight.
[18]
     While informatic approaches may suffice to annotate metabolites already deposited in
public databases (known unknowns), this is often a small fraction of the total signals observed
in untargeted metabolomics experiments. The majority of features which remain unidentified
represent the ”metabolic dark matter”. [17] Such elusive unknown unknown metabolites
require a standard-free methodology for identification using in silico predicted values.
1.2    Spectral Prediction of Metabolites
     Standard-free metabolite identification warrants the construction of a library of com-
putationally predicted spectral properties for metabolites. The library must be larger than
available experimental spectra, thereby computational methods require a high throughput.
Meanwhile, accuracy and precision must approach that of experimental performance to dis-
tinguish many overlapping signals.
     In silico prediction can be broadly divided into theoretical and empirical approaches.
Theoretical methods employ molecular modeling and simulation to deduce spectral proper-
ties using first principles. Because the computed results are supported by physical principles,
                                               2


the performance is not explicitly dependent on coverage of the metabolite space sampled.
However, the many computational steps often require expertise and considerable compu-
tational resources. In contrast, empirical methods parameterize a mathematical model on
available data to infer spectral values. While empirical methods are capable of rapid predic-
tions due to delegating fine details of molecular modeling to statistical interpolation, their
performance is dependent on the coverage, quality, and amount of spectral data available.
      Among the challenges of studying the metabolome is throughput and relative sparsity
of currently available data. While theoretical methods may return accurate predictions
over sparsely sampled regions of the metabolic space, throughput is limited by its large
computational demand. Meanwhile, empirical methods rapidly predict spectral properties,
but are not guaranteed to generalize when extrapolating to sparsely sampled regions expected
in the metabolite space. Development of in silico libraries for metabolite annotation may
depend on informed integration of these disparate approaches to overcome their respective
shortcomings.
1.3    Ion Mobility Spectrometry
      Among the analytical techniques which offer high value target to develop predictive
tools for metabolite identification is ion mobility spectrometry (IMS). IMS is an analytical
separation technique in which ionized molecules are separated by mobility through an inert
gas in a drift cell. The gas phase mobility of an ion depends on factors including its size,
shape, charge, and polarity. [19] The mobility of an ion is its velocity through the drift cell
normalized to the applied electric field, which is derived from the arrival time distribution.
The mobility presents instrumental dependency, so instead the mobility is converted to
collisional cross section (CCS), which can be understood as the rotationally averaged cross
sectional area of an ion. The Mason-Schamp equation is used to convert mobility to CCS.
[20] The CCS is a physicochemical value specific to each structure, thereby offer a valuable
analytical descriptor for metabolites for improved annotation. [19] IMS is often coupled
to MS in ion mobility mass spectrometry (IMMS) experiments, achieving an additional
                                               3


degree of separation than by individual instruments while obtaining independent metrics for
metabolite identification in a high throughput fashion. [21, 22]
      Theoretical CCS values can be obtained at various degrees of molecular modeling. The
trajectory method (TM) models long and short range interactions between the ion and drift
gas molecule to numerically integrate the equations of motion. [23] The method is compu-
tationally intensive due to the many force evaluations performed over a continuous collision
event and vast integration domain. Larger systems for which long range potentials are less
influential may model only the short range potential using methods such as elastic hard
sphere scattering (EHSS) method. [24] While the error in CCS introduced by omission of
the long range potential has been reported to be less than 10% for select systems, the uncer-
tainty introduced is not amenable for confident identification of metabolites from a mixture.
The projection approximation (PA) method further simplifies CCS prediction by calculating
the rotationally averaged geometric cross section. PA can yield CCS values with high degree
of agreement for proteins. [25] However, a sizeable error can be introduced in cases for which
surface concavity affects mobility.
      The high accuracy achieved by the TM is therefore attractive for the context of metabo-
lite annotation. TM is available in modern, open source implementations through software
such as High Performance Collision Cross Section Calculation (HPCCS). [26] While agree-
ment to experimental measurements is observed from TM predicted CCS values, this involves
sizable computational effort for modeling the conformational ensemble expected during the
trajectory down the drift cell. The arrival time is on the scale of milliseconds, thus flexible
metabolites are expected to explore all thermodynamically accessible conformers; with each
conformer contributing to the final CCS value. Thorough identification of local minima in
the potential energy surface (PES) requires computation-intensive geometry optimization
of many conformers. ISiCLE is a major ab initio CCS prediction software which encapsu-
lates the entire CCS prediction workflow, including ensemble modeling of two dimensional
chemical structures. [27]
                                               4


      To circumvent the complex workflow of ensemble modeling and CCS prediction by TM,
machine learning algorithms have been extensively developed for CCS prediction. Simple
regression models employ the support vector machine (SVM) algorithm. AllCCS achieves
high degree of agreement to experimental measurements using SVM with only up to fifteen
selected features. [28] The model associated with CCSBase is a SVM using the simple 42
features of the molecular quantum number (MQN) descriptor. [29, 30] Deep learning algo-
rithms have been deployed to directly learn structure/property relationships for CCS predic-
tion from an embedding of molecular structures using simplified molecular-input line-entry
system (SMILES) representation. DeepCCS encodes the one-hot-encoded representation of
the SMILES string using a convolutional neural net (CNN). [31] DarkChem utilizes CNNs
in the encoder/decoder of a variational autoencoder, thereby enabling candidate structure
prediction as well. [32]
      Predicted CCS databases have been published for each CCS prediction models, however
there are limitations. These models exhibit high accuracy over the metabolite space sampled,
however the performance over yet to be annotated regions is unknown. Therefore, the
uncertainty over the predicted CCS are heterogeneous. However the uncertainty of the
predicted value is not quantified, limiting the confident assignment of CCS to metabolite
by predicted values. Furthermore, SMILES is only one machine readable representation
of molecular structure. Interpreting the grammar of SMILES is a nontrivial task for deep
learning. More intuitive structural representations can be employed.
1.4    Machine Learning
      Artificial intelligence (AI) is a field which aims to implement intelligent behavior in
artificial systems by imitating natural intelligence. Machine learning (ML) is a branch of
artificial intelligence (AI) which uses data and algorithms to parameterize mathematical
models to accomplish tasks and behaviors not explicitly encoded. The delegation of finding
solutions to various tasks to the parameterization is anthropomorphized as learning by the
machine. ML dates back to 1959, when the term was incepted by Arthur Samuel of IBM. [33]
                                                5


While ML as a field is not new, various technological advancements enabled the widespread
applications of ML in what can be regarded as the Fourth Paradigm of Science. [34] The
historic evolution of scientific paradigms: from empiricism, to analyses, to simulation; de-
veloped theoretical understanding of underlying mechanics to make predictions. The Fourth
Paradigm of Data-intensive Scientific Discovery instead seeks to uncover underlying rules
and theories from large volume of data. The reformulation of scientific discovery enabled by
ML therefore is not constrained by the domain knowledge of the subject under investigation,
but rather access to data and computational resources.
     The disparate algorithms in ML are related to one another by the general tasks they
accomplish. Simple, structured datasets are matrices of values, with each row corresponding
to one sample (entry, instance, point, etc) and the columns are features (independent vari-
ables). Supervised learning algorithms operate over datasets with labels to predict, thereby
producing models for quantitative prediction such as regression or classification. Unsuper-
vised learning algorithms draw trends and generalizations from the distribution of unlabeled
datasets. Other categories exist such as semi-supervised learning or active learning tasks.
The performance of these algorithms depend on the features extracted from each sample,
therefore traditional ML algorithms are not fully delineated from domain knowledge.
     Artificial neural nets (ANN) free ML approaches from dependency to domain knowl-
edge imposed by feature engineering. ANN are composed of interconnected units (artificial
neurons, perceptron), which take the weighted sum of input features, pass the value to an
activation function such as the logistic function, and output a value over the range of the
activation function ((0.0, 1.0) for the logistic function). The single layer perceptron (SLP)
model organizes N ∈ N units in parallel to create a single layer. The weighted sum of the
outputs of individual units is the prediction returned by the model. The Universal Approx-
imation Theorem states any continuous function over an interval can be approximated by
parameterizing a SLP model. [35] Therefore, given predictive features and enough data,
a SLP can approximate any continuous function which maps the features to the label of
                                                6


interest.
     Identifying predictive features from raw data is a subject of domain knowledge. However,
the procedure of embedding from raw data to a feature vector is a function, which can be
approximated using the SLP model. In this manner, stacking layers in an ANN can automate
feature extraction from raw data. Stacking several standardized layers in an ANN is known as
a multilayer perceptron (MLP) model, which can compute increasingly abstract, hierarchical
features from the raw data. This strategy to learn data representation through ANN of
many layers is known as deep learning (DL). In chemistry, molecular structures determine
all chemical and physical properties. Extracting predictive features by DL is therefore a
general solution for many chemical problems.
     Encoding chemical structures to machine readable form for DL determines the model
architecture. Vector representation by molecular fingerprints provide a canonical represen-
tation in ML to utilize a MLP model. Fingerprint representations depend on a algorithm
defined a priori, thereby are not guaranteed to capture substructures most predictive over
the task and data. The characters of the SMILES representation of molecules can be em-
bedded as inputs to CNN or recurrent neural nets (RNN). Learning the representation of
SMILES may be complicated because several correct SMILES representations exist for one
chemical structure, and the model must learn the grammar of SMILES embedding alongside
extracting predictive features. The most practical means to communicate chemical structures
among chemists is the graph representation using segment formula. Even images of segment
formula has been used as inputs to CNN for property prediction. The graph convolutional net
(GCN) has emerged as a prominent model for learning molecular graph representation. GCN
are generalizations of CNN by performing convolution over sets non-standardized neighbors
encoded in a bond adjacency matrix. The coherence of the graph representation between
the chemical intuition of domain experts and as an exact input format for GCN renders the
approach particularly attractive for learning molecular structures.
                                               7


1.5    Overview
     I highlight applications of ML algorithms to chemical problems to increase accuracy, au-
tomation, and throughput. A logistic regression classifier based scoring function is showed
to detect identifiers of protein-protein interaction hotspots. I developed a conformational
clustering algorithm, AutoGraph, to remove hyperparameter dependencies from modeling
conformational ensembles. AutoGraph was integrated in an ab inito CCS prediction work-
flow, which accelerates CCS prediction via incremental refinement of structures using molec-
ular mechanics, deep learning, and quantum mechanics. Finally, a graph convolutional deep
Bayesian neural net was used to quantify uncertainty of predicted CCS values, thereby iden-
tify high value metabolites to annotate the CCS value by the ab initio workflow for the
production of an in silico CCS library. I conclude with describing future work to implement
a continuously refining CCS library for improving metabolite annotation.
                                               8


Chapter 2:      Refinement of pairwise potentials via logistic regression to score
                protein-protein interactions
     Kiyoto Aramis Tanemura, Jun Pei, Kenneth M. Merz Jr.
2.1    Introduction
     Protein-protein interactions (PPI) are present in the underlying mechanisms of virtually
all biochemical processes. In terms of drug discovery, they present an alternative drug target
to the traditional small binding pockets of enzyme active sites. [36, 37] For instance, the
90 kDa heat shock protein Hsp90 has been long pursued as a possible therapeutic target in
cancer research. [38] While five molecular scaffolds have been investigated as competitive
inhibitors of the N-terminal substrate-binding domain, the Hsp90 modulation by targeting
the structurally relevant C-terminal domain (CTD) was underexplored. [39] Exploitation
of the conformational dynamics natively induced by the binding of a client protein near
the CTD led to the design of allosteric activators of Hsp90 ATPase activity. [40] Following
this principle, the design of PPI inhibitors by mimicking the client protein interface has
diversified the possible modes of Hsp90 inhibition. [41]
     Experimental techniques for protein structural characterization such as X-ray crystal-
lography, nuclear magnetic resonance spectroscopy, and cryo-electron microscopy have made
available high-resolution data for proteins, including those of protein-protein complexes. [42]
Protein-protein docking prediction as a computational method complements experimental
techniques where experimental approaches do not suffice. Development of protein-protein
docking methods advances the understanding of mechanisms of biologically important func-
tions, as well as exploit their underlying PPI as a target for therapeutic agents.
     Ab initio protein-protein docking is generally separated into two phases due to the com-
plexity of the problem. These phases are the sampling of docking poses followed by their
evaluation by means of a scoring function. [43] For the sampling phase, the protein subunits
may be treated as rigid bodies, as is the case for the docking algorithms ZDOCK, FTDOCK,
                                                9


and GRAMM. [44–46] Backbone flexibility can be modeled during the docking phase through
normal mode analysis with modest computational cost, as exemplified by the flexible docking
algorithms ATTRACT and SwarmDock. [47, 48] Soft surface or pseudoatomic representa-
tions are typically employed in both rigid-body or flexible docking protocols to smooth the
potential energy surface and allow faster convergence to energy minima. Sidechain flexibility
is commonly modeled in the refinement stage after sampling, as is the case in iATTRACT.
[49]
     Due to the copiousness of the predictions generated during the sampling phase, the
scoring function must achieve high computational efficiency and must accurately assign low
energy structures to low ranks in the scoring process. Scoring functions belong to broad
categories of physics-based, knowledge-based, and machine learning (ML)-based. Physics-
based scoring functions are widely used and include those used for ZRANK, ATTRACT,
FASTCONTACT, FireDock, GalaxyTongDock, HawkRank, HADDOCK, and ClusPro. [50–
57] Energy terms commonly include van der Waals, electrostatic, and desolvation potentials.
Knowledge-based methods instead tend to apply Boltzmann inversion to the frequency of
observed interatomic/interresidue distances to approxi- mate relative energies of PPI docking
predictions. This class of scoring function include InterEvScore, SPIDER, and dDFIRE. [58–
61]
     PPI interfaces have been described as small patches of nonpolar and charged residues,
with one residue pair near its center contributing to the majority of the energetic stabilization
upon binding. [62] These energetic hot spots, as structural signatures, have been used to
detect PPI interfaces. In the scoring function, Matrix of Low Coupling Energies (MLCE),
nonbonding energies are calculated between each set of pairwise residues in the interacting
protein pair and identifies likely PPI interfaces by Eigen decomposition. [63] Conservation
and coevolution of residue pairs has been applied as a statistical approach to identify PPI
hot spots in scoring functions including Evolutionary Trace (EVT) and InterEvScore. [58,
64] A recent comparison of MLCE and EVT revealed that the relative performances of each
                                               10


scoring function depended on the biological function of the protein. [65] This highlights the
observation that the development of a general PPI scoring function depends on devising an
informed strategy to handle terms belonging to disparate scoring functions.
     ML-based models provide just such a solution for refining or integrating terms belonging
to a variety of scoring functions. Compared with classical scoring functions, ML-based meth-
ods have the advantage that they do not require prior assumptions between the structural
data and protein-protein complex stability. This enables integrated processing of input data,
as was exemplified by the PPI scoring function ProQDock and iScore. [66, 67] The rich set of
supervised and unsupervised algorithms available in ML were applied to rationalize feature
selection recursively to distinguish native-like ensemble of binding modes from decoys. [68]
Further, the deep three-dimensional convolutional neural net, DOVE, was trained on PPI
decoy set to automate the feature extraction process. [69] Meanwhile, ML-based models are
frequently criticized for being a black-box alternative to well-defined scoring functions. This
highlights the need to evaluate not only the performance of ML models but also the trends
and insights it deduces from the data.
     The random forest (RF) refinement methodology for native detection among decoys
has been applied to protein-folding and protein-ligand decoy detections. [70, 71] In short,
a dataset consisting of a native conformer and many decoy structures are featurized using
conventional pairwise potentials such as Assisted Model Building with Energy Refinement
(AMBER) force field and Knowledge-Based and Empirical Combined Scoring Algorithm
(KECSA2) pairwise potentials. [72, 73] The extreme skew in representation of decoy and
native structures of the dataset is mitigated by comparing the descriptors between the native
and decoy structures. The balanced dataset is then suitable to train a RF model for binary
classification. This methodology outperformed conventional programs for decoy detection in
protein folding and protein-ligand. The novel methodology is further explored in the realm
of scoring function for PPI prediction.
                                               11


2.2    Methods
2.2.1    Decoy Set
     The Critical Assessment of Scoring Function (CASF) - PPI decoy set was employed
in this study, which consists of 273 systems with 2000 decoys each. [74] The decoys were
generated previously by rigid-body docking using the FTDock software. [45] The CASF-PPI
decoy set was more suitable than other decoy sets to train the ML-based scoring function
because it removed artifacts and complications due to the docking protocol by using subunits
of the bound PPI crystal structure as inputs to rigid-body docking. Therefore each PPI
system consisted of one high quality native structure and decoy sets consistent in all respect
but the binding mode.
2.2.2    Featurization of PPI Complexes
     Let any PPI complex be described as an n-body system, and let all independent pairwise
probabilities be known. The overall probability of the PPI complex is described as,
                                                Yn
                                       pn =             cij × pij                             (2.1)
                                             i,j=1;i̸=j
     in which pij is the independent probability of particle pair i and j, and cij is its empirical
scaling constant. As a PPI complex, the overall probability can be further be decomposed
to bond, angle, torsion, and nonbonding interactions as follows,
                               !                 !                      !                   !
                Y                 Y                     Y                   Y
   pcomplex =        cij × pij         ckl × pkl              cmn × pmn           cpq × ppq   (2.2)
                bond             angle                torsion             nonbond
     in which cαβ corresponds to the scaling constant and pαβ corresponds to the pairwise
probability of bond (ij), angle (kl), torsion (mn), and nonbonding interaction (pq). The
present work deals with decoy structures generated by rigid-body docking, thus bond, angle,
and torsional probabilities are constant between a native structure and its decoys. The
equation 2.2 is rewritten,
                                                  12


                                                                  !
                                                  Y
                              pcomplex = C ×            cpq × ppq                       (2.3)
                                                nonbond
     for some constant C. Taking the natural logarithm yields the following,
                                                  X
                           ln pcomplex = ln C +         ln (cpq × ppq ).                (2.4)
                                                nonbond
     Pairwise nonbonding interaction probabilities were obtained using KECSA2 pairwise
potentials. [72] The potential of nonbonding interactions between atoms A and B with
distance ri were described using the following Lennard-Jones type equation:
                                               α          β
                                               σ              σ
                               EAB (ri ) = ϵ1        − ϵ2                               (2.5)
                                               ri             ri
     in which parameters ϵ1 , ϵ2 σ, α, β were obtained by the KECSA2 database. Then the
potentials were translated to relative probabilities by Boltzmann distribution. Constant C
and scaling factors cpq are canceled upon generation of the comparison descriptors. Thus
each structure was represented by a vector of 14028 features where each element represented
a specific nonbonding interaction.
2.2.3    Generation of Comparison Descriptors
     The skewed representation of native and decoy PPI complexes are balanced using a
comparison method, previously applied to protein tertiary structure and protein-ligand de-
coy sets. [70, 71] The native structure was assumed to be more stable than the decoy, thus
subtracting the logarithmic probability of the decoy from the native structure would result
in a more positive vector and vice versa. By performing this subtraction, a balanced com-
parison dataset is generated, consisting of 4000 descriptors per system. A target label of ‘0’
was appended for descriptors generated by decoy minus native, and ‘1’ was appended for
comparison descriptors of the other direction. The comparison descriptors were used as a
balanced dataset to train the LR model.
                                               13


2.2.4     Training and Evaluation of the LR Classifier
      The LR classifier model (sklearn.linear model.LogisticRegression) was trained on vari-
ous fractions of of the shuffled dataset. [75] Default values were used for hyperparameters.
Training and validation sets were standardized by zero mean and unit variance. Each model
was subjected to five-fold cross validation to obtain training and validation accuracies. Sev-
eral replicates of the operation were performed to ensure sufficient coverage of the decoy set
partitioned to the training set. Test accuracies were computed from the models refit on the
training/validation sets. Accuracy is defined as,
                                                  TP + TN
                              accuracy =                                                   (2.6)
                                           TP + FP + TN + FN
      for the count of true positive (TP), false positive (FP), true negative (TN), and false
negative (FN) predictions. Accuracy was amenable for evaluating the performance of the
LR classifier because the representation of target label was balanced.
2.2.5     Scoring and Ranking of Decoy Sets via LR Classifier
      The PPI complexes for each system were ranked as follows. Let xi be a descriptor of
structure i. Then comparison descriptors were generated by xi − xj for all j ̸= i, thereby
generating 2000 comparison descriptors. These comparison descriptors were classified by the
LR classifier as either ’0’ or ’1’. Finally, these labels were summed to provide a score for
xi , in which the greater value was predicted to be the more native-like structure. Once all
structures were scored, then the structures were ranked by their score in descending order.
2.2.6     Evaluation of Ranks Assigned by Scoring Function
      Metrics used to evaluate the scoring functions include success rate (SR), modified success
rate (Y), native ranking, first root mean square deviation (RMSD), first decoy RMSD, and
Spearman correlation coefficient of ligand RMSD and rank. CAPRI criteria were used to
define near-native structures. [76]
      Performances of the LR scoring function were compared to the scoring functions: AT-
                                               14


TRACT, dDFIRE, FASTCONTACT, and ZRANK. [50, 52, 60, 77] The scores for each
scoring function were previously calculated and included with the CASF-PPI decoy set. [74]
Success Rate
     SR is the probability of finding a near native structure in the top N predictions. Let
X = {X1 , X2 , · · · , Xn } be the set of all PPI systems in the dataset. Let h(Xi , N ) be the
number of near native structures in the top N predictions of system Xi . Then we write the
success rate as,
                                                                 
                             Pn                                   1 if h(Xi , N ) > 0
                                                                 
                                i=1 (H(Xi , N ))
               SR(N ) =                           ; H(Xi , N ) =                            (2.7)
                                      n                           0
                                                                            otherwise
Modified Success Rate
     Y takes into account the fraction of near-native structures identified in the top N pre-
dictions. [78] Additionally, it assigns a higher score to the lower ranking of near-native
structure. It is defined as,
                         Pn                          PN
                            i=1 F (Xi )                j=1 (1  + topj )
                  Y =                   ; F (Xi ) =                     ; topj = rankj−1    (2.8)
                               n                       h(Xi , |Xi |)
Spearman Correlation Coefficient of ligand RMSD and rank
     The ideal funnel shape between ligand RMSD and rank of scoring function are quantified
using Spearman correlation coefficient (ρ). A more positive ρ is ideal, as it suggests the lower
rank is associated with lower ligand RMSD and vice versa.
Fold Enrichment of Near-Native Predictions in Top Ranked Structures
     We measure the representation of various qualities of predictions in the top ranked
structures using the mean of the quotient of observed near-native prediction divided by its
expectation E, and refer to it as fold enrichment F E. The expectation is the probability
of arbitrarily choosing a near-native structure by sampling exactly one structure from all
predictions of a given PPI system. We define the fold enrichment as,
                                                     15


                                    Pn     h(Xi ,1)
                                       i=1 E(Xi )              h(Xi , |Xi |)
                         F E(X) =                   ; E(Xi ) =                            (2.9)
                                          n                       |Xi |
2.2.7     Analysis and Selection of Feature Coefficients
     Median values of coefficients assigned to each feature were obtained from LR classifiers
trained on 0.99 of the data. The features consist of pairwise nonbonding interactions of
heavy atoms of residues. The residues were assigned to the following subjective categories:
anionic, cationic, polar, nonpolar, aromatic, flexible, and small (Table A1). The interactions
of the residues were assigned by their categories (e.g. cationic-anionic). Distributions of the
types of interactions were assessed.
     Features with median coefficients of the greater magnitude were considered more salient.
Subsets of features were taken as fractions of top salient features, and were used to train LR
classifiers. The performance of the LR classifiers with reduced dimensions were assessed via
aforementioned metrics.
2.2.8     Flexible Docking of Weng Benchmark 5.0 by ATTRACT
     An independent decoy set was generated as a benchmark to assess the performance
of various scoring functions. Unbound subunits of the Weng Benchmark 5.0 (BM5) were
subjected to ATTRACT flexible docking with the iATTRACT interface refinement. [47, 49,
79] Bash script for performing docking was obtained from the ATTRACT web interface. [80]
Because the weights of terms in the ZRANK scoring function were fitted on structures in
the Weng Benchmark 1.0, structures from Benchmark 1.0 were removed for fair comparisons
between scoring functions. [50, 81] Likewise, systems in BM5 were omitted if either one of
the subunits exhibited a sequence identity of greater than 30% to any protein subunit in
the CASF-PPI data set, resulting in a maximum sequence identity of 30% between protein
subunits in BM5 and CASF-PPI. One thousand structures were generated using five normal
modes. The complexes with the following PDB IDs were not further considered because they
required repair of missing atoms in the input files to perform docking: 1F51, 1F6M, 1RLB,
                                                  16


1SYX, 2CFH, 4FZA, and 4GAM. Out of 69 systems considered, 40 structures contained at
least one acceptable structure.
2.3    Results and discussion
2.3.1    LR Classifier Achieved High Accuracy on CASF-PPI Test Set
     A learning curve was plotted for the LR classifier trained on various fractions of the
decoy set (Figure 2.1). As the fraction of data used as the training set was increased to 0.99,
the validation and test score approached accuracy of 0.99. The small differences between
training accuracy and validation or test accuracy illustrate the proficient model performance
generalized to the remainder of the CASF-PPI decoy set. The narrow range in each accuracy
indicated the model performance was stable. Near optimal performance of the LR classifier
on the decoy set was observed with a training set fraction of 0.7 or greater.
2.3.2    Analysis and Selection of Salient Features
     The input data was standardized to zero mean and unit variance. Thus the magnitude
and sign of coefficients provide a measure for salient features in detecting native structures.
The coefficients plotted in decreasing order displayed a logit shape, with the magnitudes
rapidly decreasing toward the center of the distribution (Figure 2.2). This suggests there was
a relatively small subset of highly salient features, while the majority of features contributed
moderately to the classification task. The range of coefficients for each feature was narrow,
suggesting the similarity in coefficients between models.
     To generalize the types of interactions contributing to the classification, the interacting
residues were categorized to broad classes and the type of interacting residues were recorded.
The density plot displayed the representation of a given type of interaction over the features
ordered by their coefficient values (Figure 2.2). Qualitatively, interactions between charged
residues displayed the highest representations at the ends of the distribution, representing
coefficients with greater magnitude and corresponding with features with greater effect on
the classification. This is consistent with ionic interactions as the strong, specific, and
dynamic nonbonding interaction characteristic to PPI. [82] As expected, opposing charges
                                                17


Figure 2.1: The distributions of training (green), validation (blue), and test (red) accuracies
of the LR classifiers trained on various fractions of the decoy set. The baseline is the accuracy
achieved by linear, unweighted sum of KECSA2 potentials with no LR refinement. Training
and validation accuracies were obtained by taking the mean of the five-fold cross validation
results. Test accuracy was calculated from LR classifiers refit on the training and validation
set.
                                                18


Figure 2.2: (top) Median coefficients for each feature in descending order. The curve dis-
played a logit shape, with the magnitude of the coefficients rapidly decreasing toward the
center of the distribution. Maximum and minimum for each coefficient were plotted as a
gray ribbon. (bottom) Density of various types of interactions applied to the coefficients
ordered in decreasing order.
                                             19


were favored by having a greater density on the side of positive coefficients, while like charges
had high representation for negative coefficients.
     Aromatic-small residue interactions were also notable on the positive end of coefficients.
The shape complementarity achieved by hydrophobic residues of opposing sizes has been
attributed a key factor for affecting PPI. [83] The relatively high representation may signify
the importance of shape complementarity of hydrophobic residues between interacting pro-
teins. The distribution is contrasted from the fairly even distribution of nonpolar-nonpolar
interaction, further emphasizing the importance of shape complementarity.
     Furthermore, the present LR classifiers represented polar-polar residue interaction with
negative coefficients. A greater change in potential energy upon the exclusion of water
molecules from PPI interface upon binding would be expected if favorable interaction between
solvent and polar residue were absent. The higher density of polar-polar residue with negative
coefficients suggests polar residues are underrepresented in native PPI interfaces. Thus, the
coefficients may implicitly suggest that the desolvation energy to be a contributor to the
detection of native PPI complexes.
     In summary, the LR classifier appeared to prioritize charge and geometric complementar-
ity while disfavoring polar-polar interaction. A more complete plot is available in Appendix
(Figure A1)
     Dimensionality reduction was pursued by training logistic regression classifiers with
features associated with coefficients of greater magnitude. Specifically, the features were
ordered by the magnitude of the median coefficient, then various fraction of the salient
features were used as input data. The performance of models trained on 0.9 of the data set
are reported (Table 2.1). Test accuracy and native ranks appeared to peak when around 0.1
of features were selected. Meanwhile, first decoy RMSD was constant between the fractions.
The improvement in performance may be due to reducing noise arising from superfluous
features. Refer to Figures A2-A4 for the full performance metrics.
     The performance of the LR scoring function trained on the 0.1 top features (consisting
                                               20


Table 2.1: Performance metrics of the LR classifier trained on various fractions of top salient
features. For each quantity, the mean, 25th percentile (lower) and 75th percentile (upper)
are reported. Four digits are reported all values except for lower/upper quantiles of native
rank, which are natural numbers.
       fraction       test accuracy              native rank       first decoy RMSD (Å)
                  mean lower upper          mean lower upper       mean lower upper
         1.00    0.9969 0.9968 0.9984       7.216     1       3    22.33 12.63 30.88
         0.50    0.9993 0.9990 0.9997       2.344     1       1    23.68 13.03 31.92
         0.40    0.9990 0.9992 0.9999       3.020     1       1    23.96 13.11 32.46
         0.30    0.9994 0.9995 0.9999       2.285     1       1    23.96 13.80 31.92
         0.20    0.9995 0.9997 0.9999       2.044     1       1    22.63 13.10 31.08
         0.10    0.9997 0.9996 0.9998       1.665     1       1    22.34 12.91 31.10
         0.05    0.9992 0.9990 0.9997       2.585     1       2    23.03 13.23 32.99
of 1403 features in total) was selected for further investigation. By use of a simple ML model
coupled with the use of only 0.1 of the most salient features, we arrive to a scoring function,
which ranks thousands of structures on the order of minutes. These features are summarized
in Figure A5. The performance of the new scoring function was benchmarked.
2.3.3    The LR Scoring Function Was Sensitive towards Native Structures
         While Less Responsive to Near-Native Structures
     The performance of the LR scoring function trained 0.1 top features was compared to
those of conventional scoring functions (Figure A2). The SR of the LR scoring function was
relatively high for the most stringent threshold quality, which considered only the native
structure as a near-native structure. There were little improvements in SR as the threshold
quality was relaxed down to acceptable predictions. Unlike other scoring functions, the
LR scoring function displayed an early saturation of SR at about N = 10. A plateau in Y
accompanies this trend, in which the LR scoring function displayed little improvement above
N = 10 for thresholds native and high.
     The sensitivity of the LR classifier to native structures was further exemplified in the
distribution of native ranks between the various scoring functions (Table 2.2). The LR scoring
function yielded the lowest mean native ranks compared to the other scoring functions. The
RMSD of low ranking structures illustrated a different trend. The mean of the first RMSD
                                                21


Figure 2.3: Comparison of scoring functions by success rate (top) and modified success rate
(bottom) for various threshold of near-native structures.
for LR scoring function was low due to the superior native ranking compared to other scoring
functions. Yet the insensitivity of the LR scoring function to near native decoy structures
was apparent in the distribution of the first decoy RMSD, in which the LR scoring function
displayed the greatest mean compared to other scoring functions. ZRANK notably exhibited
the greatest performance for assigning near native structures to low ranks. Plots of the
distributions are available in SI (Figure A6).
     A similar comparison was present in the Spearman correlation coefficient between ligand
RMSD and rank (Figure 2.4). If only structures with ligand RMSD up to 5 Å were consid-
ered, all scoring functions displayed ρ near 0.5. While other scoring functions still preserved
a distribution centered at a positive value when structures up to 10 Å were considered, ρ
for ATTRACT and the LR scoring functions returned to a distribution centered at 0.0. All
distributions were centered at 0.0 if structures up to 20 Å were considered. The relatively
early erosion of correlation for ATTRACT and LR scoring functions emphasize they are
less responsive to near-native decoy structures than other scoring functions. In the context
                                               22


Table 2.2: Performance metrics of the LR classifier trained on 0.1 top salient features in
comparison to other scoring functions (SF). For each quantity, the mean, 25th percentile
(lower) and 75th percentile (upper) are reported. Four digits are reported all values except
for lower/upper quantiles of native rank, which are natural numbers.
                   SF        native rank           first RMSD  (Å)    first decoy RMSD (Å)
                        mean lower upper        mean     lower upper   mean lower upper
                  LR    1.665      1       1    5.343    0.000 0.000   22.34 12.91 31.10
          ATTRACT       16.70      1       3    7.256    0.000 12.98   19.90 9.952 28.89
            dDFIRE      68.06      1      109   12.11    0.000 24.07   15.58 1.496 27.85
   FASTCONTACT          129.2      5      116   15.34    2.064 26.43   15.90 2.744 26.58
             ZRANK      13.52      1       3    6.047    0.000 3.336   9.667 0.982 15.75
of docking protocols which may generate near native predictions but not necessarily close
matches to the native structure, the sensitivity to near native structures exhibited particu-
larly by ZRANK may be more desirable as the accompanying scoring function for rigid-body
docking. The superior performance of the LR scoring function on detecting native structure
suggested its utility will lie in flexible docking predictions, in which the structures are pre-
disposed to be of higher quality than its early stage rigid body counterparts. This led to
the independent assessment of the LR scoring function on decoys generated by ATTRACT
flexible docking protocol with iATTRACT interface optimization. [47, 49]
2.3.4    The LR Scoring Function Performed Competitively on ATTRACT PPI
         Docking Predictions
     We confirmed the performance of the LR scoring function generalizes on the CASF-PPI
dataset due to its performance on the validation and test sets partitioned from the CASF-PPI
dataset. The performance was further assessed on ATTRACT flexible docking predictions of
PPI complexes in Weng Benchmark 5.0. [79] The independent dataset serves as assessment
for whether the performance of the LR model was inflated by correlation in docking protocol
between train and test sets, or biases arising by possible artifacts in docking which were not
present in the native structure.
     The SR and modified SR were determined for predictions ranked by the LR scoring
                                                 23


Figure 2.4: Comparison of scoring functions via Spearman correlation coefficient between
ligand RMSD and rank assigned by scoring function under various maximum RMSD values.
                                          24


Figure 2.5: Comparison of scoring functions via success rate (top) and modified success rate
(bottom) using various thresholds for near-native structures.
function, ATTRACT, and ZRANK (Figure 2.5). ATTRACT and ZRANK were selected
for comparison because they per- formed competitively on the CASF-PPI data set. The LR
scoring function performed comparably to ATTRACT and ZRANK, with a moderate lead in
the top 100 candidates. LR scoring function appears competitive to other scoring functions
on a set of realistic docking predictions.
     While metrics of the top predictions (low N) are crucial for the performance of scoring
functions, the differences are obscured due to the maximum value occurring at saturation
(N = 1000). To assess the performance of the top predictions, we calculated the fold
                                             25


Figure 2.6: Fold enrichment in the top scoring structures is reported for various scoring
functions (SF) and quality of prediction. Sample sizes by quality: nmedium = 22, nacceptable =
40, nincorrect = 69.
enrichment of the number of near-native structures detected in the top N predictions. The
fold enrichment at a given quality threshold is the mean quotient of the fraction of observed
near-native structures and the probabilities of arbitrarily choosing at least one near-native
structure.
     The fold enrichment is summarized for each scoring function in Figure 2.6. In the case
of each scoring function, the elevation of medium and acceptable near-native predictions over
incorrect structures illustrate the scoring function’s sensitivity toward near-native structures.
While the representation of near-native structures by LR and ZRANK are generally matched,
the LR scoring function shows a greater representation of near-native structures as the top
prediction. Combined with the SR, this illustrates the LR scoring function has a sensitivity to
near-native structure which is competitive with ZRANK, coupled with a greater propensity
to find the near-native structure as the top prediction.
                                                26


2.4    Conclusions
     The LR scoring function of PPI prediction illustrates the RF refinement of pairwise
potentials extend to protein quaternary structure prediction and performs competitively to
conventional scoring functions for the task of native detection among PPI decoys. The salient
features were consistent with terms present in the various physics-based scoring functions.
The utility of the scoring function was highlighted on ATTRACT flexible docking predic-
tions, in which the representation of high quality structures was greater at the top ranked
predictions compared to other scoring functions.
                                               27


Chapter 3:     AutoGraph: Autonomous Graph-Based Clustering of
               Small-Molecule Conformations
     Kiyoto Aramis Tanemura, Susanta Das, Kenneth M. Merz Jr.
3.1   Introduction
     Accurately modeling the distribution of equilibrium conformers is prerequisite in pre-
dicting the microscopic and macroscopic properties of flexible molecules. Obtaining a con-
formational ensemble is foundational to calculating average properties of molecular systems.
[84] Methodologies such as ensemble docking of protein-ligand systems[85], three dimen-
sional quantitative structure-activity relationship[86], and constructing Markov state mod-
els (MSM) from molecular dynamics trajectories[87] rely on sufficiently sampling from the
conformational ensemble.
     Many methods, algorithms, and their variants exist for conformation generation. [88]
The objective of conformation generation protocols is to identify many equilibrium conform-
ers at local minima of the potential energy surface (PES), which would be major contrib-
utors among all thermally accessible conformations. This may involve sufficient sampling
of nonredundant conformations, followed by refinement of those conformations to local en-
ergy minima. Sequential methods such as molecular dynamics and Monte Carlo simulated
annealing combine sampling and scoring of conformers to return physically informed, low
energy conformers, however are generally computationally intensive compared to knowledge
based methods. [89, 90] Knowledge based methods such as OMEGA and ETKDG algo-
rithms narrow the search space by using the distributions of observed dihedral angles and
ring structures from crystallographic databases. [91, 92] The rapid conformation generation
by such algorithms should be followed up with physically informed structure refinement.
     Geometry optimization by ab initio calculations converge to low energy conformations,
however the high computational cost limits its applicability to the numerous conformations
which need to be sampled. Recent development and benchmarking of machine learning based
                                              28


potentials such as ANI-2x have prompted its utilization for certain high throughput quantum
chemical applications. [93, 94] For example, ANI-1ccx potentials were used to accelerate the
refinement of generated conformers in a quantum mechanical (QM) NMR spectral prediction
workflow. [95, 96] The conformers optimized by these models, however, generally do not
converge to the same local minima as ab initio methods. Hence, conformational clustering
becomes important in narrowing the number of ”full” QM geometry optimization calculations
required to obtain a representative set of conformers to estimate the conformational ensemble.
     Clustering is a task in unsupervised machine learning, in which individual data are coarse
grained into disjoint groups. Clustering is used for purposes such as auto-label generation,
dimensionality reduction, image segmentation, and visualization of data. To date, numerous
clustering algorithms and their variants have been developed and deployed across disciplinary
lines. [97, 98] Many conformational clustering algorithms have also been evaluated. [99–103]
A majority of these algorithms require the number of clusters or threshold values defined a
priori, though these hyperparameters vary by the data under evaluation and its choice may
be nontrivial. [104] Unless automated, the iterative guess and check of hyperparameters can
render the clustering protocol into one requiring supervision, thus limiting its throughput
and integrity from user bias.
     Highly automated conformational clustering protocols which do not require the number
of clusters or threshold value be predefined would be advantageous for applications such
as high throughput metabolomics. NMRCLUST is one such algorithm, readily available in
its implementation through UCSF Chimera. [105, 106] NMRCLUST (NC) subjects confor-
mations to the average linkage algorithm for hierarchical clustering. Clusters are merged
to minimize a penalty function in order to balance between controlling the average spread
within clusters and populating the clusters with as many conformers as possible. The al-
gorithm was assessed for clustering small organic molecules. [100] The Dynamic Tree Cut
(DTC) algorithm is an alternative strategy to define clusters on dendrograms using a dy-
namic threshold value. [107] The DTC algorithm was applied to identify conformational
                                                29


clusters among drug-like molecules. [102] Unlike the aforementioned hierarchical clustering
methods, the Representative Conformer K-means (RCK) algorithm employs the popular
K-means algorithm to cluster conformers. [102] The protocol automates the determination
of the number of clusters K by iterative calculations to find the value for K which maximizes
the mean squared distance between clusters.
     Here we present an autonomous graph based conformational clustering algorithm named
AutoGraph. AutoGraph processes the atomic root-mean-squared deviation (RMSD) matrix
between conformers into an affinity matrix using a generic Gaussian kernel. The matrix
is processed as a graph object and its nodes are clustered using the Louvain algorithm,
which does not require number of clusters or thresholds be predefined. [108] We estimate
the conformational ensembles for O-succinyl-L-homoserine and nicotinamide adenine dinu-
cleotide as simple examples before exploring the conformational graphs of 200 representative
metabolites.
3.2    Methods
3.2.1    Definitions
     Let there be n ∈ N conformers of exactly one molecule.
    • A graph G = (V, E) consists of a set of vertices V and a set of edges E.
    • A distance matrix is an n × n matrix recording the pairwise dissimilarity between
      all conformers.
    • An affinity matrix is an n × n matrix storing the pairwise similarity between con-
      formers.
    • A binary adjacency matrix is an n × n matrix with 1 indicating the presence of an
      edge and 0 otherwise.
    • We define a filtered matrix of matrix M be the element-wise product between M
      and an adjacency matrix.
                                              30


3.2.2    The AutoGraph Conformational Clustering Algorithm
     The AutoGraph protocol is described (Figure A7). Atomic root mean squared deviation
(RMSD) are computed comprehensively between n structures and stored in a symmetric
n × n distance matrix. The Kabsch algorithm is used for finding the minimum pairwise
RMSD. [109] An affinity matrix is calculated by applying a generic radial basis function,
φ(r) = exp(−r2 ) for distance r. A threshold value is applied to remove edges with low valued
weights from the affinity matrix. This is performed by a breadth first search on the affinity
matrix to find the maximum threshold value that does not produce disjoint components (i.e.
given any pair of conformers, there exists a path connecting the two nodes in a graph.) An
adjacency matrix is produced by applying this threshold to the affinity matrix. Let us refer to
this threshold as the adaptive threshold or threshold in this manuscript. The resulting filtered
affinity matrix encodes an undirected, weighted graph G = (V, E) consisting of vertices V
and edges E. Clusters are detected by applying the Louvain algorithm to the filtered affinity
matrix. [108] The lowest energy conformer is reported for each cluster as its representative
conformer.
3.2.3    Benchmark Conformational Clustering Algorithms
     Among the goals of developing AutoGraph is to automate conformational clustering
without introducing system specificity. For this reason, we select highly automated confor-
mational clustering algorithms which do not assume specific threshold values or number of
clusters to compare against AutoGraph. NMRCLUST, Representative Conformer K-means,
and the Ward hierarchical clustering/Dynamic Tree Cut algorithm were chosen by this cri-
terion. We implemented each algorithm in a sequential, CPU manner using only common
Python packages (Numpy, Pandas, SciPy). [110–112] Only general descriptions or modifica-
tions are provided. Their original publications should be referenced for full descriptions of
each algorithm.
     NMRCLUST
     The previously described NMRCLUST algorithm was implemented by us. [105] The
                                               31


distance matrix is obtained in the same manner as AutoGraph. The average linkage was
calculated between all conformers at each step. Clusters having the minimum average link-
age were merged. Once no singleton clusters were present, the average spread of clusters
were calculated. Each iteration of clustering was evaluated using a penalty function, with
the normalized average spread and number of clusters as inputs. The final clustering was
determined by the iteration having the minimum penalty value.
     Unfortunately, the average spread as described in the original publication cannot be
calculated if a singleton cluster is present. Few systems examined in this study contained
a singleton cluster until the last merge into one cluster containing all conformers, which
was returned as the clustering result. We encourage use of the NMRCLUST algorithm
implemented in UCSF Chimera as an alternative. [106]
     Representative Conformer K-means
     The previously described RCK-means algorithm was implemented as well. [102] Using
the RMSD matrix as an input, the K-medoid algorithm is iterated, starting from two clusters
and increasing number of clusters by one at each iteration. So the number of clusters at the
ith iteration is Ki = i + 1. Ki -medoid clustering is repeated 100 times, then the result with
the minimum mean of the squared distance of the clique within clusters (MSQw) is retained.
The mean of the squared distance of the clique between clusters (MSQb) is calculated and
used to calculate the simple moving average (SMA) of the ith iteration with a window of
10. Once the SMA begins decreasing, the clustering result having the maximum MSQb is
returned.
     We modified the algorithm to use K-medoid rather than K-means so that the algorithm
was compatible with an RMSD matrix as input.
     Ward/Dynamic Tree Cut
     The previously described DTC algorithm was implemented and modified. [107] A pre-
vious publication demonstrated its application to conformational clustering. [102] A Ward
dendrogram was obtained from the RMSD matrix, then subjected to the DTC algorithm.
                                              32


[113] The DTC algorithm takes a tree or subtree and applies up to three thresholds. Cuts are
considered significant only if the number of previous nodes above the threshold are greater
than or equal to parameter τ ∈ N, for which we chose τ = 5. The protocol is repeated on all
subtrees until no new cuts are produced. Each cluster consists of leaves of the corresponding
subtree. Representative structures were chosen by the medoid of each cluster. To simplify
the algorithm, we did not perform the initial cut near the root of the tree, thereby beginning
the adaptive tree cut protocol with a full dendrogram.
     Fixed K Clusters on the Ward Dendrogram
     We employed hierarchical clustering by the Ward dendrogram as a negative control to
the adaptive clustering strategies. When applied to the 200 metabolite test set, the mean
number of clusters across metabolites was determined to be 18.075. Thus we calculated the
Ward dendrogram on the RMSD matrix and applied a threshold to produce 18 clusters.
3.2.4    Performance Evaluation
     Our test sets included up to 1000 conformers per molecule sampled using the ETKDG
algorithm, which were geometry optimized using various methods. Our premise is that differ-
ent Hamiltonians and methods of minimization give different energy surfaces and geometries,
so our test set contains conformers which have been biased toward their respective local min-
ima. Therefore, our goal in clustering these conformers is to identify sets of conformers which
would converge to similar geometries and energy values if fully minimized by a high quality
QM method. If this was achieved by conformational clustering, we expect the similarity in
geometry reflects similarity in energy values, and the cluster-wise variance in energy to be
lower than if they were randomly sampled.
Correlation of Actual Energy to Local Weighted Estimation
     AutoGraph clusters metabolite conformers based on the conformational graph gener-
ated. To justify clustering by the conformational graph for this test set, we assessed the
geometric/energetic correlation in the graphs. We measured the single point energies of
all conformers, which yields the actual measured energy. We also compute a local weighed
                                              33


estimate of each node’s energy by taking the average of neighbors’ energy values weighted
by incident edge weights. The Spearman correlation between actual and local estimated
energy therefore provides a metric for the geometric/energetic correlation implicit in the
conformational graph.
     Formally, let the conformational graph G = (V, E) consist of a set of vertices V con-
nected by edges in set E. Let vi ∈ V and let its neighbors be Ni = {nj |{vi , nj } ∈ E} ⊆ V .
Note that {vi , vi } ∈ / E. Let U (v) be the single point energy of the conformer assigned to
vertex v ∈ V . Also, let w(e) be the weight of edge e ∈ E. The actual energy of vi is U (vi ).
The local estimated energy of vi is given by,
                                            Pp
                                             j=1 w({vi , nj })U (nj )
                                Û (vi ) =    Pp
                                                j=1 w({vi , nj })
     The Spearman correlation coefficient was determined between U and Û , calculated in
R. [114]
Cluster-wise Variance of Conformer Energy
     Geometric similarity within clusters are achieved trivially by the objective function of the
conformational clustering algorithm. We must instead consider the variance in the energies
of the clustered conformers to evaluate the methodology. Energy as a metric informs us of the
conformers’ proximity in the PES and is independent of the clustering protocol, thus provides
an independent metric for assessment. Note the AutoGraph algorithm considers conformer
energy after partitioning conformers by clusters, thus its independence from assessment
method is not compromised. For a distribution of values X = {xi |i = 1, 2, . . . , N }, the
sample variance σ 2 is given by,
                                             PN
                                          2     i=1 (xi − x̄)2
                                         σ =
                                                  N −1
               PN
                    xi
in which x̄ =    i
                 N
                       .
     Because our conformers are organized by clusters, we can decompose the total variance
                                                 34


into variance between clusters and variance within clusters. Let cj be the size of the jth
cluster from 1 to C. Let µj be the mean energy of the jth cluster and µ be the mean energy
of all conformers. The size of clusters are heterogeneous, thus we take a weighted mean to
compute each terms as follows:
                                                  C                    C
                                                 X    cj              X    cj 2
                      σ 2 = σbetween
                             2           2
                                     + σwithin =         (µj − µ)2 +         σj
                                                 j=1
                                                      N               j=1
                                                                          N
                                                                          2
     Suppose the conformers are clustered with no bias. Then σbetween            is negligible, thus
                                 cj 2
we would expect σ 2 ≈ C                                                                          2
                           P
                             j=1 N σj . The converse of this statement is true, in which σ ̸=
PC cj 2                                                                              2
                                                                                         PC cj 2
   j=1 N σj implies a bias in clustering the energy. Our null hypothesis H0 is σ =          j=1 N σj .
                                            cj 2
Our alternative hypothesis H1 is C                   2
                                     P
                                        j=1 N σj < σ . We detect a difference in variances using
the F -test of equality of variances. [115]
     To evaluate the energy variance among the 200 benchmark metabolites, we compute the
variance in energy within clusters and variance across clusters for each of the 200 metabolites.
We then apply the paired left-tailed Wilcoxon signed-rank test to determine the statistical
significance of the difference between the two distributions for variance values. [116]
     We impose a relatively stringent significance level α of 0.001 for the following reasons.
                                                                    2
First, the F -test is biased toward type-I error because 0 ≤ σbetween      , thus we consider α0 =
0.005. Then we account for the inflated false positive rate due to multiple hypothesis testing
on the same dataset over five algorithms using the Bonferroni correction. [117] By correcting
for the number of null hypothesis significance tests, we obtain the final α of 0.001.
Within-Cluster RMSD
     To obtain an intuition over the spread of geometries within clusters, we computed the
mean RMSD for each cluster. The overall within-cluster RMSD was computed as a weighted
mean of the mean RMSDs using the sizes of clusters, so that each conformer is equally
weighed. The within-cluster RMSD values depend not only on the clustering algorithm,
but also the conformational space of the metabolite. To standardize the comparison among
metabolites, we subtract the global mean RMSD of the metabolite (mean RMSD without
                                                 35


considering clusters) from the overall within-cluster mean RMSD, and report the difference
by algorithm and metabolite.
3.2.5    Case Studies
O-Succinyl-L-Homoserine
     We clustered conformers generated for O-succinyl-L-homoserine (OSLH) to illustrate
the use of the AutoGraph conformational clustering algorithm. Conformers were gener-
ated and refined in a previous study. [96] In summary, the MacroModel/ConfGen protocol
(Schrödinger, Inc.) was used to generate 501 conformations unique to 0.1 Å in atomic RMSD.
[118, 119]
     This was followed by ANI-1ccx calculations in the gas phase. [95] Only conformers
with no imaginary vibrational frequencies were retained, narrowing the conformers to 485.
Calculations using ANI potentials were performed using the Atomic Simulation Environment
(ASE) interface. [120]
     Conformations were clustered by AutoGraph and the resultant graph was visualized
using Gephi. [121] In addition, a charged structure was prepared using the PrepWizard
tool (Schrödinger, Inc.). [122, 123] Single point energies were calculated using the Gaussian
quantum chemistry software at the HF/6-31G(d) level of theory in the gas phase for both
the neutral and charged conformers. [124]
     ANI-1ccx optimized neutral OSLH conformers were further refined using Gaussian at
the B3LYP/6-31G(d,p) level of theory in gas phase. Conformers were confirmed to be at
local minima by vibrational analysis. The fully optimized geometries were subjected to
clustering by AutoGraph.
Nicotinamide Adenine Dinucleotide
     We also clustered conformers generated for nicotinamide adenine dinucleotide in the
oxidized form (NAD+). We generated 1000 conformers from the SMILES of NAD+ using
RDKit’s implementation of the ETKDG algorithm, unique to 0.1 Å in atomic RMSD. [92,
125] Structures were optimized using the MMFF94 potential. [126] This was followed by the
                                               36


Austin Model 1 (AM1) semiempirical method in the gas phase using the MOPAC software
interfaced through ASE. [120, 127, 128] Only structures with the original topology were
retained, filtering to 785 total conformers. Conformations were clustered by AutoGraph and
the graphs were visualized using Gephi. [121] Single point energies were calculated using
the Psi4 quantum chemistry package using the B3LYP/6-31G(d) level of theory in the gas
phase. [129]
3.2.6    Benchmark Dataset
Metabolite Curation
     Molecules were selected from the Human Metabolome Database (HMDB), consisting of
114184 metabolites. [130] Those localized to the cytosol, nucleus, or mitochondria were kept
to yield 8249 molecules. We selected metabolites containing only elements which could be
subjected to energy calculation by ANI-2x potentials (CHONSFCl) to yield 7547 molecules.
[93] The number of rotatable bonds were calculated for each metabolite to remove trivial
or unfeasible cases for conformation generation. Metabolites with rotatable bonds on the
range of the 50th to 95th quantiles were retained, resulting in all metabolites having four
to fourteen rotatable bonds. Out of the 3350 remaining metabolites, 200 representative
structures were chosen using the following protocol. Morgan fingerprints were calculated for
all molecules, using 2048 bits with a connectivity of three. A 3350×3350 matrix of Tanimoto
distance between all fingerprints were calculated. A Ward dendrogram was calculated from
the dissimilarity matrix and a threshold was applied to produce 200 clusters. Representative
metabolites were selected from each cluster by having the greatest in-cluster degree. The
metabolites are given in the appendix (Table A2).
Conformer Generation
     Up to 1000 conformers were generated for each selected metabolite using RDKit’s im-
plementation of the ETKDG algorithm, discarding any structures with RMSD below 0.1 Å
from any of the previous structures. [92, 125] All structures were optimized by the Merck
Molecular Force Field 94 (MMFF94) in the gas phase. [126] Further, all structures were
                                              37


optimized using ANI-2x potentials with the BFGS optimizer in the gas phase. [93] The final
potential energies calculated with ANI-2x potentials were recorded.
3.3    Results and Discussion
3.3.1    Clustering O-succinyl-L-homoserine conformers by AutoGraph as an
         illustrative example
     Among the strengths of using a graph to represent the conformers is graphs offer in-
tuitive summaries of the relationships in the data, and an abundance of graph algorithms
are available. AutoGraph was designed with the strategy to process the RMSD matrix into
a readily interpretable form of a graph, then apply existing graph clustering protocols. We
highlight the conformational graph of OSLH, for which each node represents exactly one con-
former, and edge weights are proportional to the structural similarity determined by atomic
RMSD between conformers (Figure 3.1).
     Since the conformers are geometry optimized, we suspect the densely connected sub-
graphs represent basins in the PES, thus should be grouped in the same cluster. The Auto-
Graph protocol identified 28 clusters as shown. Nodes seem to have neighbors with similar
energy, particularly in dense regions of the graph. The qualitative geometric/energetic corre-
lation provides preliminary evidence for information regarding the PES is available implicitly
in the conformational graph.
     The superimposed conformers illustrate, while noisy, the overall molecular shape is sim-
ilar within each cluster. The intuitive clusters on the graph appear to translate to qualitative
conformational similarity for this system.
     Unlike other clustering algorithms which take the RMSD matrix as the direct input,
AutoGraph first processes the RMSD matrix into a graph representation. We assess whether
the resulting conformational graph reasonably preserves the geometric/energetic similarities
between conformers as expected for the PES. To do so, we measured the Spearman correlation
coefficients ρ between the local estimated energy values on OSLH’s conformational graph
with their actual energy values, calculated by HF/6-31G(d) in the gas phase (Figure 3.2).
                                               38


Figure 3.1: O-succinyl-L-homoserine (top) was subjected to conformer generation, geometry
optimization, and clustering by AutoGraph. The conformational graphs colored by single
point energy values for neutral (middle left) and charged (middle right) are shown using a
gradient of blue (low energy) to red (high energy). The conformational graph colored by
assigned cluster provide intuitive results (middle center). Conformers within each cluster
were superimposed to their centroid (bottom).
                                             39


Figure 3.2: Locally weighted estimated relative energies was plotted against the mea-
sured relative single point energy values (n = 485). Graphs before and after filtering low
weight edges were considered and Spearman correlation coefficients were calculated (charged:
ρwith = 0.916, ρwithout = 0.852; neutral: ρwith = 0.465, ρwithout = 0.457). The line for x = y is
plotted.
There is a positive correlation for both conformational graphs, suggesting the monotonic
relationship between geometry/energy is preserved even after kernelizing and filtering the
RMSD matrix input. The local estimated energy appears more responsive to the actual
energy when the adaptive threshold is applied. We also observe an improvement in the
correlation. The improvement is particularly pronounced for the case of the charged system.
We suspect the magnitude of Coulombic interactions in the gas phase is much greater than
that of the noise such that we observe a strong relationship between the geometry and energy
of the system. It is promising to observe a correlation in both the charged and neutral case
even though their trends in relative energy differ. While the conformational graph represents
purely geometric information, we observe we can infer energetic information because the
refined conformations are biased to minima in the PES.
     All ANI-1ccx optimized neutral OSLH were subjected to geometry optimization at
B3LYP/6-31G(d,p) level of theory. The conformational graphs obtained by AutoGraph
were visualized (Figure 3.3). In the one of the conformational graph, we indicate the con-
formers which were chosen as centroids by AutoGraph in the previous step for the ANI-1ccx
                                                40


            Cluster                        Energy                   Previous Centroids
Figure 3.3: The conformational graph for B3LYP/6-31G(d,p) optimized neutral OSLH is
colored by cluster assigned by AutoGraph, relative energy values from low (blue) to high
(red), and centroids selected by AutoGraph from ANI-1ccx optimized OSLH conformers
(red). Size of nodes are proportional to the weighted degree of each node.
optimized structures. We observe at least one structure from the centroids chosen from
the ANI-1ccx optimized conformers appear in densely connected regions of the B3LYP/6-
31G(d,p) optimized conformational graph. If we were to calculate an average property, we
may select a representative conformer from each of the clusters of the B3LYP/6-31G(d,p)
optimized conformational graph and take a Boltzmann average. We observe we can perform
ab initio optimization on only the centroids selected from the ANI-1ccx optimized graph
and obtain similar results as if we subjected all conformers to full geometry optimization.
However, we can expedite the workflow in this case by subjecting only 28 starting conformers
to the expensive geometry optimization rather than the full collection of 485 conformers.
                                             41


3.3.2    Clustering nicotinamide adenine dinucleotide conformers by AutoGraph
     We next examine the clustering result of NAD+ conformations. NAD+ is a well-known
co-factor that mediates the redox currency in the cell. [131] Importantly, NAD+ contains
two phosphorus atoms, which is an atom type not represented in the ANI-2x potential. We
require an alternative protocol for refinement, and we can probe the robustness of the Au-
toGraph protocol to the method for conformer generation and refinement. After conformer
generation and refinement with MMFF94, we optimized the NAD+ conformations using
AM1 in gas phase. The 785 conformers which retained the original topology were subjected
to conformational clustering by AutoGraph.
     The AutoGraph protocol identified 31 clusters as shown (Figure 3.4). Overall, the
graph appeared more globally connected than the OSLH example. The semiempirical QM
method parameterized model 7 (PM7) has a lesser agreement to coupled cluster energy
values when compared to ANI potentials, so the AM1 energies as a semiempirical method
may have exhibited less convergence to local optima for NAD+ than ANI-2x did to OSLH.
[94] The graph colored by single point energy values show an overall gradient, in which
the densely connected region also appear to be low energy conformers. Meanwhile the
higher energy side of the graph seems more sparsely connected. Superimposed conformers
are also visually sound. While a positive correlation was observed between actual and local
weighted energy estimates for conformational graphs before and after applying an edge weight
threshold (ρwith = 0.845, ρwithout = 0.842), no notable change in the correlation coefficient
was observed (Figure A8). Because energetic information is inferred from the conformational
graph, which only encodes geometric information explicitly, the success of the clustering
results may depend on the convergence of optimized structures, which in turn depend on
the accuracy of the energy calculation method. The throughput achieved by deep learning
potentials like ANI potentials provide a unique opportunity for AutoGraph to interface
between deep learning and ab initio methods in high throughput applications.
     The cluster-wise variance in energy values were considered as an evaluation metric. The
                                              42


Figure 3.4: Nicotinamide adenine dinucelotide (top) was subjected to conformer generation,
geometry optimization, and clustering by AutoGraph. The conformational graphs colored by
cluster (middle left) and single point energy values (middle right) are shown using a gradient
of blue (low energy) to red (high energy). Conformers within clusters were superimposed to
their centroids (bottom).
                                               43


                                                      2
Table 3.1: Variance in energy within clusters (σwithin    ) was compared against the variance
                            2
among all conformers (σ ) for OSLH and NAD+ individually. The F -test of equality of
variances was used to assess the statistical significance of the difference in variances. The
numbers of total conformers N and of clusters C are shown.
                        metabolite       N C        σ2      2
                                                          σwithin     p
                       OSLHneutral 485 28 4.59              3.94    0.048
                      OSLHcharged 485 28 194.32 64.21 < 0.001∗
                          NAD+          785 31 105.67 49.15 < 0.001∗
                               2
 Variances have units of kcal
                           mol . p-values were rounded to three decimal places. All other values
 were rounded to the second decimal place. Tests returning a p-value below 0.001 were considered
                                     significant, marked by ∗.
variance in energy within clusters was compared against the variance of energy among all
conformers for OSLH and NAD+ individually (Table 3.1). We detect the within-cluster
variance in energy is lesser than the overall variance for charged OSLH and for NAD+.
The trend for neutral OSLH was not validated, likely due to its negligible effect size of
            2
−0.65 kcal
        mol
               for the difference in variances.
     While conformational clustering of OSLH and NAD+ both produced reasonable results,
the results of this case studies are anecdotal. Validation of the AutoGraph protocol should
be performed over a diverse data set chosen in a manner that minimizes bias. For this reason
we constructed a benchmark conformation set for 200 metabolites chosen from the HMDB
using a fairly automated protocol. [130]
3.3.3    Conformational graphs retain geometry/energy correlation for
         metabolites
     The distribution of Spearman correlation coefficient obtained between actual and local
weighted energy estimates on conformational graphs from all 200 metabolites were plotted
(Figure 3.5). While the distributions exhibited a large range, the majority of ρ were positive,
with median values of 0.36 before and 0.38 after applying the edge weight threshold. This
indicates most graphs exhibited a positive monotonic relationship between actual and local
estimated energy values to varying degrees. Proximity in the conformational graph therefore
translates to similarity in energy. No significant enhancement was observed in applying
                                                 44


Figure 3.5: Distribution of Spearman correlation coefficients between actual and local esti-
mated relative energy values by applying a with and without threshold value as part of the
AutoGraph protocol to produce the conformational graphs (n = 200 for each distribution).
a threshold to the conformational graphs. We observe energetic information was inferred
from the conformational graphs over a large, representative set of metabolites, therefore
clustering by the conformational graphs may yield energetically informed partitions of the
PES by thermally accessible local minima.
3.3.4    Autonomous clustering algorithms reduce mean within-cluster RMSD
         and cluster-wise variance in energy
     We employed energy as a surrogate metric for the performance of clustering algorithms.
Each dynamic clustering algorithm chooses its hyperparameters (e.g. number of clusters,
threshold values) by optimizing its geometry based loss function. Thereby, we introduce
the possibility of inflating the performance of a given algorithm had we chosen a geometry-
                                              45


based assessment of the clustering algorithm. We report the difference between within-
cluster mean RMSD and global mean RMSD for reference (Figure A9). However, we base
our assessment on whether each clustering algorithm successfully clusters together partially
optimized conformers with similar energy values, thus exhibit low within-cluster variance in
energy.
     The variance in energy within clusters were determined to be smaller than the total
variance for charged OSLH and NAD+. We assessed the difference in cluster-wise and total
variance in energy between the 200 benchmark structures across various algorithms (Table
3.2, Figure A10). We observe a significant decrease in cluster-wise variance in energy relative
to the total variance for all dynamic clustering algorithms. The negative control failed to
                                   2
reject the null hypothesis (H0 : σwithin − σ 2 = 0), highlighting the importance for automating
the hyperparameter selection for the task of unsupervised clustering. Among the dynamic
conformational clustering algorithms, AutoGraph exhibited the most negative mean and me-
                                           2
dian discrepancy in variance of energy, σwithin  − σ 2 . The negative discrepancy in the variance
of energy implies the cluster-wise variance is less than the overall variance. Conformational
clustering algorithms successfully minimizing the variance in energy within the clusters in-
dicates successful partitioning based on the biases imposed on sampled conformers during
geometry optimization. Given this metric, AutoGraph appears to most successfully clus-
ter conformers which we expect to converge to similar structures and energies upon further
geometry optimization.
     We considered the possibility in which a superficially low within-cluster variance in
energy was achieved by forming only negligibly small clusters, thereby failing to generalize the
clustering result. For each algorithm, The mean number of clusters detected per metabolite
C̄ was much smaller than the original number of conformers (≈ 1000), thus each algorithm
appeared to successfully narrow down candidate conformers. We should note the effect size on
  2
σwithin −σ 2 is notably smaller than the results observed for charged OSLH and NAD+ (Table
3.1). While we consistently observe clustering by AutoGraph reduces the variance in energy
                                                46


Table 3.2: The difference between variance in energy within clusters and variance among
                    2
all conformers (σwithin   − σ 2 ) for 200 representative metabolites were calculated (df = 199
per algorithm). The mean and median for each distributions were reported. A paired left-
tailed Wilcoxon signed rank test was used to assess the statistical significance of the median
discrepancy in variances in energy. The mean value of number of clusters detected C̄ was
also computed.
                                  AG        DTC         NC          RCK     fixed
                     mean       −8.18       −6.59      −2.57       −2.99     0.03
                    median −1.09            −0.54      −0.38       −0.09 0.003
                        p      < 0.001∗ < 0.001∗ < 0.001∗ < 0.001∗ 0.89
                       C̄         18          18         32            4      18
                                                 2
  Discrepancy in variances have units of kcalmol . p-values were rounded to three decimal places.
Variance discrepancies in energy were rounded to the second decimal place or to have at least one
  significant figure. C̄ was rounded to the closest integer. Tests returning a p-value below 0.001
were considered significant, marked by ∗. Abbreviations: AutoGraph (AG), Ward/Dynamic Tree
 Cut (DTC), NMRCLUST (NC), Representative Conformer K-means (RCK), Ward hierarchical
                                  clustering with 18 clusters (fixed).
within clusters relative to the overall variance, the small effect size highlights AutoGraph
as a tool to draw preliminary trends from a large number of metabolite conformations to
reduce the search space for downstream processing and prediction, and not intended to make
predictions itself.
     In this study, we employed atomic RMSD as the distance metric because it is a simple,
commonly used metric for structural similarity, applicable for geometry optimization results
of small organic molecules. Various distance metrics may be more appropriate for other
applications, such as atomic contacts or Fischer distance. [132] We implemented the Auto-
Graph protocol such that users can modify the dissimilarity/similarity metric used. This is
achieved by populating the output directory with either the dissimilarity or affinity matrix
(Figure A7). We welcome further assessment of the protocol applied to metrics other than
atomic RMSD.
3.4    Conclusion
     We presented use cases for an automated conformational clustering algorithm on OSLH
and NAD+. Due to the throughput of ANI-2x potential and high degree of automation
                                                    47


of AutoGraph, we could generate, refine, and cluster the conformers for 200 representative
metabolites. Further validation of the algorithm as a strategy to obtain an approximation
of the underlying conformational ensemble is underway. The AutoGraph protocol can be
integrated into computational workflows handling metabolite or other small-molecule con-
formations using a short Python script, or run as an interactive program with no coding
required. We anticipate the application of AutoGraph will narrow down the search space in
order to generate a representative conformational ensemble of collections of small-molecules.
                                             48


Chapter 4:      Rapid and Automated Ab Initio Metabolite Collisional Cross
                Section Prediction from SMILES Input
     Kiyoto Aramis Tanemura, Susanta Das, Kenneth M. Merz Jr.
4.1    Introduction
     Ion mobility spectrometry (IMS) is an analytical technique frequently coupled to mass
spectrometry to separate analytes by size, shape, and charge for greater resolution. [19,
133] The collisional cross section (CCS) is a mobility measurement obtained by IMS, which
provides a highly reproducible value for metabolite identification. [22] One untargeted
metabolomics measurement can yield millions of signals, which is far greater than the
databases of known CCS values. [134] High throughput metabolite identification therefore
depends on both, the construction of more comprehensive CCS databases and the develop-
ment of standard-free identification by in silico prediction of CCS values. [16]
     Two main approaches are considered for computational prediction of CCS values: de-
duction from principle or inference from data. Ab initio methods model the gas phase
conformational ensemble of charged species. The CCS can then be calculated for individual
conformers and Boltzmann averaged. The in silico chemical library engine (ISiCLE) open
source software is an example of this approach. [27] While the approach produces reliable
CCS values justified by physical principles, molecular modeling involves computationally
intensive simulation steps. In a complementary approach, data-driven methods parametrize
a machine learning model on a database of molecules with known CCS values. AllCCS,
DarkChem, DeepCCS are representative models of this group. [28, 31, 32] Such models
are capable of rapid prediction of CCS values, however their accuracy depends on sufficient
representation of similar structures in the training set; which is not guaranteed with the
sparsity of available CCS values in relation to the vast number of possible metabolites.
     To mitigate the computational bottleneck imposed by QM calculations, we augment an
ab initio workflow for CCS prediction with an incremental refinement with the ANI deep
                                               49


learning potentials.[93, 135] The developed workflow was validated against CCS values mea-
sured by traveling wave ion mobility-mass spectrometry (TWIMS) for a set of representative
metabolites. [136] While the performance of the workflow was validated, execution of the
workflow still presented a barrier.
     The proper operation of computational chemistry software is required to execute this
ab initio CCS prediction workflow, which is labor intensive and limits its use to expert com-
putational chemists. The demand for concurrent development in annotation of experimental
data and development in CCS prediction software further raises the barrier to accessibility in
terms of expertise. Implementation of a highly automated workflow to execute advanced ab
initio ensemble modeling and CCS calculation is therefore warranted for the metabolomics
community who specialize in experimental or computational methods. Accessibility further
benefits from the usage of free, open source software at every step of the computational
pipeline.
     The computational chemistry landscape is very complex with many packages and soft-
ware suites vying for user attention. For all but the cognoscenti, this landscape is intimi-
dating, making it hard for non-experts to work through a series of computational steps to
get to a desired endpoint. Many decisions need to be made at each step in terms of the
software to be used, its quality, reliability and accuracy. Thus, it is incumbent on the com-
putational community to develop informatics infrastructure to support endpoints of import
to experimental scientists. While the CCS prediction workflow is quite robust, its smooth
implementation is a challenge even for expert users. Hence, the need for a workflow strategy
that encapsulates all the tools needed to go from a simple molecular structure representation
like SMILES to an accurate estimation of the CCS value.
     The present work describes the informatics infrastructure and workflow implementation
and does not focus on the computational details, which have been reported in detail else-
where. [136] In light of this we describe the implementation of our CCS prediction workflow
using the Snakemake workflow manager, which allows users to interact with a standardized
                                                50


Figure 4.1: The present CCS prediction workflow was implemented in Snakemake such that
a user can execute the workflow in an automated fashion, providing only the SMILES string
as an input. We report the uncertainty introduced by stochastic steps in this workflow, the
Conformer Generation and AutoGraph Clustering steps, by running ten replicates of inde-
pendent CCS calculations from each step. Some visuals of the outputs are shown for the
[M+H]+ adduct of L-carnosine to provide an intuition over each operation of the workflow.
Briefly, three charge models are retained and up to 1000 conformers are generated for each.
The conformational similarity graphs are shown after MMFF94 or ANI-2x refinement. We
observe individual conformers become more modular as they approach minima in the poten-
tial energy surface. The AutoGraph clustering protocol then identifies clusters to maximize
the modularity in the conformational similarity graph, and reports centroid conformers to
be subject to computationally intensive DFT geometry optimization and CCS calculation.
The CCS values are Boltzmann averaged and written to a result file accessible to the user.
interface to execute the multistep computation. [137] The high degree of automation allows
users to provide only a simplified molecular-input line-entry system (SMILES) structure to
predict CCS values for multiple protonated/deprotonated adducts and models. Furthermore,
the automation allows for parallelized computation of CCS values on high performance com-
puting (HPC) systems. Beyond describing the workflow implementation we also show its
value by assessing the workflow by running replicate calculations starting from two stochas-
tic steps in the workflow: conformer generation and clustering (Figure 4.1). In what follows
we go over some of the details of the Snakemake implementation and how it can be used to
assess the reliability of the workflow.
                                             51


   4.2     Description of the Implementation
         The complex operations involving several open source software and unspecified num-
   bers of files are automated using Snakemake. In Snakemake, computational workflows are
   programmed in a script which is titled the Snakefile. The Snakefile contains a list of target
   files and rules defined in blocks. The rules specify the input/output files, and the script to
   produce the output file from the input file. At execution, a directed acyclic graph (DAG) is
   computed based on the files present to plan the sequence of jobs to run. Snakemake executes
   the minimum jobs to produce all files specified in the target. [137]
         Let us highlight as an example the following Snakefile, which produces output files by
   the HPCCS collisional cross section prediction software. [26] It defines the paths to the
   target files, the Boltzmann averaged CCS value saved to file ccs.txt, in a list. The operation
   takes coordinates and atomic charges from the QUICK QM output files and converts them to
   PQR format in rule sp out2pqr. The CCS value is predicted from the PQR file using HPCCS,
   producing an output with extension .hpccs for the particular conformer. The HPCCS results
   are read from all conformers considered, Boltzmann averaged, and saved to ccs.txt through
   rule model results.
 1 # Target files are defined in a list by iterating through the metabolite
        ID ,
 2 # charge , and charge models considered
 3 targets = []
 4 for theID in ids :
 5       for thecharge in systems [ theID ]:
 6            if int ( thecharge ) not in args [ ’ c o n s i d e r _ a d d u c t s _ o f _ c h a r g e s ’ ]:
 7                  continue
 8            for themodel in systems [ theID ][ thecharge ]:
 9                  targets . append ( ’ results /{}/{}/ ensemble_fast /{}/ ccs . txt ’. format
        ( theID , thecharge , themodel ) )
10
11 # The targets are passed to the rule titled all
                                                  52


12 rule all :
13      input : targets
14
15 # Rules are defined in blocks . They specify input / output files , and the
      script to
16 # generate the output from the input
17
18 # write PQR file from QUICK single point output file
19 rule sp_out2pqr :
20      input : " results /{ label }/{ charge }/ ensemble_fast /{ model }/{ num }. out "
21      output : " results /{ label }/{ charge }/ ensemble_fast /{ model }/{ num }. pqr "
22      shell :
23               """
24               src / sp_out2pqr . py { input }
25               """
26
27 # Compute CCS value using HPCCS
28 rule hpccs :
29      input : " results /{ label }/{ charge }/ ensemble_fast /{ model }/{ num }. pqr "
30      output : " results /{ label }/{ charge }/ ensemble_fast /{ model }/{ num }. hpccs "
31      params : hpccs = paths [ ’ HPCCS_path ’] ,
32                wdpath = paths [ ’ wd_path ’]
33      shell : """
34               cd { params . hpccs }
35               ./ hpccs { params . wdpath }{ output } > { params . wdpath }{ output }
36               """
37
38 # Boltzmann average individual CCS values into final value
39 rule model_results :
40      input : lambda wildcards : expand ( " results /{ label }/{ charge }/
      ensemble_fast /{ model }/{ num }. hpccs " , label = wildcards . label , charge =
      wildcards . charge , model = wildcards . model , num = systems [ wildcards .
      label ][ wildcards . charge ][ wildcards . model ])
                                               53


41         output : " results /{ label }/{ charge }/ ensemble_fast /{ model }/ ccs . txt "
42         shell :
43                  """
44                  src / m o d e l _ r e s u l t s _ i n t e r m e d i a t e . py results /{ wildcards . label }/{
        wildcards . charge }/ ensemble_fast /{ wildcards . model }/
45                  """
         Complex computational operations can be implemented in Snakemake by rules. Indi-
   vidual rules may have the forms of one-to-one, one-to-many, or many-to-one input files to
   output files. As long as the number of the files are specified, rules with many inputs or out-
   puts can be encoded in one Snakefile. Workflows are allowed to diverge by using the same
   file as inputs for different rules, or converge by using files from different sources as input to
   one rule.
         The number or identifiers of the intermediate files may not be defined at the beginning
   of the workflow, as is the case for our implementation which flexibly generates and selects
   conformers. This prevents the DAG from being computed for the entire workflow at the
   start. Therefore, we broke the workflow into convenient modules over which the DAG could
   be computed with no ambiguity, and the workflows encoded in these Snakefiles can be run
   sequentially inside a bash script. Snakemake therefore presents an opportunity to customize
   workflows using other methodologies or open source software in executing the CCS prediction
   workflow. We plan to explore this advantage further by fully automating the Portal for Open
   Computational Metabolomics Tools (POMICS) webpage (www.pomics.org). Currently, the
   web site serves as a contact page to compute predicted CCS values via this workflow; thereby
   requiring human intervention to initiate the workflow.
   4.3    Description of Workflow
         The computational details of the workflow is described in detail elsewhere. [136] Let us
   expand on the particular uses of software to automate the protocol, organized by sequential
   execution of workflows encoded in Snakefiles.
         The input files are stored to the path data/input.smi. Arguments such as the charges
                                                                  54


to consider are defined in arguments.json. Paths to various software is saved in paths.json.
4.3.1    Snakefile make model
     The simplified molecular-input line-entry system (SMILES) structure is read. We em-
ploy Dimorphite-DL to protonate the functional groups present, thereby hypothesizing var-
ious protonation states. [138] Conformers are generated for each protonation model.
4.3.2    Snakemake site screen
     The generated conformers are subjected to single point energy calculation in the gas
phase using B3LYP/6-311G(d,p) through QUICK. [139–141] Models within 10 kcal/mol
relative energy are retained as viable charge models.
     While the implementation of the (de)protonation model is simple, it suffers from several
drawbacks. Dimorphite-DL protonates common functional groups expected in the aqueous
phase and not the gas-phase. Thus, the current implementation is prone to miss some viable
gas-phase protonation states, such as protonated carbonyl groups or enolates. If the user has
insight on how the metabolite might protonate, then they can append the SMILES string
for the expected charged species in the charge model hypothesis file, model.smi. A more
comprehensive protocol for metabolite protonation should be considered in future versions
of the workflow.
4.3.3    Snakefile conf gen
     Conformers are generated from the viable charge models using the ETKDG algorithm
in RDKit. [92] The generated conformers are incrementally refined, first using the Merck
Molecular Force Field 94 (MMFF94) implemented in RDKit to address distorted bonds and
angles. [126]
4.3.4    Snakefile AG
     The refinement using MMFF94 is followed by refinement by the ANI2-x potential imple-
mented in TorchANI. [93, 135] Following refinement, representative conformers are selected
using AutoGraph conformational clustering. [142] Centroid structures are saved as the model
of the conformational ensemble.
                                              55


4.3.5    Snakefile geom opt
     Centroid conformers are subjected to geometry optimization using B3LYP/6-
31++G(d,p) in QUICK. The resulting geometries are a model for the gas-phase confor-
mational ensemble for which we can use as inputs to predict average properties such as CCS
values.
4.3.6    Snakefile hpccs
     Conformer-wise CCS values are predicted using High Performance Collision Cross Sec-
tion (HPCCS) package. [26] The individual CCS values are Boltzmann averaged according
to their optimized energies to yield the final CCS value, given the metabolite, charge, and
charge model. The set of predicted CCS values are a metric that can be referenced against a
measured CCS value to perform tasks such as metabolite identification or protonation model
prediction.
4.4   Snakemake Enabled Workflow Extensions
     Given the flexibility and reliability of the workflow, we can now innovate by using it
in a number of ways by simply modifying the Snakefile to generate new endpoints. In this
vein we highlight its use to create a “fast” workflow, assess workflow reliability, and the
incorporation of NMR shift computation.
4.4.1    The Fast Workflow
     The QM step is a computational bottleneck, and it may be helpful to obtain prelim-
inary results even at a lower resolution. For this reason, we explored and now offer the
fast workflow which performs a single point and charge calculation on the ANI2-x optimized
centroid conformers rather than a full B3LYP/6-31++G(d,p) geometry optimization cal-
culation. These conformers are subjected to CCS calculation and are Boltzmann weighted
according to their B3LYP/6-31++G(d,p) single point energy, not the fully optimized energy.
     The Snakemake implementation allowed us to readily create a representative benchmark
to test predictions for positive and negative ion modes, using both the fast and standard
workflows (Table A3). We used the Snakefile geom opt as a template, which includes the
                                               56


geometry optimization step, and modified the source code such that the single point energy
energy values were computed and used for Boltzman averaging the CCS values. Due to
the modular fashion of the Snakemake workflow, the modification presented minimal coding
effort.
     The selection protocol of the benchmark set is further described in the appendix (sectino
). The mean relative error (MRE) and Pearson correlation coefficient between predicted and
experimental CCS values were calculated for each group (Table 4.1, Figure A11). If the
benchmark contained only similar CCS values, then we may get a low MRE regardless of
the input structure by arbitrarily choosing a value near the mean of the distribution of CCS
values. Therefore as a negative control, we take the mean experimental CCS for each group
and treat it as the predicted CCS for all compounds. This provides a best guess that does
not distinguish input structure, thus having no predictive ability.
     In both workflows, we observe improved accuracy and precision relative to the negative
control, highlighting that the workflow is sensitive to input structures. We observe that
the standard workflow exhibited a lower MRE than the fast workflow for both positive and
negative ion modes. Meanwhile, the fast workflow achieves satisfactory performance for
CCS prediction, using ANI2-x optimized structures in place of QM optimized structures. A
high degree of linearity was reflected across all groups except for the control. The control
included one prediction rather than a distribution, thereby was not applicable to calculate
a correlation coefficient.
4.4.2   Assessment of Reproducibility
     The workflow was executed to predict the CCS value of L-carnosine. To assess the
reproducibility of stochastic steps in the workflow, all intermediate files generated subsequent
to that step were deleted and allowed to rerun in ten replicate calculations. Conformer
generation and conformational clustering are the stochastic steps involved in the workflow.
The spread in CCS values produced thereafter represent the uncertainty introduced by the
step and all subsequent operations (Table 4.2).
                                               57


Table 4.1: The mean relative errors (MRE) of the fast and standard workflows over the
benchmark set is reported, along with the negative control. The tests are grouped by charge
and whether or not they are the negative control. The Pearson correlation coefficient (ρ)
between the predicted and experimental CCS values are reported. The correlation coefficient
cannot be calculated for the control group, which contains a singular value standing in for
the prediction. All values reported to two significant figures.
                        workflow    charge MRE (µ ± 2σ) (%)          ρ
                           fast         −1        3.4 ± 4.7        0.98
                           fast          1        3.1 ± 4.4        0.99
                        standard        −1        2.5 ± 4.7        0.97
                        standard         1        2.5 ± 4.5        0.98
                         control        −1         13 ± 13           -
                         control         1         13 ± 15           -
Table 4.2: The 95% confidence interval in CCS values obtained by ten replicate runs are
reported for L-carnosine in positive/negative ion mode for viable models. The 95% confidence
interval (CI) was calculated by two standard deviations, and reported as: absolute (relative).
All values were rounded to one significant figure. Abbreviations: conformation generation
(conf gen), conformational clustering (clustering).
                          charge   model      from         95% CI
                                1     0     conf gen 0.9Å2 (0.6%)
                                1     1     conf gen 0.9Å2 (0.6%)
                                1     2     conf gen     1Å2 (0.7%)
                              −1      0     conf gen      2Å2 (1%)
                                1     0    clustering 0.4Å2 (0.3%)
                                1     1    clustering 0.4Å2 (0.2%)
                                1     2    clustering 0.2Å2 (0.1%)
                              −1      0    clustering     2Å2 (1%)
                                               58


     In all cases considered, the 95% confidence interval was within 1% of the measured
CCS value. Overall, the uncertainty introduced by using stochastic methods is marginal
in this workflow, exhibiting high precision even with the use of stochastic algorithms for
conformation generation and clustering.
4.4.3    NMR Calculation from Intermediate Files
     One experiment is insufficient for tasks such as structure elucidation. Rather, we require
convergent evidence to identify a metabolite from experimental spectra with confidence and
integrity. In addition to measurements by IMS, nuclear magnetic resonance (NMR) spec-
troscopy on the sample can serve as another line of evidence for metabolite identification. It
will be valuable to have predicted chemical shifts to compare against experimental values.
     We illustrate that we can use intermediate files from modeling in the gas-phase ensemble
to apply to an in silico NMR chemical shift prediction workflow. [96] A CCS prediction
workflow was allowed to run to completion for O-succinyl-L-homoserine (OSLH). We then
access a gas-phase optimized ensemble, located in the results/OSLH/-1/ensemble/model0/
directory as .xyz files. We selected the –1 charge because it is the expected zwitterionic
species in aqueous solution.
     1H and 13C NMR chemical shifts were calculated as previously described, but using
the structures in the gas phase neutral ensemble as input. [96] Briefly, the gas-phase opti-
mized structures were subjected to geometry optimized using M06-2X/6-31+G(d,p), with
the integrated equation formalism polarized continuum model (IEFPCM) with deuterium as
solvent. 1H and 13C NMR shielding tensors were calculated using B3LYP/6-311G+(2d,p).
Resulting shielding tensors were converted to NMR chemical shifts by linear regression. Fi-
nally, the shifts were Boltzmann averaged. The calculations were performed using Gaussian
16. [124]
     The predicted chemical shifts were compared to measured values as well as those calcu-
lated by the original workflow (Table 4.3). We observe the mean absolute errors (MAE) to
the experimental values are small in both predicted chemical shift values, however the au-
                                               59


Table 4.3: Mean absolute errors (ppm, µ ± 2σ) of OSLH proton (upper right) and carbon
(lower left) chemical shifts obtained by experiment, this workflow, and the expert supervised
protocol.
                                                      proton
                                         experiment workflow      expert
                            experiment        -     0.14 ± 0.19 0.31 ± 0.75
                   carbon    workflow     1.1 ± 1.6      -      0.31 ± 0.70
                              expert      2.4 ± 2.2  3.0 ± 3.0       -
tomated workflow requires minimal intervention. Because the MAE between chemical shifts
predicted in each method are small, we consider the original expert supervised workflow re-
produced. Thereby, we confirm the intermediate files generated in route to CCS prediction
can be used for another application for which the conformational ensemble must be modeled.
4.5    Conclusion
      The described implementation of the CCS prediction workflow uses freely available soft-
ware using the SMILES structure as input. In addition to the standard workflow, we have
a faster implementation which substitutes the computationally expensive QM geometry op-
timization with a single point energy calculation with a modest drop in accuracy. The
reproducibility of the workflow has been validated for CCS calculation. Further, the inter-
mediate files are useful for other applications which require a representative ensemble. The
CCS prediction workflow is available at the following link: www.pomics.org
                                                 60


Chapter 5:      Metabolite Collisional Cross Section and Uncertainty Prediction
                by Deep Bayesian Graph Convolutional Neural Network
     Kiyoto Aramis Tanemura, Akhil Shajan, Susanta Das, Carter K. Asef, Facundo M.
Fernández, Kenneth M. Merz Jr.
5.1    Introduction
     Ab initio CCS prediction is a complementary approach to empirical CCS prediction.
Quantitative structure property relation (QSPR) circumvents the complications involved
in modeling and simulation, given sufficient data to parameterize a model. Meanwhile,
the performance is dependent on the coverage and quality of data, and detailed structural
information cannot be learned from QSPR alone. Therefore, while ML based CCS prediction
models can rapidly generate in silico CCS libraries to be used in metabolite annotation,
assessing the quality of these values remain a challenge.
     We employ a graph convolutional deep Bayesian neural net to predict CCS directly from
structure. [143] Graph representation provides a more intuitive representation of chemical
structure than character embeddings of SMILES employed by DarkChem and DeepCCS. [31,
32] Additionally, uncertainty quantitation achieved by deep Bayesian neural net provides
an informed method to handle the quality of values in the in silico library. The prior
distribution can be estimated through introducing dropout to each parameter of the model.
[144] The posterior distribution is then estimated through Bayesian inference by averaging
the outcomes of Monte Carlo sampling T times.
     Assuming a normal distribution for the output labels, the total uncertainty of the predic-
tion is the predicted variance of the normal distribution, Vd
                                                            ar(ŷ|x, Θ). The total uncertainty
is decomposed to the epistemic and aleatoric uncertainty. Epistemic uncertainty arises from
the model or missing information, therefore the goal is to reduce epistemic uncertainty.
Aleatoric uncertainty, on the other hand, is uncertainty which is inherent to the data. If
the aleatoric uncertainty is constant across the domain, it is considered homoscedastic, and
                                              61


can be represented by a singular uncertainty parameter. We assume the aleatoric uncer-
tainty can vary depending on the metabolite, thereby model the aleatoric uncertainty to be
                                                                             P        2
heteroscedastic. The epistemic uncertainty is given by T1 Tt=1 (ŷt )2 − T1 Ti=1 ŷt . The
                                                               P
aleatoric uncertainty predicted by a neural network is given by, T1 Tt=1 σ̂t2
                                                                     P
     CCS measurements have an uncertainty near 3% relative error. [136] Identical entries
vary between experimental CCS databases by a mean relative error of 2.6%. Due to the
relative sparsity of available CCS data relative to all possible metabolites, the uncertainty of
predicted values likely are heterogeneous over predicted values. By use of the mean negative
log likelihood (MNLL) as loss function, entries in the training set with lower predicted un-
certainty can be weighted greater than those with higher predicted uncertainty. Meanwhile,
uncertainty prediction provides a means to identify regions in the metabolite space with high
uncertainty due to poor coverage.
     Given the ab initio CCS prediction workflow is highly automated and independent of
available data, we have the means to annotate regions of the metabolite space with less
certainty of the CCS values. Combined with the uncertainty prediction achieved by the
model, we can implement an active learning strategy to interplay between ab initio and deep
learning CCS prediction methods.
5.2    Methods
5.2.1    Data
     Three datasets were employed in this study: metabolite CCS, the human metabolome
database (HMDB) set, and the test set. Each dataset contained the deposited SMILES,
protonation state, and carrier gas. All but the HMDB set also contained experimental CCS
values. The datasets were filtered by the following criteria:
    • number of heavy atoms no more than 360
    • contains no other elements than C/H/O/N/S/F/Cl
    • adducts formed solely by protonation or deprotonation
                                               62


     Finally, entries with exact matches were dropped; entries with identical structures or
experimental conditions were retained if the CCS values were not exact matches.
Metabolite Set
     Experimental metabolite CCS values were obtained from various databases, including
AllCCS, PNNL CCS Database, CCSBase, and Unified CCS Compendium. [28, 29, 145,
146] For training, entries with stereochemistry (stereogenic centers or olefin geometry) were
identified. The dataset was appended by duplicate entries of the stereogenic compounds,
but with the stereochemistry unspecified.
HMDB Set
     All entries from the HMDB were obtained.[130] Structures present in the Metabolite Set
were removed, thereby producing a collection of metabolites with no recorded CCS values.
Monoprotonation and monodeprotonation adducts in nitrogen gas were hypothesized.
Test Set
     To construct an independent test set, we sought to measure CCS values for structures
not included in the training set. We curated candidates from a laboratory chemical inven-
tory sheet. Chemicals containing no other elements than C/H/O/N/S/F/Cl were retained,
yielding 268 candidates. Structures which were also present in the training set were removed,
leaving 195 candidates. The Morgan fingerprints of the candidates and the training set were
computed with connectivity of three and 2048 bits. The candidates were decorrelated by
subjecting their Morgan fingerprints to Ward hierarchical clustering. A similarity threshold
was applied to produce 50 clusters. For each cluster, the minimum Tanimoto distance of the
candidates to all entries in the training set were calculated. The representative structure
was chosen from each cluster as having the maximum minimum distance to structures in the
training set. Up to 20 candidates were manually selected based on availability and feasibility,
producing the final 16 candidates. The CCS values for these compounds were measured by
a traveling wave IMS (TWIMS) (Table A4).
                                              63


5.2.2    Featurization
     Metabolites were featurized to the following arrays: the bond adjacency array, atomic
descriptor array, experimental conditions array, and CCS labels. Let n be the number of
entries in a dataset. The bond adjacency array A is a binary n × 360 × 360 array in which
for each entry, the axes of the matrix correspond to the atom index of a molecule, and the
presence of a covalent bond is indicated by ’1’. The diagonal is also populated by ’1’ to
convey self-similarity. The atomic descriptor matrix X is an n × 360 × 16 array encoding
for atom type, chirality, olefin geometry, and number of hydrogens for each atom. The
experimental conditions array is a n × 11 array encoding for charge (−2 through +6), carrier
gas (helium/nitrogen), and whether or not the entry is derived from experiment or theoretical
prediction. Finally, the CCS labels array is a flat array of size n containing the known CCS
values.
5.2.3    Graph Convolutional Deep Bayesian Neural Net
     We based our graph convolutional deep Bayesian neural net on an existing model with
gate-skip augmentation, and minor modifications (Figure 5.1). [143] The model consists of
six graph convolution layers, the readout of a molecular descriptor, and two sets of three dense
layers for the regression of the mean and variance of the predicted CCS values, respectively.
Concrete dropout is applied after each layer. [144, 147] The parametric rectified linear unit
(PReLU) activation function was used for all layers, except for the gate skipping mechanism.
[148]
Graph Convolution with Gate Augmentation
     Graph convolution takes the sum of the descriptors of each atom with its neighboring
atoms, takes the dot product with trainable parameters, and passes through the PReLU
activation function. Using the gate-skipping mechanism, a weighted average is computed
between the present and previous convolution outputs, also using trainable parameters. The
gate skipping mechanism employs the sigmoid activation function.
                                               64


Figure 5.1: Graph convolutional deep Bayesian neural network with gate skipping augmen-
tation, adapted from [143].
Readout of Molecular Descriptor
     The atomic features passed through layers of graph convolution are subjected to a single
layer perceptron (SLP). The outputs for each atom are summed to produce a molecular de-
scriptor vector. The one-hot-encoded experimental conditions are appended. Feed Forward
Regression of Mean and Variance
     A standard multilayer perceptron (MLP) is used for the regression task of the mean and
variance of CCS values. Each MLP outputs exactly one value, so the mean and the variance
have dedicated MLP layers.
5.2.4    Training of the Model
     The models were trained incrementally using different datasets and loss functions. In
this study, sequential training refers to training the model in a deterministic manner before
training it as a Bayesian model on the same training set. Transfer learning refers to training
the model on the Peptide Set before the Metabolite Set.
Deterministic Training
     The model is trained in a deterministic model by using the mean squared error (MSE)
loss function, using the mean regression MLP and not the variance regression MLP. Let ŷ
                                               65


be the predicted label and y be the true label. The MSE is given by,
                                               X n
                                      M SE =        (ŷi − yi )2                           (5.1)
                                                i=1
Bayesian Training
     The model is trained as a Bayesian model by using the mean negative loss likelihood
(MNLL) loss function. Let a Bayesian model have a set of parameters Θ, and compute the
label from features, given by fΘ (x) = ŷ. The negative log likelihood (NLL) is given by,
                                    N LL = − ln (P (y|x, Θ))                               (5.2)
     in which P (a|b) is the posterior probability of a, given the prior probability of b. Note
that because fΘ (x) = ŷ, σ̂ 2 for the Bayesian model fΘ , P (y|x, Θ) = P (y|ŷ, σ̂ 2 ). We pa-
rameterize the uncertainty as a Gaussian distribution, for which the likelihood is expressed
as,
                                                                     2 !
                                           1              1    y−µ
                           P (y|µ, σ) = √ exp −                                            (5.3)
                                         σ 2π             2       σ
     It follows the negative log likelihood is expressed as,
                                             (y − µ)2 ln σ 2
                                    N LL =               +                                 (5.4)
                                                2σ 2          2
     The mean log likelihood is the mean of the NLL for each entry;
                                Pn                    n
                                                           (yi − ŷi )2 ln σ̂i2
                                                                               
                                   i=1 N LLi      1X
                     M N LL =                 =                        +                   (5.5)
                                      n           n i=1         2σ̂i2      2
5.2.5    CCS Mean and Variance Prediction
     The mean and variance is determined using the Monte Carlo (MC) dropout method
(100 samples). The aleatoric uncertainty is the mean of predicted variances for each entry.
                                                 66


The epistemic uncertainty is the variance in predicted means.
5.2.6   Performance Assessment
     We average the performance metrics over five replicates of five-fold cross validation. The
database contains structural and experimental redundancy because entries with different
measured CCS values were retained. To prevent inflating the assessment metrics, we based
the partitions on the metabolite SMILES, not indices of the training set.
     For the CCS means prediction, we report the mean absolute error (MAE), root mean
squared error (RMSE), and mean relative error (MRE) from the experimental CCS values.
     For the CCS variance prediction, let us consider a metric we will refer to in this
manuscript as the area under curve for variance (AUCV) to prevent confusion with the
AUC of the receiver operating characteristic (ROC) curve. If the experimental CCS mea-
surements were fairly sampled from the distribution, we then assume the z-score of each
experimental CCS value to the predicted Gaussian distribution to also exhibit a Gaussian
distribution. The predicted probability of sampling within a threshold absolute z-score is
the relative area under a Gaussian probability distribution function within the z-score. The
observed probability is the fraction of experimental CCS within the threshold absolute z-
score. We plot the observed probability against the predicted probability to produce a curve
from (0.0, 0.0) to (1.0, 1.0). If the two probabilities perfectly match, the AUCV of this curve
has a value of 0.5. The AUCV is greater than 0.5 if the model is underconfident, predicting
uncertainty values too large relative to the observed uncertainty. Meanwhile, an AUCV is
below 0.5 if the model is overconfident, with predicted uncertainty smaller than the observed
uncertainty.
5.2.7   Visualization of Molecular Encoding
     To interpret the learned representation, the 512 feature molecular vector output of the
readout layer was computed over all of the HMDB set. The vectors were subjected to princi-
pal component analysis (PCA) to a three dimensional embedding (ratio of variance explained:
P C0 : 87%, P C1 : 7%, P C2 : 4%). Various molecular descriptors were computed for each
                                                67


structure using RDKit (rdkit.Chem.rdMolDescriptors). These well defined descriptors were
overlaid on the PCA embedding of the molecular vectors.
5.2.8    Active Learning
      A model trained on the entire metabolite training set predicts the mean and variance of
CCS values over the HMDB set. The predictions are ordered by decreasing relative predicted
uncertainty. The CCS values for the entries with high uncertainty are predicted using the ab
initio workflow in batches of 100. The theoretical CCS values augment the training set in
five iterations of five fold cross validation to evaluate the effect of theoretical augmentation.
      The −1 and 1 charged adducts in nitrogen gas were hypothesized for the ab initio
prediction. Only one charge model was considered for each metabolite/adduct pair. If no
charged model was generated, the metabolite was removed by consideration. Therefore,
while metabolites were considered in batches of 100, the number of theoretical CCS values
augmented were not exactly 100.
5.3    Results and Discussion
5.3.1    Fast convergence of training is achieved by combining deterministic and
         Bayesian training
      Performance metrics were tracked for the five replicates of five fold cross validation.
The mean absolute error through training is shown (Figure 5.2). In Bayesian training, the
model is trained with MNLL loss function for the whole 2400 epochs. In Hybrid training, the
first 1200 epochs were trained with MSE loss by deterministic training, without training the
variance prediction MLP. In the subsequent 1200 epochs, the Bayesian training was used to
train both the mean and variance prediction networks. The decrease in MAE to below 10Å
occurs at above 1000 epochs for the Bayesian training, but is near 800 epochs for the Hybrid
training. The final performance is similar between the two methods, but faster convergence
was achieved by incremental training of the mean and variance networks using the Hybrid
training.
                                                 68


Figure 5.2: The mean absolute error values of five iterations of five fold cross validation,
using only MNLL as the loss function (Bayesian) or using deterministic training followed by
Bayesian training.
                                           69


Table 5.1: Test set performance metrics by adduct charge and whether or not stereochemistry
information was included. Performance metrics include mean absolute error (MAE), mean
relative error (MRE), mean absolute z-score (MAZ), and mean predicted uncertainty (MPU).
All values reported to two significant figures.
             charge stereochemistry      MAE (Å)   MRE (%)      MAZ    MPU (Å)
               +1           yes             4.9         2.6       0.18      24
               −1           yes             8.3         5.7       0.39      26
            +1/ − 1         yes             6.5         4.1       0.28      25
               +1           no              9.4         4.3       0.29      27
               −1           no              10.         5.9       0.42      28
            +1/ − 1         no              9.7         5.1       0.35      27
5.3.2    Missing stereochemistry is reflected in uncertainty of CCS prediction
     The fully trained model was evaluated on the test set. To assess how the model treats
systems with missing stereochemical information, molecules containing stereochemistry were
assessed both with and without stereochemical information (Table 5.1). In all cases, the
error increases upon loss of the stereochemical information. The loss of stereochemistry is
accompanied by an increase in the mean predicted uncertainty. We illustrate the model is
capable of handling compounds of missing CCS value, but reflect an increase in uncertainty.
5.3.3    Learned representation encodes information relevant for CCS prediction
     The representation learned by graph convolution and readout were interpreted by visu-
alization. We embedded the computed molecular vector to three dimensions by PCA and
overlaid values computed by well defined molecular descriptors. Some plots which exhibited
a recognizable pattern were selected and shown (Figure 5.3). These descriptors highlighted:
Exact molecular weight, hydrogen bond acceptor count, ring count, and number of rotatable
bonds; can affect the size and three dimensional shape of the molecule in the gas phase,
thus is salient to CCS prediction. These features also vary in non-identical directions, thus
are not completely correlated to one another. We observe the graph convolutional model
is capable of embedding molecules from a graph structure to molecular vectors which pre-
serves information such as size, flexibility, potential for intramolecular hydrogen bonding
interaction.
                                               70


Figure 5.3: Exact molecular weight, hydrogen bond acceptor count, ring count, and number
of rotatable bonds were overlaid on PCA embedding of molecular vectors of the HMDB set.
5.3.4    Augmentation of training set with theoretical CCS values improve
         performance
     The HMDB set was sorted by decreasing predicted relative uncertainty to identify re-
gions of the metabolite space. The CCS values of metabolites with the highest predicted
relative uncertainty were predicted using the ab inito workflow in a high throughput and
automated fashion. Theoretical values were added to training set of the model in batches
of 100. The model was evaluated by five fold cross validation over solely the experimental
CCS values; the theoretical values were added to the training set in all cases.
     The five fold cross validation with no theoretical augmentation is shown at zero theo-
retical augmentation (Figure 5.4). As the number of theoretical CCS values are increased
in the training set, the validation mean relative error decreases from 3.8% to 3.4%. The
mean absolute error and root mean squared deviations both show a general decrease by the
theoretical augmentations.
                                             71


Figure 5.4: Performance metrics averaged over five iteration of five fold cross validation with
augmentation of the training set with various numbers of theoretical CCS values. Abbre-
viations: area under curve of variance (AUCV, see methods), mean absolute error (MAE),
mean relative error (MRE), root mean squared deviation (RMSD).
                                           72


     The area under curve of variance (AUCV) compares the expected and actual fraction
of predictions within a Z-score threshold, and is used in this study to assess the predicted
uncertainty. An AUCV at 0.5 shows an agreement of observed and predicted uncertainty.
We observe AUCV values above 0.5, which shows the model is underconfident, exhibiting
the model predicting uncertainty values greater than should be expected. There is no clear
improvement of the validation AUCV by inclusion of theoretical CCS values to the training
set.
     The effect size may not exhibit a dramatic improvement, however the addition of merely
200 data points in a training set of 7331 points improved the performance. While the
calculation of each theoretical CCS value is computationally nontrivial, involving multiple
QM calculations per CCS value, the uncertainty quantitation by the deep Bayesian neural
net provides a guided method to improve the theoretical annotation of CCS values.
5.4   Conclusion
     The development of a gate augmented deep Bayesian graph convolutional neural net for
CCS prediction offers an alternative representation of metabolite structure than existing deep
learning based CCS prediction models. It presents the first instance of uncertainty quantita-
tion of machine learning predicted CCS values. The uncertainty quantitation provides and
informed method to perform active learning over the metabolite space, in combination with
the automated, high throughput ab initio CCS prediction workflow. The active learning can
be automated in future work to create a consinuously refining public database of in silico
CCS values to aid in metabolite annotation.
                                              73


Chapter 6:      Thesis Contribution and Future Directions
     In this body of research, I illustrate an increasing degree of automation achieved by
use of machine learning in computational chemistry workflow development. Refinement of
pairwise potential through the logistic regression classifier created a scoring function which
evaluates protein-protein docking predictions with features which are consistent with the
physical understanding of protein-protein interaction hot-spots. Inspired by manual op-
erations which prevented automation of the NMR chemical shift prediction workflow, an
unsupervised conformation clustering protocol was developed though a novel reformulation
of RMSD distance to a similarity graph. The algorithm, AutoGraph, was implemented in
the CCS prediction workflow using Snakemake, achieving a high degree of automation and
throughput. Due to the automation achieved, the workflow could be evaluated in replicates,
modified to a faster version, or extended to NMR chemical shift prediction. The workflow
in tandem with a graph convolutional deep Bayesian neural net can perform active learning
to predict CCS values of structures in regions of high uncertainty in the metabolite space.
     This research contributes a collection of validated software toward the goal of compre-
hensive identification of metabolites from complex mixtures. The software we contribute are
distinct from previously available CCS prediction software. In comparison to ISiCLE, our
Snakemake-based workflow integrates an intermediate the ANI-2x deep learning potential
followed by AutoGraph conformational clustering. The rapid refinement and adaptive nar-
rowing of the conformational search space allows for ab initio CCS prediction on the timescale
of minutes on the high performance computing cluster. Meanwhile, the deep learning based
CCS prediction distinguishes itself from DeepCCS or DarkChem by uncertainty prediction.
We provide the first example of uncertainty quantitation over deep learning based CCS pre-
diction, thereby enabling the engineering of an active learning cycle between ab initio and
deep learning methodologies.
     Future work should focus on implementing software to automate the active learning
                                              74


between the ab initio and deep learning predictions to create a continuously refining in
silico database of CCS values. The workflows of the ab initio and deep learning based
CCS prediction are individually automated, therefore an infrastructure to manage alternate
execution of workflows can be developed with an orderly results output such that public
release of the latest CCS values can be made available. Such values can be integrated
to other databases such as the Human Metabolome Database. [14, 15] To maximize the
quality of the training set, web scraping of existing experimental CCS databases can also be
implemented. Subsequent publications should include the detailed assessment of the active
learning of CCS values, the release of the continuously refining public CCS database, and
the deployment of the library on realistic metabolite annotation.
      The present work provides the foundation for other possible research directions. The
AutoGraph conformational clustering was tailored specifically to expedite the NMR chemi-
cal shift/CCS prediction workflows. The application can be expanded to more challenging
systems such as macromolecules or molecular dynamics trajectories by GPU-enabling the
software through Tensorflow or PyTorch. The ab initio CCS prediction workflow was applied
for metabolites, however preliminary evidence exhibits its utility to peptide and glycan CCS
prediction as well. The scope of the workflow can be expanded by identifying and overcom-
ing the specific challenges of the systems of interest. Finally, the deep learning based CCS
workflow includes experimental conditions, unlike existing models, such that the predictions
by a unified model can also be augmented by novel featurizers. Enhancement of the predic-
tion performance can be pursued by convergent featurization of the molecule, possibly by
including three dimensional features extracted by de Rham–Hodge analysis. [149]
      In summary, I have developed though my PhD research a collection of software which
was integrated to an active learning cycle for continuous refinement of metabolite CCS val-
ues. Subsequent studies should pursue the complete automation of the active learning cycle
such that a comprehensive in silico CCS database can be downloaded at the time of anal-
ysis for the most reliable predicted CCS value for metabolite annotation. Meanwhile, the
                                               75


developed software provide the foundation for alternative research directions. The body of
work expands the capability of standard-free identification of metabolites by CCS predic-
tion, thereby contributes to the effort of comprehensive structural elucidation in untargeted
metabolomics experiments.
                                              76


                                   BIBLIOGRAPHY
 (1) Gowda, G. N.; Zhang, S.; Gu, H.; Asiago, V.; Shanaiah, N.; Raftery, D. Expert review
     of molecular diagnostics 2008, 8, 617–633.
 (2) Vinayavekhin, N.; Homan, E. A.; Saghatelian, A. ACS chemical biology 2010, 5,
     91–103.
 (3) Mamas, M.; Dunn, W. B.; Neyses, L.; Goodacre, R. Archives of toxicology 2011, 85,
     5–17.
 (4) Kaddurah-Daouk, R.; Kristal, B. S.; Weinshilboum, R. M., et al. Annual review of
     pharmacology and toxicology 2008, 48, 653–683.
 (5) Toya, Y.; Shimizu, H. Biotechnology advances 2013, 31, 818–826.
 (6) Dromms, R. A.; Styczynski, M. P. Metabolites 2012, 2, 1090–1122.
 (7) Presnell, K. V.; Alper, H. S. Biotechnology journal 2019, 14, 1800416.
 (8) Giani, A. M.; Gallo, G. R.; Gianfranceschi, L.; Formenti, G. Computational and
     Structural Biotechnology Journal 2020, 18, 9–19.
 (9) Mirza, S. P.; Olivier, M. Physiological genomics 2008, 33, 3–11.
(10) Bohacek, R. S.; McMartin, C.; Guida, W. C. Medicinal research reviews 1996, 16,
     3–50.
(11) Polishchuk, P. G.; Madzhidov, T. I.; Varnek, A. Journal of computer-aided molecular
     design 2013, 27, 675–679.
(12) Yanes, O.; Tautenhahn, R.; Patti, G. J.; Siuzdak, G. Analytical chemistry 2011, 83,
     2152–2161.
(13) Naser, F. J.; Mahieu, N. G.; Wang, L.; Spalding, J. L.; Johnson, S. L.; Patti, G. J.
     Analytical and bioanalytical chemistry 2018, 410, 1287–1297.
(14) Wishart, D. S.; Tzur, D.; Knox, C.; Eisner, R.; Guo, A. C.; Young, N.; Cheng, D.;
     Jewell, K.; Arndt, D.; Sawhney, S., et al. Nucleic acids research 2007, 35, D521–
     D526.
(15) Wishart, D. S.; Guo, A.; Oler, E.; Wang, F.; Anjum, A.; Peters, H.; Dizon, R.;
     Sayeeda, Z.; Tian, S.; Lee, B. L., et al. Nucleic Acids Research 2022, 50, D622–
     D631.
(16) Borges, R. M.; Colby, S. M.; Das, S.; Edison, A. S.; Fiehn, O.; Kind, T.; Lee, J.;
     Merrill, A. T.; Merz Jr, K. M.; Metz, T. O., et al. Chemical reviews 2021, 121,
     5633–5670.
                                           77


(17) Markley, J. L.; Brüschweiler, R.; Edison, A. S.; Eghbalnia, H. R.; Powers, R.; Raftery,
     D.; Wishart, D. S. Current opinion in biotechnology 2017, 43, 34–40.
(18) Sindelar, M.; Patti, G. J. Journal of the American Chemical Society 2020, 142, 9097–
     9105.
(19) Gabelica, V.; Marklund, E. Current opinion in chemical biology 2018, 42, 51–59.
(20) Revercomb, H.; Mason, E. A. Analytical chemistry 1975, 47, 970–983.
(21) May, J. C.; Goodwin, C. R.; Lareau, N. M.; Leaptrot, K. L.; Morris, C. B.; Kuru-
     lugama, R. T.; Mordehai, A.; Klein, C.; Barry, W.; Darland, E., et al. Analytical
     chemistry 2014, 86, 2107–2116.
(22) Paglia, G.; Williams, J. P.; Menikarachchi, L.; Thompson, J. W.; Tyldesley-Worster,
     R.; Halldórsson, S.; Rolfsson, O.; Moseley, A.; Grant, D.; Langridge, J., et al. Ana-
     lytical chemistry 2014, 86, 3985–3993.
(23) Mesleh, M.; Hunter, J.; Shvartsburg, A.; Schatz, G. C.; Jarrold, M. The Journal of
     Physical Chemistry 1996, 100, 16082–16086.
(24) Shvartsburg, A. A.; Jarrold, M. F. Chemical physics letters 1996, 261, 86–91.
(25) Marklund, E. G.; Degiacomi, M. T.; Robinson, C. V.; Baldwin, A. J.; Benesch, J. L.
     Structure 2015, 23, 791–799.
(26) Zanotto, L.; Heerdt, G.; Souza, P. C.; Araujo, G.; Skaf, M. S. High performance
     collision cross section calculation—HPCCS, 2018.
(27) Colby, S. M.; Thomas, D. G.; Nuñez, J. R.; Baxter, D. J.; Glaesemann, K. R.; Brown,
     J. M.; Pirrung, M. A.; Govind, N.; Teeguarden, J. G.; Metz, T. O., et al. Analytical
     chemistry 2019, 91, 4346–4356.
(28) Zhou, Z.; Luo, M.; Chen, X.; Yin, Y.; Xiong, X.; Wang, R.; Zhu, Z.-J. Nature com-
     munications 2020, 11, 1–13.
(29) Ross, D. H.; Cho, J. H.; Xu, L. Analytical chemistry 2020, 92, 4548–4557.
(30) Nguyen, K. T.; Blum, L. C.; Van Deursen, R.; Reymond, J.-L. ChemMedChem:
     Chemistry Enabling Drug Discovery 2009, 4, 1803–1805.
(31) Plante, P.-L.; Francovic-Fontaine, É.; May, J. C.; McLean, J. A.; Baker, E. S.; Lavi-
     olette, F.; Marchand, M.; Corbeil, J. Analytical chemistry 2019, 91, 5191–5199.
(32) Colby, S. M.; Nuñez, J. R.; Hodas, N. O.; Corley, C. D.; Renslow, R. R. Analytical
     chemistry 2019, 92, 1720–1729.
(33) Samuel, A. L. IBM Journal of Research and Development 1959, 3, 210–229.
                                             78


(34) Tolle, K. M.; Tansley, D. S. W.; Hey, A. J. Proceedings of the IEEE 2011, 99, 1334–
     1337.
(35) Cybenko, G. Mathematics of control, signals and systems 1989, 2, 303–314.
(36) Laraia, L.; McKenzie, G.; Spring, D. R.; Venkitaraman, A. R.; Huggins, D. J. Chem-
     istry & biology 2015, 22, 689–703.
(37) Bakail, M.; Ochsenbein, F. Comptes Rendus Chimie 2016, 19, 19–27.
(38) Jaeger, A. M.; Whitesell, L. Annual Review of Cancer Biology 2019, 3, 275–297.
(39) Neckers, L.; Blagg, B.; Haystead, T.; Trepel, J. B.; Whitesell, L.; Picard, D. Cell
     Stress and Chaperones 2018, 23, 467–482.
(40) Sattin, S.; Tao, J.; Vettoretti, G.; Moroni, E.; Pennati, M.; Lopergolo, A.; Morelli, L.;
     Bugatti, A.; Zuehlke, A.; Moses, M., et al. Chemistry–A European Journal 2015, 21,
     13598–13608.
(41) Paladino, A.; Woodford, M. R.; Backe, S. J.; Sager, R. A.; Kancherla, P.; Daneshvar,
     M. A.; Chen, V. Z.; Bourboulia, D.; Ahanin, E. F.; Prodromou, C., et al. Chemistry–A
     European Journal 2020, 26, 9459–9465.
(42) Dobson, C. M. Annual review of biochemistry 2019, 88.
(43) Vajda, S.; Hall, D. R.; Kozakov, D. Proteins: Structure, Function, and Bioinformatics
     2013, 81, 1874–1884.
(44) Pierce, B. G.; Hourai, Y.; Weng, Z. PloS one 2011, 6, e24657.
(45) Gabb, H. A.; Jackson, R. M.; Sternberg, M. J. Journal of molecular biology 1997,
     272, 106–120.
(46) Tovchigrechko, A.; Vakser, I. A. Nucleic acids research 2006, 34, W310–W314.
(47) De Vries, S.; Zacharias, M. Proteins: Structure, Function, and Bioinformatics 2013,
     81, 2167–2174.
(48) Moal, I. H.; Bates, P. A. International journal of molecular sciences 2010, 11, 3623–
     3648.
(49) Schindler, C. E.; de Vries, S. J.; Zacharias, M. Proteins: Structure, Function, and
     Bioinformatics 2015, 83, 248–258.
(50) Pierce, B.; Weng, Z. Proteins: Structure, Function, and Bioinformatics 2007, 67,
     1078–1086.
(51) Fiorucci, S.; Zacharias, M. Proteins: Structure, Function, and Bioinformatics 2010,
     78, 3131–3139.
                                             79


(52) Camacho, C. J.; Zhang, C. Bioinformatics 2005, 21, 2534–2536.
(53) Andrusier, N.; Nussinov, R.; Wolfson, H. J. Proteins: Structure, Function, and Bioin-
     formatics 2007, 69, 139–159.
(54) Park, T.; Baek, M.; Lee, H.; Seok, C. Journal of computational chemistry 2019.
(55) Feng, T.; Chen, F.; Kang, Y.; Sun, H.; Liu, H.; Li, D.; Zhu, F.; Hou, T. Journal of
     cheminformatics 2017, 9, 66.
(56) De Vries, S. J.; van Dijk, A. D.; Krzeminski, M.; van Dijk, M.; Thureau, A.; Hsu,
     V.; Wassenaar, T.; Bonvin, A. M. Proteins: structure, function, and bioinformatics
     2007, 69, 726–733.
(57) Comeau, S. R.; Gatchell, D. W.; Vajda, S.; Camacho, C. J. Bioinformatics 2004, 20,
     45–50.
(58) Andreani, J.; Faure, G.; Guerois, R. Bioinformatics 2013, 29, 1742–1749.
(59) Khashan, R.; Zheng, W.; Tropsha, A. Proteins: Structure, Function, and Bioinfor-
     matics 2012, 80, 2207–2217.
(60) Zhou, H.; Zhou, Y. Protein science 2002, 11, 2714–2726.
(61) Yang, Y.; Zhou, Y. Proteins: Structure, Function, and Bioinformatics 2008, 72, 793–
     803.
(62) Moreira, I. S.; Fernandes, P. A.; Ramos, M. J. Proteins: Structure, Function, and
     Bioinformatics 2007, 68, 803–812.
(63) Scarabelli, G.; Morra, G.; Colombo, G. Biophysical journal 2010, 98, 1966–1975.
(64) Mihalek, I.; Reš, I.; Lichtarge, O. Journal of molecular biology 2004, 336, 1265–1282.
(65) Marchetti, F.; Capelli, R.; Rizzato, F.; Laio, A.; Colombo, G. The Journal of Physical
     Chemistry Letters 2019, 10, 1489–1497.
(66) Basu, S.; Wallner, B. Bioinformatics 2016, 32, i262–i270.
(67) Geng, C.; Jung, Y.; Renaud, N.; Honavar, V.; Bonvin, A. M.; Xue, L. C. BioRxiv
     2018, 498584.
(68) Pfeiffenberger, E.; Chaleil, R. A.; Moal, I. H.; Bates, P. A. Proteins: Structure, Func-
     tion, and Bioinformatics 2017, 85, 528–543.
(69) Wang, X.; Terashi, G.; Christoffer, C. W.; Zhu, M.; Kihara, D. Bioinformatics 2020,
     36, 2113–2118.
(70) Pei, J.; Zheng, Z.; Merz Jr, K. M. Journal of chemical information and modeling
     2019.
                                             80


(71) Pei, J.; Zheng, Z.; Kim, H.; Song, L. F.; Walworth, S.; Merz, M. R.; Merz, K. M.
     Journal of Chemical Information and Modeling 2019.
(72) Zheng, Z.; Merz Jr, K. M. Journal of chemical information and modeling 2013, 53,
     1073–1083.
(73) Wang, J.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Journal of
     computational chemistry 2004, 25, 1157–1174.
(74) Han, L.; Yang, Q.; Liu, Z.; Li, Y.; Wang, R. Future medicinal chemistry 2018, 10,
     1555–1574.
(75) Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.;
     Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V., et al. Journal of machine
     learning research 2011, 12, 2825–2830.
(76) Méndez, R.; Leplae, R.; De Maria, L.; Wodak, S. J. Proteins: Structure, Function,
     and Bioinformatics 2003, 52, 51–67.
(77) May, A.; Zacharias, M. Proteins: Structure, Function, and Bioinformatics 2007, 69,
     774–780.
(78) Zhang, Q.; Feng, T.; Xu, L.; Sun, H.; Pan, P.; Li, Y.; Li, D.; Hou, T. Current drug
     targets 2016, 17, 1586–1594.
(79) Vreven, T.; Moal, I. H.; Vangone, A.; Pierce, B. G.; Kastritis, P. L.; Torchala, M.;
     Chaleil, R.; Jiménez-Garcıa, B.; Bates, P. A.; Fernandez-Recio, J., et al. Journal of
     molecular biology 2015, 427, 3031–3041.
(80) De Vries, S. J.; Schindler, C. E.; de Beauchêne, I. C.; Zacharias, M. Biophysical
     journal 2015, 108, 462–465.
(81) Chen, R.; Mintseris, J.; Janin, J.; Weng, Z. Proteins: Structure, Function, and Bioin-
     formatics 2003, 52, 88–91.
(82) Kumar, S.; Nussinov, R. ChemBioChem 2002, 3, 604–617.
(83) Li, Y.; Zhang, X.; Cao, D. Scientific reports 2013, 3, 3271.
(84) Kubo, R.; McQuarrie, D. A. Phys. Today 1965, 18, 74–75.
(85) Amaro, R. E.; Baudry, J.; Chodera, J.; Demir, Ö.; McCammon, J. A.; Miao, Y.;
     Smith, J. C. Biophys. J. 2018, 114, 2271–2278.
(86) Verma, J.; Khedkar, V.; Coutinho, E. Curr. Top. Med. Chem. 2010, 10, 95–115.
(87) Husic, B. E.; Pande, V. S. J. Am. Chem. Soc. 2018, 140, 2386–2396.
(88) Hawkins, P. C. J. Chem. Inf. Model. 2017, 57, 1747–1756.
                                           81


 (89) Pracht, P.; Bohle, F.; Grimme, S. Phys. Chem. Chem. Phys. 2020, 22, 7169–7192.
 (90) Wilson, S. R.; Cui, W.; Moskowitz, J. W.; Schmidt, K. E. J. Comput. Chem. 1991,
      12, 342–349.
 (91) Hawkins, P. C.; Nicholls, A. J. Chem. Inf. Model. 2012, 52, 2919–2936.
 (92) Riniker, S.; Landrum, G. A. J. Chem. Inf. Model. 2015, 55, 2562–2574.
 (93) Devereux, C.; Smith, J. S.; Davis, K. K.; Barros, K.; Zubatyuk, R.; Isayev, O.; Roit-
      berg, A. E. J. Chem. Theory Comput. 2020, 16, 4192–4202.
 (94) Folmsbee, D.; Hutchison, G. Int. J. Quantum Chem. 2020, 121, e26381.
 (95) Smith, J. S.; Nebgen, B. T.; Zubatyuk, R.; Lubbers, N.; Devereux, C.; Barros, K.;
      Tretiak, S.; Isayev, O.; Roitberg, A. E. Nat. Commun. 2019, 10, 1–8.
 (96) Das, S.; Edison, A. S.; Merz, K. M. Anal. Chem. 2020, 92, 10412–10419.
 (97) Xu, D.; Tian, Y. Ann. Data. Sci. 2015, 2, 165–193.
 (98) Solorio-Fernández, S.; Carrasco-Ochoa, J. A.; Martı́nez-Trinidad, J. F. Artif. Intell.
      Rev. 2019, 53, 907–948.
 (99) Shao, J.; Tanner, S. W.; Thompson, N.; Cheatham, T. E. J. Chem. Theory Comput.
      2007, 3, 2312–2334.
(100) Yongye, A. B.; Bender, A.; Martı́nez-Mayorga, K. J. Comput. Aided Mol. Des. 2010,
      24, 675–686.
(101) Li, Y.; Dong, Z. J. Chem. Inf. Model. 2016, 56, 1205–1215.
(102) Kim, H.; Jang, C.; Yadav, D. K.; Kim, M.-h. J. Cheminform. 2017, 9, 21.
(103) Husic, B. E.; Pande, V. S. J. Chem. Theory Comput. 2017, 13, 963–967.
(104) Dubes, R. C. Pattern Recognit. 1987, 20, 645–663.
(105) Kelley, L. A.; Gardner, S. P.; Sutcliffe, M. J. Protein Eng. Des. Sel. 1996, 9, 1063–
      1065.
(106) Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.;
      Meng, E. C.; Ferrin, T. E. J. Comput. Chem. 2004, 25, 1605–1612.
(107) Langfelder, P.; Zhang, B.; Horvath, S. Bioinformatics 2007, 24, 719–720.
(108) Blondel, V. D.; Guillaume, J.-L.; Lambiotte, R.; Lefebvre, E. J. Stat. Mech. 2008,
      2008, P10008.
(109) Kabsch, W. Acta Cryst. Sect. A 1976, 32, 922–923.
                                             82


(110) Harris, C. R. et al. Nature 2020, 585, 357–362.
(111) McKinney, W. In Proceedings of the 9th Python in Science Conference, ed. by van
      der Walt, S.; Millman, J., 2010, pp 56–61.
(112) Virtanen, P. et al. Nat. Methods 2020, 17, 261–272.
(113) Ward, J. H. J. Am. Stat. Assoc. 1963, 58, 236–244.
(114) R Core Team R: A Language and Environment for Statistical Computing; R Foun-
      dation for Statistical Computing, Vienna, Austria, 2017.
(115) Snedecor, G. W.; Cochran, W. G. Iowa state University press, Ames, Iowa 1989.
(116) Wilcoxon, F. In Springer Series in Statistics; Springer New York: 1992, pp 196–202.
(117) Dunn, O. J. J. Am. Stat. Assoc. 1961, 56, 52–64.
(118) Watts, K. S.; Dalal, P.; Murphy, R. B.; Sherman, W.; Friesner, R. A.; Shelley, J. C.
      J. Chem. Inf. Model. 2010, 50, 534–546.
(119) ConfGen, Schrödinger LLC. Accessed 2021-02-03, 2020.
(120) Hjorth Larsen, A. et al. J. Phys.: Condens. Matter 2017, 29, 273002.
(121) Jacomy, M.; Venturini, T.; Heymann, S.; Bastian, M. PLoS One 2014, 9, e98679.
(122) Protein Preparation Wizard; Epik; Impact; Prime, Schrödinger LLC. Accessed 2021-
      02-03, 2020.
(123) Madhavi Sastry, G.; Adzhigirey, M.; Day, T.; Annabhimoju, R.; Sherman, W. J.
      Comput. Aided. Mol. Des. 2013, 27, 221–234.
(124) Frisch, M. et al. Gaussian˜16 Revision C.01, Gaussian Inc. Wallingford CT, 2016.
(125) Ebejer, J.-P.; Morris, G. M.; Deane, C. M. J. Chem. Inf. Model. 2012, 52, 1146–1158.
(126) Tosco, P.; Stiefl, N.; Landrum, G. J. Cheminform. 2014, 6, 37.
(127) Dewar, M. J.; Zoebisch, E. G.; Healy, E. F.; Stewart, J. J. J. Am. Chem. Soc. 1985,
      107, 3902–3909.
(128) Stewart, J. J. J. Comput. Aided Mol. Des. 1990, 4, 1–103.
(129) Smith, D. G. et al. J. Chem. Phys. 2020, 152, 184108.
(130) Wishart, D. S. et al. Nucleic Acids Res. 2017, 46, D608–D617.
(131) Fessel, J. P.; Oldham, W. M. Nicotine Adenine dDinucleotides: The Redox Currency
      of the Cell, 2018.
                                            83


(132) Cukier, R. I. J. Phys. Chem. B 2015, 119, 3621–3634.
(133) Cumeras, R.; Figueras, E.; Davis, C. E.; Baumbach, J. I.; Gracia, I. Analyst 2015,
      140, 1376–1390.
(134) Da Silva, R. R.; Dorrestein, P. C.; Quinn, R. A. Proc. Natl. Acad. Sci. USA 2015,
      112, 12549–12550.
(135) Gao, X.; Ramezanghorbani, F.; Isayev, O.; Smith, J. S.; Roitberg, A. E. Journal of
      chemical information and modeling 2020, 60, 3408–3415.
(136) Das, S.; Tanemura, K. A.; Dinpazhoh, L.; Keng, M.; Schumm, C.; Leahy, L.; Asef,
      C. K.; Rainey, M.; Edison, A. S.; Fernández, F. M., et al. Journal of the American
      Society for Mass Spectrometry 2022, 33, 750–759.
(137) Mölder, F.; Jablonski, K. P.; Letcher, B.; Hall, M. B.; Tomkins-Tinch, C. H.; Sochat,
      V.; Forster, J.; Lee, S.; Twardziok, S. O.; Kanitz, A., et al. F1000Research 2021, 10.
(138) Ropp, P. J.; Kaminsky, J. C.; Yablonski, S.; Durrant, J. D. Journal of cheminfor-
      matics 2019, 11, 1–8.
(139) Manathunga, Madushanka and Shajan, Akhil and Giese, Timothy J.; Cruzeiro,
      Vinı́cius Wilian D. and Smith, Jamie and Miao, Yipu and He, Xiping and Ayers,
      K and Brothers, E. and Götz, Andreas W. amd Merz, Kenneth Malcom QUICK,
      version 22.03, University of California San Diego, CA and Michigan State University,
      East Lansing, MI, 2022.
(140) Manathunga, M.; Miao, Y.; Mu, D.; Götz, A. W.; Merz Jr, K. M. Journal of Chemical
      Theory and Computation 2020, 16, 4315–4326.
(141) Miao, Y.; Merz Jr, K. M. Journal of chemical theory and computation 2015, 11,
      1449–1462.
(142) Tanemura, K. A.; Das, S.; Merz Jr, K. M. Journal of Chemical Information and
      Modeling 2021, 61, 1647–1656.
(143) Ryu, S.; Kwon, Y.; Kim, W. Y. Chemical Science 2019, 10, 8438–8446.
(144) Gal, Y.; Ghahramani, Z. In international conference on machine learning, 2016,
      pp 1050–1059.
(145) PNNL Collision Cross Section Database.
(146) Picache, J. A.; Rose, B. S.; Balinski, A.; Leaptrot, K. L.; Sherrod, S. D.; May, J. C.;
      McLean, J. A. Chemical science 2019, 10, 983–993.
(147) Gal, Y.; Hron, J.; Kendall, A. Advances in neural information processing systems
      2017, 30.
                                             84


(148) He, K.; Zhang, X.; Ren, S.; Sun, J. In Proceedings of the IEEE international confer-
      ence on computer vision, 2015, pp 1026–1034.
(149) Zhao, R.; Wang, M.; Chen, J.; Tong, Y.; Wei, G.-W. Bulletin of mathematical biology
      2020, 82, 1–38.
                                            85


                                            APPENDIX
Table A1: Residue type assigned to each residue. Interaction type was assigned based on the
type of residues interacting. ’1’ signifies the residue belongs to the category, while ’0’ signifies
the residue is absent from the category. Flexible was defined as having a side chain with
three consecutive free rotating bonds between heavy atoms. Small was defined as having a
side chain of one carbon or less.
  residue   anionic  cationic     polar   nonpolar    aromatic     flexible small
    Asp        1         0          0         0            0           0      0
    Glu        1         0          0         0            0           1      0
    Arg        0         1          0         0            0           1      0
    Lys        0         1          0         0            0           1      0
    His        0         0          1         0            1           0      0
    Ser        0         0          1         0            0           0      1
    Thr        0         0          1         0            0           0      0
    Asn        0         0          1         0            0           0      0
    Gln        0         0          1         0            0           1      0
    Cys        0         0          1         0            0           0      1
    Gly        0         0          0         1            0           0      1
    Pro        0         0          0         1            0           0      0
    Ala        0         0          0         1            0           0      1
    Val        0         0          0         1            0           0      0
    Ile        0         0          0         1            0           1      0
    Leu        0         0          0         1            0           1      0
    Met        0         0          0         1            0           1      0
    Phe        0         0          0         1            1           0      0
    Tyr        0         0          0         1            1           0      0
    Trp        0         0          0         1            1           0      0
Table A2: Metabolites selected from the Human Metabolome Database for benchmarking
the performance of AutoGraph. The name deposited in the HMDB was used without mod-
ification.
                       name
       HMDB0031506     4,4’-Diaminodibutylamine
       HMDB0036938     1-O-Feruloylglucose
       HMDB0033217     (2xi,6xi)-7-Methyl-3-methylene-1,2,6,7-octanetetrol
       HMDB0037340     Cosmosiin
       HMDB0032957     2-O-Feruloyltartronic acid
       HMDB0032942     Momorcharaside A
                                                  86


                              Table A2 (cont’d)
HMDB0032921 Ethyl 2-furanpropionate
HMDB0037404 6-Glucopyranosylprocyanidin B1
HMDB0032752 R-2-Propenyl 1-propenesulfinothioate
HMDB0040659 Hoduloside VII
HMDB0031896 Erinacine D
HMDB0040706 7-Epi-12-hydroxyjasmonic acid glucoside
HMDB0031639 Glycerol 1-propanoate
HMDB0038265 3,3’,5-Trihydroxy-4’-methoxy-6,7-methylenedioxyflavone 3-glucuronide
HMDB0031422 cis-Piceid
HMDB0031026 Calenduloside H
HMDB0030760 Pteroside B
HMDB0030693 Delphinidin 3,5-diglucoside
HMDB0030581 Perillanin
HMDB0040831 Rheinoside C
HMDB0030472 4-Ipomeanol
HMDB0030422 xi-Linalool 3-[rhamnosyl-(1-¿6)-glucoside]
HMDB0030406 Homomethionine
HMDB0037426 Astragalin 7-rhamnoside
HMDB0040960 Capsicoside B2
HMDB0040135 Hydroxydestruxin B
HMDB0039746 4”-O-Acetylafzelin
HMDB0029773 6S,9R-Dihydroxy-4,7E-megastigmadien-3-one 9-[apiosyl-(1-¿6)-glucoside]
HMDB0033391 C.I. Acid Green 5
HMDB0035475 Simmondsin 2’-ferulate
HMDB0039178 1,2-Digalloyl-beta-D-glucopyranose
HMDB0039110 (2S,3’S)-alpha-Amino-2-carboxy-5-oxo-1-pyrrolidinebutanoic acid
HMDB0034843 Acteoside
HMDB0039102 N-Carboxyacetyl-D-phenylalanine
HMDB0036339 25-Acetyl-6,7-didehydrofevicordin F 3-[glucosyl-(1-¿6)-glucoside]
HMDB0034361 Olitorin
HMDB0000094 Citric acid
                                       87


                              Table A2 (cont’d)
HMDB0034081 Brassinolide
HMDB0034072 Gentianose
HMDB0033991 Daidzin
HMDB0036476 Araliasaponin I
HMDB0033865 Lotaustralin
HMDB0033839 Eudesmic acid
HMDB0039069 (R)-Byakangelicin 3’-glucoside
HMDB0033793 4-Heptenoic acid
HMDB0033750 D-Glycero-D-galacto-heptitol
HMDB0033741 Puddumin A
HMDB0036634 Phlorizin
HMDB0036933 1-Methyl 2-galloylgalactarate
HMDB0033508 Gonyautoxin IV
HMDB0039669 2-Hexyl-1,3-dioxolane-4-methanol
HMDB0033392 C.I. Acid Red 13
HMDB0030160 Natsudaidain
HMDB0041129 6”-O-(3-Hydroxy-3-methylglutaroyl)astragalin
HMDB0035745 Jujuboside B
HMDB0060833 N-Acetylserotonin glucuronide
HMDB0000210 Pantothenic acid
HMDB0013680 Caftaric acid
HMDB0000230 N-Acetylneuraminic acid
HMDB0060642 N-depropylpropafenone
HMDB0003339 D-Glutamic acid
HMDB0003265 Hesperidin
HMDB0003249 Rutin
HMDB0060797 6alpha,9alpha-Difluoroprednisolone-17-butyrate
HMDB0060813 Descarbonyl-lacosamide
HMDB0001846 Tetrahydrofolic acid
HMDB0038426 Butyl glucosinolate
HMDB0037892 Nomilinic acid 17-glucoside
                                       88


                              Table A2 (cont’d)
HMDB0001451 (R)-lipoic acid
HMDB0001438 25-Hydroxyvitamin D2
HMDB0061054 Desacetylvinblastine
HMDB0060883 7-Hydroxy-R-acenocoumarol
HMDB0000625 Gluconic acid
HMDB0060890 Hydroxyterbinafine
HMDB0038502 Bisacurone B
HMDB0060921 3-Hydroxyibuprofen
HMDB0000201 L-Acetylcarnitine
HMDB0061108 3’-Hydroxybuspirone
HMDB0014720 Voriconazole
HMDB0014813 Tamoxifen
HMDB0037429 Astragalin
HMDB0041300 PRE
HMDB0041343 Chinenoside III
HMDB0029311 Deltonin
HMDB0041360 Prenyl arabinosyl-(1-¿6)-glucoside
HMDB0037478 4beta-(2-Aminoethylthio)epicatechin 3-gallate
HMDB0015548 Valganciclovir
HMDB0015542 Pivampicillin
HMDB0015303 Kanamycin
HMDB0015286 Gemifloxacin
HMDB0041515 Benzyl gentiobioside
HMDB0015132 Doxorubicin
HMDB0015054 Almotriptan
HMDB0000099 L-Cystathionine
HMDB0014950 Phenylbutazone
HMDB0014934 Candesartan
HMDB0000182 L-Lysine
HMDB0041665 3-Hydroxy-4-methoxyphenyllactic acid
HMDB0041707 Caffeic acid 4-O-glucuronide
                                       89


                                Table A2 (cont’d)
HMDB0000448 Adipic acid
HMDB0039111 L-Acetopine
HMDB0038541 Goshonoside F5
HMDB0038751 Avenic acid B
HMDB0038992 Cistocardin
HMDB0038630 Niazimin
HMDB0038672 Lyciumin A
HMDB0038927 Isolariciresinol 9’-O-beta-D-glucoside
HMDB0038708 Citrusin C
HMDB0038562 8-Propanoylneosolaniol
HMDB0000067 Cholesterol
HMDB0039179 1,6-Digalloyl-beta-D-glucopyranose
HMDB0041159 Cyanidin 3-O-[b-D-Xylopyranosyl-(1-¿2)-[b-D-glucopyranosyl-(1-¿6)]-b-D-
            galactopyranoside]
HMDB0060935 4-trans-Hydroxyglipizide
HMDB0060910 8-Hydroxy-delta-9-THC
HMDB0060870 RPR112698
HMDB0060838 N-Desmethyl-p-hydroxyrosiglitazone
HMDB0060621 n-Demethylated piperazine
HMDB0060583 Trandolaprilat
HMDB0041728 (-)-Epicatechin 3’-O-glucuronide
HMDB0041552 2-[4-(3-Hydroxypropyl)-2-methoxyphenoxy]-1,3-propanediol 1-xyloside
HMDB0041513 Isopropyl apiosylglucoside
HMDB0041211 Sesaminol glucosyl-(1-¿2)-[glucosyl-(1-¿6)]-glucoside
HMDB0040937 Lactitol
HMDB0039293 Medicoside J
HMDB0040623 3’-(6”-Galloylglucosyl)-phloroacetophenone
HMDB0040575 Butyl formate
HMDB0040418 Majonoside R2
HMDB0040181 Fagopyritol A2
HMDB0040122 Madecassoside
                                        90


                              Table A2 (cont’d)
HMDB0039640 Ustiloxin A
HMDB0039509 trans-p-Coumaric acid 4-glucoside
HMDB0039488 Chrysophanol 8-gentiobioside
HMDB0039458 Di-2-propenyl heptasulfide
HMDB0039332 2”-(6”-p-Coumaroylglucosyl)quercitrin
HMDB0038301 Apigenin 7-[feruloyl-(-¿2)-glucuronyl-(1-¿2)-glucuronide] 4’-glucuronide
HMDB0034249 Caryoptosidic acid
HMDB0038143 Calamin
HMDB0038042 Isovitexin 2”-O-(6”’-feruloyl)glucoside
HMDB0030388 9-(beta-D-Ribofuranosyl)zeatin
HMDB0030246 Narceine
HMDB0030104 Humulinic acid A
HMDB0029548 Diosmin
HMDB0029517 Limocitrin 3-glucoside
HMDB0029496 Fukiic acid
HMDB0029408 Betanin
HMDB0029258 Luteolin 6-C-glucoside 8-C-arabinoside
HMDB0015570 Fusidic Acid
HMDB0015169 Procainamide
HMDB0015002 Tacrolimus
HMDB0014911 Etoposide
HMDB0014906 Olopatadine
HMDB0014412 Cefmenoxime
HMDB0014344 Erythromycin
HMDB0010358 17-beta-estradiol 3-sulfate-17-(beta-D-glucuronide)
HMDB0006270 Linoelaidic acid
HMDB0003164 Chlorogenic acid
HMDB0001876 Epinephrine sulfate
HMDB0001852 all-trans-Retinoic acid
HMDB0001542 N-Acetyllactosamine
HMDB0001262 Maltotriose
                                        91


                              Table A2 (cont’d)
HMDB0001220 Prostaglandin E2
HMDB0000570 Coproporphyrin III
HMDB0000488 (4E,15Z)-Bilirubin
HMDB0000482 Caprylic acid
HMDB0000446 N-alpha-Acetyl-L-lysine
HMDB0000417 3D,7D,11D-Phytanic acid
HMDB0000193 Isocitric acid
HMDB0030542 Neohesperidin dihydrochalcone
HMDB0031876 Prenyl glucoside
HMDB0031985 gamma-Glutamyl-S-methylcysteine
HMDB0034945 Stevioside
HMDB0037977 Cyanidin 3-galactoside
HMDB0037960 Quercetin 3-(2G-glucosylrutinoside)
HMDB0037635 xi-1-Ethoxy-1-butoxyethane
HMDB0037537 Quercetin 3-glucosyl-(1-¿2)-galactoside
HMDB0037499 3’-N’-Acetylfusarochromanone
HMDB0037491 Liquiritin apioside
HMDB0037427 Kaempferol 3-sophoroside 7-rhamnoside
HMDB0037415 Isomargaritene
HMDB0037332 8-Hydroxyluteolin 4’-methyl ether 8-glucoside
HMDB0036484 Trigoneoside Xb
HMDB0035872 Oleuropein
HMDB0035857 Ginsenoside A2
HMDB0035778 Ginsenoside Rd
HMDB0034862 4’,6’-Dihydroxy-2’-methoxyacetophenone 6’-glucoside
HMDB0032590 Zingerone
HMDB0034649 Soyasaponin I
HMDB0034613 Torachrysone 8-beta-gentiobioside
HMDB0033908 (±)-2-Heptanol
HMDB0033838 Diethyl succinate
HMDB0033597 4-(4-Methoxyphenyl)-2-butanone
                                       92


                                           Table A2 (cont’d)
       HMDB0033578       Aquifoliunine EIII
       HMDB0033463       Subaphylline
       HMDB0033320       Yessotoxin
       HMDB0033282       8-Hydroxypinoresinol 4-glucoside
       HMDB0033226       (1S,4R)-10-Hydroxyfenchone glucoside
       HMDB0033201       9-Hydroxydecanoic acid
       HMDB0033075       Methyl 1-(methylsulfinyl)propyl disulfide
       HMDB0032796       Methyl 3-(2,3-dihydroxy-3-methylbutyl)-4-hydroxybenzoate
       HMDB0061142       Fenoprofen glucuronide
Benchmark Selection
      CCS values were collected from various databases. Only entries with number of heavy atom within
30 were retained. The Morgan fingerprint was calculated for all entries with 2048 bits and connectivity
of 3. For each adduct, the Tanimoto similarity was calculated comprehensively between all entries. The
produced similarity matrix was then hierarchically clustered using the Ward method. A threshold was
applied to produce 20 distinct clusters. From each cluster, a centroid compound was selected by selecting
the compound with the greatest within-cluster weighted degree. The benchmark set was manually curated
thereafter from the centroids, removing structures which lacked stereochemical information to produce the
final benchmark set.
                                                    93


Figure A1: Density of various types of interactions applied to the coefficients ordered in
decreasing order. The coordinates for coefficient values 0.1, 0.0, and −0.1 are specified.
                                             94


Figure A2: Distribution of test accuracies for LR classifiers trained on 0.9 of dataset, em-
ploying various thresholds of top features. Light gray dots are the raw data. Mean (×) and
median (•) for each fraction are reported.
Figure A3: Comparison of success rate (top) and modified success rate (bottom) of the LR
scoring function using various coefficient threshold and near-native structures.
                                              95


Table A3: Representative structures were selected from available CCS databases.
                       name     CAS       charge    experimental CCS (Å2)
                  guanosine 118-00-3          1              165
              Lobendazole 6306-71-4           1             141.7
             L-tryptophan      73-22-3        1             150.3
                   ethanone 552-41-0          1             124.9
           L-phenylalanine     63-91-2        1             139.8
                 norfloxacin 70458-96-7       1             184.2
                    acacetin 480-44-4         1             163.7
                L-isoleucine   73-32-5        1             127.6
               theophylline    58-55-9        1             143.1
              sulfapyridine 144-83-2          1             151.6
                  R-proline 344-25-2          1             125.4
                 pyridoxine    65-23-6        1             134.7
               adrenic acid 28874-58-0        1             189.4
    4-methylpentanoic acid 646-07-1          –1             127.5
             L-tryptophan      73-22-3       –1             158.6
              salicylic acid   69-72-7       –1             120.7
      3-methoxy-L-tyrosine 300-48-1          –1             150.9
          1-methylxanthine 6136-37-4         –1             140.9
                 L-histidine   71-00-1       –1             128.5
                  R-proline 344-25-2         –1             122.5
               biochanin A 491-80-5          –1             170.4
              R-hesperetin 24604-97-5        –1             178.5
                                       96


Figure A4: Comparison of performance of LR scoring function using various coefficient
threshold via distributions of native rank (top), first RMSD (middle), and first decoy RMSD
(bottom). Mean (×) and median (•) are reported for each distributions.
Table A4: Test set metabolites with independently measured CCS values in monoprotonat-
ed/monodeprotonated modes.
                                                            CCS (Å2 )
                                   name                  [M+H]+ [M-H]-
                                 folic acid               193.94  187.30
                          3-hydroxybutyric acid             -     129.91
                                oxypurinol                124.88  116.95
                                L-threonine               123.88  114.24
                         4-imidazoleacrylic acid          128.89  114.43
                                D-arginine                131.73  130.52
                               Fmoc-Val-OH                172.99  185.74
                              α-cyclodextrin              272.40  287.21
                              cyclo(Leu-Gly)              138.28       -
                             cis-urocanic acid            124.25  115.90
                               azithromycin               263.63  269.66
                         1,5-diaminonaphthalene           137.43       -
                                glutathione               167.94  157.80
                           3-indolebutyric acid           139.81  146.95
                    α-cyano-4-hydroxycinnamic acid        140.43  129.80
                           3-hydroxycoumarin                -     122.87
                              Fmoc-Pro-OH                 168.75       -
                                artesunate                  -     183.43
                                               97


Figure A5: Composition of the top 0.1 salient features (1403 features) were summarized
using the categories defined in Table A1. The appearance of interactions in each category was
counted (top). However, number of features per category was not uniform. Fold enrichment
of each category was summarized to account for the heterogeneity in number of features per
category (bottom). The fold enrichment is the quotient of observed fraction and expected
fraction, and describes the factor by which the category is overrepresented compared to if it
were sampled by chance.
                                             98


Figure A6: Comparison of scoring functions via by distributions of native rank (top), first
RMSD (middle), and first decoy RMSD (bottom). Mean (×) and median (•) is reported for
each distributions.
                                          99


Figure A7: The AutoGraph protocol is summarized in a flowchart. In addition to the
intermediate files shown, the output directory contains ”cluster summary.csv” and ”com-
munityStats.csv”. ”cluster summary.csv” provides conformer-wise cluster assignment and
centroid selection. ”communityStats.csv” describes cluster-wise maximum and mean RMSD
values.
                                           100


Figure A8: NAD+ exhibited a positive, monotonic relationship between actual and locally
weighted estimated energy values with or without a weight threshold applied to the confor-
mational graph(n = 785, ρwith = 0.845, ρwithout = 0.842). The line for x = y is plotted.
                                             101


Figure A9: The distributions of within-cluster mean RMSD minus global mean RMSD is
reported for each metabolite assessed by various conformational clustering algorithm (n =
200 per algorithm). Abbreviations: AutoGraph (AG), Dynamic Tree Cut/Ward hierarchical
clustering (DTC), NMRCLUST (NC), Representative Conformer K-means (RCK), Ward
hierarchical clustering with 18 clusters (fixed).
                                              102


Figure A10: The distributions of discrepancy in variance of energy within clusters and overall
   2
(σwithin − σ 2 ) is reported for each metabolite assessed by various conformational clustering
algorithm (top). The plot leaves out data points beyond values shown for clarity. The mean
is indicated for each distribution (×), which was calculated including the data points beyond
the range visualized. Distribution of number of clusters detected is shown for each algorithm
as well (bottom). Abbreviations: AutoGraph (AG), Dynamic Tree Cut/Ward hierarchical
clustering (DTC), NMRCLUST (NC), Representative Conformer K-means (RCK), Ward
hierarchical clustering with 18 clusters (fixed).
                                               103


Figure A11: The fast and standard workflows were evaluated in terms of mean relative
error against a negative control. The distributions were grouped by the charge of the ion
considered. Each point corresponds to the relative error of exactly one structure included in
the benchmark set.
                                            104