NETWORK EMBEDDINGS FOR DATA CLUSTERING, TRANSITION STATE
     IDENTIFICATION, AND ENERGY LANDSCAPE ANALYSIS
                                   By
                            Paula Mercurio
                          A DISSERTATION
                              Submitted to
                      Michigan State University
              in partial fulfillment of the requirements
                           for the degree of
                Mathematics – Doctor of Philosophy
                                 2022


                                              ABSTRACT
        NETWORK EMBEDDINGS FOR DATA CLUSTERING, TRANSITION STATE
                  IDENTIFICATION, AND ENERGY LANDSCAPE ANALYSIS
                                                   By
                                             Paula Mercurio
Many chemical and biochemical systems can be intuitively modeled using networks. Due to the size
and complexity of many biochemical networks, we require tools for efficient network analysis. Of
particular interest are techniques that embed network vertices into vector spaces while preserving
important properties of the original graph. In this article, we investigate several aspects of node
embedding, propose a novel method of generating node embeddings, and explore applications to
biochemical systems. We introduce a new method for generating low-dimensional node embeddings
for directed graphs using random walk sampling and demonstrate the usefulness of this method for
identifying transition states of stochastic chemical reaction systems, detecting relationships between
nodes, and studying the structure of the original network.
    In addition, we propose an efficient scheme for numerical implementation of network embedding
based on deterministic computations of commute times rather than random walk trials. We develop
a novel implementation of stochastic gradient descent (SGD) based on a low-dimensional sparse
approximation of the original random walk on the graph, and show that this approach can improve
the performance of node embeddings.
    This method can be further extended for entropy-sensitive adaptive network embedding by
incorporating principles from metadynamics and hierarchical network embedding, allowing for
applications to the analysis of molecular structures. By adjusting the edge weights of the network by
a Gaussian term, similarly to the metadynamics approach, we ensure that areas that have already
been explored extensively by the random walk (i.e., the edges with the largest weights) will be
de-emphasized over time, allowing additional iterations of the embedding process to reveal details
about other areas of the graph. We show that this approach lends itself well to systems that are
influenced by entropy or temperature effects and biochemical systems where the potential energy


landscape depends on the system’s configuration at a given time, either by itself or in conjunction
with transition path theory. We demonstrate the effectiveness and performance of each of our
methods on several datasets and biochemical examples.


For Mom and Dad.
      iv


                                    ACKNOWLEDGEMENTS
First, I would like to thank my advisor, Di Liu, whose support and advice has made this entire work
possible. His guidance has been indispensable throughout my time in graduate school, both with
respect to our research and the development of my career. I am also very grateful to Guo-Wei Wei,
Jiayu Zhou, Chichia Chiu, and Christina Chan, for the advice and help they have given me as my
committee members over the years. I would also like to thank the entire math department at MSU,
for giving me a wonderful graduate school experience.
    I would like to thank my family, particularly my parents, Tillie and Dave Mercurio, whose
unwavering encouragement has always given me support I can count on. Thank you to my sister
Erin, who led the way and who has given me excellent advice throughout my time in graduate school,
and to my brother Ryan, whose creativity and humor have helped make this journey much more fun.
    Finally, I would like to thank my partner and best friend Whitney, whose companionship has
always been vital. I’m looking forward to what comes next.
    I would also like to acknowledge that parts of this work were supported by NSF-DMS 1720002,
and that some of the databases of local minima and transition states were computed using the Pele,
PATHSAMPLE, and disconnectionDPS softwares. The molecular graphics included in this research
were produced with UCSF Chimera, developed by the Resource for Biocomputing, Visualization, and
Informatics at the University of California, San Francisco, with support from NIH P41-GM103311.
    The content from Chapter 1 is from [23]. The published version is available at AIMS Press via:
https://doi.org/10.3934/mbe.2021046.
                                                  v


                                   TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   ix
LIST OF ALGORITHMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTER 1     IDENTIFYING TRANSITION STATES OF CHEMICAL KINETIC SYS-
              TEMS USING NETWORK EMBEDDING TECHNIQUES . . . . . . .                         . .  1
   1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  1
   1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  4
       1.2.1 Network embedding for directed graphs . . . . . . . . . . . . . . . . .        . .  5
               1.2.1.1 Random walk approaches . . . . . . . . . . . . . . . . . . .         . .  6
               1.2.1.2 Neural network approaches . . . . . . . . . . . . . . . . . .        . .  9
       1.2.2 Transition path theory . . . . . . . . . . . . . . . . . . . . . . . . . . .   . .  9
   1.3 Identifying transition states using network embedding . . . . . . . . . . . . . .    . . 11
   1.4 Examples and results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . 13
       1.4.1 Diffusion process with energy barrier . . . . . . . . . . . . . . . . . .      . . 15
       1.4.2 Diffusion process with barriers . . . . . . . . . . . . . . . . . . . . . .    . . 17
       1.4.3 Three-dimensional toggle switch . . . . . . . . . . . . . . . . . . . . .      . . 18
       1.4.4 Stochastic virus model . . . . . . . . . . . . . . . . . . . . . . . . . .     . . 20
       1.4.5 E. Coli σ-32 heat response circuit . . . . . . . . . . . . . . . . . . . .     . . 23
   1.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . 24
CHAPTER 2     EFFICIENT NETWORK EMBEDDING BASED ON SPARSE APPROX-
              IMATION OF A RANDOM WALK . . . . . . . . . . . . . . . . . . . . .              . 26
   2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
   2.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 29
       2.2.1 The Skipgram algorithm and Skipgram-based embeddings . . . . . . . .             . 29
       2.2.2 Network Embeddings with Commute times . . . . . . . . . . . . . . . .            . 31
       2.2.3 Computing sparse approximations of Green functions via Schultz method            . 32
   2.3 An implementation of the SkipGram model based on sparse approximation and
       commute times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 34
       2.3.1 Modified Johnson-Lindenstrauss Lemma . . . . . . . . . . . . . . . . .           . 37
       2.3.2 Relationship with diffusion maps . . . . . . . . . . . . . . . . . . . . .       . 40
   2.4 Numerical Details of Sparse Approximation . . . . . . . . . . . . . . . . . . . .      . 41
       2.4.1 Diffusion Wavelets . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . 41
       2.4.2 SVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .    . 43
   2.5 Examples and performance comparisons . . . . . . . . . . . . . . . . . . . . . .       . 46
       2.5.1 Butterfly dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . 47
       2.5.2 Cora dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . 48
       2.5.3 Amazon co-purchase photo dataset . . . . . . . . . . . . . . . . . . . . .       . 50
                                               vi


       2.5.4 Email database dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
       2.5.5 COVID-19 lung image classifier . . . . . . . . . . . . . . . . . . . . . . . 52
   2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
CHAPTER 3 ADAPTIVE NETWORK EMBEDDING WITH METADYNAMICS                                  . . . . 55
   3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
   3.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .   . . . . 56
       3.2.1 Network Embedding . . . . . . . . . . . . . . . . . . . . . . . . .        . . . . 56
              3.2.1.1 Embedding Method . . . . . . . . . . . . . . . . . . . .          . . . . 57
       3.2.2 Energy Landscapes . . . . . . . . . . . . . . . . . . . . . . . . . .      . . . . 58
       3.2.3 Metadynamics . . . . . . . . . . . . . . . . . . . . . . . . . . . . .     . . . . 59
   3.3 Lennard-Jones clusters with metadynamics . . . . . . . . . . . . . . . . .       . . . . 59
       3.3.1 Adjacency matrix-based embeddings . . . . . . . . . . . . . . . . .        . . . . 60
              3.3.1.1 8 Atom Lennard-Jones Embeddings . . . . . . . . . . .             . . . . 61
              3.3.1.2 Lennard-Jones 38 Cluster Embeddings . . . . . . . . . .           . . . . 63
   3.4 Application: Human Telomere Folding . . . . . . . . . . . . . . . . . . . .      . . . . 68
       3.4.1 Combining Metadynamics with Transition Path Theory . . . . . . .           . . . . 70
   3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  . . . . 74
APPENDIX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
                                              vii


                                      LIST OF TABLES
Table 2.1 Performance and clustering score for the karate club graph embeddings in
          Figure 2.3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
Table 2.2 Comparisons between three-dimensional diffusion wavelet-based embeddings,
          truncated SVD-based embeddings, and the commute time embeddings intro-
          duced in [19], on the butterfly dataset [43, 45]. Runtime represents the time
          required to compute sparse approximations for powers of T and G via the given
          algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Table 2.3 Butterfly dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
Table 2.4 Cora dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
Table 2.5 Amazon Co-Purchase Dataset.        . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Table 2.6 Email Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Table 2.7 Lung X-Ray Dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Table 3.1 Commute times between nodes, 8-atom Lennard-Jones cluster.           . . . . . . . . . 64
                                               viii


                                        LIST OF FIGURES
Figure 1.1 Diffusion process with energy barrier: plot assigning colors to each node in
           the discretized domain Ω, according to each node’s similarity to the starting
           node A. Higher similarity values correspond to higher dot products between
           the two-dimensional embeddings of the nodes. The left plot corresponds to a
           parameter value of ε = 0.01, and the right plot corresponds to ε = 1. . . . . . . 14
Figure 1.2 Feature representations of nodes with high similarity to the reactant state A
           for the 2D diffusion process with energy barrier, with ε = 0.01. . . . . . . . . 14
Figure 1.3 Diffusion process with entropic bottleneck: surface plot of the similarity
           function between the reactant state A and nodes in Ω. Here, the discretization
           with h = 0.1 is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
Figure 1.4 Feature representations of nodes with high similarity to the reactant state A
           for the entropic diffusion process. . . . . . . . . . . . . . . . . . . . . . . . . 18
Figure 1.5 Surface plot of the similarities to A of grid points in the domain Ω for the 3-D
           Toggle switch model. Results are for a 15 × 15 × 15 discretization (h = 3). . . 19
Figure 1.6 Feature representations of nodes with high similarity to the reactant state A
           for the 3D toggle switch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Figure 1.7 Feature representations of nodes with high similarity to the initial state
           A = (0, 0, 0) for the virus propagation example. . . . . . . . . . . . . . . . . . 21
Figure 1.8 Feature representations for the virus propagation example, with a logarithmic
           scale on the x-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Figure 1.9 Feature representations of nodes with high similarity to the initial state, when
           initial state is taken to be A = (600, 600, 600, 600, 1500, 1500, 1500), for the
           E. Coli heat response circuit model. . . . . . . . . . . . . . . . . . . . . . . . . 24
Figure 2.1 a) The karate club network (created via [30]). b) Commute time embeddings
           found via the proposed method. Colors of points are determined by the true
           classification of each node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Figure 2.2 2 and 3-dimensional embeddings of the karate club network, based on the first
           columns of the commute time embedding matrix from [19]. . . . . . . . . . . . 39
Figure 2.3 2 and 3-dimensional embeddings of the karate club network, created with the
           proposed method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
                                                 ix


Figure 2.4 Three-dimensional and two-dimensional node embeddings for the butterfly
           species network, generated using the algorithm described in Section 3. Em-
           beddings in the top row use sparse approximations produced via diffusion
           wavelets, and those in the bottom row use sparse approximations produced via
           truncated SVD. Each color represents the species to which that node belongs. . 49
Figure 2.5 Three-dimensional node embeddings for the chest x-ray classifier, generated
           via the proposed method. Each color represents the diagnosis (normal, COVID-
           19, or viral pneumonia) associated with that x-ray.(a) shows the result of the
           method on a dataset consisting of 60 x-rays from each category, while (b)
           shows the result for a smaller dataset (20 from each category). . . . . . . . . . . 53
Figure 2.6 10 different trials of the embeddings for the chest x-ray classifier dataset. For
           each trial, the cross-entropy loss over 10 iterations of SGD is recorded. In all
           trials, the cross-entropy loss initially drops steeply, converging in 5 iterations. . 54
Figure 3.1 Disconnectivity tree and metadynamics-based embeddings for the Lennard-
           Jones cluster with 8 atoms. Left: Disconnectivity tree of all local minima.
           Right: Embeddings for the local minima after applying the adjustment in 3.1.
           Closely related minima here have very similar or identical embeddings, for
           example both yellow minima are embedded at the yellow point on the right.
           In all figures, color scheme denotes the potential energy of the system at the
           configuration represented by that node. Dark blue denotes low energies (e.g.
           the global minimum), and red denotes the highest-energy states. . . . . . . . . 62
Figure 3.2 Metadynamics-based embeddings for the 8-atom cluster at the temperatures
           T = 0.08 (left) and T = 1 (right). Color scheme denotes the potential energy
           of the system at the configuration represented by that node. Dark blue denotes
           low energies (e.g. the global minimum), and red denotes the highest-energy
           states. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Figure 3.3 The 8-atom Lennard-Jones network of local minima. Edge lengths are
           proportional to commute times. Node colors are chosen to match those in
           Figures 3.1 (a) and (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Figure 3.4 Disconnectivity tree for the 38-atom Lennard-Jones cluster. . . . . . . . . . . . 66
Figure 3.5 Hierarchical embeddings for the Lennard-Jones cluster with 38 atoms. Pictured
           are the embeddings for the local minima originally clustered at the origin
           (with the global minimum) after applying metadynamics (equation 3.1). Color
           scheme denotes commute time from the global minimum, with dark blue
           denoting short distances, and red denoting the furthest distances. . . . . . . . . 67
                                                  x


Figure 3.6  Embeddings of the local minima of the 38-atom Lennard-Jones cluster with po-
            tential energies less than -170.9. The left figure shows the original embeddings
            of all 163 minima; the right figure shows stage 2 embeddings (embeddings
            of the nodes clustered around the origin, after adjusting the adjacency matrix
            according to the metadynamics adjustment in equation (3.1)). Color scheme
            denotes commute time from the global minimum, with dark blue denoting
            short distances, and red denoting the furthest distances. . . . . . . . . . . . . . 68
Figure 3.7  Metadynamics-based embeddings for the 38-atom cluster at the temperatures
            T = 0.08 (left) and T = 1 (right). Color scheme denotes the commute time
            distance from the global minimum. Dark blue denotes low distances (e.g. the
            global minimum), and red denotes the highest. . . . . . . . . . . . . . . . . . 69
Figure 3.8 The four-strand G-quadruplex structure (PDB structure 1KF1), with guanine
            nucleotides colored green. Image produced with Chimera [31]. . . . . . . . . . 69
Figure 3.9  Embeddings of the local minima network for the human telomere sequence,
            based on the adjacency matrix as in section 3.1 and the metadynamic adjustment
            in 3.1. Colors of embeddings denote distance from the global minimum, with
            dark blue denoting the shortest distances and red the furthest. . . . . . . . . . . 71
Figure 3.10 Two local minima from the telomere’s energy landscape, with potential energies
            of −68.0 and −69.7 respectively. Images produced with Chimera [31]. . . . . . 71
Figure 3.11 Hierarchical embeddings of the local minima network for the 8 atom LJ cluster,
            based on a TPT-based subnetwork consisting of the nodes clustered around
            the global minimum (in dark blue), and the metadynamic adjustment in 3.1.
            Colors of embeddings denote potential energy at each configuration, with dark
            blue denoting the lowest energies and red the highest. . . . . . . . . . . . . . . 73
Figure 3.12 Hierarchical embeddings of the local minima network for the human telomere
            sequence, based on a TPT-based subnetwork consisting of the nodes clustered
            around the global minimum (in dark blue), and the metadynamic adjustment
            in 3.1. Colors of embeddings denote commute time distance from the global
            minimum, with dark blue denoting the shortest distances and red the furthest. . 75
                                                 xi


                                LIST OF ALGORITHMS
Algorithm 1.1 A random walk simulation method for directed graphs, given the transition
              matrix T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Algorithm 2.1 SkipGram Model (stochastic gradient descent). . . . . . . . . . . . . . . . 30
Algorithm 2.2 Truncated SVD: An approximate singular value decomposition of the
              N × N matrix T using the largest αN singular values. . . . . . . . . . . . 45
Algorithm 2.3 Commute Time-based Network Embeddings with Truncated SVD. . . . . . 45
                                             xii


                                             CHAPTER 1
  IDENTIFYING TRANSITION STATES OF CHEMICAL KINETIC SYSTEMS USING
                            NETWORK EMBEDDING TECHNIQUES
1.1    Introduction
Networks provide a natural framework for many chemical and biochemical systems, including
chemical kinetics and dynamical interactions between biomolecules. The possible stages of a
reaction, for example, can be viewed as vertices in a directed graph. Frequently, real-world networks
are high-dimensional and complex in structure, which leads to difficulties in interpretation and
analysis. In order to take full advantage of the structural information contained in such networks,
efficient and robust methods of effective network analysis are essential to produce low dimensional
representations while retaining key information about the nodes of the graph.
    Network analysis is especially useful for understanding behaviors of nonlinear chemical processes
such as biochemical switches. Biochemical switches are systems of chemical reactions with multiple
lower-energy steady states, or metastable states. Without stochastic perturbations, the system
will spend all its time at the metastable states, where energy is minimized, while the presence of
stochastic noise in the dynamics can lead to transitions between metastable states. Difficulties arise
in the simulation of such systems as a result of the multiple time scales at work; transitions between
steady states are typically rare, but occur rapidly.
    The different possible states of the reaction, given by the varying concentrations of reactant and
product species, can be viewed as the vertices of a weighted directed graph. For example, we may
have an ‘initial’ state where only reactant species exist, a ‘final’ state where only product species
remain, and any number of transitional states each representing a particular concentration of the
different species. The edge weights of the graph can then be determined from the likelihoods of
transitioning between any given pair of states. The progression of the reaction over time can then be
viewed as a Markov jump process, where the jump probabilities are represented by the weights on
the edges between each pair of nodes. Then, methods for network analysis can be used to identify
                                                    1


interactions and pathways between the components of the biochemical switch. One objective of
this paper is to suggest a method to analyze this Markov process using node embedding techniques.
This paper will focus on the A → B problem: A and B are taken as the reactant and product states,
respectively. The aim of this paper is to investigate the mechanism by which this transition occurs.
    Recently, methods for analyzing networks through feature representation and graph embedding
have received increasing attention. For an overview on this subject, a number of recent review
papers are available [17, 5]. While much of this research has been focused on undirected graphs, the
directed case has also been investigated, for example [9, 7, 51]. In particular, several methods have
been proposed that utilize random walks for node embedding.
    Node embedding (or feature learning) methods aim to represent each node of a given network
as a vector in a low-dimensional continuous vector space, in such a way that information about
the nodes and relationships between them are retained. Specifically, nodes that are similar to each
other according to some measure in the original graph will also have similar representations in the
embedding space. Usually, this means that the embedding will be chosen in a way that maximizes the
inner products of embedding vectors for embedded nodes that are similar to each other. Compared
with the original complex high-dimensional networks, these low dimensional continuous node
representations have the benefit of being convenient to work with for downstream machine learning
techniques.
    Compared with alternative methods that embed edges or entire graphs instead of nodes, node
embeddings are more adaptive for numerous tasks, including node classification, clustering, and link
prediction [5]. Node classification is a process by which class labels are assigned to nodes based on
a small sample of labeled nodes, where similar nodes have the same labels. Clustering algorithms
group the representations of similar nodes together in the target vector space. Link prediction seeks
to predict missing edges in an incomplete graph.
    The recent interest in feature learning has led to a collection of different node embedding
methods, which can be broadly classified [17] by the analytical and computational techniques
employed such as matrix factorization, random walk sampling, and deep learning network. The
                                                   2


matrix factorization based methods generate embeddings by first using a matrix to represent the
relationships between nodes, which include the Laplacian matrix as in the Laplacian eigenmaps
[3], the adjacency matrix as in locally linear embedding (LLE) [33] or graph factorization (GF) [1],
etc. Though the above methods focus on the case of undirected methods, similar techniques for
the directed case have also been suggested [9], where the development of the Laplacian matrix, a
combinatorial Laplacian matrix, and Cheeger inequality allow the above approaches to be extended
for directed graphs. Factorization based methods such as eigenvalue decomposition can apply for
certain matrices (e.g., positive semidefinite), and additional difficulties may arise with scaling for
large data sets.
    Random walk methods are often used for large graphs, or graphs that cannot be observed in their
entirety. The general idea involves first simulating random walks on a network, then using the output
to infer relationships between nodes. Random walk methods can be used to study either undirected
or directed graphs, although much of the previous work has focused on the undirected case. For
example, DeepWalk [29] uses short unbiased random walks to find similarities between nodes, with
node pairs that tend to co-occur in the same random walks having higher similarities. This method
performs well for detecting homophily – similarity based on node adjacency and shared neighbors.
Similar methods, such as node2vec [18], use biased random walks to capture similarity based on
both homophily and structural equivalence – similarities in the nodes’ structural roles in the graph.
Both of these methods specifically address undirected graphs. A similar approach designed for the
directed case is proposed in [7], which uses Markov random walks with transition probabilities
given by ratios of edge weights to nodes’ out degrees, together with the stationary distribution of the
walks, to measure local relationships between nodes.
    Theoretical frameworks for the study of transition events of biochemical processes include
Transition State Theory (TST), Transition Path Sampling (TPS), Transition Path Theory (TPT) and
Forward Flux Sampling (FFS). The main idea of TST is that in order for the system to move from
the reactant state to the product state, the system must pass through a saddle point on the potential
energy surface, known as a Transition State [48]. The TPS technique uses Monte Carlo sampling of
                                                    3


transition paths to study the full collection of transition paths of a given Markov process [4]. The
basic idea, similar to importance sampling, is to perform a random walk in trajectory space, biased
so that important regions are visited more often. Transition Path Theory (TPT) studies reactive
trajectories of a system by considering statistical properties such as rate functions and probability
currents, which measure the average rates of transitions between states [47]. Forward flux sampling
[2], designed specifically to capture rare switching events in biochemical networks, uses a series
of interfaces between the initial and final states to calculate rate constants and generate transition
paths. This method has the advantage that knowledge of the phase space density is not required. For
the purpose of this paper, we will be using TPT, but other such techniques can fit into the method
presented here as well.
    The contributions of this paper are a method of using a random walk network embedding
approach for node classification and clustering on directed graphs, as well as identification of
transition states for the specific case of entropy effects. The method presented reduces the dimension
of networks, thereby allowing analysis and interpretation of the system.
    The remainder of this paper will be organized as follows. First, in Section 2 we will introduce
some background regarding feature learning on networks, and provide a brief overview of some
basic principles from TPT. Then Section 3 will introduce a method for node classification for
directed graphs, based on random walk node embedding. Finally, in Section 4, we will study
several examples, showing the effectiveness of the method on identifying transition states of entropic
systems.
1.2   Background
In this paper, we focus on the mechanism by which a system such as a biochemical switch passes
through its energy landscape following a reactive trajectory. A trajectory is said to be reactive if it
leaves the reactant state A and later arrives at the product state B without first returning to state
A. Such a trajectory can be viewed as a sequence of transitions between metastable states where
the energy is locally minimized. In particular, we treat these sequences as Markov jump processes,
and study the probability space of these reactive trajectories in order to identify and understand
                                                      4


the transition events of the system. Though this paper will be utilizing the techniques of Transition
Path Theory for this purpose, it is also possible to adopt other frameworks, such as Transition Path
sampling, Forward Flux sampling, etc., to investigate the transition events of a system.
    Consider a Markov jump process on a countable state space S. Let L represent the infinitesimal
generator of the process, as defined in [24]. In other words, L is an |S| × |S| matrix with entries
Luv such that for u, v ∈ S, u ̸= v, Luv ∆t + o(∆t) represents the probability that a process that is in
state u at time t will jump to state v during the time interval [t, t + ∆t]. Then the entries of L are
transition rates, satisfying
                                
                                
                                Luv  ≥0            ∀ u, v ∈ S, u ̸= v,
                                
                                
                                
                                                                                                 (1.1)
                                             = 0, ∀ u ∈ S.
                                 P
                                 v∈S Luv
                                
Let {X(t)}t∈R represent an equilibrium sample trajectory of the Markov process, such that {X(t)}t∈R
is right-continuous with left limits. At time t, let the probability distribution of this process be
denoted by µ(t) := (P(X(t) = u))Tu∈S . Then the time evolution of µ(t) follows the forward
kolmogorov equation (or master equation)
                                         dµ
                                             = µT L,    t ≥ 0.
                                          dt
    Denote the time-reversed process by {X̃(t)}t∈R , and the infinitesimal generator of this process
by L̃. The stationary probability distribution π = (πv )v∈S of both {X(t)}t∈R and {X̃(t)}t∈R is
given by the solution to
                                              0 = πT L .
1.2.1   Network embedding for directed graphs
Networks can have large numbers of vertices and complex structures, which can lead to challenges
in network analysis. To overcome these difficulties and improve understanding of these networks,
the nodes of a given network can be embedded into a low-dimensional vector space, a process called
feature learning. The techniques used in this paper to investigate transition states and pathways
combine neural network-based and random walk-based methods for learning latent representations
                                                   5


of a network’s nodes, as discussed in [18, 29, 7]. The idea behind these methods is that the nodes
in a given network G = G(V, E) can be embedded into a low-dimensional vector space in such a
way that nodes similar to each other according to some measure in the original graph will also have
embeddings similar to each other in the embedding space.
     Feature learning, or node embedding, requires three main components: an encoder function, a
definition of node similarity and a loss function. The encoder function is the function that maps
each node to a vector in the embedding space. To find the optimal embeddings, the encoder function
can be chosen by minimizing the loss function so that the similarity between nodes in the original
network corresponds as closely as possible to the similarity between their vector representations in
the continuous space. Node similarity can have different interpretations, depending on the network
and the task to be accomplished. For example, nodes that are connected, share neighbors, or share
a common structural role in the original graph can be claimed to be similar, and to be embedded
in the vector space by representations of close similarity. For the embedded vectors, a common
approach is to use inner products as the similarity measure in the feature space–that is, if u and v are
two nodes in a graph with representations e(u) and e(v) in a low-dimensional vector space, then
e(u)T e(v) should approximate the similarity between representations e(u) and e(v).
1.2.1.1    Random walk approaches
One technique for finding node embeddings is through the use of random walks. This technique has
been applied to undirected graphs in methods such as DeepWalk and node2vec [29, 18]. In these
methods, the general approach is to use short random walks to determine similarities for each pair of
nodes. The notion of node similarity is co-occurence within a random walk; two nodes are similar
if there is a higher probability that a walk containing one node will also contain the other. The
various methods using this approach differ in the implementation of this general idea: DeepWalk,
for example, uses straightforward unbiased walks, while node2vec uses adjustable parameters to bias
random walks toward breadth-first or depth-first sampling for a more flexible notion of similarity.
     The first stage in the feature learning process is to simulate a certain number of random walks
                                                   6


starting from each vertex. Using the results of these random walks, the neighbors of each node can
be determined. The objective is to maximize the probability of observing the neighborhood of each
node, conditioned on the vector representation of the node. In several of the random walk-based
feature representation methods for undirected graphs, this is achieved through optimization of a
Loss function of the form:
                                                X
                               max J(e) =           log P r(NS (u)|e(u)),                      (1.2)
                                 e
                                               u∈V
where V is the set of all vertices in the graph, NS (u) is the neighborhood of the node u, and e(u)
is the encoder function representing the embedded vector. If we assume that the probabilities of
observing a particular neighbor are independent, we can have
                                                         Y
                              P r(NS (u)|e(u)) =                P r(ni |e(u)).                 (1.3)
                                                     ni ∈NS (u)
Since these methods typically aim to maximize the inner products for neighboring nodes, they often
utilize a softmax function to model each conditional probability:
                                                                       
                                                   exp e(ni ) · e(u)
                              P r(ni |e(u)) = P                            .
                                                  v∈V   exp e(v) · e(u)
For networks where |V | is large, techniques such as hierarchical softmax [29] or negative sampling
[26] can be used to increase efficiency. To determine the optimal embedding function e(u), the Loss
function can be optimized using gradient descent with respect to the parameters of e(u).
    In [7], a random walk-based method specifically intended for directed graphs is proposed. The
function to be optimized in this method, when embedding into R1 , is
                                                     p(u, v)(yu − yv )2 ,
                                       X       X
                                  I=       πu
                                        u     v,u→v
where yu is the one-dimensional embedding of u, πu is the stationary probability for the node u and
p(u, v) represents the random walk’s transition probability from u to v. The transition probabilities
are calculated according to
                                                           X
                                   p(u, v) = w(u, v)/           w(u, k),
                                                         k,u→k
                                                     7


where w(u, v) is the weight on the edge (u, v). Note that this method, though it uses transition
probabilities of a random walk, does not actually require simulation of random walks, since all the
summations can be done deterministically.
     According to the spectral graph theory [9], a circulation on a directed graph is a function F
such that, for each u ∈ V ,
                              P                     P
                                  u,u→v F (u, v) =    w,v→w  F (v, w). The Laplacian of a directed graph
is therefore defined as
                                             Φ1/2 PΦ−1/2 + Φ−1/2 PΦ1/2
                                  L=I−                                     ,
                                                           2
 where Φ is a diagonal matrix with diagonal entries given by Φv,v = ϕ(v) =
                                                                                     P
                                                                                       u,u→v  F (u, v) for
some circulation function F . The following definition of the combinatorial Laplacian is also due to
Chung [9]:
                                                      ΦP + PT Φ
                                           L=Φ−                    ,
                                                           2
where P is the transition matrix, i.e., its elements are given by Puv = p(u, v), and Φ is the diagonal
matrix of the stationary distribution, i.e., Φ = diag(π1 , . . . , πn ). Note that the combinatorial
Laplacian is symmetric and semi-positive definite. It is shown in [7] that
                                               p(u, v)(yu − yv )2 = 2yT Ly
                                  X       X
                                      πu
                                   u     v,u→v
where y = (y1 , y2 , . . . , yn )T . This result demonstrates that this method is analogous to the
Laplacian-based methods for undirected graphs, e.g., Laplacian eigenmaps [17].
     Random walk methods have been shown to perform well compared to other methods, and can be
useful for a variety of different notions of node similarity. Previous work has shown them to be
robust, efficient, and capable of completing a diverse range of tasks, including node classification,
link prediction, clustering, and more [17, 5, 29, 18]. Because they do not examine the entire graph
at once, these methods are also useful for very large networks and networks that cannot be observed
in their entirety.
                                                      8


1.2.1.2    Neural network approaches
A third class of graph embedding methods involves the use of deep learning techniques, including
neural networks [5, 17], e.g., Graph Neural Network (GNN) [5, 35]. Such methods have been
shown to perform well, particularly for dimension reduction and identifying nonlinear relationships
among data. Random walk based methods, such as DeepWalk, can incorporate deep learning
algorithms like SkipGram for graph embedding, where random walks serve to provide neighborhood
information as input.
    In general, neural network methods typically work by assigning a weight to each node represen-
tation, and then combining those terms via a transfer function. This result is then used as the input
for an activation function, such as a sigmoid function, i.e., σ(x) = 1/(1 + exp(−x)). This process
may then repeat for multiple layers. In a neural network method, the parameters to be optimized are
the weights, which are updated after each layer, typically by gradient descent. In the GNN [35], the
objective function to be minimized is of the form
                                                  (ti,j − φw (Gi , ni,j ))2 ,
                                           XX
                                    W =
                                            i   j
where Gi is the learning set, the ni,j and ti,j are the nodes and target outputs, and the goal is to choose
the parameter w in such a way that φ closely approximates the target outputs. This optimization
occurs through gradient descent, where the gradient is computed with respect to the weights w.
1.2.2    Transition path theory
In order to identify the transition paths of a given Markov process, we will adopt the framework of
Transition Path Theory (TPT). The following notations are mostly from [24]. Consider a state space
S, for an initial state A ∈ S and final state B ∈ S, a trajectory X(t) is said to be reactive if it begins
from state A and reaches state B before returning to state A.
    To determine whether a given reaction path is reactive, we will need the following forward
and backward committor functions. For each i ∈ S, the forward committor qi+ is the probability
that a process initially at state i will reach state B before it reaches state A. Similarly, the discrete
                                                       9


backward committor qi− denotes the probability that a process arriving at state i was more recently
in state A rather than in state B. The forward committors satisfy the following Dirichlet equation
                                      Lij qj+ = 0            ∀i ∈ (A ∪ B)C
                                 X
                                 j∈S
                                          qi+ = 0            ∀i ∈ A
                                          qi+ = 1            ∀i ∈ B,
 and similar equations are satisfied by backward committors.
    The probability current, or flux, of reactive trajectories gives the average rate at which reactive
paths transition from state to state. For trajectories from A to B, the probability current is defined
for all i ̸= j such that
                      1     1 ZT
     fijAB = lim
                                                                          X
                        lim         1{i} (X(t))1{j} (X(t + s)) ×              1(−∞,tBn ] (t)1[tAn ,∞) (t + s)dt
               s→0+ s T →∞ 2T −T
                                                                          n∈Z
            = fijAB ,
where 1(a,b) denotes the indicator function on the interval (a, b). Also, for all i ∈ S, fiiAB = 0. It
can be shown that                            
                                                πi qi− Lij qj+ , if i ̸= j
                                             
                                             
                                             
                                   fijAB   =
                                                0                  if i = j.
                                             
                                             
                                             
    Since we are primarily interested in the flow from A to B, and the process can move in either
direction between two adjacent nodes on the path, the following effective current accounts for the
net average number of reactive paths that jump from i to j per time unit:
                                       fij+ = max(fijAB − fjiAB , 0).
We can also define the total effective current for a node i:
                                             Ci+ =              fij+ .
                                                        X
                                                     j:(i,j)∈E
                                                       10


In [14], the effective current is generalized to a connected subnetwork of nodes Ω such that
                                            C + (Ω) =       Ci+ ,
                                                        X
                                                       i∈∂Ω
which leads to the following definition of transition states of a time reversible process as subsets of
the state space:
                                                                               
                         T S = lim argmax{C + (Ω) exp −(q + − 0.5)2 /σ 2 }
                                 σ→0     Ω
For the non-time reversible case, transition states can be identified similarly, replacing the exponential
argument (−(q + − 0.5)2 /σ 2 ) with (−(q + − 0.5)2 + (q − − 0.5)2 )/σ 2 . The above definition gives
the flexibility of identifying transition states as subsets of the state space, which will be very useful
when dealing with complex dynamics such as entropy effects.
1.3    Identifying transition states using network embedding
We now introduce a method to identify transition states and paths of Markov processes, by random
walk based network embedding techniques for directed networks. Consider a Markov jump process
in a countable state space S. Using the infinitesimal generator L for the process, we can calculate
the stationary probability distribution π and the forward and backward committors, based on which
probability current and effective currents fijAB and fij+ can then be computed. Once these quantities
have been obtained, the discretized state space can be viewed as a weighted, directed graph G(V, E).
The nodes v ∈ V are the grid points representing states of the system, and each pair of adjacent
nodes are connected by an edge (u, v) ∈ E with weights given by the effective probability current,
such that the direction of the edge will be determined by the sign of the effective current.
    Once G is constructed, we can apply feature learning techniques for node classification to
identify transition states. Random walk trials for similarities between nodes will start from each
node in the space, using the edge weights as transition probabilities. The outputs of the random
walks can then be used to compute experimental “neighbor probabilities", i.e., the probability that a
random walk of length d starting at node u will contain node v. If the neighbor probability for a
node pair (u, v) is sufficiently high, then the node v is claimed as a neighbor of u. Note that this is a
directed process; v can be a neighbor of u even if u is not a neighbor of v.
                                                     11


    After finding these conditional probabilities and identifying the neighborhoods of each node, the
next step is to maximize an objective function in order to solve for the node embeddings. As in the
node2vec and DeepWalk methods for the undirected case, the conditional probabilities are modeled
using softmax functions. In order to adapt this technique for the case of directed graphs, we will use
a modified version of the objective function:
                                            X       X
                                    V [e] =    πu         P r(v|e(u)).                           (1.4)
                                            u     v∈N (u)
Optimizing this function will give the encoder function eopt that maximizes the probability of the
neighborhood of each node, with the most relevant nodes having the most influence on the feature
representations.
    In the original form of the objective function (1.4) for the undirected graph methods described
above, the softmax functions depend only on whether nodes are neighbors, and is independent
of the probability (frequency) that a particular neighbor will show up in a random walk trial. In
other words, all neighbors are treated equally, while in reality some node pairs are more likely to
co-occur than others. To address this, the objective function used here incorporates the neighboring
probabilities, denoted N P (·, ·) such that
                                              exp(e(n) · e(u))N P (n, u)
                            P r(n|e(u)) = P                                 .
                                             v∈V exp(e(v) · e(u))N P (v, u)
    Assuming e(u) = E2 u, for E2 being a 2 × d matrix, the objective defined above can then be
maximized using gradient ascent to obtain the matrix E2 opt and therefore the feature embeddings.
In practice, the linear assumption on the map from nodes to embeddings appears to be insufficiently
flexible, particularly for higher-dimensional problems. To improve upon this, a neural network can be
incorporated into the feature learning process, with the sigmoid function σ(u) = 1/(1+exp(−e(u)))
where e(u) is the representation in the embedding space of the node u. The current algorithm uses
a relatively small number of layers; raising this number results in node representations that are
clustered more closely to their neighbors.
                                                   12


Algorithm 1.1 A random walk simulation method for directed graphs, given the transition matrix T .
   for u ∈ V do
      walk(1) = u
      for step = 1 : walklength do
        for k = 2 : d do
           total = v,walk(k−1)→v t(walk(k − 1), v)
                    P
           for v : walk(k − 1) → v do
              P (walk(k − 1), v) = t(walk(k − 1), v)/total
           end for
           Use probability P (walk(k−1), ·) to choose walk(k) = v for some neighbor of walk(k−1).
           Set counter(walk(k), u) = counter(walk(k), u) + 1.
        end for
      end for
   end for
   for u, v ∈ V do
                                                     P
      neighborprobability(u, v) = counter(u, v)/ k counter(u, k)
   end for
    To obtain transition states for a particular process, we will examine the similarities between
the node representations of each state and the representation of metastable state A. Here, by
“similarities" we mean the conditional probabilities given by the softmax units discussed previously.
By construction, these values will naturally be small for nodes that are further from node A, due to
hopping distances that are longer than the length of the random walks. So, to find the transition
states, we combine similarities of node pairs with shorter hopping distances via matrix propagation,
e.g.,
                                              X
                               sim(A, u) =          sim(A, v)sim(v, u).                          (1.5)
                                           A→v,v→u
Note that a node u for which this value is high is a higher-order neighbor of the reactant node A;
therefore, such a node will have a higher probability of lying along a transition path. We then define
transition states as nodes with high probability of lying on a transition path, which are not direct
neighbors of the reactant or product states.
1.4    Examples and results
Next, we will illustrate the suggested method on several simple examples. In the examples presented
here, we have used 100 random walk trials, each of length 9 steps. Smaller numbers of trials increase
                                                  13


Figure 1.1 Diffusion process with energy barrier: plot assigning colors to each node in the
discretized domain Ω, according to each node’s similarity to the starting node A. Higher similarity
values correspond to higher dot products between the two-dimensional embeddings of the nodes.
The left plot corresponds to a parameter value of ε = 0.01, and the right plot corresponds to ε = 1.
                                  10-3
                              7                                         1
                              6                                         0.9
                              5                                         0.8
                              4                                         0.7
                              3                                         0.6
                              2                                         0.5
                              1
                              0.56     0.58 0.6 0.62 0.64 0.66 0.68 0.7
Figure 1.2 Feature representations of nodes with high similarity to the reactant state A for the 2D
diffusion process with energy barrier, with ε = 0.01.
the impact of noise in the results. Due to the matrix propagation technique described above, changes
to the walk length parameter do not have a noticeable affect.
                                                     14


1.4.1   Diffusion process with energy barrier
First we will consider a two-dimensional diffusion process (x(t), y(t)) representing the position of a
particle traversing a potential field with potential V . This process follows the SDE
                                          dx      dV     dW1 (t)
                                              =−      +           ,
                                          dt      dx       dt
                                          dy      dV     dW2 (t)
                                              =−      +           ,
                                          dt      dy       dt
where V is the potential
                                           V = (x2 − 1)2 + εy 4 ,                                 (1.6)
and W1 (t) and W2 (t) are independent Brownian motion processes.
    This potential V has two local minima at (-1,0) and (1,0), as well as a saddle point at the origin.
As a result, the diffusion process (x(t), y(t)) will have metastable states at these energy minima.
Here I will take (-1,0) to be the reactant state A, and (1,0) to be the product state B. The diffusion
process can be approximated using a Markov (birth-death) jump process on a discrete state space.
For this example, we will study the diffusion process on the domain Ω = [−1, 1] × [−0.75, 0.75]
by examining the Markov jump process on a grid D = ((−1 + hZ) × (−0.75 + hZ)) ∩ Ω, with
h = 0.05.
    In order to examine the behavior of this Markov process, we should first compute the probability
currents for the process. The infinitesimal transition matrix L can be found by using the jump rates
for the birth-death process Luv for each pair of adjacent nodes u, v [15]. We define the following
constants for a state (x, y) with x ∈ (−1, 1):
                                                 1      1 dV
                                   kx+ (x, y) =      −        (x − h),
                                                2h2    2h dx
                                                 1      1 dV
                                   kx− (x, y) = 2    +        (x + h),
                                                2h     2h dx
                                                 1      1 dV
                                   ky+ (x, y) = 2    −        (y − h),
                                                2h     2h dy
                                                 1      1 dV
                                   ky− (x, y) = 2    +        (y + h).
                                                2h     2h dy
                                                    15


Also let kx+ (x + h, y) = 1/h if x = −1, kx+ (x + h, y) = 0 if x = 1, and kx− (x − h, y) =
0, kx− (x − h, y) = 1/h if x = −1 or x = 1 respectively, and similarly for ky+ and ky− . Then the
infinitesimal generator can be expressed in terms of its action on a test function f such that
   (Lf )(x, y) =kx+ (x + h, y)(f (x + h, y) − f (x, y)) + kx− (x − h, y)(f (x − h, y) − f (x, y))
                 + ky+ (x, y + h)(f (x, y + h) − f (x, y)) + ky− (x, y − h)(f (x, y − h) − f (x, y)).
    Figure 1.1 assigns each point in the grid a color based on its similarity to the node of A = (−1, 0),
for epsilon values ε = .01 and ε = 1 in Eq (1.6), respectively. To account for the fact that nodes
with greater hopping distances from A will have lower similarities to this node, we propagate the
similarity matrix according to (1.5) in order to assign similarities to more distant nodes. The red
area passing from A to B represents the region that reactive trajectories will pass through with the
highest probability for each value of epsilon.
    In Figure 1.2, the node embeddings (u, v) = eopt (x, y) corresponding to a subset of nodes with
higher similarity values to A are shown, for the case where ε = 0.01, where eopt is the optimal
linear encoding. Notice that in this figure, the embeddings for nodes with the highest probability of
occurring in a reactive trajectory are clustered together, while nodes with lower probabilities tend
to be grouped with other nodes with similar probabilities; more specifically, nodes with similarity
values in the range 0.5 − 0.6 have embeddings clustered near the upper left or lower right of the
figure, while nodes with values from 0.7–0.8 will have embeddings which appear in one of the two
orange clusters.
    In the context of the diffusion process with the potential V , the entropy effect refers to changes
in the system’s observed dynamics in response to the change of the parameter ε. That is, as ε is
decreased, dV /dy shrinks and therefore the negative term of dy/dt becomes small. As a result, the
shape and size of the saddle point located at the origin is altered, and we can expect the y-coordinate
to take on a greater variety of values with higher probabilities in this case. In particular, around the
origin, smaller values of ε should result in higher similarities between the nodes above and below
the origin, since in this case the process is more likely to move up and down. For larger values of ε,
the process is expected to move straight forward from A to B, with a smaller probability of moving
                                                   16


up or down.
Figure 1.3 Diffusion process with entropic bottleneck: surface plot of the similarity function
between the reactant state A and nodes in Ω. Here, the discretization with h = 0.1 is used.
1.4.2   Diffusion process with barriers
Next we consider a pure diffusion process (V (x, y) = 0) on the domain Ω = [−1, 1] × [−1, 1]
with two barriers as depicted in Figure 3. This example was also discussed in [25]. Here we take
the node at (0.6, 0.6) to be the initial state A, and (−0.6, −0.6) as the final state B. The particle
spends most of its time between the reactant and product states, in the region between the barriers
([−1, 1] × [−0.4, 0.4]). The only factor affecting the probability that the particle will reach B before
returning to A is its current distance to B, since there are no energetic obstacles to overcome. In
order to travel from A to B, the process must find its way past the obstacles by chance, thereby
overcoming an entropic barrier.
    Figure 4 shows the node embeddings for a subset of nodes with high similarity values to reactant
A. Note that the color scheme used in this figure corresponds to the color scheme in Figure 3; node
embeddings that are colored red in the scatter plot correspond to red areas in the surface plot. In
Figure 4, node embeddings again tend to be grouped by the corresponding values of the similarity
function: nodes that are very similar to the initial state A (similarity function value > 0.9) appear in
the red cluster toward the lower right. Other clusters of different colors correspond to groups of
                                                   17


                                                                                     1
                          0.661
                            0.66                                                     0.9
                          0.659
                                                                                     0.8
                          0.658
                          0.657                                                      0.7
                          0.656
                                                                                     0.6
                          0.655
                          0.654                                                      0.5
                          0.653
                                     0.654 0.655  0.656 0.657 0.658 0.659 0.66
Figure 1.4 Feature representations of nodes with high similarity to the reactant state A for the
entropic diffusion process.
nodes with lower probabilities of appearing in a reactive trajectory. The yellow cluster contains
embeddings of nodes with similarity function values approximately 0.7–0.8, while the blue cluster
corresponds to nodes with similarity values near 0.5–0.6.
1.4.3   Three-dimensional toggle switch
Now we will consider a higher-dimension example, a stochastic model of a 3D genetic toggle switch
consisting of three genes that each inhibit the others’ expression [14]. We consider the production
and degradation of the three gene products, S1 , S2 , and S3 :
                                              α1          α2         α3
                                           ∗ ⇌ S1 , ∗ ⇌ S2 , ∗ ⇌ S3 ,
                                              α4          α5         α6
where the parameters αi are defined:
                                  c11                                                    c12
               α1 =              2
                                                  ,                   α2 =                           ,
                     (65 +    x2 )(65    +  x23 )                              (65 +   2
                                                                                     x1 )(65 + x23 )
                                  c13
               α3 =              2
                                                  ,                   α4 = c4 x1 ,
                     (65 +    x1 )(65    + x22 )
               α5 = c5 x2 ,                                           α 6 = c6 x 3 ,
                                                           18


                                                                         1
                                                                         0.9
                                                                         0.8
                                                                         0.7
                           40                                            0.6
                           30                                            0.5
                           20
                                                                         0.4
                           10                                        0   0.3
                            0                                     10
                            0                                            0.2
                                                               20
                                10
                                     20                     30           0.1
                                          30             40
                                              40                         0
                                                  50  50
Figure 1.5 Surface plot of the similarities to A of grid points in the domain Ω for the 3-D Toggle
switch model. Results are for a 15 × 15 × 15 discretization (h = 3).
with c11 = 2112.5, c12 = 845, c13 = 4225, c4 = 0.0125, c5 = 0.005, and c6 = 0.025. From the
stationary probability distribution, it can be seen that this model has three metastable states:
                       A = {x ∈ S|35 ≤ x1 ≤ 45, 0 ≤ x2 ≤ 4, 0 ≤ x3 ≤ 4},
                       B = {x ∈ S|35 ≤ x2 ≤ 45, 0 ≤ x1 ≤ 4, 0 ≤ x3 ≤ 4},
                       C = {x ∈ S|35 ≤ x3 ≤ 45, 0 ≤ x2 ≤ 4, 0 ≤ x1 ≤ 4}.
Here we will choose A to be the reactant state and B to be the product state.
    Figure 5 assigns colors to each node in Ω = {ih, jh, kh|i, j, k ∈ Z} ∩ [0, 45] × [0, 45] × [0, 45]
based on their similarity to nodes in A, propagating the similarity matrix for nodes with larger
hopping distance from A as described previously. This figure shows two potential transition states
for this system. A reactive trajectory from A to B will be most likely to pass directly from A to B
along the higher-probability trajectory, represented in red. However, such a trajectory may instead
reach B after travelling through state C, as indicated by the fainter region connecting A and B via
C.
    Below, Figure 6 shows the two-dimensional node embeddings in R2 , with the embeddings of
nodes with zero similarity to A omitted for clarity. There are several distinct clusters present in
Figure 6, each representing a cluster of similar nodes. The nodes with the highest similarities to A
                                                   19


after propagation of the similarity matrix (represented in red) are shown in the cluster to the lower
left of the figure. Hence, the nodes in this cluster represent the nodes expected to appear in the
transition path.
                                                                          Product State B
                                                                  Transition State
                                          Reactant State A
Figure 1.6 Feature representations of nodes with high similarity to the reactant state A for the 3D
toggle switch.
1.4.4    Stochastic virus model
Now we will consider a higher-dimension example, a 3D stochastic model for virus propagation
consisting of three species which are necessary for virus production [38]. The three species involved
in this system are the template (tem), viral genome (gen), and structural (struct) proteins of the virus.
    The cellular concentrations of these proteins are controlled by six reactions: producing tem from
gen, using tem as a catalyst to produce struct and gen, degradation of tem and struct, and propagation
of the virus using struct and gen such that
                                    k 1                           k 2
                             [gen] −→   [tem] ,            [tem] −→    ∅,
                                    k 3                                              k4
                             [tem] −→   [gen] ,            [gen] + [struct] −        →    ∅,
                                    k 5                               k 6
                             [tem] −→   [struct] ,         [struct] − →      ∅,
where we adopt k1 = 0.25, k2 = 0.25, k3 = 1.0, k4 = 7.5 × 10−6 , k5 = 1000, and k6 = 1.99.
From [38] we know there are two steady states for this system: an unstable trivial solution at
                                                          20


                                            Product State B
                            0.57
                            0.56
                            0.55
                            0.54
                            0.53                       Transition States
                            0.52
                            0.51
                             0.5
                            0.49                                           Reactant State A
                               0.63 0.635 0.64 0.645 0.65 0.655 0.66 0.665 0.67 0.675 0.68
Figure 1.7 Feature representations of nodes with high similarity to the initial state A = (0, 0, 0) for
the virus propagation example.
                                                                                            1
                                                                                            0.95
                          0.65               Reactant State A
                                                                                            0.9
                         0.645                                                              0.85
                                                                                            0.8
                          0.64
                                                                                            0.75
                         0.635           Transition States                                  0.7
                                                                                            0.65
                          0.63
                                                                                            0.6
                                                                                            0.55
                         0.625                                        Product State B
                                 0.59   0.595      0.6     0.605      0.61    0.615    0.62
Figure 1.8 Feature representations for the virus propagation example, with a logarithmic scale on
the x-axis.
tem = gen = struct = 0, and a stable steady state, which with the given parameter values is
tem = 30, gen = 100, struct = 12000. Ignoring the stochastic effects, as a macro scale limit, this
                                                                 21


system can be modeled by the ODEs:
                           d[tem]
                                   = k1 [gen] − k2 [tem],
                             dt
                           d[gen]
                                   = k3 [tem] − k4 [gen][struct] − k1 [gen],
                              dt
                        d[struct]
                                   = k5 [tem] − k6 [struct] − k4 [gen][struct].
                            dt
For a stochastic perspective, we model the system by a Markov process, where we follow the First
Reactions method of the Gillespie algorithm[16], which at each time step assumes that the reaction
that could take place in the shortest amount of time will occur next.
    From the upper left to the lower right, these clusters contain the representations for nodes near
approximately (15, 0, 1000), (30, 66.7, 4000) and (30, 100, 8000). Direct simulations confirm that
these are transition states: reactive trajectories will tend to pass through nodes in each of these
clusters. Applying the feature learning method to this system produces a two-dimensional feature
representation for each three-dimensional node (tem, gen, struct). Here, we take state A to be the
origin and B = (30, 100, 12000). Figure 7 plots the two-dimensional feature representations for this
system. In this figure, the nodes with higher similarities to the initial state (represented in red) and
those with lower such similarities (represented in blue) appear in separate clusters. Figure 8 plots
these feature representations with a logarithmic scale on the x-axis to better display the clusters.
    Five distinct clusters are discernible in Figure 1.8: the red node in the top left is the feature
representation for the reactant state, the red node at the bottom right is the representation of the
product state, and the three clusters between each represent a transition state that reactive trajectories
pass through while traveling from A to B. From the upper left to the lower right, these three
clusters contain the representations for nodes near approximately (15, 0, 1000), (30, 66.7, 4000) and
(30, 100, 8000). Direct simulations confirm that these are transition states: reactive trajectories will
tend to pass through nodes in each of these clusters.
                                                  22


1.4.5    E. Coli σ-32 heat response circuit
Finally we will address a higher-dimension example, to illustrate the effectiveness of the method on
reducing the dimension of a graph. Consider the E. coli σ-32 stress circuit, a network of regulatory
pathways controlling the σ-32 protein, which is essential in the E. coli response to heat shock [37].
The systems consists of 10 processes:
                          k                                        k
                 F tsH −  →1
                               ∅,                          Eσ 32 − → 2
                                                                         F tsH,
                             k                                     k
                 GroEL −     →3
                                  ∅,                       Eσ 32 − → 4
                                                                         GroEL,
                                   k                                      k
                 σ 32 + Jcomp −    →5
                                       σ 32 -Jcomp ,       σ 32 -Jcomp −  →6
                                                                             Jcomp + σ 32 ,
                                k                             k
                 σ 32 -Jcomp −  →7
                                     F tsH,                ∅− → 8
                                                                  σ 32 ,
                             k                                      k 0
                 E + σ 32 −   →9
                                   E-σ 32 ,                E-σ 32 −−   → E + σ 32 .
                                                                       1
In this context, FtsH and GroEL are stress response proteins, E is a holoenzyme, and Jcomp (or
J-complex) represents several chaperone proteins that are lumped as a simplification. Then E-σ 32
denotes the protein complex formed when E binds to σ 32 , which catalyzes downstream synthesis
reactions. σ 32 is a product of translation, which we assume to occur at a rate corresponding to
k8 = 0.007. It can also associate and dissociate of the J-complex. In accordance with [37] the
other rate constants are taken to be: k1 = 7.4 × 10−11 , k2 = 4.41 × 106 , k3 = 1.80 × 10−8 , k4 =
5.69 × 106 , k5 = 3.27 × 105 , k6 = 4.4 × 10−4 , k7 = 1.28 × 103 , k9 = 0.7, and k10 = 0.13.
    For the above reaction rates, this system has a metastable state where the concentrations of
FtsH, GroEL, σ 32 , and Jcomp are approximately 600, and the concentrations of the other three
species are approximately 1500. Another metastable state occurs when the concentrations of FtsH,
GroEL,σ 32 , and Jcomp are approximately 800. Here we take the former to be initial state A and the
latter to be the product state B. Following a procedure similar to that in the above virus propagation
example, three-dimensional representations are generated for each node. The plot of these node
representations is given in Figure 9. The color scale in this figure is determined by the value of the
similarity function between each node and the node A, with the most similar nodes shown in red.
                                                     23


Nodes with lower similarities have been omitted for clarity. Four distinct clusters can be seen in
Figure 9.
Figure 1.9 Feature representations of nodes with high similarity to the initial state, when initial state
is taken to be A = (600, 600, 600, 600, 1500, 1500, 1500), for the E. Coli heat response circuit
model.
1.5   Conclusions
In this paper, we presented a method for analyzing metastable chemical reaction systems via feature
learning on directed networks using random walk sampling. We have shown how this method may
be used to identify transition states of Markov jump processes by interpreting such processes in
terms of directed networks and taking advantage of Transition Path Theory. We have illustrated
the efficacy of this method through several low-dimensional examples involving various energetic
and entropic barriers. As noted above, for more complex, realistic examples, the method can be
used for dimensional reduction, enabling the extraction and analysis of high-dimensional graph
information. Further work will be necessary to fully understand this approach from a probabilistic
standpoint. Another direction for future work is improving efficiency and development of faster
                                                24


algorithms. However, this method still forms the groundwork for a potential new way of analyzing
directed networks and jump processes, particularly in the context of chemical kinetics.
                                               25


                                              CHAPTER 2
 EFFICIENT NETWORK EMBEDDING BASED ON SPARSE APPROXIMATION OF A
                                           RANDOM WALK
2.1    Introduction
Graphs provide a versatile and intuitive representation for many real world situations. A wide variety
of relational data, e.g., social networks, protein-protein interaction networks, biochemical systems,
citation networks, and many others, can be interpreted as graphs. As a result, methods for analyzing
these networks have consistently garnered much interest.
    Network embedding, or feature learning, which provides a major framework for network analysis,
seeks to map each node in a given graph to a point in a low-dimensional vector space such that
relational information from the original graph is preserved. In particular, the embeddings of nodes
that are closely related in the original graph (e.g., those connected by an edge, or with mutual
neighbors) will be close together in the embedding space. In many machine learning applications,
the lower-dimensional vectors are much easier and more convenient than the original network
representations.
    Although there also exist methods for embedding edges or whole networks, here we focus on
node embedding methods. These frequently have the benefit of being more adaptive and useful
for a variety of applications, such as node classification, clustering, and link prediction [5]. Node
classification assigns class labels to unlabeled nodes based on a small sample of labeled nodes, while
clustering algorithms group the representations of similar nodes together in the target vector space,
and link prediction is used to predict edges based on a sample of data containing edge information.
    Most node embedding methods can be classified based on the general techniques used to
compute representations. As seen in [17], the three major categories are methods based on matrix
factorization, random walk sampling, and deep learning techniques. Matrix factorization methods
involve representing network information in a matrix (e.g. an adjacency matrix or Laplacian),
then factoring or decomposing that matrix to determine embeddings. Examples include Laplacian
                                                   26


eigenmaps [3], locally linear embedding (LLE) [33], and graph factorization [1]. Factorization
methods are deterministic and often perform well on small graphs, but may be impractical on
very large graphs. Some factorization methods can only be applied to matrices with specific
properties; methods involving eigenvalue decompositions, for instance, require the matrix to be
positive semidefinite.
    A benefit of random walk methods is that these methods are not typically subject to these
restrictions, and can easily be applied to large networks. In general, these methods involve collecting
information about node-node relationships by running short random walk trials on the network, then
encoding this information into embeddings via gradient descent. DeepWalk [29] and node2vec [18]
are two well-known methods of this type for undirected graphs, and a random walk approach for
the directed case is proposed in [7]. In DeepWalk [29], closely-related nodes are embedded near
each other based on the frequency of co-occurrence within unbiased, fixed length random walks
on the graph. One method that builds on this idea is node2vec [18], which replaces the simple
unbiased random walks with walks influenced by parameters that control the walks’ tendency to
either backtrack and stay near the starting node, or explore higher-degree neighbors of the starting
node.
    Here, we propose another improved implementation of SGD-based random walk methods,
based instead upon a sparse approximation of the random walk on the graph. We use this sparse
approximation to deterministically find embeddings optimizing the commute time [19], then finally
apply SGD to minimize cross-entropy loss. This approach achieves efficiency by avoiding Monte
Carlo simulations of random walks, instead leveraging commute time, which takes into account all
paths of arbitrary length between nodes.
    The commute time is defined as the expected time for a random walk to travel from one node
u to another v, and then back to the starting node u. It can also be viewed as an integration over
time t of the diffusion distance introduced in [10], which is defined through the probability of the
random walk traveling from u to v in fixed time t. Hence, the diffusion distance sums over all paths
of finite time t between the pair of nodes (u, v). One advantage of utilizing commute time rather
                                                    27


than diffusion distance is that it removes the need to choose an appropriate value for the parameter
of time t.
    Numerically, the commute time between two nodes can be computed from the Green function.
Through the Schultz method, one can compute an efficient approximation of the Green function
via a short product involving dyadic powers of the random walk’s transition matrix. The diffusion
wavelets algorithm, introduced in [11], offers one way to compute increasingly low-dimensional,
compressed representations for these dyadic powers. Diffusion wavelets can also be used for network
embedding, as discussed in [44]. By construction, at each scale represented by a dyadic power
of the diffusion operator, the diffusion wavelets algorithm produces a basis of scaling functions
that span the same space as the eigenvectors corresponding to the largest eigenvalues. As a result,
these scaling functions can be used to produce embeddings that are equivalent up to a rotation to
those produced by kernel methods. Alternatively, the low-dimensional approximations of the dyadic
powers (and subsequently the Green function) can be computed from the Markov matrix’s largest
singular values and corresponding singular vectors. From these singular vectors, we can produce
multiscale embeddings analogous to the scaling functions of diffusion wavelets.
    In this paper, we propose a random walk method for learning node embeddings for networks
that preserves commute time distances between nodes. Rather than applying the SkipGram model
directly to the random walk, we begin by finding a sparse, low-dimensional approximation of the
random walk. Specifically, we take inspiration from diffusion wavelets [11] to compute compressed
versions of the random walk and its dyadic powers. We then produce commute time-preserving node
embeddings from the Green function G(u, v) associated with the sparse approximation. Finally,
we use SGD to optimally weight each dimension of the embeddings so that cross-entropy loss
is minimized. We show that these embeddings compare favorably to those produced by existing
methods such as node2vec, LINE, and GraRep, with classification accuracy measured by F1 Macro
scores. Additionally, we show that this truncated SVD algorithm allows for faster dimensional
reduction than the original diffusion wavelet algorithm, leading to improved performance and
accuracy on larger graphs.
                                                  28


    Section 2 of this paper provides some necessary background on random walk and SkipGram-
based node embeddings, commute time embeddings and an introduction to random walk sparse
approximation techniques. In Section 3, we introduce the proposed new method for learning network
embeddings, and discuss its implementation and its relationship to other kernel-based embeddings.
In Section 4, we elaborate on the numerical aspects of two possible choices of algorithm (diffusion
wavelets and SVD) for computing the sparse approximation of the random walk. Finally, Section 5
provides numerical experiments on several graphs, comparing the proposed method to some existing
feature learning methods from the literature.
2.2   Background
Before presenting the implementation details of our method, we will first discuss in more depth
the tools used at each stage. Our embeddings are based on the application of SGD to a commute
time-preserving embedding of a compressed representation of the random walk on the graph.
We will elaborate on these topics below, beginning with SGD, the SkipGram model, and some
SkipGram-based embedding methods. Then, we will discuss commute times and their embeddings
as introduced in [19], and finally the sparse approximation of the random walk.
2.2.1   The Skipgram algorithm and Skipgram-based embeddings
The SkipGram algorithm optimizes vector embeddings by minimizing the loss function
                                        X     X
                                   L=                − log (P (j|i)) .
                                         i j∈walk(i)
Here, i, j are nodes in the network, walk(i) denotes a random walk trial starting at node i, and
P (j|i) denotes the conditional probability that a random walk starting at node i will include node j.
In the SkipGram model, these conditional probabilities are modeled using softmax units as follows:
                                                   exp(vi · vj )
                                      P (j|i) = P                  ,
                                                   k exp(vi · vk )
where vk denotes the vector embedding for node k. For large networks with many nodes, this
denominator becomes impractical to compute. To overcome this, negative sampling [26] may be
                                                  29


used instead. This replaces the softmax units with
                                                       YM
                                 P (j|i) = σ(vi · vj )     σ(−vi · vkm ),
                                                       m=1
where σ(x) = 1/(1 + e−x ) and M is the chosen number of negative samples (sampled from a
uniform or unigram distribution).
    The simplest SkipGram-based method for network embedding is DeepWalk [29]. In DeepWalk,
an unbiased random walk is simulated on the graph. Essentially, the walk has an equal chance of
stepping to any node which is connected by an edge to the current node, so the transition matrix is a
row-normalized adjacency matrix. If the graph is weighted, the weighted adjacency matrix (with
entries given by edge weights) may be used instead. Stochastic gradient descent is then used to
minimize the above SkipGram objective based on the random walk simulations; that is, at each data
point from the random walk trials, the gradient of the loss function with respect to the parameters θ
of the embedding function is computed, and the parameters are updated by α dL      dθ
                                                                                      for some learning
rate α (see Algorithm 2.1).
                   Algorithm 2.1 SkipGram Model (stochastic gradient descent).
   for i do
     for j ∈ walk(i) do
         L(θ) = − log P (vj (θ)|i)
         θ = θ − α dL
                   dθ
     end for
   end for
    The node2vec [18] method follows a similar outline, but with a slightly more complex choice of
random walk. In particular, two parameters p and q are used to bias the random walk simulations.
The return parameter p is used to control the probability that a random walk will backtrack to
immediately revisit the previous node (e.g., by stepping from node A to B, then immediately back
to A). The in-out parameter q controls the likelihood that the walk will step to a node which is
close to the previous node. That is, for smaller q, the random walk will be more likely to explore
parts of the graph that are further away from the starting node. The resulting random walk’s
transition probabilities are defined as follows, where the current state of the random walk is n0 and
                                                   30


the previous state was n−1 . If n0 is connected to M other nodes m1 , m2 , ...,mM , one assigns each
mj an unnormalized probability β(j):
                                 
                                   1/p    if mj = n−1
                                 
                                 
                                 
                                 
                                 
                                 
                                 
                       β(j) =      1      if mj shares an edge with n−1
                                 
                                 
                                 
                                 
                                   1/q    if mj does not share an edge with n−1 .
                                 
                                 
                                 
To get the transition probabilities for the next step of the random walk, the β(j) are normalized so
that they sum to one. Notice that this random walk depends on the two previous steps, and so is
not a Markov chain. These parameters can be used to find the ideal balance between breadth-first
sampling, which tends to stay nearer to the source node and get a more accurate picture of its
immediate neighborhood, and depth-first sampling, which tends to visit parts of the graph further
from the source. By finding optimal values of p and q, then, one can engineer a new random walk
that will provide superior node embeddings.
    In this article, our goal is similar to that of node2vec: we aim to find a choice of random walk
that will lead to better node embeddings.
2.2.2   Network Embeddings with Commute times
For a given network Γ, let S denote the set of nodes, E the set of edges, and let A be the
adjacency matrix with entries given by the weights aij . Let D be the diagonal matrix with
                 P                                                          P
D(i, i) = di =      j aij . The volume of the graph will be given by vol =   i di , and the random walk
Laplacian is defined as L = I − D−1 A. The normalized Laplacian has the form L = D−1/2 LD−1/2
with eigendecomposition L = ΦΛΦT , where Λ denotes the diagonal matrix of eigenvalues ordered
so that 0 = λ1 ≤ λ2 ≤ ... ≤ λ|S| , and the eigenvectors are given by columns of Φ = (ϕ1 , ϕ2 , ....ϕ|S| ).
The discrete Green function is the left inverse of the operator L and can be given as
                                              |S|        !1/2
                                             X    1   dj
                                  G(i, j) =                   ϕk (i)ϕk (j).
                                             k=2 λk   di
Here, we assume that the graph is connected. Otherwise λk = 0 for some k ≥ 2, and we can instead
consider the Moore-Penrose pseudoinverse, L+ = (LT L)−1 LT , to which the methods and results
can be easily extended.
                                                     31


     We can further define the commute time CT (i, j) to be the mean time for the Markov process
on the graph prescribed by L to travel from i to j and back, which also satisfies
                                                                             !
                                         G(i, i) G(j, j) G(i, j) G(j, i)
                      CT (i, j) = vol            +          −          −        .             (2.1)
                                           di         dj         di      dj
It is shown in [19] that the coordinate matrix for embeddings that preserve commute time is then
               √
given by Θ = volΛ−1/2 ΦT D−1/2 , so ΘT Θ = vol · GD−1 . Thus, commute time embeddings are
equivalent to those produced by kernel PCA on a constant multiple of the matrix GD−1 .
2.2.3    Computing sparse approximations of Green functions via Schultz method
As we will see, one key component of our method will involve computing an approximate, compressed
version of the Green function corresponding to the random walk on the network. Here we use the
Schultz method for this purpose. The Schultz method iteratively computes the Green function (or
Moore-Pembrose pseudoinverse) associated with L through Gk+1 = Gk (2I − LGk ). The Green
function G can be associated with a diffusion process (i.e., a Markov random walk) defined by the
transition matrix T = I − L = D−1 A. Using the fact that L = I − T , and taking G0 = I + T , we
obtain:
              G1 = G0 (2I − (I − T )(I + T )) = G0 (2I − I + T 2 ) = (I + T )(I + T 2 ),
              G2 = G1 (2I − (I − T )(I + T )(I + T 2 )) = G1 (2I − (I − T 2 )(I + T 2 )
                    = (I + T )(I + T 2 )(I + T 4 ),
                 ..
                  .
                       K
                                 k
                         (I + T 2 ).
                      Y
           GK+1 =
                      k=0
The Green’s function G then has the following form on the complement of the eigenspace for the
eigenvalue equaling 1, as shown in [11]:
                                            ∞           ∞
                                                                  k
                                               T kf =     (I + T 2 )f.
                                           X           Y
                                   Gf =
                                           k=1         k=0
                                                    k
Compressed versions of these dyadic powers T 2 can be computed using either the singular value
decomposition or diffusion wavelets [11], as we will see in Section 4. Since these matrices are all
                                                    32


sparse under certain assumptions, this product can be approximated quickly, which in turn allows
for the quick computation of the commute times via the formula (2.1).
     Diffusion wavelets [11] extends ideas from wavelet theory to a more general setting. In particular,
it extends the descriptions of wavelet bases in [13] for multiscale analysis of networks. The multiscale
analysis includes a sequence of subspaces in L2 (R) such that
                                     · · · V−1 ⊃ V0 ⊃ V1 ⊃ V2 ⊃ · · · ,
     Vm = ∅, and        Vm = L2 (R). The Vm also have the property that f ∈ Vm ⇔ f (2·) ∈ Vm−1 .
T                   S
   m                  m
Thus larger m correspond to increasingly coarser scales.
     Scaling functions are constructed from some Φ ∈ V0 by dilating and scaling according to
Φm,n (x) = 2−m/2 Φ(2−m x − n). The scaling functions at scale m, {Φm,n : n ∈ Z}, form an
orthonormal basis of the space Vm .
     [13] also shows that a second basis of orthonormal functions Ψm can be constructed for each
scale m. The general construction is as follows: given the scaling function Φ, choose wavelet
coefficients cn , n ∈ Z such that
                                                  X
                                           Φ(x) =    cn Φ(2x − n).
                                                   n
Then, we choose
                                                 (−1)n cn+1 Φ(2x + n).
                                               X
                                    Ψ(x) =
                                               n
We can then define wavelet bases at each scale m from dilations and translations of the function Ψ,
similarly to the scaling functions (that is, Ψm,n (x) = 2−m/2 Ψ(2−m x−n)). The Ψm,n are orthonormal
and span a space Wm with the properties Wm ⊥ Vm and Wm
                                                                     L
                                                                        Vm = Vm−1 . Furthermore, the
entire collection of Ψm,n ∀m, n ∈ Z is an orthonormal wavelet basis for L2 (R).
     In the diffusion wavelets setting [11], the multiscale analysis includes a sequence of subspaces
                                             V0 ⊃ V1 ⊃ V2 ⊃ · · ·
such that V0 is the range of the operator T , and for m ∈ Z, Vm represents an approximation of
the range of T at scale m. In particular, when {ζi }i∈Z+ and {λi }i∈Z+ denote respectively the
                                                     33


                                                                                  m+1 −1
eigenvectors and eigenvalues of T , Vm is defined as the span of {ζλ : λ2                ≥ ε}. Each Wm can
then be defined as the orthogonal complement of Vm in Vm−1 , as in the original wavelet setting.
                                                    m+1 −1
Note that the parts of the spectrum with λ2                ≥ ϵ can be viewed as "low-pass," so that the Vm
correspond to the "approximation" or "smoothing" subspaces in classical wavelet theory, and the
Wm correspond to the "detail" components. The scaling functions Φj at each level j then provide a
basis for Vj , approximating a dyadic power of T . We leave the numerical details of the algorithm for
Section 4.
2.3 An implementation of the SkipGram model based on sparse approximation and commute
       times
We now propose a method combining ideas from the previous section to produce network embeddings
that preserve commute times while also optimizing the cross entropy loss. Overall, the idea is to
first obtain a sparse approximation of the random walk from a multiscale embedding process, and
produce a commute time-preserving embedding basis from this approximation. Then, the final
embeddings are produced via SGD, optimizing the weights on these basis vectors to minimize the
cross entropy loss.
     Let T denote the transition matrix for the random walk on a network with N nodes.. We here
assume that T is local in the sense that its columns have small support, and that high powers of T
will be of low rank. We start by computing the compressed representations of the dyadic powers of
                                                                                                 k
T , and call these representations T1 , . . . , Tk , so that Tk denotes the representation of T 2 . To do this,
we apply an orthogonalization procedure (the details of which will be explained in Section 4) to the
columns of T , producing an orthogonal basis U1 = {u1 , u2 , . . . uK }, {K ≤ N }, that approximates
the range of T . Then, we can write T1 := U1T T U1 , so that T1 is an approximation of T 2 expressed
in terms of the basis U1 . We then repeat the process, orthogonalizing the columns of T1 to produce
an approximate basis U2 for its range, using this new basis to compute a representation of T2 ≈ T 4 ,
and so on.
     More specifically, at each step we compute a truncated SVD of Tk . The singular vectors
associated with the largest singular values then form a basis spanning approximately the range of Tk ,
                                                         34


which we can view as a ”low-pass" approximation subspace analogous to the low-pass subspaces
of classical wavelet theory. We can then use these singular vectors to compute a compressed
                                                      k
representation of the next dyadic power, Tk+1 ≈ T 2 , with respect to the basis Uk . Further details of
this process will be reserved for Section 4.
    Using the Schultz method described in Section 2.3, we compute the unnormalized Green function
G from T1 , . . . , Tk . From [19], network embeddings that preserve commute time relationships have
the embedding matrix Θ satisfying ΘT Θ = vol · GD−1 . We produce such multiscale embeddings by
repeating the same approximation process to produce compressed representations of Q = vol ·GD−1 .
    This process yields at each scale a basis of scaling functions Ũk that spans an approximation
                             k
of the range of Qk ≈ Q2 , as described in Section 4. Then, the embeddings for compression level
k (with k = 1 representing the finest scale, and larger k representing increasingly coarse scales)
are constructed from the scaling functions of Qk . Essentially, to obtain the k-level embedding for
node x, we extend the truncated singular value decomposition at the kth level, Qk ≈ Ũk Σ̃k ṼkT , to
                                                 1/2
the original basis and take the matrix Θ = Σ̃k ŨkT to be the embedding matrix; that is, the nth
column of Θ will provide the embedding coordinates for the nth node. (For symmetric G, note that
vol · GD−1 ≈ ΘT Θ, as desired.)
    The singular vectors are a sensible choice for the embeddings because, similarly to the scaling
functions in classical wavelet theory, the span of the columns of Uk is a low-pass approximation
space for the range of Qk . Similar multiscale embeddings are also utilized in, for example, [11], [44],
although these embeddings utilize the scaling functions of diffusion wavelets rather than singular
vectors.
    It is also possible to replace the truncated SVD algorithm used here with the diffusion wavelet
algorithm of [11]. A more detailed explanation and comparison of the two algorithms is given
in Section 4. In the numerical experiments performed here, the SVD version of this algorithm
outperformed the diffusion wavelet algorithm, so the remainder of the paper will focus on the SVD
version. The same ideas can be implemented for either version.
    While the matrix Θ can itself be used as an effective node embedding, we can further improve
                                                   35


our results by using this sparse approximation as a starting point to obtain node embeddings that
minimize cross entropy. The commute-time-based embedding results in a basis spanning the range
of (a power of) the sparse approximation to the random walk. We can reweight the vectors in this
basis spanning the same space such that cross entropy loss is minimized. Additionally, we will with
some probability reintroduce singular vectors that were removed in the previous truncations. In this
way, we randomly reintroduce some of the information that has been lost in the approximation steps,
without compromising computational efficiency. In particular, since taking powers of a Markov
matrix is equivalent to running the Markov process forward in time, reintroducing columns from
previous steps is analogous to preserving information about different time scales.
    Cross-entropy loss is given by the following:
                                              X     X
                                 LossCE =                  − log(P (j|i)),                        (2.2)
                                               i j:(i,j)∈E
where P (j|i) denotes the conditional probability of node j conditioned on i (the probability that a
random walk will move from node i to node j). Here we model the conditional probabilities using
the angular distances between nodes,
                                                                  θiT θj
                                                                              !
                                                  1
                               AD(i, j) = 1 − arccos                            ,
                                                  π            ||θi ||||θj ||
                      1
                                  if i ̸= j and P (j|j) = 1 −
                                                                 P
by taking P (i|j) =  N
                        AD(i, j)                                     i̸=j P (i|j).
    Given the N × K sparse approximation Θ, where the ith row of Θ, θi , provides an embedding
of node i, we aim to find constants C1 , C2 , ...CK such that the embedding given by ΘC where
C = diag(C1 , C2 , ...CK ) minimizes cross entropy (2.2). Each basis vector, given by the columns
of Θ, is therefore multiplied by a different weight constant. Using SGD, we choose the optimal
constants so that the role of the more "important" columns (which correspond to particular singular
values of the approximation to the random walk) will be emphasized.
    After each step of SGD, with some small probability δ, we also extend the basis θ with one of
                                                                               1/2
the indices from the basis in the previous step. That is, if Θ = Σ̃k ŨkT , we randomly select an
index αk N ≤ j ≤ αk−1 N . This then corresponds to a column of Uk−1 , uj . We can then project
σj uj onto the basis Uk , and append this projection to Θ. This allows for the reintroduction of some
                                                     36


information from the ”high-pass" subspaces, removed during the approximation process, while still
benefiting from the low-dimensional compression given by the low-pass bases.
     The overall algorithm is as follows:
    1. Produce the sparse approximation of the random walk (e.g. the Green function G and the
       commute time embeddings θ) as described above, by computing the first αk N singular values
       and left singular vectors at the kth step.
    2. Initialize C as the identity matrix, normalized so that its diagonal entries sum to 1.
    3. Via stochastic gradient descent, find C optimizing the cross entropy loss function (2.2).
    4. During the SGD process, at each step (with probability δ) append to the embedding matrix a
       vector corresponding to an index which was previously removed in the truncated SVD. Also
       extend C accordingly.
     As a visualization aid, we consider the Zachary’s Karate Club network [50]. The scatter plot in
Figure 2.1 (b) shows the two-dimensional embeddings generated by Commute Times with Diffusion
Wavelets for the karate club network. The thirty-four nodes of this graph each represent a member
of the karate club, with edges representing whether or not two people communicate outside of
the club. Each node has been assigned a group label (indicated by the color of that node in the
plot) corresponding to one of four clusters, depicted in Figure 2.1 (a). The above method results in
embeddings that successfully distinguish all four clusters.
 Table 2.1 Performance and clustering score for the karate club graph embeddings in Figure 2.3.
                               Network         d Runtime     F1 Macro
                               Karate Club     3   0.85 s       0.9177
2.3.1    Modified Johnson-Lindenstrauss Lemma
It can be shown that the embeddings produced via the proposed method (i.e., Green function-based
commute time methods) satisfy a modified version of the Johnson-Lindenstrauss lemma. In
                                                  37


Figure 2.1 a) The karate club network (created via [30]). b) Commute time embeddings found via
the proposed method. Colors of points are determined by the true classification of each node.
particular, the Euclidean distances between these embeddings approximate up to ε the commute
time distance (or, for diffusion wavelet-based embeddings, the square root of the commute time
distance) between the original nodes.
    The commute time can be approximated with high probability by the product of G1/2 and a
constant multiple of a random matrix with normally distributed entries:
Theorem 1. Let S be the set of all vertices in a network with unnormalized Green function given
by G, degree matrix D, and volume vol, and let CT (·, ·) denote the commute time. There exists
Π ∈ Cm×d where m = O(log(1/δ)/ε2 ) and ε > 0 such that for all x, y ∈ S,
     (1 − ε)CT (x, y) ≤ ||Π(volGD−1 )1/2 δx − Π(volGD−1 )1/2 δy ||22 ≤ (1 + ε)CT (x, y)          (2.3)
with probability 1 − δ.
    Next, we give the main result of this section, which demonstrates that our embeddings preserve
commute times between nodes.
Theorem 2. Let S be the set of all vertices in the network, and let CT (·, ·) denote the commute time.
                                                  38


Figure 2.2 2 and 3-dimensional embeddings of the karate club network, based on the first columns
of the commute time embedding matrix from [19].
Figure 2.3 2 and 3-dimensional embeddings of the karate club network, created with the proposed
method.
   1. If Φ(x) := ΦT δx denotes the embedding for node x obtained by applying the diffusion wavelet
       algorithm to the matrix volGD−1 with precision constant ε (that is, Φ is the orthonormal
       basis of scaling functions and Φ(x) is the xth row of Φ), then for all x, y ∈ S,
                                q                                     q
                          −ε +    CT (x, y) ≤ ||ΦT (δx − δy )|| ≤ ε +   CT (x, y).           (2.4)
                                                 39


    2. Let U ΣV T ≈ volGD−1 represent the truncated SVD using the αN leading singular values
        σ1 ≥ ... ≥ σαN , for some α ≤ 1. Let Θ denote the embedding matrix obtained from the
        αN leading left singular vectors U , such that Θ = (Σ1/2 )T U T . Then for all x, y ∈ S, and
        ε = O(σαN +1 ),
                             −ε + CT (x, y) ≤ ||ΘT (δx − δy )|| ≤ ε + CT (x, y).                (2.5)
     See the Appendix for proofs.
2.3.2     Relationship with diffusion maps
We here show that embeddings given by diffusion wavelet scaling functions are equivalent up
to a rotation to the embeddings produced from a given kernel eigenmap method. In particular,
the embeddings produced via the application of diffusion wavelets to the matrix vol · GD−1 are
equivalent up to a rotation to the embeddings given by kernel PCA on vol · GD−1 , and so they are
also equivalent up to a rotation to the commute time embeddings from [19].
     In [10], it is shown that any kernel eigenmap method can be viewed as solving
                                                               X
                                 min Q1 (f ), where Q1 (f ) =     Qx (f ),
                               Q2 (f )=1                        x
where Q2 , Qx are symmetric positive semi-definite quadratic forms with Qx local. For example, for
a kernel k, we can take
                                                 k(x, y)(f (x) − f (y))2
                                           XX
                                 Q1 (f ) =
                                            x  y
                                              v(x)f (x)2 .
                                           X
                                 Q2 (f ) =
                                            x
Then, Q2−1 Q1 represents a discretization of a differential operator (i.e., a Laplacian).
     The following theorem and its proof are a slight generalization of the proof of Theorem 1 from
[44] (see Appendix for proof).
Theorem 3. Suppose we have a kernel-based method (e.g. Q1 , Q2 satisfying the properties in
diffusion maps). Then the embeddings obtained from the diffusion wavelet scaling functions for
                                                  40


Q−12 Q1 at level j are equivalent, up to a rotation, to the pj -dimensional embeddings obtained from
the diffusion maps embedding, where pj gives the number of scaling functions at scale j.
     In particular, since the commute time embeddings in [19] are equivalent to kernel PCA on
the matrix vol · GD−1 , we can apply the above theorem with Q−1           2 Q1 = vol · GD
                                                                                             −1
                                                                                                to see that
the embeddings produced via the diffusion wavelet method are equivalent up to a rotation to the
pj -dimensional commute time embeddings, and thus these embeddings will also preserve commute
times between nodes.
2.4    Numerical Details of Sparse Approximation
We now elaborate on two possible algorithms for the sparse approximation of the random walk. We
first give details of the diffusion wavelet algorithm introduced in [11], and then propose an alternate
implementation using an approximate SVD.
2.4.1    Diffusion Wavelets
Diffusion wavelets generalizes the wavelet theory of [13] to diffusionlike operators. We can apply
this theory for random walks on networks by taking the transition matrix T for a Markov process
on the network to be such an operator. In this setting, the scaling functions Φm at level m are
constructed by orthonormalizing the columns of the representation of T at the previous level, Tm−1 ,
up to precision ε–that is, the vectors Φm ε-span the range of Tm−1 , defined as follows in [11].
Definition. Let H be a Hilbert space such that {wj }j∈J ⊂ H. A set of vectors {vi }i∈I ε-spans the
set of vectors {wj }j∈J if
                                       ||P⟨{vi }i∈I ⟩ wj − wj ||H ≤ ε,
where P⟨{vi }i∈I ⟩ represents orthogonal projection onto the span of {vi }i∈I . We also say that ⟨{vi }i∈I ⟩
is an ε-span of ⟨{wj }j∈J ⟩.
     In fact, the subspace spanned by these scaling functions is an ε-approximation of Vm , which we
                                                                   2
will denote Ṽm . As we will see, Tm is a representation of Tm−1       by construction, so each step of the
diffusion wavelets algorithm dilates the scale by a factor of 2, as in the original wavelet setting.
                                                       41


     In classical wavelet theory, the localization of wavelet bases can better approximate choppy
signals and signals with different time scales compared to Fourier series, since using Fourier series in
these cases would require one to break up the domain into pieces and analyze each piece separately.
Using the local diffusion wavelet scaling function bases provides a similar advantage over global
eigenvalues, particularly for graphs with irregular structures.
     In the following overview of the diffusion wavelet algorithm, we borrow some notation from
[11]: we will use [A]B   B1 to denote a matrix A that is written with respect to the basis B1 in the
                           2
domain, and with respect to the basis B2 in the range.
     For an operator T such that the numerical rank of its powers {T k }k=0,1,2,... decreases with k,
the diffusion wavelets algorithm [11] computes compressed representations of the dyadic powers
   k
T 2 , which can then be used to efficiently compute certain functions of T , in particular the Green
function. In addition, when T is an |S| × |S| Markov matrix, taking higher powers of T is equivalent
to running the Markov chain forward in time. Thus, diffusion wavelets allows for the analysis of the
random walk at different time scales.
     The diffusion wavelets algorithm DW(T, K, ϵ) is outlined as follows: As inputs, DW takes
the operator T = T0 , a parameter K determining the number of iterations to be performed, and a
precision parameter ϵ. The columns of T are originally expressed in the basis {δi }1≤i≤|S| =: Φ0 .
     First, the columns of T are orthonormalized up to ϵ using a modified QR algorithm (such as
modified Gram-Schmidt with pivoting or rank-revealing QR; [11] elaborates on possible choices of
orthogonalization technique). This results in an orthonormal matrix that gives an ε-span Φ1 of the
columns of T written in terms of the original basis Φ0 , which we here denote [Φ1 ]Φ0 . Φ1 forms a
basis for the subspace Ṽ1 , and gives an ϵ-approximation of V1 (that is, the range of T ). Additionally,
the rank-revealing QR algorithm guarantees that the elements of this basis are well localized. We
also obtain from the QR decomposition an upper triangular matrix representation of T , expressed
with respect to Φ0 in the domain and Φ1 in the range, denoted [T ]Φ      1
                                                                        Φ0 .
     Then, the product [Φ1 ]Φ0 [T 2 ]Φ        ∗      2 Φ1                                      2
                                      Φ0 [Φ1 ]Φ0 = [T ]Φ1 =: T1 provides a representation of T0 in terms
                                       0
of Φ1 . It is also possible, if desired, to compute a basis of wavelet functions spanning the orthogonal
                                                      42


complement of Ṽ1 in Ṽ0 , denoted W̃1 , by orthonormalizing the columns of I − [Φ1 ]Φ0 [Φ1 ]∗Φ0 . By
                                                                         k Φ
iterating this entire process, we can compute representations of [T 2 ]Φk−1 k−1
                                                                                =: Tk , scaling functions
[Φk ]Φk−1 spanning the space Ṽk that ε-approximates Vk (i.e. the range of Tk ), and wavelet functions
Ψk for levels k = 1, 2, ...
2.4.2   SVD
In the numerical experiments presented here, we found that an SVD-based multiscale analysis
was in general faster and produced more accurate node classifications compared to the original
diffusion wavelet algorithm. This is demonstrated in Table 2.2 and Figure 2.4; on the butterfly
dataset, three-dimensional embeddings produced via the SVD algorithm perform better than either
the three-dimensional diffusion wavelet embeddings or the first three columns of the commute time
embeddings described in [19]. Additionally, the SVD approach allows more control over the rate of
dimensional reduction, since one can directly choose the proportion of singular values to be retained.
This alternative formulation is as follows:
Table 2.2 Comparisons between three-dimensional diffusion wavelet-based embeddings, truncated
SVD-based embeddings, and the commute time embeddings introduced in [19], on the butterfly
dataset [43, 45]. Runtime represents the time required to compute sparse approximations for powers
of T and G via the given algorithm.
                        Method                              Runtime    F1 Macro
                        Diffusion Wavelets                    70 s         0.5827
                        SVD                                  3.5 s         0.7490
                        Commute Time Embeddings              0.5 s         0.6051
    First, we compute a truncated SVD of T , retaining the j largest singular values and the
corresponding singular vectors, so that T ≈ U ΣV where U ∈ R|S|×j , Σ ∈ Rj×j , and V ∈ Rj×|S| .
Then, U1 := U provides an orthogonal basis for approximately the range of T , for some margin of
error determined by the choice of j. To obtain a representation of T 2 with respect to the basis U1 , we
use T1 := U1T T U1 . To justify this, note that the left singular matrix U has orthogonal columns, and
furthermore these columns span the range of T . In particular, the left singular vectors corresponding
                                                     43


to the largest singular values span an approximation of the range of T , and therefore they play a role
analogous to the diffusion wavelet scaling functions.
                                         k
    To obtain representations for T 2 for k = 2, 3, ..., we can simply iterate this process by finding
the truncated SVD of Tk−1 , taking Uk to be the left singular matrix of that decomposition, and
setting Tk = UkT Tk−1 Uk . Additionally, each Uk can be represented in the original basis using the
change-of-basis formula:
               [Uj ]U0 = [Uj−1 ]U0 [Uj ]Uj−1 = · · · = [U0 ]U0 [U1 ]U0 · · · [Uj−1 ]Uj−2 [Uj ]Uj−1 ,
where [Uj ]Uk represents Uj expressed in terms of the columns of Uk , and U0 represents the original
basis of the columns of T . (Note that in this notation, [Uj ]Uj−1 = Uj .)
    In the case where the random walk matrix T is symmetric, the connection to the diffusion
wavelet-based algorithm is obvious; the singular values are the absolute values of the eigenvalues of
T , and therefore the largest singular values are the largest-magnitude eigenvalues. The columns
of the left singular matrix U then give the eigenvectors of T . If we select j such that ∀k ≤ j,
   m+1 −1
λ2k       ≥ ε, then the first j columns of U (when the singular values are ordered from largest to
                                          m+1 −1
smallest) must span Vm := ⟨{ζλ : λ2              ≥ ϵ}⟩.
    In the asymmetric case, the basis given by the first j columns of U approximates the range of T
according to the largest singular values, rather than the eigenvalues. Here, the multiscale analysis
involves a sequence of subspaces
                                           X0 ⊃ X 1 ⊃ X2 ⊃ · · ·
such that X0 is the range of the operator T , and for m ∈ Z, Xm represents an approximation of the
range of T at scale m, obtained via the largest singular values and their corresponding singular
vectors. In particular, when {σi }i∈Z+ denotes the singular values of T , and uσi denotes the singular
                                                                                           k+1 −1
vector corresponding to the ith singular value, we can define Xk := ⟨{uσ : σ 2                     ≥ ε}⟩. These
subspaces Xk can, as in the diffusion wavelet construction, be viewed as the "low-pass" portions of
the range of T , corresponding to the "approximation" or "smoothing" subspaces in classical wavelet
theory.
                                                      44


Algorithm 2.2 Truncated SVD: An approximate singular value decomposition of the N × N matrix
T using the largest αN singular values.
   Input: Matrix T ∈ RN ×N , fraction of singular values retained α ∈ (0, 1)
   Result: Û , Σ̂, VˆT = TruncSVD(T , α)
   U, Σ, V T = SV D(Tk−1 ) ;         /* Singular values sorted in decreasing order                     */
   Û = first αN columns of U
   Σ̂ = first αN columns of Σ
   V̂ T = first αN rows of V T .
           Algorithm 2.3 Commute Time-based Network Embeddings with Truncated SVD.
   Input: T , K, K2 , 0 ≤ α ≤ 1
   Result: Node embeddings for the network represented by the |S| × |S| Markov matrix T
   T0 = T
   for 1 ≤ k ≤ K do
      U, Σ, V =TruncSVD(Tk−1 , α)
      Tk = UkT Tk−1 Uk
   end for
   G = I + T0
   G=G K
             Q 1
                 k=1 Uk (I + Tk ) ;                                               /* Compute G */
   for 1 ≤ k ≤ K2 do
      Ũ , Σ̃, Ṽ =TruncSVD(vol · G · D−1 , α)
                    1/2
      θkT = Ũk Σ̃k       ;                   /* Rows of θkT give embeddings at level k */
   end for
   Embedding Matrix= θK2 .
     One possible disadvantage of this SVD method is that the orthogonalization here is not as careful
as the orthogonalization step in diffusion wavelets [11], and as a result our bases are not guaranteed
to be localized. While this has no apparent adverse effects on the applications discussed here, this
method may not be suitable for other situations where diffusion wavelets is appropriate, particularly
situations where the localization of the scaling functions is crucial. However, in practice (at least for
the examples discussed here), the SVD basis vectors are "approximately" localized– that is, for any
given basis vector, the majority of the nodes will be clustered near 0. Setting a tolerance such that
these low-magnitude values is replaced with zero may improve performance slightly, in such cases.
                                                   45


This "approximate localization" property can be seen in Figure 2.4, which shows embeddings based
on 2 and 3 singular vectors. Because this is smaller than the intrinsic dimension of the dataset, most
of the nodes take values near 0 on the first 3 singular vectors.
2.5   Examples and performance comparisons
The following examples illustrate the effectiveness of the method for multi-label classification tasks.
In this setting, each node is assigned a class label. The goal is to correctly predict the nodes’
class labels based on their embeddings. In the following experiments, we record for each trial the
computational time, and compare the F1 macro scores to judge the relative quality of each set of
label predictions.
    We compare our method to the following existing methods:
    • GraRep [6], which creates multiscale embeddings by using the SVD of the k-step transition
       probability matrix to obtain k-step representations, and then concatenating the representations
       from each scale. Here we consider both a Matlab implementation [6] and a Python version
       available from [34].
    • DeepWalk [29], which uses SkipGram to generate random walk-based embeddings from
       unbiased random walks. Here we use the code from the original paper, which uses hierarchical
       softmax to speed up computation of the conditional probabilities.
    • Node2vec [18], a method similar to DeepWalk that uses biased random walks to compute
       embeddings, based on parameters for breadth-first and depth-first sampling.
    • LINE [39], which uses an edge-sampling algorithm to optimize an objective function that is
       designed to preserve both first and second-order relationships.
    The Commute Time with SVD embeddings were computed using Algorithm 2.3 from Section 3.
The choice of embedding size d for the commute time-based embeddings varies for each dataset,
depending on both the size of the original network and the number of iterations of the compression
algorithms used. For GraRep, LINE, and DeepWalk, all parameters except embedding size d
                                                  46


were left at their default values, and the sizes of the embeddings were chosen to match those of
the Commute Time with Diffusion Wavelet embeddings. The F1 Macro scores reported here are
averages based on 10 trials. All computations were performed with an Intel (R) HD Graphics 520
GPU.
    The F1 macro scores were computed by generating embeddings based on the data, then using
the embeddings to train a 5-nearest neighbor classification routine. This classification model was
then used to predict a class label for each node based on its embedding. F1 scores for each class
were computed using the formula:
                                                        TP
                                  F 1class i =       1             ,
                                               TP +  2
                                                       (F P + F N)
where T P denotes the number of true positives (nodes which were correctly assigned to class i),
F P denotes the number of false positives (nodes which were correctly assigned to a different class),
and F N denotes the number of false negatives (nodes which should have been assigned to class i,
but were not). The overall F1 macro score is the sum over i of all F 1class i .
    For every dataset, the embeddings created with the method presented here obtained the highest
or second-highest F1 macro score. The runtime for this method was below or on par with that of
methods with comparable F1 scores, with the exception of the butterfly dataset. The small size
of this dataset allowed the deterministic GraRep method to run efficiently and produce accurate
embeddings. However, on the larger datasets, our method produced higher quality label predictions
(as measured by F1 scores) compared to GraRep.
2.5.1    Butterfly dataset
We first consider a network from [43, 45] consisting of 832 nodes which each represent different
butterflies. Each of the organisms belongs to one of 10 species, and between 55 and 100 pictures of
each species were collected from Google Images [43]. Edges indicate similarity between organisms;
that is, two nodes will be connected by an edge if their respective images share features (such as
wing color and pattern) in common. The data is available from [21]. The commute time with SVD
                                                  47


embeddings were based on 5 iterations of the algorithm, where 50% of the singular values were
retained at each iteration.
    The embeddings that led to the best F1 macro scores for the butterfly network were Commute
Times with SVD and GraRep (see Table 1). The computation time and F1 score of the Commute Time
with SVD method were comparable to those of the Matlab and Python GraRep implementations.
                                     Table 2.3 Butterfly dataset.
                      Method                          d   Runtime      F1 Macro
                      Commute Times with SVD         27      35 s          0.9212
                      DeepWalk                       27      20 s          0.8692
                      GraRep (Python)                27      10 s          0.9213
                      GraRep (Matlab)                27       7s           0.9103
                      node2vec (p=0.5, q=2.0)        27   3 min 3 s        0.8739
                      LINE                           27    3.5 min         0.7662
    Figure 2.4 plots the three-dimensional and two-dimensional embeddings of this network,
respectively. The coordinates of the embeddings were determined by the first columns of the
embedding matrix obtained above. Each color represents the correct classification of that node,
that is, the butterfly species to which the corresponding image belongs. Although omitting the
rest of the dimensions has decreased the quality of the embeddings (the F1 macro score for the
three-dimensional embeddings pictured here are only 0.6287 and 0.7490), the clusters corresponding
to three of the species are visible. Also note that, for these low-dimensional embeddings, the
localized nature of the embedding vectors has caused the majority of the nodes to be clustered
around the origin. Slightly higher-dimension embeddings (e.g. d = 27, as in Table 2.3) result in
more accurate embeddings, as the majority of nodes will take a larger-magnitude value on at least
one embedding vector.
2.5.2    Cora dataset
The Cora dataset [36] (available from the Deep Graph Library, [46]) is a citation network consisting
of 2708 papers, with edges representing citation relationships. For a given edge, the paper represented
                                                  48


Figure 2.4 Three-dimensional and two-dimensional node embeddings for the butterfly species
network, generated using the algorithm described in Section 3. Embeddings in the top row use
sparse approximations produced via diffusion wavelets, and those in the bottom row use sparse
approximations produced via truncated SVD. Each color represents the species to which that node
belongs.
by the source node contained a citation of the target node’s paper. The papers are sorted into seven
different topics, corresponding to seven distinct classes (Case Based, Genetic Algorithms, Neural
Networks, Probabilistic Methods, Reinforcement Learning, Rule Learning, and Theory).
    The embedding size of the Commute Time with SVD embeddings is based on 4 iterations of
this algorithm, retaining 50% of the singular values at each iteration.
    As shown in Table 2, for this network, the highest F1 score was achieved by LINE, with
node2vec and the SVD-based commute time methods scoring only slightly lower. However, here
the SGD-based methods and LINE had the longest runtimes, while GraRep required less than one
                                                 49


minute and still maintained F1 scores above 0.8. Overall, the commute time with SVD method,
LINE and node2vec achieved the highest accuracy at the classification task while computing the
embeddings efficiently.
                                       Table 2.4 Cora dataset.
                Method                                    d    Runtime     F1 Macro
                Commute Times with SVD (α = 0.1)        170      3 min        0.8787
                DeepWalk                                170       81 s        0.8061
                GraRep (Python)                         170       37 s        0.8072
                GraRep (Matlab)                         170       65 s        0.8333
                node2vec (p=0.5, q=2.0)                 170      4 min        0.8749
                LINE                                    170     2.5 min       0.8864
2.5.3   Amazon co-purchase photo dataset
We next apply the methods to the Photo segment of the Amazon Co-Purchase network [22], available
from [46]. The network consists of 7650 products for sale, each of which is assigned a label
representing one of eight distinct product categories. Edge relationships are determined by which
products were recommended when viewing the product page for a given node. As discussed in [22],
these edges can therefore imply a variety of relationships: for instance, that customers frequently
view the product pages for both items, or that customers frequently purchase both products, either
simultaneously or one after the other. Depending on which type of relationship a given edge refers to,
then, two connected nodes might be products that serve the same purpose, two different types of the
same product, or products that can be used together. For the commute time with SVD embeddings,
the embedding size is based on 4 iterations of this algorithm, where 50% of the singular values are
retained at each iteration. The average F1 macro score over 10 trials was 0.9383.
    Results for each of the methods are outlined in Table 2.5. The Amazon Co-Purchase network is
the largest of the networks considered here, leading to longer computational times for all methods.
In these trials, the commute time with SVD method and node2vec had the best performance in
terms of accurate labeling, with F1 scores of 0.9383 and 0.9335 respectively, but the commute
                                                 50


time with SVD method required approximately 20 minutes to run. In comparison, node2vec also
had an F1 score above 0.9, but only required around 10 minutes. The Python implementation of
GraRep performed similarly to the SVD-based method, while the Matlab version of GraRep failed
to accurately classify the nodes. This is likely due to the structure of the network; specifically, the
fact that the degree matrix of this network, which is inverted in the GraRep algorithm, is singular to
machine precision.
                               Table 2.5 Amazon Co-Purchase Dataset.
                  Method                                   d     Runtime    F1 Macro
                  Commute Times with SVD (α = 1)          480     1189 s       0.9383
                  DeepWalk                                480      570 s       0.9167
                  GraRep (Python)                         480     1549 s       0.8972
                  GraRep (Matlab)                         480      920 s       0.0451
                  node2vec (p=0.5, q=2.0)                 480     9.5 min      0.9335
                  LINE                                    480      178 s       0.8203
2.5.4    Email database dataset
Finally, we consider a network of email communications from a particular organization [49, 20],
available from [21]. This graph contains 1005 nodes, each corresponding to a member of a certain
European research institution. Edges indicate that a particular pair of members had exchanged
emails during the 18-month data collection time period. Specifically, an edge from X to Y exists
only if at least one email was sent from X to Y, and an email was sent from Y to X. Each person in
the network belongs to one of 41 departments. The commute time with SVD embeddings were
based on 5 iterations of the truncated SVD algorithm, where 75% of the singular values are retained
at each iteration. The entire embedding process took 24 seconds on average, with an average F1
macro score of 0.6573.
    For all the methods used, the F1 scores are noticeably lower compared to the other examples,
as indicated in Table 2.6. This may be a result of the structure of the network– the degree matrix
for this graph is nearly singular. In particular, this causes the Matlab implementation of GraRep to
                                                    51


produce embeddings that lead to inaccurate classifications, similarly to in the Amazon Co-Purchase
example. However, in this case we still see that the commute time-based embeddings predict the
department labels more accurately than most other methods. The SVD algorithm, in particular, had
the highest F1 macro scores of all the methods, while also requiring less time than any other method
except for the Matlab implementation of GraRep, which had a much lower F1 score.
                                         Table 2.6 Email Dataset.
                Method                                     d     Runtime     F1 Macro
                Commute Times with SVD (α = 1)            180       24 s         0.6573
                DeepWalk                                  200      633 s         0.5169
                GraRep (Python)                           200      31.5 s        0.5431
                GraRep (Matlab)                           200     2.145 s        0.0048
                node2vec (p=0.5, q=2.0)                   200 2 min 20 s         0.6266
                LINE                                      200      141 s         0.5423
2.5.5    COVID-19 lung image classifier
In this example, we apply the methods to a dataset of chest x-rays, saved as 299 × 299-pixel images.
The dataset contains x-rays from patients with COVID-19, viral pneumonia patients, and healthy
people, and is available from [8, 32]. In the following experiments we use 60 x-rays from each
category. To apply the methods, we reshaped each image into a 2992 -dimensional vector, then
treated these vectors as the initial representations of each x-ray. We construct the matrix T using the
Euclidean norm between each two vectors to represent the similarity between those nodes. The
commute time with SVD embeddings were based on 6 iterations of this algorithm, where 50% of
the singular values are retained at each iteration.
    For this dataset, we compared the F1 scores, averaged over 10 trials, obtained for this dataset
using different amounts of training data. As shown in Table 2.7, the embeddings obtained from
the method can be used to predict diagnoses on unlabeled data with some degree of accuracy, even
with low amounts of training data. Using 90% of the dataset as training data, the method returns an
F1 macro score of 0.8018 on average, while reducing the dimension of the vector representations
                                                    52


from 2992 to 3. As the amount of training data is decreased, the F1 macro score also decreases, but
remains near 0.75 on average with only 50% training data. Even when training data is decreased to
10%, the average F1 score is above 0.5, at 0.5519.
     Figure 2.5 shows the method’s efficacy for clustering. In each subfigure, the low-dimensional
embeddings show distinct grouping by color (here, yellow points represent images taken from
patients with COVID-19, light blue represents images of healthy patients, and dark blue represents
images from patients with viral pneumonia). The divisions between the clusters are particularly
clear when a smaller dataset, containing only 20 points from each diagnostic category, is used.
     Figure 2.6 shows the decrease in cross entropy loss over 10 iterations of SGD, for 10 different
trials. In all trials, cross-entropy loss drops steeply for the first few steps of gradient descent, quickly
converging to low values within 5 iterations.
                                       Table 2.7 Lung X-Ray Dataset.
         Method           d Runtime                          F1 Macro
                                           Train % = 10%         30% 50%             70%      90%
         CT w/ SVD        3      2.7 s         0.5519         0.6892     0.7441     0.7767   0.8018
Figure 2.5 Three-dimensional node embeddings for the chest x-ray classifier, generated via the
proposed method. Each color represents the diagnosis (normal, COVID-19, or viral pneumonia)
associated with that x-ray.(a) shows the result of the method on a dataset consisting of 60 x-rays
from each category, while (b) shows the result for a smaller dataset (20 from each category).
                                                     53


Figure 2.6 10 different trials of the embeddings for the chest x-ray classifier dataset. For each trial,
the cross-entropy loss over 10 iterations of SGD is recorded. In all trials, the cross-entropy loss
initially drops steeply, converging in 5 iterations.
2.6    Conclusion
This paper outlined a new method for embedding the vertices of a network while preserving
commute time distances between nodes and optimizing cross-entropy loss. The method involves
using truncated SVDs of a random walk matrix on the network to compute powers of the matrix,
and via the Schultz method an approximate discrete Green’s function, and then obtaining multiscale
network embeddings from the SVD of a constant multiple of the product of the Green’s function and
the degree matrix of the random. As shown in Section 4, this method performs favorably on several
datasets when compared with the existing methods DeepWalk [29], node2vec [18], LINE [39],
and GraRep [6]. Future work could include a more complete analysis of the method, and possibly
further improvements to the algorithm. In particular, a full analysis of the numerical linear algebra
will be the subject of future investigation. Additionally, it is likely possible to improve upon the
stochastic gradient descent used for decreasing cross-entropy loss here. The SGD stage also makes
the largest contribution to computation time. Further improvements could involve incorporating
other randomized algorithms to more efficiently optimize cross-entropy loss.
                                                   54


                                            CHAPTER 3
              ADAPTIVE NETWORK EMBEDDING WITH METADYNAMICS
3.1   Introduction
Many applications in biology and chemistry, including protein folding and chemical reactions,
involve systems which behave according to some potential energy landscape. The potential energy
of the system at any given time depends on its state at that moment–for example, the concentrations
of reactants and products in a chemical reaction, or the arrangement of molecules in a particular
protein’s structure. In general, these systems tend toward the lowest energy states available to them,
but due to the frequently complex nature of these energy landscapes, the behavior of these systems
is not always that simple. Additionally, to fully understand the behavior of such systems, we also
want to identify the mechanisms (e.g. transition states) by which a particular system moves from
one minima to another.
    A typical energy landscape for such a system may have many local minima, separated by energy
barriers. In this case, we can view the system as a network of local minima, connected by edges
weighted according to the energy barrier that must be overcome to transition between states. Taking
this viewpoint allows us to utilize techniques for network embedding to analyze the behavior of
these systems. One class of network analysis methods, node embedding methods, can be used to
represent each node of a given network as a vector in a low-dimensional vector space, while retaining
information about the relationships between nodes. Specifically, nodes that tend to co-occur in
random walk simulations ("similar" nodes) will be embedded near each other in the vector space.
Hence, these embeddings can be used to visualize the dynamics of complicated systems, and identify
the transition paths by which systems change from one state to another.
    However, another difficulty arises due to the presence of deep potential wells in these landscapes.
While network embedding methods accurately identify energy minima where the system spends
the majority of its time, other techniques are needed to get an improved, more detailed picture of
                                                  55


the energy landscape. One such technique is metadynamics. Metadynamics uses a non-Markovian
random walk to explore an energy landscape, where the energy landscape (and therefore the transition
probabilities of the random walk) is adjusted by adding a Gaussian term after each step so that over
time, the energy wells in the landscape will be filled up. As this happens, the process is discouraged
from revisiting the lowest-energy states repeatedly, and the random walk spends more time exploring
other areas of the landscape. The end goal is to have a flat landscape, from which we can recreate
the original system by subtracting the additive terms.
    In this article, we use network embedding techniques in combination with ideas from metady-
namics to produce hierarchical embeddings that convey information about the system’s behavior at
different scales. Here we adjust the edge weights of the network in a way that parallels the original
metadynamics over time to encourage exploration away from the lowest-energy minima. We show
that these embeddings provide an effective way of visualizing the system’s dynamics, including the
entropy effect and internode relationships that may be obscured in other visualization methods.
    The remainder of this article is structured as follows. In Section 2, we provide some background
on network embedding, energy landscapes, and metadynamics. In Section 3, we discuss Lennard-
Jones clusters, and demonstrate our methodology on these test systems. Section 4 contains an
application to a more complicated system: DNA folding in a human telomere.
3.2    Background
3.2.1    Network Embedding
Real-world networks, particularly those representing biological and chemical systems, are often
large and have complex structures, making them difficult to work with. To improve understanding
of these networks, the nodes of a given network can be embedded into a low-dimensional vector
space. There exists a wide variety of methods for network embedding, which may use random walks,
matrix factorization, deep learning, or some combination of the three. The idea behind all these
methods is that the nodes in a given network can be mapped into a low-dimensional vector space in
such a way that closely related nodes have similar embeddings in the embedding space.
                                                   56


    One broad category for node embedding techniques involves the use of a random walk on the
network. This typically means building a random walk such that the possible states of the walk are
the nodes of the network, and the transition rates are determined by the edges and edge weights.
    In methods such as DeepWalk and node2vec [29, 18], the general approach is to use short random
walk simulations to determine similarities for each pair of nodes, then encode this information into
embeddings via gradient descent. Two nodes are similar if there is a high probability that a random
walk simulation containing one node will also contain the other. Then, the embeddings are assigned
based on these conditional probabilities using the SkipGram algorithm, optimizing the loss function
                                         X     X
                                 Loss =               − log (P (j|i)) .
                                          i j∈walk(i)
where walk(i) denotes a random walk trial starting at node i, and P (j|i) denotes the conditional
probability that a random walk starting at node i will include node j, modeled by (where vi denotes
the embedding of i)
                                                 exp(vi · vj )
                                     P (j|i) = P                  .
                                                  k exp(vi · vk )
The efficiency of this optimization can be greatly improved by using negative sampling [26], which
instead models the probabilities as:
                                                     YM
                               P (j|i) = σ(vi · vj )     σ(−vi · vkm ),
                                                     m=1
where σ(x) = 1/(1 + e−x ) and M is the chosen number of negative samples (sampled from a
uniform or unigram distribution).
    Some other random walk-based methods utilize matrix factorization of a random walk matrix,
rather than performing random walk trials. In this article we use one such method, which we
elaborate on in the next section.
3.2.1.1    Embedding Method
In the examples here, the embeddings are produced via a sparse approximation-based embedding
method introduced in Chapter 2. The algorithm is as follows:
                                                 57


     First, we produce compressed approximations to the dyadic powers of the random walk on the
network using its principal components; taking Uk and Σk to be the top left singular vectors and
                                k−1
singular values of Tk−1 ≈ T 2       , we let Tk := UkT Tk−1 Uk .
     From this, we produce a low-rank approximation to the Green function of the random walk,
using the Schultz method:
                                              ∞          ∞
                                                                    k
                                                 T kf =     (I + T 2 )f
                                             X           Y
                                      Gf =
                                             k=1        k=0
                                             YK
                                   ⇒ Ĝ =       (I + Tk ).
                                             k=0
     The embedding matrix θ that satisfies θT θ = vGD−1 , where v gives the volume of the graph
and D its degree matrix, is a commute time embedding [19]. Denoting the leading singular values
                                                                                     1/2
and left singular vectors of the matrix v ĜD−1 by ΣG and UG , we take θT := UG ΣG .
     Using this as a starting point, we introduce parameters by multiplying each column j of θ by a
weight Cj , and optimize these weights to minimize crossentropy loss via SGD.
3.2.2     Energy Landscapes
For many biological and chemical applications, the potential energy landscape provides insight into
the dynamics of the system. Statistical analysis of the potential energy surface has been an active
field of research for this reason, leading to results regarding global optimization, characterization
of local minima and transition states, and other ways of gathering information about the energy
landscape. These landscapes can be used to both understand the behavior of a system, and to
improve simulation techniques. Throughout the existing methods for energy landscape analysis,
the central idea is to identify local minima via geometric optimization, and to find transition states
connecting them via steepest-descent pathways [40].
     One method for energy landscape exploration is basinhopping, introduced in [42]. Basinhopping
is a Monte Carlo method that produces a database of local minima within an energy landscape. The
basic idea of basinhopping is to simplify the energy landscape by mapping each point in the energy
landscape to the local minimum it is closest to, and then constructing a random walk between these
                                                     58


basins. In other words, a basinhopping run starts at a point on the energy landscape and performs a
geometric optimization until a local minimum is reached. Then, each coordinate is shifted by a
random amount, and another geometric optimization is done on the new point.
3.2.3    Metadynamics
Metadynamics was introduced in [27] as a technique to aid in the exploration of energy landscapes.
The central idea is to create a non-Markovian (and in a sense, self-avoiding) random walk by
adjusting the gradient of the energy landscape after each step by addition of a Gaussian term. Over
time, these Gaussians eventually fill up the valleys in the enegy potential, which allows the random
walk to explore other areas of the landscape and leads to a more complete picture of the system
dynamics. Specifically, after each step, the parameter ϕi , which represents the derivative of the
                                              ∂E
energy with respect to the ith parameter − ∂σ   i
                                                  , is adjusted according to:
                                             ∂                   |σi − σit |2
                               ϕt+1 = ϕti −
                                                      Y
                                 i               W         exp(−              ),
                                            ∂σi        i            2δσ 2
where W and δσ are the height and width of the Gaussian respectively, to be chosen based on prior
knowledge of the energy landscape.
    The main goal of this note is to apply the central idea of metadynamics to network embedding.
In particular, we use an iterative embedding process, adjusting the energy landscape at each iteration,
to get a series of hierarchical embeddings that provide more information about the full energy
landscape than a single embedding by itself. Since the energy landscape is represented by a network
via an adjacency matrix A, at each iteration we update the entries of A according to:
                                                       Y          ||θi − θj ||2
                             A(i, j) = A(i, j) − W          exp(−                ).               (3.1)
                                                         i            2δσ 2
3.3    Lennard-Jones clusters with metadynamics
Lennard-Jones clusters are a test system used to model atomic or molecular dynamics within a fluid.
While they are far simpler than more detailed models that portray these dynamics more realistically,
their simplicity makes them an ideal starting point for global optimization and energy landscape
                                                     59


analysis. Optimization techniques used with Lennard-Jones clusters may eventually provide insight
into more complex optimization problems involving potential energy fields, such as protein folding,
and the dyamics of more accurate fluid models.
    In Lennard-Jones clusters, we assume that the potential energy E of a given configuration of
atoms depends only on distances between atoms:
                                                      !12        !6 
                                            X       σ          σ
                                E(r) = 4ϵ                 −         
                                                   rij        rij
where rij denotes the distance between atoms i and j, and the parameters σ and ϵ represent pair
equilibrium separation and well depth. In the experiments below, we utilize reduced units (e.g.
σ = ϵ = 1).
3.3.1   Adjacency matrix-based embeddings
To analyze the following Lennard-Jones clusters, we first generated a database of local energy
minima using the Pele software available at [41]. The database was produced using a basinhopping
run of 500 steps, and consists of 8797 local minima connected by 8099 transition states. The
database also contains other thermodynamic information, including the potential energy at each
local minimum. The embeddings here are based on a network we construct with nodes given by
the local minima, and edges located between each pair of nodes where a transition state has been
identified, weighted according to the potential energy.
    In the following examples, we used the adjacency matrix for this network to construct the
embeddings, with entries given by the energy barriers between states. Initially, every node is
embedded into the vector space. Since points on the transition path should be near the global
minimum in terms of commute time, this typically results in most of the nodes being clustered
around the global minimum’s embedding. Then, re-embedding only the nodes in that cluster reveals
new, more detailed information about some of the remaining nodes. We do this by creating a new,
smaller adjacency matrix including only those points, and adjusting the edge weights according to
the Gaussian term from Section 2.3. Iterating this process leads to a sequence of embeddings, each
of which provides a visualization of the energy landscape at a different scale. These embeddings can
                                                  60


also be used to investigate the effects of entropy on a given system, here demonstrated by comparing
embeddings for the same system at different temperatures.
3.3.1.1     8 Atom Lennard-Jones Embeddings
Figures 3.1 (a) and (b) represent the disconnectivity tree and embeddings (before and after applying
the metadynamics adjustment in Equation 3.1) for the 8-atom Lennard-Jones cluster. The colors on
each minimum on the disconnectivity tree are matched to those nodes’ embeddings in Figure 3.1,
with some of the nodes, e.g. those colored red, embedded to the same place. In the embeddings,
nodes with closer dynamic relationships are clustered together, since they will have the smallest
commute time distances between them. However, visual inspection of the embeddings reveals
that they also seem to be clustered according to potential energy levels. Nodes that have similar
energy levels (such as the three highest-energy nodes, for example) have been embedded almost
indistinguishably near to each other. On a larger scale, we can also see how the higher energy
nodes relate to the nodes with the two lowest energies. Specifically, the global minimum (in dark
blue) is nearer to the nodes in the middle of the tree (in orange), while the second-lowest local
minimum is much nearer to the three highest energy nodes. This allows us to draw conclusions
about a lowest-commute-time path through these states: one of the orange-colored states might
transition directly to the global minimum, while one of the higher-energy states would be more
likely to transition to the second-lowest-energy state first, and then either remain there or transition
on to the global minimum.
    Just as the temperature of a molecular system affects that system’s behavior, embeddings obtained
by this method will change with the introduction of entropic forces. Figure 3.2 shows the change
resulting from an upward shift in temperature for this 8-atom cluster. The major change seen
between these two temperatures is that at the higher temperature, the nodes’ embeddings are more
spread out–closely related nodes are embedded near each other, but not indistinguishably so, as in
Fig. 3.2 (a). Although the overall groupings are similar in most respects, there are some changes. in
particular, one of the higher-energy nodes (in red) is much more distanced from the others at high
                                                   61


Figure 3.1 Disconnectivity tree and metadynamics-based embeddings for the Lennard-Jones cluster
with 8 atoms. Left: Disconnectivity tree of all local minima. Right: Embeddings for the local
minima after applying the adjustment in 3.1. Closely related minima here have very similar or
identical embeddings, for example both yellow minima are embedded at the yellow point on the
right. In all figures, color scheme denotes the potential energy of the system at the configuration
represented by that node. Dark blue denotes low energies (e.g. the global minimum), and red
denotes the highest-energy states.
temperatures. At lower temperatures, the higher-energy nodes were all approximately equidistant
from the two lowest-energy configurations, but at higher temperatures, we see that most of the nodes
are much closer to one of the lowest-energy states, and further from the other.
    For comparison, we also investigated the inter-node relationships by directly computing the
commute times between each pair of nodes. The commute time between nodes i and j, or mean
time for the Markov process on the graph prescribed by the Laplacian L to travel from i to j and
back, is given by
                                                 N
                                                    λk (ϕk (i) − ϕk (j))2 ,
                                                X
                                CT (i, j) = vol
                                                k=2
where 0 = λ1 ≤ λ2 ≤ λ3 ≤ ... ≤ λN are the eigenvalues of L, and the ϕk are the corresponding
                                                   62


Figure 3.2 Metadynamics-based embeddings for the 8-atom cluster at the temperatures T = 0.08
(left) and T = 1 (right). Color scheme denotes the potential energy of the system at the
configuration represented by that node. Dark blue denotes low energies (e.g. the global minimum),
and red denotes the highest-energy states.
eigenvectors. The off-diagonal entries for the Laplacian for a Lennard-Jones cluster with κ degrees
of freedom are given by
                              X   O(i) v(i) κ−1
                    L(i, j) =                     v(i) exp(−β(Ek (k) − E(i))),
                               k  Ok (k) vk (k)
where O and Ok represent point group orders of the local minima and transition states, respectively,
and similarly v and vk represent mean vibrational frequencies and E and Ek represent potential
energy levels at each configuration.
3.3.1.2   Lennard-Jones 38 Cluster Embeddings
Figure 3.5 shows the hierarchical embeddings for the 38-atom Lennard-Jones cluster. The
disconnectivity tree [40] for this system is a little more complex, but we can still apply the
same reasoning to compare it to the embeddings.
    Here, we used the adjacency matrix for this network to construct the embeddings, with entries
given by the energy barriers between states. The first image shows the initial embeddings, colored
                                                 63


             Table 3.1 Commute times between nodes, 8-atom Lennard-Jones cluster.
                    Node 1                          Node 2
                             1     2      3     4       5      6       7          8
                    1        0   25.4   17.7  23.5 29.0 65.2          86.1     88.4
                    2              0     7.7  13.5 19.0 55.2          60.7     63.0
                    3                     0    5.8 11.3 47.5          68.4     70.7
                    4                           0     17.1 53.3       74.3     76.5
                    5                                   0     58.8    79.7     82.0
                    6                                          0     115.9 118.2
                    7                                                  0      123.7
Figure 3.3 The 8-atom Lennard-Jones network of local minima. Edge lengths are proportional to
commute times. Node colors are chosen to match those in Figures 3.1 (a) and (b).
by their commute time distance from the global minimum. We re-embedded the points of interest
(the nodes nearest to the origin) and used two iterations of this process to obtain the second image.
    In the first stage embeddings (Figure 3.5 (a)), the nodes are arranged according to commute
time distance from the global minimum, which corresponds loosely to potential energy level. In
particular, the nodes with E > −170 (which correspond to the highest-energy clusters in the tree)
are embedded further from the global minimum. In the next embedding, these nodes are removed
due to their distance from the global minimum, and the more central cluster will be re-embedded.
As we consider the second-stage and later embeddings, this pattern continues: each re-embedding
                                                 64


reveals a new "layer" of nodes that are embedded further from the global minimum, and which
correspond to nodes on the disconnectivity tree that are lower-energy than the nodes removed in
the previous stage, but higher-energy than the more closely clustered nodes. In particular, at the
second level (Fig. 3.5 (b)), the embeddings have 3 small "spokes" originating from a central cluster
containing the global minimum. However, these embeddings provide additional context: nodes that
are embedded within a particular "spoke" are more closely related, which means the system is more
likely to transition between these states. Since the nodes do not all come from the same group on
the disconnectivity tree, these embeddings can also reveal interactions between nodes that aren’t
indicated on the disconnectivity tree. Additionally, since the spokes are connected to the cluster
containing the global minimum, we can conclude that each spoke represents a potential transition
pathway from the outer edge of the cluster to the center. In other words, if the system is at a state
represented by the outer point of one of the spokes, its most likely path toward the global minimum
will be to travel through the states represented by other nodes in the spoke. We can apply similar
reasoning to higher-level embeddings. Each stage reveals a more detailed picture of the dynamics
of a different part of the system’s energy landscape. The first levels give a coarse-grained picture,
only identifying broad groups of high-energy and low-energy nodes, while later levels give a more
fine-grained visualization of the nodes most closely related to the global minimum.
    Often, we want a more detailed, finer-grained visualization of the energy landscape than the
disconnectivity tree in Figure 3.4 can provide. It is informative, therefore, to re-embed the parts
of the network of greatest interest to gain further insight. Here we focus on the lowest-energy
parts of the Lennard-Jones energy landscape. We repeated the above process using the subnetwork
consisting only of the nodes with potential energy < −170.9, that is, the 163 lowest-energy nodes.
Figure 3.6 shows the results of this experiment.
    Nodes in these embeddings (both Figure 3.5 and Figure 3.6) are clustered according to their
similarity in terms of commute time. In other words, nodes that can be quickly and frequently
reached from either the global minimum or second lowest energy node will be grouped near them. As
a result, most of the nodes we are most interested in end up in the same cluster, and the embeddings
                                                  65


               Figure 3.4 Disconnectivity tree for the 38-atom Lennard-Jones cluster.
obtained from the first application of the embedding method only give useful clusters for nodes that
are more distant in terms of commute time from the part of the graph of interest. As we progress
through additional iterations, we pull apart the cluster containing the global minimum, positioned
near the origin in the first iteration’s embeddings. The final iteration’s embeddings give us a clear
picture of the dynamics of the process within the subnetwork defined by the nodes within this cluster.
    If two nodes share a cluster or are close in the embedding space, this indicates that the system
can easily transition between those nodes, with a relatively low energy barrier. As a result, the
groupings seen in these embeddings correspond to the groupings in the disconnectivity tree for this
system [40, 41]. For instance, one of the clusters in the second-stage embeddings correspond to the
global minimum and its nearest neighbors (pictured directly right of center in the disconnectivity
tree in Figure 3.4). Clusters can be mapped to the disconnectivity tree by comparing the potential
energies of the nodes within the cluster to the tree.
    In the third iteration, some of the clusters instead represent combinations of multiple disconnec-
                                                    66


Figure 3.5 Hierarchical embeddings for the Lennard-Jones cluster with 38 atoms. Pictured are the
embeddings for the local minima originally clustered at the origin (with the global minimum) after
applying metadynamics (equation 3.1). Color scheme denotes commute time from the global
minimum, with dark blue denoting short distances, and red denoting the furthest distances.
tivity tree groups; this is a result of re-embedding the nodes to spread out those nodes that were
previously near the origin–those nodes end up being embedded in or near the clusters they are most
closely related to, although they are not actually members of the respective tree groupings. For
instance, the global minimum is embedded directly next to a node from a neighboring tree group.
    So, the first iteration’s embeddings tell us about higher-energy nodes and those that are more
distant from the global minima, and further iterations reveal information about parts of the graph
nearer and nearer to those minima.
    Additionally, these embeddings may be useful for identifying transition paths. The structure of
the second-stage embeddings, in particular, reveals four transition paths– if the system is initialized
from a node near the outside of one of these "spokes", its lowest-energy path to the global minima
will involve following the spoke into the center cluster.
    Adding up the distinct clusters in Figure 3.6, we see that we have identified about as many clusters
as there are clusters on the disconnectivity tree. However, the clusters in the embeddings do not
map directly onto tree clusters– clusters in the embedding space frequently contain closely-related
nodes from neighboring groups on the tree. To get a better idea of how these embeddings compare
                                                   67


to the disconnectivity tree, we can consider the potential energy levels of each local minima.
Figure 3.6 Embeddings of the local minima of the 38-atom Lennard-Jones cluster with potential
energies less than -170.9. The left figure shows the original embeddings of all 163 minima; the
right figure shows stage 2 embeddings (embeddings of the nodes clustered around the origin, after
adjusting the adjacency matrix according to the metadynamics adjustment in equation (3.1)). Color
scheme denotes commute time from the global minimum, with dark blue denoting short distances,
and red denoting the furthest distances.
    We can also use these embeddings to observe the results of entropic changes to the system. In
Figure 3.7, the embeddings for this cluster under two additional temperature conditions are shown.
The results of the temperature change are comparable to what was observed previously with the
8-atom cluster. Namely, at higher temperatures (Fig. 3.7 (b)) , there is greater variation in the
node embeddings, while at lower temperatures (Fig. 3.7 (a)) closely related nodes are more likely
to embedded much nearer to each other, creating the impression that there are fewer embeddings
pictured.
3.4    Application: Human Telomere Folding
Here we attempt to apply this method to a more complex problem: DNA folding in a human
telomere. Specifically, we consider a sequence of 22 nucleotide bases A(G3 T T A)3 G3 which repeats
within human telomeres. This sequence is known to form a G-quadruplex, a type of secondary
structure formed by groups of four guanine bases called G-tetrads. Its structure and potential energy
                                                 68


Figure 3.7 Metadynamics-based embeddings for the 38-atom cluster at the temperatures T = 0.08
(left) and T = 1 (right). Color scheme denotes the commute time distance from the global
minimum. Dark blue denotes low distances (e.g. the global minimum), and red denotes the highest.
landscape were previously investigated in [12].
Figure 3.8 The four-strand G-quadruplex structure (PDB structure 1KF1), with guanine nucleotides
colored green. Image produced with Chimera [31].
    In [12], the potential energy landscape was modeled using the HiRE-RNA model for coarse-
grained DNA modeling [28], which involves a summation of energy terms representing three
different types of bonds between atoms: local interactions, non-bonded interactions, and hydrogen
bonds. For each of the 22 nucleotides in the telomere, six or seven atoms are considered by the
model. Discrete path sampling was then used to generate a database of local minima and their
connecting transition states.
                                                69


    We use this database as a starting point. In particular, we construct a network with nodes given
by the 4000 lowest-energy local minima, and edges between nodes determined by the transition
states connecting them.
    For these first experiments, we used a random walk based on the energy barriers between
states, as in Section 3.1. Figure 3.9 shows the results of the initial embeddings and the second,
third, and fourth iterations, based on an energy landscape adjusted by a Gaussian term with width
0.75 and height 1 after each successive embedding. As in the Lennard-Jones cluster example,
the first embedding displays embeddings of all nodes, while each successive embedding shows a
re-embedding of the subnetwork of nodes most closely related to the global minimum (that is, all
nodes whose previous embeddings lie within a small tolerance of the global minimum).
    As with the Lennard-Jones clusters, each level of embedding represents "zooming in" on the
part of the network around the global minimum. The first level gives us a global view – the global
minimum and its nearest neighbors (with respect to commute times) are clustered at the origin,
represented by a dark blue dot. The red and orange dots furthest away from the origin represent local
minima which are more distantly related, requiring multiple steps or higher energies to transition to
the global minimum. Potential transition paths can be identified by starting at one of these points,
and moving toward the origin along nearby points. At the second level (and each of the following
levels), the nodes embedded closest to the local minimum are re-embedded to give us a more detailed
insight into the relationships of those nodes. The potential transition paths here can be constructed
similarly.
3.4.1     Combining Metadynamics with Transition Path Theory
For energy landscapes with a large number of local minima, the corresponding networks contain
large numbers of nodes. In fact, for the Lennard-Jones clusters, the number of minima increases
exponentially with the number of atoms. In these situations, as we have seen, it is beneficial to have
a method for network embedding that focuses on locally embedding regions of the graph that are of
particular interest, for example the subnetwork consisting of the global minimum and its nearest
                                                  70


Figure 3.9 Embeddings of the local minima network for the human telomere sequence, based on the
adjacency matrix as in section 3.1 and the metadynamic adjustment in 3.1. Colors of embeddings
denote distance from the global minimum, with dark blue denoting the shortest distances and red
the furthest.
Figure 3.10 Two local minima from the telomere’s energy landscape, with potential energies of
−68.0 and −69.7 respectively. Images produced with Chimera [31].
neighbors. Since the states in this subnetwork are embedded near each other, the commute times
between them are low, so the time required for the system to move between these configurations
is low. The states in these subnetworks often represent molecular configurations that are similar,
differing only by a simple change; therefore these subnetworks can have dramatically fewer degrees
of freedom compared to the full network.
     We next present an alternate formulation, which replaces the adjacency matrix with the probability
current matrix from transition path theory (TPT). TPT is a theoretical framework that can be used
                                                  71


to study transition events (such as a molecule’s transition from one configuration to another). In
particular, TPT characterizes the transition paths of a system based on statistical properties such as
rate functions and probability currents which measure the average rates of transitions between states.
For large networks, particularly those with irregular structures, the committor functions needed
to apply transition path theory to the full network may prove difficult or impractical to compute.
Focusing our application of TPT onto a smaller, localized subnetwork avoids this difficulty, allowing
us to take advantage of the additional information TPT offers. In the following experiments, we first
embed all the local minima using the adjacency matrix, as in the previous section, then construct
a subnetwork representing the telomere energy landscape nearest to the global minimum using a
local transition path theory. Taking S ′ ⊂ S to be the set of all nodes which are embedded within
distance δ from the global minimum, we construct a subnetwork consisting only of the nodes in
S ′ by considering only the relevant portions of the adjacency matrix (adjusted according to the
Gaussian term 3.1). Then, we compute the graph Laplacian L = D − A, and compute the committor
functions and probability currents as follows.
     The forward and backward committors q + and q − solve the system
                              
                                             +
                              
                                                                ∀i ∈ (A ∪ B)C
                                P
                                   j∈S Lij qj    =0
                              
                              
                              
                              
                              
                              
                              
                              
                                 +
                               qi               =0             ∀i ∈ A
                              
                              
                              
                              
                              
                              qi+               =1             ∀i ∈ B.
                              
                              
In these experiments, I took the global minimum to be the initial state A, and chose the local
minimum with the highest associated potential energy to be B.
     To find the probability current f and effective probability current f + ,
                                             
                                               πi qi− Lij qj+ , if i ̸= j
                                             
                                             
                                             
                                     fij =
                                               0                if i = j,
                                             
                                             
                                             
and
                                         fij+ = max(fij − fji , 0).
                                                      72


    Finally, each subnetwork is embedded into R3 using the method described in Section 2.1.1, using
the effective probability current in place of the adjacency matrix. Here we apply the hierarchical
embedding procedure to the subnetwork of nodes surrounding the global minimum (as we have
done throughout this paper), but the same process could be used to examine other areas of the graph
aside from the global minimum to obtain a fuller overall picture of the energy landscape.
    The distances between nodes within each subnetwork preserve the commute times between those
nodes. Since the cross-entropy loss minimization is applied to the overall network at each step,
we can expect that the distances between nodes correspond to likelihoods of a transition between
those nodes, similarly to the previous examples, and therefore we can interpret the embeddings and
identify possible transition paths in the same way.
    Though the main purpose of applying TPT in this way is to analyze larger networks, we can
briefly demonstrate its efficacy on a smaller system–the 8-atom Lennard-Jones cluster from Section
3 (see Figure 3.11). We still see that the two nodes with potential energies near -19.2 (colored
Figure 3.11 Hierarchical embeddings of the local minima network for the 8 atom LJ cluster, based
on a TPT-based subnetwork consisting of the nodes clustered around the global minimum (in dark
blue), and the metadynamic adjustment in 3.1. Colors of embeddings denote potential energy at
each configuration, with dark blue denoting the lowest energies and red the highest.
yellow in Figs 3.1 and 3.3) are closely connected, and the nodes represented in red and orange are
closer to the lowest energy nodes than to each other, but in this case the short distance between the
                                                  73


2 lowest-energy nodes reflects a higher transition rate between them. The highest-energy nodes,
embedded in red, are also embedded separately in this case, whereas their adjacency matrix-based
embeddings were identical. These embeddings indicate that a transition path (shown in Fig. 3.11)
from the highest-energy node to the slightly lower-energy node labeled 5 might pass through nodes
1 and 2 on the way. Direct simulations of this system confirm this transition path. Hence, these
embeddings can be useful for predicting the mechanisms by which a molecule moves between two
configurations.
    Having demonstrated our method on a simpler system, we now return to the human telomere
molecule. Figure 3.12 shows the embeddings produced via the TPT process described above. The
colors of local minima in Figure 3.12 are determined by the commute times between those nodes
and the global minimum, which is embedded in dark blue. It is immediately apparent that the local
minima are grouped according to these commute times, with separate clusters containing most of
the red, yellow, and light blue nodes. Within each of these clusters, the nodes are relatively close
in terms of commute times, indicating that these molecular configurations are similar up to some
simple molecular change. As with the 8-atom Lennard-Jones cluster, these embeddings suggest
possible transition paths. For example, if the system starts at one of the states furthest from the
global minimum (colored in red, clustered in the upper left) one would expect a transition path to the
global minimum to travel through the light blue and yellow clusters to reach the node in dark blue.
3.5   Conclusions
This article presents a novel method for adaptive, hierarchical network embedding that combines
the ideas of metadynamics with node embedding techniques. The usefulness of metadynamics for
analyzing energy landscapes can then be extended to energy landscapes modeled by networks and
other network applications. In this embedding scheme, the network itself – both its edge weights
and the set of nodes under consideration – can be adjusted to more effectively focus on particular
areas of the graph. In the context of energy landscapes, this allows us to focus on important parts of
the graph, such as the lowest-energy minima and the nodes most closely connected to them. The
same idea could be adapted to focus on other features of a given network as well. In some cases,
                                                 74


Figure 3.12 Hierarchical embeddings of the local minima network for the human telomere
sequence, based on a TPT-based subnetwork consisting of the nodes clustered around the global
minimum (in dark blue), and the metadynamic adjustment in 3.1. Colors of embeddings denote
commute time distance from the global minimum, with dark blue denoting the shortest distances
and red the furthest.
the embeddings illuminate aspects of the system’s dynamics like temperature dependence that are
not explored in other methods, such as disconnectivity graphs. To demonstrate the success of this
method, we present the resulting hierarchical embeddings associated with a Lennard-Jones cluster
of 38 atoms, as well as a more complicated system representing a DNA sequence from a human
telomere. In the future, further work in this direction may lead to a scheme for analyzing larger, more
complex biochemical systems such as protein folding. We also plan to develop a localized form of
TPT, which could be applied to locally parts of a graph in order to make TPT more accessible for
larger-dimension problems.
                                                   75


APPENDIX
   76


                                                     APPENDIX
                                          PROOFS OF THEOREMS
Theorem 1
Theorem. Let S be the set of all vertices in a network with unnormalized Green function given
by G, degree matrix D, and volume vol, and let CT (·, ·) denote the commute time. There exists
Π ∈ Cm×d where m = O(log(1/δ)/ε2 ) and ε > 0 such that for all x, y ∈ S,
      (1 − ε)CT (x, y) ≤ ||Π(volGD−1 )1/2 δx − Π(volGD−1 )1/2 δy ||22 ≤ (1 + ε)CT (x, y)                   (A.1)
with probability 1 − δ.
Proof. If we let λ0 , λ1 , ..., λ|S| and ζ0 , ζ1 , ..., ζ|S| denote the eigenvalues and eigenvectors of a matrix
T , the diffusion distance at time t induced by T is defined by
                                      sX
                      (t)
                     d (x, y) =              λt (ζλ (x) − ζλ (y))2 = ||T t/2 δx − T t/2 δy ||2 .
                                          λ
Note that G is a left inverse of the random walk Laplacian L (that is, GL = I so long as the graph
is connected), and so the eigenvalues of G are the reciprocals of the eigenvalues of L. (G and L
also have the same eigenvectors.) Let λ0 , λ1 , ..., λ|S| and ζ0 , ζ1 , ..., ζ|S| denote the eigenvalues and
eigenvectors of the normalized Laplacian L. Then
                                              |S|
                                                   1       ζi (x)         ζi (y)
                                                                                  )2
                                             X
                       CT (x, y) = vol                (q            −q                                     (A.2)
                                             i=2  λ̂i      D(x, x)       D(x, x)
                                         |S|
                                        X                 ζi (x)         ζi (y) 2
                                     =       volλi ( q             −q            )                         (A.3)
                                        i=2               D(x, x)       D(x, x)
                                     = ||(vol · GD−1 )1/2 δx − (vol · GD−1 )1/2 δy ||22 ,                  (A.4)
where the final equality comes from the fact that the eigenvalues of vol · G are given by volλi for all
                          P|S|
i, and by noticing that      i=2 volλ̂i (ζi (x) − ζi (y))2 is equal to the square of the diffusion distance for
t = 1, using the eigendecomposition of vol · G.
                                                             77


    Then, let Π =  √1 H   where H has random entries hij ∼ N (0, 1). Fix x, y ∈ V . We can apply
                     m
the original Johnson-Lindenstrauss lemma to conclude that
(1 − ε)||(vol · GD−1 )1/2 δx − (vol · GD−1 )1/2 δy ||22 ≤ ||Π(volGD−1 )1/2 δx − Π(volGD−1 )1/2 δy ||22
                                                                                              (A.5)
                                             ≤ (1 + ε)||(vol · GD−1 )1/2 δx − (vol · GD−1 )1/2 δy ||22 ,
                                                                                              (A.6)
and therefore
     (1 − ε)CT (x, y) ≤ ||Π(volGD−1 )1/2 δx − Π(volGD−1 )1/2 δy ||22 ≤ (1 + ε)CT (x, y)       (A.7)
with probability 1 − δ.
Theorem 2
Theorem. Let S be the set of all vertices in the network, |S| = N , and let CT (·, ·) denote the
commute time.
   1. If Φ(x) := ΦT δx denotes the embedding for node x obtained by applying the diffusion wavelet
      algorithm to the matrix volGD−1 with precision constant ε (that is, Φ is the orthonormal basis
      of scaling functions and Φ(x) is the xth row of Φ), then for all x, y ∈ S,
                               q                                     q
                        −ε +     CT (x, y) ≤ ||ΦT (δx − δy )|| ≤ ε +    CT (x, y).            (A.8)
   2. Let U ΣV T ≈ volGD−1 represent the truncated SVD using the αN leading singular values
      σ1 ≥ ... ≥ σαN , for some α ≤ 1. Let Θ denote the embedding matrix obtained from the
      αN leading left singular vectors U , such that Θ = (Σ1/2 )T U T . Then for all x, y ∈ S, and
      ε = O(σαN +1 ),
                           −ε + CT (x, y) ≤ ||ΘT (δx − δy )|| ≤ ε + CT (x, y).                (A.9)
                                                 78


Proof. Proof of (1)
    Let Φ := [Φ1 ]Φ0 be the result of applying an iteration of the diffusion wavelet algorithm to
vol · GD−1 . So the columns of Φ (denoted {ϕj }1≤j≤|S| ) are an ε-span of the range of volGD−1 .
    Let λ0 , λ1 , ..., λ|S| and ζ0 , ζ1 , ..., ζ|S| denote the eigenvalues and eigenvectors of the unnormalized
Green function G. Then we can define the subspace:
                                                  V0 := ⟨{ζλ : λ ≥ ε}⟩.
The space spanned by the columns of Φ, Ṽ0 := ⟨Φ⟩, is ε-close to V0 . In particular, from Definition
12 in [11], we have for all ζ ∈ V0 :
                                                     ||P⟨Φ⟩ ζ − ζ|| ≤ ε,
where P⟨Φ⟩ is the orthogonal projection onto Ṽ0 .
    ||P⟨Φ⟩ (volGD−1 )1/2 (δx − δy )|| = ||P⟨Φ⟩ (volGD−1 )1/2 (δx − δy ) − (volGD−1 )1/2 (δx − δy )
                                               + (volGD−1 )1/2 (δx − δy )||
                                               ≤ ε + ||(volGD−1 )1/2 (δx − δy )||
                                                       q
                                               ≤ε+       CT (x, y).
Also:
   ||ΦT (δx − δy )|| = ||ΦT (δx − δy ) + (volGD−1 )1/2 (δx − δy ) − (volGD−1 )1/2 (δx − δy )||
                        ≥ −||ΦT (δx − δy ) − (volGD−1 )1/2 (δx − δy )|| + ||(volGD−1 )1/2 (δx − δy )||
                                  q
                        ≥ −ε +        CT (x, y).
Therefore:
                                 q                                            q
                           −ε +      CT (x, y) ≤ ||ΦT (δx − δy )|| ≤ ε +        CT (x, y).
    Proof of (2)
    Note that for symmetric G, we have ΘT Θ ≈ volGD−1 . In general, using the first αN singular
vectors of volGD−1 leads to an approximation with error σαN +1 :
                                ||(volGD−1 )1/2 − (Σ1/2 )T U T ||2→2 = σαN +1 .
                                                              79


Hence
                        ||ΘT (δx − δy ) − (volGD−1 )1/2 (δx − δy )||2 = O(σαN +1 ).
  ||ΘT (δx − δy )||2 ≤ ||ΘT (δx − δy ) − (volGD−1 )1/2 (δx − δy )||2 + ||(volGD−1 )1/2 (δx − δy )||2
                        ≤ CT (x, y) + O(σαN +1 ).
Likewise,
   ||ΘT (δx − δy )|| ≥ −||ΘT (δx − δy ) − (volGD−1 )1/2 (δx − δy )|| + ||(volGD−1 )1/2 (δx − δy )||
                       ≥ −O(σαN +1 ) + CT (x, y).
Therefore:
                −O(σαN +1 ) + CT (x, y) ≤ ||ΘT (δx − δy )|| ≤ O(σαN +1 ) + CT (x, y).
    The above proofs assume that the embeddings Φ and Θ are obtained directly from the exact
Green function G, but in the method proposed here, the Green function is approximated using
compressed powers of T . This introduces additional error in the following way:
   1. If the Green function is approximated using the diffusion wavelet algorithm with precision ε,
       then we have
                                                      YK
                                               Ĝ =      (I + Tk ),
                                                      k=0
                                       k
       where Tk approximates T 2 up to ε. We have K such approximations, so Ĝ is a Kε-
                           QK
       approximation of       k=0 (I + Tk ).
   2. If the Green function is instead approximated using truncated SVD, where for α < 1, α|S|
       singular values are retained at each step, then for each k, the approximation Tk introduces an
                                                                                       QK
       error of magnitude σαk N +1 (Tk−1 ). The total approximation error is then        k=0  σαk N +1 (Tk ).
    For either approximation, under the ideal assumptions on T from [11], there is some γ < 1 such
                      j                                          j−1
that the rank of T 2 up to precision ε is less than γrankε (T 2      ), for all j > 0. In this case, the tail
Q∞                                                   Q∞
  K (I  + Tk ) is of low rank – specifically, rankε (   K (I + Tk ) < γrankε (TK ).
                                                    80


Theorem 3
The following theorem and its proof are based on the proof of Theorem 1 from [44].
Theorem. Suppose we have a kernel-based method (e.g. Q1 , Q2 satisfying the properties in diffusion
maps). Then the embeddings obtained from the diffusion wavelet scaling functions for Q−1                    2 Q1 at
level j are equivalent, up to a rotation, to the pj -dimensional embeddings obtained from the diffusion
maps embedding, where pj gives the number of scaling functions at scale j.
Proof. The diffusion maps algorithm embeds nodes by solving the eigenproblem
                                                Q−1 2 Q1 f = λf,
and using Ψt (x) := (λt1 ψ1 (x)      λt2 ψ2 (x) · · · λts(δ,t) ψs(δ,t) (x))T where
                                s = max{l ∈ N such that |λl |t > δ|λ1 |t }
and λk , ψk are the eigenvalues/eigenvectors of the transition matrix P . (With 1 = λ0 > |λ1 | ≥ · · · .
Note that P and Q−1 2 Q1 have the same eigenvectors, and the leading eigenvectors of one correspond
to the last eigenvectors of the other.)
     Let j = min{k : λt0k > ε}. Take the first pj eigenvectors of Q−1              2 Q1 , and let that be V1:pj . Let
[ϕj ]ϕ0 represent the scaling functions of Q−1    2 Q1 at level j.
     The scaling functions of Q−1   2 Q1 at level j, by construction, give (an ε-approximation of) a
localized base for Vj = ⟨ψλ : λ ∈ σ(T ), λtj ≥ ε⟩. (This is the space corresponding to the pj
eigenvectors associated with the largest eigenvalues of P , or the eigenvectors associated to the pj
least eigenvalues of Q−12 Q1 ).
     Note that the columns of V1:pj and [ϕj ]ϕ0 span the same space. Hence
                              T
                            V1:p  V
                                 j 1:pj
                                         = I = [ϕj ]Tϕ0 [ϕj ]ϕ0 ,
                                                     T
                             ⇒ V1:pj = V1:pj V1:p        V
                                                        j 1:pj
                                                                 = [ϕj ]ϕ0 ([ϕj ]Tϕ0 V1:pj )
                                                          81


    Call this last matrix in parentheses Q. Then it can be shown that QQT = QT Q = I, so Q is a
rotation matrix. Thus the pj -dimensional diffusion map embeddings are equivalent up to a rotation
to the jth-level scaling function-based embeddings.
                                                 82


BIBLIOGRAPHY
      83


                                        BIBLIOGRAPHY
[1] A. Ahmed et al. “Distributed Large-scale Natural Graph Factorization”. In: WWW 2013 -
     Proceedings of the 22nd International Conference on World Wide Web. May 2013, pp. 37–48.
     doi: 10.1145/2488388.2488393.
[2] R. Allen, P. Warren, and P. Wolde. “Sampling Rare Switching Events in Biochemical Networks”.
     In: Physical review letters 94 (Feb. 2005), p. 018104. doi: 10 . 1103 / PhysRevLett.94.018104.
[3] M. Belkin and P. Niyogi. “Laplacian Eigenmaps and Spectral Techniques for Embedding
     and Clustering”. In: Advances in Neural Information Processing System 14 (Apr. 2002). doi:
     10.5555/2980539.2980616.
[4] P. Bolhuis et al. “Transition path sampling: throwing ropes over rough mountain passes
     in the dark”. In: Ann. Rev. Phys. Chem. 53 (2002), pp. 291–318. doi: 10.1146/annurev.
     physchem.53.082301.113146.
[5] H. Cai, V. Zheng, and K. Chang. “A Comprehensive Survey of Graph Embedding: Problems,
     Techniques and Applications”. In: IEEE Transactions on Knowledge and Data Engineering
     (Sept. 2017). doi: 10.1109/TKDE.2018.2807452.
[6] S. Cao, W. Lu, and Q. Xu. “GraRep”. In: CIKM ’15: Proceedings of the 24th ACM International
     on Conference on Information and Knowledge Management. Oct. 2015, pp. 891–900. doi:
     10.1145/2806416.2806512.
[7] M. Chen, Q. Yang, and X. Tang. “Directed Graph Embedding”. In: ĲCAI International Joint
     Conference on Artificial Intelligence. Jan. 2007, pp. 2707–2712.
[8] M. Chowdhury et al. “Can AI help in screening Viral and COVID-19 pneumonia?” In: IEEE
     Access 8 (2020), pp. 132665 –132676. doi: 10.1109/ACCESS.2020.3010287.
[9] F. Chung. “Laplacians and the Cheeger Inequality for Directed Graphs”. In: Annals of
     Combinatorics 9 (Apr. 2005), pp. 1–19. doi: 10.1007/s00026-005-0237-z.
[10] R. Coifman and S. Lafon. “Diffusion maps”. In: Applied and Computational Harmonic
     Analysis 21 (July 2006), pp. 5–30. doi: 10.1016/j.acha.2006.04.006.
[11] R. Coifman and M. Maggioni. “Diffusion wavelets”. In: Applied and Computational Harmonic
     Analysis 21 (July 2006), pp. 53–94. doi: 10.1016/j.acha.2006.04.004.
[12] T. Cragnolini et al. “Multifunctional energy landscape for a DNA G-quadruplex: An
     evolved molecular switch”. In: Journal of Chemical Physics 147 (Oct. 2017). doi:
     10.1063/1.4997377.103
[13] I. Daubechies. “Orthonormal bases of compactly supported wavelets”. In: Comm. Pure Appl.
     Math. 41 (1988), pp. 909–996. doi: https://doi.org/10.1002/cpa.3160410705.
                                                 84


[14] J. Du and D. Liu. “Transitions states of stochastic chemical reaction networks”. In: Communi-
     cations in Computational Physics 29.2 (2020), pp. 606–627. doi: 10.4208/ cicp.OA-2019-0014.
[15] C. Gardiner et al. “Handbook of Stochastic Methods for Physics, Chemistry and the Natural
     Sciences.” In: Biometrics 42 (Mar. 1986), p. 226. doi: 10.2307/2531274.
[16] D. Gillespie. “Approximate Accelerated Stochastic Simulation of Chemically Reacting
     Systems”. In: Journal of Chemical Physics 115 (July 2001), pp. 1716–1733. doi: 10.
     1063/1.1378322.
[17] P. Goyal and E. Ferrara. “Graph Embedding Techniques, Applications, and Performance: A
     Survey”. In: Knowledge-Based Systems (May 2017). doi: 10.1016/j.knosys.2018. 03.022.
[18] A. Grover and J. Leskovec. “node2vec: Scalable Feature Learning for Networks”. In: vol.
     2016. July 2016, pp. 855–864. doi: 10.1145/2939672.2939754.
[19] E. Hancock and H. Qiu. “Clustering and embedding using commute times”. In: IEEE Trans.
     Pattern Anal. Mach. Intell. 29 (2007), pp. 1873–1890.
[20] J. Leskovec, J. Kleinberg, and C. Faloutsos. “Graph Evolution: Densification and Shrinking
     Diameters”. In: ACM Transactions on Knowledge Discovery from Data (ACM TKDD) 1(1)
     (2007). doi: 10.1145/1217299.1217301.
[21] J. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.
     http://snap.stanford.edu/data. June 2014.
[22] J. McAuley et al. Image-based Recommendations on Styles and Substitutes. 2015. doi:
     10.1145/2766462.2767755. url: http://arxiv.org/abs/1506.04757.
[23] P. Mercurio and D. Liu. “Identifying transition states of chemical kinetic systems us-
     ing network embedding techniques”. In: Mathematical Biosciences and Engineering
     18.1 (2021), pp. 868–887. issn: 1551-0018. doi: 10.3934/mbe.2021046. url: https:
     //www.aimspress.com/article/doi/10.3934/mbe.2021046.
[24] P. Metzner, C. Schütte, and E. Vanden-Eĳnden. “Transition Path Theory for Markov Jump
     Processes”. In: Mult. Mod. Sim. 7 (Nov. 2008), pp. 1192–1219. doi: 10.1137/ 070699500.
[25] P. Metzner, C. Schütte, and E. Vanden-Eĳnden. “Illustration of transition path theory on a
     collection of simple examples”. In: The Journal of chemical physics 125 (Sept. 2006), p.
     084110. doi: 10.1063/1.2335447.
[26] T. Mikolov et al. “Distributed Representations of Words and Phrases and their Composi-
     tionality”. In: Advances in Neural Information Processing Systems 26 (Oct. 2013). doi:
     10.5555/2999792.2999959.
[27] M. Parrinello and A. Laio. “Escaping Free-Energy Minima”. In: Proc. Natl. Acad. Sci. U.S.A.
     99 (Nov. 2002), p. 12562. doi: 10.1073/pnas.202427399. 104
[28] S. Pasquali. “HiRE-RNA: a high resolution coarse-grained energy model for RNA”. In: The
     journal of physical chemistry. B 114 (Sept. 2010), pp. 11957–66. doi: 10.1021/ jp102497y.
                                                 85


[29] B. Perozzi, R. Al-Rfou, and S. Skiena. “DeepWalk: Online Learning of Social Representations”.
     In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and
     Data Mining (Mar. 2014). doi: 10.1145/2623330.2623732.
[30] L. Peshkin. MATLAB - GraphViz interface. May 2021. url: https://www.mathworks.
     com/matlabcentral/fileexchange/4518-matlab-graphviz-interface.
[31] E. Pettersen et al. “UCSF Chimera–a visualization system for exploratory research and analysis”.
     In: J Comput Chem 25(13) (Oct. 2004), pp. 1605–12. doi: 10.1002/jcc. 20084.
[32] T. Rahman et al. “Exploring the Effect of Image Enhancement Techniques on COVID-
     19 Detection using Chest X-ray Images.” In: Comput Biol Med (2021). doi: 10.1016/
     j.compbiomed.2021.104319.
[33] S. Roweis and L. Saul. “Nonlinear Dimensionality Reduction by Locally Linear Embedding”. In:
     Science (New York, N.Y.) 290 (Jan. 2001), pp. 2323–6. doi: 10.1126/ science.290.5500.2323.
[34] B. Rozemberczki. GraRep. 2019. url: https://github.com/benedekrozemberczki/ GraRep.
[35] F. Scarselli et al. “The Graph Neural Network Model”. In: IEEE Transactions on Neural
     Networks 20.1 (Jan. 2009), pp. 61–80. issn: 1941-0093. doi: 10.1109/TNN.2008. 2005605.
[36] P. Sen et al. Collective Classification in Network Data. 2008. doi:                    https://doi.org/
     10.1609/aimag.v29i3.2157.
[37] R. Srivastava, M. S. Peterson, and W. E. Bentley. “Stochastic kinetic analysis of the Escherichia
     coli stress circuit using Sigma-32-targeted antisense”. In: Biotechnology and Bioengineering
     75.1 (2001), pp. 120–129. doi: 10.1002/bit.1171. eprint: https: / / onlinelibrary . wiley . com /
     doi / pdf / 10 . 1002 / bit . 1171. url: https : / / onlinelibrary.wiley.com/doi/abs/10.1002/bit.1171.
[38] R. Srivastava et al. “Stochastic vs. Deterministic Modeling of Intracellular Viral Kinetics”. In:
     Journal of theoretical biology 218 (Nov. 2002), pp. 309–21. doi: 10.1006/ jtbi.2002.3078.
[39] J. Tang et al. “LINE: Large-scale Information Network Embedding”. In: Line: Large- Scale
     Information Network Embedding (Mar. 2015). doi: 10.1145/2736277.2741093.
[40] D. J. Wales. “Exploring Energy Landscapes”. In: Annual Review of Physical Chemistry
     69 (2018), pp. 401–25. doi: 10 . 1146 / annurev - physchem - 050317 - 021219. url:
     https://doi.org/10.1146/annurev-physchem-050317-021219.
[41] D. J.Wales. The Cambridge Energy Landscape Database. https://www-
     wales.ch.cam.ac.uk/CCD.html. url: https://www-wales.ch.cam.ac.uk/CCD.html. 105
[42] D. J. Wales and J. P. K. Doye. “Global Optimization by Basin-Hopping and the Lowest Energy
     Structures of Lennard-Jones Clusters Containing up to 110 Atoms”. In: The Journal of Physical
     Chemistry A 101.28 (July 1997), 5111–5116. issn: 1520-5215. doi: 10.1021/jp970984n. url:
     http://dx.doi.org/10.1021/jp970984n.
[43] B. Wang et al. “Network enhancement as a general method to denoise weighted biological
     networks”. In: Nature Communications (2018). doi: 10.1038/s41467-018-05469-x.
                                                       86


[44] C. Wang and S. Mahadevan. “Multiscale dimensionality reduction based on diffusion wavelets”.
     In: Tech. Rep., Univ. of Massachusetts (2009).
[45] J. Wang, K. Markert, and M. Everingham. “Learning models for object recognition from
     natural language descriptions”. In: Proceedings of British Machine Vision Conference. 2009.
     doi: 10.5244/C.23.2.
[46] M. Wang et al. Deep Graph Library: Towards Efficient and Scalable Deep Learning on Graphs.
     2020. arXiv: 1909.01315 [cs.LG].
[47] E. Weinan and E. Vanden-Eĳnden. “Towards a theory of transition paths”. In: J. Stat. Phys.
     123 (2006), pp. 503–523. doi: 10.1007/s10955-005-9003-9.
[48] E. Wigner. “The transition state method”. In: Trans. Faraday Soc. 34 (1938), pp. 29–41. doi:
     10.1007/978-3-642-59033-7_16.
[49] H. Yin et al. “Local Higher-order Graph Clustering”. In: In Proceedings of the 23rd ACM
     SIGKDD International Conference on Knowledge Discovery and Data Mining. 2017. doi:
     10.1145/3097983.3098069.
[50] W. Zachary. “An information flow model for conflict and fission in small groups”. In: Journal
     of anthropological research 33(4) (1977), 452–473. doi: 10.1086/jar.33.4. 3629752.
[51] D. Zhou, J. Huang, and B. Schoelkopf. “Learning from Labeled and Unlabeled Data on a
     Directed Graph”. In: Framework (Jan. 2005). doi: 10.1145/1102351.1102482. 106.
                                               87