MACHINE LEARNING-BASED STOCHASTIC REDUCED MODELING OF GLE AND
STATE-DEPENDENT-GLE
By
Zhiyuan She
A DISSERTATION
Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of
Computational Mathematics, Science and Engineering—Doctor of Philosophy
2025
ABSTRACT
Predictive modeling of high-dimensional dynamical systems remains a central challenge in numerous
scientific fields, including biology, materials science, and fluid mechanics. When clear scale
separation is lacking, a reduced model must accurately capture the pronounced memory effects
arising from unresolved variables, making non-Markovian modeling essential. In this thesis, we
develop and analyze data-driven methods for constructing generalized Langevin equations (GLEs)
and extended stochastic differential equations that faithfully encode non-Markovian behaviors.
Building on the Mori–Zwanzig formalism, we first propose an approach to learn a set of non-
Markovian features—auxiliary variables that incorporate the history of the resolved coordinates—so
that the effective dynamics inherits long-time correlations. By matching evolution of correlation
functions in the extended variable space, our method systematically approximates the multi-
dimensional GLE without requiring direct estimates of complicated memory kernels. We show
that this approach yields stable, high-fidelity reduced models for molecular systems, enabling
significantly lower-dimensional simulations that nonetheless reproduce key statistical and dynamical
properties of the original system.
We then extend this framework to incorporate state-dependent memory kernels, facilitating
enhanced sampling across diverse regions of phase space. We demonstrate that constructing heteroge-
neous memory kernels—reflecting the local variations in unresolved degrees of freedom—improves
the model’s accuracy and robustness, especially in systems exhibiting multiple metastable states.
Through both numerical experiments and theoretical analysis, we highlight how these data-driven
non-Markovian models outperform traditional Markovian or fixed-memory approaches.
To address complex, multi-modal distributions in high-dimensional data, we then modify the
latent variable of a KRNet normalizing-flow architecture from a single Gaussian to a mixture-of-
Gaussians (MoG). This richer latent representation not only improves the model’s expressiveness
and training stability but also facilitates discovering collective variables (CVs), as the multi-modal
latent space reveals distinct modes corresponding to relevant metastable states or slow degrees of
freedom. Through both numerical experiments and theoretical analysis, we show that integrating a
MoG prior into KRNet yields superior density estimation, enhanced sampling of metastable basins,
and a more interpretable set of learned CVs.
Altogether, this thesis provides a comprehensive methodology for deriving scalable, memory-
embedded reduced dynamics augmented by advanced latent representations. Such models open new
possibilities for multi-scale simulations by merging fine-grained molecular fidelity with tractable
coarse-grained representations, all while systematically leveraging the benefits of multi-modal latent
spaces to identify key low-dimensional features. Our results underscore the practical advantages
of incorporating non-Markovian features and a mixture-based flow model in capturing the full
complexity of real-world molecular and dynamical systems.
Copyright by
ZHIYUAN SHE
2025
ACKNOWLEDGEMENTS
I extend my heartfelt gratitude to my advisor, Professor Huan Lei, whose unwavering support,
insightful guidance, and boundless patience have made this work possible. His passion for research
and dedication to academic excellence have continually inspired me to broaden my intellectual
horizons and to push myself beyond my comfort zone. I am truly grateful for the countless hours he
devoted to reading my drafts, sharing thoughtful feedback, and challenging me to become a better
researcher and problem-solver.
I also wish to acknowledge the encouragement and camaraderie of my groupmates, Pei Ge and
Liyao Lyu. Their collaborative spirit, readiness to exchange ideas, and commitment to building a
supportive research environment greatly enriched my work. Our thought-provoking discussions
often revealed new perspectives and solutions that shaped this thesis for the better. I deeply appreciate
their help both inside and outside the lab, from meticulous code reviews to the friendly conversations
that made my time in graduate school more rewarding.
I am especially thankful to my wife, Xia Li, whose unwavering love and understanding have been
an anchor throughout this demanding journey. Her belief in my abilities, willingness to celebrate
even the smallest milestones, and compassionate reassurance during times of doubt provided me
with the emotional resilience to persevere. I cannot overstate how much her steadfast support has
meant to me—her presence has been the guiding light that helped me navigate this path.
To each of you—my advisor, colleagues, and family—your contributions, generosity, and
devotion to this endeavor remain deeply appreciated. Your collective influence will continue to
shape my academic and personal development for years to come.
v
CHAPTER 1
OVERVIEW .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
1
TABLE OF CONTENTS
CHAPTER 2
Introduction .
DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED
DYNAMICS ENCODED WITH NON-MARKOVIAN FEATURES . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
6
6
.
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
. 25
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
2.1
. .
2.2 Methods . .
2.3 Numerical results .
2.4 Summary . .
.
. .
. .
. .
CHAPTER 3
ENHANCED SAMPLING DATA-DRIVEN CONSTRUCTION OF
STOCHASTIC REDUCED DYNAMICS ENCODED WITH STATE-
DEPENDENT MEMORY . . . . . . . . . . . . . . . . . . . . . . . . . 28
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 29
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
. 38
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
3.1
3.2 Methods . .
. .
3.3 Numerical results .
3.4 Summary . .
Introduction .
.
. .
. .
. .
CHAPTER 4
GENERATIVE MODEL BASED IDENTIFYING METASTABLE
STATES IN FULL MOLECULE SPACE . . . . . . . . . . . . . . . . . 41
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
. 43
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. 52
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Introduction .
.
.
4.1
4.2 Method .
.
4.3 Numerical Result
. .
4.4 Summary.
.
.
. .
. .
. .
.
.
.
CHAPTER 5
CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 54
BIBLIOGRAPHY .
.
.
.
.
. .
. .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
APPENDIX A
MICROSCALE MODEL OF THE POLYMER MOLECULE . . . . . . 63
APPENDIX B
CONSTRUCTION OF THE FOUR-DIMENSIONAL FREE ENERGY
FUNCTION .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
APPENDIX C
FLUCTUATION-DISSIPATION THEOREM OF THE EXTENDED
DYNAMICS .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
APPENDIX D
INVARIANT PROBABILITY DENSITY FUNCTION . . . . . . . . . 69
APPENDIX E
MEMORY KERNEL OF THE POLYMER MOLECULE SYSTEMS . . 70
vi
CHAPTER 1
OVERVIEW
The accurate modeling of high-dimensional dynamical systems remains a cornerstone challenge
across a range of scientific disciplines, from soft matter and biophysics to fluid mechanics and climate
science. Modern simulations of such systems often demand immense computational resources
due to the intricate interactions and multi-scale nature of the underlying processes. Although fully
resolved models, which include all microscopic variables and fine-grained details, can in principle
capture the relevant physics, they frequently prove infeasible in practice because of prohibitive
computational cost. As a result, significant efforts in coarse-grained modeling aim to reduce
dimensionality and complexity while preserving the essential statistics and long-time behaviors of
the original high-dimensional problem.
A central insight in coarse-graining is that purely Markovian models—those assuming in-
stantaneous and memoryless evolution—often fail to reproduce the observed time correlations
and transport properties in real systems. These discrepancies become particularly pronounced in
situations lacking clear time-scale separations, wherein so-called “fast” or unresolved degrees of
freedom exert non-negligible influences over extended time horizons. Consequently, one must
explicitly retain memory terms to produce physically accurate reduced dynamics. In theoretical
treatments, the Mori–Zwanzig (MZ) formalism provides a foundation for describing the projected
dynamics of a lower-dimensional set of variables, augmented with a time-dependent memory kernel
and stochastic fluctuation terms. However, direct numerical implementation of these ideas is seldom
straightforward because the memory kernel typically lacks a closed-form expression and may require
large volumes of data to estimate reliably.
In response, a growing body of literature has turned to data-driven approaches that learn reduced
equations or effective models directly from simulation trajectories or experimental data. Such
methods circumvent the direct computation of memory kernels by leveraging machine learning
techniques to discover the underlying structure of the system. This dissertation focuses on integrating
and extending three distinct strategies that address complementary aspects of non-Markovian
1
modeling for high-dimensional dynamical systems:
1. Non-Markovian Feature Learning
Our first strategy proposes a learning framework that sidesteps the need for explicit memory-
kernel estimation by introducing a set of auxiliary variables. These additional variables
encapsulate the “history” of the coarse-grained coordinates, effectively transforming what
would otherwise be a non-Markovian model into a higher-dimensional extended Markovian
system. By matching correlation functions between the full model and the extended reduced
model, we ensure that critical temporal dependencies are accurately preserved. This correlation-
matching step is instrumental: it encodes the slow relaxations and recurrent configurations that
are crucial for capturing long-time dynamics. Numerical experiments on molecular systems
demonstrate that non-Markovian feature learning can yield reduced-order simulations with
excellent fidelity to the reference trajectories, all while maintaining moderate computational
cost.
2. State-Dependent Memory Kernels
While the first approach offers a single global mechanism to embed memory effects, many
real systems exhibit state-dependent memory. For instance, macromolecular or fluid systems
may have multiple metastable basins, each with distinct relaxation times or energetic barriers.
In such cases, it is insufficient to assume a uniform memory kernel throughout the entire state
space. Our second strategy addresses this limitation by introducing heterogeneous memory
kernels that adapt to the local environment of the resolved variables. Rather than fitting a
single kernel function, we allow the memory to vary based on the instantaneous configuration
or thermodynamic state. This added flexibility is particularly advantageous for systems with
complex free-energy landscapes, as it enables more accurate modeling of basin-to-basin
transitions, barrier crossing, and other processes sensitive to local unresolved dynamics.
By learning these heterogeneous kernels from data, we capture nuanced variations in the
2
memory structure, significantly improving sampling efficiency and predictive performance in
multi-basin or multi-phase scenarios.
3. KRNet with a Mixture-of-Gaussians Latent Representation
Although robust memory modeling is critical for accurate dynamics, effectively capturing
distributional complexity in high-dimensional systems poses an additional challenge. Many
normalizing-flow methods, which learn invertible transformations from simple latent spaces
to complex data distributions, rely on a single Gaussian prior for the latent variables. Such
a unimodal assumption can limit the expressivity of the model, particularly when the
target distribution is multi-modal or exhibits heavy tails. Here, we introduce an advanced
KRNet architecture that employs a mixture-of-Gaussians (MoG) as the latent prior. By
allowing the latent space to be multi-modal, KRNet gains greater flexibility in approximating
intricate molecular or continuum distributions. Moreover, analyzing individual mixture
components provides insights into physically meaningful collective variables (CVs), such as
reaction coordinates or slow degrees of freedom that govern the system’s long-time behavior.
This MoG-based design not only improves density-estimation accuracy but also enhances
interpretability, offering an additional avenue for understanding how different metastable
states or configurations map to the underlying latent structure.
Taken as a whole, these three complementary approaches form a cohesive framework for
memory-aware, distribution-aware coarse-graining. In the first two, we focus on correctly capturing
time correlations and historical dependence—the hallmark of non-Markovian dynamics. In the third,
we emphasize handling complex data distributions and uncovering low-dimensional representations.
Implemented together, they enable practitioners to develop robust reduced-order models that do not
sacrifice crucial multi-scale or multi-modal characteristics of the original system.
Beyond theoretical importance, these techniques offer practical advantages: multi-scale simu-
lations become more feasible as the reduced models require fewer degrees of freedom to achieve
a similar predictive capability, potentially reducing wall-clock times while maintaining essential
3
physical fidelity. Moreover, our approaches naturally expose slow modes and transitions that might
otherwise remain obscured in fully detailed simulation data, thereby aiding in physical interpretation.
Researchers can identify meaningful collective variables or design specialized sampling protocols
targeting critical regions of phase space (e.g., near transition states or interfaces).
Ultimately, the methods presented here reflect a broader trend in computational science: as
machine learning and high-performance computing continue to advance, new opportunities emerge
for data-driven modeling of complex phenomena. These advances permit us to go beyond naively
discarding unresolved scales, instead systematically incorporating their effects through memory
terms, latent-variable modeling, or both. The chapters that follow detail each of these methods,
their theoretical underpinnings, and the empirical studies that demonstrate their utility. Collectively,
they underline the feasibility of incorporating memory effects and multi-modal representations
in next-generation coarse-grained simulation frameworks, bridging the gap between brute-force
full-resolution models and simpler—but often inaccurate—Markovian approximations.
In summary, the remainder of this dissertation proceeds as follows:
• We begin by examining non-Markovian feature learning and explain how auxiliary variables,
grounded in correlation-function matching, facilitate extended Markovian representations of
intrinsically non-Markovian processes.
• Next, we tackle state-dependent memory kernels as a direct way to incorporate local
environmental or configurational effects, significantly improving the realism of reduced
models in multi-basin systems.
• Finally, we present KRNet with a mixture-of-Gaussians (MoG) latent space, illustrating
how it enhances distribution modeling and offers a pathway to derive physically interpretable
collective variables. We highlight its synergy with memory-based approaches, demonstrating
an integrated methodology for advanced coarse-grained simulations.
Through these contributions, this dissertation seeks to illustrate the power and flexibility of
data-driven, memory-embedded modeling. By capturing temporal dependencies and multi-modal
4
structures simultaneously, researchers can generate high-fidelity reduced-order simulations capable of
exploring complex energy landscapes, long-time dynamical evolution, and rarely visited metastable
states. The approaches and results stand to benefit a wide array of fields, ranging from molecular
biophysics and materials science to geophysics and fluid mechanics, all of which confront the
challenges posed by limited computational budgets and intrinsically non-Markovian dynamics.
5
CHAPTER 2
DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED DYNAMICS ENCODED
WITH NON-MARKOVIAN FEATURES
2.1
Introduction
Predictive modeling of multi-scale dynamic systems is a long-standing problem in many fields
such as biology, materials science, and fluid physics. One essential challenge arises from the high-
dimensionality; numerical simulations of the full models often show limitations in the achievable
spatio-temporal scales. Alternatively, reduced models in terms of a set of resolved variables are
often used to probe the evolution on the scale of interest. However, the construction of reliable
reduced models remains a highly non-trivial problem. In particular, for systems without a clear scale
separation, the reduced dynamics often exhibits non-Markovian memory effects, where the analytic
form is generally unknown. To close the reduced dynamics, existing methods are primarily based on
the following two approaches. The first approach seeks various numerical approximations of the
memory term by projecting the full dynamics onto the resolved variables based on frameworks such
as the Mori-Zwanzig formalism Mori (1965b); Zwanzig (1973) or canonical models such as the
generalized Langevin equation (GLE) Zwanzig (2001). Examples include the t-model approximation
Chorin et al. (2002), the Galerkin discretization Darve et al. (2009a), regularized integral equation
discretization Lange and Grubmüller (2006), the hierarchical construction Chen et al. (2014); Stinis
(2015); Zhu and Venturi (2018); Hudson and Li (2020); Price et al. (2021), and so on. Recent
studies Ma et al. (2018); Vlachas et al. (2018); Harlim et al. (2020); Wang et al. (2020a) based on
the recurrent neural networks Hochreiter and Schmidhuber (1997) provide a promising approach to
learn the memory term of deterministic dynamics. Yet, for ergodic dynamics, how to impose the
coherent noise term compensating for the unresolved variables remains open. The second approach
parameterizes the memory term by certain ansatz, e.g., the fictitious particle Davtyan et al. (2015a),
continued fraction Wall (1948); Mori (1965a), rational function Corless and Frazho (2003), such
that the memory and the noise terms can be embedded in an extended Markovian dynamics Mori
(1965a); Ceriotti et al. (2009); Baczewski and Bond (2013); Davtyan et al. (2015a); Lei et al. (2016a);
6
Jung et al. (2017a); Lee et al. (2019a); Russo et al. (2019); Ma et al. (2019); Grogan et al. (2020). In
addition, non-Markovian models are represented by discrete dynamics with exogenous inputs in
form of NARMAX (nonlinear autoregression moving average with exogenous input) Chorin and Lu
(2015); Lin and Lu (2021) and SINN (statistics information neural network) Zhu et al. (2022) and
parameterized for each specific time step. Recent work by Vroylandt et al. Vroylandt et al. (2022)
presents an expectation-maximization method to parameterize the reduced model from the full
model trajectories. Refs. Daldrop et al. (2017); Kowalik et al. (2019) present an efficient approach
based on analyzing the force correlation function to extract the memory function for the reduced
dynamics of aqueous molecules under quadratic confinement potential; see also recent review
Klippenstein et al. (2021) for further discussion. Despite the overall success, most studies focus on
the cases with a scalar memory function. Notably, the reduced model of a two-dimensional GLE is
constructed in Ref. Lee et al. (2019a). To the best of our knowledge, the systematic construction of
stochastic reduced dynamics of multi-dimensional resolved variables remains under-explored.
Ideally, to obtain a reliable reduced model, the construction needs to accurately retain the
non-Markovian features, enable certain modeling flexibility (e.g., the dimensionality of the resolved
variables) and adaptivity (e.g., the order of approximation), and guarantee the numerical stability
and robustness. In a recent study, we developed a Petrov-Galerkin approach Lei and Li (2021) to
construct the non-Markovian reduced dynamics by projecting the full dynamics into a subspace
spanned by a set of projection bases in form of the fractional derivatives of the resolved variables.
The obtained reduced model is parameterized as extended stochastic differential equations by
introducing a set of test bases. Different from most existing approaches, the construction does not
rely on the direct fitting of the memory function. Non-local statistical properties can be naturally
matched by choosing the appropriate bases, and the model accuracy can be systematically improved
by introducing more basis functions to expand the projection subspace. Despite these appealing
properties, the construction relies on the heuristic choices of the projection and test bases. Given the
target number of basis, how to choose the optimal basis functions for the best representation of the
non-Markovian dynamics remains an open problem. Furthermore, the numerical stability needs to be
7
treated empirically. These issues limit the applications in complex systems with multi-dimensional
resolved variables.
In this work, we aim to address the above issues by developing a new data-driven approach to
construct the stochastic reduced dynamics of multi-dimensional resolved variables. The method
is based on the joint learning of a set of non-Markovian features and the extended dynamic
equation in terms of both the resolved variables and these features. Unlike the empirically chosen
projection bases adopted in the previous work Lei and Li (2021), the non-Markovian features take
an interpretable form that encodes the history of the resolved variables, and are learned along with
the extended Markovian dynamic such that they are optimal for the reduced model representation.
In this sense, they represent the optimal subspace that embodies the non-Markovian nature of the
resolved variables. The learning process enables the adaptive choices of the number of features
and is easy to implement by matching the evolution of the correlation functions of the extended
variables. In particular, the explicit form of the encoder function enables us to obtain the correlation
functions of these features directly from the ones of the resolved variables rather than the time-series
samples. The constructed model automatically ensures numerical stability, strictly satisfies the
second fluctuation-dissipation theorem Kubo (1966), and retains the consistent invariant distribution
Español (2004); Noid et al. (2008).
We demonstrate the method by modeling the dynamics of a tagged particle immersed in solvents
and a polymer molecule. With the same number of features (or equivalently, the projection bases),
the present method yields more accurate reduced models than the previous methods Lei et al. (2016a);
Lei and Li (2021) due to the concurrent learning of the non-Markovian features. More importantly,
reduced models with respect to multi-dimensional resolved variables can be conveniently constructed
without the cumbersome efforts of matrix-valued kernel fitting and stabilization treatment. This is
well-suited for model reduction in practical applications, where the constructed reduced models
often need to retain the non-local correlations among the resolved variables. It provides a convenient
approach to construct meso-scale models encoded with molecular-level fidelity and paves the way
towards constructing reliable continuum-level transport model equations Lei et al. (2020); Fang
8
et al. (2022).
Finally, it is worthwhile to mention that the present study focuses on the model reduction of
ergodic dynamic systems where the full or part of the resolved variables are specified as known
quantities that either retain a clear physical interpretation (e.g., the tagged particle position), or are
experimentally accessible (e.g., the polymer end-to-end distance, the radius of gyration). Another
relevant direction focuses on learning the slow or Markovian dynamics from the complex dynamic
systems where the resolved variables are unknown a priori; we refer to Refs. Rohrdanz et al. (2011);
Pérez-Hernández et al. (2013); Li and Ma (2014); Krivov (2013); Lu and Vanden-Eijnden (2014);
Bittracher et al. (2018) on learning resolved variables that retain the Markovianity, Refs. Coifman
et al. (2008); Chiavazzo et al. (2017); Crosskey and Maggioni (2017); Ye et al. (2021); Feng et al.
(2022); Zieliński and Hesthaven (2022) on learning the slow dynamics on a non-linear manifold,
and Refs. Giannakis (2019); Klus et al. (2018); Dibak et al. (2018); Klus et al. (2020) on model
reduction of the transfer operator.
2.2 Methods
2.2.1 Problem Setup
Let us consider the full model as a Hamiltonian system represented by a 6𝑁-dimensional phase
vector Z = [Q; P ], where Q and P are the position and momentum vectors, respectively. The
equation of motion follows
(cid:164)Z = S∇𝐻 (Z),
(2.1)
where S = (cid:169)
(cid:173)
(cid:173)
(cid:171)
0
I
−I 0
(cid:170)
(cid:174)
(cid:174)
(cid:172)
is the symplectic matrix, and 𝐻 (Z) is the Hamiltonian function and initial
condition is given by Z (0) = Z0. For high-dimensional systems with 𝑁 ≫ 1, the numerical
simulation of Eq.
(2.1) can be computational expensive.
It is often desirable to construct a
reduced model with respect to a set of low-dimensional resolved variables z(𝑡) := 𝜙 (Z (𝑡)) where
𝜙 : R6𝑁 → R𝑚 represents the mapping from the full to the coarse-grained state space with 𝑚 ≪ 𝑁.
With the explicit form of 𝐻 (Z) and 𝜙(Z), the evolution of z(𝑡) can be mapped from the initial
values via the Koopman operator Koopman (1931), i.e., z(𝑡) = e𝑡Lz(0), where the Liouville
9
operator L𝜙(Z) = −((∇𝐻 (Z0))𝑇 S∇Z0)𝜙(Z) depends on the full-dimensional phase vector Z.
Using the Duhamel–Dyson formula, the evolution of z(𝑡) can be further formulated in terms of z
based on the Mori-Zwanzig (MZ) projection formalism Mori (1965b); Zwanzig (1973). However,
the numerical evaluation of the derived model relies on solving the full-dimensional orthogonal
dynamics Chorin et al. (2002), which can be still computational expensive.
In practice, the resolved variables are often defined by the position vector Q. The MZ-formed
reduced dynamics is often simplified into the GLEs, i.e.,
(cid:164)q = M −1p
(cid:164)p = −∇𝑈 (q) −
∫ 𝑡
0
θ(𝑡 − 𝜏) (cid:164)q(𝜏)d𝜏 + R(𝑡),
(2.2)
where q ∈ R𝑚 is the so-called collective variables, M is the mass matrix, 𝑈 (q) is the free energy
function, θ(𝑡) : R+ → R𝑚×𝑚 is a matrix-valued function representing the memory kernel, and R (𝑡)
is a stationary colored noise related to θ(𝑡) through the second fluctuation-dissipation condition
Kubo (1966), i.e., (cid:10)R(𝑡) R(0)𝑇 (cid:11) = 𝑘 𝐵𝑇θ(𝑡). It is worth mentioning that Eq. (2.2) is not exact
based on the MZ formalism. In particular, the memory function generally depends on the resolved
variables z and the noise term could be non-Gaussian; we refer to Ref. Ayaz et al. (2022) for further
discussion. Nevertheless, even for the simplified GLE form (2.2), the accurate construction of
the reduced model could remain highly-nontrivial. Specifically, the numerical simulation requires
the explicit knowledge of both the free energy 𝑈 (q) and the memory function θ(𝑡). Several
methods based on importance sampling Kumar et al. (1992); Darve and Pohorille (2001); Laio and
Parrinello (2002a) and temperature elevation Rosso et al. (2002); Maragliano and Vanden-Eijnden
(2006, 2008) have been developed to construct the multi-dimensional free energy function. In
real applications, the main challenge often lies in the treatment of the memory kernel θ(𝑡). In
particular, for multi-dimensional collective variables q, the efficient construction of numerically
stable matrix-valued memory function remains under-explored.
In this study, we develop an alternative approach to learn the reduced model. Rather than directly
constructing the memory function θ(𝑡) in Eq. (2.2), we seek a set of non-Markovian features from
the full model, denoted by {ζ𝑖}𝑛
𝑖=1, and establish a joint learning of the reduced Markovian dynamics
10
in terms of both the resolved variables and these features, i.e.,
d ˜z = g ( ˜z) d𝑡 + 𝚺dW𝑡,
(2.3)
where ˜z := [q; p; ζ1; · · · ; ζ𝑛] represents the extended variables and W𝑡 represents the standard
Wiener process. In principle, any such extended system would generally lead to a non-Markovian
dynamics for the resolved variables z = [q; p]. However, the essential challenge is to determine
{ζ𝑖}𝑛
𝑖=1 so that the non-local statistical properties of z can be preserved with sufficient accuracy.
Also, the form of g(·) and 𝚺 will need to be properly designed such that the reduced model retains the
consistent thermal fluctuations and density distribution. In particular, the introduction of auxiliary
variables can be loosely considered as approximating the full-dimensional Koopman operator in
a sub-space. However, different from Ref. Lei and Li (2021), the features {ζ𝑖}𝑛
𝑖=1 are not the
empirically-chosen projection bases; instead, they are learned along with model equation (2.3) for
the best approximation of the non-local statistics. This essential difference enables the present
method to generate more accurate reduced model and be easy to implement for multi-dimensional
resolved variables without empirical treatment for numerical stability.
2.2.2 Non-Markovian features and the extended dynamics
To illustrate the essential idea, let us consider a solute molecule immersed solvent particles. To
construct a reduced model (2.3) for the solute molecule, a main question is how to construct the
auxiliary variables ζ := [ζ1; ζ2; · · · ; ζ𝑛]. Intuitively, ζ𝑖 should depend on the full-dimensional vector
Z such that their evolution encodes certain unresolved dynamics orthogonal to the subspace spanned
by z(𝑡). A straightforward approach is to represent ζ (𝑡) as a function of Z (𝑡), i.e., ζ = h(Z).
However, the direct construction of the general formulation h(Z) would become impractical since
the learning generally involves sampling the individual solvent particles near the solute molecule;
the computational cost could become intractable due to the high-dimensionality of Z.
To circumvent the above challenges, the key ascribes to formulate ζ (𝑡) such that it properly
encodes the unresolved dynamics of Z (𝑡), and meanwhile, can be easily sampled. One important
observation is that the history of p(𝑡) naturally encodes the unresolved dynamics orthogonal to z(𝑡)
and the values can be conveniently obtained. To see this, we note that the dynamics follows (cid:164)p = Lp
11
where the Liouville operator L𝜙(Z) = −((∇𝐻 (Z0))𝑇 S∇Z0)𝜙(Z) depends on the full-dimensional
vector Z. Therefore, it is possible to construct ζ (𝑡) by some encoder functions in terms of the
time history of p(𝑡), i.e., p(𝜏) with 𝜏 ≤ 𝑡, such that they retain certain components orthogonal to
z(𝑡). This is somewhat similar to the study Lei et al. (2020) on modeling the non-local constitutive
dynamics of non-Newtonian fluids by learning a set of features from the polymer configuration
space. The main difference is that the present features ζ are non-Markovian in the temporal space.
Accordingly, we define a set of non-Markovian features {ζ𝑖}𝑛
𝑖=1 by
∫ +∞
ζ𝑖 (𝑡) =
ω𝑖 (𝜏)v(𝑡 − 𝜏)d𝜏
0
𝑁𝑤
∑︁
𝑗=1
≈
w𝑖 ( 𝑗 𝛿𝑡)v(𝑡 − 𝑗 𝛿𝑡)
(2.4)
where v := M −1p represents the generalized velocity, ω𝑖 : R+ → R𝑚×𝑚 represents the encoder
function represented by 𝑁𝑤 discrete weights {w𝑖 ( 𝑗 𝛿𝑡)}𝑁𝑤
𝑗=1 whose values will be determined later.
ζ𝑖 (𝑡) can be loosely viewed as a generalized convolution over the history of v(𝑡) whose evolution
encodes the orthogonal dynamics of z(𝑡). Therefore, it is possible to learn a set of ζ𝑖 (𝑡) such
that the joint dynamics in terms of both z(𝑡) and ζ𝑖 (𝑡) can be well approximated by the extended
Markovian model (2.3). Moreover, the linear form in terms of v(𝑡) ensures that the invariant density
function of ζ𝑖 (𝑡) retains the Gaussian distribution consistent with v(𝑡). We can further impose a
constraint such that v(𝑡) and ζ𝑖 (𝑡) are uncorrelated. This leads to an additional quadratic term in the
energy function of the extended system, i.e., 𝑊 (q, p, ζ) = 𝑈 (q) + 1
2p𝑇M −1p + 1
2ζ𝑇 ˆ𝚲
−1
ζ, where ˆ𝚲
represents the covariance matrix of the ζ1, · · · , ζ𝑛. The reduced dynamics can be written as
d
q
(cid:169)
(cid:173)
(cid:173)
p
(cid:173)
(cid:173)
(cid:173)
(cid:171)
ζ
= G∇𝑊 (q, p, ζ)d𝑡 + 𝚺dW𝑡,
(2.5)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
12
where the matrix G ∈ R(2+𝑛)𝑚×(2+𝑛)𝑚 takes the form
G =
0
(cid:169)
(cid:173)
(cid:173)
−I
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
0
...
0
I 0 · · · 0
J
I
0 0 · · · 0
0
0
...
0
I
ˆ𝚲
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
.
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(2.6)
The matrix J ∈ R𝑛𝑚×𝑛𝑚 further determines the statistical properties of the resolved variables and
will be learned along with the non-Markovian features {ω𝑖 (𝑡)}𝑛
𝑖=1 from the training samples as
discussed in next subsection. Given the reduced model in form of Eqs. (2.5) and (2.6), the noise
covariance matrix can be determined by
𝚺𝚺𝑇 = −𝛽−1(J 𝚲 + 𝚲𝑇 J 𝑇 ),
(2.7)
where 𝛽 = 1/𝑘 𝐵𝑇 and 𝚲 = d𝑖𝑎𝑔(I, ˆ𝚲). The form of 𝚲 implies that v and ξ𝑖 are uncorrelated and
is consistent with the energy function of the extended system 𝑊 (q, p, ζ). It also alleviates the
non-negative constraint of 𝚺𝚺𝑇 as discussed in Sec. 2.2.3. Furthermore, we can show that model
(2.5) strictly satisfies the second-fluctuation dissipation theorem. Specifically, the embedded kernel
in Eq. (2.5) takes the form
˜θ(𝑡) = −
(cid:16) ˜J11𝛿(𝑡) + ˜J12e
˜J22𝑡 ˜J21
(cid:17)
,
(2.8)
where ˜J11 = [ ˜J ]1:𝑚,1:𝑚, ˜J12 = [ ˜J ]1:𝑚,𝑚+1: and ˜J21 = [ ˜J ]𝑚+1:,1:𝑚 are the sub-blocks of the matrix
˜J := J 𝚲. The colored noise ˜R(𝑡) in terms of p(𝑡) is related to ˜θ(𝑡) by
(cid:10) ˜R(𝑡) ˜R(𝑡′)𝑇 (cid:11) = −𝛽−1 (cid:16) ˜J12e
˜J22 (𝑡−𝑡′) ˜J21 + ( ˜J11 + ˜J 𝑇
11)𝛿(𝑡 − 𝑡′)
(cid:17)
(2.9)
with 𝑡′ < 𝑡. Moreover, we can show that the reduce model retains the consistent invariant density
function with the full model, i.e.,
𝜌eq ∝ exp [−𝛽𝑊 (q, p, ζ)] .
(2.10)
We refer to Appendix C and D for details.
13
We conclude this subsection with two remarks: (I) In principle, the mass matrix M further
depends on q. Ref. Lee et al. (2019a) reports that the varying mass matrix plays a secondary
effect on the reduced dynamics of the molecular system therein; see also Ref. Ayaz et al. (2022)
for the cases of the nonlinear distance coordinate with constant mass. A constant mass matrix is
adopted in the present study; reduced models with the varying mass matrix can be constructed
by introducing an additional term in the conservative force and will be considered in the future
study. (II) The non-Markovian features {ζ𝑖}𝑛
the state-dependence, e.g., ζ𝑖 (𝑡) = ∫ +∞
𝑖=1 in form of Eq. (2.4) can be generalized to retain
0 ω𝑖 (𝜏, q(𝜏))v(𝑡 − 𝜏)d𝜏, which leads to a reduced model
with state-dependent non-Markovian memory. In this study, we demonstrate the proposed learning
framework by constructing the reduced model (2.5) that approximates the standard GLE (2.2) with
state-independent memory function θ(𝑡). As shown in the numerical examples, although θ(𝑡) is
not explicitly constructed, it is well approximated by the memory kernel embedded in the reduced
model (2.5) by matching the evolution of the correlation matrices for both the resolved and the
extended variables. The learning of reduced models with the heterogeneous memory term will be
systematically investigated in the future study.
2.2.3
Joint learning of the reduced dynamics
Construction of the above reduced models relies on the joint learning of the non-Markovian
features (2.4) in form of the encoder functions {ω𝑖 (𝑡)}𝑛
𝑖=1 and the reduced dynamics (2.5)(2.6)
determined by the free energy 𝑈 (q) and the matrix J .
In this study, we represent the multi-
dimensional free energy 𝑈 (q) using a neural network and parameterize it based on the molecular
dynamics Frenkel and Smit (2001); we refer to Appendix for details. Furthermore, the covariance of
the noise term specified by Eq. (2.7) implies that J and 𝚲 (i.e., the encoder functions ω𝑖 (𝑡)) need to
satisfy the following constraint
J 𝚲 + 𝚲J 𝑇 ≼ 0.
(2.11)
Directly imposing the condition (2.11) becomes cumbersome for the joint learning of J and
ω𝑖 (𝑡). Fortunately, this issue can be avoided by imposing an orthogonal constraint among the
14
non-Markovian features, i.e.,
(cid:2) ˆ𝚲(cid:3)
𝑖 𝑗 := 𝛽 (cid:10)ζ𝑖, ζ 𝑗 (cid:11)
= 𝛽 ∑︁
(cid:10)w𝑖 (𝑡 − 𝑘𝛿𝑡)v(𝑘𝛿𝑡), w 𝑗 (𝑡 − 𝑘′𝛿𝑡)v(𝑘′𝛿𝑡)(cid:11)
𝑘,𝑘 ′
= 𝛿𝑖 𝑗 I,
1 ≤ 𝑖, 𝑗 ≤ 𝑛,
(2.12)
where the inner product ⟨f (Z), g(Z)⟩ =
∫
f (Z)g(Z)𝑇 𝜌(Z)dZ is defined with respect to the
∫
equilibrium density function of the full model 𝜌(Z) = 𝑒−𝛽𝐻 (Z)/
𝑒−𝛽𝐻 (Z)dZ. In addition, we
also impose the orthogonal constraints such that ζ and p are uncorrelated. Therefore, condition
(2.11) can be transformed into seeking J s.t. J + J 𝑇 ≼ 0, and we represent J by
J = −LL𝑇 + J 𝐴,
(2.13)
where L ∈ R(𝑛+1)𝑚×(𝑛+1)𝑚 is the lower-triangle matrix with positive diagonal elements and LL𝑇
represents the Cholesky decomposition of a symmetric positive-definite matrix. J 𝐴 represents an
anti-symmetric matrix. Unlike the studies Mori (1965a); Ceriotti et al. (2009) based on the direct
kernel approximation, we note that J takes a more general form and is not restricted to be diagonal
or tri-diagonal.
With the proper form of J , we can cast the reduced dynamics into the evolution of the correlation
matrices by multiply v(0) to both sides of Eq. (2.5), i.e.,
⟨M v, v(0)⟩
⟨∇𝑈 (q), v(0)⟩
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:124)
d
d𝑡
⟨ζ1, v(0)⟩
...
+
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:125)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:124)
0
...
0
⟨ζ𝑛, v(0)⟩
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
C1 (𝑡)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
C0 (𝑡)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:125)
= J
,
⟨v, v(0)⟩
⟨ζ1, v(0)⟩
...
(cid:169)
(cid:170)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
(cid:174)
⟨ζ𝑛, v(0)⟩
(cid:171)
(cid:172)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
(cid:124)
C2 (𝑡)
(2.14)
where the correlation matrices ⟨ζ𝑖 (𝑡), v(0)⟩ can be directly obtained from the correlation matrix of
the resolved variables ⟨v(𝑡), v(0)⟩ given the encoder weights, i.e.,
⟨ζ𝑖 (𝑡), v(0)⟩ =
𝑁𝑤
∑︁
𝑗=1
w𝑖 (𝑡 𝑗 ) (cid:10)v(𝑡 − 𝑡 𝑗 ), v(0)(cid:11) ,
15
where 𝑡 𝑗 = 𝑗 𝛿𝑡 and encoder weights w𝑖 (𝑡 𝑗 ) will be learned jointly. Therefore, we are able to construct
J from the pre-computed, noise-free correlation matrices instead of the on-the-fly computation
from the time-series samples of q and p. The reduced model can be trained by minimizing the
following loss function
𝐿𝐶 =
𝑁𝑡
∑︁
𝑗=1
(cid:13)
(cid:13)C1
′(𝑡 𝑗 ) + C0(𝑡 𝑗 ) − J C2(𝑡 𝑗 )(cid:13)
(cid:13)
2 𝐿Λ = ∥𝚲 − I ∥2 ,
(2.15)
𝐿 = 𝜆𝐶 𝐿𝐶 + 𝜆Λ𝐿Λ,
where C1 = [⟨M v, v(0)⟩ ; ⟨ζ1, v(0)⟩ ; · · · ; ⟨ζ𝑛, v(0)⟩], C0 and C2(𝑡) are defined similarly in Eq.
(2.14). 𝜆𝐶 and 𝜆Λ are the hyperparameters. 𝑡 𝑗 refers to the discrete time points and 𝑁𝑡 represents
the total number of sample points of the correlation matrices obtained from the full model. The
loss term 𝐿𝐶 ensures that the non-local statistical properties of the resolved variables can be
accurately preserved while the loss term 𝐿Λ ensures the aforementioned orthogonality among the
non-Markovian features. To simulate the constructed model, we always set ˆ𝚲 = I such that J in
form of Eq. (2.13) strictly satisfies the semi-positive definiteness condition. We emphasize that the
non-Markovian encoder weights (cid:8)w𝑖 (𝑡 𝑗 )(cid:9) 𝑁𝑤
𝑗=1 do not explicitly appear in the loss function. However,
they are involved in the training process along with J since the correlation functions C1 and C2
further depend on the definition of ζ𝑖, i.e., they are concurrently learned for the best approximation of
the extended Markovian dynamics of [q; p; ζ]. As shown in Sec. 3.3, this joint learning of both the
non-Markovian features and the dynamic equations enables us to probe the optimal representation
of the reduced models that leads to more accurate numerical results than the ones constructed
by the pre-selected bases, and can be easily implemented for models with multi-dimensional
resolved variables. In this study, we choose 𝑁𝑡 = 5000 for all the numerical examples and choose
𝑁𝑤 = 1800 for the one-dimensional reduced model and 1200 for the four-dimensional reduced
model, respectively. The training is conducted by the ADAM optimization algorithm Kingma and
Ba (2015) where 1000 points are randomly selected per each training step
We conclude this subsection with the following remarks: (I) Instead of Eq. (2.14), the reduced
dynamics can be also cast into the evolution of the correlation matrices by multiplying q(0) to both
16
sides of Eq. (2.5). For the present study, we found that both formulations yield accurate reduced
models. (II) Rather than learning the full sets of non-Markovian features, we can fix one of them as
M (cid:164)v + ∇𝑈 (q); this ensures that the time-derivatives of correlation functions of the reduced model
can accurately match the values of the full model near 𝑡 = 0. In the numerical examples presented in
following Sec. 3.3, all the reduced models are constructed with this choice. For simple notation,
we set it to be the last feature. For example, the fourth-order reduced model is constructed using 4
non-Markovian features. ζ1, ζ2 and ζ3 take the form of Eq. (2.4), and ζ4 is set to be M (cid:164)v + ∇𝑈 (q).
(III) In principle, for reduced models of Hamiltonian systems, the form of matrix J can be further
restricted to
J = −diag(0, ˆL ˆL𝑇 ) + J 𝐴,
(2.16)
where ˆL ∈ R𝑛𝑚×𝑛𝑚 is a lower-triangle matrix. Eq. (2.16) ensures that the embedded kernel in Eq.
(cid:1) 𝛿(𝑡)). ˜θ(𝑡) recovers the form
(2.5) does not contain the Markovian memory term (i.e., (cid:0)J11 + J 𝑇
11
of standard GLE and the second fluctuation-dissipation relationship shown in Eq. (2.9) recovers the
standard form, i.e., (cid:10) ˜R(𝑡) ˜R(𝑡′)𝑇 (cid:11) = 𝛽−1 ˜θ(𝑡 − 𝑡′). In this study, both forms yield accurate numerical
results; the contribution of the Markovian term constructed by Eq. (2.13) is less than 1%.
2.3 Numerical results
2.3.1 A tagged particle in aqueous environment
We demonstrate our method by modeling a tagged particle immersed in solvent particles. The
particle interaction is governed by the pairwise force
F𝑖 𝑗 (Q𝑖 𝑗 ) =
𝑓0(1 − 𝑄𝑖 𝑗 /𝑟𝑐)e𝑖 𝑗 , 𝑄𝑖 𝑗 ≤ 𝑟𝑐
0,
𝑄𝑖 𝑗 > 𝑟𝑐
where Q𝑖 and Q 𝑗 are the positions of 𝑖-th and 𝑗-th particles. Q𝑖 𝑗 = Q𝑖 − Q 𝑗 , 𝑄𝑖 𝑗 = ∥Q𝑖 − Q 𝑗 ∥, and
e𝑖 𝑗 = Q𝑖 𝑗
𝑄𝑖 𝑗 , and 𝑟𝑐 is the cut-off distance. The full system consists of 4000 particles in a 10 × 10 × 10
cubic box with periodic boundary condition along each direction. We set 𝑓0 = 50, 𝑟𝑐 = 1, and the
particle mass to be unit. Nosé-Hoover thermostat is used with 𝑘 𝐵𝑇 = 0.5 and time step 𝛿𝑡 = 2 × 10−3.
128 samples are collected from a production stage of 6 × 105 steps, which are used as the initial
17
conditions of the NVE simulations of the full model for a production stage of 1 × 105 steps using
the Velocity-Verlet integrator.
(a)
(c)
(b)
(d)
Figure 2.1 Numerical results for a tagged particle in the solvent particle bath. (a) Velocity correlation
function 𝐶𝑣𝑣 (𝑡) obtained from the full MD model and the reduced models constructed by the present
method based on the joint learning approximation, the rational function approximation Lei et al.
(2016a), and the Petrov–Galerkin projection with fixed bases Lei and Li (2021). (b) Predicted
evolution of the probability density function of the particle velocity obtained from the full MD and
the different reduced models with the second-order approximation. The initial velocity 𝑣 is set to 0
(the vertical line). (c) 𝐶𝑣𝑣 (𝑡) obtained from the full MD model and different orders of the present
joint learning approximation. (d) Encoder weights for the three non-Markovian features obtained
from the present joint learning with the fourth-order approximation.
The reduced dynamics in terms of the tagged particle is constructed in form of Eq. (2.5) along
with the learning of the non-Markovian features {ζ𝑖}𝑛
𝑖=1. The free energy 𝑈 (q) vanishes for this
case. For comparison, we also construct the reduced model using the previous approaches based on
the Petrov-Galerkin projection (named as fixed-basis) Lei and Li (2021) and the rational function
18
10−210−1100101t10−310−210−1100Cvv(t)MD2ndjointlearning2ndfixedbasis2ndrationalfunction−1.50−0.750.000.751.50v0.000.250.500.751.00ρ(v)t=0.06t=0.10MD2ndjoinedlearning2ndfixedbasis10−210−1100101t10−310−210−1100Cvv(t)MD2ndjointlearning3rdjointlearning4thjointlearning0250500750j−0.50.00.51.0wi(j∆t)w1w2w3approximation Lei et al. (2016a). Fig. 2.1(a) shows the velocity correlation function of constructed
models using two non-Markovian features, or equivalently, two projection bases, as well as the
second-order rational function approximation. The model constructed by the present (named as
the joint-learning) method shows the best agreement with the full model based on the molecular
dynamics (MD) simulations. The model accuracy can be further examined by the evolution of
probability density function (PDF) of the particle velocity. Specifically, we fix the velocity to be
zero as 𝑡 = 0 and sample the instantaneous PDF thereafter. Fig. 2.1(b) shows the obtained PDF at
𝑡 = 0.06. The present approach yields more accurate result than the Petrov-Galerkin method. As
shown in Fig. 2.1(c), the accuracy of the reduced model shows further improvement as we increase
the number of non-Markovian features. In particular, the reduced model with the fourth-order
approximation can accurately capture the oscillations over the full regime. Fig. 2.1(d) shows the
obtained encoder weights of the fourth-order approximation. All of the three encoder functions
show pronounced oscillations near 𝑡 = 0 and decay to 0 for large 𝑡. Unlike the empirically chosen
fractional-derivative bases in Ref. Lei and Li (2021), the present method enables the encoders to be
optimized for the best approximation of the non-local statistics, and therefore yields more accurate
results.
Figure 2.2 A sketch of a chain-molecule represented by united atom model. Reduced models are
constructed with respect to a four-dimensional resolved vector q, which represents the end-to-end
distance (𝑞1), the radius of gyration (𝑞2), and the end-to-middle distances (𝑞3 and 𝑞4), respectively.
19
qq1AAACmHicdVHbahsxEJW3t9S9JGnfmhdRE0ghLLtOSto3QwtuIA9JUycB25hZ7awjLGk30myJWUy/oK/tt/VvonWd4Jj2gGA4Z6Qzo5MUSjqKoj+N4MHDR4+frD1tPnv+4uX6xuarM5eXVmBP5Cq3Fwk4VNJgjyQpvCgsgk4UnieTT7V+/h2tk7n5RtMChxrGRmZSAHnq9GoUjzZaURh/rMGjsB15vL8t2jwOozlabIHj0WbjxyDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQpVvVavJfWr+k7MOwkqYoCY34a5SVilPO68V5Ki0KUlNfgLDSz8rFJVgQ5L/nnkui/Q4pZgOdVIPaKMlmze3ljrl7gWKFvi6NFHmKq7Sia7JQsw5JgzT16lVXKsVPwbg7wT9aKzuf5ViS2z3yGZndrkWcvFvubvrIbnPh/y/O2mG8F7ZP9ludw0V4a2yLvWU7LGYHrMO+sGPWY4KN2U/2i/0O3gSdoBsseoPG4s5rdg/B1xsxsM2nq2AAACmHicdVFdaxNBFJ2sXzV+tNU3fRkMhQpl2V2Vtm+BClHwoVrTFpIQ7s7eTYfMzG5n7paGJfgLfNXf5r9xNqaSBj0wcDnnzpx756Slko6i6FcruHP33v0HGw/bjx4/ebq5tf3s1BWVFdgXhSrseQoOlTTYJ0kKz0uLoFOFZ+n0qNHPrtA6WZivNCtxpGFiZC4FkKdOLsfJeKsThfFhAx6FSeTx7qZIeBxGC3TYEsfj7da3YVaISqMhocC5QRyVNKrBkhQK5+1h5bAEMYUJDnxpQKMb1YtZ53zHMxnPC+uPIb5gV2/UoJ2b6dR3aqALt6415L+0QUX5waiWpqwIjfhjlFeKU8GbxXkmLQpSM1+AsNLPysUFWBDkv+eWS6r9DhnmQ53Ww8YozeftndWOhXuJYo2+rowURYbrtKJrstCwDkmDNM3qdU8qxU/AuL+Cf7RRdt/LiSS398lnZPZ6FnH6erW77SO7yYX/vzhNwvhNmHx+2+l+XIa3wV6yV2yXxWyfddkHdsz6TLAJ+85+sJ/Bi6Ab9IJlb9Ba3nnObiH48hsz4s2oq3AAACmHicdVHbahsxEJW3t9TpJWne2hdRE0ghLLtOS9o3QwtuoA9pEycB25hZ7awjLGk30myJWUy+oK/tt/VvonWd4pjkgGA4Z6Qzo5MUSjqKor+N4MHDR4+frD1trj97/uLlxuarE5eXVmBP5Cq3Zwk4VNJgjyQpPCssgk4UniaTz7V++hOtk7k5pmmBQw1jIzMpgDx1dDHaG220ojD+VINHYTvy+HBTtHkcRnO02AKHo83G1SDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQuVvVavIurV9S9nFYSVOUhEb8M8pKxSnn9eI8lRYFqakvQFjpZ+XiHCwI8t9zyyXRfocUs4FOqkFtlGSz5vZyx9y9QLFCX5ZGijzFVVrRJVmoWYekQZp69aorleJHYNx/wT9aKztf5FiS2/3mMzK7XYs4ebfc3fSR3eTC7y9O2mG8F7a/v291DhbhrbE37C3bYTHbZx32lR2yHhNszH6x3+xP8DroBN1g0Rs0Fne22C0EP64BNhTNqQ==q4AAACmHicdVHbahsxEJW3t9TpJWne2hdRE0ghLLtuSto3QwtuoA9pEycB25hZ7awjLGk30myJWUy+oK/tt/VvonWd4pjkgGA4Z6Qzo5MUSjqKor+N4MHDR4+frD1trj97/uLlxuarE5eXVmBP5Cq3Zwk4VNJgjyQpPCssgk4UniaTz7V++hOtk7k5pmmBQw1jIzMpgDx1dDHaG220ojD+VINHYTvy+HBTtHkcRnO02AKHo83G1SDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQuVvVavIurV9S9nFYSVOUhEb8M8pKxSnn9eI8lRYFqakvQFjpZ+XiHCwI8t9zyyXRfocUs4FOqkFtlGSz5vZyx9y9QLFCX5ZGijzFVVrRJVmoWYekQZp69aorleJHYNx/wT9aKztf5FiS2/3mMzK7XYs4ebfc3fSR3eTC7y9O2mH8Pmx/32t1DhbhrbE37C3bYTHbZx32lR2yHhNszH6x3+xP8DroBN1g0Rs0Fne22C0EP64BOEbNqg==2.3.2 One-dimensional reduced model of a polymer molecule
We consider the reduced dynamics of a polymer molecule consisting of 𝑁 = 16 atoms. The
intramolecular potential is governed by
𝑉mol(Q) =
𝑁
∑︁
𝑖≠ 𝑗
𝑉p(𝑄𝑖 𝑗 ) +
𝑁𝑏∑︁
𝑖=1
𝑉b(𝑙𝑖) +
𝑁𝑎∑︁
𝑖=1
𝑉a(𝜃𝑖) +
𝑁𝑑
∑︁
𝑖=1
𝑉d(𝜙𝑖),
(2.17)
where 𝑙𝑖, 𝜃𝑖, 𝜙𝑖 represent the individual bond length, bond angle, and dihedral angle, respectively.
𝑉p, 𝑉b, 𝑉a, and 𝑉d represent the pairwise Lennard-Jones, finite extensible nonlinear elastic bond,
harmonic angle, and multiharmonic dihedral interactions whose explicit forms are specified in
Appendix A. The atom mass is set to unit, thermal temperature 𝑘 𝐵𝑇 is set to 1.0, and the time step
𝛿𝑡 is set to be 1 × 10−3. 512 samples are collected from a production stage of 3 × 106 steps, which
are used as the initial conditions of the NVE simulations of the full model for a production stage of
1 × 106 steps using the Velocity-Verlet integrator. Fig. 2.2 shows a sketch of the polymer molecule.
To examine the effectiveness of the present method, we first construct a 1D reduced dynamics in
terms of the end-to-end distance 𝑞1 = ∥Q1 − Q𝑁 ∥ as done in the previous work Lei and Li (2021)
based on the Petrov-Galerkin method, and compare the numerical results obtained from the two
methods. Figure 2.3(a) shows the velocity correlation function 𝐶𝑣𝑣 (𝑡) = ⟨𝑣1(𝑡)𝑣1(0)⟩ obtained
from the full MD and different orders of fixed-basis and joint-learning approximations. With the
same order of approximation, the current method yields better agreement with the MD results.
Specifically, the second-order model of the current method can capture the pattern around 𝑡 = 4 and
the fourth-order model can capture the patterns around 𝑡 = 0.4 and 𝑡 = 4. However, the previous
method with the same order approximation shows limitation to accurately capture these two patterns.
Figure 2.3(b) shows the displacement auto-correlation function 𝐶𝑞𝑞 (𝑡) = ⟨𝑞1(𝑡)𝑞1(0)⟩ obtained
from full MD and the reduced models constructed by the present method with different number
of non-Markovian features. As we introduce more features, the predicted correlation functions
approaches the MD results. In particular, the fourth-order model can capture the oscillations of the
MD results at 𝑡 = 10 and 𝑡 = 25. Figure 2.3(c) shows the encoder weights of non-Markovian features
for the fourth-order approximation. Similar to the tagged particle system, the encoder functions
exhibit pronounced oscillations at the short time and decay to zero at longer time.
20
(a)
(c)
(b)
(d)
Figure 2.3 Numerical results of a one-dimensional reduced model representing the dynamics of the
end–end distance of a polymer molecule system. (a)–(b) Velocity correlation function 𝐶𝑣𝑣 (𝑡) and
the Laplace transform of the memory function Θ(𝜆) obtained from the full MD simulations and
the different orders of the present joint learning approximation, and the Petrov–Galerkin projection
with fixed-basis approximation. (c) Displacement correlation function 𝐶𝑞𝑞 (𝑡) obtained from the
full MD and different orders of the joint learning approximation. (d) Encoder weights for the three
non-Markovian features of the reduced model with the fourth-order approximation.
21
10−210−1100101102t−0.50.51.52.5Cvv(t)MD2ndjointlearning4thjointlearning2ndfixedbasis4thfixedbasis10−1100101102t−50510Cqq(t)MD2ndjointlearning3rdjointlearning4thjointlearning050100150200j−0.2−0.10.00.10.2wi(j∆t)w1w2w310−210−1100101102103λ0.10.30.50.7Θ(λ)MD2ndjointlearning4thjointlearning2ndfixedbasis4thfixedbasisThe accuracy of the constructed reduced models can be further examined by comparing the
embedding memory kernels ˜θ(𝑡) with the full MD model. Figure 2.3(d) shows the Laplace transform
of the memory kernel of the reduced models ˜𝚯(𝜆) = ∫ +∞
˜θ(𝑡) exp (−𝑡/𝜆)d𝑡. The MD kernel 𝚯(𝜆)
0
is obtained by 𝚯(𝜆) = −G(𝜆)H (𝜆)−1, where G(𝜆) and H (𝜆) are the Laplace transform of the
correlation matrices g(𝑡) = ⟨M (cid:164)v(𝑡) + ∇𝑈 (q), q(0)⟩ and h(𝑡) = ⟨v(𝑡), q(0)⟩. Compared with the
previous method, the current method yields better agreement with MD results. Specifically, the
second- and fourth-order of the joint learning approximation, and the fourth-order of the fixed
basis approximation show good agreement with the MD result 𝚯(𝜆) for 𝜆 between 1 and 1000.
Furthermore, the fourth-order model of the joint learning approximation can further capture the
pronounced peak regime of the MD results near 𝜆 = 0.1. We emphasize that the memory kernel
˜θ(𝑡) is not explicitly constructed during the learning process; ˜θ(𝑡) approaches θ(𝑡) as we impose the
constraint (2.14) such that the correlation matrices of the reduced dynamics match the ones of the
full model. This enables us to circumvent the direct fitting of the matrix-valued memory function
for multi-dimensional GLEs, and efficiently construct the numerically-stable reduced model that
retains the non-local statistics and coherent noise as shown in the following example.
2.3.3 Four-dimensional reduced model of a polymer molecule
Finally, we construct a reduced model in terms of a four-dimensional resolved vector q =
[𝑞1, 𝑞2, 𝑞3, 𝑞4] defined by
∥Q𝑖 − Q𝑐 ∥2, Q𝑐 =
𝑞1 = ∥Q1 − Q𝑁 ∥ ,
𝑁
∑︁
𝑞2
2 =
1
𝑁
𝑖=1
(cid:13)
(cid:13)
(cid:13)Q⌊ 𝑁
(cid:13)
(cid:13)
(cid:13)Q⌈ 𝑁
2 ⌋ − Q1
2 ⌉ − Q𝑁
,
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
(cid:13)
,
𝑞3 =
𝑞4 =
1
𝑁
𝑁
∑︁
𝑖=1
Q𝑖,
(2.18)
where 𝑞1, 𝑞2, 𝑞3, and 𝑞4 represent the end-to-end distance, radius of gyration, and two center-
to-end distances, respectively. The four-dimensional free energy function 𝑈 (q) is constructed by
matching the average force sampled from the restraint molecular dynamics and represented by a
neural network; we refer to Appendix B for details. Rather than constructing the four-dimensional
22
GLE kernel θ(𝑡), we directly learn the reduced model (2.5) by minimizing the loss function (2.15).
(a)
(c)
(b)
(d)
Figure 2.4 Numerical results of a four-dimensional reduced model representing the dynamics of a
polymer molecule system, with conformation states characterized by the resolved variables q (see
Eq. (2.18)). (a)–(c) Diagonal components of the velocity correlation function C𝑣𝑣 (𝑡) = ⟨v(𝑡) v(0)𝑇 ⟩.
Note that [C𝑣𝑣 (𝑡)]44 is omitted because it is similar to [C𝑣𝑣 (𝑡)]33. (d) Constructed encoder weights
of the first non-Markovian feature ζ1 for the fourth-order reduced model.
Figure 2.4(a-c) show the diagonal components of the velocity correlation matrix C𝑣𝑣 (𝑡) =
(cid:10)v(𝑡)v(0)𝑇 (cid:11) obtained from the full MD and the reduced models using different order approximations.
Specifically, the components [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]33 show similar values near 𝑡 = 0 since both
𝑞1 and 𝑞3 characterize the distances between the individual particles, e.g., 𝑣1 = (Q1 − Q𝑁 ) · (V1 −
V𝑁 )/∥Q1 − Q𝑁 ∥. As the distribution of the velocity difference between the two free-end particles
follows N (0, 2𝑘 𝐵𝑇I), the variance of 𝑣1 is 2𝑘 𝐵𝑇. Similar argument also holds for 𝑣3 and 𝑣4. On
the long-time scale, [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]22 decay much slower than [C𝑣𝑣 (𝑡)]33 and [C𝑣𝑣 (𝑡)]44
23
10−210−1100101102t−0.50.51.52.5[Cvv(t)]11MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.030.000.030.060.09[Cvv(t)]22MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.50.51.52.5[Cvv(t)]33MD0thorder2ndjointlearning4thjointlearning0255075100j−0.2−0.10.00.10.2[w1](j∆t)[w1]11[w1]22[w1]33[w1]44and show pronounced oscillations near 𝑡 = 10 and 𝑡 = 25. The differences can be understood as
follow: Compared with the end-to-middle distances 𝑞3 and 𝑞4, the end-to-end distance 𝑞1 and radius
of gyration 𝑞2 represent the global states of the molecular conformation. Based on the scaling
law of the idealized Gaussian chain model de Gennes (1979), the relaxation time of 𝑞1 and 𝑞2 is
proportional to 𝑁 2. Accordingly, [C𝑣𝑣 (𝑡)]11 decays four times slower than [C𝑣𝑣 (𝑡)]33, which is
qualitatively consistent with the present numerical results.
The transient dynamics of the correlation functions can be accurately captured by the reduced
model. As we increase the number of non-Markovian features, the predictions show better agreement
with MD results. Specifically, the zeroth-order (i.e., Langevin) model is insufficient to capture the
patterns around 0.5 and 5. The second-order model yields an accurate prediction for [C𝑣𝑣 (𝑡)]33
but less accurate predictions for [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]22. The fourth-order model yields good
agreement for all the components over the full regime. Fig. 2.4(d) shows the encoder weights of the
first non-Markovian feature ζ1, which naturally encode the non-local statistics among the resolved
variables, and decay to 0 at large time.
Fig. 2.5 shows the off-diagonal components of the velocity correlation matrix C𝑣𝑣 (𝑡). Similar to
the diagonal components, [C𝑣𝑣 (𝑡)]12 represents the coupling between the dynamics of two global
conformation states and therefore exhibits the longest correlation with pronounced oscillations at
𝑡 = 10 and 𝑡 = 25. On the other hand, [C𝑣𝑣 (𝑡)]13 and [C𝑣𝑣 (𝑡)]23 represent the coupling between
a global state and semi-global state, and therefore exhibit intermediate correlation. In addition,
[C𝑣𝑣 (𝑡)]34 exhibits weaker correlation compared with the other components since the coupling
between the dynamics of 𝑞3 and 𝑞4 is mainly governed by the local bond- and angle-interactions
associated with 8-th and 9-th atom. The predictions of the second-order reduced model show fairly
good agreement with the full MD results for [C𝑣𝑣 (𝑡)]13 and [C𝑣𝑣 (𝑡)]23 but less agreement for
[C𝑣𝑣 (𝑡)]12. The fourth-order reduced model yields good agreement for all the components.
Fig. 2.6 shows the components of the embedded matrix-valued kernels in the Laplace space
obtained from the full MD and the reduced models. In particular, ˜𝚯(𝜆) obtained from the second-
order model shows good agreement with 𝚯(𝜆) obtained from the full MD within the regime of large
24
(a)
(c)
(b)
(d)
Figure 2.5 (a-d) Off-diagonal components of the velocity correlation function C𝑣𝑣 (𝑡) for a polymer
molecule system whose conformation states are characterized by a four-dimensional resolved vector
q defined by Eq. (2.18).
𝜆. The fourth-order model yields good agreement over the full regime, which is consistent with
the accurate prediction of the velocity correlation functions shown in Fig. 2.4 and 2.5 (see also
Appendix E for θ(𝑡)). While the kernel function θ(𝑡) is not explicitly constructed in the present
method, the accurate recovery of 𝚯(𝜆) verifies that the constructed models faithfully retain the
non-Markovian dynamics of the resolved variables.
2.4 Summary
In this study, we developed a data-driven approach to accurately learn the stochastic reduced
dynamics of full Hamiltonian systems with non-Markovian memory. The method essentially
provides an efficient approach to approximate the multi-dimensional generalized Langevin equation.
25
10−210−1100101102t−0.10.00.10.2[Cvv(t)]12MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.30.00.30.60.9[Cvv(t)]13MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.050.000.050.100.15[Cvv(t)]23MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.10−0.050.000.050.10[Cvv(t)]34MD0thorder2ndjointlearning4thjointlearning(a)
(c)
(b)
(d)
Figure 2.6 (a-d) Components of the embedded matrix-valued kernel 𝚯(𝜆) in the Laplace space
obtained from the full MD and a four-dimensional reduced model of a polymer molecule system.
Rather than directly fitting the matrix-valued memory kernel, the present method seeks a set of
non-Markovian features whose evolution naturally encodes with the orthogonal dynamics of the
resolved variables, and establishes a joint learning of the extended dynamics in terms of both the
resolved variables and the non-Markovian features. Compared with the previous studies based on
the rational function approximation Lei et al. (2016a) and the Petrov-Galerkin projection Lei and Li
(2021) with the pre-selected fractional derivative bases, the present method enables us to probe the
optimal representation of the reduced dynamics through the joint learning of the non-Markovian
features. The constructed features retain a clear physical interpretation and can be loosely viewed as
the convolution of the velocity history. This enables us to construct the proper learning formulation
such that the reduced dynamics strictly preserves the second fluctuation-dissipation theorem and
26
10−210−1100101102103λ0.00.51.01.5[Θ(λ)]11MD2ndjointlearning4thjointlearning10−210−1100101102103λ0.30.60.91.2[Θ(λ)]22MD2ndjointlearning4thjointlearning10−210−1100101102103λ−0.6−0.4−0.20.0[Θ(λ)]12MD2ndjointlearning4thjointlearning10−210−1100101102103λ−0.35−0.25−0.15−0.05[Θ(λ)]23MD2ndjointlearning4thjointlearningretains the consistent invariant density distribution. Moreover, the learning process does not require
the on-the-fly computation of the time correlations of these features from the time-series samples,
and automatically ensures numerical stability of the constructed model without empirical treatment.
This is particularly well-suited for the construction of reduced dynamics of complex systems such as
the conformation dynamics of macromolecular systems, where multi-dimensional resolved variables
are often needed to characterize the transition dynamics with non-local cross-correlations among
the variables.
Building upon the data-driven framework for modeling state-independent memory effects
introduced in Chapter 2, Chapter 3 extends this approach to more complex dynamical systems with
state-dependent memory. While Chapter 2 demonstrated how a fixed set of convolutional encoders
could capture global non-Markovian behavior through auxiliary variables, this assumption becomes
limiting in systems where memory varies across configurations — such as when transitions between
metastable states occur. To address this, Chapter 3 introduces a heterogeneous encoding architecture
that allows the memory kernels to adapt locally to the system’s state. This generalization enables a
more accurate and flexible representation of reduced dynamics in high-dimensional systems where
memory effects are inherently configuration-dependent.
27
CHAPTER 3
ENHANCED SAMPLING DATA-DRIVEN CONSTRUCTION OF STOCHASTIC
REDUCED DYNAMICS ENCODED WITH STATE-DEPENDENT MEMORY
3.1
Introduction
Predictive modeling of multi-scale dynamic systems remains a significant challenge across
various fields, including biology, materials science, and fluid physics. A prominent example is
coarse-grained molecular dynamics (CGMD), where the goal is to simplify molecular system
representations while preserving their essential dynamic behavior. The generalized Langevin
equation (GLE) has emerged as a widely used framework for capturing the non-Markovian dynamics
inherent in many CGMD processes. A range of approaches has been proposed for parameterizing
the memory Lange and Grubmüller (2006); Darve et al. (2009b); Ceriotti et al. (2009); Baczewski
and Bond (2013); Davtyan et al. (2015b); Lei et al. (2016b); Russo et al. (2019); Jung et al. (2017b);
Lee et al. (2019b); Ma et al. (2019); Wang et al. (2020b,c); Zhu and Venturi (2020); Vroylandt
et al. (2022); She et al. (2023); Xie and E (2024) aiming to reconstruct specific dynamic properties
accurately. However, recent work Lyu and Lei (2023b); Ge et al. (2024) reveal that recovering
isotropic properties alone may be insufficient for accurately reproducing the underlying complex
dynamics. These findings underscore the importance of incorporating state-dependent memory
effects to achieve precise reconstruction of dynamic behaviors.
The accurate parameterization of a state-dependent memory kernel hinges on effectively capturing
the dynamic properties within the phase space. However, practical applications often face challenges
due to the inherent complexity of the energy landscape in phase space. This complexity is typically
marked by the presence of numerous metastable states, which are separated by significant energy
barriers. These barriers hinder transitions between states, making it difficult to comprehensively
sample the phase space and accurately reconstruct the memory kernel. Addressing this challenge
requires advanced techniques capable of efficiently exploring these landscapes while retaining
essential dynamic information. The critical role of sampling in phase space has been widely
acknowledged, particularly in the construction of free energy landscapes. To address this, numerous
28
methodologies have been developed, each offering unique advantages for overcoming sampling
challenges. Notable approaches include umbrella sampling Torrie and Valleau (1977), which applies
biased potentials to enhance exploration; histogram reweighting Kumar et al. (1992), enabling the
integration of data from multiple simulations; metadynamics Laio and Parrinello (2002b); Barducci
et al. (2008), which facilitates the escape from metastable states through adaptive biasing; and
variational enhanced sampling Valsson and Parrinello (2014); Shaffer et al. (2016); Bonati et al.
(2019), a framework that leverages variational principles to optimize bias potentials. These methods
collectively underscore the importance of efficient phase space exploration in capturing accurate
free energy profiles. Despite their great success and wide application to capture the static properties,
the importance of the sampling for the dynamic properties is largely ignored.
In this study, we employ our previously developed consensus-based enhanced sampling technique
to simultaneously construct the free energy surface and parameterize the memory kernel. The
conservative force is determined through constrained dynamics at selected points in the phase
space, while dynamic information at these points is obtained via multiple free dynamics simulations
initiated from the same locations.
3.2 Methods
The system under consideration is modeled as a Hamiltonian system with a 6𝑁-dimensional
phase space vector Z = [Q; P], represent the position and momentum vectors, respectively. The
dynamics of the system are governed by the equation of motion:
(cid:164)Z = S∇𝐻 (Z),
(3.1)
0
I
where S = (cid:169)
(cid:173)
(cid:173)
−I 0
(cid:171)
(cid:170)
(cid:174)
(cid:174)
(cid:172)
is the symplectic matrix that preserves the structure of Hamiltonian dynamics,
with I being the identity matrix. and 𝐻 (Z) denotes the Hamiltonian function. For sufficiently large
𝑁, the simulation of Eq. (3.1) becomes computationally prohibitive. However, in many practical
scenarios, interest lies in a low-dimensional resolved variable, z(𝑡) = 𝜙(Z(𝑡)), where 𝜙 : R6𝑁 → R𝑚
serves as a mapping that projects the high-dimensional pace onto a reduced space of interest.
The Mori-Zwanzig (MZ) formalism provides a robust foundation for constructing approximate
29
dynamics for resolved variables by employing a projection operator. This framework separates
the resolved and unresolved components of the system, enabling a reduced description of the
dynamics while incorporating memory effects and fluctuating forces to account for the influence of
unresolved variables. The projection operator P maps functions of the full system to functions of
the coarse-grained (CG) system, and is defined as:
(P 𝑓 )(z) =
∫ 𝛿(𝜙(Z) − z) 𝑓 (Z) 𝜌(Z)dZ
∫ 𝛿(𝜙(Z) − z) 𝜌(Z)dZ
,
where z represents the CG variables, Φ(Z) is the mapping from the full system to the CG system, and
𝜌(Z) is the probability density function of the full system. The dynamics of CG variable follows:
𝜕
𝜕𝑡
𝜙(Z) = exp(𝑡L)PL𝜙(Z) +
∫ 𝑡
0
exp ((𝑡 − 𝑠)L) PL exp(𝑠QL)QL𝜙(Z)d𝑠+exp(𝑡QL)QL𝜙(Z).
Here, L := is the Liouville operator and Q = I − P. Motivate by this, the reduced dynamics can be
written as
(cid:164)q = M(q)−1p,
(cid:164)p = −F(q) −
∫ 𝑡
0
𝜃 (𝑡 − 𝜏) (cid:164)q(𝜏)d𝜏 + R (𝑡),
(3.2)
Here, q = 𝜙𝑞 (Q) denotes a coarse-grained variable, and p is the corresponding momentum,
associated with a mass matrix M(q). The effective free energy for q is defined as 𝑈eff(q) =
𝛽 log ∫ dZ𝛿(𝜙𝑞 (Z) − q) 𝜌(Z), with 𝛽 = 1
− 1
𝐾𝐵𝑇 is the inverse temperature. Inspired by previous
research She et al. (2023), Equation (3.2) can be reformulated as an extended Markovian process
(q, p, ξ), where ξ represents auxiliary variables that will be defined later. These auxiliary variables
serve to capture the memory effects inherent in the original generalized Langevin equation (GLE)
and are assumed to follow a Gaussian distribution, 𝑁 (0, 1). The evolution takes the form
d
=
q
(cid:170)
(cid:174)
(cid:174)
p
(cid:174)
(cid:174)
(cid:174)
(cid:172)
ξ
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
0
I 0
−I
0
J(q)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
∇qF
∇pF
∇ξF
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
0 0 0
0
0
𝚺(q)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
d𝑡 +
dW𝑡,
(3.3)
where F (q, p, ξ) = 𝑈 (q) + p𝑇 M(q)−1p + ξ𝑇 ξ is the free energy for the extended system. The
relationship 𝚺(q)𝚺(q)𝑇 = −𝐾𝐵𝑇 (J(q)𝚲+𝚲J(q)𝑇 ), ensures consistency with the second fluctuation-
dissipation theorem, where 𝚲 is covariance matrix of (p, ξ). By solving the Fokker-Planck equation,
30
we have the invariant distribution of the extended system 𝜌𝑒 (q, p, ξ) ∝ exp(−𝛽F (q, p, ξ)). The
invariant distribution should be consistent with the free energy for q, with ∫ 𝜌𝑒 (q, p, ξ)dpdξ ∝
exp(−𝛽𝑈eff(q)), from which we notice that 𝑈 (q) = 𝑈eff(q) − 1
2𝛽 log |M(q)−1|. To construct
the hidden variables, we define them as a linear combination of past momentum values p(𝑡 −
𝛿𝑡), · · · , p(𝑡 − 𝑁𝑤𝛿𝑡), i.e
ξ𝑖 (𝑡) =
𝑁𝑤
∑︁
𝑗=1
w 𝑗𝑖p(𝑡 − 𝑖𝛿𝑡),
where w 𝑗𝑖 are the coefficients to be optimized, and 𝑁𝜉 is the number of momentum terms included
in the linear combination. To simplify the process of training and collecting data, we construct our
(3.4)
matrix J(q) as follows
J(q) =
0
h𝑇 (q)
−h(q)M−1(q)
ˆJ(q)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
here h(q) is a vector represented by a neural network which takes the coarse-grained variable q as
input. Then we use choleschy decomposition to form ˆJ(q) = −L(q)L𝑇 (q) + A(q). Note that L(q)
is a block-wise lower triangle matrix and A(q) is a block-wise antisymmetric matrix represented by
two different neural networks respectively. Now, the the second fluctuation-dissipation theorem can
be simplified as follows
where
ˆ𝚺(q) ˆ𝚺(q)𝑇 = −𝐾𝐵𝑇 ( ˆJ(q) ˆ𝚲 + ˆ𝚲ˆJ(q))
𝚺(q) =
, 𝚲(q) =
(3.5)
0
0
0 ˆ𝚺(q)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
M(q)
(cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
0
0
ˆ𝚲(q)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
With the constructed ξ, we can computed the correlation function by multiply p(0) given q(0) = q
on both side of Eq. (3.3),
(cid:10)p + ∇qF , p(0) |q(0) = q (cid:11)
⟨ξ, p(0) |q(0) = q ⟩
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
d (cid:169)
(cid:173)
(cid:173)
(cid:171)
(cid:124)
(cid:123)(cid:122)
C1 (𝑡,q;𝑤)
(cid:10)∇pF , p(0) |q(0) = q (cid:11)
(cid:170)
= J(q) (cid:169)
(cid:174)
(cid:173)
(cid:10)∇ξF , p(0) |q(0) = q (cid:11)
(cid:174)
(cid:173)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:172)
(cid:171)
(cid:123)(cid:122)
(cid:125)
(cid:124)
C2 (𝑡,q;𝑤)
d𝑡
.
(3.6)
31
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:170)
(cid:174)
(cid:174)
(cid:172)
(cid:125)
By construct J(𝑞) as ˜J(q; 𝜃𝐽) = ˜𝚺(q; 𝜃𝐽) ˜𝚺(q; 𝜃𝐽)𝑇 , the loss function can be constructed as
𝑁𝑠
∑︁
𝑖=1
(cid:13)
(cid:13)
(cid:13)
(cid:13)
min
𝜃 𝐽 ,w
d
d𝑡 C1(𝑡, q𝑖; 𝑤) − ˜J(q𝑖; 𝜃𝐽)C2(𝑡, q𝑖; w)
2
(cid:13)
(cid:13)
(cid:13)
(cid:13)
+
𝑁𝑠
∑︁
𝑖=1
(cid:13) ˆ𝚲(q𝑖; w) − I(cid:13)
(cid:13)
2 ,
(cid:13)
(3.7)
where we optimize ˜J(q; 𝜃𝐽) and auxiliary variable ξ depends on w at the same time on the training
set {q𝑖}𝑁𝑠
𝑖=1. The construction of dynamics also depends on an accurate free energy surface 𝑈eff(q)
and mass matrix M(q). It can be computed from restrained dynamics by introducing a harmonic
term into full potential, i.e.
U𝑘 (Q, q) = U (Q) +
𝑘
2
(𝜙𝑞 (Q) − q)𝑇 (𝜙𝑞 (Q) − q),
where 𝑘 represents the magnitude of the restrained potential and U is the potential of full
system without restraint. The mean force can be computed by ∇𝑈eff(q) = lim𝑘→∞ F𝑘 (q) and
M(q) = lim𝑘→∞ M𝑘 (q), where
F𝑘 (q) =
∫
1
𝑍𝑘 (q)
𝑘 (𝜙𝑞 (Q) − q) exp(−𝛽U𝑘 (Q, q))dQ
and
M𝑘 (q) =
∫
1
𝑍𝑘 (q)
1
2𝛽( d𝜙𝑞 (Q)
)2
exp(−𝛽U𝑘 (Q, q))dQ.
d𝑡
Two neural networks ˜M(q; 𝜃 𝑀) and ˜F(q; 𝜃𝐹) is constructed to approximate M(q) and F(q)
respectively and trained loss function on the same dataset
𝑁𝑠
∑︁
𝑖=1
min
𝜃 𝑀
∥ ˜M(q𝑖; 𝜃 𝑀) − M(q𝑖) ∥2, min
𝜃𝐹
𝑁𝑠
∑︁
𝑖=1
∥ ˜F(q𝑖; 𝜃𝐹) − F(q𝑖) ∥2.
The sampling points are adaptively selected by consensus-based enhanced sampling strategy
Lyu and Lei (2023a) with a McKean type stochastic differential equation for 𝑁𝑤 walker q1, · · · , q𝑁𝑤
dq𝑖 = −
1
𝛾
∇z𝐺 (q𝑖)d𝑡 +
√︄
2
𝜅ℎ𝛾 dW𝑡
(3.8)
where 𝐺 (q𝑖) = 1
2 (q𝑖 − m)𝑇 V−1(q𝑖 − m), where m = (cid:205)𝑁𝑤
𝑖=1 q𝑖 𝑝(q𝑖), V = (𝜅𝑙 + 𝜅ℎ) (cid:205)𝑁𝑤
𝑖=1
. The R (q𝑖) represents the residual at the point q𝑖, which we
(q𝑖 − m)𝑇 (q𝑖 −
m) 𝑝(q𝑖) and 𝑝(q𝑖) =
exp(−𝜅𝑙R (q𝑖))
(cid:205)𝑁𝑤
𝑗
exp(−𝜅𝑙R (q 𝑗 ))
choose to be ∥ ˜F(q𝑖; 𝜃𝐹) − F(q𝑖)∥2 in this project. The first right term in Eq. (3.8) represents the
32
exploitation term that uses current information to drive the sampler towards the maximum residual
region, and the second term is a high temperature exploration term 𝜅ℎ that explores the unknown
region. Notice that the residual also depends on the neural network parameters, then by iterative
optimization of our neural network representation and the sampling points, we can get a good
training set over the phase space.
3.3 Numerical results
3.3.1 One-dimensional state-dependent reduced model of a polymer molecule
To illustrate the core concept of the present method, we begin with a polymer molecule with
𝑁 = 16 atoms, where the intramolecular potential is defined by
𝑉mol(Q) =
𝑁
∑︁
𝑖≠ 𝑗
𝑉p(𝑄𝑖 𝑗 ) +
𝑁𝑏∑︁
𝑖=1
𝑉b(𝑙𝑖) +
𝑁𝑎∑︁
𝑖=1
𝑉a(𝜃𝑖) +
𝑁𝑑
∑︁
𝑖=1
𝑉d(𝜙𝑖),
(3.9)
where 𝑉𝑝 is the Lennard-Jones intermolecular potential, 𝑉𝑏 is the harmonic bonds, 𝑉𝑎 and 𝑉𝑑 denotes
the potential on angle and dihedral angle respectively. The end-to-end distance 𝑞1 = ∥Q1 − Q𝑁 ∥ is
used as a collective variable in the 1D reduced dynamics framework to evaluate the effectiveness
of our current method. We selected 25 distinct points uniformly from the range of 𝑞1 ∈ [2, 18] to
create our training set.
Four auxiliary variables ξ𝑖 are learned in standard GLE and our state-dependent GLE. The
overall velocity correlation function 𝐶𝑣𝑣 (𝑡) = ⟨𝑣1(𝑡)𝑣1(0)⟩ in presented in Figure 3.1(a). Both
state-dependent GLE and standard GLE align well with the MD result. We also present the encoder
weights for non-Markovian features as obtained from the state-dependent GLE approximation in
Figure 3.1(b). The encoder functions demonstrate pronounced oscillations at short times, reflecting
the dynamic interactions present in the system during that initial period. As time progresses,
these oscillations gradually decay to zero, indicating that the influence of these non-Markovian
features diminishes at longer time scales. This behavior underscores the transient nature of the
non-Markovian dynamics in the system.
However, the fitness of the overall GLE do not represent the good performance of the learned
dynamics. We compare the conditional autocorrelation with q starting from 25 selected points by
33
(a)
(b)
Figure 3.1 Numerical results of a one-dimensional reduced model representing the dynamics of the
end–end distance of a polymer molecule system. (a) Overall velocity correlation function 𝐶𝑣𝑣 (𝑡)
obtained from MD, 4th order standard GLE, and state-dependent GLE. (b) Encoder weights for the
three non-Markovian features of the state-dependent GLE.
standard GLE, state-dependent GLE and MD in Figure 3.2. This comparison will reveal differences
of the diffusion behavior at different points on the phase space in MD is captured by the standard
state-dependent but not in standard one. Inaccuracy in the diffusion process will in return affect
the precision of the transition process.Figure 3.2(a) illustrates the time distribution of 𝑞1 for values
greater than 15, comparing results from MD simulations, GLE and state-dependent GLE. The data
reveals that the state-dependent GLE provides a closer match to the MD results than the standard
GLE method.
However, the fitness of the overall GLE do not represent the good performance of the learned
dynamics. We compare the conditional autocorrelation with q starting from 25 selected points by
standard GLE, state-dependent GLE and MD in Figure 3.2. This comparison will reveal differences
of the diffusion behavior at different points on the phase space in MD is captured by the standard
state-dependent but not in standard one. Inaccuracy in the diffusion process will in return affect
the precision of the transition process.Figure 3.2(a) illustrates the time distribution of 𝑞1 for values
greater than 15, comparing results from MD simulations, GLE and state-dependent GLE. The data
reveals that the state-dependent GLE provides a closer match to the MD results than the standard
GLE method.
34
10−210−1100101102t−0.10.10.30.5Cvv(t)MD4thState-dependentGLE4thGLE050100150200j−1.0−0.6−0.20.20.61.0wi(j∆t)w1w2w3(a)
(c)
(b)
(d)
Figure 3.2 Numerical results of a one-dimensional reduced model representing the dynamics of
the end–end distance of a polymer molecule system. (a) Distribution of 𝑞1 > 15 obtained from
the full MD simulations, the 4th-order GLE approximation, and the 4th-order state-dependent GLE
approximation. (b–d) Conditional velocity correlation functions obtained from MD, standard GLE,
and state-dependent GLE, respectively.
3.3.2 Two-dimensional state-dependent reduced model of an alanine dipeptid
We further demonstrate the effectiveness of our state-dependent reduced modeling framework
using the alanine dipeptide molecule (Ace-Ala-Nme), commonly referred to as Ala2. The full-atom
molecular dynamics (MD) simulation is performed for a solvated alanine dipeptide immersed in 383
explicit water molecules at 300 K, using the Amber99-SB force field and the TIP3P water model. A
time step of 2.5 × 10−4 ps is used for numerical integration.
To reduce dimensionality, we adopt two dihedral angles as collective variables (CVs): the 𝜙
angle defined by atoms (C, N, C𝛼, C), and the 𝜓 angle defined by atoms (N, C𝛼, C, N). These angles
35
10−1100101102t−10−5051015Cvv(t)MD10−1100101102t−0.10.10.30.5Cvv(t)4thGLE10−1100101102t−10−5051015Cvv(t)4thState-dependentGLEprovide a compact representation of the molecular conformations.
A consensus-based sampling strategy selects 1,000 representative configurations from the
MD trajectory to train the state-dependent generalized Langevin equation (GLE) model. At each
configuration, we compute the conservative force, effective mass, and the velocity autocorrelation
function. Four auxiliary variables are introduced to close the non-Markovian system, forming a
Markovian embedding that captures state-dependent memory effects.
The model’s accuracy is validated by comparing conditional momentum autocorrelation functions
from two conformational regions. As shown in Figure 3.3, the state-dependent GLE faithfully
reproduces MD results, including subtle oscillatory features that are missed by traditional GLE
models with fixed memory kernels.
36
(a)
(c)
(e)
(b)
(d)
(f)
Figure 3.3 Numerical results of the two-dimensional reduced model in terms of the two dihedral
angles of the alanine dipeptide system. (a–c) Conditional momentum auto-correlation functions
obtained from full MD simulations and the 4th-order GLE approximation at (𝜙, 𝜓) = (−1.60, 2.78).
(d–f) Conditional momentum auto-correlation functions at (𝜙, 𝜓) = (−2.90, −0.16).
We also compute the distribution of transition time between different local minima. Four local
37
minima is selected in Figure 3.4. The distribution of transition time between point 0 and other three
points is shown in Figure 3.5
(a)
Figure 3.4 The heatmap of the free energy surface m𝐺 (𝜙, 𝜓). Colored solid circles mark four local
minima of the configuration.
Figure 3.6 presents the distribution of time spent at each point, based on results from MD, 4𝑡ℎ
order of GLE and 4𝑡ℎ order of sate-dependent GLE. The results demonstrate that our current method
provides improved accuracy for each point compared to GLE method.
3.4 Summary
The state-dependent generalized Langevin equation (GLE) is usefull tool to describe the non-
Markovian behavior in many processes in the CGMD problem accurately. In this study, we employ
our previously developed consensus-based enhanced sampling strategy to simultaneously construct
the heterogeneous memory kernel and the free energy surface. The conservative force is calculated
using constrained dynamics at specific points in the phase space, while the dynamic information at
these points is gathered through multiple free dynamics initiated from the same locations. We then
train our neural network to capture the differences among the various conditional auto-correlation
functions. The results demonstrate that our current method provides improved accuracy for each
point compared to the GLE method.
38
(a)
(c)
(e)
(b)
(d)
(f)
Figure 3.5 Numerical results of a two-dimensional reduced model representing the two dihedral
angles of the alanine dipeptide system. (a) Distribution of transition time from position 0 to position
1. (b) From position 1 to 0. (c) From position 0 to 2. (d) From position 2 to 0. (e) From position 0
to 3. (f) From position 3 to 0.
39
(a)
(c)
(b)
(d)
Figure 3.6 Numerical results of a two-dimensional reduced model representing the two dihedral
angles of the alanine dipeptide system. (a) Distribution of time periods spent at position 0 before
transitioning to position 1. (b–d) Distributions of time spent at positions 1, 2, and 3, respectively.
40
CHAPTER 4
GENERATIVE MODEL BASED IDENTIFYING METASTABLE STATES IN FULL
MOLECULE SPACE
4.1
Introduction
Normalizing flows have gained significant traction in recent years as flexible generative models
that provide exact likelihood evaluation and tractable sampling through invertible transformations
between data and latent spaces. While highly expressive, their performance critically depends
on the structure of the latent prior distribution.
In most conventional settings—including in
many state-of-the-art normalizing flow architectures—a simple unimodal prior such as a standard
multivariate Gaussian is employed. This assumption works well for data distributions that are
themselves unimodal or smoothly varying, but it becomes a substantial bottleneck in modeling
systems characterized by multimodality, sharp transitions, or complex geometrical features in
high-dimensional spaces.
This issue is particularly critical in domains such as molecular dynamics or continuum mechanics,
where data often arise from a mixture of metastable states or rare-event transitions. These systems
naturally lead to multimodal distributions, with each mode representing a distinct macroscopic
configuration or energy basin. While recent developments in normalizing flows—such as NICE,
RealNVP, and Glow Dinh et al. (2014, 2017); Kingma and Dhariwal (2018)—have significantly
improved the expressivity of the transformation through architectural innovations like coupling
layers and conditioning, they still rely on simple unimodal latent priors. As a result, such models are
limited in their ability to identify and represent metastable states explicitly, since the prior structure
does not reflect the inherent multimodality of the system.
Moreover, during training, these models typically focus on maximizing the overall likelihood and
do not incorporate gradient or perturbation-based penalties to enforce key physical constraints—such
as requiring the gradient of the log-density to vanish at latent maxima or ensuring that log-density
values at these mapped maxima are indeed local maxima in data space. Without these constraints,
the learned transformation may distort the latent structure and fail to preserve the correspondence
41
between latent and data-space modes, ultimately limiting the interpretability and metastable state
resolution of the model.
To address these limitations, we propose a generative modeling framework that uses a Mixture-
of-Gaussians (MoG) prior to explicitly represent multiple metastable modes in the latent space. The
goal is not merely to enhance expressivity, but to enforce a maximum-to-maximum correspondence
between the latent space and data space—ensuring that each latent mode is mapped to a high-
density metastable state in the observed configuration space. To achieve this, we use an invertible
transformation that preserves the structure of the distribution under the change of variables. While
our implementation adopts KRNet for this transformation due to its flexibility and scalability, the
approach is general and compatible with any expressive normalizing flow architecture. During
training, we further impose gradient penalties to enforce vanishing gradients at mapped maxima,
and contrastive perturbation penalties to ensure local maximality in data space. This strategy allows
the model to capture and preserve the metastable structure inherent in complex physical systems,
rather than simply fitting the data distribution in a likelihood sense.
Our design is inspired in part by recent advances in multimodal flow-based generative modeling,
such as the bounded KRNet architecture introduced by Peng et al. Peng et al. (2023), which
demonstrated that introducing structural constraints in the latent space can significantly improve
the accuracy and interpretability of normalizing flows. Building on this line of thinking, our
model introduces a Mixture-of-Gaussians (MoG) latent prior not merely for greater flexibility, but
to explicitly encode multiple metastable modes that reflect the complex landscape of molecular
systems. Each Gaussian component captures a different region of the latent space, which is then
mapped—via a KRNet transformation—into a distinct metastable basin in data space. To enforce
this mode-to-mode correspondence, we introduce gradient penalties to drive the log-density gradient
toward zero at each latent mode, and contrastive perturbation losses to ensure that these mapped
points are true maxima in the data space.
Beyond improving generative performance, this design also enhances scientific interpretability.
The multimodal latent structure enables soft clustering of generated configurations, where each mode
42
can be associated with meaningful collective variables (CVs) such as torsion angles, end-to-end
distances, or radius of gyration. This provides insight into the system’s metastable organization
and helps identify the slow reaction coordinates that govern long-time dynamics. In summary, our
MoG-based KRNet formulation introduces a novel framework for aligning latent and data-space
maxima, enabling both accurate density modeling and interpretable discovery of metastable structure
in high-dimensional molecular data.
4.2 Method
4.2.1 Overview of the MoG-KRnet Framework
MoG-KRnet is a bijective generative model that constructs an invertible mapping 𝑓 : R𝑑 → R𝑑
transforming samples from a hybrid latent distribution 𝑝𝑍 to the data distribution 𝑝 𝑋. The key
idea is to approximate a transport map that rearranges a tractable base measure into a complex,
potentially multimodal data distribution. For a given observation x ∈ R𝑑, the model defines the
log-density through the change-of-variable formula:
log 𝑝 𝑋 (x) = log 𝑝𝑍 ( 𝑓 (x)) + log | det 𝐽 𝑓 (x)|,
where 𝐽 𝑓 (x) = ∇x 𝑓 (x) ∈ R𝑑×𝑑 denotes the Jacobian of 𝑓 . This formulation permits exact likelihood
evaluation and allows optimization via maximum likelihood estimation.
4.2.2 Hybrid Latent Prior
The latent variable z ∈ R𝑑 is decomposed into two independent blocks, denoted z1 ∈ R𝑑1
and z2 ∈ R𝑑2, with 𝑑1 + 𝑑2 = 𝑑. The first block z1 follows a product of one-dimensional
mixture-of-Gaussians:
𝑝(z1) =
𝑑1(cid:214)
𝐾 𝑗
∑︁
𝑗=1
𝑘=1
𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2
𝑗,𝑘 ),
while the second block z2 ∼ N (0, I𝑑2) is standard Gaussian. This hybrid prior combines multimodal
expressiveness with analytical tractability and defines the target measure for the flow map.
43
4.2.3 KRnet Architecture as Progressive Triangular Transport
Inspired by the Knothe–Rosenblatt rearrangement, MoG-KRnet factorizes the transformation 𝑓
into a sequence of stage-wise maps:
𝑓 = 𝑓 (𝐾) ◦ 𝑓 (𝐾−1) ◦ · · · ◦ 𝑓 (1),
where each stage 𝑓 (𝑘) updates a block of coordinates while conditioning on preceding ones,
approximating triangular transport structure. Each 𝑓 (𝑘) is implemented as a composition of
transformations:
𝑓 (𝑘) = S (𝑘) ◦ A (𝑘) ◦ N (𝑘) ◦ R (𝑘),
where R (𝑘) is a linear transformation with learnable LU structure, N (𝑘) is an actnorm layer that
ensures zero-mean and unit variance per dimension (with learnable parameters), A (𝑘) is a stack
of affine coupling layers (described in Section 3.4), and S (𝑘) performs a squeezing operation that
reallocates dimension usage over stages.
This layered composition ensures that the full Jacobian 𝐽 𝑓 remains triangular or block-triangular,
allowing for efficient computation of log | det 𝐽 𝑓 | as a sum over the individual layers.
4.2.4 Affine Coupling Transformations
Each affine coupling layer partitions the input x = [x1, x2], and updates x2 using a scale-and-shift
transformation conditioned on x1:
x′
2 = x2 ⊙ (1 + 𝛼 · tanh(𝑠(x1))) + 𝛾 · tanh(𝑡 (x1)).
Here, 𝑠 and 𝑡 are neural networks; 𝛼 ∈ (0, 1) is a fixed stability parameter (e.g., 0.6); 𝛾 ∈ R𝑑′
is
a learnable global vector. The inverse transformation is analytically computable, ensuring exact
invertibility. The log-determinant of the Jacobian for each coupling layer is efficiently computed as:
log | det 𝐽A | =
𝑑′
∑︁
𝑖=1
log (1 + 𝛼 · tanh(𝑠𝑖 (x1))) ,
which contributes additively to the total log-likelihood.
44
4.2.5 Mode Alignment Regularization
We introduce a geometric regularization mechanism to promote semantic alignment between the
latent and data spaces. Let zmax ∈ R𝑑 be the point in latent space corresponding to the global mode
of the prior. We define it as the concatenation of the mean of the dominant mixture components in
z1, and the zero vector in z2:
zmax = [𝜇∗
1
, . . . , 𝜇∗
𝑑1
, 0, . . . , 0],
where 𝜇∗
𝑗 is the mean of the most probable component for dimension 𝑗. We compute the corresponding
data-space mode as xmax = 𝑓 −1(zmax).
To encourage xmax to align with a mode of the data distribution, we introduce two regularization
terms. The first is a gradient penalty:
Lgrad =
(cid:13)
(cid:13)
(cid:13)
∇x log 𝑝 𝑋 (x)(cid:12)
(cid:12)x=xmax
2
(cid:13)
(cid:13)
(cid:13)
,
which encourages stationarity of the log-density at xmax. The second is a local contrastive penalty
defined over perturbed neighborhoods:
Lcontrast =
1
𝑀
𝑀
∑︁
𝑖=1
(cid:16)
max
0, log 𝑝 𝑋 (xneigh
𝑖
) − log 𝑝 𝑋 (xmax)
(cid:17)
,
where xneigh
𝑖
= xmax + 𝜖𝑖, and 𝜖𝑖 ∼ N (0, 𝜎2I). This enforces that xmax is not only a critical point, but
a local maximum.
4.2.6 Objective Function
The total loss function used for training is the sum of the negative log-likelihood and the
mode-alignment penalties:
L = −Ex∼D log 𝑝 𝑋 (x) + 𝜆grad · Lgrad + 𝜆contrast · Lcontrast.
Here, 𝜆grad, 𝜆contrast ≥ 0 control the strength of the regularization terms.
4.2.7 Sampling and Inference
For density evaluation, an input x ∈ R𝑑 is mapped through the flow to obtain z = 𝑓 (x), and the
log-likelihood is computed via:
log 𝑝 𝑋 (x) = log 𝑝𝑍 (z) +
𝐾
∑︁
𝑘=1
log | det 𝐽 𝑓 (𝑘 ) (·)|.
45
Each sub-map contributes a triangular Jacobian, so the determinant is computed in linear time. The
prior log-density is decomposed as:
log 𝑝𝑍 (z) =
𝑑1∑︁
𝑗=1
𝐾 𝑗
∑︁
𝑘=1
log (cid:169)
(cid:173)
(cid:171)
𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2
𝑗,𝑘 )(cid:170)
(cid:174)
(cid:172)
−
1
2
∥z2∥2 −
𝑑2
2
log(2𝜋).
To generate samples, latent vectors z ∼ 𝑝𝑍 are drawn by sampling each MoG dimension 𝑧 𝑗 from its
categorical mixture and the Gaussian block from standard normal. The resulting z is passed through
the inverse flow x = 𝑓 −1(z), which is exact and fully differentiable.
4.3 Numerical Result
4.3.1 Approximation of the Müller-Brown Equilibrium Distribution
We first evaluate the capacity of MoG-KRnet to approximate a complex, multimodal target
distribution arising from the well-known Müller-Brown potential—a classical benchmark in
molecular simulation that features multiple metastable wells separated by high-energy barriers. The
target equilibrium distribution is the Boltzmann-Gibbs measure:
𝑝 𝑋 (x) =
1
𝑍 exp
(cid:18)
−
𝑈 (x)
𝑘 𝐵𝑇
(cid:19)
,
where 𝑈 (x) denotes the potential energy, 𝑇 = 30𝐾, and x ∈ R2. We simulate overdamped Langevin
dynamics,
𝑑x
𝑑𝑡
= −∇𝑈 (x) + √︁
2𝑘 𝐵𝑇 · ξ(𝑡),
to generate a reference dataset of 5 million samples. These samples serve as empirical draws from
the true equilibrium distribution 𝑝 𝑋.
The model is instantiated as a single-stage KRnet with depth 𝐷 = 64, input dimension
𝑑 = 2, and coupling width 256. Each flow stage includes alternating affine coupling layers with
learnable LU-based linear transformations and interleaved squeezing operations. The coupling
layers implement:
z2 = z2 ⊙ (1 + 𝛼 · tanh(𝑠(z1))) + 𝛾 · tanh(𝑡 (z1)),
46
with 𝛼 = 0.6, and 𝑠, 𝑡 being two-layer neural networks with ReLU activations. The latent prior
𝑝𝑍 (z) consists of a product of a one-dimensional mixture-of-Gaussians:
2
∑︁
𝑝(𝑧1) =
𝜋𝑘 · N (𝑧1; 𝜇𝑘 , 𝜎2
𝑘 ),
𝜋 = [0.6, 0.4],
𝜇 = [−2, 2], 𝜎 = [1.0, 0.5],
𝑘=1
and a standard Gaussian 𝑧2 ∼ N (0, 1).
In addition to maximum likelihood estimation, we introduce geometric regularization to ensure
that the high-density region of the latent prior is mapped to a corresponding mode in the target
space. This is done via two penalty terms:
1. A local contrastive penalty encourages log-probability at mapped prior mode xmax to exceed
its local neighbors:
Lcontrast =
(cid:16)
max
𝑀
∑︁
𝑖=1
0, log 𝑝 𝑋 (xneigh
𝑖
) − log 𝑝 𝑋 (xmax)
(cid:17)
,
where xneigh
𝑖
= xmax + 𝜖𝑖, and 𝜖𝑖 ∼ N (0, 𝜎2I).
2. A gradient penalty enforces ∇ log 𝑝 𝑋 (xmax) ≈ 0:
Lgrad =
(cid:13)
(cid:13)
(cid:13)
∇x log 𝑝 𝑋 (x)(cid:12)
(cid:12)x=xmax
(cid:13)
(cid:13)
(cid:13)1
.
The total objective is:
L = −Ex∼D [log 𝑝 𝑋 (x)] + 𝜆contrastLcontrast + 𝜆gradLgrad
Training is performed using the Adam optimizer with an initial learning rate of 1 × 10−4, which
is decayed by 10% every 5000 iterations. Penalty coefficients 𝜆contrast, 𝜆grad, and 𝜆align are initialized
at small values and gradually increased (by 5% every 5000 iterations) to guide the flow toward stable
mode alignment while preventing instability in early optimization.
Minibatches of 2D coordinates are drawn from preprocessed subsets {set 𝑗 }19
𝑗=0, and the entire
training loop runs for 100,000 steps. Gradients are clipped using global norm clipping with a
threshold of 0.01. Checkpoints are saved regularly and used for restarting long runs.
Figure 4.1 shows the learned transport map learned by MoG-KRnet. On the left, samples drawn
from the latent space exhibit two clearly separated high-density regions corresponding to distinct
47
components in the mixture-of-Gaussians prior. After training, these two latent modes are transported
via the inverse flow 𝑓 −1 into two distinct basins of the Müller-Brown energy landscape, shown on
the right.
This demonstrates that the model not only captures the overall multimodal structure of the target
distribution but also learns a semantically consistent transport: each latent mode is mapped to a
specific metastable state of the physical system. The smoothness and separation of the transformed
samples reflect that the triangular KRnet flow—coupled with the hybrid prior and geometric
regularization—successfully avoids mode collapse and learns a one-to-one correspondence between
latent and physical modes.
Such mode-resolving behavior is difficult to achieve with standard normalizing flows that rely on
unimodal priors. In contrast, MoG-KRnet leverages the flexibility of mixture components to assign
and map separated probability mass to different energetic regions in a physically meaningful way.
This mode alignment improves both sampling fidelity and interpretability, especially in systems
where multiple competing basins dominate the dynamics.
Figure 4.1 Learned transport from latent space to physical space using MoG-KRnet. The left half of
the image shows samples drawn from the latent distribution, which has two distinct high-density
regions due to the mixture-of-Gaussians prior. The right half shows the transformed samples under
the inverse flow 𝑓 −1, which accurately maps the two latent modes into the two metastable basins of
the Müller-Brown potential. This demonstrates the model’s ability to perform semantically aligned
mode separation, transporting distinct regions of latent mass to physically meaningful targets.
48
4.3.2 Approximation of the Alanine Dipeptide Equilibrium Distribution
To further assess the scalability and generalization capacity of MoG-KRnet, we apply it
to approximate the equilibrium distribution of a higher-dimensional molecular system: alanine
dipeptide in implicit solvent. This molecule is a well-known testbed in molecular simulation due
to its low dimensionality yet rich conformational landscape, characterized by transitions between
metastable states in the Ramachandran (𝜙, 𝜓)-angle space and the full Cartesian coordinates of
selected atoms.
We generate reference samples for alanine dipeptide (Ace-Ala-Nme, commonly referred to as
Ala2) via full-atom molecular dynamics (MD) simulation in explicit solvent. The simulation system
consists of the alanine dipeptide molecule immersed in 383 TIP3P water molecules. The simulation
is carried out at 350 K using the Amber99-SB force field and Langevin dynamics for temperature
control. A time step of 2.5 × 10−4 ps is used for numerical integration.
Trajectories are collected from equilibrium simulations and projected onto a reduced coordinate
space consisting of Cartesian positions of selected heavy atoms. In total, 5 million configurations are
used to construct the dataset Dala ∼ 𝑝 𝑋, representing the high-dimensional equilibrium distribution
over molecular conformations.
The MoG-KRnet model is constructed to map the equilibrium distribution of alanine dipeptide
into a structured latent space. The model input is a 15-dimensional vector representing the
Cartesian coordinates of five key atoms—[5, 7, 9, 15, 17]—selected to capture relevant backbone
conformational fluctuations while avoiding redundant degrees of freedom. This atom selection
encompasses chemically meaningful internal coordinates, including both 𝜙 and 𝜓 torsions, as well
as spatial end-to-end geometry.
The flow transformation 𝑓 : R15 → R15 consists of 7 staged mappings inspired by the pseudo-
triangular structure of the Knothe–Rosenblatt rearrangement. Each stage contains 24 affine coupling
layers with hidden width 128, supplemented by actnorm, LU-based rotations, and squeezing
operations. This progressive architecture allows early layers to resolve nonlinear, multimodal
features in dominant subspaces, while later layers refine global geometry.
49
The latent prior 𝑝𝑍 is a hybrid of three independent one-dimensional mixture-of-Gaussians
(MoG) and a standard multivariate Gaussian:
𝑝𝑍 (z) =
3
(cid:214)
𝐾 𝑗
∑︁
𝑗=1
𝑘=1
𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2
𝑗,𝑘 ) · N (z4:15; 0, I12).
This design reflects the assumption that a small number of latent coordinates capture discrete
conformational transitions (e.g., basin-hopping), while the remaining degrees of freedom reflect
continuous fluctuations in local structure. The independence of latent dimensions facilitates efficient
sampling and interpretable mode decomposition. Latent samples are drawn by independently
sampling each MoG dimension using categorical selection followed by Gaussian sampling, and
appending a standard normal vector for the Gaussian block.
Training is performed on a dataset of 5 million MD samples using a composite loss:
Ltotal = LNLL + 𝜆gradLgrad + 𝜆contrastLcontrast + Lrep.
Here, LNLL is the negative log-likelihood, Lgrad enforces local maximality at the mapped latent
maxima, and Lcontrast ensures neighborhood contrast.
To prevent mode collapse, a pairwise repulsion loss Lrep is introduced between mapped maxima
{x(𝑖)
max} using a soft margin criterion.
Training runs for 200,000 iterations with the Adam optimizer, an initial learning rate of 10−7,
and dynamically scaled penalty weights. During training, KDE diagnostics are periodically used
to ensure that generated samples recover the correct distributions in torsional and Euclidean
observables.
In Fig. 4.2, we compare the predicted marginal densities of the 𝜙 and 𝜓 dihedral angles from
KRnet to reference MD histograms. MoG-KRnet accurately reproduces all modes in both coordinates
and captures the correct relative amplitudes. In particular, the sharp peak near 𝜓 ≈ 3.0 is learned
precisely, and the multimodality in 𝜙 is preserved without mode collapse.
To assess the learned joint dependencies, we visualize the 2D density over (𝜙, 𝜓) in Fig. 4.3,
along with the eight mode points mapped from latent maxima. The conformational basins of Ala2 are
clearly recovered, and the mode points are well-separated, each landing within distinct high-density
50
(a) Marginal distribution of 𝜙
(b) Marginal distribution of 𝜓
Figure 4.2 Comparison of 1D dihedral angle marginals between KRnet (red) and MD ground truth
(blue).
regions. This confirms that MoG-KRnet not only fits the data globally but also identifies meaningful
latent structure.
(a) 𝜙–𝜓 dihedral distribution of Ala2. Warm
background indicates predicted density; overlaid
points mark mapped latent maxima.
(b) End-to-end distance vs. radius of gyration
(𝑅𝑔) from sampled configurations. Latent max-
ima are projected onto this plane.
Figure 4.3 Predicted equilibrium features of alanine dipeptide from the trained MoG-KRnet model.
An important feature of our MoG-KRnet framework is the explicit mapping of latent density
modes to high-probability basins in configuration space. By construction, each mode of the hybrid
prior z(𝑖)
max ∈ R𝑑. These
mapped maxima are designed to coincide with peaks in the learned data distribution 𝑝 𝑋 (x), enforced
max is mapped through the inverse flow 𝑓 −1 to a corresponding point x(𝑖)
51
during training via gradient and contrastive penalties.
As shown in Figure 4.3, this alignment holds in both the dihedral angle space (𝜙, 𝜓) and
in structural coordinates.
In the left panel, each mapped latent mode falls within a distinct
conformational basin in the (𝜙, 𝜓) landscape, indicating that MoG-KRnet captures metastability
through a structured latent space. In the right panel, the same latent maxima are distributed across the
manifold of end-to-end distance and radius of gyration, further supporting the physical consistency
and geometric expressiveness of the learned model. These results confirm that our framework not
only generates accurate samples but also provides a semantically meaningful latent representation
that aligns with physically interpretable features of the molecular system.
4.4 Summary.
In this work, we proposed MoG-KRnet, a novel flow-based generative framework designed for ap-
proximating high-dimensional equilibrium distributions in complex molecular systems. Our approach
builds upon the theory of invertible transformations and exploits a staged Knothe–Rosenblatt-inspired
architecture to progressively map structured latent representations to physical configuration space.
A distinguishing feature of MoG-KRnet is its use of a hybrid latent prior that combines independent
one-dimensional mixture-of-Gaussians (MoG) components with standard Gaussian variables. This
formulation enables the model to flexibly represent multi-modal distributions while maintaining
computational tractability and efficient sampling.
To ensure meaningful correspondence between latent and physical modes, we introduced a
max encodes a distinct peak in the latent density, and then enforcing through loss penalties
mode-alignment strategy during training. This involves constructing the prior such that each latent
mode z(𝑖)
that each of these modes is mapped to a high-probability region x(𝑖)
max) in the observed
space. This is achieved via gradient-based penalties to minimize ∥∇x log 𝑝 𝑋 (xmax) ∥, and contrastive
max = 𝑓 −1(z(𝑖)
penalties to ensure that xmax is indeed a local maximum compared to its neighbors. This alignment
strategy imbues the model with semantic coherence and supports downstream interpretability.
The efficacy of MoG-KRnet was demonstrated on two benchmark systems. For the Müller-Brown
potential, we showed that the model accurately captured the bimodal equilibrium distribution and
52
learned to associate each latent mode with a distinct physical basin. For the more complex alanine
dipeptide molecule in explicit solvent, the model was trained on 5 million full-atom MD snapshots
and succeeded in learning the joint equilibrium distribution over both angular variables (dihedral
angles 𝜙, 𝜓) and structural observables (end-to-end distance and radius of gyration). In all cases,
the mapped latent maxima landed squarely in dominant high-density regions of the data space,
confirming the success of the mode-to-basin alignment. Furthermore, generated samples from the
model reproduced the marginal and joint distributions of key physical features with high fidelity,
closely matching empirical histograms derived from MD data.
Together, these contributions underscore the dual strengths of MoG-KRnet: the capacity to
approximate complex, multi-modal densities in high dimensions, and the ability to structure the
latent space in a physically meaningful and interpretable manner. Our results demonstrate that
MoG-KRnet provides not only a powerful generative model but also a principled tool for reduced
representation of molecular systems, where the mapping from latent to physical coordinates respects
the underlying metastable structure of the dynamics. This makes it particularly well-suited for tasks
in coarse-grained modeling, statistical reweighting, and uncertainty-aware exploration of equilibrium
configurations.
53
CHAPTER 5
CONCLUSION
This thesis presents a unified, data-driven framework for constructing reduced-order models of
high-dimensional, non-Markovian dynamical systems. By integrating advances in memory-aware
modeling and normalizing flow-based latent representations, we address two central challenges
in coarse-grained modeling: accurately capturing long-time correlations and resolving complex,
multi-modal equilibrium distributions.
We began by developing a novel learning-based approach to non-Markovian stochastic re-
duced modeling. By augmenting the resolved dynamics with a set of learned auxiliary vari-
ables—interpretable as non-Markovian features—we showed that the complex memory effects
embedded in full-atom molecular simulations can be faithfully captured without directly estimating
memory kernels. This framework builds on the Mori–Zwanzig formalism and circumvents conven-
tional kernel fitting by matching correlation functions in an extended variable space. Numerical
results on tagged particles and polymer chains demonstrated excellent agreement with full molecular
dynamics (MD) simulations, validating the expressiveness and robustness of the proposed models.
We extended this methodology to incorporate state-dependent memory kernels, thereby enabling
more realistic dynamics in systems with heterogeneous free energy landscapes. Our framework
captures local variations in unresolved degrees of freedom and accommodates basin-specific
relaxation times and noise structures. Through simulations on polymer systems, we observed
significant improvements in predictive accuracy and sampling fidelity compared to global or
fixed-kernel models.
To handle the challenge of modeling complex equilibrium distributions, we introduced a new
Mixture-of-Gaussians (MoG) KRNet architecture—termed KRnet-MoG-GLE—as a probabilistic
generative model for full molecular dynamics. By replacing the unimodal latent prior with a flexible
MoG prior, our model gains the capacity to represent multiple metastable basins and generate
samples that better match empirical distributions. Importantly, we designed the training to enforce
mode-alignment, ensuring that the maxima of the latent variables map to high-probability regions in
54
the data space. This was demonstrated convincingly in the Müller-Brown and alanine dipeptide
systems, where our model successfully resolved distinct basins and accurately approximated target
observables such as end-to-end distance, radius of gyration, and dihedral angles.
Taken together, the contributions of this work represent a significant step forward in the design of
interpretable, memory-embedded, and generative reduced-order models. By marrying the strengths
of stochastic modeling with expressive latent-variable architectures, our approach enables efficient
exploration and inference in systems that are otherwise intractable due to their high dimensionality
and long memory effects.
Ultimately, this thesis lays the foundation for data-driven, physically-consistent reduced models
that can serve as scalable surrogates for multi-scale simulations, with broad applicability across
molecular biophysics, soft materials, and beyond.
55
BIBLIOGRAPHY
Ayaz, C., Dalton, B. A., and Netz, R. R. (2022). Generalized langevin equation with a non-linear
potential of mean force and non-linear memory friction from a hybrid projection scheme. Physical
Review E, 105:054138.
Baczewski, A. D. and Bond, S. D. (2013). Numerical integration of the extended variable generalized
Langevin equation with a positive Prony representable memory kernel. The Journal of chemical
physics, 139(4):044107.
Barducci, A., Bussi, G., and Parrinello, M. (2008). Well-tempered metadynamics: a smoothly
converging and tunable free-energy method. Physical Review Letters, 100(2):020603.
Bittracher, A., Koltai, P., Klus, S., Banisch, R., Dellnitz, M., and Schütte, C. (2018). Transition
manifolds of complex metastable systems. Journal of Nonlinear Science, 28(2):471–512.
Bonati, L., Zhang, Y.-Y., and Parrinello, M. (2019). Neural networks-based variationally enhanced
sampling. Proceedings of the National Academy of Sciences, 116(36):17641–17647.
Ceriotti, M., Bussi, G., and Parrinello, M. (2009). Langevin equation with colored noise for
constant-temperature molecular dynamics simulations. Physical review letters, 102(2):020601.
Chen, M., Li, X., and Liu, C. (2014). Computation of the memory functions in the generalized
Langevin models for collective dynamics of macromolecules. J. Chem. Phys., 141:064112.
Chiavazzo, E., Covino, R., Coifman, R. R., Gear, C. W., Georgiou, A. S., Hummer, G., and
Kevrekidis, I. G. (2017). Intrinsic map dynamics exploration for uncharted effective free-energy
landscapes. Proceedings of the National Academy of Sciences, 114(28):E5494–E5503.
Chorin, A. J., Hald, O. H., and Kupferman, R. (2002). Optimal prediction with memory. Phys. D,
166:239–257.
Chorin, A. J. and Lu, F. (2015). Discrete approach to stochastic parametrization and dimension
reduction in nonlinear dynamics. Proceedings of the National Academy of Sciences, 112(32):9804–
9809.
Coifman, R. R., Kevrekidis, I. G., Lafon, S., Maggioni, M., and Nadler, B. (2008). Diffusion maps,
reduction coordinates, and low dimensional representation of stochastic systems. Multiscale
Modeling & Simulation, 7(2):842–864.
Corless, M. and Frazho, A. (2003). Linear Systems and Control: An Operator Perspective. Chapman
& Hall/CRC Pure and Applied Mathematics. Taylor & Francis.
Crosskey, M. and Maggioni, M. (2017). Atlas: A geometric approach to learning high-dimensional
stochastic systems near manifolds. Multiscale Modeling & Simulation, 15(1):110–156.
56
Daldrop, J. O., Kowalik, B. G., and Netz, R. R. (2017). External potential modifies friction of
molecular solutes in water. Physical Review X, 7(4):041065.
Darve, E. and Pohorille, A. (2001). Calculating free energies using average force. The Journal of
Chemical Physics, 115(20):9169–9183.
Darve, E., Solomon, J., and Kia, A. (2009a). Computing generalized Langevin equations and
generalized fokker-planck equations. Proc. Natl. Acad. Sci., 106(27):10884–10889.
Darve, E., Solomon, J., and Kia, A. (2009b). Computing generalized Langevin equations and general-
ized Fokker-Planck equations. Proceedings of the National Academy of Sciences, 106(27):10884–
10889.
Davtyan, A., Dama, J. F., Voth, G. A., and Andersen, H. C. (2015a). Dynamic force matching:
A method for constructing dynamical coarse-grained models with realistic time dependence. J.
Chem. Phys., 142.
Davtyan, A., Dama, J. F., Voth, G. A., and Andersen, H. C. (2015b). Dynamic force matching:
A method for constructing dynamical coarse-grained models with realistic time dependence. J.
Chem. Phys., 142.
de Gennes, P. (1979). Scaling concepts in polymer physics. Cornell Univ. Pr.
Dibak, M., del Razo, M. J., De Sancho, D., Schütte, C., and Noé, F. (2018). Msm/rd: Coupling
markov state models of molecular kinetics with reaction-diffusion simulations. The Journal of
Chemical Physics, 148(21):214107.
Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Non-linear independent components estimation.
arXiv preprint arXiv:1410.8516.
Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real nvp.
In
International Conference on Learning Representations (ICLR).
Español, P. (2004). Statistical mechanics of coarse-graining. In Novel Methods in Soft Matter
Simulations, pages 69–115. Springer.
Fang, L., Ge, P., Zhang, L., E, W., and Lei, H. (2022). DeePN2: A deep learning-based non-Newtonian
hydrodynamic model. Journal of Machine Learning, 1:114–140.
Feng, L., Gao, T., Dai, M., and Duan, J. (2022). Auto-sde: Learning effective reduced dynamics
from data-driven stochastic dynamical systems. arXiv preprint:arXiv:2205.04151.
Frenkel, D. and Smit, B. (2001). Understanding molecular simulation:
from algorithms to
applications, volume 1. Elsevier.
57
Ge, P., Zhang, Z., and Lei, H. (2024). Data-driven learning of the generalized langevin equation
with state-dependent memory. Physical Review Letters, 133(7):077301.
Giannakis, D. (2019). Data-driven spectral decomposition and forecasting of ergodic dynamical
systems. Applied and Computational Harmonic Analysis, 47(2):338–396.
Grogan, F., Lei, H., Li, X., and Baker, N. A. (2020). Data-driven molecular modeling with the
generalized langevin equation. J. Comput. Phys., 418:109633–109641.
Harlim, J., Jiang, S. W., Liang, S., and Yang, H. (2020). Machine learning for prediction with
missing dynamics. Journal of Computational Physics, page 109922.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9(8):1735–1780.
Hudson, T. and Li, X. H. (2020). Coarse-Graining of Overdamped Langevin Dynamics via the
Mori–Zwanzig Formalism. Multiscale Modeling & Simulation, 18(2):1113–1135.
Jung, G., Hanke, M., and Schmid, F. (2017a). Iterative reconstruction of memory kernels. Journal
of Chemical Theory and Computation, 13(6):2481–2488.
Jung, G., Hanke, M., and Schmid, F. (2017b). Iterative reconstruction of memory kernels. Journal
of Chemical Theory and Computation, 13(6):2481–2488.
Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization.
International
Conference on Learning Representations (ICLR).
Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In
Advances in neural information processing systems (NeurIPS).
Klippenstein, V., Tripathy, M., Jung, G., Schmid, F., and van der Vegt, N. F. (2021). Introduc-
ing memory in coarse-grained molecular simulations. The Journal of Physical Chemistry B,
125(19):4931–4954.
Klus, S., Nüske, F., Koltai, P., Wu, H., Kevrekidis, I., Schütte, C., and Noé, F. (2018). Data-
driven model reduction and transfer operator approximation. Journal of Nonlinear Science,
28(3):985–1010.
Klus, S., Nüske, F., Peitz, S., Niemann, J.-H., Clementi, C., and Schütte, C. (2020). Data-driven
approximation of the koopman generator: Model reduction, system identification, and control.
Physica D: Nonlinear Phenomena, 406:132416.
Koopman, B. O. (1931). Hamiltonian systems and transformation in Hilbert space. Proceedings of
the National Academy of Sciences, 17(5):315–318.
58
Kowalik, B., Daldrop, J. O., Kappler, J., Schulz, J. C., Schlaich, A., and Netz, R. R. (2019).
Memory-kernel extraction for different molecular solutes in solvents of varying viscosity in
confinement. Physical Review E, 100(1):012126.
Krivov, S. V. (2013). On reaction coordinate optimality.
Journal of Chemical Theory and
Computation, 9(1):135–146.
Kubo, R. (1966). The fluctuation-dissipation theorem. Reports on Progress in Physics, 29(1):255–
284.
Kumar, S., Rosenberg, J. M., Bouzida, D., Swendsen, R. H., and Kollman, P. A. (1992). THE
weighted histogram analysis method for free-energy calculations on biomolecules. I. The method.
Journal of Computational Chemistry, 13(8):1011–1021.
Laio, A. and Parrinello, M. (2002a). Escaping free-energy minima. Proceedings of the National
Academy of Sciences, 99(20):12562–12566.
Laio, A. and Parrinello, M. (2002b). Escaping free-energy minima. Proceedings of the National
Academy of Sciences, 99(20):12562–12566.
Lange, O. F. and Grubmüller, H. (2006). Collective Langevin dynamics of conformational motions
in proteins. The Journal of Chemical Physics, 124(21):214903.
Lange, O. F. and Grubmüller, H. (2006). Collective Langevin dynamics of conformational motions
in proteins. J. Chem. Phys., 124:214903.
Lee, H. S., Ahn, S.-H., and Darve, E. F. (2019a). The multi-dimensional generalized Langevin
equation for conformational motion of proteins. The Journal of Chemical Physics, 150(17):174113.
Lee, H. S., Ahn, S.-H., and Darve, E. F. (2019b). The multi-dimensional generalized Langevin
equation for conformational motion of proteins. The Journal of Chemical Physics, 150(17):174113.
Lei, H., Baker, N. A., and Li, X. (2016a). Data-driven parameterization of the generalized Langevin
equation. Proceedings of the National Academy of Sciences, 113(50):14183–14188.
Lei, H., Baker, N. A., and Li, X. (2016b). Data-driven parameterization of the generalized Langevin
equation. Proceedings of the National Academy of Sciences, 113(50):14183–14188.
Lei, H. and Li, X. (2021). Petrov–Galerkin methods for the construction of non-Markovian dynamics
preserving nonlocal statistics. The Journal of Chemical Physics, 154(18):184108.
Lei, H., Wu, L., and E, W. (2020). Machine learning based non-Newtonian fluid model with
molecular fidelity. Physical Review E, 102:043309.
Li, W. and Ma, A. (2014). Recent developments in methods for identifying reaction coordinates.
59
Molecular Simulation, 40(10-11):784–793.
Lin, K. K. and Lu, F. (2021). Data-driven model reduction, Wiener projections, and the Koopman-
Mori-Zwanzig formalism. Journal of Computational Physics, 424:109864.
Lu, J. and Vanden-Eijnden, E. (2014). Exact dynamical coarse-graining without time-scale separation.
The Journal of Chemical Physics, 141(4):044109.
Lyu, L. and Lei, H. (2023a). Consensus-based construction of high-dimensional free energy surface.
arXiv preprint arXiv:2311.05009.
Lyu, L. and Lei, H. (2023b). Construction of coarse-grained molecular dynamics with many-body
non-markovian memory. Physical Review Letters, 131(17):177301.
Ma, C., Wang, J., and E, W. (2018). Model reduction with memory and the machine learning of
dynamical systems. Communications in Computational Physics, 25(4):947–962.
Ma, L., Li, X., and Liu, C. (2019). Coarse-graining langevin dynamics using reduced-order
techniques. Journal of Computational Physics, 380:170–190.
Maragliano, L. and Vanden-Eijnden, E. (2006). A temperature accelerated method for sampling free
energy and determining reaction pathways in rare events simulations. Chemical Physics Letters,
426(1):168 – 175.
Maragliano, L. and Vanden-Eijnden, E. (2008). Single-sweep methods for free energy calculations.
The Journal of Chemical Physics, 128(18):184110.
Mori, H. (1965a). A continued-fraction representation of the time-correlation functions. Prog.
Theor. Phys., 34:399–416.
Mori, H. (1965b). Transport, collective motion, and Brownian motion. Progress of Theoretical
Physics, 33(3):423–455.
Noid, W. G., Chu, J.-W., Ayton, G. S., Krishna, V., Izvekov, S., Voth, G. A., Das, A., and Andersen,
H. C. (2008). The multiscale coarse-graining method. I. A rigorous bridge between atomistic and
coarse-grained models. J. Chem. Phys., 128(24):244114.
Peng, J., Wu, H., Zhou, Z., and Nie, Q. (2023). Bounded krnet: Density estimation for bounded
data with normalizing flows. arXiv preprint arXiv:2305.09063.
Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G., and Noé, F. (2013). Identification
of slow molecular order parameters for markov model construction. The Journal of Chemical
Physics, 139(1):015102.
Price, J., Meuris, B., Shapiro, M., and Stinis, P. (2021). Optimal renormalization of multiscale
60
systems. Proceedings of the National Academy of Sciences, 118(37):e2102266118.
Rohrdanz, M. A., Zheng, W., Maggioni, M., and Clementi, C. (2011). Determination of reaction
coordinates via locally scaled diffusion map. The Journal of Chemical Physics, 134(12):124116.
Rosso, L., Minàry, P., Zhu, Z., and Tuckerman, M. E. (2002). On the use of the adiabatic molecular
dynamics technique in the calculation of free energy profiles. The Journal of Chemical Physics,
116(11):4389–4402.
Russo, A., Durán-Olivencia, M. A., Kevrekidis, I. G., and Kalliadasis, S. (2019). Deep learning as
closure for irreversible processes: A data-driven generalized Langevin equation. arXiv preprint
arXiv:1903.09562.
Shaffer, P., Valsson, O., and Parrinello, M. (2016). Enhanced, targeted sampling of high-dimensional
free-energy landscapes using variationally enhanced sampling, with an application to chignolin.
Proceedings of the National Academy of Sciences, 113(5):1150–1155.
She, Z., Ge, P., and Lei, H. (2023). Data-driven construction of stochastic reduced dynamics encoded
with non-markovian features. Journal of Chemical Physics, 158(3):034102.
Stinis, P. (2015). Renormalized Mori-Zwanzig-reduced models for systems without scale sepa-
ration. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences,
471(2176):20140446.
Torrie, G. M. and Valleau, J. P. (1977). Nonphysical sampling distributions in monte carlo free-energy
estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199.
Valsson, O. and Parrinello, M. (2014). Variational approach to enhanced sampling and free energy
calculations. Phys. Rev. Lett., 113:090601.
Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P., and Koumoutsakos, P. (2018). Data-driven fore-
casting of high-dimensional chaotic systems with long short-term memory networks. Proceedings
of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844.
Vroylandt, H., Goudenège, L., Monmarché, P., Pietrucci, F., and Rotenberg, B. (2022). Likelihood-
based non-markovian models from molecular dynamics. Proceedings of the National Academy
of Sciences, 119(13):e2117586119.
Wall, H. (1948). Analytic Theory of Continued Fractions. Analytic Theory of Continued Fractions.
D. Van Nostrand Company.
Wang, Q., Ripamonti, N., and Hesthaven, J. S. (2020a). Recurrent neural network closure of
parametric POD-Galerkin reduced-order models based on the Mori-Zwanzig formalism. Journal
of Computational Physics, 410:109402.
61
Wang, Q., Ripamonti, N., and Hesthaven, J. S. (2020b). Recurrent neural network closure of
parametric POD-Galerkin reduced-order models based on the Mori-Zwanzig formalism. Journal
of Computational Physics, 410:109402.
Wang, S., Ma, Z., and Pan, W. (2020c). Data-driven coarse-grained modeling of polymers in
solution with structural and dynamic properties conserved. Soft Matter, 16(36):8330–8344.
Xie, P. and E, W. (2024). Coarse-graining conformational dynamics with multidimensional general-
ized langevin equation: How, when, and why. Journal of Chemical Theory and Computation.
Ye, F. X. F., Yang, S., and Maggioni, M. (2021). Nonlinear model reduction for slow-fast stochastic
systems near manifolds. arXiv preprint:arXiv:2104.02120.
Zhu, Y., Tang, Y.-H., and Kim, C. (2022). Learning stochastic dynamics with statistics-informed
neural network. arXiv preprint: arXiv:2202.12278.
Zhu, Y. and Venturi, D. (2018). Faber approximation of the Mori-Zwanzig equation. Journal of
Computational Physics, 372:694–718.
Zhu, Y. and Venturi, D. (2020). Generalized langevin equations for systems with local interactions.
Journal of Statistical Physics, 178:1217.
Zieliński, P. and Hesthaven, J. S. (2022). Discovery of slow variables in a class of multiscale
stochastic systems via neural networks. Journal of Nonlinear Science, 32(4):51.
Zwanzig, R. (1973). Nonlinear generalized Langevin equations. Journal of Statistical Physics,
9(3):215–220.
Zwanzig, R. (2001). Nonequilibrium Statistical Mechanics. Oxford University Press.
62
APPENDIX A
MICROSCALE MODEL OF THE POLYMER MOLECULE
The polymer molecule is modeled as a bead-spring chain consisting of 4 sub-units. Each sub-unit
consists of 4 atoms. The full potential is given by
𝑉m𝑜𝑙 (m𝑄) =
𝑁
∑︁
𝑖≠ 𝑗
𝑉p(𝑄𝑖 𝑗 ) +
𝑁𝑏∑︁
𝑖=1
𝑉b(𝑙𝑖) +
𝑁𝑎∑︁
𝑖=1
𝑉a(𝜃𝑖) +
𝑁𝑑
∑︁
𝑖=1
𝑉d(𝜙𝑖),
(A.1)
where 𝑉p, 𝑉b, 𝑉a, and 𝑉d represent the pairwise, bond, angle, and dihedral interactions whose detailed
forms are specified as below.
The pairwise interaction 𝑉p is modeled by the Lennard-Jones potential
𝑉p(𝑄) =
(cid:20) (cid:16) 𝜎
𝑄
(cid:17) 12
(cid:16) 𝜎
𝑄
−
(cid:17) 6(cid:21)
4𝜀
− 4𝜀
(cid:20)(cid:16) 𝜎
𝑄𝑐
(cid:17) 12
(cid:16) 𝜎
𝑄𝑐
−
(cid:17) 6(cid:21)
, 𝑄 < 𝑄𝑐
(A.2)
0, 𝑄 ≥ 𝑄𝑐
where 𝜀 = 0.005, 𝜎 = 1.8 and 𝑄𝑐 = 10.0.
The bond potential 𝑉b is modeled by the finite extensible nonlinear elastic bond (FENE) potential
𝑉b(𝑙) = −
(cid:34)
𝑙2
0 log
1 −
𝑘 𝑠
2
(cid:35)
,
𝑙2
𝑙2
0
(A.3)
where three different bond types. Within each sub-unit, the atoms 1-2, 3-4 are connected
by type-1 bond. The atoms 2-3 are connected by type-2 bond. Finally, the sub-unit groups are
connected by type-3 bond. The detailed parameter set is given by Tab. A.1.
Type
1
2
3
𝑘 𝑠
0.4
0.64
0.32
𝑙0
1.8
1.6
1.8
Table A.1 Parameters of the FENE bond interactions.
The angle potential 𝑉a is modeled by the harmonic angle potential
𝑉a(𝜃) =
𝑘 𝑎
2
(𝜃 − 𝜃0)2 ,
(A.4)
63
Type
1
2
𝑘 𝑎
1.2
1.5
𝜃0
114.0
119.7
Table A.2 Parameters of the harmonic angle interaction.
where two different types. Within each sub-unit group, the bond angles formed by 1-2-3 and 2-3-4
are imposed by type-1 potential. The bond angles formed by atoms of different sub-unit groups (e.g.,
3-4-5, 4-5-6) are imposed by type-2 potential. The detailed parameter set is given by Tab. A.2.
The dihedral potential 𝑉d is modeled by the multiharmonic dihedral potential
𝑉d(𝜙) =
6
∑︁
𝑖=1
𝐴𝑛 cos(𝑛−1) (𝜙),
(A.5)
where two different types. Type-1 dihedral potential is imposed to dihedral angles formed by 2-3-4-5,
4-5-6-7, · · · . Type-2 dihedral potential is imposed to dihedral angles formed by 3-4-5-6, 7-8-9-10,
· · · . The detailed parameter set is given by Tab. A.3.
Type 𝐴1
1 0.0673
2 0.1602
𝐴2
1.8479
-3.9993
𝐴3
0.0079
0.2483
𝐴4
-2.2410
6.2837
𝐴5
-0.0058
0.0165
𝐴6
0.0051
-0.0146
Table A.3 Parameters of the multiharmonic dihedral interaction.
64
APPENDIX B
CONSTRUCTION OF THE FOUR-DIMENSIONAL FREE ENERGY FUNCTION
Accurate construction of the multi-dimensional free energy is a well-known non-trivial problem.
To construct the free energy function 𝑈 (m𝑞) for the four-dimensional resolved variables m𝑞
defined by (2.18), we conduct the restraint molecular dynamics simulation to sample the average
force. Specifically, for each target configuration m𝑞∗, we impose a biased quadratic potential
𝑈b𝑖𝑎𝑠 (m𝑞, m𝑞∗) by
1
2
where 𝑘1, · · · , 𝑘4 represents the magnitude of the bias potential. We choose the values such that
𝑈b𝑖𝑎𝑠 (m𝑞, m𝑞∗) =
𝑘𝑖 (cid:0)𝑞𝑖 − 𝑞∗
𝑖
(B.1)
(cid:1) 2 ,
𝑖=1
4
∑︁
the fluctuations are about 5% of target values. For the polymer molecule considered in the present
study, the effective restraint force applied to the full atom (cid:8)m𝑄 𝑗 (cid:9) 𝑁
𝑗=1 is given by
m𝐹b𝑖𝑎𝑠 (m𝑞, m𝑞∗) = −
4
∑︁
𝑖=1
𝑘𝑖 (cid:0)𝑞𝑖 − 𝑞∗
𝑖
(cid:1) ∇m𝑄 𝑗 𝑞𝑖,
(B.2)
where the gradient terms are given by
m𝑄 𝑁 − m𝑄1
𝑞1
𝛿 𝑗,𝑁 ,
𝛿 𝑗,1 +
m𝑄1 − m𝑄 𝑁
𝑞1
2 (cid:0)m𝑄 𝑗 − m𝑄𝑐(cid:1)
𝑁𝑞2
m𝑄1 − m𝑄
⌊ 𝑁
2 ⌋
,
∇m𝑄 𝑗 𝑞1 =
∇m𝑄 𝑗 𝑞2 =
∇m𝑄 𝑗 𝑞3 =
∇m𝑄 𝑗 𝑞4 =
m𝑄
𝛿 𝑗,1 +
⌈ 𝑁
2 ⌉
𝛿 𝑗,𝑁 +
m𝑄
2 ⌋ − m𝑄1
⌊ 𝑁
𝑞3
2 ⌉ − m𝑄 𝑁
⌈ 𝑁
𝑞4
𝛿 𝑗,⌊ 𝑁
2 ⌋
𝛿 𝑗,⌈ 𝑁
2 ⌉
,
𝑞3
m𝑄 𝑁 − m𝑄
𝑞4
(B.3)
,
where 𝛿𝑖, 𝑗 represents the Kronecker delta function.
The free energy 𝑈 (m𝑞) is approximated by a 4-layer fully connected neural network ˜𝑈 (m𝑞).
Each hidden layer has 160 neurons; hyperbolic tangent function is used as the activation function.
˜𝑈 (m𝑞) is trained by minimizing the empirical loss
𝐿 =
𝑁𝑠
∑︁
𝑘=1
(cid:13)
(cid:13)
(cid:13)
−∇m𝑞 (𝑘 ) ˜𝑈 (m𝑞) − m𝐹b𝑖𝑎𝑠 (m𝑞, m𝑞 (𝑘))
(cid:13)
2
(cid:13)
(cid:13)
,
(B.4)
where m𝑞 (𝑘) represents a sampled configuration.
In this work, we construct
˜𝑈 (m𝑞) using
𝑁𝑠 = 400000 sample points collected from a simulation with a production stage of 1 × 107 steps.
65
For each configuration, the number of step is between 1 × 106 and 6 × 106 such that the empirical
sampling error is less than 5% of the mean value.
To verify the accuracy of ˜𝑈 (m𝑞), we numerically evaluate the integration
𝑘 𝐵𝑇m𝐼 ≡
∫
∫
m𝑞 ⊗ ∇𝑈 (m𝑞)e−𝑈 (m𝑞)/𝑘 𝐵𝑇 dm𝑞/
e−𝑈 (m𝑞)/𝑘 𝐵𝑇 dm𝑞 ≈
1
𝑁𝑠
𝑁𝑠
∑︁
𝑘=1
m𝑞 (𝑘) ⊗ ∇ ˜𝑈 (m𝑞 (𝑘)).
(B.5)
Therefore, the difference between the numerical summation and 𝑘 𝐵𝑇m𝐼 provide a metric. For this
case, 𝑘 𝐵𝑇 = 1. The average term yields
𝑁𝑠
∑︁
1
𝑁𝑠
m𝑞 (𝑘) ⊗ ∇ ˜𝑈 (m𝑞 (𝑘)) =
1.0362 −0.0011 0.0087
which verifies that the constructed ˜𝑈 (m𝑞) is an accurate approximation of 𝑈 (m𝑞).
0.9913 −0.0020
0.0062
0.9913
0.0018
0.0008
0.0021
0.0068
0.9814
0.0098
0.0076
0.0096
0.0094
𝑘=1
,
(B.6)
66
APPENDIX C
FLUCTUATION-DISSIPATION THEOREM OF THE EXTENDED DYNAMICS
For the extended dynamics in form of Eqs. (2.5)(2.6), we can show that the embedded memory kernel
˜m𝜃 (𝑡) and fluctuation term ˜mR (𝑡) satisfy the second-fluctuation dissipation theorem. Without
loss of generality, we set the covariance of the non-Markovian features to be 𝑘 𝐵𝑇m𝐼 following the
learning method presented in Sec. 2.2.3, i.e., mΛ = m𝐼,
˜m𝐽 = m𝐽.
Proposition C.0.1. The embedded memory kernel of the extended dynamics (2.5)(2.6) takes the
form ˜m𝜃 (𝑡) = −
(cid:16)
m𝐽11𝛿(𝑡) + m𝐽12em𝐽22𝑡m𝐽21
(cid:17)
. Furthermore, by choosing the initial condition of
m𝜁 and the white noise term m𝜉 (𝑡) = mΣ (cid:164)m𝑊 𝑡 satisfying
(cid:10)m𝜁 (0)m𝜁 (0)𝑇 (cid:11) = 𝛽−1m𝐼
(cid:10)m𝜉 (𝑡)m𝜉 (𝑠)𝑇 (cid:11) = −𝛽−1(m𝐽 + m𝐽𝑇 )𝛿(𝑡 − 𝑠),
(C.1)
the embedded kernel
˜m𝜃 (𝑡) and mR (𝑡) satisfies the second fluctuation-dissipation theorem, i.e.,
(cid:10) ˜mR (𝑡)
˜mR (𝑡′)𝑇 (cid:11) = −𝛽−1 (cid:16)
˜m𝐽12e ˜m𝐽22 (𝑡−𝑡′) ˜m𝐽21 + ( ˜m𝐽11 + ˜m𝐽𝑇
11)𝛿(𝑡 − 𝑡′)
(cid:17)
.
(C.2)
Proof. With mΛ = m𝐼 and ˜m𝐽 = m𝐽, we can take the integration of m𝜁 (𝑡) in Eq. (2.5), yielding
m𝜁 (𝑡) =
∫ 𝑡
0
em𝐽22 (𝑡−𝑠)m𝐽21m𝑣(𝑠)d𝑠 +
∫ 𝑡
0
em𝐽22 (𝑡−𝑠)m𝜉2(𝑠)d𝑠 + em𝐽22𝑡m𝜁 (0).
(C.3)
Plugging m𝜁 (𝑡) into the dynamic equation of m𝑣 gives
m𝑀 (cid:164)m𝑣 = − ∇𝑈 (m𝑞) + m𝐽11m𝑣 +
m𝐽12em𝐽22 (𝑡−𝑠)m𝐽21m𝑣(𝑠)d𝑡
∫ 𝑡
+
+ m𝜉1(𝑡)
(cid:32)(cid:32)
(cid:32)(cid:32)
(cid:125)
(cid:123)(cid:122)
(cid:124)
˜R1 (𝑡)
∫ 𝑡
0
m𝐽12em𝐽22 (𝑡−𝑠)m𝜉2(𝑠)d𝑠
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
0
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:124)
(cid:123)(cid:122)
˜R2 (𝑡)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
+ m𝐽12em𝐽22𝑡m𝜁 (0)
(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)
(cid:123)(cid:122)
(cid:125)
˜R3 (𝑡)
(cid:124)
.
(C.4)
(cid:125)
67
We check the covariance matrices of the noise terms, i.e.,
(cid:10) ˜R1(𝑡) ˜R1(𝑡′)𝑇 (cid:11) = −𝛽−1(m𝐽11 + m𝐽𝑇
∫ 𝑡
∫ 𝑡′
(cid:10) ˜R2(𝑡) ˜R2(𝑡′)𝑇 (cid:11) =
11)𝛿(𝑡 − 𝑡′),
m𝐽12em𝐽22 (𝑡−𝑠) (cid:10)m𝜉2(𝑠)m𝜉2(𝑠′)𝑇 (cid:11) em𝐽𝑇
∫ 𝑡′
22
m𝐽12em𝐽22 (𝑡−𝑠) (m𝐽22 + m𝐽𝑇
0
m𝐽12em𝐽22 (𝑡−𝑠′) (m𝐽22 + m𝐽𝑇
0
= −𝛽−1
= −𝛽−1
0
∫ 𝑡
0
∫ 𝑡′
0
22)mem𝐽𝑇
22
(𝑡′−𝑠′)m𝐽𝑇
12d𝑠′,
(𝑡′−𝑠′)m𝐽𝑇
12d𝑠d𝑠′
22)𝛿(𝑠 − 𝑠′)m𝐽𝑇
21em𝐽𝑇
22
(𝑡′−𝑠′)m𝐽𝑇
12d𝑠d𝑠′
= −𝛽−1m𝐽12em𝐽22𝑡+m𝐽𝑇
(cid:10) ˜R3(𝑡) ˜R3(𝑡′)𝑇 (cid:11) = m𝐽12em𝐽22𝑡 (cid:10)m𝜁 (0)m𝜁 (0)𝑇 (cid:11) em𝐽𝑇
= 𝛽−1m𝐽12em𝐽22𝑡em𝐽𝑇
m𝐽𝑇
𝑡′
𝑡′
22
22
.
m𝐽𝑇
12
12 + 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽𝑇
m𝐽𝑇
12
𝑡′
22
12
, ∀𝑡′ ≤ 𝑡
Moreover, for 𝑡 > 𝑡′, all the cross terms vanish except (cid:10) ˜R2(𝑡) ˜R1(𝑡′)𝑇 (cid:11), i.e.,
(cid:10) ˜R2(𝑡) ˜R1(𝑡′)𝑇 (cid:11) =
∫ 𝑡
0
m𝐽12em𝐽22 (𝑡−𝑠) ⟨m𝜉2(𝑠)m𝜉1(𝑡′)⟩ d𝑠
∫ 𝑡
m𝐽12em𝐽22 (𝑡−𝑠) (m𝐽21 + m𝐽𝑇
12)𝛿(𝑡′ − 𝑠)d𝑠
= −𝛽−1
0
= −𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′) (m𝐽21 + m𝐽𝑇
12).
Combining Eq. (C.5) and Eq. (C.6), we have
(cid:10) ˜R (𝑡) ˜R (𝑡′)𝑇 (cid:11) = 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽𝑇
12 − 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′) (m𝐽21 + m𝐽𝑇
12)
− 𝛽−1(m𝐽11 + m𝐽𝑇
11)𝛿(𝑡 − 𝑡′)
m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽21 + (m𝐽11 + m𝐽𝑇
= −𝛽−1 (cid:16)
11)𝛿(𝑡 − 𝑡′)
(cid:17)
.
(C.5)
(C.6)
(C.7)
□
As a special case, by imposing the restraint specified by Eq. (2.16) such that m𝐽11 + m𝐽𝑇
11 = 0
and m𝐽12 = −m𝐽𝑇
21, the memory kernel
˜m𝜃 (𝑡) recovers −m𝐽12em𝐽22𝑡m𝐽𝑇
12 without the Markovian
part, and the second fluctuation-dissipation theorem recovers the standard form, i.e.,
(cid:10) ˜mR (𝑡)
˜mR (0)𝑇 (cid:11) = 𝛽−1 ˜m𝜃 (𝑡).
(C.8)
68
APPENDIX D
INVARIANT PROBABILITY DENSITY FUNCTION
Proposition D.0.1. By choosing the white noise following Eq. (C.1), the reduced model (2.5)(2.6)
retains the invariant density function
∫
𝜌e𝑞 (m𝑞, m𝑝, m𝜁) = exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)] /
exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)] dm𝑞dm𝑝dm𝜁 .
(D.1)
Proof. By Eq. (C.1), the covariance of the white noise of the full extended system is given by
m𝐺 + m𝐺𝑇 = d𝑖𝑎𝑔(0, mΣmΣ𝑇 ). Accordingly, the Fokker-Plank equation follows
𝜕 𝜌(m𝑧, 𝑡)
𝜕𝑡
(cid:18)
= ∇ ·
−m𝐺∇𝑊 (m𝑧) 𝜌(m𝑧, 𝑡) −
𝛽−1(m𝐺 + m𝐺𝑇 )∇𝜌(m𝑧, 𝑡)
(cid:19)
,
(D.2)
1
2
where 𝜌(m𝑧, 𝑡) represents the probability density function of the extended variables m𝑧 =
[m𝑞; m𝑝; m𝜁]. For 𝜌e𝑞 (m𝑞, m𝑝, m𝜁) ∝ exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)], the RHS follows
(cid:18)
∇ ·
𝛽−1m𝐺∇𝜌e𝑞 (m𝑧, 𝑡) −
1
2
𝛽−1(m𝐺 + m𝐺𝑇 )∇𝜌e𝑞 (m𝑧, 𝑡)
(cid:19)
= 𝛽−1∇ ·
(cid:16)
m𝐺 𝐴∇𝜌e𝑞 (m𝑧, 𝑡)
(cid:17)
where the last identity holds because m𝐺 𝐴 is anti-symmetric.
≡ 0,
(D.3)
□
69
APPENDIX E
MEMORY KERNEL OF THE POLYMER MOLECULE SYSTEMS
Fig. E.1 shows the embedded matrix-valued kernels m𝜃 (𝑡) of the full MD and the 4D reduced
models of the polymer molecule system. Similar to the kernel in the Laplace space mΘ(𝜆) shown in
Fig. 2.6, the good agreement between the full MD and the reduced models verifies that the reduced
model can accurately retain the non-Markovian dynamics of the resolved variables.
(a)
(c)
(b)
(d)
Figure E.1 (a–d) Components of the embedded matrix-valued kernel m𝜃 (𝑡) obtained from the full
MD and the four-dimensional reduced model of a polymer molecule system She et al. (2023).
70
10−210−1100101102t−100102030[θ(t)]11MD2ndjointlearning4thjointlearning10−210−1100101102t−7.50.07.515.022.530.0[θ(t)]22MD2ndjointlearning4thjointlearning10−210−1100101102t−8−404[θ(t)]12MD2ndjointlearning4thjointlearning10−210−1100101102t−7−4−12[θ(t)]23MD2ndjointlearning4thjointlearning