MACHINE LEARNING-BASED STOCHASTIC REDUCED MODELING OF GLE AND STATE-DEPENDENT-GLE By Zhiyuan She A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering—Doctor of Philosophy 2025 ABSTRACT Predictive modeling of high-dimensional dynamical systems remains a central challenge in numerous scientific fields, including biology, materials science, and fluid mechanics. When clear scale separation is lacking, a reduced model must accurately capture the pronounced memory effects arising from unresolved variables, making non-Markovian modeling essential. In this thesis, we develop and analyze data-driven methods for constructing generalized Langevin equations (GLEs) and extended stochastic differential equations that faithfully encode non-Markovian behaviors. Building on the Mori–Zwanzig formalism, we first propose an approach to learn a set of non- Markovian features—auxiliary variables that incorporate the history of the resolved coordinates—so that the effective dynamics inherits long-time correlations. By matching evolution of correlation functions in the extended variable space, our method systematically approximates the multi- dimensional GLE without requiring direct estimates of complicated memory kernels. We show that this approach yields stable, high-fidelity reduced models for molecular systems, enabling significantly lower-dimensional simulations that nonetheless reproduce key statistical and dynamical properties of the original system. We then extend this framework to incorporate state-dependent memory kernels, facilitating enhanced sampling across diverse regions of phase space. We demonstrate that constructing heteroge- neous memory kernels—reflecting the local variations in unresolved degrees of freedom—improves the model’s accuracy and robustness, especially in systems exhibiting multiple metastable states. Through both numerical experiments and theoretical analysis, we highlight how these data-driven non-Markovian models outperform traditional Markovian or fixed-memory approaches. To address complex, multi-modal distributions in high-dimensional data, we then modify the latent variable of a KRNet normalizing-flow architecture from a single Gaussian to a mixture-of- Gaussians (MoG). This richer latent representation not only improves the model’s expressiveness and training stability but also facilitates discovering collective variables (CVs), as the multi-modal latent space reveals distinct modes corresponding to relevant metastable states or slow degrees of freedom. Through both numerical experiments and theoretical analysis, we show that integrating a MoG prior into KRNet yields superior density estimation, enhanced sampling of metastable basins, and a more interpretable set of learned CVs. Altogether, this thesis provides a comprehensive methodology for deriving scalable, memory- embedded reduced dynamics augmented by advanced latent representations. Such models open new possibilities for multi-scale simulations by merging fine-grained molecular fidelity with tractable coarse-grained representations, all while systematically leveraging the benefits of multi-modal latent spaces to identify key low-dimensional features. Our results underscore the practical advantages of incorporating non-Markovian features and a mixture-based flow model in capturing the full complexity of real-world molecular and dynamical systems. Copyright by ZHIYUAN SHE 2025 ACKNOWLEDGEMENTS I extend my heartfelt gratitude to my advisor, Professor Huan Lei, whose unwavering support, insightful guidance, and boundless patience have made this work possible. His passion for research and dedication to academic excellence have continually inspired me to broaden my intellectual horizons and to push myself beyond my comfort zone. I am truly grateful for the countless hours he devoted to reading my drafts, sharing thoughtful feedback, and challenging me to become a better researcher and problem-solver. I also wish to acknowledge the encouragement and camaraderie of my groupmates, Pei Ge and Liyao Lyu. Their collaborative spirit, readiness to exchange ideas, and commitment to building a supportive research environment greatly enriched my work. Our thought-provoking discussions often revealed new perspectives and solutions that shaped this thesis for the better. I deeply appreciate their help both inside and outside the lab, from meticulous code reviews to the friendly conversations that made my time in graduate school more rewarding. I am especially thankful to my wife, Xia Li, whose unwavering love and understanding have been an anchor throughout this demanding journey. Her belief in my abilities, willingness to celebrate even the smallest milestones, and compassionate reassurance during times of doubt provided me with the emotional resilience to persevere. I cannot overstate how much her steadfast support has meant to me—her presence has been the guiding light that helped me navigate this path. To each of you—my advisor, colleagues, and family—your contributions, generosity, and devotion to this endeavor remain deeply appreciated. Your collective influence will continue to shape my academic and personal development for years to come. v CHAPTER 1 OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 TABLE OF CONTENTS CHAPTER 2 Introduction . DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED DYNAMICS ENCODED WITH NON-MARKOVIAN FEATURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 6 . 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 . 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 . . 2.2 Methods . . 2.3 Numerical results . 2.4 Summary . . . . . . . . . CHAPTER 3 ENHANCED SAMPLING DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED DYNAMICS ENCODED WITH STATE- DEPENDENT MEMORY . . . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 . 38 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 3.2 Methods . . . . 3.3 Numerical results . 3.4 Summary . . Introduction . . . . . . . . CHAPTER 4 GENERATIVE MODEL BASED IDENTIFYING METASTABLE STATES IN FULL MOLECULE SPACE . . . . . . . . . . . . . . . . . 41 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 . 43 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . 52 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Introduction . . . 4.1 4.2 Method . . 4.3 Numerical Result . . 4.4 Summary. . . . . . . . . . . . CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 APPENDIX A MICROSCALE MODEL OF THE POLYMER MOLECULE . . . . . . 63 APPENDIX B CONSTRUCTION OF THE FOUR-DIMENSIONAL FREE ENERGY FUNCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 APPENDIX C FLUCTUATION-DISSIPATION THEOREM OF THE EXTENDED DYNAMICS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 APPENDIX D INVARIANT PROBABILITY DENSITY FUNCTION . . . . . . . . . 69 APPENDIX E MEMORY KERNEL OF THE POLYMER MOLECULE SYSTEMS . . 70 vi CHAPTER 1 OVERVIEW The accurate modeling of high-dimensional dynamical systems remains a cornerstone challenge across a range of scientific disciplines, from soft matter and biophysics to fluid mechanics and climate science. Modern simulations of such systems often demand immense computational resources due to the intricate interactions and multi-scale nature of the underlying processes. Although fully resolved models, which include all microscopic variables and fine-grained details, can in principle capture the relevant physics, they frequently prove infeasible in practice because of prohibitive computational cost. As a result, significant efforts in coarse-grained modeling aim to reduce dimensionality and complexity while preserving the essential statistics and long-time behaviors of the original high-dimensional problem. A central insight in coarse-graining is that purely Markovian models—those assuming in- stantaneous and memoryless evolution—often fail to reproduce the observed time correlations and transport properties in real systems. These discrepancies become particularly pronounced in situations lacking clear time-scale separations, wherein so-called “fast” or unresolved degrees of freedom exert non-negligible influences over extended time horizons. Consequently, one must explicitly retain memory terms to produce physically accurate reduced dynamics. In theoretical treatments, the Mori–Zwanzig (MZ) formalism provides a foundation for describing the projected dynamics of a lower-dimensional set of variables, augmented with a time-dependent memory kernel and stochastic fluctuation terms. However, direct numerical implementation of these ideas is seldom straightforward because the memory kernel typically lacks a closed-form expression and may require large volumes of data to estimate reliably. In response, a growing body of literature has turned to data-driven approaches that learn reduced equations or effective models directly from simulation trajectories or experimental data. Such methods circumvent the direct computation of memory kernels by leveraging machine learning techniques to discover the underlying structure of the system. This dissertation focuses on integrating and extending three distinct strategies that address complementary aspects of non-Markovian 1 modeling for high-dimensional dynamical systems: 1. Non-Markovian Feature Learning Our first strategy proposes a learning framework that sidesteps the need for explicit memory- kernel estimation by introducing a set of auxiliary variables. These additional variables encapsulate the “history” of the coarse-grained coordinates, effectively transforming what would otherwise be a non-Markovian model into a higher-dimensional extended Markovian system. By matching correlation functions between the full model and the extended reduced model, we ensure that critical temporal dependencies are accurately preserved. This correlation- matching step is instrumental: it encodes the slow relaxations and recurrent configurations that are crucial for capturing long-time dynamics. Numerical experiments on molecular systems demonstrate that non-Markovian feature learning can yield reduced-order simulations with excellent fidelity to the reference trajectories, all while maintaining moderate computational cost. 2. State-Dependent Memory Kernels While the first approach offers a single global mechanism to embed memory effects, many real systems exhibit state-dependent memory. For instance, macromolecular or fluid systems may have multiple metastable basins, each with distinct relaxation times or energetic barriers. In such cases, it is insufficient to assume a uniform memory kernel throughout the entire state space. Our second strategy addresses this limitation by introducing heterogeneous memory kernels that adapt to the local environment of the resolved variables. Rather than fitting a single kernel function, we allow the memory to vary based on the instantaneous configuration or thermodynamic state. This added flexibility is particularly advantageous for systems with complex free-energy landscapes, as it enables more accurate modeling of basin-to-basin transitions, barrier crossing, and other processes sensitive to local unresolved dynamics. By learning these heterogeneous kernels from data, we capture nuanced variations in the 2 memory structure, significantly improving sampling efficiency and predictive performance in multi-basin or multi-phase scenarios. 3. KRNet with a Mixture-of-Gaussians Latent Representation Although robust memory modeling is critical for accurate dynamics, effectively capturing distributional complexity in high-dimensional systems poses an additional challenge. Many normalizing-flow methods, which learn invertible transformations from simple latent spaces to complex data distributions, rely on a single Gaussian prior for the latent variables. Such a unimodal assumption can limit the expressivity of the model, particularly when the target distribution is multi-modal or exhibits heavy tails. Here, we introduce an advanced KRNet architecture that employs a mixture-of-Gaussians (MoG) as the latent prior. By allowing the latent space to be multi-modal, KRNet gains greater flexibility in approximating intricate molecular or continuum distributions. Moreover, analyzing individual mixture components provides insights into physically meaningful collective variables (CVs), such as reaction coordinates or slow degrees of freedom that govern the system’s long-time behavior. This MoG-based design not only improves density-estimation accuracy but also enhances interpretability, offering an additional avenue for understanding how different metastable states or configurations map to the underlying latent structure. Taken as a whole, these three complementary approaches form a cohesive framework for memory-aware, distribution-aware coarse-graining. In the first two, we focus on correctly capturing time correlations and historical dependence—the hallmark of non-Markovian dynamics. In the third, we emphasize handling complex data distributions and uncovering low-dimensional representations. Implemented together, they enable practitioners to develop robust reduced-order models that do not sacrifice crucial multi-scale or multi-modal characteristics of the original system. Beyond theoretical importance, these techniques offer practical advantages: multi-scale simu- lations become more feasible as the reduced models require fewer degrees of freedom to achieve a similar predictive capability, potentially reducing wall-clock times while maintaining essential 3 physical fidelity. Moreover, our approaches naturally expose slow modes and transitions that might otherwise remain obscured in fully detailed simulation data, thereby aiding in physical interpretation. Researchers can identify meaningful collective variables or design specialized sampling protocols targeting critical regions of phase space (e.g., near transition states or interfaces). Ultimately, the methods presented here reflect a broader trend in computational science: as machine learning and high-performance computing continue to advance, new opportunities emerge for data-driven modeling of complex phenomena. These advances permit us to go beyond naively discarding unresolved scales, instead systematically incorporating their effects through memory terms, latent-variable modeling, or both. The chapters that follow detail each of these methods, their theoretical underpinnings, and the empirical studies that demonstrate their utility. Collectively, they underline the feasibility of incorporating memory effects and multi-modal representations in next-generation coarse-grained simulation frameworks, bridging the gap between brute-force full-resolution models and simpler—but often inaccurate—Markovian approximations. In summary, the remainder of this dissertation proceeds as follows: • We begin by examining non-Markovian feature learning and explain how auxiliary variables, grounded in correlation-function matching, facilitate extended Markovian representations of intrinsically non-Markovian processes. • Next, we tackle state-dependent memory kernels as a direct way to incorporate local environmental or configurational effects, significantly improving the realism of reduced models in multi-basin systems. • Finally, we present KRNet with a mixture-of-Gaussians (MoG) latent space, illustrating how it enhances distribution modeling and offers a pathway to derive physically interpretable collective variables. We highlight its synergy with memory-based approaches, demonstrating an integrated methodology for advanced coarse-grained simulations. Through these contributions, this dissertation seeks to illustrate the power and flexibility of data-driven, memory-embedded modeling. By capturing temporal dependencies and multi-modal 4 structures simultaneously, researchers can generate high-fidelity reduced-order simulations capable of exploring complex energy landscapes, long-time dynamical evolution, and rarely visited metastable states. The approaches and results stand to benefit a wide array of fields, ranging from molecular biophysics and materials science to geophysics and fluid mechanics, all of which confront the challenges posed by limited computational budgets and intrinsically non-Markovian dynamics. 5 CHAPTER 2 DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED DYNAMICS ENCODED WITH NON-MARKOVIAN FEATURES 2.1 Introduction Predictive modeling of multi-scale dynamic systems is a long-standing problem in many fields such as biology, materials science, and fluid physics. One essential challenge arises from the high- dimensionality; numerical simulations of the full models often show limitations in the achievable spatio-temporal scales. Alternatively, reduced models in terms of a set of resolved variables are often used to probe the evolution on the scale of interest. However, the construction of reliable reduced models remains a highly non-trivial problem. In particular, for systems without a clear scale separation, the reduced dynamics often exhibits non-Markovian memory effects, where the analytic form is generally unknown. To close the reduced dynamics, existing methods are primarily based on the following two approaches. The first approach seeks various numerical approximations of the memory term by projecting the full dynamics onto the resolved variables based on frameworks such as the Mori-Zwanzig formalism Mori (1965b); Zwanzig (1973) or canonical models such as the generalized Langevin equation (GLE) Zwanzig (2001). Examples include the t-model approximation Chorin et al. (2002), the Galerkin discretization Darve et al. (2009a), regularized integral equation discretization Lange and Grubmüller (2006), the hierarchical construction Chen et al. (2014); Stinis (2015); Zhu and Venturi (2018); Hudson and Li (2020); Price et al. (2021), and so on. Recent studies Ma et al. (2018); Vlachas et al. (2018); Harlim et al. (2020); Wang et al. (2020a) based on the recurrent neural networks Hochreiter and Schmidhuber (1997) provide a promising approach to learn the memory term of deterministic dynamics. Yet, for ergodic dynamics, how to impose the coherent noise term compensating for the unresolved variables remains open. The second approach parameterizes the memory term by certain ansatz, e.g., the fictitious particle Davtyan et al. (2015a), continued fraction Wall (1948); Mori (1965a), rational function Corless and Frazho (2003), such that the memory and the noise terms can be embedded in an extended Markovian dynamics Mori (1965a); Ceriotti et al. (2009); Baczewski and Bond (2013); Davtyan et al. (2015a); Lei et al. (2016a); 6 Jung et al. (2017a); Lee et al. (2019a); Russo et al. (2019); Ma et al. (2019); Grogan et al. (2020). In addition, non-Markovian models are represented by discrete dynamics with exogenous inputs in form of NARMAX (nonlinear autoregression moving average with exogenous input) Chorin and Lu (2015); Lin and Lu (2021) and SINN (statistics information neural network) Zhu et al. (2022) and parameterized for each specific time step. Recent work by Vroylandt et al. Vroylandt et al. (2022) presents an expectation-maximization method to parameterize the reduced model from the full model trajectories. Refs. Daldrop et al. (2017); Kowalik et al. (2019) present an efficient approach based on analyzing the force correlation function to extract the memory function for the reduced dynamics of aqueous molecules under quadratic confinement potential; see also recent review Klippenstein et al. (2021) for further discussion. Despite the overall success, most studies focus on the cases with a scalar memory function. Notably, the reduced model of a two-dimensional GLE is constructed in Ref. Lee et al. (2019a). To the best of our knowledge, the systematic construction of stochastic reduced dynamics of multi-dimensional resolved variables remains under-explored. Ideally, to obtain a reliable reduced model, the construction needs to accurately retain the non-Markovian features, enable certain modeling flexibility (e.g., the dimensionality of the resolved variables) and adaptivity (e.g., the order of approximation), and guarantee the numerical stability and robustness. In a recent study, we developed a Petrov-Galerkin approach Lei and Li (2021) to construct the non-Markovian reduced dynamics by projecting the full dynamics into a subspace spanned by a set of projection bases in form of the fractional derivatives of the resolved variables. The obtained reduced model is parameterized as extended stochastic differential equations by introducing a set of test bases. Different from most existing approaches, the construction does not rely on the direct fitting of the memory function. Non-local statistical properties can be naturally matched by choosing the appropriate bases, and the model accuracy can be systematically improved by introducing more basis functions to expand the projection subspace. Despite these appealing properties, the construction relies on the heuristic choices of the projection and test bases. Given the target number of basis, how to choose the optimal basis functions for the best representation of the non-Markovian dynamics remains an open problem. Furthermore, the numerical stability needs to be 7 treated empirically. These issues limit the applications in complex systems with multi-dimensional resolved variables. In this work, we aim to address the above issues by developing a new data-driven approach to construct the stochastic reduced dynamics of multi-dimensional resolved variables. The method is based on the joint learning of a set of non-Markovian features and the extended dynamic equation in terms of both the resolved variables and these features. Unlike the empirically chosen projection bases adopted in the previous work Lei and Li (2021), the non-Markovian features take an interpretable form that encodes the history of the resolved variables, and are learned along with the extended Markovian dynamic such that they are optimal for the reduced model representation. In this sense, they represent the optimal subspace that embodies the non-Markovian nature of the resolved variables. The learning process enables the adaptive choices of the number of features and is easy to implement by matching the evolution of the correlation functions of the extended variables. In particular, the explicit form of the encoder function enables us to obtain the correlation functions of these features directly from the ones of the resolved variables rather than the time-series samples. The constructed model automatically ensures numerical stability, strictly satisfies the second fluctuation-dissipation theorem Kubo (1966), and retains the consistent invariant distribution Español (2004); Noid et al. (2008). We demonstrate the method by modeling the dynamics of a tagged particle immersed in solvents and a polymer molecule. With the same number of features (or equivalently, the projection bases), the present method yields more accurate reduced models than the previous methods Lei et al. (2016a); Lei and Li (2021) due to the concurrent learning of the non-Markovian features. More importantly, reduced models with respect to multi-dimensional resolved variables can be conveniently constructed without the cumbersome efforts of matrix-valued kernel fitting and stabilization treatment. This is well-suited for model reduction in practical applications, where the constructed reduced models often need to retain the non-local correlations among the resolved variables. It provides a convenient approach to construct meso-scale models encoded with molecular-level fidelity and paves the way towards constructing reliable continuum-level transport model equations Lei et al. (2020); Fang 8 et al. (2022). Finally, it is worthwhile to mention that the present study focuses on the model reduction of ergodic dynamic systems where the full or part of the resolved variables are specified as known quantities that either retain a clear physical interpretation (e.g., the tagged particle position), or are experimentally accessible (e.g., the polymer end-to-end distance, the radius of gyration). Another relevant direction focuses on learning the slow or Markovian dynamics from the complex dynamic systems where the resolved variables are unknown a priori; we refer to Refs. Rohrdanz et al. (2011); Pérez-Hernández et al. (2013); Li and Ma (2014); Krivov (2013); Lu and Vanden-Eijnden (2014); Bittracher et al. (2018) on learning resolved variables that retain the Markovianity, Refs. Coifman et al. (2008); Chiavazzo et al. (2017); Crosskey and Maggioni (2017); Ye et al. (2021); Feng et al. (2022); Zieliński and Hesthaven (2022) on learning the slow dynamics on a non-linear manifold, and Refs. Giannakis (2019); Klus et al. (2018); Dibak et al. (2018); Klus et al. (2020) on model reduction of the transfer operator. 2.2 Methods 2.2.1 Problem Setup Let us consider the full model as a Hamiltonian system represented by a 6𝑁-dimensional phase vector Z = [Q; P ], where Q and P are the position and momentum vectors, respectively. The equation of motion follows (cid:164)Z = S∇𝐻 (Z), (2.1) where S = (cid:169) (cid:173) (cid:173) (cid:171) 0 I −I 0 (cid:170) (cid:174) (cid:174) (cid:172) is the symplectic matrix, and 𝐻 (Z) is the Hamiltonian function and initial condition is given by Z (0) = Z0. For high-dimensional systems with 𝑁 ≫ 1, the numerical simulation of Eq. (2.1) can be computational expensive. It is often desirable to construct a reduced model with respect to a set of low-dimensional resolved variables z(𝑡) := 𝜙 (Z (𝑡)) where 𝜙 : R6𝑁 → R𝑚 represents the mapping from the full to the coarse-grained state space with 𝑚 ≪ 𝑁. With the explicit form of 𝐻 (Z) and 𝜙(Z), the evolution of z(𝑡) can be mapped from the initial values via the Koopman operator Koopman (1931), i.e., z(𝑡) = e𝑡Lz(0), where the Liouville 9 operator L𝜙(Z) = −((∇𝐻 (Z0))𝑇 S∇Z0)𝜙(Z) depends on the full-dimensional phase vector Z. Using the Duhamel–Dyson formula, the evolution of z(𝑡) can be further formulated in terms of z based on the Mori-Zwanzig (MZ) projection formalism Mori (1965b); Zwanzig (1973). However, the numerical evaluation of the derived model relies on solving the full-dimensional orthogonal dynamics Chorin et al. (2002), which can be still computational expensive. In practice, the resolved variables are often defined by the position vector Q. The MZ-formed reduced dynamics is often simplified into the GLEs, i.e., (cid:164)q = M −1p (cid:164)p = −∇𝑈 (q) − ∫ 𝑡 0 θ(𝑡 − 𝜏) (cid:164)q(𝜏)d𝜏 + R(𝑡), (2.2) where q ∈ R𝑚 is the so-called collective variables, M is the mass matrix, 𝑈 (q) is the free energy function, θ(𝑡) : R+ → R𝑚×𝑚 is a matrix-valued function representing the memory kernel, and R (𝑡) is a stationary colored noise related to θ(𝑡) through the second fluctuation-dissipation condition Kubo (1966), i.e., (cid:10)R(𝑡) R(0)𝑇 (cid:11) = 𝑘 𝐵𝑇θ(𝑡). It is worth mentioning that Eq. (2.2) is not exact based on the MZ formalism. In particular, the memory function generally depends on the resolved variables z and the noise term could be non-Gaussian; we refer to Ref. Ayaz et al. (2022) for further discussion. Nevertheless, even for the simplified GLE form (2.2), the accurate construction of the reduced model could remain highly-nontrivial. Specifically, the numerical simulation requires the explicit knowledge of both the free energy 𝑈 (q) and the memory function θ(𝑡). Several methods based on importance sampling Kumar et al. (1992); Darve and Pohorille (2001); Laio and Parrinello (2002a) and temperature elevation Rosso et al. (2002); Maragliano and Vanden-Eijnden (2006, 2008) have been developed to construct the multi-dimensional free energy function. In real applications, the main challenge often lies in the treatment of the memory kernel θ(𝑡). In particular, for multi-dimensional collective variables q, the efficient construction of numerically stable matrix-valued memory function remains under-explored. In this study, we develop an alternative approach to learn the reduced model. Rather than directly constructing the memory function θ(𝑡) in Eq. (2.2), we seek a set of non-Markovian features from the full model, denoted by {ζ𝑖}𝑛 𝑖=1, and establish a joint learning of the reduced Markovian dynamics 10 in terms of both the resolved variables and these features, i.e., d ˜z = g ( ˜z) d𝑡 + 𝚺dW𝑡, (2.3) where ˜z := [q; p; ζ1; · · · ; ζ𝑛] represents the extended variables and W𝑡 represents the standard Wiener process. In principle, any such extended system would generally lead to a non-Markovian dynamics for the resolved variables z = [q; p]. However, the essential challenge is to determine {ζ𝑖}𝑛 𝑖=1 so that the non-local statistical properties of z can be preserved with sufficient accuracy. Also, the form of g(·) and 𝚺 will need to be properly designed such that the reduced model retains the consistent thermal fluctuations and density distribution. In particular, the introduction of auxiliary variables can be loosely considered as approximating the full-dimensional Koopman operator in a sub-space. However, different from Ref. Lei and Li (2021), the features {ζ𝑖}𝑛 𝑖=1 are not the empirically-chosen projection bases; instead, they are learned along with model equation (2.3) for the best approximation of the non-local statistics. This essential difference enables the present method to generate more accurate reduced model and be easy to implement for multi-dimensional resolved variables without empirical treatment for numerical stability. 2.2.2 Non-Markovian features and the extended dynamics To illustrate the essential idea, let us consider a solute molecule immersed solvent particles. To construct a reduced model (2.3) for the solute molecule, a main question is how to construct the auxiliary variables ζ := [ζ1; ζ2; · · · ; ζ𝑛]. Intuitively, ζ𝑖 should depend on the full-dimensional vector Z such that their evolution encodes certain unresolved dynamics orthogonal to the subspace spanned by z(𝑡). A straightforward approach is to represent ζ (𝑡) as a function of Z (𝑡), i.e., ζ = h(Z). However, the direct construction of the general formulation h(Z) would become impractical since the learning generally involves sampling the individual solvent particles near the solute molecule; the computational cost could become intractable due to the high-dimensionality of Z. To circumvent the above challenges, the key ascribes to formulate ζ (𝑡) such that it properly encodes the unresolved dynamics of Z (𝑡), and meanwhile, can be easily sampled. One important observation is that the history of p(𝑡) naturally encodes the unresolved dynamics orthogonal to z(𝑡) and the values can be conveniently obtained. To see this, we note that the dynamics follows (cid:164)p = Lp 11 where the Liouville operator L𝜙(Z) = −((∇𝐻 (Z0))𝑇 S∇Z0)𝜙(Z) depends on the full-dimensional vector Z. Therefore, it is possible to construct ζ (𝑡) by some encoder functions in terms of the time history of p(𝑡), i.e., p(𝜏) with 𝜏 ≤ 𝑡, such that they retain certain components orthogonal to z(𝑡). This is somewhat similar to the study Lei et al. (2020) on modeling the non-local constitutive dynamics of non-Newtonian fluids by learning a set of features from the polymer configuration space. The main difference is that the present features ζ are non-Markovian in the temporal space. Accordingly, we define a set of non-Markovian features {ζ𝑖}𝑛 𝑖=1 by ∫ +∞ ζ𝑖 (𝑡) = ω𝑖 (𝜏)v(𝑡 − 𝜏)d𝜏 0 𝑁𝑤 ∑︁ 𝑗=1 ≈ w𝑖 ( 𝑗 𝛿𝑡)v(𝑡 − 𝑗 𝛿𝑡) (2.4) where v := M −1p represents the generalized velocity, ω𝑖 : R+ → R𝑚×𝑚 represents the encoder function represented by 𝑁𝑤 discrete weights {w𝑖 ( 𝑗 𝛿𝑡)}𝑁𝑤 𝑗=1 whose values will be determined later. ζ𝑖 (𝑡) can be loosely viewed as a generalized convolution over the history of v(𝑡) whose evolution encodes the orthogonal dynamics of z(𝑡). Therefore, it is possible to learn a set of ζ𝑖 (𝑡) such that the joint dynamics in terms of both z(𝑡) and ζ𝑖 (𝑡) can be well approximated by the extended Markovian model (2.3). Moreover, the linear form in terms of v(𝑡) ensures that the invariant density function of ζ𝑖 (𝑡) retains the Gaussian distribution consistent with v(𝑡). We can further impose a constraint such that v(𝑡) and ζ𝑖 (𝑡) are uncorrelated. This leads to an additional quadratic term in the energy function of the extended system, i.e., 𝑊 (q, p, ζ) = 𝑈 (q) + 1 2p𝑇M −1p + 1 2ζ𝑇 ˆ𝚲 −1 ζ, where ˆ𝚲 represents the covariance matrix of the ζ1, · · · , ζ𝑛. The reduced dynamics can be written as d q (cid:169) (cid:173) (cid:173) p (cid:173) (cid:173) (cid:173) (cid:171) ζ = G∇𝑊 (q, p, ζ)d𝑡 + 𝚺dW𝑡, (2.5) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 12 where the matrix G ∈ R(2+𝑛)𝑚×(2+𝑛)𝑚 takes the form G = 0 (cid:169) (cid:173) (cid:173) −I (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 0 ... 0 I 0 · · · 0 J I 0 0 · · · 0 0 0 ... 0 I                     ˆ𝚲 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) . (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (2.6) The matrix J ∈ R𝑛𝑚×𝑛𝑚 further determines the statistical properties of the resolved variables and will be learned along with the non-Markovian features {ω𝑖 (𝑡)}𝑛 𝑖=1 from the training samples as discussed in next subsection. Given the reduced model in form of Eqs. (2.5) and (2.6), the noise covariance matrix can be determined by 𝚺𝚺𝑇 = −𝛽−1(J 𝚲 + 𝚲𝑇 J 𝑇 ), (2.7) where 𝛽 = 1/𝑘 𝐵𝑇 and 𝚲 = d𝑖𝑎𝑔(I, ˆ𝚲). The form of 𝚲 implies that v and ξ𝑖 are uncorrelated and is consistent with the energy function of the extended system 𝑊 (q, p, ζ). It also alleviates the non-negative constraint of 𝚺𝚺𝑇 as discussed in Sec. 2.2.3. Furthermore, we can show that model (2.5) strictly satisfies the second-fluctuation dissipation theorem. Specifically, the embedded kernel in Eq. (2.5) takes the form ˜θ(𝑡) = − (cid:16) ˜J11𝛿(𝑡) + ˜J12e ˜J22𝑡 ˜J21 (cid:17) , (2.8) where ˜J11 = [ ˜J ]1:𝑚,1:𝑚, ˜J12 = [ ˜J ]1:𝑚,𝑚+1: and ˜J21 = [ ˜J ]𝑚+1:,1:𝑚 are the sub-blocks of the matrix ˜J := J 𝚲. The colored noise ˜R(𝑡) in terms of p(𝑡) is related to ˜θ(𝑡) by (cid:10) ˜R(𝑡) ˜R(𝑡′)𝑇 (cid:11) = −𝛽−1 (cid:16) ˜J12e ˜J22 (𝑡−𝑡′) ˜J21 + ( ˜J11 + ˜J 𝑇 11)𝛿(𝑡 − 𝑡′) (cid:17) (2.9) with 𝑡′ < 𝑡. Moreover, we can show that the reduce model retains the consistent invariant density function with the full model, i.e., 𝜌eq ∝ exp [−𝛽𝑊 (q, p, ζ)] . (2.10) We refer to Appendix C and D for details. 13 We conclude this subsection with two remarks: (I) In principle, the mass matrix M further depends on q. Ref. Lee et al. (2019a) reports that the varying mass matrix plays a secondary effect on the reduced dynamics of the molecular system therein; see also Ref. Ayaz et al. (2022) for the cases of the nonlinear distance coordinate with constant mass. A constant mass matrix is adopted in the present study; reduced models with the varying mass matrix can be constructed by introducing an additional term in the conservative force and will be considered in the future study. (II) The non-Markovian features {ζ𝑖}𝑛 the state-dependence, e.g., ζ𝑖 (𝑡) = ∫ +∞ 𝑖=1 in form of Eq. (2.4) can be generalized to retain 0 ω𝑖 (𝜏, q(𝜏))v(𝑡 − 𝜏)d𝜏, which leads to a reduced model with state-dependent non-Markovian memory. In this study, we demonstrate the proposed learning framework by constructing the reduced model (2.5) that approximates the standard GLE (2.2) with state-independent memory function θ(𝑡). As shown in the numerical examples, although θ(𝑡) is not explicitly constructed, it is well approximated by the memory kernel embedded in the reduced model (2.5) by matching the evolution of the correlation matrices for both the resolved and the extended variables. The learning of reduced models with the heterogeneous memory term will be systematically investigated in the future study. 2.2.3 Joint learning of the reduced dynamics Construction of the above reduced models relies on the joint learning of the non-Markovian features (2.4) in form of the encoder functions {ω𝑖 (𝑡)}𝑛 𝑖=1 and the reduced dynamics (2.5)(2.6) determined by the free energy 𝑈 (q) and the matrix J . In this study, we represent the multi- dimensional free energy 𝑈 (q) using a neural network and parameterize it based on the molecular dynamics Frenkel and Smit (2001); we refer to Appendix for details. Furthermore, the covariance of the noise term specified by Eq. (2.7) implies that J and 𝚲 (i.e., the encoder functions ω𝑖 (𝑡)) need to satisfy the following constraint J 𝚲 + 𝚲J 𝑇 ≼ 0. (2.11) Directly imposing the condition (2.11) becomes cumbersome for the joint learning of J and ω𝑖 (𝑡). Fortunately, this issue can be avoided by imposing an orthogonal constraint among the 14 non-Markovian features, i.e., (cid:2) ˆ𝚲(cid:3) 𝑖 𝑗 := 𝛽 (cid:10)ζ𝑖, ζ 𝑗 (cid:11) = 𝛽 ∑︁ (cid:10)w𝑖 (𝑡 − 𝑘𝛿𝑡)v(𝑘𝛿𝑡), w 𝑗 (𝑡 − 𝑘′𝛿𝑡)v(𝑘′𝛿𝑡)(cid:11) 𝑘,𝑘 ′ = 𝛿𝑖 𝑗 I, 1 ≤ 𝑖, 𝑗 ≤ 𝑛, (2.12) where the inner product ⟨f (Z), g(Z)⟩ = ∫ f (Z)g(Z)𝑇 𝜌(Z)dZ is defined with respect to the ∫ equilibrium density function of the full model 𝜌(Z) = 𝑒−𝛽𝐻 (Z)/ 𝑒−𝛽𝐻 (Z)dZ. In addition, we also impose the orthogonal constraints such that ζ and p are uncorrelated. Therefore, condition (2.11) can be transformed into seeking J s.t. J + J 𝑇 ≼ 0, and we represent J by J = −LL𝑇 + J 𝐴, (2.13) where L ∈ R(𝑛+1)𝑚×(𝑛+1)𝑚 is the lower-triangle matrix with positive diagonal elements and LL𝑇 represents the Cholesky decomposition of a symmetric positive-definite matrix. J 𝐴 represents an anti-symmetric matrix. Unlike the studies Mori (1965a); Ceriotti et al. (2009) based on the direct kernel approximation, we note that J takes a more general form and is not restricted to be diagonal or tri-diagonal. With the proper form of J , we can cast the reduced dynamics into the evolution of the correlation matrices by multiply v(0) to both sides of Eq. (2.5), i.e., ⟨M v, v(0)⟩ ⟨∇𝑈 (q), v(0)⟩ (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:124) d d𝑡 ⟨ζ1, v(0)⟩ ... + (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:125) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:124) 0 ... 0 ⟨ζ𝑛, v(0)⟩ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) C1 (𝑡) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) C0 (𝑡) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:125) = J , ⟨v, v(0)⟩ ⟨ζ1, v(0)⟩ ... (cid:169) (cid:170) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) (cid:173) (cid:174) ⟨ζ𝑛, v(0)⟩ (cid:171) (cid:172) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) (cid:124) C2 (𝑡) (2.14) where the correlation matrices ⟨ζ𝑖 (𝑡), v(0)⟩ can be directly obtained from the correlation matrix of the resolved variables ⟨v(𝑡), v(0)⟩ given the encoder weights, i.e., ⟨ζ𝑖 (𝑡), v(0)⟩ = 𝑁𝑤 ∑︁ 𝑗=1 w𝑖 (𝑡 𝑗 ) (cid:10)v(𝑡 − 𝑡 𝑗 ), v(0)(cid:11) , 15 where 𝑡 𝑗 = 𝑗 𝛿𝑡 and encoder weights w𝑖 (𝑡 𝑗 ) will be learned jointly. Therefore, we are able to construct J from the pre-computed, noise-free correlation matrices instead of the on-the-fly computation from the time-series samples of q and p. The reduced model can be trained by minimizing the following loss function 𝐿𝐶 = 𝑁𝑡 ∑︁ 𝑗=1 (cid:13) (cid:13)C1 ′(𝑡 𝑗 ) + C0(𝑡 𝑗 ) − J C2(𝑡 𝑗 )(cid:13) (cid:13) 2 𝐿Λ = ∥𝚲 − I ∥2 , (2.15) 𝐿 = 𝜆𝐶 𝐿𝐶 + 𝜆Λ𝐿Λ, where C1 = [⟨M v, v(0)⟩ ; ⟨ζ1, v(0)⟩ ; · · · ; ⟨ζ𝑛, v(0)⟩], C0 and C2(𝑡) are defined similarly in Eq. (2.14). 𝜆𝐶 and 𝜆Λ are the hyperparameters. 𝑡 𝑗 refers to the discrete time points and 𝑁𝑡 represents the total number of sample points of the correlation matrices obtained from the full model. The loss term 𝐿𝐶 ensures that the non-local statistical properties of the resolved variables can be accurately preserved while the loss term 𝐿Λ ensures the aforementioned orthogonality among the non-Markovian features. To simulate the constructed model, we always set ˆ𝚲 = I such that J in form of Eq. (2.13) strictly satisfies the semi-positive definiteness condition. We emphasize that the non-Markovian encoder weights (cid:8)w𝑖 (𝑡 𝑗 )(cid:9) 𝑁𝑤 𝑗=1 do not explicitly appear in the loss function. However, they are involved in the training process along with J since the correlation functions C1 and C2 further depend on the definition of ζ𝑖, i.e., they are concurrently learned for the best approximation of the extended Markovian dynamics of [q; p; ζ]. As shown in Sec. 3.3, this joint learning of both the non-Markovian features and the dynamic equations enables us to probe the optimal representation of the reduced models that leads to more accurate numerical results than the ones constructed by the pre-selected bases, and can be easily implemented for models with multi-dimensional resolved variables. In this study, we choose 𝑁𝑡 = 5000 for all the numerical examples and choose 𝑁𝑤 = 1800 for the one-dimensional reduced model and 1200 for the four-dimensional reduced model, respectively. The training is conducted by the ADAM optimization algorithm Kingma and Ba (2015) where 1000 points are randomly selected per each training step We conclude this subsection with the following remarks: (I) Instead of Eq. (2.14), the reduced dynamics can be also cast into the evolution of the correlation matrices by multiplying q(0) to both 16 sides of Eq. (2.5). For the present study, we found that both formulations yield accurate reduced models. (II) Rather than learning the full sets of non-Markovian features, we can fix one of them as M (cid:164)v + ∇𝑈 (q); this ensures that the time-derivatives of correlation functions of the reduced model can accurately match the values of the full model near 𝑡 = 0. In the numerical examples presented in following Sec. 3.3, all the reduced models are constructed with this choice. For simple notation, we set it to be the last feature. For example, the fourth-order reduced model is constructed using 4 non-Markovian features. ζ1, ζ2 and ζ3 take the form of Eq. (2.4), and ζ4 is set to be M (cid:164)v + ∇𝑈 (q). (III) In principle, for reduced models of Hamiltonian systems, the form of matrix J can be further restricted to J = −diag(0, ˆL ˆL𝑇 ) + J 𝐴, (2.16) where ˆL ∈ R𝑛𝑚×𝑛𝑚 is a lower-triangle matrix. Eq. (2.16) ensures that the embedded kernel in Eq. (cid:1) 𝛿(𝑡)). ˜θ(𝑡) recovers the form (2.5) does not contain the Markovian memory term (i.e., (cid:0)J11 + J 𝑇 11 of standard GLE and the second fluctuation-dissipation relationship shown in Eq. (2.9) recovers the standard form, i.e., (cid:10) ˜R(𝑡) ˜R(𝑡′)𝑇 (cid:11) = 𝛽−1 ˜θ(𝑡 − 𝑡′). In this study, both forms yield accurate numerical results; the contribution of the Markovian term constructed by Eq. (2.13) is less than 1%. 2.3 Numerical results 2.3.1 A tagged particle in aqueous environment We demonstrate our method by modeling a tagged particle immersed in solvent particles. The particle interaction is governed by the pairwise force F𝑖 𝑗 (Q𝑖 𝑗 ) = 𝑓0(1 − 𝑄𝑖 𝑗 /𝑟𝑐)e𝑖 𝑗 , 𝑄𝑖 𝑗 ≤ 𝑟𝑐 0, 𝑄𝑖 𝑗 > 𝑟𝑐    where Q𝑖 and Q 𝑗 are the positions of 𝑖-th and 𝑗-th particles. Q𝑖 𝑗 = Q𝑖 − Q 𝑗 , 𝑄𝑖 𝑗 = ∥Q𝑖 − Q 𝑗 ∥, and e𝑖 𝑗 = Q𝑖 𝑗 𝑄𝑖 𝑗 , and 𝑟𝑐 is the cut-off distance. The full system consists of 4000 particles in a 10 × 10 × 10 cubic box with periodic boundary condition along each direction. We set 𝑓0 = 50, 𝑟𝑐 = 1, and the particle mass to be unit. Nosé-Hoover thermostat is used with 𝑘 𝐵𝑇 = 0.5 and time step 𝛿𝑡 = 2 × 10−3. 128 samples are collected from a production stage of 6 × 105 steps, which are used as the initial 17 conditions of the NVE simulations of the full model for a production stage of 1 × 105 steps using the Velocity-Verlet integrator. (a) (c) (b) (d) Figure 2.1 Numerical results for a tagged particle in the solvent particle bath. (a) Velocity correlation function 𝐶𝑣𝑣 (𝑡) obtained from the full MD model and the reduced models constructed by the present method based on the joint learning approximation, the rational function approximation Lei et al. (2016a), and the Petrov–Galerkin projection with fixed bases Lei and Li (2021). (b) Predicted evolution of the probability density function of the particle velocity obtained from the full MD and the different reduced models with the second-order approximation. The initial velocity 𝑣 is set to 0 (the vertical line). (c) 𝐶𝑣𝑣 (𝑡) obtained from the full MD model and different orders of the present joint learning approximation. (d) Encoder weights for the three non-Markovian features obtained from the present joint learning with the fourth-order approximation. The reduced dynamics in terms of the tagged particle is constructed in form of Eq. (2.5) along with the learning of the non-Markovian features {ζ𝑖}𝑛 𝑖=1. The free energy 𝑈 (q) vanishes for this case. For comparison, we also construct the reduced model using the previous approaches based on the Petrov-Galerkin projection (named as fixed-basis) Lei and Li (2021) and the rational function 18 10−210−1100101t10−310−210−1100Cvv(t)MD2ndjointlearning2ndfixedbasis2ndrationalfunction−1.50−0.750.000.751.50v0.000.250.500.751.00ρ(v)t=0.06t=0.10MD2ndjoinedlearning2ndfixedbasis10−210−1100101t10−310−210−1100Cvv(t)MD2ndjointlearning3rdjointlearning4thjointlearning0250500750j−0.50.00.51.0wi(j∆t)w1w2w3 approximation Lei et al. (2016a). Fig. 2.1(a) shows the velocity correlation function of constructed models using two non-Markovian features, or equivalently, two projection bases, as well as the second-order rational function approximation. The model constructed by the present (named as the joint-learning) method shows the best agreement with the full model based on the molecular dynamics (MD) simulations. The model accuracy can be further examined by the evolution of probability density function (PDF) of the particle velocity. Specifically, we fix the velocity to be zero as 𝑡 = 0 and sample the instantaneous PDF thereafter. Fig. 2.1(b) shows the obtained PDF at 𝑡 = 0.06. The present approach yields more accurate result than the Petrov-Galerkin method. As shown in Fig. 2.1(c), the accuracy of the reduced model shows further improvement as we increase the number of non-Markovian features. In particular, the reduced model with the fourth-order approximation can accurately capture the oscillations over the full regime. Fig. 2.1(d) shows the obtained encoder weights of the fourth-order approximation. All of the three encoder functions show pronounced oscillations near 𝑡 = 0 and decay to 0 for large 𝑡. Unlike the empirically chosen fractional-derivative bases in Ref. Lei and Li (2021), the present method enables the encoders to be optimized for the best approximation of the non-local statistics, and therefore yields more accurate results. Figure 2.2 A sketch of a chain-molecule represented by united atom model. Reduced models are constructed with respect to a four-dimensional resolved vector q, which represents the end-to-end distance (𝑞1), the radius of gyration (𝑞2), and the end-to-middle distances (𝑞3 and 𝑞4), respectively. 19 qq1AAACmHicdVHbahsxEJW3t9S9JGnfmhdRE0ghLLtOSto3QwtuIA9JUycB25hZ7awjLGk30myJWUy/oK/tt/VvonWd4Jj2gGA4Z6Qzo5MUSjqKoj+N4MHDR4+frD1tPnv+4uX6xuarM5eXVmBP5Cq3Fwk4VNJgjyQpvCgsgk4UnieTT7V+/h2tk7n5RtMChxrGRmZSAHnq9GoUjzZaURh/rMGjsB15vL8t2jwOozlabIHj0WbjxyDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQpVvVavJfWr+k7MOwkqYoCY34a5SVilPO68V5Ki0KUlNfgLDSz8rFJVgQ5L/nnkui/Q4pZgOdVIPaKMlmze3ljrl7gWKFvi6NFHmKq7Sia7JQsw5JgzT16lVXKsVPwbg7wT9aKzuf5ViS2z3yGZndrkWcvFvubvrIbnPh/y/O2mG8F7ZP9ludw0V4a2yLvWU7LGYHrMO+sGPWY4KN2U/2i/0O3gSdoBsseoPG4s5rdg/B1xsxsM2nq2AAACmHicdVFdaxNBFJ2sXzV+tNU3fRkMhQpl2V2Vtm+BClHwoVrTFpIQ7s7eTYfMzG5n7paGJfgLfNXf5r9xNqaSBj0wcDnnzpx756Slko6i6FcruHP33v0HGw/bjx4/ebq5tf3s1BWVFdgXhSrseQoOlTTYJ0kKz0uLoFOFZ+n0qNHPrtA6WZivNCtxpGFiZC4FkKdOLsfJeKsThfFhAx6FSeTx7qZIeBxGC3TYEsfj7da3YVaISqMhocC5QRyVNKrBkhQK5+1h5bAEMYUJDnxpQKMb1YtZ53zHMxnPC+uPIb5gV2/UoJ2b6dR3aqALt6415L+0QUX5waiWpqwIjfhjlFeKU8GbxXkmLQpSM1+AsNLPysUFWBDkv+eWS6r9DhnmQ53Ww8YozeftndWOhXuJYo2+rowURYbrtKJrstCwDkmDNM3qdU8qxU/AuL+Cf7RRdt/LiSS398lnZPZ6FnH6erW77SO7yYX/vzhNwvhNmHx+2+l+XIa3wV6yV2yXxWyfddkHdsz6TLAJ+85+sJ/Bi6Ab9IJlb9Ba3nnObiH48hsz4s2oq3AAACmHicdVHbahsxEJW3t9TpJWne2hdRE0ghLLtOS9o3QwtuoA9pEycB25hZ7awjLGk30myJWUy+oK/tt/VvonWd4pjkgGA4Z6Qzo5MUSjqKor+N4MHDR4+frD1trj97/uLlxuarE5eXVmBP5Cq3Zwk4VNJgjyQpPCssgk4UniaTz7V++hOtk7k5pmmBQw1jIzMpgDx1dDHaG220ojD+VINHYTvy+HBTtHkcRnO02AKHo83G1SDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQuVvVavIurV9S9nFYSVOUhEb8M8pKxSnn9eI8lRYFqakvQFjpZ+XiHCwI8t9zyyXRfocUs4FOqkFtlGSz5vZyx9y9QLFCX5ZGijzFVVrRJVmoWYekQZp69aorleJHYNx/wT9aKztf5FiS2/3mMzK7XYs4ebfc3fSR3eTC7y9O2mG8F7a/v291DhbhrbE37C3bYTHbZx32lR2yHhNszH6x3+xP8DroBN1g0Rs0Fne22C0EP64BNhTNqQ==q4AAACmHicdVHbahsxEJW3t9TpJWne2hdRE0ghLLtuSto3QwtuoA9pEycB25hZ7awjLGk30myJWUy+oK/tt/VvonWd4pjkgGA4Z6Qzo5MUSjqKor+N4MHDR4+frD1trj97/uLlxuarE5eXVmBP5Cq3Zwk4VNJgjyQpPCssgk4UniaTz7V++hOtk7k5pmmBQw1jIzMpgDx1dDHaG220ojD+VINHYTvy+HBTtHkcRnO02AKHo83G1SDNRanRkFDgXD+OChpWYEkKhbPmoHRYgJjAGPu+NKDRDav5rDO+7ZmUZ7n1xxCfs8s3KtDOTXXiOzXQuVvVavIurV9S9nFYSVOUhEb8M8pKxSnn9eI8lRYFqakvQFjpZ+XiHCwI8t9zyyXRfocUs4FOqkFtlGSz5vZyx9y9QLFCX5ZGijzFVVrRJVmoWYekQZp69aorleJHYNx/wT9aKztf5FiS2/3mMzK7XYs4ebfc3fSR3eTC7y9O2mH8Pmx/32t1DhbhrbE37C3bYTHbZx32lR2yHhNszH6x3+xP8DroBN1g0Rs0Fne22C0EP64BOEbNqg== 2.3.2 One-dimensional reduced model of a polymer molecule We consider the reduced dynamics of a polymer molecule consisting of 𝑁 = 16 atoms. The intramolecular potential is governed by 𝑉mol(Q) = 𝑁 ∑︁ 𝑖≠ 𝑗 𝑉p(𝑄𝑖 𝑗 ) + 𝑁𝑏∑︁ 𝑖=1 𝑉b(𝑙𝑖) + 𝑁𝑎∑︁ 𝑖=1 𝑉a(𝜃𝑖) + 𝑁𝑑 ∑︁ 𝑖=1 𝑉d(𝜙𝑖), (2.17) where 𝑙𝑖, 𝜃𝑖, 𝜙𝑖 represent the individual bond length, bond angle, and dihedral angle, respectively. 𝑉p, 𝑉b, 𝑉a, and 𝑉d represent the pairwise Lennard-Jones, finite extensible nonlinear elastic bond, harmonic angle, and multiharmonic dihedral interactions whose explicit forms are specified in Appendix A. The atom mass is set to unit, thermal temperature 𝑘 𝐵𝑇 is set to 1.0, and the time step 𝛿𝑡 is set to be 1 × 10−3. 512 samples are collected from a production stage of 3 × 106 steps, which are used as the initial conditions of the NVE simulations of the full model for a production stage of 1 × 106 steps using the Velocity-Verlet integrator. Fig. 2.2 shows a sketch of the polymer molecule. To examine the effectiveness of the present method, we first construct a 1D reduced dynamics in terms of the end-to-end distance 𝑞1 = ∥Q1 − Q𝑁 ∥ as done in the previous work Lei and Li (2021) based on the Petrov-Galerkin method, and compare the numerical results obtained from the two methods. Figure 2.3(a) shows the velocity correlation function 𝐶𝑣𝑣 (𝑡) = ⟨𝑣1(𝑡)𝑣1(0)⟩ obtained from the full MD and different orders of fixed-basis and joint-learning approximations. With the same order of approximation, the current method yields better agreement with the MD results. Specifically, the second-order model of the current method can capture the pattern around 𝑡 = 4 and the fourth-order model can capture the patterns around 𝑡 = 0.4 and 𝑡 = 4. However, the previous method with the same order approximation shows limitation to accurately capture these two patterns. Figure 2.3(b) shows the displacement auto-correlation function 𝐶𝑞𝑞 (𝑡) = ⟨𝑞1(𝑡)𝑞1(0)⟩ obtained from full MD and the reduced models constructed by the present method with different number of non-Markovian features. As we introduce more features, the predicted correlation functions approaches the MD results. In particular, the fourth-order model can capture the oscillations of the MD results at 𝑡 = 10 and 𝑡 = 25. Figure 2.3(c) shows the encoder weights of non-Markovian features for the fourth-order approximation. Similar to the tagged particle system, the encoder functions exhibit pronounced oscillations at the short time and decay to zero at longer time. 20 (a) (c) (b) (d) Figure 2.3 Numerical results of a one-dimensional reduced model representing the dynamics of the end–end distance of a polymer molecule system. (a)–(b) Velocity correlation function 𝐶𝑣𝑣 (𝑡) and the Laplace transform of the memory function Θ(𝜆) obtained from the full MD simulations and the different orders of the present joint learning approximation, and the Petrov–Galerkin projection with fixed-basis approximation. (c) Displacement correlation function 𝐶𝑞𝑞 (𝑡) obtained from the full MD and different orders of the joint learning approximation. (d) Encoder weights for the three non-Markovian features of the reduced model with the fourth-order approximation. 21 10−210−1100101102t−0.50.51.52.5Cvv(t)MD2ndjointlearning4thjointlearning2ndfixedbasis4thfixedbasis10−1100101102t−50510Cqq(t)MD2ndjointlearning3rdjointlearning4thjointlearning050100150200j−0.2−0.10.00.10.2wi(j∆t)w1w2w310−210−1100101102103λ0.10.30.50.7Θ(λ)MD2ndjointlearning4thjointlearning2ndfixedbasis4thfixedbasis The accuracy of the constructed reduced models can be further examined by comparing the embedding memory kernels ˜θ(𝑡) with the full MD model. Figure 2.3(d) shows the Laplace transform of the memory kernel of the reduced models ˜𝚯(𝜆) = ∫ +∞ ˜θ(𝑡) exp (−𝑡/𝜆)d𝑡. The MD kernel 𝚯(𝜆) 0 is obtained by 𝚯(𝜆) = −G(𝜆)H (𝜆)−1, where G(𝜆) and H (𝜆) are the Laplace transform of the correlation matrices g(𝑡) = ⟨M (cid:164)v(𝑡) + ∇𝑈 (q), q(0)⟩ and h(𝑡) = ⟨v(𝑡), q(0)⟩. Compared with the previous method, the current method yields better agreement with MD results. Specifically, the second- and fourth-order of the joint learning approximation, and the fourth-order of the fixed basis approximation show good agreement with the MD result 𝚯(𝜆) for 𝜆 between 1 and 1000. Furthermore, the fourth-order model of the joint learning approximation can further capture the pronounced peak regime of the MD results near 𝜆 = 0.1. We emphasize that the memory kernel ˜θ(𝑡) is not explicitly constructed during the learning process; ˜θ(𝑡) approaches θ(𝑡) as we impose the constraint (2.14) such that the correlation matrices of the reduced dynamics match the ones of the full model. This enables us to circumvent the direct fitting of the matrix-valued memory function for multi-dimensional GLEs, and efficiently construct the numerically-stable reduced model that retains the non-local statistics and coherent noise as shown in the following example. 2.3.3 Four-dimensional reduced model of a polymer molecule Finally, we construct a reduced model in terms of a four-dimensional resolved vector q = [𝑞1, 𝑞2, 𝑞3, 𝑞4] defined by ∥Q𝑖 − Q𝑐 ∥2, Q𝑐 = 𝑞1 = ∥Q1 − Q𝑁 ∥ , 𝑁 ∑︁ 𝑞2 2 = 1 𝑁 𝑖=1 (cid:13) (cid:13) (cid:13)Q⌊ 𝑁 (cid:13) (cid:13) (cid:13)Q⌈ 𝑁 2 ⌋ − Q1 2 ⌉ − Q𝑁 , (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) (cid:13) , 𝑞3 = 𝑞4 = 1 𝑁 𝑁 ∑︁ 𝑖=1 Q𝑖, (2.18) where 𝑞1, 𝑞2, 𝑞3, and 𝑞4 represent the end-to-end distance, radius of gyration, and two center- to-end distances, respectively. The four-dimensional free energy function 𝑈 (q) is constructed by matching the average force sampled from the restraint molecular dynamics and represented by a neural network; we refer to Appendix B for details. Rather than constructing the four-dimensional 22 GLE kernel θ(𝑡), we directly learn the reduced model (2.5) by minimizing the loss function (2.15). (a) (c) (b) (d) Figure 2.4 Numerical results of a four-dimensional reduced model representing the dynamics of a polymer molecule system, with conformation states characterized by the resolved variables q (see Eq. (2.18)). (a)–(c) Diagonal components of the velocity correlation function C𝑣𝑣 (𝑡) = ⟨v(𝑡) v(0)𝑇 ⟩. Note that [C𝑣𝑣 (𝑡)]44 is omitted because it is similar to [C𝑣𝑣 (𝑡)]33. (d) Constructed encoder weights of the first non-Markovian feature ζ1 for the fourth-order reduced model. Figure 2.4(a-c) show the diagonal components of the velocity correlation matrix C𝑣𝑣 (𝑡) = (cid:10)v(𝑡)v(0)𝑇 (cid:11) obtained from the full MD and the reduced models using different order approximations. Specifically, the components [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]33 show similar values near 𝑡 = 0 since both 𝑞1 and 𝑞3 characterize the distances between the individual particles, e.g., 𝑣1 = (Q1 − Q𝑁 ) · (V1 − V𝑁 )/∥Q1 − Q𝑁 ∥. As the distribution of the velocity difference between the two free-end particles follows N (0, 2𝑘 𝐵𝑇I), the variance of 𝑣1 is 2𝑘 𝐵𝑇. Similar argument also holds for 𝑣3 and 𝑣4. On the long-time scale, [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]22 decay much slower than [C𝑣𝑣 (𝑡)]33 and [C𝑣𝑣 (𝑡)]44 23 10−210−1100101102t−0.50.51.52.5[Cvv(t)]11MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.030.000.030.060.09[Cvv(t)]22MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.50.51.52.5[Cvv(t)]33MD0thorder2ndjointlearning4thjointlearning0255075100j−0.2−0.10.00.10.2[w1](j∆t)[w1]11[w1]22[w1]33[w1]44 and show pronounced oscillations near 𝑡 = 10 and 𝑡 = 25. The differences can be understood as follow: Compared with the end-to-middle distances 𝑞3 and 𝑞4, the end-to-end distance 𝑞1 and radius of gyration 𝑞2 represent the global states of the molecular conformation. Based on the scaling law of the idealized Gaussian chain model de Gennes (1979), the relaxation time of 𝑞1 and 𝑞2 is proportional to 𝑁 2. Accordingly, [C𝑣𝑣 (𝑡)]11 decays four times slower than [C𝑣𝑣 (𝑡)]33, which is qualitatively consistent with the present numerical results. The transient dynamics of the correlation functions can be accurately captured by the reduced model. As we increase the number of non-Markovian features, the predictions show better agreement with MD results. Specifically, the zeroth-order (i.e., Langevin) model is insufficient to capture the patterns around 0.5 and 5. The second-order model yields an accurate prediction for [C𝑣𝑣 (𝑡)]33 but less accurate predictions for [C𝑣𝑣 (𝑡)]11 and [C𝑣𝑣 (𝑡)]22. The fourth-order model yields good agreement for all the components over the full regime. Fig. 2.4(d) shows the encoder weights of the first non-Markovian feature ζ1, which naturally encode the non-local statistics among the resolved variables, and decay to 0 at large time. Fig. 2.5 shows the off-diagonal components of the velocity correlation matrix C𝑣𝑣 (𝑡). Similar to the diagonal components, [C𝑣𝑣 (𝑡)]12 represents the coupling between the dynamics of two global conformation states and therefore exhibits the longest correlation with pronounced oscillations at 𝑡 = 10 and 𝑡 = 25. On the other hand, [C𝑣𝑣 (𝑡)]13 and [C𝑣𝑣 (𝑡)]23 represent the coupling between a global state and semi-global state, and therefore exhibit intermediate correlation. In addition, [C𝑣𝑣 (𝑡)]34 exhibits weaker correlation compared with the other components since the coupling between the dynamics of 𝑞3 and 𝑞4 is mainly governed by the local bond- and angle-interactions associated with 8-th and 9-th atom. The predictions of the second-order reduced model show fairly good agreement with the full MD results for [C𝑣𝑣 (𝑡)]13 and [C𝑣𝑣 (𝑡)]23 but less agreement for [C𝑣𝑣 (𝑡)]12. The fourth-order reduced model yields good agreement for all the components. Fig. 2.6 shows the components of the embedded matrix-valued kernels in the Laplace space obtained from the full MD and the reduced models. In particular, ˜𝚯(𝜆) obtained from the second- order model shows good agreement with 𝚯(𝜆) obtained from the full MD within the regime of large 24 (a) (c) (b) (d) Figure 2.5 (a-d) Off-diagonal components of the velocity correlation function C𝑣𝑣 (𝑡) for a polymer molecule system whose conformation states are characterized by a four-dimensional resolved vector q defined by Eq. (2.18). 𝜆. The fourth-order model yields good agreement over the full regime, which is consistent with the accurate prediction of the velocity correlation functions shown in Fig. 2.4 and 2.5 (see also Appendix E for θ(𝑡)). While the kernel function θ(𝑡) is not explicitly constructed in the present method, the accurate recovery of 𝚯(𝜆) verifies that the constructed models faithfully retain the non-Markovian dynamics of the resolved variables. 2.4 Summary In this study, we developed a data-driven approach to accurately learn the stochastic reduced dynamics of full Hamiltonian systems with non-Markovian memory. The method essentially provides an efficient approach to approximate the multi-dimensional generalized Langevin equation. 25 10−210−1100101102t−0.10.00.10.2[Cvv(t)]12MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.30.00.30.60.9[Cvv(t)]13MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.050.000.050.100.15[Cvv(t)]23MD0thorder2ndjointlearning4thjointlearning10−210−1100101102t−0.10−0.050.000.050.10[Cvv(t)]34MD0thorder2ndjointlearning4thjointlearning (a) (c) (b) (d) Figure 2.6 (a-d) Components of the embedded matrix-valued kernel 𝚯(𝜆) in the Laplace space obtained from the full MD and a four-dimensional reduced model of a polymer molecule system. Rather than directly fitting the matrix-valued memory kernel, the present method seeks a set of non-Markovian features whose evolution naturally encodes with the orthogonal dynamics of the resolved variables, and establishes a joint learning of the extended dynamics in terms of both the resolved variables and the non-Markovian features. Compared with the previous studies based on the rational function approximation Lei et al. (2016a) and the Petrov-Galerkin projection Lei and Li (2021) with the pre-selected fractional derivative bases, the present method enables us to probe the optimal representation of the reduced dynamics through the joint learning of the non-Markovian features. The constructed features retain a clear physical interpretation and can be loosely viewed as the convolution of the velocity history. This enables us to construct the proper learning formulation such that the reduced dynamics strictly preserves the second fluctuation-dissipation theorem and 26 10−210−1100101102103λ0.00.51.01.5[Θ(λ)]11MD2ndjointlearning4thjointlearning10−210−1100101102103λ0.30.60.91.2[Θ(λ)]22MD2ndjointlearning4thjointlearning10−210−1100101102103λ−0.6−0.4−0.20.0[Θ(λ)]12MD2ndjointlearning4thjointlearning10−210−1100101102103λ−0.35−0.25−0.15−0.05[Θ(λ)]23MD2ndjointlearning4thjointlearning retains the consistent invariant density distribution. Moreover, the learning process does not require the on-the-fly computation of the time correlations of these features from the time-series samples, and automatically ensures numerical stability of the constructed model without empirical treatment. This is particularly well-suited for the construction of reduced dynamics of complex systems such as the conformation dynamics of macromolecular systems, where multi-dimensional resolved variables are often needed to characterize the transition dynamics with non-local cross-correlations among the variables. Building upon the data-driven framework for modeling state-independent memory effects introduced in Chapter 2, Chapter 3 extends this approach to more complex dynamical systems with state-dependent memory. While Chapter 2 demonstrated how a fixed set of convolutional encoders could capture global non-Markovian behavior through auxiliary variables, this assumption becomes limiting in systems where memory varies across configurations — such as when transitions between metastable states occur. To address this, Chapter 3 introduces a heterogeneous encoding architecture that allows the memory kernels to adapt locally to the system’s state. This generalization enables a more accurate and flexible representation of reduced dynamics in high-dimensional systems where memory effects are inherently configuration-dependent. 27 CHAPTER 3 ENHANCED SAMPLING DATA-DRIVEN CONSTRUCTION OF STOCHASTIC REDUCED DYNAMICS ENCODED WITH STATE-DEPENDENT MEMORY 3.1 Introduction Predictive modeling of multi-scale dynamic systems remains a significant challenge across various fields, including biology, materials science, and fluid physics. A prominent example is coarse-grained molecular dynamics (CGMD), where the goal is to simplify molecular system representations while preserving their essential dynamic behavior. The generalized Langevin equation (GLE) has emerged as a widely used framework for capturing the non-Markovian dynamics inherent in many CGMD processes. A range of approaches has been proposed for parameterizing the memory Lange and Grubmüller (2006); Darve et al. (2009b); Ceriotti et al. (2009); Baczewski and Bond (2013); Davtyan et al. (2015b); Lei et al. (2016b); Russo et al. (2019); Jung et al. (2017b); Lee et al. (2019b); Ma et al. (2019); Wang et al. (2020b,c); Zhu and Venturi (2020); Vroylandt et al. (2022); She et al. (2023); Xie and E (2024) aiming to reconstruct specific dynamic properties accurately. However, recent work Lyu and Lei (2023b); Ge et al. (2024) reveal that recovering isotropic properties alone may be insufficient for accurately reproducing the underlying complex dynamics. These findings underscore the importance of incorporating state-dependent memory effects to achieve precise reconstruction of dynamic behaviors. The accurate parameterization of a state-dependent memory kernel hinges on effectively capturing the dynamic properties within the phase space. However, practical applications often face challenges due to the inherent complexity of the energy landscape in phase space. This complexity is typically marked by the presence of numerous metastable states, which are separated by significant energy barriers. These barriers hinder transitions between states, making it difficult to comprehensively sample the phase space and accurately reconstruct the memory kernel. Addressing this challenge requires advanced techniques capable of efficiently exploring these landscapes while retaining essential dynamic information. The critical role of sampling in phase space has been widely acknowledged, particularly in the construction of free energy landscapes. To address this, numerous 28 methodologies have been developed, each offering unique advantages for overcoming sampling challenges. Notable approaches include umbrella sampling Torrie and Valleau (1977), which applies biased potentials to enhance exploration; histogram reweighting Kumar et al. (1992), enabling the integration of data from multiple simulations; metadynamics Laio and Parrinello (2002b); Barducci et al. (2008), which facilitates the escape from metastable states through adaptive biasing; and variational enhanced sampling Valsson and Parrinello (2014); Shaffer et al. (2016); Bonati et al. (2019), a framework that leverages variational principles to optimize bias potentials. These methods collectively underscore the importance of efficient phase space exploration in capturing accurate free energy profiles. Despite their great success and wide application to capture the static properties, the importance of the sampling for the dynamic properties is largely ignored. In this study, we employ our previously developed consensus-based enhanced sampling technique to simultaneously construct the free energy surface and parameterize the memory kernel. The conservative force is determined through constrained dynamics at selected points in the phase space, while dynamic information at these points is obtained via multiple free dynamics simulations initiated from the same locations. 3.2 Methods The system under consideration is modeled as a Hamiltonian system with a 6𝑁-dimensional phase space vector Z = [Q; P], represent the position and momentum vectors, respectively. The dynamics of the system are governed by the equation of motion: (cid:164)Z = S∇𝐻 (Z), (3.1) 0 I where S = (cid:169) (cid:173) (cid:173) −I 0 (cid:171) (cid:170) (cid:174) (cid:174) (cid:172) is the symplectic matrix that preserves the structure of Hamiltonian dynamics, with I being the identity matrix. and 𝐻 (Z) denotes the Hamiltonian function. For sufficiently large 𝑁, the simulation of Eq. (3.1) becomes computationally prohibitive. However, in many practical scenarios, interest lies in a low-dimensional resolved variable, z(𝑡) = 𝜙(Z(𝑡)), where 𝜙 : R6𝑁 → R𝑚 serves as a mapping that projects the high-dimensional pace onto a reduced space of interest. The Mori-Zwanzig (MZ) formalism provides a robust foundation for constructing approximate 29 dynamics for resolved variables by employing a projection operator. This framework separates the resolved and unresolved components of the system, enabling a reduced description of the dynamics while incorporating memory effects and fluctuating forces to account for the influence of unresolved variables. The projection operator P maps functions of the full system to functions of the coarse-grained (CG) system, and is defined as: (P 𝑓 )(z) = ∫ 𝛿(𝜙(Z) − z) 𝑓 (Z) 𝜌(Z)dZ ∫ 𝛿(𝜙(Z) − z) 𝜌(Z)dZ , where z represents the CG variables, Φ(Z) is the mapping from the full system to the CG system, and 𝜌(Z) is the probability density function of the full system. The dynamics of CG variable follows: 𝜕 𝜕𝑡 𝜙(Z) = exp(𝑡L)PL𝜙(Z) + ∫ 𝑡 0 exp ((𝑡 − 𝑠)L) PL exp(𝑠QL)QL𝜙(Z)d𝑠+exp(𝑡QL)QL𝜙(Z). Here, L := is the Liouville operator and Q = I − P. Motivate by this, the reduced dynamics can be written as (cid:164)q = M(q)−1p, (cid:164)p = −F(q) − ∫ 𝑡 0 𝜃 (𝑡 − 𝜏) (cid:164)q(𝜏)d𝜏 + R (𝑡), (3.2) Here, q = 𝜙𝑞 (Q) denotes a coarse-grained variable, and p is the corresponding momentum, associated with a mass matrix M(q). The effective free energy for q is defined as 𝑈eff(q) = 𝛽 log ∫ dZ𝛿(𝜙𝑞 (Z) − q) 𝜌(Z), with 𝛽 = 1 − 1 𝐾𝐵𝑇 is the inverse temperature. Inspired by previous research She et al. (2023), Equation (3.2) can be reformulated as an extended Markovian process (q, p, ξ), where ξ represents auxiliary variables that will be defined later. These auxiliary variables serve to capture the memory effects inherent in the original generalized Langevin equation (GLE) and are assumed to follow a Gaussian distribution, 𝑁 (0, 1). The evolution takes the form d = q (cid:170) (cid:174) (cid:174) p (cid:174) (cid:174) (cid:174) (cid:172) ξ (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 0 I 0 −I 0 J(q) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) ∇qF ∇pF ∇ξF (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) 0 0 0 0 0 𝚺(q) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) d𝑡 + dW𝑡, (3.3) where F (q, p, ξ) = 𝑈 (q) + p𝑇 M(q)−1p + ξ𝑇 ξ is the free energy for the extended system. The relationship 𝚺(q)𝚺(q)𝑇 = −𝐾𝐵𝑇 (J(q)𝚲+𝚲J(q)𝑇 ), ensures consistency with the second fluctuation- dissipation theorem, where 𝚲 is covariance matrix of (p, ξ). By solving the Fokker-Planck equation, 30 we have the invariant distribution of the extended system 𝜌𝑒 (q, p, ξ) ∝ exp(−𝛽F (q, p, ξ)). The invariant distribution should be consistent with the free energy for q, with ∫ 𝜌𝑒 (q, p, ξ)dpdξ ∝ exp(−𝛽𝑈eff(q)), from which we notice that 𝑈 (q) = 𝑈eff(q) − 1 2𝛽 log |M(q)−1|. To construct the hidden variables, we define them as a linear combination of past momentum values p(𝑡 − 𝛿𝑡), · · · , p(𝑡 − 𝑁𝑤𝛿𝑡), i.e ξ𝑖 (𝑡) = 𝑁𝑤 ∑︁ 𝑗=1 w 𝑗𝑖p(𝑡 − 𝑖𝛿𝑡), where w 𝑗𝑖 are the coefficients to be optimized, and 𝑁𝜉 is the number of momentum terms included in the linear combination. To simplify the process of training and collecting data, we construct our (3.4) matrix J(q) as follows J(q) = 0 h𝑇 (q) −h(q)M−1(q) ˆJ(q) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) here h(q) is a vector represented by a neural network which takes the coarse-grained variable q as input. Then we use choleschy decomposition to form ˆJ(q) = −L(q)L𝑇 (q) + A(q). Note that L(q) is a block-wise lower triangle matrix and A(q) is a block-wise antisymmetric matrix represented by two different neural networks respectively. Now, the the second fluctuation-dissipation theorem can be simplified as follows where ˆ𝚺(q) ˆ𝚺(q)𝑇 = −𝐾𝐵𝑇 ( ˆJ(q) ˆ𝚲 + ˆ𝚲ˆJ(q)) 𝚺(q) = , 𝚲(q) = (3.5) 0 0 0 ˆ𝚺(q) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) M(q) (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) 0 0 ˆ𝚲(q) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) With the constructed ξ, we can computed the correlation function by multiply p(0) given q(0) = q on both side of Eq. (3.3), (cid:10)p + ∇qF , p(0) |q(0) = q (cid:11) ⟨ξ, p(0) |q(0) = q ⟩ (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) d (cid:169) (cid:173) (cid:173) (cid:171) (cid:124) (cid:123)(cid:122) C1 (𝑡,q;𝑤) (cid:10)∇pF , p(0) |q(0) = q (cid:11) (cid:170) = J(q) (cid:169) (cid:174) (cid:173) (cid:10)∇ξF , p(0) |q(0) = q (cid:11) (cid:174) (cid:173) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:172) (cid:171) (cid:123)(cid:122) (cid:125) (cid:124) C2 (𝑡,q;𝑤) d𝑡 . (3.6) 31 (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:172) (cid:125) By construct J(𝑞) as ˜J(q; 𝜃𝐽) = ˜𝚺(q; 𝜃𝐽) ˜𝚺(q; 𝜃𝐽)𝑇 , the loss function can be constructed as 𝑁𝑠 ∑︁ 𝑖=1 (cid:13) (cid:13) (cid:13) (cid:13) min 𝜃 𝐽 ,w d d𝑡 C1(𝑡, q𝑖; 𝑤) − ˜J(q𝑖; 𝜃𝐽)C2(𝑡, q𝑖; w) 2 (cid:13) (cid:13) (cid:13) (cid:13) + 𝑁𝑠 ∑︁ 𝑖=1 (cid:13) ˆ𝚲(q𝑖; w) − I(cid:13) (cid:13) 2 , (cid:13) (3.7) where we optimize ˜J(q; 𝜃𝐽) and auxiliary variable ξ depends on w at the same time on the training set {q𝑖}𝑁𝑠 𝑖=1. The construction of dynamics also depends on an accurate free energy surface 𝑈eff(q) and mass matrix M(q). It can be computed from restrained dynamics by introducing a harmonic term into full potential, i.e. U𝑘 (Q, q) = U (Q) + 𝑘 2 (𝜙𝑞 (Q) − q)𝑇 (𝜙𝑞 (Q) − q), where 𝑘 represents the magnitude of the restrained potential and U is the potential of full system without restraint. The mean force can be computed by ∇𝑈eff(q) = lim𝑘→∞ F𝑘 (q) and M(q) = lim𝑘→∞ M𝑘 (q), where F𝑘 (q) = ∫ 1 𝑍𝑘 (q) 𝑘 (𝜙𝑞 (Q) − q) exp(−𝛽U𝑘 (Q, q))dQ and M𝑘 (q) = ∫ 1 𝑍𝑘 (q) 1 2𝛽( d𝜙𝑞 (Q) )2 exp(−𝛽U𝑘 (Q, q))dQ. d𝑡 Two neural networks ˜M(q; 𝜃 𝑀) and ˜F(q; 𝜃𝐹) is constructed to approximate M(q) and F(q) respectively and trained loss function on the same dataset 𝑁𝑠 ∑︁ 𝑖=1 min 𝜃 𝑀 ∥ ˜M(q𝑖; 𝜃 𝑀) − M(q𝑖) ∥2, min 𝜃𝐹 𝑁𝑠 ∑︁ 𝑖=1 ∥ ˜F(q𝑖; 𝜃𝐹) − F(q𝑖) ∥2. The sampling points are adaptively selected by consensus-based enhanced sampling strategy Lyu and Lei (2023a) with a McKean type stochastic differential equation for 𝑁𝑤 walker q1, · · · , q𝑁𝑤 dq𝑖 = − 1 𝛾 ∇z𝐺 (q𝑖)d𝑡 + √︄ 2 𝜅ℎ𝛾 dW𝑡 (3.8) where 𝐺 (q𝑖) = 1 2 (q𝑖 − m)𝑇 V−1(q𝑖 − m), where m = (cid:205)𝑁𝑤 𝑖=1 q𝑖 𝑝(q𝑖), V = (𝜅𝑙 + 𝜅ℎ) (cid:205)𝑁𝑤 𝑖=1 . The R (q𝑖) represents the residual at the point q𝑖, which we (q𝑖 − m)𝑇 (q𝑖 − m) 𝑝(q𝑖) and 𝑝(q𝑖) = exp(−𝜅𝑙R (q𝑖)) (cid:205)𝑁𝑤 𝑗 exp(−𝜅𝑙R (q 𝑗 )) choose to be ∥ ˜F(q𝑖; 𝜃𝐹) − F(q𝑖)∥2 in this project. The first right term in Eq. (3.8) represents the 32 exploitation term that uses current information to drive the sampler towards the maximum residual region, and the second term is a high temperature exploration term 𝜅ℎ that explores the unknown region. Notice that the residual also depends on the neural network parameters, then by iterative optimization of our neural network representation and the sampling points, we can get a good training set over the phase space. 3.3 Numerical results 3.3.1 One-dimensional state-dependent reduced model of a polymer molecule To illustrate the core concept of the present method, we begin with a polymer molecule with 𝑁 = 16 atoms, where the intramolecular potential is defined by 𝑉mol(Q) = 𝑁 ∑︁ 𝑖≠ 𝑗 𝑉p(𝑄𝑖 𝑗 ) + 𝑁𝑏∑︁ 𝑖=1 𝑉b(𝑙𝑖) + 𝑁𝑎∑︁ 𝑖=1 𝑉a(𝜃𝑖) + 𝑁𝑑 ∑︁ 𝑖=1 𝑉d(𝜙𝑖), (3.9) where 𝑉𝑝 is the Lennard-Jones intermolecular potential, 𝑉𝑏 is the harmonic bonds, 𝑉𝑎 and 𝑉𝑑 denotes the potential on angle and dihedral angle respectively. The end-to-end distance 𝑞1 = ∥Q1 − Q𝑁 ∥ is used as a collective variable in the 1D reduced dynamics framework to evaluate the effectiveness of our current method. We selected 25 distinct points uniformly from the range of 𝑞1 ∈ [2, 18] to create our training set. Four auxiliary variables ξ𝑖 are learned in standard GLE and our state-dependent GLE. The overall velocity correlation function 𝐶𝑣𝑣 (𝑡) = ⟨𝑣1(𝑡)𝑣1(0)⟩ in presented in Figure 3.1(a). Both state-dependent GLE and standard GLE align well with the MD result. We also present the encoder weights for non-Markovian features as obtained from the state-dependent GLE approximation in Figure 3.1(b). The encoder functions demonstrate pronounced oscillations at short times, reflecting the dynamic interactions present in the system during that initial period. As time progresses, these oscillations gradually decay to zero, indicating that the influence of these non-Markovian features diminishes at longer time scales. This behavior underscores the transient nature of the non-Markovian dynamics in the system. However, the fitness of the overall GLE do not represent the good performance of the learned dynamics. We compare the conditional autocorrelation with q starting from 25 selected points by 33 (a) (b) Figure 3.1 Numerical results of a one-dimensional reduced model representing the dynamics of the end–end distance of a polymer molecule system. (a) Overall velocity correlation function 𝐶𝑣𝑣 (𝑡) obtained from MD, 4th order standard GLE, and state-dependent GLE. (b) Encoder weights for the three non-Markovian features of the state-dependent GLE. standard GLE, state-dependent GLE and MD in Figure 3.2. This comparison will reveal differences of the diffusion behavior at different points on the phase space in MD is captured by the standard state-dependent but not in standard one. Inaccuracy in the diffusion process will in return affect the precision of the transition process.Figure 3.2(a) illustrates the time distribution of 𝑞1 for values greater than 15, comparing results from MD simulations, GLE and state-dependent GLE. The data reveals that the state-dependent GLE provides a closer match to the MD results than the standard GLE method. However, the fitness of the overall GLE do not represent the good performance of the learned dynamics. We compare the conditional autocorrelation with q starting from 25 selected points by standard GLE, state-dependent GLE and MD in Figure 3.2. This comparison will reveal differences of the diffusion behavior at different points on the phase space in MD is captured by the standard state-dependent but not in standard one. Inaccuracy in the diffusion process will in return affect the precision of the transition process.Figure 3.2(a) illustrates the time distribution of 𝑞1 for values greater than 15, comparing results from MD simulations, GLE and state-dependent GLE. The data reveals that the state-dependent GLE provides a closer match to the MD results than the standard GLE method. 34 10−210−1100101102t−0.10.10.30.5Cvv(t)MD4thState-dependentGLE4thGLE050100150200j−1.0−0.6−0.20.20.61.0wi(j∆t)w1w2w3 (a) (c) (b) (d) Figure 3.2 Numerical results of a one-dimensional reduced model representing the dynamics of the end–end distance of a polymer molecule system. (a) Distribution of 𝑞1 > 15 obtained from the full MD simulations, the 4th-order GLE approximation, and the 4th-order state-dependent GLE approximation. (b–d) Conditional velocity correlation functions obtained from MD, standard GLE, and state-dependent GLE, respectively. 3.3.2 Two-dimensional state-dependent reduced model of an alanine dipeptid We further demonstrate the effectiveness of our state-dependent reduced modeling framework using the alanine dipeptide molecule (Ace-Ala-Nme), commonly referred to as Ala2. The full-atom molecular dynamics (MD) simulation is performed for a solvated alanine dipeptide immersed in 383 explicit water molecules at 300 K, using the Amber99-SB force field and the TIP3P water model. A time step of 2.5 × 10−4 ps is used for numerical integration. To reduce dimensionality, we adopt two dihedral angles as collective variables (CVs): the 𝜙 angle defined by atoms (C, N, C𝛼, C), and the 𝜓 angle defined by atoms (N, C𝛼, C, N). These angles 35 10−1100101102t−10−5051015Cvv(t)MD10−1100101102t−0.10.10.30.5Cvv(t)4thGLE10−1100101102t−10−5051015Cvv(t)4thState-dependentGLE provide a compact representation of the molecular conformations. A consensus-based sampling strategy selects 1,000 representative configurations from the MD trajectory to train the state-dependent generalized Langevin equation (GLE) model. At each configuration, we compute the conservative force, effective mass, and the velocity autocorrelation function. Four auxiliary variables are introduced to close the non-Markovian system, forming a Markovian embedding that captures state-dependent memory effects. The model’s accuracy is validated by comparing conditional momentum autocorrelation functions from two conformational regions. As shown in Figure 3.3, the state-dependent GLE faithfully reproduces MD results, including subtle oscillatory features that are missed by traditional GLE models with fixed memory kernels. 36 (a) (c) (e) (b) (d) (f) Figure 3.3 Numerical results of the two-dimensional reduced model in terms of the two dihedral angles of the alanine dipeptide system. (a–c) Conditional momentum auto-correlation functions obtained from full MD simulations and the 4th-order GLE approximation at (𝜙, 𝜓) = (−1.60, 2.78). (d–f) Conditional momentum auto-correlation functions at (𝜙, 𝜓) = (−2.90, −0.16). We also compute the distribution of transition time between different local minima. Four local 37 minima is selected in Figure 3.4. The distribution of transition time between point 0 and other three points is shown in Figure 3.5 (a) Figure 3.4 The heatmap of the free energy surface m𝐺 (𝜙, 𝜓). Colored solid circles mark four local minima of the configuration. Figure 3.6 presents the distribution of time spent at each point, based on results from MD, 4𝑡ℎ order of GLE and 4𝑡ℎ order of sate-dependent GLE. The results demonstrate that our current method provides improved accuracy for each point compared to GLE method. 3.4 Summary The state-dependent generalized Langevin equation (GLE) is usefull tool to describe the non- Markovian behavior in many processes in the CGMD problem accurately. In this study, we employ our previously developed consensus-based enhanced sampling strategy to simultaneously construct the heterogeneous memory kernel and the free energy surface. The conservative force is calculated using constrained dynamics at specific points in the phase space, while the dynamic information at these points is gathered through multiple free dynamics initiated from the same locations. We then train our neural network to capture the differences among the various conditional auto-correlation functions. The results demonstrate that our current method provides improved accuracy for each point compared to the GLE method. 38 (a) (c) (e) (b) (d) (f) Figure 3.5 Numerical results of a two-dimensional reduced model representing the two dihedral angles of the alanine dipeptide system. (a) Distribution of transition time from position 0 to position 1. (b) From position 1 to 0. (c) From position 0 to 2. (d) From position 2 to 0. (e) From position 0 to 3. (f) From position 3 to 0. 39 (a) (c) (b) (d) Figure 3.6 Numerical results of a two-dimensional reduced model representing the two dihedral angles of the alanine dipeptide system. (a) Distribution of time periods spent at position 0 before transitioning to position 1. (b–d) Distributions of time spent at positions 1, 2, and 3, respectively. 40 CHAPTER 4 GENERATIVE MODEL BASED IDENTIFYING METASTABLE STATES IN FULL MOLECULE SPACE 4.1 Introduction Normalizing flows have gained significant traction in recent years as flexible generative models that provide exact likelihood evaluation and tractable sampling through invertible transformations between data and latent spaces. While highly expressive, their performance critically depends on the structure of the latent prior distribution. In most conventional settings—including in many state-of-the-art normalizing flow architectures—a simple unimodal prior such as a standard multivariate Gaussian is employed. This assumption works well for data distributions that are themselves unimodal or smoothly varying, but it becomes a substantial bottleneck in modeling systems characterized by multimodality, sharp transitions, or complex geometrical features in high-dimensional spaces. This issue is particularly critical in domains such as molecular dynamics or continuum mechanics, where data often arise from a mixture of metastable states or rare-event transitions. These systems naturally lead to multimodal distributions, with each mode representing a distinct macroscopic configuration or energy basin. While recent developments in normalizing flows—such as NICE, RealNVP, and Glow Dinh et al. (2014, 2017); Kingma and Dhariwal (2018)—have significantly improved the expressivity of the transformation through architectural innovations like coupling layers and conditioning, they still rely on simple unimodal latent priors. As a result, such models are limited in their ability to identify and represent metastable states explicitly, since the prior structure does not reflect the inherent multimodality of the system. Moreover, during training, these models typically focus on maximizing the overall likelihood and do not incorporate gradient or perturbation-based penalties to enforce key physical constraints—such as requiring the gradient of the log-density to vanish at latent maxima or ensuring that log-density values at these mapped maxima are indeed local maxima in data space. Without these constraints, the learned transformation may distort the latent structure and fail to preserve the correspondence 41 between latent and data-space modes, ultimately limiting the interpretability and metastable state resolution of the model. To address these limitations, we propose a generative modeling framework that uses a Mixture- of-Gaussians (MoG) prior to explicitly represent multiple metastable modes in the latent space. The goal is not merely to enhance expressivity, but to enforce a maximum-to-maximum correspondence between the latent space and data space—ensuring that each latent mode is mapped to a high- density metastable state in the observed configuration space. To achieve this, we use an invertible transformation that preserves the structure of the distribution under the change of variables. While our implementation adopts KRNet for this transformation due to its flexibility and scalability, the approach is general and compatible with any expressive normalizing flow architecture. During training, we further impose gradient penalties to enforce vanishing gradients at mapped maxima, and contrastive perturbation penalties to ensure local maximality in data space. This strategy allows the model to capture and preserve the metastable structure inherent in complex physical systems, rather than simply fitting the data distribution in a likelihood sense. Our design is inspired in part by recent advances in multimodal flow-based generative modeling, such as the bounded KRNet architecture introduced by Peng et al. Peng et al. (2023), which demonstrated that introducing structural constraints in the latent space can significantly improve the accuracy and interpretability of normalizing flows. Building on this line of thinking, our model introduces a Mixture-of-Gaussians (MoG) latent prior not merely for greater flexibility, but to explicitly encode multiple metastable modes that reflect the complex landscape of molecular systems. Each Gaussian component captures a different region of the latent space, which is then mapped—via a KRNet transformation—into a distinct metastable basin in data space. To enforce this mode-to-mode correspondence, we introduce gradient penalties to drive the log-density gradient toward zero at each latent mode, and contrastive perturbation losses to ensure that these mapped points are true maxima in the data space. Beyond improving generative performance, this design also enhances scientific interpretability. The multimodal latent structure enables soft clustering of generated configurations, where each mode 42 can be associated with meaningful collective variables (CVs) such as torsion angles, end-to-end distances, or radius of gyration. This provides insight into the system’s metastable organization and helps identify the slow reaction coordinates that govern long-time dynamics. In summary, our MoG-based KRNet formulation introduces a novel framework for aligning latent and data-space maxima, enabling both accurate density modeling and interpretable discovery of metastable structure in high-dimensional molecular data. 4.2 Method 4.2.1 Overview of the MoG-KRnet Framework MoG-KRnet is a bijective generative model that constructs an invertible mapping 𝑓 : R𝑑 → R𝑑 transforming samples from a hybrid latent distribution 𝑝𝑍 to the data distribution 𝑝 𝑋. The key idea is to approximate a transport map that rearranges a tractable base measure into a complex, potentially multimodal data distribution. For a given observation x ∈ R𝑑, the model defines the log-density through the change-of-variable formula: log 𝑝 𝑋 (x) = log 𝑝𝑍 ( 𝑓 (x)) + log | det 𝐽 𝑓 (x)|, where 𝐽 𝑓 (x) = ∇x 𝑓 (x) ∈ R𝑑×𝑑 denotes the Jacobian of 𝑓 . This formulation permits exact likelihood evaluation and allows optimization via maximum likelihood estimation. 4.2.2 Hybrid Latent Prior The latent variable z ∈ R𝑑 is decomposed into two independent blocks, denoted z1 ∈ R𝑑1 and z2 ∈ R𝑑2, with 𝑑1 + 𝑑2 = 𝑑. The first block z1 follows a product of one-dimensional mixture-of-Gaussians: 𝑝(z1) = 𝑑1(cid:214) 𝐾 𝑗 ∑︁ 𝑗=1 𝑘=1 𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2 𝑗,𝑘 ), while the second block z2 ∼ N (0, I𝑑2) is standard Gaussian. This hybrid prior combines multimodal expressiveness with analytical tractability and defines the target measure for the flow map. 43 4.2.3 KRnet Architecture as Progressive Triangular Transport Inspired by the Knothe–Rosenblatt rearrangement, MoG-KRnet factorizes the transformation 𝑓 into a sequence of stage-wise maps: 𝑓 = 𝑓 (𝐾) ◦ 𝑓 (𝐾−1) ◦ · · · ◦ 𝑓 (1), where each stage 𝑓 (𝑘) updates a block of coordinates while conditioning on preceding ones, approximating triangular transport structure. Each 𝑓 (𝑘) is implemented as a composition of transformations: 𝑓 (𝑘) = S (𝑘) ◦ A (𝑘) ◦ N (𝑘) ◦ R (𝑘), where R (𝑘) is a linear transformation with learnable LU structure, N (𝑘) is an actnorm layer that ensures zero-mean and unit variance per dimension (with learnable parameters), A (𝑘) is a stack of affine coupling layers (described in Section 3.4), and S (𝑘) performs a squeezing operation that reallocates dimension usage over stages. This layered composition ensures that the full Jacobian 𝐽 𝑓 remains triangular or block-triangular, allowing for efficient computation of log | det 𝐽 𝑓 | as a sum over the individual layers. 4.2.4 Affine Coupling Transformations Each affine coupling layer partitions the input x = [x1, x2], and updates x2 using a scale-and-shift transformation conditioned on x1: x′ 2 = x2 ⊙ (1 + 𝛼 · tanh(𝑠(x1))) + 𝛾 · tanh(𝑡 (x1)). Here, 𝑠 and 𝑡 are neural networks; 𝛼 ∈ (0, 1) is a fixed stability parameter (e.g., 0.6); 𝛾 ∈ R𝑑′ is a learnable global vector. The inverse transformation is analytically computable, ensuring exact invertibility. The log-determinant of the Jacobian for each coupling layer is efficiently computed as: log | det 𝐽A | = 𝑑′ ∑︁ 𝑖=1 log (1 + 𝛼 · tanh(𝑠𝑖 (x1))) , which contributes additively to the total log-likelihood. 44 4.2.5 Mode Alignment Regularization We introduce a geometric regularization mechanism to promote semantic alignment between the latent and data spaces. Let zmax ∈ R𝑑 be the point in latent space corresponding to the global mode of the prior. We define it as the concatenation of the mean of the dominant mixture components in z1, and the zero vector in z2: zmax = [𝜇∗ 1 , . . . , 𝜇∗ 𝑑1 , 0, . . . , 0], where 𝜇∗ 𝑗 is the mean of the most probable component for dimension 𝑗. We compute the corresponding data-space mode as xmax = 𝑓 −1(zmax). To encourage xmax to align with a mode of the data distribution, we introduce two regularization terms. The first is a gradient penalty: Lgrad = (cid:13) (cid:13) (cid:13) ∇x log 𝑝 𝑋 (x)(cid:12) (cid:12)x=xmax 2 (cid:13) (cid:13) (cid:13) , which encourages stationarity of the log-density at xmax. The second is a local contrastive penalty defined over perturbed neighborhoods: Lcontrast = 1 𝑀 𝑀 ∑︁ 𝑖=1 (cid:16) max 0, log 𝑝 𝑋 (xneigh 𝑖 ) − log 𝑝 𝑋 (xmax) (cid:17) , where xneigh 𝑖 = xmax + 𝜖𝑖, and 𝜖𝑖 ∼ N (0, 𝜎2I). This enforces that xmax is not only a critical point, but a local maximum. 4.2.6 Objective Function The total loss function used for training is the sum of the negative log-likelihood and the mode-alignment penalties: L = −Ex∼D log 𝑝 𝑋 (x) + 𝜆grad · Lgrad + 𝜆contrast · Lcontrast. Here, 𝜆grad, 𝜆contrast ≥ 0 control the strength of the regularization terms. 4.2.7 Sampling and Inference For density evaluation, an input x ∈ R𝑑 is mapped through the flow to obtain z = 𝑓 (x), and the log-likelihood is computed via: log 𝑝 𝑋 (x) = log 𝑝𝑍 (z) + 𝐾 ∑︁ 𝑘=1 log | det 𝐽 𝑓 (𝑘 ) (·)|. 45 Each sub-map contributes a triangular Jacobian, so the determinant is computed in linear time. The prior log-density is decomposed as: log 𝑝𝑍 (z) = 𝑑1∑︁ 𝑗=1 𝐾 𝑗 ∑︁ 𝑘=1 log (cid:169) (cid:173) (cid:171) 𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2 𝑗,𝑘 )(cid:170) (cid:174) (cid:172) − 1 2 ∥z2∥2 − 𝑑2 2 log(2𝜋). To generate samples, latent vectors z ∼ 𝑝𝑍 are drawn by sampling each MoG dimension 𝑧 𝑗 from its categorical mixture and the Gaussian block from standard normal. The resulting z is passed through the inverse flow x = 𝑓 −1(z), which is exact and fully differentiable. 4.3 Numerical Result 4.3.1 Approximation of the Müller-Brown Equilibrium Distribution We first evaluate the capacity of MoG-KRnet to approximate a complex, multimodal target distribution arising from the well-known Müller-Brown potential—a classical benchmark in molecular simulation that features multiple metastable wells separated by high-energy barriers. The target equilibrium distribution is the Boltzmann-Gibbs measure: 𝑝 𝑋 (x) = 1 𝑍 exp (cid:18) − 𝑈 (x) 𝑘 𝐵𝑇 (cid:19) , where 𝑈 (x) denotes the potential energy, 𝑇 = 30𝐾, and x ∈ R2. We simulate overdamped Langevin dynamics, 𝑑x 𝑑𝑡 = −∇𝑈 (x) + √︁ 2𝑘 𝐵𝑇 · ξ(𝑡), to generate a reference dataset of 5 million samples. These samples serve as empirical draws from the true equilibrium distribution 𝑝 𝑋. The model is instantiated as a single-stage KRnet with depth 𝐷 = 64, input dimension 𝑑 = 2, and coupling width 256. Each flow stage includes alternating affine coupling layers with learnable LU-based linear transformations and interleaved squeezing operations. The coupling layers implement: z2 = z2 ⊙ (1 + 𝛼 · tanh(𝑠(z1))) + 𝛾 · tanh(𝑡 (z1)), 46 with 𝛼 = 0.6, and 𝑠, 𝑡 being two-layer neural networks with ReLU activations. The latent prior 𝑝𝑍 (z) consists of a product of a one-dimensional mixture-of-Gaussians: 2 ∑︁ 𝑝(𝑧1) = 𝜋𝑘 · N (𝑧1; 𝜇𝑘 , 𝜎2 𝑘 ), 𝜋 = [0.6, 0.4], 𝜇 = [−2, 2], 𝜎 = [1.0, 0.5], 𝑘=1 and a standard Gaussian 𝑧2 ∼ N (0, 1). In addition to maximum likelihood estimation, we introduce geometric regularization to ensure that the high-density region of the latent prior is mapped to a corresponding mode in the target space. This is done via two penalty terms: 1. A local contrastive penalty encourages log-probability at mapped prior mode xmax to exceed its local neighbors: Lcontrast = (cid:16) max 𝑀 ∑︁ 𝑖=1 0, log 𝑝 𝑋 (xneigh 𝑖 ) − log 𝑝 𝑋 (xmax) (cid:17) , where xneigh 𝑖 = xmax + 𝜖𝑖, and 𝜖𝑖 ∼ N (0, 𝜎2I). 2. A gradient penalty enforces ∇ log 𝑝 𝑋 (xmax) ≈ 0: Lgrad = (cid:13) (cid:13) (cid:13) ∇x log 𝑝 𝑋 (x)(cid:12) (cid:12)x=xmax (cid:13) (cid:13) (cid:13)1 . The total objective is: L = −Ex∼D [log 𝑝 𝑋 (x)] + 𝜆contrastLcontrast + 𝜆gradLgrad Training is performed using the Adam optimizer with an initial learning rate of 1 × 10−4, which is decayed by 10% every 5000 iterations. Penalty coefficients 𝜆contrast, 𝜆grad, and 𝜆align are initialized at small values and gradually increased (by 5% every 5000 iterations) to guide the flow toward stable mode alignment while preventing instability in early optimization. Minibatches of 2D coordinates are drawn from preprocessed subsets {set 𝑗 }19 𝑗=0, and the entire training loop runs for 100,000 steps. Gradients are clipped using global norm clipping with a threshold of 0.01. Checkpoints are saved regularly and used for restarting long runs. Figure 4.1 shows the learned transport map learned by MoG-KRnet. On the left, samples drawn from the latent space exhibit two clearly separated high-density regions corresponding to distinct 47 components in the mixture-of-Gaussians prior. After training, these two latent modes are transported via the inverse flow 𝑓 −1 into two distinct basins of the Müller-Brown energy landscape, shown on the right. This demonstrates that the model not only captures the overall multimodal structure of the target distribution but also learns a semantically consistent transport: each latent mode is mapped to a specific metastable state of the physical system. The smoothness and separation of the transformed samples reflect that the triangular KRnet flow—coupled with the hybrid prior and geometric regularization—successfully avoids mode collapse and learns a one-to-one correspondence between latent and physical modes. Such mode-resolving behavior is difficult to achieve with standard normalizing flows that rely on unimodal priors. In contrast, MoG-KRnet leverages the flexibility of mixture components to assign and map separated probability mass to different energetic regions in a physically meaningful way. This mode alignment improves both sampling fidelity and interpretability, especially in systems where multiple competing basins dominate the dynamics. Figure 4.1 Learned transport from latent space to physical space using MoG-KRnet. The left half of the image shows samples drawn from the latent distribution, which has two distinct high-density regions due to the mixture-of-Gaussians prior. The right half shows the transformed samples under the inverse flow 𝑓 −1, which accurately maps the two latent modes into the two metastable basins of the Müller-Brown potential. This demonstrates the model’s ability to perform semantically aligned mode separation, transporting distinct regions of latent mass to physically meaningful targets. 48 4.3.2 Approximation of the Alanine Dipeptide Equilibrium Distribution To further assess the scalability and generalization capacity of MoG-KRnet, we apply it to approximate the equilibrium distribution of a higher-dimensional molecular system: alanine dipeptide in implicit solvent. This molecule is a well-known testbed in molecular simulation due to its low dimensionality yet rich conformational landscape, characterized by transitions between metastable states in the Ramachandran (𝜙, 𝜓)-angle space and the full Cartesian coordinates of selected atoms. We generate reference samples for alanine dipeptide (Ace-Ala-Nme, commonly referred to as Ala2) via full-atom molecular dynamics (MD) simulation in explicit solvent. The simulation system consists of the alanine dipeptide molecule immersed in 383 TIP3P water molecules. The simulation is carried out at 350 K using the Amber99-SB force field and Langevin dynamics for temperature control. A time step of 2.5 × 10−4 ps is used for numerical integration. Trajectories are collected from equilibrium simulations and projected onto a reduced coordinate space consisting of Cartesian positions of selected heavy atoms. In total, 5 million configurations are used to construct the dataset Dala ∼ 𝑝 𝑋, representing the high-dimensional equilibrium distribution over molecular conformations. The MoG-KRnet model is constructed to map the equilibrium distribution of alanine dipeptide into a structured latent space. The model input is a 15-dimensional vector representing the Cartesian coordinates of five key atoms—[5, 7, 9, 15, 17]—selected to capture relevant backbone conformational fluctuations while avoiding redundant degrees of freedom. This atom selection encompasses chemically meaningful internal coordinates, including both 𝜙 and 𝜓 torsions, as well as spatial end-to-end geometry. The flow transformation 𝑓 : R15 → R15 consists of 7 staged mappings inspired by the pseudo- triangular structure of the Knothe–Rosenblatt rearrangement. Each stage contains 24 affine coupling layers with hidden width 128, supplemented by actnorm, LU-based rotations, and squeezing operations. This progressive architecture allows early layers to resolve nonlinear, multimodal features in dominant subspaces, while later layers refine global geometry. 49 The latent prior 𝑝𝑍 is a hybrid of three independent one-dimensional mixture-of-Gaussians (MoG) and a standard multivariate Gaussian: 𝑝𝑍 (z) = 3 (cid:214) 𝐾 𝑗 ∑︁ 𝑗=1 𝑘=1 𝜋 𝑗,𝑘 · N (𝑧 𝑗 ; 𝜇 𝑗,𝑘 , 𝜎2 𝑗,𝑘 ) · N (z4:15; 0, I12). This design reflects the assumption that a small number of latent coordinates capture discrete conformational transitions (e.g., basin-hopping), while the remaining degrees of freedom reflect continuous fluctuations in local structure. The independence of latent dimensions facilitates efficient sampling and interpretable mode decomposition. Latent samples are drawn by independently sampling each MoG dimension using categorical selection followed by Gaussian sampling, and appending a standard normal vector for the Gaussian block. Training is performed on a dataset of 5 million MD samples using a composite loss: Ltotal = LNLL + 𝜆gradLgrad + 𝜆contrastLcontrast + Lrep. Here, LNLL is the negative log-likelihood, Lgrad enforces local maximality at the mapped latent maxima, and Lcontrast ensures neighborhood contrast. To prevent mode collapse, a pairwise repulsion loss Lrep is introduced between mapped maxima {x(𝑖) max} using a soft margin criterion. Training runs for 200,000 iterations with the Adam optimizer, an initial learning rate of 10−7, and dynamically scaled penalty weights. During training, KDE diagnostics are periodically used to ensure that generated samples recover the correct distributions in torsional and Euclidean observables. In Fig. 4.2, we compare the predicted marginal densities of the 𝜙 and 𝜓 dihedral angles from KRnet to reference MD histograms. MoG-KRnet accurately reproduces all modes in both coordinates and captures the correct relative amplitudes. In particular, the sharp peak near 𝜓 ≈ 3.0 is learned precisely, and the multimodality in 𝜙 is preserved without mode collapse. To assess the learned joint dependencies, we visualize the 2D density over (𝜙, 𝜓) in Fig. 4.3, along with the eight mode points mapped from latent maxima. The conformational basins of Ala2 are clearly recovered, and the mode points are well-separated, each landing within distinct high-density 50 (a) Marginal distribution of 𝜙 (b) Marginal distribution of 𝜓 Figure 4.2 Comparison of 1D dihedral angle marginals between KRnet (red) and MD ground truth (blue). regions. This confirms that MoG-KRnet not only fits the data globally but also identifies meaningful latent structure. (a) 𝜙–𝜓 dihedral distribution of Ala2. Warm background indicates predicted density; overlaid points mark mapped latent maxima. (b) End-to-end distance vs. radius of gyration (𝑅𝑔) from sampled configurations. Latent max- ima are projected onto this plane. Figure 4.3 Predicted equilibrium features of alanine dipeptide from the trained MoG-KRnet model. An important feature of our MoG-KRnet framework is the explicit mapping of latent density modes to high-probability basins in configuration space. By construction, each mode of the hybrid prior z(𝑖) max ∈ R𝑑. These mapped maxima are designed to coincide with peaks in the learned data distribution 𝑝 𝑋 (x), enforced max is mapped through the inverse flow 𝑓 −1 to a corresponding point x(𝑖) 51 during training via gradient and contrastive penalties. As shown in Figure 4.3, this alignment holds in both the dihedral angle space (𝜙, 𝜓) and in structural coordinates. In the left panel, each mapped latent mode falls within a distinct conformational basin in the (𝜙, 𝜓) landscape, indicating that MoG-KRnet captures metastability through a structured latent space. In the right panel, the same latent maxima are distributed across the manifold of end-to-end distance and radius of gyration, further supporting the physical consistency and geometric expressiveness of the learned model. These results confirm that our framework not only generates accurate samples but also provides a semantically meaningful latent representation that aligns with physically interpretable features of the molecular system. 4.4 Summary. In this work, we proposed MoG-KRnet, a novel flow-based generative framework designed for ap- proximating high-dimensional equilibrium distributions in complex molecular systems. Our approach builds upon the theory of invertible transformations and exploits a staged Knothe–Rosenblatt-inspired architecture to progressively map structured latent representations to physical configuration space. A distinguishing feature of MoG-KRnet is its use of a hybrid latent prior that combines independent one-dimensional mixture-of-Gaussians (MoG) components with standard Gaussian variables. This formulation enables the model to flexibly represent multi-modal distributions while maintaining computational tractability and efficient sampling. To ensure meaningful correspondence between latent and physical modes, we introduced a max encodes a distinct peak in the latent density, and then enforcing through loss penalties mode-alignment strategy during training. This involves constructing the prior such that each latent mode z(𝑖) that each of these modes is mapped to a high-probability region x(𝑖) max) in the observed space. This is achieved via gradient-based penalties to minimize ∥∇x log 𝑝 𝑋 (xmax) ∥, and contrastive max = 𝑓 −1(z(𝑖) penalties to ensure that xmax is indeed a local maximum compared to its neighbors. This alignment strategy imbues the model with semantic coherence and supports downstream interpretability. The efficacy of MoG-KRnet was demonstrated on two benchmark systems. For the Müller-Brown potential, we showed that the model accurately captured the bimodal equilibrium distribution and 52 learned to associate each latent mode with a distinct physical basin. For the more complex alanine dipeptide molecule in explicit solvent, the model was trained on 5 million full-atom MD snapshots and succeeded in learning the joint equilibrium distribution over both angular variables (dihedral angles 𝜙, 𝜓) and structural observables (end-to-end distance and radius of gyration). In all cases, the mapped latent maxima landed squarely in dominant high-density regions of the data space, confirming the success of the mode-to-basin alignment. Furthermore, generated samples from the model reproduced the marginal and joint distributions of key physical features with high fidelity, closely matching empirical histograms derived from MD data. Together, these contributions underscore the dual strengths of MoG-KRnet: the capacity to approximate complex, multi-modal densities in high dimensions, and the ability to structure the latent space in a physically meaningful and interpretable manner. Our results demonstrate that MoG-KRnet provides not only a powerful generative model but also a principled tool for reduced representation of molecular systems, where the mapping from latent to physical coordinates respects the underlying metastable structure of the dynamics. This makes it particularly well-suited for tasks in coarse-grained modeling, statistical reweighting, and uncertainty-aware exploration of equilibrium configurations. 53 CHAPTER 5 CONCLUSION This thesis presents a unified, data-driven framework for constructing reduced-order models of high-dimensional, non-Markovian dynamical systems. By integrating advances in memory-aware modeling and normalizing flow-based latent representations, we address two central challenges in coarse-grained modeling: accurately capturing long-time correlations and resolving complex, multi-modal equilibrium distributions. We began by developing a novel learning-based approach to non-Markovian stochastic re- duced modeling. By augmenting the resolved dynamics with a set of learned auxiliary vari- ables—interpretable as non-Markovian features—we showed that the complex memory effects embedded in full-atom molecular simulations can be faithfully captured without directly estimating memory kernels. This framework builds on the Mori–Zwanzig formalism and circumvents conven- tional kernel fitting by matching correlation functions in an extended variable space. Numerical results on tagged particles and polymer chains demonstrated excellent agreement with full molecular dynamics (MD) simulations, validating the expressiveness and robustness of the proposed models. We extended this methodology to incorporate state-dependent memory kernels, thereby enabling more realistic dynamics in systems with heterogeneous free energy landscapes. Our framework captures local variations in unresolved degrees of freedom and accommodates basin-specific relaxation times and noise structures. Through simulations on polymer systems, we observed significant improvements in predictive accuracy and sampling fidelity compared to global or fixed-kernel models. To handle the challenge of modeling complex equilibrium distributions, we introduced a new Mixture-of-Gaussians (MoG) KRNet architecture—termed KRnet-MoG-GLE—as a probabilistic generative model for full molecular dynamics. By replacing the unimodal latent prior with a flexible MoG prior, our model gains the capacity to represent multiple metastable basins and generate samples that better match empirical distributions. Importantly, we designed the training to enforce mode-alignment, ensuring that the maxima of the latent variables map to high-probability regions in 54 the data space. This was demonstrated convincingly in the Müller-Brown and alanine dipeptide systems, where our model successfully resolved distinct basins and accurately approximated target observables such as end-to-end distance, radius of gyration, and dihedral angles. Taken together, the contributions of this work represent a significant step forward in the design of interpretable, memory-embedded, and generative reduced-order models. By marrying the strengths of stochastic modeling with expressive latent-variable architectures, our approach enables efficient exploration and inference in systems that are otherwise intractable due to their high dimensionality and long memory effects. Ultimately, this thesis lays the foundation for data-driven, physically-consistent reduced models that can serve as scalable surrogates for multi-scale simulations, with broad applicability across molecular biophysics, soft materials, and beyond. 55 BIBLIOGRAPHY Ayaz, C., Dalton, B. A., and Netz, R. R. (2022). Generalized langevin equation with a non-linear potential of mean force and non-linear memory friction from a hybrid projection scheme. Physical Review E, 105:054138. Baczewski, A. D. and Bond, S. D. (2013). Numerical integration of the extended variable generalized Langevin equation with a positive Prony representable memory kernel. The Journal of chemical physics, 139(4):044107. Barducci, A., Bussi, G., and Parrinello, M. (2008). Well-tempered metadynamics: a smoothly converging and tunable free-energy method. Physical Review Letters, 100(2):020603. Bittracher, A., Koltai, P., Klus, S., Banisch, R., Dellnitz, M., and Schütte, C. (2018). Transition manifolds of complex metastable systems. Journal of Nonlinear Science, 28(2):471–512. Bonati, L., Zhang, Y.-Y., and Parrinello, M. (2019). Neural networks-based variationally enhanced sampling. Proceedings of the National Academy of Sciences, 116(36):17641–17647. Ceriotti, M., Bussi, G., and Parrinello, M. (2009). Langevin equation with colored noise for constant-temperature molecular dynamics simulations. Physical review letters, 102(2):020601. Chen, M., Li, X., and Liu, C. (2014). Computation of the memory functions in the generalized Langevin models for collective dynamics of macromolecules. J. Chem. Phys., 141:064112. Chiavazzo, E., Covino, R., Coifman, R. R., Gear, C. W., Georgiou, A. S., Hummer, G., and Kevrekidis, I. G. (2017). Intrinsic map dynamics exploration for uncharted effective free-energy landscapes. Proceedings of the National Academy of Sciences, 114(28):E5494–E5503. Chorin, A. J., Hald, O. H., and Kupferman, R. (2002). Optimal prediction with memory. Phys. D, 166:239–257. Chorin, A. J. and Lu, F. (2015). Discrete approach to stochastic parametrization and dimension reduction in nonlinear dynamics. Proceedings of the National Academy of Sciences, 112(32):9804– 9809. Coifman, R. R., Kevrekidis, I. G., Lafon, S., Maggioni, M., and Nadler, B. (2008). Diffusion maps, reduction coordinates, and low dimensional representation of stochastic systems. Multiscale Modeling & Simulation, 7(2):842–864. Corless, M. and Frazho, A. (2003). Linear Systems and Control: An Operator Perspective. Chapman & Hall/CRC Pure and Applied Mathematics. Taylor & Francis. Crosskey, M. and Maggioni, M. (2017). Atlas: A geometric approach to learning high-dimensional stochastic systems near manifolds. Multiscale Modeling & Simulation, 15(1):110–156. 56 Daldrop, J. O., Kowalik, B. G., and Netz, R. R. (2017). External potential modifies friction of molecular solutes in water. Physical Review X, 7(4):041065. Darve, E. and Pohorille, A. (2001). Calculating free energies using average force. The Journal of Chemical Physics, 115(20):9169–9183. Darve, E., Solomon, J., and Kia, A. (2009a). Computing generalized Langevin equations and generalized fokker-planck equations. Proc. Natl. Acad. Sci., 106(27):10884–10889. Darve, E., Solomon, J., and Kia, A. (2009b). Computing generalized Langevin equations and general- ized Fokker-Planck equations. Proceedings of the National Academy of Sciences, 106(27):10884– 10889. Davtyan, A., Dama, J. F., Voth, G. A., and Andersen, H. C. (2015a). Dynamic force matching: A method for constructing dynamical coarse-grained models with realistic time dependence. J. Chem. Phys., 142. Davtyan, A., Dama, J. F., Voth, G. A., and Andersen, H. C. (2015b). Dynamic force matching: A method for constructing dynamical coarse-grained models with realistic time dependence. J. Chem. Phys., 142. de Gennes, P. (1979). Scaling concepts in polymer physics. Cornell Univ. Pr. Dibak, M., del Razo, M. J., De Sancho, D., Schütte, C., and Noé, F. (2018). Msm/rd: Coupling markov state models of molecular kinetics with reaction-diffusion simulations. The Journal of Chemical Physics, 148(21):214107. Dinh, L., Krueger, D., and Bengio, Y. (2014). Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516. Dinh, L., Sohl-Dickstein, J., and Bengio, S. (2017). Density estimation using real nvp. In International Conference on Learning Representations (ICLR). Español, P. (2004). Statistical mechanics of coarse-graining. In Novel Methods in Soft Matter Simulations, pages 69–115. Springer. Fang, L., Ge, P., Zhang, L., E, W., and Lei, H. (2022). DeePN2: A deep learning-based non-Newtonian hydrodynamic model. Journal of Machine Learning, 1:114–140. Feng, L., Gao, T., Dai, M., and Duan, J. (2022). Auto-sde: Learning effective reduced dynamics from data-driven stochastic dynamical systems. arXiv preprint:arXiv:2205.04151. Frenkel, D. and Smit, B. (2001). Understanding molecular simulation: from algorithms to applications, volume 1. Elsevier. 57 Ge, P., Zhang, Z., and Lei, H. (2024). Data-driven learning of the generalized langevin equation with state-dependent memory. Physical Review Letters, 133(7):077301. Giannakis, D. (2019). Data-driven spectral decomposition and forecasting of ergodic dynamical systems. Applied and Computational Harmonic Analysis, 47(2):338–396. Grogan, F., Lei, H., Li, X., and Baker, N. A. (2020). Data-driven molecular modeling with the generalized langevin equation. J. Comput. Phys., 418:109633–109641. Harlim, J., Jiang, S. W., Liang, S., and Yang, H. (2020). Machine learning for prediction with missing dynamics. Journal of Computational Physics, page 109922. Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780. Hudson, T. and Li, X. H. (2020). Coarse-Graining of Overdamped Langevin Dynamics via the Mori–Zwanzig Formalism. Multiscale Modeling & Simulation, 18(2):1113–1135. Jung, G., Hanke, M., and Schmid, F. (2017a). Iterative reconstruction of memory kernels. Journal of Chemical Theory and Computation, 13(6):2481–2488. Jung, G., Hanke, M., and Schmid, F. (2017b). Iterative reconstruction of memory kernels. Journal of Chemical Theory and Computation, 13(6):2481–2488. Kingma, D. and Ba, J. (2015). Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR). Kingma, D. P. and Dhariwal, P. (2018). Glow: Generative flow with invertible 1x1 convolutions. In Advances in neural information processing systems (NeurIPS). Klippenstein, V., Tripathy, M., Jung, G., Schmid, F., and van der Vegt, N. F. (2021). Introduc- ing memory in coarse-grained molecular simulations. The Journal of Physical Chemistry B, 125(19):4931–4954. Klus, S., Nüske, F., Koltai, P., Wu, H., Kevrekidis, I., Schütte, C., and Noé, F. (2018). Data- driven model reduction and transfer operator approximation. Journal of Nonlinear Science, 28(3):985–1010. Klus, S., Nüske, F., Peitz, S., Niemann, J.-H., Clementi, C., and Schütte, C. (2020). Data-driven approximation of the koopman generator: Model reduction, system identification, and control. Physica D: Nonlinear Phenomena, 406:132416. Koopman, B. O. (1931). Hamiltonian systems and transformation in Hilbert space. Proceedings of the National Academy of Sciences, 17(5):315–318. 58 Kowalik, B., Daldrop, J. O., Kappler, J., Schulz, J. C., Schlaich, A., and Netz, R. R. (2019). Memory-kernel extraction for different molecular solutes in solvents of varying viscosity in confinement. Physical Review E, 100(1):012126. Krivov, S. V. (2013). On reaction coordinate optimality. Journal of Chemical Theory and Computation, 9(1):135–146. Kubo, R. (1966). The fluctuation-dissipation theorem. Reports on Progress in Physics, 29(1):255– 284. Kumar, S., Rosenberg, J. M., Bouzida, D., Swendsen, R. H., and Kollman, P. A. (1992). THE weighted histogram analysis method for free-energy calculations on biomolecules. I. The method. Journal of Computational Chemistry, 13(8):1011–1021. Laio, A. and Parrinello, M. (2002a). Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566. Laio, A. and Parrinello, M. (2002b). Escaping free-energy minima. Proceedings of the National Academy of Sciences, 99(20):12562–12566. Lange, O. F. and Grubmüller, H. (2006). Collective Langevin dynamics of conformational motions in proteins. The Journal of Chemical Physics, 124(21):214903. Lange, O. F. and Grubmüller, H. (2006). Collective Langevin dynamics of conformational motions in proteins. J. Chem. Phys., 124:214903. Lee, H. S., Ahn, S.-H., and Darve, E. F. (2019a). The multi-dimensional generalized Langevin equation for conformational motion of proteins. The Journal of Chemical Physics, 150(17):174113. Lee, H. S., Ahn, S.-H., and Darve, E. F. (2019b). The multi-dimensional generalized Langevin equation for conformational motion of proteins. The Journal of Chemical Physics, 150(17):174113. Lei, H., Baker, N. A., and Li, X. (2016a). Data-driven parameterization of the generalized Langevin equation. Proceedings of the National Academy of Sciences, 113(50):14183–14188. Lei, H., Baker, N. A., and Li, X. (2016b). Data-driven parameterization of the generalized Langevin equation. Proceedings of the National Academy of Sciences, 113(50):14183–14188. Lei, H. and Li, X. (2021). Petrov–Galerkin methods for the construction of non-Markovian dynamics preserving nonlocal statistics. The Journal of Chemical Physics, 154(18):184108. Lei, H., Wu, L., and E, W. (2020). Machine learning based non-Newtonian fluid model with molecular fidelity. Physical Review E, 102:043309. Li, W. and Ma, A. (2014). Recent developments in methods for identifying reaction coordinates. 59 Molecular Simulation, 40(10-11):784–793. Lin, K. K. and Lu, F. (2021). Data-driven model reduction, Wiener projections, and the Koopman- Mori-Zwanzig formalism. Journal of Computational Physics, 424:109864. Lu, J. and Vanden-Eijnden, E. (2014). Exact dynamical coarse-graining without time-scale separation. The Journal of Chemical Physics, 141(4):044109. Lyu, L. and Lei, H. (2023a). Consensus-based construction of high-dimensional free energy surface. arXiv preprint arXiv:2311.05009. Lyu, L. and Lei, H. (2023b). Construction of coarse-grained molecular dynamics with many-body non-markovian memory. Physical Review Letters, 131(17):177301. Ma, C., Wang, J., and E, W. (2018). Model reduction with memory and the machine learning of dynamical systems. Communications in Computational Physics, 25(4):947–962. Ma, L., Li, X., and Liu, C. (2019). Coarse-graining langevin dynamics using reduced-order techniques. Journal of Computational Physics, 380:170–190. Maragliano, L. and Vanden-Eijnden, E. (2006). A temperature accelerated method for sampling free energy and determining reaction pathways in rare events simulations. Chemical Physics Letters, 426(1):168 – 175. Maragliano, L. and Vanden-Eijnden, E. (2008). Single-sweep methods for free energy calculations. The Journal of Chemical Physics, 128(18):184110. Mori, H. (1965a). A continued-fraction representation of the time-correlation functions. Prog. Theor. Phys., 34:399–416. Mori, H. (1965b). Transport, collective motion, and Brownian motion. Progress of Theoretical Physics, 33(3):423–455. Noid, W. G., Chu, J.-W., Ayton, G. S., Krishna, V., Izvekov, S., Voth, G. A., Das, A., and Andersen, H. C. (2008). The multiscale coarse-graining method. I. A rigorous bridge between atomistic and coarse-grained models. J. Chem. Phys., 128(24):244114. Peng, J., Wu, H., Zhou, Z., and Nie, Q. (2023). Bounded krnet: Density estimation for bounded data with normalizing flows. arXiv preprint arXiv:2305.09063. Pérez-Hernández, G., Paul, F., Giorgino, T., De Fabritiis, G., and Noé, F. (2013). Identification of slow molecular order parameters for markov model construction. The Journal of Chemical Physics, 139(1):015102. Price, J., Meuris, B., Shapiro, M., and Stinis, P. (2021). Optimal renormalization of multiscale 60 systems. Proceedings of the National Academy of Sciences, 118(37):e2102266118. Rohrdanz, M. A., Zheng, W., Maggioni, M., and Clementi, C. (2011). Determination of reaction coordinates via locally scaled diffusion map. The Journal of Chemical Physics, 134(12):124116. Rosso, L., Minàry, P., Zhu, Z., and Tuckerman, M. E. (2002). On the use of the adiabatic molecular dynamics technique in the calculation of free energy profiles. The Journal of Chemical Physics, 116(11):4389–4402. Russo, A., Durán-Olivencia, M. A., Kevrekidis, I. G., and Kalliadasis, S. (2019). Deep learning as closure for irreversible processes: A data-driven generalized Langevin equation. arXiv preprint arXiv:1903.09562. Shaffer, P., Valsson, O., and Parrinello, M. (2016). Enhanced, targeted sampling of high-dimensional free-energy landscapes using variationally enhanced sampling, with an application to chignolin. Proceedings of the National Academy of Sciences, 113(5):1150–1155. She, Z., Ge, P., and Lei, H. (2023). Data-driven construction of stochastic reduced dynamics encoded with non-markovian features. Journal of Chemical Physics, 158(3):034102. Stinis, P. (2015). Renormalized Mori-Zwanzig-reduced models for systems without scale sepa- ration. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 471(2176):20140446. Torrie, G. M. and Valleau, J. P. (1977). Nonphysical sampling distributions in monte carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199. Valsson, O. and Parrinello, M. (2014). Variational approach to enhanced sampling and free energy calculations. Phys. Rev. Lett., 113:090601. Vlachas, P. R., Byeon, W., Wan, Z. Y., Sapsis, T. P., and Koumoutsakos, P. (2018). Data-driven fore- casting of high-dimensional chaotic systems with long short-term memory networks. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 474(2213):20170844. Vroylandt, H., Goudenège, L., Monmarché, P., Pietrucci, F., and Rotenberg, B. (2022). Likelihood- based non-markovian models from molecular dynamics. Proceedings of the National Academy of Sciences, 119(13):e2117586119. Wall, H. (1948). Analytic Theory of Continued Fractions. Analytic Theory of Continued Fractions. D. Van Nostrand Company. Wang, Q., Ripamonti, N., and Hesthaven, J. S. (2020a). Recurrent neural network closure of parametric POD-Galerkin reduced-order models based on the Mori-Zwanzig formalism. Journal of Computational Physics, 410:109402. 61 Wang, Q., Ripamonti, N., and Hesthaven, J. S. (2020b). Recurrent neural network closure of parametric POD-Galerkin reduced-order models based on the Mori-Zwanzig formalism. Journal of Computational Physics, 410:109402. Wang, S., Ma, Z., and Pan, W. (2020c). Data-driven coarse-grained modeling of polymers in solution with structural and dynamic properties conserved. Soft Matter, 16(36):8330–8344. Xie, P. and E, W. (2024). Coarse-graining conformational dynamics with multidimensional general- ized langevin equation: How, when, and why. Journal of Chemical Theory and Computation. Ye, F. X. F., Yang, S., and Maggioni, M. (2021). Nonlinear model reduction for slow-fast stochastic systems near manifolds. arXiv preprint:arXiv:2104.02120. Zhu, Y., Tang, Y.-H., and Kim, C. (2022). Learning stochastic dynamics with statistics-informed neural network. arXiv preprint: arXiv:2202.12278. Zhu, Y. and Venturi, D. (2018). Faber approximation of the Mori-Zwanzig equation. Journal of Computational Physics, 372:694–718. Zhu, Y. and Venturi, D. (2020). Generalized langevin equations for systems with local interactions. Journal of Statistical Physics, 178:1217. Zieliński, P. and Hesthaven, J. S. (2022). Discovery of slow variables in a class of multiscale stochastic systems via neural networks. Journal of Nonlinear Science, 32(4):51. Zwanzig, R. (1973). Nonlinear generalized Langevin equations. Journal of Statistical Physics, 9(3):215–220. Zwanzig, R. (2001). Nonequilibrium Statistical Mechanics. Oxford University Press. 62 APPENDIX A MICROSCALE MODEL OF THE POLYMER MOLECULE The polymer molecule is modeled as a bead-spring chain consisting of 4 sub-units. Each sub-unit consists of 4 atoms. The full potential is given by 𝑉m𝑜𝑙 (m𝑄) = 𝑁 ∑︁ 𝑖≠ 𝑗 𝑉p(𝑄𝑖 𝑗 ) + 𝑁𝑏∑︁ 𝑖=1 𝑉b(𝑙𝑖) + 𝑁𝑎∑︁ 𝑖=1 𝑉a(𝜃𝑖) + 𝑁𝑑 ∑︁ 𝑖=1 𝑉d(𝜙𝑖), (A.1) where 𝑉p, 𝑉b, 𝑉a, and 𝑉d represent the pairwise, bond, angle, and dihedral interactions whose detailed forms are specified as below. The pairwise interaction 𝑉p is modeled by the Lennard-Jones potential 𝑉p(𝑄) =    (cid:20) (cid:16) 𝜎 𝑄 (cid:17) 12 (cid:16) 𝜎 𝑄 − (cid:17) 6(cid:21) 4𝜀 − 4𝜀 (cid:20)(cid:16) 𝜎 𝑄𝑐 (cid:17) 12 (cid:16) 𝜎 𝑄𝑐 − (cid:17) 6(cid:21) , 𝑄 < 𝑄𝑐 (A.2) 0, 𝑄 ≥ 𝑄𝑐 where 𝜀 = 0.005, 𝜎 = 1.8 and 𝑄𝑐 = 10.0. The bond potential 𝑉b is modeled by the finite extensible nonlinear elastic bond (FENE) potential 𝑉b(𝑙) = − (cid:34) 𝑙2 0 log 1 − 𝑘 𝑠 2 (cid:35) , 𝑙2 𝑙2 0 (A.3) where three different bond types. Within each sub-unit, the atoms 1-2, 3-4 are connected by type-1 bond. The atoms 2-3 are connected by type-2 bond. Finally, the sub-unit groups are connected by type-3 bond. The detailed parameter set is given by Tab. A.1. Type 1 2 3 𝑘 𝑠 0.4 0.64 0.32 𝑙0 1.8 1.6 1.8 Table A.1 Parameters of the FENE bond interactions. The angle potential 𝑉a is modeled by the harmonic angle potential 𝑉a(𝜃) = 𝑘 𝑎 2 (𝜃 − 𝜃0)2 , (A.4) 63 Type 1 2 𝑘 𝑎 1.2 1.5 𝜃0 114.0 119.7 Table A.2 Parameters of the harmonic angle interaction. where two different types. Within each sub-unit group, the bond angles formed by 1-2-3 and 2-3-4 are imposed by type-1 potential. The bond angles formed by atoms of different sub-unit groups (e.g., 3-4-5, 4-5-6) are imposed by type-2 potential. The detailed parameter set is given by Tab. A.2. The dihedral potential 𝑉d is modeled by the multiharmonic dihedral potential 𝑉d(𝜙) = 6 ∑︁ 𝑖=1 𝐴𝑛 cos(𝑛−1) (𝜙), (A.5) where two different types. Type-1 dihedral potential is imposed to dihedral angles formed by 2-3-4-5, 4-5-6-7, · · · . Type-2 dihedral potential is imposed to dihedral angles formed by 3-4-5-6, 7-8-9-10, · · · . The detailed parameter set is given by Tab. A.3. Type 𝐴1 1 0.0673 2 0.1602 𝐴2 1.8479 -3.9993 𝐴3 0.0079 0.2483 𝐴4 -2.2410 6.2837 𝐴5 -0.0058 0.0165 𝐴6 0.0051 -0.0146 Table A.3 Parameters of the multiharmonic dihedral interaction. 64 APPENDIX B CONSTRUCTION OF THE FOUR-DIMENSIONAL FREE ENERGY FUNCTION Accurate construction of the multi-dimensional free energy is a well-known non-trivial problem. To construct the free energy function 𝑈 (m𝑞) for the four-dimensional resolved variables m𝑞 defined by (2.18), we conduct the restraint molecular dynamics simulation to sample the average force. Specifically, for each target configuration m𝑞∗, we impose a biased quadratic potential 𝑈b𝑖𝑎𝑠 (m𝑞, m𝑞∗) by 1 2 where 𝑘1, · · · , 𝑘4 represents the magnitude of the bias potential. We choose the values such that 𝑈b𝑖𝑎𝑠 (m𝑞, m𝑞∗) = 𝑘𝑖 (cid:0)𝑞𝑖 − 𝑞∗ 𝑖 (B.1) (cid:1) 2 , 𝑖=1 4 ∑︁ the fluctuations are about 5% of target values. For the polymer molecule considered in the present study, the effective restraint force applied to the full atom (cid:8)m𝑄 𝑗 (cid:9) 𝑁 𝑗=1 is given by m𝐹b𝑖𝑎𝑠 (m𝑞, m𝑞∗) = − 4 ∑︁ 𝑖=1 𝑘𝑖 (cid:0)𝑞𝑖 − 𝑞∗ 𝑖 (cid:1) ∇m𝑄 𝑗 𝑞𝑖, (B.2) where the gradient terms are given by m𝑄 𝑁 − m𝑄1 𝑞1 𝛿 𝑗,𝑁 , 𝛿 𝑗,1 + m𝑄1 − m𝑄 𝑁 𝑞1 2 (cid:0)m𝑄 𝑗 − m𝑄𝑐(cid:1) 𝑁𝑞2 m𝑄1 − m𝑄 ⌊ 𝑁 2 ⌋ , ∇m𝑄 𝑗 𝑞1 = ∇m𝑄 𝑗 𝑞2 = ∇m𝑄 𝑗 𝑞3 = ∇m𝑄 𝑗 𝑞4 = m𝑄 𝛿 𝑗,1 + ⌈ 𝑁 2 ⌉ 𝛿 𝑗,𝑁 + m𝑄 2 ⌋ − m𝑄1 ⌊ 𝑁 𝑞3 2 ⌉ − m𝑄 𝑁 ⌈ 𝑁 𝑞4 𝛿 𝑗,⌊ 𝑁 2 ⌋ 𝛿 𝑗,⌈ 𝑁 2 ⌉ , 𝑞3 m𝑄 𝑁 − m𝑄 𝑞4 (B.3) , where 𝛿𝑖, 𝑗 represents the Kronecker delta function. The free energy 𝑈 (m𝑞) is approximated by a 4-layer fully connected neural network ˜𝑈 (m𝑞). Each hidden layer has 160 neurons; hyperbolic tangent function is used as the activation function. ˜𝑈 (m𝑞) is trained by minimizing the empirical loss 𝐿 = 𝑁𝑠 ∑︁ 𝑘=1 (cid:13) (cid:13) (cid:13) −∇m𝑞 (𝑘 ) ˜𝑈 (m𝑞) − m𝐹b𝑖𝑎𝑠 (m𝑞, m𝑞 (𝑘)) (cid:13) 2 (cid:13) (cid:13) , (B.4) where m𝑞 (𝑘) represents a sampled configuration. In this work, we construct ˜𝑈 (m𝑞) using 𝑁𝑠 = 400000 sample points collected from a simulation with a production stage of 1 × 107 steps. 65 For each configuration, the number of step is between 1 × 106 and 6 × 106 such that the empirical sampling error is less than 5% of the mean value. To verify the accuracy of ˜𝑈 (m𝑞), we numerically evaluate the integration 𝑘 𝐵𝑇m𝐼 ≡ ∫ ∫ m𝑞 ⊗ ∇𝑈 (m𝑞)e−𝑈 (m𝑞)/𝑘 𝐵𝑇 dm𝑞/ e−𝑈 (m𝑞)/𝑘 𝐵𝑇 dm𝑞 ≈ 1 𝑁𝑠 𝑁𝑠 ∑︁ 𝑘=1 m𝑞 (𝑘) ⊗ ∇ ˜𝑈 (m𝑞 (𝑘)). (B.5) Therefore, the difference between the numerical summation and 𝑘 𝐵𝑇m𝐼 provide a metric. For this case, 𝑘 𝐵𝑇 = 1. The average term yields 𝑁𝑠 ∑︁ 1 𝑁𝑠 m𝑞 (𝑘) ⊗ ∇ ˜𝑈 (m𝑞 (𝑘)) = 1.0362 −0.0011 0.0087              which verifies that the constructed ˜𝑈 (m𝑞) is an accurate approximation of 𝑈 (m𝑞). 0.9913 −0.0020 0.0062 0.9913 0.0018 0.0008 0.0021 0.0068 0.9814 0.0098 0.0076 0.0096 0.0094              𝑘=1 , (B.6) 66 APPENDIX C FLUCTUATION-DISSIPATION THEOREM OF THE EXTENDED DYNAMICS For the extended dynamics in form of Eqs. (2.5)(2.6), we can show that the embedded memory kernel ˜m𝜃 (𝑡) and fluctuation term ˜mR (𝑡) satisfy the second-fluctuation dissipation theorem. Without loss of generality, we set the covariance of the non-Markovian features to be 𝑘 𝐵𝑇m𝐼 following the learning method presented in Sec. 2.2.3, i.e., mΛ = m𝐼, ˜m𝐽 = m𝐽. Proposition C.0.1. The embedded memory kernel of the extended dynamics (2.5)(2.6) takes the form ˜m𝜃 (𝑡) = − (cid:16) m𝐽11𝛿(𝑡) + m𝐽12em𝐽22𝑡m𝐽21 (cid:17) . Furthermore, by choosing the initial condition of m𝜁 and the white noise term m𝜉 (𝑡) = mΣ (cid:164)m𝑊 𝑡 satisfying (cid:10)m𝜁 (0)m𝜁 (0)𝑇 (cid:11) = 𝛽−1m𝐼 (cid:10)m𝜉 (𝑡)m𝜉 (𝑠)𝑇 (cid:11) = −𝛽−1(m𝐽 + m𝐽𝑇 )𝛿(𝑡 − 𝑠), (C.1) the embedded kernel ˜m𝜃 (𝑡) and mR (𝑡) satisfies the second fluctuation-dissipation theorem, i.e., (cid:10) ˜mR (𝑡) ˜mR (𝑡′)𝑇 (cid:11) = −𝛽−1 (cid:16) ˜m𝐽12e ˜m𝐽22 (𝑡−𝑡′) ˜m𝐽21 + ( ˜m𝐽11 + ˜m𝐽𝑇 11)𝛿(𝑡 − 𝑡′) (cid:17) . (C.2) Proof. With mΛ = m𝐼 and ˜m𝐽 = m𝐽, we can take the integration of m𝜁 (𝑡) in Eq. (2.5), yielding m𝜁 (𝑡) = ∫ 𝑡 0 em𝐽22 (𝑡−𝑠)m𝐽21m𝑣(𝑠)d𝑠 + ∫ 𝑡 0 em𝐽22 (𝑡−𝑠)m𝜉2(𝑠)d𝑠 + em𝐽22𝑡m𝜁 (0). (C.3) Plugging m𝜁 (𝑡) into the dynamic equation of m𝑣 gives m𝑀 (cid:164)m𝑣 = − ∇𝑈 (m𝑞) + m𝐽11m𝑣 + m𝐽12em𝐽22 (𝑡−𝑠)m𝐽21m𝑣(𝑠)d𝑡 ∫ 𝑡 + + m𝜉1(𝑡) (cid:32)(cid:32) (cid:32)(cid:32) (cid:125) (cid:123)(cid:122) (cid:124) ˜R1 (𝑡) ∫ 𝑡 0 m𝐽12em𝐽22 (𝑡−𝑠)m𝜉2(𝑠)d𝑠 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) 0 (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) (cid:123)(cid:122) ˜R2 (𝑡) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) + m𝐽12em𝐽22𝑡m𝜁 (0) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:123)(cid:122) (cid:125) ˜R3 (𝑡) (cid:124) . (C.4) (cid:125) 67 We check the covariance matrices of the noise terms, i.e., (cid:10) ˜R1(𝑡) ˜R1(𝑡′)𝑇 (cid:11) = −𝛽−1(m𝐽11 + m𝐽𝑇 ∫ 𝑡 ∫ 𝑡′ (cid:10) ˜R2(𝑡) ˜R2(𝑡′)𝑇 (cid:11) = 11)𝛿(𝑡 − 𝑡′), m𝐽12em𝐽22 (𝑡−𝑠) (cid:10)m𝜉2(𝑠)m𝜉2(𝑠′)𝑇 (cid:11) em𝐽𝑇 ∫ 𝑡′ 22 m𝐽12em𝐽22 (𝑡−𝑠) (m𝐽22 + m𝐽𝑇 0 m𝐽12em𝐽22 (𝑡−𝑠′) (m𝐽22 + m𝐽𝑇 0 = −𝛽−1 = −𝛽−1 0 ∫ 𝑡 0 ∫ 𝑡′ 0 22)mem𝐽𝑇 22 (𝑡′−𝑠′)m𝐽𝑇 12d𝑠′, (𝑡′−𝑠′)m𝐽𝑇 12d𝑠d𝑠′ 22)𝛿(𝑠 − 𝑠′)m𝐽𝑇 21em𝐽𝑇 22 (𝑡′−𝑠′)m𝐽𝑇 12d𝑠d𝑠′ = −𝛽−1m𝐽12em𝐽22𝑡+m𝐽𝑇 (cid:10) ˜R3(𝑡) ˜R3(𝑡′)𝑇 (cid:11) = m𝐽12em𝐽22𝑡 (cid:10)m𝜁 (0)m𝜁 (0)𝑇 (cid:11) em𝐽𝑇 = 𝛽−1m𝐽12em𝐽22𝑡em𝐽𝑇 m𝐽𝑇 𝑡′ 𝑡′ 22 22 . m𝐽𝑇 12 12 + 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽𝑇 m𝐽𝑇 12 𝑡′ 22 12 , ∀𝑡′ ≤ 𝑡 Moreover, for 𝑡 > 𝑡′, all the cross terms vanish except (cid:10) ˜R2(𝑡) ˜R1(𝑡′)𝑇 (cid:11), i.e., (cid:10) ˜R2(𝑡) ˜R1(𝑡′)𝑇 (cid:11) = ∫ 𝑡 0 m𝐽12em𝐽22 (𝑡−𝑠) ⟨m𝜉2(𝑠)m𝜉1(𝑡′)⟩ d𝑠 ∫ 𝑡 m𝐽12em𝐽22 (𝑡−𝑠) (m𝐽21 + m𝐽𝑇 12)𝛿(𝑡′ − 𝑠)d𝑠 = −𝛽−1 0 = −𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′) (m𝐽21 + m𝐽𝑇 12). Combining Eq. (C.5) and Eq. (C.6), we have (cid:10) ˜R (𝑡) ˜R (𝑡′)𝑇 (cid:11) = 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽𝑇 12 − 𝛽−1m𝐽12em𝐽22 (𝑡−𝑡′) (m𝐽21 + m𝐽𝑇 12) − 𝛽−1(m𝐽11 + m𝐽𝑇 11)𝛿(𝑡 − 𝑡′) m𝐽12em𝐽22 (𝑡−𝑡′)m𝐽21 + (m𝐽11 + m𝐽𝑇 = −𝛽−1 (cid:16) 11)𝛿(𝑡 − 𝑡′) (cid:17) . (C.5) (C.6) (C.7) □ As a special case, by imposing the restraint specified by Eq. (2.16) such that m𝐽11 + m𝐽𝑇 11 = 0 and m𝐽12 = −m𝐽𝑇 21, the memory kernel ˜m𝜃 (𝑡) recovers −m𝐽12em𝐽22𝑡m𝐽𝑇 12 without the Markovian part, and the second fluctuation-dissipation theorem recovers the standard form, i.e., (cid:10) ˜mR (𝑡) ˜mR (0)𝑇 (cid:11) = 𝛽−1 ˜m𝜃 (𝑡). (C.8) 68 APPENDIX D INVARIANT PROBABILITY DENSITY FUNCTION Proposition D.0.1. By choosing the white noise following Eq. (C.1), the reduced model (2.5)(2.6) retains the invariant density function ∫ 𝜌e𝑞 (m𝑞, m𝑝, m𝜁) = exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)] / exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)] dm𝑞dm𝑝dm𝜁 . (D.1) Proof. By Eq. (C.1), the covariance of the white noise of the full extended system is given by m𝐺 + m𝐺𝑇 = d𝑖𝑎𝑔(0, mΣmΣ𝑇 ). Accordingly, the Fokker-Plank equation follows 𝜕 𝜌(m𝑧, 𝑡) 𝜕𝑡 (cid:18) = ∇ · −m𝐺∇𝑊 (m𝑧) 𝜌(m𝑧, 𝑡) − 𝛽−1(m𝐺 + m𝐺𝑇 )∇𝜌(m𝑧, 𝑡) (cid:19) , (D.2) 1 2 where 𝜌(m𝑧, 𝑡) represents the probability density function of the extended variables m𝑧 = [m𝑞; m𝑝; m𝜁]. For 𝜌e𝑞 (m𝑞, m𝑝, m𝜁) ∝ exp [−𝛽𝑊 (m𝑞, m𝑝, m𝜁)], the RHS follows (cid:18) ∇ · 𝛽−1m𝐺∇𝜌e𝑞 (m𝑧, 𝑡) − 1 2 𝛽−1(m𝐺 + m𝐺𝑇 )∇𝜌e𝑞 (m𝑧, 𝑡) (cid:19) = 𝛽−1∇ · (cid:16) m𝐺 𝐴∇𝜌e𝑞 (m𝑧, 𝑡) (cid:17) where the last identity holds because m𝐺 𝐴 is anti-symmetric. ≡ 0, (D.3) □ 69 APPENDIX E MEMORY KERNEL OF THE POLYMER MOLECULE SYSTEMS Fig. E.1 shows the embedded matrix-valued kernels m𝜃 (𝑡) of the full MD and the 4D reduced models of the polymer molecule system. Similar to the kernel in the Laplace space mΘ(𝜆) shown in Fig. 2.6, the good agreement between the full MD and the reduced models verifies that the reduced model can accurately retain the non-Markovian dynamics of the resolved variables. (a) (c) (b) (d) Figure E.1 (a–d) Components of the embedded matrix-valued kernel m𝜃 (𝑡) obtained from the full MD and the four-dimensional reduced model of a polymer molecule system She et al. (2023). 70 10−210−1100101102t−100102030[θ(t)]11MD2ndjointlearning4thjointlearning10−210−1100101102t−7.50.07.515.022.530.0[θ(t)]22MD2ndjointlearning4thjointlearning10−210−1100101102t−8−404[θ(t)]12MD2ndjointlearning4thjointlearning10−210−1100101102t−7−4−12[θ(t)]23MD2ndjointlearning4thjointlearning