DEEP LEARNING REGULARIZATION: THEORY AND DATA PERSPECTIVES By Xitong Zhang A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computational Mathematics, Science and Engineering—Doctor of Philosophy 2024 ABSTRACT Generalization is a central research topic in deep learning. To enhance the test performance of well-trained models on unseen data, it is essential to apply regularization techniques that refine the model’s expressive capabilities and the training process. This thesis categorizes regularization into theory-driven and data-driven approaches. Theory-driven regularization encompasses methods that are broadly applicable across various contexts, including conventional techniques such as weight decay and dropout. Conversely, data-driven regularization involves techniques specifically designed for particular data sets and applications. For instance, different neural network architectures can be developed to capture various useful patterns in data for specific applications. This dissertation explores both types of regularization, from the development of new training algorithms with theoretical guarantees to the design of deep learning architectures for data-driven approaches. For theory-driven regularization, this dissertation discusses a training algorithm based on PAC-Bayes bound. PAC-Bayes bound evaluates the upper bound of the test error using only training data. However, minimizing the upper bound of the test error using existing PAC-Bayes bounds, which are theoretically tight and should intuitively benefit generalization, often results in compromised test performance compared to empirical risk minimization (ERM) with commonly used regularization techniques such as weight decay, large learning rates, and small batch sizes. The designed algorithm seeks to bridge the gap between theoretical tightness and practical effectiveness in boosting test performance for classification tasks. For data-driven regularization, this dissertation discusses graph neural networks specifically designed for directed graphs and spatial-temporal seismic data. It also introduces a physics-informed deep learning framework for full-waveform inversion, which aims to estimate subsurface structures based on seismic data by integrating the governing acoustic wave equation with convolutional neural networks. Additionally, data augmentation is considered a specialized form of regularization. This thesis explores the design of generative neural networks for time-lapse full-waveform inversion to obtain more training samples and achieve lower test errors in the target inversion task. The material presented in this dissertation incorporates several publications and preprints. For details on PAC-Bayes training, where the model is trained using the PAC-Bayes bound, review Zhang et al. (2023). For discussions on regularization through the design of graph neural networks, refer to Zhang et al. (2021b) and Zhang et al. (2022). For physics-informed regularization, see Jin et al. (2021). For approaches to data augmentation with generative models, check Yang et al. (2022). Copyright by XITONG ZHANG 2024 To all warriors exploring in the darkness. "There is only one heroism in the world: to see the world as it is and to love it." — Romain Rolland v ACKNOWLEDGEMENTS Pursuing a PhD degree has been a long and memorable journey, filled with many emotions, including happiness, excitement, hopefulness, and, at times, depression. In this dissertation, I sincerely thank all my friends, colleagues, advisors, and family who have supported me throughout this journey. Over the course of my six-year journey at Michigan State University, while my research may not be the most acclaimed, I am confident that my experiences are among the most diverse of all the graduates. Throughout my PhD, I had the opportunity to work in several different research labs. After transferring from another department, my journey in the Department of Computational Mathematics, Science, and Engineering began under the guidance of Dr. Matthew Hirn and, later, Dr. Rongrong Wang. They introduced me to high-quality research practices, effective project management, and the essentials of good leadership and mentorship. Additionally, I was fortunate to intern with Dr. Youzuo Lin four years ago, who encouraged and guided me in exploring a completely new field in geoscience. Working with them has been a profoundly rewarding experience; words cannot fully express my gratitude. I would also like to extend my thanks to all my other committee members, Dr. Saiprasad Ravishankar and Dr. Jianrong Wang, for their invaluable guidance, advice, and constructive feedback. I am deeply thankful to Jingwen Shi, who has been a steadfast companion throughout nearly my entire research career. I also appreciate Qi Wang, who provided encouragement and support as I explored new research paths during challenging times. I am grateful to Guangliang Liu, Haitao Mao, and Zhiyu Xue for their support during my job search. Additionally, I extend my gratitude to all my labmates and friends, including, but not limited to, He Lyu, Avrajit Ghosh, Ismail Alkhouri, Peng Jin, Will Reichard-Flynn, Yuxin Yang, Shihang Feng, Michael Perlmutter, Yixuan He, Xiaorui Liu, and Junyuan Hong. It is impossible to name everyone here; please accept my apologies if I have inadvertently omitted anyone. Most of all, I extend my deepest gratitude to my family, my unwavering supporters. Their understanding, encouragement, and love have been the foundation of my success throughout all that has happened these years. vi TABLE OF CONTENTS CHAPTER 1 OVERVIEW . . . 1.1 Background . . . 1.2 Dissertation Contributions 1.3 Dissertation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 THEORY-DRIVEN REGULARIZATION . . . . . . . . . . . . . . . . . . Introduction of Implicit Regularization: The Gradient Descent Case . . . . . 2.1 2.2 Unlocking Tuning-free Generalization: Minimizing the PAC-Bayes Bound with Trainable Priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 2 5 6 6 9 CHAPTER 3 DATA-DRIVEN REGULARIZATION . . . . . . . . . . . . . . . . . . 29 . 29 3.1 MagNet: A Neural Network for Directed Graphs . . . . . . . . . . . . . . . . 3.2 Spatio-Temporal Graph Convolutional Networks for Earthquake Source Char- acterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.3 Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop . . . . . . . . . . . . . . . . . . . . . . 71 3.4 Making Invisible Visible: Data-Driven Seismic Inversion with Spatio- temporally Constrained Data Augmentation . . . . . . . . . . . . . . . . . . . 86 CHAPTER 4 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 APPENDIX A UNLOCKING TUNING-FREE GENERALIZATION: MINIMIZING THE PAC-BAYES BOUND WITH TRAINABLE PRIORS . . . . . . . 135 APPENDIX B MAGNET: A NEURAL NETWORK FOR DIRECTED GRAPHS . . . 167 APPENDIX C UNSUPERVISED LEARNING OF FULL-WAVEFORM INVER- SION: CONNECTING CNN AND PARTIAL DIFFERENTIAL EQUA- TION IN A LOOP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 vii CHAPTER 1 OVERVIEW 1.1 Background Before delving into the details of this dissertation, it is crucial to define what is meant by ‘regularization,’ a term with varied definitions in the field. For instance, a recent taxonomy has categorized regularization techniques into explicit and implicit types. According to Hernández- García and König (2018), explicit regularization involves techniques that reduce the representational capacity of a model class H0, such as a neural network, resulting in a sub-hypothesis set H1 ⊂ H0 based on specific assumptions. Examples of explicit regularization include weight decay and dropout. On the other hand, implicit regularization comprises techniques that reduce generalization error indirectly through characteristics of the network architecture, the training data, or the learning algorithm, as discussed in Zhang et al. (2021a). Due to some ambiguity in these definitions, this dissertation adopts the taxonomy proposed by Kukačka et al. (2017), defining regularization as follows: Definition 1.1.1. Regularization is any supplementary technique that aims at enhancing the model’s ability to generalize, i.e., to produce better results on the test set. Based on this definition, we can categorize different types of regularization with the following objectives: An arbitrary deep learning model can be described as a function 𝑓𝜃 : 𝑥 ↦→ 𝑦 with trainable weights 𝜃 ∈ Θ. The objective of training is to find the optimal weights 𝜃∗ that minimize the loss function L : Θ ↦→ R: The loss function generally takes the form: 𝜃∗ = arg min 𝜃 L (𝜃). L = E(𝑥,𝑦)∼D [ℓ( 𝑓𝜃 (𝑥), 𝑦) + 𝑅(· · · )], (1.1) (1.2) 1 where D represents the data distribution, ℓ is the misfit function that measures the discrepancy between the network output 𝑓𝜃 (𝑥) and the target label 𝑦, and 𝑅 is the extra penalty term based on specific criteria, such as Occam’s razor (e.g., weight decay). Since the data distribution D is generally unknown, we evaluate the loss function using a training dataset S ∼ D, defined as S = {(𝑥𝑖, 𝑦𝑖)}𝑚 𝑖=1, where S comprises 𝑚 pairs of training samples. Training can then be framed as solving the following optimization task: 𝜃∗ = arg min 𝜃 1 |S| ∑︁ (𝑥𝑖,𝑦𝑖)∼S [ℓ( 𝑓𝜃 (𝑥𝑖), 𝑦𝑖) + 𝑅(· · · )]. (1.3) Based on the above equation, we can identify different sources of regularization: • 𝑓 : the neural network architecture, such as the use of pooling layers to achieve output invariance to slight spatial distortions in the input. • 𝑅: the extra penalty term, for example, weight decay and sharpness minimization (Foret et al., 2020). • S: the training dataset, including techniques like data augmentation (Wang et al., 2017). • ℓ: the misfit function, such as the dice coefficient which is robust to class imbalance (Milletari et al., 2016). • arg min : 𝜃 the optimization procedure, for instance, implicit regularization in (stochastic) gradient descent (Barrett and Dherin, 2020; Ghosh et al., 2022), as well as early-stopping and warm-start methods. 1.2 Dissertation Contributions This dissertation focuses on exploring all types of regularization— 𝑓 , 𝑅, ℓ, S, and arg min . More 𝜃 specifically: 1. In Section 2.2, we discuss a two-stage training algorithm for neural networks that minimizes the PAC-Bayes bound (Zhang et al., 2023), integrating both ℓ and arg min . Previous research 𝜃 2 in PAC-Bayes learning theory primarily focused on establishing tight upper bounds for test errors, whereas PAC-Bayes training updates network weights to minimize these bounds for better generalization. While theoretically tight, practical implementations of PAC-Bayes bounds often fell short of achieving test errors as low as those obtained by empirical risk minimization (ERM) with optimally tuned hyperparameters of commonly used regularization, such as learning rate, dropout, and weight decay. Moreover, traditional PAC-Bayes training algorithms Pérez-Ortiz et al. (2021) typically require bounded loss functions and extensive searches over priors using additional datasets, limiting their applicability. Our new PAC-Bayes training algorithm, which allows for unbounded loss and involves a two-stage training process, minimizes reliance on prior tuning by training both the prior and posterior using the same dataset. Comprehensive evaluations across various classification tasks and neural network architectures show that our method not only surpasses existing PAC-Bayes algorithms but also achieves test accuracies comparable to ERM with optimal commonly chosen regularization settings. 2. For 𝑓 , this dissertation presents the design of graph neural networks, MagNet (Zhang et al., 2021b) and STGNN (Zhang et al., 2022). • Section 3.1 introduces MagNet, a graph neural network for directed graphs. Unlike typical GNNs that focus on undirected graphs and require symmetrization, MagNet utilizes a complex Hermitian matrix, the magnetic Laplacian, to preserve direction information. This matrix captures undirected geometric structure in the magnitude of its entries and directional information in their phases. A “charge” parameter adjusts spectral information to account for variations among directed cycles. MagNet has been applied to various node classification and link prediction tasks, showing superior performance compared to other methods on most tasks. It is adaptable to other GNN architectures such as GCN (Kipf and Welling, 2016) and ChebNet (Defferrard et al., 2016). • Section 3.2 discusses the Spatiotemporal Graph Neural Network (STGNN) designed 3 for estimating earthquake locations and magnitudes. Traditional machine learning earthquake characterization methods use waveform information from a single station; STGNN, however, utilizes data from multiple stations to construct dynamic graphs via adaptive message passing. Tested on data from the Southern California Seismic Network and Oklahoma, STGNN has demonstrated more accurate earthquake location predictions than baseline models. 3. Combining 𝑓 and 𝑅, Section 3.3 details an architecture for Full-Waveform Inversion (FWI) (Jin et al., 2021). FWI is typically used in geophysics to estimate subsurface velocity maps from seismic data, a challenging task formulated by a second-order partial differential equation (PDE). By using finite difference methods to approximate forward modeling of the PDE and modeling its inversion with a CNN, we transform the supervised inversion task into an unsupervised seismic data reconstruction task. The architecture effectively acts as an auto-encoder, with the decoder designed around governing physics ( 𝑓 ). Perceptual loss (Johnson et al., 2016) has also been found to enhance generalization in this setting (𝑅). Our results indicate that the model, utilizing only seismic data, achieves accuracy comparable to supervised methods and outperforms them when more unlabeled seismic data is included. 4. For the regularization of S, Section 3.4 describes a data augmentation approach based on generative neural networks (Yang et al., 2022) for time-lapse full-waveform inversion (FWI). Traditional data augmentation techniques from computer vision often yield physically unacceptable samples that do not benefit FWI. We developed generative models that incorporate physics knowledge, such as governing equations and observable phenomena, to enhance the quality of the synthetic data. We applied these techniques to detect small CO2 leakages and validated our methods through comprehensive numerical tests. Our analysis shows that data-driven seismic imaging can be significantly improved with our data augmentation techniques. 4 1.3 Dissertation Structure The dissertation is organized as follows. Chapter 2 explores theory-driven regularization, starting with a review of implicit regularization in gradient descent in Section 2.1. While numerous studies have investigated individual regularization techniques and recognized their advantages, the interactions among these techniques remain less understood. Consequently, extensive tuning of the hyperparameters associated with each regularization technique is often necessary to achieve optimal test performance in practical applications. Motivated by this challenge, Section 2.2 introduces a training algorithm based on PAC-Bayes theory. This algorithm proves effective across various in-domain classification tasks and different architectures, aiming to enhance generalization. Chapter 3 focuses on data-driven regularization by designing deep learning architectures tailored for specific applications. Section 3.1 introduces MagNet, a graph neural network for directed graphs. Section 3.2 presents STGNN, a graph neural network that predicts earthquake locations and magnitudes based on waveforms collected from various seismic stations. Section 3.3 discusses full-waveform inversion conducted in an unsupervised manner, leveraging governing physical principles. Additionally, Section 3.4 describes a scientific data augmentation approach specifically for time-lapse full-waveform inversion. Finally, Chapter 4 summarizes the discussed regularization techniques and discusses the potential future work. 5 CHAPTER 2 THEORY-DRIVEN REGULARIZATION This chapter focuses on theory-driven regularization, which operates independently of data char- acteristics. The methods discussed here are applicable across a broader spectrum of applications, underpinned by theoretical frameworks aimed at enhancing generalization. This chapter initially explores implicit regularization in Section 2.1, represented by the arg min 𝜃 term in Equation 1.3. Building on the insights gained from studying implicit regularization effects, this chapter will then delve into a training algorithm that leverages the PAC-Bayes bound, detailed in Section 2.2. This approach illustrates how theoretical principles can guide the development of practical regularization techniques that enhance model performance across various domains. 2.1 Introduction of Implicit Regularization: The Gradient Descent Case Figure 2.1 Backward error analysis. There is error in the numerical solution of the system (cid:164)𝜃 = 𝑓 (𝜃) because of discretization. Thus, the numerical solution becomes the exact solver of the other system (cid:164)𝜃 = ˜𝑓 (𝜃). Barrett and Dherin (2020) pointed out that deep learning models trained with larger learning rates could achieve better generalization performance by analyzing the gradient descent algorithm based on the backward error analysis. The backward error analysis is used to measure the error from the discretization of ODE solvers. The general intention is visualized in Figure 2.1. In order to measure the error of ODE 𝜃𝑛+1 = Φℎ (𝜃𝑛), which is the numerical solution of the system (cid:164)𝜃 = 𝑓 (𝜃), we can compare (cid:164)𝜃 = ˜𝑓 (𝜃), the exact solution of 𝜃𝑛+1 = Φℎ (𝜃𝑛), with (cid:164)𝜃, the discretized solution, by Taylor expansion. Consider the gradient descent, the iterative weight update process can be represented by the 6 learning rate ℎ and the gradient of loss function ∇𝜃 𝐸 (·): 𝜃𝑛+1 = 𝜃𝑛 + ℎ 𝑓 (𝜃𝑛) = 𝜃𝑛 − ℎ∇𝜃 𝐸 (𝜃𝑛). (2.1) To obtain what ˜𝑓 will make the trajectory of (cid:164)𝜃 = ˜𝑓 (𝜃) coincides with that of (2.1) on the discrete time-grids, we can first write ˜𝑓 (𝜃) in its Taylor series form (cid:164)𝜃 = ˜𝑓 (𝜃) = 𝑓 (𝜃) + ℎ 𝑓1(𝜃) + ℎ2 𝑓2(𝜃) + · · · (2.2) Then for any fixed 𝑡 > 0, by using numerical integration, we have 𝜃𝑡+1 = 𝜃𝑡 + ℎ( 𝑓 (𝜃𝑡) + ℎ 𝑓1(𝜃𝑡) + ℎ2 𝑓2(𝜃𝑡) + · · · ) + ℎ2 2 + · · · ( 𝑓 ′(𝜃𝑡) + ℎ 𝑓 ′ 1 (𝜃𝑡) + ℎ2 𝑓 ′ 2 (𝜃𝑡) + · · · ) ( 𝑓 (𝜃𝑡) + ℎ 𝑓1(𝜃𝑡) + ℎ2 𝑓2(𝜃𝑡) + · · · ) (2.3) ≈ 𝜃𝑡 + ℎ 𝑓 (𝜃𝑡) + ℎ2 2 ( 𝑓 ′(𝜃𝑡) 𝑓 (𝜃𝑡) + 2 𝑓1(𝜃𝑡)). To match the two trajectories, (cid:164)𝜃 = ˜𝑓 (𝜃) should match 𝜃𝑛+1 = 𝜃𝑛 + ℎ 𝑓 (𝜃𝑛) whenever 𝑡 = 𝑛, ℎ2 2 𝑓 ′(𝜃𝑡) 𝑓 (𝜃𝑡)+ℎ2 𝑓1(𝜃𝑡) should be 0. Thus, we have 𝑓1(𝜃) = − 1 2 𝑓 ′(𝜃) 𝑓 (𝜃), which turns Equation 2.2 into: (cid:164)𝜃 ≈ 𝑓 (𝜃) + ℎ 𝑓1(𝜃) = 𝑓 (𝜃) − ℎ 2 𝑓 ′(𝜃) 𝑓 (𝜃) = −∇(𝐸𝜃 (𝜃) + ℎ 4 ||∇𝐸𝜃 (𝜃)||2). (2.4) Equation 2.4 indicates that the numerical solution of the gradient descent has the implicit regularization term ||∇𝐸𝜃 (𝜃)||2. The exact loss that the gradient descent solves becomes: ˜𝐸 (𝜃) = 𝐸 (𝜃) + ℎ 4 ||∇𝐸 (𝜃)||2. (2.5) Consequently, it explains why the testing performance is better with a larger learning rate for the gradient descent method. Following the same analysis process, Smith et al. (2021) has the same conclusion for the stochastic gradient descent method. There are several other implicit regularization techniques and scenarios. Keskar et al. (2016) notes that training with large batches often leads to convergence at sharp minima, which experimentally results in poorer generalization performance compared to flatter minima. Neyshabur et al. (2014) and 7 Nakkiran et al. (2021) empirically observe that test performance improves with the addition of more trainable weights, even after achieving 100% training accuracy. Kobak et al. (2020) discusses how features with independent components can act as implicit regularization in ridge regression, where the optimal regularization weight can sometimes be negative. Bishop (1995) shows that adding small Gaussian noise to features leads to a loss expectation equivalent to adding an implicit term of || 𝜕L 𝜕𝑥 ||2 to the original loss function. Additionally, Santurkar et al. (2018) empirically examines the effect of batch normalization, concluding that it smooths training without reducing internal covariate shift as initially proposed by Ioffe and Szegedy (2015). Implicit regularization generally arises from mechanisms where the strength is difficult to tune, unlike traditional regularization techniques such as weight decay. For example, learning rate and momentum (Ghosh et al., 2022) are related to convergence: overly large values can hinder convergence. To enhance the effect of implicit regularization without impacting convergence, one can convert implicit terms to explicit ones. A notable method in this context is Sharpness-Aware Minimization (SAM) (Foret et al., 2020). However, despite SAM’s effectiveness in achieving good test performance, Andriushchenko et al. (2023) argue that sharpness does not necessarily correlate well with generalization. Instead, it correlates more with training parameters like the learning rate, which may arbitrarily relate to generalization depending on the setup. This raises a pertinent question: Is there a more reliable metric strongly correlated with generalization? The answer lies in the PAC-Bayes bound, which directly measures the upper bound of the test error. If the PAC-Bayes bound is sufficiently tight, it should correlate perfectly with generalization. This concept will be elaborated in the next section. 8 2.2 Unlocking Tuning-free Generalization: Minimizing the PAC-Bayes Bound with Trainable Priors The PAC-Bayes bound is instrumental in assessing the generalization capabilities of machine learning models by estimating the upper limits of test errors. This theoretical framework offers crucial insights into a model’s generalization ability and provides a solid foundation for developing practical training algorithms (Shawe-Taylor and Williamson, 1997). PAC-Bayes bounds are particularly valuable as they elucidate the discrepancy between training and generalization errors, underline the importance of incorporating regularizers in empirical risk minimization, and demonstrate how larger datasets can enhance generalization. The efficacy of PAC-Bayes bounds in determining the generalization capabilities of machine learning models has been validated by extensive empirical evidence across various generalization metrics (Jiang et al., 2019). Minimizing the upper bound of generalization error is inherently advantageous for generalization. This section introduces a training algorithm that leverages the PAC-Bayes bounds, aiming to optimize these theoretical limits through an approach involving trainable priors. The objective function, designed based on the proposed PAC-Bayes bound, corresponds to the ℓ term in Equation 1.3, while the complete practical training algorithm is represented by the arg min 𝜃 term. 2.2.1 Introduction Traditionally, PAC-Bayes bounds have been primarily used for quality assurance or model selection (McAllester, 1998, 1999; Herbrich and Graepel, 2000), particularly with smaller machine learning models. Recent work has introduced a framework that minimizes a PAC-Bayes bound during training large neural networks (Dziugaite and Roy, 2017). Ideally, the generalization performance of deep neural networks could be enhanced by directly minimizing its quantitative upper bounds, specifically the PAC-Bayes bounds, without incorporating any other regularization tricks. However, the effectiveness of applying PAC-Bayes training to deep neural networks is challenged by the well-known issue that PAC-Bayes bounds can become vacuous in highly over-parameterized settings (Livni and Moran, 2020). Additionally, selecting a suitable prior, which should be independent of training samples, is critical yet challenging. This often leads to conducting a parameter search for 9 the prior using separate datasets (Dziugaite et al., 2021). Furthermore, existing PAC-Bayes training methods are typically tailored for bounded loss (Dziugaite and Roy, 2017, 2018; Pérez-Ortiz et al., 2021), limiting their straightforward application to popular losses like Cross-Entropy. On the other hand, the prevalent training methods for neural networks, which involve minimizing empirical risk with SGD/Adam, achieve satisfactory test performance. However, they often require integration with various regularization techniques to optimize generalization performance. For instance, research has shown that factors such as larger learning rates (Cohen et al., 2021; Barrett and Dherin, 2020), momentum (Ghosh et al., 2022; Cattaneo et al., 2023), smaller batch sizes (Lee and Jang, 2022), parameter noise injection (Neelakantan et al., 2015; Orvieto et al., 2022), and batch normalization (Luo et al., 2018) all induce higher degrees of implicit regularization, yielding better generalization. Besides, various explicit regularization techniques, such as weight decay (Loshchilov and Hutter, 2017), dropout (Wei et al., 2020), label noise (Damian et al., 2021) can also significantly affect generalization. While many studies have explored individual regularization techniques to identify their unique benefits, the interaction among these regularizations remains less understood. As a result, in practical scenarios, one has to extensively tune the hyperparameters corresponding to each regularization technique to obtain the optimal test performance. Although further investigation is needed to fully understand the underlying mechanisms, training models using ERM with various regularization methods remains the prevalent choice and typically delivers state-of-the-art test performance. While PAC-Bayes training is built upon a solid theoretical basis for analyzing generalization, its wider adoption is limited by existing assumptions about loss and challenges in prior selection. Moreover, it is still an open question regarding how to enhance PAC-Bayes training to match the performance of ERM methods with well-tuned regularizations. This section introduces a training algorithm using a new PAC-Bayes bound for unbounded loss. The contribution is summarized as follows: 1. We introduce a new PAC-Bayes bound for unbounded loss complemented by a training algorithm. This algorithm simultaneously optimizes the prior and the posterior using the same dataset. 2. The test performance of the proposed algorithm is theoretically justified. 10 3. The proposed PAC-Bayes training algorithm outperforms existing methods that minimize other PAC-Bayes bounds in terms of test performance. 4. Our training algorithm approaches the best test performance of the widely-used ERM using SGD/Adam, enhanced by standard regularizations like noise injection and weight decay. 5. Our training algorithm exhibits robustness to variation in hyperparameters such as learning rate and batch size. Besides, the same hyperparameter configuration is effective across various neural network architectures. 2.2.2 Preliminaries This section outlines the PAC-Bayes framework. For any supervised learning problem, the goal is to find a proper model h from some hypothesis space H , with the help of the training data S ≡ {𝑧𝑖}𝑚 𝑖=1, where 𝑧𝑖 is the training pair with sample x𝑖 and its label 𝑦𝑖. Given the loss function ℓ(h; 𝑧𝑖) : h ↦→ R+, which measures the misfit between the true label 𝑦𝑖 and the predicted label by h, the empirical and population/generalization errors are defined as: ℓ(h; S) = 1 𝑚 𝑚 ∑︁ 𝑖=1 ℓ(h; 𝑧𝑖), ℓ(h; D) = ES∼D (ℓ(h; S)), by assuming that the training and testing data are i.i.d. sampled from the unknown distribution D. PAC-Bayes bounds include a family of upper bounds on the generalization error of the following type. Theorem 2.2.1. (Maurer, 2004) Assume the loss function ℓ is bounded within the interval [0, 1]. Given a preset prior distribution P over the model space H , and given a scalar 𝛿 ∈ (0, 1), for any choice of i.i.d 𝑚-sized training dataset S according to D, and all posterior distributions Q over H , Eh∼Qℓ(h; D) ≤ Eh∼Qℓ(h; S) + √︄ log( 2 √ 𝑚 𝛿 ) + KL(Q||P) 2𝑚 , holds with probability at least 1 − 𝛿. Here, KL stands for the Kullback-Leibler divergence. A PAC-Bayes bound measures the gap between the expected empirical and generalization errors. It’s worth noting that this bound holds for all posterior Q for any given data-independent prior P 11 and, which enables optimization of the bound by searching for the best posterior. In practice, the posterior expectation corresponds to the trained model, and the prior expectation can be set to the initial model. In this section, we will use || · || to denote a generic norm, and || · ||2 to denote 𝐿2 norm. 2.2.3 Related Work PAC-Bayes bounds were first used to train neural networks in Dziugaite and Roy (2017). Specifically, the bound McAllester (1999) has been employed for training shallow stochastic neural networks on binary MNIST classification with bounded 0-1 loss and has proven to be non-vacuous. Following this work, many recent studies (Letarte et al., 2019; Rivasplata et al., 2019; Pérez-Ortiz et al., 2021; Biggs and Guedj, 2021; Perez-Ortiz et al., 2021; Zhou et al., 2018b) expanded the applicability of PAC-Bayes bounds to a wider range of neural network architectures and datasets. However, most studies are limited to training shallow networks with binary labels using bounded loss, which restricts their broader application to deep network training. Although PAC-Bayes bounds for unbounded loss have been established (Audibert and Catoni, 2011; Alquier and Guedj, 2018; Holland, 2019; Kuzborskij and Szepesvári, 2019; Haddouche et al., 2021; Rivasplata et al., 2020; Rodríguez-Gálvez et al., 2023; Casado et al., 2024), it remains unclear whether these bounds can lead to enhanced test performance in training neural networks. This uncertainty arises partly because they usually include assumptions that are difficult to validate or terms that are hard to compute in real applications. For example, Kuzborskij and Szepesvári (2019) derived a PAC-Bayes bound under the second-order moment condition of the unbounded loss. However, as mentioned in the paper, that bound is semi-empirical, in the sense that it contains the population second order moment of the loss, in contrast to usual PAC-Bayes bounds that only contain empirical quantities that can be computed from the data. To the best of our knowledge, existing PAC-Bayes bounds built under the second-order moment condition all suffer from this issue. Recently, Dziugaite et al. (2021) suggested that a tighter PAC-Bayes bound could be achieved with a data-dependent prior. They divide the data into two sets, using one to train the prior and the other to train the posterior with the optimized prior, thus making the prior independent from the 12 training dataset for the posterior. This, however, reduces the training data available for the posterior. Dziugaite and Roy (2018) and Rivasplata et al. (2020) justified the approach of learning the prior and posterior with the same set of data by utilizing differential privacy. However, the argument only holds for priors provably satisfying the so-called 𝐷𝑃(𝜖)-condition in differential privacy, which limits their practical application. Pérez-Ortiz et al. (2021) also empirically shows training with Dziugaite and Roy (2018) could sacrifice test accuracy if the bound is not tight enough. In this work, we advance the PAC-Bayes training approach, enhancing its practicality and showcasing its potential in realistic settings. 2.2.4 New PAC-Bayes Bound for Unbounded loss Popular PAC-Bayes training algorithms (Dziugaite and Roy, 2017, 2018; Pérez-Ortiz et al., 2021) are limited to bounded loss. When dealing with unbounded Cross-Entropy loss1, they require a clipping of the loss to small bounded regions before applying the training, leading to suboptimal performance. On the other hand, PAC-Bayes bounds for unbounded loss were also established in the literature (Germain et al., 2016; Rodríguez-Gálvez et al., 2023) where the requirement of bounded loss is replaced by the weaker requirement of the finite second-order moment of the loss or finite CGF (cumulant generating function). However, these bounds are often not non-vacuous when applied to deep neural networks (as shown in Figure 2.2 of Section 2.2.8), meaning that the numerical value of the bound is too large for the training to progress. We propose a modified PAC-Bayes bound that imposes milder conditions, making it effective for training deep networks. The new bound is based on a modification of the existing assumption of the loss function, detailed as follows. Definition 2.2.2 (Exponential moment on finite intervals). Let 𝑋 be a random variable defined on the probability space (Ω, F , P) and 0 ≤ 𝛾1 ≤ 𝛾2 be two numbers. We call any 𝐾 > 0 an exponential moment bound of 𝑋 over the interval [𝛾1, 𝛾2], when E[exp (𝛾(E[𝑋] − 𝑋))] ≤ exp (𝛾2𝐾) (2.6) 1The MSE loss could also be unbounded when used in regression tasks 13 holds for all 𝛾 ∈ [𝛾1, 𝛾2]. By restricting the range of 𝛾 to a finite interval [𝛾1, 𝛾2], (2.6) is weaker than the usual exponential moment condition for sub-Gaussian distributions. Later, when we apply this condition to the PAC- Bayes analysis, the random variable 𝑋 in Def. 2.2.2 will represent the loss function. Since most loss functions in machine learning (e.g., Cross-Entropy, 𝐿1, MSE, Huber loss, hinge loss, Log-cosh loss, quantile loss) are non-negative, it is of great interest to analyze the strength of Definition 2.2.2 under 𝑋 ≥ 0. In this case, we can show that our condition is weaker than the second-order moment condition, which is currently the weakest condition allowing the establishment of PAC-Bayes bounds. Lemma 2.2.3. For non-negative random variable 𝑋 ≥ 0, the existence of 𝐾 on the interval 𝛾 ∈ [0, ∞) in Definition 2.2.2 can be implied by the existence of the second-order moment E𝑋 2 < ∞. This lemma suggests that for non-negative loss functions, our Definition 2.2.2 is weaker than the second-order moment condition. In addition, the assumption 𝑋 ≥ 0 can be further relaxed to 𝑋 ≥ −𝑀 with 𝑀 > 0, as in this case the random variable 𝑋 + 𝑀 is non-negative to which we can apply Lemma 2.2.3. Proof of Lemma 2.2.3. We show that E𝑋 2 < ∞ implies Definition 2.2.2 holding for any 𝛾 ∈ [0, ∞) with some finite 𝐾. Since E𝑋 2 < ∞, we have (E𝑋)2 ≤ E𝑋 2 < ∞. If 𝛾 ≥ 1 E𝑋 , then it suffices to take the 𝐾 in E𝑒𝛾(E𝑋−𝑋) ≤ 𝑒𝛾2𝐾 to be 𝐾 = E𝑋 𝛾 ≤ (E𝑋)2 ≡ 𝐾1. If 𝛾 < 1 E𝑋 , then using the inequality 𝑒𝑥 ≤ 1 + 𝑥 + 𝑥2, ∀𝑥 < 1 with 𝑥 := 𝛾(E𝑋 − 𝑋) ≤ 𝛾E𝑋 < 1, we have E𝑒𝛾(E𝑋−𝑋) ≤ E(1 + 𝛾(E𝑋 − 𝑋) + 𝛾2(E𝑋 − 𝑋)2) = 1 + 𝛾2Var(𝑋) ≤ 𝑒𝛾2Var(𝑋) Therefore, it suffices to take 𝐾 = Var(𝑋) ≡ 𝐾2. Collecting the two cases, we see taking 𝐾 = max{𝐾1, 𝐾2} would be enough for Definition 4.1 to hold with 𝛾1 = 0, 𝛾2 = ∞. □ 14 Remark 2.2.4 (Comparison with the first-order-moment condition). Still under the assumption 𝑋 ≥ 0, when the 𝛾1 in Definition 2.2.2 is finite (bounded away from 0), the existence of 𝐾 can be implied by the existence of first-order moment. Indeed, by taking 𝐾 = E[𝑋] 𝛾1 , the inequality E[𝑋] − 𝑋 ≤ E[𝑋] (assumed 𝑋 ≥ 0) immediately implies (2.6). However, this argument does not hold when 𝛾1 → 0. Hence we cannot say our condition is as weak as the first-order moment condition. We want to emphasize that the main motivation for proposing Definition 4.1 is from an empirical perspective, where we want to have a bound with a smaller numerical value. Therefore, in practice, we always take 𝛾1, 𝛾2 to be positive scalars. In addition, we propose to make the exponential moment bound depend on the prior distribution, which leads to a further reduction of the bound. For this purpose, we first extend Definition 2.2.2 from a single random variable to a family of random variables parameterized by models in a hypothesis space. Let us first explain what we mean by random variables parameterized by models in a hypothesis space. In the network setting, let us define 𝑋 (h) as 𝑋 (h) ≡ ℓ( 𝑓𝜃 (𝑥), 𝑦), where ℓ is the loss and h = 𝑓𝜃 is the model/network parametrized by weight 𝜃. For a fixed model h (i.e. 𝑓𝜃), we see 𝑋 (h) is a random variable whose randomness comes from the input pairs (𝑥, 𝑦) ∼ D (D is the data distribution). Since this random variable 𝑋 (h) varies with h, we call it a random variable parameterized by models h. Definition 2.2.5 (Exponential moment over hypotheses). Let 𝑋 (h) be a random variable param- eterized by the hypothesis h in some space H (i.e., h ∈ H ), and fix an interval [𝛾1, 𝛾2] with 0 < 𝛾1 < 𝛾2 < ∞. Let {Pλ, λ ∈ Λ} be a family of distribution over H parameterized by λ ∈ Λ ⊆ R𝑘 . Then, we call any non-negative function 𝐾 (λ) a uniform exponential moment bound for 𝑋 (h) over the priors {Pλ, λ ∈ Λ} and the interval [𝛾1, 𝛾2], if the following holds Eh∼Pλ E[exp (𝛾(E[𝑋 (h)] − 𝑋 (h)))] ≤ exp (𝛾2𝐾 (λ)), 15 for all 𝛾 ∈ [𝛾1, 𝛾2], and any λ ∈ Λ ⊆ R𝑘 . The minimal such 𝐾 (λ) is 𝐾min(λ) = sup 𝛾∈[𝛾1,𝛾2] 1 𝛾2 log(Eh∼Pλ E[exp (𝛾(E[𝑋 (h)] − 𝑋 (h)))]). (2.7) Similar to Definition 2.2.2, when dealing with non-negative loss, the existence of the exponential moment bound 𝐾min is guaranteed, provided that the second-order moment of the loss is bounded, or provided that the first-order moment of the loss is bounded and 𝛾1 is bounded away from 0. Now, we can establish the PAC-Bayes bound for losses that satisfy Definition 2.2.5. Theorem 2.2.6 (PAC-Bayes bound for unbounded loss with a preset prior distribution). Given a prior distribution Pλ over the hypothesis space H , parametrized by λ ∈ Λ. Assume the loss ℓ(h, 𝑧𝑖) as a random variable parametrized by h satisfies Definition 2.2.5. Fix some 𝛿 ∈ (0, 1). For any 0 < 𝛿 < 1 and 𝛾 ∈ [𝛾1, 𝛾2], we have (cid:18) 𝑃S ∀Q ∈ Q, Eh∼Qℓ(h; D) ≤ Eh∼Qℓ(h; S) + 1 𝛾𝑚 (log 1 𝛿 + KL(Q||Pλ)) + 𝛾𝐾 (λ) (cid:19) ≥ 1 − 𝛿 where Q is the set of all probability distributions. Remark 2.2.7. By setting 𝛾 = 𝑂 (𝑚−1/2), we observe that the asymptotic behavior of this bound aligns with the 𝑂 (𝑚−1/2) convergence rate of popular PAC-Bayes bounds in the literature. A corollary of this theorem and Lemma 2.2.3 is that this convergence rate can be achieved for CE loss under a bounded second-order moment condition. While bounds under the second-order moment condition were derived in the literature, as discussed in Section 2.2.3, our bound seems to be the first purely empirical bound (i.e., computable from data) that can be easily used for training. Moreover, our use of finite 𝛾1 > 0 and 𝛾2 < ∞ and the permission of 𝐾 to depend on the prior parameter 𝜆 further reduce the value of the bound. The proof is available in Appendix A.1.1. With the relaxed requirements on the loss function, our bound offers a basis for establishing effective optimization over both the posterior and the prior. We will first outline the training process, which focuses on jointly optimizing the prior and posterior to avoid the complex hyper-parameter search over the prior as Pérez-Ortiz et al. (2021), followed by a discussion of its theoretical guarantees. 16 The procedure is similar to the one in Dziugaite and Roy (2017), but has been adapted to align with our newly proposed bound. We begin by parameterizing the posterior distribution as Qσ (h), where h ∈ R𝑑 represents the mean of the posterior, and σ ∈ R𝑑 accounts for the variations in each model parameter from this mean (i.e., variance). Next, we parameterize the prior as Pλ, where λ ∈ R𝑘 . We operate under the assumption that the prior has significantly fewer parameters than the posterior, that is, 𝑘 ≪ 𝑑; the relevance of this assumption will become apparent upon examining Theorem 2.2.10. For our PAC-Bayes training, we propose to optimize over all four variables: h, 𝛾, σ, and λ: ( ˆh, ˆ𝛾, ˆσ, ˆλ) = arg min h,λ,σ,𝛾∈[𝛾1,𝛾2] 𝐿𝑃 𝐴𝐶 (h, 𝛾, σ, λ), (P) where 𝐿𝑃 𝐴𝐶 (h, 𝛾, σ, λ) = E ˜h∼Qσ (h) ℓ( ˜h; S) + 1 𝛾𝑚 (log 1 𝛿 + KL(Qσ (h)||Pλ)) + 𝛾𝐾 (λ). Compared to previous PAC-Bayes training, the most notable change in 𝐿𝑃 𝐴𝐶 is that we allow 𝐾 to depend on the prior parameter λ, and optimize it along with other terms. We provide an end-to-end theorem that guarantees the performance of this optimization algorithm. To derive our theorem, we need the following assumptions: Assumption 2.2.8 (Continuity of the KL divergence). Let 𝔔 be a family of posterior distributions, let 𝔓 = {𝑃λ, λ ∈ Λ ⊆ R𝑘 } be a family of prior distributions parameterized by λ. We say the KL divergence KL(Q||Pλ) is continuous with respect to λ over the posterior family, if there exists some non-decreasing function 𝜂1(𝑥) : R+ ↦→ R+ with 𝜂1(0) = 0, such that |KL(Q||Pλ) − KL(Q||P ˜λ)| ≤ 𝜂1(∥λ − ˜λ∥), for all pairs λ, ˜λ ∈ Λ and for all Q ∈ 𝔔. Assumption 2.2.9 (Continuity of the exponential moment bound). Let 𝐾min(λ) be as defined in Definition 2.2.5. Assume it is continuous with respect to the parameter λ of the prior in the sense that there exists a non-decreasing function 𝜂2(𝑥) : R+ ↦→ R+ with 𝜂2(0) = 0 such that |𝐾min(λ) − 𝐾min( ˜λ)| ≤ 𝜂2(∥λ − ˜λ∥), for all λ, ˜λ ∈ Λ. 17 These two assumptions are quite weak and can be satisfied by popular continuous distributions, such as the exponential family. We will first present a general theorem applicable to all distribution families satisfying these assumptions. Then we demonstrate why the Gaussian prior/posterior distribution, commonly used in practice, satisfies these assumptions. Theorem 2.2.10 (PAC-Bayes bound for unbounded losses and trainable priors). Assume the loss ℓ(h, 𝑧𝑖) as a random variable parametrized by h satisfies Definition 2.2.5. Let 𝔔 be a family of posterior distribution, let 𝔓 = {𝑃λ, λ ∈ Λ ⊆ R𝑘 } be a family of prior distributions parameterized by λ. Let 𝑛(𝜀) := N (Λ, ∥ · ∥, 𝜀) be the covering number of the set of the prior parameters. Under Assumption 2.2.8 and Assumption 2.2.9, the following inequality holds for the minimizer ( ˆh, ˆ𝛾, ˆσ, ˆλ) of (P) and any 𝜖, 𝜀 > 0 with probability as least 1 − 𝜖: E h∼Q ˆσ ( ˆh) ℓ(h; D) ≤ 𝐿𝑃 𝐴𝐶 ( ˆh, ˆ𝛾, ˆσ, ˆλ) + 𝜂, (2.8) where 𝜂 = 𝐵𝜀 + 𝐶 (𝜂1(𝜀) + 𝜂2(𝜀)) + log(𝑛(𝜀)+ 𝛾1𝑚 𝜂2, 𝑚 and the upper bounds of the parameters in the prior and posterior. ) 2 −𝛾 𝛾 1 2𝜀 , and 𝐶 and 𝐵 are constants depending on 𝛾1, 𝛾2, The proof is available in Appendix A.1.2. The theorem provides a generalization bound on the model learned as the minimizer of (P) with data-dependent priors. This bound contains the PAC-Bayes loss 𝐿𝑃 𝐴𝐶 along with an additional correction term 𝜂, that is notably absent in the traditional PAC-Bayes bound with fixed priors. Given that ( ˆh, ˆ𝛾, ˆσ, ˆλ) minimizes 𝐿𝑃 𝐴𝐶, evaluating 𝐿𝑃 𝐴𝐶 at its own minimizer ensures that the first term is small. If the correction term is also small, then the test error remains low. In the next section, we will delve deeper into the condition for this term to be small. Intuitively, selecting a small 𝜀 helps to maintain low values for the first three terms in 𝜂. Although a smaller 𝜀 increases the 𝑛(𝜀) in the last term, this increase is moderated because it is inside the logarithm and divided by the size of the dataset. 18 2.2.5 PAC-Bayes Training Algorithm with Gaussian Families 2.2.6 Gaussian prior and posterior For the 𝐿𝑃 𝐴𝐶 objective to have a closed-form formula, in this section, we employ the Gaussian distribution family. For ease of illustration, we introduce a new notation. Consider a neural network model denoted as 𝑓θ, where 𝑓 represents the network’s architecture, and θ is the weight. In this context, 𝑓θ aligns with the h discussed in earlier sections. Moving forward, we will use 𝑓θ to refer to the model instead of h. We define the posterior distribution of the weights as a Gaussian distribution centered around the trainable weight θ, with trainable variance σ., i.e., the posterior weight distribution is N (θ, diag(σ)), denoted by Qσ (θ), where σ includes the anisotropic variance of the weights and θ includes the mean. The assumption of a diagonal covariance matrix implies the independence of the weights. We consider two types of priors, both centered around the initial weight of the neural network θ0 (as suggested by Dziugaite and Roy (2017)), but with different settings on the variance. Scalar prior: we use a universal scalar to encode the variance of all the weights in the prior, i.e., the weight distribution of P𝜆 is N (θ0, 𝜆𝐼𝑑), where 𝜆 is a scalar. With this prior, the KL divergence KL(Qσ (θ)||P𝜆 (θ0)) in (P) is: (cid:34) 1 2 −1⊤ 𝑑 log(σ) + 𝑑 (log(𝜆) − 1) + (∥σ∥1 + ∥θ − θ0∥2 2) 𝜆 (cid:35) . (2.9) Layerwise prior: weights in the 𝑖th layer share a common variance λ𝑖, but different layers could have different variances. By setting λ = (λ1, ...., λ𝑘 ) as the vector containing all the layerwise variances of a 𝑘-layer neural network, the weight distribution of prior Pλ is N (θ0, BlockDiag(λ)), where BlockDiag(λ) is obtained by diagonally stacking all λ𝑖 𝐼𝑑𝑖 into a 𝑑 × 𝑑 matrix, where 𝑑𝑖 is the number of weights of the 𝑖th layer. The KL divergence for layerwise prior is in Appendix A.1.3. For shallow networks, it is enough to use the scalar prior; for deep neural networks and neural networks constructed from different types of layers, using the layerwise prior is more sensible. By plugging in the closed-form (2.9) for KL(Qσ (θ)||Pλ(θ0)) into the PAC-Bayes bound in Theorem 2.2.10, we have the following corollary that justifies the usage of PAC-Bayes bound on large neural networks with the trainable prior. 19 Corollary 2.2.11. Suppose the posterior and prior are Gaussian distributions as defined above. Assume all parameters for the prior and posterior are bounded, i.e., we restrict the model parameter θ, the posterior variance σ and the prior variance λ, all to be searched over bounded sets, + : ∥σ∥1 ≤ 𝑑𝑇 }, Λ =: {λ ∈ [𝑒−𝑎, 𝑒𝑏] 𝑘 }, respectively, 𝑑 𝑀 }, Σ := {σ ∈ R𝑑 √ Θ := {θ ∈ R𝑑 : ∥θ∥2 ≤ with fixed 𝑀, 𝑇, 𝑎, 𝑏 > 0. Then, • Assumption 2.2.8 holds with 𝜂1(𝑥) = 𝐿1𝑥, where 𝐿1 = 1 √ 2 max{𝑑, 𝑒𝑎 (2 • Assumption 2.2.9 holds with 𝜂2(𝑥) = 𝐿2𝑥, where 𝐿2 = 1 𝛾2 1 (cid:16) 2𝑑𝑀 2𝑒2𝑎 + 𝑑 (𝑎+𝑏) 2 𝑑 𝑀 + 𝑑𝑇)} (cid:17) • With high probability, the PAC-Bayes bound for the minimizer of (P) has the form E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; D) ≤ 𝐿𝑃 𝐴𝐶 ( ˆθ, ˆ𝛾, ˆσ, ˆλ) + 𝜂, where 𝜂 = 𝑘 𝛾1𝑚 (cid:16) 1 + log 2(𝐶 𝐿+𝐵)Δ𝛾1𝑚 𝑘 (cid:17) , 𝐿 = 𝐿1 + 𝐿2, Δ := max{𝑏 + 𝑎, 2(𝛾2 − 𝛾1)}, 𝐶 = 1 𝛾1𝑚 + 𝛾2 𝐵 is a constant depending on 𝛾1, 𝛿, 𝑀, 𝑑, 𝑇, 𝑎, 𝑏, 𝑚2. In the bound, the term 𝐿𝑃 𝐴𝐶 ( ˆθ, ˆ𝛾, ˆσ, ˆλ) is inherently minimized as it evaluates the function 𝐿𝑃 𝐴𝐶 at its own minimizer. The overall bound remains low if the correction term 𝜂 can be deemed insignificant. The logarithm term in the definition of 𝜂 grows very mildly with the dimension in general, so we can treat it (almost) as a constant. Thus, 𝜂 ∼ 𝑘 𝛾1𝑚 , from which we see that 1). 𝜂 (and therefore the bound) would be small if prior’s degree of freedom 𝑘 is substantially less than the dataset size 𝑚 2). This bound still achieves the asymptotic rate of 𝑂 (𝑚−1/2) after optimizing over 𝛾1. We note that even if the corollary assumes that the parameters (i.e., mean and variance) of the Gaussian distribution are bounded, the random variable itself is still unbounded, so the loss is still unbounded. The proof and more discussions can be found in Appendix A.1.4. 2.2.7 Training algorithm Estimating 𝐾min(λ): In practice, the function 𝐾min(λ) must be estimated first. Since we showed in Corollary 2.2.11 and Remark 2.2.4 that 𝐾min(λ) is Lipschtiz continuous and bounded, we can approximate it using piecewise-linear functions. Notably, since for each fixed λ ∈ Λ, the prior is 2See Appendix A.1.4 for the explicit form of 𝐵. 20 Algorithm 2.1 PAC-Bayes training (scalar prior) Input: initial weight θ0 ∈ R𝑑, 𝑇1 = 500, 𝜆1 = 𝑒−12, 𝜆2 = 𝑒2, 𝛾1 = 0.5, 𝛾2 = 10. // 𝑇1, 𝜆1, 𝜆2, 𝛾1, 𝛾2 can be fixed in all experiments of Sec.2.2.8. Output: trained weight ˆθ, posterior noise level ˆσ θ ← θ0, v ← 1d · log( 1 𝑑 ∥θ0∥1) Obtain ˆ𝐾 (𝜆) with Λ = [𝜆1, 𝜆2] using (A.20) (Appendix Algorithm A.1) /*Stage 1*/ for epoch = 1 : 𝑇1 do 𝑑 ∥θ0∥1), 𝑏 ← log( 1 for sampling one batch 𝑠 from S do //Ensure non-negative variances 𝜆 ← exp(𝑏), σ ← exp(v) P𝜆 ← N (θ0; 𝜆𝐼𝑑), Qσ (θ) ← θ + N (0; diag(σ)) //Get the stochastic version of E ˜θ∼Qσ (θ) ℓ( 𝑓 ˜θ; S) Draw one ˜θ ∼ Qσ (θ) and evaluate ℓ( 𝑓 ˜θ; S) Compute the KL divergence as (2.9) Compute 𝛾 as (2.10) Compute the loss function L as 𝐿𝑃 𝐴𝐶 in (P) //Update all parameters 𝑏 ← 𝑏 + 𝜂 𝜕L 𝜕𝑏 , v ← v + 𝜂 𝜕L 𝜕v , θ ← θ + 𝜂 𝜕L 𝜕θ end for end for //Fix the noise level from now on ˆσ ← exp(v) /*Stage 2*/ while not converge do for sampling one batch 𝑠 from S do //Noise injection Draw one ˜θ ∼ Q ˆσ (θ) and evaluate ℓ( 𝑓 ˜θ; S) as ˜L, //Update model parameters θ ← θ + 𝜂 𝜕 ˜L 𝜕θ end for end while ˆθ ← θ independent of the data, this procedure of estimating 𝐾min(λ) can be carried out before training. More details are in Appendix A.2.1. Two-stage PAC-Bayes training: Algorithm 2.1 outlines the proposed PAC-Bayes training algorithm that contains two stages. Stage 1 performs pure PAC-Bayes bound minimization, and Stage 2 is a refinement stage. The version of Algorithm 1 that uses a layerwise prior is detailed in Appendix A.2.2. For Stage 1, although there are several input parameters to be specified, one 21 can use the same choice of values across very different network architectures and datasets with minor modifications. Please see Appendix A.3.1 for more discussions. When everything else in the PAC-Bayes loss is fixed, 𝛾 ∈ [𝛾1, 𝛾2] has a closed-form solution, 𝛾∗ = min √︄ max       𝛾1, 1 𝐾min log 1 𝛿 + KL(Qσ (θ)||Pλ(θ0)) 𝑚    , 𝛾2    (2.10) Therefore, we only need to perform gradient updates on the other three variables, θ, σ, λ. The second stage of training: Gastpar et al. (2023); Nagarajan and Kolter (2019) showed that achieving high accuracy on certain distributions precludes the possibility of getting a tight generalization bound in overparameterized settings. This implies that it is less possible to use reasonable generalization bound to fully train one overparameterized model on a particular dataset. By minimizing the PAC-Bayes bound only, it is also observed in our PAC-Bayes training (Stage 1) that the training accuracy is hard to reach 100%. Therefore, we add a second stage to ensure convergence of the training loss. Specifically, in Stage 2, we continue to update the model by minimizing only Eθ∼Q ˆσ ℓ( 𝑓θ; S) over θ, and keep all other variables (i.e., λ, σ) fixed to the solution found by Stage 1. This is essentially a stochastic gradient descent with noise injection, the level of which has been learned from Stage 1. The two-stage training is similar to the idea of the learning-rate scheduler (LRS). In LRS, the initial large learning rate introduces an implicit bias that guides the solution path towards a flat region (Cohen et al., 2021; Barrett and Dherin, 2020), and the later lower learning rate ensures the convergence to a local minimizer in this region. Without the large learning rate stage, it cannot reach the flat region; without the small learning rate stage, it cannot converge to a local minimizer. For the two-stage PAC-Bayes training, Stage 1 (PAC-Bayes stage) guides the solution to flat regions by minimizing the generalization bound, and Stage 2 is necessary for an actual convergence to a local minimizer. Regularizations in the PAC-Bayes training: By plugging the KL divergence (2.9) into P, we can see that in the case of Gaussian priors and posteriors, the PAC-Bayes loss is nothing but the original training loss augmented by a noise injection and a weight decay, except that strength of both of them are automatically learned. More discussions are available in Appendix A.2.3. 22 Prediction: After training, we use the mean of the posterior as the trained model and perform deterministic prediction on the test dataset. In Appendix A.2.4, we provide some mathematical intuition of why the deterministic predictor is expected to perform even better than the Bayesian predictor. 2.2.8 Experiments In this section, we demonstrate the efficacy of the proposed PAC-Bays training algorithm through extensive numerical experiments. Specifically, we conduct comparisons between our algorithm and existing PAC-Bayes training algorithms, as well as conventional training algorithms based on Empirical Risk Minimization (ERM). Our approach yields competitive test accuracy in all settings and exhibits a high degree of robustness w.r.t. the choice of hyperparameters. Comparison with different PAC-Bayes bounds and existing PAC-Bayes training algorithms: We compared our PAC-Bayes training algorithm using the layerwise prior with baselines in Pérez- Ortiz et al. (2021): quad (Rivasplata et al., 2019), lambda (Thiemann et al., 2017), classic (McAllester, 1999), and bbb (Blundell et al., 2015) in the context of deep convolutional neural networks. The baseline PAC-Bayes algorithms contain a variety of crucial hyperparameters, including variance of the prior (1e-2 to 5e-6), learning rate (1e-3 to 1e-2), momentum (0.95, 0.99), dropout rate (0 to 0.3) in the training of the prior, and the KL trade-off coefficient (1e-5 to 0.1) for bbb. These hyperparameters were chosen by grid search. The batch size is 250 for all methods. Our findings, as detailed in Table 2.1, show that our algorithm outperforms the other PAC-Bayes methods regarding test accuracy. It is important to note that all four baselines employed the PAC-Bayes bound for bounded loss. Therefore, they need to convert unbounded loss into bounded loss for training purposes. Various conversion methods were evaluated by Pérez-Ortiz et al. (2021), and the most effective one was selected for producing the results presented. To demonstrate the necessity of our newly proposed PAC-Bayes bound for unbounded loss, we compared this new bound with two existing PAC-Bayes bounds for unbounded loss. One is based on the subGaussian assumption (Corollary 4 of Germain et al. (2016)), while the other (Theorem 9 of Rodríguez-Gálvez et al. (2023)) assumes the loss function is a bounded cumulant generating 23 Figure 2.2 Training process when minimizing different PAC-Bayes bounds on CNN9 using CIFAR10 (Stage 1). Minimizing our bound (layer) achieves a tighter bound and better test accuracy compared with optimizing the other two (subGaussian and CGF). Table 2.1 Test accuracy of convolution neural networks on CIFAR10. The test accuracy of baselines for bounded loss is from Table 5 of Pérez-Ortiz et al. (2021), calculated as 1-the zero-one error of the deterministic predictor. subG represents the subGaussian bound. Our proposed PAC-Bayes training with a layerwise prior (layer) achieves the best test accuracy across all models. bounded unbounded quad lambda classic bbb subG CGF layer CNN9 CNN13 CNN15 78.63 84.47 85.31 79.39 84.48 85.51 78.33 84.22 85.20 83.49 81.49 85.84 85.41 85.63 85.95 80.02 85.46 84.21 88.31 84.36 87.55 function (CGF). It is important to note that, as of now, no training algorithms specifically leverage these PAC-Bayes bounds for unbounded loss. Therefore, for a fair comparison, we conducted an experiment by replacing our PAC-Bayes bound with the other two bounds and using the same two-stage training algorithm with the trainable layerwise prior. We found that the two baseline bounds are not non-vacuous on CNN9/13/15; both are larger than 1e5. The subGaussian bound even explodes on CNN13 and CNN15. When using these bounds for training a model, it is expected that they deliver worse 3 performance than the proposed one as shown in Table 2.1. We also visualized the test accuracy when minimizing different PAC-Bayes bounds for unbounded loss in Stage 1. As shown in Figure 2.2, minimizing our PAC-Bayes bound can achieve better generalization performance. The details of the two baseline bounds are in Appendix A.3.2. 3Despite the vacuousness of the bound, the final results are still meaningful due to the use of Stage 2. 24 Table 2.2 Test accuracy of CNNs on C10 (CIFAR10) and C100 (CIFAR100) with batch size 128. Our PAC-Bayes training with scalar and layerwise prior are labeled scalar and layer. The best and second-best test accuracies are highlighted and underlined. Our PAC-Bayes training can approximately match the best performance of the baseline. VGG13 VGG19 ResNet18 ResNet34 Dense121 C10 C100 C10 C100 C10 C100 C10 C100 C10 C100 90.2 SGD Adam 88.5 AdamW 88.4 scalar layer 88.7 89.7 66.9 63.7 61.8 67.2 67.1 90.2 89.0 89.0 89.2 90.5 64.5 58.8 62.3 61.3 62.3 89.9 87.5 87.9 88.0 89.3 64.0 61.6 61.4 68.8 68.9 90.0 87.9 88.3 89.6 90.9 70.3 59.5 59.9 69.5 69.9 91.8 91.2 91.5 91.2 91.5 74.0 70.0 70.1 71.4 72.2 Comparison with ERM optimized by SGD/Adam with various regularizations: We tested our PAC-Bayes training on CIFAR10 and CIFAR100 datasets with no data augmentation4 on various popular deep neural networks, VGG13, VGG19 (Simonyan and Zisserman, 2014), ResNet18, ResNet34 (He et al., 2016), and Dense121 (Huang et al., 2017) by comparing its performance with conventional empirical risk minimization by SGD/Adam enhanced by various regularizations (which we call baselines). The training of baselines involves a grid search for the best hyperparameters, including momentum for SGD (0.3 to 0.9), learning rate (1e-3 to 0.2), weight decay (1e-4 to 1e-2), and noise injection (5e-4 to 1e-2). The batch size was set to be 128. We reported the highest test accuracy obtained from this search as the baseline results. For all convolutional neural networks, our method employed Adam with a fixed learning rate of 1e-4. Since the CIFAR10 and CIFAR100 datasets do not have a published validation dataset, we used the test dataset to find the best hyperparameters of baselines during the grid search, which might lead to a slightly inflated performance for baselines. Nevertheless, as presented in Table 2.2, the test accuracy of our method is still competitive. Please refer to Appendix A.3.5 for more details. Evaluation on graph neural networks: To demonstrate the broad applicability of the proposed PAC-Bayes training algorithm to different network architectures, we evaluated it on graph neural networks (GNNs). Unlike CNNs, optimal GNN performance has been reported using the AdamW 4Result with data augmentation can be found in Appendix A.3.4 25 Table 2.3 Test accuracy of GNNs trained with AdamW versus our proposed method with scalar prior scalar. The best test accuracies are highlighted. The performance of our training can almost match the best results of the baseline obtained after carefully tuning hyperparameters. CoraML Citeseer PubMed Cora DBLP GCN GAT AdamW 85.7±0.7 90.3±0.4 85.0±0.6 scalar 60.7±0.7 84.9±0.8 62.0±0.4 86.1±0.7 90.0±0.4 AdamW 85.7±1.0 90.8±0.3 85.9±0.8 scalar 90.6±0.5 84.4±0.5 84.0±0.4 63.5±0.4 60.9±0.6 80.6±1.4 80.5±0.6 81.8±0.6 81.0±0.5 SAGE AdamW 85.7±0.5 90.5±0.5 86.5±0.5 scalar 60.6±0.5 83.5±0.4 90.0±0.5 84.4±0.6 61.2±0.2 80.7±0.6 79.9±0.5 APPNP AdamW 86.6±0.7 91.0±0.4 87.1±0.6 scalar 62.5±0.4 85.1±0.5 90.4±0.5 85.7±0.4 63.5±0.4 80.6±2.8 81.8±0.5 optimizer for ERM and enabling dropout. To ensure the best baseline results, we conducted a hyperparameter search over learning rate (1e-3 to 1e-2), weight decay (0 to 1e-2), noise injection (0 to 1e-2), and dropout (0 to 0.8) and reported the highest test accuracy as the baseline result. For our method, we used Adam and fixed the learning rate to be 1e-2 for all graph neural networks. We follow the convention for graph datasets by randomly assigning 20 nodes per class for training, 500 for validation, and the remaining for testing. We tested four architectures GCN (Kipf and Welling, 2016), GAT (Veličković et al., 2018), SAGE (Hamilton et al., 2017), and APPNP (Gasteiger et al., 2018) on 5 benchmark datasets CoraML, Citeseer, PubMed, Cora and DBLP (Bojchevski and Günnemann, 2017). Since there are only two convolution layers for GNNs, applying our algorithm with the scalar prior is sensible. For our PAC-Bayes training, we retained the dropout layer in the GAT as is, since it differs from the conventional dropout and essentially drops the edges of the input graph. Other architectures do not have this type of dropout; hence, our PAC-Bayes training for these architectures does not include dropout. Table 2.3 demonstrates that the performance of our algorithm closely approximates the best outcome of the baseline. Appendix A.3.6 provides additional details and more results. Extra analysis on few-shot text classification with transformers is in Appendix A.3.7. Evaluation on the sensitivity of hyperparameters: In previous experiments, we selected 26 Table 2.4 The test accuracy for CNNs on CIFAR10 (C10) and CIFAR100 (C100) using a batch size of 2048. Values in (·) indicate how much the results differ from using a batch size (128). Our PAC-Bayes training with scalar and layerwise prior are labeled as scalar and layer. The most robust results w.r.t. the increase of batch size are highlighted, indicating the elevated robustness of our method compared to the baseline regarding batch sizes. VGG13 ResNet18 C10 C100 C10 C100 SGD 87.7 (-2.5) Adam 90.7 (+2.2) AdamW 87.2 (-1.1) 60.1 (-6.8) 66.2 (+2.5) 61.0 (-0.8) 85.4 (-4.5) 87.7 (+0.2) 84.9 (-2.9) 61.5 (-2.6) 65.4 (+3.8) 58.9 (-2.5) scalar layer 88.9 (+0.2) 89.4 (-0.3) 66.0 (-1.2) 67.1 (0.0) 88.9 (+0.9) 89.2 (-0.1) 68.7 (-0.1) 69.3 (+0.3) Table 2.5 Test accuracy of ResNet18 and VGG13 trained with different learning rates on CIFAR10. The best test accuracies are highlighted. Our method is more robust to learning rate variations. Model Method 3e-5 5e-5 1e-4 2e-4 3e-4 5e-4 1e-3 ResNet18 VGG13 88.4 layer Adam 66.6 88.6 layer Adam 84.3 88.8 73.9 88.9 84.8 89.3 81.2 89.7 85.8 88.6 85.3 89.6 87.4 88.3 86.4 89.6 87.9 89.2 87.0 89.5 88.3 87.3 87.5 88.7 88.5 specific batch sizes and learning rates as the only two tunable hyperparameters of our algorithm, with all other parameters remaining constant across all experiments. We further demonstrate that batch size and learning rate variations do not significantly impact our final performance. This suggests a general robustness of our method to hyperparameters, reducing the necessity for extensive tuning. More specifically, with a fixed learning rate 5e-4 in our method, Table 2.4 shows that changing the batch size from 128 to a very large one, 2048, for VGG13 and ResNet18 does not significantly affect the performance of the PAC-Bayes training compared to ERM with extensive tuning as before. Also, as shown by Table 2.5, our algorithm is more robust to learning rate changes than ERM, which utilizes the optimal weight decay and noise injection settings from Table 2.2. Please refer to Appendix A.3.8 for more results. 27 2.2.9 Summary In this section, the objective function is designed by the proposed PAC-Bayes bound for the ℓ term in Equation 1.3, and the comprehensive implementation of the training process is encapsulated by the arg min 𝜃 term. This integration of theoretical principles and practical application forms a robust algorithm for reducing generalization error of machine learning models. Specifically, we presented the practical deployment of the PAC-Bayes bound, expanding its use for effectively training neural networks with satisfactory test performance. To realize this, we proposed a new PAC-Bayes bound for unbounded loss with a trainable prior. This new bound overcomes the limitations inherent in the assumptions of bounded loss and extensive prior selection. 28 CHAPTER 3 DATA-DRIVEN REGULARIZATION This chapter presents data-dependent regularization approaches. According to the regularization taxonomy in Equation (1.3), the terms 𝑓 , 𝑅, ℓ, and S can all be data-dependent. To regulate learning and directly learn desired patterns, a new architecture, termed 𝑓 , is required. Depending on the task, a specific 𝑅 term, such as the total variation (TV) loss—popular in full-waveform inversion (FWI)—can be incorporated into the loss function. Specialized ℓ functions can be designed to encode physical constraints relevant to scientific data. Furthermore, data augmentation techniques, tailored to the unique characteristics of the data, can be implemented as part of S. 3.1 MagNet: A Neural Network for Directed Graphs Introducing graph structures to a collection of objects allows the encoding of pairwise rela- tionships, which often possess inherent directional properties. For instance, the WebKB dataset Pei et al. (2020) comprises a list of university websites interconnected by hyperlinks, where one website might link to another without reciprocal linking, typifying directed graphs. In this context, the section introduces MagNet, a graph convolutional neural network designed for directed graphs. This network, represented by the 𝑓 term in Equation (1.3), leverages the magnetic Laplacian to effectively model and learn from the directional relationships present in such datasets. 3.1.1 Introduction Most graph neural networks fall into one of two families, spectral networks or spatial networks. Spatial methods define graph convolution as a localized averaging operation with iteratively learned weights. Spectral networks, on the other hand, define convolution on graphs via the eigendecompositon of the (normalized) graph Laplacian. The eigenvectors of the graph Laplacian assume the role of Fourier modes, and convolution is defined as entrywise multiplication in the Fourier basis. For a comprehensive review of both spatial and spectral networks, we refer the reader to Zhou et al. (2018a) and Wu et al. (2020b). Many spatial graph CNNs have natural extensions to directed graphs. However, these extensions typically only consider the outgoing neighbors of each vertex and neglect the incoming neighbors. 29 Therefore, they run the risk of discarding potentially important information. Consider, for example, a directed social network such as Twitter, where the nodes are Twitter accounts and a directed edge (𝑢, 𝑣) ∈ 𝐸 means that account 𝑢 mentions account 𝑣 (using the @ functionality). To infer something about account 𝑣, there is important information to be gathered both from other accounts that 𝑣 mentions, and accounts that mention 𝑣. Therefore, it is common for spatial methods to preprocess the data by symmetrizing the adjacency matrix, effectively creating an undirected graph. For example, while Veličković et al. (2018) explicitly notes that their network is well-defined on directed graphs, their experiments treat all citation networks as undirected for improved performance. Extending spectral methods to directed graphs is not straightforward since the adjacency matrix is asymmetric and, thus, there is no obvious way to define a symmetric, real-valued Laplacian with a full set of real eigenvalues that uniquely encodes any directed graph. We overcome this challenge by constructing a network based on the magnetic Laplacian L(𝑞) defined in Section 3.1.2. Unlike the directed graph Laplacians used in works such as Ma et al. (2019); Monti et al. (2018); Tong et al. (2020a,b), the magnetic Laplacian is not a real-valued symmetric matrix. Instead, it is a complex-valued Hermitian matrix that encodes the fundamentally asymmetric nature of a directed graph via the complex phase of its entries. Since L(𝑞) is Hermitian, the spectral theorem implies it has an orthonormal basis of complex eigenvectors corresponding to real eigenvalues. Moreover, Theorem B.6.1, stated in Section B.6 of the appendix, shows that L(𝑞) is positive semidefinite, similar to the traditional Laplacian. Setting 𝑞 = 0 is equivalent to symmetrizing the adjacency matrix and no importance is given to directional information. When 𝑞 = .25, on the other hand, we have that L(.25) (𝑢, 𝑣) = −L(.25) (𝑣, 𝑢) whenever there is an edge from 𝑢 to 𝑣 but not from 𝑣 to 𝑢. Different values of 𝑞 highlight different graph motifs Fanuel et al. (2018, 2017); Guo and Mohar (2017); Mohar (2020), and therefore the optimal choice of 𝑞 varies. Learning the appropriate value of 𝑞 from data allows MagNet to adaptively incorporate directed information. We also note that L(𝑞) has been applied to graph signal processing Furutani et al. (2020), community detection Fanuel et al. (2017), and clustering Cloninger (2017); Fanuel et al. (2018); F. de Resende and F. Costa (2020). 30 In Section 3.1.4, we show how the networks constructed in Bruna et al. (2014); Defferrard et al. (2016); Kipf and Welling (2016) can be adapted to directed graphs by incorporating complex Hermitian matrices, such as the magnetic Laplacian. When 𝑞 = 0, we effectively recover the networks constructed in those previous works. Therefore, our work generalizes these networks in a way that is suitable for directed graphs. Our method is very general and is not tied to any particular choice of network architecture. Indeed, the main ideas of this work could be adapted to nearly any spectral graph neural network, and some spatial ones. In Section 3.1.3, we summarize related work on directed graph neural networks as well as other papers studying the magnetic Laplacian and its applications in data science. In Section 3.1.5, we apply our network to node classification and link prediction tasks. We compare against several spectral and spatial methods as well as networks designed for directed graphs. We find that MagNet obtains the best or second-best performance on five out of six node-classification tasks and has the best performance on seven out of eight link-prediction tasks tested on real-world data, in addition to providing excellent node-classification performance on difficult synthetic data. The full implementation details, theoretical results concerning the magnetic Laplacian, extended examples, and further numerical details are in the appendix Section B. 3.1.2 The magnetic Laplacian Spectral graph theory has been remarkably successful in relating geometric characteristics of undirected graphs to properties of eigenvectors and eigenvalues of graph Laplacians and related matrices. For example, the tasks of optimal graph partitioning, sparsification, clustering, and embedding may be approximated by eigenvectors corresponding to small eigenvalues of various Laplacians (see, e.g., Chung and Graham (1997); Shi and Malik (1997); Belkin and Niyogi (2003); Spielman and Teng (2004); Coifman and Lafon (2006)). Similarly, the graph signal processing research community leverages the full set of eigenvectors to extend the Fourier transform to these structures Ortega et al. (2018). Furthermore, numerous papers Bruna et al. (2014); Defferrard et al. (2016); Kipf and Welling (2016) have shown that this eigendecomposition can be used to define neural networks on graphs. In this section, we provide the background needed to extend these 31 constructions to directed graphs via complex Hermitian matrices such as the magnetic Laplacian. We let 𝐺 = (𝑉, 𝐸) be a directed graph where 𝑉 is a set of 𝑁 vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of directed edges. If (𝑢, 𝑣) ∈ 𝐸, then we say there is an edge from 𝑢 to 𝑣. For the sake of simplicity, we will focus on the case where the graph is unweighted and has no self-loops, i.e., (𝑣, 𝑣) ∉ 𝐸, but our methods have natural extensions to graphs with self-loops and/or weighted edges. If both (𝑢, 𝑣) ∈ 𝐸 and (𝑣, 𝑢) ∈ 𝐸, then one may consider this pair of directed edges as a single undirected edge. A directed graph can be described by an adjacency matrix (A(𝑢, 𝑣))𝑢,𝑣∈𝑉 where A(𝑢, 𝑣) = 1 if (𝑢, 𝑣) ∈ 𝐸 and A(𝑢, 𝑣) = 0 otherwise. Unless 𝐺 is undirected, A is not symmetric, and, indeed, this is the key technical challenge in extending spectral graph neural networks to directed graphs. In the undirected case, where the adjacency matrix A is symmetric, the (unnormalized) graph Laplacian can be defined by L = D − A, where D is a diagonal degree matrix. It is well-known that L is a symmetric, positive-semidefinite matrix and therefore has an orthonormal basis of eigenvectors associated with non-negative eigenvalues. However, when A is asymmetric, direct attempts to define the Laplacian this way typically yield complex eigenvalues. This impedes the straightforward extension of classical methods of spectral graph theory and graph signal processing to directed graphs. A key point of this project is to represent the directed graph through a complex Hermitian matrix L such that: (1) the magnitude of L(𝑢, 𝑣) indicates the presence of an edge, but not its direction; and (2) the phase of L(𝑢, 𝑣) indicates the direction of the edge, or if the edge is undirected. Such matrices have been explored in the directed graph literature (see Section 3.1.3), but not in the context of graph neural networks. They have several advantages over their real-valued matrix counterparts. In particular, a single symmetric real-valued matrix will not uniquely represent a directed graph. Instead, one must use several matrices, as in Tong et al. (2020b), but this increases the complexity of the resulting network. Alternatively, one can work with an asymmetric, real-valued matrix, such as the adjacency matrix or the random walk matrix. However, the spatial graph filters that result from such matrices are typically limited by the fact that they can only aggregate information from the vertices that can be reached in one hop from a central vertex, but ignore the equally important subset 32 of vertices that can reach the central vertex in one hop. Complex Hermitian matrices, however, lead to filters that aggregate information from both sets of vertices. Finally, one could use a real-valued skew-symmetric matrix but such matrices do not generalize well to graphs with both directed and undirected edges. The optimal choice of complex Hermitian matrix is an open question. Here, we utilize a parameterized family of magnetic Laplacians, which have proven to be useful in other data-driven contexts Fanuel et al. (2017); Cloninger (2017); Fanuel et al. (2018); F. de Resende and F. Costa (2020). We first define the symmetrized adjacency matrix and corresponding degree matrix by, A𝑠 (𝑢, 𝑣) (cid:66) 1 2 (A(𝑢, 𝑣) + A(𝑣, 𝑢)), 1 ≤ 𝑢, 𝑣 ≤ 𝑁, D𝑠 (𝑢, 𝑢) (cid:66) ∑︁ 𝑣∈𝑉 A𝑠 (𝑢, 𝑣), 1 ≤ 𝑢 ≤ 𝑁 , with D𝑠 (𝑢, 𝑣) = 0 for 𝑢 ≠ 𝑣. We capture directional information via a phase matrix,1 𝚯(𝑞), 𝚯(𝑞) (𝑢, 𝑣) (cid:66) 2𝜋𝑞(A(𝑢, 𝑣) − A(𝑣, 𝑢)) , 𝑞 ≥ 0 , where exp(𝑖𝚯(𝑞)) is defined component-wise by exp(𝑖𝚯(𝑞)) (𝑢, 𝑣) (cid:66) exp(𝑖𝚯(𝑞) (𝑢, 𝑣)). Letting ⊙ denote component-wise multiplication, we define the complex Hermitian adjacency matrix H(𝑞) by H(𝑞) (cid:66) A𝑠 ⊙ exp(𝑖𝚯(𝑞)) . Since 𝚯(𝑞) is skew-symmetric, H(𝑞) is Hermitian. When 𝑞 = 0, we have 𝚯(0) = 0 and so H(0) = A𝑠. This effectively corresponds to treating the graph as undirected. For 𝑞 ≠ 0, the phase of H(𝑞) (𝑢, 𝑣) encodes edge direction and the value H(𝑞) (𝑢, 𝑣) separates four possible cases: no edge, edge from 𝑢 to 𝑣, edge from 𝑣 to 𝑢, and undirected edge. If there is no edge, we will have H𝑞 (𝑢, 𝑣) = 0. In the case of a directed edge, the Hermitian adjacency will be complex valued, and changing the direction of an edge will correspond to complex conjugation. For example, in the case where 𝑞 = .25, if there is an edge from 𝑢 to 𝑣 but not from 𝑣 to 𝑢 we have H(.25) (𝑢, 𝑣) = 𝑖 2 = −H(.25) (𝑣, 𝑢) . 1Our definition of 𝚯(𝑞) coincides with that used in Furutani et al. (2020). However, another definition (differing by a minus sign) also appears in the literature. These resulting magnetic Laplacians have the same eigenvalues and the corresponding eigenvectors are complex conjugates of one another. Therefore, this difference does not affect the performance of our network since our final layer separates the real and imaginary parts before multiplying by a trainable weight matrix (see Section 3.1.4 for details on the network structure). 33 Thus, in this setting, an edge from 𝑢 to 𝑣 is treated as the opposite of an edge from 𝑣 to 𝑢. On the other hand, if (𝑢, 𝑣), (𝑣, 𝑢) ∈ 𝐸 (which can be interpreted as a single undirected edge), then H(𝑞) (𝑢, 𝑣) = H(𝑞) (𝑣, 𝑢) = 1, and we see the phase, 𝚯(𝑞) (𝑢, 𝑣) = 0, encodes the lack of direction in the edge. For the rest of this section, we will assume that 𝑞 lies in between these two extreme values, i.e., 0 ≤ 𝑞 ≤ .25. We define the normalized and unnormalized magnetic Laplacians by 𝑈 (cid:66) D𝑠 − H(𝑞) = D𝑠 − A𝑠 ⊙ exp(𝑖𝚯(𝑞)), L(𝑞) L(𝑞) 𝑁 (cid:66) I − (cid:16) D−1/2 𝑠 A𝑠D−1/2 𝑠 (cid:17) ⊙ exp(𝑖𝚯(𝑞)) . (3.1) Note that when 𝐺 is undirected, L(𝑞) 𝑈 and L(𝑞) 𝑁 reduce to the standard undirected Laplacians. L(𝑞) 𝑈 and L(𝑞) 𝑁 are Hermitian. Theorem 1 (Section B.6 of the appendix) shows they are positive- semidefinite and thus are diagonalized by an orthonormal basis of complex eigenvectors u1, . . . , u𝑁 associated to real, nonnegative eigenvalues 𝜆1, . . . , 𝜆𝑁 . Similar to the traditional normalized Laplacian, Theorem 2 (Section B.6 of the appendix) shows the eigenvalues of L we may factor L(𝑞) 𝑞 𝑁 lie in [0, 2], and 𝑁 = U𝚲U†, where U is the 𝑁 × 𝑁 matrix whose 𝑘-th column is u𝑘 , 𝚲 is the diagonal matrix with 𝚲(𝑘, 𝑘) = 𝜆𝑘 , and U† is the conjugate transpose of U (a similar formula holds for L(𝑞) 3 (Section B.6 of the appendix) shows that L(𝑞) 𝑈 ). Furthermore, recall L = BB⊤, where B is the signed incidence matrix. Similarly, Theorem 𝑈 = B(𝑞) (B(𝑞))†, where B(𝑞) is a modified incidence matrix. The magnetic Laplacian encodes geometric information in its eigenvectors and eigenvalues. In the directed star graph (Section B.7 of the appendix), for example, directional information is contained in the eigenvectors only, whereas the eigenvalues are invariant to the direction of the edges. On the other hand, for the directed cycle graph the magnetic Laplacian encodes the directed nature of the graph solely in its spectrum. In general, both the eigenvectors and eigenvalues may contain important information, which we leverage in MagNet. 3.1.3 Related work In Section 3.1.3.1, we describe other graph neural networks designed specifically for directed graphs. Notably, none of these methods encode directionality with complex numbers, instead opting for real-valued, symmetric matrices. In Section 3.1.3.2, we review other work studying the magnetic Laplacian which has been studied for several decades and lately has garnered interest in the network science and graph signal processing communities. However, to the best of our knowledge, this is the 34 first work to use it to construct a graph neural network. We also note there are numerous approaches to graph signal processing on directed graphs. Many of these rely on a natural analog of Fourier modes. These Fourier modes are typically defined through either a factorization of a graph shift operator or by solving an optimization problem. For further review, we refer the reader to Marques et al. (2020). 3.1.3.1 Neural networks for directed graphs In Ma et al. (2019), the authors construct a directed Laplacian, via identities involving the random walk matrix and its stationary distribution 𝚷. When 𝐺 is undirected, one can use the fact that 𝚷 is proportional to the degree vector to verify this directed Laplacian reduces to the standard normalized graph Laplacian. However, this method requires 𝐺 to be strongly connected, unlike MagNet. The authors of Tong et al. (2020b) use a first-order proximity matrix A𝐹 (equivalent to A𝑠 here), as well as two second-order proximity matrices A𝑆in and A𝑆out. A𝑆in is defined by A𝑆in (𝑢, 𝑣) ≠ 0 if there exists a 𝑤 such that (𝑤, 𝑢), (𝑤, 𝑣) ∈ 𝐸, and A𝑆out is defined analogously. These three matrices collectively describe and distinguish the neighborhood of each vertex and those vertices that can reach a vertex in a single hop. The authors construct three different Laplacians and use a fusion operator to share information across channels. Similarly, inspired by Benson et al. (2016), in Monti et al. (2018), the authors consider several different symmetric Laplacian matrices corresponding to a number of different graph motifs. The method of Tong et al. (2020a) builds upon the ideas of both Ma et al. (2019) and Tong et al. (2020b) and considers a directed Laplacian similar to the one used in Ma et al. (2019), but with a PageRank matrix in place of the random-walk matrix. This allows for applications to graphs which are not strongly connected. Similar to Tong et al. (2020b), they use higher-order receptive fields (analogous to the second-order adjacency matrices discussed above) and an inception module to share information between receptive fields of different orders. We also note Klicpera et al. (2019a), which uses an approach based on PageRank in the spatial domain. There are also some related methods for directed graphs that are not based on the graph Laplacian, such as the directed graph embedding Sim et al. (2021), and directed message passing for molecular graphs Klicpera et al. 35 (2019b). 3.1.3.2 Related work on the magnetic Laplacian and Hermitian adjacency matrices The magnetic Laplacian has been studied since at least Lieb and Loss (1993). The name originates from its interpretation as a quantum mechanical Hamiltonian of a particle under magnetic flux. Early works focused on 𝑑-regular graphs, where the eigenvectors of the magnetic Laplacian are equivalent to those of the Hermitian adjacency matrix. The authors of Guo and Mohar (2017), for example, show that using a complex-valued Hermitian adjacency matrix rather than the symmetrized adjacency matrix reduces the number of small, non-isomorphic cospectral graphs. Topics of current research into Hermitian adjacency matrices include clustering tasks Cucuringu et al. (2020) and the role of the parameter 𝑞 Mohar (2020). The magnetic Laplacian is also the subject of ongoing research in graph signal processing Furutani et al. (2020), community detection Fanuel et al. (2017), and clustering Cloninger (2017); Fanuel et al. (2018); F. de Resende and F. Costa (2020). For example, Fanuel et al. (2018) uses the phase of the eigenvectors to construct eigenmap embeddings analogous to Belkin and Niyogi (2003). The role of 𝑞 is highlighted in the works of Fanuel et al. (2018, 2017); Guo and Mohar (2017); Mohar (2020), which show how particular choices of 𝑞 may highlight various graph motifs. In our context, this indicates that 𝑞 should be carefully tuned via cross-validation. Lastly, we note that numerous other directed graph Laplacians have been studied and applied to data science Chung (2005); Chung and Kempton (2013); Palmer and Zheng (2021). However, as alluded to in Section 3.1.2, these methods typically do not use complex Hermitian matrices. 3.1.4 MagNet Most graph neural network architectures can be described as being either spectral or spatial. Spatial networks such as Veličković et al. (2018); Hamilton et al. (2017); Atwood and Towsley (2016); Duvenaud et al. (2015) typically extend convolution to graphs by performing a weighted average of features over neighborhoods N (𝑢) = {𝑣 : (𝑢, 𝑣) ∈ 𝐸 }. These neighborhoods are well-defined even when 𝐸 is not symmetric, so spatial methods typically have natural extensions to directed graphs. However, such simplistic extensions may miss important information in the 36 directed graph. For example, filters defined using N (𝑢) are not capable of assimilating the equally important information contained in {𝑣 : (𝑣, 𝑢) ∈ 𝐸 }. Alternatively, these methods may also use the symmetrized adjacency matrix, but they cannot learn to balance directed and undirected approaches. In this section, we show how to extend spectral methods to directed graphs using the magnetic Laplacian introduced in Section 3.1.2. To highlight the flexibility of our approach, we show how three spectral graph neural network architectures can be adapted to incorporate the magnetic Laplacian. Our approach is very general, and so for most of this section, we will perform our analysis for a general complex Hermitian, positive semidefinite matrix. However, we view the magnetic Laplacian as our primary object of interest (and use it in all of our experiments in Section 3.1.5) because of the large body of literature studying its spectral properties and applying it to data science (see Section 3.1.3). 3.1.4.1 Spectral convolution via the magnetic Laplacian In this section, we let L denote a Hermitian, positive semidefinite matrix, such as the normalized or unnormalized magnetic Laplacian introduced in Section 3.1.2, on a directed graph 𝐺 = (𝑉, 𝐸), |𝑉 | = 𝑁. We let u1 . . . , u𝑁 be an orthonormal basis of eigenvectors for L and let U be the 𝑁 × 𝑁 matrix whose 𝑘-th column is u𝑘 . We define the directed graph Fourier transform for a signal x : 𝑉 → C by (cid:98)x = U†x, so that (cid:98)x(𝑘) = ⟨x, u𝑘 ⟩ . We regard the eigenvectors u1, . . . , u𝑁 as the generalizations of discrete Fourier modes to directed graphs. Since U is unitary, we have the Fourier inversion formula x = U(cid:98)x = 𝑁 ∑︁ 𝑘=1 (cid:98)x(𝑘)u𝑘 . (3.2) In Euclidean space, convolution corresponds to pointwise multiplication in the Fourier basis. Thus, we define the convolution of x with a filter y in the Fourier domain by (cid:154)y ∗ x(𝑘) = (cid:98)y(𝑘)(cid:98)x(𝑘). By (3.2), this implies y ∗ x = UDiag((cid:98)y)(cid:98)x = (UDiag((cid:98)y)U†)x, and so we say Y is a convolution matrix if Y = U𝚺U† , (3.3) for a diagonal matrix 𝚺. This is the natural generalization of the class of convolutions used in Bruna et al. (2014). 37 Next, following Defferrard et al. (2016) (see also Hammond et al. (2011)), we show that a spectral network can be implemented in the spatial domain via polynomials of L by having 𝚺 be a polynomial of 𝚲 in (3.3). This reduces the number of trainable parameters to prevent overfitting, avoids explicit diagonalization of the matrix L, (which is expensive for large graphs), and improves stability to perturbations Levie et al. (2019). As in Defferrard et al. (2016), we define a normalized eigenvalue matrix, with entries in [−1, 1], by (cid:101)𝚲 = 2 𝜆max 𝚲 − I and assume 𝚺 = 𝐾 ∑︁ 𝑘=0 𝜃 𝑘𝑇𝑘 ((cid:101)𝚲) , for some real-valued 𝜃1, . . . , 𝜃 𝑘 , where 𝑇𝑘 is the Chebyshev polynomial defined by 𝑇0(𝑥) = 1, 𝑇1(𝑥) = 𝑥, and 𝑇𝑘 (𝑥) = 2𝑥𝑇𝑘−1(𝑥) + 𝑇𝑘−2(𝑥) for 𝑘 ≥ 2. With (U(cid:101)𝚲U†) 𝑘 = U(cid:101)𝚲 U†, one has 𝑘 Yx = U 𝐾 ∑︁ 𝑘=0 𝜃 𝑘𝑇𝑘 ((cid:101)𝚲)U†x = 𝐾 ∑︁ 𝑘=0 𝜃 𝑘𝑇𝑘 ( (cid:101)L)x , (3.4) where, analogous to (cid:101)𝚲, we define (cid:101)L (cid:66) 2 𝜆max Hermitian structure of (cid:101)L, the value Yx(𝑢) aggregates information both from the values of x on N𝑘 (𝑢), the 𝑘-hop neighborhood of 𝑢, and the values of x on {𝑣 : dist(𝑣, 𝑢) ≤ 𝑘 }, which consists L − I. It is important to note that, due to the complex of those of vertices that can reach 𝑢 in 𝑘-hops. While in an undirected graph these two sets of vertices are the same, that is not the case for general directed graphs. Furthermore, due to the difference in phase between an edge (𝑢, 𝑣) and an edge (𝑣, 𝑢), the filter matrix Y is also capable of aggregating information from these two sets in different ways. This capability is in contrast to any single, symmetric, real-valued matrix, as well as any matrix that encodes just N (𝑢). To obtain a network similar to Kipf and Welling (2016), we set 𝐾 = 1, assume that L = L(𝑞) 𝑁 , using 𝜆max ≤ 2 make the approximation 𝜆max ≈ 2, and set 𝜃1 = −𝜃0. With this, we obtain Yx = 𝜃0(I + (D−1/2 𝑠 A𝑠D−1/2 𝑠 ) ⊙ exp(𝑖𝚯(𝑞)))x . As in Kipf and Welling (2016), we substitute I + (D−1/2 exp(𝑖𝚯(𝑞)). This renormalization helps avoid instabilities arising from vanishing/exploding gradients ) ⊙ exp(𝑖𝚯(𝑞)) → (cid:101)D−1/2 𝑠 (cid:101)A𝑠(cid:101)D−1/2 𝑠 A𝑠D−1/2 ⊙ 𝑠 𝑠 and yields Yx = 𝜃0(cid:101)D−1/2 𝑠 (cid:101)A𝑠(cid:101)D−1/2 𝑠 ⊙ exp(𝑖𝚯(𝑞)) , (3.5) 38 where (cid:101)A𝑠 = A𝑠 + I and (cid:101)D𝑠 (𝑖, 𝑖) = (cid:205) 𝑗 (cid:101)A𝑠 (𝑖, 𝑗). In theory, the matrix exp(𝑖𝚯(𝑞)) is dense. However, in practice, one only needs to compute a small fraction of its entries. In most real-world datasets, the symmetrized adjacency matrix will be sparse. Since the Hermitian adjacency matrix is constructed via pointwise multiplication between the symmetrized adjacency matrix and the phase matrix, it is only necessary to compute the phase matrix for entries (𝑢, 𝑣) where A𝑠 (𝑢, 𝑣) ≠ 0. Thus, the efficiency of the proposed algorithm is comparable to standard GCN algorithms, and can leverage any existing developments such as Fey et al. (2021) that increase efficiency of standard GCNs (although the computational complexity our method does differ by a factor of four because of the computational complexity of complex-valued multiplication). 3.1.4.2 The MagNet architecture Let 𝐿 be the number of convolution layers in our network, and let X(0) be an 𝑁 × 𝐹0 input feature matrix with columns x(0) 1 , . . . x(0) 𝐹0 . Since our filters are complex, we use a complex version of ReLU defined by 𝜎(𝑧) = 𝑧, if −𝜋/2 ≤ arg(𝑧) < 𝜋/2, and 𝜎(𝑧) = 0 otherwise (where arg(𝑧) is the complex argument of 𝑧 ∈ C). Let 𝐹ℓ be the number of channels in layer ℓ, and for 1 ≤ ℓ ≤ 𝐿, 1 ≤ 𝑖 ≤ 𝐹ℓ−1, and 1 ≤ 𝑗 ≤ 𝐹ℓ, we let Y(ℓ) , . . . x(ℓ) (3.5). Define the ℓth layer feature matrix X(ℓ) with columns x(ℓ) 1 (cid:33) 𝑖 𝑗 be a convolution matrix defined in the sense of either (3.3), (3.4), or 𝐹ℓ as: x(ℓ) 𝑗 = 𝜎 𝑖 𝑗 x(ℓ−1) Y(ℓ) 𝑖 + b(ℓ) 𝑗 , (3.6) (𝑣) = 𝑏 (ℓ) 𝑗 and real(𝑏 (ℓ) X(ℓ−1)(cid:17) with b(ℓ) 𝑗 where Z(ℓ) is a hidden layer of the form (3.6). In the numerical experiments reported in Section 3.1.5, we utilize formulation (3.4) with L = L(𝑞) 𝑗 ). In matrix form we write X(ℓ) = Z(ℓ) (cid:16) , X(ℓ) = 𝜎 (cid:16) X(ℓ−1)W(ℓ) self 𝑁 . In most cases we set 𝐾 = 1, for which + B(ℓ)(cid:17) 𝑁 X(ℓ−1)W(ℓ) + (cid:101)L(𝑞) , neigh self and W(ℓ) where W(ℓ) B(ℓ) (𝑣, ·) = (𝑏 (ℓ) 1 , . . . , 𝑏 (ℓ) 𝐹ℓ ) for each 𝑣 ∈ 𝑉. neigh are learned weight matrices corresponding to the filter weights in (3.4), and After the convolutional layers, we unwind the complex 𝑁 × 𝐹𝐿 matrix X(𝐿) into a real-valued 𝑁 × 2𝐹𝐿 matrix, apply a linear layer, consisting of right-multiplication by a 2𝐹𝐿 × 𝑛𝑐 weight matrix 39 (cid:32)𝐹ℓ −1∑︁ 𝑖=1 𝑗 ) = imag(𝑏 (ℓ) Figure 3.1 MagNet (𝐿 = 2) applied to node classification. (a) Ordered meta-graph. (b) Cyclic meta-graph. Figure 3.2 Meta-graphs for the synthetic data sets. W(𝐿+1) (where 𝑛𝑐 is the number of classes) and apply softmax. In our experiments, we set 𝐿 = 2 or 3. When 𝐿 = 2, our network applied to node classification, as illustrated in Figure 3.1, is given by softmax(unwind(Z(2) (Z(1) (X(0))))W(3)) . For link-prediction, we apply the same method through the unwind layer, and then concatenate the rows corresponding to pairs of nodes to obtain the edge features. 3.1.5 Experiments 3.1.5.1 Datasets Directed Stochastic Block Model We construct a directed stochastic block (DSBM) model as follows. First we divide 𝑁 vertices into 𝑛𝑐 equally-sized clusters 𝐶1, . . . , 𝐶𝑛𝑐 . We define {𝛼𝑖, 𝑗 }1≤𝑖, 𝑗 ≤𝑛𝑐 to be a collection of probabilities, 0 < 𝛼𝑖, 𝑗 ≤ 1 with 𝛼𝑖, 𝑗 = 𝛼 𝑗,𝑖, and for an unordered pair 𝑢 ≠ 𝑣 create an undirected edge between 𝑢 and 𝑣 with probability 𝛼𝑖, 𝑗 if 𝑢 ∈ 𝐶𝑖, 𝑣 ∈ 𝐶 𝑗 . To turn this undirected graph into a directed graph, we define {𝛽𝑖, 𝑗 }1≤𝑖, 𝑗 ≤𝑛𝑐 to be a collection of probabilities such that 0 ≤ 𝛽𝑖, 𝑗 ≤ 1 and 𝛽𝑖, 𝑗 + 𝛽 𝑗,𝑖 = 1. For 40 01234Majorityflow01234Noise edgesMajority flow each undirected edge {𝑢, 𝑣}, we assign that edge a direction by the rule that the edge points from 𝑢 to 𝑣 with probability 𝛽𝑖, 𝑗 if 𝑢 ∈ 𝐶𝑖 and 𝑣 ∈ 𝐶 𝑗 , and points from 𝑣 to 𝑢 otherwise. If 𝛼𝑖, 𝑗 is constant, then the only way to determine the clusters will be from the directional information. In Figure 3.3, we plot the performance of MagNet and other methods on variations of the DSBM. In each of these, we set 𝑛𝑐 = 5 and the goal is to classify the vertices by cluster. We set 𝑁 = 2500, except in Figure 3.3d where 𝑁 = 500. In Figure 3.3a, we plot the performance of our model on the DSBM with 𝛼𝑖, 𝑗 (cid:66) 𝛼∗ = .1, .08, and .05 for 𝑖 ≠ 𝑗, which varies the density of inter-cluster edges, and set 𝛼𝑖,𝑖 = .1. Here we set 𝛽𝑖,𝑖 = .5 and 𝛽𝑖, 𝑗 = .05 for 𝑖 > 𝑗. This corresponds to the ordered meta-graph in Figure 3.2a. Figure 3.3b also uses the ordered meta-graph, but here we fix 𝛼𝑖, 𝑗 = .1 for all 𝑖, 𝑗, and set 𝛽𝑖, 𝑗 = 𝛽∗, for 𝑖 > 𝑗, and allow 𝛽∗ to vary from .05 to .4, which varies the net flow (related to flow imbalance in He et al. (2021)) from one cluster to another. The results in Figure 3.3c utilize a cyclic meta-graph structure as in Figure 3.2b (without the gray noise edges). Specifically, we set 𝛼𝑖, 𝑗 = .1 if 𝑖 = 𝑗 or 𝑖 = 𝑗 ± 1 mod 5 and 𝛼𝑖, 𝑗 = 0 otherwise. We define 𝛽𝑖, 𝑗 = 𝛽∗, 𝛽 𝑗,𝑖 = 1 − 𝛽∗ when 𝑗 = (𝑖 − 1) mod 5, and 𝛽𝑖, 𝑗 = 0 otherwise. In Figure 3.3d we add noise to the cyclic structure of our meta-graph by setting 𝛼𝑖, 𝑗 = .1 for all 𝑖, 𝑗 and 𝛽𝑖, 𝑗 = .5 for all (𝑖, 𝑗) connected by a gray edge in Figure 3.2b (keeping 𝛽𝑖, 𝑗 the same as in Figure 3.3c for the blue edges). Real datasets Texas, Wisconsin, and Cornell are WebKB datasets modeling links between websites at different universities Pei et al. (2020). We use these datasets for both link prediction and node classification with nodes labeled as student, project, course, staff, and faculty in the latter case. Telegram Bovet and Grindrod (2020) is a pairwise influence network between 245 Telegram channels with 8, 912 links. To the best of our knowledge, this dataset has not previously been studied in the graph neural network literature. Labels are generated from the method discussed in Bovet and Grindrod (2020), with a total of four classes. The datasets Chameleon and Squirrel Rozemberczki et al. (2019) represent links between Wikipedia pages related to chameleons and squirrels. We use these datasets for link prediction. Likewise, WikiCS Mernyei and Cangea (2020) is a collection of Computer Science articles. Cora-ML and CiteSeer are popular citation networks with node labels corresponding to 41 scientific subareas. We use the versions of these datasets provided in Bojchevski and Günnemann (2017). Further details are given in the appendix. 3.1.5.2 Training and implementation details Node classification is performed in a semi-supervised setting (i.e., access to the test data, but not the test labels, during training). For the datasets Cornell, Texas, Wisconsin, and Telegram we use a 60%/20%/20% training/validation/test split, which might be viewed as more akin to supervised learning, because of the small graph size. For Cora-ML and CiteSeer, we use the same split as Tong et al. (2020a). For all of these datasets we use 10 random data splits. For the DSBM datasets, we generated 5 graphs randomly for each type and for each set of parameters, each with 10 different random node splits. We use 20% of the nodes for validation and we vary the proportion of training samples based on the classification difficulty, using 2%, 10%, and 60% of nodes per class for the ordered, cyclic, and noisy cyclic DSBM graphs, respectively, during training, and the rest for testing. Hyperpameters were selected using one of the five generated graphs, and then applied to the other four generated graphs. In the main text, there are two types of link prediction tasks conducted for performance evaluation. The first type is to predict the edge direction of pairs of vertices 𝑢, 𝑣 for which either (𝑢, 𝑣) ∈ 𝐸 or (𝑣, 𝑢) ∈ 𝐸. The second type is existence prediction. The model is asked to predict if (𝑢, 𝑣) ∈ 𝐸 by considering ordered pairs of vertices (𝑢, 𝑣). For both types of link prediction, we removed 15% of edges for testing, 5% for validation, and use the rest of the edges for training. The connectivity was maintained during splitting. 10 splits were generated randomly for each graph and the input features are in-degree and out-degree of nodes. In the appendix, we report on two additional link prediction tasks based on a three-class classification setup: (𝑢, 𝑣) ∈ 𝐸, (𝑣, 𝑢) ∈ 𝐸, or (𝑢, 𝑣), (𝑣, 𝑢) ∉ 𝐸. Full details are provided in the appendix. In all experiments, we used the normalized magnetic Laplacian and implement MagNet with convolution defined as in (3.4), meaning that our network may be viewed as the magnetic Laplacian generalization of ChebNet. The setting of the hyperparameter 𝑞 and other network hyperparameters is obtained by cross-validation. Since currently complex tensors are still in beta in PyTorch, we 42 did not use them, and instead we stored any complex tensor as two real tensors (one for the real part, one for the imaginary part), and carried out complex multiplication using the standard formula: (𝑎 + 𝑖𝑏)(𝑐 + 𝑖𝑑) = (𝑎𝑐 − 𝑏𝑑) + 𝑖(𝑏𝑐 + 𝑎𝑑) (note, 𝑎, 𝑏, 𝑐, 𝑑 can be real numbers or real matrices). We compare with multiple baselines in three categories: (i) spectral methods: ChebNet Defferrard et al. (2016), GCN Kipf and Welling (2016); (ii) spatial methods: APPNP Klicpera et al. (2019a), SAGE Hamilton et al. (2017), GIN Xu et al. (2018), GAT Veličković et al. (2018); and (iii) methods designed for directed graphs: DGCN Tong et al. (2020b), and two variants of Tong et al. (2020a), a basic version (DiGraph) and a version with higher order inception blocks (DiGraphIB). All baselines in the experiments have two graph convolutional layers, except for the node classification on the DSBM using the cyclic meta-graphs (Figures 3.3c, 3.3d, and 3.2b) for which we also tested three layers during the hyperparameter search. For ChebNet, we use the symmetrized adjacency matrix. For the spatial networks we apply both the symmetrized and asymmmetric adjacency matrix for node classification. The results reported are the better of the two results. The appendix provides full details, as well as results for two other types of baselines: (i) BiGCN, BiSAGE, BiGAT which are obtained by applying GCN, SAGE, GAT on both the original adjacency matrix and the transposed adjacency matrix; and (ii) a 𝑘-nearest neighbors classifier based on the eigenvector with the smallest eigenvalue of the magnetic Laplacian Fanuel et al. (2017). 3.1.5.3 Results We see that MagNet performs well across all tasks. As indicated in Table 3.1, our cross-validation procedure selects 𝑞 = 0 for node classification on the citation networks Cora-ML and CiteSeer. This means we achieved the best performance when regarding directional information as noise, suggesting symmetrization-based methods are appropriate in the context of node classification on citation networks. This matches our intuition. For example, in Cora-ML, the task is to classify research papers by scientific subarea. If the topic of a given paper is “machine learning,” then it is likely to both cite and be cited by other machine learning papers. For all other datasets, we find the optimal value of 𝑞 is nonzero, indicating that directional information is important. Our network exhibits the best performance on three out of six of these datasets and is a close second on Texas 43 Table 3.1 Node classification accuracy (%). The best results are in bold and the second are underlined. Type Method Cornell Texas Wisconsin Cora-ML CiteSeer Telegram Score Spectral ChebNet 79.8±5.0 79.2±7.5 81.6±6.3 80.0±1.8 66.7±1.6 70.2 ±6.8 6.94 59.0±6.4 58.7±3.8 55.9±5.4 82.0±1.1 66.0±1.5 73.4 ±5.8 19.16 GCN Spatial APPNP 58.7±4.0 57.0±4.8 51.8±7.4 82.6±1.4 66.9±1.8 67.3±3.0 18.75 80.0±6.1 84.3±5.5 83.1±4.8 82.3±1.2 66.0±1.5 66.4±6.4 5.76 SAGE 57.9±5.7 65.2±6.5 58.2±5.1 78.1±2.0 63.3±2.5 86.4±4.3 16.53 GIN 57.6±4.9 61.1±5.0 54.1±4.2 81.9±1.0 67.3±1.3 72.6±7.5 16.39 GAT Directed DGCN 67.3±4.3 71.7±7.4 65.5±4.7 81.3±1.4 66.3±2.0 90.4±5.6 8.55 66.8±6.2 64.9±8.1 59.6±3.8 79.4±1.8 62.6±2.2 82.0±3.1 15.70 Digraph DiGraphIB 64.4±9.0 64.9±13.7 64.1±7.0 79.3±1.2 61.1±1.7 64.1±7.0 16.36 Ours MagNet Best 𝑞 84.3±7.0 83.3±6.1 85.7±3.2 79.8±2.5 67.5±1.8 87.6±2.9 1.10 0.25 0.15 0.05 0.0 0.0 0.15 - and Telegram. We also achieve an at least four percentage point improvement over both ChebNet and GCN on the four data sets for which 𝑞 > 0. These networks are similar to ours but with the classical graph Laplacian. This isolates the effects of the magnetic Laplacian and shows that it is a valuable tool for encoding directional information. MagNet also compares favorably to non-spectral methods on the WebKB networks (Cornell, Texas, Wisconsin). Indeed, MagNet obtains a ∼ 4% improvement on Cornell and a ∼ 2.5% improvement on Wisconsin, while on Texas it has the second best accuracy, close behind SAGE. We also see the other directed methods have relatively poor performance on the WebKB networks, perhaps since these graphs are fairly small and have very few training samples. To make this analysis more quantitative, we computed the absolute difference of the classification accuracy of each method from the classification accuracy of the top performing method (in percentage points) on each data set, and averaged over the six data sets. In this context, lower scores are better, and a method with a score of zero indicates the method is the top performing method on each data set. As reported in Table 3.1, MagNet achieved a best score of 1.1 percent. On the DSBM datasets, as illustrated in Figure 3.3, we see that MagNet generally performs quite well and is the best performing network in the vast majority of cases. The networks DGCN and DiGraphIB rely on second order proximity matrices. As demonstrated in Figure 3.3c, these methods 44 (a) Ordered DSBM with varying edge density. (b) Ordered DSBM with varying net flow. (c) Cyclic DSBM with varying net flow. (d) Noisy Cyclic DSBM with varying net flow. Figure 3.3 Node classification accuracy. Error bars are one standard error. MagNet is bold red. are well suited for networks with a cyclic meta-graph structure since nodes in the same cluster are likely to have common neighbors. MagNet, on the other hand, does not use second-order proximity, but can still learn the clusters by stacking multiple layers together. This improves MagNet’s ability to adapt to directed graphs with different underlying topologies. This is illustrated in Figure 3.3d where the network has an approximately cyclic meta-graph structure. In this setting, MagNet continues to perform well, but the performance of DGCN and DiGraphIB deteriorate significantly. Interestingly, MagNet performs well on the DSBM cyclic meta-graph (Figure 3.3c) with 𝑞 ≈ .1, whereas 𝑞 ≥ .2 is preferred for the other three DSBM tests; we leave a more in-depth investigation for future work. For link prediction, we achieve the best performance on seven out of eight tests as shown in Table 3.2. We also note that Table 3.2 reports optimal non-zero 𝑞 values for each task. This indicates that incorporating directional information is important for link prediction, even on citation networks 45 0.050.060.070.080.090.10*0.20.30.40.50.60.70.80.91.0accuracy0.050.150.250.35*0.20.40.60.81.0accuracy0.050.150.25*0.20.40.60.81.01.2accuracy0.050.15*0.10.20.30.40.50.60.70.8accuracyMagNetDGCNDigraphDigraphIBChebNetGCNAPPNPSAGEGINGAT Table 3.2 Link prediction accuracy (%). The best results are in bold and the second are underlined. Direction prediction Existence prediction Cornell Wisconsin Cora-ML CiteSeer Cornell Wisconsin Cora-ML CiteSeer ChebNet GCN 71.0±5.5 56.2±8.7 67.5±4.5 72.7±1.5 68.0±1.6 80.1±2.3 82.5±1.9 80.0±0.6 77.4±0.4 71.0±4.0 79.8±1.1 68.9±2.8 75.1±1.4 75.1±1.9 81.6±0.5 76.9±0.5 APPNP SAGE GIN GAT 75.1±3.5 83.7±0.7 77.9±1.6 74.9±1.5 75.7±2.2 82.5±0.6 78.6±0.7 69.5±9.0 75.2±11.0 72.0±3.5 68.2±0.8 68.7±1.5 79.8±2.4 77.3±2.9 75.0±0.0 74.1±1.0 69.3±6.0 74.8±3.7 83.2±0.9 76.3±1.4 74.5±2.1 76.2±1.9 82.5±0.7 77.9±0.7 67.9±11.1 53.2±2.6 50.0±0.1 50.6±0.5 77.9±3.2 74.6±0.0 75.0±0.0 75.0±0.0 80.7±6.3 DGCN 79.3±1.9 DiGraph DiGraphIB 79.8±4.8 74.5±7.2 79.6±1.5 78.5±2.3 80.0±3.9 82.8±2.0 82.1±0.5 81.2±0.4 82.3±4.9 80.8±1.1 81.0±1.1 80.6±2.5 82.8±2.6 81.8±0.5 82.2±0.6 82.0±4.9 83.4±1.1 82.5±1.3 80.5±3.6 82.4±2.2 82.2±0.5 81.0±0.5 MagNet Best 𝑞 82.9±3.5 83.3±3.0 86.5±0.7 84.8±1.2 81.1±3.3 82.8±2.2 82.7±0.7 79.9±0.6 0.20 0.10 0.20 0.15 0.25 0.05 0.05 0.05 such as Cora and CiteSeer. This matches our intuition, since there is a clear difference between a paper with many citations and one with many references. 3.1.6 Summary In this section, we introduced MagNet, a neural network specifically designed for directed graphs, utilizing the magnetic Laplacian. This network represents an advancement in spectral graph convolutional networks by extending their application to directed graphs, thereby regularizing learning through the 𝑓 term in Equation (1.3). We have demonstrated the effectiveness of MagNet, particularly highlighting the crucial role of incorporating directional information via a complex Hermitian matrix. Our results, based on both real and synthetic datasets for tasks such as link prediction and node classification, confirm that MagNet provides substantial improvements in handling directional data compared to traditional methods. 46 3.2 Spatio-Temporal Graph Convolutional Networks for Earthquake Source Characterization This section discusses the application of a graph neural network designed specifically for earthquake source characterization. This network exemplifies the regularization of the 𝑓 term in Equation (1.3), focusing on the unique challenges presented by spatio-temporal seismic data inherent in earthquake events. 3.2.1 Introduction Earthquake source characterization plays a fundamental role in various seismic studies, including earthquake early-warning, hazard assessment, subsurface energy exploration, etc. Li et al. (2020a). Characterization of an earthquake source can be posed as a classical inverse problem. Its purpose is to infer the source information (location, magnitude, etc.) from seismic recordings. Various approaches have been developed to characterize earthquake sources, the most well-established being traveltime-based inversion Zhang et al. (2017); Li and van der Baan (2016); Lin et al. (2015); Zhang and Thurber (2003) and waveform-based inversion Beskardes et al. (2018); Zhebel and Eisner (2015); Pesicek et al. (2014); Gajewski et al. (2007). Traveltime-based methods implement a multi-step process, in which the arrival times of P and S waves are determined through phase detection and associated to specific earthquakes; earthquake locations are estimated as an inversion process given arrival times, station locations, and a velocity model. Magnitudes are calculated based on waveform amplitudes and source-receiver distances. Though traveltime-based methods are commonly used in seismic applications, they are susceptible to noise-related errors, particularly when estimating low-magnitude events, and fail to utilize abundant phase and amplitude information in the complete waveform. In contrast, waveform-based inversion integrates all phase and amplitude information recorded in seismographs, resulting in high quality source characterization. However, waveform-based inversion is computationally expensive. Both methods require domain expertise to properly tune parameters in the inversion process. Deep learning for source characterization provides a data-driven alternative, where integrated location and magnitude predictions extract full-waveform features with less computational expense than waveform inversion. Advances in algorithms and computing, and the availability of large, high-quality datasets have 47 allowed machine learning techniques to attain spectacular success in seismological applications Kong et al. (2019); Bergen et al. (2019) including phase picking Zhu and Beroza (2019), seismic discrimination Li et al. (2018), waveform denoising Zhu et al. (2019a), phase association Ross et al. (2019), earthquake location Perol et al. (2018), as well as magnitude estimation Mousavi and Beroza (2020b). Although machine learning has long been applied to seismic event detection Wang and Teng (1995); Tiira (1999), the first work to leverage recent advances in deep learning was developed by Perol et al. (2018), where convolutional neural networks (CNN’s) were trained to detect earthquakes from single station recordings and predict the source locations from among six regions. Though successful in establishing foundational research in machine learning for earthquake location, the CNN model is restricted to waveforms from a single seismic station and can only classify earthquakes into broad geographic groups without providing specific location information. Since then, more advanced single-station approaches have been developed to improve location accuracy. Mousavi and Beroza (2020a) build Bayesian neural networks to learn epicenter distance, P-wave travel time, and associated uncertainty from single-station data. Recently, multi-station based machine learning methods have shown promising results. For instance, Kriegerowski et al. (2019) develop a CNN structure that combines three-component waveforms from multiple stations to predict hypocenter locations, resulting in more accurate source parameters than single station methods. Zhang et al. (2020) developed an end-to-end fully convolutional network (FCN) to predict the probability distribution of earthquake location directly from input data recorded at multiple stations, which was extended to determine earthquake locations and magnitudes from continuous waveforms for earthquake early warning Zhang et al. (2021c). Shen and Shen (2021) also adopt a CNN framework, extracting the location, magnitude, and origin time from continuous waveforms collected across a seismic network. Though multiple-station approaches improve upon single-station methods, the use of standard convolutional layers is limited in several ways: (1) CNN’s are designed to function on evenly-spaced grids (i.e. photographs) where information is exclusively shared between adjacent cells, and (2) CNN’s require the input of station locations to be static (i.e. recordings from station N must always 48 be found at position N of the input file) in order to learn positional mapping. These assumptions are inappropriate for seismic networks, which are not regularly-spaced and may record information related to non-adjacent stations. Additionally, station outages, the addition/removal of stations to seismic networks, and the ability to select a localized array for the detection of small-magnitude events makes dynamic station input highly desirable for source characterization. To solve this problem, recently several graph-based machine learning methods have been developed. Münchmeyer et al. (2020) developed an attention-based transformer model for earthquake early warning, which was extended to predict hypocenters and magnitudes of events in Münchmeyer et al. (2021). While this model is successful in implementing a multistation approach that allows for dynamic inputs, high computational complexity restricts inputs to a relatively small number of stations. Another method for implementing flexible, multi-station input that avoids high complexity for large networks is through graph convolution. This method is implemented by van den Ende and Ampuero (2020), who develop a multi-station source characterization model. This model regards features as nodes on an edgeless graph, implementing single-station convolution and global pooling. However, global pooling may not sufficiently extract all useful information from multiple seismic stations, as the pooling layer is ideally applied after global features are obtained by feature fusion along the spatial dimension. Yano et al. (2021) introduce a multi-station technique in which edges are selected and held fixed for all inputs. While this model allows for more meaningful features to be constructed than in global pooling, station inputs are required to be fixed during training and implementation, introducing the same limitation inherent to CNN’s. McBrearty and Beroza (2022) propose a GNN framework using multiple pre-defined graphs constructed on both labels and station locations. The model allows for variation in the set of input stations, but the inputs are waveform amplitudes and phase arrival times rather than whole waveforms. To harness the full functionality of Graph Neural Networks (GNN’s) while maintaining flexibility in the location and number of seismic stations, we design a data-driven framework, spatio-temporal graph neural network (STGNN), that creates edges automatically to combine waveform features and spatial information. In order to evaluate the performance of our approach, we compare STGNN 49 to two baselines: the GNN model designed by van den Ende and Ampuero (2020) and the Fully Convolutional Network (FCN) designed by Zhang et al. (2020). We apply all three models to the two datasets upon which the baselines were originally tested and trained: (1) regional 2.5 < 𝑀 < 6 earthquakes recorded by 185 seismic stations in Southern California from 2000 to 2019, and (2) local 0 < 𝑀 < 4 earthquakes recorded by 30 seismic stations in Oklahoma from 2014 to 2015. 3.2.2 Methodology 3.2.2.1 Overview Graph neural networks (GNN’s) are designed to handle graphical data, or data that can be represented by vertices connected by edges. In GNN’s, convolution and pooling operates along connecting edges. In CNN’s, on the other hand, convolution and pooling operates on regions closest together on a Euclidean grid, meaning that input order directly impacts information-sharing and featurization. This is not the case for GNN’s, in which edges are not restricted to Euclidean grids but may instead be constructed by any criteria. Two major advantages of GNN architectures are that they do not require a fixed input order, and can handle graphs with different sets of vertices. These properties of GNN’s are well-suited for seismic data analysis with inputs from multiple stations. It is common for stations in a seismic network to be added, removed, or repositioned, or for the recording quality of individual stations to fluctuate over time due to operation and/or equipment issues. It is therefore beneficial to dynamically select relevant seismic stations for source characterization. We therefore propose a dynamic GNN framework as the basis for STGNN, in which seismic stations act as nodes, connected by dynamically defined edges. Inspired by Wang et al. (2019), our graph convolutions follow the design of EdgeConv layers to automatically generate edges between nodes. Instead of manually constructing fixed edges or implementing an edgeless graph, our framework learns to combine useful information from multiple stations implicitly during the training process. Our framework consists of three major components as shown in Figure 3.4: 1. Waveform feature extraction: We first extract temporal features from the waveform recorded at each seismic station using a CNN-based encoder. The three-channel seismic recordings are 50 Figure 3.4 The overview of STGNN. There are three major components in STGNN: (1) Waveform feature extraction for obtaining time domain feature from each station independently. (2) Spatial feature fusion for time domain feature integration from different stations based on their geographic locations and extracted feature similarity. (3) Earthquake location and magnitude prediction given spatiotemporal features from the previous step. See Fig 3.5 for a more detailed summary of the network architecture. reduced to a low dimensional representation. 2. Spatial feature fusion: We then represent the seismic station network as a graph, in which each node (i.e. station) is connected to other nodes by automatically generated edges. Through iterative steps of edge generation and convolution, the perceptive field is gradually enlarged. The model integrates and fuses features from different stations to obtain a high-order spatiotemporal representation of the recorded wavefield over the seismic network. The graph convolutional architecture considers both geographic locations and waveform feature similarity among multiple seismic stations. 3. Prediction: The last component is the prediction module. A fully-connected neural network outputs four normalized scalars corresponding to latitude, longitude, depth and magnitude based on features learned from the previous steps. 51 3.2.2.2 Graph Convolutional Layers The spatial feature fusion process consists of multiple graph convolution layers. The goal of each graph convolution layer is to enlarge the perceptive field by combining the extracted feature of each seismic stations and auto-selected neighbor stations. Each graph convolution layer can be broken down into two steps: edge generation and feature update. Edge generation. Each station node is connected to several other station nodes which show maximum similarity to the node. Similarity measurements are based on two criteria: 1. Geographic distance: The geographic distance is the intuitive choice, since adjacent stations tend to record related signals due to similar wave paths. Additionally, events are more likely to be mutually recorded by stations in close proximity, especially in the case of small-magnitude events. 2. Feature similarity: As the same earthquake event can be recorded by distant stations in a large area, waveform similarity provides a complimentary perspective to geographic distance. We compare 𝑙2 distance of features from station 𝑖 and 𝑗 directly by ||𝑥𝑖 − 𝑥 𝑗 ||2, and thus we can combine waveform features from distant stations, where 𝑥𝑖 and 𝑥 𝑗 are the extracted feature vectors. In edge generation, we link every station with its K-nearest neighbors based on their similarity, where K is a tunable hyperparapeter. In our framework, both geographic proximity and waveform feature similarity are considered. By ignoring the feature channel and batch dimensions, we assume the feature for neighbors selection is 𝑋 ∈ R𝑁×𝑑, where 𝑁 is the number of stations and 𝑑 is the feature dimension. 𝑑 equals to 2 when the criterion of generating edges is geographic distance (longitude and latitude). Let 𝑥𝑖 = 𝑋 (𝑖) and 𝑥 𝑗 = 𝑋 ( 𝑗), the process of selecting K-nearest neighbors can be explained with the following equations: 1. Compute the pair-wise distance matrix 𝐷 ∈ R𝑁×𝑁 : 𝐷 (𝑖, 𝑗) = ||𝑥𝑖 − 𝑥 𝑗 ||2, 1 ≤ 𝑖 ≤ 𝑁, 1 ≤ 𝑗 ≤ 𝑁 (3.7) 52 2. Get K-nearest neighbors N ∈ R𝑁×𝐾 by sorting each row of 𝐷: N (𝑖) = smallest-K(𝐷 (𝑖)), 1 ≤ 𝑖 ≤ 𝑁 (3.8) In practice, the similarity between waveforms can also be affected by other factors, such as wave path and signal to noise ratio. By training with a large amount of samples with different sets of seismic stations with distinct spatial distributions, the network will learn to embed these implicit and complex factors to low dimensional features automatically in order to minimize the misfit between labels and predictions. Feature update. Given the edges, we update the features of each stations by 𝑥𝑖 = maxpool (cid:101) 𝑗 ∈Ndistance (𝑖) 𝑗 are features of station 𝑖, 𝑗 and 𝑗 ′, respectively. (𝑔(𝑥𝑖 − 𝑥 𝑗 ) + 𝑓 (𝑥𝑖)) + maxpool 𝑗 ′∈Nsimilarity (𝑖) where 𝑥𝑖, 𝑥 𝑗 and 𝑥′ (𝑔′(𝑥𝑖 − 𝑥 𝑗 ′) + 𝑓 ′(𝑥𝑖)), (3.9) 𝑗 is a neighbor of 𝑖 based on geographic distance and 𝑗 ′ is a neighbor of 𝑖 by measuring feature similarity from the previous edge generation step. 𝑔(·), 𝑓 (·), 𝑔′(·), and 𝑓 ′(·) are trainable fully connected neural networks. (cid:101) updated feature of station 𝑖. Max pooling is conducted along the constructed edges to combine 𝑥𝑖 is the information from the K-nearest neighbors of 𝑖, so that each station is once more associated with a single feature vector. The update is asymmetric for station 𝑖 and 𝑗 to encourage the update processes of 𝑖 and 𝑗 to be different, as it is possible that only one of the stations records the event. 3.2.2.3 Architecture The model takes as input (1) A list of station coordinates, and (2) waveforms recorded by each station. Each input can contain an arbitrary set of stations, limited by a trainable maximum. If the number of functioning stations is less than the maximum number of stations for which the model is trained, the input is padded with zeroed channels and the coordinates of the missing stations are set to (−1, −1). A graphical illustration of the architecture is presented in Figure 3.5. In Temporal Feature Extraction, time domain waveform features are extracted from each station independently using an encoder with eleven convolutional layers. These features are used to construct the initial inputs for Spatial Feature Fusion. 53 Two graphs are generated for each layer of graph convolution: one in which edges are generated based on geographic distance, and one in which edges are generated based on waveform feature similarity. For graphs in which geographic distance dictates edges, two scalars containing station coordinates, normalized between −1 and 1, are concatenated to the station’s feature vector for neighbor selection and graph convolution. Spatial Feature Fusion uses four layers of graph convolutional layers to obtain spatially hierarchical features. The features from all four graph convolutions are concatenated together along the feature dimension for final source characterization regression. A fully-connected layer transforms the features in each station, and the features are then compressed with adaptive max pooling along the station dimension. The compressed features are regressed to scalar predictions of latitude, longitude, depth, and magnitude using a fully-connected neural network. The objective function is L = 1 𝑁 𝑁 ∑︁ 𝑖=1 1 4 ||𝑦𝑖 − (cid:98) 𝑦𝑖 ||, (3.10) 𝑦𝑖 and 𝑦𝑖 are the prediction and ground truth values of 𝑖th sample, respectively, represented as where (cid:98) vectors of latitude, longitude, depth and magnitude. The maximum number of stations and number of edges are architectural parameters set during training. The model design in PyTorch allows retraining to new architectural parameters without fundamental alteration of the code. 3.2.3 Experiments In this section, the data, experiment settings, and results are discussed. We evaluate STGNN in two ways: (1) performance on two datasets compared to GNN and CNN baselines, and (2) stability analysis of STGNN with various settings. 3.2.3.1 Data Description The Southern California dataset uses waveforms and catalogue information collected by the Southern California Earthquake Data Center (SCEDC) Hutton et al. (2010), and was used for training and testing in the GNN baseline van den Ende and Ampuero (2020). The selected dataset contains events from January 2000 to June 2019 within a geographic subset from 32◦ to 36◦ latitude 54 Figure 3.5 Overview of STGNN, including three components outlined in Figure 3.4. (1) In Temporal Feature Extraction, standard convolution and maximum pooling along the time dimension reduces 3-component waveforms to a feature vector of length 64. These feature vectors are concatenated to form T0. (2) In Spatial Feature Fusion, four layers of graph convolution are performed. In each layer, two graphs are constructed: Distance-Based (D), and Feature-Based (F), which are combined through element-wise summation, and max pooled along the K-nearest neighbors dimension. (3) In Prediction, feature tensors T0...T4 are concatenated and regressed to normalized predictions of latitude, longitude, depth, and magnitude. The pooling is applied along the station dimension. and −120◦ to −116◦ longitude, a depth range of 0-30 km, and a magnitude range of 2.5 < 𝑀 < 6. The final dataset contains 2, 209 events recorded by 185 broadband seismic stations. 55 Figure 3.6 Maps of the two target regions used in this study: (a) Southern California and (b) Oklahoma. The distribution of all seismic stations (red triangles) and earthquakes (black stars) are shown. The 30 stations selected for fixed input testing are surrounded by a green circle. Black lines indicate seismic faults United States Geological Survey and California Geological Survey (2022). The Oklahoma dataset uses waveforms and catalogue information collected by the Nanometrics Research Network, and was used for training and testing in the FCN baseline Zhang et al. (2020). The selected dataset contains events from March 2014 to July 2015 Nanometrics Seismological Instruments (2013) within a geographic subset from 34.482◦ to 37◦ latitude and −98.405◦ to −95.527◦ longitude, a depth range of 0-12 km, and a magnitude range of 1.5 < 𝑀 < 4. The final dataset contains 3, 456 events recorded by 30 broadband seismic stations. All waveforms and catalogues were accessed using ObsPy Beyreuther et al. (2010). Each trace contains 200 sec of seismic displacement collected by three orthogonal channels, which is interpolated into 4, 096 evenly spaced samples, resulting in a sampling rate of approximately 20 Hz. For both datasets, the instrument response was removed, and waveforms were bandpass filtered from 1 − 8 Hz. As the recorded displacement amplitudes are very small, the waveforms are multiplied by a constant scaling factor of 1𝑒7 to raise the input data to a numerically stable range close to [−1, 1] 56 (a)Southern California(b) Oklahoma without eliminating magnitude information. A map of events and stations is shown in Figure 3.6. One advantage of the graph neural network is its ability to make predictions using dynamic inputs (i.e., the selected stations and their order in the input file are not necessarily the same for each sample). To demonstrate this ability, we perform tests with STGNN and the GNN baseline using Southern California data with dynamic inputs, in which functioning stations are randomly selected for each event. However, the FCN baseline requires a fixed input, in which the same stations must occupy the same position for each sample. To make a fair comparison, we train STGNN as well as both baselines on thirty fixed stations to compare the performance of all methods. This results in three datasets: (1) Dynamic Southern California Dataset, in which 100 stations are randomly selected for each sample, as well as (2) Fixed Southern California Dataset and (3) Fixed Oklahoma Dataset, in which a static set of 30 stations are used for every sample. For all datasets, events are omitted where < 25 stations are functioning. Figure 3.7 The monthly earthquake frequency distribution for (a) Southern California and (b) Oklahoma. The temporal boundaries between the training, validation, and testing data are indicated by color. 3.2.3.2 Training Procedure In the experiments, we use AdamW as the optimizer with a learning rate of 3𝑒−4. The 𝑙2 regularization term 𝜆 is 1𝑒−4. Models are trained for 400 epochs with early stopping after 50 epochs without validation error improvement, from which we select the model with the best validation performance. We use a 20-80 split to divide each dataset into testing and training data, and reserve 57 (a)Southern California(b) Oklahoma 20% of the training data for validation. The datasets are not randomly shuffled, but rather separated by time in which training data precedes testing data. This approach avoids potential information leakage Kaufman et al. (2012) which might occur from spatially and temporally localized swarms. This method of splitting data also better simulates a real-use case, in which historic earthquakes would be used to train a model to detect more recent events on a network where station configuration and seismic characteristics may evolve over time. Figure 3.7 shows the monthly event frequency distribution in the training and testing dataset. We use a sliding window with a length of 100 sec and a stride of 5 sec to create ten 100 sec samples from each 200 sec recording. This augments the data by increasing the sample size and cropping different portions of the wavetrain, assuring that the model can be used without known origin and arrival times. Within each sample, the same time shift is applied to all stations. Figure 3.8 (a) MAE of each tested model where the location error is measured in km. Location error refers to the euclidean distance between the predicted location and the true event location. (b) MAE of the magnitude predictions from the graph convolutional neural networks when applied to the Oklahoma dataset with 30 fixed stations, the Southern California dataset with 30 fixed stations, and the Southern California dataset with 100 dynamically selected stations. 58 (a)MAE of Location Prediction(b) MAE of Magnitude Prediction 3.2.3.3 Performance Comparison To evaluate our developed framework, we compare the performance of our model against two baselines (1) van den Ende and Ampuero (2020) (referenced as GNN for graph neural network), and (2) Zhang et al. (2020) (referenced as FCN for fully convolutional network). The performance of each model is evaluated using the following metrics: MAE = MSE = 1 𝑛 𝑛 ∑︁ 𝑖=1 𝑦𝑖 |, |𝑦𝑖 − (cid:98) 1 𝑛 𝑛 ∑︁ 𝑖=1 (𝑦𝑖 − (cid:98) 𝑦𝑖)2, R2 = 1 − 𝑛 ∑︁ 𝑖=1 𝑦𝑖 − 𝑦𝑖)2 ((cid:98) (𝑦𝑖 − ¯𝑦)2 , (3.11) (3.12) (3.13) 𝑦𝑖 is the model’s prediction, 𝑦𝑖 is the true value, ¯𝑦 is the average true value, and 𝑛 is the total where (cid:98) number of predictions. Both STGNN and GNN make normalized predictions between -1 and 1. When calculating the above metrics, the values are first reverted from the normalized scalars to degrees of latitude and longitude, kilometers of depth, and magnitude values. Degrees of latitude and longitude are then converted to kilometers using conversions of 110 km/degree and 92 km/degree, respectively. Testing is conducted across three datasets: 1. Dynamic Southern California Dataset: The performance is tested for the STGNN and GNN models. Five neighbors (𝐾=5) were selected for feature update for STGNN. Results are detailed in Table 3.3. 2. Fixed Southern California Dataset: The performance is tested for the STGNN, GNN, and FCN models. Seven neighbors (𝐾=7) were selected for feature update for STGNN. Results are detailed in Table 3.4. 3. Fixed Oklahoma Dataset: The performance is tested for the STGNN, GNN, and FCN models. Seven neighbors (𝐾=7) were selected for feature update for STGNN. Results are detailed in Table 3.5. 59 The performance overview (Figure 3.8) demonstrates that our proposed model achieves a higher location accuracy than baselines for all datasets. STGNN makes predictions with an average of 6.8 km less location error, a 40% improvement across all tested datasets when compared to the FCN baseline. Across all datasets, STGNN makes predictions with an average of 3.0 km less location error than the GNN baseline, a 22% improvement. The improved location is primarily due to epicentral location accuracy. All tested models demonstrate low R2 values for depth prediction. STGNN and FCN achieve comparable magnitude prediction, and FCN does not support magnitude prediction. STGNN appears to incorporate a consistent bias, underpredicting magnitude values. Figure 3.9, 3.10 and 3.11 plot all predictions to give a richer understanding of model capacity beyond individual quality metrics. Observation of individual predictions makes it clear that while both models succeed in learning a meaningful mapping to latitude and longitude predictions, depth predictions are highly scattered and are little better than predictions of the mean. Table 3.3 Performance of STGNN and the GNN baseline when applied to the Southern California dataset with dynamic inputs. MAE refers to the mean absolute error (Equation 3.11) and MSE refers to the mean squared error (Equation 3.12), where a lower value indicates less error. The R2 value (Equation 3.13) is a measure of how strongly variation in the predicted values are related to variation in the ground truth value, where a value close to 1 is indicative of high accuracy. R2 0.980 0.969 R2 0.986 0.970 R2 0.266 0.120 R2 0.826 0.807 Latitude STGNN GNN MAE (km) 7.548 ± 9.841 10.201 ± 11.791 Longitude STGNN GNN MAE (km) 6.931 ± 8.152 10.095 ± 12.086 MSE (102 km) 1.538 ± 9.255 2.431 ± 12.438 MSE (102 km) 1.145 ± 5.589 2.480 ± 11.865 Depth STGNN 3.472 ± 2.928 3.837 ± 3.166 GNN MAE (km) MSE (102 km) 0.206 ± 0.358 0.247 ± 0.399 Magnitude STGNN GNN MAE 0.120 ± 0.114 0.120 ± 0.126 MSE 0.027 ± 0.085 0.030 ± 0.105 60 Table 3.4 Performance of STGNN, GNN and FCN baselines when applied to the Southern California dataset with fixed inputs. MAE refers to the mean absolute error (Equation 3.11) and MSE refers to the mean squared error (Equation 3.12), where a lower value indicates less error. The R2 value (Equation 3.13) is a measure of how strongly variation in the predicted values are related to variation in the ground truth value, where a value close to 1 is indicative of high accuracy. MAE (km) Latitude STGNN 10.396 ± 11.388 11.263 ± 11.696 GNN 14.415 ± 21.827 FCN MSE (102 km) 2.378 ± 10.067 2.637 ± 8.010 6.842 ± 34.697 MSE (102 km) 2.636 ± 15.761 2.807 ± 10.252 8.865 ± 47.323 R2 0.954 0.949 0.869 R2 0.962 0.960 0.874 R2 0.278 ± 0.405 −0.069 0.296 ± 0.403 −0.141 0.279 ± 0.431 -0.074 R2 Longitude STGNN GNN FCN MAE (km) 9.663 ± 13.048 11.485 ± 12.199 16.369 ± 24.872 MAE (km) MSE (102 km) Depth STGNN 4.030 ± 3.396 4.264 ± 3.384 GNN 4.105 ± 3.324 FCN Magnitude STGNN GNN MAE 0.216 ± 0.151 0.120 ± 0.118 MSE 0.069 ± 0.083 0.028 ± 0.088 0.583 0.830 3.2.3.4 Stability Analysis There are three critical hyper-parameters in STGNN: the number of neighbors (𝐾) considered for edge generation, the maximum amount of observed stations, and the random selection of seismic stations when creating datasets. We use the Southern California Dasaset to vary these hyperparameters in order to assess the stability of STGNN. The results of the paramater permutation are shown in Figure 3.12. When a parameter is not permuted, 100 stations, 5 edges, 4 graph convolutions, and a random seed of 0 is used. For each prediction, a random subset of functional stations was selected. We permute the random seed during sample selection to alter the set of stations used for training. We find that the random subsets return similar results for all predictions except for magnitude, which shows a higher degree of variation. Prediction accuracy improves as more stations are used. Accuracy is moderately impacted by the number of edges selected, with magnitude predictions fluctuating most 61 Table 3.5 Performance of STGNN, GNN and FCN baselines when applied to the Oklahoma dataset with fixed inputs. MAE refers to the mean absolute error (Equation 3.11) and MSE refers to the mean squared error (Equation 3.12), where a lower value indicates less error. The R2 value (Equation 3.13) is a measure of how strongly variation in the predicted values are related to variation in the ground truth value, where a value close to 1 is indicative of high accuracy. MAE (km) Latitude STGNN 3.574 ± 5.755 7.166 ± 12.414 GNN 9.219 ± 16.418 FCN MSE (102 km) 0.459 ± 3.665 2.055 ± 14.820 3.545 ± 23.070 MSE (102 km) 0.380 ± 2.365 1.015 ± 5.547 2.279 ± 8.244 Longitude STGNN GNN FCN MAE (km) 3.697 ± 4.936 5.934 ± 8.144 9.308 ± 11.883 MAE (km) MSE (102 km) 0.049 ± 0.082 0.049 ± 0.078 0.059 ± 0.084 Depth STGNN 1.686 ± 1.427 1.701 ± 1.423 GNN 1.865 ± 1.546 FCN Magnitude STGNN GNN MAE 0.154 ± 0.126 0.195 ± 0.142 MSE 0.040 ± 0.066 0.058 ± 0.083 R2 0.975 0.897 0.822 R2 0.942 0.904 0.785 R2 0.036 0.090 -0.086 R2 0.790 0.681 significantly. Overall, the model appears to be generally stable, with magnitude demonstrating the greatest sensitivity to hyperparameter tuning. We also examine the impact of the number of GNN layers on the model’s performance. A depth of four convolutions produces the best balance of location and magnitude accuracy, and is used for this study. 3.2.4 Discussion 3.2.4.1 Architecture Strengths and Weaknesses Our STGNN has several advantages over the FCN baseline model. One of the primary advantages is the ability to make predictions on a dynamic set of inputs, allowing the model to adapt to station outages, network alterations, and station subsetting. As STGNN featurizes individual stations rather than an ordered network image, the model can be easily trained to predict using any number of stations without architectural alteration. 62 Figure 3.9 Testing comparison on 100 dynamically selected stations from the Southern California dataset with 5 convolutional edges. “STGNN” and “GNN” denote the performance of our framework and van den Ende and Ampuero (2020), respectively. In the scatter plot, each point represents an event, and a position on the diagonal line corresponds to perfect agreement between the predicted value (x-axis) and the true value (y-axis). Latitude and longitude values are displayed in degrees and depth values are displayed in kilometers. The FCN baseline uses an image-to-image strategy, outputting a probability volume in which the highest values correspond to the event location. This has the advantage of predicting a probability amplitude, which Zhang et al. (2020) demonstrate as a useful measure of prediction uncertainty, especially in cases where earthquakes occur outside the bounds of the modeled region. However, the volumetric output comes at the cost of resolution limitation due to discretization. The gridded, three- dimensional output also requires a high degree of model complexity. The FCN baseline consequently comprises approximately 27 million parameters while our STGNN with scalar predictions comprises fewer than 0.24 million parameters. 63 Figure 3.10 Testing comparison on 30 fixed stations from the Oklahoma dataset with 7 convolutional edges. “STGNN”, “GNN”, and “FCN” denote the performance of our framework, van den Ende and Ampuero (2020), and Zhang et al. (2020), respectively. In the scatter plot, each point represents an event, and a position on the diagonal line corresponds to perfect agreement between the predicted value (x-axis) and the true value (y-axis). Latitude and longitude values are displayed in degrees and depth values are displayed in kilometers. Magnitude is omitted for the FCN, as this model makes only location predictions. The baseline GNN van den Ende and Ampuero (2020) implements edgeless graph convolution (i.e. station-by-station convolutions with global pooling) while the STGNN implements convolution and pooling over dynamically-generated edges. Figure 3.13 gives insight into the edge generation process. For clear visualization, we select a case with 50 seismic stations with 𝐾 = 5. In the edges generated by waveform similarity, stations that have recorded an event are generally connected to other recording stations, forming distinct clusters from edges generated by geographic proximity. This indicates that the model is able to successfully extract waveform information and associate 64 Figure 3.11 Testing comparison on 30 fixed stations from the Southern California dataset with 7 convolutional edges. “STGNN”, “GNN”, and “FCN” denote the performance of our framework, van den Ende and Ampuero (2020), and Zhang et al. (2020), respectively. In the scatter plot, each point represents an event, and a position on the diagonal line corresponds to perfect agreement between the predicted value (x-axis) and the true value (y-axis). Latitude and longitude values are displayed in degrees and depth values are displayed in kilometers. Magnitude is omitted for the FCN, as this model makes only location predictions. stations in order to characterize an event. The graph generated based on the feature similarity is different from that created based on the location, showing that the feature similarity is a complement of location during aggregating features from different seismic stations. A limitation that STGNN shares with the baselines is the ability to make predictions only within a certain range of area, depth, and magnitude. The model outputs normalized values between -1 and 1 which correspond to a range selected at the beginning of training. The spatial restrictions are similar to the bounds set in inversion-based methods and are arguably less limiting, as the 65 Figure 3.12 Stability analysis permuting (a) the number of edges used to connect nodes during graph convolution, (b) the random seed used to select stations for the model input, (c) the number of stations used for prediction, and (d) the number of graph convolutions implemented. When a parameter is not permuted, 100 stations, 5 edges, 4 graph convolutions, and a random seed of 0 is used. 66 (c) Number of Stations(a) Number of Edges(b) Random Seed(d) Number of Graph Convolutions Figure 3.13 Graphs constructed by different layers of the graph neural network, (a) graph convolution layer based on geographic distance among seismic stations (b) 1st, (c) 2nd, (d) 3rd and (e) 4th graph convolution layer based on the extracted feature similarity. The information from stations with the event signal are clustered in deeper layers. 67 blue station:seismic station without event signalred station: seismic station with event signaledge starting from a red station edge starting from a blue station abcdeearthquake event predictions made by our model are continuous and therefore not bound by grid-spacing. However, STGNN is more limited than non-machine learning methods with regard to magnitude prediction. Magnitudes falling above or below the training range cannot be predicted by STGNN or the deep learning baselines. The limited range of predictions adversely impacts the usefulness of the deep learning methods for applications such as Earthquake Early Warning, where magnitude saturation must be avoided. The limitations posed by fixed prediction ranges are made less severe by STGNN’s ability to be tuned to new ranges with small amounts of training data. However, the fixed prediction ranges nonetheless represent a weakness in our framework. 3.2.4.2 Impacts on Location Prediction Overall location error for the STGNN model is 5.41 km for the Fixed Oklahoma Dataset and 14.75 km for the Fixed Southern California Dataset. The higher loss for the Southern California dataset may be attributable to the larger size of the region. As locations in both the smaller and larger regions are normalized to values between -1 and 1, errors in the initial prediction will result in larger errors when converted to kilometers in larger regions. In addition, larger regions may include a greater range of structural complexity that may be more challenging for the model to learn. Location error for the Dynamic California Dataset was 3.93 km lower than that of the Fixed California Dataset. This supports the assumption that dynamic inputs improve not only the flexibility, but the performance of prediction models. 3.2.4.3 Synthetic Testing While substantial improvements have been made in the prediction of latitude and longitude, magnitude does not improve in every dataset, and depth predictions are inaccurate for all models. To test the capacity of our model under ideal circumstances, we train our model using synthetic data. The synthetic waveforms are generated using Green’s Functions created with PyFK Xi et al. (2021) from a 1-D sedimentary half-space model, with an epicentral resolution of 1 km and a depth resolution of 0.5 km and a sampling rate of 20.48 Hz. Thirty recording sensors are used with the same configuration as the Fixed Oklahoma Dataset. No label or waveform noise is applied. The high degree of accuracy suggests that the fundamental architecture of the model has the capacity to 68 learn depth and magnitude estimation (Fig 3.14). The difference between the simulated waveforms and the recorded data are (1) label noise, (2) waveform noise, and (3) subsurface complexity. While the fundamental structure holds promise, STGNN must be improved to address these factors before it can be effectively applied to real seismic datasets for depth and magnitude prediction. Figure 3.14 Testing performance of STGNN on synthetic data. In the scatter plot, each point represents an event, and a position on the diagonal line corresponds to perfect agreement between the predicted value (x-axis) and the true value (y-axis). Latitude and longitude values are displayed in degrees and depth values are displayed in kilometers. 3.2.5 Summary This section has presented a graph convolutional neural network (GCNN) specifically designed for earthquake source characterization, utilizing waveform data from multiple seismic stations. This application leverages the regularization of the 𝑓 term as defined in Equation (1.3). Through experimental validation in two distinct seismic environments, the Spatio-Temporal Graph Neural Network (STGNN) demonstrated superior performance over both fully-convolutional neural network (FCN) and traditional Graph Neural Network (GNN) baselines. A significant advantage of the 69 050100150200250Prediction050100150200250LabelLatitude050100150200250Prediction050100150200250LabelLongitude0.02.55.07.510.012.5Prediction0.02.55.07.510.012.5LabelDepth01234Prediction01234LabelMag STGNN framework is its ability to dynamically learn feature generation and fusion processes directly from the data, thereby eliminating the need for static input types or manually predefined graph structures. This allows for an effective synthesis of waveform features and spatial data, enhancing the model’s predictive accuracy and adaptability. 70 3.3 Unsupervised Learning of Full-Waveform Inversion: Connecting CNN and Partial Differential Equation in a Loop In addition to designing regularization based on data structure as described in Sections 3.1 and 3.2, this section introduces a physics-informed machine learning architecture for full-waveform inversion, which is mathematically modeled by partial differential equations. By integrating governing physical equations with neural networks, we impose regularization through the 𝑓 term in Equation (1.3). Furthermore, experimental results indicate that the perceptual loss Johnson et al. (2016) (represented as the 𝑅 term in Equation (1.3)) significantly enhances seismic data reconstruction. 3.3.1 Introduction Geophysical properties (such as velocity, impedance, and density) play an important role in various subsurface applications including subsurface energy exploration, carbon capture and sequestration, estimating pathways of subsurface contaminant transport, and earthquake early warning systems to provide critical alerts. These properties can be obtained via seismic surveys, i.e., receiving reflected/refracted seismic waves generated by a controlled source. This section focuses on reconstructing subsurface velocity maps from seismic measurements. Mathematically, the velocity map and seismic measurements are correlated through an acoustic-wave equation (a second-order partial differential equation) as follows: ∇2 𝑝(r, 𝑡) − 1 𝑣(r)2 𝜕2 𝑝(r, 𝑡) 𝜕𝑡2 = 𝑠(r, 𝑡) , (3.14) where 𝑝(r, 𝑡) denotes the pressure wavefield at spatial location r and time 𝑡, 𝑣(r) represents the velocity map, and 𝑠(r, 𝑡) is the source term. Full-Waveform Inversion (FWI) is a methodology that determines high-resolution velocity maps 𝑣(r) of subsurface via matching synthetic seismic waveforms to raw recorded seismic data 𝑝( ˜r, 𝑡), where ˜r represents the locations of seismic receivers. A velocity map describes the wave propagation speed in the subsurface region of interest. An example in 2D scenario is shown in Figure 3.15a. Particularly, the x-axis represents the horizontal offset of a region, and the y-axis stands for the depth. The regions with the same geologic information (velocity) are called a layer in velocity maps. In a sample of seismic measurements (termed a shot gather in geophysics) as depicted in Figure 3.15b, each grid in the 71 (a) (b) Figure 3.15 An example of (a) a velocity map and (b) seismic measurements (named shot gather in geophysics) and the 1D time-series signal recorded by a receiver. x-axis represents a receiver, and the value in the y-axis is a 1D time-series signal recorded by each receiver. Existing approaches solve FWI in two directions: physics-driven and data-driven. Physics-driven approaches rely on the forward modeling of Equation 3.14, which simulates seismic data from velocity map by finite difference. They optimize velocity map per seismic sample, by iteratively updating velocity map from an initial guess such that simulated seismic data (after forward modeling) is close to the input seismic measurements. However, these methods are slow and difficult to scale up as the iterative optimization is required per input sample. Data-driven approaches consider FWI problem as an image-to-image translation task and apply convolution neural networks (CNN) to learn the mapping from seismic data to velocity maps (Wu and Lin, 2019). The limitation of these methods is that they require paired seismic data and velocity maps to train the network. Such ground truth velocity maps are hardly accessible in real-world scenario because generating them is extremely time-consuming even for domain experts. In this work, we leverage advantages of both directions (physics + data driven) and shift the paradigm to unsupervised learning of FWI by connecting forward modeling and CNN in a loop. Specifically, as shown in Figure 3.16, a CNN is trained to predict a velocity map from seismic data, which is followed by forward modeling to reconstruct seismic data. The loop is closed by applying reconstruction loss on seismic data to train the CNN. Due to the differentiable forward modeling, the whole loop can be trained end-to-end. Note that the CNN is trained in an unsupervised manner, as the ground truth of velocity map is not needed. We name our unsupervised approach as UPFWI 72 (Unsupervised Physics-informed Full-Waveform Inversion). Additionally, we find that perceptual loss (Johnson et al., 2016) is crucial to improve the overall quality of predicted velocity maps due to its superior capability in preserving the coherence of the reconstructed waveforms comparing with other losses like Mean Squared Error (MSE) and Mean Absolute Error (MAE). To encourage fair comparison on a large dataset with more complicate geological structures, we introduce a new synthetic dataset named OpenFWI, which contains 60,000 labeled data (velocity map and seismic data pairs) and 48,000 unlabeled data (seismic data alone). 30,000 of those velocity maps contain curved layers that are more challenge for inversion. We also add geological faults with various shift distances and tilting angles to all velocity maps. We evaluate our method on this dataset. Experimental results show that for velocity maps with flat layers, our UPFWI trained with 48,000 unlabeled data achieves 1146.09 in MSE, which is 26.77% smaller than that of the supervised baseline H-PGNN+ (Sun et al., 2021), and 0.9895 in Structured Similarity (SSIM), which is 0.0021 higher; for velocity maps with curved layers, our UPFWI achieves 3639.96 in MSE, which is 28.30% smaller, and 0.9756 in SSIM, which is 0.0057 higher. Our contribution is summarized as follows: • We propose to solve FWI in an unsupervised manner by connecting CNN and forward modeling in a loop, enabling end-to-end learning from seismic data alone. • We find that perceptual loss is helpful to boost the performance comparable to the supervised counterpart. • We introduce a large-scale dataset as benchmark to encourage further research on FWI. 3.3.2 Preliminaries of Full-Waveform Inversion (FWI) The goal of FWI in geophysics is to invert for a velocity map v ∈ R𝑊×𝐻 from seismic measurements p ∈ R𝑆×𝑇×𝑅, where 𝑊 and 𝐻 denote the horizontal and vertical dimensions of the velocity map, 𝑆 is the number of sources used to generate waves during data acquisition process, 𝑇 73 Figure 3.16 Schematic illustration of our proposed method, which comprises a CNN to learn an inverse mapping and a differentiable operator to approximate the forward modeling of PDE. Mean Squared Error (MSE) Structural Similarity (SSIM) Figure 3.17 Unsupervised UPFWI (ours) vs. Supervised H-PGNN+ (Sun et al., 2021). Our method achieves better performance, e.g. lower Mean Squared Error (MSE) and higher Structural Similarity (SSIM), when involving more unlabeled data (>24k). denotes the number of samples in the wavefields recorded by each receiver, and 𝑅 represents the total number of receivers. In conventional physics-driven methods, forward modeling is commonly referred to the process of simulating seismic data ˜p from given estimated velocity maps ˆv. For simplicity, the forward acoustic-wave operator 𝑓 can be expressed as ˜p = 𝑓 (ˆv) . (3.15) Given this forward operator 𝑓 , the physics-driven FWI can be posed as a minimization problem (Virieux and Operto, 2009) (cid:110) 𝐸 (ˆv) = min ˆv ||p − 𝑓 (ˆv)||2 2 + 𝜆𝑅(ˆv) (cid:111) , (3.16) where ||p− 𝑓 (ˆv)||2 2 is the the ℓ2 distance between true seismic measurements p and the corresponding simulated data 𝑓 (ˆv), 𝜆 is a regularization parameter and 𝑅(ˆv) is the regularization term which is 74 often the ℓ2 or ℓ1 norm of ˆv. This requires optimization per sample, which is slow as the optimization involves multiple iterations from an initial guess. Data-driven methods leverage CNNs to directly learn the inverse mapping as (Adler et al., 2021) ˆv = 𝑔𝜃 (p) ≈ 𝑓 −1(p) , (3.17) where 𝑔𝜃 (·) is the approximated inverse operator of 𝑓 (·) parameterized by 𝜃. In practice, 𝑔𝜃 is usually implemented as a CNN (Adler et al., 2021; Wu and Lin, 2019). This requires paired seismic data and velocity maps for supervised learning. However, the acquisition of large volume of velocity maps in field applications can be extremely challenging and computationally prohibitive. 3.3.3 Related Work Physics-driven Methods: In the past few decades, many regularization techniques have been proposed to alleviate the ill-posedness and non-linearity of FWI (Hu et al., 2009; Burstedde and Ghattas, 2009; Ramírez and Lewis, 2010; Lin and Huang, 2017, 2015b,a; Guitton, 2012; Treister and Haber, 2016). Other researchers focused on multi-scale techniques and decomposed the data into different frequency bands (Bunks et al., 1995; Boonyasiriwat et al., 2009). Data-driven Methods: Recently, some researchers employed neural networks to solve FWI. Those methods can be further divided into supervised and unsupervised methods. Supervised: One type of supervised methods require labeled samples to directly learn the inverse mapping, and they can be formulated as: ˆv(p) = 𝑔𝜃∗ (p) s.t. 𝜃∗(𝚽𝑠) = arg min 𝜃 ∑︁ {v𝑖,p𝑖 }∈𝚽𝑠 L (𝑔𝜃 (pi), v𝑖), (3.18) where p denotes the seismic measurements, v is the velocity map, 𝜃 represents the trainable weights in the inversion network 𝑔𝜃 (·), 𝑓 (·) is the forward modeling, and L (·, ·) is a loss function. One example of supervised methods is the fully connected network proposed by Araya-Polo et al. (2018). Wu and Lin (2019) developed an encoder-decoder structured network to handle more complex velocity maps. Zhang and Lin (2020) adopted GAN and transfer learning to improve generalizability. Li et al. (2020b) designed SeisInvNet to solve misaligned issue when dealing sources from different locations. In Yang and Ma (2019), a U-Net architecture was proposed with 75 skip connections. Feng et al. (2021) proposed a multi-scale framework by considering different frequency components. Rojas-Gómez et al. (2020) developed an adaptive data augmentation method to improvegeneralizability. Sun et al. (2021) combined the data-driven and physics-based methods and proposed H-PGNN model. Another type of supervised methods GANs to learn a distribution from velocity maps in training set as a prior (Richardson, 2018; Mosser et al., 2020). They can be formulated as: ˆv(z∗) = 𝑔𝜃∗ (z∗) s.t. z∗(p) = arg min L ( 𝑓 (𝑔𝜃∗ (z)), p), z 𝜃∗(𝚽𝒗) = arg min 𝜃 ∑︁ 𝑣𝑖 ∈Φ𝑣 LGAN(𝑔𝜃 (α𝒊), v𝒊), (3.19) where 𝚽v is a training dataset including numerous velocity maps. z and αi are tensors sampled from the normal distribution. The iterative optimization is then performed on z to draw a velocity map sampled from the prior distribution. Unsupervised: The existing unsupervised methods follow the iterative optimization paradigm and perform FWI per sample. They employ neural networks to reparameterize velocity maps. The networks serve as an implicit regularization and are required to be pretrained on an expert initial guess. Those methods can be formulated as: ˆv(p) = 𝑔𝜃∗ (p) (a) s.t. 𝜃∗(p) = arg min 𝜃 L ( 𝑓 (𝑔𝜃 (a)), p), (3.20) where a is a random tensor. Different network architectures have been proposed including CNN- domain FWI (Wu and McMechan, 2019) and DNN-FWI (He and Wang, 2021). Zhu et al. (2021) developed NNFWI which does not need pretraining ahead, but the initial guess is still required to be fed into the PDE with estimated velocity maps. 3.3.4 Method In this section, we present our Unsupervised Physics-informed solution (named UPFWI), which connects CNN and forward modeling in a loop. It addresses limitations of both physics-driven and data-driven approaches, as it requires neither optimization at inference (per sample), nor velocity maps as supervision. 76 3.3.4.1 UPFWI: Connecting CNN and Forward Modeling As depicted in Figure 3.16, our UPFWI connects a CNN 𝑔𝜃 and a differentiable forward operator 𝑓 to form a loop. In particular, the CNN takes seismic measurements p as input and generates the corresponding velocity map ˆv. We then apply forward acoustic-wave operator 𝑓 (see Equation 3.15) on the estimated velocity map ˆv to reconstruct the seismic data ˜p. Typically, the forward modeling employs finite difference (FD) to discretize the wave equation (Equation 3.14). The details of forward modeling will be discussed Section 3.3.4.3. The loop is closed by the reconstruction loss between input seismic data p and reconstructed seismic data ˜p = 𝑓 (𝑔𝜃 (p)). Notice that the ground truth of the velocity map v is not involved, and the training process is unsupervised. Since the forward operator is differentiable, the reconstruction loss can be backpropagated (via gradient descent) to update the parameters 𝜃 in the CNN. 3.3.4.2 CNN Network Architecture We use an encoder-decoder structured CNN (similar to Wu and Lin (2019) and Zhang and Lin (2020)) to model the mapping from seismic data p ∈ R𝑆×𝑇×𝑅 to velocity map v ∈ R𝑊×𝐻. The encoder compresses the seismic input and then transforms the latent vector to build the velocity estimation through a decoder. See the implementation details in Appendix C.1. 3.3.4.3 Differentiable Forward Modeling We apply the standard finite difference (FD) in the space domain and time domain to discretize the original wave equation. Specifically, the second-order central finite difference in time domain ( in Equation 3.14) is approximated as follows: 𝜕2 𝑝(r, 𝑡) 𝜕𝑡2 ≈ 1 (Δ𝑡)2 ( 𝑝𝑡+1 r − 2𝑝𝑡 r + 𝑝𝑡−1 r ) + 𝑂 [(Δ𝑡)2] , 𝜕2 𝑝(r,𝑡) 𝜕𝑡2 (3.21) where 𝑝𝑡 r denotes the pressure wavefields at timestep 𝑡, and 𝑝𝑡+1 r and 𝑝𝑡−1 r are the wavefields at 𝑡 + Δ𝑡 and 𝑡 − Δ𝑡, respectively. The Laplacian of 𝑝(r, 𝑡) can be estimated in the similar way on the space domain (see Appendix C.2). Therefore, the wave equation can then be written as r = (2 − 𝑣2∇2) 𝑝𝑡 𝑝𝑡+1 r − 𝑝𝑡−1 r − 𝑣2(Δ𝑡)2𝑠𝑡 r , (3.22) where ∇2 here denotes the discrete Laplace operator. 77 The initial wavefield at the initial timestep is set to zero (i.e. 𝑝0 r = 0). Thus, the gradient of loss L with respect to estimated velocity at spatial location r can be computed using the chain rule as 𝜕L 𝜕𝑣(r) = 𝑇 ∑︁ (cid:20) 𝑡=0 𝜕L 𝜕 𝑝(r, 𝑡) (cid:21) 𝜕 𝑝(r, 𝑡) 𝜕𝑣(r) , (3.23) where 𝑇 indicates the total number of timesteps. 3.3.4.4 Loss Function The reconstruction loss of our UPFWI includes a pixel-wise loss and a perceptual loss as follows: L (p, ˜p) = L 𝑝𝑖𝑥𝑒𝑙 (p, ˜p) + L 𝑝𝑒𝑟𝑐𝑒 𝑝𝑡𝑢𝑎𝑙 (p, ˜p), (3.24) where p and ˜p are input and reconstructed seismic data, respectively. The pixel-wise loss L 𝑝𝑖𝑥𝑒𝑙 combines ℓ1 and ℓ2 distance as: L 𝑝𝑖𝑥𝑒𝑙 (p, ˜p) = 𝜆1ℓ1(p, ˜p) + 𝜆2ℓ2(p, ˜p), (3.25) where 𝜆1 and 𝜆2 are two hyper-parameters to control the relative importance. For the perceptual loss L 𝑝𝑒𝑟𝑐𝑒 𝑝𝑡𝑢𝑎𝑙, we extract features from conv5 in a VGG-16 network (Simonyan and Zisserman, 2014) pretrained on ImageNet (Krizhevsky et al., 2012) and combine the ℓ1 and ℓ2 distance as: L 𝑝𝑒𝑟𝑐𝑒 𝑝𝑡𝑢𝑎𝑙 (p, ˜p) = 𝜆3ℓ1(𝜙(p), 𝜙( ˜p)) + 𝜆4ℓ2(𝜙(p), 𝜙( ˜p)), (3.26) where 𝜙(·) represents the output of conv5 in the VGG-16 network, and 𝜆3 and 𝜆4 are two hyper- parameters. Compared to the pixel-wise loss, the perceptual loss is better to capture the region-wise structure, which reflects the waveform coherence. This is crucial to boost the overall accuracy of velocity maps (e.g. the quantitative velocity values and the structural information). 3.3.5 OpenFWI Dataset We introduce a new large-scale geophysics FWI dataset OpenFWI, which consists of 108K seismic data for two types of velocity maps: one with flat layers (named FlatFault) and the other one with curved layers (named CurvedFault). Each type has 54K seismic data, including 30K with paired velocity maps (labeled) and 24K unlabeled. The 30K labeled pairs are splitted as 24K/3K/3K for training, validation and testing respectively. Samples are shown in Appendix C.3. 78 The shape of curves in our dataset follows a sine function. Velocity maps in CurvedFault are designed to validate the effectiveness of FWI methods on curved topography. Compared to the maps with flat layers, curved velocity maps yield much more irregular geological structures, making inversion more challenging. Both FlatFault and CurvedFault contain 30,000 samples with 2 to 4 layers and their corresponding seismic data. Each velocity map has dimensions of 70×70, and the grid size is 15 meter in both directions. The layer thickness ranges from 15 grids to 35 grids, and the velocity in each layer is randomly sampled from a uniform distribution between 3,000 meter/second and 6,000 meter/second. The velocity is designed to increase with depth to be more physically realistic. We also add geological faults to every velocity map. The faults shift from 10 grids to 20 grids, and the tilting angle ranges from -123 to 123 degrees. To synthesize seismic data, five sources are evenly placed on surface with a 255-meter spacing, and seismic traces are recorded by 70 receivers at each grid with an interval of 15 meter. The source is a Ricker wavelet with a central frequency of 25 Hz (Wang, 2015). Each receiver records time-series data for 1 second, and we use a 1 millisecond sample rate to generate 1,000 timesteps. Therefore, the dimensions of seismic data become 5×1000×70. Compared to existing datasets (Yang and Ma, 2019; Moseley et al., 2020), OpenFWI is significantly larger. It includes more complicated and physically realistic velocity maps. We hope it establishes a more challenging benchmark for the community. 3.3.6 Experiments In this section, we present experimental results of our proposed UPFWI evaluated on the OpenFWI. 3.3.6.1 Implementation Details Training Details: The input seismic data are normalized to range [−1, 1]. We employ AdamW (Loshchilov and Hutter, 2018) optimizer with momentum parameters 𝛽1 = 0.9, 𝛽2 = 0.999 and a weight decay of 1 × 10−4 to update all parameters of the network. The initial learning rate is set to 3.2 × 10−4, and we reduce the learning rate by a factor of 10 when validation loss reaches a plateau. The minimum learning rate is set to 3.2 × 10−6. The size of a mini-batch is set to 128. All 79 trade-off hyper-parameters 𝜆 in our loss function are set to 1. We implement our models in Pytorch and train them on 8 NVIDIA Tesla V100 GPUs. All models are randomly initialized. Evaluation Metrics: We consider three metrics for evaluating the velocity maps inverted by our method: MAE, MSE and SSIM. Both MAE and MSE have been employed in existing methods (Wu and Lin, 2019; Zhang and Lin, 2020) to measure pixel-wise errors. Considering the layered-structured velocity maps contain highly structured information, degradation or distortion can be easily perceived by a human. To better align with human vision, we employ SSIM to measure perceptual similarity. Note that for MAE and MSE calculation, we denormalize velocity maps to their original scale while we keep them in normalized scale [-1, 1] for SSIM according to the algorithm. Comparison: We compare our method with three state-of-the-art algorithms: two pure data- driven methods, i.e., InversionNet (Wu and Lin, 2019) and VelocityGAN (Zhang and Lin, 2020), and a physics-informed method H-PGNN (Sun et al., 2021). We follow the implementation described in these papers and search for the best hyper-parameters for OpenFWI dataset. Note that we improve H-PGNN by replacing the network architecture with the CNN in our UPFWI and adding perceptual loss, resulting in a significant boosted performance. We refer our implementation as H-PGNN+, which is a strong supervised baseline. Our method has two variants (UPFWI-24K and UPFWI-48K), using 24K and 48K unlabeled seismic data respectively. 3.3.6.2 Main Results Results on FlatFault: Table 3.6 shows the results of different methods on FlatFault. Compared to InversionNet and VelocityGAN, our UPFWI-24K performs better in MSE and SSIM, but is slightly worse in MAE score. Compared to H-PGNN+, there is a gap between our UPFWI-24K and H-PGNN+ when trained with the same amount of data. However, after we double the size of unlabeled data (from 24K to 48K), a significant improvement is observed in our UPFWI-48K for all three metrics, and it outperforms all three supervised baselines in MSE and SSIM. This demonstrates the potential of our UPFWI for achieving higher performance with more unlabeled data involved. 80 Table 3.6 Quantitative results evaluated on OpenFWI in terms of MAE, MSE and SSIM. Our UPFWI yields comparable inversion accuracy comparing to supervised baselines. For H-PGNN+, we use our network architecture to replace the original one reported in their paper, and an additional perceptual loss between seismic data is added during training. Supervision Method FlatFault CurvedFault MAE ↓ MSE ↓ SSIM ↑ MAE ↓ MSE ↓ SSIM ↑ InversionNet 15.83 2156.00 0.9832 23.77 5285.38 0.9681 Supervised VelocityGAN 16.15 1770.31 0.9857 25.83 5076.79 0.9699 H-PGNN+ 12.91 1565.02 0.9874 24.19 5139.60 0.9685 Unsupervised UPFWI-24K 16.27 1705.35 0.9866 29.59 5712.25 0.9652 UPFWI-48K 14.60 1146.09 0.9895 23.56 3639.96 0.9756 Ground Truth InversionNet VelocityGAN H-PGNN+ UPFWI-24K (Ours) UPFWI-48K (Ours) Figure 3.18 Comparison of different methods on inverted velocity maps of FlatFault (top) and CurvedFault (bottom). For FlatFault, our UPFWI-48K reveals more accurate details at layer boundaries and the slope of the fault in deep region. For CurvedFault, our UPFWI reconstructs the geological anomalies on the surface that best match the ground truth. The velocity maps inverted by different methods are shown in the top row of Figure 3.18. Consistent with our quantitative analysis, more accurate details are observed in the velocity maps generated by UPFWI-48K. For instance, we find in the visualization results that both InversionNet and VelocityGAN generate blurry results in deep region, while H-PGNN+, UPFWI-24K and UPFWI-48K yield much clearer boundaries. We attribute this finding as the impact of seismic loss. We further observe that the slope of the fault in deep region is different from that in the shallow 81 Ground Truth pixel-ℓ2 pixel-ℓ1ℓ2 pixel-ℓ1ℓ2+ perceptual (a) (b) Figure 3.19 Comparison of UPFWI with different loss functions on (a) waveform residual and their corresponding inversion results (ground truth provided in the first column), and (b) single trace residuals recorded by the receiver at 525 m offset. Our UPFWI trained with pixel-wise loss (ℓ1+ℓ2 distance) and perceptual loss yields the most accurate results. Best viewed in color. region, yet only UPFWI-48K replicates this result as highlighted by the green square. Results on CurvedFault: Table 3.6 shows the results of CurvedFault. Performance degradation is observed for all models, due to the more complicated geological structures in CurvedFault. Although our UPFWI-24K underperforms the three supervised baselines, our UPFWI-48K significantly boosts the performance, outperforming all supervised methods in terms of all three metrics. This demonstrates the power of unsupervised learning in our UPFWI that greatly benefits from more unlabeled data when dealing with more complicated curved structure. The bottom row of Figure 3.18 shows the visualized velocity maps in CurvedFault obtained using different methods. Similar to the observation in FlatFault, our UPFWI-48K yields more accurate details compared to the results of supervised methods. For instance, only our UPFWI-24K and UPFWI-48K precisely reconstruct the fault beneath the curve around the top-left corner as highlighted by the yellow square. Although some artifacts are observed in the results of UPFWI-24K around the layer boundary in deep region, they are eliminated in the results of UPFWI-48K. More visualization results are shown in Appendix C.3. 3.3.6.3 Ablation Study Loss Terms: We study the contribution of each loss term in our loss function: (a) pixel-wise ℓ2 distance (MSE), (b) pixel-wise ℓ1 distance (MAE), and (c) perceptual loss. All experiments are 82 Table 3.7 Quantitative results of our UPFWI with different loss function settings. Loss pixel-ℓ1 ✓ ✓ pixel-ℓ2 ✓ ✓ ✓ Velocity Error Seismic Error perceptual MAE ↓ MSE ↓ SSIM ↑ MAE ↓ MSE ↓ SSIM ↑ 32.61 10014.47 0.9735 0.0167 0.0023 0.9978 21.71 16.27 ✓ 2999.55 0.9775 0.0155 0.0025 0.9977 1705.35 0.9866 0.0140 0.0021 0.9984 Table 3.8 Quantitative results of our UPFWI evaluated on Marmousi and Salt datasets. Method Marmousi Salt MAE↓ MSE↓ SSIM↑ MAE↓ MSE↓ SSIM↑ InversionNet 149.67 45936.23 0.7889 25.98 8669.98 0.9764 UPFWI 221.93 125825.75 0.7920 150.34 164595.28 0.7837 Table 3.9 Quantitative results of our UPFWI with different architectures. Network MAE↓ MSE↓ SSIM↑ CNN ViT 16.27 1705.35 0.9866 41.44 11029.01 0.9461 MLP-Mixer 22.32 4177.37 0.9726 conducted on FlatFault using 24,000 unlabeled data. Figure 3.19a shows the predicted velocity maps for using three loss combinations (pixel-ℓ2, pixel-ℓ1ℓ2, pixel-ℓ1ℓ2+perceptual) in UPFWI. The ground truth seismic data and velocity map are shown in the left column. For each loss option, we show the difference between the reconstructed and the input seismic data (on the top) and predicted velocity (on the bottom). When using pixel-wise loss in 𝑙2 distance alone, there are some obvious artifacts in both seismic data (around 600 millisecond) and velocity map. These artifacts are mitigated by introducing additional pixel-wise loss in 𝑙1 distance. With perceptual loss added, more details are correctly retained (e.g. seismic data from 400 millisecond to 600 millisecond, velocity boundary between layers). Figure 3.19b compares the reconstructed seismic data (in terms of residual to the ground truth) at a slice of 525 meter offset (orange dash line in Figure 3.19a). Clearly, the combination of pixel-wise and perceptual loss has the smallest residual. 83 Table 3.10 Quantitative results of our UPFWI tested on seismic inputs with different noise levels. 𝜎 (10−4) 0.5 1.0 5.0 FlatFault CurvedFault PSNR MAE↓ MSE↓ SSIM↑ PSNR MAE↓ MSE↓ SSIM↑ 61.60 15.68 1343.21 0.9888 61.72 23.78 3704.00 0.9751 58.70 24.84 4010.78 0.9733 58.70 24.84 4010.78 0.9733 51.58 44.33 7592.57 0.9681 51.68 46.90 10415.38 0.9441 The quantitative results are shown in Table 3.7. They are consistent with our observation in qualitative analysis (Figure 3.19a). In particular, using pixel-wise loss in ℓ2 distance has the worst performance. The involvement of ℓ1 distance mitigates velocity errors but is slightly worse on MSE and SSIM of seismic error. Adding perceptual loss boosts the performance in all metrics by a clear margin. This shows perceptual loss is helpful to retain waveform coherence, which is correlated to velocity boundary, and validates our proposed loss function (combining pixel-wise and perceptual loss). More Challenging Datasets: We further evaluate our UPFWI on two more challenging tests including Marmousi and Salt (Yang and Ma, 2019) datasets and achieve solid results. For Marmousi dataset, we follow the work of Feng et al. (2021) and employ the Marmousi velocity map as the style image to construct a low-resolution dataset. Table 3.8 shows the quantitative results on both datasets. Although our UPFWI achieves good results on Salt dataset with preserved subsurface structures, it has clearly larger errors than the supervised InversionNet. This is due to two reasons: (a) Salt dataset has a small amount of training data (120 samples), which is very challenging for unsupervised methods; (b) the variability between training and testing samples is small, providing a significantly larger favor to supervised methods than the unsupervised counterparts. The visualization of results on Marmousi dataset and Salt data are shown in Appendix C.4. Other Network Architectures: We further conducted experiments by using Vision Trans- former (ViT, Dosovitskiy et al., 2020) and MLP-Mixer (Tolstikhin et al., 2021) to replace CNN as the encoder. Table 3.9 further shows the quantitative results. Solid results are obtained for both network architectures, indicating our proposed method is model-agnostic. Visualization results are 84 Table 3.11 Quantitative results of our UPFWI tested on seismic inputs with missing traces. Missing Traces FlatFault CurvedFault MAE↓ MSE↓ SSIM↑ MAE↓ MSE↓ SSIM↑ 4 (5%) 21.23 1772.05 0.9868 41.33 6914.12 0.9622 7 (10%) 33.66 3504.25 0.9814 61.72 12445.90 0.9453 17 (25%) 85.21 16731.69 0.9457 121.06 36770.77 0.8853 shown in Appendix C.4. Robustness Evaluation: We validate the robustness of our UPFWI models by two additional tests: (1) testing data contaminated by Gaussian noise and (2) testing data with missing traces. The quantitative results are shown in Table 3.10 and Table 3.11, respectively. We observe that in both experiments our model is robust to a certain level of noise and irregular acquisition. Visualization results are shown in Appendix C.4. 3.3.7 Summary This section presented a novel physics-informed machine learning architecture tailored for full-waveform inversion, a process effectively modeled by partial differential equations. By integrating these equations with convolutional neural networks, we achieved a sophisticated form of regularization captured by the 𝑓 term in Equation (1.3). Our experimental findings underscore the benefits of incorporating perceptual loss, as denoted by the 𝑅 term in Equation (1.3), which has significantly enhanced the quality of seismic data reconstruction. This method not only elevates the precision of seismic interpretations but also broadens the scope of neural networks in geophysical data analysis, setting the stage for more detailed and accurate geological assessments. 85 3.4 Making Invisible Visible: Data-Driven Seismic Inversion with Spatio-temporally Con- strained Data Augmentation Deep learning methods have unlocked new potential for leveraging the vast amounts of data available today. Particularly noteworthy are the advancements in data-driven full-waveform inversion techniques, as seen in recent developments Zhang and Lin (2020); Wu and Lin (2019); Araya- Polo et al. (2018). These methods, however, typically rely on large-scale datasets that are often not available for full-waveform inversion tasks, thereby underscoring the critical role of data augmentation. Traditional data augmentation techniques used in computer vision, such as shifting, rotating, scaling, and cropping, may interfere with the physical integrity of seismic data. This section introduces a data augmentation method using generative models (variational autoencoder) that adheres to the S term in Equation (1.3), specifically designed to maintain the physical properties inherent in the data. 3.4.1 Introduction Most of the current data-driven seismic FWI techniques are built-on end-to-end neural network structures. In order to improve the inversion accuracy and the model generalization, data-driven techniques are usually trained on a large volume of dataset, which in turn significantly increases the complexity of the networks. In Wu and Lin (2019), an encoder-decoder structure is developed to learn the regression correspondence from raw seismic data to velocity maps. In Araya-Polo et al. (2018), a fully-connected network structure is designed for the inversion. In Zhang and Lin (2020), a generative model is utilized and trained for learning the inversion operator. To give an idea of the size of the training data, in Wu and Lin (2019), a ten-layer encoder-decoder results in more than 40 million learnable model parameters. To train this deep neural network, more than 60,000 pairs of labeled simulations need to be available as reported in Wu and Lin (2019). However, obtaining such a large amount of data is challenging (or even infeasible) for some subsurface applications due to the practical obstacles in data acquisition and simulation. Particularly, seismic FWI for monitoring is notoriously known as a “small-data regime”. A limited number of seismic sensors are distributed over a large area, and very few time-lapse observations can be affordably 86 acquired Lumley (2001). Training a data-driven seismic FWI using limited data will result in weak generalizability and misfitting. To fully unleash the power of deep learning for a better, faster, and cheaper subsurface seismic FWI approach, we develop a new data augmentation technique to bridge the gap by addressing the critical issues of generalizability and the capability of generating high quality and a large volume of training data. Data augmentation, the process of creating new samples by manipulating the original data, addresses the data shortage at the root of the problem Shorten and Khoshgoftaar (2019). However, the most popular data augmentation methods are not appropriate for seismic imaging due to their inability to incorporate generic physics properties. Furthermore, seismic data usually yields both spatial and temporal characteristics. To address those issues, we develop data augmentation techniques to account for both spatio-temporal characteristics and the critical seismic physics to generate high-quality simulations. Our models are built on variation autoencoder (VAE) to take advantage of its direct tie to the latent representations Kingma and Welling (2014). The design of our techniques considers different representations of physics including the governing equations, the observable perception, and the physics phenomena. To validate the performance of our developed techniques, we test our models using an existing CO2 leakage synthetic dataset, Kimberlina dataset, generated and operated by the U.S. Department of Energy (DOE) Jordan and Wagoner (2017). Our interest is to employ our data-driven FWI to image and detect small CO2 leaks. Via various numerical tests, we demonstrate that our data augmentation techniques significantly improve the data representativeness of the training set, which in turn enhances the seismic imaging accuracy. Specifically, CO2 plumes related to small leaks can now be much better imaged than those obtained without using augmentation. In the following sections, we first briefly provide the related work in Section 3.4.2. We develop and discuss our data augmentation techniques in Section 3.4.4. We further provide all the numerical tests and results in Section 3.4.5. Finally, further discussion, future work, and concluding remarks will be presented. 87 3.4.2 Related Work 3.4.2.1 Deep Generative Models Generative models are known as a type of unsupervised learning approaches that explicitly or implicitly model the distribution of true data so as to generate new samples with some varia- tions Bishop (2006). Current state-of-the-art generative models are built on deep neural networks (i.e., deep generative models (DGMs)). Examples of recent DGMs include variational autoen- coders (VAE) Kingma and Welling (2014) and generative adversarial networks (GAN) Goodfellow et al. (2014). As a variation in autoencoder, VAE belongs to the DGMs that learn the data distribution explicitly. It solves a variational inference problem to maximize the marginalized data likelihood by using a generative network (decoder) and a recognition network (encoder). Once fully trained, the encoder learns a distribution over latent variable given observation, and the decoder learns a distribution over observation given latent variable. VAE and its variants have shown great potential in generating data for augmentation in different applications Luo et al. (2020); Hsu et al. (2017); Nishizaki (2017); Liu et al. (2018). In particular, Luo et al. (2020) employ a vanilla VAE to generate synthetic EEG time series for recognizing emotions. Nishizaki (2017) also employ VAE to generate waveform data for automatic speech recognition. In Liu et al. (2018), VAE is used to extracted useful features in the latent space from image data. Linear interpolation on the latent space is conducted to obtain new synthetic images. Hsu et al. (2017) develop a VAE-based data augmentation technique to address the distribution mismatch in source and target domains for improving the performance of a domain adaptation method in speech recognition. Besides VAE, other DGMs (such as GAN) have also been applied for the task of data augmentation Zhang et al. (2019); Antoniou et al. (2017); Shrivastava et al. (2017); Sixt et al. (2018). In comparison, VAE provides a natural connection to the data distribution by collapsing most dimensions in the latent representations. Another noticeable benefit of VAE-based DGMs is the relatively easier effort to train with less technical complexity for hyper-parameter selection. Provided with these aforementioned encouraging results of DGMs, a direct application of DGMs to our problems may face two major challenges. Firstly, DGMs are 88 in general highly data-demanding. Secondly, they are purely driven by data without considering physics. 3.4.2.2 Physics-Informed Deep Learning Physics-informed (i.e., domain-aware) learning is a critical task to scientific machine learn- ing (SciML) community Bas (2019). Particularly, how to incorporate physics information becomes one of the most challenging and important research topics across different scientific domains Sun et al. (2020); Gomez et al. (2020); Wang et al. (2020); Raissi et al. (2019); Zhu et al. (2019b). A thorough survey on this topic is published by Karniadakis et al. (2021); Willard et al. (2020). As pointed out in Karniadakis et al. (2021), there are three ways to make a learning algorithm physics-informed, “observation bias”, “inductive bias”, and “learning bias”. The observation bias approaches introduce physics to the model directly through data that embody the underlying knowledge. The inductive bias approaches focus on designing neural network architectures that implicitly enforce physics knowledge associated with a given predictive task. The learning bias approaches incorporate the physics knowledge in a soft manner by appropriately penalizing the loss function of conventional neural networks. Our approach developed in this work belong to two categories of the above: observation bias and learning bias. There are many benefits considering physics knowledge when designing a neural network models. Regardless of the application domains, one of the major benefits is to improve the robustness of the prediction model and to produce physically meaningful (and more accurate) results. Particularly, Lagaris et al. (1998) propose an artificial neural network method to solve partial differential equations (PDEs) for flow simulations. Raissi et al. (2019) develop a deep-learning-based nonlinear PDE solver. Zhu et al. (2019b) develop a numerical PDE solver using a convolutional encoder- decoder and a flow-based generative model with physics constraints. More accurate results have been shown in their work. Sun et al. (2020) develop another PDE solver using the physics-informed deep learning method. Their method leverages both the full-physics simulations and additional physics-based constraints. Wang et al. (2020) develop a spatiotemporal deep learning model to account for both data characteristics and underline physics to synthesize high-quality turbulent 89 imagery. All these aforementioned works provide us with great inspiration about leveraging useful physics information while developing deep learning models for our seismic imaging problems. Figure 3.20 Illustration of the Kimberlina dataset and three modeling modules used to generate the simulated velocity maps. (a) CO2 storage reservoir model, (b) wellbore leakage model, (c) multi-phase flow and reactive transport models of CO2 migration in aquifers Buscheck et al. (2019); Yang et al. (2019), and (d) Illustration of a set of simulation with 20 velocity maps over a duration of 200 years. A CO2 leakage will result in a decrease of the velocity value in the location where the leak happens. (© 2022 IEEE) 3.4.2.3 Data Augmentation in Seismic Exploration In the seismic exploration community, there has been surprisingly little work addressing this dilemma of lack of data for data-driven seismic FWI. The existing approaches can be roughly categorized into two groups, those based on velocity building Liu et al. (2021); Ren et al. (2021); Wu et al. (2020a) and those based on pure machine learning approaches Feng et al. (2020); Gomez et al. (2020); Ovcharenko et al. (2019). Specifically, in Liu et al. (2021) and Ren et al. (2021), a large volume of subsurface velocity maps are generated to include different geologic structures. The geometry of those pre-generated geologic structures is assumed to follow a certain distribution. Wu et al. (2020a) design a workflow to automatically build a subsurface structure with folding and faulting features. Their method relies on the initial layer-like structure, therefore, producing unsatisfactory results when applying to different sites. In Gomez et al. (2020), an adaptive data augmentation technique is developed to augment the training by using unlabeled seismic data. Feng et al. (2020) develop a style transfer technique to generate synthetic velocity maps from natural 90 images. Ovcharenko et al. (2019) develop a set of subsurface structure maps using customized subsurface random model generators. Their method strongly relies on domain knowledge to generate the content images, which in turn significantly limits the variability of the training set. 3.4.3 Small CO2 Leak Detection and Kimberlina Dataset In geologic carbon sequestration (GCS), also known as carbon capture and storage (CCS), developing effective monitoring methods is urgently needed to detect and respond to CO2 leakage. This is particularly important for early detection, which would provide timely warning and intervention before the potential damages to the environment (such as acidification of groundwater and killing of plant life, contamination of the atmosphere, etc) Ha-Duong and Keith (2003). On the other hand, detecting small CO2 leaks is also technically challenging since it requires high detectability and sufficient spatial resolution of geophysical methods to capture the subtle geologic feature perturbation induced by the leaks. Considering this pressing need, our goal in this work is to assess and further improve the early CO2 leak-detection capabilities of the seismic FWI method. To our best knowledge, we are unaware of any available field seismic data that fits the scope of our problem of interest. Meanwhile, this lack of data is recognized by the U.S. Department of Energy (DOE), and to alleviate this problem, given the importance of this application, the DOE, through the National Risk Assessment Partnership (NRAP) project, has generated a set of high fidelity simulations, the Kimberlina dataset, with the aim of providing a standard baseline dataset to understand and assess the effectiveness of various geophysical monitoring techniques for detecting CO2 leakage Jordan and Wagoner (2017). The Kimberlina dataset is generated from a hypothetical numerical model built on the geologic structure of a commercial-scale geologic carbon sequestration reservoir at the Kimberlina site in the southern San Joaquin Basin, 30 km northwest of Bakersfield, CA, USA. The simulation procedure consists of four modules: a CO2 storage reservoir model (Fig. 3.20(a)), a wellbore leakage model (Fig. 3.20(b)), a multi-phase flow and reactive transport models of CO2 migration in aquifers (Fig. 3.20(c)), and a geophysical model. In particular, the P-wave velocity maps used in this work belong to the geophysical model, which is created based on the realistic geologic-layer properties from the GCS site as shown in Fig. 3.20(b) Buscheck et al. 91 Figure 3.21 Distribution of leakage mass of Kimberlina Dataset. Each of the splittings covers 20%, 20%, 20%, and 40% of the data samples, respectively. (© 2022 IEEE) (2019); Yang et al. (2019). Figure 3.22 Schematic illustration of our (a) autoencoder and (b) VAE generative models. (© 2022 IEEE) The Kimberlina dataset contains 991 CO2 leakage scenarios, each simulated over a duration of 200 years, with 20 leakage velocity maps provided (i.e., at every 10 years) for each scenario. An illustration of one specific leakage simulation associated with the leakage mass over 200 years are shown in Fig. 3.20(d). We also provide the overall distribution of the whole dataset in Fig. 3.21. For a balanced dataset, we would expect the data label (leakage mass) should be uniformly distributed, which however is not the case for Kimberlina dataset. Particularly, the whole dataset can be split 92 into four parts by its leak mass as Tiny if mass < 9.10 × 106 kg, Small if 9.10 × 106 kg < mass < 2.67 × 107 kg, Medium if 2.67 × 107 kg < mass < 8.05 × 107 kg, (3.27) Large if 8.05 × 107 kg < mass.    Each of the splittings covers 20%, 20%, 20%, and 40% of the data samples, respectively. Although we have 20% of tiny leakage samples, these samples are distributed from 0 to 9.1 × 106 that covers nearly 70% of CO2 leakage scenarios, as shown in Fig. 3.21. In other words, the density of tiny samples is much lower than that of the other three classes. This sparsity and in-balanced sample density create the major challenge when imaging tiny leakage samples. The Kimberlina project focuses on the shallow CO2 leakage. That leads to 3-layer synthetic velocity models (baseline and monitor), which reflect the shallow geologic structure from the field study. The Kimberlina model and simulations have been the basis for a variety of extensive research efforts in characterizing and detecting for CO2 using different geophysical approaches Appriou et al. (2020); Chen and Huang (2020); Zhou et al. (2019); Buscheck et al. (2019); Yang et al. (2019). Our interest is on the early-leak detection, which requires to image those small leaks. The unbalancing of the dataset becomes a major challenge for our data-driven seismic inversion technique since it will mislead our InversionNet model towards medium or large leaks. On the other hand, due to the limitation of physical simulations, we will not be able to further generate more synthetic for the small leaks. Hence, those practical obstacles make our problem reside in a low-data regime scenario. Next, we will describe our techniques to augment the Kimberlina dataset while preserving the physics information as much as we can to improve the prediction accuracy of our InversionNet model. 93 3.4.4 Methodology 3.4.4.1 Physics of the Problem Our data augmentation techniques will leverage existing physics knowledge of the problem. It is worth understanding what specific physics information are referred to in this context for the designing and training our neural networks. 1. Governing Equations. One of the most prominent physics knowledge in our problem is that the governing equations are used to generate original physical simulations (as shown in Fig. 3.20). Those equations describe specific physical relationships between time and spatial derivatives explicitly using temporally dynamic formulas. In order to generate physically meaningful synthesized data, it would be important to embed that physics information in the generative models. 2. Observable Perception. As described in Section 3.4.3, the data that we are interested in synthesizing are two-dimensional (2D), which means it yields a distribution that would be represented in certain visual perception. We expect our generative model would be able to capture the underline true data distribution, which in turn would require the synthesized 2D data would physically “look like” those in the training data. 3. Physics Phenomena. Any physical simulation should respect the realistic physics phenomena. As one example, in our problem of interest, the super-critical CO2 will migrate over time, meaning that we will observe the spatial spreading of CO2 should gradually increase over time. How to best design our generative model without violating this phenomenon would potentially help to improve the performance of our generative model. We will consider all of the above during the development of our generative models. Another point that would be important to consider is that all of the above physics information is consistent throughout all temporal duration. 94 3.4.4.2 Data-driven Generative Models To compensate for the imbalance data as shown in Fig. 3.21, we would like to generate more data in the small-leak region. Luckily, the original Kimberlina dataset provides full-physics simulation in the medium- and large-leak regions. Those data are generated by the governing physics equations, which means those physics knowledge are represented by those data implicitly. Our first two generative models are built on autoencoder and VAE to leverage those existing simulations while taking into account the temporal variation. 3.4.4.3 Autoencoder Our first model is to build a “regression” model that would provide interpolated data for those temporally missing points (as shown in Fig. 3.22(a)). The hypothesis behind this idea is that considering the consistency of the physics, we would expect that once fully trained, our generative model will capture the intrinsic dynamics of the physics from the existing simulations so that it will provide physically realistic prediction at any given time, particularly, those at the early stage of the leakage. (a) (b) Figure 3.23 Schematic illustration of (a) our new variational autoencoder with perception loss, and (b) the perception loss using the pre-trained VGG-19 network Simonyan and Zisserman (2014). (© 2022 IEEE) Technically, our model will be based on an autoencoder structure, which consists of a convo- lutional encoder F𝜃, with a set of trainable parameters 𝜃, and a convolutional decoder G𝜙, with a set of trainable parameters 𝜙. To incorporate the spatial information, we set two input channels of our encoder as the first and the last velocity maps from one simulation. To incorporate the temporal information, we further create a temporal matrix by replicating the single time value over 95 all matrix entries. The temporal matrix will then be used as one of the three input channels of the autoencoder together with two other two. When training the autoencoder, all three input channels will be convolved together, leading to the incorporation of both spatial and temporal information. Once fully trained, our encoder will learn to reduce the dimensionality of this mixture of inputs to a latent variable which is a high-level latent representation containing both spatio-temporal information. Our decoder will estimate the target velocity map using the latent variable, which can be considered as nonlinear high-dimensional regression. The structure of our generative model is shown as Fig. 3.22(a), and mathematically our autoencoder can be represented as Encoder : 𝑧 = F𝜃 (𝑥𝑠,10, 𝑥𝑠,200, 𝑡), Decoder : ˆ𝑥𝑠,𝑡 = G𝜙 (𝑧), (3.28) where 𝑥𝑠,10, 𝑥𝑠,200 and 𝑡 are the inputs of the encoder. 𝑡 is the time of the velocity map that needs to be predicted and it is created as a temporal matrix by replicating the single time value over all matrix entries. 𝑥𝑠,10 and 𝑥𝑠,200 are the first data and the last data from the same simulation 𝑠, where 10 and 200 are the time index of the data. 𝑧 is the latent variable output by the encoder F , and the decoder G produces ˆ𝑥𝑠,𝑡, the estimated velocity map of simulation 𝑠 at time 𝑡. We use Mean Squared Error (MSE) as our optimality criterion to compute the reconstruction loss between the ground truth and the generated velocity maps and to update trainable parameters through backpropagation L (𝜃, 𝜙) = L𝑟𝑒𝑐𝑜𝑛 = 1 |𝑆||𝑇 | ∑︁ 𝑠∈𝑆,𝑡∈𝑇 (𝑥𝑠,𝑡 − ˆ𝑥𝑠,𝑡)2, = 1 |𝑆||𝑇 | ∑︁ 𝑠∈𝑆,𝑡∈𝑇 (𝑥𝑠,𝑡 − G𝜙 (F𝜃 (𝑥𝑠,10, 𝑥𝑠,200, 𝑡)))2. (3.29) This model generates synthetic samples in the data domain. As discussed in Oring et al. (2020), generating samples in the latent space might increase the variability within the data distribution. We, therefore, study VAE and its capability in generating samples. 3.4.4.4 Variational Autoencoder The variational autoencoder (VAE) is a probabilistic generative model to create a latent representation of the input data. That would allow us to generate new samples with high diversity by 96 manipulating the latent representations. Unlike the autoencoder model, which incorporates temporal information as part of the input (Fig. 3.22(a)), our VAE generative model produces new temporal interpolation separately in two steps. In the first step, we train the VAE by taking only velocity maps as input without explicit temporal information. Once fully trained, the VAE will be able to generate latent variables representing the velocity maps. In the second step, we provide a linear interpolation scheme on the normal distributed latent space to produce new latent variables for further synthesizing new velocity samples. The idea behind the VAE generative model is somewhat similar to that of the autoencoder. We expect the physics knowledge, i.e., the governing physics relationship can be captured by training the VAE using simulations. The consistency of the physics information will be leveraged when generating new samples at different times. We provide the illustration of our VAE generative model in Fig. 3.22(b). The encoder, F , and decoder, G, structures of VAE are similar to those of the autoencoder. However, the encoder in VAE is to learn the posterior distribution 𝑞𝜃 (𝑧|𝑥), which is the distribution parameter of latent variable 𝑧 given input 𝑥. The decoder in VAE is to learn the conditional distribution 𝑝𝜙 (𝑥|𝑧), which is the distribution of reconstructed data given latent distribution. There is a prior distribution 𝑝(𝑧) over the latent space, which we set as a standard normal distribution. The output of encoder 𝑞𝜃 (𝑧|𝑥) has two parts of mean and log-variance of the posterior distribution. One of the known problems associated with VAE is that its gradients cannot flow through the bottleneck of mean and log-variance. So, we perform a re-parameterize trick to make the gradient able to flow through the bottleneck Kingma and Welling (2014), 𝑧 = 𝜇 + 𝜎 ⊙ 𝜖, (3.30) where 𝑧 ∈ R64 is the latent sample, 𝜇 ∈ R64 and 𝜎 ∈ R64 are the mean and log-variance of the posterior distribution 𝑞𝜃 (𝑧|𝑥), and 𝜖 ∈ R64 is a random variable sampled from normal distribution 97 and independent from 𝜇 and 𝜎. We employ the standard VAE loss function as below L (𝜃, 𝜙) = L𝑟𝑒𝑐𝑜𝑛 + L𝑘𝑙𝑑, (cid:20) E𝑞 𝜃 (𝑧𝑖 |𝑥𝑖) log 𝑝𝜙 (𝑥𝑖, 𝑧𝑖) 𝑞𝜃 (𝑧𝑖 |𝑥𝑖) (cid:21) , ∑︁ = 𝑖 ∑︁ 𝑖 = E𝑞 𝜃 (𝑧𝑖 |𝑥𝑖) (log 𝑝𝜙 (𝑥𝑖 |𝑧𝑖) (3.31) + log 𝑝(𝑧𝑖) − log 𝑞𝜃 (𝑧𝑖 |𝑥𝑖)), ∑︁ = (𝑥𝑖 − ˆ𝑥𝑖)2 + ∑︁ 𝐷𝐾 𝐿 (𝑞𝜃 (𝑧𝑖 |𝑥𝑖) ∥ 𝑝(𝑧𝑖)), 𝑖 where L𝑘𝑙𝑑 = (cid:205)𝑖 𝐷𝐾 𝐿 (𝑞𝜃 (𝑧|𝑥𝑖)∥ 𝑝(𝑧)) = (cid:205)𝑖 E𝑞 𝜃 (𝑧𝑖 |𝑥𝑖) (log 𝑝(𝑧𝑖) − log 𝑞𝜃 (𝑧𝑖 |𝑥𝑖)) is to measure 𝑖 the KL-divergence between the posterior distribution 𝑞𝜃 (𝑧|𝑥) and the prior distribution 𝑝(𝑧). L𝑟𝑒𝑐𝑜𝑛 = (cid:205)𝑖 E𝑞 𝜃 (𝑧|𝑥𝑖) (log 𝑝𝜙 (𝑥𝑖 |𝑧𝑖) = (cid:205)𝑖 (𝑥𝑖 − ˆ𝑥𝑖)2 is the reconstruction loss between ground truth velocity maps, 𝑥𝑖 and generated velocity maps, ˆ𝑥𝑖. When generating new velocity maps to augment our dataset, a directly random sampling on the prior distribution may lead to velocity maps associated with different leaks, whereas our interest is to obtain more small-leakage velocity maps. So we come up with an interpolation strategy. Particularly, we obtain the latent variables of two adjacent velocity maps from the same simulation, namely, 𝑧1 and 𝑧2 through the encoder, F . Multiple new latent variables can be interpolated between these 𝑧1 and 𝑧2 before passing them through the decoder, G, to further generate additional velocity maps Berthelot et al. (2018). The procedure can be posed as follows ˆ𝑥𝛼 = G𝜙 (𝛼𝑧1 + (1 − 𝛼)𝑧2), = G𝜙 (𝛼F𝜃 (𝑥𝑠,10) + (1 − 𝛼)F𝜃 (𝑥𝑠,200)), (3.32) where 𝛼 ∈ [0, 1] is the coefficient of the interpolation. Different from the autoencoder model, the temporal information is not used as the input in this VAE generative model. It is worth mentioning that for either the autoencoder or VAE as shown in Fig. 3.22, we expect that the governing physics will be able to be learned through training the models using full-physics simulations. However, other physics knowledge (such as observable perception or physics phenomena) will not be able to be captured by the generative models. Hence, we come 98 up with two different strategies to further constrain our generative models with additional physics knowledge. Figure 3.24 Schematic illustration of (a) the spatio-temporal dynamics of velocity maps at four consecutive times, and (b) our new variational autoencoder with regularization. In (a), the CO2 plume is observed migrating towards a specific spatial direction over the monitoring period. (© 2022 IEEE) 3.4.4.5 Spatio-temporal Constrained Generative Models 3.4.4.6 Variational Autoencoder with Perception Loss Perception of the generated velocity map is an important criterion to evaluate the quality of the synthesized image data. For both loss functions in Eqs. (3.29) and (3.31), we employ L2 norm to quantify the error. However, as pointed out by Zhang et al. (2018), classic per-pixel measures would be insufficient for assessing structured data such as images. It is obvious that the perception of the generated velocity map and that of the true velocity map should be consistent throughout the whole dataset. Inspired by recent work on style transfer Gatys et al. (2015), we can quantify perception using features extracted from pre-trained VGG-19 classification network Simonyan and Zisserman (2014), and further calculate the perception error between the true and the generated velocity maps. Thus, reducing the perception error can make our generated velocity maps more physically realistic. Our new VAE generative model is shown in Fig. 3.23(a), where one additional loss function (i.e., “Perception Loss”) is added on top of the spatial loss and KL divergence. In Fig. 3.23(b), we illustrate how we extract spatial features and use them to calculate the perception loss. Particularly, we generate representation using the Gram matrix, 𝐺𝑙 ∈ R𝑁𝑙×𝑁𝑙 , on feature maps from several 99 intermediate layers of VGG-19 net 𝐺𝑙 𝑖 𝑗 = ∑︁ 𝑘 𝑖𝑘 𝐹𝑙 𝐹𝑙 𝑗 𝑘 , (3.33) where 𝐺𝑙 𝑖 𝑗 is the inner product between the vectorized feature 𝑖, 𝑗 at layer 𝑙 and 𝐹𝑙 𝑖𝑘 is the position 𝑘 of the vectorized feature map of the 𝑖𝑡ℎ filter at layer 𝑙 of VGG-19 net. There are 𝑁𝑙 feature maps at layer 𝑙 of VGG-19 net with the size of 𝑀𝑙 which is the height times the width of the feature map. With the Gram matrix of ground truth velocity map, 𝑥 (denoted as “𝐺𝑙”) and that of the generated velocity map, ˆ𝑥𝛼 (denoted as “𝐴𝑙”) obtained at layer 𝑙, we will have the perception loss function as 𝐿 𝑝ℎ𝑦𝑠 = ∑︁ 𝜆𝑙 ∑︁ 𝑙 𝑖 𝑗 (𝐺𝑙 𝑖 𝑗 − 𝐴𝑙 𝑖 𝑗 )2, (3.34) where 𝜆𝑙 = 1 𝑀 2 4𝑁 2 𝑙 𝑙 perception loss function becomes is the coefficient of the perception loss at layer 𝑙. Hence, our new VAE with L (𝜃, 𝜙) = L𝑟𝑒𝑐𝑜𝑛 + L𝑘𝑙𝑑 + L 𝑝ℎ𝑦𝑠, ∑︁ 𝑖 ∑︁ = + (𝑥𝑖 − ˆ𝑥𝑖)2 + 𝐷𝐾 𝐿 (𝑞𝜃 (𝑧𝑖 |𝑥𝑖) ∥ 𝑝(𝑧𝑖)) (3.35) ∑︁ 𝜆𝑙 (𝐺𝑙 𝑖 𝑗 − 𝐴𝑙 𝑖 𝑗 )2. 𝑙 𝑖 𝑗 An important hyper-parameter that needs to be carefully tuned is the selection of layers from VGG-19 net that will be used for calculating the perception loss. We will provide more details later in the numerical test. 3.4.4.7 Variational Autoencoder with Regularization As we discussed previously, prominent phenomena also play a critical role in designing our generative model. For the CO2 storage problem, it has been well understood that starting from the injection well, CO2 will enter the formation at high flow rates and migrate relatively and vigorously into the most permeable regions under strong pressure gradients while displacing native fluids (e.g., brine) Birkholzer et al. (2015). Our full-physics simulations as shown in Fig. 3.20 accurately illustrate this process, which results in a prominent spatially and temporally varying pattern of the CO2 plume (shown in Fig. 3.24(a)). We expect our generative model would respect (at least not violate) this particular dynamics. To enforce this constraint, our idea is to design a regularization 100 Figure 3.25 (a) Ground truth velocity maps, and generated velocity maps using (b) autoencoder, (C) VAE, (d) VAE with perception loss, and (e) VAE with regularization. (© 2022 IEEE) term informed by the leakage process with the hope to produce new samples being consistent with the underline spatio-temporal dynamics. As shown in Fig. 3.24(a), we observe that the CO2 plume would span towards a certain spatial direction over the time during the migration, which indicates a clear spatio-temporal dynamical pattern. We, therefore, impose regularization on top of the difference between velocity maps at two consecutive times and ensure the dynamics of the ground truth would be preserved to its best when generating new synthetics. To achieve this, we employ a L1-norm based regularization given by L𝑟𝑒𝑔 = (cid:13) (cid:13)(𝑥𝑠,𝑡1 − 𝑥𝑠,𝑡2) − ( ˆ𝑥𝑠,𝑡1 − ˆ𝑥𝑠,𝑡2)(cid:13) (cid:13)1 , (3.36) where 𝑥𝑡1 and 𝑥𝑡2 are ground truth velocity maps at two consecutive times, respectively. Accordingly, ˆ𝑥𝑡1 and ˆ𝑥𝑡2 are generated velocity maps at the same times, respectively. The times, 𝑡1 and 𝑡2, are adjacent to each other with 𝑡1 > 𝑡2. The reason that we use L1 norm instead of L2 norm is that the value of the subtraction of the differences of two consecutive velocity maps can sometimes be 101 Figure 3.26 Reconstruction loss on test dataset using different models w.r.t. each year in the dataset. (© 2022 IEEE) very close to zero, which would make the L2-norm value too small and may lead to the gradient vanishing issue. In Fig. 3.24(b), we provide the network structure of our generative model using a new VAE loss function with regularization L (𝜃, 𝜙) = L𝑟𝑒𝑐𝑜𝑛 + L𝑘𝑙𝑑 + 𝛾 L𝑟𝑒𝑔, ∑︁ = (𝑥𝑖 − ˆ𝑥𝑖)2 + 𝐷𝐾 𝐿 (𝑞𝜃 (𝑧𝑖 |𝑥𝑖) ∥ 𝑝𝜙 (𝑧𝑖)) 𝑖 + 𝛾 (cid:13) (cid:13)(𝑥𝑠,𝑡1 − 𝑥𝑠,𝑡2) − ( ˆ𝑥𝑠,𝑡1 − ˆ𝑥𝑠,𝑡2)(cid:13) (cid:13)1 , (3.37) where the first and the second terms are the reconstruction loss and KL-divergence of the VAE, respectively. The third term is regularization. 𝛾 is the regularization parameter. The regularization parameter in Eq. (3.37) is important to the accuracy of data generation. We will explore its impact and how we select it in our numerical test. Another technical details may be worthwhile mentioning is the selection of the norms (i.e., L2 norm versus L1 norm) in our loss functions and the regularization terms. The L2 loss (i.e., Mean Squared Error (MSE)) and L1 loss (i.e., Mean Absolute Error (MAE)) are two of the most popularly used functions. Since MAE is minimized by conditional median which may lead to bias during optimization, we therefore choose MSE, which is minimized by conditional mean. However, we select the L1 regularization term to promote the sparsity of the coefficients when constraining the differences between two (or thee) adjacent velocity maps. 3.4.5 Experiments With all 4 different generative models being designed in the previous section, we will validate their performances in a couple of scenarios including a general assessment of the synthesized 102 Figure 3.27 Clustering results of the generated versus true samples. PCA-based clustering for (a) autoencoder, (b) VAE, (c) VAE with perception loss, and (d) VAE with regularization; NMF- based clustering for (e) autoencoder, (f) VAE, (g) VAE with perception loss, and (h) VAE with regularization. (© 2022 IEEE) data (Test 1) and the performance in imaging small CO2 leakage (Test 2). We will also provide numerical tests to illustrate what would be a reasonable augmented data size (Test 3) and how we pick some of the critical hyper-parameters (Test 4). 3.4.5.1 Experiment Setup We use 800 leakage scenarios in our dataset (16,000 samples in total) as a training dataset and the rest of the simulations (3,763 samples in total) as a test dataset. For training of our proposed data augmentation models, we use a batch size of 32, and train models for 100 epochs using ADAM optimizer with a learning rate of 0.0001. The initialization of model weights is based on He initialization He et al. (2015). For the training of InversionNet, we use a batch size of 24, and train InversionNet for 80 epochs using ADAM optimizer with an initial learning rate of 0.01, and with a 103 Table 3.12 Computational costs of different generative models. Row 1 is the size of the each model. Row 2 is the memory cost. Row 3 is the time per epoch in training each model. Row 4 is the total time in training each model. Row 5 is the time in generating a single sample. Row 6 is time in generating 3,000 velocity maps set in parallel. (© 2022 IEEE) Autoencoder VAE VAE_percep VAE_reg Parameter # 3,382,290 6,432,818 6,432,818 6,432,818 GPU Memory Cost 10.3GB 14.6GB 14.6GB 14.6GB Time (Training/epoch) Time (Training/total) Time (Generation/sample) Time (Generation/set) 48s 80m 1.36s 3.13m 36s 60m 4.38s 3.41m 258s 430m 4.38s 3.41m 48s 80m 4.38s 3.41m weight decay coefficient of 0.0001. Table 3.13 Computational costs of training InversionNet without and with augmented dataset with 3,000 more velocity maps. (© 2022 IEEE) InversionNet without augment InversionNet with augment Training Time (total) 1h23m 1h31m 3.4.5.2 Test 1: Velocity Map Generation We provide the synthesized velocity maps generated using our four different generative models in Fig. 3.25. For VAE, we use the first column, the ground truth velocity maps as the input velocity maps. For autoencoder, we use two velocity maps of 10-year and 200-year as the input velocity maps. Overall, we observe that all four generative models produce reasonable results. VAE (Fig. 3.25(c)) yields images with the highest variability among all four models. This is particularly true when comparing to autoencoder results (Fig. 3.25(b)). However, due to the variability, some unrealistic features can be also observed in the VAE results. An example would be the one at the last row, where the whole leakage plume is unphysical split into two. The VAE with perception loss (Fig. 3.25(d)) produces better results in preserving the perception of the velocity map. The unphysical data in the later stage is improved. Meanwhile, the images at the early stage also match better to the true images comparing to the VAE results. However, we also notice some artifacts being generated in the VAE with the perception loss model. The best results produced by all four models is the VAE with 104 regularization (Fig. 3.25(e)). It produces not only cleaner images but also highly accurate images in the early stage. To further quantitatively compare different generative models, we run our models on the test dataset and calculate the test loss w.r.t. each year of the data. The result is shown in Fig. 3.26. Consistent with what we observe in Fig. 3.25, VAE with regularization (in red), yields the best performance among all four models. Autoencoder yields a lower reconstruction loss for velocity maps after 80 years compared to those of vanilla VAE, VAE_percep, and VAE_reg models. However, this only implies that autoencoder produces a better performance on reconstructing velocity maps after 80 years, and it does not mean that autoencoder can generate more realistic and diversified velocity maps from the underlying distribution, which however is essential for InversionNet to capture the underlying data distribution and help with the generalization ability. Another means to justify the quality of our synthesized data is to visualize the distribution of the generated data versus that of the true data. To achieve this, we employ two commonly used methods: Principal Components Analysis (PCA) Smith (2002) and Non-Negative Matrix Factorization (NMF) Lee and Seung (1999). The visualization results are provided in Fig. 3.27. Regardless of the clustering approaches, the VAE with regularization (Figs. 3.27(d) and (h)) produces the distribution matching most closely to that of the ground truth. VAE and VAE with perception loss models yield comparable results. The autoencoder-based model performs the worst out of all four models. Computation cost is also an important factor in evaluating the performance of a generative model. To that perspective, we provide in Table 3.12 more details of the cost by comparing the model complexity (number of parameters), memory consumption, training time per epoch, total training time, sample-generation time, and total generation time required. Particularly, we compare all four models listed in our manuscript including autoencoder, VAE, VAE_percep and VAE_reg. We observe that autoencoder yields the smallest number of network parameters among all four models and all the other three VAE-based models yield comparable network complexity and memory requirement. As for the training time cost, VAE_percep is the most time-consuming whereas the remaining three models are comparable in training cost. The excessive time of training the 105 VAE_percep model is due to accessing multiple layers of VGG-19 for extracting relevant spatial features that would be needed in computing the Gram matrix and perception loss. Once the models are fully trained, generating the augmented dataset (3,000 samples) only spends around 4 minutes, which is not very time-consuming. With new samples being generated, we provide the training time comparison of InversionNet with/without augmented dataset in Table 3.13. We notice that training InversionNet with augmentation dataset will only increase 8 minutes in training. To summarize, in this test we demonstrate the capability of our four generative models in synthesizing high-quality velocity maps, which would provide additional data to train data-driven seismic imaging methods. Particularly, our VAE with regularization model generates the synthesized data with the most appealing visual quality and matches best to the true data distribution. 3.4.5.3 Test 2: Performance on Edge Cases The main purpose of this test is to evaluate the performance of our developed data augmentation techniques in improving the data-driven seismic imaging method in characterizing small leakage. We firstly generate 4 groups of synthesized data using our proposed generative models. There are 3,000 velocity maps in each group. As a baseline, we will train InversionNet Wu and Lin (2019) with an initial training dataset, which consists of 800 simulations, amounting to a total of around 15,000 samples. We further design two different test categories of one with all sizes of leakage (named as “General Leakage”) and the other one with only tiny and small leakage (named as “Small Leakage”). For the definitions of different leakage (tiny, small, medium, and large), please refer to Eq. (3.27) and Fig. 3.21. The test samples of different leakage sample in General Leakage and Small Leakage are provided in Table 3.14. Table 3.14 Two different test sets for evaluating the performance of our generative models. For the definitions of different leakage (tiny, small, medium, and large), please refer to Eq. (3.27) and Fig. 3.21. (© 2022 IEEE) Tiny Small Medium Large General leakage Small leakage 717 153 770 38 675 0 1494 0 106 Table 3.15 Test loss of InversionNet on General Leakage and Small Leakage tests without augmenta- tion (Col 2), and with augmentation data generated using autoencoder (Col 3), VAE (Col 4), VAE with perception loss (Col 5) and VAE with regularization (Col 6). The results using VAE with regularization are the best comparing to all others. (© 2022 IEEE) Baseline AE VAE VAE_percep VAE_reg General leakage 0.001294 0.001229 0.001522 0.001331 0.001093 Small leakage 0.000780 0.000813 0.001157 0.000924 0.000646 For all tests, we train InversionNet for 80 epochs to assure its convergence. We report in Table 3.15 the test loss of InversionNet on both testing categories with/without augmented training data sets. Particularly, our VAE with regularization yields the smallest loss value for both “General Leakage” and “Small Leakage”. On the other hand, VAE model (Col 4) produces the worst results among all four methods. We suspect that high variability and some unphysical synthesized samples “confuses” the InversionNet in learning the data distribution leading to degraded performance. This can be confirmed by noticing that with some additional constraints being imposed to the VAE model, an immediate performance improvement can be observed in the results using VAE with perception loss (Col 5) and VAE with regularization (Col 6). To better understand the error distribution using different generative models, we also provide in Fig. 3.28 a box-plot on the small leakage test. Out of the three methods, it is clear that VAE_reg yields the best performance with the smallest median value and interquartile ranges. Comparing VAE_percep and AE models, although they both produce similar median values, the VAE_percep is much less dispersed than the AE model. However, we notice more outliers are existing in the VAE_percep box-plot than those in the other two box-plots, which explains degradation of the overall test loss of VAE_percep as reported in Table 3.15. To better visualize the performance in imaging small leakage, we provide the reconstructed imaging results of InversionNet in Figs. 3.29(b) to (d). The differences of the reconstructions (by subtraction results from ground truth) are further provided in Figs. 3.30(a) to (c). The ground truth of the testing samples is shown in Fig. 3.29(a). To quantify the errors of the imaging results, we use two metrics, the mean-absolute errors (MAE) and structural similarity indexes (SSIM) Wang et al. (2004). The leakage mass and errors of the imaging results are provided in Table 3.16. InversionNet 107 Figure 3.28 Illustration of the error distribution (median, range, and outliers) in a box-plot on the small leakage test. (© 2022 IEEE) trained without any augmentation data yields the worst imaging results. The CO2 plume is either very hard to visualize or distorted severely, which results in the highest MAE and the lowest SSIM value comparing to others. InversionNet trained on augmented data sets produce much-improved imaging results. Particularly, the one using VAE with regularization yields the best imaging results with the smallest MAE and highest SSIM values. Figure 3.29 Four groups of InversionNet imaging results (b, c, d) on small leakage test data. (a) Ground truth, InversionNet imaging results (b) without augmentation, with augmented data set generated using (c) VAE with perception loss, and (d) VAE with regularization. (© 2022 IEEE) Besides visualization of the resulting images, quantifying the spatial resolution will provide a different perspective to evaluate the quality of the results. Here, we employ the commonly used 108 Figure 3.30 Four groups of differences of InversionNet imaging results to ground truth (e, f, g) on small leakage test data. (a) Difference of InversionNet imaging results without augmentation (Fig. 3.29 (b)) to ground truth (Fig. 3.29 (a)), (b) difference of results with augmented data set generated using VAE with perception loss (Fig. 3.29 (c)) to ground truth, (c) difference of results with augmented data set generated using VAE with regularization (Fig. 3.29 (d)) to ground truth. (© 2022 IEEE) Figure 3.31 Illustration of the baseline velocity map. (© 2022 IEEE) wavenumber analysis on our imaging results to help with justifying the quality of the resulting image resolution Our focus is on the velocity perturbation induced by the CO2 leaks as shown in Fig. 3.29. The perturbation can be obtained by subtracting the time-lapsed images from the baseline image (shown in Fig. 3.31), where the baseline image refers to the one without any leaks. Once the velocity perturbation is obtained after the subtraction, we employ the spatial Fourier transform to obtain the wavenumber (i.e., the Kz spectrum), and we provide the plots (in Figs. 3.32) using all four imaging results shown in Fig. 3.29. We observe that the Kz spectra of our results (in blue) are much closer to 109 Table 3.16 Leakage mass (Col 2) of CO2 of four groups of velocity maps showed in Fig. 3.29. MAE and SSIM errors of InversionNet imaging results using baseline without augmentation (Col 4), VAE with perception loss (Col 5) and VAE with regularization (Col 6). (© 2022 IEEE) Group Leakage Mass Metric Baseline VAE_percep VAE_reg 1 2 3 4 4.08 × 106 kg 4.29 × 103 kg 8.89 × 105 kg 8.05 × 105 kg MAE 0.00277 0.00237 0.000889 SSIM 0.9907 0.9929 0.9936 MAE 0.00437 0.00346 0.00290 SSIM 0.9833 0.9842 0.9871 MAE 0.00102 0.000742 0.000766 SSIM 0.9985 0.9987 0.9986 MAE 0.00255 0.00138 0.00168 SSIM 0.9926 0.9928 0.9949 Figure 3.32 Resolution analysis of the four imaging results as shown in Fig. 3.29. The Kz spectra of our results (in blue) are much closer to those of the ground truth (in red) by comparing the baseline method (in black). That indicates that our imaging method yields higher spatial resolution than the baseline method. (© 2022 IEEE) those of the ground truth (in red) by comparing the baseline method (in black) for all imaging results. That indicates that our imaging method yields higher spatial resolution than the baseline method. In this test, we study the performance of InversionNet using augmented data sets generated by our models on various leakage scenarios. Due to the lack of consideration of underlying physics knowledge, both autoencoder and vanilla VAE models do not help to improve the overall performance of InversionNet. On the other hand, our proposed model, VAE_reg, is capable of generating physically realistic synthetic samples, which in turn will further improve in imaging all leakage cases. In particular, the imaging resolution on tiny leakage is significantly enhanced with the CO2 plume much better resolved. It is worth mentioning that although the other proposed model, VAE_percep, is capable of generating comparable synthetic samples to VAE_percep, it may not 110 lead to an improved overall imaging quality in this dataset. We still include this new model and its results since it may perform better in a different application and dataset. Through the numerical tests and comparison, we conclude that the performance of InversionNet can be much improved in imaging all leakage cases. In particular, the imaging resolution on tiny leakage is significantly enhanced with the CO2 plume much better resolved. 3.4.5.4 Test 3: Determination of Augmented Data Size The size of the augmentation data is critical to the resulting performance. Without sufficient augmented data, the imaging results of InversionNet will be sub-optimal. In contrast, augmenting the original training set with too much data may lead to “augmentation leak”, meaning that synthesized data dominate the training set and distorts the true data distribution Zhao et al. (2020). In this test, we aim to find out the best range of the amount of augmentation data for optimizing the performance of InversionNet. Figure 3.33 Mean and standard deviation of test loss for different augmentation sizes. Test loss of InversionNet using augmented data set generated with (a) VAE with perception loss, (b) VAE with regularization. For all the tests, the same Small Leak data set (see Table 3.14) are utilized as the test data. (© 2022 IEEE) Through Tests 1 and 2, we learn that VAE with perception loss and VAE with regularization usually yield better results. Hence, we focus on the impact of the varying augmentation data size over those two models. We vary the size of the augmented data set among 350, 800, 1500, 3000, 4500, 6000, and 7500. For each case, we generate 5 different groups of augmentation data using our generative models. With all those groups of synthesized data being available, we train InversionNet with augmented data and test on the same Small Leak data as used in Test 2 (see Table 3.14). We report the corresponding loss values in terms of mean and standard deviation in Fig. 3.33. We observe a general pattern “decrease first −→ bottom −→ increase later” from both results in 111 Fig. 3.33. This type of pattern has recently been discovered and analyzed in other data augmentation literature Zhao et al. (2020); Karras et al. (2020). It is mainly caused by an augmentation leak, which should be used as an indication to decide a reasonable augmentation data size. Interestingly, for both VAE with physics perception loss and VAE with regularization, the smallest test loss values are achieved when 3,000 synthetic velocity maps are generated. Hence, throughout all the tests, we use 3,000 as the size of the augmented data set. 3.4.5.5 Test 4: Hyper-parameter Selection Hyper-parameters play an important role in our generative models. In this test, we will study two critical hyper-parameters: the selection of the layers from the pre-trained VGG-19 network in Eq. (3.35) and the regularization parameter, 𝛾, in Eq. (3.37). Figure 3.34 Visualization of the hyper-parameter versus loss values. (a) different combinations of layers selected from VGG-19 used in our VAE with perception loss. (b) Various values of the regularization parameter used in our VAE with regularization. (© 2022 IEEE) To select the optimal VGG-19 layers for computing perception loss, there are 5 convolutional blocks in VGG-19 as shown in Fig. 3.23(b). We follow a similar idea in Gatys et al. (2015) to select effective layers. Particularly, we choose from following four combinations of (A) [conv1_1, conv2_1]; (B) [conv1_1, conv2_1, conv3_1]; (C) [conv1_1, conv2_1, conv3_1, conv4_1]; and (D) [conv1_1, conv2_1, conv3_1, conv4_1, conv5_1]. We train VAE with perception loss using different combinations of layers, and compute test loss of each case on the same test dataset (as shown in Fig. 3.34(a)). We can observe that VAE with perception loss reaches smallest value when using combination (D). That gives us the indication to use first layer of 5 convolutional blocks of VGG-19 to compute perception loss. Similarly, in order to select optimal 𝜆, we choose 7 values evenly distributed on a log scale from 112 10−3 to 103 (i.e. 10−3, 10−2, 10−1, 100, 101, 102 and 103). For each 𝜆 value, we train a VAE with regularization, and test it on a common test dataset. The resulting test loss is shown in Fig. 3.34(b). We observe that test loss is relative stable when 𝜆 < 103, and it reaches lowest value when 𝜆 = 102. So, 𝜆 = 102 becomes the optimal value for our problem. 3.4.6 Summary In this work, we have developed several spatio-temporal data augmentation strategies using convo- lutional neural networks to enhance data-driven seismic imaging, particularly in reconstructing “rare events.” Our data augmentation techniques not only incorporate various physics information—such as governing equations, observable perception, and physics phenomena—but also integrate this information through perceptual loss and advanced regularization techniques. We evaluated the performance of our generative models by imaging very small CO2 leaks from the subsurface with InversionNet. Through detailed comparison and analysis, we demonstrate that incorporating physics information is crucial for generating realistic and physically consistent synthetic data. This enhanced data quality significantly improves the representativeness of the training set. These advancements are seamlessly combined with regularization from the S term in Equation (1.3). 113 CHAPTER 4 CONCLUSION This dissertation has explored the landscape of regularization techniques in deep learning, dividing the discussion into theory-driven and data-driven approaches as outlined in Chapters 3 and 2, respectively. The theory-driven section underscored the importance of understanding and implementing theory such as the PAC-Bayes bounds discussed in Section 2.2. By minimizing the upper bound of the generalization error, this dissertation illustrated how theoretical insights into regularization can lead to more effective training algorithms and improved model performance across various tasks. Conversely, in the data-driven section, this dissertation introduced architectures such as MagNet and STGNN (Sections 3.1 and 3.2) to encode specific patterns directly from the data. Furthermore, the application of physics-informed frameworks in Section 3.3 and data augmentation strategies in Section 3.4 demonstrated how the integration of physical laws and generative models could significantly enhance the representational capability and physical consistency of training datasets. The contributions of this dissertation lie in the integration of practical machine learning architectures and generalization theory to address both applied and fundamental challenges in the field. The methodologies and frameworks developed in this dissertation open several promising avenues for future research. Below, we outline potential areas for extending and improving the work introduced in various sections of this thesis: 1. PAC-Bayes Training (Zhang et al., 2023): This work has primarily utilized Gaussian priors and posteriors. Exploring alternative distributions could uncover unique advantages. The complexity added by managing additional parameters such as 𝜆 and 𝜎 in PAC-Bayes training necessitates more efficient parameterization of the prior and posterior. Additionally, the optimization challenges introduced by these parameters underscore the need for a comprehensive convergence analysis to ensure the robustness and efficacy of the training algorithms. 114 2. MagNet (Zhang et al., 2021b): While MagNet extends naturally to weighted, directed graphs where all edges are directed, its application to weighted mixed graphs (containing both directed and undirected edges) remains unexplored. Moreover, the current implementation lacks an attention mechanism and does not scale well to large graphs. Future research could explore architectural enhancements to address these limitations, potentially incorporating scalable graph learning techniques and attention mechanisms to improve performance on large datasets. 3. STGNN (Zhang et al., 2022): Transferability remains a major concern in earthquake source characterization. Although STGNN does not rely on specific graph structures or seismic station distributions, its performance in regions that differ from the training area requires improvement. Addressing this challenge might involve leveraging recent advancements in foundation models Bommasani et al. (2021), which could provide new ways to enhance model robustness and adaptability across different geophysical environments. 4. UPFWI (Jin et al., 2021): The UPFWI model faces challenges with velocity maps where adjacent layers have very close velocity values, which could be improved by updating architectures such as Jagtap et al. (2020). Additionally, the computational demands for forward modeling, in terms of speed and memory due to the necessity of storing gradients for backpropagation, are substantial. Future iterations of this model could experiment with alternative loss functions, such as adversarial loss, and explore computational strategies to balance resource use and accuracy. Expanding the application of CNN-PDE integration to other inverse problems like medical imaging and flow estimation also holds significant potential. 5. Seismic Data Augmentation (Yang et al., 2022): Evaluating the quality of synthesized data remains a significant challenge. Unlike disciplines such as computer vision, where metrics like the Fréchet Inception Distance (FID) are commonly used, seismic data requires domain-specific metrics for effective evaluation. Additionally, choosing the domain for data augmentation 115 (velocity or seismic) impacts the utility of the synthetic data. While augmentation in the velocity domain leverages prominent spatiotemporal dynamics, augmentation in the seismic domain aligns more closely with real-world scenarios, presenting a trade-off that warrants further exploration. These directions not only build on the work presented in this dissertation but also promise to advance the state of the art in machine learning. By bridging theoretical insights with practical applications, this work sets the stage for the development of more robust, interpretable, and effective deep learning models and training algorithms. 116 BIBLIOGRAPHY (2019). Basic research needs for scientific machine learning. Technical report, U.S. the Department of Energy Advanced Scientific Computing Research. Adler, A., Araya-Polo, M., and Poggio, T. (2021). Deep learning for seismic inverse problems: toward the acceleration of geophysical analysis workflows. IEEE Signal Processing Magazine, 38(2):89–119. Alquier, P. and Guedj, B. (2018). Simpler pac-bayesian bounds for hostile data. Machine Learning, 107(5):887–902. Andriushchenko, M., Croce, F., Müller, M., Hein, M., and Flammarion, N. (2023). A modern look at the relationship between sharpness and generalization. arXiv preprint arXiv:2302.07011. Antoniou, A., Storkey, A., and Edwards, H. (2017). Data augmentation generative adversarial networks. arXiv preprint arXiv:1711.04340. Appriou, D., Bonneville, A., Zhou, Q., and Gasperikova, E. (2020). Time-lapse gravity monitoring of CO2 migration based on numerical modeling of a faulted storage complex. International Journal Greenhouse Gas Control, 95:102956. Araya-Polo, M., Jennings, J., Adler, A., and Dahlke, T. (2018). Deep-learning tomography. The Leading Edge, 37(1):58–66. Atwood, J. and Towsley, D. (2016). Diffusion-convolutional neural networks. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29, pages 1993–2001. Curran Associates, Inc. Audibert, J.-Y. and Catoni, O. (2011). Robust linear least squares regression. The Annals of Statistics, 39(5):2766 – 2794. Barrett, D. G. and Dherin, B. (2020). Implicit gradient regularization. arXiv preprint arXiv:2009.11162. Belkin, M. and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6):1373–1396. Benson, A. R., Gleich, D. F., and Leskovec, J. (2016). Higher-order organization of complex networks. Science, 353(6295):163–166. Bergen, K. J., Johnson, P. A., Maarten, V., and Beroza, G. C. (2019). Machine learning for data-driven discovery in solid earth geoscience. Science, 363(6433). Berthelot, D., Raffel, C., Roy, A., and Goodfellow, I. (2018). Understanding and improving 117 interpolation in autoencoders via an adversarial regularizer. arXiv preprint arXiv:1807.07543. Beskardes, G. D., Hole, J. A., Wang, K., Michaelides, M., and Wu, Q. (2018). A comparison of earthquake back-projection imaging methods for dense local arrays. Geophysical Journal International, 212(3):1986–2002. Beyreuther, M., Barsch, R., Krischer, L., Megies, T., Behr, Y., and Wassermann, J. (2010). Obspy: A python toolbox for seismology. Seismological Research Letters, 81(3):530–533. Biggs, F. and Guedj, B. (2021). Differentiable pac–bayes objectives with partially aggregated neural networks. Entropy, 23(10):1280. Birkholzer, J., Oldenburg, C., and Zhou, Q. (2015). CO2 migration and pressure evolution in deep saline aquifers. International Journal of Greenhouse Gas Control, 40:203–220. Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer Science & Business Media. Bishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1):108–116. Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR. Bojchevski, A. and Günnemann, S. (2017). Deep gaussian embedding of graphs: Unsupervised inductive learning via ranking. arXiv preprint arXiv:1707.03815. Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M. S., Bohg, J., Bosselut, A., Brunskill, E., et al. (2021). On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258. Boonyasiriwat, C., Valasek, P., Routh, P., Cao, W., Schuster, G. T., and Macy, B. (2009). An efficient multiscale method for time-domain waveform tomography. Geophysics, 74(6):WCC59–WCC68. Bovet, A. and Grindrod, P. (2020). The activity of the far right on telegram v2.11. Bruna, J., Zaremba, W., Szlam, A., and LeCun, Y. (2014). Spectral networks and deep locally connected networks on graphs. In International Conference on Learning Representations (ICLR). Bunks, C., Saleck, F., Zaleski, S., and Chavent, G. (1995). Multiscale seismic waveform inversion. Geophysics, 60(5):1457–1473. Burstedde, C. and Ghattas, O. (2009). Algorithmic strategies for full waveform inversion: 1D experiments. Geophysics, 74(6):37–46. 118 Buscheck, T., Mansoor, K., Yang, X., Wainwright, H., and Carroll, S. (2019). Downhole pressure and chemical monitoring for CO2 and brine leak detection in aquifers above a CO2 storage reservoir. International Journal Greenhouse Gas Control, 91:102812. Casado, I., Ortega, L. A., Masegosa, A. R., and Pérez, A. (2024). Pac-bayes-chernoff bounds for unbounded losses. arXiv preprint arXiv:2401.01148. Cattaneo, M. D., Klusowski, J. M., and Shigida, B. (2023). On the implicit bias of adam. arXiv preprint arXiv:2309.00079. Chen, T. and Huang, L. (2020). Optimal design of microseismic monitoring network: Synthetic study for the Kimberlina CO2 storage demonstration site. International Journal Greenhouse Gas Control, 95:102981. Chung, F. (2005). Laplacians and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1–19. Chung, F. and Kempton, M. (2013). A local clustering algorithm for connection graphs. In International Workshop on Algorithms and Models for the Web-Graph, pages 26–43. Springer. Chung, F. R. and Graham, F. C. (1997). Spectral graph theory. Number 92. American Mathematical Soc. Cloninger, A. (2017). A note on markov normalized magnetic eigenmaps. Applied and Computational Harmonic Analysis, 43(2):370 – 380. Cohen, J., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. (2020). Gradient descent on neural In International Conference on Learning networks typically occurs at the edge of stability. Representations. Cohen, J. M., Kaur, S., Li, Y., Kolter, J. Z., and Talwalkar, A. (2021). Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065. Coifman, R. R. and Lafon, S. (2006). Diffusion maps. Applied and computational harmonic analysis, 21(1):5–30. Collino, F. and Tsogka, C. (2001). Application of the perfectly matched absorbing layer model to the linear elastodynamic problem in anisotropic heterogeneous media. Geophysics, 66(1):294–307. Cucuringu, M., Li, H., Sun, H., and Zanetti, L. (2020). Hermitian matrices for clustering directed graphs: insights and applications. In International Conference on Artificial Intelligence and Statistics, pages 983–992. PMLR. Damian, A., Ma, T., and Lee, J. D. (2021). Label noise sgd provably prefers flat global minimizers. Advances in Neural Information Processing Systems, 34:27449–27461. 119 Defferrard, M., Bresson, X., and Vandergheynst, P. (2016). Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in Neural Information Processing Systems 29, pages 3844–3852. Devlin, J., Chang, M., Lee, K., and Toutanova, K. (2018). BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations. Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., and Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 28, pages 2224–2232. Curran Associates, Inc. Dziugaite, G. K., Hsu, K., Gharbieh, W., Arpino, G., and Roy, D. (2021). On the role of data in pac-bayes bounds. In International Conference on Artificial Intelligence and Statistics, pages 604–612. PMLR. Dziugaite, G. K. and Roy, D. M. (2017). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence (UAI). Dziugaite, G. K. and Roy, D. M. (2018). Data-dependent pac-bayes priors via differential privacy. Advances in neural information processing systems, 31. F. de Resende, B. M. and F. Costa, L. d. (2020). Characterization and comparison of large directed networks through the spectra of the magnetic laplacian. Chaos: An Interdisciplinary Journal of Nonlinear Science, 30(7):073141. Fanuel, M., Alaíz, C. M., Ángela Fernández, and Suykens, J. A. (2018). Magnetic eigenmaps for the visualization of directed networks. Applied and Computational Harmonic Analysis, 44:189–199. Fanuel, M., Alaiz, C. M., and Suykens, J. A. (2017). Magnetic eigenmaps for community detection in directed networks. Physical Review E, 95(2):022302. Feng, S., Lin, Y., and Wohlberg, B. (2020). Physically realistic training data construction for data-driven full-waveform inversion and traveltime tomography. In SEG Technical Program Expanded Abstracts, pages 3472–3476. Feng, S., Lin, Y., and Wohlberg, B. (2021). Multiscale data-driven seismic full-waveform inversion with field data study. IEEE Transactions on Geoscience and Remote Sensing, pages 1–14. Fey, M., Lenssen, J. E., Weichert, F., and Leskovec, J. (2021). GNNAutoScale: Scalable and 120 expressive graph neural networks via historical embeddings. In International Conference on Machine Learning (ICML). Foret, P., Kleiner, A., Mobahi, H., and Neyshabur, B. (2020). Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations. Furutani, S., Shibahara, T., Akiyama, M., Hato, K., and Aida, M. (2020). Graph signal processing for directed graphs based on the hermitian laplacian. In Machine Learning and Knowledge Discovery in Databases, pages 447–463. Gajewski, D., Anikiev, D., Kashtan, B., and Tessmer, E. (2007). Localization of seismic events by diffraction stacking. In SEG Technical Program Expanded Abstracts 2007, pages 1287–1291. Gasteiger, J., Bojchevski, A., and Günnemann, S. (2018). Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997. Gastpar, M., Nachum, I., Shafer, J., and Weinberger, T. (2023). Fantastic generalization measures are nowhere to be found. Gatys, L. A., Ecker, A. S., and Bethge, M. (2015). A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576. Geiping, J., Goldblum, M., Pope, P. E., Moeller, M., and Goldstein, T. (2021). Stochastic training is not necessary for generalization. arXiv preprint arXiv:2109.14119. Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S. (2016). Pac-bayesian theory meets bayesian inference. Advances in Neural Information Processing Systems, 29. Ghosh, A., Lyu, H., Zhang, X., and Wang, R. (2022). Implicit regularization in heavy-ball momentum accelerated stochastic gradient descent. In The Eleventh International Conference on Learning Representations. Gomez, R., Yang, J., Lin, Y., Theiler, J., and Wohlberg, B. (2020). Physics-consistent data-driven waveform inversion with adaptive data augmentation. arXiv preprint (also accepted in IEEE Geoscience and Remote Sensing Letters). Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680. Guitton, A. (2012). Blocky regularization schemes for full waveform inversion. Geophysical Prospecting, 60:870–884. Guo, K. and Mohar, B. (2017). Hermitian adjacency matrix of digraphs and mixed graphs. Journal of Graph Theory, 85(1):217–248. 121 Ha-Duong, M. and Keith, D. (2003). Carbon storage: the economic efficiency of storing CO2 in leaky reservoirs. Technological Choices for Sustainability, 5:181–189. Haddouche, M., Guedj, B., Rivasplata, O., and Shawe-Taylor, J. (2021). Pac-bayes unleashed: Generalisation bounds with unbounded losses. Entropy, 23(10):1330. Hamilton, W., Ying, Z., and Leskovec, J. (2017). Inductive representation learning on large graphs. Advances in neural information processing systems, 30. Hammond, D. K., Vandergheynst, P., and Gribonval, R. (2011). Wavelets on graphs via spectral graph theory. Applied and Computational Harmonic Analysis, 30(2):129–150. He, K., Zhang, X., Ren, S., and Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034. He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778. He, Q. and Wang, Y. (2021). Reparameterized full-waveform inversion using deep neural networks. Geophysics, 86(1):V1–V13. He, Y., Reinert, G., and Cucuringu, M. (2021). Digrac: Digraph clustering with flow imbalance. arXiv preprint arXiv:2106.05194. Herbrich, R. and Graepel, T. (2000). A pac-bayesian margin bound for linear classifiers: Why svms work. Advances in neural information processing systems, 13. Hernández-García, A. and König, P. (2018). Data augmentation instead of explicit regularization. arXiv preprint arXiv:1806.03852. Holland, M. (2019). Pac-bayes under potentially heavy tails. Advances in Neural Information Processing Systems, 32. Hsu, W.-N., Zhang, Y., and Glass, J. (2017). Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 16–23. IEEE. Hu, W., Abubakar, A., and Habashy, T. (2009). Simultaneous multifrequency inversion of full- waveform seismic data. Geophysics, 74(2):1–14. Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708. 122 Hutton, K., Woessner, J., and Hauksson, E. (2010). Earthquake monitoring in southern california for seventy-seven years (1932–2008). Bulletin of the Seismological Society of America, 100(2):423– 446. Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by In International Conference on Machine Learning, pages reducing internal covariate shift. 448–456. PMLR. Jagtap, A. D., Kawaguchi, K., and Karniadakis, G. E. (2020). Adaptive activation functions accelerate convergence in deep and physics-informed neural networks. Journal of Computational Physics, 404:109136. Jiang, Y., Neyshabur, B., Mobahi, H., Krishnan, D., and Bengio, S. (2019). Fantastic generalization measures and where to find them. In International Conference on Learning Representations. Jin, P., Zhang, X., Chen, Y., Huang, S. X., Liu, Z., and Lin, Y. (2021). Unsupervised learning of full-waveform inversion: Connecting cnn and partial differential equation in a loop. arXiv preprint arXiv:2110.07584. Johnson, J., Alahi, A., and Fei-Fei, L. (2016). Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pages 694–711. Springer. Jordan, P. and Wagoner, J. (2017). Characterizing construction of existing wells to a CO2 storage target: The Kimberlina site, California. Technical report, U.S. Department of Energy - Office of Fossil Energy. Karniadakis, G., Kevrekidis, I., Lu, L., Perdikaris, P., Wang, S., and Yang, L. (2021). Physics- informed machine learning. Nature Reviews Physics, 3:422–440. Karras, T., Aittala, M., Hellsten, J., Laine, S., Lehtinen, J., and Aila, T. (2020). Training generative adversarial networks with limited data. arXiv preprint arXiv:2006.06676v2. Kaufman, S., Rosset, S., Perlich, C., and Stitelman, O. (2012). Leakage in data mining: Formulation, detection, and avoidance. ACM Transactions on Knowledge Discovery from Data (TKDD), 6(4):1–21. Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2016). On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836. Kingma, D. P. and Welling, M. (2014). Auto-encoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings. Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with graph convolutional 123 networks. arXiv preprint arXiv:1609.02907. Klicpera, J., Bojchevski, A., and Günnemann, S. (2019a). Predict then propagate: Graph neural networks meet personalized pagerank. In ICLR. Klicpera, J., Groß, J., and Günnemann, S. (2019b). Directional message passing for molecular graphs. In International Conference on Learning Representations. Kobak, D., Lomond, J., and Sanchez, B. (2020). The optimal ridge penalty for real-world high- dimensional data can be zero or negative due to the implicit ridge regularization. J. Mach. Learn. Res., 21:169–1. Kong, Q., Trugman, D. T., Ross, Z. E., Bianco, M. J., Meade, B. J., and Gerstoft, P. (2019). Machine learning in seismology: Turning data into insights. Seismological Research Letters, 90(1):3–14. Kriegerowski, M., Petersen, G. M., Vasyura-Bathke, H., and Ohrnberger, M. (2019). A deep convolutional neural network for localization of clustered earthquakes based on multistation full waveforms. Seismological Research Letters, 90:510 – 516. Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25:1097– 1105. Kukačka, J., Golkov, V., and Cremers, D. (2017). Regularization for deep learning: A taxonomy. arXiv preprint arXiv:1710.10686. Kuzborskij, I. and Szepesvári, C. (2019). Efron-stein pac-bayesian inequalities. arXiv preprint arXiv:1909.01931. Lagaris, I. E., Likas, A., and Fotiadis, D. I. (1998). Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5):987–1000. Lee, D. D. and Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791. Lee, S. and Jang, C. (2022). A new characterization of the edge of stability based on a sharpness In The Eleventh International Conference on measure aware of batch gradient distribution. Learning Representations. Letarte, G., Germain, P., Guedj, B., and Laviolette, F. (2019). Dichotomize and generalize: Pac-bayesian binary activated deep neural networks. Advances in Neural Information Processing Systems, 32. Levie, R., Huang, W., Bucci, L., Bronstein, M. M., and Kutyniok, G. (2019). Transferability of spectral graph convolutional neural networks. arXiv preprint arXiv:1907.12972. 124 Lewkowycz, A., Bahri, Y., Dyer, E., Sohl-Dickstein, J., and Gur-Ari, G. (2020). The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218. Li, L., Tan, J., Schwarz, B., Stanek, F., Poiata, N., Shi, P., Diekmann, L., Eisner, L., and Gajewski, D. (2020a). Recent advances and challenges of waveform-based seismic location methods at multiple scales. Reviews of Geophysics, page e2019RG000667. Li, S., Liu, B., Ren, Y., Chen, Y., Yang, S., Wang, Y., and Jiang, P. (2020b). Deep-learning inversion of seismic data. IEEE Transactions on Geoscience and Remote Sensing, 58(3):2135–2149. Li, Z., Meier, M.-A., Hauksson, E., Zhan, Z., and Andrews, J. (2018). Machine learning seismic wave discrimination: Application to earthquake early warning. Geophysical Research Letters, 45(10):4773–4779. Li, Z. and van der Baan, M. (2016). Microseismic event localization by acoustic time reversal extrapolation. Geophysics, 81(3):KS123–KS134. Lieb, E. H. and Loss, M. (1993). Fluxes, Laplacians, and Kasteleyn’s theorem. In Statistical Mechanics, pages 457–483. Springer. Lin, Y. and Huang, L. (2015a). Acoustic- and elastic-waveform inversion using a modified Total-Variation regularization scheme. Geophysical Journal International, 200(1):489–502. Lin, Y. and Huang, L. (2015b). Quantifying subsurface geophysical properties changes using double-difference seismic-waveform inversion with a modified Total-Variation regularization scheme. Geophysical Journal International, 203(3):2125–2149. Lin, Y. and Huang, L. (2017). Building subsurface velocity models with sharp interfaces using interface-guided seismic full-waveform inversion. Pure and Applied Geophysics, 174(11):4035– 4055. Lin, Y., Syracuse, E. M., Maceira, M., Zhang, H., and Larmat, C. (2015). Double-difference traveltime tomography with edge-preserving regularization and a priori interfaces. Geophysical Journal International, 201(2):574–594. Liu, B., Yang, S., Ren, Y., Xu, X., Jiang, P., and Chen, Y. (2021). Deep-learning seismic full-waveform inversion for realistic structural models. Geophysics, 86(1):R31 – R44. Liu, X., Zou, Y., Kong, L., Diao, Z., Yan, J., Wang, J., Li, S., Jia, P., and You, J. (2018). Data augmentation via latent space interpolation for image classification. In 2018 24th International Conference on Pattern Recognition (ICPR), pages 728–733. IEEE. Livni, R. and Moran, S. (2020). A limitation of the pac-bayes framework. Advances in Neural Information Processing Systems, 33:20543–20553. 125 Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101. Loshchilov, I. and Hutter, F. (2018). Decoupled weight decay regularization. In International Conference on Learning Representations. Lumley, D. (2001). Time-lapse seismic reservoir monitoring. Geophysics, 66:50–53. Luo, P., Wang, X., Shao, W., and Peng, Z. (2018). Towards understanding regularization in batch normalization. arXiv preprint arXiv:1809.00846. Luo, Y., Zhu, L., Wan, Z., and Lu, B. (2020). Data augmentation for enhancing EEG-based emotion recognition with deep generative models. Journal of Neural Engineering, 17(5):056021. Ma, Y., Hao, J., Yang, Y., Li, H., Jin, J., and Chen, G. (2019). Spectral-based graph convolutional network for directed graphs. arXiv:1907.08990. Marques, A. G., Segarra, S., and Mateos, G. (2020). Signal processing on directed graphs: The role of edge directionality when processing and learning from network data. IEEE Signal Processing Magazine, 37(6):99–116. Maurer, A. (2004). A note on the pac bayesian theorem. arXiv preprint cs/0411099. McAllester, D. A. (1998). Some pac-bayesian theorems. In Proceedings of the eleventh annual conference on Computational learning theory, pages 230–234. McAllester, D. A. (1999). Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170. McBrearty, I. W. and Beroza, G. C. (2022). Earthquake location and magnitude estimation with graph neural networks. arXiv preprint arXiv:2203.05144. Mernyei, P. and Cangea, C. (2020). Wiki-cs: A wikipedia-based benchmark for graph neural networks. arXiv preprint arXiv:2007.02901. Milletari, F., Navab, N., and Ahmadi, S.-A. (2016). V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee. Mohar, B. (2020). A new kind of hermitian matrices for digraphs. Linear Algebra and its Applications, 584:343–352. Monti, F., Otness, K., and Bronstein, M. M. (2018). Motifnet: A motif-based graph convolutional network for directed graphs. In 2018 IEEE Data Science Workshop, pages 225–228. 126 Moseley, B., Nissen-Meyer, T., and Markham, A. (2020). Deep learning for fast simulation of seismic waves in complex media. Solid Earth, 11(4):1527–1549. Mosser, L., Dubrule, O., and Blunt, M. J. (2020). Stochastic seismic waveform inversion using generative adversarial networks as a geological prior. Mathematical Geosciences, 52(1):53–79. Mousavi, S. M. and Beroza, G. C. (2020a). Bayesian-deep-learning estimation of earthquake location from single-station observations. IEEE Transactions on Geoscience and Remote Sensing, pages 1 – 14. Mousavi, S. M. and Beroza, G. C. (2020b). A machine-learning approach for earthquake magnitude estimation. Geophysical Research Letters, 47(1):e2019GL085976. Münchmeyer, J., Bindi, D., Leser, U., and Tilmann, F. (2020). The transformer earthquake alerting model: a new versatile approach to earthquake early warning. Geophysical Journal International, 225(1):646–656. Münchmeyer, J., Bindi, D., Leser, U., and Tilmann, F. (2021). Earthquake magnitude and location estimation from real time seismic waveforms with a transformer network. Geophysical Journal International, 226(2):1086–1104. Nagarajan, V. and Kolter, J. Z. (2019). Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32. Nair, V. and Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., and Sutskever, I. (2021). Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003. Nanometrics Seismological Instruments (2013). Nanometrics research network. Neelakantan, A., Vilnis, L., Le, Q. V., Sutskever, I., Kaiser, L., Kurach, K., and Martens, J. (2015). Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807. Neyshabur, B., Tomioka, R., and Srebro, N. (2014). In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614. Nishizaki, H. (2017). Data augmentation and feature extraction using variational autoencoder for acoustic modeling. In 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1222–1227. IEEE. 127 Oring, A., Yakhini, Z., and Hel-Or, Y. (2020). Autoencoder image interpolation by shaping the latent space. arXiv preprint arXiv:2008.01487v2. Ortega, A., Frossard, P., Kovačević, J., Moura, J. M., and Vandergheynst, P. (2018). Graph signal processing: Overview, challenges, and applications. Proceedings of the IEEE, 106(5):808–828. Orvieto, A., Kersting, H., Proske, F., Bach, F., and Lucchi, A. (2022). Anticorrelated noise injection for improved generalization. In International Conference on Machine Learning, pages 17094–17116. PMLR. Ovcharenko, O., Kazei, V., Peter, D., and Alkhalifah, T. (2019). Style transfer for generation of realistically textured subsurface models. In SEG Technical Program Expanded Abstracts 2019, pages 2393–2397. Society of Exploration Geophysicists. Palmer, W. R. and Zheng, T. (2021). Spectral clustering for directed networks. Studies in Computational Intelligence, 943. Pei, H., Wei, B., Chang, K. C.-C., Lei, Y., and Yang, B. (2020). Geom-gcn: Geometric graph convolutional networks. arXiv preprint arXiv:2002.05287. Perez-Ortiz, M., Rivasplata, O., Guedj, B., Gleeson, M., Zhang, J., Shawe-Taylor, J., Bober, M., and Kittler, J. (2021). Learning pac-bayes priors for probabilistic neural networks. arXiv preprint arXiv:2109.10304. Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., and Szepesvári, C. (2021). Tighter risk certificates for neural networks. The Journal of Machine Learning Research, 22(1):10326–10365. Perol, T., Gharbi, M., and Denolle, M. (2018). Convolutional neural network for earthquake detection and location. Science Advances, 4:e1700578. Pesicek, J. D., Child, D., Artman, B., and Cieslik, K. (2014). Picking versus stacking in a modern microearthquake location: Comparison of results from a surface passive seismic monitoring array in Oklahoma. Geophysics, 79(6):KS61–KS68. Raissi, M., Perdikaris, P., and Karniadakis, G. E. (2019). Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational Physics, 378:686–707. Ramírez, A. and Lewis, W. (2010). Regularization and full-waveform inversion: A two-step approach. In 80th Annual International Meeting, SEG, Expanded Abstracts, pages 2773–2778. Ren, Y., Nie, L., Yang, S., Jiang, P., and Chen, Y. (2021). Building complex seismic velocity models for deep learning inversion. IEEE Access, 4(1):R31 – R44. Richardson, A. (2018). Generative adversarial networks for model order reduction in seismic 128 full-waveform inversion. arXiv preprint arXiv:1806.00828. Rivasplata, O., Kuzborskij, I., Szepesvári, C., and Shawe-Taylor, J. (2020). Pac-bayes analysis beyond the usual bounds. Advances in Neural Information Processing Systems, 33:16833–16845. Rivasplata, O., Tankasali, V. M., and Szepesvári, C. (2019). Pac-bayes with backprop. arXiv preprint arXiv:1908.07380. Rodríguez-Gálvez, B., Thobaben, R., and Skoglund, M. (2023). More pac-bayes bounds: From bounded losses, to losses with general tail behaviors, to anytime-validity. arXiv preprint arXiv:2306.12214. Rojas-Gómez, R., Yang, J., Lin, Y., Theiler, J., and Wohlberg, B. (2020). Physics-consistent data-driven waveform inversion with adaptive data augmentation. IEEE Geoscience and Remote Sensing Letters. Ross, Z. E., Yue, Y., Meier, M.-A., Hauksson, E., and Heaton, T. H. (2019). Phaselink: A deep learning approach to seismic phase association. Journal of Geophysical Research: Solid Earth, 124(1):856–869. Rozemberczki, B., Allen, C., and Sarkar, R. (2019). Multi-scale attributed node embedding. arXiv preprint arXiv:1909.13021. Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization? Advances in neural information processing systems, 31. Shawe-Taylor, J. and Williamson, R. C. (1997). A pac analysis of a bayesian estimator. In Proceedings of the tenth annual conference on Computational learning theory, pages 2–9. Shen, H. and Shen, Y. (2021). Array-based convolutional neural networks for automatic detection and 4d localization of earthquakes in hawai ‘i. Seismological Society of America, 92(5):2961–2971. Shi, J. and Malik, J. (1997). Normalized cuts and image segmentation. In Proceedings of IEEE computer society conference on computer vision and pattern recognition, pages 731–737. IEEE. Shorten, C. and Khoshgoftaar, T. (2019). A survey on image data augmentation for deep learning. Journal of Big Data, 6(1):1–48. Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., and Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2107–2116. Sim, A., Wiatrak, M., Brayne, A., Creed, P., and Paliwal, S. (2021). Directed graph embeddings in pseudo-riemannian manifolds. In Meila, M. and Zhang, T., editors, Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, 129 volume 139 of Proceedings of Machine Learning Research, pages 9681–9690. PMLR. Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Sixt, L., Wild, B., and Landgraf, T. (2018). Rendergan: Generating realistic labeled data. Frontiers in Robotics and AI, 5:66. Smith, L. I. (2002). A tutorial on principal components analysis. Smith, S. L., Dherin, B., Barrett, D. G., and De, S. (2021). On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176. Spielman, D. A. and Teng, S.-H. (2004). Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 81–90. Sun, J., Innanen, K. A., and Huang, C. (2021). Physics-guided deep learning for seismic inversion with hybrid training and uncertainty analysis. Geophysics, 86(3):R303–R317. Sun, L., Gao, H., Pan, S., and Wang, J.-X. (2020). Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation data. Computer Methods in Applied Mechanics and Engineering, 361:112732. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. (2016). Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826. Thiemann, N., Igel, C., Wintenberger, O., and Seldin, Y. (2017). A strongly quasiconvex pac-bayesian bound. In International Conference on Algorithmic Learning Theory, pages 466–492. PMLR. Tiira, T. (1999). Detecting teleseismic events using artificial neural networks. Comput. Geosci., 25:929 – 938. Tolstikhin, I., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., Yung, J., Steiner, A. P., Keysers, D., Uszkoreit, J., et al. (2021). Mlp-mixer: An all-mlp architecture for vision. In Thirty-Fifth Conference on Neural Information Processing Systems. Tong, Z., Liang, Y., Sun, C., Li, X., Rosenblum, D., and Lim, A. (2020a). Digraph inception convolutional networks. In NeurIPS. Tong, Z., Liang, Y., Sun, C., Rosenblum, D. S., and Lim, A. (2020b). Directed graph convolutional network. arXiv:2004.13970. Treister, E. and Haber, E. (2016). Full waveform inversion guided by travel time tomography. SIAM 130 Journal on Scientific Computing, 39:S587–S609. United States Geological Survey and California Geological Survey (2022). Quaternary fault and fold database for the united state. van den Ende, M. P. and Ampuero, J.-P. (2020). Automated seismic source characterisation using deep graph neural networks. Geophysical Research Letters, page e2020GL088690. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., and Bengio, Y. (2018). Graph Attention Networks. International Conference on Learning Representations. Virieux, J. and Operto, S. (2009). An overview of full-waveform inversion in exploration geophysics. Geophysics, 74(6):WCC1–WCC26. Wang, J., Perez, L., et al. (2017). The effectiveness of data augmentation in image classification using deep learning. Convolutional Neural Networks Vis. Recognit, 11(2017):1–8. Wang, J. and Teng, T. (1995). Artificial neural network-based seismic detector. Bull. Seismol. Soc. Am., 85:308 – 319. Wang, R., Kashinath, K., Mustafa, M., Albert, A., and Yu, R. (2020). Towards physics-informed deep learning for turbulent flow prediction. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1457–1466. Wang, Y. (2015). Frequencies of the Ricker wavelet. Geophysics, 80(2):A31–A37. Wang, Y., Sun, Y., Liu, Z., Sarma, S. E., Bronstein, M. M., and Solomon, J. M. (2019). Dynamic graph cnn for learning on point clouds. Acm Transactions On Graphics (tog), 38(5):1–12. Wang, Z., Bovik, A. C., Sheikh, H. R., and Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612. Wei, C., Kakade, S., and Ma, T. (2020). The implicit and explicit regularization effects of dropout. In International conference on machine learning, pages 10181–10192. PMLR. Willard, J., Jia, X., Xu, S., Steinbach, M., and Kumar, V. (2020). Integrating physics-based modeling with machinelearning: A survey. arXiv preprint arXiv:2003.04919v4. Wu, X., Geng, Z., Shi, Y., Pham, N., Fomel, S., and Caumon, G. (2020a). Building realistic structure models to train convolutional neural networks for seismic structural interpretation. Geophysics, 85(4):WA27–WA39. Wu, Y. and Lin, Y. (2019). InversionNet: An efficient and accurate data-driven full waveform inversion. IEEE Transactions on Computational Imaging, 6(1):419–433. 131 Wu, Y. and McMechan, G. A. (2019). Parametric convolutional neural network-domain full-waveform inversion. Geophysics, 84(6):R881–R896. Wu, Z., Pan, S., Chen, F., Long, G., Zhang, C., and Yu, P. S. (2020b). A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 32(1):4–24. Xi, Z., Li, J., Chen, M., and Wei, S. (2021). Pyfk: A fast mpi and cuda accelerated python package for calculating synthetic seismograms based on the frequencywavenumber method. In AGU Fall Meeting Abstracts, volume 2021, pages S15E–0288. Xu, K., Hu, W., Leskovec, J., and Jegelka, S. (2018). How powerful are graph neural networks? arXiv preprint arXiv:1810.00826. Yang, F. and Ma, J. (2019). Deep-learning inversion: A next-generation seismic velocity model building method. Geophysics, 84(4):R583–R599. Yang, X., Buscheck, T., Mansoor, K., Wang, Z., Gao, K., Huang, L., Wainwright, H., and Carroll, S. (2019). Assessment of geophysical monitoring methods for detection of brine and CO2 leakage in drinking water aquifers. International Journal Greenhouse Gas Control, 90:102803. Yang, Y., Zhang, X., Guan, Q., and Lin, Y. (2022). Making invisible visible: Data-driven seismic inversion with spatio-temporally constrained data augmentation. IEEE Transactions on Geoscience and Remote Sensing, 60:1–16, copyright © 2022 IEEE. Yano, K., Shiina, T., Kurata, S., Kato, A., Komaki, F., Sakai, S., and Hirata, N. (2021). Graph- partitioning based convolutional neural network for earthquake detection using a seismic array. Journal of Geophysical Research: Solid Earth, 126(5):e2020JB020269. Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2021a). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115. Zhang, H. and Thurber, C. H. (2003). Double-difference tomography: The method and its application to the Hayward Fault, California. Bulletin of the Seismological Society of America, 93(5):1875–1889. Zhang, R., Isola, P., Efros, A., Shechtman, E., and Wang, O. (2018). The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint arXiv:1801.03924v2. Zhang, X., Ghosh, A., Liu, G., and Wang, R. (2023). Unleashing the power of pac-bayes training for unbounded loss. Zhang, X., He, Y., Brugnone, N., Perlmutter, M., and Hirn, M. (2021b). Magnet: A neural network for directed graphs. Advances in neural information processing systems, 34:27003–27015. 132 Zhang, X., Reichard-Flynn, W., Zhang, M., Hirn, M., and Lin, Y. (2022). Spatiotemporal graph convolutional networks for earthquake source characterization. Journal of Geophysical Research: Solid Earth, 127(11):e2022JB024401. Zhang, X., Wang, Z., Liu, D., and Ling, Q. (2019). Dada: Deep adversarial data augmentation for extremely low data regime classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2807–2811. IEEE. Zhang, X., Zhang, J., Yuan, C., Liu, S., Chen, Z., and Li, W. (2020). Locating induced earthquakes with a network of seismic station in Oklahoma via a deep learning method. Scientific Report, 10. Zhang, X., Zhang, M., and Tian, X. (2021c). Real-time earthquake early warning with deep learning: Application to the 2016 m 6.0 central apennines, italy earthquake. Geophysical Research Letters, 48(5):2020GL089394. Zhang, Z. and Lin, Y. (2020). Data-driven seismic waveform inversion: A study on the robustness and generalization. IEEE Transactions on Geoscience and Remote sensing, 58(10):6900–6913. Zhang, Z., Rector, J. W., and Nava, M. J. (2017). Simultaneous inversion of multiple microseismic data for event locations and velocity model with bayesian inference. Geophysics, 82(3):KS27– KS39. Zhao, Z., Zhang, Z., Chen, T., Singh, S., and Zhang, H. (2020). Image augmentations for GAN training. arXiv preprint arXiv:2006.02595v1. Zhebel, O. and Eisner, L. (2015). Simultaneous microseismic event localization and source mechanism determination. Geophysics, 80(1):KS1–KS9. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., and Sun, M. (2018a). Graph neural networks: A review of methods and applications. arXiv preprint arXiv:1812.08434. Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2018b). Non-vacuous gener- alization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862. Zhou, Z., Lin, Y., Zhang, Z., Wu, Y., Wang, Z., Dilmore, R., and Guthrie, G. (2019). A data-driven CO2 leakage detection using seismic data and spatial-temporal densely connected convolutional neural networks. International Journal of Greenhouse Gas Control, 90:102790. Zhu, W. and Beroza, G. C. (2019). Phasenet: a deep-neural-network-based seismic arrival-time picking method. Geophysical Journal International, 216(1):261–273. Zhu, W., Mousavi, S. M., and Beroza, G. C. (2019a). Seismic signal denoising and decomposition using deep neural networks. IEEE Transactions on Geoscience and Remote Sensing, 57(11):9476– 9488. 133 Zhu, W., Xu, K., Darve, E., Biondi, B., and Beroza, G. C. (2021). Integrating deep neural networks with full-waveform inversion: Reparametrization, regularization, and uncertainty quantification. Geophysics, 87(1):1–103. Zhu, Y., Zabaras, N., Koutsourelakis, P.-S., and Perdikaris, P. (2019b). Physics-constrained deep learning for high-dimensional surrogate modeling and uncertainty quantification without labeled data. Journal of Computational Physics, 394:56–81. 134 APPENDIX A UNLOCKING TUNING-FREE GENERALIZATION: MINIMIZING THE PAC-BAYES BOUND WITH TRAINABLE PRIORS A.1 Proofs A.1.1 Proofs of Theorem 2.2.6 Theorem A.1.1. Given a prior Pλ parametrized by λ ∈ Λ over the hypothesis set H . Fix λ ∈ Λ, 𝛿 ∈ (0, 1) and 𝛾 ∈ [𝛾1, 𝛾2]. For any choice of i.i.d 𝑚-sized training dataset S according to D, and all posterior distributions Q over H , we have Eh∼Qℓ(h; D) ≤ Eh∼Qℓ(h; S) + 1 𝛾𝑚 (log 1 𝛿 + KL(Q||Pλ)) + 𝛾𝐾 (λ) (A.1) holds with probability at least 1 − 𝛿 when ℓ(h, ·) satisfies Definition 2.2.5 with bound 𝐾 (λ). Proof. Firstly, in the bounded interval 𝛾 ∈ [𝛾1, 𝛾2], we bound the difference of the expected loss over the posterior distribution evaluated on the training dataset S and D with the KL divergence between the posterior distribution Q and prior distribution Pλ evaluated over a hypothesis space H . For 𝛾 ∈ [𝛾1, 𝛾2], ES∼D [exp (𝛾𝑚(Eh∼Qℓ(h; D) − Eh∼Qℓ(h; S)) − KL(Q||Pλ))] dQ dPλ =ES∼D [exp (𝛾𝑚(Eh∼Qℓ(h; D) − Eh∼Qℓ(h; S)) − Eh∼Q log (h))] ≤ES∼DEh∼Q [exp (𝛾𝑚(ℓ(h; D) − ℓ(h; S)) − log dQ dPλ (h))] =Eh∼Pλ ES∼D [exp(𝛾𝑚(ℓ(h; D) − ℓ(h; S)))], (A.2) (A.3) (A.4) where dQ/dP denotes the Radon-Nikodym derivative. In (A.2), we use KL(Q||P𝜆) = Eh∼Q log dQ dP𝜆 (h). From (A.2) to (A.3), Jensen’s inequality is used over the convex exponential function. Since this argument holds for any 𝑄, we have ES∼D [exp (𝛾𝑚(Eh∼Qℓ(h; D) − Eh∼Qℓ(h; S)) − KL(Q||Pλ))] ≤ sup Q∈Q Eh∼Pλ ES∼D [exp(𝛾𝑚(ℓ(h; D) − ℓ(h; S)))] (A.5) (A.6) 135 Let 𝑋 = ℓ(h; D) − ℓ(h; S), then 𝑋 is centered with E[𝑋] = 0. Then, by Definition 2.2.5, ∃𝐾 (λ), Eh∼P𝜆ES∼D [exp (𝛾𝑚𝑋)] ≤ exp (𝑚𝛾2𝐾 (λ)). (A.7) Using Markov’s inequality, (A.8) holds with probability at least 1 − 𝛿. exp (𝛾𝑚𝑋) ≤ exp (𝑚𝛾2𝐾 (λ)) 𝛿 . (A.8) Combining (A.5) and (A.8), the following inequality holds with probability at least 1 − 𝛿. exp (𝛾𝑚(Eh∼Qℓ(h; D) − Eh∼Qℓ(h; S)) − KL(Q||Pλ)) ≤ sup Q∈Q exp (𝑚𝛾2𝐾 (λ)) 𝛿 ⇒𝛾𝑚(Eh∼Qℓ(h; D) − Eh∼Qℓ(h; S)) − KL(Q||Pλ) ≤ log 1 𝛿 + 𝑚𝛾2𝐾 (λ), ∀Q ⇒Eh∼Qℓ(h; D) ≤ Eh∼Qℓ(h; S) + 1 𝛾𝑚 (log 1 𝛿 + KL(Q||Pλ)) + 𝛾𝐾 (λ), ∀Q. (A.9) The bound A.9 is exactly the statement of the Theorem. □ A.1.2 Proof of Theorem 2.2.10 Theorem A.1.2. Let 𝑛(𝜀) := N (Λ, ∥ · ∥, 𝜀) be the covering number of the set of the prior parameters. Under Assumption 2.2.8 and Assumption 2.2.9, the following inequality holds for the minimizer ( ˆh, ˆ𝛾, ˆσ, ˆλ) of upper bound in (A.1) with probability as least 1 − 𝜖: E h∼Q ˆσ ( ˆh) ℓ(h; D) ≤ E h∼Q ˆσ ( ˆh) ℓ(h; S) + (cid:34) log 𝑛(𝜀) + 𝜖 1 ˆ𝛾𝑚 𝛾2−𝛾1 𝜀 (cid:35) + KL(Q ˆσ ( ˆh)||P ˆλ) + ˆ𝛾𝐾 ( ˆλ) + 𝜂 = 𝐿𝑃 𝐴𝐶 ( ˆh, ˆ𝛾, ˆσ, ˆλ) + 𝜂 holds for any 𝜖, 𝜀 > 0, where 𝜂 = 𝐵𝜀 + 𝐶 (𝜂1(𝜀) + 𝜂2(𝜀)) + log(𝑛(𝜀)+ 𝛾1𝑚 𝐵 := supλ∈Λ (KL(Q ˆσ ( ˆh)||Pλ) + log 1 𝛿 ) + 𝐾 (λ). 1 𝑚𝛾2 1 (A.10) ) , with 𝐶 = 1 𝛾1𝑚 + 𝛾2, and 2 −𝛾 𝛾 1 𝜀 Proof: In this proof, we extend our PAC-Bayes bound with data-independent priors to data- dependent ones that accommodate the error when the prior distribution is parameterized and optimized over a finite set of parameters 𝔓 = {𝑃λ, λ ∈ Λ ⊆ R𝑘 } with a much smaller dimension 136 than the model itself. Let T(Λ, ∥ · ∥, 𝜀) be an 𝜀-cover of the set Λ, which states that for any λ ∈ Λ, there exists a ˜λ ∈ T(Λ, ∥ · ∥, 𝜀) , such that ||λ − ˜λ|| ≤ 𝜀. Now we select the posterior distribution as Qσ (h), parameterized by h and σ ∈ R𝑑, where h represents the mean of the posterior, and σ accounts for the variations in each model parameter from this mean. Assuming the prior P is parameterized by λ ∈ R𝑘 (𝑘 ≪ 𝑑). Then the PAC-Bayes bound A.1 holds already for any ( ˆh, 𝛾, ˆσ, λ), with fixed λ ∈ Λ and 𝛾 ∈ [𝛾1, 𝛾2], i.e., E ˜h∼Q ˆσ ( ˆh) ℓ( ˜h; D) ≤ E ˜h∼Q ˆσ ( ˆh) ℓ( ˜h; S) + 1 𝛾𝑚 (log 1 𝛿 + KL(Q ˆσ ( ˆh)||Pλ)) + 𝛾𝐾 (λ) (A.11) with probability over 1 − 𝛿. Now, for the collection of λs in the 𝜀-net T(Λ, ∥ · ∥, 𝜀), by the union bound, the PAC-Bayes bound uniformly holds on the 𝜀-net with probability at least 1 − |T|𝛿 = 1 − 𝑛(𝜀)𝛿. For an arbitrary λ ∈ Λ, its distance to the 𝜀-net is at most 𝜀. Then under Assumption 2.2.8 and Assumption 2.2.9, we have: and |KL(Q||Pλ) − KL(Q||P ˜λ)| ≤ 𝜂1(∥λ − ˜λ∥) ≤ 𝜂1(𝜀), min ˜λ∈T |𝐾 (λ) − 𝐾 ( ˜λ)| ≤ 𝜂2(∥λ − ˜λ∥) ≤ 𝜂2(𝜀). min ˜λ∈T Similarly, for 𝛾, a 𝜀-net on its range 𝛾1 ≤ 𝛾 ≤ 𝛾2 is the uniform grid with a grid separation 𝜀, so the net contains 𝛾2−𝛾1 𝜀 points. By the union bound, requiring the PAC-Bayes bound to uniformly hold for all the 𝛾 within this 𝜀-net induces an extra probability of failure of 𝛾2−𝛾1 𝜀 𝛿. So, the total probability of failure is 𝑛(𝜀)𝛿 + 𝛾2−𝛾1 𝜀 𝛿. For an arbitrary 𝛾 ∈ Γ, and Γ := {𝛾 ∈ [𝛾1, 𝛾2]}, its distance to the 𝜀-net T′ is at most 𝜀, we have: |𝐿𝑃 𝐴𝐶 ( ˆh, 𝛾, ˆσ, λ) − 𝐿𝑃 𝐴𝐶 ( ˆh, ˜𝛾, ˆσ, λ)| = min ˜𝛾∈T′ = ≤ − (cid:12) (cid:12) (cid:12) (cid:12) 1 ˜𝛾 1 𝛿 1 𝛾 (cid:19) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝛿 KL(Q ˆσ ( ˆh)||Pλ) + log + |𝛾 − ˜𝛾|𝐾 (λ) (KL(Q ˆσ ( ˆh)||Pλ) + log ) + 𝐾 (λ) (cid:32) 1 𝑚𝛾2 1 (KL(Q ˆσ ( ˆh)||Pλ) + log 1 𝛿 ) + 𝐾 (λ) 𝜀 |𝛾 − ˜𝛾| (cid:19) (cid:33) (cid:18) 1 𝑚 (cid:18) 1 𝑚𝛾 ˜𝛾 ≤ 𝐵𝜀, 137 where 𝐵 := supλ∈Λ 1 𝑚𝛾2 1 (KL(Q ˆσ ( ˆh)||Pλ) + log 1 𝛿 ) + 𝐾 (λ), clearly, 𝐵 is a constant depending on the range of the parameters. With the three inequalities above, we can control the PAC-Bayes loss at the given λ and 𝛾 as follows: min ˜λ∈T, ˜𝛾∈T′ |𝐿𝑃 𝐴𝐶 ( ˆh, 𝛾, ˆσ, λ) − 𝐿𝑃 𝐴𝐶 ( ˆh, ˜𝛾, ˆσ, ˜λ)| ≤ min ˜𝛾∈T′ |𝐿𝑃 𝐴𝐶 ( ˆh, 𝛾, ˆσ, λ) − 𝐿𝑃 𝐴𝐶 ( ˆh, ˜𝛾, ˆσ, λ)| + min ˜λ∈T |𝐿𝑃 𝐴𝐶 ( ˆh, ˜𝛾, ˆσ, λ) − 𝐿𝑃 𝐴𝐶 ( ˆh, ˜𝛾, ˆσ, ˜λ)| ≤ 𝐵𝜀 + ≤ 𝐵𝜀 + 1 ˜𝛾𝑚 1 𝛾1𝑚 𝜂1(𝜀) + ˜𝛾𝜂2(𝜀) 𝜂1(𝜀) + 𝛾2𝜂2(𝜀) ≤ 𝐵𝜀 + 𝐶 (𝜂1(𝜀) + 𝜂2(𝜀)) where 𝐶 = 1 𝛾1 + 𝛾2 and 𝛾1 ≤ 𝛾 ≤ 𝛾2. Since this inequality holds for any λ ∈ Λ and 𝛾 ∈ Γ, it certainly holds for the optima ˆλ and ˆ𝛾. Combining this with (A.11), we have E h∼Q ˆσ ( ˆh) ℓ(h; D) ≤ 𝐿𝑃 𝐴𝐶 ( ˆh, ˆ𝛾, ˆσ, ˆλ) + 𝐵𝜀 + 𝐶 (𝜂1(𝜀) + 𝜂2(𝜀)), where 𝐵 := supλ∈Λ 1 𝑚𝛾2 1 (KL(Q ˆσ ( ˆh)||Pλ) + log 1 𝛿 ) + 𝐾 (λ). Now taking 𝜖 := (𝑛(𝜀) + 𝛾2−𝛾1 𝜀 )𝛿 to be the previously calculated probability of failure, we get, with probability 1 − 𝜖, it holds that E h∼Q ˆσ ( ˆh) ℓ(h; D) ≤ E h∼Q ˆσ ( ˆh) ℓ(h; S) + (cid:34) log 𝑛(𝜀) + 𝜖 1 ˆ𝛾𝑚 𝛾2−𝛾1 𝜀 (cid:35) + KL(Q ˆσ ( ˆh)||P ˆλ) (A.12) + ˆ𝛾𝐾 ( ˆλ) + 𝐵𝜀 + 𝐶 (𝜂1(𝜀) + 𝜂2(𝜀)) ≤ 𝐿𝑃 𝐴𝐶 ( ˆh, ˆ𝛾, ˆσ, ˆλ) + 𝜂 and the proof is completed. (A.13) □ A.1.3 KL divergence of the Gaussian prior and posterior For a 𝑘-layer network, the prior is written as Pλ(θ0), where θ0 is the random initialized model parameterized by θ0 and λ ∈ R𝑘 + is the vector containing the variance for each layer. The set of 138 all such priors is denoted by 𝔓 := {Pλ(θ0), λ ∈ Λ ⊆ R𝑘 , θ0 ∈ Θ}. In the PAC-Bayes training, we select the posterior distribution to be centered around the trained model parameterized by θ, with independent anisotropic variance. Specifically, for a network with 𝑑 trainable parameters, the posterior is Qσ (θ) := N (θ, diag(σ)), where θ (the current model) is the mean and σ ∈ R𝑑 + is the vector containing the variance for each trainable parameter. The set of all posteriors is 𝔔 := {Qσ (θ), σ ∈ Σ, θ ∈ Θ}, and the KL divergence between all such prior and posterior in 𝔓 and 𝔔 is: KL(Qσ (θ)||Pλ(θ0)) = 1 2 (cid:34) 𝑘 ∑︁ 𝑖=1 −1⊤ 𝑑𝑖 log(σ𝑖) + 𝑑𝑖 (log(λ𝑖) − 1) + ∥σ𝑖 ∥1 + ∥(θ − θ0)𝑖 ∥2 2) λ𝑖 (cid:35) , (A.14) where σ𝑖, (θ − θ0)𝑖 are vectors denoting the variances and weights for the 𝑖-th layer, respectively, and 𝜆𝑖 is the scalar variance for the 𝑖-th layer. 𝑑𝑖 = dim(σ𝑖), and 1𝑑𝑖 denotes an all-ones vector of length 𝑑𝑖 1. Scalar prior is a special case of the layerwise prior by setting all entries of λ to be equal, for which the KL divergence reduces to KL(Qσ (θ)||P𝜆 (θ0)) = (cid:20) 1 2 −1⊤ 𝑑 log(σ) + 𝑑 (log(𝜆) − 1) + 1 𝜆 (∥σ∥1 + ∥θ − θ0∥2 2) (cid:21) . (A.15) A.1.4 Proof of Corollary 2.2.11 Recall for the training, we proposed to optimize over all four variables: θ, 𝛾, σ, and λ. ( ˆθ, ˆ𝛾, ˆσ, ˆλ) = arg min θ,λ,σ, 𝛾∈[𝛾1,𝛾2] E ˜θ∼Qσ (θ) ℓ( 𝑓 ˜θ; S) + (log (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) (cid:124) 1 𝛾𝑚 1 𝛿 (cid:123)(cid:122) ≡𝐿 𝑃 𝐴𝐶 (θ,𝛾,σ,λ) + KL(Qσ (θ)||Pλ)) + 𝛾𝐾 (λ) (cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32)(cid:32) . (A.16) (cid:125) Corollary A.1.3. Assume all parameters for the prior and posterior are bounded, i.e., we restrict the model parameter θ, the posterior variance σ and the prior variance λ, and the exponential moment 𝐾 (λ) all to be searched over bounded sets, Θ := {θ ∈ R𝑑 : ∥θ∥2 ≤ 𝑑𝑇 }, Λ =: {λ ∈ [𝑒−𝑎, 𝑒𝑏] 𝑘 }, Γ := {𝛾 ∈ [𝛾1, 𝛾2]}, respectively, with fixed 𝑀, 𝑇, 𝑎, 𝑏 > 0. Then, 𝑑 𝑀 }, Σ := {σ ∈ R𝑑 + : ∥σ∥1 ≤ √ • Assumption 2.2.8 holds with 𝜂1(𝑥) = 𝐿1𝑥, where 𝐿1 = 1 2 max{𝑑, 𝑒𝑎 (2 √ 𝑑 𝑀 + 𝑑𝑇)} 1Note that with a little ambiguity, the λ𝑖 here has a different meaning from that in (A.20) and Algorithm A.1, here λ𝑖 means the 𝑖th element in λ, whereas in (A.20) and Algorithm A.1, λ𝑖 means the 𝑖th element in the discrete set. 139 • Assumption 2.2.9 holds with 𝜂2(𝑥) = 𝐿2𝑥, where 𝐿2 = 1 𝛾2 1 (cid:16) 2𝑑𝑀 2𝑒2𝑎 + 𝑑 (𝑎+𝑏) 2 (cid:17) • With high probability, the PAC-Bayes bound for the minimizer of (P) has the form E θ∼Q ˆσ ( ˆh) ℓ( 𝑓θ; D) ≤ 𝐿𝑃 𝐴𝐶 ( ˆθ, ˆ𝛾, ˆσ, ˆλ) + 𝜂, (cid:16) where 𝜂 = 𝑘 𝛾1𝑚 (KL(Q ˆσ ( ˆθ)||Pλ) + log 1 1 + log 2(𝐶 𝐿+𝐵)Δ𝛾1𝑚 supλ∈Λ 𝑘 1 𝑚𝛾2 1 𝛿 ) + 𝐾 (λ), and 𝐶 = 1 𝛾1𝑚 + 𝛾2. (cid:17) , 𝐿 = 𝐿1 + 𝐿2, Δ := max{𝑏 + 𝑎, 2(𝛾2 − 𝛾1)}, 𝐵 = Proof: We first prove the two assumptions are satisfied by the Gaussian family with bounded parameter spaces. To prove Assumption 2.2.8 is satisfied, let 𝑣𝑖 = log 1/𝜆𝑖, 𝑖 = 1, ..., 𝑘 and perform a change of variable from 𝜆𝑖 to 𝑣𝑖. The weight of prior for the 𝑖th layer now becomes N (0, 𝑒−𝑣𝑖 I𝑑𝑖 )), where 𝑑𝑖 is the number of trainable parameters in the 𝑖th layer. It is straightforward to compute 𝜕KL(Qσ || ˜Pv) 𝜕𝑣𝑖 = 1 2 [−𝑑𝑖 + 𝑒𝑣𝑖 (∥σ𝑖 ∥1 + ∥θ𝑖 − θ0,𝑖 ∥2 2)], where σ𝑖, θ𝑖, θ0,𝑖 are the blocks of σ, θ, θ0, containing the parameters associated with the 𝑖th layer, respectively. Now, given the assumptions on the boundedness of the parameters, we have: ∥∇vKL(Qσ || ˜Pv)∥2 ≤ ∥∇vKL(Qσ || ˜Pv) ∥1 ≤ max{𝑑, 𝑒𝑎 (2 √ 1 2 𝑑 𝑀 + 𝑑𝑇)} ≡ 𝐿1(𝑑, 𝑀, 𝑇, 𝑎), (A.17) where we used the assumption ∥σ∥1 ≤ 𝑑𝑇 and ∥θ0∥2, ∥θ∥2 ≤ √ 𝑑 𝑀. Equation A.17 says 𝐿1(𝑑, 𝑀, 𝑇, 𝑎) is a valid Lipschitz bound on the KL divergence and therefore Assumption 2.2.8 is satisfied by setting 𝜂1(𝑥) = 𝐿1(𝑑, 𝑀, 𝑇, 𝑎)𝑥. Next, we prove Assumption 2.2.9 is satisfied. We use 𝐾min(λ) defined in Definition 2.2.5 as the 140 𝐾 (λ) in the PAC-Bayes training, and verify that it makes Assumption 2.2.9 hold. |𝐾min(λ1) − 𝐾min(λ2)| log(Eθ∼Pλ1 E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))]) − sup 𝛾∈[𝛾1,𝛾2] 1 𝛾2 log(Eθ∼Pλ2 E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))]) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))]) − log(Eθ∼Pλ2 E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))]) (cid:12) (cid:12) (cid:12) E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))] 𝑝λ1 (θ) 𝑝λ2 (θ) ) − log(Eθ∼Pλ2 E𝑧∼D [exp (𝛾ℓ( 𝑓θ; 𝑧))]) (cid:12) (cid:12) (cid:12) (cid:12) 1 𝛾2 1 𝛾2 1 𝛾2 1 𝛾2 log = (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) sup 𝛾∈[𝛾1,𝛾2] ≤ sup 𝛾∈[𝛾1,𝛾2] = sup 𝛾∈[𝛾1,𝛾2] ≤ sup (cid:12) (cid:12) (cid:12) (cid:12) 𝛾∈[𝛾1,𝛾2] 1 𝛾2 1 1 𝛾2 1 sup h∈H (cid:18) ≤ ≤ (cid:12) (cid:12)log(Eθ∼Pλ1 (cid:12) (cid:12) (cid:12) log(Eθ∼Pλ2 (cid:12) (cid:12) (cid:12) (cid:12) sup log (cid:12) (cid:12) θ∈Θ 𝑝λ1 (θ) 𝑝λ2 (θ) 𝑝λ1 (θ) 𝑝λ2 (θ) (cid:12) (cid:12) (cid:12) (cid:12) 𝑑 (𝑎 + 𝑏) 2 (cid:19) 2𝑑𝑀 2𝑒2𝑎 + (cid:12) (cid:12) (cid:12) (cid:12) ∥λ1 − λ2∥2, where the first inequality used the property of the supremum, the 𝑝𝜆1 (θ), 𝑝𝜆2 (θ) in the fourth line denote the probability density function of Gaussian with mean θ and variance parametrized by λ1, λ2 (i.e., λ1,𝑖, λ2,𝑖 are the variances for the 𝑖th layer), the second inequality use the fact that if 𝑋 (h) is a non-negative function of h and 𝑌 (h) is a bounded function of h, then |Eh(𝑋 (h)𝑌 (h))| ≤ ( sup h∈H |𝑌 (h)|) · Eh𝑋 (h). The last inequality used the formula of the Gaussian density 𝑝(𝑥; 𝜇, Σ) = 1 (2𝜋)𝑑/2|Σ|1/2 exp (cid:18) − 1 2 (𝑥 − 𝜇)𝑇 Σ−1(𝑥 − 𝜇) (cid:19) and the boundedness of the parameters. Therefore, Assumption 2.2.9 is satisfied by setting 𝜂2(𝑥) = 𝐿2(𝑑, 𝑀, 𝛾1, 𝑎)𝑥, where 𝐿2(𝑑, 𝑀, 𝛾1, 𝑎) = 1 𝛾2 1 (cid:16) 2𝑑𝑀 2𝑒2𝑎 + 𝑑 (𝑎+𝑏) 2 (cid:17) . Let 𝐿(𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) = 𝐿1(𝑑, 𝑀, 𝑇, 𝑎) + 𝐿2(𝑑, 𝑀, 𝛾1, 𝑎). Then we can apply Theorem 2.2.10, to get with probability 1 − 𝜖, E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; D) ≤ E θ∼Q ˆσ ( ˆθ)ℓ( 𝑓θ; S) + (cid:34) log 𝑛(𝜀) + 𝜖 𝛾2−𝛾1 2𝜀 1 ˆ𝛾𝑚 + KL(Q ˆσ ( ˆθ)||Pλ) + ˆ𝛾𝐾min( ˆλ)+ (A.18) (cid:35) (𝐶 𝐿 (𝑑, 𝑀, 𝑇, 𝛾1, 𝑎)) + 𝐵)𝜀. 141 Here, we used 𝜂1(𝑥) = 𝐿1𝑥 and 𝜂2(𝑥) = 𝐿2𝑥. Note that for the set [−𝑏, 𝑎] 𝑘 , the covering number 𝑛(𝜀) = N ([−𝑏, 𝑎] 𝑘 , | · |, 𝜀) is (cid:17) 𝑘 (cid:16) 𝑏+𝑎 2𝜀 , and the covering number 𝛾2−𝛾1 2𝜀 for 𝛾 ∈ [𝛾1, 𝛾2]. We introduce a new variable 𝜌 > 0, letting 𝜀 = 𝜌 2(𝐶 𝐿 (𝑑,𝑀,𝑇,𝛾1,𝑎)+𝐵) and inserting it into equation (A.18), we obtain with probability 1 − 𝜖: E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; D) ≤ E θ∼Q ˆ𝜎 ( ˆθ) ℓ( 𝑓θ; S) + 1 ˆ𝛾𝑚 (cid:20) log 1 𝜖 + KL(Q ˆσ ( ˆθ)||Pλ) (cid:21) + ˆ𝛾𝐾min( ˆλ) + 𝜌 + 𝑘 𝛾1𝑚 log 2(𝐶 𝐿 (𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) + 𝐵)Δ 𝜌 . where Δ := max{𝑏 + 𝑎, 2(𝛾2 − 𝛾1)}. Optimizing over 𝜌, we obtain: E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; D) ≤ E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; S) + 1 ˆ𝛾𝑚 (cid:20) log 1 𝜖 + KL(Q ˆσ ( ˆθ)||Pλ) (cid:21) + ˆ𝛾𝐾min( ˆλ) + (cid:18) 𝑘 𝛾1𝑚 1 + log 2(𝐶 𝐿 (𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) + 𝐵)Δ𝛾1𝑚 𝑘 (cid:19) (cid:18) 𝑘 𝛾1𝑚 1 + log 2(𝐶 𝐿(𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) + 𝐵)Δ𝛾1𝑚 𝑘 (cid:19) . = 𝐿𝑃 𝐴𝐶 ( ˆθ, ˆ𝛾, ˆσ, ˆλ) + Hence we have E θ∼Q ˆσ ( ˆθ) ℓ( 𝑓θ; D) ≤ 𝐿𝑃 𝐴𝐶 ( ˆθ, ˆ𝛾, ˆσ, ˆλ) + 𝜂, where 𝜂 = max( 1 𝛾1𝑚 (cid:18) 𝑘 𝛾1𝑚 1 + log (1 + log(2(𝐶 𝐿 (𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) + 𝐵) (𝛾2 − 𝛾1)𝛾1𝑚)), 2(𝐶 𝐿 (𝑑, 𝑀, 𝑇, 𝛾1, 𝑎) + 𝐵)Δ𝛾1𝑚 𝑘 (cid:19) ). □ Remark A.1.4. In defining the boundedness of the domain Θ of θ in Corollary 2.2.11, we used √ √ 𝑑 𝑀 as the bound. Here, the factor 𝑑 (where 𝑑 denotes the dimension of h) is used to encapsulate the idea that if on average, the components of the weight are bounded by 𝑀, then the ℓ2 norm would naturally be bounded by √ 𝑑 𝑀. The same idea applies to the definition of Σ. 142 Remark A.1.5. Due to the above remark, 𝑀, 𝑇, 𝑎, 𝑏 can be treated as dimension-independent constants that do not grow with the network size 𝑑. As a result, the constants 𝐿1, 𝐿2, 𝐿 in Corollary 2.2.11, are dominated by 𝑑, and 𝐿1, 𝐿2, 𝐿 = 𝑂 (𝑑). This then implies the logarithm term in 𝜂 scales as 𝑂 (log 𝑑), which grows very mildly with the size. Therefore, Corollary 2.2.11 can be used as the generalization guarantee for large neural networks. A.2 Algorithm Details A.2.1 Algorithms to estimate 𝐾 (λ) In this section, we explain the algorithm to compute 𝐾 (𝜆). In previous literature, the moment bound 𝐾 or its analog term in the PAC-Bayes bounds was often assumed to be a constant. One of our contributions is to allow 𝐾 to vary with the variance 𝜆 of the prior, so if a small prior variance is found by PAC-Bayes training, then the corresponding 𝐾 would also be small. We perform linear interpolation to approximate the function 𝐾min(𝜆) defined in (2) of the main text. When 𝜆 is 1D, We first compute 𝐾min(𝜆) on a finite grid of the domain of 𝜆, by solving (A.19) below. With the computed function values on the grid {𝐾min(𝜆𝑖)}𝑖 , we can construct a piecewise linear function as the approximation of 𝐾min(𝜆). 𝐾min(𝜆𝑖) = arg min 𝐾>0 𝐾 s.t. exp (𝛾2𝐾) ≥ 1 𝑛𝑚 𝑛 ∑︁ 𝑚 ∑︁ 𝑙=1 𝑗=1 exp(𝛾(ℓ( 𝑓θ𝑙 ; S) − ℓ( 𝑓θ𝑙 ; 𝑧 𝑗 ))), (A.19) ∀ 𝛾 ∈ [𝛾1, 𝛾2], θ𝑙 ∼ N (θ0, 𝜆𝑖), 𝜆min ≤ 𝜆𝑖 ≤ 𝜆max where θ𝑙 ∼ P𝜆𝑖 (θ0), 𝑙 = 1, ..., 𝑛, are samples from the prior distribution and are fixed when solving (A.19) for 𝐾min(λ𝑖). (A.19) is the discrete version of the formula (2) in the main text. This optimization problem is 1-dimensional, and the function in the constraint is monotonic in 𝐾, so it can be solved efficiently by the bisection method. When extending this procedure to high dimension, where λ is a 𝑘-dimension vector, we need to set up a grid for the domain of λ in 𝑘-dimensional space and estimate 𝐾min on each grid point, which is time-consuming when 𝑘 is large. To address this issue, we propose to use the following 143 approximation: ˆ𝐾 (max(λ𝑖)) = arg min 𝐾>0 𝐾 s.t. exp (𝛾2𝐾) ≥ 1 𝑛𝑚 𝑛 ∑︁ 𝑚 ∑︁ 𝑙=1 𝑗=1 exp(𝛾(ℓ( 𝑓θ𝑙 ; S) − ℓ( 𝑓θ𝑙 ; 𝑧 𝑗 ))), ∀ 𝛾 ∈ [𝛾1, 𝛾2], θ𝑙 ∼ N (θ0, max(λ𝑖)), 𝜆min ≤ max(λ𝑖) ≤ 𝜆max, 𝑖 = 1, ..., 𝑠 (A.20) where λ𝑖 is a random sample from the domain Λ of λ. Since each λi is k-dimensional, max(λi) represents the maximum of the 𝑘 coordinates. The idea of this formulation A.20 is as follows, we use the 1D function ˆ𝐾 (max(λ𝑖)) as a surrogate function of the original 𝑘-dimension function 𝐾min(λ) (i.e. 𝐾min(λ) ≤ ˆ𝐾 (max(λ𝑖))). Then estimating this 1D surrogate function is easy by using the bisection method. This procedure will certainly overestimate the true 𝐾min(λ) but since the surrogate function is also a valid exponential moment bound, it is safe to be used as a replacement for the 𝐾 (λ) in our PAC-Bayes bound for training. In practice, we tried to use mean(λ𝑖) to replace max(λ𝑖) to mitigate the over-estimation, but the final performance stays the same. The details of the whole procedure are presented in Algorithm A.1. Algorithm A.1 Compute 𝐾 (λ) given a set of query priors 𝑖=1, model sampling time 𝑛 = 10 Input: 𝛾1 and 𝛾2, sampling time 𝑠 of prior variances, the initial neural network weight θ0, the training dataset S = {𝑧𝑖}𝑚 Output: the piece-wise linear interpolation ˜𝐾 (λ) for 𝐾min(λ) Draw 𝑠 random samples for the prior variances V = {λ𝑖 ∈ Λ ⊆ R𝑘 , 𝑖 = 1, ..., 𝑠} Set up a discrete grid Γ for the interval [𝛾1, 𝛾2] of 𝛾. for λ𝑖 ∈ V do for 𝑙 = 1 : 𝑛 do Sampling weights from the Gaussian distribution θ𝑙 ∼ N (θ0, λ𝑖) Use θ𝑙, Γ and S to compute one term in the sum in (A.20) end for Solve ˆ𝐾 (max(λ𝑖)) using (A.20) end for Fit a piece-wise linear function ˜𝐾 (λ) to the data {(λ𝑖, ˆ𝐾 (max(λ𝑖))}𝑠 𝑖=1 144 A.2.2 PAC-Bayes Training with layerwise prior Similar to Algorithm 2.1, our PAC-Bayes training with a layerwise prior is stated here in Algorithm A.2. Algorithm A.2 PAC-Bayes training (layerwise prior) Input: initial weight θ0 ∈ R𝑑, the number of layers 𝑘, 𝑇1, 𝜆1 = 𝑒−12, 𝜆2 = 𝑒2, 𝛾1 = 0.5, 𝛾2 = 10, // 𝑇1, 𝜆1, 𝜆2, 𝛾1, 𝛾2 can be fixed in all experiments of Sec2.2.8. Output: trained model ˆθ, posterior noise level ˆσ 𝑖=1 |θ0,𝑖 |), b ← 1𝑘 · log( 1 θ ← θ0, v ← 1𝑑 · log( 1 𝑑 𝑑 Obtain the estimated ˜𝐾 ( ¯λ) with Λ = [𝜆1, 𝜆2] 𝑘 using (A.20) and Appendix A.2.1 // Stage 1 for epoch = 1 : 𝑇1 do // Initialization 𝑖=1 |θ0,𝑖 |) (cid:205)𝑑 (cid:205)𝑑 for sampling one batch 𝑠 from S do λ ← exp(b), σ ← exp(v) // Ensure non-negative variances Construct the covariance of Pλ from λ // Setting the variance of the weights in layer-𝑖 all to the scalar λ(𝑖) Draw one ˜θ ∼ Qσ (θ) and evaluate ℓ( 𝑓 ˜θ; S), Compute the KL-divergence as (A.14) Compute 𝛾 as (2.10) Compute the loss function L as 𝐿𝑃 𝐴𝐶 in (P) 𝜕v , θ ← θ + 𝜂 𝜕L b ← b + 𝜂 𝜕L 𝜕θ // Stochastic version of E ˜θ∼Qσ (θ) 𝜕b , v ← v + 𝜂 𝜕L // Update all parameters ℓ( 𝑓 ˜θ; S) end for // Fix the noise level from now on for sampling one batch 𝑠 from S do Draw one sample ˜θ ∼ Q ˆσ (θ) and evaluate ℓ( 𝑓 ˜θ; S) as ˜L, θ ← θ + 𝜂 𝜕 ˜L 𝜕θ // Noise injection // Update model parameters end for ˆσ ← exp(v) // Stage 2 while not converge do end for end while ˆθ ← θ A.2.3 Regularizations in PAC-Bayes bound Only noise injection and weight decay are essential from our derived PAC-Bayes bound. Since many factors in normal training, such as mini-batch and dropout, enhance generalization by some sort of noise injection, it is unsurprising that they can be substituted by the well-calibrated noise injection in PAC-Bayes training. Like most commonly used implicit regularizations (large lr, momentum, small batch size), dropout and batch-norm are also known to penalize the loss function’s 145 sharpness indirectly. Wei et al. (2020) studies that dropout introduces an explicit regularization that penalizes sharpness and an implicit regularization that is analogous to the effect of stochasticity in small mini-batch stochastic gradient descent. Similarly, it is well-studied that batch-norm Luo et al. (2018) allows the use of a large learning rate by reducing the variance in the layer batches, and large allowable learning rates regularize sharpness through the edge of stability Cohen et al. (2020). As shown in the equation below, the first term (noise-injection) in our PAC-Bayes bound explicitly penalizes the Trace of the Hessian of the loss, which directly relates to sharpness and is quite similar to the regularization effect of batch-norm and dropout. During training, suppose the current posterior is Q ˆσ ( ˆθ) = N ( ˆθ, diag( ˆσ)), then the training loss expectation over the posterior is: E θ∼Q ˆ𝜎 ( ˆθ) ℓ( 𝑓θ; D) = EΔθ∼Q ˆσ (0)ℓ( 𝑓 ˆθ+Δθ; D) ≈ ℓ( 𝑓 ˆθ , D) + EΔθ∼Q ˆσ (0) (ℓ( 𝑓 ˆθ; D)Δθ + 1 2 Tr(diag( ˆσ)∇2ℓ( 𝑓 ˆθ; D)). = ℓ( ˆ𝑓θ; D) + 1 2 Δθ⊤∇2ℓ( 𝑓 ˆθ; D)Δθ) The second regularization term (weight decay) in the bound additionally ensures that the minimizer found is close to initialization. Although the relation of this regularizer to sharpness is not very clear, empirical results suggest that weight decay may have a separate regularization effect from sharpness. In brief, we state that the effect of sharpness regularization from dropout and batch norm can also be well emulated by noise injection with the additional effect of weight decay. A.2.4 Deterministic Prediction Recall that for any θ ∈ R𝑑 and σ ∈ R𝑑 +, we used Qσ (θ) to denote the multivariate normal distribition with mean θ and covariance matrix diag(σ). If we rewrite the left-hand side of the PAC-Bayes bound by Taylor expansion, we have: E θ∼Q ˆ𝜎 ( ˆθ) ℓ( 𝑓θ; D) = EΔθ∼Q ˆσ (0)ℓ( 𝑓 ˆθ+Δθ; D) ≈ ℓ( 𝑓 ˆθ , D) + EΔθ∼Q ˆσ (0) (∇ℓ( 𝑓 ˆθ; D)𝑇 Δθ + 1 2 Δθ⊤∇2ℓ( 𝑓 ˆθ; D)Δθ) (A.21) = ℓ( ˆ𝑓θ; D) + 1 2 Tr(diag( ˆσ)∇2ℓ( 𝑓 ˆθ; D)) ≥ ℓ( 𝑓 ˆθ; D). Recall here ˆθ and ˆσ are the minimizers of the PAC-Bayes loss, obtained by solving the optimization problem (P). Equation (A.21) states that the deterministic predictor has a smaller 146 prediction error than the Bayesian predictor. However, note that the last inequality in (A.21) is derived under the assumption that the term ∇2ℓ( ˆ𝑓θ; D) is positive-semidefinite. This is a reasonable assumption as ˆθ is the local minimizer of the PAC-Bayes loss, and the PAC-Bayes loss is close to the population loss when the number of samples is large. Nevertheless, since this property only approximately holds, the presented argument can only serve as an intuition that shows the potential benefits of using the deterministic predictor. A.3 Extended Experimental Details We conducted experiments using eight A5000 GPUs with four AMD EPYC 7543 32-core Processors. To speed up the training process for posterior and prior variance, we utilized a warmup method that involved updating the noise level in the posterior of each layer as a scalar for the first 50 epochs and then proceeding with normal updates after the warmup period. This method only affects the convergence speed, not the generalization, and it was only used for large models in image classification. A.3.1 Parameter Settings Recall that the exponential momentum bound 𝐾 (λ) is estimated over a range [𝛾1, 𝛾2] of 𝛾 as per Definition 2.2.5. It means that we need the inequality Eh∼Pλ E[exp (𝛾(E[𝑋 (h)] − 𝑋 (h)))] ≤ exp (𝛾2𝐾 (λ)) to hold for any 𝛾 in this range. One needs to be a little cautious when choosing the upper bound 𝛾2, because if it is too large, then the empirical estimate of Eh∼Pλ have too large of a variance. Therefore, we recommended 𝛾2 to be set to no more than 10 or 20. The E[exp (𝛾(E[𝑋 (h)] − 𝑋 (h)))] would choice of 𝛾1 also does not seem to be very crucial, so we have fixed it to 0.5 throughout. For large datasets (like in MNIST or CIFAR10), 𝑚 is large. Then, according to Theorem 2.2.10, we can set the range 𝑀, 𝑇, 𝑎, 𝑏 of the trainable parameters to be very large with only a little increase of the bound (as 𝑀, 𝑇, 𝑎, 𝑏 are inside the logarithm), and then during training, the parameters would not exceed these bounds even if we don’t clip them. Hence, no clipping is needed for very large networks or with small networks with proper initializations. But when the dataset size 𝑚 is small, or 147 the initialization is not good enough, then the correction term could be large, and clipping will be needed. The clipping is also needed from the usual numerical stability point of view. As 𝜆 is in the denominator of the KL-divergence, it cannot be too close to 0. Because of this, in the numerical experiments on GNN and CNN13/CNN15, we clip the domain of 𝜆 at a lower bound of 0.1 and 5𝑒 − 3, respectively. For the VGG and Resnet experiments, the clipping 𝜆 is optional. A.3.2 Baseline PAC-Bayes bounds for unbounded loss functions We compared two baseline PAC-Bayes bounds when training CNNs with our layerwise PAC- Bayes bound. The bounds are expressed in our notation. • SubGaussian (Corollary 4 of Germain et al. (2016)): Eθ∼Qσ (θ)ℓ( 𝑓θ; D) ≤ Eθ∼Qσ (θ)ℓ( 𝑓θ; S) + 1 𝑚 (log 1 𝛿 + KL(Qσ (θ)||P)) + 𝑠2, 1 2 (A.22) where 𝑠2 is the variance factor by assuming the loss function ℓ is sub-Gaussian as defined below: Eθ∼PES∼D exp [𝛾(ℓ( 𝑓θ; D) − ℓ( 𝑓θ; S))] ≤ exp ( 𝛾2𝑠2 2 ), ∀𝛾 ∈ R+. • CGF (Theorem 9 of Rodríguez-Gálvez et al. (2023)): Eθ∼Qσ (θ)ℓ( 𝑓θ; D) ≤ Eθ∼Qσ (θ)ℓ( 𝑓θ; S) + (cid:18) 1 𝛾 (log 1 𝛿 + KL(Qσ (θ)||P)) + 𝜓(𝛾) (cid:19) , (A.23) where 𝜓(𝛾) is a convex and continuously differentiable function defined on [0, 𝑏) for some 𝑏 ∈ R+ such that 𝜓(𝛾) = 𝜓′(𝛾) = 0 and Eθ∼PES∼D [exp(𝛾(ℓ( 𝑓θ; D) − ℓ( 𝑓θ; S)))] ≤ exp (𝜓(𝛾)) for all 𝛾 ∈ [0, 𝑏). There is no specific form of 𝜓(𝛾) provided in the original paper, so we set 𝜓(𝛾) = 𝐾𝛾2. Moreover, 𝛾 is on the denominator of the bound, so we optimized 𝛾 when evaluating this bound and clipped 𝛾 to the same range [0.5, 10) as we did to our algorithm. A.3.3 Image classification There is no data augmentation in the experiment results reported in the main text. The ones with data augmentation can be found below. For the layerwise prior, we treated each parameter in 148 the PyTorch object model.parameters() as an independent layer, i.e., the weights and bias of one convolution/batch-norm layer were treated as two different layers. The number of training epochs of Stage 1 is 500 epochs for PAC-Bayes training. Moreover, a learning rate scheduler was added to both our method and the baseline to make the training fully converge. Specifically, the learning rate will be reduced by 0.1 whenever the training accuracy does not increase for 20 epochs. For PAC-Bayes training, the scheduler is only activated in Stage 2. The training will be terminated when the training accuracy is above 99.9% for 20 epochs or when the learning rate decreases to below 1𝑒−5. We also add label smoothing (0.1) (Szegedy et al., 2016) to neural networks when comparing SGD/Adam with our method on image classification tasks to enhance the final test accuracy for all training methods. The detailed searched values of hyperparameters include momentum for SGD (0.3, 0.6, 0.9), learning rates (1𝑒−3, 5𝑒−3, 1𝑒−2, 5𝑒−2, 1𝑒−1, 2𝑒−1), weight decay (1𝑒−4, 5𝑒−4, 1𝑒−3, 5𝑒−3, 1𝑒−2), and noise injection (5𝑒−4, 1𝑒−3, 5𝑒−3, 1𝑒−2). The best learning rate for Adam and AdamW is the same since weight decay is the only difference between the two optimizers. We adjusted one hyper-parameter at a time while keeping the others fixed to accelerate the search. To determine the optimal hyper-parameter for a variable, we compared the mean test accuracy of the last five epochs. We then used this selected hyper-parameter to tune the next one. We used an extensive grid search as a baseline to ensure the best achievable test accuracy in the literature (Table 4 of Geiping et al. (2021)). The noise injection is only applied to Adam/AdamW, as it sometimes causes instability to SGD and does not seem to increase the test performance. We compared the mean test accuracy of the last five epochs to determine each optimal hyper-parameter. The test accuracy from all experiments with batch size 128 with the learning rate 1𝑒−4 is shown in Figure A.1 and Figure A.2. To best demonstrate the sensitivity of the hyper-parameter selection of baselines and motivate our PAC-Bayes training, we organized the test accuracy below for ResNet18. Considering the search efficiency, we searched the hyperparameter one by one. For SGD, we first searched the learning rate, set the momentum and the weight decay as 0 (both are default values for SGD), and then used 149 the best learning rate to search for the momentum. At last, the best-searched learning rate and momentum are used to search for weight decay. For Adam, we searched the learning rate, weight decay, and noise injection in an order similar to SGD. Since AdamW and Adam are the same when setting the weight decay as 0, we searched for the best weight decay based on the best learning rate obtained from searching on Adam. A.3.4 Compatibility with Data Augmentation We didn’t include data augmentation in the experiments in the main text. Because with data augmentation, there is no rigorous way of choosing the sample size 𝑚 that appears in the PAC-Bayes bound. More specifically, for the PAC-Bayes bound to be valid, the training data has to be i.i.d. samples from some underlying distribution. However, most data augmentation techniques would break the i.i.d. assumption. As a result, if we have 10 times more samples after augmentation, the new information they bring in would be much less than those from 10 times i.i.d. samples. In this case, how to determine the effective sample size 𝑚 to be used in the PAC-Bayes bound is a problem. Since knowing whether a training method can work well with data augmentation is important, we carried out the PAC-Bayes training with an ad-hoc choice of 𝑚, that is, we set 𝑚 to be the size of the augmented data. We compared the grid-search result of SGD and Adam versus PAC-Bayes training on CIFAR10 with ResNet18. The augmentation is achieved by random flipping and random cropping. The data augmentation increased the size of the training sample by 128 times. The test accuracy for SGD is 95.2%, it is 94.3% for Adam, it is 94.4% for AdamW, and it is 94.3% for PAC-Bayes training with the layerwise prior. In contrast, the test accuracy without data augmentation is lower than 90% for all methods. It suggests that data augmentation does not conflict with the PAC-Bayes training in practice. A.3.5 Model analysis We examined the learning process of PAC-Bayes training by analyzing the posterior variance σ for different layers in models trained by Algorithm A.2. Typically, batch norm layers have smaller σ values than convolution layers. Additionally, shadow convolution and the last few layers have smaller σ values than the middle layers. We also found that skip-connections in ResNet18 have 150 smaller σ values than nearby layers, suggesting that important layers with a greater impact on the output have smaller σ values. In Stage 1, the training loss is higher than the testing loss, which means the adopted PAC- Bayes bound is able to bound the generalization error throughout the PAC-Bayes training stage. Additionally, we observed that the final value of 𝐾 is usually very close to the minimum of the sampled function values. The average value of σ experienced a rapid update during the initial 50 warmup epochs but later progressed slowly until Stage 2. The details can be found in Figure A.9 and A.13. Based on the figures, shadow convolution, and the last few layers have smaller σ values than the middle layers for all models. We also found that skip-connections in ResNet18 and ResNet34 have smaller σ values than nearby layers on both datasets, suggesting that important layers with a greater impact on the output have smaller σ values. Computational cost: In PAC-Bayes training, we have four parameters θ, λ, σ, 𝛾. Among these variables, 𝛾 can be computed on the fly or whenever needed, so there is no need to store them. We need to store θ, λ, σ, where σ has the same size as θ and the size of λ is the same as the number of layers which is much smaller. Hence the total storage is approximately doubled. Likewise, when computing the gradient for θ, λ, σ, the cost of automatic differentiation in each iteration is also approximately doubled. In the inference stage, the complexity is the same as in conventional training. Effect of two stages: We have tested the effect of the two stages. Without the first stage, the algorithm cannot automatically learn the noise level and weight decay to be used in the second stage. If the first stage is there but too short (10 epochs for example), then the final performance of VGG13 on CIFAR100 will reduce to 64.0% . Without Stage 2, the final performance is not as good as reported either. The test accuracy of models like VGG13 and ResNet18 on CIFAR10 would be 10% lower as in Figure A.9 and A.13. A.3.6 Node classification by GNNs We test the PAC-Bayes training algorithm on the following popular GNN models, tun- ing the learning rate (1𝑒−3, 5𝑒−3, 1𝑒−2), weight decay (0, 1𝑒−4, 1𝑒−3, 1𝑒−2), noise injection 151 (0, 1𝑒−3, 5𝑒−3, 1𝑒−2), and dropout (0, 0.4, 0.8). The number of filters per layer is 32 in GCN (Kipf and Welling, 2016) and SAGE (Hamilton et al., 2017). For GAT (Veličković et al., 2018), the number of filters is 8 per layer, the number of heads is 8, and the dropout rate of the attention coefficient is 0.6. Fpr APPNP (Gasteiger et al., 2018), the number of filters is 32, 𝐾 = 10 and 𝛼 = 0.1. We set the number of layers to 2, achieving the best baseline performance. A ReLU activation and a dropout layer are added between the convolution layers for baseline training only. Since GNNs are faster to train than convolutional neural networks, we tested all possible combinations of the above parameters for the baseline, conducting 144 searches per model on one dataset. We use Adam as the optimizer with the learning rate as 1𝑒−2 for all models using both training and validation nodes for PAC-Bayes training. We also did a separate experiment using both training and validation nodes for training. For baselines, we need first to train the model to detect the best hyperparameters as before and then train the model again on the combined data. Our PAC-Bayes training can also match the best generalization of baselines in this setting. All results are visualized in Figure A.5-A.8. The AdamW+val and scalar+val record the performances of the baseline and the PAC-Bayes training, respectively, with both training and validation datasets for training. We can see that test accuracy after adding validation nodes increased significantly for both methods but still, the results of our algorithm match the best test accuracy of baselines. Our proposed PAC-Bayes training with the scalar prior is better than most of the settings during searching and achieved comparable test accuracy when adding validation nodes to training. A.3.7 Few-shot text classification with transformers The proposed method is also observed to work on transformer networks. We conducted experiments on two text classification tasks of the GLUE benchmark as shown in Table A.1. SST is the sentiment analysis task, whose performance is evaluated as the classification accuracy. Sentiment analysis is the process of analyzing the sentiment of a given text to determine if the emotional tone of the text is positive, negative, or neutral. QNLI (Question-answering Natural Language Inference) focuses on determining the logical relationship between a given question and a corresponding 152 sentence. The objective of QNLI is to determine whether the sentence contradicts, entails, or is neutral with respect to the question. We use classification accuracy as the evaluation metric. The baseline method uses grid search over the hyper-parameter choices of the learning rate (1𝑒−1, 1𝑒−2, 1𝑒−3), batch size (2, 8, 16, 32, 80), dropout ratio (0, 0.5), optimization algorithms (SGD, AdamW), noise injection (0, 1𝑒−5, 1𝑒−4, 1𝑒−3, 1𝑒−2, 1𝑒−1), and weight decay (0, 1𝑒−1, 1𝑒−2, 1𝑒−3, 1𝑒−4). The learning rate and batch size of our method are set to 1𝑒−3 and 100 (i.e., full-batch), respectively. In this task, the number of training samples is small (80). As a result, the preset 𝛾2 = 10 is a bit large and thus prevents the model from achieving the best performance with PAC-Bayes training. We adopt BERT (Devlin et al., 2018) as our backbone and added one fully connected layer as the classification layer. Only the added classification layer is trainable, and the pre-trained model is frozen without gradient update. To simulate a few-shot learning scenario, we randomly sample 100 instances from the original training set and take the whole development set to evaluate the classification performance. We split the training set into 5 splits, taking one split as the validation data and the rest as the training set. Each experiment was conducted five times, and we report the average performance. We used the PAC-Bayes training with the scalar prior in this experiment. According to Table A.1, our method is competitive to the baseline method on the SST task, and the performance gap is only 0.4 points. On the QNLI task, our method outperforms the baseline by a large margin, and the variance of our proposed method is less than that of the baseline method. Table A.1 Test accuracy on the development sets of 2 GLUE benchmarks. SST QNLI baseline 72.9±0.99 72.5±0.99 scalar 62.6±0.10 64.2±0.02 A.3.8 Additional experiments stability We conducted extra experiments to showcase the robustness of the proposed PAC-Bayes training algorithm. Specifically, we tested the effect of different learning rates on ResNet18 and VGG13 153 models trained with layerwise prior. Learning rate has long been known as an important impact factor of the generalization for baseline training. Within the stability range of gradient descent, the larger the learning rate is, the better the generalization has been observed (Lewkowycz et al., 2020). In contrast, the generalization of the PAC-Bayes trained model is less sensitive to the learning rate. We do observe that due to the newly introduced noise parameters, the stability of the optimization gets worse, which in turn requires a lower learning rate to achieve stable training. But as long as the stability is guaranteed by setting the learning rate low enough, our results, as Table A.2, indicated that the test accuracy remained stable across various learning rates for VGG13 and Resnet18. The dash in the table means that the learning rate for that particular setting is too large to maintain the training stability. For learning rates below 1𝑒−4, we trained the model in Stage 1 for more epochs (700) to fully update the prior and posterior variance. We also demonstrate that the warmup iterations (as discussed at the beginning of this section) do not affect generalization. As shown in Table A.4, the test accuracy is insensitive to different numbers of warmup iterations. Furthermore, additional evaluations of the effects of batch size (Table A.5), optimizer (Tables A.6), and 𝛾1 and 𝛾2 (Table A.7) We further visualize the sorted test accuracy of baselines and our proposed PAC-Bayes training with large batch sizes and a fixed learning rate 5𝑒−4 in Figure A.3 and Figure A.4. These figures demonstrate that our PAC-Bayes training algorithm achieves better test accuracy than most searched settings. For models VGG13 and ResNet18, the large batch size is 2048, and for large models VGG19 and ResNet34, the large batch size is set to 1280 due to the GPU memory limitation. Table A.2 Test accuracy of ResNet18 trained with different learning rates. lr 3𝑒−5 5𝑒−5 1𝑒−4 2𝑒−4 3𝑒−4 5𝑒−4 CIFAR10 CIFAR100 88.4 69.2 88.8 69.0 89.3 68.9 88.6 69.1 88.3 69.1 89.2 69.6 154 Table A.3 Test accuracy of VGG13 trained with different learning rates. lr 3𝑒−5 5𝑒−5 1𝑒−4 2𝑒−4 3𝑒−4 5𝑒−4 CIFAR10 CIFAR100 88.6 67.7 88.9 68.0 89.7 67.1 89.6 - 89.6 - 89.5 - Table A.4 Test accuracy of ResNet18 trained with warmup epochs of 𝜎. 10 20 50 80 100 150 CIFAR10 CIFAR100 88.5 69.4 88.5 89.3 69.6 68.9 89.5 69.1 89.5 88.9 68.1 69.0 Table A.5 Test accuracy of VGG13 with different batch sizes. Batch Size 128 256 1024 2048 2500 Test Acc 89.7 89.7 88.7 89.4 88.3 Table A.6 Test accuracy of ResNet18 using SGD: Effects of different momentum values (with learning rate 1 × 10−3) and different learning rates (with momentum 0.9). Momentum 0.3 0.6 0.9 Learning Rate 3 × 10−4 1 × 10−3 1 × 10−4 Test Acc 88.6 88.8 89.2 88.3 88.8 89.2 Table A.7 Test accuracy of ResNet18 with different settings for 𝛾1 (with 𝛾2 = 20) and 𝛾2 (with 𝛾1 = 0.1). 𝛾1 0.5 0.1 1.0 10 𝛾2 15 20 Test Acc 88.8 89.3 88.8 89.3 89.4 89.4 155 (a) VGG13 (b) VGG19 (c) ResNet18 (d) ResNet34 Figure A.1 Sorted test accuracy of CIFAR10. The x-axis represents the experiment index. (e) Desnse121 156 (a) VGG13 (b) VGG19 (c) ResNet18 (d) ResNet34 Figure A.2 Sorted test accuracy of CIFAR100. The x-axis represents the experiment index. (e) Desnse121 157 (a) VGG13 (batch: 2048) (b) ResNet18 (batch: 2048) (c) VGG19 (batch: 1280) (d) ResNet34 (batch: 1280) Figure A.3 Sorted test accuracy of CIFAR10 with large batch sizes. The x-axis represents the experiment index. 158 (a) VGG13 (batch: 2048) (b) ResNet18 (batch: 2048) (c) VGG19 (batch: 1280) (d) ResNet34 (batch: 1280) Figure A.4 Sorted test accuracy of CIFAR100 with large batch sizes. The x-axis represents the experiment index. 159 (a) CoraML (b) Citeseer (c) CoraFull (d) DBLP (e) DBLP Figure A.5 Test accuracy of GCN. The first and third quartiles construct the interval over the ten random splits. {+val} denotes the performance with both training and validation datasets for training. 160 (a) CoraML (b) Citeseer (c) CoraFull (d) DBLP (e) DBLP Figure A.6 Test accuracy of SAGE. The first and third quartiles construct the interval over the ten random splits. {+val} denotes the performance with both training and validation datasets for training. 161 (a) CoraML (b) Citeseer (c) CoraFull (d) DBLP (e) DBLP Figure A.7 Test accuracy of GAT. The first and third quartiles construct the interval over the ten random splits. {+val} denotes the performance with both training and validation datasets for training. 162 (a) CoraML (b) Citeseer (c) CoraFull (d) DBLP (e) DBLP Figure A.8 Test accuracy of APPNP. The first and third quartiles construct the interval over the ten random splits. {+val} denotes the performance with both training and validation datasets for training. 163 (a) mean(σ) of batch-norm layers. (b) mean(σ) of convolution layers. (c) mean(σ) in training. (d) Function ˜𝐾 ( ¯λ). (e) Training and testing process. Figure A.9 Training details of ResNet18 on CIFAR10. The red star denotes the final 𝐾. (a) mean(σ) of batch-norm layers. (b) mean(σ) of convolution layers. (c) mean(σ) in training. (d) Function ˜𝐾 ( ¯λ). (e) Training and testing process. Figure A.10 Training details of ResNet18 on CIFAR100. The red star denotes the final 𝐾. 164 (a) mean(σ) of batch-norm layers. (b) mean(σ) of convolution layers. (c) mean(σ) in training. (d) Function ˜𝐾 ( ¯λ). (e) Training and testing process. Figure A.11 Training details of ResNet34 on CIFAR10. The red star denotes the final 𝐾. (a) mean(σ) of batch-norm layers. (b) mean(σ) of convolution layers. (c) mean(σ) in training. (d) Function ˜𝐾 ( ¯λ). (e) Training and testing process. Figure A.12 Training details of ResNet34 on CIFAR100. The red star denotes the final 𝐾. 165 (a) mean(σ) of batch-norm layers. (b) mean(σ) of convolution layers. (c) mean(σ) in training. (d) Function ˜𝐾 ( ¯λ). (e) Training and testing process. Figure A.13 Training details of VGG13 on CIFAR10. The red star denotes the final 𝐾. 166 APPENDIX B MAGNET: A NEURAL NETWORK FOR DIRECTED GRAPHS B.1 List of method abbreviations • MagNet (this paper) • BiGCN: applying GCN on the original ad- • ChebNet Defferrard et al. (2016) jacency matrix and its transpose matrix • GCN Kipf and Welling (2016) • APPNP Klicpera et al. (2019a) • GAT Veličković et al. (2018) • SAGE Hamilton et al. (2017) • GIN Xu et al. (2018) • DGCN Tong et al. (2020b) • DiGraph Tong et al. (2020a) separately • BiSAGE: applying SAGE on the original adjacency matrix and its transpose matrix separately • BiGAT: applying GAT on the original ad- jacency matrix and its transpose matrix separately • KNN: K-nearest neighbors based on the • DiGraphIB Tong et al. (2020a): DiGraph eigenvectors with the smallest eigenvalues with inception blocks of magnetic Laplacian Fanuel et al. (2017). B.2 Further implementation details We set the parameter 𝐾 = 1 in our implementation of both ChebNet and MagNet, except for synthetic noisy cylcic graphs with random input features. For sythetic noisy cylcic graphs with random input features, we also tried 𝐾 = 2 for MagNet. We train all models with a maximum of 3000 epochs and stop early if the validation error doesn’t decrease after 500 epochs for both node classification and link prediction tasks. One dropout layer with a probability of 0.5 is created before the last linear layer. The model is picked with the best validation accuracy during training for testing. We tune the number of filters in [16, 32, 48] for the graph convolutional layers for all models, except DigraphIB, since the inception block has more trainable parameters. For node classification, we tune the learning rate in [1𝑒−3, 5𝑒−3, 1𝑒−2] for all models. Compared with node classification, the 167 number of available samples for link prediction is much larger. Thus, we set a relatively small learning rate of 1𝑒−3. We use Adam as the optimizer and ℓ2 regularization with the hyperparameter set as 5𝑒−4 to avoid overfitting. We post the best testing performance by grid-searching based on validation accuracy. For node classification on the synthetic datasets, we generate a one-dimensional node feature sampled from the standard normal distribution. We use the original features for the other node classification datasets. For link prediction, we use the in-degree and out-degree as the node features for all datasets instead the original features. This allows all models to learn directed information from the adjacency matrix. Our experiments were conducted on 8 compute nodes each with 1 Nvidia Tesla V100 GPU, 120G RAM, and 32 Intel Xeon E5-2660 v3 CPUs; as well as on a compute node with 8 Nvidia RTX 8000 GPUs, 1000GB RAM, and 48 Intel Xeon Silver 4116 CPUs. Here are implementation details specific to certain methods: • We set the parameter 𝜖 to 0 in GIN for both tasks. • For GAT and BiGAT, the number of heads tuned is in [2, 4, 8]. • For APPNP, we set 𝐾 = 10 for node classification (following the original paper Klicpera et al. (2019a)), and search K in [1, 5, 10] for link prediction. • The coefficient 𝛼 for PageRank-based models (APPNP, DiGraph) is searched in [0.05, 0.1, 0.15, 0.2]. • For DiGraph, the model includes graph convolutional layers without the high-order approxi- mation and inception module. The high order Laplacian and the inception module is included in DigraphIB. • DigraphIB is a bit different than other networks because it requires generating a three-channel Laplacian tensor. For this network, the number of filters for each channel is searched in [6, 11, 21] for node classification and link prediction. • For GCN, the out-degree normalized, directed adjacency matrix, including self-loops is also tried in addition to the symmetrized adjacency matrix for node classification tasks, except for synthetic datasets since symmetrization will break the cluster pattern. 168 • For other spatial methods, including APPNP, GAT, SAGE, and GIN, we tried both the symmetrized adjacency matrices and the original directed adjacency matrices for node classification tasks except for synthetic datasets. • For KNN, we set 𝑞 = 0.25 and 𝐾 = 5. B.3 Datasets B.4 Node classification We use six real datasets for node classification. A directed edge is defined as follows. If the edge (𝑢, 𝑣) ∈ 𝐸 but (𝑣, 𝑢) ∉ 𝐸, then (𝑢, 𝑣) is a directed edge. If (𝑢, 𝑣) ∈ 𝐸 and (𝑣, 𝑢) ∈ 𝐸, then (𝑢, 𝑣) and (𝑣, 𝑢) are undirected edges (in other words, undirected edges that are not self-loops are counted twice). For the citation datasets, Cora-ML and Citeseer, we randomly select 20 nodes in each class for training, 500 nodes for validation, and the rest for testing following Tong et al. (2020a). For the synthetic datasets (ordered DSBM graphs, cyclic DSBM graphs, noisy cyclic DSBM graphs), we generate a one-dimensional node feature sampled from the standard normal distribution. Ten folds are generated randomly for each dataset, except for Cornell, Texas and Wisconsin. For Cornell, Texas, and Wisconsin, we use the same training, validation, and testing folds as Pei et al. (2020). For Telegram, we treat it as a directed, unweighted graph and randomly generate 10 splits for training/validation/testing with 60%/20%/20% of the nodes. The node features are sampled from the normal distribution. B.5 Link prediction We use eight real datasets in link prediction. Instead of using the original features, we use the in-degree and out-degree as the node features in order to allow the models to learn structural information from the adjacency matrix directly. The connectivity is maintained by getting the undirected minimum spanning tree before removing edges for validation and testing. For the results in the main text, undirected edges and, if they exist, pairs of vertices with multiple edges between them, may be placed in the training/validation/testing sets. However, labels that indicate the direction of such edges are not well defined, and therefore can be considered as noisy labels from the machine 169 learning perspective. In order to obtain a full set of well-defined, noiseless labels, in the supplement we also run experiments in which undirected edges and pairs of vertices with multiple edges between them are ignored when sampling edges for training/validation/testing (in other words, only directed edges, and the absence of an edge, are included). We evaluated all models on four prediction tasks, which we now describe. To construct the datasets that we use for training, validation and testing, which consist of pairs of vertices in the graph, we do the following. (1) Existence prediction. If (𝑢, 𝑣) ∈ 𝐸, we give (𝑢, 𝑣) the label 0, otherwise its label is 1. The proportion of the two classes of edges is 25% and 75%, respectively, when undirected edges and multi-edges are included, and 50% and 50%, respectively, when only directed edges are included. (2) Direction prediction. Given an ordered node pair (𝑢, 𝑣), we give the label 0 if (𝑢, 𝑣) ∈ 𝐸 and the label 1 if (𝑣, 𝑢) ∈ 𝐸, conditioning on (𝑢, 𝑣) ∈ 𝐸 or (𝑣, 𝑢) ∈ 𝐸. The proportion of the two types of edges is 50% and 50%. We randomly generated ten folds for all datasets. We used 15% and 5% of edges for testing and validation for all datasets. B.6 Eigenvalues of the magnetic Laplacian In this section we state and prove three theorems. Theorem B.6.1, which shows that both the normalized and unnormalized magnetic Laplacian a postive semidefinite, is well known (see e.g. Fanuel et al. (2018)). Theorem B.6.2, which shows that the eigenvalues of the normalized magnetic Laplacian lie in the interval [0, 2], is a straightforward adaption of the corresponding result for the traditional normalized graph Laplacian. Finally, Theorem B.6.4 proves the un-normalized magnetic Laplacian may be factored in terms of a complex valued incidence matrix, analogous to the well-known result for the standard graph Laplacian. We give full proofs of all three results for completeness. Theorem B.6.1. Let 𝐺 = (𝑉, 𝐸) be a directed graph where 𝑉 is a set of 𝑁 vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of directed edges. Then, for all 𝑞 ≥ 0, both the unnormalized magnetic Laplacian L(𝑞) its normalized counterpart L(𝑞) 𝑁 are positive semidefinite. 𝑈 and Proof. Let x ∈ C𝑁 . We first note that since L(𝑞) 𝑈 is Hermitian we have Imag(x†L(𝑞) 𝑈 x) = 0. Next, 170 we use the definition of D𝑠 and the fact that A𝑠 is symmetric to observe that 2Real (cid:16) (cid:17) x†L(𝑞) 𝑈 x 𝑁 ∑︁ D𝑠 (𝑢, 𝑣)x(𝑢)x(𝑣) − 2 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) 𝑢,𝑣=1 𝑁 ∑︁ D𝑠 (𝑢, 𝑢)x(𝑢)x(𝑢) − 2 𝑢,𝑣=1 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) 𝑢=1 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)|x(𝑢)|2 − 2 𝑢,𝑣=1 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) =2 =2 =2 𝑢,𝑣=1 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)|x(𝑢)|2 + 𝑢,𝑣=1 𝑁 ∑︁ 𝑢,𝑣=1 A𝑠 (𝑣, 𝑢)|x(𝑣)|2 − 2 𝑁 ∑︁ 𝑢,𝑣=1 A𝑠 (𝑢, 𝑣)x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) (cid:16) (cid:16) A𝑠 (𝑢, 𝑣) A𝑠 (𝑢, 𝑣) |x(𝑢)|2 + |x(𝑣)|2 − 2x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) (cid:17) (B.1) |x(𝑢)|2 + |x(𝑣)|2 − 2|x(𝑢)||x(𝑣)| (cid:17) A𝑠 (𝑢, 𝑣)(|x(𝑢)| − |x(𝑣)|)2 = = ≥ = 𝑢,𝑣=1 𝑁 ∑︁ 𝑢,𝑣=1 𝑁 ∑︁ 𝑢,𝑣=1 𝑁 ∑︁ 𝑢,𝑣=1 ≥0. Thus, L(𝑞) 𝑈 is positive semidefinite. For the normalized magnetic Laplacian, we note that D−1/2 𝑠 A𝑠D−1/2 𝑠 (cid:17) ⊙ exp(𝑖𝚯(𝑞)) = D−1/2 𝑠 (cid:16) A𝑠 ⊙ exp(𝑖𝚯(𝑞)) (cid:17) D−1/2 𝑠 , (cid:16) and therefore Thus, letting y = D−1/2 𝑠 x, the fact that D𝑠 is diagonal implies L(𝑞) 𝑁 = D−1/2 𝑠 L(𝑞) 𝑈 D−1/2 𝑠 . x†L(𝑞) 𝑁 x = x†D−1/2 𝑠 L(𝑞) 𝑈 D−1/2 𝑠 x = y†L(𝑞) 𝑈 y ≥ 0. (B.2) □ Theorem B.6.2. Let 𝐺 = (𝑉, 𝐸) be a directed graph where 𝑉 is a set of 𝑁 vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of directed edges. Then, for all 𝑞 ≥ 0, the eigenvalues of the normalized magnetic Laplacian L(𝑞) 𝑁 are contained in the interval [0, 2]. 171 Proof. By Theorem B.6.1, we know that L(𝑞) 𝑁 has real, nonnegative eigenvalues. Therefore, we need to show that the lead eigenvalue, 𝜆𝑁 , is less than or equal to 2. The Courant-Fischer theorem shows that 𝜆𝑁 = max x≠0 x†L(𝑞) 𝑁 x x†x . Therefore, using (B.2) and setting y = D−1/2 x, we have 𝑠 𝑠 L(𝑞) 𝑈 D−1/2 x†D−1/2 x†x 𝜆𝑁 = max x≠0 𝑠 x y†L(𝑞) 𝑈 y y†D𝑠y . = max y≠0 First, we observe that since D𝑠 is diagonal, we have y†D𝑠y = 𝑁 ∑︁ 𝑢,𝑣=1 D𝑠 (𝑢, 𝑣)y(𝑢)y(𝑣) = 𝑁 ∑︁ 𝑢=1 D𝑠 (𝑢, 𝑢)|y(𝑢)|2 Next, we note that by (B.1), we have y†L(𝑞) 𝑈 y = 1 2 𝑁 ∑︁ 𝑢,𝑣=1 𝑁 ∑︁ (cid:16) A𝑠 (𝑢, 𝑣) |x(𝑢)|2 + |x(𝑣)|2 − 2x(𝑢)x(𝑣) cos(𝑖𝚯(𝑞) (𝑢, 𝑣)) (cid:17) A𝑠 (𝑢, 𝑣)(|x(𝑢)| + |x(𝑣)|)2 1 2 ≤ ≤ 𝑢,𝑣=1 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)(|x(𝑢)|2 + |x(𝑣)|2). 𝑢,𝑣=1 Therefore, since As is symmetric, we have y†L(𝑞) 𝑈 y ≤ 2 𝑁 ∑︁ A𝑠 (𝑢, 𝑣)|x(𝑢)|2 𝑢,𝑣=1 𝑁 ∑︁ |x(𝑢)|2 (cid:33) A𝑠 (𝑢, 𝑣) (cid:32) 𝑁 ∑︁ 𝑣=1 D𝑠 (𝑢, 𝑢)|x(𝑢)|2 = 2 = 2 𝑢=1 𝑁 ∑︁ 𝑢=1 = 2y†D𝑠y. □ Definition B.6.3. Let 𝐺 = (𝑉, 𝐸) be a directed graph where 𝑉 is a set of 𝑁 vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of directed edges. We say that a link (𝑢, 𝑣) ∈ 𝐸 is bidirectional if the “reverse" link (𝑣, 𝑢) is also also in 𝐸. If a link is not bidirectional we say that it is unidirectional. 172 Theorem B.6.4. Let 𝐺 = (𝑉, 𝐸) be a directed graph where 𝑉 is a set of 𝑁 vertices and 𝐸 ⊆ 𝑉 × 𝑉 is a set of directed edges. Then, for all 𝑞 ≥ 0, the unnormalized magnetic Laplacian may be factored as L(𝑞) 𝑈 = B(𝑞) (B(𝑞))†, where B(𝑞) is a modified incidence matrix defined by B(𝑞) ( 𝑗, ℓ) = 𝑒𝑖𝜋𝑞 𝑒−𝑖𝜋𝑞 1√ 2 −1√ 2 1 1 0    if j is the source of link ℓ and ℓ is unidirectional if j is the sink of the link ℓ and ℓ is unidirectional if j is the source of the link ℓ and ℓ is bidirectional . if j is the sink of the link ℓ and ℓ is bidirectional otherwise Proof. Let B = B(𝑞) for the remainder of the proof. By definition we have, (BB†)( 𝑗, 𝑘) = ∑︁ ℓ B( 𝑗, ℓ)B(𝑘, ℓ) If 𝑗 = 𝑘, we have (BB†)( 𝑗, 𝑗) ∑︁ ℓ = = B( 𝑗, ℓ)B( 𝑗, ℓ) ∑︁ B( 𝑗, ℓ)B( 𝑗, ℓ) + ∑︁ B( 𝑗, ℓ)B( 𝑗, ℓ) ℓ unidirectional st. 𝑗 is a source ∑︁ + ℓ bidirectional st. 𝑗 is a source ℓ unidirectional st. 𝑗 is a sink ∑︁ B( 𝑗, ℓ)B( 𝑗, ℓ) + B( 𝑗, ℓ)B( 𝑗, ℓ) ℓ bidirectional st. 𝑗 is a sink −1 √ 2 ∑︁ ℓ unidirectional st. 𝑗 is a sink 𝑒−𝑖𝜋𝑞 (cid:19) + 𝑒𝑖𝜋𝑞 (cid:18) −1 √ 2 ∑︁ 1 + ∑︁ 1 ℓ bidirectional st. 𝑗 is a source ℓ bidirectional st. 𝑗 is a sink = = ∑︁ 1 √ 2 𝑒𝑖𝜋𝑞 1 √ 2 𝑒𝑖𝜋𝑞 + ℓ unidirectional st. 𝑗 is a source 1 2 (𝑑𝑖𝑛 ( 𝑗) + 𝑑𝑜𝑢𝑡 ( 𝑗)) =𝑑𝑠 ( 𝑗). If 𝑗 ≠ 𝑘 and there is a link from 𝑗 to 𝑘 but not from 𝑘 to 𝑗, then (BB†)( 𝑗, 𝑘) = ∑︁ ℓ B( 𝑗, ℓ)B(𝑘, ℓ) = 𝑒𝑖𝜋𝑞 1 √ 2 (cid:18) −1 √ 2 (cid:19) 𝑒𝑖𝜋𝑞 = −1 2 𝑒2𝜋𝑖𝑞 = −H(𝑞) ( 𝑗, 𝑘) 173 Likewise, if there is a link from 𝑘 to 𝑗 but not from 𝑗 to 𝑘 we have (BB†)( 𝑗, 𝑘) = ∑︁ ℓ B( 𝑗, ℓ)B(𝑘, ℓ) = (cid:18) −1 √ 2 𝑒−𝑖𝜋𝑞 (cid:19) 1 √ 2 𝑒−𝑖𝜋𝑞 = −1 2 𝑒−2𝜋𝑖𝑞 = −H(𝑞) ( 𝑗, 𝑘). Lastly, if there is neither a link from 𝑘 to 𝑗 or 𝑗 to 𝑘 we have (BB†) ( 𝑗, 𝑘) = 0. □ B.7 The eigenvectors and eigenvalues of directed stars and cycles In this section, we examine the eigenvectors and eigenvalues of the unnormalized magnetic Laplacian on two example graphs. As alluded to in the main text, in the directed star graph directional information is contained in the eigenvectors only. For the directed cycle, on the other hand, the magnetic Laplacian encodes the directed nature of the graph only through the eigenvalues. Both examples can be verified via direct pen and paper calculation. Figure B.1 Directed stars (a) 𝐺 (in), and (b) 𝐺 (out) Example 1. Let 𝐺 (in) and 𝐺 (out) be the directed star graphs with vertices 𝑉 = {1, . . . , 𝑁 } and edges pointing in/out to the central vertex as shown in Figure B.1. Then the eigenvalues of L(𝑞,in) , the 𝑈 unnormalized magnetic Laplacian on 𝐺in, are given by 𝜆in 1 = 0, 𝜆in 𝑘 = 1 2 for 2 ≤ 𝑘 ≤ 𝑁 − 1, and 𝜆in 𝑁 = 𝑁 2 . If we let 𝑣 = 1 be the central vertex, then the lead eigenvector is given by 1 (1) = 𝑒2𝜋𝑖𝑞, uin uin 1 (𝑛) = 1, 2 ≤ 𝑛 ≤ 𝑁. For 2 ≤ 𝑘 ≤ 𝑁 − 1, the eigenvectors are uin 𝑘 = δ𝑘 − δ𝑘+1, 174 and the final eigenvector is given by 𝑁 (1) = −𝑒2𝜋𝑖𝑞, uin uin 𝑁 (𝑛) = 1 𝑁 − 1 , 2 ≤ 𝑛 ≤ 𝑁. The phase matrices satisfies 𝚯(𝑞,in) = −𝚯(𝑞,out). Therefore, the associated magnetic Laplacians satisfy L(𝑞,in) (𝑢, 𝑣). Since these matrices are Hermitian, this implies that the (𝑢, 𝑣) = L(𝑞,out) 𝑈 𝑈 corresponding eigenvalue-eigenvector pairs satisfy 𝜆in 1 and uout identify the central vertex, and the sign of their imaginary parts at this vertex identifies whether it is 𝑘 . Hence, uin 𝑘 , and uin 𝑘 = 𝜆out 𝑘 = uout 1 a source or a sink. On the other hand, the eigenvalues give no directional information. Example 2. Let 𝐺 be the directed cycle. Then, then the eigenvalues of L(𝑞) 𝑈 is are the classical Fourier modes u𝑘 (𝑛) = 𝑒(2𝜋𝑖𝑘𝑛/𝑁), independent of 𝑞. The eigenvalues, however, do depend on 𝑞 and are given by 𝜆𝑘 = 1 − cos (cid:18) 2𝜋 (cid:18) 𝑘 𝑁 (cid:19)(cid:19) , + 𝑞 1 ≤ 𝑘 ≤ 𝑁. 175 APPENDIX C UNSUPERVISED LEARNING OF FULL-WAVEFORM INVERSION: CONNECTING CNN AND PARTIAL DIFFERENTIAL EQUATION IN A LOOP C.1 Network Architecture Since the number of receivers 𝑅 and the number of timesteps 𝑇 in seismic measurements are unbalanced (𝑇 ≫ 𝑅), we first stack a 7×1 and six 3×1 convolutional layers (with stride 2 every the other layer to reduce dimension) to extract temporal features until the temporal dimension is close to 𝑅. Then, six 3×3 convolutional layers are followed to extract spatial-temporal features. The resolution is down-sampled every the other layer by using stride 2. Next, the feature map is flattened and a fully connected layer is applied to generate the latent feature with dimension 512. The decoder first repeats the latent vector by 25 times to generate a 5×5×512 tensor. Then it is followed by five 3×3 convolutional layers with nearest neighbor upsampling in between, resulting in a feature map with size 80×80×32. Finally, we center-crop the feature map (70×70) and apply a 3×3 convolution layer to output a single channel velocity map. All the aforementioned convolutional and upsampling layers are followed by a batch nor- malization (Ioffe and Szegedy, 2015) and a leaky ReLU (Nair and Hinton, 2010) as activation function. C.2 Derivation of Forward Modeling in Practice Similar to the finite difference in time domain, in 2D situation, by applying the fourth-order central finite difference in space, the Laplacian of 𝑝(r, 𝑡) can be discretized as ∇2 𝑝(r, 𝑡) = 𝜕2 𝑝 𝜕𝑥2 + ≈ 1 (Δ𝑥)2 , 𝜕2 𝑝 𝜕𝑧2 2 ∑︁ 𝑐𝑖 𝑝𝑡 𝑥+𝑖,𝑧 + 𝑖=−2 1 (Δ𝑧)2 2 ∑︁ 𝑖=−2 𝑐𝑖 𝑝𝑡 𝑥,𝑧+𝑖 (C.1) + 𝑂 [(Δ𝑥)4 + (Δ𝑧)4] , where 𝑐0 = − 5 12 , 𝑐𝑖 = 𝑐−𝑖, and 𝑥 and 𝑧 stand for the horizontal offset and the depth of a 2D velocity map, respectively. For convenience, we assume that the vertical grid spacing Δ𝑧 is 3, 𝑐2 = − 1 2, 𝑐1 = 4 identical to the horizontal grid spacing Δ𝑥. 176 Given the approximation in Equations 3.21 and C.1, we can rewrite the Equation 3.14 as 𝑥,𝑧 = (2 − 5𝛼) 𝑝𝑡 𝑝𝑡+1 𝑥,𝑧 − 𝑝𝑡−1 𝑥,𝑧 − (Δ𝑥)2𝛼𝑠𝑡 𝑥,𝑧 + 𝛼 2 ∑︁ 𝑖=−2 𝑖≠0 𝑐𝑖 ( 𝑝𝑡 𝑥+𝑖,𝑧 + 𝑝𝑡 𝑥,𝑧+𝑖) , (C.2) where 𝛼 = ( 𝑣Δ𝑡 Δ𝑥 )2. During the simulation of the forward modeling, the boundaries of the velocity maps should be carefully handled because they may cause reflection artifacts that interfere with the desired waves. One of the standard methods to reduce the boundary effects is to add absorbing layers around the original velocity map. Waves are trapped and attenuated by a damping parameter when propagating through those absorbing layers. Here, we follow Collino and Tsogka (2001) and implement the damping parameter as 𝜅 = 𝑑 (𝑢) = 3𝑢𝑣 2𝐿2 𝑙𝑛(𝑅) , (C.3) where 𝐿 denotes the overall thickness of absorbing layers, 𝑢 indicates the distance between the current position and the closest boundary of the original velocity map, and 𝑅 is the theoretical reflection coefficient chosen to be 10−7. With absorbing layers added, Equation 3.22 can be ultimately written as 𝑥,𝑧 = (2 − 5𝛼 − 𝜅) 𝑝𝑡 𝑝𝑡+1 𝑥,𝑧 − (1 − 𝜅) 𝑝𝑡−1 𝑥,𝑧 − (Δ𝑥)2𝛼𝑠𝑡 𝑥,𝑧 + 𝛼 2 ∑︁ 𝑖=−2 𝑖≠0 𝑐𝑖 ( 𝑝𝑡 𝑥+𝑖,𝑧 + 𝑝𝑡 𝑥,𝑧+𝑖) . (C.4) C.3 OpenFWI Examples and Inversion Results of Different Methods 177 Velocity Channel 1 Channel 2 Channel 3 Channel 4 Channel 5 Figure C.1 More examples of velocity maps and their corresponding seismic measurements in OpenFWI dataset. 178 Ground Truth InversionNet VelocityGAN H-PGNN+ UPFWI-24K (Ours) UPFWI-48K (Ours) Figure C.2 Comparison of different methods on inverted velocity maps of FlatFault. The details revealed by our UPFWI are highlighted. 179 Ground Truth InversionNet VelocityGAN H-PGNN+ UPFWI-24K (Ours) UPFWI-48K (Ours) Figure C.3 Comparison of different methods on inverted velocity maps of CurvedFault. The details revealed by our UPFWI are highlighted. 180 C.4 Additional Experiment Results Ground Truth Prediction Ground Truth Prediction Figure C.4 Results of low-resolution Marmousi Dataset. This dataset contains low-resolution velocity maps generated using style transfer with the Marmousi velocity map as the style images. Our UPFWI model yields good results in shallow regions, and it also captures some geological structures in deeper regions. Similar phenomenon is also observed in the prediction of the smoothed Marmousi velocity map (bottom-right corner). 181 Figure C.5 Results of salt bodies dataset. This dataset contains more complicated velocity maps. Our UPFWI model yields good velocity map prediction (bottom) on both salt bodies and background geological structures compared to the ground truth (top). 182 Ground Truth CNN MLP-Mixer ViT Figure C.6 Results of UPFWI with different network architectures. We replace the CNN in our model with Vision Transformer (ViT) and MLP-Mixer as the encoder and test them on the FlatFault dataset. Both models yield reasonable velocity maps. This demonstrates that our proposed learning paradigm is model-agnostic. 183 Ground Truth Clean PSNR=61.60dB PSNR=58.70dB PSNR=51.58dB Seismic Input Velocity Map Seismic Input Velocity Map Figure C.7 Results of adding Gaussian noise to FlatFault. The model is trained on the clean data (without noise) and tested on different levels (PSNR) of Gaussian noises. Clearly, our method is robust to the noise although slight degradation is observed when noise level increases. 184 Ground Truth Clean PSNR=61.72dB PSNR=58.70dB PSNR=51.68dB Seismic Input Velocity Map Seismic Input Velocity Map Figure C.8 Results of adding Gaussian noise to CurvedFault. The model is trained on the clean data (without noise) and tested on different levels (PSNR) of Gaussian noises. Similar to the results of FlatFault, our method is robust to the noise although slight degradation is observed when noise level increases. 185 Ground Truth Clean 7 Missing 10 Missing 17 Missing Seismic Input Velocity Map Seismic Input Velocity Map Figure C.9 Results of randomly missing traces on FlatFault. The model is trained on the clean data (without missing traces) and tested on multiple missing rates from 5% to 25%. Our method is robust to the missing traces. Although the higher missing rate leads to shifts in velocity values, the geological structures are well preserved. 186 Ground Truth Clean 7 Missing 10 Missing 17 Missing Seismic Input Velocity Map Seismic Input Velocity Map Figure C.10 Results of randomly missing traces on CurvedFault. The model is trained on the clean data (without missing traces) and tested on multiple missing rates from 5% to 25%. Similar to the results of FlatFault, our method is robust to the missing traces. Although the higher missing rate leads to shifts in velocity values, the geological structures are well preserved. 187 Additional experiments to investigate generalization. We conducted two additional exper- iments: (1) training our model on the CurvedFault dataset and further testing on the FlatFault dataset (visualization results are listed in Figure C.11, and quantitative results are shown in Table C.1); (2) testing our model on time-lapse imaging problems (visualization results are listed in Figure C.12). The results demonstrate that our proposed model yields generalization ability to a certain degree. Table C.1 Quantitative results of our UPFWI models evaluated on FlatFault. Training Dataset Test Dataset MAE↓ MSE↓ SSIM↑ FlatFault FlatFault 14.60 1146.09 0.9895 CurvedFault FlatFault 50.80 17627.65 0.9253 Ground Truth Prediction Figure C.11 Results on generalization across datasets. The test is performed on FlatFault by applying a UPFWI model that is trained on CurvedFault dataset. Although the artifact is not negligible, the fault structures and velocity values are well preserved. This demonstrates that our model has generalizability to a certain degree. 188 Figure C.12 Results on generalizability over geological anomalies. The test is performed on a dataset where we add additional geological anomalies to simulate time-lapse imaging problems. The velocity maps containing those anomalies are not included during training. However, our model captures the spatial and temporal dynamics of anomalies in prediction. This demonstrates that our model has generalizability to a certain degree. 189 Ground TruthPredictionGround TruthPredictiont = 0t = 1t = 2t = 3