BAYESIAN UNCERTAINTY QUANTIFICATION OF COMPUTER MODELS WITH EFFICIENT CALIBRATION AND COMPUTATION By Vojtech Kejzlar A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statistics – Doctor of Philosophy 2020 ABSTRACT BAYESIAN UNCERTAINTY QUANTIFICATION OF COMPUTER MODELS WITH EFFICIENT CALIBRATION AND COMPUTATION By Vojtech Kejzlar The use of mathematical models, typically implemented in the form of computer code, proliferates to solve complex problems in many scientific applications such as nuclear physics and climate research. The computational and statistical tools of Uncertainty Quantification (UQ) are instrumental in assessing how accurately a computer model describes a physical process. Bayesian framework for UQ has become the dominant approach, because it provides a principled way of quantifying uncertainty in the language of probabilities. The ever-growing access to high performance computing in scientific communities has meanwhile created the need to develop next-generation tools and theory for analysis of computer models. Motivated by practical research problems, this dissertations proposes novel computational tools and UQ methodology aimed to enhance the quality of computer models which leads to improved predictive capability and a more “honest” UQ. First, we consider model uncertainty, which arises in situations when several competing models are available to describe the same or a similar physical phenomenon. One of the historically dominant methods to account for this source of uncertainty is Bayesian Model Averaging (BMA). We perform systematic analysis of prediction errors and show the use of BMA posterior mean predictor leads to mean squared error reduction. In a response to a recurrent research scenario in nuclear physics, BMA is extended to a situation where models are defined on non-identical study regions. We illustrate our methodology via pedagogical simulations and applications of forecasting nuclear observables, which exhibit improvements in both prediction error and empirical coverage probabilities. In the second part of this dissertation, we concentrate on individual computer models with particular focus on those which are computationally too expensive to be used directly for predictions. Furthermore, we consider computer models that need to be calibrated with experimental observations, because they depend on inputs whose values are generally un- known. We develop an efficient algorithm based on variational Bayes inference (VBI) for the calibration of computer models with Gaussian processes (GPs). To preserve the efficiency of VBI in the presence of dependent data, we adopt the pairwise decomposition of the data likelihood using vine copulas that separate the information on dependence structure in data from their marginal distribution. We provide both theoretical and empirical evidence for the computational scalability of our algorithm and demonstrate the opportunities given by our method on a real-data example through calibration of the Liquid Drop Model of nuclear binding energies. As a fast and easy-to-implement alternative to the fully Bayesian treatment (such as the VBI approach), we propose an empirical Bayes approach to computer-enabled predictions of physical quantities. We offer a new perspective to the Bayesian calibration framework with GPs and provide its representation as a Bayesian hierarchical model. Consequently, a posterior consistency of the physical process is established, assuming certain smoothness properties of the GP priors and the existence of a strongly consistent estimator of a noise scale. A simulation study and a real-data example that support the consistency and efficiency of the empirical Bayes method are provided as well. To Str´yˇcek for showing me it was possible. iv ACKNOWLEDGEMENTS I would like to take this opportunity to sincerely thank to my advisors, Prof. Tapabrata Maiti and Prof. Frederi Viens, for their support, guidance, and encouragement. Prof. Maiti’s exemplary commitment to research will continue to serve as a guide in my own academic pursuits. Thank you, Prof. Viens, for consistently providing me with resources and opportunities over the past years. My committee member Prof. Witold Nazarewicz who readily welcomed me to the world of physics and allowed me to participate in exciting interdisciplinary projects. I would also like to thank Prof. R.V. Ramamoorthi for being a part of my committee, for our discussions, and for his insight. I would like to extend many thanks to my research partners, Dr. L´eo Neufcourt, Dr. Shrijita Bhattacharya, Dr. Stefan Wild, Dr. P.-G. Reinhard, and Mookyong Son. Lastly, I am grateful to my family. Thanks to my parents for their unequivocal support during all my studies and especially for their foresight in enrolling me in computer classes in elementary school despite my protests. Thanks to my brother, Jakub, for his willingness to meet up with me anywhere in the world, and thanks to my wife, May, for believing in me from the start and always cheering me on. v TABLE OF CONTENTS LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Computer models and sources of uncertainty . . . . . . . . . . . . . . . . . . 1.2 Bayesian calibration of imperfect computer models . . . . . . . . . . . . . . 1.3 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 2 SURVEY OF BAYESIAN MODEL AVERAGING WITH EXAM- PLES AND EXTENSION TO DISCREPANT DOMAINS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Optimality of BMA predictions 2.2 BMA with discrepant domains . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Two models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 K models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Examples and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Averaging of proton potentials . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Averaging of nuclear mass emulators in the Ca region . . . . . . . . . 2.3.3 Averaging of the Liquid Drop Model variants . . . . . . . . . . . . . 2.3.4 Averaging of models with discrepant domains: a pedagogical example . . . . . . . . . . . . . . . . . . 2.4.1 A simple example of evidence integral with closed form solution . . . 2.4.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Supplement for the general case of K models . . . . . . . . . . . . . . 2.4.3 2.4.4 Supplement for the examples and applications . . . . . . . . . . . . . 2.4 Technical details and supplementary results CHAPTER 3 AN EFFICIENT ALGORITHM FOR BAYESIAN CALIBRATION 3.3 Selection of truncation level Scalable algorithm with truncated vine copulas OF COMPUTER MODELS VIA VARIATIONAL INFERENCE . . . 3.1 Variational Bayes inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Variational calibration of computer models . . . . . . . . . . . . . . . . . . . 3.2.1 Multivariate copulas and likelihood decomposition . . . . . . . . . . . 3.2.2 . . . . . . . . . . . . Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 3.3.2 Variance reduction of Monte Carlo approximations . . . . . . . . . . 3.3.3 Choice of the learning rate . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Parametrizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 1 2 3 7 9 11 14 17 19 20 21 22 25 29 34 37 37 37 38 40 45 49 51 51 55 59 59 60 64 65 66 66 vi 3.5 Technical details and supplementary results 3.4.2 Calibration of the Liquid Drop Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.1 Scalable algorithm with truncated vine copulas: C-vine . . . . . . . . 3.5.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.3 Supplement for the calibration of the Liquid Drop Model 72 76 76 80 85 CHAPTER 4 EMPIRICAL BAYES CALIBRATION OF COMPUTER MODELS 4.3.1 Estimation of hyperparameters WITH CONSISTENT PREDICTIONS . . . . . . . . . . . . . . . . 4.1 Hierarchical model for Bayesian calibration of computer models . . . . . . . 4.2 Posterior consistency, a theoretical validation . . . . . . . . . . . . . . . . . . 4.3 Parameter estimation and prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1.1 Marginal data likelihood . . . . . . . . . . . . . . . . . . . . 4.3.1.2 Predictive likelihood with K-fold cross-validation . . . . . . 4.3.2 Algorithm for predictions . . . . . . . . . . . . . . . . . . . . . . . . . 88 90 92 97 98 98 98 99 4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.1 Transverse harmonic wave . . . . . . . . . . . . . . . . . . . . . . . . 100 4.4.2 The Liquid Drop Model revisited . . . . . . . . . . . . . . . . . . . . 104 . . . . . . . . . . . . . . . . . . 106 4.5.1 Equivalency of hierarchical model . . . . . . . . . . . . . . . . . . . . 106 4.5.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 4.5.3 Supplement for the transverse harmonic wave simulation . . . . . . . 117 4.5 Technical details and supplementary results CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 5.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 vii LIST OF TABLES Table 2.1: RMSE (in MeV) and the improvement under the BMA posterior mean predictor calculated on the testing dataset (n = 70, A = 250). . . . . . . . 25 Table 2.2: Model posterior weights for 9 nuclear mass models with the RMSE (in MeV) and the MSE improvement for the training and the testing datasets. The last three rows correspond to the averaging with the prior weights, the simplified BMA (Neufcourt et al., 2019), and the full BMA. . Table 2.3: Posterior model weights under the averaging scenarios with two (L and H; left) and three (L, H, and L+H; right) models. The weights for the full intermediate domain of nuclei and the subset of 8 randomly selected nuclei are listed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Table 2.4: The RMSEs (in MeV) of the predictions from the 4 LDM variants as well as the values from BMA, calculated on the held-out data in the intermediate domain of even-even nuclei from AME2003. . . . . . . . . . . Table 2.5: Scheme depicting the observations contained in the training dataset of the models according to the proportion of shared data. The crosses mark the values contained in the domain of each model. . . . . . . . . . . . . . 27 32 33 35 Table 2.6: Summary of the domain corrected BMA analysis in the asymmetric case of the pedagogical example. . . . . . . . . . . . . . . . . . . . . . . . . . . 36 Table 2.7: Sample size breakdown for the training (AME2003) and the testing (AME2016 \ AME2003) datasets of nuclear separation energies in the Ca region according to Z and N parities. . . . . . . . . . . . . . . . . . . Table 2.8: The model posterior weights, RMSE (in MeV) and MSE improvement calculated on both the training (AME2003) and the testing (AME2016 \ AME2003) datasets for 3 nuclear mass models. . . . . . . . . . . . . . . 41 41 Table 2.9: Summary of the domain corrected BMA analysis in the symmetric case of the pedagogical example. . . . . . . . . . . . . . . . . . . . . . . . . . . 42 Table 3.1: The specification of GPs for the simulation study. . . . . . . . . . . . . . 67 Table 3.2: Comparison of the MSE for the simple scenario using the MH, the NUTS, and the VC algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 viii Table 3.3: The RMSE of the VC (Algorithm 3.2) after 24 hours dedicated to running the algorithm compared with the RMSE based on the LS estimates. The parameter estimates (and their standard errors) are also displayed. . . . . 75 Table 3.4: The space of calibration parameters used for generating the outputs of the semi-empirical mass formula (1.1). . . . . . . . . . . . . . . . . . . . . 86 Table 4.1: The RMSE comparison of the empirical Bayes approach and the fully Bayesian treatment. The GP hyperparameters were estimated using Al- gorithm 4.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table 4.2: The estimates of calibration parameters and the noise scale under each method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Table 4.3: The RMSEs of the predictions evaluated on 145 even-even nuclei from the AME2003 dataset. The parameter estimates are also listed. The posterior means are shown in the case of the MH algorithm. . . . . . . . . 105 ix LIST OF FIGURES Figure 1.1: Realizations of a Gaussian process with zero mean and squared expo- nential covariance function with η = 1 and (cid:96) = 0.1. . . . . . . . . . . . . 5 Figure 2.1: The Woods-Saxon potential and the Coulomb potential along with the training (140 observations) and the testing datasets (70 observations) generated from the mixture of the two potentials. . . . . . . . . . . . . . Figure 2.2: ECPs for the testing dataset (m = 70, A = 250). . . . . . . . . . . . . . . Figure 2.3: The ECPs calculated on the independent testing dataset (AME2016 \ 24 25 AME2003). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Figure 2.4: Even-even nuclei from AME2003 divided into the domains of light (Z < 40, N < 50), heavy (Z > 50, N > 80), and intermediate nuclei (re- maining 155 nuclei). The subset of 8 randomly selected nuclei is also depicted (From Kejzlar et al. (2020)). . . . . . . . . . . . . . . . . . . . . Figure 2.5: The ECPs for the four LDM variants used in our study and the averaging scenarios with two (L and H) and three models (L, H, and L+H) (From Kejzlar et al. (2020)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 2.6: Posterior mean predictions (with 68% HPD credible intervals) for the 10 observations y for the two models in (2.27) as well as their BMA, with the domain correction and with the assumption of independent model domains. This is the asymmetric case. The dashed line segments represent the translated values of the original observations. . . . . . . . . Figure 2.7: Posterior mean predictions (with 68% HPD credible intervals) for the 10 observations y for the two models in (2.27) as well as their BMA, with the domain correction and with the assumption of independent model domains. This is the symmetric case. The dashed line segments represent the translated values of the original observations. . . . . . . . . 30 33 43 44 Figure 3.1: A D-vine tree representation of a copula with 5 variables. . . . . . . . . . 54 Figure 3.2: The approximate posterior distributions for the target calibration pa- rameters. The VC (Algorithm 3.2) was carried out using l = 3 truncated D-vine and compared with the results from the NUTS and the MH al- gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 x Figure 3.3: The evolution of MSE of the posterior predictive means based on the VC with cumulatively implemented variance reduction techniques described in Section 3.3.2. The figure is based on an independently generated set of 50 testing points. Time and memory demands for each of the implementations are also plotted the VC (Algorithm 3.2) was carried out using l = 3 truncated D-vine. . . . . . . . . . . . . . . . . . . . . . . Figure 3.4: The evolution of the MSE of the posterior predictive means based on the VC (Algorithm 3.2), the MH algorithm, and the NUTS. The figure is based on an independently generated set of 200 testing points. The VC (Algorithm 3.2) was carried out using l = 5 truncated D-vine. . . . . 69 70 Figure 3.5: Recorded memory profiles of Algorithm 3.2, the MH algorithm, and the NUTS for the duration of 1 hour under the simulation scenario. . . . . . 71 Figure 3.6: Experimental binding energies of nuclei in AME2003 dataset (2225 ob- servations). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Figure 3.7: The residual plot for 225 experimental binding energies in the testing dataset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Figure 4.1: Detail of 95% credible bands plotted at t = 0.21. . . . . . . . . . . . . . . 103 Figure 4.2: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.21. . . . . . . . . . . . . 103 Figure 4.3: Binding energies of even-even nuclei in AME2003 dataset divided into a testing and a training dataset. . . . . . . . . . . . . . . . . . . . . . . . 104 Figure 4.4: Detail of 95% credible bands plotted at t = 0.00. . . . . . . . . . . . . . . 117 Figure 4.5: Detail of 95% credible bands plotted at t = 0.43. . . . . . . . . . . . . . . 118 Figure 4.6: Detail of 95% credible bands plotted at t = 0.71. . . . . . . . . . . . . . . 118 Figure 4.7: Detail of 95% credible bands plotted at t = 1.00. . . . . . . . . . . . . . . 118 Figure 4.8: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.00. . . . . . . . . . . . . 119 Figure 4.9: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.43. . . . . . . . . . . . . 119 Figure 4.10: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.71. . . . . . . . . . . . . 120 xi Figure 4.11: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 1.00. . . . . . . . . . . . . 120 xii LIST OF ALGORITHMS Algorithm 3.1: Variational calibration with truncated D-vine copulas. . . . . . . . Algorithm 3.2: Variational calibration with truncated D-vine copulas II. . . . . . . Algorithm 3.3: Variational calibration with truncated C-vine copulas. . . . . . . . Algorithm 3.4: Variational calibration with truncated C-vine copulas II. . . . . . . 58 64 78 79 Algorithm 4.1: Empirical Bayes algorithm for predictions of physical quantities. . 100 xiii KEY TO ABBREVIATIONS BMA Bayesian model averaging CM Computer model CDF Cumulative distribution function DFT Density functional theory ECP Empirical coverage probability EB Empirical Bayes EDF Energy density functional HPD Highest posterior density GP Gaussian process LDM Liquid Drop Model MC Monte Carlo MCMC Markov chain Monte Carlo MH Metropolis-Hastings MSE Mean square error NUTS No-U-Turn sampler PMSE Posterior mean square error UQ Uncertainty quantification RMSE Root mean square error SGA Stochastic gradient ascend SVI Stochastic variational inference VI Variational inference VBI Variational Bayes inference xiv CHAPTER 1 INTRODUCTION With the advancements of computer architectures in the 21st century, mathematical mod- els implemented on a computer, which we shall refer to as computer models, have become the driving force behind the acceleration of the cycle of the scientific process. This is because computer models are typically much faster, safer, and economical to run than physical exper- iments. For example, experiments in high energy physics are conducted in particle colliders that cost billions of dollars and can take up to a decade to build. Moreover, some physical experiments associated with rare natural events such as volcanic eruptions or earthquakes are infeasible to carry out for all practical purposes. Computer models, despite being an extremely useful tool (Box, 1976), are an imperfect representation of physical systems. The comprehensive study of the impact of all forms of modeling errors is called uncertainty quantification (UQ). Bayesian methodology of UQ, which is the main approach considered in this work, has been a heavily utilized statistical device due to its natural way to describe uncertainty in the language of probabilities; see Higdon et al. (2015); McDonnell et al. (2015), and King et al. (2019) for examples in nuclear physics, Sexton et al. (2012) and Pollard et al. (2016) for examples in climatology, and Williams et al. (2006); Lawrence et al. (2010), Plumlee et al. (2016) and Zhang et al. (2019) for applications in engineering, astrophysics, and medicine. Meanwhile, the incoming era of exascale computing (systems capable of 1018 double precision floating point operations per second) has spawned the development of complex computer models that produce massive amounts of data. This consequently creates the need to bring the computational and statistical tools of UQ into the big-data age. 1 1.1 Computer models and sources of uncertainty To illustrate various sources of uncertainties in computer models on a simple example, let us consider the 4-parameter Liquid Drop Model (LDM) (Weizs¨acker, 1935; Bethe and Bacher, 1936; Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020) which is a global (across the whole nuclear chart) model of nuclear binding energies; the minimum energy needed to disassemble the nucleus of an atom into unbound protons and neutrons. In principle, the LDM treats the nucleus like a drop of incompressible fluid of very high density. Despite this simplification, the LDM makes reasonable estimates of average properties of nuclei. The LDM is formulated through the semi-empirical mass formula as: EB(N, Z) = avolA − asurf A2/3 − asym (N − Z)2 A − aC Z(Z − 1) A1/3 , (1.1) where Z is the proton number, N is the neutron number, and A = Z + N is the mass number of the nucleus. The model parameters are (avol, asurf , asym, aC) representing the volume, surface, symmetry and Coulomb energy, respectively. These parameters have specific physical meaning, where avol is for instance proportional to the volume of the nucleus. See Krane (1987) for more details. We can identify the following sources of uncertainty as proposed by Kennedy and O’Hagan (2001). Parameter uncertainty: The model is a function of fixed but unknown parameters (avol, asurf , asym, aC). These parameters are context specific and need to be estimated with reported standard errors. The process of model fitting is also known as calibration. Observation error: In estimating the unknown model parameters, we will be making use of experimental data from the actual physical process. These measurements typically contain some observation error that should be accounted for. Model inadequacy: As we already mentioned at the beginning of this chapter, computer models are an imperfect representation of physical systems. Even if we know the true values 2 the model parameters, the LDM predictions will not equal the true values of the physical process. This uncertainty (error) is often interpreted as “missing physics” in the model and is differentiated from the observation error by its systematic nature. Parametric uncertainty: Note that the LDM is a linear function of the parameter vector (avol, asurf , asym, aC). It is possible that one or more predictor variables are highly linearly correlated (multicollinearity) and the LDM can be reduced to a model with less parameters. Model uncertainty: The LDM is not the only model of nuclear binding energies. In fact, there are many alternative and competing models. In order to conduct comprehensive UQ of modeling framework, we should allow for this possibility. The subsequent two sections describe Bayesian formalisms that provide statistically prin- cipled ways to account for the various sources of uncertainty described above with exception of the parametric uncertainty. The parametric uncertainty is not the focus of this disserta- tion, and we refer the reader to Jaganathen et al. (2017) and Kejzlar et al. (2020) for some examples in nuclear physics. 1.2 Bayesian calibration of imperfect computer models Let us consider observations y = (y1, . . . , yn) of a physical process ζ(ti) depending on a known set of inputs ti ∈ Ω ⊂ Rp following the relationship yi = ζ(ti) + σi, i = 1, . . . , n, (1.2) where σ represent the scale of observation error (noise), typically i is to establish statistically principled predictions of new values y∗ = (y∗ J ) of the physical process ζ at, yet to be observed, inputs (t∗ J ) using y and a computer model fm defined as a mapping (t, θ) (cid:55)→ fm(t, θ). As we can see, the computer model depends on an additional set of inputs θ ∈ Θ ⊂ Rd that we call calibration parameters. These 1, . . . , y∗ i.i.d.∼ N (0, 1). Our aim 1, . . . , t∗ are considered fixed but unknown quantities common to all the observations yi and all the 3 instances of the physical process that we intend to predict using the calibrated computer model. The calibration parameters represent inherent properties of the physical process that cannot be directly measured or controlled in an experiment. In the most rudimentary form, one can think of the calibration parameters as parameters in standard regression problems. To this extent, we suppose the relationship between the observations yi, the physical process ζ, and the computer model fm as proposed by Kennedy and O’Hagan (2001) yi = fm(ti, θ) + δ(ti) + σi, (1.3) where δ(ti) represents an unknown systematic error between the computer model and the physical process. While δ(ti) is intrinsically deterministic, a non-parametric approach using Gaussian process prior model is typically imposed for Bayesian inference. Definition 1. δ(t) has a Gaussian process distribution if for every i = 1, 2, 3 . . . the joint distribution of δ(t1), . . . δ(ti) is multivariate normal. It is fully characterized by mean func- tion m(t) = E[δ(t)] and covariance function k(t, t(cid:48)) = Cov[δ(t), δ(t(cid:48))]. We write δ(t) ∼ GP(mδ(t), kδ(t, t(cid:48))). Gaussian processes are a convenient way of placing a distribution over a space of functions with the covariance function characterizing the relationship of the process at different inputs. Typically, the mean function is chosen to be zero or some dense family of basis functions (wavelets, Fourier, polynomials) across the input domain: m(·) = h(·)T β, where h(·) = (h1(·), . . . hp(·)) are the basis functions and β is a hyperparameter. A typical choice for the covariance function is a stationary covariance function that depends on the inputs through t−t(cid:48). For example, a Gaussian kernel covariance function (also called squared exponential or radial basis function kernel) takes the form (cid:19) (t − t(cid:48))T M (t − t(cid:48)) , (cid:18) k(t, t(cid:48)) = η2 exp − 1 2 4 where M corresponds to a positive definite diagonal matrix of hyperparameters and η is a scaling parameter. We refer to the case of M = 1 (cid:96)2 I, for some (cid:96) > 0, as an isotropic version of the kernel, because it is invariant to the rotation. The case of M with different diagonal terms is called an anisotropic version of the kernel. Other popular choices for stationary covariance functions are Mat´ern kernels, polynomial kernels, or exponential kernels (Rasmussen and Williams, 2006). See Figure 1.1 to visualize realizations of a Gaussian process. Figure 1.1: Realizations of a Gaussian process with zero mean and squared exponential covariance function with η = 1 and (cid:96) = 0.1. It is often the case the evaluation of the computer model fm is too expensive in terms of both time and space (memory). It is common practice to reduce the number of necessary computer model evaluations by considering a Gaussian process prior model fm(t, θ) ∼ GP(mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))). In this setup, the data also include a set of model evaluations z = (z1, . . . , zs) over a grid {((cid:101)t1,(cid:101)θ1), . . . , ((cid:101)ts,(cid:101)θs)}. These are usually selected using some space-filling design such as a uniform or Latin hypercube design (Morris and Mitchell, 1995), which is a design that has a good coverage of the space with evenly distributed points in each one-dimensional 5 0.00.20.40.60.81.0t21012(t) projection. The complete data set d in the case of computationally expensive models consists of n observations yi from the physical process ζ and s evaluations zj of the computer model fm, i.e. d = (d1, . . . , dn+s) := (y, z). We shall denote the set of unknown parameters as φ = (θ, γ, σ) with γ denoting the set of hyperparameters of Gaussian processes’ mean and covariance functions. Consequently, the complete dataset d conditioned on (θ, γ, σ) follows the multivariate normal distribution where and K(θ, γ, σ) = d|θ, γ, σ ∼ N (M (θ, γ), K(θ, γ, σ)), Mf (Ty(θ)) + Mδ(Ty)  Mf (Tz((cid:101)θ)) M (θ, γ) = Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) + σ2In Kf (Ty(θ), Tz((cid:101)θ))  Kf (Tz((cid:101)θ), Tz((cid:101)θ)) Kf (Tz((cid:101)θ), Ty(θ)) (1.4) (1.5) (1.6) Here, Kf (Ty(θ), Ty(θ)) is the matrix with (i, j) element kf ((ti, θ), (tj, θ)), Kδ(Ty, Ty) is the matrix with (i, j) element kδ(ti, tj), and Kf (Tz((cid:101)θ), Tz((cid:101)θ)) is the matrix with (i, j) element kf (((cid:101)ti,(cid:101)θi), ((cid:101)tj,(cid:101)θj)). We can similarly define Kf (Ty(θ), Tz((cid:101)θ)) with the kernel kf . Under this framework, the Bayesian calibration consists of deriving the full posterior distribution p(φ|d) given by the Bayes’ theorem, namely p(φ|d) = ∝ p(d|φ)p(φ), (1.7) (cid:82) p(d|φ)p(φ) dφ p(d|φ)p(φ) where p(φ) expresses our prior uncertainty about the unknown parameters. The Bayesian predictions of y∗ are specified by the posterior predictive distribution p(y∗|d). This is given by integrating the conditional density of y∗, given φ and the data d, against the posterior density p(φ|d): (cid:90) p(y∗|d) = p(y∗|d, φ)p(φ|d) dφ. (1.8) 6 The conditional density p(y∗|d, φ) is a multivariate normal density given directly by the statistical model (1.3) and the specification of the Gaussian processes. We postpone the detailed description of this likelihood to Chapter 3. Here we point out a few caveats of the framework described above. First, the calibra- tion parameter θ is in general non-identifiable. Indeed, δ(ti) = ζ(ti) − fm(ti, θ) yields the same distribution for yi for any choice of θ. Several authors have pointed this out and proposed various methods to mitigate the problem including Brynjarsd´ottir and O’Hagan (2014); Plumlee (2017); Tuo and Wu (2015, 2016); Bayarri et al. (2007). Our main goal here, nonetheless, is not the correct identification of θ, but a prediction. Second, the posterior distribution p(φ|d) does not have a closed form and needs to be approximated. The tradi- tionally used Markov Chain Monte Carlo (MCMC) methods that approximate p(φ|d)—such as the Metropolis-Hastings (MH) algorithm (Chib and Greenberg, 1995) or more advanced ones including the Hamiltonian Monte Carlo or the No-U-Turn samplers (NUTS) (Homan and Gelman, 2014)—work only with a relatively small sample size because of the computa- tional costs associated with the evaluation of p(d|φ). This clearly calls for the development of computationally efficient alternatives to the traditional approaches. 1.3 Bayesian model averaging Bayesian model averaging (BMA) is the natural Bayesian framework in scenarios with several competing models M1, . . . ,MK when one is not comfortable selecting a single model at the desired level of certainty (Bernardo and Smith, 1994; Kass and Raftery, 1995; Hoeting et al., 1999; Wasserman, 2000). The seminal review work by Geweke (1999) introduced BMA in econometrics and later in other fields such as political and social sciences; BMA has also been applied to the medical sciences (Balasubramanian et al., 2014; Schorning et al., 2016), ecology and evolution (Silvestro et al., 2014; Hooten and Hobbs, 2015), genetics (Wei et al., 2011; Wen, 2015), astrophysics (Parkinson and Liddle, 2013), fluid dynamics (Radaideh et al., 2019), machine learning (Clyde et al., 2011; Hern´andez et al., 2018), and lately in 7 nuclear physics (Neufcourt et al., 2019, 2020a,b; Kejzlar et al., 2020). For any quantity of interest O, e.g., the value y∗, the BMA posterior density p(O|d) corresponds to the mixture of the posterior densities of the individual models: K(cid:88) p(O|d) = p(O|d,Mk)p(Mk|d), (1.9) k=1 where d are given datapoints. These datapoints are typically observations y unless we consider the specific scenario of computationally expensive computer models in Section 1.2, where we also include the set of model runs z. For notation consistency, we shall denote the set of datapoints as d throughout this dissertation with the actual content of d clarified by the context in which it is considered. Formula (1.9) expresses the actual posterior probability of a quantity of interest O is the average of O’s posterior distributions given each model, weighted by the model posterior probabilities. In other words, (1.9) is simply a mixture of K distributions, which makes sampling from the BMA posterior density immediate once we obtain the posterior samples under each model. The posterior model weights p(Mk|d) are the posterior probabilities that a given model is the hypothetical true model; it is given by a simple application of the Bayes’ theorem: p(Mk|d) = (cid:80)K p(d|Mk)p(Mk) (cid:96)=1 p(d|M(cid:96))p(M(cid:96)) , (1.10) where p(Mk) represents the prior probability that Mk is the true model. The so called evidence (integrals) p(d|Mk) are obtained by integrating the data likelihood against the prior density of the model parameters φk, namely Additionally, the definition of expected value yields the posterior mean of O as (cid:90) p(d|Mk) = E[O|d] = p(d|φk,Mk)p(φk|Mk) dφk. K(cid:88) E[O|d,Mk]p(Mk|d), k=1 8 (1.11) (1.12) and the well-known conditional variance formula (Casella and Berger, 2002) yields the pos- terior variance of O, given d, as K(cid:88) Var[O|d] = p(Mk|d)Var[∆|d,Mk] + Var[E(∆|d, M )|d]. (1.13) Note that the term Var[E(O|d, M )|d] is the variance of a function of the discrete random k=1 variable M (the set of all models being considered), which accounts for the model uncertainty. This model uncertainty is not accounted for by individual models. Its inclusion thus allows for a more honest UQ. One of the challenges of BMA is that it becomes unclear how one should proceed in scenarios where alternative models are defined on different subsets of the same input space. This is, for example, a usual situation in nuclear physics, for instance for nuclear mass models; ab initio (also known as A-body) models range over lighter nuclei due to contemporary computational limitations, while Energy Density Functionals (EDF) can cover the whole nuclear chart (Klupfel et al., 2009; Kortelainen et al., 2010b). 1.4 Dissertation outline The main content of this dissertation is organized as follows. Chapter 2 provides a survey of the remaining details for successful implementation of the BMA framework with particular focus on the calculation of the evidence integral (1.11). We perform a systematic analysis of the prediction errors, focusing on the fact that BMA is the optimal linear combination (pro- jection) in the L2 sense under the posterior probability distribution, among all the possible mixtures of models. Motivated by recurrent scenarios in nuclear physics, we subsequently extend BMA to the situations when the different models constrain different subsets of the data. Lastly, we present a set of pedagogical examples as well as real-data applications of the BMA methodology highlighting its benefits in terms of the improvement of the prediction accuracy and UQ. Some results from this chapter are also provided in Kejzlar et al. (2019). Chapter 3 presents a novel and computationally efficient algorithm based on variational Bayes inference (VBI) for the calibration of computer models with Gaussian processes. We 9 provide both theoretical and empirical evidence for the computational scalability of our methodology and describe all the necessary details for an efficient implementation of the proposed algorithm. We demonstrate the opportunities given by our method for practitioners on a real data example through the calibration of the Liquid Drop Model of nuclear binding energies. The algorithmic development done in this chapter is also provided in Kejzlar and Maiti (2020). Chapter 4 develops an empirical Bayes (EB) approach for the Bayesian calibration frame- work outlined in Section 1.2 that can be understood as an easy-to-implement and fast ap- proximation of the fully Bayesian treatment. Firstly, we utilize the structural convenience of Gaussian processes and restate the calibration framework as a Bayesian hierarchical model. Secondly, we make use of this new representation and extend the results of Choi and Schervish (2007a) on non-parametric regression problems to theoretically investigate the proposed EB approach. A numerical simulation study and a real data example are also provided. In Chapter 5, we discuss the likely future theoretical and computational extensions of the methodologies developed in Chapters 2 – 4. For ease of readability, all proofs of lemmas, propositions, and theorems, altogether with technical details of numerical studies, are provided in the section titled “Technical details and supplementary results” at the end of respective chapter. We also provide fully documented Python code that reproduces all the results in this dissertation and can be easily modified and used by practitioners in a public repository at https://github.com/kejzlarv. 10 CHAPTER 2 SURVEY OF BAYESIAN MODEL AVERAGING WITH EXAMPLES AND EXTENSION TO DISCREPANT DOMAINS Interest for model averaging arises, as discussed in Chapter 1, in situations when several competing models are available to solve the same or similar problem, and no single model can be selected at a desired level of certainty. For example, there is a multitude of competing computer models for numerical weather prediction including the American model (Global Forecast System) and the European model (European Centre for Medium-Range Weather Forecasts) (Lynch, 2008). In nuclear physics, alternative models arise through different theoretical strategies in modeling atomic nuclei such as the A-body modeling approach or the density functional theory (DFT) (Nazarewicz, 2016). In this chapter, we consider a general situation where measurements (ti, yi)n i=1 of a physical process t (cid:55)→ ζ(t) are used to predict new values y∗ of the physical process ζ, where t ∈ Ω ⊂ Rp. Furthermore, we suppose there are K competing models M1, . . . ,MK of observations yi, where the kth model is parametrized by a vector of unknown parameters φk ∈ Rpk for k = 1, . . . , K and pk ≥ 1 (e.g., for the Bayesian calibration with Gaussian processes φ = (θ, γ, σ)). Given each model, we consider the data likelihood p(d|φk,Mk) and the prior density p(φk|Mk); the dataset d typically consists of the experimental observations y = (y1, . . . , yn) only, however, it can also include a set of computer model evaluations z = (z1, . . . , zs) under the Bayesian calibration framework with computationally expensive models described in Section 1.2. BMA provides a way of accounting for model uncertainty induced by the existence of alternative models. If O is the quantity of interest, e.g., the value y∗, the BMA posterior density p(O|d) corresponds to the mixture of the posterior densities of the individual models: p(O|d) = p(O|d,Mk)p(Mk|d). (2.1) K(cid:88) k=1 11 The posterior weights p(Mk|d) are given by a simple application of the Bayes’ theorem: p(Mk|d) = (cid:80)K p(d|Mk)p(Mk) (cid:96)=1 p(d|M(cid:96))p(M(cid:96)) . (2.2) These weights are determined by two quantities that are the key to a successful implemen- tation of the BMA framework. First, one needs to assign suitable prior probabilities p(Mk) that Mk is the true model. Hoeting et al. (1999) notes that, When there is little prior information about the relative plausibility of the mod- els considered, the assumption that all models are equally likely a priory is a reasonable “neutral” choice. One can, nevertheless, choose informative prior distributions when prior information about the plausibility of each model is available. Eliciting an informative prior is a non-trivial task, but Madigan et al. (1995) provide some guidance in the context of graphical models that can be applied in other settings as well. The second key quantity is the evidence integral (cid:90) p(d|Mk) = p(d|φk,Mk)p(φk|Mk) dφk. (2.3) The numerical evaluation of evidence integrals is challenging in practice, because a closed form solution is available only in elementary situations for the exponential family distribu- tions with conjugate priors (see Section 2.4 for a simple example) and thus requires approx- imation. The simplest and most commonly used approximation in the literature, and which we have adopted in our applications, is to use the Monte Carlo (MC) integration estimate (cid:98)pM C (d|Mk) = 1 nM C (cid:88) i p(d|φ (i) k ,Mk), (2.4) where φ (i) k are i.i.d. samples from the prior p(φk|Mk) for i = 1, . . . , nM C . While this MC integration yields reasonable results, it requires separate evaluations of the likelihood at new samples from the prior p(φk|Mk), which can be very costly in computing time. 12 Another frequently used method is the Laplace approximation, which relies on the fact that the integration (2.3) has a closed form in the case of a linear regression with Gaussian noise. It corresponds to a second order Taylor expansion of the log-likelihood around its maximum, which makes the likelihood Gaussian. Namely the Laplace approximation is (cid:98)pL(d|Mk) = (2π) pk 2 |(cid:101)Σk| 1 2 p(d|(cid:101)φkMk)p((cid:101)φk|Mk), where (cid:101)φk is the mode of p(φk|d,Mk) and(cid:101)Σk = (−D2l((cid:101)φk))−1 is the inverse of the Hessian matrix of second derivatives (evaluated at (cid:101)φk) of l(φk) = log(p(d|φk,Mk)p(φk|Mk)). The (2.5) Laplace method typically gives very good results for very peaked likelihoods. We refer the reader to Kass and Raftery (1995) for an exhaustive survey of classical methods used to compute the evidence integral. Also, more recently proposed Nested Sampling algorithm by Skilling (2006) and expanded by Feroz et al. (2009) provides another alternative to these classical approaches. BMA, while a conceptually straightforward and natural approach to account for model uncertainty, becomes challenging in scenarios where alternative models are defined on dif- ferent subsets of the same input space; this can typically arise with local models or with numerical models with different constraints. It is also a usual situation in nuclear physics, for instance for nuclear mass models; ab initio models range over lighter nuclei due to con- temporary computational limitations, while EDFs can cover the whole nuclear chart (Klupfel et al., 2009; Kortelainen et al., 2010b). This also happens when one considers mixing models produced by the calibration of observables of different types – typically some nuclear models are fitted on nuclear binding energies, while others on binding energies and other observables such as rms charge radii (a measure of the size of an atomic nucleus). Surprisingly, we have not found in the literature a principled approach to adapt BMA to this situation, or how to compare models with similar, overlapping, but significantly non-identical domains. To address this “domain discrepancy”, in Section 2.2 we present a method which relaxes the requirement that all models cover the same domain (d is common to all models considered). Other applications of our framework could include time series with missing data, or different 13 time scales, e.g. in a financial setting where additionally different classes of assets can be treated as observables. The remaining sections of this chapter are organized as follows. Section 2.1 provides a systematic analysis of prediction errors under individual models as compared to the BMA framework. Section 2.2 develops the BMA methodology for models with discrepant domains. Section 2.3 contains an extensive collection of simulation studies, pedagogical examples as well as real-data applications highlighting the benefits of BMA in terms of the improvement of the prediction accuracy and UQ. All technical details and supplementary results are provided in section 2.4. 2.1 Optimality of BMA predictions BMA is not the only way to deal with several alternative models and to account for model uncertainty, but it does have the property of reducing the Posterior Mean Square Error (PMSE) of prediction of a new observation y∗. In this section, we illustrate this property in a clear and concise way. Let us, for simplicity of notation, consider two competing models M1 and M2 - the treatment of multiple models follows from a similar argument, and our verbal descriptions below in this section occasionally refer to the general case without further comment. Denote (cid:98)y∗ 1 := E[y∗|d,M1] and (cid:98)y∗ 2 := E[y∗|d,M2] as the posterior means of y∗ under each model, and let (cid:98)y∗ := E[y∗|d]. We also define pk := p(Mk|d) for k = 1, 2 for the posterior probability of each model. Thus the BMA posterior mean estimator (1.12) can be written as (cid:98)y∗ = p1(cid:98)y∗ 1 + p2(cid:98)y∗ 2. The PMSE of y∗ is then defined as E[((cid:98)y∗ − y∗)2|d] and has the following E[(y∗ −(cid:99)y∗)2|d] = E[(y∗ − λ1(cid:99)y∗ 1 − λ2(cid:99)y∗ decomposition. Lemma 1. For every λ1, λ2 ≥ 0 satisfying λ1 + λ2 = 1, we have 2)2|d] − [(λ1 − p1)(cid:99)y∗ 1 + λ2(cid:98)y∗ the PMSE associated with any convex combination λ1(cid:98)y∗ 1 + (λ2 − p2)(cid:99)y∗ 2]2 (2.6) This Lemma shows explicitly that the PMSE of the BMA predictor is smaller than 2 of the each of the two 14 models’ posterior means. It also measures how much smaller it is, and shows that equality holds as soon as the convex coefficients λk are equal to the posterior probabilities pk of each model, k = 1, 2. Specifically, by applying Lemma 1 twice, with (λ1, λ2) = (1, 0) and with (λ1, λ2) = (0, 1), we obtain the following dual expression for the PMSE or the BMA predictor, involving each individual model’s PMSE, showing how much smaller the former is compared to the two latter: E[(y∗ −(cid:99)y∗ 2((cid:99)y∗ 1 −(cid:99)y∗ 1)2|d] − p2 2)2 = E[(y∗ −(cid:99)y∗)2|d] = E[(y∗ −(cid:99)y∗ 1((cid:99)y∗ 1 −(cid:99)y∗ 2)2|d] − p2 2)2. (2.7) The relationship (2.7) directly implies E[(y∗ − (cid:98)y∗)2|d] ≤ E[(y∗ − (cid:98)y∗ k)2|d], k = 1, 2. (2.8) This inequality clearly states that the BMA estimator (1.12) gives prediction error at least as small as the best of the models considered, in the PMSE sense. We interpret this as a translation of the fact that each model that goes into creating the BMA estimator necessarily ignores model uncertainty. Note that this says nothing about how the BMA estimator would compare to a model not used in its definition. Moreover, since Lemma 1 covers all convex combinations of the original models, it shows that BMA achieves the following minimum E[(cid:0)y∗ − (λ1(cid:98)y∗ 1 + λ2 2)(cid:1)2|d]. ˆy∗ (2.9) k=1,2 = arg min λ∈[0,1]2:λ1+λ2=1 (cid:0)p(Mk|d)(cid:1) 1 and (cid:98)y∗ estimators (cid:98)y∗ Hence, the BMA estimator is actually optimal over all convex combinations of the individual 2. The optimality of BMA can be also established from a decision- theoretic perspective, see Chapter 6 in Bernardo and Smith (1994) for details. We can also express the reduction of the PMSE for the BMA estimator, compared to the best (lowest) PMSE among all of the individual models’, as BM A := 1 − E[((cid:98)y∗ − y∗)2|d] mink E[((cid:98)y∗ k − y∗)2|d] r2 , k = 1, . . . , K (2.10) 15 In the specific case of two competing models, if we assume for instance that the ’best’ model is M2, we can obtain an even more explicit expression for r2 gain attained by BMA, namely BM A which provides the relative BM A = p(M1|d)2 r2 . (2.11) 1 − (cid:98)y∗ ((cid:98)y∗ E[((cid:98)y∗ 2 − y∗)2|d] 2)2 Below in Section 2.3, we denote the sample version of the expression in (2.10) as ˆr2 BM A, which we will use to evaluate the performance of BMA quantitatively. To finish this section, we decompose the quantity E[((cid:98)y∗ − y∗)2|d] against the residuals ((cid:98)y∗ k − y∗), k = 1, 2, from each individual model assuming p1, p2 > 0. This is easily done by symmetrizing formula (2.7) via reintroducing y∗ to identify these residuals, and then taking another conditional expectation with respect to d to avoid an expression which depends on unobserved data. We obtain E[(y∗ − (cid:98)y∗)2|d] = (p1 − p2 1)E[(y∗ − (cid:98)y∗ 2)E(cid:2)((cid:98)y∗ 2)E[(y∗ − (cid:98)y∗ 2)|d(cid:3). 1)2|d] + (p2 − p2 1 − y∗)(y∗ − (cid:98)y∗ 2)2|d] − (p2 1 + p2 (2.12) Formula (2.12) shows that the PMSE of the BMA estimator is an explicit linear combination of the prediction errors of estimators for each constituent model, but that one must subtract a coupling correction term on the right hand side of (2.12). It is interesting to note that the weights in the aforementioned linear combination can be interpreted as the variances of Bernoulli random variables with the posterior model probabili- ties p1 and p2 as their success probabilities. Also note that, since these variances pk−p2 k < pk, the linear combination is not convex, but is smaller. The correction term is not necessarily a subtraction of a positive term, but it is likely to be when both individual models have significant biases in opposite directions for prediction of y∗. This is particularly interesting when the two models have similar posterior performances. Both values of pk will be in this situation close to 1/2, which minimizes the values of pk − p2 k for both k = 1, 2. This is a scenario where using BMA will significantly improve prediction errors even when each model is competitive compared to the other, regardless of how large the individual models’ biases 16 are, and without knowing in what direction they go, as long as the two models are assumed to have significant defects that work in opposite directions. A sanity check reveals an interesting characteristic of BMA: suppose that p1 = 1, so that 1. According to (2.12) we must have E[(y∗−(cid:98)y∗)2|d] = 0, the BMA estimate is given by (cid:98)y∗ = (cid:98)y∗ and further E[(y∗ − (cid:98)y∗)2] = 0, i.e. y∗ = (cid:98)y∗ = (cid:98)y∗ 1 a.s. given d, in other words, model 1 must provide a perfect description of the reality. 2.2 BMA with discrepant domains Let us continue with the discussion about BMA of models with similar, overlapping, but significantly non-identical domains from the beginning of this chapter in a formal setting. Let us consider two models MA and MB, which we will also denote by (A) and (B) or merely A and B for simplicity, and assume that they are respectively defined only on different strict subsets t(A) and t(B) of the data. We denote d(A) and d(B) the corresponding d data as well as d(−A) and d(−B) their respective complements in d. The actual Bayesian evidence for each of these models are the probabilities p(d|A) and p(d|B), but these quantities are not clearly defined. On the other hand p(d(A)|A) and p(d(B)|B), where each model refers only to its original range of validity, are the evidences of the models corresponding to the classical BMA theory described in Section 1.3 and also at the beginning of this chapter. Nevertheless, we have the following expansion: p(d|A) = p(d(A), d(−A)|A) = p(d(A)|A)p(d(−A)|d(A), A). (2.13) This expression means that to obtain model (A)’s actual Bayesian evidence, p(d(A)|A) must be multiplied by a corrective factor p(d(−A)|d(A), A) which represents the information one has on d(−A) assuming that model (A) holds and that it does not provide any prediction at the data points in d(−A). Note that the distribution p(d|A) is meaningful only to the extent that d – and thus d(−A) – is measurable in the underlying probability space, which implies the existence of underlying distributions p(d(−A)) and subsequently of p(d(−A)|d(A)) and p(d(−A)|d(A), A). To that extent, the problem of averaging models with different domains 17 can be ill posed, if these distributions cannot be defined convincingly. If the data d(A) and d(−A) are independent, conditionally to model (A), in other words if no information can be gleaned about d(−A) from d(A) or from (A), i.e. d(A) is unconstrained by (A) and by d(−A), then it is legitimate to ignore the aforementioned correction factor which should be p(d(−A)|d(A), A) = 1. In particular, this is the case if, given model (A), d(−A) is considered deterministically equal to its sample value. Conversely, setting the corrective factor to p(d(−A)|d(A), A) = 1 outside of this scope is an approximation to such extent, and not in general a fair evaluation of the information contained in the ”globality” of a model. We shall refer to this case as BMA with independent model domains. Although it has been adopted as a natural matter of convenience, it raises serious safeguards for which we cannot find better words than Trotta’s ascertainment (Trotta, 2008): On the other hand, it is important to notice that the Bayesian evidence does not penalize models with parameters that are unconstrained by the data. It is easy to see that unmeasured parameters (i.e. parameters whose posterior is equal to the prior) do not contribute to the evidence integral, and hence model comparison does not act against them, awaiting better data. Let us point out as an extreme situation that occurs when model (A) predicts the values d(−A) that have no physical meaning, e.g. in the case of nuclear mass models, this can be the mass of a nucleus which a model predicts not to exist, and therefore the mass has no physical meaning. In this case, the model (A) is actually strongly constrained by d(−A), to the point that p(d(−A)|A) = 0, yielding p(d(−A)|d(A), A) = 0, which rules the model (A) impossible as long as d(−A) is not empty. Another tempting option is to restrict the domain of interest to the domain common to all models and simply consider p(d(A)∩(B)|A) and p(d(A)∩(B)|B), which can be obtained in a standard way according to (2.3). As we ignore even more data, this approach is arguably worse than setting p(d(−A)|d(A), A) = 1. 18 Let us illustrate how the assumption of independent model domains, namely setting p(d(−A)|d(A), A) = 1, can fail to provide a satisfactory ranking of models in two examples where a model takes a shortcut by ’refusing’ to predict challenging points. Scenario 1. Consider the situation where one model M0 is empty so that p(M0|d) ∝ p(M0). On the other hand, any other model which constrains any part of the data will have an evidence most likely lower than 1 which implies that the model will end up with lower posterior weights when starting from equal prior weights. Thus any predictive model will be deemed inferior to a non-predictive one. Scenario 2. Take two deterministic models A and B with input space (domain of t) {a, b}; assume model A has deviation 0 at location a and 1099 at location b, and that model B has deviation 1.001 at location a, but does not predict anything at location b. One can easily adjust the numbers to reach an extreme situation (e.g. making A’s prediction at location b to be extremely poor) where model B ends up with a much higher Bayes evidence than model A, while the common sense by which no prediction is a form of extremely poor prediction, would always imply that model A is better than model B. These examples show how important it is to acknowledge that a model’s inability to make predictions in some locations is not a neutral property. The classical BMA approach offers no trade-off: a model withholding its predictions at the most difficult points will always improve its weight. We now introduce our “domain-corrected BMA” where we amend the model weights to account more fairly for the (in-)ability of a model to provide predictions at locations of interest. 2.2.1 Two models Starting from (2.13), instead of setting p(d(−A)|d(A), A) = 1 which removes the ef- fect of a model’s domain in its posterior weights, we propose the weaker assumption that 19 p(d(−A)|d(A), A) is independent from the model, i.e. we assume p(d(−A)|d(A), A) = p(d(−A)|d(A)). (2.14) This is quite natural if we consider that model (A) implies a distribution p(d(A)|A) but provides no information on d(−A), leaving d(−A) unconstrained by (A) (see the introduction of this section). The evidence p(d|A) is now given by p(d|A) ∝A p(d(A)|A)p(d(−A)|d(A)). (2.15) Our assumption d(A) ∪ d(B) = d implies that d(−A) can only be informed by (B). Hence which can be written as an explicit integral with respect to model (B)’s parameter φB, p(d(−A)|d(A)) = p(d(−A)|d(A), B) = p(d(−A)|d(A)∩(B), B), (cid:90) p(d(−A)|d(A)∩(B), φB, B)p(φB|d(A)∩(B), B) dφB. (2.16) (2.17) To approximate (2.17), one can use the same approximation methods as in the case of classical evidence integral (see the beginning of this section for more details). 2.2.2 K models d. We also introduce d((cid:11)) :=(cid:84) Moreover we assume that d =(cid:83) In the general case, each model Mk constrains a subset d(k) of the data d (for k = 1, . . . , K); as in the case of two models, d(−k) denotes the complement subset of d(k) in k d(k) as the set of data common to all individual models. k d(k), i.e. every datapoint is covered by at least one model. We also assume, up to taking equivalence classes on models (see Section 2.4.3 for details), that for each pair of models there exists a chain of models joining them where each model Mk shares a data point in its domain d(k) with each of its neighbours. Relying on the same principles described in Section 2.2.1, we set p(d(−k)|d(k),Mk) = p(d(−k)|d(k)), (2.18) 20 which leads to the model posterior probabilities of the form p(Mk|d) ∝k p(d(−k)|d(k))p(Mk|d(k)). (2.19) Compared to the two-model case, the computation of the corrective factors poses ad- ditional difficulty that, when there is more than one model constraining d(−k), the factor p(d(−k)|d(k)) is no longer equal to a single p(d(−k)|d(k),Mk), but rather to the an aver- age of all models constraining d(−k). Hence our domain-corrected BMA corresponds to the intermediate solution where one replaces the factors of the likelihood corresponding to the missing model predictions by a geometric average of the likelihoods over the models which do produce predictions, based on the predictive models’ posterior weights. We have found that similar ideas have been developed in the broader framework of evidence theory (Park and Grandhi, 2012, Section 2.2). The notation for a given corrective factor can become cumbersome when model domains have very general intersections, but these corrective factors can still be computed recursively rather than directly. We relegate the calculations of the general case to Section 2.4.3. 2.3 Examples and applications To illustrate the methodology described in Chapter 2, we present several examples in which BMA leads to the reduction in prediction error and improved UQ. Our first example is a simple yet sensible scenario of averaging two different models of proton potentials. The second example is an application of BMA methodology to state-of-the-art nuclear mass models and nuclear mass data. The third example is a BMA study of the LDM (1.1) published by Kejzlar et al. (2020). Lastly, we provide a pedagogical application of model averaging to a synthetic dataset which highlights the interest of the domain-corrected BMA. The predictive improvement is measured in the examples as a relative reduction in the mean square error (MSE), a sample version of (2.10), which we denote as (cid:98)r2 BM A. As a measure of UQ fidelity, we consider what is know as the empirical coverage probability 21 (ECP) (Gneiting et al., 2007; Gneiting and Raftery, 2007). Formally, it can be written as J(cid:88) i=1 η(α) := 1 J i ∈Iα(t∗ 1y∗ i ), (2.20) where 1 is the indicator function, Iα(t∗ model at a new input t∗ i ) is the α−credibility interval produced by a given i ’s are the (new) testing data. The ECP represents the i , and y∗ proportion of a model’s prediction of independent testing points falling into the respective credibility intervals. This quantity is typically plotted against the credibility level α to form a so called ECP line (e.g., Figure 2.2). This line should theoretically follow the diagonal so that the actual fidelity of the interval corresponds to the nominal value. If the respective ECP line falls above the reference, credible intervals produced by a given model are too wide (UQ is conservative). Naturally, a model with an ECP line below the reference underestimates the uncertainty of predictions (UQ is liberal). While values of empirical proportions close to the reference curve are desirable, it is preferable to be conservative rather than liberal. Overly narrow credible intervals declare a level of assurance higher than it should be. Each of the examples in this section looks at a situation with several competing models without any prior knowledge of which is better; thus we set the prior model weights to be uniform over the model space. All the posterior samples were computed using the NUTS. The evidence integrals were approximated using the MC integration. All the credible intervals discussed are the highest posterior density (HPD) credible intervals. Given a credibility level α, the α-HPD of a scalar quantity consists of the minimum width interval containing an α proportion of its MCMC posterior samples. Lastly, some of the supplementary results and modeling details are delegated to Section 2.4.4. 2.3.1 Averaging of proton potentials In this first example we demonstrate the potential of BMA to improve both prediction accuracy and honesty of UQ in a favorable situation where we average two models associated with different proton potentials. 22 We consider two single-proton potentials describing the average interaction acting on a proton within the spatial range of a nucleus; namely, the Woods-Saxon (WS) potential V1 representing respectively the strong nuclear forces between nucleons (protons and neutrons), and the Coulomb potential V2 representing the electromagnetic interactions between protons. For a given nucleus, which we will take with proton and neutron numbers Z = 100 and N = 150 and mass number A = 250, they can be expressed as V1(r) = −VW S 1 r−RA a 1 + e V2(r) = −VC Z r . , (2.21) (2.22) Here, VW S = 50 MeV, VC = 0.5 MeV fm, and a = 0.5 are fixed parameters, and RA = A1/3×1.25 fm is the radius of the nucleus of interest. These two models for energy potentials have the interesting property that both are non-decreasing and vanishing at infinity, while with different speeds, and can correspond to two phenomenons with different length scales. As a matter of fact, the strong interactions described by the WS potential are confined to the volume of atomic nuclei (several fm = 10−15 m), i.e. they are short-ranged; in contrast the electrostatic ones are long-ranged, i.e. they act on much larger length scales (> 10−10 m) and compete with the strong interactions in superheavy elements, causing the so-called Coulomb frustration (see Nazarewicz (2018)). This fact is reproduced in our example where we also expect that V1 should be well constrained by a dataset of stable nuclei, while V2 should play an important role in the description of short-lived superheavy nuclei. More generally, we have in mind a scenario where two models have been developed for different subsets of an input domain and are in competition on some common intermediate domain. Both of these modeling approaches are equally confident that they prevail on the intermediate domain, while the truth is somewhere in between. This situation is quite realistic despite its simplicity, and we can reasonably expect model mixing to have positive outcomes. We simulate the experimental data {(ri, yi)}n i=1 at different spatial locations ri, relatively 23 far from the nucleus (r > RA) following a mixture of the two models. Namely yi = (1 − ω)V1(ri) + ωV2(ri) + i, (2.23) where i are standard normal errors, and we take ω = 1 2. Note that in reality, the ob- servations of the potentials are not available as such, but can be inferred indirectly from experimental nucleonic densities measured in nucleon scattering experiments (Anni et al., 1995). In particular, we drew a dataset of 210 observations generated according to the model (2.23) with the locations ri sampled uniformly over (RA, 10). Figure 2.1: The Woods-Saxon potential and the Coulomb potential along with the training (140 observations) and the testing datasets (70 observations) generated from the mixture of the two potentials. We further randomly divided the data into a training dataset of 140 observations and kept the remaining 70 observations for testing (see Figure 2.1). The two statistical models M1 and M2 considered here are given by the respective energy potentials (2.23) obtained with ω = 0 and ω = 1 and additive independent experimental errors distributed according to N (0, σj) for j = 1, 2. The prior distributions for standard deviations σj’s were take to be the non-informative Inv-Gamma(1,30). Table 2.1 shows the estimated root MSE (RMSE) for the testing dataset. We can see that this simple example gives significantly better predictions under the BMA posterior mean 24 8.08.59.09.510.0r [fm]2520151050V(r) [MeV]Coulomb potentialWoods-Saxon potentialTraining dataTesting data predictor than each of the models individually. This is, of course, not a surprise and shows that BMA behaves as expected. Model M1 M2 MBM A RMSE P (Mk|y) (cid:98)r2 3.540 3.607 0.935 0.512 0.488 - BM A 0.930 0.933 - Table 2.1: RMSE (in MeV) and the improvement under the BMA posterior mean predictor calculated on the testing dataset (n = 70, A = 250). More interesting results can be seen from the angle of the quality of the predictions’ UQ in Figure 2.2. In contrast with the individual models, the ECP of the BMA posterior predictions matches closely the reference line and provides evidence that accounting for model uncertainty leads to the desired more honest UQ. Figure 2.2: ECPs for the testing dataset (m = 70, A = 250). 2.3.2 Averaging of nuclear mass emulators in the Ca region An important challenge in nuclear structure is to produce quantified predictions of nu- clear observables, such as nuclear masses (McDonnell et al., 2015), for all possible pairs 25 0.00.20.40.60.8Credibility level0.00.20.40.60.81.0Empirical coverageReference21BMA (Z, N ) of proton numbers Z and neutron numbers N which can be bound together in a nu- cleus. Such predictions are of direct interest to guide future nuclear experiments or to feed astrophysical calculations for the abundance of elements in the universe. The underlying astrophysical processes, such as the rapid neutron capture which produces heavy elements in stellar environments (Horowitz et al., 2019), take place far from the region of nuclear stability, where no experimental measurement are available, and these observables have to be extracted from extreme extrapolations of theoretical nuclear models. In their recent work, Neufcourt et al. (2019) used GPs (see Section 1.2 for definition) to model the discrepancies between the experimental data and the theoretical calculations for several nuclear models based on the DFT, and obtained quantified extrapolations for nuclear masses in the Calcium region (at the frontier between experimental and theoretical limits). They computed a simplified BMA of 9 global mass models (Bartel et al., 1982; Dobaczewski et al., 1984; Chabanat et al., 1995; Kl¨upfel et al., 2009; Kortelainen et al., 2010a, 2012, 2014) listed in Table 2.2 defined across the full nuclear landscape from the light to the superheavy nuclei, thus suitable for extrapolations. Their weights, proportional to p(y∗ > 0|y,Mk), are based on each model’s probability to assign a positive separation energy y∗ to a testing set of nuclei which have been experimentally observed after 2003, thus independent from the training set of measured neutron separation energies y (separation energy is the energy needed to remove a neutron or proton form an atomic nucleus). Here, we compare the results of Neufcourt et al. (2019) to the full BMA analysis with model weights given by their posterior probabilities p(Mk|y). Note that all the physical models are taken here as calibrated and their parameter estimation is not part of our analysis. We consider the same training dataset of one-neutron (S1n) and two-neutron (S2n) sep- aration energies AME2003 (Audi et al., 2003) restricted to the calcium (Ca) region on the nuclear landscape with Z ≥ 14 and N ≤ 22 (n = 139). The predictive performances of each model augmented with the GP model for systematic discrepancies and the BMA pos- terior mean predictor are evaluated on both the training dataset and a testing dataset of 26 new measurements in AME2016 (n = 14) that we denote as AME2016 \ AME2003 (Wang et al., 2017). The predictive performances of each model augmented with a GP model for systematic discrepancies and the BMA posterior mean predictor are evaluated on both the training dataset and a testing dataset of new measurements in AME2016 (n = 14) (Wang et al., 2017). Similarly to Neufcourt et al. (2019), we calculate the model posterior prob- abilities independently over four non-overlapping nuclear domains according to the parity of numbers Z and N with uniform prior distribution over the model space. We assess the performance of BMA using the MSE improvement and the ECP. These were combined over odd and even parities of numbers Z and N in order to mitigate the relatively small size of each parity subset. The GP model specification and the sample sizes breakdown based on the parity of Z and N are given in Section 2.4.4. Model posterior weights S1n (odd N) S2n (even N) even Z odd Z even Z odd Z RMSE (cid:98)r2 Training Errors BM A RMSE (cid:98)r2 Testing 0.000 0.000 0.000 0.000 0.000 0.845 0.002 0.153 0.000 0.000 0.000 0.000 0.000 0.009 0.669 0.013 0.308 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.902 0.098 0.008 0.000 0.000 0.001 0.000 0.089 0.125 0.310 0.467 0.076 0.127 0.142 0.107 0.136 0.110 0.109 0.114 0.146 0.110 0.118 0.105 - 0.308 0.449 0.023 0.400 0.077 0.058 0.149 0.477 0.045 0.110 - 0.713 0.989 0.924 0.840 0.809 0.550 0.806 0.808 0.806 0.641 0.680 0.591 BM A 0.313 0.642 0.591 0.505 0.466 - 0.462 0.465 0.463 0.078 0.131 - Model SLy4 SkP SkM* SV-min UNEDF0 UNEDF1 UNEDF2 FRDM-2012 HFB-24 MBM A(prior) MBM A(simple) MBM A Table 2.2: Model posterior weights for 9 nuclear mass models with the RMSE (in MeV) and the MSE improvement for the training and the testing datasets. The last three rows correspond to the averaging with the prior weights, the simplified BMA (Neufcourt et al., 2019), and the full BMA. Table 2.2 presents the resulting posterior weights of the models, as well as the RMSE and the MSE improvement for both averaging procedures. The predictions based on the full BMA (MBM A) outperform the simplified method of Neufcourt et al. (2019) (MBM A(simple)) by 27 11% on the training dataset and 13% on the testing one, as measured by(cid:98)r2 BM A. The lowest RMSE on the training dataset was attained by SLy4 and UNEDF1 respectively for AME2016 \ AME2003. This result should not discourage practitioner from using BMA posterior mean predictors, because the BMA methodology outlined in this paper allows for existence of a “best” model for a particular data domain. However, such a model does not account for modeling uncertainty whereas BMA does, and therefore the BMA posterior mean estimator performs consistently well irrespective of the dataset. In fact it attains the second lowest RMSE on both AME2003 and AME2016 \ AME2003. Moreover, if we consider only a subset of the whole model space, the BMA attains the lowest RMSE. See Table 2.8 in Section 2.4.4 for the results with a restricted model space. Figure 2.3 shows the ECP of the averaged nuclear mass emulators. While it is not clear that the BMA has an improved ECP compared to each individual models, its ECP is certainly significantly better than the worst models and comparable to the models with highest fidelity. Figure 2.3: The ECPs calculated on the independent testing dataset (AME2016 \ AME2003). 28 0.00.20.40.60.81.0Credibility level0.00.20.40.60.81.0Empirical coverageSkM*SkPSLy4SV-minUNEDF0UNEDF1UNEDF2FRDM-2012HFB-24ReferenceBMA 2.3.3 Averaging of the Liquid Drop Model variants In our second real-data example, we demonstrate the opportunities in nuclear theory offered by BMA through averaging of the LDM that has been optimized to various subsets of the nuclear domain. In the context of the following discussion, it is useful to clarify the notion of a “model”. In this specific scenario, by model we understand the combination of the algebraic model formula, the dataset used for its parameter determination, and a statistical model that describes the error structure. The parameters of the LDM are (avol, asurf , asym, aC) representing the volume, surface, symmetry and Coulomb energy, respectively. Because of its linearity and simplicity, the LDM has become a popular model for various statistical applications (Bertsch et al., 2005; Toivanen et al., 2008; Utama et al., 2016; Yuan, 2016; Bertsch and Bingham, 2017; Zhang et al., 2017; Cauchois et al., 2018; Shelley et al., 2014; Pastore, 2019). To study the impact of the fitting domain on prediction accuracy, and UQ fidelity of nuclear mass models, we shall consider the experimental binding energies of 595 even-even nuclei of AME2003 (meaning both Z and N are even) divided into 3 domains according to Figure 2.4. Namely, we define the domain of light nuclei with Z < 40 and N < 50, heavy nuclei with Z > 50 and N > 80, and the intermediate domain DI consisting of the remaining even-even nuclei. To keep some of our results within computable ranges we will also consider 8 randomly selected nuclei in the central subset of the intermediate domain which we will denote DC. By dividing nuclear domains according to A, we are trying to light nuclei are often simulate the current theoretical strategy in modeling atomic nuclei: described by different classes of models than heavy nuclei, with the intermediate domain being the testing ground for all approaches Nazarewicz (2016). Here we use, for testing, the same LDM expression in all domains. The models are distinguished merely by the fitting datasets. In terms of these separated data domains, we consider four LDM variants fitted on specific regions of the nuclear landscape: 29 Figure 2.4: Even-even nuclei from AME2003 divided into the domains of light (Z < 40, N < 50), heavy (Z > 50, N > 80), and intermediate nuclei (remaining 155 nuclei). The subset of 8 randomly selected nuclei is also depicted (From Kejzlar et al. (2020)). (i) LDM(A) – LDM fitted on all 595 even-even nuclei. (ii) LDM(L) – LDM restricted to the light domain (153 nuclei). (iii) LDM(H) – LDM restricted to the heavy domain (287 nuclei). (iv) LDM(L + H) – LDM fitted on the both light and heavy domain (440 nuclei). We emphasize that the intermediate domain DI (and DC) is not used for training in variants (ii)-(iv), but kept aside as an independent testing domain where the different LDM variants compete. Thus we use the binding energies in the intermediate domain to evaluate the predictions and error bounds of these variants and their Bayesian averages. In short, this setup is designed to produce a scenario where two models, which have been optimized on their respective domains, compete to explain the data on a third disconnected domain. 30 020406080100120140160N1101009080706050403020100ZLightHeavy Our statistical model for binding energies yi is the standard yi = fm(ti, θ) + σi, (2.24) where the function fm(t, θ) represents the LDM prediction (1.1) with a given parameter vector θ = (avol, asurf , asym, aC) for a nucleus indexed by t = (Z, N ). The errors are mod- eled as independent standard normal random variable i with mean zero and unit variance, scaled by σ. For the LDM parameters avol, asurf and asym we use independent normal prior distributions N (0, 100) with mean 0 and standard deviation 100, while for aC we take N (0, 2). For σ we assume a gamma prior distribution Gamma(5,2) with shape parameter 5 and scale parameter 2. These are chosen to be weakly informative, i.e., distributions where hyperparameters are chosen to ensure that the prior distribution spans a much wider domain than the resulting posterior. Since the parameter estimation is not topic of this study, we refer the reader to Kejzlar et al. (2020) for more details about the posterior distributions of these parameters. In this example, we wish to select a model’s weight according to its predictive ability and also to avoid overfitting, in the same spirit as the approach implemented in Neufcourt et al. (2019, 2020a,b). To this end, we evaluate the evidence integrals over a set of binding energies y∗ from the intermediate domain of Figure 2.4, which corresponds to integrating the posterior distribution of new predictions against the posterior distribution of the model parameters (cid:90) p(y∗|y,Mk) = p(y∗|y, θk, σk,Mk)p(θk, σk|y,Mk) dθk dσk. (2.25) Given that posterior distribution of the parameters reflects the true distribution of the pa- rameter more accurately than the prior, (2.25) more accurately represents the probability that Mk can explain data y. To assess the impact of the number of evidence datapoints, we evaluate evidence integrals both on the full intermediate domain DI and a smaller central domain DC. 31 The integral (2.25) can be estimated using the MC integration as nM C(cid:88) i=1 (cid:92)p(y∗|y,Mk) = 1 nM C p(y∗|y, θ (i) k , σ (i) k ,Mk), (2.26) where (θ (i) k , σ (i) k ) are samples from the posterior distributions p(θk, σk|y,Mk). Table 2.3 shows the posterior weights obtained under averaging scenarios with two (L and H) and three (L, H, and L+H) models. The corresponding RMSE values for individual models and the BMA posterior mean predictors are listed in Table 2.4. LDM(L) LDM(H) LDM(L+H) DI BMA(L,H) BMA(L,H,L+H) DC BMA(L,H) BMA(L,H,L+H) 0.000 0.000 0.008 0.002 1.000 0.000 0.992 0.255 1.000 0.743 Table 2.3: Posterior model weights under the averaging scenarios with two (L and H; left) and three (L, H, and L+H; right) models. The weights for the full intermediate domain of nuclei and the subset of 8 randomly selected nuclei are listed. As expected, model (H) is selected in the two model variant, and the (L+H) variant dominates when it is included – this is true for both sets of evidence datasets DC and DI . This is consistent with the RMSE of these models. It shall be emphasized that BMA performs a model selection in the two-model variant, where the RMSEs of the competing models are very different, and model averaging in the three-model variant, where the RMSE of (H) and (L+H) are close enough. Table 2.4 also shows how the RMSE of the BMA predictions compare with that of the individual models. In the two-model setup, BMA is very much like (H) and it has a similar RMSE. In the three-model setup, BMA performs much better than the worst model and very close to the best of the averaged models. When computed on the full test domain DI , the RMSEs are systematically smaller for BMA than for all the individual models involved in the averaging (not considering LDM(A)). One may notice that the RMSE of BMA(L, H, L+H) is, perhaps unexpectedly, slightly worse than that of 32 LDM(L+H) on the small domain DC. However, these values are based merely on 8 data points and should be viewed as a crude estimate of true predictive performance. DI DC Mk BMA(L,H) BMA(L,H,L+H) Mk BMA(L,H) BMA(L,H,L+H) LDM(A) LDM(L) LDM(H) LDM(L+H) 3.206 8.176 3.811 3.351 3.810 3.223 1.930 6.825 3.292 1.881 3.300 1.926 Table 2.4: The RMSEs (in MeV) of the predictions from the 4 LDM variants as well as the values from BMA, calculated on the held-out data in the intermediate domain of even-even nuclei from AME2003. Figure 2.5: The ECPs for the four LDM variants used in our study and the averaging scenarios with two (L and H) and three models (L, H, and L+H) (From Kejzlar et al. (2020)). Similarly to all the previous examples in Chapter 2, we also evaluate the models from UQ 33 0.00.20.40.60.81.0Credibility level0.00.20.40.60.81.0Empirical coverageReferenceLDM(L)LDM(H)LDM(L+H)LDM(A)BMA(L,H)BMA(L,H,L+H) quality perspective using the ECP curves. Figure 2.5 shows that the LDM variants fitted to the smaller domains (L or H) tend to underestimate the uncertainty of the predicted binding energies compared to the rather conservative UQ of the (L+H) variant and the LDM fitted to the entire AME2003 dataset. On the other hand, BMA(L,H,L+H) yields an ECP superior to all the LDM variants, including LDM(A), which aligns with our hypothesis that meaningful averaging can lead to an improved UQ. 2.3.4 Averaging of models with discrepant domains: a pedagogical example In this example we study a simulated scenario where two models with t-dynamics of the same order act in the opposite directions. We consider these models to be the realizations of GPs with means mi(t) = αit2 + θi, i ∈ {1, 2}, (2.27) where α1 = 0.5 and α2 = −0.5, and θ1 and θ2 represent unknown parameters to be estimated. The covariance function used for the GPs is squared exponential kernel − (t−t(cid:48))2 ki(t, t(cid:48)) = η2 2(cid:96)2 i e i , i ∈ {1, 2}. (2.28) The prior distributions for the unknown parameters (θi, ηi, (cid:96)i) are listed in Section 2.4.4. Overall, the two statistical models M1 and M2 considered here are given by the respective GPs and additive independent errors distributed according to N (0, σj) for j = 1, 2. The two means (2.27) emulate a natural scenario of competition between models, similar to the proton potential example above, where we are uncertain about the nature of the physical law and resort to BMA in order to account for this uncertainty. To do so, we consider a synthetic dataset y of 18 observations drawn independently from N (0, 10−3) at input points t = {±k, k = 1, 2, . . . 9}. Additionally, we study the impact of the domain correction by assigning a different training dataset y(k) to the models M1 and M2, using seven different scenarios with proportions of shared observations (Dshared) ranging from 20% 34 to 80% according to the scheme in Table 2.5. Note the break of symmetry in the domain of y(k) denoted by a circle, we shall refer to those as symmetric and asymmetric scenarios. For each value of Dshared, we carried out the domain-corrected procedure detailed in Section 2.2 and computed the evidence integrals p(y(k)|Mk) as well as the corrective terms p(y(−k)|y(k)). Also note that the approximate computation of these terms (2.17) is more demanding than the computation of the evidence integrals, because it requires integration against the posterior distribution of parameters. Training dataset y(k) 3 2 4 5 6 7 8 9 x x x x x x x x x x ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ -9 -8 -7 -6 -5 -4 -3 -2 -1 1 x x x x x x x x x x Dshared Model 0.2 M1 M2 0.3 M1 ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ M2 0.4 M1 M2 0.5 M1 M2 0.6 M1 M2 0.7 M1 M2 0.8 M1 M2 x x x x x x x x x x ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ x x x x x x x x x x ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ Table 2.5: Scheme depicting the observations contained in the training dataset of the models according to the proportion of shared data. The crosses mark the values contained in the domain of each model. Table 2.6 gives a quantitative summary of the simulation results in the asymmetric sce- nario, where the impact of the domain correction is stronger. See Table 2.9 in Section 2.4.4 for the symmetric case, where the impact of the domain correction is minor due to the sym- metry of training data and the response functions. The RMSE was calculated based on the set of common observations (t ≤ 5). BM A(Q) and BM A(Q0) represent respectively the domain corrected BMA and the BMA with independent model domains. Q denotes the pos- terior odds ratio p(y(−1)|y(1))p(M1|y(1))/[p(y(−2)|y(2))p(M2|y(2))] used to draw samples 35 Dshared Model RMSE p(y(k)|Mk) p(y(−k)|y(k)) Q0 Q (cid:98)r2 3.33 4.69 4.68 M1 M2 MBM A(Q0) 4.53 MBM A(Q) M1 M2 MBM A(Q0) 4.29 MBM A(Q) M1 M2 MBM A(Q0) 3.54 MBM A(Q) 4.36 3.62 4.63 4.38 3.23 2.78 - - 2.69 · 10−21 2.13 · 10−16 7.78 · 10−20 9.25 · 10−18 - - 7.79 · 10−20 5.44 · 10−13 3.29 · 10−18 2.12 · 10−14 - - 3.23 · 10−18 1.13 · 10−8 1.45 · 10−16 3.49 · 10−10 - - - - - - 0.3 0.5 0.7 0.03 0.80 0.02 0.61 0.02 0.72 BM A 0.495 0.494 - - 0.512 0.456 - - 0.593 0.410 - - Table 2.6: Summary of the domain corrected BMA analysis in the asymmetric case of the pedagogical example. from the mixture distribution (2.1) and Q0 is the ratio p(M1|y(1))/p(M2|y(2)). The MSE improvement(cid:98)r2 BM A is with respect to the BMA with domain correction. As expected from our construction, BMA leads to a spectacular decrease of the MSE by about 50%. The BMA posterior mean predictor outperforms consistently the individual models, at all proportions of the shared training data. As the overlap between the two model domains increases, the RMSEs consistently decrease. The same observations hold in the symmetric case. The domain corrected BMA has consistently lower RMSE than the BMA with independent model domains across Dshared. We observe that the values of the corrective factors increase exponentially towards 1 as Dshared increases; indeed the extreme case Dshared = 1, where both models are defined on the same domain, corresponds to the classical BMA framework. The odds ratios stay expectedly close to 1, due to the fact that the deviations from out-of-domain data are comparable across the models; still the domain- corrected odds ratio Q has a consistently larger variability than Q0, the difference vanishes as the proportion Dshared of data shared between the two models increases. 36 2.4 Technical details and supplementary results 2.4.1 A simple example of evidence integral with closed form solution Let us suppose the following set of K models of experimental observations (ti, yi)n i=1 yi = fk(ti) + σki, k = 1, . . . , K, where yk(t) are known deterministic functions, i are independent identically distributed k ∼ Inv-Gamma(αk, βk). We can calculate the standard normal random variables, and σ2 (cid:18) − (cid:80) i(y(ti)−yk(ti))2 (cid:19) 2σ2 k (cid:18) (cid:19) 1 αk β k (cid:80) Γ(αk) i(y(ti)−yk(ti))2+βk k)αk+1 e (σ2 − βk σ2 k (cid:19) (cid:18) − 1 2 dσ2 k dσ2 k 1 n 2 +αk+1 e (σ2 k) αk k Γ( n β i(y(ti) − yk(ti))2 + βk) 2 + αk) σ2 k . n 2 +αk n 2 Γ(αk)( 1 2 evidence integrals (2.3) explicitly as (cid:90) ∞ 0 1 (2πσ2 k) e n 2 αk β k n 2 Γ(αk) (2π) (cid:90) ∞ (cid:80) 0 p(y|Mk) = = = (2π) 2.4.2 Proofs Proof of Lemma 1. First, the standard factorization identities give the following expression: (y∗ − (cid:98)y∗)2 − (y∗ − λ1(cid:98)y∗ = [2y∗ − (λ1 + p1)(cid:98)y∗ 1 − λ2(cid:98)y∗ 1 − (λ2 + p2)(cid:98)y∗ 2)2 2][(λ1 − p1)(cid:98)y∗ 1 + (λ2 − p2)(cid:98)y∗ 2]. To get the result of the Lemma, we now take the expectation of the expression above condi- tioned on d and notice that the right hand side is, with the exception of y∗, d-measurable. E[(y∗ − (cid:98)y∗)2|d] − E[(y∗ − λ1(cid:98)y∗ 1 + (p2 − λ2)(cid:98)y∗ = [(p1 − λ1)(cid:98)y∗ 1 − λ2(cid:98)y∗ 2)2|d] 1 + (λ2 − p2)(cid:98)y∗ 2][(λ1 − p1)(cid:98)y∗ 2]. 37 2.4.3 Supplement for the general case of K models Let us consider a dataset d and K models M1, . . . ,MK , and assume that each model Mk is defined on a subset t(k) of the data inputs. Denote also d(k) the subset of d corresponding to inputs t(k) and d(−k) the complementary subset as well as d((cid:11)) :=(cid:84) all data locations are in the domain of at least one model so that d =(cid:83) k d(k). Suppose that k d(k). Note that if the datasets are disjoint, there is simply no basis to compare the models. Given a set of models, one can define a unique minimal equivalence relationship (cid:63) on the models (i.e. with a number of equivalence classes maximal) satisfying M (cid:63)M(cid:48) if M and M(cid:48) share at least one data point, i.e. M (cid:63) M(cid:48) if and only if there exists r ≥ 0 and a sequence of models M =: M1,M2, . . . ,Mr := M(cid:48) such that Mi and Mi+1 have a common data point for each 0 ≤ i < r. The computation of the posterior weights of the models can then be done within each class of equivalence, and we will therefore assume that there is only one such equivalence class. In the standard BMA where all models share d, one can express the posterior probabilities on the models p(Mk|d) using the Bayes formula p(Mk|d) ∝k p(d|Mk)p(Mk) (2.29) and estimate the evidence integral p(d|Mk) as at the beginning of this chapter. In our situation, however, the model Mk provides an expression p(d(k)|Mk) instead of p(d|Mk), so that the standard procedure cannot be applied without a further argument. Starting from (2.29), we expand p(d|Mk) similarly to the two-model (2.13) case as p(d|Mk) = p(d(k), d(−k)|Mk) = p(d(−k)|d(k),Mk)p(d(k)|Mk). (2.30) Instead of setting p(d(−k)|d(k),Mk) = 1 which advantages models that withhold their pre- dictions at difficult locations (see the example scenarios and discussion in Section 2.2), our domain-corrected BMA estimates p(d(−k)|d(k),Mk) = p(d(−k)|d(k)). (2.31) 38 This yields the evidence and the posterior weights given respectively by p(d|Mk) = p(d(−k)|d(k))p(d(k)|Mk) p(Mk|d) ∝k p(d(−k)|d(k))p(d(k)|Mk)p(Mk), (2.32) (2.33) similarly to the two-model case. All that is left now is to evaluate the p(d(−k)|d(k)). Let S be the set of q indices of the models that constrain d(−k), we can compute p(d(−k)|d(k)) by conditioning with respect to the models with indices in S. Namely, p(d(−k)|d(k)) = p(d(−k)|d(k),∪q 1 (cid:80) l∈S p(Ml, d(k)) (cid:80) l∈S p(Ml|d(k)) 1 = = (cid:88) l [M = Ml; l ∈ S]) (cid:88) l∈S p(d(−k)|d(k),Ml)p(Ml, d(k)) p(d(−k)|d(k),Ml)p(Ml|d(k)) l∈S The simplest case is when d(−k) is non-divisible, in the sense that for every l ∈ S, we have d(−k) ⊂ d(l) or d(−k) ∩ d(l) = ∅. Then p(d(−k)|d(k),Ml) is given by (2.17) and the sum above have explicit expressions. In the general case, some models may be defined only on a strict subset of y(−k). In that case we have p(d(−k)|d(k),Ml) = p(d(−k)∩(l), d(−k)∩(−l)|d(k),Ml) = p(d(−k)∩(l)|d(k),Ml)p(d(−k)∩(−l)|d(k), d(−k)∩(l),Ml) = p(d(−k)∩(l)|d(k)∩(l),Ml)p(d(−k)∩(−l)|d(l), d(k)∩(−l)) The first term can be explicitly computed as (2.17) and p(d(−k)∩(−l)|d(l), d(k)∩(−l)) can be computed recursively. For practical purposes, it is important to notice that the complexity of the underlying algorithm is at most exponential in the number of models, where each iteration requires the computation of a posterior predictive distributions of decreasing subsets of the data given decreasing subsets of the data, posterior model weights given decreasing subset of the data, and N computations of corrective likelihoods, where N is the number of non-divisible subsets. 39 2.4.4 Supplement for the examples and applications Averaging of nuclear mass emulators in the Ca region. In this real data application, we follow the experimental framework of Neufcourt et al. (2019). Given a (known) theoret- ical nuclear model fm(t) for the one- and two-neutron separation energies, we consider the relationship between the experimental observations yi and the nuclear model as yi = fm(t) + δ(t), for t := (Z, N ) ranging over the two-dimensional nuclear domain. We model the systematic discrepancy δ as the GP δ(Z, N ) ∼ GP(0, kη,(cid:96){(Z, N ), (Z(cid:48), N(cid:48))}), with the mean 0 and the quadratic exponential covariance kernel with three parameters kη,(cid:96){(Z, N ), (Z(cid:48), N(cid:48))} = η2e 2(cid:96)2 Z − (Z−Z(cid:48))2 − (N−N(cid:48))2 2(cid:96)2 N , with independent gamma prior distributions with shape and scale parameters η, (cid:96)Z , (cid:96)N ∼ Gamma(a, b), where b = 1 and a respectively set to 0.8, 0.5 and 1.8. Note that this corresponds to the framework of Bayesian calibration of imperfect computer models described in Section 1.2. The only difference is that here we consider models that were already calibrated, and we don’t explicitly model the experimental error (the is common practice in the nuclear physics community since the experimental error is negligible compared with the systematic error for state-of-the-art models). See supplemental material to Neufcourt et al. (2019) for exhaustive description of the framework. Averaging of models with discrepant domains: a pedagogical example. Table 2.9 gives a quantitative summary of the simulation results in the symmetric scenario. Figure 2.6 40 Sample Size S1n (odd N) S2n (even N) Dataset even Z odd Z even Z odd Z AME2003 41 AME2016 \ AME2003 3 28 5 31 3 39 3 Table 2.7: Sample size breakdown for the training (AME2003) and the testing (AME2016 \ AME2003) datasets of nuclear separation energies in the Ca region according to Z and N parities. Model posterior weights Errors Model S1n (odd N) S2n (even N) even Z odd Z even Z odd Z RMSE (cid:98)r2 Training BM A RMSE (cid:98)r2 Testing SkM* 0.000 FRDM-2012 1.000 HFB-24 0.000 MBM A 0.001 0.997 0.002 0.000 0.900 0.100 0.000 0.399 0.601 0.142 0.114 0.146 0.112 0.375 0.031 0.405 - 0.925 0.808 0.806 0.709 BM A 0.413 0.231 0.227 - Table 2.8: The model posterior weights, RMSE (in MeV) and MSE improvement calculated on both the training (AME2003) and the testing (AME2016 \ AME2003) datasets for 3 nuclear mass models. (asymmetric design) and Figure 2.7 (symmetric design) show the posterior mean predictions for M1, M2, domain corrected BMA MBM A(Q), and BMA with independent model do- mains MBM A(Q0). These were obtained for the pedagogical example 2.3.4 using the domain correction developed in Section 2.2. The RMSE for both BM A(Q0) and BM A(Q) is almost identical here (up to a roundoff error) due to the symmetric nature of both the training dataset y(k) and the response functions. The prior distributions used in the example were θi ∼ N (0, 1), i ∼ Inv-Gamma(10, 1), σ2 (cid:96)i ∼ Gamma(1, 10), ηi ∼ Inv-Gamma(10, 1), 41 where both the gamma and the inverse gamma distributions are parametrized in terms of the shape and the rate parameters for i ∈ {1, 2}. Dshared Model RMSE p(y(k)|Mk) p(y(−k)|y(k)) Q0 Q (cid:98)r2 3.28 4.69 4.58 4.64 4.53 M1 M2 MBM A(Q0) 3.28 MBM A(Q) M1 M2 MBM A(Q0) 3.24 MBM A(Q) M1 M2 MBM A(Q0) 3.07 MBM A(Q) M1 M2 MBM A(Q0) 2.53 MBM A(Q) 3.61 3.56 4.37 4.33 3.25 3.08 2.53 - - - - 2.78 · 10−21 1.98 · 10−19 2.73 · 10−21 2.11 · 10−19 - - 7.99 · 10−20 4.33 · 10−16 7.95 · 10−20 3.96 · 10−16 - - 3.32 · 10−18 8.59 · 10−12 3.29 · 10−18 7.84 · 10−12 - - 1.45 · 10−16 2.99 · 10−6 1.42 · 10−16 2.98 · 10−6 - - - - - - 0.2 0.4 0.6 0.8 1.02 0.96 1.01 1.10 1.01 1.11 1.02 1.03 BM A 0.512 0.488 - - 0.511 0.486 - - 0.504 0.495 - - 0.509 0.495 - - Table 2.9: Summary of the domain corrected BMA analysis in the symmetric case of the pedagogical example. 42 Figure 2.6: Posterior mean predictions (with 68% HPD credible intervals) for the 10 observations y for the two models in (2.27) as well as their BMA, with the domain correction and with the assumption of independent model domains. This is the asymmetric case. The dashed line segments represent the translated values of the original observations. 43 4202420020Dshared=0.34202420020Dshared=0.54202420020Dshared=0.7BMA(Q0)BMA(Q)12y* Figure 2.7: Posterior mean predictions (with 68% HPD credible intervals) for the 10 observations y for the two models in (2.27) as well as their BMA, with the domain correction and with the assumption of independent model domains. This is the symmetric case. The dashed line segments represent the translated values of the original observations. 44 4202420020Dshared=0.24202420020Dshared=0.442024201001020Dshared=0.642024201001020Dshared=0.8BMA(Q0)BMA(Q)12y* CHAPTER 3 AN EFFICIENT ALGORITHM FOR BAYESIAN CALIBRATION OF COMPUTER MODELS VIA VARIATIONAL INFERENCE With the ever-growing access to high performance computing in scientific communities, the use of computational models proliferates to solve complex problems in many scientific applications such as nuclear physics and climate research. An important class of such prob- lems is making predictions, in order to aid the cycle of the scientific process. In particular, our task is to establish statistically principled predictions of new values y∗ of a physical process ζ using a computer model fm and a set of observations y = (y1, . . . , yn) from this process. We would also like to account for various sources of uncertainty associated with individual models (see Section 1.1 for detailed discussion on UQ of computer models). The general framework that we shall follow and allows for predictions with UQ is called Bayesian calibration. It was originally developed by Kennedy and O’Hagan (2001) with extensions provided by Higdon et al. (2005, 2008); Bayarri et al. (2007); Plumlee (2017, 2019); Gu and Wang (2018) and Xie and Xu (2020), to name a few. Formally, let y = (y1, . . . , yn) be observations of a physical process ζ(ti) depending on a known set of inputs ti ∈ Ω ⊂ Rp. Assume that yi follows yi = ζ(ti) + σi, (3.1) i.i.d.∼ N (0, 1). As a mathematical descrip- where σ represent the scale of observation error i tion of ζ, we consider a computer model fm defined as the mapping (t, θ) (cid:55)→ fm(t, θ) which depends on an additional set of inputs θ ∈ Θ ⊂ Rd that we call calibration parameters. These are fixed but unknown quantities representing fundamental properties of the physical process that cannot be directly measured or controlled in an experiment. We assume a single value of calibration parameter θ to be common among all the observations yi and all the future instances of the physical process. 45 As we discussed in Chapter 1, a computer model is an imperfect description of the reality, and there often exists some systematic discrepancy (error) between the model and the physical process. To this extent, we assume that ζ satisfies ζ(t) = fm(t, θ) + δ(t), where δ(t) is the systematic discrepancy of the model whose form is generally unknown. The complete statistical model then reads as yi = fm(ti, θ) + δ(t) + σi. (3.2) The systematic discrepancy is modeled non-parametrically using a Gaussian process (GP) with the mean function mδ(t) and the covariance function kδ(t, t(cid:48)): δ(t) ∼ GP(mδ(t), kδ(t, t(cid:48))). (3.3) The definition of a GP with examples is provided in Section 1.2. In addition to the computer model being imperfect, it is often too expensive in terms of both computational time and memory to be used directly for inference. A common remedy is to consider the computer model as a realization of a GP with the mean function mf (t, θ) and the covariance function kf ((t, θ), (t(cid:48), θ(cid:48))): fm(t, θ) ∼ GP(mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))). (3.4) In this situation, we generate an additional synthetic dataset of model runs z = (z1, . . . , zs) over a fixed grid of inputs {((cid:101)t1,(cid:101)θ1), . . . , ((cid:101)ts,(cid:101)θs)} selected using a space-filling design such as a uniform or Latin hypercube design (Morris and Mitchell, 1995). The complete dataset d therefore consists of n observations yi from the physical process ζ and s evaluations zj of the computer model fm, i.e. d = (d1, . . . , dn+s) := (y, z), and follows the multivariate normal distribution d|φ ∼ N (M (φ), K(φ)), (3.5) where φ = (θ, γ, σ) is the set of all unknown parameters with γ denoting the set of hyper- parameters of the GPs’ mean and covariance functions. M (φ) (1.5) is the mean vector and K(φ) (1.6) is the covariance matrix given by the GPs’ specifications. 46 Under this framework, the Bayesian predictions of y∗ are given by the posterior predictive distribution p(y∗|d), namely (cid:90) p(y∗|d) = p(y∗|d, φ)p(φ|d) dφ. (3.6) The conditional density p(y∗|d, φ) is a multivariate normal density given by the statistical model (1.3) and the specification of GPs (the explicit form is provided in Section 3.4). The posterior distribution of the unknown parameters p(φ|d) is given by the Bayes’ theorem p(φ|d) = . (3.7) (cid:82) p(d|φ)p(φ) dφ p(d|φ)p(φ) The term “calibration” in the Bayesian paradigm includes both an estimation of φ and a full evaluation of uncertainty for every parameter under a prior uncertainty expressed by p(φ). It is also worth noting that the posterior predictive density is rarely computed directly from (3.6). Instead, we first generate samples φ(1), . . . , φ(M ) from p(φ|d) and then obtain samples y∗(1), . . . , y∗(M ) so that y∗(i) ∼ p(y∗|d, φ(i)), i = 1, . . . , M . The posterior predictive density is then approximated using the empirical density of samples y∗(1), . . . , y∗(M ). As a consequence of this simple two-step algorithm, we are interested in effective sampling (approximation) from the posterior distribution p(φ|d). This becomes quickly infeasible with the increasing size of datasets, number of parameters, and model complexity. Traditional MCMC methods that approximate p(φ|d)—such as the MH algorithm or more advanced ones including the Hamiltonian Monte Carlo or the NUTS—typically fail because of the computational costs associated with the evaluation of p(d|φ). The conventional approaches to scalable Bayesian inference are in general not applicable here because of the highly corre- lated structure of K(φ) or the nature of calibration itself. Indeed, parallelization of MCMC (Neiswanger et al., 2014) works in the case of and independent d, and GP approximation methods are developed in the context of regression problems (Qui˜nonero-Candela and Ras- mussen, 2005; Titsias, 2009; Bauer et al., 2016). This chapter presents a scalable and statistically principled approach to Bayesian calibra- tion of computer models. We offer an alternative approximation to posterior densities using 47 variational Bayesian inference (VBI), which originated as a machine learning algorithm that approximates a target density through optimization. Statisticians and computer scientists (starting with Peterson and Anderson (1987); Jordan et al. (1999)) have been widely using variational techniques because they tend to be faster and easier to scale to massive datasets. Moreover, the recently published frequentist consistency of variational Bayes by Wang and Blei (2019) established VBI as a theoretically valid procedure. The scalability of VBI in modern applications hinges on the efficiency of stochastic optimization in scenarios with independent data points. This efficiency, however, diminishes in the case of Bayesian cali- bration of computer models due to the dependence structure in data (Robbins and Monro, 1951; Hoffman et al., 2013). To maintain the speed and scalability of VBI, we adopt a pairwise decomposition of data likelihood using vine copulas that separate the information on a dependence structure in data from their marginal distributions (Cooke and Kurowicka, 2006). Our specific contributions are as follows: 1. We propose a novel version of the black-box variational inference (Ranganath et al., 2014) for Bayesian calibration of computer models that preserves the efficiency of stochastic optimization in a scenario with dependent data. 2. We implement the Rao-Blackwellization, control variates, and importance sampling to reduce the variance of noisy gradient estimates involved in our algorithm. 3. We provide both theoretical and empirical evidence for scalability of our methodology and establish its superiority over the MH algorithm and the NUTS both in terms of time efficiency and memory requirements. 4. Finally, we demonstrate the opportunities in UQ given by the proposed algorithm on a real-word example in the field of nuclear physics. The rest of this chapter is organized as follows. In Section 3.1, we give a general overview of VBI. In Section 3.2, we derive our proposed VBI approach to perform an inexpensive and 48 scalable calibration. We establish statistical validity of the method and provide theoretical justification for its scalability. Subsequently, in Section 3.3, we discuss the implementation details with focus on strategies to reduce the variance of the gradient estimators that are at the center of stochastic optimization for VBI. Section 3.4 presents a simulation study comparing our approach with state-of-the-art methods to approximate posterior distribution and illustrates our method on a real-data application. All technical details, proofs, and supplementary results are provided in section 3.5. 3.1 Variational Bayes inference VBI is an optimization based method that approximates p(φ|d) by a family of distribu- tions q(φ|λ) over latent variables with its own variational parameter λ. Many commonly used families exist with the simplest mean-field family assuming independence of all the components in φ; see Wainwright and Jordan (2008); Hoffman and Blei (2015); Ranganath et al. (2016); Tran et al. (2015, 2017) for examples of more sophisticated families. The approximate distribution q∗ is chosen to satisfy q∗ = arg min q(φ|λ) KL(q(φ|λ)||p(φ|d)). (3.8) Here, KL denotes the Kullback-Leibler divergence of q(φ|λ) from p(φ|d). Finding q∗ is done in practice by maximizing the evidence lower bound (ELBO) (cid:20) L(λ) = Eq (cid:21) log p(d|φ) − KL(q(φ|λ)||p(φ)), (3.9) which is a sum of the expected data log-likelihood log p(d|φ) and the KL divergence between the combined prior distribution p(φ) of calibration parameters, the error scale σ, and GP hy- perparameters and the variational distribution q(φ|λ). Note that we set L(λ) := L(q(φ|λ)) for the ease of notation. Minimizing the ELBO is equivalent to minimizing the original 49 objective function. Indeed, (cid:20) (cid:18) KL(q(φ|λ)||p(φ|d)) = Eq = − log q(φ|λ) (cid:20) − Eq (cid:21) Eq log p(d|φ) − KL(q(φ|λ)||p(φ)) + log p(d). (cid:21) (cid:20) (cid:21) log p(φ|d) (cid:19) The ELBO can be optimized via the standard coordinate- or gradient-ascent methods. These techniques are inefficient for large datasets, because we must optimize the variational parameters globally for the whole dataset. Instead, it has become common practice to use a stochastic gradient ascent (SGA) algorithm, which Hoffman et al. (2013) named “stochastic variational inference” (SVI). Similarly to the traditional gradient ascent, SGA updates λ at the tth iteration with λt+1 ← λt + ρt˜l(λt). (3.10) Here, ˜l(λ) is a realization of the random variable ˜L(λ), so that E( ˜L(λ)) = ∇λL(λ), and Ranganath et al. (2014) showed that the gradient of ELBO with respect to the variational parameter λ can be written as ∇λL(λ) = Eq (cid:20) ∇λ log q(φ|λ)(log p(d|φ) − log (cid:21) ) , q(φ|λ) p(φ) (3.11) where ∇λ log q(φ|λ) is the gradient of the variational log-likelihood with respect to λ. SGA converges to a local maximum of L(λ) (global for L(λ) concave (Bottou et al., 1997)) when the learning rate ρt follows the Robbins-Monro conditions (Robbins and Monro, 1951) ∞(cid:88) ∞(cid:88) ρt = ∞, t < ∞. ρ2 (3.12) t=1 t=1 The bottleneck in the computation of the gradient ∇λL(λ) is the evaluation of the log- likelihood log p(d|φ), which makes the traditional gradient methods as hard to scale as MCMC methods. SGA algorithms address this challenge. If we consider N independent observations di ∼ p(di|φ), then we can define a noisy estimate of the gradient ∇λL(λ) as (cid:20) (cid:21) ∇λ log q(φ|λ)(log p(dI|φ)) (cid:20) ∇λ log q(φ|λ) log − Eq ˜L(λ) := NEq , (3.13) (cid:21) q(φ|λ) p(φ) 50 where I ∼ U (1, . . . , N ) with E( ˜L(λ)) = ∇λL(λ). Each update of λ computes the likelihood only for one observation di at a time and makes the SVI scalable for large datasets. One can easily see that, under the framework for Bayesian calibration, E( ˜L(λ)) (cid:54)= ∇λL(λ) and that the corresponding the SVI does not scale (the noisy estimates are biased). 3.2 Variational calibration of computer models In this section, we derive the algorithm for scalable variational inference approach to Bayesian computer model calibration. The first step is finding a convenient decomposition of the likelihood p(d|φ) that allows for an unbiased stochastic estimate of the gradient ∇λL(λ) that depends only on a small subset of data. Multivariate copulas, and specifically their pairwise construction which we shall introduce below, provide such a decomposition. We are not the first ones to use copulas in the context of VBI. For instance, Tran et al. (2015) and Smith et al. (2020) proposed a multivariate copula as a possible variational family. However, we are the first ones using copulas in the context of computer model calibration implementing via VBI. 3.2.1 Multivariate copulas and likelihood decomposition Fundamentally, a copula separates the information on the dependence structure of N > 1 random variables X1, . . . , XN from their marginal distributions. Let us assume, for simplic- ity, that the marginal cumulative distribution functions (CDFs) F1, . . . , FN are continuous and possess the inverse functions F−1 transform that Ui := Fi(Xi) ∼ U (0, 1) and conversely that Xi = F−1 It follows from the probability integral 1 , . . . , F−1 N . (Ui). With this in i mind, we have P (X1 ≤ F−1 1 (x1), . . . , XN ≤ F−1 N (xN )) = P (U1 ≤ x1, . . . , UN ≤ xN ) := C(x1, . . . , xN ). The function C is a distribution with support on [0, 1]N , uniform marginals, and is called a copula. Under the above assumptions, a one-to-one correspondence exists between copula 51 C and the distribution of X = (X1, . . . , XN )T , as stated in the following theorem due to Sklar (1959). To keep the notation consistency and readability, we re-state the theorem here. Theorem 1 (Sklar (1959)). Given the random v. X1, . . . , Xn with continuous marginals F1, . . . , FN and the joint distribution function F , there exists a unique copula C such that for all x = (x1, . . . , xN )T ∈ Rn: F (x1, . . . , xN ) = C(F1(x1), . . . , Fn(xN )). Conversely, given the CDFs F1, . . . , FN and a copula C, F defined through C(F1(x1), . . . , Fn(xN )) is an N-variate distribution function with marginals F1, . . . , FN . Consequently, one can write the joint probability density function (pdf) f of X = (X1, . . . , XN )T as f (x1, . . . , xN ) = c(F1(x1), . . . , Fn(xN )) N(cid:89) i=1 fi(xi), (3.14) where c represents the copula density and fi is the marginal pdf of Xi. The key reason for considering copulas is that one can decompose the N -dimensional copula density c into a product of bivariate copulas. The starting point for this construction is a recursive decomposition of the density f into a product of conditional densities N(cid:89) f (x1, . . . , xN ) = f (xi|x1, . . . , xi−1)f (x1). i=2 For N = 2, the Sklar’s theorem implies that and where f (x1, x2) = c12(F1(x1), F2(x2))f1(x1)f2(x2), f (x1|x2) = c12(F1(x1), F2(x2))f1(x1), c12 := c12(F1(x1), F2(x2)) (3.15) (3.16) (3.17) (3.18) is a density of C(F1(x1), F2(x2)) = F (x1, x2). Using (3.17) for the decomposition of (X1, Xt) given X2, . . . , Xt−1, we obtain f (xt|x1, . . . , xt−1) = ( cs,t;s+1,...,t−1)c(t−1),t · ft(xt), (3.19) t−2(cid:89) s=1 52 where and ci,j;i1,...,ik := ci,j;i1,...,ik (F (xi|xi1 , . . . , xik ), F (xj|xi1 , . . . , xik )) (3.20) F (xi, xj|xi1 , . . . , xik ) := Ci,j;i1,...,ik , . . . , xik , . . . , xik )). (3.21) Using (3.15) and (3.19) with the specific index choices s = i, t = i + j, we have that (F (xi|xi1 N−j(cid:89) (cid:20) N−1(cid:89) ), F (xj|xi1 (cid:21) N(cid:89) f (x1, . . . , xN ) = ci,(i+j);(i+1),...,(i+j−1) fk(xk). (3.22) j=1 i=1 k=1 are two-dimensional copulas evaluated at the CDFs F (xi|xi1 Note that ci,j;i1,...,ik , . . . , xik and F (xj|xi1 ). The decomposition above is called a D-vine distribution. A similar , . . . , xik class of decompositions is possible when one applies (3.17) on (Xt−1, Xt) given X1, . . . , Xt−2 and sets j = t − k, j + i = t to get a canonical vine (C-vine) (Cooke and Kurowicka, 2006): ) f (x1, . . . , xN ) = f1(x1) t−1(cid:89) k=1 (cid:20) N(cid:89) N−j(cid:89) t=2 (cid:20) N−1(cid:89) = (cid:21) ct−k,t;1,...,(t−k−1) · ft(xt) (cid:21) N(cid:89) cj,(j+i);1,...,(j−1) fk(xk). j=1 i=1 k=1 One can easily imagine that many such pair-copula decompositions exist. Bedford and Cooke (2002) observed that these can be represented graphically as a sequence of nested trees with undirected edges, which are referred to as vine trees. In order for a pair-copula decomposition to be feasible, Bedford and Cooke (2002) defined a regular vine tree (R-vine) on N variables consisting of connected trees T1, . . . ,TN−1 with nodes Ni and edges Ei satisfying the following conditions: 1. T1 has nodes N1 = {1, . . . , N} and edges E1. 2. For i = 2, . . . , N − 1 the tree Ti has nodes Ni = Ei−1 (i.e., edges in a tree become nodes in the next tree). 3. Two edges in Ti are joined in Ti+1 if they share a common node in Ti. 53 Figure 3.1: A D-vine tree representation of a copula with 5 variables. Here, we focus exclusively on the D-vine and C-vine decompositions because they repre- sent the most-studied instances of regular vines and provide an especially efficient notation. We note, however, that the following results can be extended to any regular vines. Properties of vine copulas (Cooke and Kurowicka, 2006). The vine copula con- struction is particularly attractive for two reasons. First, each pair of variables occurs only once as a conditioning set. Second, the bivariate copulas involved in the decompositions have convenient form in the case of Gaussian likelihood f . In particular, let X = (X1, . . . , XN )T follows a multivariate normal distribution with Fj = Φ, j = 1, . . . , N , where Φ is the standard normal CDF. The bivariate copula density is exp{− κ2(w2 j ) − 2κwiwj i + w2 2(1 − κ2) }. (3.23) ci,j;i1,...,ik (ui, uj) = 1√ 1 − κ2 ), uj = F (xj|xi1 , . . . , xik ), wi = Φ−1(ui), wj = Φ−1(uj), and is the partial correlation of variables i, j given i1, . . . , ik. The D-vine and Here, ui = F (xi|xi1 κ = ρi,j·i1,...,ik C-vine decompositions also involve conditional CDFs, for which we need further expressions. Let v ∈ D and D−v := D \ v so that D contains more than one element, F (xj|xD) is , . . . , xik 54 typically computed recursively as F (xj|xD) = h(F (xj|xD−v ), F (xv|xD−v )|ρjv|D−v ) and the function h is for the Gaussian case given by h(ui, uj|ρi,j·i1,...,ik ) = Φ i,j·i1,...,ik Lastly, the partial correlation can be also computed recursively as 1 − ρ2 (cid:32)Φ−1(ui) − ρi,j·i1,...,ik (cid:113) Φ−1(uj) (cid:33) . (3.24) (3.25) ρi,j·D = (cid:113) ρi,j·D−v − ρi,v·D−v ρv,j·D−v 1 − ρ2 v,j·D−v i,v·D−v 1 − ρ2 (cid:113) . (3.26) 3.2.2 Scalable algorithm with truncated vine copulas We now consider the data likelihood p(d|φ) according to (3.5) and make use of vines to construct a noisy estimate of the gradient ∇λL(λ). We additionally assume that N = n + s, where n is the number of observations yi from the physical process, and s is the number of computer model runs zj. The log-likelihood log p(d|φ) can be rewritten according to the D-vine decomposition as log p(d|φ) = where j=1 pD i,i+j(φ) = log ci,(i+j);(i+1),...,(i+j−1) + N−1(cid:88) pD i,i+j(φ), N−j(cid:88) (cid:0) log pi(di|φ) + log pi+j(di+j|φ)(cid:1). i=1 1 n − 1 (3.27) (3.28) This can be conveniently used in the expression of the ELBO gradient. For a D-vine, we have that ∇λL(λ) = (cid:20) Eq N−1(cid:88) N−j(cid:88) j=1 i=1 ∇λ log q(φ|λ)(pD i,i+j(φ)) (cid:21) (cid:20) − Eq ∇λ log q(φ|λ) log q(φ|λ) p(φ) (cid:21) . (3.29) The following proposition gives a noisy unbiased estimate ˜LD(λ) of the gradient (3.29). Similarly, we can derive a noisy estimate ˜LC (λ) of the gradient using a C-vine. We leave the details to Section 3.5.1. 55 Proposition 1. Let ˜LD(λ) be an estimate of the ELBO gradient ∇λL(λ) defined as q(φ|λ) p(φ) ∇λ log q(φ|λ)(pD ∇λ log q(φ|λ) log ˜LD(λ) = N (N − 1) ID(K)(φ)) 2 (cid:20) Eq (cid:21) (cid:20) − Eq (cid:21) , where K ∼ U (1, . . . , N (N−1) N (N − 1) 2 ID : {1, . . . , 2 ), and ID is the bijection } → {(i, i + j) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . N − 1}}, then ˜LD(λ) is unbiased i.e., E( ˜LD(λ)) = ∇λL(λ). As in the case of SVI for independent data, these noisy estimates allow to update the variational parameter λ without the need to evaluate the whole likelihood p(d|φ). We need to consider only the data consisting of a copula’s conditioning and conditioned sets. Un- fortunately, both ˜LD(λ) and ˜LC (λ) can be relatively costly to compute for large datasets because of the recursive nature of calculations involved in the copula densities’ evaluation. According to Brechmann et al. (2012); Dissmann et al. (2013), and Brechmann and Joe (2015), the most important and strongest dependencies among variables can be typically captured best by the pair copulas of the first trees. This notion motivates the use of trun- cated vine copulas, where the copulas associated with the higher-order trees are set to the independence copulas. From the definition of a regular vine, one can show that the joint density f can be decomposed as f (d1, . . . , dN ) = (cid:20) N−1(cid:89) (cid:89) j=1 e∈Ei (cid:21) N(cid:89) k=1 fk(dk), cj(e),k(e);D(e) where e = j(e), k(e); D(e) ∈ Ei is an edge in the ith tree of the vine specification. We define the truncated regular vine copula as follows. Definition 2 (Brechmann et al. (2012)). Let U = {U1, . . . , UN} be a random vector with uniform marginals, and let l ∈ {1, . . . , N−1} be the truncation level. Let Π denote the bivari- ate independence copula. Then, U is said to be distributed according to an N-dimensional l-truncated R-vine copula if C is an N-dimensional R-vine copula with Cj(e),k(e);D(e) = Π ∀e ∈ Ei i = l + 1, . . . , N − 1. 56 For the case of an l-truncated D-vine, we have (cid:20) l(cid:89) N−j(cid:89) j=1 i=1 (cid:21) N(cid:89) k=1 fk(dk), (3.30) f (d1, . . . , dN ) = ci,(i+j);(i+1),...,(i+j−1) and analogically to the case of D-vine with no truncation, the log-likelihood p(d|φ) can be written as a sum of unique elements given in Proposition 2. Proposition 2. If the copula of p(d|φ) is distributed according to an l-truncated D-vine, we can rewrite where log p(d|φ) = l(cid:88) N−j(cid:88) j=1 i=1 Dl i,i+j(φ), p (3.31) Dl i,i+j(φ) = log ci,(i+j);(i+1),...,(i+j−1) + p log pi(di|φ) + 1 ai 1 bi+j log pi+j(di+j|φ), (3.32) and ai = 2l − bi+j = 2l − (cid:20) (cid:20) (l + 1 − i)1i≤l + (l − N + i)1i>N−l (l + 1 − j − i)1i+j≤l + (l − N + j + i)1i+j>N−l (cid:21) , (cid:21) . The main idea for the scalable variational calibration (VC) of computer models is replac- ing the full log-likelihood log(d|φ) in the definition of ELBO with the likelihood based on a truncated vine copula. This yields the l-truncated ELBO for the l-truncated D-vine LDl l(cid:88) N−j(cid:88) j=1 i=1 (λ) = Eq Dl i,i+j(φ) p − KL(q(φ|λ)||p(φ)) (3.33) j=1 i=1 (cid:20) ∇λ log q(φ|λ)(p Eq (cid:21) Dl i,i+j(φ)) (cid:20) − Eq ∇λ log q(φ|λ) log (cid:21) . q(φ|λ) p(φ) with its gradient ∇λLDl (λ) = The following proposition gives a noisy unbiased estimate ˜LDl We can analogously derive an unbiased estimate ˜LCl (λ) of the gradient ∇λLDl (λ). (λ) of the gradient using C-vine (see Section 3.5.1). 57 (cid:20) l(cid:88) N−j(cid:88) (cid:21) (λ) = Proposition 3. Let ˜LDl l(2N − (l + 1)) ˜LDl where K ∼ U (1, . . . , l(2N−(l+1)) l(2N − (l + 1)) 2 2 IDl then ˜LDl : {1, . . . , (λ) is unbiased i.e., E( ˜LDl 2 (cid:20) (λ) be an estimate of the ELBO gradient ∇λLDl ∇λ log q(φ|λ)(p Eq (cid:20) ∇λ log q(φ|λ) log − Eq (cid:21) (λ) defined as q(φ|λ) p(φ) (K)(φ)) (cid:21) , Dl IDl is the bijection ), and IDl } → {(i, i + j) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . l}}, (λ)) = ∇λLDl (λ). Considering the l-truncated ELBO defined above, our proposed algorithm for variational calibration of computer models with truncated vine copulas is stated in Algorithm 3.1. Note that ˜LDl (λ) does not have closed form expression in general due to expectations involved in the computation. Therefore, we resort to a MC approximation of the gradient estimate ˜LDl (λ) using samples from the variational distribution. Algorithm 3.1: Variational calibration with truncated D-vine copulas. Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan framework, variational family q(φ|λ), truncation level l 1 λ ← random initial value 2 t ← 1 3 repeat 4 for s = 1 to S do φ[s] ∼ q(φ|λ) 5 6 7 8 2 ) (cid:80)S K ← U (1, . . . , l(2N−(l+1)) (cid:20) ρ ← tth value of a Robbins-Monro sequence λ ← λ + ρ 1 l(2N−(l+1)) log q(φ[s]|λ) t ← t + 1 ∇λ log q(φ[s]|λ)(cid:0)p l(2N−(l+1)) (cid:1)(cid:21) p(φ[s]) s=1 S 2 2 9 10 until change of λ is less than  // Random sample from q (K)(φ[s]) − Dl IDl Scalability Discussion. The complexity of a bivariate copula evaluation depends on the size of the conditioning dataset due to the recursive nature of the calculations involved (Cooke and Kurowicka, 2006). From the vine tree construction, the cardinality of the conditioning 58 set for D-vine and C-vine is in the worst case N − 2. Nevertheless, on average, we can do better. Lemma 2. Let X be the cardinality of the conditioning set in pD ID(K)(φ) or pC IC (K)(φ), then and E(X) = N−2 3 . P (X = i) = (cid:0)N (cid:1) N − (i + 1) 2 for i ∈ {0, . . . , N − 2} (3.34) The cardinality of conditioning set in Lemma 2 is on average roughly N/3. On the other hand, the cardinality of conditioning set for the case of Algorithm 3.1 is at most l − 1 with the average given by the following lemma. Lemma 3. Let X be the cardinality of the conditioning set in p Dl IDl (K)(φ) or p Cl ICl (K)(φ), then and P (X = i) = N − (i + 1) l(2N−(l+1)) 2 for i ∈ {0, . . . , l − 1}, (3.35) E(X) = (l − 1)(3N − 2l − 2) 3(2N − l − 1) . As a consequence of Lemma 3, E(X) ≈ 2 for N = 105 and truncation level l = 5, which IC (K)(φ) (≈ 33333 for is a significant improvement to the average case pD ID(K)(φ) and pC N = 105). This provides a heuristic yet convincing argument for the scalability. 3.3 Implementation details 3.3.1 Selection of truncation level Selection of the truncation level l is an important element in effective approximation of the posterior distribution p(φ|d) under Algorithm 3.1. Dissmann et al. (2013) propose a sequential approach for selection of l in the case of vine estimation. One sequentially fits models with an increasing truncation level until the quality of fit stays stable or com- putational resources are depleted. We adopt similar idea for the case of VC of computer 59 models with vine copulas. Let λ(l) represents the value of variational parameter estimated with Algorithm 3.1 for a fixed truncation level l. One can then sequentially increase l until ∆(λ(l + 1), λ(l)) <  for some distance metric ∆ and a desired tolerance . 3.3.2 Variance reduction of Monte Carlo approximations The computational convenience of simple MC approximations of the gradient estimators based on the l-truncated D-vine and C-vine copulas ˜LDl (λ) and ˜LCl (λ) (see Section 3.2.2) is typically accompanied by their large variance. The consequence in practice is the need for small step size ρt in the SGA portion of Algorithm 3.1 which results in a slower convergence. In order to reduce the variance of MC approximations, we adopt the same approach as Ruiz et al. (2016) and use the Rao-Blackwellization (Casella and Robert, 1996) in combination with the control variates (Ross, 2006) and importance sampling. The reminder of this section focuses on the case of D-vine decomposition, see Section 3.5.1 for the derivations for C-vines. Rao-Blackwellization. The idea here is to replace the noisy estimate of gradient with its conditional expectation with respect to a subset of φ. For simplicity, let us consider a situation with φ = (φ1, φ2) ∈ R2 and variational family q(φ|λ) that factorizes into q(φ1|λ1)q(φ2|λ2). Additionally, let ˆLλ(φ1, φ2) be the MC approximation of the gradient ∇λL(λ). Now, the conditional expectation E[ ˆLλ(φ1, φ2)|φ1] is also an unbiased estimate of ∇λL(λ) since Eq(E[ ˆLλ(φ1, φ2)|φ1]) = Eq( ˆLλ(φ1, φ2)) and Varq(E[ ˆLλ(φ1, φ2)|φ1]) = Varq( ˆLλ(φ1, φ2)) − E[( ˆLλ(φ1, φ2) − E[ ˆLλ(φ1, φ2)|φ1])2] shows that Varq(E[ ˆLλ(φ1, φ2)|φ1]) ≤ Varq( ˆLλ(φ1, φ2)). The factorization of the variational family also makes the conditional expectation straightforward to compute as E[ ˆLλ(φ1, φ2)] q(φ1|λ1)q(φ2|λ2) q(φ1|λ1) dφ2 = E q(φ2|λ2)( ˆLλ(φ1, φ2)), 60 E[ ˆLλ(φ1, φ2)|φ1] = (cid:90) φ2 i.e., we just need to integrate out some variables. Let us consider the MC approximation of the gradient estimator ˜LDl (cid:20)l(2N − (l + 1)) S(cid:88) log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) − ∇λj 1 S s=1 2 (λ). The jth entry of the Rao-Blackwellized estimator is 2 l(2N − (l + 1)) log q(φj[s]|λj) p(φj[s]) (cid:1)(cid:21) , where ˜p(j)(φ) are the components of p Dl IDl (K)(φ) that include φj. Control Variates. To further reduce the variance of the MC approximations we will replace the Rao-Blackwellized estimate above with a function that has the same expectation but again smaller variance. For illustration, let us first consider a target function ξ(φ) whose variance we want to reduce, and a function ψ(φ) with finite expectation. Define ˆξ(φ) = ξ(φ) − a(ψ(φ) − Eq[ψ(φ)]), (3.36) where a is a scalar and Eq( ˆξ(φ)) = Eg[ξ(φ)]. The variance of ˆξ(φ) is Varq( ˆξ(φ)) = Varq(ξ(φ)) + a2Varq(ψ(φ)) − 2aCovq(ξ(φ), ψ(φ)). (3.37) This shows that a good choice for function ψ(φ) is one that has high covariance with ξ(φ). Moreover, the value of a that minimizes (3.37) is a∗ = Covq(ξ(φ), ψ(φ)) Varq(ψ(φ)) . (3.38) Let us place the CV back into the context of calibration. Meeting the above described criteria, Ranganath et al. (2014) propose ψ(φ) to be ∇λ log q(φ|λ), because it depends only on the variational distribution and has expectation zero. We can now set the target function ξ(φ) to be l(2N − (l + 1)) 2 log q(φj|λj)(cid:0)˜p(j)(φ) − ∇λj 2 l(2N − (l + 1)) log (cid:1), q(φj|λj) p(φj) 61 which gives the following jth entry of the MC approximation of the gradient estimator ˜LDl (λ) with CV ˜LCV (j) Dl = 1 S (λ) S(cid:88) (cid:20)l(2N − (l + 1)) s=1 2 log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) − ∇λj 2(log q(φj [s]|λj ) p(φj [s]) + ˆaD j ) l(2N − (l + 1)) (cid:1)(cid:21) , where ˆaD j is the estimate of a∗ based on additional independent draws from the variational approximation (otherwise the estimator would be biased). Importance sampling. Here, we outline the last variance reduction technique that makes use of importance sampling. We refer to Ruiz et al. (2016) for full description of the method and illustration of its efficiency in the VBI framework. Fundamentally, instead of taking samples from the variational family q(φ|λ) to carry out the MC approximation of the ELBO gradient estimate, we will take samples from an overdispersed distribution r(φ|λ, τ ) in the same family that depends on an additional dispersion parameter τ > 1. Namely, we can (cid:20) l(2N − (l + 1)) write the estimate ˜LDl (λ) as Er(φ|λ,τ ) 2 ∇λ log q(φ|λ)(p (K)(φ) − Dl IDl 2 l(2N − (l + 1)) (cid:21) )w(φ) , q(φ|λ) p(φ) log where w(φ) = q(φ|λ)/r(φ|λ, τ ) is the importance weight which guarantees the estimator to be unbiased. The reason to formulate the ˜LDl (λ) this way comes from the fact the optimal proposal (Robert and Casella, 2005) distribution to form the MC estimate is not q(φ|λ), but rather r∗(φ) ∝ q(φ|λ)|ξ(φ)|, where ξ(φ) = l(2N − (l + 1)) 2 ∇λ log q(φ|λ)(cid:0)p (K)(φ) − Dl IDl 2 l(2N − (l + 1)) (3.39) (3.40) (cid:1). q(φ|λ) p(φ) log However, the normalizing constant for the optimal r∗(φ) is intractable, and so Ruiz et al. (2016) propose that an overdispersed version of the variational family that assigns higher 62 probability to the tails of q(φ|λ) is closer to the optimum than q(φ|λ) itself. For example, if the value of λ makes the variational family a poor fit, then the samples φ[s] ∼ q(φ|λ) have a high value for the variational distribution but low for the true posterior. On the other hand, r∗(φ) proposes values of φ[s] for which ξ(φ) is large that are in the tails of p(φ|λ). To see how the importance sampling leads to the reduction of variance of the MC esti- mates, let us consider the following estimator then ξ(φ[s]), φ[s] ∼ p(φ|λ), (cid:2)ξ2(φ)(cid:3) − 1 S (cid:2) ˜LDl (λ)(cid:3)2. Eq 1 S (3.41) (3.42) Similarly, we can derived the variance of the MC estimator with the importance weights s=1 1 S S(cid:88) (cid:3) = (cid:98)LM C = Var(cid:2)(cid:98)LM C S(cid:88) (cid:3) = (cid:2)ξ2(φ) 1 S s=1 Eq ξ(φ[s]) 1 S M C = (cid:98)LO Var(cid:2)(cid:98)LO as q(φ|λ) r(φ|λ, τ ) Now, if we choose the distribution r(φ|λ, τ ) such that Eq M C q(φ[s]|λ) r(φ[s]|λ, τ ) (cid:2)ξ2(φ) (cid:3) ≤ Eq q(φ|λ) r(φ|λ, τ ) , φ[s] ∼ r(φ|λ, τ ), (λ)(cid:3)2. (cid:2) ˜LDl (cid:3) − 1 (cid:2)ξ2(φ)(cid:3), S (3.43) (3.44) (3.45) the variance reduction will be achieved. The optimal r∗ obviously satisfies the condition (3.45). Ruiz et al. (2016) show that the choice of overdispersed version of the variational family q(φ|λ) has similar effect on the variance reduction as the optimal r∗. The details on the form of overdispersed families for specific variational families are discussed later in Section 3.3.4. Combining the ideas of the Rao-Blackwellization, CV, and importance sampling, we have the following jth entry of the MC approximation of the gradient estimator ˜LDl ˜LOCV (j) Dl (λ) (λ) (cid:21) )w(φj[s]) , 2(log q(φj [s]|λj ) p(φj [s]) + ˜aD j ) l(2N − (l + 1)) (cid:20) l(2N − (l + 1)) S(cid:88) = s=1 2S ∇λj log q(φj[s]|λj)(˜p(j)(φ[s]) − 63 where φ[s] ∼ r(φ|λ, τ ) and ˜aD j = (cid:100)Covr( l(2N−(l+1))w(φj ) 2 ∇λj log q(φj|λj)(˜p(j)(φ) − 2 log (cid:100)Varr(∇λj q(φj|λj ) p(φj ) log q(φj|λj)w(φj)) l(2N−(l+1)) ),∇λj log q(φj|λj)w(φj)) . The extension of Algorithm 3.1 with the variance reductions of the MC approximations due to the Rao-Blackwellization, control variates, and importance sampling is in Algorithm 3.2. Algorithm 3.2: Variational calibration with truncated D-vine copulas II. Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan framework, variational family q(φ|λ), dispersion parameter τ truncation level l // Random sample from r 1 λ ← random initial value 2 t ← 1 3 repeat 4 for s = 1 to S do φ[s] ∼ r(φ|λ, τ ) 5 6 7 8 K ← U (1, . . . , l(2N−(l+1)) ρ ← tth value of a Robbins-Monro sequence ) 2 (cid:20) log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) − λ ← λ + ρ(cid:80)S 2(log s=1 q(φj [s]|λj ) p(φj [s]) l(2N−(l+1)) l(2N−(l+1)) 2S ∇λj (cid:21) (cid:1)w(φj[s]) +˜aD j ) t ← t + 1 9 10 until change of λ is less than  3.3.3 Choice of the learning rate Even though the SGA is straightforward in its general definition, the choice of learning rate ρt can be challenging in practice. Ideally, one would want the rate to be small in the situations where the noisy estimates of the gradient have large variance and vice-versa. The elements of variational parameter λ can also differ in scale, and one needs to set the learning 64 rate so that the SGA can accommodate even the smallest scales. The rapidly increasing usage of machine learning techniques in recent years produced various algorithms for element-wise adaptive-scale learning rates. We use the adaptive gradient (AdaGrad) algorithm (Duchi et al., 2011) which has been considered in similar problems before, e.g., Ranganath et al. (2014), however, there are other popular algorithms such as the ADADELTA (Zeiler, 2012) or the RMSProp (Tieleman and Hinton, 2012). Let gT be the gradient used in the T th step of the SGA algorithm, and Gt be the matrix consisting of the sum of the outer products of these gradients across the first t iterations, namely t(cid:88) Gt = gT gT T . The AdaGrad defines the element-wise adaptive scale learning rate as T =1 ρt = η · diag(Gt)−1/2, (3.46) (3.47) where η is the initial learning rate. It is a common practice, however, to add a small constant value to diag(Gt) (typically of order 10−6) to avoid division by zero. 3.3.4 Parametrizations Variational families. We use a Gaussian distribution for real valued components of φ and a gamma distribution for positive variables. Both of these families are parametrized in terms of their mean and standard deviation. Moreover, in order to avoid constrained optimization, we transform all the positive variational parameters λ to ˜λ = log (eλ − 1) and optimize with respect to ˜λ. Overdispersed families. Given a fixed dispersion coefficient τ , the overdispersed Gaus- sian distribution with mean µ and standard deviation σ is a Gaussian distribution with √ mean µ and standard deviation σ and standard deviation σ is a gamma distribution with mean µ + (τ − 1) σ2 deviation σ × τ µ2+τ σ2(τ−1) (Ruiz et al., 2016). τ . The overdispersed gamma distribution with mean µ µ and standard (cid:113) µ 65 3.4 Applications This section empirically establishes the efficiency of our methodology for the VBI based calibration of computer models. First, we conduct an extensive simulation study, where we focus both on the fidelity of variational approximation and prediction accuracy. Second, we demonstrate the opportunities in UQ given by the proposed methodology on calibration of the Liquid Drop Model. The Bayesian predictions of new observations from the physical process ζ at input loca- tions (t∗ 1, . . . , t∗ J ) are obtained according to (3.6). The conditional distribution p(y∗|d, φ) is a multivariate normal distribution with the mean vector My∗(φ) = Mf (T∗ y (θ)) + Mδ(T∗ y ) + C∗K(φ)−1(d − M (φ)), and the covariance matrix Ky∗(φ) = Kf (T∗ y (θ), Ty(θ)) + Kδ(T∗ y , Ty) + σ2Im − C∗K(φ)−1CT∗ , (3.48) (3.49) where (cid:16) C∗ = Kf (T∗ y (θ), Ty(θ)) + Kδ(T∗ y , Ty) Kf (T∗ (cid:17) y (θ), Tz((cid:101)θ)) . (3.50) Here, M (φ) and K(φ) is the mean vector and the covariance matrix of the data likelihood y (θ), Ty(θ)) is the matrix with (i, j) element kf ((t∗ p(d|φ), Kf (T∗ is the matrix with (i, j) element kδ(t∗ the kernel kf . i , tj). We can similarly define Kf (T∗ i , θ), (tj, θ)) and Kδ(T∗ y (θ), Tz((cid:101)θ)) with y , Ty) 3.4.1 Simulation study In this section, we study Algorithm 3.2 in a simulated scenario, where we first demonstrate the method’s fidelity in approximating the posterior distribution of calibration parameters p(θ|d) and substantiate the indispensability of the variance reduction techniques described in Section 3.3.2 in order to achieve convergence. Second, we show the scalability of our method in comparison to the popular MH algorithm and the NUTS. 66 Let us consider a simple scenario following the model (3.4) with a two-dimensional cali- bration parameter θ = (0.39, 0.60) that was obtained as a sample from its prior distribution p(θ) and a two-dimensional input variable t = (t1, t2). We model fm(t, θ) and δ(t) with GPs according to the specifications in Table 3.1 with the particular choices of ηf = 1 lθ = 1, ηδ = 1 30, lt = 1, 2, and βδ = 0.15. 30, lδ = 1 GP mean GP covariance function fm θ1cos(t1) + θ2sin(t2) ηf · exp(−||t−t(cid:48)||2 − ||θ−θ(cid:48)||2 ) δ βδ 2l2 t ηδ · exp(−||t−t(cid:48)||2 ) 2l2 θ 2l2 δ Table 3.1: The specification of GPs for the simulation study. We choose the variational family to be the mean-field family with Gaussian distribu- tions for real valued parameters and gamma distributions for positive variables following the parametrization discussed in Section 3.3.4. The variational parameters are initialized to match the prior distributions, and we use the AdaGrad for the learning rate updates. Calibration. For the purpose of model calibration, we sampled the data d jointly from the prior with the experimental noise following N (0, 1 100). The calibration parameter values for the model runs z were selected on a uniform grid over [0, 1]2 and the inputs t over [0, 3]2. For the first set of experiments, the size of the dataset was N = 225 with n = 144 and s = 81. We used 50 samples from the variational family to approximate the expectations in Algorithm 3.2 and 10 samples to implement the control variates. Figure 3.2 demonstrates the quality of the variational approximation (Algorithm 3.2) in comparison to the MH algorithm and the NUTS. We can see that our method was able to accurately match both MCMC-based approximations with a minor deviation in θ1. It is important to note, however, that the variance reduction through the combination of the Rao-Blackwellization, control variates, and importance sampling was necessary to achieve meaningful convergence. 67 Figure 3.2: The approximate posterior distributions for the target calibration parameters. The VC (Algorithm 3.2) was carried out using l = 3 truncated D-vine and compared with the results from the NUTS and the MH algorithm. In particular, Figure 3.3 shows the MSE of the posterior predictive means, evaluated on an independently generated set of 50 data points, based on the VC with cumulatively implemented variance reduction techniques. Algorithm 3.2 which employs the importance sampling clearly outperforms the calibration with only the Rao-Blackwellization and the calibration with control variates. In fact, each additional attempt to reduce the variance tends to decrease the MSE by one order of magnitude. There is naturally a time and space (memory) cost associated with each variance reduction technique. Figure 3.3 shows that the control variates and the importance sampling practically double the time per iteration of the algorithm. This additional complexity is, however, outweighed by the gain in the MSE reduction. The increase in memory consumption is less significant and is due to the storage of dispersion coefficients used for importance sampling and samples needed to compute control variates. Note that the memory consumed by the algorithms rises over time, because we chose to store the values of variational parameters during each step; the memory demands can be dramatically reduced if we drop these intermediate results. For completeness, in Table 3.2, we also compare the MSE of the MCMC approximations and the VC at the point of convergence of the algorithms. The resulting errors in the predictions were, for all practical purposes, equivalent. 68 0.30.40.501020301NUTSMHVC0.50.60.70510152020.30.40.510.500.550.600.650.702 Figure 3.3: The evolution of MSE of the posterior predictive means based on the VC with cumulatively implemented variance reduction techniques described in Section 3.3.2. The figure is based on an independently generated set of 50 testing points. Time and memory demands for each of the implementations are also plotted the VC (Algorithm 3.2) was carried out using l = 3 truncated D-vine. Algorithm MSE Variational Calibration - RB + CV + IS 2.9 × 10−3 3.0 × 10−3 Metropolis-Hastings 3.0 × 10−3 No-U-Turn Table 3.2: Comparison of the MSE for the simple scenario using the MH, the NUTS, and the VC algorithms. Scalability. We now significantly increase the size of the dataset from N = 225 to 0.5×104 and eventually to 2 × 104 with the simulated experimental measurements and the model runs split equally (n = s). For better numerical stability, we expand the space of the input variables to t ∈ [0, 10]2 and select those using the Latin hypercube design. We also enlarge the testing dataset to 200 points. All the remaining simulation parameters are 69 020004000600080000123Time [h]020004000600080002468Memory [GB]010002000300040005000600070008000Iteration102101100MSEVariational Calibration - RBVariational Calibration - RB + CVVariational Calibration - RB + CV + IS unchanged. The conventional MCMC methods are already impractical for the purpose of Bayesian calibration with these moderately large amounts of data. We were able to obtain only around 600 posterior samples in the case of N = 1× 104 and about 120 for N = 2× 104 in 25 hours of sampling using the MH algorithm (significantly less with the NUTS). Figure 3.4: The evolution of the MSE of the posterior predictive means based on the VC (Algorithm 3.2), the MH algorithm, and the NUTS. The figure is based on an independently generated set of 200 testing points. The VC (Algorithm 3.2) was carried out using l = 5 truncated D-vine. Figure 3.4 demonstrates that Algorithm 3.2 (D-vine with truncation l = 5) converges to the predictive MSE of about 0.003 under 4 hours for N = 2 × 104 and 2 hours for N = 0.5× 104. It took similar time for the MH to achieve this MSE value for N = 0.5× 104 but almost 25 hours for the NUTS. Once we increased the data size to 2 × 104, neither the NUTS nor the MH were able to achieve a similar predictive MSE as the VC within the 25 hour window allotted for sampling. In fact, they were by an order of magnitude larger. It 70 102101n=0.5×104102101n=1×104VCMHNUTS0123456Time [h]102101n=2×104MSENNN is important to mention that both MCMC-based algorithms have also substantially larger memory demands than the VC as depicted in Figure 3.5. These memory profiles were recorded during a one hour period of running the algorithms. The MH algorithm and the NUTS were implemented in Python 3.0 using the PyMC3 module version 3.5. The memory profiles were measured using the memory-profiler module version 0.55.0 in Python 3.0. The VC was also implemented in Python 3.0. Figure 3.5: Recorded memory profiles of Algorithm 3.2, the MH algorithm, and the NUTS for the duration of 1 hour under the simulation scenario. 71 51015n=0.5×104204060n=1×104VCMHNUTS0102030405060Time [min]50100150200250300n=2×104Memory [GB]NNN 3.4.2 Calibration of the Liquid Drop Model Over the past decade or so, the statistical tools of UQ have experienced a robust ramp-up in use in the field of nuclear physics (Ireland and Nazarewicz, 2015). Bayesian calibration has been especially popular because it enhances the understanding of a nuclear model’s structure through parameter estimation and potentially advances the quality of nuclear modeling by accounting for systematic errors. In this context, we use our variational Algorithm 3.2 to cal- ibrate the 4-parameter LDM. Since we discussed the LDM several times in this dissertation, we refer reader to Section 1.1 for a detailed description of the model. Here we also note that this is by no means the first case when Bayesian calibration methodology is applied to study the LDM. In fact, the LDM is a popular model for statistical applications (Bertsch et al., 2005; Yuan, 2016; Bertsch and Bingham, 2017) which is why we choose the model to illustrate our methodology as well. The LDM also generally performs better on heavy nuclei as compared to the light nuclei which alludes to the existence of a significant systematic discrepancy between the model and the experimental binding energies (Reinhard et al., 2006; Kejzlar et al., 2020). Namely, we consider the following statistical model y = EB(N, Z) + δ(N, Z) + σ, (3.51) where δ(N, Z) represents the unknown systematic discrepancy between the semi-empirical mass formula EB(N, Z) and the experimental binding energies y. The parameter σ is as usual the scale of observation error  ∼ N (0, 1). The nuclear physics community often (Dobaczewski et al., 2014) considers the least squares (LS) estimator of θ defined as n(cid:88) θ i=1 ˆθL2 = arg min (yi − EB(Ni, Zi))2 , (3.52) which is also the maximum likelihood estimate of θ in the case of δ = 0. The benefit of this estimator is that it is fast, easy-to-compute, and allows for analysis under the standard linear regression theory. It, however, neglects some sources of uncertainty that are accounted for in the Bayesian calibration framework. 72 To this end, we shall consider a GP prior with the mean zero and the squared expo- nential covariance function for the systematic discrepancy δ(Z, N ). Since the main purpose of the example is to provide a canonical illustration of the methodology in a real data sce- nario, we also set a GP prior for the LDM and treat EB(Z, N ) as an unknown function. We use 2000 experimental binding energies randomly selected from the AME2003 dataset (Audi et al., 2003) (publicly available at http://amdc.impcas.ac.cn/web/masstab.html) for calibration, see Figure 3.6, and an additional set of 104 model evaluations. The calibra- tion inputs were generated with the Latin hypercube design so that all the reasonable values of (avol, asurf, asym, aC) given by the literature are covered (Weizs¨acker, 1935; Bethe and Bacher, 1936; Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020). The model inputs (Z, N ) were selected from the set of 2000 experimental binding energies, duplicated five-fold, and randomly permutated among the generated calibration inputs to span only the set of relevant nuclei. This relatively large number of model runs was chosen so that the Figure 3.6: Experimental binding energies of nuclei in AME2003 dataset (2225 observations). combined 6 dimensional space of calibration parameters and model inputs is sufficiently cov- ered considering the existence of a non-trivial systematic discrepancy. In fact, the uniform 73 0102030405060708090100110120130140150160N0102030405060708090100110Z040080012001600EB [MeV] experimental design would amount only to 4-5 points per dimension. Independent Gaussian distributions centered at the LS estimates ˆθL2 (in Table 3.3) with standard deviations large enough to cover the space of inputs used for generating the model runs were selected to represent the prior knowledge about the calibration parameters. Independent gamma distributions were used as the prior models for the hyperparameters of the GP’s covariance functions. We choose the variational family to be fully-factorized with the Gaussian distributions for real valued parameters and the gamma distributions for positive variables. The means of variational families were initialized as random samples from their respective prior distributions and the variances were set to match those of the prior distributions. We used the AdaGrad for stochastic optimization. See Section 3.5.3 for further discussion on the prior distributions and the experimental design. Results. Including the generated model runs, the overall size of the training dataset is 1.2×104 which already makes the MCMC-based calibration impractical, as illustrated by the simulation study in Section 3.4.1. We therefore asses the quality of variational approximation only against the standard LS estimation and do not consider the MCMC methods. In particular, we consider the testing dataset of the remaining 225 experimental binding energies in AME2003 that were excluded from the training data. The predictions ˆy∗ of these testing binding energies y∗ were calculated, under the variational approximation, as the posterior means of y∗ conditioned on the 1.2 × 104 binding energies from the training data set, i.e., the posterior means of the predictive distribution p(y∗|d). The predictions under the LS estimates ˆθL2 were given by the semi-empirical mass formula (1.1). Table 3.3 gives the RMSEs for both methods under consideration. The VC (Algorithm 3.2) results are based on a 24 hour window dedicated to running the algorithm with 50 samples used to approximate the expectations, 10 samples used to implement the control variates, and the truncation level selected to be l = 3. By using GPs to account for the systematic discrepancies of the semi-empirical mass formula and the uncertainty of the LDM 74 itself, we were able to significantly reduce the RMSE approx. 57% compared to the LS benchmark. Table 3.3 additionally shows the calibration parameter estimates and their standard errors. The estimates under the VC are given by the means of their variational families. Both the methods calibrate the LDM around the same values with notably low standard errors of the LS estimates. This is, however, expected since ˆθL2 estimates that in the presence of heteroscedasticity (see Figure 3.7) become inefficient and are ordinary LS tend to significantly underestimate the true variance (Goldberger, 1966; Johnston, 1976). Method Parameter estimate and standard errors asurf avol 15.42 (0.027) 16.91 (0.086) 22.47 (0.070) 0.69 (0.002) 15.78 (0.198) 15.99 (0.681) 21.94 (0.510) 0.68 (0.018) asym aC LS VC Testing error RMSE (MeV) 3.54 1.52 Table 3.3: The RMSE of the VC (Algorithm 3.2) after 24 hours dedicated to running the algorithm compared with the RMSE based on the LS estimates. The parameter estimates (and their standard errors) are also displayed. The residual plot in Figure 3.7, showing the difference between y∗ and ˆy∗ as a function of the nuclear mass number A, clearly demonstrates a better fit of the testing data with our methodology than is achieved by the simple LS fit. The majority of the residuals appear to be randomly spread around 0 which strongly supports the efficiency of GPs in accounting for the systematic discrepancy δ. Figure 3.7: The residual plot for 225 experimental binding energies in the testing dataset. 75 050100150200250A=Z+N15105051015LSVC 3.5 Technical details and supplementary results 3.5.1 Scalable algorithm with truncated vine copulas: C-vine Here we present the details of the C-vine based versions of Algorithm 3.1 and Algo- rithm 3.2. First, we can decompose the log-likelihood log p(d|φ) using a C-vine as log p(d|φ) = pC j,j+i(φ), (3.53) N−1(cid:88) N−j(cid:88) j=1 i=1 where pC j,j+i(φ) = log cj,(j+i);1,...,(j−1) + 1 N − 1 (cid:0) log pj(dj|φ) + log pj+i(dj+i|φ)(cid:1). (cid:21) (cid:20) − Eq ∇λ log q(φ|λ) log q(φ|λ) p(φ) (3.54) (cid:21) . (3.55) This now yields the following expression for the ELBO gradient: (cid:20) N−1(cid:88) N−j(cid:88) j=1 i=1 ∇λL(λ) = ∇λ log q(φ|λ)(pC j,j+i(φ)) Eq Equivalently to Proposition 1, we have the following proposition that establishes the noisy unbiased estimate of the gradient (3.55) using the C-vine copula decomposition. Proposition 4. Let ˜LC (λ) be an estimate of the ELBO gradient ∇λL(λ) defined as q(φ|λ) p(φ) (cid:20) ∇λ log q(φ|λ) log ∇λ log q(φ|λ)(pC ˜LC (λ) = N (N − 1) IC (K)(φ)) − Eq (cid:21) (cid:20) Eq 2 (cid:21) , where K ∼ U (1, . . . , N (N−1) N (N − 1) 2 IC : {1, . . . , 2 ), and IC is the bijection } → {(j, j + i) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . N − 1}}, then ˜LC (λ) is unbiased i.e., E( ˜LC (λ)) = ∇λL(λ). Again, ˜LC (λ) can be relatively costly to compute for large datasets due to the recursive nature of the copula density computations. We now carry out exactly the same development an using l-truncated C-vine as in the case of Proposition 2 and Proposition 3. 76 Proposition 5. If the copula of p(d|φ) is distributed according to an l-truncated C-vine, we can rewrite log p(d|φ) = l(cid:88) N−j(cid:88) Cl i,i+j(φ), p j=1 i=1 cl i,i+j(φ) = log cj,(j+i);1,...,(j−1) + p 1 aj log pj(dj|φ) + 1 bj+i log pj+i(dj+i|φ), where and aj = N − 1, bj+i = (N − 1 − l)1j+i≤l + l. (3.56) (3.57) Let us now replace the full log-likelihood log(d|φ) in the definition of ELBO with the likelihood based on a truncated vine copula. This yields the l-truncated ELBO for the l-truncated C-vine (cid:20) l(cid:88) N−j(cid:88) p (cid:21) LCl N−j(cid:88) l(cid:88) j=1 i=1 (λ) = Eq Cl j,j+i(φ) − KL(q(φ|λ)||p(φ)) (3.58) (cid:20) Eq j=1 i=1 ∇λ log q(φ|λ)(p (cid:21) Cl j,j+i(φ)) (cid:20) − Eq ∇λ log q(φ|λ) log (cid:21) . q(φ|λ) p(φ) with its gradient ∇λLCl (λ) = Consequently, we get the following proposition that establishes the noisy unbiased estimate (λ). of ∇λLCl (cid:20) (λ) be an estimate of the ELBO gradient ∇λLCl Proposition 6. Let ˜LCl l(2N − (l + 1)) ˜LCl ∇λ log q(φ|λ)(p Eq where K ∼ U (1, . . . , l(2N−(l+1)) l(2N − (l + 1)) is the bijection (K)(φ)) − Eq Cl ICl (λ) = (cid:20) (cid:21) 2 2 ∇λ log q(φ|λ) log (λ) defined as q(φ|λ) p(φ) (cid:21) , ICl then ˜LCl : {1, . . . , (λ) is unbiased i.e., E( ˜LCl 2 ), and ICl } → {(j, j + i) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . l}}, (λ)) = ∇λLCl (λ). Algorithm 3.3 postulates the version of Algorithm 3.1 based on the truncated C-vine decomposition. 77 Algorithm 3.3: Variational calibration with truncated C-vine copulas. Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan framework, variational family q(φ|λ), truncation level l 1 λ ← random initial value 2 t ← 1 3 repeat 4 for s = 1 to S do φ[s] ∼ q(φ|λ) 5 6 7 8 2 ) (cid:80)S K ← U (1, . . . , l(2N−(l+1)) (cid:20) ρ ← tth value of a Robbins-Monro sequence λ ← λ + ρ 1 l(2N−(l+1)) log q(φ[s]|λ) t ← t + 1 ∇λ log q(φ[s]|λ)(cid:0)p l(2N−(l+1)) (cid:1)(cid:21) p(φ[s]) S s=1 2 2 9 10 until change of λ is less than  // Random sample from q (K)(φ[s]) − Cl ICl Variance reduction of MC estimates. Let us now consider the MC approximation of (λ), the jth entry of the Rao-Blackwellized estimator is the gradient estimator ˜LCl (cid:20) l(2N − (l + 1)) S(cid:88) 1 S s=1 2 log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) − ∇λj 2 l(2N − (l + 1)) log q(φj[s]|λj) p(φj[s]) where ˜p(j)(φ) are here the components of p Cl ICl (K)(φ) that include φj. We can again use the control variates to reduce the variance of MC approximation of the gradient estimator ˜LCl Blackwellized MC approximation of the gradient estimator ˜LCl (λ). In particular, we consider the following jth entry of the Rao- (λ) with control variates (cid:20)l(2N − (l + 1)) S(cid:88) s=1 2S ˜LCV (j) Cl (λ) = ∇λj log q(φj[s]|λj)(˜p(j)(φ[s]) − 2(log q(φj [s]|λj ) p(φj [s]) + ˆaC j ) l(2N − (l + 1)) where ˆaC j is the estimate of the optimal control variate scalar a∗ based on S (or fever) independent draws from the variational distribution. Namely, (cid:1)(cid:21) , (cid:21) ) , (cid:100)Covq( l(2N−(l+1)) 2 ˆaC j = ∇λj log q(φj|λj)(˜p(j)(φ) − (cid:100)Varq(∇λj log q(φj|λj)) 2 log q(φj|λj ) l(2N−(l+1)) log p(φj )),∇λj log q(φj|λj)) . 78 As in the case of the D-vine, we now derive the ultimate version of Algorithm 3.3. Again, instead of taking the samples from q(φ|λ) to approximate the gradient estimates, we will take samples from an overdispersed distribution r(φ|λ, τ ). Combining the Rao- Blackwellization, control variates, and importance sampling, we have the following jth entry of the MC approximation of the gradient estimator ˜LCl ˜LOCV (j) Cl (λ) ∇λj log q(φj[s]|λj)(˜p(j)(φ[s]) − 2(log q(φj [s]|λj ) p(φj [s]) + ˜aC j ) l(2N − (l + 1)) )w(φj[s]) , (cid:21) (λ) (cid:20) l(2N − (l + 1)) S(cid:88) = s=1 2S where φ[s] ∼ r(φ|λ, τ ) and w(φ[s]) = q(φ[s]|λ)/r(φ[s]|λ, τ ) with ˜aC j = (cid:100)Covr( l(2N−(l+1))w(φj ) 2 ∇λj log q(φj|λj)(˜p(j)(φ) − 2 log (cid:100)Varr(∇λj q(φj|λj ) p(φj ) log q(φj|λj)w(φj)) l(2N−(l+1)) ),∇λj log q(φj|λj)w(φj)) . Algorithm 3.4: Variational calibration with truncated C-vine copulas II. Input: Data d, mean and covariance functions for GPs, variational family q(φ|λ), dispersion parameter τ truncation level l // Random sample from r 1 λ ← random initial value 2 t ← 1 3 repeat 4 for s = 1 to S do φ[s] ∼ r(φ|λ, τ ) 5 6 7 8 K ← U (1, . . . , l(2N−(l+1)) ρ ← tth value of a Robbins-Monro sequence ) 2 (cid:20) λ ← λ + ρ(cid:80)S 2(log s=1 q(φj [s]|λj ) p(φj [s]) l(2N−(l+1)) +˜aC j ) 2S l(2N−(l+1)) ∇λj (cid:21) (cid:1)w(φj[s]) log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) − t ← t + 1 9 10 until change of λ is less than  79 3.5.2 Proofs Proof of Proposition 1. Since P (K = k) = N (N−1), we have directly from the definition of expectation 2 E( ˜LD(λ)) = (cid:20) N (N−1) 2(cid:88) N (N − 1) (cid:20) 2 ∇λ log q(φ|λ) log k=1 2 (cid:21) N (N − 1) q(φ|λ) p(φ) − Eq = ∇λL(λ). (cid:21) Eq ∇λ log q(φ|λ)(pD ID(k)(φ)) The final equality is the consequence of the uniqueness of the pairs of variables in the conditioned sets of the copula density ci,(i+j);(i+1),...,(i+j−1), and that N (N−1) 2 is the number of unordered pairs of N variables. Proof of Proposition 2. It is sufficient to show that for l ∈ {1, . . . , N − 1} the following equality holds: l(cid:88) N−j(cid:88) (cid:20) 1 j=1 i=1 ai where log pi(di|φ) + 1 bi+j (cid:21) log pi+j(di+j|φ) = N(cid:88) k=1 (cid:20) (cid:20) (l + 1 − i)1i≤l + (l − N + i)1i>N−l (l + 1 − j − i)1i+j≤l + (l − N + j + i)1i+j>N−l (cid:21) , (cid:21) . ai = 2l − bi+j = 2l − log p(dk|φ), (3.59) To show this, let us consider the summation (cid:20) (cid:21) log pi(di|φ) + log pi+j(di+j|φ) l(cid:88) N−j(cid:88) (cid:20) l(cid:88) j=1 i=1 = (cid:21) (log p1(d1|φ) + log p1+j(d1+j|φ)) + ··· + (log pN−j(dN−j|φ) + log pN (dN|φ)) . j=1 80 For l = 1, we get l(cid:88) N−j(cid:88) (cid:20) log pi(di|φ) + log pi+j(di+j|φ) (cid:21) j=1 i=1 and for l ≥ 2 = (log p1(d1|φ) + log p2(d2|φ)) + ··· + (log pN−1(dN−1|φ) + log pN (dN|φ)), (cid:20) l(cid:88) N−j(cid:88) (cid:20) (cid:21) log pi(di|φ) + log pi+j(di+j|φ) (cid:20) (cid:21) (log p1(d1|φ) + log p1+l(d1+l|φ)) + ··· + (log pN−l(dN−l|φ) + log pN (dN|φ)) (cid:21) (log p1(d1|φ) + log p2(d2|φ)) + ··· + (log pN−1(dN−1|φ) + log pN (dN|φ)) + ··· + j=1 i=1 = . Note that in the case of l = N − 1, the last summation consists of only one element log p1(d1|φ) + log p1+l(d1+l|φ). By careful examination of the two cases above, we get the following results. For 2l ≤ N : l(cid:88) N−j(cid:88) l(cid:88) (cid:21) log pi(di|φ) + log pi+j(di+j|φ) N−l(cid:88) N(cid:88) (cid:20) j=1 i=1 (l + k − 1) log pk(dk|φ) + = k=1 where the middle term disappears in the case 2l = N , and for 2l > N : (cid:20) k=l+1 2l log pk(dk|φ) + (cid:21) log pi(di|φ) + log pi+j(di+j|φ) l(cid:88) N−j(cid:88) l(cid:88) N−l(cid:88) (l + k − 1) log pk(dk|φ) + N(cid:88) (N − i + l) log pk(dk|φ). k=N−l+1 k=1 i=1 j=1 + = (N − i + l) log pk(dk|φ), k=N−l+1 (N − 1) log pk(dk|φ) k=l+1 If we now check that ai equals to the factors in front of the log-likelihoods in the two cases above, the proof of Proposition 2 is complete. Note that once we check the equality for ai, 81 the same directly translates to bi+j since bi+j is ai with indices set to i + j instead of i. Indeed, for 2l ≤ N and for 2l > N ai = ai =   l + i − 1 i ≤ l l < i ≤ N − l , 2l N − i + l N − l < i l + i − 1 N − 1 N − i + l i ≤ N − l N − l < i ≤ l . l < i Proof of Proposition 3. By the construction of R-vine (see Section 3.2.1), each tree Ti, for i = 1, . . . , N−1 has exactly N − i edges (these are the unique conditioned variable pairs). For any R-vine truncated at level l ∈ {1, . . . , N − 1}, we get the number of edges to be l(cid:88) i=1 (N − i) = lN − l(l + 1) l(2N − (l + 1)) 2 = 2 The rest of the proof is identical with that of Proposition 1 due to the uniqueness of the conditioned variable pairs in the copula density ci,(i+j);(i+1),...,(i+j−1), but in this case P (K = k) = 2 l(2N−(l+1)). Proof of Proposition 4. The proof is identical with that of Proposition 1 since each conditioned pair in the copula density cj,(j+i);1,...,(j−1) is unique as well. 82 Proof of Proposition 5. It is sufficient to show that for l ∈ {1, . . . , N − 1} the following equality holds: log pj(dj|φ) + 1 bj+i (cid:21) log pj+i(dj+i|φ) = N(cid:88) k=1 l(cid:88) N−j(cid:88) (cid:20) 1 j=1 i=1 aj where log p(dk|φ), (3.60) aj = N − 1, bj+i = (N − 1 − l)1j+i≤l + l. (cid:21) log pj+i(dj+i|φ) (cid:21) log pj+1(dj+i|φ)) + ··· + log pN (dN|φ) . = log p2(d2|φ) + . . . log pN (dN|φ). (cid:21) log pl+1(dl+1|φ) + . . . log pN (dN|φ) . To show this, let us consider the following summation (cid:20) (cid:21) log pj(dj|φ) + log pj+i(dj+i|φ) i=1 j=1 N−j(cid:88) l(cid:88) (cid:20) l(cid:88) l(cid:88) j=1 = = N−j(cid:88) (N − j) log pj(dj|φ) + (cid:20) l(cid:88) (N − j) log pj(dj|φ) + i=1 Now, for l = 1, we have For l ≥ 2, we have j=1 j=1 j=1 l(cid:88) (cid:20) l(cid:88) (cid:20) j=1 (cid:20) (cid:21) log pj+1(dj+1|φ)) + ··· + log pN (dN|φ) (cid:21) log pj+1(dj+1|φ)) + ··· + log pN (dN|φ) (cid:20) (cid:21) log p2(d2|φ) + . . . log pN (dN|φ) + ··· + = Therefore we can rewrite (cid:20) (cid:21) l(cid:88) log pj+1(dj+1|φ)) + ··· + log pN (dN|φ) l(cid:88) N(cid:88) (j − 1) log pj(dj|φ) + l log pj(dj|φ) j=1 = j=1 j=l+1 83 Overall, (cid:20) (cid:21) log pj(dj|φ) + log pj+i(dj+i|φ) i=1 j=1 N−j(cid:88) l(cid:88) l(cid:88) (N − j) log pj(dj|φ) + l(cid:88) (N − 1) log pk(dk|φ) + j=1 = = l(cid:88) (j − 1) log pj(dj|φ) + N(cid:88) l log pk(dk|φ). j=1 N(cid:88) j=l+1 l log pj(dj|φ) k=1 Since j ∈ {1, . . . , l} and the equality 3.60 holds. Proof of Proposition 6. bj+i = k=l+1 N − 1 j + i ≤ l j + i > l l , The proof is identical with that of Proposition 3 since each conditioned pair in the copula density cj,(j+i);1,...,(j−1) is unique, and a C-vine is a special case of R-vine. Proof of Lemma 2. As we discussed in the proof of Proposition 3, the construction of R-vine implies that each tree Ti, for i = 1, . . . , N−1 has exactly N−i edges (pairs of conditioned variables). Moreover, each tree Ti corresponds to copulas with the conditioning set of size i − 1. Therefore, for X being the cardinality of the conditioning set, we get P (X = i) = for i ∈ {0, . . . , N − 2}. N − (i + 1) (cid:0)N (cid:1) 2 N−2(cid:88) i=0 i Now E(X) = = N − (i + 1) (cid:1) (cid:0)N (cid:20) 2 2 N (N − 1) (N − 1)[ 2 = N (N − 1) (N − 2)(N − 1) i=0 [i(N − 1) − i2] ] − (N − 2)(N − 1)(2N − 3) 6 (cid:21) N − 2 3 . = N−2(cid:88) 2 84 Where the equality on the second line is due to the standard algebraic results on the sum of powers of the first first N integers. Proof of Lemma 3. Analogically to the proof of Lemma 2, while recalling the number of edges for any l-truncated R-vine provided in the proof of Proposition 3, we have for the cardinality of the conditioning set X: Now P (X = i) = N − (i + 1) l(2N−(l+1)) 2 for i ∈ {0, . . . , l − 1}. l−1(cid:88) i=0 i E(X) = N − (i + 1) l(2N−(l+1)) 2 2 (cid:20) 2 = l(2N − (l + 1)) (N − 1)[ (l − 1)l 2 l−1(cid:88) [i(N − 1) − i2] ] − (l − 1)l(2l − 1) (cid:21) i=0 6 = = l(2N − (l + 1)) (l − 1)(3N − 2l − 2) 3(2N − l − 1) . 3.5.3 Supplement for the calibration of the Liquid Drop Model GP specifications. In the case of the LDM EB(Z, N ), we consider the GP prior with the mean zero and the covariance function (cid:18) − (Z − Z(cid:48))2 ηE · exp − (asurf − a(cid:48) 2ν2 Z 2ν2 2 − (N − N(cid:48))2 − (asym − a(cid:48) surf)2 2ν2 N − (avol − a(cid:48) sym)2 vol)2 − (aC − a(cid:48) C)2 2ν2 1 (cid:19) . 2ν2 3 2ν2 4 Similarly, we consider the GP prior for the systematic discrepancy δ(Z, N ) with mean zero and covariance function ηδ · exp (cid:32) −(Z − Z(cid:48))2 2l2 Z − (N − N(cid:48))2 2l2 N (cid:33) . 85 Experimental design. Kennedy and O’Hagan (2001) recommend to select the calibration inputs for the model runs so that any plausible value θ of the true calibration parameter is covered. In this context, we consider the space of calibration parameters to be centered at the values of least squares estimates ˆθL2 values provided by the nuclear physics literature (Weizs¨acker, 1935; Bethe and Bacher, 1936; and broad enough to contain the majority of Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020). Table 3.4 gives the lower and upper bounds for the parameter space so that Lower bound = ˆθL2 Upper bound = ˆθL2 theory. ) is given by the standard linear regression − 15 × SE(ˆθL2 ) and +15×SE(ˆθL2 ). Here SE(ˆθL2 Parameter Lower bound Upper bound avol asurf asym aC 15.829 18.193 23.505 0.72 15.008 15.628 21.435 0.665 Table 3.4: The space of calibration parameters used for generating the outputs of the semi-empirical mass formula (1.1). Prior distributions. First, we consider the independent Gaussian distributions centered at the LS estimates ˆθL2 calibration parameters used for generating the model runs are covered roughly within two (in Table 3.3) with standard deviations 7.5 × SE( ˆθL2 ) so that the standard deviations of the priors. Namely, avol ∼ N (15.42, 0.203), asurf ∼ N (16.91, 0.645), asym ∼ N (22.47, 0.525), aC ∼ N (0.69, 0.015). The prior distributions for hyperparameters of the GPs were selected as Gamma(α, β) with the shape parameter α and scale parameter β, so that they represent a vague knowledge about the scale of these parameters given by the literature on nuclear mass models (Weizs¨acker, 86 1935; Bethe and Bacher, 1936; Myers and Swiatecki, 1966; Fayans, 1998; Kirson, 2008; Mc- Donnell et al., 2015; Kortelainen et al., 2010a, 2012, 2014; Benzaid et al., 2020; Kejzlar et al., 2020). In particular, the error scale σ is in the majority of nuclear applications within units of MeV, therefore we set σ ∼ Gamma(2, 1), with the scale of the systematic error being ηδ ∼ Gamma(10, 1), to allow for this quantity to range between the units and tens of MeV. It is also reasonable to assume that the mass of a given nucleus is correlated mostly with its neighbours on the nuclear chart. We express this notion through these reasonably wide prior distributions lZ ∼ Gamma(10, 1), lN ∼ Gamma(10, 1), νZ ∼ Gamma(10, 1), νN ∼ Gamma(10, 1), νi ∼ Gamma(10, 1), i = 1, 2, 3, 4. Finally, the majority of the masses in the training dataset of 2000 experimental binding energies fall into the range of [1000, 2000] MeV (1165 of masses precisely). We consider the following prior distribution for the parameter ηf to reflect on the scale of the experimental binding energies: ηf ∼ Gamma(110, 10). 87 CHAPTER 4 EMPIRICAL BAYES CALIBRATION OF COMPUTER MODELS WITH CONSISTENT PREDICTIONS Up to this point, we have seen that the Bayesian framework for computer-model-aided inference, described in detail in Section 1.2 and at the beginning of Chapter 3, provides a statistically principled way to account for various sources of uncertainty and leads to bet- ter predictions. It can be especially powerful in scenarios where computer models under consideration are complex and computationally too expensive to be used directly for pre- dictions with quantified uncertainties, because each evaluation of such models often takes several days. Despite these advantages, we have also identified many challenges that make the implementation of the Kennedy and O’Hagan (2001) framework challenging in practice. Let us recall that under a fully Bayesian treatment, the predictions of new values y∗ of a physical ζ using a computer model fm are specified by the posterior predictive distri- bution p(y∗|d). The dataset d here and for the rest of this chapter consists of n obser- vations yi from the physical process ζ and s evaluations zj of the computer model fm, i.e. d = (d1, . . . , dn+s) := (y, z), and follows the multivariate normal distribution (1.4). The pre- dictive distribution p(y∗|d) is obtained by integrating the conditional density p(y∗|d, θ, γ, σ), which is a multivariate normal density given by the statistical model (1.3) and the specifi- cation of GPs, against the posterior density p(θ, γ, σ|d), namely (cid:90) p(y∗|d) = p(y∗|d, θ, γ, σ)p(θ, γ, σ|d) dθ dγ dσ. (4.1) φ An analogical relationship also holds for the predictions of new realizations of the physical process ζ∗. The posterior density p(θ, γ, σ|d), however, does not have a closed form in general and one typically resorts to MCMC methods for approximation. Additionally, the nature of the likelihood p(d|θ, γ, σ) makes the problem hard to scale due to the complex structure of the covariance matrix K(θ, γ, σ) (see (1.6)). In Chapter 3, we developed a novel VBI algorithm that provides an efficient and scalable alternative to the traditional 88 MCMC methods. Nevertheless, the practical implementation of either the MCMC or our VBI approach can be a non-trivial task and requires some practical experience. As an easy-to-implement alternative that avoids the difficulties described above, we pro- pose an empirical Bayes approach for fast and statistically principled predictions of physical quantities using imperfect computer models which instead of placing a (prior) distribution on (θ, γ, σ) estimates these parameters directly form the data. One can therefore utilize the convenience of GPs to obtain closed form, simple, and fast predictions given by the conditional distribution p(y∗|d, θ, γ, σ) (or p(ζ∗|d, θ, γ, σ)). The proposed approach can be viewed as an approximation of the fully Bayesian treatment that neglects some of the uncertainty associated with the unknown parameters. Our contributions are the following. First, we present a fast and easy to implement framework for computer-model-enabled predictions and provide two alternative plug-in es- timators for all the unknown quantities involved. Second, we offer a new perspective on the Kennedy and O’Hagan (2001) framework and provide its representation as a Bayesian hierarchical model. This alternative representation allows us to discuss the framework in the context of non-parametric regression problems with GP priors and establish our methods’ theoretical validity through a posterior consistency result. Lastly, we validate the empirical Bayes approach empirically through a simulation study, and illustrate our methodology on a real data application in nuclear physics. The rest of this chapter is organized as follows. In Section 4.1, we show the equivalence of the general framework for Bayesian calibration of computer models with a Bayesian hierar- chical model. Then, in Section 4.2, we discuss the theoretical properties of our approach and establish its posterior consistency. Section 4.3 defines two plug-in estimators for GP model parameters and a consistent estimator of a noise scale component. Section 4.4 contains a simulation study that empirically validates the methodology in this chapter. A real-data application is also included in Section 4.4. 89 4.1 Hierarchical model for Bayesian calibration of computer mod- els Here we show that we can represent the model of Kennedy and O’Hagan (2001) described in Section 1.2, hierarchically, as the following hypotheses about the observations yi, the computer model evaluations zj, and a set of prior distributions. Model for data: yi = ζ(ti) + σi zj = fm((cid:101)tj,(cid:101)θj), i.i.d.∼ N (0, σ). i i = 1, . . . , n, j = 1, . . . , s, (4.2) (4.3) (4.4) Priors: δ(t) ∼ GPδ(mδ(t), kδ(t, t(cid:48))), given γ and independent of  and σ, fm(t, θ) ∼ GPf (mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))), ζ(t)|θ, γ ∼ GPf + GPδ. given γ and independent of i, σ, and δ, Under this model, the conditional likelihoods for yi and zj are ˜fm (cid:90) (cid:90) n(cid:89) (cid:90) (cid:90) (cid:90) ζ ζ ζ = = i 90 n(cid:89) s(cid:89) i ˜fm p(yi|ζi, σ)p(ζ, z|θ, γ) dζ, j p(yi|ζ(ti), σ) = p(zj|fm((cid:101)tj,(cid:101)θj)) = 1 σ (cid:18) √ 1 2π zj =fm((cid:101)tj ,(cid:101)θj ) exp (cid:19) , (4.5) − (yi − ζ(ti))2 2σ2 (zj), where p(zj|fm((cid:101)tj,(cid:101)θj)) is a likelihood with the point mass at zj = fm((cid:101)tj,(cid:101)θj). Consequently, (cid:90) (cid:90) the equivalence of the two formulations is given through the equality between the likelihood (1.4) and the integral (4.6) p(ζ, ˜fm, d|θ, γ, σ) d ˜fm dζ = p(d|ζ, ˜fm, θ, γ, σ)p(ζ, ˜fm|θ, γ) d ˜fm dζ ζ ˜fm p(yi|ζi, σ) p(zj| ˜fm,j)p(ζ, ˜fm|θ, γ) d ˜fm dζ where ζ = (ζ(t1), . . . , ζ(tn)) = (ζ1, . . . , ζn) and ˜fm = (fm((cid:101)t1,(cid:101)θ1), . . . , fm((cid:101)ts,(cid:101)θs)). The likelihood p(ζ, z|θ, γ) is the multivariate normal distribution with the mean M (θ, γ) (see (1.5)) and the covariance (cid:32) Kp(θ, γ) = (cid:33) Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) Kf (Ty(θ), Tz((cid:101)θ)) Kf (Tz((cid:101)θ), Tz((cid:101)θ)) Kf (Tz((cid:101)θ), Ty(θ)) . Again, Kf (Ty(θ), Ty(θ)) is the matrix with (i, j) element kf ((ti, θ), (tj, θ)), Kδ(Ty, Ty) is the matrix with (i, j) element kδ(ti, tj), and Kf (Tz((cid:101)θ), Tz((cid:101)θ)) is the matrix with (i, j) element kf (((cid:101)ti,(cid:101)θi), ((cid:101)tj,(cid:101)θj)). Kf (Ty(θ), Tz((cid:101)θ)) is defined analogically with the kernel kf . We leave the details of the integral computation for Section 4.5.1. This representations of the model is crucial for the theoretical results obtained in the subsequent section. It reframes the Bayesian model as a version of a non-parametric regression problem with GP prior for ζ(t) and an additive noise. Additionally, we can gain a further insight into the role of the set of model runs z. Let us consider a function space F and a subset (cid:101)F ⊂ F, then p(ζ ∈ (cid:101)F|d, θ, γ, σ) ∝ (cid:90) (cid:101)F n(cid:89) i p(yi|ζi, σ)p(ζ|z, θ, γ) dζ. (4.7) One can therefore interpret the model runs z as an additional information provided by the computer model fm that enhances the GP prior p(ζ|z, θ, γ) for the physical process ζ, having the mean function m(cid:88) (cid:104) (cid:105)(cid:104) kf ((t, θ), ((cid:101)tj,(cid:101)θj)) (cid:105) zi − mf ((cid:101)ti,(cid:101)θi) mζ (t) = mf (t, θ) + mδ(t) + κj,i and the covariance function i,j=1 kζ (t, t(cid:48)) = kf ((t, θ), (t(cid:48), θ)) + kδ(t, t(cid:48)) (cid:105)(cid:104) kf ((t, θ), ((cid:101)tj,(cid:101)θj)) − m(cid:88) (cid:104) κj,i (cid:105) kf (((cid:101)ti,(cid:101)θi), (t(cid:48), θ)) , where κj,i is the (j, i) element of the matrix Kf (Tz((cid:101)θ), Tz((cid:101)θ))−1. i,j=1 91 , (4.8) (4.9) 4.2 Posterior consistency, a theoretical validation The revealing consequence of the previous section is that the Kennedy and O’Hagan (2001) framework is equivalent to the non-parametric regression model of an unknown func- tion ζ(t) with the prior distribution p(ζ|z, θ, γ). This is not only a new perspective on the popular framework, but also happens to be the key step that allows us to validate our em- pirical Bayes approach theoretically and establish the posterior consistency of the physical process when the prior p(ζ|z, θ, γ) satisfies certain properties. In the reminder of this section, we assume that the true physical process ζ0 is a con- tinuously differentiable function on the compact and convex set Ω ⊂ Rp. Without loss of generality, we take Ω = [0, 1]p. Additionally, we shall assume the hyperparameters (θ, γ) take values in a set Υ. For any ν > 0, we aim to establish, under suitable conditions, the following: p(ζ ∈ W C ν,n|y1, . . . , yn, z, θ, γ, ˆσn) −−−→ n 0 a.s. P0, (4.10) sup (θ,γ)∈Υ where P0 denotes the joint conditional distribution of {yi}∞ strongly consistent estimator of σ0, and i=1 given true ζ0 and σ0, ˆσn is a Wν,n = with Qn being the empirical measure on the design points given as Qn(t) = n−1(cid:80)n i=1 1ti(t). In Theorem 2, we first present a general result on the consistency of non-parametric |ζ(t) − ζ0(t)| dQn(t) ≤ ν (4.11) regression problems and subsequently discuss the theorem’s conditions in the context of the model described in Section 4.1. This is based on the work of Choi and Schervish (2007a) and Choi (2007), where the authors assume σ is included in Wν,n, and the posterior consistency is derived jointly for ζ and σ. On the other hand, the consistency of ζ conditioned on ˆσn requires a non-trivial modification of their original results. The proof of Theorem 2 is given in Section 4.5.2. Theorem 2. Let {yi}∞ i=1 be independently and normally distributed with the mean ζ(ti) and the standard deviation σ with respect to a common σ-finite measure, where ζ belongs to a 92 (cid:90) (cid:26) ζ : (cid:27) , space of continuously differentiable functions on [0, 1]p denoted as F, and σ > 0. Let ζ0 ∈ F and let P0 denotes the joint conditional distribution of {yi}∞ i=1 given true ζ0 and σ0. Let {Un}∞ n=1 be a sequence of subsets of F. Let ζ have a prior Π(·|θ, γ) where (θ, γ) take values in a set Υ. For any 0 <  < 1 and ζ0(ti) = ζ0,i define: p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − )) , Λi(ζ0, ζ) = log Ki(ζ0, ζ) = Eζ0,σ0 Vi(ζ0, ζ) = Varζ0,σ0 (Λi(ζ0, ζ)), (Λi(ζ0, ζ)). If the following assumptions are satisfied: (A1) Suppose there exists a set B with Π(B|θ, γ) > 0 and for any ∆ > 0 a constant (i) (cid:80)∞ 0 < ˜1 < 1, so that for any  < ˜1: i2 < ∞, ∀ζ ∈ B, Vi(ζ0,ζ) i=1 (ii) Π(B ∩ {ζ : Ki(ζ0, ζ) < ∆ for all i}|θ, γ) > 0. (A2) Suppose there exist tests {Φn}∞ n=1, sets {Fn}∞ n=1, and constants C2, C1, c1 > 0 and 0 < ˜2 < 1 so that: (i) (cid:80)∞ Eζ0,σ0 Φn < ∞ n=1 (ii) sup(θ,γ)∈Υ Π(F C (iii) There exists a constant c > 0 such that for any 0 <  < ˜2 the inequality c + log(1 − n |θ, γ) < C1e−c1n ) − log(1 + ) > 0 holds and E ζ,σ0(1+)(1 − Φn) ≤ C2e−cn. sup ζ∈U C n ∩Fn (A3) ˆσn is strongly consistent, i.e ˆσn −−−→ Then n σ0 a.s. P0. p(ζ ∈ U C n |y1, . . . , yn, θ, γ, ˆσn) −−−→ n 0 a.s. P0. sup (θ,γ)∈Υ 93 For the purpose of generality of Theorem 2, we do not explicitly condition on the set of model runs z. It is clear from our previous discussions (see (4.7) in particular) that the model runs play the role of fixed constants in the prior distribution over ζ. The dependence on z in (4.10) arises by setting Π(ζ|θ, γ) := p(ζ|z, θ, γ). We now consider the conditions of Theorem 2 in the context of the model in Section 4.1. These conditions fall into two general categories; one group of conditions is related to the existence of the test functions Φn, and the second group revolves around the conditions for the prior distributions. Our approach to establish the existence of test functions {Φn}∞ n=1 that satisfy the condi- tions (i) and (iii) in Theorem 2 is similar to that of Theorem 2 in Choi and Schervish (2007a). We consider a sieve Fn which grows to the space of continuously differentiable functions on [0, 1]p. Namely, let (cid:27) (cid:26) Fn = ζ : (cid:107) ζ (cid:107)∞< Mn, (cid:107) ∂ ∂ti ζ (cid:107)∞< Mn, i = 1,··· , p (4.12) 2 , 1). Also, (cid:107) · (cid:107)∞ denotes the supremum norm. Each test where Mn = O(nα) for some α ∈ ( 1 is defined as a combination of tests over finitely many elements in the covering of Fn. The existence of tests in the specific case of Wn,ν is given in Theorem 3 with its prove provided in Section 4.5.2. Theorem 3. Let Fn be the sieves defined in (4.12). For any ν > 0 there exist tests {Φn}∞ and constants C and 0 < ˜ < 1 so that: n=1 (i) (cid:80)∞ Eζ0,σ0 Φn < ∞ n=1 (ii) There exists a constant c > 0 such that for any 0 <  < ˜ the inequality c + log(1 − ) − log(1 + ) > 0 holds and ζ,σ0(1+)(1 − Φn) ≤ Ce−cn. E sup n,ν∩Fn ζ∈W C 94 To verify conditions (A1) of Theorem 2, it is sufficient to show that the GP prior for ζ assigns positive probability to the following set for any δ > 0: For any 0 <  < 1, a short calculation leads to 1 − Ki(ζ0, ζ) = log(1 − ) − 1 2 ≤ log(1 − ) − 1 2 Bδ = {ζ :(cid:107) ζ − ζ0 (cid:107)∞< δ} . (cid:19) (cid:19) (1 − )2 (cid:18) (cid:18) + 1 1 − 1 (1 − )2 [ζ0(ti) − ζ(t)]2 0(1 − )2 2σ2 (cid:107) ζ0(ti) − ζ(t) (cid:107)2∞ 0(1 − )2 2σ2 + (4.13) . Let a() = log(1− )− 1/2 + 1/[2(1− )2], it is easy to see that a() is positive and continuous at  = 0. Therefore, for every ∆ > 0, there exist δ > 0 and 0 < ˜ < 1 so that Ki(ζ0, ζ) < ∆ for all i and any  < ˜. Additionally, for any  < ˜ and any δ > 0 (cid:20) (cid:21)2 (1 − )2 − 1 1 (cid:20)[ζ0(ti) − ζ(t)] (1 − )2 (cid:21)2 + Vi(ζ0, ζ) = 1 2 and as a result, for all ζ ∈ Bδ,(cid:80)∞ i=1 < ∞ uniformly in i, Vi(ζ0,ζ) i2 < ∞. The prior condition (ii) of (A2) for the sieve Fn (4.12) is addressed in Lemma 4 (for proof see Section 4.5.2). Lemma 4. Let the mean function mζ (·) of the GP prior for ζ defined on [0, 1]p be contin- uously differentiable, and the covariance function kζ (·,·) has mixed partial derivatives up to order 4 that are continuous. Define, ρ2 0(θ, γ) = sup t∈[0,1]p ρ2 i (θ, γ) = sup t∈[0,1]p Var (ζ(t)|z, θ, γ) , (cid:19) (cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ (cid:18) ∂ i are continuous functions of (θ, γ) for all (θ, γ) ∈ Υ, i = 1, . . . , p. ζ(t) ∂ti Var , Suppose that Υ is a compact set, and ρ2 i = 0, . . . , p. Then there exist constants C, c > 0 so that p(F C where Fn are the sieves defined in (4.12). (θ,γ)∈Υ sup n |z, θ, γ) < Ce−cn, 95 Below we present the almost sure consistency result 4.10 as a corollary of Theorem 2, Theorem 3, and Lemma 4. Corollary 1. Let P0 denotes the joint conditional distribution of {yi}∞ i=1 given true ζ0 and σ0. Let mζ (·) and kζ (·,·) be the mean and covariance functions of the GP prior for ζ satisfying the conditions of Lemma 4. Assume Υ is a compact set, and for any δ > 0, p(Bδ|z, θ, γ) > 0. If ˆσn is a strongly consistent estimator of σ0, then for any ν > 0 p(ζ ∈ W C ν,n|y1, . . . , yn, z, θ, γ, ˆσn) −−−→ n sup (θ,γ)∈Υ 0 a.s. P0. (4.14) Prior conditions. The prior positivity condition requiring p(Bδ|z, θ, γ) > 0 for any δ was extensively studied by Ghosal and Roy (2006) and Tokdar and Ghosh (2007). Theorem 4 of Ghosal and Roy (2006) implies that this condition is satisfied for a GP with continuous sample paths and continuous mean and covariance functions, as long as ζ0 and the mζ belong to the reproducing kernel Hilbert space (RKHS) of kζ . The continuity of GP’s sample paths is given by the application of Theorem 5 in Ghosal and Roy (2006) which requires the same continuity conditions as Lemma 4 in this section (excluding those on ρ2 i ). It should be clear from (4.8) and (4.9) that mζ is continuously differentiable on [0, 1]p, and kζ has continuous mixed partial derivatives up to 4th order on [0, 1]p, as long as the same holds about mf and mδ and respectively kf and kδ. Tokdar and Ghosh (2007) show that the RKHS of kζ spans the space of continuously differentiable functions on [0, 1]p, if kζ is a product of p isotropic and integrable univariate covariance functions with continuous mixed partial derivatives up to order 4. For example, the squared exponential covariance function satisfies these requirements including the continuity of ρ2 i for i = 0, . . . , p. This, of course, does not directly imply that such choices for mf and mδ, and kf and kδ respectively, result in the conditional mean mζ and covariance kζ functions satisfying these sufficient conditions. For larger applicability of our results, we note that further investigation of specific choices for mean and covariance functions that satisfy the desired conditions is needed. We intend to address this in our future work. Nevertheless, the simulation study 96 conducted in Section 4.4.1 strongly suggests that choosing the squared exponential kernel leads to consistent predictions. 4.3 Parameter estimation and prediction Thus far, we established that the empirical Bayesian framework provides a principled approach for inference and enjoys good theoretical properties, all this assuming a (strongly) consistent estimator of σ0, smoothness of the prior mean and covariance function, and the GP hyperparameters (θ, γ) taking values in some compact set. In this section, we first propose a strongly consistent estimator of the true noise scale σ0 and two different plug-in estimators of (θ, γ) as minimizers of two alternative loss functions. In particular, we consider negative data log-likelihood and negative predictive log-likelihood combined with K-fold cross-validation. Second, we provide the complete empirical Bayes algorithm for simple and fast predictions of physical quantities using (imperfect) computer models. Let us consider n observations yi from the physical process under the model (4.2), we (cid:80)n−1 propose the following estimator of the noise variance σ2 0: i=1 (yi+1 − yi)2 ˆσ2 n = 2(n − 1) (4.15) Theorem 4. Suppose ζ0(t) represents the true physical process and σ2 0 be the true value of the experimental error variance, where t ∈ Ω is a compact and convex subset of Rp, and ζ0 is continuously differentiable on Ω. Let P0 denotes the joint conditional distribution of {yi}∞ 0. Also assume the following holds about the design points ti: i=1 given true ζ0 and σ2 then sup i∈{1,...,n},j∈{1,...,p} |ti+1,j − ti,j| −−−→ n 0, n −−−→ ˆσ2 n σ2 0 a.s. P0. (AD) (4.16) The proof of Theorem 4 is given in Section 4.5.2. The continuous mapping theorem directly implies the following. 97 Corollary 2. Under the assumptions of Theorem 4, (cid:113) ˆσn = n −−−→ ˆσ2 n σ0 a.s. P0. (4.17) Remark 1. The assumption (AD) is satisfied by a design that contains at least one point in each hypercube H in Ω with its Lebesgue measure λ(H) ≥ 1 Kn , for some constant 0 < K ≤ 1. This is, for example, the case of equally spaced design. 4.3.1 Estimation of hyperparameters 4.3.1.1 Marginal data likelihood We first consider estimates of (θ, γ) as minimizers of a loss function that is reminiscent of the standard maximum likelihood approach, namely LM LE(θ, γ) = − log p(d|θ, γ, ˆσn), (4.18) with the negative log-likelihood being − log p(d|θ, γ, ˆσn) = 1 2 + (d − M (θ, γ))T K(θ, γ, ˆσn)(d − M (θ, γ)) log|K(θ, γ, ˆσn)| + 1 2 log 2π. n + s 2 We can readily interpret the minimizer of LM LE as a trade-off between the data-fit 1 M (θ, γ))T K(θ, γ, ˆσn)(d − M (θ, γ)) and the model complexity penalty 1 that depends only on the model parameters and the variable inputs. 2(d − 2log|K(θ, γ, ˆσn)| 4.3.1.2 Predictive likelihood with K-fold cross-validation Another viable approach to estimating the parameters (θ, γ) is to base these on a model’s predictive performance on unseen data. Cross-validation is a popular and robust approach to estimate this predictive performance that has been utilized across many statistical applica- tions. See Sundararajan and Keerthi (2001); Rasmussen and Williams (2006); Martino et al. 98 LCV (K)(θ, γ) = − K(cid:88) (2017) for applications with GPs. Here, we consider a K-fold cross-validation where the basic idea is to randomly partition the training detest into K subsets of equal size. We then select K − 1 subsets for training and the hold-out data as a proxy for estimating the predictive performance. This is then repeated until we exhaust all the K subsets for the purpose of validation with typical choices for K being 3, 5, 10, or n (leave-one-out cross-validation). Formally, let yi represent the ith subset of the observations y and y−i = y (cid:114) yi. The negative predictive log-likelihood under the K-fold cross-validation is log p(yi|y−i, z, θ, γ, ˆσn), (4.19) The cross-validation should be more robust against model miss-specification and overfitting i (Wahba, 1990). 4.3.2 Algorithm for predictions One of the main benefits of the empirical Bayes approach is that once we estimate the unknown parameters (θ, γ, σ), we can obtain a closed form predictive distribution given these estimates. Formally, let us consider a set of new inputs (t∗ J ) at which we want to obtain the predictions according to the model (1.3). As discussed in Section 3.4, the joint normality between d and y∗ implies that the conditional distribution p(y∗|d, θ, γ, σ) is a 1, . . . , t∗ multivariate normal distribution with the mean vector My∗(θ, γ, σ) = Mf (T∗ y (θ)) + Mδ(T∗ y ) + C∗K(θ, γ, σ)−1(d − M (θ, γ)), and the covariance matrix y (θ), Ty(θ)) + Kδ(T∗ y , Ty) + σ2Im − C∗K(θ, γ, σ)−1CT∗ , Ky∗(θ, γ, σ) = Kf (T∗ (cid:16) C∗ = Kf (T∗ where (cid:17) y (θ), Tz((cid:101)θ)) Similarly to the conditional covariance matrices discussed previously, Kf (T∗ the matrix with (i, j) element kf ((t∗ y (θ), Ty(θ)) + Kδ(T∗ i , θ), (tj, θ)) and Kδ(T∗ y , Ty) Kf (T∗ . y (θ), Ty(θ)) is y , Ty) is the matrix with (i, j) (4.20) (4.21) (4.22) 99 element kδ(t∗ i , tj). The matrix Kf (T∗ y (θ), Tz((cid:101)θ)) is defined accordingly with the kernel kf . Analogical relationship holds for the conditional distribution of the new realizations from the physical process p(ζ∗|d, θ, γ, σ), where the mean vector is identical with (4.20) and the covariance matrix is Kζ∗(θ, γ, σ) = Kf (T∗ y (θ), Ty(θ)) + Kδ(T∗ y , Ty) − C∗K(θ, γ, σ)−1CT∗ , (4.23) Algorithm 4.1 summarizes the procedure for predictions of physical quantities using im- perfect and computationally expensive computer models. Algorithm 4.1: Empirical Bayes algorithm for predictions of physical quantities. Input: Data d, mean and covariance functions for GPs, and new inputs (t∗ 1 Use the experimental observations y1, . . . , yn to compute ˆσn =(cid:112)ˆσ2 1, . . . , t∗ J ) n 2 Minimize either LM LE(θ, γ) or LCV (K)(θ, γ) to obtain the estimates ( ˆθ, ˆγ) 3 Compute My∗( ˆθ, ˆγ, ˆσn) and Ky∗( ˆθ, ˆγ, ˆσn) or Mζ∗( ˆθ, ˆγ, ˆσn) and Kζ∗( ˆθ, ˆγ, ˆσn) respectively to get the posterior predictive distribution 4.4 Applications The main objective of this section is to empirically establish the efficiency of the empirical Bayes method in Algorithm 4.1 and support the consistency result presented in section 4.2. All this while sacrificing minimally in terms of the fidelity of UQ as compared to the fully Bayesian treatment. To this extent, we consider a simulation study where we compare our method (under both LM LE and LCV (K)) to a fully Bayesian treatment with posterior samples obtained using the standard MH algorithm. Finally, we revisit the LDM and illustrate our methodology in a real data scenario. 4.4.1 Transverse harmonic wave Let us consider a simple computer model representing a periodic wave disturbance that moves through a medium and causes displacement of individual atoms or molecules in the medium. This is called a transverse harmonic wave, where the displacement fm((t, x), θ) of 100 a particle at location x over time t is given by fm((t, x), θ) = θ1 sin(cid:0)kx − θ2t + ψ(cid:1), (4.24) where θ1 represents the amplitude of the wave, and θ2 is the frequency of the wave. The model also depends on the wave number k, which is reciprocal to the wave length, and the phase constant ψ. For the purpose of this example, we shall consider these to be known values with k = 5 and ψ = 1, and define the model inputs (t, x) over the space [0, 1]2 (we assume that the length and time units are all equal to one). The true physical process is modeled according to ζ0(t, x) = fm((t, x), θ) + δ(t, x) = θ1 sin(cid:0)5x − θ2t + 1(cid:1) + β, (4.25) where β = 1 is a constant systematic error of the model, and θ = (θ1, θ2) are arbitrarily set to be (1.2, 1.8). We generate the experimental observation according to the model (1.3) with the true value of the observation error scale σ0 = 0.2, where the model inputs (t, x) are chosen using the Latin hypercube design over the full space [0, 1]2. The space filling prop- erties of the design guarantee decreasing bias of the estimator ˆσn with an increasing sample size. Additionally, we assume that the computer model for the periodic wave disturbance is computationally expensive and generate the set of model runs z using again the Latin hypercube design, now over [0, 1]2 × [0, 2]2. We define the GP priors for fm and δ to have zero means and the covariance functions kf ({t, x, θ},{t(cid:48), x(cid:48), θ(cid:48)}) = ηf · exp(−||t − t(cid:48)||2 kδ({t, x},{t(cid:48), x(cid:48)}) = ηδ · exp(−||t − t(cid:48)||2 2(cid:96)2 t 2ν2 t − ||x − x(cid:48)||2 − ||x − x(cid:48)||2 2(cid:96)2 x 2ν2 x − ||θ1 − θ(cid:48) 1||2 2(cid:96)2 θ1 − ||θ2 − θ(cid:48) 2||2 2(cid:96)2 θ2 ) ). The hyperparameters in this scenario are therefore γ = (ηf , (cid:96)t, (cid:96)x, (cid:96)θ1 , (cid:96)θ2 , ηδ, νt, νx). For the case of the fully Bayesian treatment, we choose inverse gamma priors with mean 1/2 and variance 1/4 for (σ, ηf , ηδ), gamma priors with mean 1/3 and variance 1/9 for the length scales, and independent Gaussian distributions with mean 0 and variance 4 for the 101 calibration parameters (θ1, θ2). These are non-informative priors given the spans of both the input space [0, 1]2 and the parameter space [0, 2]2. Table 4.1 shows the RMSEs of predictions of new realizations from the true physical process (4.25) evaluated on a testing dataset of 225 realizations over a uniform grid on [0, 1]2. The predictions are taken to be the posterior predictive means under each method. We consider the estimates of hyperparameters using the LM LE loss and the 10-fold cross-validation predictive loss function. The noise scale parameter was estimated using the consistent estimator ˆσn defined in Section 4.3. RMSE values on the testing dataset LM LE LCV (10) Metropolis-Hastings n = 125 0.048 s = 125 n = 250 0.019 s = 250 n = 500 0.010 s = 500 0.071 0.030 0.019 0.049 0.037 0.021 Table 4.1: The RMSE comparison of the empirical Bayes approach and the fully Bayesian treatment. The GP hyperparameters were estimated using Algorithm 4.1. The proposed empirical Bayes approach closely matches the fully Bayesian treatment. In fact, the RMSE under the LM LE loss is consistently the lowest and monotonously decreases with the increasing size of the dataset. This is a desirable outcome since the empirical Bayes fit can be readily obtained in several minutes using standard numerical solvers while sampling from posterior distributions can take hours. It took approximately 2 hours to obtain 104 samples in the scenario with the largest sample size on a standard PC with 4 cores. Parameter n = 125, s = 125 n = 250, s = 250 n = 500, s = 500 LM LE LCV (10) MH LM LE LCV (10) MH LM LE LCV (10) MH 1.208 1.197 1.765 1.781 0.198 1.251 1.160 1.771 1.805 0.208 1.166 1.207 1.765 1.792 0.206 1.217 1.787 0.328 1.251 1.799 0.259 1.206 1.818 0.228 θ1 θ2 σ Table 4.2: The estimates of calibration parameters and the noise scale under each method. 102 For completeness, we also show the estimates of calibration parameters and the noise scale under each method in Table 4.2. Posterior means were taken as the estimates of the fully Bayesian solution. We can see again a close match between the approximate empirical Bayes method and the MH algorithm. The only notable difference is in terms of the noise scale estimate ˆσn. This is expected since the estimate is asymptotically unbiased. Figure 4.1: Detail of 95% credible bands plotted at t = 0.21. Figure 4.2: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.21. Figure 4.1 and Figure 4.2 show the loss in terms of UQ is negligible for all practical pur- 103 0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI poses. We can see that the empirical Bayes approach slightly overestimates the uncertainty for smaller sample size, but this quickly diminishes as the sample size increases. This is likely the consequence of the inflation of the noise scale given by the bias of ˆσn which diminishes with the increasing sample size as expected. See Section 4.5.3 for additional figures of the empirical Bayes fit at the time locations t = 0, t = 0.43, t = 0.71, and t = 1. 4.4.2 The Liquid Drop Model revisited To illustrate our empirical Bayes framework for computer-enabled predictions on a real data example, we yet again consider the 4-parameter LDM of nuclear binding energies (see Section 1.1 for details). We now present an analysis of 595 experimental binding energies of even-even nuclei from the AME2003 dataset (Audi et al., 2003) (publicly available at http://amdc.impcas.ac. cn/web/masstab.html) randomly divided into a training set of 450 nuclei and a testing set of the remaining 145 nuclei, see Figure 4.3. We consider the statistical model (1.3) and model Figure 4.3: Binding energies of even-even nuclei in AME2003 dataset divided into a testing and a training dataset. the systematic discrepancy δ with zero mean GP and the isotropic squared exponential covariance function. For the purpose of this example, we also assume that the LDM is computationally expensive (or not directly accessible) and regard it is an unknown function of (Z, N ) and θ. Similarly to the discrepancy δ, we assign a GP prior to EB(N, Z) with 104 020406080100120140160N1101009080706050403020100ZTrainingTesting020406080100120140160N0102030405060708090100110Z50010001500EB [MeV] zero mean and the isotropic squared exponential covariance function. To this extent, we additionally generated a set of 900 model evaluations using the Latin hypercube design over the space spanning all reasonable values of the parameters θ as given by the nuclear physics literature similarly to our previous analysis in Section 3.5.3. Corresponding nuclear configurations, the inputs (Z, N ), were randomly assigned to the generated values of θ from a set of two times duplicated training nuclei. Results. The predictions of nuclear binding energies were computed as the means of the posterior predictive distribution (4.20) conditioned on the estimates of the calibration pa- rameters θ, GP’s hyperparameters γ, and the noise scale ˆσn. The estimates for (θ, γ) were obtained numerically as the minimizers of LM LE and LCV (10). The priors for the GP hyper- parameters were chosen according to Section 3.5.3 in the case of the fully Bayesian treatment. Testing error asurf asym aC RMSE (MeV) Parameter estimates avol 15.07 15.58 22.00 0.68 LM LE LCV (10) 15.08 16.08 21.19 0.67 MH 15.32 16.09 22.09 0.70 1.16 1.26 1.16 Table 4.3: The RMSEs of the predictions evaluated on 145 even-even nuclei from the AME2003 dataset. The parameter estimates are also listed. The posterior means are shown in the case of the MH algorithm. Table 4.3 gives the RMSE values calculated on the testing set of 145 even-even nuclei for the empirical Bayes approach and also the MH algorithm. The calibration parameter estimates are also provided with values that do not significantly differ between the methods considered. The resulting RMSEs are 1.1 − 1.3 MeV which is a consistent result with our previous study in Section 3.4.2 that was conducted on the whole AME2003 dataset using the VBI approach. We also carried out a simple least squares fit of the LDM with the resulting RMSE of 4.10 MeV evaluated on the same testing set of even-even nuclei. This is 105 an improvement that is consistent with our previous study on the full dataset using the VBI algorithm. Overall, this is quite a remarkable result given the considerable effort that needs to be put forth to implement the fully Bayesian solution and to obtain sufficient amount of posterior samples. 4.5 Technical details and supplementary results 4.5.1 Equivalency of hierarchical model (cid:90) n(cid:89) ζ i To establish the equivalency between the Bayesian model given by the data likelihood p(d|θ, γ, σ) and the hierarchical model (see Section 4.1), we need to show that the following equality holds p(d|θ, γ, σ) = p(yi|ζi, σ)p(ζ, z|θ, γ) dζ, (4.26) where ζ = (ζ(t1), . . . , ζ(tn)) = (ζ1, . . . , ζn) and p(ζ, z|θ, γ) is the multivariate normal dis- tribution with the mean M (θ, γ) (see (1.5)) and the covariance Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) Kf (Ty(θ), Tz((cid:101)θ))  = Kf (Tz((cid:101)θ), Tz((cid:101)θ)) Kf (Tz((cid:101)θ), Ty(θ)) C11 C12 C21 C22  . Kp(θ, γ) = For the ease of notation, let us now assume M (θ, γ) = (M T (cid:90) n(cid:89) ζ i × p(yi|ζi, σ)p(ζ, z|θ, γ) dζ = (cid:18) 1 (2π)(n+m)/2|Kp|1/2 exp (cid:18) − 1 2 1 exp (2π)(n+m)/2|K|1/2 |K|1/2 × (cid:90) ζ (cid:18) (cid:18)ζ − My (cid:19)T (2π)n/2|σ2In|1/2|Kp|1/2 K−1 (cid:18) z − Mz − 1 2 p 1 × exp = = − 1 2 p ζ exp − 1 2 y , M T 1 (cid:19)T z − Mz z − Mz z )T . Then (2π)n/2|σ2In|1/2 K−1 (cid:90) (cid:18)ζ − My (cid:18)y − My (cid:19)T (cid:18) (cid:18)ζ − My (cid:18)y − My z − Mz − 1 2 (cid:19) (cid:18) (y − ζ)T (σ2In)−1(y − ζ) (cid:18)ζ − My (cid:19)(cid:19) (cid:18)y − My (cid:19)(cid:19) (cid:18)y − My (cid:18)y − My K−1 (y − ζ)T (σ2In)−1(y − ζ) (cid:19)(cid:19) (cid:19)T (cid:19) (cid:18)y − My (cid:19)T K−1 (cid:19)(cid:19) z − Mz z − Mz z − Mz z − Mz (cid:19)(cid:19) exp dζ dζ + 1 2 K−1 × 1. (2π)(n+m)/2|K|1/2 exp − 1 2 z − Mz z − Mz 106 The integral is equal to 1 since it is an integration of multivariate normal probability density function over ζ with the covariance function ((σ2In)−1 +(C11−C12C−1 22 C21)−1)−1. Namely, |K|1/2 |σ2In|1/2|Kp|1/2 = = = = = = 22 C21|1/2 22 C21|1/2 |C22|1/2|C11 + σ2In − C12C−1 |σ2In|1/2|C22|1/2|C11 − C12C−1 |C11 + σ2In − C12C−1 22 C21|1/2 |σ2In|1/2|C11 − C12C−1 22 C21|1/2 |A + B|1/2 |A|1/2|B|1/2 = 1 1 |A−1B−1A + A−1B−1B|−1/2 = |A|1/2|B|1/2|A + B|−1/2 = 1 (|A−1||B−1||A + B|)−1/2 1 |A−1B−1A + A−1|−1/2 1 |A−1(B−1 + A−1)A|−1/2 = 1 (|A−1||(B−1 + A−1)||A|)−1/2 1 |(B−1 + A−1)−1|1/2 where we used the Schur complement identity for determinants in the first equality and A = C11 − C12C−1 B = σ2In. 22 C21, Lastly, considering the notation  K−1 p = C− (cid:18)ζ − My (y − ζ)T (σ2In)−1(y − ζ) − 1 (cid:19)(cid:19) (cid:18)y − My (cid:18)y − My z − Mz (cid:19)T 2 11 C− C− 21 C− (cid:19)T 12 22 K−1 K−1 p z − Mz z − Mz ζT (σ2In)−1ζ + ζT (σ2In)−1y − 1 (cid:19) [(ζ − My)T C− 11 + (z − Mz)T C− ζT ((σ2In)−1 + C− 11)ζ + ζT b 2 − 1 2 − 1 2 − 1 2 2 yT (σ2In)−1y 21, (ζ − My)T C− we have (cid:18) − 1 2 (cid:18)1 (cid:18) (cid:18) (cid:18) exp × exp ∝ exp × exp ∝ exp (cid:19)(cid:19) z − Mz (cid:18)ζ − My (cid:19) 12 + (z − Mz)T C− 22] (cid:19)(cid:19) (cid:18)ζ − My z − Mz 107 where C− 11 = C11 − C12C−1 22 C21 due to the Schur complement identity for matrix inverse, and b is a constant column vector. This shows that integral is indeed equal to 1 as stated, and the equality (4.26) holds. 4.5.2 Proofs Proof of Theorem 2 Note that for any  > 0, the posterior probability of interest p(ζ ∈ U C n |y1, . . . , yn, θ, γ, ˆσn) can be bound from the above as p(ζ ∈ U C n |y1, . . . , yn, θ, γ, ˆσn) ≤ p(ζ ∈ U C n |y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn σ0 (cid:12)(cid:12)(cid:12)≤} + 1{(cid:12)(cid:12)(cid:12) ˆσn σ0 −1 (cid:12)(cid:12)(cid:12)>}, −1 where (cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ) p(ζ ∈ U C ≤ Φn + (cid:82) + σ0 i=1 i=1 n∩Fn U c (cid:12)(cid:12)(cid:12)≤} −1 σ0 p(yi|ζi,ˆσn) p(yi|ζ0,i,σ0)1{(cid:12)(cid:12)(cid:12) ˆσn −1 p(yi|ζi,ˆσn) p(yi|ζ0,i,σ0) dΠ(ζ|θ, γ) (cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ) n |y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn (1 − Φn)(cid:82) (cid:81)n (cid:82)F(cid:81)n (cid:81)n (cid:82)F(cid:81)n p(yi|ζ0,i,σ0)1{(cid:12)(cid:12)(cid:12) ˆσn p(yi|ζi,ˆσn) p(yi|ζ0,i,σ0) dΠ(ζ|θ, γ) (cid:12)(cid:12)(cid:12)>} −−→ I1n(y1, . . . , yn, θ, γ, ˆσn, ) I3n(y1, . . . , yn, θ, γ, ˆσn) p(yi|ζi,ˆσn) −1 −1 i=1 σ0 n n∩FC U c n i=1 = Φn + Since the assumption (A3) implies that 1{(cid:12)(cid:12)(cid:12) ˆσn σ0 there exists  > 0 so that Φn −−→ sup (θ,γ)∈Υ 0 a.s. P0, n + I2n(y1, . . . , yn, θ, γ, ˆσn, ) I3n(y1, . . . , yn, θ, γ, ˆσn) . 0 a.s. P0, it is enough to show that (4.27) eβ1nI1n(y1, . . . , yn, θ, γ, ˆσn, ) −−→ n eβ2nI2n(y1, . . . , yn, θ, γ, ˆσn, ) −−→ n 0 a.s. P0 for some β1 > 0, (4.28) 0 a.s. P0 for some β2 > 0, (4.29) eβ3nI3n(y1, . . . , yn, θ, γ, ˆσn) −−→ n ∞ a.s. P0 for some β3 > 0, (4.30) sup (θ,γ)∈Υ sup (θ,γ)∈Υ sup (θ,γ)∈Υ 108 where β3 ≤ min{β1, β2}. The rest of the proof follows the general steps of the proof of Theorem 1 in Choi and Schervish (2007a) and Theorem 9 in Choi (2007) with some non-trivial treatment of the constant . We shall provide step by step details below. Step 1). By Markov inequality, for any δ > 0 ∞(cid:88) n=1 P0(Φn > δ) ≤ 1 δ ∞(cid:88) n=1 Eζ0,σ0 Φn, which due to the condition (i) of (A2) and the first Borel-Cantelli Lemma yields Φn −−→ n 0 a.s. P0. Since this does not depend on (θ, γ), it implies (4.27). Step 2). By Fubini’s theorem and for any 0 <  < ˜2 p(yi|ζi, ˆσn) p(yi|ζ0,i, σ0) n(cid:89) p(yi|ζi, ˆσn) p(yi|ζ0,i, σ0) i=1 1{(cid:12)(cid:12)(cid:12) ˆσn σ0 (cid:21) (cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ) 1{(cid:12)(cid:12)(cid:12) ˆσn (cid:12)(cid:12)(cid:12)≤} dP0 dΠ(ζ|θ, γ) −1 −1 σ0 ζ,σ0(1+)[(1 − Φn)] dΠ(ζ|θ, γ) E UnC∩Fn (I1n(y1, . . . , yn, θ, γ, ˆσn, )) n∩Fn U c (1 − Φn) n(cid:89) i=1 (cid:90) (cid:20) = n∩Fn U c (1 − Φn) Eζ0,σ0 = Eζ0,σ0 (cid:90) (cid:90) (cid:18) σ0(1 − ) (cid:19)−n(cid:90) (cid:19)−n (cid:18)1 −  (cid:19)−n (cid:18)1 −  σ0(1 + ) 1 +  ≤ ≤ ≤ 1 +  ζ,σ0(1+)[(1 − Φn)] E sup ζ∈U C n ∩Fn C2e−cn = C2e−˜cn, where ˜c = c + log(1 − ) − log(1 + ) together with condition (iii) of (A2) implies ˜c > 0. Thus (cid:26) I1n(y1, . . . , yn, θ, γ, ˆσn, ) ≥ e P0 (cid:27) −˜c n 2 ≤ C1e˜c n 2 e−˜cn = C1e −˜c n 2 . 109 Therefore, for any  > 0 so that  < ˜2 there exists a constant ˜c for which the first Borel- Cantelli Lemma implies e˜c n 4 I1n(y1, . . . , yn, θ, γ, ˆσn, ) −−→ n 0 a.s. P0. Since this does not depend on (θ, γ), it implies (4.28). Step 3). If we proceed as in the step 2), the Fubini’s theorem implies (I2n(y1, . . . , yn, θ, γ, ˆσn, )) p(yi|ζi, ˆσn) p(yi|ζ0,i, σ0) Eζ0,σ0 = Eζ0,σ0 (cid:20)(cid:90) n(cid:89) (cid:19)−n(cid:90) (cid:18) σ0(1 − ) (cid:19)−n (cid:18)1 −  n∩Fn U c σ0(1 + ) i=1 UnC∩FC n n |θ, γ). Π(F C ≤ ≤ 1 +  (cid:21) (cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ) 1{(cid:12)(cid:12)(cid:12) ˆσn ζ,σ0(1+)[1] dΠ(ζ|θ, γ) E −1 σ0 The condition (ii) of (A2) and the first Borel-Cantelli Lemma implies that for any  < 1−e−c1 1+e−c1 : sup (θ,γ)∈Υ e ˜k n 4 I2n(y1, . . . , yn, θ, γ, ˆσn, ) −−→ 0 a.s. P0, n where ˜k = c1 + log(1 − ) − log(1 + ). Step 4). To prove (4.30), given any 0 < ρ < 1, we first observe the following: I3n(y1, . . . , yn, θ, γ, ˆσn) ≥ I3n(y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn −1 σ0 (cid:12)(cid:12)(cid:12)≤ρ} (cid:18)1 − ρ (cid:19)n(cid:90) 1 + ρ F n(cid:89) i=1 ≥ p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) dΠ(ζ|θ, γ). Let us now define log+(x) = max{0, log(x)} and log−(x) = − min{0, log(x)} as well as Wi = log+ p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − ρ)) (cid:90) (cid:90) K+ i (ζ0, ζ) = K− i (ζ0, ζ) = p(yi|ζ0,i, σ0) log+ p(yi|ζ0,i, σ0) log− 110 , p(yi|ζ0,i, σ0) p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − ρ)) p(yi|ζi, σ0(1 − ρ)) dyi, dyi. Then we get Varζ0,σ0 (Wi) = Eζ0,σ0 ≤ Eζ0,σ0 ≤ Eζ0,σ0 (cid:90) (W 2 i ) − {K+ i (ζ0, ζ)}2 (W 2 (cid:90) i ) − {Ki(ζ0, ζ)}2 (cid:18) p(yi|ζ0,i, σ0) (cid:18) log− p(yi|ζ0,i, σ0) i ) + (W 2 = p(yi|ζ0,i, σ0) log p(yi|ζi, σ0(1 − ρ)) (cid:19)2 dyi − {Ki(ζ0, ζ)}2 p(yi|ζ0,i, σ0) (cid:19)2 p(yi|ζi, σ0(1 − ρ)) dyi − {Ki(ζ0, ζ)}2 = Vi(ζ0, ζ). Hence, by condition (i) of (A1) for any ρ < ˜1 and ζ ∈ B n=∞(cid:88) i=1 Varζ0,σ0 i2 (Wi) ≤ n=∞(cid:88) i=1 Vi(ζ0, ζ) i2 < ∞, and by the Kolmogorov’s strong law of large numbers for independent non-identically dis- 0 a.s. P0. n tributed random variables (e.g. Shiryaev (1996), Chapter 3), n(cid:88) (Wi − K+ (cid:19) i=1 i=1 n i=1 log 1 n 1 n lim inf n→∞ n(cid:88) (cid:18) 1 p(yi|ζ0,i, σ0) = − lim inf n→∞ p(yi|ζi, σ0(1 − ρ)) i (ζ0, ζ)) −−→ (cid:32) As a result, for every ζ ∈ B, with P0 probability 1 (cid:32) (cid:32) (cid:32) (cid:32)  1 ≥ − lim sup n→∞ ≥ − lim sup n→∞ = − lim sup n→∞ = − lim inf n→∞ n(cid:88) n(cid:88) n(cid:88) n(cid:88) n(cid:88) n(cid:88) 1 n 1 n 1 n 1 n i=1 i=1 i=1 i=1 ≥ − lim sup n→∞ log log+ (cid:33) (cid:33) − log p(yi|ζi, σ0(1 − ρ)) (cid:33) p(yi|ζ0,i, σ0) p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) (cid:33) p(yi|ζi, σ0(1 − ρ)) (cid:114) n(cid:88) (cid:118)(cid:117)(cid:117)(cid:116) 1 n(cid:88) 1 n i=1 n i=1 K+ i (ζ0, ζ) Ki(S0, S) + (cid:33)  . Ki(ζ0, ζ) 2 2 Ki(ζ0, ζ) Ki(ζ0, ζ) + n i=1 111 The fourth line follows from the almost sure convergence proved in the previous paragraph, the second to last line follows from Amewou-Atisso et al. (2003). We now make use of the 8 and also condition (ii) of (A1). Let us consider β > 0 and select ∆ so that ∆ + C = B ∩ {ζ : Ki(ζ0, ζ) < ∆ for all i}. By (A1) there exists ˜1 so that for all 0 < ρ < ˜1 implies Π(C|θ, γ) > 0. Therefore, for each ζ ∈ C Ki(ζ0, ζ) + Ki(ζ0, ζ) 2  2 ≤ β (cid:113) ∆ (cid:118)(cid:117)(cid:117)(cid:116) 1 n n(cid:88) i=1 −β 8 −β 8 } 1+e (cid:32) n(cid:88) i=1 1 n p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) log (cid:33) n(cid:88) i=1  1 (cid:114) n ∆ 2 ), ≥ − lim sup n→∞ ≥ −(∆ + (cid:80)n i=1 Ki(ζ0, ζ) < ∆ for all ζ ∈ C. Finally, for any ρ < min{˜1, 1−e lim inf n→∞ since 1 n 2nβ 8 I3n(y1, . . . , yn, θ, γ, ˆσn) lim inf n→∞ e 1 + ρ (cid:19)n(cid:90) (cid:18)1 − ρ n(cid:89) (cid:19)n(cid:90) (cid:18)1 − ρ n(cid:89) (cid:19)n n(cid:89) (cid:18)1 − ρ 1 + ρ i=1 F i=1 C 2nβ 8 1 + ρ i=1 ≥ lim inf n→∞ e 2nβ 8 ≥ lim inf n→∞ e 2nβ 8 lim inf n→∞ e (cid:90) ≥ C = ∞. p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) p(yi|ζi, σ0(1 − ρ)) p(yi|ζ0,i, σ0) dΠ(ζ|θ, γ) dΠ(ζ|θ, γ) dΠ(ζ|θ, γ) Note that the actual bound on I3n does not depend on (θ, γ). Taking  < min{˜2, 1−e−c1 1+e−c1 } concludes the proof. Proof of Theorem 3 2 and t = r 4. Let Nt = N (t,Fn,(cid:107) · (cid:107)∞) We shall first define some notation. Let 0 < r < ν be the covering number of Fn. In Theorem 2.7.1, van der Vaart and Wellner (1996) show that there exist a constant K so that log Nt ≤ KMn and therefore Nt = O(Mn), where tp Mn = O(nα) for α ∈ ( 1 2 , 1) according to the definition of the sieves. Let us consider τ ∈ ( α n). Moreover, let ζ1, . . . , ζNt ∈ Fn be 2) and define cn = nτ so that log(Nt) = o(c2 2 , 1 112 finitely many elements of the sieve so that for every ζ ∈ Fn there is i ∈ {1, . . . , Nt} satisfying (cid:107) ζ − ζi (cid:107)∞< t. This implies that if ζ ∈ Fn such that (cid:82) |ζ(t) − ζ0(t)| dQn(t) > ν, then (cid:82) |ζi(t) − ζ0(t)| dQn(t) > ν 2 . The next step in the proof is to construct a test for each ζi with the resulting functions Φn defined as a combination of the individual tests and showing that the probabilities of type I and type II errors satisfies the properties of the theorem. Let us recall that ζj = ζ(tj) and ζ0,j = ζ0(tj). For an arbitrary ζ ∈ Fn such that (cid:107) ζ − ζi (cid:107)∞< t, let us define ζ1,j = ζi(tj) and bj = 1 if ζ1,j > ζ0,j and −1 otherwise. For any ν > 0, let Ψn[ζ, ν] be the indicator of set A defined as follows  n(cid:88) j=1 bj (cid:18)yj − ζ0,j (cid:19) σ0  . √ > 2cn n A = The test functions Φn are then Φn = max 1≤j≤Nt Ψn[ζj, ν 2 ]. Type I error. The Mill’s ratio implies  n(cid:88) j=1 (cid:18) yj − ζ0,j (cid:19) σ0 bj  √ n > 2cn Eζ0,σ0 (Ψn) = P0 = 1 − Φ(2cn) √ ≤ 1 2cn ≤ e−2c2 n. 2π e−2c2 n The function Φ(·) is the CDF of the standard normal distribution. Consequently, we have Eζ0,σ0 (Φn) ≤ (Ψn[ζj, ν 2 ]) n = elog(Nt)−2c2 n Nt(cid:88) j=1 Eζ0,σ0 ≤ Nte−2c2 ≤ e−c2 n, 113 and Type II error. ∞(cid:88) n=1 Eζ0,σ0 Φn < ∞. given an arbitrary ζ in W C It is sufficient to find i for which the probability of type II error of Ψn[ζi, ν 2 ], ν,n ∩ Fn, is sufficiently small. This is because the probability of 2 ]. Note that type II error for the composite test Φn is no larger than the smallest of Ψn[ζi, ν here we assume(cid:82) |ζ(t)− ζ0(t)| dQn(t) > ν, and then(cid:82) |ζi(t)− ζ0(t)| dQn(t) > ν 2 . For every r < ν 2 , Choi and Schervish (2007b) show that Let n be large enough so that 4σ0cn < r ζ,σ0(1+)(1 − Ψn[ζi, E ]) = Pζ,σ0(1+) ν 2 bj j=1 σ0 j=1 √ (cid:35) (cid:19) √ n ≤ 2cn n, then for any 0 <  < 1 |ζ1,j − ζ0,j| > rn. n(cid:88) (cid:34) n(cid:88) (cid:18) yj − ζ0,j (cid:34) n(cid:88) (cid:18) yj − ζj + ζj − ζ1,j + ζ1,j + ζ0,j (cid:34) (cid:18) yj − ζj n(cid:88) (cid:35) (cid:12)(cid:12)(cid:12)(cid:12) ≤ 2cn (cid:12)(cid:12)(cid:12)(cid:12) ζ1,j − ζ0,j (cid:34) (cid:18) yj − ζj (cid:19) n(cid:88) (cid:34) (cid:18) yj − ζj n(cid:88) (cid:19) − r n σ0 √ n √ n 4σ0 ≤ − r n(cid:88) 4σ0(1 + ) σ0(1 + ) j=1 1√ n 1√ n bj bj ≤ r (cid:19) 1√ n (cid:19) bj j=1 bj j=1 σ0 + √ j=1 j=1 σ0 σ0 σ0 (cid:35) √ n (cid:19) (cid:18) ζj − ζ1,j ≤ 2cn (cid:19) σ0 (cid:35) + 2cn (cid:35) = Pζ,σ0(1+) bj = Pζ,σ0(1+) n(cid:88) j=1 + 1√ n ≤ Pζ,σ0(1+) 1√ ≤ Pζ,σ0(1+) n √ − r (cid:18) = Φ n 4σ0(1 + ) √ ≤ 4σ0(1 + ) 2πn r − e nr2 0(1+)2 32σ2 . To establish the part (ii) of the theorem, we need to show that there exists a constant 114 0 < ˜ < 1 so that for any  < ˜ r2 0(1 + )2 32σ2 + log (cid:18)1 −  (cid:19) 1 +  > 0. (4.31) Take κ = r2 32σ2 0 and define b() to be the left hand side of (4.31), (cid:18) b() = κ 1 (1 + )2 + 1 κ log (cid:18)1 −  (cid:19)(cid:19) 1 +  . The function b() is clearly continuous at  = 0. Hence, for each κ > 0, there exists ˜ such that for all 0 <  < ˜, b() > 0. Proof of Lemma 4 Theorem 5 of Ghosal and Roy (2006) implies that there exist positive constants C, d1, . . . , dp so that for i = 1, . . . , p (cid:32) P (cid:32) |ζ(t)| > Mn (cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ, (cid:12)(cid:12)(cid:12) > Mn|z, θ, γ, (cid:33) (cid:33) sup t∈[0,1]p (cid:12)(cid:12)(cid:12) ∂ −d0 ≤ Ce M 2 n ρ2 0(θ,γ) , −di M 2 n ρ2 i (θ,γ) . ≤ Ce P ∂ti ζ(t) The continuity of ρ2 sup t∈[0,1]p i (θ, γ), for i = 0,··· , p, on a compact set Υ implies that they are uniformly bounded. Therefore, there exist universal constants (Ξ0,1, Ξ0,2),··· , (Ξp,1, Ξp,2) such that for i = 0,··· , p, Hence, for i = 0,··· , p, 0 < Ξi,1 ≤ sup (θ,γ)∈Υ (cid:32) i (θ, γ)| ≤ Ξi,2. |ρ2 (cid:33) (cid:33) (cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ, (cid:12)(cid:12)(cid:12) > Mn|z, θ, γ, |ζ(t)| > Mn sup (θ,γ)∈Υ P (cid:32) sup t∈[0,1]p (cid:12)(cid:12)(cid:12) ∂ ∂ti sup (θ,γ)∈Υ P sup t∈[0,1]p ζ(t) −d0 ≤ Ce −di ≤ Ce M 2 n Ξ0,1 , M 2 n Ξi,1 . 115 Proof of Theorem 4 First, we show that ˆσ2 E[(i+1 − i)2] n is asymptotically unbiased. Note that E[(yi+1 − yi)2] = [ζ0(ti+1) − ζ0(ti)]2 + σ2 = [ζ0(ti+1) − ζ0(ti)]2 + 2σ2 0, (cid:80)n−1 i=1 [ζ0(ti+1) − ζ0(ti)]2 0 E(ˆσ2 n) = 2(n − 1) because i i.i.d.∼ N (0, 1). Consequently + σ2 0. (4.32) Since ζ0 is continuously differentiable on the compact and convex set Ω, it is also (globally) Lipschitz on Ω (e.g. Schaeffer and Cain (2016), Corollary 3.2.4), and there exists a real constant K so that |ζ0(ti+1) − ζ0(ti)| ≤ K |ti+1,j − ti,j|. p(cid:88) j=1 Therefore, due to the design assumption (AD) (cid:80)n−1 i=1 [ζ0(ti+1) − ζ0(ti)]2 2(n − 1) 0 ≤ (cid:20) ≤ K2p2 2 sup i∈{1,...,n},j∈{1,...,p} (cid:21)2 −−→ n |ti+1,j − ti,j| 0, (4.33) and the combination of (4.32) with (4.33) implies E(ˆσ2 n) −−−→ n σ2 0. (4.34) To show the almost sure convergence of ˆσ2 n, let us now denote xi = (yi+1 − yi)2 and n as a sum of two estimators, each consisting of a sum of independent rewrite the estimator ˆσ2 variables: 1 2 ˆσ2 n = (cid:80) n−1 2(cid:0) n−1 2 2 (cid:1) + i=1 x2i (cid:80) n−1 2(cid:0) n−1 2 (cid:1) j=1 x2j−1 2 1 2 = ˆσ2 n,e + ˆσ2 n,o. Without loss of generality, we assumed that n is an odd integer. Lastly note that Var(xi) ≤ C < ∞ uniformly in i. This is because the differences ζ0(ti+1)−ζ0(ti) are uniformly bounded on the compact set Ω due to the continuity of ζ0. Additionally, yi+1 − yi are Gaussian and have bounded moments. We can now apply the Kolmogorov’s strong law of large numbers 116 for independent non-identically distributed random variables (e.g. Shiryaev (1996), Chapter 3), and as a result n,e −−−→ ˆσ2 n,0 −−−→ ˆσ2 n n 1 2 1 2 σ2 0 σ2 0 a.s. P0, a.s. P0, n = ˆσ2 ˆσ2 n,e + ˆσ2 n,o −−−→ n σ2 0 a.s. P0. 4.5.3 Supplement for the transverse harmonic wave simulation This section contains some additional figures comparing the empirical Bayes fit with the fully Bayesian approach under the posterior samples obtained via MH algorithm. Figure 4.4: Detail of 95% credible bands plotted at t = 0.00. 117 0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH Figure 4.5: Detail of 95% credible bands plotted at t = 0.43. Figure 4.6: Detail of 95% credible bands plotted at t = 0.71. Figure 4.7: Detail of 95% credible bands plotted at t = 1.00. 118 0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH Figure 4.8: Comparison of the convergence to the true physical process ζ0(t, x). The curves with 95% credible intervals are plotted at t = 0.00. Figure 4.9: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.43. 119 0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI Figure 4.10: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 0.71. Figure 4.11: Comparison of the convergence to the true physical process. The curves with 95% credible intervals are plotted at t = 1.00. 120 0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI CHAPTER 5 CONCLUSION We devote the final chapter of this dissertation to the summary of the advances in com- putational statistics and the developments of new statistical tools of UQ that were made in Chapters 2, 3, and 4. We also provide an overview of the new and exciting avenues this work opens for future research. In Chapter 2, we studied BMA, the natural Bayesian framework to account for the model uncertainty that arises in situations when multiple competing models are available to describe the same or similar physical process. Motivated by a recurrent scenario in the field of nuclear physics, we extended BMA to the scenario where competing models are defined on non-identical study regions. We gave a theoretical justification for the use of BMA posterior mean predictor in terms of PMSE reduction. While this predictor does not guarantee a universal improvement in predictive ability, on average, it performs at least as well as the best model under consideration. Finally, we applied the methodology outlined in Chapter 2 under several scenarios that lead to better predictions and improved UQ; one simple and transparent exercise of averaging of proton potentials, and a pedagogical example of domain- corrected averaging with a synthetic dataset. We also provided a full-scale BMA analysis of 9 state-of-the-art nuclear mass models and a study of the LDM of nuclear binding energies trained on discrepant domains of the nuclear chart. In Chapter 3, we developed a novel VBI approach to Bayesian calibration of compu- tationally complex and many-parameter computer models. We exploited the probabilistic theory of approximation coupled with pairwise construction of multivariate copulas to create a computationally efficient and scalable algorithm for calibration. In addition, we proposed the Rao-Blackwellization, control variates, and importance sampling to reduce the variance of noisy gradient estimates involved in the stochastic approximation. The theoretical justifi- cation for scalability was also established. In our examples, we first carried out an extensive 121 simulation study that provided empirical evidence for the accuracy and scalability of our method in scenarios where the traditional MCMC-based approaches become impractical. We established the superiority of variational calibration over the MH algorithm and NUTS in terms of time efficiency and memory requirements. We also demonstrated the opportuni- ties given by our method for practitioners on a real data example through the calibration of the LDM. In Chapter 4, we proposed an empirical Bayes approach to model-enabled predictions of physical quantities as a fast and easy-to-implement alternative to the fully Bayesian treat- ment (also discussed in Chapter 3). A new hierarchical model representation of the Bayesian model for calibration of computer models was presented. Theoretical study of the proposed methodology was provided under this new representation. In particular, we established the posterior consistency of the physical process, assuming smoothness of the mean and covari- ance function of GP priors and existence of a strongly consistent estimator of the noise scale. Consequently, we proposed two plug-in estimators for GP model hyperparameters and a strongly consistent estimator of the noise scale parameter. A simulation study that established the efficiency of the method and empirically verified the consistency was pro- vided. Lastly, we revisited the LDM of binding energies and showed that our method yields comparable results to the fully Bayesian treatment. 5.1 Future research The extension of BMA to the situation with models defined over non-overlapping input domains addresses only one of many practical challenges in Bayesian model mixing. From methodology perspective, developing a principled approach to average models locally, with model wights depending on input values, would mitigate the tendency of BMA to perform global model selection when one of the models significantly dominates on some (small) part of the input space. Computationally, BMA is a two step procedure, when one needs to first obtain samples from posterior distributions under individual models and consequently sample 122 from the BMA posterior density. A direct approximation of the BMA posterior, potentially using variational methods, would considerably improve the ease of implementation. A natural next step to enhance the impact of the VBI approach for calibration of com- puter models that we proposed in Chapter 3, would be to examine its theoretical properties. For example, one could pursue similar frequentist consistency result as Wang and Blei (2019). If we establish the conditions under which the ELBO L(λ) and the l-truncated ELBO LDl (respectively LCl (λ) (λ) = L(λ) + op(1), the asymp- totic properties of Wang and Blei (2019) will directly extend to our methodology. Besides (λ)) are equivalent in limit, namely LDl theoretical investigations, a procedure that avoids the current sequential approach to select the truncation level would be beneficial. For instance, using fit indices for finding sufficient truncation appears to be a promising approach as discussed by Brechmann and Joe (2015). When it comes to the empirical Bayes approach to model-enabled prediction, we have already noted the need for further investigation of specific mean and covariance functions of GP priors that satisfy the smoothness conditions for posterior consistency. Most importantly, the hierarchical model representation of Kennedy and O’Hagan (2001) framework together with the theoretical developments in Section 4.2 constitute a solid foundation to establish the posterior consistency of the physical process ζ in the fully Bayesian regime; that is, in a scenario with suitable prior distributions over the hyperparameters of Gaussian process priors and the calibration parameters. 123 BIBLIOGRAPHY 124 BIBLIOGRAPHY Amewou-Atisso, M., Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. (2003). Posterior consistency for semi-parametric regression problems. Bernoulli, 9(2):291–312. Anni, R., Co’, G., and Pellegrino, P. (1995). Nuclear charge density distributions from elastic electron scattering data. Nuclear Physics A, 584:35–59. Audi, G., Wapstra, A., and Thibault, C. (2003). The AME2003 atomic mass evaluation: (ii). tables, graphs and references. Nuclear Physics A, 729:337–676. Balasubramanian, J. B., Visweswaran, S., Cooper, G. F., and Gopalakrishnan, V. (2014). Selective model averaging with Bayesian rule learning for predictive biomedicine. AMIA Joint Summits on Translational Science proceedings AMIA Summit on Translational Sci- ence, 2014:17–22. Bartel, J., Quentin, P., Brack, M., Guet, C., and H˚akansson, H.-B. (1982). Towards a better parametrisation of Skyrme-like effective forces: a critical study of the SkM force. Nuclear Physics A, 386(1):79–100. Bauer, M., van der Wilk, M., and Rasmussen, C. E. (2016). Understanding probabilistic sparse gaussian process approximations. In Proceedings of the 30th International Confer- ence on Neural Information Processing Systems, NeurIPS’16, pages 1533–1541. Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin, C.- H., and Tu, J. (2007). A framework for validation of computer models. Technometrics, 49:138–154. Bedford, T. and Cooke, R. M. (2002). Vines–a new graphical model for dependent random variables. Annals of Statistics, 30(4):1031–1068. Benzaid, D., Bentridi, S., Kerraci, A., and Amrani, N. (2020). Bethe–Weizs¨acker semiem- pirical mass formula coefficients 2019 update based on AME2016. Nuclear Science and Techniques, 31:9. Bernardo, J. M. and Smith, A. F. M. (1994). Reference analysis, chapter Inference. Wiley. Bertsch, G. F. and Bingham, D. (2017). Estimating parameter uncertainty in binding-energy models by the frequency-domain bootstrap. Physical Review Letters, 119:252501. Bertsch, G. F., Sabbey, B., and Uusn¨akki, M. (2005). Fitting theories of nuclear binding energies. Physical Review C, 71:054311. Bethe, H. A. and Bacher, R. F. (1936). Nuclear physics a. stationary states of nuclei. Reviews of Modern Physics, 8:82–229. 125 Bottou, L., Le Cun, Y., and Bengio, Y. (1997). Global training of document processing systems using graph transformer networks. In Proceedings of Computer Vision and Pattern Recognition (CVPR), pages 489–493. IEEE. Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association, 71(356):791–799. Brechmann, E. C., Czado, C., and Aas, K. (2012). Truncated regular vines in high dimensions with application to financial data. The Canadian Journal of Statistics, 40(1):68–85. Brechmann, E. C. and Joe, H. (2015). Truncation of vine copulas using fit indices. Journal of Multivariate Analysis, 138:19–33. Brynjarsd´ottir, J. and O’Hagan, A. (2014). Learning about physical parameters: the impor- tance of model discrepancy. Inverse Problems, 30:114007. Casella, G. and Berger, R. (2002). Statistical Inference. Duxbury advanced series in statistics and decision sciences. Thomson Learning, second edition. Casella, G. and Robert, C. P. (1996). Rao-blackwellisation of sampling schemes. Biometrika, 83(1):81–94. Cauchois, B., L¨u, H., Boilley, D., and Royer, G. (2018). Uncertainty analysis of the nuclear liquid drop model. Physical Review C, 98:024305. Chabanat, E., Bonche, P., Haensel, P., Meyer, J., and Schaeffer, R. (1995). New Skyrme effective forces for supernovae and neutron rich nuclei. Physica Scripta, 1995(T56):231. Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49:327–335. Choi, T. (2007). Alternative posterior consistency results in nonparametric binary regression using gaussian process priors. Journal of Statistical Planning and Inference, 137(9):2975 – 2983. Choi, T. and Schervish, M. J. (2007a). On posterior consistency in nonparametric regression problems. Journal of Multivariate Analysis, 98(10):1969 – 1987. Choi, T. and Schervish, M. J. (2007b). Posterior Consistency in Nonparametric Regression Problems under Gaussian Process Priors. Clyde, M. A., Ghosh, J., and Littman, M. L. (2011). Bayesian adaptive sampling for variable selection and model averaging. Journal of Computational and Graphical Statistics, 20:80– 101. Cooke, R. and Kurowicka, D. (2006). Uncertainty Analysis With High Dimensional Depen- dence Modelling. Wiley. Dissmann, J., Brechmann, E., Czado, C., and Kurowicka, D. (2013). Selecting and estimating regular vine copulae and application to financial returns. Computational Statistics & Data Analysis, 59:52–69. 126 Dobaczewski, J., Flocard, H., and Treiner, J. (1984). Hartree-Fock-Bogolyubov description of nuclei near the neutron-drip line. Nuclear Physics A, 422(1):103–139. Dobaczewski, J., Nazarewicz, W., and Reinhard, P.-G. (2014). Error estimates of theoretical models: a guide. Journal of Physics G: Nuclear and Particle Physics, 41(7):074001. Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121– 2159. Fayans, S. A. (1998). Towards a universal nuclear density functional. Journal of Experimental and Theoretical Physics Letters, 68(3):169–174. Feroz, F., Hobson, M. P., and Bridges, M. (2009). Multinest: an efficient and robust bayesian inference tool for cosmology and particle physics. Monthly Notices of the Royal Astronom- ical Society, 398:1601–1614. Geweke, J. (1999). Using simulation methods for Bayesian econometric models: inference, development, and communication. Econometric Reviews, 18:1–73. Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonpara- metric binary regression. Annals of Statistics, 34(5):2413–2429. Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibra- tion and sharpness. The Journal of the Royal Statistical Society, Series B (Statistical Methodology), 69:243–268. Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and esti- mation. Journal of the American Statistical Association, 102:359–378. Goldberger, A. (1966). Econometric theory. Wiley publications in statistics. J. Wiley. Gu, M. and Wang, L. (2018). Scaled Gaussian stochastic process for computer model calibra- tion and prediction. SIAM/ASA Journal on Uncertainty Quantification, 6(4):1555–1583. Hern´andez, B., Raftery, A. E., Pennington, S. R., and Parnell, A. C. (2018). Bayesian addi- tive regression trees using Bayesian model averaging. Statistics and Computing, 28:869– 890. Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). Computer model calibration using high-dimensional output. Journal of the American Statistical Association, 103:570– 583. Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2005). Combining field data and computer simulations for calibration and prediction. SIAM Journal on Scientific Computing, 26:448–466. Higdon, D., McDonnell, J. D., Schunck, N., Sarich, J., and Wild, S. M. (2015). A Bayesian ap- proach for parameter estimation and prediction using a computationally intensive model. Journal of Physics G: Nuclear and Particle Physics, 42(3):034009. 127 Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science, 14:382–401. Hoffman, M. and Blei, D. (2015). Stochastic structured variational inference. In Proceed- ings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38, pages 361–369, San Diego, CA. PMLR. Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. Journal of Machine Learning Research, 14:1303–1347. Homan, M. D. and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351– 1381. Hooten, M. B. and Hobbs, N. T. (2015). A guide to Bayesian model selection for ecologists. Ecological Monographs, 85:3–28. Horowitz, C. J., Arcones, A., Cˆot´e, B., Dillmann, I., Nazarewicz, W., et al. (2019). r- process nucleosynthesis: connecting rare-isotope beam facilities with the cosmos. Journal of Physics G: Nuclear and Particle Physics, 46(8):083001. Ireland, D. G. and Nazarewicz, W. (2015). Enhancing the interaction between nuclear experiment and theory through information and statistics. Journal of Physics G: Nuclear and Particle Physics, 42(3):030301. Jaganathen, Y., Betan, R. M. I., Michel, N., Nazarewicz, W., and P(cid:32)loszajczak, M. (2017). interaction for psd-shell nuclei. Physical Review C, Quantified gamow shell model 96:054316. Johnston, J. (1976). Econometric Methods. McGraw-Hill. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37:183–233. Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90:773–795. Kejzlar, V. and Maiti, T. (2020). Variational inference with vine copulas: An efficient approach for bayesian computer model calibration. arxiv.org/2003.12890. Kejzlar, V., Neufcourt, L., Maiti, T., and Viens, F. (2019). Bayesian averaging of computer models with domain discrepancies: a nuclear physics perspective. arxiv.org/1904.04793. Kejzlar, V., Neufcourt, L., Nazarewicz, W., and Reinhard, P.-G. (2020). Statistical aspects of nuclear mass models. Journal of Physics G: Nuclear and Particle Physics. URL https: // doi. org/ 10. 1088/ 1361-6471/ ab907c . Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63:425–464. 128 King, G. B., Lovell, A. E., Neufcourt, L., and Nunes, F. M. (2019). Direct comparison between Bayesian and frequentist uncertainty quantification for nuclear reactions. Physical Review Letters, 122:232502. Kirson, M. W. (2008). Mutual influence of terms in a semi-empirical mass formula. Nuclear Physics A, 798(1):29 – 60. Klupfel, P., Reinhard, P. G., Burvenich, T. J., and Maruhn, J. A. (2009). Variations on a theme by Skyrme: A systematic study of adjustments of model parameters. Physical Review C, 79:034310. Kl¨upfel, P., Reinhard, P.-G., B¨urvenich, T. J., and Maruhn, J. A. (2009). Variations on a theme by Skyrme: A systematic study of adjustments of model parameters. Physical Review C, 79(3):034310. Kortelainen, M., Lesinski, T., Mor´e, J. J., Nazarewicz, W., Sarich, J., Schunck, N., Stoitsov, M. V., and Wild, S. M. (2010a). Nuclear energy density optimization. Physical Review C, 82(2):024313. Kortelainen, M., Lesinski, T., Mor´e, J., Nazarewicz, W., Sarich, J., Schunck, N., Stoitsov, M. V., and Wild, S. (2010b). Nuclear energy density optimization. Physical Review C, 82:024313. Kortelainen, M., McDonnell, J., Nazarewicz, W., Olsen, E., Reinhard, P.-G., Sarich, J., Schunck, N., Wild, S. M., Davesne, D., Erler, J., and Pastore, A. (2014). Nuclear energy density optimization: Shell structure. Physical Review C, 89:054314. Kortelainen, M., McDonnell, J., Nazarewicz, W., Reinhard, P.-G., Sarich, J., Schunck, N., large Stoitsov, M. V., and Wild, S. M. (2012). Nuclear energy density optimization: deformations. Physical Review C, 85:024304. Krane, K. (1987). Introductory Nuclear Physics. Wiley. Lawrence, E., Heitmann, K., White, M., Higdon, D., Wagner, C., Habib, S., and Williams, B. (2010). The Coyote Universe III: simulation suite and precision emulator for the nonlinear matter power spectrum. The Astrophysical Journal, 713(2):1322–1331. Lynch, P. (2008). The origins of computer weather prediction and climate modeling. Journal of Computational Physics, 227(7):3431 – 3444. Predicting weather, climate and extreme events. Madigan, D., Gavrin, J., and Raftery, A. E. (1995). Eliciting prior information to enhance the predictive performance of bayesian graphical models. Communications in Statistics - Theory and Methods, 24(9):2271–2292. Martino, L., Laparra, V., and Camps-Valls, G. (2017). Probabilistic cross-validation estima- tors for gaussian process regression. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 823–827. 129 McDonnell, J. D., Schunck, N., Higdon, D., Sarich, J., Wild, S. M., and Nazarewicz, W. (2015). Uncertainty quantification for nuclear density functional theory and information content of new measurements. Physical Review Letters, 114(12):122501. Morris, M. D. and Mitchell, T. J. (1995). Exploratory designs for computational experiments. Journal of Statistical Planning and Inference, 43(3):381 – 402. Myers, W. D. and Swiatecki, W. J. (1966). Nuclear masses and deformations. Nuclear Physics, 81(2):1 – 60. Nazarewicz, W. (2016). Challenges in nuclear structure theory. Journal of Physics G: Nuclear and Particle Physics, 43:044002. Nazarewicz, W. (2018). The limits of nuclear mass and charge. Nature Physics, 14(6):537– 541. Neiswanger, W., Wang, C., and Xing, E. P. (2014). Asymptotically exact, embarrassingly In Proceedings of the Thirtieth Conference on Uncertainty in Artificial parallel mcmc. Intelligence, UAI’14, pages 623–632, Arlington, VA. AUAI Press. Neufcourt, L., Cao, Y., Giuliani, S., Nazarewicz, W., Olsen, E., and Tarasov, O. B. (2020a). Beyond the proton drip line: Bayesian analysis of proton-emitting nuclei. Physical Review C, 101:014319. Neufcourt, L., Cao, Y., Giuliani, S. A., Nazarewicz, W., Olsen, E., and Tarasov, O. B. (2020b). Quantified limits of the nuclear landscape. Physical Review C, 101:044307. Neufcourt, L., Cao, Y., Nazarewicz, W., Olsen, E., and Viens, F. (2019). Neutron drip line in the Ca region from Bayesian Model Averaging. Physical Review Letters, 122:062502. Park, I. and Grandhi, R. V. (2012). Quantification of model-form and parametric uncertainty using evidence theory. Structural Safety, 39:44—-51. Parkinson, D. and Liddle, A. R. (2013). Bayesian model averaging in astrophysics: A review. Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(1):3–14. Pastore, A. (2019). An introduction to bootstrap for nuclear physics. Journal of Physics G: Nuclear and Particle Physics, 46(5):052001. Peterson, C. and Anderson, J. R. (1987). A mean field theory learning algorithm for neural networks. Complex Systems, 1:995–1019. Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the Amer- ican Statistical Association, 112:1274–1285. Plumlee, M. (2019). Computer model calibration with confidence and consistency. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 81(3):519–545. Plumlee, M., Joseph, V. R., and Yang, H. (2016). Calibrating functional parameters in the ion channel models of cardiac cells. Journal of the American Statistical Association, 111:500–509. 130 Pollard, D., Chang, W., Haran, M., Applegate, P., and DeConto, R. (2016). Large ensemble modeling of the last deglacial retreat of the West Antarctic Ice Sheet: comparison of simple and advanced statistical techniques. Geoscientific Model Development, 9(5):1697–1723. Qui˜nonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, pages 1939–1959. Radaideh, M. I., Borowiec, K., and Kozlowski, T. (2019). Integrated framework for model assessment and advanced uncertainty quantification of nuclear computer codes under Bayesian statistics. Reliability Engineering & System Safety, 189:357–377. Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Proceed- ings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814–822. PMLR. Ranganath, R., Tran, D., and Blei, D. M. (2016). Hierarchical variational models. In Proceedings of the 33rd International Conference on International Conference on Machine Learning – Volume 48, ICML’16, pages 2568–2577. JMLR. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press. Reinhard, P.-G., Bender, M., Nazarewicz, W., and Vertse, T. (2006). From finite nuclei to the nuclear liquid drop: Leptodermous expansion based on self-consistent mean-field theory. Physical Review C, 73:014309. Robbins, H. and Monro, S. (1951). A stochastic approximation method. Annals of Mathe- matical Statistics, 22(3):400–407. Robert, C. and Casella, G. (2005). Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer New York. Ross, S. M. (2006). Simulation. Academic Press, Inc., Orlando, FL, fourth edition. Ruiz, F. J. R., Titsias, M. K., and Blei, D. M. (2016). Overdispersed black-box variational In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial inference. Intelligence, UAI’16, page 647–656, Arlington, Virginia, USA. AUAI Press. Schaeffer, D. G. and Cain, J. W. (2016). Nonlinear Systems: Local Theory, pages 79–109. Springer New York, New York, NY. Schorning, K., Bornkamp, B., Bretz, F., and Dette, H. (2016). Model selection versus model averaging in dose finding studies. Statistics in Medicine, 35:4021–4040. Sexton, D. M. H., Murphy, J. M., Collins, M., and Webb, M. J. (2012). Multivariate probabilistic projections using imperfect climate models Part i: outline of methodology. Climate Dynamics, 38(11):2513–2542. Shelley, M., Becker, P., Gration, A., and Pastore, A. (2014). Advanced statistical methods to fit nuclear models. Acta Physica Polonica B, 12:649. 131 Shiryaev, A. N. (1996). Convergence of Probability Measures. Central Limit Theorem, pages 308–378. Springer New York, New York, NY. Silvestro, D., Schnitzler, J., Liow, L. H., Antonelli, A., and Salamin, N. (2014). Bayesian estimation of speciation and extinction from incomplete fossil occurrence data. Systematic Biology, 63:349–367. Skilling, J. (2006). Nested sampling for general bayesian computation. Bayesian Analysis, 1(4):833–860. Sklar, A. (1959). Fonctions de r´epartition `a n dimensions et leurs marges. Publications de l’Institut de Statistique de l’Universit´e de Paris, 8:229–231. Smith, M. S., Loaiza-Maya, R., and Nott, D. J. (2020). High-dimensional copula varia- tional approximation through transformation. Journal of Computational and Graphical Statistics, pages 1–35. Sundararajan, S. and Keerthi, S. S. (2001). Predictive approaches for choosing hyperparam- eters in gaussian processes. Neural Computation, 13(5):1103–1118. Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning. Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statis- tics, volume 5, pages 567–574. PMLR. Toivanen, J., Dobaczewski, J., Kortelainen, M., and Mizuyama, K. (2008). Error analysis of nuclear mass fits. Physical Review C, 78:034306. Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34 – 42. Tran, D., Blei, D. M., and Airoldi, E. M. (2015). Copula variational inference. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NeurIPS’15, pages 3564–3572, Cambridge, MA. MIT Press. Tran, D., Ranganath, R., and Blei, D. M. (2017). Hierarchical implicit models and likelihood- free variational inference. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NeurIPS’17, pages 5529–5539. Trotta, R. (2008). Bayes in the sky: Bayesian inference and model selection in cosmology. Contemporary Physics, 49:71–104. Tuo, R. and Wu, C. F. J. (2015). Efficient calibration for imperfect computer models. Annals of Statistics, 43:2331–2352. 132 Tuo, R. and Wu, C. F. J. (2016). A theoretical framework for calibration in computer models: parametrization, estimation and convergence properties. SIAM/ASA Journal on Uncertainty Quantification, 4:767–795. Utama, R., Piekarewicz, J., and Prosper, H. B. (2016). Nuclear mass predictions for the crustal composition of neutron stars: A Bayesian neural network approach. Physical Review C, 93:014311. van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer. Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied Mathematics. Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305. Wang, M., Audi, G., Kondev, F. G., Huang, W. J., Naimi, S., and Xu, X. (2017). The AME2016 atomic mass evaluation (II). tables, graphs and references. Chinese Physics C, 41:030003. Wang, Y. and Blei, D. M. (2019). Frequentist consistency of variational bayes. Journal of the American Statistical Association, 114(527):1147–1161. Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathe- matical Physics, 44:92 – 107. Wei, W., Visweswaran, S., and Cooper, G. F. (2011). The application of naive Bayes model averaging to predict Alzheimer’s disease from genome-wide data. Journal of the American Medical Informatics Association, 18:370–375. Weizs¨acker, C. F. v. (1935). Zur theorie der kernmassen. Z. Phys., 96(7):431–458. Wen, X. (2015). Bayesian model comparison in genetic association analysis: linear mixed modeling and SNP set testing. Biostatistics, 16:701–712. Williams, B., Higdon, D., Gattiker, J., Moore, L., McKay, M., and Keller-McNulty, S. (2006). Combining experimental data and computer simulations, with an application to flyer plate experiments. Bayesian Analysis, 1(4):765–792. Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of the American Statistical Association, pages 1–47. Yuan, C. (2016). Uncertainty decomposition method and its application to the liquid drop model. Physical Review C, 93:034310. Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. ArXiv, 1212.5701. Zhang, H. F., Wang, L. H., Yin, J. P., Chen, P. H., and Zhang, H. F. (2017). Performance of the Levenberg-Marquardt neural network approach in nuclear mass prediction. Journal of Physics G: Nuclear and Particle Physics, 44(4):045110. 133 Zhang, L., Jiang, Z., Choi, J., Lim, C.-Y., Maiti, T., and Baek, S. (2019). Patient-specific prediction of abdominal aortic aneurysm expansion using Bayesian calibration. IEE Jour- nal of Biomedical and Health Informatics. 134