BAYESIAN UNCERTAINTY QUANTIFICATION OF COMPUTER MODELS WITH

EFFICIENT CALIBRATION AND COMPUTATION

By

Vojtech Kejzlar

A DISSERTATION

Submitted to

Michigan State University

in partial fulﬁllment of the requirements

for the degree of

Statistics – Doctor of Philosophy

2020

ABSTRACT

BAYESIAN UNCERTAINTY QUANTIFICATION OF COMPUTER MODELS WITH

EFFICIENT CALIBRATION AND COMPUTATION

By

Vojtech Kejzlar

The use of mathematical models, typically implemented in the form of computer code,

proliferates to solve complex problems in many scientiﬁc applications such as nuclear physics

and climate research. The computational and statistical tools of Uncertainty Quantiﬁcation

(UQ) are instrumental in assessing how accurately a computer model describes a physical

process. Bayesian framework for UQ has become the dominant approach, because it provides

a principled way of quantifying uncertainty in the language of probabilities. The ever-growing

access to high performance computing in scientiﬁc communities has meanwhile created the

need to develop next-generation tools and theory for analysis of computer models. Motivated

by practical research problems, this dissertations proposes novel computational tools and

UQ methodology aimed to enhance the quality of computer models which leads to improved

predictive capability and a more “honest” UQ.

First, we consider model uncertainty, which arises in situations when several competing

models are available to describe the same or a similar physical phenomenon. One of the

historically dominant methods to account for this source of uncertainty is Bayesian Model

Averaging (BMA). We perform systematic analysis of prediction errors and show the use of

BMA posterior mean predictor leads to mean squared error reduction. In a response to a

recurrent research scenario in nuclear physics, BMA is extended to a situation where models

are deﬁned on non-identical study regions. We illustrate our methodology via pedagogical

simulations and applications of forecasting nuclear observables, which exhibit improvements

in both prediction error and empirical coverage probabilities.

In the second part of this dissertation, we concentrate on individual computer models

with particular focus on those which are computationally too expensive to be used directly

for predictions. Furthermore, we consider computer models that need to be calibrated with

experimental observations, because they depend on inputs whose values are generally un-

known. We develop an eﬃcient algorithm based on variational Bayes inference (VBI) for the

calibration of computer models with Gaussian processes (GPs). To preserve the eﬃciency

of VBI in the presence of dependent data, we adopt the pairwise decomposition of the data

likelihood using vine copulas that separate the information on dependence structure in data

from their marginal distribution. We provide both theoretical and empirical evidence for

the computational scalability of our algorithm and demonstrate the opportunities given by

our method on a real-data example through calibration of the Liquid Drop Model of nuclear

binding energies.

As a fast and easy-to-implement alternative to the fully Bayesian treatment (such as the

VBI approach), we propose an empirical Bayes approach to computer-enabled predictions

of physical quantities. We oﬀer a new perspective to the Bayesian calibration framework

with GPs and provide its representation as a Bayesian hierarchical model. Consequently,

a posterior consistency of the physical process is established, assuming certain smoothness

properties of the GP priors and the existence of a strongly consistent estimator of a noise

scale. A simulation study and a real-data example that support the consistency and eﬃciency

of the empirical Bayes method are provided as well.

To Str´yˇcek for showing me it was possible.

iv

ACKNOWLEDGEMENTS

I would like to take this opportunity to sincerely thank to my advisors, Prof. Tapabrata

Maiti and Prof. Frederi Viens, for their support, guidance, and encouragement. Prof.

Maiti’s exemplary commitment to research will continue to serve as a guide in my own

academic pursuits. Thank you, Prof. Viens, for consistently providing me with resources

and opportunities over the past years.

My committee member Prof. Witold Nazarewicz who readily welcomed me to the world

of physics and allowed me to participate in exciting interdisciplinary projects. I would also

like to thank Prof. R.V. Ramamoorthi for being a part of my committee, for our discussions,

and for his insight.

I would like to extend many thanks to my research partners, Dr. L´eo Neufcourt, Dr.

Shrijita Bhattacharya, Dr. Stefan Wild, Dr. P.-G. Reinhard, and Mookyong Son.

Lastly, I am grateful to my family. Thanks to my parents for their unequivocal support

during all my studies and especially for their foresight in enrolling me in computer classes

in elementary school despite my protests. Thanks to my brother, Jakub, for his willingness

to meet up with me anywhere in the world, and thanks to my wife, May, for believing in me

from the start and always cheering me on.

v

TABLE OF CONTENTS

LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

x

KEY TO ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiv

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1 Computer models and sources of uncertainty . . . . . . . . . . . . . . . . . .
1.2 Bayesian calibration of imperfect computer models
. . . . . . . . . . . . . .
1.3 Bayesian model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.4 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CHAPTER 2 SURVEY OF BAYESIAN MODEL AVERAGING WITH EXAM-

PLES AND EXTENSION TO DISCREPANT DOMAINS . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
2.1 Optimality of BMA predictions
2.2 BMA with discrepant domains . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Two models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2.2 K models
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Examples and applications . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Averaging of proton potentials . . . . . . . . . . . . . . . . . . . . . .
2.3.2 Averaging of nuclear mass emulators in the Ca region . . . . . . . . .
2.3.3 Averaging of the Liquid Drop Model variants
. . . . . . . . . . . . .
2.3.4 Averaging of models with discrepant domains: a pedagogical example
. . . . . . . . . . . . . . . . . .
2.4.1 A simple example of evidence integral with closed form solution . . .
2.4.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Supplement for the general case of K models . . . . . . . . . . . . . .
2.4.3
2.4.4
Supplement for the examples and applications . . . . . . . . . . . . .

2.4 Technical details and supplementary results

CHAPTER 3 AN EFFICIENT ALGORITHM FOR BAYESIAN CALIBRATION

3.3

Selection of truncation level

Scalable algorithm with truncated vine copulas

OF COMPUTER MODELS VIA VARIATIONAL INFERENCE . . .
3.1 Variational Bayes inference . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Variational calibration of computer models . . . . . . . . . . . . . . . . . . .
3.2.1 Multivariate copulas and likelihood decomposition . . . . . . . . . . .
3.2.2
. . . . . . . . . . . .
Implementation details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
3.3.1
3.3.2 Variance reduction of Monte Carlo approximations
. . . . . . . . . .
3.3.3 Choice of the learning rate . . . . . . . . . . . . . . . . . . . . . . . .
3.3.4 Parametrizations
. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simulation study . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.4.1

1
2
3
7
9

11
14
17
19
20
21
22
25
29
34
37
37
37
38
40

45
49
51
51
55
59
59
60
64
65
66
66

vi

3.5 Technical details and supplementary results

3.4.2 Calibration of the Liquid Drop Model . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
3.5.1
Scalable algorithm with truncated vine copulas: C-vine . . . . . . . .
3.5.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
3.5.3

Supplement for the calibration of the Liquid Drop Model

72
76
76
80
85

CHAPTER 4 EMPIRICAL BAYES CALIBRATION OF COMPUTER MODELS

4.3.1 Estimation of hyperparameters

WITH CONSISTENT PREDICTIONS . . . . . . . . . . . . . . . .
4.1 Hierarchical model for Bayesian calibration of computer models
. . . . . . .
4.2 Posterior consistency, a theoretical validation . . . . . . . . . . . . . . . . . .
4.3 Parameter estimation and prediction . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
4.3.1.1 Marginal data likelihood . . . . . . . . . . . . . . . . . . . .
4.3.1.2 Predictive likelihood with K-fold cross-validation . . . . . .
4.3.2 Algorithm for predictions . . . . . . . . . . . . . . . . . . . . . . . . .

88
90
92
97
98
98
98
99
4.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.1 Transverse harmonic wave . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4.2 The Liquid Drop Model revisited . . . . . . . . . . . . . . . . . . . . 104
. . . . . . . . . . . . . . . . . . 106
4.5.1 Equivalency of hierarchical model
. . . . . . . . . . . . . . . . . . . . 106
4.5.2 Proofs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.3
Supplement for the transverse harmonic wave simulation . . . . . . . 117

4.5 Technical details and supplementary results

CHAPTER 5 CONCLUSION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.1 Future research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124

vii

LIST OF TABLES

Table 2.1: RMSE (in MeV) and the improvement under the BMA posterior mean

predictor calculated on the testing dataset (n = 70, A = 250). . . . . . . .

25

Table 2.2: Model posterior weights for 9 nuclear mass models with the RMSE
(in MeV) and the MSE improvement for the training and the testing
datasets. The last three rows correspond to the averaging with the prior
weights, the simpliﬁed BMA (Neufcourt et al., 2019), and the full BMA. .

Table 2.3: Posterior model weights under the averaging scenarios with two (L and
H; left) and three (L, H, and L+H; right) models. The weights for the
full intermediate domain of nuclei and the subset of 8 randomly selected
nuclei are listed.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Table 2.4: The RMSEs (in MeV) of the predictions from the 4 LDM variants as
well as the values from BMA, calculated on the held-out data in the
intermediate domain of even-even nuclei from AME2003. . . . . . . . . . .

Table 2.5: Scheme depicting the observations contained in the training dataset of
the models according to the proportion of shared data. The crosses mark
the values contained in the domain of each model.

. . . . . . . . . . . . .

27

32

33

35

Table 2.6: Summary of the domain corrected BMA analysis in the asymmetric case

of the pedagogical example. . . . . . . . . . . . . . . . . . . . . . . . . . .

36

Table 2.7: Sample size breakdown for the training (AME2003) and the testing
(AME2016 \ AME2003) datasets of nuclear separation energies in the
Ca region according to Z and N parities.

. . . . . . . . . . . . . . . . . .

Table 2.8: The model posterior weights, RMSE (in MeV) and MSE improvement
calculated on both the training (AME2003) and the testing (AME2016
\ AME2003) datasets for 3 nuclear mass models.

. . . . . . . . . . . . . .

41

41

Table 2.9: Summary of the domain corrected BMA analysis in the symmetric case

of the pedagogical example. . . . . . . . . . . . . . . . . . . . . . . . . . .

42

Table 3.1: The speciﬁcation of GPs for the simulation study.

. . . . . . . . . . . . .

67

Table 3.2: Comparison of the MSE for the simple scenario using the MH, the NUTS,

and the VC algorithms.

. . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

viii

Table 3.3: The RMSE of the VC (Algorithm 3.2) after 24 hours dedicated to running
the algorithm compared with the RMSE based on the LS estimates. The
parameter estimates (and their standard errors) are also displayed.

. . . .

75

Table 3.4: The space of calibration parameters used for generating the outputs of

the semi-empirical mass formula (1.1). . . . . . . . . . . . . . . . . . . . .

86

Table 4.1: The RMSE comparison of the empirical Bayes approach and the fully
Bayesian treatment. The GP hyperparameters were estimated using Al-
gorithm 4.1.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Table 4.2: The estimates of calibration parameters and the noise scale under each

method.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

Table 4.3: The RMSEs of the predictions evaluated on 145 even-even nuclei from
the AME2003 dataset. The parameter estimates are also listed. The
posterior means are shown in the case of the MH algorithm. . . . . . . . . 105

ix

LIST OF FIGURES

Figure 1.1: Realizations of a Gaussian process with zero mean and squared expo-

nential covariance function with η = 1 and (cid:96) = 0.1.

. . . . . . . . . . . .

5

Figure 2.1: The Woods-Saxon potential and the Coulomb potential along with the
training (140 observations) and the testing datasets (70 observations)
generated from the mixture of the two potentials.

. . . . . . . . . . . . .

Figure 2.2: ECPs for the testing dataset (m = 70, A = 250). . . . . . . . . . . . . . .
Figure 2.3: The ECPs calculated on the independent testing dataset (AME2016 \

24

25

AME2003).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

28

Figure 2.4: Even-even nuclei from AME2003 divided into the domains of light (Z <
40, N < 50), heavy (Z > 50, N > 80), and intermediate nuclei (re-
maining 155 nuclei). The subset of 8 randomly selected nuclei is also
depicted (From Kejzlar et al. (2020)). . . . . . . . . . . . . . . . . . . . .

Figure 2.5: The ECPs for the four LDM variants used in our study and the averaging
scenarios with two (L and H) and three models (L, H, and L+H) (From
Kejzlar et al. (2020)). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 2.6: Posterior mean predictions (with 68% HPD credible intervals) for the
10 observations y for the two models in (2.27) as well as their BMA,
with the domain correction and with the assumption of independent
model domains. This is the asymmetric case. The dashed line segments
represent the translated values of the original observations. . . . . . . . .

Figure 2.7: Posterior mean predictions (with 68% HPD credible intervals) for the
10 observations y for the two models in (2.27) as well as their BMA,
with the domain correction and with the assumption of independent
model domains. This is the symmetric case. The dashed line segments
represent the translated values of the original observations. . . . . . . . .

30

33

43

44

Figure 3.1: A D-vine tree representation of a copula with 5 variables. . . . . . . . . .

54

Figure 3.2: The approximate posterior distributions for the target calibration pa-
rameters. The VC (Algorithm 3.2) was carried out using l = 3 truncated
D-vine and compared with the results from the NUTS and the MH al-
gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

x

Figure 3.3: The evolution of MSE of the posterior predictive means based on the VC
with cumulatively implemented variance reduction techniques described
in Section 3.3.2. The ﬁgure is based on an independently generated
set of 50 testing points. Time and memory demands for each of the
implementations are also plotted the VC (Algorithm 3.2) was carried
out using l = 3 truncated D-vine.

. . . . . . . . . . . . . . . . . . . . . .

Figure 3.4: The evolution of the MSE of the posterior predictive means based on
the VC (Algorithm 3.2), the MH algorithm, and the NUTS. The ﬁgure
is based on an independently generated set of 200 testing points. The
VC (Algorithm 3.2) was carried out using l = 5 truncated D-vine.

. . . .

69

70

Figure 3.5: Recorded memory proﬁles of Algorithm 3.2, the MH algorithm, and the

NUTS for the duration of 1 hour under the simulation scenario.

. . . . .

71

Figure 3.6: Experimental binding energies of nuclei in AME2003 dataset (2225 ob-

servations).

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73

Figure 3.7: The residual plot for 225 experimental binding energies in the testing

dataset.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

Figure 4.1: Detail of 95% credible bands plotted at t = 0.21. . . . . . . . . . . . . . . 103

Figure 4.2: Comparison of the convergence to the true physical process. The curves

with 95% credible intervals are plotted at t = 0.21.

. . . . . . . . . . . . 103

Figure 4.3: Binding energies of even-even nuclei in AME2003 dataset divided into

a testing and a training dataset. . . . . . . . . . . . . . . . . . . . . . . . 104

Figure 4.4: Detail of 95% credible bands plotted at t = 0.00. . . . . . . . . . . . . . . 117

Figure 4.5: Detail of 95% credible bands plotted at t = 0.43. . . . . . . . . . . . . . . 118

Figure 4.6: Detail of 95% credible bands plotted at t = 0.71. . . . . . . . . . . . . . . 118

Figure 4.7: Detail of 95% credible bands plotted at t = 1.00. . . . . . . . . . . . . . . 118

Figure 4.8: Comparison of the convergence to the true physical process. The curves

with 95% credible intervals are plotted at t = 0.00.

. . . . . . . . . . . . 119

Figure 4.9: Comparison of the convergence to the true physical process. The curves

with 95% credible intervals are plotted at t = 0.43.

. . . . . . . . . . . . 119

Figure 4.10: Comparison of the convergence to the true physical process. The curves

with 95% credible intervals are plotted at t = 0.71.

. . . . . . . . . . . . 120

xi

Figure 4.11: Comparison of the convergence to the true physical process. The curves

with 95% credible intervals are plotted at t = 1.00.

. . . . . . . . . . . . 120

xii

LIST OF ALGORITHMS

Algorithm 3.1: Variational calibration with truncated D-vine copulas.

. . . . . . .

Algorithm 3.2: Variational calibration with truncated D-vine copulas II. . . . . . .

Algorithm 3.3: Variational calibration with truncated C-vine copulas.

. . . . . . .

Algorithm 3.4: Variational calibration with truncated C-vine copulas II. . . . . . .

58

64

78

79

Algorithm 4.1: Empirical Bayes algorithm for predictions of physical quantities.

. 100

xiii

KEY TO ABBREVIATIONS

BMA Bayesian model averaging

CM Computer model

CDF Cumulative distribution function

DFT Density functional theory

ECP Empirical coverage probability

EB Empirical Bayes

EDF Energy density functional

HPD Highest posterior density

GP Gaussian process

LDM Liquid Drop Model

MC Monte Carlo

MCMC Markov chain Monte Carlo

MH Metropolis-Hastings

MSE Mean square error

NUTS No-U-Turn sampler

PMSE Posterior mean square error

UQ Uncertainty quantiﬁcation

RMSE Root mean square error

SGA Stochastic gradient ascend

SVI Stochastic variational inference

VI Variational inference

VBI Variational Bayes inference

xiv

CHAPTER 1

INTRODUCTION

With the advancements of computer architectures in the 21st century, mathematical mod-

els implemented on a computer, which we shall refer to as computer models, have become the

driving force behind the acceleration of the cycle of the scientiﬁc process. This is because

computer models are typically much faster, safer, and economical to run than physical exper-

iments. For example, experiments in high energy physics are conducted in particle colliders

that cost billions of dollars and can take up to a decade to build. Moreover, some physical

experiments associated with rare natural events such as volcanic eruptions or earthquakes

are infeasible to carry out for all practical purposes.

Computer models, despite being an extremely useful tool (Box, 1976), are an imperfect

representation of physical systems. The comprehensive study of the impact of all forms

of modeling errors is called uncertainty quantiﬁcation (UQ). Bayesian methodology of UQ,

which is the main approach considered in this work, has been a heavily utilized statistical

device due to its natural way to describe uncertainty in the language of probabilities; see

Higdon et al. (2015); McDonnell et al. (2015), and King et al. (2019) for examples in nuclear

physics, Sexton et al. (2012) and Pollard et al. (2016) for examples in climatology, and

Williams et al. (2006); Lawrence et al. (2010), Plumlee et al. (2016) and Zhang et al. (2019)

for applications in engineering, astrophysics, and medicine.

Meanwhile, the incoming era of exascale computing (systems capable of 1018 double

precision ﬂoating point operations per second) has spawned the development of complex

computer models that produce massive amounts of data. This consequently creates the need

to bring the computational and statistical tools of UQ into the big-data age.

1

1.1 Computer models and sources of uncertainty

To illustrate various sources of uncertainties in computer models on a simple example,

let us consider the 4-parameter Liquid Drop Model (LDM) (Weizs¨acker, 1935; Bethe and

Bacher, 1936; Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020) which is a

global (across the whole nuclear chart) model of nuclear binding energies; the minimum

energy needed to disassemble the nucleus of an atom into unbound protons and neutrons.

In principle, the LDM treats the nucleus like a drop of incompressible ﬂuid of very

high density. Despite this simpliﬁcation, the LDM makes reasonable estimates of average

properties of nuclei. The LDM is formulated through the semi-empirical mass formula as:

EB(N, Z) = avolA − asurf A2/3 − asym

(N − Z)2

A

− aC

Z(Z − 1)

A1/3

,

(1.1)

where Z is the proton number, N is the neutron number, and A = Z + N is the mass

number of the nucleus. The model parameters are (avol, asurf , asym, aC) representing the

volume, surface, symmetry and Coulomb energy, respectively. These parameters have speciﬁc

physical meaning, where avol is for instance proportional to the volume of the nucleus. See

Krane (1987) for more details. We can identify the following sources of uncertainty as

proposed by Kennedy and O’Hagan (2001).

Parameter uncertainty: The model is a function of ﬁxed but unknown parameters

(avol, asurf , asym, aC). These parameters are context speciﬁc and need to be estimated with

reported standard errors. The process of model ﬁtting is also known as calibration.

Observation error:

In estimating the unknown model parameters, we will be making

use of experimental data from the actual physical process. These measurements typically

contain some observation error that should be accounted for.

Model inadequacy: As we already mentioned at the beginning of this chapter, computer

models are an imperfect representation of physical systems. Even if we know the true values

2

the model parameters, the LDM predictions will not equal the true values of the physical

process. This uncertainty (error) is often interpreted as “missing physics” in the model and

is diﬀerentiated from the observation error by its systematic nature.

Parametric uncertainty: Note that the LDM is a linear function of the parameter vector

(avol, asurf , asym, aC). It is possible that one or more predictor variables are highly linearly

correlated (multicollinearity) and the LDM can be reduced to a model with less parameters.

Model uncertainty: The LDM is not the only model of nuclear binding energies. In fact,

there are many alternative and competing models. In order to conduct comprehensive UQ

of modeling framework, we should allow for this possibility.

The subsequent two sections describe Bayesian formalisms that provide statistically prin-

cipled ways to account for the various sources of uncertainty described above with exception

of the parametric uncertainty. The parametric uncertainty is not the focus of this disserta-

tion, and we refer the reader to Jaganathen et al. (2017) and Kejzlar et al. (2020) for some

examples in nuclear physics.

1.2 Bayesian calibration of imperfect computer models

Let us consider observations y = (y1, . . . , yn) of a physical process ζ(ti) depending on a

known set of inputs ti ∈ Ω ⊂ Rp following the relationship

yi = ζ(ti) + σi,

i = 1, . . . , n,

(1.2)

where σ represent the scale of observation error (noise), typically i
is to establish statistically principled predictions of new values y∗ = (y∗
J ) of the
physical process ζ at, yet to be observed, inputs (t∗
J ) using y and a computer model
fm deﬁned as a mapping (t, θ) (cid:55)→ fm(t, θ). As we can see, the computer model depends
on an additional set of inputs θ ∈ Θ ⊂ Rd that we call calibration parameters. These

1, . . . , y∗

i.i.d.∼ N (0, 1). Our aim

1, . . . , t∗

are considered ﬁxed but unknown quantities common to all the observations yi and all the

3

instances of the physical process that we intend to predict using the calibrated computer

model. The calibration parameters represent inherent properties of the physical process that

cannot be directly measured or controlled in an experiment. In the most rudimentary form,

one can think of the calibration parameters as parameters in standard regression problems.

To this extent, we suppose the relationship between the observations yi, the physical process

ζ, and the computer model fm as proposed by Kennedy and O’Hagan (2001)

yi = fm(ti, θ) + δ(ti) + σi,

(1.3)

where δ(ti) represents an unknown systematic error between the computer model and the

physical process. While δ(ti) is intrinsically deterministic, a non-parametric approach using

Gaussian process prior model is typically imposed for Bayesian inference.

Deﬁnition 1. δ(t) has a Gaussian process distribution if for every i = 1, 2, 3 . . . the joint

distribution of δ(t1), . . . δ(ti) is multivariate normal. It is fully characterized by mean func-
tion m(t) = E[δ(t)] and covariance function k(t, t(cid:48)) = Cov[δ(t), δ(t(cid:48))]. We write

δ(t) ∼ GP(mδ(t), kδ(t, t(cid:48))).

Gaussian processes are a convenient way of placing a distribution over a space of functions

with the covariance function characterizing the relationship of the process at diﬀerent inputs.

Typically, the mean function is chosen to be zero or some dense family of basis functions

(wavelets, Fourier, polynomials) across the input domain:

m(·) = h(·)T β,

where h(·) = (h1(·), . . . hp(·)) are the basis functions and β is a hyperparameter. A typical
choice for the covariance function is a stationary covariance function that depends on the
inputs through t−t(cid:48). For example, a Gaussian kernel covariance function (also called squared

exponential or radial basis function kernel) takes the form

(cid:19)
(t − t(cid:48))T M (t − t(cid:48))

,

(cid:18)
k(t, t(cid:48)) = η2 exp

− 1
2

4

where M corresponds to a positive deﬁnite diagonal matrix of hyperparameters and η is a

scaling parameter. We refer to the case of M = 1

(cid:96)2 I, for some (cid:96) > 0, as an isotropic version of
the kernel, because it is invariant to the rotation. The case of M with diﬀerent diagonal terms

is called an anisotropic version of the kernel. Other popular choices for stationary covariance

functions are Mat´ern kernels, polynomial kernels, or exponential kernels (Rasmussen and

Williams, 2006). See Figure 1.1 to visualize realizations of a Gaussian process.

Figure 1.1: Realizations of a Gaussian process with zero mean and squared exponential
covariance function with η = 1 and (cid:96) = 0.1.

It is often the case the evaluation of the computer model fm is too expensive in terms

of both time and space (memory). It is common practice to reduce the number of necessary

computer model evaluations by considering a Gaussian process prior model

fm(t, θ) ∼ GP(mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))).

In this setup, the data also include a set of model evaluations z = (z1, . . . , zs) over a grid

{((cid:101)t1,(cid:101)θ1), . . . , ((cid:101)ts,(cid:101)θs)}. These are usually selected using some space-ﬁlling design such as

a uniform or Latin hypercube design (Morris and Mitchell, 1995), which is a design that

has a good coverage of the space with evenly distributed points in each one-dimensional

5

0.00.20.40.60.81.0t21012(t)projection. The complete data set d in the case of computationally expensive models consists

of n observations yi from the physical process ζ and s evaluations zj of the computer model

fm, i.e. d = (d1, . . . , dn+s) := (y, z). We shall denote the set of unknown parameters as

φ = (θ, γ, σ) with γ denoting the set of hyperparameters of Gaussian processes’ mean and

covariance functions. Consequently, the complete dataset d conditioned on (θ, γ, σ) follows

the multivariate normal distribution

where

and

K(θ, γ, σ) =

d|θ, γ, σ ∼ N (M (θ, γ), K(θ, γ, σ)),

Mf (Ty(θ)) + Mδ(Ty)

Mf (Tz((cid:101)θ))

M (θ, γ) =

Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) + σ2In Kf (Ty(θ), Tz((cid:101)θ))

Kf (Tz((cid:101)θ), Tz((cid:101)θ))

Kf (Tz((cid:101)θ), Ty(θ))

(1.4)

(1.5)

(1.6)

Here, Kf (Ty(θ), Ty(θ)) is the matrix with (i, j) element kf ((ti, θ), (tj, θ)), Kδ(Ty, Ty) is the

matrix with (i, j) element kδ(ti, tj), and Kf (Tz((cid:101)θ), Tz((cid:101)θ)) is the matrix with (i, j) element
kf (((cid:101)ti,(cid:101)θi), ((cid:101)tj,(cid:101)θj)). We can similarly deﬁne Kf (Ty(θ), Tz((cid:101)θ)) with the kernel kf .

Under this framework, the Bayesian calibration consists of deriving the full posterior

distribution p(φ|d) given by the Bayes’ theorem, namely

p(φ|d) =

∝ p(d|φ)p(φ),

(1.7)

(cid:82) p(d|φ)p(φ) dφ

p(d|φ)p(φ)

where p(φ) expresses our prior uncertainty about the unknown parameters. The Bayesian
predictions of y∗ are speciﬁed by the posterior predictive distribution p(y∗|d). This is given
by integrating the conditional density of y∗, given φ and the data d, against the posterior
density p(φ|d):

(cid:90)

p(y∗|d) =

p(y∗|d, φ)p(φ|d) dφ.

(1.8)

6

The conditional density p(y∗|d, φ) is a multivariate normal density given directly by the

statistical model (1.3) and the speciﬁcation of the Gaussian processes. We postpone the

detailed description of this likelihood to Chapter 3.

Here we point out a few caveats of the framework described above. First, the calibra-
tion parameter θ is in general non-identiﬁable. Indeed, δ(ti) = ζ(ti) − fm(ti, θ) yields the
same distribution for yi for any choice of θ. Several authors have pointed this out and

proposed various methods to mitigate the problem including Brynjarsd´ottir and O’Hagan

(2014); Plumlee (2017); Tuo and Wu (2015, 2016); Bayarri et al. (2007). Our main goal here,

nonetheless, is not the correct identiﬁcation of θ, but a prediction. Second, the posterior
distribution p(φ|d) does not have a closed form and needs to be approximated. The tradi-
tionally used Markov Chain Monte Carlo (MCMC) methods that approximate p(φ|d)—such

as the Metropolis-Hastings (MH) algorithm (Chib and Greenberg, 1995) or more advanced

ones including the Hamiltonian Monte Carlo or the No-U-Turn samplers (NUTS) (Homan

and Gelman, 2014)—work only with a relatively small sample size because of the computa-
tional costs associated with the evaluation of p(d|φ). This clearly calls for the development

of computationally eﬃcient alternatives to the traditional approaches.

1.3 Bayesian model averaging

Bayesian model averaging (BMA) is the natural Bayesian framework in scenarios with
several competing models M1, . . . ,MK when one is not comfortable selecting a single model
at the desired level of certainty (Bernardo and Smith, 1994; Kass and Raftery, 1995; Hoeting

et al., 1999; Wasserman, 2000). The seminal review work by Geweke (1999) introduced BMA

in econometrics and later in other ﬁelds such as political and social sciences; BMA has also

been applied to the medical sciences (Balasubramanian et al., 2014; Schorning et al., 2016),

ecology and evolution (Silvestro et al., 2014; Hooten and Hobbs, 2015), genetics (Wei et al.,

2011; Wen, 2015), astrophysics (Parkinson and Liddle, 2013), ﬂuid dynamics (Radaideh

et al., 2019), machine learning (Clyde et al., 2011; Hern´andez et al., 2018), and lately in

7

nuclear physics (Neufcourt et al., 2019, 2020a,b; Kejzlar et al., 2020).

For any quantity of interest O, e.g., the value y∗, the BMA posterior density p(O|d)

corresponds to the mixture of the posterior densities of the individual models:

K(cid:88)

p(O|d) =

p(O|d,Mk)p(Mk|d),

(1.9)

k=1

where d are given datapoints. These datapoints are typically observations y unless we

consider the speciﬁc scenario of computationally expensive computer models in Section 1.2,

where we also include the set of model runs z. For notation consistency, we shall denote the

set of datapoints as d throughout this dissertation with the actual content of d clariﬁed by

the context in which it is considered. Formula (1.9) expresses the actual posterior probability
of a quantity of interest O is the average of O’s posterior distributions given each model,

weighted by the model posterior probabilities. In other words, (1.9) is simply a mixture of

K distributions, which makes sampling from the BMA posterior density immediate once we
obtain the posterior samples under each model. The posterior model weights p(Mk|d) are
the posterior probabilities that a given model is the hypothetical true model; it is given by

a simple application of the Bayes’ theorem:

p(Mk|d) =

(cid:80)K
p(d|Mk)p(Mk)
(cid:96)=1 p(d|M(cid:96))p(M(cid:96))

,

(1.10)

where p(Mk) represents the prior probability that Mk is the true model. The so called
evidence (integrals) p(d|Mk) are obtained by integrating the data likelihood against the
prior density of the model parameters φk, namely

Additionally, the deﬁnition of expected value yields the posterior mean of O as

(cid:90)

p(d|Mk) =

E[O|d] =

p(d|φk,Mk)p(φk|Mk) dφk.
K(cid:88)

E[O|d,Mk]p(Mk|d),

k=1

8

(1.11)

(1.12)

and the well-known conditional variance formula (Casella and Berger, 2002) yields the pos-
terior variance of O, given d, as

K(cid:88)

Var[O|d] =

p(Mk|d)Var[∆|d,Mk] + Var[E(∆|d, M )|d].

(1.13)

Note that the term Var[E(O|d, M )|d] is the variance of a function of the discrete random

k=1

variable M (the set of all models being considered), which accounts for the model uncertainty.

This model uncertainty is not accounted for by individual models. Its inclusion thus allows

for a more honest UQ.

One of the challenges of BMA is that it becomes unclear how one should proceed in

scenarios where alternative models are deﬁned on diﬀerent subsets of the same input space.

This is, for example, a usual situation in nuclear physics, for instance for nuclear mass models;

ab initio (also known as A-body) models range over lighter nuclei due to contemporary

computational limitations, while Energy Density Functionals (EDF) can cover the whole

nuclear chart (Klupfel et al., 2009; Kortelainen et al., 2010b).

1.4 Dissertation outline

The main content of this dissertation is organized as follows. Chapter 2 provides a survey

of the remaining details for successful implementation of the BMA framework with particular

focus on the calculation of the evidence integral (1.11). We perform a systematic analysis of

the prediction errors, focusing on the fact that BMA is the optimal linear combination (pro-

jection) in the L2 sense under the posterior probability distribution, among all the possible

mixtures of models. Motivated by recurrent scenarios in nuclear physics, we subsequently

extend BMA to the situations when the diﬀerent models constrain diﬀerent subsets of the

data. Lastly, we present a set of pedagogical examples as well as real-data applications of the

BMA methodology highlighting its beneﬁts in terms of the improvement of the prediction

accuracy and UQ. Some results from this chapter are also provided in Kejzlar et al. (2019).

Chapter 3 presents a novel and computationally eﬃcient algorithm based on variational

Bayes inference (VBI) for the calibration of computer models with Gaussian processes. We

9

provide both theoretical and empirical evidence for the computational scalability of our

methodology and describe all the necessary details for an eﬃcient implementation of the

proposed algorithm. We demonstrate the opportunities given by our method for practitioners

on a real data example through the calibration of the Liquid Drop Model of nuclear binding

energies. The algorithmic development done in this chapter is also provided in Kejzlar and

Maiti (2020).

Chapter 4 develops an empirical Bayes (EB) approach for the Bayesian calibration frame-

work outlined in Section 1.2 that can be understood as an easy-to-implement and fast ap-

proximation of the fully Bayesian treatment. Firstly, we utilize the structural convenience of

Gaussian processes and restate the calibration framework as a Bayesian hierarchical model.

Secondly, we make use of this new representation and extend the results of Choi and Schervish

(2007a) on non-parametric regression problems to theoretically investigate the proposed EB

approach. A numerical simulation study and a real data example are also provided.

In Chapter 5, we discuss the likely future theoretical and computational extensions of

the methodologies developed in Chapters 2 – 4.

For ease of readability, all proofs of lemmas, propositions, and theorems, altogether with

technical details of numerical studies, are provided in the section titled “Technical details and

supplementary results” at the end of respective chapter. We also provide fully documented

Python code that reproduces all the results in this dissertation and can be easily modiﬁed

and used by practitioners in a public repository at https://github.com/kejzlarv.

10

CHAPTER 2

SURVEY OF BAYESIAN MODEL AVERAGING WITH EXAMPLES AND

EXTENSION TO DISCREPANT DOMAINS

Interest for model averaging arises, as discussed in Chapter 1, in situations when several

competing models are available to solve the same or similar problem, and no single model

can be selected at a desired level of certainty. For example, there is a multitude of competing

computer models for numerical weather prediction including the American model (Global

Forecast System) and the European model (European Centre for Medium-Range Weather

Forecasts) (Lynch, 2008).

In nuclear physics, alternative models arise through diﬀerent

theoretical strategies in modeling atomic nuclei such as the A-body modeling approach or

the density functional theory (DFT) (Nazarewicz, 2016).

In this chapter, we consider a general situation where measurements (ti, yi)n

i=1 of a
physical process t (cid:55)→ ζ(t) are used to predict new values y∗ of the physical process ζ, where
t ∈ Ω ⊂ Rp. Furthermore, we suppose there are K competing models M1, . . . ,MK of
observations yi, where the kth model is parametrized by a vector of unknown parameters
φk ∈ Rpk for k = 1, . . . , K and pk ≥ 1 (e.g., for the Bayesian calibration with Gaussian
processes φ = (θ, γ, σ)). Given each model, we consider the data likelihood p(d|φk,Mk) and
the prior density p(φk|Mk); the dataset d typically consists of the experimental observations
y = (y1, . . . , yn) only, however, it can also include a set of computer model evaluations

z = (z1, . . . , zs) under the Bayesian calibration framework with computationally expensive

models described in Section 1.2.

BMA provides a way of accounting for model uncertainty induced by the existence of
alternative models. If O is the quantity of interest, e.g., the value y∗, the BMA posterior
density p(O|d) corresponds to the mixture of the posterior densities of the individual models:

p(O|d) =

p(O|d,Mk)p(Mk|d).

(2.1)

K(cid:88)

k=1

11

The posterior weights p(Mk|d) are given by a simple application of the Bayes’ theorem:

p(Mk|d) =

(cid:80)K
p(d|Mk)p(Mk)
(cid:96)=1 p(d|M(cid:96))p(M(cid:96))

.

(2.2)

These weights are determined by two quantities that are the key to a successful implemen-
tation of the BMA framework. First, one needs to assign suitable prior probabilities p(Mk)
that Mk is the true model. Hoeting et al. (1999) notes that,

When there is little prior information about the relative plausibility of the mod-

els considered, the assumption that all models are equally likely a priory is a

reasonable “neutral” choice.

One can, nevertheless, choose informative prior distributions when prior information about

the plausibility of each model is available. Eliciting an informative prior is a non-trivial task,

but Madigan et al. (1995) provide some guidance in the context of graphical models that

can be applied in other settings as well.

The second key quantity is the evidence integral

(cid:90)

p(d|Mk) =

p(d|φk,Mk)p(φk|Mk) dφk.

(2.3)

The numerical evaluation of evidence integrals is challenging in practice, because a closed

form solution is available only in elementary situations for the exponential family distribu-

tions with conjugate priors (see Section 2.4 for a simple example) and thus requires approx-

imation. The simplest and most commonly used approximation in the literature, and which

we have adopted in our applications, is to use the Monte Carlo (MC) integration estimate

(cid:98)pM C (d|Mk) =

1

nM C

(cid:88)

i

p(d|φ

(i)

k ,Mk),

(2.4)

where φ

(i)

k are i.i.d. samples from the prior p(φk|Mk) for i = 1, . . . , nM C . While this MC

integration yields reasonable results, it requires separate evaluations of the likelihood at new
samples from the prior p(φk|Mk), which can be very costly in computing time.

12

Another frequently used method is the Laplace approximation, which relies on the fact

that the integration (2.3) has a closed form in the case of a linear regression with Gaussian

noise. It corresponds to a second order Taylor expansion of the log-likelihood around its

maximum, which makes the likelihood Gaussian. Namely the Laplace approximation is

(cid:98)pL(d|Mk) = (2π)

pk

2 |(cid:101)Σk| 1

2 p(d|(cid:101)φkMk)p((cid:101)φk|Mk),

where (cid:101)φk is the mode of p(φk|d,Mk) and(cid:101)Σk = (−D2l((cid:101)φk))−1 is the inverse of the Hessian
matrix of second derivatives (evaluated at (cid:101)φk) of l(φk) = log(p(d|φk,Mk)p(φk|Mk)). The

(2.5)

Laplace method typically gives very good results for very peaked likelihoods. We refer the

reader to Kass and Raftery (1995) for an exhaustive survey of classical methods used to

compute the evidence integral. Also, more recently proposed Nested Sampling algorithm by

Skilling (2006) and expanded by Feroz et al. (2009) provides another alternative to these

classical approaches.

BMA, while a conceptually straightforward and natural approach to account for model

uncertainty, becomes challenging in scenarios where alternative models are deﬁned on dif-

ferent subsets of the same input space; this can typically arise with local models or with

numerical models with diﬀerent constraints. It is also a usual situation in nuclear physics,

for instance for nuclear mass models; ab initio models range over lighter nuclei due to con-

temporary computational limitations, while EDFs can cover the whole nuclear chart (Klupfel

et al., 2009; Kortelainen et al., 2010b). This also happens when one considers mixing models

produced by the calibration of observables of diﬀerent types – typically some nuclear models

are ﬁtted on nuclear binding energies, while others on binding energies and other observables

such as rms charge radii (a measure of the size of an atomic nucleus). Surprisingly, we have

not found in the literature a principled approach to adapt BMA to this situation, or how

to compare models with similar, overlapping, but signiﬁcantly non-identical domains. To

address this “domain discrepancy”, in Section 2.2 we present a method which relaxes the

requirement that all models cover the same domain (d is common to all models considered).

Other applications of our framework could include time series with missing data, or diﬀerent

13

time scales, e.g.

in a ﬁnancial setting where additionally diﬀerent classes of assets can be

treated as observables.

The remaining sections of this chapter are organized as follows. Section 2.1 provides a

systematic analysis of prediction errors under individual models as compared to the BMA

framework. Section 2.2 develops the BMA methodology for models with discrepant domains.

Section 2.3 contains an extensive collection of simulation studies, pedagogical examples as

well as real-data applications highlighting the beneﬁts of BMA in terms of the improvement of

the prediction accuracy and UQ. All technical details and supplementary results are provided

in section 2.4.

2.1 Optimality of BMA predictions

BMA is not the only way to deal with several alternative models and to account for

model uncertainty, but it does have the property of reducing the Posterior Mean Square
Error (PMSE) of prediction of a new observation y∗.

In this section, we illustrate this

property in a clear and concise way.

Let us, for simplicity of notation, consider two competing models M1 and M2 - the
treatment of multiple models follows from a similar argument, and our verbal descriptions

below in this section occasionally refer to the general case without further comment. Denote

(cid:98)y∗
1 := E[y∗|d,M1] and (cid:98)y∗
2 := E[y∗|d,M2] as the posterior means of y∗ under each model, and
let (cid:98)y∗ := E[y∗|d]. We also deﬁne pk := p(Mk|d) for k = 1, 2 for the posterior probability
of each model. Thus the BMA posterior mean estimator (1.12) can be written as (cid:98)y∗ =
p1(cid:98)y∗
1 + p2(cid:98)y∗
2. The PMSE of y∗ is then deﬁned as E[((cid:98)y∗ − y∗)2|d] and has the following

E[(y∗ −(cid:99)y∗)2|d] = E[(y∗ − λ1(cid:99)y∗

1 − λ2(cid:99)y∗

decomposition.
Lemma 1. For every λ1, λ2 ≥ 0 satisfying λ1 + λ2 = 1, we have
2)2|d] − [(λ1 − p1)(cid:99)y∗
1 + λ2(cid:98)y∗

the PMSE associated with any convex combination λ1(cid:98)y∗

1 + (λ2 − p2)(cid:99)y∗

2]2

(2.6)

This Lemma shows explicitly that the PMSE of the BMA predictor is smaller than

2 of the each of the two

14

models’ posterior means. It also measures how much smaller it is, and shows that equality

holds as soon as the convex coeﬃcients λk are equal to the posterior probabilities pk of

each model, k = 1, 2. Speciﬁcally, by applying Lemma 1 twice, with (λ1, λ2) = (1, 0) and

with (λ1, λ2) = (0, 1), we obtain the following dual expression for the PMSE or the BMA

predictor, involving each individual model’s PMSE, showing how much smaller the former

is compared to the two latter:

E[(y∗ −(cid:99)y∗

2((cid:99)y∗
1 −(cid:99)y∗
1)2|d] − p2

2)2 = E[(y∗ −(cid:99)y∗)2|d] = E[(y∗ −(cid:99)y∗

1((cid:99)y∗
1 −(cid:99)y∗
2)2|d] − p2

2)2.

(2.7)

The relationship (2.7) directly implies

E[(y∗ − (cid:98)y∗)2|d] ≤ E[(y∗ − (cid:98)y∗

k)2|d],

k = 1, 2.

(2.8)

This inequality clearly states that the BMA estimator (1.12) gives prediction error at least

as small as the best of the models considered, in the PMSE sense. We interpret this as a

translation of the fact that each model that goes into creating the BMA estimator necessarily

ignores model uncertainty. Note that this says nothing about how the BMA estimator would

compare to a model not used in its deﬁnition.

Moreover, since Lemma 1 covers all convex combinations of the original models, it shows

that BMA achieves the following minimum

E[(cid:0)y∗ − (λ1(cid:98)y∗

1 + λ2

2)(cid:1)2|d].

ˆy∗

(2.9)

k=1,2 =

arg min

λ∈[0,1]2:λ1+λ2=1

(cid:0)p(Mk|d)(cid:1)
1 and (cid:98)y∗

estimators (cid:98)y∗

Hence, the BMA estimator is actually optimal over all convex combinations of the individual

2. The optimality of BMA can be also established from a decision-

theoretic perspective, see Chapter 6 in Bernardo and Smith (1994) for details.

We can also express the reduction of the PMSE for the BMA estimator, compared to the

best (lowest) PMSE among all of the individual models’, as

BM A := 1 − E[((cid:98)y∗ − y∗)2|d]

mink E[((cid:98)y∗

k − y∗)2|d]

r2

,

k = 1, . . . , K

(2.10)

15

In the speciﬁc case of two competing models, if we assume for instance that the ’best’ model
is M2, we can obtain an even more explicit expression for r2
gain attained by BMA, namely

BM A which provides the relative

BM A = p(M1|d)2
r2

.

(2.11)

1 − (cid:98)y∗
((cid:98)y∗
E[((cid:98)y∗
2 − y∗)2|d]

2)2

Below in Section 2.3, we denote the sample version of the expression in (2.10) as ˆr2

BM A,

which we will use to evaluate the performance of BMA quantitatively.

To ﬁnish this section, we decompose the quantity E[((cid:98)y∗ − y∗)2|d] against the residuals
((cid:98)y∗
k − y∗), k = 1, 2, from each individual model assuming p1, p2 > 0. This is easily done by
symmetrizing formula (2.7) via reintroducing y∗ to identify these residuals, and then taking

another conditional expectation with respect to d to avoid an expression which depends on

unobserved data. We obtain

E[(y∗ − (cid:98)y∗)2|d] = (p1 − p2

1)E[(y∗ − (cid:98)y∗
2)E(cid:2)((cid:98)y∗

2)E[(y∗ − (cid:98)y∗
2)|d(cid:3).
1)2|d] + (p2 − p2
1 − y∗)(y∗ − (cid:98)y∗

2)2|d]

− (p2

1 + p2

(2.12)

Formula (2.12) shows that the PMSE of the BMA estimator is an explicit linear combination

of the prediction errors of estimators for each constituent model, but that one must subtract

a coupling correction term on the right hand side of (2.12).

It is interesting to note that the weights in the aforementioned linear combination can be

interpreted as the variances of Bernoulli random variables with the posterior model probabili-
ties p1 and p2 as their success probabilities. Also note that, since these variances pk−p2
k < pk,
the linear combination is not convex, but is smaller. The correction term is not necessarily

a subtraction of a positive term, but it is likely to be when both individual models have
signiﬁcant biases in opposite directions for prediction of y∗. This is particularly interesting

when the two models have similar posterior performances. Both values of pk will be in this
situation close to 1/2, which minimizes the values of pk − p2
k for both k = 1, 2. This is a
scenario where using BMA will signiﬁcantly improve prediction errors even when each model

is competitive compared to the other, regardless of how large the individual models’ biases

16

are, and without knowing in what direction they go, as long as the two models are assumed

to have signiﬁcant defects that work in opposite directions.

A sanity check reveals an interesting characteristic of BMA: suppose that p1 = 1, so that

1. According to (2.12) we must have E[(y∗−(cid:98)y∗)2|d] = 0,

the BMA estimate is given by (cid:98)y∗ = (cid:98)y∗
and further E[(y∗ − (cid:98)y∗)2] = 0, i.e. y∗ = (cid:98)y∗ = (cid:98)y∗

1 a.s. given d, in other words, model 1 must

provide a perfect description of the reality.

2.2 BMA with discrepant domains

Let us continue with the discussion about BMA of models with similar, overlapping, but

signiﬁcantly non-identical domains from the beginning of this chapter in a formal setting. Let
us consider two models MA and MB, which we will also denote by (A) and (B) or merely
A and B for simplicity, and assume that they are respectively deﬁned only on diﬀerent strict

subsets t(A) and t(B) of the data. We denote d(A) and d(B) the corresponding d data as
well as d(−A) and d(−B) their respective complements in d. The actual Bayesian evidence
for each of these models are the probabilities p(d|A) and p(d|B), but these quantities are
not clearly deﬁned. On the other hand p(d(A)|A) and p(d(B)|B), where each model refers

only to its original range of validity, are the evidences of the models corresponding to the

classical BMA theory described in Section 1.3 and also at the beginning of this chapter.

Nevertheless, we have the following expansion:

p(d|A) = p(d(A), d(−A)|A) = p(d(A)|A)p(d(−A)|d(A), A).

(2.13)

This expression means that to obtain model (A)’s actual Bayesian evidence, p(d(A)|A) must
be multiplied by a corrective factor p(d(−A)|d(A), A) which represents the information one
has on d(−A) assuming that model (A) holds and that it does not provide any prediction at
the data points in d(−A). Note that the distribution p(d|A) is meaningful only to the extent
that d – and thus d(−A) – is measurable in the underlying probability space, which implies
the existence of underlying distributions p(d(−A)) and subsequently of p(d(−A)|d(A)) and
p(d(−A)|d(A), A). To that extent, the problem of averaging models with diﬀerent domains

17

can be ill posed, if these distributions cannot be deﬁned convincingly.

If the data d(A) and d(−A) are independent, conditionally to model (A), in other words if
no information can be gleaned about d(−A) from d(A) or from (A), i.e. d(A) is unconstrained
by (A) and by d(−A), then it is legitimate to ignore the aforementioned correction factor
which should be p(d(−A)|d(A), A) = 1. In particular, this is the case if, given model (A),
d(−A) is considered deterministically equal to its sample value. Conversely, setting the
corrective factor to p(d(−A)|d(A), A) = 1 outside of this scope is an approximation to such

extent, and not in general a fair evaluation of the information contained in the ”globality”

of a model. We shall refer to this case as BMA with independent model domains. Although

it has been adopted as a natural matter of convenience, it raises serious safeguards for which

we cannot ﬁnd better words than Trotta’s ascertainment (Trotta, 2008):

On the other hand, it is important to notice that the Bayesian evidence does not

penalize models with parameters that are unconstrained by the data. It is easy

to see that unmeasured parameters (i.e. parameters whose posterior is equal to

the prior) do not contribute to the evidence integral, and hence model comparison

does not act against them, awaiting better data.

Let us point out as an extreme situation that occurs when model (A) predicts the values
d(−A) that have no physical meaning, e.g. in the case of nuclear mass models, this can be

the mass of a nucleus which a model predicts not to exist, and therefore the mass has no
physical meaning. In this case, the model (A) is actually strongly constrained by d(−A), to
the point that p(d(−A)|A) = 0, yielding p(d(−A)|d(A), A) = 0, which rules the model (A)
impossible as long as d(−A) is not empty.

Another tempting option is to restrict the domain of interest to the domain common to
all models and simply consider p(d(A)∩(B)|A) and p(d(A)∩(B)|B), which can be obtained in

a standard way according to (2.3). As we ignore even more data, this approach is arguably
worse than setting p(d(−A)|d(A), A) = 1.

18

Let us illustrate how the assumption of independent model domains, namely setting
p(d(−A)|d(A), A) = 1, can fail to provide a satisfactory ranking of models in two examples

where a model takes a shortcut by ’refusing’ to predict challenging points.

Scenario 1. Consider the situation where one model M0 is empty so that p(M0|d) ∝
p(M0). On the other hand, any other model which constrains any part of the data will have
an evidence most likely lower than 1 which implies that the model will end up with lower

posterior weights when starting from equal prior weights. Thus any predictive model will be

deemed inferior to a non-predictive one.

Scenario 2. Take two deterministic models A and B with input space (domain of t) {a, b};
assume model A has deviation 0 at location a and 1099 at location b, and that model B has

deviation 1.001 at location a, but does not predict anything at location b. One can easily

adjust the numbers to reach an extreme situation (e.g. making A’s prediction at location b to

be extremely poor) where model B ends up with a much higher Bayes evidence than model

A, while the common sense by which no prediction is a form of extremely poor prediction,

would always imply that model A is better than model B.

These examples show how important it is to acknowledge that a model’s inability to make

predictions in some locations is not a neutral property. The classical BMA approach oﬀers

no trade-oﬀ: a model withholding its predictions at the most diﬃcult points will always

improve its weight. We now introduce our “domain-corrected BMA” where we amend the

model weights to account more fairly for the (in-)ability of a model to provide predictions

at locations of interest.

2.2.1 Two models

Starting from (2.13), instead of setting p(d(−A)|d(A), A) = 1 which removes the ef-

fect of a model’s domain in its posterior weights, we propose the weaker assumption that

19

p(d(−A)|d(A), A) is independent from the model, i.e. we assume

p(d(−A)|d(A), A) = p(d(−A)|d(A)).

(2.14)

This is quite natural if we consider that model (A) implies a distribution p(d(A)|A) but
provides no information on d(−A), leaving d(−A) unconstrained by (A) (see the introduction
of this section). The evidence p(d|A) is now given by

p(d|A) ∝A p(d(A)|A)p(d(−A)|d(A)).

(2.15)

Our assumption d(A) ∪ d(B) = d implies that d(−A) can only be informed by (B). Hence

which can be written as an explicit integral with respect to model (B)’s parameter φB,

p(d(−A)|d(A)) = p(d(−A)|d(A), B) = p(d(−A)|d(A)∩(B), B),
(cid:90)

p(d(−A)|d(A)∩(B), φB, B)p(φB|d(A)∩(B), B) dφB.

(2.16)

(2.17)

To approximate (2.17), one can use the same approximation methods as in the case of

classical evidence integral (see the beginning of this section for more details).

2.2.2 K models

d. We also introduce d((cid:11)) :=(cid:84)
Moreover we assume that d =(cid:83)

In the general case, each model Mk constrains a subset d(k) of the data d (for k =
1, . . . , K); as in the case of two models, d(−k) denotes the complement subset of d(k) in
k d(k) as the set of data common to all individual models.
k d(k), i.e. every datapoint is covered by at least one model.
We also assume, up to taking equivalence classes on models (see Section 2.4.3 for details),

that for each pair of models there exists a chain of models joining them where each model
Mk shares a data point in its domain d(k) with each of its neighbours. Relying on the same
principles described in Section 2.2.1, we set

p(d(−k)|d(k),Mk) = p(d(−k)|d(k)),

(2.18)

20

which leads to the model posterior probabilities of the form

p(Mk|d) ∝k p(d(−k)|d(k))p(Mk|d(k)).

(2.19)

Compared to the two-model case, the computation of the corrective factors poses ad-
ditional diﬃculty that, when there is more than one model constraining d(−k), the factor
p(d(−k)|d(k)) is no longer equal to a single p(d(−k)|d(k),Mk), but rather to the an aver-
age of all models constraining d(−k). Hence our domain-corrected BMA corresponds to the

intermediate solution where one replaces the factors of the likelihood corresponding to the

missing model predictions by a geometric average of the likelihoods over the models which

do produce predictions, based on the predictive models’ posterior weights. We have found

that similar ideas have been developed in the broader framework of evidence theory (Park

and Grandhi, 2012, Section 2.2).

The notation for a given corrective factor can become cumbersome when model domains

have very general intersections, but these corrective factors can still be computed recursively

rather than directly. We relegate the calculations of the general case to Section 2.4.3.

2.3 Examples and applications

To illustrate the methodology described in Chapter 2, we present several examples in

which BMA leads to the reduction in prediction error and improved UQ. Our ﬁrst example

is a simple yet sensible scenario of averaging two diﬀerent models of proton potentials. The

second example is an application of BMA methodology to state-of-the-art nuclear mass

models and nuclear mass data. The third example is a BMA study of the LDM (1.1)

published by Kejzlar et al. (2020). Lastly, we provide a pedagogical application of model

averaging to a synthetic dataset which highlights the interest of the domain-corrected BMA.

The predictive improvement is measured in the examples as a relative reduction in the

mean square error (MSE), a sample version of (2.10), which we denote as (cid:98)r2

BM A. As a
measure of UQ ﬁdelity, we consider what is know as the empirical coverage probability

21

(ECP) (Gneiting et al., 2007; Gneiting and Raftery, 2007). Formally, it can be written as

J(cid:88)

i=1

η(α) :=

1
J

i ∈Iα(t∗
1y∗
i ),

(2.20)

where 1 is the indicator function, Iα(t∗
model at a new input t∗

i ) is the α−credibility interval produced by a given
i ’s are the (new) testing data. The ECP represents the

i , and y∗

proportion of a model’s prediction of independent testing points falling into the respective

credibility intervals. This quantity is typically plotted against the credibility level α to form

a so called ECP line (e.g., Figure 2.2). This line should theoretically follow the diagonal so

that the actual ﬁdelity of the interval corresponds to the nominal value. If the respective ECP

line falls above the reference, credible intervals produced by a given model are too wide (UQ

is conservative). Naturally, a model with an ECP line below the reference underestimates

the uncertainty of predictions (UQ is liberal). While values of empirical proportions close

to the reference curve are desirable, it is preferable to be conservative rather than liberal.

Overly narrow credible intervals declare a level of assurance higher than it should be.

Each of the examples in this section looks at a situation with several competing models

without any prior knowledge of which is better; thus we set the prior model weights to be

uniform over the model space. All the posterior samples were computed using the NUTS. The

evidence integrals were approximated using the MC integration. All the credible intervals

discussed are the highest posterior density (HPD) credible intervals. Given a credibility level

α, the α-HPD of a scalar quantity consists of the minimum width interval containing an α

proportion of its MCMC posterior samples. Lastly, some of the supplementary results and

modeling details are delegated to Section 2.4.4.

2.3.1 Averaging of proton potentials

In this ﬁrst example we demonstrate the potential of BMA to improve both prediction

accuracy and honesty of UQ in a favorable situation where we average two models associated

with diﬀerent proton potentials.

22

We consider two single-proton potentials describing the average interaction acting on a

proton within the spatial range of a nucleus; namely, the Woods-Saxon (WS) potential V1

representing respectively the strong nuclear forces between nucleons (protons and neutrons),

and the Coulomb potential V2 representing the electromagnetic interactions between protons.

For a given nucleus, which we will take with proton and neutron numbers Z = 100 and

N = 150 and mass number A = 250, they can be expressed as

V1(r) = −VW S

1
r−RA

a

1 + e

V2(r) = −VC

Z
r

.

,

(2.21)

(2.22)

Here, VW S = 50 MeV, VC = 0.5 MeV fm, and a = 0.5 are ﬁxed parameters, and RA =
A1/3×1.25 fm is the radius of the nucleus of interest. These two models for energy potentials

have the interesting property that both are non-decreasing and vanishing at inﬁnity, while

with diﬀerent speeds, and can correspond to two phenomenons with diﬀerent length scales.

As a matter of fact, the strong interactions described by the WS potential are conﬁned to the
volume of atomic nuclei (several fm = 10−15 m), i.e. they are short-ranged; in contrast the
electrostatic ones are long-ranged, i.e. they act on much larger length scales (> 10−10 m) and

compete with the strong interactions in superheavy elements, causing the so-called Coulomb

frustration (see Nazarewicz (2018)). This fact is reproduced in our example where we also

expect that V1 should be well constrained by a dataset of stable nuclei, while V2 should

play an important role in the description of short-lived superheavy nuclei. More generally,

we have in mind a scenario where two models have been developed for diﬀerent subsets

of an input domain and are in competition on some common intermediate domain. Both

of these modeling approaches are equally conﬁdent that they prevail on the intermediate

domain, while the truth is somewhere in between. This situation is quite realistic despite its

simplicity, and we can reasonably expect model mixing to have positive outcomes.

We simulate the experimental data {(ri, yi)}n

i=1 at diﬀerent spatial locations ri, relatively

23

far from the nucleus (r > RA) following a mixture of the two models. Namely

yi = (1 − ω)V1(ri) + ωV2(ri) + i,

(2.23)

where i are standard normal errors, and we take ω = 1

2. Note that in reality, the ob-

servations of the potentials are not available as such, but can be inferred indirectly from

experimental nucleonic densities measured in nucleon scattering experiments (Anni et al.,

1995). In particular, we drew a dataset of 210 observations generated according to the model

(2.23) with the locations ri sampled uniformly over (RA, 10).

Figure 2.1: The Woods-Saxon potential and the Coulomb potential along with the training
(140 observations) and the testing datasets (70 observations) generated from the mixture of
the two potentials.

We further randomly divided the data into a training dataset of 140 observations and

kept the remaining 70 observations for testing (see Figure 2.1). The two statistical models
M1 and M2 considered here are given by the respective energy potentials (2.23) obtained
with ω = 0 and ω = 1 and additive independent experimental errors distributed according
to N (0, σj) for j = 1, 2. The prior distributions for standard deviations σj’s were take to be
the non-informative Inv-Gamma(1,30).

Table 2.1 shows the estimated root MSE (RMSE) for the testing dataset. We can see

that this simple example gives signiﬁcantly better predictions under the BMA posterior mean

24

8.08.59.09.510.0r [fm]2520151050V(r) [MeV]Coulomb potentialWoods-Saxon potentialTraining dataTesting datapredictor than each of the models individually. This is, of course, not a surprise and shows

that BMA behaves as expected.

Model
M1
M2
MBM A

RMSE P (Mk|y) (cid:98)r2

3.540
3.607
0.935

0.512
0.488

-

BM A
0.930
0.933

-

Table 2.1: RMSE (in MeV) and the improvement under the BMA posterior mean predictor
calculated on the testing dataset (n = 70, A = 250).

More interesting results can be seen from the angle of the quality of the predictions’

UQ in Figure 2.2. In contrast with the individual models, the ECP of the BMA posterior

predictions matches closely the reference line and provides evidence that accounting for

model uncertainty leads to the desired more honest UQ.

Figure 2.2: ECPs for the testing dataset (m = 70, A = 250).

2.3.2 Averaging of nuclear mass emulators in the Ca region

An important challenge in nuclear structure is to produce quantiﬁed predictions of nu-

clear observables, such as nuclear masses (McDonnell et al., 2015), for all possible pairs

25

0.00.20.40.60.8Credibility level0.00.20.40.60.81.0Empirical coverageReference21BMA(Z, N ) of proton numbers Z and neutron numbers N which can be bound together in a nu-

cleus. Such predictions are of direct interest to guide future nuclear experiments or to feed

astrophysical calculations for the abundance of elements in the universe. The underlying

astrophysical processes, such as the rapid neutron capture which produces heavy elements

in stellar environments (Horowitz et al., 2019), take place far from the region of nuclear

stability, where no experimental measurement are available, and these observables have to

be extracted from extreme extrapolations of theoretical nuclear models.

In their recent work, Neufcourt et al. (2019) used GPs (see Section 1.2 for deﬁnition) to

model the discrepancies between the experimental data and the theoretical calculations for

several nuclear models based on the DFT, and obtained quantiﬁed extrapolations for nuclear

masses in the Calcium region (at the frontier between experimental and theoretical limits).

They computed a simpliﬁed BMA of 9 global mass models (Bartel et al., 1982; Dobaczewski

et al., 1984; Chabanat et al., 1995; Kl¨upfel et al., 2009; Kortelainen et al., 2010a, 2012, 2014)

listed in Table 2.2 deﬁned across the full nuclear landscape from the light to the superheavy

nuclei, thus suitable for extrapolations. Their weights, proportional to p(y∗ > 0|y,Mk),
are based on each model’s probability to assign a positive separation energy y∗ to a testing

set of nuclei which have been experimentally observed after 2003, thus independent from

the training set of measured neutron separation energies y (separation energy is the energy

needed to remove a neutron or proton form an atomic nucleus). Here, we compare the

results of Neufcourt et al. (2019) to the full BMA analysis with model weights given by
their posterior probabilities p(Mk|y). Note that all the physical models are taken here as
calibrated and their parameter estimation is not part of our analysis.

We consider the same training dataset of one-neutron (S1n) and two-neutron (S2n) sep-

aration energies AME2003 (Audi et al., 2003) restricted to the calcium (Ca) region on the
nuclear landscape with Z ≥ 14 and N ≤ 22 (n = 139). The predictive performances of

each model augmented with the GP model for systematic discrepancies and the BMA pos-

terior mean predictor are evaluated on both the training dataset and a testing dataset of

26

new measurements in AME2016 (n = 14) that we denote as AME2016 \ AME2003 (Wang

et al., 2017). The predictive performances of each model augmented with a GP model for

systematic discrepancies and the BMA posterior mean predictor are evaluated on both the

training dataset and a testing dataset of new measurements in AME2016 (n = 14) (Wang

et al., 2017). Similarly to Neufcourt et al. (2019), we calculate the model posterior prob-

abilities independently over four non-overlapping nuclear domains according to the parity

of numbers Z and N with uniform prior distribution over the model space. We assess the

performance of BMA using the MSE improvement and the ECP. These were combined over

odd and even parities of numbers Z and N in order to mitigate the relatively small size of

each parity subset. The GP model speciﬁcation and the sample sizes breakdown based on

the parity of Z and N are given in Section 2.4.4.

Model posterior weights

S1n (odd N) S2n (even N)

even Z odd Z even Z odd Z RMSE (cid:98)r2

Training

Errors

BM A RMSE (cid:98)r2

Testing

0.000
0.000
0.000
0.000
0.000
0.845
0.002
0.153
0.000

0.000
0.000
0.000
0.000
0.009
0.669
0.013
0.308
0.001

0.000
0.000
0.000
0.000
0.000
0.000
0.000
0.902
0.098

0.008
0.000
0.000
0.001
0.000
0.089
0.125
0.310
0.467

0.076
0.127
0.142
0.107
0.136
0.110
0.109
0.114
0.146
0.110
0.118
0.105

-
0.308
0.449
0.023
0.400
0.077
0.058
0.149
0.477
0.045
0.110
-

0.713
0.989
0.924
0.840
0.809
0.550
0.806
0.808
0.806
0.641
0.680
0.591

BM A
0.313
0.642
0.591
0.505
0.466
-
0.462
0.465
0.463
0.078
0.131
-

Model

SLy4
SkP
SkM*
SV-min
UNEDF0
UNEDF1
UNEDF2
FRDM-2012
HFB-24
MBM A(prior)
MBM A(simple)
MBM A

Table 2.2: Model posterior weights for 9 nuclear mass models with the RMSE (in MeV)
and the MSE improvement for the training and the testing datasets. The last three rows
correspond to the averaging with the prior weights, the simpliﬁed BMA (Neufcourt et al.,
2019), and the full BMA.

Table 2.2 presents the resulting posterior weights of the models, as well as the RMSE and

the MSE improvement for both averaging procedures. The predictions based on the full BMA
(MBM A) outperform the simpliﬁed method of Neufcourt et al. (2019) (MBM A(simple)) by

27

11% on the training dataset and 13% on the testing one, as measured by(cid:98)r2

BM A. The lowest
RMSE on the training dataset was attained by SLy4 and UNEDF1 respectively for AME2016
\ AME2003. This result should not discourage practitioner from using BMA posterior mean

predictors, because the BMA methodology outlined in this paper allows for existence of a

“best” model for a particular data domain. However, such a model does not account for

modeling uncertainty whereas BMA does, and therefore the BMA posterior mean estimator

performs consistently well irrespective of the dataset. In fact it attains the second lowest
RMSE on both AME2003 and AME2016 \ AME2003.

Moreover, if we consider only a subset of the whole model space, the BMA attains the

lowest RMSE. See Table 2.8 in Section 2.4.4 for the results with a restricted model space.

Figure 2.3 shows the ECP of the averaged nuclear mass emulators. While it is not clear that

the BMA has an improved ECP compared to each individual models, its ECP is certainly

signiﬁcantly better than the worst models and comparable to the models with highest ﬁdelity.

Figure 2.3: The ECPs calculated on the independent testing dataset (AME2016 \
AME2003).

28

0.00.20.40.60.81.0Credibility level0.00.20.40.60.81.0Empirical coverageSkM*SkPSLy4SV-minUNEDF0UNEDF1UNEDF2FRDM-2012HFB-24ReferenceBMA2.3.3 Averaging of the Liquid Drop Model variants

In our second real-data example, we demonstrate the opportunities in nuclear theory

oﬀered by BMA through averaging of the LDM that has been optimized to various subsets

of the nuclear domain. In the context of the following discussion, it is useful to clarify the

notion of a “model”.

In this speciﬁc scenario, by model we understand the combination

of the algebraic model formula, the dataset used for its parameter determination, and a

statistical model that describes the error structure.

The parameters of the LDM are (avol, asurf , asym, aC) representing the volume, surface,

symmetry and Coulomb energy, respectively. Because of its linearity and simplicity, the

LDM has become a popular model for various statistical applications (Bertsch et al., 2005;

Toivanen et al., 2008; Utama et al., 2016; Yuan, 2016; Bertsch and Bingham, 2017; Zhang

et al., 2017; Cauchois et al., 2018; Shelley et al., 2014; Pastore, 2019).

To study the impact of the ﬁtting domain on prediction accuracy, and UQ ﬁdelity of

nuclear mass models, we shall consider the experimental binding energies of 595 even-even

nuclei of AME2003 (meaning both Z and N are even) divided into 3 domains according

to Figure 2.4. Namely, we deﬁne the domain of light nuclei with Z < 40 and N < 50,
heavy nuclei with Z > 50 and N > 80, and the intermediate domain DI consisting of the
remaining even-even nuclei. To keep some of our results within computable ranges we will

also consider 8 randomly selected nuclei in the central subset of the intermediate domain
which we will denote DC. By dividing nuclear domains according to A, we are trying to
light nuclei are often
simulate the current theoretical strategy in modeling atomic nuclei:

described by diﬀerent classes of models than heavy nuclei, with the intermediate domain

being the testing ground for all approaches Nazarewicz (2016). Here we use, for testing, the

same LDM expression in all domains. The models are distinguished merely by the ﬁtting

datasets.

In terms of these separated data domains, we consider four LDM variants ﬁtted on speciﬁc

regions of the nuclear landscape:

29

Figure 2.4: Even-even nuclei from AME2003 divided into the domains of light (Z < 40,
N < 50), heavy (Z > 50, N > 80), and intermediate nuclei (remaining 155 nuclei). The
subset of 8 randomly selected nuclei is also depicted (From Kejzlar et al. (2020)).

(i) LDM(A) – LDM ﬁtted on all 595 even-even nuclei.

(ii) LDM(L) – LDM restricted to the light domain (153 nuclei).

(iii) LDM(H) – LDM restricted to the heavy domain (287 nuclei).

(iv) LDM(L + H) – LDM ﬁtted on the both light and heavy domain (440 nuclei).

We emphasize that the intermediate domain DI (and DC) is not used for training in variants
(ii)-(iv), but kept aside as an independent testing domain where the diﬀerent LDM variants

compete. Thus we use the binding energies in the intermediate domain to evaluate the

predictions and error bounds of these variants and their Bayesian averages. In short, this

setup is designed to produce a scenario where two models, which have been optimized on

their respective domains, compete to explain the data on a third disconnected domain.

30

020406080100120140160N1101009080706050403020100ZLightHeavyOur statistical model for binding energies yi is the standard

yi = fm(ti, θ) + σi,

(2.24)

where the function fm(t, θ) represents the LDM prediction (1.1) with a given parameter

vector θ = (avol, asurf , asym, aC) for a nucleus indexed by t = (Z, N ). The errors are mod-

eled as independent standard normal random variable i with mean zero and unit variance,

scaled by σ. For the LDM parameters avol, asurf and asym we use independent normal
prior distributions N (0, 100) with mean 0 and standard deviation 100, while for aC we take
N (0, 2). For σ we assume a gamma prior distribution Gamma(5,2) with shape parameter 5

and scale parameter 2. These are chosen to be weakly informative, i.e., distributions where

hyperparameters are chosen to ensure that the prior distribution spans a much wider domain

than the resulting posterior. Since the parameter estimation is not topic of this study, we

refer the reader to Kejzlar et al. (2020) for more details about the posterior distributions of

these parameters.

In this example, we wish to select a model’s weight according to its predictive ability

and also to avoid overﬁtting, in the same spirit as the approach implemented in Neufcourt

et al. (2019, 2020a,b). To this end, we evaluate the evidence integrals over a set of binding
energies y∗ from the intermediate domain of Figure 2.4, which corresponds to integrating

the posterior distribution of new predictions against the posterior distribution of the model

parameters

(cid:90)

p(y∗|y,Mk) =

p(y∗|y, θk, σk,Mk)p(θk, σk|y,Mk) dθk dσk.

(2.25)

Given that posterior distribution of the parameters reﬂects the true distribution of the pa-

rameter more accurately than the prior, (2.25) more accurately represents the probability
that Mk can explain data y. To assess the impact of the number of evidence datapoints, we
evaluate evidence integrals both on the full intermediate domain DI and a smaller central
domain DC.

31

The integral (2.25) can be estimated using the MC integration as

nM C(cid:88)

i=1

(cid:92)p(y∗|y,Mk) =

1

nM C

p(y∗|y, θ

(i)
k , σ

(i)

k ,Mk),

(2.26)

where (θ

(i)
k , σ

(i)

k ) are samples from the posterior distributions p(θk, σk|y,Mk).

Table 2.3 shows the posterior weights obtained under averaging scenarios with two (L

and H) and three (L, H, and L+H) models. The corresponding RMSE values for individual

models and the BMA posterior mean predictors are listed in Table 2.4.

LDM(L) LDM(H) LDM(L+H)

DI BMA(L,H)

BMA(L,H,L+H)

DC BMA(L,H)

BMA(L,H,L+H)

0.000
0.000

0.008
0.002

1.000
0.000

0.992
0.255

1.000

0.743

Table 2.3: Posterior model weights under the averaging scenarios with two (L and H; left)
and three (L, H, and L+H; right) models. The weights for the full intermediate domain of
nuclei and the subset of 8 randomly selected nuclei are listed.

As expected, model (H) is selected in the two model variant, and the (L+H) variant
dominates when it is included – this is true for both sets of evidence datasets DC and DI .
This is consistent with the RMSE of these models. It shall be emphasized that BMA performs

a model selection in the two-model variant, where the RMSEs of the competing models are

very diﬀerent, and model averaging in the three-model variant, where the RMSE of (H)

and (L+H) are close enough. Table 2.4 also shows how the RMSE of the BMA predictions

compare with that of the individual models. In the two-model setup, BMA is very much

like (H) and it has a similar RMSE. In the three-model setup, BMA performs much better

than the worst model and very close to the best of the averaged models. When computed
on the full test domain DI , the RMSEs are systematically smaller for BMA than for all
the individual models involved in the averaging (not considering LDM(A)). One may notice

that the RMSE of BMA(L, H, L+H) is, perhaps unexpectedly, slightly worse than that of

32

LDM(L+H) on the small domain DC. However, these values are based merely on 8 data
points and should be viewed as a crude estimate of true predictive performance.

DI

DC

Mk
BMA(L,H)
BMA(L,H,L+H)
Mk
BMA(L,H)
BMA(L,H,L+H)

LDM(A) LDM(L) LDM(H) LDM(L+H)

3.206

8.176

3.811

3.351

3.810
3.223

1.930

6.825

3.292

1.881

3.300
1.926

Table 2.4: The RMSEs (in MeV) of the predictions from the 4 LDM variants as well as the
values from BMA, calculated on the held-out data in the intermediate domain of even-even
nuclei from AME2003.

Figure 2.5: The ECPs for the four LDM variants used in our study and the averaging
scenarios with two (L and H) and three models (L, H, and L+H) (From Kejzlar et al.
(2020)).

Similarly to all the previous examples in Chapter 2, we also evaluate the models from UQ

33

0.00.20.40.60.81.0Credibility level0.00.20.40.60.81.0Empirical coverageReferenceLDM(L)LDM(H)LDM(L+H)LDM(A)BMA(L,H)BMA(L,H,L+H)quality perspective using the ECP curves. Figure 2.5 shows that the LDM variants ﬁtted to

the smaller domains (L or H) tend to underestimate the uncertainty of the predicted binding

energies compared to the rather conservative UQ of the (L+H) variant and the LDM ﬁtted to

the entire AME2003 dataset. On the other hand, BMA(L,H,L+H) yields an ECP superior to

all the LDM variants, including LDM(A), which aligns with our hypothesis that meaningful

averaging can lead to an improved UQ.

2.3.4 Averaging of models with discrepant domains: a pedagogical example

In this example we study a simulated scenario where two models with t-dynamics of the

same order act in the opposite directions. We consider these models to be the realizations

of GPs with means

mi(t) = αit2 + θi,

i ∈ {1, 2},

(2.27)

where α1 = 0.5 and α2 = −0.5, and θ1 and θ2 represent unknown parameters to be estimated.
The covariance function used for the GPs is squared exponential kernel

− (t−t(cid:48))2
ki(t, t(cid:48)) = η2
2(cid:96)2
i e
i

,

i ∈ {1, 2}.

(2.28)

The prior distributions for the unknown parameters (θi, ηi, (cid:96)i) are listed in Section 2.4.4.
Overall, the two statistical models M1 and M2 considered here are given by the respective
GPs and additive independent errors distributed according to N (0, σj) for j = 1, 2.

The two means (2.27) emulate a natural scenario of competition between models, similar

to the proton potential example above, where we are uncertain about the nature of the

physical law and resort to BMA in order to account for this uncertainty. To do so, we
consider a synthetic dataset y of 18 observations drawn independently from N (0, 10−3) at
input points t = {±k, k = 1, 2, . . . 9}. Additionally, we study the impact of the domain
correction by assigning a diﬀerent training dataset y(k) to the models M1 and M2, using
seven diﬀerent scenarios with proportions of shared observations (Dshared) ranging from 20%

34

to 80% according to the scheme in Table 2.5. Note the break of symmetry in the domain of

y(k) denoted by a circle, we shall refer to those as symmetric and asymmetric scenarios.

For each value of Dshared, we carried out the domain-corrected procedure detailed in
Section 2.2 and computed the evidence integrals p(y(k)|Mk) as well as the corrective terms
p(y(−k)|y(k)). Also note that the approximate computation of these terms (2.17) is more

demanding than the computation of the evidence integrals, because it requires integration

against the posterior distribution of parameters.

Training dataset y(k)
3

2

4

5

6

7

8 9

x x x x x x x x x x

⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

-9 -8 -7 -6 -5 -4 -3 -2 -1 1
x x x x x x x x x x

Dshared Model
0.2 M1
M2
0.3 M1 ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗
M2
0.4 M1
M2
0.5 M1
M2
0.6 M1
M2
0.7 M1
M2
0.8 M1
M2

x x x x x x x x x x
⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

x x x x x x x x x x
⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

x x x x x x x x x x

x x x x x x x x x x

x x x x x x x x x x

x x x x x x x x x x

⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗ ⊗

Table 2.5: Scheme depicting the observations contained in the training dataset of the
models according to the proportion of shared data. The crosses mark the values contained
in the domain of each model.

Table 2.6 gives a quantitative summary of the simulation results in the asymmetric sce-

nario, where the impact of the domain correction is stronger. See Table 2.9 in Section 2.4.4

for the symmetric case, where the impact of the domain correction is minor due to the sym-

metry of training data and the response functions. The RMSE was calculated based on the
set of common observations (t ≤ 5). BM A(Q) and BM A(Q0) represent respectively the
domain corrected BMA and the BMA with independent model domains. Q denotes the pos-
terior odds ratio p(y(−1)|y(1))p(M1|y(1))/[p(y(−2)|y(2))p(M2|y(2))] used to draw samples

35

Dshared Model

RMSE p(y(k)|Mk) p(y(−k)|y(k)) Q0 Q (cid:98)r2

3.33

4.69
4.68

M1
M2
MBM A(Q0) 4.53
MBM A(Q)
M1
M2
MBM A(Q0) 4.29
MBM A(Q)
M1
M2
MBM A(Q0) 3.54
MBM A(Q)

4.36
3.62

4.63
4.38

3.23

2.78

-
-

2.69 · 10−21 2.13 · 10−16
7.78 · 10−20 9.25 · 10−18
-
-
7.79 · 10−20 5.44 · 10−13
3.29 · 10−18 2.12 · 10−14
-
-
3.23 · 10−18 1.13 · 10−8
1.45 · 10−16 3.49 · 10−10
-
-

-
-

-
-

0.3

0.5

0.7

0.03 0.80

0.02 0.61

0.02 0.72

BM A
0.495
0.494

-
-

0.512
0.456

-
-

0.593
0.410

-
-

Table 2.6: Summary of the domain corrected BMA analysis in the asymmetric case of the
pedagogical example.

from the mixture distribution (2.1) and Q0 is the ratio p(M1|y(1))/p(M2|y(2)). The MSE
improvement(cid:98)r2

BM A is with respect to the BMA with domain correction.

As expected from our construction, BMA leads to a spectacular decrease of the MSE

by about 50%. The BMA posterior mean predictor outperforms consistently the individual

models, at all proportions of the shared training data. As the overlap between the two

model domains increases, the RMSEs consistently decrease. The same observations hold in

the symmetric case. The domain corrected BMA has consistently lower RMSE than the

BMA with independent model domains across Dshared. We observe that the values of the

corrective factors increase exponentially towards 1 as Dshared increases; indeed the extreme

case Dshared = 1, where both models are deﬁned on the same domain, corresponds to the

classical BMA framework. The odds ratios stay expectedly close to 1, due to the fact that

the deviations from out-of-domain data are comparable across the models; still the domain-

corrected odds ratio Q has a consistently larger variability than Q0, the diﬀerence vanishes

as the proportion Dshared of data shared between the two models increases.

36

2.4 Technical details and supplementary results

2.4.1 A simple example of evidence integral with closed form solution

Let us suppose the following set of K models of experimental observations (ti, yi)n

i=1

yi = fk(ti) + σki,

k = 1, . . . , K,

where yk(t) are known deterministic functions, i are independent identically distributed
k ∼ Inv-Gamma(αk, βk). We can calculate the
standard normal random variables, and σ2

(cid:18)

−

(cid:80)
i(y(ti)−yk(ti))2

(cid:19)

2σ2
k

(cid:18)

(cid:19)

1

αk
β
k
(cid:80)
Γ(αk)
i(y(ti)−yk(ti))2+βk

k)αk+1 e
(σ2

− βk
σ2
k

(cid:19)

(cid:18)

− 1

2

dσ2
k

dσ2
k

1
n
2 +αk+1

e

(σ2
k)
αk
k Γ( n
β
i(y(ti) − yk(ti))2 + βk)

2 + αk)

σ2
k

.

n
2 +αk

n
2 Γ(αk)( 1
2

evidence integrals (2.3) explicitly as

(cid:90) ∞

0

1
(2πσ2
k)

e

n
2

αk
β
k
n
2 Γ(αk)

(2π)

(cid:90) ∞
(cid:80)

0

p(y|Mk) =

=

=

(2π)

2.4.2 Proofs

Proof of Lemma 1.

First, the standard factorization identities give the following expression:

(y∗ − (cid:98)y∗)2 − (y∗ − λ1(cid:98)y∗
= [2y∗ − (λ1 + p1)(cid:98)y∗

1 − λ2(cid:98)y∗
1 − (λ2 + p2)(cid:98)y∗

2)2

2][(λ1 − p1)(cid:98)y∗

1 + (λ2 − p2)(cid:98)y∗

2].

To get the result of the Lemma, we now take the expectation of the expression above condi-
tioned on d and notice that the right hand side is, with the exception of y∗, d-measurable.

E[(y∗ − (cid:98)y∗)2|d] − E[(y∗ − λ1(cid:98)y∗
1 + (p2 − λ2)(cid:98)y∗
= [(p1 − λ1)(cid:98)y∗

1 − λ2(cid:98)y∗
2)2|d]
1 + (λ2 − p2)(cid:98)y∗
2][(λ1 − p1)(cid:98)y∗

2].

37

2.4.3 Supplement for the general case of K models

Let us consider a dataset d and K models M1, . . . ,MK , and assume that each model Mk
is deﬁned on a subset t(k) of the data inputs. Denote also d(k) the subset of d corresponding

to inputs t(k) and d(−k) the complementary subset as well as d((cid:11)) :=(cid:84)
all data locations are in the domain of at least one model so that d =(cid:83)

k d(k). Suppose that
k d(k).

Note that if the datasets are disjoint, there is simply no basis to compare the models.

Given a set of models, one can deﬁne a unique minimal equivalence relationship (cid:63) on the
models (i.e. with a number of equivalence classes maximal) satisfying M (cid:63)M(cid:48) if M and M(cid:48)
share at least one data point, i.e. M (cid:63) M(cid:48) if and only if there exists r ≥ 0 and a sequence
of models M =: M1,M2, . . . ,Mr := M(cid:48) such that Mi and Mi+1 have a common data
point for each 0 ≤ i < r. The computation of the posterior weights of the models can then

be done within each class of equivalence, and we will therefore assume that there is only one

such equivalence class.

In the standard BMA where all models share d, one can express the posterior probabilities

on the models p(Mk|d) using the Bayes formula

p(Mk|d) ∝k p(d|Mk)p(Mk)

(2.29)

and estimate the evidence integral p(d|Mk) as at the beginning of this chapter.
In our
situation, however, the model Mk provides an expression p(d(k)|Mk) instead of p(d|Mk),
so that the standard procedure cannot be applied without a further argument.

Starting from (2.29), we expand p(d|Mk) similarly to the two-model (2.13) case as

p(d|Mk) = p(d(k), d(−k)|Mk) = p(d(−k)|d(k),Mk)p(d(k)|Mk).

(2.30)

Instead of setting p(d(−k)|d(k),Mk) = 1 which advantages models that withhold their pre-

dictions at diﬃcult locations (see the example scenarios and discussion in Section 2.2), our

domain-corrected BMA estimates

p(d(−k)|d(k),Mk) = p(d(−k)|d(k)).

(2.31)

38

This yields the evidence and the posterior weights given respectively by

p(d|Mk) = p(d(−k)|d(k))p(d(k)|Mk)
p(Mk|d) ∝k p(d(−k)|d(k))p(d(k)|Mk)p(Mk),

(2.32)

(2.33)

similarly to the two-model case. All that is left now is to evaluate the p(d(−k)|d(k)).

Let S be the set of q indices of the models that constrain d(−k), we can compute

p(d(−k)|d(k)) by conditioning with respect to the models with indices in S. Namely,

p(d(−k)|d(k)) = p(d(−k)|d(k),∪q

1

(cid:80)
l∈S p(Ml, d(k))
(cid:80)
l∈S p(Ml|d(k))

1

=

=

(cid:88)
l [M = Ml; l ∈ S])
(cid:88)

l∈S

p(d(−k)|d(k),Ml)p(Ml, d(k))
p(d(−k)|d(k),Ml)p(Ml|d(k))

l∈S

The simplest case is when d(−k) is non-divisible, in the sense that for every l ∈ S, we have
d(−k) ⊂ d(l) or d(−k) ∩ d(l) = ∅. Then p(d(−k)|d(k),Ml) is given by (2.17) and the sum

above have explicit expressions. In the general case, some models may be deﬁned only on a
strict subset of y(−k). In that case we have

p(d(−k)|d(k),Ml) = p(d(−k)∩(l), d(−k)∩(−l)|d(k),Ml)

= p(d(−k)∩(l)|d(k),Ml)p(d(−k)∩(−l)|d(k), d(−k)∩(l),Ml)
= p(d(−k)∩(l)|d(k)∩(l),Ml)p(d(−k)∩(−l)|d(l), d(k)∩(−l))

The ﬁrst term can be explicitly computed as (2.17) and p(d(−k)∩(−l)|d(l), d(k)∩(−l)) can be

computed recursively. For practical purposes, it is important to notice that the complexity

of the underlying algorithm is at most exponential in the number of models, where each

iteration requires the computation of a posterior predictive distributions of decreasing subsets

of the data given decreasing subsets of the data, posterior model weights given decreasing

subset of the data, and N computations of corrective likelihoods, where N is the number of

non-divisible subsets.

39

2.4.4 Supplement for the examples and applications

Averaging of nuclear mass emulators in the Ca region.

In this real data application,

we follow the experimental framework of Neufcourt et al. (2019). Given a (known) theoret-

ical nuclear model fm(t) for the one- and two-neutron separation energies, we consider the

relationship between the experimental observations yi and the nuclear model as

yi = fm(t) + δ(t),

for t := (Z, N ) ranging over the two-dimensional nuclear domain. We model the systematic

discrepancy δ as the GP

δ(Z, N ) ∼ GP(0, kη,(cid:96){(Z, N ), (Z(cid:48), N(cid:48))}),

with the mean 0 and the quadratic exponential covariance kernel with three parameters

kη,(cid:96){(Z, N ), (Z(cid:48), N(cid:48))} = η2e

2(cid:96)2
Z

− (Z−Z(cid:48))2

− (N−N(cid:48))2

2(cid:96)2

N ,

with independent gamma prior distributions with shape and scale parameters

η, (cid:96)Z , (cid:96)N ∼ Gamma(a, b),

where b = 1 and a respectively set to 0.8, 0.5 and 1.8. Note that this corresponds to the

framework of Bayesian calibration of imperfect computer models described in Section 1.2.

The only diﬀerence is that here we consider models that were already calibrated, and we

don’t explicitly model the experimental error (the is common practice in the nuclear physics

community since the experimental error is negligible compared with the systematic error for

state-of-the-art models). See supplemental material to Neufcourt et al. (2019) for exhaustive

description of the framework.

Averaging of models with discrepant domains: a pedagogical example. Table 2.9

gives a quantitative summary of the simulation results in the symmetric scenario. Figure 2.6

40

Sample Size

S1n (odd N) S2n (even N)
Dataset
even Z odd Z even Z odd Z
AME2003
41
AME2016 \ AME2003 3

28
5

31
3

39
3

Table 2.7: Sample size breakdown for the training (AME2003) and the testing (AME2016 \
AME2003) datasets of nuclear separation energies in the Ca region according to Z and N
parities.

Model posterior weights

Errors

Model

S1n (odd N) S2n (even N)

even Z odd Z even Z odd Z RMSE (cid:98)r2

Training

BM A RMSE (cid:98)r2

Testing

SkM*
0.000
FRDM-2012 1.000
HFB-24
0.000
MBM A

0.001
0.997
0.002

0.000
0.900
0.100

0.000
0.399
0.601

0.142
0.114
0.146
0.112

0.375
0.031
0.405
-

0.925
0.808
0.806
0.709

BM A
0.413
0.231
0.227
-

Table 2.8: The model posterior weights, RMSE (in MeV) and MSE improvement
calculated on both the training (AME2003) and the testing (AME2016 \ AME2003)
datasets for 3 nuclear mass models.

(asymmetric design) and Figure 2.7 (symmetric design) show the posterior mean predictions
for M1, M2, domain corrected BMA MBM A(Q), and BMA with independent model do-
mains MBM A(Q0). These were obtained for the pedagogical example 2.3.4 using the domain
correction developed in Section 2.2. The RMSE for both BM A(Q0) and BM A(Q) is almost

identical here (up to a roundoﬀ error) due to the symmetric nature of both the training

dataset y(k) and the response functions.

The prior distributions used in the example were

θi ∼ N (0, 1),
i ∼ Inv-Gamma(10, 1),
σ2
(cid:96)i ∼ Gamma(1, 10),
ηi ∼ Inv-Gamma(10, 1),

41

where both the gamma and the inverse gamma distributions are parametrized in terms of
the shape and the rate parameters for i ∈ {1, 2}.

Dshared Model

RMSE p(y(k)|Mk) p(y(−k)|y(k)) Q0 Q (cid:98)r2

3.28

4.69
4.58

4.64
4.53

M1
M2
MBM A(Q0) 3.28
MBM A(Q)
M1
M2
MBM A(Q0) 3.24
MBM A(Q)
M1
M2
MBM A(Q0) 3.07
MBM A(Q)
M1
M2
MBM A(Q0) 2.53
MBM A(Q)

3.61
3.56

4.37
4.33

3.25

3.08

2.53

-
-

-
-

2.78 · 10−21 1.98 · 10−19
2.73 · 10−21 2.11 · 10−19
-
-
7.99 · 10−20 4.33 · 10−16
7.95 · 10−20 3.96 · 10−16
-
-
3.32 · 10−18 8.59 · 10−12
3.29 · 10−18 7.84 · 10−12
-
-
1.45 · 10−16 2.99 · 10−6
1.42 · 10−16 2.98 · 10−6
-
-

-
-

-
-

0.2

0.4

0.6

0.8

1.02 0.96

1.01 1.10

1.01 1.11

1.02 1.03

BM A
0.512
0.488

-
-

0.511
0.486

-
-

0.504
0.495

-
-

0.509
0.495

-
-

Table 2.9: Summary of the domain corrected BMA analysis in the symmetric case of the
pedagogical example.

42

Figure 2.6: Posterior mean predictions (with 68% HPD credible intervals) for the 10
observations y for the two models in (2.27) as well as their BMA, with the domain
correction and with the assumption of independent model domains. This is the asymmetric
case. The dashed line segments represent the translated values of the original observations.

43

4202420020Dshared=0.34202420020Dshared=0.54202420020Dshared=0.7BMA(Q0)BMA(Q)12y*Figure 2.7: Posterior mean predictions (with 68% HPD credible intervals) for the 10
observations y for the two models in (2.27) as well as their BMA, with the domain
correction and with the assumption of independent model domains. This is the symmetric
case. The dashed line segments represent the translated values of the original observations.

44

4202420020Dshared=0.24202420020Dshared=0.442024201001020Dshared=0.642024201001020Dshared=0.8BMA(Q0)BMA(Q)12y*CHAPTER 3

AN EFFICIENT ALGORITHM FOR BAYESIAN CALIBRATION OF

COMPUTER MODELS VIA VARIATIONAL INFERENCE

With the ever-growing access to high performance computing in scientiﬁc communities,

the use of computational models proliferates to solve complex problems in many scientiﬁc

applications such as nuclear physics and climate research. An important class of such prob-

lems is making predictions, in order to aid the cycle of the scientiﬁc process. In particular,
our task is to establish statistically principled predictions of new values y∗ of a physical

process ζ using a computer model fm and a set of observations y = (y1, . . . , yn) from this

process. We would also like to account for various sources of uncertainty associated with

individual models (see Section 1.1 for detailed discussion on UQ of computer models). The

general framework that we shall follow and allows for predictions with UQ is called Bayesian

calibration. It was originally developed by Kennedy and O’Hagan (2001) with extensions

provided by Higdon et al. (2005, 2008); Bayarri et al. (2007); Plumlee (2017, 2019); Gu and

Wang (2018) and Xie and Xu (2020), to name a few.

Formally, let y = (y1, . . . , yn) be observations of a physical process ζ(ti) depending on a

known set of inputs ti ∈ Ω ⊂ Rp. Assume that yi follows

yi = ζ(ti) + σi,

(3.1)

i.i.d.∼ N (0, 1). As a mathematical descrip-
where σ represent the scale of observation error i
tion of ζ, we consider a computer model fm deﬁned as the mapping (t, θ) (cid:55)→ fm(t, θ) which
depends on an additional set of inputs θ ∈ Θ ⊂ Rd that we call calibration parameters.

These are ﬁxed but unknown quantities representing fundamental properties of the physical

process that cannot be directly measured or controlled in an experiment. We assume a single

value of calibration parameter θ to be common among all the observations yi and all the

future instances of the physical process.

45

As we discussed in Chapter 1, a computer model is an imperfect description of the

reality, and there often exists some systematic discrepancy (error) between the model and

the physical process. To this extent, we assume that ζ satisﬁes ζ(t) = fm(t, θ) + δ(t),

where δ(t) is the systematic discrepancy of the model whose form is generally unknown.

The complete statistical model then reads as

yi = fm(ti, θ) + δ(t) + σi.

(3.2)

The systematic discrepancy is modeled non-parametrically using a Gaussian process (GP)

with the mean function mδ(t) and the covariance function kδ(t, t(cid:48)):

δ(t) ∼ GP(mδ(t), kδ(t, t(cid:48))).

(3.3)

The deﬁnition of a GP with examples is provided in Section 1.2.

In addition to the computer model being imperfect, it is often too expensive in terms of

both computational time and memory to be used directly for inference. A common remedy

is to consider the computer model as a realization of a GP with the mean function mf (t, θ)
and the covariance function kf ((t, θ), (t(cid:48), θ(cid:48))):

fm(t, θ) ∼ GP(mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))).

(3.4)

In this situation, we generate an additional synthetic dataset of model runs z = (z1, . . . , zs)

over a ﬁxed grid of inputs {((cid:101)t1,(cid:101)θ1), . . . , ((cid:101)ts,(cid:101)θs)} selected using a space-ﬁlling design such as

a uniform or Latin hypercube design (Morris and Mitchell, 1995). The complete dataset d

therefore consists of n observations yi from the physical process ζ and s evaluations zj of the

computer model fm, i.e. d = (d1, . . . , dn+s) := (y, z), and follows the multivariate normal

distribution

d|φ ∼ N (M (φ), K(φ)),

(3.5)

where φ = (θ, γ, σ) is the set of all unknown parameters with γ denoting the set of hyper-

parameters of the GPs’ mean and covariance functions. M (φ) (1.5) is the mean vector and

K(φ) (1.6) is the covariance matrix given by the GPs’ speciﬁcations.

46

Under this framework, the Bayesian predictions of y∗ are given by the posterior predictive

distribution p(y∗|d), namely

(cid:90)

p(y∗|d) =

p(y∗|d, φ)p(φ|d) dφ.

(3.6)

The conditional density p(y∗|d, φ) is a multivariate normal density given by the statistical

model (1.3) and the speciﬁcation of GPs (the explicit form is provided in Section 3.4). The
posterior distribution of the unknown parameters p(φ|d) is given by the Bayes’ theorem

p(φ|d) =

.

(3.7)

(cid:82) p(d|φ)p(φ) dφ

p(d|φ)p(φ)

The term “calibration” in the Bayesian paradigm includes both an estimation of φ and a full

evaluation of uncertainty for every parameter under a prior uncertainty expressed by p(φ).

It is also worth noting that the posterior predictive density is rarely computed directly from
(3.6). Instead, we ﬁrst generate samples φ(1), . . . , φ(M ) from p(φ|d) and then obtain samples
y∗(1), . . . , y∗(M ) so that y∗(i) ∼ p(y∗|d, φ(i)), i = 1, . . . , M . The posterior predictive density
is then approximated using the empirical density of samples y∗(1), . . . , y∗(M ).

As a consequence of this simple two-step algorithm, we are interested in eﬀective sampling
(approximation) from the posterior distribution p(φ|d). This becomes quickly infeasible with

the increasing size of datasets, number of parameters, and model complexity. Traditional
MCMC methods that approximate p(φ|d)—such as the MH algorithm or more advanced

ones including the Hamiltonian Monte Carlo or the NUTS—typically fail because of the
computational costs associated with the evaluation of p(d|φ). The conventional approaches

to scalable Bayesian inference are in general not applicable here because of the highly corre-

lated structure of K(φ) or the nature of calibration itself. Indeed, parallelization of MCMC

(Neiswanger et al., 2014) works in the case of and independent d, and GP approximation

methods are developed in the context of regression problems (Qui˜nonero-Candela and Ras-

mussen, 2005; Titsias, 2009; Bauer et al., 2016).

This chapter presents a scalable and statistically principled approach to Bayesian calibra-

tion of computer models. We oﬀer an alternative approximation to posterior densities using

47

variational Bayesian inference (VBI), which originated as a machine learning algorithm that

approximates a target density through optimization. Statisticians and computer scientists

(starting with Peterson and Anderson (1987); Jordan et al. (1999)) have been widely using

variational techniques because they tend to be faster and easier to scale to massive datasets.

Moreover, the recently published frequentist consistency of variational Bayes by Wang and

Blei (2019) established VBI as a theoretically valid procedure. The scalability of VBI in

modern applications hinges on the eﬃciency of stochastic optimization in scenarios with

independent data points. This eﬃciency, however, diminishes in the case of Bayesian cali-

bration of computer models due to the dependence structure in data (Robbins and Monro,

1951; Hoﬀman et al., 2013). To maintain the speed and scalability of VBI, we adopt a

pairwise decomposition of data likelihood using vine copulas that separate the information

on a dependence structure in data from their marginal distributions (Cooke and Kurowicka,

2006). Our speciﬁc contributions are as follows:

1. We propose a novel version of the black-box variational inference (Ranganath et al.,

2014) for Bayesian calibration of computer models that preserves the eﬃciency of

stochastic optimization in a scenario with dependent data.

2. We implement the Rao-Blackwellization, control variates, and importance sampling to

reduce the variance of noisy gradient estimates involved in our algorithm.

3. We provide both theoretical and empirical evidence for scalability of our methodology

and establish its superiority over the MH algorithm and the NUTS both in terms of

time eﬃciency and memory requirements.

4. Finally, we demonstrate the opportunities in UQ given by the proposed algorithm on

a real-word example in the ﬁeld of nuclear physics.

The rest of this chapter is organized as follows. In Section 3.1, we give a general overview

of VBI. In Section 3.2, we derive our proposed VBI approach to perform an inexpensive and

48

scalable calibration. We establish statistical validity of the method and provide theoretical

justiﬁcation for its scalability. Subsequently, in Section 3.3, we discuss the implementation

details with focus on strategies to reduce the variance of the gradient estimators that are

at the center of stochastic optimization for VBI. Section 3.4 presents a simulation study

comparing our approach with state-of-the-art methods to approximate posterior distribution

and illustrates our method on a real-data application. All technical details, proofs, and

supplementary results are provided in section 3.5.

3.1 Variational Bayes inference

VBI is an optimization based method that approximates p(φ|d) by a family of distribu-
tions q(φ|λ) over latent variables with its own variational parameter λ. Many commonly

used families exist with the simplest mean-ﬁeld family assuming independence of all the

components in φ; see Wainwright and Jordan (2008); Hoﬀman and Blei (2015); Ranganath

et al. (2016); Tran et al. (2015, 2017) for examples of more sophisticated families. The
approximate distribution q∗ is chosen to satisfy

q∗ = arg min
q(φ|λ)

KL(q(φ|λ)||p(φ|d)).

(3.8)

Here, KL denotes the Kullback-Leibler divergence of q(φ|λ) from p(φ|d). Finding q∗ is

done in practice by maximizing the evidence lower bound (ELBO)

(cid:20)

L(λ) = Eq

(cid:21)

log p(d|φ)

− KL(q(φ|λ)||p(φ)),

(3.9)

which is a sum of the expected data log-likelihood log p(d|φ) and the KL divergence between

the combined prior distribution p(φ) of calibration parameters, the error scale σ, and GP hy-
perparameters and the variational distribution q(φ|λ). Note that we set L(λ) := L(q(φ|λ))

for the ease of notation. Minimizing the ELBO is equivalent to minimizing the original

49

objective function. Indeed,

(cid:20)
(cid:18)

KL(q(φ|λ)||p(φ|d)) = Eq
= −

log q(φ|λ)

(cid:20)

− Eq

(cid:21)

Eq

log p(d|φ)

− KL(q(φ|λ)||p(φ))

+ log p(d).

(cid:21)

(cid:20)

(cid:21)

log p(φ|d)

(cid:19)

The ELBO can be optimized via the standard coordinate- or gradient-ascent methods.

These techniques are ineﬃcient for large datasets, because we must optimize the variational

parameters globally for the whole dataset. Instead, it has become common practice to use a

stochastic gradient ascent (SGA) algorithm, which Hoﬀman et al. (2013) named “stochastic

variational inference” (SVI). Similarly to the traditional gradient ascent, SGA updates λ at

the tth iteration with

λt+1 ← λt + ρt˜l(λt).

(3.10)
Here, ˜l(λ) is a realization of the random variable ˜L(λ), so that E( ˜L(λ)) = ∇λL(λ), and
Ranganath et al. (2014) showed that the gradient of ELBO with respect to the variational

parameter λ can be written as

∇λL(λ) = Eq

(cid:20)
∇λ log q(φ|λ)(log p(d|φ) − log

(cid:21)

)

,

q(φ|λ)
p(φ)

(3.11)

where ∇λ log q(φ|λ) is the gradient of the variational log-likelihood with respect to λ.

SGA converges to a local maximum of L(λ) (global for L(λ) concave (Bottou et al., 1997))

when the learning rate ρt follows the Robbins-Monro conditions (Robbins and Monro, 1951)

∞(cid:88)

∞(cid:88)

ρt = ∞,

t < ∞.
ρ2

(3.12)

t=1

t=1

The bottleneck in the computation of the gradient ∇λL(λ) is the evaluation of the log-
likelihood log p(d|φ), which makes the traditional gradient methods as hard to scale as

MCMC methods. SGA algorithms address this challenge.
If we consider N independent
observations di ∼ p(di|φ), then we can deﬁne a noisy estimate of the gradient ∇λL(λ) as

(cid:20)

(cid:21)
∇λ log q(φ|λ)(log p(dI|φ))

(cid:20)
∇λ log q(φ|λ) log

− Eq

˜L(λ) := NEq

,

(3.13)

(cid:21)

q(φ|λ)
p(φ)

50

where I ∼ U (1, . . . , N ) with E( ˜L(λ)) = ∇λL(λ). Each update of λ computes the likelihood
only for one observation di at a time and makes the SVI scalable for large datasets. One can
easily see that, under the framework for Bayesian calibration, E( ˜L(λ)) (cid:54)= ∇λL(λ) and that
the corresponding the SVI does not scale (the noisy estimates are biased).

3.2 Variational calibration of computer models

In this section, we derive the algorithm for scalable variational inference approach to

Bayesian computer model calibration. The ﬁrst step is ﬁnding a convenient decomposition
of the likelihood p(d|φ) that allows for an unbiased stochastic estimate of the gradient
∇λL(λ) that depends only on a small subset of data. Multivariate copulas, and speciﬁcally
their pairwise construction which we shall introduce below, provide such a decomposition.

We are not the ﬁrst ones to use copulas in the context of VBI. For instance, Tran et al.

(2015) and Smith et al. (2020) proposed a multivariate copula as a possible variational

family. However, we are the ﬁrst ones using copulas in the context of computer model

calibration implementing via VBI.

3.2.1 Multivariate copulas and likelihood decomposition

Fundamentally, a copula separates the information on the dependence structure of N > 1

random variables X1, . . . , XN from their marginal distributions. Let us assume, for simplic-

ity, that the marginal cumulative distribution functions (CDFs) F1, . . . , FN are continuous
and possess the inverse functions F−1
transform that Ui := Fi(Xi) ∼ U (0, 1) and conversely that Xi = F−1

It follows from the probability integral

1 , . . . , F−1
N .

(Ui). With this in

i

mind, we have

P (X1 ≤ F−1

1 (x1), . . . , XN ≤ F−1

N (xN )) = P (U1 ≤ x1, . . . , UN ≤ xN ) := C(x1, . . . , xN ).

The function C is a distribution with support on [0, 1]N , uniform marginals, and is called

a copula. Under the above assumptions, a one-to-one correspondence exists between copula

51

C and the distribution of X = (X1, . . . , XN )T , as stated in the following theorem due to

Sklar (1959). To keep the notation consistency and readability, we re-state the theorem here.

Theorem 1 (Sklar (1959)). Given the random v. X1, . . . , Xn with continuous marginals

F1, . . . , FN and the joint distribution function F , there exists a unique copula C such that
for all x = (x1, . . . , xN )T ∈ Rn: F (x1, . . . , xN ) = C(F1(x1), . . . , Fn(xN )). Conversely,
given the CDFs F1, . . . , FN and a copula C, F deﬁned through C(F1(x1), . . . , Fn(xN )) is an

N-variate distribution function with marginals F1, . . . , FN .

Consequently, one can write the joint probability density function (pdf) f of X =

(X1, . . . , XN )T as

f (x1, . . . , xN ) = c(F1(x1), . . . , Fn(xN ))

N(cid:89)

i=1

fi(xi),

(3.14)

where c represents the copula density and fi is the marginal pdf of Xi.

The key reason for considering copulas is that one can decompose the N -dimensional

copula density c into a product of bivariate copulas. The starting point for this construction

is a recursive decomposition of the density f into a product of conditional densities

N(cid:89)

f (x1, . . . , xN ) =

f (xi|x1, . . . , xi−1)f (x1).

i=2
For N = 2, the Sklar’s theorem implies that

and

where

f (x1, x2) = c12(F1(x1), F2(x2))f1(x1)f2(x2),

f (x1|x2) = c12(F1(x1), F2(x2))f1(x1),

c12 := c12(F1(x1), F2(x2))

(3.15)

(3.16)

(3.17)

(3.18)

is a density of C(F1(x1), F2(x2)) = F (x1, x2). Using (3.17) for the decomposition of (X1, Xt)
given X2, . . . , Xt−1, we obtain

f (xt|x1, . . . , xt−1) = (

cs,t;s+1,...,t−1)c(t−1),t · ft(xt),

(3.19)

t−2(cid:89)

s=1

52

where

and

ci,j;i1,...,ik

:= ci,j;i1,...,ik

(F (xi|xi1

, . . . , xik

), F (xj|xi1

, . . . , xik

))

(3.20)

F (xi, xj|xi1

, . . . , xik

) := Ci,j;i1,...,ik

, . . . , xik

, . . . , xik

)).

(3.21)

Using (3.15) and (3.19) with the speciﬁc index choices s = i, t = i + j, we have that

(F (xi|xi1
N−j(cid:89)

(cid:20) N−1(cid:89)

), F (xj|xi1
(cid:21) N(cid:89)

f (x1, . . . , xN ) =

ci,(i+j);(i+1),...,(i+j−1)

fk(xk).

(3.22)

j=1

i=1

k=1

are two-dimensional copulas evaluated at the CDFs F (xi|xi1

Note that ci,j;i1,...,ik
, . . . , xik
and F (xj|xi1
). The decomposition above is called a D-vine distribution. A similar
, . . . , xik
class of decompositions is possible when one applies (3.17) on (Xt−1, Xt) given X1, . . . , Xt−2
and sets j = t − k, j + i = t to get a canonical vine (C-vine) (Cooke and Kurowicka, 2006):

)

f (x1, . . . , xN ) = f1(x1)

t−1(cid:89)

k=1

(cid:20) N(cid:89)
N−j(cid:89)

t=2

(cid:20) N−1(cid:89)

=

(cid:21)
ct−k,t;1,...,(t−k−1) · ft(xt)

(cid:21) N(cid:89)

cj,(j+i);1,...,(j−1)

fk(xk).

j=1

i=1

k=1

One can easily imagine that many such pair-copula decompositions exist. Bedford and

Cooke (2002) observed that these can be represented graphically as a sequence of nested

trees with undirected edges, which are referred to as vine trees. In order for a pair-copula

decomposition to be feasible, Bedford and Cooke (2002) deﬁned a regular vine tree (R-vine)
on N variables consisting of connected trees T1, . . . ,TN−1 with nodes Ni and edges Ei
satisfying the following conditions:

1. T1 has nodes N1 = {1, . . . , N} and edges E1.
2. For i = 2, . . . , N − 1 the tree Ti has nodes Ni = Ei−1 (i.e., edges in a tree become

nodes in the next tree).

3. Two edges in Ti are joined in Ti+1 if they share a common node in Ti.

53

Figure 3.1: A D-vine tree representation of a copula with 5 variables.

Here, we focus exclusively on the D-vine and C-vine decompositions because they repre-

sent the most-studied instances of regular vines and provide an especially eﬃcient notation.

We note, however, that the following results can be extended to any regular vines.

Properties of vine copulas (Cooke and Kurowicka, 2006). The vine copula con-

struction is particularly attractive for two reasons. First, each pair of variables occurs only

once as a conditioning set. Second, the bivariate copulas involved in the decompositions have

convenient form in the case of Gaussian likelihood f . In particular, let X = (X1, . . . , XN )T

follows a multivariate normal distribution with Fj = Φ, j = 1, . . . , N , where Φ is the standard

normal CDF. The bivariate copula density is

exp{− κ2(w2

j ) − 2κwiwj

i + w2
2(1 − κ2)

}.

(3.23)

ci,j;i1,...,ik

(ui, uj) =

1√
1 − κ2
), uj = F (xj|xi1

, . . . , xik

), wi = Φ−1(ui), wj = Φ−1(uj), and
is the partial correlation of variables i, j given i1, . . . , ik. The D-vine and

Here, ui = F (xi|xi1
κ = ρi,j·i1,...,ik
C-vine decompositions also involve conditional CDFs, for which we need further expressions.
Let v ∈ D and D−v := D \ v so that D contains more than one element, F (xj|xD) is

, . . . , xik

54

typically computed recursively as

F (xj|xD) = h(F (xj|xD−v ), F (xv|xD−v )|ρjv|D−v

)

and the function h is for the Gaussian case given by

h(ui, uj|ρi,j·i1,...,ik

) = Φ

i,j·i1,...,ik
Lastly, the partial correlation can be also computed recursively as

1 − ρ2

(cid:32)Φ−1(ui) − ρi,j·i1,...,ik

(cid:113)

Φ−1(uj)

(cid:33)

.

(3.24)

(3.25)

ρi,j·D =

(cid:113)
ρi,j·D−v − ρi,v·D−v ρv,j·D−v
1 − ρ2
v,j·D−v

i,v·D−v

1 − ρ2

(cid:113)

.

(3.26)

3.2.2 Scalable algorithm with truncated vine copulas

We now consider the data likelihood p(d|φ) according to (3.5) and make use of vines to
construct a noisy estimate of the gradient ∇λL(λ). We additionally assume that N = n + s,
where n is the number of observations yi from the physical process, and s is the number of
computer model runs zj. The log-likelihood log p(d|φ) can be rewritten according to the
D-vine decomposition as

log p(d|φ) =

where

j=1

pD
i,i+j(φ) = log ci,(i+j);(i+1),...,(i+j−1) +

N−1(cid:88)

pD
i,i+j(φ),

N−j(cid:88)
(cid:0) log pi(di|φ) + log pi+j(di+j|φ)(cid:1).

i=1

1

n − 1

(3.27)

(3.28)

This can be conveniently used in the expression of the ELBO gradient. For a D-vine, we

have that

∇λL(λ) =

(cid:20)

Eq

N−1(cid:88)

N−j(cid:88)

j=1

i=1

∇λ log q(φ|λ)(pD

i,i+j(φ))

(cid:21)

(cid:20)

− Eq

∇λ log q(φ|λ) log

q(φ|λ)
p(φ)

(cid:21)

. (3.29)

The following proposition gives a noisy unbiased estimate ˜LD(λ) of the gradient (3.29).
Similarly, we can derive a noisy estimate ˜LC (λ) of the gradient using a C-vine. We leave
the details to Section 3.5.1.

55

Proposition 1. Let ˜LD(λ) be an estimate of the ELBO gradient ∇λL(λ) deﬁned as
q(φ|λ)
p(φ)

∇λ log q(φ|λ)(pD

∇λ log q(φ|λ) log

˜LD(λ) =

N (N − 1)

ID(K)(φ))

2

(cid:20)

Eq

(cid:21)

(cid:20)

− Eq

(cid:21)

,

where K ∼ U (1, . . . , N (N−1)
N (N − 1)

2

ID : {1, . . . ,

2

), and ID is the bijection
} → {(i, i + j) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . N − 1}},

then ˜LD(λ) is unbiased i.e., E( ˜LD(λ)) = ∇λL(λ).

As in the case of SVI for independent data, these noisy estimates allow to update the
variational parameter λ without the need to evaluate the whole likelihood p(d|φ). We need

to consider only the data consisting of a copula’s conditioning and conditioned sets. Un-
fortunately, both ˜LD(λ) and ˜LC (λ) can be relatively costly to compute for large datasets
because of the recursive nature of calculations involved in the copula densities’ evaluation.

According to Brechmann et al. (2012); Dissmann et al. (2013), and Brechmann and Joe

(2015), the most important and strongest dependencies among variables can be typically

captured best by the pair copulas of the ﬁrst trees. This notion motivates the use of trun-

cated vine copulas, where the copulas associated with the higher-order trees are set to the

independence copulas. From the deﬁnition of a regular vine, one can show that the joint

density f can be decomposed as

f (d1, . . . , dN ) =

(cid:20) N−1(cid:89)

(cid:89)

j=1

e∈Ei

(cid:21) N(cid:89)

k=1

fk(dk),

cj(e),k(e);D(e)

where e = j(e), k(e); D(e) ∈ Ei is an edge in the ith tree of the vine speciﬁcation. We deﬁne
the truncated regular vine copula as follows.
Deﬁnition 2 (Brechmann et al. (2012)). Let U = {U1, . . . , UN} be a random vector with
uniform marginals, and let l ∈ {1, . . . , N−1} be the truncation level. Let Π denote the bivari-

ate independence copula. Then, U is said to be distributed according to an N-dimensional

l-truncated R-vine copula if C is an N-dimensional R-vine copula with

Cj(e),k(e);D(e) = Π ∀e ∈ Ei

i = l + 1, . . . , N − 1.

56

For the case of an l-truncated D-vine, we have

(cid:20) l(cid:89)

N−j(cid:89)

j=1

i=1

(cid:21) N(cid:89)

k=1

fk(dk),

(3.30)

f (d1, . . . , dN ) =

ci,(i+j);(i+1),...,(i+j−1)

and analogically to the case of D-vine with no truncation, the log-likelihood p(d|φ) can be

written as a sum of unique elements given in Proposition 2.

Proposition 2. If the copula of p(d|φ) is distributed according to an l-truncated D-vine, we

can rewrite

where

log p(d|φ) =

l(cid:88)

N−j(cid:88)

j=1

i=1

Dl
i,i+j(φ),

p

(3.31)

Dl
i,i+j(φ) = log ci,(i+j);(i+1),...,(i+j−1) +

p

log pi(di|φ) +

1
ai

1
bi+j

log pi+j(di+j|φ),

(3.32)

and

ai = 2l −
bi+j = 2l −

(cid:20)
(cid:20)
(l + 1 − i)1i≤l + (l − N + i)1i>N−l
(l + 1 − j − i)1i+j≤l + (l − N + j + i)1i+j>N−l

(cid:21)

,

(cid:21)

.

The main idea for the scalable variational calibration (VC) of computer models is replac-
ing the full log-likelihood log(d|φ) in the deﬁnition of ELBO with the likelihood based on a

truncated vine copula. This yields the l-truncated ELBO for the l-truncated D-vine

LDl
l(cid:88)
N−j(cid:88)

j=1

i=1

(λ) = Eq

Dl
i,i+j(φ)

p

− KL(q(φ|λ)||p(φ))

(3.33)

j=1

i=1

(cid:20)
∇λ log q(φ|λ)(p

Eq

(cid:21)

Dl
i,i+j(φ))

(cid:20)

− Eq

∇λ log q(φ|λ) log

(cid:21)

.

q(φ|λ)
p(φ)

with its gradient

∇λLDl

(λ) =

The following proposition gives a noisy unbiased estimate ˜LDl
We can analogously derive an unbiased estimate ˜LCl

(λ) of the gradient ∇λLDl

(λ).

(λ) of the gradient using C-vine (see

Section 3.5.1).

57

(cid:20) l(cid:88)

N−j(cid:88)

(cid:21)

(λ) =

Proposition 3. Let ˜LDl
l(2N − (l + 1))
˜LDl
where K ∼ U (1, . . . , l(2N−(l+1))
l(2N − (l + 1))

2

2

IDl
then ˜LDl

: {1, . . . ,
(λ) is unbiased i.e., E( ˜LDl

2

(cid:20)
(λ) be an estimate of the ELBO gradient ∇λLDl
∇λ log q(φ|λ)(p
Eq

(cid:20)
∇λ log q(φ|λ) log

− Eq

(cid:21)

(λ) deﬁned as
q(φ|λ)
p(φ)

(K)(φ))

(cid:21)

,

Dl
IDl

is the bijection

), and IDl
} → {(i, i + j) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . l}},

(λ)) = ∇λLDl

(λ).

Considering the l-truncated ELBO deﬁned above, our proposed algorithm for variational

calibration of computer models with truncated vine copulas is stated in Algorithm 3.1. Note

that ˜LDl

(λ) does not have closed form expression in general due to expectations involved

in the computation. Therefore, we resort to a MC approximation of the gradient estimate

˜LDl

(λ) using samples from the variational distribution.

Algorithm 3.1: Variational calibration with truncated D-vine copulas.

Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan

framework, variational family q(φ|λ), truncation level l

1 λ ← random initial value
2 t ← 1
3 repeat
4

for s = 1 to S do
φ[s] ∼ q(φ|λ)

5

6

7

8

2

)

(cid:80)S

K ← U (1, . . . , l(2N−(l+1))
(cid:20)
ρ ← tth value of a Robbins-Monro sequence
λ ← λ + ρ 1
l(2N−(l+1)) log q(φ[s]|λ)
t ← t + 1

∇λ log q(φ[s]|λ)(cid:0)p

l(2N−(l+1))

(cid:1)(cid:21)

p(φ[s])

s=1

S

2

2

9
10 until change of λ is less than 

// Random sample from q

(K)(φ[s]) −

Dl
IDl

Scalability Discussion. The complexity of a bivariate copula evaluation depends on the

size of the conditioning dataset due to the recursive nature of the calculations involved (Cooke

and Kurowicka, 2006). From the vine tree construction, the cardinality of the conditioning

58

set for D-vine and C-vine is in the worst case N − 2. Nevertheless, on average, we can do

better.

Lemma 2. Let X be the cardinality of the conditioning set in pD

ID(K)(φ) or pC

IC (K)(φ),

then

and E(X) = N−2
3 .

P (X = i) =

(cid:0)N

(cid:1)

N − (i + 1)

2

for i ∈ {0, . . . , N − 2}

(3.34)

The cardinality of conditioning set in Lemma 2 is on average roughly N/3. On the other
hand, the cardinality of conditioning set for the case of Algorithm 3.1 is at most l − 1 with

the average given by the following lemma.

Lemma 3. Let X be the cardinality of the conditioning set in p

Dl
IDl

(K)(φ) or p

Cl
ICl

(K)(φ),

then

and

P (X = i) =

N − (i + 1)
l(2N−(l+1))

2

for

i ∈ {0, . . . , l − 1},

(3.35)

E(X) =

(l − 1)(3N − 2l − 2)

3(2N − l − 1)

.

As a consequence of Lemma 3, E(X) ≈ 2 for N = 105 and truncation level l = 5, which
IC (K)(φ) (≈ 33333 for

is a signiﬁcant improvement to the average case pD

ID(K)(φ) and pC

N = 105). This provides a heuristic yet convincing argument for the scalability.

3.3 Implementation details

3.3.1 Selection of truncation level

Selection of the truncation level l is an important element in eﬀective approximation
of the posterior distribution p(φ|d) under Algorithm 3.1. Dissmann et al. (2013) propose

a sequential approach for selection of l in the case of vine estimation. One sequentially

ﬁts models with an increasing truncation level until the quality of ﬁt stays stable or com-

putational resources are depleted. We adopt similar idea for the case of VC of computer

59

models with vine copulas. Let λ(l) represents the value of variational parameter estimated

with Algorithm 3.1 for a ﬁxed truncation level l. One can then sequentially increase l until

∆(λ(l + 1), λ(l)) <  for some distance metric ∆ and a desired tolerance .

3.3.2 Variance reduction of Monte Carlo approximations

The computational convenience of simple MC approximations of the gradient estimators

based on the l-truncated D-vine and C-vine copulas ˜LDl

(λ) and ˜LCl

(λ) (see Section 3.2.2)

is typically accompanied by their large variance. The consequence in practice is the need for

small step size ρt in the SGA portion of Algorithm 3.1 which results in a slower convergence.

In order to reduce the variance of MC approximations, we adopt the same approach as Ruiz

et al. (2016) and use the Rao-Blackwellization (Casella and Robert, 1996) in combination

with the control variates (Ross, 2006) and importance sampling. The reminder of this section

focuses on the case of D-vine decomposition, see Section 3.5.1 for the derivations for C-vines.

Rao-Blackwellization. The idea here is to replace the noisy estimate of gradient with

its conditional expectation with respect to a subset of φ. For simplicity, let us consider
a situation with φ = (φ1, φ2) ∈ R2 and variational family q(φ|λ) that factorizes into
q(φ1|λ1)q(φ2|λ2). Additionally, let ˆLλ(φ1, φ2) be the MC approximation of the gradient
∇λL(λ). Now, the conditional expectation E[ ˆLλ(φ1, φ2)|φ1] is also an unbiased estimate of
∇λL(λ) since Eq(E[ ˆLλ(φ1, φ2)|φ1]) = Eq( ˆLλ(φ1, φ2)) and

Varq(E[ ˆLλ(φ1, φ2)|φ1]) = Varq( ˆLλ(φ1, φ2)) − E[( ˆLλ(φ1, φ2) − E[ ˆLλ(φ1, φ2)|φ1])2]

shows that Varq(E[ ˆLλ(φ1, φ2)|φ1]) ≤ Varq( ˆLλ(φ1, φ2)). The factorization of the variational
family also makes the conditional expectation straightforward to compute as

E[ ˆLλ(φ1, φ2)]

q(φ1|λ1)q(φ2|λ2)

q(φ1|λ1)

dφ2 = E

q(φ2|λ2)( ˆLλ(φ1, φ2)),

60

E[ ˆLλ(φ1, φ2)|φ1] =

(cid:90)

φ2

i.e., we just need to integrate out some variables. Let us consider the MC approximation of

the gradient estimator ˜LDl
(cid:20)l(2N − (l + 1))
S(cid:88)

log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) −

∇λj

1
S

s=1

2

(λ). The jth entry of the Rao-Blackwellized estimator is

2

l(2N − (l + 1))

log

q(φj[s]|λj)
p(φj[s])

(cid:1)(cid:21)

,

where ˜p(j)(φ) are the components of p

Dl
IDl

(K)(φ) that include φj.

Control Variates. To further reduce the variance of the MC approximations we will

replace the Rao-Blackwellized estimate above with a function that has the same expectation

but again smaller variance. For illustration, let us ﬁrst consider a target function ξ(φ) whose

variance we want to reduce, and a function ψ(φ) with ﬁnite expectation. Deﬁne

ˆξ(φ) = ξ(φ) − a(ψ(φ) − Eq[ψ(φ)]),

(3.36)

where a is a scalar and Eq( ˆξ(φ)) = Eg[ξ(φ)]. The variance of ˆξ(φ) is

Varq( ˆξ(φ)) = Varq(ξ(φ)) + a2Varq(ψ(φ)) − 2aCovq(ξ(φ), ψ(φ)).

(3.37)

This shows that a good choice for function ψ(φ) is one that has high covariance with ξ(φ).

Moreover, the value of a that minimizes (3.37) is

a∗ =

Covq(ξ(φ), ψ(φ))

Varq(ψ(φ))

.

(3.38)

Let us place the CV back into the context of calibration. Meeting the above described
criteria, Ranganath et al. (2014) propose ψ(φ) to be ∇λ log q(φ|λ), because it depends only
on the variational distribution and has expectation zero. We can now set the target function

ξ(φ) to be

l(2N − (l + 1))

2

log q(φj|λj)(cid:0)˜p(j)(φ) −

∇λj

2

l(2N − (l + 1))

log

(cid:1),

q(φj|λj)
p(φj)

61

which gives the following jth entry of the MC approximation of the gradient estimator ˜LDl

(λ)

with CV

˜LCV (j)
Dl

=

1
S

(λ)

S(cid:88)

(cid:20)l(2N − (l + 1))

s=1

2

log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) −

∇λj

2(log

q(φj [s]|λj )
p(φj [s]) + ˆaD
j )

l(2N − (l + 1))

(cid:1)(cid:21)

,

where ˆaD
j

is the estimate of a∗ based on additional independent draws from the variational

approximation (otherwise the estimator would be biased).

Importance sampling. Here, we outline the last variance reduction technique that makes

use of importance sampling. We refer to Ruiz et al. (2016) for full description of the method

and illustration of its eﬃciency in the VBI framework. Fundamentally, instead of taking
samples from the variational family q(φ|λ) to carry out the MC approximation of the ELBO
gradient estimate, we will take samples from an overdispersed distribution r(φ|λ, τ ) in the

same family that depends on an additional dispersion parameter τ > 1. Namely, we can

(cid:20) l(2N − (l + 1))
write the estimate ˜LDl

(λ) as

Er(φ|λ,τ )

2

∇λ log q(φ|λ)(p

(K)(φ) −

Dl
IDl

2

l(2N − (l + 1))

(cid:21)

)w(φ)

,

q(φ|λ)
p(φ)

log

where w(φ) = q(φ|λ)/r(φ|λ, τ ) is the importance weight which guarantees the estimator to
be unbiased. The reason to formulate the ˜LDl
(λ) this way comes from the fact the optimal
proposal (Robert and Casella, 2005) distribution to form the MC estimate is not q(φ|λ),

but rather

r∗(φ) ∝ q(φ|λ)|ξ(φ)|,

where

ξ(φ) =

l(2N − (l + 1))

2

∇λ log q(φ|λ)(cid:0)p

(K)(φ) −

Dl
IDl

2

l(2N − (l + 1))

(3.39)

(3.40)

(cid:1).

q(φ|λ)
p(φ)

log

However, the normalizing constant for the optimal r∗(φ) is intractable, and so Ruiz et al.

(2016) propose that an overdispersed version of the variational family that assigns higher

62

probability to the tails of q(φ|λ) is closer to the optimum than q(φ|λ) itself. For example, if
the value of λ makes the variational family a poor ﬁt, then the samples φ[s] ∼ q(φ|λ) have

a high value for the variational distribution but low for the true posterior. On the other
hand, r∗(φ) proposes values of φ[s] for which ξ(φ) is large that are in the tails of p(φ|λ).

To see how the importance sampling leads to the reduction of variance of the MC esti-

mates, let us consider the following estimator

then

ξ(φ[s]),

φ[s] ∼ p(φ|λ),

(cid:2)ξ2(φ)(cid:3) − 1

S

(cid:2) ˜LDl

(λ)(cid:3)2.

Eq

1
S

(3.41)

(3.42)

Similarly, we can derived the variance of the MC estimator with the importance weights

s=1

1
S

S(cid:88)
(cid:3) =

(cid:98)LM C =
Var(cid:2)(cid:98)LM C
S(cid:88)
(cid:3) =
(cid:2)ξ2(φ)

1
S

s=1

Eq

ξ(φ[s])

1
S

M C =

(cid:98)LO
Var(cid:2)(cid:98)LO

as

q(φ|λ)
r(φ|λ, τ )
Now, if we choose the distribution r(φ|λ, τ ) such that

Eq

M C

q(φ[s]|λ)
r(φ[s]|λ, τ )

(cid:2)ξ2(φ)

(cid:3) ≤ Eq

q(φ|λ)
r(φ|λ, τ )

,

φ[s] ∼ r(φ|λ, τ ),

(λ)(cid:3)2.

(cid:2) ˜LDl
(cid:3) − 1
(cid:2)ξ2(φ)(cid:3),

S

(3.43)

(3.44)

(3.45)
the variance reduction will be achieved. The optimal r∗ obviously satisﬁes the condition

(3.45). Ruiz et al. (2016) show that the choice of overdispersed version of the variational
family q(φ|λ) has similar eﬀect on the variance reduction as the optimal r∗. The details

on the form of overdispersed families for speciﬁc variational families are discussed later in

Section 3.3.4.

Combining the ideas of the Rao-Blackwellization, CV, and importance sampling, we have

the following jth entry of the MC approximation of the gradient estimator ˜LDl
˜LOCV (j)
Dl

(λ)

(λ)

(cid:21)

)w(φj[s])

,

2(log

q(φj [s]|λj )
p(φj [s]) + ˜aD
j )

l(2N − (l + 1))

(cid:20) l(2N − (l + 1))

S(cid:88)

=

s=1

2S

∇λj

log q(φj[s]|λj)(˜p(j)(φ[s]) −

63

where φ[s] ∼ r(φ|λ, τ ) and

˜aD
j =

(cid:100)Covr(

l(2N−(l+1))w(φj )

2

∇λj

log q(φj|λj)(˜p(j)(φ) − 2 log
(cid:100)Varr(∇λj

q(φj|λj )
p(φj )
log q(φj|λj)w(φj))

l(2N−(l+1)) ),∇λj

log q(φj|λj)w(φj))

.

The extension of Algorithm 3.1 with the variance reductions of the MC approximations due

to the Rao-Blackwellization, control variates, and importance sampling is in Algorithm 3.2.

Algorithm 3.2: Variational calibration with truncated D-vine copulas II.

Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan

framework, variational family q(φ|λ), dispersion parameter τ truncation
level l

// Random sample from r

1 λ ← random initial value
2 t ← 1
3 repeat
4

for s = 1 to S do
φ[s] ∼ r(φ|λ, τ )

5

6

7

8

K ← U (1, . . . , l(2N−(l+1))
ρ ← tth value of a Robbins-Monro sequence

)

2

(cid:20)

log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) −

λ ← λ + ρ(cid:80)S

2(log

s=1

q(φj [s]|λj )
p(φj [s])
l(2N−(l+1))

l(2N−(l+1))

2S

∇λj
(cid:21)
(cid:1)w(φj[s])

+˜aD
j )

t ← t + 1

9
10 until change of λ is less than 

3.3.3 Choice of the learning rate

Even though the SGA is straightforward in its general deﬁnition, the choice of learning

rate ρt can be challenging in practice. Ideally, one would want the rate to be small in the

situations where the noisy estimates of the gradient have large variance and vice-versa. The

elements of variational parameter λ can also diﬀer in scale, and one needs to set the learning

64

rate so that the SGA can accommodate even the smallest scales. The rapidly increasing usage

of machine learning techniques in recent years produced various algorithms for element-wise

adaptive-scale learning rates. We use the adaptive gradient (AdaGrad) algorithm (Duchi

et al., 2011) which has been considered in similar problems before, e.g., Ranganath et al.

(2014), however, there are other popular algorithms such as the ADADELTA (Zeiler, 2012)

or the RMSProp (Tieleman and Hinton, 2012). Let gT be the gradient used in the T th step

of the SGA algorithm, and Gt be the matrix consisting of the sum of the outer products of

these gradients across the ﬁrst t iterations, namely

t(cid:88)

Gt =

gT gT
T .

The AdaGrad deﬁnes the element-wise adaptive scale learning rate as

T =1

ρt = η · diag(Gt)−1/2,

(3.46)

(3.47)

where η is the initial learning rate. It is a common practice, however, to add a small constant
value to diag(Gt) (typically of order 10−6) to avoid division by zero.

3.3.4 Parametrizations

Variational families. We use a Gaussian distribution for real valued components of φ

and a gamma distribution for positive variables. Both of these families are parametrized

in terms of their mean and standard deviation. Moreover, in order to avoid constrained
optimization, we transform all the positive variational parameters λ to ˜λ = log (eλ − 1) and
optimize with respect to ˜λ.

Overdispersed families. Given a ﬁxed dispersion coeﬃcient τ , the overdispersed Gaus-

sian distribution with mean µ and standard deviation σ is a Gaussian distribution with

√

mean µ and standard deviation σ
and standard deviation σ is a gamma distribution with mean µ + (τ − 1) σ2
deviation σ ×

τ µ2+τ σ2(τ−1)

(Ruiz et al., 2016).

τ . The overdispersed gamma distribution with mean µ

µ and standard

(cid:113)

µ

65

3.4 Applications

This section empirically establishes the eﬃciency of our methodology for the VBI based

calibration of computer models. First, we conduct an extensive simulation study, where we

focus both on the ﬁdelity of variational approximation and prediction accuracy. Second, we

demonstrate the opportunities in UQ given by the proposed methodology on calibration of

the Liquid Drop Model.

The Bayesian predictions of new observations from the physical process ζ at input loca-

tions (t∗

1, . . . , t∗

J ) are obtained according to (3.6). The conditional distribution p(y∗|d, φ) is

a multivariate normal distribution with the mean vector

My∗(φ) = Mf (T∗

y (θ)) + Mδ(T∗

y ) + C∗K(φ)−1(d − M (φ)),

and the covariance matrix

Ky∗(φ) = Kf (T∗

y (θ), Ty(θ)) + Kδ(T∗

y , Ty) + σ2Im − C∗K(φ)−1CT∗ ,

(3.48)

(3.49)

where

(cid:16)

C∗ =

Kf (T∗

y (θ), Ty(θ)) + Kδ(T∗

y , Ty) Kf (T∗

(cid:17)
y (θ), Tz((cid:101)θ))

.

(3.50)

Here, M (φ) and K(φ) is the mean vector and the covariance matrix of the data likelihood

y (θ), Ty(θ)) is the matrix with (i, j) element kf ((t∗

p(d|φ), Kf (T∗
is the matrix with (i, j) element kδ(t∗
the kernel kf .

i , tj). We can similarly deﬁne Kf (T∗

i , θ), (tj, θ)) and Kδ(T∗

y (θ), Tz((cid:101)θ)) with

y , Ty)

3.4.1 Simulation study

In this section, we study Algorithm 3.2 in a simulated scenario, where we ﬁrst demonstrate

the method’s ﬁdelity in approximating the posterior distribution of calibration parameters
p(θ|d) and substantiate the indispensability of the variance reduction techniques described

in Section 3.3.2 in order to achieve convergence. Second, we show the scalability of our

method in comparison to the popular MH algorithm and the NUTS.

66

Let us consider a simple scenario following the model (3.4) with a two-dimensional cali-

bration parameter θ = (0.39, 0.60) that was obtained as a sample from its prior distribution

p(θ) and a two-dimensional input variable t = (t1, t2). We model fm(t, θ) and δ(t) with GPs
according to the speciﬁcations in Table 3.1 with the particular choices of ηf = 1
lθ = 1, ηδ = 1

30, lt = 1,

2, and βδ = 0.15.

30, lδ = 1

GP mean

GP covariance function

fm θ1cos(t1) + θ2sin(t2) ηf · exp(−||t−t(cid:48)||2

− ||θ−θ(cid:48)||2

)

δ

βδ

2l2
t

ηδ · exp(−||t−t(cid:48)||2

)

2l2
θ

2l2
δ

Table 3.1: The speciﬁcation of GPs for the simulation study.

We choose the variational family to be the mean-ﬁeld family with Gaussian distribu-

tions for real valued parameters and gamma distributions for positive variables following

the parametrization discussed in Section 3.3.4. The variational parameters are initialized to

match the prior distributions, and we use the AdaGrad for the learning rate updates.

Calibration. For the purpose of model calibration, we sampled the data d jointly from
the prior with the experimental noise following N (0, 1
100). The calibration parameter values
for the model runs z were selected on a uniform grid over [0, 1]2 and the inputs t over [0, 3]2.

For the ﬁrst set of experiments, the size of the dataset was N = 225 with n = 144 and

s = 81. We used 50 samples from the variational family to approximate the expectations in

Algorithm 3.2 and 10 samples to implement the control variates.

Figure 3.2 demonstrates the quality of the variational approximation (Algorithm 3.2) in

comparison to the MH algorithm and the NUTS. We can see that our method was able

to accurately match both MCMC-based approximations with a minor deviation in θ1. It

is important to note, however, that the variance reduction through the combination of the

Rao-Blackwellization, control variates, and importance sampling was necessary to achieve

meaningful convergence.

67

Figure 3.2: The approximate posterior distributions for the target calibration parameters.
The VC (Algorithm 3.2) was carried out using l = 3 truncated D-vine and compared with
the results from the NUTS and the MH algorithm.

In particular, Figure 3.3 shows the MSE of the posterior predictive means, evaluated

on an independently generated set of 50 data points, based on the VC with cumulatively

implemented variance reduction techniques. Algorithm 3.2 which employs the importance

sampling clearly outperforms the calibration with only the Rao-Blackwellization and the

calibration with control variates. In fact, each additional attempt to reduce the variance

tends to decrease the MSE by one order of magnitude. There is naturally a time and space

(memory) cost associated with each variance reduction technique. Figure 3.3 shows that the

control variates and the importance sampling practically double the time per iteration of

the algorithm. This additional complexity is, however, outweighed by the gain in the MSE

reduction. The increase in memory consumption is less signiﬁcant and is due to the storage of

dispersion coeﬃcients used for importance sampling and samples needed to compute control

variates. Note that the memory consumed by the algorithms rises over time, because we

chose to store the values of variational parameters during each step; the memory demands

can be dramatically reduced if we drop these intermediate results.

For completeness, in Table 3.2, we also compare the MSE of the MCMC approximations

and the VC at the point of convergence of the algorithms. The resulting errors in the

predictions were, for all practical purposes, equivalent.

68

0.30.40.501020301NUTSMHVC0.50.60.70510152020.30.40.510.500.550.600.650.702Figure 3.3: The evolution of MSE of the posterior predictive means based on the VC with
cumulatively implemented variance reduction techniques described in Section 3.3.2. The
ﬁgure is based on an independently generated set of 50 testing points. Time and memory
demands for each of the implementations are also plotted the VC (Algorithm 3.2) was
carried out using l = 3 truncated D-vine.

Algorithm
MSE
Variational Calibration - RB + CV + IS 2.9 × 10−3
3.0 × 10−3
Metropolis-Hastings
3.0 × 10−3
No-U-Turn

Table 3.2: Comparison of the MSE for the simple scenario using the MH, the NUTS, and
the VC algorithms.

Scalability. We now signiﬁcantly increase the size of the dataset from N = 225 to 0.5×104
and eventually to 2 × 104 with the simulated experimental measurements and the model

runs split equally (n = s). For better numerical stability, we expand the space of the
input variables to t ∈ [0, 10]2 and select those using the Latin hypercube design. We also

enlarge the testing dataset to 200 points. All the remaining simulation parameters are

69

020004000600080000123Time [h]020004000600080002468Memory [GB]010002000300040005000600070008000Iteration102101100MSEVariational Calibration - RBVariational Calibration - RB + CVVariational Calibration - RB + CV + ISunchanged. The conventional MCMC methods are already impractical for the purpose of

Bayesian calibration with these moderately large amounts of data. We were able to obtain
only around 600 posterior samples in the case of N = 1× 104 and about 120 for N = 2× 104

in 25 hours of sampling using the MH algorithm (signiﬁcantly less with the NUTS).

Figure 3.4: The evolution of the MSE of the posterior predictive means based on the VC
(Algorithm 3.2), the MH algorithm, and the NUTS. The ﬁgure is based on an
independently generated set of 200 testing points. The VC (Algorithm 3.2) was carried out
using l = 5 truncated D-vine.

Figure 3.4 demonstrates that Algorithm 3.2 (D-vine with truncation l = 5) converges
to the predictive MSE of about 0.003 under 4 hours for N = 2 × 104 and 2 hours for
N = 0.5× 104. It took similar time for the MH to achieve this MSE value for N = 0.5× 104
but almost 25 hours for the NUTS. Once we increased the data size to 2 × 104, neither the

NUTS nor the MH were able to achieve a similar predictive MSE as the VC within the 25

hour window allotted for sampling. In fact, they were by an order of magnitude larger. It

70

102101n=0.5×104102101n=1×104VCMHNUTS0123456Time [h]102101n=2×104MSENNNis important to mention that both MCMC-based algorithms have also substantially larger

memory demands than the VC as depicted in Figure 3.5. These memory proﬁles were

recorded during a one hour period of running the algorithms. The MH algorithm and the

NUTS were implemented in Python 3.0 using the PyMC3 module version 3.5. The memory

proﬁles were measured using the memory-proﬁler module version 0.55.0 in Python 3.0. The

VC was also implemented in Python 3.0.

Figure 3.5: Recorded memory proﬁles of Algorithm 3.2, the MH algorithm, and the NUTS
for the duration of 1 hour under the simulation scenario.

71

51015n=0.5×104204060n=1×104VCMHNUTS0102030405060Time [min]50100150200250300n=2×104Memory [GB]NNN3.4.2 Calibration of the Liquid Drop Model

Over the past decade or so, the statistical tools of UQ have experienced a robust ramp-up

in use in the ﬁeld of nuclear physics (Ireland and Nazarewicz, 2015). Bayesian calibration has

been especially popular because it enhances the understanding of a nuclear model’s structure

through parameter estimation and potentially advances the quality of nuclear modeling by

accounting for systematic errors. In this context, we use our variational Algorithm 3.2 to cal-

ibrate the 4-parameter LDM. Since we discussed the LDM several times in this dissertation,

we refer reader to Section 1.1 for a detailed description of the model.

Here we also note that this is by no means the ﬁrst case when Bayesian calibration

methodology is applied to study the LDM. In fact, the LDM is a popular model for statistical

applications (Bertsch et al., 2005; Yuan, 2016; Bertsch and Bingham, 2017) which is why we

choose the model to illustrate our methodology as well. The LDM also generally performs

better on heavy nuclei as compared to the light nuclei which alludes to the existence of a

signiﬁcant systematic discrepancy between the model and the experimental binding energies

(Reinhard et al., 2006; Kejzlar et al., 2020). Namely, we consider the following statistical

model

y = EB(N, Z) + δ(N, Z) + σ,

(3.51)

where δ(N, Z) represents the unknown systematic discrepancy between the semi-empirical

mass formula EB(N, Z) and the experimental binding energies y. The parameter σ is as
usual the scale of observation error  ∼ N (0, 1). The nuclear physics community often

(Dobaczewski et al., 2014) considers the least squares (LS) estimator of θ deﬁned as

n(cid:88)

θ

i=1

ˆθL2

= arg min

(yi − EB(Ni, Zi))2 ,

(3.52)

which is also the maximum likelihood estimate of θ in the case of δ = 0. The beneﬁt of

this estimator is that it is fast, easy-to-compute, and allows for analysis under the standard

linear regression theory. It, however, neglects some sources of uncertainty that are accounted

for in the Bayesian calibration framework.

72

To this end, we shall consider a GP prior with the mean zero and the squared expo-

nential covariance function for the systematic discrepancy δ(Z, N ). Since the main purpose

of the example is to provide a canonical illustration of the methodology in a real data sce-

nario, we also set a GP prior for the LDM and treat EB(Z, N ) as an unknown function.

We use 2000 experimental binding energies randomly selected from the AME2003 dataset

(Audi et al., 2003) (publicly available at http://amdc.impcas.ac.cn/web/masstab.html)

for calibration, see Figure 3.6, and an additional set of 104 model evaluations. The calibra-

tion inputs were generated with the Latin hypercube design so that all the reasonable values

of (avol, asurf, asym, aC) given by the literature are covered (Weizs¨acker, 1935; Bethe and

Bacher, 1936; Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020). The model

inputs (Z, N ) were selected from the set of 2000 experimental binding energies, duplicated

ﬁve-fold, and randomly permutated among the generated calibration inputs to span only the

set of relevant nuclei. This relatively large number of model runs was chosen so that the

Figure 3.6: Experimental binding energies of nuclei in AME2003 dataset (2225
observations).

combined 6 dimensional space of calibration parameters and model inputs is suﬃciently cov-

ered considering the existence of a non-trivial systematic discrepancy. In fact, the uniform

73

0102030405060708090100110120130140150160N0102030405060708090100110Z040080012001600EB [MeV]experimental design would amount only to 4-5 points per dimension.

Independent Gaussian distributions centered at the LS estimates ˆθL2

(in Table 3.3)

with standard deviations large enough to cover the space of inputs used for generating the

model runs were selected to represent the prior knowledge about the calibration parameters.

Independent gamma distributions were used as the prior models for the hyperparameters

of the GP’s covariance functions. We choose the variational family to be fully-factorized

with the Gaussian distributions for real valued parameters and the gamma distributions

for positive variables. The means of variational families were initialized as random samples

from their respective prior distributions and the variances were set to match those of the

prior distributions. We used the AdaGrad for stochastic optimization. See Section 3.5.3 for

further discussion on the prior distributions and the experimental design.

Results.
Including the generated model runs, the overall size of the training dataset is
1.2×104 which already makes the MCMC-based calibration impractical, as illustrated by the

simulation study in Section 3.4.1. We therefore asses the quality of variational approximation

only against the standard LS estimation and do not consider the MCMC methods.

In

particular, we consider the testing dataset of the remaining 225 experimental binding energies
in AME2003 that were excluded from the training data. The predictions ˆy∗ of these testing
binding energies y∗ were calculated, under the variational approximation, as the posterior
means of y∗ conditioned on the 1.2 × 104 binding energies from the training data set, i.e.,
the posterior means of the predictive distribution p(y∗|d). The predictions under the LS
estimates ˆθL2

were given by the semi-empirical mass formula (1.1).

Table 3.3 gives the RMSEs for both methods under consideration. The VC (Algorithm

3.2) results are based on a 24 hour window dedicated to running the algorithm with 50

samples used to approximate the expectations, 10 samples used to implement the control

variates, and the truncation level selected to be l = 3. By using GPs to account for the

systematic discrepancies of the semi-empirical mass formula and the uncertainty of the LDM

74

itself, we were able to signiﬁcantly reduce the RMSE approx. 57% compared to the LS

benchmark. Table 3.3 additionally shows the calibration parameter estimates and their

standard errors. The estimates under the VC are given by the means of their variational

families. Both the methods calibrate the LDM around the same values with notably low
standard errors of the LS estimates. This is, however, expected since ˆθL2
estimates that in the presence of heteroscedasticity (see Figure 3.7) become ineﬃcient and

are ordinary LS

tend to signiﬁcantly underestimate the true variance (Goldberger, 1966; Johnston, 1976).

Method

Parameter estimate and standard errors

asurf

avol
15.42 (0.027) 16.91 (0.086) 22.47 (0.070) 0.69 (0.002)
15.78 (0.198) 15.99 (0.681) 21.94 (0.510) 0.68 (0.018)

asym

aC

LS
VC

Testing error
RMSE (MeV)

3.54
1.52

Table 3.3: The RMSE of the VC (Algorithm 3.2) after 24 hours dedicated to running the
algorithm compared with the RMSE based on the LS estimates. The parameter estimates
(and their standard errors) are also displayed.

The residual plot in Figure 3.7, showing the diﬀerence between y∗ and ˆy∗ as a function

of the nuclear mass number A, clearly demonstrates a better ﬁt of the testing data with our

methodology than is achieved by the simple LS ﬁt. The majority of the residuals appear to

be randomly spread around 0 which strongly supports the eﬃciency of GPs in accounting

for the systematic discrepancy δ.

Figure 3.7: The residual plot for 225 experimental binding energies in the testing dataset.

75

050100150200250A=Z+N15105051015LSVC3.5 Technical details and supplementary results

3.5.1 Scalable algorithm with truncated vine copulas: C-vine

Here we present the details of the C-vine based versions of Algorithm 3.1 and Algo-

rithm 3.2. First, we can decompose the log-likelihood log p(d|φ) using a C-vine as

log p(d|φ) =

pC
j,j+i(φ),

(3.53)

N−1(cid:88)

N−j(cid:88)

j=1

i=1

where

pC
j,j+i(φ) = log cj,(j+i);1,...,(j−1) +

1

N − 1

(cid:0) log pj(dj|φ) + log pj+i(dj+i|φ)(cid:1).
(cid:21)

(cid:20)

− Eq

∇λ log q(φ|λ) log

q(φ|λ)
p(φ)

(3.54)

(cid:21)

. (3.55)

This now yields the following expression for the ELBO gradient:

(cid:20)

N−1(cid:88)

N−j(cid:88)

j=1

i=1

∇λL(λ) =

∇λ log q(φ|λ)(pC

j,j+i(φ))

Eq

Equivalently to Proposition 1, we have the following proposition that establishes the noisy

unbiased estimate of the gradient (3.55) using the C-vine copula decomposition.

Proposition 4. Let ˜LC (λ) be an estimate of the ELBO gradient ∇λL(λ) deﬁned as
q(φ|λ)
p(φ)

(cid:20)
∇λ log q(φ|λ) log

∇λ log q(φ|λ)(pC

˜LC (λ) =

N (N − 1)

IC (K)(φ))

− Eq

(cid:21)

(cid:20)

Eq

2

(cid:21)

,

where K ∼ U (1, . . . , N (N−1)
N (N − 1)

2

IC : {1, . . . ,

2

), and IC is the bijection

} → {(j, j + i) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . N − 1}},

then ˜LC (λ) is unbiased i.e., E( ˜LC (λ)) = ∇λL(λ).

Again, ˜LC (λ) can be relatively costly to compute for large datasets due to the recursive
nature of the copula density computations. We now carry out exactly the same development

an using l-truncated C-vine as in the case of Proposition 2 and Proposition 3.

76

Proposition 5. If the copula of p(d|φ) is distributed according to an l-truncated C-vine, we

can rewrite

log p(d|φ) =

l(cid:88)

N−j(cid:88)

Cl
i,i+j(φ),

p

j=1

i=1

cl
i,i+j(φ) = log cj,(j+i);1,...,(j−1) +

p

1
aj

log pj(dj|φ) +

1

bj+i

log pj+i(dj+i|φ),

where

and

aj = N − 1,
bj+i = (N − 1 − l)1j+i≤l + l.

(3.56)

(3.57)

Let us now replace the full log-likelihood log(d|φ) in the deﬁnition of ELBO with the

likelihood based on a truncated vine copula. This yields the l-truncated ELBO for the

l-truncated C-vine

(cid:20) l(cid:88)

N−j(cid:88)

p

(cid:21)

LCl
N−j(cid:88)
l(cid:88)

j=1

i=1

(λ) = Eq

Cl
j,j+i(φ)

− KL(q(φ|λ)||p(φ))

(3.58)

(cid:20)

Eq

j=1

i=1

∇λ log q(φ|λ)(p

(cid:21)

Cl
j,j+i(φ))

(cid:20)

− Eq

∇λ log q(φ|λ) log

(cid:21)

.

q(φ|λ)
p(φ)

with its gradient

∇λLCl

(λ) =

Consequently, we get the following proposition that establishes the noisy unbiased estimate

(λ).

of ∇λLCl
(cid:20)
(λ) be an estimate of the ELBO gradient ∇λLCl
Proposition 6. Let ˜LCl
l(2N − (l + 1))
˜LCl
∇λ log q(φ|λ)(p
Eq
where K ∼ U (1, . . . , l(2N−(l+1))
l(2N − (l + 1))

is the bijection

(K)(φ))

− Eq

Cl
ICl

(λ) =

(cid:20)

(cid:21)

2

2

∇λ log q(φ|λ) log

(λ) deﬁned as
q(φ|λ)
p(φ)

(cid:21)

,

ICl
then ˜LCl

: {1, . . . ,
(λ) is unbiased i.e., E( ˜LCl

2

), and ICl
} → {(j, j + i) : i ∈ {1, . . . , N − j} for j ∈ {1, . . . l}},
(λ)) = ∇λLCl

(λ).

Algorithm 3.3 postulates the version of Algorithm 3.1 based on the truncated C-vine

decomposition.

77

Algorithm 3.3: Variational calibration with truncated C-vine copulas.

Input: Data d, mean and covariance functions for GPs in Kennedy-O’Hagan

framework, variational family q(φ|λ), truncation level l

1 λ ← random initial value
2 t ← 1
3 repeat
4

for s = 1 to S do
φ[s] ∼ q(φ|λ)

5

6

7

8

2

)

(cid:80)S

K ← U (1, . . . , l(2N−(l+1))
(cid:20)
ρ ← tth value of a Robbins-Monro sequence
λ ← λ + ρ 1
l(2N−(l+1)) log q(φ[s]|λ)
t ← t + 1

∇λ log q(φ[s]|λ)(cid:0)p

l(2N−(l+1))

(cid:1)(cid:21)

p(φ[s])

S

s=1

2

2

9
10 until change of λ is less than 

// Random sample from q

(K)(φ[s]) −

Cl
ICl

Variance reduction of MC estimates. Let us now consider the MC approximation of

(λ), the jth entry of the Rao-Blackwellized estimator is

the gradient estimator ˜LCl
(cid:20) l(2N − (l + 1))
S(cid:88)

1
S

s=1

2

log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) −

∇λj

2

l(2N − (l + 1))

log

q(φj[s]|λj)
p(φj[s])

where ˜p(j)(φ) are here the components of p

Cl
ICl

(K)(φ) that include φj.

We can again use the control variates to reduce the variance of MC approximation of the

gradient estimator ˜LCl
Blackwellized MC approximation of the gradient estimator ˜LCl

(λ). In particular, we consider the following jth entry of the Rao-

(λ) with control variates

(cid:20)l(2N − (l + 1))

S(cid:88)

s=1

2S

˜LCV (j)
Cl

(λ) =

∇λj

log q(φj[s]|λj)(˜p(j)(φ[s]) −

2(log

q(φj [s]|λj )
p(φj [s]) + ˆaC
j )

l(2N − (l + 1))

where ˆaC
j

is the estimate of the optimal control variate scalar a∗ based on S (or fever)

independent draws from the variational distribution. Namely,

(cid:1)(cid:21)

,

(cid:21)

)

,

(cid:100)Covq( l(2N−(l+1))

2

ˆaC
j =

∇λj

log q(φj|λj)(˜p(j)(φ) −

(cid:100)Varq(∇λj

log q(φj|λj))

2 log q(φj|λj )

l(2N−(l+1)) log p(φj )),∇λj

log q(φj|λj))

.

78

As in the case of the D-vine, we now derive the ultimate version of Algorithm 3.3.
Again, instead of taking the samples from q(φ|λ) to approximate the gradient estimates,
we will take samples from an overdispersed distribution r(φ|λ, τ ). Combining the Rao-
Blackwellization, control variates, and importance sampling, we have the following jth entry
of the MC approximation of the gradient estimator ˜LCl
˜LOCV (j)
Cl

(λ)

∇λj

log q(φj[s]|λj)(˜p(j)(φ[s]) −

2(log

q(φj [s]|λj )
p(φj [s]) + ˜aC
j )

l(2N − (l + 1))

)w(φj[s])

,

(cid:21)

(λ)

(cid:20) l(2N − (l + 1))

S(cid:88)

=

s=1

2S

where φ[s] ∼ r(φ|λ, τ ) and w(φ[s]) = q(φ[s]|λ)/r(φ[s]|λ, τ ) with

˜aC
j =

(cid:100)Covr(

l(2N−(l+1))w(φj )

2

∇λj

log q(φj|λj)(˜p(j)(φ) − 2 log
(cid:100)Varr(∇λj

q(φj|λj )
p(φj )
log q(φj|λj)w(φj))

l(2N−(l+1)) ),∇λj

log q(φj|λj)w(φj))

.

Algorithm 3.4: Variational calibration with truncated C-vine copulas II.
Input: Data d, mean and covariance functions for GPs, variational family q(φ|λ),

dispersion parameter τ truncation level l

// Random sample from r

1 λ ← random initial value
2 t ← 1
3 repeat
4

for s = 1 to S do
φ[s] ∼ r(φ|λ, τ )

5

6

7

8

K ← U (1, . . . , l(2N−(l+1))
ρ ← tth value of a Robbins-Monro sequence

)

2

(cid:20)

λ ← λ + ρ(cid:80)S

2(log

s=1

q(φj [s]|λj )
p(φj [s])
l(2N−(l+1))

+˜aC
j )

2S

l(2N−(l+1))

∇λj
(cid:21)
(cid:1)w(φj[s])

log q(φj[s]|λj)(cid:0)˜p(j)(φ[s]) −

t ← t + 1

9
10 until change of λ is less than 

79

3.5.2 Proofs

Proof of Proposition 1.

Since P (K = k) =

N (N−1), we have directly from the deﬁnition of expectation

2

E( ˜LD(λ)) =

(cid:20)

N (N−1)

2(cid:88)

N (N − 1)

(cid:20)
2
∇λ log q(φ|λ) log

k=1

2

(cid:21)

N (N − 1)
q(φ|λ)
p(φ)

− Eq

= ∇λL(λ).

(cid:21)

Eq

∇λ log q(φ|λ)(pD

ID(k)(φ))

The ﬁnal equality is the consequence of the uniqueness of the pairs of variables in the

conditioned sets of the copula density ci,(i+j);(i+1),...,(i+j−1), and that N (N−1)

2

is the number

of unordered pairs of N variables.

Proof of Proposition 2.
It is suﬃcient to show that for l ∈ {1, . . . , N − 1} the following equality holds:

l(cid:88)

N−j(cid:88)

(cid:20) 1

j=1

i=1

ai

where

log pi(di|φ) +

1
bi+j

(cid:21)
log pi+j(di+j|φ)

=

N(cid:88)

k=1

(cid:20)
(cid:20)
(l + 1 − i)1i≤l + (l − N + i)1i>N−l
(l + 1 − j − i)1i+j≤l + (l − N + j + i)1i+j>N−l

(cid:21)

,

(cid:21)

.

ai = 2l −
bi+j = 2l −

log p(dk|φ),

(3.59)

To show this, let us consider the summation

(cid:20)

(cid:21)
log pi(di|φ) + log pi+j(di+j|φ)

l(cid:88)
N−j(cid:88)
(cid:20)
l(cid:88)

j=1

i=1

=

(cid:21)
(log p1(d1|φ) + log p1+j(d1+j|φ)) + ··· + (log pN−j(dN−j|φ) + log pN (dN|φ))

.

j=1

80

For l = 1, we get

l(cid:88)

N−j(cid:88)

(cid:20)

log pi(di|φ) + log pi+j(di+j|φ)

(cid:21)

j=1

i=1

and for l ≥ 2

= (log p1(d1|φ) + log p2(d2|φ)) + ··· + (log pN−1(dN−1|φ) + log pN (dN|φ)),
(cid:20)
l(cid:88)
N−j(cid:88)
(cid:20)

(cid:21)
log pi(di|φ) + log pi+j(di+j|φ)
(cid:20)
(cid:21)
(log p1(d1|φ) + log p1+l(d1+l|φ)) + ··· + (log pN−l(dN−l|φ) + log pN (dN|φ))

(cid:21)
(log p1(d1|φ) + log p2(d2|φ)) + ··· + (log pN−1(dN−1|φ) + log pN (dN|φ))
+ ··· +

j=1

i=1

=

.

Note that in the case of l = N − 1, the last summation consists of only one element
log p1(d1|φ) + log p1+l(d1+l|φ). By careful examination of the two cases above, we get
the following results. For 2l ≤ N :
l(cid:88)
N−j(cid:88)
l(cid:88)

(cid:21)
log pi(di|φ) + log pi+j(di+j|φ)
N−l(cid:88)

N(cid:88)

(cid:20)

j=1

i=1

(l + k − 1) log pk(dk|φ) +

=

k=1

where the middle term disappears in the case 2l = N , and for 2l > N :

(cid:20)

k=l+1

2l log pk(dk|φ) +
(cid:21)
log pi(di|φ) + log pi+j(di+j|φ)
l(cid:88)

N−j(cid:88)
l(cid:88)
N−l(cid:88)
(l + k − 1) log pk(dk|φ) +
N(cid:88)

(N − i + l) log pk(dk|φ).

k=N−l+1

k=1

i=1

j=1

+

=

(N − i + l) log pk(dk|φ),

k=N−l+1

(N − 1) log pk(dk|φ)

k=l+1

If we now check that ai equals to the factors in front of the log-likelihoods in the two cases

above, the proof of Proposition 2 is complete. Note that once we check the equality for ai,

81

the same directly translates to bi+j since bi+j is ai with indices set to i + j instead of i.
Indeed, for 2l ≤ N

and for 2l > N

ai =

ai =




l + i − 1

i ≤ l
l < i ≤ N − l

,

2l
N − i + l N − l < i

l + i − 1
N − 1
N − i + l

i ≤ N − l
N − l < i ≤ l

.

l < i

Proof of Proposition 3.
By the construction of R-vine (see Section 3.2.1), each tree Ti, for i = 1, . . . , N−1 has exactly
N − i edges (these are the unique conditioned variable pairs). For any R-vine truncated at
level l ∈ {1, . . . , N − 1}, we get the number of edges to be

l(cid:88)

i=1

(N − i) = lN − l(l + 1)

l(2N − (l + 1))

2

=

2

The rest of the proof is identical with that of Proposition 1 due to the uniqueness of the

conditioned variable pairs in the copula density ci,(i+j);(i+1),...,(i+j−1), but in this case

P (K = k) =

2

l(2N−(l+1)).

Proof of Proposition 4.

The proof is identical with that of Proposition 1 since each conditioned pair in the copula

density cj,(j+i);1,...,(j−1) is unique as well.

82

Proof of Proposition 5.
It is suﬃcient to show that for l ∈ {1, . . . , N − 1} the following equality holds:

log pj(dj|φ) +

1

bj+i

(cid:21)
log pj+i(dj+i|φ)

=

N(cid:88)

k=1

l(cid:88)

N−j(cid:88)

(cid:20) 1

j=1

i=1

aj

where

log p(dk|φ),

(3.60)

aj = N − 1,
bj+i = (N − 1 − l)1j+i≤l + l.

(cid:21)

log pj+i(dj+i|φ)

(cid:21)
log pj+1(dj+i|φ)) + ··· + log pN (dN|φ)

.

= log p2(d2|φ) + . . . log pN (dN|φ).

(cid:21)
log pl+1(dl+1|φ) + . . . log pN (dN|φ)

.

To show this, let us consider the following summation

(cid:20)

(cid:21)
log pj(dj|φ) + log pj+i(dj+i|φ)

i=1

j=1

N−j(cid:88)
l(cid:88)
(cid:20)
l(cid:88)
l(cid:88)

j=1

=

=

N−j(cid:88)
(N − j) log pj(dj|φ) +
(cid:20)
l(cid:88)

(N − j) log pj(dj|φ) +

i=1

Now, for l = 1, we have

For l ≥ 2, we have

j=1

j=1

j=1

l(cid:88)
(cid:20)
l(cid:88)
(cid:20)

j=1

(cid:20)

(cid:21)
log pj+1(dj+1|φ)) + ··· + log pN (dN|φ)
(cid:21)
log pj+1(dj+1|φ)) + ··· + log pN (dN|φ)
(cid:20)

(cid:21)
log p2(d2|φ) + . . . log pN (dN|φ)

+ ··· +

=

Therefore we can rewrite

(cid:20)
(cid:21)
l(cid:88)
log pj+1(dj+1|φ)) + ··· + log pN (dN|φ)
l(cid:88)

N(cid:88)

(j − 1) log pj(dj|φ) +

l log pj(dj|φ)

j=1

=

j=1

j=l+1

83

Overall,

(cid:20)

(cid:21)
log pj(dj|φ) + log pj+i(dj+i|φ)

i=1

j=1

N−j(cid:88)
l(cid:88)
l(cid:88)
(N − j) log pj(dj|φ) +
l(cid:88)

(N − 1) log pk(dk|φ) +

j=1

=

=

l(cid:88)
(j − 1) log pj(dj|φ) +
N(cid:88)

l log pk(dk|φ).

j=1

N(cid:88)

j=l+1

l log pj(dj|φ)

k=1

Since j ∈ {1, . . . , l} and

the equality 3.60 holds.

Proof of Proposition 6.

bj+i =

k=l+1

N − 1 j + i ≤ l

j + i > l

l

,

The proof is identical with that of Proposition 3 since each conditioned pair in the copula

density cj,(j+i);1,...,(j−1) is unique, and a C-vine is a special case of R-vine.

Proof of Lemma 2.

As we discussed in the proof of Proposition 3, the construction of R-vine implies that each
tree Ti, for i = 1, . . . , N−1 has exactly N−i edges (pairs of conditioned variables). Moreover,
each tree Ti corresponds to copulas with the conditioning set of size i − 1. Therefore, for X
being the cardinality of the conditioning set, we get

P (X = i) =

for i ∈ {0, . . . , N − 2}.

N − (i + 1)

(cid:0)N

(cid:1)

2

N−2(cid:88)

i=0

i

Now

E(X) =

=

N − (i + 1)

(cid:1)

(cid:0)N
(cid:20)

2

2

N (N − 1)

(N − 1)[

2

=

N (N − 1)
(N − 2)(N − 1)

i=0

[i(N − 1) − i2]
] − (N − 2)(N − 1)(2N − 3)

6

(cid:21)

N − 2

3

.

=

N−2(cid:88)

2

84

Where the equality on the second line is due to the standard algebraic results on the sum of

powers of the ﬁrst ﬁrst N integers.

Proof of Lemma 3.

Analogically to the proof of Lemma 2, while recalling the number of edges for any l-truncated

R-vine provided in the proof of Proposition 3, we have for the cardinality of the conditioning

set X:

Now

P (X = i) =

N − (i + 1)
l(2N−(l+1))

2

for i ∈ {0, . . . , l − 1}.

l−1(cid:88)

i=0

i

E(X) =

N − (i + 1)
l(2N−(l+1))

2

2

(cid:20)

2

=

l(2N − (l + 1))

(N − 1)[

(l − 1)l

2

l−1(cid:88)
[i(N − 1) − i2]
] − (l − 1)l(2l − 1)

(cid:21)

i=0

6

=

=

l(2N − (l + 1))
(l − 1)(3N − 2l − 2)

3(2N − l − 1)

.

3.5.3 Supplement for the calibration of the Liquid Drop Model

GP speciﬁcations.

In the case of the LDM EB(Z, N ), we consider the GP prior with the

mean zero and the covariance function

(cid:18)
− (Z − Z(cid:48))2
ηE · exp
− (asurf − a(cid:48)

2ν2
Z

2ν2
2

− (N − N(cid:48))2
− (asym − a(cid:48)
surf)2

2ν2
N

− (avol − a(cid:48)
sym)2

vol)2
− (aC − a(cid:48)
C)2

2ν2
1

(cid:19)

.

2ν2
3

2ν2
4

Similarly, we consider the GP prior for the systematic discrepancy δ(Z, N ) with mean zero

and covariance function

ηδ · exp

(cid:32)
−(Z − Z(cid:48))2

2l2
Z

− (N − N(cid:48))2

2l2
N

(cid:33)

.

85

Experimental design. Kennedy and O’Hagan (2001) recommend to select the calibration

inputs for the model runs so that any plausible value θ of the true calibration parameter

is covered. In this context, we consider the space of calibration parameters to be centered
at the values of least squares estimates ˆθL2
values provided by the nuclear physics literature (Weizs¨acker, 1935; Bethe and Bacher, 1936;

and broad enough to contain the majority of

Myers and Swiatecki, 1966; Kirson, 2008; Benzaid et al., 2020). Table 3.4 gives the lower
and upper bounds for the parameter space so that Lower bound = ˆθL2
Upper bound = ˆθL2
theory.

) is given by the standard linear regression

− 15 × SE(ˆθL2

) and

+15×SE(ˆθL2

). Here SE(ˆθL2

Parameter Lower bound Upper bound
avol
asurf
asym
aC

15.829
18.193
23.505
0.72

15.008
15.628
21.435
0.665

Table 3.4: The space of calibration parameters used for generating the outputs of the
semi-empirical mass formula (1.1).

Prior distributions. First, we consider the independent Gaussian distributions centered
at the LS estimates ˆθL2
calibration parameters used for generating the model runs are covered roughly within two

(in Table 3.3) with standard deviations 7.5 × SE( ˆθL2

) so that the

standard deviations of the priors. Namely,

avol ∼ N (15.42, 0.203),
asurf ∼ N (16.91, 0.645),
asym ∼ N (22.47, 0.525),
aC ∼ N (0.69, 0.015).

The prior distributions for hyperparameters of the GPs were selected as Gamma(α, β) with

the shape parameter α and scale parameter β, so that they represent a vague knowledge about

the scale of these parameters given by the literature on nuclear mass models (Weizs¨acker,

86

1935; Bethe and Bacher, 1936; Myers and Swiatecki, 1966; Fayans, 1998; Kirson, 2008; Mc-

Donnell et al., 2015; Kortelainen et al., 2010a, 2012, 2014; Benzaid et al., 2020; Kejzlar et al.,

2020). In particular, the error scale σ is in the majority of nuclear applications within units

of MeV, therefore we set

σ ∼ Gamma(2, 1),

with the scale of the systematic error being

ηδ ∼ Gamma(10, 1),

to allow for this quantity to range between the units and tens of MeV. It is also reasonable

to assume that the mass of a given nucleus is correlated mostly with its neighbours on the

nuclear chart. We express this notion through these reasonably wide prior distributions

lZ ∼ Gamma(10, 1),
lN ∼ Gamma(10, 1),
νZ ∼ Gamma(10, 1),
νN ∼ Gamma(10, 1),
νi ∼ Gamma(10, 1),

i = 1, 2, 3, 4.

Finally, the majority of the masses in the training dataset of 2000 experimental binding

energies fall into the range of [1000, 2000] MeV (1165 of masses precisely). We consider the

following prior distribution for the parameter ηf to reﬂect on the scale of the experimental

binding energies:

ηf ∼ Gamma(110, 10).

87

CHAPTER 4

EMPIRICAL BAYES CALIBRATION OF COMPUTER MODELS WITH

CONSISTENT PREDICTIONS

Up to this point, we have seen that the Bayesian framework for computer-model-aided

inference, described in detail in Section 1.2 and at the beginning of Chapter 3, provides a

statistically principled way to account for various sources of uncertainty and leads to bet-

ter predictions.

It can be especially powerful in scenarios where computer models under

consideration are complex and computationally too expensive to be used directly for pre-

dictions with quantiﬁed uncertainties, because each evaluation of such models often takes

several days. Despite these advantages, we have also identiﬁed many challenges that make

the implementation of the Kennedy and O’Hagan (2001) framework challenging in practice.
Let us recall that under a fully Bayesian treatment, the predictions of new values y∗

of a physical ζ using a computer model fm are speciﬁed by the posterior predictive distri-
bution p(y∗|d). The dataset d here and for the rest of this chapter consists of n obser-

vations yi from the physical process ζ and s evaluations zj of the computer model fm, i.e.

d = (d1, . . . , dn+s) := (y, z), and follows the multivariate normal distribution (1.4). The pre-
dictive distribution p(y∗|d) is obtained by integrating the conditional density p(y∗|d, θ, γ, σ),

which is a multivariate normal density given by the statistical model (1.3) and the speciﬁ-
cation of GPs, against the posterior density p(θ, γ, σ|d), namely

(cid:90)

p(y∗|d) =

p(y∗|d, θ, γ, σ)p(θ, γ, σ|d) dθ dγ dσ.

(4.1)

φ

An analogical relationship also holds for the predictions of new realizations of the physical
process ζ∗. The posterior density p(θ, γ, σ|d), however, does not have a closed form in

general and one typically resorts to MCMC methods for approximation. Additionally, the
nature of the likelihood p(d|θ, γ, σ) makes the problem hard to scale due to the complex

structure of the covariance matrix K(θ, γ, σ) (see (1.6)).

In Chapter 3, we developed a

novel VBI algorithm that provides an eﬃcient and scalable alternative to the traditional

88

MCMC methods. Nevertheless, the practical implementation of either the MCMC or our

VBI approach can be a non-trivial task and requires some practical experience.

As an easy-to-implement alternative that avoids the diﬃculties described above, we pro-

pose an empirical Bayes approach for fast and statistically principled predictions of physical

quantities using imperfect computer models which instead of placing a (prior) distribution

on (θ, γ, σ) estimates these parameters directly form the data. One can therefore utilize

the convenience of GPs to obtain closed form, simple, and fast predictions given by the
conditional distribution p(y∗|d, θ, γ, σ) (or p(ζ∗|d, θ, γ, σ)). The proposed approach can

be viewed as an approximation of the fully Bayesian treatment that neglects some of the

uncertainty associated with the unknown parameters.

Our contributions are the following. First, we present a fast and easy to implement

framework for computer-model-enabled predictions and provide two alternative plug-in es-

timators for all the unknown quantities involved. Second, we oﬀer a new perspective on

the Kennedy and O’Hagan (2001) framework and provide its representation as a Bayesian

hierarchical model. This alternative representation allows us to discuss the framework in the

context of non-parametric regression problems with GP priors and establish our methods’

theoretical validity through a posterior consistency result. Lastly, we validate the empirical

Bayes approach empirically through a simulation study, and illustrate our methodology on

a real data application in nuclear physics.

The rest of this chapter is organized as follows. In Section 4.1, we show the equivalence of

the general framework for Bayesian calibration of computer models with a Bayesian hierar-

chical model. Then, in Section 4.2, we discuss the theoretical properties of our approach and

establish its posterior consistency. Section 4.3 deﬁnes two plug-in estimators for GP model

parameters and a consistent estimator of a noise scale component. Section 4.4 contains a

simulation study that empirically validates the methodology in this chapter. A real-data

application is also included in Section 4.4.

89

4.1 Hierarchical model for Bayesian calibration of computer mod-

els

Here we show that we can represent the model of Kennedy and O’Hagan (2001) described

in Section 1.2, hierarchically, as the following hypotheses about the observations yi, the

computer model evaluations zj, and a set of prior distributions.

Model for data:

yi = ζ(ti) + σi

zj = fm((cid:101)tj,(cid:101)θj),

i.i.d.∼ N (0, σ).

i

i = 1, . . . , n,

j = 1, . . . , s,

(4.2)

(4.3)

(4.4)

Priors:

δ(t) ∼ GPδ(mδ(t), kδ(t, t(cid:48))),

given γ and independent of  and σ,

fm(t, θ) ∼ GPf (mf (t, θ), kf ((t, θ), (t(cid:48), θ(cid:48)))),
ζ(t)|θ, γ ∼ GPf + GPδ.

given γ and independent of i, σ, and δ,

Under this model, the conditional likelihoods for yi and zj are

˜fm

(cid:90)
(cid:90)
n(cid:89)

(cid:90)
(cid:90)
(cid:90)

ζ

ζ

ζ

=

=

i

90

n(cid:89)

s(cid:89)

i

˜fm
p(yi|ζi, σ)p(ζ, z|θ, γ) dζ,

j

p(yi|ζ(ti), σ) =
p(zj|fm((cid:101)tj,(cid:101)θj)) = 1

σ

(cid:18)
√
1
2π
zj =fm((cid:101)tj ,(cid:101)θj )

exp

(cid:19)

,

(4.5)

− (yi − ζ(ti))2

2σ2

(zj),

where p(zj|fm((cid:101)tj,(cid:101)θj)) is a likelihood with the point mass at zj = fm((cid:101)tj,(cid:101)θj). Consequently,
(cid:90)
(cid:90)

the equivalence of the two formulations is given through the equality between the likelihood

(1.4) and the integral

(4.6)

p(ζ, ˜fm, d|θ, γ, σ) d ˜fm dζ =

p(d|ζ, ˜fm, θ, γ, σ)p(ζ, ˜fm|θ, γ) d ˜fm dζ

ζ

˜fm

p(yi|ζi, σ)

p(zj| ˜fm,j)p(ζ, ˜fm|θ, γ) d ˜fm dζ

where ζ = (ζ(t1), . . . , ζ(tn)) = (ζ1, . . . , ζn) and ˜fm = (fm((cid:101)t1,(cid:101)θ1), . . . , fm((cid:101)ts,(cid:101)θs)). The

likelihood p(ζ, z|θ, γ) is the multivariate normal distribution with the mean M (θ, γ) (see

(1.5)) and the covariance

(cid:32)

Kp(θ, γ) =

(cid:33)
Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) Kf (Ty(θ), Tz((cid:101)θ))
Kf (Tz((cid:101)θ), Tz((cid:101)θ))

Kf (Tz((cid:101)θ), Ty(θ))

.

Again, Kf (Ty(θ), Ty(θ)) is the matrix with (i, j) element kf ((ti, θ), (tj, θ)), Kδ(Ty, Ty) is the

matrix with (i, j) element kδ(ti, tj), and Kf (Tz((cid:101)θ), Tz((cid:101)θ)) is the matrix with (i, j) element
kf (((cid:101)ti,(cid:101)θi), ((cid:101)tj,(cid:101)θj)). Kf (Ty(θ), Tz((cid:101)θ)) is deﬁned analogically with the kernel kf .

We leave the details of the integral computation for Section 4.5.1. This representations of

the model is crucial for the theoretical results obtained in the subsequent section. It reframes

the Bayesian model as a version of a non-parametric regression problem with GP prior for

ζ(t) and an additive noise. Additionally, we can gain a further insight into the role of the

set of model runs z. Let us consider a function space F and a subset (cid:101)F ⊂ F, then

p(ζ ∈ (cid:101)F|d, θ, γ, σ) ∝

(cid:90)
(cid:101)F

n(cid:89)

i

p(yi|ζi, σ)p(ζ|z, θ, γ) dζ.

(4.7)

One can therefore interpret the model runs z as an additional information provided by the
computer model fm that enhances the GP prior p(ζ|z, θ, γ) for the physical process ζ, having
the mean function

m(cid:88)

(cid:104)
(cid:105)(cid:104)
kf ((t, θ), ((cid:101)tj,(cid:101)θj))

(cid:105)
zi − mf ((cid:101)ti,(cid:101)θi)

mζ (t) = mf (t, θ) + mδ(t) +

κj,i

and the covariance function

i,j=1

kζ (t, t(cid:48)) = kf ((t, θ), (t(cid:48), θ)) + kδ(t, t(cid:48))
(cid:105)(cid:104)
kf ((t, θ), ((cid:101)tj,(cid:101)θj))

− m(cid:88)

(cid:104)

κj,i

(cid:105)
kf (((cid:101)ti,(cid:101)θi), (t(cid:48), θ))

,

where κj,i is the (j, i) element of the matrix Kf (Tz((cid:101)θ), Tz((cid:101)θ))−1.

i,j=1

91

,

(4.8)

(4.9)

4.2 Posterior consistency, a theoretical validation

The revealing consequence of the previous section is that the Kennedy and O’Hagan

(2001) framework is equivalent to the non-parametric regression model of an unknown func-
tion ζ(t) with the prior distribution p(ζ|z, θ, γ). This is not only a new perspective on the

popular framework, but also happens to be the key step that allows us to validate our em-

pirical Bayes approach theoretically and establish the posterior consistency of the physical
process when the prior p(ζ|z, θ, γ) satisﬁes certain properties.

In the reminder of this section, we assume that the true physical process ζ0 is a con-
tinuously diﬀerentiable function on the compact and convex set Ω ⊂ Rp. Without loss of
generality, we take Ω = [0, 1]p. Additionally, we shall assume the hyperparameters (θ, γ)

take values in a set Υ. For any ν > 0, we aim to establish, under suitable conditions, the

following:

p(ζ ∈ W C

ν,n|y1, . . . , yn, z, θ, γ, ˆσn) −−−→

n

0 a.s. P0,

(4.10)

sup

(θ,γ)∈Υ

where P0 denotes the joint conditional distribution of {yi}∞
strongly consistent estimator of σ0, and

i=1 given true ζ0 and σ0, ˆσn is a

Wν,n =

with Qn being the empirical measure on the design points given as Qn(t) = n−1(cid:80)n

i=1 1ti(t).
In Theorem 2, we ﬁrst present a general result on the consistency of non-parametric

|ζ(t) − ζ0(t)| dQn(t) ≤ ν

(4.11)

regression problems and subsequently discuss the theorem’s conditions in the context of the

model described in Section 4.1. This is based on the work of Choi and Schervish (2007a) and

Choi (2007), where the authors assume σ is included in Wν,n, and the posterior consistency

is derived jointly for ζ and σ. On the other hand, the consistency of ζ conditioned on ˆσn

requires a non-trivial modiﬁcation of their original results. The proof of Theorem 2 is given

in Section 4.5.2.

Theorem 2. Let {yi}∞

i=1 be independently and normally distributed with the mean ζ(ti) and

the standard deviation σ with respect to a common σ-ﬁnite measure, where ζ belongs to a

92

(cid:90)

(cid:26)

ζ :

(cid:27)

,

space of continuously diﬀerentiable functions on [0, 1]p denoted as F, and σ > 0. Let ζ0 ∈ F
and let P0 denotes the joint conditional distribution of {yi}∞
i=1 given true ζ0 and σ0. Let
{Un}∞
n=1 be a sequence of subsets of F. Let ζ have a prior Π(·|θ, γ) where (θ, γ) take values
in a set Υ. For any 0 <  < 1 and ζ0(ti) = ζ0,i deﬁne:

p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ))

,

Λi(ζ0, ζ) = log
Ki(ζ0, ζ) = Eζ0,σ0
Vi(ζ0, ζ) = Varζ0,σ0

(Λi(ζ0, ζ)),

(Λi(ζ0, ζ)).

If the following assumptions are satisﬁed:

(A1) Suppose there exists a set B with Π(B|θ, γ) > 0 and for any ∆ > 0 a constant

(i) (cid:80)∞

0 < ˜1 < 1, so that for any  < ˜1:
i2 < ∞, ∀ζ ∈ B,

Vi(ζ0,ζ)

i=1

(ii) Π(B ∩ {ζ : Ki(ζ0, ζ) < ∆ for all i}|θ, γ) > 0.
(A2) Suppose there exist tests {Φn}∞

n=1, sets {Fn}∞

n=1, and constants C2, C1, c1 > 0 and

0 < ˜2 < 1 so that:

(i) (cid:80)∞

Eζ0,σ0

Φn < ∞

n=1

(ii) sup(θ,γ)∈Υ Π(F C
(iii) There exists a constant c > 0 such that for any 0 <  < ˜2 the inequality c + log(1 −

n |θ, γ) < C1e−c1n

) − log(1 + ) > 0 holds and

E

ζ,σ0(1+)(1 − Φn) ≤ C2e−cn.

sup
ζ∈U C
n ∩Fn

(A3) ˆσn is strongly consistent, i.e ˆσn −−−→
Then

n

σ0

a.s. P0.

p(ζ ∈ U C

n |y1, . . . , yn, θ, γ, ˆσn) −−−→

n

0 a.s. P0.

sup

(θ,γ)∈Υ

93

For the purpose of generality of Theorem 2, we do not explicitly condition on the set

of model runs z. It is clear from our previous discussions (see (4.7) in particular) that the

model runs play the role of ﬁxed constants in the prior distribution over ζ. The dependence
on z in (4.10) arises by setting Π(ζ|θ, γ) := p(ζ|z, θ, γ).

We now consider the conditions of Theorem 2 in the context of the model in Section 4.1.

These conditions fall into two general categories; one group of conditions is related to the

existence of the test functions Φn, and the second group revolves around the conditions for

the prior distributions.

Our approach to establish the existence of test functions {Φn}∞

n=1 that satisfy the condi-

tions (i) and (iii) in Theorem 2 is similar to that of Theorem 2 in Choi and Schervish (2007a).
We consider a sieve Fn which grows to the space of continuously diﬀerentiable functions on
[0, 1]p. Namely, let

(cid:27)

(cid:26)

Fn =

ζ : (cid:107) ζ (cid:107)∞< Mn, (cid:107) ∂
∂ti

ζ (cid:107)∞< Mn,

i = 1,··· , p

(4.12)

2 , 1). Also, (cid:107) · (cid:107)∞ denotes the supremum norm. Each test
where Mn = O(nα) for some α ∈ ( 1
is deﬁned as a combination of tests over ﬁnitely many elements in the covering of Fn. The
existence of tests in the speciﬁc case of Wn,ν is given in Theorem 3 with its prove provided

in Section 4.5.2.
Theorem 3. Let Fn be the sieves deﬁned in (4.12). For any ν > 0 there exist tests {Φn}∞
and constants C and 0 < ˜ < 1 so that:

n=1

(i) (cid:80)∞

Eζ0,σ0

Φn < ∞

n=1

(ii) There exists a constant c > 0 such that for any 0 <  < ˜ the inequality c + log(1 −

) − log(1 + ) > 0 holds and

ζ,σ0(1+)(1 − Φn) ≤ Ce−cn.
E

sup
n,ν∩Fn
ζ∈W C

94

To verify conditions (A1) of Theorem 2, it is suﬃcient to show that the GP prior for ζ

assigns positive probability to the following set for any δ > 0:

For any 0 <  < 1, a short calculation leads to
1 −

Ki(ζ0, ζ) = log(1 − ) − 1
2
≤ log(1 − ) − 1
2

Bδ = {ζ :(cid:107) ζ − ζ0 (cid:107)∞< δ} .
(cid:19)
(cid:19)

(1 − )2

(cid:18)
(cid:18)

+

1

1 −

1

(1 − )2

[ζ0(ti) − ζ(t)]2
0(1 − )2
2σ2
(cid:107) ζ0(ti) − ζ(t) (cid:107)2∞
0(1 − )2

2σ2

+

(4.13)

.

Let a() = log(1− )− 1/2 + 1/[2(1− )2], it is easy to see that a() is positive and continuous

at  = 0. Therefore, for every ∆ > 0, there exist δ > 0 and 0 < ˜ < 1 so that Ki(ζ0, ζ) < ∆

for all i and any  < ˜.

Additionally, for any  < ˜ and any δ > 0

(cid:20)

(cid:21)2
(1 − )2 − 1

1

(cid:20)[ζ0(ti) − ζ(t)]

(1 − )2

(cid:21)2

+

Vi(ζ0, ζ) =

1
2

and as a result, for all ζ ∈ Bδ,(cid:80)∞

i=1

< ∞ uniformly in i,

Vi(ζ0,ζ)

i2

< ∞. The prior condition (ii) of (A2) for the

sieve Fn (4.12) is addressed in Lemma 4 (for proof see Section 4.5.2).
Lemma 4. Let the mean function mζ (·) of the GP prior for ζ deﬁned on [0, 1]p be contin-
uously diﬀerentiable, and the covariance function kζ (·,·) has mixed partial derivatives up to
order 4 that are continuous. Deﬁne,

ρ2
0(θ, γ) = sup
t∈[0,1]p

ρ2
i (θ, γ) = sup
t∈[0,1]p

Var (ζ(t)|z, θ, γ) ,

(cid:19)

(cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ

(cid:18) ∂
i are continuous functions of (θ, γ) for all (θ, γ) ∈ Υ,

i = 1, . . . , p.

ζ(t)

∂ti

Var

,

Suppose that Υ is a compact set, and ρ2

i = 0, . . . , p. Then there exist constants C, c > 0 so that

p(F C
where Fn are the sieves deﬁned in (4.12).

(θ,γ)∈Υ

sup

n |z, θ, γ) < Ce−cn,

95

Below we present the almost sure consistency result 4.10 as a corollary of Theorem 2,

Theorem 3, and Lemma 4.

Corollary 1. Let P0 denotes the joint conditional distribution of {yi}∞
i=1 given true ζ0
and σ0. Let mζ (·) and kζ (·,·) be the mean and covariance functions of the GP prior for
ζ satisfying the conditions of Lemma 4. Assume Υ is a compact set, and for any δ > 0,
p(Bδ|z, θ, γ) > 0. If ˆσn is a strongly consistent estimator of σ0, then for any ν > 0

p(ζ ∈ W C

ν,n|y1, . . . , yn, z, θ, γ, ˆσn) −−−→

n

sup

(θ,γ)∈Υ

0 a.s. P0.

(4.14)

Prior conditions. The prior positivity condition requiring p(Bδ|z, θ, γ) > 0 for any δ was
extensively studied by Ghosal and Roy (2006) and Tokdar and Ghosh (2007). Theorem 4

of Ghosal and Roy (2006) implies that this condition is satisﬁed for a GP with continuous

sample paths and continuous mean and covariance functions, as long as ζ0 and the mζ belong

to the reproducing kernel Hilbert space (RKHS) of kζ . The continuity of GP’s sample paths

is given by the application of Theorem 5 in Ghosal and Roy (2006) which requires the same

continuity conditions as Lemma 4 in this section (excluding those on ρ2

i ). It should be clear
from (4.8) and (4.9) that mζ is continuously diﬀerentiable on [0, 1]p, and kζ has continuous
mixed partial derivatives up to 4th order on [0, 1]p, as long as the same holds about mf

and mδ and respectively kf and kδ. Tokdar and Ghosh (2007) show that the RKHS of
kζ spans the space of continuously diﬀerentiable functions on [0, 1]p, if kζ is a product of

p isotropic and integrable univariate covariance functions with continuous mixed partial

derivatives up to order 4. For example, the squared exponential covariance function satisﬁes

these requirements including the continuity of ρ2

i for i = 0, . . . , p.

This, of course, does not directly imply that such choices for mf and mδ, and kf and kδ

respectively, result in the conditional mean mζ and covariance kζ functions satisfying these

suﬃcient conditions. For larger applicability of our results, we note that further investigation

of speciﬁc choices for mean and covariance functions that satisfy the desired conditions is

needed. We intend to address this in our future work. Nevertheless, the simulation study

96

conducted in Section 4.4.1 strongly suggests that choosing the squared exponential kernel

leads to consistent predictions.

4.3 Parameter estimation and prediction

Thus far, we established that the empirical Bayesian framework provides a principled

approach for inference and enjoys good theoretical properties, all this assuming a (strongly)

consistent estimator of σ0, smoothness of the prior mean and covariance function, and the

GP hyperparameters (θ, γ) taking values in some compact set.

In this section, we ﬁrst propose a strongly consistent estimator of the true noise scale σ0

and two diﬀerent plug-in estimators of (θ, γ) as minimizers of two alternative loss functions.

In particular, we consider negative data log-likelihood and negative predictive log-likelihood

combined with K-fold cross-validation. Second, we provide the complete empirical Bayes

algorithm for simple and fast predictions of physical quantities using (imperfect) computer

models.

Let us consider n observations yi from the physical process under the model (4.2), we

(cid:80)n−1
propose the following estimator of the noise variance σ2
0:
i=1 (yi+1 − yi)2

ˆσ2
n =

2(n − 1)

(4.15)

Theorem 4. Suppose ζ0(t) represents the true physical process and σ2
0 be the true value of
the experimental error variance, where t ∈ Ω is a compact and convex subset of Rp, and

ζ0 is continuously diﬀerentiable on Ω. Let P0 denotes the joint conditional distribution of
{yi}∞

0. Also assume the following holds about the design points ti:

i=1 given true ζ0 and σ2

then

sup

i∈{1,...,n},j∈{1,...,p}

|ti+1,j − ti,j| −−−→

n

0,

n −−−→
ˆσ2

n

σ2
0

a.s. P0.

(AD)

(4.16)

The proof of Theorem 4 is given in Section 4.5.2. The continuous mapping theorem

directly implies the following.

97

Corollary 2. Under the assumptions of Theorem 4,

(cid:113)

ˆσn =

n −−−→
ˆσ2

n

σ0

a.s. P0.

(4.17)

Remark 1. The assumption (AD) is satisﬁed by a design that contains at least one point in
each hypercube H in Ω with its Lebesgue measure λ(H) ≥ 1

Kn , for some constant 0 < K ≤ 1.

This is, for example, the case of equally spaced design.

4.3.1 Estimation of hyperparameters

4.3.1.1 Marginal data likelihood

We ﬁrst consider estimates of (θ, γ) as minimizers of a loss function that is reminiscent

of the standard maximum likelihood approach, namely

LM LE(θ, γ) = − log p(d|θ, γ, ˆσn),

(4.18)

with the negative log-likelihood being

− log p(d|θ, γ, ˆσn) =

1
2

+

(d − M (θ, γ))T K(θ, γ, ˆσn)(d − M (θ, γ))
log|K(θ, γ, ˆσn)| +
1
2

log 2π.

n + s

2

We can readily interpret the minimizer of LM LE as a trade-oﬀ between the data-ﬁt 1
M (θ, γ))T K(θ, γ, ˆσn)(d − M (θ, γ)) and the model complexity penalty 1
that depends only on the model parameters and the variable inputs.

2(d −
2log|K(θ, γ, ˆσn)|

4.3.1.2 Predictive likelihood with K-fold cross-validation

Another viable approach to estimating the parameters (θ, γ) is to base these on a model’s

predictive performance on unseen data. Cross-validation is a popular and robust approach to

estimate this predictive performance that has been utilized across many statistical applica-

tions. See Sundararajan and Keerthi (2001); Rasmussen and Williams (2006); Martino et al.

98

LCV (K)(θ, γ) = − K(cid:88)

(2017) for applications with GPs. Here, we consider a K-fold cross-validation where the basic

idea is to randomly partition the training detest into K subsets of equal size. We then select
K − 1 subsets for training and the hold-out data as a proxy for estimating the predictive

performance. This is then repeated until we exhaust all the K subsets for the purpose of

validation with typical choices for K being 3, 5, 10, or n (leave-one-out cross-validation).

Formally, let yi represent the ith subset of the observations y and y−i = y (cid:114) yi. The

negative predictive log-likelihood under the K-fold cross-validation is

log p(yi|y−i, z, θ, γ, ˆσn),

(4.19)

The cross-validation should be more robust against model miss-speciﬁcation and overﬁtting

i

(Wahba, 1990).

4.3.2 Algorithm for predictions

One of the main beneﬁts of the empirical Bayes approach is that once we estimate the

unknown parameters (θ, γ, σ), we can obtain a closed form predictive distribution given
these estimates. Formally, let us consider a set of new inputs (t∗

J ) at which we want
to obtain the predictions according to the model (1.3). As discussed in Section 3.4, the joint
normality between d and y∗ implies that the conditional distribution p(y∗|d, θ, γ, σ) is a

1, . . . , t∗

multivariate normal distribution with the mean vector

My∗(θ, γ, σ) = Mf (T∗

y (θ)) + Mδ(T∗

y ) + C∗K(θ, γ, σ)−1(d − M (θ, γ)),

and the covariance matrix

y (θ), Ty(θ)) + Kδ(T∗

y , Ty) + σ2Im − C∗K(θ, γ, σ)−1CT∗ ,

Ky∗(θ, γ, σ) = Kf (T∗
(cid:16)

C∗ =

Kf (T∗

where

(cid:17)
y (θ), Tz((cid:101)θ))
Similarly to the conditional covariance matrices discussed previously, Kf (T∗
the matrix with (i, j) element kf ((t∗

y (θ), Ty(θ)) + Kδ(T∗

i , θ), (tj, θ)) and Kδ(T∗

y , Ty) Kf (T∗

.

y (θ), Ty(θ)) is

y , Ty) is the matrix with (i, j)

(4.20)

(4.21)

(4.22)

99

element kδ(t∗

i , tj). The matrix Kf (T∗

y (θ), Tz((cid:101)θ)) is deﬁned accordingly with the kernel kf .

Analogical relationship holds for the conditional distribution of the new realizations from
the physical process p(ζ∗|d, θ, γ, σ), where the mean vector is identical with (4.20) and the

covariance matrix is

Kζ∗(θ, γ, σ) = Kf (T∗

y (θ), Ty(θ)) + Kδ(T∗

y , Ty) − C∗K(θ, γ, σ)−1CT∗ ,

(4.23)

Algorithm 4.1 summarizes the procedure for predictions of physical quantities using im-

perfect and computationally expensive computer models.

Algorithm 4.1: Empirical Bayes algorithm for predictions of physical quantities.
Input: Data d, mean and covariance functions for GPs, and new inputs (t∗

1 Use the experimental observations y1, . . . , yn to compute ˆσn =(cid:112)ˆσ2

1, . . . , t∗
J )

n

2 Minimize either LM LE(θ, γ) or LCV (K)(θ, γ) to obtain the estimates ( ˆθ, ˆγ)
3 Compute My∗( ˆθ, ˆγ, ˆσn) and Ky∗( ˆθ, ˆγ, ˆσn) or Mζ∗( ˆθ, ˆγ, ˆσn) and Kζ∗( ˆθ, ˆγ, ˆσn)

respectively to get the posterior predictive distribution

4.4 Applications

The main objective of this section is to empirically establish the eﬃciency of the empirical

Bayes method in Algorithm 4.1 and support the consistency result presented in section

4.2. All this while sacriﬁcing minimally in terms of the ﬁdelity of UQ as compared to

the fully Bayesian treatment. To this extent, we consider a simulation study where we

compare our method (under both LM LE and LCV (K)) to a fully Bayesian treatment with

posterior samples obtained using the standard MH algorithm. Finally, we revisit the LDM

and illustrate our methodology in a real data scenario.

4.4.1 Transverse harmonic wave

Let us consider a simple computer model representing a periodic wave disturbance that

moves through a medium and causes displacement of individual atoms or molecules in the

medium. This is called a transverse harmonic wave, where the displacement fm((t, x), θ) of

100

a particle at location x over time t is given by

fm((t, x), θ) = θ1 sin(cid:0)kx − θ2t + ψ(cid:1),

(4.24)

where θ1 represents the amplitude of the wave, and θ2 is the frequency of the wave. The

model also depends on the wave number k, which is reciprocal to the wave length, and the

phase constant ψ. For the purpose of this example, we shall consider these to be known

values with k = 5 and ψ = 1, and deﬁne the model inputs (t, x) over the space [0, 1]2 (we

assume that the length and time units are all equal to one). The true physical process is

modeled according to

ζ0(t, x) = fm((t, x), θ) + δ(t, x) = θ1 sin(cid:0)5x − θ2t + 1(cid:1) + β,

(4.25)

where β = 1 is a constant systematic error of the model, and θ = (θ1, θ2) are arbitrarily

set to be (1.2, 1.8). We generate the experimental observation according to the model (1.3)

with the true value of the observation error scale σ0 = 0.2, where the model inputs (t, x) are
chosen using the Latin hypercube design over the full space [0, 1]2. The space ﬁlling prop-

erties of the design guarantee decreasing bias of the estimator ˆσn with an increasing sample

size. Additionally, we assume that the computer model for the periodic wave disturbance

is computationally expensive and generate the set of model runs z using again the Latin
hypercube design, now over [0, 1]2 × [0, 2]2. We deﬁne the GP priors for fm and δ to have
zero means and the covariance functions

kf ({t, x, θ},{t(cid:48), x(cid:48), θ(cid:48)}) = ηf · exp(−||t − t(cid:48)||2
kδ({t, x},{t(cid:48), x(cid:48)}) = ηδ · exp(−||t − t(cid:48)||2

2(cid:96)2
t

2ν2
t

− ||x − x(cid:48)||2
− ||x − x(cid:48)||2

2(cid:96)2
x

2ν2
x

− ||θ1 − θ(cid:48)
1||2

2(cid:96)2
θ1

− ||θ2 − θ(cid:48)
2||2

2(cid:96)2
θ2

)

).

The hyperparameters in this scenario are therefore γ = (ηf , (cid:96)t, (cid:96)x, (cid:96)θ1

, (cid:96)θ2

, ηδ, νt, νx).

For the case of the fully Bayesian treatment, we choose inverse gamma priors with mean

1/2 and variance 1/4 for (σ, ηf , ηδ), gamma priors with mean 1/3 and variance 1/9 for the

length scales, and independent Gaussian distributions with mean 0 and variance 4 for the

101

calibration parameters (θ1, θ2). These are non-informative priors given the spans of both the
input space [0, 1]2 and the parameter space [0, 2]2. Table 4.1 shows the RMSEs of predictions

of new realizations from the true physical process (4.25) evaluated on a testing dataset of

225 realizations over a uniform grid on [0, 1]2. The predictions are taken to be the posterior

predictive means under each method. We consider the estimates of hyperparameters using

the LM LE loss and the 10-fold cross-validation predictive loss function. The noise scale

parameter was estimated using the consistent estimator ˆσn deﬁned in Section 4.3.

RMSE values on the testing dataset

LM LE LCV (10) Metropolis-Hastings

n = 125 0.048
s = 125
n = 250 0.019
s = 250
n = 500 0.010
s = 500

0.071

0.030

0.019

0.049

0.037

0.021

Table 4.1: The RMSE comparison of the empirical Bayes approach and the fully Bayesian
treatment. The GP hyperparameters were estimated using Algorithm 4.1.

The proposed empirical Bayes approach closely matches the fully Bayesian treatment. In

fact, the RMSE under the LM LE loss is consistently the lowest and monotonously decreases

with the increasing size of the dataset. This is a desirable outcome since the empirical Bayes

ﬁt can be readily obtained in several minutes using standard numerical solvers while sampling

from posterior distributions can take hours. It took approximately 2 hours to obtain 104

samples in the scenario with the largest sample size on a standard PC with 4 cores.

Parameter

n = 125, s = 125

n = 250, s = 250

n = 500, s = 500

LM LE LCV (10) MH LM LE LCV (10) MH LM LE LCV (10) MH
1.208
1.197
1.765
1.781
0.198

1.251 1.160
1.771 1.805
0.208

1.166 1.207
1.765 1.792
0.206

1.217
1.787

0.328

1.251
1.799

0.259

1.206
1.818

0.228

θ1
θ2
σ

Table 4.2: The estimates of calibration parameters and the noise scale under each method.

102

For completeness, we also show the estimates of calibration parameters and the noise

scale under each method in Table 4.2. Posterior means were taken as the estimates of the

fully Bayesian solution. We can see again a close match between the approximate empirical

Bayes method and the MH algorithm. The only notable diﬀerence is in terms of the noise

scale estimate ˆσn. This is expected since the estimate is asymptotically unbiased.

Figure 4.1: Detail of 95% credible bands plotted at t = 0.21.

Figure 4.2: Comparison of the convergence to the true physical process. The curves with
95% credible intervals are plotted at t = 0.21.

Figure 4.1 and Figure 4.2 show the loss in terms of UQ is negligible for all practical pur-

103

0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CIposes. We can see that the empirical Bayes approach slightly overestimates the uncertainty

for smaller sample size, but this quickly diminishes as the sample size increases. This is likely

the consequence of the inﬂation of the noise scale given by the bias of ˆσn which diminishes

with the increasing sample size as expected. See Section 4.5.3 for additional ﬁgures of the

empirical Bayes ﬁt at the time locations t = 0, t = 0.43, t = 0.71, and t = 1.

4.4.2 The Liquid Drop Model revisited

To illustrate our empirical Bayes framework for computer-enabled predictions on a real

data example, we yet again consider the 4-parameter LDM of nuclear binding energies (see

Section 1.1 for details).

We now present an analysis of 595 experimental binding energies of even-even nuclei from

the AME2003 dataset (Audi et al., 2003) (publicly available at http://amdc.impcas.ac.

cn/web/masstab.html) randomly divided into a training set of 450 nuclei and a testing set of

the remaining 145 nuclei, see Figure 4.3. We consider the statistical model (1.3) and model

Figure 4.3: Binding energies of even-even nuclei in AME2003 dataset divided into a testing
and a training dataset.

the systematic discrepancy δ with zero mean GP and the isotropic squared exponential

covariance function. For the purpose of this example, we also assume that the LDM is

computationally expensive (or not directly accessible) and regard it is an unknown function

of (Z, N ) and θ. Similarly to the discrepancy δ, we assign a GP prior to EB(N, Z) with

104

020406080100120140160N1101009080706050403020100ZTrainingTesting020406080100120140160N0102030405060708090100110Z50010001500EB [MeV]zero mean and the isotropic squared exponential covariance function. To this extent, we

additionally generated a set of 900 model evaluations using the Latin hypercube design

over the space spanning all reasonable values of the parameters θ as given by the nuclear

physics literature similarly to our previous analysis in Section 3.5.3. Corresponding nuclear

conﬁgurations, the inputs (Z, N ), were randomly assigned to the generated values of θ from

a set of two times duplicated training nuclei.

Results. The predictions of nuclear binding energies were computed as the means of the

posterior predictive distribution (4.20) conditioned on the estimates of the calibration pa-

rameters θ, GP’s hyperparameters γ, and the noise scale ˆσn. The estimates for (θ, γ) were

obtained numerically as the minimizers of LM LE and LCV (10). The priors for the GP hyper-

parameters were chosen according to Section 3.5.3 in the case of the fully Bayesian treatment.

Testing error
asurf asym aC RMSE (MeV)

Parameter estimates
avol
15.07 15.58 22.00 0.68
LM LE
LCV (10) 15.08 16.08 21.19 0.67
MH
15.32 16.09 22.09 0.70

1.16
1.26
1.16

Table 4.3: The RMSEs of the predictions evaluated on 145 even-even nuclei from the
AME2003 dataset. The parameter estimates are also listed. The posterior means are
shown in the case of the MH algorithm.

Table 4.3 gives the RMSE values calculated on the testing set of 145 even-even nuclei

for the empirical Bayes approach and also the MH algorithm. The calibration parameter

estimates are also provided with values that do not signiﬁcantly diﬀer between the methods
considered. The resulting RMSEs are 1.1 − 1.3 MeV which is a consistent result with our

previous study in Section 3.4.2 that was conducted on the whole AME2003 dataset using

the VBI approach. We also carried out a simple least squares ﬁt of the LDM with the

resulting RMSE of 4.10 MeV evaluated on the same testing set of even-even nuclei. This is

105

an improvement that is consistent with our previous study on the full dataset using the VBI

algorithm. Overall, this is quite a remarkable result given the considerable eﬀort that needs

to be put forth to implement the fully Bayesian solution and to obtain suﬃcient amount of

posterior samples.

4.5 Technical details and supplementary results

4.5.1 Equivalency of hierarchical model

(cid:90)

n(cid:89)

ζ

i

To establish the equivalency between the Bayesian model given by the data likelihood
p(d|θ, γ, σ) and the hierarchical model (see Section 4.1), we need to show that the following

equality holds

p(d|θ, γ, σ) =

p(yi|ζi, σ)p(ζ, z|θ, γ) dζ,

(4.26)

where ζ = (ζ(t1), . . . , ζ(tn)) = (ζ1, . . . , ζn) and p(ζ, z|θ, γ) is the multivariate normal dis-
tribution with the mean M (θ, γ) (see (1.5)) and the covariance

Kf (Ty(θ), Ty(θ)) + Kδ(Ty, Ty) Kf (Ty(θ), Tz((cid:101)θ))
 =
Kf (Tz((cid:101)θ), Tz((cid:101)θ))

Kf (Tz((cid:101)θ), Ty(θ))

C11 C12

C21 C22

 .

Kp(θ, γ) =

For the ease of notation, let us now assume M (θ, γ) = (M T

(cid:90)

n(cid:89)

ζ

i

×

p(yi|ζi, σ)p(ζ, z|θ, γ) dζ =
(cid:18)

1

(2π)(n+m)/2|Kp|1/2

exp

(cid:18)

− 1
2

1

exp

(2π)(n+m)/2|K|1/2
|K|1/2
×

(cid:90)

ζ

(cid:18)

(cid:18)ζ − My

(cid:19)T
(2π)n/2|σ2In|1/2|Kp|1/2
K−1
(cid:18)

z − Mz

− 1
2

p

1

× exp

=

=

− 1
2

p

ζ

exp

− 1
2

y , M T

1

(cid:19)T

z − Mz

z − Mz

z )T . Then

(2π)n/2|σ2In|1/2
K−1

(cid:90)
(cid:18)ζ − My
(cid:18)y − My
(cid:19)T
(cid:18)
(cid:18)ζ − My
(cid:18)y − My

z − Mz
− 1
2

(cid:19)
(cid:18)
(y − ζ)T (σ2In)−1(y − ζ)
(cid:18)ζ − My
(cid:19)(cid:19)
(cid:18)y − My
(cid:19)(cid:19)
(cid:18)y − My
(cid:18)y − My

K−1
(y − ζ)T (σ2In)−1(y − ζ)
(cid:19)(cid:19)
(cid:19)T

(cid:19)
(cid:18)y − My

(cid:19)T
K−1
(cid:19)(cid:19)

z − Mz

z − Mz

z − Mz

z − Mz

(cid:19)(cid:19)

exp

dζ

dζ

+

1
2
K−1

× 1.

(2π)(n+m)/2|K|1/2

exp

− 1
2

z − Mz

z − Mz

106

The integral is equal to 1 since it is an integration of multivariate normal probability density

function over ζ with the covariance function ((σ2In)−1 +(C11−C12C−1

22 C21)−1)−1. Namely,

|K|1/2

|σ2In|1/2|Kp|1/2

=

=

=

=

=

=

22 C21|1/2
22 C21|1/2

|C22|1/2|C11 + σ2In − C12C−1
|σ2In|1/2|C22|1/2|C11 − C12C−1
|C11 + σ2In − C12C−1
22 C21|1/2
|σ2In|1/2|C11 − C12C−1
22 C21|1/2
|A + B|1/2
|A|1/2|B|1/2

=

1

1

|A−1B−1A + A−1B−1B|−1/2

=

|A|1/2|B|1/2|A + B|−1/2

=

1

(|A−1||B−1||A + B|)−1/2
1

|A−1B−1A + A−1|−1/2

1

|A−1(B−1 + A−1)A|−1/2

=

1

(|A−1||(B−1 + A−1)||A|)−1/2

1

|(B−1 + A−1)−1|1/2

where we used the Schur complement identity for determinants in the ﬁrst equality and

A = C11 − C12C−1
B = σ2In.

22 C21,

Lastly, considering the notation



K−1
p =

C−
(cid:18)ζ − My
(y − ζ)T (σ2In)−1(y − ζ) − 1
(cid:19)(cid:19)

(cid:18)y − My

(cid:18)y − My

z − Mz

(cid:19)T

2

11 C−
C−
21 C−
(cid:19)T

12

22

K−1

K−1

p

z − Mz

z − Mz
ζT (σ2In)−1ζ + ζT (σ2In)−1y − 1
(cid:19)
[(ζ − My)T C−
11 + (z − Mz)T C−
ζT ((σ2In)−1 + C−

11)ζ + ζT b

2
− 1
2
− 1
2
− 1
2

2

yT (σ2In)−1y
21, (ζ − My)T C−

we have

(cid:18)

− 1
2

(cid:18)1
(cid:18)
(cid:18)
(cid:18)

exp

× exp

∝ exp

× exp

∝ exp

(cid:19)(cid:19)

z − Mz

(cid:18)ζ − My
(cid:19)
12 + (z − Mz)T C−
22]

(cid:19)(cid:19)

(cid:18)ζ − My

z − Mz

107

where C−

11 = C11 − C12C−1

22 C21 due to the Schur complement identity for matrix inverse,

and b is a constant column vector. This shows that integral is indeed equal to 1 as stated,

and the equality (4.26) holds.

4.5.2 Proofs

Proof of Theorem 2
Note that for any  > 0, the posterior probability of interest p(ζ ∈ U C

n |y1, . . . , yn, θ, γ, ˆσn)

can be bound from the above as
p(ζ ∈ U C

n |y1, . . . , yn, θ, γ, ˆσn) ≤ p(ζ ∈ U C

n |y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn

σ0

(cid:12)(cid:12)(cid:12)≤} + 1{(cid:12)(cid:12)(cid:12) ˆσn

σ0

−1

(cid:12)(cid:12)(cid:12)>},

−1

where

(cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ)

p(ζ ∈ U C

≤ Φn +

(cid:82)

+

σ0

i=1

i=1

n∩Fn
U c

(cid:12)(cid:12)(cid:12)≤}

−1
σ0
p(yi|ζi,ˆσn)

p(yi|ζ0,i,σ0)1{(cid:12)(cid:12)(cid:12) ˆσn
−1
p(yi|ζi,ˆσn)
p(yi|ζ0,i,σ0) dΠ(ζ|θ, γ)
(cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ)

n |y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn
(1 − Φn)(cid:82)
(cid:81)n
(cid:82)F(cid:81)n

(cid:81)n
(cid:82)F(cid:81)n
p(yi|ζ0,i,σ0)1{(cid:12)(cid:12)(cid:12) ˆσn
p(yi|ζi,ˆσn)
p(yi|ζ0,i,σ0) dΠ(ζ|θ, γ)
(cid:12)(cid:12)(cid:12)>} −−→

I1n(y1, . . . , yn, θ, γ, ˆσn, )
I3n(y1, . . . , yn, θ, γ, ˆσn)

p(yi|ζi,ˆσn)

−1

−1

i=1

σ0

n

n∩FC
U c
n

i=1

= Φn +

Since the assumption (A3) implies that 1{(cid:12)(cid:12)(cid:12) ˆσn

σ0

there exists  > 0 so that
Φn −−→

sup

(θ,γ)∈Υ

0 a.s. P0,

n

+

I2n(y1, . . . , yn, θ, γ, ˆσn, )
I3n(y1, . . . , yn, θ, γ, ˆσn)

.

0 a.s. P0, it is enough to show that

(4.27)

eβ1nI1n(y1, . . . , yn, θ, γ, ˆσn, ) −−→

n

eβ2nI2n(y1, . . . , yn, θ, γ, ˆσn, ) −−→

n

0 a.s. P0

for some β1 > 0,

(4.28)

0 a.s. P0

for some β2 > 0,

(4.29)

eβ3nI3n(y1, . . . , yn, θ, γ, ˆσn) −−→

n

∞ a.s. P0

for some β3 > 0,

(4.30)

sup

(θ,γ)∈Υ

sup

(θ,γ)∈Υ

sup

(θ,γ)∈Υ

108

where β3 ≤ min{β1, β2}.

The rest of the proof follows the general steps of the proof of Theorem 1 in Choi and

Schervish (2007a) and Theorem 9 in Choi (2007) with some non-trivial treatment of the

constant . We shall provide step by step details below.

Step 1). By Markov inequality, for any δ > 0

∞(cid:88)

n=1

P0(Φn > δ) ≤ 1
δ

∞(cid:88)

n=1

Eζ0,σ0

Φn,

which due to the condition (i) of (A2) and the ﬁrst Borel-Cantelli Lemma yields

Φn −−→

n

0 a.s. P0.

Since this does not depend on (θ, γ), it implies (4.27).

Step 2). By Fubini’s theorem and for any 0 <  < ˜2

p(yi|ζi, ˆσn)
p(yi|ζ0,i, σ0)

n(cid:89)
p(yi|ζi, ˆσn)
p(yi|ζ0,i, σ0)

i=1

1{(cid:12)(cid:12)(cid:12) ˆσn

σ0

(cid:21)
(cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ)
1{(cid:12)(cid:12)(cid:12) ˆσn
(cid:12)(cid:12)(cid:12)≤} dP0 dΠ(ζ|θ, γ)

−1

−1

σ0

ζ,σ0(1+)[(1 − Φn)] dΠ(ζ|θ, γ)
E

UnC∩Fn

(I1n(y1, . . . , yn, θ, γ, ˆσn, ))

n∩Fn
U c
(1 − Φn)

n(cid:89)

i=1

(cid:90)

(cid:20)

=

n∩Fn
U c

(1 − Φn)

Eζ0,σ0
= Eζ0,σ0
(cid:90)
(cid:90)
(cid:18) σ0(1 − )
(cid:19)−n(cid:90)
(cid:19)−n
(cid:18)1 − 
(cid:19)−n
(cid:18)1 − 

σ0(1 + )

1 + 

≤

≤

≤

1 + 

ζ,σ0(1+)[(1 − Φn)]
E

sup
ζ∈U C
n ∩Fn
C2e−cn = C2e−˜cn,

where ˜c = c + log(1 − ) − log(1 + ) together with condition (iii) of (A2) implies ˜c > 0.
Thus

(cid:26)
I1n(y1, . . . , yn, θ, γ, ˆσn, ) ≥ e

P0

(cid:27)

−˜c

n
2

≤ C1e˜c

n

2 e−˜cn = C1e

−˜c

n
2 .

109

Therefore, for any  > 0 so that  < ˜2 there exists a constant ˜c for which the ﬁrst Borel-

Cantelli Lemma implies

e˜c

n

4 I1n(y1, . . . , yn, θ, γ, ˆσn, ) −−→

n

0 a.s. P0.

Since this does not depend on (θ, γ), it implies (4.28).

Step 3).

If we proceed as in the step 2), the Fubini’s theorem implies

(I2n(y1, . . . , yn, θ, γ, ˆσn, ))
p(yi|ζi, ˆσn)
p(yi|ζ0,i, σ0)

Eζ0,σ0
= Eζ0,σ0

(cid:20)(cid:90)
n(cid:89)
(cid:19)−n(cid:90)
(cid:18) σ0(1 − )
(cid:19)−n
(cid:18)1 − 

n∩Fn
U c

σ0(1 + )

i=1

UnC∩FC
n
n |θ, γ).

Π(F C

≤

≤

1 + 

(cid:21)
(cid:12)(cid:12)(cid:12)≤} dΠ(ζ|θ, γ)

1{(cid:12)(cid:12)(cid:12) ˆσn
ζ,σ0(1+)[1] dΠ(ζ|θ, γ)
E

−1

σ0

The condition (ii) of (A2) and the ﬁrst Borel-Cantelli Lemma implies that for any  <
1−e−c1
1+e−c1

:

sup

(θ,γ)∈Υ

e

˜k

n

4 I2n(y1, . . . , yn, θ, γ, ˆσn, ) −−→

0 a.s. P0,

n

where ˜k = c1 + log(1 − ) − log(1 + ).

Step 4). To prove (4.30), given any 0 < ρ < 1, we ﬁrst observe the following:

I3n(y1, . . . , yn, θ, γ, ˆσn) ≥ I3n(y1, . . . , yn, θ, γ, ˆσn)1{(cid:12)(cid:12)(cid:12) ˆσn

−1

σ0

(cid:12)(cid:12)(cid:12)≤ρ}

(cid:18)1 − ρ

(cid:19)n(cid:90)

1 + ρ

F

n(cid:89)

i=1

≥

p(yi|ζi, σ0(1 − ρ))

p(yi|ζ0,i, σ0)

dΠ(ζ|θ, γ).

Let us now deﬁne log+(x) = max{0, log(x)} and log−(x) = − min{0, log(x)} as well as

Wi = log+

p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ρ))

(cid:90)
(cid:90)

K+
i (ζ0, ζ) =
K−
i (ζ0, ζ) =

p(yi|ζ0,i, σ0) log+
p(yi|ζ0,i, σ0) log−

110

,
p(yi|ζ0,i, σ0)
p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ρ))
p(yi|ζi, σ0(1 − ρ))

dyi,

dyi.

Then we get

Varζ0,σ0

(Wi) = Eζ0,σ0
≤ Eζ0,σ0
≤ Eζ0,σ0
(cid:90)

(W 2

i ) − {K+
i (ζ0, ζ)}2
(W 2
(cid:90)
i ) − {Ki(ζ0, ζ)}2
(cid:18)

p(yi|ζ0,i, σ0)

(cid:18)
log−
p(yi|ζ0,i, σ0)

i ) +

(W 2

=

p(yi|ζ0,i, σ0)

log

p(yi|ζi, σ0(1 − ρ))

(cid:19)2

dyi − {Ki(ζ0, ζ)}2

p(yi|ζ0,i, σ0)
(cid:19)2
p(yi|ζi, σ0(1 − ρ))

dyi − {Ki(ζ0, ζ)}2

= Vi(ζ0, ζ).

Hence, by condition (i) of (A1) for any ρ < ˜1 and ζ ∈ B

n=∞(cid:88)

i=1

Varζ0,σ0
i2

(Wi)

≤ n=∞(cid:88)

i=1

Vi(ζ0, ζ)

i2

< ∞,

and by the Kolmogorov’s strong law of large numbers for independent non-identically dis-

0 a.s. P0.

n

tributed random variables (e.g. Shiryaev (1996), Chapter 3),

n(cid:88)
(Wi − K+
(cid:19)

i=1

i=1

n

i=1

log

1
n

1
n

lim inf
n→∞

n(cid:88)

(cid:18) 1

p(yi|ζ0,i, σ0)

= − lim inf
n→∞

p(yi|ζi, σ0(1 − ρ))

i (ζ0, ζ)) −−→
(cid:32)
As a result, for every ζ ∈ B, with P0 probability 1
(cid:32)
(cid:32)
(cid:32)
(cid:32)
 1

≥ − lim sup
n→∞

≥ − lim sup
n→∞

= − lim sup
n→∞

= − lim inf
n→∞

n(cid:88)
n(cid:88)
n(cid:88)
n(cid:88)
n(cid:88)
n(cid:88)

1
n

1
n

1
n

1
n

i=1

i=1

i=1

i=1

≥ − lim sup
n→∞

log

log+

(cid:33)

(cid:33)

− log

p(yi|ζi, σ0(1 − ρ))
(cid:33)
p(yi|ζ0,i, σ0)

p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ρ))
p(yi|ζ0,i, σ0)
(cid:33)

p(yi|ζi, σ0(1 − ρ))
(cid:114)
n(cid:88)
(cid:118)(cid:117)(cid:117)(cid:116) 1
n(cid:88)

1
n

i=1

n

i=1

K+

i (ζ0, ζ)

Ki(S0, S) +

(cid:33)
 .

Ki(ζ0, ζ)

2

2

Ki(ζ0, ζ)

Ki(ζ0, ζ) +

n

i=1

111

The fourth line follows from the almost sure convergence proved in the previous paragraph,

the second to last line follows from Amewou-Atisso et al. (2003). We now make use of the

8 and also
condition (ii) of (A1). Let us consider β > 0 and select ∆ so that ∆ +
C = B ∩ {ζ : Ki(ζ0, ζ) < ∆ for all i}. By (A1) there exists ˜1 so that for all 0 < ρ < ˜1
implies Π(C|θ, γ) > 0. Therefore, for each ζ ∈ C

Ki(ζ0, ζ) +

Ki(ζ0, ζ)

2



2 ≤ β

(cid:113) ∆
(cid:118)(cid:117)(cid:117)(cid:116) 1

n

n(cid:88)

i=1

−β
8
−β
8

}

1+e

(cid:32)

n(cid:88)

i=1

1
n

p(yi|ζi, σ0(1 − ρ))

p(yi|ζ0,i, σ0)

log

(cid:33)

n(cid:88)

i=1

 1
(cid:114)

n

∆
2

),

≥ − lim sup
n→∞

≥ −(∆ +

(cid:80)n
i=1 Ki(ζ0, ζ) < ∆ for all ζ ∈ C. Finally, for any ρ < min{˜1, 1−e

lim inf
n→∞

since 1
n

2nβ
8 I3n(y1, . . . , yn, θ, γ, ˆσn)

lim inf
n→∞ e

1 + ρ

(cid:19)n(cid:90)
(cid:18)1 − ρ
n(cid:89)
(cid:19)n(cid:90)
(cid:18)1 − ρ
n(cid:89)
(cid:19)n n(cid:89)
(cid:18)1 − ρ

1 + ρ

i=1

F

i=1

C

2nβ

8

1 + ρ

i=1

≥ lim inf
n→∞ e

2nβ

8

≥ lim inf
n→∞ e

2nβ

8

lim inf
n→∞ e

(cid:90)

≥

C

= ∞.

p(yi|ζi, σ0(1 − ρ))

p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ρ))

p(yi|ζ0,i, σ0)

p(yi|ζi, σ0(1 − ρ))

p(yi|ζ0,i, σ0)

dΠ(ζ|θ, γ)

dΠ(ζ|θ, γ)

dΠ(ζ|θ, γ)

Note that the actual bound on I3n does not depend on (θ, γ). Taking  < min{˜2, 1−e−c1
1+e−c1

}

concludes the proof.

Proof of Theorem 3

2 and t = r

4. Let Nt = N (t,Fn,(cid:107) · (cid:107)∞)
We shall ﬁrst deﬁne some notation. Let 0 < r < ν
be the covering number of Fn. In Theorem 2.7.1, van der Vaart and Wellner (1996) show
that there exist a constant K so that log Nt ≤ KMn
and therefore Nt = O(Mn), where
tp
Mn = O(nα) for α ∈ ( 1
2 , 1) according to the deﬁnition of the sieves. Let us consider
τ ∈ ( α
n). Moreover, let ζ1, . . . , ζNt ∈ Fn be

2) and deﬁne cn = nτ so that log(Nt) = o(c2

2 , 1

112

ﬁnitely many elements of the sieve so that for every ζ ∈ Fn there is i ∈ {1, . . . , Nt} satisfying

(cid:107) ζ − ζi (cid:107)∞< t. This implies that if ζ ∈ Fn such that (cid:82) |ζ(t) − ζ0(t)| dQn(t) > ν, then
(cid:82) |ζi(t) − ζ0(t)| dQn(t) > ν

2 .

The next step in the proof is to construct a test for each ζi with the resulting functions

Φn deﬁned as a combination of the individual tests and showing that the probabilities of type

I and type II errors satisﬁes the properties of the theorem. Let us recall that ζj = ζ(tj) and
ζ0,j = ζ0(tj). For an arbitrary ζ ∈ Fn such that (cid:107) ζ − ζi (cid:107)∞< t, let us deﬁne ζ1,j = ζi(tj)
and bj = 1 if ζ1,j > ζ0,j and −1 otherwise. For any ν > 0, let Ψn[ζ, ν] be the indicator of
set A deﬁned as follows

 n(cid:88)

j=1

bj

(cid:18)yj − ζ0,j

(cid:19)

σ0

 .

√

> 2cn

n

A =

The test functions Φn are then

Φn = max
1≤j≤Nt

Ψn[ζj,

ν
2

].

Type I error. The Mill’s ratio implies

 n(cid:88)

j=1

(cid:18) yj − ζ0,j

(cid:19)

σ0

bj



√
n

> 2cn

Eζ0,σ0

(Ψn) = P0

= 1 − Φ(2cn)
√
≤
1
2cn
≤ e−2c2
n.

2π

e−2c2

n

The function Φ(·) is the CDF of the standard normal distribution. Consequently, we have

Eζ0,σ0

(Φn) ≤

(Ψn[ζj,

ν
2

])

n = elog(Nt)−2c2

n

Nt(cid:88)

j=1

Eζ0,σ0
≤ Nte−2c2
≤ e−c2
n,

113

and

Type II error.

∞(cid:88)

n=1

Eζ0,σ0

Φn < ∞.

given an arbitrary ζ in W C

It is suﬃcient to ﬁnd i for which the probability of type II error of Ψn[ζi, ν
2 ],
ν,n ∩ Fn, is suﬃciently small. This is because the probability of
2 ]. Note that

type II error for the composite test Φn is no larger than the smallest of Ψn[ζi, ν

here we assume(cid:82) |ζ(t)− ζ0(t)| dQn(t) > ν, and then(cid:82) |ζi(t)− ζ0(t)| dQn(t) > ν

2 . For every

r < ν

2 , Choi and Schervish (2007b) show that

Let n be large enough so that 4σ0cn < r
ζ,σ0(1+)(1 − Ψn[ζi,
E

]) = Pζ,σ0(1+)

ν
2

bj

j=1

σ0

j=1

√

(cid:35)

(cid:19)

√
n

≤ 2cn

n, then for any 0 <  < 1

|ζ1,j − ζ0,j| > rn.

n(cid:88)
(cid:34) n(cid:88)
(cid:18) yj − ζ0,j
(cid:34) n(cid:88)
(cid:18) yj − ζj + ζj − ζ1,j + ζ1,j + ζ0,j
(cid:34)
(cid:18) yj − ζj
n(cid:88)
(cid:35)
(cid:12)(cid:12)(cid:12)(cid:12) ≤ 2cn
(cid:12)(cid:12)(cid:12)(cid:12) ζ1,j − ζ0,j
(cid:34)
(cid:18) yj − ζj
(cid:19)
n(cid:88)
(cid:34)
(cid:18) yj − ζj
n(cid:88)
(cid:19)

− r
n
σ0
√
n

√
n
4σ0

≤ − r

n(cid:88)

4σ0(1 + )

σ0(1 + )

j=1

1√
n

1√
n

bj

bj

≤ r

(cid:19)

1√
n

(cid:19)

bj

j=1

bj

j=1

σ0

+

√

j=1

j=1

σ0

σ0

σ0

(cid:35)

√
n

(cid:19)
(cid:18) ζj − ζ1,j

≤ 2cn

(cid:19)

σ0

(cid:35)

+ 2cn

(cid:35)

= Pζ,σ0(1+)

bj

= Pζ,σ0(1+)

n(cid:88)

j=1

+

1√
n

≤ Pζ,σ0(1+)

1√
≤ Pζ,σ0(1+)
n
√
− r

(cid:18)

= Φ

n

4σ0(1 + )

√

≤ 4σ0(1 + )
2πn

r

−
e

nr2
0(1+)2

32σ2

.

To establish the part (ii) of the theorem, we need to show that there exists a constant

114

0 < ˜ < 1 so that for any  < ˜

r2
0(1 + )2

32σ2

+ log

(cid:18)1 − 

(cid:19)

1 + 

> 0.

(4.31)

Take κ = r2
32σ2
0

and deﬁne b() to be the left hand side of (4.31),

(cid:18)

b() = κ

1

(1 + )2 +

1
κ

log

(cid:18)1 − 

(cid:19)(cid:19)

1 + 

.

The function b() is clearly continuous at  = 0. Hence, for each κ > 0, there exists ˜ such

that for all 0 <  < ˜, b() > 0.

Proof of Lemma 4

Theorem 5 of Ghosal and Roy (2006) implies that there exist positive constants C, d1, . . . , dp

so that for i = 1, . . . , p

(cid:32)

P

(cid:32)

|ζ(t)| > Mn

(cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ,
(cid:12)(cid:12)(cid:12) > Mn|z, θ, γ,

(cid:33)
(cid:33)

sup
t∈[0,1]p

(cid:12)(cid:12)(cid:12) ∂

−d0

≤ Ce

M 2
n
ρ2
0(θ,γ) ,

−di

M 2
n
ρ2
i (θ,γ) .

≤ Ce

P

∂ti

ζ(t)

The continuity of ρ2

sup
t∈[0,1]p
i (θ, γ), for i = 0,··· , p, on a compact set Υ implies that they are
uniformly bounded. Therefore, there exist universal constants (Ξ0,1, Ξ0,2),··· , (Ξp,1, Ξp,2)
such that for i = 0,··· , p,

Hence, for i = 0,··· , p,

0 < Ξi,1 ≤ sup
(θ,γ)∈Υ
(cid:32)

i (θ, γ)| ≤ Ξi,2.
|ρ2
(cid:33)
(cid:33)

(cid:12)(cid:12)(cid:12)(cid:12)z, θ, γ,
(cid:12)(cid:12)(cid:12) > Mn|z, θ, γ,

|ζ(t)| > Mn

sup

(θ,γ)∈Υ

P

(cid:32)

sup
t∈[0,1]p

(cid:12)(cid:12)(cid:12) ∂

∂ti

sup

(θ,γ)∈Υ

P

sup
t∈[0,1]p

ζ(t)

−d0

≤ Ce

−di

≤ Ce

M 2
n
Ξ0,1 ,

M 2
n
Ξi,1 .

115

Proof of Theorem 4

First, we show that ˆσ2

E[(i+1 − i)2]

n is asymptotically unbiased. Note that
E[(yi+1 − yi)2] = [ζ0(ti+1) − ζ0(ti)]2 + σ2
= [ζ0(ti+1) − ζ0(ti)]2 + 2σ2
0,
(cid:80)n−1
i=1 [ζ0(ti+1) − ζ0(ti)]2

0

E(ˆσ2

n) =

2(n − 1)

because i

i.i.d.∼ N (0, 1). Consequently

+ σ2
0.

(4.32)

Since ζ0 is continuously diﬀerentiable on the compact and convex set Ω, it is also (globally)

Lipschitz on Ω (e.g. Schaeﬀer and Cain (2016), Corollary 3.2.4), and there exists a real

constant K so that

|ζ0(ti+1) − ζ0(ti)| ≤ K

|ti+1,j − ti,j|.

p(cid:88)

j=1

Therefore, due to the design assumption (AD)

(cid:80)n−1
i=1 [ζ0(ti+1) − ζ0(ti)]2

2(n − 1)

0 ≤

(cid:20)

≤ K2p2
2

sup

i∈{1,...,n},j∈{1,...,p}

(cid:21)2 −−→

n

|ti+1,j − ti,j|

0,

(4.33)

and the combination of (4.32) with (4.33) implies

E(ˆσ2

n) −−−→

n

σ2
0.

(4.34)

To show the almost sure convergence of ˆσ2

n, let us now denote xi = (yi+1 − yi)2 and
n as a sum of two estimators, each consisting of a sum of independent

rewrite the estimator ˆσ2

variables:

1
2

ˆσ2
n =

(cid:80) n−1
2(cid:0) n−1

2

2

(cid:1) +

i=1 x2i

(cid:80) n−1
2(cid:0) n−1

2

(cid:1)

j=1 x2j−1
2

1
2

= ˆσ2

n,e + ˆσ2

n,o.

Without loss of generality, we assumed that n is an odd integer. Lastly note that Var(xi) ≤
C < ∞ uniformly in i. This is because the diﬀerences ζ0(ti+1)−ζ0(ti) are uniformly bounded
on the compact set Ω due to the continuity of ζ0. Additionally, yi+1 − yi are Gaussian and
have bounded moments. We can now apply the Kolmogorov’s strong law of large numbers

116

for independent non-identically distributed random variables (e.g. Shiryaev (1996), Chapter

3),

and as a result

n,e −−−→
ˆσ2
n,0 −−−→
ˆσ2

n

n

1
2
1
2

σ2
0

σ2
0

a.s. P0,

a.s. P0,

n = ˆσ2
ˆσ2

n,e + ˆσ2

n,o −−−→

n

σ2
0

a.s. P0.

4.5.3 Supplement for the transverse harmonic wave simulation

This section contains some additional ﬁgures comparing the empirical Bayes ﬁt with the

fully Bayesian approach under the posterior samples obtained via MH algorithm.

Figure 4.4: Detail of 95% credible bands plotted at t = 0.00.

117

0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MHFigure 4.5: Detail of 95% credible bands plotted at t = 0.43.

Figure 4.6: Detail of 95% credible bands plotted at t = 0.71.

Figure 4.7: Detail of 95% credible bands plotted at t = 1.00.

118

0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MH0.00.20.4x0.751.001.251.501.752.002.252.50n=125s=1250.00.20.4xn=250s=2500.00.20.4xn=500s=500LMLELCV(10)MHFigure 4.8: Comparison of the convergence to the true physical process ζ0(t, x). The curves
with 95% credible intervals are plotted at t = 0.00.

Figure 4.9: Comparison of the convergence to the true physical process. The curves with
95% credible intervals are plotted at t = 0.43.

119

0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CIFigure 4.10: Comparison of the convergence to the true physical process. The curves with
95% credible intervals are plotted at t = 0.71.

Figure 4.11: Comparison of the convergence to the true physical process. The curves with
95% credible intervals are plotted at t = 1.00.

120

0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CI0.000.250.500.751.00012n=125s=125LMLE0.000.250.500.751.00LCV(10)0.000.250.500.751.00MH0.000.250.500.751.00012n=250s=2500.000.250.500.751.000.000.250.500.751.000.000.250.500.751.00x012n=500s=5000.000.250.500.751.00x0.000.250.500.751.00x(t,x) 0(t,x)95% CICHAPTER 5

CONCLUSION

We devote the ﬁnal chapter of this dissertation to the summary of the advances in com-

putational statistics and the developments of new statistical tools of UQ that were made in

Chapters 2, 3, and 4. We also provide an overview of the new and exciting avenues this work

opens for future research.

In Chapter 2, we studied BMA, the natural Bayesian framework to account for the

model uncertainty that arises in situations when multiple competing models are available to

describe the same or similar physical process. Motivated by a recurrent scenario in the ﬁeld

of nuclear physics, we extended BMA to the scenario where competing models are deﬁned on

non-identical study regions. We gave a theoretical justiﬁcation for the use of BMA posterior

mean predictor in terms of PMSE reduction. While this predictor does not guarantee a

universal improvement in predictive ability, on average, it performs at least as well as the

best model under consideration. Finally, we applied the methodology outlined in Chapter

2 under several scenarios that lead to better predictions and improved UQ; one simple and

transparent exercise of averaging of proton potentials, and a pedagogical example of domain-

corrected averaging with a synthetic dataset. We also provided a full-scale BMA analysis of

9 state-of-the-art nuclear mass models and a study of the LDM of nuclear binding energies

trained on discrepant domains of the nuclear chart.

In Chapter 3, we developed a novel VBI approach to Bayesian calibration of compu-

tationally complex and many-parameter computer models. We exploited the probabilistic

theory of approximation coupled with pairwise construction of multivariate copulas to create

a computationally eﬃcient and scalable algorithm for calibration. In addition, we proposed

the Rao-Blackwellization, control variates, and importance sampling to reduce the variance

of noisy gradient estimates involved in the stochastic approximation. The theoretical justiﬁ-

cation for scalability was also established. In our examples, we ﬁrst carried out an extensive

121

simulation study that provided empirical evidence for the accuracy and scalability of our

method in scenarios where the traditional MCMC-based approaches become impractical.

We established the superiority of variational calibration over the MH algorithm and NUTS

in terms of time eﬃciency and memory requirements. We also demonstrated the opportuni-

ties given by our method for practitioners on a real data example through the calibration of

the LDM.

In Chapter 4, we proposed an empirical Bayes approach to model-enabled predictions of

physical quantities as a fast and easy-to-implement alternative to the fully Bayesian treat-

ment (also discussed in Chapter 3). A new hierarchical model representation of the Bayesian

model for calibration of computer models was presented. Theoretical study of the proposed

methodology was provided under this new representation. In particular, we established the

posterior consistency of the physical process, assuming smoothness of the mean and covari-

ance function of GP priors and existence of a strongly consistent estimator of the noise

scale. Consequently, we proposed two plug-in estimators for GP model hyperparameters

and a strongly consistent estimator of the noise scale parameter. A simulation study that

established the eﬃciency of the method and empirically veriﬁed the consistency was pro-

vided. Lastly, we revisited the LDM of binding energies and showed that our method yields

comparable results to the fully Bayesian treatment.

5.1 Future research

The extension of BMA to the situation with models deﬁned over non-overlapping input

domains addresses only one of many practical challenges in Bayesian model mixing. From

methodology perspective, developing a principled approach to average models locally, with

model wights depending on input values, would mitigate the tendency of BMA to perform

global model selection when one of the models signiﬁcantly dominates on some (small) part

of the input space. Computationally, BMA is a two step procedure, when one needs to ﬁrst

obtain samples from posterior distributions under individual models and consequently sample

122

from the BMA posterior density. A direct approximation of the BMA posterior, potentially

using variational methods, would considerably improve the ease of implementation.

A natural next step to enhance the impact of the VBI approach for calibration of com-

puter models that we proposed in Chapter 3, would be to examine its theoretical properties.

For example, one could pursue similar frequentist consistency result as Wang and Blei (2019).

If we establish the conditions under which the ELBO L(λ) and the l-truncated ELBO LDl
(respectively LCl

(λ)
(λ) = L(λ) + op(1), the asymp-
totic properties of Wang and Blei (2019) will directly extend to our methodology. Besides

(λ)) are equivalent in limit, namely LDl

theoretical investigations, a procedure that avoids the current sequential approach to select

the truncation level would be beneﬁcial. For instance, using ﬁt indices for ﬁnding suﬃcient

truncation appears to be a promising approach as discussed by Brechmann and Joe (2015).

When it comes to the empirical Bayes approach to model-enabled prediction, we have

already noted the need for further investigation of speciﬁc mean and covariance functions of

GP priors that satisfy the smoothness conditions for posterior consistency. Most importantly,

the hierarchical model representation of Kennedy and O’Hagan (2001) framework together

with the theoretical developments in Section 4.2 constitute a solid foundation to establish

the posterior consistency of the physical process ζ in the fully Bayesian regime; that is, in

a scenario with suitable prior distributions over the hyperparameters of Gaussian process

priors and the calibration parameters.

123

BIBLIOGRAPHY

124

BIBLIOGRAPHY

Amewou-Atisso, M., Ghosal, S., Ghosh, J. K., and Ramamoorthi, R. (2003). Posterior

consistency for semi-parametric regression problems. Bernoulli, 9(2):291–312.

Anni, R., Co’, G., and Pellegrino, P. (1995). Nuclear charge density distributions from elastic

electron scattering data. Nuclear Physics A, 584:35–59.

Audi, G., Wapstra, A., and Thibault, C. (2003). The AME2003 atomic mass evaluation:

(ii). tables, graphs and references. Nuclear Physics A, 729:337–676.

Balasubramanian, J. B., Visweswaran, S., Cooper, G. F., and Gopalakrishnan, V. (2014).
Selective model averaging with Bayesian rule learning for predictive biomedicine. AMIA
Joint Summits on Translational Science proceedings AMIA Summit on Translational Sci-
ence, 2014:17–22.

Bartel, J., Quentin, P., Brack, M., Guet, C., and H˚akansson, H.-B. (1982). Towards a better
parametrisation of Skyrme-like eﬀective forces: a critical study of the SkM force. Nuclear
Physics A, 386(1):79–100.

Bauer, M., van der Wilk, M., and Rasmussen, C. E. (2016). Understanding probabilistic
sparse gaussian process approximations. In Proceedings of the 30th International Confer-
ence on Neural Information Processing Systems, NeurIPS’16, pages 1533–1541.

Bayarri, M. J., Berger, J. O., Paulo, R., Sacks, J., Cafeo, J. A., Cavendish, J., Lin, C.-
H., and Tu, J. (2007). A framework for validation of computer models. Technometrics,
49:138–154.

Bedford, T. and Cooke, R. M. (2002). Vines–a new graphical model for dependent random

variables. Annals of Statistics, 30(4):1031–1068.

Benzaid, D., Bentridi, S., Kerraci, A., and Amrani, N. (2020). Bethe–Weizs¨acker semiem-
pirical mass formula coeﬃcients 2019 update based on AME2016. Nuclear Science and
Techniques, 31:9.

Bernardo, J. M. and Smith, A. F. M. (1994). Reference analysis, chapter Inference. Wiley.

Bertsch, G. F. and Bingham, D. (2017). Estimating parameter uncertainty in binding-energy

models by the frequency-domain bootstrap. Physical Review Letters, 119:252501.

Bertsch, G. F., Sabbey, B., and Uusn¨akki, M. (2005). Fitting theories of nuclear binding

energies. Physical Review C, 71:054311.

Bethe, H. A. and Bacher, R. F. (1936). Nuclear physics a. stationary states of nuclei. Reviews

of Modern Physics, 8:82–229.

125

Bottou, L., Le Cun, Y., and Bengio, Y. (1997). Global training of document processing
systems using graph transformer networks. In Proceedings of Computer Vision and Pattern
Recognition (CVPR), pages 489–493. IEEE.

Box, G. E. P. (1976). Science and statistics. Journal of the American Statistical Association,

71(356):791–799.

Brechmann, E. C., Czado, C., and Aas, K. (2012). Truncated regular vines in high dimensions

with application to ﬁnancial data. The Canadian Journal of Statistics, 40(1):68–85.

Brechmann, E. C. and Joe, H. (2015). Truncation of vine copulas using ﬁt indices. Journal

of Multivariate Analysis, 138:19–33.

Brynjarsd´ottir, J. and O’Hagan, A. (2014). Learning about physical parameters: the impor-

tance of model discrepancy. Inverse Problems, 30:114007.

Casella, G. and Berger, R. (2002). Statistical Inference. Duxbury advanced series in statistics

and decision sciences. Thomson Learning, second edition.

Casella, G. and Robert, C. P. (1996). Rao-blackwellisation of sampling schemes. Biometrika,

83(1):81–94.

Cauchois, B., L¨u, H., Boilley, D., and Royer, G. (2018). Uncertainty analysis of the nuclear

liquid drop model. Physical Review C, 98:024305.

Chabanat, E., Bonche, P., Haensel, P., Meyer, J., and Schaeﬀer, R. (1995). New Skyrme

eﬀective forces for supernovae and neutron rich nuclei. Physica Scripta, 1995(T56):231.

Chib, S. and Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The

American Statistician, 49:327–335.

Choi, T. (2007). Alternative posterior consistency results in nonparametric binary regression
using gaussian process priors. Journal of Statistical Planning and Inference, 137(9):2975
– 2983.

Choi, T. and Schervish, M. J. (2007a). On posterior consistency in nonparametric regression

problems. Journal of Multivariate Analysis, 98(10):1969 – 1987.

Choi, T. and Schervish, M. J. (2007b). Posterior Consistency in Nonparametric Regression

Problems under Gaussian Process Priors.

Clyde, M. A., Ghosh, J., and Littman, M. L. (2011). Bayesian adaptive sampling for variable
selection and model averaging. Journal of Computational and Graphical Statistics, 20:80–
101.

Cooke, R. and Kurowicka, D. (2006). Uncertainty Analysis With High Dimensional Depen-

dence Modelling. Wiley.

Dissmann, J., Brechmann, E., Czado, C., and Kurowicka, D. (2013). Selecting and estimating
regular vine copulae and application to ﬁnancial returns. Computational Statistics & Data
Analysis, 59:52–69.

126

Dobaczewski, J., Flocard, H., and Treiner, J. (1984). Hartree-Fock-Bogolyubov description

of nuclei near the neutron-drip line. Nuclear Physics A, 422(1):103–139.

Dobaczewski, J., Nazarewicz, W., and Reinhard, P.-G. (2014). Error estimates of theoretical

models: a guide. Journal of Physics G: Nuclear and Particle Physics, 41(7):074001.

Duchi, J., Hazan, E., and Singer, Y. (2011). Adaptive subgradient methods for online
learning and stochastic optimization. The Journal of Machine Learning Research, 12:2121–
2159.

Fayans, S. A. (1998). Towards a universal nuclear density functional. Journal of Experimental

and Theoretical Physics Letters, 68(3):169–174.

Feroz, F., Hobson, M. P., and Bridges, M. (2009). Multinest: an eﬃcient and robust bayesian
inference tool for cosmology and particle physics. Monthly Notices of the Royal Astronom-
ical Society, 398:1601–1614.

Geweke, J. (1999). Using simulation methods for Bayesian econometric models: inference,

development, and communication. Econometric Reviews, 18:1–73.

Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonpara-

metric binary regression. Annals of Statistics, 34(5):2413–2429.

Gneiting, T., Balabdaoui, F., and Raftery, A. E. (2007). Probabilistic forecasts, calibra-
tion and sharpness. The Journal of the Royal Statistical Society, Series B (Statistical
Methodology), 69:243–268.

Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and esti-

mation. Journal of the American Statistical Association, 102:359–378.

Goldberger, A. (1966). Econometric theory. Wiley publications in statistics. J. Wiley.

Gu, M. and Wang, L. (2018). Scaled Gaussian stochastic process for computer model calibra-
tion and prediction. SIAM/ASA Journal on Uncertainty Quantiﬁcation, 6(4):1555–1583.

Hern´andez, B., Raftery, A. E., Pennington, S. R., and Parnell, A. C. (2018). Bayesian addi-
tive regression trees using Bayesian model averaging. Statistics and Computing, 28:869–
890.

Higdon, D., Gattiker, J., Williams, B., and Rightley, M. (2008). Computer model calibration
using high-dimensional output. Journal of the American Statistical Association, 103:570–
583.

Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2005). Combining
ﬁeld data and computer simulations for calibration and prediction. SIAM Journal on
Scientiﬁc Computing, 26:448–466.

Higdon, D., McDonnell, J. D., Schunck, N., Sarich, J., and Wild, S. M. (2015). A Bayesian ap-
proach for parameter estimation and prediction using a computationally intensive model.
Journal of Physics G: Nuclear and Particle Physics, 42(3):034009.

127

Hoeting, J. A., Madigan, D., Raftery, A. E., and Volinsky, C. T. (1999). Bayesian model

averaging: A tutorial. Statistical Science, 14:382–401.

Hoﬀman, M. and Blei, D. (2015). Stochastic structured variational inference. In Proceed-
ings of the Eighteenth International Conference on Artiﬁcial Intelligence and Statistics,
volume 38, pages 361–369, San Diego, CA. PMLR.

Hoﬀman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational

inference. Journal of Machine Learning Research, 14:1303–1347.

Homan, M. D. and Gelman, A. (2014). The No-U-Turn Sampler: Adaptively setting path
lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15:1351–
1381.

Hooten, M. B. and Hobbs, N. T. (2015). A guide to Bayesian model selection for ecologists.

Ecological Monographs, 85:3–28.

Horowitz, C. J., Arcones, A., Cˆot´e, B., Dillmann, I., Nazarewicz, W., et al. (2019).

r-
process nucleosynthesis: connecting rare-isotope beam facilities with the cosmos. Journal
of Physics G: Nuclear and Particle Physics, 46(8):083001.

Ireland, D. G. and Nazarewicz, W. (2015). Enhancing the interaction between nuclear
experiment and theory through information and statistics. Journal of Physics G: Nuclear
and Particle Physics, 42(3):030301.

Jaganathen, Y., Betan, R. M. I., Michel, N., Nazarewicz, W., and P(cid:32)loszajczak, M. (2017).
interaction for psd-shell nuclei. Physical Review C,

Quantiﬁed gamow shell model
96:054316.

Johnston, J. (1976). Econometric Methods. McGraw-Hill.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to

variational methods for graphical models. Machine Learning, 37:183–233.

Kass, R. E. and Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical

Association, 90:773–795.

Kejzlar, V. and Maiti, T. (2020). Variational inference with vine copulas: An eﬃcient

approach for bayesian computer model calibration. arxiv.org/2003.12890.

Kejzlar, V., Neufcourt, L., Maiti, T., and Viens, F. (2019). Bayesian averaging of computer
models with domain discrepancies: a nuclear physics perspective. arxiv.org/1904.04793.

Kejzlar, V., Neufcourt, L., Nazarewicz, W., and Reinhard, P.-G. (2020). Statistical aspects
of nuclear mass models. Journal of Physics G: Nuclear and Particle Physics. URL https:
// doi. org/ 10. 1088/ 1361-6471/ ab907c .

Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 63:425–464.

128

King, G. B., Lovell, A. E., Neufcourt, L., and Nunes, F. M. (2019). Direct comparison
between Bayesian and frequentist uncertainty quantiﬁcation for nuclear reactions. Physical
Review Letters, 122:232502.

Kirson, M. W. (2008). Mutual inﬂuence of terms in a semi-empirical mass formula. Nuclear

Physics A, 798(1):29 – 60.

Klupfel, P., Reinhard, P. G., Burvenich, T. J., and Maruhn, J. A. (2009). Variations on
a theme by Skyrme: A systematic study of adjustments of model parameters. Physical
Review C, 79:034310.

Kl¨upfel, P., Reinhard, P.-G., B¨urvenich, T. J., and Maruhn, J. A. (2009). Variations on
a theme by Skyrme: A systematic study of adjustments of model parameters. Physical
Review C, 79(3):034310.

Kortelainen, M., Lesinski, T., Mor´e, J. J., Nazarewicz, W., Sarich, J., Schunck, N., Stoitsov,
M. V., and Wild, S. M. (2010a). Nuclear energy density optimization. Physical Review C,
82(2):024313.

Kortelainen, M., Lesinski, T., Mor´e, J., Nazarewicz, W., Sarich, J., Schunck, N., Stoitsov,
M. V., and Wild, S. (2010b). Nuclear energy density optimization. Physical Review C,
82:024313.

Kortelainen, M., McDonnell, J., Nazarewicz, W., Olsen, E., Reinhard, P.-G., Sarich, J.,
Schunck, N., Wild, S. M., Davesne, D., Erler, J., and Pastore, A. (2014). Nuclear energy
density optimization: Shell structure. Physical Review C, 89:054314.

Kortelainen, M., McDonnell, J., Nazarewicz, W., Reinhard, P.-G., Sarich, J., Schunck, N.,
large

Stoitsov, M. V., and Wild, S. M. (2012). Nuclear energy density optimization:
deformations. Physical Review C, 85:024304.

Krane, K. (1987). Introductory Nuclear Physics. Wiley.

Lawrence, E., Heitmann, K., White, M., Higdon, D., Wagner, C., Habib, S., and Williams, B.
(2010). The Coyote Universe III: simulation suite and precision emulator for the nonlinear
matter power spectrum. The Astrophysical Journal, 713(2):1322–1331.

Lynch, P. (2008). The origins of computer weather prediction and climate modeling. Journal
of Computational Physics, 227(7):3431 – 3444. Predicting weather, climate and extreme
events.

Madigan, D., Gavrin, J., and Raftery, A. E. (1995). Eliciting prior information to enhance
the predictive performance of bayesian graphical models. Communications in Statistics -
Theory and Methods, 24(9):2271–2292.

Martino, L., Laparra, V., and Camps-Valls, G. (2017). Probabilistic cross-validation estima-
tors for gaussian process regression. In 2017 25th European Signal Processing Conference
(EUSIPCO), pages 823–827.

129

McDonnell, J. D., Schunck, N., Higdon, D., Sarich, J., Wild, S. M., and Nazarewicz, W.
(2015). Uncertainty quantiﬁcation for nuclear density functional theory and information
content of new measurements. Physical Review Letters, 114(12):122501.

Morris, M. D. and Mitchell, T. J. (1995). Exploratory designs for computational experiments.

Journal of Statistical Planning and Inference, 43(3):381 – 402.

Myers, W. D. and Swiatecki, W. J. (1966). Nuclear masses and deformations. Nuclear

Physics, 81(2):1 – 60.

Nazarewicz, W. (2016). Challenges in nuclear structure theory. Journal of Physics G: Nuclear

and Particle Physics, 43:044002.

Nazarewicz, W. (2018). The limits of nuclear mass and charge. Nature Physics, 14(6):537–

541.

Neiswanger, W., Wang, C., and Xing, E. P. (2014). Asymptotically exact, embarrassingly
In Proceedings of the Thirtieth Conference on Uncertainty in Artiﬁcial

parallel mcmc.
Intelligence, UAI’14, pages 623–632, Arlington, VA. AUAI Press.

Neufcourt, L., Cao, Y., Giuliani, S., Nazarewicz, W., Olsen, E., and Tarasov, O. B. (2020a).
Beyond the proton drip line: Bayesian analysis of proton-emitting nuclei. Physical Review
C, 101:014319.

Neufcourt, L., Cao, Y., Giuliani, S. A., Nazarewicz, W., Olsen, E., and Tarasov, O. B.

(2020b). Quantiﬁed limits of the nuclear landscape. Physical Review C, 101:044307.

Neufcourt, L., Cao, Y., Nazarewicz, W., Olsen, E., and Viens, F. (2019). Neutron drip line

in the Ca region from Bayesian Model Averaging. Physical Review Letters, 122:062502.

Park, I. and Grandhi, R. V. (2012). Quantiﬁcation of model-form and parametric uncertainty

using evidence theory. Structural Safety, 39:44—-51.

Parkinson, D. and Liddle, A. R. (2013). Bayesian model averaging in astrophysics: A review.

Statistical Analysis and Data Mining: The ASA Data Science Journal, 6(1):3–14.

Pastore, A. (2019). An introduction to bootstrap for nuclear physics. Journal of Physics G:

Nuclear and Particle Physics, 46(5):052001.

Peterson, C. and Anderson, J. R. (1987). A mean ﬁeld theory learning algorithm for neural

networks. Complex Systems, 1:995–1019.

Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the Amer-

ican Statistical Association, 112:1274–1285.

Plumlee, M. (2019). Computer model calibration with conﬁdence and consistency. Journal

of the Royal Statistical Society: Series B (Statistical Methodology), 81(3):519–545.

Plumlee, M., Joseph, V. R., and Yang, H. (2016). Calibrating functional parameters in
the ion channel models of cardiac cells. Journal of the American Statistical Association,
111:500–509.

130

Pollard, D., Chang, W., Haran, M., Applegate, P., and DeConto, R. (2016). Large ensemble
modeling of the last deglacial retreat of the West Antarctic Ice Sheet: comparison of simple
and advanced statistical techniques. Geoscientiﬁc Model Development, 9(5):1697–1723.

Qui˜nonero-Candela, J. and Rasmussen, C. E. (2005). A unifying view of sparse approximate

Gaussian process regression. Journal of Machine Learning Research, pages 1939–1959.

Radaideh, M. I., Borowiec, K., and Kozlowski, T. (2019). Integrated framework for model
assessment and advanced uncertainty quantiﬁcation of nuclear computer codes under
Bayesian statistics. Reliability Engineering & System Safety, 189:357–377.

Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Proceed-
ings of the Seventeenth International Conference on Artiﬁcial Intelligence and Statistics,
volume 33 of Proceedings of Machine Learning Research, pages 814–822. PMLR.

Ranganath, R., Tran, D., and Blei, D. M. (2016). Hierarchical variational models.

In
Proceedings of the 33rd International Conference on International Conference on Machine
Learning – Volume 48, ICML’16, pages 2568–2577. JMLR.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.

Cambridge, MA: MIT Press.

Reinhard, P.-G., Bender, M., Nazarewicz, W., and Vertse, T. (2006). From ﬁnite nuclei
to the nuclear liquid drop: Leptodermous expansion based on self-consistent mean-ﬁeld
theory. Physical Review C, 73:014309.

Robbins, H. and Monro, S. (1951). A stochastic approximation method. Annals of Mathe-

matical Statistics, 22(3):400–407.

Robert, C. and Casella, G. (2005). Monte Carlo Statistical Methods. Springer Texts in

Statistics. Springer New York.

Ross, S. M. (2006). Simulation. Academic Press, Inc., Orlando, FL, fourth edition.

Ruiz, F. J. R., Titsias, M. K., and Blei, D. M. (2016). Overdispersed black-box variational
In Proceedings of the Thirty-Second Conference on Uncertainty in Artiﬁcial

inference.
Intelligence, UAI’16, page 647–656, Arlington, Virginia, USA. AUAI Press.

Schaeﬀer, D. G. and Cain, J. W. (2016). Nonlinear Systems: Local Theory, pages 79–109.

Springer New York, New York, NY.

Schorning, K., Bornkamp, B., Bretz, F., and Dette, H. (2016). Model selection versus model

averaging in dose ﬁnding studies. Statistics in Medicine, 35:4021–4040.

Sexton, D. M. H., Murphy, J. M., Collins, M., and Webb, M. J. (2012). Multivariate
probabilistic projections using imperfect climate models Part i: outline of methodology.
Climate Dynamics, 38(11):2513–2542.

Shelley, M., Becker, P., Gration, A., and Pastore, A. (2014). Advanced statistical methods

to ﬁt nuclear models. Acta Physica Polonica B, 12:649.

131

Shiryaev, A. N. (1996). Convergence of Probability Measures. Central Limit Theorem, pages

308–378. Springer New York, New York, NY.

Silvestro, D., Schnitzler, J., Liow, L. H., Antonelli, A., and Salamin, N. (2014). Bayesian
estimation of speciation and extinction from incomplete fossil occurrence data. Systematic
Biology, 63:349–367.

Skilling, J. (2006). Nested sampling for general bayesian computation. Bayesian Analysis,

1(4):833–860.

Sklar, A. (1959). Fonctions de r´epartition `a n dimensions et leurs marges. Publications de

l’Institut de Statistique de l’Universit´e de Paris, 8:229–231.

Smith, M. S., Loaiza-Maya, R., and Nott, D. J. (2020). High-dimensional copula varia-
tional approximation through transformation. Journal of Computational and Graphical
Statistics, pages 1–35.

Sundararajan, S. and Keerthi, S. S. (2001). Predictive approaches for choosing hyperparam-

eters in gaussian processes. Neural Computation, 13(5):1103–1118.

Tieleman, T. and Hinton, G. (2012). Lecture 6.5—RmsProp: Divide the gradient by a
running average of its recent magnitude. COURSERA: Neural Networks for Machine
Learning.

Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In
Proceedings of the Twelfth International Conference on Artiﬁcial Intelligence and Statis-
tics, volume 5, pages 567–574. PMLR.

Toivanen, J., Dobaczewski, J., Kortelainen, M., and Mizuyama, K. (2008). Error analysis of

nuclear mass ﬁts. Physical Review C, 78:034306.

Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic gaussian process
priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34 – 42.

Tran, D., Blei, D. M., and Airoldi, E. M. (2015). Copula variational inference. In Proceedings
of the 28th International Conference on Neural Information Processing Systems - Volume
2, NeurIPS’15, pages 3564–3572, Cambridge, MA. MIT Press.

Tran, D., Ranganath, R., and Blei, D. M. (2017). Hierarchical implicit models and likelihood-
free variational inference. In Proceedings of the 31st International Conference on Neural
Information Processing Systems, NeurIPS’17, pages 5529–5539.

Trotta, R. (2008). Bayes in the sky: Bayesian inference and model selection in cosmology.

Contemporary Physics, 49:71–104.

Tuo, R. and Wu, C. F. J. (2015). Eﬃcient calibration for imperfect computer models. Annals

of Statistics, 43:2331–2352.

132

Tuo, R. and Wu, C. F. J. (2016). A theoretical framework for calibration in computer
models: parametrization, estimation and convergence properties. SIAM/ASA Journal on
Uncertainty Quantiﬁcation, 4:767–795.

Utama, R., Piekarewicz, J., and Prosper, H. B. (2016). Nuclear mass predictions for the
crustal composition of neutron stars: A Bayesian neural network approach. Physical
Review C, 93:014311.

van der Vaart, A. and Wellner, J. (1996). Weak Convergence and Empirical Processes: With

Applications to Statistics. Springer Series in Statistics. Springer.

Wahba, G. (1990). Spline Models for Observational Data. Society for Industrial and Applied

Mathematics.

Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and

variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305.

Wang, M., Audi, G., Kondev, F. G., Huang, W. J., Naimi, S., and Xu, X. (2017). The
AME2016 atomic mass evaluation (II). tables, graphs and references. Chinese Physics C,
41:030003.

Wang, Y. and Blei, D. M. (2019). Frequentist consistency of variational bayes. Journal of

the American Statistical Association, 114(527):1147–1161.

Wasserman, L. (2000). Bayesian model selection and model averaging. Journal of Mathe-

matical Physics, 44:92 – 107.

Wei, W., Visweswaran, S., and Cooper, G. F. (2011). The application of naive Bayes model
averaging to predict Alzheimer’s disease from genome-wide data. Journal of the American
Medical Informatics Association, 18:370–375.

Weizs¨acker, C. F. v. (1935). Zur theorie der kernmassen. Z. Phys., 96(7):431–458.

Wen, X. (2015). Bayesian model comparison in genetic association analysis:

linear mixed

modeling and SNP set testing. Biostatistics, 16:701–712.

Williams, B., Higdon, D., Gattiker, J., Moore, L., McKay, M., and Keller-McNulty, S. (2006).
Combining experimental data and computer simulations, with an application to ﬂyer plate
experiments. Bayesian Analysis, 1(4):765–792.

Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of

the American Statistical Association, pages 1–47.

Yuan, C. (2016). Uncertainty decomposition method and its application to the liquid drop

model. Physical Review C, 93:034310.

Zeiler, M. D. (2012). Adadelta: An adaptive learning rate method. ArXiv, 1212.5701.

Zhang, H. F., Wang, L. H., Yin, J. P., Chen, P. H., and Zhang, H. F. (2017). Performance
of the Levenberg-Marquardt neural network approach in nuclear mass prediction. Journal
of Physics G: Nuclear and Particle Physics, 44(4):045110.

133

Zhang, L., Jiang, Z., Choi, J., Lim, C.-Y., Maiti, T., and Baek, S. (2019). Patient-speciﬁc
prediction of abdominal aortic aneurysm expansion using Bayesian calibration. IEE Jour-
nal of Biomedical and Health Informatics.

134