CALIBRATED COMPUTER MODELS USING MARKOV CHAIN MONTE CARLO AND
VARIATIONAL BAYES

By

Mookyong Son

A DISSERTATION

Submitted to
Michigan State University
in partial fulfillment of the requirements
for the degree of

Statistics—Doctor of Philosophy

2023

ABSTRACT

Computer models are used to solve complex problems in many scientific applications, such as

nuclear physics and climate research. Although a popular approach, Markov chain Monte Carlo-

based Bayesian calibration of computer models has not been investigated much regarding theoretical

properties until relatively recently. Hence, this work focuses on the theory of computer model

calibration through a proof of posterior consistency of the estimated physical process.

In Chapter 1, we review the general framework of computer model calibration, Gaussian

Processes, and Posterior Consistency.

In Chapter 2, we prove the posterior consistency of the estimated physical process in the

Bayesian model calibration framework using Markov chain Monte Carlo. We used the extension

of Schwartz’s theorem to show the posterior contraction rate using GP priors.

In Chapter 3, we propose a fast and scalable posterior approximation algorithm for Bayesian

computer model calibration via Variational Inference. Variational Inference is an optimization-

based method alternative to the time-consuming Markov chain Monte Carlo approximation of

posterior distributions popularized by machine learners. We provide the statistical guarantee of

the proposed algorithm in the form of a posterior consistency theorem for the estimated physical

process under regularity assumptions on the variational family. The main results are shown in

the two widely used classes of Gaussian Process priors, the Squared Exponential covariance class

and the Matern covariance class. We also provide a simulation study to demonstrate the proposed

method’s time efficiency and fidelity compared to the standard Markov chain Monte Carlo method.

In Chapter 4, we applied the Bayesian model calibration framework to 𝛽 decay calculation

and compared different calibration approaches. We adopt 𝜒2 results as a benchmark to show the

advantage of Bayesian approaches.

Finally, we conclude the thesis with future directions for further research in Chapter 5.

Copyright by
MOOKYONG SON
2023

ACKNOWLEDGEMENTS

I would like to express my heartfelt gratitude to my advisors, Dr. Tapabrata Maiti and Dr. Shrĳita

Bhattacharya, for their invaluable guidance throughout my research journey. Their insightful

suggestions have consistently propelled me forward and led to significant progress.

I also want to acknowledge the members of my dissertation committee, Dr. Witold Nazarewicz

and Dr. Chih-Li Sung, whose feedback and comments have been instrumental in shaping the quality

of my research. I am particularly grateful for the weekly meetings with physicists at the Facility for

Rare Isotope Beams (FRIB), which have broadened my perspective on the practical applications of

Statistics. Additionally, the BAND research collaboration provided me with wonderful networking

opportunities with fellow physicists. Furthermore, I extend my gratitude to my mentors, Stefan

Wild and Arindam Fadikar, during my remote internship at Argonne National Lab amidst the

pandemic. Our discussions greatly honed my expertise in my field.

I am thankful for all the friends I made during my time at Michigan State University. Special

thanks go to Arkaprabha Ganguli, Anirban Samaddar, and Anriana Manousidaki, with whom I

shared powerful memories while studying for prelim during the summer. Passing prelim without

their support would have been nearly impossible. I am also indebted to Tong Li and Vojtech Kejzlar

for the profound discussions that greatly enriched my thesis.

Last but not least, my gratitude goes to my family. The unwavering support of my wife, Jin

Young Kwak, and my daughter, Lia, has made the completion of my Ph.D. possible. Moreover, I am

truly grateful for the unconditional encouragement I have received from my parents, parents-in-law,

brother, and brother-in-law throughout this journey. Their constant belief in me has been a driving

force behind my accomplishments.

Throughout my five years at Michigan State University, the courses and seminars have imparted

valuable knowledge, which I have successfully applied to my research. I eagerly look forward to

embracing my future as a statistician with enthusiasm and dedication.

iv

TABLE OF CONTENTS

CHAPTER 1

INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
1.1 Computer Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Gaussian Processes .
1.3 Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
BIBLIOGRAPHY . .

. .

.

.

1
1
3
5
7

CHAPTER 2

.

.

.

.

CALIBRATED COMPUTER MODEL USING MARKOV CHAIN
MONTE CARLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .

9
2.1 Bayesian Hierarchical Formulation of Kennedy and O’Hagan Model
9
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Preliminaries
2.3 Posterior Consistency in Kenndy and O’Hagan Model . . . . . . . . . . . . . . 17
2.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
BIBLIOGRAPHY .
.
PROOF OF THEOREM 1
APPENDIX A
. 22
CHECKING ASSUMPTION 2.2.1 . . . . . . . . . . . . . . . . 27
APPENDIX B
CHECKING ASSUMPTION 2.2.2 1 . . . . . . . . . . . . . . . 31
APPENDIX C
CHECKING ASSUMPTION 2.2.2 2 . . . . . . . . . . . . . . . 39
APPENDIX D

. . . . . . . . . . . . . . . . . . .

.
.

.

.

.

CHAPTER 3

CALIBRATED COMPUTER MODEL USING VARIATIONAL BAYES
(VARIATIONAL INFERENCE) . . . . . . . . . . . . . . . . . . . . . . 41
3.1 Variational Bayes formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.2 Variational Bayes Algorithm for Discrepancy Model
. 44
3.3 Posterior Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
. 49
3.4 Simulation Study: Ball Dropping Experiment
3.5 Simulation Study: Logistic Function . . . . . . . . . . . . . . . . . . . . . . . 50
3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
. 53
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
BIBLIOGRAPHY .
.
CHECKING ASSUMPTION 3.3.2 (I)
APPENDIX A
. . . . . . . . . . . . . . 57
CHECKING ASSUMPTION 3.3.2 (II) . . . . . . . . . . . . . . 58
APPENDIX B
. 61
PROOF OF THEOREM 3 . . . . . . . . . . . . . . . . . . . .
APPENDIX C

. . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

.

.

.

CHAPTER 4

.

.

BAYESIAN MODEL CALIBRATION FOR BETA DECAY
CALCULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
. 69
. 70
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
. 79
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
LATIN HYPERCUBE DESIGN . . . . . . . . . . . . . . . . . 85
CHI-SQUARE OPTIMIZATION FRAMEWORK . . . . . . . . 86

4.1 Data Collection .
4.2 Extension of the calibration model to two observable types . . . . . . . . . .
4.3 Prior Structure .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4 Calibration Results .
4.5 Prediction Results . .
4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . .
BIBLIOGRAPHY .
APPENDIX A
APPENDIX B

. .

.

.

.

v

CHAPTER 5

CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . 88

vi

CHAPTER 1

INTRODUCTION

1.1 Computer Model Calibration

Computer models are an essential part of today’s scientific process. With the increasing

access to high-performance computing, computer models provide much faster and economical

environment for scientists to rapidly develop their domain fields such as Biology, Chemistry,

Epidemiology, Engineering, and Physics to just name a few. However, computer models are

canonically imperfect representation of physical systems. The landmark publication of Kennedy

and O’Hagan (2001) approached this issue by introducing discrepancy function between a physical

process and a computer model. This work, up to the present time, accumulated more than 3000

citations and led to countless applications. For instance, see King et al. (2019); Gatt et al. (2020) for

examples in nuclear physics, Seede et al. (2020); Honarmandi and Arróyave (2020) for examples

in material sciences, and Colosimoa et al. (2018); Hüllen et al. (2020); Choi et al. (2018) for

engineering applications.

The theoretical properties of the original framework have not been investigated until relatively

recently. There are two properties of the calibration model we can think of, the one is identifiability

of calibration parameters and the other is the consistency of the estimated physical process to the

true physical process. Non-identifiability issue of calibration parameters in the KOH model stems

from the original paper Kennedy and O’Hagan (2001), which concerns the distribution of observed

physical data induced by the computer model, which does not uniquely determine the calibration

parameter. There is a couple of papers to handle this problem by modifying the original model,

for example take a look at Plumlee (2017) and Xie and Xu (2020). In this paper, however, our

goal is to show the consistency property of the original KOH model in Fully Bayesian setting.

There are also some papers to deal with this issue such as Xie and Xu (2020) or Li and Xiong

(2022). But again these papers are based on the modified version of the model such as defining the

calibration parameters as 𝐿2 minimization between the computer model and the physical process

or decomposing the discrepancy term.

1

The KOH model is built based on some assumptions. Firstly, Kennedy and O’Hagan (2001)

assumes that although the computer model is not fully available, one does have exact realizations

from the computer model at a finite discrete number of points. To facilitate this, a Gaussian

process is assumed on the computer model indexed by (𝒕, 𝜽) with nuisance parameters 𝜙 𝑓 . 𝜽 is

often referred to as the calibration parameter. Additionally, Kennedy and O’Hagan (2001) assumes

a discrepancy term between the true physical process and the computer model.

It assumes a

Gaussian process on the discrepancy term with index 𝒕 with nuisance parameters 𝜙𝛿. The aim

of the Kennedy and O’Hagan (2001) model is to study the relationship between the computer

model

𝑓 and the physical process 𝜁 by learning about calibration parameter 𝜽, and eventually

to do predictions 𝒚∗ = (𝑦∗
1

, . . . , 𝑦∗

𝐽) of the physical process 𝜁 at new inputs (𝒕∗
1

, . . . , 𝒕∗

𝐽). The

current theoretical developments have three main drawbacks (1) they assume the computer model

is explicitly available (2) 𝜽 is an unknown parameter characterizing the computer model (3) 𝜙 𝑓 and

𝜙𝛿 are unknown parameters.

In this paper, we provide the posterior consistency of the estimated physical process under

the full Bayesian treatment of the Kennedy and O’Hagan (2001) model. Following Kejzlar et al.

(2020), we consider an equivalent representation of Kennedy and O’Hagan (2001) model as a

hierarchical model. Consequently, we prove the posterior consistency of the estimated physical

process using an original extension of Schwartz’s theorem for non-parametric regression problems

with GP priors. Unlike all the previous works in the literature, we stick to the assumption that

computer model is not fully available and one has only the evaluations from the computer model at

finite number of points. Secondly, we assume that 𝜃, 𝜓1 and 𝜓2 are not just parameters with simple

plug-in estimators but random variables with their own prior distributions.

1.1.1 Kennedy O’Hagan Model

Let us consider observations 𝒚 = (𝑦1, . . . , 𝑦𝑛) of a physical process 𝜁 depending on a known

set of inputs 𝒕𝑖 ∈ 𝛀 ⊂ R𝑝, 𝑖 = 1, · · · , 𝑛, 𝑝 ≥ 1 given by

𝑦𝑖 = 𝜁 (𝒕𝑖) + 𝜎𝜖𝑖,

(1.1)

2

where 𝜎 represents the scale of observational error, and 𝜖𝑖

𝑖.𝑖.𝑑.
∼ N (0, 1).

Let 𝑓 denote a computer model. Kennedy and O’Hagan (2001) introduced a discrepancy term 𝛿

to represent the unknown systematic error between the computer model 𝑓 and the physical process

𝜁. Hence the physical process can be represented as a summation of imperfect computer model

plus the discrepancy term.

𝜁 (𝒕𝑖) = 𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖),

𝑖 = 1, . . . , 𝑛

(1.2)

Note, the above expression involves known inputs 𝒕𝑖 and unknown parameters 𝜽, also referred to

as the calibration parameter. The calibration parameter is not controllable in the physical process

𝜁 and thus may be used to represent aspects of the computer model which can not be measured in

the physical observations. In some cases, the calibration parameter may also carry a real physical

meaning. For example, consider a famous ball drop example by Plumlee (2017). If one constructs

a computer model to check the time for a ball to hit the ground from a certain height. Then the

gravitational force could be used as a calibration parameter to build a computer model. Further

discussion of physical meaning in calibration parameters could be checked at Brynjarsdóttir and

O’Hagan (2014).

Next, we assume that the evaluation of computer model 𝑓 is either expensive or not directly

available, which are often the case in a real-world application. The common practice in this situation

is to fit a surrogate for a computer model based on the simulation data. Hence, let 𝒛 = (𝑧1, . . . , 𝑧𝑠)

denote outputs from the computer model 𝑓 under a known set of inputs ˜𝒕𝑖 ∈ 𝛀, 𝑝 ≥ 1 and ˜𝜽𝑖 ∈ Θ.

𝑧𝑖 = 𝑓 ( ˜𝒕𝑖, ˜𝜽𝑖)

𝑖 = 1, . . . , 𝑠,

(1.3)

Without loss of generality, we assume 𝛀 = [0, 1] 𝑝 for the rest of the paper.

1.2 Gaussian Processes

Gaussian Process is a collection of random variables, any finite number of which have a joint

multivariate Gaussian distribution. Hence we can think of it as a infinite-dimensional multivariate

Gaussian distribution. A Gaussian process 𝑓 (𝑥) is completely specified by it’s mean function 𝑚(𝑥)

3

Figure 1.1 (Gaussian Process priors) The left figure shows realization of three random functions
from Gaussian Processes prior, and the right figure shows realization of three random functions
from posterior distribution updated by five training data, which is shown in "+" signs. This figure
is taken from Rasmussen and Williams (2006).

and covariance function 𝑘 (𝑥, 𝑥′), where

𝑚(𝑥) = E[ 𝑓 (𝑥)]

𝑘 (𝑥, 𝑥′) = E[( 𝑓 (𝑥) − 𝑚(𝑥)) ( 𝑓 (𝑥′) − 𝑚(𝑥′))]

A commonly adopted mean function is a constant, while there are many types of covariance

functions widely used. Two widely used covariance functions are ’Squared Exponential’ (SE) and

’Matérn’. Squared Exponential covariance functions has the following form.

𝑘 (𝑥, 𝑥′; 𝜂, 𝑙) = 𝜂 exp

(cid:19)

(𝑥 − 𝑥′)2

(cid:18) −1
2𝑙2

where 𝜂 > 0 is the scale parameter and 𝑙 > 0 is the length scale parameter. Matérn covariance

functions has the following form.

𝑘 (𝑥, 𝑥′; 𝜂, 𝑙, 𝜈) = 𝜂 21−𝜈
Γ(𝜈)

(cid:32) √

2𝜈|𝑥 − 𝑥′|
𝑙

(cid:33) 𝜈

(cid:32) √

𝐾𝜈

(cid:33)

2𝜈|𝑥 − 𝑥′|
𝑙

where 𝐾𝜈 is the modified Bessel function and 𝜈 > 0 is the roughness parameter.

4

Figure 1.2 (Covariance functions of GP) The left figure shows Matérn covariance as a function of
input distance 𝑟 = |𝑥 − 𝑥′| while 𝜂 and 𝑙 are fixed at 1. Different colors represent different roughness
parameter value 𝜈. The red line corresponds to 𝜈 = ∞, which is equivalent to SE covariance. The
right figure shows realization of random functions from GP of each covariance function.

The connection between these two classes of covariance function is that the SE covariance is a

special case of Metérn covariance where 𝜈 equals to ∞. When the roughness parameter becomes

infinity, the sample function from GP with SE covariance function becomes infinitely differentiable,

in other words, very smooth. This could be checked at the Figure 1.2. The sample function from the

SE covariance function is colored red, and it looks very smooth compared to other wiggly sample

functions.

This smoothness assumption is often too restrictive to model the response surface of physical

processes. Hence, in practice, it is often advised to use the Matérn covariance, Stein (1999).

However, SE covariance is often easier to derive the sample function properties of GP. Therefore,

we will consider both cases in developing large sample theory in Bayesian model calibration in this

thesis.

1.3 Posterior Consistency

The consistency of estimators is usually discussed in the context of frequentist statistics. In a

frequentist framework, a consistent estimator converges in probability to the true parameter as the

number of data points approaches infinity. This idea can be extended to the Bayesian framework

as well. For Bayesian methods, convergence properties are discussed using the entire posterior

distribution rather than a single estimate.

5

0.00.51.01.52.02.53.0input distance, r0.00.20.40.60.81.0covariance, k(r)Covariance Functionsv=1/2v=3/2v=5/2v -> 42024input, x3210123output, f(x)sample functionv=1/2v=3/2v=5/2v -> Assume that a set of data 𝒚𝑛 = (𝑦1, · · · , 𝑦𝑛) are randomly sampled from 𝑓0 ∈ F . Then the

posterior distribution is a random measure defined as follows,

Π(𝐵| 𝒚𝑛) =

∫
𝐵

∫
F

(cid:206)𝑛

𝑖=1

(cid:206)𝑛

𝑖=1

𝑓 (𝑦𝑖)𝑑Π( 𝑓 )
𝑓 (𝑦𝑖)𝑑Π( 𝑓 )

where 𝐵 is the measurable set in F .

The natural question we have is what happens to the posterior distribution as the sample size 𝑛

increases to infinity. If the posterior distribution of 𝑓 concentrates around some small neighborhood

of true parameter 𝑓0, then we say that the posterior distribution is consistent. In other words, as we

see more data, data provide more information on the parameter and ideally the posterior distribution

of the parameter converges to the truth, which is a point mass.

There has been a line of research to show the posterior consistency, and the first approach

was done by Doob (1949).

In this work, Doob showed that posterior distribution is consistent

almost everywhere on prior support, but this was not useful to check consistency at a particular

density. Later in 1965, Schwartz introduced a test function framework to show the posterior

consistency Schwartz (1965). In particular, he showed that in order to show that the posterior is

consistent, it is necessary that the true parameter and the complement set of the neighborhood of

the true parameter can be separated. This idea of separation is conveniently formalized through the
existence of appropriate tests for testing 𝐻0 : 𝑓 = 𝑓0, 𝐻𝐴 : 𝑓 ∈ 𝑈𝑐, where 𝑈 is some neighborhood
of 𝑓0.

When posterior consistency holds, quantifying the rate at which the posterior contracts to the

truth is often possible. This is called the posterior contraction rate. Hence, in the following

chapters, we show not only the posterior is consistent but also the posterior contraction rates in the

KOH model setup using MCMC and Variational Bayes.

6

BIBLIOGRAPHY

Brynjarsdóttir, J. and O’Hagan, A. (2014). Learning about physical parameters: the importance of

model discrepancy. Inverse Problems, 30:114007.

Choi, W., Menberg, K., Kikumoto, H., Heo, Y., Choudhary, R., and Ooka, R. (2018). Bayesian
inference of structural error in inverse models of thermal response tests. Applied Energy,
228:1473–1485.

Colosimoa, B. M., Huangb, Q., Dasguptac, T., and Tsung, F. (2018). Opportunities and challenges
of quality engineering for additive manufacturing. Journal of Quality Technology, 50:233–252.

Doob, J. L. (1949). Application of the theory of martingales. in Le Calcul des Probabilites et ses

Applications, 40:23–27.

Gatt, D., Yousif, C., Cellura, M., Camilleri, L., and Guarino, F. (2020). Assessment of building
energy modelling studies to meet the requirements of the new energy performance of buildings
directive. Renewable and Sustainable Energy Reviews, 127:1–16.

Honarmandi, P. and Arróyave, R. (2020). Uncertainty quantifcation and propagation in compu-
tational materials science and simulation assisted materials design. Integrating Materials and
Manufacturing Innovation, 9:103–143.

Hüllen, G., amd Sun-Hye Kim, J. Z., Sinha, A., J.Realff, M., and Boukouvala, F. (2020). Managing
uncertainty in data-driven simulation-based optimization. Computers and Chemical Engineering,
136:1–16.

Kejzlar, V., Son, M., Bhattacharya, S., and Maiti, T. (2020). A fast, scalable, and calibrated

computer model emulator: An empirical bayes approach.

Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 63:425–464.

King, G. B., Lovell, A. E., Neufcourt, L., and Nunes, F. M. (2019). Direct comparison between
Bayesian and frequentist uncertainty quantification for nuclear reactions. Physical Review Letters,
122:232502.

Li, Y. and Xiong, S. (2022). Physical parameter calibration. arXiv:2208.00124 [stat.ME].

Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American

Statistical Association, 112:1274–1285.

Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning.

Cambridge, MA: MIT Press.

7

Schwartz, L. (1965). On bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete,

4:10–26.

Seede, R., Shoukr, D., Zhang, B., Whitt, A., Gibbons, S., Flater, P., Elwany, A., Arroyave, R.,
and Karaman, I. (2020). An ultra-high strength martensitic steel fabricated using selective laser
melting additive manufacturing: Densification, microstructure, and mechanical properties. Acta
Materialia, 186:199–214.

Stein, M. L. (1999). Interpolation of spatial data: some theory for kriging. Springer Science &

Business Media.

Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of the

American Statistical Association, 0(0):1–18.

8

CHAPTER 2

CALIBRATED COMPUTER MODEL USING MARKOV CHAIN MONTE CARLO

In Section 1, we restate the original Kennedy O’Hagan (KOH) model framework Kennedy and

O’Hagan (2001) as a hierarchical regression model with GP prior. In Section 2, we introduce the

general consistency theorem and assumptions we need for the full Bayesian calibration model to

be consistent. In Section 3 we check the assumptions of the consistency theorem in the context of

the KOH model. Finally, we conclude the chapter with a discussion in section 4.

2.1 Bayesian Hierarchical Formulation of Kennedy and O’Hagan Model

We present the KOH model described in Chapter 1 as a Bayesian hierarchical model with GP

priors. This representation is adopted from Kejzlar et al. (2020) and is crucial for the theoretical

results. It reframes the Bayesian model as a version of a non-parametric regression problem with

a GP prior for 𝜁 (𝒕) and additive noise.

To be specific, we consider the following GP prior structure:

𝜁 (𝒕)| 𝑓 (𝒕, 𝜽), 𝛿(𝒕) ∼ 𝑓 (𝒕, 𝜽) + 𝛿(𝒕),

𝛿(𝒕) ∼ GP𝛿 (𝑚𝛿 (𝒕), 𝑘 𝛿 (𝒕, 𝒕′); 𝜙𝛿),

𝑓 (𝒕, 𝜽) ∼ GP 𝑓 (𝑚 𝑓 (𝒕, 𝜽), 𝑘 𝑓 ((𝒕, 𝜽), (𝒕′, 𝜽′)); 𝜙 𝑓 ).

(2.1)

where 𝝓𝛿 and 𝝓 𝑓 denote the hyperparameters associated with the Gaussian prior on 𝛿 and 𝑓

respectively. For example, 𝝓 𝑓 = (𝜂 𝑓 , 𝑙 𝑓 ) comprise of 𝜂 𝑓 , the scale parameter of the covariance

function and 𝑙 𝑓 , the length scale parameters of the covariance function, 𝑙 𝑓 . We first consider a prior

distribution for the scale parameter 𝜎2 as

𝜎2 ∼ Π(𝜎2)

With 𝝓 = (𝝓𝛿, 𝝓 𝑓 ) ∈ 𝐴𝝓, we assume the following prior distribution of 𝝓

𝝓 ∼ Π(𝝓)

Finally, we consider the following prior distribution for the calibration parameter 𝜽 ∈ 𝚯

𝜽 ∼ Π(𝜽)

9

(2.2)

(2.3)

(2.4)

Based on (1.3) and (2.1) the joint distribution 𝑝(𝜁, 𝒛|𝜽, 𝝓) is a multivariate normal distribution

with mean

and covariance

𝑀𝜁,𝒛|𝜽,𝝓 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑀 𝑓 (𝑇𝑦 (𝜽)) + 𝑀𝛿 (𝑇𝑦)
𝑀 𝑓 (𝑇𝑧 ((cid:101)𝜽))

,

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽)) + 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽))
𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽))

𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑦 (𝜽))

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝐾𝜁,𝒛|𝜽,𝝓 = (cid:169)
(cid:173)
(cid:173)
(cid:171)

(2.5)

(2.6)

Here, 𝑀 𝑓 (𝑇𝑦 (𝜽)) is a column vector with 𝑗 th element 𝑚 𝑓 (𝒕 𝑗 , 𝜽), 𝑀𝛿 (𝑇𝑦) is a column vector with 𝑗 th
element 𝑚𝛿 (𝒕 𝑗 ), and 𝑀 𝑓 (𝑇𝑧 ((cid:101)𝜽)) is a column vector with 𝑗 th element 𝑚 𝑓 ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 ). 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽))
is the matrix with (𝑖, 𝑗) element 𝑘 𝑓 ((𝒕𝑖, 𝜽), (𝒕 𝑗 , 𝜽)), 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) is the matrix with (𝑖, 𝑗) element
𝑘 𝛿 (𝒕𝑖, 𝒕 𝑗 ), and 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽)) is the matrix with (𝑖, 𝑗) element 𝑘 𝑓 (((cid:101)𝒕𝑖, (cid:101)𝜽𝑖), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )). We can define
the matrix 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽)) similarly with the kernel 𝑘 𝑓 . To separate out the scale parameters, let’s
define new notations for the covariance matrix as follows: 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽)) = 𝜂 𝑓 ˜Σ11, 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) =
𝜂𝛿 ˜Σ11′, 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽)) = 𝜂 𝑓 ˜Σ12, and 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽)) = 𝜂 𝑓 ˜Σ22.

Consequently, the conditional distribution 𝑝(𝜁 |𝒛, 𝜽, 𝝓) is also a multivariate normal with the

mean and covariance functions:

𝑚𝜁 |𝒛,𝜽,𝝓 = 𝑚 𝑓 (𝒕, 𝜽) + 𝑚𝛿 (𝒕) +

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

𝑘 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )) (

1
𝜂 𝑓

˜Σ−1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝜃)) [𝑧𝑖 − 𝑚 𝑓 ((cid:101)𝒕𝑖, (cid:101)𝜽𝑖))]

𝑘 𝜁 |𝒛,𝜽,𝝓 = 𝑘 𝑓 ((𝒕, 𝜽), (𝒕′, 𝜽′)) + 𝑘 𝛿 (𝒕, 𝒕′)
𝑠
∑︁

𝑠
∑︁

𝑘 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 ))

−

(cid:18) 1
𝜂 𝑓

(cid:19)

˜Σ−1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)

𝑘 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 ))

𝑗=1

where (1/𝜂 𝑓 ) ˜Σ−1

𝑖=1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) is the ( 𝑗, 𝑖) element of [𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽))]−1 which depends on 𝑙 𝑓 , 𝑙𝛿.
To make our argument simple and efficient, without loss of generality, we are going to assume zero

(2.7)

(2.8)

mean for 𝑓𝑚 and 𝛿.

In what follows, we are interested in the posterior density of the parameter 𝑆 = (𝜁, 𝜎)
(cid:35) 𝑑Π(𝑆|𝒛)
𝑑𝑆

𝑝( 𝒚|𝑆) 𝑝(𝑆|𝒛)
𝑝( 𝒚, 𝒛)

𝑝(𝑆, 𝒚, 𝒛)
𝑝(𝒚, 𝒛)

1
𝑝(𝒚, 𝒛)

Π(𝑆| 𝒚, 𝒛) =

𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)

(cid:34) 𝑛
(cid:214)

=

=

𝑖

(2.9)

10

where the marginal density 𝑝( 𝒚, 𝒛) is given by 𝑝( 𝒚, 𝒛) = ∫ (cid:2)(cid:206)𝑛

𝑖 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)(cid:3) 𝑑Π(𝑆|𝒛) and the prior

density 𝑑Π(𝑆|𝒛)/𝑑𝑆 is given by

∫

∫

Θ
(cid:32)∫

𝐴𝝓
∫

𝐴𝝓

Θ
(cid:32)∫

∫

𝑑Π(𝑆|𝒛)
𝑑𝑆

=

=

=

=

𝑝(𝑆, 𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽 =

∫

∫

Θ

𝐴𝝓

𝑝(𝑆|𝒛, 𝜽, 𝝓) 𝑝(𝝓, 𝜽 |𝒛) 𝑑𝝓𝑑𝜽

(cid:33)

𝑝(𝜁 |𝒛, 𝜽, 𝝓) 𝑝(𝝓|𝒛) 𝑝(𝜽) 𝑑𝝓𝑑𝜽

× 𝑝(𝜎)

𝑝(𝜁 |𝒛, 𝜽, 𝝓) 𝑝(𝝓 𝑓 |𝒛) 𝑝(𝝓𝛿) 𝑝(𝜽)𝑑𝝓 𝑓 𝑑𝝓𝛿𝑑𝜽

× 𝑝(𝜎)

(cid:33)

∫

𝐴𝝓 𝑓

×

𝑑Π2(𝜎)
𝑑𝜎

Θ

𝐴𝝓 𝛿
𝑑Π1(𝜁 |𝒛)
𝑑𝜁

2.2 Preliminaries

For the purposes of this section, we shall assume that 𝜎2 = 1 and Π1 = Π𝜁 . The results can be

easily extended to unknown 𝜎2 with minor modifications. The basic idea of posterior consistency

is that the posterior probability of an arbitrary neighborhood around the true parameter goes to 1

as the sample size 𝑛 goes to infinity. Consider the following neighborhood,
(cid:40)

(cid:41)

U𝜖𝑛 =

𝜁 :

|𝜁 (𝑡𝑖) − 𝜁0(𝑡𝑖)| ≤ 𝜖𝑛

(2.10)

1
𝑛

𝑛
∑︁

𝑖=1

Then in the KOH model setup we described in the previous section, posterior consistency means

that as 𝑛 goes to ∞,

Π𝜁 (cid:8)𝜁 ∈ U𝜖𝑛 |𝒚, 𝒛(cid:9) → 0

𝑎.𝑠. [𝑃𝜁0]

where the posterior distribution is given by (2.9).

There are many pieces of literature regarding posterior consistency. An early result of posterior

consistency is by Doob (1949), which shows that the posterior distribution is consistent almost

everywhere on prior support. Many Bayesians were satisfied with the result at the time, but it

was not useful to check the consistency at a particular density. Then the breakthrough was made

by Schwartz (1965), who proved a theorem for posterior consistency of parameters from i.i.d

random variables. The conditions for this theorem are the prior positivity of neighborhoods of

the true parameters and the existence of uniformly consistent test functions. This introduction

11

of test function conditions made Schwartz’s theorem a powerful tool for checking the posterior

consistency in many applications. Please refer to Barron (1999) for further insight into Schwartz’s

theorem. Choudhuri et al. (2004) extended this result to a triangular array of independent but non-

identically distributed observations in terms of convergence in probability. Choi and Schervish

(2007) strengthened this result to almost sure convergence. The proof of KOH model consistency is

based on this extension of the Schwartz theorem, and note that the priors are updated with computer

model evaluations, which makes the proof complicated.

2.2.1 Posterior Consistency Theorem

Theorem 1. Let {𝑦𝑖}∞

𝑖=1 be independently and normally distributed with mean 𝜁 (𝒕𝑖) and standard
deviation 𝜎 > 0 with respect to a common 𝜎-finite measure. Let 𝜁 ∈ F , where F denotes the

space of continuously differentiable functions on 𝛀 = [0, 1] 𝑝. Let 𝜁 ∈ F denote the parameter of

interest. Let the prior of 𝜁 depending on a known set of inputs 𝒛 = (𝑧1, · · · , 𝑧𝑠) be given by the

product measure 𝑑Π𝜁 (𝜁 |𝒛)/𝑑𝜁. For 𝜁0 ∈ F , 𝑃𝜁0 denote the joint conditional distribution of {𝑦𝑖}∞
𝑖=1
given the true 𝜁0. Let {𝑈𝑛}∞
𝑛=1 be a sequence of subsets of F . Now under the following two main

assumptions,

Assumption 2.2.1. Existence of tests. Suppose there exist test functions {Φ𝑛}∞

𝑛=1, a sequence of

sets {F𝑛}∞

𝑛=1 and constants 𝐶1, 𝐶2, 𝑐1, 𝑐2 > 0 such that

1. (cid:205)∞
𝑛=1

𝐸𝜁0Φ𝑛 < ∞

2. sup𝜁 ∈𝑈𝑐

𝑛∩F𝑛

𝐸𝑆 (1 − Φ𝑛) ≤ 𝐶1 exp(−𝑐1𝑛)

Assumption 2.2.2. Prior positivity of neighborhoods. The marginal prior Π𝜁 (𝜁) satisfies

1. Π𝜁 (F 𝑐

𝑛 ) ≤ exp(−𝐶𝑛𝜖 2
𝑛)

2. Π𝜁 (∥𝜁 − 𝜁0∥∞ < 𝜖𝑛) ≥ exp(−𝐶𝑛𝜖 2
𝑛)

The posterior distribution as in (2.9) satisfies

Π{𝜁 ∈ 𝑈𝑐

𝑛 |𝒚, 𝒛} → 0

𝑎.𝑠. [𝑃𝜁0]

(2.11)

12

We now simplify the posterior distribution in (2.9) as follows

Π(𝜁 ∈ 𝑈𝑐

𝑛 | 𝒚, 𝒛) =

(cid:2)(cid:206)𝑛

∫
𝑈𝑐
𝑛
∫ (cid:2)(cid:206)𝑛

𝑖 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)(cid:3) 𝑑Π(𝜁 |𝒛)
𝑖 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)(cid:3) 𝑑Π(𝜁 |𝒛)
(1 − Φ𝑛) ∫

(cid:2)(cid:206)𝑛

≤ Φ𝑛 +

𝑛∩F𝑛

𝑈𝑐
∫ (cid:2)(cid:206)𝑛

𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛)

𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛)

∫
𝑈𝑐

+

𝑛∩F 𝑐
𝑛
∫ (cid:2)(cid:206)𝑛

𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛)

(cid:2)(cid:206)𝑛
𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛)

= Φ𝑛 +

I(1,𝑛) (𝒚)
I(3,𝑛) ( 𝒚)

+

I(2,𝑛) ( 𝒚)
I(3,𝑛) ( 𝒚)

(2.12)

Assumption 2.2.1. 1 and Assumption 2.2.1. 2 requires uniformly consistent sequence of tests

for testing 𝐻0 : 𝜁 = 𝜁0 versus 𝐻1 : 𝜁 ∈ 𝑈𝑐
I(1,𝑛) (𝒚) in (2.12) goes to 0 as 𝑛 → ∞. The construction of these test functions for GPs has been

𝑛. These conditions are used to show that the numerator,

studied in Choi and Schervish (2007) (see Section 3.5.3) and uses Assumption 2.2.3 on the design

points.

Note, Assumption 2.2.2. 1 assumes that the prior gives negligible probability outside the set F𝑛,

shall be used to show that the numerator, I(2,𝑛) (𝒚) in (2.12) goes to 0 as 𝑛 → ∞. Finally, Assumption

2.2.2. 2 states that the marginal prior Π𝜁 gives enough mass on the set 𝐵𝑛 = {𝜁 : ∥𝜁 − 𝜁0∥∞ ≤ 𝜖𝑛}
where 𝐵𝑛 denotes a set around the true process 𝜁0, and Assumption 2.2.2. 2 give a positive prior on

the neighborhood of the true parameter 𝜁0, and it allows us to show that the denominator, I(3,𝑛) ( 𝒚)

in (2.12) goes to ∞.

Assumption 2.2.2. 2 is a problem well-known in the GP community as small ball probability,

and it has been studied by many in the probability literature (see Ghosal and Roy (2006); Li and

Shao (2001); Tokdar and Ghosh (2007); Vaart and Zanten (2008)). The small ball probability in

GP prior is closely related to posterior contraction through the concentration function

𝜙𝜁0 (𝜖𝑛) =

inf
ℎ∈H:∥ℎ−𝜁0 ∥<𝜖𝑛

∥ℎ∥2

H − log Π(∥𝜁 ∥ < 𝜖𝑛)

where, H is the Reproducing Kernel Hilbert Space (RKHS) associated to the prior, and ∥·∥H is

the corresponding RKHS norm. The first infimum part of the concentration function denotes the

decrease in prior mass when the ball is shifted from the origin to 𝜁0. The second part represents

13

prior mass in a ball of radius 𝜖𝑛 around zero function (see Ghosal and van der Vaart (2017), Vaart

and Zanten (2008), and van der Vaart and van Zanten (2011) for further description of this function).

By the Theorem 2.1 of Vaart and Zanten (2008), Π(∥𝜁 − 𝜁0∥∞ < 𝜖𝑛) ≥ exp(−𝜙𝜁0 (𝜖𝑛)/2). In the
Supplement, we show that 𝜙𝜁0 (𝜖𝑛)/2 ≤ 𝑛𝜖 2

𝑛 for the Squared Exponential covariance and Matérn

covariance class.

2.2.2 Assumptions for the KOH model

In order to apply consistency Theorem 1 to KOH model, we need assumptions on the smoothness

of the Gaussian process prior 𝜁 |𝒛, 𝜽, 𝝓 (see equations (2.7) and (2.8)) together with a set of

assumptions on the priors for 𝝓 and 𝜽 (see equations (1.1), (2.3) and (2.4)). Additionally, we need

an assumption on the rate of design points filling out the interval [0, 1] 𝑝.

2.2.2.1 Assumptions for design points under KOH

Assumption 2.2.3. Non random design points.

For each hypercube 𝐻 in [0, 1] 𝑝, let 𝜆(𝐻) be its Lebesgue measure. Suppose that there exists a

constant 0 < 𝐾𝑑 < 1 such that whenever 𝜆(𝐻) ≥ 1/(𝑛𝐾𝑑), 𝐻 contains at least one design point.

Note that equally spaced design satisfies Assumption 2.2.3. Suppose 𝒕𝑖 = (𝑡𝑖,1, · · · , 𝑡𝑖,𝑝),

𝑖 = 1, · · · , 𝑛 are spaced such that |𝑡𝑖+1,𝑘 − 𝑡𝑖,𝑘 | = 𝑛−1/𝑑,
0 = 𝑡0,𝑘 < 𝑡1,𝑘 ≤ · · · ≤ 𝑡𝑛,𝑘 < 𝑡𝑛+1,𝑘 = 1. Then each hypercube 𝐻 with a Lebesgue measure of at

𝑖, 𝑗 = 1, · · · , 𝑛, 𝑘 = 1, · · · , 𝑝 and

least 1/𝑛 contains at least one design point. By the same logic, space-filling designs such as Latin

Hypercube Sampling with maxmin distance Morris and Mitchell (1995), which is commonly used

in the computer model fields, also satisfy this assumption. Please refer to Santner et al. (2003) for

further information regarding designs in computer experiments.

2.2.2.2 Assumptions on the prior for KOH

The prior assumption comprises two parts. One is for the smoothness of sample paths of

Gaussian Processes 𝑓 and 𝛿. (Assumption 2.2.4. 1 and Assumption 2.2.4. 2). The other is for

technical assumptions on priors of hyperparameters and calibration parameters ( Assumption 2.2.5

1, and Assumption 2.2.5. 2, Assumption 2.2.5. 3).

14

Assumption 2.2.4. Gaussian process.

1. For the Gaussian processes GP 𝑓 and GP𝛿, assume zero mean 𝑚 𝑓 = 𝑚𝛿 = 0. In addition,
, · · · , 𝑙 𝑓 𝑝+𝑞 ) ∈ R𝑝+𝑞,
, · · · , 𝑙𝛿 𝑝 ) ∈ R𝑝, inputs 𝒕 = (𝑡1, 𝑡2, · · · , 𝑡 𝑝) ∈ R𝑝, 𝒕′ = (𝑡′1, 𝑡′2, · · · , 𝑡′𝑝) ∈ R𝑝, and

assume the scale parameters 𝜂 𝑓 , 𝜂𝛿 > 0, length scale parameters 𝑙 𝑓 = (𝑙 𝑓1
𝑙𝛿 = (𝑙𝛿1
calibration parameters 𝜽 = (𝜃1, · · · , 𝜃 𝑝′), (cid:101)𝜽 = ((cid:101)𝜃1, · · · , (cid:101)𝜃 𝑝′) ∈ 𝚯, let the covariance functions
′) be a product of isotropic and integrable covariance functions,
𝑘 𝑓 ((𝒕, 𝜽), (𝒕′, 𝜽′)) and 𝑘 𝛿 (𝒕, 𝒕

one for each dimension.

𝑘 𝑓 ((𝒕, 𝜽), (𝒕′, 𝜽′)) = 𝜂 𝑓 𝑅(𝑡1, 𝑡′1; 𝑙 𝑓1) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝; 𝑙 𝑓 𝑝 )𝑅(𝜃1, (cid:101)𝜃1; 𝑙 𝑓 𝑝+1) · · · 𝑅(𝜃 𝑝′, (cid:101)𝜃 𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

𝑘 𝛿 (𝒕, 𝒕′) = 𝜂𝛿 𝑅(𝑡1, 𝑡′1; 𝑙𝛿1)𝑅(𝑡2, 𝑡′2; 𝑙𝛿2) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝; 𝑙𝛿 𝑝 )

where each 𝑅(𝑡𝑖, 𝑡′𝑖; 𝑙𝑖) = 𝑅

(cid:17)

(cid:16) |𝑡𝑖−𝑡′𝑖 |
𝑙𝑖

, and 𝑅 is the positive multiple of density.

2. For the Gaussian Process GP 𝑓 and GP𝛿, the mean functions 𝑚 𝑓 , 𝑚𝛿 are continuously differ-

entiable with respect to input 𝒕 and covariance functions 𝑘 𝑓 and 𝑘 𝛿 have partial derivatives

up to order 2𝑝 + 2.

Assumption 2.2.5. Hyperparameters, calibration parameter and observational error scale pa-

rameter.

1. 𝜂 𝑓 and 𝜂𝛿 have prior distributions with support R+, and there exists 0 < ℎ < 1

2 and
𝑏1, 𝑏2, 𝑏3, 𝑏4 > 0 such that P[𝜂 𝑓 > 𝑛ℎ] ≤ 𝑏1 exp(−𝑏2𝑛), and P[𝜂𝛿 > 𝑛ℎ] ≤ 𝑏3 exp(−𝑏4𝑛),
∀𝑛 ≥ 1.

2. 𝑙 𝑓 and 𝑙𝛿 have prior distributions with bounded support on R+. That is for the bounds,

𝐵 𝑓 , 𝐵𝛿 > 0, set P[𝑙 𝑓𝑖 > 𝐵 𝑓 ] = 0,

𝑓 𝑜𝑟 𝑖 = 1, · · · , 𝑝 + 𝑝′, and P[𝑙𝛿 𝑗 > 𝐵𝛿] = 0,

𝑓 𝑜𝑟 𝑗 =

1, · · · , 𝑝

3. 𝜽 has a continuous density on the support 𝚯 such that ∫

𝚯

𝑝(𝜽)𝑑𝜽 = 1.

Note that in Assumption 2.2.4. 1, the product of the isotropic covariance functions is also an

isotropic covariance function. The functional form of 𝑅(·) in the covariance kernel in Assumption

2.2.4. 1 affects the proof of showing the Assumption 2.2.2. 1. Choi (2005) avoids functional form

by considering the continuity of supremum of variance terms (the same assumption was also used

15

in Lemma 1 in Kejzlar et al. (2020)). In this paper, we are going to prove Assumption 2.2.2. 1 for

general case, which encompasses the widely used covariance functions such as Squared Exponential

covariance. Assumption 2.2.4. 1 and Assumption 2.2.4. 2 together with Assumption 2.2.5. 1 and

Assumption 2.2.5. 2 are sufficient conditions to ensure that the support of GP prior for 𝜁 |𝒛, 𝜽, 𝝓

contains every continuously differentiable function based on the arguments by Theorem 4 of Ghosal

and Roy (2006) and Theorem 4.5 of Tokdar and Ghosh (2007). Particularly, Assumption 2.2.4.

2 implies that the same condition holds for the conditional process 𝜁 |𝒛, 𝜽, 𝝓, which is a function

of 𝑓 and 𝛿. This enables us to apply Theorem 4 of Ghosal and Roy (2006) and Theorem 4.5 of

Tokdar and Ghosh (2007) in KOH model setup. Many covariance functions satisfy this assumption,

including the product of squared exponential covariance functions. For detailed descriptions of

the analytic properties of covariance functions, please check Cramer and Leadbetter (1967), Adler

(1981), and Adler (1990). Furthermore, Assumption 2.2.4. 1 and Assumption 2.2.4. 2 guarantees

the existence of continuous sample derivatives with probability 1, p.171 and p.185 of Cramer and

Leadbetter (1967). In addition, by the Theorem 5 of Ghosal and Roy (2006), these Assumption

2.2.4. 1 and Assumption 2.2.4. 2 guarantee that the GP prior 𝜁 |𝒛, 𝜽, 𝝓 gives non zero probability

to the set of continuous functions with bounded derivatives.

Assumption 2.2.5. 1 relaxes the assumption of compact support for all hyperparameters as used

in the empirical Bayes approach Kejzlar et al. (2020) of KOH model. We relaxed that assumption

in such a way that scale parameters for GP covariances, 𝑛 𝑓 and 𝑛𝛿, have an exponentially small

probability of blowing up. Assumption 2.2.5. 2 indicates that the length scale parameters have

bounded support. In the algorithm, we used log transformation to have the support of length scale

parameters to be R+. However, we can always use different transformation. For example, sigmoid

transformation could be used to have a support on [0, 1]. Hence, without loss of generality let’s

assume they are bounded for theoretical investigation. Then, this condition allows us to upper bound
(cid:12)
the quantity (cid:205)𝑠
(cid:12) by a constant ¯𝐶 in a bounded support. This assumption is
(cid:12)
a technical assumption to ensure that the conditional prior Π𝜁 (·|𝒛) assigns exponentially small

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)
˜Σ−1

(cid:205)𝑠

𝑗=1

𝑖=1

(cid:12)
(cid:12)
(cid:12)

probability to F 𝐶

𝑛 . Assumption 2.2.5. 3 assumes continuous density on the support of calibration

16

parameters. In practice, the support of calibration parameters is determined by domain experts.

2.3 Posterior Consistency in Kenndy and O’Hagan Model

We have already presented the KOH model in Section 2.1. We shall first assume that the design

points 𝒕 = (𝑡1, · · · , 𝑡𝑛) on which the physical process 𝜁 is defined (see (1.1)) follows the non random

design as discussed under Assumption 2.2.3. As described in Section 2.1, we have considered a

hierarchical Gaussian prior on the physical process 𝜁 as in (2.1). With 𝒛 defined in (1.3), we obtain

an induced conditional multivariate Gaussian prior on the physical process 𝜁 whose mean and

covariance kernels are given by (2.7) and (2.8) respectively. We first assume that this conditional

prior 𝜁 |𝒛, 𝜽, 𝝓 satisfies all the conditions under Assumption 2.2.4. We also assume that the prior on

hyperparameter 𝝓 = (𝑛 𝑓 , 𝑛𝛿, 𝑙 𝑓 , 𝑙𝛿) satisfies Assumption 2.2.5. 1 and Assumption 2.2.5. 2. Under

these assumptions, we next present the main result on the joint consistency of the physical process

𝜁 and the scale parameter 𝜎 under the KOH model.

2.3.1 Posterior Consistency Theorem of KOH Model

Theorem 2. Let {𝑦𝑖}∞

𝑖=1 be independently and normally distributed with mean 𝜁 (𝒕𝑖) and standard
deviation 𝜎 > 0 with respect to a common 𝜎-finite measure. Let 𝜁 denote the parameter of

interest. Let the design points 𝒕𝑖 ∈ [0, 1] 𝑝 satisfy Assumption 2.2.3. With reference to Section

2.1, let 𝒛 = (𝑧1, · · · , 𝑧𝑠) be as defined in (1.3). We assume that the conditional prior 𝜁 |𝒛, 𝜽, 𝝓

satisfies Assumption 2.2.4. 1, Assumption 2.2.4. 2. We also assume that the prior on 𝝓 satisfies

Assumption 2.2.5. 1 and 2.2.5. 2. In addition, assume the calibration parameter satisfies 2.2.5. 3.

Let 𝑃𝜁0 denote the joint conditional distribution of {𝑦𝑖}∞
𝑖=1 given the true 𝜁0. Let the function 𝜁0 be
continuously differentiable, then for a sequence 𝜖𝑛 that goes to 0 and for the neighborhood U𝜖𝑛 of

𝜁 as defined in equation (2.10),

Π𝜁 {𝜁 ∈ U𝑐

𝜖𝑛 | 𝒚, 𝒛} → 0

𝑎.𝑠. [𝑃𝜁0]

(2.13)

where Π𝜁 {|𝒚, 𝒛} denotes the posterior distribution of 𝜁 given the data 𝒚 = (𝑦1, · · · , 𝑦𝑛) and

𝒛 = (𝑧1, · · · , 𝑧𝑠).

The proof of Theorem 2 is based on the Theorem 1. In order to prove Theorem 2, we shall show

17

that Assumption 2.2.2 and Assumption 2.2.1 hold under the KOH model. We present the proofs of

Assumption 2.2.2 and Assumption 2.2.1 under the KOH model in the Appendix.

2.4 Conclusion and Discussion

The Kennedy O’Hagan model assumes both physical data and computer model evaluation data

are available and make use of the Gaussian Process to calibrate the parameters and to predict

physical process at new input space. Although the identifiability issue of 𝜃 is still a concern in

the KOH model, it has been widely used in diverse fields. At the same time, researchers tried to

provide a theoretical justification for this model. For example, you can look at Tuo and Wu (2015,

2016) and Plumlee (2017), among others.

However, the theoretical properties of the fully Bayesian approach of the KOH model have not

been investigated much. The most relevant paper is Xie and Xu (2020), in which they proposed

a Bayesian projected calibration method following the frequentist 𝐿2-projected calibration method

in Tuo and Wu (2015) and showed theoretical justification. An additional relevant study conducted

by Gramacy and Apley (2015) employed a localized approach with parallelization techniques to

approximate a large GP emulator. However, it is important to note that this particular work focused

on GP computer emulation and did not operate within the KOH model framework. Hence the main

contribution of this paper is the theoretical justification of using the original full Bayesian KOH

model. Our approach is based on the extended use of Schwartz’s theorem in our model setup to

prove the consistency of the fitted emulator.

One issue we haven’t discussed in this paper is the identifiability of 𝜽. This is because it is

not an issue for the prediction. However, this would be a great next research topic. As Kennedy

and O’Hagan (2001) stated, the parameter is the one best explains the difference between the true

physical model and the computer model. So it lacks physical meaning. But when it comes to

applications, as discussed in Higdon et al. (2005) and Han et al. (2003), there are both cases where

the parameter has little or no physical meaning (tuning parameter) and the cases where it does

have a physical meaning (calibration parameter). This is an integral part for the domain scientists

because the calibration parameters themselves are the interest to them sometimes, and it helps for

18

scientific understanding of the model and future scientific extension. Hence developing the new

definition of 𝜽, which contains the concept of both approximating the true physical model and

physical interpretation would be a big breakthrough in this field.

19

BIBLIOGRAPHY

Abrahamsen, P. (1997). A review of gaussian random fields and correlation functions, volume 917.

Norwegian Computing Center.

Adler, R. J. (1981). The Geometry of Random Fields. Wiley.

Adler, R. J. (1990). An introduction to continuity, extrema, and related topics for Gaussian process,

volume 12. IMS Lecture notes-monograph series.

Barron, A. R. (1999). Information-theoretic characterization of bayes performance and the choice

of priors in parametric and nonparametric problems. In Bayesian Statistics, 6:27–52.

Choi, T. (2005). Posterior Consistency in Nonparametric Regression Problems under Gaussian

Process Priors. PhD thesis, Carnegie Mellon University.

Choi, T. and Schervish, M. J. (2007). Posterior Consistency in Nonparametric Regression Problems

under Gaussian Process Priors. Journal contribution.

Choudhuri, N., Ghosal, S., and Roy, A. (2004). Bayesian estimation of the spectral density of a

time series. Journal of the American Statistical Association, 99(468):1050 – 1059.

Cramer, H. and Leadbetter, M. R. (1967). Stationary and related stochastic processes. Wiley.

Doob, J. L. (1949). Application of the theory of martingales. in Le Calcul des Probabilites et ses

Applications, 40:23–27.

Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonparametric

binary regression. Ann. Statist., 34(5):2413–2429.

Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference.

Cambridge University Press.

Gramacy, R. B. and Apley, D. W. (2015). Local gaussian process approximation for large computer

experiments. Journal of Computational and Graphical Statistics, 24(2):561–578.

Han, G., Santner, T. J., and Rawlinson, J. J. (2003). Simultaneous determination of tuning and

calibration parameters for computer experiments. Technometrics, 51(4):464 – 474.

Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2005). Combining
field data and computer simulations for calibration and prediction. SIAM Journal on Scientific
Computing, 26:448–466.

Kejzlar, V., Son, M., Bhattacharya, S., and Maiti, T. (2020). A fast, scalable, and calibrated

computer model emulator: An empirical bayes approach.

20

Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 63:425–464.

Li, W. V. and Shao, Q.-M. (2001). Gaussian processes: Inequalities, small ball probabilities and
applications. Stochastic Processes: Theory and Methods. Handbook of Statistics, 19:533–598.

Morris, M. D. and Mitchell, T. J. (1995). A comparison of three methods for selecting values of
input variables in the analysis of output from a computer code. Journal of Statistical Planning
and Inference, 43:381–402.

Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American

Statistical Association, 112:1274–1285.

Santner, T. J., Williams, B. J., Notz, W. I., and Williams, B. J. (2003). The design and analysis of

computer experiments, volume 1. Springer.

Schwartz, L. (1965). On bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete,

4:10–26.

Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic gaussian process priors in

density estimation. Journal of Statistical Planning and Inference, 137(1):34 – 42.

Tuo, R. and Wu, C. F. J. (2015). Efficient calibration for imperfect computer models. The Annals

of Statistics, 43:2331–2352.

Tuo, R. and Wu, C. F. J. (2016). A theoretical framework for calibration in computer models:
parametrization, estimation and convergence properties. SIAM/ASA Journal on Uncertainty
Quantification, 4:767–795.

Vaart, A. W. V. D. and Zanten, J. H. V. (2008). Rates of contraction of posterior distributions based

on gaussian process priors. The Annals of Statistics, 36:1435–1463.

van der Vaart, A. W. and van Zanten, J. H. (2011). Information rates of nonparametric gaussian

process methods. Journal of Machine Learning Research, pages 2095–2119.

Xie, F., Jin, W., and Xu, Y. (2019). Rates of contraction with respect to 𝐿2-distance for Bayesian

nonparametric regression. Electronic Journal of Statistics, 13(2):3485 – 3512.

Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of the

American Statistical Association, 0(0):1–18.

21

APPENDIX A

PROOF OF THEOREM 1

Note 𝒛 is implicitly conditioned on the prior. If we decompose the following term.

Π𝜁 (U𝑐

𝜖𝑛 | 𝒚, 𝒛) =

=

=

=

∫

𝐴𝝓
∫
Θ
∫

∫
𝐴𝝓
Θ
𝑝(𝒚|𝜁)

𝑝(𝒚|𝜁)

∫

∫

Θ

∫
U𝑐
𝑛
∫
F

∫
U𝑐
𝑛
∫
F

∫
U𝑐
𝑛
∫
F

Π(U𝑐

𝑛 , 𝜽, 𝝓|𝒚, 𝒛)𝑑𝝓𝑑𝜽

𝑝(𝜁, 𝒚, 𝒛, 𝜽, 𝝓)𝑑𝝓𝑑𝜽 𝑑𝜁

𝐴𝝓

∫

𝑝(𝜁, 𝒚, 𝒛, 𝜽, 𝝓)𝑑𝝓𝑑𝜽 𝑑𝜁
(cid:16) ∫
Θ
∫

𝑝(𝜁, 𝜽, 𝝓, 𝒛)𝑑𝝓𝑑𝜽
(cid:17)

𝑝(𝜁, 𝜽, 𝝓, 𝒛)𝑑𝝓𝑑𝜽

𝐴𝝓

(cid:16) ∫
Θ

𝐴𝝓

(cid:17)

𝑑𝜁

𝑑𝜁

𝑝( 𝒚|𝜁)𝑑Π(𝜁 |𝒛)

𝑝(𝒚|𝜁)𝑑Π(𝜁 |𝒛)
(1 − Φ𝑛) ∫

≤ Φ𝑛 +

+

= Φ𝑛 +

∫
U𝑐
𝑛 ∩F 𝑐
𝑛
∫
(cid:206)𝑛
F
I(1,𝑛) (𝒚)
I(3,𝑛) ( 𝒚)

≤ exp(−𝐶1𝑛𝜀2

= exp(−𝐶1𝑛𝜀2

(cid:206)𝑛

𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛)

U𝑐
(cid:206)𝑛

∫
F
(cid:206)𝑛

𝑛 ∩F𝑛
𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛)
𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎))/( 𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛)

𝑖=1( 𝑝(𝑦𝑖 |𝜁𝑖, 𝜎))/( 𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛)

+

I(2,𝑛) ( 𝒚)
I(3,𝑛) ( 𝒚)
exp(−𝐶2𝑛𝜀2
𝑛/2)
exp(−(𝐶 + 𝐶4)𝑛𝜀2
𝑛)
𝑛) + exp(−(𝐶2/2 + 𝐶 − 𝐶4)𝑛𝜀2

𝑛) +

+

exp(−𝐶3𝑛𝜀2
𝑛/2)
exp(−(𝐶 + 𝐶4)𝑛𝜀2
𝑛)

𝑛) + exp(−(𝐶3/2 + 𝐶 − 𝐶4)𝑛𝜀2
𝑛)

≤ exp(−𝐶𝑛𝜀2
𝑛)

Where

I1,𝑛 = (1 − Φ𝑛)

∫

U𝑐

𝑛 ∩F𝑛

𝑛
(cid:214)

𝑖=1

( 𝑝(𝑦𝑖 |𝜁𝑖)/𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛)

( 𝑝(𝑦𝑖 |𝜁𝑖))/( 𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛)

( 𝑝(𝑦𝑖 |𝜁𝑖))/( 𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛)

𝑛
(cid:214)

𝑖=1

I2,𝑛 =

I3,𝑛 =

∫

U𝑐

𝑛 ∩F 𝑐
𝑛
𝑛
(cid:214)

∫

F

𝑖=1

22

It suffices to show the following four results,

Φ𝑛 ≤ exp(−𝐶1𝑛𝜀2
𝑛)

I1,𝑛 ≤ exp(−𝐶2𝑛𝜀2

𝑛/2)

I2,𝑛 ≤ exp(−𝐶3𝑛𝜀2

𝑛/2)

I3,𝑛 ≤ exp(−(𝐶 + 𝐶4)𝑛𝜀2
𝑛)

(A.1)

(A.2)

(A.3)

(A.4)

Mainly, (A.1) involves Markov inequality, Borel-cantelli lemma, and Assumption. 2.2.1. (i) in

the main paper. Proofs of (A.2) and (A.3) follows from Theorem 2 of Choudhuri et al. (2004)

and proof of (A.4) follows from Lemma 2.4 of Xie et al. (2019). All the proofs need a slight

modification from the original papers since we have prior on 𝜁 updated with simulation data 𝒛.

These modifications are discussed next.

Proof of (A.1): Using Assumption 2.2.1 and Markov’s Inequality,

=⇒ P0 [ Φ𝑛 > exp(−𝐶1𝑛𝜀2

𝑛) ] ≤

E0 [ Φ𝑛 ]

1
exp(−𝐶1𝑛𝜀2
𝑛)
∞
∑︁

=⇒

∞
∑︁

𝑛=1

P0 [ Φ𝑛 > exp(−𝐶1𝑛𝜀2

𝑛) ] ≤

1
exp(−𝐶1𝑛𝜀2
𝑛)

E0 [ Φ𝑛 ] ≤ 𝐶

∞
∑︁

𝑛=1

E0 [ Φ𝑛 ] < ∞

𝑛=1

=⇒ P0 [ Φ𝑛 > exp(−𝐶1𝑛𝜀2
𝑛)

𝑖.𝑜. ] = 0

=⇒ P0 [ Φ𝑛 ≤ exp(−𝐶1𝑛𝜀2

𝑛)] → 1

Proof of (A.2): For every non-negative function 𝜓𝑛 and any set 𝐴, by Fubini’s theorem,

(cid:20)

E𝜁0

𝜓𝑛 ( 𝒚)

∫

𝐴

𝑝( 𝒚|𝜁)
𝑝(𝒚|𝜁0)

(cid:21)

𝑑Π(𝜁 |𝒛)

=

=

∫

∫

∫

𝐴

𝐴

𝜓𝑛 ( 𝒚)

𝑝( 𝒚|𝜁)
𝑝(𝒚|𝜁0)

𝑝( 𝒚|𝜁0)𝑑 ( 𝒚)𝑑Π(𝜁 |𝒛)

E𝜁 [𝜓𝑛 ( 𝒚)] 𝑑Π(𝜁 |𝒛)

Now with 𝐴 = U𝑐

𝑛 ∩ F𝑛 and 𝜓𝑛 = Φ𝑛, we have,

E𝜁0

(cid:2)I(1,𝑛) ( 𝒚)(cid:3) =

∫

U𝑐

𝑛 ∩F𝑛

E𝜁 [(1 − Φ𝑛)]𝑑Π(𝜁 |𝒛)

≤ sup
𝜁 ∈U𝑐

𝑛 ∩F𝑛

E𝜁 [(1 − Φ𝑛)]

≤ exp(−𝐶2𝑛𝜖 2
𝑛)

23

where the last inequality follows as a consequence of Assumption. 2.2.1. Thus,

(cid:104)

P𝜁0

I(1,𝑛) ( 𝒚) ≥ exp

(cid:16)

−𝐶2𝑛𝜖 2

𝑛/2

(cid:17)(cid:105)

≤ exp

(cid:16)

𝐶2𝑛𝜖 2

𝑛/2

(cid:17)

exp(−𝐶2𝑛𝜖 2

𝑛) = exp

(cid:16)

−𝐶2𝑛𝜖 2

𝑛/2

(cid:17)

The proof follows as a consequence of Borel-Cantelli lemma.

Proof of (A.3): Note that

E𝜁0 [I(2,𝑛) ( 𝒚)] = E𝜁0
∫

(cid:20)∫

𝑈𝑐

𝑛∩F𝑛

𝑝( 𝒚|𝜁)
𝑝(𝒚|𝜁0)

(cid:21)

𝑑Π(𝜁 |𝒛)

E𝜁 [1] 𝑑Π(𝜁 |𝒛)

=

U𝑐
𝑛 ∩F 𝑐
𝑛
≤ 𝑑Π(F 𝑐

𝑛 |𝒛)

≤ exp(−𝐶3𝑛𝜖 2
𝑛)

By Assumption 2.2.2 (i) in the main paper, the second inequality holds. The proof follows as

a consequence of the Borel-Cantelli Lemma. Proof of (A.4): Denote Π𝜁 (·|𝐵𝑛, 𝒛) = Π𝜁 (· ∩

𝐵𝑛|𝒛)/Π𝜁 (𝐵𝑛|𝒛) to be the renormalized restriction of Π𝜁 (·|𝒛) on 𝐵𝑛, where 𝐵𝑛 is defined as,

𝐵𝑛 = {𝜁 :

∥𝜁 − 𝜁0∥∞ < 𝜖𝑛}

Let,

𝑉𝑛,𝑖 = 𝜁0(𝒕𝑖) −

∫

𝜁 (𝒕𝑖)𝑑Π𝜁 (𝜁 |𝐵𝑛),

𝑊𝑛,𝑖 =

∫

1
2

(𝜁 (𝒕𝑖) − 𝜁0(𝒕𝑖))2𝑑Π𝜁 (𝜁 |𝐵𝑛)

H𝑛 =

(cid:40)∫ 𝑛
(cid:214)

𝑖=1

𝑝(𝑦𝑖 |𝜁𝑖)
𝑝(𝑦𝑖 |𝜁0,𝑖)

𝑑Π𝜁 (𝜁 |𝒛) > Π𝜁 (𝐵𝑛|𝒛) exp(−𝐶4𝑛𝜀2
𝑛)

(cid:41)

Note 𝑉 2

𝑛,𝑖 ≤ 𝜀2

𝑛 and 𝑊𝑛,𝑖 ≤ 1
2

𝜀2
𝑛, because

∫

𝜁 (𝒙𝑖)𝑑Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

(cid:19) 2

(cid:18)

𝜁0(𝒙𝑖) −
∫

𝑉 2

𝑛,𝑖 =

≤

≤

(𝜁0(𝒙𝑖) − 𝜁 (𝒙𝑖))2 𝑑Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

∫

∥𝜁0(𝒙𝑖) − 𝜁 (𝒙𝑖)∥2

∞ 𝑑Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

≤ 𝜀2
𝑛

24

and because,

Also note,

∫ 𝑛
(cid:214)

𝑖=1

=

=

=

=

≤

Hence

(𝜁 (𝒙𝑖) − 𝜁0(𝒙𝑖))2 𝑑Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

∥𝜁 (𝒙𝑖) − 𝜁0(𝒙𝑖) ∥2

∞ 𝑑Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

𝑊𝑛,𝑖 =

≤

≤

∫

∫

𝜀2
𝑛

1
2
1
2
1
2

Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

𝑝(𝑦𝑖 |𝜁0,𝑖)
𝑝(𝑦𝑖 |𝜁𝑖)
∫ 𝑛
∑︁

(cid:26)

−

(cid:26)

−

(cid:26) 1
2

𝑖=1
∫ 𝑛
∑︁

𝑖=1
∫ 𝑛
∑︁

𝑖=1

𝑛
∑︁

𝑊𝑛,𝑖 +

𝑖=1
1
2

𝜀2
𝑛 +

𝑛
∑︁

𝑖=1

H 𝐶

𝑛 =

⊂

⊂

=

⊂

=

log (2𝜋) −

1
2

(𝑦𝑖 − 𝜁0,𝑖)2 +

1
2

log (2𝜋) +

(cid:27)

(𝑦𝑖 − 𝜁𝑖)2

1
2

Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

(𝜁0,𝑖 + 𝑒𝑖 − 𝜁0,𝑖)2 +

(𝜁0,𝑖 + 𝑒𝑖 − 𝜁𝑖)2

(cid:27)

Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

1
2

1
2

1
2

(𝜁𝑖 − 𝜁0,𝑖)2 + 𝑒𝑖 (𝜁𝑖 − 𝜁0,𝑖)

(cid:27)

Π𝜁 (𝜁 |𝐵𝑛, 𝒛)

𝑛
∑︁

𝑖=1

𝑒𝑖𝑉𝑛,𝑖

𝑒𝑖𝑉𝑛,𝑖

𝑝(𝑦𝑖 |𝜁𝑖)
𝑝(𝑦𝑖 |𝜁0,𝑖)

𝑝(𝑦𝑖 |𝜁𝑖)
𝑝(𝑦𝑖 |𝜁0,𝑖)

Π𝜁 (𝜁 |𝒛) ≤ Π𝜁 ( ˜𝐵𝑛|𝒛) exp(−𝐶4𝑛𝜀2
𝑛)

(cid:41)

Π𝜁 (𝜁 | ˜𝐵𝑛, 𝒛) ≤ exp(−𝐶4𝑛𝜀2
𝑛)

(cid:41)

log

𝑝(𝑦𝑖 |𝜁𝑖)
𝑝(𝑦𝑖 |𝜁0,𝑖)

Π𝜁 (𝜁 |𝐵𝑛, 𝒛) ≤ −𝐶4𝑛𝜀2
𝑛

(cid:41)

log

𝑛
∑︁

𝑖=1

𝑝(𝑦𝑖 |𝜁0,𝑖)
𝑝(𝑦𝑖 |𝜁𝑖)

Π𝜁 (𝜁 |𝐵𝑛, 𝒛) > 𝐶4𝑛𝜀2
𝑛

(cid:41)

(cid:41)

𝑒𝑖𝑉𝑛,𝑖 > 𝐶4𝑛𝜀2
𝑛

(cid:41)

(cid:40)∫ 𝑛
(cid:214)

𝑖=1
(cid:40)∫ 𝑛
(cid:214)

𝑖=1
(cid:40)∫ 𝑛
∑︁

𝑖=1
(cid:40)∫ 𝑛
∑︁

𝑖=1

𝜀2
𝑛 +

(cid:40)

1
2
(cid:40) 𝑛
∑︁

𝑖=1

𝑒𝑖𝑉𝑛,𝑖 > 𝐶5𝑛𝜀2
𝑛

25

Now we can use the Chernoff bound for the Gaussian random variables,

P0

(cid:40) 𝑛
∑︁

𝑖=1

(cid:41)

(cid:32)

𝑒𝑖𝑉𝑛,𝑖 > 𝐶5𝑛𝜖 2
𝑛

≤ exp

−

(cid:16)

𝐶5𝑛𝜀2
𝑛

(cid:17) 2

(cid:33)

1

(cid:205)𝑛

𝑖=1
(cid:33)

𝑉 2
𝑛,𝑖

(cid:32)

= exp

−𝐶2
5

𝑉 2
𝑛,𝑖

𝑛𝜖 4
𝑛

𝑖=1
(cid:19)

(cid:205)𝑛

1
𝑛
𝑛𝜖 4
𝑛
𝜖 2
𝑛

−𝐶2
5

(cid:17)

−𝐶2
5

𝑛𝜖 2
𝑛

(cid:18)

(cid:16)

= exp

= exp

→ 0

Therefore,

P0{H 𝐶

𝑛 } ≤ P0

(cid:41)

𝑒𝑖𝑉𝑛,𝑖 > 𝐶𝑛𝜀2
𝑛

→ 0

(cid:40) 𝑛
∑︁

𝑖=1

With Assumption 2.2.2 (ii) in the main paper, this completes the proof.

26

APPENDIX B

CHECKING ASSUMPTION 2.2.1

Assumption 2.2.1. (i) and (ii) require the existence of a uniformly consistent sequence of tests for
testing 𝐻0 : 𝜁 = 𝜁0 versus 𝐻1 : 𝜁 ∈ U𝑐

𝜖𝑛 ∩ F𝑛. The construction of these test functions for GPs has

been studied in Choi and Schervish (2007) (see Section 3.5.3) and uses Assumption 2.2.3 in the

main paper about the design points. Since these test functions mainly depend on the properties of

𝜁 and the model 𝑦𝑖 = 𝜁 (𝑡𝑖) + 𝜎𝜖𝑖, the adaptation of the test functions in Choi and Schervish (2007)

to our case is immediate with known 𝜎. For the completeness of the paper, the proof is adopted

here based on the following five lemmas from Choi (2005).
Lemma B.0.1. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define 𝜁𝑖 𝑗 = 𝜁𝑖 (𝑡 𝑗 ) for 𝑖 = 0, 1 and
𝑗 = 1, · · · , 𝑛. Let 𝜖 > 0, and let 𝑟 > 0. Let 𝑐𝑛 = 𝑛𝑟1 for 𝛼1/2 < 𝜏 < 1/2 and 1/2 < 𝛼1 < 1. Let
𝑏 𝑗 = 1 if 𝜁1 𝑗 > 𝜁0 𝑗 and -1 otherwise. Let Ψ1𝑛 [𝜁1, 𝜖] be the indicator of the set 𝐴1, where 𝐴1 is

defined as

𝐴1 =

Then there exists a constant 𝐶3 such that for all 𝜁1 that satisfy (cid:205)𝑛

𝑛
∑︁

𝑗=1

𝑏 𝑗

(cid:19)

(cid:18)𝑌𝑗 − 𝜁0 𝑗
𝜎0

√

> 2𝑐𝑛





𝑛




𝑗=1 |𝜁1 𝑗 − 𝜁0 𝑗 | > 𝑟𝑛,

𝐸𝑃𝜁

0

(Ψ1𝑛 [𝜁1, 𝜖]) < 𝐶3 exp(−2𝑐2
𝑛)

Also there exists constants 𝐶4 and 𝐶5 such that for all sufficiently large 𝑛 and all 𝜁 satisfying

||𝜁 − 𝜁0||∞ < 𝑟/4 and for all 𝜎 ≤ 𝜎0(1 + 𝜖)

𝐸𝑃𝜁 (1 − Ψ1𝑛 [𝜁1, 𝜖]) ≤ 𝐶4 exp(−𝑛𝐶5)

where 𝑃𝜁 is the joint distribution of {𝑌𝑛}∞

𝑛=1 assuming parameters to be 𝜁.

Proof: See Lemma 3.5.9 in Choi (2005).
Lemma B.0.2. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define 𝜁𝑖 𝑗 = 𝜁𝑖 (𝑡 𝑗 ) for 𝑖 = 0, 1 and
𝑗 = 1, · · · , 𝑛. Let Ψ2𝑛 be the indicator of the set 𝐴2, where 𝐴2 is defined as

𝐴2 =

𝑛
∑︁

𝑗=1





𝑏 𝑗

(cid:19)

(cid:18)𝑌𝑗 − 𝜁0 𝑗
𝜎0

> 𝑛(1 + 𝜖)





27

Then there exists a constant 𝐶6 such that for all 𝜁1,

𝐸𝑃𝜁

0

(Ψ2𝑛) < exp(−𝑛𝐶6)

Also there exists constants 𝐶7 such that for all sufficiently large 𝑛 and all 𝜁 satisfying ||𝜁 −𝜁0||∞ < 𝑟/4

and for all 𝜎 > 𝜎0(1 + 𝜖)

𝐸𝑃𝜁 (1 − Ψ2𝑛 [𝜁1, 𝜖]) ≤ exp(−𝑛𝐶7)

where 𝑃𝜁 is the joint distribution of {𝑌𝑛}∞

𝑛=1 assuming parameters as 𝜁.

Proof: See Lemma 3.5.10 in Choi (2005).
Lemma B.0.3. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define 𝜁𝑖 𝑗 = 𝜁𝑖 (𝑡 𝑗 ) for 𝑖 = 0, 1 and
𝜖 − 𝜖 2. Let Ψ3𝑛 [𝜁1, 𝜖] be the indicator of the set 𝐴3,
𝑗 = 1, · · · , 𝑛. Let 𝜖 > 0, and let 0 < 𝑟 < 4𝜎0

√

where 𝐴3 is defined as

𝐴3 =

𝑛
∑︁

𝑗=1

𝑏 𝑗

(cid:19)

(cid:18)𝑌𝑗 − 𝜁1 𝑗
𝜎0

≤ 𝑐(1 − 𝜖 2)









Then there exists a constant 𝐶8 such that for all 𝜁1 that satisfy (cid:205)𝑛

𝑗=1 |𝜁1 𝑗 − 𝜁0 𝑗 | > 𝑟𝑛,

𝐸𝑃𝜁

0

(Ψ3𝑛 [𝜁1, 𝜖]) < exp(−𝑛𝐶8)

Also there exists constants 𝐶9 such that for all sufficiently large 𝑛 and all 𝜁 satisfying ||𝜁 − 𝜁0|| < 𝑟/4

and for all 𝜎 ≤ 𝜎0(1 − 𝜖)

𝐸𝑃𝜁 (1 − Ψ3𝑛 [𝜁1, 𝜖]) ≤ exp(−𝑛𝐶9)

where 𝑃𝜁 is the joint distribution of {𝑌𝑛}∞

𝑛=1 assuming parameters to be 𝜁.

Proof: See Lemma 3.5.11 in Choi (2005). The following two lemmas show that under Assumption
2.2.3. and Assumption 2.2.4., there exist enough number of design points to acquire (cid:205)𝑛

𝑗=1 |𝜁1 𝑗 −

𝜁0 𝑗 | > 𝑟𝑛 in 𝑝 dimensions.

Lemma B.0.4. Assume Assumption 2.2.3. and Assumption 2.2.4. holds. Let 𝜆 be the Lebesgue

measure in R𝑝. Let 𝐾𝑑 be the constants mentioned in the Assumptions 2.2.3 and 𝑉 be a constant
such that 𝑉 > (cid:13)
(cid:13)
∀𝒕1, 𝒕2 ∈ [0, 1] 𝑝,

𝜕𝑡 𝑗 𝜁0(𝑡1, · · · , 𝑡 𝑝)(cid:13)
(cid:13)∞. Let 𝐴𝑉 be the set of all continuous functions 𝛾 such that
|𝛾(𝒕1) − 𝛾(𝒕2)| ≤ 𝑉 ||𝒕1 − 𝒕2||. For each such function 𝛾 and 𝜖, define 𝐵𝜖,𝛾 = {𝑡 :

𝜕

28

|𝛾(𝑡)| > 𝜖 }. Then for each 𝜖 > 0, there exists an integer 𝑁 such that for all 𝑛 ≥ 𝑁 and all 𝛾 ∈ 𝐴𝑉 ,

𝑛
∑︁

𝑖=1

|𝛾(𝑡𝑖)| ≥ 𝑛𝜆(𝐵𝜖,𝛾)

𝜖

2

Proof: See Lemma 4.2.1 in Choi (2005).

Lemma B.0.5. Assume Assumption 2.2.3. and Assumption 2.2.4. holds. Let 𝐴𝑉 be the set of all

continuous functions 𝜁 such that ||𝜁 ||∞ < 𝑀𝑛 and

(cid:13)
(cid:13)
(cid:13)

𝜕

𝜕𝒕 𝑗 𝜻 (𝒕1, · · · , 𝒕𝑑)

(cid:13)
(cid:13)
(cid:13)∞

< 𝑉, 𝑗 = 1, · · · , 𝑑. Then

for each 𝜖 > 0, there exist an integer 𝑁 and 𝑟 > 0 such that, for all 𝑛 ≥ 𝑁 and all 𝜁 ∈ 𝐴𝑉 such that
||𝜁 − 𝜁0||1, (cid:205)𝑛

𝑗=1 |𝜁1 𝑗 − 𝜁0 𝑗 | ≥ 𝑟𝑛

Proof: See Lemma 4.2.2 in Choi (2005).

Now we are going to prove the existence of test functions in the KOH model setup.

Proof: First, define

Ψ𝑛 [𝜁1, 𝜖] = 1 − (1 − Ψ1𝑛 [𝜁1, 𝜖]) (1 − Ψ2𝑛 [𝜁1, 𝜖]) (1 − Ψ3𝑛 [𝜁1, 𝜖])

From Lemmas B.0.1-B.0.3, it is clear that 𝐸𝑃𝜁
for each 𝜁1 ∈ U𝐶
𝜖𝑛. To create a test that doesn’t depend on a specific choice of 𝜁1 in Lemma
B.0.1, we make use of the covering number of the sieve. Let 𝑟 be the same number that appears in

𝑛) and 𝐸𝑃𝜁 (1 − Ψ𝑛) ≤ exp(−𝑛𝐶10)

Ψ𝑛 < exp(−2𝑐2

0

Lemmas B.0.1, B.0.2, B.0.3. Let 𝜀 = min{𝜖/2, 𝑟/4}. Let 𝑁𝜀 be the 𝜀-covering number of F𝑛 in the

supremum norm. Recall that given 0 < 𝛿 < 1/2, (2𝛿 + 1)/2 < 𝛼1 < 1 and 𝛼/2 < 𝜏1 < 1/2. Thus,
with 𝑀𝑛 = O (𝑛𝛼1) and 𝑐𝑛 = 𝑛𝜏1, we have log(𝑁𝜀) = 𝑜(𝑐2
𝑛). Let 𝜁 1, · · · , 𝜁 𝑁 𝜀 ∈ F𝑛 be such that for
each 𝜁 ∈ F𝑛, there exists 𝑗 such that ||𝜁 − 𝜁 𝑗 ||∞ < 𝜀. If ||𝜁 − 𝜁 𝑗 ||1 > 𝜖, then ||𝜁 𝑗 − 𝜁0||1 > 𝜖/2.

Finally, define

Ψ𝑛 [𝜁 𝑗 ,

𝜖

2

]

𝑗=1 |𝜁1 𝑗 − 𝜁0 𝑗 | > 𝑟𝑛 holds for every such 𝜁 𝑗 in

Φ𝑛 = max
1≤ 𝑗 ≤𝑁 𝜀
Since we verified that there exists 𝑟 such that (cid:205)𝑛

Lemma B.0.5, then

29

𝐸𝑃𝜁

0

Φ𝑛 ≤

𝑁 𝜀∑︁

𝑗=1

𝐸𝑃𝜁

0

Ψ𝑛 [𝜁 𝑗 ,

𝜖

2

]

≤ 𝐶3𝑁𝜀 exp(−2𝑐2
𝑛)

= 𝐶3 exp(log(𝑁𝜀) − 2𝑐2
𝑛)

≤ 𝐶3 exp(−𝑐2
𝑛)

For 𝜁 ∈ U𝐶

𝜖𝑛 ∩ F𝑛, the type 2 error probability of Ψ𝑛 is no larger than the minimum of the individual

type 2 error probabilities of the Ψ𝑛 [𝜁 𝑗 , 𝜖/2] tests. Hence we have the results.

30

APPENDIX C

CHECKING ASSUMPTION 2.2.2 1

We construct our sieve F𝑛 such that it increases to the space of all continuously differentiable

functions with bounded derivatives on [0, 1] 𝑝. To this end, we make use of an increasing sequence

of constants 𝑀𝑛. To be specific, let {𝑀𝑛}∞

𝑛=1 be a sequence of numbers such that 𝑀𝑛 → ∞ as

𝑛 → ∞ and 𝑀𝑛 = O (𝑛𝛼) for some 𝛼 ∈ (1/2, 1). Also, define,

(cid:26)

F𝑛 =

𝜁 (·) :

||𝜁 ||∞ < 𝑀𝑛,

||

𝜕
𝜕𝑡𝑖

𝜁 ||∞ < 𝑀𝑛,

𝑖 = 1, · · · , 𝑝

(cid:27)

(C.1)

In lieu of (C.1), for the sequence of sieves F𝑛, Π𝜁 (F 𝑐

𝑛 |𝒛) is exponentially small as long as the

following two probabilities are exponentially small

Π𝜁 {𝜁 : ||𝜁 ||∞ > 𝑀𝑛|𝒛}
𝜕
𝜕𝑡𝑖

𝜁 ||∞ > 𝑀𝑛

𝜁 : ||

Π𝜁

(cid:26)

(cid:27)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝒛

,

𝑖 = 1, · · · , 𝑝

(C.2)

(C.3)

This result is stated in the Lemma below.

Lemma C.0.1. Let 𝑀𝑛 = O (𝑛𝛼), where 𝛼 ∈ (1/2, 1) and 2𝛼 − ℎ ≥ 1 where ℎ is as defined in

Assumption 2.2.5. 1. Then, there exists positive constants 𝐷1, 𝐷2, 𝑑1, 𝑑2 such that,

Π𝜁 {𝜁 : ||𝜁 ||∞ > 𝑀𝑛|𝒛} ≤ 𝐷1 exp(−𝑑1𝑛)
(cid:12)
(cid:26)
(cid:12)
(cid:12)
(cid:12)

≤ 𝐷2 exp(−𝑑2𝑛),

𝜁 ||∞ > 𝑀𝑛

𝜕
𝜕𝑡𝑖

𝜁 : ||

(cid:27)

𝒛

Π𝜁

𝑖 = 1, · · · , 𝑝

Proof: For the proof of the lemma, we show the results step by step. First, note that for 𝑥 ≥ 0,

0 ≤ 𝑅(𝑥) ≤ 1, −𝑇1 ≤ 𝑅′(𝑥) ≤ 0, −𝑇2 ≤ 𝑅′′(𝑥) ≤ 𝑇3 for positive 𝑇1, 𝑇2, and 𝑇3. In addition, if we

evaluate at 𝑥 = 0, then 𝑅(0) = 1 and 𝑅′′(0) = −𝑇4 < 0, where 𝑇4 > 0. The widely used Squared

Exponential covariance function, Matérn covariance function, and Cauchy covariance function

satisfy these equalities and inequalities. Please refer to Chapter 4 of Abrahamsen (1997) for further

description. Then,

31

E𝜁 |𝜽,𝝓,𝒛 [𝜁 (𝒕)] =

=

=

≤

≤

≤

𝑠
∑︁

𝑠
∑︁

𝑘=1
𝑠
∑︁

𝑗=1
𝑠
∑︁

𝑘=1
𝑠
∑︁

𝑗=1
𝑠
∑︁

𝑘=1

𝑗=1

𝑘 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 ))

(cid:18) 1
𝜂 𝑓

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝜃)

(cid:19)

𝑧𝑘

𝜂 𝑓 𝑅(𝒕, ˜𝒕 𝑗 ; 𝑙 𝑓𝑡 )𝑅(𝜽, ˜𝜽 𝑗 ; 𝑙𝜃)

(cid:18) 1
𝜂 𝑓

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

(cid:19)

𝑧𝑘

𝑅(𝑡1, ˜𝑡1

𝑗 ; 𝑙 𝑓1) · · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′𝑙 𝑓 𝑝+ 𝑝′ )

(cid:16)

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

(cid:17)

𝑧𝑘

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

(cid:17)

𝑧𝑘

(cid:19)

𝑠
∑︁

𝑠
∑︁

𝑧𝑘

𝑘=1

𝑗=1

(cid:12)
(cid:12)
(cid:12)

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

(cid:12)
(cid:12)
(cid:12)

𝑠
∑︁

𝑠
∑︁

(cid:16)

𝑗=1

𝑘=1
(cid:18)

max
𝑘∈{1,··· ,𝑠}

(cid:18)

max
𝑘∈{1,··· ,𝑠}

(cid:19)

𝑧𝑘

¯𝐶

= 𝐶

(C.4)

E𝜁 |𝜽,𝝓,𝒛

(cid:20) 𝜕
𝜕𝑡 𝑘

(cid:21)

=

𝜁 (𝒕)

𝑠
∑︁

𝑠
∑︁

𝑘=1

𝑗=1

𝑅(𝑡1, ˜𝑡1

𝑗 ; 𝑙 𝑓1) · · · 𝑅′(𝑡 𝑘 , ˜𝑡 𝑘

𝑗 ; 𝑙 𝑓𝑘 )

(cid:32)

−𝑙 𝑓𝑘

(cid:33)

𝑡 𝑘 − ˜𝑡 𝑘
𝑗
|𝑡 𝑘 − ˜𝑡 𝑘
𝑗 |

· · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′𝑙 𝑓 𝑝+ 𝑝′ )
𝑠
(cid:18)
∑︁

𝑠
∑︁

(cid:19)

𝑙 𝑓𝑘

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

≤ 𝑇1

max
𝑘∈{1,··· ,𝑠}

𝑧𝑘

(cid:12)
(cid:12)
(cid:12)

𝑘=1

𝑗=1

˜Σ−1
22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)

(cid:17)

𝑧𝑘

(cid:16)

(cid:12)
(cid:12)
(cid:12)

(C.5)

Step 1) Let, 𝜎2

0 (𝜁) = sup𝒕∈𝛀⊂R 𝑝 V𝑎𝑟 (𝜁 (𝒕)|𝒛, 𝜽, 𝝓). By applying Lemma.3.5.6 from Choi (2005) at

the 1st inequality below, we can bound the probability of the given set from above.

(cid:18)

P

(cid:12)
(cid:12)
(cid:12)

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝒛

sup
𝒕∈𝛀⊂R 𝑝

=

≤

=

Θ
∫

𝐴𝝓

∫

Θ

𝐴𝝓

∫

∫

Θ

𝐴𝝓

∫

∫

(cid:18)

P

(cid:12)
(cid:12)
(cid:12)

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

𝒛, 𝜽, 𝝓

(cid:12)
(cid:12)
(cid:12)
(cid:12)

sup
𝒕∈𝛀⊂R 𝑝

𝑝(𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽

(cid:32)

exp

−

1
4

𝜎2
0

𝑀 2
𝑛
(𝜁) + E2(𝜁 (𝒕)|𝒛, 𝜽, 𝝓)

(cid:33)

𝑝(𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑝(𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽

32

Hence, using Assumption 8. 1,
(cid:12)
(cid:12)
(cid:12)
(cid:12)

sup
𝒕∈𝛀⊂R 𝑝

> 𝑀𝑛

𝜁 (𝒕)

P

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

(cid:18)

𝒛

(cid:19)

=

=

∫

∫

∫ ∞

∫ ∞

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂𝛿 }

0

0

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

∫

∫

∫ 𝑛ℎ

∫ 𝑛ℎ

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂𝛿 }

0

0

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

+

+

−

∫

∫

∫ ∞

∫ ∞

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂𝛿 }

0

𝑛ℎ

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

∫

∫

∫ ∞

∫ ∞

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂𝛿 }

𝑛ℎ

0

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

+ 𝐶2

∫

∫

∫ ∞

∫ ∞

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂 𝛿 }

𝑛ℎ

𝑛ℎ

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

∫

∫

≤

∫ 𝑛ℎ

∫ 𝑛ℎ

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂 𝛿 }

0

0

−

1
4

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽
∫

∫ ∞

∫ ∞

∫

+

+

−

𝐴𝝓−{ 𝜂 𝑓 , 𝜂 𝛿 }

Θ
∫

∫

Θ
∫

𝐴𝝓−{ 𝜂 𝑓 , 𝜂 𝛿 }

∫

0
∫ ∞

𝑛ℎ
∫ ∞

𝑛ℎ
∫ ∞

0
∫ ∞

Θ

𝐴𝝓−{ 𝜂 𝑓 , 𝜂𝛿 }

𝑛ℎ

𝑛ℎ

1𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

1𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

0𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

33

Therefore,

(cid:18)

P

(cid:12)
(cid:12)
(cid:12)

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝒛

≤

∫

∫

Θ

sup
𝒕∈𝛀⊂R 𝑝

𝐴𝝓2

∫ 𝑛ℎ

∫ 𝑛ℎ

1
4

0

0

exp (cid:169)
−
(cid:173)
(cid:173)
(cid:171)
𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 (cid:205)𝑠

𝑖=1

𝑀 2
𝑛

(cid:205)𝑠

𝑗=1

(cid:12)
(cid:12)
(cid:12)

𝐾 −1

𝑗,𝑖 (𝑙 𝑓 , 𝑙𝜃)

(cid:12)
(cid:12)
(cid:12)

+ 𝐶2

(cid:170)
(cid:174)
(cid:174)
(cid:172)

+ P(𝜂 𝑓 > 𝑛ℎ|𝒛) + P(𝜂𝛿 > 𝑛ℎ|𝒛) − 0
𝑛2𝛼1
𝑛ℎ + 𝑛ℎ + 𝑛ℎ ¯𝐶 + 𝐶2

(cid:19)

(cid:18)

≤ exp

(cid:16)

≤ exp

−

1
4
−𝐶𝑛2𝛼1−ℎ(cid:17)

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)

≤ 𝐷1 exp(−𝑑1𝑛)

Step 2) Let 𝜎2
(cid:16)

−𝑀 2

𝑛 /4(𝑇4𝜂 𝑓 𝑙2
𝑓𝑘

exp

𝑘 (𝜁) = sup𝒕∈𝛀⊂R 𝑝 V𝑎𝑟

+ 𝑇4𝜂𝛿𝑙2
𝛿𝑖

+ 𝜂 𝑓 𝑙2
𝑓𝑘

𝑇 2
1

(cid:18)

𝜕
𝜕𝑡 𝑘 𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)
(cid:12)
¯𝐶 − 𝑇1𝐶𝑙 𝑓𝑘 ¯𝐶)

𝒛, 𝜽, 𝝓
(cid:17)

(cid:19)

. Further let, 𝑙∗
𝑓𝑘

, 𝑙∗

𝛿𝑘 are the maximizer of

in the bounded support. By applying the

same logic as in step 1), we have the upper bound of the probability.

(cid:12)
(cid:12)
(cid:12)

𝜕
𝜕𝑡 𝑘
(cid:18)

P

(cid:18)

P

sup
𝒕∈𝛀⊂R 𝑝
∫
∫

Θ

𝐴𝝓

∫

∫

Θ

𝐴𝝓

=

=

∫ 𝑛ℎ

∫

≤

exp (cid:169)
−
(cid:173)
(cid:173)
(cid:171)
∫ 𝑛ℎ

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝒛

(cid:12)
(cid:12)
(cid:12)

𝜕
𝜕𝑡 𝑘

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

𝒛, 𝜽, 𝝓

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝑝(𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽

sup
𝒕∈𝛀⊂R 𝑝

1
4

×

𝑀 2
𝑛

𝑘 (𝜁) + E2
𝜎2

𝜁 |𝜽,𝝓,𝒛

(cid:104) 𝜕
𝜕𝑡 𝑘 𝜁 (𝒕)

(cid:105)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

𝑝(𝜽, 𝝓|𝒛)𝑑𝝓𝑑𝜽

Θ

0

0

−

1
4

×

exp (cid:169)
(cid:173)
(cid:173)
(cid:171)

𝑇4𝜂 𝑓 𝑙2

𝑓 + 𝑇4𝜂𝛿𝑙2

𝛿 + 𝜂 𝑓 𝑙2

𝑓 𝑇 2
1

𝑀 2
𝑛

¯𝐶 + 𝑇1

(cid:0)max𝑘∈{1,··· ,𝑠} 𝑧𝑘 (cid:1) (cid:205)𝑠

𝑘=1

(cid:205)𝑠

𝑗=1

𝑙 𝑓𝑘

(cid:12)
(cid:12)
(cid:12)

22( 𝑗,𝑘) (𝑙 𝑓 , 𝑙𝛿)
˜Σ−1

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:12)
(cid:12)
(cid:12)

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

+ P(𝜂 𝑓 > 𝑛ℎ|𝒛) + P(𝜂𝛿 > 𝑛ℎ|𝒛)

34

Hence,

(cid:18)

P

≤

(cid:12)
(cid:12)
(cid:12)
∫ 𝑛ℎ

sup
𝒕∈𝛀⊂R 𝑝
∫

𝜕
𝜕𝑡 𝑘
∫ 𝑛ℎ

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)

> 𝑀𝑛

(cid:19)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝒛

(cid:32)

exp

−

1
4

×

𝑇4𝜂 𝑓 𝑙∗2
𝑓𝑘

+ 𝑇4𝜂𝛿𝑙∗2
𝛿𝑘

𝑀 2
𝑛
+ 𝜂 𝑓 𝑙∗2
𝑓𝑘

𝑇 2
1

¯𝐶 + 𝑇1𝐶𝑙∗
𝑓𝑘

¯𝐶

(cid:33)

Θ

0

0

𝑝(𝜂 𝑓 |𝒛) 𝑝(𝜂𝛿 |𝒛) 𝑝(𝜽)𝑑𝝓𝑑𝜽

(cid:32)

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)
𝑛2𝛼1
+ 𝑛ℎ𝑙∗2
𝑓𝑘
(cid:33)

+ 𝑇4𝑛ℎ𝑙∗2
𝛿𝑘

1
4

−

×

(cid:32)

≤ exp

𝑇4𝑛ℎ𝑙∗2
𝑓𝑘
𝑛2𝛼1−ℎ
+ 𝑇4𝑙∗2
𝛿𝑘

≲ exp

(cid:16)

= exp

−

4(𝑇4𝑙∗2
𝑓𝑘
−𝐶𝑛2𝛼1−ℎ(cid:17)

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)

¯𝐶)

𝑇 2
1

+ 𝑙∗2
𝑓𝑘

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)

(cid:33)

𝑇 2
1

¯𝐶 + 𝑇1𝐶𝑙∗
𝑓𝑘

¯𝐶

+ 𝑏1 exp(−𝑏2𝑛) + 𝑏3 exp(−𝑏4𝑛)

≤ 𝐷2 exp (−𝑑2𝑛)

Note that for the second inequality, we used the assumption that the length scale parameters

have bounded support, which leads to specific values, 𝑙∗

𝑓 and 𝑙∗

𝛿, that maximize the term in the

exponential. In addition the bound of 𝜎2

0 and 𝜎2

𝑖 used in the proof of step 1) and step 2) are derived

below.

Bounds used for the proof of step 1) and step 2) above Remember from Assumption

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝑖=1

𝑗=1

(cid:205)𝑠

2.2.5. 2, we have (cid:205)𝑠

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)
˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝜃) is the (j,i)-th element
˜Σ−1
of [𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽))]−1, and it depends on 𝑙 𝑓 , 𝑙𝛿. Under the assumed covariance structure, the co-
variance kernel of 𝜁 conditioned on 𝒛, 𝜽, 𝝓 is as follows. Let 𝑙 𝑓 = (𝑙 𝑓𝑡 , 𝑙𝜃), where 𝑙 𝑓𝑡 = (𝑙 𝑓1
, · · · , 𝑙 𝑓 𝑝 )
and 𝑙𝜃 = (𝑙 𝑓 𝑝+1
, · · · , 𝑙𝛿 𝑝 )

, · · · , 𝑙 𝑓 𝑝+𝑝′), and 𝑙𝛿 = (𝑙𝛿1

≤ ¯𝐶 Also, 1
𝜂 𝑓

𝑘 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) = 𝑘 𝑓 ((𝒕, 𝜽), (𝒕′, 𝜽′)) + 𝑘 𝛿 (𝒕, 𝒕′)
𝑠
∑︁

𝑠
∑︁

𝜂 𝑓 𝑅(𝒕, ˜𝒕𝑖; 𝑙 𝑓𝑡 )𝑅(𝜽, ˜𝜽𝑖; 𝑙𝜃)

(cid:18) 1
𝜂 𝑓

(cid:19)

˜Σ−1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)

𝜂 𝑓 𝑅(𝒕′, ˜𝒕 𝑗 ; 𝑙 𝑓𝑡 )𝑅(𝜽, ˜𝜽 𝑗 ; 𝑙𝜃)

−

𝑖=1
𝑠
∑︁

𝑗=1
𝑠
∑︁

𝑖=1

𝑗=1

−

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝒕, ˜𝒕𝑖; 𝑙 𝑓𝑡 )𝑅(𝜽, ˜𝜽𝑖; 𝑙𝜃)𝑅(𝒕′, ˜𝒕 𝑗 ; 𝑙 𝑓𝑡 )𝑅(𝜽, ˜𝜽 𝑗 ; 𝑙𝜃)

35

Hence,

𝑘 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) = 𝜂 𝑓 𝑅(𝒕, 𝒕′; 𝑙 𝑓𝑡 ) + 𝜂𝛿 𝑅(𝒕, ˜𝒕′; 𝑙𝛿)

= 𝜂 𝑓 𝑅(𝑡1, 𝑡′1; 𝑙 𝑓1)𝑅(𝑡2, 𝑡′2; 𝑙 𝑓2) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝
) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝

+ 𝜂𝛿 𝑅(𝑡1, 𝑡′1; 𝑙𝛿1)𝑅(𝑙𝑡2,𝑡′2;𝛿2

; 𝑙 𝑓 𝑝 )

; 𝑙𝛿 𝑝 )

−

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑡1, ˜𝑡1

𝑖 ; 𝑙 𝑓1)𝑅(𝑡2, ˜𝑡2

𝑖 ; 𝑙 𝑓2) · · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓 𝑝+1) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡′1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡′2, ˜𝑡2

𝑗 ; 𝑙 𝑓1) · · · 𝑅(𝑡′𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

By letting 𝒕∗ the maximizer of V𝑎𝑟 (𝜁 (𝒕)|𝒛, 𝜽, 𝝓), we can bound the following term.

0 (𝜁) = sup
𝜎2
𝒕∈𝛀⊂R 𝑝

V𝑎𝑟 (𝜁 (𝒕)|𝒛, 𝜽, 𝝓)

= 𝜂 𝑓 𝑅(0) 𝑝 + 𝜂𝛿 𝑅(0) 𝑝
𝑠
∑︁

𝑠
∑︁

−

𝑖=1

𝑗=1

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑡∗1, ˜𝑡1

𝑖 ; 𝑙 𝑓1)𝑅(𝑡∗2, ˜𝑡2

𝑖 ; 𝑙 𝑓2) · · · 𝑅(𝑡∗𝑝, ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃2

𝑖 ; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡∗1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡∗2, ˜𝑡2

𝑗 ; 𝑙 𝑓2) · · · 𝑅(𝑡∗𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

≤ 𝜂 𝑓 𝑅(0) 𝑝 + 𝜂𝛿 𝑅(0) 𝑝
𝑠
∑︁

𝑠
∑︁

+

𝑖=1

𝑗=1

𝜂 𝑓

˜Σ−1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)

(cid:12)
(cid:12)
(cid:12)

(cid:12)
(cid:12)
(cid:12)

𝑅(𝑡∗1, ˜𝑡1

𝑖 ; 𝑙 𝑓1)𝑅(𝑡∗2, ˜𝑡2

𝑖 ; 𝑙 𝑓2) · · · 𝑅(𝑡∗𝑝 ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡∗1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡∗2, ˜𝑡2

𝑗 ; 𝑙 𝑓2) · · · 𝑅(𝑡∗𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

36

Since 𝑅(0) = 1 and 𝑅(𝑥) ≤ 1 for widely used covariance functions such as Squared Exponential

covariance, Matérn covariance, and Cauchy covariance and using Assumption 8. 2,

0 (𝜁) ≤ 𝜂 𝑓 + 𝜂𝛿 + 𝜂 𝑓 ¯𝐶
𝜎2

For bounding 𝜎2

𝑖 (𝜁) = sup𝒕∈𝛀⊂R 𝑝 V𝑎𝑟

(cid:18)

𝜕
𝜕𝑡𝑖

𝜁 (𝒕)

(cid:19)

𝒛, 𝜽, 𝝓

, we are going to use the fact that the process

constructed by taking the partial derivative of 𝜁 |𝒛, 𝜽, 𝝓 with respect to the 𝑘 𝑡ℎ component is again

(cid:12)
(cid:12)
(cid:12)
(cid:12)

a GP with continuous sample paths and covariance function with

𝜕2

𝜕𝑡 𝑘 𝜕𝑡′𝑘 𝑘 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′). For proof of

this fact, please refer to Ghosal and Roy (2006).

𝜕
𝜕𝑡 𝑘

=

𝜕
𝑘 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′)
𝜕𝑡′𝑘
𝜕
𝜕
𝜕𝑡′𝑘 [𝜂 𝑓 𝑅(𝑡1, 𝑡′1; 𝑙 𝑓1)𝑅(𝑡2, 𝑡′2; 𝑙 𝑓2) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝
𝜕𝑡 𝑘
+ 𝜂𝛿 𝑅(𝑡1, 𝑡′1; 𝑙𝛿1)𝑅(𝑡2, 𝑡′2; 𝑙𝛿1) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝
; 𝑙𝛿 𝑝 )

; 𝑙 𝑓 𝑝 )

−

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑡1, ˜𝑡1

𝑖 ; 𝑙 𝑓1)𝑅(𝑡2, ˜𝑡2

𝑖 ; 𝑙 𝑓2) · · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡′1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡′2, ˜𝑡2

𝑗 ; 𝑙 𝑓2) · · · 𝑅(𝑡′𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )]

=

𝜕
𝜕𝑡 𝑘 [𝜂 𝑓 𝑅(𝑡1, 𝑡′1; 𝑙 𝑓1)𝑅(𝑡2, 𝑡′2; 𝑙 𝑓2) · · · 𝑅′(𝑡 𝑘 , 𝑡′𝑘

; 𝑙 𝑓𝑘 )

+ 𝜂𝛿 𝑅(𝑡1, 𝑡′1; 𝑙𝛿1)𝑅(𝑡2, 𝑡′2; 𝑙𝛿2) · · · 𝑅′(𝑡 𝑘 , 𝑡′𝑘

; 𝑙𝛿𝑘 )

· · · 𝑅(𝑡 𝑝, 𝑡′𝑝

; 𝑙 𝑓 𝑝 )

(cid:19)

· · · 𝑅(𝑡 𝑝, 𝑡′𝑝

; 𝑙𝛿 𝑝 )

(cid:19)

(cid:18)

(cid:18)

−𝑙 𝑓𝑘

𝑡 𝑘 − 𝑡′𝑘
|𝑡 𝑘 − 𝑡′𝑘 |
𝑡 𝑘 − 𝑡′𝑘
|𝑡 𝑘 − 𝑡′𝑘 |
𝑖 ; 𝑙 𝑓2) · · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

−𝑙𝛿𝑘

−

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑡1, ˜𝑡1

𝑖 ; 𝑙 𝑓1)𝑅(𝑡2, ˜𝑡2

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝 )𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝′ )

× 𝑅(𝑡′1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡′2, ˜𝑡2

𝑗 ; 𝑙 𝑓2) · · · 𝑅′(𝑡′𝑘 , ˜𝑡 𝑘

𝑗 ; 𝑙 𝑓𝑘 )

(cid:32)

−𝑙 𝑓𝑘

(cid:33)

𝑡′𝑘 − ˜𝑡 𝑘
𝑗
|𝑡′𝑘 − ˜𝑡 𝑘
𝑗 |

· · · 𝑅(𝑡′𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )]

37

Hence,

𝜕
𝜕𝑡 𝑘

𝜕
𝜕𝑡′𝑘

𝑘 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′)

= −𝜂 𝑓 𝑙2
𝑓𝑘

𝑅(𝑡1, 𝑡′1; 𝑙 𝑓1)𝑅(𝑡2, 𝑡′2; 𝑙 𝑓2) · · · 𝑅′′(𝑡 𝑘 , 𝑡′𝑘

; 𝑙 𝑓𝑘 ) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝

; 𝑙 𝑓 𝑝 )

− 𝜂𝛿𝑙2
𝛿𝑘

𝑅(𝑡1, 𝑡′1; 𝑙𝛿1)𝑅(𝑡2, 𝑡′2; 𝑙𝛿2) · · · 𝑅′′(𝑡 𝑘 , 𝑡′𝑘

−

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑡1, ˜𝑡1

𝑖 ; 𝑙 𝑓1) · · · 𝑅′(𝑡 𝑘 , ˜𝑡 𝑘

𝑖 ; 𝑙 𝑓𝑘 )

−𝑙 𝑓𝑘

; 𝑙𝛿𝑘 ) · · · 𝑅(𝑡 𝑝, 𝑡′𝑝
(cid:32)

; 𝑙𝛿 𝑝 )
𝑡 𝑘 − ˜𝑡 𝑘
𝑖
|𝑡 𝑘 − ˜𝑡 𝑘
𝑖 |

(cid:33)

· · · 𝑅(𝑡 𝑝, ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡′1, ˜𝑡1

𝑗 ; 𝑙 𝑓1)𝑅(𝑡′2, ˜𝑡2

𝑗 ; 𝑙 𝑓1) · · · 𝑅′(𝑡′𝑘 , ˜𝑡 𝑘

𝑗 ; 𝑙 𝑓𝑘 )

(cid:32)

−𝑙 𝑓𝑘

(cid:33)

𝑡′𝑘 − ˜𝑡 𝑘
𝑗
|𝑡′𝑘 − ˜𝑡 𝑘
𝑗 |

· · · 𝑅(𝑡′𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

Assume that 𝑡∗∗ is the maximizer of

𝜕2

𝜕 (𝑡 𝑘)2 V𝑎𝑟

(cid:18)

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:19)

𝒛, 𝜽, 𝝓

. Then, we can simplify the bound by

using the following facts coming from the covariance assumption for R.

𝑘 (𝜁) = sup
𝜎2
𝒕∈𝛀⊂R 𝑝

𝜕2
𝜕 (𝑡 𝑘 )2

V𝑎𝑟

(cid:18)

𝜁 (𝒕)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:19)

𝒛, 𝜽, 𝝓

= −𝜂 𝑓 𝑙2
𝑓𝑘

𝑅(0) 𝑝−1𝑅′′(0) − 𝜂𝛿𝑙2
𝛿𝑘
𝑠
∑︁

𝑠
∑︁

𝜂 𝑓 ˜Σ−1

22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)

−

𝑖=1

𝑗=1

𝑅(0) 𝑝−1𝑅′′(0)

× 𝑅(𝑡∗∗1, ˜𝑡1

𝑖 ; 𝑙 𝑓1) · · · 𝑅′(𝑡∗∗𝑘 , ˜𝑡 𝑘

𝑖 ; 𝑙 𝑓𝑘 )

(cid:32)

𝑙 𝑓𝑘

(cid:33)

𝑡∗∗𝑘 − ˜𝑡 𝑘
𝑖
|𝑡∗∗𝑘 − ˜𝑡 𝑘
𝑖 |

· · · 𝑅(𝑡∗∗ 𝑝, ˜𝑡 𝑝

𝑖 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1, ˜𝜃𝑖,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃𝑖,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃𝑖,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

× 𝑅(𝑡∗∗1, ˜𝑡1

𝑗 ; 𝑙 𝑓1) · · · 𝑅′(𝑡∗∗𝑘 , ˜𝑡 𝑘

𝑗 ; 𝑙 𝑓𝑘 )

(cid:32)

𝑙 𝑓𝑘

(cid:33)

𝑡∗∗𝑘 − ˜𝑡 𝑘
𝑗
|𝑡∗∗𝑘 − ˜𝑡 𝑘
𝑗 |

· · · 𝑅(𝑡∗∗ 𝑝, ˜𝑡 𝑝

𝑗 ; 𝑙 𝑓 𝑝 )

× 𝑅(𝜃1 − ˜𝜃 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(𝜃2, ˜𝜃 𝑗,2; 𝑙 𝑓 𝑝+2) · · · 𝑅(𝜃 𝑝′, ˜𝜃 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )

≤ 𝑇4𝜂 𝑓 𝑙2

𝑓𝑘 + 𝑇4𝜂𝛿𝑙2

𝛿𝑘 + 𝜂 𝑓 𝑙2
𝑓𝑘

≤ 𝑇4𝜂 𝑓 𝑙2

𝑓𝑘 + 𝑇4𝜂𝛿𝑙2

𝛿𝑘 + 𝜂 𝑓 𝑙2
𝑓𝑘

𝑇 2
1

𝑠
∑︁

𝑠
∑︁

𝑖=1

𝑗=1

(cid:12)
(cid:12)
(cid:12)

˜Σ−1
22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)

(cid:12)
(cid:12)
(cid:12)

¯𝐶

𝑇 2
1

38

APPENDIX D

CHECKING ASSUMPTION 2.2.2 2

First, recall two classical spaces of finite smoothness: Hölder and Sobolev. Hölder space 𝐶𝛼 (X),

where 𝛼 > 0, is the space of all functions whose partial derivatives of order up to ⌈𝛼⌉ − 1 exist and

are uniformly bounded, where ⌈𝛼⌉ represents the largest integer that is smaller than 𝛼. Sobolev
space 𝐻𝛼 (X) is the set of functions that are restriction to X of functions 𝑓0 : R𝑝 → R with Fourier
transform ˆ𝑓0(𝜆) = (2𝜋)−𝑝 ∫ exp(𝑖𝜆𝑇 𝑡) 𝑓0(𝑡)𝑑𝑡 satisfying

∫

(1 + ∥𝜆∥2)𝛼| ˆ𝑓0|2𝑑𝜆 < ∞.

van der Vaart and van Zanten (2011) showed behavior of concentration function in two different

classes of GP. First, consider Matérn class of GP prior, which corresponds to the mean zero GP

𝑊 = (𝑊𝑡 : 𝑡 ∈ [0, 1] 𝑝) with covariance function

E[𝑊𝑠𝑊𝑡] =

∫

R 𝑝

exp(𝑖𝜆𝑇 (𝑠 − 𝑡))𝑚(𝜆)𝑑𝜆

where spectral densities 𝑚 : R𝑝 → R is given by, for 𝛼 > 0,

𝑚(𝜆) =

1
(𝑎 + ∥𝜆∥2)𝛼+𝑝/2

(D.1)

(D.2)

Combining the results of Lemma 3 and Lemma 4 ofvan der Vaart and van Zanten (2011), we

have the following results,
Lemma D.0.1. For Matérn class of GP, if 𝜁0 ∈ 𝐶 𝛽 [0, 1] 𝑝 ∩ 𝐻 𝛽 [0, 1] 𝑝 for 0 < 𝛽 ≤ 𝛼, the

concentration function satisfies,

𝜙𝑀
𝜁0

(𝜖𝑛) ≲ 𝜖

− (2𝛼+ 𝑝−2𝛽)
𝛽

𝑛

− 𝑝
𝛼
𝑛

+ 𝜖

(D.3)

For the second case, consider the Square Exponential class of GP prior. The squared exponential

class corresponds to the mean zero GP with covariance function.

E[𝑊𝑠𝑊𝑡] = exp(− ∥𝑠 − 𝑡 ∥2),

𝑠, 𝑡 ∈ [0, 1] 𝑝

(D.4)

39

where spectral densities 𝑚 : R𝑝 → R is given by,

𝑚(𝜆) =

1
2𝑝𝜋 𝑝/2

exp

(cid:19)

(cid:18) − ∥𝜆∥2
4

(D.5)

Combining the results of Lemma 6 and Lemma 7 ofvan der Vaart and van Zanten (2011), we

have the following results,
Lemma D.0.2. For Squared Exponential class of GP, if 𝜁0 ∈ 𝐻 𝛽 [0, 1] 𝑝 for 𝑑/2 < 𝛽, then for a
constant 𝐶 that depends only on 𝜁0, the concentration function satisfies,

𝜙𝑆𝐸
𝜁0

(𝜖𝑛) ≲ exp(𝐶𝜖 −2/(𝛽−𝑝/2)

𝑛

) + (− log 𝜖𝑛)1+𝑝

(D.6)

Note both the right hand sides of 𝜙𝑀
𝜁0

(𝜖𝑛) and 𝜙𝑆𝐸
𝜁0

(𝜖𝑛) tend to infinty as 𝑛 goes to infinity.

However, since slowly decreasing sequence 𝜖𝑛 satisfy 𝑛𝜖 2

𝑛 → ∞ as 𝑛 goes to infinity, for large

enough 𝑛 we have 𝜙𝑀
𝜁0

(𝜖𝑛) ≲ 𝑛𝜖 2

𝑛 and 𝜙𝑆𝐸
𝜁0

(𝜖𝑛) ≲ 𝑛𝜖 2
𝑛.

40

CHAPTER 3

CALIBRATED COMPUTER MODEL USING VARIATIONAL BAYES (VARIATIONAL
INFERENCE)

3.1 Variational Bayes formulation

This section introduces a general Variational Bayesian (VB) approach that we are going to use.

Let a set of parameters of interest 𝚿 have a prior distribution 𝑝(𝚿). Given a data set D, the goal

of the Bayesian method is to find the posterior distribution. Using Bayes’ theorem, that is to say,

find the following distribution.

𝜋(𝚿|D) =

𝑝(D |𝚿) 𝑝(𝚿)
𝑝(D|𝚿) 𝑝(𝚿)𝑑𝚿

∫
𝚿

However, the core problem of posterior calculation comes from the denominator part of the

posterior distribution, evidence. Modern Bayesian statisticians have been focused on approximating

this difficult-to-calculate density Blei et al. (2017). There are diverse approaches to approximating

the posterior distributions. To name a few, there are Markov Chain Monte Carlo (MCMC),

Hamiltonian Monte Carlo (HMC), and Approximate Bayesian Computation (ABC). An alternative

machine learning method has been getting popular recently, Variational Bayes (VB) or Variational

Inference (VI). The advantage of VB over other approaches is clear; it is faster and easier to scale

to large data sets. This fact is also very appealing to the computer simulation setup where we want

to build a GP surrogate for a large simulation data set.

To apply VB, we first need a divergence criteria between two probability measures. The most

widely used divergence criterion is the Kullback-Leibler (KL) divergence or KL distance.

Definition 1. (Kullback-Leibler (KL) divergence) For two probability measures P and Q on a

measurable space X, let P is absolutely continuous with respect to Q. Further if 𝜇 is any measure

on X for which densities 𝑝 and 𝑞 with 𝑃(𝑑𝑥) = 𝑝(𝑥)𝜇(𝑑𝑥) and 𝑄(𝑑𝑥) = 𝑞(𝑥)𝜇(𝑑𝑥) exist. Then

the Kullback-Leibler divergence is defined as

K𝐿 (𝑃||𝑄) =

∫

X

log

𝑃(𝑑𝑥)
𝑄(𝑑𝑥)

𝑃(𝑑𝑥) =

∫

X

𝑝(𝑥) log

𝑝(𝑥)
𝑞(𝑥)

𝜇(𝑑𝑥) = E𝑝

(cid:20)

log

(cid:21)

𝑝
𝑞

41

Definition 1 shows the standard definition of KL divergence. However, since we are working on

infinite-dimensional function spaces, there is no sensible infinite-dimensional Lebesgue measure.

Therefore, we need a more general definition of KL divergence to work with. Kolmogorov

formulated a general definition of KL divergence as a supremum of KL divergence over all finite

measurable partitions on the space of the index set Kolmogorov et al. (1992). This formulation

has been adopted in the functional approach of variational inference in other papers such as

de G. Matthews et al. (2016) and Sun et al. (2019). Some early works related to KL divergence

between stochastic processes include Csató and Opper (2002), Csató (2002). Seeger (2003a), and

Seeger (2003b).

Definition 2. (Kullback-Leibler (KL) divergence for Stochastic Processes) For two stochastic

processes P and Q, the Kullback-Leibler divergence is defined as

K𝐿(𝑃||𝑄) = sup
𝑋∈X

K𝐿 (𝑃𝑋 ||𝑄 𝑋)

where 𝑋 denotes measurement sets and 𝑃𝑋, 𝑄 𝑋 are the marginal distribution of function values at

𝑋.

Note that this definition of KL divergence depends on measurement set 𝑋. Sun et al. (2019)

proposed sampling-based measurement sets where the measurements set are sampled from both

training inputs and random points from the domain. In this paper, however, we are going to assume

the measurement set contains all the training data points for theoretical analysis. Then, for the

same class of stochastic process, the logarithm term cancels out for KL divergence because we are

considering a finite measurement set. This simplifies our proof of the consistency of variational

posterior.

The first step to building a VB approach is to predefine a family of distributions we are going to

optimize. This family is called variational family, Q. The common practice is to consider a family

of distributions where a product form for each dimension of parameters is assumed, mean field

variational family Q𝑀 𝐹. After setting up the variational family, the next step is to find the closest

distribution to the true posterior in terms of KL distance over Q𝑀 𝐹. This can be mathematically

42

written as follows.

𝑄∗ = arg min
𝑄∈𝑄 𝑀𝐹

K𝐿 (𝑄||Π(·|D))

The problem with this optimization is that we cannot calculate the evidence, 𝑝(D). If we take a

closer look, we can check the dependence of KL distance to log evidence.

K𝐿(𝑄||Π(·|D)) = E𝑞 [log 𝑞(𝜽)] − E𝑞 [log 𝑝(𝜽 |D)]

= E𝑞 [log 𝑞(𝜽)] − E𝑞 [log 𝑝(𝜽, D)] + log 𝑝(D)

Hence we optimize an equivalent objective function, the evidence lower bound(ELBO), L (𝑞)

Jordan et al. (1999). The name comes from the fact that it lower bounds the log evidence.

L (𝑞) = E𝑞 [log 𝑝(D, 𝜽)] − E𝑞 [log 𝑞(𝜽)]

≤ log 𝑝(D)

3.1.1 Black Box Variational Inference (BBVI)

VB method has been widely adopted for approximating the posterior distributions. But the

problem of VB is that it requires significant model-specific analysis in order to develop the VB

algorithm. This boils down to calculating ELBO, E𝑞 [log 𝑝(D, 𝜽)]−E𝑞 [log 𝑞(𝜽)]. This expectation

often does not have a closed-form expression. Ranganath et al. (2014) proposed Black Box

Variational Inference (BBVI) method. This is basically a stochastic optimization method by using

an unbiased noisy gradient to maximize ELBO. Consider a set of variational parameters 𝜆, then

∇𝜆L (𝑞) = E𝑞 [∇𝝀 log 𝑞(Ψ𝑠|𝜆) × (log 𝑝(Ψ𝑠, D) − log 𝑞(Ψ𝑠 |𝜆)]

(3.1)

where Ψ𝑠 ∼ 𝑞(Ψ|𝜆)The unbiased Monte Carlo estimate of this gradient is

(cid:156)∇𝜆L (𝑞) =

1
𝑆

𝑆
∑︁

𝑠=1

[∇𝝀 log 𝑞(Ψ𝑠 |𝜆) × (log 𝑝(Ψ𝑠, D) − log 𝑞(Ψ𝑠 |𝜆)]

(3.2)

Using this noisy gradient, BBVI update the variational parameters 𝜆𝑡+1 ← 𝜆𝑡 + 𝜌𝑡

(cid:156)∇𝜆L (𝑞),

until ELBO converges. If the sequence of step sizes {𝜌𝑡 }∞

𝑡=1 satisfies following Robbins-Monro

43

conditions, then the algorithm guaranteed to converge to a local maximum.

∞
∑︁

𝑖=1

𝜌𝑡 = ∞,

∞
∑︁

𝑖=1

(𝜌𝑡)2 < ∞

(3.3)

3.1.2 Reparametrization Trick

There is one issue implementing the above algorithm. That is samples of parameters are not

differentiable with respect to variational parameters 𝝀. Kingma and Welling (2013) and Rezende

et al. (2014) approached this problem by reparametrizing samples in a clever way, such that the

stochasticity is independent of the parameters. By separating out the stochasticity, samples become

deterministically depend on the parameters of the distribution, hence we could differentiate the

samples with respect to the variational parameters. For example, consider a sample 𝑧 from normal

distribution with mean 𝜇 and variance 𝜎2. Equivalently, this could be written as 𝑧 = 𝜎 × 𝜖 + 𝜇,

where 𝜖 ∼ 𝑁 (0, 1). This reparametrization trick allows us to sample from a standard normal

distribution and then scaled by the mean and standard deviation to get a sample from 𝑁 (𝜇, 𝜎). In

this way, we could keep back propagate the mean and standard deviation through the samples.

3.2 Variational Bayes Algorithm for Discrepancy Model

Coming back to the discrepancy model setup, we intend to find the variational approximation to

the posterior 𝑝(𝜽, 𝝓, 𝜎| 𝒅). First, let 𝝓 = (𝝓1, 𝝓2) where 𝝓1 corresponds to real-valued parameters

and 𝝓2 corresponds to positive-valued parameters. We posit a variational family of the form,

˜Q = Q𝜁 |𝜽,𝝓 × Q𝜽 × Q𝝓1 × Q𝝓2 × Q𝜎

Here,

Assumption 3.2.1. We assume the following structure on the variational family

Q𝜁 |𝜽,𝝓 = {𝜁 |𝜽, 𝝓 ∼ 𝐺𝑃(𝒎

𝑞
𝜁 |𝜽,𝝓

, 𝑘 𝑞

𝜁 |𝜽,𝝓

)},

Q𝜽 = {𝜽 ∼ N (𝒎𝜃, 𝑺𝜃)},

Q𝝓1 = {𝝓1 ∼ N (𝒎𝜙1

, 𝑺𝜙2)},

Q𝝓2 = { ˜𝝓2 ∼ N (𝒎𝜙2

, 𝑺𝜙2)}, Q𝜎 = {(cid:101)

𝜎 ∼ N (𝑚𝜎, 𝑠𝜎)},

where ˜𝝓2 = log 𝝓2 is the log operator applied elementwise and (cid:101)
field variational family, one assumes that the matrices 𝑺𝜃, 𝑺𝜙1 and 𝑺𝜙2 are diagonal matrices.

𝜎 = log 𝜎. Also, under the mean

44

Note that the variational distribution of 𝜁 depends on the calibration parameter 𝜽 and hyperpa-

rameters 𝝓. Hence this is not the mean field family. Now if we assume using the same covariance

function for variational distribution of 𝜁 as prior of 𝜁, such as squared exponential covariance or

Matérn covariance function, then the KL divergence between variational distribution for 𝜁 and prior

for 𝜁 become zero as we discussed. Now we can simplify the objective KL function.

K𝐿 (𝑞(𝜁, 𝜽, 𝝓, 𝜎)|| 𝑝(𝜁, 𝜽, 𝝓, 𝜎| 𝒅)) =

∫

log

(cid:18) 𝑞(𝜽, 𝝓, 𝜎)
𝑝(𝜽, 𝝓, 𝜎| 𝒅)

(cid:19)

𝑞(𝜽, 𝝓, 𝜎)𝑑𝜽 𝑑𝝓𝑑𝜎

(3.4)

Therefore the optimal variational distribution 𝑞∗ is given by

𝑞∗ = arg min

𝑞∈Q

K𝐿(𝑞(𝜽, 𝝓, 𝜎|𝝀)|| 𝑝(𝜽, 𝝓, 𝜎| 𝒅)).

(3.5)

where 𝝀 = (𝒎𝜃, 𝑺𝜃, 𝒎𝜙1
together, and Q = Q𝜽 × Q𝝓1 × Q𝝓2 × Q𝜎. The optimal variational distribution 𝑞∗ is obtained in

, 𝑚𝜎, 𝑆𝜎) denotes the set of all variational parameters taken

, 𝒎𝜙2

, 𝑺𝜙2

, 𝑺𝜙1

practice by maximizing an ELBO.

L (𝑞) = E𝑞(𝜽,𝝓,𝜎|𝝀)

(cid:104)

log 𝑝( 𝒅|𝜽, 𝝓, 𝜎) + log 𝑝(𝜽, 𝝓, 𝜎) − log 𝑞(𝜽, 𝝓, 𝜎|𝝀)

(cid:105)

(3.6)

We can adopt Black Box Variational Inference (BBVI) algorithm to solve this optimization

problem. That is, find the unbiased Monte Carlo estimate of the derivative of ELBO.

∇𝜆L (𝑞) = E𝑞 [∇𝝀 log 𝑞(𝜽, 𝝓, 𝜎|𝝀)(log 𝑝( 𝒅|𝜽, 𝝓, 𝜎) + log 𝑝(𝜽, 𝝓, 𝜎) − log 𝑞(𝜽, 𝝓, 𝜎|𝝀))]

(3.7)

≈

1
𝑆

𝑆
∑︁

𝑠=1

∇𝝀 log 𝑞(𝜽 𝑠, 𝝓𝑠, 𝜎𝑠 |𝝀)

× (log 𝑝( 𝒅|𝜽 𝑠, 𝝓𝑠, 𝜎𝑠) + log 𝑝(𝜽 𝑠, 𝝓𝑠, 𝜎𝑠) − log 𝑞(𝜽 𝑠, 𝝓𝑠, 𝜎𝑠 |𝝀))

= ˆ∇𝜆L (𝑞)

(3.8)

(3.9)

(3.10)

Then use the stochastic gradient descent (SGD) algorithm. The algorithm is summarized in the

next algorithm 1 table. Let 𝑔1(𝑧; 𝜇, 𝜎) = 𝑧 (cid:199) 𝜎 + 𝜇 and 𝑔2(𝑧; 𝜇, 𝜎) = exp(𝑔1(𝑧; 𝜇, 𝜎)).

45

Algorithm 1 BBVI algorithm in Computer Calibration Model

Input: Model, tolerance 𝜏, batch size L, learning rate sequence {𝜌𝑡 }∞
𝑡=1
Output: 𝝀∗ = 𝝀𝑡

1: Randomly initialize 𝝀0
2: Set 𝑡 ← 0, Δ ← 0
3: while increament of ELBO > 𝜏 do
4:
5:

𝑡 ← 𝑡 + 1
(𝜽 1, · · · , 𝜽 𝐿) = 𝑔1(𝒘1, · · · , 𝒘 𝐿; 𝒎𝜃, 𝑺𝜃), where (𝒘1, · · · , 𝒘 𝐿)
(𝝓1
1
(𝝓1
2

1 ) = 𝑔1(𝒘
2 ) = 𝑔2(𝒘

′1, · · · , 𝒘
′′1, · · · , 𝒘

, · · · , 𝝓𝐿
, · · · , 𝝓𝐿

′ 𝐿; 𝒎𝜙1
′′ 𝐿; 𝒎𝜙2
′′ 𝐿)

′1, · · · , 𝒘

, 𝑺𝜙2), where (𝒘
, 𝑺𝜙2),
𝑖.𝑖.𝑑.
∼ 𝑁 (0, 𝐼2𝑝+𝑝′×2𝑝+𝑝′)
𝑖.𝑖.𝑑.
∼ 𝑁 (0, 1)

where (𝒘

′′1, · · · , 𝒘
(𝜎1, · · · 𝜎𝐿) = 𝑔2(𝑒1, · · · , 𝑒𝐿; 𝑚𝜎, 𝑆𝜎), where (𝑒1, · · · , 𝑒𝐿)
(cid:205)𝐿
𝑙=1 ∇𝝀 log 𝑞(𝜽 𝑙, 𝝓𝑙, 𝜎𝑙 |𝝀𝑡−1)×
ˆ∇𝜆L (𝑞) ← 1
𝐿
(log 𝑝( 𝒅|𝜽 𝑙, 𝝓𝑙, 𝜎𝑙) + log 𝑝(𝜽 𝑙, 𝝓𝑙, 𝜎𝑙) − log 𝑞(𝜽 𝑙, 𝝓𝑙, 𝜎𝑙 |𝝀𝑡−1))

𝑖.𝑖.𝑑.
∼ 𝑁 (0, 𝐼𝑝′×𝑝′)
′ 𝐿)

𝑖.𝑖.𝑑.
∼ 𝑁 (0, 𝐼2×2)

𝝀𝑡 ← 𝝀𝑡−1 + 𝜌𝑡 ˆ∇𝜆L (𝑞)

6:

7:

8:

9:
10:
11:
12:
13: end while

3.3 Posterior Contraction

Just as MCMC case, we shall assume that 𝜎2 = 1. The results can be extended to unknown 𝜎2

with minor modifications as well. To statistically validate a Bayesian approach, one needs to estab-

lish posterior consistency which states that the posterior probability of an arbitrary neighborhood

around the true parameter goes to 1 as the sample size 𝑛 goes to infinity. In this section, we first

establish the posterior consistency of the true posterior in (2.9). Next, we establish the posterior

consistency of the variational posterior in (3.5). We shall use the notation E0 and P0 to denote the

expectation and probability of 𝑦𝑖 with respect to the true parameters 𝜁0. In this direction, define

𝑄∗ = arg min

∫

log

𝑑𝑄(𝜁, 𝜃, 𝜙)
𝑑Π(𝜁, 𝜃, 𝜙| 𝒚, 𝒛)

𝑑𝑄(𝜁, 𝜃, 𝜙)

𝑑𝑄∗

𝜁 (𝜁) =

𝑑Π𝜁 (𝜁) =

𝑑Π𝜁 (𝜁 | 𝒚, 𝒛) =

𝑄∈Q
∫

∫

Θ
∫

Φ

∫

Θ
∫

Φ

∫

Θ

Φ

𝑑𝑄∗(𝜁, 𝜃, 𝜙),

𝑑Π(𝜁, 𝜃, 𝜙),

𝑑Π(𝜁, 𝜃, 𝜙|𝒚, 𝒛)

Further, 𝑄∗

𝜁 (𝜁), Π𝜁 (𝜁) and Π𝜁 (𝜁 |𝒚, 𝒛) denote the marginal distribution of 𝜁 under the variational

posterior, the prior, and the true posterior respectively.

46

3.3.1 Assumptions

Let 𝜖𝑛 be a sequence satisfying 𝜖𝑛 → 0 and 𝑛𝜖 2

𝑛 → ∞. Consider the following neighborhood of

the true physical process as follows

(cid:40)

U𝜖𝑛 =

𝜁 :

1
𝑛

𝑛
∑︁

𝑖=1

|𝜁 (𝑡𝑖) − 𝜁0(𝑡𝑖)| ≤ 𝜖𝑛

(cid:41)

We consider a sequence of sets {F𝑛}∞

𝑛=1 such that it increases to include the space of all continuously
differentiable functions with bounded derivatives on [0, 1] 𝑝. The following assumption describes

the prior distributions on all the unknown parameters under consideration.

Assumption 3.3.1. We assume the following structure on the prior distributions of 𝜽, 𝝓1, 𝝓2 and

𝜎2

Π(𝜽) ∼ N (𝒎0

𝜃, 𝑺0

𝜃), Π(𝝓1) ∼ N (𝒎0
𝜙1

, 𝑺0
𝜙2

),

Π( ˜𝝓2) ∼ N (𝒎0
𝜙2

, 𝑺0
𝜙2

), Π( ˜𝜎) ∼ N (𝑚0

𝜎, (𝑠0

𝜎)2),

where ˜𝝓2 = log 𝝓2 with the log operator being applied element wise and ˜𝜎 = log 𝜎. For simplicity

𝜃, 𝑺0
𝑺0
𝜙1

, and 𝑺0
𝜙2

are assumed to be diagonal matrices.

Assumption 3.3.2. For some 𝑄 ∈ Q
(i) sup𝑖=1,··· ,𝑛 E𝑄 (cid:2)∫ 𝜁 (𝑡𝑖)𝑑Π(𝜁 |𝜽, 𝝓, 𝒛) − 𝜁0(𝑡𝑖)(cid:3) 2 ≤ 𝜖 2

𝑛

(ii) K𝐿(𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓, 𝒛)) ≤ 𝐶𝑛𝜖 2

𝑛 log 𝑛

Assumption 3.3.2 (i) states that there exists a variational member 𝑄 such that under 𝑄 the

expectation of difference between the conditional expectation of the physical process 𝜁 under the

prior and the true physical process 𝜁0 is negligible. This happens when 𝑄 places overwhelming

mass on those values of 𝜽 and 𝝓 for which the conditional expectation of the physical process 𝜁

approaches the true process 𝜁0. For the Squared Exponential covariance and Matérn covariance

class, we establish that Assumption 3.3.2 (i) holds as long as there exists 𝜽 ∗

𝑛 and 𝝓∗

𝑛 satisfying

∫

(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝜁 (𝑡𝑖)𝑑Π(𝜁 |𝜽 ∗

𝑛, 𝝓∗

𝑛, 𝒛) − 𝜁0(𝑡𝑖)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤ 𝜖𝑛

(3.11)

for some ||𝜽 ∗

𝑛||2 ≤ 𝐵𝜙 with 𝐵𝜃 and 𝐵𝜙 independent of 𝑛. Note, (3.11) assumes

there exist good choices of 𝜽 and 𝝓 such that the conditional expectation of the physical process 𝜁

under the parameters approaches the true physical process 𝜁0.

47

sup
𝑖=1,··· ,𝑛
𝑛||2 ≤ 𝐵𝜃 and ||𝝓∗

Assumption 3.3.2 (ii) states that the KL distance between the variational member 𝑄 and the prior

Π is bounded up to an order of 𝑛𝜖 2

𝑛 log 𝑛. Although 𝑄 needs to give overwhelming probabilities to

good values of 𝜽 and 𝝓, it cannot be too far from the prior Π if variational posterior consistency

needs to hold. For the Squared Exponential covariance and Matérn covariance class, this will hold

as long as 𝜽 ∗

𝑛 and 𝝓∗

𝑛 satisfying (3.11) are bounded in 𝐿2 norm up to an order of

√

𝑛.

Indeed the GP priors as designed in (2.1) and Assumption 3.3.1 on the priors of 𝜽 and 𝝓 together

with Assumption 3.2.1 on the variational family Q will be essential components towards the proof

Assumption 3.3.2 (i) and (ii).

3.3.2 Main Results

Theorem 3. (Variational Posterior Consistency) Suppose Assumption 3.3.2 holds in addition to

Assumption 2.2.1 and Assumption 2.2.2, then

𝑄∗

𝜁 (U𝑐

𝜀𝑛) −→ 0

in P𝑛

0 probability as 𝑛 → ∞ and 𝜀𝑛 = 𝜖𝑛
Note 𝑀𝑛 can be any sequence increasing slowly to ∞. Theorem 3 shows that the variational

√︁𝑀𝑛 log 𝑛, 𝑀𝑛 → ∞.

posterior of 𝜁 concentrates around the true parameter value 𝜁0 at the rate 𝜀𝑛 which is slightly higher

than the rate 𝜖𝑛 for the true posterior. The detailed proof of the Theorem 3 has been provided in

the Appendix. However, we next give a brief overview.

Proof of Theorem 3: By Corollary 4.15 in Boucheron et al. (2013), we have

∫

𝑓 𝑑𝑄∗

𝜁 ≤ K𝐿 (𝑄∗

𝜁 , Π𝜁 (.|𝒚, 𝒛)) + log

∫

𝑒 𝑓 𝑑Π𝜁 (·| 𝒚, 𝒛)

(3.12)

Using (3.12) with 𝑓 = 𝐶𝑛𝜀2

𝑛1U𝐶
𝜀𝑛

with 1 being the indicator function of a set, we get

𝐶𝑛𝜀2

𝑛𝑄∗

𝜁 (U𝐶

𝜀𝑛) ≤ K𝐿 (𝑄∗

𝜁 ||Π𝜁 (|𝒚, 𝒛)) + log

(cid:104)

1 + 𝑒𝐶𝑛𝜀2

𝑛Π𝜁 (U𝐶

𝜀𝑛 | 𝒚, 𝒛)

(cid:105)

(3.13)

By Assumptions 2.2.1 and 2.2.2, it can be shown that Π𝜁 (U𝐶

𝜖𝑛 |𝒚, 𝒛) ≤ exp(−𝐶′𝑛𝜖 2

𝑛) where 𝐶′ > 𝐶.

Additionally, under Assumption 3.3.2, it can be shown that K𝐿(𝑄∗

𝜁 ||Π𝜁 (|𝒚, 𝒛)) ≤ 𝐶′′𝑛𝜖 2

𝑛 log 𝑛 for

some 𝐶′′ > 0. Using these results in Relation (3.13), the proof of Theorem 3 follows.

48

3.4 Simulation Study: Ball Dropping Experiment

Figure 3.1 (Functions and data of Ball dropping experiment) The left figure shows the true physical
process and the computer model with true calibration parameter. The right figure shows the data
we obtained from these functions.

Let’s consider a famous ball-dropping experiment from Plumlee (2017). A ball is dropped from

a chosen vertical position 8 meters above the ground with initial velocity −0.1 𝑚/𝑠. Then the ball’s

vertical position at time 𝑥 is given by following the physical process.

𝜁0(𝑡) =

5
2

log

(cid:26) 50
49

−

50
49

√

(cid:104)

tanh−1(

tanh

0.02) +

(cid:105) 2(cid:27)

√

2𝑥

+ 8,

𝑥 ∈ [0, 1]

(3.14)

We are going to consider an inexact computer model for this physical process with unknown

calibration parameter 𝜃 as

𝑓𝑚 (𝑡, 𝜃) = 8 − 𝑥 −

𝜃𝑥2

2

(3.15)

Note that the calibration parameter represents gravitational acceleration and let the true value

of 𝜃 to be 10. The left figure in Figure 3.1 shows these two functions in the domain 𝑥 ∈ [0, 1]. We

can see that the computer model behaves very similarly to the true physical process, but there is a

systematic difference, which will be modeled using GP.

For the data generation, we will use Latin hypercube samples for (𝑥, 𝜃) in the bounded space

[0, 1] × [7, 12]. In addition, we are going to add normal random error with a variance 0.01 to

the experimental data. Then the location of the training data of size (𝑁, 𝑆) = (30, 200) could

49

0.00.20.40.60.81.0Time: x345678Vertical position of the ballFunctionTrue functionSimulation function with true 0.00.20.40.60.81.0Time: x345678DataExperimental DataSimulation DataFigure 3.2 (Ball dropping example: posterior results of MCMC approach and VI approach) The left
figure shows the prior distribution and the posterior distributions of calibration parameters using
two different approaches. The right figure presents Root Mean Squared Error for 100 out of the
sample physical data.

be checked in Figure 3.1, where 𝑁 is the number of experimental data and 𝑆 is the number of

simulation data.

Figure 3.2 shows the results of the proposed method in terms of posterior distributions and

the Root Mean Squared Error (RMSE) for 100 out of the sample physical data. The posterior

distributions of the calibration parameter are centered around the truve value, and the proposed VI

method shows less uncertainty. This is anticipated since the mean field variational family tends to

underestimate the variance. Even in RMSEs the proposed VI method well captures the performance

of MCMC approach. The main difference is coming from the running time of each algorithm. The

proposed VI methods converged in less than 1 min. This convergence is remarkable compared to

MCMC result, which took around 13 minutes for 10000 posterior samples.

3.5 Simulation Study: Logistic Function

Let’s consider a simple computer model in the form of a logistic growth function:

𝑓𝑚 ((𝑡, 𝑥), 𝜽) =

(cid:18)

1 + exp

1 + 𝜃1

− [1 + 𝑥 + 𝜃2𝑡]

(cid:19)

(3.16)

50

4681012140.00.20.40.60.8DensityPosterior distributionMCMCVIPrior N(9,2)(30,200)sample size0.010.020.030.040.05RMSERMSE for 100 physical test datatypeMCMCVIwhere we assume that the calibration parameter 𝜽 takes values in the space [0, 2]2 and the model

inputs (𝑡, 𝑥) take values in the space [−10, 10]2. The true physical process is modeled according to

𝜁0(𝑡, 𝑥) = 𝑓𝑚 ((𝑡, 𝑥), 𝜽) + 𝛿(𝑡, 𝑥) =

(cid:18)

1 + exp

1 + 𝜃1

− [1 + 𝑥 + 𝜃2𝑡]

(cid:19) ) + 𝛽,

(3.17)

where 𝛽 = 1 is a constant systematic error of the model and the true value of 𝜽 = (𝜃1, 𝜃2) are

arbitrarily set to be (1, 1.8). Since we are going to have 2 calibration parameters, we are going to

consider three cases of sample sizes to compare the time gain and the fidelity of the variational

approximation. The position of the data in the 2-dimensional input space is shown in Figure 3.3

for the case (𝑛, 𝑠) = (500, 1000).

The posterior distributions of calibration parameters are drawn in Figure 3.4 using MCMC and

VI to compare the results. As one can see, MCMC often has dented shapes, but the VI method

approximates the posterior density well with round contour plots on top of the MCMC result. As

the sample size increases, the uncertainty of MCMC results decreases with less standard deviations

in both 𝜃1 and 𝜃2, and the VI results show very similar behavior. We can also check this fact from

Table 3.1 with the mean and standard deviation of each sample size.

Figure 3.3 (Functions and data of Logistic growth function example) The far left figure shows the
true Logistic growth function with true calibration parameter. The figure in the middle represents
the 500 experimental data and the right figure shows the 1000 simulation data.

51

Figure 3.4 (Posterior distributions of two calibration parameters) The blue lines represents posterior
distribution coming from MCMC method, orange color is the variational approximation, and the
red dot represents the true value of calibration parameters. We consider three cases of sample size
to compare the performance of variational approximation.

(𝑛, 𝑠)

Method
𝜃1
𝜃2
Condition (3.11)

(125,250)

(250,500)

(500,1000)

MCMC

VB

MCMC

VB

MCMC

VB

0.89(0.08)
1.77(0.09)

0.89(0.05)
1.76(0.08)

0.95(0.04)
1.72(0.03)

0.93(0.04)
1.72(0.03)

0.97(0.03)
1.76(0.02)

0.96(0.02)
1.75(0.01)

1.326

1.045

1.027

Table 3.1 Posterior means and standard deviations (in parenthesis) of calibration parameters. Here,
𝑛 corresponds to the number of experimental observations and 𝑠 corresponds to the number of
computer model evaluations.

Table 3.1 also shows the condition (3.11) of variational family, the supremum of difference of

the mean of GP prior given 𝜃 and 𝜙 and the true physical processes.

∫

(cid:12)
(cid:12)
(cid:12)
(cid:12)

sup
𝑖=1,··· ,𝑛

𝜁 (𝑡𝑖)𝑑Π(𝜁 |𝜽 ∗

𝑛, 𝝓∗

𝑛, 𝒛) − 𝜁0(𝑡𝑖)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

≤ 𝜖𝑛

The left side of the equation is calculated in this example and the value is decreasing to 1 as

the number of sample sizes increases. The values are not decreasing to 0 in this case, because

the simulation data in this setup has a constant bias of value 1, hence it is decreasing to the value

1. If the simulation equation approximates the real physical process well so that the difference is

negligible, then the mean of GP prior updated by the simulation data will be very close to the true

physical process and the upper bound of the left side of the equation will decrease to 0.

The big difference between the two approaches is shown in each algorithm’s running time.

The upper figures in Figure 3.5 show the convergence of ELBO for the VI approach in terms of

52

0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(125,250)typeMCMCVI0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(250,500)0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(500,1000)time in hours and the posterior sampling time for the MCMC approach. One can see that ELBO

convergence is quite fast compared to the MCMC method. For example, for (𝑛, 𝑠) = (500, 1000)

case, the MCMC algorithm took 14 hours for 10,000 posterior samples, while the VI approach

took around 30 minutes for ELBO convergence. In other words, when ELBO converges, MCMC

methods could only produce 450 posterior samples at maximum among all three cases of different

sample sizes.

We can also check that the Root Mean Squared Error (RMSE) for the 100 out-of-sample test

data set also converges quickly for VI method which shows quite close results to MCMC result.

MCMC approach also converges quite quickly in this simple simulation setup, but acquiring enough

posterior samples for inference is very time-consuming compared to the VI approach.

In short, we can see that the proposed VI method is consistent with MCMC in this simple

example with a bit of smaller uncertainty. On the other hand, the computational time is very

different; the VI approach is faster up to 23 times in this example compared to the MCMC

approach.

Figure 3.5 (Time comparison for Logistic growth function example between posterior results of
MCMC approach and VI approach) Upper figures show the change of ELBO in terms of time for
VI approach and the Lower figures show the change of Root Mean Squared Error(RMSE) for 100
out of the sample physical data with regard to time.

3.6 Conclusion and Discussion

The original KOH model is a powerful approach for computer model calibration and estimation

of physical processes. However, the standard MCMC implementation is not trivial and computa-

53

0.00.20.4times (hours)60040020002004006008001000ELBO(n,s)=(125,250)0246times (hours)(n,s)=(250,500)VI0510times (hours)(n,s)=(500,1000)MCMC0200040006000800010000Number of MCMC samplestionally expensive due to the use of GPs. We addressed the scalability issue of KOH model by

replacing MCMC with VB approximation which lead to a significant reduction in the code’s run-

time. Our simulation study shows that the VB method produces posterior approximation with high

fidelity with much less computation time compared to the MCMC. In the future, we plan to deploy

the proposed VB approach on a wide range of validation studies including a real data example. On

the theoretical end, we first established the posterior contraction rate of the true posterior followed

by the posterior contraction rate for the variational approximation. As reflected in the theoretical

analysis, the posterior contraction rate of the proposed VB approach is only a bit slower than the

true posterior. In the theoretical section, we assumed the parameter 𝜎2 to be known. Extending to

the case of unknown 𝜎2 will be an interesting line of work. Further, our theoretical development

relies on the existence of suitable choices of 𝜽 and 𝝓 such that the conditional expectation of the

computer model approaches the true physical process 𝜁0.

54

BIBLIOGRAPHY

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for

statisticians. Journal of the American Statistical Association, 112(518):859–877.

Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic

theory of independence. Oxford university press.

Csató, L. (2002). Gaussian processes:

iterative sparse approximations. Ph.D. thesis, Aston

University.

Csató, L. and Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation,

14(3):641–668.

de G. Matthews, A. G., Hensman, J., Turner, R. E., and Ghahramani, Z. (2016). On sparse
variational methods and the kullback-leibler divergence between stochastic processes. Artificial
Intelligence and Statistics, pages 231–239.

Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational

methods for graphical models. Machine Learning, 37:183–233.

Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes.

arXiv preprint

arXiv:1312.6114.

Kolmogorov, A. N., Shiryaev, A. N., and Lindquist, G. (1992). Selected works of A. N. Kolmogorov.

Volume II. Kluwer Academic Publishers.

Lee, H. (2000). Consistency of posterior distributions for neural networks. Neural Network,

13:629–642.

Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American

Statistical Association, 112:1274–1285.

Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Proceedings
of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33
of Proceedings of Machine Learning Research, pages 814–822. PMLR.

Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate
inference in deep generative models. In International conference on machine learning, pages
1278–1286. PMLR.

Seeger, M. (2003a). Bayesian gaussian process models: Pac-bayesian generalisation error bounds

and sparse approximations. Ph.D. thesis, University of Edinburgh.

Seeger, M. (2003b). Pac-bayesian generalisation error bounds for gaussian process classification.

55

Journal of Machine Learning Research, 3:233–269.

Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational bayesian neural networks.

International Conference on Learning Representations.

56

APPENDIX A

CHECKING ASSUMPTION 3.3.2 (I)

Under Assumption 3.3.2 (i) using (3.11), 𝜽 ∗, 𝝓∗ depends on 𝑛. However, for notation simplicity we

suppress 𝑛 in the remaining proof. Now, let

|E𝜁 |𝜽,𝝓𝜁 (𝑡𝑖) − 𝜁0(𝑡𝑖)| = 𝑓𝑖 (𝜽, 𝝓)

Further define A = {|𝜃 𝑗 − 𝜃∗

𝑗 | ≤ 𝜖𝑛, |𝝓1, 𝑗 − 𝝓∗

1, 𝑗 | ≤ 𝜖𝑛, | log 𝝓2, 𝑗 − log 𝝓∗

2, 𝑗 | ≤ 𝜖𝑛},

E𝑞(𝜽,𝝓) |(𝜁0(𝑡𝑖) − E𝜁 |𝜽,𝝓,𝒛 (𝜁 (𝑡𝑖))|2 =

∫

𝑖 (𝜽, 𝝓)𝑞(𝜽, 𝝓)𝑑𝜃𝑑𝜙
𝑓 2

=

≤

∫

A
∫

A

𝑖 (𝜽, 𝝓)𝑞(𝜽, 𝝓)𝑑𝜃𝑑𝜙 +
𝑓 2

∫

A𝑐

𝑖 (𝜽, 𝝓)𝑞(𝜽, 𝝓)𝑑𝜃𝑑𝜙
𝑓 2

( 𝑓 2

𝑖 (𝜽 ∗, 𝝓∗) + 𝐶𝜖 2

𝑛)𝑞(𝜽, 𝝓)𝑑𝜽 𝑑𝝓 +

(cid:18)

max
𝑘∈{1,··· ,𝑠}

𝑧𝑘 ¯𝐶 + |𝜁0(𝑡𝑖)|

(cid:19) 2

𝑄(A𝑐)

≤ 𝐶𝜖 2

𝑛 + 𝐶 exp(−𝑛𝜖 2
𝑛)

The first inequality above holds since 𝑓 2

𝑖 (𝜽, 𝝓) is a continuous function in 𝜽 and 𝝓. Thus, for every

𝛿1𝑛 > 0, there exists 𝛿2𝑛 > 0 such that for every (𝜽, 𝝓) ∈ B ((𝜽 ∗, 𝝓∗), 𝛿2𝑛), 𝑓 2

𝑖 (𝜽, 𝝓) ≤ 𝑓 2

𝑖 (𝜽 ∗, 𝝓∗) +

𝛿1𝑛 where B ((𝜽 ∗, 𝝓∗), 𝛿𝑛) is a ball of radius 𝛿𝑛 around (𝜽 ∗, 𝝓∗). Let 𝜖𝑛 = min(

𝛿1𝑛, 𝛿2𝑛).

√

To prove the second inequality 𝑄(A𝑐) ≲ exp(−𝑛𝜖 2

𝑛), we use the tail bounds for normal

distribution which states if 𝑋 is normally distributed, P(|𝑋 − 𝜇|/𝜎 ≥ 𝑥) ≤ exp(−𝑥2) as 𝑥 → ∞ in
addition to the the fact that 𝑄(( 𝐴1 ∩ · · · ∩ 𝐴𝑟)𝑐) = (cid:205)𝑟 𝑄( 𝐴𝑐

𝑖 ). To be specific,

√

𝑄(

𝑛|𝜃 𝑗 − 𝜃∗

𝑗 | ≥

√

𝑛𝜖𝑛) ≤ 2𝑝𝜽 exp(−𝑛𝜖 2
𝑛)

𝑝𝜽(cid:214)

𝑗=1

𝑝𝝓1(cid:214)
𝑗=1

𝑄(|𝜙1, 𝑗 − 𝝓∗

1, 𝑗 | ≥ 𝜖𝑛) ≤ 2𝑝𝝓1 exp(−𝑛𝜖 2
𝑛)

𝑄(| log 𝜙2, 𝑗 − log 𝜙∗

2, 𝑗 | ≥ 𝜖𝑛) ≤ 2𝑝𝝓2 exp(−𝑛𝜖 2
𝑛)

𝑝𝝓2(cid:214)
𝑗=1

Finally, | 𝑓𝑖 (𝜽, 𝝓)| ≤ max𝑘∈{1,··· ,𝑠} 𝑧𝑘 ¯𝐶 + |𝜁0(𝑡𝑖)| follows as a consequence of (C.4).

57

APPENDIX B

CHECKING ASSUMPTION 3.3.2 (II)

Assume independent priors for 𝜽 and 𝝓 as follows
𝑝𝜽(cid:214)

𝑝𝜽(cid:214)

𝑝𝜽(cid:214)

𝑝(𝜽) =

𝑝(𝜃𝑖) =

N (𝜇𝜃,𝑖, 𝜎2

𝜃,𝑖), 𝑞(𝜽) =

𝑞(𝜃𝑖) =

(cid:16)

N

𝑚𝜃,𝑖, 𝑠2
𝜃,𝑖

(cid:17)

𝑝(𝝓1) =

𝑝(𝝓2) =

𝑖=1
𝑝𝝓1(cid:214)
𝑖=1
𝑝𝝓2(cid:214)
𝑖=1

𝑖=1

𝑝𝝓1(cid:214)
𝑖=1
𝑝𝝓2(cid:214)
𝑖=1

𝑝(𝜙1,𝑖) =

𝑝(𝜙2,𝑖) =

𝑖=1

(cid:16)

N

(cid:17)

𝜇𝜙1,𝑖, 𝜎2
𝜙1,𝑖

, 𝑞(𝝓1) =

𝑝𝝓1(cid:214)
𝑖=1

𝑞(𝜙1,𝑖) =

(cid:16)

N

(cid:17)

𝑚𝜙1,𝑖, 𝑠2

𝜙1,𝑖

𝑝𝝓1(cid:214)
𝑖=1

(cid:16)

LN

log 𝜇𝜙2,𝑖, 𝜎2
𝜙2,𝑖

(cid:17)

, 𝑞(𝝓2) =

𝑞(𝜙2,𝑖) =

𝑝𝝓2(cid:214)
𝑖=1

(cid:16)

LN

log 𝑚𝜙2,𝑖, 𝑠2

𝜙2,𝑖

(cid:17)

𝑝𝜽(cid:214)

𝑖=1

𝑝𝝓2(cid:214)
𝑖=1

Note that, 𝑄(𝜽, 𝝓) = 𝑄(𝜽)𝑄(𝝓1)𝑄(𝝓2), Π(𝜽, 𝝓|𝒛) = Π(𝝓1|𝒛)Π(𝝓2|𝒛)Π(𝜽). Therefore,

K𝐿 (𝑄(𝜁, 𝜽, 𝝓)||Π(𝜁, 𝜽, 𝝓|𝒛))

= K𝐿 (𝑄(𝝓1)||Π(𝝓1|𝒛)) + K𝐿 (𝑄(𝝓2)||Π(𝝓2|𝒛)) + K𝐿(𝑄(𝜽)||Π(𝜽))

Each KL term can be calculated as follows. Let

𝑝(𝜃𝑖) = N (𝜇𝜃,𝑖, 𝜎2

𝜃,𝑖), 𝑞(𝜃𝑖) = N (𝑚𝜃,𝑖, 𝑠2

𝜃,𝑖)

then we can simplify K𝐿(𝑄(𝜽)||Π(𝜽)) as

𝐸𝑄(𝜽)

(cid:34)

−

1
2
𝑝𝜽∑︁

1
2
𝑖=1
𝑝𝜽∑︁

𝑖=1

1
2
𝑝𝜽∑︁

𝑖=1
𝑝𝜽∑︁

1
2

𝑖=1

= −

+

1
2

= −

+

1
2

= −

(cid:16)

2𝜋𝜎2
𝜃,𝑖

(cid:17)

+

log

(cid:33)(cid:35)

(cid:32) 𝑝𝜽∑︁

𝑖=1

(𝜃𝑖 − 𝜇𝜃,𝑖)2
2𝜎2
𝜃,𝑖

𝑝𝜽∑︁

𝑖=1

+

1
2
(cid:19) 2(cid:35)

(cid:16)

2𝜋𝑠2
𝜃,𝑖

(cid:17)

−

log

𝑝𝜽∑︁

𝑖=1

(cid:32) 𝑝𝜽∑︁

𝑖=1

log

(cid:16)

2𝜋𝑠2
𝜃,𝑖

(cid:17)

−

1
2

𝑝𝜽∑︁

𝑖=1

𝐸𝑄(𝜽)

(cid:33)

(𝜃𝑖 − 𝑚𝜃,𝑖)2
2𝑠2
𝜃,𝑖
(cid:34)(cid:18) 𝜃𝑖 − 𝑚𝜃,𝑖
𝑠𝜃,𝑖

log

(cid:16)

(2𝜋𝜎2

𝜃,𝑖)

(cid:17)

+

1
2

𝐸𝑄(𝜽)

(cid:33)(cid:35)

(cid:34)(cid:32) 𝑝𝜽∑︁

𝑖=1

(𝜃𝑖 − 𝜇𝜃,𝑖)2
𝜎2
𝜃,𝑖

(cid:17)

(cid:16)

𝑠2
𝜃,𝑖

−

log

1
2

𝑝𝜽 +

1
2

𝑝𝜽 log 2𝜋 +

1
2

𝑝𝜽∑︁

𝑖=1

(cid:17)

(cid:16)

𝜎2
𝜃,𝑖

log

𝑝𝜽 log(2𝜋) −

1
2

𝑝𝜽∑︁

𝑖=1

(cid:32) 𝑠2
𝜃,𝑖
𝜎2
𝜃,𝑖

(cid:19) 2(cid:33)

+

(cid:18) 𝑚𝜃,𝑖 − 𝜇𝜃,𝑖
𝜎𝜃,𝑖

(cid:17)

(cid:16)

𝑠2
𝜃,𝑖

−

log

1
2

𝑝𝜽 +

1
2

𝑝𝜽∑︁

𝑖=1

(cid:17)

(cid:16)

𝜎2
𝜃,𝑖

+

log

1
2

𝑝𝜽∑︁

𝑖=1

(cid:32) 𝑠2
𝜃,𝑖
𝜎2
𝜃,𝑖

(cid:19) 2(cid:33)

+

(cid:18) 𝑚𝜃,𝑖 − 𝜇𝜃,𝑖
𝜎𝜃,𝑖

58

Choose 𝑚𝜃,𝑖 = 𝜃0,𝑖 and 𝑠2

𝜃,𝑖 = 1/𝑛. Then, for a positive constant 𝐶1 > 0, we can simplify

K𝐿(𝑄(𝜽)||Π(𝜽)) as follows

1
2

𝑝𝜽∑︁

𝑖=1

(cid:17)

(cid:16)

𝑠2
𝜃,𝑖

−

log

𝑝𝜽 log 𝑛 −

1
2

𝑝𝜽 +

1
2

1
2

= −

=

1
2

= 𝑜(𝐶1𝑛𝜖 2

𝑛 log 𝑛)

𝑝𝜽 +

1
2

𝑝𝜽∑︁

𝑖=1

(cid:17)

(cid:16)

𝜎2
𝜃,𝑖

log

(cid:17)

(cid:16)

𝜎2
𝜃,𝑖

+

log

𝑝𝜽∑︁

𝑖=1

1
2

𝑝𝜽∑︁

𝑖=1

+

1
2
(cid:32)

𝑝𝜽∑︁

𝑖=1

1
𝑛𝜎2
𝜃,𝑖

(cid:19) 2(cid:33)

+

(cid:18) 𝑚𝜃,𝑖 − 𝜇𝜃,𝑖
𝜎𝜃,𝑖
(cid:19) 2(cid:33)

(cid:32) 𝑠2
𝜃,𝑖
𝜎2
𝜃,𝑖
(cid:18) 𝜃0,𝑖 − 𝜇𝜃,𝑖
𝜎𝜃,𝑖

+

where the above line follows by Assumption 8. (i).

By similar logic, let 𝑝(𝜙1,𝑖) = N (𝜇𝜙1,𝑖, 𝜎2
𝜙1,𝑖 = 1/𝑛. Then, for a positive constant 𝐶1 > 0

and 𝑠2

𝜙1,𝑖), 𝑞(𝜙1,𝑖) = N (𝑚𝜙1,𝑖, 𝑠2

𝜙1,𝑖) then with 𝑚𝜙1,𝑖 = 𝜙0,1,𝑖

K𝐿(𝑄(𝝓1)||Π(𝝓1|𝒛)) = 𝑜(𝐶1𝑛𝜖 2

𝑛 log 𝑛)

Let 𝑝(log 𝜙2,𝑖) = N (log 𝜇𝜙2,𝑖, 𝜎2

𝜙2,𝑖), 𝑞(log 𝜙2,𝑖) = N (log 𝑚𝜙2,𝑖, 𝑠2

𝜙2,𝑖), then

(cid:32) 𝑝 𝜙
2∑︁

𝑖=1

(cid:19)

+

(cid:32) 𝑝 𝜙
2∑︁

𝑖=1

(cid:33)(cid:35)

(log 𝜙2,𝑖 − log 𝑚𝜙2,𝑖)2
2𝑠2

𝜙2,𝑖
(log 𝜙2,𝑖 − log 𝜇𝜙2,𝑖)2
2𝜎2
𝑝 𝜙
2∑︁

(cid:33)(cid:35)

𝑝𝝓2∑︁
𝑖=1
(cid:34) 𝑝𝝓2∑︁
𝑖=1

𝑝𝝓2∑︁
𝑖=1
(cid:34) 𝑝𝝓2∑︁
𝑖=1

K𝐿(𝑄(𝝓2)||Π(𝝓2|𝒛))
(cid:34)

= 𝐸𝑄(𝝓2)

−

(cid:18)

(𝑠𝜙2,𝑖𝜙

(cid:19)

3
2

2,𝑖)

−

log

+ 𝐸𝑄(𝝓2)

(cid:18)

log

(𝜎𝜙2,𝑖𝜙

3
2

2,𝑖)
𝑝𝝓2∑︁
𝑖=1

3
2

𝑝𝝓2∑︁
𝑖=1

(cid:34)

= 𝐸𝑄(𝝓2)

−

log (cid:0)𝑠𝜙2,𝑖(cid:1) −

log (cid:0)𝜙2,𝑖(cid:1) −

+ 𝐸𝑄(𝝓2)

log(𝜎𝜙2,𝑖) +

3
2

log (cid:0)𝜙2,𝑖(cid:1) +

1
2

= −

𝑝𝝓2∑︁
𝑖=1

log (cid:0)𝑠𝜙2,𝑖(cid:1) −

+

1
2

𝑝 𝜙
2∑︁

(cid:32) 𝑠2

𝜙2,𝑖

𝑖=1

𝜎2

𝜙2,𝑖

+

3
2

1
2

𝑝𝜙2 +

𝑚𝜙2,𝑖 −

𝑝𝝓2∑︁
𝑖=1
(cid:18) log 𝑚𝜙2,𝑖 − log 𝜇𝜙2,𝑖
𝜎𝜙2,𝑖

𝑝𝝓2∑︁
𝑖=1
(cid:19) 2(cid:33)

𝜙2,𝑖
(cid:18) (log 𝜙2,𝑖 − log 𝑚𝜙2,𝑖)
𝑠𝜙2,𝑖

(cid:19) 2(cid:35)

1
2

𝑖=1

(cid:32) 𝑝 𝜙
2∑︁

𝑖=1

(log 𝜙2,𝑖 − log 𝜇𝜙2,𝑖)2
𝜎2

(cid:33)(cid:35)

log(𝜎𝜙2,𝑖) +

𝜙2,𝑖

𝑚𝜙2,𝑖

𝑝𝝓2∑︁
𝑖=1

3
2

59

Hence,

K𝐿(𝑄(𝝓2)||Π(𝝓2|𝒛))

= −

𝑝𝝓2∑︁
𝑖=1

log (cid:0)𝑠𝜙2,𝑖(cid:1) −

3
2
(cid:32) 𝑠2

+

𝑝 𝜙
2∑︁

𝑖=1

1
2

+

𝜙2,𝑖

𝜎2

𝜙2,𝑖

=

1
2

𝑝𝝓2 log 𝑛 −

3
2

𝑝𝝓2

+

1
2

(cid:32)

𝑝 𝜙
2∑︁

𝑖=1

1
𝑛𝜎2

𝜙2,𝑖

= 𝑜(𝐶1𝑛𝜖 2

𝑛 log 𝑛)

𝑝𝝓2∑︁
𝑖=1

𝑚𝜙2,𝑖 −

1
2

𝑝𝜙2 +

𝑝𝝓2∑︁
𝑖=1

log(𝜎𝜙2,𝑖) +

𝑝𝝓2∑︁
𝑖=1

3
2

𝑚𝜙2,𝑖

(cid:19) 2(cid:33)

(cid:18) log 𝑚𝜙2,𝑖 − log 𝜇𝜙2,𝑖
𝜎𝜙2,𝑖
𝑝𝝓2∑︁
𝑖=1

𝑝𝜙2 +

1
2

log(𝜎𝜙2,𝑖) +

𝜙0,2,𝑖 −

+

(cid:18) log 𝜙0,2,𝑖 − log 𝜇𝜙2,𝑖
𝜎𝜙2,𝑖

(cid:19) 2(cid:33)

𝑝𝝓2∑︁
𝑖=1

3
2

𝑚𝜙2,𝑖

where we choose 𝑚𝜙2,𝑖 = 𝜙0,2,𝑖 and 𝑠2

𝜙2,𝑖 = 1/𝑛. Then, for a positive constant 𝐶1 > 0where the

above line follows by Assumption 8. (i). This completes the proof.

60

APPENDIX C

PROOF OF THEOREM 3

Based on the sketch of the proof of Theorem 3 in the main paper, the complete proof boils down to

show K𝐿(𝑄∗

𝜁 ||Π𝜁 (·| 𝒚, 𝒛)) ≤ 𝐶′′𝑛𝜖 2

𝑛 log 𝑛 for some 𝐶′′ > 0. This depends on the following lemmas.

Again, remember that 𝒛 is implicitly conditioned on the prior.

Lemma C.0.1. Inequality for KL divergence

K𝐿 (𝑄∗

𝜁 ||Π𝜁 (·| 𝒚, 𝒛)) ≤ K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓| 𝒚, 𝒛))

Lemma C.0.2. There exists a constant 𝐶2 such that for any 𝜖𝑛 → 0 with 𝑛𝜖 2

𝑛 → ∞,

P0

(cid:26)(cid:12)
(cid:12)
(cid:12)
(cid:12)

log

𝑝( 𝒚|𝒛)
𝑝(𝒚|𝜁0)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

< 𝐶2𝑛𝜖 2

𝑛 log 𝑛

(cid:27)

−→ 1

Lemma C.0.3. There exists a constant 𝐶3 such that for any 𝜖𝑛 → 0 with 𝑛𝜖 2

𝑛 → ∞,

(cid:26)

P0

E𝑞(𝜽,𝝓)

(cid:20)

log

(cid:21)

𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝒛, 𝜽, 𝝓)

≤ 𝐶3𝑛𝜖 2

𝑛 log 𝑛

(cid:27)

−→ 1

(C.1)

(C.2)

(C.3)

Lemma C.0.4. By using the Lemma C.0.2, Lemma C.0.3, and Assumption 3.3.2 (ii) in the main

paper it can be established with dominating probability for any 𝐶9 > 0, as 𝑛 → ∞,

K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓| 𝒚, 𝒛)) ≤ 𝐶9𝑛𝜖 2

𝑛 log 𝑛

(C.4)

Using Lemma C.0.4, we show variational posterior and the true posterior is bounded.

C.0.1 Proof of Lemma C.0.1

Note,

K𝐿 (𝑄∗

𝜁 ||Π𝜁 (·| 𝒚, 𝒛)) =

≤

≤

∫

∫

∫

= log

log

𝑑𝑄∗

𝜁 (𝜁)
𝑑Π𝜁 (𝜁 |𝒚, 𝒛)

𝑑𝑄∗

𝜁 (𝜁)

log

𝑑𝑄∗(𝜁, 𝜽, 𝝓)
𝑑Π(𝜁, 𝜽, 𝝓|𝒚, 𝒛)
𝑑𝑄(𝜁, 𝜽, 𝝓)
𝑑Π(𝜁, 𝜽, 𝝓|𝒚, 𝒛)
𝑞(𝜁 |𝜽, 𝝓)𝑞(𝜽, 𝝓)
𝜋(𝜁 |𝜽, 𝝓, 𝒚, 𝒛)𝜋(𝜽, 𝝓|𝒚, 𝒛)

log

𝑑𝑄∗(𝜁, 𝜽, 𝝓)

𝑑𝑄(𝜁, 𝜽, 𝝓)

𝑑𝑄(𝜁, 𝜽, 𝝓)

= K𝐿(𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒚, 𝒛))

61

where the last line of equality follows since 𝑞(𝜁 |𝜽, 𝝓) = 𝜋(𝜁 |𝜽, 𝝓, 𝒚, 𝒛). The first inequality above

is due to the argument below and the second inequality above is because 𝑄∗ is the minimizer of

K𝐿(𝑄||Π(·| 𝒚, 𝒛)). Proof of the 1st inequality is as follows.

∫

log

𝑑𝑄∗

𝜁 (𝜁)
𝑑Π𝜁 (𝜁 |𝒚, 𝒛)

𝑑𝑄∗

𝜁 (𝜁) =

∫

log

𝜁 (𝜁)
𝑞∗
𝜋𝜁 (𝜁 |𝒚, 𝒛)

𝑞∗
𝜁 (𝜁)𝑑𝜁

Now,

log

𝜁 (𝜁)
𝑞∗
𝜋𝜁 (𝜁 | 𝒚, 𝒛)

𝑞∗
𝜁 (𝜁) =

=

∫

∫

𝑞∗(𝜁, 𝜽, 𝝓)𝑑𝜃𝑑𝜙 log

𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)𝑑𝜃𝑑𝜙

∫ 𝑞∗(𝜁, 𝜽, 𝝓)𝑑𝜃𝑑𝜙
∫ 𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)𝑑𝜃𝑑𝜙
∫ (𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛))𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)𝑑𝜃𝑑𝜙
∫ 𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)𝑑𝜃𝑑𝜙

× 𝜁 log

∫ 𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛))𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)𝑑𝜃𝑑𝜙
∫ 𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)𝑑𝜃𝑑𝜙

Consider density function for fixed 𝜁: ˜𝜋𝜁 (𝜽, 𝝓) = 𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛))/∫ 𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛))𝑑𝜃𝑑𝜙. Then,

log

𝜁 (𝜁)
𝑞∗
𝜋𝜁 (𝜁 | 𝒚, 𝒛)

∫

𝑞∗
𝜁 (𝜁) =

𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)𝑑𝜃𝑑𝜙

𝐸 ˜𝜋𝜁 (𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)) log 𝐸 ˜𝜋𝜁 (𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛))

∫

≤

𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)𝑑𝜃𝑑𝜙

𝐸 ˜𝜋𝜁 ((𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)) log(𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)))

∫

𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)𝑑𝜃𝑑𝜙

=
∫ 𝑞∗(𝜁, 𝜽, 𝝓) log(𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)))𝑑𝜃𝑑𝜙
∫ 𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)𝑑𝜃𝑑𝜙

∫

=

𝑞∗(𝜁, 𝜽, 𝝓) log(𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓| 𝒚, 𝒛)))𝑑𝜃𝑑𝜙

where the second step follows by using Jensen’s inequality on the convex function 𝑥 → 𝑥 log 𝑥.

Thus,

K𝐿 (𝑄∗

𝜁 ||Π𝜁 (·| 𝒚, 𝒛)) =

≤

∫

∫

which completes the proof.

log

𝜁 (𝜁)
𝑞∗
𝜋𝜁 (𝜁 |𝒚, 𝒛)

𝑞∗
𝜁 (𝜁)𝑑𝜁

𝑞∗(𝜁, 𝜽, 𝝓) log(𝑞∗(𝜁, 𝜽, 𝝓)/𝜋(𝜁, 𝜽, 𝝓|𝒚, 𝒛)))𝑑𝜃𝑑𝜙𝑑𝜁

62

C.0.2 Proof of Lemma C.0.2

Let 𝐿∗ = 𝑝(𝒚|𝒛) = ∫

S

𝑝(𝒚|𝜁)𝑑Π(𝜁 |𝒛) and 𝐿(𝜁0) = 𝑝( 𝒚|𝜁0). By Markov’s inequality, we have

P0

(cid:26)(cid:12)
(cid:12)
(cid:12)
(cid:12)

log

∫

𝜁

≤

=

=

≤

≤

=

≤

=

=

1
𝐶2𝑛𝜖 2
𝑛 log 𝑛
1
𝐶2𝑛𝜖 2
𝑛 log 𝑛
1
𝐶2𝑛𝜖 2
𝑛 log 𝑛
1
𝑛 log 𝑛
𝐶2𝑛𝜖 2
1
𝑛 log 𝑛
𝐶2𝑛𝜖 2
1
𝐶2𝑛𝜖 2
𝑛 log 𝑛
1
𝐶2𝑛𝜖 2
𝑛 log 𝑛
1
𝑛 log 𝑛
𝐶2𝑛𝜖 2
1
𝑛 log 𝑛
𝐶2𝑛𝜖 2

𝑑Π(𝜁 |𝒛)

> 𝐶2𝑛𝜖 2

𝑛 log 𝑛

(cid:27)

E0

𝑝( 𝒚|𝜁)
𝑝( 𝒚|𝜁0)
(cid:26)(cid:12)
(cid:12)
log
(cid:12)
(cid:12)
(cid:26)(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:26)(cid:12)
(cid:12)
(cid:12)
(cid:12)

E0

E0

log

log

(cid:12)
(cid:12)
(cid:12)
(cid:12)
∫ 𝑝( 𝒚|𝜁)
𝑝(𝒚|𝜁0)
(cid:12)
(cid:27)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)
(cid:12)

𝐿∗
𝐿 (𝜁0)
𝐿(𝜁0)
𝐿∗(𝜁)

(cid:27)

𝑑Π(𝜁 |𝒛)

(cid:27)

(cid:12)
(cid:12)
(cid:12)
(cid:12)

(cid:26)

K𝐿(𝐿(𝜁0)||𝐿∗) +

(cid:27)

2
𝑒

(cid:41)

(cid:41)

+

2
𝑒

(cid:40)

E0

(cid:26)

E0

(cid:26)

E0

(cid:40)

(cid:26)

log

𝑝(𝒚|𝜁0)
∫ 𝑝( 𝒚|𝜁)𝑑Π(𝜁 |𝒛)
∫ 𝑝(𝒚|𝜁)
𝑝( 𝒚|𝜁0)
(cid:8)− log Π𝜁 (𝐵𝑛|𝒛) + 𝐶4𝑛𝜖 2

− log

𝑛

𝑑Π(𝜁 |𝒛)

(cid:9) +

(cid:27)

(cid:27)

2
𝑒
(cid:27)

+

2
𝑒

(cid:26)

(cid:26)

E0

(cid:8)𝐶𝑛𝜖 2

𝑛 + 𝐶4𝑛𝜖 2
𝑛

(cid:9) +

(cid:27)

2
𝑒

(𝐶 + 𝐶4)𝑛𝜖 2

𝑛 +

(cid:27)

2
𝑒

→ 0

Note the first inequality from the bottom, we applied log function to the result of (A.4). For the

third inequality from the top, we used Lemma 4 from Lee (2000).

C.0.3 Proof of Lemma C.0.3

Note

E𝑞(𝜽,𝝓)

(cid:20)

log

(cid:21)

𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝒛, 𝜽, 𝝓)

= E𝑞(𝜽,𝝓)

= E𝑞(𝜽,𝝓)

= E𝑞(𝜽,𝝓)

(cid:20)

(cid:34)

(cid:20)

− log

− log

(cid:21)

𝑝( 𝒚|𝒛, 𝜽, 𝝓)
𝑝( 𝒚|𝜁0)

∫ 𝑝( 𝒚|𝒛, 𝜽, 𝝓, 𝜁) 𝑝(𝜁 |𝜽, 𝝓, 𝒛)𝑑𝜁
𝑝(𝒚|𝜁0)

(cid:35)

− log

∫ 𝑝( 𝒚|𝑆)
𝑝(𝒚|𝜁0)

𝑝(𝜁 |𝜽, 𝝓, 𝒛)𝑑𝜁

(cid:21)

63

Now, using Markov’s inequality, Jensen’s inequality and Fubini’s theorem,

(cid:20)

P0

E𝑞(𝜽,𝝓)

(cid:20)

− log

𝑝(𝜁 |𝜽, 𝝓, 𝒛)𝑑𝜁

(cid:21)

(cid:21)

≥ 𝐶3𝑛𝜖 2

𝑛 log 𝑛

∫ 𝑝( 𝒚|𝜁)
𝑝(𝒚|𝜁0)
(cid:20)

(cid:20)

E𝑞(𝜽,𝝓)

− log

(cid:20)

E𝑞(𝜽,𝝓)

(cid:20)∫

E0

E0

∫ 𝑝( 𝒚|𝜁)
𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝜁)
𝑝( 𝒚|𝜁0)

− log

𝑝(𝜁 |𝜽, 𝝓, 𝒛)𝑑𝜁

𝑝(𝜁 |𝜽, 𝝓, 𝒛)𝑑𝜁

(cid:21) (cid:21)

(cid:21) (cid:21)

(cid:20)

E𝑞(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0

(cid:20)

− log

(cid:21) (cid:21)

𝑝(𝒚|𝜁)
𝑝( 𝒚|𝜁0)

(cid:34)

−

𝑛

2

+

𝑛𝜎2
0

2

+ 𝐶𝜖 2

𝑛 + 𝐶𝜖 2

𝑛 exp(−𝑛𝜖 2
𝑛)

(cid:35)

→ 0

≤

≤

=

≤

1
𝐶3𝑛𝜖 2
𝑛 log 𝑛
1
𝐶3𝑛𝜖 2
𝑛 log 𝑛
1
𝐶3𝑛𝜖 2
𝑛 log 𝑛
1
𝐶3𝑛𝜖 2
𝑛 log 𝑛

where the second inequality above is due to Jensen’s inequality In the last inequality,

E𝑞(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0

(cid:20)

− log

= E𝑞(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0

−

(cid:34)

(cid:34)

= E𝑞(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0

−

1
2

𝑛

2

(cid:21)

𝑝( 𝒚|𝜁)
𝑝( 𝒚|𝜁0)
𝑛
∑︁

(𝑦𝑖 − 𝜁0(𝑡𝑖))2 +

𝑖=1

+

1
2

𝑛
∑︁

𝑖=1

(cid:34)

= E𝑞(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛

−

= E𝑞(𝜽,𝝓)

(cid:104)

−

𝑛

2

+

(cid:105)

𝑛

2

𝑛

2

+

1
2

𝑛
∑︁

𝑖=1

+ E𝑞(𝜽,𝝓)

(cid:35)

(𝑦𝑖 − 𝜁 (𝑡𝑖))2

𝑛
∑︁

𝑖=1

1
2
(cid:35)

(𝑦𝑖 − 𝜁 (𝑡𝑖))2

(cid:35)

(1 + 𝜁 2

0 (𝑡𝑖) − 2𝜁 (𝑡𝑖)𝜁0(𝑡𝑖) + 𝜁 2(𝑡𝑖))

(𝜁0(𝑡𝑖) − E𝜁 |𝜽,𝝓,𝜎,𝒛 [𝜁 (𝑡𝑖)])2

(cid:35)

(cid:34)

1
2

𝑛
∑︁

𝑖=1

≤ 𝐶𝜖 2

𝑛 + 𝐶𝜖 2

𝑛 exp(−𝑛𝜖 2
𝑛)

where the last inequality above is a consequence of Assumption 3.3.2 (i).

C.0.4 Proof of Lemma C.0.4

Note,

K𝐿 (𝑞(𝜽, 𝝓)|| 𝑝(𝜽, 𝝓| 𝒚, 𝒛))

= 𝐸𝑞(𝜽,𝝓) [log 𝑞(𝜽, 𝝓) − log 𝑝( 𝒚|𝒛, 𝜽, 𝝓) − log 𝑝(𝜽, 𝝓|𝒛) − log 𝑝(𝒛) + log 𝑝( 𝒚, 𝒛)]

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

𝑝(𝒚, 𝒛)
𝑝(𝒛)

+ 𝐸𝑞(𝜽,𝝓)

(cid:20)

log

(cid:21)

1
𝑝(𝒚|𝒛, 𝜽, 𝝓)

64

Hence,

K𝐿 (𝑞(𝜽, 𝝓)|| 𝑝(𝜽, 𝝓| 𝒚, 𝒛))

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

= K𝐿 (𝑄(𝜽, 𝝓)||Π(𝜽, 𝝓|𝒛)) + log

(cid:20)

log

(cid:21)

𝑝(𝒚|𝜁0)
𝑝(𝒚|𝒛, 𝜽, 𝝓)
(cid:20)

𝑑𝜁 + 𝐸𝑞(𝜽,𝝓)

log

+ 𝐸𝑞(𝜽,𝝓)

𝑝( 𝒚, 𝒛)
𝑝(𝒛) 𝑝(𝒚|𝜁0)
∫
𝑝( 𝒚, 𝒛, 𝜁)
𝑝(𝒛) 𝑝( 𝒚|𝜁0)
𝑝(𝒚|𝜁) 𝑝(𝜁 |𝒛) 𝑝(𝒛)
𝑝(𝒛) 𝑝( 𝒚|𝜁0)

𝜁
∫

𝑑𝜁 + 𝐸𝑞(𝜽,𝝓)

log

(cid:21)

𝑝(𝒚|𝜁0)
𝑝(𝒚|𝒛, 𝜽, 𝝓)
(cid:20)

(cid:21)

𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝒛, 𝜽, 𝝓)

S
∫

S
∫

S

𝑝(𝒚|𝜁)
𝑝(𝒚|𝜁0)
𝑝(𝒚|𝜁)
𝑝(𝒚|𝜁0)

𝑑Π(𝜁 |𝒛) + 𝐸𝑞(𝜽,𝝓)

𝑑Π(𝜁 |𝒛) + 𝐸𝑞(𝜽,𝝓)

(cid:20)

(cid:20)

log

log

(cid:21)

(cid:21)

𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝒛, 𝜽, 𝝓)
𝑝( 𝒚|𝜁0)
𝑝( 𝒚|𝒛, 𝜽, 𝝓)

≤ 𝐶1𝑛𝜖 2

𝑛 log 𝑛 + 𝐶2𝑛𝜖 2

𝑛 log 𝑛 + 𝐶3𝑛𝜖 2

𝑛 log 𝑛

≤ 𝐶𝑛𝜖 2

𝑛 log 𝑛

65

CHAPTER 4

BAYESIAN MODEL CALIBRATION FOR BETA DECAY
CALCULATIONS

Figure 4.1 Calibration framework in 𝛽−-decay calculation.

Accurate forecasts regarding the 𝛽-decay rates of neutron-rich nuclei are crucial for investigations of

𝑟-process studies. The recent development of the proton-neutron finite amplitude method (pnFAM)

makes global 𝛽-decay studies feasible within the nuclear density functional theory.

In pnFAM

calculations with the Skyrme functional, time-odd terms, isoscalar pairing, and the axial-vector

coupling strongly impact the results. However, they are not constrained by the properties of even-

even nuclei. Thus, model calibration and the uncertainty quantification of 𝛽-decay predictions are

necessary. In this chapter, we conduct Bayesian model calibration with selected experimental data

and simulation data, and obtain the posterior distribution of parameters with reliable predictions.

We will conduct various calibration schemes and compare the result with that of 𝜒2 optimization.

From a statistical point of view, this problem could be summarized as follows. Consider a

physics model 𝑓𝑚 which takes two inputs, 𝒕 and 𝜽.

𝒕 is the known input for the model, and it is

the number of protons 𝑍 and the number of neutrons 𝑁. The model also has another set of inputs

called calibration parameter 𝜽. This will consist of dimensionless Landau-Migdal parameters 𝑔′
0,
dimensionless Iso-scalar pairing strength 𝑣0, effective axial vector coupling (quenching effect) 𝑔𝐴.

If we plug in all five values in 𝑓𝑚, then the physics model will spit out two outputs, GT resonance

(GTR) energy, and the 𝛽 decay rates. This relationship is also visualized in Figure 4.1.

The original framework proposed by Kennedy and O’Hagan (2001) can be computationally

66

demanding since it estimates the emulator 𝑓 , discrepancy 𝛿, and unknown parameters 𝜽 at si-

multaneously using the joint dataset 𝒅 = (𝒚, 𝒛). At the same time, this approach is flexible and

generally yields a good fit to the experimental data at the expense of potential identifiability issue

of the calibration parameters. For example, a poor fit of the emulator can be compensated by

the discrepancy term which can lead to non-physical values of the calibration parameters while

having a statistical model that fits well the training data Bayarri et al. (2009). We shall refer to this

simultaneous estimation as the integrated approach.

In contrast to the integrated approach, one can fit the emulator first using only the computer

model evaluations 𝒛 and then find the posterior distribution of the unknown parameters using the

experimental observations 𝒚. This two-stage estimation is known as the modular approach Bayarri

et al. (2009). The modular approach tends to be computationally less challenging compared to the

integrated approach as it does not require the evaluation of the joint likelihood 𝑝( 𝒅|𝜽). Additionally,

while the fitted emulator obtained via the integrated approach could be somewhat inconsistent with

the computer model, modular approach guarantees the performance of emulator since it is fitted

using computer model evaluations only. From a Physical point of view, the modular approach

appears more natural since one is often concerned with the performance of the emulator altogether

with the estimation of the calibration parameters. Hence, we will compare and contrast these two

different estimation approaches in the context of calibration with GTR energies and 𝛽-decay rates.

In addition, often model discrepancy term adds another level of complexity in terms of inter-

preting the results. Hence, we are going to consider a case where there is no model discrepancy

term. To be specific, we are going to consider the following models: Integrated (𝐼), Integrated

without model discrepancy (𝐼 − 𝛿), Modular (𝑀), Modular without discrepancy (𝑀 − 𝛿).

(𝐼) : 𝑦𝑖 = 𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖) + 𝜎𝜖𝑖,

(𝐼 − 𝛿) : 𝑦𝑖 = 𝑓 (𝒕𝑖, 𝜽) + 𝜎𝜖𝑖,

(𝑀) : 𝑦𝑖 = ¯𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖) + 𝜎𝜖𝑖,

(𝑀 − 𝛿) : 𝑦𝑖 = ¯𝑓 (𝒕𝑖, 𝜽) + 𝜎𝜖𝑖,

67

(4.1)

(4.2)

(4.3)

(4.4)

where ¯𝑓 represents a pretrained GP emulator using simulation data only.

4.1 Data Collection

4.1.1 Experimental data

As shown in Tables 4.1 and 4.2 below, there are two types of observables employed in the

calibration: GTR energies 𝐸GTR and 𝛽−-decay half-lives 𝑇1/2. The half-lives are given in ascending

order, and their logarithms lg 𝑇1/2 (base 10) are adopted in the fit as the half-lives can vary by

several orders of magnitude.

No. Nucleus 𝐸GTR Error
0.2
0.6

208Pb
132Sn

15.6
16.3

1
2

No. Nucleus 𝐸GTR Error

3
4

90Zr
112Sn

8.7
8.94

–
0.25

Table 4.1 GTR energies and their experimental errors (in MeV) taken from Refs. Yasuda et al.
(2018); Akimune et al. (1995); Gaarde et al. (1981); Pham et al. (1995) for the four nuclei selected
in this work.

No. Nucleus

5
6
7
8
9
10
11
12
13
14
15
16
17

98Kr
58Ti
102Sr
82Zn
48Ar
60Cr
126Cd
114Ru
134Sn
152Ce
78Zn
72Ni
92Kr

𝑇1/2
0.043
0.058
0.069
0.166
0.475
0.49
0.515
0.54
1.05
1.4
1.47
1.57
1.84

Error
0.004
0.009
0.006
0.011
0.04
0.01
0.017
0.03
0.011
0.2
0.15
0.05
0.008

𝑇1/2
No. Nucleus
166Gd
4.8
18
156Nd
5.26
19
204Pt
10.3
20
74Zn
95.6
21
52Ti
102
22
180Yb
144
23
114Pd
145.2
24
242U
1008
25
134Te
2508
26
92Sr
27
9399.6
156Sm 33840
28
200Pt
45360
29

Error
1.0
0.2
1.4
1.2
6
30
3.6
30
48
61.2
720
1080

Table 4.2 𝛽−-decay half-lives and their experimental errors (in second) taken from National Nuclear
Data Center for the 25 nuclei selected in this work. half-lives are listed in ascending order.

4.1.2 Simulation data

In the context of nuclear density functional theory (DFT), the finite amplitude method (FAM)

provides an efficient solution for the (quasiparticle) random phase approximation (QRPA) Nakat-

68

sukasa et al. (2007); Avogadro and Nakatsukasa (2011). Recent advancements in the Skyrme

proton-neutron FAM (PNFAM) have made it possible to compute charge-changing transitions in

deformed nuclei and perform extensive calculations for 𝛽-decay rates Shafer et al. (2016); Mustonen

et al. (2014); Mustonen and Engel (2016); Ney et al. (2020). In the PNFAM, the time-odd isovector

Skyrme couplings, isoscalar pairing strength (𝑣0), and effective axial-vector coupling (𝑔𝐴) have a

strong impact on 𝛽-decay transitions. In the actual calculation, the dimensionless Landau-Migdal

parameter (𝑔′

0) was used instead of the time-odd isovector Skyrme couplings.

4.2 Extension of the calibration model to two observable types

In order to estimate all the unknown quantities including the calibration parameters using both

GTR energies and 𝛽-decay half-lives, we extend the original framework described in Chapter 1

by modeling each observable type independently. Specifically, let 𝒚1 denote a column vector of

size 𝑛1 representing the experimental observations of the 1st observable type and let 𝒛1 denote

a column vector of size 𝑚1 consisting of the computer model evaluations of the 1st observable

type. Overall, for the 1st observable type, we have 𝒅1 = (𝒚1, 𝒛1). Similarly, for the 2nd observable

type, let 𝒚2 denote 𝑛2 experimental observations of the 2nd observable type, and let 𝒛2 denote 𝑚2

computer model evaluations of the 2nd observable type. Consequently, 𝒅2 is the joint dataset for

the 2nd observable type. As described in the section 4.1.1, the fit observables consist of 4 GTR

energies and 25 𝛽-decay half-lives. To emulate the the PNFAM with the Skyrme energy density

functional, we generated 1000 data points for each data type using the Latin Hypercube Design

(see the appendix/supplementary text for more details).

By denoting sets of unknown parameters for each data type as 𝚿1 = (𝜽1, 𝝓1, 𝜎1) and 𝚿2 =

(𝜽2, 𝝓2, 𝜎2), we get the joint conditional data likelihood for the integrated approach as

(cid:169)
𝒅1, 𝒅2|𝚿1, 𝚿2 ∼ 𝑁 (cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:171)
where Σ1 and Σ2 are the covariance matrices consistent with the KOH model formulation in Chapter

, (cid:169)
(cid:173)
(cid:173)
(cid:171)

0 Σ2

(4.5)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

(cid:170)
(cid:174)
(cid:174)
(cid:172)

02

,

01

Σ1

0

2. The likelihood (4.5) can be used directly for the integrated approach’s posterior sampling. Note

that the calibration parameters generally don’t need to be shared among the observable types. In

69

our case, however, 𝜽1 ∩ 𝜽2 ≠ ∅ as 𝜽1 = (𝑔′
0

, 𝑣0) and 𝜽1 = (𝑔′
0

, 𝑣0, 𝑔𝐴).

The joint conditional likelihood for the modular approach is given by the likelihood of exper-

imental observations 𝒚1 and 𝒚2 as all the computer model outputs are used to train the emulator

independently. Namely,

0

ˆ𝑆1(𝜽1)

𝐾 𝛿1

(𝑇𝑑1

, 𝑇𝑑1) + 𝜎2
1

𝐼𝑛1

ˆ𝑆2(𝜽2)

(cid:169)
𝒚1, 𝒚2| ˜𝚿1, ˜𝚿2 ∼ 𝑁 (cid:169)
(cid:173)
(cid:173)
(cid:173)
(cid:173)
(cid:171)
(cid:171)

(cid:170)
, (cid:169)
(cid:174)
(cid:173)
(cid:174)
(cid:173)
, 𝑇𝑑2) + 𝜎2
2
(cid:172)
(cid:171)
where ˆ𝑆1(𝜽1) is the posterior mean of GP emulator at the design points {( ˜𝒕1,𝑖)}𝑚1
𝑖=1 with calibration
parameters 𝜽1. The vector ˜𝚿1 consists of the GP hyperparameters for 𝛿1, the scale of observational
error for the 1st data type, and the calibration parameters 𝜽1. 𝐾𝛿1
, 𝑇𝑑1) is the matrix with
elements 𝑘 𝛿1 ( ˜𝒕1,𝑖, ˜𝒕1, 𝑗 ). The components of likelihood (4.6) corresponding to the 2nd data type are

(cid:170)
(cid:170)
(cid:174)
(cid:174)
(cid:174)
(cid:174)
(cid:172)
(cid:172)

(𝑇𝑑1

𝐾 𝛿2

(𝑇𝑑2

(4.6)

𝐼𝑛2

0

then defined analogically.

4.3 Prior Structure

It is often the case with Bayesian modeling that the careful elicitation of prior distributions is

neglected. This is because if one has enough data, the effect of the prior becomes small Ghossal and

Van der Vaart (2017). Thoughtful prior selection is very important in computer model calibration,

because it can help mitigate the potential identifiability issue of calibration parameters. In particular,

strong priors for GPs’ hyperparameters are needed to distinguish between the systematic discrepancy

𝛿 and the observational error 𝜎 Brynjarsdóttir and O’Hagan (2014).

In order to improve the computational stability of MCMC sampler, we first reparameterize

the GP length scales to the correlation hyperparameters. Specifically, for the length scales of GP
(cid:17)
emulator of 𝑓 (𝒕, 𝜽), we set 𝜌𝑠 𝑗

, 𝑖 = 1, · · · , 2 + 𝑞 𝑗 and 𝑗 = 1, 2, where 𝑞 𝑗 is the

−1/4𝑙 𝑠 𝑗

(cid:16)

𝑖 = 𝑒𝑥 𝑝

𝑖

dimension of calibration parameters in observable type 𝑗. Consider similar reparameterization

for the length scale parameters for 𝛿 function. Since the original length scale parameters are

positive real values, the MCMC sampling is slow since it has to sample potentially from all the

real positive values. With this reparameterization, the sampling only needs to occur on the [0, 1]

domain. Therefore, we consider a Beta distribution for the reparametrized length scales with shape

70

parameters 𝑎 = 1 and 𝑏 = 0.3. This choice is informed by the fact that we want to check which

of inputs are relevant for the outputs. For example, we found that among 4 parameters 𝑔′

1 could

be removed from analysis based on previous sections. Similarly, in Bayesian analysis, since prior

masses are clustered around 1 (length scales are very large and the effect of inputs is small), if an

input is important, then we expect its posterior distribution moves toward 0.

For the precision parameters (inverse of scale parameters in covariance functions), first note that

they have nice interpretations. They determine how much of the data’s variance comes from each

term ( 𝑓 , 𝛿, or 𝜖). Recall that all the outputs were standardized to have unit variance. Therefore,

we specify the precision parameters of the covariance functions for 𝑓1 and 𝑓2 to be close to 1 by

assigning them a Gamma prior with shape parameter 𝛼 = 10 and rate parameter 𝛽 = 10. This leads

to the expectation of each precision parameter for the covariance functions to be 1.

For the model discrepancy term, we expect to have a relatively small bias, which also helps to

mitigate the potential confounding issues. This is because the computer emulator is expected to

explain most of the variance in experimental observations. Hence, we set Gamma prior with shape

parameter 𝛼 = 10 and rate parameter 𝛽 = 0.3 for the precision parameters for the discrepancy term.

This leads to the expectation of each precision parameter of the covariance functions for 𝛿1 and 𝛿2

to be 33. Hence the relative size of the covariances from the discrepancy term would be 3%(= 1
33)

of those from 𝑓1 and 𝑓2.

Finally, for the observational error terms, we set Gamma prior with shape parameter 𝛼 = 10 and

rate parameter 𝛽 = 0.0015 for the observational error terms. By the same logic, the relative size

of the covariances from the observational error term would be 0.015%(=

6666.6667 ) of those from
𝑓1 and 𝑓2. This hierarchical order of variances explained by each term is important to distinguish

1

between discrepancy and observational error.

In addition, the choice of prior on observational

error terms is consistent with the 𝜒2 optimization results. More discussion of suitable priors on

discrepancy terms can be found at Brynjarsdóttir and O’Hagan (2014) and Chong and Menberg

(2018).

When it comes to the calibration parameters, we considered two cases. In the first case, we

71

consider a uniform prior on each parameter space (𝑔′

0 ∈ [0, 4], 𝑣0 ∈ [−2, 0], 𝑔𝐴 ∈ [−2, 0]). In the
second case, we consider weakly informative priors for the calibration parameters, which are taken

to be Truncated Normal (TN) distributions centered around the empirical values (𝑔′∗

0 =
𝐴 = −1) with small standard deviations truncated at the boundaries of each parameter space.

0 = 1.6, 𝑣∗

−1.1, 𝑔∗

4.4 Calibration Results

The number of effective parameters is chosen based on the results of Li (2022). Therefore, we

will consider two cases where 𝑔𝐴 is free or fixed at value 1.

4.4.1

𝑔𝐴 free case

For the two different observable types (𝐸GTR, 𝑇1/2), we fit the independent Gaussian Process with

Matérn covariance class with roughness parameter 3/2, which shares the calibration parameters;

(𝑔′
0

, 𝑣0) for GTR energies and (𝑔′
0

, 𝑣0, 𝑔𝐴) for 𝛽−-decay half-lives.

Figure 4.2 shows the posterior distribution of calibration parameters. The Figure’s first row

shows when using TN priors on the calibration parameters. MCMC results of ’I’ and ’M’ show

multimodality in the posterior distribution of 𝑔′

0. The VI method can not capture this multimodality

because we predefined normal distribution as the variational family of the calibration parameters.

However, the VI result is centered around the 𝜒2 result, a standard approach in the Physics

community. The same thing holds for the models without discrepancy term.

For 𝑣0, VI approach well approximated MCMC posteriors, all are slightly off from the 𝜒2 result.

The results without discrepancy term have smaller concentrations but more or less centered at the

same values as posteriors of models with discrepancy term.

For 𝑔𝐴, while 𝜒2 result is around 0.5 area, MCMC posteriors without model discrepancy has

moved slightly to the right. MCMC posteriors with model discrepancy center around a value 1.0.

The approximated VI posterior is moved slightly right of the MCMC results. This might be the

compensation from approximating the multimodality of the posterior of 𝑔′

0 with normal distribution.

The sensitivity of posterior distributions coming from using two different priors is minimal for

MCMC methods. The second row of the figure shows the results when we use the uniform prior

distribution on each parameter space. The posterior shape is more or less the same for 𝑔′

0 and 𝑣0,

72

but the posterior distribution of 𝑔𝐴 has more mass toward the value 0.5. For the VI, 𝑔′

0 has slightly

moved to the left, while other parameters are more concentrated around the same values as in TN

prior cases. The mean and standard deviation for each case are summarized in Table 4.3.

Methods
I (U)
I-𝛿 (U)
M (U)
M-𝛿 (U)
VI (U)
I (TN)
I-𝛿 (TN)
M (TN)
M-𝛿 (TN)
VI (TN)
𝜒2 opt 1
I (U)
I-𝛿 (U)
M (U)
M-𝛿 (U)
VI (U)
I (TN)
I-𝛿 (TN)
M (TN)
M-𝛿 (TN)
VI (TN)
𝜒2 opt 2

𝑔′
0
1.653(0.104)
1.584(0.037)
1.654(0.101)
1.585(0.036)
1.404(0.028)
1.655(0.098)
1.590(0.036)
1.658(0.098)
1.589(0.036)
1.600(0.049)
1.592(0.034)
1.465(0.171)
2.383(0.014)
1.456(0.142)
1.588(0.036)
1.419(0.038)
1.674(0.118)
1.586(0.038)
1.568(0.158)
1.589(0.036)
1.422(0.040)
1.596(0.039)

𝑣0
-0.941(0.064)
-0.916(0.080)
-0.943(0.066)
-0.920(0.080)
-0.941(0.029)
-0.957(0.056)
-0.950(0.073)
-0.958(0.058)
-0.952(0.072)
-0.955(0.045)
-1.197(0.179)
-1.179(0.008)
-1.802(0.191)
-1.519(0.008)
-0.898(0.166)
-0.778(0.010)
-0.942(0.064)
-0.780(0.013)
-1.119(0.099)
-0.922(0.163)
-0.778(0.009)
-1(0.178)

𝑔𝐴
0.940(0.140)
0.708(0.125)
0.925(0.148)
0.705(0.123)
1.089(0.049)
0.975(0.097)
0.804(0.129)
0.967(0.099)
0.800(0.124)
1.147(0.055)
0.503(0.143)
1
1
1
1
1
1
1
1
1
1
1

Table 4.3 (Posterior summaries of calibration parameters using both observable types.) ’U’ repre-
sents uniform prior and ’TN’ represents truncated normal prior. We divided the case where 𝑔𝐴 is
either free or fixed at 1.

4.4.2

𝑔𝐴 fixed case

Figure 4.3 shows the posterior distribution of calibration parameters when 𝑔𝐴 is fixed at 1. The

posterior distributions of 𝑔′

0 behaves similar to the 𝑔𝐴 free case. Posterior distributions of models
with discrepancy term show multimodality, and posterior distributions of the models without

discrepancy term center around 𝜒2 optimization results. VI results now have moved slightly to the

left. On the other hand, posterior distributions of 𝑣0 have different results for each model and for

73

Figure 4.2 (Posterior distribution of calibration parameters using both observable types with free
𝑔𝐴). The columns of the figure represent the posterior distribution of each calibration parameter.
In addition, the upper three figures are posterior distributions using TN prior, and the lower three
figures are those from a uniform prior. The purple solid lines represent the prior distributions
(either uniform prior or TN prior) over each parameter space. Each line with different colors and
different line types shows different estimation approaches; blue dashed lines for ’I’, orange dotted
lines for ’I-𝛿’, green dash-dotted lines for ’M’, red dash-dot-dotted lines for ’M-𝛿’, and the brown
solid line is for ’VI’ approach we proposed in Chapter2. The black solid vertical line represents
the 𝜒2 optimization results with two sigma error bands in black vertical dotted lines.

each prior distribution we used except for VI. Especially, 𝐼 − 𝛿 with uniform prior case shows the

posterior distribution pushed toward the boundary value near -2. This is very different from other

results, and we tried to understand why this is happening.

One possible explanation could be found by looking at the 𝜒2 surface in Figure 4.4. It shows

that the 𝜒2 has two regions where the 𝜒2 value is low. Hence we restricted the parameter space

of 𝑣0 to avoid those unphysical regions where 𝑣0 is less than -2. However, by fitting the emulator

and the observational error together (𝐼 − 𝛿 case), the posterior distribution of 𝑣0 moved toward this

unphysical area. This means that a good amount of proportion in the variance of the experimental

was absorbed in the observational error. This fact could be confirmed by looking at the performance

of the emulator using 100 GTR and 500 𝛽-decay out of the sample simulation data set; Table 4.4.

In both tables, the RMSE for GTR data is not much different whether one fixes 𝑔𝐴 or not. However,

74

0246810Probability densityII-MM-TN PriorVI02468101214024681.001.251.501.752.00g000.02.55.07.510.012.5Probability DensityII-MM-Unif PriorVI2.01.51.00.50.0v0024681012140.00.51.01.52.0gA02468Figure 4.3 Posterior distribution of calibration parameters using both observable types with 𝑔𝐴 = 1.

Figure 4.4 𝜒2 surface for fixed 𝑔𝐴

RMSE for 𝛽 data tends to be higher if we fix 𝑔𝐴. This is because the 𝛽 decay data controls 𝑣0 and

𝑔𝐴. In fact, if we look at the ′𝐼 − 𝛿′ model with a uniform prior case, the emulator’s performance

on 𝛽 decay data is significantly worse than other methods.

As we explained, the posterior distributions of observational error terms are quite different in

′𝐼 − 𝛿′ case. Figure 4.5 shows posterior distributions of hyperparameters for the ′𝐼 − 𝛿′ case with

75

024681012Probability densityII-MM-TN PriorVI0102030401.01.21.41.61.82.02.2g00024681012Probability DensityII-MM-Unif PriorVI2.01.51.00.50.0v0010203040Priors

Method

(TN Prior)

(Unif Prior)

GTR RMSE 𝛽− RMSE GTR RMSE 𝛽− RMSE

I
I-𝛿
M
M-𝛿
VI
VI(mean)
I ( 𝑔𝐴 = 1)
I-𝛿 (𝑔𝐴 = 1)
M (𝑔𝐴 = 1)
M-𝛿 (𝑔𝐴 = 1)
VI (𝑔𝐴 = 1)
VI (𝑔𝐴 = 1, mean)

0.057
0.057
0.056
0.057
5.085
0.033
0.057
0.056
0.057
0.057
6.368
0.088

0.707
0.703
0.701
0.700
1.011
0.538
1.158
1.152
1.130
1.151
7.765
0.776

0.056
0.057
0.057
0.057
5.086
0.033
0.057
0.057
0.057
0.057
6.525
0.088

0.707
0.703
0.701
0.701
1.011
0.535
1.123
1.545
1.165
1.150
7.746
0.776

Table 4.4 Average of RMSEs for 100 GTR and 500 𝛽-decay out of sample computer outputs.

fixed 𝑔𝐴 using two different priors. The three figures show the difference of posterior distributions in

1/𝜂 𝑓2, 1/𝜎2

1 , and 1/𝜎2

2 , while all the other hyperparameters have posterior distributions overlapping

each other. This implies that assigning distinct priors to the calibration parameters had an impact on

the learning of other parameters. This connection arises because various components, including the

emulator, calibration parameters, and observational error, are interconnected within the framework

of the ′𝐼′ or ′𝐼 − 𝛿 approach. The VI approach is also an approximation of MCMC in the framework

of ′𝐼′, and it shows poor performance of the emulator too. However, if we use the mean of variational

posterior for prediction instead of full variational distribution, the performance of the emulator is

quite comparable to MCMC methods.

This poor performance of the emulator could be avoided by training the emulator first (′𝑀′

or ′𝑀 − 𝛿′ cases) using the simulation data. By doing so, we could avoid the confounding issue.

Bayarri et al. (2009) also mentioned five reasons for using the modularization approach, which

could be summarized as follows.

1. Keep a good module separate from a suspect module to avoid contamination.

2. Scientific understanding and development can require modularization.

3. Identifiability concerns or confounding might require modularization.

76

4. Mixing of MCMC analyses can greatly improve under modularization.

5. The computation is not otherwise possible.

In short, we have a certain expectation of the behavior of the emulator, and we would like to avoid

the behavior of the emulator being contaminated by the experimental data by keeping the modules

separate.

Figure 4.5 Posterior distribution of hyperparameters for fixed 𝑔𝐴

4.5 Prediction Results

In this section, we are going to compare the RMSEs for 179 out of the sample Beta decay data

based on each model. Table 4.5 shows the performance of each model to predict 179 experimental

data. There are two main features in the results. The first is that the model discrepancy term reduces

the RMSEs, and the other is freeing 𝑔𝐴 also reduces RMSEs. In addition, the VI approach shows

significantly lower RMSE for fixed 𝑔𝐴. However, the Average RMSE for Bayesian approaches

tends to be higher than the 𝜒2 optimization results. Then why should we adopt Bayesian calibration

methods?

In Figure 4.7, we can check the advantage of the Bayesian calibration approach. The x-axis

of the figure is the logarithm of beta decay rates (position of 179 nuclei in terms of beta decay

rates), and the y-axis is the log-ratio of prediction over experimental values of beta decay rates.

Therefore if the prediction is close to experimental values, the ratio becomes one, and by taking the

logarithm, the value becomes zero. For this reason, the value zero on the y-axis is the benchmark

and is represented as a dotted line.

The 𝜒2 results show that the prediction error is small overall. Hence the prediction interval

can not capture the value zero. On the other hand, the MCMC prediction interval well captures

77

0.60.70.80.91.01.11.21.31/f20246DensityPosterior Distribution of 1/f2 (Beta-decay)TNUnif0.000.020.040.060.080.100.121/2102000400060008000DensityPosterior Distribution of 1/21 (GTR)TNUnif0.0000.0250.0500.0750.1000.1250.1501/22050100150DensityPosterior Distribution of 1/22 (Beta-decay)TNUnifzero values, with similar behavior to the 𝜒2 results with accurate prediction in small values of beta

decay rates and bad prediction at large values of beta decay rates. VI approach well approximated

MCMC with smaller uncertainty. The prediction behavior for each model is very similar. For free

𝑔𝐴, the larger the x-value, the log ratio value tends to be small. This is due to the physical simulator,

which can also be seen in 𝜒2 results. For fixed 𝑔𝐴, unlike 𝜒2 results, model prediction based on

each model tends to systematically off from 0.

Methods

𝛽− RMSE (TN prior)

𝛽− RMSE (Unif prior)

I
I-𝛿
M
M-𝛿
VI
𝜒2 opt 1
I (𝑔𝐴 = 1)
I-𝛿 (𝑔𝐴 = 1)
M (𝑔𝐴 = 1)
M-𝛿 (𝑔𝐴 = 1)
VI (𝑔𝐴 = 1)
𝜒2 opt 2 (𝑔𝐴 = 1)

1.702
1.989
1.646
1.939
1.457

1.991
2.105
1.893
2.124
1.810

1.701
1.970
1.641
1.916
1.515

1.961
2.189
1.882
2.119
1.807

0.829

0.974

Table 4.5 Average of RMSEs for 179 Beta decay experimental data.

Another way to look at the advantage of Bayesian model calibration compared 𝜒2 results could

be seen from Figure 4.8. In this figure, the x-axis represents the theoretical coverage of true values

by controlling the length of the prediction interval, and the y-axis represents how many of those

intervals actually include the value 0 among 179 prediction points (empirical coverage). We can

clearly see that 𝜒2 prediction interval can not include the true values even if you increase the

intervals. But all the Bayesian approach well captures the true values with increasing prediction

intervals, either based on MCMC or VI.

The main difference between each model comes from the code’s running time. The code’s run

time is 80 hours for ′𝐼′ approach with MCMC, 20 hours for ′𝑀′ and 3 hours for ′𝑉 𝐼′. Models

without discrepancy term took slightly less time than those with discrepancy term.

78

Figure 4.6 Prediction of Beta decay rates of 179 nuclei with one standard deviation error bar for
calibration models without discrepancy terms.

4.6 Conclusion and Discussion

In this work, we found that choosing an appropriate covariance function is of fundamental

importance in modeling the physical process. In that sense, we would like to recommend Matérn

class of covariance function instead of SE covariance all the time. Also, setting relatively strong

priors in the hyperparameters of GPs is necessary to correctly distinguish the discrepancy term

from observational error Brynjarsdóttir and O’Hagan (2014).

In addition, instead of the naive

application of the Bayesian calibration model, by keeping the modules separate, we could get

a time advantage with similar results as the ′𝐼′ approach with a reliable emulator performance.

Exclusion of discrepancy term lead to a similar performance in terms of prediction but it could lead

to biased results for calibration parameters. ′𝑉 𝐼′ Approach gives us a fast and efficient result for

calibration and prediction for experimental values. However, because we restrict the shape of the

posterior of calibration parameters, the researchers should be careful when it comes to interpreting

calibration parameters or the performance of the emulator.

79

642024lg(Tcomp/Texp)I-δ & TN priorFixed gAFree gAI-δ & Unif priorFixed gAFree gA10-1101103105Texp / s642024lg(Tcomp/Texp)M-δ & TN priorFixed gAFree gA10-1101103105Texp / sM-δ & Unif priorFixed gAFree gAFigure 4.7 Prediction of Beta decay rates of 179 nuclei with one standard deviation error bar for
calibration models with discrepancy terms.

80

321012lg(Tcomp/Texp)χ2Fixed gAFree gA642024lg(Tcomp/Texp)VI & TN priorFixed gAFree gAVI & Unif priorFixed gAFree gA642024lg(Tcomp/Texp)I (Sum) & TN priorFixed gAFree gAI (Sum) & Unif priorFixed gAFree gA10-1101103105Texp / s642024lg(Tcomp/Texp)M & TN priorFixed gAFree gA10-1101103105Texp / sM & Unif priorFixed gAFree gAFigure 4.8 Coverage of test data set using prediction intervals.

Lastly, further investigation of the relationship between 𝜒2 optimization and Bayesian calibration

would be an interesting research topic. The current 𝜒2 result of RMSE is better than all the Bayesian

approaches. Considering the fact that 𝜒2 is to minimize the scaled version of the squared difference

between experimental data and simulation data, the performance of models without discrepancy is

expected to be close to 𝜒2 results.

However, despite this expectation, the RMSE values for models without discrepancies are rather

poor. One key reason for this discrepancy lies in the difference in how 𝜒2 and Bayesian approaches

utilize the physics model. While 𝜒2 directly employs the physics model, Bayesian methods utilize

a Gaussian Process (GP) emulator to approximate it. As a result, the Bayesian approach tends to

perform worse than 𝜒2.

In cases where the difference between the GP emulator and the physics model becomes negligi-

ble due to having sufficient simulation data, the models without discrepancies can closely approach

𝜒2 results. Furthermore, with an increase in experimental data, the Bayesian approach is expected

to outperform 𝜒2 optimization. This is because Bayesian models converge to the true physical pro-

81

0.00.20.40.60.81.0Experimental Value CoverageTN prior (free gA)χ2 optII-δI (Emul)MM-δVITN prior (fixed gA)χ2 optII-δI (Emul)MM-δVI0.00.20.40.60.81.0Credible percentage0.00.20.40.60.81.0Experimental Value CoverageUniform prior (free gA)χ2 optII-δI (Emul)MM-δVI0.00.20.40.60.81.0Credible percentageUniform prior (fixed gA)χ2 optII-δI (Emul)MM-δVIcesses with increasing experimental data, as discussed in previous chapters, while 𝜒2 optimization

remains in constrained world of physics models. Nevertheless, this offers a potential rationale for

the present findings, and further investigation through theoretical analysis and extensive simulation

studies is needed.

82

BIBLIOGRAPHY

Akimune, H., Daito, I., Fujita, Y., Fujiwara, M., Greenfield, M. B., Harakeh, M. N., Inomata, T.,
Jänecke, J., Katori, K., Nakayama, S., Sakai, H., Sakemi, Y., Tanaka, M., and Yosoi, M. (1995).
Direct proton decay from the Gamow-Teller resonance in 208Bi. Phys. Rev. C, 52(2):604–615.

Avogadro, P. and Nakatsukasa, T. (2011). Finite amplitude method for the quasiparticle random-

phase approximation. Phys. Rev. C, 84(1):014314.

Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., Buschelman, K.,
Constantinescu, E., Dalcin, L., Dener, A., Eĳkhout, V., Gropp, W. D., Hapla, V., Isaac, T.,
Jolivet, P., Karpeev, D., Kaushik, D., Knepley, M. G., Kong, F., Kruger, S., May, D. A., McInnes,
L. C., Mills, R. T., Mitchell, L., Munson, T., Roman, J. E., Rupp, K., Sanan, P., Sarich, J., Smith,
B. F., Zampini, S., Zhang, H., Zhang, H., and Zhang, J. (2021a). PETSc/TAO users manual.
Technical Report ANL-21/39 - Revision 3.16, Argonne National Laboratory.

Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., Buschelman, K.,
Constantinescu, E. M., Dalcin, L., Dener, A., Eĳkhout, V., Gropp, W. D., Hapla, V., Isaac, T.,
Jolivet, P., Karpeev, D., Kaushik, D., Knepley, M. G., Kong, F., Kruger, S., May, D. A., McInnes,
L. C., Mills, R. T., Mitchell, L., Munson, T., Roman, J. E., Rupp, K., Sanan, P., Sarich, J., Smith,
B. F., Zampini, S., Zhang, H., Zhang, H., and Zhang, J. (2021b). PETSc Web page.

Balay, S., Gropp, W. D., McInnes, L. C., and Smith, B. F. (1997). Efficient management of
parallelism in object oriented numerical software libraries. In Arge, E., Bruaset, A. M., and
Langtangen, H. P., editors, Mod. Softw. Tools Sci. Comput., pages 163–202. Birkhäuser Press.

Bayarri, M. J., Berger, J. O., and Liu, F. (2009). Modularization in Bayesian analysis, with emphasis

on analysis of computer models. Bayesian Analysis, 4(1):119 – 150.

Brynjarsdóttir, J. and O’Hagan, A. (2014). Learning about physical parameters: the importance of

model discrepancy. Inverse Problems, 30:114007.

Chong, A. and Menberg, K. (2018). Guidelines for the bayesian calibration of building energy

models. Energy and Buildings, 174:527 – 547.

Gaarde, C., Rapaport, J., Taddeucci, T. N., Goodman, C. D., Foster, C. C., Bainum, D. E., Goulding,
C. A., Greenfield, M. B., Hören, D. J., and Sugarbaker, E. (1981). Excitation of giant spin-isospin
multipole vibrations. Nucl. Phys. A, 369(2):258–280.

Ghossal, S. and Van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference.

Cambridge University Press.

Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the

Royal Statistical Society: Series B (Statistical Methodology), 63:425–464.

83

Li, T. (2022). Application of density functional theory in nuclear structure. Ph.D. thesis, Michigan

State University.

Mustonen, M. T. and Engel, J. (2016). Global description of 𝛽− decay in even-even nuclei with the

axially-deformed skyrme finite-amplitude method. Phys. Rev. C, 93(1):014304.

Mustonen, M. T., Shafer, T., Zenginerler, Z., and Engel, J. (2014). Finite-amplitude method for

charge-changing transitions in axially deformed nuclei. Phys. Rev. C, 90(2):024308.

Nakatsukasa, T., Inakura, T., and Yabana, K. (2007). Finite amplitude method for the solution of

the random-phase approximation. Phys. Rev. C, 76(2):024318.

Ney, E. M., Engel, J., Li, T., and Schunck, N. (2020). Global description of 𝛽− decay with the
axially deformed skyrme finite-amplitude method: Extension to odd-mass and odd-odd nuclei.
Phys. Rev. C, 102(3):034326.

Pham, K., Jänecke, J., Roberts, D. A., Harakeh, M. N., Berg, G. P. A., Chang, S., Liu, J.,
Stephenson, E. J., Davis, B. F., Akimune, H., and Fujiwara, M. (1995). Fragmentation and
splitting of Gamow-Teller resonances in Sn(3He,t)Sb charge-exchange reactions, 𝐴=112–124.
Phys. Rev. C, 51(2):526–540.

Shafer, T., Engel, J., Fröhlich, C., McLaughlin, G. C., Mumpower, M., and Surman, R. (2016). 𝛽
decay of deformed 𝑟-process nuclei near 𝐴 = 80 and 𝐴 = 160, including odd-𝐴 and odd-odd
nuclei, with the skyrme finite-amplitude method. Phys. Rev. C, 94(5):055802.

Wild, S. M. (2017). Chapter 40: POUNDERS in TAO: Solving Derivative-Free Nonlinear Least-
Squares Problems with POUNDERS. In Advances and Trends in Optimization with Engineering
Applications, MOS-SIAM Series on Optimization, pages 529–539. Society for Industrial and
Applied Mathematics.

Yasuda, J., Sasano, M., Zegers, R. G. T., Baba, H., Bazin, D., Chao, W., Dozono, M., Fukuda,
N., Inabe, N., Isobe, T., Jhang, G., Kameda, D., Kaneko, M., Kisamori, K., Kobayashi, M.,
Kobayashi, N., Kobayashi, T., Koyama, S., Kondo, Y., Krasznahorkay, A. J., Kubo, T., Kubota,
Y., Kurata-Nishimura, M., Lee, C. S., Lee, J. W., Matsuda, Y., Milman, E., Michimasa, S.,
Motobayashi, T., Muecher, D., Murakami, T., Nakamura, T., Nakatsuka, N., Ota, S., Otsu, H.,
Panin, V., Powell, W., Reichert, S., Sakaguchi, S., Sakai, H., Sako, M., Sato, H., Shimizu, Y.,
Shikata, M., Shimoura, S., Stuhl, L., Sumikama, T., Suzuki, H., Tangwancharoen, S., Takaki,
M., Takeda, H., Tako, T., Togano, Y., Tokieda, H., Tsubota, J., Uesaka, T., Wakasa, T., Yako,
K., Yoneda, K., and Zenihiro, J. (2018). Extraction of the Landau-Migdal parameter from the
Gamow-Teller Giant Resonance in 132Sn. Phys. Rev. Lett., 121(13):132501.

84

APPENDIX A

LATIN HYPERCUBE DESIGN

For the Bayesian calibration model, we need a set of model evaluations for each observable type.

In order to evaluate the model using a GP emulator, design points are required. Hence we used

Latin Hypercube Design in the scaled input space of calibration parameters [0, 1]3. The figures

below represent the points we used for the evaluation of the computer model. An interesting point

is that in the right Figure of A.1, the points in 𝑔𝐴 are very regular compared to other parameters.

This is because the calculation of a computer model with different 𝑔𝐴 is cheap once one point is

calculated.

Figure A.1 (Latin Hypercube Design points) Left figure is for GTR simulation and the right figure
is for 𝛽 decay simulation.

85

APPENDIX B

CHI-SQUARE OPTIMIZATION FRAMEWORK

The 𝜒2 optimization framework, based on Dobaczewski et al.

(2014), seeks to determine the

optimal model parameters by minimizing the weighted sum of squared residuals.

In here, we

briefly introduce the important results following the presentation of Li (2022).

𝜒2(𝒙) =

(cid:205)𝑛𝑑
𝜀2
𝑘 (𝒙)
𝑘=1
𝑛𝑑 − 𝑛𝑥

=

1
𝑛𝑑 − 𝑛𝑥

𝑛𝑑
∑︁

𝑘=1

(cid:20) 𝑠𝑘 (𝒙) − 𝑑𝑘
𝑤 𝑘

(cid:21) 2

,

(B.1)

The objective function 𝜒2(𝒙) represents the difference between model predictions 𝑠𝑘 (𝒙) and ex-

perimental values 𝑑𝑘 for each observable 𝑘, scaled by their respective weights 𝑤 𝑘 . The weight 𝑤 𝑘

is essential to make the residuals dimensionless, especially when dealing with multiple observable

types.

The assumption is made that all weighted residuals 𝜀𝑘 are independent and follow a common

normal distribution with a zero mean and a variance of 𝜎2. Consequently, the 𝜒2 value at the

optimal point ˆ𝒙 is approximately equal to this variance 𝜎2.

To meet this assumption, it becomes essential to select weights 𝑤 𝑘 that closely match the errors

associated with the model predictions 𝑠𝑘 , encompassing theoretical, numerical, and experimental

uncertainties. This strategy is employed to bring the 𝜒2( ˆ𝒙) close to a value of 1.

Although each data point can have its unique weight, it is common practice for data points of

the same type to be assigned the same weight since their errors are expected to be similar. This

ensures consistency in the treatment of similar observations within the optimization process.

The relative weights among different observable types are the key consideration because a

global scale factor 𝑠, known as the Birge factor (Birge, 1932), can be introduced to scale the entire

𝜒2 value and ensure it approaches 1.

𝜒2( ˆ𝒙) → ˜𝜒2( ˆ𝒙) = 𝜒2( ˆ𝒙)/𝑠 = 1, 𝑤 𝑘 → ˜𝑤 𝑘 = 𝑤 𝑘

√

𝑠.

(B.2)

Then, for consistency between weights and residual distributions, the scaled weight for a given

86

observable type ˜𝑤typ should be close to

𝑟typ =

√︄

𝑛𝑑
𝑛typ(𝑛𝑑 − 𝑛𝑥)

∑︁

𝑘∈typ

[𝑠𝑘 ( ˆ𝒙) − 𝑑𝑘 ]2,

(B.3)

where 𝑛typ is the number of points of the given type.

By utilizing the linear expansions of weighted residuals 𝜀𝑘 (𝒙) around the optimal point ˆ𝒙,

we can convert the original nonlinear optimization problem into a linear one, allowing us to

apply the linear-regression framework. Here are some important conclusions. For more rigorous

mathematical details, please refer to Seber (1989).

Let 𝒙∗ be the true parameter vector. Then the difference ( ˆ𝒙 − 𝒙∗) approximately follows a

multivariate normal distribution: ˆ𝒙 − 𝒙∗ ∼ 𝑁 ((cid:174)0, Cov( ˆ𝒙)). The covariance matrix is

Cov( ˆ𝒙) ≈ 𝜒2( ˆ𝒙) (cid:2)𝐽𝑇 ( ˆ𝒙)𝐽 ( ˆ𝒙)(cid:3) −1 ,

(B.4)

where the 𝑛𝑑 × 𝑛𝑥 Jacobian matrix 𝐽 (𝒙) is defined by 𝐽𝑘𝑙 = 𝜕𝜀𝑘
𝜕𝑥𝑙 .

As for the optimization routine, we employ POUNDERS of Wild (2017) in PETSc/TAO toolkit,

Balay et al. (1997, 2021a,b).

87

CHAPTER 5

CONCLUSIONS AND FUTURE DIRECTIONS

In this thesis, we have explored some theoretical properties of the original KOH model framework,

specifically focusing on the posterior consistency of the estimated physical processes. This work

was guided by the lack of theory backing up the method, which has been widely used in science

and engineering problems with over 4000 citations.

We first started to look at the KOH model as a hierarchical GP regression model and adopted

the extension of Schwartz theorem’s test function framework to calculate the posterior contraction

rate using true posterior. We further proposed a new variational algorithm to overcome the

computational issue of the MCMC-based KOH model.

In addition, we showed the posterior

contraction rate of the proposed method under reasonable assumptions on GP prior and regularity

conditions on the variational family. This was supported by simulation studies and the real data

application in 𝛽-decay calculation. We have also examined a variety of issues concerning the

implementation of both methods, MCMC and VB: Choosing covariance class for modeling the

physical process, prior structure to mitigate the identifiability issue of calibration parameters, and

modularizing the training of the emulator for a better understanding of the model and a better

interpretation.

Our findings and discussions have significant implications of understanding the KOH model.

The findings presented in this thesis will serve as a foundation for future theoretical investigations

in this area. Other interesting directions of research from here could be summarized as follows.

Improving the performance of the emulator: The newly proposed algorithm in this thesis

remains rooted in Gaussian Processes. As a consequence, the optimization process necessitated

addressing the issue of nonpositive semi-definite covariance. However, there is potential to explore

alternative avenues where advanced Machine Learning techniques, such as Deep learning models,

diffusion models, or generative models, could replace Gaussian Processes. In particular, the use of

Deep Gaussian processes for constructing emulators has gained attention from researchers recently.

This presents a captivating and relevant topic for investigation, especially in the current era of AI,

88

enticing researchers from various domains.

Calibration parameters carrying physical meaning: An alternative avenue of investigation

pertains to the calibration parameters. In the original KOH model, all the unknowns are estimated

simultaneously, which raises concerns about the identifiability of calibration parameters. Therefore,

an intriguing project would involve establishing a novel definition for calibration parameters that

not only ensures identifiability but also preserves their physical significance. Interested people may

explore Li and Xiong’s (2022) fascinating method of decomposing the discrepancy term through

Taylor expansion.

Theoretical investigation of Modular approach: The primary focus of the original KOH

model is to excel in handling experimental data. Nevertheless, domain scientists also place signif-

icant emphasis on the emulator. Therefore, maintaining a separate emulator module by training

it independently could prove beneficial in enhancing the comprehension and interpretation of the

model. Demonstrating theoretical backing for the Modular approach presents an engaging path for

future research.

89