CALIBRATED COMPUTER MODELS USING MARKOV CHAIN MONTE CARLO AND VARIATIONAL BAYES By Mookyong Son A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Statisticsβ€”Doctor of Philosophy 2023 ABSTRACT Computer models are used to solve complex problems in many scientific applications, such as nuclear physics and climate research. Although a popular approach, Markov chain Monte Carlo- based Bayesian calibration of computer models has not been investigated much regarding theoretical properties until relatively recently. Hence, this work focuses on the theory of computer model calibration through a proof of posterior consistency of the estimated physical process. In Chapter 1, we review the general framework of computer model calibration, Gaussian Processes, and Posterior Consistency. In Chapter 2, we prove the posterior consistency of the estimated physical process in the Bayesian model calibration framework using Markov chain Monte Carlo. We used the extension of Schwartz’s theorem to show the posterior contraction rate using GP priors. In Chapter 3, we propose a fast and scalable posterior approximation algorithm for Bayesian computer model calibration via Variational Inference. Variational Inference is an optimization- based method alternative to the time-consuming Markov chain Monte Carlo approximation of posterior distributions popularized by machine learners. We provide the statistical guarantee of the proposed algorithm in the form of a posterior consistency theorem for the estimated physical process under regularity assumptions on the variational family. The main results are shown in the two widely used classes of Gaussian Process priors, the Squared Exponential covariance class and the Matern covariance class. We also provide a simulation study to demonstrate the proposed method’s time efficiency and fidelity compared to the standard Markov chain Monte Carlo method. In Chapter 4, we applied the Bayesian model calibration framework to 𝛽 decay calculation and compared different calibration approaches. We adopt πœ’2 results as a benchmark to show the advantage of Bayesian approaches. Finally, we conclude the thesis with future directions for further research in Chapter 5. Copyright by MOOKYONG SON 2023 ACKNOWLEDGEMENTS I would like to express my heartfelt gratitude to my advisors, Dr. Tapabrata Maiti and Dr. ShrΔ³ita Bhattacharya, for their invaluable guidance throughout my research journey. Their insightful suggestions have consistently propelled me forward and led to significant progress. I also want to acknowledge the members of my dissertation committee, Dr. Witold Nazarewicz and Dr. Chih-Li Sung, whose feedback and comments have been instrumental in shaping the quality of my research. I am particularly grateful for the weekly meetings with physicists at the Facility for Rare Isotope Beams (FRIB), which have broadened my perspective on the practical applications of Statistics. Additionally, the BAND research collaboration provided me with wonderful networking opportunities with fellow physicists. Furthermore, I extend my gratitude to my mentors, Stefan Wild and Arindam Fadikar, during my remote internship at Argonne National Lab amidst the pandemic. Our discussions greatly honed my expertise in my field. I am thankful for all the friends I made during my time at Michigan State University. Special thanks go to Arkaprabha Ganguli, Anirban Samaddar, and Anriana Manousidaki, with whom I shared powerful memories while studying for prelim during the summer. Passing prelim without their support would have been nearly impossible. I am also indebted to Tong Li and Vojtech Kejzlar for the profound discussions that greatly enriched my thesis. Last but not least, my gratitude goes to my family. The unwavering support of my wife, Jin Young Kwak, and my daughter, Lia, has made the completion of my Ph.D. possible. Moreover, I am truly grateful for the unconditional encouragement I have received from my parents, parents-in-law, brother, and brother-in-law throughout this journey. Their constant belief in me has been a driving force behind my accomplishments. Throughout my five years at Michigan State University, the courses and seminars have imparted valuable knowledge, which I have successfully applied to my research. I eagerly look forward to embracing my future as a statistician with enthusiasm and dedication. iv TABLE OF CONTENTS CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Computer Model Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Gaussian Processes . 1.3 Posterior Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . . . . . . 1 1 3 5 7 CHAPTER 2 . . . . CALIBRATED COMPUTER MODEL USING MARKOV CHAIN MONTE CARLO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1 Bayesian Hierarchical Formulation of Kennedy and O’Hagan Model 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Preliminaries 2.3 Posterior Consistency in Kenndy and O’Hagan Model . . . . . . . . . . . . . . 17 2.4 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 BIBLIOGRAPHY . . PROOF OF THEOREM 1 APPENDIX A . 22 CHECKING ASSUMPTION 2.2.1 . . . . . . . . . . . . . . . . 27 APPENDIX B CHECKING ASSUMPTION 2.2.2 1 . . . . . . . . . . . . . . . 31 APPENDIX C CHECKING ASSUMPTION 2.2.2 2 . . . . . . . . . . . . . . . 39 APPENDIX D . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 3 CALIBRATED COMPUTER MODEL USING VARIATIONAL BAYES (VARIATIONAL INFERENCE) . . . . . . . . . . . . . . . . . . . . . . 41 3.1 Variational Bayes formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.2 Variational Bayes Algorithm for Discrepancy Model . 44 3.3 Posterior Contraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 . 49 3.4 Simulation Study: Ball Dropping Experiment 3.5 Simulation Study: Logistic Function . . . . . . . . . . . . . . . . . . . . . . . 50 3.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 BIBLIOGRAPHY . . CHECKING ASSUMPTION 3.3.2 (I) APPENDIX A . . . . . . . . . . . . . . 57 CHECKING ASSUMPTION 3.3.2 (II) . . . . . . . . . . . . . . 58 APPENDIX B . 61 PROOF OF THEOREM 3 . . . . . . . . . . . . . . . . . . . . APPENDIX C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CHAPTER 4 . . BAYESIAN MODEL CALIBRATION FOR BETA DECAY CALCULATIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 . 69 . 70 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 . 79 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 LATIN HYPERCUBE DESIGN . . . . . . . . . . . . . . . . . 85 CHI-SQUARE OPTIMIZATION FRAMEWORK . . . . . . . . 86 4.1 Data Collection . 4.2 Extension of the calibration model to two observable types . . . . . . . . . . 4.3 Prior Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Calibration Results . 4.5 Prediction Results . . 4.6 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . BIBLIOGRAPHY . APPENDIX A APPENDIX B . . . . . v CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS . . . . . . . . . . . . . 88 vi CHAPTER 1 INTRODUCTION 1.1 Computer Model Calibration Computer models are an essential part of today’s scientific process. With the increasing access to high-performance computing, computer models provide much faster and economical environment for scientists to rapidly develop their domain fields such as Biology, Chemistry, Epidemiology, Engineering, and Physics to just name a few. However, computer models are canonically imperfect representation of physical systems. The landmark publication of Kennedy and O’Hagan (2001) approached this issue by introducing discrepancy function between a physical process and a computer model. This work, up to the present time, accumulated more than 3000 citations and led to countless applications. For instance, see King et al. (2019); Gatt et al. (2020) for examples in nuclear physics, Seede et al. (2020); Honarmandi and ArrΓ³yave (2020) for examples in material sciences, and Colosimoa et al. (2018); HΓΌllen et al. (2020); Choi et al. (2018) for engineering applications. The theoretical properties of the original framework have not been investigated until relatively recently. There are two properties of the calibration model we can think of, the one is identifiability of calibration parameters and the other is the consistency of the estimated physical process to the true physical process. Non-identifiability issue of calibration parameters in the KOH model stems from the original paper Kennedy and O’Hagan (2001), which concerns the distribution of observed physical data induced by the computer model, which does not uniquely determine the calibration parameter. There is a couple of papers to handle this problem by modifying the original model, for example take a look at Plumlee (2017) and Xie and Xu (2020). In this paper, however, our goal is to show the consistency property of the original KOH model in Fully Bayesian setting. There are also some papers to deal with this issue such as Xie and Xu (2020) or Li and Xiong (2022). But again these papers are based on the modified version of the model such as defining the calibration parameters as 𝐿2 minimization between the computer model and the physical process or decomposing the discrepancy term. 1 The KOH model is built based on some assumptions. Firstly, Kennedy and O’Hagan (2001) assumes that although the computer model is not fully available, one does have exact realizations from the computer model at a finite discrete number of points. To facilitate this, a Gaussian process is assumed on the computer model indexed by (𝒕, 𝜽) with nuisance parameters πœ™ 𝑓 . 𝜽 is often referred to as the calibration parameter. Additionally, Kennedy and O’Hagan (2001) assumes a discrepancy term between the true physical process and the computer model. It assumes a Gaussian process on the discrepancy term with index 𝒕 with nuisance parameters πœ™π›Ώ. The aim of the Kennedy and O’Hagan (2001) model is to study the relationship between the computer model 𝑓 and the physical process 𝜁 by learning about calibration parameter 𝜽, and eventually to do predictions π’šβˆ— = (π‘¦βˆ— 1 , . . . , π‘¦βˆ— 𝐽) of the physical process 𝜁 at new inputs (π’•βˆ— 1 , . . . , π’•βˆ— 𝐽). The current theoretical developments have three main drawbacks (1) they assume the computer model is explicitly available (2) 𝜽 is an unknown parameter characterizing the computer model (3) πœ™ 𝑓 and πœ™π›Ώ are unknown parameters. In this paper, we provide the posterior consistency of the estimated physical process under the full Bayesian treatment of the Kennedy and O’Hagan (2001) model. Following Kejzlar et al. (2020), we consider an equivalent representation of Kennedy and O’Hagan (2001) model as a hierarchical model. Consequently, we prove the posterior consistency of the estimated physical process using an original extension of Schwartz’s theorem for non-parametric regression problems with GP priors. Unlike all the previous works in the literature, we stick to the assumption that computer model is not fully available and one has only the evaluations from the computer model at finite number of points. Secondly, we assume that πœƒ, πœ“1 and πœ“2 are not just parameters with simple plug-in estimators but random variables with their own prior distributions. 1.1.1 Kennedy O’Hagan Model Let us consider observations π’š = (𝑦1, . . . , 𝑦𝑛) of a physical process 𝜁 depending on a known set of inputs 𝒕𝑖 ∈ 𝛀 βŠ‚ R𝑝, 𝑖 = 1, Β· Β· Β· , 𝑛, 𝑝 β‰₯ 1 given by 𝑦𝑖 = 𝜁 (𝒕𝑖) + πœŽπœ–π‘–, (1.1) 2 where 𝜎 represents the scale of observational error, and πœ–π‘– 𝑖.𝑖.𝑑. ∼ N (0, 1). Let 𝑓 denote a computer model. Kennedy and O’Hagan (2001) introduced a discrepancy term 𝛿 to represent the unknown systematic error between the computer model 𝑓 and the physical process 𝜁. Hence the physical process can be represented as a summation of imperfect computer model plus the discrepancy term. 𝜁 (𝒕𝑖) = 𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖), 𝑖 = 1, . . . , 𝑛 (1.2) Note, the above expression involves known inputs 𝒕𝑖 and unknown parameters 𝜽, also referred to as the calibration parameter. The calibration parameter is not controllable in the physical process 𝜁 and thus may be used to represent aspects of the computer model which can not be measured in the physical observations. In some cases, the calibration parameter may also carry a real physical meaning. For example, consider a famous ball drop example by Plumlee (2017). If one constructs a computer model to check the time for a ball to hit the ground from a certain height. Then the gravitational force could be used as a calibration parameter to build a computer model. Further discussion of physical meaning in calibration parameters could be checked at BrynjarsdΓ³ttir and O’Hagan (2014). Next, we assume that the evaluation of computer model 𝑓 is either expensive or not directly available, which are often the case in a real-world application. The common practice in this situation is to fit a surrogate for a computer model based on the simulation data. Hence, let 𝒛 = (𝑧1, . . . , 𝑧𝑠) denote outputs from the computer model 𝑓 under a known set of inputs Λœπ’•π‘– ∈ 𝛀, 𝑝 β‰₯ 1 and Λœπœ½π‘– ∈ Θ. 𝑧𝑖 = 𝑓 ( Λœπ’•π‘–, Λœπœ½π‘–) 𝑖 = 1, . . . , 𝑠, (1.3) Without loss of generality, we assume 𝛀 = [0, 1] 𝑝 for the rest of the paper. 1.2 Gaussian Processes Gaussian Process is a collection of random variables, any finite number of which have a joint multivariate Gaussian distribution. Hence we can think of it as a infinite-dimensional multivariate Gaussian distribution. A Gaussian process 𝑓 (π‘₯) is completely specified by it’s mean function π‘š(π‘₯) 3 Figure 1.1 (Gaussian Process priors) The left figure shows realization of three random functions from Gaussian Processes prior, and the right figure shows realization of three random functions from posterior distribution updated by five training data, which is shown in "+" signs. This figure is taken from Rasmussen and Williams (2006). and covariance function π‘˜ (π‘₯, π‘₯β€²), where π‘š(π‘₯) = E[ 𝑓 (π‘₯)] π‘˜ (π‘₯, π‘₯β€²) = E[( 𝑓 (π‘₯) βˆ’ π‘š(π‘₯)) ( 𝑓 (π‘₯β€²) βˆ’ π‘š(π‘₯β€²))] A commonly adopted mean function is a constant, while there are many types of covariance functions widely used. Two widely used covariance functions are ’Squared Exponential’ (SE) and ’MatΓ©rn’. Squared Exponential covariance functions has the following form. π‘˜ (π‘₯, π‘₯β€²; πœ‚, 𝑙) = πœ‚ exp (cid:19) (π‘₯ βˆ’ π‘₯β€²)2 (cid:18) βˆ’1 2𝑙2 where πœ‚ > 0 is the scale parameter and 𝑙 > 0 is the length scale parameter. MatΓ©rn covariance functions has the following form. π‘˜ (π‘₯, π‘₯β€²; πœ‚, 𝑙, 𝜈) = πœ‚ 21βˆ’πœˆ Ξ“(𝜈) (cid:32) √ 2𝜈|π‘₯ βˆ’ π‘₯β€²| 𝑙 (cid:33) 𝜈 (cid:32) √ 𝐾𝜈 (cid:33) 2𝜈|π‘₯ βˆ’ π‘₯β€²| 𝑙 where 𝐾𝜈 is the modified Bessel function and 𝜈 > 0 is the roughness parameter. 4 Figure 1.2 (Covariance functions of GP) The left figure shows MatΓ©rn covariance as a function of input distance π‘Ÿ = |π‘₯ βˆ’ π‘₯β€²| while πœ‚ and 𝑙 are fixed at 1. Different colors represent different roughness parameter value 𝜈. The red line corresponds to 𝜈 = ∞, which is equivalent to SE covariance. The right figure shows realization of random functions from GP of each covariance function. The connection between these two classes of covariance function is that the SE covariance is a special case of MetΓ©rn covariance where 𝜈 equals to ∞. When the roughness parameter becomes infinity, the sample function from GP with SE covariance function becomes infinitely differentiable, in other words, very smooth. This could be checked at the Figure 1.2. The sample function from the SE covariance function is colored red, and it looks very smooth compared to other wiggly sample functions. This smoothness assumption is often too restrictive to model the response surface of physical processes. Hence, in practice, it is often advised to use the MatΓ©rn covariance, Stein (1999). However, SE covariance is often easier to derive the sample function properties of GP. Therefore, we will consider both cases in developing large sample theory in Bayesian model calibration in this thesis. 1.3 Posterior Consistency The consistency of estimators is usually discussed in the context of frequentist statistics. In a frequentist framework, a consistent estimator converges in probability to the true parameter as the number of data points approaches infinity. This idea can be extended to the Bayesian framework as well. For Bayesian methods, convergence properties are discussed using the entire posterior distribution rather than a single estimate. 5 0.00.51.01.52.02.53.0input distance, r0.00.20.40.60.81.0covariance, k(r)Covariance Functionsv=1/2v=3/2v=5/2v -> 42024input, x3210123output, f(x)sample functionv=1/2v=3/2v=5/2v -> Assume that a set of data π’šπ‘› = (𝑦1, Β· Β· Β· , 𝑦𝑛) are randomly sampled from 𝑓0 ∈ F . Then the posterior distribution is a random measure defined as follows, Ξ (𝐡| π’šπ‘›) = ∫ 𝐡 ∫ F (cid:206)𝑛 𝑖=1 (cid:206)𝑛 𝑖=1 𝑓 (𝑦𝑖)𝑑Π( 𝑓 ) 𝑓 (𝑦𝑖)𝑑Π( 𝑓 ) where 𝐡 is the measurable set in F . The natural question we have is what happens to the posterior distribution as the sample size 𝑛 increases to infinity. If the posterior distribution of 𝑓 concentrates around some small neighborhood of true parameter 𝑓0, then we say that the posterior distribution is consistent. In other words, as we see more data, data provide more information on the parameter and ideally the posterior distribution of the parameter converges to the truth, which is a point mass. There has been a line of research to show the posterior consistency, and the first approach was done by Doob (1949). In this work, Doob showed that posterior distribution is consistent almost everywhere on prior support, but this was not useful to check consistency at a particular density. Later in 1965, Schwartz introduced a test function framework to show the posterior consistency Schwartz (1965). In particular, he showed that in order to show that the posterior is consistent, it is necessary that the true parameter and the complement set of the neighborhood of the true parameter can be separated. This idea of separation is conveniently formalized through the existence of appropriate tests for testing 𝐻0 : 𝑓 = 𝑓0, 𝐻𝐴 : 𝑓 ∈ π‘ˆπ‘, where π‘ˆ is some neighborhood of 𝑓0. When posterior consistency holds, quantifying the rate at which the posterior contracts to the truth is often possible. This is called the posterior contraction rate. Hence, in the following chapters, we show not only the posterior is consistent but also the posterior contraction rates in the KOH model setup using MCMC and Variational Bayes. 6 BIBLIOGRAPHY BrynjarsdΓ³ttir, J. and O’Hagan, A. (2014). Learning about physical parameters: the importance of model discrepancy. Inverse Problems, 30:114007. Choi, W., Menberg, K., Kikumoto, H., Heo, Y., Choudhary, R., and Ooka, R. (2018). Bayesian inference of structural error in inverse models of thermal response tests. Applied Energy, 228:1473–1485. Colosimoa, B. M., Huangb, Q., Dasguptac, T., and Tsung, F. (2018). Opportunities and challenges of quality engineering for additive manufacturing. Journal of Quality Technology, 50:233–252. Doob, J. L. (1949). Application of the theory of martingales. in Le Calcul des Probabilites et ses Applications, 40:23–27. Gatt, D., Yousif, C., Cellura, M., Camilleri, L., and Guarino, F. (2020). Assessment of building energy modelling studies to meet the requirements of the new energy performance of buildings directive. Renewable and Sustainable Energy Reviews, 127:1–16. Honarmandi, P. and ArrΓ³yave, R. (2020). Uncertainty quantifcation and propagation in compu- tational materials science and simulation assisted materials design. Integrating Materials and Manufacturing Innovation, 9:103–143. HΓΌllen, G., amd Sun-Hye Kim, J. Z., Sinha, A., J.Realff, M., and Boukouvala, F. (2020). Managing uncertainty in data-driven simulation-based optimization. Computers and Chemical Engineering, 136:1–16. Kejzlar, V., Son, M., Bhattacharya, S., and Maiti, T. (2020). A fast, scalable, and calibrated computer model emulator: An empirical bayes approach. Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63:425–464. King, G. B., Lovell, A. E., Neufcourt, L., and Nunes, F. M. (2019). Direct comparison between Bayesian and frequentist uncertainty quantification for nuclear reactions. Physical Review Letters, 122:232502. Li, Y. and Xiong, S. (2022). Physical parameter calibration. arXiv:2208.00124 [stat.ME]. Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American Statistical Association, 112:1274–1285. Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. Cambridge, MA: MIT Press. 7 Schwartz, L. (1965). On bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4:10–26. Seede, R., Shoukr, D., Zhang, B., Whitt, A., Gibbons, S., Flater, P., Elwany, A., Arroyave, R., and Karaman, I. (2020). An ultra-high strength martensitic steel fabricated using selective laser melting additive manufacturing: Densification, microstructure, and mechanical properties. Acta Materialia, 186:199–214. Stein, M. L. (1999). Interpolation of spatial data: some theory for kriging. Springer Science & Business Media. Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of the American Statistical Association, 0(0):1–18. 8 CHAPTER 2 CALIBRATED COMPUTER MODEL USING MARKOV CHAIN MONTE CARLO In Section 1, we restate the original Kennedy O’Hagan (KOH) model framework Kennedy and O’Hagan (2001) as a hierarchical regression model with GP prior. In Section 2, we introduce the general consistency theorem and assumptions we need for the full Bayesian calibration model to be consistent. In Section 3 we check the assumptions of the consistency theorem in the context of the KOH model. Finally, we conclude the chapter with a discussion in section 4. 2.1 Bayesian Hierarchical Formulation of Kennedy and O’Hagan Model We present the KOH model described in Chapter 1 as a Bayesian hierarchical model with GP priors. This representation is adopted from Kejzlar et al. (2020) and is crucial for the theoretical results. It reframes the Bayesian model as a version of a non-parametric regression problem with a GP prior for 𝜁 (𝒕) and additive noise. To be specific, we consider the following GP prior structure: 𝜁 (𝒕)| 𝑓 (𝒕, 𝜽), 𝛿(𝒕) ∼ 𝑓 (𝒕, 𝜽) + 𝛿(𝒕), 𝛿(𝒕) ∼ GP𝛿 (π‘šπ›Ώ (𝒕), π‘˜ 𝛿 (𝒕, 𝒕′); πœ™π›Ώ), 𝑓 (𝒕, 𝜽) ∼ GP 𝑓 (π‘š 𝑓 (𝒕, 𝜽), π‘˜ 𝑓 ((𝒕, 𝜽), (𝒕′, πœ½β€²)); πœ™ 𝑓 ). (2.1) where 𝝓𝛿 and 𝝓 𝑓 denote the hyperparameters associated with the Gaussian prior on 𝛿 and 𝑓 respectively. For example, 𝝓 𝑓 = (πœ‚ 𝑓 , 𝑙 𝑓 ) comprise of πœ‚ 𝑓 , the scale parameter of the covariance function and 𝑙 𝑓 , the length scale parameters of the covariance function, 𝑙 𝑓 . We first consider a prior distribution for the scale parameter 𝜎2 as 𝜎2 ∼ Ξ (𝜎2) With 𝝓 = (𝝓𝛿, 𝝓 𝑓 ) ∈ 𝐴𝝓, we assume the following prior distribution of 𝝓 𝝓 ∼ Ξ (𝝓) Finally, we consider the following prior distribution for the calibration parameter 𝜽 ∈ 𝚯 𝜽 ∼ Ξ (𝜽) 9 (2.2) (2.3) (2.4) Based on (1.3) and (2.1) the joint distribution 𝑝(𝜁, 𝒛|𝜽, 𝝓) is a multivariate normal distribution with mean and covariance π‘€πœ,𝒛|𝜽,𝝓 = (cid:169) (cid:173) (cid:173) (cid:171) 𝑀 𝑓 (𝑇𝑦 (𝜽)) + 𝑀𝛿 (𝑇𝑦) 𝑀 𝑓 (𝑇𝑧 ((cid:101)𝜽)) , (cid:170) (cid:174) (cid:174) (cid:172) 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽)) + 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽)) 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽)) 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑦 (𝜽)) (cid:170) (cid:174) (cid:174) (cid:172) 𝐾𝜁,𝒛|𝜽,𝝓 = (cid:169) (cid:173) (cid:173) (cid:171) (2.5) (2.6) Here, 𝑀 𝑓 (𝑇𝑦 (𝜽)) is a column vector with 𝑗 th element π‘š 𝑓 (𝒕 𝑗 , 𝜽), 𝑀𝛿 (𝑇𝑦) is a column vector with 𝑗 th element π‘šπ›Ώ (𝒕 𝑗 ), and 𝑀 𝑓 (𝑇𝑧 ((cid:101)𝜽)) is a column vector with 𝑗 th element π‘š 𝑓 ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 ). 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽)) is the matrix with (𝑖, 𝑗) element π‘˜ 𝑓 ((𝒕𝑖, 𝜽), (𝒕 𝑗 , 𝜽)), 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) is the matrix with (𝑖, 𝑗) element π‘˜ 𝛿 (𝒕𝑖, 𝒕 𝑗 ), and 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽)) is the matrix with (𝑖, 𝑗) element π‘˜ 𝑓 (((cid:101)𝒕𝑖, (cid:101)πœ½π‘–), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )). We can define the matrix 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽)) similarly with the kernel π‘˜ 𝑓 . To separate out the scale parameters, let’s define new notations for the covariance matrix as follows: 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑦 (𝜽)) = πœ‚ 𝑓 ˜Σ11, 𝐾𝛿 (𝑇𝑦, 𝑇𝑦) = πœ‚π›Ώ ˜Σ11β€², 𝐾 𝑓 (𝑇𝑦 (𝜽), 𝑇𝑧 ((cid:101)𝜽)) = πœ‚ 𝑓 ˜Σ12, and 𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽)) = πœ‚ 𝑓 ˜Σ22. Consequently, the conditional distribution 𝑝(𝜁 |𝒛, 𝜽, 𝝓) is also a multivariate normal with the mean and covariance functions: π‘šπœ |𝒛,𝜽,𝝓 = π‘š 𝑓 (𝒕, 𝜽) + π‘šπ›Ώ (𝒕) + 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 π‘˜ 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )) ( 1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , π‘™πœƒ)) [𝑧𝑖 βˆ’ π‘š 𝑓 ((cid:101)𝒕𝑖, (cid:101)πœ½π‘–))] π‘˜ 𝜁 |𝒛,𝜽,𝝓 = π‘˜ 𝑓 ((𝒕, 𝜽), (𝒕′, πœ½β€²)) + π‘˜ 𝛿 (𝒕, 𝒕′) 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ π‘˜ 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )) βˆ’ (cid:18) 1 πœ‚ 𝑓 (cid:19) ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) π‘˜ 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )) 𝑗=1 where (1/πœ‚ 𝑓 ) ΛœΞ£βˆ’1 𝑖=1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) is the ( 𝑗, 𝑖) element of [𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽))]βˆ’1 which depends on 𝑙 𝑓 , 𝑙𝛿. To make our argument simple and efficient, without loss of generality, we are going to assume zero (2.7) (2.8) mean for π‘“π‘š and 𝛿. In what follows, we are interested in the posterior density of the parameter 𝑆 = (𝜁, 𝜎) (cid:35) 𝑑Π(𝑆|𝒛) 𝑑𝑆 𝑝( π’š|𝑆) 𝑝(𝑆|𝒛) 𝑝( π’š, 𝒛) 𝑝(𝑆, π’š, 𝒛) 𝑝(π’š, 𝒛) 1 𝑝(π’š, 𝒛) Ξ (𝑆| π’š, 𝒛) = 𝑝(𝑦𝑖 |πœπ‘–, 𝜎) (cid:34) 𝑛 (cid:214) = = 𝑖 (2.9) 10 where the marginal density 𝑝( π’š, 𝒛) is given by 𝑝( π’š, 𝒛) = ∫ (cid:2)(cid:206)𝑛 𝑖 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)(cid:3) 𝑑Π(𝑆|𝒛) and the prior density 𝑑Π(𝑆|𝒛)/𝑑𝑆 is given by ∫ ∫ Θ (cid:32)∫ 𝐴𝝓 ∫ 𝐴𝝓 Θ (cid:32)∫ ∫ 𝑑Π(𝑆|𝒛) 𝑑𝑆 = = = = 𝑝(𝑆, 𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ = ∫ ∫ Θ 𝐴𝝓 𝑝(𝑆|𝒛, 𝜽, 𝝓) 𝑝(𝝓, 𝜽 |𝒛) π‘‘π“π‘‘πœ½ (cid:33) 𝑝(𝜁 |𝒛, 𝜽, 𝝓) 𝑝(𝝓|𝒛) 𝑝(𝜽) π‘‘π“π‘‘πœ½ Γ— 𝑝(𝜎) 𝑝(𝜁 |𝒛, 𝜽, 𝝓) 𝑝(𝝓 𝑓 |𝒛) 𝑝(𝝓𝛿) 𝑝(𝜽)𝑑𝝓 𝑓 π‘‘π“π›Ώπ‘‘πœ½ Γ— 𝑝(𝜎) (cid:33) ∫ 𝐴𝝓 𝑓 Γ— 𝑑Π2(𝜎) π‘‘πœŽ Θ 𝐴𝝓 𝛿 𝑑Π1(𝜁 |𝒛) π‘‘πœ 2.2 Preliminaries For the purposes of this section, we shall assume that 𝜎2 = 1 and Ξ 1 = Π𝜁 . The results can be easily extended to unknown 𝜎2 with minor modifications. The basic idea of posterior consistency is that the posterior probability of an arbitrary neighborhood around the true parameter goes to 1 as the sample size 𝑛 goes to infinity. Consider the following neighborhood, (cid:40) (cid:41) Uπœ–π‘› = 𝜁 : |𝜁 (𝑑𝑖) βˆ’ 𝜁0(𝑑𝑖)| ≀ πœ–π‘› (2.10) 1 𝑛 𝑛 βˆ‘οΈ 𝑖=1 Then in the KOH model setup we described in the previous section, posterior consistency means that as 𝑛 goes to ∞, Π𝜁 (cid:8)𝜁 ∈ Uπœ–π‘› |π’š, 𝒛(cid:9) β†’ 0 π‘Ž.𝑠. [π‘ƒπœ0] where the posterior distribution is given by (2.9). There are many pieces of literature regarding posterior consistency. An early result of posterior consistency is by Doob (1949), which shows that the posterior distribution is consistent almost everywhere on prior support. Many Bayesians were satisfied with the result at the time, but it was not useful to check the consistency at a particular density. Then the breakthrough was made by Schwartz (1965), who proved a theorem for posterior consistency of parameters from i.i.d random variables. The conditions for this theorem are the prior positivity of neighborhoods of the true parameters and the existence of uniformly consistent test functions. This introduction 11 of test function conditions made Schwartz’s theorem a powerful tool for checking the posterior consistency in many applications. Please refer to Barron (1999) for further insight into Schwartz’s theorem. Choudhuri et al. (2004) extended this result to a triangular array of independent but non- identically distributed observations in terms of convergence in probability. Choi and Schervish (2007) strengthened this result to almost sure convergence. The proof of KOH model consistency is based on this extension of the Schwartz theorem, and note that the priors are updated with computer model evaluations, which makes the proof complicated. 2.2.1 Posterior Consistency Theorem Theorem 1. Let {𝑦𝑖}∞ 𝑖=1 be independently and normally distributed with mean 𝜁 (𝒕𝑖) and standard deviation 𝜎 > 0 with respect to a common 𝜎-finite measure. Let 𝜁 ∈ F , where F denotes the space of continuously differentiable functions on 𝛀 = [0, 1] 𝑝. Let 𝜁 ∈ F denote the parameter of interest. Let the prior of 𝜁 depending on a known set of inputs 𝒛 = (𝑧1, Β· Β· Β· , 𝑧𝑠) be given by the product measure π‘‘Ξ πœ (𝜁 |𝒛)/π‘‘πœ. For 𝜁0 ∈ F , π‘ƒπœ0 denote the joint conditional distribution of {𝑦𝑖}∞ 𝑖=1 given the true 𝜁0. Let {π‘ˆπ‘›}∞ 𝑛=1 be a sequence of subsets of F . Now under the following two main assumptions, Assumption 2.2.1. Existence of tests. Suppose there exist test functions {Φ𝑛}∞ 𝑛=1, a sequence of sets {F𝑛}∞ 𝑛=1 and constants 𝐢1, 𝐢2, 𝑐1, 𝑐2 > 0 such that 1. (cid:205)∞ 𝑛=1 𝐸𝜁0Φ𝑛 < ∞ 2. sup𝜁 βˆˆπ‘ˆπ‘ π‘›βˆ©F𝑛 𝐸𝑆 (1 βˆ’ Φ𝑛) ≀ 𝐢1 exp(βˆ’π‘1𝑛) Assumption 2.2.2. Prior positivity of neighborhoods. The marginal prior Π𝜁 (𝜁) satisfies 1. Π𝜁 (F 𝑐 𝑛 ) ≀ exp(βˆ’πΆπ‘›πœ– 2 𝑛) 2. Π𝜁 (βˆ₯𝜁 βˆ’ 𝜁0βˆ₯∞ < πœ–π‘›) β‰₯ exp(βˆ’πΆπ‘›πœ– 2 𝑛) The posterior distribution as in (2.9) satisfies Ξ {𝜁 ∈ π‘ˆπ‘ 𝑛 |π’š, 𝒛} β†’ 0 π‘Ž.𝑠. [π‘ƒπœ0] (2.11) 12 We now simplify the posterior distribution in (2.9) as follows Ξ (𝜁 ∈ π‘ˆπ‘ 𝑛 | π’š, 𝒛) = (cid:2)(cid:206)𝑛 ∫ π‘ˆπ‘ 𝑛 ∫ (cid:2)(cid:206)𝑛 𝑖 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)(cid:3) 𝑑Π(𝜁 |𝒛) 𝑖 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)(cid:3) 𝑑Π(𝜁 |𝒛) (1 βˆ’ Φ𝑛) ∫ (cid:2)(cid:206)𝑛 ≀ Φ𝑛 + π‘›βˆ©F𝑛 π‘ˆπ‘ ∫ (cid:2)(cid:206)𝑛 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛) 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛) ∫ π‘ˆπ‘ + π‘›βˆ©F 𝑐 𝑛 ∫ (cid:2)(cid:206)𝑛 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛) (cid:2)(cid:206)𝑛 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))(cid:3) 𝑑Π(𝜁 |𝒛) = Φ𝑛 + I(1,𝑛) (π’š) I(3,𝑛) ( π’š) + I(2,𝑛) ( π’š) I(3,𝑛) ( π’š) (2.12) Assumption 2.2.1. 1 and Assumption 2.2.1. 2 requires uniformly consistent sequence of tests for testing 𝐻0 : 𝜁 = 𝜁0 versus 𝐻1 : 𝜁 ∈ π‘ˆπ‘ I(1,𝑛) (π’š) in (2.12) goes to 0 as 𝑛 β†’ ∞. The construction of these test functions for GPs has been 𝑛. These conditions are used to show that the numerator, studied in Choi and Schervish (2007) (see Section 3.5.3) and uses Assumption 2.2.3 on the design points. Note, Assumption 2.2.2. 1 assumes that the prior gives negligible probability outside the set F𝑛, shall be used to show that the numerator, I(2,𝑛) (π’š) in (2.12) goes to 0 as 𝑛 β†’ ∞. Finally, Assumption 2.2.2. 2 states that the marginal prior Π𝜁 gives enough mass on the set 𝐡𝑛 = {𝜁 : βˆ₯𝜁 βˆ’ 𝜁0βˆ₯∞ ≀ πœ–π‘›} where 𝐡𝑛 denotes a set around the true process 𝜁0, and Assumption 2.2.2. 2 give a positive prior on the neighborhood of the true parameter 𝜁0, and it allows us to show that the denominator, I(3,𝑛) ( π’š) in (2.12) goes to ∞. Assumption 2.2.2. 2 is a problem well-known in the GP community as small ball probability, and it has been studied by many in the probability literature (see Ghosal and Roy (2006); Li and Shao (2001); Tokdar and Ghosh (2007); Vaart and Zanten (2008)). The small ball probability in GP prior is closely related to posterior contraction through the concentration function πœ™πœ0 (πœ–π‘›) = inf β„ŽβˆˆH:βˆ₯β„Žβˆ’πœ0 βˆ₯<πœ–π‘› βˆ₯β„Žβˆ₯2 H βˆ’ log Ξ (βˆ₯𝜁 βˆ₯ < πœ–π‘›) where, H is the Reproducing Kernel Hilbert Space (RKHS) associated to the prior, and βˆ₯Β·βˆ₯H is the corresponding RKHS norm. The first infimum part of the concentration function denotes the decrease in prior mass when the ball is shifted from the origin to 𝜁0. The second part represents 13 prior mass in a ball of radius πœ–π‘› around zero function (see Ghosal and van der Vaart (2017), Vaart and Zanten (2008), and van der Vaart and van Zanten (2011) for further description of this function). By the Theorem 2.1 of Vaart and Zanten (2008), Ξ (βˆ₯𝜁 βˆ’ 𝜁0βˆ₯∞ < πœ–π‘›) β‰₯ exp(βˆ’πœ™πœ0 (πœ–π‘›)/2). In the Supplement, we show that πœ™πœ0 (πœ–π‘›)/2 ≀ π‘›πœ– 2 𝑛 for the Squared Exponential covariance and MatΓ©rn covariance class. 2.2.2 Assumptions for the KOH model In order to apply consistency Theorem 1 to KOH model, we need assumptions on the smoothness of the Gaussian process prior 𝜁 |𝒛, 𝜽, 𝝓 (see equations (2.7) and (2.8)) together with a set of assumptions on the priors for 𝝓 and 𝜽 (see equations (1.1), (2.3) and (2.4)). Additionally, we need an assumption on the rate of design points filling out the interval [0, 1] 𝑝. 2.2.2.1 Assumptions for design points under KOH Assumption 2.2.3. Non random design points. For each hypercube 𝐻 in [0, 1] 𝑝, let πœ†(𝐻) be its Lebesgue measure. Suppose that there exists a constant 0 < 𝐾𝑑 < 1 such that whenever πœ†(𝐻) β‰₯ 1/(𝑛𝐾𝑑), 𝐻 contains at least one design point. Note that equally spaced design satisfies Assumption 2.2.3. Suppose 𝒕𝑖 = (𝑑𝑖,1, Β· Β· Β· , 𝑑𝑖,𝑝), 𝑖 = 1, Β· Β· Β· , 𝑛 are spaced such that |𝑑𝑖+1,π‘˜ βˆ’ 𝑑𝑖,π‘˜ | = π‘›βˆ’1/𝑑, 0 = 𝑑0,π‘˜ < 𝑑1,π‘˜ ≀ Β· Β· Β· ≀ 𝑑𝑛,π‘˜ < 𝑑𝑛+1,π‘˜ = 1. Then each hypercube 𝐻 with a Lebesgue measure of at 𝑖, 𝑗 = 1, Β· Β· Β· , 𝑛, π‘˜ = 1, Β· Β· Β· , 𝑝 and least 1/𝑛 contains at least one design point. By the same logic, space-filling designs such as Latin Hypercube Sampling with maxmin distance Morris and Mitchell (1995), which is commonly used in the computer model fields, also satisfy this assumption. Please refer to Santner et al. (2003) for further information regarding designs in computer experiments. 2.2.2.2 Assumptions on the prior for KOH The prior assumption comprises two parts. One is for the smoothness of sample paths of Gaussian Processes 𝑓 and 𝛿. (Assumption 2.2.4. 1 and Assumption 2.2.4. 2). The other is for technical assumptions on priors of hyperparameters and calibration parameters ( Assumption 2.2.5 1, and Assumption 2.2.5. 2, Assumption 2.2.5. 3). 14 Assumption 2.2.4. Gaussian process. 1. For the Gaussian processes GP 𝑓 and GP𝛿, assume zero mean π‘š 𝑓 = π‘šπ›Ώ = 0. In addition, , Β· Β· Β· , 𝑙 𝑓 𝑝+π‘ž ) ∈ R𝑝+π‘ž, , Β· Β· Β· , 𝑙𝛿 𝑝 ) ∈ R𝑝, inputs 𝒕 = (𝑑1, 𝑑2, Β· Β· Β· , 𝑑 𝑝) ∈ R𝑝, 𝒕′ = (𝑑′1, 𝑑′2, Β· Β· Β· , 𝑑′𝑝) ∈ R𝑝, and assume the scale parameters πœ‚ 𝑓 , πœ‚π›Ώ > 0, length scale parameters 𝑙 𝑓 = (𝑙 𝑓1 𝑙𝛿 = (𝑙𝛿1 calibration parameters 𝜽 = (πœƒ1, Β· Β· Β· , πœƒ 𝑝′), (cid:101)𝜽 = ((cid:101)πœƒ1, Β· Β· Β· , (cid:101)πœƒ 𝑝′) ∈ 𝚯, let the covariance functions β€²) be a product of isotropic and integrable covariance functions, π‘˜ 𝑓 ((𝒕, 𝜽), (𝒕′, πœ½β€²)) and π‘˜ 𝛿 (𝒕, 𝒕 one for each dimension. π‘˜ 𝑓 ((𝒕, 𝜽), (𝒕′, πœ½β€²)) = πœ‚ 𝑓 𝑅(𝑑1, 𝑑′1; 𝑙 𝑓1) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝; 𝑙 𝑓 𝑝 )𝑅(πœƒ1, (cid:101)πœƒ1; 𝑙 𝑓 𝑝+1) Β· Β· Β· 𝑅(πœƒ 𝑝′, (cid:101)πœƒ 𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) π‘˜ 𝛿 (𝒕, 𝒕′) = πœ‚π›Ώ 𝑅(𝑑1, 𝑑′1; 𝑙𝛿1)𝑅(𝑑2, 𝑑′2; 𝑙𝛿2) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝; 𝑙𝛿 𝑝 ) where each 𝑅(𝑑𝑖, 𝑑′𝑖; 𝑙𝑖) = 𝑅 (cid:17) (cid:16) |π‘‘π‘–βˆ’π‘‘β€²π‘– | 𝑙𝑖 , and 𝑅 is the positive multiple of density. 2. For the Gaussian Process GP 𝑓 and GP𝛿, the mean functions π‘š 𝑓 , π‘šπ›Ώ are continuously differ- entiable with respect to input 𝒕 and covariance functions π‘˜ 𝑓 and π‘˜ 𝛿 have partial derivatives up to order 2𝑝 + 2. Assumption 2.2.5. Hyperparameters, calibration parameter and observational error scale pa- rameter. 1. πœ‚ 𝑓 and πœ‚π›Ώ have prior distributions with support R+, and there exists 0 < β„Ž < 1 2 and 𝑏1, 𝑏2, 𝑏3, 𝑏4 > 0 such that P[πœ‚ 𝑓 > π‘›β„Ž] ≀ 𝑏1 exp(βˆ’π‘2𝑛), and P[πœ‚π›Ώ > π‘›β„Ž] ≀ 𝑏3 exp(βˆ’π‘4𝑛), βˆ€π‘› β‰₯ 1. 2. 𝑙 𝑓 and 𝑙𝛿 have prior distributions with bounded support on R+. That is for the bounds, 𝐡 𝑓 , 𝐡𝛿 > 0, set P[𝑙 𝑓𝑖 > 𝐡 𝑓 ] = 0, 𝑓 π‘œπ‘Ÿ 𝑖 = 1, Β· Β· Β· , 𝑝 + 𝑝′, and P[𝑙𝛿 𝑗 > 𝐡𝛿] = 0, 𝑓 π‘œπ‘Ÿ 𝑗 = 1, Β· Β· Β· , 𝑝 3. 𝜽 has a continuous density on the support 𝚯 such that ∫ 𝚯 𝑝(𝜽)π‘‘πœ½ = 1. Note that in Assumption 2.2.4. 1, the product of the isotropic covariance functions is also an isotropic covariance function. The functional form of 𝑅(Β·) in the covariance kernel in Assumption 2.2.4. 1 affects the proof of showing the Assumption 2.2.2. 1. Choi (2005) avoids functional form by considering the continuity of supremum of variance terms (the same assumption was also used 15 in Lemma 1 in Kejzlar et al. (2020)). In this paper, we are going to prove Assumption 2.2.2. 1 for general case, which encompasses the widely used covariance functions such as Squared Exponential covariance. Assumption 2.2.4. 1 and Assumption 2.2.4. 2 together with Assumption 2.2.5. 1 and Assumption 2.2.5. 2 are sufficient conditions to ensure that the support of GP prior for 𝜁 |𝒛, 𝜽, 𝝓 contains every continuously differentiable function based on the arguments by Theorem 4 of Ghosal and Roy (2006) and Theorem 4.5 of Tokdar and Ghosh (2007). Particularly, Assumption 2.2.4. 2 implies that the same condition holds for the conditional process 𝜁 |𝒛, 𝜽, 𝝓, which is a function of 𝑓 and 𝛿. This enables us to apply Theorem 4 of Ghosal and Roy (2006) and Theorem 4.5 of Tokdar and Ghosh (2007) in KOH model setup. Many covariance functions satisfy this assumption, including the product of squared exponential covariance functions. For detailed descriptions of the analytic properties of covariance functions, please check Cramer and Leadbetter (1967), Adler (1981), and Adler (1990). Furthermore, Assumption 2.2.4. 1 and Assumption 2.2.4. 2 guarantees the existence of continuous sample derivatives with probability 1, p.171 and p.185 of Cramer and Leadbetter (1967). In addition, by the Theorem 5 of Ghosal and Roy (2006), these Assumption 2.2.4. 1 and Assumption 2.2.4. 2 guarantee that the GP prior 𝜁 |𝒛, 𝜽, 𝝓 gives non zero probability to the set of continuous functions with bounded derivatives. Assumption 2.2.5. 1 relaxes the assumption of compact support for all hyperparameters as used in the empirical Bayes approach Kejzlar et al. (2020) of KOH model. We relaxed that assumption in such a way that scale parameters for GP covariances, 𝑛 𝑓 and 𝑛𝛿, have an exponentially small probability of blowing up. Assumption 2.2.5. 2 indicates that the length scale parameters have bounded support. In the algorithm, we used log transformation to have the support of length scale parameters to be R+. However, we can always use different transformation. For example, sigmoid transformation could be used to have a support on [0, 1]. Hence, without loss of generality let’s assume they are bounded for theoretical investigation. Then, this condition allows us to upper bound (cid:12) the quantity (cid:205)𝑠 (cid:12) by a constant ¯𝐢 in a bounded support. This assumption is (cid:12) a technical assumption to ensure that the conditional prior Π𝜁 (Β·|𝒛) assigns exponentially small 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) ΛœΞ£βˆ’1 (cid:205)𝑠 𝑗=1 𝑖=1 (cid:12) (cid:12) (cid:12) probability to F 𝐢 𝑛 . Assumption 2.2.5. 3 assumes continuous density on the support of calibration 16 parameters. In practice, the support of calibration parameters is determined by domain experts. 2.3 Posterior Consistency in Kenndy and O’Hagan Model We have already presented the KOH model in Section 2.1. We shall first assume that the design points 𝒕 = (𝑑1, Β· Β· Β· , 𝑑𝑛) on which the physical process 𝜁 is defined (see (1.1)) follows the non random design as discussed under Assumption 2.2.3. As described in Section 2.1, we have considered a hierarchical Gaussian prior on the physical process 𝜁 as in (2.1). With 𝒛 defined in (1.3), we obtain an induced conditional multivariate Gaussian prior on the physical process 𝜁 whose mean and covariance kernels are given by (2.7) and (2.8) respectively. We first assume that this conditional prior 𝜁 |𝒛, 𝜽, 𝝓 satisfies all the conditions under Assumption 2.2.4. We also assume that the prior on hyperparameter 𝝓 = (𝑛 𝑓 , 𝑛𝛿, 𝑙 𝑓 , 𝑙𝛿) satisfies Assumption 2.2.5. 1 and Assumption 2.2.5. 2. Under these assumptions, we next present the main result on the joint consistency of the physical process 𝜁 and the scale parameter 𝜎 under the KOH model. 2.3.1 Posterior Consistency Theorem of KOH Model Theorem 2. Let {𝑦𝑖}∞ 𝑖=1 be independently and normally distributed with mean 𝜁 (𝒕𝑖) and standard deviation 𝜎 > 0 with respect to a common 𝜎-finite measure. Let 𝜁 denote the parameter of interest. Let the design points 𝒕𝑖 ∈ [0, 1] 𝑝 satisfy Assumption 2.2.3. With reference to Section 2.1, let 𝒛 = (𝑧1, Β· Β· Β· , 𝑧𝑠) be as defined in (1.3). We assume that the conditional prior 𝜁 |𝒛, 𝜽, 𝝓 satisfies Assumption 2.2.4. 1, Assumption 2.2.4. 2. We also assume that the prior on 𝝓 satisfies Assumption 2.2.5. 1 and 2.2.5. 2. In addition, assume the calibration parameter satisfies 2.2.5. 3. Let π‘ƒπœ0 denote the joint conditional distribution of {𝑦𝑖}∞ 𝑖=1 given the true 𝜁0. Let the function 𝜁0 be continuously differentiable, then for a sequence πœ–π‘› that goes to 0 and for the neighborhood Uπœ–π‘› of 𝜁 as defined in equation (2.10), Π𝜁 {𝜁 ∈ U𝑐 πœ–π‘› | π’š, 𝒛} β†’ 0 π‘Ž.𝑠. [π‘ƒπœ0] (2.13) where Π𝜁 {|π’š, 𝒛} denotes the posterior distribution of 𝜁 given the data π’š = (𝑦1, Β· Β· Β· , 𝑦𝑛) and 𝒛 = (𝑧1, Β· Β· Β· , 𝑧𝑠). The proof of Theorem 2 is based on the Theorem 1. In order to prove Theorem 2, we shall show 17 that Assumption 2.2.2 and Assumption 2.2.1 hold under the KOH model. We present the proofs of Assumption 2.2.2 and Assumption 2.2.1 under the KOH model in the Appendix. 2.4 Conclusion and Discussion The Kennedy O’Hagan model assumes both physical data and computer model evaluation data are available and make use of the Gaussian Process to calibrate the parameters and to predict physical process at new input space. Although the identifiability issue of πœƒ is still a concern in the KOH model, it has been widely used in diverse fields. At the same time, researchers tried to provide a theoretical justification for this model. For example, you can look at Tuo and Wu (2015, 2016) and Plumlee (2017), among others. However, the theoretical properties of the fully Bayesian approach of the KOH model have not been investigated much. The most relevant paper is Xie and Xu (2020), in which they proposed a Bayesian projected calibration method following the frequentist 𝐿2-projected calibration method in Tuo and Wu (2015) and showed theoretical justification. An additional relevant study conducted by Gramacy and Apley (2015) employed a localized approach with parallelization techniques to approximate a large GP emulator. However, it is important to note that this particular work focused on GP computer emulation and did not operate within the KOH model framework. Hence the main contribution of this paper is the theoretical justification of using the original full Bayesian KOH model. Our approach is based on the extended use of Schwartz’s theorem in our model setup to prove the consistency of the fitted emulator. One issue we haven’t discussed in this paper is the identifiability of 𝜽. This is because it is not an issue for the prediction. However, this would be a great next research topic. As Kennedy and O’Hagan (2001) stated, the parameter is the one best explains the difference between the true physical model and the computer model. So it lacks physical meaning. But when it comes to applications, as discussed in Higdon et al. (2005) and Han et al. (2003), there are both cases where the parameter has little or no physical meaning (tuning parameter) and the cases where it does have a physical meaning (calibration parameter). This is an integral part for the domain scientists because the calibration parameters themselves are the interest to them sometimes, and it helps for 18 scientific understanding of the model and future scientific extension. Hence developing the new definition of 𝜽, which contains the concept of both approximating the true physical model and physical interpretation would be a big breakthrough in this field. 19 BIBLIOGRAPHY Abrahamsen, P. (1997). A review of gaussian random fields and correlation functions, volume 917. Norwegian Computing Center. Adler, R. J. (1981). The Geometry of Random Fields. Wiley. Adler, R. J. (1990). An introduction to continuity, extrema, and related topics for Gaussian process, volume 12. IMS Lecture notes-monograph series. Barron, A. R. (1999). Information-theoretic characterization of bayes performance and the choice of priors in parametric and nonparametric problems. In Bayesian Statistics, 6:27–52. Choi, T. (2005). Posterior Consistency in Nonparametric Regression Problems under Gaussian Process Priors. PhD thesis, Carnegie Mellon University. Choi, T. and Schervish, M. J. (2007). Posterior Consistency in Nonparametric Regression Problems under Gaussian Process Priors. Journal contribution. Choudhuri, N., Ghosal, S., and Roy, A. (2004). Bayesian estimation of the spectral density of a time series. Journal of the American Statistical Association, 99(468):1050 – 1059. Cramer, H. and Leadbetter, M. R. (1967). Stationary and related stochastic processes. Wiley. Doob, J. L. (1949). Application of the theory of martingales. in Le Calcul des Probabilites et ses Applications, 40:23–27. Ghosal, S. and Roy, A. (2006). Posterior consistency of gaussian process prior for nonparametric binary regression. Ann. Statist., 34(5):2413–2429. Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference. Cambridge University Press. Gramacy, R. B. and Apley, D. W. (2015). Local gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578. Han, G., Santner, T. J., and Rawlinson, J. J. (2003). Simultaneous determination of tuning and calibration parameters for computer experiments. Technometrics, 51(4):464 – 474. Higdon, D., Kennedy, M., Cavendish, J. C., Cafeo, J. A., and Ryne, R. D. (2005). Combining field data and computer simulations for calibration and prediction. SIAM Journal on Scientific Computing, 26:448–466. Kejzlar, V., Son, M., Bhattacharya, S., and Maiti, T. (2020). A fast, scalable, and calibrated computer model emulator: An empirical bayes approach. 20 Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63:425–464. Li, W. V. and Shao, Q.-M. (2001). Gaussian processes: Inequalities, small ball probabilities and applications. Stochastic Processes: Theory and Methods. Handbook of Statistics, 19:533–598. Morris, M. D. and Mitchell, T. J. (1995). A comparison of three methods for selecting values of input variables in the analysis of output from a computer code. Journal of Statistical Planning and Inference, 43:381–402. Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American Statistical Association, 112:1274–1285. Santner, T. J., Williams, B. J., Notz, W. I., and Williams, B. J. (2003). The design and analysis of computer experiments, volume 1. Springer. Schwartz, L. (1965). On bayes procedures. Z. Wahrscheinlichkeitstheorie und Verw. Gebiete, 4:10–26. Tokdar, S. T. and Ghosh, J. K. (2007). Posterior consistency of logistic gaussian process priors in density estimation. Journal of Statistical Planning and Inference, 137(1):34 – 42. Tuo, R. and Wu, C. F. J. (2015). Efficient calibration for imperfect computer models. The Annals of Statistics, 43:2331–2352. Tuo, R. and Wu, C. F. J. (2016). A theoretical framework for calibration in computer models: parametrization, estimation and convergence properties. SIAM/ASA Journal on Uncertainty Quantification, 4:767–795. Vaart, A. W. V. D. and Zanten, J. H. V. (2008). Rates of contraction of posterior distributions based on gaussian process priors. The Annals of Statistics, 36:1435–1463. van der Vaart, A. W. and van Zanten, J. H. (2011). Information rates of nonparametric gaussian process methods. Journal of Machine Learning Research, pages 2095–2119. Xie, F., Jin, W., and Xu, Y. (2019). Rates of contraction with respect to 𝐿2-distance for Bayesian nonparametric regression. Electronic Journal of Statistics, 13(2):3485 – 3512. Xie, F. and Xu, Y. (2020). Bayesian projected calibration of computer models. Journal of the American Statistical Association, 0(0):1–18. 21 APPENDIX A PROOF OF THEOREM 1 Note 𝒛 is implicitly conditioned on the prior. If we decompose the following term. Π𝜁 (U𝑐 πœ–π‘› | π’š, 𝒛) = = = = ∫ 𝐴𝝓 ∫ Θ ∫ ∫ 𝐴𝝓 Θ 𝑝(π’š|𝜁) 𝑝(π’š|𝜁) ∫ ∫ Θ ∫ U𝑐 𝑛 ∫ F ∫ U𝑐 𝑛 ∫ F ∫ U𝑐 𝑛 ∫ F Ξ (U𝑐 𝑛 , 𝜽, 𝝓|π’š, 𝒛)π‘‘π“π‘‘πœ½ 𝑝(𝜁, π’š, 𝒛, 𝜽, 𝝓)π‘‘π“π‘‘πœ½ π‘‘πœ 𝐴𝝓 ∫ 𝑝(𝜁, π’š, 𝒛, 𝜽, 𝝓)π‘‘π“π‘‘πœ½ π‘‘πœ (cid:16) ∫ Θ ∫ 𝑝(𝜁, 𝜽, 𝝓, 𝒛)π‘‘π“π‘‘πœ½ (cid:17) 𝑝(𝜁, 𝜽, 𝝓, 𝒛)π‘‘π“π‘‘πœ½ 𝐴𝝓 (cid:16) ∫ Θ 𝐴𝝓 (cid:17) π‘‘πœ π‘‘πœ 𝑝( π’š|𝜁)𝑑Π(𝜁 |𝒛) 𝑝(π’š|𝜁)𝑑Π(𝜁 |𝒛) (1 βˆ’ Φ𝑛) ∫ ≀ Φ𝑛 + + = Φ𝑛 + ∫ U𝑐 𝑛 ∩F 𝑐 𝑛 ∫ (cid:206)𝑛 F I(1,𝑛) (π’š) I(3,𝑛) ( π’š) ≀ exp(βˆ’πΆ1π‘›πœ€2 = exp(βˆ’πΆ1π‘›πœ€2 (cid:206)𝑛 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛) U𝑐 (cid:206)𝑛 ∫ F (cid:206)𝑛 𝑛 ∩F𝑛 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎)/𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛) 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎))/( 𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛) 𝑖=1( 𝑝(𝑦𝑖 |πœπ‘–, 𝜎))/( 𝑝(𝑦𝑖 |𝜁0,𝑖, 𝜎0))𝑑Π(𝜁 |𝒛) + I(2,𝑛) ( π’š) I(3,𝑛) ( π’š) exp(βˆ’πΆ2π‘›πœ€2 𝑛/2) exp(βˆ’(𝐢 + 𝐢4)π‘›πœ€2 𝑛) 𝑛) + exp(βˆ’(𝐢2/2 + 𝐢 βˆ’ 𝐢4)π‘›πœ€2 𝑛) + + exp(βˆ’πΆ3π‘›πœ€2 𝑛/2) exp(βˆ’(𝐢 + 𝐢4)π‘›πœ€2 𝑛) 𝑛) + exp(βˆ’(𝐢3/2 + 𝐢 βˆ’ 𝐢4)π‘›πœ€2 𝑛) ≀ exp(βˆ’πΆπ‘›πœ€2 𝑛) Where I1,𝑛 = (1 βˆ’ Φ𝑛) ∫ U𝑐 𝑛 ∩F𝑛 𝑛 (cid:214) 𝑖=1 ( 𝑝(𝑦𝑖 |πœπ‘–)/𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛) ( 𝑝(𝑦𝑖 |πœπ‘–))/( 𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛) ( 𝑝(𝑦𝑖 |πœπ‘–))/( 𝑝(𝑦𝑖 |𝜁0,𝑖))𝑑Π(𝜁 |𝒛) 𝑛 (cid:214) 𝑖=1 I2,𝑛 = I3,𝑛 = ∫ U𝑐 𝑛 ∩F 𝑐 𝑛 𝑛 (cid:214) ∫ F 𝑖=1 22 It suffices to show the following four results, Φ𝑛 ≀ exp(βˆ’πΆ1π‘›πœ€2 𝑛) I1,𝑛 ≀ exp(βˆ’πΆ2π‘›πœ€2 𝑛/2) I2,𝑛 ≀ exp(βˆ’πΆ3π‘›πœ€2 𝑛/2) I3,𝑛 ≀ exp(βˆ’(𝐢 + 𝐢4)π‘›πœ€2 𝑛) (A.1) (A.2) (A.3) (A.4) Mainly, (A.1) involves Markov inequality, Borel-cantelli lemma, and Assumption. 2.2.1. (i) in the main paper. Proofs of (A.2) and (A.3) follows from Theorem 2 of Choudhuri et al. (2004) and proof of (A.4) follows from Lemma 2.4 of Xie et al. (2019). All the proofs need a slight modification from the original papers since we have prior on 𝜁 updated with simulation data 𝒛. These modifications are discussed next. Proof of (A.1): Using Assumption 2.2.1 and Markov’s Inequality, =β‡’ P0 [ Φ𝑛 > exp(βˆ’πΆ1π‘›πœ€2 𝑛) ] ≀ E0 [ Φ𝑛 ] 1 exp(βˆ’πΆ1π‘›πœ€2 𝑛) ∞ βˆ‘οΈ =β‡’ ∞ βˆ‘οΈ 𝑛=1 P0 [ Φ𝑛 > exp(βˆ’πΆ1π‘›πœ€2 𝑛) ] ≀ 1 exp(βˆ’πΆ1π‘›πœ€2 𝑛) E0 [ Φ𝑛 ] ≀ 𝐢 ∞ βˆ‘οΈ 𝑛=1 E0 [ Φ𝑛 ] < ∞ 𝑛=1 =β‡’ P0 [ Φ𝑛 > exp(βˆ’πΆ1π‘›πœ€2 𝑛) 𝑖.π‘œ. ] = 0 =β‡’ P0 [ Φ𝑛 ≀ exp(βˆ’πΆ1π‘›πœ€2 𝑛)] β†’ 1 Proof of (A.2): For every non-negative function πœ“π‘› and any set 𝐴, by Fubini’s theorem, (cid:20) E𝜁0 πœ“π‘› ( π’š) ∫ 𝐴 𝑝( π’š|𝜁) 𝑝(π’š|𝜁0) (cid:21) 𝑑Π(𝜁 |𝒛) = = ∫ ∫ ∫ 𝐴 𝐴 πœ“π‘› ( π’š) 𝑝( π’š|𝜁) 𝑝(π’š|𝜁0) 𝑝( π’š|𝜁0)𝑑 ( π’š)𝑑Π(𝜁 |𝒛) E𝜁 [πœ“π‘› ( π’š)] 𝑑Π(𝜁 |𝒛) Now with 𝐴 = U𝑐 𝑛 ∩ F𝑛 and πœ“π‘› = Φ𝑛, we have, E𝜁0 (cid:2)I(1,𝑛) ( π’š)(cid:3) = ∫ U𝑐 𝑛 ∩F𝑛 E𝜁 [(1 βˆ’ Φ𝑛)]𝑑Π(𝜁 |𝒛) ≀ sup 𝜁 ∈U𝑐 𝑛 ∩F𝑛 E𝜁 [(1 βˆ’ Φ𝑛)] ≀ exp(βˆ’πΆ2π‘›πœ– 2 𝑛) 23 where the last inequality follows as a consequence of Assumption. 2.2.1. Thus, (cid:104) P𝜁0 I(1,𝑛) ( π’š) β‰₯ exp (cid:16) βˆ’πΆ2π‘›πœ– 2 𝑛/2 (cid:17)(cid:105) ≀ exp (cid:16) 𝐢2π‘›πœ– 2 𝑛/2 (cid:17) exp(βˆ’πΆ2π‘›πœ– 2 𝑛) = exp (cid:16) βˆ’πΆ2π‘›πœ– 2 𝑛/2 (cid:17) The proof follows as a consequence of Borel-Cantelli lemma. Proof of (A.3): Note that E𝜁0 [I(2,𝑛) ( π’š)] = E𝜁0 ∫ (cid:20)∫ π‘ˆπ‘ π‘›βˆ©F𝑛 𝑝( π’š|𝜁) 𝑝(π’š|𝜁0) (cid:21) 𝑑Π(𝜁 |𝒛) E𝜁 [1] 𝑑Π(𝜁 |𝒛) = U𝑐 𝑛 ∩F 𝑐 𝑛 ≀ 𝑑Π(F 𝑐 𝑛 |𝒛) ≀ exp(βˆ’πΆ3π‘›πœ– 2 𝑛) By Assumption 2.2.2 (i) in the main paper, the second inequality holds. The proof follows as a consequence of the Borel-Cantelli Lemma. Proof of (A.4): Denote Π𝜁 (Β·|𝐡𝑛, 𝒛) = Π𝜁 (Β· ∩ 𝐡𝑛|𝒛)/Π𝜁 (𝐡𝑛|𝒛) to be the renormalized restriction of Π𝜁 (Β·|𝒛) on 𝐡𝑛, where 𝐡𝑛 is defined as, 𝐡𝑛 = {𝜁 : βˆ₯𝜁 βˆ’ 𝜁0βˆ₯∞ < πœ–π‘›} Let, 𝑉𝑛,𝑖 = 𝜁0(𝒕𝑖) βˆ’ ∫ 𝜁 (𝒕𝑖)π‘‘Ξ πœ (𝜁 |𝐡𝑛), π‘Šπ‘›,𝑖 = ∫ 1 2 (𝜁 (𝒕𝑖) βˆ’ 𝜁0(𝒕𝑖))2π‘‘Ξ πœ (𝜁 |𝐡𝑛) H𝑛 = (cid:40)∫ 𝑛 (cid:214) 𝑖=1 𝑝(𝑦𝑖 |πœπ‘–) 𝑝(𝑦𝑖 |𝜁0,𝑖) π‘‘Ξ πœ (𝜁 |𝒛) > Π𝜁 (𝐡𝑛|𝒛) exp(βˆ’πΆ4π‘›πœ€2 𝑛) (cid:41) Note 𝑉 2 𝑛,𝑖 ≀ πœ€2 𝑛 and π‘Šπ‘›,𝑖 ≀ 1 2 πœ€2 𝑛, because ∫ 𝜁 (𝒙𝑖)π‘‘Ξ πœ (𝜁 |𝐡𝑛, 𝒛) (cid:19) 2 (cid:18) 𝜁0(𝒙𝑖) βˆ’ ∫ 𝑉 2 𝑛,𝑖 = ≀ ≀ (𝜁0(𝒙𝑖) βˆ’ 𝜁 (𝒙𝑖))2 π‘‘Ξ πœ (𝜁 |𝐡𝑛, 𝒛) ∫ βˆ₯𝜁0(𝒙𝑖) βˆ’ 𝜁 (𝒙𝑖)βˆ₯2 ∞ π‘‘Ξ πœ (𝜁 |𝐡𝑛, 𝒛) ≀ πœ€2 𝑛 24 and because, Also note, ∫ 𝑛 (cid:214) 𝑖=1 = = = = ≀ Hence (𝜁 (𝒙𝑖) βˆ’ 𝜁0(𝒙𝑖))2 π‘‘Ξ πœ (𝜁 |𝐡𝑛, 𝒛) βˆ₯𝜁 (𝒙𝑖) βˆ’ 𝜁0(𝒙𝑖) βˆ₯2 ∞ π‘‘Ξ πœ (𝜁 |𝐡𝑛, 𝒛) π‘Šπ‘›,𝑖 = ≀ ≀ ∫ ∫ πœ€2 𝑛 1 2 1 2 1 2 Π𝜁 (𝜁 |𝐡𝑛, 𝒛) 𝑝(𝑦𝑖 |𝜁0,𝑖) 𝑝(𝑦𝑖 |πœπ‘–) ∫ 𝑛 βˆ‘οΈ (cid:26) βˆ’ (cid:26) βˆ’ (cid:26) 1 2 𝑖=1 ∫ 𝑛 βˆ‘οΈ 𝑖=1 ∫ 𝑛 βˆ‘οΈ 𝑖=1 𝑛 βˆ‘οΈ π‘Šπ‘›,𝑖 + 𝑖=1 1 2 πœ€2 𝑛 + 𝑛 βˆ‘οΈ 𝑖=1 H 𝐢 𝑛 = βŠ‚ βŠ‚ = βŠ‚ = log (2πœ‹) βˆ’ 1 2 (𝑦𝑖 βˆ’ 𝜁0,𝑖)2 + 1 2 log (2πœ‹) + (cid:27) (𝑦𝑖 βˆ’ πœπ‘–)2 1 2 Π𝜁 (𝜁 |𝐡𝑛, 𝒛) (𝜁0,𝑖 + 𝑒𝑖 βˆ’ 𝜁0,𝑖)2 + (𝜁0,𝑖 + 𝑒𝑖 βˆ’ πœπ‘–)2 (cid:27) Π𝜁 (𝜁 |𝐡𝑛, 𝒛) 1 2 1 2 1 2 (πœπ‘– βˆ’ 𝜁0,𝑖)2 + 𝑒𝑖 (πœπ‘– βˆ’ 𝜁0,𝑖) (cid:27) Π𝜁 (𝜁 |𝐡𝑛, 𝒛) 𝑛 βˆ‘οΈ 𝑖=1 𝑒𝑖𝑉𝑛,𝑖 𝑒𝑖𝑉𝑛,𝑖 𝑝(𝑦𝑖 |πœπ‘–) 𝑝(𝑦𝑖 |𝜁0,𝑖) 𝑝(𝑦𝑖 |πœπ‘–) 𝑝(𝑦𝑖 |𝜁0,𝑖) Π𝜁 (𝜁 |𝒛) ≀ Π𝜁 ( Λœπ΅π‘›|𝒛) exp(βˆ’πΆ4π‘›πœ€2 𝑛) (cid:41) Π𝜁 (𝜁 | Λœπ΅π‘›, 𝒛) ≀ exp(βˆ’πΆ4π‘›πœ€2 𝑛) (cid:41) log 𝑝(𝑦𝑖 |πœπ‘–) 𝑝(𝑦𝑖 |𝜁0,𝑖) Π𝜁 (𝜁 |𝐡𝑛, 𝒛) ≀ βˆ’πΆ4π‘›πœ€2 𝑛 (cid:41) log 𝑛 βˆ‘οΈ 𝑖=1 𝑝(𝑦𝑖 |𝜁0,𝑖) 𝑝(𝑦𝑖 |πœπ‘–) Π𝜁 (𝜁 |𝐡𝑛, 𝒛) > 𝐢4π‘›πœ€2 𝑛 (cid:41) (cid:41) 𝑒𝑖𝑉𝑛,𝑖 > 𝐢4π‘›πœ€2 𝑛 (cid:41) (cid:40)∫ 𝑛 (cid:214) 𝑖=1 (cid:40)∫ 𝑛 (cid:214) 𝑖=1 (cid:40)∫ 𝑛 βˆ‘οΈ 𝑖=1 (cid:40)∫ 𝑛 βˆ‘οΈ 𝑖=1 πœ€2 𝑛 + (cid:40) 1 2 (cid:40) 𝑛 βˆ‘οΈ 𝑖=1 𝑒𝑖𝑉𝑛,𝑖 > 𝐢5π‘›πœ€2 𝑛 25 Now we can use the Chernoff bound for the Gaussian random variables, P0 (cid:40) 𝑛 βˆ‘οΈ 𝑖=1 (cid:41) (cid:32) 𝑒𝑖𝑉𝑛,𝑖 > 𝐢5π‘›πœ– 2 𝑛 ≀ exp βˆ’ (cid:16) 𝐢5π‘›πœ€2 𝑛 (cid:17) 2 (cid:33) 1 (cid:205)𝑛 𝑖=1 (cid:33) 𝑉 2 𝑛,𝑖 (cid:32) = exp βˆ’πΆ2 5 𝑉 2 𝑛,𝑖 π‘›πœ– 4 𝑛 𝑖=1 (cid:19) (cid:205)𝑛 1 𝑛 π‘›πœ– 4 𝑛 πœ– 2 𝑛 βˆ’πΆ2 5 (cid:17) βˆ’πΆ2 5 π‘›πœ– 2 𝑛 (cid:18) (cid:16) = exp = exp β†’ 0 Therefore, P0{H 𝐢 𝑛 } ≀ P0 (cid:41) 𝑒𝑖𝑉𝑛,𝑖 > πΆπ‘›πœ€2 𝑛 β†’ 0 (cid:40) 𝑛 βˆ‘οΈ 𝑖=1 With Assumption 2.2.2 (ii) in the main paper, this completes the proof. 26 APPENDIX B CHECKING ASSUMPTION 2.2.1 Assumption 2.2.1. (i) and (ii) require the existence of a uniformly consistent sequence of tests for testing 𝐻0 : 𝜁 = 𝜁0 versus 𝐻1 : 𝜁 ∈ U𝑐 πœ–π‘› ∩ F𝑛. The construction of these test functions for GPs has been studied in Choi and Schervish (2007) (see Section 3.5.3) and uses Assumption 2.2.3 in the main paper about the design points. Since these test functions mainly depend on the properties of 𝜁 and the model 𝑦𝑖 = 𝜁 (𝑑𝑖) + πœŽπœ–π‘–, the adaptation of the test functions in Choi and Schervish (2007) to our case is immediate with known 𝜎. For the completeness of the paper, the proof is adopted here based on the following five lemmas from Choi (2005). Lemma B.0.1. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define πœπ‘– 𝑗 = πœπ‘– (𝑑 𝑗 ) for 𝑖 = 0, 1 and 𝑗 = 1, Β· Β· Β· , 𝑛. Let πœ– > 0, and let π‘Ÿ > 0. Let 𝑐𝑛 = π‘›π‘Ÿ1 for 𝛼1/2 < 𝜏 < 1/2 and 1/2 < 𝛼1 < 1. Let 𝑏 𝑗 = 1 if 𝜁1 𝑗 > 𝜁0 𝑗 and -1 otherwise. Let Ξ¨1𝑛 [𝜁1, πœ–] be the indicator of the set 𝐴1, where 𝐴1 is defined as 𝐴1 = Then there exists a constant 𝐢3 such that for all 𝜁1 that satisfy (cid:205)𝑛 𝑛 βˆ‘οΈ 𝑗=1 𝑏 𝑗 (cid:19) (cid:18)π‘Œπ‘— βˆ’ 𝜁0 𝑗 𝜎0 √ > 2𝑐𝑛   ο£³ 𝑛   ο£Ύ 𝑗=1 |𝜁1 𝑗 βˆ’ 𝜁0 𝑗 | > π‘Ÿπ‘›, πΈπ‘ƒπœ 0 (Ξ¨1𝑛 [𝜁1, πœ–]) < 𝐢3 exp(βˆ’2𝑐2 𝑛) Also there exists constants 𝐢4 and 𝐢5 such that for all sufficiently large 𝑛 and all 𝜁 satisfying ||𝜁 βˆ’ 𝜁0||∞ < π‘Ÿ/4 and for all 𝜎 ≀ 𝜎0(1 + πœ–) πΈπ‘ƒπœ (1 βˆ’ Ξ¨1𝑛 [𝜁1, πœ–]) ≀ 𝐢4 exp(βˆ’π‘›πΆ5) where π‘ƒπœ is the joint distribution of {π‘Œπ‘›}∞ 𝑛=1 assuming parameters to be 𝜁. Proof: See Lemma 3.5.9 in Choi (2005). Lemma B.0.2. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define πœπ‘– 𝑗 = πœπ‘– (𝑑 𝑗 ) for 𝑖 = 0, 1 and 𝑗 = 1, Β· Β· Β· , 𝑛. Let Ξ¨2𝑛 be the indicator of the set 𝐴2, where 𝐴2 is defined as 𝐴2 = 𝑛 βˆ‘οΈ 𝑗=1   ο£³ 𝑏 𝑗 (cid:19) (cid:18)π‘Œπ‘— βˆ’ 𝜁0 𝑗 𝜎0 > 𝑛(1 + πœ–)   ο£Ύ 27 Then there exists a constant 𝐢6 such that for all 𝜁1, πΈπ‘ƒπœ 0 (Ξ¨2𝑛) < exp(βˆ’π‘›πΆ6) Also there exists constants 𝐢7 such that for all sufficiently large 𝑛 and all 𝜁 satisfying ||𝜁 βˆ’πœ0||∞ < π‘Ÿ/4 and for all 𝜎 > 𝜎0(1 + πœ–) πΈπ‘ƒπœ (1 βˆ’ Ξ¨2𝑛 [𝜁1, πœ–]) ≀ exp(βˆ’π‘›πΆ7) where π‘ƒπœ is the joint distribution of {π‘Œπ‘›}∞ 𝑛=1 assuming parameters as 𝜁. Proof: See Lemma 3.5.10 in Choi (2005). Lemma B.0.3. Let 𝜁1 be a continuous function in [0, 1] 𝑝, and define πœπ‘– 𝑗 = πœπ‘– (𝑑 𝑗 ) for 𝑖 = 0, 1 and πœ– βˆ’ πœ– 2. Let Ξ¨3𝑛 [𝜁1, πœ–] be the indicator of the set 𝐴3, 𝑗 = 1, Β· Β· Β· , 𝑛. Let πœ– > 0, and let 0 < π‘Ÿ < 4𝜎0 √ where 𝐴3 is defined as 𝐴3 = 𝑛 βˆ‘οΈ 𝑗=1 𝑏 𝑗 (cid:19) (cid:18)π‘Œπ‘— βˆ’ 𝜁1 𝑗 𝜎0 ≀ 𝑐(1 βˆ’ πœ– 2)   ο£³   ο£Ύ Then there exists a constant 𝐢8 such that for all 𝜁1 that satisfy (cid:205)𝑛 𝑗=1 |𝜁1 𝑗 βˆ’ 𝜁0 𝑗 | > π‘Ÿπ‘›, πΈπ‘ƒπœ 0 (Ξ¨3𝑛 [𝜁1, πœ–]) < exp(βˆ’π‘›πΆ8) Also there exists constants 𝐢9 such that for all sufficiently large 𝑛 and all 𝜁 satisfying ||𝜁 βˆ’ 𝜁0|| < π‘Ÿ/4 and for all 𝜎 ≀ 𝜎0(1 βˆ’ πœ–) πΈπ‘ƒπœ (1 βˆ’ Ξ¨3𝑛 [𝜁1, πœ–]) ≀ exp(βˆ’π‘›πΆ9) where π‘ƒπœ is the joint distribution of {π‘Œπ‘›}∞ 𝑛=1 assuming parameters to be 𝜁. Proof: See Lemma 3.5.11 in Choi (2005). The following two lemmas show that under Assumption 2.2.3. and Assumption 2.2.4., there exist enough number of design points to acquire (cid:205)𝑛 𝑗=1 |𝜁1 𝑗 βˆ’ 𝜁0 𝑗 | > π‘Ÿπ‘› in 𝑝 dimensions. Lemma B.0.4. Assume Assumption 2.2.3. and Assumption 2.2.4. holds. Let πœ† be the Lebesgue measure in R𝑝. Let 𝐾𝑑 be the constants mentioned in the Assumptions 2.2.3 and 𝑉 be a constant such that 𝑉 > (cid:13) (cid:13) βˆ€π’•1, 𝒕2 ∈ [0, 1] 𝑝, πœ•π‘‘ 𝑗 𝜁0(𝑑1, Β· Β· Β· , 𝑑 𝑝)(cid:13) (cid:13)∞. Let 𝐴𝑉 be the set of all continuous functions 𝛾 such that |𝛾(𝒕1) βˆ’ 𝛾(𝒕2)| ≀ 𝑉 ||𝒕1 βˆ’ 𝒕2||. For each such function 𝛾 and πœ–, define π΅πœ–,𝛾 = {𝑑 : πœ• 28 |𝛾(𝑑)| > πœ– }. Then for each πœ– > 0, there exists an integer 𝑁 such that for all 𝑛 β‰₯ 𝑁 and all 𝛾 ∈ 𝐴𝑉 , 𝑛 βˆ‘οΈ 𝑖=1 |𝛾(𝑑𝑖)| β‰₯ π‘›πœ†(π΅πœ–,𝛾) πœ– 2 Proof: See Lemma 4.2.1 in Choi (2005). Lemma B.0.5. Assume Assumption 2.2.3. and Assumption 2.2.4. holds. Let 𝐴𝑉 be the set of all continuous functions 𝜁 such that ||𝜁 ||∞ < 𝑀𝑛 and (cid:13) (cid:13) (cid:13) πœ• πœ•π’• 𝑗 𝜻 (𝒕1, Β· Β· Β· , 𝒕𝑑) (cid:13) (cid:13) (cid:13)∞ < 𝑉, 𝑗 = 1, Β· Β· Β· , 𝑑. Then for each πœ– > 0, there exist an integer 𝑁 and π‘Ÿ > 0 such that, for all 𝑛 β‰₯ 𝑁 and all 𝜁 ∈ 𝐴𝑉 such that ||𝜁 βˆ’ 𝜁0||1, (cid:205)𝑛 𝑗=1 |𝜁1 𝑗 βˆ’ 𝜁0 𝑗 | β‰₯ π‘Ÿπ‘› Proof: See Lemma 4.2.2 in Choi (2005). Now we are going to prove the existence of test functions in the KOH model setup. Proof: First, define Ψ𝑛 [𝜁1, πœ–] = 1 βˆ’ (1 βˆ’ Ξ¨1𝑛 [𝜁1, πœ–]) (1 βˆ’ Ξ¨2𝑛 [𝜁1, πœ–]) (1 βˆ’ Ξ¨3𝑛 [𝜁1, πœ–]) From Lemmas B.0.1-B.0.3, it is clear that πΈπ‘ƒπœ for each 𝜁1 ∈ U𝐢 πœ–π‘›. To create a test that doesn’t depend on a specific choice of 𝜁1 in Lemma B.0.1, we make use of the covering number of the sieve. Let π‘Ÿ be the same number that appears in 𝑛) and πΈπ‘ƒπœ (1 βˆ’ Ψ𝑛) ≀ exp(βˆ’π‘›πΆ10) Ψ𝑛 < exp(βˆ’2𝑐2 0 Lemmas B.0.1, B.0.2, B.0.3. Let πœ€ = min{πœ–/2, π‘Ÿ/4}. Let π‘πœ€ be the πœ€-covering number of F𝑛 in the supremum norm. Recall that given 0 < 𝛿 < 1/2, (2𝛿 + 1)/2 < 𝛼1 < 1 and 𝛼/2 < 𝜏1 < 1/2. Thus, with 𝑀𝑛 = O (𝑛𝛼1) and 𝑐𝑛 = π‘›πœ1, we have log(π‘πœ€) = π‘œ(𝑐2 𝑛). Let 𝜁 1, Β· Β· Β· , 𝜁 𝑁 πœ€ ∈ F𝑛 be such that for each 𝜁 ∈ F𝑛, there exists 𝑗 such that ||𝜁 βˆ’ 𝜁 𝑗 ||∞ < πœ€. If ||𝜁 βˆ’ 𝜁 𝑗 ||1 > πœ–, then ||𝜁 𝑗 βˆ’ 𝜁0||1 > πœ–/2. Finally, define Ψ𝑛 [𝜁 𝑗 , πœ– 2 ] 𝑗=1 |𝜁1 𝑗 βˆ’ 𝜁0 𝑗 | > π‘Ÿπ‘› holds for every such 𝜁 𝑗 in Φ𝑛 = max 1≀ 𝑗 ≀𝑁 πœ€ Since we verified that there exists π‘Ÿ such that (cid:205)𝑛 Lemma B.0.5, then 29 πΈπ‘ƒπœ 0 Φ𝑛 ≀ 𝑁 πœ€βˆ‘οΈ 𝑗=1 πΈπ‘ƒπœ 0 Ψ𝑛 [𝜁 𝑗 , πœ– 2 ] ≀ 𝐢3π‘πœ€ exp(βˆ’2𝑐2 𝑛) = 𝐢3 exp(log(π‘πœ€) βˆ’ 2𝑐2 𝑛) ≀ 𝐢3 exp(βˆ’π‘2 𝑛) For 𝜁 ∈ U𝐢 πœ–π‘› ∩ F𝑛, the type 2 error probability of Ψ𝑛 is no larger than the minimum of the individual type 2 error probabilities of the Ψ𝑛 [𝜁 𝑗 , πœ–/2] tests. Hence we have the results. 30 APPENDIX C CHECKING ASSUMPTION 2.2.2 1 We construct our sieve F𝑛 such that it increases to the space of all continuously differentiable functions with bounded derivatives on [0, 1] 𝑝. To this end, we make use of an increasing sequence of constants 𝑀𝑛. To be specific, let {𝑀𝑛}∞ 𝑛=1 be a sequence of numbers such that 𝑀𝑛 β†’ ∞ as 𝑛 β†’ ∞ and 𝑀𝑛 = O (𝑛𝛼) for some 𝛼 ∈ (1/2, 1). Also, define, (cid:26) F𝑛 = 𝜁 (Β·) : ||𝜁 ||∞ < 𝑀𝑛, || πœ• πœ•π‘‘π‘– 𝜁 ||∞ < 𝑀𝑛, 𝑖 = 1, Β· Β· Β· , 𝑝 (cid:27) (C.1) In lieu of (C.1), for the sequence of sieves F𝑛, Π𝜁 (F 𝑐 𝑛 |𝒛) is exponentially small as long as the following two probabilities are exponentially small Π𝜁 {𝜁 : ||𝜁 ||∞ > 𝑀𝑛|𝒛} πœ• πœ•π‘‘π‘– 𝜁 ||∞ > 𝑀𝑛 𝜁 : || Π𝜁 (cid:26) (cid:27) (cid:12) (cid:12) (cid:12) (cid:12) 𝒛 , 𝑖 = 1, Β· Β· Β· , 𝑝 (C.2) (C.3) This result is stated in the Lemma below. Lemma C.0.1. Let 𝑀𝑛 = O (𝑛𝛼), where 𝛼 ∈ (1/2, 1) and 2𝛼 βˆ’ β„Ž β‰₯ 1 where β„Ž is as defined in Assumption 2.2.5. 1. Then, there exists positive constants 𝐷1, 𝐷2, 𝑑1, 𝑑2 such that, Π𝜁 {𝜁 : ||𝜁 ||∞ > 𝑀𝑛|𝒛} ≀ 𝐷1 exp(βˆ’π‘‘1𝑛) (cid:12) (cid:26) (cid:12) (cid:12) (cid:12) ≀ 𝐷2 exp(βˆ’π‘‘2𝑛), 𝜁 ||∞ > 𝑀𝑛 πœ• πœ•π‘‘π‘– 𝜁 : || (cid:27) 𝒛 Π𝜁 𝑖 = 1, Β· Β· Β· , 𝑝 Proof: For the proof of the lemma, we show the results step by step. First, note that for π‘₯ β‰₯ 0, 0 ≀ 𝑅(π‘₯) ≀ 1, βˆ’π‘‡1 ≀ 𝑅′(π‘₯) ≀ 0, βˆ’π‘‡2 ≀ 𝑅′′(π‘₯) ≀ 𝑇3 for positive 𝑇1, 𝑇2, and 𝑇3. In addition, if we evaluate at π‘₯ = 0, then 𝑅(0) = 1 and 𝑅′′(0) = βˆ’π‘‡4 < 0, where 𝑇4 > 0. The widely used Squared Exponential covariance function, MatΓ©rn covariance function, and Cauchy covariance function satisfy these equalities and inequalities. Please refer to Chapter 4 of Abrahamsen (1997) for further description. Then, 31 E𝜁 |𝜽,𝝓,𝒛 [𝜁 (𝒕)] = = = ≀ ≀ ≀ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ π‘˜=1 𝑠 βˆ‘οΈ 𝑗=1 𝑠 βˆ‘οΈ π‘˜=1 𝑠 βˆ‘οΈ 𝑗=1 𝑠 βˆ‘οΈ π‘˜=1 𝑗=1 π‘˜ 𝑓 ((𝒕, 𝜽), ((cid:101)𝒕 𝑗 , (cid:101)𝜽 𝑗 )) (cid:18) 1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , π‘™πœƒ) (cid:19) π‘§π‘˜ πœ‚ 𝑓 𝑅(𝒕, Λœπ’• 𝑗 ; 𝑙 𝑓𝑑 )𝑅(𝜽, ˜𝜽 𝑗 ; π‘™πœƒ) (cid:18) 1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) (cid:19) π‘§π‘˜ 𝑅(𝑑1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1) Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′𝑙 𝑓 𝑝+ 𝑝′ ) (cid:16) ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) (cid:17) π‘§π‘˜ ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) (cid:17) π‘§π‘˜ (cid:19) 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ π‘§π‘˜ π‘˜=1 𝑗=1 (cid:12) (cid:12) (cid:12) ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) (cid:12) (cid:12) (cid:12) 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ (cid:16) 𝑗=1 π‘˜=1 (cid:18) max π‘˜βˆˆ{1,Β·Β·Β· ,𝑠} (cid:18) max π‘˜βˆˆ{1,Β·Β·Β· ,𝑠} (cid:19) π‘§π‘˜ ¯𝐢 = 𝐢 (C.4) E𝜁 |𝜽,𝝓,𝒛 (cid:20) πœ• πœ•π‘‘ π‘˜ (cid:21) = 𝜁 (𝒕) 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ π‘˜=1 𝑗=1 𝑅(𝑑1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1) Β· Β· Β· 𝑅′(𝑑 π‘˜ , Λœπ‘‘ π‘˜ 𝑗 ; 𝑙 π‘“π‘˜ ) (cid:32) βˆ’π‘™ π‘“π‘˜ (cid:33) 𝑑 π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 |𝑑 π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 | Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′𝑙 𝑓 𝑝+ 𝑝′ ) 𝑠 (cid:18) βˆ‘οΈ 𝑠 βˆ‘οΈ (cid:19) 𝑙 π‘“π‘˜ ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) ≀ 𝑇1 max π‘˜βˆˆ{1,Β·Β·Β· ,𝑠} π‘§π‘˜ (cid:12) (cid:12) (cid:12) π‘˜=1 𝑗=1 ΛœΞ£βˆ’1 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) (cid:17) π‘§π‘˜ (cid:16) (cid:12) (cid:12) (cid:12) (C.5) Step 1) Let, 𝜎2 0 (𝜁) = supπ’•βˆˆπ›€βŠ‚R 𝑝 Vπ‘Žπ‘Ÿ (𝜁 (𝒕)|𝒛, 𝜽, 𝝓). By applying Lemma.3.5.6 from Choi (2005) at the 1st inequality below, we can bound the probability of the given set from above. (cid:18) P (cid:12) (cid:12) (cid:12) 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) (cid:12) (cid:12) (cid:12) (cid:12) 𝒛 sup π’•βˆˆπ›€βŠ‚R 𝑝 = ≀ = Θ ∫ 𝐴𝝓 ∫ Θ 𝐴𝝓 ∫ ∫ Θ 𝐴𝝓 ∫ ∫ (cid:18) P (cid:12) (cid:12) (cid:12) 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) 𝒛, 𝜽, 𝝓 (cid:12) (cid:12) (cid:12) (cid:12) sup π’•βˆˆπ›€βŠ‚R 𝑝 𝑝(𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ (cid:32) exp βˆ’ 1 4 𝜎2 0 𝑀 2 𝑛 (𝜁) + E2(𝜁 (𝒕)|𝒛, 𝜽, 𝝓) (cid:33) 𝑝(𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 (cid:170) (cid:174) (cid:174) (cid:172) 𝑝(𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ 32 Hence, using Assumption 8. 1, (cid:12) (cid:12) (cid:12) (cid:12) sup π’•βˆˆπ›€βŠ‚R 𝑝 > 𝑀𝑛 𝜁 (𝒕) P (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:18) 𝒛 (cid:19) = = ∫ ∫ ∫ ∞ ∫ ∞ Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚π›Ώ } 0 0 βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 (cid:170) (cid:174) (cid:174) (cid:172) 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ ∫ ∫ ∫ π‘›β„Ž ∫ π‘›β„Ž Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚π›Ώ } 0 0 βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 (cid:170) (cid:174) (cid:174) (cid:172) 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ + + βˆ’ ∫ ∫ ∫ ∞ ∫ ∞ Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚π›Ώ } 0 π‘›β„Ž βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ ∫ ∫ ∫ ∞ ∫ ∞ Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚π›Ώ } π‘›β„Ž 0 βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) + 𝐢2 + 𝐢2 ∫ ∫ ∫ ∞ ∫ ∞ Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚ 𝛿 } π‘›β„Ž π‘›β„Ž βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ (cid:170) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:172) ∫ ∫ ≀ ∫ π‘›β„Ž ∫ π‘›β„Ž Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚ 𝛿 } 0 0 βˆ’ 1 4 exp (cid:169) (cid:173) (cid:173) (cid:171) πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 (cid:170) (cid:174) (cid:174) (cid:172) 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ ∫ ∫ ∞ ∫ ∞ ∫ + + βˆ’ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚ 𝛿 } Θ ∫ ∫ Θ ∫ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚ 𝛿 } ∫ 0 ∫ ∞ π‘›β„Ž ∫ ∞ π‘›β„Ž ∫ ∞ 0 ∫ ∞ Θ π΄π“βˆ’{ πœ‚ 𝑓 , πœ‚π›Ώ } π‘›β„Ž π‘›β„Ž 1𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ 1𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ 0𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ 33 Therefore, (cid:18) P (cid:12) (cid:12) (cid:12) 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) (cid:12) (cid:12) (cid:12) (cid:12) 𝒛 ≀ ∫ ∫ Θ sup π’•βˆˆπ›€βŠ‚R 𝑝 𝐴𝝓2 ∫ π‘›β„Ž ∫ π‘›β„Ž 1 4 0 0 exp (cid:169) βˆ’ (cid:173) (cid:173) (cid:171) 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 (cid:205)𝑠 𝑖=1 𝑀 2 𝑛 (cid:205)𝑠 𝑗=1 (cid:12) (cid:12) (cid:12) 𝐾 βˆ’1 𝑗,𝑖 (𝑙 𝑓 , π‘™πœƒ) (cid:12) (cid:12) (cid:12) + 𝐢2 (cid:170) (cid:174) (cid:174) (cid:172) + P(πœ‚ 𝑓 > π‘›β„Ž|𝒛) + P(πœ‚π›Ώ > π‘›β„Ž|𝒛) βˆ’ 0 𝑛2𝛼1 π‘›β„Ž + π‘›β„Ž + π‘›β„Ž ¯𝐢 + 𝐢2 (cid:19) (cid:18) ≀ exp (cid:16) ≀ exp βˆ’ 1 4 βˆ’πΆπ‘›2𝛼1βˆ’β„Ž(cid:17) + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) ≀ 𝐷1 exp(βˆ’π‘‘1𝑛) Step 2) Let 𝜎2 (cid:16) βˆ’π‘€ 2 𝑛 /4(𝑇4πœ‚ 𝑓 𝑙2 π‘“π‘˜ exp π‘˜ (𝜁) = supπ’•βˆˆπ›€βŠ‚R 𝑝 Vπ‘Žπ‘Ÿ + 𝑇4πœ‚π›Ώπ‘™2 𝛿𝑖 + πœ‚ 𝑓 𝑙2 π‘“π‘˜ 𝑇 2 1 (cid:18) πœ• πœ•π‘‘ π‘˜ 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) (cid:12) ¯𝐢 βˆ’ 𝑇1𝐢𝑙 π‘“π‘˜ ¯𝐢) 𝒛, 𝜽, 𝝓 (cid:17) (cid:19) . Further let, π‘™βˆ— π‘“π‘˜ , π‘™βˆ— π›Ώπ‘˜ are the maximizer of in the bounded support. By applying the same logic as in step 1), we have the upper bound of the probability. (cid:12) (cid:12) (cid:12) πœ• πœ•π‘‘ π‘˜ (cid:18) P (cid:18) P sup π’•βˆˆπ›€βŠ‚R 𝑝 ∫ ∫ Θ 𝐴𝝓 ∫ ∫ Θ 𝐴𝝓 = = ∫ π‘›β„Ž ∫ ≀ exp (cid:169) βˆ’ (cid:173) (cid:173) (cid:171) ∫ π‘›β„Ž 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) (cid:12) (cid:12) (cid:12) (cid:12) 𝒛 (cid:12) (cid:12) (cid:12) πœ• πœ•π‘‘ π‘˜ 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) 𝒛, 𝜽, 𝝓 (cid:12) (cid:12) (cid:12) (cid:12) 𝑝(𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ sup π’•βˆˆπ›€βŠ‚R 𝑝 1 4 Γ— 𝑀 2 𝑛 π‘˜ (𝜁) + E2 𝜎2 𝜁 |𝜽,𝝓,𝒛 (cid:104) πœ• πœ•π‘‘ π‘˜ 𝜁 (𝒕) (cid:105) (cid:170) (cid:174) (cid:174) (cid:172) 𝑝(𝜽, 𝝓|𝒛)π‘‘π“π‘‘πœ½ Θ 0 0 βˆ’ 1 4 Γ— exp (cid:169) (cid:173) (cid:173) (cid:171) 𝑇4πœ‚ 𝑓 𝑙2 𝑓 + 𝑇4πœ‚π›Ώπ‘™2 𝛿 + πœ‚ 𝑓 𝑙2 𝑓 𝑇 2 1 𝑀 2 𝑛 ¯𝐢 + 𝑇1 (cid:0)maxπ‘˜βˆˆ{1,Β·Β·Β· ,𝑠} π‘§π‘˜ (cid:1) (cid:205)𝑠 π‘˜=1 (cid:205)𝑠 𝑗=1 𝑙 π‘“π‘˜ (cid:12) (cid:12) (cid:12) 22( 𝑗,π‘˜) (𝑙 𝑓 , 𝑙𝛿) ΛœΞ£βˆ’1 (cid:170) (cid:174) (cid:174) (cid:172) (cid:12) (cid:12) (cid:12) 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝝓2|𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ + P(πœ‚ 𝑓 > π‘›β„Ž|𝒛) + P(πœ‚π›Ώ > π‘›β„Ž|𝒛) 34 Hence, (cid:18) P ≀ (cid:12) (cid:12) (cid:12) ∫ π‘›β„Ž sup π’•βˆˆπ›€βŠ‚R 𝑝 ∫ πœ• πœ•π‘‘ π‘˜ ∫ π‘›β„Ž 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) > 𝑀𝑛 (cid:19) (cid:12) (cid:12) (cid:12) (cid:12) 𝒛 (cid:32) exp βˆ’ 1 4 Γ— 𝑇4πœ‚ 𝑓 π‘™βˆ—2 π‘“π‘˜ + 𝑇4πœ‚π›Ώπ‘™βˆ—2 π›Ώπ‘˜ 𝑀 2 𝑛 + πœ‚ 𝑓 π‘™βˆ—2 π‘“π‘˜ 𝑇 2 1 ¯𝐢 + 𝑇1πΆπ‘™βˆ— π‘“π‘˜ ¯𝐢 (cid:33) Θ 0 0 𝑝(πœ‚ 𝑓 |𝒛) 𝑝(πœ‚π›Ώ |𝒛) 𝑝(𝜽)π‘‘π“π‘‘πœ½ (cid:32) + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) 𝑛2𝛼1 + π‘›β„Žπ‘™βˆ—2 π‘“π‘˜ (cid:33) + 𝑇4π‘›β„Žπ‘™βˆ—2 π›Ώπ‘˜ 1 4 βˆ’ Γ— (cid:32) ≀ exp 𝑇4π‘›β„Žπ‘™βˆ—2 π‘“π‘˜ 𝑛2𝛼1βˆ’β„Ž + 𝑇4π‘™βˆ—2 π›Ώπ‘˜ ≲ exp (cid:16) = exp βˆ’ 4(𝑇4π‘™βˆ—2 π‘“π‘˜ βˆ’πΆπ‘›2𝛼1βˆ’β„Ž(cid:17) + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) ¯𝐢) 𝑇 2 1 + π‘™βˆ—2 π‘“π‘˜ + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) (cid:33) 𝑇 2 1 ¯𝐢 + 𝑇1πΆπ‘™βˆ— π‘“π‘˜ ¯𝐢 + 𝑏1 exp(βˆ’π‘2𝑛) + 𝑏3 exp(βˆ’π‘4𝑛) ≀ 𝐷2 exp (βˆ’π‘‘2𝑛) Note that for the second inequality, we used the assumption that the length scale parameters have bounded support, which leads to specific values, π‘™βˆ— 𝑓 and π‘™βˆ— 𝛿, that maximize the term in the exponential. In addition the bound of 𝜎2 0 and 𝜎2 𝑖 used in the proof of step 1) and step 2) are derived below. Bounds used for the proof of step 1) and step 2) above Remember from Assumption (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑖=1 𝑗=1 (cid:205)𝑠 2.2.5. 2, we have (cid:205)𝑠 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , π‘™πœƒ) is the (j,i)-th element ΛœΞ£βˆ’1 of [𝐾 𝑓 (𝑇𝑧 ((cid:101)𝜽), 𝑇𝑧 ((cid:101)𝜽))]βˆ’1, and it depends on 𝑙 𝑓 , 𝑙𝛿. Under the assumed covariance structure, the co- variance kernel of 𝜁 conditioned on 𝒛, 𝜽, 𝝓 is as follows. Let 𝑙 𝑓 = (𝑙 𝑓𝑑 , π‘™πœƒ), where 𝑙 𝑓𝑑 = (𝑙 𝑓1 , Β· Β· Β· , 𝑙 𝑓 𝑝 ) and π‘™πœƒ = (𝑙 𝑓 𝑝+1 , Β· Β· Β· , 𝑙𝛿 𝑝 ) , Β· Β· Β· , 𝑙 𝑓 𝑝+𝑝′), and 𝑙𝛿 = (𝑙𝛿1 ≀ ¯𝐢 Also, 1 πœ‚ 𝑓 π‘˜ 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) = π‘˜ 𝑓 ((𝒕, 𝜽), (𝒕′, πœ½β€²)) + π‘˜ 𝛿 (𝒕, 𝒕′) 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ πœ‚ 𝑓 𝑅(𝒕, Λœπ’•π‘–; 𝑙 𝑓𝑑 )𝑅(𝜽, Λœπœ½π‘–; π‘™πœƒ) (cid:18) 1 πœ‚ 𝑓 (cid:19) ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) πœ‚ 𝑓 𝑅(𝒕′, Λœπ’• 𝑗 ; 𝑙 𝑓𝑑 )𝑅(𝜽, ˜𝜽 𝑗 ; π‘™πœƒ) βˆ’ 𝑖=1 𝑠 βˆ‘οΈ 𝑗=1 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 βˆ’ πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝒕, Λœπ’•π‘–; 𝑙 𝑓𝑑 )𝑅(𝜽, Λœπœ½π‘–; π‘™πœƒ)𝑅(𝒕′, Λœπ’• 𝑗 ; 𝑙 𝑓𝑑 )𝑅(𝜽, ˜𝜽 𝑗 ; π‘™πœƒ) 35 Hence, π‘˜ 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) = πœ‚ 𝑓 𝑅(𝒕, 𝒕′; 𝑙 𝑓𝑑 ) + πœ‚π›Ώ 𝑅(𝒕, Λœπ’•β€²; 𝑙𝛿) = πœ‚ 𝑓 𝑅(𝑑1, 𝑑′1; 𝑙 𝑓1)𝑅(𝑑2, 𝑑′2; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 ) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 + πœ‚π›Ώ 𝑅(𝑑1, 𝑑′1; 𝑙𝛿1)𝑅(𝑙𝑑2,𝑑′2;𝛿2 ; 𝑙 𝑓 𝑝 ) ; 𝑙𝛿 𝑝 ) βˆ’ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑑1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1)𝑅(𝑑2, Λœπ‘‘2 𝑖 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓 𝑝+1) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(𝑑′1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(𝑑′2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓1) Β· Β· Β· 𝑅(𝑑′𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) By letting π’•βˆ— the maximizer of Vπ‘Žπ‘Ÿ (𝜁 (𝒕)|𝒛, 𝜽, 𝝓), we can bound the following term. 0 (𝜁) = sup 𝜎2 π’•βˆˆπ›€βŠ‚R 𝑝 Vπ‘Žπ‘Ÿ (𝜁 (𝒕)|𝒛, 𝜽, 𝝓) = πœ‚ 𝑓 𝑅(0) 𝑝 + πœ‚π›Ώ 𝑅(0) 𝑝 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ βˆ’ 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(π‘‘βˆ—1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1)𝑅(π‘‘βˆ—2, Λœπ‘‘2 𝑖 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(π‘‘βˆ—π‘, Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ2 𝑖 ; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(π‘‘βˆ—1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(π‘‘βˆ—2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(π‘‘βˆ—π‘, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) ≀ πœ‚ 𝑓 𝑅(0) 𝑝 + πœ‚π›Ώ 𝑅(0) 𝑝 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ + 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) 𝑅(π‘‘βˆ—1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1)𝑅(π‘‘βˆ—2, Λœπ‘‘2 𝑖 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(π‘‘βˆ—π‘ Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(π‘‘βˆ—1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(π‘‘βˆ—2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(π‘‘βˆ—π‘, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) 36 Since 𝑅(0) = 1 and 𝑅(π‘₯) ≀ 1 for widely used covariance functions such as Squared Exponential covariance, MatΓ©rn covariance, and Cauchy covariance and using Assumption 8. 2, 0 (𝜁) ≀ πœ‚ 𝑓 + πœ‚π›Ώ + πœ‚ 𝑓 ¯𝐢 𝜎2 For bounding 𝜎2 𝑖 (𝜁) = supπ’•βˆˆπ›€βŠ‚R 𝑝 Vπ‘Žπ‘Ÿ (cid:18) πœ• πœ•π‘‘π‘– 𝜁 (𝒕) (cid:19) 𝒛, 𝜽, 𝝓 , we are going to use the fact that the process constructed by taking the partial derivative of 𝜁 |𝒛, 𝜽, 𝝓 with respect to the π‘˜ π‘‘β„Ž component is again (cid:12) (cid:12) (cid:12) (cid:12) a GP with continuous sample paths and covariance function with πœ•2 πœ•π‘‘ π‘˜ πœ•π‘‘β€²π‘˜ π‘˜ 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′). For proof of this fact, please refer to Ghosal and Roy (2006). πœ• πœ•π‘‘ π‘˜ = πœ• π‘˜ 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) πœ•π‘‘β€²π‘˜ πœ• πœ• πœ•π‘‘β€²π‘˜ [πœ‚ 𝑓 𝑅(𝑑1, 𝑑′1; 𝑙 𝑓1)𝑅(𝑑2, 𝑑′2; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 πœ•π‘‘ π‘˜ + πœ‚π›Ώ 𝑅(𝑑1, 𝑑′1; 𝑙𝛿1)𝑅(𝑑2, 𝑑′2; 𝑙𝛿1) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 ; 𝑙𝛿 𝑝 ) ; 𝑙 𝑓 𝑝 ) βˆ’ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑑1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1)𝑅(𝑑2, Λœπ‘‘2 𝑖 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(𝑑′1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(𝑑′2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑′𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )] = πœ• πœ•π‘‘ π‘˜ [πœ‚ 𝑓 𝑅(𝑑1, 𝑑′1; 𝑙 𝑓1)𝑅(𝑑2, 𝑑′2; 𝑙 𝑓2) Β· Β· Β· 𝑅′(𝑑 π‘˜ , π‘‘β€²π‘˜ ; 𝑙 π‘“π‘˜ ) + πœ‚π›Ώ 𝑅(𝑑1, 𝑑′1; 𝑙𝛿1)𝑅(𝑑2, 𝑑′2; 𝑙𝛿2) Β· Β· Β· 𝑅′(𝑑 π‘˜ , π‘‘β€²π‘˜ ; π‘™π›Ώπ‘˜ ) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 ; 𝑙 𝑓 𝑝 ) (cid:19) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 ; 𝑙𝛿 𝑝 ) (cid:19) (cid:18) (cid:18) βˆ’π‘™ π‘“π‘˜ 𝑑 π‘˜ βˆ’ π‘‘β€²π‘˜ |𝑑 π‘˜ βˆ’ π‘‘β€²π‘˜ | 𝑑 π‘˜ βˆ’ π‘‘β€²π‘˜ |𝑑 π‘˜ βˆ’ π‘‘β€²π‘˜ | 𝑖 ; 𝑙 𝑓2) Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 βˆ’π‘™π›Ώπ‘˜ βˆ’ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑑1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1)𝑅(𝑑2, Λœπ‘‘2 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝 )𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝′ ) Γ— 𝑅(𝑑′1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(𝑑′2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓2) Β· Β· Β· 𝑅′(π‘‘β€²π‘˜ , Λœπ‘‘ π‘˜ 𝑗 ; 𝑙 π‘“π‘˜ ) (cid:32) βˆ’π‘™ π‘“π‘˜ (cid:33) π‘‘β€²π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 |π‘‘β€²π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 | Β· Β· Β· 𝑅(𝑑′𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ )] 37 Hence, πœ• πœ•π‘‘ π‘˜ πœ• πœ•π‘‘β€²π‘˜ π‘˜ 𝜁 |𝒛,𝜽,𝝓 (𝒕, 𝒕′) = βˆ’πœ‚ 𝑓 𝑙2 π‘“π‘˜ 𝑅(𝑑1, 𝑑′1; 𝑙 𝑓1)𝑅(𝑑2, 𝑑′2; 𝑙 𝑓2) Β· Β· Β· 𝑅′′(𝑑 π‘˜ , π‘‘β€²π‘˜ ; 𝑙 π‘“π‘˜ ) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 ; 𝑙 𝑓 𝑝 ) βˆ’ πœ‚π›Ώπ‘™2 π›Ώπ‘˜ 𝑅(𝑑1, 𝑑′1; 𝑙𝛿1)𝑅(𝑑2, 𝑑′2; 𝑙𝛿2) Β· Β· Β· 𝑅′′(𝑑 π‘˜ , π‘‘β€²π‘˜ βˆ’ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿)𝑅(𝑑1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1) Β· Β· Β· 𝑅′(𝑑 π‘˜ , Λœπ‘‘ π‘˜ 𝑖 ; 𝑙 π‘“π‘˜ ) βˆ’π‘™ π‘“π‘˜ ; π‘™π›Ώπ‘˜ ) Β· Β· Β· 𝑅(𝑑 𝑝, 𝑑′𝑝 (cid:32) ; 𝑙𝛿 𝑝 ) 𝑑 π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑖 |𝑑 π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑖 | (cid:33) Β· Β· Β· 𝑅(𝑑 𝑝, Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(𝑑′1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1)𝑅(𝑑′2, Λœπ‘‘2 𝑗 ; 𝑙 𝑓1) Β· Β· Β· 𝑅′(π‘‘β€²π‘˜ , Λœπ‘‘ π‘˜ 𝑗 ; 𝑙 π‘“π‘˜ ) (cid:32) βˆ’π‘™ π‘“π‘˜ (cid:33) π‘‘β€²π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 |π‘‘β€²π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 | Β· Β· Β· 𝑅(𝑑′𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Assume that π‘‘βˆ—βˆ— is the maximizer of πœ•2 πœ• (𝑑 π‘˜)2 Vπ‘Žπ‘Ÿ (cid:18) 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) (cid:12) (cid:19) 𝒛, 𝜽, 𝝓 . Then, we can simplify the bound by using the following facts coming from the covariance assumption for R. π‘˜ (𝜁) = sup 𝜎2 π’•βˆˆπ›€βŠ‚R 𝑝 πœ•2 πœ• (𝑑 π‘˜ )2 Vπ‘Žπ‘Ÿ (cid:18) 𝜁 (𝒕) (cid:12) (cid:12) (cid:12) (cid:12) (cid:19) 𝒛, 𝜽, 𝝓 = βˆ’πœ‚ 𝑓 𝑙2 π‘“π‘˜ 𝑅(0) π‘βˆ’1𝑅′′(0) βˆ’ πœ‚π›Ώπ‘™2 π›Ώπ‘˜ 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ πœ‚ 𝑓 ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) βˆ’ 𝑖=1 𝑗=1 𝑅(0) π‘βˆ’1𝑅′′(0) Γ— 𝑅(π‘‘βˆ—βˆ—1, Λœπ‘‘1 𝑖 ; 𝑙 𝑓1) Β· Β· Β· 𝑅′(π‘‘βˆ—βˆ—π‘˜ , Λœπ‘‘ π‘˜ 𝑖 ; 𝑙 π‘“π‘˜ ) (cid:32) 𝑙 π‘“π‘˜ (cid:33) π‘‘βˆ—βˆ—π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑖 |π‘‘βˆ—βˆ—π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑖 | Β· Β· Β· 𝑅(π‘‘βˆ—βˆ— 𝑝, Λœπ‘‘ 𝑝 𝑖 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1, Λœπœƒπ‘–,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒπ‘–,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒπ‘–,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) Γ— 𝑅(π‘‘βˆ—βˆ—1, Λœπ‘‘1 𝑗 ; 𝑙 𝑓1) Β· Β· Β· 𝑅′(π‘‘βˆ—βˆ—π‘˜ , Λœπ‘‘ π‘˜ 𝑗 ; 𝑙 π‘“π‘˜ ) (cid:32) 𝑙 π‘“π‘˜ (cid:33) π‘‘βˆ—βˆ—π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 |π‘‘βˆ—βˆ—π‘˜ βˆ’ Λœπ‘‘ π‘˜ 𝑗 | Β· Β· Β· 𝑅(π‘‘βˆ—βˆ— 𝑝, Λœπ‘‘ 𝑝 𝑗 ; 𝑙 𝑓 𝑝 ) Γ— 𝑅(πœƒ1 βˆ’ Λœπœƒ 𝑗,1; 𝑙 𝑓 𝑝+1)𝑅(πœƒ2, Λœπœƒ 𝑗,2; 𝑙 𝑓 𝑝+2) Β· Β· Β· 𝑅(πœƒ 𝑝′, Λœπœƒ 𝑗,𝑝′; 𝑙 𝑓 𝑝+ 𝑝′ ) ≀ 𝑇4πœ‚ 𝑓 𝑙2 π‘“π‘˜ + 𝑇4πœ‚π›Ώπ‘™2 π›Ώπ‘˜ + πœ‚ 𝑓 𝑙2 π‘“π‘˜ ≀ 𝑇4πœ‚ 𝑓 𝑙2 π‘“π‘˜ + 𝑇4πœ‚π›Ώπ‘™2 π›Ώπ‘˜ + πœ‚ 𝑓 𝑙2 π‘“π‘˜ 𝑇 2 1 𝑠 βˆ‘οΈ 𝑠 βˆ‘οΈ 𝑖=1 𝑗=1 (cid:12) (cid:12) (cid:12) ΛœΞ£βˆ’1 22( 𝑗,𝑖) (𝑙 𝑓 , 𝑙𝛿) (cid:12) (cid:12) (cid:12) ¯𝐢 𝑇 2 1 38 APPENDIX D CHECKING ASSUMPTION 2.2.2 2 First, recall two classical spaces of finite smoothness: HΓΆlder and Sobolev. HΓΆlder space 𝐢𝛼 (X), where 𝛼 > 0, is the space of all functions whose partial derivatives of order up to βŒˆπ›ΌβŒ‰ βˆ’ 1 exist and are uniformly bounded, where βŒˆπ›ΌβŒ‰ represents the largest integer that is smaller than 𝛼. Sobolev space 𝐻𝛼 (X) is the set of functions that are restriction to X of functions 𝑓0 : R𝑝 β†’ R with Fourier transform ˆ𝑓0(πœ†) = (2πœ‹)βˆ’π‘ ∫ exp(π‘–πœ†π‘‡ 𝑑) 𝑓0(𝑑)𝑑𝑑 satisfying ∫ (1 + βˆ₯πœ†βˆ₯2)𝛼| ˆ𝑓0|2π‘‘πœ† < ∞. van der Vaart and van Zanten (2011) showed behavior of concentration function in two different classes of GP. First, consider MatΓ©rn class of GP prior, which corresponds to the mean zero GP π‘Š = (π‘Šπ‘‘ : 𝑑 ∈ [0, 1] 𝑝) with covariance function E[π‘Šπ‘ π‘Šπ‘‘] = ∫ R 𝑝 exp(π‘–πœ†π‘‡ (𝑠 βˆ’ 𝑑))π‘š(πœ†)π‘‘πœ† where spectral densities π‘š : R𝑝 β†’ R is given by, for 𝛼 > 0, π‘š(πœ†) = 1 (π‘Ž + βˆ₯πœ†βˆ₯2)𝛼+𝑝/2 (D.1) (D.2) Combining the results of Lemma 3 and Lemma 4 ofvan der Vaart and van Zanten (2011), we have the following results, Lemma D.0.1. For MatΓ©rn class of GP, if 𝜁0 ∈ 𝐢 𝛽 [0, 1] 𝑝 ∩ 𝐻 𝛽 [0, 1] 𝑝 for 0 < 𝛽 ≀ 𝛼, the concentration function satisfies, πœ™π‘€ 𝜁0 (πœ–π‘›) ≲ πœ– βˆ’ (2𝛼+ π‘βˆ’2𝛽) 𝛽 𝑛 βˆ’ 𝑝 𝛼 𝑛 + πœ– (D.3) For the second case, consider the Square Exponential class of GP prior. The squared exponential class corresponds to the mean zero GP with covariance function. E[π‘Šπ‘ π‘Šπ‘‘] = exp(βˆ’ βˆ₯𝑠 βˆ’ 𝑑 βˆ₯2), 𝑠, 𝑑 ∈ [0, 1] 𝑝 (D.4) 39 where spectral densities π‘š : R𝑝 β†’ R is given by, π‘š(πœ†) = 1 2π‘πœ‹ 𝑝/2 exp (cid:19) (cid:18) βˆ’ βˆ₯πœ†βˆ₯2 4 (D.5) Combining the results of Lemma 6 and Lemma 7 ofvan der Vaart and van Zanten (2011), we have the following results, Lemma D.0.2. For Squared Exponential class of GP, if 𝜁0 ∈ 𝐻 𝛽 [0, 1] 𝑝 for 𝑑/2 < 𝛽, then for a constant 𝐢 that depends only on 𝜁0, the concentration function satisfies, πœ™π‘†πΈ 𝜁0 (πœ–π‘›) ≲ exp(πΆπœ– βˆ’2/(π›½βˆ’π‘/2) 𝑛 ) + (βˆ’ log πœ–π‘›)1+𝑝 (D.6) Note both the right hand sides of πœ™π‘€ 𝜁0 (πœ–π‘›) and πœ™π‘†πΈ 𝜁0 (πœ–π‘›) tend to infinty as 𝑛 goes to infinity. However, since slowly decreasing sequence πœ–π‘› satisfy π‘›πœ– 2 𝑛 β†’ ∞ as 𝑛 goes to infinity, for large enough 𝑛 we have πœ™π‘€ 𝜁0 (πœ–π‘›) ≲ π‘›πœ– 2 𝑛 and πœ™π‘†πΈ 𝜁0 (πœ–π‘›) ≲ π‘›πœ– 2 𝑛. 40 CHAPTER 3 CALIBRATED COMPUTER MODEL USING VARIATIONAL BAYES (VARIATIONAL INFERENCE) 3.1 Variational Bayes formulation This section introduces a general Variational Bayesian (VB) approach that we are going to use. Let a set of parameters of interest 𝚿 have a prior distribution 𝑝(𝚿). Given a data set D, the goal of the Bayesian method is to find the posterior distribution. Using Bayes’ theorem, that is to say, find the following distribution. πœ‹(𝚿|D) = 𝑝(D |𝚿) 𝑝(𝚿) 𝑝(D|𝚿) 𝑝(𝚿)π‘‘πšΏ ∫ 𝚿 However, the core problem of posterior calculation comes from the denominator part of the posterior distribution, evidence. Modern Bayesian statisticians have been focused on approximating this difficult-to-calculate density Blei et al. (2017). There are diverse approaches to approximating the posterior distributions. To name a few, there are Markov Chain Monte Carlo (MCMC), Hamiltonian Monte Carlo (HMC), and Approximate Bayesian Computation (ABC). An alternative machine learning method has been getting popular recently, Variational Bayes (VB) or Variational Inference (VI). The advantage of VB over other approaches is clear; it is faster and easier to scale to large data sets. This fact is also very appealing to the computer simulation setup where we want to build a GP surrogate for a large simulation data set. To apply VB, we first need a divergence criteria between two probability measures. The most widely used divergence criterion is the Kullback-Leibler (KL) divergence or KL distance. Definition 1. (Kullback-Leibler (KL) divergence) For two probability measures P and Q on a measurable space X, let P is absolutely continuous with respect to Q. Further if πœ‡ is any measure on X for which densities 𝑝 and π‘ž with 𝑃(𝑑π‘₯) = 𝑝(π‘₯)πœ‡(𝑑π‘₯) and 𝑄(𝑑π‘₯) = π‘ž(π‘₯)πœ‡(𝑑π‘₯) exist. Then the Kullback-Leibler divergence is defined as K𝐿 (𝑃||𝑄) = ∫ X log 𝑃(𝑑π‘₯) 𝑄(𝑑π‘₯) 𝑃(𝑑π‘₯) = ∫ X 𝑝(π‘₯) log 𝑝(π‘₯) π‘ž(π‘₯) πœ‡(𝑑π‘₯) = E𝑝 (cid:20) log (cid:21) 𝑝 π‘ž 41 Definition 1 shows the standard definition of KL divergence. However, since we are working on infinite-dimensional function spaces, there is no sensible infinite-dimensional Lebesgue measure. Therefore, we need a more general definition of KL divergence to work with. Kolmogorov formulated a general definition of KL divergence as a supremum of KL divergence over all finite measurable partitions on the space of the index set Kolmogorov et al. (1992). This formulation has been adopted in the functional approach of variational inference in other papers such as de G. Matthews et al. (2016) and Sun et al. (2019). Some early works related to KL divergence between stochastic processes include CsatΓ³ and Opper (2002), CsatΓ³ (2002). Seeger (2003a), and Seeger (2003b). Definition 2. (Kullback-Leibler (KL) divergence for Stochastic Processes) For two stochastic processes P and Q, the Kullback-Leibler divergence is defined as K𝐿(𝑃||𝑄) = sup π‘‹βˆˆX K𝐿 (𝑃𝑋 ||𝑄 𝑋) where 𝑋 denotes measurement sets and 𝑃𝑋, 𝑄 𝑋 are the marginal distribution of function values at 𝑋. Note that this definition of KL divergence depends on measurement set 𝑋. Sun et al. (2019) proposed sampling-based measurement sets where the measurements set are sampled from both training inputs and random points from the domain. In this paper, however, we are going to assume the measurement set contains all the training data points for theoretical analysis. Then, for the same class of stochastic process, the logarithm term cancels out for KL divergence because we are considering a finite measurement set. This simplifies our proof of the consistency of variational posterior. The first step to building a VB approach is to predefine a family of distributions we are going to optimize. This family is called variational family, Q. The common practice is to consider a family of distributions where a product form for each dimension of parameters is assumed, mean field variational family Q𝑀 𝐹. After setting up the variational family, the next step is to find the closest distribution to the true posterior in terms of KL distance over Q𝑀 𝐹. This can be mathematically 42 written as follows. π‘„βˆ— = arg min π‘„βˆˆπ‘„ 𝑀𝐹 K𝐿 (𝑄||Ξ (Β·|D)) The problem with this optimization is that we cannot calculate the evidence, 𝑝(D). If we take a closer look, we can check the dependence of KL distance to log evidence. K𝐿(𝑄||Ξ (Β·|D)) = Eπ‘ž [log π‘ž(𝜽)] βˆ’ Eπ‘ž [log 𝑝(𝜽 |D)] = Eπ‘ž [log π‘ž(𝜽)] βˆ’ Eπ‘ž [log 𝑝(𝜽, D)] + log 𝑝(D) Hence we optimize an equivalent objective function, the evidence lower bound(ELBO), L (π‘ž) Jordan et al. (1999). The name comes from the fact that it lower bounds the log evidence. L (π‘ž) = Eπ‘ž [log 𝑝(D, 𝜽)] βˆ’ Eπ‘ž [log π‘ž(𝜽)] ≀ log 𝑝(D) 3.1.1 Black Box Variational Inference (BBVI) VB method has been widely adopted for approximating the posterior distributions. But the problem of VB is that it requires significant model-specific analysis in order to develop the VB algorithm. This boils down to calculating ELBO, Eπ‘ž [log 𝑝(D, 𝜽)]βˆ’Eπ‘ž [log π‘ž(𝜽)]. This expectation often does not have a closed-form expression. Ranganath et al. (2014) proposed Black Box Variational Inference (BBVI) method. This is basically a stochastic optimization method by using an unbiased noisy gradient to maximize ELBO. Consider a set of variational parameters πœ†, then βˆ‡πœ†L (π‘ž) = Eπ‘ž [βˆ‡π€ log π‘ž(Ψ𝑠|πœ†) Γ— (log 𝑝(Ψ𝑠, D) βˆ’ log π‘ž(Ψ𝑠 |πœ†)] (3.1) where Ψ𝑠 ∼ π‘ž(Ξ¨|πœ†)The unbiased Monte Carlo estimate of this gradient is (cid:156)βˆ‡πœ†L (π‘ž) = 1 𝑆 𝑆 βˆ‘οΈ 𝑠=1 [βˆ‡π€ log π‘ž(Ψ𝑠 |πœ†) Γ— (log 𝑝(Ψ𝑠, D) βˆ’ log π‘ž(Ψ𝑠 |πœ†)] (3.2) Using this noisy gradient, BBVI update the variational parameters πœ†π‘‘+1 ← πœ†π‘‘ + πœŒπ‘‘ (cid:156)βˆ‡πœ†L (π‘ž), until ELBO converges. If the sequence of step sizes {πœŒπ‘‘ }∞ 𝑑=1 satisfies following Robbins-Monro 43 conditions, then the algorithm guaranteed to converge to a local maximum. ∞ βˆ‘οΈ 𝑖=1 πœŒπ‘‘ = ∞, ∞ βˆ‘οΈ 𝑖=1 (πœŒπ‘‘)2 < ∞ (3.3) 3.1.2 Reparametrization Trick There is one issue implementing the above algorithm. That is samples of parameters are not differentiable with respect to variational parameters 𝝀. Kingma and Welling (2013) and Rezende et al. (2014) approached this problem by reparametrizing samples in a clever way, such that the stochasticity is independent of the parameters. By separating out the stochasticity, samples become deterministically depend on the parameters of the distribution, hence we could differentiate the samples with respect to the variational parameters. For example, consider a sample 𝑧 from normal distribution with mean πœ‡ and variance 𝜎2. Equivalently, this could be written as 𝑧 = 𝜎 Γ— πœ– + πœ‡, where πœ– ∼ 𝑁 (0, 1). This reparametrization trick allows us to sample from a standard normal distribution and then scaled by the mean and standard deviation to get a sample from 𝑁 (πœ‡, 𝜎). In this way, we could keep back propagate the mean and standard deviation through the samples. 3.2 Variational Bayes Algorithm for Discrepancy Model Coming back to the discrepancy model setup, we intend to find the variational approximation to the posterior 𝑝(𝜽, 𝝓, 𝜎| 𝒅). First, let 𝝓 = (𝝓1, 𝝓2) where 𝝓1 corresponds to real-valued parameters and 𝝓2 corresponds to positive-valued parameters. We posit a variational family of the form, ˜Q = Q𝜁 |𝜽,𝝓 Γ— Q𝜽 Γ— Q𝝓1 Γ— Q𝝓2 Γ— Q𝜎 Here, Assumption 3.2.1. We assume the following structure on the variational family Q𝜁 |𝜽,𝝓 = {𝜁 |𝜽, 𝝓 ∼ 𝐺𝑃(π’Ž π‘ž 𝜁 |𝜽,𝝓 , π‘˜ π‘ž 𝜁 |𝜽,𝝓 )}, Q𝜽 = {𝜽 ∼ N (π’Žπœƒ, π‘Ίπœƒ)}, Q𝝓1 = {𝝓1 ∼ N (π’Žπœ™1 , π‘Ίπœ™2)}, Q𝝓2 = { Λœπ“2 ∼ N (π’Žπœ™2 , π‘Ίπœ™2)}, Q𝜎 = {(cid:101) 𝜎 ∼ N (π‘šπœŽ, π‘ πœŽ)}, where Λœπ“2 = log 𝝓2 is the log operator applied elementwise and (cid:101) field variational family, one assumes that the matrices π‘Ίπœƒ, π‘Ίπœ™1 and π‘Ίπœ™2 are diagonal matrices. 𝜎 = log 𝜎. Also, under the mean 44 Note that the variational distribution of 𝜁 depends on the calibration parameter 𝜽 and hyperpa- rameters 𝝓. Hence this is not the mean field family. Now if we assume using the same covariance function for variational distribution of 𝜁 as prior of 𝜁, such as squared exponential covariance or MatΓ©rn covariance function, then the KL divergence between variational distribution for 𝜁 and prior for 𝜁 become zero as we discussed. Now we can simplify the objective KL function. K𝐿 (π‘ž(𝜁, 𝜽, 𝝓, 𝜎)|| 𝑝(𝜁, 𝜽, 𝝓, 𝜎| 𝒅)) = ∫ log (cid:18) π‘ž(𝜽, 𝝓, 𝜎) 𝑝(𝜽, 𝝓, 𝜎| 𝒅) (cid:19) π‘ž(𝜽, 𝝓, 𝜎)π‘‘πœ½ π‘‘π“π‘‘πœŽ (3.4) Therefore the optimal variational distribution π‘žβˆ— is given by π‘žβˆ— = arg min π‘žβˆˆQ K𝐿(π‘ž(𝜽, 𝝓, 𝜎|𝝀)|| 𝑝(𝜽, 𝝓, 𝜎| 𝒅)). (3.5) where 𝝀 = (π’Žπœƒ, π‘Ίπœƒ, π’Žπœ™1 together, and Q = Q𝜽 Γ— Q𝝓1 Γ— Q𝝓2 Γ— Q𝜎. The optimal variational distribution π‘žβˆ— is obtained in , π‘šπœŽ, π‘†πœŽ) denotes the set of all variational parameters taken , π’Žπœ™2 , π‘Ίπœ™2 , π‘Ίπœ™1 practice by maximizing an ELBO. L (π‘ž) = Eπ‘ž(𝜽,𝝓,𝜎|𝝀) (cid:104) log 𝑝( 𝒅|𝜽, 𝝓, 𝜎) + log 𝑝(𝜽, 𝝓, 𝜎) βˆ’ log π‘ž(𝜽, 𝝓, 𝜎|𝝀) (cid:105) (3.6) We can adopt Black Box Variational Inference (BBVI) algorithm to solve this optimization problem. That is, find the unbiased Monte Carlo estimate of the derivative of ELBO. βˆ‡πœ†L (π‘ž) = Eπ‘ž [βˆ‡π€ log π‘ž(𝜽, 𝝓, 𝜎|𝝀)(log 𝑝( 𝒅|𝜽, 𝝓, 𝜎) + log 𝑝(𝜽, 𝝓, 𝜎) βˆ’ log π‘ž(𝜽, 𝝓, 𝜎|𝝀))] (3.7) β‰ˆ 1 𝑆 𝑆 βˆ‘οΈ 𝑠=1 βˆ‡π€ log π‘ž(𝜽 𝑠, 𝝓𝑠, πœŽπ‘  |𝝀) Γ— (log 𝑝( 𝒅|𝜽 𝑠, 𝝓𝑠, πœŽπ‘ ) + log 𝑝(𝜽 𝑠, 𝝓𝑠, πœŽπ‘ ) βˆ’ log π‘ž(𝜽 𝑠, 𝝓𝑠, πœŽπ‘  |𝝀)) = Λ†βˆ‡πœ†L (π‘ž) (3.8) (3.9) (3.10) Then use the stochastic gradient descent (SGD) algorithm. The algorithm is summarized in the next algorithm 1 table. Let 𝑔1(𝑧; πœ‡, 𝜎) = 𝑧 (cid:199) 𝜎 + πœ‡ and 𝑔2(𝑧; πœ‡, 𝜎) = exp(𝑔1(𝑧; πœ‡, 𝜎)). 45 Algorithm 1 BBVI algorithm in Computer Calibration Model Input: Model, tolerance 𝜏, batch size L, learning rate sequence {πœŒπ‘‘ }∞ 𝑑=1 Output: π€βˆ— = 𝝀𝑑 1: Randomly initialize 𝝀0 2: Set 𝑑 ← 0, Ξ” ← 0 3: while increament of ELBO > 𝜏 do 4: 5: 𝑑 ← 𝑑 + 1 (𝜽 1, Β· Β· Β· , 𝜽 𝐿) = 𝑔1(π’˜1, Β· Β· Β· , π’˜ 𝐿; π’Žπœƒ, π‘Ίπœƒ), where (π’˜1, Β· Β· Β· , π’˜ 𝐿) (𝝓1 1 (𝝓1 2 1 ) = 𝑔1(π’˜ 2 ) = 𝑔2(π’˜ β€²1, Β· Β· Β· , π’˜ β€²β€²1, Β· Β· Β· , π’˜ , Β· Β· Β· , 𝝓𝐿 , Β· Β· Β· , 𝝓𝐿 β€² 𝐿; π’Žπœ™1 β€²β€² 𝐿; π’Žπœ™2 β€²β€² 𝐿) β€²1, Β· Β· Β· , π’˜ , π‘Ίπœ™2), where (π’˜ , π‘Ίπœ™2), 𝑖.𝑖.𝑑. ∼ 𝑁 (0, 𝐼2𝑝+𝑝′×2𝑝+𝑝′) 𝑖.𝑖.𝑑. ∼ 𝑁 (0, 1) where (π’˜ β€²β€²1, Β· Β· Β· , π’˜ (𝜎1, Β· Β· Β· 𝜎𝐿) = 𝑔2(𝑒1, Β· Β· Β· , 𝑒𝐿; π‘šπœŽ, π‘†πœŽ), where (𝑒1, Β· Β· Β· , 𝑒𝐿) (cid:205)𝐿 𝑙=1 βˆ‡π€ log π‘ž(𝜽 𝑙, 𝝓𝑙, πœŽπ‘™ |π€π‘‘βˆ’1)Γ— Λ†βˆ‡πœ†L (π‘ž) ← 1 𝐿 (log 𝑝( 𝒅|𝜽 𝑙, 𝝓𝑙, πœŽπ‘™) + log 𝑝(𝜽 𝑙, 𝝓𝑙, πœŽπ‘™) βˆ’ log π‘ž(𝜽 𝑙, 𝝓𝑙, πœŽπ‘™ |π€π‘‘βˆ’1)) 𝑖.𝑖.𝑑. ∼ 𝑁 (0, 𝐼𝑝′×𝑝′) β€² 𝐿) 𝑖.𝑖.𝑑. ∼ 𝑁 (0, 𝐼2Γ—2) 𝝀𝑑 ← π€π‘‘βˆ’1 + πœŒπ‘‘ Λ†βˆ‡πœ†L (π‘ž) 6: 7: 8: 9: 10: 11: 12: 13: end while 3.3 Posterior Contraction Just as MCMC case, we shall assume that 𝜎2 = 1. The results can be extended to unknown 𝜎2 with minor modifications as well. To statistically validate a Bayesian approach, one needs to estab- lish posterior consistency which states that the posterior probability of an arbitrary neighborhood around the true parameter goes to 1 as the sample size 𝑛 goes to infinity. In this section, we first establish the posterior consistency of the true posterior in (2.9). Next, we establish the posterior consistency of the variational posterior in (3.5). We shall use the notation E0 and P0 to denote the expectation and probability of 𝑦𝑖 with respect to the true parameters 𝜁0. In this direction, define π‘„βˆ— = arg min ∫ log 𝑑𝑄(𝜁, πœƒ, πœ™) 𝑑Π(𝜁, πœƒ, πœ™| π’š, 𝒛) 𝑑𝑄(𝜁, πœƒ, πœ™) π‘‘π‘„βˆ— 𝜁 (𝜁) = π‘‘Ξ πœ (𝜁) = π‘‘Ξ πœ (𝜁 | π’š, 𝒛) = π‘„βˆˆQ ∫ ∫ Θ ∫ Ξ¦ ∫ Θ ∫ Ξ¦ ∫ Θ Ξ¦ π‘‘π‘„βˆ—(𝜁, πœƒ, πœ™), 𝑑Π(𝜁, πœƒ, πœ™), 𝑑Π(𝜁, πœƒ, πœ™|π’š, 𝒛) Further, π‘„βˆ— 𝜁 (𝜁), Π𝜁 (𝜁) and Π𝜁 (𝜁 |π’š, 𝒛) denote the marginal distribution of 𝜁 under the variational posterior, the prior, and the true posterior respectively. 46 3.3.1 Assumptions Let πœ–π‘› be a sequence satisfying πœ–π‘› β†’ 0 and π‘›πœ– 2 𝑛 β†’ ∞. Consider the following neighborhood of the true physical process as follows (cid:40) Uπœ–π‘› = 𝜁 : 1 𝑛 𝑛 βˆ‘οΈ 𝑖=1 |𝜁 (𝑑𝑖) βˆ’ 𝜁0(𝑑𝑖)| ≀ πœ–π‘› (cid:41) We consider a sequence of sets {F𝑛}∞ 𝑛=1 such that it increases to include the space of all continuously differentiable functions with bounded derivatives on [0, 1] 𝑝. The following assumption describes the prior distributions on all the unknown parameters under consideration. Assumption 3.3.1. We assume the following structure on the prior distributions of 𝜽, 𝝓1, 𝝓2 and 𝜎2 Ξ (𝜽) ∼ N (π’Ž0 πœƒ, 𝑺0 πœƒ), Ξ (𝝓1) ∼ N (π’Ž0 πœ™1 , 𝑺0 πœ™2 ), Ξ ( Λœπ“2) ∼ N (π’Ž0 πœ™2 , 𝑺0 πœ™2 ), Ξ ( ˜𝜎) ∼ N (π‘š0 𝜎, (𝑠0 𝜎)2), where Λœπ“2 = log 𝝓2 with the log operator being applied element wise and ˜𝜎 = log 𝜎. For simplicity πœƒ, 𝑺0 𝑺0 πœ™1 , and 𝑺0 πœ™2 are assumed to be diagonal matrices. Assumption 3.3.2. For some 𝑄 ∈ Q (i) sup𝑖=1,Β·Β·Β· ,𝑛 E𝑄 (cid:2)∫ 𝜁 (𝑑𝑖)𝑑Π(𝜁 |𝜽, 𝝓, 𝒛) βˆ’ 𝜁0(𝑑𝑖)(cid:3) 2 ≀ πœ– 2 𝑛 (ii) K𝐿(𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓, 𝒛)) ≀ πΆπ‘›πœ– 2 𝑛 log 𝑛 Assumption 3.3.2 (i) states that there exists a variational member 𝑄 such that under 𝑄 the expectation of difference between the conditional expectation of the physical process 𝜁 under the prior and the true physical process 𝜁0 is negligible. This happens when 𝑄 places overwhelming mass on those values of 𝜽 and 𝝓 for which the conditional expectation of the physical process 𝜁 approaches the true process 𝜁0. For the Squared Exponential covariance and MatΓ©rn covariance class, we establish that Assumption 3.3.2 (i) holds as long as there exists 𝜽 βˆ— 𝑛 and π“βˆ— 𝑛 satisfying ∫ (cid:12) (cid:12) (cid:12) (cid:12) 𝜁 (𝑑𝑖)𝑑Π(𝜁 |𝜽 βˆ— 𝑛, π“βˆ— 𝑛, 𝒛) βˆ’ 𝜁0(𝑑𝑖) (cid:12) (cid:12) (cid:12) (cid:12) ≀ πœ–π‘› (3.11) for some ||𝜽 βˆ— 𝑛||2 ≀ π΅πœ™ with π΅πœƒ and π΅πœ™ independent of 𝑛. Note, (3.11) assumes there exist good choices of 𝜽 and 𝝓 such that the conditional expectation of the physical process 𝜁 under the parameters approaches the true physical process 𝜁0. 47 sup 𝑖=1,Β·Β·Β· ,𝑛 𝑛||2 ≀ π΅πœƒ and ||π“βˆ— Assumption 3.3.2 (ii) states that the KL distance between the variational member 𝑄 and the prior Ξ  is bounded up to an order of π‘›πœ– 2 𝑛 log 𝑛. Although 𝑄 needs to give overwhelming probabilities to good values of 𝜽 and 𝝓, it cannot be too far from the prior Ξ  if variational posterior consistency needs to hold. For the Squared Exponential covariance and MatΓ©rn covariance class, this will hold as long as 𝜽 βˆ— 𝑛 and π“βˆ— 𝑛 satisfying (3.11) are bounded in 𝐿2 norm up to an order of √ 𝑛. Indeed the GP priors as designed in (2.1) and Assumption 3.3.1 on the priors of 𝜽 and 𝝓 together with Assumption 3.2.1 on the variational family Q will be essential components towards the proof Assumption 3.3.2 (i) and (ii). 3.3.2 Main Results Theorem 3. (Variational Posterior Consistency) Suppose Assumption 3.3.2 holds in addition to Assumption 2.2.1 and Assumption 2.2.2, then π‘„βˆ— 𝜁 (U𝑐 πœ€π‘›) βˆ’β†’ 0 in P𝑛 0 probability as 𝑛 β†’ ∞ and πœ€π‘› = πœ–π‘› Note 𝑀𝑛 can be any sequence increasing slowly to ∞. Theorem 3 shows that the variational βˆšοΈπ‘€π‘› log 𝑛, 𝑀𝑛 β†’ ∞. posterior of 𝜁 concentrates around the true parameter value 𝜁0 at the rate πœ€π‘› which is slightly higher than the rate πœ–π‘› for the true posterior. The detailed proof of the Theorem 3 has been provided in the Appendix. However, we next give a brief overview. Proof of Theorem 3: By Corollary 4.15 in Boucheron et al. (2013), we have ∫ 𝑓 π‘‘π‘„βˆ— 𝜁 ≀ K𝐿 (π‘„βˆ— 𝜁 , Π𝜁 (.|π’š, 𝒛)) + log ∫ 𝑒 𝑓 π‘‘Ξ πœ (Β·| π’š, 𝒛) (3.12) Using (3.12) with 𝑓 = πΆπ‘›πœ€2 𝑛1U𝐢 πœ€π‘› with 1 being the indicator function of a set, we get πΆπ‘›πœ€2 π‘›π‘„βˆ— 𝜁 (U𝐢 πœ€π‘›) ≀ K𝐿 (π‘„βˆ— 𝜁 ||Π𝜁 (|π’š, 𝒛)) + log (cid:104) 1 + π‘’πΆπ‘›πœ€2 π‘›Ξ πœ (U𝐢 πœ€π‘› | π’š, 𝒛) (cid:105) (3.13) By Assumptions 2.2.1 and 2.2.2, it can be shown that Π𝜁 (U𝐢 πœ–π‘› |π’š, 𝒛) ≀ exp(βˆ’πΆβ€²π‘›πœ– 2 𝑛) where 𝐢′ > 𝐢. Additionally, under Assumption 3.3.2, it can be shown that K𝐿(π‘„βˆ— 𝜁 ||Π𝜁 (|π’š, 𝒛)) ≀ πΆβ€²β€²π‘›πœ– 2 𝑛 log 𝑛 for some 𝐢′′ > 0. Using these results in Relation (3.13), the proof of Theorem 3 follows. 48 3.4 Simulation Study: Ball Dropping Experiment Figure 3.1 (Functions and data of Ball dropping experiment) The left figure shows the true physical process and the computer model with true calibration parameter. The right figure shows the data we obtained from these functions. Let’s consider a famous ball-dropping experiment from Plumlee (2017). A ball is dropped from a chosen vertical position 8 meters above the ground with initial velocity βˆ’0.1 π‘š/𝑠. Then the ball’s vertical position at time π‘₯ is given by following the physical process. 𝜁0(𝑑) = 5 2 log (cid:26) 50 49 βˆ’ 50 49 √ (cid:104) tanhβˆ’1( tanh 0.02) + (cid:105) 2(cid:27) √ 2π‘₯ + 8, π‘₯ ∈ [0, 1] (3.14) We are going to consider an inexact computer model for this physical process with unknown calibration parameter πœƒ as π‘“π‘š (𝑑, πœƒ) = 8 βˆ’ π‘₯ βˆ’ πœƒπ‘₯2 2 (3.15) Note that the calibration parameter represents gravitational acceleration and let the true value of πœƒ to be 10. The left figure in Figure 3.1 shows these two functions in the domain π‘₯ ∈ [0, 1]. We can see that the computer model behaves very similarly to the true physical process, but there is a systematic difference, which will be modeled using GP. For the data generation, we will use Latin hypercube samples for (π‘₯, πœƒ) in the bounded space [0, 1] Γ— [7, 12]. In addition, we are going to add normal random error with a variance 0.01 to the experimental data. Then the location of the training data of size (𝑁, 𝑆) = (30, 200) could 49 0.00.20.40.60.81.0Time: x345678Vertical position of the ballFunctionTrue functionSimulation function with true 0.00.20.40.60.81.0Time: x345678DataExperimental DataSimulation Data Figure 3.2 (Ball dropping example: posterior results of MCMC approach and VI approach) The left figure shows the prior distribution and the posterior distributions of calibration parameters using two different approaches. The right figure presents Root Mean Squared Error for 100 out of the sample physical data. be checked in Figure 3.1, where 𝑁 is the number of experimental data and 𝑆 is the number of simulation data. Figure 3.2 shows the results of the proposed method in terms of posterior distributions and the Root Mean Squared Error (RMSE) for 100 out of the sample physical data. The posterior distributions of the calibration parameter are centered around the truve value, and the proposed VI method shows less uncertainty. This is anticipated since the mean field variational family tends to underestimate the variance. Even in RMSEs the proposed VI method well captures the performance of MCMC approach. The main difference is coming from the running time of each algorithm. The proposed VI methods converged in less than 1 min. This convergence is remarkable compared to MCMC result, which took around 13 minutes for 10000 posterior samples. 3.5 Simulation Study: Logistic Function Let’s consider a simple computer model in the form of a logistic growth function: π‘“π‘š ((𝑑, π‘₯), 𝜽) = (cid:18) 1 + exp 1 + πœƒ1 βˆ’ [1 + π‘₯ + πœƒ2𝑑] (cid:19) (3.16) 50 4681012140.00.20.40.60.8DensityPosterior distributionMCMCVIPrior N(9,2)(30,200)sample size0.010.020.030.040.05RMSERMSE for 100 physical test datatypeMCMCVI where we assume that the calibration parameter 𝜽 takes values in the space [0, 2]2 and the model inputs (𝑑, π‘₯) take values in the space [βˆ’10, 10]2. The true physical process is modeled according to 𝜁0(𝑑, π‘₯) = π‘“π‘š ((𝑑, π‘₯), 𝜽) + 𝛿(𝑑, π‘₯) = (cid:18) 1 + exp 1 + πœƒ1 βˆ’ [1 + π‘₯ + πœƒ2𝑑] (cid:19) ) + 𝛽, (3.17) where 𝛽 = 1 is a constant systematic error of the model and the true value of 𝜽 = (πœƒ1, πœƒ2) are arbitrarily set to be (1, 1.8). Since we are going to have 2 calibration parameters, we are going to consider three cases of sample sizes to compare the time gain and the fidelity of the variational approximation. The position of the data in the 2-dimensional input space is shown in Figure 3.3 for the case (𝑛, 𝑠) = (500, 1000). The posterior distributions of calibration parameters are drawn in Figure 3.4 using MCMC and VI to compare the results. As one can see, MCMC often has dented shapes, but the VI method approximates the posterior density well with round contour plots on top of the MCMC result. As the sample size increases, the uncertainty of MCMC results decreases with less standard deviations in both πœƒ1 and πœƒ2, and the VI results show very similar behavior. We can also check this fact from Table 3.1 with the mean and standard deviation of each sample size. Figure 3.3 (Functions and data of Logistic growth function example) The far left figure shows the true Logistic growth function with true calibration parameter. The figure in the middle represents the 500 experimental data and the right figure shows the 1000 simulation data. 51 Figure 3.4 (Posterior distributions of two calibration parameters) The blue lines represents posterior distribution coming from MCMC method, orange color is the variational approximation, and the red dot represents the true value of calibration parameters. We consider three cases of sample size to compare the performance of variational approximation. (𝑛, 𝑠) Method πœƒ1 πœƒ2 Condition (3.11) (125,250) (250,500) (500,1000) MCMC VB MCMC VB MCMC VB 0.89(0.08) 1.77(0.09) 0.89(0.05) 1.76(0.08) 0.95(0.04) 1.72(0.03) 0.93(0.04) 1.72(0.03) 0.97(0.03) 1.76(0.02) 0.96(0.02) 1.75(0.01) 1.326 1.045 1.027 Table 3.1 Posterior means and standard deviations (in parenthesis) of calibration parameters. Here, 𝑛 corresponds to the number of experimental observations and 𝑠 corresponds to the number of computer model evaluations. Table 3.1 also shows the condition (3.11) of variational family, the supremum of difference of the mean of GP prior given πœƒ and πœ™ and the true physical processes. ∫ (cid:12) (cid:12) (cid:12) (cid:12) sup 𝑖=1,Β·Β·Β· ,𝑛 𝜁 (𝑑𝑖)𝑑Π(𝜁 |𝜽 βˆ— 𝑛, π“βˆ— 𝑛, 𝒛) βˆ’ 𝜁0(𝑑𝑖) (cid:12) (cid:12) (cid:12) (cid:12) ≀ πœ–π‘› The left side of the equation is calculated in this example and the value is decreasing to 1 as the number of sample sizes increases. The values are not decreasing to 0 in this case, because the simulation data in this setup has a constant bias of value 1, hence it is decreasing to the value 1. If the simulation equation approximates the real physical process well so that the difference is negligible, then the mean of GP prior updated by the simulation data will be very close to the true physical process and the upper bound of the left side of the equation will decrease to 0. The big difference between the two approaches is shown in each algorithm’s running time. The upper figures in Figure 3.5 show the convergence of ELBO for the VI approach in terms of 52 0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(125,250)typeMCMCVI0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(250,500)0.40.60.81.01.21.411.41.51.61.71.81.92.02.12(n,s)=(500,1000) time in hours and the posterior sampling time for the MCMC approach. One can see that ELBO convergence is quite fast compared to the MCMC method. For example, for (𝑛, 𝑠) = (500, 1000) case, the MCMC algorithm took 14 hours for 10,000 posterior samples, while the VI approach took around 30 minutes for ELBO convergence. In other words, when ELBO converges, MCMC methods could only produce 450 posterior samples at maximum among all three cases of different sample sizes. We can also check that the Root Mean Squared Error (RMSE) for the 100 out-of-sample test data set also converges quickly for VI method which shows quite close results to MCMC result. MCMC approach also converges quite quickly in this simple simulation setup, but acquiring enough posterior samples for inference is very time-consuming compared to the VI approach. In short, we can see that the proposed VI method is consistent with MCMC in this simple example with a bit of smaller uncertainty. On the other hand, the computational time is very different; the VI approach is faster up to 23 times in this example compared to the MCMC approach. Figure 3.5 (Time comparison for Logistic growth function example between posterior results of MCMC approach and VI approach) Upper figures show the change of ELBO in terms of time for VI approach and the Lower figures show the change of Root Mean Squared Error(RMSE) for 100 out of the sample physical data with regard to time. 3.6 Conclusion and Discussion The original KOH model is a powerful approach for computer model calibration and estimation of physical processes. However, the standard MCMC implementation is not trivial and computa- 53 0.00.20.4times (hours)60040020002004006008001000ELBO(n,s)=(125,250)0246times (hours)(n,s)=(250,500)VI0510times (hours)(n,s)=(500,1000)MCMC0200040006000800010000Number of MCMC samples tionally expensive due to the use of GPs. We addressed the scalability issue of KOH model by replacing MCMC with VB approximation which lead to a significant reduction in the code’s run- time. Our simulation study shows that the VB method produces posterior approximation with high fidelity with much less computation time compared to the MCMC. In the future, we plan to deploy the proposed VB approach on a wide range of validation studies including a real data example. On the theoretical end, we first established the posterior contraction rate of the true posterior followed by the posterior contraction rate for the variational approximation. As reflected in the theoretical analysis, the posterior contraction rate of the proposed VB approach is only a bit slower than the true posterior. In the theoretical section, we assumed the parameter 𝜎2 to be known. Extending to the case of unknown 𝜎2 will be an interesting line of work. Further, our theoretical development relies on the existence of suitable choices of 𝜽 and 𝝓 such that the conditional expectation of the computer model approaches the true physical process 𝜁0. 54 BIBLIOGRAPHY Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877. Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press. CsatΓ³, L. (2002). Gaussian processes: iterative sparse approximations. Ph.D. thesis, Aston University. CsatΓ³, L. and Opper, M. (2002). Sparse on-line gaussian processes. Neural Computation, 14(3):641–668. de G. Matthews, A. G., Hensman, J., Turner, R. E., and Ghahramani, Z. (2016). On sparse variational methods and the kullback-leibler divergence between stochastic processes. Artificial Intelligence and Statistics, pages 231–239. Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., and Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37:183–233. Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Kolmogorov, A. N., Shiryaev, A. N., and Lindquist, G. (1992). Selected works of A. N. Kolmogorov. Volume II. Kluwer Academic Publishers. Lee, H. (2000). Consistency of posterior distributions for neural networks. Neural Network, 13:629–642. Plumlee, M. (2017). Bayesian calibration of inexact computer models. Journal of the American Statistical Association, 112:1274–1285. Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 of Proceedings of Machine Learning Research, pages 814–822. PMLR. Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning, pages 1278–1286. PMLR. Seeger, M. (2003a). Bayesian gaussian process models: Pac-bayesian generalisation error bounds and sparse approximations. Ph.D. thesis, University of Edinburgh. Seeger, M. (2003b). Pac-bayesian generalisation error bounds for gaussian process classification. 55 Journal of Machine Learning Research, 3:233–269. Sun, S., Zhang, G., Shi, J., and Grosse, R. (2019). Functional variational bayesian neural networks. International Conference on Learning Representations. 56 APPENDIX A CHECKING ASSUMPTION 3.3.2 (I) Under Assumption 3.3.2 (i) using (3.11), 𝜽 βˆ—, π“βˆ— depends on 𝑛. However, for notation simplicity we suppress 𝑛 in the remaining proof. Now, let |E𝜁 |𝜽,π“πœ (𝑑𝑖) βˆ’ 𝜁0(𝑑𝑖)| = 𝑓𝑖 (𝜽, 𝝓) Further define A = {|πœƒ 𝑗 βˆ’ πœƒβˆ— 𝑗 | ≀ πœ–π‘›, |𝝓1, 𝑗 βˆ’ π“βˆ— 1, 𝑗 | ≀ πœ–π‘›, | log 𝝓2, 𝑗 βˆ’ log π“βˆ— 2, 𝑗 | ≀ πœ–π‘›}, Eπ‘ž(𝜽,𝝓) |(𝜁0(𝑑𝑖) βˆ’ E𝜁 |𝜽,𝝓,𝒛 (𝜁 (𝑑𝑖))|2 = ∫ 𝑖 (𝜽, 𝝓)π‘ž(𝜽, 𝝓)π‘‘πœƒπ‘‘πœ™ 𝑓 2 = ≀ ∫ A ∫ A 𝑖 (𝜽, 𝝓)π‘ž(𝜽, 𝝓)π‘‘πœƒπ‘‘πœ™ + 𝑓 2 ∫ A𝑐 𝑖 (𝜽, 𝝓)π‘ž(𝜽, 𝝓)π‘‘πœƒπ‘‘πœ™ 𝑓 2 ( 𝑓 2 𝑖 (𝜽 βˆ—, π“βˆ—) + πΆπœ– 2 𝑛)π‘ž(𝜽, 𝝓)π‘‘πœ½ 𝑑𝝓 + (cid:18) max π‘˜βˆˆ{1,Β·Β·Β· ,𝑠} π‘§π‘˜ ¯𝐢 + |𝜁0(𝑑𝑖)| (cid:19) 2 𝑄(A𝑐) ≀ πΆπœ– 2 𝑛 + 𝐢 exp(βˆ’π‘›πœ– 2 𝑛) The first inequality above holds since 𝑓 2 𝑖 (𝜽, 𝝓) is a continuous function in 𝜽 and 𝝓. Thus, for every 𝛿1𝑛 > 0, there exists 𝛿2𝑛 > 0 such that for every (𝜽, 𝝓) ∈ B ((𝜽 βˆ—, π“βˆ—), 𝛿2𝑛), 𝑓 2 𝑖 (𝜽, 𝝓) ≀ 𝑓 2 𝑖 (𝜽 βˆ—, π“βˆ—) + 𝛿1𝑛 where B ((𝜽 βˆ—, π“βˆ—), 𝛿𝑛) is a ball of radius 𝛿𝑛 around (𝜽 βˆ—, π“βˆ—). Let πœ–π‘› = min( 𝛿1𝑛, 𝛿2𝑛). √ To prove the second inequality 𝑄(A𝑐) ≲ exp(βˆ’π‘›πœ– 2 𝑛), we use the tail bounds for normal distribution which states if 𝑋 is normally distributed, P(|𝑋 βˆ’ πœ‡|/𝜎 β‰₯ π‘₯) ≀ exp(βˆ’π‘₯2) as π‘₯ β†’ ∞ in addition to the the fact that 𝑄(( 𝐴1 ∩ Β· Β· Β· ∩ π΄π‘Ÿ)𝑐) = (cid:205)π‘Ÿ 𝑄( 𝐴𝑐 𝑖 ). To be specific, √ 𝑄( 𝑛|πœƒ 𝑗 βˆ’ πœƒβˆ— 𝑗 | β‰₯ √ π‘›πœ–π‘›) ≀ 2π‘πœ½ exp(βˆ’π‘›πœ– 2 𝑛) π‘πœ½(cid:214) 𝑗=1 𝑝𝝓1(cid:214) 𝑗=1 𝑄(|πœ™1, 𝑗 βˆ’ π“βˆ— 1, 𝑗 | β‰₯ πœ–π‘›) ≀ 2𝑝𝝓1 exp(βˆ’π‘›πœ– 2 𝑛) 𝑄(| log πœ™2, 𝑗 βˆ’ log πœ™βˆ— 2, 𝑗 | β‰₯ πœ–π‘›) ≀ 2𝑝𝝓2 exp(βˆ’π‘›πœ– 2 𝑛) 𝑝𝝓2(cid:214) 𝑗=1 Finally, | 𝑓𝑖 (𝜽, 𝝓)| ≀ maxπ‘˜βˆˆ{1,Β·Β·Β· ,𝑠} π‘§π‘˜ ¯𝐢 + |𝜁0(𝑑𝑖)| follows as a consequence of (C.4). 57 APPENDIX B CHECKING ASSUMPTION 3.3.2 (II) Assume independent priors for 𝜽 and 𝝓 as follows π‘πœ½(cid:214) π‘πœ½(cid:214) π‘πœ½(cid:214) 𝑝(𝜽) = 𝑝(πœƒπ‘–) = N (πœ‡πœƒ,𝑖, 𝜎2 πœƒ,𝑖), π‘ž(𝜽) = π‘ž(πœƒπ‘–) = (cid:16) N π‘šπœƒ,𝑖, 𝑠2 πœƒ,𝑖 (cid:17) 𝑝(𝝓1) = 𝑝(𝝓2) = 𝑖=1 𝑝𝝓1(cid:214) 𝑖=1 𝑝𝝓2(cid:214) 𝑖=1 𝑖=1 𝑝𝝓1(cid:214) 𝑖=1 𝑝𝝓2(cid:214) 𝑖=1 𝑝(πœ™1,𝑖) = 𝑝(πœ™2,𝑖) = 𝑖=1 (cid:16) N (cid:17) πœ‡πœ™1,𝑖, 𝜎2 πœ™1,𝑖 , π‘ž(𝝓1) = 𝑝𝝓1(cid:214) 𝑖=1 π‘ž(πœ™1,𝑖) = (cid:16) N (cid:17) π‘šπœ™1,𝑖, 𝑠2 πœ™1,𝑖 𝑝𝝓1(cid:214) 𝑖=1 (cid:16) LN log πœ‡πœ™2,𝑖, 𝜎2 πœ™2,𝑖 (cid:17) , π‘ž(𝝓2) = π‘ž(πœ™2,𝑖) = 𝑝𝝓2(cid:214) 𝑖=1 (cid:16) LN log π‘šπœ™2,𝑖, 𝑠2 πœ™2,𝑖 (cid:17) π‘πœ½(cid:214) 𝑖=1 𝑝𝝓2(cid:214) 𝑖=1 Note that, 𝑄(𝜽, 𝝓) = 𝑄(𝜽)𝑄(𝝓1)𝑄(𝝓2), Ξ (𝜽, 𝝓|𝒛) = Ξ (𝝓1|𝒛)Ξ (𝝓2|𝒛)Ξ (𝜽). Therefore, K𝐿 (𝑄(𝜁, 𝜽, 𝝓)||Ξ (𝜁, 𝜽, 𝝓|𝒛)) = K𝐿 (𝑄(𝝓1)||Ξ (𝝓1|𝒛)) + K𝐿 (𝑄(𝝓2)||Ξ (𝝓2|𝒛)) + K𝐿(𝑄(𝜽)||Ξ (𝜽)) Each KL term can be calculated as follows. Let 𝑝(πœƒπ‘–) = N (πœ‡πœƒ,𝑖, 𝜎2 πœƒ,𝑖), π‘ž(πœƒπ‘–) = N (π‘šπœƒ,𝑖, 𝑠2 πœƒ,𝑖) then we can simplify K𝐿(𝑄(𝜽)||Ξ (𝜽)) as 𝐸𝑄(𝜽) (cid:34) βˆ’ 1 2 π‘πœ½βˆ‘οΈ 1 2 𝑖=1 π‘πœ½βˆ‘οΈ 𝑖=1 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 π‘πœ½βˆ‘οΈ 1 2 𝑖=1 = βˆ’ + 1 2 = βˆ’ + 1 2 = βˆ’ (cid:16) 2πœ‹πœŽ2 πœƒ,𝑖 (cid:17) + log (cid:33)(cid:35) (cid:32) π‘πœ½βˆ‘οΈ 𝑖=1 (πœƒπ‘– βˆ’ πœ‡πœƒ,𝑖)2 2𝜎2 πœƒ,𝑖 π‘πœ½βˆ‘οΈ 𝑖=1 + 1 2 (cid:19) 2(cid:35) (cid:16) 2πœ‹π‘ 2 πœƒ,𝑖 (cid:17) βˆ’ log π‘πœ½βˆ‘οΈ 𝑖=1 (cid:32) π‘πœ½βˆ‘οΈ 𝑖=1 log (cid:16) 2πœ‹π‘ 2 πœƒ,𝑖 (cid:17) βˆ’ 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 𝐸𝑄(𝜽) (cid:33) (πœƒπ‘– βˆ’ π‘šπœƒ,𝑖)2 2𝑠2 πœƒ,𝑖 (cid:34)(cid:18) πœƒπ‘– βˆ’ π‘šπœƒ,𝑖 π‘ πœƒ,𝑖 log (cid:16) (2πœ‹πœŽ2 πœƒ,𝑖) (cid:17) + 1 2 𝐸𝑄(𝜽) (cid:33)(cid:35) (cid:34)(cid:32) π‘πœ½βˆ‘οΈ 𝑖=1 (πœƒπ‘– βˆ’ πœ‡πœƒ,𝑖)2 𝜎2 πœƒ,𝑖 (cid:17) (cid:16) 𝑠2 πœƒ,𝑖 βˆ’ log 1 2 π‘πœ½ + 1 2 π‘πœ½ log 2πœ‹ + 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:17) (cid:16) 𝜎2 πœƒ,𝑖 log π‘πœ½ log(2πœ‹) βˆ’ 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:32) 𝑠2 πœƒ,𝑖 𝜎2 πœƒ,𝑖 (cid:19) 2(cid:33) + (cid:18) π‘šπœƒ,𝑖 βˆ’ πœ‡πœƒ,𝑖 πœŽπœƒ,𝑖 (cid:17) (cid:16) 𝑠2 πœƒ,𝑖 βˆ’ log 1 2 π‘πœ½ + 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:17) (cid:16) 𝜎2 πœƒ,𝑖 + log 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:32) 𝑠2 πœƒ,𝑖 𝜎2 πœƒ,𝑖 (cid:19) 2(cid:33) + (cid:18) π‘šπœƒ,𝑖 βˆ’ πœ‡πœƒ,𝑖 πœŽπœƒ,𝑖 58 Choose π‘šπœƒ,𝑖 = πœƒ0,𝑖 and 𝑠2 πœƒ,𝑖 = 1/𝑛. Then, for a positive constant 𝐢1 > 0, we can simplify K𝐿(𝑄(𝜽)||Ξ (𝜽)) as follows 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:17) (cid:16) 𝑠2 πœƒ,𝑖 βˆ’ log π‘πœ½ log 𝑛 βˆ’ 1 2 π‘πœ½ + 1 2 1 2 = βˆ’ = 1 2 = π‘œ(𝐢1π‘›πœ– 2 𝑛 log 𝑛) π‘πœ½ + 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 (cid:17) (cid:16) 𝜎2 πœƒ,𝑖 log (cid:17) (cid:16) 𝜎2 πœƒ,𝑖 + log π‘πœ½βˆ‘οΈ 𝑖=1 1 2 π‘πœ½βˆ‘οΈ 𝑖=1 + 1 2 (cid:32) π‘πœ½βˆ‘οΈ 𝑖=1 1 π‘›πœŽ2 πœƒ,𝑖 (cid:19) 2(cid:33) + (cid:18) π‘šπœƒ,𝑖 βˆ’ πœ‡πœƒ,𝑖 πœŽπœƒ,𝑖 (cid:19) 2(cid:33) (cid:32) 𝑠2 πœƒ,𝑖 𝜎2 πœƒ,𝑖 (cid:18) πœƒ0,𝑖 βˆ’ πœ‡πœƒ,𝑖 πœŽπœƒ,𝑖 + where the above line follows by Assumption 8. (i). By similar logic, let 𝑝(πœ™1,𝑖) = N (πœ‡πœ™1,𝑖, 𝜎2 πœ™1,𝑖 = 1/𝑛. Then, for a positive constant 𝐢1 > 0 and 𝑠2 πœ™1,𝑖), π‘ž(πœ™1,𝑖) = N (π‘šπœ™1,𝑖, 𝑠2 πœ™1,𝑖) then with π‘šπœ™1,𝑖 = πœ™0,1,𝑖 K𝐿(𝑄(𝝓1)||Ξ (𝝓1|𝒛)) = π‘œ(𝐢1π‘›πœ– 2 𝑛 log 𝑛) Let 𝑝(log πœ™2,𝑖) = N (log πœ‡πœ™2,𝑖, 𝜎2 πœ™2,𝑖), π‘ž(log πœ™2,𝑖) = N (log π‘šπœ™2,𝑖, 𝑠2 πœ™2,𝑖), then (cid:32) 𝑝 πœ™ 2βˆ‘οΈ 𝑖=1 (cid:19) + (cid:32) 𝑝 πœ™ 2βˆ‘οΈ 𝑖=1 (cid:33)(cid:35) (log πœ™2,𝑖 βˆ’ log π‘šπœ™2,𝑖)2 2𝑠2 πœ™2,𝑖 (log πœ™2,𝑖 βˆ’ log πœ‡πœ™2,𝑖)2 2𝜎2 𝑝 πœ™ 2βˆ‘οΈ (cid:33)(cid:35) 𝑝𝝓2βˆ‘οΈ 𝑖=1 (cid:34) 𝑝𝝓2βˆ‘οΈ 𝑖=1 𝑝𝝓2βˆ‘οΈ 𝑖=1 (cid:34) 𝑝𝝓2βˆ‘οΈ 𝑖=1 K𝐿(𝑄(𝝓2)||Ξ (𝝓2|𝒛)) (cid:34) = 𝐸𝑄(𝝓2) βˆ’ (cid:18) (π‘ πœ™2,π‘–πœ™ (cid:19) 3 2 2,𝑖) βˆ’ log + 𝐸𝑄(𝝓2) (cid:18) log (πœŽπœ™2,π‘–πœ™ 3 2 2,𝑖) 𝑝𝝓2βˆ‘οΈ 𝑖=1 3 2 𝑝𝝓2βˆ‘οΈ 𝑖=1 (cid:34) = 𝐸𝑄(𝝓2) βˆ’ log (cid:0)π‘ πœ™2,𝑖(cid:1) βˆ’ log (cid:0)πœ™2,𝑖(cid:1) βˆ’ + 𝐸𝑄(𝝓2) log(πœŽπœ™2,𝑖) + 3 2 log (cid:0)πœ™2,𝑖(cid:1) + 1 2 = βˆ’ 𝑝𝝓2βˆ‘οΈ 𝑖=1 log (cid:0)π‘ πœ™2,𝑖(cid:1) βˆ’ + 1 2 𝑝 πœ™ 2βˆ‘οΈ (cid:32) 𝑠2 πœ™2,𝑖 𝑖=1 𝜎2 πœ™2,𝑖 + 3 2 1 2 π‘πœ™2 + π‘šπœ™2,𝑖 βˆ’ 𝑝𝝓2βˆ‘οΈ 𝑖=1 (cid:18) log π‘šπœ™2,𝑖 βˆ’ log πœ‡πœ™2,𝑖 πœŽπœ™2,𝑖 𝑝𝝓2βˆ‘οΈ 𝑖=1 (cid:19) 2(cid:33) πœ™2,𝑖 (cid:18) (log πœ™2,𝑖 βˆ’ log π‘šπœ™2,𝑖) π‘ πœ™2,𝑖 (cid:19) 2(cid:35) 1 2 𝑖=1 (cid:32) 𝑝 πœ™ 2βˆ‘οΈ 𝑖=1 (log πœ™2,𝑖 βˆ’ log πœ‡πœ™2,𝑖)2 𝜎2 (cid:33)(cid:35) log(πœŽπœ™2,𝑖) + πœ™2,𝑖 π‘šπœ™2,𝑖 𝑝𝝓2βˆ‘οΈ 𝑖=1 3 2 59 Hence, K𝐿(𝑄(𝝓2)||Ξ (𝝓2|𝒛)) = βˆ’ 𝑝𝝓2βˆ‘οΈ 𝑖=1 log (cid:0)π‘ πœ™2,𝑖(cid:1) βˆ’ 3 2 (cid:32) 𝑠2 + 𝑝 πœ™ 2βˆ‘οΈ 𝑖=1 1 2 + πœ™2,𝑖 𝜎2 πœ™2,𝑖 = 1 2 𝑝𝝓2 log 𝑛 βˆ’ 3 2 𝑝𝝓2 + 1 2 (cid:32) 𝑝 πœ™ 2βˆ‘οΈ 𝑖=1 1 π‘›πœŽ2 πœ™2,𝑖 = π‘œ(𝐢1π‘›πœ– 2 𝑛 log 𝑛) 𝑝𝝓2βˆ‘οΈ 𝑖=1 π‘šπœ™2,𝑖 βˆ’ 1 2 π‘πœ™2 + 𝑝𝝓2βˆ‘οΈ 𝑖=1 log(πœŽπœ™2,𝑖) + 𝑝𝝓2βˆ‘οΈ 𝑖=1 3 2 π‘šπœ™2,𝑖 (cid:19) 2(cid:33) (cid:18) log π‘šπœ™2,𝑖 βˆ’ log πœ‡πœ™2,𝑖 πœŽπœ™2,𝑖 𝑝𝝓2βˆ‘οΈ 𝑖=1 π‘πœ™2 + 1 2 log(πœŽπœ™2,𝑖) + πœ™0,2,𝑖 βˆ’ + (cid:18) log πœ™0,2,𝑖 βˆ’ log πœ‡πœ™2,𝑖 πœŽπœ™2,𝑖 (cid:19) 2(cid:33) 𝑝𝝓2βˆ‘οΈ 𝑖=1 3 2 π‘šπœ™2,𝑖 where we choose π‘šπœ™2,𝑖 = πœ™0,2,𝑖 and 𝑠2 πœ™2,𝑖 = 1/𝑛. Then, for a positive constant 𝐢1 > 0where the above line follows by Assumption 8. (i). This completes the proof. 60 APPENDIX C PROOF OF THEOREM 3 Based on the sketch of the proof of Theorem 3 in the main paper, the complete proof boils down to show K𝐿(π‘„βˆ— 𝜁 ||Π𝜁 (Β·| π’š, 𝒛)) ≀ πΆβ€²β€²π‘›πœ– 2 𝑛 log 𝑛 for some 𝐢′′ > 0. This depends on the following lemmas. Again, remember that 𝒛 is implicitly conditioned on the prior. Lemma C.0.1. Inequality for KL divergence K𝐿 (π‘„βˆ— 𝜁 ||Π𝜁 (Β·| π’š, 𝒛)) ≀ K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓| π’š, 𝒛)) Lemma C.0.2. There exists a constant 𝐢2 such that for any πœ–π‘› β†’ 0 with π‘›πœ– 2 𝑛 β†’ ∞, P0 (cid:26)(cid:12) (cid:12) (cid:12) (cid:12) log 𝑝( π’š|𝒛) 𝑝(π’š|𝜁0) (cid:12) (cid:12) (cid:12) (cid:12) < 𝐢2π‘›πœ– 2 𝑛 log 𝑛 (cid:27) βˆ’β†’ 1 Lemma C.0.3. There exists a constant 𝐢3 such that for any πœ–π‘› β†’ 0 with π‘›πœ– 2 𝑛 β†’ ∞, (cid:26) P0 Eπ‘ž(𝜽,𝝓) (cid:20) log (cid:21) 𝑝( π’š|𝜁0) 𝑝( π’š|𝒛, 𝜽, 𝝓) ≀ 𝐢3π‘›πœ– 2 𝑛 log 𝑛 (cid:27) βˆ’β†’ 1 (C.1) (C.2) (C.3) Lemma C.0.4. By using the Lemma C.0.2, Lemma C.0.3, and Assumption 3.3.2 (ii) in the main paper it can be established with dominating probability for any 𝐢9 > 0, as 𝑛 β†’ ∞, K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓| π’š, 𝒛)) ≀ 𝐢9π‘›πœ– 2 𝑛 log 𝑛 (C.4) Using Lemma C.0.4, we show variational posterior and the true posterior is bounded. C.0.1 Proof of Lemma C.0.1 Note, K𝐿 (π‘„βˆ— 𝜁 ||Π𝜁 (Β·| π’š, 𝒛)) = ≀ ≀ ∫ ∫ ∫ = log log π‘‘π‘„βˆ— 𝜁 (𝜁) π‘‘Ξ πœ (𝜁 |π’š, 𝒛) π‘‘π‘„βˆ— 𝜁 (𝜁) log π‘‘π‘„βˆ—(𝜁, 𝜽, 𝝓) 𝑑Π(𝜁, 𝜽, 𝝓|π’š, 𝒛) 𝑑𝑄(𝜁, 𝜽, 𝝓) 𝑑Π(𝜁, 𝜽, 𝝓|π’š, 𝒛) π‘ž(𝜁 |𝜽, 𝝓)π‘ž(𝜽, 𝝓) πœ‹(𝜁 |𝜽, 𝝓, π’š, 𝒛)πœ‹(𝜽, 𝝓|π’š, 𝒛) log π‘‘π‘„βˆ—(𝜁, 𝜽, 𝝓) 𝑑𝑄(𝜁, 𝜽, 𝝓) 𝑑𝑄(𝜁, 𝜽, 𝝓) = K𝐿(𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|π’š, 𝒛)) 61 where the last line of equality follows since π‘ž(𝜁 |𝜽, 𝝓) = πœ‹(𝜁 |𝜽, 𝝓, π’š, 𝒛). The first inequality above is due to the argument below and the second inequality above is because π‘„βˆ— is the minimizer of K𝐿(𝑄||Ξ (Β·| π’š, 𝒛)). Proof of the 1st inequality is as follows. ∫ log π‘‘π‘„βˆ— 𝜁 (𝜁) π‘‘Ξ πœ (𝜁 |π’š, 𝒛) π‘‘π‘„βˆ— 𝜁 (𝜁) = ∫ log 𝜁 (𝜁) π‘žβˆ— πœ‹πœ (𝜁 |π’š, 𝒛) π‘žβˆ— 𝜁 (𝜁)π‘‘πœ Now, log 𝜁 (𝜁) π‘žβˆ— πœ‹πœ (𝜁 | π’š, 𝒛) π‘žβˆ— 𝜁 (𝜁) = = ∫ ∫ π‘žβˆ—(𝜁, 𝜽, 𝝓)π‘‘πœƒπ‘‘πœ™ log πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ ∫ π‘žβˆ—(𝜁, 𝜽, 𝝓)π‘‘πœƒπ‘‘πœ™ ∫ πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ ∫ (π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛))πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ ∫ πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ Γ— 𝜁 log ∫ π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛))πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ ∫ πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ Consider density function for fixed 𝜁: Λœπœ‹πœ (𝜽, 𝝓) = πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛))/∫ πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛))π‘‘πœƒπ‘‘πœ™. Then, log 𝜁 (𝜁) π‘žβˆ— πœ‹πœ (𝜁 | π’š, 𝒛) ∫ π‘žβˆ— 𝜁 (𝜁) = πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ 𝐸 Λœπœ‹πœ (π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)) log 𝐸 Λœπœ‹πœ (π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)) ∫ ≀ πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ 𝐸 Λœπœ‹πœ ((π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)) log(π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛))) ∫ πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ = ∫ π‘žβˆ—(𝜁, 𝜽, 𝝓) log(π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)))π‘‘πœƒπ‘‘πœ™ ∫ πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)π‘‘πœƒπ‘‘πœ™ ∫ = π‘žβˆ—(𝜁, 𝜽, 𝝓) log(π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓| π’š, 𝒛)))π‘‘πœƒπ‘‘πœ™ where the second step follows by using Jensen’s inequality on the convex function π‘₯ β†’ π‘₯ log π‘₯. Thus, K𝐿 (π‘„βˆ— 𝜁 ||Π𝜁 (Β·| π’š, 𝒛)) = ≀ ∫ ∫ which completes the proof. log 𝜁 (𝜁) π‘žβˆ— πœ‹πœ (𝜁 |π’š, 𝒛) π‘žβˆ— 𝜁 (𝜁)π‘‘πœ π‘žβˆ—(𝜁, 𝜽, 𝝓) log(π‘žβˆ—(𝜁, 𝜽, 𝝓)/πœ‹(𝜁, 𝜽, 𝝓|π’š, 𝒛)))π‘‘πœƒπ‘‘πœ™π‘‘πœ 62 C.0.2 Proof of Lemma C.0.2 Let πΏβˆ— = 𝑝(π’š|𝒛) = ∫ S 𝑝(π’š|𝜁)𝑑Π(𝜁 |𝒛) and 𝐿(𝜁0) = 𝑝( π’š|𝜁0). By Markov’s inequality, we have P0 (cid:26)(cid:12) (cid:12) (cid:12) (cid:12) log ∫ 𝜁 ≀ = = ≀ ≀ = ≀ = = 1 𝐢2π‘›πœ– 2 𝑛 log 𝑛 1 𝐢2π‘›πœ– 2 𝑛 log 𝑛 1 𝐢2π‘›πœ– 2 𝑛 log 𝑛 1 𝑛 log 𝑛 𝐢2π‘›πœ– 2 1 𝑛 log 𝑛 𝐢2π‘›πœ– 2 1 𝐢2π‘›πœ– 2 𝑛 log 𝑛 1 𝐢2π‘›πœ– 2 𝑛 log 𝑛 1 𝑛 log 𝑛 𝐢2π‘›πœ– 2 1 𝑛 log 𝑛 𝐢2π‘›πœ– 2 𝑑Π(𝜁 |𝒛) > 𝐢2π‘›πœ– 2 𝑛 log 𝑛 (cid:27) E0 𝑝( π’š|𝜁) 𝑝( π’š|𝜁0) (cid:26)(cid:12) (cid:12) log (cid:12) (cid:12) (cid:26)(cid:12) (cid:12) (cid:12) (cid:12) (cid:26)(cid:12) (cid:12) (cid:12) (cid:12) E0 E0 log log (cid:12) (cid:12) (cid:12) (cid:12) ∫ 𝑝( π’š|𝜁) 𝑝(π’š|𝜁0) (cid:12) (cid:27) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) (cid:12) πΏβˆ— 𝐿 (𝜁0) 𝐿(𝜁0) πΏβˆ—(𝜁) (cid:27) 𝑑Π(𝜁 |𝒛) (cid:27) (cid:12) (cid:12) (cid:12) (cid:12) (cid:26) K𝐿(𝐿(𝜁0)||πΏβˆ—) + (cid:27) 2 𝑒 (cid:41) (cid:41) + 2 𝑒 (cid:40) E0 (cid:26) E0 (cid:26) E0 (cid:40) (cid:26) log 𝑝(π’š|𝜁0) ∫ 𝑝( π’š|𝜁)𝑑Π(𝜁 |𝒛) ∫ 𝑝(π’š|𝜁) 𝑝( π’š|𝜁0) (cid:8)βˆ’ log Π𝜁 (𝐡𝑛|𝒛) + 𝐢4π‘›πœ– 2 βˆ’ log 𝑛 𝑑Π(𝜁 |𝒛) (cid:9) + (cid:27) (cid:27) 2 𝑒 (cid:27) + 2 𝑒 (cid:26) (cid:26) E0 (cid:8)πΆπ‘›πœ– 2 𝑛 + 𝐢4π‘›πœ– 2 𝑛 (cid:9) + (cid:27) 2 𝑒 (𝐢 + 𝐢4)π‘›πœ– 2 𝑛 + (cid:27) 2 𝑒 β†’ 0 Note the first inequality from the bottom, we applied log function to the result of (A.4). For the third inequality from the top, we used Lemma 4 from Lee (2000). C.0.3 Proof of Lemma C.0.3 Note Eπ‘ž(𝜽,𝝓) (cid:20) log (cid:21) 𝑝( π’š|𝜁0) 𝑝( π’š|𝒛, 𝜽, 𝝓) = Eπ‘ž(𝜽,𝝓) = Eπ‘ž(𝜽,𝝓) = Eπ‘ž(𝜽,𝝓) (cid:20) (cid:34) (cid:20) βˆ’ log βˆ’ log (cid:21) 𝑝( π’š|𝒛, 𝜽, 𝝓) 𝑝( π’š|𝜁0) ∫ 𝑝( π’š|𝒛, 𝜽, 𝝓, 𝜁) 𝑝(𝜁 |𝜽, 𝝓, 𝒛)π‘‘πœ 𝑝(π’š|𝜁0) (cid:35) βˆ’ log ∫ 𝑝( π’š|𝑆) 𝑝(π’š|𝜁0) 𝑝(𝜁 |𝜽, 𝝓, 𝒛)π‘‘πœ (cid:21) 63 Now, using Markov’s inequality, Jensen’s inequality and Fubini’s theorem, (cid:20) P0 Eπ‘ž(𝜽,𝝓) (cid:20) βˆ’ log 𝑝(𝜁 |𝜽, 𝝓, 𝒛)π‘‘πœ (cid:21) (cid:21) β‰₯ 𝐢3π‘›πœ– 2 𝑛 log 𝑛 ∫ 𝑝( π’š|𝜁) 𝑝(π’š|𝜁0) (cid:20) (cid:20) Eπ‘ž(𝜽,𝝓) βˆ’ log (cid:20) Eπ‘ž(𝜽,𝝓) (cid:20)∫ E0 E0 ∫ 𝑝( π’š|𝜁) 𝑝( π’š|𝜁0) 𝑝( π’š|𝜁) 𝑝( π’š|𝜁0) βˆ’ log 𝑝(𝜁 |𝜽, 𝝓, 𝒛)π‘‘πœ 𝑝(𝜁 |𝜽, 𝝓, 𝒛)π‘‘πœ (cid:21) (cid:21) (cid:21) (cid:21) (cid:20) Eπ‘ž(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0 (cid:20) βˆ’ log (cid:21) (cid:21) 𝑝(π’š|𝜁) 𝑝( π’š|𝜁0) (cid:34) βˆ’ 𝑛 2 + π‘›πœŽ2 0 2 + πΆπœ– 2 𝑛 + πΆπœ– 2 𝑛 exp(βˆ’π‘›πœ– 2 𝑛) (cid:35) β†’ 0 ≀ ≀ = ≀ 1 𝐢3π‘›πœ– 2 𝑛 log 𝑛 1 𝐢3π‘›πœ– 2 𝑛 log 𝑛 1 𝐢3π‘›πœ– 2 𝑛 log 𝑛 1 𝐢3π‘›πœ– 2 𝑛 log 𝑛 where the second inequality above is due to Jensen’s inequality In the last inequality, Eπ‘ž(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0 (cid:20) βˆ’ log = Eπ‘ž(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0 βˆ’ (cid:34) (cid:34) = Eπ‘ž(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛E0 βˆ’ 1 2 𝑛 2 (cid:21) 𝑝( π’š|𝜁) 𝑝( π’š|𝜁0) 𝑛 βˆ‘οΈ (𝑦𝑖 βˆ’ 𝜁0(𝑑𝑖))2 + 𝑖=1 + 1 2 𝑛 βˆ‘οΈ 𝑖=1 (cid:34) = Eπ‘ž(𝜽,𝝓)E𝜁 |𝜽,𝝓,𝒛 βˆ’ = Eπ‘ž(𝜽,𝝓) (cid:104) βˆ’ 𝑛 2 + (cid:105) 𝑛 2 𝑛 2 + 1 2 𝑛 βˆ‘οΈ 𝑖=1 + Eπ‘ž(𝜽,𝝓) (cid:35) (𝑦𝑖 βˆ’ 𝜁 (𝑑𝑖))2 𝑛 βˆ‘οΈ 𝑖=1 1 2 (cid:35) (𝑦𝑖 βˆ’ 𝜁 (𝑑𝑖))2 (cid:35) (1 + 𝜁 2 0 (𝑑𝑖) βˆ’ 2𝜁 (𝑑𝑖)𝜁0(𝑑𝑖) + 𝜁 2(𝑑𝑖)) (𝜁0(𝑑𝑖) βˆ’ E𝜁 |𝜽,𝝓,𝜎,𝒛 [𝜁 (𝑑𝑖)])2 (cid:35) (cid:34) 1 2 𝑛 βˆ‘οΈ 𝑖=1 ≀ πΆπœ– 2 𝑛 + πΆπœ– 2 𝑛 exp(βˆ’π‘›πœ– 2 𝑛) where the last inequality above is a consequence of Assumption 3.3.2 (i). C.0.4 Proof of Lemma C.0.4 Note, K𝐿 (π‘ž(𝜽, 𝝓)|| 𝑝(𝜽, 𝝓| π’š, 𝒛)) = πΈπ‘ž(𝜽,𝝓) [log π‘ž(𝜽, 𝝓) βˆ’ log 𝑝( π’š|𝒛, 𝜽, 𝝓) βˆ’ log 𝑝(𝜽, 𝝓|𝒛) βˆ’ log 𝑝(𝒛) + log 𝑝( π’š, 𝒛)] = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log 𝑝(π’š, 𝒛) 𝑝(𝒛) + πΈπ‘ž(𝜽,𝝓) (cid:20) log (cid:21) 1 𝑝(π’š|𝒛, 𝜽, 𝝓) 64 Hence, K𝐿 (π‘ž(𝜽, 𝝓)|| 𝑝(𝜽, 𝝓| π’š, 𝒛)) = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log = K𝐿 (𝑄(𝜽, 𝝓)||Ξ (𝜽, 𝝓|𝒛)) + log (cid:20) log (cid:21) 𝑝(π’š|𝜁0) 𝑝(π’š|𝒛, 𝜽, 𝝓) (cid:20) π‘‘πœ + πΈπ‘ž(𝜽,𝝓) log + πΈπ‘ž(𝜽,𝝓) 𝑝( π’š, 𝒛) 𝑝(𝒛) 𝑝(π’š|𝜁0) ∫ 𝑝( π’š, 𝒛, 𝜁) 𝑝(𝒛) 𝑝( π’š|𝜁0) 𝑝(π’š|𝜁) 𝑝(𝜁 |𝒛) 𝑝(𝒛) 𝑝(𝒛) 𝑝( π’š|𝜁0) 𝜁 ∫ π‘‘πœ + πΈπ‘ž(𝜽,𝝓) log (cid:21) 𝑝(π’š|𝜁0) 𝑝(π’š|𝒛, 𝜽, 𝝓) (cid:20) (cid:21) 𝑝( π’š|𝜁0) 𝑝( π’š|𝒛, 𝜽, 𝝓) S ∫ S ∫ S 𝑝(π’š|𝜁) 𝑝(π’š|𝜁0) 𝑝(π’š|𝜁) 𝑝(π’š|𝜁0) 𝑑Π(𝜁 |𝒛) + πΈπ‘ž(𝜽,𝝓) 𝑑Π(𝜁 |𝒛) + πΈπ‘ž(𝜽,𝝓) (cid:20) (cid:20) log log (cid:21) (cid:21) 𝑝( π’š|𝜁0) 𝑝( π’š|𝒛, 𝜽, 𝝓) 𝑝( π’š|𝜁0) 𝑝( π’š|𝒛, 𝜽, 𝝓) ≀ 𝐢1π‘›πœ– 2 𝑛 log 𝑛 + 𝐢2π‘›πœ– 2 𝑛 log 𝑛 + 𝐢3π‘›πœ– 2 𝑛 log 𝑛 ≀ πΆπ‘›πœ– 2 𝑛 log 𝑛 65 CHAPTER 4 BAYESIAN MODEL CALIBRATION FOR BETA DECAY CALCULATIONS Figure 4.1 Calibration framework in π›½βˆ’-decay calculation. Accurate forecasts regarding the 𝛽-decay rates of neutron-rich nuclei are crucial for investigations of π‘Ÿ-process studies. The recent development of the proton-neutron finite amplitude method (pnFAM) makes global 𝛽-decay studies feasible within the nuclear density functional theory. In pnFAM calculations with the Skyrme functional, time-odd terms, isoscalar pairing, and the axial-vector coupling strongly impact the results. However, they are not constrained by the properties of even- even nuclei. Thus, model calibration and the uncertainty quantification of 𝛽-decay predictions are necessary. In this chapter, we conduct Bayesian model calibration with selected experimental data and simulation data, and obtain the posterior distribution of parameters with reliable predictions. We will conduct various calibration schemes and compare the result with that of πœ’2 optimization. From a statistical point of view, this problem could be summarized as follows. Consider a physics model π‘“π‘š which takes two inputs, 𝒕 and 𝜽. 𝒕 is the known input for the model, and it is the number of protons 𝑍 and the number of neutrons 𝑁. The model also has another set of inputs called calibration parameter 𝜽. This will consist of dimensionless Landau-Migdal parameters 𝑔′ 0, dimensionless Iso-scalar pairing strength 𝑣0, effective axial vector coupling (quenching effect) 𝑔𝐴. If we plug in all five values in π‘“π‘š, then the physics model will spit out two outputs, GT resonance (GTR) energy, and the 𝛽 decay rates. This relationship is also visualized in Figure 4.1. The original framework proposed by Kennedy and O’Hagan (2001) can be computationally 66 demanding since it estimates the emulator 𝑓 , discrepancy 𝛿, and unknown parameters 𝜽 at si- multaneously using the joint dataset 𝒅 = (π’š, 𝒛). At the same time, this approach is flexible and generally yields a good fit to the experimental data at the expense of potential identifiability issue of the calibration parameters. For example, a poor fit of the emulator can be compensated by the discrepancy term which can lead to non-physical values of the calibration parameters while having a statistical model that fits well the training data Bayarri et al. (2009). We shall refer to this simultaneous estimation as the integrated approach. In contrast to the integrated approach, one can fit the emulator first using only the computer model evaluations 𝒛 and then find the posterior distribution of the unknown parameters using the experimental observations π’š. This two-stage estimation is known as the modular approach Bayarri et al. (2009). The modular approach tends to be computationally less challenging compared to the integrated approach as it does not require the evaluation of the joint likelihood 𝑝( 𝒅|𝜽). Additionally, while the fitted emulator obtained via the integrated approach could be somewhat inconsistent with the computer model, modular approach guarantees the performance of emulator since it is fitted using computer model evaluations only. From a Physical point of view, the modular approach appears more natural since one is often concerned with the performance of the emulator altogether with the estimation of the calibration parameters. Hence, we will compare and contrast these two different estimation approaches in the context of calibration with GTR energies and 𝛽-decay rates. In addition, often model discrepancy term adds another level of complexity in terms of inter- preting the results. Hence, we are going to consider a case where there is no model discrepancy term. To be specific, we are going to consider the following models: Integrated (𝐼), Integrated without model discrepancy (𝐼 βˆ’ 𝛿), Modular (𝑀), Modular without discrepancy (𝑀 βˆ’ 𝛿). (𝐼) : 𝑦𝑖 = 𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖) + πœŽπœ–π‘–, (𝐼 βˆ’ 𝛿) : 𝑦𝑖 = 𝑓 (𝒕𝑖, 𝜽) + πœŽπœ–π‘–, (𝑀) : 𝑦𝑖 = ¯𝑓 (𝒕𝑖, 𝜽) + 𝛿(𝒕𝑖) + πœŽπœ–π‘–, (𝑀 βˆ’ 𝛿) : 𝑦𝑖 = ¯𝑓 (𝒕𝑖, 𝜽) + πœŽπœ–π‘–, 67 (4.1) (4.2) (4.3) (4.4) where ¯𝑓 represents a pretrained GP emulator using simulation data only. 4.1 Data Collection 4.1.1 Experimental data As shown in Tables 4.1 and 4.2 below, there are two types of observables employed in the calibration: GTR energies 𝐸GTR and π›½βˆ’-decay half-lives 𝑇1/2. The half-lives are given in ascending order, and their logarithms lg 𝑇1/2 (base 10) are adopted in the fit as the half-lives can vary by several orders of magnitude. No. Nucleus 𝐸GTR Error 0.2 0.6 208Pb 132Sn 15.6 16.3 1 2 No. Nucleus 𝐸GTR Error 3 4 90Zr 112Sn 8.7 8.94 – 0.25 Table 4.1 GTR energies and their experimental errors (in MeV) taken from Refs. Yasuda et al. (2018); Akimune et al. (1995); Gaarde et al. (1981); Pham et al. (1995) for the four nuclei selected in this work. No. Nucleus 5 6 7 8 9 10 11 12 13 14 15 16 17 98Kr 58Ti 102Sr 82Zn 48Ar 60Cr 126Cd 114Ru 134Sn 152Ce 78Zn 72Ni 92Kr 𝑇1/2 0.043 0.058 0.069 0.166 0.475 0.49 0.515 0.54 1.05 1.4 1.47 1.57 1.84 Error 0.004 0.009 0.006 0.011 0.04 0.01 0.017 0.03 0.011 0.2 0.15 0.05 0.008 𝑇1/2 No. Nucleus 166Gd 4.8 18 156Nd 5.26 19 204Pt 10.3 20 74Zn 95.6 21 52Ti 102 22 180Yb 144 23 114Pd 145.2 24 242U 1008 25 134Te 2508 26 92Sr 27 9399.6 156Sm 33840 28 200Pt 45360 29 Error 1.0 0.2 1.4 1.2 6 30 3.6 30 48 61.2 720 1080 Table 4.2 π›½βˆ’-decay half-lives and their experimental errors (in second) taken from National Nuclear Data Center for the 25 nuclei selected in this work. half-lives are listed in ascending order. 4.1.2 Simulation data In the context of nuclear density functional theory (DFT), the finite amplitude method (FAM) provides an efficient solution for the (quasiparticle) random phase approximation (QRPA) Nakat- 68 sukasa et al. (2007); Avogadro and Nakatsukasa (2011). Recent advancements in the Skyrme proton-neutron FAM (PNFAM) have made it possible to compute charge-changing transitions in deformed nuclei and perform extensive calculations for 𝛽-decay rates Shafer et al. (2016); Mustonen et al. (2014); Mustonen and Engel (2016); Ney et al. (2020). In the PNFAM, the time-odd isovector Skyrme couplings, isoscalar pairing strength (𝑣0), and effective axial-vector coupling (𝑔𝐴) have a strong impact on 𝛽-decay transitions. In the actual calculation, the dimensionless Landau-Migdal parameter (𝑔′ 0) was used instead of the time-odd isovector Skyrme couplings. 4.2 Extension of the calibration model to two observable types In order to estimate all the unknown quantities including the calibration parameters using both GTR energies and 𝛽-decay half-lives, we extend the original framework described in Chapter 1 by modeling each observable type independently. Specifically, let π’š1 denote a column vector of size 𝑛1 representing the experimental observations of the 1st observable type and let 𝒛1 denote a column vector of size π‘š1 consisting of the computer model evaluations of the 1st observable type. Overall, for the 1st observable type, we have 𝒅1 = (π’š1, 𝒛1). Similarly, for the 2nd observable type, let π’š2 denote 𝑛2 experimental observations of the 2nd observable type, and let 𝒛2 denote π‘š2 computer model evaluations of the 2nd observable type. Consequently, 𝒅2 is the joint dataset for the 2nd observable type. As described in the section 4.1.1, the fit observables consist of 4 GTR energies and 25 𝛽-decay half-lives. To emulate the the PNFAM with the Skyrme energy density functional, we generated 1000 data points for each data type using the Latin Hypercube Design (see the appendix/supplementary text for more details). By denoting sets of unknown parameters for each data type as 𝚿1 = (𝜽1, 𝝓1, 𝜎1) and 𝚿2 = (𝜽2, 𝝓2, 𝜎2), we get the joint conditional data likelihood for the integrated approach as (cid:169) 𝒅1, 𝒅2|𝚿1, 𝚿2 ∼ 𝑁 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:171) where Ξ£1 and Ξ£2 are the covariance matrices consistent with the KOH model formulation in Chapter , (cid:169) (cid:173) (cid:173) (cid:171) 0 Ξ£2 (4.5) (cid:170) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:172) (cid:170) (cid:174) (cid:174) (cid:172) 02 , 01 Ξ£1 0 2. The likelihood (4.5) can be used directly for the integrated approach’s posterior sampling. Note that the calibration parameters generally don’t need to be shared among the observable types. In 69 our case, however, 𝜽1 ∩ 𝜽2 β‰  βˆ… as 𝜽1 = (𝑔′ 0 , 𝑣0) and 𝜽1 = (𝑔′ 0 , 𝑣0, 𝑔𝐴). The joint conditional likelihood for the modular approach is given by the likelihood of exper- imental observations π’š1 and π’š2 as all the computer model outputs are used to train the emulator independently. Namely, 0 ˆ𝑆1(𝜽1) 𝐾 𝛿1 (𝑇𝑑1 , 𝑇𝑑1) + 𝜎2 1 𝐼𝑛1 ˆ𝑆2(𝜽2) (cid:169) π’š1, π’š2| ˜𝚿1, ˜𝚿2 ∼ 𝑁 (cid:169) (cid:173) (cid:173) (cid:173) (cid:173) (cid:171) (cid:171) (cid:170) , (cid:169) (cid:174) (cid:173) (cid:174) (cid:173) , 𝑇𝑑2) + 𝜎2 2 (cid:172) (cid:171) where ˆ𝑆1(𝜽1) is the posterior mean of GP emulator at the design points {( Λœπ’•1,𝑖)}π‘š1 𝑖=1 with calibration parameters 𝜽1. The vector ˜𝚿1 consists of the GP hyperparameters for 𝛿1, the scale of observational error for the 1st data type, and the calibration parameters 𝜽1. 𝐾𝛿1 , 𝑇𝑑1) is the matrix with elements π‘˜ 𝛿1 ( Λœπ’•1,𝑖, Λœπ’•1, 𝑗 ). The components of likelihood (4.6) corresponding to the 2nd data type are (cid:170) (cid:170) (cid:174) (cid:174) (cid:174) (cid:174) (cid:172) (cid:172) (𝑇𝑑1 𝐾 𝛿2 (𝑇𝑑2 (4.6) 𝐼𝑛2 0 then defined analogically. 4.3 Prior Structure It is often the case with Bayesian modeling that the careful elicitation of prior distributions is neglected. This is because if one has enough data, the effect of the prior becomes small Ghossal and Van der Vaart (2017). Thoughtful prior selection is very important in computer model calibration, because it can help mitigate the potential identifiability issue of calibration parameters. In particular, strong priors for GPs’ hyperparameters are needed to distinguish between the systematic discrepancy 𝛿 and the observational error 𝜎 BrynjarsdΓ³ttir and O’Hagan (2014). In order to improve the computational stability of MCMC sampler, we first reparameterize the GP length scales to the correlation hyperparameters. Specifically, for the length scales of GP (cid:17) emulator of 𝑓 (𝒕, 𝜽), we set πœŒπ‘  𝑗 , 𝑖 = 1, Β· Β· Β· , 2 + π‘ž 𝑗 and 𝑗 = 1, 2, where π‘ž 𝑗 is the βˆ’1/4𝑙 𝑠 𝑗 (cid:16) 𝑖 = 𝑒π‘₯ 𝑝 𝑖 dimension of calibration parameters in observable type 𝑗. Consider similar reparameterization for the length scale parameters for 𝛿 function. Since the original length scale parameters are positive real values, the MCMC sampling is slow since it has to sample potentially from all the real positive values. With this reparameterization, the sampling only needs to occur on the [0, 1] domain. Therefore, we consider a Beta distribution for the reparametrized length scales with shape 70 parameters π‘Ž = 1 and 𝑏 = 0.3. This choice is informed by the fact that we want to check which of inputs are relevant for the outputs. For example, we found that among 4 parameters 𝑔′ 1 could be removed from analysis based on previous sections. Similarly, in Bayesian analysis, since prior masses are clustered around 1 (length scales are very large and the effect of inputs is small), if an input is important, then we expect its posterior distribution moves toward 0. For the precision parameters (inverse of scale parameters in covariance functions), first note that they have nice interpretations. They determine how much of the data’s variance comes from each term ( 𝑓 , 𝛿, or πœ–). Recall that all the outputs were standardized to have unit variance. Therefore, we specify the precision parameters of the covariance functions for 𝑓1 and 𝑓2 to be close to 1 by assigning them a Gamma prior with shape parameter 𝛼 = 10 and rate parameter 𝛽 = 10. This leads to the expectation of each precision parameter for the covariance functions to be 1. For the model discrepancy term, we expect to have a relatively small bias, which also helps to mitigate the potential confounding issues. This is because the computer emulator is expected to explain most of the variance in experimental observations. Hence, we set Gamma prior with shape parameter 𝛼 = 10 and rate parameter 𝛽 = 0.3 for the precision parameters for the discrepancy term. This leads to the expectation of each precision parameter of the covariance functions for 𝛿1 and 𝛿2 to be 33. Hence the relative size of the covariances from the discrepancy term would be 3%(= 1 33) of those from 𝑓1 and 𝑓2. Finally, for the observational error terms, we set Gamma prior with shape parameter 𝛼 = 10 and rate parameter 𝛽 = 0.0015 for the observational error terms. By the same logic, the relative size of the covariances from the observational error term would be 0.015%(= 6666.6667 ) of those from 𝑓1 and 𝑓2. This hierarchical order of variances explained by each term is important to distinguish 1 between discrepancy and observational error. In addition, the choice of prior on observational error terms is consistent with the πœ’2 optimization results. More discussion of suitable priors on discrepancy terms can be found at BrynjarsdΓ³ttir and O’Hagan (2014) and Chong and Menberg (2018). When it comes to the calibration parameters, we considered two cases. In the first case, we 71 consider a uniform prior on each parameter space (𝑔′ 0 ∈ [0, 4], 𝑣0 ∈ [βˆ’2, 0], 𝑔𝐴 ∈ [βˆ’2, 0]). In the second case, we consider weakly informative priors for the calibration parameters, which are taken to be Truncated Normal (TN) distributions centered around the empirical values (π‘”β€²βˆ— 0 = 𝐴 = βˆ’1) with small standard deviations truncated at the boundaries of each parameter space. 0 = 1.6, π‘£βˆ— βˆ’1.1, π‘”βˆ— 4.4 Calibration Results The number of effective parameters is chosen based on the results of Li (2022). Therefore, we will consider two cases where 𝑔𝐴 is free or fixed at value 1. 4.4.1 𝑔𝐴 free case For the two different observable types (𝐸GTR, 𝑇1/2), we fit the independent Gaussian Process with MatΓ©rn covariance class with roughness parameter 3/2, which shares the calibration parameters; (𝑔′ 0 , 𝑣0) for GTR energies and (𝑔′ 0 , 𝑣0, 𝑔𝐴) for π›½βˆ’-decay half-lives. Figure 4.2 shows the posterior distribution of calibration parameters. The Figure’s first row shows when using TN priors on the calibration parameters. MCMC results of ’I’ and ’M’ show multimodality in the posterior distribution of 𝑔′ 0. The VI method can not capture this multimodality because we predefined normal distribution as the variational family of the calibration parameters. However, the VI result is centered around the πœ’2 result, a standard approach in the Physics community. The same thing holds for the models without discrepancy term. For 𝑣0, VI approach well approximated MCMC posteriors, all are slightly off from the πœ’2 result. The results without discrepancy term have smaller concentrations but more or less centered at the same values as posteriors of models with discrepancy term. For 𝑔𝐴, while πœ’2 result is around 0.5 area, MCMC posteriors without model discrepancy has moved slightly to the right. MCMC posteriors with model discrepancy center around a value 1.0. The approximated VI posterior is moved slightly right of the MCMC results. This might be the compensation from approximating the multimodality of the posterior of 𝑔′ 0 with normal distribution. The sensitivity of posterior distributions coming from using two different priors is minimal for MCMC methods. The second row of the figure shows the results when we use the uniform prior distribution on each parameter space. The posterior shape is more or less the same for 𝑔′ 0 and 𝑣0, 72 but the posterior distribution of 𝑔𝐴 has more mass toward the value 0.5. For the VI, 𝑔′ 0 has slightly moved to the left, while other parameters are more concentrated around the same values as in TN prior cases. The mean and standard deviation for each case are summarized in Table 4.3. Methods I (U) I-𝛿 (U) M (U) M-𝛿 (U) VI (U) I (TN) I-𝛿 (TN) M (TN) M-𝛿 (TN) VI (TN) πœ’2 opt 1 I (U) I-𝛿 (U) M (U) M-𝛿 (U) VI (U) I (TN) I-𝛿 (TN) M (TN) M-𝛿 (TN) VI (TN) πœ’2 opt 2 𝑔′ 0 1.653(0.104) 1.584(0.037) 1.654(0.101) 1.585(0.036) 1.404(0.028) 1.655(0.098) 1.590(0.036) 1.658(0.098) 1.589(0.036) 1.600(0.049) 1.592(0.034) 1.465(0.171) 2.383(0.014) 1.456(0.142) 1.588(0.036) 1.419(0.038) 1.674(0.118) 1.586(0.038) 1.568(0.158) 1.589(0.036) 1.422(0.040) 1.596(0.039) 𝑣0 -0.941(0.064) -0.916(0.080) -0.943(0.066) -0.920(0.080) -0.941(0.029) -0.957(0.056) -0.950(0.073) -0.958(0.058) -0.952(0.072) -0.955(0.045) -1.197(0.179) -1.179(0.008) -1.802(0.191) -1.519(0.008) -0.898(0.166) -0.778(0.010) -0.942(0.064) -0.780(0.013) -1.119(0.099) -0.922(0.163) -0.778(0.009) -1(0.178) 𝑔𝐴 0.940(0.140) 0.708(0.125) 0.925(0.148) 0.705(0.123) 1.089(0.049) 0.975(0.097) 0.804(0.129) 0.967(0.099) 0.800(0.124) 1.147(0.055) 0.503(0.143) 1 1 1 1 1 1 1 1 1 1 1 Table 4.3 (Posterior summaries of calibration parameters using both observable types.) ’U’ repre- sents uniform prior and ’TN’ represents truncated normal prior. We divided the case where 𝑔𝐴 is either free or fixed at 1. 4.4.2 𝑔𝐴 fixed case Figure 4.3 shows the posterior distribution of calibration parameters when 𝑔𝐴 is fixed at 1. The posterior distributions of 𝑔′ 0 behaves similar to the 𝑔𝐴 free case. Posterior distributions of models with discrepancy term show multimodality, and posterior distributions of the models without discrepancy term center around πœ’2 optimization results. VI results now have moved slightly to the left. On the other hand, posterior distributions of 𝑣0 have different results for each model and for 73 Figure 4.2 (Posterior distribution of calibration parameters using both observable types with free 𝑔𝐴). The columns of the figure represent the posterior distribution of each calibration parameter. In addition, the upper three figures are posterior distributions using TN prior, and the lower three figures are those from a uniform prior. The purple solid lines represent the prior distributions (either uniform prior or TN prior) over each parameter space. Each line with different colors and different line types shows different estimation approaches; blue dashed lines for ’I’, orange dotted lines for ’I-𝛿’, green dash-dotted lines for ’M’, red dash-dot-dotted lines for ’M-𝛿’, and the brown solid line is for ’VI’ approach we proposed in Chapter2. The black solid vertical line represents the πœ’2 optimization results with two sigma error bands in black vertical dotted lines. each prior distribution we used except for VI. Especially, 𝐼 βˆ’ 𝛿 with uniform prior case shows the posterior distribution pushed toward the boundary value near -2. This is very different from other results, and we tried to understand why this is happening. One possible explanation could be found by looking at the πœ’2 surface in Figure 4.4. It shows that the πœ’2 has two regions where the πœ’2 value is low. Hence we restricted the parameter space of 𝑣0 to avoid those unphysical regions where 𝑣0 is less than -2. However, by fitting the emulator and the observational error together (𝐼 βˆ’ 𝛿 case), the posterior distribution of 𝑣0 moved toward this unphysical area. This means that a good amount of proportion in the variance of the experimental was absorbed in the observational error. This fact could be confirmed by looking at the performance of the emulator using 100 GTR and 500 𝛽-decay out of the sample simulation data set; Table 4.4. In both tables, the RMSE for GTR data is not much different whether one fixes 𝑔𝐴 or not. However, 74 0246810Probability densityII-MM-TN PriorVI02468101214024681.001.251.501.752.00g000.02.55.07.510.012.5Probability DensityII-MM-Unif PriorVI2.01.51.00.50.0v0024681012140.00.51.01.52.0gA02468 Figure 4.3 Posterior distribution of calibration parameters using both observable types with 𝑔𝐴 = 1. Figure 4.4 πœ’2 surface for fixed 𝑔𝐴 RMSE for 𝛽 data tends to be higher if we fix 𝑔𝐴. This is because the 𝛽 decay data controls 𝑣0 and 𝑔𝐴. In fact, if we look at the ′𝐼 βˆ’ 𝛿′ model with a uniform prior case, the emulator’s performance on 𝛽 decay data is significantly worse than other methods. As we explained, the posterior distributions of observational error terms are quite different in ′𝐼 βˆ’ 𝛿′ case. Figure 4.5 shows posterior distributions of hyperparameters for the ′𝐼 βˆ’ 𝛿′ case with 75 024681012Probability densityII-MM-TN PriorVI0102030401.01.21.41.61.82.02.2g00024681012Probability DensityII-MM-Unif PriorVI2.01.51.00.50.0v0010203040 Priors Method (TN Prior) (Unif Prior) GTR RMSE π›½βˆ’ RMSE GTR RMSE π›½βˆ’ RMSE I I-𝛿 M M-𝛿 VI VI(mean) I ( 𝑔𝐴 = 1) I-𝛿 (𝑔𝐴 = 1) M (𝑔𝐴 = 1) M-𝛿 (𝑔𝐴 = 1) VI (𝑔𝐴 = 1) VI (𝑔𝐴 = 1, mean) 0.057 0.057 0.056 0.057 5.085 0.033 0.057 0.056 0.057 0.057 6.368 0.088 0.707 0.703 0.701 0.700 1.011 0.538 1.158 1.152 1.130 1.151 7.765 0.776 0.056 0.057 0.057 0.057 5.086 0.033 0.057 0.057 0.057 0.057 6.525 0.088 0.707 0.703 0.701 0.701 1.011 0.535 1.123 1.545 1.165 1.150 7.746 0.776 Table 4.4 Average of RMSEs for 100 GTR and 500 𝛽-decay out of sample computer outputs. fixed 𝑔𝐴 using two different priors. The three figures show the difference of posterior distributions in 1/πœ‚ 𝑓2, 1/𝜎2 1 , and 1/𝜎2 2 , while all the other hyperparameters have posterior distributions overlapping each other. This implies that assigning distinct priors to the calibration parameters had an impact on the learning of other parameters. This connection arises because various components, including the emulator, calibration parameters, and observational error, are interconnected within the framework of the ′𝐼′ or ′𝐼 βˆ’ 𝛿 approach. The VI approach is also an approximation of MCMC in the framework of ′𝐼′, and it shows poor performance of the emulator too. However, if we use the mean of variational posterior for prediction instead of full variational distribution, the performance of the emulator is quite comparable to MCMC methods. This poor performance of the emulator could be avoided by training the emulator first (′𝑀′ or ′𝑀 βˆ’ 𝛿′ cases) using the simulation data. By doing so, we could avoid the confounding issue. Bayarri et al. (2009) also mentioned five reasons for using the modularization approach, which could be summarized as follows. 1. Keep a good module separate from a suspect module to avoid contamination. 2. Scientific understanding and development can require modularization. 3. Identifiability concerns or confounding might require modularization. 76 4. Mixing of MCMC analyses can greatly improve under modularization. 5. The computation is not otherwise possible. In short, we have a certain expectation of the behavior of the emulator, and we would like to avoid the behavior of the emulator being contaminated by the experimental data by keeping the modules separate. Figure 4.5 Posterior distribution of hyperparameters for fixed 𝑔𝐴 4.5 Prediction Results In this section, we are going to compare the RMSEs for 179 out of the sample Beta decay data based on each model. Table 4.5 shows the performance of each model to predict 179 experimental data. There are two main features in the results. The first is that the model discrepancy term reduces the RMSEs, and the other is freeing 𝑔𝐴 also reduces RMSEs. In addition, the VI approach shows significantly lower RMSE for fixed 𝑔𝐴. However, the Average RMSE for Bayesian approaches tends to be higher than the πœ’2 optimization results. Then why should we adopt Bayesian calibration methods? In Figure 4.7, we can check the advantage of the Bayesian calibration approach. The x-axis of the figure is the logarithm of beta decay rates (position of 179 nuclei in terms of beta decay rates), and the y-axis is the log-ratio of prediction over experimental values of beta decay rates. Therefore if the prediction is close to experimental values, the ratio becomes one, and by taking the logarithm, the value becomes zero. For this reason, the value zero on the y-axis is the benchmark and is represented as a dotted line. The πœ’2 results show that the prediction error is small overall. Hence the prediction interval can not capture the value zero. On the other hand, the MCMC prediction interval well captures 77 0.60.70.80.91.01.11.21.31/f20246DensityPosterior Distribution of 1/f2 (Beta-decay)TNUnif0.000.020.040.060.080.100.121/2102000400060008000DensityPosterior Distribution of 1/21 (GTR)TNUnif0.0000.0250.0500.0750.1000.1250.1501/22050100150DensityPosterior Distribution of 1/22 (Beta-decay)TNUnif zero values, with similar behavior to the πœ’2 results with accurate prediction in small values of beta decay rates and bad prediction at large values of beta decay rates. VI approach well approximated MCMC with smaller uncertainty. The prediction behavior for each model is very similar. For free 𝑔𝐴, the larger the x-value, the log ratio value tends to be small. This is due to the physical simulator, which can also be seen in πœ’2 results. For fixed 𝑔𝐴, unlike πœ’2 results, model prediction based on each model tends to systematically off from 0. Methods π›½βˆ’ RMSE (TN prior) π›½βˆ’ RMSE (Unif prior) I I-𝛿 M M-𝛿 VI πœ’2 opt 1 I (𝑔𝐴 = 1) I-𝛿 (𝑔𝐴 = 1) M (𝑔𝐴 = 1) M-𝛿 (𝑔𝐴 = 1) VI (𝑔𝐴 = 1) πœ’2 opt 2 (𝑔𝐴 = 1) 1.702 1.989 1.646 1.939 1.457 1.991 2.105 1.893 2.124 1.810 1.701 1.970 1.641 1.916 1.515 1.961 2.189 1.882 2.119 1.807 0.829 0.974 Table 4.5 Average of RMSEs for 179 Beta decay experimental data. Another way to look at the advantage of Bayesian model calibration compared πœ’2 results could be seen from Figure 4.8. In this figure, the x-axis represents the theoretical coverage of true values by controlling the length of the prediction interval, and the y-axis represents how many of those intervals actually include the value 0 among 179 prediction points (empirical coverage). We can clearly see that πœ’2 prediction interval can not include the true values even if you increase the intervals. But all the Bayesian approach well captures the true values with increasing prediction intervals, either based on MCMC or VI. The main difference between each model comes from the code’s running time. The code’s run time is 80 hours for ′𝐼′ approach with MCMC, 20 hours for ′𝑀′ and 3 hours for ′𝑉 𝐼′. Models without discrepancy term took slightly less time than those with discrepancy term. 78 Figure 4.6 Prediction of Beta decay rates of 179 nuclei with one standard deviation error bar for calibration models without discrepancy terms. 4.6 Conclusion and Discussion In this work, we found that choosing an appropriate covariance function is of fundamental importance in modeling the physical process. In that sense, we would like to recommend MatΓ©rn class of covariance function instead of SE covariance all the time. Also, setting relatively strong priors in the hyperparameters of GPs is necessary to correctly distinguish the discrepancy term from observational error BrynjarsdΓ³ttir and O’Hagan (2014). In addition, instead of the naive application of the Bayesian calibration model, by keeping the modules separate, we could get a time advantage with similar results as the ′𝐼′ approach with a reliable emulator performance. Exclusion of discrepancy term lead to a similar performance in terms of prediction but it could lead to biased results for calibration parameters. ′𝑉 𝐼′ Approach gives us a fast and efficient result for calibration and prediction for experimental values. However, because we restrict the shape of the posterior of calibration parameters, the researchers should be careful when it comes to interpreting calibration parameters or the performance of the emulator. 79 642024lg(Tcomp/Texp)I-Ξ΄ & TN priorFixed gAFree gAI-Ξ΄ & Unif priorFixed gAFree gA10-1101103105Texp / s642024lg(Tcomp/Texp)M-Ξ΄ & TN priorFixed gAFree gA10-1101103105Texp / sM-Ξ΄ & Unif priorFixed gAFree gA Figure 4.7 Prediction of Beta decay rates of 179 nuclei with one standard deviation error bar for calibration models with discrepancy terms. 80 321012lg(Tcomp/Texp)Ο‡2Fixed gAFree gA642024lg(Tcomp/Texp)VI & TN priorFixed gAFree gAVI & Unif priorFixed gAFree gA642024lg(Tcomp/Texp)I (Sum) & TN priorFixed gAFree gAI (Sum) & Unif priorFixed gAFree gA10-1101103105Texp / s642024lg(Tcomp/Texp)M & TN priorFixed gAFree gA10-1101103105Texp / sM & Unif priorFixed gAFree gA Figure 4.8 Coverage of test data set using prediction intervals. Lastly, further investigation of the relationship between πœ’2 optimization and Bayesian calibration would be an interesting research topic. The current πœ’2 result of RMSE is better than all the Bayesian approaches. Considering the fact that πœ’2 is to minimize the scaled version of the squared difference between experimental data and simulation data, the performance of models without discrepancy is expected to be close to πœ’2 results. However, despite this expectation, the RMSE values for models without discrepancies are rather poor. One key reason for this discrepancy lies in the difference in how πœ’2 and Bayesian approaches utilize the physics model. While πœ’2 directly employs the physics model, Bayesian methods utilize a Gaussian Process (GP) emulator to approximate it. As a result, the Bayesian approach tends to perform worse than πœ’2. In cases where the difference between the GP emulator and the physics model becomes negligi- ble due to having sufficient simulation data, the models without discrepancies can closely approach πœ’2 results. Furthermore, with an increase in experimental data, the Bayesian approach is expected to outperform πœ’2 optimization. This is because Bayesian models converge to the true physical pro- 81 0.00.20.40.60.81.0Experimental Value CoverageTN prior (free gA)Ο‡2 optII-Ξ΄I (Emul)MM-Ξ΄VITN prior (fixed gA)Ο‡2 optII-Ξ΄I (Emul)MM-Ξ΄VI0.00.20.40.60.81.0Credible percentage0.00.20.40.60.81.0Experimental Value CoverageUniform prior (free gA)Ο‡2 optII-Ξ΄I (Emul)MM-Ξ΄VI0.00.20.40.60.81.0Credible percentageUniform prior (fixed gA)Ο‡2 optII-Ξ΄I (Emul)MM-Ξ΄VI cesses with increasing experimental data, as discussed in previous chapters, while πœ’2 optimization remains in constrained world of physics models. Nevertheless, this offers a potential rationale for the present findings, and further investigation through theoretical analysis and extensive simulation studies is needed. 82 BIBLIOGRAPHY Akimune, H., Daito, I., Fujita, Y., Fujiwara, M., Greenfield, M. B., Harakeh, M. N., Inomata, T., JΓ€necke, J., Katori, K., Nakayama, S., Sakai, H., Sakemi, Y., Tanaka, M., and Yosoi, M. (1995). Direct proton decay from the Gamow-Teller resonance in 208Bi. Phys. Rev. C, 52(2):604–615. Avogadro, P. and Nakatsukasa, T. (2011). Finite amplitude method for the quasiparticle random- phase approximation. Phys. Rev. C, 84(1):014314. Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., Buschelman, K., Constantinescu, E., Dalcin, L., Dener, A., EΔ³khout, V., Gropp, W. D., Hapla, V., Isaac, T., Jolivet, P., Karpeev, D., Kaushik, D., Knepley, M. G., Kong, F., Kruger, S., May, D. A., McInnes, L. C., Mills, R. T., Mitchell, L., Munson, T., Roman, J. E., Rupp, K., Sanan, P., Sarich, J., Smith, B. F., Zampini, S., Zhang, H., Zhang, H., and Zhang, J. (2021a). PETSc/TAO users manual. Technical Report ANL-21/39 - Revision 3.16, Argonne National Laboratory. Balay, S., Abhyankar, S., Adams, M. F., Benson, S., Brown, J., Brune, P., Buschelman, K., Constantinescu, E. M., Dalcin, L., Dener, A., EΔ³khout, V., Gropp, W. D., Hapla, V., Isaac, T., Jolivet, P., Karpeev, D., Kaushik, D., Knepley, M. G., Kong, F., Kruger, S., May, D. A., McInnes, L. C., Mills, R. T., Mitchell, L., Munson, T., Roman, J. E., Rupp, K., Sanan, P., Sarich, J., Smith, B. F., Zampini, S., Zhang, H., Zhang, H., and Zhang, J. (2021b). PETSc Web page. Balay, S., Gropp, W. D., McInnes, L. C., and Smith, B. F. (1997). Efficient management of parallelism in object oriented numerical software libraries. In Arge, E., Bruaset, A. M., and Langtangen, H. P., editors, Mod. Softw. Tools Sci. Comput., pages 163–202. BirkhΓ€user Press. Bayarri, M. J., Berger, J. O., and Liu, F. (2009). Modularization in Bayesian analysis, with emphasis on analysis of computer models. Bayesian Analysis, 4(1):119 – 150. BrynjarsdΓ³ttir, J. and O’Hagan, A. (2014). Learning about physical parameters: the importance of model discrepancy. Inverse Problems, 30:114007. Chong, A. and Menberg, K. (2018). Guidelines for the bayesian calibration of building energy models. Energy and Buildings, 174:527 – 547. Gaarde, C., Rapaport, J., Taddeucci, T. N., Goodman, C. D., Foster, C. C., Bainum, D. E., Goulding, C. A., Greenfield, M. B., HΓΆren, D. J., and Sugarbaker, E. (1981). Excitation of giant spin-isospin multipole vibrations. Nucl. Phys. A, 369(2):258–280. Ghossal, S. and Van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge University Press. Kennedy, M. C. and O’Hagan, A. (2001). Bayesian calibration of computer models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63:425–464. 83 Li, T. (2022). Application of density functional theory in nuclear structure. Ph.D. thesis, Michigan State University. Mustonen, M. T. and Engel, J. (2016). Global description of π›½βˆ’ decay in even-even nuclei with the axially-deformed skyrme finite-amplitude method. Phys. Rev. C, 93(1):014304. Mustonen, M. T., Shafer, T., Zenginerler, Z., and Engel, J. (2014). Finite-amplitude method for charge-changing transitions in axially deformed nuclei. Phys. Rev. C, 90(2):024308. Nakatsukasa, T., Inakura, T., and Yabana, K. (2007). Finite amplitude method for the solution of the random-phase approximation. Phys. Rev. C, 76(2):024318. Ney, E. M., Engel, J., Li, T., and Schunck, N. (2020). Global description of π›½βˆ’ decay with the axially deformed skyrme finite-amplitude method: Extension to odd-mass and odd-odd nuclei. Phys. Rev. C, 102(3):034326. Pham, K., JΓ€necke, J., Roberts, D. A., Harakeh, M. N., Berg, G. P. A., Chang, S., Liu, J., Stephenson, E. J., Davis, B. F., Akimune, H., and Fujiwara, M. (1995). Fragmentation and splitting of Gamow-Teller resonances in Sn(3He,t)Sb charge-exchange reactions, 𝐴=112–124. Phys. Rev. C, 51(2):526–540. Shafer, T., Engel, J., FrΓΆhlich, C., McLaughlin, G. C., Mumpower, M., and Surman, R. (2016). 𝛽 decay of deformed π‘Ÿ-process nuclei near 𝐴 = 80 and 𝐴 = 160, including odd-𝐴 and odd-odd nuclei, with the skyrme finite-amplitude method. Phys. Rev. C, 94(5):055802. Wild, S. M. (2017). Chapter 40: POUNDERS in TAO: Solving Derivative-Free Nonlinear Least- Squares Problems with POUNDERS. In Advances and Trends in Optimization with Engineering Applications, MOS-SIAM Series on Optimization, pages 529–539. Society for Industrial and Applied Mathematics. Yasuda, J., Sasano, M., Zegers, R. G. T., Baba, H., Bazin, D., Chao, W., Dozono, M., Fukuda, N., Inabe, N., Isobe, T., Jhang, G., Kameda, D., Kaneko, M., Kisamori, K., Kobayashi, M., Kobayashi, N., Kobayashi, T., Koyama, S., Kondo, Y., Krasznahorkay, A. J., Kubo, T., Kubota, Y., Kurata-Nishimura, M., Lee, C. S., Lee, J. W., Matsuda, Y., Milman, E., Michimasa, S., Motobayashi, T., Muecher, D., Murakami, T., Nakamura, T., Nakatsuka, N., Ota, S., Otsu, H., Panin, V., Powell, W., Reichert, S., Sakaguchi, S., Sakai, H., Sako, M., Sato, H., Shimizu, Y., Shikata, M., Shimoura, S., Stuhl, L., Sumikama, T., Suzuki, H., Tangwancharoen, S., Takaki, M., Takeda, H., Tako, T., Togano, Y., Tokieda, H., Tsubota, J., Uesaka, T., Wakasa, T., Yako, K., Yoneda, K., and Zenihiro, J. (2018). Extraction of the Landau-Migdal parameter from the Gamow-Teller Giant Resonance in 132Sn. Phys. Rev. Lett., 121(13):132501. 84 APPENDIX A LATIN HYPERCUBE DESIGN For the Bayesian calibration model, we need a set of model evaluations for each observable type. In order to evaluate the model using a GP emulator, design points are required. Hence we used Latin Hypercube Design in the scaled input space of calibration parameters [0, 1]3. The figures below represent the points we used for the evaluation of the computer model. An interesting point is that in the right Figure of A.1, the points in 𝑔𝐴 are very regular compared to other parameters. This is because the calculation of a computer model with different 𝑔𝐴 is cheap once one point is calculated. Figure A.1 (Latin Hypercube Design points) Left figure is for GTR simulation and the right figure is for 𝛽 decay simulation. 85 APPENDIX B CHI-SQUARE OPTIMIZATION FRAMEWORK The πœ’2 optimization framework, based on Dobaczewski et al. (2014), seeks to determine the optimal model parameters by minimizing the weighted sum of squared residuals. In here, we briefly introduce the important results following the presentation of Li (2022). πœ’2(𝒙) = (cid:205)𝑛𝑑 πœ€2 π‘˜ (𝒙) π‘˜=1 𝑛𝑑 βˆ’ 𝑛π‘₯ = 1 𝑛𝑑 βˆ’ 𝑛π‘₯ 𝑛𝑑 βˆ‘οΈ π‘˜=1 (cid:20) π‘ π‘˜ (𝒙) βˆ’ π‘‘π‘˜ 𝑀 π‘˜ (cid:21) 2 , (B.1) The objective function πœ’2(𝒙) represents the difference between model predictions π‘ π‘˜ (𝒙) and ex- perimental values π‘‘π‘˜ for each observable π‘˜, scaled by their respective weights 𝑀 π‘˜ . The weight 𝑀 π‘˜ is essential to make the residuals dimensionless, especially when dealing with multiple observable types. The assumption is made that all weighted residuals πœ€π‘˜ are independent and follow a common normal distribution with a zero mean and a variance of 𝜎2. Consequently, the πœ’2 value at the optimal point ˆ𝒙 is approximately equal to this variance 𝜎2. To meet this assumption, it becomes essential to select weights 𝑀 π‘˜ that closely match the errors associated with the model predictions π‘ π‘˜ , encompassing theoretical, numerical, and experimental uncertainties. This strategy is employed to bring the πœ’2( ˆ𝒙) close to a value of 1. Although each data point can have its unique weight, it is common practice for data points of the same type to be assigned the same weight since their errors are expected to be similar. This ensures consistency in the treatment of similar observations within the optimization process. The relative weights among different observable types are the key consideration because a global scale factor 𝑠, known as the Birge factor (Birge, 1932), can be introduced to scale the entire πœ’2 value and ensure it approaches 1. πœ’2( ˆ𝒙) β†’ Λœπœ’2( ˆ𝒙) = πœ’2( ˆ𝒙)/𝑠 = 1, 𝑀 π‘˜ β†’ Λœπ‘€ π‘˜ = 𝑀 π‘˜ √ 𝑠. (B.2) Then, for consistency between weights and residual distributions, the scaled weight for a given 86 observable type Λœπ‘€typ should be close to π‘Ÿtyp = βˆšοΈ„ 𝑛𝑑 𝑛typ(𝑛𝑑 βˆ’ 𝑛π‘₯) βˆ‘οΈ π‘˜βˆˆtyp [π‘ π‘˜ ( ˆ𝒙) βˆ’ π‘‘π‘˜ ]2, (B.3) where 𝑛typ is the number of points of the given type. By utilizing the linear expansions of weighted residuals πœ€π‘˜ (𝒙) around the optimal point ˆ𝒙, we can convert the original nonlinear optimization problem into a linear one, allowing us to apply the linear-regression framework. Here are some important conclusions. For more rigorous mathematical details, please refer to Seber (1989). Let π’™βˆ— be the true parameter vector. Then the difference ( ˆ𝒙 βˆ’ π’™βˆ—) approximately follows a multivariate normal distribution: ˆ𝒙 βˆ’ π’™βˆ— ∼ 𝑁 ((cid:174)0, Cov( ˆ𝒙)). The covariance matrix is Cov( ˆ𝒙) β‰ˆ πœ’2( ˆ𝒙) (cid:2)𝐽𝑇 ( ˆ𝒙)𝐽 ( ˆ𝒙)(cid:3) βˆ’1 , (B.4) where the 𝑛𝑑 Γ— 𝑛π‘₯ Jacobian matrix 𝐽 (𝒙) is defined by π½π‘˜π‘™ = πœ•πœ€π‘˜ πœ•π‘₯𝑙 . As for the optimization routine, we employ POUNDERS of Wild (2017) in PETSc/TAO toolkit, Balay et al. (1997, 2021a,b). 87 CHAPTER 5 CONCLUSIONS AND FUTURE DIRECTIONS In this thesis, we have explored some theoretical properties of the original KOH model framework, specifically focusing on the posterior consistency of the estimated physical processes. This work was guided by the lack of theory backing up the method, which has been widely used in science and engineering problems with over 4000 citations. We first started to look at the KOH model as a hierarchical GP regression model and adopted the extension of Schwartz theorem’s test function framework to calculate the posterior contraction rate using true posterior. We further proposed a new variational algorithm to overcome the computational issue of the MCMC-based KOH model. In addition, we showed the posterior contraction rate of the proposed method under reasonable assumptions on GP prior and regularity conditions on the variational family. This was supported by simulation studies and the real data application in 𝛽-decay calculation. We have also examined a variety of issues concerning the implementation of both methods, MCMC and VB: Choosing covariance class for modeling the physical process, prior structure to mitigate the identifiability issue of calibration parameters, and modularizing the training of the emulator for a better understanding of the model and a better interpretation. Our findings and discussions have significant implications of understanding the KOH model. The findings presented in this thesis will serve as a foundation for future theoretical investigations in this area. Other interesting directions of research from here could be summarized as follows. Improving the performance of the emulator: The newly proposed algorithm in this thesis remains rooted in Gaussian Processes. As a consequence, the optimization process necessitated addressing the issue of nonpositive semi-definite covariance. However, there is potential to explore alternative avenues where advanced Machine Learning techniques, such as Deep learning models, diffusion models, or generative models, could replace Gaussian Processes. In particular, the use of Deep Gaussian processes for constructing emulators has gained attention from researchers recently. This presents a captivating and relevant topic for investigation, especially in the current era of AI, 88 enticing researchers from various domains. Calibration parameters carrying physical meaning: An alternative avenue of investigation pertains to the calibration parameters. In the original KOH model, all the unknowns are estimated simultaneously, which raises concerns about the identifiability of calibration parameters. Therefore, an intriguing project would involve establishing a novel definition for calibration parameters that not only ensures identifiability but also preserves their physical significance. Interested people may explore Li and Xiong’s (2022) fascinating method of decomposing the discrepancy term through Taylor expansion. Theoretical investigation of Modular approach: The primary focus of the original KOH model is to excel in handling experimental data. Nevertheless, domain scientists also place signif- icant emphasis on the emulator. Therefore, maintaining a separate emulator module by training it independently could prove beneficial in enhancing the comprehension and interpretation of the model. Demonstrating theoretical backing for the Modular approach presents an engaging path for future research. 89